from:"Scott Wood"

[tip: core/rcu] rcu: Unconditionally use rcuc threads on PREEMPT_RT

2021-02-15 Thread tip-bot2 for Scott Wood

The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 8b9a0ecc7ef5e1ed3afbc926de17399a37128c82
Gitweb:
https://git.kernel.org/tip/8b9a0ecc7ef5e1ed3afbc926de17399a37128c82
Author:Scott Wood 
AuthorDate:Tue, 15 Dec 2020 15:16:46 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 04 Jan 2021 13:43:51 -08:00

rcu: Unconditionally use rcuc threads on PREEMPT_RT

PREEMPT_RT systems have long used the rcutree.use_softirq kernel
boot parameter to avoid use of RCU_SOFTIRQ handlers, which can disrupt
real-time applications by invoking callbacks during return from interrupts
that arrived while executing time-critical code.  This kernel boot
parameter instead runs RCU core processing in an 'rcuc' kthread, thus
allowing the scheduler to do its job of avoiding disrupting time-critical
code.

This commit therefore disables the rcutree.use_softirq kernel boot
parameter on PREEMPT_RT systems, thus forcing such systems to do RCU
core processing in 'rcuc' kthreads.  This approach has long been in
use by users of the -rt patchset, and there have been no complaints.
There is therefore no way for the system administrator to override this
choice, at least without modifying and rebuilding the kernel.

Signed-off-by: Scott Wood 
[bigeasy: Reword commit message]
Signed-off-by: Sebastian Andrzej Siewior 
[ paulmck: Update kernel-parameters.txt accordingly. ]
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 4 
 kernel/rcu/tree.c   | 4 +++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index c722ec1..521255f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4092,6 +4092,10 @@
value, meaning that RCU_SOFTIRQ is used by default.
Specify rcutree.use_softirq=0 to use rcuc kthreads.
 
+   But note that CONFIG_PREEMPT_RT=y kernels disable
+   this kernel boot parameter, forcibly setting it
+   to zero.
+
rcutree.rcu_fanout_exact= [KNL]
Disable autobalancing of the rcu_node combining
tree.  This is used by rcutorture, and might
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40e5e3d..d609035 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -100,8 +100,10 @@ static struct rcu_state rcu_state = {
 static bool dump_tree;
 module_param(dump_tree, bool, 0444);
 /* By default, use RCU_SOFTIRQ instead of rcuc kthreads. */
-static bool use_softirq = true;
+static bool use_softirq = !IS_ENABLED(CONFIG_PREEMPT_RT);
+#ifndef CONFIG_PREEMPT_RT
 module_param(use_softirq, bool, 0444);
+#endif
 /* Control rcu_node-tree auto-balancing at boot time. */
 static bool rcu_fanout_exact;
 module_param(rcu_fanout_exact, bool, 0444);

Re: [PATCH] sched/fair: Don't migrate with src_cpu == dst_cpu

2020-12-07 Thread Scott Wood

On Thu, 2020-12-03 at 09:47 +0100, Peter Zijlstra wrote:
> On Thu, Dec 03, 2020 at 12:04:49AM -0600, Scott Wood wrote:
> > Besides being a waste of time to try to move tasks to where they already
> > are, this avoids triggering the WARN_ON_ONCE(is_migration_disabled(p))
> > in set_task_cpu().
> > 
> > Signed-off-by: Scott Wood 
> > ---
> > Patch is against tip/master.  Assertion was seen by running rteval on
> > the
> > RT tree.
> > 
> >  kernel/sched/fair.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e7e21ac479a2..f443626164d4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7574,7 +7574,8 @@ int can_migrate_task(struct task_struct *p, struct
> > lb_env *env)
> >  
> > /* Prevent to re-select dst_cpu via env's CPUs: */
> > for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
> > -   if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
> > +   if (cpu != env->src_cpu &&
> > +   cpumask_test_cpu(cpu, p->cpus_ptr)) {
> > env->flags |= LBF_DST_PINNED;
> > env->new_dst_cpu = cpu;
> > break;
> 
> Do we have _any_ clue as to how we ended up in that situation? The above
> sounds like it should be a WARN and we should avoid getting here in the
> first place.

My initial impression was that there simply wasn't anything stopping it from
happening, but digging deeper it looks like it's specific to NUMA domains
with overlapping CPUs.

-Scott

Re: [PATCH 1/2] powerpc: Retire e200 core (mpc555x processor)

2020-12-04 Thread Scott Wood

On Tue, 2020-11-17 at 05:07 +, Christophe Leroy wrote:
> There is no defconfig selecting CONFIG_E200, and no platform.
> 
> e200 is an earlier version of booke, a predecessor of e500,
> with some particularities like an unified cache instead of both an
> instruction cache and a data cache.
> 
> Remove it.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/Makefile |  1 -
>  arch/powerpc/include/asm/cputable.h   | 11 -
>  arch/powerpc/include/asm/mmu.h|  2 +-
>  arch/powerpc/include/asm/reg.h|  5 --
>  arch/powerpc/include/asm/reg_booke.h  | 12 -
>  arch/powerpc/kernel/cpu_setup_fsl_booke.S |  9 
>  arch/powerpc/kernel/cputable.c| 46 --
>  arch/powerpc/kernel/head_booke.h  |  3 +-
>  arch/powerpc/kernel/head_fsl_booke.S  | 57 +--
>  arch/powerpc/kernel/setup_32.c|  2 -
>  arch/powerpc/kernel/traps.c   | 25 --
>  arch/powerpc/mm/nohash/fsl_booke.c| 12 ++---
>  arch/powerpc/platforms/Kconfig.cputype| 13 ++----
>  13 files changed, 11 insertions(+), 187 deletions(-)

Acked-by: Scott Wood 

-Scott

[PATCH] sched/fair: Don't migrate with src_cpu == dst_cpu

2020-12-02 Thread Scott Wood

Besides being a waste of time to try to move tasks to where they already
are, this avoids triggering the WARN_ON_ONCE(is_migration_disabled(p))
in set_task_cpu().

Signed-off-by: Scott Wood 
---
Patch is against tip/master.  Assertion was seen by running rteval on the
RT tree.

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e7e21ac479a2..f443626164d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7574,7 +7574,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 
/* Prevent to re-select dst_cpu via env's CPUs: */
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
-   if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
+   if (cpu != env->src_cpu &&
+   cpumask_test_cpu(cpu, p->cpus_ptr)) {
env->flags |= LBF_DST_PINNED;
env->new_dst_cpu = cpu;
break;
-- 
2.27.0

Re: [PATCH] powerpc: Drop -me200 addition to build flags

2020-11-16 Thread Scott Wood

On Mon, 2020-11-16 at 23:09 +1100, Michael Ellerman wrote:
> Currently a build with CONFIG_E200=y will fail with:
> 
>   Error: invalid switch -me200
>   Error: unrecognized option -me200
> 
> Upstream binutils has never supported an -me200 option. Presumably it
> was supported at some point by either a fork or Freescale internal
> binutils.
> 
> We can't support code that we can't even build test, so drop the
> addition of -me200 to the build flags, so we can at least build with
> CONFIG_E200=y.
> 
> Reported-by: Németh Márton 
> Reported-by: kernel test robot 
> Signed-off-by: Michael Ellerman 
> ---
> 
> More discussion: 
> https://lore.kernel.org/lkml/202011131146.g8dplqdd-...@intel.com
> ---
>  arch/powerpc/Makefile | 1 -
>  1 file changed, 1 deletion(-)

Acked-by: Scott Wood 

I'd go further and remove E200 code entirely, unless someone with the hardware
can claim that it actually works.  There doesn't appear to be any actual
platform support for an e200-based system.  It seems to be a long-abandoned
work in progress.

-Scott

Re: Error: invalid switch -me200

2020-11-16 Thread Scott Wood

On Fri, 2020-11-13 at 18:50 -0600, Segher Boessenkool wrote:
> On Fri, Nov 13, 2020 at 04:37:38PM -0800, Fāng-ruì Sòng wrote:
> > On Fri, Nov 13, 2020 at 4:23 PM Segher Boessenkool
> >  wrote:
> > > On Fri, Nov 13, 2020 at 12:14:18PM -0800, Nick Desaulniers wrote:
> > > > > > > Error: invalid switch -me200
> > > > > > > Error: unrecognized option -me200
> > > > > > 
> > > > > > 251 cpu-as-$(CONFIG_E200)   += -Wa,-me200
> > > > > > 
> > > > > > Are those all broken configs, or is Kconfig messed up such that
> > > > > > randconfig can select these when it should not?
> > > > > 
> > > > > Hmmm, looks like this flag does not exist in mainline binutils?
> > > > > There is
> > > > > a thread in 2010 about this that Segher commented on:
> > > > > 
> > > > > 
https://lore.kernel.org/linuxppc-dev/9859e645-954d-4d07-8003-ffcd2391a...@kernel.crashing.org/
> > > > > 
> > > > > Guess this config should be eliminated?
> > > 
> > > The help text for this config options says that e200 is used in 55xx,
> > > and there *is* an -me5500 GAS flag (which probably does this same
> > > thing, too).  But is any of this tested, or useful, or wanted?
> > > 
> > > Maybe Christophe knows, cc:ed.
> > 
> > CC Alan Modra, a binutils global maintainer.
> > 
> > Alan, can the few -Wa,-m* options deleted from arch/powerpc/Makefile ?
> 
> All the others work fine (and are needed afaics), it is only -me200 that
> doesn't exist (in mainline binutils).  Perhaps -me5500 will work for it
> instead.

According to Wikipedia e200 is from mpc55xx (for which I don't see any
platform support having ever been added).  e5500 is completely different (64-
bit version of e500mc).

-Scott

[tip: sched/urgent] sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption

2020-06-23 Thread tip-bot2 for Scott Wood

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 16568f1f4fd4decee6935751d5ada1f963e5bd5f
Gitweb:
https://git.kernel.org/tip/16568f1f4fd4decee6935751d5ada1f963e5bd5f
Author:Scott Wood 
AuthorDate:Wed, 17 Jun 2020 14:17:42 +02:00
Committer: Ingo Molnar 
CommitterDate: Tue, 23 Jun 2020 10:42:30 +02:00

sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix 
mask corruption

This function is concerned with the long-term CPU mask, not the
transitory mask the task might have while migrate disabled.  Before
this patch, if a task was migrate-disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the CPU that the task was running on, then the mask update
would be lost.

Signed-off-by: Scott Wood 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmp...@linutronix.de
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f36032..9eeac94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1637,7 +1637,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
goto out;
}
 
-   if (cpumask_equal(p->cpus_ptr, new_mask))
+   if (cpumask_equal(>cpus_mask, new_mask))
goto out;
 
/*

[tip: sched/urgent] sched: __set_cpus_allowed_ptr(): Check cpus_mask, not cpus_ptr

2020-06-23 Thread tip-bot2 for Scott Wood

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: a51dde512a93959d1df3a915fa3e3df68b1fc94d
Gitweb:
https://git.kernel.org/tip/a51dde512a93959d1df3a915fa3e3df68b1fc94d
Author:Scott Wood 
AuthorDate:Wed, 17 Jun 2020 14:17:42 +02:00
Committer: Peter Zijlstra 
CommitterDate: Mon, 22 Jun 2020 20:51:05 +02:00

sched: __set_cpus_allowed_ptr(): Check cpus_mask, not cpus_ptr

This function is concerned with the long-term cpu mask, not the
transitory mask the task might have while migrate disabled.  Before
this patch, if a task was migrate disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the cpu that the task was running on, then the mask update
would be lost.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Scott Wood 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmp...@linutronix.de
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8298b2c..9f8aa34 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1637,7 +1637,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
goto out;
}
 
-   if (cpumask_equal(p->cpus_ptr, new_mask))
+   if (cpumask_equal(>cpus_mask, new_mask))
goto out;
 
/*

Re: [PATCH] sched: __set_cpus_allowed_ptr(): Check cpus_mask, not cpus_ptr

2020-06-17 Thread Scott Wood

On Wed, 2020-06-17 at 15:15 +0100, Valentin Schneider wrote:
> On 17/06/20 13:17, Sebastian Andrzej Siewior wrote:
> > From: Scott Wood 
> > 
> > This function is concerned with the long-term cpu mask, not the
> > transitory mask the task might have while migrate disabled.  Before
> > this patch, if a task was migrate disabled at the time
> > __set_cpus_allowed_ptr() was called, and the new mask happened to be
> > equal to the cpu that the task was running on, then the mask update
> > would be lost.
> > 
> > Signed-off-by: Scott Wood 
> > Signed-off-by: Sebastian Andrzej Siewior 
> > ---
> >  kernel/sched/core.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1637,7 +1637,7 @@ static int __set_cpus_allowed_ptr(struct
> >   goto out;
> >   }
> > 
> > -   if (cpumask_equal(p->cpus_ptr, new_mask))
> > +   if (cpumask_equal(>cpus_mask, new_mask))
> >   goto out;
> > 
> >   /*
> 
> Makes sense, but what about the rest of the checks? Further down there is
> 
> /* Can the task run on the task's current CPU? If so, we're done
> */
> if (cpumask_test_cpu(task_cpu(p), new_mask))
> goto out;
> 
> If the task is currently migrate disabled and for some stupid reason it
> gets affined elsewhere, we could try to move it out - which AFAICT we
> don't
> want to do because migrate disabled. So I suppose you'd want an extra
> bailout condition here when the task is migrate disabled.
> 
> ISTR in RT you do re-check the affinity and potentially move the task away
> when re-enabling migration, so that should work out all fine.

On RT the above test is:

/* Can the task run on the task's current CPU? If so, we're done */
if (cpumask_test_cpu(task_cpu(p), new_mask) ||
p->cpus_ptr != >cpus_mask)
goto out;

...so we do bail out if we're migrate disabled.

-Scott

Re: [PATCH] powerpc/fsl_booke/32: fix build with CONFIG_RANDOMIZE_BASE

2020-06-15 Thread Scott Wood

On Sat, 2020-06-13 at 23:28 +0700, Arseny Solokha wrote:
> Building the current 5.8 kernel for a e500 machine with
> CONFIG_RANDOMIZE_BASE set yields the following failure:
> 
>   arch/powerpc/mm/nohash/kaslr_booke.c: In function 'kaslr_early_init':
>   arch/powerpc/mm/nohash/kaslr_booke.c:387:2: error: implicit declaration
> of function 'flush_icache_range'; did you mean 'flush_tlb_range'?
> [-Werror=implicit-function-declaration]
> 
> Indeed, including asm/cacheflush.h into kaslr_booke.c fixes the build.
> 
> The issue dates back to the introduction of that file and probably went
> unnoticed because there's no in-tree defconfig with CONFIG_RANDOMIZE_BASE
> set.
> 
> Fixes: 2b0e86cc5de6 ("powerpc/fsl_booke/32: implement KASLR infrastructure")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Arseny Solokha 
> ---
>  arch/powerpc/mm/nohash/kaslr_booke.c | 1 +
>  1 file changed, 1 insertion(+)

Acked-by: Scott Wood 

-Scott

Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()

2020-04-29 Thread Scott Wood

On Wed, 2020-04-29 at 10:27 +0200, Vincent Guittot wrote:
> On Tue, 28 Apr 2020 at 07:02, Scott Wood  wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 02f323b85b6d..74c3c5280d6b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6758,8 +6758,6 @@ balance_fair(struct rq *rq, struct task_struct
> > *prev, struct rq_flags *rf)
> >  {
> > if (rq->nr_running)
> > return 1;
> > -
> > -   return newidle_balance(rq, rf) != 0;
> 
> Missing return ?

Yes, editing error during last minute cleanup. :-P

-Scott

Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()

2020-04-29 Thread Scott Wood

On Wed, 2020-04-29 at 11:05 +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 06:20:32PM -0500, Scott Wood wrote:
> > On Wed, 2020-04-29 at 01:02 +0200, Peter Zijlstra wrote:
> > > On Tue, Apr 28, 2020 at 05:55:03PM -0500, Scott Wood wrote:
> > > > On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:
> > > > > Also, if you move it this late, this is entirely the wrong
> > > > > place.  If you do it after the context switch either use the
> > > > > balance_callback or put it in the idle path.
> > > > > 
> > > > > But what Valentin said; this needs a fair bit of support, the
> > > > > whole reason we've never done this is to avoid that double
> > > > > context switch...
> > > > > 
> > > > 
> > > > balance_callback() enters with the rq lock held but BH not
> > > > separately
> > > 
> > > BH? softirqs you mean? Pray tell more.
> > 
> > In https://lore.kernel.org/lkml/5122cd9c.9070...@oracle.com/ the need to
> > keep softirqs disabled during rebalance was brought up, but simply
> > wrapping
> > the lock dropping in local_bh_enable()/local_bh_disable() meant that
> > local_bh_enable() would be called with interrupts disabled, which isn't
> > allowed.
> 
> That thread, nor your explanation make any sense -- why do we care about
> softirqs?, 

I was trusting Steve's claim that that was the issue (it seemed plausible
given that system-wide rebalancing is done from a softirq).  If things have
changed since then, great.  If that was never the issue, then there's the
question of what caused the bug Sasha saw.

> nor do I see how placing it in finish_task_switch() helps
> with any of this.

It lets us do the local_bh_enable() after IRQs are enabled, since we don't
enter with any existing atomic context.  Though I suppose we could instead
do another lock drop at the end of newidle_balance() just to enable
softirqs.

> > > > disabled, which interferes with the ability to enable interrupts
> > > > but not BH.  It also gets called from rt_mutex_setprio() and
> > > > __sched_setscheduler(), and I didn't want the caller of those to
> > > > be stuck with the latency.
> > > 
> > > You're not reading it right.
> > 
> > Could you elaborate?
> 
> If you were to do a queue_balance_callback() from somewhere in the
> pick_next_task() machinery, then the balance_callback() at the end of
> __schedule() would run it, and it'd be gone. How would
> rt_mutex_setprio() / __sched_setscheduler() be affected?

The rq lock is dropped between queue_balance_callback() and the
balance_callback() at the end of __schedule().  What stops
setprio/setscheduler on another cpu from doing the callback at that
point?

-Scott

Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()

2020-04-28 Thread Scott Wood

On Wed, 2020-04-29 at 01:02 +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 05:55:03PM -0500, Scott Wood wrote:
> > On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:
> > > Also, if you move it this late, this is entirely the wrong place. If
> > > you
> > > do it after the context switch either use the balance_callback or put
> > > it
> > > in the idle path.
> > > 
> > > But what Valentin said; this needs a fair bit of support, the whole
> > > reason we've never done this is to avoid that double context switch...
> > > 
> > 
> > balance_callback() enters with the rq lock held but BH not separately
> 
> BH? softirqs you mean? Pray tell more.

In https://lore.kernel.org/lkml/5122cd9c.9070...@oracle.com/ the need to
keep softirqs disabled during rebalance was brought up, but simply wrapping
the lock dropping in local_bh_enable()/local_bh_disable() meant that
local_bh_enable() would be called with interrupts disabled, which isn't
allowed.

> > disabled, which interferes with the ability to enable interrupts but not
> > BH.
> > It also gets called from rt_mutex_setprio() and __sched_setscheduler(),
> > and
> > I didn't want the caller of those to be stuck with the latency.
> 
> You're not reading it right.

Could you elaborate?

-Scott

Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()

2020-04-28 Thread Scott Wood

On Wed, 2020-04-29 at 00:09 +0200, Peter Zijlstra wrote:
> On Tue, Apr 28, 2020 at 10:37:18PM +0100, Valentin Schneider wrote:
> > On 28/04/20 06:02, Scott Wood wrote:
> > > Thus, newidle_balance() is entered with interrupts enabled, which
> > > allows
> > > (in the next patch) enabling interrupts when the lock is dropped.
> > > 
> > > Signed-off-by: Scott Wood 
> > > ---
> > >  kernel/sched/core.c  |  7 ---
> > >  kernel/sched/fair.c  | 45 ---
> > > -
> > >  kernel/sched/sched.h |  6 ++
> > >  3 files changed, 22 insertions(+), 36 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 9a2fbf98fd6f..0294beb8d16c 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct
> > > task_struct *prev)
> > >   }
> > > 
> > >   tick_nohz_task_switch();
> > > +
> > > + if (is_idle_task(current))
> > > + newidle_balance();
> > > +
> > 
> > This means we must go through a switch_to(idle) before figuring out we
> > could've switched to a CFS task, and do it then. I'm curious to see the
> > performance impact of that.
> 
> Also, if you move it this late, this is entirely the wrong place. If you
> do it after the context switch either use the balance_callback or put it
> in the idle path.
> 
> But what Valentin said; this needs a fair bit of support, the whole
> reason we've never done this is to avoid that double context switch...
> 

balance_callback() enters with the rq lock held but BH not separately
disabled, which interferes with the ability to enable interrupts but not BH.
It also gets called from rt_mutex_setprio() and __sched_setscheduler(), and
I didn't want the caller of those to be stuck with the latency.

-Scott

Re: [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears

2020-04-28 Thread Scott Wood

On Tue, 2020-04-28 at 17:33 -0500, Scott Wood wrote:
> On Tue, 2020-04-28 at 22:56 +0100, Valentin Schneider wrote:
> > On 28/04/20 06:02, Scott Wood wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index dfde7f0ce3db..e7437e4e40b4 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9394,6 +9400,10 @@ static int should_we_balance(struct lb_env
> > > *env)
> > >   struct sched_group *sg = env->sd->groups;
> > >   int cpu, balance_cpu = -1;
> > > 
> > > + /* Run the realtime task now; load balance later. */
> > > + if (rq_has_runnable_rt_task(env->dst_rq))
> > > + return 0;
> > > +
> > 
> > I have a feeling this isn't very nice to CFS tasks, since we would now
> > "waste" load-balance attempts if they happen to coincide with an RT task
> > being runnable.
> > 
> > On your 72 CPUs machine, the system-wide balance happens (at best) every
> > 72ms if you have idle time, every ~2300ms otherwise (every balance
> > CPU gets to try to balance however, so it's not as horrible as I'm
> > making
> > it sound). This is totally worst-case scenario territory, and you'd hope
> > newidle_balance() could help here and there (as it isn't gated by any
> > balance interval).
> > 
> > Still, even for a single rq, postponing a system-wide balance for a
> > full balance interval (i.e. ~2 secs worst case here) just because we had
> > a
> > single RT task running when we tried to balance seems a bit much.
> > 
> > It may be possible to hack something to detect those cases and reset the
> > interval to "now" when e.g. dequeuing the last RT task (& after having
> > previously aborted a load-balance due to RT/DL/foobar).
> 
> Yeah, some way to retry at an appropriate time after aborting a rebalance
> would be good.

Another option is to limit the bailing out to newidle balancing (as the
patchset currently stands, it isn't checking the right rq for global
balancing anyway).  On RT the softirq runs from thread context, so enabling
interrupts and (on RT) preemption should suffice to avoid latency problems
in the global rebalance.

-Scott

Re: [RFC PATCH 3/3] sched,rt: break out of load balancing if an RT task appears

2020-04-28 Thread Scott Wood

On Tue, 2020-04-28 at 22:56 +0100, Valentin Schneider wrote:
> On 28/04/20 06:02, Scott Wood wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index dfde7f0ce3db..e7437e4e40b4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9394,6 +9400,10 @@ static int should_we_balance(struct lb_env *env)
> >   struct sched_group *sg = env->sd->groups;
> >   int cpu, balance_cpu = -1;
> > 
> > +   /* Run the realtime task now; load balance later. */
> > +   if (rq_has_runnable_rt_task(env->dst_rq))
> > +   return 0;
> > +
> 
> I have a feeling this isn't very nice to CFS tasks, since we would now
> "waste" load-balance attempts if they happen to coincide with an RT task
> being runnable.
>
> On your 72 CPUs machine, the system-wide balance happens (at best) every
> 72ms if you have idle time, every ~2300ms otherwise (every balance
> CPU gets to try to balance however, so it's not as horrible as I'm making
> it sound). This is totally worst-case scenario territory, and you'd hope
> newidle_balance() could help here and there (as it isn't gated by any
> balance interval).
> 
> Still, even for a single rq, postponing a system-wide balance for a
> full balance interval (i.e. ~2 secs worst case here) just because we had a
> single RT task running when we tried to balance seems a bit much.
> 
> It may be possible to hack something to detect those cases and reset the
> interval to "now" when e.g. dequeuing the last RT task (& after having
> previously aborted a load-balance due to RT/DL/foobar).

Yeah, some way to retry at an appropriate time after aborting a rebalance
would be good.


> > +
> > +/* Is there a task of a high priority class? */
> > +static inline bool rq_has_runnable_rt_task(struct rq *rq)
> > +{
> > +   return unlikely(rq->nr_running != rq->cfs.h_nr_running);
> 
> Seeing as that can be RT, DL or stopper, that name is somewhat misleading.

rq_has_runnable_rt_dl_task()?  Or is there some term that unambiguously
encompasses both?

-Scott

Re: [RFC PATCH 1/3] sched/fair: Call newidle_balance() from finish_task_switch()

2020-04-28 Thread Scott Wood

On Tue, 2020-04-28 at 22:37 +0100, Valentin Schneider wrote:
> On 28/04/20 06:02, Scott Wood wrote:
> > Thus, newidle_balance() is entered with interrupts enabled, which allows
> > (in the next patch) enabling interrupts when the lock is dropped.
> > 
> > Signed-off-by: Scott Wood 
> > ---
> >  kernel/sched/core.c  |  7 ---
> >  kernel/sched/fair.c  | 45 
> >  kernel/sched/sched.h |  6 ++
> >  3 files changed, 22 insertions(+), 36 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..0294beb8d16c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3241,6 +3241,10 @@ static struct rq *finish_task_switch(struct
> > task_struct *prev)
> >   }
> > 
> >   tick_nohz_task_switch();
> > +
> > +   if (is_idle_task(current))
> > +   newidle_balance();
> > +
> 
> This means we must go through a switch_to(idle) before figuring out we
> could've switched to a CFS task, and do it then. I'm curious to see the
> performance impact of that.

Any particular benchmark I should try?

> >  return rq;
> >  }
> > 
> > @@ -10425,14 +10408,23 @@ static inline void nohz_newidle_balance(struct
> > rq *this_rq) { }
> >   * 0 - failed, no new tasks
> >   *   > 0 - success, new (fair) tasks present
> >   */
> > -int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> > +int newidle_balance(void)
> >  {
> >   unsigned long next_balance = jiffies + HZ;
> > -   int this_cpu = this_rq->cpu;
> > +   int this_cpu;
> >   struct sched_domain *sd;
> > +   struct rq *this_rq;
> >   int pulled_task = 0;
> >   u64 curr_cost = 0;
> > 
> > +   preempt_disable();
> > +   this_rq = this_rq();
> > +   this_cpu = this_rq->cpu;
> > +   local_bh_disable();
> > +   raw_spin_lock_irq(_rq->lock);
> > +
> > +   update_rq_clock(this_rq);
> > +
> >   update_misfit_status(NULL, this_rq);
> 
> I'm thinking this should be moved to where newidle_balance() used to be,
> otherwise we have a window where the rq is flagged as having a misfit
> task despite not having any runnable CFS tasks.

OK

-Scott

Re: [PATCH v7 00/12] implement KASLR for powerpc/fsl_booke/32

2019-10-22 Thread Scott Wood

On Mon, 2019-10-21 at 11:34 +0800, Jason Yan wrote:
> 
> On 2019/10/10 2:46, Scott Wood wrote:
> > On Wed, 2019-10-09 at 16:41 +0800, Jason Yan wrote:
> > > Hi Scott,
> > > 
> > > On 2019/10/9 15:13, Scott Wood wrote:
> > > > On Wed, 2019-10-09 at 14:10 +0800, Jason Yan wrote:
> > > > > Hi Scott,
> > > > > 
> > > > > Would you please take sometime to test this?
> > > > > 
> > > > > Thank you so much.
> > > > > 
> > > > > On 2019/9/24 13:52, Jason Yan wrote:
> > > > > > Hi Scott,
> > > > > > 
> > > > > > Can you test v7 to see if it works to load a kernel at a non-zero
> > > > > > address?
> > > > > > 
> > > > > > Thanks,
> > > > 
> > > > Sorry for the delay.  Here's the output:
> > > > 
> > > 
> > > Thanks for the test.
> > > 
> > > > ## Booting kernel from Legacy Image at 1000 ...
> > > >  Image Name:   Linux-5.4.0-rc2-00050-g8ac2cf5b4
> > > >  Image Type:   PowerPC Linux Kernel Image (gzip compressed)
> > > >  Data Size:7521134 Bytes = 7.2 MiB
> > > >  Load Address: 0400
> > > >  Entry Point:  0400
> > > >  Verifying Checksum ... OK
> > > > ## Flattened Device Tree blob at 1fc0
> > > >  Booting using the fdt blob at 0x1fc0
> > > >  Uncompressing Kernel Image ... OK
> > > >  Loading Device Tree to 07fe, end 07fff65c ... OK
> > > > KASLR: No safe seed for randomizing the kernel base.
> > > > OF: reserved mem: initialized node qman-fqd, compatible id fsl,qman-
> > > > fqd
> > > > OF: reserved mem: initialized node qman-pfdr, compatible id fsl,qman-
> > > > pfdr
> > > > OF: reserved mem: initialized node bman-fbpr, compatible id fsl,bman-
> > > > fbpr
> > > > Memory CAM mapping: 64/64/64 Mb, residual: 12032Mb
> > > 
> > > When boot from 0400, the max CAM value is 64M. And
> > > you have a board with 12G memory, CONFIG_LOWMEM_CAM_NUM=3 means only
> > > 192M memory is mapped and when kernel is randomized at the middle of
> > > this 192M memory, we will not have enough continuous memory for node
> > > map.
> > > 
> > > Can you set CONFIG_LOWMEM_CAM_NUM=8 and see if it works?
> > 
> > OK, that worked.
> > 
> 
> Hi Scott, any more cases should be tested or any more comments?
> What else need to be done before this feature can be merged?

I've just applied it and sent a pull request.

-Scott

Re: [PATCH RT v2 2/3] sched: Lazy migrate_disable processing

2019-10-18 Thread Scott Wood

On Fri, 2019-10-18 at 14:12 -0400, Waiman Long wrote:
> On 10/12/19 2:52 AM, Scott Wood wrote:
> > Avoid overhead on the majority of migrate disable/enable sequences by
> > only manipulating scheduler data (and grabbing the relevant locks) when
> > the task actually schedules while migrate-disabled.  A kernel build
> > showed around a 10% reduction in system time (with CONFIG_NR_CPUS=512).
> > 
> > Instead of cpuhp_pin_lock, CPU hotplug is handled by keeping a per-CPU
> > count of the number of pinned tasks (including tasks which have not
> > scheduled in the migrate-disabled section); takedown_cpu() will
> > wait until that reaches zero (confirmed by take_cpu_down() in stop
> > machine context to deal with races) before migrating tasks off of the
> > cpu.
> > 
> > To simplify synchronization, updating cpus_mask is no longer deferred
> > until migrate_enable().  This lets us not have to worry about
> > migrate_enable() missing the update if it's on the fast path (didn't
> > schedule during the migrate disabled section).  It also makes the code
> > a bit simpler and reduces deviation from mainline.
> > 
> > While the main motivation for this is the performance benefit, lazy
> > migrate disable also eliminates the restriction on calling
> > migrate_disable() while atomic but leaving the atomic region prior to
> > calling migrate_enable() -- though this won't help with
> > local_bh_disable()
> > (and thus rcutorture) unless something similar is done with the recently
> > added local_lock.
> > 
> > Signed-off-by: Scott Wood 
> > ---
> > The speedup is smaller than before, due to commit 659252061477862f
> > ("lib/smp_processor_id: Don't use cpumask_equal()") achieving
> > an overlapping speedup.
> > ---
> >  include/linux/cpu.h|   4 --
> >  include/linux/sched.h  |  11 +--
> >  init/init_task.c   |   4 ++
> >  kernel/cpu.c   | 103 +++-
> >  kernel/sched/core.c| 180 
> > -
> >  kernel/sched/sched.h   |   4 ++
> >  lib/smp_processor_id.c |   3 +
> >  7 files changed, 128 insertions(+), 181 deletions(-)
> > 
> > diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> > index f4a772c12d14..2df500fdcbc4 100644
> > --- a/include/linux/cpu.h
> > +++ b/include/linux/cpu.h
> > @@ -113,8 +113,6 @@ static inline void cpu_maps_update_done(void)
> >  extern void cpu_hotplug_enable(void);
> >  void clear_tasks_mm_cpumask(int cpu);
> >  int cpu_down(unsigned int cpu);
> > -extern void pin_current_cpu(void);
> > -extern void unpin_current_cpu(void);
> >  
> >  #else /* CONFIG_HOTPLUG_CPU */
> >  
> > @@ -126,8 +124,6 @@ static inline void cpus_read_unlock(void) { }
> >  static inline void lockdep_assert_cpus_held(void) { }
> >  static inline void cpu_hotplug_disable(void) { }
> >  static inline void cpu_hotplug_enable(void) { }
> > -static inline void pin_current_cpu(void) { }
> > -static inline void unpin_current_cpu(void) { }
> >  
> >  #endif /* !CONFIG_HOTPLUG_CPU */
> >  
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 7e892e727f12..c6872b38bf77 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -229,6 +229,8 @@
> >  extern long io_schedule_timeout(long timeout);
> >  extern void io_schedule(void);
> >  
> > +int cpu_nr_pinned(int cpu);
> > +
> >  /**
> >   * struct prev_cputime - snapshot of system and user cputime
> >   * @utime: time spent in user mode
> > @@ -661,16 +663,13 @@ struct task_struct {
> > cpumask_t   cpus_mask;
> >  #if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE)
> > int migrate_disable;
> > -   int migrate_disable_update;
> > -   int pinned_on_cpu;
> > +   boolmigrate_disable_scheduled;
> >  # ifdef CONFIG_SCHED_DEBUG
> > -   int migrate_disable_atomic;
> > +   int pinned_on_cpu;
> >  # endif
> > -
> >  #elif !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE)
> >  # ifdef CONFIG_SCHED_DEBUG
> > int migrate_disable;
> > -   int migrate_disable_atomic;
> >  # endif
> >  #endif
> >  #ifdef CONFIG_PREEMPT_RT_FULL
> > @@ -2074,4 +2073,6 @@ static inline void rseq_syscall(struct

[tip: x86/cpu] x86/Kconfig: Enforce limit of 512 CPUs with MAXSMP and no CPUMASK_OFFSTACK

2019-10-17 Thread tip-bot2 for Scott Wood

The following commit has been merged into the x86/cpu branch of tip:

Commit-ID: 1edae1ae62589f28d00da186465a003e2a7f9c6c
Gitweb:
https://git.kernel.org/tip/1edae1ae62589f28d00da186465a003e2a7f9c6c
Author:Scott Wood 
AuthorDate:Sat, 12 Oct 2019 02:00:54 -05:00
Committer: Borislav Petkov 
CommitterDate: Thu, 17 Oct 2019 18:04:51 +02:00

x86/Kconfig: Enforce limit of 512 CPUs with MAXSMP and no CPUMASK_OFFSTACK

The help text of NR_CPUS says that the maximum number of CPUs supported
without CPUMASK_OFFSTACK is 512. However, NR_CPUS_RANGE_END allows this
limit to be bypassed by MAXSMP even if CPUMASK_OFFSTACK is not set.

This scenario can currently only happen in the RT tree, since it has
"select CPUMASK_OFFSTACK if !PREEMPT_RT_FULL" in MAXSMP. However,
even if we ignore the RT tree, checking for MAXSMP in addition to
CPUMASK_OFFSTACK is redundant.

Signed-off-by: Scott Wood 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Mike Travis 
Cc: Thomas Gleixner 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20191012070054.28657-1-sw...@redhat.com
---
 arch/x86/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 91c22ee..896f840 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1000,8 +1000,8 @@ config NR_CPUS_RANGE_END
 config NR_CPUS_RANGE_END
int
depends on X86_64
-   default 8192 if  SMP && ( MAXSMP ||  CPUMASK_OFFSTACK)
-   default  512 if  SMP && (!MAXSMP && !CPUMASK_OFFSTACK)
+   default 8192 if  SMP && CPUMASK_OFFSTACK
+   default  512 if  SMP && !CPUMASK_OFFSTACK
default1 if !SMP
 
 config NR_CPUS_DEFAULT

Re: [PATCH RT] kernel/sched: Don't recompute cpumask weight in migrate_enable_update_cpus_allowed()

2019-10-12 Thread Scott Wood

On Fri, 2019-10-11 at 10:09 -0400, Waiman Long wrote:
> At each invocation of rt_spin_unlock(), cpumask_weight() is called
> via migrate_enable_update_cpus_allowed() to recompute the weight of
> cpus_mask which doesn't change that often.
> 
> The following is a sample output of perf-record running the testpmd
> microbenchmark on an RT kernel:
> 
>   34.77%   1.65%  testpmd  [kernel.kallsyms]  [k] rt_spin_unlock
>   34.32%   2.52%  testpmd  [kernel.kallsyms]  [k] migrate_enable
>   21.76%  21.76%  testpmd  [kernel.kallsyms]  [k] __bitmap_weight
> 
> By adding an extra variable to keep track of the weight of cpus_mask,
> we could eliminate the frequent call to cpumask_weight() and replace
> it with simple assignment.

Can you try this with my migrate disable patchset (which makes
amigrate_enable_update_cpus_allowed() be called much less often) and see if
caching nr_cpus_allowed still makes a difference?

-Scott

[PATCH] x86/smp: Enforce limit of 512 cpus with MAXSMP and no CPUMASK_OFFSTACK

2019-10-12 Thread Scott Wood

The help text of NR_CPUS says that the maximum number of cpus supported
without CPUMASK_OFFSTACK is 512.  However, NR_CPUS_RANGE_END allows this
limit to be bypassed by MAXSMP even if CPUMASK_OFFSTACK is not set.

This scenario can currently only happen in the RT tree, since it
has "select CPUMASK_OFFSTACK if !PREEMPT_RT_FULL" in MAXSMP.  However,
even if we ignore the RT tree, checking for MAXSMP in addition to
CPUMASK_OFFSTACK is redundant.

Signed-off-by: Scott Wood 
---
 arch/x86/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d6e1faa28c58..2b526e101dc9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1000,8 +1000,8 @@ config NR_CPUS_RANGE_END
 config NR_CPUS_RANGE_END
int
depends on X86_64
-   default 8192 if  SMP && ( MAXSMP ||  CPUMASK_OFFSTACK)
-   default  512 if  SMP && (!MAXSMP && !CPUMASK_OFFSTACK)
+   default 8192 if  SMP && CPUMASK_OFFSTACK
+   default  512 if  SMP && !CPUMASK_OFFSTACK
default1 if !SMP
 
 config NR_CPUS_DEFAULT
-- 
2.20.1

[PATCH RT v2 2/3] sched: Lazy migrate_disable processing

2019-10-12 Thread Scott Wood

Avoid overhead on the majority of migrate disable/enable sequences by
only manipulating scheduler data (and grabbing the relevant locks) when
the task actually schedules while migrate-disabled.  A kernel build
showed around a 10% reduction in system time (with CONFIG_NR_CPUS=512).

Instead of cpuhp_pin_lock, CPU hotplug is handled by keeping a per-CPU
count of the number of pinned tasks (including tasks which have not
scheduled in the migrate-disabled section); takedown_cpu() will
wait until that reaches zero (confirmed by take_cpu_down() in stop
machine context to deal with races) before migrating tasks off of the
cpu.

To simplify synchronization, updating cpus_mask is no longer deferred
until migrate_enable().  This lets us not have to worry about
migrate_enable() missing the update if it's on the fast path (didn't
schedule during the migrate disabled section).  It also makes the code
a bit simpler and reduces deviation from mainline.

While the main motivation for this is the performance benefit, lazy
migrate disable also eliminates the restriction on calling
migrate_disable() while atomic but leaving the atomic region prior to
calling migrate_enable() -- though this won't help with local_bh_disable()
(and thus rcutorture) unless something similar is done with the recently
added local_lock.

Signed-off-by: Scott Wood 
---
The speedup is smaller than before, due to commit 659252061477862f
("lib/smp_processor_id: Don't use cpumask_equal()") achieving
an overlapping speedup.
---
 include/linux/cpu.h|   4 --
 include/linux/sched.h  |  11 +--
 init/init_task.c   |   4 ++
 kernel/cpu.c   | 103 +++-
 kernel/sched/core.c| 180 -
 kernel/sched/sched.h   |   4 ++
 lib/smp_processor_id.c |   3 +
 7 files changed, 128 insertions(+), 181 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index f4a772c12d14..2df500fdcbc4 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -113,8 +113,6 @@ static inline void cpu_maps_update_done(void)
 extern void cpu_hotplug_enable(void);
 void clear_tasks_mm_cpumask(int cpu);
 int cpu_down(unsigned int cpu);
-extern void pin_current_cpu(void);
-extern void unpin_current_cpu(void);
 
 #else /* CONFIG_HOTPLUG_CPU */
 
@@ -126,8 +124,6 @@ static inline void cpus_read_unlock(void) { }
 static inline void lockdep_assert_cpus_held(void) { }
 static inline void cpu_hotplug_disable(void) { }
 static inline void cpu_hotplug_enable(void) { }
-static inline void pin_current_cpu(void) { }
-static inline void unpin_current_cpu(void) { }
 
 #endif /* !CONFIG_HOTPLUG_CPU */
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e892e727f12..c6872b38bf77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -229,6 +229,8 @@
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+int cpu_nr_pinned(int cpu);
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
@@ -661,16 +663,13 @@ struct task_struct {
cpumask_t   cpus_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE)
int migrate_disable;
-   int migrate_disable_update;
-   int pinned_on_cpu;
+   boolmigrate_disable_scheduled;
 # ifdef CONFIG_SCHED_DEBUG
-   int migrate_disable_atomic;
+   int pinned_on_cpu;
 # endif
-
 #elif !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE)
 # ifdef CONFIG_SCHED_DEBUG
int migrate_disable;
-   int migrate_disable_atomic;
 # endif
 #endif
 #ifdef CONFIG_PREEMPT_RT_FULL
@@ -2074,4 +2073,6 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+extern struct task_struct *takedown_cpu_task;
+
 #endif
diff --git a/init/init_task.c b/init/init_task.c
index e402413dc47d..c0c7618fd2fb 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -81,6 +81,10 @@ struct task_struct init_task
.cpus_ptr   = _task.cpus_mask,
.cpus_mask  = CPU_MASK_ALL,
.nr_cpus_allowed= NR_CPUS,
+#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE) && \
+defined(CONFIG_SCHED_DEBUG)
+   .pinned_on_cpu  = -1,
+#endif
.mm = NULL,
.active_mm  = _mm,
.restart_block  = {
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 25afa2bb1a2c..e1bf3c698a32 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -76,11 +76,6 @@ static DEFINE_PER_CPU(struct cpuhp_cpu_state, cpuhp_state) = 
{
.fail = CPUHP_INVALID,
 };
 
-#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_PREEMPT_RT_FULL)
-static DEFINE_PER_CPU(struct rt_rw_lock, cpuhp_pin_lock) = \
-   __RWLOCK_RT_INITIALIZER(cpuhp_pin_lock);

[PATCH RT v2 0/3] migrate disable fixes and performance

2019-10-12 Thread Scott Wood

These are the unapplied patches from v1, minus the sched deadline
patch, and with stop_one_cpu_nowait() in place of clobbering
current->state.

Scott Wood (3):
  sched: migrate_enable: Use select_fallback_rq()
  sched: Lazy migrate_disable processing
  sched: migrate_enable: Use stop_one_cpu_nowait()

 include/linux/cpu.h  |   4 -
 include/linux/sched.h|  11 +--
 include/linux/stop_machine.h |   2 +
 init/init_task.c |   4 +
 kernel/cpu.c | 103 +--
 kernel/sched/core.c  | 195 ++-
 kernel/sched/sched.h |   4 +
 kernel/stop_machine.c|   7 +-
 lib/smp_processor_id.c   |   3 +
 9 files changed, 142 insertions(+), 191 deletions(-)

-- 
1.8.3.1

[PATCH RT v2 3/3] sched: migrate_enable: Use stop_one_cpu_nowait()

2019-10-12 Thread Scott Wood

migrate_enable() can be called with current->state != TASK_RUNNING.
Avoid clobbering the existing state by using stop_one_cpu_nowait().
Since we're stopping the current cpu, we know that we won't get
past __schedule() until migration_cpu_stop() has run (at least up to
the point of migrating us to another cpu).

Signed-off-by: Scott Wood 
---
 include/linux/stop_machine.h |  2 ++
 kernel/sched/core.c  | 22 +-
 kernel/stop_machine.c|  7 +--
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 6d3635c86dbe..82fc686ddd9e 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -26,6 +26,8 @@ struct cpu_stop_work {
cpu_stop_fn_t   fn;
void*arg;
struct cpu_stop_done*done;
+   /* Did not run due to disabled stopper; for nowait debug checks */
+   booldisabled;
 };
 
 int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5cb2a519b8bf..6383ade320f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1051,6 +1051,7 @@ static struct rq *move_queued_task(struct rq *rq, struct 
rq_flags *rf,
 struct migration_arg {
struct task_struct *task;
int dest_cpu;
+   bool done;
 };
 
 /*
@@ -1086,6 +1087,11 @@ static int migration_cpu_stop(void *data)
struct task_struct *p = arg->task;
struct rq *rq = this_rq();
struct rq_flags rf;
+   int dest_cpu = arg->dest_cpu;
+
+   /* We don't look at arg after this point. */
+   smp_mb();
+   arg->done = true;
 
/*
 * The original target CPU might have gone down and we might
@@ -1108,9 +1114,9 @@ static int migration_cpu_stop(void *data)
 */
if (task_rq(p) == rq) {
if (task_on_rq_queued(p))
-   rq = __migrate_task(rq, , p, arg->dest_cpu);
+   rq = __migrate_task(rq, , p, dest_cpu);
else
-   p->wake_cpu = arg->dest_cpu;
+   p->wake_cpu = dest_cpu;
}
rq_unlock(rq, );
raw_spin_unlock(>pi_lock);
@@ -7392,6 +7398,7 @@ void migrate_enable(void)
WARN_ON(smp_processor_id() != cpu);
if (!is_cpu_allowed(p, cpu)) {
struct migration_arg arg = { p };
+   struct cpu_stop_work work;
struct rq_flags rf;
 
rq = task_rq_lock(p, );
@@ -7399,13 +7406,10 @@ void migrate_enable(void)
arg.dest_cpu = select_fallback_rq(cpu, p);
task_rq_unlock(rq, p, );
 
-   preempt_lazy_enable();
-   preempt_enable();
-
-   sleeping_lock_inc();
-   stop_one_cpu(task_cpu(p), migration_cpu_stop, );
-   sleeping_lock_dec();
-   return;
+   stop_one_cpu_nowait(task_cpu(p), migration_cpu_stop,
+   , );
+   __schedule(true);
+   WARN_ON_ONCE(!arg.done && !work.disabled);
}
 
 out:
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 2b5a6754646f..fa53a472dd44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -85,8 +85,11 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct 
cpu_stop_work *work)
enabled = stopper->enabled;
if (enabled)
__cpu_stop_queue_work(stopper, work, );
-   else if (work->done)
-   cpu_stop_signal_done(work->done);
+   else {
+   work->disabled = true;
+   if (work->done)
+   cpu_stop_signal_done(work->done);
+   }
raw_spin_unlock_irqrestore(>lock, flags);
 
wake_up_q();
-- 
1.8.3.1

[PATCH RT v2 1/3] sched: migrate_enable: Use select_fallback_rq()

2019-10-12 Thread Scott Wood

migrate_enable() currently open-codes a variant of select_fallback_rq().
However, it does not have the "No more Mr. Nice Guy" fallback and thus
it will pass an invalid CPU to the migration thread if cpus_mask only
contains a CPU that is !active.

Signed-off-by: Scott Wood 
---
This scenario will be more likely after the next patch, since
the migrate_disable_update check goes away.  However, it could happen
anyway if cpus_mask was updated to a CPU other than the one we were
pinned to, and that CPU subsequently became inactive.
---
 kernel/sched/core.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 93b4ae1ecaff..bfd160e90797 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7405,6 +7405,7 @@ void migrate_enable(void)
if (p->migrate_disable_update) {
struct rq *rq;
struct rq_flags rf;
+   int cpu = task_cpu(p);
 
rq = task_rq_lock(p, );
update_rq_clock(rq);
@@ -7414,21 +7415,15 @@ void migrate_enable(void)
 
p->migrate_disable_update = 0;
 
-   WARN_ON(smp_processor_id() != task_cpu(p));
-   if (!cpumask_test_cpu(task_cpu(p), >cpus_mask)) {
-   const struct cpumask *cpu_valid_mask = cpu_active_mask;
-   struct migration_arg arg;
-   unsigned int dest_cpu;
-
-   if (p->flags & PF_KTHREAD) {
-   /*
-* Kernel threads are allowed on online && 
!active CPUs
-*/
-   cpu_valid_mask = cpu_online_mask;
-   }
-   dest_cpu = cpumask_any_and(cpu_valid_mask, 
>cpus_mask);
-   arg.task = p;
-   arg.dest_cpu = dest_cpu;
+   WARN_ON(smp_processor_id() != cpu);
+   if (!cpumask_test_cpu(cpu, >cpus_mask)) {
+   struct migration_arg arg = { p };
+   struct rq_flags rf;
+
+   rq = task_rq_lock(p, );
+   update_rq_clock(rq);
+   arg.dest_cpu = select_fallback_rq(cpu, p);
+   task_rq_unlock(rq, p, );
 
unpin_current_cpu();
preempt_lazy_enable();
-- 
1.8.3.1

Re: [PATCH RT 5/8] sched/deadline: Reclaim cpuset bandwidth in .migrate_task_rq()

2019-10-09 Thread Scott Wood

On Wed, 2019-10-09 at 09:27 +0200, Juri Lelli wrote:
> On 09/10/19 01:25, Scott Wood wrote:
> > On Tue, 2019-10-01 at 10:52 +0200, Juri Lelli wrote:
> > > On 30/09/19 11:24, Scott Wood wrote:
> > > > On Mon, 2019-09-30 at 09:12 +0200, Juri Lelli wrote:
> > > 
> > > [...]
> > > 
> > > > > Hummm, I was actually more worried about the fact that we call
> > > > > free_old_
> > > > > cpuset_bw_dl() only if p->state != TASK_WAKING.
> > > > 
> > > > Oh, right. :-P  Not sure what I had in mind there; we want to call
> > > > it
> > > > regardless.
> > > > 
> > > > I assume we need rq->lock in free_old_cpuset_bw_dl()?  So something
> > > > like
> > > 
> > > I think we can do with rcu_read_lock_sched() (see
> > > dl_task_can_attach()).
> > 
> > RCU will keep dl_bw from being freed under us (we're implicitly in an
> > RCU
> > sched read section due to atomic context).  It won't stop rq->rd from
> > changing, but that could have happened before we took rq->lock.  If the
> > cpu
> > the task was running on was removed from the cpuset, and that raced with
> > the
> > task being moved to a different cpuset, couldn't we end up erroneously
> > subtracting from the cpu's new root domain (or failing to subtract at
> > all if
> > the old cpu's new cpuset happens to be the task's new cpuset)?  I don't
> > see
> > anything that forces tasks off of the cpu when a cpu is removed from a
> > cpuset (though maybe I'm not looking in the right place), so the race
> > window
> > could be quite large.  In any case, that's an existing problem that's
> > not
> > going to get solved in this patchset.
> 
> OK. So, mainline has got cpuset_read_lock() which should be enough to
> guard against changes to rd(s).
> 
> I agree that rq->lock is needed here.

My point was that rq->lock isn't actually helping, because rq->rd could have
changed before rq->lock is acquired (and it's still the old rd that needs
the bandwidth subtraction).  cpuset_mutex/cpuset_rwsem helps somewhat,
though there's still a problem due to the subtraction not happening until
the task is woken up (by which time cpuset_mutex may have been released and
further reconfiguration could have happened).  This would be an issue even
without lazy migrate, since in that case ->set_cpus_allowed() can get
deferred, but this patch makes the window much bigger.

The right solution is probably to explicitly track the rd for which a task
has a pending bandwidth subtraction (if any), and to continue doing it from
set_cpus_allowed() if the task is not migrate-disabled.  In the meantime, I
think we should drop this patch from the patchset -- without it, lazy
migrate disable doesn't really make the race situation any worse than it
already was.

BTW, what happens to the bw addition in dl_task_can_attach() if a subsequent
can_attach fails and the whole operation is cancelled?

-Scott

Re: [PATCH v7 00/12] implement KASLR for powerpc/fsl_booke/32

2019-10-09 Thread Scott Wood

On Wed, 2019-10-09 at 16:41 +0800, Jason Yan wrote:
> Hi Scott,
> 
> On 2019/10/9 15:13, Scott Wood wrote:
> > On Wed, 2019-10-09 at 14:10 +0800, Jason Yan wrote:
> > > Hi Scott,
> > > 
> > > Would you please take sometime to test this?
> > > 
> > > Thank you so much.
> > > 
> > > On 2019/9/24 13:52, Jason Yan wrote:
> > > > Hi Scott,
> > > > 
> > > > Can you test v7 to see if it works to load a kernel at a non-zero
> > > > address?
> > > > 
> > > > Thanks,
> > 
> > Sorry for the delay.  Here's the output:
> > 
> 
> Thanks for the test.
> 
> > ## Booting kernel from Legacy Image at 1000 ...
> > Image Name:   Linux-5.4.0-rc2-00050-g8ac2cf5b4
> > Image Type:   PowerPC Linux Kernel Image (gzip compressed)
> > Data Size:7521134 Bytes = 7.2 MiB
> > Load Address: 0400
> > Entry Point:  0400
> > Verifying Checksum ... OK
> > ## Flattened Device Tree blob at 1fc0
> > Booting using the fdt blob at 0x1fc0
> > Uncompressing Kernel Image ... OK
> > Loading Device Tree to 07fe, end 07fff65c ... OK
> > KASLR: No safe seed for randomizing the kernel base.
> > OF: reserved mem: initialized node qman-fqd, compatible id fsl,qman-fqd
> > OF: reserved mem: initialized node qman-pfdr, compatible id fsl,qman-pfdr
> > OF: reserved mem: initialized node bman-fbpr, compatible id fsl,bman-fbpr
> > Memory CAM mapping: 64/64/64 Mb, residual: 12032Mb
> 
> When boot from 0400, the max CAM value is 64M. And
> you have a board with 12G memory, CONFIG_LOWMEM_CAM_NUM=3 means only
> 192M memory is mapped and when kernel is randomized at the middle of 
> this 192M memory, we will not have enough continuous memory for node map.
> 
> Can you set CONFIG_LOWMEM_CAM_NUM=8 and see if it works?

OK, that worked.

-Scott

Re: [PATCH v7 00/12] implement KASLR for powerpc/fsl_booke/32

2019-10-09 Thread Scott Wood

On Wed, 2019-10-09 at 14:10 +0800, Jason Yan wrote:
> Hi Scott,
> 
> Would you please take sometime to test this?
> 
> Thank you so much.
> 
> On 2019/9/24 13:52, Jason Yan wrote:
> > Hi Scott,
> > 
> > Can you test v7 to see if it works to load a kernel at a non-zero address?
> > 
> > Thanks,

Sorry for the delay.  Here's the output:

## Booting kernel from Legacy Image at 1000 ...
   Image Name:   Linux-5.4.0-rc2-00050-g8ac2cf5b4
   Image Type:   PowerPC Linux Kernel Image (gzip compressed)
   Data Size:7521134 Bytes = 7.2 MiB
   Load Address: 0400
   Entry Point:  0400
   Verifying Checksum ... OK
## Flattened Device Tree blob at 1fc0
   Booting using the fdt blob at 0x1fc0
   Uncompressing Kernel Image ... OK
   Loading Device Tree to 07fe, end 07fff65c ... OK
KASLR: No safe seed for randomizing the kernel base.
OF: reserved mem: initialized node qman-fqd, compatible id fsl,qman-fqd
OF: reserved mem: initialized node qman-pfdr, compatible id fsl,qman-pfdr
OF: reserved mem: initialized node bman-fbpr, compatible id fsl,bman-fbpr
Memory CAM mapping: 64/64/64 Mb, residual: 12032Mb
Linux version 5.4.0-rc2-00050-g8ac2cf5b4e4a-dirty (scott@snotra) (gcc version 8.
1.0 (GCC)) #26 SMP Wed Oct 9 01:50:40 CDT 2019
Using CoreNet Generic machine description
printk: bootconsole [udbg0] enabled
CPU maps initialized for 1 thread per core
-
phys_mem_size = 0x2fc00
dcache_bsize  = 0x40
icache_bsize  = 0x40
cpu_features  = 0x03b4
  possible= 0x010103bc
  always  = 0x0020
cpu_user_features = 0x8c008000 0x0800
mmu_features  = 0x000a0010
physical_start= 0xc7c4000
-
CoreNet Generic board
mpc85xx_qe_init: Could not find Quicc Engine node
barrier-nospec: using isync; sync as speculation barrier
Zone ranges:
  Normal   [mem 0x0400-0x0fff]
  HighMem  [mem 0x1000-0x0002]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0400-0x0002]
Initmem setup node 0 [mem 0x0400-0x0002]
Kernel panic - not syncing: Failed to allocate 125173760 bytes for node 0 memory
 map
CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-rc2-00050-g8ac2cf5b4e4a-dirty #26
Call Trace:
[c989fe10] [c924bfb0] dump_stack+0x84/0xb4 (unreliable)
[c989fe30] [c880badc] panic+0x140/0x334
[c989fe90] [c89a1144] alloc_node_mem_map.constprop.117+0xa0/0x11c
[c989feb0] [c95481c4] free_area_init_node+0x314/0x5b8
[c989ff30] [c9548b34] free_area_init_nodes+0x57c/0x5c0
[c989ff80] [c952cbb4] setup_arch+0x250/0x270
[c989ffa0] [c95278e0] start_kernel+0x74/0x4e8
[c989fff0] [c87c4478] set_ivor+0x150/0x18c
Kernel Offset: 0x87c4000 from 0xc000
Rebooting in 180 seconds..

-Scott

Re: [PATCH RT 5/8] sched/deadline: Reclaim cpuset bandwidth in .migrate_task_rq()

2019-10-09 Thread Scott Wood

On Tue, 2019-10-01 at 10:52 +0200, Juri Lelli wrote:
> On 30/09/19 11:24, Scott Wood wrote:
> > On Mon, 2019-09-30 at 09:12 +0200, Juri Lelli wrote:
> 
> [...]
> 
> > > Hummm, I was actually more worried about the fact that we call
> > > free_old_
> > > cpuset_bw_dl() only if p->state != TASK_WAKING.
> > 
> > Oh, right. :-P  Not sure what I had in mind there; we want to call it
> > regardless.
> > 
> > I assume we need rq->lock in free_old_cpuset_bw_dl()?  So something like
> 
> I think we can do with rcu_read_lock_sched() (see dl_task_can_attach()).

RCU will keep dl_bw from being freed under us (we're implicitly in an RCU
sched read section due to atomic context).  It won't stop rq->rd from
changing, but that could have happened before we took rq->lock.  If the cpu
the task was running on was removed from the cpuset, and that raced with the
task being moved to a different cpuset, couldn't we end up erroneously
subtracting from the cpu's new root domain (or failing to subtract at all if
the old cpu's new cpuset happens to be the task's new cpuset)?  I don't see
anything that forces tasks off of the cpu when a cpu is removed from a
cpuset (though maybe I'm not looking in the right place), so the race window
could be quite large.  In any case, that's an existing problem that's not
going to get solved in this patchset.

-Scott

[PATCH] tick-sched: Update nohz load even if tick already stopped

2019-10-02 Thread Scott Wood

The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz.  This can be particularly noticeable if nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.  We've had reports of a loadavg near 150 on a mostly idle
system.

Calling calc_load_nohz_start() regardless of whether the tick is already
stopped addresses the issue when going idle.  Tracking load changes when
not going idle (e.g. multiple SCHED_FIFO tasks coming and going) is not
addressed by this patch.

Signed-off-by: Scott Wood 
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 955851748dc3..f177d8168400 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -763,6 +763,9 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int 
cpu)
ts->do_timer_last = 0;
}
 
+   /* Even if the tick was already stopped, load may have changed */
+   calc_load_nohz_start();
+
/* Skip reprogram of event if its not changed */
if (ts->tick_stopped && (expires == ts->next_tick)) {
/* Sanity check: make sure clockevent is actually programmed */
@@ -783,7 +786,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int 
cpu)
 * the scheduler tick in nohz_restart_sched_tick.
 */
if (!ts->tick_stopped) {
-   calc_load_nohz_start();
quiet_vmstat();
 
ts->last_tick = hrtimer_get_expires(>sched_timer);
-- 
1.8.3.1

Re: [PATCH RT 5/8] sched/deadline: Reclaim cpuset bandwidth in .migrate_task_rq()

2019-09-30 Thread Scott Wood

On Mon, 2019-09-30 at 09:12 +0200, Juri Lelli wrote:
> On 27/09/19 11:40, Scott Wood wrote:
> > On Fri, 2019-09-27 at 10:11 +0200, Juri Lelli wrote:
> > > Hi Scott,
> > > 
> > > On 27/07/19 00:56, Scott Wood wrote:
> > > > With the changes to migrate disabling, ->set_cpus_allowed() no
> > > > longer
> > > > gets deferred until migrate_enable().  To avoid releasing the
> > > > bandwidth
> > > > while the task may still be executing on the old CPU, move the
> > > > subtraction
> > > > to ->migrate_task_rq().
> > > > 
> > > > Signed-off-by: Scott Wood 
> > > > ---
> > > >  kernel/sched/deadline.c | 67 +++---
> > > > 
> > > > ---
> > > >  1 file changed, 31 insertions(+), 36 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > > index c18be51f7608..2f18d0cf1b56 100644
> > > > --- a/kernel/sched/deadline.c
> > > > +++ b/kernel/sched/deadline.c
> > > > @@ -1606,14 +1606,42 @@ static void yield_task_dl(struct rq *rq)
> > > > return cpu;
> > > >  }
> > > >  
> > > > +static void free_old_cpuset_bw_dl(struct rq *rq, struct task_struct
> > > > *p)
> > > > +{
> > > > +   struct root_domain *src_rd = rq->rd;
> > > > +
> > > > +   /*
> > > > +* Migrating a SCHED_DEADLINE task between exclusive
> > > > +* cpusets (different root_domains) entails a bandwidth
> > > > +* update. We already made space for us in the destination
> > > > +* domain (see cpuset_can_attach()).
> > > > +*/
> > > > +   if (!cpumask_intersects(src_rd->span, p->cpus_ptr)) {
> > > > +   struct dl_bw *src_dl_b;
> > > > +
> > > > +   src_dl_b = dl_bw_of(cpu_of(rq));
> > > > +   /*
> > > > +* We now free resources of the root_domain we are
> > > > migrating
> > > > +* off. In the worst case, sched_setattr() may
> > > > temporary
> > > > fail
> > > > +* until we complete the update.
> > > > +*/
> > > > +   raw_spin_lock(_dl_b->lock);
> > > > +   __dl_sub(src_dl_b, p->dl.dl_bw,
> > > > dl_bw_cpus(task_cpu(p)));
> > > > +   raw_spin_unlock(_dl_b->lock);
> > > > +   }
> > > > +}
> > > > +
> > > >  static void migrate_task_rq_dl(struct task_struct *p, int new_cpu
> > > > __maybe_unused)
> > > >  {
> > > > struct rq *rq;
> > > >  
> > > > -   if (p->state != TASK_WAKING)
> > > > +   rq = task_rq(p);
> > > > +
> > > > +   if (p->state != TASK_WAKING) {
> > > > +   free_old_cpuset_bw_dl(rq, p);
> > > 
> > > What happens if a DEADLINE task is moved between cpusets while it was
> > > sleeping? Don't we miss removing from the old cpuset if the task gets
> > > migrated on wakeup?
> > 
> > In that case set_task_cpu() is called by ttwu after setting state to
> > TASK_WAKING.
> 
> Right.
> 
> > I guess it could be annoying if the task doesn't wake up for a
> > long time and therefore doesn't release the bandwidth until then.
> 
> Hummm, I was actually more worried about the fact that we call free_old_
> cpuset_bw_dl() only if p->state != TASK_WAKING.

Oh, right. :-P  Not sure what I had in mind there; we want to call it
regardless.

I assume we need rq->lock in free_old_cpuset_bw_dl()?  So something like
this:

if (p->state == TASK_WAITING)
raw_spin_lock(>lock);
free_old_cpuset_bw_dl(rq, p);
if (p->state != TASK_WAITING)
return;

if (p->dl.dl_non_contending) {


BTW, is the full cpumask_intersects() necessary or would it suffice to see
that the new cpu is not in the old span?

-Scott

Re: [PATCH RT 4/8] sched: migrate disable: Protect cpus_ptr with lock

2019-09-27 Thread Scott Wood

On Fri, 2019-09-27 at 14:19 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-26 11:52:42 [-0500], Scott Wood wrote:
> > Looks good, thanks!
> 
> Thanks, just released.
> Moving forward. It would be nice to have some DL-dev feedback on DL
> patch. For the remaining once, could please throw Steven's
> stress-test-hostplug-cpu-script? If that one does not complain I don't
> see a reason why not apply the patches (since they improve performance
> and do not break anything while doing so).

I'd been using a quick-and-dirty script that does something similar, and ran
it along with rcutorture, a kernel build, and something that randomly
changes affinities -- though with these loads I have to ignore occasional
RCU forward progress complaints that Paul said were expected with this
version of the RCU code, XFS warnings that happen even on unmodified non-RT, 
and sometimes a transitory netdev timeout that is not that surprising given
the heavy load and affinity stress (I get it without these patches as well,
though it takes longer).

I just ran Steven's script (both alone and with the other stuff above) and
saw no difference.

-Scott

Re: [PATCH RT 5/8] sched/deadline: Reclaim cpuset bandwidth in .migrate_task_rq()

2019-09-27 Thread Scott Wood

On Fri, 2019-09-27 at 10:11 +0200, Juri Lelli wrote:
> Hi Scott,
> 
> On 27/07/19 00:56, Scott Wood wrote:
> > With the changes to migrate disabling, ->set_cpus_allowed() no longer
> > gets deferred until migrate_enable().  To avoid releasing the bandwidth
> > while the task may still be executing on the old CPU, move the
> > subtraction
> > to ->migrate_task_rq().
> > 
> > Signed-off-by: Scott Wood 
> > ---
> >  kernel/sched/deadline.c | 67 +++---
> > ---
> >  1 file changed, 31 insertions(+), 36 deletions(-)
> > 
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index c18be51f7608..2f18d0cf1b56 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -1606,14 +1606,42 @@ static void yield_task_dl(struct rq *rq)
> > return cpu;
> >  }
> >  
> > +static void free_old_cpuset_bw_dl(struct rq *rq, struct task_struct *p)
> > +{
> > +   struct root_domain *src_rd = rq->rd;
> > +
> > +   /*
> > +* Migrating a SCHED_DEADLINE task between exclusive
> > +* cpusets (different root_domains) entails a bandwidth
> > +* update. We already made space for us in the destination
> > +* domain (see cpuset_can_attach()).
> > +*/
> > +   if (!cpumask_intersects(src_rd->span, p->cpus_ptr)) {
> > +   struct dl_bw *src_dl_b;
> > +
> > +   src_dl_b = dl_bw_of(cpu_of(rq));
> > +   /*
> > +* We now free resources of the root_domain we are migrating
> > +* off. In the worst case, sched_setattr() may temporary
> > fail
> > +* until we complete the update.
> > +*/
> > +   raw_spin_lock(_dl_b->lock);
> > +   __dl_sub(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
> > +   raw_spin_unlock(_dl_b->lock);
> > +   }
> > +}
> > +
> >  static void migrate_task_rq_dl(struct task_struct *p, int new_cpu
> > __maybe_unused)
> >  {
> > struct rq *rq;
> >  
> > -   if (p->state != TASK_WAKING)
> > +   rq = task_rq(p);
> > +
> > +   if (p->state != TASK_WAKING) {
> > +   free_old_cpuset_bw_dl(rq, p);
> 
> What happens if a DEADLINE task is moved between cpusets while it was
> sleeping? Don't we miss removing from the old cpuset if the task gets
> migrated on wakeup?

In that case set_task_cpu() is called by ttwu after setting state to
TASK_WAKING.  I guess it could be annoying if the task doesn't wake up for a
long time and therefore doesn't release the bandwidth until then.

-Scott

Re: [PATCH RT 4/8] sched: migrate disable: Protect cpus_ptr with lock

2019-09-26 Thread Scott Wood

On Thu, 2019-09-26 at 18:39 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-07-27 00:56:34 [-0500], Scott Wood wrote:
> > Various places assume that cpus_ptr is protected by rq/pi locks,
> > so don't change it before grabbing those locks.
> > 
> > Signed-off-by: Scott Wood 
> 
> I applied now everything until here and you can take a look at
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-5.2.y-rt-RC
> 
> If there are no objections then I would make this a -rt9 and then go
> further.
> 
> Sebastian

Looks good, thanks!

-Scott

Re: [PATCH RT 7/8] sched: migrate_enable: Use select_fallback_rq()

2019-09-24 Thread Scott Wood

On Tue, 2019-09-17 at 18:00 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-07-27 00:56:37 [-0500], Scott Wood wrote:
> > migrate_enable() currently open-codes a variant of select_fallback_rq().
> > However, it does not have the "No more Mr. Nice Guy" fallback and thus
> > it will pass an invalid CPU to the migration thread if cpus_mask only
> > contains a CPU that is !active.
> > 
> > Signed-off-by: Scott Wood 
> > ---
> > This scenario will be more likely after the next patch, since
> > the migrate_disable_update check goes away.  However, it could happen
> > anyway if cpus_mask was updated to a CPU other than the one we were
> > pinned to, and that CPU subsequently became inactive.
> 
> I'm unclear about the problem / side effect this has (before and after
> the change). It is possible (before and after that change) that a CPU is
> selected which is invalid / goes offline after the "preempt_enable()"
> statement and before stop_one_cpu() does its job, correct?

By "pass an invalid CPU" I don't mean offline; I mean >= nr_cpu_ids which is
what cpumask_any_and() returns if the sets don't intersect (a CPU going
offline is merely a way to end up in that situation).  At one point I
observed that causing a crash.  I guess is_cpu_allowed() returned true by
chance based on the contents of data beyond the end of the cpumask?  That
doesn't seem likely based on what comes after p->cpus_mask (at that point
migrate_disable should be zero), but maybe I had something else at that
point in the struct while developing.  In any case, not something to be
relied on. :-)

Going offline after selection shouldn't be a problem.  migration_cpu_stop()
won't do anything if is_cpu_allowed() returns false, and we'll get migrated
off the CPU by migrate_tasks().  Even if we get migrated after reading
task_cpu(p) but before queuing the stop machine work, it'll either get
flushed when the stopper thread parks, or rejected due to stopper->enabled
being false.

-Scott

Re: [PATCH RT v3 3/5] sched: migrate_dis/enable: Use rt_invol_sleep

2019-09-24 Thread Scott Wood

On Tue, 2019-09-24 at 18:05 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-24 10:47:36 [-0500], Scott Wood wrote:
> > When the stop machine finishes it will do a wake_up_process() via
> > complete().  Since this does not pass WF_LOCK_SLEEPER, saved_state will
> > be
> > cleared, and you'll have TASK_RUNNING when you get to other_func() and
> > schedule(), regardless of whether CPU1 sends wake_up() -- so this change
> > doesn't actually accomplish anything.
> 
> True, I completely missed that part.
> 
> > While as noted in the other thread I don't think these spurious wakeups
> > are
> > a huge problem, we could avoid them by doing stop_one_cpu_nowait() and
> > then
> > schedule() without messing with task state.  Since we're stopping our
> > own
> > cpu, it should be guaranteed that the stopper has finished by the time
> > we
> > exit schedule().
> 
> I remember loosing a state can be a problem. Lets say it is not "just"
> TASK_UNINTERRUPTIBLE -> TASK_RUNNING which sounds harmless but it is
> __TASK_TRACED and you lose it as part of unlocking siglock.

OK, sounds like stop_one_cpu_nowait() is the way to go then.

-Scott

Re: [PATCH RT v3 3/5] sched: migrate_dis/enable: Use rt_invol_sleep

2019-09-24 Thread Scott Wood

On Tue, 2019-09-24 at 17:25 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-24 08:53:43 [-0500], Scott Wood wrote:
> > As I pointed out in the "[PATCH RT 6/8] sched: migrate_enable: Set state
> > to
> > TASK_RUNNING" discussion, we can get here inside the rtmutex code (e.g.
> > from
> > debug_rt_mutex_print_deadlock) where saved_state is already holding
> > something -- plus, the waker won't have WF_LOCK_SLEEPER and therefore
> > saved_state will get cleared anyway.
> 
> So let me drop the saved_state pieces and get back to it once I get to
> the other thread (which you replied and I didn't realised until now).
> 
> Regarding the WF_LOCK_SLEEPER part. I think this works as expected.
> Imagine:
> 
> CPU0  CPU1
> spin_lock();
> set_current_state(TASK_UNINTERRUPTIBLE);
> …
> spin_unlock()
>  -> migrate_enable();
>-> stop_one_cpu(); <-- A)
> other_func(); <-- B)
> schedule();
> 
> So. With only CPU0 we enter schedule() with TASK_UNINTERRUPTIBLE because
> the state gets preserved with the change I added (which is expected).
> If CPU1 sends a wake_up() at A) then the saved_state gets overwritten
> and we enter schedule() with TASK_RUNNING. Same happens if it is sent at
> B) point which is outside of any migrate/spin lock related code. 
> 
> Was this clear or did I miss the point?

When the stop machine finishes it will do a wake_up_process() via
complete().  Since this does not pass WF_LOCK_SLEEPER, saved_state will be
cleared, and you'll have TASK_RUNNING when you get to other_func() and
schedule(), regardless of whether CPU1 sends wake_up() -- so this change
doesn't actually accomplish anything.

While as noted in the other thread I don't think these spurious wakeups are
a huge problem, we could avoid them by doing stop_one_cpu_nowait() and then
schedule() without messing with task state.  Since we're stopping our own
cpu, it should be guaranteed that the stopper has finished by the time we
exit schedule().

-Scott

Re: [PATCH RT v3 3/5] sched: migrate_dis/enable: Use rt_invol_sleep

2019-09-24 Thread Scott Wood

On Tue, 2019-09-24 at 13:21 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-23 19:52:33 [+0200], To Scott Wood wrote:
> 
> I made dis:
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 885a195dfbe02..25afa2bb1a2cf 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -308,7 +308,9 @@ void pin_current_cpu(void)
>   preempt_lazy_enable();
>   preempt_enable();
>  
> + sleeping_lock_inc();
>   __read_rt_lock(cpuhp_pin);
> + sleeping_lock_dec();
>  
>   preempt_disable();
>   preempt_lazy_disable();
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e1bdd7f9be054..63a6420d01053 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7388,6 +7388,7 @@ void migrate_enable(void)
>  
>   WARN_ON(smp_processor_id() != task_cpu(p));
>   if (!cpumask_test_cpu(task_cpu(p), >cpus_mask)) {
> + struct task_struct *self = current;
>   const struct cpumask *cpu_valid_mask =
> cpu_active_mask;
>   struct migration_arg arg;
>   unsigned int dest_cpu;
> @@ -7405,7 +7406,21 @@ void migrate_enable(void)
>   unpin_current_cpu();
>   preempt_lazy_enable();
>   preempt_enable();
> + rt_invol_sleep_inc();
> +
> + raw_spin_lock_irq(>pi_lock);
> + self->saved_state = self->state;
> + __set_current_state_no_track(TASK_RUNNING);
> + raw_spin_unlock_irq(>pi_lock);
> +
>   stop_one_cpu(task_cpu(p), migration_cpu_stop, );
> +
> + raw_spin_lock_irq(>pi_lock);
> + __set_current_state_no_track(self->saved_state);
> + self->saved_state = TASK_RUNNING;
> + raw_spin_unlock_irq(>pi_lock);
> +
> + rt_invol_sleep_dec();
>   return;
>   }
>   }
> 
> I think we need to preserve the current state, otherwise we will lose
> anything != TASK_RUNNING here. So the spin_lock() would preserve it
> while waiting but the migrate_enable() will lose it if it needs to
> change the CPU at the end.
> I will try to prepare all commits for the next release before I release
> so you can have a look first and yell if needed.

As I pointed out in the "[PATCH RT 6/8] sched: migrate_enable: Set state to
TASK_RUNNING" discussion, we can get here inside the rtmutex code (e.g. from
debug_rt_mutex_print_deadlock) where saved_state is already holding
something -- plus, the waker won't have WF_LOCK_SLEEPER and therefore
saved_state will get cleared anyway.

-Scott

Re: [PATCH RT v3 3/5] sched: migrate_dis/enable: Use rt_invol_sleep

2019-09-23 Thread Scott Wood

On Tue, 2019-09-17 at 09:06 -0500, Scott Wood wrote:
> On Tue, 2019-09-17 at 09:59 +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-09-11 17:57:27 [+0100], Scott Wood wrote:
> > > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > > index 885a195dfbe0..32c6175b63b6 100644
> > > --- a/kernel/cpu.c
> > > +++ b/kernel/cpu.c
> > > @@ -308,7 +308,9 @@ void pin_current_cpu(void)
> > >   preempt_lazy_enable();
> > >   preempt_enable();
> > >  
> > > + rt_invol_sleep_inc();
> > >   __read_rt_lock(cpuhp_pin);
> > > + rt_invol_sleep_dec();
> > >  
> > >   preempt_disable();
> > >   preempt_lazy_disable();
> > 
> > I understand the other one. But now looking at it, both end up in
> > rt_spin_lock_slowlock_locked() which would be the proper place to do
> > that annotation. Okay.
> 
> FWIW, if my lazy migrate patchset is accepted, then there will be no users
> of __read_rt_lock() outside rwlock-rt.c and it'll be moot.

I missed the "both" -- which is the "other one" that ends up in
rt_spin_lock_slowlock_locked()?  stop_one_cpu() doesn't...

-Scott

Re: [PATCH RT 6/8] sched: migrate_enable: Set state to TASK_RUNNING

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 17:31 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-07-27 00:56:36 [-0500], Scott Wood wrote:
> > If migrate_enable() is called while a task is preparing to sleep
> > (state != TASK_RUNNING), that triggers a debug check in stop_one_cpu().
> > Explicitly reset state to acknowledge that we're accepting the spurious
> > wakeup.
> > 
> > Signed-off-by: Scott Wood 
> > ---
> >  kernel/sched/core.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 38a9a9df5638..eb27a9bf70d7 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7396,6 +7396,14 @@ void migrate_enable(void)
> > unpin_current_cpu();
> > preempt_lazy_enable();
> > preempt_enable();
> > +
> > +   /*
> > +* Avoid sleeping with an existing non-running
> > +* state.  This will result in a spurious wakeup
> > +* for the calling context.
> > +*/
> > +   __set_current_state(TASK_RUNNING);
> 
> Do you have an example for this?

Unfortunately I didn't save the traceback of where I originally saw it, but
I ran it again and hit the warning in prepare_to_wait_event().

>  I'm not too sure if we are not using a
> state by doing this. Actually we were losing it and get yelled at.  We do:
> > rt_spin_unlock()
> > {
> > rt_spin_lock_fastunlock();
> > migrate_enable();
> > }
> 
> So save the state as part of the locking process and lose it as part of
> migrate_enable() if the CPU mask was changed. I *think* we need to
> preserve that state until after the migration to the other CPU.

Preserving the state would be ideal, though in most cases occasionally
losing the state should be tolerable as long as it doesn't happen
persistently.  TASK_DEAD is an exception but that doesn't do anything before
schedule() that could end up here.  I suppose there could be some places
that use TASK_UNINTERRUPTIBLE without checking the actual condition after
schedule(), and which do something that can end up here before schedule()...

We can't use saved_state here though, because we can get here inside the
rtmutex code (e.g. from debug_rt_mutex_print_deadlock)... and besides, the
waker won't have WF_LOCK_SLEEPER and so saved_state will get cleared.

-Scott

Re: [PATCH RT 8/8] sched: Lazy migrate_disable processing

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 18:50 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-07-27 00:56:38 [-0500], Scott Wood wrote:
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 885a195dfbe0..0096acf1a692 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -939,17 +893,34 @@ static int takedown_cpu(unsigned int cpu)
> >  */
> > irq_lock_sparse();
> >  
> > -#ifdef CONFIG_PREEMPT_RT_FULL
> > -   __write_rt_lock(cpuhp_pin);
> > +#ifdef CONFIG_PREEMPT_RT_BASE
> > +   WARN_ON_ONCE(takedown_cpu_task);
> > +   takedown_cpu_task = current;
> > +
> > +again:
> > +   for (;;) {
> > +   int nr_pinned;
> > +
> > +   set_current_state(TASK_UNINTERRUPTIBLE);
> > +   nr_pinned = cpu_nr_pinned(cpu);
> > +   if (nr_pinned == 0)
> > +   break;
> > +   schedule();
> > +   }
> 
> we used to have cpuhp_pin which ensured that once we own the write lock
> there will be no more tasks that can enter a migrate_disable() section
> on this CPU. It has been placed fairly late to ensure that nothing new
> comes in as part of the shutdown process and that it flushes everything
> out that is still in a migrate_disable() section.
> Now you claim that once the counter reached zero it never increments
> again. I would be happier if there was an explicit check for that :)

I don't claim that.  A check is added in take_cpu_down() to see whether it
went back up, and if so, exit with EAGAIN.  If *that* check succeeds, it
can't go back up because it's in stop machine, and any tasks will get
migrated to another CPU before they can run again.  There's also a WARN in
migrate_tasks() if somehow a migrate-disabled task does get encountered.

> There is no back off and flush mechanism which means on a busy CPU (as
> in heavily lock contended by multiple tasks) this will wait until the
> CPU gets idle again.

Not really any different from the reader-biased rwlock that this replaces...

-Scott

Re: [PATCH RT v3 5/5] rcutorture: Avoid problematic critical section nesting on RT

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 16:50 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-17 09:36:22 [-0500], Scott Wood wrote:
> > > On non-RT you can (but should not) use the counter part of the
> > > function
> > > in random order like:
> > >   local_bh_disable();
> > >   local_irq_disable();
> > >   local_bh_enable();
> > >   local_irq_enable();
> > 
> > Actually even non-RT will assert if you do local_bh_enable() with IRQs
> > disabled -- but the other combinations do work, and are used some places
> > via
> > spinlocks.  If they are used via direct calls to preempt_disable() or
> > local_irq_disable() (or via raw spinlocks), then that will not go away
> > on RT
> > and we'll have a problem.
> 
> lockdep_assert_irqs_enabled() is a nop with CONFIG_PROVE_LOCKING=N and
> RT breaks either way. 

Right, I meant a non-RT kernel with debug checks enabled.

> > > Since you _can_ use it in random order Paul wants to test that the
> > > random use of those function does not break RCU in any way. Since they
> > > can not be used on RT in random order it has been agreed that we keep
> > > the test for !RT but disable it on RT.
> > 
> > For now, yes.  Long term it would be good to keep track of when
> > preemption/irqs would be disabled on RT, even when running a non-RT
> > debug
> > kernel, and assert when bad things are done with it (assuming an RT-
> > capable
> > arch).  Besides detecting these fairly unusual patterns, it could also
> > detect earlier the much more common problem of nesting a non-raw
> > spinlock
> > inside a raw spinlock or other RT-atomic context.
> 
> you will be surprised but we have patches for that. We need first get
> rid of other "false positives" before plugging this in.

Nice!  Are the "false positives" real issues from components that are
currently blacklisted on RT, or something different?

-Scott

Re: [PATCH RT v3 1/5] rcu: Acquire RCU lock when disabling BHs

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 16:42 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-17 09:06:28 [-0500], Scott Wood wrote:
> > Sorry, I missed that you were asking about rcu_read_lock_bh() as
> > well.  I
> > did remove the change to rcu_read_lock_bh_held().
> 
> Sorry for not being clear here.
> 
> > With this patch, local_bh_disable() calls rcu_read_lock() on RT which
> > handles this debug stuff.  Doing it twice shouldn't be explicitly
> > harmful,
> > but it's redundant, and debug kernels are slow enough as is.
> 
> rcu_read_lock() does:
> > __rcu_read_lock();
> > __acquire(RCU);
> > rcu_lock_acquire(_lock_map);
> > RCU_LOCKDEP_WARN(!rcu_is_watching(),
> >  "rcu_read_lock() used illegally while idle");
> 
> __acquire() is removed removed by cpp.
> That RCU_LOCKDEP_WARN is doing the same thing as above and redundant.
> Am I right to assume that you consider
>   rcu_lock_acquire(_bh_lock_map);
> 
> redundant because the only user of that is also checking for
> rcu_lock_map?

Yes.

-Scott

Re: [PATCH RT 2/8] sched: __set_cpus_allowed_ptr: Check cpus_mask, not cpus_ptr

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 16:57 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-07-27 00:56:32 [-0500], Scott Wood wrote:
> > This function is concerned with the long-term cpu mask, not the
> > transitory mask the task might have while migrate disabled.  Before
> > this patch, if a task was migrate disabled at the time
> > __set_cpus_allowed_ptr() was called, and the new mask happened to be
> > equal to the cpu that the task was running on, then the mask update
> > would be lost.
> 
> lost as in "would not be carried out" I assume.

Right.  The old mask would be restored upon migrate_enable() even though
that's no longer the policy that was requested.

-Scott

Re: [PATCH RT v3 5/5] rcutorture: Avoid problematic critical section nesting on RT

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 12:07 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-16 11:55:57 [-0500], Scott Wood wrote:
> > On Thu, 2019-09-12 at 18:17 -0400, Joel Fernandes wrote:
> > > On Wed, Sep 11, 2019 at 05:57:29PM +0100, Scott Wood wrote:
> > > > rcutorture was generating some nesting scenarios that are not
> > > > reasonable.  Constrain the state selection to avoid them.
> > > > 
> > > > Example #1:
> > > > 
> > > > 1. preempt_disable()
> > > > 2. local_bh_disable()
> > > > 3. preempt_enable()
> > > > 4. local_bh_enable()
> > > > 
> > > > On PREEMPT_RT, BH disabling takes a local lock only when called in
> > > > non-atomic context.  Thus, atomic context must be retained until
> > > > after
> > > > BH
> > > > is re-enabled.  Likewise, if BH is initially disabled in non-atomic
> > > > context, it cannot be re-enabled in atomic context.
> > > > 
> > > > Example #2:
> > > > 
> > > > 1. rcu_read_lock()
> > > > 2. local_irq_disable()
> > > > 3. rcu_read_unlock()
> > > > 4. local_irq_enable()
> > > 
> > > If I understand correctly, these examples are not unrealistic in the
> > > real
> > > world unless RCU is used in the scheduler.
> > 
> > I hope you mean "not realistic", at least when it comes to explicit
> > preempt/irq disabling rather than spinlock variants that don't disable
> > preempt/irqs on PREEMPT_RT.
> 
> We have:
> - local_irq_disable() (+save)
> - spin_lock()
> - local_bh_disable()
> - preempt_disable()
> 
> On non-RT you can (but should not) use the counter part of the function
> in random order like:
>   local_bh_disable();
>   local_irq_disable();
>   local_bh_enable();
>   local_irq_enable();

Actually even non-RT will assert if you do local_bh_enable() with IRQs
disabled -- but the other combinations do work, and are used some places via
spinlocks.  If they are used via direct calls to preempt_disable() or
local_irq_disable() (or via raw spinlocks), then that will not go away on RT
and we'll have a problem.

> The non-RT will survive this. On RT the counterpart functions have to be
> used in reverse order:
>   local_bh_disable();
>   local_irq_disable();
>   local_irq_enable();
>   local_bh_enable();
> 
> or the kernel will fall apart.
> 
> Since you _can_ use it in random order Paul wants to test that the
> random use of those function does not break RCU in any way. Since they
> can not be used on RT in random order it has been agreed that we keep
> the test for !RT but disable it on RT.

For now, yes.  Long term it would be good to keep track of when
preemption/irqs would be disabled on RT, even when running a non-RT debug
kernel, and assert when bad things are done with it (assuming an RT-capable
arch).  Besides detecting these fairly unusual patterns, it could also
detect earlier the much more common problem of nesting a non-raw spinlock
inside a raw spinlock or other RT-atomic context.

-Scott

Re: [PATCH RT v3 4/5] rcu: Disable use_softirq on PREEMPT_RT

2019-09-17 Thread Scott Wood

On Wed, 2019-09-11 at 17:57 +0100, Scott Wood wrote:
>  kernel/rcu/tree.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index fc8b00c61b32..ee0a5ec2c30f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -98,9 +98,14 @@ struct rcu_state rcu_state = {
>  /* Dump rcu_node combining tree at boot to verify correct setup. */
>  static bool dump_tree;
>  module_param(dump_tree, bool, 0444);
> -/* By default, use RCU_SOFTIRQ instead of rcuc kthreads. */
> -static bool use_softirq = 1;
> +/*
> + * By default, use RCU_SOFTIRQ instead of rcuc kthreads.
> + * But, avoid RCU_SOFTIRQ on PREEMPT_RT due to pi/rq deadlocks.
> + */
> +static bool use_softirq = !IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
> +#ifdef CONFIG_PREEMPT_RT_FULL
>  module_param(use_softirq, bool, 0444);
> +#endif

Ugh, that should be s/ifdef/ifndef/

-Scott

Re: [PATCH RT v3 1/5] rcu: Acquire RCU lock when disabling BHs

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 09:44 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-11 17:57:25 [+0100], Scott Wood wrote:
> > 
> > @@ -615,10 +645,7 @@ static inline void rcu_read_unlock(void)
> >  static inline void rcu_read_lock_bh(void)
> >  {
> > local_bh_disable();
> > -   __acquire(RCU_BH);
> > -   rcu_lock_acquire(_bh_lock_map);
> > -   RCU_LOCKDEP_WARN(!rcu_is_watching(),
> > -"rcu_read_lock_bh() used illegally while idle");
> > +   rcu_bh_lock_acquire();
> >  }
> >  
> >  /*
> 
> I asked previously why do you need to change rcu_read_lock_bh() and you
> replied that you don't remember:
>
> https://lore.kernel.org/linux-rt-users/b948ec6cccda31925ed8dc123bd0f55423fff3d4.ca...@redhat.com/
> 
> Did this change?

Sorry, I missed that you were asking about rcu_read_lock_bh() as well.  I
did remove the change to rcu_read_lock_bh_held().

With this patch, local_bh_disable() calls rcu_read_lock() on RT which
handles this debug stuff.  Doing it twice shouldn't be explicitly harmful,
but it's redundant, and debug kernels are slow enough as is.

-Scott

Re: [PATCH RT v3 3/5] sched: migrate_dis/enable: Use rt_invol_sleep

2019-09-17 Thread Scott Wood

On Tue, 2019-09-17 at 09:59 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-09-11 17:57:27 [+0100], Scott Wood wrote:
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 885a195dfbe0..32c6175b63b6 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -308,7 +308,9 @@ void pin_current_cpu(void)
> > preempt_lazy_enable();
> > preempt_enable();
> >  
> > +   rt_invol_sleep_inc();
> > __read_rt_lock(cpuhp_pin);
> > +   rt_invol_sleep_dec();
> >  
> > preempt_disable();
> > preempt_lazy_disable();
> 
> I understand the other one. But now looking at it, both end up in
> rt_spin_lock_slowlock_locked() which would be the proper place to do
> that annotation. Okay.

FWIW, if my lazy migrate patchset is accepted, then there will be no users
of __read_rt_lock() outside rwlock-rt.c and it'll be moot.

-Scott

Re: [PATCH v3 2/2] powerpc/83xx: map IMMR with a BAT.

2019-09-16 Thread Scott Wood

On Mon, 2019-09-16 at 20:25 +, Christophe Leroy wrote:
> On mpc83xx with a QE, IMMR is 2Mbytes and aligned on 2Mbytes boundarie.
> On mpc83xx without a QE, IMMR is 1Mbyte and 1Mbyte aligned.
> 
> Each driver will map a part of it to access the registers it needs.
> Some drivers will map the same part of IMMR as other drivers.
> 
> In order to reduce TLB misses, map the full IMMR with a BAT. If it is
> 2Mbytes aligned, map 2Mbytes. If there is no QE, the upper part will
> remain unused, but it doesn't harm as it is mapped as guarded memory.
> 
> When the IMMR is not aligned on a 2Mbytes boundarie, only map 1Mbyte.
> 
> Signed-off-by: Christophe Leroy 
> 
> ---
> v2:
> - use a fixmap area instead of playing with ioremap_bot
> - always map 2M unless IMMRBAR is only 1M aligned
> 
> v3:
> - replaced __fix_to_virt() by fix_to_virt()
> ---
>  arch/powerpc/include/asm/fixmap.h  |  8 
>  arch/powerpc/platforms/83xx/misc.c | 11 +++
>  2 files changed, 19 insertions(+)

Acked-by: Scott Wood 

-Scott

Re: [PATCH RT v3 5/5] rcutorture: Avoid problematic critical section nesting on RT

2019-09-16 Thread Scott Wood

On Thu, 2019-09-12 at 18:17 -0400, Joel Fernandes wrote:
> On Wed, Sep 11, 2019 at 05:57:29PM +0100, Scott Wood wrote:
> > rcutorture was generating some nesting scenarios that are not
> > reasonable.  Constrain the state selection to avoid them.
> > 
> > Example #1:
> > 
> > 1. preempt_disable()
> > 2. local_bh_disable()
> > 3. preempt_enable()
> > 4. local_bh_enable()
> > 
> > On PREEMPT_RT, BH disabling takes a local lock only when called in
> > non-atomic context.  Thus, atomic context must be retained until after
> > BH
> > is re-enabled.  Likewise, if BH is initially disabled in non-atomic
> > context, it cannot be re-enabled in atomic context.
> > 
> > Example #2:
> > 
> > 1. rcu_read_lock()
> > 2. local_irq_disable()
> > 3. rcu_read_unlock()
> > 4. local_irq_enable()
> 
> If I understand correctly, these examples are not unrealistic in the real
> world unless RCU is used in the scheduler.

I hope you mean "not realistic", at least when it comes to explicit
preempt/irq disabling rather than spinlock variants that don't disable
preempt/irqs on PREEMPT_RT.

> > If the thread is preempted between steps 1 and 2,
> > rcu_read_unlock_special.b.blocked will be set, but it won't be
> > acted on in step 3 because IRQs are disabled.  Thus, reporting of the
> > quiescent state will be delayed beyond the local_irq_enable().
> 
> Yes, with consolidated RCU this can happen but AFAIK it has not seen to be
> a
> problem since deferred QS reporting will happen take care of it, which can
> also happen from subsequent rcu_read_unlock_special().

The defer_qs_iw_pending stuff isn't in 5.2-rt.  Still, given patch 4/5 (and
special.b.deferred_qs on mainline) this shouldn't present a deadlock concern
(letting the test run a bit now to double check) so this patch could
probably be limited to the "example #1" sequence.

> > For now, these scenarios will continue to be tested on non-PREEMPT_RT
> > kernels, until debug checks are added to ensure that they are not
> > happening elsewhere.
> 
> Are you seeing real issues that need this patch? It would be good to not
> complicate rcutorture if not needed.

rcutorture crashes on RT without this patch (in particular due to the
local_bh_disable misordering).

-Scott

Re: [PATCH v2 2/2] powerpc/83xx: map IMMR with a BAT.

2019-09-16 Thread Scott Wood

On Mon, 2019-09-16 at 06:42 +, Christophe Leroy wrote:
> @@ -145,6 +147,15 @@ void __init mpc83xx_setup_arch(void)
>   if (ppc_md.progress)
>   ppc_md.progress("mpc83xx_setup_arch()", 0);
>  
> + if (!__map_without_bats) {
> + phys_addr_t immrbase = get_immrbase();
> + int immrsize = IS_ALIGNED(immrbase, SZ_2M) ? SZ_2M : SZ_1M;
> + unsigned long va = __fix_to_virt(FIX_IMMR_BASE);

Why __fix_to_virt() and not fix_to_virt()?

-Scott

Re: [PATCH 2/2] powerpc/83xx: map IMMR with a BAT.

2019-09-15 Thread Scott Wood

On Sat, 2019-09-14 at 18:51 +0200, Christophe Leroy wrote:
> 
> Le 14/09/2019 à 16:34, Scott Wood a écrit :
> > On Fri, 2019-08-23 at 12:50 +, Christophe Leroy wrote:
> > > On mpc83xx with a QE, IMMR is 2Mbytes.
> > > On mpc83xx without a QE, IMMR is 1Mbytes.
> > > Each driver will map a part of it to access the registers it needs.
> > > Some driver will map the same part of IMMR as other drivers.
> > > 
> > > In order to reduce TLB misses, map the full IMMR with a BAT.
> > > 
> > > Signed-off-by: Christophe Leroy 
> > > ---
> > >   arch/powerpc/platforms/83xx/misc.c | 10 ++
> > >   1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/arch/powerpc/platforms/83xx/misc.c
> > > b/arch/powerpc/platforms/83xx/misc.c
> > > index f46d7bf3b140..1e395b01c535 100644
> > > --- a/arch/powerpc/platforms/83xx/misc.c
> > > +++ b/arch/powerpc/platforms/83xx/misc.c
> > > @@ -18,6 +18,8 @@
> > >   #include 
> > >   #include 
> > >   
> > > +#include 
> > > +
> > >   #include "mpc83xx.h"
> > >   
> > >   static __be32 __iomem *restart_reg_base;
> > > @@ -145,6 +147,14 @@ void __init mpc83xx_setup_arch(void)
> > >   if (ppc_md.progress)
> > >   ppc_md.progress("mpc83xx_setup_arch()", 0);
> > >   
> > > + if (!__map_without_bats) {
> > > + int immrsize = IS_ENABLED(CONFIG_QUICC_ENGINE) ? SZ_2M :
> > > SZ_1M;
> > 
> > Any reason not to unconditionally make it 2M?  After all, the kernel being
> > built with CONFIG_QUICC_ENGINE doesn't mean that the hardware you're
> > running
> > on has it...
> > 
> 
> Euh .. ok. I didn't see it that way, but you are right.
> 
> Do you think it is not a problem to map 2M even when the quicc engine is 
> not there ? Or should it check device tree instead ?

It should be OK, since it's a guarded mapping.  Unless the IMMR base is not 2M
aligned...

-Scott

Re: [PATCH 2/2] powerpc/83xx: map IMMR with a BAT.

2019-09-14 Thread Scott Wood

On Fri, 2019-08-23 at 12:50 +, Christophe Leroy wrote:
> On mpc83xx with a QE, IMMR is 2Mbytes.
> On mpc83xx without a QE, IMMR is 1Mbytes.
> Each driver will map a part of it to access the registers it needs.
> Some driver will map the same part of IMMR as other drivers.
> 
> In order to reduce TLB misses, map the full IMMR with a BAT.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/platforms/83xx/misc.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/83xx/misc.c
> b/arch/powerpc/platforms/83xx/misc.c
> index f46d7bf3b140..1e395b01c535 100644
> --- a/arch/powerpc/platforms/83xx/misc.c
> +++ b/arch/powerpc/platforms/83xx/misc.c
> @@ -18,6 +18,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #include "mpc83xx.h"
>  
>  static __be32 __iomem *restart_reg_base;
> @@ -145,6 +147,14 @@ void __init mpc83xx_setup_arch(void)
>   if (ppc_md.progress)
>   ppc_md.progress("mpc83xx_setup_arch()", 0);
>  
> + if (!__map_without_bats) {
> + int immrsize = IS_ENABLED(CONFIG_QUICC_ENGINE) ? SZ_2M :
> SZ_1M;

Any reason not to unconditionally make it 2M?  After all, the kernel being
built with CONFIG_QUICC_ENGINE doesn't mean that the hardware you're running
on has it...

-Scott

Re: [PATCH v6 00/12] implement KASLR for powerpc/fsl_booke/32

2019-09-14 Thread Scott Wood

On Tue, 2019-09-10 at 13:34 +0800, Jason Yan wrote:
> Hi Scott,
> 
> On 2019/8/28 12:05, Scott Wood wrote:
> > On Fri, 2019-08-09 at 18:07 +0800, Jason Yan wrote:
> > > This series implements KASLR for powerpc/fsl_booke/32, as a security
> > > feature that deters exploit attempts relying on knowledge of the
> > > location
> > > of kernel internals.
> > > 
> > > Since CONFIG_RELOCATABLE has already supported, what we need to do is
> > > map or copy kernel to a proper place and relocate.
> > 
> > Have you tested this with a kernel that was loaded at a non-zero
> > address?  I
> > tried loading a kernel at 0x0400 (by changing the address in the
> > uImage,
> > and setting bootm_low to 0400 in U-Boot), and it works without
> > CONFIG_RANDOMIZE and fails with.
> > 
> 
> How did you change the load address of the uImage, by changing the
> kernel config CONFIG_PHYSICAL_START or the "-a/-e" parameter of mkimage?
> I tried both, but it did not work with or without CONFIG_RANDOMIZE.

With mkimage.  Did you set bootm_low in U-Boot as described above?  Was
CONFIG_RELOCATABLE set in the non-CONFIG_RANDOMIZE kernel?

-Scott

[PATCH RT v3 5/5] rcutorture: Avoid problematic critical section nesting on RT

2019-09-11 Thread Scott Wood

rcutorture was generating some nesting scenarios that are not
reasonable.  Constrain the state selection to avoid them.

Example #1:

1. preempt_disable()
2. local_bh_disable()
3. preempt_enable()
4. local_bh_enable()

On PREEMPT_RT, BH disabling takes a local lock only when called in
non-atomic context.  Thus, atomic context must be retained until after BH
is re-enabled.  Likewise, if BH is initially disabled in non-atomic
context, it cannot be re-enabled in atomic context.

Example #2:

1. rcu_read_lock()
2. local_irq_disable()
3. rcu_read_unlock()
4. local_irq_enable()

If the thread is preempted between steps 1 and 2,
rcu_read_unlock_special.b.blocked will be set, but it won't be
acted on in step 3 because IRQs are disabled.  Thus, reporting of the
quiescent state will be delayed beyond the local_irq_enable().

For now, these scenarios will continue to be tested on non-PREEMPT_RT
kernels, until debug checks are added to ensure that they are not
happening elsewhere.

Signed-off-by: Scott Wood 
---
v3: Limit to RT kernels, and remove one constraint that, while it
is bad on both RT and non-RT (missing a schedule), does not oops or
otherwise prevent using rcutorture.  It wolud be added once debug checks
are implemented.

 kernel/rcu/rcutorture.c | 96 +
 1 file changed, 82 insertions(+), 14 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index efaa5b3f4d3f..ecb82cc432af 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -60,10 +60,13 @@
 #define RCUTORTURE_RDR_RBH  0x08   /*  ... rcu_read_lock_bh(). */
 #define RCUTORTURE_RDR_SCHED0x10   /*  ... rcu_read_lock_sched(). */
 #define RCUTORTURE_RDR_RCU  0x20   /*  ... entering another RCU reader. */
-#define RCUTORTURE_RDR_NBITS6  /* Number of bits defined above. */
+#define RCUTORTURE_RDR_ATOM_BH  0x40   /*  ... disabling bh while atomic */
+#define RCUTORTURE_RDR_ATOM_RBH 0x80   /*  ... RBH while atomic */
+#define RCUTORTURE_RDR_NBITS8  /* Number of bits defined above. */
 #define RCUTORTURE_MAX_EXTEND   \
(RCUTORTURE_RDR_BH | RCUTORTURE_RDR_IRQ | RCUTORTURE_RDR_PREEMPT | \
-RCUTORTURE_RDR_RBH | RCUTORTURE_RDR_SCHED)
+RCUTORTURE_RDR_RBH | RCUTORTURE_RDR_SCHED | \
+RCUTORTURE_RDR_ATOM_BH | RCUTORTURE_RDR_ATOM_RBH)
 #define RCUTORTURE_RDR_MAX_LOOPS 0x7   /* Maximum reader extensions. */
/* Must be power of two minus one. */
 #define RCUTORTURE_RDR_MAX_SEGS (RCUTORTURE_RDR_MAX_LOOPS + 3)
@@ -1092,31 +1095,52 @@ static void rcutorture_one_extend(int *readstate, int 
newstate,
WARN_ON_ONCE((idxold >> RCUTORTURE_RDR_SHIFT) > 1);
rtrsp->rt_readstate = newstate;
 
-   /* First, put new protection in place to avoid critical-section gap. */
+   /*
+* First, put new protection in place to avoid critical-section gap.
+* Disable preemption around the ATOM disables to ensure that
+* in_atomic() is true.
+*/
if (statesnew & RCUTORTURE_RDR_BH)
local_bh_disable();
+   if (statesnew & RCUTORTURE_RDR_RBH)
+   rcu_read_lock_bh();
if (statesnew & RCUTORTURE_RDR_IRQ)
local_irq_disable();
if (statesnew & RCUTORTURE_RDR_PREEMPT)
preempt_disable();
-   if (statesnew & RCUTORTURE_RDR_RBH)
-   rcu_read_lock_bh();
if (statesnew & RCUTORTURE_RDR_SCHED)
rcu_read_lock_sched();
+   preempt_disable();
+   if (statesnew & RCUTORTURE_RDR_ATOM_BH)
+   local_bh_disable();
+   if (statesnew & RCUTORTURE_RDR_ATOM_RBH)
+   rcu_read_lock_bh();
+   preempt_enable();
if (statesnew & RCUTORTURE_RDR_RCU)
idxnew = cur_ops->readlock() << RCUTORTURE_RDR_SHIFT;
 
-   /* Next, remove old protection, irq first due to bh conflict. */
+   /*
+* Next, remove old protection, in decreasing order of strength
+* to avoid unlock paths that aren't safe in the stronger
+* context.  Disable preemption around the ATOM enables in
+* case the context was only atomic due to IRQ disabling.
+*/
+   preempt_disable();
if (statesold & RCUTORTURE_RDR_IRQ)
local_irq_enable();
-   if (statesold & RCUTORTURE_RDR_BH)
+   if (statesold & RCUTORTURE_RDR_ATOM_BH)
local_bh_enable();
+   if (statesold & RCUTORTURE_RDR_ATOM_RBH)
+   rcu_read_unlock_bh();
+   preempt_enable();
if (statesold & RCUTORTURE_RDR_PREEMPT)
preempt_enable();
-   if (statesold & RCUTORTURE_RDR_RBH)
-   rcu_read_unlock_bh();
if (statesold & RCUTORTURE_RDR_SCHED)
rcu_read_unlock_sched();
+   if (statesold & RCUTORTURE_RDR_BH)
+   local_bh

[PATCH RT v3 4/5] rcu: Disable use_softirq on PREEMPT_RT

2019-09-11 Thread Scott Wood

Besides restoring behavior that used to be default on RT, this avoids
a deadlock on scheduler locks:

[  136.894657] 039: 
[  136.900401] 039: WARNING: possible recursive locking detected
[  136.906146] 039: 5.2.9-rt3.dbg+ #174 Tainted: GE
[  136.912152] 039: 
[  136.917896] 039: rcu_torture_rea/13474 is trying to acquire lock:
[  136.923990] 039: 5f25146d
[  136.927310] 039:  (
[  136.929414] 039: >pi_lock
[  136.932303] 039: ){-...}
[  136.934840] 039: , at: try_to_wake_up+0x39/0x920
[  136.939461] 039:
but task is already holding lock:
[  136.944425] 039: 5f25146d
[  136.947745] 039:  (
[  136.949852] 039: >pi_lock
[  136.952738] 039: ){-...}
[  136.955274] 039: , at: try_to_wake_up+0x39/0x920
[  136.959895] 039:
other info that might help us debug this:
[  136.96] 039:  Possible unsafe locking scenario:

[  136.970608] 039:CPU0
[  136.973493] 039:
[  136.976378] 039:   lock(
[  136.978918] 039: >pi_lock
[  136.981806] 039: );
[  136.983911] 039:   lock(
[  136.986451] 039: >pi_lock
[  136.989336] 039: );
[  136.991444] 039:
 *** DEADLOCK ***

[  136.995194] 039:  May be due to missing lock nesting notation

[  137.001115] 039: 3 locks held by rcu_torture_rea/13474:
[  137.006341] 039:  #0:
[  137.008707] 039: 5f25146d
[  137.012024] 039:  (
[  137.014131] 039: >pi_lock
[  137.017015] 039: ){-...}
[  137.019558] 039: , at: try_to_wake_up+0x39/0x920
[  137.024175] 039:  #1:
[  137.026540] 039: 11c8e51d
[  137.029859] 039:  (
[  137.031966] 039: >lock
[  137.034679] 039: ){-...}
[  137.037217] 039: , at: try_to_wake_up+0x241/0x920
[  137.041924] 039:  #2:
[  137.044291] 039: 098649b9
[  137.047610] 039:  (
[  137.049714] 039: rcu_read_lock
[  137.052774] 039: ){}
[  137.055314] 039: , at: cpuacct_charge+0x33/0x1e0
[  137.059934] 039:
stack backtrace:
[  137.063425] 039: CPU: 39 PID: 13474 Comm: rcu_torture_rea Kdump: loaded 
Tainted: GE 5.2.9-rt3.dbg+ #174
[  137.074197] 039: Hardware name: Intel Corporation S2600BT/S2600BT, BIOS 
SE5C620.86B.01.00.0763.022420181017 02/24/2018
[  137.084886] 039: Call Trace:
[  137.087773] 039:  
[  137.090226] 039:  dump_stack+0x5e/0x8b
[  137.093997] 039:  __lock_acquire+0x725/0x1100
[  137.098358] 039:  lock_acquire+0xc0/0x240
[  137.102374] 039:  ? try_to_wake_up+0x39/0x920
[  137.106737] 039:  _raw_spin_lock_irqsave+0x47/0x90
[  137.111534] 039:  ? try_to_wake_up+0x39/0x920
[  137.115910] 039:  try_to_wake_up+0x39/0x920
[  137.120098] 039:  rcu_read_unlock_special+0x65/0xb0
[  137.124977] 039:  __rcu_read_unlock+0x5d/0x70
[  137.129337] 039:  cpuacct_charge+0xd9/0x1e0
[  137.133522] 039:  ? cpuacct_charge+0x33/0x1e0
[  137.137880] 039:  update_curr+0x14b/0x420
[  137.141894] 039:  enqueue_entity+0x42/0x370
[  137.146080] 039:  enqueue_task_fair+0xa9/0x490
[  137.150528] 039:  activate_task+0x5a/0xf0
[  137.154539] 039:  ttwu_do_activate+0x4e/0x90
[  137.158813] 039:  try_to_wake_up+0x277/0x920
[  137.163086] 039:  irq_exit+0xb6/0xf0
[  137.11] 039:  smp_apic_timer_interrupt+0xe3/0x3a0
[  137.171714] 039:  apic_timer_interrupt+0xf/0x20
[  137.176249] 039:  
[  137.178785] 039: RIP: 0010:__schedule+0x0/0x8e0
[  137.183319] 039: Code: 00 02 48 89 43 20 e8 0f 5a 00 00 48 8d 7b 28 e8 86 f2 
fd ff 31 c0 5b 5d 41 5c c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <55> 48 
89 e5 41 57 41 56 49 c7 c6 c0 ca 1e 00 41 55 41 89 fd 41 54
[  137.202498] 039: RSP: 0018:c9005835fbc0 EFLAGS: 0246
[  137.208158] 039:  ORIG_RAX: ff13
[  137.212428] 039: RAX:  RBX: 8897c3e1bb00 RCX: 
0001
[  137.219994] 039: RDX: 80004008 RSI: 0006 RDI: 
0001
[  137.227560] 039: RBP: 8897c3e1bb00 R08:  R09: 

[  137.235126] 039: R10: 0001 R11: 0001 R12: 
81001fd1
[  137.242694] 039: R13: 0044 R14:  R15: 
c9005835fcac
[  137.250259] 039:  ? ___preempt_schedule+0x16/0x18
[  137.254969] 039:  preempt_schedule_common+0x32/0x80
[  137.259846] 039:  ___preempt_schedule+0x16/0x18
[  137.264379] 039:  rcutorture_one_extend+0x33a/0x510 [rcutorture]
[  137.270397] 039:  rcu_torture_one_read+0x18c/0x450 [rcutorture]
[  137.276334] 039:  rcu_torture_reader+0xac/0x1f0 [rcutorture]
[  137.281998] 039:  ? rcu_torture_reader+0x1f0/0x1f0 [rcutorture]
[  137.287920] 039:  kthread+0x106/0x140
[  137.291591] 039:  ? rcu_torture_one_read+0x450/0x450 [rcutorture]
[  137.297681] 039:  ? kthread_bind+0x10/0x10
[  137.301783] 039:  ret_from_fork+0x3a/0x50

Signed-off-by: Scott Wood 
---
The prohibition on use_softirq should be able to be dropped once RT gets
the latest RCU code, but the question of what use_softirq should default
to on PREEMPT_RT remains.

v3: Use IS_ENABLED
---
 kernel/rcu/tree.c | 9 +++--
 1 file changed, 7 in

[PATCH RT v3 3/5] sched: migrate_dis/enable: Use rt_invol_sleep

2019-09-11 Thread Scott Wood

Without this, rcu_note_context_switch() will complain if an RCU read lock
is held when migrate_enable() calls stop_one_cpu().  Likewise when
migrate_disable() calls pin_current_cpu() which calls __read_rt_lock() --
which bypasses the part of the mutex code that calls rt_invol_sleep_inc().

Signed-off-by: Scott Wood 
---
v3: Add to pin_current_cpu as well

 include/linux/sched.h| 4 ++--
 kernel/cpu.c | 2 ++
 kernel/rcu/tree_plugin.h | 2 +-
 kernel/sched/core.c  | 4 
 4 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index edc93b74f7d8..ecf5cbb23335 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -673,7 +673,7 @@ struct task_struct {
int migrate_disable_atomic;
 # endif
 #endif
-#ifdef CONFIG_PREEMPT_RT_FULL
+#ifdef CONFIG_PREEMPT_RT_BASE
/* Task is blocking due to RT-specific mechanisms, not voluntarily */
int rt_invol_sleep;
 #endif
@@ -1882,7 +1882,7 @@ static __always_inline bool need_resched(void)
return unlikely(tif_need_resched());
 }
 
-#ifdef CONFIG_PREEMPT_RT_FULL
+#ifdef CONFIG_PREEMPT_RT_BASE
 static inline void rt_invol_sleep_inc(void)
 {
current->rt_invol_sleep++;
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 885a195dfbe0..32c6175b63b6 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -308,7 +308,9 @@ void pin_current_cpu(void)
preempt_lazy_enable();
preempt_enable();
 
+   rt_invol_sleep_inc();
__read_rt_lock(cpuhp_pin);
+   rt_invol_sleep_dec();
 
preempt_disable();
preempt_lazy_disable();
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0da4b975cd71..6d92dafeeca5 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -292,7 +292,7 @@ void rcu_note_context_switch(bool preempt)
barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
lockdep_assert_irqs_disabled();
-#if defined(CONFIG_PREEMPT_RT_FULL)
+#if defined(CONFIG_PREEMPT_RT_BASE)
rt_invol = t->rt_invol_sleep;
 #endif
WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0 && !rt_invol);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1bdd7f9be05..a151332474d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7405,7 +7405,11 @@ void migrate_enable(void)
unpin_current_cpu();
preempt_lazy_enable();
preempt_enable();
+
+   rt_invol_sleep_inc();
stop_one_cpu(task_cpu(p), migration_cpu_stop, );
+   rt_invol_sleep_dec();
+
return;
}
}
-- 
1.8.3.1

[PATCH RT v3 0/5] RCU fixes

2019-09-11 Thread Scott Wood

With these patches, rcutorture works on PREEMPT_RT_FULL.

Scott Wood (5):
  rcu: Acquire RCU lock when disabling BHs
  sched: Rename sleeping_lock to rt_invol_sleep
  sched: migrate_dis/enable: Use rt_invol_sleep
  rcu: Disable use_softirq on PREEMPT_RT
  rcutorture: Avoid problematic critical section nesting

 include/linux/rcupdate.h   | 40 +++
 include/linux/sched.h  | 19 -
 kernel/cpu.c   |  2 +
 kernel/locking/rtmutex.c   | 14 +++
 kernel/locking/rwlock-rt.c | 16 
 kernel/rcu/rcutorture.c| 96 +++---
 kernel/rcu/tree.c  |  9 -
 kernel/rcu/tree_plugin.h   |  8 ++--
 kernel/sched/core.c|  4 ++
 kernel/softirq.c   | 14 +--
 kernel/time/hrtimer.c  |  4 +-
 11 files changed, 168 insertions(+), 58 deletions(-)

-- 
1.8.3.1

[PATCH RT v3 2/5] sched: Rename sleeping_lock to rt_invol_sleep

2019-09-11 Thread Scott Wood

It's already used for one situation other than acquiring a lock, and the
next patch will add another, so change the name to avoid confusion.

Signed-off-by: Scott Wood 
---
 include/linux/sched.h  | 15 ---
 kernel/locking/rtmutex.c   | 14 +++---
 kernel/locking/rwlock-rt.c | 16 
 kernel/rcu/tree_plugin.h   |  6 +++---
 kernel/softirq.c   |  2 +-
 kernel/time/hrtimer.c  |  4 ++--
 6 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e892e727f12..edc93b74f7d8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -674,7 +674,8 @@ struct task_struct {
 # endif
 #endif
 #ifdef CONFIG_PREEMPT_RT_FULL
-   int sleeping_lock;
+   /* Task is blocking due to RT-specific mechanisms, not voluntarily */
+   int rt_invol_sleep;
 #endif
 
 #ifdef CONFIG_PREEMPT_RCU
@@ -1882,20 +1883,20 @@ static __always_inline bool need_resched(void)
 }
 
 #ifdef CONFIG_PREEMPT_RT_FULL
-static inline void sleeping_lock_inc(void)
+static inline void rt_invol_sleep_inc(void)
 {
-   current->sleeping_lock++;
+   current->rt_invol_sleep++;
 }
 
-static inline void sleeping_lock_dec(void)
+static inline void rt_invol_sleep_dec(void)
 {
-   current->sleeping_lock--;
+   current->rt_invol_sleep--;
 }
 
 #else
 
-static inline void sleeping_lock_inc(void) { }
-static inline void sleeping_lock_dec(void) { }
+static inline void rt_invol_sleep_inc(void) { }
+static inline void rt_invol_sleep_dec(void) { }
 #endif
 
 /*
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 5ccbb45131e5..d7100586c597 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1135,7 +1135,7 @@ void __sched rt_spin_lock_slowunlock(struct rt_mutex 
*lock)
 
 void __lockfunc rt_spin_lock(spinlock_t *lock)
 {
-   sleeping_lock_inc();
+   rt_invol_sleep_inc();
migrate_disable();
spin_acquire(>dep_map, 0, 0, _RET_IP_);
rt_spin_lock_fastlock(>lock, rt_spin_lock_slowlock);
@@ -1150,7 +1150,7 @@ void __lockfunc __rt_spin_lock(struct rt_mutex *lock)
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass)
 {
-   sleeping_lock_inc();
+   rt_invol_sleep_inc();
migrate_disable();
spin_acquire(>dep_map, subclass, 0, _RET_IP_);
rt_spin_lock_fastlock(>lock, rt_spin_lock_slowlock);
@@ -1164,7 +1164,7 @@ void __lockfunc rt_spin_unlock(spinlock_t *lock)
spin_release(>dep_map, 1, _RET_IP_);
rt_spin_lock_fastunlock(>lock, rt_spin_lock_slowunlock);
migrate_enable();
-   sleeping_lock_dec();
+   rt_invol_sleep_dec();
 }
 EXPORT_SYMBOL(rt_spin_unlock);
 
@@ -1190,14 +1190,14 @@ int __lockfunc rt_spin_trylock(spinlock_t *lock)
 {
int ret;
 
-   sleeping_lock_inc();
+   rt_invol_sleep_inc();
migrate_disable();
ret = __rt_mutex_trylock(>lock);
if (ret) {
spin_acquire(>dep_map, 0, 1, _RET_IP_);
} else {
migrate_enable();
-   sleeping_lock_dec();
+   rt_invol_sleep_dec();
}
return ret;
 }
@@ -1210,7 +1210,7 @@ int __lockfunc rt_spin_trylock_bh(spinlock_t *lock)
local_bh_disable();
ret = __rt_mutex_trylock(>lock);
if (ret) {
-   sleeping_lock_inc();
+   rt_invol_sleep_inc();
migrate_disable();
spin_acquire(>dep_map, 0, 1, _RET_IP_);
} else
@@ -1226,7 +1226,7 @@ int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, 
unsigned long *flags)
*flags = 0;
ret = __rt_mutex_trylock(>lock);
if (ret) {
-   sleeping_lock_inc();
+   rt_invol_sleep_inc();
migrate_disable();
spin_acquire(>dep_map, 0, 1, _RET_IP_);
}
diff --git a/kernel/locking/rwlock-rt.c b/kernel/locking/rwlock-rt.c
index c3b91205161c..de025a7cc9c4 100644
--- a/kernel/locking/rwlock-rt.c
+++ b/kernel/locking/rwlock-rt.c
@@ -305,14 +305,14 @@ int __lockfunc rt_read_trylock(rwlock_t *rwlock)
 {
int ret;
 
-   sleeping_lock_inc();
+   rt_invol_sleep_inc();
migrate_disable();
ret = do_read_rt_trylock(rwlock);
if (ret) {
rwlock_acquire_read(>dep_map, 0, 1, _RET_IP_);
} else {
migrate_enable();
-   sleeping_lock_dec();
+   rt_invol_sleep_dec();
}
return ret;
 }
@@ -322,14 +322,14 @@ int __lockfunc rt_write_trylock(rwlock_t *rwlock)
 {
int ret;
 
-   sleeping_lock_inc();
+   rt_invol_sleep_inc();
migrate_disable();
ret = do_write_rt_trylock(rwlock);
if (ret) {
rwlock_acquire(>dep_map, 0, 1, _RET_IP_);
} else {

[PATCH RT v3 1/5] rcu: Acquire RCU lock when disabling BHs

2019-09-11 Thread Scott Wood

A plain local_bh_disable() is documented as creating an RCU critical
section, and (at least) rcutorture expects this to be the case.  However,
in_softirq() doesn't block a grace period on PREEMPT_RT, since RCU checks
preempt_count() directly.  Even if RCU were changed to check
in_softirq(), that wouldn't allow blocked BH disablers to be boosted.

Fix this by calling rcu_read_lock() from local_bh_disable(), and update
rcu_read_lock_bh_held() accordingly.

Signed-off-by: Scott Wood 
---
v3: Remove change to rcu_read_lock_bh_held(), and move debug portions
of rcu_read_[un]lock_bh() to separate functions
---
 include/linux/rcupdate.h | 40 
 kernel/softirq.c | 12 +---
 2 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 388ace315f32..9ce7c5006a5e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -600,6 +600,36 @@ static inline void rcu_read_unlock(void)
rcu_lock_release(_lock_map); /* Keep acq info for rls diags. */
 }
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+/*
+ * On RT, local_bh_disable() calls rcu_read_lock() -- no need to
+ * track it twice.
+ */
+static inline void rcu_bh_lock_acquire(void)
+{
+}
+
+static inline void rcu_bh_lock_release(void)
+{
+}
+#else
+static inline void rcu_bh_lock_acquire(void)
+{
+   __acquire(RCU_BH);
+   rcu_lock_acquire(_bh_lock_map);
+   RCU_LOCKDEP_WARN(!rcu_is_watching(),
+"rcu_read_lock_bh() used illegally while idle");
+}
+
+static inline void rcu_bh_lock_release(void)
+{
+   RCU_LOCKDEP_WARN(!rcu_is_watching(),
+"rcu_read_unlock_bh() used illegally while idle");
+   rcu_lock_release(_bh_lock_map);
+   __release(RCU_BH);
+}
+#endif
+
 /**
  * rcu_read_lock_bh() - mark the beginning of an RCU-bh critical section
  *
@@ -615,10 +645,7 @@ static inline void rcu_read_unlock(void)
 static inline void rcu_read_lock_bh(void)
 {
local_bh_disable();
-   __acquire(RCU_BH);
-   rcu_lock_acquire(_bh_lock_map);
-   RCU_LOCKDEP_WARN(!rcu_is_watching(),
-"rcu_read_lock_bh() used illegally while idle");
+   rcu_bh_lock_acquire();
 }
 
 /*
@@ -628,10 +655,7 @@ static inline void rcu_read_lock_bh(void)
  */
 static inline void rcu_read_unlock_bh(void)
 {
-   RCU_LOCKDEP_WARN(!rcu_is_watching(),
-"rcu_read_unlock_bh() used illegally while idle");
-   rcu_lock_release(_bh_lock_map);
-   __release(RCU_BH);
+   rcu_bh_lock_release();
local_bh_enable();
 }
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d16d080a74f7..6080c9328df1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -115,8 +115,10 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int 
cnt)
long soft_cnt;
 
WARN_ON_ONCE(in_irq());
-   if (!in_atomic())
+   if (!in_atomic()) {
local_lock(bh_lock);
+   rcu_read_lock();
+   }
soft_cnt = this_cpu_inc_return(softirq_counter);
WARN_ON_ONCE(soft_cnt == 0);
current->softirq_count += SOFTIRQ_DISABLE_OFFSET;
@@ -151,8 +153,10 @@ void _local_bh_enable(void)
 #endif
 
current->softirq_count -= SOFTIRQ_DISABLE_OFFSET;
-   if (!in_atomic())
+   if (!in_atomic()) {
+   rcu_read_unlock();
local_unlock(bh_lock);
+   }
 }
 
 void _local_bh_enable_rt(void)
@@ -185,8 +189,10 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int 
cnt)
WARN_ON_ONCE(count < 0);
local_irq_enable();
 
-   if (!in_atomic())
+   if (!in_atomic()) {
+   rcu_read_unlock();
local_unlock(bh_lock);
+   }
 
current->softirq_count -= SOFTIRQ_DISABLE_OFFSET;
preempt_check_resched();
-- 
1.8.3.1

Re: [PATCH v6 06/12] powerpc/fsl_booke/32: implement KASLR infrastructure

2019-08-28 Thread Scott Wood

On Wed, 2019-08-28 at 19:03 +0800, Jason Yan wrote:
> 
> On 2019/8/28 12:54, Scott Wood wrote:
> > On Fri, Aug 09, 2019 at 06:07:54PM +0800, Jason Yan wrote:
> > > +/*
> > > + * To see if we need to relocate the kernel to a random offset
> > > + * void *dt_ptr - address of the device tree
> > > + * phys_addr_t size - size of the first memory block
> > > + */
> > > +notrace void __init kaslr_early_init(void *dt_ptr, phys_addr_t size)
> > > +{
> > > + unsigned long tlb_virt;
> > > + phys_addr_t tlb_phys;
> > > + unsigned long offset;
> > > + unsigned long kernel_sz;
> > > +
> > > + kernel_sz = (unsigned long)_end - KERNELBASE;
> > 
> > Why KERNELBASE and not kernstart_addr?
> > 
> 
> Did you mean kernstart_virt_addr? It should be kernstart_virt_addr.

Yes, kernstart_virt_addr.  KERNELBASE will be incorrect if the kernel was
loaded at a nonzero physical address without CONFIG_PHYSICAL_START being
adjusted to match.

-Scott

Re: [PATCH v6 00/12] implement KASLR for powerpc/fsl_booke/32

2019-08-27 Thread Scott Wood

On Tue, 2019-08-27 at 11:33 +1000, Michael Ellerman wrote:
> Jason Yan  writes:
> > A polite ping :)
> > 
> > What else should I do now?
> 
> That's a good question.
> 
> Scott, are you still maintaining FSL bits, 

Sort of... now that it's become very low volume, it's easy to forget when
something does show up (or miss it if I'm not CCed).  It'd probably help if I
were to just ack patches instead of thinking "I'll do a pull request for this
later" when it's just one or two patches per cycle.

-Scott

Re: [PATCH v6 00/12] implement KASLR for powerpc/fsl_booke/32

2019-08-27 Thread Scott Wood

On Tue, 2019-08-27 at 23:05 -0500, Scott Wood wrote:
> On Fri, 2019-08-09 at 18:07 +0800, Jason Yan wrote:
> >  Freescale Book-E
> > parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
> > entries are not suitable to map the kernel directly in a randomized
> > region, so we chose to copy the kernel to a proper place and restart to
> > relocate.
> > 
> > Entropy is derived from the banner and timer base, which will change every
> > build and boot. This not so much safe so additionally the bootloader may
> > pass entropy via the /chosen/kaslr-seed node in device tree.
> 
> How complicated would it be to directly access the HW RNG (if present) that
> early in the boot?  It'd be nice if a U-Boot update weren't required (and
> particularly concerning that KASLR would appear to work without a U-Boot
> update, but without decent entropy).

OK, I see that kaslr-seed is used on some other platforms, though arm64 aborts
KASLR if it doesn't get a seed.  I'm not sure if that's better than a loud
warning message (or if it was a conscious choice rather than just not having
an alternative implemented), but silently using poor entropy for something
like this seems bad.

-Scott

Re: [PATCH v6 06/12] powerpc/fsl_booke/32: implement KASLR infrastructure

2019-08-27 Thread Scott Wood

On Fri, Aug 09, 2019 at 06:07:54PM +0800, Jason Yan wrote:
> This patch add support to boot kernel from places other than KERNELBASE.
> Since CONFIG_RELOCATABLE has already supported, what we need to do is
> map or copy kernel to a proper place and relocate. Freescale Book-E
> parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
> entries are not suitable to map the kernel directly in a randomized
> region, so we chose to copy the kernel to a proper place and restart to
> relocate.
> 
> The offset of the kernel was not randomized yet(a fixed 64M is set). We
> will randomize it in the next patch.
> 
> Signed-off-by: Jason Yan 
> Cc: Diana Craciun 
> Cc: Michael Ellerman 
> Cc: Christophe Leroy 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Nicholas Piggin 
> Cc: Kees Cook 
> Tested-by: Diana Craciun 
> Reviewed-by: Christophe Leroy 
> ---
>  arch/powerpc/Kconfig  | 11 
>  arch/powerpc/kernel/Makefile  |  1 +
>  arch/powerpc/kernel/early_32.c|  2 +-
>  arch/powerpc/kernel/fsl_booke_entry_mapping.S | 17 +++--
>  arch/powerpc/kernel/head_fsl_booke.S  | 13 +++-
>  arch/powerpc/kernel/kaslr_booke.c | 62 +++
>  arch/powerpc/mm/mmu_decl.h|  7 +++
>  arch/powerpc/mm/nohash/fsl_booke.c|  7 ++-
>  8 files changed, 105 insertions(+), 15 deletions(-)
>  create mode 100644 arch/powerpc/kernel/kaslr_booke.c
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 77f6ebf97113..710c12ef7159 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -548,6 +548,17 @@ config RELOCATABLE
> setting can still be useful to bootwrappers that need to know the
> load address of the kernel (eg. u-boot/mkimage).
>  
> +config RANDOMIZE_BASE
> + bool "Randomize the address of the kernel image"
> + depends on (FSL_BOOKE && FLATMEM && PPC32)
> + depends on RELOCATABLE
> + help
> +   Randomizes the virtual address at which the kernel image is
> +   loaded, as a security feature that deters exploit attempts
> +   relying on knowledge of the location of kernel internals.
> +
> +   If unsure, say N.
> +

Why is N the safe default (other than concerns about code maturity,
though arm64 and mips don't seem to have updated this recommendation
after several years)?  On x86 this defaults to Y.

> diff --git a/arch/powerpc/kernel/fsl_booke_entry_mapping.S 
> b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
> index f4d3eaae54a9..641920d4f694 100644
> --- a/arch/powerpc/kernel/fsl_booke_entry_mapping.S
> +++ b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
> @@ -155,23 +155,22 @@ skpinv: addir6,r6,1 /* 
> Increment */
>  
>  #if defined(ENTRY_MAPPING_BOOT_SETUP)
>  
> -/* 6. Setup KERNELBASE mapping in TLB1[0] */
> +/* 6. Setup kernstart_virt_addr mapping in TLB1[0] */
>   lis r6,0x1000   /* Set MAS0(TLBSEL) = TLB1(1), ESEL = 0 
> */
>   mtspr   SPRN_MAS0,r6
>   lis r6,(MAS1_VALID|MAS1_IPROT)@h
>   ori r6,r6,(MAS1_TSIZE(BOOK3E_PAGESZ_64M))@l
>   mtspr   SPRN_MAS1,r6
> - lis r6,MAS2_VAL(PAGE_OFFSET, BOOK3E_PAGESZ_64M, MAS2_M_IF_NEEDED)@h
> - ori r6,r6,MAS2_VAL(PAGE_OFFSET, BOOK3E_PAGESZ_64M, 
> MAS2_M_IF_NEEDED)@l
> - mtspr   SPRN_MAS2,r6
> + lis r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@h
> + ori r6,r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@l
> + and r6,r6,r20
> + ori r6,r6,MAS2_M_IF_NEEDED@l
> + mtspr   SPRN_MAS2,r6

Please use tabs rather than spaces between the mnemonic and the
arguments.

It looks like that was the last user of MAS2_VAL so let's remove it.

> diff --git a/arch/powerpc/kernel/kaslr_booke.c 
> b/arch/powerpc/kernel/kaslr_booke.c
> new file mode 100644
> index ..f8dc60534ac1
> --- /dev/null
> +++ b/arch/powerpc/kernel/kaslr_booke.c

Shouldn't this go under arch/powerpc/mm/nohash?

> +/*
> + * To see if we need to relocate the kernel to a random offset
> + * void *dt_ptr - address of the device tree
> + * phys_addr_t size - size of the first memory block
> + */
> +notrace void __init kaslr_early_init(void *dt_ptr, phys_addr_t size)
> +{
> + unsigned long tlb_virt;
> + phys_addr_t tlb_phys;
> + unsigned long offset;
> + unsigned long kernel_sz;
> +
> + kernel_sz = (unsigned long)_end - KERNELBASE;

Why KERNELBASE and not kernstart_addr?

> +
> + offset = kaslr_choose_location(dt_ptr, size, kernel_sz);
> +
> + if (offset == 0)
> + return;
> +
> + kernstart_virt_addr += offset;
> + kernstart_addr += offset;
> +
> + is_second_reloc = 1;
> +
> + if (offset >= SZ_64M) {
> + tlb_virt = round_down(kernstart_virt_addr, SZ_64M);
> + tlb_phys = round_down(kernstart_addr, SZ_64M);

If kernstart_addr wasn't 64M-aligned before adding offset, then "offset
>= SZ_64M" is not necessarily going to detect when you've

Re: [PATCH v6 00/12] implement KASLR for powerpc/fsl_booke/32

2019-08-27 Thread Scott Wood

On Fri, 2019-08-09 at 18:07 +0800, Jason Yan wrote:
> This series implements KASLR for powerpc/fsl_booke/32, as a security
> feature that deters exploit attempts relying on knowledge of the location
> of kernel internals.
> 
> Since CONFIG_RELOCATABLE has already supported, what we need to do is
> map or copy kernel to a proper place and relocate.

Have you tested this with a kernel that was loaded at a non-zero address?  I
tried loading a kernel at 0x0400 (by changing the address in the uImage,
and setting bootm_low to 0400 in U-Boot), and it works without
CONFIG_RANDOMIZE and fails with.

>  Freescale Book-E
> parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
> entries are not suitable to map the kernel directly in a randomized
> region, so we chose to copy the kernel to a proper place and restart to
> relocate.
> 
> Entropy is derived from the banner and timer base, which will change every
> build and boot. This not so much safe so additionally the bootloader may
> pass entropy via the /chosen/kaslr-seed node in device tree.

How complicated would it be to directly access the HW RNG (if present) that
early in the boot?  It'd be nice if a U-Boot update weren't required (and
particularly concerning that KASLR would appear to work without a U-Boot
update, but without decent entropy).

-Scott

Re: [PATCH v6 04/12] powerpc/fsl_booke/32: introduce create_tlb_entry() helper

2019-08-27 Thread Scott Wood

On Fri, Aug 09, 2019 at 06:07:52PM +0800, Jason Yan wrote:
> Add a new helper create_tlb_entry() to create a tlb entry by the virtual
> and physical address. This is a preparation to support boot kernel at a
> randomized address.
> 
> Signed-off-by: Jason Yan 
> Cc: Diana Craciun 
> Cc: Michael Ellerman 
> Cc: Christophe Leroy 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Nicholas Piggin 
> Cc: Kees Cook 
> Reviewed-by: Christophe Leroy 
> Reviewed-by: Diana Craciun 
> Tested-by: Diana Craciun 
> ---
>  arch/powerpc/kernel/head_fsl_booke.S | 29 
>  arch/powerpc/mm/mmu_decl.h   |  1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
> b/arch/powerpc/kernel/head_fsl_booke.S
> index adf0505dbe02..04d124fee17d 100644
> --- a/arch/powerpc/kernel/head_fsl_booke.S
> +++ b/arch/powerpc/kernel/head_fsl_booke.S
> @@ -1114,6 +1114,35 @@ __secondary_hold_acknowledge:
>   .long   -1
>  #endif
>  
> +/*
> + * Create a 64M tlb by address and entry
> + * r3/r4 - physical address
> + * r5 - virtual address
> + * r6 - entry
> + */
> +_GLOBAL(create_tlb_entry)

This function is broadly named but contains various assumptions about the
entry being created.  I'd just call it create_kaslr_tlb_entry.

> + lis r7,0x1000   /* Set MAS0(TLBSEL) = 1 */
> + rlwimi  r7,r6,16,4,15   /* Setup MAS0 = TLBSEL | ESEL(r6) */
> + mtspr   SPRN_MAS0,r7/* Write MAS0 */
> +
> + lis r6,(MAS1_VALID|MAS1_IPROT)@h
> + ori r6,r6,(MAS1_TSIZE(BOOK3E_PAGESZ_64M))@l
> + mtspr   SPRN_MAS1,r6/* Write MAS1 */
> +
> + lis r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@h
> + ori r6,r6,MAS2_EPN_MASK(BOOK3E_PAGESZ_64M)@l
> + and r6,r6,r5
> + ori r6,r6,MAS2_M@l
> + mtspr   SPRN_MAS2,r6/* Write MAS2(EPN) */
> +
> + ori r8,r4,(MAS3_SW|MAS3_SR|MAS3_SX)
> + mtspr   SPRN_MAS3,r8/* Write MAS3(RPN) */
> +
> + tlbwe   /* Write TLB */
> + isync
> + sync
> + blr

Should set MAS7 under MMU_FTR_BIG_PHYS (or CONFIG_PHYS_64BIT if it's
too early for features) -- even if relocatable kernels over 4GiB aren't
supported (I don't remember if they work or not), MAS7 might be non-zero
on entry.  And the function claims to take a 64-bit phys addr as input...

MAS2_M should be MAS2_M_IF_NEEDED to match other kmem tlb entries.

-Scott

Re: [PATCH RT v2 1/3] rcu: Acquire RCU lock when disabling BHs

2019-08-26 Thread Scott Wood

On Mon, 2019-08-26 at 17:59 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-08-23 14:46:39 [-0500], Scott Wood wrote:
> > > > Before consolidation, RT mapped rcu_read_lock_bh_held() to
> > > > rcu_read_lock_bh() and called rcu_read_lock() from
> > > > rcu_read_lock_bh().  This
> > > > somehow got lost when rebasing on top of 5.0.
> > > 
> > > so now rcu_read_lock_bh_held() is untouched and in_softirq() reports
> > > 1.
> > > So the problem is that we never hold RCU but report 1 like we do?
> > 
> > Yes.
> 
> I understand the part where "rcu_read_lock() becomes part of
> local_bh_disable()". But why do you modify rcu_read_lock_bh_held() and
> rcu_read_lock_bh()? Couldn't they remain as-is?

Yes, it looks like they can.  I recall having problems with
rcu_read_lock_bh_held() in an earlier version which is what prompted the
change, but looking back I don't see what the problem was.

-Scott

Re: [PATCH RT v2 2/3] sched: migrate_enable: Use sleeping_lock to indicate involuntary sleep

2019-08-26 Thread Scott Wood

On Mon, 2019-08-26 at 09:29 -0700, Paul E. McKenney wrote:
> On Mon, Aug 26, 2019 at 05:25:23PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-08-23 23:10:14 [-0400], Joel Fernandes wrote:
> > > On Fri, Aug 23, 2019 at 02:28:46PM -0500, Scott Wood wrote:
> > > > On Fri, 2019-08-23 at 18:20 +0200, Sebastian Andrzej Siewior wrote:
> > > > > this looks like an ugly hack. This sleeping_lock_inc() is used
> > > > > where we
> > > > > actually hold a sleeping lock and schedule() which is okay. But
> > > > > this
> > > > > would mean we hold a RCU lock and schedule() anyway. Is that okay?
> > > > 
> > > > Perhaps the name should be changed, but the concept is the same --
> > > > RT-
> > > > specific sleeping which should be considered involuntary for the
> > > > purpose of
> > > > debug checks.  Voluntary sleeping is not allowed in an RCU critical
> > > > section
> > > > because it will break the critical section on certain flavors of
> > > > RCU, but
> > > > that doesn't apply to the flavor used on RT.  Sleeping for a long
> > > > time in an
> > > > RCU critical section would also be a bad thing, but that also
> > > > doesn't apply
> > > > here.
> > > 
> > > I think the name should definitely be changed. At best, it is super
> > > confusing to
> > > call it "sleeping_lock" for this scenario. In fact here, you are not
> > > even
> > > blocking on a lock.
> > > 
> > > Maybe "sleeping_allowed" or some such.
> > 
> > The mechanism that is used here may change in future. I just wanted to
> > make sure that from RCU's side it is okay to schedule here.
> 
> Good point.
> 
> The effect from RCU's viewpoint will be to split any non-rcu_read_lock()
> RCU read-side critical section at this point.  This alrady happens in a
> few places, for example, rcu_note_context_switch() constitutes an RCU
> quiescent state despite being invoked with interrupts disabled (as is
> required!).  The __schedule() function just needs to understand (and does
> understand) that the RCU read-side critical section that would otherwise
> span that call to rcu_node_context_switch() is split in two by that call.
> 
> However, if this was instead an rcu_read_lock() critical section within
> a PREEMPT=y kernel, then if a schedule() occured within stop_one_task(),
> RCU would consider that critical section to be preempted.  This means
> that any RCU grace period that is blocked by this RCU read-side critical
> section would remain blocked until stop_one_cpu() resumed, returned,
> and so on until the matching rcu_read_unlock() was reached.  In other
> words, RCU would consider that RCU read-side critical section to span
> the call to stop_one_cpu() even if stop_one_cpu() invoked schedule().
> 
> On the other hand, within a PREEMPT=n kernel, the call to schedule()
> would split even an rcu_read_lock() critical section.  Which is why I
> asked earlier if sleeping_lock_inc() and sleeping_lock_dec() are no-ops
> in !PREEMPT_RT_BASE kernels.  We would after all want the usual lockdep
> complaints in that case.

migrate_enable() is PREEMPT_RT_BASE-specific -- this code won't execute at
all with PREEMPT=n.

-Scott

Re: [PATCH RT v2 1/3] rcu: Acquire RCU lock when disabling BHs

2019-08-23 Thread Scott Wood

On Fri, 2019-08-23 at 18:17 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-08-22 22:23:23 [-0500], Scott Wood wrote:
> > On Thu, 2019-08-22 at 09:39 -0400, Joel Fernandes wrote:
> > > On Wed, Aug 21, 2019 at 04:33:58PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Aug 21, 2019 at 06:19:04PM -0500, Scott Wood wrote:
> > > > > Signed-off-by: Scott Wood 
> > > > > ---
> > > > > Another question is whether non-raw spinlocks are intended to
> > > > > create
> > > > > an
> > > > > RCU read-side critical section due to implicit preempt disable.
> > > > 
> > > > Hmmm...  Did non-raw spinlocks act like rcu_read_lock_sched()
> > > > and rcu_read_unlock_sched() pairs in -rt prior to the RCU flavor
> > > > consolidation?  If not, I don't see why they should do so after that
> > > > consolidation in -rt.
> > > 
> > > May be I am missing something, but I didn't see the connection between
> > > consolidation and this patch. AFAICS, this patch is so that
> > > rcu_read_lock_bh_held() works at all on -rt. Did I badly miss
> > > something?
> > 
> > Before consolidation, RT mapped rcu_read_lock_bh_held() to
> > rcu_read_lock_bh() and called rcu_read_lock() from
> > rcu_read_lock_bh().  This
> > somehow got lost when rebasing on top of 5.0.
> 
> so now rcu_read_lock_bh_held() is untouched and in_softirq() reports 1.
> So the problem is that we never hold RCU but report 1 like we do?

Yes.

-Scott

Re: [PATCH RT v2 2/3] sched: migrate_enable: Use sleeping_lock to indicate involuntary sleep

2019-08-23 Thread Scott Wood

On Fri, 2019-08-23 at 18:20 +0200, Sebastian Andrzej Siewior wrote:
> On 2019-08-21 18:19:05 [-0500], Scott Wood wrote:
> > Without this, rcu_note_context_switch() will complain if an RCU read
> > lock is held when migrate_enable() calls stop_one_cpu().
> > 
> > Signed-off-by: Scott Wood 
> > ---
> > v2: Added comment.
> > 
> > If my migrate disable changes aren't taken, then pin_current_cpu()
> > will also need to use sleeping_lock_inc() because calling
> > __read_rt_lock() bypasses the usual place it's done.
> > 
> >  include/linux/sched.h| 4 ++--
> >  kernel/rcu/tree_plugin.h | 2 +-
> >  kernel/sched/core.c  | 8 
> >  3 files changed, 11 insertions(+), 3 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7405,7 +7405,15 @@ void migrate_enable(void)
> > unpin_current_cpu();
> > preempt_lazy_enable();
> > preempt_enable();
> > +
> > +   /*
> > +* sleeping_lock_inc suppresses a debug check for
> > +* sleeping inside an RCU read side critical section
> > +*/
> > +   sleeping_lock_inc();
> > stop_one_cpu(task_cpu(p), migration_cpu_stop, );
> > +   sleeping_lock_dec();
> 
> this looks like an ugly hack. This sleeping_lock_inc() is used where we
> actually hold a sleeping lock and schedule() which is okay. But this
> would mean we hold a RCU lock and schedule() anyway. Is that okay?

Perhaps the name should be changed, but the concept is the same -- RT-
specific sleeping which should be considered involuntary for the purpose of
debug checks.  Voluntary sleeping is not allowed in an RCU critical section
because it will break the critical section on certain flavors of RCU, but
that doesn't apply to the flavor used on RT.  Sleeping for a long time in an
RCU critical section would also be a bad thing, but that also doesn't apply
here.

-Scott

Re: [PATCH RT v2 1/3] rcu: Acquire RCU lock when disabling BHs

2019-08-22 Thread Scott Wood

On Thu, 2019-08-22 at 09:39 -0400, Joel Fernandes wrote:
> On Wed, Aug 21, 2019 at 04:33:58PM -0700, Paul E. McKenney wrote:
> > On Wed, Aug 21, 2019 at 06:19:04PM -0500, Scott Wood wrote:
> > > Signed-off-by: Scott Wood 
> > > ---
> > > Another question is whether non-raw spinlocks are intended to create
> > > an
> > > RCU read-side critical section due to implicit preempt disable.
> > 
> > Hmmm...  Did non-raw spinlocks act like rcu_read_lock_sched()
> > and rcu_read_unlock_sched() pairs in -rt prior to the RCU flavor
> > consolidation?  If not, I don't see why they should do so after that
> > consolidation in -rt.
> 
> May be I am missing something, but I didn't see the connection between
> consolidation and this patch. AFAICS, this patch is so that
> rcu_read_lock_bh_held() works at all on -rt. Did I badly miss something?

Before consolidation, RT mapped rcu_read_lock_bh_held() to
rcu_read_lock_bh() and called rcu_read_lock() from rcu_read_lock_bh().  This
somehow got lost when rebasing on top of 5.0.

> > >  include/linux/rcupdate.h |  4 
> > >  kernel/rcu/update.c  |  4 
> > >  kernel/softirq.c | 12 +---
> > >  3 files changed, 17 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 388ace315f32..d6e357378732 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -615,10 +615,12 @@ static inline void rcu_read_unlock(void)
> > >  static inline void rcu_read_lock_bh(void)
> > >  {
> > >   local_bh_disable();
> > > +#ifndef CONFIG_PREEMPT_RT_FULL
> > >   __acquire(RCU_BH);
> > >   rcu_lock_acquire(_bh_lock_map);
> > >   RCU_LOCKDEP_WARN(!rcu_is_watching(),
> > >"rcu_read_lock_bh() used illegally while idle");
> > > +#endif
> > 
> > Any chance of this using "if (!IS_ENABLED(CONFIG_PREEMPT_RT_FULL))"?
> > We should be OK providing a do-nothing __maybe_unused rcu_bh_lock_map
> > for lockdep-enabled -rt kernels, right?
> 
> Since this function is small, I prefer if -rt defines their own
> rcu_read_lock_bh() which just does the local_bh_disable(). That would be
> way
> cleaner IMO. IIRC, -rt does similar things for spinlocks, but it has been
> sometime since I look at the -rt patchset.

I'll do it whichever way you all decide, though I'm not sure I agree about
it being cleaner (especially while RT is still out-of-tree and a change to
the non-RT version that fails to trigger a merge conflict is a concern).

What about moving everything but the local_bh_disable into a separate
function called from rcu_read_lock_bh, and making that a no-op on RT?

> > >  
> > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > index 016c66a98292..a9cdf3d562bc 100644
> > > --- a/kernel/rcu/update.c
> > > +++ b/kernel/rcu/update.c
> > > @@ -296,7 +296,11 @@ int rcu_read_lock_bh_held(void)
> > >   return 0;
> > >   if (!rcu_lockdep_current_cpu_online())
> > >   return 0;
> > > +#ifdef CONFIG_PREEMPT_RT_FULL
> > > + return lock_is_held(_lock_map) || irqs_disabled();
> > > +#else
> > >   return in_softirq() || irqs_disabled();
> > > +#endif
> > 
> > And globally.
> 
> And could be untangled a bit as well:
> 
> if (irqs_disabled())
>   return 1;
> 
> if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL))
>   return lock_is_held(_lock_map);
> 
> return in_softirq();

OK.

-Scott

Re: [PATCH RT v2 1/3] rcu: Acquire RCU lock when disabling BHs

2019-08-22 Thread Scott Wood

On Wed, 2019-08-21 at 16:33 -0700, Paul E. McKenney wrote:
> On Wed, Aug 21, 2019 at 06:19:04PM -0500, Scott Wood wrote:
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 388ace315f32..d6e357378732 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -615,10 +615,12 @@ static inline void rcu_read_unlock(void)
> >  static inline void rcu_read_lock_bh(void)
> >  {
> > local_bh_disable();
> > +#ifndef CONFIG_PREEMPT_RT_FULL
> > __acquire(RCU_BH);
> > rcu_lock_acquire(_bh_lock_map);
> > RCU_LOCKDEP_WARN(!rcu_is_watching(),
> >  "rcu_read_lock_bh() used illegally while idle");
> > +#endif
> 
> Any chance of this using "if (!IS_ENABLED(CONFIG_PREEMPT_RT_FULL))"?
> We should be OK providing a do-nothing __maybe_unused rcu_bh_lock_map
> for lockdep-enabled -rt kernels, right?

OK.

> > @@ -185,8 +189,10 @@ void __local_bh_enable_ip(unsigned long ip,
> > > > unsigned int cnt)
> > WARN_ON_ONCE(count < 0);
> > local_irq_enable();
> >  
> > -   if (!in_atomic())
> > +   if (!in_atomic()) {
> > +   rcu_read_unlock();
> > local_unlock(bh_lock);
> > +   }
> 
> The return from in_atomic() is guaranteed to be the same at
> local_bh_enable() time as was at the call to the corresponding
> local_bh_disable()?

That's an existing requirement on RT (which rcutorture currently violates)
due to bh_lock.

> I could have sworn that I ran afoul of this last year.  Might these
> added rcu_read_lock() and rcu_read_unlock() calls need to check for
> CONFIG_PREEMPT_RT_FULL?

This code is already under a PREEMPT_RT_FULL ifdef.

-Scott

Re: [PATCH RT v2 2/3] sched: migrate_enable: Use sleeping_lock to indicate involuntary sleep

2019-08-22 Thread Scott Wood

On Wed, 2019-08-21 at 16:35 -0700, Paul E. McKenney wrote:
> On Wed, Aug 21, 2019 at 06:19:05PM -0500, Scott Wood wrote:
> > Without this, rcu_note_context_switch() will complain if an RCU read
> > lock is held when migrate_enable() calls stop_one_cpu().
> > 
> > Signed-off-by: Scott Wood 
> 
> I have to ask...  Both sleeping_lock_inc() and sleeping_lock_dec() are
> no-ops if not CONFIG_PREEMPT_RT_BASE?

Yes.

-Scott

Re: [PATCH RT v2 3/3] rcu: Disable use_softirq on PREEMPT_RT

2019-08-22 Thread Scott Wood

On Thu, 2019-08-22 at 09:59 -0400, Joel Fernandes wrote:
> On Wed, Aug 21, 2019 at 06:19:06PM -0500, Scott Wood wrote:
> > I think the prohibition on use_softirq can be dropped once RT gets the
> > latest RCU code, but the question of what use_softirq should default
> > to on PREEMPT_RT remains.
> 
> Independent of the question of what use_softirq should default to, could
> we
> test RT with latest RCU code now to check if the deadlock goes away?  That
> way, maybe we can find any issues in current RCU that cause scheduler
> deadlocks in the situation you pointed. The reason I am asking is because
> recently additional commits [1] try to prevent deadlock and it'd be nice
> to
> ensure that other conditions are not lingering (I don't think they are but
> it'd be nice to be sure).
> 
> I am happy to do such testing myself if you want, however what does it
> take
> to apply the RT patchset to the latest mainline? Is it an achievable feat?

I did run such a test (cherry picking all RCU patches that aren't already in
RT, plus your RFC patch to rcu_read_unlock_special, rather than applying RT
to current mainline) with rcutorture plus a looping kernel build overnight,
and didn't see any splats with or without use_softirq.

-Scott

[PATCH RT v2 1/3] rcu: Acquire RCU lock when disabling BHs

2019-08-21 Thread Scott Wood

A plain local_bh_disable() is documented as creating an RCU critical
section, and (at least) rcutorture expects this to be the case.  However,
in_softirq() doesn't block a grace period on PREEMPT_RT, since RCU checks
preempt_count() directly.  Even if RCU were changed to check
in_softirq(), that wouldn't allow blocked BH disablers to be boosted.

Fix this by calling rcu_read_lock() from local_bh_disable(), and update
rcu_read_lock_bh_held() accordingly.

Signed-off-by: Scott Wood 
---
Another question is whether non-raw spinlocks are intended to create an
RCU read-side critical section due to implicit preempt disable.  If they
are, then we'd need to add rcu_read_lock() there as well since RT doesn't
disable preemption (and rcutorture should explicitly test with a
spinlock).  If not, the documentation should make that clear.

 include/linux/rcupdate.h |  4 
 kernel/rcu/update.c  |  4 
 kernel/softirq.c | 12 +---
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 388ace315f32..d6e357378732 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -615,10 +615,12 @@ static inline void rcu_read_unlock(void)
 static inline void rcu_read_lock_bh(void)
 {
local_bh_disable();
+#ifndef CONFIG_PREEMPT_RT_FULL
__acquire(RCU_BH);
rcu_lock_acquire(_bh_lock_map);
RCU_LOCKDEP_WARN(!rcu_is_watching(),
 "rcu_read_lock_bh() used illegally while idle");
+#endif
 }
 
 /*
@@ -628,10 +630,12 @@ static inline void rcu_read_lock_bh(void)
  */
 static inline void rcu_read_unlock_bh(void)
 {
+#ifndef CONFIG_PREEMPT_RT_FULL
RCU_LOCKDEP_WARN(!rcu_is_watching(),
 "rcu_read_unlock_bh() used illegally while idle");
rcu_lock_release(_bh_lock_map);
__release(RCU_BH);
+#endif
local_bh_enable();
 }
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 016c66a98292..a9cdf3d562bc 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -296,7 +296,11 @@ int rcu_read_lock_bh_held(void)
return 0;
if (!rcu_lockdep_current_cpu_online())
return 0;
+#ifdef CONFIG_PREEMPT_RT_FULL
+   return lock_is_held(_lock_map) || irqs_disabled();
+#else
return in_softirq() || irqs_disabled();
+#endif
 }
 EXPORT_SYMBOL_GPL(rcu_read_lock_bh_held);
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d16d080a74f7..6080c9328df1 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -115,8 +115,10 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int 
cnt)
long soft_cnt;
 
WARN_ON_ONCE(in_irq());
-   if (!in_atomic())
+   if (!in_atomic()) {
local_lock(bh_lock);
+   rcu_read_lock();
+   }
soft_cnt = this_cpu_inc_return(softirq_counter);
WARN_ON_ONCE(soft_cnt == 0);
current->softirq_count += SOFTIRQ_DISABLE_OFFSET;
@@ -151,8 +153,10 @@ void _local_bh_enable(void)
 #endif
 
current->softirq_count -= SOFTIRQ_DISABLE_OFFSET;
-   if (!in_atomic())
+   if (!in_atomic()) {
+   rcu_read_unlock();
local_unlock(bh_lock);
+   }
 }
 
 void _local_bh_enable_rt(void)
@@ -185,8 +189,10 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int 
cnt)
WARN_ON_ONCE(count < 0);
local_irq_enable();
 
-   if (!in_atomic())
+   if (!in_atomic()) {
+   rcu_read_unlock();
local_unlock(bh_lock);
+   }
 
current->softirq_count -= SOFTIRQ_DISABLE_OFFSET;
preempt_check_resched();
-- 
1.8.3.1

[PATCH RT v2 3/3] rcu: Disable use_softirq on PREEMPT_RT

2019-08-21 Thread Scott Wood

Besides restoring behavior that used to be default on RT, this avoids
a deadlock on scheduler locks:

[  136.894657] 039: 
[  136.900401] 039: WARNING: possible recursive locking detected
[  136.906146] 039: 5.2.9-rt3.dbg+ #174 Tainted: GE
[  136.912152] 039: 
[  136.917896] 039: rcu_torture_rea/13474 is trying to acquire lock:
[  136.923990] 039: 5f25146d
[  136.927310] 039:  (
[  136.929414] 039: >pi_lock
[  136.932303] 039: ){-...}
[  136.934840] 039: , at: try_to_wake_up+0x39/0x920
[  136.939461] 039:
but task is already holding lock:
[  136.944425] 039: 5f25146d
[  136.947745] 039:  (
[  136.949852] 039: >pi_lock
[  136.952738] 039: ){-...}
[  136.955274] 039: , at: try_to_wake_up+0x39/0x920
[  136.959895] 039:
other info that might help us debug this:
[  136.96] 039:  Possible unsafe locking scenario:

[  136.970608] 039:CPU0
[  136.973493] 039:
[  136.976378] 039:   lock(
[  136.978918] 039: >pi_lock
[  136.981806] 039: );
[  136.983911] 039:   lock(
[  136.986451] 039: >pi_lock
[  136.989336] 039: );
[  136.991444] 039:
 *** DEADLOCK ***

[  136.995194] 039:  May be due to missing lock nesting notation

[  137.001115] 039: 3 locks held by rcu_torture_rea/13474:
[  137.006341] 039:  #0:
[  137.008707] 039: 5f25146d
[  137.012024] 039:  (
[  137.014131] 039: >pi_lock
[  137.017015] 039: ){-...}
[  137.019558] 039: , at: try_to_wake_up+0x39/0x920
[  137.024175] 039:  #1:
[  137.026540] 039: 11c8e51d
[  137.029859] 039:  (
[  137.031966] 039: >lock
[  137.034679] 039: ){-...}
[  137.037217] 039: , at: try_to_wake_up+0x241/0x920
[  137.041924] 039:  #2:
[  137.044291] 039: 098649b9
[  137.047610] 039:  (
[  137.049714] 039: rcu_read_lock
[  137.052774] 039: ){}
[  137.055314] 039: , at: cpuacct_charge+0x33/0x1e0
[  137.059934] 039:
stack backtrace:
[  137.063425] 039: CPU: 39 PID: 13474 Comm: rcu_torture_rea Kdump: loaded 
Tainted: GE 5.2.9-rt3.dbg+ #174
[  137.074197] 039: Hardware name: Intel Corporation S2600BT/S2600BT, BIOS 
SE5C620.86B.01.00.0763.022420181017 02/24/2018
[  137.084886] 039: Call Trace:
[  137.087773] 039:  
[  137.090226] 039:  dump_stack+0x5e/0x8b
[  137.093997] 039:  __lock_acquire+0x725/0x1100
[  137.098358] 039:  lock_acquire+0xc0/0x240
[  137.102374] 039:  ? try_to_wake_up+0x39/0x920
[  137.106737] 039:  _raw_spin_lock_irqsave+0x47/0x90
[  137.111534] 039:  ? try_to_wake_up+0x39/0x920
[  137.115910] 039:  try_to_wake_up+0x39/0x920
[  137.120098] 039:  rcu_read_unlock_special+0x65/0xb0
[  137.124977] 039:  __rcu_read_unlock+0x5d/0x70
[  137.129337] 039:  cpuacct_charge+0xd9/0x1e0
[  137.133522] 039:  ? cpuacct_charge+0x33/0x1e0
[  137.137880] 039:  update_curr+0x14b/0x420
[  137.141894] 039:  enqueue_entity+0x42/0x370
[  137.146080] 039:  enqueue_task_fair+0xa9/0x490
[  137.150528] 039:  activate_task+0x5a/0xf0
[  137.154539] 039:  ttwu_do_activate+0x4e/0x90
[  137.158813] 039:  try_to_wake_up+0x277/0x920
[  137.163086] 039:  irq_exit+0xb6/0xf0
[  137.11] 039:  smp_apic_timer_interrupt+0xe3/0x3a0
[  137.171714] 039:  apic_timer_interrupt+0xf/0x20
[  137.176249] 039:  
[  137.178785] 039: RIP: 0010:__schedule+0x0/0x8e0
[  137.183319] 039: Code: 00 02 48 89 43 20 e8 0f 5a 00 00 48 8d 7b 28 e8 86 f2 
fd ff 31 c0 5b 5d 41 5c c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <55> 48 
89 e5 41 57 41 56 49 c7 c6 c0 ca 1e 00 41 55 41 89 fd 41 54
[  137.202498] 039: RSP: 0018:c9005835fbc0 EFLAGS: 0246
[  137.208158] 039:  ORIG_RAX: ff13
[  137.212428] 039: RAX:  RBX: 8897c3e1bb00 RCX: 
0001
[  137.219994] 039: RDX: 80004008 RSI: 0006 RDI: 
0001
[  137.227560] 039: RBP: 8897c3e1bb00 R08:  R09: 

[  137.235126] 039: R10: 0001 R11: 0001 R12: 
81001fd1
[  137.242694] 039: R13: 0044 R14:  R15: 
c9005835fcac
[  137.250259] 039:  ? ___preempt_schedule+0x16/0x18
[  137.254969] 039:  preempt_schedule_common+0x32/0x80
[  137.259846] 039:  ___preempt_schedule+0x16/0x18
[  137.264379] 039:  rcutorture_one_extend+0x33a/0x510 [rcutorture]
[  137.270397] 039:  rcu_torture_one_read+0x18c/0x450 [rcutorture]
[  137.276334] 039:  rcu_torture_reader+0xac/0x1f0 [rcutorture]
[  137.281998] 039:  ? rcu_torture_reader+0x1f0/0x1f0 [rcutorture]
[  137.287920] 039:  kthread+0x106/0x140
[  137.291591] 039:  ? rcu_torture_one_read+0x450/0x450 [rcutorture]
[  137.297681] 039:  ? kthread_bind+0x10/0x10
[  137.301783] 039:  ret_from_fork+0x3a/0x50

Signed-off-by: Scott Wood 
---
I think the prohibition on use_softirq can be dropped once RT gets the
latest RCU code, but the question of what use_softirq should default
to on PREEMPT_RT remains.
---
 kernel/rcu/tree.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

[PATCH RT v2 0/3] RCU fixes

2019-08-21 Thread Scott Wood

This is a respin of the "Address rcutorture issues" patchset,
minus the actual rcutorture changes.

I still plan to implement detection of bad nesting scenarios, but it's
complicated by the need to distinguish (on a non-RT kernel) between
irq/preempt disabling that would and would not happen on an RT kernel
(which would also have the benefit of being able to detect nesting
regular spinlocks inside raw spinlocks on a non-RT debug kernel).  In
the meantime I could send the rcutorture changes as a PREEMPT_RT only
patch, though the extent of the changes depends on whether my migrate
disable patchset is applied since it removes a restriction.

Scott Wood (3):
  rcu: Acquire RCU lock when disabling BHs
  sched: migrate_enable: Use sleeping_lock to indicate involuntary sleep
  rcu: Disable use_softirq on PREEMPT_RT

 include/linux/rcupdate.h |  4 
 include/linux/sched.h|  4 ++--
 kernel/rcu/tree.c|  9 -
 kernel/rcu/tree_plugin.h |  2 +-
 kernel/rcu/update.c  |  4 
 kernel/sched/core.c  |  8 
 kernel/softirq.c | 12 +---
 7 files changed, 36 insertions(+), 7 deletions(-)

-- 
1.8.3.1

[PATCH RT v2 2/3] sched: migrate_enable: Use sleeping_lock to indicate involuntary sleep

2019-08-21 Thread Scott Wood

Without this, rcu_note_context_switch() will complain if an RCU read
lock is held when migrate_enable() calls stop_one_cpu().

Signed-off-by: Scott Wood 
---
v2: Added comment.

If my migrate disable changes aren't taken, then pin_current_cpu()
will also need to use sleeping_lock_inc() because calling
__read_rt_lock() bypasses the usual place it's done.

 include/linux/sched.h| 4 ++--
 kernel/rcu/tree_plugin.h | 2 +-
 kernel/sched/core.c  | 8 
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e892e727f12..1ebc97f28009 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -673,7 +673,7 @@ struct task_struct {
int migrate_disable_atomic;
 # endif
 #endif
-#ifdef CONFIG_PREEMPT_RT_FULL
+#ifdef CONFIG_PREEMPT_RT_BASE
int sleeping_lock;
 #endif
 
@@ -1881,7 +1881,7 @@ static __always_inline bool need_resched(void)
return unlikely(tif_need_resched());
 }
 
-#ifdef CONFIG_PREEMPT_RT_FULL
+#ifdef CONFIG_PREEMPT_RT_BASE
 static inline void sleeping_lock_inc(void)
 {
current->sleeping_lock++;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 23a54e4b649c..7a3aa085ce2c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -292,7 +292,7 @@ void rcu_note_context_switch(bool preempt)
barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
lockdep_assert_irqs_disabled();
-#if defined(CONFIG_PREEMPT_RT_FULL)
+#if defined(CONFIG_PREEMPT_RT_BASE)
sleeping_l = t->sleeping_lock;
 #endif
WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0 && !sleeping_l);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1bdd7f9be05..0758ee85634e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7405,7 +7405,15 @@ void migrate_enable(void)
unpin_current_cpu();
preempt_lazy_enable();
preempt_enable();
+
+   /*
+* sleeping_lock_inc suppresses a debug check for
+* sleeping inside an RCU read side critical section
+*/
+   sleeping_lock_inc();
stop_one_cpu(task_cpu(p), migration_cpu_stop, );
+   sleeping_lock_dec();
+
return;
}
}
-- 
1.8.3.1

Re: [RFC v2] rcu/tree: Try to invoke_rcu_core() if in_irq() during unlock

2019-08-19 Thread Scott Wood

On Sun, 2019-08-18 at 17:49 -0400, Joel Fernandes (Google) wrote:
> When we're in hard interrupt context in rcu_read_unlock_special(), we
> can still benefit from invoke_rcu_core() doing wake ups of rcuc
> threads when the !use_softirq parameter is passed.  This is safe
> to do so because:

What is the benefit, beyond skipping the irq work overhead?  Is there some
reason to specifically want the rcuc thread woken rather than just getting
into the scheduler (and thus rcu_note_context_switch) as soon as possible?

-Scott

[PATCH RT 4/8] sched: migrate disable: Protect cpus_ptr with lock

2019-07-26 Thread Scott Wood

Various places assume that cpus_ptr is protected by rq/pi locks,
so don't change it before grabbing those locks.

Signed-off-by: Scott Wood 
---
 kernel/sched/core.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 99a3cfccf4d3..38a9a9df5638 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7283,9 +7283,8 @@ void dump_cpu_task(int cpu)
struct rq *rq;
struct rq_flags rf;
 
-   p->cpus_ptr = cpumask_of(smp_processor_id());
-
rq = task_rq_lock(p, );
+   p->cpus_ptr = cpumask_of(smp_processor_id());
update_nr_migratory(p, -1);
p->nr_cpus_allowed = 1;
task_rq_unlock(rq, p, );
@@ -7297,9 +7296,8 @@ void dump_cpu_task(int cpu)
struct rq *rq;
struct rq_flags rf;
 
-   p->cpus_ptr = >cpus_mask;
-
rq = task_rq_lock(p, );
+   p->cpus_ptr = >cpus_mask;
p->nr_cpus_allowed = cpumask_weight(>cpus_mask);
update_nr_migratory(p, 1);
task_rq_unlock(rq, p, );
-- 
1.8.3.1

[PATCH RT 1/8] sched: migrate_enable: Use sleeping_lock to indicate involuntary sleep

2019-07-26 Thread Scott Wood

Without this, rcu_note_context_switch() will complain if an RCU read
lock is held when migrate_enable() calls stop_one_cpu().

Signed-off-by: Scott Wood 
---
 include/linux/sched.h| 4 ++--
 kernel/rcu/tree_plugin.h | 2 +-
 kernel/sched/core.c  | 2 ++
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4e218f8d8048..ad23ab939b35 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -673,7 +673,7 @@ struct task_struct {
int migrate_disable_atomic;
 # endif
 #endif
-#ifdef CONFIG_PREEMPT_RT_FULL
+#ifdef CONFIG_PREEMPT_RT_BASE
int sleeping_lock;
 #endif
 
@@ -1873,7 +1873,7 @@ static __always_inline bool need_resched(void)
return unlikely(tif_need_resched());
 }
 
-#ifdef CONFIG_PREEMPT_RT_FULL
+#ifdef CONFIG_PREEMPT_RT_BASE
 static inline void sleeping_lock_inc(void)
 {
current->sleeping_lock++;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 23a54e4b649c..7a3aa085ce2c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -292,7 +292,7 @@ void rcu_note_context_switch(bool preempt)
barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
lockdep_assert_irqs_disabled();
-#if defined(CONFIG_PREEMPT_RT_FULL)
+#if defined(CONFIG_PREEMPT_RT_BASE)
sleeping_l = t->sleeping_lock;
 #endif
WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0 && !sleeping_l);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d3c6542b306f..c3407707e367 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7405,7 +7405,9 @@ void migrate_enable(void)
unpin_current_cpu();
preempt_lazy_enable();
preempt_enable();
+   sleeping_lock_inc();
stop_one_cpu(task_cpu(p), migration_cpu_stop, );
+   sleeping_lock_dec();
return;
}
}
-- 
1.8.3.1

[PATCH RT 6/8] sched: migrate_enable: Set state to TASK_RUNNING

2019-07-26 Thread Scott Wood

If migrate_enable() is called while a task is preparing to sleep
(state != TASK_RUNNING), that triggers a debug check in stop_one_cpu().
Explicitly reset state to acknowledge that we're accepting the spurious
wakeup.

Signed-off-by: Scott Wood 
---
 kernel/sched/core.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 38a9a9df5638..eb27a9bf70d7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7396,6 +7396,14 @@ void migrate_enable(void)
unpin_current_cpu();
preempt_lazy_enable();
preempt_enable();
+
+   /*
+* Avoid sleeping with an existing non-running
+* state.  This will result in a spurious wakeup
+* for the calling context.
+*/
+   __set_current_state(TASK_RUNNING);
+
sleeping_lock_inc();
stop_one_cpu(task_cpu(p), migration_cpu_stop, );
sleeping_lock_dec();
-- 
1.8.3.1

[PATCH RT 5/8] sched/deadline: Reclaim cpuset bandwidth in .migrate_task_rq()

2019-07-26 Thread Scott Wood

With the changes to migrate disabling, ->set_cpus_allowed() no longer
gets deferred until migrate_enable().  To avoid releasing the bandwidth
while the task may still be executing on the old CPU, move the subtraction
to ->migrate_task_rq().

Signed-off-by: Scott Wood 
---
 kernel/sched/deadline.c | 67 +++--
 1 file changed, 31 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c18be51f7608..2f18d0cf1b56 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1606,14 +1606,42 @@ static void yield_task_dl(struct rq *rq)
return cpu;
 }
 
+static void free_old_cpuset_bw_dl(struct rq *rq, struct task_struct *p)
+{
+   struct root_domain *src_rd = rq->rd;
+
+   /*
+* Migrating a SCHED_DEADLINE task between exclusive
+* cpusets (different root_domains) entails a bandwidth
+* update. We already made space for us in the destination
+* domain (see cpuset_can_attach()).
+*/
+   if (!cpumask_intersects(src_rd->span, p->cpus_ptr)) {
+   struct dl_bw *src_dl_b;
+
+   src_dl_b = dl_bw_of(cpu_of(rq));
+   /*
+* We now free resources of the root_domain we are migrating
+* off. In the worst case, sched_setattr() may temporary fail
+* until we complete the update.
+*/
+   raw_spin_lock(_dl_b->lock);
+   __dl_sub(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
+   raw_spin_unlock(_dl_b->lock);
+   }
+}
+
 static void migrate_task_rq_dl(struct task_struct *p, int new_cpu 
__maybe_unused)
 {
struct rq *rq;
 
-   if (p->state != TASK_WAKING)
+   rq = task_rq(p);
+
+   if (p->state != TASK_WAKING) {
+   free_old_cpuset_bw_dl(rq, p);
return;
+   }
 
-   rq = task_rq(p);
/*
 * Since p->state == TASK_WAKING, set_task_cpu() has been called
 * from try_to_wake_up(). Hence, p->pi_lock is locked, but
@@ -2220,39 +2248,6 @@ static void task_woken_dl(struct rq *rq, struct 
task_struct *p)
}
 }
 
-static void set_cpus_allowed_dl(struct task_struct *p,
-   const struct cpumask *new_mask)
-{
-   struct root_domain *src_rd;
-   struct rq *rq;
-
-   BUG_ON(!dl_task(p));
-
-   rq = task_rq(p);
-   src_rd = rq->rd;
-   /*
-* Migrating a SCHED_DEADLINE task between exclusive
-* cpusets (different root_domains) entails a bandwidth
-* update. We already made space for us in the destination
-* domain (see cpuset_can_attach()).
-*/
-   if (!cpumask_intersects(src_rd->span, new_mask)) {
-   struct dl_bw *src_dl_b;
-
-   src_dl_b = dl_bw_of(cpu_of(rq));
-   /*
-* We now free resources of the root_domain we are migrating
-* off. In the worst case, sched_setattr() may temporary fail
-* until we complete the update.
-*/
-   raw_spin_lock(_dl_b->lock);
-   __dl_sub(src_dl_b, p->dl.dl_bw, dl_bw_cpus(task_cpu(p)));
-   raw_spin_unlock(_dl_b->lock);
-   }
-
-   set_cpus_allowed_common(p, new_mask);
-}
-
 /* Assumes rq->lock is held */
 static void rq_online_dl(struct rq *rq)
 {
@@ -2407,7 +2402,7 @@ static void prio_changed_dl(struct rq *rq, struct 
task_struct *p,
 #ifdef CONFIG_SMP
.select_task_rq = select_task_rq_dl,
.migrate_task_rq= migrate_task_rq_dl,
-   .set_cpus_allowed   = set_cpus_allowed_dl,
+   .set_cpus_allowed   = set_cpus_allowed_common,
.rq_online  = rq_online_dl,
.rq_offline = rq_offline_dl,
.task_woken = task_woken_dl,
-- 
1.8.3.1

[PATCH RT 2/8] sched: __set_cpus_allowed_ptr: Check cpus_mask, not cpus_ptr

2019-07-26 Thread Scott Wood

This function is concerned with the long-term cpu mask, not the
transitory mask the task might have while migrate disabled.  Before
this patch, if a task was migrate disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the cpu that the task was running on, then the mask update
would be lost.

Signed-off-by: Scott Wood 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c3407707e367..6e643d656d71 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1218,7 +1218,7 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
goto out;
}
 
-   if (cpumask_equal(p->cpus_ptr, new_mask))
+   if (cpumask_equal(>cpus_mask, new_mask))
goto out;
 
if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
-- 
1.8.3.1

[PATCH RT 7/8] sched: migrate_enable: Use select_fallback_rq()

2019-07-26 Thread Scott Wood

migrate_enable() currently open-codes a variant of select_fallback_rq().
However, it does not have the "No more Mr. Nice Guy" fallback and thus
it will pass an invalid CPU to the migration thread if cpus_mask only
contains a CPU that is !active.

Signed-off-by: Scott Wood 
---
This scenario will be more likely after the next patch, since
the migrate_disable_update check goes away.  However, it could happen
anyway if cpus_mask was updated to a CPU other than the one we were
pinned to, and that CPU subsequently became inactive.
---
 kernel/sched/core.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb27a9bf70d7..3a2d8251a30c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7368,6 +7368,7 @@ void migrate_enable(void)
if (p->migrate_disable_update) {
struct rq *rq;
struct rq_flags rf;
+   int cpu = task_cpu(p);
 
rq = task_rq_lock(p, );
update_rq_clock(rq);
@@ -7377,21 +7378,15 @@ void migrate_enable(void)
 
p->migrate_disable_update = 0;
 
-   WARN_ON(smp_processor_id() != task_cpu(p));
-   if (!cpumask_test_cpu(task_cpu(p), >cpus_mask)) {
-   const struct cpumask *cpu_valid_mask = cpu_active_mask;
-   struct migration_arg arg;
-   unsigned int dest_cpu;
-
-   if (p->flags & PF_KTHREAD) {
-   /*
-* Kernel threads are allowed on online && 
!active CPUs
-*/
-   cpu_valid_mask = cpu_online_mask;
-   }
-   dest_cpu = cpumask_any_and(cpu_valid_mask, 
>cpus_mask);
-   arg.task = p;
-   arg.dest_cpu = dest_cpu;
+   WARN_ON(smp_processor_id() != cpu);
+   if (!cpumask_test_cpu(cpu, >cpus_mask)) {
+   struct migration_arg arg = { p };
+   struct rq_flags rf;
+
+   rq = task_rq_lock(p, );
+   update_rq_clock(rq);
+   arg.dest_cpu = select_fallback_rq(cpu, p);
+   task_rq_unlock(rq, p, );
 
unpin_current_cpu();
preempt_lazy_enable();
-- 
1.8.3.1

[PATCH RT 3/8] sched: Remove dead __migrate_disabled() check

2019-07-26 Thread Scott Wood

This code was unreachable given the __migrate_disabled() branch
to "out" immediately beforehand.

Signed-off-by: Scott Wood 
---
 kernel/sched/core.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6e643d656d71..99a3cfccf4d3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1242,13 +1242,6 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
if (cpumask_test_cpu(task_cpu(p), new_mask) || __migrate_disabled(p))
goto out;
 
-#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE)
-   if (__migrate_disabled(p)) {
-   p->migrate_disable_update = 1;
-   goto out;
-   }
-#endif
-
dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
if (task_running(rq, p) || p->state == TASK_WAKING) {
struct migration_arg arg = { p, dest_cpu };
-- 
1.8.3.1

[PATCH RT 8/8] sched: Lazy migrate_disable processing

2019-07-26 Thread Scott Wood

Avoid overhead on the majority of migrate disable/enable sequences
by only manipulating scheduler data (and grabbing the relevant locks)
when the task actually schedules while migrate-disabled.  Very large
speedups were seen during a kernel build.

Instead of cpuhp_pin_lock, CPU hotplug is handled by keeping a per-CPU
count of the number of pinned tasks (including tasks which have not
scheduled in the migrate-disabled section); takedown_cpu() will
wait until that reaches zero (confirmed by take_cpu_down() in stop
machine context to deal with races) before migrating tasks off of the
cpu.

To simplify synchronization, updating cpus_mask is no longer deferred
until migrate_enable().  This lets us not have to worry about
migrate_enable() missing the update if it's on the fast path (didn't
schedule during the migrate disabled section).  It also makes the code
a bit simpler and reduces deviation from mainline.

While the main motivation for this is the performance benefit, lazy
migrate disable also eliminates the restriction on calling
migrate_disable() while atomic but leaving the atomic region prior to
calling migrate_enable() -- though this won't help with local_bh_disable()
(and thus rcutorture) unless something similar is done with the recently
added local_lock.

Signed-off-by: Scott Wood 
---
 include/linux/cpu.h|   4 --
 include/linux/sched.h  |  11 +--
 init/init_task.c   |   4 ++
 kernel/cpu.c   |  97 +
 kernel/sched/core.c| 192 -
 kernel/sched/sched.h   |   4 ++
 lib/smp_processor_id.c |   3 +
 7 files changed, 130 insertions(+), 185 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index f4a772c12d14..2df500fdcbc4 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -113,8 +113,6 @@ static inline void cpu_maps_update_done(void)
 extern void cpu_hotplug_enable(void);
 void clear_tasks_mm_cpumask(int cpu);
 int cpu_down(unsigned int cpu);
-extern void pin_current_cpu(void);
-extern void unpin_current_cpu(void);
 
 #else /* CONFIG_HOTPLUG_CPU */
 
@@ -126,8 +124,6 @@ static inline void cpus_read_unlock(void) { }
 static inline void lockdep_assert_cpus_held(void) { }
 static inline void cpu_hotplug_disable(void) { }
 static inline void cpu_hotplug_enable(void) { }
-static inline void pin_current_cpu(void) { }
-static inline void unpin_current_cpu(void) { }
 
 #endif /* !CONFIG_HOTPLUG_CPU */
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad23ab939b35..069c46dde15b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -229,6 +229,8 @@
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+int cpu_nr_pinned(int cpu);
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
@@ -661,16 +663,13 @@ struct task_struct {
cpumask_t   cpus_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE)
int migrate_disable;
-   int migrate_disable_update;
-   int pinned_on_cpu;
+   boolmigrate_disable_scheduled;
 # ifdef CONFIG_SCHED_DEBUG
-   int migrate_disable_atomic;
+   int pinned_on_cpu;
 # endif
-
 #elif !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE)
 # ifdef CONFIG_SCHED_DEBUG
int migrate_disable;
-   int migrate_disable_atomic;
 # endif
 #endif
 #ifdef CONFIG_PREEMPT_RT_BASE
@@ -2066,4 +2065,6 @@ static inline void rseq_syscall(struct pt_regs *regs)
 
 #endif
 
+extern struct task_struct *takedown_cpu_task;
+
 #endif
diff --git a/init/init_task.c b/init/init_task.c
index e402413dc47d..c0c7618fd2fb 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -81,6 +81,10 @@ struct task_struct init_task
.cpus_ptr   = _task.cpus_mask,
.cpus_mask  = CPU_MASK_ALL,
.nr_cpus_allowed= NR_CPUS,
+#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT_RT_BASE) && \
+defined(CONFIG_SCHED_DEBUG)
+   .pinned_on_cpu  = -1,
+#endif
.mm = NULL,
.active_mm  = _mm,
.restart_block  = {
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 885a195dfbe0..0096acf1a692 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -76,11 +76,6 @@ static DEFINE_PER_CPU(struct cpuhp_cpu_state, cpuhp_state) = 
{
.fail = CPUHP_INVALID,
 };
 
-#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_PREEMPT_RT_FULL)
-static DEFINE_PER_CPU(struct rt_rw_lock, cpuhp_pin_lock) = \
-   __RWLOCK_RT_INITIALIZER(cpuhp_pin_lock);
-#endif
-
 #if defined(CONFIG_LOCKDEP) && defined(CONFIG_SMP)
 static struct lockdep_map cpuhp_state_up_map =
STATIC_LOCKDEP_MAP_INIT("cpuhp_state-up", _state_up_map);
@@ -

[RT PATCH 0/8] migrate disable fixes and performance

2019-07-26 Thread Scott Wood

With these patches, a kernel build on a 104-cpu machine took around 75%
less wall time and 85% less system time.  Note that there is a difference
in v5.2-rt compared to v5.0-rt.  The performance with these patches is
similar in both cases, but without these patches v5.2-rt is substantially
slower.  In v5.0-rt with a previous version of these patches, lazy
migrate disable reduced kernel build time by around 15-20% wall and
70-75% system.

Scott Wood (8):
  sched: migrate_enable: Use sleeping_lock to indicate involuntary sleep
  sched: __set_cpus_allowed_ptr: Check cpus_mask, not cpus_ptr
  sched: Remove dead __migrate_disabled() check
  sched: migrate disable: Protect cpus_ptr with lock
  sched/deadline: Reclaim cpuset bandwidth in .migrate_task_rq()
  sched: migrate_enable: Set state to TASK_RUNNING
  sched: migrate_enable: Use select_fallback_rq()
  sched: Lazy migrate_disable processing

 include/linux/cpu.h  |   4 -
 include/linux/sched.h|  15 ++--
 init/init_task.c |   4 +
 kernel/cpu.c |  97 ---
 kernel/rcu/tree_plugin.h |   2 +-
 kernel/sched/core.c  | 200 +++
 kernel/sched/deadline.c  |  67 
 kernel/sched/sched.h |   4 +
 lib/smp_processor_id.c   |   3 +
 9 files changed, 166 insertions(+), 230 deletions(-)

-- 
1.8.3.1

Re: [PATCH] clk: qoriq: Fix -Wunused-const-variable

2019-06-28 Thread Scott Wood

On Thu, 2019-06-27 at 15:06 -0700, Nathan Huckleberry wrote:
> drivers/clk/clk-qoriq.c:138:38: warning: unused variable
> 'p5020_cmux_grp1' [-Wunused-const-variable] static const struct
> clockgen_muxinfo p5020_cmux_grp1
> 
> drivers/clk/clk-qoriq.c:146:38: warning: unused variable
> 'p5020_cmux_grp2' [-Wunused-const-variable] static const struct
> clockgen_muxinfo p5020_cmux_grp2
> 
> In the definition of the p5020 chip, the p2041 chip's info was used
> instead.  The p5020 and p2041 chips have different info. This is most
> likely a typo.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/525
> Cc: clang-built-li...@googlegroups.com
> Signed-off-by: Nathan Huckleberry 

Acked-by: Scott Wood 

-Scott

Re: [RFC] Deadlock via recursive wakeup via RCU with threadirqs

2019-06-28 Thread Scott Wood

On Fri, 2019-06-28 at 16:15 +0200, Peter Zijlstra wrote:
> On Thu, Jun 27, 2019 at 01:36:12PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 27, 2019 at 03:17:27PM -0500, Scott Wood wrote:
> > > On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> > > > Of course, unconditionally refusing to do the wakeup might not be
> > > > happy
> > > > thing for NO_HZ_FULL kernels that don't implement IRQ work.
> > > 
> > > Couldn't smp_send_reschedule() be used instead?
> > 
> > Good point.  If current -rcu doesn't fix things for Sebastian's case,
> > that would be well worth looking at.  But there must be some reason
> > why Peter Zijlstra didn't suggest it when he instead suggested using
> > the IRQ work approach.
> > 
> > Peter, thoughts?
> 
> I've not exactly kept up with the thread; but irq_work allows you to run
> some actual code on the remote CPU which is often useful and it is only
> a little more expensive than smp_send_reschedule().
> 
> Also, just smp_send_reschedule() doesn't really do anything without
> first poking TIF_NEED_RESCHED (or other scheduler state) and if you want
> to do both, there's other helpers you should use, like resched_cpu().

resched_cpu() will not send an IPI to the current CPU[1].  Plus, the RCU
code needs to set need_resched even in cases where it doesn't need to send
the IPI.  And worst of all, resched_cpu() takes the rq lock which is the
deadlock scenario we're trying to avoid.

-Scott

[1] Which makes me nervous about latency if there are any wakeups with irqs
disabled, without a preempt_enable() after irqs are enabled again, and not
inside an interrupt.

Re: [RFC PATCH RT 4/4] rcutorture: Avoid problematic critical section nesting

2019-06-28 Thread Scott Wood

On Thu, 2019-06-27 at 17:52 -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 05:46:27PM -0500, Scott Wood wrote:
> > On Thu, 2019-06-27 at 13:50 -0700, Paul E. McKenney wrote:
> > > If by IPI-to-self you mean the IRQ work trick, that isn't implemented
> > > across all architectures yet, is it?
> > 
> > Right... smp_send_reschedule() has wider coverage, but even then there's
> > some hardware that just can't do it reasonably (e.g. pre-APIC x86).
> 
> Except that smp_send_reschedule() won't do anything unless the scheduler
> things something needs to be done, as it its wake list is non-empty.
> Which might explain why Peter Zijlstra didn't suggest it.

The wake list stuff is separate from the original purpose of the IPI, which
is to hit the need_resched check on IRQ exit.  When that happens, the
scheduler will call into RCU, even if it doesn't change threads.  

> > So I guess the options are:
> > 
> > 1. Accept that such hardware might experience delayed grace period
> > completion in certain configurations,
> > 2. Have such hardware check for need_resched in local_irq_enable() (not
> > nice
> > if sharing a kernel build with hardware that doesn't need it), or
> > 3. Forbid the sequence (enforced by debug checks).  Again, this would
> > only
> > prohibit rcu_read_lock()/local_irq_disable()/rcu_read_unlock()/
> > local_irq_enable() *without* preempt disabling around the IRQ-disabled
> > region.
> 
> 4. If further testing continues to show it to be reliable, continue
> using the scheme in -rcu.

If the testing isn't done on machines that can't do the IPI then it's
basically option #1.  FWIW I don't think option #1 is unreasonable given
that we're talking about very old and/or specialized hardware, and we're
only talking about delays, not a crash (maybe limit the ability to use
nohz_full on such hardware?).  Of course if it turns out people are actually
trying to run (modern versions of) RT on such hardware, that might be
different. :-)

> 5. Use a short-duration hrtimer to get a clean environment in short
> order.  Yes, the timer might fire while preemption and/or softirqs
> are disabled, but then the code can rely on the following
> preempt_enable(), local_bh_enable(), or whatever.  This condition
> should be sufficiently rare to avoid issues with hrtimer overhead.

Yeah, I considered that but was hesitant due to overhead -- at least in the
case of the example I gave (pre-APIC x86), arming a oneshot timer is pretty
slow.  Plus, some hardware might entirely lack one-shot timer capability.

> 6. Use smp_call_function_single() to IPI some other poor slob of a
> CPU, which then does the same back.  Non-waiting version in both
> cases, of course.

I was assuming any hardware that can't do smp_send_reschedule() is not SMP.

> 
> Probably others as well.
> 
> > > Why not simply make rcutorture cyheck whether it is running in a
> > > PREEMPT_RT_FULL environment and avoid the PREEMPT_RT_FULL-unfriendly
> > > testing only in that case?
> > > 
> > > And should we later get to a place where the PREEMPT_RT_FULL-
> > > unfriendly
> > > scenarios are prohibited across all kernel configurations, then the
> > > module
> > > parameter can be removed.  Again, until we know (as opposed to
> > > suspect)
> > > that these scenarios really don't happen, mainline rcutorture must
> > > continue testing them.
> > 
> > Yes, I already acknowledged that debug checks detecting the sequences
> > should
> > come before the test removal
> 
> OK, good to hear.  As you may have noticed, I was getting the impression
> that you might have changed your mind on this point.  ;-)
> 
> >  (including this patch as an RFC at this
> > point
> > was mainly meant as a demonstration of what's needed to get rcutorture
> > to
> > pass), but it'd be nice to have some idea of whether there would be
> > opposition to the concept before coding up the checks.  I'd rather not
> > continue the state of "these sequences can blow up on RT and we don't
> > know
> > if they exist or not" any longer than necessary.  Plus, only one of the
> > sequences is exclusively an RT issue (though it's the one with the worst
> > consequences).
> 
> Steve Rostedt's point about enlisting the aid of lockdep seems worth
> looking into.

Sure.  I was just concerned by the "Linus was against enforcing this in the
past" comment and was hoping for more details.

-Scott

Re: [RFC PATCH RT 4/4] rcutorture: Avoid problematic critical section nesting

2019-06-27 Thread Scott Wood

On Thu, 2019-06-27 at 13:50 -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 03:16:09PM -0500, Scott Wood wrote:
> > On Thu, 2019-06-27 at 11:00 -0700, Paul E. McKenney wrote:
> > > On Wed, Jun 26, 2019 at 11:49:16AM -0500, Scott Wood wrote:
> > > > > 
> > > > > On Fri, 21 Jun 2019 16:59:55 -0700
> > > > > "Paul E. McKenney"  wrote:
> > > > > 
> > > > > > I have no objection to the outlawing of a number of these
> > > > > > sequences
> > > > > > in
> > > > > > mainline, but am rather pointing out that until they really are
> > > > > > outlawed
> > > > > > and eliminated, rcutorture must continue to test them in
> > > > > > mainline.
> > > > > > Of course, an rcutorture running in -rt should avoid testing
> > > > > > things
> > > > > > that
> > > > > > break -rt, including these sequences.
> > > > > 
> > > > > sequences in the code. And we also need to get Linus's approval of
> > > > > this
> > > > > as I believe he was against enforcing this in the past.
> > > > 
> > > > Was the opposition to prohibiting some specific sequence?  It's only
> > > > certain
> > > > misnesting scenarios that are problematic.  The rcu_read_lock/
> > > > local_irq_disable restriction can be dropped with the IPI-to-self
> > > > added
> > > > in
> > > > Paul's tree.  Are there any known instances of the other two
> > > > (besides
> > > > rcutorture)?
> 
> If by IPI-to-self you mean the IRQ work trick, that isn't implemented
> across all architectures yet, is it?

Right... smp_send_reschedule() has wider coverage, but even then there's
some hardware that just can't do it reasonably (e.g. pre-APIC x86).  So I
guess the options are:

1. Accept that such hardware might experience delayed grace period
completion in certain configurations,
2. Have such hardware check for need_resched in local_irq_enable() (not nice
if sharing a kernel build with hardware that doesn't need it), or
3. Forbid the sequence (enforced by debug checks).  Again, this would only
prohibit rcu_read_lock()/local_irq_disable()/rcu_read_unlock()/
local_irq_enable() *without* preempt disabling around the IRQ-disabled
region.

> Why not simply make rcutorture cyheck whether it is running in a
> PREEMPT_RT_FULL environment and avoid the PREEMPT_RT_FULL-unfriendly
> testing only in that case?
>
> And should we later get to a place where the PREEMPT_RT_FULL-unfriendly
> scenarios are prohibited across all kernel configurations, then the module
> parameter can be removed.  Again, until we know (as opposed to suspect)
> that these scenarios really don't happen, mainline rcutorture must
> continue testing them.

Yes, I already acknowledged that debug checks detecting the sequences should
come before the test removal (including this patch as an RFC at this point
was mainly meant as a demonstration of what's needed to get rcutorture to
pass), but it'd be nice to have some idea of whether there would be
opposition to the concept before coding up the checks.  I'd rather not
continue the state of "these sequences can blow up on RT and we don't know
if they exist or not" any longer than necessary.  Plus, only one of the
sequences is exclusively an RT issue (though it's the one with the worst
consequences).

-Scott

Re: [RFC] Deadlock via recursive wakeup via RCU with threadirqs

2019-06-27 Thread Scott Wood

On Thu, 2019-06-27 at 11:41 -0700, Paul E. McKenney wrote:
> On Thu, Jun 27, 2019 at 02:16:38PM -0400, Joel Fernandes wrote:
> > 
> > I think the fix should be to prevent the wake-up not based on whether we
> > are
> > in hard/soft-interrupt mode but that we are doing the rcu_read_unlock()
> > from
> > a scheduler path (if we can detect that)
> 
> Or just don't do the wakeup at all, if it comes to that.  I don't know
> of any way to determine whether rcu_read_unlock() is being called from
> the scheduler, but it has been some time since I asked Peter Zijlstra
> about that.
> 
> Of course, unconditionally refusing to do the wakeup might not be happy
> thing for NO_HZ_FULL kernels that don't implement IRQ work.

Couldn't smp_send_reschedule() be used instead?

-Scott

Re: [RFC PATCH RT 4/4] rcutorture: Avoid problematic critical section nesting

2019-06-27 Thread Scott Wood

On Thu, 2019-06-27 at 11:00 -0700, Paul E. McKenney wrote:
> On Wed, Jun 26, 2019 at 11:49:16AM -0500, Scott Wood wrote:
> > On Wed, 2019-06-26 at 11:08 -0400, Steven Rostedt wrote:
> > > On Fri, 21 Jun 2019 16:59:55 -0700
> > > "Paul E. McKenney"  wrote:
> > > 
> > > > I have no objection to the outlawing of a number of these sequences
> > > > in
> > > > mainline, but am rather pointing out that until they really are
> > > > outlawed
> > > > and eliminated, rcutorture must continue to test them in mainline.
> > > > Of course, an rcutorture running in -rt should avoid testing things
> > > > that
> > > > break -rt, including these sequences.
> > > 
> > > We should update lockdep to complain about these sequences. That would
> > > "outlaw" them in mainline. That is, after we clean up all the current
> > > sequences in the code. And we also need to get Linus's approval of
> > > this
> > > as I believe he was against enforcing this in the past.
> > 
> > Was the opposition to prohibiting some specific sequence?  It's only
> > certain
> > misnesting scenarios that are problematic.  The rcu_read_lock/
> > local_irq_disable restriction can be dropped with the IPI-to-self added
> > in
> > Paul's tree.  Are there any known instances of the other two (besides
> > rcutorture)?
> 
> Given the failure scenario Sebastian Siewior reported today, there
> apparently are some, at least when running threaded interrupt handlers.

That's the rcu misnesting, which it looks like we can allow with the IPI-to-
self; I was asking about the other two.  I suppose if we really need to, we
could work around preempt_disable()/local_irq_disable()/preempt_enable()/
local_irq_enable() by having preempt_enable() do an IPI-to-self if
need_resched is set and IRQs are disabled.  The RT local_bh_disable()
atomic/non-atomic misnesting would be more difficult, but I don't think
impossible.  I've got lazy migrate disable working (initially as an attempt
to deal with misnesting but it turned out to give a huge speedup as well;
will send as soon as I take care of a loose end in the deadline scheduler);
it's possible that something similar could be done with the softirq lock
(and given that I saw a slowdown when that lock was introduced, it may also
be worth doing just for performance).

BTW, it's not clear to me whether the failure Sebastian saw was due to the
bare irq disabled version, which was what I was talking about prohibiting
(he didn't show the context that was interrupted).  The version where
preempt is disabled (with or without irqs being disabled inside the preempt
disabled region) definitely happens and is what I was trying to address with
patch 3/4.

-Scott

Re: [RFC PATCH RT 4/4] rcutorture: Avoid problematic critical section nesting

2019-06-26 Thread Scott Wood

On Wed, 2019-06-26 at 11:08 -0400, Steven Rostedt wrote:
> On Fri, 21 Jun 2019 16:59:55 -0700
> "Paul E. McKenney"  wrote:
> 
> > I have no objection to the outlawing of a number of these sequences in
> > mainline, but am rather pointing out that until they really are outlawed
> > and eliminated, rcutorture must continue to test them in mainline.
> > Of course, an rcutorture running in -rt should avoid testing things that
> > break -rt, including these sequences.
> 
> We should update lockdep to complain about these sequences. That would
> "outlaw" them in mainline. That is, after we clean up all the current
> sequences in the code. And we also need to get Linus's approval of this
> as I believe he was against enforcing this in the past.

Was the opposition to prohibiting some specific sequence?  It's only certain
misnesting scenarios that are problematic.  The rcu_read_lock/
local_irq_disable restriction can be dropped with the IPI-to-self added in
Paul's tree.  Are there any known instances of the other two (besides
rcutorture)?

-Scott

Re: [RFC PATCH RT 3/4] rcu: unlock special: Treat irq and preempt disabled the same

2019-06-24 Thread Scott Wood

On Sat, 2019-06-22 at 12:13 -0700, Paul E. McKenney wrote:
> On Fri, Jun 21, 2019 at 05:26:06PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 20, 2019 at 06:08:19PM -0500, Scott Wood wrote:
> > > On Thu, 2019-06-20 at 15:25 -0700, Paul E. McKenney wrote:
> > > > On Thu, Jun 20, 2019 at 04:59:30PM -0500, Scott Wood wrote:
> > > > > On Thu, 2019-06-20 at 14:10 -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Jun 18, 2019 at 08:19:07PM -0500, Scott Wood wrote:
> > > > > > > [Note: Just before posting this I noticed that the
> > > > > > > invoke_rcu_core
> > > > > > > stuff
> > > > > > >  is part of the latest RCU pull request, and it has a patch
> > > > > > > that
> > > > > > >  addresses this in a more complicated way that appears to deal
> > > > > > > with
> > > > > > > the
> > > > > > >  bare irq-disabled sequence as well.
> > > > > > 
> > > > > > Far easier to deal with it than to debug the lack of it.  ;-)
> > > > > > 
> > > > > > >  Assuming we need/want to support such sequences, is the
> > > > > > >  invoke_rcu_core() call actually going to result in scheduling
> > > > > > > any
> > > > > > >  sooner?  resched_curr() just does the same setting of
> > > > > > > need_resched
> > > > > > >  when it's the same cpu.
> > > > > > > ]
> > > > > > 
> > > > > > Yes, invoke_rcu_core() can in some cases invoke the scheduler
> > > > > > sooner.
> > > > > > Setting the CPU-local bits might not have effect until the next
> > > > > > interrupt.
> > > > > 
> > > > > Maybe I'm missing something, but I don't see how (in the non-
> > > > > use_softirq
> > > > > case).  It just calls wake_up_process(), which in resched_curr()
> > > > > will
> > > > > set
> > > > > need_resched but not do an IPI-to-self.
> > > > 
> > > > The common non-rt case will be use_softirq.  Or are you referring
> > > > specifically to this block of code in current -rcu?
> > > > 
> > > > } else if (exp && irqs_were_disabled && !use_softirq
> > > > &&
> > > >!t-
> > > > >rcu_read_unlock_special.b.deferred_qs) {
> > > > // Safe to awaken and we get no help from
> > > > enabling
> > > > // irqs, unlike bh/preempt.
> > > > invoke_rcu_core();
> > > 
> > > Yes, that one.  If that block is removed the else path should be
> > > sufficient,
> > > now that an IPI-to-self has been added.
> > 
> > I will give it a try and let you know what happens.
> 
> How about the following?

Looks good, thanks.

-Scott

Re: [RFC PATCH RT 3/4] rcu: unlock special: Treat irq and preempt disabled the same

2019-06-20 Thread Scott Wood

On Thu, 2019-06-20 at 15:25 -0700, Paul E. McKenney wrote:
> On Thu, Jun 20, 2019 at 04:59:30PM -0500, Scott Wood wrote:
> > On Thu, 2019-06-20 at 14:10 -0700, Paul E. McKenney wrote:
> > > On Tue, Jun 18, 2019 at 08:19:07PM -0500, Scott Wood wrote:
> > > > [Note: Just before posting this I noticed that the invoke_rcu_core
> > > > stuff
> > > >  is part of the latest RCU pull request, and it has a patch that
> > > >  addresses this in a more complicated way that appears to deal with
> > > > the
> > > >  bare irq-disabled sequence as well.
> > > 
> > > Far easier to deal with it than to debug the lack of it.  ;-)
> > > 
> > > >  Assuming we need/want to support such sequences, is the
> > > >  invoke_rcu_core() call actually going to result in scheduling any
> > > >  sooner?  resched_curr() just does the same setting of need_resched
> > > >  when it's the same cpu.
> > > > ]
> > > 
> > > Yes, invoke_rcu_core() can in some cases invoke the scheduler sooner.
> > > Setting the CPU-local bits might not have effect until the next
> > > interrupt.
> > 
> > Maybe I'm missing something, but I don't see how (in the non-use_softirq
> > case).  It just calls wake_up_process(), which in resched_curr() will
> > set
> > need_resched but not do an IPI-to-self.
> 
> The common non-rt case will be use_softirq.  Or are you referring
> specifically to this block of code in current -rcu?
> 
>   } else if (exp && irqs_were_disabled && !use_softirq &&
>  !t->rcu_read_unlock_special.b.deferred_qs) {
>   // Safe to awaken and we get no help from enabling
>   // irqs, unlike bh/preempt.
>   invoke_rcu_core();

Yes, that one.  If that block is removed the else path should be sufficient,
now that an IPI-to-self has been added.

Also, shouldn't the IPI-to-self be conditioned on irqs_were_disabled? 
Besides that being the problem the IPI was meant to address, if irqs are
enabled the IPI is likely to happen before preempt is re-enabled and thus it
won't accomplish anything.

-Scott

Re: [RFC PATCH RT 3/4] rcu: unlock special: Treat irq and preempt disabled the same

2019-06-20 Thread Scott Wood

On Thu, 2019-06-20 at 14:10 -0700, Paul E. McKenney wrote:
> On Tue, Jun 18, 2019 at 08:19:07PM -0500, Scott Wood wrote:
> > [Note: Just before posting this I noticed that the invoke_rcu_core stuff
> >  is part of the latest RCU pull request, and it has a patch that
> >  addresses this in a more complicated way that appears to deal with the
> >  bare irq-disabled sequence as well.
> 
> Far easier to deal with it than to debug the lack of it.  ;-)
> 
> >  Assuming we need/want to support such sequences, is the
> >  invoke_rcu_core() call actually going to result in scheduling any
> >  sooner?  resched_curr() just does the same setting of need_resched
> >  when it's the same cpu.
> > ]
> 
> Yes, invoke_rcu_core() can in some cases invoke the scheduler sooner.
> Setting the CPU-local bits might not have effect until the next interrupt.

Maybe I'm missing something, but I don't see how (in the non-use_softirq
case).  It just calls wake_up_process(), which in resched_curr() will set
need_resched but not do an IPI-to-self.

-Scott

Re: [RFC PATCH RT 4/4] rcutorture: Avoid problematic critical section nesting

2019-06-20 Thread Scott Wood

On Thu, 2019-06-20 at 14:18 -0700, Paul E. McKenney wrote:
> On Tue, Jun 18, 2019 at 08:19:08PM -0500, Scott Wood wrote:
> > rcutorture was generating some nesting scenarios that are not
> > reasonable.  Constrain the state selection to avoid them.
> > 
> > Example #1:
> > 
> > 1. preempt_disable()
> > 2. local_bh_disable()
> > 3. preempt_enable()
> > 4. local_bh_enable()
> > 
> > On PREEMPT_RT, BH disabling takes a local lock only when called in
> > non-atomic context.  Thus, atomic context must be retained until after
> > BH
> > is re-enabled.  Likewise, if BH is initially disabled in non-atomic
> > context, it cannot be re-enabled in atomic context.
> > 
> > Example #2:
> > 
> > 1. rcu_read_lock()
> > 2. local_irq_disable()
> > 3. rcu_read_unlock()
> > 4. local_irq_enable()
> > 
> > If the thread is preempted between steps 1 and 2,
> > rcu_read_unlock_special.b.blocked will be set, but it won't be
> > acted on in step 3 because IRQs are disabled.  Thus, reporting of the
> > quiescent state will be delayed beyond the local_irq_enable().
> > 
> > Example #3:
> > 
> > 1. preempt_disable()
> > 2. local_irq_disable()
> > 3. preempt_enable()
> > 4. local_irq_enable()
> > 
> > If need_resched is set between steps 1 and 2, then the reschedule
> > in step 3 will not happen.
> > 
> > Signed-off-by: Scott Wood 
> 
> OK for -rt, but as long as people can code those sequences without getting
> their wrists slapped, RCU needs to deal with it.  So I cannot accept
> this in mainline at the current time.  Yes, I will know when it is safe
> to accept it when rcutorture's virtual wrist gets slapped in mainline.
> Why did you ask?  ;-)

Hence the "TODO: Document restrictions and add debug checks for invalid
sequences". :-)

> But I have to ask...  With this elaboration, is it time to make this a
> data-driven state machine?  Or is the complexity not yet to the point
> where that would constitute a simplification?

Perhaps... Making the entire sequence visible to the constraint enforcement
would also allow relaxing some of the constraints that currently have to
assume the worst regarding what happened before the most recent state.

-Scott

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1652 matches

Mail list logo