[tip: core/rcu] rcu: Remove superfluous rdp fetch

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: d3ad5bbc4da70c25ad6b386e038e711d0755767b
Gitweb:
https://git.kernel.org/tip/d3ad5bbc4da70c25ad6b386e038e711d0755767b
Author:Frederic Weisbecker 
AuthorDate:Wed, 06 Jan 2021 23:07:15 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:17:35 -08:00

rcu: Remove superfluous rdp fetch

Cc: Rafael J. Wysocki 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..cdf091f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -648,7 +648,6 @@ static noinstr void rcu_eqs_enter(bool user)
instrumentation_begin();
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 
atomic_read(>dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && 
!is_idle_task(current));
-   rdp = this_cpu_ptr(_data);
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
 


[tip: core/rcu] rcu/nocb: Fix missed nocb_timer requeue

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: b2fcf2102049f6e56981e0ab3d9b633b8e2741da
Gitweb:
https://git.kernel.org/tip/b2fcf2102049f6e56981e0ab3d9b633b8e2741da
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:09:59 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:54 -07:00

rcu/nocb: Fix missed nocb_timer requeue

This sequence of events can lead to a failure to requeue a CPU's
->nocb_timer:

1.  There are no callbacks queued for any CPU covered by CPU 0-2's
->nocb_gp_kthread.  Note that ->nocb_gp_kthread is associated
with CPU 0.

2.  CPU 1 enqueues its first callback with interrupts disabled, and
thus must defer awakening its ->nocb_gp_kthread.  It therefore
queues its rcu_data structure's ->nocb_timer.  At this point,
CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.

3.  CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a
callback, but with interrupts enabled, allowing it to directly
awaken the ->nocb_gp_kthread.

4.  The newly awakened ->nocb_gp_kthread associates both CPU 1's
and CPU 2's callbacks with a future grace period and arranges
for that grace period to be started.

5.  This ->nocb_gp_kthread goes to sleep waiting for the end of this
future grace period.

6.  This grace period elapses before the CPU 1's timer fires.
This is normally improbably given that the timer is set for only
one jiffy, but timers can be delayed.  Besides, it is possible
that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.

7.  The grace period ends, so rcu_gp_kthread awakens the
->nocb_gp_kthread, which in turn awakens both CPU 1's and
CPU 2's ->nocb_cb_kthread.  Then ->nocb_gb_kthread sleeps
waiting for more newly queued callbacks.

8.  CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps
waiting for more invocable callbacks.

9.  Note that neither kthread updated any ->nocb_timer state,
so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.

10. CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread.
It does so with calling wake_nocb_gp() which also cancels the
pending timer that got queued in step 2. But that doesn't reset
CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now
desynchronized.

11. ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arranges for that grace period to start and sleeps
waiting for it to complete.

12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread,
which in turn wakes up CPU 1's ->nocb_cb_kthread which then
invokes the callback queued in 10.

13. CPU 1 enqueues its third callback, this time with interrupts
disabled so it must queue a timer for a deferred wakeup. However
the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which
incorrectly indicates that a timer is already queued.  Instead,
CPU 1's ->nocb_timer was cancelled in 10.  CPU 1 therefore fails
to queue the ->nocb_timer.

14. CPU 1 has its pending callback and it may go unnoticed until
some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever
calls an explicit deferred wakeup, for example, during idle entry.

This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime
we delete the ->nocb_timer.

It is quite possible that there is a similar scenario involving
->nocb_bypass_timer and ->nocb_defer_wakeup.  However, despite some
effort from several people, a failure scenario has not yet been located.
However, that by no means guarantees that no such scenario exists.
Finding a failure scenario is left as an exercise for the reader, and the
"Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer.

Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
Cc: 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Boqun Feng 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index a1a17ad..e392bd1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1708,7 +1708,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool 
force,
rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
-   del_timer(>nocb_timer);
+
+   if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
+   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
+   del_timer(>nocb_timer);
+  

[tip: core/rcu] rcu/nocb: Detect unsafe checks for offloaded rdp

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 3820b513a2e33d6dee1caa3b4815f92079cb9890
Gitweb:
https://git.kernel.org/tip/3820b513a2e33d6dee1caa3b4815f92079cb9890
Author:Frederic Weisbecker 
AuthorDate:Thu, 12 Nov 2020 01:51:21 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:20 -08:00

rcu/nocb: Detect unsafe checks for offloaded rdp

Provide CONFIG_PROVE_RCU sanity checks to ensure we are always reading
the offloaded state of an rdp in a safe and stable way and prevent from
its value to be changed under us. We must either hold the barrier mutex,
the cpu-hotplug lock (read or write) or the nocb lock.
Local non-preemptible reads are also safe. NOCB kthreads and timers have
their own means of synchronization against the offloaded state updaters.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c| 21 -
 kernel/rcu/tree_plugin.h | 90 ---
 2 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..03503e2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -156,6 +156,7 @@ static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
 static void sync_sched_exp_online_cleanup(int cpu);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
+static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 
 /* rcuc/rcub kthread realtime priority */
 static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
@@ -1672,7 +1673,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, 
struct rcu_data *rdp)
 {
bool ret = false;
bool need_qs;
-   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
 
raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -2128,7 +2129,7 @@ static void rcu_gp_cleanup(void)
needgp = true;
}
/* Advance CBs to reduce false positives below. */
-   offloaded = rcu_segcblist_is_offloaded(>cblist);
+   offloaded = rcu_rdp_is_offloaded(rdp);
if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) {
WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT);
WRITE_ONCE(rcu_state.gp_req_activity, jiffies);
@@ -2327,7 +2328,7 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
unsigned long flags;
unsigned long mask;
bool needwake = false;
-   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
struct rcu_node *rnp;
 
WARN_ON_ONCE(rdp->cpu != smp_processor_id());
@@ -2497,7 +2498,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
int div;
bool __maybe_unused empty;
unsigned long flags;
-   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count = 0;
@@ -3066,7 +3067,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
trace_rcu_segcb_stats(>cblist, TPS("SegCBQueued"));
 
/* Go handle any RCU core processing required. */
-   if (unlikely(rcu_segcblist_is_offloaded(>cblist))) {
+   if (unlikely(rcu_rdp_is_offloaded(rdp))) {
__call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */
} else {
__call_rcu_core(rdp, head, flags);
@@ -3843,13 +3844,13 @@ static int rcu_pending(int user)
return 1;
 
/* Does this CPU have callbacks ready to invoke? */
-   if (!rcu_segcblist_is_offloaded(>cblist) &&
+   if (!rcu_rdp_is_offloaded(rdp) &&
rcu_segcblist_ready_cbs(>cblist))
return 1;
 
/* Has RCU gone idle with this CPU needing another grace period? */
if (!gp_in_progress && rcu_segcblist_is_enabled(>cblist) &&
-   !rcu_segcblist_is_offloaded(>cblist) &&
+   !rcu_rdp_is_offloaded(rdp) &&
!rcu_segcblist_restempty(>cblist, RCU_NEXT_READY_TAIL))
return 1;
 
@@ -3968,7 +3969,7 @@ void rcu_barrier(void)
for_each_possible_cpu(cpu) {
rdp = per_cpu_ptr(_data, cpu);
if (cpu_is_offline(cpu) &&
-   !rcu_segcblist_is_offloaded(>cblist))
+   !rcu_rdp_is_offloaded(rdp))
continue;
if (rcu_segcblist_n_cbs(>cblist) && cpu_online(cpu)) {
rcu_barrier_trace(TPS("OnlineQ"), cpu,
@@ -4291,7 +4292,7 @@ void rcutree_migrate_callbacks(int cpu)
struct rcu_data *rdp = per_cpu_ptr(_data, cpu);
bool 

[tip: core/rcu] rcu/nocb: Comment the reason behind BH disablement on batch processing

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 5de2e5bb80aeef82f75fff76120874cdc86f935d
Gitweb:
https://git.kernel.org/tip/5de2e5bb80aeef82f75fff76120874cdc86f935d
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:08 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:20 -08:00

rcu/nocb: Comment the reason behind BH disablement on batch processing

This commit explains why softirqs need to be disabled while invoking
callbacks, even when callback processing has been offloaded.  After
all, invoking callbacks concurrently is one thing, but concurrently
invoking the same callback is quite another.

Reported-by: Boqun Feng 
Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cd513ea..013142d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2235,6 +2235,12 @@ static void nocb_cb_wait(struct rcu_data *rdp)
local_irq_save(flags);
rcu_momentary_dyntick_idle();
local_irq_restore(flags);
+   /*
+* Disable BH to provide the expected environment.  Also, when
+* transitioning to/from NOCB mode, a self-requeuing callback might
+* be invoked from softirq.  A short grace period could cause both
+* instances of this callback would execute concurrently.
+*/
local_bh_disable();
rcu_do_batch(rdp);
local_bh_enable();


[tip: core/rcu] rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleep

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 8a682b3974c36853b52fc8ede14dee966e96e19f
Gitweb:
https://git.kernel.org/tip/8a682b3974c36853b52fc8ede14dee966e96e19f
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:12 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:21 -08:00

rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleep

The nocb_cb_wait() function first sets the rdp->nocb_cb_sleep flag to
true by after invoking the callbacks, and then sets it back to false if
it finds more callbacks that are ready to invoke.

This is confusing and will become unsafe if this flag is ever read
locklessly.  This commit therefore writes it only once, based on the
state after both callback invocation and checking.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 9fd8588..6a7f77d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2230,6 +2230,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
unsigned long flags;
bool needwake_state = false;
bool needwake_gp = false;
+   bool can_sleep = true;
struct rcu_node *rnp = rdp->mynode;
 
local_irq_save(flags);
@@ -2253,8 +2254,6 @@ static void nocb_cb_wait(struct rcu_data *rdp)
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
}
 
-   WRITE_ONCE(rdp->nocb_cb_sleep, true);
-
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) {
rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_CB);
@@ -2262,7 +2261,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
needwake_state = true;
}
if (rcu_segcblist_ready_cbs(cblist))
-   WRITE_ONCE(rdp->nocb_cb_sleep, false);
+   can_sleep = false;
} else {
/*
 * De-offloading. Clear our flag and notify the de-offload 
worker.
@@ -2275,6 +2274,8 @@ static void nocb_cb_wait(struct rcu_data *rdp)
needwake_state = true;
}
 
+   WRITE_ONCE(rdp->nocb_cb_sleep, can_sleep);
+
if (rdp->nocb_cb_sleep)
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep"));
 


[tip: core/rcu] rcu/nocb: Forbid NOCB toggling on offline CPUs

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 64305db2856b969a5d48e8f3a5b0d06b5594591c
Gitweb:
https://git.kernel.org/tip/64305db2856b969a5d48e8f3a5b0d06b5594591c
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:09 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:21 -08:00

rcu/nocb: Forbid NOCB toggling on offline CPUs

It makes no sense to de-offload an offline CPU because that CPU will never
invoke any remaining callbacks.  It also makes little sense to offload an
offline CPU because any pending RCU callbacks were migrated when that CPU
went offline.  Yes, it is in theory possible to use a number of tricks
to permit offloading and deoffloading offline CPUs in certain cases, but
in practice it is far better to have the simple and deterministic rule
"Toggling the offload state of an offline CPU is forbidden".

For but one example, consider that an offloaded offline CPU might have
millions of callbacks queued.  Best to just say "no".

This commit therefore forbids toggling of the offloaded state of
offline CPUs.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c|  3 +--
 kernel/rcu/tree_plugin.h | 57 ++-
 2 files changed, 22 insertions(+), 38 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 03503e2..ee77858 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4086,8 +4086,7 @@ int rcutree_prepare_cpu(unsigned int cpu)
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
/*
 * Lock in case the CB/GP kthreads are still around handling
-* old callbacks (longer term we should flush all callbacks
-* before completing CPU offline)
+* old callbacks.
 */
rcu_nocb_lock(rdp);
if (rcu_segcblist_empty(>cblist)) /* No early-boot CBs? */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 013142d..9fd8588 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2399,23 +2399,18 @@ static int rdp_offload_toggle(struct rcu_data *rdp,
return 0;
 }
 
-static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
+static long rcu_nocb_rdp_deoffload(void *arg)
 {
+   struct rcu_data *rdp = arg;
struct rcu_segcblist *cblist = >cblist;
unsigned long flags;
int ret;
 
+   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
+
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-   /*
-* If there are still pending work offloaded, the offline
-* CPU won't help much handling them.
-*/
-   if (cpu_is_offline(rdp->cpu) && !rcu_segcblist_empty(>cblist)) {
-   rcu_nocb_unlock_irqrestore(rdp, flags);
-   return -EBUSY;
-   }
 
ret = rdp_offload_toggle(rdp, false, flags);
swait_event_exclusive(rdp->nocb_state_wq,
@@ -2446,14 +2441,6 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
return ret;
 }
 
-static long rcu_nocb_rdp_deoffload(void *arg)
-{
-   struct rcu_data *rdp = arg;
-
-   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
-   return __rcu_nocb_rdp_deoffload(rdp);
-}
-
 int rcu_nocb_cpu_deoffload(int cpu)
 {
struct rcu_data *rdp = per_cpu_ptr(_data, cpu);
@@ -2466,12 +2453,14 @@ int rcu_nocb_cpu_deoffload(int cpu)
mutex_lock(_state.barrier_mutex);
cpus_read_lock();
if (rcu_rdp_is_offloaded(rdp)) {
-   if (cpu_online(cpu))
+   if (cpu_online(cpu)) {
ret = work_on_cpu(cpu, rcu_nocb_rdp_deoffload, rdp);
-   else
-   ret = __rcu_nocb_rdp_deoffload(rdp);
-   if (!ret)
-   cpumask_clear_cpu(cpu, rcu_nocb_mask);
+   if (!ret)
+   cpumask_clear_cpu(cpu, rcu_nocb_mask);
+   } else {
+   pr_info("NOCB: Can't CB-deoffload an offline CPU\n");
+   ret = -EINVAL;
+   }
}
cpus_read_unlock();
mutex_unlock(_state.barrier_mutex);
@@ -2480,12 +2469,14 @@ int rcu_nocb_cpu_deoffload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_deoffload);
 
-static int __rcu_nocb_rdp_offload(struct rcu_data *rdp)
+static long rcu_nocb_rdp_offload(void *arg)
 {
+   struct rcu_data *rdp = arg;
struct rcu_segcblist *cblist = >cblist;
unsigned long flags;
int ret;
 
+   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
/*
 * For now we only support re-offload, ie: the rdp must have been
 * offloaded on boot first.
@@ -2525,14 +2516,6 @@ static int __rcu_nocb_rdp_offload(struct rcu_data *rdp)

[tip: core/rcu] rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloading

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 55adc3e1c82a25e99e9efef4f2b14b8b4806918a
Gitweb:
https://git.kernel.org/tip/55adc3e1c82a25e99e9efef4f2b14b8b4806918a
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:13 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:22 -08:00

rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloading

The name nocb_gp_update_state() is unenlightening, so this commit changes
it to nocb_gp_update_state_deoffloading().  This function now does what
its name says, updates state and returns true if the CPU corresponding to
the specified rcu_data structure is in the process of being de-offloaded.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h |  9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6a7f77d..93d3938 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2016,7 +2016,8 @@ static inline bool nocb_gp_enabled_cb(struct rcu_data 
*rdp)
return rcu_segcblist_test_flags(>cblist, flags);
 }
 
-static inline bool nocb_gp_update_state(struct rcu_data *rdp, bool 
*needwake_state)
+static inline bool nocb_gp_update_state_deoffloading(struct rcu_data *rdp,
+bool *needwake_state)
 {
struct rcu_segcblist *cblist = >cblist;
 
@@ -2026,7 +2027,7 @@ static inline bool nocb_gp_update_state(struct rcu_data 
*rdp, bool *needwake_sta
if (rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB))
*needwake_state = true;
}
-   return true;
+   return false;
}
 
/*
@@ -2037,7 +2038,7 @@ static inline bool nocb_gp_update_state(struct rcu_data 
*rdp, bool *needwake_sta
rcu_segcblist_clear_flags(cblist, SEGCBLIST_KTHREAD_GP);
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB))
*needwake_state = true;
-   return false;
+   return true;
 }
 
 
@@ -2075,7 +2076,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
continue;
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
rcu_nocb_lock_irqsave(rdp, flags);
-   if (!nocb_gp_update_state(rdp, _state)) {
+   if (nocb_gp_update_state_deoffloading(rdp, _state)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
if (needwake_state)
swake_up_one(>nocb_state_wq);


[tip: core/rcu] rcu/nocb: Only (re-)initialize segcblist when needed on CPU up

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: ec711bc12c777b1165585f59f7a6c35a89e04cc3
Gitweb:
https://git.kernel.org/tip/ec711bc12c777b1165585f59f7a6c35a89e04cc3
Author:Frederic Weisbecker 
AuthorDate:Thu, 28 Jan 2021 18:12:10 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 08 Mar 2021 14:20:22 -08:00

rcu/nocb: Only (re-)initialize segcblist when needed on CPU up

At the start of a CPU-hotplug operation, the incoming CPU's callback
list can be in a number of states:

1.  Disabled and empty.  This is the case when the boot CPU has
not invoked call_rcu(), when a non-boot CPU first comes online,
and when a non-offloaded CPU comes back online.  In this case,
it is both necessary and permissible to initialize ->cblist.
Because either the CPU is currently running with interrupts
disabled (boot CPU) or is not yet running at all (other CPUs),
it is not necessary to acquire ->nocb_lock.

In this case, initialization is required.

2.  Disabled and non-empty.  This cannot occur, because early boot
call_rcu() invocations enable the callback list before enqueuing
their callback.

3.  Enabled, whether empty or not.  In this case, the callback
list has already been initialized.  This case occurs when the
boot CPU has executed an early boot call_rcu() and also when
an offloaded CPU comes back online.  In both cases, there is
no need to initialize the callback list: In the boot-CPU case,
the CPU has not (yet) gone offline, and in the offloaded case,
the rcuo kthreads are taking care of business.

Because it is not necessary to initialize the callback list,
it is also not necessary to acquire ->nocb_lock.

Therefore, checking if the segcblist is enabled suffices.  This commit
therefore initializes the callback list at rcutree_prepare_cpu() time
only if that list is disabled.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c |  9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ee77858..402ea36 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4084,14 +4084,13 @@ int rcutree_prepare_cpu(unsigned int cpu)
rdp->dynticks_nesting = 1;  /* CPU not up, no tearing. */
rcu_dynticks_eqs_online();
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
+
/*
-* Lock in case the CB/GP kthreads are still around handling
-* old callbacks.
+* Only non-NOCB CPUs that didn't have early-boot callbacks need to be
+* (re-)initialized.
 */
-   rcu_nocb_lock(rdp);
-   if (rcu_segcblist_empty(>cblist)) /* No early-boot CBs? */
+   if (!rcu_segcblist_is_enabled(>cblist))
rcu_segcblist_init(>cblist);  /* Re-enable callbacks. */
-   rcu_nocb_unlock(rdp);
 
/*
 * Add CPU to leaf rcu_node pending-online bitmask.  Any needed


[tip: core/rcu] rcu/nocb: Disable bypass when CPU isn't completely offloaded

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 76d00b494d7962e88d4bbd4135f34aba9019c67f
Gitweb:
https://git.kernel.org/tip/76d00b494d7962e88d4bbd4135f34aba9019c67f
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:10:00 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:54 -07:00

rcu/nocb: Disable bypass when CPU isn't completely offloaded

Currently, the bypass is flushed at the very last moment in the
deoffloading procedure.  However, this approach leads to a larger state
space than would be preferred.  This commit therefore disables the
bypass at soon as the deoffloading procedure begins, then flushes it.
This guarantees that the bypass remains empty and thus out of the way
of the deoffloading procedure.

Symmetrically, this commit waits to enable the bypass until the offloading
procedure has completed.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_segcblist.h |  7 +++---
 kernel/rcu/tree_plugin.h  | 38 +-
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 8afe886..3db96c4 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -109,7 +109,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   Kthreads handle callbacks holding nocb_lock, local rcu_core() stops
|
- *  |   handling callbacks.
|
+ *  |   handling callbacks. Enable bypass queueing.
|
  *  

  */
 
@@ -125,7 +125,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads handle callbacks holding nocb_lock, local rcu_core()
|
- *  |   ignores callbacks. 
|
+ *  |   ignores callbacks. Bypass enqueue is enabled.  
|
  *  

  *  |
  *  v
@@ -134,7 +134,8 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads and local rcu_core() handle callbacks concurrently  
|
- *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary.
|
+ *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary. Disable
|
+ *  |   bypass enqueue.
|
  *  

  *  |
  *  v
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e392bd1..b08564b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1830,11 +1830,22 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, 
struct rcu_head *rhp,
unsigned long j = jiffies;
long ncbs = rcu_cblist_n_cbs(>nocb_bypass);
 
+   lockdep_assert_irqs_disabled();
+
+   // Pure softirq/rcuc based processing: no bypassing, no
+   // locking.
if (!rcu_rdp_is_offloaded(rdp)) {
*was_alldone = !rcu_segcblist_pend_cbs(>cblist);
+   return false;
+   }
+
+   // In the process of (de-)offloading: no bypassing, but
+   // locking.
+   if (!rcu_segcblist_completely_offloaded(>cblist)) {
+   rcu_nocb_lock(rdp);
+   *was_alldone = !rcu_segcblist_pend_cbs(>cblist);
return false; /* Not offloaded, no bypassing. */
}
-   lockdep_assert_irqs_disabled();
 
// Don't use ->nocb_bypass during early boot.
if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
@@ -2416,7 +2427,16 @@ static long rcu_nocb_rdp_deoffload(void *arg)
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-
+   /*
+* Flush once and for all now. This suffices because we are
+* running on the target CPU holding ->nocb_lock (thus having
+* interrupts disabled), and because rdp_offload_toggle()
+* invokes rcu_segcblist_offload(), which clears SEGCBLIST_OFFLOADED.
+* Thus future calls to rcu_segcblist_completely_offloaded() will
+* return false, which 

[tip: core/rcu] rcu/nocb: Remove stale comment above rcu_segcblist_offload()

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 0efdf14a9f83618335a0849df3586808bff36cfb
Gitweb:
https://git.kernel.org/tip/0efdf14a9f83618335a0849df3586808bff36cfb
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:10:01 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:54 -07:00

rcu/nocb: Remove stale comment above rcu_segcblist_offload()

This commit removes a stale comment claiming that the cblist must be
empty before changing the offloading state.  This claim was correct back
when the offloaded state was defined exclusively at boot.

Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 7f181c9..aaa1112 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -261,8 +261,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
 }
 
 /*
- * Mark the specified rcu_segcblist structure as offloaded.  This
- * structure must be empty.
+ * Mark the specified rcu_segcblist structure as offloaded.
  */
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload)
 {


[tip: core/rcu] rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible

2021-04-11 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: e02691b7ef51c5fac0eee5a6ebde45ce92958fae
Gitweb:
https://git.kernel.org/tip/e02691b7ef51c5fac0eee5a6ebde45ce92958fae
Author:Frederic Weisbecker 
AuthorDate:Tue, 23 Feb 2021 01:10:02 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 15 Mar 2021 13:54:55 -07:00

rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible

Those tracing calls don't need to be under ->nocb_lock.  This commit
therefore moves them outside of that lock.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b08564b..9846c8a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1703,9 +1703,9 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
 
lockdep_assert_held(>nocb_lock);
if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
 
@@ -1955,9 +1955,9 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
// If we are being polled or there is no kthread, just leave.
t = READ_ONCE(rdp->nocb_gp_kthread);
if (rcu_nocb_poll || !t) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("WakeNotPoll"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return;
}
// Need to actually to a wakeup.
@@ -1992,8 +1992,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
   TPS("WakeOvfIsDeferred"));
rcu_nocb_unlock_irqrestore(rdp, flags);
} else {
-   trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
rcu_nocb_unlock_irqrestore(rdp, flags);
+   trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
}
return;
 }


[tip: sched/core] static_call: Provide DEFINE_STATIC_CALL_RET0()

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 29fd01944b7273bb630c649a2104b7f9e4ef3fa6
Gitweb:
https://git.kernel.org/tip/29fd01944b7273bb630c649a2104b7f9e4ef3fa6
Author:Frederic Weisbecker 
AuthorDate:Mon, 18 Jan 2021 15:12:17 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:08:51 +01:00

static_call: Provide DEFINE_STATIC_CALL_RET0()

DECLARE_STATIC_CALL() must pass the original function targeted for a
given static call. But DEFINE_STATIC_CALL() may want to initialize it as
off. In this case we can't pass NULL (for functions without return value)
or __static_call_return0 (for functions returning a value) directly
to DEFINE_STATIC_CALL() as that may trigger a static call redeclaration
with a different function prototype. Type casts neither can work around
that as they don't get along with typeof().

The proper way to do that for functions that don't return a value is
to use DEFINE_STATIC_CALL_NULL(). But functions returning a actual value
don't have an equivalent yet.

Provide DEFINE_STATIC_CALL_RET0() to solve this situation.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210118141223.123667-3-frede...@kernel.org
---
 include/linux/static_call.h | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index bd6735d..d69dd8b 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -144,13 +144,13 @@ extern int static_call_text_reserved(void *start, void 
*end);
 
 extern long __static_call_return0(void);
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
.type = 1,  \
};  \
-   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
@@ -178,12 +178,12 @@ struct static_call_key {
void *func;
 };
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
};  \
-   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
@@ -234,10 +234,10 @@ static inline long __static_call_return0(void)
return 0;
 }
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
}
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
@@ -286,4 +286,10 @@ static inline int static_call_text_reserved(void *start, 
void *end)
 
 #endif /* CONFIG_HAVE_STATIC_CALL */
 
+#define DEFINE_STATIC_CALL(name, _func)
\
+   __DEFINE_STATIC_CALL(name, _func, _func)
+
+#define DEFINE_STATIC_CALL_RET0(name, _func)   \
+   __DEFINE_STATIC_CALL(name, _func, __static_call_return0)
+
 #endif /* _LINUX_STATIC_CALL_H */


[tip: sched/core] rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 54b7429efffc99e845ba9381bee3244f012a06c2
Gitweb:
https://git.kernel.org/tip/54b7429efffc99e845ba9381bee3244f012a06c2
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:44 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers

Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to
be handled differently whether initiated by idle, user or guest. Prepare
with pulling that control up to rcu_eqs_enter() callers.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-2-frede...@kernel.org
---
 kernel/rcu/tree.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40e5e3d..63032e5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -644,7 +644,6 @@ static noinstr void rcu_eqs_enter(bool user)
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 
atomic_read(>dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && 
!is_idle_task(current));
rdp = this_cpu_ptr(_data);
-   do_nocb_deferred_wakeup(rdp);
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
 
@@ -672,7 +671,10 @@ static noinstr void rcu_eqs_enter(bool user)
  */
 void rcu_idle_enter(void)
 {
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
lockdep_assert_irqs_disabled();
+   do_nocb_deferred_wakeup(rdp);
rcu_eqs_enter(false);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -691,7 +693,14 @@ EXPORT_SYMBOL_GPL(rcu_idle_enter);
  */
 noinstr void rcu_user_enter(void)
 {
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
lockdep_assert_irqs_disabled();
+
+   instrumentation_begin();
+   do_nocb_deferred_wakeup(rdp);
+   instrumentation_end();
+
rcu_eqs_enter(true);
 }
 #endif /* CONFIG_NO_HZ_FULL */


[tip: sched/core] rcu/nocb: Perform deferred wake up before last idle's need_resched() check

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 43789ef3f7d61aa7bed0cb2764e588fc990c30ef
Gitweb:
https://git.kernel.org/tip/43789ef3f7d61aa7bed0cb2764e588fc990c30ef
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:45 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:43 +01:00

rcu/nocb: Perform deferred wake up before last idle's need_resched() check

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and 
perf)
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frede...@kernel.org
---
 include/linux/rcupdate.h | 2 ++
 kernel/rcu/tree.c| 3 ---
 kernel/rcu/tree_plugin.h | 5 +
 kernel/sched/idle.c  | 1 +
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index fd02c5f..36c2119 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -110,8 +110,10 @@ static inline void rcu_user_exit(void) { }
 
 #ifdef CONFIG_RCU_NOCB_CPU
 void rcu_init_nohz(void);
+void rcu_nocb_flush_deferred_wakeup(void);
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 static inline void rcu_init_nohz(void) { }
+static inline void rcu_nocb_flush_deferred_wakeup(void) { }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 /**
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 63032e5..82838e9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -671,10 +671,7 @@ static noinstr void rcu_eqs_enter(bool user)
  */
 void rcu_idle_enter(void)
 {
-   struct rcu_data *rdp = this_cpu_ptr(_data);
-
lockdep_assert_irqs_disabled();
-   do_nocb_deferred_wakeup(rdp);
rcu_eqs_enter(false);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7e291ce..d5b38c2 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2187,6 +2187,11 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
do_nocb_deferred_wakeup_common(rdp);
 }
 
+void rcu_nocb_flush_deferred_wakeup(void)
+{
+   do_nocb_deferred_wakeup(this_cpu_ptr(_data));
+}
+
 void __init rcu_init_nohz(void)
 {
int cpu;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 305727e..7199e6f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -285,6 +285,7 @@ static void do_idle(void)
}
 
arch_cpu_idle_enter();
+   rcu_nocb_flush_deferred_wakeup();
 
/*
 * In poll mode we reenable interrupts and spin. Also if we


[tip: sched/core] entry: Explicitly flush pending rcuog wakeup before last rescheduling point

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 47b8ff194c1fd73d58dc339b597d466fe48c8958
Gitweb:
https://git.kernel.org/tip/47b8ff194c1fd73d58dc339b597d466fe48c8958
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:47 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:43 +01:00

entry: Explicitly flush pending rcuog wakeup before last rescheduling point

Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point on resuming to user mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-5-frede...@kernel.org
---
 kernel/entry/common.c |  7 +++
 kernel/rcu/tree.c | 12 +++-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f09cae3..8442e5c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -184,6 +184,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs 
*regs,
 * enabled above.
 */
local_irq_disable_exit_to_user();
+
+   /* Check if any of the above work has queued a deferred wakeup 
*/
+   rcu_nocb_flush_deferred_wakeup();
+
ti_work = READ_ONCE(current_thread_info()->flags);
}
 
@@ -197,6 +201,9 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs)
 
lockdep_assert_irqs_disabled();
 
+   /* Flush pending rcuog wakeup before the last need_resched() check */
+   rcu_nocb_flush_deferred_wakeup();
+
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4b1e5bd..2ebc211 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -707,13 +707,15 @@ noinstr void rcu_user_enter(void)
lockdep_assert_irqs_disabled();
 
/*
-* We may be past the last rescheduling opportunity in the entry code.
-* Trigger a self IPI that will fire and reschedule once we resume to
-* user/guest mode.
+* Other than generic entry implementation, we may be past the last
+* rescheduling opportunity in the entry code. Trigger a self IPI
+* that will fire and reschedule once we resume in user/guest mode.
 */
instrumentation_begin();
-   if (do_nocb_deferred_wakeup(rdp) && need_resched())
-   irq_work_queue(this_cpu_ptr(_wakeup_work));
+   if (!IS_ENABLED(CONFIG_GENERIC_ENTRY) || (current->flags & PF_VCPU)) {
+   if (do_nocb_deferred_wakeup(rdp) && need_resched())
+   irq_work_queue(this_cpu_ptr(_wakeup_work));
+   }
instrumentation_end();
 
rcu_eqs_enter(true);


[tip: sched/core] rcu/nocb: Trigger self-IPI on late deferred wake up before user resume

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: f8bb5cae9616224a39cbb399de382d36ac41df10
Gitweb:
https://git.kernel.org/tip/f8bb5cae9616224a39cbb399de382d36ac41df10
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:46 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:43 +01:00

rcu/nocb: Trigger self-IPI on late deferred wake up before user resume

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Unfortunately the call to rcu_user_enter() is already past the last
rescheduling opportunity before we resume to userspace or to guest mode.
We may escape there with the woken task ignored.

The ultimate resort to fix every callsites is to trigger a self-IPI
(nohz_full depends on arch to implement arch_irq_work_raise()) that will
trigger a reschedule on IRQ tail or guest exit.

Eventually every site that want a saner treatment will need to carefully
place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit
need_resched() check upon resume.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and 
perf)
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-4-frede...@kernel.org
---
 kernel/rcu/tree.c| 21 -
 kernel/rcu/tree.h|  2 +-
 kernel/rcu/tree_plugin.h | 25 -
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 82838e9..4b1e5bd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -677,6 +677,18 @@ void rcu_idle_enter(void)
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
 
 #ifdef CONFIG_NO_HZ_FULL
+
+/*
+ * An empty function that will trigger a reschedule on
+ * IRQ tail once IRQs get re-enabled on userspace resume.
+ */
+static void late_wakeup_func(struct irq_work *work)
+{
+}
+
+static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
+   IRQ_WORK_INIT(late_wakeup_func);
+
 /**
  * rcu_user_enter - inform RCU that we are resuming userspace.
  *
@@ -694,12 +706,19 @@ noinstr void rcu_user_enter(void)
 
lockdep_assert_irqs_disabled();
 
+   /*
+* We may be past the last rescheduling opportunity in the entry code.
+* Trigger a self IPI that will fire and reschedule once we resume to
+* user/guest mode.
+*/
instrumentation_begin();
-   do_nocb_deferred_wakeup(rdp);
+   if (do_nocb_deferred_wakeup(rdp) && need_resched())
+   irq_work_queue(this_cpu_ptr(_wakeup_work));
instrumentation_end();
 
rcu_eqs_enter(true);
 }
+
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7708ed1..9226f40 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -433,7 +433,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, 
struct rcu_head *rhp,
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
-static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
+static bool do_nocb_deferred_wakeup(struct rcu_data *rdp);
 static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_cpu_nocb_kthread(int cpu);
 static void __init rcu_spawn_nocb_kthreads(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index d5b38c2..384856e 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1631,8 +1631,8 @@ bool rcu_is_nocb_cpu(int cpu)
  * Kick the GP kthread for this NOCB group.  Caller holds ->nocb_lock
  * and this function releases it.
  */
-static void wake_nocb_gp(struct rcu_data *rdp, bool force,
-  unsigned long flags)
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
+unsigned long flags)
__releases(rdp->nocb_lock)
 {
bool needwake = false;
@@ -1643,7 +1643,7 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
rcu_nocb_unlock_irqrestore(rdp, flags);
-   return;
+   return false;
}
del_timer(>nocb_timer);
rcu_nocb_unlock_irqrestore(rdp, flags);
@@ -1656,6 +1656,8 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
raw_spin_unlock_irqrestore(_gp->nocb_gp_lock, flags);
if (needwake)
wake_up_process(rdp_gp->nocb_gp_kthread);
+
+   return needwake;
 }
 
 /*
@@ -2152,20 +2154,23 @@ static int rcu_nocb_need_deferred_wakeup(struct 
rcu_data *rdp)
 }
 
 /* Do a deferred wakeup of rcu_nocb_kthread(). */
-static void 

[tip: sched/core] entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point

2021-02-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 4ae7dc97f726ea95c58ac58af71cc034ad22d7de
Gitweb:
https://git.kernel.org/tip/4ae7dc97f726ea95c58ac58af71cc034ad22d7de
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:48 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:43 +01:00

entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point

Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point upon resuming to guest mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-6-frede...@kernel.org
---
 arch/x86/kvm/x86.c|  1 +-
 include/linux/entry-kvm.h | 14 -
 kernel/rcu/tree.c | 44 +-
 kernel/rcu/tree_plugin.h  |  1 +-
 4 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b404e4..b967c1c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1782,6 +1782,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr);
 
 bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
 {
+   xfer_to_guest_mode_prepare();
return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) ||
xfer_to_guest_mode_work_pending();
 }
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 9b93f85..8b2b1d6 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -47,6 +47,20 @@ static inline int arch_xfer_to_guest_mode_handle_work(struct 
kvm_vcpu *vcpu,
 int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu);
 
 /**
+ * xfer_to_guest_mode_prepare - Perform last minute preparation work that
+ * need to be handled while IRQs are disabled
+ * upon entering to guest.
+ *
+ * Has to be invoked with interrupts disabled before the last call
+ * to xfer_to_guest_mode_work_pending().
+ */
+static inline void xfer_to_guest_mode_prepare(void)
+{
+   lockdep_assert_irqs_disabled();
+   rcu_nocb_flush_deferred_wakeup();
+}
+
+/**
  * __xfer_to_guest_mode_work_pending - Check if work is pending
  *
  * Returns: True if work pending, False otherwise.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2ebc211..ce17b84 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -678,9 +678,10 @@ EXPORT_SYMBOL_GPL(rcu_idle_enter);
 
 #ifdef CONFIG_NO_HZ_FULL
 
+#if !defined(CONFIG_GENERIC_ENTRY) || !defined(CONFIG_KVM_XFER_TO_GUEST_WORK)
 /*
  * An empty function that will trigger a reschedule on
- * IRQ tail once IRQs get re-enabled on userspace resume.
+ * IRQ tail once IRQs get re-enabled on userspace/guest resume.
  */
 static void late_wakeup_func(struct irq_work *work)
 {
@@ -689,6 +690,37 @@ static void late_wakeup_func(struct irq_work *work)
 static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
IRQ_WORK_INIT(late_wakeup_func);
 
+/*
+ * If either:
+ *
+ * 1) the task is about to enter in guest mode and $ARCH doesn't support KVM 
generic work
+ * 2) the task is about to enter in user mode and $ARCH doesn't support 
generic entry.
+ *
+ * In these cases the late RCU wake ups aren't supported in the resched loops 
and our
+ * last resort is to fire a local irq_work that will trigger a reschedule once 
IRQs
+ * get re-enabled again.
+ */
+noinstr static void rcu_irq_work_resched(void)
+{
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
+   if (IS_ENABLED(CONFIG_GENERIC_ENTRY) && !(current->flags & PF_VCPU))
+   return;
+
+   if (IS_ENABLED(CONFIG_KVM_XFER_TO_GUEST_WORK) && (current->flags & 
PF_VCPU))
+   return;
+
+   instrumentation_begin();
+   if (do_nocb_deferred_wakeup(rdp) && need_resched()) {
+   irq_work_queue(this_cpu_ptr(_wakeup_work));
+   }
+   instrumentation_end();
+}
+
+#else
+static inline void rcu_irq_work_resched(void) { }
+#endif
+
 /**
  * rcu_user_enter - inform RCU that we are resuming userspace.
  *
@@ -702,8 +734,6 @@ static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
  */
 noinstr void rcu_user_enter(void)
 {
-   struct rcu_data *rdp = this_cpu_ptr(_data);
-
lockdep_assert_irqs_disabled();
 
/*
@@ -711,13 +741,7 @@ noinstr void rcu_user_enter(void)
 * rescheduling opportunity in the entry code. Trigger a self IPI
 * that will fire and reschedule once we resume in user/guest mode.
 */
-   instrumentation_begin();
-   if (!IS_ENABLED(CONFIG_GENERIC_ENTRY) || (current->flags & PF_VCPU)) {
-   if (do_nocb_deferred_wakeup(rdp) && need_resched())
-   

[tip: core/rcu] tools/rcutorture: Make identify_qemu_vcpus() independent of local language

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 106cc0d9e79aa7fcb43bd8feab97ee6e114d348b
Gitweb:
https://git.kernel.org/tip/106cc0d9e79aa7fcb43bd8feab97ee6e114d348b
Author:Frederic Weisbecker 
AuthorDate:Thu, 19 Nov 2020 01:30:24 +01:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 04 Jan 2021 14:01:20 -08:00

tools/rcutorture: Make identify_qemu_vcpus() independent of local language

The rcutorture scripts' identify_qemu_vcpus() function expects `lscpu`
to have a "CPU: " line, for example:

CPU(s): 8

But different local language settings can give different results:

Processeur(s) : 8

As a result, identify_qemu_vcpus() may return an empty string, resulting
in the following warning (with the same local language settings):

kvm-test-1-run.sh: ligne 138 : test:  : nombre entier attendu comme 
expression

This commit therefore changes identify_qemu_vcpus() to use getconf,
which produces local-language-independend output.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: r...@vger.kernel.org
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/functions.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh 
b/tools/testing/selftests/rcutorture/bin/functions.sh
index 8266349..fef8b4b 100644
--- a/tools/testing/selftests/rcutorture/bin/functions.sh
+++ b/tools/testing/selftests/rcutorture/bin/functions.sh
@@ -232,7 +232,7 @@ identify_qemu_args () {
 # Returns the number of virtual CPUs available to the aggregate of the
 # guest OSes.
 identify_qemu_vcpus () {
-   lscpu | grep '^CPU(s):' | sed -e 's/CPU(s)://' -e 's/[  ]*//g'
+   getconf _NPROCESSORS_ONLN
 }
 
 # print_bug


[tip: core/rcu] rcu/nocb: Turn enabled/offload states into a common flag

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 65e560327fe68153a9ad7452d5fd3171a1927d33
Gitweb:
https://git.kernel.org/tip/65e560327fe68153a9ad7452d5fd3171a1927d33
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:16 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/nocb: Turn enabled/offload states into a common flag

This commit gathers the rcu_segcblist ->enabled and ->offloaded property
field into a single ->flags bitmask to avoid further proliferation of
individual u8 fields in the structure.  This change prepares for the
state formerly known as ->offloaded state to be modified at runtime.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_segcblist.h |  6 --
 kernel/rcu/rcu_segcblist.c|  6 +++---
 kernel/rcu/rcu_segcblist.h| 23 +--
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 6c01f09..4714b02 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -63,6 +63,9 @@ struct rcu_cblist {
 #define RCU_NEXT_TAIL  3
 #define RCU_CBLIST_NSEGS   4
 
+#define SEGCBLIST_ENABLED  BIT(0)
+#define SEGCBLIST_OFFLOADEDBIT(1)
+
 struct rcu_segcblist {
struct rcu_head *head;
struct rcu_head **tails[RCU_CBLIST_NSEGS];
@@ -73,8 +76,7 @@ struct rcu_segcblist {
long len;
 #endif
long seglen[RCU_CBLIST_NSEGS];
-   u8 enabled;
-   u8 offloaded;
+   u8 flags;
 };
 
 #define RCU_SEGCBLIST_INITIALIZER(n) \
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 89e0dff..934945d 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -246,7 +246,7 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp)
rcu_segcblist_set_seglen(rsclp, i, 0);
}
rcu_segcblist_set_len(rsclp, 0);
-   rsclp->enabled = 1;
+   rcu_segcblist_set_flags(rsclp, SEGCBLIST_ENABLED);
 }
 
 /*
@@ -257,7 +257,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
 {
WARN_ON_ONCE(!rcu_segcblist_empty(rsclp));
WARN_ON_ONCE(rcu_segcblist_n_cbs(rsclp));
-   rsclp->enabled = 0;
+   rcu_segcblist_clear_flags(rsclp, SEGCBLIST_ENABLED);
 }
 
 /*
@@ -266,7 +266,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
  */
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp)
 {
-   rsclp->offloaded = 1;
+   rcu_segcblist_set_flags(rsclp, SEGCBLIST_OFFLOADED);
 }
 
 /*
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 18e101d..ff372db 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -53,19 +53,38 @@ static inline long rcu_segcblist_n_cbs(struct rcu_segcblist 
*rsclp)
 #endif
 }
 
+static inline void rcu_segcblist_set_flags(struct rcu_segcblist *rsclp,
+  int flags)
+{
+   rsclp->flags |= flags;
+}
+
+static inline void rcu_segcblist_clear_flags(struct rcu_segcblist *rsclp,
+int flags)
+{
+   rsclp->flags &= ~flags;
+}
+
+static inline bool rcu_segcblist_test_flags(struct rcu_segcblist *rsclp,
+   int flags)
+{
+   return READ_ONCE(rsclp->flags) & flags;
+}
+
 /*
  * Is the specified rcu_segcblist enabled, for example, not corresponding
  * to an offline CPU?
  */
 static inline bool rcu_segcblist_is_enabled(struct rcu_segcblist *rsclp)
 {
-   return rsclp->enabled;
+   return rcu_segcblist_test_flags(rsclp, SEGCBLIST_ENABLED);
 }
 
 /* Is the specified rcu_segcblist offloaded?  */
 static inline bool rcu_segcblist_is_offloaded(struct rcu_segcblist *rsclp)
 {
-   return IS_ENABLED(CONFIG_RCU_NOCB_CPU) && rsclp->offloaded;
+   return IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
+   rcu_segcblist_test_flags(rsclp, SEGCBLIST_OFFLOADED);
 }
 
 /*


[tip: core/rcu] rcu/nocb: Provide basic callback offloading state machine bits

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 8d346d438f93b5344e99d429727ec9c2f392d4ec
Gitweb:
https://git.kernel.org/tip/8d346d438f93b5344e99d429727ec9c2f392d4ec
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:17 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/nocb: Provide basic callback offloading state machine bits

Offloading and de-offloading RCU callback processes must be done
carefully.  There must never be a time at which callback processing is
disabled because the task driving the offloading or de-offloading might be
preempted or otherwise stalled at that point in time, which would result
in OOM due to calbacks piling up indefinitely.  This implies that there
will be times during which a given CPU's callbacks might be concurrently
invoked by both that CPU's RCU_SOFTIRQ handler (or, equivalently, that
CPU's rcuc kthread) and by that CPU's rcuo kthread.

This situation could fatally confuse both rcu_barrier() and the
CPU-hotplug offlining process, so these must be excluded during any
concurrent-callback-invocation period.  In addition, during times of
concurrent callback invocation, changes to ->cblist must be protected
both as needed for RCU_SOFTIRQ and as needed for the rcuo kthread.

This commit therefore defines and documents the states for a state
machine that coordinates offloading and deoffloading.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_segcblist.h | 115 -
 kernel/rcu/rcu_segcblist.c|   1 +-
 kernel/rcu/rcu_segcblist.h|  12 ++-
 kernel/rcu/tree.c |   3 +-
 4 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 4714b02..8afe886 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -63,8 +63,121 @@ struct rcu_cblist {
 #define RCU_NEXT_TAIL  3
 #define RCU_CBLIST_NSEGS   4
 
+
+/*
+ * ==NOCB Offloading state machine==
+ *
+ *
+ *  

+ *  | SEGCBLIST_SOFTIRQ_ONLY   
|
+ *  |  
|
+ *  |  Callbacks processed by rcu_core() from softirqs or local
|
+ *  |  rcuc kthread, without holding nocb_lock.
|
+ *  

+ * |
+ * v
+ *  

+ *  |SEGCBLIST_OFFLOADED   
|
+ *  |  
|
+ *  | Callbacks processed by rcu_core() from softirqs or local 
|
+ *  | rcuc kthread, while holding nocb_lock. Waking up CB and GP kthreads, 
|
+ *  | allowing nocb_timer to be armed. 
|
+ *  

+ * |
+ * v
+ *---
+ *| |
+ *v v
+ *  ---  
--|
+ *  |SEGCBLIST_OFFLOADED ||  | SEGCBLIST_OFFLOADED |   
|
+ *  |SEGCBLIST_KTHREAD_CB |  | SEGCBLIST_KTHREAD_GP
|
+ *  | |  | 
|
+ *  | |  | 
|
+ *  | CB kthread woke up and  |  | GP kthread woke up and  
|
+ *  | acknowledged SEGCBLIST_OFFLOADED.   |  | acknowledged 
SEGCBLIST_OFFLOADED|
+ *  | Processes callbacks concurrently|  | 
|
+ *  | with rcu_core(), holding|  | 
|
+ *  | nocb_lock.  |  | 
|
+ *  ---  
---
+ *| |
+ *---
+ * |
+ * v
+ *  
|--|
+ * 

[tip: core/rcu] rcu/nocb: Always init segcblist on CPU up

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 126d9d49528dae792859e5f11f3b447ce8a9a9b4
Gitweb:
https://git.kernel.org/tip/126d9d49528dae792859e5f11f3b447ce8a9a9b4
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:18 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/nocb: Always init segcblist on CPU up

How the rdp->cblist enabled state is treated at CPU-hotplug time depends
on whether or not that ->cblist is offloaded.

1) Not offloaded: The ->cblist is disabled when the CPU goes down. All
   its callbacks are migrated and none can to enqueued until after some
   later CPU-hotplug operation brings the CPU back up.

2) Offloaded: The ->cblist is not disabled on CPU down because the CB/GP
   kthreads must finish invoking the remaining callbacks. There is thus
   no need to re-enable it on CPU up.

Since the ->cblist offloaded state is set in stone at boot, it cannot
change between CPU down and CPU up. So 1) and 2) are symmetrical.

However, given runtime toggling of the offloaded state, there are two
additional asymmetrical scenarios:

3) The ->cblist is not offloaded when the CPU goes down. The ->cblist
   is later toggled to offloaded and then the CPU comes back up.

4) The ->cblist is offloaded when the CPU goes down. The ->cblist is
   later toggled to no longer be offloaded and then the CPU comes back up.

Scenario 4) is currently handled correctly. The ->cblist remains enabled
on CPU down and gets re-initialized on CPU up. The toggling operation
will wait until ->cblist is empty, so ->cblist will remain empty until
CPU-up time.

The scenario 3) would run into trouble though, as the rdp is disabled
on CPU down and not re-initialized/re-enabled on CPU up.  Except that
in this case, ->cblist is guaranteed to be empty because all its
callbacks were migrated away at CPU-down time.  And the CPU-up code
already initializes and enables any empty ->cblist structures in order
to handle the possibility of early-boot invocations of call_rcu() in
the case where such invocations don't occur.  So all that need be done
is to adjust the locking.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7cfc2e8..83362f6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4015,12 +4015,18 @@ int rcutree_prepare_cpu(unsigned int cpu)
rdp->qlen_last_fqs_check = 0;
rdp->n_force_qs_snap = rcu_state.n_force_qs;
rdp->blimit = blimit;
-   if (rcu_segcblist_empty(>cblist) && /* No early-boot CBs? */
-   !rcu_segcblist_is_offloaded(>cblist))
-   rcu_segcblist_init(>cblist);  /* Re-enable callbacks. */
rdp->dynticks_nesting = 1;  /* CPU not up, no tearing. */
rcu_dynticks_eqs_online();
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
+   /*
+* Lock in case the CB/GP kthreads are still around handling
+* old callbacks (longer term we should flush all callbacks
+* before completing CPU offline)
+*/
+   rcu_nocb_lock(rdp);
+   if (rcu_segcblist_empty(>cblist)) /* No early-boot CBs? */
+   rcu_segcblist_init(>cblist);  /* Re-enable callbacks. */
+   rcu_nocb_unlock(rdp);
 
/*
 * Add CPU to leaf rcu_node pending-online bitmask.  Any needed


[tip: core/rcu] rcu/nocb: De-offloading CB kthread

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: d97b078182406c0bd0aacd36fc0a693e118e608f
Gitweb:
https://git.kernel.org/tip/d97b078182406c0bd0aacd36fc0a693e118e608f
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:19 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/nocb: De-offloading CB kthread

To de-offload callback processing back onto a CPU, it is necessary to
clear SEGCBLIST_OFFLOAD and notify the nocb CB kthread, which will then
clear its own bit flag and go to sleep to stop handling callbacks.  This
commit makes that change.  It will also be necessary to notify the nocb
GP kthread in this same way, which is the subject of a follow-on commit.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
[ paulmck: Add export per kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h   |   2 +-
 kernel/rcu/rcu_segcblist.c |  10 ++-
 kernel/rcu/rcu_segcblist.h |   2 +-
 kernel/rcu/tree.h  |   1 +-
 kernel/rcu/tree_plugin.h   | 130 +++-
 5 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de08264..40266eb 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -104,8 +104,10 @@ static inline void rcu_user_exit(void) { }
 
 #ifdef CONFIG_RCU_NOCB_CPU
 void rcu_init_nohz(void);
+int rcu_nocb_cpu_deoffload(int cpu);
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 static inline void rcu_init_nohz(void) { }
+static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 /**
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 7fc6362..7f181c9 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -264,10 +264,14 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
  * Mark the specified rcu_segcblist structure as offloaded.  This
  * structure must be empty.
  */
-void rcu_segcblist_offload(struct rcu_segcblist *rsclp)
+void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload)
 {
-   rcu_segcblist_clear_flags(rsclp, SEGCBLIST_SOFTIRQ_ONLY);
-   rcu_segcblist_set_flags(rsclp, SEGCBLIST_OFFLOADED);
+   if (offload) {
+   rcu_segcblist_clear_flags(rsclp, SEGCBLIST_SOFTIRQ_ONLY);
+   rcu_segcblist_set_flags(rsclp, SEGCBLIST_OFFLOADED);
+   } else {
+   rcu_segcblist_clear_flags(rsclp, SEGCBLIST_OFFLOADED);
+   }
 }
 
 /*
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index e05952a..28c9a52 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -109,7 +109,7 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp);
 void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v);
 void rcu_segcblist_init(struct rcu_segcblist *rsclp);
 void rcu_segcblist_disable(struct rcu_segcblist *rsclp);
-void rcu_segcblist_offload(struct rcu_segcblist *rsclp);
+void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload);
 bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp);
 bool rcu_segcblist_pend_cbs(struct rcu_segcblist *rsclp);
 struct rcu_head *rcu_segcblist_first_cb(struct rcu_segcblist *rsclp);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7708ed1..e0deb48 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -201,6 +201,7 @@ struct rcu_data {
/* 5) Callback offloading. */
 #ifdef CONFIG_RCU_NOCB_CPU
struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */
+   struct swait_queue_head nocb_state_wq; /* For offloading state changes 
*/
struct task_struct *nocb_gp_kthread;
raw_spinlock_t nocb_lock;   /* Guard following pair of fields. */
atomic_t nocb_lock_contended;   /* Contention experienced. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7e291ce..1b870d0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2081,16 +2081,29 @@ static int rcu_nocb_gp_kthread(void *arg)
return 0;
 }
 
+static inline bool nocb_cb_can_run(struct rcu_data *rdp)
+{
+   u8 flags = SEGCBLIST_OFFLOADED | SEGCBLIST_KTHREAD_CB;
+   return rcu_segcblist_test_flags(>cblist, flags);
+}
+
+static inline bool nocb_cb_wait_cond(struct rcu_data *rdp)
+{
+   return nocb_cb_can_run(rdp) && !READ_ONCE(rdp->nocb_cb_sleep);
+}
+
 /*
  * Invoke any ready callbacks from the corresponding no-CBs CPU,
  * then, if there are no more, wait for more to appear.
  */
 static void nocb_cb_wait(struct rcu_data *rdp)
 {
+   struct rcu_segcblist *cblist = >cblist;
+   struct rcu_node *rnp = rdp->mynode;
+   bool needwake_state = false;
+   bool 

[tip: core/rcu] rcu/nocb: Don't deoffload an offline CPU with pending work

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: ef005345e6e49859e225f549c88c985e79477bb9
Gitweb:
https://git.kernel.org/tip/ef005345e6e49859e225f549c88c985e79477bb9
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:20 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:19 -08:00

rcu/nocb: Don't deoffload an offline CPU with pending work

Offloaded CPUs do not migrate their callbacks, instead relying on
their rcuo kthread to invoke them.  But if the CPU is offline, it
will be running neither its RCU_SOFTIRQ handler nor its rcuc kthread.
This means that de-offloading an offline CPU that still has pending
callbacks will strand those callbacks.  This commit therefore refuses
to toggle offline CPUs having pending callbacks.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Suggested-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h |  9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1b870d0..b70cc91 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2227,6 +2227,15 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
printk("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
+   /*
+* If there are still pending work offloaded, the offline
+* CPU won't help much handling them.
+*/
+   if (cpu_is_offline(rdp->cpu) && !rcu_segcblist_empty(>cblist)) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
+   return -EBUSY;
+   }
+
rcu_segcblist_offload(cblist, false);
 
if (rdp->nocb_cb_sleep) {


[tip: core/rcu] rcu/nocb: De-offloading GP kthread

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 5bb39dc956f3d4f1bb75b5962b503426c45340ae
Gitweb:
https://git.kernel.org/tip/5bb39dc956f3d4f1bb75b5962b503426c45340ae
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:21 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:20 -08:00

rcu/nocb: De-offloading GP kthread

To de-offload callback processing back onto a CPU, it is necessary
to clear SEGCBLIST_OFFLOAD and notify the nocb GP kthread, which will
then clear its own bit flag and ignore this CPU until further notice.
Whichever of the nocb CB and nocb GP kthreads is last to clear its own
bit notifies the de-offloading worker kthread.  Once notified, this
worker kthread can proceed safe in the knowledge that the nocb CB and
GP kthreads will no longer be manipulating this CPU's RCU callback list.

This commit makes this change.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 54 ---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b70cc91..fe46e70 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1928,6 +1928,33 @@ static void do_nocb_bypass_wakeup_timer(struct 
timer_list *t)
__call_rcu_nocb_wake(rdp, true, flags);
 }
 
+static inline bool nocb_gp_enabled_cb(struct rcu_data *rdp)
+{
+   u8 flags = SEGCBLIST_OFFLOADED | SEGCBLIST_KTHREAD_GP;
+
+   return rcu_segcblist_test_flags(>cblist, flags);
+}
+
+static inline bool nocb_gp_update_state(struct rcu_data *rdp, bool 
*needwake_state)
+{
+   struct rcu_segcblist *cblist = >cblist;
+
+   if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
+   return true;
+   } else {
+   /*
+* De-offloading. Clear our flag and notify the de-offload 
worker.
+* We will ignore this rdp until it ever gets re-offloaded.
+*/
+   WARN_ON_ONCE(!rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_GP));
+   rcu_segcblist_clear_flags(cblist, SEGCBLIST_KTHREAD_GP);
+   if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB))
+   *needwake_state = true;
+   return false;
+   }
+}
+
+
 /*
  * No-CBs GP kthreads come here to wait for additional callbacks to show up
  * or for grace periods to end.
@@ -1956,8 +1983,17 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 */
WARN_ON_ONCE(my_rdp->nocb_gp_rdp != my_rdp);
for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) {
+   bool needwake_state = false;
+   if (!nocb_gp_enabled_cb(rdp))
+   continue;
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
rcu_nocb_lock_irqsave(rdp, flags);
+   if (!nocb_gp_update_state(rdp, _state)) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
+   if (needwake_state)
+   swake_up_one(>nocb_state_wq);
+   continue;
+   }
bypass_ncbs = rcu_cblist_n_cbs(>nocb_bypass);
if (bypass_ncbs &&
(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
@@ -2221,7 +2257,8 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
 static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
 {
struct rcu_segcblist *cblist = >cblist;
-   bool wake_cb = false;
+   struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
+   bool wake_cb = false, wake_gp = false;
unsigned long flags;
 
printk("De-offloading %d\n", rdp->cpu);
@@ -2247,9 +2284,19 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
if (wake_cb)
swake_up_one(>nocb_cb_wq);
 
-   swait_event_exclusive(rdp->nocb_state_wq,
- !rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB));
+   raw_spin_lock_irqsave(_gp->nocb_gp_lock, flags);
+   if (rdp_gp->nocb_gp_sleep) {
+   rdp_gp->nocb_gp_sleep = false;
+   wake_gp = true;
+   }
+   raw_spin_unlock_irqrestore(_gp->nocb_gp_lock, flags);
 
+   if (wake_gp)
+   wake_up_process(rdp_gp->nocb_gp_kthread);
+
+   swait_event_exclusive(rdp->nocb_state_wq,
+ !rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB |
+   SEGCBLIST_KTHREAD_GP));
return 0;
 }
 
@@ -2332,6 +2379,7 @@ void __init rcu_init_nohz(void)
rcu_segcblist_init(>cblist);

[tip: core/rcu] rcu/nocb: Re-offload support

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 254e11efde66ca0a0ce0c99a62c377314b5984ff
Gitweb:
https://git.kernel.org/tip/254e11efde66ca0a0ce0c99a62c377314b5984ff
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:22 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:25 -08:00

rcu/nocb: Re-offload support

To re-offload the callback processing off of a CPU, it is necessary to
clear SEGCBLIST_SOFTIRQ_ONLY, set SEGCBLIST_OFFLOADED, and then notify
both the CB and GP kthreads so that they both set their own bit flag and
start processing the callbacks remotely.  The re-offloading worker is
then notified that it can stop the RCU_SOFTIRQ handler (or rcuc kthread,
as the case may be) from processing the callbacks locally.

Ordering must be carefully enforced so that the callbacks that used to be
processed locally without locking will have the same ordering properties
when they are invoked by the nocb CB and GP kthreads.

This commit makes this change.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
[ paulmck: Export rcu_nocb_cpu_offload(). ]
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h |   2 +-
 kernel/rcu/tree_plugin.h | 158 --
 2 files changed, 138 insertions(+), 22 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 40266eb..e0ee52e 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -104,9 +104,11 @@ static inline void rcu_user_exit(void) { }
 
 #ifdef CONFIG_RCU_NOCB_CPU
 void rcu_init_nohz(void);
+int rcu_nocb_cpu_offload(int cpu);
 int rcu_nocb_cpu_deoffload(int cpu);
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 static inline void rcu_init_nohz(void) { }
+static inline int rcu_nocb_cpu_offload(int cpu) { return -EINVAL; }
 static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index fe46e70..03ae1ce 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1928,6 +1928,20 @@ static void do_nocb_bypass_wakeup_timer(struct 
timer_list *t)
__call_rcu_nocb_wake(rdp, true, flags);
 }
 
+/*
+ * Check if we ignore this rdp.
+ *
+ * We check that without holding the nocb lock but
+ * we make sure not to miss a freshly offloaded rdp
+ * with the current ordering:
+ *
+ *  rdp_offload_toggle()nocb_gp_enabled_cb()
+ * -   
+ *WRITE flags LOCK nocb_gp_lock
+ *LOCK nocb_gp_lock   READ/WRITE nocb_gp_sleep
+ *READ/WRITE nocb_gp_sleepUNLOCK nocb_gp_lock
+ *UNLOCK nocb_gp_lock READ flags
+ */
 static inline bool nocb_gp_enabled_cb(struct rcu_data *rdp)
 {
u8 flags = SEGCBLIST_OFFLOADED | SEGCBLIST_KTHREAD_GP;
@@ -1940,6 +1954,11 @@ static inline bool nocb_gp_update_state(struct rcu_data 
*rdp, bool *needwake_sta
struct rcu_segcblist *cblist = >cblist;
 
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
+   if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_GP)) {
+   rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_GP);
+   if (rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB))
+   *needwake_state = true;
+   }
return true;
} else {
/*
@@ -2003,6 +2022,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
bypass_ncbs = rcu_cblist_n_cbs(>nocb_bypass);
} else if (!bypass_ncbs && rcu_segcblist_empty(>cblist)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
+   if (needwake_state)
+   swake_up_one(>nocb_state_wq);
continue; /* No callbacks here, try next. */
}
if (bypass_ncbs) {
@@ -2054,6 +2075,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
}
if (needwake_gp)
rcu_gp_kthread_wake();
+   if (needwake_state)
+   swake_up_one(>nocb_state_wq);
}
 
my_rdp->nocb_gp_bypass = bypass;
@@ -2159,6 +2182,11 @@ static void nocb_cb_wait(struct rcu_data *rdp)
WRITE_ONCE(rdp->nocb_cb_sleep, true);
 
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
+   if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) {
+   rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_CB);
+   if (rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_GP))
+   needwake_state = 

[tip: core/rcu] rcu/nocb: Shutdown nocb timer on de-offloading

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 69cdea873cde261586a2cae2440178df1a313bbe
Gitweb:
https://git.kernel.org/tip/69cdea873cde261586a2cae2440178df1a313bbe
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:23 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

rcu/nocb: Shutdown nocb timer on de-offloading

This commit ensures that the nocb timer is shut down before reaching the
final de-offloaded state.  The key goal is to prevent the timer handler
from manipulating the callbacks without the protection of the nocb locks.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.h|  1 +
 kernel/rcu/tree_plugin.h | 12 +++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e0deb48..5d359b9 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -257,6 +257,7 @@ struct rcu_data {
 };
 
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
+#define RCU_NOCB_WAKE_OFF  -1
 #define RCU_NOCB_WAKE_NOT  0
 #define RCU_NOCB_WAKE  1
 #define RCU_NOCB_WAKE_FORCE2
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 03ae1ce..c88ad62 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1665,6 +1665,8 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
 static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
   const char *reason)
 {
+   if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_OFF)
+   return;
if (rdp->nocb_defer_wakeup == RCU_NOCB_WAKE_NOT)
mod_timer(>nocb_timer, jiffies + 1);
if (rdp->nocb_defer_wakeup < waketype)
@@ -2243,7 +2245,7 @@ static int rcu_nocb_cb_kthread(void *arg)
 /* Is a deferred wakeup of rcu_nocb_kthread() required? */
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp)
 {
-   return READ_ONCE(rdp->nocb_defer_wakeup);
+   return READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT;
 }
 
 /* Do a deferred wakeup of rcu_nocb_kthread(). */
@@ -2337,6 +2339,12 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
swait_event_exclusive(rdp->nocb_state_wq,
  !rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB |
SEGCBLIST_KTHREAD_GP));
+   /* Make sure nocb timer won't stay around */
+   rcu_nocb_lock_irqsave(rdp, flags);
+   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_OFF);
+   rcu_nocb_unlock_irqrestore(rdp, flags);
+   del_timer_sync(>nocb_timer);
+
return ret;
 }
 
@@ -2394,6 +2402,8 @@ static int __rcu_nocb_rdp_offload(struct rcu_data *rdp)
 * SEGCBLIST_SOFTIRQ_ONLY mode.
 */
raw_spin_lock_irqsave(>nocb_lock, flags);
+   /* Re-enable nocb timer */
+   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
/*
 * We didn't take the nocb lock while working on the
 * rdp->cblist in SEGCBLIST_SOFTIRQ_ONLY mode.


[tip: core/rcu] rcu/nocb: Flush bypass before setting SEGCBLIST_SOFTIRQ_ONLY

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 314202f84ddd61e4d7576ef62570ad2e2d9db06b
Gitweb:
https://git.kernel.org/tip/314202f84ddd61e4d7576ef62570ad2e2d9db06b
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:24 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

rcu/nocb: Flush bypass before setting SEGCBLIST_SOFTIRQ_ONLY

This commit flushes the bypass queue and sets state to avoid its being
refilled before switching to the final de-offloaded state.  To avoid
refilling, this commit sets SEGCBLIST_SOFTIRQ_ONLY before re-enabling
IRQs.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index c88ad62..35dc9b3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2339,12 +2339,21 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data 
*rdp)
swait_event_exclusive(rdp->nocb_state_wq,
  !rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB |
SEGCBLIST_KTHREAD_GP));
-   /* Make sure nocb timer won't stay around */
rcu_nocb_lock_irqsave(rdp, flags);
+   /* Make sure nocb timer won't stay around */
WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_OFF);
rcu_nocb_unlock_irqrestore(rdp, flags);
del_timer_sync(>nocb_timer);
 
+   /*
+* Flush bypass. While IRQs are disabled and once we set
+* SEGCBLIST_SOFTIRQ_ONLY, no callback is supposed to be
+* enqueued on bypass.
+*/
+   rcu_nocb_lock_irqsave(rdp, flags);
+   rcu_nocb_flush_bypass(rdp, NULL, jiffies);
+   rcu_nocb_unlock_irqrestore(rdp, flags);
+
return ret;
 }
 


[tip: core/rcu] rcu/nocb: Only cond_resched() from actual offloaded batch processing

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: e3abe959fbd57aa751bc533677a35c411cee9b16
Gitweb:
https://git.kernel.org/tip/e3abe959fbd57aa751bc533677a35c411cee9b16
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:26 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

rcu/nocb: Only cond_resched() from actual offloaded batch processing

During a toggle operations, rcu_do_batch() may be invoked concurrently
by softirqs and offloaded processing for a given CPU's callbacks.
This commit therefore makes sure cond_resched() is invoked only from
the offloaded context.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 83362f6..4ef59a5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2516,8 +2516,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
/* Exceeded the time limit, so leave. */
break;
}
-   if (offloaded) {
-   WARN_ON_ONCE(in_serving_softirq());
+   if (!in_serving_softirq()) {
local_bh_enable();
lockdep_assert_irqs_enabled();
cond_resched_tasks_rcu_qs();


[tip: core/rcu] rcu/nocb: Set SEGCBLIST_SOFTIRQ_ONLY at the very last stage of de-offloading

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: b9ced9e1ab51ed6057ac8198fd1eeb404a32a867
Gitweb:
https://git.kernel.org/tip/b9ced9e1ab51ed6057ac8198fd1eeb404a32a867
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:25 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

rcu/nocb: Set SEGCBLIST_SOFTIRQ_ONLY at the very last stage of de-offloading

This commit sets SEGCBLIST_SOFTIRQ_ONLY once toggling is otherwise fully
complete, allowing further RCU callback manipulation to be carried out
locklessly and locally.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h |  9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 35dc9b3..8641b72 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2352,7 +2352,14 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
 */
rcu_nocb_lock_irqsave(rdp, flags);
rcu_nocb_flush_bypass(rdp, NULL, jiffies);
-   rcu_nocb_unlock_irqrestore(rdp, flags);
+   rcu_segcblist_set_flags(cblist, SEGCBLIST_SOFTIRQ_ONLY);
+   /*
+* With SEGCBLIST_SOFTIRQ_ONLY, we can't use
+* rcu_nocb_unlock_irqrestore() anymore. Theoretically we
+* could set SEGCBLIST_SOFTIRQ_ONLY with cb unlocked and IRQs
+* disabled now, but let's be paranoid.
+*/
+   raw_spin_unlock_irqrestore(>nocb_lock, flags);
 
return ret;
 }


[tip: core/rcu] rcu/nocb: Process batch locally as long as offloading isn't complete

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 32aa2f4170d22f0b9fcb75ab05679ab122fae373
Gitweb:
https://git.kernel.org/tip/32aa2f4170d22f0b9fcb75ab05679ab122fae373
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:27 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

rcu/nocb: Process batch locally as long as offloading isn't complete

This commit makes sure to process the callbacks locally (via either
RCU_SOFTIRQ or the rcuc kthread) whenever the segcblist isn't entirely
offloaded.  This ensures that callbacks are invoked one way or another
while a CPU is in the middle of a toggle operation.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.h | 12 
 kernel/rcu/tree.c  |  3 ++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 28c9a52..afad6fc 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -95,6 +95,18 @@ static inline bool rcu_segcblist_is_offloaded(struct 
rcu_segcblist *rsclp)
return false;
 }
 
+static inline bool rcu_segcblist_completely_offloaded(struct rcu_segcblist 
*rsclp)
+{
+   int flags = SEGCBLIST_KTHREAD_CB | SEGCBLIST_KTHREAD_GP | 
SEGCBLIST_OFFLOADED;
+
+   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU)) {
+   if ((rsclp->flags & flags) == flags)
+   return true;
+   }
+
+   return false;
+}
+
 /*
  * Are all segments following the specified segment of the specified
  * rcu_segcblist structure empty of callbacks?  (The specified
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4ef59a5..ec14c01 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2700,6 +2700,7 @@ static __latent_entropy void rcu_core(void)
struct rcu_data *rdp = raw_cpu_ptr(_data);
struct rcu_node *rnp = rdp->mynode;
const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
+   const bool do_batch = !rcu_segcblist_completely_offloaded(>cblist);
 
if (cpu_is_offline(smp_processor_id()))
return;
@@ -2729,7 +2730,7 @@ static __latent_entropy void rcu_core(void)
rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());
 
/* If there are callbacks ready, invoke them. */
-   if (!offloaded && rcu_segcblist_ready_cbs(>cblist) &&
+   if (do_batch && rcu_segcblist_ready_cbs(>cblist) &&
likely(READ_ONCE(rcu_scheduler_fully_active)))
rcu_do_batch(rdp);
 


[tip: core/rcu] rcu/nocb: Locally accelerate callbacks as long as offloading isn't complete

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 634954c2dbf88e67aa267798f60af6b9a476cf4b
Gitweb:
https://git.kernel.org/tip/634954c2dbf88e67aa267798f60af6b9a476cf4b
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:28 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

rcu/nocb: Locally accelerate callbacks as long as offloading isn't complete

The local callbacks processing checks if any callbacks need acceleration.
This commit carries out this checking under nocb lock protection in
the middle of toggle operations, during which time rcu_core() executes
concurrently with GP/CB kthreads.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ec14c01..03810a5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2699,7 +2699,6 @@ static __latent_entropy void rcu_core(void)
unsigned long flags;
struct rcu_data *rdp = raw_cpu_ptr(_data);
struct rcu_node *rnp = rdp->mynode;
-   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
const bool do_batch = !rcu_segcblist_completely_offloaded(>cblist);
 
if (cpu_is_offline(smp_processor_id()))
@@ -2720,11 +2719,11 @@ static __latent_entropy void rcu_core(void)
 
/* No grace period and unregistered callbacks? */
if (!rcu_gp_in_progress() &&
-   rcu_segcblist_is_enabled(>cblist) && !offloaded) {
-   local_irq_save(flags);
+   rcu_segcblist_is_enabled(>cblist) && do_batch) {
+   rcu_nocb_lock_irqsave(rdp, flags);
if (!rcu_segcblist_restempty(>cblist, RCU_NEXT_READY_TAIL))
rcu_accelerate_cbs_unlocked(rnp, rdp);
-   local_irq_restore(flags);
+   rcu_nocb_unlock_irqrestore(rdp, flags);
}
 
rcu_check_gp_start_stall(rnp, rdp, rcu_jiffies_till_stall_check());


[tip: core/rcu] cpu/hotplug: Add lockdep_is_cpus_held()

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 43759fe5a137389e94ed6d4680c3c63c17273158
Gitweb:
https://git.kernel.org/tip/43759fe5a137389e94ed6d4680c3c63c17273158
Author:Frederic Weisbecker 
AuthorDate:Wed, 11 Nov 2020 23:53:13 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

cpu/hotplug: Add lockdep_is_cpus_held()

This commit adds a lockdep_is_cpus_held() function to verify that the
proper locks are held and that various operations are running in the
correct context.

Signed-off-by: Frederic Weisbecker 
Cc: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 include/linux/cpu.h | 2 ++
 kernel/cpu.c| 7 +++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index d6428aa..3aaa068 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -111,6 +111,8 @@ static inline void cpu_maps_update_done(void)
 #endif /* CONFIG_SMP */
 extern struct bus_type cpu_subsys;
 
+extern int lockdep_is_cpus_held(void);
+
 #ifdef CONFIG_HOTPLUG_CPU
 extern void cpus_write_lock(void);
 extern void cpus_write_unlock(void);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 4e11e91..1b6302e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -330,6 +330,13 @@ void lockdep_assert_cpus_held(void)
percpu_rwsem_assert_held(_hotplug_lock);
 }
 
+#ifdef CONFIG_LOCKDEP
+int lockdep_is_cpus_held(void)
+{
+   return percpu_rwsem_is_held(_hotplug_lock);
+}
+#endif
+
 static void lockdep_acquire_cpus_lock(void)
 {
rwsem_acquire(_hotplug_lock.dep_map, 0, 0, _THIS_IP_);


[tip: core/rcu] timer: Add timer_curr_running()

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: dcd42591ebb8a25895b551a5297ea9c24414ba54
Gitweb:
https://git.kernel.org/tip/dcd42591ebb8a25895b551a5297ea9c24414ba54
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:33 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

timer: Add timer_curr_running()

This commit adds a timer_curr_running() function that verifies that the
current code is running in the context of the specified timer's handler.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 include/linux/timer.h |  2 ++
 kernel/time/timer.c   | 13 +
 2 files changed, 15 insertions(+)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index fda13c9..4118a97 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -192,6 +192,8 @@ extern int try_to_del_timer_sync(struct timer_list *timer);
 
 #define del_singleshot_timer_sync(t) del_timer_sync(t)
 
+extern bool timer_curr_running(struct timer_list *timer);
+
 extern void init_timers(void);
 struct hrtimer;
 extern enum hrtimer_restart it_real_fn(struct hrtimer *);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8dbc008..f9b2096 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1237,6 +1237,19 @@ int try_to_del_timer_sync(struct timer_list *timer)
 }
 EXPORT_SYMBOL(try_to_del_timer_sync);
 
+bool timer_curr_running(struct timer_list *timer)
+{
+   int i;
+
+   for (i = 0; i < NR_BASES; i++) {
+   struct timer_base *base = this_cpu_ptr(_bases[i]);
+   if (base->running_timer == timer)
+   return true;
+   }
+
+   return false;
+}
+
 #ifdef CONFIG_PREEMPT_RT
 static __init void timer_base_init_expiry_lock(struct timer_base *base)
 {


[tip: core/rcu] tools/rcutorture: Support nocb toggle in TREE01

2021-02-12 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 70e8088b97211177225acf499247b3741cc8a229
Gitweb:
https://git.kernel.org/tip/70e8088b97211177225acf499247b3741cc8a229
Author:Frederic Weisbecker 
AuthorDate:Fri, 13 Nov 2020 13:13:29 +01:00
Committer: Paul E. McKenney 
CommitterDate: Wed, 06 Jan 2021 16:24:59 -08:00

tools/rcutorture: Support nocb toggle in TREE01

This commit adds periodic toggling of 7 of 8 CPUs every second to TREE01
in order to test NOCB toggle code.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE01.boot | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE01.boot 
b/tools/testing/selftests/rcutorture/configs/rcu/TREE01.boot
index d6da9a6..40af3df 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE01.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE01.boot
@@ -2,5 +2,7 @@ maxcpus=8 nr_cpus=43
 rcutree.gp_preinit_delay=3
 rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
-rcu_nocbs=0
+rcu_nocbs=0-1,3-7
+rcutorture.nocbs_nthreads=8
+rcutorture.nocbs_toggle=1000
 rcutorture.fwd_progress=0


[tip: sched/core] entry: Explicitly flush pending rcuog wakeup before last rescheduling point

2021-02-10 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2c910e0753dc424dfdeb1f8e230ad8f187a744a7
Gitweb:
https://git.kernel.org/tip/2c910e0753dc424dfdeb1f8e230ad8f187a744a7
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:47 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:51 +01:00

entry: Explicitly flush pending rcuog wakeup before last rescheduling point

Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point on resuming to user mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-5-frede...@kernel.org
---
 kernel/entry/common.c |  7 +++
 kernel/rcu/tree.c | 12 +++-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f09cae3..8442e5c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -184,6 +184,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs 
*regs,
 * enabled above.
 */
local_irq_disable_exit_to_user();
+
+   /* Check if any of the above work has queued a deferred wakeup 
*/
+   rcu_nocb_flush_deferred_wakeup();
+
ti_work = READ_ONCE(current_thread_info()->flags);
}
 
@@ -197,6 +201,9 @@ static void exit_to_user_mode_prepare(struct pt_regs *regs)
 
lockdep_assert_irqs_disabled();
 
+   /* Flush pending rcuog wakeup before the last need_resched() check */
+   rcu_nocb_flush_deferred_wakeup();
+
if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
ti_work = exit_to_user_mode_loop(regs, ti_work);
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4b1e5bd..2ebc211 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -707,13 +707,15 @@ noinstr void rcu_user_enter(void)
lockdep_assert_irqs_disabled();
 
/*
-* We may be past the last rescheduling opportunity in the entry code.
-* Trigger a self IPI that will fire and reschedule once we resume to
-* user/guest mode.
+* Other than generic entry implementation, we may be past the last
+* rescheduling opportunity in the entry code. Trigger a self IPI
+* that will fire and reschedule once we resume in user/guest mode.
 */
instrumentation_begin();
-   if (do_nocb_deferred_wakeup(rdp) && need_resched())
-   irq_work_queue(this_cpu_ptr(_wakeup_work));
+   if (!IS_ENABLED(CONFIG_GENERIC_ENTRY) || (current->flags & PF_VCPU)) {
+   if (do_nocb_deferred_wakeup(rdp) && need_resched())
+   irq_work_queue(this_cpu_ptr(_wakeup_work));
+   }
instrumentation_end();
 
rcu_eqs_enter(true);


[tip: sched/core] entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point

2021-02-10 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 14bbd41d5109a8049f3f1b77e994e0213f94f4c0
Gitweb:
https://git.kernel.org/tip/14bbd41d5109a8049f3f1b77e994e0213f94f4c0
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:48 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:51 +01:00

entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point

Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point upon resuming to guest mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.

Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-6-frede...@kernel.org
---
 arch/x86/kvm/x86.c|  1 +-
 include/linux/entry-kvm.h | 14 -
 kernel/rcu/tree.c | 44 +-
 kernel/rcu/tree_plugin.h  |  1 +-
 4 files changed, 50 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b404e4..b967c1c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1782,6 +1782,7 @@ EXPORT_SYMBOL_GPL(kvm_emulate_wrmsr);
 
 bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu)
 {
+   xfer_to_guest_mode_prepare();
return vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) ||
xfer_to_guest_mode_work_pending();
 }
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 9b93f85..8b2b1d6 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -47,6 +47,20 @@ static inline int arch_xfer_to_guest_mode_handle_work(struct 
kvm_vcpu *vcpu,
 int xfer_to_guest_mode_handle_work(struct kvm_vcpu *vcpu);
 
 /**
+ * xfer_to_guest_mode_prepare - Perform last minute preparation work that
+ * need to be handled while IRQs are disabled
+ * upon entering to guest.
+ *
+ * Has to be invoked with interrupts disabled before the last call
+ * to xfer_to_guest_mode_work_pending().
+ */
+static inline void xfer_to_guest_mode_prepare(void)
+{
+   lockdep_assert_irqs_disabled();
+   rcu_nocb_flush_deferred_wakeup();
+}
+
+/**
  * __xfer_to_guest_mode_work_pending - Check if work is pending
  *
  * Returns: True if work pending, False otherwise.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2ebc211..ce17b84 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -678,9 +678,10 @@ EXPORT_SYMBOL_GPL(rcu_idle_enter);
 
 #ifdef CONFIG_NO_HZ_FULL
 
+#if !defined(CONFIG_GENERIC_ENTRY) || !defined(CONFIG_KVM_XFER_TO_GUEST_WORK)
 /*
  * An empty function that will trigger a reschedule on
- * IRQ tail once IRQs get re-enabled on userspace resume.
+ * IRQ tail once IRQs get re-enabled on userspace/guest resume.
  */
 static void late_wakeup_func(struct irq_work *work)
 {
@@ -689,6 +690,37 @@ static void late_wakeup_func(struct irq_work *work)
 static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
IRQ_WORK_INIT(late_wakeup_func);
 
+/*
+ * If either:
+ *
+ * 1) the task is about to enter in guest mode and $ARCH doesn't support KVM 
generic work
+ * 2) the task is about to enter in user mode and $ARCH doesn't support 
generic entry.
+ *
+ * In these cases the late RCU wake ups aren't supported in the resched loops 
and our
+ * last resort is to fire a local irq_work that will trigger a reschedule once 
IRQs
+ * get re-enabled again.
+ */
+noinstr static void rcu_irq_work_resched(void)
+{
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
+   if (IS_ENABLED(CONFIG_GENERIC_ENTRY) && !(current->flags & PF_VCPU))
+   return;
+
+   if (IS_ENABLED(CONFIG_KVM_XFER_TO_GUEST_WORK) && (current->flags & 
PF_VCPU))
+   return;
+
+   instrumentation_begin();
+   if (do_nocb_deferred_wakeup(rdp) && need_resched()) {
+   irq_work_queue(this_cpu_ptr(_wakeup_work));
+   }
+   instrumentation_end();
+}
+
+#else
+static inline void rcu_irq_work_resched(void) { }
+#endif
+
 /**
  * rcu_user_enter - inform RCU that we are resuming userspace.
  *
@@ -702,8 +734,6 @@ static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
  */
 noinstr void rcu_user_enter(void)
 {
-   struct rcu_data *rdp = this_cpu_ptr(_data);
-
lockdep_assert_irqs_disabled();
 
/*
@@ -711,13 +741,7 @@ noinstr void rcu_user_enter(void)
 * rescheduling opportunity in the entry code. Trigger a self IPI
 * that will fire and reschedule once we resume in user/guest mode.
 */
-   instrumentation_begin();
-   if (!IS_ENABLED(CONFIG_GENERIC_ENTRY) || (current->flags & PF_VCPU)) {
-   if (do_nocb_deferred_wakeup(rdp) && need_resched())
-   irq_work_queue(this_cpu_ptr(_wakeup_work));
-   }
-

[tip: sched/core] rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers

2021-02-10 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: e4234f21d2ea7674bcc1aeaca9d382b50ca1efec
Gitweb:
https://git.kernel.org/tip/e4234f21d2ea7674bcc1aeaca9d382b50ca1efec
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:44 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:49 +01:00

rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers

Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to
be handled differently whether initiated by idle, user or guest. Prepare
with pulling that control up to rcu_eqs_enter() callers.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-2-frede...@kernel.org
---
 kernel/rcu/tree.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40e5e3d..63032e5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -644,7 +644,6 @@ static noinstr void rcu_eqs_enter(bool user)
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 
atomic_read(>dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && 
!is_idle_task(current));
rdp = this_cpu_ptr(_data);
-   do_nocb_deferred_wakeup(rdp);
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
 
@@ -672,7 +671,10 @@ static noinstr void rcu_eqs_enter(bool user)
  */
 void rcu_idle_enter(void)
 {
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
lockdep_assert_irqs_disabled();
+   do_nocb_deferred_wakeup(rdp);
rcu_eqs_enter(false);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -691,7 +693,14 @@ EXPORT_SYMBOL_GPL(rcu_idle_enter);
  */
 noinstr void rcu_user_enter(void)
 {
+   struct rcu_data *rdp = this_cpu_ptr(_data);
+
lockdep_assert_irqs_disabled();
+
+   instrumentation_begin();
+   do_nocb_deferred_wakeup(rdp);
+   instrumentation_end();
+
rcu_eqs_enter(true);
 }
 #endif /* CONFIG_NO_HZ_FULL */


[tip: sched/core] rcu/nocb: Trigger self-IPI on late deferred wake up before user resume

2021-02-10 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0940cbceefbaa40d85efeb968ce9f2707a145e58
Gitweb:
https://git.kernel.org/tip/0940cbceefbaa40d85efeb968ce9f2707a145e58
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:46 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:50 +01:00

rcu/nocb: Trigger self-IPI on late deferred wake up before user resume

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Unfortunately the call to rcu_user_enter() is already past the last
rescheduling opportunity before we resume to userspace or to guest mode.
We may escape there with the woken task ignored.

The ultimate resort to fix every callsites is to trigger a self-IPI
(nohz_full depends on arch to implement arch_irq_work_raise()) that will
trigger a reschedule on IRQ tail or guest exit.

Eventually every site that want a saner treatment will need to carefully
place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit
need_resched() check upon resume.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and 
perf)
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-4-frede...@kernel.org
---
 kernel/rcu/tree.c| 21 -
 kernel/rcu/tree.h|  2 +-
 kernel/rcu/tree_plugin.h | 25 -
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 82838e9..4b1e5bd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -677,6 +677,18 @@ void rcu_idle_enter(void)
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
 
 #ifdef CONFIG_NO_HZ_FULL
+
+/*
+ * An empty function that will trigger a reschedule on
+ * IRQ tail once IRQs get re-enabled on userspace resume.
+ */
+static void late_wakeup_func(struct irq_work *work)
+{
+}
+
+static DEFINE_PER_CPU(struct irq_work, late_wakeup_work) =
+   IRQ_WORK_INIT(late_wakeup_func);
+
 /**
  * rcu_user_enter - inform RCU that we are resuming userspace.
  *
@@ -694,12 +706,19 @@ noinstr void rcu_user_enter(void)
 
lockdep_assert_irqs_disabled();
 
+   /*
+* We may be past the last rescheduling opportunity in the entry code.
+* Trigger a self IPI that will fire and reschedule once we resume to
+* user/guest mode.
+*/
instrumentation_begin();
-   do_nocb_deferred_wakeup(rdp);
+   if (do_nocb_deferred_wakeup(rdp) && need_resched())
+   irq_work_queue(this_cpu_ptr(_wakeup_work));
instrumentation_end();
 
rcu_eqs_enter(true);
 }
+
 #endif /* CONFIG_NO_HZ_FULL */
 
 /**
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 7708ed1..9226f40 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -433,7 +433,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, 
struct rcu_head *rhp,
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
-static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
+static bool do_nocb_deferred_wakeup(struct rcu_data *rdp);
 static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
 static void rcu_spawn_cpu_nocb_kthread(int cpu);
 static void __init rcu_spawn_nocb_kthreads(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index d5b38c2..384856e 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1631,8 +1631,8 @@ bool rcu_is_nocb_cpu(int cpu)
  * Kick the GP kthread for this NOCB group.  Caller holds ->nocb_lock
  * and this function releases it.
  */
-static void wake_nocb_gp(struct rcu_data *rdp, bool force,
-  unsigned long flags)
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
+unsigned long flags)
__releases(rdp->nocb_lock)
 {
bool needwake = false;
@@ -1643,7 +1643,7 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
rcu_nocb_unlock_irqrestore(rdp, flags);
-   return;
+   return false;
}
del_timer(>nocb_timer);
rcu_nocb_unlock_irqrestore(rdp, flags);
@@ -1656,6 +1656,8 @@ static void wake_nocb_gp(struct rcu_data *rdp, bool force,
raw_spin_unlock_irqrestore(_gp->nocb_gp_lock, flags);
if (needwake)
wake_up_process(rdp_gp->nocb_gp_kthread);
+
+   return needwake;
 }
 
 /*
@@ -2152,20 +2154,23 @@ static int rcu_nocb_need_deferred_wakeup(struct 
rcu_data *rdp)
 }
 
 /* Do a deferred wakeup of rcu_nocb_kthread(). */
-static void do_nocb_deferred_wakeup_common(struct rcu_data *rdp)

[tip: sched/core] rcu/nocb: Perform deferred wake up before last idle's need_resched() check

2021-02-10 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 3a7b5c87a0b29c8554a9bdbbbd75eeb4176fb5d4
Gitweb:
https://git.kernel.org/tip/3a7b5c87a0b29c8554a9bdbbbd75eeb4176fb5d4
Author:Frederic Weisbecker 
AuthorDate:Mon, 01 Feb 2021 00:05:45 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:50 +01:00

rcu/nocb: Perform deferred wake up before last idle's need_resched() check

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and 
perf)
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frede...@kernel.org
---
 include/linux/rcupdate.h | 2 ++
 kernel/rcu/tree.c| 3 ---
 kernel/rcu/tree_plugin.h | 5 +
 kernel/sched/idle.c  | 1 +
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index fd02c5f..36c2119 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -110,8 +110,10 @@ static inline void rcu_user_exit(void) { }
 
 #ifdef CONFIG_RCU_NOCB_CPU
 void rcu_init_nohz(void);
+void rcu_nocb_flush_deferred_wakeup(void);
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 static inline void rcu_init_nohz(void) { }
+static inline void rcu_nocb_flush_deferred_wakeup(void) { }
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
 /**
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 63032e5..82838e9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -671,10 +671,7 @@ static noinstr void rcu_eqs_enter(bool user)
  */
 void rcu_idle_enter(void)
 {
-   struct rcu_data *rdp = this_cpu_ptr(_data);
-
lockdep_assert_irqs_disabled();
-   do_nocb_deferred_wakeup(rdp);
rcu_eqs_enter(false);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7e291ce..d5b38c2 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2187,6 +2187,11 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
do_nocb_deferred_wakeup_common(rdp);
 }
 
+void rcu_nocb_flush_deferred_wakeup(void)
+{
+   do_nocb_deferred_wakeup(this_cpu_ptr(_data));
+}
+
 void __init rcu_init_nohz(void)
 {
int cpu;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 305727e..7199e6f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -285,6 +285,7 @@ static void do_idle(void)
}
 
arch_cpu_idle_enter();
+   rcu_nocb_flush_deferred_wakeup();
 
/*
 * In poll mode we reenable interrupts and spin. Also if we


[tip: sched/core] static_call: Provide DEFINE_STATIC_CALL_RET0()

2021-02-08 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 50ace20f2cfecd90c88edaf58400b362f42f2960
Gitweb:
https://git.kernel.org/tip/50ace20f2cfecd90c88edaf58400b362f42f2960
Author:Frederic Weisbecker 
AuthorDate:Mon, 18 Jan 2021 15:12:17 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 05 Feb 2021 17:19:55 +01:00

static_call: Provide DEFINE_STATIC_CALL_RET0()

DECLARE_STATIC_CALL() must pass the original function targeted for a
given static call. But DEFINE_STATIC_CALL() may want to initialize it as
off. In this case we can't pass NULL (for functions without return value)
or __static_call_return0 (for functions returning a value) directly
to DEFINE_STATIC_CALL() as that may trigger a static call redeclaration
with a different function prototype. Type casts neither can work around
that as they don't get along with typeof().

The proper way to do that for functions that don't return a value is
to use DEFINE_STATIC_CALL_NULL(). But functions returning a actual value
don't have an equivalent yet.

Provide DEFINE_STATIC_CALL_RET0() to solve this situation.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210118141223.123667-3-frede...@kernel.org
---
 include/linux/static_call.h | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index bd6735d..d69dd8b 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -144,13 +144,13 @@ extern int static_call_text_reserved(void *start, void 
*end);
 
 extern long __static_call_return0(void);
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
.type = 1,  \
};  \
-   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
@@ -178,12 +178,12 @@ struct static_call_key {
void *func;
 };
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
};  \
-   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func)
+   ARCH_DEFINE_STATIC_CALL_TRAMP(name, _func_init)
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
DECLARE_STATIC_CALL(name, _func);   \
@@ -234,10 +234,10 @@ static inline long __static_call_return0(void)
return 0;
 }
 
-#define DEFINE_STATIC_CALL(name, _func)
\
+#define __DEFINE_STATIC_CALL(name, _func, _func_init)  \
DECLARE_STATIC_CALL(name, _func);   \
struct static_call_key STATIC_CALL_KEY(name) = {\
-   .func = _func,  \
+   .func = _func_init, \
}
 
 #define DEFINE_STATIC_CALL_NULL(name, _func)   \
@@ -286,4 +286,10 @@ static inline int static_call_text_reserved(void *start, 
void *end)
 
 #endif /* CONFIG_HAVE_STATIC_CALL */
 
+#define DEFINE_STATIC_CALL(name, _func)
\
+   __DEFINE_STATIC_CALL(name, _func, _func)
+
+#define DEFINE_STATIC_CALL_RET0(name, _func)   \
+   __DEFINE_STATIC_CALL(name, _func, __static_call_return0)
+
 #endif /* _LINUX_STATIC_CALL_H */


[tip: core/rcu] rcu: Implement rcu_segcblist_is_offloaded() config dependent

2020-12-13 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: e3771c850d3b9349b48449c9a91c98944a08650c
Gitweb:
https://git.kernel.org/tip/e3771c850d3b9349b48449c9a91c98944a08650c
Author:Frederic Weisbecker 
AuthorDate:Mon, 21 Sep 2020 14:43:40 +02:00
Committer: Paul E. McKenney 
CommitterDate: Thu, 19 Nov 2020 19:37:16 -08:00

rcu: Implement rcu_segcblist_is_offloaded() config dependent

This commit simplifies the use of the rcu_segcblist_is_offloaded() API so
that its callers no longer need to check the RCU_NOCB_CPU Kconfig option.
Note that rcu_segcblist_is_offloaded() is defined in the header file,
which means that the generated code should be just as efficient as before.

Suggested-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.h |  2 +-
 kernel/rcu/tree.c  | 21 +++--
 2 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 5c293af..492262b 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -62,7 +62,7 @@ static inline bool rcu_segcblist_is_enabled(struct 
rcu_segcblist *rsclp)
 /* Is the specified rcu_segcblist offloaded?  */
 static inline bool rcu_segcblist_is_offloaded(struct rcu_segcblist *rsclp)
 {
-   return rsclp->offloaded;
+   return IS_ENABLED(CONFIG_RCU_NOCB_CPU) && rsclp->offloaded;
 }
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 93e1808..0ccdca4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1603,8 +1603,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, 
struct rcu_data *rdp)
 {
bool ret = false;
bool need_qs;
-   const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-  rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
 
raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -2048,8 +2047,7 @@ static void rcu_gp_cleanup(void)
needgp = true;
}
/* Advance CBs to reduce false positives below. */
-   offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-   rcu_segcblist_is_offloaded(>cblist);
+   offloaded = rcu_segcblist_is_offloaded(>cblist);
if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) {
WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT);
WRITE_ONCE(rcu_state.gp_req_activity, jiffies);
@@ -2248,8 +2246,7 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
unsigned long flags;
unsigned long mask;
bool needwake = false;
-   const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-  rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_node *rnp;
 
WARN_ON_ONCE(rdp->cpu != smp_processor_id());
@@ -2417,8 +2414,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
 {
int div;
unsigned long flags;
-   const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-  rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count;
@@ -2675,8 +2671,7 @@ static __latent_entropy void rcu_core(void)
unsigned long flags;
struct rcu_data *rdp = raw_cpu_ptr(_data);
struct rcu_node *rnp = rdp->mynode;
-   const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-  rcu_segcblist_is_offloaded(>cblist);
+   const bool offloaded = rcu_segcblist_is_offloaded(>cblist);
 
if (cpu_is_offline(smp_processor_id()))
return;
@@ -2978,8 +2973,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
   rcu_segcblist_n_cbs(>cblist));
 
/* Go handle any RCU core processing required. */
-   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
-   unlikely(rcu_segcblist_is_offloaded(>cblist))) {
+   if (unlikely(rcu_segcblist_is_offloaded(>cblist))) {
__call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */
} else {
__call_rcu_core(rdp, head, flags);
@@ -3712,8 +3706,7 @@ static int rcu_pending(int user)
 
/* Has RCU gone idle with this CPU needing another grace period? */
if (!gp_in_progress && rcu_segcblist_is_enabled(>cblist) &&
-   (!IS_ENABLED(CONFIG_RCU_NOCB_CPU) ||
-!rcu_segcblist_is_offloaded(>cblist)) &&
+   !rcu_segcblist_is_offloaded(>cblist) &&
!rcu_segcblist_restempty(>cblist, RCU_NEXT_READY_TAIL))
return 1;
 


[tip: irq/core] s390/vtime: Use the generic IRQ entry accounting

2020-12-02 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the irq/core branch of tip:

Commit-ID: 2b91ec9f551b56751cde48792f1c0a1130358844
Gitweb:
https://git.kernel.org/tip/2b91ec9f551b56751cde48792f1c0a1130358844
Author:Frederic Weisbecker 
AuthorDate:Wed, 02 Dec 2020 12:57:29 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 20:20:04 +01:00

s390/vtime: Use the generic IRQ entry accounting

s390 has its own version of IRQ entry accounting because it doesn't
account the idle time the same way the other architectures do. Only
the actual idle sleep time is accounted as idle time, the rest of the
idle task execution is accounted as system time.

Make the generic IRQ entry accounting aware of architectures that have
their own way of accounting idle time and convert s390 to use it.

This prepares s390 to get involved in further consolidations of IRQ
time accounting.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201202115732.27827-3-frede...@kernel.org

---
 arch/Kconfig  |  7 ++-
 arch/s390/Kconfig |  1 +
 arch/s390/include/asm/vtime.h |  1 -
 arch/s390/kernel/vtime.c  |  4 
 kernel/sched/cputime.c| 13 ++---
 5 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc..0f151b4 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -627,6 +627,12 @@ config HAVE_TIF_NOHZ
 config HAVE_VIRT_CPU_ACCOUNTING
bool
 
+config HAVE_VIRT_CPU_ACCOUNTING_IDLE
+   bool
+   help
+ Architecture has its own way to account idle CPU time and therefore
+ doesn't implement vtime_account_idle().
+
 config ARCH_HAS_SCALED_CPUTIME
bool
 
@@ -641,7 +647,6 @@ config HAVE_VIRT_CPU_ACCOUNTING_GEN
  some 32-bit arches may require multiple accesses, so proper
  locking is needed to protect against concurrent accesses.
 
-
 config HAVE_IRQ_TIME_ACCOUNTING
bool
help
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 4a2a12b..6f1fdcd 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -181,6 +181,7 @@ config S390
select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
+   select HAVE_VIRT_CPU_ACCOUNTING_IDLE
select IOMMU_HELPER if PCI
select IOMMU_SUPPORTif PCI
select MODULES_USE_ELF_RELA
diff --git a/arch/s390/include/asm/vtime.h b/arch/s390/include/asm/vtime.h
index 3622d4e..fac6a67 100644
--- a/arch/s390/include/asm/vtime.h
+++ b/arch/s390/include/asm/vtime.h
@@ -2,7 +2,6 @@
 #ifndef _S390_VTIME_H
 #define _S390_VTIME_H
 
-#define __ARCH_HAS_VTIME_ACCOUNT
 #define __ARCH_HAS_VTIME_TASK_SWITCH
 
 #endif /* _S390_VTIME_H */
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index f9f2a11..ebd8e56 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -247,10 +247,6 @@ void vtime_account_kernel(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
-void vtime_account_irq_enter(struct task_struct *tsk)
-__attribute__((alias("vtime_account_kernel")));
-
-
 /*
  * Sorted add to a list. List is linear searched until first bigger
  * element is found.
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 61ce9f9..2783162 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -417,23 +417,14 @@ void vtime_task_switch(struct task_struct *prev)
 }
 # endif
 
-/*
- * Archs that account the whole time spent in the idle task
- * (outside irq) as idle time can rely on this and just implement
- * vtime_account_kernel() and vtime_account_idle(). Archs that
- * have other meaning of the idle time (s390 only includes the
- * time spent by the CPU when it's in low power mode) must override
- * vtime_account().
- */
-#ifndef __ARCH_HAS_VTIME_ACCOUNT
 void vtime_account_irq_enter(struct task_struct *tsk)
 {
-   if (!in_interrupt() && is_idle_task(tsk))
+   if (!IS_ENABLED(CONFIG_HAVE_VIRT_CPU_ACCOUNTING_IDLE) &&
+   !in_interrupt() && is_idle_task(tsk))
vtime_account_idle(tsk);
else
vtime_account_kernel(tsk);
 }
-#endif /* __ARCH_HAS_VTIME_ACCOUNT */
 
 void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
u64 *ut, u64 *st)


[tip: irq/core] irqtime: Move irqtime entry accounting after irq offset incrementation

2020-12-02 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the irq/core branch of tip:

Commit-ID: d3759e7184f8f6187e62f8c4e7dcb1f6c47c075a
Gitweb:
https://git.kernel.org/tip/d3759e7184f8f6187e62f8c4e7dcb1f6c47c075a
Author:Frederic Weisbecker 
AuthorDate:Wed, 02 Dec 2020 12:57:31 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 20:20:05 +01:00

irqtime: Move irqtime entry accounting after irq offset incrementation

IRQ time entry is currently accounted before HARDIRQ_OFFSET or
SOFTIRQ_OFFSET are incremented. This is convenient to decide to which
index the cputime to account is dispatched.

Unfortunately it prevents tick_irq_enter() from being called under
HARDIRQ_OFFSET because tick_irq_enter() has to be called before the IRQ
entry accounting due to the necessary clock catch up. As a result we
don't benefit from appropriate lockdep coverage on tick_irq_enter().

To prepare for fixing this, move the IRQ entry cputime accounting after
the preempt offset is incremented. This requires the cputime dispatch
code to handle the extra offset.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Acked-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20201202115732.27827-5-frede...@kernel.org

---
 include/linux/hardirq.h |  4 ++--
 include/linux/vtime.h   | 34 --
 kernel/sched/cputime.c  | 18 +++---
 kernel/softirq.c|  6 +++---
 4 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 754f67a..7c9d6a2 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -32,9 +32,9 @@ static __always_inline void rcu_irq_enter_check_tick(void)
  */
 #define __irq_enter()  \
do {\
-   account_irq_enter_time(current);\
preempt_count_add(HARDIRQ_OFFSET);  \
lockdep_hardirq_enter();\
+   account_hardirq_enter(current); \
} while (0)
 
 /*
@@ -62,8 +62,8 @@ void irq_enter_rcu(void);
  */
 #define __irq_exit()   \
do {\
+   account_hardirq_exit(current);  \
lockdep_hardirq_exit(); \
-   account_irq_exit_time(current); \
preempt_count_sub(HARDIRQ_OFFSET);  \
} while (0)
 
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 6c98674..041d652 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -83,32 +83,46 @@ static inline void vtime_init_idle(struct task_struct *tsk, 
int cpu) { }
 #endif
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-extern void vtime_account_irq(struct task_struct *tsk);
+extern void vtime_account_irq(struct task_struct *tsk, unsigned int offset);
 extern void vtime_account_softirq(struct task_struct *tsk);
 extern void vtime_account_hardirq(struct task_struct *tsk);
 extern void vtime_flush(struct task_struct *tsk);
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
-static inline void vtime_account_irq(struct task_struct *tsk) { }
+static inline void vtime_account_irq(struct task_struct *tsk, unsigned int 
offset) { }
+static inline void vtime_account_softirq(struct task_struct *tsk) { }
+static inline void vtime_account_hardirq(struct task_struct *tsk) { }
 static inline void vtime_flush(struct task_struct *tsk) { }
 #endif
 
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-extern void irqtime_account_irq(struct task_struct *tsk);
+extern void irqtime_account_irq(struct task_struct *tsk, unsigned int offset);
 #else
-static inline void irqtime_account_irq(struct task_struct *tsk) { }
+static inline void irqtime_account_irq(struct task_struct *tsk, unsigned int 
offset) { }
 #endif
 
-static inline void account_irq_enter_time(struct task_struct *tsk)
+static inline void account_softirq_enter(struct task_struct *tsk)
 {
-   vtime_account_irq(tsk);
-   irqtime_account_irq(tsk);
+   vtime_account_irq(tsk, SOFTIRQ_OFFSET);
+   irqtime_account_irq(tsk, SOFTIRQ_OFFSET);
 }
 
-static inline void account_irq_exit_time(struct task_struct *tsk)
+static inline void account_softirq_exit(struct task_struct *tsk)
 {
-   vtime_account_irq(tsk);
-   irqtime_account_irq(tsk);
+   vtime_account_softirq(tsk);
+   irqtime_account_irq(tsk, 0);
+}
+
+static inline void account_hardirq_enter(struct task_struct *tsk)
+{
+   vtime_account_irq(tsk, HARDIRQ_OFFSET);
+   irqtime_account_irq(tsk, HARDIRQ_OFFSET);
+}
+
+static inline void account_hardirq_exit(struct task_struct *tsk)
+{
+   vtime_account_hardirq(tsk);
+   irqtime_account_irq(tsk, 0);
 }
 
 #endif /* _LINUX_KERNEL_VTIME_H */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 02163d4..5f61165 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ 

[tip: irq/core] sched/vtime: Consolidate IRQ time accounting

2020-12-02 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the irq/core branch of tip:

Commit-ID: 8a6a5920d3286eb0eae9f36a4ec4fc9df511eccb
Gitweb:
https://git.kernel.org/tip/8a6a5920d3286eb0eae9f36a4ec4fc9df511eccb
Author:Frederic Weisbecker 
AuthorDate:Wed, 02 Dec 2020 12:57:30 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 20:20:05 +01:00

sched/vtime: Consolidate IRQ time accounting

The 3 architectures implementing CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
all have their own version of irq time accounting that dispatch the
cputime to the appropriate index: hardirq, softirq, system, idle,
guest... from an all-in-one function.

Instead of having these ad-hoc versions, move the cputime destination
dispatch decision to the core code and leave only the actual per-index
cputime accounting to the architecture.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201202115732.27827-4-frede...@kernel.org

---
 arch/ia64/kernel/time.c| 20 +
 arch/powerpc/kernel/time.c | 56 ++---
 arch/s390/kernel/vtime.c   | 45 +-
 include/linux/vtime.h  | 16 ---
 kernel/sched/cputime.c | 13 ++---
 5 files changed, 102 insertions(+), 48 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 7abc5f3..733e0e3 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -138,12 +138,8 @@ void vtime_account_kernel(struct task_struct *tsk)
struct thread_info *ti = task_thread_info(tsk);
__u64 stime = vtime_delta(tsk);
 
-   if ((tsk->flags & PF_VCPU) && !irq_count())
+   if (tsk->flags & PF_VCPU)
ti->gtime += stime;
-   else if (hardirq_count())
-   ti->hardirq_time += stime;
-   else if (in_serving_softirq())
-   ti->softirq_time += stime;
else
ti->stime += stime;
 }
@@ -156,6 +152,20 @@ void vtime_account_idle(struct task_struct *tsk)
ti->idle_time += vtime_delta(tsk);
 }
 
+void vtime_account_softirq(struct task_struct *tsk)
+{
+   struct thread_info *ti = task_thread_info(tsk);
+
+   ti->softirq_time += vtime_delta(tsk);
+}
+
+void vtime_account_hardirq(struct task_struct *tsk)
+{
+   struct thread_info *ti = task_thread_info(tsk);
+
+   ti->hardirq_time += vtime_delta(tsk);
+}
+
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 static irqreturn_t
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 74efe46..cf3f8db 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -311,12 +311,11 @@ static unsigned long vtime_delta_scaled(struct 
cpu_accounting_data *acct,
return stime_scaled;
 }
 
-static unsigned long vtime_delta(struct task_struct *tsk,
+static unsigned long vtime_delta(struct cpu_accounting_data *acct,
 unsigned long *stime_scaled,
 unsigned long *steal_time)
 {
unsigned long now, stime;
-   struct cpu_accounting_data *acct = get_accounting(tsk);
 
WARN_ON_ONCE(!irqs_disabled());
 
@@ -331,29 +330,30 @@ static unsigned long vtime_delta(struct task_struct *tsk,
return stime;
 }
 
+static void vtime_delta_kernel(struct cpu_accounting_data *acct,
+  unsigned long *stime, unsigned long 
*stime_scaled)
+{
+   unsigned long steal_time;
+
+   *stime = vtime_delta(acct, stime_scaled, _time);
+   *stime -= min(*stime, steal_time);
+   acct->steal_time += steal_time;
+}
+
 void vtime_account_kernel(struct task_struct *tsk)
 {
-   unsigned long stime, stime_scaled, steal_time;
struct cpu_accounting_data *acct = get_accounting(tsk);
+   unsigned long stime, stime_scaled;
 
-   stime = vtime_delta(tsk, _scaled, _time);
-
-   stime -= min(stime, steal_time);
-   acct->steal_time += steal_time;
+   vtime_delta_kernel(acct, , _scaled);
 
-   if ((tsk->flags & PF_VCPU) && !irq_count()) {
+   if (tsk->flags & PF_VCPU) {
acct->gtime += stime;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
acct->utime_scaled += stime_scaled;
 #endif
} else {
-   if (hardirq_count())
-   acct->hardirq_time += stime;
-   else if (in_serving_softirq())
-   acct->softirq_time += stime;
-   else
-   acct->stime += stime;
-
+   acct->stime += stime;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
acct->stime_scaled += stime_scaled;
 #endif
@@ -366,10 +366,34 @@ void vtime_account_idle(struct task_struct *tsk)
unsigned long stime, stime_scaled, steal_time;
struct cpu_accounting_data *acct = get_accounting(tsk);
 
-   stime = vtime_delta(tsk, _scaled, _time);
+   stime = vtime_delta(acct, _scaled, _time);
acct->idle_time += stime + steal_time;

[tip: irq/core] sched/cputime: Remove symbol exports from IRQ time accounting

2020-12-02 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the irq/core branch of tip:

Commit-ID: 7197688b2006357da75a014e0a76be89ca9c2d46
Gitweb:
https://git.kernel.org/tip/7197688b2006357da75a014e0a76be89ca9c2d46
Author:Frederic Weisbecker 
AuthorDate:Wed, 02 Dec 2020 12:57:28 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 20:20:04 +01:00

sched/cputime: Remove symbol exports from IRQ time accounting

account_irq_enter_time() and account_irq_exit_time() are not called
from modules. EXPORT_SYMBOL_GPL() can be safely removed from the IRQ
cputime accounting functions called from there.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201202115732.27827-2-frede...@kernel.org

---
 arch/s390/kernel/vtime.c | 10 +-
 kernel/sched/cputime.c   |  2 --
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index 8df10d3..f9f2a11 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -226,7 +226,7 @@ void vtime_flush(struct task_struct *tsk)
  * Update process times based on virtual cpu times stored by entry.S
  * to the lowcore fields user_timer, system_timer & steal_clock.
  */
-void vtime_account_irq_enter(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 {
u64 timer;
 
@@ -245,12 +245,12 @@ void vtime_account_irq_enter(struct task_struct *tsk)
 
virt_timer_forward(timer);
 }
-EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
-
-void vtime_account_kernel(struct task_struct *tsk)
-__attribute__((alias("vtime_account_irq_enter")));
 EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
+void vtime_account_irq_enter(struct task_struct *tsk)
+__attribute__((alias("vtime_account_kernel")));
+
+
 /*
  * Sorted add to a list. List is linear searched until first bigger
  * element is found.
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5a55d23..61ce9f9 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -71,7 +71,6 @@ void irqtime_account_irq(struct task_struct *curr)
else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
 }
-EXPORT_SYMBOL_GPL(irqtime_account_irq);
 
 static u64 irqtime_tick_accounted(u64 maxtime)
 {
@@ -434,7 +433,6 @@ void vtime_account_irq_enter(struct task_struct *tsk)
else
vtime_account_kernel(tsk);
 }
-EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
 
 void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,


[tip: irq/core] irq: Call tick_irq_enter() inside HARDIRQ_OFFSET

2020-12-02 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the irq/core branch of tip:

Commit-ID: d14ce74f1fb376ccbbc0b05ded477ada51253729
Gitweb:
https://git.kernel.org/tip/d14ce74f1fb376ccbbc0b05ded477ada51253729
Author:Frederic Weisbecker 
AuthorDate:Wed, 02 Dec 2020 12:57:32 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 20:20:05 +01:00

irq: Call tick_irq_enter() inside HARDIRQ_OFFSET

Now that account_hardirq_enter() is called after HARDIRQ_OFFSET has
been incremented, there is nothing left that prevents us from also
moving tick_irq_enter() after HARDIRQ_OFFSET is incremented.

The desired outcome is to remove the nasty hack that prevents softirqs
from being raised through ksoftirqd instead of the hardirq bottom half.
Also tick_irq_enter() then becomes appropriately covered by lockdep.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201202115732.27827-6-frede...@kernel.org

---
 kernel/softirq.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b8f42b3..d5bfd5e 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -377,16 +377,12 @@ restart:
  */
 void irq_enter_rcu(void)
 {
-   if (is_idle_task(current) && !in_interrupt()) {
-   /*
-* Prevent raise_softirq from needlessly waking up ksoftirqd
-* here, as softirq will be serviced on return from interrupt.
-*/
-   local_bh_disable();
+   __irq_enter_raw();
+
+   if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
tick_irq_enter();
-   _local_bh_enable();
-   }
-   __irq_enter();
+
+   account_hardirq_enter(current);
 }
 
 /**


[tip: core/entry] context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs

2020-11-20 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/entry branch of tip:

Commit-ID: 6775de4984ea83ce39f19a40c09f8813d7423831
Gitweb:
https://git.kernel.org/tip/6775de4984ea83ce39f19a40c09f8813d7423831
Author:Frederic Weisbecker 
AuthorDate:Tue, 17 Nov 2020 16:16:36 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:42 +01:00

context_tracking: Only define schedule_user() on 
!HAVE_CONTEXT_TRACKING_OFFSTACK archs

schedule_user() was traditionally used by the entry code's tail to
preempt userspace after the call to user_enter(). Indeed the call to
user_enter() used to be performed upon syscall exit slow path which was
right before the last opportunity to schedule() while resuming to
userspace. The context tracking state had to be saved on the task stack
and set back to CONTEXT_KERNEL temporarily in order to safely switch to
another task.

Only a few archs use it now (namely sparc64 and powerpc64) and those
implementing HAVE_CONTEXT_TRACKING_OFFSTACK definetly can't rely on it.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201117151637.259084-5-frede...@kernel.org
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c23d7cb..44426e5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4631,7 +4631,7 @@ void __sched schedule_idle(void)
} while (need_resched());
 }
 
-#ifdef CONFIG_CONTEXT_TRACKING
+#if defined(CONFIG_CONTEXT_TRACKING) && 
!defined(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK)
 asmlinkage __visible void __sched schedule_user(void)
 {
/*


[tip: core/entry] sched: Detect call to schedule from critical entry code

2020-11-20 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/entry branch of tip:

Commit-ID: 9f68b5b74c48761bcbd7d90cf1426049bdbaabb7
Gitweb:
https://git.kernel.org/tip/9f68b5b74c48761bcbd7d90cf1426049bdbaabb7
Author:Frederic Weisbecker 
AuthorDate:Tue, 17 Nov 2020 16:16:35 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:42 +01:00

sched: Detect call to schedule from critical entry code

Detect calls to schedule() between user_enter() and user_exit(). Those
are symptoms of early entry code that either forgot to protect a call
to schedule() inside exception_enter()/exception_exit() or, in the case
of HAVE_CONTEXT_TRACKING_OFFSTACK, enabled interrupts or preemption in
a wrong spot.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201117151637.259084-4-frede...@kernel.org
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2003a7..c23d7cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4291,6 +4291,7 @@ static inline void schedule_debug(struct task_struct 
*prev, bool preempt)
preempt_count_set(PREEMPT_DISABLED);
}
rcu_sleep_check();
+   SCHED_WARN_ON(ct_state() == CONTEXT_USER);
 
profile_hit(SCHED_PROFILING, __builtin_return_address(0));
 


[tip: core/entry] context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK

2020-11-20 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/entry branch of tip:

Commit-ID: 179a9cf79212bb3b96fb69a314583189cd863c5b
Gitweb:
https://git.kernel.org/tip/179a9cf79212bb3b96fb69a314583189cd863c5b
Author:Frederic Weisbecker 
AuthorDate:Tue, 17 Nov 2020 16:16:34 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:42 +01:00

context_tracking: Don't implement exception_enter/exit() on 
CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK

The typical steps with context tracking are:

1) Task runs in userspace
2) Task enters the kernel (syscall/exception/IRQ)
3) Task switches from context tracking state CONTEXT_USER to
   CONTEXT_KERNEL (user_exit())
4) Task does stuff in kernel
5) Task switches from context tracking state CONTEXT_KERNEL to
   CONTEXT_USER (user_enter())
6) Task exits the kernel

If an exception fires between 5) and 6), the pt_regs and the context
tracking disagree on the context of the faulted/trapped instruction.
CONTEXT_KERNEL must be set before the exception handler, that's
unconditional for those handlers that want to be able to call into
schedule(), but CONTEXT_USER must be restored when the exception exits
whereas pt_regs tells that we are resuming to kernel space.

This can't be fixed with storing the context tracking state in a per-cpu
or per-task variable since another exception may fire onto the current
one and overwrite the saved state. Also the task can schedule. So it
has to be stored in a per task stack.

This is how exception_enter()/exception_exit() paper over the problem:

5) Task switches from context tracking state CONTEXT_KERNEL to
   CONTEXT_USER (user_enter())
5.1) Exception fires
5.2) prev_state = exception_enter() // save CONTEXT_USER to prev_state
// and set CONTEXT_KERNEL
5.3) Exception handler
5.4) exception_enter(prev_state) // restore CONTEXT_USER
5.5) Exception resumes
6) Task exits the kernel

The condition to live without exception_enter()/exception_exit() is to
forbid exceptions and IRQs between 2) and 3) and between 5) and 6), or if
any is allowed to trigger, it won't call into context tracking, eg: NMIs,
and it won't schedule. These requirements are met by architectures
supporting CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK and those can
therefore afford not to implement this hack.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201117151637.259084-3-frede...@kernel.org
---
 include/linux/context_tracking.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d53cd33..bceb064 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -51,7 +51,8 @@ static inline enum ctx_state exception_enter(void)
 {
enum ctx_state prev_ctx;
 
-   if (!context_tracking_enabled())
+   if (IS_ENABLED(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK) ||
+   !context_tracking_enabled())
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
@@ -63,7 +64,8 @@ static inline enum ctx_state exception_enter(void)
 
 static inline void exception_exit(enum ctx_state prev_ctx)
 {
-   if (context_tracking_enabled()) {
+   if (!IS_ENABLED(CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK) &&
+   context_tracking_enabled()) {
if (prev_ctx != CONTEXT_KERNEL)
context_tracking_enter(prev_ctx);
}


[tip: core/entry] context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK

2020-11-20 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/entry branch of tip:

Commit-ID: 83c2da2e605c73aafcc02df04b2dbf1ccbfc24c0
Gitweb:
https://git.kernel.org/tip/83c2da2e605c73aafcc02df04b2dbf1ccbfc24c0
Author:Frederic Weisbecker 
AuthorDate:Tue, 17 Nov 2020 16:16:33 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:41 +01:00

context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK

Historically, context tracking had to deal with fragile entry code path,
ie: before user_exit() is called and after user_enter() is called, in
case some of those spots would call schedule() or use RCU. On such
cases, the site had to be protected between exception_enter() and
exception_exit() that save the context tracking state in the task stack.

Such sleepable fragile code path had many different origins: tracing,
exceptions, early or late calls to context tracking on syscalls...

Aside of that not being pretty, saving the context tracking state on
the task stack forces us to run context tracking on all CPUs, including
housekeepers, and prevents us to completely shutdown nohz_full at
runtime on a CPU in the future as context tracking and its overhead
would still need to run system wide.

Now thanks to the extensive efforts to sanitize x86 entry code, those
conditions have been removed and we can now get rid of these workarounds
in this architecture.

Create a Kconfig feature to express this achievement.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201117151637.259084-2-frede...@kernel.org
---
 arch/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc..090ef35 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -618,6 +618,23 @@ config HAVE_CONTEXT_TRACKING
  protected inside rcu_irq_enter/rcu_irq_exit() but preemption or signal
  handling on irq exit still need to be protected.
 
+config HAVE_CONTEXT_TRACKING_OFFSTACK
+   bool
+   help
+ Architecture neither relies on exception_enter()/exception_exit()
+ nor on schedule_user(). Also preempt_schedule_notrace() and
+ preempt_schedule_irq() can't be called in a preemptible section
+ while context tracking is CONTEXT_USER. This feature reflects a sane
+ entry implementation where the following requirements are met on
+ critical entry code, ie: before user_exit() or after user_enter():
+
+ - Critical entry code isn't preemptible (or better yet:
+   not interruptible).
+ - No use of RCU read side critical sections, unless rcu_nmi_enter()
+   got called.
+ - No use of instrumentation, unless instrumentation_begin() got
+   called.
+
 config HAVE_TIF_NOHZ
bool
help


[tip: core/entry] x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK

2020-11-20 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/entry branch of tip:

Commit-ID: d1f250e2205eca9f1264f8e2d3a41fcf38f92d91
Gitweb:
https://git.kernel.org/tip/d1f250e2205eca9f1264f8e2d3a41fcf38f92d91
Author:Frederic Weisbecker 
AuthorDate:Tue, 17 Nov 2020 16:16:37 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 19 Nov 2020 11:25:42 +01:00

x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK

A lot of ground work has been performed on x86 entry code. Fragile path
between user_enter() and user_exit() have IRQs disabled. Uses of RCU and
intrumentation in these fragile areas have been explicitly annotated
and protected.

This architecture doesn't need exception_enter()/exception_exit()
anymore and has therefore earned CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201117151637.259084-6-frede...@kernel.org
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b8..d793361 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -162,6 +162,7 @@ config X86
select HAVE_CMPXCHG_DOUBLE
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKINGif X86_64
+   select HAVE_CONTEXT_TRACKING_OFFSTACK   if HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS


[tip: core/rcu] tick/nohz: Narrow down noise while setting current task's tick dependency

2020-07-31 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 3c8920e2dbd1a55f72dc14d656df9d0097cf5c72
Gitweb:
https://git.kernel.org/tip/3c8920e2dbd1a55f72dc14d656df9d0097cf5c72
Author:Frederic Weisbecker 
AuthorDate:Fri, 15 May 2020 02:34:29 +02:00
Committer: Paul E. McKenney 
CommitterDate: Mon, 29 Jun 2020 11:58:50 -07:00

tick/nohz: Narrow down noise while setting current task's tick dependency

Setting a tick dependency on any task, including the case where a task
sets that dependency on itself, triggers an IPI to all CPUs.  That is
of course suboptimal but it had previously not been an issue because it
was only used by POSIX CPU timers on nohz_full, which apparently never
occurs in latency-sensitive workloads in production.  (Or users of such
systems are suffering in silence on the one hand or venting their ire
on the wrong people on the other.)

But RCU now sets a task tick dependency on the current task in order
to fix stall issues that can occur during RCU callback processing.
Thus, RCU callback processing triggers frequent system-wide IPIs from
nohz_full CPUs.  This is quite counter-productive, after all, avoiding
IPIs is what nohz_full is supposed to be all about.

This commit therefore optimizes tasks' self-setting of a task tick
dependency by using tick_nohz_full_kick() to avoid the system-wide IPI.
Instead, only the execution of the one task is disturbed, which is
acceptable given that this disturbance is well down into the noise
compared to the degree to which the RCU callback processing itself
disturbs execution.

Fixes: 6a949b7af82d (rcu: Force on tick when invoking lots of callbacks)
Reported-by: Matt Fleming 
Signed-off-by: Frederic Weisbecker 
Cc: sta...@kernel.org
Cc: Paul E. McKenney 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Signed-off-by: Paul E. McKenney 
---
 kernel/time/tick-sched.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e2dc9b..f0199a4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -351,16 +351,24 @@ void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits 
bit)
 EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu);
 
 /*
- * Set a per-task tick dependency. Posix CPU timers need this in order to 
elapse
- * per task timers.
+ * Set a per-task tick dependency. RCU need this. Also posix CPU timers
+ * in order to elapse per task timers.
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   /*
-* We could optimize this with just kicking the target running the task
-* if that noise matters for nohz full users.
-*/
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
+   if (tsk == current) {
+   preempt_disable();
+   tick_nohz_full_kick();
+   preempt_enable();
+   } else {
+   /*
+* Some future tick_nohz_full_kick_task()
+* should optimize this.
+*/
+   tick_nohz_full_kick_all();
+   }
+   }
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);
 


[tip: timers/core] timers: Recalculate next timer interrupt only when necessary

2020-07-24 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 31cd0e119d50cf27ebe214d1a8f7ca36692f13a5
Gitweb:
https://git.kernel.org/tip/31cd0e119d50cf27ebe214d1a8f7ca36692f13a5
Author:Frederic Weisbecker 
AuthorDate:Thu, 23 Jul 2020 17:16:41 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 24 Jul 2020 12:49:40 +02:00

timers: Recalculate next timer interrupt only when necessary

The nohz tick code recalculates the timer wheel's next expiry on each idle
loop iteration.

On the other hand, the base next expiry is now always cached and updated
upon timer enqueue and execution. Only timer dequeue may leave
base->next_expiry out of date (but then its stale value won't ever go past
the actual next expiry to be recalculated).

Since recalculating the next_expiry isn't a free operation, especially when
the last wheel level is reached to find out that no timer has been enqueued
at all, reuse the next expiry cache when it is known to be reliable, which
it is most of the time.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20200723151641.12236-1-frede...@kernel.org

---
 kernel/time/timer.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 77e21e9..96d802e 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -204,6 +204,7 @@ struct timer_base {
unsigned long   clk;
unsigned long   next_expiry;
unsigned intcpu;
+   boolnext_expiry_recalc;
boolis_idle;
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head   vectors[WHEEL_SIZE];
@@ -593,6 +594,7 @@ static void enqueue_timer(struct timer_base *base, struct 
timer_list *timer,
 * can reevaluate the wheel:
 */
base->next_expiry = bucket_expiry;
+   base->next_expiry_recalc = false;
trigger_dyntick_cpu(base, timer);
}
 }
@@ -836,8 +838,10 @@ static int detach_if_pending(struct timer_list *timer, 
struct timer_base *base,
if (!timer_pending(timer))
return 0;
 
-   if (hlist_is_singular_node(>entry, base->vectors + idx))
+   if (hlist_is_singular_node(>entry, base->vectors + idx)) {
__clear_bit(idx, base->pending_map);
+   base->next_expiry_recalc = true;
+   }
 
detach_timer(timer, clear_pending);
return 1;
@@ -1571,6 +1575,9 @@ static unsigned long __next_timer_interrupt(struct 
timer_base *base)
clk >>= LVL_CLK_SHIFT;
clk += adj;
}
+
+   base->next_expiry_recalc = false;
+
return next;
 }
 
@@ -1631,9 +1638,11 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 
basem)
return expires;
 
raw_spin_lock(>lock);
-   nextevt = __next_timer_interrupt(base);
+   if (base->next_expiry_recalc)
+   base->next_expiry = __next_timer_interrupt(base);
+   nextevt = base->next_expiry;
is_max_delta = (nextevt == base->clk + NEXT_TIMER_MAX_DELTA);
-   base->next_expiry = nextevt;
+
/*
 * We have a fresh next event. Check whether we can forward the
 * base. We can only do that when @basej is past base->clk
@@ -1725,6 +1734,12 @@ static inline void __run_timers(struct timer_base *base)
while (time_after_eq(jiffies, base->clk) &&
   time_after_eq(jiffies, base->next_expiry)) {
levels = collect_expired_timers(base, heads);
+   /*
+* The only possible reason for not finding any expired
+* timer at this clk is that all matching timers have been
+* dequeued.
+*/
+   WARN_ON_ONCE(!levels && !base->next_expiry_recalc);
base->clk++;
base->next_expiry = __next_timer_interrupt(base);
 


[tip: timers/core] timers: Spare timer softirq until next expiry

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: d4f7dae87096dfe722bf32aa82076ece1063746c
Gitweb:
https://git.kernel.org/tip/d4f7dae87096dfe722bf32aa82076ece1063746c
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:49 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:24 +02:00

timers: Spare timer softirq until next expiry

Now that the core timer infrastructure doesn't depend anymore on
periodic base->clk increments, even when the CPU is not in NO_HZ mode,
timer softirqs can be skipped until there are timers to expire.

Some spurious softirqs can still remain since base->next_expiry doesn't
keep track of canceled timers but this still reduces the number of softirqs
significantly: ~15 times less for HZ=1000 and ~5 times less for HZ=100.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-11-frede...@kernel.org

---
 kernel/time/timer.c | 49 +++-
 1 file changed, 8 insertions(+), 41 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 1be92b5..4f78a7b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1458,10 +1458,10 @@ static void expire_timers(struct timer_base *base, 
struct hlist_head *head)
}
 }
 
-static int __collect_expired_timers(struct timer_base *base,
-   struct hlist_head *heads)
+static int collect_expired_timers(struct timer_base *base,
+ struct hlist_head *heads)
 {
-   unsigned long clk = base->clk;
+   unsigned long clk = base->clk = base->next_expiry;
struct hlist_head *vec;
int i, levels = 0;
unsigned int idx;
@@ -1684,40 +1684,6 @@ void timer_clear_idle(void)
 */
base->is_idle = false;
 }
-
-static int collect_expired_timers(struct timer_base *base,
- struct hlist_head *heads)
-{
-   unsigned long now = READ_ONCE(jiffies);
-
-   /*
-* NOHZ optimization. After a long idle sleep we need to forward the
-* base to current jiffies. Avoid a loop by searching the bitfield for
-* the next expiring timer.
-*/
-   if ((long)(now - base->clk) > 2) {
-   /*
-* If the next timer is ahead of time forward to current
-* jiffies, otherwise forward to the next expiry time:
-*/
-   if (time_after(base->next_expiry, now)) {
-   /*
-* The call site will increment base->clk and then
-* terminate the expiry loop immediately.
-*/
-   base->clk = now;
-   return 0;
-   }
-   base->clk = base->next_expiry;
-   }
-   return __collect_expired_timers(base, heads);
-}
-#else
-static inline int collect_expired_timers(struct timer_base *base,
-struct hlist_head *heads)
-{
-   return __collect_expired_timers(base, heads);
-}
 #endif
 
 /*
@@ -1750,7 +1716,7 @@ static inline void __run_timers(struct timer_base *base)
struct hlist_head heads[LVL_DEPTH];
int levels;
 
-   if (!time_after_eq(jiffies, base->clk))
+   if (time_before(jiffies, base->next_expiry))
return;
 
timer_base_lock_expiry(base);
@@ -1763,7 +1729,8 @@ static inline void __run_timers(struct timer_base *base)
 */
base->must_forward_clk = false;
 
-   while (time_after_eq(jiffies, base->clk)) {
+   while (time_after_eq(jiffies, base->clk) &&
+  time_after_eq(jiffies, base->next_expiry)) {
 
levels = collect_expired_timers(base, heads);
base->clk++;
@@ -1798,12 +1765,12 @@ void run_local_timers(void)
 
hrtimer_run_queues();
/* Raise the softirq only if required. */
-   if (time_before(jiffies, base->clk)) {
+   if (time_before(jiffies, base->next_expiry)) {
if (!IS_ENABLED(CONFIG_NO_HZ_COMMON))
return;
/* CPU is awake, so check the deferrable base. */
base++;
-   if (time_before(jiffies, base->clk))
+   if (time_before(jiffies, base->next_expiry))
return;
}
raise_softirq(TIMER_SOFTIRQ);


[tip: timers/core] timers: Expand clk forward logic beyond nohz

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 1f8a4212dc83f8353843fabf6465fd918372fbbf
Gitweb:
https://git.kernel.org/tip/1f8a4212dc83f8353843fabf6465fd918372fbbf
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:48 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:24 +02:00

timers: Expand clk forward logic beyond nohz

As for next_expiry, the base->clk catch up logic will be expanded beyond
NOHZ in order to avoid triggering useless softirqs.

If softirqs should only fire to expire pending timers, periodic base->clk
increments must be skippable for random amounts of time.  Therefore prepare
to catch-up with missing updates whenever an up-to-date base clock is
needed.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-10-frede...@kernel.org

---
 kernel/time/timer.c | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 13f48ee..1be92b5 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -888,19 +888,12 @@ get_target_base(struct timer_base *base, unsigned tflags)
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-#ifdef CONFIG_NO_HZ_COMMON
unsigned long jnow;
 
-   /*
-* We only forward the base when we are idle or have just come out of
-* idle (must_forward_clk logic), and have a delta between base clock
-* and jiffies. In the common case, run_timers will take care of it.
-*/
-   if (likely(!base->must_forward_clk))
+   if (!base->must_forward_clk)
return;
 
jnow = READ_ONCE(jiffies);
-   base->must_forward_clk = base->is_idle;
if ((long)(jnow - base->clk) < 2)
return;
 
@@ -915,7 +908,6 @@ static inline void forward_timer_base(struct timer_base 
*base)
return;
base->clk = base->next_expiry;
}
-#endif
 }
 
 
@@ -1667,10 +1659,8 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 
basem)
 * logic is only maintained for the BASE_STD base, deferrable
 * timers may still see large granularity skew (by design).
 */
-   if ((expires - basem) > TICK_NSEC) {
-   base->must_forward_clk = true;
+   if ((expires - basem) > TICK_NSEC)
base->is_idle = true;
-   }
}
raw_spin_unlock(>lock);
 
@@ -1769,16 +1759,7 @@ static inline void __run_timers(struct timer_base *base)
/*
 * timer_base::must_forward_clk must be cleared before running
 * timers so that any timer functions that call mod_timer() will
-* not try to forward the base. Idle tracking / clock forwarding
-* logic is only used with BASE_STD timers.
-*
-* The must_forward_clk flag is cleared unconditionally also for
-* the deferrable base. The deferrable base is not affected by idle
-* tracking and never forwarded, so clearing the flag is a NOOP.
-*
-* The fact that the deferrable base is never forwarded can cause
-* large variations in granularity for deferrable timers, but they
-* can be deferred for long periods due to idle anyway.
+* not try to forward the base.
 */
base->must_forward_clk = false;
 
@@ -1791,6 +1772,7 @@ static inline void __run_timers(struct timer_base *base)
while (levels--)
expire_timers(base, heads + levels);
}
+   base->must_forward_clk = true;
raw_spin_unlock_irq(>lock);
timer_base_unlock_expiry(base);
 }


[tip: timers/core] timers: Reuse next expiry cache after nohz exit

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 90d52f65f303091be17b5f4ffab7090b2064b4a1
Gitweb:
https://git.kernel.org/tip/90d52f65f303091be17b5f4ffab7090b2064b4a1
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:47 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:23 +02:00

timers: Reuse next expiry cache after nohz exit

Now that the next expiry it tracked unconditionally when a timer is added,
this information can be reused on a tick firing after exiting nohz.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-9-frede...@kernel.org

---
 kernel/time/timer.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 76fd964..13f48ee 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1706,13 +1706,11 @@ static int collect_expired_timers(struct timer_base 
*base,
 * the next expiring timer.
 */
if ((long)(now - base->clk) > 2) {
-   unsigned long next = __next_timer_interrupt(base);
-
/*
 * If the next timer is ahead of time forward to current
 * jiffies, otherwise forward to the next expiry time:
 */
-   if (time_after(next, now)) {
+   if (time_after(base->next_expiry, now)) {
/*
 * The call site will increment base->clk and then
 * terminate the expiry loop immediately.
@@ -1720,7 +1718,7 @@ static int collect_expired_timers(struct timer_base *base,
base->clk = now;
return 0;
}
-   base->clk = next;
+   base->clk = base->next_expiry;
}
return __collect_expired_timers(base, heads);
 }


[tip: timers/core] timers: Preserve higher bits of expiration on index calculation

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 3d2e83a2a6a0657c1cf145fa6ba23620715d6c36
Gitweb:
https://git.kernel.org/tip/3d2e83a2a6a0657c1cf145fa6ba23620715d6c36
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:41 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:21 +02:00

timers: Preserve higher bits of expiration on index calculation

The higher bits of the timer expiration are cropped while calling
calc_index() due to the implicit cast from unsigned long to unsigned int.

This loss shouldn't have consequences on the current code since all the
computation to calculate the index is done on the lower 32 bits.

However to prepare for returning the actual bucket expiration from
calc_index() in order to properly fix base->next_expiry updates, the higher
bits need to be preserved.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20200717140551.29076-3-frede...@kernel.org

---
 kernel/time/timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index df1ff80..bcdc304 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -487,7 +487,7 @@ static inline void timer_set_idx(struct timer_list *timer, 
unsigned int idx)
  * Helper function to calculate the array index for a given expiry
  * time.
  */
-static inline unsigned calc_index(unsigned expires, unsigned lvl)
+static inline unsigned calc_index(unsigned long expires, unsigned lvl)
 {
expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
return LVL_OFFS(lvl) + (expires & LVL_MASK);


[tip: timers/core] timers: Add comments about calc_index() ceiling work

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 4468897211628865ee2392acb5ad281f74176f63
Gitweb:
https://git.kernel.org/tip/4468897211628865ee2392acb5ad281f74176f63
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:44 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:22 +02:00

timers: Add comments about calc_index() ceiling work

calc_index() adds 1 unit of the level granularity to the expiry passed
in parameter to ensure that the timer doesn't expire too early. Add a
comment to explain that and the resulting layout in the wheel.

Suggested-by: Thomas Gleixner 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-6-frede...@kernel.org

---
 kernel/time/timer.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2af08a1..af1c08b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -156,7 +156,8 @@ EXPORT_SYMBOL(jiffies_64);
 
 /*
  * The time start value for each level to select the bucket at enqueue
- * time.
+ * time. We start from the last possible delta of the previous level
+ * so that we can later add an extra LVL_GRAN(n) to n (see calc_index()).
  */
 #define LVL_START(n)   ((LVL_SIZE - 1) << (((n) - 1) * LVL_CLK_SHIFT))
 
@@ -490,6 +491,15 @@ static inline void timer_set_idx(struct timer_list *timer, 
unsigned int idx)
 static inline unsigned calc_index(unsigned long expires, unsigned lvl,
  unsigned long *bucket_expiry)
 {
+
+   /*
+* The timer wheel has to guarantee that a timer does not fire
+* early. Early expiry can happen due to:
+* - Timer is armed at the edge of a tick
+* - Truncation of the expiry time in the outer wheel levels
+*
+* Round up with level granularity to prevent this.
+*/
expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
*bucket_expiry = expires << LVL_SHIFT(lvl);
return LVL_OFFS(lvl) + (expires & LVL_MASK);


[tip: timers/core] timers: Optimize _next_timer_interrupt() level iteration

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 001ec1b3925da0d51847c23fc0aa4129282db526
Gitweb:
https://git.kernel.org/tip/001ec1b3925da0d51847c23fc0aa4129282db526
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:45 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:22 +02:00

timers: Optimize _next_timer_interrupt() level iteration

If a level has a timer that expires before reaching the next level, there
is no need to iterate further.

The next level is reached when the 3 lower bits of the current level are
cleared. If the next event happens before/during that, the next levels
won't provide an earlier expiration.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-7-frede...@kernel.org

---
 kernel/time/timer.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index af1c08b..9abc417 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1526,6 +1526,7 @@ static unsigned long __next_timer_interrupt(struct 
timer_base *base)
clk = base->clk;
for (lvl = 0; lvl < LVL_DEPTH; lvl++, offset += LVL_SIZE) {
int pos = next_pending_bucket(base, offset, clk & LVL_MASK);
+   unsigned long lvl_clk = clk & LVL_CLK_MASK;
 
if (pos >= 0) {
unsigned long tmp = clk + (unsigned long) pos;
@@ -1533,6 +1534,13 @@ static unsigned long __next_timer_interrupt(struct 
timer_base *base)
tmp <<= LVL_SHIFT(lvl);
if (time_before(tmp, next))
next = tmp;
+
+   /*
+* If the next expiration happens before we reach
+* the next level, no need to check further.
+*/
+   if (pos <= ((LVL_CLK_DIV - lvl_clk) & LVL_CLK_MASK))
+   break;
}
/*
 * Clock for the next level. If the current level clock lower
@@ -1570,7 +1578,7 @@ static unsigned long __next_timer_interrupt(struct 
timer_base *base)
 * So the simple check whether the lower bits of the current
 * level are 0 or not is sufficient for all cases.
 */
-   adj = clk & LVL_CLK_MASK ? 1 : 0;
+   adj = lvl_clk ? 1 : 0;
clk >>= LVL_CLK_SHIFT;
clk += adj;
}


[tip: timers/core] timers: Move trigger_dyntick_cpu() to enqueue_timer()

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 9a2b764b06c880678416d803d027f575ae40ec99
Gitweb:
https://git.kernel.org/tip/9a2b764b06c880678416d803d027f575ae40ec99
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:43 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:22 +02:00

timers: Move trigger_dyntick_cpu() to enqueue_timer()

Consolidate the code by calling trigger_dyntick_cpu() from
enqueue_timer() instead of calling it from all its callers.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-5-frede...@kernel.org

---
 kernel/time/timer.c | 61 ++--
 1 file changed, 25 insertions(+), 36 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a7a3cf7..2af08a1 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -533,30 +533,6 @@ static int calc_wheel_index(unsigned long expires, 
unsigned long clk,
return idx;
 }
 
-/*
- * Enqueue the timer into the hash bucket, mark it pending in
- * the bitmap and store the index in the timer flags.
- */
-static void enqueue_timer(struct timer_base *base, struct timer_list *timer,
- unsigned int idx)
-{
-   hlist_add_head(>entry, base->vectors + idx);
-   __set_bit(idx, base->pending_map);
-   timer_set_idx(timer, idx);
-
-   trace_timer_start(timer, timer->expires, timer->flags);
-}
-
-static void
-__internal_add_timer(struct timer_base *base, struct timer_list *timer,
-unsigned long *bucket_expiry)
-{
-   unsigned int idx;
-
-   idx = calc_wheel_index(timer->expires, base->clk, bucket_expiry);
-   enqueue_timer(base, timer, idx);
-}
-
 static void
 trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer,
unsigned long bucket_expiry)
@@ -598,15 +574,31 @@ trigger_dyntick_cpu(struct timer_base *base, struct 
timer_list *timer,
wake_up_nohz_cpu(base->cpu);
 }
 
-static void
-internal_add_timer(struct timer_base *base, struct timer_list *timer)
+/*
+ * Enqueue the timer into the hash bucket, mark it pending in
+ * the bitmap, store the index in the timer flags then wake up
+ * the target CPU if needed.
+ */
+static void enqueue_timer(struct timer_base *base, struct timer_list *timer,
+ unsigned int idx, unsigned long bucket_expiry)
 {
-   unsigned long bucket_expiry;
+   hlist_add_head(>entry, base->vectors + idx);
+   __set_bit(idx, base->pending_map);
+   timer_set_idx(timer, idx);
 
-   __internal_add_timer(base, timer, _expiry);
+   trace_timer_start(timer, timer->expires, timer->flags);
trigger_dyntick_cpu(base, timer, bucket_expiry);
 }
 
+static void internal_add_timer(struct timer_base *base, struct timer_list 
*timer)
+{
+   unsigned long bucket_expiry;
+   unsigned int idx;
+
+   idx = calc_wheel_index(timer->expires, base->clk, _expiry);
+   enqueue_timer(base, timer, idx, bucket_expiry);
+}
+
 #ifdef CONFIG_DEBUG_OBJECTS_TIMERS
 
 static struct debug_obj_descr timer_debug_descr;
@@ -1057,16 +1049,13 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires, unsigned int option
/*
 * If 'idx' was calculated above and the base time did not advance
 * between calculating 'idx' and possibly switching the base, only
-* enqueue_timer() and trigger_dyntick_cpu() is required. Otherwise
-* we need to (re)calculate the wheel index via
-* internal_add_timer().
+* enqueue_timer() is required. Otherwise we need to (re)calculate
+* the wheel index via internal_add_timer().
 */
-   if (idx != UINT_MAX && clk == base->clk) {
-   enqueue_timer(base, timer, idx);
-   trigger_dyntick_cpu(base, timer, bucket_expiry);
-   } else {
+   if (idx != UINT_MAX && clk == base->clk)
+   enqueue_timer(base, timer, idx, bucket_expiry);
+   else
internal_add_timer(base, timer);
-   }
 
 out_unlock:
raw_spin_unlock_irqrestore(>lock, flags);


[tip: timers/core] timers: Lower base clock forwarding threshold

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 36cd28a4cdd05d47ccb62a2d86e8f37839cc879a
Gitweb:
https://git.kernel.org/tip/36cd28a4cdd05d47ccb62a2d86e8f37839cc879a
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:51 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:25 +02:00

timers: Lower base clock forwarding threshold

There is nothing that prevents from forwarding the base clock if it's one
jiffy off. The reason for this arbitrary limit of two jiffies is historical
and does not longer exist.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-13-frede...@kernel.org

---
 kernel/time/timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8b3fb52..77e21e9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -894,7 +894,7 @@ static inline void forward_timer_base(struct timer_base 
*base)
 * Also while executing timers, base->clk is 1 offset ahead
 * of jiffies to avoid endless requeuing to current jffies.
 */
-   if ((long)(jnow - base->clk) < 2)
+   if ((long)(jnow - base->clk) < 1)
return;
 
/*


[tip: timers/core] timers: Always keep track of next expiry

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: dc2a0f1fb2a06df09f5094f29aea56b763aa7cca
Gitweb:
https://git.kernel.org/tip/dc2a0f1fb2a06df09f5094f29aea56b763aa7cca
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:46 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:23 +02:00

timers: Always keep track of next expiry

So far next expiry was only tracked while the CPU was in nohz_idle mode
in order to cope with missing ticks that can't increment the base->clk
periodically anymore.

This logic is going to be expanded beyond nohz in order to spare timer
softirqs so do it unconditionally.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-8-frede...@kernel.org

---
 kernel/time/timer.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 9abc417..76fd964 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -544,8 +544,7 @@ static int calc_wheel_index(unsigned long expires, unsigned 
long clk,
 }
 
 static void
-trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer,
-   unsigned long bucket_expiry)
+trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
 {
if (!is_timers_nohz_active())
return;
@@ -565,23 +564,8 @@ trigger_dyntick_cpu(struct timer_base *base, struct 
timer_list *timer,
 * timer is not deferrable. If the other CPU is on the way to idle
 * then it can't set base->is_idle as we hold the base lock:
 */
-   if (!base->is_idle)
-   return;
-
-   /*
-* Check whether this is the new first expiring timer. The
-* effective expiry time of the timer is required here
-* (bucket_expiry) instead of timer->expires.
-*/
-   if (time_after_eq(bucket_expiry, base->next_expiry))
-   return;
-
-   /*
-* Set the next expiry time and kick the CPU so it can reevaluate the
-* wheel:
-*/
-   base->next_expiry = bucket_expiry;
-   wake_up_nohz_cpu(base->cpu);
+   if (base->is_idle)
+   wake_up_nohz_cpu(base->cpu);
 }
 
 /*
@@ -592,12 +576,26 @@ trigger_dyntick_cpu(struct timer_base *base, struct 
timer_list *timer,
 static void enqueue_timer(struct timer_base *base, struct timer_list *timer,
  unsigned int idx, unsigned long bucket_expiry)
 {
+
hlist_add_head(>entry, base->vectors + idx);
__set_bit(idx, base->pending_map);
timer_set_idx(timer, idx);
 
trace_timer_start(timer, timer->expires, timer->flags);
-   trigger_dyntick_cpu(base, timer, bucket_expiry);
+
+   /*
+* Check whether this is the new first expiring timer. The
+* effective expiry time of the timer is required here
+* (bucket_expiry) instead of timer->expires.
+*/
+   if (time_before(bucket_expiry, base->next_expiry)) {
+   /*
+* Set the next expiry time and kick the CPU so it
+* can reevaluate the wheel:
+*/
+   base->next_expiry = bucket_expiry;
+   trigger_dyntick_cpu(base, timer);
+   }
 }
 
 static void internal_add_timer(struct timer_base *base, struct timer_list 
*timer)
@@ -1493,7 +1491,6 @@ static int __collect_expired_timers(struct timer_base 
*base,
return levels;
 }
 
-#ifdef CONFIG_NO_HZ_COMMON
 /*
  * Find the next pending bucket of a level. Search from level start (@offset)
  * + @clk upwards and if nothing there, search from start of the level
@@ -1585,6 +1582,7 @@ static unsigned long __next_timer_interrupt(struct 
timer_base *base)
return next;
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
 /*
  * Check, if the next hrtimer event is before the next timer wheel
  * event:
@@ -1790,6 +1788,7 @@ static inline void __run_timers(struct timer_base *base)
 
levels = collect_expired_timers(base, heads);
base->clk++;
+   base->next_expiry = __next_timer_interrupt(base);
 
while (levels--)
expire_timers(base, heads + levels);
@@ -2042,6 +2041,7 @@ static void __init init_timer_cpu(int cpu)
base->cpu = cpu;
raw_spin_lock_init(>lock);
base->clk = jiffies;
+   base->next_expiry = base->clk + NEXT_TIMER_MAX_DELTA;
timer_base_init_expiry_lock(base);
}
 }


[tip: timers/core] timers: Remove must_forward_clk

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 0975fb565b8b8f9e0c96d0de39fcb954833ea5e0
Gitweb:
https://git.kernel.org/tip/0975fb565b8b8f9e0c96d0de39fcb954833ea5e0
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:50 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:55:25 +02:00

timers: Remove must_forward_clk

There is no reason to keep this guard around. The code makes sure that
base->clk remains sane and won't be forwarded beyond jiffies nor set
backward.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Juri Lelli 
Link: https://lkml.kernel.org/r/20200717140551.29076-12-frede...@kernel.org

---
 kernel/time/timer.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 4f78a7b..8b3fb52 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -205,7 +205,6 @@ struct timer_base {
unsigned long   next_expiry;
unsigned intcpu;
boolis_idle;
-   boolmust_forward_clk;
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head   vectors[WHEEL_SIZE];
 } cacheline_aligned;
@@ -888,12 +887,13 @@ get_target_base(struct timer_base *base, unsigned tflags)
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-   unsigned long jnow;
+   unsigned long jnow = READ_ONCE(jiffies);
 
-   if (!base->must_forward_clk)
-   return;
-
-   jnow = READ_ONCE(jiffies);
+   /*
+* No need to forward if we are close enough below jiffies.
+* Also while executing timers, base->clk is 1 offset ahead
+* of jiffies to avoid endless requeuing to current jffies.
+*/
if ((long)(jnow - base->clk) < 2)
return;
 
@@ -1722,16 +1722,8 @@ static inline void __run_timers(struct timer_base *base)
timer_base_lock_expiry(base);
raw_spin_lock_irq(>lock);
 
-   /*
-* timer_base::must_forward_clk must be cleared before running
-* timers so that any timer functions that call mod_timer() will
-* not try to forward the base.
-*/
-   base->must_forward_clk = false;
-
while (time_after_eq(jiffies, base->clk) &&
   time_after_eq(jiffies, base->next_expiry)) {
-
levels = collect_expired_timers(base, heads);
base->clk++;
base->next_expiry = __next_timer_interrupt(base);
@@ -1739,7 +1731,6 @@ static inline void __run_timers(struct timer_base *base)
while (levels--)
expire_timers(base, heads + levels);
}
-   base->must_forward_clk = true;
raw_spin_unlock_irq(>lock);
timer_base_unlock_expiry(base);
 }
@@ -1935,7 +1926,6 @@ int timers_prepare_cpu(unsigned int cpu)
base->clk = jiffies;
base->next_expiry = base->clk + NEXT_TIMER_MAX_DELTA;
base->is_idle = false;
-   base->must_forward_clk = true;
}
return 0;
 }


[tip: timers/urgent] timer: Fix wheel index calculation on last level

2020-07-17 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: e2a71bdea81690b6ef11f4368261ec6f5b6891aa
Gitweb:
https://git.kernel.org/tip/e2a71bdea81690b6ef11f4368261ec6f5b6891aa
Author:Frederic Weisbecker 
AuthorDate:Fri, 17 Jul 2020 16:05:40 +02:00
Committer: Thomas Gleixner 
CommitterDate: Fri, 17 Jul 2020 21:44:05 +02:00

timer: Fix wheel index calculation on last level

When an expiration delta falls into the last level of the wheel, that delta
has be compared against the maximum possible delay and reduced to fit in if
necessary.

However instead of comparing the delta against the maximum, the code
compares the actual expiry against the maximum. Then instead of fixing the
delta to fit in, it sets the maximum delta as the expiry value.

This can result in various undesired outcomes, the worst possible one
being a timer expiring 15 days ahead to fire immediately.

Fixes: 500462a9de65 ("timers: Switch to a non-cascading wheel")
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20200717140551.29076-2-frede...@kernel.org

---
 kernel/time/timer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 9a838d3..df1ff80 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -521,8 +521,8 @@ static int calc_wheel_index(unsigned long expires, unsigned 
long clk)
 * Force expire obscene large timeouts to expire at the
 * capacity limit of the wheel.
 */
-   if (expires >= WHEEL_TIMEOUT_CUTOFF)
-   expires = WHEEL_TIMEOUT_MAX;
+   if (delta >= WHEEL_TIMEOUT_CUTOFF)
+   expires = clk + WHEEL_TIMEOUT_MAX;
 
idx = calc_index(expires, LVL_DEPTH - 1);
}


[tip: timers/urgent] timer: Prevent base->clk from moving backward

2020-07-09 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: 30c66fc30ee7a98c4f3adf5fb7e213b61884474f
Gitweb:
https://git.kernel.org/tip/30c66fc30ee7a98c4f3adf5fb7e213b61884474f
Author:Frederic Weisbecker 
AuthorDate:Fri, 03 Jul 2020 03:06:57 +02:00
Committer: Thomas Gleixner 
CommitterDate: Thu, 09 Jul 2020 11:56:57 +02:00

timer: Prevent base->clk from moving backward

When a timer is enqueued with a negative delta (ie: expiry is below
base->clk), it gets added to the wheel as expiring now (base->clk).

Yet the value that gets stored in base->next_expiry, while calling
trigger_dyntick_cpu(), is the initial timer->expires value. The
resulting state becomes:

base->next_expiry < base->clk

On the next timer enqueue, forward_timer_base() may accidentally
rewind base->clk. As a possible outcome, timers may expire way too
early, the worst case being that the highest wheel levels get spuriously
processed again.

To prevent from that, make sure that base->next_expiry doesn't get below
base->clk.

Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Anna-Maria Behnsen 
Tested-by: Juri Lelli 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20200703010657.2302-1-frede...@kernel.org
---
 kernel/time/timer.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 398e6ea..9a838d3 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -584,7 +584,15 @@ trigger_dyntick_cpu(struct timer_base *base, struct 
timer_list *timer)
 * Set the next expiry time and kick the CPU so it can reevaluate the
 * wheel:
 */
-   base->next_expiry = timer->expires;
+   if (time_before(timer->expires, base->clk)) {
+   /*
+* Prevent from forward_timer_base() moving the base->clk
+* backward
+*/
+   base->next_expiry = base->clk;
+   } else {
+   base->next_expiry = timer->expires;
+   }
wake_up_nohz_cpu(base->cpu);
 }
 
@@ -896,10 +904,13 @@ static inline void forward_timer_base(struct timer_base 
*base)
 * If the next expiry value is > jiffies, then we fast forward to
 * jiffies otherwise we forward to the next expiry value.
 */
-   if (time_after(base->next_expiry, jnow))
+   if (time_after(base->next_expiry, jnow)) {
base->clk = jnow;
-   else
+   } else {
+   if (WARN_ON_ONCE(time_before(base->next_expiry, base->clk)))
+   return;
base->clk = base->next_expiry;
+   }
 #endif
 }
 


[tip: core/rcu] arm64: Prepare arch_nmi_enter() for recursion

2020-05-19 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the core/rcu branch of tip:

Commit-ID: 28f6bf9e247fe23d177cfdbf7e709270e8cc7fa6
Gitweb:
https://git.kernel.org/tip/28f6bf9e247fe23d177cfdbf7e709270e8cc7fa6
Author:Frederic Weisbecker 
AuthorDate:Thu, 27 Feb 2020 09:51:40 +01:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 19 May 2020 15:51:17 +02:00

arm64: Prepare arch_nmi_enter() for recursion

When using nmi_enter() recursively, arch_nmi_enter() must also be recursion
safe. In particular, it must be ensured that HCR_TGE is always set while in
NMI context when in HYP mode, and be restored to it's former state when
done.

The current code fails this when interleaved wrong. Notably it overwrites
the original hcr state on nesting.

Introduce a nesting counter to make sure to store the original value.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Alexandre Chartre 
Cc: Will Deacon 
Cc: Catalin Marinas 
Link: https://lkml.kernel.org/r/20200505134100.771491...@linutronix.de


---
 arch/arm64/include/asm/hardirq.h | 78 +++
 1 file changed, 59 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/include/asm/hardirq.h b/arch/arm64/include/asm/hardirq.h
index 87ad961..985493a 100644
--- a/arch/arm64/include/asm/hardirq.h
+++ b/arch/arm64/include/asm/hardirq.h
@@ -32,30 +32,70 @@ u64 smp_irq_stat_cpu(unsigned int cpu);
 
 struct nmi_ctx {
u64 hcr;
+   unsigned int cnt;
 };
 
 DECLARE_PER_CPU(struct nmi_ctx, nmi_contexts);
 
-#define arch_nmi_enter()   
\
-   do {
\
-   if (is_kernel_in_hyp_mode()) {  
\
-   struct nmi_ctx *nmi_ctx = this_cpu_ptr(_contexts);  
\
-   nmi_ctx->hcr = read_sysreg(hcr_el2);
\
-   if (!(nmi_ctx->hcr & HCR_TGE)) {
\
-   write_sysreg(nmi_ctx->hcr | HCR_TGE, hcr_el2);  
\
-   isb();  
\
-   }   
\
-   }   
\
-   } while (0)
+#define arch_nmi_enter()   \
+do {   \
+   struct nmi_ctx *___ctx; \
+   u64 ___hcr; \
+   \
+   if (!is_kernel_in_hyp_mode())   \
+   break;  \
+   \
+   ___ctx = this_cpu_ptr(_contexts);   \
+   if (___ctx->cnt) {  \
+   ___ctx->cnt++;  \
+   break;  \
+   }   \
+   \
+   ___hcr = read_sysreg(hcr_el2);  \
+   if (!(___hcr & HCR_TGE)) {  \
+   write_sysreg(___hcr | HCR_TGE, hcr_el2);\
+   isb();  \
+   }   \
+   /*  \
+* Make sure the sysreg write is performed before ___ctx->cnt   \
+* is set to 1. NMIs that see cnt == 1 will rely on us. \
+*/ \
+   barrier();  \
+   ___ctx->cnt = 1;\
+   /*  \
+* Make sure ___ctx->cnt is set before we save ___hcr. We   \
+* don't want ___ctx->hcr to be overwritten.\
+*/ \
+   barrier();  \
+   ___ctx->hcr = ___hcr;   \
+} while (0)
 
-#define arch_nmi_exit()
\
-   do {
\
-   if (is_kernel_in_hyp_mode()) {  
\
-  

[tip: sched/urgent] sched/vtime: Fix guest/system mis-accounting on task switch

2019-10-09 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 68e7a4d66b0ce04bf18ff2ffded5596ab3618585
Gitweb:
https://git.kernel.org/tip/68e7a4d66b0ce04bf18ff2ffded5596ab3618585
Author:Frederic Weisbecker 
AuthorDate:Wed, 25 Sep 2019 23:42:42 +02:00
Committer: Ingo Molnar 
CommitterDate: Wed, 09 Oct 2019 12:38:03 +02:00

sched/vtime: Fix guest/system mis-accounting on task switch

vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

vtime_common_task_switch(prev)
vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec 
clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frede...@kernel.org
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cputime.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2305ce8..46ed4e1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -740,7 +740,7 @@ void vtime_account_system(struct task_struct *tsk)
 
write_seqcount_begin(>seqcount);
/* We might have scheduled out from guest path */
-   if (current->flags & PF_VCPU)
+   if (tsk->flags & PF_VCPU)
vtime_account_guest(tsk, vtime);
else
__vtime_account_system(tsk, vtime);
@@ -783,7 +783,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 */
write_seqcount_begin(>seqcount);
__vtime_account_system(tsk, vtime);
-   current->flags |= PF_VCPU;
+   tsk->flags |= PF_VCPU;
write_seqcount_end(>seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_enter);
@@ -794,7 +794,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 
write_seqcount_begin(>seqcount);
vtime_account_guest(tsk, vtime);
-   current->flags &= ~PF_VCPU;
+   tsk->flags &= ~PF_VCPU;
write_seqcount_end(>seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);


[tip: sched/core] sched/vtime: Fix guest/system mis-accounting on task switch

2019-10-09 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 68e7a4d66b0ce04bf18ff2ffded5596ab3618585
Gitweb:
https://git.kernel.org/tip/68e7a4d66b0ce04bf18ff2ffded5596ab3618585
Author:Frederic Weisbecker 
AuthorDate:Wed, 25 Sep 2019 23:42:42 +02:00
Committer: Ingo Molnar 
CommitterDate: Wed, 09 Oct 2019 12:38:03 +02:00

sched/vtime: Fix guest/system mis-accounting on task switch

vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

vtime_common_task_switch(prev)
vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec 
clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frede...@kernel.org
Signed-off-by: Ingo Molnar 
---
 kernel/sched/cputime.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 2305ce8..46ed4e1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -740,7 +740,7 @@ void vtime_account_system(struct task_struct *tsk)
 
write_seqcount_begin(>seqcount);
/* We might have scheduled out from guest path */
-   if (current->flags & PF_VCPU)
+   if (tsk->flags & PF_VCPU)
vtime_account_guest(tsk, vtime);
else
__vtime_account_system(tsk, vtime);
@@ -783,7 +783,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 */
write_seqcount_begin(>seqcount);
__vtime_account_system(tsk, vtime);
-   current->flags |= PF_VCPU;
+   tsk->flags |= PF_VCPU;
write_seqcount_end(>seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_enter);
@@ -794,7 +794,7 @@ void vtime_guest_exit(struct task_struct *tsk)
 
write_seqcount_begin(>seqcount);
vtime_account_guest(tsk, vtime);
-   current->flags &= ~PF_VCPU;
+   tsk->flags &= ~PF_VCPU;
write_seqcount_end(>seqcount);
 }
 EXPORT_SYMBOL_GPL(vtime_guest_exit);


[tip: sched/core] sched/cputime: Spare a seqcount lock/unlock cycle on context switch

2019-10-09 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8d495477d62e4397207f22a432fcaa86d9f2bc2d
Gitweb:
https://git.kernel.org/tip/8d495477d62e4397207f22a432fcaa86d9f2bc2d
Author:Frederic Weisbecker 
AuthorDate:Thu, 03 Oct 2019 18:17:45 +02:00
Committer: Ingo Molnar 
CommitterDate: Wed, 09 Oct 2019 12:39:26 +02:00

sched/cputime: Spare a seqcount lock/unlock cycle on context switch

On context switch we are locking the vtime seqcount of the scheduling-out
task twice:

 * On vtime_task_switch_common(), when we flush the pending vtime through
   vtime_account_system()

 * On arch_vtime_task_switch() to reset the vtime state.

This is pointless as these actions can be performed without the need
to unlock/lock in the middle. The reason these steps are separated is to
consolidate a very small amount of common code between
CONFIG_VIRT_CPU_ACCOUNTING_GEN and CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.

Performance in this fast path is definitely a priority over artificial
code factorization so split the task switch code between GEN and
NATIVE and mutualize the parts than can run under a single seqcount
locked block.

As a side effect, vtime_account_idle() becomes included in the seqcount
protection. This happens to be a welcome preparation in order to
properly support kcpustat under vtime in the future and fetch
CPUTIME_IDLE without race.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yauheni Kaliuta 
Link: https://lkml.kernel.org/r/20191003161745.28464-3-frede...@kernel.org
Signed-off-by: Ingo Molnar 
---
 include/linux/vtime.h  | 32 
 kernel/sched/cputime.c | 30 +++---
 2 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 2fd247f..d9160ab 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -14,8 +14,12 @@ struct task_struct;
  * vtime_accounting_cpu_enabled() definitions/declarations
  */
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
+
 static inline bool vtime_accounting_cpu_enabled(void) { return true; }
+extern void vtime_task_switch(struct task_struct *prev);
+
 #elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
+
 /*
  * Checks if vtime is enabled on some CPU. Cputime readers want to be careful
  * in that case and compute the tickless cputime.
@@ -36,33 +40,29 @@ static inline bool vtime_accounting_cpu_enabled(void)
 
return false;
 }
+
+extern void vtime_task_switch_generic(struct task_struct *prev);
+
+static inline void vtime_task_switch(struct task_struct *prev)
+{
+   if (vtime_accounting_cpu_enabled())
+   vtime_task_switch_generic(prev);
+}
+
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
+
 static inline bool vtime_accounting_cpu_enabled(void) { return false; }
-#endif
+static inline void vtime_task_switch(struct task_struct *prev) { }
 
+#endif
 
 /*
  * Common vtime APIs
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
-
-#ifdef __ARCH_HAS_VTIME_TASK_SWITCH
-extern void vtime_task_switch(struct task_struct *prev);
-#else
-extern void vtime_common_task_switch(struct task_struct *prev);
-static inline void vtime_task_switch(struct task_struct *prev)
-{
-   if (vtime_accounting_cpu_enabled())
-   vtime_common_task_switch(prev);
-}
-#endif /* __ARCH_HAS_VTIME_TASK_SWITCH */
-
 extern void vtime_account_kernel(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
-
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
-
-static inline void vtime_task_switch(struct task_struct *prev) { }
 static inline void vtime_account_kernel(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b45932e..cef23c2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -405,9 +405,10 @@ static inline void irqtime_account_process_tick(struct 
task_struct *p, int user_
 /*
  * Use precise platform statistics if available:
  */
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+
 # ifndef __ARCH_HAS_VTIME_TASK_SWITCH
-void vtime_common_task_switch(struct task_struct *prev)
+void vtime_task_switch(struct task_struct *prev)
 {
if (is_idle_task(prev))
vtime_account_idle(prev);
@@ -418,10 +419,7 @@ void vtime_common_task_switch(struct task_struct *prev)
arch_vtime_task_switch(prev);
 }
 # endif
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
-
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 /*
  * Archs that account the whole time spent in the idle task
  * (outside irq) as idle time can rely on this and just implement
@@ -731,6 +729,16 @@ static void vtime_account_guest(struct task_struct *tsk,
}
 }
 
+static void __vtime_account_kernel(struct task_struct *tsk,
+  struct vtime *vtime)
+{
+   

[tip: sched/core] sched/cputime: Rename vtime_account_system() to vtime_account_kernel()

2019-10-09 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the sched/core branch of tip:

Commit-ID: f83eeb1a01689b2691f6f56629ac9f66de8d41c2
Gitweb:
https://git.kernel.org/tip/f83eeb1a01689b2691f6f56629ac9f66de8d41c2
Author:Frederic Weisbecker 
AuthorDate:Thu, 03 Oct 2019 18:17:44 +02:00
Committer: Ingo Molnar 
CommitterDate: Wed, 09 Oct 2019 12:39:25 +02:00

sched/cputime: Rename vtime_account_system() to vtime_account_kernel()

vtime_account_system() decides if we need to account the time to the
system (__vtime_account_system()) or to the guest (vtime_account_guest()).

So this function is a misnomer as we are on a higher level than
"system". All we know when we call that function is that we are
accounting kernel cputime. Whether it belongs to guest or system time
is a lower level detail.

Rename this function to vtime_account_kernel(). This will clarify things
and avoid too many underscored vtime_account_system() versions.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Wanpeng Li 
Cc: Yauheni Kaliuta 
Link: https://lkml.kernel.org/r/20191003161745.28464-2-frede...@kernel.org
Signed-off-by: Ingo Molnar 
---
 arch/ia64/kernel/time.c  |  4 ++--
 arch/powerpc/kernel/time.c   |  6 +++---
 arch/s390/kernel/vtime.c |  4 ++--
 include/linux/context_tracking.h |  4 ++--
 include/linux/vtime.h|  6 +++---
 kernel/sched/cputime.c   | 18 +-
 6 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 1e95d32..91b4024 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -132,7 +132,7 @@ static __u64 vtime_delta(struct task_struct *tsk)
return delta_stime;
 }
 
-void vtime_account_system(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 {
struct thread_info *ti = task_thread_info(tsk);
__u64 stime = vtime_delta(tsk);
@@ -146,7 +146,7 @@ void vtime_account_system(struct task_struct *tsk)
else
ti->stime += stime;
 }
-EXPORT_SYMBOL_GPL(vtime_account_system);
+EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 6945223..84827da 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -338,7 +338,7 @@ static unsigned long vtime_delta(struct task_struct *tsk,
return stime;
 }
 
-void vtime_account_system(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 {
unsigned long stime, stime_scaled, steal_time;
struct cpu_accounting_data *acct = get_accounting(tsk);
@@ -366,7 +366,7 @@ void vtime_account_system(struct task_struct *tsk)
 #endif
}
 }
-EXPORT_SYMBOL_GPL(vtime_account_system);
+EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
@@ -395,7 +395,7 @@ static void vtime_flush_scaled(struct task_struct *tsk,
 /*
  * Account the whole cputime accumulated in the paca
  * Must be called with interrupts disabled.
- * Assumes that vtime_account_system/idle() has been called
+ * Assumes that vtime_account_kernel/idle() has been called
  * recently (i.e. since the last entry from usermode) so that
  * get_paca()->user_time_scaled is up to date.
  */
diff --git a/arch/s390/kernel/vtime.c b/arch/s390/kernel/vtime.c
index c475ca4..8df10d3 100644
--- a/arch/s390/kernel/vtime.c
+++ b/arch/s390/kernel/vtime.c
@@ -247,9 +247,9 @@ void vtime_account_irq_enter(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(vtime_account_irq_enter);
 
-void vtime_account_system(struct task_struct *tsk)
+void vtime_account_kernel(struct task_struct *tsk)
 __attribute__((alias("vtime_account_irq_enter")));
-EXPORT_SYMBOL_GPL(vtime_account_system);
+EXPORT_SYMBOL_GPL(vtime_account_kernel);
 
 /*
  * Sorted add to a list. List is linear searched until first bigger
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d05609a..558a209 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -141,7 +141,7 @@ static inline void guest_enter_irqoff(void)
 * to assume that it's the stime pending cputime
 * to flush.
 */
-   vtime_account_system(current);
+   vtime_account_kernel(current);
current->flags |= PF_VCPU;
rcu_virt_note_context_switch(smp_processor_id());
 }
@@ -149,7 +149,7 @@ static inline void guest_enter_irqoff(void)
 static inline void guest_exit_irqoff(void)
 {
/* Flush the guest cputime we spent on the guest */
-   vtime_account_system(current);
+   vtime_account_kernel(current);
current->flags &= ~PF_VCPU;
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index a26ed10..2fd247f 100644
--- 

[tip: timers/core] hrtimer: Improve comments on handling priority inversion against softirq kthread

2019-08-22 Thread tip-bot2 for Frederic Weisbecker
The following commit has been merged into the timers/core branch of tip:

Commit-ID: 0bee3b601b77dbe7981b5474ae8758d6bf60177a
Gitweb:
https://git.kernel.org/tip/0bee3b601b77dbe7981b5474ae8758d6bf60177a
Author:Frederic Weisbecker 
AuthorDate:Tue, 20 Aug 2019 15:12:23 +02:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 20 Aug 2019 22:05:46 +02:00

hrtimer: Improve comments on handling priority inversion against softirq kthread

The handling of a priority inversion between timer cancelling and a a not
well defined possible preemption of softirq kthread is not very clear.

Especially in the posix timers side it's unclear why there is a specific RT
wait callback.

All the nice explanations can be found in the initial changelog of
f61eff83cec9 (hrtimer: Prepare support for PREEMPT_RT").

Extract the detailed informations from there and put it into comments.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190820132656.GC2093@lenoir
---
 kernel/time/hrtimer.c  | 14 ++
 kernel/time/posix-timers.c |  6 ++
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 4991227..8333537 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1201,10 +1201,16 @@ static void hrtimer_sync_wait_running(struct 
hrtimer_cpu_base *cpu_base,
  * deletion of a timer failed because the timer callback function was
  * running.
  *
- * This prevents priority inversion, if the softirq thread on a remote CPU
- * got preempted, and it prevents a life lock when the task which tries to
- * delete a timer preempted the softirq thread running the timer callback
- * function.
+ * This prevents priority inversion: if the soft irq thread is preempted
+ * in the middle of a timer callback, then calling del_timer_sync() can
+ * lead to two issues:
+ *
+ *  - If the caller is on a remote CPU then it has to spin wait for the timer
+ *handler to complete. This can result in unbound priority inversion.
+ *
+ *  - If the caller originates from the task which preempted the timer
+ *handler on the same CPU, then spin waiting for the timer handler to
+ *complete is never going to end.
  */
 void hrtimer_cancel_wait_running(const struct hrtimer *timer)
 {
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 9e37783..0ec5b7a 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -810,6 +810,12 @@ static void common_timer_wait_running(struct k_itimer 
*timer)
hrtimer_cancel_wait_running(>it.real.timer);
 }
 
+/*
+ * On PREEMPT_RT this prevent priority inversion against softirq kthread in
+ * case it gets preempted while executing a timer callback. See comments in
+ * hrtimer_cancel_wait_running. For PREEMPT_RT=n this just results in a
+ * cpu_relax().
+ */
 static struct k_itimer *timer_wait_running(struct k_itimer *timer,
   unsigned long *flags)
 {