[tip: core/rcu] rcu: Fix CPU-offline trace in rcutree_dying_cpu
The following commit has been merged into the core/rcu branch of tip: Commit-ID: 47fcbc8dd62f15dc75916225ebacdc3bca9c12b2 Gitweb: https://git.kernel.org/tip/47fcbc8dd62f15dc75916225ebacdc3bca9c12b2 Author:Neeraj Upadhyay AuthorDate:Mon, 11 Jan 2021 17:15:58 +05:30 Committer: Paul E. McKenney CommitterDate: Mon, 08 Mar 2021 14:17:35 -08:00 rcu: Fix CPU-offline trace in rcutree_dying_cpu The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is backwards, so that it uses the string "cpuofl" when the offline CPU is blocking the current grace period and "cpuofl-bgp" otherwise. Given that the "-bgp" stands for "blocking grace period", this is at best misleading. This commit therefore switches these strings in order to correctly trace whether the outgoing cpu blocks the current grace period. Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index cdf091f..e62c2de 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2413,7 +2413,7 @@ int rcutree_dying_cpu(unsigned int cpu) blkd = !!(rnp->qsmask & rdp->grpmask); trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq), - blkd ? TPS("cpuofl") : TPS("cpuofl-bgp")); + blkd ? TPS("cpuofl-bgp") : TPS("cpuofl")); return 0; }
Re: [PATCH 01/13] rcu/nocb: Fix potential missed nocb_timer rearm
This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. 7. The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread. Then ->nocb_gb_kthread sleeps waiting for more newly queued callbacks. 8. CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps waiting for more invocable callbacks. 9. Note that neither kthread updated any ->nocb_timer state, so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE. 10. CPU 1 enqueues its second callback, this time with interrupts enabled so it can wake directly ->nocb_gp_kthread. It does so with calling wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now desynchronized. 11. ->nocb_gp_kthread associates the callback queued in 10 with a new grace period, arranges for that grace period to start and sleeps waiting for it to complete. 12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread, which in turn wakes up CPU 1's ->nocb_cb_kthread which then invokes the callback queued in 10. 13. CPU 1 enqueues its third callback, this time with interrupts disabled so it must queue a timer for a deferred wakeup. However the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which incorrectly indicates that a timer is already queued. Instead, CPU 1's ->nocb_timer was cancelled in 10. CPU 1 therefore fails to queue the ->nocb_timer. 14. CPU 1 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever calls an explicit deferred wakeup, for example, during idle entry. This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime we delete the ->nocb_timer. Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Stable Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 44746d8..429491d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1721,7 +1721,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; } - del_timer(>nocb_timer); + + if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(>nocb_timer); + } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { @@ -2350,7 +2354,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup); - WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); ret = wake_nocb_gp(rdp, ndw == RCU_NOCB_WAKE_FORCE, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("DeferredWake")); Reviewed-by: Neeraj Upadhyay Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] arm64: Add part number for Arm Cortex-A78
On 2/17/2021 10:36 PM, Will Deacon wrote: On Wed, Feb 17, 2021 at 10:14:11PM +0530, Neeraj Upadhyay wrote: Add the MIDR part number info for the Arm Cortex-A78. Signed-off-by: Neeraj Upadhyay --- arch/arm64/include/asm/cputype.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h index ef5b040..3aced88 100644 --- a/arch/arm64/include/asm/cputype.h +++ b/arch/arm64/include/asm/cputype.h @@ -72,6 +72,7 @@ #define ARM_CPU_PART_CORTEX_A76 0xD0B #define ARM_CPU_PART_NEOVERSE_N1 0xD0C #define ARM_CPU_PART_CORTEX_A77 0xD0D +#define ARM_CPU_PART_CORTEX_A780xD41 #define APM_CPU_PART_POTENZA 0x000 @@ -109,6 +110,7 @@ #define MIDR_CORTEX_A76 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76) #define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1) #define MIDR_CORTEX_A77 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77) +#define MIDR_CORTEX_A78MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78) #define MIDR_THUNDERX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX) #define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX) #define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX) This usually means there's an erratum to work around. What are you hiding ;) Will :) . This is needed for supporting implementation defined AMU counters in A78 [1]. However, there is no upstream user of it. [1] https://www.spinics.net/lists/arm-kernel/msg856989.html Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] arm64: Add part number for Arm Cortex-A78
Add the MIDR part number info for the Arm Cortex-A78. Signed-off-by: Neeraj Upadhyay --- arch/arm64/include/asm/cputype.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h index ef5b040..3aced88 100644 --- a/arch/arm64/include/asm/cputype.h +++ b/arch/arm64/include/asm/cputype.h @@ -72,6 +72,7 @@ #define ARM_CPU_PART_CORTEX_A760xD0B #define ARM_CPU_PART_NEOVERSE_N1 0xD0C #define ARM_CPU_PART_CORTEX_A770xD0D +#define ARM_CPU_PART_CORTEX_A780xD41 #define APM_CPU_PART_POTENZA 0x000 @@ -109,6 +110,7 @@ #define MIDR_CORTEX_A76MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76) #define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1) #define MIDR_CORTEX_A77MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77) +#define MIDR_CORTEX_A78MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78) #define MIDR_THUNDERX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX) #define MIDR_THUNDERX_81XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_81XX) #define MIDR_THUNDERX_83XX MIDR_CPU_MODEL(ARM_CPU_IMP_CAVIUM, CAVIUM_CPU_PART_THUNDERX_83XX) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[tip: core/rcu] rcu: Check and report missed fqs timer wakeup on RCU stall
The following commit has been merged into the core/rcu branch of tip: Commit-ID: 683954e55c981467bfd4688417e914bafc40959f Gitweb: https://git.kernel.org/tip/683954e55c981467bfd4688417e914bafc40959f Author:Neeraj Upadhyay AuthorDate:Mon, 16 Nov 2020 21:36:00 +05:30 Committer: Paul E. McKenney CommitterDate: Wed, 06 Jan 2021 16:54:11 -08:00 rcu: Check and report missed fqs timer wakeup on RCU stall For a new grace period request, the RCU GP kthread transitions through following states: a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS] The RCU_GP_WAIT_GPS state is where the GP kthread waits for a request for a new GP. Once it receives a request (for example, when a new RCU callback is queued), the GP kthread transitions to RCU_GP_DONE_GPS. b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF] Grace period initialization starts in rcu_gp_init(), which records the start of new GP in rcu_state.gp_seq and transitions to RCU_GP_ONOFF. c. [RCU_GP_ONOFF] -> [RCU_GP_INIT] The purpose of the RCU_GP_ONOFF state is to apply the online/offline information that was buffered for any CPUs that recently came online or went offline. This state is maintained in per-leaf rcu_node bitmasks, with the buffered state in ->qsmaskinitnext and the state for the upcoming GP in ->qsmaskinit. At the end of this RCU_GP_ONOFF state, each bit in ->qsmaskinit will correspond to a CPU that must pass through a quiescent state before the upcoming grace period is allowed to complete. However, a leaf rcu_node structure with an all-zeroes ->qsmaskinit cannot necessarily be ignored. In preemptible RCU, there might well be tasks still in RCU read-side critical sections that were first preempted while running on one of the CPUs managed by this structure. Such tasks will be queued on this structure's ->blkd_tasks list. Only after this list fully drains can this leaf rcu_node structure be ignored, and even then only if none of its CPUs have come back online in the meantime. Once that happens, the ->qsmaskinit masks further up the tree will be updated to exclude this leaf rcu_node structure. Once the ->qsmaskinitnext and ->qsmaskinit fields have been updated as needed, the GP kthread transitions to RCU_GP_INIT. d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS] The purpose of the RCU_GP_INIT state is to copy each ->qsmaskinit to the ->qsmask field within each rcu_node structure. This copying is done breadth-first from the root to the leaves. Why not just copy directly from ->qsmaskinitnext to ->qsmask? Because the ->qsmaskinitnext masks can change in the meantime as additional CPUs come online or go offline. Such changes would result in inconsistencies in the ->qsmask fields up and down the tree, which could in turn result in too-short grace periods or grace-period hangs. These issues are avoided by snapshotting the leaf rcu_node structures' ->qsmaskinitnext fields into their ->qsmaskinit counterparts, generating a consistent set of ->qsmaskinit fields throughout the tree, and only then copying these consistent ->qsmaskinit fields to their ->qsmask counterparts. Once this initialization step is complete, the GP kthread transitions to RCU_GP_WAIT_FQS, where it waits to do a force-quiescent-state scan on the one hand or for the end of the grace period on the other. e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS] The RCU_GP_WAIT_FQS state waits for one of three things: (1) An explicit request to do a force-quiescent-state scan, (2) The end of the grace period, or (3) A short interval of time, after which it will do a force-quiescent-state (FQS) scan. The explicit request can come from rcutorture or from any CPU that has too many RCU callbacks queued (see the qhimark kernel parameter and the RCU_GP_FLAG_OVLD flag). The aforementioned "short period of time" is specified by the jiffies_till_first_fqs boot parameter for a given grace period's first FQS scan and by the jiffies_till_next_fqs for later FQS scans. Either way, once the wait is over, the GP kthread transitions to RCU_GP_DOING_FQS. f. [RCU_GP_DOING_FQS] -> [RCU_GP_CLEANUP] The RCU_GP_DOING_FQS state performs an FQS scan. Each such scan carries out two functions for any CPU whose bit is still set in its leaf rcu_node structure's ->qsmask field, that is, for any CPU that has not yet reported a quiescent state for the current grace period: i. Report quiescent states on behalf of CPUs that have been observed to be idle (from an RCU perspective) since the beginning of the grace period. ii. If the current grace period is too old, take various actions to encourage holdout CPUs to pass through quiescent states, including enlisting the aid of any calls to cond_resched() and might_sleep(), and even including IPIing the holdout CPUs. These checks are skipped for any leaf rcu_node structure with a all-zero ->qsmask field, however such structures are subject
Re: [PATCH] rcu: Correct cpu offline trace in rcutree_dying_cpu
On 1/12/2021 11:01 PM, Paul E. McKenney wrote: On Mon, Jan 11, 2021 at 05:15:58PM +0530, Neeraj Upadhyay wrote: Correctly trace whether the outgoing cpu blocks current gp in rcutree_dying_cpu(). Signed-off-by: Neeraj Upadhyay Good catch, queued, thank you! Please see below for my usual wordsmithing, and please lat me know if I messed something up. Thanx, Paul Thanks Paul, looks good! Thanks Neeraj commit ab6e7609e7590e1bb220ef6b0822a823dde46f6c Author: Neeraj Upadhyay Date: Mon Jan 11 17:15:58 2021 +0530 rcu: Fix CPU-offline trace in rcutree_dying_cpu The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is backwards, so that it uses the string "cpuofl" when the offline CPU is blocking the current grace period and "cpuofl-bgp" otherwise. Given that the "-bgp" stands for "blocking grace period", this is at best misleading. This commit therefore switches these strings in order to correctly trace whether the outgoing cpu blocks the current grace period. Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index cc6b6fc..63c6dba 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2387,7 +2387,7 @@ int rcutree_dying_cpu(unsigned int cpu) blkd = !!(rnp->qsmask & rdp->grpmask); trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq), - blkd ? TPS("cpuofl") : TPS("cpuofl-bgp")); + blkd ? TPS("cpuofl-bgp") : TPS("cpuofl")); return 0; } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu: Correct cpu offline trace in rcutree_dying_cpu
Correctly trace whether the outgoing cpu blocks current gp in rcutree_dying_cpu(). Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 097990a..1f4bff4 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2364,7 +2364,7 @@ int rcutree_dying_cpu(unsigned int cpu) blkd = !!(rnp->qsmask & rdp->grpmask); trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq), - blkd ? TPS("cpuofl") : TPS("cpuofl-bgp")); + blkd ? TPS("cpuofl-bgp") : TPS("cpuofl")); return 0; } -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] rcu: Trace cbs accelerated without leaf rnp node lock held
Trace cbs which are accelerated without rnp lock help in rcu_accelerate_cbs_unlocked(). Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 338b817..097990a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1473,6 +1473,17 @@ static void rcu_gp_kthread_wake(void) swake_up_one(_state.gp_wq); } +static void rcu_trace_accelerated_cbs(struct rcu_data *rdp, + unsigned long gp_seq_req) +{ + rcu_lockdep_assert_cblist_protected(rdp); + /* Trace depending on how much we were able to accelerate. */ + if (rcu_segcblist_restempty(>cblist, RCU_WAIT_TAIL)) + trace_rcu_grace_period(rcu_state.name, gp_seq_req, TPS("AccWaitCB")); + else + trace_rcu_grace_period(rcu_state.name, gp_seq_req, TPS("AccReadyCB")); +} + /* * If there is room, assign a ->gp_seq number to any callbacks on this * CPU that have not already been assigned. Also accelerate any callbacks @@ -1511,12 +1522,7 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp) if (!rcu_segcblist_accelerate(>cblist, gp_seq_req)) return ret; ret = rcu_start_this_gp(rnp, rdp, gp_seq_req); - - /* Trace depending on how much we were able to accelerate. */ - if (rcu_segcblist_restempty(>cblist, RCU_WAIT_TAIL)) - trace_rcu_grace_period(rcu_state.name, gp_seq_req, TPS("AccWaitCB")); - else - trace_rcu_grace_period(rcu_state.name, gp_seq_req, TPS("AccReadyCB")); + rcu_trace_accelerated_cbs(rdp, gp_seq_req); return ret; } @@ -1538,7 +1544,8 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp, c = rcu_seq_snap(_state.gp_seq); if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, c)) { /* Old request still live, so mark recent callbacks. */ - (void)rcu_segcblist_accelerate(>cblist, c); + if (rcu_segcblist_accelerate(>cblist, c)) + rcu_trace_accelerated_cbs(rdp, c); return; } raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] rcu: Only trace when cbs are accelerated
Fix rcu_accelerate_cbs() traces to only trace when cbs are accelerated in current call. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 804e543..338b817 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1508,8 +1508,9 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp) * number. */ gp_seq_req = rcu_seq_snap(_state.gp_seq); - if (rcu_segcblist_accelerate(>cblist, gp_seq_req)) - ret = rcu_start_this_gp(rnp, rdp, gp_seq_req); + if (!rcu_segcblist_accelerate(>cblist, gp_seq_req)) + return ret; + ret = rcu_start_this_gp(rnp, rdp, gp_seq_req); /* Trace depending on how much we were able to accelerate. */ if (rcu_segcblist_restempty(>cblist, RCU_WAIT_TAIL)) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] rcu: Fix dynticks_nmi_nesting underflow check in rcu_is_cpu_rrupt_from_idle
For the smp_call_function() optimization, where callbacks can run from idle context, in commit 806f04e9fd2c ("rcu: Allow for smp_call_function() running callbacks from idle"), an additional check is added in rcu_is_cpu_rrupt_from_idle(), for dynticks_nmi_nesting value being 0, for these smp_call_function() callbacks running from idle loop. However, this commit missed updating a preexisting underflow check of dynticks_nmi_nesting, which checks for a non zero positive value. Fix this warning and while at it, read the counter only once. Signed-off-by: Neeraj Upadhyay --- Hi, I was not able to get this warning, with scftorture. RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0, "RCU dynticks_nmi_nesting counter underflow/zero!"); Not sure if idle loop smp_call_function() optimization is already present in mainline? Another thing, which I am not sure of is, maybe lockdep gets disabled in the idle loop contexts, where rcu_is_cpu_rrupt_from_idle() is called? Was this the original intention, to keep the lockdep based RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0 check separate from idle task context nesting value WARN_ON_ONCE(!nesting && !is_idle_task(current)) check? Thanks Neeraj kernel/rcu/tree.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index bc8b489..c3037cf 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -457,11 +457,10 @@ static int rcu_is_cpu_rrupt_from_idle(void) /* Check for counter underflows */ RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nesting) < 0, "RCU dynticks_nesting counter underflow!"); - RCU_LOCKDEP_WARN(__this_cpu_read(rcu_data.dynticks_nmi_nesting) <= 0, -"RCU dynticks_nmi_nesting counter underflow/zero!"); + nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting); + RCU_LOCKDEP_WARN(nesting < 0, "RCU dynticks_nmi_nesting counter underflow!"); /* Are we at first interrupt nesting level? */ - nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting); if (nesting > 1) return false; -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[tip: core/rcu] rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
The following commit has been merged into the core/rcu branch of tip: Commit-ID: a3941517fcd6625adc540aef5ec3f717c8fa71e8 Gitweb: https://git.kernel.org/tip/a3941517fcd6625adc540aef5ec3f717c8fa71e8 Author:Neeraj Upadhyay AuthorDate:Thu, 24 Sep 2020 12:04:10 +05:30 Committer: Paul E. McKenney CommitterDate: Thu, 19 Nov 2020 19:37:16 -08:00 rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config This commit clarifies that the "p" and the "s" in the in the RCU_NOCB_CPU config-option description refer to the "x" in the "rcuox/N" kthread name. Signed-off-by: Neeraj Upadhyay [ paulmck: While in the area, update description and advice. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/Kconfig | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index b71e21f..cdc57b4 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -221,19 +221,23 @@ config RCU_NOCB_CPU Use this option to reduce OS jitter for aggressive HPC or real-time workloads. It can also be used to offload RCU callback invocation to energy-efficient CPUs in battery-powered - asymmetric multiprocessors. + asymmetric multiprocessors. The price of this reduced jitter + is that the overhead of call_rcu() increases and that some + workloads will incur significant increases in context-switch + rates. This option offloads callback invocation from the set of CPUs specified at boot time by the rcu_nocbs parameter. For each such CPU, a kthread ("rcuox/N") will be created to invoke callbacks, where the "N" is the CPU being offloaded, and where - the "p" for RCU-preempt (PREEMPTION kernels) and "s" for RCU-sched - (!PREEMPTION kernels). Nothing prevents this kthread from running - on the specified CPUs, but (1) the kthreads may be preempted - between each callback, and (2) affinity or cgroups can be used - to force the kthreads to run on whatever set of CPUs is desired. - - Say Y here if you want to help to debug reduced OS jitter. + the "x" is "p" for RCU-preempt (PREEMPTION kernels) and "s" for + RCU-sched (!PREEMPTION kernels). Nothing prevents this kthread + from running on the specified CPUs, but (1) the kthreads may be + preempted between each callback, and (2) affinity or cgroups can + be used to force the kthreads to run on whatever set of CPUs is + desired. + + Say Y here if you need reduced OS jitter, despite added overhead. Say N here if you are unsure. config TASKS_TRACE_RCU_READ_MB
[tip: core/rcu] rcu: Fix single-CPU check in rcu_blocking_is_gp()
The following commit has been merged into the core/rcu branch of tip: Commit-ID: ed73860cecc3ec12aa50a6dcfb4900e5b4ae9507 Gitweb: https://git.kernel.org/tip/ed73860cecc3ec12aa50a6dcfb4900e5b4ae9507 Author:Neeraj Upadhyay AuthorDate:Wed, 23 Sep 2020 12:59:33 +05:30 Committer: Paul E. McKenney CommitterDate: Thu, 19 Nov 2020 19:37:16 -08:00 rcu: Fix single-CPU check in rcu_blocking_is_gp() Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses num_online_cpus() to determine whether there is only one CPU online. When there is only a single CPU online, the simple fact that synchronize_rcu() could be legally called implies that a full grace period has elapsed. Therefore, in the single-CPU case, synchronize_rcu() simply returns immediately. Unfortunately, num_online_cpus() is unreliable while a CPU-hotplug operation is transitioning to or from single-CPU operation because: 1. num_online_cpus() uses atomic_read(&__num_online_cpus) to locklessly sample the number of online CPUs. The hotplug locks are not held, which means that an incoming CPU can concurrently update this count. This in turn means that an RCU read-side critical section on the incoming CPU might observe updates prior to the grace period, but also that this critical section might extend beyond the end of the optimized synchronize_rcu(). This breaks RCU's fundamental guarantee. 2. In addition, num_online_cpus() does no ordering, thus providing another way that RCU's fundamental guarantee can be broken by the current code. 3. The most probable failure mode happens on outgoing CPUs. The outgoing CPU updates the count of online CPUs in the CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in and of itself due to preemption being disabled at the call to num_online_cpus(). Unfortunately, after that stop-machine handler returns, the CPU takes one last trip through the scheduler (which has RCU readers) and, after the resulting context switch, one final dive into the idle loop. During this time, RCU needs to keep track of two CPUs, but num_online_cpus() will say that there is only one, which in turn means that the surviving CPU will incorrectly ignore the outgoing CPU's RCU read-side critical sections. This problem is illustrated by the following litmus test in which P0() corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU. The herd7 tool confirms that the "exists" clause can be satisfied, thus demonstrating that this breakage can happen according to the Linux kernel memory model. { int x = 0; atomic_t numonline = ATOMIC_INIT(1); } P0(int *x, atomic_t *numonline) { int r0; WRITE_ONCE(*x, 1); r0 = atomic_read(numonline); if (r0 == 1) { smp_mb(); } else { synchronize_rcu(); } WRITE_ONCE(*x, 2); } P1(int *x, atomic_t *numonline) { int r0; int r1; atomic_inc(numonline); smp_mb(); rcu_read_lock(); r0 = READ_ONCE(*x); smp_rmb(); r1 = READ_ONCE(*x); rcu_read_unlock(); } locations [x;numonline;] exists (1:r0=0 /\ 1:r1=2) It is important to note that these problems arise only when the system is transitioning to or from single-CPU operation. One solution would be to hold the CPU-hotplug locks while sampling num_online_cpus(), which was in fact the intent of the (redundant) preempt_disable() and preempt_enable() surrounding this call to num_online_cpus(). Actually blocking CPU hotplug would not only result in excessive overhead, but would also unnecessarily impede CPU-hotplug operations. This commit therefore follows long-standing RCU tradition by maintaining a separate RCU-specific set of CPU-hotplug books. This separate set of books is implemented by a new ->n_online_cpus field in the rcu_state structure that maintains RCU's count of the online CPUs. This count is incremented early in the CPU-online process, so that the critical transition away from single-CPU operation will occur when there is only a single CPU. Similarly for the critical transition to single-CPU operation, the counter is decremented late in the CPU-offline process, again while there is only a single CPU. Because there is only ever a single CPU when the ->n_online_cpus field undergoes the critical 1->2 and 2->1 transitions, full memory ordering and mutual exclusion is provided implicitly and, better yet, for free. In the case where the CPU is coming online, nothing will happen until the current CPU helps it come online. Therefore, the new CPU will see all accesses prior to the optimized grace period, which means that RCU does not need to further delay this new CPU. In the case where the CPU is going offline, the outgoing CPU is totally out of the picture before the optimized
Re: [PATCH v2 tip/core/rcu 1/6] srcu: Make Tiny SRCU use multi-bit grace-period counter
On 11/28/2020 7:46 AM, Paul E. McKenney wrote: On Wed, Nov 25, 2020 at 10:03:26AM +0530, Neeraj Upadhyay wrote: On 11/24/2020 10:48 AM, Neeraj Upadhyay wrote: On 11/24/2020 1:25 AM, Paul E. McKenney wrote: On Mon, Nov 23, 2020 at 10:01:13AM +0530, Neeraj Upadhyay wrote: On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods. This polling needs to distinguish between an SRCU instance being idle on the one hand or in the middle of a grace period on the other. This commit therefore converts the Tiny SRCU srcu_struct structure's srcu_idx from a defacto boolean to a free-running counter, using the bottom bit to indicate that a grace period is in progress. The second-from-bottom bit is thus used as the index returned by srcu_read_lock(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Fix __srcu_read_lock() idx computation Neeraj per Upadhyay. ] Signed-off-by: Paul E. McKenney --- include/linux/srcutiny.h | 4 ++-- kernel/rcu/srcutiny.c | 5 +++-- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index 5a5a194..d9edb67 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -15,7 +15,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ - short srcu_idx; /* Current reader array element. */ + unsigned short srcu_idx; /* Current reader array element in bit 0x2. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; @@ -59,7 +59,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp) { int idx; - idx = READ_ONCE(ssp->srcu_idx); + idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1; WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1); return idx; } Need change in idx calcultion in srcu_torture_stats_print() ? static inline void srcu_torture_stats_print(struct srcu_struct *ssp, idx = READ_ONCE(ssp->srcu_idx) & 0x1; Excellent point! It should match the calculation in __srcu_read_lock(), shouldn't it? I have updated this, thank you! Thanx, Paul Updated version looks good! Thanks Neeraj For the version in rcu -dev: Reviewed-by: Neeraj Upadhyay I applied all of these, thank you very much! Welcome :) Only minor point which I have is, the idx calculation can be made an inline func (though srcu_drive_gp() does not require a READ_ONCE for ->srcu_idx): __srcu_read_lock() and srcu_torture_stats_print() are using idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1; whereas srcu_drive_gp() uses: idx = (ssp->srcu_idx & 0x2) / 2; They do work on different elements of the various arrays. Or do you believe that the srcu_drive_gp() use needs adjusting? My bad, I missed that they are using different elements of array. Please ignore this comment. Thanks Neeraj Either way, the overhead of READ_ONCE() is absolutely not at all a problem. Would you like to put together a patch so that I can see exactly what you are suggesting? Thanx, Paul Thanks Neeraj Thanks Neeraj diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 6208c1d..5598cf6 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -124,11 +124,12 @@ void srcu_drive_gp(struct work_struct *wp) ssp->srcu_cb_head = NULL; ssp->srcu_cb_tail = >srcu_cb_head; local_irq_enable(); - idx = ssp->srcu_idx; - WRITE_ONCE(ssp->srcu_idx, !ssp->srcu_idx); + idx = (ssp->srcu_idx & 0x2) / 2; + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); WRITE_ONCE(ssp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ swait_event_exclusive(ssp->srcu_wq, !READ_ONCE(ssp->srcu_lock_nesting[idx])); WRITE_ONCE(ssp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); /* Invoke the callbacks we removed above. */ while (lh) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v2 tip/core/rcu 6/6] srcu: Document polling interfaces for Tree SRCU grace periods
On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" This commit adds requirements documentation for the get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() functions. Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet Signed-off-by: Paul E. McKenney --- Reviewed-by: Neeraj Upadhyay Thanks Neeraj Documentation/RCU/Design/Requirements/Requirements.rst | 18 ++ 1 file changed, 18 insertions(+) diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index 1e3df77..2dce79d 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -2600,6 +2600,24 @@ also includes DEFINE_SRCU(), DEFINE_STATIC_SRCU(), and init_srcu_struct() APIs for defining and initializing ``srcu_struct`` structures. +More recently, the SRCU API has added polling interfaces: + +#. start_poll_synchronize_srcu() returns a cookie identifying + the completion of a future SRCU grace period and ensures + that this grace period will be started. +#. poll_state_synchronize_srcu() returns ``true`` iff the + specified cookie corresponds to an already-completed + SRCU grace period. +#. get_state_synchronize_srcu() returns a cookie just like + start_poll_synchronize_srcu() does, but differs in that + it does nothing to ensure that any future SRCU grace period + will be started. + +These functions are used to avoid unnecessary SRCU grace periods in +certain types of buffer-cache algorithms having multi-stage age-out +mechanisms. The idea is that by the time the block has aged completely +from the cache, an SRCU grace period will be very likely to have elapsed. + Tasks RCU ~ -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v2 tip/core/rcu 5/6] srcu: Provide polling interfaces for Tree SRCU grace periods
On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/ Signed-off-by: Paul E. McKenney --- For version in -rcu dev Reviewed-by: Neeraj Upadhyay Thanks Neeraj kernel/rcu/srcutiny.c | 2 +- kernel/rcu/srcutree.c | 67 --- 2 files changed, 64 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index b073175..c8f4202 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -148,7 +148,7 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) + if (USHORT_CMP_LT(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) schedule_work(>srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index d930ece..945c047 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -810,7 +810,8 @@ static void srcu_leak_callback(struct rcu_head *rhp) /* * Start an SRCU grace period, and also queue the callback if non-NULL. */ -static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rhp, bool do_norm) +static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, +struct rcu_head *rhp, bool do_norm) { unsigned long flags; int idx; @@ -819,10 +820,12 @@ static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rh unsigned long s; struct srcu_data *sdp; + check_init_srcu_struct(ssp); idx = srcu_read_lock(ssp); sdp = raw_cpu_ptr(ssp->sda); spin_lock_irqsave_rcu_node(sdp, flags); - rcu_segcblist_enqueue(>srcu_cblist, rhp); + if (rhp) + rcu_segcblist_enqueue(>srcu_cblist, rhp); rcu_segcblist_advance(>srcu_cblist, rcu_seq_current(>srcu_gp_seq)); s = rcu_seq_snap(>srcu_gp_seq); @@ -841,6 +844,7 @@ static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rh else if (needexp) srcu_funnel_exp_start(ssp, sdp->mynode, s); srcu_read_unlock(ssp, idx); + return s; } /* @@ -874,7 +878,6 @@ static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rh static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, rcu_callback_t func, bool do_norm) { - check_init_srcu_struct(ssp); if (debug_rcu_head_queue(rhp)) { /* Probable double call_srcu(), so leak the callback. */ WRITE_ONCE(rhp->func, srcu_leak_callback); @@ -882,7 +885,7 @@ static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, return; } rhp->func = func; - srcu_gp_start_if_needed(ssp, rhp, do_norm); + (void)srcu_gp_start_if_needed(ssp, rhp, do_norm); } /** @@ -1011,6 +1014,62 @@ void synchronize_srcu(struct srcu_struct *ssp) } EXPORT_SYMBOL_GPL(synchronize_srcu); +/** + * get_state_synchronize_srcu - Provide an end-of-grace-period cookie + * @ssp: srcu_struct to provide cookie for. + * + * This function returns a cookie that can be passed to + * poll_state_synchronize_srcu(), which will return true if a full grace + * period has elapsed in the meantime. It is the caller's responsibility + * to make sure that grace period happens, for example, by invoking + * call_srcu() after return from get_state_synchronize_srcu(). + */ +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp) +{ + // Any prior manipulation of SRCU-protected data must happen + // before the load from ->srcu_gp_seq. + smp_mb(); + return rcu_seq_snap(>srcu_gp_seq); +} +EXPORT_SYMBOL_GPL(get_st
Re: [PATCH v2 tip/core/rcu 4/6] srcu: Provide polling interfaces for Tiny SRCU grace periods
On 11/25/2020 1:00 AM, Paul E. McKenney wrote: On Tue, Nov 24, 2020 at 10:44:24AM +0530, Neeraj Upadhyay wrote: On 11/24/2020 2:42 AM, Paul E. McKenney wrote: On Mon, Nov 23, 2020 at 10:13:13AM +0530, Neeraj Upadhyay wrote: On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/ Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 ++ include/linux/srcu.h | 3 +++ include/linux/srcutiny.h | 1 + kernel/rcu/srcutiny.c| 52 ++-- 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de08264..e09c0d8 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -33,6 +33,8 @@ #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b)) #define ulong2long(a) (*(long *)(&(a))) +#define USHORT_CMP_GE(a, b)(USHRT_MAX / 2 >= (unsigned short)((a) - (b))) +#define USHORT_CMP_LT(a, b)(USHRT_MAX / 2 < (unsigned short)((a) - (b))) /* Exported common interfaces */ void call_rcu(struct rcu_head *head, rcu_callback_t func); diff --git a/include/linux/srcu.h b/include/linux/srcu.h index e432cc9..a0895bb 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -60,6 +60,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp); +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie); #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index d9edb67..c7f0c1f 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -16,6 +16,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ + unsigned short srcu_idx_max;/* Furthest future srcu_idx request. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 3bac1db..b073175 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -34,6 +34,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp) ssp->srcu_gp_running = false; ssp->srcu_gp_waiting = false; ssp->srcu_idx = 0; + ssp->srcu_idx_max = 0; INIT_WORK(>srcu_work, srcu_drive_gp); INIT_LIST_HEAD(>srcu_work.entry); return 0; Minor: cleanup_srcu_struct() can probably have 2 new sanity checks? WARN_ON(ssp->srcu_idx != ssp->srcu_idx_max); WARN_ON(ssp->srcu_idx & 1); Good point, added and under test. Thanks Neeraj @@ -114,7 +115,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) + if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -147,13 +148,18 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (READ_ONCE(ssp->srcu_cb_head)) + if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max)))
Re: [PATCH v2 tip/core/rcu 1/6] srcu: Make Tiny SRCU use multi-bit grace-period counter
On 11/24/2020 10:48 AM, Neeraj Upadhyay wrote: On 11/24/2020 1:25 AM, Paul E. McKenney wrote: On Mon, Nov 23, 2020 at 10:01:13AM +0530, Neeraj Upadhyay wrote: On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods. This polling needs to distinguish between an SRCU instance being idle on the one hand or in the middle of a grace period on the other. This commit therefore converts the Tiny SRCU srcu_struct structure's srcu_idx from a defacto boolean to a free-running counter, using the bottom bit to indicate that a grace period is in progress. The second-from-bottom bit is thus used as the index returned by srcu_read_lock(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Fix __srcu_read_lock() idx computation Neeraj per Upadhyay. ] Signed-off-by: Paul E. McKenney --- include/linux/srcutiny.h | 4 ++-- kernel/rcu/srcutiny.c | 5 +++-- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index 5a5a194..d9edb67 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -15,7 +15,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ - short srcu_idx; /* Current reader array element. */ + unsigned short srcu_idx; /* Current reader array element in bit 0x2. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; @@ -59,7 +59,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp) { int idx; - idx = READ_ONCE(ssp->srcu_idx); + idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1; WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1); return idx; } Need change in idx calcultion in srcu_torture_stats_print() ? static inline void srcu_torture_stats_print(struct srcu_struct *ssp, idx = READ_ONCE(ssp->srcu_idx) & 0x1; Excellent point! It should match the calculation in __srcu_read_lock(), shouldn't it? I have updated this, thank you! Thanx, Paul Updated version looks good! Thanks Neeraj For the version in rcu -dev: Reviewed-by: Neeraj Upadhyay Only minor point which I have is, the idx calculation can be made an inline func (though srcu_drive_gp() does not require a READ_ONCE for ->srcu_idx): __srcu_read_lock() and srcu_torture_stats_print() are using idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1; whereas srcu_drive_gp() uses: idx = (ssp->srcu_idx & 0x2) / 2; Thanks Neeraj Thanks Neeraj diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 6208c1d..5598cf6 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -124,11 +124,12 @@ void srcu_drive_gp(struct work_struct *wp) ssp->srcu_cb_head = NULL; ssp->srcu_cb_tail = >srcu_cb_head; local_irq_enable(); - idx = ssp->srcu_idx; - WRITE_ONCE(ssp->srcu_idx, !ssp->srcu_idx); + idx = (ssp->srcu_idx & 0x2) / 2; + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); WRITE_ONCE(ssp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ swait_event_exclusive(ssp->srcu_wq, !READ_ONCE(ssp->srcu_lock_nesting[idx])); WRITE_ONCE(ssp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); /* Invoke the callbacks we removed above. */ while (lh) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: AMU extension v1 support for cortex A76, A77, A78 CPUs
Thanks Marc, Vladimir, Mark, Sudeep for your inputs! Thanks Neeraj On 11/20/2020 3:43 PM, Mark Rutland wrote: On Fri, Nov 20, 2020 at 09:09:00AM +, Vladimir Murzin wrote: On 11/20/20 8:56 AM, Marc Zyngier wrote: On 2020-11-20 04:30, Neeraj Upadhyay wrote: Hi, For ARM cortex A76, A77, A78 cores (which as per TRM, support AMU) AA64PFR0[47:44] field is not set, and AMU does not get enabled for them. Can you please provide support for these CPUs in cpufeature.c? If that was the case, that'd be an erratum, and it would need to be documented as such. It could also be that this is an optional feature for these cores (though the TRM doesn't suggest that). Can someone at ARM confirm what is the expected behaviour of these CPUs? Not a confirmation, but IIRC, these are imp def features, while our cpufeatures catches architected one. We generally don't make use of IMP-DEF featurees because of all the pain it brings. Looking at the Cortex-A76 TRM, the encoding for AMCNTENCLR is: Op0: 3 (0b11) Op1: 3 (0b011) CRn: 15 (0b) CRm: 9 (0b1001) Op2: 7 (0b111) ... whereas the architected encoding (from our sysreg.h) is: Op0: 3 Op1: 3 CRn: 13 CRm: 2 Op2: 4 ... so that's a different register with the same name, which is confusing and unfortunate. The encodings are different (and I haven't checked whether the fields / semantics are the same), so it's not just a matter of wiring up new detection code. There are also IMP-DEF traps in ACTLR_EL3 and ACTLR_EL2 which we can't be certain of the configuration of, and as the registers are in the IMP-DEF encoding space they'll be trapped by HCR_EL2.TIDCP and emulated as UNDEFINED by a hypervisor. All of that means that going by the MIDR alone is not sufficient to know we can safely access the registers. So as usual for IMP-DEF stuff I don't think we can or should make use of this. Thanks, Mark. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v2 tip/core/rcu 1/6] srcu: Make Tiny SRCU use multi-bit grace-period counter
On 11/24/2020 1:25 AM, Paul E. McKenney wrote: On Mon, Nov 23, 2020 at 10:01:13AM +0530, Neeraj Upadhyay wrote: On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods. This polling needs to distinguish between an SRCU instance being idle on the one hand or in the middle of a grace period on the other. This commit therefore converts the Tiny SRCU srcu_struct structure's srcu_idx from a defacto boolean to a free-running counter, using the bottom bit to indicate that a grace period is in progress. The second-from-bottom bit is thus used as the index returned by srcu_read_lock(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Fix __srcu_read_lock() idx computation Neeraj per Upadhyay. ] Signed-off-by: Paul E. McKenney --- include/linux/srcutiny.h | 4 ++-- kernel/rcu/srcutiny.c| 5 +++-- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index 5a5a194..d9edb67 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -15,7 +15,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ - short srcu_idx; /* Current reader array element. */ + unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; @@ -59,7 +59,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp) { int idx; - idx = READ_ONCE(ssp->srcu_idx); + idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1; WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1); return idx; } Need change in idx calcultion in srcu_torture_stats_print() ? static inline void srcu_torture_stats_print(struct srcu_struct *ssp, idx = READ_ONCE(ssp->srcu_idx) & 0x1; Excellent point! It should match the calculation in __srcu_read_lock(), shouldn't it? I have updated this, thank you! Thanx, Paul Updated version looks good! Thanks Neeraj Thanks Neeraj diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 6208c1d..5598cf6 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -124,11 +124,12 @@ void srcu_drive_gp(struct work_struct *wp) ssp->srcu_cb_head = NULL; ssp->srcu_cb_tail = >srcu_cb_head; local_irq_enable(); - idx = ssp->srcu_idx; - WRITE_ONCE(ssp->srcu_idx, !ssp->srcu_idx); + idx = (ssp->srcu_idx & 0x2) / 2; + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); WRITE_ONCE(ssp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ swait_event_exclusive(ssp->srcu_wq, !READ_ONCE(ssp->srcu_lock_nesting[idx])); WRITE_ONCE(ssp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); /* Invoke the callbacks we removed above. */ while (lh) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v2 tip/core/rcu 4/6] srcu: Provide polling interfaces for Tiny SRCU grace periods
On 11/24/2020 2:42 AM, Paul E. McKenney wrote: On Mon, Nov 23, 2020 at 10:13:13AM +0530, Neeraj Upadhyay wrote: On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/ Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 ++ include/linux/srcu.h | 3 +++ include/linux/srcutiny.h | 1 + kernel/rcu/srcutiny.c| 52 ++-- 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de08264..e09c0d8 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -33,6 +33,8 @@ #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b)) #define ulong2long(a)(*(long *)(&(a))) +#define USHORT_CMP_GE(a, b)(USHRT_MAX / 2 >= (unsigned short)((a) - (b))) +#define USHORT_CMP_LT(a, b)(USHRT_MAX / 2 < (unsigned short)((a) - (b))) /* Exported common interfaces */ void call_rcu(struct rcu_head *head, rcu_callback_t func); diff --git a/include/linux/srcu.h b/include/linux/srcu.h index e432cc9..a0895bb 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -60,6 +60,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp); +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie); #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index d9edb67..c7f0c1f 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -16,6 +16,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ + unsigned short srcu_idx_max;/* Furthest future srcu_idx request. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 3bac1db..b073175 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -34,6 +34,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp) ssp->srcu_gp_running = false; ssp->srcu_gp_waiting = false; ssp->srcu_idx = 0; + ssp->srcu_idx_max = 0; INIT_WORK(>srcu_work, srcu_drive_gp); INIT_LIST_HEAD(>srcu_work.entry); return 0; Minor: cleanup_srcu_struct() can probably have 2 new sanity checks? WARN_ON(ssp->srcu_idx != ssp->srcu_idx_max); WARN_ON(ssp->srcu_idx & 1); Good point, added and under test. Thanks Neeraj @@ -114,7 +115,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) + if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -147,13 +148,18 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (READ_ONCE(ssp->srcu_cb_head)) + if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) schedule_work(>srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); static void srcu_gp_start_
Re: [PATCH v2 tip/core/rcu 4/6] srcu: Provide polling interfaces for Tiny SRCU grace periods
On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/ Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 ++ include/linux/srcu.h | 3 +++ include/linux/srcutiny.h | 1 + kernel/rcu/srcutiny.c| 52 ++-- 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de08264..e09c0d8 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -33,6 +33,8 @@ #define ULONG_CMP_GE(a, b)(ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b)(ULONG_MAX / 2 < (a) - (b)) #define ulong2long(a) (*(long *)(&(a))) +#define USHORT_CMP_GE(a, b)(USHRT_MAX / 2 >= (unsigned short)((a) - (b))) +#define USHORT_CMP_LT(a, b)(USHRT_MAX / 2 < (unsigned short)((a) - (b))) /* Exported common interfaces */ void call_rcu(struct rcu_head *head, rcu_callback_t func); diff --git a/include/linux/srcu.h b/include/linux/srcu.h index e432cc9..a0895bb 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -60,6 +60,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp); +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie); #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index d9edb67..c7f0c1f 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -16,6 +16,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ + unsigned short srcu_idx_max;/* Furthest future srcu_idx request. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 3bac1db..b073175 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -34,6 +34,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp) ssp->srcu_gp_running = false; ssp->srcu_gp_waiting = false; ssp->srcu_idx = 0; + ssp->srcu_idx_max = 0; INIT_WORK(>srcu_work, srcu_drive_gp); INIT_LIST_HEAD(>srcu_work.entry); return 0; Minor: cleanup_srcu_struct() can probably have 2 new sanity checks? WARN_ON(ssp->srcu_idx != ssp->srcu_idx_max); WARN_ON(ssp->srcu_idx & 1); Thanks Neeraj @@ -114,7 +115,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) + if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -147,13 +148,18 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (READ_ONCE(ssp->srcu_cb_head)) + if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) schedule_work(>srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); static void srcu_gp_start_if_needed(struct srcu_struct *ssp) { + unsigned short cookie; + + cookie = get_state_synchronize_srcu(ssp); + if (USHORT_CMP_LT(READ_ONCE(ssp->srcu_idx_max), co
Re: [PATCH RFC tip/core/rcu 4/5] srcu: Provide polling interfaces for Tiny SRCU grace periods
On 11/22/2020 11:31 PM, Paul E. McKenney wrote: On Sun, Nov 22, 2020 at 07:57:26PM +0530, Neeraj Upadhyay wrote: On 11/21/2020 5:43 AM, Paul E. McKenney wrote: On Fri, Nov 20, 2020 at 05:28:32PM +0530, Neeraj Upadhyay wrote: Hi Paul, On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 ++ include/linux/srcu.h | 3 +++ include/linux/srcutiny.h | 1 + kernel/rcu/srcutiny.c| 52 ++-- 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de08264..e09c0d8 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -33,6 +33,8 @@ #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b)) #define ulong2long(a) (*(long *)(&(a))) +#define USHORT_CMP_GE(a, b)(USHRT_MAX / 2 >= (unsigned short)((a) - (b))) +#define USHORT_CMP_LT(a, b)(USHRT_MAX / 2 < (unsigned short)((a) - (b))) /* Exported common interfaces */ void call_rcu(struct rcu_head *head, rcu_callback_t func); diff --git a/include/linux/srcu.h b/include/linux/srcu.h index e432cc9..a0895bb 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -60,6 +60,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp); +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie); #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index fed4a2d..e9bd6fb 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -16,6 +16,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ + unsigned short srcu_idx_max;/* Furthest future srcu_idx request. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 3bac1db..b405811 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -34,6 +34,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp) ssp->srcu_gp_running = false; ssp->srcu_gp_waiting = false; ssp->srcu_idx = 0; + ssp->srcu_idx_max = 0; INIT_WORK(>srcu_work, srcu_drive_gp); INIT_LIST_HEAD(>srcu_work.entry); return 0; @@ -114,7 +115,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) + if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -147,14 +148,19 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (READ_ONCE(ssp->srcu_cb_head)) + if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) Should this be USHORT_CMP_LT ? I believe that you are correct. As is, it works but does needless grace periods. schedule_work(>srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); static void srcu_gp_start_if_needed(struct srcu_struct *ssp) { + unsigned short cookie; + if (!READ_ONCE(ssp-&g
Re: [PATCH v2 tip/core/rcu 1/6] srcu: Make Tiny SRCU use multi-bit grace-period counter
On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods. This polling needs to distinguish between an SRCU instance being idle on the one hand or in the middle of a grace period on the other. This commit therefore converts the Tiny SRCU srcu_struct structure's srcu_idx from a defacto boolean to a free-running counter, using the bottom bit to indicate that a grace period is in progress. The second-from-bottom bit is thus used as the index returned by srcu_read_lock(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Fix __srcu_read_lock() idx computation Neeraj per Upadhyay. ] Signed-off-by: Paul E. McKenney --- include/linux/srcutiny.h | 4 ++-- kernel/rcu/srcutiny.c| 5 +++-- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index 5a5a194..d9edb67 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -15,7 +15,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ - short srcu_idx; /* Current reader array element. */ + unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; @@ -59,7 +59,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp) { int idx; - idx = READ_ONCE(ssp->srcu_idx); + idx = ((READ_ONCE(ssp->srcu_idx) + 1) & 0x2) >> 1; WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1); return idx; } Need change in idx calcultion in srcu_torture_stats_print() ? static inline void srcu_torture_stats_print(struct srcu_struct *ssp, idx = READ_ONCE(ssp->srcu_idx) & 0x1; Thanks Neeraj diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 6208c1d..5598cf6 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -124,11 +124,12 @@ void srcu_drive_gp(struct work_struct *wp) ssp->srcu_cb_head = NULL; ssp->srcu_cb_tail = >srcu_cb_head; local_irq_enable(); - idx = ssp->srcu_idx; - WRITE_ONCE(ssp->srcu_idx, !ssp->srcu_idx); + idx = (ssp->srcu_idx & 0x2) / 2; + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); WRITE_ONCE(ssp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ swait_event_exclusive(ssp->srcu_wq, !READ_ONCE(ssp->srcu_lock_nesting[idx])); WRITE_ONCE(ssp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); /* Invoke the callbacks we removed above. */ while (lh) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v2 tip/core/rcu 4/6] srcu: Provide polling interfaces for Tiny SRCU grace periods
On 11/21/2020 6:29 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/lkml/20201117004017.GA7444@paulmck-ThinkPad-P72/ Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 ++ include/linux/srcu.h | 3 +++ include/linux/srcutiny.h | 1 + kernel/rcu/srcutiny.c| 52 ++-- 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de08264..e09c0d8 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -33,6 +33,8 @@ #define ULONG_CMP_GE(a, b)(ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b)(ULONG_MAX / 2 < (a) - (b)) #define ulong2long(a) (*(long *)(&(a))) +#define USHORT_CMP_GE(a, b)(USHRT_MAX / 2 >= (unsigned short)((a) - (b))) +#define USHORT_CMP_LT(a, b)(USHRT_MAX / 2 < (unsigned short)((a) - (b))) /* Exported common interfaces */ void call_rcu(struct rcu_head *head, rcu_callback_t func); diff --git a/include/linux/srcu.h b/include/linux/srcu.h index e432cc9..a0895bb 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -60,6 +60,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp); +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie); #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index d9edb67..c7f0c1f 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -16,6 +16,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ + unsigned short srcu_idx_max;/* Furthest future srcu_idx request. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 3bac1db..b073175 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -34,6 +34,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp) ssp->srcu_gp_running = false; ssp->srcu_gp_waiting = false; ssp->srcu_idx = 0; + ssp->srcu_idx_max = 0; INIT_WORK(>srcu_work, srcu_drive_gp); INIT_LIST_HEAD(>srcu_work.entry); return 0; @@ -114,7 +115,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) + if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -147,13 +148,18 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (READ_ONCE(ssp->srcu_cb_head)) + if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) USHORT_CMP_LT ? Thanks Neeraj schedule_work(>srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); static void srcu_gp_start_if_needed(struct srcu_struct *ssp) { + unsigned short cookie; + + cookie = get_state_synchronize_srcu(ssp); + if (USHORT_CMP_LT(READ_ONCE(ssp->srcu_idx_max), cookie)) + WRITE_ONCE(ssp->srcu_idx_max, cookie); if (!READ_ONCE(ssp->srcu_gp_running)) { if (l
Re: [PATCH RFC tip/core/rcu 4/5] srcu: Provide polling interfaces for Tiny SRCU grace periods
On 11/21/2020 5:43 AM, Paul E. McKenney wrote: On Fri, Nov 20, 2020 at 05:28:32PM +0530, Neeraj Upadhyay wrote: Hi Paul, On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 ++ include/linux/srcu.h | 3 +++ include/linux/srcutiny.h | 1 + kernel/rcu/srcutiny.c| 52 ++-- 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de08264..e09c0d8 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -33,6 +33,8 @@ #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b)) #define ulong2long(a)(*(long *)(&(a))) +#define USHORT_CMP_GE(a, b)(USHRT_MAX / 2 >= (unsigned short)((a) - (b))) +#define USHORT_CMP_LT(a, b)(USHRT_MAX / 2 < (unsigned short)((a) - (b))) /* Exported common interfaces */ void call_rcu(struct rcu_head *head, rcu_callback_t func); diff --git a/include/linux/srcu.h b/include/linux/srcu.h index e432cc9..a0895bb 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -60,6 +60,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp); +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie); #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index fed4a2d..e9bd6fb 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -16,6 +16,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ + unsigned short srcu_idx_max;/* Furthest future srcu_idx request. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 3bac1db..b405811 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -34,6 +34,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp) ssp->srcu_gp_running = false; ssp->srcu_gp_waiting = false; ssp->srcu_idx = 0; + ssp->srcu_idx_max = 0; INIT_WORK(>srcu_work, srcu_drive_gp); INIT_LIST_HEAD(>srcu_work.entry); return 0; @@ -114,7 +115,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) + if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -147,14 +148,19 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (READ_ONCE(ssp->srcu_cb_head)) + if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) Should this be USHORT_CMP_LT ? I believe that you are correct. As is, it works but does needless grace periods. schedule_work(>srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); static void srcu_gp_start_if_needed(struct srcu_struct *ssp) { + unsigned short cookie; + if (!READ_ONCE(ssp->srcu_gp_running)) { + cookie = get_state_synchronize_srcu(ssp); + if (USHORT_CMP_LT(READ_ONCE(ssp-
Re: [PATCH RFC tip/core/rcu 5/5] srcu: Provide polling interfaces for Tree SRCU grace periods
On 11/21/2020 5:46 AM, Paul E. McKenney wrote: On Fri, Nov 20, 2020 at 05:31:43PM +0530, Neeraj Upadhyay wrote: On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 63 --- 1 file changed, 60 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index d930ece..015d80e 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -810,7 +810,8 @@ static void srcu_leak_callback(struct rcu_head *rhp) /* * Start an SRCU grace period, and also queue the callback if non-NULL. */ -static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rhp, bool do_norm) +static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, +struct rcu_head *rhp, bool do_norm) { unsigned long flags; int idx; @@ -822,7 +823,8 @@ static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rh idx = srcu_read_lock(ssp); sdp = raw_cpu_ptr(ssp->sda); spin_lock_irqsave_rcu_node(sdp, flags); - rcu_segcblist_enqueue(>srcu_cblist, rhp); + if (rhp) + rcu_segcblist_enqueue(>srcu_cblist, rhp); rcu_segcblist_advance(>srcu_cblist, rcu_seq_current(>srcu_gp_seq)); s = rcu_seq_snap(>srcu_gp_seq); @@ -841,6 +843,7 @@ static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rh else if (needexp) srcu_funnel_exp_start(ssp, sdp->mynode, s); srcu_read_unlock(ssp, idx); + return s; } /* @@ -882,7 +885,7 @@ static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, return; } rhp->func = func; - srcu_gp_start_if_needed(ssp, rhp, do_norm); + (void)srcu_gp_start_if_needed(ssp, rhp, do_norm); } /** @@ -1011,6 +1014,60 @@ void synchronize_srcu(struct srcu_struct *ssp) } EXPORT_SYMBOL_GPL(synchronize_srcu); +/** + * get_state_synchronize_srcu - Provide an end-of-grace-period cookie + * @ssp: srcu_struct to provide cookie for. + * + * This function returns a cookie that can be passed to + * poll_state_synchronize_srcu(), which will return true if a full grace + * period has elapsed in the meantime. It is the caller's responsibility + * to make sure that grace period happens, for example, by invoking + * call_srcu() after return from get_state_synchronize_srcu(). + */ +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp) +{ + // Any prior manipulation of SRCU-protected data must happen +// before the load from ->srcu_gp_seq. + smp_mb(); + return rcu_seq_snap(>srcu_gp_seq); +} +EXPORT_SYMBOL_GPL(get_state_synchronize_srcu); + +/** + * start_poll_synchronize_srcu - Provide cookie and start grace period + * @ssp: srcu_struct to provide cookie for. + * + * This function returns a cookie that can be passed to + * poll_state_synchronize_srcu(), which will return true if a full grace + * period has elapsed in the meantime. Unlike get_state_synchronize_srcu(), + * this function also ensures that any needed SRCU grace period will be + * started. This convenience does come at a cost in terms of CPU overhead. + */ +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp) +{ + return srcu_gp_start_if_needed(ssp, NULL, true); +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_srcu); + +/** + * poll_state_synchronize_srcu - Has cookie's grace period ended? + * @ssp: srcu_struct to provide cookie for. + * @cookie: Return value from get_state_synchronize_srcu() or start_poll_synchronize_srcu(). + * + * This function takes the cookie that was returned from either + * get_state_synchronize_srcu() or start_poll_synchronize_srcu(), and + * returns @true if an SRCU grace period elapsed since the time that the + * cookie was created. + */ +bool poll_state_synchronize_srcu(struc
Re: [PATCH RFC tip/core/rcu 5/5] srcu: Provide polling interfaces for Tree SRCU grace periods
On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 63 --- 1 file changed, 60 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index d930ece..015d80e 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -810,7 +810,8 @@ static void srcu_leak_callback(struct rcu_head *rhp) /* * Start an SRCU grace period, and also queue the callback if non-NULL. */ -static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rhp, bool do_norm) +static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, +struct rcu_head *rhp, bool do_norm) { unsigned long flags; int idx; @@ -822,7 +823,8 @@ static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rh idx = srcu_read_lock(ssp); sdp = raw_cpu_ptr(ssp->sda); spin_lock_irqsave_rcu_node(sdp, flags); - rcu_segcblist_enqueue(>srcu_cblist, rhp); + if (rhp) + rcu_segcblist_enqueue(>srcu_cblist, rhp); rcu_segcblist_advance(>srcu_cblist, rcu_seq_current(>srcu_gp_seq)); s = rcu_seq_snap(>srcu_gp_seq); @@ -841,6 +843,7 @@ static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rh else if (needexp) srcu_funnel_exp_start(ssp, sdp->mynode, s); srcu_read_unlock(ssp, idx); + return s; } /* @@ -882,7 +885,7 @@ static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, return; } rhp->func = func; - srcu_gp_start_if_needed(ssp, rhp, do_norm); + (void)srcu_gp_start_if_needed(ssp, rhp, do_norm); } /** @@ -1011,6 +1014,60 @@ void synchronize_srcu(struct srcu_struct *ssp) } EXPORT_SYMBOL_GPL(synchronize_srcu); +/** + * get_state_synchronize_srcu - Provide an end-of-grace-period cookie + * @ssp: srcu_struct to provide cookie for. + * + * This function returns a cookie that can be passed to + * poll_state_synchronize_srcu(), which will return true if a full grace + * period has elapsed in the meantime. It is the caller's responsibility + * to make sure that grace period happens, for example, by invoking + * call_srcu() after return from get_state_synchronize_srcu(). + */ +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp) +{ + // Any prior manipulation of SRCU-protected data must happen +// before the load from ->srcu_gp_seq. + smp_mb(); + return rcu_seq_snap(>srcu_gp_seq); +} +EXPORT_SYMBOL_GPL(get_state_synchronize_srcu); + +/** + * start_poll_synchronize_srcu - Provide cookie and start grace period + * @ssp: srcu_struct to provide cookie for. + * + * This function returns a cookie that can be passed to + * poll_state_synchronize_srcu(), which will return true if a full grace + * period has elapsed in the meantime. Unlike get_state_synchronize_srcu(), + * this function also ensures that any needed SRCU grace period will be + * started. This convenience does come at a cost in terms of CPU overhead. + */ +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp) +{ + return srcu_gp_start_if_needed(ssp, NULL, true); +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_srcu); + +/** + * poll_state_synchronize_srcu - Has cookie's grace period ended? + * @ssp: srcu_struct to provide cookie for. + * @cookie: Return value from get_state_synchronize_srcu() or start_poll_synchronize_srcu(). + * + * This function takes the cookie that was returned from either + * get_state_synchronize_srcu() or start_poll_synchronize_srcu(), and + * returns @true if an SRCU grace period elapsed since the time that the + * cookie was created. + */ +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie) +{ + if (!rcu_seq_done(>srcu_gp_seq, cookie)) + return false; + smp_mb(); // ^^^ Minor: Should this comment be
Re: [PATCH RFC tip/core/rcu 4/5] srcu: Provide polling interfaces for Tiny SRCU grace periods
Hi Paul, On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods, so this commit supplies get_state_synchronize_srcu(), start_poll_synchronize_srcu(), and poll_state_synchronize_srcu() for this purpose. The first can be used if future grace periods are inevitable (perhaps due to a later call_srcu() invocation), the second if future grace periods might not otherwise happen, and the third to check if a grace period has elapsed since the corresponding call to either of the first two. As with get_state_synchronize_rcu() and cond_synchronize_rcu(), the return value from either get_state_synchronize_srcu() or start_poll_synchronize_srcu() must be passed in to a later call to poll_state_synchronize_srcu(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet [ paulmck: Add EXPORT_SYMBOL_GPL() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 ++ include/linux/srcu.h | 3 +++ include/linux/srcutiny.h | 1 + kernel/rcu/srcutiny.c| 52 ++-- 4 files changed, 56 insertions(+), 2 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index de08264..e09c0d8 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -33,6 +33,8 @@ #define ULONG_CMP_GE(a, b)(ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b)(ULONG_MAX / 2 < (a) - (b)) #define ulong2long(a) (*(long *)(&(a))) +#define USHORT_CMP_GE(a, b)(USHRT_MAX / 2 >= (unsigned short)((a) - (b))) +#define USHORT_CMP_LT(a, b)(USHRT_MAX / 2 < (unsigned short)((a) - (b))) /* Exported common interfaces */ void call_rcu(struct rcu_head *head, rcu_callback_t func); diff --git a/include/linux/srcu.h b/include/linux/srcu.h index e432cc9..a0895bb 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -60,6 +60,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp); int __srcu_read_lock(struct srcu_struct *ssp) __acquires(ssp); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp); void synchronize_srcu(struct srcu_struct *ssp); +unsigned long get_state_synchronize_srcu(struct srcu_struct *ssp); +unsigned long start_poll_synchronize_srcu(struct srcu_struct *ssp); +bool poll_state_synchronize_srcu(struct srcu_struct *ssp, unsigned long cookie); #ifdef CONFIG_DEBUG_LOCK_ALLOC diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index fed4a2d..e9bd6fb 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -16,6 +16,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ + unsigned short srcu_idx_max;/* Furthest future srcu_idx request. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 3bac1db..b405811 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -34,6 +34,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp) ssp->srcu_gp_running = false; ssp->srcu_gp_waiting = false; ssp->srcu_idx = 0; + ssp->srcu_idx_max = 0; INIT_WORK(>srcu_work, srcu_drive_gp); INIT_LIST_HEAD(>srcu_work.entry); return 0; @@ -114,7 +115,7 @@ void srcu_drive_gp(struct work_struct *wp) struct srcu_struct *ssp; ssp = container_of(wp, struct srcu_struct, srcu_work); - if (ssp->srcu_gp_running || !READ_ONCE(ssp->srcu_cb_head)) + if (ssp->srcu_gp_running || USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) return; /* Already running or nothing to do. */ /* Remove recently arrived callbacks and wait for readers. */ @@ -147,14 +148,19 @@ void srcu_drive_gp(struct work_struct *wp) * straighten that out. */ WRITE_ONCE(ssp->srcu_gp_running, false); - if (READ_ONCE(ssp->srcu_cb_head)) + if (USHORT_CMP_GE(ssp->srcu_idx, READ_ONCE(ssp->srcu_idx_max))) Should this be USHORT_CMP_LT ? schedule_work(>srcu_work); } EXPORT_SYMBOL_GPL(srcu_drive_gp); static void srcu_gp_start_if_needed(struct srcu_struct *ssp) { + unsigned short cookie; + if (!READ_ONCE(ssp->srcu_gp_running)) { + cookie = get_state_synchronize_srcu(ssp); + if (USHORT_CMP_LT(READ_ONCE(ssp->srcu_idx_max), cookie)) + WRITE_ONCE(ssp->srcu_idx_max, cookie); I was thinking of a case which might break with this. Consider a scenario, where GP is in progress and kworker is right before below point, after executing callbacks: void srcu_drive_gp(struct work_struct
Re: [PATCH RFC tip/core/rcu 2/5] srcu: Provide internal interface to start a Tiny SRCU grace period
On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods. This polling needs to initiate an SRCU grace period without having to queue (and manage) a callback. This commit therefore splits the Tiny SRCU call_srcu() function into callback-queuing and start-grace-period portions, with the latter in a new function named srcu_gp_start_if_needed(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet Signed-off-by: Paul E. McKenney --- Reviewed-by: Neeraj Upadhyay Thanks Neeraj kernel/rcu/srcutiny.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 5598cf6..3bac1db 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -152,6 +152,16 @@ void srcu_drive_gp(struct work_struct *wp) } EXPORT_SYMBOL_GPL(srcu_drive_gp); +static void srcu_gp_start_if_needed(struct srcu_struct *ssp) +{ + if (!READ_ONCE(ssp->srcu_gp_running)) { + if (likely(srcu_init_done)) + schedule_work(>srcu_work); + else if (list_empty(>srcu_work.entry)) + list_add(>srcu_work.entry, _boot_list); + } +} + /* * Enqueue an SRCU callback on the specified srcu_struct structure, * initiating grace-period processing if it is not already running. @@ -167,12 +177,7 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, *ssp->srcu_cb_tail = rhp; ssp->srcu_cb_tail = >next; local_irq_restore(flags); - if (!READ_ONCE(ssp->srcu_gp_running)) { - if (likely(srcu_init_done)) - schedule_work(>srcu_work); - else if (list_empty(>srcu_work.entry)) - list_add(>srcu_work.entry, _boot_list); - } + srcu_gp_start_if_needed(ssp); } EXPORT_SYMBOL_GPL(call_srcu); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH RFC tip/core/rcu 3/5] srcu: Provide internal interface to start a Tree SRCU grace period
On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods. This polling needs to initiate an SRCU grace period without having to queue (and manage) a callback. This commit therefore splits the Tree SRCU __call_srcu() function into callback-initialization and queuing/start-grace-period portions, with the latter in a new function named srcu_gp_start_if_needed(). This function may be passed a NULL callback pointer, in which case it will refrain from queuing anything. Why have the new function mess with queuing? Locking considerations, of course! Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet Signed-off-by: Paul E. McKenney --- Reviewed-by: Neeraj Upadhyay Thanks Neeraj kernel/rcu/srcutree.c | 66 +-- 1 file changed, 37 insertions(+), 29 deletions(-) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 79b7081..d930ece 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -808,6 +808,42 @@ static void srcu_leak_callback(struct rcu_head *rhp) } /* + * Start an SRCU grace period, and also queue the callback if non-NULL. + */ +static void srcu_gp_start_if_needed(struct srcu_struct *ssp, struct rcu_head *rhp, bool do_norm) +{ + unsigned long flags; + int idx; + bool needexp = false; + bool needgp = false; + unsigned long s; + struct srcu_data *sdp; + + idx = srcu_read_lock(ssp); + sdp = raw_cpu_ptr(ssp->sda); + spin_lock_irqsave_rcu_node(sdp, flags); + rcu_segcblist_enqueue(>srcu_cblist, rhp); + rcu_segcblist_advance(>srcu_cblist, + rcu_seq_current(>srcu_gp_seq)); + s = rcu_seq_snap(>srcu_gp_seq); + (void)rcu_segcblist_accelerate(>srcu_cblist, s); + if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { + sdp->srcu_gp_seq_needed = s; + needgp = true; + } + if (!do_norm && ULONG_CMP_LT(sdp->srcu_gp_seq_needed_exp, s)) { + sdp->srcu_gp_seq_needed_exp = s; + needexp = true; + } + spin_unlock_irqrestore_rcu_node(sdp, flags); + if (needgp) + srcu_funnel_gp_start(ssp, sdp, s, do_norm); + else if (needexp) + srcu_funnel_exp_start(ssp, sdp->mynode, s); + srcu_read_unlock(ssp, idx); +} + +/* * Enqueue an SRCU callback on the srcu_data structure associated with * the current CPU and the specified srcu_struct structure, initiating * grace-period processing if it is not already running. @@ -838,13 +874,6 @@ static void srcu_leak_callback(struct rcu_head *rhp) static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, rcu_callback_t func, bool do_norm) { - unsigned long flags; - int idx; - bool needexp = false; - bool needgp = false; - unsigned long s; - struct srcu_data *sdp; - check_init_srcu_struct(ssp); if (debug_rcu_head_queue(rhp)) { /* Probable double call_srcu(), so leak the callback. */ @@ -853,28 +882,7 @@ static void __call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp, return; } rhp->func = func; - idx = srcu_read_lock(ssp); - sdp = raw_cpu_ptr(ssp->sda); - spin_lock_irqsave_rcu_node(sdp, flags); - rcu_segcblist_enqueue(>srcu_cblist, rhp); - rcu_segcblist_advance(>srcu_cblist, - rcu_seq_current(>srcu_gp_seq)); - s = rcu_seq_snap(>srcu_gp_seq); - (void)rcu_segcblist_accelerate(>srcu_cblist, s); - if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) { - sdp->srcu_gp_seq_needed = s; - needgp = true; - } - if (!do_norm && ULONG_CMP_LT(sdp->srcu_gp_seq_needed_exp, s)) { - sdp->srcu_gp_seq_needed_exp = s; - needexp = true; - } - spin_unlock_irqrestore_rcu_node(sdp, flags); - if (needgp) - srcu_funnel_gp_start(ssp, sdp, s, do_norm); - else if (needexp) - srcu_funnel_exp_start(ssp, sdp->mynode, s); - srcu_read_unlock(ssp, idx); + srcu_gp_start_if_needed(ssp, rhp, do_norm); } /** -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
AMU extension v1 support for cortex A76, A77, A78 CPUs
Hi, For ARM cortex A76, A77, A78 cores (which as per TRM, support AMU) AA64PFR0[47:44] field is not set, and AMU does not get enabled for them. Can you please provide support for these CPUs in cpufeature.c? Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH RFC tip/core/rcu 1/5] srcu: Make Tiny SRCU use multi-bit grace-period counter
Hi Paul, On 11/17/2020 6:10 AM, paul...@kernel.org wrote: From: "Paul E. McKenney" There is a need for a polling interface for SRCU grace periods. This polling needs to distinguish between an SRCU instance being idle on the one hand or in the middle of a grace period on the other. This commit therefore converts the Tiny SRCU srcu_struct structure's srcu_idx from a defacto boolean to a free-running counter, using the bottom bit to indicate that a grace period is in progress. The second-from-bottom bit is thus used as the index returned by srcu_read_lock(). Link: https://lore.kernel.org/rcu/20201112201547.gf3365...@moria.home.lan/ Reported-by: Kent Overstreet Signed-off-by: Paul E. McKenney --- include/linux/srcutiny.h | 4 ++-- kernel/rcu/srcutiny.c| 5 +++-- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h index 5a5a194..fed4a2d 100644 --- a/include/linux/srcutiny.h +++ b/include/linux/srcutiny.h @@ -15,7 +15,7 @@ struct srcu_struct { short srcu_lock_nesting[2]; /* srcu_read_lock() nesting depth. */ - short srcu_idx; /* Current reader array element. */ + unsigned short srcu_idx;/* Current reader array element in bit 0x2. */ u8 srcu_gp_running; /* GP workqueue running? */ u8 srcu_gp_waiting; /* GP waiting for readers? */ struct swait_queue_head srcu_wq; @@ -59,7 +59,7 @@ static inline int __srcu_read_lock(struct srcu_struct *ssp) { int idx; - idx = READ_ONCE(ssp->srcu_idx); + idx = (READ_ONCE(ssp->srcu_idx) & 0x2) / 2; Should we use bit 0x2 of (READ_ONCE(ssp->srcu_idx) + 1) , if GP (srcu_drive_gp()) is in progress? Or am I missing something here? idx = ((READ_ONCE(ssp->srcu_idx) +1) & 0x2) / 2; Also, any reason for using divison instead of shift; something to do with 16-bit srcu_idx which I am missing here? Thanks Neeraj WRITE_ONCE(ssp->srcu_lock_nesting[idx], ssp->srcu_lock_nesting[idx] + 1); return idx; } diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c index 6208c1d..5598cf6 100644 --- a/kernel/rcu/srcutiny.c +++ b/kernel/rcu/srcutiny.c @@ -124,11 +124,12 @@ void srcu_drive_gp(struct work_struct *wp) ssp->srcu_cb_head = NULL; ssp->srcu_cb_tail = >srcu_cb_head; local_irq_enable(); - idx = ssp->srcu_idx; - WRITE_ONCE(ssp->srcu_idx, !ssp->srcu_idx); + idx = (ssp->srcu_idx & 0x2) / 2; + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); WRITE_ONCE(ssp->srcu_gp_waiting, true); /* srcu_read_unlock() wakes! */ swait_event_exclusive(ssp->srcu_wq, !READ_ONCE(ssp->srcu_lock_nesting[idx])); WRITE_ONCE(ssp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */ + WRITE_ONCE(ssp->srcu_idx, ssp->srcu_idx + 1); /* Invoke the callbacks we removed above. */ while (lh) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH V2] rcu: Check and report missed fqs timer wakeup on RCU stall
For a new grace period request, RCU GP kthread transitions through following states: a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS] Initial state, where GP kthread waits for a new GP start request is RCU_GP_WAIT_GPS. On new GP request (started by callers, for any new callbacks, by setting RCU_GP_FLAG_INIT gp_state flag and waking up the GP kthread) GP kthread enters RCU_GP_DONE_GPS. b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF] Grace period initialization starts in rcu_gp_init(), which records the start of new GP in rcu_state.gp_seq and enters into RCU_GP_ONOFF state. c. [RCU_GP_ONOFF] -> [RCU_GP_INIT] In RCU_GP_ONOFF state, for each leaf rnp node, GP kthread applies the buffered online/offline information of its cpus, from ->qsmaskinitnext to ->qsmaskinit, which is basically the mask of cpus, which need to report quiescent state, for the new GP to complete. Also, in this state, an outgoing rnp's (for which all cpus are now offline and there are no tasks blocked inside RCU read section) or an incoming rnp's (for which first cpu comes online) participation in the new and subsequent GP is advertised, by propagating its qsmaskinit information up the tree. d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS] In RCU_GP_INIT state, current GPs qs mask information and new GP seq is propagated down the tree, into all rnp nodes, in breadth first order. On GP initialization completion, GP kthread enters a fqs loop, which does following things, to get quiescent state reported for cpus, which aren't quiesce in ->qsmask: i. For CPUs, which have entered idle since the first iteration, report quiescent state up the tree. ii. Based on how long the current grace period has been running for, either, set the appropriate flags, so that sched clock interrupt on that cpu, does qs reporting or do a resched on that cpu. On each iteration, it transitions through following states. The fqs loop is terminated on GP completion, when all CPUs report their quiescent state and all RCU readers, blocking current grace period have completed the RCU read section. e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS] GP kthread uses a timed wait (jiffies_till_first_fqs for first iteration and jiffies_till_next_fqs for subsequent iterations), before doing all the work, to force quiescent state on all cpus. It wakes up from this state, either on timeout, or when (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD) flags are set, either on callback overload conditions or when last cpu reports its quiescent state and all RCU readers blocking current GP, have left the RCU read section. f. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED] Mark end of grace period, in ->gp_seq, on all rnp nodes, in breadth first order and finally in rcu_state. For cases where timers are not served (due to issues in timer configuration, in timer framework or due to softirqs not getting handled, when there is a storm of interrupts) on the CPU, where GP kthread queued a timer (for timed wait, which is done in RCU_GP_WAIT_FQS) its possible that RCU kthread never wakes up. Report the same from stall warnings, if GP thread is in RCU_GP_WAIT_FQS state, and the timeout has elapsed and the kthread is not woken. Signed-off-by: Neeraj Upadhyay --- Changes in V2: - Documentation update. - Replace READ_ONCE()/smp_rmb() with smp_load_acquire(), as suggested by Paul. - Correct time_after() check. - Move fqs timer warning above starvation warning. Documentation/RCU/stallwarn.rst | 23 ++- kernel/rcu/tree.c | 25 +++-- kernel/rcu/tree_stall.h | 32 3 files changed, 69 insertions(+), 11 deletions(-) diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst index c9ab6af..5170cc0 100644 --- a/Documentation/RCU/stallwarn.rst +++ b/Documentation/RCU/stallwarn.rst @@ -92,7 +92,9 @@ warnings: buggy timer hardware through bugs in the interrupt or exception path (whether hardware, firmware, or software) through bugs in Linux's timer subsystem through bugs in the scheduler, and, - yes, even including bugs in RCU itself. + yes, even including bugs in RCU itself. It can also result in + the ``rcu_.*timer wakeup didn't happen for`` console-log message, + which will include additional debugging information. - A bug in the RCU implementation. @@ -292,6 +294,25 @@ kthread is waiting for a short timeout, the "state" precedes value of the task_struct ->state field, and the "cpu" indicates that the grace-period kthread last ran on CPU 5. +If the relevant grace-period kthread isn't woken up from fqs wait, and +fqs timeout has happened, then the following additional line is printed:: + + kthread timer wakeup didn't happen for 23804 jiffies! g7076 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 + +The "23804" indicates that kthread's timer expired 23 thousand jiffies +back. Re
Re: [PATCH] rcu: Check and report missed fqs timer wakeup on RCU stall
Hi Paul, On 11/12/2020 1:01 AM, Paul E. McKenney wrote: On Wed, Nov 11, 2020 at 07:37:37PM +0530, Neeraj Upadhyay wrote: For a new grace period request, RCU GP kthread transitions through following states: a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS] Initial state, where GP kthread waits for a new GP start request is RCU_GP_WAIT_GPS. On new GP request (started by callers, for any new callbacks, by setting RCU_GP_FLAG_INIT gp_state flag and waking up the GP kthread) GP kthread enters RCU_GP_DONE_GPS. b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF] Grace period initialization starts in rcu_gp_init(), which records the start of new GP in rcu_state.gp_seq and enters into RCU_GP_ONOFF state. c. [RCU_GP_ONOFF] -> [RCU_GP_INIT] In RCU_GP_ONOFF state, for each leaf rnp node, GP kthread applies the buffered online/offline information of its cpus, from ->qsmaskinitnext to ->qsmaskinit, which is basically the mask of cpus, which need to report quiescent state, for the new GP to complete. Also, in this state, an outgoing rnp's (for which all cpus are now offline and there are no tasks blocked inside RCU read section) or an incoming rnp's (for which first cpu comes online) participation in the new and subsequent GP is advertised, by propagating its qsmaskinit information up the tree. d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS] In RCU_GP_INIT state, current GPs qs mask information and new GP seq is propagated down the tree, into all rnp nodes, in breadth first order. On GP initialization completion, GP kthread enters a fqs loop, which does following things, to get quiescent state reported for cpus, which aren't quiesce in ->qsmask: i. For CPUs, which have entered idle since the first iteration, report quiescent state up the tree. ii. Based on how long the current grace period has been running for, either, set the appropriate flags, so that sched clock interrupt on that cpu, does qs reporting or do a resched on that cpu. On each iteration, it transitions through following states. The fqs loop is terminated on GP completion, when all CPUs report their quiescent state and all RCU readers, blocking current grace period have completed the RCU read section. e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS] GP kthread uses a timed wait (jiffies_till_first_fqs for first iteration and jiffies_till_next_fqs for subsequent iterations), before doing all the work, to force quiescent state on all cpus. It wakes up from this state, either on timeout, or when (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD) flags are set, either on callback overload conditions or when last cpu reports its quiescent state and all RCU readers blocking current GP, have left the RCU read section. f. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED] Mark end of grace period, in ->gp_seq, on all rnp nodes, in breadth first order and finally in rcu_state. For cases where timers are not served (due to issues in timer configuration, in timer framework or due to softirqs not getting handled, when there is a storm of interrupts) on the CPU, where GP kthread queued a timer (for timed wait, which is done in RCU_GP_WAIT_FQS) its possible that RCU kthread never wakes up. Report the same from stall warnings, if GP thread is in RCU_GP_WAIT_FQS state, and the timeout has elapsed and the kthread is not woken. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 25 +++-- kernel/rcu/tree_stall.h | 33 + 2 files changed, 48 insertions(+), 10 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 413831b..804e543 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1767,7 +1767,7 @@ static bool rcu_gp_init(void) * go offline later. Please also refer to "Hotplug CPU" section * of RCU's Requirements documentation. */ - rcu_state.gp_state = RCU_GP_ONOFF; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF); rcu_for_each_leaf_node(rnp) { smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values. firstseq = READ_ONCE(rnp->ofl_seq); @@ -1833,7 +1833,7 @@ static bool rcu_gp_init(void) * The grace period cannot complete until the initialization * process finishes, because this kthread handles both. */ - rcu_state.gp_state = RCU_GP_INIT; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_INIT); rcu_for_each_node_breadth_first(rnp) { rcu_gp_slow(gp_init_delay); raw_spin_lock_irqsave_rcu_node(rnp, flags); @@ -1932,17 +1932,22 @@ static void rcu_gp_fqs_loop(void) ret = 0; for (;;) { if (!ret) { - rcu_state.jiffies_force_qs = jiffies + j; + WRITE_ONCE(rcu_state.jiffies_force_qs, jiffies + j); + /* +* jiffies_force_qs before RCU_GP_WAIT_FQS state +
[PATCH] rcu: Check and report missed fqs timer wakeup on RCU stall
For a new grace period request, RCU GP kthread transitions through following states: a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS] Initial state, where GP kthread waits for a new GP start request is RCU_GP_WAIT_GPS. On new GP request (started by callers, for any new callbacks, by setting RCU_GP_FLAG_INIT gp_state flag and waking up the GP kthread) GP kthread enters RCU_GP_DONE_GPS. b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF] Grace period initialization starts in rcu_gp_init(), which records the start of new GP in rcu_state.gp_seq and enters into RCU_GP_ONOFF state. c. [RCU_GP_ONOFF] -> [RCU_GP_INIT] In RCU_GP_ONOFF state, for each leaf rnp node, GP kthread applies the buffered online/offline information of its cpus, from ->qsmaskinitnext to ->qsmaskinit, which is basically the mask of cpus, which need to report quiescent state, for the new GP to complete. Also, in this state, an outgoing rnp's (for which all cpus are now offline and there are no tasks blocked inside RCU read section) or an incoming rnp's (for which first cpu comes online) participation in the new and subsequent GP is advertised, by propagating its qsmaskinit information up the tree. d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS] In RCU_GP_INIT state, current GPs qs mask information and new GP seq is propagated down the tree, into all rnp nodes, in breadth first order. On GP initialization completion, GP kthread enters a fqs loop, which does following things, to get quiescent state reported for cpus, which aren't quiesce in ->qsmask: i. For CPUs, which have entered idle since the first iteration, report quiescent state up the tree. ii. Based on how long the current grace period has been running for, either, set the appropriate flags, so that sched clock interrupt on that cpu, does qs reporting or do a resched on that cpu. On each iteration, it transitions through following states. The fqs loop is terminated on GP completion, when all CPUs report their quiescent state and all RCU readers, blocking current grace period have completed the RCU read section. e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS] GP kthread uses a timed wait (jiffies_till_first_fqs for first iteration and jiffies_till_next_fqs for subsequent iterations), before doing all the work, to force quiescent state on all cpus. It wakes up from this state, either on timeout, or when (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD) flags are set, either on callback overload conditions or when last cpu reports its quiescent state and all RCU readers blocking current GP, have left the RCU read section. f. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED] Mark end of grace period, in ->gp_seq, on all rnp nodes, in breadth first order and finally in rcu_state. For cases where timers are not served (due to issues in timer configuration, in timer framework or due to softirqs not getting handled, when there is a storm of interrupts) on the CPU, where GP kthread queued a timer (for timed wait, which is done in RCU_GP_WAIT_FQS) its possible that RCU kthread never wakes up. Report the same from stall warnings, if GP thread is in RCU_GP_WAIT_FQS state, and the timeout has elapsed and the kthread is not woken. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 25 +++-- kernel/rcu/tree_stall.h | 33 + 2 files changed, 48 insertions(+), 10 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 413831b..804e543 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1767,7 +1767,7 @@ static bool rcu_gp_init(void) * go offline later. Please also refer to "Hotplug CPU" section * of RCU's Requirements documentation. */ - rcu_state.gp_state = RCU_GP_ONOFF; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF); rcu_for_each_leaf_node(rnp) { smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values. firstseq = READ_ONCE(rnp->ofl_seq); @@ -1833,7 +1833,7 @@ static bool rcu_gp_init(void) * The grace period cannot complete until the initialization * process finishes, because this kthread handles both. */ - rcu_state.gp_state = RCU_GP_INIT; + WRITE_ONCE(rcu_state.gp_state, RCU_GP_INIT); rcu_for_each_node_breadth_first(rnp) { rcu_gp_slow(gp_init_delay); raw_spin_lock_irqsave_rcu_node(rnp, flags); @@ -1932,17 +1932,22 @@ static void rcu_gp_fqs_loop(void) ret = 0; for (;;) { if (!ret) { - rcu_state.jiffies_force_qs = jiffies + j; + WRITE_ONCE(rcu_state.jiffies_force_qs, jiffies + j); + /* +* jiffies_force_qs before RCU_GP_WAIT_FQS state +* update; required for stall checks. +*/ + smp_wmb();
Re: [PATCH v9 4/7] rcu/trace: Add tracing for how segcb list changes
On 11/3/2020 7:56 PM, Joel Fernandes (Google) wrote: Track how the segcb list changes before/after acceleration, during queuing and during dequeuing. This has proved useful to discover an optimization to avoid unwanted GP requests when there are no callbacks accelerated. The overhead is minimal as each segment's length is now stored in the respective segment. Reviewed-by: Frederic Weisbecker Reviewed-by: Neeraj Upadhyay Signed-off-by: Joel Fernandes (Google) --- include/trace/events/rcu.h | 25 + kernel/rcu/rcu_segcblist.c | 34 ++ kernel/rcu/rcu_segcblist.h | 5 + kernel/rcu/tree.c | 9 + 4 files changed, 73 insertions(+) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 155b5cb43cfd..5f8f2ee1a936 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -505,6 +505,31 @@ TRACE_EVENT_RCU(rcu_callback, __entry->qlen) ); +TRACE_EVENT_RCU(rcu_segcb_stats, + + TP_PROTO(const char *ctx, int *cb_count, unsigned long *gp_seq), I think we need to use long[] instead of int[] for cb_count everywhere in this patch? Thanks Neeraj + + TP_ARGS(ctx, cb_count, gp_seq), + + TP_STRUCT__entry( + __field(const char *, ctx) + __array(int, cb_count, RCU_CBLIST_NSEGS) + __array(unsigned long, gp_seq, RCU_CBLIST_NSEGS) + ), + + TP_fast_assign( + __entry->ctx = ctx; + memcpy(__entry->cb_count, cb_count, RCU_CBLIST_NSEGS * sizeof(int)); + memcpy(__entry->gp_seq, gp_seq, RCU_CBLIST_NSEGS * sizeof(unsigned long)); + ), + + TP_printk("%s cb_count: (DONE=%d, WAIT=%d, NEXT_READY=%d, NEXT=%d) " + "gp_seq: (DONE=%lu, WAIT=%lu, NEXT_READY=%lu, NEXT=%lu)", __entry->ctx, + __entry->cb_count[0], __entry->cb_count[1], __entry->cb_count[2], __entry->cb_count[3], + __entry->gp_seq[0], __entry->gp_seq[1], __entry->gp_seq[2], __entry->gp_seq[3]) + +); + /* * Tracepoint for the registration of a single RCU callback of the special * kvfree() form. The first argument is the RCU type, the second argument diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 357c19bbcb00..2a03949d0b82 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -14,6 +14,7 @@ #include #include "rcu_segcblist.h" +#include "rcu.h" /* Initialize simple callback list. */ void rcu_cblist_init(struct rcu_cblist *rclp) @@ -328,6 +329,39 @@ void rcu_segcblist_extract_done_cbs(struct rcu_segcblist *rsclp, rcu_segcblist_set_seglen(rsclp, RCU_DONE_TAIL, 0); } +/* + * Return how many CBs each segment along with their gp_seq values. + * + * This function is O(N) where N is the number of segments. Only used from + * tracing code which is usually disabled in production. + */ +#ifdef CONFIG_RCU_TRACE +static void rcu_segcblist_countseq(struct rcu_segcblist *rsclp, +int cbcount[RCU_CBLIST_NSEGS], +unsigned long gpseq[RCU_CBLIST_NSEGS]) +{ + int i; + + for (i = 0; i < RCU_CBLIST_NSEGS; i++) { + cbcount[i] = rcu_segcblist_get_seglen(rsclp, i); + gpseq[i] = rsclp->gp_seq[i]; + } +} + +void __trace_rcu_segcb_stats(struct rcu_segcblist *rsclp, const char *context) +{ + int cbs[RCU_CBLIST_NSEGS]; + unsigned long gps[RCU_CBLIST_NSEGS]; + + if (!trace_rcu_segcb_stats_enabled()) + return; + + rcu_segcblist_countseq(rsclp, cbs, gps); + + trace_rcu_segcb_stats(context, cbs, gps); +} +#endif + /* * Extract only those callbacks still pending (not yet ready to be * invoked) from the specified rcu_segcblist structure and place them in diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index cd35c9faaf51..7750734fa116 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -103,3 +103,8 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq); bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq); void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp, struct rcu_segcblist *src_rsclp); +#ifdef CONFIG_RCU_TRACE +void __trace_rcu_segcb_stats(struct rcu_segcblist *rsclp, const char *context); +#else +#define __trace_rcu_segcb_stats(...) +#endif diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 24c00020ab83..f6c6653b3ec2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1497,6 +1497,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp) if (!rcu_segcblist_pend_
Re: [PATCH v9 3/7] srcu: Fix invoke_rcu_callbacks() segcb length adjustment
On 11/3/2020 7:55 PM, Joel Fernandes (Google) wrote: With earlier patches, the negative counting of the unsegmented list cannot be used to adjust the segmented one. To fix this, sample the unsegmented length in advance, and use it after CB execution to adjust the segmented list's length. Reviewed-by: Frederic Weisbecker Suggested-by: Frederic Weisbecker Signed-off-by: Joel Fernandes (Google) --- Reviewed-by: Neeraj Upadhyay kernel/rcu/srcutree.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 0f23d20d485a..79b7081143a7 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1160,6 +1160,7 @@ static void srcu_advance_state(struct srcu_struct *ssp) */ static void srcu_invoke_callbacks(struct work_struct *work) { + long len; bool more; struct rcu_cblist ready_cbs; struct rcu_head *rhp; @@ -1182,6 +1183,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) /* We are on the job! Extract and invoke ready callbacks. */ sdp->srcu_cblist_invoking = true; rcu_segcblist_extract_done_cbs(>srcu_cblist, _cbs); + len = ready_cbs.len; spin_unlock_irq_rcu_node(sdp); rhp = rcu_cblist_dequeue(_cbs); for (; rhp != NULL; rhp = rcu_cblist_dequeue(_cbs)) { @@ -1190,13 +1192,14 @@ static void srcu_invoke_callbacks(struct work_struct *work) rhp->func(rhp); local_bh_enable(); } + WARN_ON_ONCE(ready_cbs.len); /* * Update counts, accelerate new callbacks, and if needed, * schedule another round of callback invocation. */ spin_lock_irq_rcu_node(sdp); - rcu_segcblist_insert_count(>srcu_cblist, _cbs); + rcu_segcblist_add_len(>srcu_cblist, -len); (void)rcu_segcblist_accelerate(>srcu_cblist, rcu_seq_snap(>srcu_gp_seq)); sdp->srcu_cblist_invoking = false; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v9 2/7] rcu/segcblist: Add counters to segcblist datastructure
On 11/3/2020 7:55 PM, Joel Fernandes (Google) wrote: Add counting of segment lengths of segmented callback list. This will be useful for a number of things such as knowing how big the ready-to-execute segment have gotten. The immediate benefit is ability to trace how the callbacks in the segmented callback list change. Also this patch remove hacks related to using donecbs's ->len field as a temporary variable to save the segmented callback list's length. This cannot be done anymore and is not needed. Reviewed-by: Frederic Weisbecker Signed-off-by: Joel Fernandes (Google) --- Reviewed-by: Neeraj Upadhyay include/linux/rcu_segcblist.h | 1 + kernel/rcu/rcu_segcblist.c| 120 ++ kernel/rcu/rcu_segcblist.h| 2 - 3 files changed, 79 insertions(+), 44 deletions(-) diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h index b36afe7b22c9..6c01f09a6456 100644 --- a/include/linux/rcu_segcblist.h +++ b/include/linux/rcu_segcblist.h @@ -72,6 +72,7 @@ struct rcu_segcblist { #else long len; #endif + long seglen[RCU_CBLIST_NSEGS]; u8 enabled; u8 offloaded; }; diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index bb246d8c6ef1..357c19bbcb00 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -7,10 +7,11 @@ * Authors: Paul E. McKenney */ -#include -#include +#include #include +#include #include +#include #include "rcu_segcblist.h" @@ -88,6 +89,46 @@ static void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v) #endif } +/* Get the length of a segment of the rcu_segcblist structure. */ +static long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg) +{ + return READ_ONCE(rsclp->seglen[seg]); +} + +/* Set the length of a segment of the rcu_segcblist structure. */ +static void rcu_segcblist_set_seglen(struct rcu_segcblist *rsclp, int seg, long v) +{ + WRITE_ONCE(rsclp->seglen[seg], v); +} + +/* Return number of callbacks in a segment of the segmented callback list. */ +static void rcu_segcblist_add_seglen(struct rcu_segcblist *rsclp, int seg, long v) +{ + WRITE_ONCE(rsclp->seglen[seg], rsclp->seglen[seg] + v); +} + +/* Move from's segment length to to's segment. */ +static void rcu_segcblist_move_seglen(struct rcu_segcblist *rsclp, int from, int to) +{ + long len; + + if (from == to) + return; + + len = rcu_segcblist_get_seglen(rsclp, from); + if (!len) + return; + + rcu_segcblist_add_seglen(rsclp, to, len); + rcu_segcblist_set_seglen(rsclp, from, 0); +} + +/* Increment segment's length. */ +static void rcu_segcblist_inc_seglen(struct rcu_segcblist *rsclp, int seg) +{ + rcu_segcblist_add_seglen(rsclp, seg, 1); +} + /* * Increase the numeric length of an rcu_segcblist structure by the * specified amount, which can be negative. This can cause the ->len @@ -119,26 +160,6 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp) rcu_segcblist_add_len(rsclp, 1); } -/* - * Exchange the numeric length of the specified rcu_segcblist structure - * with the specified value. This can cause the ->len field to disagree - * with the actual number of callbacks on the structure. This exchange is - * fully ordered with respect to the callers accesses both before and after. - */ -static long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v) -{ -#ifdef CONFIG_RCU_NOCB_CPU - return atomic_long_xchg(>len, v); -#else - long ret = rsclp->len; - - smp_mb(); /* Up to the caller! */ - WRITE_ONCE(rsclp->len, v); - smp_mb(); /* Up to the caller! */ - return ret; -#endif -} - /* * Initialize an rcu_segcblist structure. */ @@ -149,8 +170,10 @@ void rcu_segcblist_init(struct rcu_segcblist *rsclp) BUILD_BUG_ON(RCU_NEXT_TAIL + 1 != ARRAY_SIZE(rsclp->gp_seq)); BUILD_BUG_ON(ARRAY_SIZE(rsclp->tails) != ARRAY_SIZE(rsclp->gp_seq)); rsclp->head = NULL; - for (i = 0; i < RCU_CBLIST_NSEGS; i++) + for (i = 0; i < RCU_CBLIST_NSEGS; i++) { rsclp->tails[i] = >head; + rcu_segcblist_set_seglen(rsclp, i, 0); + } rcu_segcblist_set_len(rsclp, 0); rsclp->enabled = 1; } @@ -246,6 +269,7 @@ void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp, { rcu_segcblist_inc_len(rsclp); smp_mb(); /* Ensure counts are updated before callback is enqueued. */ + rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL); rhp->next = NULL; WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp); WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], >next); @@ -274,27 +298,13 @@ bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp, for (i = RCU_NEXT_TAIL; i > RCU_DONE_TAIL; i--)
Re: Queries on ARM SDEI Linux kernel code
Hi James, Sorry for late reply. Thanks for your comments! On 10/16/2020 9:57 PM, James Morse wrote: Hi Neeraj, On 15/10/2020 07:07, Neeraj Upadhyay wrote: 1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND) is not available for clients to use; can you please share information on why it is not provided? There is no compelling use-case for it, and its very complex to support as the driver can no longer hide things like hibernate. Last time I looked, it looked like the SDEI driver would need to ask the irqchip to prevent modification while firmware re-configures the irq. I couldn't work out how this would work if the irq is in-progress on another CPU. Got it. I will think in this direction, on how to achieve this. The reasons to use bound-interrupts can equally be supported with an event provided by firmware. Ok, I will explore in that direction. While trying to dig information on this, I saw that [1] says: Now the hotplug callbacks save nothing, and restore the OS-view of registered/enabled. This makes bound-interrupts harder to work with. Based on this comment, the changes from v4 [2], which I could understand is, cpu down path does not save the current event enable status, and we rely on the enable status `event->reenable', which is set, when register/unregister, enable/disable calls are made; this enable status is used during cpu up path, to decide whether to reenable the interrupt. Does this make, bound-interrupts harder to work with? how? Can you please explain? Or above save/restore is not the reason and you meant something else? If you bind a level-triggered interrupt, how does firmware know how to clear the interrupt from whatever is generating it? What happens if the OS can't do this either, as it needs to allocate memory, or take a lock, which it can't do in nmi context? Ok, makes sense. The people that wrote the SDEI spec's answer to this was that the handler can disable the event from inside the handler... and firmware will do, something, to stop the interrupt screaming. So now an event can become disabled anytime its registered, which makes it more complicated to save/restore. Also, does shared bound interrupts Shared-interrupts as an NMI made me jump. But I think you mean a bound interrupt as a shared event. i.e. and SPI not a PPI. Sorry I should have worded properly; yes I meant SPI as shared event. also have the same problem, as save/restore behavior was only for private events? See above, the problem is the event disabling itself. This makes sense now. Additionally those changes to unregister the private-event mean the code can't tell the difference between cpuhp and hibernate... only hibernate additionally loses the state in firmware. Got it! 2. SDEI_EVENT_SIGNAL api is not provided? What is the reason for it? Its handling has the same problems, which are there for bound interrupts? Its not supported as no-one showed up with a use-case. While firmware is expected to back it with a PPI, its doesn't have the same problems as bound-interrupts as its not an interrupt the OS ever knows about. Also, if it is provided, clients need to register event 0 ? Vendor events or other event nums are not supported, as per spec. Ideally the driver would register the event, and provide a call_on_cpu() helper to trigger it. This should fit in with however the GIC's PMR based NMI does its PPI based crash/stacktrace call so that the caller doesn't need to know if its back by IRQ, pNMI or SDEI. Ok; I will explore how PMR based NMIs work; I thought it was SGI based. But will recheck. 3. Can kernel panic() be triggered from sdei event handler? Yes, Is it a safe operation? panic() wipes out the machine... did you expect it to keep running? I wanted to check the case where panic triggers kexec/kdump path into capture kernel. What does safe mean here? I think I didn't put it correctly; I meant what possible scenarios can happen in this case and you explained one below, thanks! You should probably call nmi_panic() if there is the risk that the event occurred during panic() on the same CPU, as it would otherwise just block. The spec says, synchronous exceptions should not be triggered; I think panic won't do it; but anything which triggers a WARN or other sync exception in that path can cause undefined behavior. Can you share your thoughts on this? What do you mean by undefined behaviour? I was thinking, if SDEI event preempts EL1, at the point, where EL1 has just entered an exception, and hasn't captured the registers like spsr_el1, elr_el1 and other registers, what will be the behavior? SDEI was originally to report external abort to the OS in regions where the OS can't take an exception because the exception-registers are live, just after and exception and just before eret. If you take another exception from the NMI handler, chances are you're going to go back ro
Queries on ARM SDEI Linux kernel code
Hi James, Have few queries on ARM SDEI Linux code. Queries are listed below; can you please help provide your insights on these? 1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND) is not available for clients to use; can you please share information on why it is not provided? While trying to dig information on this, I saw that [1] says: Now the hotplug callbacks save nothing, and restore the OS-view of registered/enabled. This makes bound-interrupts harder to work with. Based on this comment, the changes from v4 [2], which I could understand is, cpu down path does not save the current event enable status, and we rely on the enable status `event->reenable', which is set, when register/unregister, enable/disable calls are made; this enable status is used during cpu up path, to decide whether to reenable the interrupt. Does this make, bound-interrupts harder to work with? how? Can you please explain? Or above save/restore is not the reason and you meant something else? Also, does shared bound interrupts also have the same problem, as save/restore behavior was only for private events? 2. SDEI_EVENT_SIGNAL api is not provided? What is the reason for it? Its handling has the same problems, which are there for bound interrupts? Also, if it is provided, clients need to register event 0 ? Vendor events or other event nums are not supported, as per spec. 3. Can kernel panic() be triggered from sdei event handler? Is it a safe operation? The spec says, synchronous exceptions should not be triggered; I think panic won't do it; but anything which triggers a WARN or other sync exception in that path can cause undefined behavior. Can you share your thoughts on this? "The handler code should not enable asynchronous exceptions by clearing any of the PSTATE.DAIF bits, and should not cause synchronous exceptions to the client Exception level." [1] https://lwn.net/Articles/740817/ [2] https://www.spinics.net/lists/kvm-arm/msg27784.html -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v6 3/4] rcu/trace: Add tracing for how segcb list changes
On 9/23/2020 8:52 PM, Joel Fernandes (Google) wrote: Track how the segcb list changes before/after acceleration, during queuing and during dequeuing. This has proved useful to discover an optimization to avoid unwanted GP requests when there are no callbacks accelerated. The overhead is minimal as each segment's length is now stored in the respective segment. Signed-off-by: Joel Fernandes (Google) --- include/trace/events/rcu.h | 25 + kernel/rcu/rcu_segcblist.c | 34 ++ kernel/rcu/rcu_segcblist.h | 5 + kernel/rcu/tree.c | 9 + 4 files changed, 73 insertions(+) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 155b5cb43cfd..7b84df3c95df 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -505,6 +505,31 @@ TRACE_EVENT_RCU(rcu_callback, __entry->qlen) ); +TRACE_EVENT_RCU(rcu_segcb, + + TP_PROTO(const char *ctx, int *cb_count, unsigned long *gp_seq), + + TP_ARGS(ctx, cb_count, gp_seq), + + TP_STRUCT__entry( + __field(const char *, ctx) + __array(int, cb_count, 4) + __array(unsigned long, gp_seq, 4) Use RCU_CBLIST_NSEGS in place of 4 ? + ), + + TP_fast_assign( + __entry->ctx = ctx; + memcpy(__entry->cb_count, cb_count, 4 * sizeof(int)); + memcpy(__entry->gp_seq, gp_seq, 4 * sizeof(unsigned long)); + ), + + TP_printk("%s cb_count: (DONE=%d, WAIT=%d, NEXT_READY=%d, NEXT=%d) " + "gp_seq: (DONE=%lu, WAIT=%lu, NEXT_READY=%lu, NEXT=%lu)", __entry->ctx, + __entry->cb_count[0], __entry->cb_count[1], __entry->cb_count[2], __entry->cb_count[3], + __entry->gp_seq[0], __entry->gp_seq[1], __entry->gp_seq[2], __entry->gp_seq[3]) + +); + /* * Tracepoint for the registration of a single RCU callback of the special * kvfree() form. The first argument is the RCU type, the second argument diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 0e6d19bd3de9..df0f31e30947 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -13,6 +13,7 @@ #include #include "rcu_segcblist.h" +#include "rcu.h" /* Initialize simple callback list. */ void rcu_cblist_init(struct rcu_cblist *rclp) @@ -343,6 +344,39 @@ void rcu_segcblist_extract_done_cbs(struct rcu_segcblist *rsclp, rcu_segcblist_set_seglen(rsclp, RCU_DONE_TAIL, 0); } +/* + * Return how many CBs each segment along with their gp_seq values. + * + * This function is O(N) where N is the number of callbacks. Only used from N is number of segments? + * tracing code which is usually disabled in production. + */ +#ifdef CONFIG_RCU_TRACE +static void rcu_segcblist_countseq(struct rcu_segcblist *rsclp, +int cbcount[RCU_CBLIST_NSEGS], +unsigned long gpseq[RCU_CBLIST_NSEGS]) +{ + int i; + + for (i = 0; i < RCU_CBLIST_NSEGS; i++) + cbcount[i] = 0; + What is the reason for initializing to 0? + for (i = 0; i < RCU_CBLIST_NSEGS; i++) { + cbcount[i] = rcu_segcblist_get_seglen(rsclp, i); + gpseq[i] = rsclp->gp_seq[i]; + } +} + +void trace_rcu_segcb_list(struct rcu_segcblist *rsclp, char *context) +{ + int cbs[RCU_CBLIST_NSEGS]; + unsigned long gps[RCU_CBLIST_NSEGS]; + + rcu_segcblist_countseq(rsclp, cbs, gps); + + trace_rcu_segcb(context, cbs, gps); +} +#endif + /* * Extract only those callbacks still pending (not yet ready to be * invoked) from the specified rcu_segcblist structure and place them in diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 3e0eb1056ae9..15c10d30f88c 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -103,3 +103,8 @@ void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq); bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq); void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp, struct rcu_segcblist *src_rsclp); +#ifdef CONFIG_RCU_TRACE +void trace_rcu_segcb_list(struct rcu_segcblist *rsclp, char *context); +#else +#define trace_rcu_segcb_list(...) +#endif diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 50af465729f4..e3381ff67fc6 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1492,6 +1492,8 @@ static bool rcu_accelerate_cbs(struct rcu_node *rnp, struct rcu_data *rdp) if (!rcu_segcblist_pend_cbs(>cblist)) return false; + trace_rcu_segcb_list(>cblist, "SegCbPreAcc"); Use TPS("SegCbPreAcc") ? Thanks Neeraj + /* * Callbacks are often registered with incomplete grace-period *
Re: [PATCH v6 1/4] rcu/tree: Make rcu_do_batch count how many callbacks were executed
On 9/23/2020 8:52 PM, Joel Fernandes (Google) wrote: Currently, rcu_do_batch() depends on the unsegmented callback list's len field to know how many CBs are executed. This fields counts down from 0 as CBs are dequeued. It is possible that all CBs could not be run because of reaching limits in which case the remaining unexecuted callbacks are requeued in the CPU's segcblist. The number of callbacks that were not requeued are then the negative count (how many CBs were run) stored in the rcl->len which has been counting down on every dequeue. This negative count is then added to the per-cpu segmented callback list's to correct its count. Such a design works against future efforts to track the length of each segment of the segmented callback list. The reason is because rcu_segcblist_extract_done_cbs() will be populating the unsegmented callback list's length field (rcl->len) during extraction. Also, the design of counting down from 0 is confusing and error-prone IMHO. This commit therefore explicitly counts have many callbacks were executed in rcu_do_batch() itself, and uses that to update the per-CPU segcb list's ->len field, without relying on the negativity of rcl->len. Signed-off-by: Joel Fernandes (Google) --- kernel/rcu/rcu_segcblist.c | 2 +- kernel/rcu/rcu_segcblist.h | 1 + kernel/rcu/tree.c | 9 - 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 2d2a6b6b9dfb..bb246d8c6ef1 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -95,7 +95,7 @@ static void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v) * This increase is fully ordered with respect to the callers accesses * both before and after. */ -static void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v) +void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v) { #ifdef CONFIG_RCU_NOCB_CPU smp_mb__before_atomic(); /* Up to the caller! */ diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 5c293afc07b8..b90725f81d77 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -76,6 +76,7 @@ static inline bool rcu_segcblist_restempty(struct rcu_segcblist *rsclp, int seg) } void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp); +void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v); void rcu_segcblist_init(struct rcu_segcblist *rsclp); void rcu_segcblist_disable(struct rcu_segcblist *rsclp); void rcu_segcblist_offload(struct rcu_segcblist *rsclp); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 7623128d0020..50af465729f4 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2427,7 +2427,7 @@ static void rcu_do_batch(struct rcu_data *rdp) rcu_segcblist_is_offloaded(>cblist); struct rcu_head *rhp; struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); - long bl, count; + long bl, count = 0; long pending, tlimit = 0; /* If no callbacks are ready, just return. */ @@ -2472,6 +2472,7 @@ static void rcu_do_batch(struct rcu_data *rdp) for (; rhp; rhp = rcu_cblist_dequeue()) { rcu_callback_t f; + count++; debug_rcu_head_unqueue(rhp); rcu_lock_acquire(_callback_map); @@ -2485,9 +2486,8 @@ static void rcu_do_batch(struct rcu_data *rdp) /* * Stop only if limit reached and CPU has something to do. -* Note: The rcl structure counts down from zero. */ - if (-rcl.len >= bl && !offloaded && + if (count >= bl && !offloaded && (need_resched() || (!is_idle_task(current) && !rcu_is_callbacks_kthread( break; Update below usage of -rcl.len also? if (likely((-rcl.len & 31) || local_clock() < tlimit)) Thanks Neeraj @@ -2510,7 +2510,6 @@ static void rcu_do_batch(struct rcu_data *rdp) local_irq_save(flags); rcu_nocb_lock(rdp); - count = -rcl.len; rdp->n_cbs_invoked += count; trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(), is_idle_task(current), rcu_is_callbacks_kthread()); @@ -2518,7 +2517,7 @@ static void rcu_do_batch(struct rcu_data *rdp) /* Update counts and requeue any remaining callbacks. */ rcu_segcblist_insert_done_cbs(>cblist, ); smp_mb(); /* List handling before counting for rcu_barrier(). */ - rcu_segcblist_insert_count(>cblist, ); + rcu_segcblist_add_len(>cblist, -count); /* Reinstate batch limit if we have worked down the excess. */ count = rcu_segcblist_n_cbs(>cblist); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[tip: core/rcu] rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp()
The following commit has been merged into the core/rcu branch of tip: Commit-ID: 9b1ce0acb5e65e9ea1e6b322562d072f9f7d1f6e Gitweb: https://git.kernel.org/tip/9b1ce0acb5e65e9ea1e6b322562d072f9f7d1f6e Author:Neeraj Upadhyay AuthorDate:Mon, 22 Jun 2020 23:37:03 +05:30 Committer: Paul E. McKenney CommitterDate: Mon, 24 Aug 2020 18:36:06 -07:00 rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp() Originally, the call to rcu_preempt_blocked_readers_cgp() from force_qs_rnp() had to be conditioned on CONFIG_PREEMPT_RCU=y, as in commit a77da14ce9af ("rcu: Yet another fix for preemption and CPU hotplug"). However, there is now a CONFIG_PREEMPT_RCU=n definition of rcu_preempt_blocked_readers_cgp() that unconditionally returns zero, so invoking it is now safe. In addition, the CONFIG_PREEMPT_RCU=n definition of rcu_initiate_boost() simply releases the rcu_node structure's ->lock, which is what happens when the "if" condition evaluates to false. This commit therefore drops the IS_ENABLED(CONFIG_PREEMPT_RCU) check, so that rcu_initiate_boost() is called only in CONFIG_PREEMPT_RCU=y kernels when there are readers blocking the current grace period. This does not change the behavior, but reduces code-reader confusion by eliminating non-CONFIG_PREEMPT_RCU=y calls to rcu_initiate_boost(). Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4770d77..acc926f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2533,8 +2533,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) raw_spin_lock_irqsave_rcu_node(rnp, flags); rcu_state.cbovldnext |= !!rnp->cbovldmask; if (rnp->qsmask == 0) { - if (!IS_ENABLED(CONFIG_PREEMPT_RCU) || - rcu_preempt_blocked_readers_cgp(rnp)) { + if (rcu_preempt_blocked_readers_cgp(rnp)) { /* * No point in scanning bits because they * are all zero. But we might need to
[tip: core/rcu] rcu/tree: Force quiescent state on callback overload
The following commit has been merged into the core/rcu branch of tip: Commit-ID: 9c39245382de4d52a122641952900709d4a9950b Gitweb: https://git.kernel.org/tip/9c39245382de4d52a122641952900709d4a9950b Author:Neeraj Upadhyay AuthorDate:Mon, 22 Jun 2020 00:07:27 +05:30 Committer: Paul E. McKenney CommitterDate: Mon, 24 Aug 2020 18:36:05 -07:00 rcu/tree: Force quiescent state on callback overload On callback overload, it is necessary to quickly detect idle CPUs, and rcu_gp_fqs_check_wake() checks for this condition. Unfortunately, the code following the call to this function does not repeat this check, which means that in reality no actual quiescent-state forcing, instead only a couple of quick and pointless wakeups at the beginning of the grace period. This commit therefore adds a check for the RCU_GP_FLAG_OVLD flag in the post-wakeup "if" statement in rcu_gp_fqs_loop(). Fixes: 1fca4d12f4637 ("rcu: Expedite first two FQS scans under callback-overload conditions") Reviewed-by: Joel Fernandes (Google) Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8969120..4770d77 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1884,7 +1884,7 @@ static void rcu_gp_fqs_loop(void) break; /* If time for quiescent-state forcing, do it. */ if (!time_after(rcu_state.jiffies_force_qs, jiffies) || - (gf & RCU_GP_FLAG_FQS)) { + (gf & (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD))) { trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqsstart")); rcu_gp_fqs(first_gp_fqs);
Re: [PATCH v2] rcu/tree: nocb: Avoid raising softirq when there are ready to execute CBs
On 10/8/2020 4:04 AM, Paul E. McKenney wrote: On Sun, Oct 04, 2020 at 10:11:32PM -0400, Joel Fernandes (Google) wrote: During testing, I see it is possible that rcu_pending() returns 1 when offloaded callbacks are ready to execute thus raising the RCU softirq. However, softirq does not execute offloaded callbacks. They are executed in a kthread which is awakened independent of the softirq. This commit therefore avoids raising the softirq in the first place. That's probably a good thing considering that the purpose of callback offloading is to reduce softirq activity. Passed 30 minute tests of TREE01 through TREE09 each. On TREE08, I notice that there is atmost 150us from when the softirq was NOT raised when ready cbs were present, to when the ready callbacks were invoked by the rcuop thread. This also further confirms that there is no need to raise the softirq for ready cbs in the first place. Cc: neer...@codeaurora.org Signed-off-by: Joel Fernandes (Google) Looks good, applied, thank you! I reworked things a bit based on previous patches and to more precisely capture why this patch does not cause additional problems. Please let me know if I messed anything up. Thanx, Paul commit 33847a34a2d261354a79b4a24d9d37222e8ec888 Author: Joel Fernandes (Google) Date: Wed Oct 7 13:50:36 2020 -0700 rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs Testing showed that rcu_pending() can return 1 when offloaded callbacks are ready to execute. This invokes RCU core processing, for example, by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core(). However, rcu_core() explicitly avoids in any way manipulating offloaded callbacks, which are instead handled by the rcuog and rcuoc kthreads, which work independently of rcu_core(). One exception to this independence is that rcu_core() invokes do_nocb_deferred_wakeup(), however, rcu_pending() also checks rcu_nocb_need_deferred_wakeup() in order to correctly handle this case, invoking rcu_core() when needed. This commit therefore avoids needlessly invoking RCU core processing by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs. This reduces overhead, for example, by reducing softirq activity. This change passed 30 minute tests of TREE01 through TREE09 each. On TREE08, there is at most 150us from the time that rcu_pending() chose not to invoke RCU core processing to the time when the ready callbacks were invoked by the rcuoc kthread. This provides further evidence that there is no need to invoke rcu_core() for offloaded callbacks that are ready to invoke. Cc: Neeraj Upadhyay Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney Reviewed-by: Neeraj Upadhyay Thanks Neeraj diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 85e3f29..bfd38f2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3716,7 +3716,8 @@ static int rcu_pending(int user) return 1; /* Does this CPU have callbacks ready to invoke? */ - if (rcu_segcblist_ready_cbs(>cblist)) + if (!rcu_segcblist_is_offloaded(>cblist) && + rcu_segcblist_ready_cbs(>cblist)) return 1; /* Has RCU gone idle with this CPU needing another grace period? */ -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
Hi Paul, On 9/25/2020 4:29 AM, Paul E. McKenney wrote: On Thu, Sep 24, 2020 at 12:04:10PM +0530, Neeraj Upadhyay wrote: Clarify the "x" in rcuox/N naming in RCU_NOCB_CPU config description. Signed-off-by: Neeraj Upadhyay Applied with a few additional updates as shown below. As always, please let me know if I messed anything up. Looks good! thanks! Thanks Neeraj Thanx, Paul commit 8d1d776b4998896a6f8f4608edb0b258bd37ec9f Author: Neeraj Upadhyay Date: Thu Sep 24 12:04:10 2020 +0530 rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config This commit clarifies that the "p" and the "s" in the in the RCU_NOCB_CPU config-option description refer to the "x" in the "rcuox/N" kthread name. Signed-off-by: Neeraj Upadhyay [ paulmck: While in the area, update description and advice. ] Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index b71e21f..cdc57b4 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -221,19 +221,23 @@ config RCU_NOCB_CPU Use this option to reduce OS jitter for aggressive HPC or real-time workloads. It can also be used to offload RCU callback invocation to energy-efficient CPUs in battery-powered - asymmetric multiprocessors. + asymmetric multiprocessors. The price of this reduced jitter + is that the overhead of call_rcu() increases and that some + workloads will incur significant increases in context-switch + rates. This option offloads callback invocation from the set of CPUs specified at boot time by the rcu_nocbs parameter. For each such CPU, a kthread ("rcuox/N") will be created to invoke callbacks, where the "N" is the CPU being offloaded, and where - the "p" for RCU-preempt (PREEMPTION kernels) and "s" for RCU-sched - (!PREEMPTION kernels). Nothing prevents this kthread from running - on the specified CPUs, but (1) the kthreads may be preempted - between each callback, and (2) affinity or cgroups can be used - to force the kthreads to run on whatever set of CPUs is desired. - - Say Y here if you want to help to debug reduced OS jitter. + the "x" is "p" for RCU-preempt (PREEMPTION kernels) and "s" for + RCU-sched (!PREEMPTION kernels). Nothing prevents this kthread + from running on the specified CPUs, but (1) the kthreads may be + preempted between each callback, and (2) affinity or cgroups can + be used to force the kthreads to run on whatever set of CPUs is + desired. + + Say Y here if you need reduced OS jitter, despite added overhead. Say N here if you are unsure. config TASKS_TRACE_RCU_READ_MB -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
Clarify the "x" in rcuox/N naming in RCU_NOCB_CPU config description. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/Kconfig | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index b71e21f..5b22747 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -227,11 +227,12 @@ config RCU_NOCB_CPU specified at boot time by the rcu_nocbs parameter. For each such CPU, a kthread ("rcuox/N") will be created to invoke callbacks, where the "N" is the CPU being offloaded, and where - the "p" for RCU-preempt (PREEMPTION kernels) and "s" for RCU-sched - (!PREEMPTION kernels). Nothing prevents this kthread from running - on the specified CPUs, but (1) the kthreads may be preempted - between each callback, and (2) affinity or cgroups can be used - to force the kthreads to run on whatever set of CPUs is desired. + the "x" is "p" for RCU-preempt (PREEMPTION kernels) and "s" for + RCU-sched (!PREEMPTION kernels). Nothing prevents this kthread + from running on the specified CPUs, but (1) the kthreads may be + preempted between each callback, and (2) affinity or cgroups can + be used to force the kthreads to run on whatever set of CPUs is + desired. Say Y here if you want to help to debug reduced OS jitter. Say N here if you are unsure. -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH V2] rcu/tree: Correctly handle single cpu check in rcu_blocking_is_gp
Hi Paul, On 9/24/2020 2:33 AM, Paul E. McKenney wrote: On Wed, Sep 23, 2020 at 12:59:33PM +0530, Neeraj Upadhyay wrote: Currently, for non-preempt kernels (with CONFIG_PREEMPTION=n), rcu_blocking_is_gp() checks (with preemption disabled), whether there is only one cpu online. It uses num_online_cpus() to decide whether only one cpu is online. If there is only single cpu online, synchronize_rcu() is optimized to return without doing all the work to wait for grace period. However, there are few issues with the num_online_cpus() check used, for transition of __num_online_cpus from 2 -> 1 for cpu down path and 1 -> 2 for cpu up path. Again, good catch! As usual, I could not resist editing the commit log and comments, so could you please look the following over to make sure that I did not mess anything up? The commit log and comments look very good! Thanks! Thanks Neeraj Thanx, Paul commit 7af8c1c8d13c6c9c7013fbfef77f843dbc430c4b Author: Neeraj Upadhyay Date: Wed Sep 23 12:59:33 2020 +0530 rcu: Fix single-CPU check in rcu_blocking_is_gp() Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses num_online_cpus() to determine whether there is only one CPU online. When there is only single CPU online, the simple fact that synchronize_rcu() could be legally called implies that a full grace period has elapsed. Therefore, in the single-CPU case, synchronize_rcu() simply returns immediately. Unfortunately, num_online_cpus() is unreliable while a CPU-hotplug operation is transitioning to or from single-CPU operation because: 1. num_online_cpus() uses atomic_read(&__num_online_cpus) to locklessly sample the number of online CPUs. The hotplug locks are not held, which means that an incoming CPU can concurrently update this count. This in turn means that an RCU read-side critical section on the incoming CPU might observe updates prior to the grace period, but also that this critical section might extend beyond the end of the optimized synchronize_rcu(). This breaks RCU's fundamental guarantee. 2. In addition, num_online_cpus() does no ordering, thus providing another way that RCU's fundamental guarantee can be broken by the current code. 3. The most probable failure mode happens on outgoing CPUs. The outgoing CPU updates the count of online CPUs in the CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in and of itself due to preemption being disabled at the call to num_online_cpus(). Unfortunately, after that stop-machine handler returns, the CPU takes one last trip through the scheduler (which has RCU readers) and, after the resulting context switch, one final dive into the idle loop. During this time, RCU needs to keep track of two CPUs, but num_online_cpus() will say that there is only one, which in turn means that the surviving CPU will incorrectly ignore the outgoing CPU's RCU read-side critical sections. This problem is illustrated by the following litmus test in which P0() corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU. The herd7 tool confirms that the "exists" clause can be satisfied, thus demonstrating that this breakage can happen according to the Linux kernel memory model. { int x = 0; atomic_t numonline = ATOMIC_INIT(1); } P0(int *x, atomic_t *numonline) { int r0; WRITE_ONCE(*x, 1); r0 = atomic_read(numonline); if (r0 == 1) { smp_mb(); } else { synchronize_rcu(); } WRITE_ONCE(*x, 2); } P1(int *x, atomic_t *numonline) { int r0; int r1; atomic_inc(numonline); smp_mb(); rcu_read_lock(); r0 = READ_ONCE(*x); smp_rmb(); r1 = READ_ONCE(*x); rcu_read_unlock(); } locations [x;numonline;] exists (1:r0=0 /\ 1:r1=2) It is important to note that these problems arise only when the system is transitioning to or from single-CPU operation. One solution would be to hold the CPU-hotplug locks while sampling num_online_cpus(), which was in fact the intent of the (redundant) preempt_disable() and preempt_enable() surrounding this call to num_online_cpus(). Actually blocking CPU hotplug would not only result in excessive overhead, but would also unnecessaril
[PATCH V2] rcu/tree: Correctly handle single cpu check in rcu_blocking_is_gp
Currently, for non-preempt kernels (with CONFIG_PREEMPTION=n), rcu_blocking_is_gp() checks (with preemption disabled), whether there is only one cpu online. It uses num_online_cpus() to decide whether only one cpu is online. If there is only single cpu online, synchronize_rcu() is optimized to return without doing all the work to wait for grace period. However, there are few issues with the num_online_cpus() check used, for transition of __num_online_cpus from 2 -> 1 for cpu down path and 1 -> 2 for cpu up path. 1. For CPU path, num_online_cpus() does a atomic_read(&__num_online_cpus). As hotplug locks are not held, this does not ensure that new incoming cpus update of the count is visible. This can result in read side section on new incoming cpu, observe updates which should not be visible beyond the grace period corresponding to synchronize_rcu(). For e.g. below litmus test, where P0 process corresponds to synchronize_rcu() and P1 corresponds to new online cpu, has positive witnesses; confirming the possibility of read side section to extend before and after the grace period, thereby breaking guarantees provided by synchronize_rcu(). { int x = 0; atomic_t numonline = ATOMIC_INIT(1); } P0(int *x, atomic_t *numonline) { int r0; WRITE_ONCE(*x, 1); r0 = atomic_read(numonline); if (r0 == 1) { smp_mb(); } else { synchronize_rcu(); } WRITE_ONCE(*x, 2); } P1(int *x, atomic_t *numonline) { int r0; int r1; atomic_inc(numonline); smp_mb(); rcu_read_lock(); r0 = READ_ONCE(*x); smp_rmb(); r1 = READ_ONCE(*x); rcu_read_unlock(); } locations [x;numonline;] exists (1:r0=0 /\ 1:r1=2) 2. Second problem is, the same early exit, from synchronize_rcu() does not provide full ordering, memory barrier, w.r.t. memory accesses after synchronize_rcu() call. 3. Third, more important issue is related to outgoing cpu. Checking only for __num_online_cpus with preemotion disabled isn't sufficient for RCU, as, on completion of CPUHP_TEARDOWN_CPU stop machine (which clears outgoing cpu from __num_online_cpus, the CPU switches to idle task. So, checking only for __num_online_cpus does not consider RCU read side sections in scheduler code (before switching to idle task) and any potential read side sections in idle task, before final RCU-quiesce entry into cpuhp_report_idle_dead() -> rcu_report_dead(). The problem here is, changes to number of online cpus happen from places that are not helpful for lockless access from synchronize_rcu() : for offlining cpu, it happens from stop machine on that cpu, before cpu is marked offline in rcu bookkeeping, resulting in unwatched RCU critical sections; and for onlining cpu, it happens on the new cpu, after it is online, resulting in visibility issues. To handle these issues, add a new rcu_state member n_online_cpus, to keep count of the current number of online cpus. This counter is incremented early in the CPU-hotplug CPU-online process, that is, on the control CPU, prior to the new CPU coming online. For outgoing cpu, counter is decremented on the control cpu, after the offlining cpu is long gone. Update of this counter from control cpu means that, if rcu_state. n_online_cpus is equal to one in rcu_blocking_is_gp(), below is guaranteed: a. There is only one CPU, and that CPU sees all prior accesses made by any CPU that was online at the time of its access. Furthermore, rcu_state.n_online_cpus value cannot change, as CONFIG_PREEMPTION=n. b. All later CPUs (both this one and any that come online later on) are guaranteed to see all accesses by any CPU prior to this point in the code. The CPU-hotplug code provides the required memory barriers to ensure this. This eliminates the need for any additional memory barriers for the single cpu online case; it also eliminates the early counter update and visibility problems with __num_online_cpus value. For multiple cpu online case, synchronize_rcu() does all the work of grace period initiation and wait and that provides the required ordering gurantees. rcu_state.n_online_cpus update from control cpu would result in unnecessary calls to synchronize_rcu() slow path during the CPU-online process, but that should have negligible impact. Signed-off-by: Neeraj Upadhyay --- Changes in V2: - Make rcu_state.n_online_cpus int, instead of atomic_t. - Move counter updates to rcutree_prepare_cpu() and rcutree_dead_cpu(). - Update commit log and code comments to highlight single-cpu behavior. kernel/rcu/tree.c | 21 - kernel/rcu/tree.h | 1 + 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 2424e2a..facdec9 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2405,6 +2405,7 @@ int rcutree_dead_cpu(unsigned int cpu) if (!IS_ENABLED(CONFI
Re: [PATCH] rcu/tree: Correctly handle single cpu check in rcu_blocking_is_gp
Hi Paul, On 9/23/2020 1:59 AM, Paul E. McKenney wrote: On Tue, Sep 22, 2020 at 01:15:57AM +0530, Neeraj Upadhyay wrote: Currently, for non-preempt kernels (with CONFIG_PREEMPTION=n), rcu_blocking_is_gp() checks (with preemption disabled), whether there is only one cpu online. It uses num_online_cpus() to decide whether only one cpu is online. If there is only single cpu online, synchronize_rcu() is optimized to return without doing all the work to wait for grace period. However, there are few issues with the num_online_cpus() check used: Great catch!!! I do have some questions about your suggested fix, though. 1. num_online_cpus() does a atomic_read(&__num_online_cpus). As hotplug locks are not held, this does not ensure that new incoming cpus update of the count is visible. This can result in read side section on new incoming cpu, observe updates which should not be visible beyond the grace period corresponding to synchronize_rcu(). For e.g. below litmus test, where P0 process corresponds to synchronize_rcu() and P1 corresponds to new online cpu, has positive witnesses; confirming the possibility of read side section to extend before and after the grace period, thereby breaking guarantees provided by synchronize_rcu(). { int x = 0; atomic_t numonline = ATOMIC_INIT(1); } P0(int *x, atomic_t *numonline) { int r0; WRITE_ONCE(*x, 1); r0 = atomic_read(numonline); if (r0 == 1) { smp_mb(); } else { synchronize_rcu(); } WRITE_ONCE(*x, 2); } P1(int *x, atomic_t *numonline) { int r0; int r1; atomic_inc(numonline); smp_mb(); rcu_read_lock(); r0 = READ_ONCE(*x); smp_rmb(); r1 = READ_ONCE(*x); rcu_read_unlock(); } locations [x;numonline;] exists (1:r0=0 /\ 1:r1=2) 2. Second problem is, the same early exit, from synchronize_rcu() does not provide full ordering, memory barrier, w.r.t. memory accesses after synchronize_rcu() call. 3. Third, more important issue is related to outgoing cpu. Checking only for __num_online_cpus with preemotion disabled isn't sufficient for RCU, as, on completion of CPUHP_TEARDOWN_CPU stop machine (which clears outgoing cpu from __num_online_cpus, the CPU switches to idle task. So, checking only for __num_online_cpus does not consider RCU read side sections in scheduler code (before switching to idle task) and any potential read side sections in idle task, before final RCU-quiesce entry into cpuhp_report_idle_dead() -> rcu_report_dead(). To handle these issues, add a new rcu_state member n_online_cpus, to keep account of the current number of online cpus. The atomic updates to this counter from rcu_report_dead() and rcu_cpu_starting() and the added read/write memory ordering semantics ensure that synchronize_rcu() fast path waits for all read side sections, where incoming/outgoing cpus are considered online, for RCU i.e. after rcu_cpu_starting() and before rcu_report_dead(). Signed-off-by: Neeraj Upadhyay --- Below is the reproducer for issue described in point 3; this snippet is based on klitmus generated test, which is modified to sample reads from idle thread: static void code0(int* x) { WRITE_ONCE(*x, 1); idle_ctr = 0; smp_mb(); mdelay(10); WRITE_ONCE(*x, 1); idle_ctr = 1; synchronize_rcu(); WRITE_ONCE(*x, 2); idle_ctr = 2; } static int thread0(void *_p) { int _j, _i; ctx_t *_a = (ctx_t *)_p; smp_mb(); for (_j = 0 ; _j < stride ; _j++) { for (_i = _j ; _i < size ; _i += stride) { while (idle_wait1) { cpu_relax(); cond_resched(); } code0(&_a->x[_i]); smp_mb(); get_online_cpus(); idle_wait1 = true; put_online_cpus(); } } atomic_inc(); smp_mb(); wake_up(wq); smp_mb(); do_exit(0); } static void code1(int* x,int* out_1_r1,int* out_1_r0) { int r0; int r1; r0 = READ_ONCE(idle_ctr_snap1); r1 = READ_ONCE(idle_ctr_snap2); *out_1_r1 = (int)r1; *out_1_r0 = (int)r0; } static int thread1(void *_p) { ctx_t *_a = (ctx_t *)_p; int _j, _i; smp_mb(); for (_j = 0 ; _j < stride ; _j++) { for (_i = _j ; _i < size ; _i += stride) { while (idle_wait2) { cpu_relax(); cond_resched(); } get_online_cpus(); code1(&_a->x[_i],&_a->out_1_r1[_i],&_a->out_1_r0[_i]); smp_mb(); idle_wait2 = true; put_online_cpus(); } } atomic_inc(); smp_mb(); wake_up(wq); smp_mb(); do_exit(0); } Idle thread snippet: if (cpu_is_offline(cpu)) { smp_mb(); idle_wait1 = false; mdelay(8); smp_mb(); rcu_read_lock(); idle_ctr_snap1 = idle_ctr; mdelay(40); smp_rmb(); idle_ctr_snap2 = idle_c
[PATCH] rcu/tree: Correctly handle single cpu check in rcu_blocking_is_gp
Currently, for non-preempt kernels (with CONFIG_PREEMPTION=n), rcu_blocking_is_gp() checks (with preemption disabled), whether there is only one cpu online. It uses num_online_cpus() to decide whether only one cpu is online. If there is only single cpu online, synchronize_rcu() is optimized to return without doing all the work to wait for grace period. However, there are few issues with the num_online_cpus() check used: 1. num_online_cpus() does a atomic_read(&__num_online_cpus). As hotplug locks are not held, this does not ensure that new incoming cpus update of the count is visible. This can result in read side section on new incoming cpu, observe updates which should not be visible beyond the grace period corresponding to synchronize_rcu(). For e.g. below litmus test, where P0 process corresponds to synchronize_rcu() and P1 corresponds to new online cpu, has positive witnesses; confirming the possibility of read side section to extend before and after the grace period, thereby breaking guarantees provided by synchronize_rcu(). { int x = 0; atomic_t numonline = ATOMIC_INIT(1); } P0(int *x, atomic_t *numonline) { int r0; WRITE_ONCE(*x, 1); r0 = atomic_read(numonline); if (r0 == 1) { smp_mb(); } else { synchronize_rcu(); } WRITE_ONCE(*x, 2); } P1(int *x, atomic_t *numonline) { int r0; int r1; atomic_inc(numonline); smp_mb(); rcu_read_lock(); r0 = READ_ONCE(*x); smp_rmb(); r1 = READ_ONCE(*x); rcu_read_unlock(); } locations [x;numonline;] exists (1:r0=0 /\ 1:r1=2) 2. Second problem is, the same early exit, from synchronize_rcu() does not provide full ordering, memory barrier, w.r.t. memory accesses after synchronize_rcu() call. 3. Third, more important issue is related to outgoing cpu. Checking only for __num_online_cpus with preemotion disabled isn't sufficient for RCU, as, on completion of CPUHP_TEARDOWN_CPU stop machine (which clears outgoing cpu from __num_online_cpus, the CPU switches to idle task. So, checking only for __num_online_cpus does not consider RCU read side sections in scheduler code (before switching to idle task) and any potential read side sections in idle task, before final RCU-quiesce entry into cpuhp_report_idle_dead() -> rcu_report_dead(). To handle these issues, add a new rcu_state member n_online_cpus, to keep account of the current number of online cpus. The atomic updates to this counter from rcu_report_dead() and rcu_cpu_starting() and the added read/write memory ordering semantics ensure that synchronize_rcu() fast path waits for all read side sections, where incoming/outgoing cpus are considered online, for RCU i.e. after rcu_cpu_starting() and before rcu_report_dead(). Signed-off-by: Neeraj Upadhyay --- Below is the reproducer for issue described in point 3; this snippet is based on klitmus generated test, which is modified to sample reads from idle thread: static void code0(int* x) { WRITE_ONCE(*x, 1); idle_ctr = 0; smp_mb(); mdelay(10); WRITE_ONCE(*x, 1); idle_ctr = 1; synchronize_rcu(); WRITE_ONCE(*x, 2); idle_ctr = 2; } static int thread0(void *_p) { int _j, _i; ctx_t *_a = (ctx_t *)_p; smp_mb(); for (_j = 0 ; _j < stride ; _j++) { for (_i = _j ; _i < size ; _i += stride) { while (idle_wait1) { cpu_relax(); cond_resched(); } code0(&_a->x[_i]); smp_mb(); get_online_cpus(); idle_wait1 = true; put_online_cpus(); } } atomic_inc(); smp_mb(); wake_up(wq); smp_mb(); do_exit(0); } static void code1(int* x,int* out_1_r1,int* out_1_r0) { int r0; int r1; r0 = READ_ONCE(idle_ctr_snap1); r1 = READ_ONCE(idle_ctr_snap2); *out_1_r1 = (int)r1; *out_1_r0 = (int)r0; } static int thread1(void *_p) { ctx_t *_a = (ctx_t *)_p; int _j, _i; smp_mb(); for (_j = 0 ; _j < stride ; _j++) { for (_i = _j ; _i < size ; _i += stride) { while (idle_wait2) { cpu_relax(); cond_resched(); } get_online_cpus(); code1(&_a->x[_i],&_a->out_1_r1[_i],&_a->out_1_r0[_i]); smp_mb(); idle_wait2 = true; put_online_cpus(); } } atomic_inc(); smp_mb(); wake_up(wq); smp_mb(); do_exit(0); } Idle thread snippet: if (cpu_is_offline(cpu)) { smp_mb(); idle_wait1 = false; mdelay(8); smp_mb(); rcu_read_lock(); idle_ctr_snap1 = idle_ctr; mdelay(40); smp_rmb(); idle_ctr_snap2 = idle_ctr; rcu_read_unlock(); smp_mb(); idle_wait2 = false; tick_nohz_idle_stop_tick(); cpuhp_report_idle_dead(); arch_cpu_idle_dead(); kernel/rcu/tree.c | 65 +++ kernel/rcu/tree.h | 1 + 2 files changed, 66 insertions(+) dif
Re: [PATCH] rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp
Hi Paul, On 6/23/2020 4:48 AM, Paul E. McKenney wrote: On Mon, Jun 22, 2020 at 11:37:03PM +0530, Neeraj Upadhyay wrote: Remove CONFIG_PREMPT_RCU check in force_qs_rnp(). Originally, this check was required to skip executing fqs failsafe for rcu-sched, which was added in commit a77da14ce9af ("rcu: Yet another fix for preemption and CPU hotplug"). However, this failsafe has been removed, since then. So, cleanup the code to avoid any confusion around the need for boosting, for !CONFIG_PREMPT_RCU. Signed-off-by: Neeraj Upadhyay Good point, there is a !PREEMPT definition of the function rcu_preempt_blocked_readers_cgp() that unconditionally returns zero. And if !PREEMPT kernels, the same things happens in the "if" body as after it, so behavior is not changed. I have queued and pushed this with an upgraded commit log as shown below. Thanx, Paul Thanks! patch looks good to me! Thanks Neeraj --- kernel/rcu/tree.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6226bfb..57c904b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2514,8 +2514,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) raw_spin_lock_irqsave_rcu_node(rnp, flags); rcu_state.cbovldnext |= !!rnp->cbovldmask; if (rnp->qsmask == 0) { - if (!IS_ENABLED(CONFIG_PREEMPT_RCU) || - rcu_preempt_blocked_readers_cgp(rnp)) { + if (rcu_preempt_blocked_readers_cgp(rnp)) { /* * No point in scanning bits because they * are all zero. But we might need to -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project commit a4600389c35010aef414b89e2817d4a527e751b5 Author: Neeraj Upadhyay Date: Mon Jun 22 23:37:03 2020 +0530 rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp() Originally, the call to rcu_preempt_blocked_readers_cgp() from force_qs_rnp() had to be conditioned on CONFIG_PREEMPT_RCU=y, as in commit a77da14ce9af ("rcu: Yet another fix for preemption and CPU hotplug"). However, there is now a CONFIG_PREEMPT_RCU=n definition of rcu_preempt_blocked_readers_cgp() that unconditionally returns zero, so invoking it is now safe. In addition, the CONFIG_PREEMPT_RCU=n definition of rcu_initiate_boost() simply releases the rcu_node structure's ->lock, which is what happens when the "if" condition evaluates to false. This commit therefore drops the IS_ENABLED(CONFIG_PREEMPT_RCU) check, so that rcu_initiate_boost() is called only in CONFIG_PREEMPT_RCU=y kernels when there are readers blocking the current grace period. This does not change the behavior, but reduces code-reader confusion by eliminating non-CONFIG_PREEMPT_RCU=y calls to rcu_initiate_boost(). Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6226bfb..57c904b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2514,8 +2514,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) raw_spin_lock_irqsave_rcu_node(rnp, flags); rcu_state.cbovldnext |= !!rnp->cbovldmask; if (rnp->qsmask == 0) { - if (!IS_ENABLED(CONFIG_PREEMPT_RCU) || - rcu_preempt_blocked_readers_cgp(rnp)) { + if (rcu_preempt_blocked_readers_cgp(rnp)) { /* * No point in scanning bits because they * are all zero. But we might need to -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] rcu/tree: Force quiescent state on callback overload
Hi Paul, On 6/23/2020 4:23 AM, Paul E. McKenney wrote: On Mon, Jun 22, 2020 at 09:16:24AM +0530, Neeraj Upadhyay wrote: Hi Paul, On 6/22/2020 8:43 AM, Paul E. McKenney wrote: On Mon, Jun 22, 2020 at 01:30:31AM +0530, Neeraj Upadhyay wrote: Hi Paul, On 6/22/2020 1:20 AM, Paul E. McKenney wrote: On Mon, Jun 22, 2020 at 12:07:27AM +0530, Neeraj Upadhyay wrote: On callback overload, we want to force quiescent state immediately, for the first and second fqs. Enforce the same, by including RCU_GP_FLAG_OVLD flag, in fqsstart check. Signed-off-by: Neeraj Upadhyay Good catch! But what did you do to verify that this change does the right thing? Thanx, Paul I haven't done a runtime verification of this code path; I posted this, based on review of this code. My concern is that under overload, the FQS scans would happen continuously rather than accelerating only the first such scan in a given grace period. This would of course result in a CPU-bound grace-period kthread, which users might not be all that happy with. Or am I missing something subtle that prevents this? Looks like under overload, only the first and second scans are accelerated? gf = 0; if (first_gp_fqs) { first_gp_fqs = false; gf = rcu_state.cbovld ? RCU_GP_FLAG_OVLD : 0; } Very good, it does sound like you understand this, and it matches my analysis and passes light testing, so I queued this one. I did improve the commit log, please check below. The added detail is helpful to people (including ourselves, by the way) who might need to look at this commit some time in the future. Thanks; patch looks good; I will try to put more efforts on commit log for future patches. If you have an x86 system lying around, running rcutorture is quite straightforward. Non-x86 systems can also run rcutorture, if nothing else by using modprobe and rmmod as described here: https://paulmck.livejournal.com/57769.html The scripting described in the latter part of this document has worked on ARMv8 and PowerPC, and might still work for all I know. I will set it up at my end; the livejournal is pretty detailed! Thanks for sharing this! Thanks Neeraj Thanx, Paul commit 9482524d7dd0aea5d32a6efa2979223eea07c029 Author: Neeraj Upadhyay Date: Mon Jun 22 00:07:27 2020 +0530 rcu/tree: Force quiescent state on callback overload On callback overload, it is necessary to quickly detect idle CPUs, and rcu_gp_fqs_check_wake() checks for this condition. Unfortunately, the code following the call to this function does not repeat this check, which means that in reality no actual quiescent-state forcing, instead only a couple of quick and pointless wakeups at the beginning of the grace period. This commit therefore adds a check for the RCU_GP_FLAG_OVLD flag in the post-wakeup "if" statement in rcu_gp_fqs_loop(). Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d0988a1..6226bfb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1865,7 +1865,7 @@ static void rcu_gp_fqs_loop(void) break; /* If time for quiescent-state forcing, do it. */ if (!time_after(rcu_state.jiffies_force_qs, jiffies) || - (gf & RCU_GP_FLAG_FQS)) { + (gf & (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD))) { trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqsstart")); rcu_gp_fqs(first_gp_fqs); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu/tree: Remove CONFIG_PREMPT_RCU check in force_qs_rnp
Remove CONFIG_PREMPT_RCU check in force_qs_rnp(). Originally, this check was required to skip executing fqs failsafe for rcu-sched, which was added in commit a77da14ce9af ("rcu: Yet another fix for preemption and CPU hotplug"). However, this failsafe has been removed, since then. So, cleanup the code to avoid any confusion around the need for boosting, for !CONFIG_PREMPT_RCU. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6226bfb..57c904b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2514,8 +2514,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) raw_spin_lock_irqsave_rcu_node(rnp, flags); rcu_state.cbovldnext |= !!rnp->cbovldmask; if (rnp->qsmask == 0) { - if (!IS_ENABLED(CONFIG_PREEMPT_RCU) || - rcu_preempt_blocked_readers_cgp(rnp)) { + if (rcu_preempt_blocked_readers_cgp(rnp)) { /* * No point in scanning bits because they * are all zero. But we might need to -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH] rcu/tree: Force quiescent state on callback overload
Hi Paul, On 6/22/2020 8:43 AM, Paul E. McKenney wrote: On Mon, Jun 22, 2020 at 01:30:31AM +0530, Neeraj Upadhyay wrote: Hi Paul, On 6/22/2020 1:20 AM, Paul E. McKenney wrote: On Mon, Jun 22, 2020 at 12:07:27AM +0530, Neeraj Upadhyay wrote: On callback overload, we want to force quiescent state immediately, for the first and second fqs. Enforce the same, by including RCU_GP_FLAG_OVLD flag, in fqsstart check. Signed-off-by: Neeraj Upadhyay Good catch! But what did you do to verify that this change does the right thing? Thanx, Paul I haven't done a runtime verification of this code path; I posted this, based on review of this code. My concern is that under overload, the FQS scans would happen continuously rather than accelerating only the first such scan in a given grace period. This would of course result in a CPU-bound grace-period kthread, which users might not be all that happy with. Or am I missing something subtle that prevents this? Looks like under overload, only the first and second scans are accelerated? gf = 0; if (first_gp_fqs) { first_gp_fqs = false; gf = rcu_state.cbovld ? RCU_GP_FLAG_OVLD : 0; } Thanks Neeraj But yes, it does look like the current mainline code fails to do the first scan immediately, so again, good catch! Thanx, Paul Thanks Neeraj --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d0988a1..6226bfb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1865,7 +1865,7 @@ static void rcu_gp_fqs_loop(void) break; /* If time for quiescent-state forcing, do it. */ if (!time_after(rcu_state.jiffies_force_qs, jiffies) || - (gf & RCU_GP_FLAG_FQS)) { + (gf & (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD))) { trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqsstart")); rcu_gp_fqs(first_gp_fqs); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] rcu/tree: Force quiescent state on callback overload
Hi Paul, On 6/22/2020 1:20 AM, Paul E. McKenney wrote: On Mon, Jun 22, 2020 at 12:07:27AM +0530, Neeraj Upadhyay wrote: On callback overload, we want to force quiescent state immediately, for the first and second fqs. Enforce the same, by including RCU_GP_FLAG_OVLD flag, in fqsstart check. Signed-off-by: Neeraj Upadhyay Good catch! But what did you do to verify that this change does the right thing? Thanx, Paul I haven't done a runtime verification of this code path; I posted this, based on review of this code. Thanks Neeraj --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d0988a1..6226bfb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1865,7 +1865,7 @@ static void rcu_gp_fqs_loop(void) break; /* If time for quiescent-state forcing, do it. */ if (!time_after(rcu_state.jiffies_force_qs, jiffies) || - (gf & RCU_GP_FLAG_FQS)) { + (gf & (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD))) { trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqsstart")); rcu_gp_fqs(first_gp_fqs); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu/tree: Force quiescent state on callback overload
On callback overload, we want to force quiescent state immediately, for the first and second fqs. Enforce the same, by including RCU_GP_FLAG_OVLD flag, in fqsstart check. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d0988a1..6226bfb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1865,7 +1865,7 @@ static void rcu_gp_fqs_loop(void) break; /* If time for quiescent-state forcing, do it. */ if (!time_after(rcu_state.jiffies_force_qs, jiffies) || - (gf & RCU_GP_FLAG_FQS)) { + (gf & (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD))) { trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqsstart")); rcu_gp_fqs(first_gp_fqs); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: Query regarding pseudo nmi support on GIC V3 and request_nmi()
Hi Marc, Thanks a lot for your comments. I will work on exploring how SDEI can be used for it. Thanks Neeraj On 5/8/2020 9:41 PM, Marc Zyngier wrote: On Fri, 08 May 2020 14:34:10 +0100, Neeraj Upadhyay wrote: Hi Marc, On 5/8/2020 6:23 PM, Marc Zyngier wrote: On Fri, 8 May 2020 18:09:00 +0530 Neeraj Upadhyay wrote: Hi Marc, On 5/8/2020 5:57 PM, Marc Zyngier wrote: On Fri, 8 May 2020 16:36:42 +0530 Neeraj Upadhyay wrote: Hi Marc, On 5/8/2020 4:15 PM, Marc Zyngier wrote: On Thu, 07 May 2020 17:06:19 +0100, Neeraj Upadhyay wrote: Hi, I have one query regarding pseudo NMI support on GIC v3; from what I could understand, GIC v3 supports pseudo NMI setup for SPIs and PPIs. However the request_nmi() in irq framework requires NMI to be per cpu interrupt source (it checks for IRQF_PERCPU). Can you please help understand this part, how SPIs can be configured as NMIs, if there is a per cpu interrupt source restriction? Let me answer your question by another question: what is the semantic of a NMI if you can't associate it with a particular CPU? >> I was actually thinking of a use case, where, we have a watchdog interrupt (which is a SPI), which is used for detecting software hangs and cause device reset; If that interrupt's current cpu affinity is on a core, where interrupts are disabled, we won't be able to serve it; so, we need to group that interrupt as an fiq; Linux doesn't use Group-0 interrupts, as they are strictly secure (unless your SoC doesn't have EL3, which I doubt). Yes, we handle that watchdog interrupt as a Group-0 interrupt, which is handled as fiq in EL3. I was thinking, if its feasible to mark that interrupt as pseudo NMI and route it to EL1 as irq. However, looks like that is not the semantic of a NMI and we would need something like pseudo NMI ipi for this. Sending a NMI IPI from another NMI handler? Even once I've added these, there is no way this will work for that particular scenario. Just look at the restrictions we impose on NMIs. Sending a pseudo NMI IPI (to EL1) from fiq handler (which runs in EL3); I will check, but do you think, that might not work? How do you know, from EL3, what to write in memory so that the NMI handler knows what you want to do? Are you going to parse the S1 page tables? Hard-code the behaviour of some random Linux version in your legendary non-updatable firmware? This isn't an acceptable behaviour. Ok, I understand; Initial thought was to use watchdog SPI as pseudo NMI; however, as pseudo NMIs are only per CPU sources, we were exploring the possibility of using an unused ipi (using the work which is done in [1] and [2] for SGIs) as pseudo NMI, which EL3 sends to EL1, on receiving watchdog fiq. The pseudo NMI handler would collect required debug information, to help indentify the lockup cause. We weren't thinking of communicating any information from EL3 fiq handler to EL1. What if the operating system running at EL1/EL2 is *not* Linux? However, from this discussion, I realize that calling irq handler from fiq handler, would not be possible. So, the approach looks flawed. I believe, allowing a non-percpu pseudo NMI is not acceptable to community? No, I really don't want to entertain this idea, because the semantics are way too loosely defined and you'd end up with everyone wanting something mildly different. An IPI is between two CPUs used by the same SW entitiy. What runs at EL3 is completely alien to Linux, and so is Linux to EL3. If you want to IPI, send Group-0 IPIs that are private to the firmware. Ok got it; however, I wonder what's the use case of sending SGI to EL1, from secure world, using ICC_ASGI1R. I thought it allowed communication between EL1 and EL3; but, looks like I understood in wrong. There is what the GIC architecture can do, and there is what is sensible for Linux. The GIC allows IPIs from S-to-NS as well as the opposite. This doesn't make it a good idea (it actually is a terrible idea, and I really hope that future versions of the architecture will simply kill the feature). The idea was that you'd make SGIs an first class ABI between S and NS. Given that the two are developed separately and that nobody ever standardised what the SGI numbers mean, this idea is completely dead. If you want to inject NMI-type exceptions into EL1, you can always try SDEI (did I actually write this? Help!). But given your use case below, that wouldn't work either. Ok. Frankly, if all you need to do is to reset the SoC, use EL3 firmware. That is what it is for. Before triggering SoC reset, we want to collect certain EL1 debug information like stack trace for CPUs and other debug information. Frankly, if you are going to reset the SoC because EL1/EL2 has gone bust, how can you trust it to do anything sensible when injecting an interrupt?. Once you take a SPI at EL3, gather whatever state you want from EL3. Do not involve EL1 at all. M. Agree that it
Re: Query regarding pseudo nmi support on GIC V3 and request_nmi()
Hi Marc, On 5/8/2020 6:23 PM, Marc Zyngier wrote: On Fri, 8 May 2020 18:09:00 +0530 Neeraj Upadhyay wrote: Hi Marc, On 5/8/2020 5:57 PM, Marc Zyngier wrote: On Fri, 8 May 2020 16:36:42 +0530 Neeraj Upadhyay wrote: Hi Marc, On 5/8/2020 4:15 PM, Marc Zyngier wrote: On Thu, 07 May 2020 17:06:19 +0100, Neeraj Upadhyay wrote: Hi, I have one query regarding pseudo NMI support on GIC v3; from what I could understand, GIC v3 supports pseudo NMI setup for SPIs and PPIs. However the request_nmi() in irq framework requires NMI to be per cpu interrupt source (it checks for IRQF_PERCPU). Can you please help understand this part, how SPIs can be configured as NMIs, if there is a per cpu interrupt source restriction? Let me answer your question by another question: what is the semantic of a NMI if you can't associate it with a particular CPU? >> I was actually thinking of a use case, where, we have a watchdog interrupt (which is a SPI), which is used for detecting software hangs and cause device reset; If that interrupt's current cpu affinity is on a core, where interrupts are disabled, we won't be able to serve it; so, we need to group that interrupt as an fiq; Linux doesn't use Group-0 interrupts, as they are strictly secure (unless your SoC doesn't have EL3, which I doubt). Yes, we handle that watchdog interrupt as a Group-0 interrupt, which is handled as fiq in EL3. I was thinking, if its feasible to mark that interrupt as pseudo NMI and route it to EL1 as irq. However, looks like that is not the semantic of a NMI and we would need something like pseudo NMI ipi for this. Sending a NMI IPI from another NMI handler? Even once I've added these, there is no way this will work for that particular scenario. Just look at the restrictions we impose on NMIs. Sending a pseudo NMI IPI (to EL1) from fiq handler (which runs in EL3); I will check, but do you think, that might not work? How do you know, from EL3, what to write in memory so that the NMI handler knows what you want to do? Are you going to parse the S1 page tables? Hard-code the behaviour of some random Linux version in your legendary non-updatable firmware? This isn't an acceptable behaviour. Ok, I understand; Initial thought was to use watchdog SPI as pseudo NMI; however, as pseudo NMIs are only per CPU sources, we were exploring the possibility of using an unused ipi (using the work which is done in [1] and [2] for SGIs) as pseudo NMI, which EL3 sends to EL1, on receiving watchdog fiq. The pseudo NMI handler would collect required debug information, to help indentify the lockup cause. We weren't thinking of communicating any information from EL3 fiq handler to EL1. However, from this discussion, I realize that calling irq handler from fiq handler, would not be possible. So, the approach looks flawed. I believe, allowing a non-percpu pseudo NMI is not acceptable to community? [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/gic-sgi [2] https://lkml.org/lkml/2020/4/25/135 An IPI is between two CPUs used by the same SW entitiy. What runs at EL3 is completely alien to Linux, and so is Linux to EL3. If you want to IPI, send Group-0 IPIs that are private to the firmware. Ok got it; however, I wonder what's the use case of sending SGI to EL1, from secure world, using ICC_ASGI1R. I thought it allowed communication between EL1 and EL3; but, looks like I understood in wrong. If you want to inject NMI-type exceptions into EL1, you can always try SDEI (did I actually write this? Help!). But given your use case below, that wouldn't work either. Ok. Frankly, if all you need to do is to reset the SoC, use EL3 firmware. That is what it is for. Before triggering SoC reset, we want to collect certain EL1 debug information like stack trace for CPUs and other debug information. Frankly, if you are going to reset the SoC because EL1/EL2 has gone bust, how can you trust it to do anything sensible when injecting an interrupt?. Once you take a SPI at EL3, gather whatever state you want from EL3. Do not involve EL1 at all. M. Agree that it might not work for all cases. But, for the cases like, some kernel code is stuck after disabling local irqs; pseudo NMI might still be able to run and capture stack and other debug info, to help detect the cause of lockups. Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: Query regarding pseudo nmi support on GIC V3 and request_nmi()
Hi Marc, On 5/8/2020 5:57 PM, Marc Zyngier wrote: On Fri, 8 May 2020 16:36:42 +0530 Neeraj Upadhyay wrote: Hi Marc, On 5/8/2020 4:15 PM, Marc Zyngier wrote: On Thu, 07 May 2020 17:06:19 +0100, Neeraj Upadhyay wrote: Hi, I have one query regarding pseudo NMI support on GIC v3; from what I could understand, GIC v3 supports pseudo NMI setup for SPIs and PPIs. However the request_nmi() in irq framework requires NMI to be per cpu interrupt source (it checks for IRQF_PERCPU). Can you please help understand this part, how SPIs can be configured as NMIs, if there is a per cpu interrupt source restriction? Let me answer your question by another question: what is the semantic of a NMI if you can't associate it with a particular CPU? I was actually thinking of a use case, where, we have a watchdog interrupt (which is a SPI), which is used for detecting software hangs and cause device reset; If that interrupt's current cpu affinity is on a core, where interrupts are disabled, we won't be able to serve it; so, we need to group that interrupt as an fiq; Linux doesn't use Group-0 interrupts, as they are strictly secure (unless your SoC doesn't have EL3, which I doubt). Yes, we handle that watchdog interrupt as a Group-0 interrupt, which is handled as fiq in EL3. I was thinking, if its feasible to mark that interrupt as pseudo NMI and route it to EL1 as irq. However, looks like that is not the semantic of a NMI and we would need something like pseudo NMI ipi for this. Sending a NMI IPI from another NMI handler? Even once I've added these, there is no way this will work for that particular scenario. Just look at the restrictions we impose on NMIs. Sending a pseudo NMI IPI (to EL1) from fiq handler (which runs in EL3); I will check, but do you think, that might not work? Frankly, if all you need to do is to reset the SoC, use EL3 firmware. That is what it is for. Before triggering SoC reset, we want to collect certain EL1 debug information like stack trace for CPUs and other debug information. Thanks, M. Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: Query regarding pseudo nmi support on GIC V3 and request_nmi()
Hi Marc, On 5/8/2020 4:15 PM, Marc Zyngier wrote: On Thu, 07 May 2020 17:06:19 +0100, Neeraj Upadhyay wrote: Hi, I have one query regarding pseudo NMI support on GIC v3; from what I could understand, GIC v3 supports pseudo NMI setup for SPIs and PPIs. However the request_nmi() in irq framework requires NMI to be per cpu interrupt source (it checks for IRQF_PERCPU). Can you please help understand this part, how SPIs can be configured as NMIs, if there is a per cpu interrupt source restriction? Let me answer your question by another question: what is the semantic of a NMI if you can't associate it with a particular CPU? I was actually thinking of a use case, where, we have a watchdog interrupt (which is a SPI), which is used for detecting software hangs and cause device reset; If that interrupt's current cpu affinity is on a core, where interrupts are disabled, we won't be able to serve it; so, we need to group that interrupt as an fiq; I was thinking, if its feasible to mark that interrupt as pseudo NMI and route it to EL1 as irq. However, looks like that is not the semantic of a NMI and we would need something like pseudo NMI ipi for this. We use pseudo-NMI to be able to profile (or detect lockups) within sections where normal interrupts cannot fire. If the interrupt can end-up on a random CPU (with an unrelated PMU or one that hasn't locked up), what have we achieved? Only confusion. The whole point is that NMIs have to be tied to a given CPU. For SGI/PPI, this is guaranteed by construction. For SPIs, this means that the affinity cannot be changed from userspace. IRQF_PERCPU doesn't mean much in this context as we don't "broadcast" interrupts, but is an indication to the core kernel that the same interrupt cannot be taken on another CPU. The short of it is that NMIs are only for per-CPU sources. For SPIs, that's for PMUs that use SPIs instead of PPIs. Don't use it for anything else. Thank you for the explanation! Thanks, M. Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Query regarding pseudo nmi support on GIC V3 and request_nmi()
Hi, I have one query regarding pseudo NMI support on GIC v3; from what I could understand, GIC v3 supports pseudo NMI setup for SPIs and PPIs. However the request_nmi() in irq framework requires NMI to be per cpu interrupt source (it checks for IRQF_PERCPU). Can you please help understand this part, how SPIs can be configured as NMIs, if there is a per cpu interrupt source restriction? Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] arm64: Explicitly set pstate.ssbs for el0 on kernel entry
Hi Marc, On 7/9/19 7:52 PM, Marc Zyngier wrote: On 09/07/2019 15:18, Neeraj Upadhyay wrote: Hi Marc, On 7/9/19 6:38 PM, Marc Zyngier wrote: Hi Neeraj, On 09/07/2019 12:22, Neeraj Upadhyay wrote: For cpus which do not support pstate.ssbs feature, el0 might not retain spsr.ssbs. This is problematic, if this task migrates to a cpu supporting this feature, thus relying on its state to be correct. On kernel entry, explicitly set spsr.ssbs, so that speculation is enabled for el0, when this task migrates to a cpu supporting ssbs feature. Restoring state at kernel entry ensures that el0 ssbs state is always consistent while we are in el1. As alternatives are applied by boot cpu, at the end of smp init, presence/absence of ssbs feature on boot cpu, is used for deciding, whether the capability is uniformly provided. I've seen the same issue, but went for a slightly different approach, see below. Signed-off-by: Neeraj Upadhyay --- arch/arm64/kernel/cpu_errata.c | 16 arch/arm64/kernel/entry.S | 26 +- 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index ca11ff7..c84a56d 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -336,6 +336,22 @@ void __init arm64_enable_wa2_handling(struct alt_instr *alt, *updptr = cpu_to_le32(aarch64_insn_gen_nop()); } +void __init arm64_restore_ssbs_state(struct alt_instr *alt, +__le32 *origptr, __le32 *updptr, +int nr_inst) +{ + BUG_ON(nr_inst != 1); + /* +* Only restore EL0 SSBS state on EL1 entry if cpu does not +* support the capability and capability is present for at +* least one cpu and if the SSBD state allows it to +* be changed. +*/ + if (!this_cpu_has_cap(ARM64_SSBS) && cpus_have_cap(ARM64_SSBS) && + arm64_get_ssbd_state() != ARM64_SSBD_FORCE_ENABLE) + *updptr = cpu_to_le32(aarch64_insn_gen_nop()); +} + void arm64_set_ssbd_mitigation(bool state) { if (!IS_ENABLED(CONFIG_ARM64_SSBD)) { diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 9cdc459..7e79305 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -143,6 +143,25 @@ alternative_cb_end #endif .endm + // This macro updates spsr. It also corrupts the condition + // codes state. + .macro restore_ssbs_state, saved_spsr, tmp +#ifdef CONFIG_ARM64_SSBD +alternative_cb arm64_restore_ssbs_state + b .L__asm_ssbs_skip\@ +alternative_cb_end + ldr \tmp, [tsk, #TSK_TI_FLAGS] + tbnz\tmp, #TIF_SSBD, .L__asm_ssbs_skip\@ + tst \saved_spsr, #PSR_MODE32_BIT// native task? + b.ne.L__asm_ssbs_compat\@ + orr \saved_spsr, \saved_spsr, #PSR_SSBS_BIT + b .L__asm_ssbs_skip\@ +.L__asm_ssbs_compat\@: + orr \saved_spsr, \saved_spsr, #PSR_AA32_SSBS_BIT +.L__asm_ssbs_skip\@: +#endif + .endm Although this is in keeping with the rest of entry.S (perfectly unreadable ;-), I think we can do something a bit simpler, that doesn't rely on patching. Also, this doesn't seem to take the SSBD options such as ARM64_SSBD_FORCE_ENABLE into account. arm64_restore_ssbs_state has a check for ARM64_SSBD_FORCE_ENABLE, does that look wrong? No, I just focused on the rest of the asm code and missed it, apologies. + .macro kernel_entry, el, regsize = 64 .if \regsize == 32 mov w0, w0 // zero upper 32 bits of x0 @@ -182,8 +201,13 @@ alternative_cb_end str x20, [tsk, #TSK_TI_ADDR_LIMIT] /* No need to reset PSTATE.UAO, hardware's already set it to 0 for us */ .endif /* \el == 0 */ - mrs x22, elr_el1 mrs x23, spsr_el1 + + .if \el == 0 + restore_ssbs_state x23, x22 + .endif + + mrs x22, elr_el1 stp lr, x21, [sp, #S_LR] /* How about the patch below? Looks good; was just going to mention PF_KTHREAD check, but Mark R. has already given detailed information about it. Yup, well spotted. I'll respin it shortly and we can then work out whether that's really a better approach. Did you get chance to recheck it? Thanks Neeraj Thanks, M. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] arm64: Explicitly set pstate.ssbs for el0 on kernel entry
Hi Marc, On 7/9/19 6:38 PM, Marc Zyngier wrote: Hi Neeraj, On 09/07/2019 12:22, Neeraj Upadhyay wrote: For cpus which do not support pstate.ssbs feature, el0 might not retain spsr.ssbs. This is problematic, if this task migrates to a cpu supporting this feature, thus relying on its state to be correct. On kernel entry, explicitly set spsr.ssbs, so that speculation is enabled for el0, when this task migrates to a cpu supporting ssbs feature. Restoring state at kernel entry ensures that el0 ssbs state is always consistent while we are in el1. As alternatives are applied by boot cpu, at the end of smp init, presence/absence of ssbs feature on boot cpu, is used for deciding, whether the capability is uniformly provided. I've seen the same issue, but went for a slightly different approach, see below. Signed-off-by: Neeraj Upadhyay --- arch/arm64/kernel/cpu_errata.c | 16 arch/arm64/kernel/entry.S | 26 +- 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index ca11ff7..c84a56d 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -336,6 +336,22 @@ void __init arm64_enable_wa2_handling(struct alt_instr *alt, *updptr = cpu_to_le32(aarch64_insn_gen_nop()); } +void __init arm64_restore_ssbs_state(struct alt_instr *alt, +__le32 *origptr, __le32 *updptr, +int nr_inst) +{ + BUG_ON(nr_inst != 1); + /* +* Only restore EL0 SSBS state on EL1 entry if cpu does not +* support the capability and capability is present for at +* least one cpu and if the SSBD state allows it to +* be changed. +*/ + if (!this_cpu_has_cap(ARM64_SSBS) && cpus_have_cap(ARM64_SSBS) && + arm64_get_ssbd_state() != ARM64_SSBD_FORCE_ENABLE) + *updptr = cpu_to_le32(aarch64_insn_gen_nop()); +} + void arm64_set_ssbd_mitigation(bool state) { if (!IS_ENABLED(CONFIG_ARM64_SSBD)) { diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 9cdc459..7e79305 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -143,6 +143,25 @@ alternative_cb_end #endif .endm + // This macro updates spsr. It also corrupts the condition + // codes state. + .macro restore_ssbs_state, saved_spsr, tmp +#ifdef CONFIG_ARM64_SSBD +alternative_cb arm64_restore_ssbs_state + b .L__asm_ssbs_skip\@ +alternative_cb_end + ldr \tmp, [tsk, #TSK_TI_FLAGS] + tbnz\tmp, #TIF_SSBD, .L__asm_ssbs_skip\@ + tst \saved_spsr, #PSR_MODE32_BIT// native task? + b.ne.L__asm_ssbs_compat\@ + orr \saved_spsr, \saved_spsr, #PSR_SSBS_BIT + b .L__asm_ssbs_skip\@ +.L__asm_ssbs_compat\@: + orr \saved_spsr, \saved_spsr, #PSR_AA32_SSBS_BIT +.L__asm_ssbs_skip\@: +#endif + .endm Although this is in keeping with the rest of entry.S (perfectly unreadable ;-), I think we can do something a bit simpler, that doesn't rely on patching. Also, this doesn't seem to take the SSBD options such as ARM64_SSBD_FORCE_ENABLE into account. arm64_restore_ssbs_state has a check for ARM64_SSBD_FORCE_ENABLE, does that look wrong? + .macro kernel_entry, el, regsize = 64 .if \regsize == 32 mov w0, w0 // zero upper 32 bits of x0 @@ -182,8 +201,13 @@ alternative_cb_end str x20, [tsk, #TSK_TI_ADDR_LIMIT] /* No need to reset PSTATE.UAO, hardware's already set it to 0 for us */ .endif /* \el == 0 */ - mrs x22, elr_el1 mrs x23, spsr_el1 + + .if \el == 0 + restore_ssbs_state x23, x22 + .endif + + mrs x22, elr_el1 stp lr, x21, [sp, #S_LR] /* How about the patch below? Looks good; was just going to mention PF_KTHREAD check, but Mark R. has already given detailed information about it. Thanks Neeraj Thanks, M. From 7d4314d1ef3122d8bf56a7ef239c8c68e0c81277 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 4 Jun 2019 17:35:18 +0100 Subject: [PATCH] arm64: Force SSBS on context switch On a CPU that doesn't support SSBS, PSTATE[12] is RES0. In a system where only some of the CPUs implement SSBS, we end-up losing track of the SSBS bit across task migration. To address this issue, let's force the SSBS bit on context switch. Signed-off-by: Marc Zyngier --- arch/arm64/include/asm/processor.h | 14 -- arch/arm64/kernel/process.c| 14 ++ 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index fd5b1a4efc70..844e2964b0f5 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -193,6 +193,16
[PATCH] arm64: Explicitly set pstate.ssbs for el0 on kernel entry
For cpus which do not support pstate.ssbs feature, el0 might not retain spsr.ssbs. This is problematic, if this task migrates to a cpu supporting this feature, thus relying on its state to be correct. On kernel entry, explicitly set spsr.ssbs, so that speculation is enabled for el0, when this task migrates to a cpu supporting ssbs feature. Restoring state at kernel entry ensures that el0 ssbs state is always consistent while we are in el1. As alternatives are applied by boot cpu, at the end of smp init, presence/absence of ssbs feature on boot cpu, is used for deciding, whether the capability is uniformly provided. Signed-off-by: Neeraj Upadhyay --- arch/arm64/kernel/cpu_errata.c | 16 arch/arm64/kernel/entry.S | 26 +- 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index ca11ff7..c84a56d 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -336,6 +336,22 @@ void __init arm64_enable_wa2_handling(struct alt_instr *alt, *updptr = cpu_to_le32(aarch64_insn_gen_nop()); } +void __init arm64_restore_ssbs_state(struct alt_instr *alt, +__le32 *origptr, __le32 *updptr, +int nr_inst) +{ + BUG_ON(nr_inst != 1); + /* +* Only restore EL0 SSBS state on EL1 entry if cpu does not +* support the capability and capability is present for at +* least one cpu and if the SSBD state allows it to +* be changed. +*/ + if (!this_cpu_has_cap(ARM64_SSBS) && cpus_have_cap(ARM64_SSBS) && + arm64_get_ssbd_state() != ARM64_SSBD_FORCE_ENABLE) + *updptr = cpu_to_le32(aarch64_insn_gen_nop()); +} + void arm64_set_ssbd_mitigation(bool state) { if (!IS_ENABLED(CONFIG_ARM64_SSBD)) { diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index 9cdc459..7e79305 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -143,6 +143,25 @@ alternative_cb_end #endif .endm + // This macro updates spsr. It also corrupts the condition + // codes state. + .macro restore_ssbs_state, saved_spsr, tmp +#ifdef CONFIG_ARM64_SSBD +alternative_cb arm64_restore_ssbs_state + b .L__asm_ssbs_skip\@ +alternative_cb_end + ldr \tmp, [tsk, #TSK_TI_FLAGS] + tbnz\tmp, #TIF_SSBD, .L__asm_ssbs_skip\@ + tst \saved_spsr, #PSR_MODE32_BIT// native task? + b.ne.L__asm_ssbs_compat\@ + orr \saved_spsr, \saved_spsr, #PSR_SSBS_BIT + b .L__asm_ssbs_skip\@ +.L__asm_ssbs_compat\@: + orr \saved_spsr, \saved_spsr, #PSR_AA32_SSBS_BIT +.L__asm_ssbs_skip\@: +#endif + .endm + .macro kernel_entry, el, regsize = 64 .if \regsize == 32 mov w0, w0 // zero upper 32 bits of x0 @@ -182,8 +201,13 @@ alternative_cb_end str x20, [tsk, #TSK_TI_ADDR_LIMIT] /* No need to reset PSTATE.UAO, hardware's already set it to 0 for us */ .endif /* \el == 0 */ - mrs x22, elr_el1 mrs x23, spsr_el1 + + .if \el == 0 + restore_ssbs_state x23, x22 + .endif + + mrs x22, elr_el1 stp lr, x21, [sp, #S_LR] /* -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH v3] pinctrl: qcom: Add irq_enable callback for msm gpio
From: Srinivas Ramana Introduce the irq_enable callback which will be same as irq_unmask except that it will also clear the status bit before unmask. This will help in clearing any erroneous interrupts that would have got latched when the interrupt is not in use. There may be devices like UART which can use the same gpio line for data rx as well as a wakeup gpio when in suspend. The data that was flowing on the line may latch the interrupt and when we enable the interrupt before going to suspend, this would trigger the unexpected interrupt. This change helps clearing the interrupt so that these unexpected interrupts gets cleared. Signed-off-by: Srinivas Ramana Signed-off-by: Neeraj Upadhyay --- Changes since v2: - Renamed function to msm_gpio_irq_clear_unmask() drivers/pinctrl/qcom/pinctrl-msm.c | 25 - 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/drivers/pinctrl/qcom/pinctrl-msm.c b/drivers/pinctrl/qcom/pinctrl-msm.c index 6e319bc..0223755 100644 --- a/drivers/pinctrl/qcom/pinctrl-msm.c +++ b/drivers/pinctrl/qcom/pinctrl-msm.c @@ -729,7 +729,7 @@ static void msm_gpio_irq_mask(struct irq_data *d) raw_spin_unlock_irqrestore(>lock, flags); } -static void msm_gpio_irq_unmask(struct irq_data *d) +static void msm_gpio_irq_clear_unmask(struct irq_data *d, bool status_clear) { struct gpio_chip *gc = irq_data_get_irq_chip_data(d); struct msm_pinctrl *pctrl = gpiochip_get_data(gc); @@ -741,6 +741,17 @@ static void msm_gpio_irq_unmask(struct irq_data *d) raw_spin_lock_irqsave(>lock, flags); + if (status_clear) { + /* +* clear the interrupt status bit before unmask to avoid +* any erroneous interrupts that would have got latched +* when the interrupt is not in use. +*/ + val = msm_readl_intr_status(pctrl, g); + val &= ~BIT(g->intr_status_bit); + msm_writel_intr_status(val, pctrl, g); + } + val = msm_readl_intr_cfg(pctrl, g); val |= BIT(g->intr_raw_status_bit); val |= BIT(g->intr_enable_bit); @@ -751,6 +762,17 @@ static void msm_gpio_irq_unmask(struct irq_data *d) raw_spin_unlock_irqrestore(>lock, flags); } +static void msm_gpio_irq_enable(struct irq_data *d) +{ + + msm_gpio_irq_clear_unmask(d, true); +} + +static void msm_gpio_irq_unmask(struct irq_data *d) +{ + msm_gpio_irq_clear_unmask(d, false); +} + static void msm_gpio_irq_ack(struct irq_data *d) { struct gpio_chip *gc = irq_data_get_irq_chip_data(d); @@ -978,6 +1000,7 @@ static int msm_gpio_init(struct msm_pinctrl *pctrl) chip->need_valid_mask = msm_gpio_needs_valid_mask(pctrl); pctrl->irq_chip.name = "msmgpio"; + pctrl->irq_chip.irq_enable = msm_gpio_irq_enable; pctrl->irq_chip.irq_mask = msm_gpio_irq_mask; pctrl->irq_chip.irq_unmask = msm_gpio_irq_unmask; pctrl->irq_chip.irq_ack = msm_gpio_irq_ack; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH v2] pinctrl: qcom: Add irq_enable callback for msm gpio
On 6/25/19 2:28 PM, Linus Walleij wrote: On Mon, Jun 17, 2019 at 11:35 AM Neeraj Upadhyay wrote: From: Srinivas Ramana Introduce the irq_enable callback which will be same as irq_unmask except that it will also clear the status bit before unmask. This will help in clearing any erroneous interrupts that would have got latched when the interrupt is not in use. There may be devices like UART which can use the same gpio line for data rx as well as a wakeup gpio when in suspend. The data that was flowing on the line may latch the interrupt and when we enable the interrupt before going to suspend, this would trigger the unexpected interrupt. This change helps clearing the interrupt so that these unexpected interrupts gets cleared. Signed-off-by: Srinivas Ramana Signed-off-by: Neeraj Upadhyay Overall this looks good to me, waiting for Bjorn's review. Changes since v1: - Extracted common code into __msm_gpio_irq_unmask(). Please don't name functions __like __that. -static void msm_gpio_irq_unmask(struct irq_data *d) +static void __msm_gpio_irq_unmask(struct irq_data *d, bool status_clear) Instead of __unclear __underscore __semantic use something really descriptive like static void msm_gpio_irq_clear_irq() That is what it does, right? Is below ok? as it clears (if status_clear set) and then unmasks irq static void msm_gpio_irq_clear_unmask() Other than that it looks fine. Yours, Linus Walleij -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: Fwd: Re: [PATCH] pinctrl: qcom: Clear status bit on irq_unmask
Thanks for the review, Linus. On 6/17/19 5:20 PM, Linus Walleij wrote: On Mon, Jun 17, 2019 at 12:35 PM Neeraj Upadhyay wrote: Hi Stephen, there is one use case with is not covered by commit b55326dc969e ( "pinctrl: msm: Really mask level interrupts to prevent latching"). That happens when gpio line is toggled between i/o mode and interrupt mode : 1. GPIO is configured as irq line. Peripheral raises interrupt. 2. IRQ handler runs and disables the irq line (through wq work). 3. GPIO is configured for input and and data is received from the peripheral. There is no distinction between using a GPIO line as input and using it for IRQ. All input GPIOs can be used for IRQs, if the hardware supports it (has an irqchip). Ok 4. Now, when GPIO is re-enabled as irq, we see spurious irq, and there isn't any data received on the gpio line, when it is read back after configuring as input. That's an interesting usecase. Hans Verkuil reworked the GPIO irq support very elegantly exactly to support this type of usecase (irq switch on and off dynamically), where he was even switching the line into output mode between the IRQ trains. (one-wire transcactions for CEC). Patch https://lkml.org/lkml/2019/6/17/226 tries to cover this use case. Can you please provide your comments? What this patch does is clear all pending IRQs at irq unmask. This is usually safe, unless there may be cases where you *want* to catch any pending IRQs. I guess normally you don't so it should be safe? We want to clear pending status at irq enable. Afaik, no, we don't normally track these pending irqs. So, should be fine here. The corner case is when you start some transaction or whatever that gets ACKed by an IRQ and you actually get the IRQ back before you had time to execute the code enabling the IRQ. That would be racy and bad code, as you should clearly enable the IRQ first, then start the transaction. So I think this patch is safe. But let's see what Bjorn says. Yours, Linus Walleij -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: Fwd: Re: [PATCH] pinctrl: qcom: Clear status bit on irq_unmask
Quoting tengf...@codeaurora.org (2019-06-11 03:41:26) On 2019-06-10 22:51, Stephen Boyd wrote: > Quoting Linus Walleij (2019-06-07 14:08:10) >> On Fri, May 31, 2019 at 8:52 AM Tengfei Fan >> wrote: >> >> > The gpio interrupt status bit is getting set after the >> > irq is disabled and causing an immediate interrupt after >> > enablling the irq, so clear status bit on irq_unmask. >> > >> > Signed-off-by: Tengfei Fan >> >> This looks pretty serious, can one of the Qcom maintainers ACK >> this? >> >> Should it be sent to fixes and even stable? >> >> Fixes: tag? >> > > How is the interrupt status bit getting set after the irq is disabled? > It looks like this is a level type interrupt? I thought that after > commit b55326dc969e ("pinctrl: msm: Really mask level interrupts to > prevent latching") this wouldn't be a problem. Am I wrong, or is qcom > just clearing out patches on drivers and this is the last one that > needs > to be upstreamed? Your patch(commit b55326dc969e) can cover our issue, and my patch is no longer needed. Your patch isn't included in our code, so I submitted this patch. Alright cool. Sounds like this patch can be dropped then and you can pick up the patch from upstream into your vendor kernel. Hi Stephen, there is one use case with is not covered by commit b55326dc969e ( "pinctrl: msm: Really mask level interrupts to prevent latching"). That happens when gpio line is toggled between i/o mode and interrupt mode : 1. GPIO is configured as irq line. Peripheral raises interrupt. 2. IRQ handler runs and disables the irq line (through wq work). 3. GPIO is configured for input and and data is received from the peripheral. 4. Now, when GPIO is re-enabled as irq, we see spurious irq, and there isn't any data received on the gpio line, when it is read back after configuring as input. This can happen for both edge and level interrupts. Patch https://lkml.org/lkml/2019/6/17/226 tries to cover this use case. Can you please provide your comments? Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH v2] pinctrl: qcom: Add irq_enable callback for msm gpio
From: Srinivas Ramana Introduce the irq_enable callback which will be same as irq_unmask except that it will also clear the status bit before unmask. This will help in clearing any erroneous interrupts that would have got latched when the interrupt is not in use. There may be devices like UART which can use the same gpio line for data rx as well as a wakeup gpio when in suspend. The data that was flowing on the line may latch the interrupt and when we enable the interrupt before going to suspend, this would trigger the unexpected interrupt. This change helps clearing the interrupt so that these unexpected interrupts gets cleared. Signed-off-by: Srinivas Ramana Signed-off-by: Neeraj Upadhyay --- Changes since v1: - Extracted common code into __msm_gpio_irq_unmask(). drivers/pinctrl/qcom/pinctrl-msm.c | 25 - 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/drivers/pinctrl/qcom/pinctrl-msm.c b/drivers/pinctrl/qcom/pinctrl-msm.c index 6e319bc..2a127f0 100644 --- a/drivers/pinctrl/qcom/pinctrl-msm.c +++ b/drivers/pinctrl/qcom/pinctrl-msm.c @@ -729,7 +729,7 @@ static void msm_gpio_irq_mask(struct irq_data *d) raw_spin_unlock_irqrestore(>lock, flags); } -static void msm_gpio_irq_unmask(struct irq_data *d) +static void __msm_gpio_irq_unmask(struct irq_data *d, bool status_clear) { struct gpio_chip *gc = irq_data_get_irq_chip_data(d); struct msm_pinctrl *pctrl = gpiochip_get_data(gc); @@ -741,6 +741,17 @@ static void msm_gpio_irq_unmask(struct irq_data *d) raw_spin_lock_irqsave(>lock, flags); + if (status_clear) { + /* +* clear the interrupt status bit before unmask to avoid +* any erroneous interrupts that would have got latched +* when the interrupt is not in use. +*/ + val = msm_readl_intr_status(pctrl, g); + val &= ~BIT(g->intr_status_bit); + msm_writel_intr_status(val, pctrl, g); + } + val = msm_readl_intr_cfg(pctrl, g); val |= BIT(g->intr_raw_status_bit); val |= BIT(g->intr_enable_bit); @@ -751,6 +762,17 @@ static void msm_gpio_irq_unmask(struct irq_data *d) raw_spin_unlock_irqrestore(>lock, flags); } +static void msm_gpio_irq_enable(struct irq_data *d) +{ + + __msm_gpio_irq_unmask(d, true); +} + +static void msm_gpio_irq_unmask(struct irq_data *d) +{ + __msm_gpio_irq_unmask(d, false); +} + static void msm_gpio_irq_ack(struct irq_data *d) { struct gpio_chip *gc = irq_data_get_irq_chip_data(d); @@ -978,6 +1000,7 @@ static int msm_gpio_init(struct msm_pinctrl *pctrl) chip->need_valid_mask = msm_gpio_needs_valid_mask(pctrl); pctrl->irq_chip.name = "msmgpio"; + pctrl->irq_chip.irq_enable = msm_gpio_irq_enable; pctrl->irq_chip.irq_mask = msm_gpio_irq_mask; pctrl->irq_chip.irq_unmask = msm_gpio_irq_unmask; pctrl->irq_chip.irq_ack = msm_gpio_irq_ack; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] rcu: tree_stall: Correctly unlock root node in rcu_check_gp_start_stall
On 3/30/19 2:57 AM, Paul E. McKenney wrote: On Fri, Mar 29, 2019 at 07:52:15PM +0530, Neeraj Upadhyay wrote: On 3/29/19 6:58 PM, Mukesh Ojha wrote: On 3/29/2019 4:57 PM, Neeraj Upadhyay wrote: Only unlock the root node, if current node (rnp) is not root node. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree_stall.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index f65a73a..0651833 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h why this is showing as under tree_stall.h while it is under "kernel/rcu/tree.c" It's moved in https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=dev=10462d6f58fb6dbde7563e9343505d98d5bfba3d Please see linux-rcu dev tree for other changes, which moves code to this file. Thanks Neeraj @@ -630,7 +630,9 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, time_before(j, rcu_state.gp_req_activity + gpssdelay) || time_before(j, rcu_state.gp_activity + gpssdelay) || atomic_xchg(, 1)) { - raw_spin_unlock_rcu_node(rnp_root); /* irqs remain disabled. */ + if (rnp_root != rnp) + /* irqs remain disabled. */ + raw_spin_unlock_rcu_node(rnp_root); Looks good as it will balance the lock .if it is the root_node, which was not there earlier, and unlock was happening without any lock on root. Reviewed-by: Mukesh Ojha Applied, again thank you both! In both cases, I updated the commit log, so please check to make sure that I didn't mess anything up. Thanx, Paul Thanks Paul. One minor comment on https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=dev=ec6530e763046b6bb1f4c2c2aed49ebc68aae2a0 "it clearly does not make sense to release both rnp->lock and rnp->lock" should be rnp->lock and rnp_root->lock Thanks Neeraj Cheers, -Mukesh raw_spin_unlock_irqrestore_rcu_node(rnp, flags); return; } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] rcu: tree_stall: Correctly unlock root node in rcu_check_gp_start_stall
On 3/29/19 6:58 PM, Mukesh Ojha wrote: On 3/29/2019 4:57 PM, Neeraj Upadhyay wrote: Only unlock the root node, if current node (rnp) is not root node. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree_stall.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index f65a73a..0651833 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h why this is showing as under tree_stall.h while it is under "kernel/rcu/tree.c" It's moved in https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=dev=10462d6f58fb6dbde7563e9343505d98d5bfba3d Please see linux-rcu dev tree for other changes, which moves code to this file. Thanks Neeraj @@ -630,7 +630,9 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, time_before(j, rcu_state.gp_req_activity + gpssdelay) || time_before(j, rcu_state.gp_activity + gpssdelay) || atomic_xchg(, 1)) { - raw_spin_unlock_rcu_node(rnp_root); /* irqs remain disabled. */ + if (rnp_root != rnp) + /* irqs remain disabled. */ + raw_spin_unlock_rcu_node(rnp_root); Looks good as it will balance the lock .if it is the root_node, which was not there earlier, and unlock was happening without any lock on root. Reviewed-by: Mukesh Ojha Cheers, -Mukesh raw_spin_unlock_irqrestore_rcu_node(rnp, flags); return; } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu: tree_stall: Correctly unlock root node in rcu_check_gp_start_stall
Only unlock the root node, if current node (rnp) is not root node. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree_stall.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index f65a73a..0651833 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -630,7 +630,9 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, time_before(j, rcu_state.gp_req_activity + gpssdelay) || time_before(j, rcu_state.gp_activity + gpssdelay) || atomic_xchg(, 1)) { - raw_spin_unlock_rcu_node(rnp_root); /* irqs remain disabled. */ + if (rnp_root != rnp) + /* irqs remain disabled. */ + raw_spin_unlock_rcu_node(rnp_root); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); return; } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu: tree_plugin: Dump specified number of blocked tasks
dump_blkd_tasks() uses 10 as the max number of blocked tasks, which are printed. However, it has an argument which provides that number. So, use the argument value instead. As all callers currently pass 10 as the number, there isn't any impact. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 3960294..08bcd87 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -740,7 +740,7 @@ void exit_rcu(void) i = 0; list_for_each(lhp, >blkd_tasks) { pr_cont(" %p", lhp); - if (++i >= 10) + if (++i >= ncheck) break; } pr_cont("\n"); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] rcu/tree: Fix self wakeups for grace period kthread
On 3/12/19 7:20 AM, Steven Rostedt wrote: On Fri, 8 Mar 2019 15:16:18 +0530 Neeraj Upadhyay wrote: Update the code to match the comment that self wakeup of grace period kthread is allowed from interrupt handler, and softirq handler, running in the grace period kthread's context. Present code allows self wakeups from all interrupt contexts - nmi, softirq and hardirq contexts. That's not actually the issue. But it appears that we return if we simply have BH disabled, which I don't think we want, and we don't care about NMI as NMI should never call this code. I think your patch is correct, but the change log is not. -- Steve Hi Steve, sorry, I don't understand fully, why we want to not return in BH disabled case. From the commit logs and lkml discussion, there is a case where GP kthread is interrupted in the wait event path and rcu_gp_kthread_wake() is called in softirq handler (I am not sure about interrupt handler case; how rcu_gp_kthread_wake() is called from that path). https://github.com/torvalds/linux/commit/1d1f898df6586c5ea9aeaf349f13089c6fa37903 Thanks Neeraj Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index acd6ccf..57cac6d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1585,7 +1585,7 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) static void rcu_gp_kthread_wake(void) { if ((current == rcu_state.gp_kthread && -!in_interrupt() && !in_serving_softirq()) || +!in_irq() && !in_serving_softirq()) || !READ_ONCE(rcu_state.gp_flags) || !rcu_state.gp_kthread) return; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] rcu/tree_plugin: Fix nohz status in stall warning
On 3/9/19 2:48 AM, Paul E. McKenney wrote: On Fri, Mar 08, 2019 at 11:51:49PM +0530, Neeraj Upadhyay wrote: Fix stall warning, to show correct nohz marker. Signed-off-by: Neeraj Upadhyay Good eyes, thank you! I applied and pushed all three with modified commit logs. Please check to make sure that I didn't mess anything up. If testing and reviews are OK, I expect to submit them for the v5.2 merge window. I had to hand-apply this one, so in future patches could you please generate them against -rcu? It may be found here: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git Thanx, Paul Thanks Paul, I reviewed the patches on -rcu; they look good. Thanks for applying them! I will move my codebase to -rcu, for all future work. Thanks Neeraj --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 97dba50..93da32c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1659,7 +1659,7 @@ static void print_cpu_stall_fast_no_hz(char *cp, int cpu) rdp->last_accelerate & 0x, jiffies & 0x, ".l"[rdp->all_lazy], ".L"[!rcu_segcblist_n_nonlazy_cbs(>cblist)], - ".D"[!rdp->tick_nohz_enabled_snap]); + ".D"[!!rdp->tick_nohz_enabled_snap]); } #else /* #ifdef CONFIG_RCU_FAST_NO_HZ */ -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu/rcutorture: Fix expected forward progress duration in oom notifier
Fix a possible miscalculation in rcutorture_oom_notify(), for the expected forward progress duration of current forward progress test. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/rcutorture.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f14d1b1..5d09f79 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1848,7 +1848,7 @@ static int rcutorture_oom_notify(struct notifier_block *self, WARN(1, "%s invoked upon OOM during forward-progress testing.\n", __func__); rcu_torture_fwd_cb_hist(); - rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat) / 2)); + rcu_fwd_progress_check(1 + (jiffies - READ_ONCE(rcu_fwd_startat)) / 2); WRITE_ONCE(rcu_fwd_emergency_stop, true); smp_mb(); /* Emergency stop before free and wait to avoid hangs. */ pr_info("%s: Freed %lu RCU callbacks.\n", -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu/tree_plugin: Fix nohz status in stall warning
Fix stall warning, to show correct nohz marker. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 97dba50..93da32c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1659,7 +1659,7 @@ static void print_cpu_stall_fast_no_hz(char *cp, int cpu) rdp->last_accelerate & 0x, jiffies & 0x, ".l"[rdp->all_lazy], ".L"[!rcu_segcblist_n_nonlazy_cbs(>cblist)], - ".D"[!rdp->tick_nohz_enabled_snap]); + ".D"[!!rdp->tick_nohz_enabled_snap]); } #else /* #ifdef CONFIG_RCU_FAST_NO_HZ */ -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] rcu/tree: Fix self wakeups for grace period kthread
Update the code to match the comment that self wakeup of grace period kthread is allowed from interrupt handler, and softirq handler, running in the grace period kthread's context. Present code allows self wakeups from all interrupt contexts - nmi, softirq and hardirq contexts. Signed-off-by: Neeraj Upadhyay --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index acd6ccf..57cac6d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1585,7 +1585,7 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) static void rcu_gp_kthread_wake(void) { if ((current == rcu_state.gp_kthread && -!in_interrupt() && !in_serving_softirq()) || +!in_irq() && !in_serving_softirq()) || !READ_ONCE(rcu_state.gp_flags) || !rcu_state.gp_kthread) return; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[tip:smp/urgent] cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()
Commit-ID: f8b7530aa0a1def79c93101216b5b17cf408a70a Gitweb: https://git.kernel.org/tip/f8b7530aa0a1def79c93101216b5b17cf408a70a Author: Neeraj Upadhyay AuthorDate: Wed, 5 Sep 2018 11:22:07 +0530 Committer: Thomas Gleixner CommitDate: Thu, 6 Sep 2018 15:21:37 +0200 cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun() The smp_mb() in cpuhp_thread_fun() is misplaced. It needs to be after the load of st->should_run to prevent reordering of the later load/stores w.r.t. the load of st->should_run. Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Signed-off-by: Neeraj Upadhyay Signed-off-by: Thomas Gleixner Acked-by: Peter Zijlstra (Intel) Cc: j...@joshtriplett.org Cc: pet...@infradead.org Cc: jiangshan...@gmail.com Cc: dzic...@redhat.com Cc: brendan.jack...@arm.com Cc: ma...@debian.org Cc: mo...@codeaurora.org Cc: sram...@codeaurora.org Cc: linux-arm-...@vger.kernel.org Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/1536126727-11629-1-git-send-email-neer...@codeaurora.org --- kernel/cpu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index aa7fe85ad62e..eb4041f78073 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -607,15 +607,15 @@ static void cpuhp_thread_fun(unsigned int cpu) bool bringup = st->bringup; enum cpuhp_state state; + if (WARN_ON_ONCE(!st->should_run)) + return; + /* * ACQUIRE for the cpuhp_should_run() load of ->should_run. Ensures * that if we see ->should_run we also see the rest of the state. */ smp_mb(); - if (WARN_ON_ONCE(!st->should_run)) - return; - cpuhp_lock_acquire(bringup); if (st->single) {
[tip:smp/urgent] cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()
Commit-ID: f8b7530aa0a1def79c93101216b5b17cf408a70a Gitweb: https://git.kernel.org/tip/f8b7530aa0a1def79c93101216b5b17cf408a70a Author: Neeraj Upadhyay AuthorDate: Wed, 5 Sep 2018 11:22:07 +0530 Committer: Thomas Gleixner CommitDate: Thu, 6 Sep 2018 15:21:37 +0200 cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun() The smp_mb() in cpuhp_thread_fun() is misplaced. It needs to be after the load of st->should_run to prevent reordering of the later load/stores w.r.t. the load of st->should_run. Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Signed-off-by: Neeraj Upadhyay Signed-off-by: Thomas Gleixner Acked-by: Peter Zijlstra (Intel) Cc: j...@joshtriplett.org Cc: pet...@infradead.org Cc: jiangshan...@gmail.com Cc: dzic...@redhat.com Cc: brendan.jack...@arm.com Cc: ma...@debian.org Cc: mo...@codeaurora.org Cc: sram...@codeaurora.org Cc: linux-arm-...@vger.kernel.org Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/1536126727-11629-1-git-send-email-neer...@codeaurora.org --- kernel/cpu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index aa7fe85ad62e..eb4041f78073 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -607,15 +607,15 @@ static void cpuhp_thread_fun(unsigned int cpu) bool bringup = st->bringup; enum cpuhp_state state; + if (WARN_ON_ONCE(!st->should_run)) + return; + /* * ACQUIRE for the cpuhp_should_run() load of ->should_run. Ensures * that if we see ->should_run we also see the rest of the state. */ smp_mb(); - if (WARN_ON_ONCE(!st->should_run)) - return; - cpuhp_lock_acquire(bringup); if (st->single) {
Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
On 09/06/2018 01:48 PM, Thomas Gleixner wrote: On Thu, 6 Sep 2018, Neeraj Upadhyay wrote: On 09/05/2018 06:47 PM, Thomas Gleixner wrote: On Wed, 5 Sep 2018, Neeraj Upadhyay wrote: On 09/05/2018 05:53 PM, Thomas Gleixner wrote: And looking closer this is a general issue. Just that the TEARDOWN state makes it simple to observe. It's universaly broken, when the first teardown callback fails because, st->state is only decremented _AFTER_ the callback returns success, but undo_cpu_down() increments unconditionally. As per my understanding, there are 2 problems here; one is fixed with your patch, and other is cpuhp_reset_state() is used during rollback from non-AP to AP state, which seem to result in 2 increments of st->state (one increment done by cpuhp_reset_state() and another by cpu_thread_fun()) . And how did your hack fix that up magically? I'll have a look later today. Thanks, tglx The hack fixes it by not calling cpuhp_reset_state() and doing rollback state reset inline in _cpu_down(). And how is that any different from the proper patch I sent? It ends up in the same state. So I have a hard time to understand your blurb above where you claim that my patch just solves one of two problems? Thanks, tglx Yes, your patch solves the problem related to smpboot unparking being skipped during rollback and with the hack we end up in same state. The second thing, which I am referring to, is that there is one additional state increment. I missed the part that, it could be required, so that we reach CPUHP_AP_ONLINE_IDLE before calling __cpuhp_kick_ap(). So, it's not a problem. * cpuhp_reset_state() does one state increment and we reach CPUHP_AP_ONLINE_IDLE. if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { cpuhp_reset_state(st, prev_state); __cpuhp_kick_ap(st); } CPUHP_TEARDOWN_CPU, CPUHP_AP_ONLINE_IDLE, CPUHP_AP_SMPBOOT_THREADS, * cpuhp_thread_fun() does one more increment before invoking state callback (so, we skip CPUHP_AP_ONLINE_IDLE) and we reach CPUHP_AP_SMPBOOT_THREADS: static void cpuhp_thread_fun(unsigned int cpu) if (bringup) { st->state++; state = st->state; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
On 09/06/2018 01:48 PM, Thomas Gleixner wrote: On Thu, 6 Sep 2018, Neeraj Upadhyay wrote: On 09/05/2018 06:47 PM, Thomas Gleixner wrote: On Wed, 5 Sep 2018, Neeraj Upadhyay wrote: On 09/05/2018 05:53 PM, Thomas Gleixner wrote: And looking closer this is a general issue. Just that the TEARDOWN state makes it simple to observe. It's universaly broken, when the first teardown callback fails because, st->state is only decremented _AFTER_ the callback returns success, but undo_cpu_down() increments unconditionally. As per my understanding, there are 2 problems here; one is fixed with your patch, and other is cpuhp_reset_state() is used during rollback from non-AP to AP state, which seem to result in 2 increments of st->state (one increment done by cpuhp_reset_state() and another by cpu_thread_fun()) . And how did your hack fix that up magically? I'll have a look later today. Thanks, tglx The hack fixes it by not calling cpuhp_reset_state() and doing rollback state reset inline in _cpu_down(). And how is that any different from the proper patch I sent? It ends up in the same state. So I have a hard time to understand your blurb above where you claim that my patch just solves one of two problems? Thanks, tglx Yes, your patch solves the problem related to smpboot unparking being skipped during rollback and with the hack we end up in same state. The second thing, which I am referring to, is that there is one additional state increment. I missed the part that, it could be required, so that we reach CPUHP_AP_ONLINE_IDLE before calling __cpuhp_kick_ap(). So, it's not a problem. * cpuhp_reset_state() does one state increment and we reach CPUHP_AP_ONLINE_IDLE. if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { cpuhp_reset_state(st, prev_state); __cpuhp_kick_ap(st); } CPUHP_TEARDOWN_CPU, CPUHP_AP_ONLINE_IDLE, CPUHP_AP_SMPBOOT_THREADS, * cpuhp_thread_fun() does one more increment before invoking state callback (so, we skip CPUHP_AP_ONLINE_IDLE) and we reach CPUHP_AP_SMPBOOT_THREADS: static void cpuhp_thread_fun(unsigned int cpu) if (bringup) { st->state++; state = st->state; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
On 09/05/2018 06:47 PM, Thomas Gleixner wrote: On Wed, 5 Sep 2018, Neeraj Upadhyay wrote: On 09/05/2018 05:53 PM, Thomas Gleixner wrote: And looking closer this is a general issue. Just that the TEARDOWN state makes it simple to observe. It's universaly broken, when the first teardown callback fails because, st->state is only decremented _AFTER_ the callback returns success, but undo_cpu_down() increments unconditionally. As per my understanding, there are 2 problems here; one is fixed with your patch, and other is cpuhp_reset_state() is used during rollback from non-AP to AP state, which seem to result in 2 increments of st->state (one increment done by cpuhp_reset_state() and another by cpu_thread_fun()) . And how did your hack fix that up magically? I'll have a look later today. Thanks, tglx The hack fixes it by not calling cpuhp_reset_state() and doing rollback state reset inline in _cpu_down(). -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
On 09/05/2018 06:47 PM, Thomas Gleixner wrote: On Wed, 5 Sep 2018, Neeraj Upadhyay wrote: On 09/05/2018 05:53 PM, Thomas Gleixner wrote: And looking closer this is a general issue. Just that the TEARDOWN state makes it simple to observe. It's universaly broken, when the first teardown callback fails because, st->state is only decremented _AFTER_ the callback returns success, but undo_cpu_down() increments unconditionally. As per my understanding, there are 2 problems here; one is fixed with your patch, and other is cpuhp_reset_state() is used during rollback from non-AP to AP state, which seem to result in 2 increments of st->state (one increment done by cpuhp_reset_state() and another by cpu_thread_fun()) . And how did your hack fix that up magically? I'll have a look later today. Thanks, tglx The hack fixes it by not calling cpuhp_reset_state() and doing rollback state reset inline in _cpu_down(). -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
On 09/05/2018 05:53 PM, Thomas Gleixner wrote: On Wed, 5 Sep 2018, Thomas Gleixner wrote: On Tue, 4 Sep 2018, Neeraj Upadhyay wrote: ret = cpuhp_down_callbacks(cpu, st, target); if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { - cpuhp_reset_state(st, prev_state); + /* +* As st->last is not set, cpuhp_reset_state() increments +* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being +* skipped during rollback. So, don't use it here. +*/ + st->rollback = true; + st->target = prev_state; + st->bringup = !st->bringup; No, this is just papering over the actual problem. The state inconsistency happens in take_cpu_down() when it returns with a failure from __cpu_disable() because that returns with state = TEARDOWN_CPU and st->state is then incremented in undo_cpu_down(). That's the real issue and we need to analyze the whole cpu_down rollback logic first. And looking closer this is a general issue. Just that the TEARDOWN state makes it simple to observe. It's universaly broken, when the first teardown callback fails because, st->state is only decremented _AFTER_ the callback returns success, but undo_cpu_down() increments unconditionally. Patch below. Thanks, tglx As per my understanding, there are 2 problems here; one is fixed with your patch, and other is cpuhp_reset_state() is used during rollback from non-AP to AP state, which seem to result in 2 increments of st->state (one increment done by cpuhp_reset_state() and another by cpu_thread_fun()) . --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL); if (ret) { st->target = prev_state; - undo_cpu_down(cpu, st); + if (st->state < prev_state) + undo_cpu_down(cpu, st); break; } } @@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int * to do the further cleanups. */ ret = cpuhp_down_callbacks(cpu, st, target); - if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { + if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) { cpuhp_reset_state(st, prev_state); __cpuhp_kick_ap(st); } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
On 09/05/2018 05:53 PM, Thomas Gleixner wrote: On Wed, 5 Sep 2018, Thomas Gleixner wrote: On Tue, 4 Sep 2018, Neeraj Upadhyay wrote: ret = cpuhp_down_callbacks(cpu, st, target); if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { - cpuhp_reset_state(st, prev_state); + /* +* As st->last is not set, cpuhp_reset_state() increments +* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being +* skipped during rollback. So, don't use it here. +*/ + st->rollback = true; + st->target = prev_state; + st->bringup = !st->bringup; No, this is just papering over the actual problem. The state inconsistency happens in take_cpu_down() when it returns with a failure from __cpu_disable() because that returns with state = TEARDOWN_CPU and st->state is then incremented in undo_cpu_down(). That's the real issue and we need to analyze the whole cpu_down rollback logic first. And looking closer this is a general issue. Just that the TEARDOWN state makes it simple to observe. It's universaly broken, when the first teardown callback fails because, st->state is only decremented _AFTER_ the callback returns success, but undo_cpu_down() increments unconditionally. Patch below. Thanks, tglx As per my understanding, there are 2 problems here; one is fixed with your patch, and other is cpuhp_reset_state() is used during rollback from non-AP to AP state, which seem to result in 2 increments of st->state (one increment done by cpuhp_reset_state() and another by cpu_thread_fun()) . --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL); if (ret) { st->target = prev_state; - undo_cpu_down(cpu, st); + if (st->state < prev_state) + undo_cpu_down(cpu, st); break; } } @@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int * to do the further cleanups. */ ret = cpuhp_down_callbacks(cpu, st, target); - if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { + if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) { cpuhp_reset_state(st, prev_state); __cpuhp_kick_ap(st); } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] cpu/hotplug: Fix acquire of st->should_run in cpuhp_thread_fun()
The smp_mb() in cpuhp_thread_fun() appears to be misplaced, and need to be after the load of st->should_run, to prevent reordering of the later load/stores w.r.t. the load of st->should_run. Signed-off-by: Neeraj Upadhyay --- kernel/cpu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index aa7fe85..eb4041f 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -607,15 +607,15 @@ static void cpuhp_thread_fun(unsigned int cpu) bool bringup = st->bringup; enum cpuhp_state state; + if (WARN_ON_ONCE(!st->should_run)) + return; + /* * ACQUIRE for the cpuhp_should_run() load of ->should_run. Ensures * that if we see ->should_run we also see the rest of the state. */ smp_mb(); - if (WARN_ON_ONCE(!st->should_run)) - return; - cpuhp_lock_acquire(bringup); if (st->single) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] cpu/hotplug: Fix acquire of st->should_run in cpuhp_thread_fun()
The smp_mb() in cpuhp_thread_fun() appears to be misplaced, and need to be after the load of st->should_run, to prevent reordering of the later load/stores w.r.t. the load of st->should_run. Signed-off-by: Neeraj Upadhyay --- kernel/cpu.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index aa7fe85..eb4041f 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -607,15 +607,15 @@ static void cpuhp_thread_fun(unsigned int cpu) bool bringup = st->bringup; enum cpuhp_state state; + if (WARN_ON_ONCE(!st->should_run)) + return; + /* * ACQUIRE for the cpuhp_should_run() load of ->should_run. Ensures * that if we see ->should_run we also see the rest of the state. */ smp_mb(); - if (WARN_ON_ONCE(!st->should_run)) - return; - cpuhp_lock_acquire(bringup); if (st->single) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
If takedown_cpu() fails during _cpu_down(), st->state is reset, by calling cpuhp_reset_state(). This results in an additional increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS state being skipped during rollback. Fix this by not calling cpuhp_reset_state() and doing the state reset directly in _cpu_down(). Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Signed-off-by: Neeraj Upadhyay --- kernel/cpu.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index aa7fe85..9f49edb 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen, */ ret = cpuhp_down_callbacks(cpu, st, target); if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { - cpuhp_reset_state(st, prev_state); + /* +* As st->last is not set, cpuhp_reset_state() increments +* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being +* skipped during rollback. So, don't use it here. +*/ + st->rollback = true; + st->target = prev_state; + st->bringup = !st->bringup; __cpuhp_kick_ap(st); } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()
If takedown_cpu() fails during _cpu_down(), st->state is reset, by calling cpuhp_reset_state(). This results in an additional increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS state being skipped during rollback. Fix this by not calling cpuhp_reset_state() and doing the state reset directly in _cpu_down(). Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Signed-off-by: Neeraj Upadhyay --- kernel/cpu.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index aa7fe85..9f49edb 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen, */ ret = cpuhp_down_callbacks(cpu, st, target); if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) { - cpuhp_reset_state(st, prev_state); + /* +* As st->last is not set, cpuhp_reset_state() increments +* st->state, which results in CPUHP_AP_SMPBOOT_THREADS being +* skipped during rollback. So, don't use it here. +*/ + st->rollback = true; + st->target = prev_state; + st->bringup = !st->bringup; __cpuhp_kick_ap(st); } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] workqueue: Handle race between wake up and rebind
On 01/18/2018 08:32 AM, Lai Jiangshan wrote: On Wed, Jan 17, 2018 at 4:08 AM, Neeraj Upadhyay <neer...@codeaurora.org> wrote: On 01/16/2018 11:05 PM, Tejun Heo wrote: Hello, Neeraj. On Mon, Jan 15, 2018 at 02:08:12PM +0530, Neeraj Upadhyay wrote: - kworker/0:0 gets chance to run on cpu1; while processing a work, it goes to sleep. However, it does not decrement pool->nr_running. This is because WORKER_REBOUND (NOT_ RUNNING) flag was cleared, when worker entered worker_ Do you mean that because REBOUND was set? Actually, I meant REBOUND was not set. Below is the sequence - cpu0 bounded pool is unbound. - kworker/0:0 is woken up on cpu1. - cpu0 pool is rebound REBOUND is set for kworker/0:0 Thanks for looking into the detail of workqueue... "REBOUND is set for kworker/0:0" means set_cpus_allowed_ptr(kworker/0:0) already successfull returned and kworker/0:0 is already moved to cpu0. It will not still run on cpu1 as the following steps you described. If there is something wrong with " set_cpus_allowed_ptr()" in this situation, could you please elaborate it. Thanks Lai, I missed that; will debug from that perspective. - kworker/0:0 starts running on cpu1 worker_thread() // It clears REBOUND and sets nr_running =1 after below call worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); - kworker/0:0 goes to sleep wq_worker_sleeping() // Below condition is not true, as all NOT_RUNNING // flags were cleared in worker_thread() if (worker->flags & WORKER_NOT_RUNNING) // Below is true, as worker is running on cpu1 if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) return NULL; // Below is not reached and nr_running stays 1 if (atomic_dec_and_test(>nr_running) && - kworker/0:0 wakes up again, this time on cpu0, as worker->task cpus_allowed was set to cpu0, in rebind_workers. wq_worker_waking_up() if (!(worker->flags & WORKER_NOT_RUNNING)) { // Increments pool->nr_running to 2 atomic_inc(>pool->nr_running); thread(). Worker 0 runs on cpu1 worker_thread() process_one_work() wq_worker_sleeping() if (worker->flags & WORKER_NOT_RUNNING) return NULL; if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - After this, when kworker/0:0 wakes up, this time on its bounded cpu cpu0, it increments pool->nr_running again. So, pool->nr_running becomes 2. Why is it suddenly 2? Who made it one on the account of the kworker? As shown in above comment, it became 1 in worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); Do you see this happening? Or better, is there a (semi) reliable repro for this issue? Yes, this was reported in our long run testing with random hotplug. Sorry, don't have a quick reproducer for it. Issue is reported in few days of testing. Thanks. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] workqueue: Handle race between wake up and rebind
On 01/18/2018 08:32 AM, Lai Jiangshan wrote: On Wed, Jan 17, 2018 at 4:08 AM, Neeraj Upadhyay wrote: On 01/16/2018 11:05 PM, Tejun Heo wrote: Hello, Neeraj. On Mon, Jan 15, 2018 at 02:08:12PM +0530, Neeraj Upadhyay wrote: - kworker/0:0 gets chance to run on cpu1; while processing a work, it goes to sleep. However, it does not decrement pool->nr_running. This is because WORKER_REBOUND (NOT_ RUNNING) flag was cleared, when worker entered worker_ Do you mean that because REBOUND was set? Actually, I meant REBOUND was not set. Below is the sequence - cpu0 bounded pool is unbound. - kworker/0:0 is woken up on cpu1. - cpu0 pool is rebound REBOUND is set for kworker/0:0 Thanks for looking into the detail of workqueue... "REBOUND is set for kworker/0:0" means set_cpus_allowed_ptr(kworker/0:0) already successfull returned and kworker/0:0 is already moved to cpu0. It will not still run on cpu1 as the following steps you described. If there is something wrong with " set_cpus_allowed_ptr()" in this situation, could you please elaborate it. Thanks Lai, I missed that; will debug from that perspective. - kworker/0:0 starts running on cpu1 worker_thread() // It clears REBOUND and sets nr_running =1 after below call worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); - kworker/0:0 goes to sleep wq_worker_sleeping() // Below condition is not true, as all NOT_RUNNING // flags were cleared in worker_thread() if (worker->flags & WORKER_NOT_RUNNING) // Below is true, as worker is running on cpu1 if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) return NULL; // Below is not reached and nr_running stays 1 if (atomic_dec_and_test(>nr_running) && - kworker/0:0 wakes up again, this time on cpu0, as worker->task cpus_allowed was set to cpu0, in rebind_workers. wq_worker_waking_up() if (!(worker->flags & WORKER_NOT_RUNNING)) { // Increments pool->nr_running to 2 atomic_inc(>pool->nr_running); thread(). Worker 0 runs on cpu1 worker_thread() process_one_work() wq_worker_sleeping() if (worker->flags & WORKER_NOT_RUNNING) return NULL; if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - After this, when kworker/0:0 wakes up, this time on its bounded cpu cpu0, it increments pool->nr_running again. So, pool->nr_running becomes 2. Why is it suddenly 2? Who made it one on the account of the kworker? As shown in above comment, it became 1 in worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); Do you see this happening? Or better, is there a (semi) reliable repro for this issue? Yes, this was reported in our long run testing with random hotplug. Sorry, don't have a quick reproducer for it. Issue is reported in few days of testing. Thanks. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] workqueue: Handle race between wake up and rebind
On 01/16/2018 11:05 PM, Tejun Heo wrote: Hello, Neeraj. On Mon, Jan 15, 2018 at 02:08:12PM +0530, Neeraj Upadhyay wrote: - kworker/0:0 gets chance to run on cpu1; while processing a work, it goes to sleep. However, it does not decrement pool->nr_running. This is because WORKER_REBOUND (NOT_ RUNNING) flag was cleared, when worker entered worker_ Do you mean that because REBOUND was set? Actually, I meant REBOUND was not set. Below is the sequence - cpu0 bounded pool is unbound. - kworker/0:0 is woken up on cpu1. - cpu0 pool is rebound REBOUND is set for kworker/0:0 - kworker/0:0 starts running on cpu1 worker_thread() // It clears REBOUND and sets nr_running =1 after below call worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); - kworker/0:0 goes to sleep wq_worker_sleeping() // Below condition is not true, as all NOT_RUNNING // flags were cleared in worker_thread() if (worker->flags & WORKER_NOT_RUNNING) // Below is true, as worker is running on cpu1 if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) return NULL; // Below is not reached and nr_running stays 1 if (atomic_dec_and_test(>nr_running) && - kworker/0:0 wakes up again, this time on cpu0, as worker->task cpus_allowed was set to cpu0, in rebind_workers. wq_worker_waking_up() if (!(worker->flags & WORKER_NOT_RUNNING)) { // Increments pool->nr_running to 2 atomic_inc(>pool->nr_running); thread(). Worker 0 runs on cpu1 worker_thread() process_one_work() wq_worker_sleeping() if (worker->flags & WORKER_NOT_RUNNING) return NULL; if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - After this, when kworker/0:0 wakes up, this time on its bounded cpu cpu0, it increments pool->nr_running again. So, pool->nr_running becomes 2. Why is it suddenly 2? Who made it one on the account of the kworker? As shown in above comment, it became 1 in worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); Do you see this happening? Or better, is there a (semi) reliable repro for this issue? Yes, this was reported in our long run testing with random hotplug. Sorry, don't have a quick reproducer for it. Issue is reported in few days of testing. Thanks. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: [PATCH] workqueue: Handle race between wake up and rebind
On 01/16/2018 11:05 PM, Tejun Heo wrote: Hello, Neeraj. On Mon, Jan 15, 2018 at 02:08:12PM +0530, Neeraj Upadhyay wrote: - kworker/0:0 gets chance to run on cpu1; while processing a work, it goes to sleep. However, it does not decrement pool->nr_running. This is because WORKER_REBOUND (NOT_ RUNNING) flag was cleared, when worker entered worker_ Do you mean that because REBOUND was set? Actually, I meant REBOUND was not set. Below is the sequence - cpu0 bounded pool is unbound. - kworker/0:0 is woken up on cpu1. - cpu0 pool is rebound REBOUND is set for kworker/0:0 - kworker/0:0 starts running on cpu1 worker_thread() // It clears REBOUND and sets nr_running =1 after below call worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); - kworker/0:0 goes to sleep wq_worker_sleeping() // Below condition is not true, as all NOT_RUNNING // flags were cleared in worker_thread() if (worker->flags & WORKER_NOT_RUNNING) // Below is true, as worker is running on cpu1 if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) return NULL; // Below is not reached and nr_running stays 1 if (atomic_dec_and_test(>nr_running) && - kworker/0:0 wakes up again, this time on cpu0, as worker->task cpus_allowed was set to cpu0, in rebind_workers. wq_worker_waking_up() if (!(worker->flags & WORKER_NOT_RUNNING)) { // Increments pool->nr_running to 2 atomic_inc(>pool->nr_running); thread(). Worker 0 runs on cpu1 worker_thread() process_one_work() wq_worker_sleeping() if (worker->flags & WORKER_NOT_RUNNING) return NULL; if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - After this, when kworker/0:0 wakes up, this time on its bounded cpu cpu0, it increments pool->nr_running again. So, pool->nr_running becomes 2. Why is it suddenly 2? Who made it one on the account of the kworker? As shown in above comment, it became 1 in worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND); Do you see this happening? Or better, is there a (semi) reliable repro for this issue? Yes, this was reported in our long run testing with random hotplug. Sorry, don't have a quick reproducer for it. Issue is reported in few days of testing. Thanks. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] workqueue: Handle race between wake up and rebind
There is a potential race b/w rebind_workers() and wakeup of a worker thread, which can result in workqueue lockup for a bounder worker pool. Below is the potential race: - cpu0 is a bounded worker pool, which is unbound from its cpu. A new work is queued on this pool, which causes its worker (kworker/0:0) to be woken up on a cpu different from cpu0, lets say cpu1. workqueue_queue_work workqueue_activate_work - cpu0 rebind happens rebind_workers() Clears POOL_DISASSOCIATED and binds cpumask of all workers. - kworker/0:0 gets chance to run on cpu1; while processing a work, it goes to sleep. However, it does not decrement pool->nr_running. This is because WORKER_REBOUND (NOT_ RUNNING) flag was cleared, when worker entered worker_ thread(). Worker 0 runs on cpu1 worker_thread() process_one_work() wq_worker_sleeping() if (worker->flags & WORKER_NOT_RUNNING) return NULL; if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - After this, when kworker/0:0 wakes up, this time on its bounded cpu cpu0, it increments pool->nr_running again. So, pool->nr_running becomes 2. - When kworker/0:0 enters idle, it decrements pool->nr_running by 1. This leaves pool->nr_running =1 , with no workers in runnable state. - Now, no new workers will be woken up, as pool->nr_running is non-zero. This results in indefinite lockup for this pool. Fix this by deferring the work to some other idle worker, if the current worker is not bound to its pool's CPU. Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org> --- kernel/workqueue.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 43d18cb..71c0023 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2218,6 +2218,17 @@ static int worker_thread(void *__worker) if (unlikely(!may_start_working(pool)) && manage_workers(worker)) goto recheck; + /* handle the case where, while a bounded pool is unbound, +* its worker is woken up on a target CPU, which is different +* from pool->cpu, but pool is rebound before this worker gets +* chance to run on the target CPU. +*/ + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && + raw_smp_processor_id() != pool->cpu)) { + wake_up_worker(pool); + goto sleep; + } + /* * ->scheduled list can only be filled while a worker is * preparing to process a work or actually processing it. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] workqueue: Handle race between wake up and rebind
There is a potential race b/w rebind_workers() and wakeup of a worker thread, which can result in workqueue lockup for a bounder worker pool. Below is the potential race: - cpu0 is a bounded worker pool, which is unbound from its cpu. A new work is queued on this pool, which causes its worker (kworker/0:0) to be woken up on a cpu different from cpu0, lets say cpu1. workqueue_queue_work workqueue_activate_work - cpu0 rebind happens rebind_workers() Clears POOL_DISASSOCIATED and binds cpumask of all workers. - kworker/0:0 gets chance to run on cpu1; while processing a work, it goes to sleep. However, it does not decrement pool->nr_running. This is because WORKER_REBOUND (NOT_ RUNNING) flag was cleared, when worker entered worker_ thread(). Worker 0 runs on cpu1 worker_thread() process_one_work() wq_worker_sleeping() if (worker->flags & WORKER_NOT_RUNNING) return NULL; if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id())) - After this, when kworker/0:0 wakes up, this time on its bounded cpu cpu0, it increments pool->nr_running again. So, pool->nr_running becomes 2. - When kworker/0:0 enters idle, it decrements pool->nr_running by 1. This leaves pool->nr_running =1 , with no workers in runnable state. - Now, no new workers will be woken up, as pool->nr_running is non-zero. This results in indefinite lockup for this pool. Fix this by deferring the work to some other idle worker, if the current worker is not bound to its pool's CPU. Signed-off-by: Neeraj Upadhyay --- kernel/workqueue.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 43d18cb..71c0023 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2218,6 +2218,17 @@ static int worker_thread(void *__worker) if (unlikely(!may_start_working(pool)) && manage_workers(worker)) goto recheck; + /* handle the case where, while a bounded pool is unbound, +* its worker is woken up on a target CPU, which is different +* from pool->cpu, but pool is rebound before this worker gets +* chance to run on the target CPU. +*/ + if (WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) && + raw_smp_processor_id() != pool->cpu)) { + wake_up_worker(pool); + goto sleep; + } + /* * ->scheduled list can only be filled while a worker is * preparing to process a work or actually processing it. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Re: Query regarding srcu_funnel_exp_start()
On 10/28/2017 03:50 AM, Paul E. McKenney wrote: On Fri, Oct 27, 2017 at 10:15:04PM +0530, Neeraj Upadhyay wrote: On 10/27/2017 05:56 PM, Paul E. McKenney wrote: On Fri, Oct 27, 2017 at 02:23:07PM +0530, Neeraj Upadhyay wrote: Hi, One query regarding srcu_funnel_exp_start() function in kernel/rcu/srcutree.c. static void srcu_funnel_exp_start(struct srcu_struct *sp, struct srcu_node *snp, unsigned long s) { if (!ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) sp->srcu_gp_seq_needed_exp = s; } Why is sp->srcu_gp_seq_needed_exp set to 's' if srcu_gp_seq_needed_exp is >= 's'. Shouldn't srcu_gp_seq_needed_exp be equal to the greater of both? Let's suppose that it is incorrect as currently written. Can you construct a test case demonstrating a failure of some sort, then provide a fix? Will check this. Might take some time to build a test case. Fair enough! I suggest checking to see if kernel/rcu/rcuperf.c can do what you need for this test. (Might not with a single test, but perhaps a before-and-after comparison. Or maybe you really do need to add some test code somewhere.) Thanks for the suggestion, will try that out. To start with, if it is currently incorrect, what would be the nature of the failure? Thanx, Paul Hi Paul, I see below scenario, where new gp won't be expedited. Please correct me if I am missing something here. 1. CPU0 calls synchronize_srcu_expedited() synchronize_srcu_expedited() __synchronize_srcu() __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // lets say srcu_gp_seq = 0; // s = 0x100 Looks like you have one hex digit and then two binary digits, but why not? (RCU_SEQ_STATE_MASK is 3 rather than 0xff > Yeah, sorry I confused myself while representing the values. 0x100 need to be replaced with b'100' and 0x200 with b'1000'. sdp->srcu_gp_seq_needed = s // 0x100 needgp = true sdp->srcu_gp_seq_needed_exp = s // 0x100 srcu_funnel_gp_start() sp->srcu_gp_seq_needed_exp = s; srcu_gp_start(sp); rcu_seq_start(>srcu_gp_seq); 2. CPU1 calls normal synchronize_srcu() synchronize_srcu() __synchronize_srcu(sp, true) __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // srcu_gp_seq = 1 // s= 0x200 sdp->srcu_gp_seq_needed = s; // 0x200 srcu_funnel_gp_start() smp_store_release(>srcu_gp_seq_needed, s); // 0x200 3. CPU3 calls synchronize_srcu_expedited() synchronize_srcu_expedited() __synchronize_srcu() __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // srcu_gp_seq = 1 // s = 0x200 sdp->srcu_gp_seq_needed_exp = s // 0x200 srcu_funnel_exp_start(sp, sdp->mynode, s); // sp->srcu_gp_seq_needed_exp = 0x100 // s = 0x200 ; sp->srcu_gp_seq_needed_exp is not updated if (!ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) sp->srcu_gp_seq_needed_exp = s; Seems plausible, but you should be able to show the difference in grace-period duration with a test. Ok sure, will attempt that. While you are in srcu_funnel_exp_start(), should it be rechecking rcu_seq_done(>srcu_gp_seq, s) as well as the current ULONG_CMP_GE(snp->srcu_gp_seq_needed_exp, s) under the lock? Why or why not? Thanx, Paul Hi Paul, I don't see how it will impact. I have put markers in code snippet below to explain my points. My understanding is * rcu_seq_done check @a is a fastpath return, and avoid contention for snp lock, if the gp has already elapsed. * Checking it @b, inside srcu_node lock might not make any difference, as sp->srcu_gp_seq counter portion is updated under srcu_struct lock. Also, we cannot lock srcu_struct at this point, as it will cause lock contention among multiple CPUs. * Checking rcu_seq_done @c also does not impact, as we have already done all the work of traversing the entire parent chain and if rcu_seq_done() is true srcu_gp_seq_needed_exp will be greater than or equal to 's'. srcu_gp_end() raw_spin_lock_irq_rcu_node(sp); rcu_seq_end(>srcu_gp_seq); gpseq = rcu_seq_current(>srcu_gp_seq); if (ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, gpseq)) sp->srcu_gp_seq_needed_exp = gpseq; raw_spin_unlock_irq_rcu_node(sp); static void srcu_funnel_exp_start(...) { for (; snp != NULL; snp = snp->srcu_parent) { if (rcu_seq_done(>srcu_gp_seq, s) || /* a */ ULONG_CMP_GE(READ_ONCE(snp->srcu_gp_seq_needed_exp), s)) return;
Re: Query regarding srcu_funnel_exp_start()
On 10/28/2017 03:50 AM, Paul E. McKenney wrote: On Fri, Oct 27, 2017 at 10:15:04PM +0530, Neeraj Upadhyay wrote: On 10/27/2017 05:56 PM, Paul E. McKenney wrote: On Fri, Oct 27, 2017 at 02:23:07PM +0530, Neeraj Upadhyay wrote: Hi, One query regarding srcu_funnel_exp_start() function in kernel/rcu/srcutree.c. static void srcu_funnel_exp_start(struct srcu_struct *sp, struct srcu_node *snp, unsigned long s) { if (!ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) sp->srcu_gp_seq_needed_exp = s; } Why is sp->srcu_gp_seq_needed_exp set to 's' if srcu_gp_seq_needed_exp is >= 's'. Shouldn't srcu_gp_seq_needed_exp be equal to the greater of both? Let's suppose that it is incorrect as currently written. Can you construct a test case demonstrating a failure of some sort, then provide a fix? Will check this. Might take some time to build a test case. Fair enough! I suggest checking to see if kernel/rcu/rcuperf.c can do what you need for this test. (Might not with a single test, but perhaps a before-and-after comparison. Or maybe you really do need to add some test code somewhere.) Thanks for the suggestion, will try that out. To start with, if it is currently incorrect, what would be the nature of the failure? Thanx, Paul Hi Paul, I see below scenario, where new gp won't be expedited. Please correct me if I am missing something here. 1. CPU0 calls synchronize_srcu_expedited() synchronize_srcu_expedited() __synchronize_srcu() __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // lets say srcu_gp_seq = 0; // s = 0x100 Looks like you have one hex digit and then two binary digits, but why not? (RCU_SEQ_STATE_MASK is 3 rather than 0xff > Yeah, sorry I confused myself while representing the values. 0x100 need to be replaced with b'100' and 0x200 with b'1000'. sdp->srcu_gp_seq_needed = s // 0x100 needgp = true sdp->srcu_gp_seq_needed_exp = s // 0x100 srcu_funnel_gp_start() sp->srcu_gp_seq_needed_exp = s; srcu_gp_start(sp); rcu_seq_start(>srcu_gp_seq); 2. CPU1 calls normal synchronize_srcu() synchronize_srcu() __synchronize_srcu(sp, true) __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // srcu_gp_seq = 1 // s= 0x200 sdp->srcu_gp_seq_needed = s; // 0x200 srcu_funnel_gp_start() smp_store_release(>srcu_gp_seq_needed, s); // 0x200 3. CPU3 calls synchronize_srcu_expedited() synchronize_srcu_expedited() __synchronize_srcu() __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // srcu_gp_seq = 1 // s = 0x200 sdp->srcu_gp_seq_needed_exp = s // 0x200 srcu_funnel_exp_start(sp, sdp->mynode, s); // sp->srcu_gp_seq_needed_exp = 0x100 // s = 0x200 ; sp->srcu_gp_seq_needed_exp is not updated if (!ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) sp->srcu_gp_seq_needed_exp = s; Seems plausible, but you should be able to show the difference in grace-period duration with a test. Ok sure, will attempt that. While you are in srcu_funnel_exp_start(), should it be rechecking rcu_seq_done(>srcu_gp_seq, s) as well as the current ULONG_CMP_GE(snp->srcu_gp_seq_needed_exp, s) under the lock? Why or why not? Thanx, Paul Hi Paul, I don't see how it will impact. I have put markers in code snippet below to explain my points. My understanding is * rcu_seq_done check @a is a fastpath return, and avoid contention for snp lock, if the gp has already elapsed. * Checking it @b, inside srcu_node lock might not make any difference, as sp->srcu_gp_seq counter portion is updated under srcu_struct lock. Also, we cannot lock srcu_struct at this point, as it will cause lock contention among multiple CPUs. * Checking rcu_seq_done @c also does not impact, as we have already done all the work of traversing the entire parent chain and if rcu_seq_done() is true srcu_gp_seq_needed_exp will be greater than or equal to 's'. srcu_gp_end() raw_spin_lock_irq_rcu_node(sp); rcu_seq_end(>srcu_gp_seq); gpseq = rcu_seq_current(>srcu_gp_seq); if (ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, gpseq)) sp->srcu_gp_seq_needed_exp = gpseq; raw_spin_unlock_irq_rcu_node(sp); static void srcu_funnel_exp_start(...) { for (; snp != NULL; snp = snp->srcu_parent) { if (rcu_seq_done(>srcu_gp_seq, s) || /* a */ ULONG_CMP_GE(READ_ONCE(snp->srcu_gp_seq_needed_exp), s)) return;
Re: Query regarding srcu_funnel_exp_start()
On 10/27/2017 05:56 PM, Paul E. McKenney wrote: On Fri, Oct 27, 2017 at 02:23:07PM +0530, Neeraj Upadhyay wrote: Hi, One query regarding srcu_funnel_exp_start() function in kernel/rcu/srcutree.c. static void srcu_funnel_exp_start(struct srcu_struct *sp, struct srcu_node *snp, unsigned long s) { if (!ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) sp->srcu_gp_seq_needed_exp = s; } Why is sp->srcu_gp_seq_needed_exp set to 's' if srcu_gp_seq_needed_exp is >= 's'. Shouldn't srcu_gp_seq_needed_exp be equal to the greater of both? Let's suppose that it is incorrect as currently written. Can you construct a test case demonstrating a failure of some sort, then provide a fix? Will check this. Might take some time to build a test case. To start with, if it is currently incorrect, what would be the nature of the failure? Thanx, Paul Hi Paul, I see below scenario, where new gp won't be expedited. Please correct me if I am missing something here. 1. CPU0 calls synchronize_srcu_expedited() synchronize_srcu_expedited() __synchronize_srcu() __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // lets say srcu_gp_seq = 0; // s = 0x100 sdp->srcu_gp_seq_needed = s // 0x100 needgp = true sdp->srcu_gp_seq_needed_exp = s // 0x100 srcu_funnel_gp_start() sp->srcu_gp_seq_needed_exp = s; srcu_gp_start(sp); rcu_seq_start(>srcu_gp_seq); 2. CPU1 calls normal synchronize_srcu() synchronize_srcu() __synchronize_srcu(sp, true) __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // srcu_gp_seq = 1 // s= 0x200 sdp->srcu_gp_seq_needed = s; // 0x200 srcu_funnel_gp_start() smp_store_release(>srcu_gp_seq_needed, s); // 0x200 3. CPU3 calls synchronize_srcu_expedited() synchronize_srcu_expedited() __synchronize_srcu() __call_srcu() s = rcu_seq_snap(>srcu_gp_seq); // srcu_gp_seq = 1 // s = 0x200 sdp->srcu_gp_seq_needed_exp = s // 0x200 srcu_funnel_exp_start(sp, sdp->mynode, s); // sp->srcu_gp_seq_needed_exp = 0x100 // s = 0x200 ; sp->srcu_gp_seq_needed_exp is not updated if (!ULONG_CMP_LT(sp->srcu_gp_seq_needed_exp, s)) sp->srcu_gp_seq_needed_exp = s; Thanks Neeraj -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation