Re: [PATCH v2 4/4] x86/paravirt: remove no longer needed paravirt patching code

2023-10-16 Thread Peter Zijlstra
On Mon, Oct 16, 2023 at 02:39:33PM +0200, Juergen Gross wrote:
> Now that paravirt is using the alternatives patching infrastructure,
> remove the paravirt patching code.
> 
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/include/asm/paravirt.h   | 18 
>  arch/x86/include/asm/paravirt_types.h | 40 
>  arch/x86/include/asm/text-patching.h  | 12 -
>  arch/x86/kernel/alternative.c | 66 +--
>  arch/x86/kernel/paravirt.c| 30 
>  arch/x86/kernel/vmlinux.lds.S | 13 --
>  arch/x86/tools/relocs.c   |  2 +-
>  7 files changed, 3 insertions(+), 178 deletions(-)

More - more better! :-)

Acked-by: Peter Zijlstra (Intel) 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 3/4] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2

2023-10-16 Thread Peter Zijlstra
On Mon, Oct 16, 2023 at 02:39:32PM +0200, Juergen Gross wrote:
> Instead of stacking alternative and paravirt patching, use the new
> ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
> handling.
> 
> This eliminates the need to be careful regarding the sequence of
> alternative and paravirt patching.
> 
> For call depth tracking callthunks_setup() needs to be adapted to patch
> calls at alternative patching sites instead of paravirt calls.
> 
> Signed-off-by: Juergen Gross 

I cannot help but feel this would've been better as two patches, one
introducing ALT_NOT_XEN and then a second with the rest.

Regardless,

Acked-by: Peter Zijlstra (Intel) 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 3/3] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2

2023-09-20 Thread Peter Zijlstra
On Thu, Jun 08, 2023 at 04:03:33PM +0200, Juergen Gross wrote:
> Instead of stacking alternative and paravirt patching, use the new
> ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
> handling.
> 
> This eliminates the need to be careful regarding the sequence of
> alternative and paravirt patching.
> 
> For call depth tracking callthunks_setup() needs to be adapted to patch
> calls at alternative patching sites instead of paravirt calls.
> 
> Remove the no longer needed paravirt patching and related code.

I think this becomes easier if you first convert the paravirt sites to
alternatives, such that .parainstructions is empty, and then in a
subsequent patch remove all the paravirt infrastructure that is unused.


> +#define SAVE_FLAGS   ALTERNATIVE_2 "PARA_IRQ_save_fl;", ALT_CALL_INSTR, \
> +   ALT_CALL_ALWAYS, "pushf; pop %rax;", \
> +   ALT_NOT(X86_FEATURE_XENPV)

I find this more readable when written as:

#define SAVE_FLAGS  ALTERNATIVE_2 "PARA_IRQ_save_fl;",  \
  ALT_CALL_INSTR, ALT_CALL_ALWAYS,  \
  "pushf; pop %rax;", 
ALT_NOT(X86_FEATURE_XENPV)

(and perhaps ALT_NOT_XEN is in order, there's a ton of those)

If you base this on top of the nested alternative patches, another
helper might be:

#define __PV_ALTERNATIVE(old) __ALTERNATIVE(old, ALT_CALL_INSTR, 
ALT_CALL_ALWAYS)

So that you can then write:

#define SAVE_FLAGS  __ALTERNATIVE(__PV_ALTERNATIVE("PARA_IRQ_save_fl;"),
  "pushf; pop %rax;", ALT_NOT_XEN)

But perhaps I'm over-cooking things now..

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/3] x86/paravirt: move some functions and defines to alternative

2023-09-20 Thread Peter Zijlstra
On Thu, Jun 08, 2023 at 04:03:31PM +0200, Juergen Gross wrote:
> As a preparation for replacing paravirt patching completely by
> alternative patching, move some backend functions and #defines to
> alternative code and header.
> 
> Signed-off-by: Juergen Gross 

Acked-by: Peter Zijlstra (Intel) 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker

2023-07-24 Thread Peter Zijlstra
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote:

> +void shrinker_unregister(struct shrinker *shrinker)
> +{
> + struct dentry *debugfs_entry;
> + int debugfs_id;
> +
> + if (!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))
> + return;
> +
> + down_write(_rwsem);
> + list_del(>list);
> + shrinker->flags &= ~SHRINKER_REGISTERED;
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);
> + debugfs_entry = shrinker_debugfs_detach(shrinker, _id);
> + up_write(_rwsem);
> +
> + shrinker_debugfs_remove(debugfs_entry, debugfs_id);

Should there not be an rcu_barrier() right about here?

> +
> + kfree(shrinker->nr_deferred);
> + shrinker->nr_deferred = NULL;
> +
> + kfree(shrinker);
> +}
> +EXPORT_SYMBOL(shrinker_unregister);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: convert simple paravirt functions to asm

2023-03-16 Thread Peter Zijlstra
On Wed, Mar 08, 2023 at 04:42:10PM +0100, Juergen Gross wrote:

> +DEFINE_PARAVIRT_ASM(pv_native_irq_disable, "cli", .text);
> +DEFINE_PARAVIRT_ASM(pv_native_irq_enable, "sti", .text);
> +DEFINE_PARAVIRT_ASM(pv_native_read_cr2, "mov %cr2, %rax", .text);

per these v, the above ^ should be in .noinstr.text

> -static noinstr unsigned long pv_native_read_cr2(void)
> -static noinstr void pv_native_irq_enable(void)
> -static noinstr void pv_native_irq_disable(void)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads

2023-01-31 Thread Peter Zijlstra
On Mon, Jan 30, 2023 at 11:59:30AM -0800, Josh Poimboeuf wrote:

> @@ -8662,16 +8665,19 @@ void sched_dynamic_update(int mode)
>  
>   switch (mode) {
>   case preempt_dynamic_none:
> - preempt_dynamic_enable(cond_resched);
> + if (!klp_override)
> + preempt_dynamic_enable(cond_resched);
>   preempt_dynamic_disable(might_resched);
>   preempt_dynamic_disable(preempt_schedule);
>   preempt_dynamic_disable(preempt_schedule_notrace);
>   preempt_dynamic_disable(irqentry_exit_cond_resched);
> + //FIXME avoid printk for klp restore

if (mode != preempt_dynamic_mode)

>   pr_info("Dynamic Preempt: none\n");
>   break;
>  
>   case preempt_dynamic_voluntary:
> - preempt_dynamic_enable(cond_resched);
> + if (!klp_override)
> + preempt_dynamic_enable(cond_resched);
>   preempt_dynamic_enable(might_resched);
>   preempt_dynamic_disable(preempt_schedule);
>   preempt_dynamic_disable(preempt_schedule_notrace);


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2.1 4/9] tracing, preempt: Squash _rcuidle tracing

2023-01-31 Thread Peter Zijlstra


Extend commit 9aedeaed6fc6 ("tracing, hardirq: No moar _rcuidle()
tracing") to also cover trace_preempt_{on,off}() which were
mysteriously untouched.

Fixes: 9aedeaed6fc6 ("tracing, hardirq: No moar _rcuidle() tracing")
Reported-by: Mark Rutland 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Mark Rutland 
Link: https://lkml.kernel.org/r/y9d5afnoukwno...@hirez.programming.kicks-ass.net
---
 kernel/trace/trace_preemptirq.c |   14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -15,10 +15,6 @@
 #define CREATE_TRACE_POINTS
 #include 
 
-#ifdef CONFIG_TRACE_IRQFLAGS
-/* Per-cpu variable to prevent redundant calls when IRQs already off */
-static DEFINE_PER_CPU(int, tracing_irq_cpu);
-
 /*
  * Use regular trace points on architectures that implement noinstr
  * tooling: these calls will only happen with RCU enabled, which can
@@ -33,6 +29,10 @@ static DEFINE_PER_CPU(int, tracing_irq_c
 #define trace(point)   if (!in_nmi()) trace_##point##_rcuidle
 #endif
 
+#ifdef CONFIG_TRACE_IRQFLAGS
+/* Per-cpu variable to prevent redundant calls when IRQs already off */
+static DEFINE_PER_CPU(int, tracing_irq_cpu);
+
 /*
  * Like trace_hardirqs_on() but without the lockdep invocation. This is
  * used in the low level entry code where the ordering vs. RCU is important
@@ -100,15 +100,13 @@ NOKPROBE_SYMBOL(trace_hardirqs_off);
 
 void trace_preempt_on(unsigned long a0, unsigned long a1)
 {
-   if (!in_nmi())
-   trace_preempt_enable_rcuidle(a0, a1);
+   trace(preempt_enable)(a0, a1);
tracer_preempt_on(a0, a1);
 }
 
 void trace_preempt_off(unsigned long a0, unsigned long a1)
 {
-   if (!in_nmi())
-   trace_preempt_disable_rcuidle(a0, a1);
+   trace(preempt_disable)(a0, a1);
tracer_preempt_off(a0, a1);
 }
 #endif
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads

2023-01-30 Thread Peter Zijlstra
On Fri, Jan 27, 2023 at 02:11:31PM -0800, Josh Poimboeuf wrote:


> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4df2b3e76b30..fbcd3acca25c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /* task_struct member predeclarations (sorted alphabetically): */
> @@ -2074,6 +2075,9 @@ DECLARE_STATIC_CALL(cond_resched, __cond_resched);
>  
>  static __always_inline int _cond_resched(void)
>  {
> + //FIXME this is a bit redundant with preemption disabled
> + klp_sched_try_switch();
> +
>   return static_call_mod(cond_resched)();
>  }

Right, I was thinking you'd do something like:

static_call_update(cond_resched, klp_cond_resched);

With:

static int klp_cond_resched(void)
{
klp_try_switch_task(current);
return __cond_resched();
}

That would force cond_resched() into doing the transition thing,
irrespective of the preemption mode at hand. And then, when KLP be done,
re-run sched_dynamic_update() to reset it to whatever it ought to be.

> @@ -401,8 +421,10 @@ void klp_try_complete_transition(void)
>*/
>   read_lock(_lock);
>   for_each_process_thread(g, task)
> - if (!klp_try_switch_task(task))
> + if (!klp_try_switch_task(task)) {
> + set_tsk_need_resched(task);
>   complete = false;
> + }

Yeah, no, that's broken -- preemption state live in more than just the
TIF bit.

>   read_unlock(_lock);
>  
>   /*
> @@ -413,6 +435,7 @@ void klp_try_complete_transition(void)
>   task = idle_task(cpu);
>   if (cpu_online(cpu)) {
>   if (!klp_try_switch_task(task)) {
> + set_tsk_need_resched(task);
>   complete = false;
>   /* Make idle task go through the main loop. */
>   wake_up_if_idle(cpu);

Idem.

Also, I don't see the point of this and the __schedule() hook here:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3a0ef2fefbd5..01e32d242ef6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6506,6 +6506,8 @@ static void __sched notrace __schedule(unsigned int 
> sched_mode)
>   struct rq *rq;
>   int cpu;
>  
> + klp_sched_try_switch();
> +
>   cpu = smp_processor_id();
>   rq = cpu_rq(cpu);
>   prev = rq->curr;

If it schedules, you'll get it with the normal switcheroo, because it'll
be inactive some of the time. If it doesn't schedule, it'll run into
cond_resched().

> @@ -8500,8 +8502,10 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
>  static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
>  int __sched dynamic_cond_resched(void)
>  {
> - if (!static_branch_unlikely(_dynamic_cond_resched))
> + if (!static_branch_unlikely(_dynamic_cond_resched)) {
> + klp_sched_try_switch();
>   return 0;
> + }
>   return __cond_resched();
>  }
>  EXPORT_SYMBOL(dynamic_cond_resched);

I would make the klp_sched_try_switch() not depend on
sk_dynamic_cond_resched, because __cond_resched() is not a guaranteed
pass through __schedule().

But you'll probably want to check with Mark here, this all might
generate crap code on arm64.

Both ways this seems to make KLP 'depend' (or at least work lots better)
when PREEMPT_DYNAMIC=y. Do we want a PREEMPT_DYNAMIC=n fallback for
_cond_resched() too?


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads

2023-01-27 Thread Peter Zijlstra
On Thu, Jan 26, 2023 at 08:43:55PM -0800, Josh Poimboeuf wrote:
> On Thu, Jan 26, 2023 at 03:12:35PM -0600, Seth Forshee (DigitalOcean) wrote:
> > On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > > We've fairly regularaly seen liveptches which cannot transition within 
> > > > kpatch's
> > > > timeout period due to busy vhost worker kthreads.
> > > 
> > > I have missed this detail. Miroslav told me that we have solved
> > > something similar some time ago, see
> > > https://lore.kernel.org/all/20220507174628.2086373-1-s...@kernel.org/
> > 
> > Interesting thread. I had thought about something along the lines of the
> > original patch, but there are some ideas in there that I hadn't
> > considered.
> 
> Here's another idea, have we considered this?  Have livepatch set
> TIF_NEED_RESCHED on all kthreads to force them into schedule(), and then
> have the scheduler call klp_try_switch_task() if TIF_PATCH_PENDING is
> set.
> 
> Not sure how scheduler folks would feel about that ;-)

So, let me try and page all that back in :-)

KLP needs to unwind the stack to see if any of the patched functions are
active, if not, flip task to new set.

Unwinding the stack of a task can be done when:

 - task is inactive (stable reg and stack) -- provided it stays inactive
   while unwinding etc..

 - task is current (guarantees stack doesn't dip below where we started
   due to being busy on top etc..)

Can NOT be done from interrupt context, because can hit in the middle of
setting up stack frames etc..

The issue at hand is that some tasks run for a long time without passing
through an explicit check.

The thread above tried sticking something in cond_resched() which is a
problem for PREEMPT=y since cond_resched() is a no-op.

Preempt notifiers were raised, and those would actually be nice, except
you can only install a notifier on current and you need some memory
allocated per task, which makes it less than ideal. Plus ...

... putting something in finish_task_switch() wouldn't be the end of the
world I suppose, but then you still need to force schedule the task --
imagine it being the only runnable task on the CPU, there's nothing
going to make it actually switch.

Which then leads me to suggest something daft like this.. does that
help?


diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..06746095a724 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include "core.h"
 #include "patch.h"
 #include "transition.h"
@@ -334,6 +335,16 @@ static bool klp_try_switch_task(struct task_struct *task)
return !ret;
 }
 
+static int __stop_try_switch(void *arg)
+{
+   return klp_try_switch_task(arg) ? 0 : -EBUSY;
+}
+
+static bool klp_try_switch_task_harder(struct task_struct *task)
+{
+   return !stop_one_cpu(task_cpu(task), __stop_try_switch, task);
+}
+
 /*
  * Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
  * Kthreads with TIF_PATCH_PENDING set are woken up.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 7/9] x86: Mark sched_clock() noinstr

2023-01-26 Thread Peter Zijlstra
In order to use sched_clock() from noinstr code, mark it and all it's
implenentations noinstr.

The whole pvclock thing (used by KVM/Xen) is a bit of a pain,
since it calls out to watchdogs, create a
pvclock_clocksource_read_nowd() variant doesn't do that and can be
noinstr.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/kvmclock.h |2 +-
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/include/asm/pvclock.h  |3 ++-
 arch/x86/kernel/cpu/vmware.c|2 +-
 arch/x86/kernel/kvmclock.c  |6 +++---
 arch/x86/kernel/pvclock.c   |   19 +++
 arch/x86/kernel/tsc.c   |7 +++
 arch/x86/xen/time.c |   12 ++--
 include/linux/math64.h  |4 ++--
 9 files changed, 38 insertions(+), 19 deletions(-)

--- a/arch/x86/include/asm/kvmclock.h
+++ b/arch/x86/include/asm/kvmclock.h
@@ -8,7 +8,7 @@ extern struct clocksource kvm_clock;
 
 DECLARE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
 
-static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+static __always_inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
 {
return _cpu_read(hv_clock_per_cpu)->pvti;
 }
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -26,7 +26,7 @@ DECLARE_STATIC_CALL(pv_sched_clock, dumm
 
 void paravirt_set_sched_clock(u64 (*func)(void));
 
-static inline u64 paravirt_sched_clock(void)
+static __always_inline u64 paravirt_sched_clock(void)
 {
return static_call(pv_sched_clock)();
 }
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -7,6 +7,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u64 pvclock_clocksource_read_nowd(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
@@ -39,7 +40,7 @@ bool pvclock_read_retry(const struct pvc
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
-static inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int shift)
+static __always_inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int 
shift)
 {
u64 product;
 #ifdef __i386__
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -143,7 +143,7 @@ static __init int parse_no_stealacc(char
 }
 early_param("no-steal-acc", parse_no_stealacc);
 
-static unsigned long long notrace vmware_sched_clock(void)
+static noinstr u64 vmware_sched_clock(void)
 {
unsigned long long ns;
 
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -71,12 +71,12 @@ static int kvm_set_wallclock(const struc
return -ENODEV;
 }
 
-static u64 kvm_clock_read(void)
+static noinstr u64 kvm_clock_read(void)
 {
u64 ret;
 
preempt_disable_notrace();
-   ret = pvclock_clocksource_read(this_cpu_pvti());
+   ret = pvclock_clocksource_read_nowd(this_cpu_pvti());
preempt_enable_notrace();
return ret;
 }
@@ -86,7 +86,7 @@ static u64 kvm_clock_get_cycles(struct c
return kvm_clock_read();
 }
 
-static u64 kvm_sched_clock_read(void)
+static noinstr u64 kvm_sched_clock_read(void)
 {
return kvm_clock_read() - kvm_sched_clock_offset;
 }
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -64,7 +64,8 @@ u8 pvclock_read_flags(struct pvclock_vcp
return flags & valid_flags;
 }
 
-u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+static __always_inline
+u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
 {
unsigned version;
u64 ret;
@@ -77,7 +78,7 @@ u64 pvclock_clocksource_read(struct pvcl
flags = src->flags;
} while (pvclock_read_retry(src, version));
 
-   if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
+   if (dowd && unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
}
@@ -100,15 +101,25 @@ u64 pvclock_clocksource_read(struct pvcl
 * updating at the same time, and one of them could be slightly behind,
 * making the assumption that last_value always go forward fail to hold.
 */
-   last = atomic64_read(_value);
+   last = arch_atomic64_read(_value);
do {
if (ret <= last)
return last;
-   } while (!atomic64_try_cmpxchg(_value, , ret));
+   } while (!arch_atomic64_try_cmpxchg(_value, , ret));
 
return ret;
 }
 
+u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+{
+   return __pvclock_clocksource_read(src, true);
+}
+
+noinstr u64 pvclock_clocksource_read_nowd(struct pvclock_vcpu_time_info *src)
+{
+   return __pv

[PATCH v2 4/9] tracing, preempt: Squash _rcuidle tracing

2023-01-26 Thread Peter Zijlstra
Extend commit 9aedeaed6fc6 ("tracing, hardirq: No moar _rcuidle()
tracing") to also cover trace_preempt_{on,off}() which were
mysteriously untouched.

Fixes: 9aedeaed6fc6 ("tracing, hardirq: No moar _rcuidle() tracing")
Reported-by: Mark Rutland 
Signed-off-by: Peter Zijlstra (Intel) 
Tested-by: Mark Rutland 
Link: https://lkml.kernel.org/r/y9d5afnoukwno...@hirez.programming.kicks-ass.net
---
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index f992444a0b1f..ea96b41c8838 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -100,15 +100,13 @@ NOKPROBE_SYMBOL(trace_hardirqs_off);
 
 void trace_preempt_on(unsigned long a0, unsigned long a1)
 {
-   if (!in_nmi())
-   trace_preempt_enable_rcuidle(a0, a1);
+   trace(preempt_enable)(a0, a1);
tracer_preempt_on(a0, a1);
 }
 
 void trace_preempt_off(unsigned long a0, unsigned long a1)
 {
-   if (!in_nmi())
-   trace_preempt_disable_rcuidle(a0, a1);
+   trace(preempt_disable)(a0, a1);
tracer_preempt_off(a0, a1);
 }
 #endif


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 9/9] cpuidle: Fix poll_idle() noinstr annotation

2023-01-26 Thread Peter Zijlstra
The instrumentation_begin()/end() annotations in poll_idle() were
complete nonsense. Specifically they caused tracing to happen in the
middle of noinstr code, resulting in RCU splats.

Now that local_clock() is noinstr, mark up the rest and let it rip.

Fixes: 00717eb8c955 ("cpuidle: Annotate poll_idle()")
Signed-off-by: Peter Zijlstra (Intel) 
Reported-by: kernel test robot 
Acked-by: Rafael J. Wysocki 
Link: https://lore.kernel.org/oe-lkp/202301192148.58ece903-oliver.s...@intel.com
---
 drivers/cpuidle/cpuidle.c|2 +-
 drivers/cpuidle/poll_state.c |2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -426,7 +426,7 @@ void cpuidle_reflect(struct cpuidle_devi
  * @dev:   the cpuidle device
  *
  */
-u64 cpuidle_poll_time(struct cpuidle_driver *drv,
+__cpuidle u64 cpuidle_poll_time(struct cpuidle_driver *drv,
  struct cpuidle_device *dev)
 {
int i;
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -15,7 +15,6 @@ static int __cpuidle poll_idle(struct cp
 {
u64 time_start;
 
-   instrumentation_begin();
time_start = local_clock();
 
dev->poll_time_limit = false;
@@ -42,7 +41,6 @@ static int __cpuidle poll_idle(struct cp
raw_local_irq_disable();
 
current_clr_polling();
-   instrumentation_end();
 
return index;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 6/9] x86/pvclock: improve atomic update of last_value in pvclock_clocksource_read

2023-01-26 Thread Peter Zijlstra
From: Uros Bizjak 

Improve atomic update of last_value in pvclock_clocksource_read:

- Atomic update can be skipped if the "last_value" is already
  equal to "ret".

- The detection of atomic update failure is not correct. The value,
  returned by atomic64_cmpxchg should be compared to the old value
  from the location to be updated. If these two are the same, then
  atomic update succeeded and "last_value" location is updated to
  "ret" in an atomic way. Otherwise, the atomic update failed and
  it should be retried with the value from "last_value" - exactly
  what atomic64_try_cmpxchg does in a correct and more optimal way.

Signed-off-by: Uros Bizjak 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20230118202330.3740-1-ubiz...@gmail.com
---
 arch/x86/kernel/pvclock.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index eda37df016f0..5a2a517dd61b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -102,10 +102,9 @@ u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info 
*src)
 */
last = atomic64_read(_value);
do {
-   if (ret < last)
+   if (ret <= last)
return last;
-   last = atomic64_cmpxchg(_value, last, ret);
-   } while (unlikely(last != ret));
+   } while (!atomic64_try_cmpxchg(_value, , ret));
 
return ret;
 }
-- 
2.39.0



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 8/9] sched/clock: Make local_clock() noinstr

2023-01-26 Thread Peter Zijlstra
With sched_clock() noinstr, provide a noinstr implementation of
local_clock().

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched/clock.h |8 +++-
 kernel/sched/clock.c|   27 +--
 2 files changed, 24 insertions(+), 11 deletions(-)

--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -45,7 +45,7 @@ static inline u64 cpu_clock(int cpu)
return sched_clock();
 }
 
-static inline u64 local_clock(void)
+static __always_inline u64 local_clock(void)
 {
return sched_clock();
 }
@@ -79,10 +79,8 @@ static inline u64 cpu_clock(int cpu)
return sched_clock_cpu(cpu);
 }
 
-static inline u64 local_clock(void)
-{
-   return sched_clock_cpu(raw_smp_processor_id());
-}
+extern u64 local_clock(void);
+
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -93,7 +93,7 @@ struct sched_clock_data {
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, 
sched_clock_data);
 
-notrace static inline struct sched_clock_data *this_scd(void)
+static __always_inline struct sched_clock_data *this_scd(void)
 {
return this_cpu_ptr(_clock_data);
 }
@@ -244,12 +244,12 @@ late_initcall(sched_clock_init_late);
  * min, max except they take wrapping into account
  */
 
-notrace static inline u64 wrap_min(u64 x, u64 y)
+static __always_inline u64 wrap_min(u64 x, u64 y)
 {
return (s64)(x - y) < 0 ? x : y;
 }
 
-notrace static inline u64 wrap_max(u64 x, u64 y)
+static __always_inline u64 wrap_max(u64 x, u64 y)
 {
return (s64)(x - y) > 0 ? x : y;
 }
@@ -260,7 +260,7 @@ notrace static inline u64 wrap_max(u64 x
  *  - filter out backward motion
  *  - use the GTOD tick value to create a window to filter crazy TSC values
  */
-notrace static u64 sched_clock_local(struct sched_clock_data *scd)
+static __always_inline u64 sched_clock_local(struct sched_clock_data *scd)
 {
u64 now, clock, old_clock, min_clock, max_clock, gtod;
s64 delta;
@@ -287,13 +287,28 @@ notrace static u64 sched_clock_local(str
clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);
 
-   if (!try_cmpxchg64(>clock, _clock, clock))
+   if (!arch_try_cmpxchg64(>clock, _clock, clock))
goto again;
 
return clock;
 }
 
-notrace static u64 sched_clock_remote(struct sched_clock_data *scd)
+noinstr u64 local_clock(void)
+{
+   u64 clock;
+
+   if (static_branch_likely(&__sched_clock_stable))
+   return sched_clock() + __sched_clock_offset;
+
+   preempt_disable_notrace();
+   clock = sched_clock_local(this_scd());
+   preempt_enable_notrace();
+
+   return clock;
+}
+EXPORT_SYMBOL_GPL(local_clock);
+
+static notrace u64 sched_clock_remote(struct sched_clock_data *scd)
 {
struct sched_clock_data *my_scd = this_scd();
u64 this_clock, remote_clock;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 3/9] tracing: Warn about !rcu_is_watching()

2023-01-26 Thread Peter Zijlstra
When using noinstr, WARN when tracing hits when RCU is disabled.

Suggested-by: Steven Rostedt (Google) 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/trace_recursion.h |   18 ++
 1 file changed, 18 insertions(+)

--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,21 @@ extern void ftrace_record_recursion(unsi
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
+#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+# define trace_warn_on_no_rcu(ip)  \
+   ({  \
+   bool __ret = !rcu_is_watching();\
+   if (__ret && !trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) 
{ \
+   trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \
+   WARN_ONCE(true, "RCU not on for: %pS\n", (void *)ip); \
+   trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \
+   }   \
+   __ret;  \
+   })
+#else
+# define trace_warn_on_no_rcu(ip)  false
+#endif
+
 /*
  * Preemption is promised to be disabled when return bit >= 0.
  */
@@ -144,6 +159,9 @@ static __always_inline int trace_test_an
unsigned int val = READ_ONCE(current->trace_recursion);
int bit;
 
+   if (trace_warn_on_no_rcu(ip))
+   return -1;
+
bit = trace_get_context_bit() + start;
if (unlikely(val & (1 << bit))) {
/*


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 0/9] A few more cpuidle vs rcu fixes

2023-01-26 Thread Peter Zijlstra
0-day robot reported graph-tracing made the cpuidle-vs-rcu rework go splat.

These patches appear to cure this, the ftrace selftest now runs to completion
without spamming scary messages to dmesg.

Since v1:

 - fixed recursive RCU splats
 - fixed psci thingies for arm (null)
 - improved the tracing WARN (rostedt)
 - fixed TRACE_PREEMPT_TOGGLE (null)

---
 arch/x86/include/asm/atomic64_32.h | 44 +++---
 arch/x86/include/asm/atomic64_64.h | 36 +++
 arch/x86/include/asm/kvmclock.h|  2 +-
 arch/x86/include/asm/paravirt.h|  2 +-
 arch/x86/include/asm/pvclock.h |  3 ++-
 arch/x86/kernel/cpu/vmware.c   |  2 +-
 arch/x86/kernel/kvmclock.c |  6 +++---
 arch/x86/kernel/pvclock.c  | 22 +--
 arch/x86/kernel/tsc.c  |  7 +++---
 arch/x86/xen/time.c| 12 +--
 drivers/cpuidle/cpuidle.c  |  2 +-
 drivers/cpuidle/poll_state.c   |  2 --
 drivers/firmware/psci/psci.c   | 31 ---
 include/linux/context_tracking.h   | 27 +++
 include/linux/math64.h |  4 ++--
 include/linux/sched/clock.h|  8 +++
 include/linux/trace_recursion.h| 18 
 kernel/locking/lockdep.c   |  3 +++
 kernel/panic.c |  5 +
 kernel/sched/clock.c   | 27 +--
 kernel/trace/trace_preemptirq.c|  6 ++
 lib/bug.c  | 15 -
 22 files changed, 192 insertions(+), 92 deletions(-)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 2/9] bug: Disable rcu_is_watching() during WARN/BUG

2023-01-26 Thread Peter Zijlstra
In order to avoid WARN/BUG from generating nested or even recursive
warnings, force rcu_is_watching() true during
WARN/lockdep_rcu_suspicious().

Notably things like unwinding the stack can trigger rcu_dereference()
warnings, which then triggers more unwinding which then triggers more
warnings etc..

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/context_tracking.h |   27 +++
 kernel/locking/lockdep.c |3 +++
 kernel/panic.c   |5 +
 lib/bug.c|   15 ++-
 4 files changed, 49 insertions(+), 1 deletion(-)

--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -130,9 +130,36 @@ static __always_inline unsigned long ct_
return arch_atomic_add_return(incby, 
this_cpu_ptr(_tracking.state));
 }
 
+static __always_inline bool warn_rcu_enter(void)
+{
+   bool ret = false;
+
+   /*
+* Horrible hack to shut up recursive RCU isn't watching fail since
+* lots of the actual reporting also relies on RCU.
+*/
+   preempt_disable_notrace();
+   if (rcu_dynticks_curr_cpu_in_eqs()) {
+   ret = true;
+   ct_state_inc(RCU_DYNTICKS_IDX);
+   }
+
+   return ret;
+}
+
+static __always_inline void warn_rcu_exit(bool rcu)
+{
+   if (rcu)
+   ct_state_inc(RCU_DYNTICKS_IDX);
+   preempt_enable_notrace();
+}
+
 #else
 static inline void ct_idle_enter(void) { }
 static inline void ct_idle_exit(void) { }
+
+static __always_inline bool warn_rcu_enter(void) { return false; }
+static __always_inline void warn_rcu_exit(bool rcu) { }
 #endif /* !CONFIG_CONTEXT_TRACKING_IDLE */
 
 #endif
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -6555,6 +6556,7 @@ void lockdep_rcu_suspicious(const char *
 {
struct task_struct *curr = current;
int dl = READ_ONCE(debug_locks);
+   bool rcu = warn_rcu_enter();
 
/* Note: the following can be executed concurrently, so be careful. */
pr_warn("\n");
@@ -6595,5 +6597,6 @@ void lockdep_rcu_suspicious(const char *
lockdep_print_held_locks(curr);
pr_warn("\nstack backtrace:\n");
dump_stack();
+   warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -679,6 +680,7 @@ void __warn(const char *file, int line,
 void warn_slowpath_fmt(const char *file, int line, unsigned taint,
   const char *fmt, ...)
 {
+   bool rcu = warn_rcu_enter();
struct warn_args args;
 
pr_warn(CUT_HERE);
@@ -693,11 +695,13 @@ void warn_slowpath_fmt(const char *file,
va_start(args.args, fmt);
__warn(file, line, __builtin_return_address(0), taint, NULL, );
va_end(args.args);
+   warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL(warn_slowpath_fmt);
 #else
 void __warn_printk(const char *fmt, ...)
 {
+   bool rcu = warn_rcu_enter();
va_list args;
 
pr_warn(CUT_HERE);
@@ -705,6 +709,7 @@ void __warn_printk(const char *fmt, ...)
va_start(args, fmt);
vprintk(fmt, args);
va_end(args);
+   warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL(__warn_printk);
 #endif
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern struct bug_entry __start___bug_table[], __stop___bug_table[];
 
@@ -153,7 +154,7 @@ struct bug_entry *find_bug(unsigned long
return module_find_bug(bugaddr);
 }
 
-enum bug_trap_type report_bug(unsigned long bugaddr, struct pt_regs *regs)
+static enum bug_trap_type __report_bug(unsigned long bugaddr, struct pt_regs 
*regs)
 {
struct bug_entry *bug;
const char *file;
@@ -209,6 +210,18 @@ enum bug_trap_type report_bug(unsigned l
return BUG_TRAP_TYPE_BUG;
 }
 
+enum bug_trap_type report_bug(unsigned long bugaddr, struct pt_regs *regs)
+{
+   enum bug_trap_type ret;
+   bool rcu = false;
+
+   rcu = warn_rcu_enter();
+   ret = __report_bug(bugaddr, regs);
+   warn_rcu_exit(rcu);
+
+   return ret;
+}
+
 static void clear_once_table(struct bug_entry *start, struct bug_entry *end)
 {
struct bug_entry *bug;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 5/9] x86: Always inline arch_atomic64

2023-01-26 Thread Peter Zijlstra
As already done for regular arch_atomic, always inline arch_atomic64.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/atomic64_32.h |   44 ++---
 arch/x86/include/asm/atomic64_64.h |   36 +++---
 2 files changed, 40 insertions(+), 40 deletions(-)

--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -71,7 +71,7 @@ ATOMIC64_DECL(add_unless);
  * the old value.
  */
 
-static inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
+static __always_inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
 {
return arch_cmpxchg64(>counter, o, n);
 }
@@ -85,7 +85,7 @@ static inline s64 arch_atomic64_cmpxchg(
  * Atomically xchgs the value of @v to @n and returns
  * the old value.
  */
-static inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
+static __always_inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
 {
s64 o;
unsigned high = (unsigned)(n >> 32);
@@ -104,7 +104,7 @@ static inline s64 arch_atomic64_xchg(ato
  *
  * Atomically sets the value of @v to @n.
  */
-static inline void arch_atomic64_set(atomic64_t *v, s64 i)
+static __always_inline void arch_atomic64_set(atomic64_t *v, s64 i)
 {
unsigned high = (unsigned)(i >> 32);
unsigned low = (unsigned)i;
@@ -119,7 +119,7 @@ static inline void arch_atomic64_set(ato
  *
  * Atomically reads the value of @v and returns it.
  */
-static inline s64 arch_atomic64_read(const atomic64_t *v)
+static __always_inline s64 arch_atomic64_read(const atomic64_t *v)
 {
s64 r;
alternative_atomic64(read, "=" (r), "c" (v) : "memory");
@@ -133,7 +133,7 @@ static inline s64 arch_atomic64_read(con
  *
  * Atomically adds @i to @v and returns @i + *@v
  */
-static inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 {
alternative_atomic64(add_return,
 ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -145,7 +145,7 @@ static inline s64 arch_atomic64_add_retu
 /*
  * Other variants with different arithmetic operators:
  */
-static inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
 {
alternative_atomic64(sub_return,
 ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -154,7 +154,7 @@ static inline s64 arch_atomic64_sub_retu
 }
 #define arch_atomic64_sub_return arch_atomic64_sub_return
 
-static inline s64 arch_atomic64_inc_return(atomic64_t *v)
+static __always_inline s64 arch_atomic64_inc_return(atomic64_t *v)
 {
s64 a;
alternative_atomic64(inc_return, "=" (a),
@@ -163,7 +163,7 @@ static inline s64 arch_atomic64_inc_retu
 }
 #define arch_atomic64_inc_return arch_atomic64_inc_return
 
-static inline s64 arch_atomic64_dec_return(atomic64_t *v)
+static __always_inline s64 arch_atomic64_dec_return(atomic64_t *v)
 {
s64 a;
alternative_atomic64(dec_return, "=" (a),
@@ -179,7 +179,7 @@ static inline s64 arch_atomic64_dec_retu
  *
  * Atomically adds @i to @v.
  */
-static inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
 {
__alternative_atomic64(add, add_return,
   ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -194,7 +194,7 @@ static inline s64 arch_atomic64_add(s64
  *
  * Atomically subtracts @i from @v.
  */
-static inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
 {
__alternative_atomic64(sub, sub_return,
   ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -208,7 +208,7 @@ static inline s64 arch_atomic64_sub(s64
  *
  * Atomically increments @v by 1.
  */
-static inline void arch_atomic64_inc(atomic64_t *v)
+static __always_inline void arch_atomic64_inc(atomic64_t *v)
 {
__alternative_atomic64(inc, inc_return, /* no output */,
   "S" (v) : "memory", "eax", "ecx", "edx");
@@ -221,7 +221,7 @@ static inline void arch_atomic64_inc(ato
  *
  * Atomically decrements @v by 1.
  */
-static inline void arch_atomic64_dec(atomic64_t *v)
+static __always_inline void arch_atomic64_dec(atomic64_t *v)
 {
__alternative_atomic64(dec, dec_return, /* no output */,
   "S" (v) : "memory", "eax", "ecx", "edx");
@@ -237,7 +237,7 @@ static inline void arch_atomic64_dec(ato
  * Atomically adds @a to @v, so long as it was not @u.
  * Returns non-zero if the add was done, zero otherwise.
  */
-static inline int arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
+static __always_in

[PATCH v2 1/9] drivers: firmware: psci: Dont instrument suspend code

2023-01-26 Thread Peter Zijlstra
From: Mark Rutland 

The PSCI suspend code is currently instrumentable, which is not safe as
instrumentation (e.g. ftrace) may try to make use of RCU during idle
periods when RCU is not watching.

To fix this we need to ensure that psci_suspend_finisher() and anything
it calls are not instrumented. We can do this fairly simply by marking
psci_suspend_finisher() and the psci*_cpu_suspend() functions as
noinstr, and the underlying helper functions as __always_inline.

When CONFIG_DEBUG_VIRTUAL=y, __pa_symbol() can expand to an out-of-line
instrumented function, so we must use __pa_symbol_nodebug() within
psci_suspend_finisher().

The raw SMCCC invocation functions are written in assembly, and are not
subject to compiler instrumentation.

Signed-off-by: Mark Rutland 
Signed-off-by: Peter Zijlstra (Intel) 
---
 drivers/firmware/psci/psci.c |   31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -108,9 +108,10 @@ bool psci_power_state_is_valid(u32 state
return !(state & ~valid_mask);
 }
 
-static unsigned long __invoke_psci_fn_hvc(unsigned long function_id,
-   unsigned long arg0, unsigned long arg1,
-   unsigned long arg2)
+static __always_inline unsigned long
+__invoke_psci_fn_hvc(unsigned long function_id,
+unsigned long arg0, unsigned long arg1,
+unsigned long arg2)
 {
struct arm_smccc_res res;
 
@@ -118,9 +119,10 @@ static unsigned long __invoke_psci_fn_hv
return res.a0;
 }
 
-static unsigned long __invoke_psci_fn_smc(unsigned long function_id,
-   unsigned long arg0, unsigned long arg1,
-   unsigned long arg2)
+static __always_inline unsigned long
+__invoke_psci_fn_smc(unsigned long function_id,
+unsigned long arg0, unsigned long arg1,
+unsigned long arg2)
 {
struct arm_smccc_res res;
 
@@ -128,7 +130,7 @@ static unsigned long __invoke_psci_fn_sm
return res.a0;
 }
 
-static int psci_to_linux_errno(int errno)
+static __always_inline int psci_to_linux_errno(int errno)
 {
switch (errno) {
case PSCI_RET_SUCCESS:
@@ -169,7 +171,8 @@ int psci_set_osi_mode(bool enable)
return psci_to_linux_errno(err);
 }
 
-static int __psci_cpu_suspend(u32 fn, u32 state, unsigned long entry_point)
+static __always_inline int
+__psci_cpu_suspend(u32 fn, u32 state, unsigned long entry_point)
 {
int err;
 
@@ -177,13 +180,15 @@ static int __psci_cpu_suspend(u32 fn, u3
return psci_to_linux_errno(err);
 }
 
-static int psci_0_1_cpu_suspend(u32 state, unsigned long entry_point)
+static __always_inline int
+psci_0_1_cpu_suspend(u32 state, unsigned long entry_point)
 {
return __psci_cpu_suspend(psci_0_1_function_ids.cpu_suspend,
  state, entry_point);
 }
 
-static int psci_0_2_cpu_suspend(u32 state, unsigned long entry_point)
+static __always_inline int
+psci_0_2_cpu_suspend(u32 state, unsigned long entry_point)
 {
return __psci_cpu_suspend(PSCI_FN_NATIVE(0_2, CPU_SUSPEND),
  state, entry_point);
@@ -447,10 +452,12 @@ late_initcall(psci_debugfs_init)
 #endif
 
 #ifdef CONFIG_CPU_IDLE
-static int psci_suspend_finisher(unsigned long state)
+static noinstr int psci_suspend_finisher(unsigned long state)
 {
u32 power_state = state;
-   phys_addr_t pa_cpu_resume = __pa_symbol(cpu_resume);
+   phys_addr_t pa_cpu_resume;
+
+   pa_cpu_resume = __pa_symbol_nodebug((unsigned long)cpu_resume);
 
return psci_ops.cpu_suspend(power_state, pa_cpu_resume);
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-26 Thread Peter Zijlstra
On Wed, Jan 25, 2023 at 10:46:58AM -0800, Paul E. McKenney wrote:

> > Ofc. Paul might have an opinion on this glorious bodge ;-)
> 
> For some definition of the word "glorious", to be sure.  ;-)
> 
> Am I correct that you have two things happening here?  (1) Preventing
> trace recursion and (2) forcing RCU to pay attention when needed.

Mostly just (1), we're in an error situation, I'm not too worried about
(2).

> I cannot resist pointing out that you have re-invented RCU_NONIDLE(),
> though avoiding much of the overhead when not needed.  ;-)

Yeah, this was the absolute minimal bodge I could come up with that
shuts up the rcu_derefence warning thing.

> I would have objections if this ever leaks out onto a non-error code path.

Agreed.

> There are things that need doing when RCU starts and stops watching,
> and this approach omits those things.  Which again is OK in this case,
> where this code is only ever executed when something is already broken,
> but definitely *not* OK when things are not already broken.

And agreed.

Current version of the bodge looks like so (will repost the whole series
a little later today).

I managed to tickle the recursion so that it was a test-case for the
stack guard...

With this on, it prints just the one WARN and lives.

---
Subject: bug: Disable rcu_is_watching() during WARN/BUG
From: Peter Zijlstra 
Date: Wed Jan 25 13:57:49 CET 2023

In order to avoid WARN/BUG from generating nested or even recursive
warnings, force rcu_is_watching() true during
WARN/lockdep_rcu_suspicious().

Notably things like unwinding the stack can trigger rcu_dereference()
warnings, which then triggers more unwinding which then triggers more
warnings etc..

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/context_tracking.h |   27 +++
 kernel/locking/lockdep.c |3 +++
 kernel/panic.c   |5 +
 lib/bug.c|   15 ++-
 4 files changed, 49 insertions(+), 1 deletion(-)

--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -130,9 +130,36 @@ static __always_inline unsigned long ct_
return arch_atomic_add_return(incby, 
this_cpu_ptr(_tracking.state));
 }
 
+static __always_inline bool warn_rcu_enter(void)
+{
+   bool ret = false;
+
+   /*
+* Horrible hack to shut up recursive RCU isn't watching fail since
+* lots of the actual reporting also relies on RCU.
+*/
+   preempt_disable_notrace();
+   if (rcu_dynticks_curr_cpu_in_eqs()) {
+   ret = true;
+   ct_state_inc(RCU_DYNTICKS_IDX);
+   }
+
+   return ret;
+}
+
+static __always_inline void warn_rcu_exit(bool rcu)
+{
+   if (rcu)
+   ct_state_inc(RCU_DYNTICKS_IDX);
+   preempt_enable_notrace();
+}
+
 #else
 static inline void ct_idle_enter(void) { }
 static inline void ct_idle_exit(void) { }
+
+static __always_inline bool warn_rcu_enter(void) { return false; }
+static __always_inline void warn_rcu_exit(bool rcu) { }
 #endif /* !CONFIG_CONTEXT_TRACKING_IDLE */
 
 #endif
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -6555,6 +6556,7 @@ void lockdep_rcu_suspicious(const char *
 {
struct task_struct *curr = current;
int dl = READ_ONCE(debug_locks);
+   bool rcu = warn_rcu_enter();
 
/* Note: the following can be executed concurrently, so be careful. */
pr_warn("\n");
@@ -6595,5 +6597,6 @@ void lockdep_rcu_suspicious(const char *
lockdep_print_held_locks(curr);
pr_warn("\nstack backtrace:\n");
dump_stack();
+   warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -679,6 +680,7 @@ void __warn(const char *file, int line,
 void warn_slowpath_fmt(const char *file, int line, unsigned taint,
   const char *fmt, ...)
 {
+   bool rcu = warn_rcu_enter();
struct warn_args args;
 
pr_warn(CUT_HERE);
@@ -693,11 +695,13 @@ void warn_slowpath_fmt(const char *file,
va_start(args.args, fmt);
__warn(file, line, __builtin_return_address(0), taint, NULL, );
va_end(args.args);
+   warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL(warn_slowpath_fmt);
 #else
 void __warn_printk(const char *fmt, ...)
 {
+   bool rcu = warn_rcu_enter();
va_list args;
 
pr_warn(CUT_HERE);
@@ -705,6 +709,7 @@ void __warn_printk(const char *fmt, ...)
va_start(args, fmt);
vprintk(fmt, args);
va_end(args);
+   warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL(__warn_printk);
 #endif
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern struct bug_entry __start___b

Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-25 Thread Peter Zijlstra
On Tue, Jan 24, 2023 at 05:12:14PM +, Mark Rutland wrote:
> On Tue, Jan 24, 2023 at 03:44:35PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 23, 2023 at 05:07:53PM -0500, Steven Rostedt wrote:
> > 
> > > Actually, perhaps we can just add this, and all you need to do is create
> > > and set CONFIG_NO_RCU_TRACING (or some other name).
> > 
> > Elsewhere I've used CONFIG_ARCH_WANTS_NO_INSTR for this.
> 
> Yes please; if we use CONFIG_ARCH_WANTS_NO_INSTR then arm64 will get this "for
> free" once we add the missing checks (which I assume we need) in our 
> ftrace_prepare_return().
> 
> > Anyway, I took it for a spin and it  doesn't seems to do the job.
> > 
> > With my patch the first splat is
> > 
> >   "RCU not on for: cpuidle_poll_time+0x0/0x70"
> > 
> > While with yours I seems to get the endless:
> > 
> >   "WARNING: suspicious RCU usage"
> > 
> > thing. Let me see if I can figure out where it goes side-ways.
> 
> Hmmm... for WARN_ONCE() don't we need to wake RCU first also? I thought we
> needed that at least for the printk machinery?

OK, the below seems to work nice for me -- although I'm still on a
hacked up printk, but the recursive RCU not watching fail seems to be
tamed.

Ofc. Paul might have an opinion on this glorious bodge ;-)

---

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index c303f7a114e9..d48cd92d2364 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,21 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
+#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+# define trace_warn_on_no_rcu(ip)  \
+   ({  \
+   bool __ret = !rcu_is_watching();\
+   if (__ret && !trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) 
{ \
+   trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \
+   WARN_ONCE(true, "RCU not on for: %pS\n", (void *)ip); \
+   trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \
+   }   \
+   __ret;  \
+   })
+#else
+# define trace_warn_on_no_rcu(ip)  false
+#endif
+
 /*
  * Preemption is promised to be disabled when return bit >= 0.
  */
@@ -144,6 +159,9 @@ static __always_inline int 
trace_test_and_set_recursion(unsigned long ip, unsign
unsigned int val = READ_ONCE(current->trace_recursion);
int bit;
 
+   if (trace_warn_on_no_rcu(ip))
+   return -1;
+
bit = trace_get_context_bit() + start;
if (unlikely(val & (1 << bit))) {
/*
diff --git a/lib/bug.c b/lib/bug.c
index c223a2575b72..0a10643ea168 100644
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern struct bug_entry __start___bug_table[], __stop___bug_table[];
 
@@ -153,7 +154,7 @@ struct bug_entry *find_bug(unsigned long bugaddr)
return module_find_bug(bugaddr);
 }
 
-enum bug_trap_type report_bug(unsigned long bugaddr, struct pt_regs *regs)
+static enum bug_trap_type __report_bug(unsigned long bugaddr, struct pt_regs 
*regs)
 {
struct bug_entry *bug;
const char *file;
@@ -209,6 +210,30 @@ enum bug_trap_type report_bug(unsigned long bugaddr, 
struct pt_regs *regs)
return BUG_TRAP_TYPE_BUG;
 }
 
+enum bug_trap_type report_bug(unsigned long bugaddr, struct pt_regs *regs)
+{
+   enum bug_trap_type ret;
+   bool rcu = false;
+
+#ifdef CONFIG_CONTEXT_TRACKING_IDLE
+   /*
+* Horrible hack to shut up recursive RCU isn't watching fail since
+* lots of the actual reporting also relies on RCU.
+*/
+   if (!rcu_is_watching()) {
+   rcu = true;
+   ct_state_inc(RCU_DYNTICKS_IDX);
+   }
+#endif
+
+   ret = __report_bug(bugaddr, regs);
+
+   if (rcu)
+   ct_state_inc(RCU_DYNTICKS_IDX);
+
+   return ret;
+}
+
 static void clear_once_table(struct bug_entry *start, struct bug_entry *end)
 {
struct bug_entry *bug;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/6] A few cpuidle vs rcu fixes

2023-01-25 Thread Peter Zijlstra
On Wed, Jan 25, 2023 at 10:35:16AM +0100, Peter Zijlstra wrote:
> tip/sched/core contains the following patch addressing this:
> 
> ---
> commit 9aedeaed6fc6fe8452b9b8225e95cc2b8631ff91
> Author: Peter Zijlstra 
> Date:   Thu Jan 12 20:43:49 2023 +0100
> 
> tracing, hardirq: No moar _rcuidle() tracing
> 
> Robot reported that trace_hardirqs_{on,off}() tickle the forbidden
> _rcuidle() tracepoint through local_irq_{en,dis}able().
> 
> For 'sane' configs, these calls will only happen with RCU enabled and
> as such can use the regular tracepoint. This also means it's possible
> to trace them from NMI context again.
> 
> Reported-by: kernel test robot 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ingo Molnar 
> Link: https://lore.kernel.org/r/20230112195541.477416...@infradead.org
> 
> diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
> index 629f2854e12b..f992444a0b1f 100644
> --- a/kernel/trace/trace_preemptirq.c
> +++ b/kernel/trace/trace_preemptirq.c
> @@ -19,6 +19,20 @@
>  /* Per-cpu variable to prevent redundant calls when IRQs already off */
>  static DEFINE_PER_CPU(int, tracing_irq_cpu);
>  
> +/*
> + * Use regular trace points on architectures that implement noinstr
> + * tooling: these calls will only happen with RCU enabled, which can
> + * use a regular tracepoint.
> + *
> + * On older architectures, use the rcuidle tracing methods (which
> + * aren't NMI-safe - so exclude NMI contexts):
> + */
> +#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> +#define trace(point) trace_##point
> +#else
> +#define trace(point) if (!in_nmi()) trace_##point##_rcuidle
> +#endif
> +
>  /*
>   * Like trace_hardirqs_on() but without the lockdep invocation. This is
>   * used in the low level entry code where the ordering vs. RCU is important

For some reason I missed the trace_preempt_{on,off} things, so that then
gets the below on top or so.

diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index f992444a0b1f..ea96b41c8838 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -100,15 +100,13 @@ NOKPROBE_SYMBOL(trace_hardirqs_off);
 
 void trace_preempt_on(unsigned long a0, unsigned long a1)
 {
-   if (!in_nmi())
-   trace_preempt_enable_rcuidle(a0, a1);
+   trace(preempt_enable)(a0, a1);
tracer_preempt_on(a0, a1);
 }
 
 void trace_preempt_off(unsigned long a0, unsigned long a1)
 {
-   if (!in_nmi())
-   trace_preempt_disable_rcuidle(a0, a1);
+   trace(preempt_disable)(a0, a1);
tracer_preempt_off(a0, a1);
 }
 #endif
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-25 Thread Peter Zijlstra
On Tue, Jan 24, 2023 at 05:12:14PM +, Mark Rutland wrote:
> On Tue, Jan 24, 2023 at 03:44:35PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 23, 2023 at 05:07:53PM -0500, Steven Rostedt wrote:
> > 
> > > Actually, perhaps we can just add this, and all you need to do is create
> > > and set CONFIG_NO_RCU_TRACING (or some other name).
> > 
> > Elsewhere I've used CONFIG_ARCH_WANTS_NO_INSTR for this.
> 
> Yes please; if we use CONFIG_ARCH_WANTS_NO_INSTR then arm64 will get this "for
> free" once we add the missing checks (which I assume we need) in our 
> ftrace_prepare_return().

Aye.

> > Anyway, I took it for a spin and it  doesn't seems to do the job.
> > 
> > With my patch the first splat is
> > 
> >   "RCU not on for: cpuidle_poll_time+0x0/0x70"
> > 
> > While with yours I seems to get the endless:
> > 
> >   "WARNING: suspicious RCU usage"
> > 
> > thing. Let me see if I can figure out where it goes side-ways.
> 
> Hmmm... for WARN_ONCE() don't we need to wake RCU first also? I thought we
> needed that at least for the printk machinery?

Yeah, I'm currently running with a hacked up printk that redirects
everything into early_printk() but it still trips up lots.

I was just about to go stick on RCU magic into WARN itself, this isn't
going to be the only site triggering this fail-cascade.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/6] A few cpuidle vs rcu fixes

2023-01-25 Thread Peter Zijlstra
On Tue, Jan 24, 2023 at 06:39:12PM +, Mark Rutland wrote:
> On Tue, Jan 24, 2023 at 05:30:29PM +, Mark Rutland wrote:
> > On Tue, Jan 24, 2023 at 04:34:23PM +, Mark Rutland wrote:
> > > Hi Peter,
> > > 
> > > On Mon, Jan 23, 2023 at 09:50:09PM +0100, Peter Zijlstra wrote:
> > > > 0-day robot reported graph-tracing made the cpuidle-vs-rcu rework go 
> > > > splat.
> > > 
> > > Do you have a link toe the splat somewhere?
> > > 
> > > I'm assuming that this is partially generic, and I'd like to make sure I 
> > > test
> > > the right thing on arm64. I'll throw my usual lockdep options at the 
> > > ftrace
> > > selftests...
> > 
> > Hmm... with the tip sched/core branch, with or without this series applied 
> > atop
> > I see a couple of splats which I don't see with v6.2-rc1 (which seems to be
> > entirely clean). I'm not seeing any other splats.
> > 
> > I can trigger those reliably with the 'toplevel-enable.tc' ftrace test:
> > 
> >   ./ftracetest test.d/event/toplevel-enable.tc
> > 
> > Splats below; I'll dig into this a bit more tomorrow.
> > 
> > [   65.729252] [ cut here ]
> > [   65.730397] WARNING: CPU: 3 PID: 1162 at 
> > include/trace/events/preemptirq.h:55 trace_preempt_on+0x68/0x70
> 
> The line number here is a bit inscrutible, but a bisect led me down to commit
> 
>   408b961146be4c1a ("tracing: WARN on rcuidle")
> 
> ... and it appears this must be the RCUIDLE_COND() warning that adds, and that
> seems to be because trace_preempt_on() calls trace_preempt_enable_rcuidle():
> 
> | void trace_preempt_on(unsigned long a0, unsigned long a1)
> | {
> | if (!in_nmi())
> | trace_preempt_enable_rcuidle(a0, a1);
> | tracer_preempt_on(a0, a1);
> | }
> 
> It looks like that tracing is dependend upon CONFIG_TRACE_PREEMPT_TOGGLE, and 
> I
> have that because I enabled CONFIG_PREEMPT_TRACER. I reckon the same should
> happen on x86 with CONFIG_PREEMPT_TRACER=y.
> 
> IIUC we'll need to clean up that trace_.*_rcuidle() usage too, but I'm not
> entirely sure how to do that.

tip/sched/core contains the following patch addressing this:

---
commit 9aedeaed6fc6fe8452b9b8225e95cc2b8631ff91
Author: Peter Zijlstra 
Date:   Thu Jan 12 20:43:49 2023 +0100

tracing, hardirq: No moar _rcuidle() tracing

Robot reported that trace_hardirqs_{on,off}() tickle the forbidden
_rcuidle() tracepoint through local_irq_{en,dis}able().

For 'sane' configs, these calls will only happen with RCU enabled and
as such can use the regular tracepoint. This also means it's possible
to trace them from NMI context again.

Reported-by: kernel test robot 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20230112195541.477416...@infradead.org

diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 629f2854e12b..f992444a0b1f 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -19,6 +19,20 @@
 /* Per-cpu variable to prevent redundant calls when IRQs already off */
 static DEFINE_PER_CPU(int, tracing_irq_cpu);
 
+/*
+ * Use regular trace points on architectures that implement noinstr
+ * tooling: these calls will only happen with RCU enabled, which can
+ * use a regular tracepoint.
+ *
+ * On older architectures, use the rcuidle tracing methods (which
+ * aren't NMI-safe - so exclude NMI contexts):
+ */
+#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#define trace(point)   trace_##point
+#else
+#define trace(point)   if (!in_nmi()) trace_##point##_rcuidle
+#endif
+
 /*
  * Like trace_hardirqs_on() but without the lockdep invocation. This is
  * used in the low level entry code where the ordering vs. RCU is important
@@ -28,8 +42,7 @@ static DEFINE_PER_CPU(int, tracing_irq_cpu);
 void trace_hardirqs_on_prepare(void)
 {
if (this_cpu_read(tracing_irq_cpu)) {
-   if (!in_nmi())
-   trace_irq_enable(CALLER_ADDR0, CALLER_ADDR1);
+   trace(irq_enable)(CALLER_ADDR0, CALLER_ADDR1);
tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
this_cpu_write(tracing_irq_cpu, 0);
}
@@ -40,8 +53,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on_prepare);
 void trace_hardirqs_on(void)
 {
if (this_cpu_read(tracing_irq_cpu)) {
-   if (!in_nmi())
-   trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+   trace(irq_enable)(CALLER_ADDR0, CALLER_ADDR1);
tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
this_cpu_write(tracing_irq_cpu, 0);
}
@@ -63,8 +75,7 @@ void trace_hardirqs_off_finish(void)
 

Re: [PATCH 0/6] A few cpuidle vs rcu fixes

2023-01-25 Thread Peter Zijlstra
On Tue, Jan 24, 2023 at 04:34:23PM +, Mark Rutland wrote:
> Hi Peter,
> 
> On Mon, Jan 23, 2023 at 09:50:09PM +0100, Peter Zijlstra wrote:
> > 0-day robot reported graph-tracing made the cpuidle-vs-rcu rework go splat.
> 
> Do you have a link toe the splat somewhere?
> 
> I'm assuming that this is partially generic, and I'd like to make sure I test
> the right thing on arm64. I'll throw my usual lockdep options at the ftrace
> selftests...

0-day triggered this by running tools/testing/selftests/ftrace/ftracetest,
which is what I've been using to reproduce.

If that don't work for you I can try and dig out the 0day email to see
if it has more details on.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Peter Zijlstra
On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d6d790d9bed..6c7c70bf50dd 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -491,7 +491,13 @@ struct vm_area_struct {
>* See vmf_insert_mixed_prot() for discussion.
>*/
>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * WARNING! Do not modify directly.
> +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> +  */
> + unsigned long vm_flags;

We have __private and ACCESS_PRIVATE() to help with enforcing this.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-24 Thread Peter Zijlstra
On Mon, Jan 23, 2023 at 05:07:53PM -0500, Steven Rostedt wrote:

> Actually, perhaps we can just add this, and all you need to do is create
> and set CONFIG_NO_RCU_TRACING (or some other name).

Elsewhere I've used CONFIG_ARCH_WANTS_NO_INSTR for this.

Anyway, I took it for a spin and it  doesn't seems to do the job.

With my patch the first splat is

  "RCU not on for: cpuidle_poll_time+0x0/0x70"

While with yours I seems to get the endless:

  "WARNING: suspicious RCU usage"

thing. Let me see if I can figure out where it goes side-ways.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 6/6] cpuidle: Fix poll_idle() noinstr annotation

2023-01-23 Thread Peter Zijlstra
The instrumentation_begin()/end() annotations in poll_idle() were
complete nonsense. Specifically they caused tracing to happen in the
middle of noinstr code, resulting in RCU splats.

Now that local_clock() is noinstr, mark up the rest and let it rip.

Fixes: 00717eb8c955 ("cpuidle: Annotate poll_idle()")
Signed-off-by: Peter Zijlstra (Intel) 
Reported-by: kernel test robot 
Link: https://lore.kernel.org/oe-lkp/202301192148.58ece903-oliver.s...@intel.com
---
 drivers/cpuidle/cpuidle.c|2 +-
 drivers/cpuidle/poll_state.c |2 --
 2 files changed, 1 insertion(+), 3 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -426,7 +426,7 @@ void cpuidle_reflect(struct cpuidle_devi
  * @dev:   the cpuidle device
  *
  */
-u64 cpuidle_poll_time(struct cpuidle_driver *drv,
+__cpuidle u64 cpuidle_poll_time(struct cpuidle_driver *drv,
  struct cpuidle_device *dev)
 {
int i;
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -15,7 +15,6 @@ static int __cpuidle poll_idle(struct cp
 {
u64 time_start;
 
-   instrumentation_begin();
time_start = local_clock();
 
dev->poll_time_limit = false;
@@ -42,7 +41,6 @@ static int __cpuidle poll_idle(struct cp
raw_local_irq_disable();
 
current_clr_polling();
-   instrumentation_end();
 
return index;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/6] x86/pvclock: improve atomic update of last_value in pvclock_clocksource_read

2023-01-23 Thread Peter Zijlstra
From: Uros Bizjak 

Improve atomic update of last_value in pvclock_clocksource_read:

- Atomic update can be skipped if the "last_value" is already
  equal to "ret".

- The detection of atomic update failure is not correct. The value,
  returned by atomic64_cmpxchg should be compared to the old value
  from the location to be updated. If these two are the same, then
  atomic update succeeded and "last_value" location is updated to
  "ret" in an atomic way. Otherwise, the atomic update failed and
  it should be retried with the value from "last_value" - exactly
  what atomic64_try_cmpxchg does in a correct and more optimal way.

Signed-off-by: Uros Bizjak 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20230118202330.3740-1-ubiz...@gmail.com
---
 arch/x86/kernel/pvclock.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index eda37df016f0..5a2a517dd61b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -102,10 +102,9 @@ u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info 
*src)
 */
last = atomic64_read(_value);
do {
-   if (ret < last)
+   if (ret <= last)
return last;
-   last = atomic64_cmpxchg(_value, last, ret);
-   } while (unlikely(last != ret));
+   } while (!atomic64_try_cmpxchg(_value, , ret));
 
return ret;
 }
-- 
2.39.0



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 4/6] x86: Mark sched_clock() noinstr

2023-01-23 Thread Peter Zijlstra
In order to use sched_clock() from noinstr code, mark it and all it's
implenentations noinstr.

The whole pvclock thing (used by KVM/Xen) is a bit of a pain,
since it calls out to watchdogs, create a
pvclock_clocksource_read_nowd() variant doesn't do that and can be
noinstr.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/kvmclock.h |2 +-
 arch/x86/include/asm/paravirt.h |2 +-
 arch/x86/include/asm/pvclock.h  |3 ++-
 arch/x86/kernel/cpu/vmware.c|2 +-
 arch/x86/kernel/kvmclock.c  |6 +++---
 arch/x86/kernel/pvclock.c   |   19 +++
 arch/x86/kernel/tsc.c   |7 +++
 arch/x86/xen/time.c |   12 ++--
 include/linux/math64.h  |4 ++--
 9 files changed, 38 insertions(+), 19 deletions(-)

--- a/arch/x86/include/asm/kvmclock.h
+++ b/arch/x86/include/asm/kvmclock.h
@@ -8,7 +8,7 @@ extern struct clocksource kvm_clock;
 
 DECLARE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
 
-static inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
+static __always_inline struct pvclock_vcpu_time_info *this_cpu_pvti(void)
 {
return _cpu_read(hv_clock_per_cpu)->pvti;
 }
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -26,7 +26,7 @@ DECLARE_STATIC_CALL(pv_sched_clock, dumm
 
 void paravirt_set_sched_clock(u64 (*func)(void));
 
-static inline u64 paravirt_sched_clock(void)
+static __always_inline u64 paravirt_sched_clock(void)
 {
return static_call(pv_sched_clock)();
 }
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -7,6 +7,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u64 pvclock_clocksource_read_nowd(struct pvclock_vcpu_time_info *src);
 u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
@@ -39,7 +40,7 @@ bool pvclock_read_retry(const struct pvc
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
  */
-static inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int shift)
+static __always_inline u64 pvclock_scale_delta(u64 delta, u32 mul_frac, int 
shift)
 {
u64 product;
 #ifdef __i386__
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -143,7 +143,7 @@ static __init int parse_no_stealacc(char
 }
 early_param("no-steal-acc", parse_no_stealacc);
 
-static unsigned long long notrace vmware_sched_clock(void)
+static noinstr u64 vmware_sched_clock(void)
 {
unsigned long long ns;
 
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -71,12 +71,12 @@ static int kvm_set_wallclock(const struc
return -ENODEV;
 }
 
-static u64 kvm_clock_read(void)
+static noinstr u64 kvm_clock_read(void)
 {
u64 ret;
 
preempt_disable_notrace();
-   ret = pvclock_clocksource_read(this_cpu_pvti());
+   ret = pvclock_clocksource_read_nowd(this_cpu_pvti());
preempt_enable_notrace();
return ret;
 }
@@ -86,7 +86,7 @@ static u64 kvm_clock_get_cycles(struct c
return kvm_clock_read();
 }
 
-static u64 kvm_sched_clock_read(void)
+static noinstr u64 kvm_sched_clock_read(void)
 {
return kvm_clock_read() - kvm_sched_clock_offset;
 }
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -64,7 +64,8 @@ u8 pvclock_read_flags(struct pvclock_vcp
return flags & valid_flags;
 }
 
-u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+static __always_inline
+u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
 {
unsigned version;
u64 ret;
@@ -77,7 +78,7 @@ u64 pvclock_clocksource_read(struct pvcl
flags = src->flags;
} while (pvclock_read_retry(src, version));
 
-   if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
+   if (dowd && unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
src->flags &= ~PVCLOCK_GUEST_STOPPED;
pvclock_touch_watchdogs();
}
@@ -100,15 +101,25 @@ u64 pvclock_clocksource_read(struct pvcl
 * updating at the same time, and one of them could be slightly behind,
 * making the assumption that last_value always go forward fail to hold.
 */
-   last = atomic64_read(_value);
+   last = arch_atomic64_read(_value);
do {
if (ret <= last)
return last;
-   } while (!atomic64_try_cmpxchg(_value, , ret));
+   } while (!arch_atomic64_try_cmpxchg(_value, , ret));
 
return ret;
 }
 
+u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
+{
+   return __pvclock_clocksource_read(src, true);
+}
+
+noinstr u64 pvclock_clocksource_read_nowd(struct pvclock_vcpu_time_info *src)
+{
+   return __pv

[PATCH 1/6] x86: Always inline arch_atomic64

2023-01-23 Thread Peter Zijlstra
As already done for regular arch_atomic, always inline arch_atomic64.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/atomic64_32.h |   44 ++---
 arch/x86/include/asm/atomic64_64.h |   36 +++---
 2 files changed, 40 insertions(+), 40 deletions(-)

--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -71,7 +71,7 @@ ATOMIC64_DECL(add_unless);
  * the old value.
  */
 
-static inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
+static __always_inline s64 arch_atomic64_cmpxchg(atomic64_t *v, s64 o, s64 n)
 {
return arch_cmpxchg64(>counter, o, n);
 }
@@ -85,7 +85,7 @@ static inline s64 arch_atomic64_cmpxchg(
  * Atomically xchgs the value of @v to @n and returns
  * the old value.
  */
-static inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
+static __always_inline s64 arch_atomic64_xchg(atomic64_t *v, s64 n)
 {
s64 o;
unsigned high = (unsigned)(n >> 32);
@@ -104,7 +104,7 @@ static inline s64 arch_atomic64_xchg(ato
  *
  * Atomically sets the value of @v to @n.
  */
-static inline void arch_atomic64_set(atomic64_t *v, s64 i)
+static __always_inline void arch_atomic64_set(atomic64_t *v, s64 i)
 {
unsigned high = (unsigned)(i >> 32);
unsigned low = (unsigned)i;
@@ -119,7 +119,7 @@ static inline void arch_atomic64_set(ato
  *
  * Atomically reads the value of @v and returns it.
  */
-static inline s64 arch_atomic64_read(const atomic64_t *v)
+static __always_inline s64 arch_atomic64_read(const atomic64_t *v)
 {
s64 r;
alternative_atomic64(read, "=" (r), "c" (v) : "memory");
@@ -133,7 +133,7 @@ static inline s64 arch_atomic64_read(con
  *
  * Atomically adds @i to @v and returns @i + *@v
  */
-static inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_add_return(s64 i, atomic64_t *v)
 {
alternative_atomic64(add_return,
 ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -145,7 +145,7 @@ static inline s64 arch_atomic64_add_retu
 /*
  * Other variants with different arithmetic operators:
  */
-static inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_sub_return(s64 i, atomic64_t *v)
 {
alternative_atomic64(sub_return,
 ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -154,7 +154,7 @@ static inline s64 arch_atomic64_sub_retu
 }
 #define arch_atomic64_sub_return arch_atomic64_sub_return
 
-static inline s64 arch_atomic64_inc_return(atomic64_t *v)
+static __always_inline s64 arch_atomic64_inc_return(atomic64_t *v)
 {
s64 a;
alternative_atomic64(inc_return, "=" (a),
@@ -163,7 +163,7 @@ static inline s64 arch_atomic64_inc_retu
 }
 #define arch_atomic64_inc_return arch_atomic64_inc_return
 
-static inline s64 arch_atomic64_dec_return(atomic64_t *v)
+static __always_inline s64 arch_atomic64_dec_return(atomic64_t *v)
 {
s64 a;
alternative_atomic64(dec_return, "=" (a),
@@ -179,7 +179,7 @@ static inline s64 arch_atomic64_dec_retu
  *
  * Atomically adds @i to @v.
  */
-static inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_add(s64 i, atomic64_t *v)
 {
__alternative_atomic64(add, add_return,
   ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -194,7 +194,7 @@ static inline s64 arch_atomic64_add(s64
  *
  * Atomically subtracts @i from @v.
  */
-static inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
+static __always_inline s64 arch_atomic64_sub(s64 i, atomic64_t *v)
 {
__alternative_atomic64(sub, sub_return,
   ASM_OUTPUT2("+A" (i), "+c" (v)),
@@ -208,7 +208,7 @@ static inline s64 arch_atomic64_sub(s64
  *
  * Atomically increments @v by 1.
  */
-static inline void arch_atomic64_inc(atomic64_t *v)
+static __always_inline void arch_atomic64_inc(atomic64_t *v)
 {
__alternative_atomic64(inc, inc_return, /* no output */,
   "S" (v) : "memory", "eax", "ecx", "edx");
@@ -221,7 +221,7 @@ static inline void arch_atomic64_inc(ato
  *
  * Atomically decrements @v by 1.
  */
-static inline void arch_atomic64_dec(atomic64_t *v)
+static __always_inline void arch_atomic64_dec(atomic64_t *v)
 {
__alternative_atomic64(dec, dec_return, /* no output */,
   "S" (v) : "memory", "eax", "ecx", "edx");
@@ -237,7 +237,7 @@ static inline void arch_atomic64_dec(ato
  * Atomically adds @a to @v, so long as it was not @u.
  * Returns non-zero if the add was done, zero otherwise.
  */
-static inline int arch_atomic64_add_unless(atomic64_t *v, s64 a, s64 u)
+static __always_in

[PATCH 0/6] A few cpuidle vs rcu fixes

2023-01-23 Thread Peter Zijlstra
0-day robot reported graph-tracing made the cpuidle-vs-rcu rework go splat.

These patches appear to cure this, the ftrace selftest now runs to completion
without spamming scary messages to dmesg.

---
 arch/x86/include/asm/atomic64_32.h | 44 +++---
 arch/x86/include/asm/atomic64_64.h | 36 +++
 arch/x86/include/asm/kvmclock.h|  2 +-
 arch/x86/include/asm/paravirt.h|  2 +-
 arch/x86/include/asm/pvclock.h |  3 ++-
 arch/x86/kernel/cpu/vmware.c   |  2 +-
 arch/x86/kernel/ftrace.c   |  3 +++
 arch/x86/kernel/kvmclock.c |  6 +++---
 arch/x86/kernel/pvclock.c  | 22 +--
 arch/x86/kernel/tsc.c  |  7 +++---
 arch/x86/xen/time.c| 12 +--
 drivers/cpuidle/cpuidle.c  |  2 +-
 drivers/cpuidle/poll_state.c   |  2 --
 include/linux/math64.h |  4 ++--
 include/linux/sched/clock.h|  8 +++
 kernel/sched/clock.c   | 27 +--
 16 files changed, 107 insertions(+), 75 deletions(-)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 5/6] sched/clock: Make local_clock() noinstr

2023-01-23 Thread Peter Zijlstra
With sched_clock() noinstr, provide a noinstr implementation of
local_clock().

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/sched/clock.h |8 +++-
 kernel/sched/clock.c|   27 +--
 2 files changed, 24 insertions(+), 11 deletions(-)

--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -45,7 +45,7 @@ static inline u64 cpu_clock(int cpu)
return sched_clock();
 }
 
-static inline u64 local_clock(void)
+static __always_inline u64 local_clock(void)
 {
return sched_clock();
 }
@@ -79,10 +79,8 @@ static inline u64 cpu_clock(int cpu)
return sched_clock_cpu(cpu);
 }
 
-static inline u64 local_clock(void)
-{
-   return sched_clock_cpu(raw_smp_processor_id());
-}
+extern u64 local_clock(void);
+
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -93,7 +93,7 @@ struct sched_clock_data {
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, 
sched_clock_data);
 
-notrace static inline struct sched_clock_data *this_scd(void)
+static __always_inline struct sched_clock_data *this_scd(void)
 {
return this_cpu_ptr(_clock_data);
 }
@@ -244,12 +244,12 @@ late_initcall(sched_clock_init_late);
  * min, max except they take wrapping into account
  */
 
-notrace static inline u64 wrap_min(u64 x, u64 y)
+static __always_inline u64 wrap_min(u64 x, u64 y)
 {
return (s64)(x - y) < 0 ? x : y;
 }
 
-notrace static inline u64 wrap_max(u64 x, u64 y)
+static __always_inline u64 wrap_max(u64 x, u64 y)
 {
return (s64)(x - y) > 0 ? x : y;
 }
@@ -260,7 +260,7 @@ notrace static inline u64 wrap_max(u64 x
  *  - filter out backward motion
  *  - use the GTOD tick value to create a window to filter crazy TSC values
  */
-notrace static u64 sched_clock_local(struct sched_clock_data *scd)
+static __always_inline u64 sched_clock_local(struct sched_clock_data *scd)
 {
u64 now, clock, old_clock, min_clock, max_clock, gtod;
s64 delta;
@@ -287,13 +287,28 @@ notrace static u64 sched_clock_local(str
clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);
 
-   if (!try_cmpxchg64(>clock, _clock, clock))
+   if (!arch_try_cmpxchg64(>clock, _clock, clock))
goto again;
 
return clock;
 }
 
-notrace static u64 sched_clock_remote(struct sched_clock_data *scd)
+noinstr u64 local_clock(void)
+{
+   u64 clock;
+
+   if (static_branch_likely(&__sched_clock_stable))
+   return sched_clock() + __sched_clock_offset;
+
+   preempt_disable_notrace();
+   clock = sched_clock_local(this_scd());
+   preempt_enable_notrace();
+
+   return clock;
+}
+EXPORT_SYMBOL_GPL(local_clock);
+
+static notrace u64 sched_clock_remote(struct sched_clock_data *scd)
 {
struct sched_clock_data *my_scd = this_scd();
u64 this_clock, remote_clock;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Peter Zijlstra
All RCU disabled code should be noinstr and hence we should never get
here -- when we do, WARN about it and make sure to not actually do
tracing.

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/x86/kernel/ftrace.c |3 +++
 1 file changed, 3 insertions(+)

--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
if (unlikely(atomic_read(>tracing_graph_pause)))
return;
 
+   if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
+   return;
+
bit = ftrace_test_recursion_trylock(ip, *parent);
if (bit < 0)
return;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 16/51] cpuidle: Annotate poll_idle()

2023-01-20 Thread Peter Zijlstra
On Thu, Jan 12, 2023 at 08:43:30PM +0100, Peter Zijlstra wrote:
> The __cpuidle functions will become a noinstr class, as such they need
> explicit annotations.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Reviewed-by: Rafael J. Wysocki 
> Acked-by: Frederic Weisbecker 
> Tested-by: Tony Lindgren 
> Tested-by: Ulf Hansson 
> ---
>  drivers/cpuidle/poll_state.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -13,7 +13,10 @@
>  static int __cpuidle poll_idle(struct cpuidle_device *dev,
>  struct cpuidle_driver *drv, int index)
>  {
> - u64 time_start = local_clock();
> + u64 time_start;
> +
> + instrumentation_begin();
> + time_start = local_clock();
>  
>   dev->poll_time_limit = false;
>  
> @@ -39,6 +42,7 @@ static int __cpuidle poll_idle(struct cp
>   raw_local_irq_disable();
>  
>   current_clr_polling();
> + instrumentation_end();
>  
>   return index;
>  }

Pff, this patch is garbage. However wrote it didn't have his brain
engaged :/

Something like the below fixes it, but I still need to build me funny
configs like ia64 and paravirt to see if I didn't wreck me something...

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a78e73da4a74..70c07e11caa6 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -215,7 +215,7 @@ static void __init cyc2ns_init_secondary_cpus(void)
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
-u64 native_sched_clock(void)
+noinstr u64 native_sched_clock(void)
 {
if (static_branch_likely(&__use_tsc)) {
u64 tsc_now = rdtsc();
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 500d1720421e..0b00f21cefe3 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -426,7 +426,7 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
  * @dev:   the cpuidle device
  *
  */
-u64 cpuidle_poll_time(struct cpuidle_driver *drv,
+__cpuidle u64 cpuidle_poll_time(struct cpuidle_driver *drv,
  struct cpuidle_device *dev)
 {
int i;
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index d25ec52846e6..bdcfeaecd228 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -15,7 +15,6 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
 {
u64 time_start;
 
-   instrumentation_begin();
time_start = local_clock();
 
dev->poll_time_limit = false;
@@ -42,7 +41,6 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
raw_local_irq_disable();
 
current_clr_polling();
-   instrumentation_end();
 
return index;
 }
diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h
index 867d588314e0..7960f0769884 100644
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -45,7 +45,7 @@ static inline u64 cpu_clock(int cpu)
return sched_clock();
 }
 
-static inline u64 local_clock(void)
+static __always_inline u64 local_clock(void)
 {
return sched_clock();
 }
@@ -79,7 +79,7 @@ static inline u64 cpu_clock(int cpu)
return sched_clock_cpu(cpu);
 }
 
-static inline u64 local_clock(void)
+static __always_inline u64 local_clock(void)
 {
return sched_clock_cpu(raw_smp_processor_id());
 }
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index e374c0c923da..6b3b0559e53c 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -260,7 +260,7 @@ notrace static inline u64 wrap_max(u64 x, u64 y)
  *  - filter out backward motion
  *  - use the GTOD tick value to create a window to filter crazy TSC values
  */
-notrace static u64 sched_clock_local(struct sched_clock_data *scd)
+noinstr static u64 sched_clock_local(struct sched_clock_data *scd)
 {
u64 now, clock, old_clock, min_clock, max_clock, gtod;
s64 delta;
@@ -287,7 +287,7 @@ notrace static u64 sched_clock_local(struct 
sched_clock_data *scd)
clock = wrap_max(clock, min_clock);
clock = wrap_min(clock, max_clock);
 
-   if (!try_cmpxchg64(>clock, _clock, clock))
+   if (!arch_try_cmpxchg64(>clock, _clock, clock))
goto again;
 
return clock;
@@ -360,7 +360,7 @@ notrace static u64 sched_clock_remote(struct 
sched_clock_data *scd)
  *
  * See cpu_clock().
  */
-notrace u64 sched_clock_cpu(int cpu)
+noinstr u64 sched_clock_cpu(int cpu)
 {
struct sched_clock_data *scd;
u64 clock;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: merge activate_mm and dup_mmap callbacks

2023-01-17 Thread Peter Zijlstra
On Sun, Jan 15, 2023 at 08:27:50PM -0800, Srivatsa S. Bhat wrote:

> I see that's not an issue right now since there is no other actual
> user for these callbacks. But are we sure that merging the callbacks
> just because the current user (Xen PV) has the same implementation for
> both is a good idea?

IIRC the pv_ops are part of the PARAVIRT_ME_HARDER (also spelled as
_XXL) suite of ops and they are only to be used by Xen PV, no new users
of these must happen.

The moment we can drop Xen PV (hopes and dreams etc..) all these things
go in the bin right along with it.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Peter Zijlstra
On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:

> I'm sorry to have to bear some bad news on that front. :(

Moo, something had to give..


> IIUC what's happenign here is the PSCI cpuidle driver has entered idle and RCU
> is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> local_daif_*() helpers poke lockdep and tracing, hence the call to
> trace_hardirqs_off() and the RCU usage.

Right, strictly speaking not needed at this point, IRQs should have been
traced off a long time ago.

> I think we need RCU to be watching all the way down to cpu_suspend(), and it's
> cpu_suspend() that should actually enter/exit idle context. That and we need 
> to
> make cpu_suspend() and the low-level PSCI invocation noinstr.
> 
> I'm not sure whether 32-bit will have a similar issue or not.

I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
maybe I missed somsething.

In any case, the below ought to cure the ARM64 case and remove that last
known RCU_NONIDLE() user as a bonus.

---
diff --git a/arch/arm64/kernel/cpuidle.c b/arch/arm64/kernel/cpuidle.c
index 41974a1a229a..42e19fff40ee 100644
--- a/arch/arm64/kernel/cpuidle.c
+++ b/arch/arm64/kernel/cpuidle.c
@@ -67,10 +67,10 @@ __cpuidle int acpi_processor_ffh_lpi_enter(struct 
acpi_lpi_state *lpi)
u32 state = lpi->address;
 
if (ARM64_LPI_IS_RETENTION_STATE(lpi->arch_flags))
-   return 
CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM(psci_cpu_suspend_enter,
+   return 
CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM_RCU(psci_cpu_suspend_enter,
lpi->index, state);
else
-   return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter,
+   return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(psci_cpu_suspend_enter,
 lpi->index, state);
 }
 #endif
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index e7163f31f716..0fbdf5fe64d8 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -104,6 +105,10 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
long))
 * From this point debug exceptions are disabled to prevent
 * updates to mdscr register (saved and restored along with
 * general purpose registers) from kernel debuggers.
+*
+* Strictly speaking the trace_hardirqs_off() here is superfluous,
+* hardirqs should be firmly off by now. This really ought to use
+* something like raw_local_daif_save().
 */
flags = local_daif_save();
 
@@ -120,6 +125,8 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned long))
 */
arm_cpuidle_save_irq_context();
 
+   ct_cpuidle_enter();
+
if (__cpu_suspend_enter()) {
/* Call the suspend finisher */
ret = fn(arg);
@@ -133,8 +140,11 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
long))
 */
if (!ret)
ret = -EOPNOTSUPP;
+
+   ct_cpuidle_exit();
} else {
-   RCU_NONIDLE(__cpu_suspend_exit());
+   ct_cpuidle_exit();
+   __cpu_suspend_exit();
}
 
arm_cpuidle_restore_irq_context();
diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
index 4fc4e0381944..312a34ef28dc 100644
--- a/drivers/cpuidle/cpuidle-psci.c
+++ b/drivers/cpuidle/cpuidle-psci.c
@@ -69,16 +69,12 @@ static __cpuidle int __psci_enter_domain_idle_state(struct 
cpuidle_device *dev,
else
pm_runtime_put_sync_suspend(pd_dev);
 
-   ct_cpuidle_enter();
-
state = psci_get_domain_state();
if (!state)
state = states[idx];
 
ret = psci_cpu_suspend_enter(state) ? -1 : idx;
 
-   ct_cpuidle_exit();
-
if (s2idle)
dev_pm_genpd_resume(pd_dev);
else
@@ -192,7 +188,7 @@ static __cpuidle int psci_enter_idle_state(struct 
cpuidle_device *dev,
 {
u32 *state = __this_cpu_read(psci_cpuidle_data.psci_states);
 
-   return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, 
state[idx]);
+   return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(psci_cpu_suspend_enter, idx, 
state[idx]);
 }
 
 static const struct of_device_id psci_idle_state_match[] = {
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index e7bcfca4159f..f3a044fa4652 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -462,11 +462,22 @@ int psci_cpu_suspend_enter(u32 state)
if (!psci_power_state_loses_context(state)) {
struct arm_cpuidle_irq_context context;
 
+   ct_cpuidle_enter();
arm_cpuidle_save_irq_context();
ret = psci_ops.cpu_suspend(state, 0);
arm_cpuidle_restore_irq_context();
+ 

Re: [PATCH v3 35/51] trace,hardirq: No moar _rcuidle() tracing

2023-01-17 Thread Peter Zijlstra
On Tue, Jan 17, 2023 at 01:24:46PM +0900, Masami Hiramatsu wrote:
> Hi Peter,
> 
> On Thu, 12 Jan 2023 20:43:49 +0100
> Peter Zijlstra  wrote:
> 
> > Robot reported that trace_hardirqs_{on,off}() tickle the forbidden
> > _rcuidle() tracepoint through local_irq_{en,dis}able().
> > 
> > For 'sane' configs, these calls will only happen with RCU enabled and
> > as such can use the regular tracepoint. This also means it's possible
> > to trace them from NMI context again.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> 
> The code looks good to me. I just have a question about comment.
> 
> > ---
> >  kernel/trace/trace_preemptirq.c |   21 +
> >  1 file changed, 13 insertions(+), 8 deletions(-)
> > 
> > --- a/kernel/trace/trace_preemptirq.c
> > +++ b/kernel/trace/trace_preemptirq.c
> > @@ -20,6 +20,15 @@
> >  static DEFINE_PER_CPU(int, tracing_irq_cpu);
> >  
> >  /*
> > + * ...
> 
> Is this intended? Wouldn't you leave any comment here?

I indeed forgot to write the comment before posting, my bad :/ Ingo fixed
it up when he applied.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 50/51] cpuidle: Comments about noinstr/__cpuidle

2023-01-12 Thread Peter Zijlstra
Add a few words on noinstr / __cpuidle usage.

Signed-off-by: Peter Zijlstra (Intel) 
---
 drivers/cpuidle/cpuidle.c  |   12 
 include/linux/compiler_types.h |   10 ++
 2 files changed, 22 insertions(+)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -252,6 +252,18 @@ noinstr int cpuidle_enter_state(struct c
instrumentation_begin();
}
 
+   /*
+* NOTE!!
+*
+* For cpuidle_state::enter() methods that do *NOT* set
+* CPUIDLE_FLAG_RCU_IDLE RCU will be disabled here and these functions
+* must be marked either noinstr or __cpuidle.
+*
+* For cpuidle_state::enter() methods that *DO* set
+* CPUIDLE_FLAG_RCU_IDLE this isn't required, but they must mark the
+* function calling ct_cpuidle_enter() as noinstr/__cpuidle and all
+* functions called within the RCU-idle region.
+*/
entered_state = target_state->enter(dev, drv, index);
 
if (WARN_ONCE(!irqs_disabled(), "%ps leaked IRQ state", 
target_state->enter))
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -233,6 +233,16 @@ struct ftrace_likely_data {
 
 #define noinstr __noinstr_section(".noinstr.text")
 
+/*
+ * The __cpuidle section is used twofold:
+ *
+ *  1) the original use -- identifying if a CPU is 'stuck' in idle state based
+ * on it's instruction pointer. See cpu_in_idle().
+ *
+ *  2) supressing instrumentation around where cpuidle disables RCU; where the
+ * function isn't strictly required for #1, this is interchangeable with
+ * noinstr.
+ */
 #define __cpuidle __noinstr_section(".cpuidle.text")
 
 #endif /* __KERNEL__ */


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 51/51] context_tracking: Fix noinstr vs KASAN

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: __ct_user_enter+0x72: call to 
__kasan_check_write() leaves .noinstr.text section
vmlinux.o: warning: objtool: __ct_user_exit+0x47: call to __kasan_check_write() 
leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/context_tracking.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -510,7 +510,7 @@ void noinstr __ct_user_enter(enum ctx_st
 * In this we case we don't care about any 
concurrency/ordering.
 */
if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE))
-   atomic_set(>state, state);
+   arch_atomic_set(>state, state);
} else {
/*
 * Even if context tracking is disabled on this CPU, 
because it's outside
@@ -527,7 +527,7 @@ void noinstr __ct_user_enter(enum ctx_st
 */
if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE)) {
/* Tracking for vtime only, no concurrent RCU 
EQS accounting */
-   atomic_set(>state, state);
+   arch_atomic_set(>state, state);
} else {
/*
 * Tracking for vtime and RCU EQS. Make sure we 
don't race
@@ -535,7 +535,7 @@ void noinstr __ct_user_enter(enum ctx_st
 * RCU only requires RCU_DYNTICKS_IDX 
increments to be fully
 * ordered.
 */
-   atomic_add(state, >state);
+   arch_atomic_add(state, >state);
}
}
}
@@ -630,12 +630,12 @@ void noinstr __ct_user_exit(enum ctx_sta
 * In this we case we don't care about any 
concurrency/ordering.
 */
if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE))
-   atomic_set(>state, CONTEXT_KERNEL);
+   arch_atomic_set(>state, CONTEXT_KERNEL);
 
} else {
if (!IS_ENABLED(CONFIG_CONTEXT_TRACKING_IDLE)) {
/* Tracking for vtime only, no concurrent RCU 
EQS accounting */
-   atomic_set(>state, CONTEXT_KERNEL);
+   arch_atomic_set(>state, CONTEXT_KERNEL);
} else {
/*
 * Tracking for vtime and RCU EQS. Make sure we 
don't race
@@ -643,7 +643,7 @@ void noinstr __ct_user_exit(enum ctx_sta
 * RCU only requires RCU_DYNTICKS_IDX 
increments to be fully
 * ordered.
 */
-   atomic_sub(state, >state);
+   arch_atomic_sub(state, >state);
}
}
}


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 29/51] cpuidle,tdx: Make tdx noinstr clean

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: __halt+0x2c: call to hcall_func.constprop.0() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: __halt+0x3f: call to __tdx_hypercall() leaves 
.noinstr.text section
vmlinux.o: warning: objtool: __tdx_hypercall+0x66: call to 
__tdx_hypercall_failed() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/boot/compressed/vmlinux.lds.S |1 +
 arch/x86/coco/tdx/tdcall.S |2 ++
 arch/x86/coco/tdx/tdx.c|5 +++--
 3 files changed, 6 insertions(+), 2 deletions(-)

--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -34,6 +34,7 @@ SECTIONS
_text = .;  /* Text */
*(.text)
*(.text.*)
+   *(.noinstr.text)
_etext = . ;
}
.rodata : {
--- a/arch/x86/coco/tdx/tdcall.S
+++ b/arch/x86/coco/tdx/tdcall.S
@@ -31,6 +31,8 @@
  TDX_R12 | TDX_R13 | \
  TDX_R14 | TDX_R15 )
 
+.section .noinstr.text, "ax"
+
 /*
  * __tdx_module_call()  - Used by TDX guests to request services from
  * the TDX module (does not include VMM services) using TDCALL instruction.
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -53,8 +53,9 @@ static inline u64 _tdx_hypercall(u64 fn,
 }
 
 /* Called from __tdx_hypercall() for unrecoverable failure */
-void __tdx_hypercall_failed(void)
+noinstr void __tdx_hypercall_failed(void)
 {
+   instrumentation_begin();
panic("TDVMCALL failed. TDX module bug?");
 }
 
@@ -64,7 +65,7 @@ void __tdx_hypercall_failed(void)
  * Reusing the KVM EXIT_REASON macros makes it easier to connect the host and
  * guest sides of these calls.
  */
-static u64 hcall_func(u64 exit_reason)
+static __always_inline u64 hcall_func(u64 exit_reason)
 {
return exit_reason;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 41/51] cpuidle,clk: Remove trace_.*_rcuidle()

2023-01-12 Thread Peter Zijlstra
OMAP was the one and only user.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ulf Hansson 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/clk/clk.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -978,12 +978,12 @@ static void clk_core_disable(struct clk_
if (--core->enable_count > 0)
return;
 
-   trace_clk_disable_rcuidle(core);
+   trace_clk_disable(core);
 
if (core->ops->disable)
core->ops->disable(core->hw);
 
-   trace_clk_disable_complete_rcuidle(core);
+   trace_clk_disable_complete(core);
 
clk_core_disable(core->parent);
 }
@@ -1037,12 +1037,12 @@ static int clk_core_enable(struct clk_co
if (ret)
return ret;
 
-   trace_clk_enable_rcuidle(core);
+   trace_clk_enable(core);
 
if (core->ops->enable)
ret = core->ops->enable(core->hw);
 
-   trace_clk_enable_complete_rcuidle(core);
+   trace_clk_enable_complete(core);
 
if (ret) {
clk_core_disable(core->parent);


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 49/51] cpuidle, arch: Mark all regular cpuidle_state::enter methods __cpuidle

2023-01-12 Thread Peter Zijlstra
For all cpuidle drivers that do not use CPUIDLE_FLAG_RCU_IDLE (iow,
the simple ones) make sure all the functions are marked __cpuidle.

( due to lack of noinstr validation on these platforms it is entirely
  possible this isn't complete )

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/arm/kernel/cpuidle.c   |4 ++--
 arch/arm/mach-davinci/cpuidle.c |4 ++--
 arch/arm/mach-imx/cpuidle-imx5.c|4 ++--
 arch/arm/mach-imx/cpuidle-imx6sl.c  |4 ++--
 arch/arm/mach-imx/cpuidle-imx7ulp.c |4 ++--
 arch/arm/mach-s3c/cpuidle-s3c64xx.c |5 ++---
 arch/mips/kernel/idle.c |6 +++---
 7 files changed, 15 insertions(+), 16 deletions(-)

--- a/arch/arm/kernel/cpuidle.c
+++ b/arch/arm/kernel/cpuidle.c
@@ -26,8 +26,8 @@ static struct cpuidle_ops cpuidle_ops[NR
  *
  * Returns the index passed as parameter
  */
-int arm_cpuidle_simple_enter(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+__cpuidle int arm_cpuidle_simple_enter(struct cpuidle_device *dev, struct
+  cpuidle_driver *drv, int index)
 {
cpu_do_idle();
 
--- a/arch/arm/mach-davinci/cpuidle.c
+++ b/arch/arm/mach-davinci/cpuidle.c
@@ -44,8 +44,8 @@ static void davinci_save_ddr_power(int e
 }
 
 /* Actual code that puts the SoC in different idle states */
-static int davinci_enter_idle(struct cpuidle_device *dev,
- struct cpuidle_driver *drv, int index)
+static __cpuidle int davinci_enter_idle(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int index)
 {
davinci_save_ddr_power(1, ddr2_pdown);
cpu_do_idle();
--- a/arch/arm/mach-imx/cpuidle-imx5.c
+++ b/arch/arm/mach-imx/cpuidle-imx5.c
@@ -8,8 +8,8 @@
 #include 
 #include "cpuidle.h"
 
-static int imx5_cpuidle_enter(struct cpuidle_device *dev,
- struct cpuidle_driver *drv, int index)
+static __cpuidle int imx5_cpuidle_enter(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int index)
 {
arm_pm_idle();
return index;
--- a/arch/arm/mach-imx/cpuidle-imx6sl.c
+++ b/arch/arm/mach-imx/cpuidle-imx6sl.c
@@ -11,8 +11,8 @@
 #include "common.h"
 #include "cpuidle.h"
 
-static int imx6sl_enter_wait(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+static __cpuidle int imx6sl_enter_wait(struct cpuidle_device *dev,
+  struct cpuidle_driver *drv, int index)
 {
imx6_set_lpm(WAIT_UNCLOCKED);
/*
--- a/arch/arm/mach-imx/cpuidle-imx7ulp.c
+++ b/arch/arm/mach-imx/cpuidle-imx7ulp.c
@@ -12,8 +12,8 @@
 #include "common.h"
 #include "cpuidle.h"
 
-static int imx7ulp_enter_wait(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+static __cpuidle int imx7ulp_enter_wait(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int index)
 {
if (index == 1)
imx7ulp_set_lpm(ULP_PM_WAIT);
--- a/arch/arm/mach-s3c/cpuidle-s3c64xx.c
+++ b/arch/arm/mach-s3c/cpuidle-s3c64xx.c
@@ -19,9 +19,8 @@
 #include "regs-sys-s3c64xx.h"
 #include "regs-syscon-power-s3c64xx.h"
 
-static int s3c64xx_enter_idle(struct cpuidle_device *dev,
- struct cpuidle_driver *drv,
- int index)
+static __cpuidle int s3c64xx_enter_idle(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int index)
 {
unsigned long tmp;
 
--- a/arch/mips/kernel/idle.c
+++ b/arch/mips/kernel/idle.c
@@ -241,7 +241,7 @@ void __init check_wait(void)
}
 }
 
-void arch_cpu_idle(void)
+__cpuidle void arch_cpu_idle(void)
 {
if (cpu_wait)
cpu_wait();
@@ -249,8 +249,8 @@ void arch_cpu_idle(void)
 
 #ifdef CONFIG_CPU_IDLE
 
-int mips_cpuidle_wait_enter(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+__cpuidle int mips_cpuidle_wait_enter(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
 {
arch_cpu_idle();
return index;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 37/51] cpuidle,omap3: Push RCU-idle into omap_sram_idle()

2023-01-12 Thread Peter Zijlstra
OMAP3 uses full SoC suspend modes as idle states, as such it needs the
whole power-domain and clock-domain code from the idle path.

All that code is not suitable to run with RCU disabled, as such push
RCU-idle deeper still.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Tony Lindgren 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-omap2/cpuidle34xx.c |4 +---
 arch/arm/mach-omap2/pm.h  |2 +-
 arch/arm/mach-omap2/pm34xx.c  |   12 ++--
 3 files changed, 12 insertions(+), 6 deletions(-)

--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -133,9 +133,7 @@ static int omap3_enter_idle(struct cpuid
}
 
/* Execute ARM wfi */
-   ct_cpuidle_enter();
-   omap_sram_idle();
-   ct_cpuidle_exit();
+   omap_sram_idle(true);
 
/*
 * Call idle CPU PM enter notifier chain to restore
--- a/arch/arm/mach-omap2/pm.h
+++ b/arch/arm/mach-omap2/pm.h
@@ -29,7 +29,7 @@ static inline int omap4_idle_init(void)
 
 extern void *omap3_secure_ram_storage;
 extern void omap3_pm_off_mode_enable(int);
-extern void omap_sram_idle(void);
+extern void omap_sram_idle(bool rcuidle);
 extern int omap_pm_clkdms_setup(struct clockdomain *clkdm, void *unused);
 
 #if defined(CONFIG_PM_OPP)
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -174,7 +175,7 @@ static int omap34xx_do_sram_idle(unsigne
return 0;
 }
 
-void omap_sram_idle(void)
+void omap_sram_idle(bool rcuidle)
 {
/* Variable to tell what needs to be saved and restored
 * in omap_sram_idle*/
@@ -254,11 +255,18 @@ void omap_sram_idle(void)
 */
if (save_state)
omap34xx_save_context(omap3_arm_context);
+
+   if (rcuidle)
+   ct_cpuidle_enter();
+
if (save_state == 1 || save_state == 3)
cpu_suspend(save_state, omap34xx_do_sram_idle);
else
omap34xx_do_sram_idle(save_state);
 
+   if (rcuidle)
+   ct_cpuidle_exit();
+
/* Restore normal SDRC POWER settings */
if (cpu_is_omap3430() && omap_rev() >= OMAP3430_REV_ES3_0 &&
(omap_type() == OMAP2_DEVICE_TYPE_EMU ||
@@ -316,7 +324,7 @@ static int omap3_pm_suspend(void)
 
omap3_intc_suspend();
 
-   omap_sram_idle();
+   omap_sram_idle(false);
 
 restore:
/* Restore next_pwrsts */


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 40/51] cpuidle,powerdomain: Remove trace_.*_rcuidle()

2023-01-12 Thread Peter Zijlstra
OMAP was the one and only user.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ulf Hansson 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-omap2/powerdomain.c |   10 +-
 drivers/base/power/runtime.c  |   24 
 2 files changed, 17 insertions(+), 17 deletions(-)

--- a/arch/arm/mach-omap2/powerdomain.c
+++ b/arch/arm/mach-omap2/powerdomain.c
@@ -187,9 +187,9 @@ static int _pwrdm_state_switch(struct po
trace_state = (PWRDM_TRACE_STATES_FLAG |
   ((next & OMAP_POWERSTATE_MASK) << 8) |
   ((prev & OMAP_POWERSTATE_MASK) << 0));
-   trace_power_domain_target_rcuidle(pwrdm->name,
- trace_state,
- 
raw_smp_processor_id());
+   trace_power_domain_target(pwrdm->name,
+ trace_state,
+ raw_smp_processor_id());
}
break;
default:
@@ -541,8 +541,8 @@ int pwrdm_set_next_pwrst(struct powerdom
 
if (arch_pwrdm && arch_pwrdm->pwrdm_set_next_pwrst) {
/* Trace the pwrdm desired target state */
-   trace_power_domain_target_rcuidle(pwrdm->name, pwrst,
- raw_smp_processor_id());
+   trace_power_domain_target(pwrdm->name, pwrst,
+ raw_smp_processor_id());
/* Program the pwrdm desired target state */
ret = arch_pwrdm->pwrdm_set_next_pwrst(pwrdm, pwrst);
}
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -442,7 +442,7 @@ static int rpm_idle(struct device *dev,
int (*callback)(struct device *);
int retval;
 
-   trace_rpm_idle_rcuidle(dev, rpmflags);
+   trace_rpm_idle(dev, rpmflags);
retval = rpm_check_suspend_allowed(dev);
if (retval < 0)
;   /* Conditions are wrong. */
@@ -481,7 +481,7 @@ static int rpm_idle(struct device *dev,
dev->power.request_pending = true;
queue_work(pm_wq, >power.work);
}
-   trace_rpm_return_int_rcuidle(dev, _THIS_IP_, 0);
+   trace_rpm_return_int(dev, _THIS_IP_, 0);
return 0;
}
 
@@ -493,7 +493,7 @@ static int rpm_idle(struct device *dev,
wake_up_all(>power.wait_queue);
 
  out:
-   trace_rpm_return_int_rcuidle(dev, _THIS_IP_, retval);
+   trace_rpm_return_int(dev, _THIS_IP_, retval);
return retval ? retval : rpm_suspend(dev, rpmflags | RPM_AUTO);
 }
 
@@ -557,7 +557,7 @@ static int rpm_suspend(struct device *de
struct device *parent = NULL;
int retval;
 
-   trace_rpm_suspend_rcuidle(dev, rpmflags);
+   trace_rpm_suspend(dev, rpmflags);
 
  repeat:
retval = rpm_check_suspend_allowed(dev);
@@ -708,7 +708,7 @@ static int rpm_suspend(struct device *de
}
 
  out:
-   trace_rpm_return_int_rcuidle(dev, _THIS_IP_, retval);
+   trace_rpm_return_int(dev, _THIS_IP_, retval);
 
return retval;
 
@@ -760,7 +760,7 @@ static int rpm_resume(struct device *dev
struct device *parent = NULL;
int retval = 0;
 
-   trace_rpm_resume_rcuidle(dev, rpmflags);
+   trace_rpm_resume(dev, rpmflags);
 
  repeat:
if (dev->power.runtime_error) {
@@ -925,7 +925,7 @@ static int rpm_resume(struct device *dev
spin_lock_irq(>power.lock);
}
 
-   trace_rpm_return_int_rcuidle(dev, _THIS_IP_, retval);
+   trace_rpm_return_int(dev, _THIS_IP_, retval);
 
return retval;
 }
@@ -1081,7 +1081,7 @@ int __pm_runtime_idle(struct device *dev
if (retval < 0) {
return retval;
} else if (retval > 0) {
-   trace_rpm_usage_rcuidle(dev, rpmflags);
+   trace_rpm_usage(dev, rpmflags);
return 0;
}
}
@@ -1119,7 +1119,7 @@ int __pm_runtime_suspend(struct device *
if (retval < 0) {
return retval;
} else if (retval > 0) {
-   trace_rpm_usage_rcuidle(dev, rpmflags);
+   trace_rpm_usage(dev, rpmflags);
return 0;
}
}
@@ -1202,7 +1202,7 @@ int pm_runtime_get_if_active(struct devi
} else {
retval = atomic_inc_not_zero(>power.usage_count);
}
-   trace_rpm_usage_rcuidle(dev, 0);
+   trace_rpm_usage(dev, 0);
spin_unlock_irqrestore(>power.lock, flags);

[PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-12 Thread Peter Zijlstra
Hi All!

The (hopefully) final respin of cpuidle vs rcu cleanup patches. Barring any
objections I'll be queueing these patches in tip/sched/core in the next few
days.

v2: https://lkml.kernel.org/r/20220919095939.761690...@infradead.org

These here patches clean up the mess that is cpuidle vs rcuidle.

At the end of the ride there's only on RCU_NONIDLE user left:

  arch/arm64/kernel/suspend.c:RCU_NONIDLE(__cpu_suspend_exit());

And I know Mark has been prodding that with something sharp.

The last version was tested by a number of people and I'm hoping to not have
broken anything in the meantime ;-)


Changes since v2:

 - rebased to v6.2-rc3; as available at:
 git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/idle

 - folded: 
https://lkml.kernel.org/r/y3ubwyny15etu...@hirez.programming.kicks-ass.net
   which makes the ARM cpuidle index 0 consistently not use
   CPUIDLE_FLAG_RCU_IDLE, as requested by Ulf.

 - added a few more __always_inline to empty stub functions as found by the
   robot.

 - Used _RET_IP_ instead of _THIS_IP_ in a few placed because of:
   https://github.com/ClangBuiltLinux/linux/issues/263

 - Added new patches to address various robot reports:

 #35:  trace,hardirq: No moar _rcuidle() tracing
 #47:  cpuidle: Ensure ct_cpuidle_enter() is always called from 
noinstr/__cpuidle
 #48:  cpuidle,arch: Mark all ct_cpuidle_enter() callers __cpuidle
 #49:  cpuidle,arch: Mark all regular cpuidle_state::enter methods __cpuidle
 #50:  cpuidle: Comments about noinstr/__cpuidle
 #51:  context_tracking: Fix noinstr vs KASAN


---
 arch/alpha/kernel/process.c   |  1 -
 arch/alpha/kernel/vmlinux.lds.S   |  1 -
 arch/arc/kernel/process.c |  3 ++
 arch/arc/kernel/vmlinux.lds.S |  1 -
 arch/arm/include/asm/vmlinux.lds.h|  1 -
 arch/arm/kernel/cpuidle.c |  4 +-
 arch/arm/kernel/process.c |  1 -
 arch/arm/kernel/smp.c |  6 +--
 arch/arm/mach-davinci/cpuidle.c   |  4 +-
 arch/arm/mach-gemini/board-dt.c   |  3 +-
 arch/arm/mach-imx/cpuidle-imx5.c  |  4 +-
 arch/arm/mach-imx/cpuidle-imx6q.c |  8 ++--
 arch/arm/mach-imx/cpuidle-imx6sl.c|  4 +-
 arch/arm/mach-imx/cpuidle-imx6sx.c|  9 ++--
 arch/arm/mach-imx/cpuidle-imx7ulp.c   |  4 +-
 arch/arm/mach-omap2/common.h  |  6 ++-
 arch/arm/mach-omap2/cpuidle34xx.c | 16 ++-
 arch/arm/mach-omap2/cpuidle44xx.c | 29 +++--
 arch/arm/mach-omap2/omap-mpuss-lowpower.c | 12 +-
 arch/arm/mach-omap2/pm.h  |  2 +-
 arch/arm/mach-omap2/pm24xx.c  | 51 +-
 arch/arm/mach-omap2/pm34xx.c  | 14 +--
 arch/arm/mach-omap2/pm44xx.c  |  2 +-
 arch/arm/mach-omap2/powerdomain.c | 10 ++---
 arch/arm/mach-s3c/cpuidle-s3c64xx.c   |  5 +--
 arch/arm64/kernel/cpuidle.c   |  2 +-
 arch/arm64/kernel/idle.c  |  1 -
 arch/arm64/kernel/smp.c   |  4 +-
 arch/arm64/kernel/vmlinux.lds.S   |  1 -
 arch/csky/kernel/process.c|  1 -
 arch/csky/kernel/smp.c|  2 +-
 arch/csky/kernel/vmlinux.lds.S|  1 -
 arch/hexagon/kernel/process.c |  1 -
 arch/hexagon/kernel/vmlinux.lds.S |  1 -
 arch/ia64/kernel/process.c|  1 +
 arch/ia64/kernel/vmlinux.lds.S|  1 -
 arch/loongarch/kernel/idle.c  |  1 +
 arch/loongarch/kernel/vmlinux.lds.S   |  1 -
 arch/m68k/kernel/vmlinux-nommu.lds|  1 -
 arch/m68k/kernel/vmlinux-std.lds  |  1 -
 arch/m68k/kernel/vmlinux-sun3.lds |  1 -
 arch/microblaze/kernel/process.c  |  1 -
 arch/microblaze/kernel/vmlinux.lds.S  |  1 -
 arch/mips/kernel/idle.c   | 14 +++
 arch/mips/kernel/vmlinux.lds.S|  1 -
 arch/nios2/kernel/process.c   |  1 -
 arch/nios2/kernel/vmlinux.lds.S   |  1 -
 arch/openrisc/kernel/process.c|  1 +
 arch/openrisc/kernel/vmlinux.lds.S|  1 -
 arch/parisc/kernel/process.c  |  2 -
 arch/parisc/kernel/vmlinux.lds.S  |  1 -
 arch/powerpc/kernel/idle.c|  5 +--
 arch/powerpc/kernel/vmlinux.lds.S |  1 -
 arch/riscv/kernel/process.c   |  1 -
 arch/riscv/kernel/vmlinux-xip.lds.S   |  1 -
 arch/riscv/kernel/vmlinux.lds.S   |  1 -
 arch/s390/kernel/idle.c   |  1 -
 arch/s390/kernel/vmlinux.lds.S|  1 -
 arch/sh/kernel/idle.c |  1 +
 arch/sh/kernel/vmlinux.lds.S  |  1 -
 arch/sparc/kernel/leon_pmc.c  |  4 ++
 arch/sparc/kernel/process_32.c|  1 -
 arch/sparc/kernel/process_64.c|  3 +-
 arch/sparc/kernel/vmlinux.lds.S   |  1 -
 arch/um/kernel/dyn.lds.S  |  1 -
 arch/um/kernel/process.c 

[PATCH v3 31/51] cpuidle,nospec: Make noinstr clean

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: mwait_idle+0x47: call to 
mds_idle_clear_cpu_buffers() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0xa2: call to 
mds_idle_clear_cpu_buffers() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle+0x91: call to 
mds_idle_clear_cpu_buffers() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_s2idle+0x8c: call to 
mds_idle_clear_cpu_buffers() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_irq+0xaa: call to 
mds_idle_clear_cpu_buffers() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/include/asm/nospec-branch.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -310,7 +310,7 @@ static __always_inline void mds_user_cle
  *
  * Clear CPU buffers if the corresponding static key is enabled
  */
-static inline void mds_idle_clear_cpu_buffers(void)
+static __always_inline void mds_idle_clear_cpu_buffers(void)
 {
if (static_branch_likely(_idle_clear))
mds_clear_cpu_buffers();


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 16/51] cpuidle: Annotate poll_idle()

2023-01-12 Thread Peter Zijlstra
The __cpuidle functions will become a noinstr class, as such they need
explicit annotations.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/cpuidle/poll_state.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -13,7 +13,10 @@
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
   struct cpuidle_driver *drv, int index)
 {
-   u64 time_start = local_clock();
+   u64 time_start;
+
+   instrumentation_begin();
+   time_start = local_clock();
 
dev->poll_time_limit = false;
 
@@ -39,6 +42,7 @@ static int __cpuidle poll_idle(struct cp
raw_local_irq_disable();
 
current_clr_polling();
+   instrumentation_end();
 
return index;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 48/51] cpuidle, arch: Mark all ct_cpuidle_enter() callers __cpuidle

2023-01-12 Thread Peter Zijlstra
For all cpuidle drivers that use CPUIDLE_FLAG_RCU_IDLE, ensure that
all functions that call ct_cpuidle_enter() are marked __cpuidle.

( due to lack of noinstr validation on these platforms it is entirely
  possible this isn't complete )

Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/arm/mach-imx/cpuidle-imx6q.c |4 ++--
 arch/arm/mach-imx/cpuidle-imx6sx.c|4 ++--
 arch/arm/mach-omap2/omap-mpuss-lowpower.c |4 ++--
 arch/arm/mach-omap2/pm34xx.c  |2 +-
 arch/arm64/kernel/cpuidle.c   |2 +-
 drivers/cpuidle/cpuidle-arm.c |4 ++--
 drivers/cpuidle/cpuidle-big_little.c  |4 ++--
 drivers/cpuidle/cpuidle-mvebu-v7.c|6 +++---
 drivers/cpuidle/cpuidle-psci.c|   17 ++---
 drivers/cpuidle/cpuidle-qcom-spm.c|4 ++--
 drivers/cpuidle/cpuidle-riscv-sbi.c   |   10 +-
 drivers/cpuidle/cpuidle-tegra.c   |   10 +-
 12 files changed, 33 insertions(+), 38 deletions(-)

--- a/arch/arm/mach-imx/cpuidle-imx6q.c
+++ b/arch/arm/mach-imx/cpuidle-imx6q.c
@@ -17,8 +17,8 @@
 static int num_idle_cpus = 0;
 static DEFINE_RAW_SPINLOCK(cpuidle_lock);
 
-static int imx6q_enter_wait(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+static __cpuidle int imx6q_enter_wait(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
 {
raw_spin_lock(_lock);
if (++num_idle_cpus == num_online_cpus())
--- a/arch/arm/mach-imx/cpuidle-imx6sx.c
+++ b/arch/arm/mach-imx/cpuidle-imx6sx.c
@@ -30,8 +30,8 @@ static int imx6sx_idle_finish(unsigned l
return 0;
 }
 
-static int imx6sx_enter_wait(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+static __cpuidle int imx6sx_enter_wait(struct cpuidle_device *dev,
+  struct cpuidle_driver *drv, int index)
 {
imx6_set_lpm(WAIT_UNCLOCKED);
 
--- a/arch/arm/mach-omap2/omap-mpuss-lowpower.c
+++ b/arch/arm/mach-omap2/omap-mpuss-lowpower.c
@@ -224,8 +224,8 @@ static void __init save_l2x0_context(voi
  * 2 - CPUx L1 and logic lost + GIC lost: MPUSS OSWR
  * 3 - CPUx L1 and logic lost + GIC + L2 lost: DEVICE OFF
  */
-int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state,
-bool rcuidle)
+__cpuidle int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state,
+  bool rcuidle)
 {
struct omap4_cpu_pm_info *pm_info = _cpu(omap4_pm_info, cpu);
unsigned int save_state = 0, cpu_logic_state = PWRDM_POWER_RET;
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -175,7 +175,7 @@ static int omap34xx_do_sram_idle(unsigne
return 0;
 }
 
-void omap_sram_idle(bool rcuidle)
+__cpuidle void omap_sram_idle(bool rcuidle)
 {
/* Variable to tell what needs to be saved and restored
 * in omap_sram_idle*/
--- a/arch/arm64/kernel/cpuidle.c
+++ b/arch/arm64/kernel/cpuidle.c
@@ -62,7 +62,7 @@ int acpi_processor_ffh_lpi_probe(unsigne
return psci_acpi_cpu_init_idle(cpu);
 }
 
-int acpi_processor_ffh_lpi_enter(struct acpi_lpi_state *lpi)
+__cpuidle int acpi_processor_ffh_lpi_enter(struct acpi_lpi_state *lpi)
 {
u32 state = lpi->address;
 
--- a/drivers/cpuidle/cpuidle-arm.c
+++ b/drivers/cpuidle/cpuidle-arm.c
@@ -31,8 +31,8 @@
  * Called from the CPUidle framework to program the device to the
  * specified target state selected by the governor.
  */
-static int arm_enter_idle_state(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int idx)
+static __cpuidle int arm_enter_idle_state(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int idx)
 {
/*
 * Pass idle state index to arm_cpuidle_suspend which in turn
--- a/drivers/cpuidle/cpuidle-big_little.c
+++ b/drivers/cpuidle/cpuidle-big_little.c
@@ -122,8 +122,8 @@ static int notrace bl_powerdown_finisher
  * Called from the CPUidle framework to program the device to the
  * specified target state selected by the governor.
  */
-static int bl_enter_powerdown(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int idx)
+static __cpuidle int bl_enter_powerdown(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int idx)
 {
cpu_pm_enter();
ct_cpuidle_enter();
--- a/drivers/cpuidle/cpuidle-mvebu-v7.c
+++ b/drivers/cpuidle/cpuidle-mvebu-v7.c
@@ -25,9 +25,9 @@
 
 static int (*mvebu_v7_cpu_suspend)(int);
 
-static int mvebu_v7_enter_idle(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv,
-   int index)
+static __cpuidle int mvebu_v7_enter_idle(struct cpuidle_device *dev,
+struct cpuidle_driver *

[PATCH v3 19/51] cpuidle,intel_idle: Fix CPUIDLE_FLAG_INIT_XSTATE

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: intel_idle_s2idle+0xd5: call to fpu_idle_fpregs() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_xstate+0x11: call to fpu_idle_fpregs() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: fpu_idle_fpregs+0x9: call to xfeatures_in_use() 
leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/include/asm/fpu/xcr.h   |4 ++--
 arch/x86/include/asm/special_insns.h |2 +-
 arch/x86/kernel/fpu/core.c   |4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/fpu/xcr.h
+++ b/arch/x86/include/asm/fpu/xcr.h
@@ -5,7 +5,7 @@
 #define XCR_XFEATURE_ENABLED_MASK  0x
 #define XCR_XFEATURE_IN_USE_MASK   0x0001
 
-static inline u64 xgetbv(u32 index)
+static __always_inline u64 xgetbv(u32 index)
 {
u32 eax, edx;
 
@@ -27,7 +27,7 @@ static inline void xsetbv(u32 index, u64
  *
  * Callers should check X86_FEATURE_XGETBV1.
  */
-static inline u64 xfeatures_in_use(void)
+static __always_inline u64 xfeatures_in_use(void)
 {
return xgetbv(XCR_XFEATURE_IN_USE_MASK);
 }
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -295,7 +295,7 @@ static inline int enqcmds(void __iomem *
return 0;
 }
 
-static inline void tile_release(void)
+static __always_inline void tile_release(void)
 {
/*
 * Instruction opcode for TILERELEASE; supported in binutils
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -856,12 +856,12 @@ int fpu__exception_code(struct fpu *fpu,
  * Initialize register state that may prevent from entering low-power idle.
  * This function will be invoked from the cpuidle driver only when needed.
  */
-void fpu_idle_fpregs(void)
+noinstr void fpu_idle_fpregs(void)
 {
/* Note: AMX_TILE being enabled implies XGETBV1 support */
if (cpu_feature_enabled(X86_FEATURE_AMX_TILE) &&
(xfeatures_in_use() & XFEATURE_MASK_XTILE)) {
tile_release();
-   fpregs_deactivate(>thread.fpu);
+   __this_cpu_write(fpu_fpregs_owner_ctx, NULL);
}
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 44/51] entry, kasan, x86: Disallow overriding mem*() functions

2023-01-12 Thread Peter Zijlstra
KASAN cannot just hijack the mem*() functions, it needs to emit
__asan_mem*() variants if it wants instrumentation (other sanitizers
already do this).

vmlinux.o: warning: objtool: sync_regs+0x24: call to memcpy() leaves 
.noinstr.text section
vmlinux.o: warning: objtool: vc_switch_off_ist+0xbe: call to memcpy() leaves 
.noinstr.text section
vmlinux.o: warning: objtool: fixup_bad_iret+0x36: call to memset() leaves 
.noinstr.text section
vmlinux.o: warning: objtool: __sev_get_ghcb+0xa0: call to memcpy() leaves 
.noinstr.text section
vmlinux.o: warning: objtool: __sev_put_ghcb+0x35: call to memcpy() leaves 
.noinstr.text section

Remove the weak aliases to ensure nobody hijacks these functions and
add them to the noinstr section.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/lib/memcpy_64.S  |5 ++---
 arch/x86/lib/memmove_64.S |4 +++-
 arch/x86/lib/memset_64.S  |4 +++-
 mm/kasan/kasan.h  |4 
 mm/kasan/shadow.c |   38 ++
 tools/objtool/check.c |3 +++
 6 files changed, 53 insertions(+), 5 deletions(-)

--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -7,7 +7,7 @@
 #include 
 #include 
 
-.pushsection .noinstr.text, "ax"
+.section .noinstr.text, "ax"
 
 /*
  * We build a jump to memcpy_orig by default which gets NOPped out on
@@ -42,7 +42,7 @@ SYM_FUNC_START(__memcpy)
 SYM_FUNC_END(__memcpy)
 EXPORT_SYMBOL(__memcpy)
 
-SYM_FUNC_ALIAS_WEAK(memcpy, __memcpy)
+SYM_FUNC_ALIAS(memcpy, __memcpy)
 EXPORT_SYMBOL(memcpy)
 
 /*
@@ -183,4 +183,3 @@ SYM_FUNC_START_LOCAL(memcpy_orig)
RET
 SYM_FUNC_END(memcpy_orig)
 
-.popsection
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -13,6 +13,8 @@
 
 #undef memmove
 
+.section .noinstr.text, "ax"
+
 /*
  * Implement memmove(). This can handle overlap between src and dst.
  *
@@ -213,5 +215,5 @@ SYM_FUNC_START(__memmove)
 SYM_FUNC_END(__memmove)
 EXPORT_SYMBOL(__memmove)
 
-SYM_FUNC_ALIAS_WEAK(memmove, __memmove)
+SYM_FUNC_ALIAS(memmove, __memmove)
 EXPORT_SYMBOL(memmove)
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+.section .noinstr.text, "ax"
+
 /*
  * ISO C memset - set a memory block to a byte value. This function uses fast
  * string to get better performance than the original function. The code is
@@ -43,7 +45,7 @@ SYM_FUNC_START(__memset)
 SYM_FUNC_END(__memset)
 EXPORT_SYMBOL(__memset)
 
-SYM_FUNC_ALIAS_WEAK(memset, __memset)
+SYM_FUNC_ALIAS(memset, __memset)
 EXPORT_SYMBOL(memset)
 
 /*
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -551,6 +551,10 @@ void __asan_set_shadow_f3(const void *ad
 void __asan_set_shadow_f5(const void *addr, size_t size);
 void __asan_set_shadow_f8(const void *addr, size_t size);
 
+void *__asan_memset(void *addr, int c, size_t len);
+void *__asan_memmove(void *dest, const void *src, size_t len);
+void *__asan_memcpy(void *dest, const void *src, size_t len);
+
 void __hwasan_load1_noabort(unsigned long addr);
 void __hwasan_store1_noabort(unsigned long addr);
 void __hwasan_load2_noabort(unsigned long addr);
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -38,6 +38,12 @@ bool __kasan_check_write(const volatile
 }
 EXPORT_SYMBOL(__kasan_check_write);
 
+#ifndef CONFIG_GENERIC_ENTRY
+/*
+ * CONFIG_GENERIC_ENTRY relies on compiler emitted mem*() calls to not be
+ * instrumented. KASAN enabled toolchains should emit __asan_mem*() functions
+ * for the sites they want to instrument.
+ */
 #undef memset
 void *memset(void *addr, int c, size_t len)
 {
@@ -68,6 +74,38 @@ void *memcpy(void *dest, const void *src
 
return __memcpy(dest, src, len);
 }
+#endif
+
+void *__asan_memset(void *addr, int c, size_t len)
+{
+   if (!kasan_check_range((unsigned long)addr, len, true, _RET_IP_))
+   return NULL;
+
+   return __memset(addr, c, len);
+}
+EXPORT_SYMBOL(__asan_memset);
+
+#ifdef __HAVE_ARCH_MEMMOVE
+void *__asan_memmove(void *dest, const void *src, size_t len)
+{
+   if (!kasan_check_range((unsigned long)src, len, false, _RET_IP_) ||
+   !kasan_check_range((unsigned long)dest, len, true, _RET_IP_))
+   return NULL;
+
+   return __memmove(dest, src, len);
+}
+EXPORT_SYMBOL(__asan_memmove);
+#endif
+
+void *__asan_memcpy(void *dest, const void *src, size_t len)
+{
+   if (!kasan_check_range((unsigned long)src, len, false, _RET_IP_) ||
+   !kasan_check_range((unsigned long)dest, len, true, _RET_IP_))
+   return NULL;
+
+   return __memcpy(dest, src, len);
+}
+EXPORT_SYMBOL(__asan_memcpy);
 
 void kasan_poison(const void *addr, size_t size, u8 value, bool init)
 {
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -956,6 +956,9 @@ static const char *uaccess_safe_builtin[
"__asan_store16_noabort",

[PATCH v3 39/51] arm,omap2: Use WFI for omap2_pm_idle()

2023-01-12 Thread Peter Zijlstra
arch_cpu_idle() is a very simple idle interface and exposes only a
single idle state and is expected to not require RCU and not do any
tracing/instrumentation.

As such, omap2_pm_idle() is not a valid implementation. Replace it
with a simple (shallow) omap2_do_wfi() call.

Omap2 doesn't have a cpuidle driver; but adding one would be the
recourse to (re)gain the other idle states.

Suggested-by: Tony Lindgren 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-omap2/pm24xx.c |   51 +--
 1 file changed, 2 insertions(+), 49 deletions(-)

--- a/arch/arm/mach-omap2/pm24xx.c
+++ b/arch/arm/mach-omap2/pm24xx.c
@@ -116,50 +116,12 @@ static int omap2_enter_full_retention(vo
 
 static int sti_console_enabled;
 
-static int omap2_allow_mpu_retention(void)
-{
-   if (!omap2xxx_cm_mpu_retention_allowed())
-   return 0;
-   if (sti_console_enabled)
-   return 0;
-
-   return 1;
-}
-
-static void omap2_enter_mpu_retention(void)
+static void omap2_do_wfi(void)
 {
const int zero = 0;
 
-   /* The peripherals seem not to be able to wake up the MPU when
-* it is in retention mode. */
-   if (omap2_allow_mpu_retention()) {
-   /* REVISIT: These write to reserved bits? */
-   omap_prm_clear_mod_irqs(CORE_MOD, PM_WKST1, ~0);
-   omap_prm_clear_mod_irqs(CORE_MOD, OMAP24XX_PM_WKST2, ~0);
-   omap_prm_clear_mod_irqs(WKUP_MOD, PM_WKST, ~0);
-
-   /* Try to enter MPU retention */
-   pwrdm_set_next_pwrst(mpu_pwrdm, PWRDM_POWER_RET);
-
-   } else {
-   /* Block MPU retention */
-   pwrdm_set_next_pwrst(mpu_pwrdm, PWRDM_POWER_ON);
-   }
-
/* WFI */
asm("mcr p15, 0, %0, c7, c0, 4" : : "r" (zero) : "memory", "cc");
-
-   pwrdm_set_next_pwrst(mpu_pwrdm, PWRDM_POWER_ON);
-}
-
-static int omap2_can_sleep(void)
-{
-   if (omap2xxx_cm_fclks_active())
-   return 0;
-   if (__clk_is_enabled(osc_ck))
-   return 0;
-
-   return 1;
 }
 
 static void omap2_pm_idle(void)
@@ -169,16 +131,7 @@ static void omap2_pm_idle(void)
if (omap_irq_pending())
return;
 
-   error = cpu_cluster_pm_enter();
-   if (error || !omap2_can_sleep()) {
-   omap2_enter_mpu_retention();
-   goto out_cpu_cluster_pm;
-   }
-
-   omap2_enter_full_retention();
-
-out_cpu_cluster_pm:
-   cpu_cluster_pm_exit();
+   omap2_do_wfi();
 }
 
 static void __init prcm_setup_regs(void)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 47/51] cpuidle: Ensure ct_cpuidle_enter() is always called from noinstr/__cpuidle

2023-01-12 Thread Peter Zijlstra
Tracing (kprobes included) and other compiler instrumentation relies
on a normal kernel runtime. Therefore all functions that disable RCU
should be noinstr, as should all functions that are called while RCU
is disabled.

Signed-off-by: Peter Zijlstra (Intel) 
---
 drivers/cpuidle/cpuidle.c |   37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -137,11 +137,13 @@ int cpuidle_find_deepest_state(struct cp
 }
 
 #ifdef CONFIG_SUSPEND
-static void enter_s2idle_proper(struct cpuidle_driver *drv,
-   struct cpuidle_device *dev, int index)
+static noinstr void enter_s2idle_proper(struct cpuidle_driver *drv,
+struct cpuidle_device *dev, int index)
 {
-   ktime_t time_start, time_end;
struct cpuidle_state *target_state = >states[index];
+   ktime_t time_start, time_end;
+
+   instrumentation_begin();
 
time_start = ns_to_ktime(local_clock());
 
@@ -152,13 +154,18 @@ static void enter_s2idle_proper(struct c
 * suspended is generally unsafe.
 */
stop_critical_timings();
-   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE))
+   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE)) {
ct_cpuidle_enter();
+   /* Annotate away the indirect call */
+   instrumentation_begin();
+   }
target_state->enter_s2idle(dev, drv, index);
if (WARN_ON_ONCE(!irqs_disabled()))
raw_local_irq_disable();
-   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE))
+   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE)) {
+   instrumentation_end();
ct_cpuidle_exit();
+   }
tick_unfreeze();
start_critical_timings();
 
@@ -166,6 +173,7 @@ static void enter_s2idle_proper(struct c
 
dev->states_usage[index].s2idle_time += ktime_us_delta(time_end, 
time_start);
dev->states_usage[index].s2idle_usage++;
+   instrumentation_end();
 }
 
 /**
@@ -200,8 +208,9 @@ int cpuidle_enter_s2idle(struct cpuidle_
  * @drv: cpuidle driver for this cpu
  * @index: index into the states table in @drv of the state to enter
  */
-int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
-   int index)
+noinstr int cpuidle_enter_state(struct cpuidle_device *dev,
+struct cpuidle_driver *drv,
+int index)
 {
int entered_state;
 
@@ -209,6 +218,8 @@ int cpuidle_enter_state(struct cpuidle_d
bool broadcast = !!(target_state->flags & CPUIDLE_FLAG_TIMER_STOP);
ktime_t time_start, time_end;
 
+   instrumentation_begin();
+
/*
 * Tell the time framework to switch to a broadcast timer because our
 * local timer will be shut down.  If a local timer is used from another
@@ -235,15 +246,21 @@ int cpuidle_enter_state(struct cpuidle_d
time_start = ns_to_ktime(local_clock());
 
stop_critical_timings();
-   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE))
+   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE)) {
ct_cpuidle_enter();
+   /* Annotate away the indirect call */
+   instrumentation_begin();
+   }
 
entered_state = target_state->enter(dev, drv, index);
+
if (WARN_ONCE(!irqs_disabled(), "%ps leaked IRQ state", 
target_state->enter))
raw_local_irq_disable();
 
-   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE))
+   if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE)) {
+   instrumentation_end();
ct_cpuidle_exit();
+   }
start_critical_timings();
 
sched_clock_idle_wakeup_event();
@@ -306,6 +323,8 @@ int cpuidle_enter_state(struct cpuidle_d
dev->states_usage[index].rejected++;
}
 
+   instrumentation_end();
+
return entered_state;
 }
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 43/51] intel_idle: Add force_irq_on module param

2023-01-12 Thread Peter Zijlstra
For testing purposes.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/idle/intel_idle.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1787,6 +1787,9 @@ static bool __init intel_idle_verify_cst
return true;
 }
 
+static bool force_irq_on __read_mostly;
+module_param(force_irq_on, bool, 0444);
+
 static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
 {
int cstate;
@@ -1838,8 +1841,10 @@ static void __init intel_idle_init_cstat
/* Structure copy. */
drv->states[drv->state_count] = cpuidle_state_table[cstate];
 
-   if (cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_IRQ_ENABLE)
+   if ((cpuidle_state_table[cstate].flags & 
CPUIDLE_FLAG_IRQ_ENABLE) || force_irq_on) {
+   printk("intel_idle: forced intel_idle_irq for state 
%d\n", cstate);
drv->states[drv->state_count].enter = intel_idle_irq;
+   }
 
if (cpu_feature_enabled(X86_FEATURE_KERNEL_IBRS) &&
cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_IBRS) {


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 09/51] cpuidle,omap3: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then teporarily enable it
again before going idle is daft.

Notably the cpu_pm_*() calls implicitly re-enable RCU for a bit.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Tony Lindgren 
Acked-by: Rafael J. Wysocki 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-omap2/cpuidle34xx.c |   16 
 1 file changed, 16 insertions(+)

--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -133,7 +133,9 @@ static int omap3_enter_idle(struct cpuid
}
 
/* Execute ARM wfi */
+   ct_idle_enter();
omap_sram_idle();
+   ct_idle_exit();
 
/*
 * Call idle CPU PM enter notifier chain to restore
@@ -265,6 +267,7 @@ static struct cpuidle_driver omap3_idle_
.owner= THIS_MODULE,
.states = {
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 2 + 2,
.target_residency = 5,
@@ -272,6 +275,7 @@ static struct cpuidle_driver omap3_idle_
.desc = "MPU ON + CORE ON",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 10 + 10,
.target_residency = 30,
@@ -279,6 +283,7 @@ static struct cpuidle_driver omap3_idle_
.desc = "MPU ON + CORE ON",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 50 + 50,
.target_residency = 300,
@@ -286,6 +291,7 @@ static struct cpuidle_driver omap3_idle_
.desc = "MPU RET + CORE ON",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 1500 + 1800,
.target_residency = 4000,
@@ -293,6 +299,7 @@ static struct cpuidle_driver omap3_idle_
.desc = "MPU OFF + CORE ON",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 2500 + 7500,
.target_residency = 12000,
@@ -300,6 +307,7 @@ static struct cpuidle_driver omap3_idle_
.desc = "MPU RET + CORE RET",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 3000 + 8500,
.target_residency = 15000,
@@ -307,6 +315,7 @@ static struct cpuidle_driver omap3_idle_
.desc = "MPU OFF + CORE RET",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 1 + 3,
.target_residency = 3,
@@ -328,6 +337,7 @@ static struct cpuidle_driver omap3430_id
.owner= THIS_MODULE,
.states = {
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 110 + 162,
.target_residency = 5,
@@ -335,6 +345,7 @@ static struct cpuidle_driver omap3430_id
.desc = "MPU ON + CORE ON",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 106 + 180,
.target_residency = 309,
@@ -342,6 +353,7 @@ static struct cpuidle_driver omap3430_id
.desc = "MPU ON + CORE ON",
},
{
+   .flags= CPUIDLE_FLAG_RCU_IDLE,
.enter= omap3_enter_idle_bm,
.exit_latency = 107 + 410,
.target_residency = 46057,
@@ -349,6 +361,7 @@ static struct cpuidle_driver omap3430_id
.desc = "MPU RET + CORE ON",
},
{
+  

[PATCH v3 26/51] time/tick-broadcast: Remove RCU_NONIDLE usage

2023-01-12 Thread Peter Zijlstra
No callers left that have already disabled RCU.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mark Rutland 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 kernel/time/tick-broadcast-hrtimer.c |   29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

--- a/kernel/time/tick-broadcast-hrtimer.c
+++ b/kernel/time/tick-broadcast-hrtimer.c
@@ -56,25 +56,20 @@ static int bc_set_next(ktime_t expires,
 * hrtimer callback function is currently running, then
 * hrtimer_start() cannot move it and the timer stays on the CPU on
 * which it is assigned at the moment.
+*/
+   hrtimer_start(, expires, HRTIMER_MODE_ABS_PINNED_HARD);
+   /*
+* The core tick broadcast mode expects bc->bound_on to be set
+* correctly to prevent a CPU which has the broadcast hrtimer
+* armed from going deep idle.
 *
-* As this can be called from idle code, the hrtimer_start()
-* invocation has to be wrapped with RCU_NONIDLE() as
-* hrtimer_start() can call into tracing.
+* As tick_broadcast_lock is held, nothing can change the cpu
+* base which was just established in hrtimer_start() above. So
+* the below access is safe even without holding the hrtimer
+* base lock.
 */
-   RCU_NONIDLE( {
-   hrtimer_start(, expires, HRTIMER_MODE_ABS_PINNED_HARD);
-   /*
-* The core tick broadcast mode expects bc->bound_on to be set
-* correctly to prevent a CPU which has the broadcast hrtimer
-* armed from going deep idle.
-*
-* As tick_broadcast_lock is held, nothing can change the cpu
-* base which was just established in hrtimer_start() above. So
-* the below access is safe even without holding the hrtimer
-* base lock.
-*/
-   bc->bound_on = bctimer.base->cpu_base->cpu;
-   } );
+   bc->bound_on = bctimer.base->cpu_base->cpu;
+
return 0;
 }
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 14/51] cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter, exit}()

2023-01-12 Thread Peter Zijlstra
All callers should still have RCU enabled.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ulf Hansson 
Acked-by: Mark Rutland 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 kernel/cpu_pm.c |9 -
 1 file changed, 9 deletions(-)

--- a/kernel/cpu_pm.c
+++ b/kernel/cpu_pm.c
@@ -30,16 +30,9 @@ static int cpu_pm_notify(enum cpu_pm_eve
 {
int ret;
 
-   /*
-* This introduces a RCU read critical section, which could be
-* disfunctional in cpu idle. Copy RCU_NONIDLE code to let RCU know
-* this.
-*/
-   ct_irq_enter_irqson();
rcu_read_lock();
ret = raw_notifier_call_chain(_pm_notifier.chain, event, NULL);
rcu_read_unlock();
-   ct_irq_exit_irqson();
 
return notifier_to_errno(ret);
 }
@@ -49,11 +42,9 @@ static int cpu_pm_notify_robust(enum cpu
unsigned long flags;
int ret;
 
-   ct_irq_enter_irqson();
raw_spin_lock_irqsave(_pm_notifier.lock, flags);
ret = raw_notifier_call_chain_robust(_pm_notifier.chain, event_up, 
event_down, NULL);
raw_spin_unlock_irqrestore(_pm_notifier.lock, flags);
-   ct_irq_exit_irqson();
 
return notifier_to_errno(ret);
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 42/51] ubsan: Fix objtool UACCESS warns

2023-01-12 Thread Peter Zijlstra
clang-14 allyesconfig gives:

vmlinux.o: warning: objtool: emulator_cmpxchg_emulated+0x705: call to 
__ubsan_handle_load_invalid_value() with UACCESS enabled
vmlinux.o: warning: objtool: paging64_update_accessed_dirty_bits+0x39e: call to 
__ubsan_handle_load_invalid_value() with UACCESS enabled
vmlinux.o: warning: objtool: paging32_update_accessed_dirty_bits+0x390: call to 
__ubsan_handle_load_invalid_value() with UACCESS enabled
vmlinux.o: warning: objtool: ept_update_accessed_dirty_bits+0x43f: call to 
__ubsan_handle_load_invalid_value() with UACCESS enabled

Add the required eflags save/restore and whitelist the thing.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 lib/ubsan.c   |5 -
 tools/objtool/check.c |1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -340,9 +340,10 @@ void __ubsan_handle_load_invalid_value(v
 {
struct invalid_value_data *data = _data;
char val_str[VALUE_LENGTH];
+   unsigned long ua_flags = user_access_save();
 
if (suppress_report(>location))
-   return;
+   goto out;
 
ubsan_prologue(>location, "invalid-load");
 
@@ -352,6 +353,8 @@ void __ubsan_handle_load_invalid_value(v
val_str, data->type->type_name);
 
ubsan_epilogue();
+out:
+   user_access_restore(ua_flags);
 }
 EXPORT_SYMBOL(__ubsan_handle_load_invalid_value);
 
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1068,6 +1068,7 @@ static const char *uaccess_safe_builtin[
"__ubsan_handle_type_mismatch",
"__ubsan_handle_type_mismatch_v1",
"__ubsan_handle_shift_out_of_bounds",
+   "__ubsan_handle_load_invalid_value",
/* misc */
"csum_partial_copy_generic",
"copy_mc_fragile",


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 35/51] trace,hardirq: No moar _rcuidle() tracing

2023-01-12 Thread Peter Zijlstra
Robot reported that trace_hardirqs_{on,off}() tickle the forbidden
_rcuidle() tracepoint through local_irq_{en,dis}able().

For 'sane' configs, these calls will only happen with RCU enabled and
as such can use the regular tracepoint. This also means it's possible
to trace them from NMI context again.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/trace/trace_preemptirq.c |   21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -20,6 +20,15 @@
 static DEFINE_PER_CPU(int, tracing_irq_cpu);
 
 /*
+ * ...
+ */
+#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#define trace(point)   trace_##point
+#else
+#define trace(point)   if (!in_nmi()) trace_##point##_rcuidle
+#endif
+
+/*
  * Like trace_hardirqs_on() but without the lockdep invocation. This is
  * used in the low level entry code where the ordering vs. RCU is important
  * and lockdep uses a staged approach which splits the lockdep hardirq
@@ -28,8 +37,7 @@ static DEFINE_PER_CPU(int, tracing_irq_c
 void trace_hardirqs_on_prepare(void)
 {
if (this_cpu_read(tracing_irq_cpu)) {
-   if (!in_nmi())
-   trace_irq_enable(CALLER_ADDR0, CALLER_ADDR1);
+   trace(irq_enable)(CALLER_ADDR0, CALLER_ADDR1);
tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
this_cpu_write(tracing_irq_cpu, 0);
}
@@ -40,8 +48,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on_prepar
 void trace_hardirqs_on(void)
 {
if (this_cpu_read(tracing_irq_cpu)) {
-   if (!in_nmi())
-   trace_irq_enable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+   trace(irq_enable)(CALLER_ADDR0, CALLER_ADDR1);
tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
this_cpu_write(tracing_irq_cpu, 0);
}
@@ -63,8 +70,7 @@ void trace_hardirqs_off_finish(void)
if (!this_cpu_read(tracing_irq_cpu)) {
this_cpu_write(tracing_irq_cpu, 1);
tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
-   if (!in_nmi())
-   trace_irq_disable(CALLER_ADDR0, CALLER_ADDR1);
+   trace(irq_disable)(CALLER_ADDR0, CALLER_ADDR1);
}
 
 }
@@ -78,8 +84,7 @@ void trace_hardirqs_off(void)
if (!this_cpu_read(tracing_irq_cpu)) {
this_cpu_write(tracing_irq_cpu, 1);
tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
-   if (!in_nmi())
-   trace_irq_disable_rcuidle(CALLER_ADDR0, CALLER_ADDR1);
+   trace(irq_disable)(CALLER_ADDR0, CALLER_ADDR1);
}
 }
 EXPORT_SYMBOL(trace_hardirqs_off);


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 34/51] trace: WARN on rcuidle

2023-01-12 Thread Peter Zijlstra
ARCH_WANTS_NO_INSTR (a superset of CONFIG_GENERIC_ENTRY) disallows any
and all tracing when RCU isn't enabled.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 include/linux/tracepoint.h |   15 +--
 kernel/trace/trace.c   |3 +++
 2 files changed, 16 insertions(+), 2 deletions(-)

--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -178,6 +178,17 @@ static inline struct tracepoint *tracepo
 #endif /* CONFIG_HAVE_STATIC_CALL */
 
 /*
+ * ARCH_WANTS_NO_INSTR archs are expected to have sanitized entry and idle
+ * code that disallow any/all tracing/instrumentation when RCU isn't watching.
+ */
+#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#define RCUIDLE_COND(rcuidle)  (rcuidle)
+#else
+/* srcu can't be used from NMI */
+#define RCUIDLE_COND(rcuidle)  (rcuidle && in_nmi())
+#endif
+
+/*
  * it_func[0] is never NULL because there is at least one element in the array
  * when the array itself is non NULL.
  */
@@ -188,8 +199,8 @@ static inline struct tracepoint *tracepo
if (!(cond))\
return; \
\
-   /* srcu can't be used from NMI */   \
-   WARN_ON_ONCE(rcuidle && in_nmi());  \
+   if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\
+   return; \
\
/* keep srcu and sched-rcu usage consistent */  \
preempt_disable_notrace();  \
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3119,6 +3119,9 @@ void __trace_stack(struct trace_array *t
return;
}
 
+   if (WARN_ON_ONCE(IS_ENABLED(CONFIG_GENERIC_ENTRY)))
+   return;
+
/*
 * When an NMI triggers, RCU is enabled via ct_nmi_enter(),
 * but if the above rcu_is_watching() failed, then the NMI


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 10/51] cpuidle,armada: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then temporarily enable it
again before going idle is daft.

Notably the cpu_pm_*() calls implicitly re-enable RCU for a bit.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Acked-by: Rafael J. Wysocki 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/cpuidle/cpuidle-mvebu-v7.c |7 +++
 1 file changed, 7 insertions(+)

--- a/drivers/cpuidle/cpuidle-mvebu-v7.c
+++ b/drivers/cpuidle/cpuidle-mvebu-v7.c
@@ -36,7 +36,10 @@ static int mvebu_v7_enter_idle(struct cp
if (drv->states[index].flags & MVEBU_V7_FLAG_DEEP_IDLE)
deepidle = true;
 
+   ct_idle_enter();
ret = mvebu_v7_cpu_suspend(deepidle);
+   ct_idle_exit();
+
cpu_pm_exit();
 
if (ret)
@@ -49,6 +52,7 @@ static struct cpuidle_driver armadaxp_id
.name   = "armada_xp_idle",
.states[0]  = ARM_CPUIDLE_WFI_STATE,
.states[1]  = {
+   .flags  = CPUIDLE_FLAG_RCU_IDLE,
.enter  = mvebu_v7_enter_idle,
.exit_latency   = 100,
.power_usage= 50,
@@ -57,6 +61,7 @@ static struct cpuidle_driver armadaxp_id
.desc   = "CPU power down",
},
.states[2]  = {
+   .flags  = CPUIDLE_FLAG_RCU_IDLE,
.enter  = mvebu_v7_enter_idle,
.exit_latency   = 1000,
.power_usage= 5,
@@ -72,6 +77,7 @@ static struct cpuidle_driver armada370_i
.name   = "armada_370_idle",
.states[0]  = ARM_CPUIDLE_WFI_STATE,
.states[1]  = {
+   .flags  = CPUIDLE_FLAG_RCU_IDLE,
.enter  = mvebu_v7_enter_idle,
.exit_latency   = 100,
.power_usage= 5,
@@ -87,6 +93,7 @@ static struct cpuidle_driver armada38x_i
.name   = "armada_38x_idle",
.states[0]  = ARM_CPUIDLE_WFI_STATE,
.states[1]  = {
+   .flags  = CPUIDLE_FLAG_RCU_IDLE,
.enter  = mvebu_v7_enter_idle,
.exit_latency   = 10,
.power_usage= 5,


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 45/51] sched: Always inline __this_cpu_preempt_check()

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: in_entry_stack+0x9: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: default_do_nmi+0x10: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: fpu_idle_fpregs+0x41: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: kvm_read_and_reset_apf_flags+0x1: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: lockdep_hardirqs_on+0xb0: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: lockdep_hardirqs_off+0xae: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: irqentry_nmi_enter+0x69: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: irqentry_nmi_exit+0x32: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x9: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_idle_enter+0x43: call to 
__this_cpu_preempt_check() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_idle_enter_s2idle+0x45: call to 
__this_cpu_preempt_check() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 include/linux/percpu-defs.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -310,7 +310,7 @@ extern void __bad_size_call_parameter(vo
 #ifdef CONFIG_DEBUG_PREEMPT
 extern void __this_cpu_preempt_check(const char *op);
 #else
-static inline void __this_cpu_preempt_check(const char *op) { }
+static __always_inline void __this_cpu_preempt_check(const char *op) { }
 #endif
 
 #define __pcpu_size_call_return(stem, variable)
\


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 22/51] x86/tdx: Remove TDX_HCALL_ISSUE_STI

2023-01-12 Thread Peter Zijlstra
Now that arch_cpu_idle() is expected to return with IRQs disabled,
avoid the useless STI/CLI dance.

Per the specs this is supposed to work, but nobody has yet relied up
this behaviour so broken implementations are possible.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/coco/tdx/tdcall.S|   13 -
 arch/x86/coco/tdx/tdx.c   |   23 ---
 arch/x86/include/asm/shared/tdx.h |1 -
 3 files changed, 4 insertions(+), 33 deletions(-)

--- a/arch/x86/coco/tdx/tdcall.S
+++ b/arch/x86/coco/tdx/tdcall.S
@@ -139,19 +139,6 @@ SYM_FUNC_START(__tdx_hypercall)
 
movl $TDVMCALL_EXPOSE_REGS_MASK, %ecx
 
-   /*
-* For the idle loop STI needs to be called directly before the TDCALL
-* that enters idle (EXIT_REASON_HLT case). STI instruction enables
-* interrupts only one instruction later. If there is a window between
-* STI and the instruction that emulates the HALT state, there is a
-* chance for interrupts to happen in this window, which can delay the
-* HLT operation indefinitely. Since this is the not the desired
-* result, conditionally call STI before TDCALL.
-*/
-   testq $TDX_HCALL_ISSUE_STI, %rsi
-   jz .Lskip_sti
-   sti
-.Lskip_sti:
tdcall
 
/*
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -169,7 +169,7 @@ static int ve_instr_len(struct ve_info *
}
 }
 
-static u64 __cpuidle __halt(const bool irq_disabled, const bool do_sti)
+static u64 __cpuidle __halt(const bool irq_disabled)
 {
struct tdx_hypercall_args args = {
.r10 = TDX_HYPERCALL_STANDARD,
@@ -189,20 +189,14 @@ static u64 __cpuidle __halt(const bool i
 * can keep the vCPU in virtual HLT, even if an IRQ is
 * pending, without hanging/breaking the guest.
 */
-   return __tdx_hypercall(, do_sti ? TDX_HCALL_ISSUE_STI : 0);
+   return __tdx_hypercall(, 0);
 }
 
 static int handle_halt(struct ve_info *ve)
 {
-   /*
-* Since non safe halt is mainly used in CPU offlining
-* and the guest will always stay in the halt state, don't
-* call the STI instruction (set do_sti as false).
-*/
const bool irq_disabled = irqs_disabled();
-   const bool do_sti = false;
 
-   if (__halt(irq_disabled, do_sti))
+   if (__halt(irq_disabled))
return -EIO;
 
return ve_instr_len(ve);
@@ -210,22 +204,13 @@ static int handle_halt(struct ve_info *v
 
 void __cpuidle tdx_safe_halt(void)
 {
-/*
- * For do_sti=true case, __tdx_hypercall() function enables
- * interrupts using the STI instruction before the TDCALL. So
- * set irq_disabled as false.
- */
const bool irq_disabled = false;
-   const bool do_sti = true;
 
/*
 * Use WARN_ONCE() to report the failure.
 */
-   if (__halt(irq_disabled, do_sti))
+   if (__halt(irq_disabled))
WARN_ONCE(1, "HLT instruction emulation failed\n");
-
-   /* XXX I can't make sense of what @do_sti actually does */
-   raw_local_irq_disable();
 }
 
 static int read_msr(struct pt_regs *regs, struct ve_info *ve)
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -8,7 +8,6 @@
 #define TDX_HYPERCALL_STANDARD  0
 
 #define TDX_HCALL_HAS_OUTPUT   BIT(0)
-#define TDX_HCALL_ISSUE_STIBIT(1)
 
 #define TDX_CPUID_LEAF_ID  0x21
 #define TDX_IDENT  "IntelTDX"


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 02/51] x86/idle: Replace x86_idle with a static_call

2023-01-12 Thread Peter Zijlstra
Typical boot time setup; no need to suffer an indirect call for that.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Rafael J. Wysocki 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/kernel/process.c |   50 +-
 1 file changed, 28 insertions(+), 22 deletions(-)

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -692,7 +693,23 @@ void __switch_to_xtra(struct task_struct
 unsigned long boot_option_idle_override = IDLE_NO_OVERRIDE;
 EXPORT_SYMBOL(boot_option_idle_override);
 
-static void (*x86_idle)(void);
+/*
+ * We use this if we don't have any better idle routine..
+ */
+void __cpuidle default_idle(void)
+{
+   raw_safe_halt();
+}
+#if defined(CONFIG_APM_MODULE) || defined(CONFIG_HALTPOLL_CPUIDLE_MODULE)
+EXPORT_SYMBOL(default_idle);
+#endif
+
+DEFINE_STATIC_CALL_NULL(x86_idle, default_idle);
+
+static bool x86_idle_set(void)
+{
+   return !!static_call_query(x86_idle);
+}
 
 #ifndef CONFIG_SMP
 static inline void play_dead(void)
@@ -715,28 +732,17 @@ void arch_cpu_idle_dead(void)
 /*
  * Called from the generic idle code.
  */
-void arch_cpu_idle(void)
-{
-   x86_idle();
-}
-
-/*
- * We use this if we don't have any better idle routine..
- */
-void __cpuidle default_idle(void)
+void __cpuidle arch_cpu_idle(void)
 {
-   raw_safe_halt();
+   static_call(x86_idle)();
 }
-#if defined(CONFIG_APM_MODULE) || defined(CONFIG_HALTPOLL_CPUIDLE_MODULE)
-EXPORT_SYMBOL(default_idle);
-#endif
 
 #ifdef CONFIG_XEN
 bool xen_set_default_idle(void)
 {
-   bool ret = !!x86_idle;
+   bool ret = x86_idle_set();
 
-   x86_idle = default_idle;
+   static_call_update(x86_idle, default_idle);
 
return ret;
 }
@@ -859,20 +865,20 @@ void select_idle_routine(const struct cp
if (boot_option_idle_override == IDLE_POLL && smp_num_siblings > 1)
pr_warn_once("WARNING: polling idle and HT enabled, performance 
may degrade\n");
 #endif
-   if (x86_idle || boot_option_idle_override == IDLE_POLL)
+   if (x86_idle_set() || boot_option_idle_override == IDLE_POLL)
return;
 
if (boot_cpu_has_bug(X86_BUG_AMD_E400)) {
pr_info("using AMD E400 aware idle routine\n");
-   x86_idle = amd_e400_idle;
+   static_call_update(x86_idle, amd_e400_idle);
} else if (prefer_mwait_c1_over_halt(c)) {
pr_info("using mwait in idle threads\n");
-   x86_idle = mwait_idle;
+   static_call_update(x86_idle, mwait_idle);
} else if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
pr_info("using TDX aware idle routine\n");
-   x86_idle = tdx_safe_halt;
+   static_call_update(x86_idle, tdx_safe_halt);
} else
-   x86_idle = default_idle;
+   static_call_update(x86_idle, default_idle);
 }
 
 void amd_e400_c1e_apic_setup(void)
@@ -925,7 +931,7 @@ static int __init idle_setup(char *str)
 * To continue to load the CPU idle driver, don't touch
 * the boot_option_idle_override.
 */
-   x86_idle = default_idle;
+   static_call_update(x86_idle, default_idle);
boot_option_idle_override = IDLE_HALT;
} else if (!strcmp(str, "nomwait")) {
/*


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 20/51] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IBRS

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: intel_idle_ibrs+0x17: call to spec_ctrl_current() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_ibrs+0x27: call to wrmsrl.constprop.0() 
leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/kernel/cpu/bugs.c |2 +-
 drivers/idle/intel_idle.c  |4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -79,7 +79,7 @@ void write_spec_ctrl_current(u64 val, bo
wrmsrl(MSR_IA32_SPEC_CTRL, val);
 }
 
-u64 spec_ctrl_current(void)
+noinstr u64 spec_ctrl_current(void)
 {
return this_cpu_read(x86_spec_ctrl_current);
 }
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -181,12 +181,12 @@ static __cpuidle int intel_idle_ibrs(str
int ret;
 
if (smt_active)
-   wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+   native_wrmsrl(MSR_IA32_SPEC_CTRL, 0);
 
ret = __intel_idle(dev, drv, index);
 
if (smt_active)
-   wrmsrl(MSR_IA32_SPEC_CTRL, spec_ctrl);
+   native_wrmsrl(MSR_IA32_SPEC_CTRL, spec_ctrl);
 
return ret;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 13/51] cpuidle: Fix ct_idle_*() usage

2023-01-12 Thread Peter Zijlstra
The whole disable-RCU, enable-IRQS dance is very intricate since
changing IRQ state is traced, which depends on RCU.

Add two helpers for the cpuidle case that mirror the entry code.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-imx/cpuidle-imx6q.c|4 +--
 arch/arm/mach-imx/cpuidle-imx6sx.c   |4 +--
 arch/arm/mach-omap2/cpuidle34xx.c|4 +--
 arch/arm/mach-omap2/cpuidle44xx.c|8 +++---
 drivers/acpi/processor_idle.c|8 --
 drivers/cpuidle/cpuidle-big_little.c |4 +--
 drivers/cpuidle/cpuidle-mvebu-v7.c   |4 +--
 drivers/cpuidle/cpuidle-psci.c   |4 +--
 drivers/cpuidle/cpuidle-riscv-sbi.c  |4 +--
 drivers/cpuidle/cpuidle-tegra.c  |8 +++---
 drivers/cpuidle/cpuidle.c|   11 
 include/linux/clockchips.h   |4 +--
 include/linux/cpuidle.h  |   34 --
 kernel/sched/idle.c  |   45 ++-
 kernel/time/tick-broadcast.c |6 +++-
 15 files changed, 86 insertions(+), 66 deletions(-)

--- a/arch/arm/mach-imx/cpuidle-imx6q.c
+++ b/arch/arm/mach-imx/cpuidle-imx6q.c
@@ -25,9 +25,9 @@ static int imx6q_enter_wait(struct cpuid
imx6_set_lpm(WAIT_UNCLOCKED);
raw_spin_unlock(_lock);
 
-   ct_idle_enter();
+   ct_cpuidle_enter();
cpu_do_idle();
-   ct_idle_exit();
+   ct_cpuidle_exit();
 
raw_spin_lock(_lock);
if (num_idle_cpus-- == num_online_cpus())
--- a/arch/arm/mach-imx/cpuidle-imx6sx.c
+++ b/arch/arm/mach-imx/cpuidle-imx6sx.c
@@ -47,9 +47,9 @@ static int imx6sx_enter_wait(struct cpui
cpu_pm_enter();
cpu_cluster_pm_enter();
 
-   ct_idle_enter();
+   ct_cpuidle_enter();
cpu_suspend(0, imx6sx_idle_finish);
-   ct_idle_exit();
+   ct_cpuidle_exit();
 
cpu_cluster_pm_exit();
cpu_pm_exit();
--- a/arch/arm/mach-omap2/cpuidle34xx.c
+++ b/arch/arm/mach-omap2/cpuidle34xx.c
@@ -133,9 +133,9 @@ static int omap3_enter_idle(struct cpuid
}
 
/* Execute ARM wfi */
-   ct_idle_enter();
+   ct_cpuidle_enter();
omap_sram_idle();
-   ct_idle_exit();
+   ct_cpuidle_exit();
 
/*
 * Call idle CPU PM enter notifier chain to restore
--- a/arch/arm/mach-omap2/cpuidle44xx.c
+++ b/arch/arm/mach-omap2/cpuidle44xx.c
@@ -105,9 +105,9 @@ static int omap_enter_idle_smp(struct cp
}
raw_spin_unlock_irqrestore(_lock, flag);
 
-   ct_idle_enter();
+   ct_cpuidle_enter();
omap4_enter_lowpower(dev->cpu, cx->cpu_state);
-   ct_idle_exit();
+   ct_cpuidle_exit();
 
raw_spin_lock_irqsave(_lock, flag);
if (cx->mpu_state_vote == num_online_cpus())
@@ -186,10 +186,10 @@ static int omap_enter_idle_coupled(struc
}
}
 
-   ct_idle_enter();
+   ct_cpuidle_enter();
omap4_enter_lowpower(dev->cpu, cx->cpu_state);
cpu_done[dev->cpu] = true;
-   ct_idle_exit();
+   ct_cpuidle_exit();
 
/* Wakeup CPU1 only if it is not offlined */
if (dev->cpu == 0 && cpumask_test_cpu(1, cpu_online_mask)) {
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -642,6 +642,8 @@ static int __cpuidle acpi_idle_enter_bm(
 */
bool dis_bm = pr->flags.bm_control;
 
+   instrumentation_begin();
+
/* If we can skip BM, demote to a safe state. */
if (!cx->bm_sts_skip && acpi_idle_bm_check()) {
dis_bm = false;
@@ -663,11 +665,11 @@ static int __cpuidle acpi_idle_enter_bm(
raw_spin_unlock(_lock);
}
 
-   ct_idle_enter();
+   ct_cpuidle_enter();
 
acpi_idle_do_entry(cx);
 
-   ct_idle_exit();
+   ct_cpuidle_exit();
 
/* Re-enable bus master arbitration */
if (dis_bm) {
@@ -677,6 +679,8 @@ static int __cpuidle acpi_idle_enter_bm(
raw_spin_unlock(_lock);
}
 
+   instrumentation_end();
+
return index;
 }
 
--- a/drivers/cpuidle/cpuidle-big_little.c
+++ b/drivers/cpuidle/cpuidle-big_little.c
@@ -126,13 +126,13 @@ static int bl_enter_powerdown(struct cpu
struct cpuidle_driver *drv, int idx)
 {
cpu_pm_enter();
-   ct_idle_enter();
+   ct_cpuidle_enter();
 
cpu_suspend(0, bl_powerdown_finisher);
 
/* signals the MCPM core that CPU is out of low power state */
mcpm_cpu_powered_up();
-   ct_idle_exit();
+   ct_cpuidle_exit();
 
cpu_pm_exit();
 
--- a/drivers/cpuidle/cpuidle-mvebu-v7.c
+++ b/drivers/cpuidle/cpuidle-mvebu-v7.c
@@ -36,9 +36,9 @@ static int mvebu_v7_enter_idle(struct cp
if (drv->states[index].flags & MVEBU_

[PATCH v3 04/51] cpuidle: Move IRQ state validation

2023-01-12 Thread Peter Zijlstra
Make cpuidle_enter_state() consistent with the s2idle variant and
verify ->enter() always returns with interrupts disabled.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/cpuidle/cpuidle.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -236,7 +236,11 @@ int cpuidle_enter_state(struct cpuidle_d
stop_critical_timings();
if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE))
ct_idle_enter();
+
entered_state = target_state->enter(dev, drv, index);
+   if (WARN_ONCE(!irqs_disabled(), "%ps leaked IRQ state", 
target_state->enter))
+   raw_local_irq_disable();
+
if (!(target_state->flags & CPUIDLE_FLAG_RCU_IDLE))
ct_idle_exit();
start_critical_timings();
@@ -248,12 +252,8 @@ int cpuidle_enter_state(struct cpuidle_d
/* The cpu is no longer idle or about to enter idle. */
sched_idle_set_state(NULL);
 
-   if (broadcast) {
-   if (WARN_ON_ONCE(!irqs_disabled()))
-   local_irq_disable();
-
+   if (broadcast)
tick_broadcast_exit();
-   }
 
if (!cpuidle_state_is_coupled(drv, index))
local_irq_enable();


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 21/51] arch/idle: Change arch_cpu_idle() IRQ behaviour

2023-01-12 Thread Peter Zijlstra
Current arch_cpu_idle() is called with IRQs disabled, but will return
with IRQs enabled.

However, the very first thing the generic code does after calling
arch_cpu_idle() is raw_local_irq_disable(). This means that
architectures that can idle with IRQs disabled end up doing a
pointless 'enable-disable' dance.

Therefore, push this IRQ disabling into the idle function, meaning
that those architectures can avoid the pointless IRQ state flipping.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Gautham R. Shenoy 
Acked-by: Mark Rutland  [arm64]
Acked-by: Rafael J. Wysocki 
Acked-by: Guo Ren 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/alpha/kernel/process.c  |1 -
 arch/arc/kernel/process.c|3 +++
 arch/arm/kernel/process.c|1 -
 arch/arm/mach-gemini/board-dt.c  |3 ++-
 arch/arm64/kernel/idle.c |1 -
 arch/csky/kernel/process.c   |1 -
 arch/csky/kernel/smp.c   |2 +-
 arch/hexagon/kernel/process.c|1 -
 arch/ia64/kernel/process.c   |1 +
 arch/loongarch/kernel/idle.c |1 +
 arch/microblaze/kernel/process.c |1 -
 arch/mips/kernel/idle.c  |8 +++-
 arch/nios2/kernel/process.c  |1 -
 arch/openrisc/kernel/process.c   |1 +
 arch/parisc/kernel/process.c |2 --
 arch/powerpc/kernel/idle.c   |5 ++---
 arch/riscv/kernel/process.c  |1 -
 arch/s390/kernel/idle.c  |1 -
 arch/sh/kernel/idle.c|1 +
 arch/sparc/kernel/leon_pmc.c |4 
 arch/sparc/kernel/process_32.c   |1 -
 arch/sparc/kernel/process_64.c   |3 ++-
 arch/um/kernel/process.c |1 -
 arch/x86/coco/tdx/tdx.c  |3 +++
 arch/x86/kernel/process.c|   15 ---
 arch/xtensa/kernel/process.c |1 +
 kernel/sched/idle.c  |2 --
 27 files changed, 29 insertions(+), 37 deletions(-)

--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -57,7 +57,6 @@ EXPORT_SYMBOL(pm_power_off);
 void arch_cpu_idle(void)
 {
wtint(0);
-   raw_local_irq_enable();
 }
 
 void arch_cpu_idle_dead(void)
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -114,6 +114,8 @@ void arch_cpu_idle(void)
"sleep %0   \n"
:
:"I"(arg)); /* can't be "r" has to be embedded const */
+
+   raw_local_irq_disable();
 }
 
 #else  /* ARC700 */
@@ -122,6 +124,7 @@ void arch_cpu_idle(void)
 {
/* sleep, but enable both set E1/E2 (levels of interrupts) before 
committing */
__asm__ __volatile__("sleep 0x3 \n");
+   raw_local_irq_disable();
 }
 
 #endif
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -78,7 +78,6 @@ void arch_cpu_idle(void)
arm_pm_idle();
else
cpu_do_idle();
-   raw_local_irq_enable();
 }
 
 void arch_cpu_idle_prepare(void)
--- a/arch/arm/mach-gemini/board-dt.c
+++ b/arch/arm/mach-gemini/board-dt.c
@@ -42,8 +42,9 @@ static void gemini_idle(void)
 */
 
/* FIXME: Enabling interrupts here is racy! */
-   local_irq_enable();
+   raw_local_irq_enable();
cpu_do_idle();
+   raw_local_irq_disable();
 }
 
 static void __init gemini_init_machine(void)
--- a/arch/arm64/kernel/idle.c
+++ b/arch/arm64/kernel/idle.c
@@ -42,5 +42,4 @@ void noinstr arch_cpu_idle(void)
 * tricks
 */
cpu_do_idle();
-   raw_local_irq_enable();
 }
--- a/arch/csky/kernel/process.c
+++ b/arch/csky/kernel/process.c
@@ -100,6 +100,5 @@ void arch_cpu_idle(void)
 #ifdef CONFIG_CPU_PM_STOP
asm volatile("stop\n");
 #endif
-   raw_local_irq_enable();
 }
 #endif
--- a/arch/csky/kernel/smp.c
+++ b/arch/csky/kernel/smp.c
@@ -309,7 +309,7 @@ void arch_cpu_idle_dead(void)
while (!secondary_stack)
arch_cpu_idle();
 
-   local_irq_disable();
+   raw_local_irq_disable();
 
asm volatile(
"movsp, %0\n"
--- a/arch/hexagon/kernel/process.c
+++ b/arch/hexagon/kernel/process.c
@@ -44,7 +44,6 @@ void arch_cpu_idle(void)
 {
__vmwait();
/*  interrupts wake us up, but irqs are still disabled */
-   raw_local_irq_enable();
 }
 
 /*
--- a/arch/ia64/kernel/process.c
+++ b/arch/ia64/kernel/process.c
@@ -242,6 +242,7 @@ void arch_cpu_idle(void)
(*mark_idle)(1);
 
raw_safe_halt();
+   raw_local_irq_disable();
 
if (mark_idle)
(*mark_idle)(0);
--- a/arch/loongarch/kernel/idle.c
+++ b/arch/loongarch/kernel/idle.c
@@ -13,4 +13,5 @@ void __cpuidle arch_cpu_idle(void)
 {
raw_local_irq_enable();
__arch_cpu_idle(); /* idle instruction needs irq enabled */
+   raw_local_irq_disable();
 }
--- a/arch/microblaze/kernel/process.c
+++ b/arch/microblaze/kernel/process.c
@@ -140,5 +140,4 @@ int dump_fpu(struct pt_regs *regs, elf_f

[PATCH v3 25/51] printk: Remove trace_.*_rcuidle() usage

2023-01-12 Thread Peter Zijlstra
The problem, per commit fc98c3c8c9dc ("printk: use rcuidle console
tracepoint"), was printk usage from the cpuidle path where RCU was
already disabled.

Per the patches earlier in this series, this is no longer the case.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Sergey Senozhatsky 
Acked-by: Petr Mladek 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 kernel/printk/printk.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2238,7 +2238,7 @@ static u16 printk_sprint(char *text, u16
}
}
 
-   trace_console_rcuidle(text, text_len);
+   trace_console(text, text_len);
 
return text_len;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 33/51] trace: Remove trace_hardirqs_{on,off}_caller()

2023-01-12 Thread Peter Zijlstra
Per commit 56e62a737028 ("s390: convert to generic entry") the last
and only callers of trace_hardirqs_{on,off}_caller() went away, clean
up.

Cc: Sven Schnelle 
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/trace/trace_preemptirq.c |   29 -
 1 file changed, 29 deletions(-)

--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -84,35 +84,6 @@ void trace_hardirqs_off(void)
 }
 EXPORT_SYMBOL(trace_hardirqs_off);
 NOKPROBE_SYMBOL(trace_hardirqs_off);
-
-__visible void trace_hardirqs_on_caller(unsigned long caller_addr)
-{
-   if (this_cpu_read(tracing_irq_cpu)) {
-   if (!in_nmi())
-   trace_irq_enable_rcuidle(CALLER_ADDR0, caller_addr);
-   tracer_hardirqs_on(CALLER_ADDR0, caller_addr);
-   this_cpu_write(tracing_irq_cpu, 0);
-   }
-
-   lockdep_hardirqs_on_prepare();
-   lockdep_hardirqs_on(caller_addr);
-}
-EXPORT_SYMBOL(trace_hardirqs_on_caller);
-NOKPROBE_SYMBOL(trace_hardirqs_on_caller);
-
-__visible void trace_hardirqs_off_caller(unsigned long caller_addr)
-{
-   lockdep_hardirqs_off(caller_addr);
-
-   if (!this_cpu_read(tracing_irq_cpu)) {
-   this_cpu_write(tracing_irq_cpu, 1);
-   tracer_hardirqs_off(CALLER_ADDR0, caller_addr);
-   if (!in_nmi())
-   trace_irq_disable_rcuidle(CALLER_ADDR0, caller_addr);
-   }
-}
-EXPORT_SYMBOL(trace_hardirqs_off_caller);
-NOKPROBE_SYMBOL(trace_hardirqs_off_caller);
 #endif /* CONFIG_TRACE_IRQFLAGS */
 
 #ifdef CONFIG_TRACE_PREEMPT_TOGGLE


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 18/51] cpuidle, intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE *again*

2023-01-12 Thread Peter Zijlstra
  vmlinux.o: warning: objtool: intel_idle_irq+0x10c: call to 
trace_hardirqs_off() leaves .noinstr.text section

As per commit 32d4fd5751ea ("cpuidle,intel_idle: Fix
CPUIDLE_FLAG_IRQ_ENABLE"):

  "must not have tracing in idle functions"

Clearly people can't read and tinker along until splat dissapears.
This straight up reverts commit d295ad34f236 ("intel_idle: Fix false
positive RCU splats due to incorrect hardirqs state").

It doesn't re-introduce the problem because preceding patches fixed it
properly.

Fixes: d295ad34f236 ("intel_idle: Fix false positive RCU splats due to 
incorrect hardirqs state")
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/idle/intel_idle.c |8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -168,13 +168,7 @@ static __cpuidle int intel_idle_irq(stru
 
raw_local_irq_enable();
ret = __intel_idle(dev, drv, index);
-
-   /*
-* The lockdep hardirqs state may be changed to 'on' with timer
-* tick interrupt followed by __do_softirq(). Use local_irq_disable()
-* to keep the hardirqs state correct.
-*/
-   local_irq_disable();
+   raw_local_irq_disable();
 
return ret;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 01/51] x86/perf/amd: Remove tracing from perf_lopwr_cb()

2023-01-12 Thread Peter Zijlstra
The perf_lopwr_cb() is called from the idle routines; there is no RCU
there, we must not enter tracing.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/events/amd/brs.c |   13 +
 arch/x86/include/asm/perf_event.h |2 +-
 2 files changed, 6 insertions(+), 9 deletions(-)

--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -41,18 +41,15 @@ static inline unsigned int brs_to(int id
return MSR_AMD_SAMP_BR_FROM + 2 * idx + 1;
 }
 
-static inline void set_debug_extn_cfg(u64 val)
+static __always_inline void set_debug_extn_cfg(u64 val)
 {
/* bits[4:3] must always be set to 11b */
-   wrmsrl(MSR_AMD_DBG_EXTN_CFG, val | 3ULL << 3);
+   __wrmsr(MSR_AMD_DBG_EXTN_CFG, val | 3ULL << 3, val >> 32);
 }
 
-static inline u64 get_debug_extn_cfg(void)
+static __always_inline u64 get_debug_extn_cfg(void)
 {
-   u64 val;
-
-   rdmsrl(MSR_AMD_DBG_EXTN_CFG, val);
-   return val;
+   return __rdmsr(MSR_AMD_DBG_EXTN_CFG);
 }
 
 static bool __init amd_brs_detect(void)
@@ -338,7 +335,7 @@ void amd_pmu_brs_sched_task(struct perf_
  * called from ACPI processor_idle.c or acpi_pad.c
  * with interrupts disabled
  */
-void perf_amd_brs_lopwr_cb(bool lopwr_in)
+void noinstr perf_amd_brs_lopwr_cb(bool lopwr_in)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
union amd_debug_extn_cfg cfg;
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -554,7 +554,7 @@ extern void perf_amd_brs_lopwr_cb(bool l
 
 DECLARE_STATIC_CALL(perf_lopwr_cb, perf_amd_brs_lopwr_cb);
 
-static inline void perf_lopwr_cb(bool lopwr_in)
+static __always_inline void perf_lopwr_cb(bool lopwr_in)
 {
static_call_mod(perf_lopwr_cb)(lopwr_in);
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 28/51] cpuidle,mwait: Make noinstr clean

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: intel_idle_s2idle+0x6e: call to 
__monitor.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_irq+0x8c: call to 
__monitor.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle+0x73: call to __monitor.constprop.0() 
leaves .noinstr.text section

vmlinux.o: warning: objtool: mwait_idle+0x88: call to clflush() leaves 
.noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/include/asm/mwait.h |   12 ++--
 arch/x86/include/asm/special_insns.h |2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -25,7 +25,7 @@
 #define TPAUSE_C01_STATE   1
 #define TPAUSE_C02_STATE   0
 
-static inline void __monitor(const void *eax, unsigned long ecx,
+static __always_inline void __monitor(const void *eax, unsigned long ecx,
 unsigned long edx)
 {
/* "monitor %eax, %ecx, %edx;" */
@@ -33,7 +33,7 @@ static inline void __monitor(const void
 :: "a" (eax), "c" (ecx), "d"(edx));
 }
 
-static inline void __monitorx(const void *eax, unsigned long ecx,
+static __always_inline void __monitorx(const void *eax, unsigned long ecx,
  unsigned long edx)
 {
/* "monitorx %eax, %ecx, %edx;" */
@@ -41,7 +41,7 @@ static inline void __monitorx(const void
 :: "a" (eax), "c" (ecx), "d"(edx));
 }
 
-static inline void __mwait(unsigned long eax, unsigned long ecx)
+static __always_inline void __mwait(unsigned long eax, unsigned long ecx)
 {
mds_idle_clear_cpu_buffers();
 
@@ -76,8 +76,8 @@ static inline void __mwait(unsigned long
  * EAX (logical) address to monitor
  * ECX #GP if not zero
  */
-static inline void __mwaitx(unsigned long eax, unsigned long ebx,
-   unsigned long ecx)
+static __always_inline void __mwaitx(unsigned long eax, unsigned long ebx,
+unsigned long ecx)
 {
/* No MDS buffer clear as this is AMD/HYGON only */
 
@@ -86,7 +86,7 @@ static inline void __mwaitx(unsigned lon
 :: "a" (eax), "b" (ebx), "c" (ecx));
 }
 
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
+static __always_inline void __sti_mwait(unsigned long eax, unsigned long ecx)
 {
mds_idle_clear_cpu_buffers();
/* "mwait %eax, %ecx;" */
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -196,7 +196,7 @@ static inline void load_gs_index(unsigne
 
 #endif /* CONFIG_PARAVIRT_XXL */
 
-static inline void clflush(volatile void *__p)
+static __always_inline void clflush(volatile void *__p)
 {
asm volatile("clflush %0" : "+m" (*(volatile char __force *)__p));
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 08/51] cpuidle,imx6: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then temporarily enable it
again, at least twice, before going idle is daft.

Notably both cpu_pm_enter() and cpu_cluster_pm_enter() implicity
re-enable RCU.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Acked-by: Rafael J. Wysocki 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-imx/cpuidle-imx6sx.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/arch/arm/mach-imx/cpuidle-imx6sx.c
+++ b/arch/arm/mach-imx/cpuidle-imx6sx.c
@@ -47,7 +47,9 @@ static int imx6sx_enter_wait(struct cpui
cpu_pm_enter();
cpu_cluster_pm_enter();
 
+   ct_idle_enter();
cpu_suspend(0, imx6sx_idle_finish);
+   ct_idle_exit();
 
cpu_cluster_pm_exit();
cpu_pm_exit();
@@ -87,7 +89,8 @@ static struct cpuidle_driver imx6sx_cpui
 */
.exit_latency = 300,
.target_residency = 500,
-   .flags = CPUIDLE_FLAG_TIMER_STOP,
+   .flags = CPUIDLE_FLAG_TIMER_STOP |
+CPUIDLE_FLAG_RCU_IDLE,
.enter = imx6sx_enter_wait,
.name = "LOW-POWER-IDLE",
.desc = "ARM power off",


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 30/51] cpuidle,xenpv: Make more PARAVIRT_XXL noinstr clean

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: acpi_idle_enter_s2idle+0xde: call to wbinvd() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: default_idle+0x4: call to arch_safe_halt() leaves 
.noinstr.text section
vmlinux.o: warning: objtool: xen_safe_halt+0xa: call to 
HYPERVISOR_sched_op.constprop.0() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Srivatsa S. Bhat (VMware) 
Reviewed-by: Juergen Gross 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/include/asm/paravirt.h  |6 --
 arch/x86/include/asm/special_insns.h |4 ++--
 arch/x86/include/asm/xen/hypercall.h |2 +-
 arch/x86/kernel/paravirt.c   |   14 --
 arch/x86/xen/enlighten_pv.c  |2 +-
 arch/x86/xen/irq.c   |2 +-
 6 files changed, 21 insertions(+), 9 deletions(-)

--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -168,7 +168,7 @@ static inline void __write_cr4(unsigned
PVOP_VCALL1(cpu.write_cr4, x);
 }
 
-static inline void arch_safe_halt(void)
+static __always_inline void arch_safe_halt(void)
 {
PVOP_VCALL0(irq.safe_halt);
 }
@@ -178,7 +178,9 @@ static inline void halt(void)
PVOP_VCALL0(irq.halt);
 }
 
-static inline void wbinvd(void)
+extern noinstr void pv_native_wbinvd(void);
+
+static __always_inline void wbinvd(void)
 {
PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT(X86_FEATURE_XENPV));
 }
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -115,7 +115,7 @@ static inline void wrpkru(u32 pkru)
 }
 #endif
 
-static inline void native_wbinvd(void)
+static __always_inline void native_wbinvd(void)
 {
asm volatile("wbinvd": : :"memory");
 }
@@ -179,7 +179,7 @@ static inline void __write_cr4(unsigned
native_write_cr4(x);
 }
 
-static inline void wbinvd(void)
+static __always_inline void wbinvd(void)
 {
native_wbinvd();
 }
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -382,7 +382,7 @@ MULTI_stack_switch(struct multicall_entr
 }
 #endif
 
-static inline int
+static __always_inline int
 HYPERVISOR_sched_op(int cmd, void *arg)
 {
return _hypercall2(int, sched_op, cmd, arg);
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -233,6 +233,11 @@ static noinstr void pv_native_set_debugr
native_set_debugreg(regno, val);
 }
 
+noinstr void pv_native_wbinvd(void)
+{
+   native_wbinvd();
+}
+
 static noinstr void pv_native_irq_enable(void)
 {
native_irq_enable();
@@ -242,6 +247,11 @@ static noinstr void pv_native_irq_disabl
 {
native_irq_disable();
 }
+
+static noinstr void pv_native_safe_halt(void)
+{
+   native_safe_halt();
+}
 #endif
 
 enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
@@ -273,7 +283,7 @@ struct paravirt_patch_template pv_ops =
.cpu.read_cr0   = native_read_cr0,
.cpu.write_cr0  = native_write_cr0,
.cpu.write_cr4  = native_write_cr4,
-   .cpu.wbinvd = native_wbinvd,
+   .cpu.wbinvd = pv_native_wbinvd,
.cpu.read_msr   = native_read_msr,
.cpu.write_msr  = native_write_msr,
.cpu.read_msr_safe  = native_read_msr_safe,
@@ -307,7 +317,7 @@ struct paravirt_patch_template pv_ops =
.irq.save_fl= __PV_IS_CALLEE_SAVE(native_save_fl),
.irq.irq_disable= __PV_IS_CALLEE_SAVE(pv_native_irq_disable),
.irq.irq_enable = __PV_IS_CALLEE_SAVE(pv_native_irq_enable),
-   .irq.safe_halt  = native_safe_halt,
+   .irq.safe_halt  = pv_native_safe_halt,
.irq.halt   = native_halt,
 #endif /* CONFIG_PARAVIRT_XXL */
 
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1019,7 +1019,7 @@ static const typeof(pv_ops) xen_cpu_ops
 
.write_cr4 = xen_write_cr4,
 
-   .wbinvd = native_wbinvd,
+   .wbinvd = pv_native_wbinvd,
 
.read_msr = xen_read_msr,
.write_msr = xen_write_msr,
--- a/arch/x86/xen/irq.c
+++ b/arch/x86/xen/irq.c
@@ -24,7 +24,7 @@ noinstr void xen_force_evtchn_callback(v
(void)HYPERVISOR_xen_version(0, NULL);
 }
 
-static void xen_safe_halt(void)
+static noinstr void xen_safe_halt(void)
 {
/* Blocking includes an implicit local_irq_enable(). */
if (HYPERVISOR_sched_op(SCHEDOP_block, NULL) != 0)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 46/51] arm64,riscv,perf: Remove RCU_NONIDLE() usage

2023-01-12 Thread Peter Zijlstra
The PM notifiers should no longer be ran with RCU disabled (per the
previous patches), as such this hack is no longer required either.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/perf/arm_pmu.c   |   11 +--
 drivers/perf/riscv_pmu_sbi.c |8 +---
 2 files changed, 2 insertions(+), 17 deletions(-)

--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -762,17 +762,8 @@ static void cpu_pm_pmu_setup(struct arm_
case CPU_PM_ENTER_FAILED:
 /*
  * Restore and enable the counter.
- * armpmu_start() indirectly calls
- *
- * perf_event_update_userpage()
- *
- * that requires RCU read locking to be functional,
- * wrap the call within RCU_NONIDLE to make the
- * RCU subsystem aware this cpu is not idle from
- * an RCU perspective for the armpmu_start() call
- * duration.
  */
-   RCU_NONIDLE(armpmu_start(event, PERF_EF_RELOAD));
+   armpmu_start(event, PERF_EF_RELOAD);
break;
default:
break;
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -747,14 +747,8 @@ static int riscv_pm_pmu_notify(struct no
case CPU_PM_ENTER_FAILED:
/*
 * Restore and enable the counter.
-*
-* Requires RCU read locking to be functional,
-* wrap the call within RCU_NONIDLE to make the
-* RCU subsystem aware this cpu is not idle from
-* an RCU perspective for the riscv_pmu_start() call
-* duration.
 */
-   RCU_NONIDLE(riscv_pmu_start(event, PERF_EF_RELOAD));
+   riscv_pmu_start(event, PERF_EF_RELOAD);
break;
default:
break;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 36/51] cpuidle,omap3: Use WFI for omap3_pm_idle()

2023-01-12 Thread Peter Zijlstra
arch_cpu_idle() is a very simple idle interface and exposes only a
single idle state and is expected to not require RCU and not do any
tracing/instrumentation.

As such, omap_sram_idle() is not a valid implementation. Replace it
with the simple (shallow) omap3_do_wfi() call. Leaving the more
complicated idle states for the cpuidle driver.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Tony Lindgren 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-omap2/pm34xx.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -294,7 +294,7 @@ static void omap3_pm_idle(void)
if (omap_irq_pending())
return;
 
-   omap_sram_idle();
+   omap3_do_wfi();
 }
 
 #ifdef CONFIG_SUSPEND


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 27/51] cpuidle, sched: Remove annotations from TIF_{POLLING_NRFLAG, NEED_RESCHED}

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: mwait_idle+0x5: call to 
current_set_polling_and_test() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0xc5: call to 
current_set_polling_and_test() leaves .noinstr.text section
vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x73: call to 
test_ti_thread_flag() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle+0xbc: call to 
current_set_polling_and_test() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_irq+0xea: call to 
current_set_polling_and_test() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_s2idle+0xb4: call to 
current_set_polling_and_test() leaves .noinstr.text section

vmlinux.o: warning: objtool: intel_idle+0xa6: call to current_clr_polling() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_irq+0xbf: call to current_clr_polling() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_s2idle+0xa1: call to 
current_clr_polling() leaves .noinstr.text section

vmlinux.o: warning: objtool: mwait_idle+0xe: call to __current_set_polling() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0xc5: call to 
__current_set_polling() leaves .noinstr.text section
vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x73: call to 
test_ti_thread_flag() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle+0xbc: call to __current_set_polling() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_irq+0xea: call to 
__current_set_polling() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_s2idle+0xb4: call to 
__current_set_polling() leaves .noinstr.text section

vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x73: call to 
test_ti_thread_flag() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_s2idle+0x73: call to 
test_ti_thread_flag.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle_irq+0x91: call to 
test_ti_thread_flag.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: intel_idle+0x78: call to 
test_ti_thread_flag.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_safe_halt+0xf: call to 
test_ti_thread_flag.constprop.0() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 include/linux/sched/idle.h  |   40 ++--
 include/linux/thread_info.h |   18 +-
 2 files changed, 47 insertions(+), 11 deletions(-)

--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -23,12 +23,37 @@ static inline void wake_up_if_idle(int c
  */
 #ifdef TIF_POLLING_NRFLAG
 
-static inline void __current_set_polling(void)
+#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_ATOMIC_H
+
+static __always_inline void __current_set_polling(void)
 {
-   set_thread_flag(TIF_POLLING_NRFLAG);
+   arch_set_bit(TIF_POLLING_NRFLAG,
+(unsigned long *)(_thread_info()->flags));
 }
 
-static inline bool __must_check current_set_polling_and_test(void)
+static __always_inline void __current_clr_polling(void)
+{
+   arch_clear_bit(TIF_POLLING_NRFLAG,
+  (unsigned long *)(_thread_info()->flags));
+}
+
+#else
+
+static __always_inline void __current_set_polling(void)
+{
+   set_bit(TIF_POLLING_NRFLAG,
+   (unsigned long *)(_thread_info()->flags));
+}
+
+static __always_inline void __current_clr_polling(void)
+{
+   clear_bit(TIF_POLLING_NRFLAG,
+ (unsigned long *)(_thread_info()->flags));
+}
+
+#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_ATOMIC_H */
+
+static __always_inline bool __must_check current_set_polling_and_test(void)
 {
__current_set_polling();
 
@@ -41,12 +66,7 @@ static inline bool __must_check current_
return unlikely(tif_need_resched());
 }
 
-static inline void __current_clr_polling(void)
-{
-   clear_thread_flag(TIF_POLLING_NRFLAG);
-}
-
-static inline bool __must_check current_clr_polling_and_test(void)
+static __always_inline bool __must_check current_clr_polling_and_test(void)
 {
__current_clr_polling();
 
@@ -73,7 +93,7 @@ static inline bool __must_check current_
 }
 #endif
 
-static inline void current_clr_polling(void)
+static __always_inline void current_clr_polling(void)
 {
__current_clr_polling();
 
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -177,7 +177,23 @@ static __always_inline unsigned long rea
clear_ti_thread_flag(task_thread_info(t), TIF_##fl)
 #endif /* !CONFIG_GENERIC_ENTRY */
 
-#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)
+#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
+
+static __always_inline bool tif_need_resched(void)
+{
+   return arch_test_bit(TIF_

[PATCH v3 12/51] cpuidle,dt: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then temporarily enable it
again before going idle is daft.

Notably: this converts all dt_init_idle_driver() and
__CPU_PM_CPU_IDLE_ENTER() users for they are inextrably intertwined.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/acpi/processor_idle.c|2 ++
 drivers/cpuidle/cpuidle-arm.c|1 +
 drivers/cpuidle/cpuidle-big_little.c |8 ++--
 drivers/cpuidle/cpuidle-psci.c   |1 +
 drivers/cpuidle/cpuidle-qcom-spm.c   |1 +
 drivers/cpuidle/cpuidle-riscv-sbi.c  |1 +
 drivers/cpuidle/dt_idle_states.c |2 +-
 include/linux/cpuidle.h  |2 ++
 8 files changed, 15 insertions(+), 3 deletions(-)

--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1219,6 +1219,8 @@ static int acpi_processor_setup_lpi_stat
state->target_residency = lpi->min_residency;
if (lpi->arch_flags)
state->flags |= CPUIDLE_FLAG_TIMER_STOP;
+   if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
+   state->flags |= CPUIDLE_FLAG_RCU_IDLE;
state->enter = acpi_idle_lpi_enter;
drv->safe_state_index = i;
}
--- a/drivers/cpuidle/cpuidle-big_little.c
+++ b/drivers/cpuidle/cpuidle-big_little.c
@@ -64,7 +64,8 @@ static struct cpuidle_driver bl_idle_lit
.enter  = bl_enter_powerdown,
.exit_latency   = 700,
.target_residency   = 2500,
-   .flags  = CPUIDLE_FLAG_TIMER_STOP,
+   .flags  = CPUIDLE_FLAG_TIMER_STOP |
+ CPUIDLE_FLAG_RCU_IDLE,
.name   = "C1",
.desc   = "ARM little-cluster power down",
},
@@ -85,7 +86,8 @@ static struct cpuidle_driver bl_idle_big
.enter  = bl_enter_powerdown,
.exit_latency   = 500,
.target_residency   = 2000,
-   .flags  = CPUIDLE_FLAG_TIMER_STOP,
+   .flags  = CPUIDLE_FLAG_TIMER_STOP |
+ CPUIDLE_FLAG_RCU_IDLE,
.name   = "C1",
.desc   = "ARM big-cluster power down",
},
@@ -124,11 +126,13 @@ static int bl_enter_powerdown(struct cpu
struct cpuidle_driver *drv, int idx)
 {
cpu_pm_enter();
+   ct_idle_enter();
 
cpu_suspend(0, bl_powerdown_finisher);
 
/* signals the MCPM core that CPU is out of low power state */
mcpm_cpu_powered_up();
+   ct_idle_exit();
 
cpu_pm_exit();
 
--- a/drivers/cpuidle/dt_idle_states.c
+++ b/drivers/cpuidle/dt_idle_states.c
@@ -77,7 +77,7 @@ static int init_state_node(struct cpuidl
if (err)
desc = state_node->name;
 
-   idle_state->flags = 0;
+   idle_state->flags = CPUIDLE_FLAG_RCU_IDLE;
if (of_property_read_bool(state_node, "local-timer-stop"))
idle_state->flags |= CPUIDLE_FLAG_TIMER_STOP;
/*
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -289,7 +289,9 @@ extern s64 cpuidle_governor_latency_req(
if (!is_retention)  \
__ret =  cpu_pm_enter();\
if (!__ret) {   \
+   ct_idle_enter();\
__ret = low_level_idle_enter(state);\
+   ct_idle_exit(); \
if (!is_retention)  \
cpu_pm_exit();  \
}   \


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 15/51] acpi_idle: Remove tracing

2023-01-12 Thread Peter Zijlstra
All the idle routines are called with RCU disabled, as such there must
not be any tracing inside.

While there; clean-up the io-port idle thing.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/acpi/processor_idle.c |   16 
 1 file changed, 8 insertions(+), 8 deletions(-)

--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -109,8 +109,8 @@ static const struct dmi_system_id proces
 static void __cpuidle acpi_safe_halt(void)
 {
if (!tif_need_resched()) {
-   safe_halt();
-   local_irq_disable();
+   raw_safe_halt();
+   raw_local_irq_disable();
}
 }
 
@@ -525,8 +525,11 @@ static int acpi_idle_bm_check(void)
return bm_status;
 }
 
-static void wait_for_freeze(void)
+static __cpuidle void io_idle(unsigned long addr)
 {
+   /* IO port based C-state */
+   inb(addr);
+
 #ifdef CONFIG_X86
/* No delay is needed if we are in guest */
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
@@ -571,9 +574,7 @@ static void __cpuidle acpi_idle_do_entry
} else if (cx->entry_method == ACPI_CSTATE_HALT) {
acpi_safe_halt();
} else {
-   /* IO port based C-state */
-   inb(cx->address);
-   wait_for_freeze();
+   io_idle(cx->address);
}
 
perf_lopwr_cb(false);
@@ -595,8 +596,7 @@ static int acpi_idle_play_dead(struct cp
if (cx->entry_method == ACPI_CSTATE_HALT)
safe_halt();
else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
-   inb(cx->address);
-   wait_for_freeze();
+   io_idle(cx->address);
} else
return -ENODEV;
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 32/51] cpuidle,acpi: Make noinstr clean

2023-01-12 Thread Peter Zijlstra
vmlinux.o: warning: objtool: io_idle+0xc: call to __inb.isra.0() leaves 
.noinstr.text section
vmlinux.o: warning: objtool: acpi_idle_enter+0xfe: call to num_online_cpus() 
leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_idle_enter+0x115: call to 
acpi_idle_fallback_to_c1.isra.0() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/x86/include/asm/shared/io.h |4 ++--
 drivers/acpi/processor_idle.c|2 +-
 include/linux/cpumask.h  |4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/shared/io.h
+++ b/arch/x86/include/asm/shared/io.h
@@ -5,13 +5,13 @@
 #include 
 
 #define BUILDIO(bwl, bw, type) \
-static inline void __out##bwl(type value, u16 port)\
+static __always_inline void __out##bwl(type value, u16 port)   \
 {  \
asm volatile("out" #bwl " %" #bw "0, %w1"   \
 : : "a"(value), "Nd"(port));   \
 }  \
\
-static inline type __in##bwl(u16 port) \
+static __always_inline type __in##bwl(u16 port)
\
 {  \
type value; \
asm volatile("in" #bwl " %w1, %" #bw "0"\
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -593,7 +593,7 @@ static int acpi_idle_play_dead(struct cp
return 0;
 }
 
-static bool acpi_idle_fallback_to_c1(struct acpi_processor *pr)
+static __always_inline bool acpi_idle_fallback_to_c1(struct acpi_processor *pr)
 {
return IS_ENABLED(CONFIG_HOTPLUG_CPU) && !pr->flags.has_cst &&
!(acpi_gbl_FADT.flags & ACPI_FADT_C2_MP_SUPPORTED);
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -908,9 +908,9 @@ static inline const struct cpumask *get_
  * concurrent CPU hotplug operations unless invoked from a cpuhp_lock held
  * region.
  */
-static inline unsigned int num_online_cpus(void)
+static __always_inline unsigned int num_online_cpus(void)
 {
-   return atomic_read(&__num_online_cpus);
+   return arch_atomic_read(&__num_online_cpus);
 }
 #define num_possible_cpus()cpumask_weight(cpu_possible_mask)
 #define num_present_cpus() cpumask_weight(cpu_present_mask)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 17/51] objtool/idle: Validate __cpuidle code as noinstr

2023-01-12 Thread Peter Zijlstra
Idle code is very like entry code in that RCU isn't available. As
such, add a little validation.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Geert Uytterhoeven 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/alpha/kernel/vmlinux.lds.S  |1 -
 arch/arc/kernel/vmlinux.lds.S|1 -
 arch/arm/include/asm/vmlinux.lds.h   |1 -
 arch/arm64/kernel/vmlinux.lds.S  |1 -
 arch/csky/kernel/vmlinux.lds.S   |1 -
 arch/hexagon/kernel/vmlinux.lds.S|1 -
 arch/ia64/kernel/vmlinux.lds.S   |1 -
 arch/loongarch/kernel/vmlinux.lds.S  |1 -
 arch/m68k/kernel/vmlinux-nommu.lds   |1 -
 arch/m68k/kernel/vmlinux-std.lds |1 -
 arch/m68k/kernel/vmlinux-sun3.lds|1 -
 arch/microblaze/kernel/vmlinux.lds.S |1 -
 arch/mips/kernel/vmlinux.lds.S   |1 -
 arch/nios2/kernel/vmlinux.lds.S  |1 -
 arch/openrisc/kernel/vmlinux.lds.S   |1 -
 arch/parisc/kernel/vmlinux.lds.S |1 -
 arch/powerpc/kernel/vmlinux.lds.S|1 -
 arch/riscv/kernel/vmlinux-xip.lds.S  |1 -
 arch/riscv/kernel/vmlinux.lds.S  |1 -
 arch/s390/kernel/vmlinux.lds.S   |1 -
 arch/sh/kernel/vmlinux.lds.S |1 -
 arch/sparc/kernel/vmlinux.lds.S  |1 -
 arch/um/kernel/dyn.lds.S |1 -
 arch/um/kernel/uml.lds.S |1 -
 arch/x86/include/asm/irqflags.h  |   11 ---
 arch/x86/include/asm/mwait.h |2 +-
 arch/x86/kernel/vmlinux.lds.S|1 -
 arch/xtensa/kernel/vmlinux.lds.S |1 -
 include/asm-generic/vmlinux.lds.h|9 +++--
 include/linux/compiler_types.h   |8 ++--
 include/linux/cpu.h  |3 ---
 tools/objtool/check.c|   13 +
 32 files changed, 27 insertions(+), 45 deletions(-)

--- a/arch/alpha/kernel/vmlinux.lds.S
+++ b/arch/alpha/kernel/vmlinux.lds.S
@@ -27,7 +27,6 @@ SECTIONS
HEAD_TEXT
TEXT_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
*(.fixup)
*(.gnu.warning)
--- a/arch/arc/kernel/vmlinux.lds.S
+++ b/arch/arc/kernel/vmlinux.lds.S
@@ -85,7 +85,6 @@ SECTIONS
_stext = .;
TEXT_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
IRQENTRY_TEXT
--- a/arch/arm/include/asm/vmlinux.lds.h
+++ b/arch/arm/include/asm/vmlinux.lds.h
@@ -96,7 +96,6 @@
SOFTIRQENTRY_TEXT   \
TEXT_TEXT   \
SCHED_TEXT  \
-   CPUIDLE_TEXT\
LOCK_TEXT   \
KPROBES_TEXT\
ARM_STUBS_TEXT  \
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -175,7 +175,6 @@ SECTIONS
ENTRY_TEXT
TEXT_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
HYPERVISOR_TEXT
--- a/arch/csky/kernel/vmlinux.lds.S
+++ b/arch/csky/kernel/vmlinux.lds.S
@@ -34,7 +34,6 @@ SECTIONS
SOFTIRQENTRY_TEXT
TEXT_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
*(.fixup)
--- a/arch/hexagon/kernel/vmlinux.lds.S
+++ b/arch/hexagon/kernel/vmlinux.lds.S
@@ -41,7 +41,6 @@ SECTIONS
IRQENTRY_TEXT
SOFTIRQENTRY_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
*(.fixup)
--- a/arch/ia64/kernel/vmlinux.lds.S
+++ b/arch/ia64/kernel/vmlinux.lds.S
@@ -51,7 +51,6 @@ SECTIONS {
__end_ivt_text = .;
TEXT_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
IRQENTRY_TEXT
--- a/arch/loongarch/kernel/vmlinux.lds.S
+++ b/arch/loongarch/kernel/vmlinux.lds.S
@@ -42,7 +42,6 @@ SECTIONS
.text : {
TEXT_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
KPROBES_TEXT
IRQENTRY_TEXT
--- a/arch/m68k/kernel/vmlinux-nommu.lds
+++ b/arch/m68k/kernel/vmlinux-nommu.lds
@@ -48,7 +48,6 @@ SECTIONS {
IRQENTRY_TEXT
SOFTIRQENTRY_TEXT
SCHED_TEXT
-   CPUIDLE_TEXT
LOCK_TEXT
*(.fixup)
. = ALIGN(16);
--- a/arch/m68k/kernel/vmlinux

[PATCH v3 07/51] cpuidle,psci: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then temporarily enable it
again, at least twice, before going idle is daft.

Notably once implicitly through the cpu_pm_*() calls and once
explicitly doing ct_irq_*_irqon().

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Guo Ren 
Acked-by: Rafael J. Wysocki 
Tested-by: Kajetan Puchalski 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/cpuidle/cpuidle-psci.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/drivers/cpuidle/cpuidle-psci.c
+++ b/drivers/cpuidle/cpuidle-psci.c
@@ -69,12 +69,12 @@ static int __psci_enter_domain_idle_stat
return -1;
 
/* Do runtime PM to manage a hierarchical CPU toplogy. */
-   ct_irq_enter_irqson();
if (s2idle)
dev_pm_genpd_suspend(pd_dev);
else
pm_runtime_put_sync_suspend(pd_dev);
-   ct_irq_exit_irqson();
+
+   ct_idle_enter();
 
state = psci_get_domain_state();
if (!state)
@@ -82,12 +82,12 @@ static int __psci_enter_domain_idle_stat
 
ret = psci_cpu_suspend_enter(state) ? -1 : idx;
 
-   ct_irq_enter_irqson();
+   ct_idle_exit();
+
if (s2idle)
dev_pm_genpd_resume(pd_dev);
else
pm_runtime_get_sync(pd_dev);
-   ct_irq_exit_irqson();
 
cpu_pm_exit();
 
@@ -240,6 +240,7 @@ static int psci_dt_cpu_init_topology(str
 * of a shared state for the domain, assumes the domain states are all
 * deeper states.
 */
+   drv->states[state_count - 1].flags |= CPUIDLE_FLAG_RCU_IDLE;
drv->states[state_count - 1].enter = psci_enter_domain_idle_state;
drv->states[state_count - 1].enter_s2idle = 
psci_enter_s2idle_domain_idle_state;
psci_cpuidle_use_cpuhp = true;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 11/51] cpuidle,omap4: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then temporarily enable it
again, some *four* times, before going idle is daft.

Notably three times explicitly using RCU_NONIDLE() and once implicitly
through cpu_pm_*().

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Reviewed-by: Tony Lindgren 
Acked-by: Rafael J. Wysocki 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/mach-omap2/cpuidle44xx.c |   29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

--- a/arch/arm/mach-omap2/cpuidle44xx.c
+++ b/arch/arm/mach-omap2/cpuidle44xx.c
@@ -105,7 +105,9 @@ static int omap_enter_idle_smp(struct cp
}
raw_spin_unlock_irqrestore(_lock, flag);
 
+   ct_idle_enter();
omap4_enter_lowpower(dev->cpu, cx->cpu_state);
+   ct_idle_exit();
 
raw_spin_lock_irqsave(_lock, flag);
if (cx->mpu_state_vote == num_online_cpus())
@@ -151,10 +153,10 @@ static int omap_enter_idle_coupled(struc
 (cx->mpu_logic_state == PWRDM_POWER_OFF);
 
/* Enter broadcast mode for periodic timers */
-   RCU_NONIDLE(tick_broadcast_enable());
+   tick_broadcast_enable();
 
/* Enter broadcast mode for one-shot timers */
-   RCU_NONIDLE(tick_broadcast_enter());
+   tick_broadcast_enter();
 
/*
 * Call idle CPU PM enter notifier chain so that
@@ -166,7 +168,7 @@ static int omap_enter_idle_coupled(struc
 
if (dev->cpu == 0) {
pwrdm_set_logic_retst(mpu_pd, cx->mpu_logic_state);
-   RCU_NONIDLE(omap_set_pwrdm_state(mpu_pd, cx->mpu_state));
+   omap_set_pwrdm_state(mpu_pd, cx->mpu_state);
 
/*
 * Call idle CPU cluster PM enter notifier chain
@@ -178,14 +180,16 @@ static int omap_enter_idle_coupled(struc
index = 0;
cx = state_ptr + index;
pwrdm_set_logic_retst(mpu_pd, 
cx->mpu_logic_state);
-   RCU_NONIDLE(omap_set_pwrdm_state(mpu_pd, 
cx->mpu_state));
+   omap_set_pwrdm_state(mpu_pd, cx->mpu_state);
mpuss_can_lose_context = 0;
}
}
}
 
+   ct_idle_enter();
omap4_enter_lowpower(dev->cpu, cx->cpu_state);
cpu_done[dev->cpu] = true;
+   ct_idle_exit();
 
/* Wakeup CPU1 only if it is not offlined */
if (dev->cpu == 0 && cpumask_test_cpu(1, cpu_online_mask)) {
@@ -194,9 +198,9 @@ static int omap_enter_idle_coupled(struc
mpuss_can_lose_context)
gic_dist_disable();
 
-   RCU_NONIDLE(clkdm_deny_idle(cpu_clkdm[1]));
-   RCU_NONIDLE(omap_set_pwrdm_state(cpu_pd[1], PWRDM_POWER_ON));
-   RCU_NONIDLE(clkdm_allow_idle(cpu_clkdm[1]));
+   clkdm_deny_idle(cpu_clkdm[1]);
+   omap_set_pwrdm_state(cpu_pd[1], PWRDM_POWER_ON);
+   clkdm_allow_idle(cpu_clkdm[1]);
 
if (IS_PM44XX_ERRATUM(PM_OMAP4_ROM_SMP_BOOT_ERRATUM_GICD) &&
mpuss_can_lose_context) {
@@ -222,7 +226,7 @@ static int omap_enter_idle_coupled(struc
cpu_pm_exit();
 
 cpu_pm_out:
-   RCU_NONIDLE(tick_broadcast_exit());
+   tick_broadcast_exit();
 
 fail:
cpuidle_coupled_parallel_barrier(dev, _barrier);
@@ -247,7 +251,8 @@ static struct cpuidle_driver omap4_idle_
/* C2 - CPU0 OFF + CPU1 OFF + MPU CSWR */
.exit_latency = 328 + 440,
.target_residency = 960,
-   .flags = CPUIDLE_FLAG_COUPLED,
+   .flags = CPUIDLE_FLAG_COUPLED |
+CPUIDLE_FLAG_RCU_IDLE,
.enter = omap_enter_idle_coupled,
.name = "C2",
.desc = "CPUx OFF, MPUSS CSWR",
@@ -256,7 +261,8 @@ static struct cpuidle_driver omap4_idle_
/* C3 - CPU0 OFF + CPU1 OFF + MPU OSWR */
.exit_latency = 460 + 518,
.target_residency = 1100,
-   .flags = CPUIDLE_FLAG_COUPLED,
+   .flags = CPUIDLE_FLAG_COUPLED |
+CPUIDLE_FLAG_RCU_IDLE,
.enter = omap_enter_idle_coupled,
.name = "C3",
.desc = "CPUx OFF, MPUSS OSWR",
@@ -282,7 +288,8 @@ static struct cpuidle_driver omap5_idle_
/* C2 - CPU0 RET + CPU1 RET + MPU CSWR */
.exit_latency = 48 + 60,
.target_residency = 100,
-   .flags = CPUIDLE_FLAG_TIMER_STOP,
+   .flags = CPUIDLE_FLA

[PATCH v3 23/51] arm,smp: Remove trace_.*_rcuidle() usage

2023-01-12 Thread Peter Zijlstra
None of these functions should ever be ran with RCU disabled anymore.

Specifically, do_handle_IPI() is only called from handle_IPI() which
explicitly does irq_enter()/irq_exit() which ensures RCU is watching.

The problem with smp_cross_call() was, per commit 7c64cc0531fa ("arm: Use
_rcuidle for smp_cross_call() tracepoints"), that
cpuidle_enter_state_coupled() already had RCU disabled, but that's
long been fixed by commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle
deeper into the idle path").

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ulf Hansson 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm/kernel/smp.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -639,7 +639,7 @@ static void do_handle_IPI(int ipinr)
unsigned int cpu = smp_processor_id();
 
if ((unsigned)ipinr < NR_IPI)
-   trace_ipi_entry_rcuidle(ipi_types[ipinr]);
+   trace_ipi_entry(ipi_types[ipinr]);
 
switch (ipinr) {
case IPI_WAKEUP:
@@ -686,7 +686,7 @@ static void do_handle_IPI(int ipinr)
}
 
if ((unsigned)ipinr < NR_IPI)
-   trace_ipi_exit_rcuidle(ipi_types[ipinr]);
+   trace_ipi_exit(ipi_types[ipinr]);
 }
 
 /* Legacy version, should go away once all irqchips have been converted */
@@ -709,7 +709,7 @@ static irqreturn_t ipi_handler(int irq,
 
 static void smp_cross_call(const struct cpumask *target, unsigned int ipinr)
 {
-   trace_ipi_raise_rcuidle(target, ipi_types[ipinr]);
+   trace_ipi_raise(target, ipi_types[ipinr]);
__ipi_send_mask(ipi_desc[ipinr], target);
 }
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 06/51] cpuidle,tegra: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then temporarily enable it
again, at least twice, before going idle is daft.

Notably once implicitly through the cpu_pm_*() calls and once
explicitly doing RCU_NONIDLE().

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Frederic Weisbecker 
Acked-by: Rafael J. Wysocki 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/cpuidle/cpuidle-tegra.c |   21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

--- a/drivers/cpuidle/cpuidle-tegra.c
+++ b/drivers/cpuidle/cpuidle-tegra.c
@@ -180,9 +180,11 @@ static int tegra_cpuidle_state_enter(str
}
 
local_fiq_disable();
-   RCU_NONIDLE(tegra_pm_set_cpu_in_lp2());
+   tegra_pm_set_cpu_in_lp2();
cpu_pm_enter();
 
+   ct_idle_enter();
+
switch (index) {
case TEGRA_C7:
err = tegra_cpuidle_c7_enter();
@@ -197,8 +199,10 @@ static int tegra_cpuidle_state_enter(str
break;
}
 
+   ct_idle_exit();
+
cpu_pm_exit();
-   RCU_NONIDLE(tegra_pm_clear_cpu_in_lp2());
+   tegra_pm_clear_cpu_in_lp2();
local_fiq_enable();
 
return err ?: index;
@@ -226,6 +230,7 @@ static int tegra_cpuidle_enter(struct cp
   struct cpuidle_driver *drv,
   int index)
 {
+   bool do_rcu = drv->states[index].flags & CPUIDLE_FLAG_RCU_IDLE;
unsigned int cpu = cpu_logical_map(dev->cpu);
int ret;
 
@@ -233,9 +238,13 @@ static int tegra_cpuidle_enter(struct cp
if (dev->states_usage[index].disable)
return -1;
 
-   if (index == TEGRA_C1)
+   if (index == TEGRA_C1) {
+   if (do_rcu)
+   ct_idle_enter();
ret = arm_cpuidle_simple_enter(dev, drv, index);
-   else
+   if (do_rcu)
+   ct_idle_exit();
+   } else
ret = tegra_cpuidle_state_enter(dev, index, cpu);
 
if (ret < 0) {
@@ -285,7 +294,8 @@ static struct cpuidle_driver tegra_idle_
.exit_latency   = 2000,
.target_residency   = 2200,
.power_usage= 100,
-   .flags  = CPUIDLE_FLAG_TIMER_STOP,
+   .flags  = CPUIDLE_FLAG_TIMER_STOP |
+ CPUIDLE_FLAG_RCU_IDLE,
.name   = "C7",
.desc   = "CPU core powered off",
},
@@ -295,6 +305,7 @@ static struct cpuidle_driver tegra_idle_
.target_residency   = 1,
.power_usage= 0,
.flags  = CPUIDLE_FLAG_TIMER_STOP |
+ CPUIDLE_FLAG_RCU_IDLE   |
  CPUIDLE_FLAG_COUPLED,
.name   = "CC6",
.desc   = "CPU cluster powered off",


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 38/51] cpuidle, omap4: Push RCU-idle into omap4_enter_lowpower()

2023-01-12 Thread Peter Zijlstra
From: Tony Lindgren 

OMAP4 uses full SoC suspend modes as idle states, as such it needs the
whole power-domain and clock-domain code from the idle path.

All that code is not suitable to run with RCU disabled, as such push
RCU-idle deeper still.

Signed-off-by: Tony Lindgren 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
Link: https://lkml.kernel.org/r/yqcv6crsnkusw...@atomide.com
---
 arch/arm/mach-omap2/common.h  |6 --
 arch/arm/mach-omap2/cpuidle44xx.c |8 ++--
 arch/arm/mach-omap2/omap-mpuss-lowpower.c |   12 +++-
 arch/arm/mach-omap2/pm44xx.c  |2 +-
 4 files changed, 18 insertions(+), 10 deletions(-)

--- a/arch/arm/mach-omap2/common.h
+++ b/arch/arm/mach-omap2/common.h
@@ -284,11 +284,13 @@ extern u32 omap4_get_cpu1_ns_pa_addr(voi
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PM)
 extern int omap4_mpuss_init(void);
-extern int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state);
+extern int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state,
+   bool rcuidle);
 extern int omap4_hotplug_cpu(unsigned int cpu, unsigned int power_state);
 #else
 static inline int omap4_enter_lowpower(unsigned int cpu,
-   unsigned int power_state)
+   unsigned int power_state,
+   bool rcuidle)
 {
cpu_do_idle();
return 0;
--- a/arch/arm/mach-omap2/cpuidle44xx.c
+++ b/arch/arm/mach-omap2/cpuidle44xx.c
@@ -105,9 +105,7 @@ static int omap_enter_idle_smp(struct cp
}
raw_spin_unlock_irqrestore(_lock, flag);
 
-   ct_cpuidle_enter();
-   omap4_enter_lowpower(dev->cpu, cx->cpu_state);
-   ct_cpuidle_exit();
+   omap4_enter_lowpower(dev->cpu, cx->cpu_state, true);
 
raw_spin_lock_irqsave(_lock, flag);
if (cx->mpu_state_vote == num_online_cpus())
@@ -186,10 +184,8 @@ static int omap_enter_idle_coupled(struc
}
}
 
-   ct_cpuidle_enter();
-   omap4_enter_lowpower(dev->cpu, cx->cpu_state);
+   omap4_enter_lowpower(dev->cpu, cx->cpu_state, true);
cpu_done[dev->cpu] = true;
-   ct_cpuidle_exit();
 
/* Wakeup CPU1 only if it is not offlined */
if (dev->cpu == 0 && cpumask_test_cpu(1, cpu_online_mask)) {
--- a/arch/arm/mach-omap2/omap-mpuss-lowpower.c
+++ b/arch/arm/mach-omap2/omap-mpuss-lowpower.c
@@ -33,6 +33,7 @@
  * and first to wake-up when MPUSS low power states are excercised
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -214,6 +215,7 @@ static void __init save_l2x0_context(voi
  * of OMAP4 MPUSS subsystem
  * @cpu : CPU ID
  * @power_state: Low power state.
+ * @rcuidle: RCU needs to be idled
  *
  * MPUSS states for the context save:
  * save_state =
@@ -222,7 +224,8 @@ static void __init save_l2x0_context(voi
  * 2 - CPUx L1 and logic lost + GIC lost: MPUSS OSWR
  * 3 - CPUx L1 and logic lost + GIC + L2 lost: DEVICE OFF
  */
-int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state)
+int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state,
+bool rcuidle)
 {
struct omap4_cpu_pm_info *pm_info = _cpu(omap4_pm_info, cpu);
unsigned int save_state = 0, cpu_logic_state = PWRDM_POWER_RET;
@@ -268,6 +271,10 @@ int omap4_enter_lowpower(unsigned int cp
cpu_clear_prev_logic_pwrst(cpu);
pwrdm_set_next_pwrst(pm_info->pwrdm, power_state);
pwrdm_set_logic_retst(pm_info->pwrdm, cpu_logic_state);
+
+   if (rcuidle)
+   ct_cpuidle_enter();
+
set_cpu_wakeup_addr(cpu, __pa_symbol(omap_pm_ops.resume));
omap_pm_ops.scu_prepare(cpu, power_state);
l2x0_pwrst_prepare(cpu, save_state);
@@ -283,6 +290,9 @@ int omap4_enter_lowpower(unsigned int cp
if (IS_PM44XX_ERRATUM(PM_OMAP4_ROM_SMP_BOOT_ERRATUM_GICD) && cpu)
gic_dist_enable();
 
+   if (rcuidle)
+   ct_cpuidle_exit();
+
/*
 * Restore the CPUx power state to ON otherwise CPUx
 * power domain can transitions to programmed low power
--- a/arch/arm/mach-omap2/pm44xx.c
+++ b/arch/arm/mach-omap2/pm44xx.c
@@ -76,7 +76,7 @@ static int omap4_pm_suspend(void)
 * domain CSWR is not supported by hardware.
 * More details can be found in OMAP4430 TRM section 4.3.4.2.
 */
-   omap4_enter_lowpower(cpu_id, cpu_suspend_state);
+   omap4_enter_lowpower(cpu_id, cpu_suspend_state, false);
 
/* Restore next powerdomain state */
list_for_each_entry(pwrst, _list, node) {


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 03/51] cpuidle/poll: Ensure IRQ state is invariant

2023-01-12 Thread Peter Zijlstra
cpuidle_state::enter() methods should be IRQ invariant.

Additionally make sure to use raw_local_irq_*() methods since this
cpuidle callback will be called with RCU already disabled.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Rafael J. Wysocki 
Reviewed-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/cpuidle/poll_state.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -17,7 +17,7 @@ static int __cpuidle poll_idle(struct cp
 
dev->poll_time_limit = false;
 
-   local_irq_enable();
+   raw_local_irq_enable();
if (!current_set_polling_and_test()) {
unsigned int loop_count = 0;
u64 limit;
@@ -36,6 +36,8 @@ static int __cpuidle poll_idle(struct cp
}
}
}
+   raw_local_irq_disable();
+
current_clr_polling();
 
return index;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 24/51] arm64,smp: Remove trace_.*_rcuidle() usage

2023-01-12 Thread Peter Zijlstra
Ever since commit d3afc7f12987 ("arm64: Allow IPIs to be handled as
normal interrupts") this function is called in regular IRQ context.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Mark Rutland 
Acked-by: Marc Zyngier 
Acked-by: Rafael J. Wysocki 
Acked-by: Frederic Weisbecker 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 arch/arm64/kernel/smp.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -865,7 +865,7 @@ static void do_handle_IPI(int ipinr)
unsigned int cpu = smp_processor_id();
 
if ((unsigned)ipinr < NR_IPI)
-   trace_ipi_entry_rcuidle(ipi_types[ipinr]);
+   trace_ipi_entry(ipi_types[ipinr]);
 
switch (ipinr) {
case IPI_RESCHEDULE:
@@ -914,7 +914,7 @@ static void do_handle_IPI(int ipinr)
}
 
if ((unsigned)ipinr < NR_IPI)
-   trace_ipi_exit_rcuidle(ipi_types[ipinr]);
+   trace_ipi_exit(ipi_types[ipinr]);
 }
 
 static irqreturn_t ipi_handler(int irq, void *data)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 05/51] cpuidle,riscv: Push RCU-idle into driver

2023-01-12 Thread Peter Zijlstra
Doing RCU-idle outside the driver, only to then temporarily enable it
again, at least twice, before going idle is daft.

That is, once implicitly through the cpu_pm_*() calls and once
explicitly doing ct_irq_*_irqon().

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Anup Patel 
Reviewed-by: Frederic Weisbecker 
Acked-by: Rafael J. Wysocki 
Tested-by: Tony Lindgren 
Tested-by: Ulf Hansson 
---
 drivers/cpuidle/cpuidle-riscv-sbi.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/drivers/cpuidle/cpuidle-riscv-sbi.c
+++ b/drivers/cpuidle/cpuidle-riscv-sbi.c
@@ -116,12 +116,12 @@ static int __sbi_enter_domain_idle_state
return -1;
 
/* Do runtime PM to manage a hierarchical CPU toplogy. */
-   ct_irq_enter_irqson();
if (s2idle)
dev_pm_genpd_suspend(pd_dev);
else
pm_runtime_put_sync_suspend(pd_dev);
-   ct_irq_exit_irqson();
+
+   ct_idle_enter();
 
if (sbi_is_domain_state_available())
state = sbi_get_domain_state();
@@ -130,12 +130,12 @@ static int __sbi_enter_domain_idle_state
 
ret = sbi_suspend(state) ? -1 : idx;
 
-   ct_irq_enter_irqson();
+   ct_idle_exit();
+
if (s2idle)
dev_pm_genpd_resume(pd_dev);
else
pm_runtime_get_sync(pd_dev);
-   ct_irq_exit_irqson();
 
cpu_pm_exit();
 
@@ -246,6 +246,7 @@ static int sbi_dt_cpu_init_topology(stru
 * of a shared state for the domain, assumes the domain states are all
 * deeper states.
 */
+   drv->states[state_count - 1].flags |= CPUIDLE_FLAG_RCU_IDLE;
drv->states[state_count - 1].enter = sbi_enter_domain_idle_state;
drv->states[state_count - 1].enter_s2idle =
sbi_enter_s2idle_domain_idle_state;


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 12/44] cpuidle,dt: Push RCU-idle into driver

2022-11-16 Thread Peter Zijlstra


Sorry; things keep getting in the way of finishing this :/

As such, I need a bit of time to get on-track again..

On Tue, Oct 04, 2022 at 01:03:57PM +0200, Ulf Hansson wrote:

> > --- a/drivers/acpi/processor_idle.c
> > +++ b/drivers/acpi/processor_idle.c
> > @@ -1200,6 +1200,8 @@ static int acpi_processor_setup_lpi_stat
> > state->target_residency = lpi->min_residency;
> > if (lpi->arch_flags)
> > state->flags |= CPUIDLE_FLAG_TIMER_STOP;
> > +   if (lpi->entry_method == ACPI_CSTATE_FFH)
> > +   state->flags |= CPUIDLE_FLAG_RCU_IDLE;
> 
> I assume the state index here will never be 0?
> 
> If not, it may lead to that acpi_processor_ffh_lpi_enter() may trigger
> CPU_PM_CPU_IDLE_ENTER_PARAM() to call ct_cpuidle_enter|exit() for an
> idle-state that doesn't have the CPUIDLE_FLAG_RCU_IDLE bit set.

I'm not quite sure I see how. AFAICT this condition above implies
acpi_processor_ffh_lpi_enter() gets called, no?

Which in turn is an unconditional __CPU_PM_CPU_IDLE_ENTER() user, so
even if idx==0, it ends up in ct_idle_{enter,exit}().

> 
> > state->enter = acpi_idle_lpi_enter;
> > drv->safe_state_index = i;
> > }
> > --- a/drivers/cpuidle/cpuidle-arm.c
> > +++ b/drivers/cpuidle/cpuidle-arm.c
> > @@ -53,6 +53,7 @@ static struct cpuidle_driver arm_idle_dr
> >  * handler for idle state index 0.
> >  */
> > .states[0] = {
> > +   .flags  = CPUIDLE_FLAG_RCU_IDLE,
> 
> Comparing arm64 and arm32 idle-states/idle-drivers, the $subject
> series ends up setting the CPUIDLE_FLAG_RCU_IDLE for the ARM WFI idle
> state (state zero), but only for the arm64 and psci cases (mostly
> arm64). For arm32 we would need to update the ARM_CPUIDLE_WFI_STATE
> too, as that is what most arm32 idle-drivers are using. My point is,
> the code becomes a bit inconsistent.

True.

> Perhaps it's easier to avoid setting the CPUIDLE_FLAG_RCU_IDLE bit for
> all of the ARM WFI idle states, for both arm64 and arm32?

As per the below?

> 
> > .enter  = arm_enter_idle_state,
> > .exit_latency   = 1,
> > .target_residency   = 1,

> > --- a/include/linux/cpuidle.h
> > +++ b/include/linux/cpuidle.h
> > @@ -282,14 +282,18 @@ extern s64 cpuidle_governor_latency_req(
> > int __ret = 0;  \
> > \
> > if (!idx) { \
> > +   ct_idle_enter();\
> 
> According to my comment above, we should then drop these calls to
> ct_idle_enter and ct_idle_exit() here. Right?

Yes, if we ensure idx==0 never has RCU_IDLE set then these must be
removed.

> > cpu_do_idle();  \
> > +   ct_idle_exit(); \
> > return idx; \
> > }   \
> > \
> > if (!is_retention)  \
> > __ret =  cpu_pm_enter();\
> > if (!__ret) {   \
> > +   ct_idle_enter();\
> > __ret = low_level_idle_enter(state);\
> > +   ct_idle_exit(); \
> > if (!is_retention)  \
> > cpu_pm_exit();  \
> > }   \
> >

So the basic premise is that everything that needs RCU inside the idle
callback must set CPUIDLE_FLAG_RCU_IDLE and by doing that promise to
call ct_idle_{enter,exit}() themselves.

Setting RCU_IDLE is required when there is RCU usage, however even if
there is no RCU usage, setting RCU_IDLE is fine, as long as
ct_idle_{enter,exit}() then get called.


So does the below (delta) look better to you?

--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1218,7 +1218,7 @@ static int acpi_processor_setup_lpi_stat
state->target_residency = lpi->min_residency;
if (lpi->arch_flags)
state->flags |= CPUIDLE_FLAG_TIMER_STOP;
-   if (lpi->entry_method == ACPI_CSTATE_FFH)
+   if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
state->flags |= CPUIDLE_FLAG_RCU_IDLE;
state->enter = acpi_idle_lpi_enter;
drv->safe_state_index 

Re: [PATCH v2] x86/paravirt: use common macro for creating simple asm paravirt functions

2022-11-16 Thread Peter Zijlstra
On Wed, Nov 09, 2022 at 02:44:18PM +0100, Juergen Gross wrote:
> There are some paravirt assembler functions which are sharing a common
> pattern. Introduce a macro DEFINE_PARAVIRT_ASM() for creating them.
> 
> Note that this macro is including explicit alignment of the generated
> functions, leading to __raw_callee_save___kvm_vcpu_is_preempted(),
> _paravirt_nop() and paravirt_ret0() to be aligned at 4 byte boundaries
> now.
> 
> The explicit _paravirt_nop() prototype in paravirt.c isn't needed, as
> it is included in paravirt_types.h already.
> 
> Signed-off-by: Juergen Gross 
> Reviewed-by: Srivatsa S. Bhat (VMware) 
> ---

Seems nice; I've made the below little edits, but this is certainly a
bit large for /urgent at this point in time. So how about I merge
locking/urgent into x86/paravirt and munge this on top?

---
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -737,7 +737,7 @@ static __always_inline unsigned long arc
 __ALIGN_STR "\n"   \
 #func ":\n\t"  \
 ASM_ENDBR  \
-instr  \
+instr "\n\t"   \
 ASM_RET\
 ".size " #func ", . - " #func "\n\t"   \
 ".popsection")
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -54,8 +54,8 @@ __PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_
"pop%rdx\n\t"   \
FRAME_END
 
-DEFINE_PARAVIRT_ASM(__raw_callee_save___pv_queued_spin_unlock, PV_UNLOCK_ASM,
-   .spinlock.text);
+DEFINE_PARAVIRT_ASM(__raw_callee_save___pv_queued_spin_unlock,
+   PV_UNLOCK_ASM, .spinlock.text);
 
 #else /* CONFIG_64BIT */
 
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -802,6 +802,7 @@ extern bool __raw_callee_save___kvm_vcpu
  "movq   __per_cpu_offset(,%rdi,8), %rax\n\t"   \
  "cmpb   $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax)\n\t" \
  "setne  %al\n\t"
+
 DEFINE_PARAVIRT_ASM(__raw_callee_save___kvm_vcpu_is_preempted,
PV_VCPU_PREEMPTED_ASM, .text);
 #endif
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -40,8 +40,7 @@
 DEFINE_PARAVIRT_ASM(_paravirt_nop, "", .entry.text);
 
 /* stub always returning 0. */
-#define PV_RET0_ASM"xor %" _ASM_AX ", %" _ASM_AX "\n\t"
-DEFINE_PARAVIRT_ASM(paravirt_ret0, PV_RET0_ASM, .entry.text);
+DEFINE_PARAVIRT_ASM(paravirt_ret0, "xor %eax,%eax", .entry.text);
 
 void __init default_banner(void)
 {
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic

2022-09-26 Thread Peter Zijlstra
On Mon, Sep 26, 2022 at 08:06:46PM +0200, Peter Zijlstra wrote:

> Let me go git-grep some to see if there's more similar fail.

I've ended up with the below...

---
 include/linux/wait.h | 2 +-
 kernel/hung_task.c   | 8 ++--
 kernel/sched/core.c  | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 14ad8a0e9fac..7f5a51aae0a7 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -281,7 +281,7 @@ static inline void wake_up_pollfree(struct wait_queue_head 
*wq_head)
 
 #define ___wait_is_interruptible(state)
\
(!__builtin_constant_p(state) ||
\
-   state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)  
\
+(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
 
 extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f1321c03c32a..4a8a713fd67b 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -191,6 +191,8 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
hung_task_show_lock = false;
rcu_read_lock();
for_each_process_thread(g, t) {
+   unsigned int state;
+
if (!max_count--)
goto unlock;
if (time_after(jiffies, last_break + HUNG_TASK_LOCK_BREAK)) {
@@ -198,8 +200,10 @@ static void check_hung_uninterruptible_tasks(unsigned long 
timeout)
goto unlock;
last_break = jiffies;
}
-   /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
-   if (READ_ONCE(t->__state) == TASK_UNINTERRUPTIBLE)
+   /* skip the TASK_KILLABLE tasks -- these can be killed */
+   state == READ_ONCE(t->__state);
+   if ((state & TASK_UNINTERRUPTIBLE) &&
+   !(state & TASK_WAKEKILL))
check_hung_task(t, timeout);
}
  unlock:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1095917ed048..12ee5b98e2c4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8885,7 +8885,7 @@ state_filter_match(unsigned long state_filter, struct 
task_struct *p)
 * When looking for TASK_UNINTERRUPTIBLE skip TASK_IDLE (allows
 * TASK_KILLABLE).
 */
-   if (state_filter == TASK_UNINTERRUPTIBLE && state == TASK_IDLE)
+   if (state_filter == TASK_UNINTERRUPTIBLE && state & TASK_NOLOAD)
return false;
 
return true;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic

2022-09-26 Thread Peter Zijlstra
On Mon, Sep 26, 2022 at 05:49:16PM +0200, Christian Borntraeger wrote:

> Hmm,
> 
> #define ___wait_is_interruptible(state)   
>   \
> (!__builtin_constant_p(state) ||  
>   \
> state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)
>   \
> 
> That would not trigger when state is also TASK_FREEZABLE, no?

Spot on!

signal_pending_state() writes that as:

state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)

which is the correct form.

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 14ad8a0e9fac..7f5a51aae0a7 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -281,7 +281,7 @@ static inline void wake_up_pollfree(struct wait_queue_head 
*wq_head)
 
 #define ___wait_is_interruptible(state)
\
(!__builtin_constant_p(state) ||
\
-   state == TASK_INTERRUPTIBLE || state == TASK_KILLABLE)  
\
+(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
 
 extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
 

Let me go git-grep some to see if there's more similar fail.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic

2022-09-26 Thread Peter Zijlstra
On Mon, Sep 26, 2022 at 03:23:10PM +0200, Christian Borntraeger wrote:
> Am 26.09.22 um 14:55 schrieb Peter Zijlstra:
> 
> > Could you please test with something like the below on? I can boot that
> > with KVM, but obviously I didn't suffer any weirdness to begin with :/
> > 
> > ---
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4e6a6417211f..ef9ccfc3a8c0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4051,6 +4051,8 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> > state, int wake_flags)
> > unsigned long flags;
> > int cpu, success = 0;
> > +   WARN_ON_ONCE(state & TASK_FREEZABLE);
> > +
> > preempt_disable();
> > if (p == current) {
> > /*
> 
> Does not seem to trigger.

Moo -- quite the puzzle this :/ I'll go stare at it more then.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic

2022-09-26 Thread Peter Zijlstra
On Mon, Sep 26, 2022 at 02:32:24PM +0200, Christian Borntraeger wrote:
> diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
> index 9fa3c76a267f..e93df4f735fe 100644
> --- a/drivers/char/virtio_console.c
> +++ b/drivers/char/virtio_console.c
> @@ -790,7 +790,7 @@ static int wait_port_writable(struct port *port, bool 
> nonblock)
> if (nonblock)
> return -EAGAIN;
> -   ret = wait_event_freezable(port->waitqueue,
> +   ret = wait_event_interruptible(port->waitqueue,
>!will_write_block(port));
> if (ret < 0)
> return ret;
> 
> Does fix the problem.

It's almost as if someone does try_to_wake_up(.state = TASK_FREEZABLE)
-- which would be quite insane.

Could you please test with something like the below on? I can boot that
with KVM, but obviously I didn't suffer any weirdness to begin with :/

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4e6a6417211f..ef9ccfc3a8c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4051,6 +4051,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, 
int wake_flags)
unsigned long flags;
int cpu, success = 0;
 
+   WARN_ON_ONCE(state & TASK_FREEZABLE);
+
preempt_disable();
if (p == current) {
/*
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 6/6] freezer,sched: Rewrite core freezer logic

2022-09-26 Thread Peter Zijlstra
On Mon, Sep 26, 2022 at 12:55:21PM +0200, Christian Borntraeger wrote:
> 
> 
> Am 26.09.22 um 10:06 schrieb Christian Borntraeger:
> > 
> > 
> > Am 23.09.22 um 09:53 schrieb Christian Borntraeger:
> > > Am 23.09.22 um 09:21 schrieb Christian Borntraeger:
> > > > Peter,
> > > > 
> > > > as a heads-up. This commit (bisected and verified) triggers a
> > > > regression in our KVM on s390x CI. The symptom is that a specific
> > > > testcase (start a guest with next kernel and a poky ramdisk,
> > > > then ssh via vsock into the guest and run the reboot command) now
> > > > takes much longer (300 instead of 20 seconds). From a first look
> > > > it seems that the sshd takes very long to end during shutdown
> > > > but I have not looked into that yet.
> > > > Any quick idea?
> > > > 
> > > > Christian
> > > 
> > > the sshd seems to hang in virtio-serial (not vsock).
> > 
> > FWIW, sshd does not seem to hang, instead it seems to busy loop in
> > wait_port_writable calling into the scheduler over and over again.
> 
> -#define TASK_FREEZABLE 0x2000
> +#define TASK_FREEZABLE 0x
> 
> "Fixes" the issue. Just have to find out which of users is responsible.

Since it's not the wait_port_writable() one -- we already tested that by
virtue of 's/wait_event_freezable/wait_event/' there, it must be on the
producing side of that port. But I'm having a wee bit of trouble
following that code.

Is there a task stuck in FROZEN state? -- then again, I thought you said
there was no actual suspend involved, so that should not be it either.

I'm curious though -- how far does it get into the scheduler? It should
call schedule() with __state == TASK_INTERRUPTIBLE|TASK_FREEZABLE, which
is quite sufficient to get it off the runqueue, who then puts it back?
Or is it bailing early in the wait_event loop?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] smp/hotplug, x86/vmware: Put offline vCPUs in halt instead of mwait

2022-09-23 Thread Peter Zijlstra
On Thu, Jul 21, 2022 at 01:44:33PM -0700, Srivatsa S. Bhat wrote:
> From: Srivatsa S. Bhat (VMware) 
> 
> VMware ESXi allows enabling a passthru mwait CPU-idle state in the
> guest using the following VMX option:
> 
> monitor_control.mwait_in_guest = "TRUE"
> 
> This lets a vCPU in mwait to remain in guest context (instead of
> yielding to the hypervisor via a VMEXIT), which helps speed up
> wakeups from idle.
> 
> However, this runs into problems with CPU hotplug, because the Linux
> CPU offline path prefers to put the vCPU-to-be-offlined in mwait
> state, whenever mwait is available. As a result, since a vCPU in mwait
> remains in guest context and does not yield to the hypervisor, an
> offline vCPU *appears* to be 100% busy as viewed from ESXi, which
> prevents the hypervisor from running other vCPUs or workloads on the
> corresponding pCPU (particularly when vCPU - pCPU mappings are
> statically defined by the user).

I would hope vCPU pinning is a mandatory thing when MWAIT passthrough it
set?

> [ Note that such a vCPU is not
> actually busy spinning though; it remains in mwait idle state in the
> guest ].
> 
> Fix this by overriding the CPU offline play_dead() callback for VMware
> hypervisor, by putting the CPU in halt state (which actually yields to
> the hypervisor), even if mwait support is available.
> 
> Signed-off-by: Srivatsa S. Bhat (VMware) 
> ---

> +static void vmware_play_dead(void)
> +{
> + play_dead_common();
> + tboot_shutdown(TB_SHUTDOWN_WFS);
> +
> + /*
> +  * Put the vCPU going offline in halt instead of mwait (even
> +  * if mwait support is available), to make sure that the
> +  * offline vCPU yields to the hypervisor (which may not happen
> +  * with mwait, for example, if the guest's VMX is configured
> +  * to retain the vCPU in guest context upon mwait).
> +  */
> + hlt_play_dead();
> +}
>  #endif
>  
>  static __init int activate_jump_labels(void)
> @@ -349,6 +365,7 @@ static void __init vmware_paravirt_ops_setup(void)
>  #ifdef CONFIG_SMP
>   smp_ops.smp_prepare_boot_cpu =
>   vmware_smp_prepare_boot_cpu;
> + smp_ops.play_dead = vmware_play_dead;
>   if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> "x86/vmware:online",
> vmware_cpu_online,

No real objection here; but would not something like the below fix the
problem more generally? I'm thinking MWAIT passthrough for *any*
hypervisor doesn't want play_dead to use it.

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f24227bc3220..166cb3aaca8a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1759,6 +1759,8 @@ static inline void mwait_play_dead(void)
return;
if (!this_cpu_has(X86_FEATURE_CLFLUSH))
return;
+   if (this_cpu_has(X86_FEATURE_HYPERVISOR))
+   return;
if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
return;
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 00/44] cpuidle,rcu: Clean up the mess

2022-09-20 Thread Peter Zijlstra


Because Nadav asked about tracing/kprobing idle, I had another go around
and noticed not all functions calling ct_cpuidle_enter are __cpuidle.

Basically all cpuidle_driver::enter functions should be __cpuidle; i'll
do that audit shortly.

For now this is ct_cpuidle_enter / CPU_IDLE_ENTER users.

---
--- a/arch/arm/mach-imx/cpuidle-imx6q.c
+++ b/arch/arm/mach-imx/cpuidle-imx6q.c
@@ -17,8 +17,8 @@
 static int num_idle_cpus = 0;
 static DEFINE_RAW_SPINLOCK(cpuidle_lock);
 
-static int imx6q_enter_wait(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+static __cpuidle int imx6q_enter_wait(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
 {
raw_spin_lock(_lock);
if (++num_idle_cpus == num_online_cpus())
--- a/arch/arm/mach-imx/cpuidle-imx6sx.c
+++ b/arch/arm/mach-imx/cpuidle-imx6sx.c
@@ -30,8 +30,8 @@ static int imx6sx_idle_finish(unsigned l
return 0;
 }
 
-static int imx6sx_enter_wait(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int index)
+static __cpuidle int imx6sx_enter_wait(struct cpuidle_device *dev,
+  struct cpuidle_driver *drv, int index)
 {
imx6_set_lpm(WAIT_UNCLOCKED);
 
--- a/arch/arm/mach-omap2/omap-mpuss-lowpower.c
+++ b/arch/arm/mach-omap2/omap-mpuss-lowpower.c
@@ -224,8 +224,8 @@ static void __init save_l2x0_context(voi
  * 2 - CPUx L1 and logic lost + GIC lost: MPUSS OSWR
  * 3 - CPUx L1 and logic lost + GIC + L2 lost: DEVICE OFF
  */
-int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state,
-bool rcuidle)
+__cpuidle int omap4_enter_lowpower(unsigned int cpu, unsigned int power_state,
+  bool rcuidle)
 {
struct omap4_cpu_pm_info *pm_info = _cpu(omap4_pm_info, cpu);
unsigned int save_state = 0, cpu_logic_state = PWRDM_POWER_RET;
--- a/arch/arm/mach-omap2/pm34xx.c
+++ b/arch/arm/mach-omap2/pm34xx.c
@@ -175,7 +175,7 @@ static int omap34xx_do_sram_idle(unsigne
return 0;
 }
 
-void omap_sram_idle(bool rcuidle)
+__cpuidle void omap_sram_idle(bool rcuidle)
 {
/* Variable to tell what needs to be saved and restored
 * in omap_sram_idle*/
--- a/arch/arm64/kernel/cpuidle.c
+++ b/arch/arm64/kernel/cpuidle.c
@@ -62,7 +62,7 @@ int acpi_processor_ffh_lpi_probe(unsigne
return psci_acpi_cpu_init_idle(cpu);
 }
 
-int acpi_processor_ffh_lpi_enter(struct acpi_lpi_state *lpi)
+__cpuidle int acpi_processor_ffh_lpi_enter(struct acpi_lpi_state *lpi)
 {
u32 state = lpi->address;
 
--- a/drivers/cpuidle/cpuidle-arm.c
+++ b/drivers/cpuidle/cpuidle-arm.c
@@ -31,8 +31,8 @@
  * Called from the CPUidle framework to program the device to the
  * specified target state selected by the governor.
  */
-static int arm_enter_idle_state(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int idx)
+static __cpuidle int arm_enter_idle_state(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int idx)
 {
/*
 * Pass idle state index to arm_cpuidle_suspend which in turn
--- a/drivers/cpuidle/cpuidle-big_little.c
+++ b/drivers/cpuidle/cpuidle-big_little.c
@@ -122,8 +122,8 @@ static int notrace bl_powerdown_finisher
  * Called from the CPUidle framework to program the device to the
  * specified target state selected by the governor.
  */
-static int bl_enter_powerdown(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv, int idx)
+static __cpuidle int bl_enter_powerdown(struct cpuidle_device *dev,
+   struct cpuidle_driver *drv, int idx)
 {
cpu_pm_enter();
ct_cpuidle_enter();
--- a/drivers/cpuidle/cpuidle-mvebu-v7.c
+++ b/drivers/cpuidle/cpuidle-mvebu-v7.c
@@ -25,9 +25,9 @@
 
 static int (*mvebu_v7_cpu_suspend)(int);
 
-static int mvebu_v7_enter_idle(struct cpuidle_device *dev,
-   struct cpuidle_driver *drv,
-   int index)
+static __cpuidle int mvebu_v7_enter_idle(struct cpuidle_device *dev,
+struct cpuidle_driver *drv,
+int index)
 {
int ret;
bool deepidle = false;
--- a/drivers/cpuidle/cpuidle-psci.c
+++ b/drivers/cpuidle/cpuidle-psci.c
@@ -49,14 +49,9 @@ static inline u32 psci_get_domain_state(
return __this_cpu_read(domain_state);
 }
 
-static inline int psci_enter_state(int idx, u32 state)
-{
-   return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, state);
-}
-
-static int __psci_enter_domain_idle_state(struct cpuidle_device *dev,
- struct cpuidle_driver *drv, int idx,
- bool s2idle)
+static __cpuidle int __psci_enter_domain_idle_state(struct cpuidle_device 

  1   2   3   4   5   6   >