Re: [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c

2023-05-11 Thread Petr Mladek
On Fri 2023-05-05 09:37:50, Doug Anderson wrote:
> Hi,
> 
> On Thu, May 4, 2023 at 7:58 PM Nicholas Piggin  wrote:
> >
> > On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > > The perf hardlockup detector works by looking at interrupt counts and
> > > seeing if they change from run to run. The interrupt counts are
> > > managed by the common watchdog code via its watchdog_timer_fn().
> > >
> > > Currently the API between the perf detector and the common code is a
> > > function: is_hardlockup(). When the hard lockup detector sees that
> > > function return true then it handles printing out debug info and
> > > inducing a panic if necessary.
> > >
> > > Let's change the API a little bit in preparation for the buddy
> > > hardlockup detector. The buddy hardlockup detector wants to print
> >
> > I think the name change is a gratuitous. Especially since it's now
> > static.
> >
> > watchdog_hardlockup_ is a pretty long prefix too, hardlockup_
> > should be enough?
> >
> > Seems okay otherwise though.
> 
> I went back and forth on names far too much when constructing this
> patch series. Mostly I was trying to balance what looked good to me
> and what Petr suggested [1]. I'm not super picky about the names and
> I'm happy to change them all to a "hardlockup_" prefix. I'd love to
> hear Petr's opinion.

Sigh, the original code was a real mess of naming schemes. It is hard
to say how to move forward.

My opinion:

  + watchdog_hardlockup_check(): looks fine. It is consistent with
watchdog_hardlockup_enable()/disable().

  + watchdog_hardlockup_is_lockedup() is really overly complicated.
I would personally keep is_hardlockup(). It is static.
And it will be consitent with is_softlockup().

  + watchdog_hardlockup_interrupt_count() looks better then
the original. It clearly shows that it makes sense only
for the hardlockup detector ("bug" fixed by this patch).

Well, I would personally call it watchdog_hardlockup_kick()
and remove the comment.


That said, I could live with this patch. It is better than the
original.

Best Regards,
Petr


Re: [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c

2023-05-05 Thread Doug Anderson
Hi,

On Thu, May 4, 2023 at 7:58 PM Nicholas Piggin  wrote:
>
> On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > The perf hardlockup detector works by looking at interrupt counts and
> > seeing if they change from run to run. The interrupt counts are
> > managed by the common watchdog code via its watchdog_timer_fn().
> >
> > Currently the API between the perf detector and the common code is a
> > function: is_hardlockup(). When the hard lockup detector sees that
> > function return true then it handles printing out debug info and
> > inducing a panic if necessary.
> >
> > Let's change the API a little bit in preparation for the buddy
> > hardlockup detector. The buddy hardlockup detector wants to print
>
> I think the name change is a gratuitous. Especially since it's now
> static.
>
> watchdog_hardlockup_ is a pretty long prefix too, hardlockup_
> should be enough?
>
> Seems okay otherwise though.

I went back and forth on names far too much when constructing this
patch series. Mostly I was trying to balance what looked good to me
and what Petr suggested [1]. I'm not super picky about the names and
I'm happy to change them all to a "hardlockup_" prefix. I'd love to
hear Petr's opinion.

[1] https://lore.kernel.org/r/ZFErmshcrcikrSU1@alley


Re: [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c

2023-05-04 Thread Nicholas Piggin
On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> The perf hardlockup detector works by looking at interrupt counts and
> seeing if they change from run to run. The interrupt counts are
> managed by the common watchdog code via its watchdog_timer_fn().
>
> Currently the API between the perf detector and the common code is a
> function: is_hardlockup(). When the hard lockup detector sees that
> function return true then it handles printing out debug info and
> inducing a panic if necessary.
>
> Let's change the API a little bit in preparation for the buddy
> hardlockup detector. The buddy hardlockup detector wants to print

I think the name change is a gratuitous. Especially since it's now
static.

watchdog_hardlockup_ is a pretty long prefix too, hardlockup_ 
should be enough?

Seems okay otherwise though.

Thanks,
Nick

> nearly the same debug info and have nearly the same panic
> behavior. That means we want to move all that code to the common
> file. For now, the code in the common file will only be there if the
> perf hardlockup detector is enabled, but eventually it will be
> selected by a common config.
>
> Right now, this _just_ moves the code from the perf detector file to
> the common file and changes the names. It doesn't make the changes
> that the buddy hardlockup detector will need and doesn't do any style
> cleanups. A future patch will do cleanup to make it more obvious what
> changed.
>
> With the above, we no longer have any callers of is_hardlockup()
> outside of the "watchdog.c" file, so we can remove it from the header,
> make it static, move it to the same "#ifdef" block as our new
> watchdog_hardlockup_check(), and rename it to make it obvious it's
> just for hardlockup detectors. While doing this, it can be noted that
> even if no hardlockup detectors were configured the existing code used
> to still have the code for counting/checking "hrtimer_interrupts" even
> if the perf hardlockup detector wasn't configured. We didn't need to
> do that, so move all the "hrtimer_interrupts" counting to only be
> there if the perf hardlockup detector is configured as well.
>
> This change is expected to be a no-op.
>
> Signed-off-by: Douglas Anderson 
> ---
>
> Changes in v4:
> - ("Move perf hardlockup checking/panic ...") new for v4.
>
>  include/linux/nmi.h|  5 ++-
>  kernel/watchdog.c  | 92 +-
>  kernel/watchdog_perf.c | 42 +--
>  3 files changed, 78 insertions(+), 61 deletions(-)
>
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 35d09d70f394..c6cb9bc5dc80 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -15,7 +15,6 @@
>  void lockup_detector_init(void);
>  void lockup_detector_soft_poweroff(void);
>  void lockup_detector_cleanup(void);
> -bool is_hardlockup(void);
>  
>  extern int watchdog_user_enabled;
>  extern int nmi_watchdog_user_enabled;
> @@ -88,6 +87,10 @@ extern unsigned int hardlockup_panic;
>  static inline void hardlockup_detector_disable(void) {}
>  #endif
>  
> +#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> +void watchdog_hardlockup_check(struct pt_regs *regs);
> +#endif
> +
>  #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
>  # define NMI_WATCHDOG_SYSCTL_PERM0644
>  #else
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index c705a18b26bf..2d319cdf64b9 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -85,6 +85,78 @@ __setup("nmi_watchdog=", hardlockup_panic_setup);
>  
>  #endif /* CONFIG_HARDLOCKUP_DETECTOR */
>  
> +#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> +
> +static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
> +static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
> +static DEFINE_PER_CPU(bool, hard_watchdog_warn);
> +static unsigned long hardlockup_allcpu_dumped;
> +
> +static bool watchdog_hardlockup_is_lockedup(void)
> +{
> + unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
> +
> + if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
> + return true;
> +
> + __this_cpu_write(hrtimer_interrupts_saved, hrint);
> + return false;
> +}
> +
> +static void watchdog_hardlockup_interrupt_count(void)
> +{
> + __this_cpu_inc(hrtimer_interrupts);
> +}
> +
> +void watchdog_hardlockup_check(struct pt_regs *regs)
> +{
> + /* check for a hardlockup
> +  * This is done by making sure our timer interrupt
> +  * is incrementing.  The timer interrupt should have
> +  * fired multiple times before we overflow'd.  If it hasn't
> +  * then this is a good indication the cpu is stuck
> +  */
> + if (watchdog_hardlockup_is_lockedup()) {
> + int this_cpu = smp_processor_id();
> +
> + /* only print hardlockups once */
> + if (__this_cpu_read(hard_watchdog_warn) == true)
> + return;
> +
> + pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n",
> +  this