Re: [Kgdb-bugreport] KGDB and ARM

2024-06-25 Thread Doug Anderson
Hi,

On Tue, Jun 25, 2024 at 8:35 AM ton1ght via Kgdb-bugreport
 wrote:
>
> Hello,
>
> I am sorry to bother you but I am very frustrated because i cannot get a 
> cross debugging setup working (x86 Linux host and ARM Linux target). The 
> setup works "a little bit", but as soon as i encounter a specific instruction 
> the board just hangs and gets unresponsive. Do you have an idea what is wrong 
> with my setup?
>
> I have a detailed description of the problem right here:
> https://pastebin.com/xLEZmYHk
> or here 
> (formatted):https://www.reddit.com/r/gdb/comments/1djkj80/kgdb_for_arm/

Sounds like you're hitting single stepping problems. Those are known
broken. You could see Sumit's most recent patch series for this,
though (if I remember correctly) there are some corner cases that it
doesn't handle properly and ARM guys say we need a rewrite of a bunch
of stuff to get it working. Nobody at ARM has had time to do this...

Sumit's most recent patch series that I'm aware of:

https://lore.kernel.org/r/20230202073148.657746-1-sumit.g...@linaro.org


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 13/13] kdb: Add mdi, mdiW / mdiWcN commands to show iomapped memory

2024-06-21 Thread Doug Anderson
Hi,

On Fri, Jun 21, 2024 at 8:43 AM Daniel Thompson
 wrote:
>
> For example I have long wanted to be able to let
> the user see /proc/interrupts before the usespace comes up but the spin
> locks get in the way.

BTW the "gdb" scripts are supposed to handle this with
"lx-interruptlist", though when I try it right now it doesn't seem to
work.

...actually, there's also a script "lx-iomem" which sounds great but
only has the physical addresses. :(

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH RESEND] kdb: Get rid of redundant kdb_curr_task()

2024-06-21 Thread Doug Anderson
Hi,

On Thu, Jun 20, 2024 at 6:58 AM Zheng Zengkai  wrote:
>
> Commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture")
> removed the only definition of macro _TIF_MCA_INIT, so kdb_curr_task()
> is actually the same as curr_task() now and becomes redundant.
>
> Let's remove the definition of kdb_curr_task() and replace remaining
> calls with curr_task().
>
> Signed-off-by: Zheng Zengkai 
> ---
>  kernel/debug/kdb/kdb_bt.c  |  2 +-
>  kernel/debug/kdb/kdb_main.c| 18 --
>  kernel/debug/kdb/kdb_private.h |  2 --
>  3 files changed, 5 insertions(+), 17 deletions(-)

In case Daniel picks this one up since it CCs LKML, I'll copy my tag
from the one that didn't:

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: Get rid of redundant kdb_curr_task()

2024-06-20 Thread Doug Anderson
Hi,

On Wed, Jun 19, 2024 at 5:34 AM Zheng Zengkai  wrote:
>
> Commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture")
> removed the only definition of macro _TIF_MCA_INIT, so kdb_curr_task()
> is actually the same as curr_task() now and becomes redundant.
>
> Let's remove the definition of kdb_curr_task() and replace remaining
> calls with curr_task().
>
> Signed-off-by: Zheng Zengkai 
> ---
>  kernel/debug/kdb/kdb_bt.c  |  2 +-
>  kernel/debug/kdb/kdb_main.c| 18 --
>  kernel/debug/kdb/kdb_private.h |  2 --
>  3 files changed, 5 insertions(+), 17 deletions(-)

Looks right to me.

Reviewed-by: Douglas Anderson 

It's slightly annoying that this isn't CCed any kernel mailing lists
that are tracked by lore.kernel.org, but I guess that'll be up to
Daniel to figure out. :-P

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 13/13] kdb: Add mdi, mdiW / mdiWcN commands to show iomapped memory

2024-06-18 Thread Doug Anderson
Hi,

On Tue, Jun 18, 2024 at 8:59 AM Daniel Thompson
 wrote:
>
> On Mon, Jun 17, 2024 at 05:34:47PM -0700, Douglas Anderson wrote:
> > Add commands that are like the other "md" commands but that allow you
> > to read memory that's in the IO space.
> >
> > Signed-off-by: Douglas Anderson 
>
> Sorry to be the bearer of bad news but...
>
>
> > ---
> > 
> > +/*
> > + * kdb_getioword
> > + * Inputs:
> > + *   wordPointer to the word to receive the result.
> > + *   addrAddress of the area to copy.
> > + *   sizeSize of the area.
> > + * Returns:
> > + *   0 for success, < 0 for error.
> > + */
> > +int kdb_getioword(unsigned long *word, unsigned long addr, size_t size)
> > +{
> > + void __iomem *mapped = ioremap(addr, size);
>
> ioremap() is a might_sleep() function. It's unsafe to call it from the
> debug trap handler.

H. Do you have a pointer to documentation or code showing that
it's a might_sleep() function. I was worried about this initially but
I couldn't find any documentation or code indicating it to be so. I
also got no warnings when I ran my prototype code and then the code
worked fine, so I assumed that it must somehow work. Sigh...

Looking more closely, maybe this is:

ioremap() -> ioremap_prot() -> generic_ioremap_prot() ->
__get_vm_area_caller() -> __get_vm_area_node() -> __get_vm_area_node()

...and that has a kernel allocation with GFP_KERNEL?

I guess it also then calls alloc_vmap_area()  which has might_sleep()...

I'll have to track down why no warnings triggered...

> I'm afraid I don't know a safe alternative either. Machinary such as
> kmap_atomic() needs a page and iomem won't have one.

Ugh. It would be nice to come up with something since it's not
uncommon to need to look at the state of hardware registers when a
crash happens. In the past I've managed to get into gdb, track down a
global variable where someone has already mapped the memory, and then
use gdb to look at the memory. It's always a big pain, though...

...even if I could just look up the virtual address where someone else
had already mapped it that might be enough. At least I wouldn't need
to go track down the globals myself...

...anyway, I guess I'll ponder on it and poke if I have time...


-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 07/13] kdb: Tweak "repeat" handling code for "mdW" and "mdWcN"

2024-06-18 Thread Doug Anderson
Hi,

On Tue, Jun 18, 2024 at 5:57 AM Daniel Thompson
 wrote:
>
> On Mon, Jun 17, 2024 at 05:34:41PM -0700, Douglas Anderson wrote:
> > In general, "md"-style commands are meant to be "repeated". This is a
> > feature of kdb and "md"-style commands get it enabled because they
> > have the flag KDB_REPEAT_NO_ARGS. What this means is that if you type
> > "md4c2 0xff808ef05400" and then keep hitting return on the "kdb>"
> > prompt that you'll read more and more memory. For instance:
> >   [5]kdb> md4c2 0xff808ef05400
> >   0xff808ef05400 0204  
> >   [5]kdb>
> >   0xff808ef05408 8235e000  ..5.
> >   [5]kdb>
> >   0xff808ef05410 0003 0001 
> >
> > As a side effect of the way kdb works is implemented, you can get the
> > same behavior as the above by typing the command again with no
> > arguments. Though it seems unlikely anyone would do this it shouldn't
> > really hurt:
> >   [5]kdb> md4c2 0xff808ef05400
> >   0xff808ef05400 0204  
> >   [5]kdb> md4c2
> >   0xff808ef05408 8235e000  ..5.
> >   [5]kdb> md4c2
> >   0xff808ef05410 0003 0001 
> >
> > In general supporting "repeat" should be easy. If argc is 0 then we
> > just copy the results of the arg parsing from the last time, making
> > sure that the address has been updated. This is all handled nicely in
> > the "if (argc == 0)" clause in kdb_md().
> >
> > Oddly, the "mdW" and "mdWcN" code seems to update "last_bytesperword"
> > and "last_repeat", which doesn't seem like it should be necessary. It
> > appears that this code is needed to make this use case work, though
> > it's a bit unclear if this is truly an important feature to support:
> >   [1]kdb> md2c3 0xff80c8e2b280
> >   0xff80c8e2b280 0200  ...
> >   [1]kdb> md2c4
> >   0xff80c8e2b286  e000 8235    ...
> >
> > In order to abstract the code better, remove the code updating
> > "last_bytesperword" and "last_repeat" from the "mdW" and "mdWcN"
> > handling. This breaks the above case where the user tweaked "argv[0]"
> > and then tried to somehow leverage the "repeat" code to do something
> > smart, but that feels like it was a misfeature anyway.
>
> I'm not too keen on "successfully" doing the wrong thing.
>
> In that light I'd view this as a feature that is arguably simpler to
> implement than it is to error check *not* implementing it. In other
> words by the time you add error checking to the argc == 0 path to
> spot mismatches then you are better off honouring the user request
> rather then telling them they got it wrong.

Hmmm. How about we store the last "argv0" and we just immediately fail
if "argc == 0" and "argv[0]" doesn't match the previous one?

The override rules for lines / word count is already complicated
enough and my head was spinning trying to get this right and reason
about it. I tried several times and each time I thought I had it
working cleanly I ended up with some other weird corner case that was
broken. For instance, with the current code this case is broken (IMO):

[3]kdb> md2c4 0xd8948040
0xd8948040 0204      
[3]kdb> md2
0xd8948048 3000 824f   0003  0001    .0O.
0xd8948058 0003          
0xd8948068 8000 884a 8000  0001  0100 0040   ..J...@.
0xd8948078   0001        
0xd8948088 0030    000a      0...
0xd8948098 f0e6 fffb   9340 809b     @...
0xd89480a8 0005  0003  0001  0078    x...
0xd89480b8 0078  0078        x...x...

Specifically I would have expected the "last" wordcount to persist but
instead it fell back to the default mdcount. Maybe the above is right
but it's distinctly non-obvious.


> Daniel.
>
>
> PS I have never done so but I also wondered if it is reasonable to use
>this feature to manually decompose structures. For example:
>
>  md1c1 structure_pointer; md1c7; md8c1; md8c1; md2c2

Sure, you can come up with cases where it could be useful, but it
feels like there are other ways to accomplish the same thing w/ less
complexity.


-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 06/13] kdb: Remove "mdW" and "mdWcN" handling of "W" == 0

2024-06-18 Thread Doug Anderson
Hi,

On Tue, Jun 18, 2024 at 4:38 AM Daniel Thompson
 wrote:
>
> On Mon, Jun 17, 2024 at 05:34:40PM -0700, Douglas Anderson wrote:
> > The "mdW" and "mdWcN" generally lets the user control more carefully
> > what word size we display memory in and exactly how many words should
> > be displayed. Specifically, "md4" says to display memory w/ 4
> > bytes-per word and "md4c6" says to display 6 words of memory w/
> > 4-bytes-per word.
> >
> > The kdb "md" implementation has a special rule for when "W" is 0. In
> > this case:
> > * If you run with "W" == 0 and you've never run a kdb "md" command
> >   this reboot then it will pick 4 bytes-per-word, ignoring the normal
> >   default from the environment.
> > * If you run with "W" == 0 and you've run a kdb "md" command this
> >   reboot then it will pick up the bytes per word of the last command.
> >
> > As an example:
> >   [1]kdb> md2 0xff80c8e2b280 1
> >   0xff80c8e2b280 0200    e000 8235     ...
> >   [1]kdb> md0 0xff80c8e2b280 1
> >   0xff80c8e2b280 0200    e000 8235     ...
> >   [1]kdb> md 0xff80c8e2b280 1
> >   0xff80c8e2b280 0200 8235e000   ...
> >   [1]kdb> md0 0xff80c8e2b280 1
> >   0xff80c8e2b280 0200 8235e000   ...
> >
> > This doesn't seem like particularly useful behavior and adds a bunch
> > of complexity to the arg parsing. Remove it.
> >
> > Signed-off-by: Douglas Anderson 
> > ---
> >
> >  kernel/debug/kdb/kdb_main.c | 5 -
> >  1 file changed, 5 deletions(-)
> >
> > diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
> > index c013b014a7d3..700b4e355545 100644
> > --- a/kernel/debug/kdb/kdb_main.c
> > +++ b/kernel/debug/kdb/kdb_main.c
> > @@ -1611,11 +1611,6 @@ static int kdb_md(int argc, const char **argv)
> >
> >   if (isdigit(argv[0][2])) {
> >   bytesperword = (int)(argv[0][2] - '0');
> > - if (bytesperword == 0) {
> > - bytesperword = last_bytesperword;
> > - if (bytesperword == 0)
> > - bytesperword = 4;
> > - }
> >   last_bytesperword = bytesperword;
> >   repeat = mdcount * 16 / bytesperword;
>
> Isn't this now a divide-by-zero?

Dang, you're right. It goes away in a later patch, though, since we
stop re-calculating everything until the end when things are
validated. I'll plan to reorder this patch to be after the patch
("kdb: In kdb_md() make `repeat` and `mdcount` calculations more
obvious").


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 02/13] kdb: Document the various "md" commands better

2024-06-18 Thread Doug Anderson
Hi,

On Tue, Jun 18, 2024 at 4:24 AM Daniel Thompson
 wrote:
>
> On Mon, Jun 17, 2024 at 05:34:36PM -0700, Douglas Anderson wrote:
> > The documentation for the variouis "md" commands was inconsistent
> > about documenting the command arguments. It was also hard to figure
> > out what the differences between the "phys", "raw", and "symbolic"
> > versions was.
> >
> > Update the help strings to make things more obvious.
> >
> > As part of this, add "bogus" commands to the table for "mdW" and
> > "mdWcN" so we don't have to obscurely reference them in the normal
> > "md" help. These bogus commands don't really hurt since kdb_md()
> > validates argv[0] enough.
> >
> > Signed-off-by: Douglas Anderson 
> > ---
> >
> >  kernel/debug/kdb/kdb_main.c | 39 ++---
> >  1 file changed, 23 insertions(+), 16 deletions(-)
> >
> > diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
> > index cbeb203785b4..47e037c3c002 100644
> > --- a/kernel/debug/kdb/kdb_main.c
> > +++ b/kernel/debug/kdb/kdb_main.c
> > @@ -1516,14 +1516,9 @@ static int kdb_mdr(unsigned long addr, unsigned int 
> > count)
> >  }
> >
> >  /*
> > - * kdb_md - This function implements the 'md', 'md1', 'md2', 'md4',
> > - *   'md8' 'mdr' and 'mds' commands.
> > + * kdb_md - This function implements the guts of the various 'md' commands.
> >   *
> > - *   md|mds  [ [ []]]
> > - *   mdWcN   [ [ []]]
> > - *   where W = is the width (1, 2, 4 or 8) and N is the count.
> > - *   for eg., md1c20 reads 20 bytes, 1 at a time.
> > - *   mdr  ,
> > + * See the kdb help for syntax.
> >   */
> >  static void kdb_md_line(const char *fmtstr, unsigned long addr,
> >   int symbolic, int nosect, int bytesperword,
> > @@ -2677,26 +2672,38 @@ EXPORT_SYMBOL_GPL(kdb_unregister);
> >  static kdbtab_t maintab[] = {
> >   {   .name = "md",
> >   .func = kdb_md,
> > - .usage = "",
> > - .help = "Display Memory Contents, also mdWcN, e.g. md8c1",
> > + .usage = " [ []]",
> > + .help = "Display RAM using BYTESPERWORD; show MDCOUNT lines",
>
> I'd prefer "memory" over "RAM" because it's what the mnemonic is
> abbreviating. This applies to all of the below but I won't be adding a
> "same here" for all of them.
>
> Where we have to crush something to fit into one line we'd than have to
> break the pattern and choose from thing like:
>
> 1. Show memory
> 2. Display RAM
> 3. Display mem
>
> Personally I prefer #1 but could probably cope with #2.

I'm not dead set on RAM, but at least for me RAM was more clarifying.
Specifically when it said "memory" I always assumed it would take in
any memory address and when I first looked at this I tried to figure
out why memory addresses in the IO space didn't work with these
commands. I figured I was holding it wrong only to find out that the
commands specifically limit you to just the RAM range of memory
addresses.

That being said, I don't feel strongly so if you really like "memory"
I'll change it back.


> >   .flags = KDB_ENABLE_MEM_READ | KDB_REPEAT_NO_ARGS,
> >   },
> > - {   .name = "mdr",
> > + {   .name = "mdW",
> >   .func = kdb_md,
> > - .usage = " ",
> > - .help = "Display Raw Memory",
> > + .usage = " [ []]",
> > + .help = "Display RAM using word size (W) of 1, 2, 4, or 8",
>
> We need an "e.g. md8" in here somewhere. Otherwise it is not at all
> obvious that W is a wildcard.
>
> I guess that alternatively you could also try naming the command with
> hint that W is a wild card (what happens if you register a command
> called md?).
>
>
> > + .flags = KDB_ENABLE_MEM_READ | KDB_REPEAT_NO_ARGS,
> > + },
> > + {   .name = "mdWcN",
> > + .func = kdb_md,
> > + .usage = " [ []]",
> > + .help = "Display RAM using word size (W); show N words",
>
> Same here.

Sure, so:

.name = "md",
.help = "Display RAM using word size 1, 2, 4, or 8; e.g. md8",

.name = "mdc",
.help = "Display RAM using word size W; show N words; e.g. md4c6",

...or changing RAM to "memory" if you don't buy my argument above.

We're definitely ending up over the 80 character mark here, but I
assume that's OK. We were even before my change.

I'll assume that I don't need the "e.g." for all the followup (mdp,
mdi) variants introduced in later patches?


Random question: for the mdWcN variant, should I make specifying
 illegal? It's pretty silly to let the user specify a word
count and then immediately override it. In that case, do I bump
"" to the 2nd argument or just don't allow "" for mdWcN?
That would need to be done in a later patch, obviously...

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: address -Wformat-security warnings

2024-05-28 Thread Doug Anderson
Hi,

On Tue, May 28, 2024 at 5:12 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> When -Wformat-security is not disabled, using a string pointer
> as a format causes a warning:
>
> kernel/debug/kdb/kdb_io.c: In function 'kdb_read':
> kernel/debug/kdb/kdb_io.c:365:36: error: format not a string literal and no 
> format arguments [-Werror=format-security]
>   365 | kdb_printf(kdb_prompt_str);
>   |^~
> kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
> kernel/debug/kdb/kdb_io.c:456:20: error: format not a string literal and no 
> format arguments [-Werror=format-security]
>   456 | kdb_printf(kdb_prompt_str);
>   |^~
>
> Use an explcit "%s" format instead.
>
> Signed-off-by: Arnd Bergmann 
> ---
>  kernel/debug/kdb/kdb_io.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Fixes: 5d5314d6795f ("kdb: core for kgdb back end (1 of 2)")
Reviewed-by: Douglas Anderson 

...probably also justifies a:

Cc: sta...@vger.kernel.org


-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 7/7] kdb: Simplify management of tmpbuffer in kdb_read()

2024-04-22 Thread Doug Anderson
Hi,

On Mon, Apr 22, 2024 at 9:38 AM Daniel Thompson
 wrote:
>
> The current approach to filling tmpbuffer with completion candidates is
> confusing, with the buffer management being especially hard to reason
> about. That's because it doesn't copy the completion canidate into
> tmpbuffer, instead of copies a whole bunch of other nonsense and then
> runs the completion stearch from the middle of tmpbuffer!
>
> Change this to copy nothing but the completion candidate into tmpbuffer.
>
> Pretty much everything else in this patch is renaming to reflect the
> above change:
>
> s/p_tmp/tmpbuffer/
> s/buf_size/sizeof(tmpbuffer)/
>
> Signed-off-by: Daniel Thompson 
> ---
>  kernel/debug/kdb/kdb_io.c | 40 +---
>  1 file changed, 17 insertions(+), 23 deletions(-)

Definitely an improvement.

Reviewed-by: Douglas Anderson 


> diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c
> index 94a638a9d52fa..640208675c9a8 100644
> --- a/kernel/debug/kdb/kdb_io.c
> +++ b/kernel/debug/kdb/kdb_io.c
> @@ -227,8 +227,7 @@ static char *kdb_read(char *buffer, size_t bufsize)
> int count;
> int i;
> int diag, dtab_count;
> -   int key, buf_size, ret;
> -
> +   int key, ret;
>
> diag = kdbgetintenv("DTABCOUNT", _count);
> if (diag)
> @@ -310,21 +309,16 @@ static char *kdb_read(char *buffer, size_t bufsize)
> case 9: /* Tab */
> if (tab < 2)
> ++tab;
> -   p_tmp = buffer;
> -   while (*p_tmp == ' ')
> -   p_tmp++;
> -   if (p_tmp > cp)
> -   break;
> -   memcpy(tmpbuffer, p_tmp, cp-p_tmp);
> -   *(tmpbuffer + (cp-p_tmp)) = '\0';
> -   p_tmp = strrchr(tmpbuffer, ' ');
> -   if (p_tmp)
> -   ++p_tmp;
> -   else
> -   p_tmp = tmpbuffer;
> -   len = strlen(p_tmp);
> -   buf_size = sizeof(tmpbuffer) - (p_tmp - tmpbuffer);
> -   count = kallsyms_symbol_complete(p_tmp, buf_size);
> +
> +   tmp = *cp;
> +   *cp = '\0';
> +   p_tmp = strrchr(buffer, ' ');
> +   p_tmp = (p_tmp ? p_tmp + 1 : buffer);
> +   strscpy(tmpbuffer, p_tmp, sizeof(tmpbuffer));

You're now using strscpy() here. Is that actually important, or are
you just following good practices and being extra paranoid? If it's
actually important, this probably also needs to be CCed to stable,
right? The old code just assumed that it  could copy the whole buffer
into tmpbuffer. I assume that was OK, but it wasn't documented in the
function comments that there was a maximum size that buffer could
be...


-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 6/7] kdb: Replace double memcpy() with memmove() in kdb_read()

2024-04-22 Thread Doug Anderson
Hi,

On Mon, Apr 22, 2024 at 9:38 AM Daniel Thompson
 wrote:
>
> At several points in kdb_read() there are variants of the following
> code pattern (with offsets slightly altered):
>
> memcpy(tmpbuffer, cp, lastchar - cp);
> memcpy(cp-1, tmpbuffer, lastchar - cp);
> *(--lastchar) = '\0';
>
> There is no need to use tmpbuffer here, since we can use memmove() instead
> so refactor in the obvious way. Additionally the strings that are being
> copied are already properly terminated so let's also change the code so
> that the library calls also move the terminator.
>
> Changing how the terminators are managed has no functional effect for now
> but might allow us to retire lastchar at a later point. lastchar, although
> stored as a pointer, is functionally equivalent to caching strlen(buffer).
>
> Signed-off-by: Daniel Thompson 
> ---
>  kernel/debug/kdb/kdb_io.c | 19 +++
>  1 file changed, 7 insertions(+), 12 deletions(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 1/7] kdb: Fix buffer overflow during tab-complete

2024-04-22 Thread Doug Anderson
Hi,

On Mon, Apr 22, 2024 at 9:37 AM Daniel Thompson
 wrote:
>
> Currently, when the user attempts symbol completion with the Tab key, kdb
> will use strncpy() to insert the completed symbol into the command buffer.
> Unfortunately it passes the size of the source buffer rather than the
> destination to strncpy() with predictably horrible results. Most obviously
> if the command buffer is already full but cp, the cursor position, is in
> the middle of the buffer, then we will write past the end of the supplied
> buffer.
>
> Fix this by replacing the dubious strncpy() calls with memmove()/memcpy()
> calls plus explicit boundary checks to make sure we have enough space
> before we start moving characters around.
>
> Reported-by: Justin Stitt 
> Closes: 
> https://lore.kernel.org/all/cafhgd8qesuuifuhsnjfpr-va3p80bxrw+lqvc8dea8gziuj...@mail.gmail.com/
> Cc: sta...@vger.kernel.org
> Signed-off-by: Daniel Thompson 
> ---
>  kernel/debug/kdb/kdb_io.c | 21 +
>  1 file changed, 13 insertions(+), 8 deletions(-)

Boy, this function (and especially the tab handling) is hard to read.
I spent a bit of time trying to grok it and, as far as I can tell,
your patch makes things better and I don't see any bugs.

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 4/7] kdb: Merge identical case statements in kdb_read()

2024-04-22 Thread Doug Anderson
Hi,

On Mon, Apr 22, 2024 at 9:38 AM Daniel Thompson
 wrote:
>
> The code that handles case 14 (down) and case 16 (up) has been copy and
> pasted despite being byte-for-byte identical. Combine them.
>
> Cc: sta...@vger.kernel.org # Not a bug fix but it is needed for later bug 
> fixes
> Signed-off-by: Daniel Thompson 
> ---
>  kernel/debug/kdb/kdb_io.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 5/7] kdb: Use format-specifiers rather than memset() for padding in kdb_read()

2024-04-22 Thread Doug Anderson
Hi,

On Mon, Apr 22, 2024 at 9:38 AM Daniel Thompson
 wrote:
>
> Currently when the current line should be removed from the display
> kdb_read() uses memset() to fill a temporary buffer with spaces.
> The problem is not that this could be trivially implemented using a
> format string rather than open coding it. The real problem is that
> it is possible, on systems with a long kdb_prompt_str, to write pas

nit: s/pas/past

> the end of the tmpbuffer.
>
> Happily, as mentioned above, this can be trivially implemented using a
> format string. Make it so!
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Daniel Thompson 
> ---
>  kernel/debug/kdb/kdb_io.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 3/7] kdb: Fix console handling when editing and tab-completing commands

2024-04-22 Thread Doug Anderson
Hi,

On Mon, Apr 22, 2024 at 9:37 AM Daniel Thompson
 wrote:
>
> Currently, if the cursor position is not at the end of the command buffer
> and the user uses the Tab-complete functions, then the console does not
> leave the cursor in the correct position.
>
> For example consider the following buffer with the cursor positioned
> at the ^:
>
> md kdb_pro 10
>   ^
>
> Pressing tab should result in:
>
> md kdb_prompt_str 10
>  ^
>
> However this does not happen. Instead the cursor is placed at the end
> (after then 10) and further cursor movement redraws incorrectly. The
> same problem exists when we double-Tab but in a different part of the
> code.
>
> Fix this by sending a carriage return and then redisplaying the text to
> the left of the cursor.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Daniel Thompson 
> ---
>  kernel/debug/kdb/kdb_io.c | 5 +
>  1 file changed, 5 insertions(+)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 2/7] kdb: Use format-strings rather than '\0' injection in kdb_read()

2024-04-22 Thread Doug Anderson
Hi,

On Mon, Apr 22, 2024 at 9:37 AM Daniel Thompson
 wrote:
>
> Currently when kdb_read() needs to reposition the cursor it uses copy and
> paste code that works by injecting an '\0' at the cursor position before
> delivering a carriage-return and reprinting the line (which stops at the
> '\0').
>
> Tidy up the code by hoisting the copy and paste code into an appropriately
> named function. Additionally let's replace the '\0' injection with a
> proper field width parameter so that the string will be abridged during
> formatting instead.
>
> Cc: sta...@vger.kernel.org # Not a bug fix but it is needed for later bug 
> fixes
> Signed-off-by: Daniel Thompson 
> ---
>  kernel/debug/kdb/kdb_io.c | 34 ++
>  1 file changed, 14 insertions(+), 20 deletions(-)

Looks like a nice fix, but I think this'll create a compile warning on
some compilers. The variable "tmp" is no longer used, I think.

Once the "tmp" variable is deleted, feel free to add my Reviewed-by.

NOTE: patch #7 in your series re-adds a user of "tmp", but since this
one is "Cc: stable" you will need to delete it here and then re-add it
in patch #7.


> diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c
> index 06dfbccb10336..a42607e4d1aba 100644
> --- a/kernel/debug/kdb/kdb_io.c
> +++ b/kernel/debug/kdb/kdb_io.c
> @@ -184,6 +184,13 @@ char kdb_getchar(void)
> unreachable();
>  }
>
> +static void kdb_position_cursor(char *prompt, char *buffer, char *cp)
> +{
> +   kdb_printf("\r%s", kdb_prompt_str);
> +   if (cp > buffer)
> +   kdb_printf("%.*s", (int)(cp - buffer), buffer);

nit: personally, I'd take the "if" statement out. I'm nearly certain
that kdb_printf() can handle zero-length for the width argument and
"buffer" can never be _after_ cp (so you can't get negative).


-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] serial: kgdboc: Fix NMI-safety problems from keyboard reset code

2024-04-22 Thread Doug Anderson
Hi,

On Fri, Apr 19, 2024 at 3:30 AM Daniel Thompson
 wrote:
>
> Currently, when kdb is compiled with keyboard support, then we will use
> schedule_work() to provoke reset of the keyboard status.  Unfortunately
> schedule_work() gets called from the kgdboc post-debug-exception
> handler.  That risks deadlock since schedule_work() is not NMI-safe and,
> even on platforms where the NMI is not directly used for debugging, the
> debug trap can have NMI-like behaviour depending on where breakpoints
> are placed.
>
> Fix this by using the irq work system, which is NMI-safe, to defer the
> call to schedule_work() to a point when it is safe to call.
>
> Reported-by: Liuye 
> Closes: https://lore.kernel.org/all/20240228025602.3087748-1-liu@h3c.com/
> Cc: sta...@vger.kernel.org
> Signed-off-by: Daniel Thompson 
> ---
>  drivers/tty/serial/kgdboc.c | 30 +-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
> index 7ce7bb1640054..adcea70fd7507 100644
> --- a/drivers/tty/serial/kgdboc.c
> +++ b/drivers/tty/serial/kgdboc.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -48,6 +49,25 @@ static struct kgdb_io
> kgdboc_earlycon_io_ops;
>  static int  (*earlycon_orig_exit)(struct console *con);
>  #endif /* IS_BUILTIN(CONFIG_KGDB_SERIAL_CONSOLE) */
>
> +/*
> + * When we leave the debug trap handler we need to reset the keyboard status
> + * (since the original keyboard state gets partially clobbered by kdb use of
> + * the keyboard).
> + *
> + * The path to deliver the reset is somewhat circuitous.
> + *
> + * To deliver the reset we register an input handler, reset the keyboard and
> + * then deregister the input handler. However, to get this done right, we do
> + * have to carefully manage the calling context because we can only register
> + * input handlers from task context.
> + *
> + * In particular we need to trigger the action from the debug trap handler 
> with
> + * all its NMI and/or NMI-like oddities. To solve this the kgdboc trap exit 
> code
> + * (the "post_exception" callback) uses irq_work_queue(), which is NMI-safe, 
> to
> + * schedule a callback from a hardirq context. From there we have to defer 
> the
> + * work again, this time using schedule_Work(), to get a callback using the

nit: schedule_work() (no capital "W").

> + * system workqueue, which runs in task context.

Thank you for the comment. It makes the double-jump through IRQ work
and then normal work clearer.


Other than the nit in the comment, this looks good to me.

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: Use str_plural() to fix Coccinelle warning

2024-03-28 Thread Doug Anderson
Hi,

On Thu, Mar 28, 2024 at 7:03 AM Thorsten Blum  wrote:
>
> Fixes the following Coccinelle/coccicheck warning reported by
> string_choices.cocci:
>
> opportunity for str_plural(days)
>
> Signed-off-by: Thorsten Blum 
> ---
>  kernel/debug/kdb/kdb_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH V2] tty: serial: kgdboc: Fix 8250_* kgd over serial

2023-12-18 Thread Doug Anderson
Hi,

On Sun, Dec 17, 2023 at 11:34 PM Michael Trimarchi
 wrote:
>
> Check if port type is not PORT_UNKNOWN during poll_init.
> The kgdboc calls the tty_find_polling_driver that check
> if the serial is able to use poll_init. The poll_init calls
> the uart uart_poll_init that try to configure the uart with the
> selected boot parameters. The uart must be ready before setting
> parameters. Seems that PORT_UNKNOWN is already used by other
> functions in serial_core to detect uart status, so use the same
> to avoid to use it in invalid state.
>
> The crash happen for instance in am62x architecture where the 8250
> register the platform driver after the 8250 core is initialized.
>
> Follow the report crash coming from KGDB
>
> Thread 2 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1]
> _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 584 __raw_writeb(value, PCI_IOBASE + addr);
> (gdb) bt
>
> This section of the code is too early because in this case
> the omap serial is not probed
>
> Thread 2 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1]
> _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 584 __raw_writeb(value, PCI_IOBASE + addr);
> (gdb) bt
>
> Thread 2 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1]
> _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 584 __raw_writeb(value, PCI_IOBASE + addr);
> (gdb) bt
> 0  _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 1  logic_outb (value=0 '\000', addr=18446739675637874689) at 
> lib/logic_pio.c:299
> 2  0x80008082dfcc in io_serial_out (p=0x0, offset=16760830, value=0) at 
> drivers/tty/serial/8250/8250_port.c:416
> 3  0x80008082fe34 in serial_port_out (value=, 
> offset=, up=)
> at ./include/linux/serial_core.h:677
> 4  serial8250_do_set_termios (port=0x8000828ee940 
> , termios=0x80008292b93c, old=0x0)
> at drivers/tty/serial/8250/8250_port.c:2860
> 5  0x800080830064 in serial8250_set_termios (port=0xfbfffe80, 
> termios=0xffbffe, old=0x0)
> at drivers/tty/serial/8250/8250_port.c:2912
> 6  0x80008082571c in uart_set_options (port=0x8000828ee940 
> , co=0x0, baud=115200, parity=110, bits=8, flow=110)
> at drivers/tty/serial/serial_core.c:2285
> 7  0x800080828434 in uart_poll_init (driver=0xfbfffe80, 
> line=16760830, options=0x8000828f7506  "115200n8")
> at drivers/tty/serial/serial_core.c:2656
> 8  0x800080801690 in tty_find_polling_driver (name=0x8000828f7500 
>  "ttyS2,115200n8", line=0x80008292ba90)
> at drivers/tty/tty_io.c:410
> 9  0x80008086c0b0 in configure_kgdboc () at 
> drivers/tty/serial/kgdboc.c:194
> 10 0x80008086c1ec in kgdboc_probe (pdev=0xfbfffe80) at 
> drivers/tty/serial/kgdboc.c:249
> 11 0x8000808b399c in platform_probe (_dev=0x00ebb810) at 
> drivers/base/platform.c:1404
> 12 0x8000808b0b44 in call_driver_probe (drv=, 
> dev=) at drivers/base/dd.c:579
> 13 really_probe (dev=0x00ebb810, drv=0x80008277f138 
> ) at drivers/base/dd.c:658
> 14 0x8000808b0d2c in __driver_probe_device (drv=0x80008277f138 
> , dev=0x00ebb810)
> at drivers/base/dd.c:800
> 15 0x8000808b0eb8 in driver_probe_device (drv=0xfbfffe80, 
> dev=0x00ebb810) at drivers/base/dd.c:830
> 16 0x8000808b0ff4 in __device_attach_driver (drv=0x80008277f138 
> , _data=0x80008292bc48)
> at drivers/base/dd.c:958
> 17 0x8000808ae970 in bus_for_each_drv (bus=0xfbfffe80, start=0x0, 
> data=0x80008292bc48,
> fn=0x8000808b0f3c <__device_attach_driver>) at drivers/base/bus.c:457
> 18 0x8000808b1408 in __device_attach (dev=0x00ebb810, 
> allow_async=true) at drivers/base/dd.c:1030
> 19 0x8000808b16d8 in device_initial_probe (dev=0xfbfffe80) at 
> drivers/base/dd.c:1079
> 20 0x8000808af9f4 in bus_probe_device (dev=0x00ebb810) at 
> drivers/base/bus.c:532
> 21 0x8000808ac77c in device_add (dev=0xfbfffe80) at 
> drivers/base/core.c:3625
> 22 0x8000808b3428 in platform_device_add (pdev=0x00ebb800) at 
> drivers/base/platform.c:716
> 23 0x800081b5dc0c in init_kgdboc () at drivers/tty/serial/kgdboc.c:292
> 24 0x800080014db0 in do_one_initcall (fn=0x800081b5dba4 
> ) at init/main.c:1236
> 25 0x800081b0114c in do_initcall_level (command_line=, 
> level=) at init/main.c:1298
> 26 do_initcalls () at init/main.c:1314
> 27 do_basic_setup () at init/main.c:1333
> 28 kernel_init_freeable () at init/main.c:1551
> 29 0x8000810271ec in kernel_init (unused=0xfbfffe80) at 
> init/main.c:1441
> 30 0x800080015e80 in ret_from_fork () at arch/arm64/kernel/entry.S:857
>
> Signed-off-by: Michael Trimarchi 
> ---
> v1 -> v2:
> - fix if condition during submission
> - improve a bit the commit message
> RFC -> v1:
> - refuse uart that has type PORT_UNKNOWN

Re: [Kgdb-bugreport] [RFC PATCH] tty: serial: kgdboc: Fix 8250_* kgd over serial

2023-12-11 Thread Doug Anderson
Hi,

On Fri, Dec 8, 2023 at 1:28 PM Michael Trimarchi
 wrote:
>
> Use late_initcall_sync insted of module init to be sure that
> serial driver is really probed and get take handover from
> early driver.

Awesome that you used the earlycon driver to debug problems with
registering the normal driver! :-P


> The 8250 register the platform driver after
> the 8250 core is initialized. As shown by kdbg
>
> Thread 2 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1]
> _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 584 __raw_writeb(value, PCI_IOBASE + addr);
> (gdb) bt
>
> This section of the code is too early because in this case
> the omap serial is not probed
>
> Thread 2 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1]
> _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 584 __raw_writeb(value, PCI_IOBASE + addr);
> (gdb) bt
>
> Thread 2 received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1]
> _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 584 __raw_writeb(value, PCI_IOBASE + addr);
> (gdb) bt
> 0  _outb (addr=, value=) at 
> ./include/asm-generic/io.h:584
> 1  logic_outb (value=0 '\000', addr=18446739675637874689) at 
> lib/logic_pio.c:299
> 2  0x80008082dfcc in io_serial_out (p=0x0, offset=16760830, value=0) at 
> drivers/tty/serial/8250/8250_port.c:416
> 3  0x80008082fe34 in serial_port_out (value=, 
> offset=, up=)
> at ./include/linux/serial_core.h:677
> 4  serial8250_do_set_termios (port=0x8000828ee940 
> , termios=0x80008292b93c, old=0x0)
> at drivers/tty/serial/8250/8250_port.c:2860
> 5  0x800080830064 in serial8250_set_termios (port=0xfbfffe80, 
> termios=0xffbffe, old=0x0)
> at drivers/tty/serial/8250/8250_port.c:2912
> 6  0x80008082571c in uart_set_options (port=0x8000828ee940 
> , co=0x0, baud=115200, parity=110, bits=8, flow=110)
> at drivers/tty/serial/serial_core.c:2285
> 7  0x800080828434 in uart_poll_init (driver=0xfbfffe80, 
> line=16760830, options=0x8000828f7506  "115200n8")
> at drivers/tty/serial/serial_core.c:2656
> 8  0x800080801690 in tty_find_polling_driver (name=0x8000828f7500 
>  "ttyS2,115200n8", line=0x80008292ba90)
> at drivers/tty/tty_io.c:410
> 9  0x80008086c0b0 in configure_kgdboc () at 
> drivers/tty/serial/kgdboc.c:194
> 10 0x80008086c1ec in kgdboc_probe (pdev=0xfbfffe80) at 
> drivers/tty/serial/kgdboc.c:249
> 11 0x8000808b399c in platform_probe (_dev=0x00ebb810) at 
> drivers/base/platform.c:1404
> 12 0x8000808b0b44 in call_driver_probe (drv=, 
> dev=) at drivers/base/dd.c:579
> 13 really_probe (dev=0x00ebb810, drv=0x80008277f138 
> ) at drivers/base/dd.c:658
> 14 0x8000808b0d2c in __driver_probe_device (drv=0x80008277f138 
> , dev=0x00ebb810)
> at drivers/base/dd.c:800
> 15 0x8000808b0eb8 in driver_probe_device (drv=0xfbfffe80, 
> dev=0x00ebb810) at drivers/base/dd.c:830
> 16 0x8000808b0ff4 in __device_attach_driver (drv=0x80008277f138 
> , _data=0x80008292bc48)
> at drivers/base/dd.c:958
> 17 0x8000808ae970 in bus_for_each_drv (bus=0xfbfffe80, start=0x0, 
> data=0x80008292bc48,
> fn=0x8000808b0f3c <__device_attach_driver>) at drivers/base/bus.c:457
> 18 0x8000808b1408 in __device_attach (dev=0x00ebb810, 
> allow_async=true) at drivers/base/dd.c:1030
> 19 0x8000808b16d8 in device_initial_probe (dev=0xfbfffe80) at 
> drivers/base/dd.c:1079
> 20 0x8000808af9f4 in bus_probe_device (dev=0x00ebb810) at 
> drivers/base/bus.c:532
> 21 0x8000808ac77c in device_add (dev=0xfbfffe80) at 
> drivers/base/core.c:3625
> 22 0x8000808b3428 in platform_device_add (pdev=0x00ebb800) at 
> drivers/base/platform.c:716
> 23 0x800081b5dc0c in init_kgdboc () at drivers/tty/serial/kgdboc.c:292
> 24 0x800080014db0 in do_one_initcall (fn=0x800081b5dba4 
> ) at init/main.c:1236
> 25 0x800081b0114c in do_initcall_level (command_line=, 
> level=) at init/main.c:1298
> 26 do_initcalls () at init/main.c:1314
> 27 do_basic_setup () at init/main.c:1333
> 28 kernel_init_freeable () at init/main.c:1551
> 29 0x8000810271ec in kernel_init (unused=0xfbfffe80) at 
> init/main.c:1441
> 30 0x800080015e80 in ret_from_fork () at arch/arm64/kernel/entry.S:857
>
> Signed-off-by: Michael Trimarchi 
> ---
>  drivers/tty/serial/kgdboc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
> index 7ce7bb164005..7f8364507f55 100644
> --- a/drivers/tty/serial/kgdboc.c
> +++ b/drivers/tty/serial/kgdboc.c
> @@ -622,7 +622,7 @@ console_initcall(kgdboc_earlycon_late_init);
>
>  #endif /* IS_BUILTIN(CONFIG_KGDB_SERIAL_CONSOLE) */
>
> -module_init(init_kgdboc);
> +late_initcall_sync(init_kgdboc);

While I'm not denying that 

Re: [Kgdb-bugreport] [RFC PATCH] tty: serial: kgdboc: Fix 8250_* kgd over serial

2023-12-11 Thread Doug Anderson
Hi,

On Mon, Dec 11, 2023 at 1:42 PM Michael Nazzareno Trimarchi
 wrote:
>
> > 1. init_kgdboc() runs and registers the singleton kgdb "platform driver".
> >
> > 2. The platform driver's probe function, kgdboc_probe(), runs and
> > checks to see if the console is ready by looking at the return value
> > of configure_kgdboc(). If it's ready then we're good to go. If it's
> > not ready then we defer.
> >
> > So I think the bug here is that somehow the console looks "ready"
> > (because tty_find_polling_driver() can find it) but it isn't actually
> > ready yet (because it crashes). That's what you need to fix.
> >
>
> The polling driver look for uart and uart8250_core is registered and 4 fake 
> uart
> are there but there are not still replaced by platform driver that can
> come later.
> The try_polling find it but it's the isa-8250 driver. It means that
> add_uart 8250 is
> not still happen

The 8250 driver is always a maze, so you might need to do a bunch of
digging. ...but it sure sounds like the console shouldn't be
registered until the correct ops are in place. That either means
getting the ops put in place earlier or deferring when the console is
registered...

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [PATCH 2/2] trace: kdb: Replace simple_strtoul with kstrtoul in kdb_ftdump

2023-12-06 Thread Doug Anderson
Hi,

On Wed, Dec 6, 2023 at 3:38 AM Daniel Thompson
 wrote:
>
> On Tue, Dec 05, 2023 at 01:41:57PM -0800, Doug Anderson wrote:
> > Hi,
> >
> > On Sun, Nov 19, 2023 at 4:10 PM Yuran Pereira  
> > wrote:
> > >
> > > The function simple_strtoul performs no error checking in scenarios
> > > where the input value overflows the intended output variable.
> > > This results in this function successfully returning, even when the
> > > output does not match the input string (aka the function returns
> > > successfully even when the result is wrong).
> > >
> > > Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(),
> > > simple_strtoul(), and simple_strtoull() functions explicitly ignore
> > > overflows, which may lead to unexpected results in callers."
> > > Hence, the use of those functions is discouraged.
> > >
> > > This patch replaces all uses of the simple_strtoul with the safer
> > > alternatives kstrtoint and kstrtol.
> > >
> > > [1] 
> > > https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull
> > >
> > > Signed-off-by: Yuran Pereira 
> > > ---
> > >  kernel/trace/trace_kdb.c | 14 ++
> > >  1 file changed, 6 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/kernel/trace/trace_kdb.c b/kernel/trace/trace_kdb.c
> > > index 59857a1ee44c..3891f885e4a6 100644
> > > --- a/kernel/trace/trace_kdb.c
> > > +++ b/kernel/trace/trace_kdb.c
> > > @@ -96,23 +96,21 @@ static int kdb_ftdump(int argc, const char **argv)
> > >  {
> > > int skip_entries = 0;
> > > long cpu_file;
> > > -   char *cp;
> > > +   int err;
> > > int cnt;
> > > int cpu;
> > >
> > > if (argc > 2)
> > > return KDB_ARGCOUNT;
> > >
> > > -   if (argc) {
> > > -   skip_entries = simple_strtol(argv[1], , 0);
> > > -   if (*cp)
> > > +   if (argc)
> > > +   if (kstrtoint(argv[1], 0, _entries))
> > > skip_entries = 0;
> > > -   }
> >
> > Similar nit about braces as in patch #1. tl;dr is change the above to:
> >
> > if (argc && kstrtoint(argv[1], 0, _entries))
> >   skip_entries = 0;
>
> Surely that should be:
>
> if (...)
> return KDB_BADINT;
>
> There seems little point switching to a "safer" API if we just ignore the
> errors it provides us.

Ah, sure. I have no objections to that. Note that would have also been
possible with the old code, which did still do awkward error checking,
so I assumed that it was a conscious decision. ...but I'm definitely
happier with the error being reported instead of glossed over.

-Doug



Re: [PATCH 2/2] trace: kdb: Replace simple_strtoul with kstrtoul in kdb_ftdump

2023-12-05 Thread Doug Anderson
Hi,

On Sun, Nov 19, 2023 at 4:10 PM Yuran Pereira  wrote:
>
> The function simple_strtoul performs no error checking in scenarios
> where the input value overflows the intended output variable.
> This results in this function successfully returning, even when the
> output does not match the input string (aka the function returns
> successfully even when the result is wrong).
>
> Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(),
> simple_strtoul(), and simple_strtoull() functions explicitly ignore
> overflows, which may lead to unexpected results in callers."
> Hence, the use of those functions is discouraged.
>
> This patch replaces all uses of the simple_strtoul with the safer
> alternatives kstrtoint and kstrtol.
>
> [1] 
> https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull
>
> Signed-off-by: Yuran Pereira 
> ---
>  kernel/trace/trace_kdb.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/trace/trace_kdb.c b/kernel/trace/trace_kdb.c
> index 59857a1ee44c..3891f885e4a6 100644
> --- a/kernel/trace/trace_kdb.c
> +++ b/kernel/trace/trace_kdb.c
> @@ -96,23 +96,21 @@ static int kdb_ftdump(int argc, const char **argv)
>  {
> int skip_entries = 0;
> long cpu_file;
> -   char *cp;
> +   int err;
> int cnt;
> int cpu;
>
> if (argc > 2)
> return KDB_ARGCOUNT;
>
> -   if (argc) {
> -   skip_entries = simple_strtol(argv[1], , 0);
> -   if (*cp)
> +   if (argc)
> +   if (kstrtoint(argv[1], 0, _entries))
> skip_entries = 0;
> -   }

Similar nit about braces as in patch #1. tl;dr is change the above to:

if (argc && kstrtoint(argv[1], 0, _entries))
  skip_entries = 0;


>
> if (argc == 2) {
> -   cpu_file = simple_strtol(argv[2], , 0);
> -   if (*cp || cpu_file >= NR_CPUS || cpu_file < 0 ||
> -   !cpu_online(cpu_file))
> +   err = kstrtol(argv[2], 0, _file);
> +   if (err || cpu_file >= NR_CPUS || cpu_file < 0 ||
> +   !cpu_online(cpu_file))

nit: why did you change the indentation for "!cpu_online(cpu_file))"?
It seemed better before.

With nits fixed:

Reviewed-by: Douglas Anderson 



Re: [PATCH 1/2] kdb: Replace the use of simple_strto with safer kstrto in kdb_main

2023-12-05 Thread Doug Anderson
Hi,

On Sun, Nov 19, 2023 at 4:07 PM Yuran Pereira  wrote:
>
> The simple_str* family of functions perform no error checking in
> scenarios where the input value overflows the intended output variable.
> This results in these functions successfully returning even when the
> output does not match the input string.
>
> Or as it was mentioned [1], "...simple_strtol(), simple_strtoll(),
> simple_strtoul(), and simple_strtoull() functions explicitly ignore
> overflows, which may lead to unexpected results in callers."
> Hence, the use of those functions is discouraged.
>
> This patch replaces all uses of the simple_strto* series of functions
> with their safer kstrto* alternatives.
>
> Side effects of this patch:
> - Every string to long or long long conversion using kstrto* is now
>   checked for failure.
> - kstrto* errors are handled with appropriate `KDB_BADINT` wherever
>   applicable.
> - A good side effect is that we end up saving a few lines of code
>   since unlike in simple_strto* functions, kstrto functions do not
>   need an additional "end pointer" variable, and the return values
>   of the latter can be directly checked in an "if" statement without
>   the need to define additional `ret` or `err` variables.
>   This, of course, results in cleaner, yet still easy to understand
>   code.
>
> [1] 
> https://www.kernel.org/doc/html/latest/process/deprecated.html#simple-strtol-simple-strtoll-simple-strtoul-simple-strtoull
>
> Signed-off-by: Yuran Pereira 
> ---
>  kernel/debug/kdb/kdb_main.c | 70 +++--
>  1 file changed, 21 insertions(+), 49 deletions(-)

Sorry for taking so long to review this--it arrived in my inbox at a
bad time. A few minor nits below that I think should be fixed before
landing but overall I think it's a nice cleanup. Thanks!


> @@ -412,42 +412,21 @@ static void kdb_printenv(void)
>   */
>  int kdbgetularg(const char *arg, unsigned long *value)
>  {
> -   char *endp;
> -   unsigned long val;
> -
> -   val = simple_strtoul(arg, , 0);
> -
> -   if (endp == arg) {
> -   /*
> -* Also try base 16, for us folks too lazy to type the
> -* leading 0x...
> -*/
> -   val = simple_strtoul(arg, , 16);
> -   if (endp == arg)
> +   /*
> +* If the first fails, also try base 16, for us
> +* folks too lazy to type the leading 0x...
> +*/
> +   if (kstrtoul(arg, 0, value))
> +   if (kstrtoul(arg, 16, value))

Not new to your patch, but the above seems like a terrible idea to me.
What that means is that:

kdbgetularg("18", ) => value is 18
kdbgetularg("19", ) => value is 19
kdbgetularg("1a", ) => value is 26

Bleh! If someone wants hex then they should put the 0x first.

I'd suggest a followup patch that removes the fallback for the lazy
folks. Here and in the next function...


> @@ -2095,15 +2074,11 @@ static int kdb_dmesg(int argc, const char **argv)
> if (argc > 2)
> return KDB_ARGCOUNT;
> if (argc) {
> -   char *cp;
> -   lines = simple_strtol(argv[1], , 0);
> -   if (*cp)
> +   if (kstrtoint(argv[1], 0, ))
> lines = 0;
> -   if (argc > 1) {
> -   adjust = simple_strtoul(argv[2], , 0);
> -   if (*cp || adjust < 0)
> +   if (argc > 1)
> +   if (kstrtouint(argv[2], 0, ) || adjust < 0)

My gut reaction is that some sort of build bot is going to come and
yell at you about the above line. Even if it doesn't, it's a bit
confusing. You're passing a pointer to an int into a function that
expects a pointer to an unsigned int. Most things don't really care
about signed/unsigned, but I could swear that some compilers get mad
when you start working with pointers to those types...

In any case, I think everything would work fine if you just change it
to kstrtoint(), right? I guess the other option would be to change the
variable to unsigned, but I guess that doesn't make sense since it's a
modifier to "lines" which is an int.

Side note: I didn't even know about the "adjust" argument, since it's
not in the help text in the command table below. I guess that could be
fixed in a separate patch.

nit: IMO if you have nested "if" statements then the outer one should
have braces. AKA:

if (a) {
  if (b)
blah();
}

instead of:

if (a)
  if (b)
blah();

...or you could do better and just change it to:

if (a && b)
  blah();



Re: [Kgdb-bugreport] [PATCH v2] kdb: Fix a potential buffer overflow in kdb_local()

2023-11-27 Thread Doug Anderson
Hi,

On Sat, Nov 25, 2023 at 4:05 AM Christophe JAILLET
 wrote:
>
> When appending "[defcmd]" to 'kdb_prompt_str', the size of the string
> already in the buffer should be taken into account.
>
> An option could be to switch from strncat() to strlcat() which does the
> correct test to avoid such an overflow.
>
> However, this actually looks as dead code, because 'defcmd_in_progress'
> can't be true here.
> See a more detailed explanation at [1].
>
> [1]: 
> https://lore.kernel.org/all/CAD=FV=wsh7wkn7yp-3wwidgx4e3isq8uh0lcztmd1v9cg9j...@mail.gmail.com/
>
> Fixes: 5d5314d6795f ("kdb: core for kgdb back end (1 of 2)")
> Signed-off-by: Christophe JAILLET 
> ---
> Changes in v2:
>- Delete the strncat() call   [Doug Anderson]
>
> v1: 
> https://lore.kernel.org/all/0b1790ca91b71e3362a6a4c2863bc5787b4d60c9.1698501284.git.christophe.jail...@wanadoo.fr/
> ---
>  kernel/debug/kdb/kdb_main.c | 2 --
>  1 file changed, 2 deletions(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: Fix a potential buffer overflow in kdb_local()

2023-10-30 Thread Doug Anderson
Hi,

On Sat, Oct 28, 2023 at 6:55 AM Christophe JAILLET
 wrote:
>
> When appending "[defcmd]" to 'kdb_prompt_str', the size of the string
> already in the buffer should be taken into account.
>
> Switch from strncat() to strlcat() which does the correct test to avoid
> such an overflow.
>
> Fixes: 5d5314d6795f ("kdb: core for kgdb back end (1 of 2)")
> Signed-off-by: Christophe JAILLET 
> ---
>  kernel/debug/kdb/kdb_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
> index 438b868cbfa9..e5f0bf0f45d1 100644
> --- a/kernel/debug/kdb/kdb_main.c
> +++ b/kernel/debug/kdb/kdb_main.c
> @@ -1350,7 +1350,7 @@ static int kdb_local(kdb_reason_t reason, int error, 
> struct pt_regs *regs,
> snprintf(kdb_prompt_str, CMD_BUFLEN, kdbgetenv("PROMPT"),
>  raw_smp_processor_id());
> if (defcmd_in_progress)
> -   strncat(kdb_prompt_str, "[defcmd]", CMD_BUFLEN);
> +   strlcat(kdb_prompt_str, "[defcmd]", CMD_BUFLEN);

Some of this code is a bit hard to follow, but I think it's better to
simply delete the whole "strncat". Specifically, as of commit
a37372f6c3c0 ("kdb: Prevent kernel oops with kdb_defcmd") it's clear
that "defcmd" can't actually be run to define new commands
interactively. It's also clear to me that "defcmd_in_progress" is only
set when defining new commands.

The prompt being constructed here is a prompt that's printed to the
end user when working interactively. That means the "if
(defcmd_in_progress)" should never be true and it can be deleted as
dead code.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v13 3/7] arm64: smp: Remove dedicated wakeup IPI

2023-10-02 Thread Doug Anderson
Hi,

On Mon, Sep 25, 2023 at 5:39 PM Doug Anderson  wrote:
>
> Mark,
>
> On Wed, Sep 6, 2023 at 9:06 AM Douglas Anderson  wrote:
> >
> > +#ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
> > +void arch_send_wakeup_ipi(unsigned int cpu)
> > +{
> > +   /*
> > +* We use a scheduler IPI to wake the CPU as this avoids the need 
> > for a
> > +* dedicated IPI and we can safely handle spurious scheduler IPIs.
> > +*/
> > +   arch_smp_send_reschedule(cpu);
>
> I was backporting this to our ChromeOS kernels and our build test bot
> noticed that arch_smp_send_reschedule() didn't exist in older kernels.
> That's fine--I can always adjust this patch when backporting or
> cherry-pick extra patches, but it made me wonder. Is there a reason
> you chose to use arch_smp_send_reschedule() directly here instead of
> smp_send_reschedule()? I guess the only difference is that you're
> bypassing the tracing. Is that on purpose? Should we add a comment
> about it, or change this to smp_send_reschedule()?

FWIW, I posted a patch changing this to smp_send_reschedule(). Please
yell if this is incorrect.

https://lore.kernel.org/r/20231002094526.2.I2e6d22fc42ccbf6b26465a28a10e36e05ccf3075@changeid


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v13 3/7] arm64: smp: Remove dedicated wakeup IPI

2023-09-25 Thread Doug Anderson
Mark,

On Wed, Sep 6, 2023 at 9:06 AM Douglas Anderson  wrote:
>
> +#ifdef CONFIG_ARM64_ACPI_PARKING_PROTOCOL
> +void arch_send_wakeup_ipi(unsigned int cpu)
> +{
> +   /*
> +* We use a scheduler IPI to wake the CPU as this avoids the need for 
> a
> +* dedicated IPI and we can safely handle spurious scheduler IPIs.
> +*/
> +   arch_smp_send_reschedule(cpu);

I was backporting this to our ChromeOS kernels and our build test bot
noticed that arch_smp_send_reschedule() didn't exist in older kernels.
That's fine--I can always adjust this patch when backporting or
cherry-pick extra patches, but it made me wonder. Is there a reason
you chose to use arch_smp_send_reschedule() directly here instead of
smp_send_reschedule()? I guess the only difference is that you're
bypassing the tracing. Is that on purpose? Should we add a comment
about it, or change this to smp_send_reschedule()?

Thanks!

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v12 1/7] irqchip/gic-v3: Enable support for SGIs to act as NMIs

2023-08-31 Thread Doug Anderson
Hi,

On Thu, Aug 31, 2023 at 1:53 AM Mark Rutland  wrote:
>
> On Wed, Aug 30, 2023 at 12:11:22PM -0700, Douglas Anderson wrote:
> > As of commit 6abbd6988971 ("irqchip/gic, gic-v3: Make SGIs use
> > handle_percpu_devid_irq()") SGIs are treated the same as PPIs/EPPIs
> > and use handle_percpu_devid_irq() by default. Unfortunately,
> > handle_percpu_devid_irq() isn't NMI safe, and so to run in an NMI
> > context those should use handle_percpu_devid_fasteoi_nmi().
> >
> > In order to accomplish this, we just have to make room for SGIs in the
> > array of refcounts that keeps track of which interrupts are set as
> > NMI. We also rename the array and create a new indexing scheme that
> > accounts for SGIs.
> >
> > Also, enable NMI support prior to gic_smp_init() as allocation of SGIs
> > as IRQs/NMIs happen as part of this routine.
> >
> > Co-developed-by: Sumit Garg 
> > Signed-off-by: Sumit Garg 
> > Signed-off-by: Douglas Anderson 
> > ---
> > I'll note that this change is a little more black magic to me than
> > others in this series. I don't have a massive amounts of familiarity
> > with all the moving parts of gic-v3, so I mostly just followed Mark
> > Rutland's advice [1]. Please pay extra attention to make sure I didn't
> > do anything too terrible.
> >
> > Mark's advice wasn't a full patch and I ended up doing a bit of work
> > to translate it to reality, so I did not add him as "Co-developed-by"
> > here. Mark: if you would like this tag then please provide it and your
> > Signed-off-by. I certainly won't object.
>
> That's all reasonable, and I'm perfectly happy without a tag.
>
> I have one trivial nit below, but with or without that fixed up:
>
> Acked-by: Mark Rutland 
>
> >
> > [1] 
> > https://lore.kernel.org/r/znc-yrqopo0pa...@fvff77s0q05n.cambridge.arm.com
> >
> > Changes in v12:
> > - Added a comment about why we account for 16 SGIs when Linux uses 8.
> >
> > Changes in v10:
> > - Rewrite as needed for 5.11+ as per Mark Rutland and Sumit.
> >
> >  drivers/irqchip/irq-gic-v3.c | 59 +---
> >  1 file changed, 41 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> > index eedfa8e9f077..8d20122ba0a8 100644
> > --- a/drivers/irqchip/irq-gic-v3.c
> > +++ b/drivers/irqchip/irq-gic-v3.c
> > @@ -78,6 +78,13 @@ static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key);
> >  #define GIC_LINE_NR  min(GICD_TYPER_SPIS(gic_data.rdists.gicd_typer), 
> > 1020U)
> >  #define GIC_ESPI_NR  GICD_TYPER_ESPIS(gic_data.rdists.gicd_typer)
> >
> > +/*
> > + * There are 16 SGIs, though we only actually use 8 in Linux. The other 8 
> > SGIs
> > + * are potentially stolen by the secure side. Some code, especially code 
> > dealing
> > + * with hwirq IDs, is simplified by accounting for all 16.
> > + */
> > +#define SGI_NR   16
> > +
> >  /*
> >   * The behaviours of RPR and PMR registers differ depending on the value of
> >   * SCR_EL3.FIQ, and the behaviour of non-secure priority registers of the
> > @@ -125,8 +132,8 @@ EXPORT_SYMBOL(gic_nonsecure_priorities);
> >   __priority; \
> >   })
> >
> > -/* ppi_nmi_refs[n] == number of cpus having ppi[n + 16] set as NMI */
> > -static refcount_t *ppi_nmi_refs;
> > +/* rdist_nmi_refs[n] == number of cpus having the rdist interrupt n set as 
> > NMI */
> > +static refcount_t *rdist_nmi_refs;
> >
> >  static struct gic_kvm_info gic_v3_kvm_info __initdata;
> >  static DEFINE_PER_CPU(bool, has_rss);
> > @@ -519,9 +526,22 @@ static u32 __gic_get_ppi_index(irq_hw_number_t hwirq)
> >   }
> >  }
> >
> > -static u32 gic_get_ppi_index(struct irq_data *d)
> > +static u32 __gic_get_rdist_idx(irq_hw_number_t hwirq)
> > +{
> > + switch (__get_intid_range(hwirq)) {
> > + case SGI_RANGE:
> > + case PPI_RANGE:
> > + return hwirq;
> > + case EPPI_RANGE:
> > + return hwirq - EPPI_BASE_INTID + 32;
> > + default:
> > + unreachable();
> > + }
> > +}
> > +
> > +static u32 gic_get_rdist_idx(struct irq_data *d)
> >  {
> > - return __gic_get_ppi_index(d->hwirq);
> > + return __gic_get_rdist_idx(d->hwirq);
> >  }
>
> Nit: It would be nicer to call this gic_get_rdist_index() to match
> gic_get_ppi_index(); likewise with __gic_get_rdist_index().
>
> That's my fault given I suggested the gic_get_rdist_idx() name in:
>
>   
> https://lore.kernel.org/linux-arm-kernel/znc-yrqopo0pa...@fvff77s0q05n.cambridge.arm.com/
>
> ... so sorry about that!

Yeah, I kept the name you suggested even though it seemed a little
inconsistent. I'll happily send a v13 with that fixed up, though I'll
probably wait a little bit just to avoid spamming new versions too
quickly. It's not like the patches can land in the middle of the merge
window anyway.

Unless someone says otherwise, I guess this series is in good shape to
land then. Does anyone have any plans for the details of how to land
it? I guess 

Re: [Kgdb-bugreport] [PATCH v11 4/6] arm64: smp: Add arch support for backtrace using pseudo-NMI

2023-08-29 Thread Doug Anderson
Hi,

On Mon, Aug 28, 2023 at 10:23 PM Tomohiro Misono (Fujitsu)
 wrote:
>
> > -Original Message-
> > Subject: [PATCH v11 4/6] arm64: smp: Add arch support for backtrace using
> > pseudo-NMI
> >
> > Enable arch_trigger_cpumask_backtrace() support on arm64. This enables
> > things much like they are enabled on arm32 (including some of the
> > funky logic around NR_IPI, nr_ipi, and MAX_IPI) but with the
> > difference that, unlike arm32, we'll try to enable the backtrace to
> > use pseudo-NMI.
> >
> > NOTE: this patch is a squash of the little bit of code adding the
> > ability to mark an IPI to try to use pseudo-NMI plus the little bit of
> > code to hook things up for kgdb. This approach was decided upon in the
> > discussion of v9 [1].
> >
> > This patch depends on commit 36759e343ff9 ("nmi_backtrace: allow
>
> Hi,
> Is this commit ID valid? I believe this commit is not included in the
> mainline yet. Other than this,
>  Reviewed-by: Misono Tomohiro 

Ah, good call. I must have grabbed that git hash before the commit
moved from Andrew Morton's unstable branch to his stable branch. As
far as I can tell, it should be commit 8d539b84f1e3 ("nmi_backtrace:
allow excluding an arbitrary CPU"). ...at least that's what's in
linuxnext today. That also appears to match the commit hash in the
pull request that Andrew just sent to Linus [1].

I'll update the git hash and add your tag in v12, which I'm currently
aiming to send out tomorrow.

[1] 
https://lore.kernel.org/r/20230828225431.354d3d2d3b80ee5b88e65...@linux-foundation.org


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v11 1/6] irqchip/gic-v3: Enable support for SGIs to act as NMIs

2023-08-28 Thread Doug Anderson
Hi,

On Sat, Aug 26, 2023 at 3:37 AM Marc Zyngier  wrote:
>
> On Thu, 24 Aug 2023 16:30:27 +0100,
> Douglas Anderson  wrote:
> >
> > As of commit 6abbd6988971 ("irqchip/gic, gic-v3: Make SGIs use
> > handle_percpu_devid_irq()") SGIs are treated the same as PPIs/EPPIs
> > and use handle_percpu_devid_irq() by default. Unfortunately,
> > handle_percpu_devid_irq() isn't NMI safe, and so to run in an NMI
> > context those should use handle_percpu_devid_fasteoi_nmi().
> >
> > In order to accomplish this, we just have to make room for SGIs in the
> > array of refcounts that keeps track of which interrupts are set as
> > NMI. We also rename the array and create a new indexing scheme that
> > accounts for SGIs.
> >
> > Also, enable NMI support prior to gic_smp_init() as allocation of SGIs
> > as IRQs/NMIs happen as part of this routine.
> >
> > Co-developed-by: Sumit Garg 
> > Signed-off-by: Sumit Garg 
> > Signed-off-by: Douglas Anderson 
> > ---
> > In v10 I removed the previous Reviewed-by and Tested-by tags since the
> > patch contents changed pretty drastically.
> >
> > I'll also note that this change is a little more black magic to me
> > than others in this series. I don't have a massive amounts of
> > familiarity with all the moving parts of gic-v3, so I mostly just
> > followed Mark Rutland's advice [1]. Please pay extra attention to make
> > sure I didn't do anything too terrible.
> >
> > Mark's advice wasn't a full patch and I ended up doing a bit of work
> > to translate it to reality, so I did not add him as "Co-developed-by"
> > here. Mark: if you would like this tag then please provide it and your
> > Signed-off-by. I certainly won't object.
> >
> > [1] 
> > https://lore.kernel.org/r/znc-yrqopo0pa...@fvff77s0q05n.cambridge.arm.com
> >
> > (no changes since v10)
> >
> > Changes in v10:
> > - Rewrite as needed for 5.11+ as per Mark Rutland and Sumit.
> >
> >  drivers/irqchip/irq-gic-v3.c | 54 
> >  1 file changed, 36 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> > index eedfa8e9f077..49d18cf3f636 100644
> > --- a/drivers/irqchip/irq-gic-v3.c
> > +++ b/drivers/irqchip/irq-gic-v3.c
> > @@ -78,6 +78,8 @@ static DEFINE_STATIC_KEY_TRUE(supports_deactivate_key);
> >  #define GIC_LINE_NR  min(GICD_TYPER_SPIS(gic_data.rdists.gicd_typer), 
> > 1020U)
> >  #define GIC_ESPI_NR  GICD_TYPER_ESPIS(gic_data.rdists.gicd_typer)
> >
> > +#define SGI_NR   16
>
> Why 16? We only allocate 8, as the other 8 are potentially stolen by
> the secure side. We do try and initialise them all so that they have a
> known state if they were actually configured as Group-1NS, but we
> don't use them.
>
> I understand that this simplifies the indexing in the rdist_nmi_refs
> array and I'm not going to cry over 32 wasted bytes, but this
> definitely deserves a comment.

Good point. I'll plan to wait another day or two to see if any other
feedback shows up and then send a v12 with this plus Stephen's nit
fixes on one of the other patches.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v11 4/6] arm64: smp: Add arch support for backtrace using pseudo-NMI

2023-08-25 Thread Doug Anderson
Hi,

On Fri, Aug 25, 2023 at 3:27 PM Stephen Boyd  wrote:
>
> Quoting Douglas Anderson (2023-08-24 08:30:30)
> > diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> > index fac08e18bcd5..50ce8b697ff3 100644
> > --- a/arch/arm64/include/asm/irq.h
> > +++ b/arch/arm64/include/asm/irq.h
> > @@ -6,6 +6,9 @@
> >
> >  #include 
> >
> > +void arch_trigger_cpumask_backtrace(const cpumask_t *mask, int 
> > exclude_cpu);
>
> Some nits, but otherwise
>
> Reviewed-by: Stephen Boyd 
>
> > +#define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
> > +
> >  struct pt_regs;
> >
> >  int set_handle_irq(void (*handle_irq)(struct pt_regs *));
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index a5848f1ef817..c8896cbc5327 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -72,12 +73,18 @@ enum ipi_msg_type {
> > IPI_CPU_CRASH_STOP,
> > IPI_TIMER,
> > IPI_IRQ_WORK,
> > -   NR_IPI
> > +   NR_IPI,
> > +   /*
> > +* Any enum >= NR_IPI and < MAX_IPI is special and not tracable
> > +* with trace_ipi_*
> > +*/
> > +   IPI_CPU_BACKTRACE = NR_IPI,
> > +   MAX_IPI
> >  };
> >
> >  static int ipi_irq_base __read_mostly;
> >  static int nr_ipi __read_mostly = NR_IPI;
> > -static struct irq_desc *ipi_desc[NR_IPI] __read_mostly;
> > +static struct irq_desc *ipi_desc[MAX_IPI] __read_mostly;
>
> Side note: it would be nice to mark ipi_desc as __ro_after_init. Same
> for nr_ipi and ipi_irq_base.

I'd rather not change it in this patch since it's a pre-existing and
separate issue, but I can add a patch to the end of the series for
that if I end up spinning it. Otherwise I can send a follow-up patch
for it.


> >  static void ipi_setup(int cpu);
> >
> > @@ -845,6 +852,22 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int 
> > cpu, struct pt_regs *regs
> >  #endif
> >  }
> >
> > +static void arm64_backtrace_ipi(cpumask_t *mask)
> > +{
> > +   __ipi_send_mask(ipi_desc[IPI_CPU_BACKTRACE], mask);
> > +}
> > +
> > +void arch_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu)
>
> Can this be 'bool exclude_self' instead of int? That matches all other
> implementations from what I can tell.

Nope. See the part of the commit message that says:

This patch depends on commit 36759e343ff9 ("nmi_backtrace: allow
excluding an arbitrary CPU") since that commit changed the prototype
of arch_trigger_cpumask_backtrace(), which this patch implements.

> > +{
> > +   /*
> > +* NOTE: though nmi_trigger_cpumask_backtrace has "nmi_" in the 
> > name,
>
> USe nmi_trigger_cpumask_backtrace() to indicate function.

I won't plan on doing an immediate spin for this and for now I'll wait
for additional feedback. If a maintainer is getting ready to land
this, I'm happy to post a new version with this fix or also happy if a
maintainer wants to add the "()" while applying.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kgdb: Flush console before entering kgdb on panic

2023-08-25 Thread Doug Anderson
Hi,

On Fri, Aug 25, 2023 at 3:09 AM Daniel Thompson
 wrote:
>
> On Tue, Aug 22, 2023 at 01:19:46PM -0700, Douglas Anderson wrote:
> > When entering kdb/kgdb on a kernel panic, it was be observed that the
> > console isn't flushed before the `kdb` prompt came up. Specifically,
> > when using the buddy lockup detector on arm64 and running:
> >   echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
> >
> > I could see:
> >   [   26.161099] lkdtm: Performing direct entry HARDLOCKUP
> >   [   32.499881] watchdog: Watchdog detected hard LOCKUP on cpu 6
> >   [   32.552865] Sending NMI from CPU 5 to CPUs 6:
> >   [   32.557359] NMI backtrace for cpu 6
> >   ... [backtrace for cpu 6] ...
> >   [   32.558353] NMI backtrace for cpu 5
> >   ... [backtrace for cpu 5] ...
> >   [   32.867471] Sending NMI from CPU 5 to CPUs 0-4,7:
> >   [   32.872321] NMI backtrace forP cpuANC: Hard LOCKUP
> >
> >   Entering kdb (current=..., pid 0) on processor 5 due to Keyboard Entry
> >   [5]kdb>
> >
> > As you can see, backtraces for the other CPUs start printing and get
> > interleaved with the kdb PANIC print.
> >
> > Let's replicate the commands to flush the console in the kdb panic
> > entry point to avoid this.
> >
> > Signed-off-by: Douglas Anderson 
> > ---
> >
> >  kernel/debug/debug_core.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> > index d5e9ccde3ab8..3a904d8697c8 100644
> > --- a/kernel/debug/debug_core.c
> > +++ b/kernel/debug/debug_core.c
> > @@ -1006,6 +1006,9 @@ void kgdb_panic(const char *msg)
> >   if (panic_timeout)
> >   return;
> >
> > + debug_locks_off();
> > + console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> > +
> >   if (dbg_kdb_mode)
> >   kdb_printf("PANIC: %s\n", msg);
>
> I'm somewhat included to say *this* (calling kdb_printf() when not
> actually in the debugger) is the cause of the problem. kdb_printf()
> does some pretty horid things to the console and isn't intended to
> run while the system is active.
>
> I'd therefore be more tempted to defer the print to the b.p. trap
> handler itself and make this part of kgdb_panic() look more like:
>
> kgdb_panic_msg = msg;
> kgdb_breakpoint();
> kgdb_panic_msg = NULL;

Unfortunately I think that only solves half the problem. As a quick
test, I tried simply commenting out the "kdb_printf" line in
kgdb_panic(). While that avoids the interleaved panic message and
backtrace, it does nothing to actually get the backtraces printed out
before you end up in kdb. As an example, this is what happened when I
used `echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT` and
had the "kdb_printf" in kgdb_panic() commented out:

[   72.658424] lkdtm: Performing direct entry HARDLOCKUP
[   82.181857] watchdog: Watchdog detected hard LOCKUP on cpu 6
...
[   82.234801] Sending NMI from CPU 5 to CPUs 6:
[   82.239296] NMI backtrace for cpu 6
... [ stack trace for CPU 6 ] ...
[   82.240294] NMI backtrace for cpu 5
... [ stack trace for CPU 5 ] ...
[   82.576443] Sending NMI from CPU 5 to CPUs 0-4,7:
[   82.581291] NMI backtrace
Entering kdb (current=0xff80da5a1080, pid 6978) on processor 5 due
to Keyboard Entry
[5]kdb>

As you can see, I don't see the traces for CPUs 0-4 and 7. Those do
show up if I use the "dmesg" command but it's a bit of a hassle to run
"dmesg" to look for any extra debug messages every time I drop in kdb.

I guess perhaps that part isn't obvious from the commit message?
Should I send a new version with an updated commit message indicating
that it's not just the jumbled text that's a problem but also the lack
of stack traces?

Thanks!

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v9 5/7] arm64: ipi_debug: Add support for backtrace using the debug IPI

2023-08-21 Thread Doug Anderson
Hi,

On Mon, Aug 7, 2023 at 3:23 AM Mark Rutland  wrote:
>
> On Thu, Jun 01, 2023 at 02:31:49PM -0700, Douglas Anderson wrote:
> > From: Sumit Garg 
> >
> > Enable arch_trigger_cpumask_backtrace() support on arm64 using the new
> > debug IPI. With this arm64 can now get backtraces in cases where
> > callers of the trigger_xyz_backtrace() class of functions don't check
> > the return code and implement a fallback. One example is
> > `kernel.softlockup_all_cpu_backtrace`. This also allows us to
> > backtrace hard locked up CPUs in cases where the debug IPI is backed
> > by an NMI (or pseudo NMI).
> >
> > Signed-off-by: Sumit Garg 
> > Signed-off-by: Douglas Anderson 
> > ---
> >
> > Changes in v9:
> > - Added comments that we might not be using NMI always.
> > - Renamed "NMI IPI" to "debug IPI" since it might not be backed by NMI.
> > - arch_trigger_cpumask_backtrace() no longer returns bool
> >
> > Changes in v8:
> > - Removed "#ifdef CONFIG_SMP" since arm64 is always SMP
> >
> >  arch/arm64/include/asm/irq.h  |  3 +++
> >  arch/arm64/kernel/ipi_debug.c | 25 +++--
> >  2 files changed, 26 insertions(+), 2 deletions(-)
>
> As with prior patches, I'd prefer if this lived in smp.c with the other IPI
> logic.
>
> >
> > diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> > index fac08e18bcd5..be2d103f316e 100644
> > --- a/arch/arm64/include/asm/irq.h
> > +++ b/arch/arm64/include/asm/irq.h
> > @@ -6,6 +6,9 @@
> >
> >  #include 
> >
> > +void arch_trigger_cpumask_backtrace(const cpumask_t *mask, bool 
> > exclude_self);
> > +#define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
> > +
> >  struct pt_regs;
> >
> >  int set_handle_irq(void (*handle_irq)(struct pt_regs *));
> > diff --git a/arch/arm64/kernel/ipi_debug.c b/arch/arm64/kernel/ipi_debug.c
> > index b57833e31eaf..6984ed507e1f 100644
> > --- a/arch/arm64/kernel/ipi_debug.c
> > +++ b/arch/arm64/kernel/ipi_debug.c
> > @@ -8,6 +8,7 @@
> >
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >
> >  #include "ipi_debug.h"
> > @@ -24,11 +25,31 @@ void arm64_debug_ipi(cpumask_t *mask)
> >   __ipi_send_mask(ipi_debug_desc, mask);
> >  }
> >
> > +void arch_trigger_cpumask_backtrace(const cpumask_t *mask, bool 
> > exclude_self)
> > +{
> > + /*
> > +  * NOTE: though nmi_trigger_cpumask_backtrace has "nmi_" in the name,
> > +  * nothing about it truly needs to be backed by an NMI, it's just that
> > +  * it's _allowed_ to be called from an NMI. If set_smp_debug_ipi()
> > +  * failed to get an NMI (or pseudo-NMI) this will just be backed by a
> > +  * regular IPI.
> > +  */
> > + nmi_trigger_cpumask_backtrace(mask, exclude_self, arm64_debug_ipi);
> > +}
> > +
> >  static irqreturn_t ipi_debug_handler(int irq, void *data)
> >  {
> > - /* nop, NMI handlers for special features can be added here. */
> > + irqreturn_t ret = IRQ_NONE;
> > +
> > + /*
> > +  * NOTE: Just like in arch_trigger_cpumask_backtrace(), we're calling
> > +  * a function with "nmi_" in the name but it works fine even if we
> > +  * are using a regulaor IPI.
> > +  */
> > + if (nmi_cpu_backtrace(get_irq_regs()))
> > + ret = IRQ_HANDLED;
> >
>
> Does this need the printk_deferred_{enter,exit}() that 32-bit arm has?

I don't _think_ so, but I also don't _think_ it's needed for arm32.
;-) Let me explain my logic and you can tell me if it sounds right to
you.

If we're doing the backtrace in pseudo-NMI context then we definitely
don't need it. Specifically, the printk_deferred_{enter,exit}() just
manages the per-cpu "printk_context" value. That value doesn't matter
if "in_nmi()" returns true.

If we're _not_ doing the backtrace in pseudo-NMI context then I think
we also don't need it. After all, if we're not in pseudo-NMI context
then it's perfectly fine to print, right?

While it's hard to know 100% for sure, my best guess is that in arm
this might have helped prevent stack traces from getting interspersed
among one another. I guess this is no longer needed as of commit
55d6af1d6688 ("lib/nmi_backtrace: explicitly serialize banner and
regs")? In any case, when I tested this earlier things seemed to
printout fine without it...

That being said, it wouldn't hurt to include it here and I'll do it if you want.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v9 3/7] arm64: Add framework for a debug IPI

2023-08-21 Thread Doug Anderson
Hi,

On Mon, Aug 7, 2023 at 3:12 AM Mark Rutland  wrote:
>
> On Thu, Jun 01, 2023 at 02:31:47PM -0700, Douglas Anderson wrote:
> > From: Sumit Garg 
> >
> > Introduce a framework for an IPI that will be used for debug
> > purposes. The primary use case of this IPI will be to generate stack
> > crawls on other CPUs, but it will also be used to round up CPUs for
> > kgdb.
> >
> > When possible, we try to allocate this debug IPI as an NMI (or a
> > pseudo NMI). If that fails (due to CONFIG, an incompatible interrupt
> > controller, a quirk, missing the "irqchip.gicv3_pseudo_nmi=1" kernel
> > parameter, etc) we fall back to a normal IPI.
> >
> > NOTE: hooking this up for CPU backtrace / kgdb will happen in a future
> > patch, this just adds the framework.
> >
> > Signed-off-by: Sumit Garg 
> > Signed-off-by: Douglas Anderson 
>
> I think that we shouldn't add a framework in a separate file for this:
>
> * This is very similar to our existing IPI management in smp.c, so it feels
>   like duplication, or at least another thing we'd like to keep in-sync.
>
> * We're going to want an NMI backtrace regardless of KGDB
>
> * We're going to want the IPI_CPU_STOP and IPI_CRASH_CPU_STOP IPIs to be NMIs
>   too.
>
> I reckon it'd be better to extend the existing IPI logic in smp.c to allow 
> IPIs
> to be requested as NMIs, e.g.
>
> 
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index edd63894d61e8..48e6aa62c473e 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -926,6 +927,21 @@ static void smp_cross_call(const struct cpumask *target, 
> unsigned int ipinr)
> __ipi_send_mask(ipi_desc[ipinr], target);
>  }
>
> +static bool ipi_should_be_nmi(enum ipi_msg_type ipi)
> +{
> +   if (!system_uses_irq_prio_masking())
> +   return false;
> +
> +   switch (ipi) {
> +   /*
> +* TODO: select NMI IPIs here
> +*/
> +   return true;
> +   default:
> +   return false;
> +   }
> +}
> +
>  static void ipi_setup(int cpu)
>  {
> int i;
> @@ -933,8 +949,14 @@ static void ipi_setup(int cpu)
> if (WARN_ON_ONCE(!ipi_irq_base))
> return;
>
> -   for (i = 0; i < nr_ipi; i++)
> -   enable_percpu_irq(ipi_irq_base + i, 0);
> +   for (i = 0; i < nr_ipi; i++) {
> +   if (ipi_should_be_nmi(i)) {
> +   prepare_percpu_nmi(ipi_irq_base + i);
> +   enable_percpu_nmi(ipi_irq_base + i, 0);
> +   } else {
> +   enable_percpu_irq(ipi_irq_base + i, 0);
> +   }
> +   }
>  }
>
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -945,8 +967,14 @@ static void ipi_teardown(int cpu)
> if (WARN_ON_ONCE(!ipi_irq_base))
> return;
>
> -   for (i = 0; i < nr_ipi; i++)
> -   disable_percpu_irq(ipi_irq_base + i);
> +   for (i = 0; i < nr_ipi; i++) {
> +   if (ipi_should_be_nmi(i)) {
> +   disable_percpu_nmi(ipi_irq_base + i);
> +   teardown_percpu_nmi(ipi_irq_base + i);
> +   } else {
> +   disable_percpu_irq(ipi_irq_base + i);
> +   }
> +   }
>  }
>  #endif
>
> @@ -958,11 +986,19 @@ void __init set_smp_ipi_range(int ipi_base, int n)
> nr_ipi = min(n, NR_IPI);
>
> for (i = 0; i < nr_ipi; i++) {
> -   int err;
> -
> -   err = request_percpu_irq(ipi_base + i, ipi_handler,
> -"IPI", _number);
> -   WARN_ON(err);
> +   int err = -EINVAL;
> +
> +   if (ipi_should_be_nmi(i)) {
> +   err = request_percpu_nmi(ipi_base + i, ipi_handler,
> +"IPI", _number);
> +   WARN(err, "Could not request IPI %d as NMI, err=%d\n",
> +i, err);
> +   } else {
> +   err = request_percpu_irq(ipi_base + i, ipi_handler,
> +"IPI", _number);
> +   WARN(err, "Could not request IPI %d as IRQ, err=%d\n",
> +i, err);
> +   }
>
> ipi_desc[i] = irq_to_desc(ipi_base + i);
> irq_set_status_flags(ipi_base + i, IRQ_HIDDEN);
> 
>
> ... and then if we need an IPI for KGDB, we can add that to the existing list
> of IPIs, and have it requested/enabled/disabled as usual.

Sounds good. I'm starting to work on v10 incorporating your feedback.
A few quick questions:

1. If I mostly take your patch above verbatim, do you have any
suggested tags for Author/Signed-off-by? I'd tend to set you as the
author but I can't do that because you didn't provide a
Signed-off-by...

2. Would you prefer this patch on its own, or would 

Re: [Kgdb-bugreport] [PATCH v9 0/7] arm64: Add debug IPI for backtraces / kgdb; try to use NMI for it

2023-07-24 Thread Doug Anderson
Hi folks,

On Thu, Jun 1, 2023 at 2:37 PM Douglas Anderson  wrote:
>
> This is an attempt to resurrect Sumit's old patch series [1] that
> allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
> also to round up CPUs in kdb/kgdb. The last post from Sumit that I
> could find was v7, so I started my series at v8. I haven't copied all
> of his old changelongs here, but you can find them from the link.
>
> I'm really looking for a way to land this patch series. In response to
> v8, Mark Rutland indicated [2] that he was worried about the soundness
> of pseudo NMI. Those definitely need to get fixed, but IMO this patch
> series could still land in the meantime. That would at least let
> people test with it.
>
> Request for anyone reading this: please help indicate your support of
> this patch series landing by replying, even if you don't have the
> background for a full review. My suspicion is that there are a lot of
> people who agree that this would be super useful to get landed.
>
> Since v8, I have cleaned up this patch series by integrating the 10th
> patch from v8 [3] into the whole series. As part of this, I renamed
> the "NMI IPI" to the "debug IPI" since it could now be backed by a
> regular IPI in the case that pseudo NMIs weren't available. With the
> fallback, this allowed me to drop some extra patches from the
> series. This feels (to me) to be pretty clean and hopefully others
> agree. Any patch I touched significantly I removed Masayoshi and
> Chen-Yu's tags from.
>
> ...also in v8, I reorderd the patches a bit in a way that seemed a
> little cleaner to me.
>
> Since v7, I have:
> * Addressed the small amount of feedback that was there for v7.
> * Rebased.
> * Added a new patch that prevents us from spamming the logs with idle
>   tasks.
> * Added an extra patch to gracefully fall back to regular IPIs if
>   pseudo-NMIs aren't there.
>
> It can be noted that this patch series works very well with the recent
> "hardlockup" patches that have landed through Andrew Morton's tree and
> are currently in linuxnext. It works especially well with the "buddy"
> lockup detector.
>
> [1] 
> https://lore.kernel.org/linux-arm-kernel/1604317487-14543-1-git-send-email-sumit.g...@linaro.org/
> [2] 
> https://lore.kernel.org/lkml/zfvgqd%2f%2fpm%2flz...@fvff77s0q05n.cambridge.arm.com/
> [3] 
> https://lore.kernel.org/r/20230419155341.v8.10.Ic3659997d6243139d0522fc3afcdfd88d7a5f030@changeid/
>
> Changes in v9:
> - Add a warning if we don't have enough IPIs for the NMI IPI
> - Added comments that we might not be using NMI always.
> - Added missing "inline"
> - Added to commit message that this doesn't catch all cases.
> - Fold in v8 patch #10 ("Fallback to a regular IPI if NMI isn't enabled")
> - Moved header file out of "include" since it didn't need to be there.
> - Remove arm64_supports_nmi()
> - Remove fallback for when debug IPI isn't available.
> - Renamed "NMI IPI" to "debug IPI" since it might not be backed by NMI.
> - Update commit description
> - arch_trigger_cpumask_backtrace() no longer returns bool
>
> Changes in v8:
> - "Provide a stub kgdb_nmicallback() if !CONFIG_KGDB" new for v8
> - "Tag the arm64 idle functions as __cpuidle" new for v8
> - Removed "#ifdef CONFIG_SMP" since arm64 is always SMP
> - debug_ipi_setup() and debug_ipi_teardown() no longer take cpu param
>
> Douglas Anderson (2):
>   arm64: idle: Tag the arm64 idle functions as __cpuidle
>   kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB
>
> Sumit Garg (5):
>   irqchip/gic-v3: Enable support for SGIs to act as NMIs
>   arm64: Add framework for a debug IPI
>   arm64: smp: Assign and setup the debug IPI
>   arm64: ipi_debug: Add support for backtrace using the debug IPI
>   arm64: kgdb: Roundup cpus using the debug IPI
>
>  arch/arm64/include/asm/irq.h  |   3 +
>  arch/arm64/kernel/Makefile|   2 +-
>  arch/arm64/kernel/idle.c  |   4 +-
>  arch/arm64/kernel/ipi_debug.c | 102 ++
>  arch/arm64/kernel/ipi_debug.h |  13 +
>  arch/arm64/kernel/kgdb.c  |  14 +
>  arch/arm64/kernel/smp.c   |  11 
>  drivers/irqchip/irq-gic-v3.c  |  29 +++---
>  include/linux/kgdb.h  |   1 +
>  9 files changed, 168 insertions(+), 11 deletions(-)

I'm looking for some ideas on what to do to move this patch series
forward. Thanks to Daniel, the kgdb patch is now in Linus's tree which
hopefully makes this simpler to land. I guess there is still the
irqchip dependency that will need to be sorted out, though...

Even if folks aren't in agreement about whether this is ready to be
enabled in production, I don't think anything here is super
objectionable or controversial, is it? Can we land it? If you feel
like it needs extra review, would it help if I tried to drum up some
extra people to provide review feedback?

Also: in case it's interesting to anyone, I've been doing benchmarks
on sc7180-trogdor devices in preparation for enabling this. On that
platform, I did manage to see 

Re: [Kgdb-bugreport] [PATCH v2 6/6] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

2023-07-01 Thread Doug Anderson
Hi,

On Sat, Jul 1, 2023 at 7:40 AM Guenter Roeck  wrote:
>
> On Fri, Jun 16, 2023 at 05:06:18PM +0200, Petr Mladek wrote:
> > The HAVE_ prefix means that the code could be enabled. Add another
> > variable for HAVE_HARDLOCKUP_DETECTOR_ARCH without this prefix.
> > It will be set when it should be built. It will make it compatible
> > with the other hardlockup detectors.
> >
> > The change allows to clean up dependencies of PPC_WATCHDOG
> > and HAVE_HARDLOCKUP_DETECTOR_PERF definitions for powerpc.
> >
> > As a result HAVE_HARDLOCKUP_DETECTOR_PERF has the same dependencies
> > on arm, x86, powerpc architectures.
> >
> > Signed-off-by: Petr Mladek 
> > Reviewed-by: Douglas Anderson 
> > ---
> ...
> > --- a/include/linux/nmi.h
> > +++ b/include/linux/nmi.h
> > @@ -9,7 +9,7 @@
> >  #include 
> >
> >  /* Arch specific watchdogs might need to share extra watchdog-related 
> > APIs. */
> > -#if defined(CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH) || 
> > defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
> > +#if defined(CONFIG_HARDLOCKUP_DETECTOR_ARCH) || 
> > defined(CONFIG_HARDLOCKUP_DETECTOR_SPARC64)
>
> This results in:
>
> arch/powerpc/platforms/pseries/mobility.c: In function 
> 'pseries_migrate_partition':
> arch/powerpc/platforms/pseries/mobility.c:753:17: error: implicit declaration 
> of function 'watchdog_hardlockup_set_timeout_pct'; did you mean 
> 'watchdog_hardlockup_stop'? [-Werror=implicit-function-declaration]
>   753 | watchdog_hardlockup_set_timeout_pct(factor);
>
> with ppc64_defconfig -CONFIG_HARDLOCKUP_DETECTOR, because the dummy
> for watchdog_hardlockup_set_timeout_pct() is still defined in
> arch/powerpc/include/asm/nmi.h which is no longer included.

Can you test with:

https://lore.kernel.org/r/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid

Thanks!


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: move kdb_send_sig() declaration to a better header file

2023-06-30 Thread Doug Anderson
Hi,

On Fri, Jun 30, 2023 at 1:12 PM Daniel Thompson
 wrote:
>
> kdb_send_sig() is defined in the signal code and called from kdb,
> but the declaration is part of the kdb internal code.
> Move the declaration to the shared header to avoid the warning:
>
> kernel/signal.c:4789:6: error: no previous prototype for 'kdb_send_sig' 
> [-Werror=missing-prototypes]
>
> Reported-by: Arnd Bergmann 

FWIW: these days, I think checkpatch yells at you for a bare
"Reported-by" like above. It ideally wants a "Closes" tag to follow
immediately with a link to the report, or in the very least a "Link"
tag if this doesn't fully address the issue.

> Signed-off-by: Daniel Thompson 

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: Handle LF in the command parser

2023-06-29 Thread Doug Anderson
Hi,

On Thu, Jun 29, 2023 at 9:48 AM Daniel Thompson
 wrote:
>
> On Wed, Jun 28, 2023 at 12:56:17PM -0700, Douglas Anderson wrote:
> > The main kdb command parser only handles CR (ASCII 13 AKA '\r') today,
> > but not LF (ASCII 10 AKA '\n'). That means that the kdb command parser
> > can handle terminals that send just CR or that send CR+LF but can't
> > handle terminals that send just LF.
> >
> > The fact that kdb didn't handle LF in the command parser tripped up a
> > tool I tried to use with it. Specifically, I was trying to send a
> > command to my device to resume it from kdb using a ChromeOS tool like:
> >   dut-control cpu_uart_cmd:"g"
> > That tool only terminates lines with LF, not CR+LF.
> >
> > Arguably the ChromeOS tool should be fixed. After all, officially kdb
> > seems to be designed such that CR+LF is the official line ending
> > transmitted over the wire and that internally a line ending is just
> > '\n' (LF). Some evidence:
> > * uart_poll_put_char(), which is used by kdb, notices a '\n' and
> >   converts it to '\r\n'.
> > * kdb functions specifically use '\r' to get a carriage return without
> >   a newline. You can see this in the pager where kdb will write a '\r'
> >   and then write over the pager prompt.
>
> I'd view this as simply "what you have to do drive a terminal correctly"
> rather than indicating what the official newline protocol for kdb is.
> With a human and a terminal emulator I would expect the typical input to
> be CR-only (and that's what I setup the test suite to send ;-)).

Fair enough. I haven't done massive amounts of serial communications
across lots of platforms, but I do remember the history of line
endings in text files and so I wanted to document my findings a bit.
;-)


> > However, all that being said there's no real harm in accepting LF as a
> > command terminator in the kdb parser and doing so seems like it would
> > improve compatibility. After this, I'd expect that things would work
> > OK-ish with a remote terminal that used any of CR, CR+LF, or LF as a
> > line ending. Someone using CR as a line ending might get some ugliness
> > where kdb wasn't able to overwrite the last line, but basic commands
> > would work. Someone using just LF as a line ending would probably also
> > work OK.
> >
> > A few other notes:
> > - It can be noted that "bash" running on an "agetty" handles LF as a
> >   line termination with no complaints.
> > - Historically, kdb's "pager" actually handled either CR or LF fine. A
> >   very quick inspection would make one think that kdb's pager actually
> >   could have paged down two lines instead of one for anyone using
> >   CR+LF, but this is generally avoided because of kdb_input_flush().
>
> These are very convincing though!
>
> > - Conceivably one could argue that some of this special case logic
> >   belongs in uart_poll_get_char() since uart_poll_put_char() handles
> >   the '\n' => '\r\n' conversion. I would argue that perhaps we should
> >   eventually do the opposite and move the '\n' => '\r\n' out of
> >   uart_poll_put_char(). Having that conversion at such a low level
> >   could interfere if we ever want to transfer binary data. In
> >   addition, if we truly made uart_poll_get_char() the inverse of
> >   uart_poll_put_char() it would convert back to '\n' and (ironically)
> >   kdb's parser currently only looks for '\r' to find the end of a
> >   command.
> >
> > Signed-off-by: Douglas Anderson 
>
> This arrived just as I was gathering up the patches (I know... running
> late). I've added a couple of cases to the test suite to cover the
> new feature.
>
> The code looks good to me. Are you happy for me to take it this
> merge cycle?

Yeah, it should be OK. It's really pretty straightforward. Thanks!

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 6/6] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

2023-06-21 Thread Doug Anderson
Hi,

On Wed, Jun 21, 2023 at 6:08 AM Michael Ellerman  wrote:
>
> Petr Mladek  writes:
> > The HAVE_ prefix means that the code could be enabled. Add another
> > variable for HAVE_HARDLOCKUP_DETECTOR_ARCH without this prefix.
> > It will be set when it should be built. It will make it compatible
> > with the other hardlockup detectors.
> >
> > The change allows to clean up dependencies of PPC_WATCHDOG
> > and HAVE_HARDLOCKUP_DETECTOR_PERF definitions for powerpc.
> >
> > As a result HAVE_HARDLOCKUP_DETECTOR_PERF has the same dependencies
> > on arm, x86, powerpc architectures.
> >
> > Signed-off-by: Petr Mladek 
> > Reviewed-by: Douglas Anderson 
> > ---
> >  arch/powerpc/Kconfig | 5 ++---
> >  include/linux/nmi.h  | 2 +-
> >  lib/Kconfig.debug| 9 +
> >  3 files changed, 12 insertions(+), 4 deletions(-)
>
> Something in this patch is breaking the powerpc g5_defconfig, I don't
> immediately see what though.
>
> ../arch/powerpc/kernel/stacktrace.c: In function ‘handle_backtrace_ipi’:
> ../arch/powerpc/kernel/stacktrace.c:171:9: error: implicit declaration of 
> function ‘nmi_cpu_backtrace’ [-Werror=implicit-function-declaration]
>   171 | nmi_cpu_backtrace(regs);
>   | ^
> ../arch/powerpc/kernel/stacktrace.c: In function 
> ‘arch_trigger_cpumask_backtrace’:
> ../arch/powerpc/kernel/stacktrace.c:226:9: error: implicit declaration of 
> function ‘nmi_trigger_cpumask_backtrace’; did you mean 
> ‘arch_trigger_cpumask_backtrace’? [-Werror=implicit-function-declaration]
>   226 | nmi_trigger_cpumask_backtrace(mask, exclude_self, 
> raise_backtrace_ipi);
>   | ^
>   | arch_trigger_cpumask_backtrace
> cc1: all warnings being treated as errors

Yeah, I can reproduce that.

The problem is that before ${SUBJECT} patch "include/linux/nmi.h"
would include . Now it won't.

There are a ton of different ways to fix this, but I think the one
that makes sense is to be consistent with other architectures and move
the "arch_trigger_cpumask_backtrace" definitions to asm/irq.h.

https://lore.kernel.org/r/20230621164809.1.Ice67126857506712559078e7de26d32d26e64631@changeid

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 4/6] watchdog/hardlockup: Make HAVE_NMI_WATCHDOG sparc64-specific

2023-06-16 Thread Doug Anderson
Hi,

On Fri, Jun 16, 2023 at 8:07 AM Petr Mladek  wrote:
>
> There are several hardlockup detector implementations and several Kconfig
> values which allow selection and build of the preferred one.
>
> CONFIG_HARDLOCKUP_DETECTOR was introduced by the commit 23637d477c1f53acb
> ("lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR") in v2.6.36.
> It was a preparation step for introducing the new generic perf hardlockup
> detector.
>
> The existing arch-specific variants did not support the to-be-created
> generic build configurations, sysctl interface, etc. This distinction
> was made explicit by the commit 4a7863cc2eb5f98 ("x86, nmi_watchdog:
> Remove ARCH_HAS_NMI_WATCHDOG and rely on CONFIG_HARDLOCKUP_DETECTOR")
> in v2.6.38.
>
> CONFIG_HAVE_NMI_WATCHDOG was introduced by the commit d314d74c695f967e105
> ("nmi watchdog: do not use cpp symbol in Kconfig") in v3.4-rc1. It replaced
> the above mentioned ARCH_HAS_NMI_WATCHDOG. At that time, it was still used
> by three architectures, namely blackfin, mn10300, and sparc.
>
> The support for blackfin and mn10300 architectures has been completely
> dropped some time ago. And sparc is the only architecture with the historic
> NMI watchdog at the moment.
>
> And the old sparc implementation is really special. It is always built on
> sparc64. It used to be always enabled until the commit 7a5c8b57cec93196b
> ("sparc: implement watchdog_nmi_enable and watchdog_nmi_disable") added
> in v4.10-rc1.
>
> There are only few locations where the sparc64 NMI watchdog interacts
> with the generic hardlockup detectors code:
>
>   + implements arch_touch_nmi_watchdog() which is called from the generic
> touch_nmi_watchdog()
>
>   + implements watchdog_hardlockup_enable()/disable() to support
> /proc/sys/kernel/nmi_watchdog
>
>   + is always preferred over other generic watchdogs, see
> CONFIG_HARDLOCKUP_DETECTOR
>
>   + includes asm/nmi.h into linux/nmi.h because some sparc-specific
> functions are needed in sparc-specific code which includes
> only linux/nmi.h.
>
> The situation became more complicated after the commit 05a4a95279311c3
> ("kernel/watchdog: split up config options") and commit 2104180a53698df5
> ("powerpc/64s: implement arch-specific hardlockup watchdog") in v4.13-rc1.
> They introduced HAVE_HARDLOCKUP_DETECTOR_ARCH. It was used for powerpc
> specific hardlockup detector. It was compatible with the perf one
> regarding the general boot, sysctl, and programming interfaces.
>
> HAVE_HARDLOCKUP_DETECTOR_ARCH was defined as a superset of
> HAVE_NMI_WATCHDOG. It made some sense because all arch-specific
> detectors had some common requirements, namely:
>
>   + implemented arch_touch_nmi_watchdog()
>   + included asm/nmi.h into linux/nmi.h
>   + defined the default value for /proc/sys/kernel/nmi_watchdog
>
> But it actually has made things pretty complicated when the generic
> buddy hardlockup detector was added. Before the generic perf detector
> was newer supported together with an arch-specific one. But the buddy
> detector could work on any SMP system. It means that an architecture
> could support both the arch-specific and buddy detector.
>
> As a result, there are few tricky dependencies. For example,
> CONFIG_HARDLOCKUP_DETECTOR depends on:
>
>   ((HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_BUDDY) && 
> !HAVE_NMI_WATCHDOG) || HAVE_HARDLOCKUP_DETECTOR_ARCH
>
> The problem is that the very special sparc implementation is defined as:
>
>   HAVE_NMI_WATCHDOG && !HAVE_HARDLOCKUP_DETECTOR_ARCH
>
> Another problem is that the meaning of HAVE_NMI_WATCHDOG is far from clear
> without reading understanding the history.
>
> Make the logic less tricky and more self-explanatory by making
> HAVE_NMI_WATCHDOG specific for the sparc64 implementation. And rename it to
> HAVE_HARDLOCKUP_DETECTOR_SPARC64.
>
> Note that HARDLOCKUP_DETECTOR_PREFER_BUDDY, HARDLOCKUP_DETECTOR_PERF,
> and HARDLOCKUP_DETECTOR_BUDDY may conflict only with
> HAVE_HARDLOCKUP_DETECTOR_ARCH. They depend on HARDLOCKUP_DETECTOR
> and it is not longer enabled when HAVE_NMI_WATCHDOG is set.
>
> Signed-off-by: Petr Mladek 
>
> watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64
>
> The configuration variable HAVE_NMI_WATCHDOG has a generic name but
> it is selected only for SPARC64.
>
> It should _not_ be used in general because it is not integrated with
> the other hardlockup detectors. Namely, it does not support the hardlockup
> specific command line parameters and systcl interface. Instead, it is
> enabled/disabled together with the softlockup detector by the global
> "watchdog" sysctl.
>
> Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special
> behavior more clear.
>
> Also the variable is set only on sparc64. Move the definition
> from arch/Kconfig to arch/sparc/Kconfig.debug.
>
> Signed-off-by: Petr Mladek 

I think you goofed up when squashing the patches. You've now got a
second patch subject after your first Signed-off-by and 

Re: [Kgdb-bugreport] [PATCH v2 1/6] watchdog/hardlockup: Sort hardlockup detector related config values a logical way

2023-06-16 Thread Doug Anderson
Hi,

On Fri, Jun 16, 2023 at 8:06 AM Petr Mladek  wrote:
>
> There are four possible variants of hardlockup detectors:
>
>   + buddy: available when SMP is set.
>
>   + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.
>
>   + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.
>
>   + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
> and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> Only one hardlockup detector can be compiled in. The selection is done
> using quite complex dependencies between several CONFIG variables.
> The following patches will try to make it more straightforward.
>
> As a first step, reorder the definitions of the various CONFIG variables.
> The logical order is:
>
>1. HAVE_* variables define available variants. They are typically
>   defined in the arch/ config files.
>
>2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup
>   detector is enabled at all.
>
>3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether
>   the buddy detector should be preferred over the perf one.
>   Note that the arch specific variants are always preferred when
>   available.
>
>4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given
>   detector is enabled in the end.
>
>5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH
>   are temporary variables that are going to be removed in
>   a followup patch.
>
> This is a preparation step for further cleanup. It will change the logic
> without shuffling the definitions.
>
> This change temporary breaks the C-like ordering where the variables are
> declared or defined before they are used. It is not really needed for
> Kconfig. Also the following patches will rework the logic so that
> the ordering will be C-like in the end.
>
> The patch just shuffles the definitions. It should not change the existing
> behavior.
>
> Signed-off-by: Petr Mladek 
> ---
>  lib/Kconfig.debug | 112 +++---
>  1 file changed, 56 insertions(+), 56 deletions(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2 2/6] watchdog/hardlockup: Make the config checks more straightforward

2023-06-16 Thread Doug Anderson
Hi,

On Fri, Jun 16, 2023 at 8:07 AM Petr Mladek  wrote:
>
> There are four possible variants of hardlockup detectors:
>
>   + buddy: available when SMP is set.
>
>   + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.
>
>   + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.
>
>   + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
> and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> The check for the sparc64 variant is more complicated because
> HAVE_NMI_WATCHDOG is used to #ifdef code used by both arch-specific
> and sparc64 specific variant. Therefore it is automatically
> selected with HAVE_HARDLOCKUP_DETECTOR_ARCH.
>
> This complexity is partly hidden in HAVE_HARDLOCKUP_DETECTOR_NON_ARCH.
> It reduces the size of some checks but it makes them harder to follow.
>
> Finally, the other temporary variable HARDLOCKUP_DETECTOR_NON_ARCH
> is used to re-compute HARDLOCKUP_DETECTOR_PERF/BUDDY when the global
> HARDLOCKUP_DETECTOR switch is enabled/disabled.
>
> Make the logic more straightforward by the following changes:
>
>   + Better explain the role of HAVE_HARDLOCKUP_DETECTOR_ARCH and
> HAVE_NMI_WATCHDOG in comments.
>
>   + Add HAVE_HARDLOCKUP_DETECTOR_BUDDY so that there is separate
> HAVE_* for all four hardlockup detector variants.
>
> Use it in the other conditions instead of SMP. It makes it
> clear that it is about the buddy detector.
>
>   + Open code HAVE_HARDLOCKUP_DETECTOR_NON_ARCH in HARDLOCKUP_DETECTOR
> and HARDLOCKUP_DETECTOR_PREFER_BUDDY. It helps to understand
> the conditions between the four hardlockup detector variants.
>
>   + Define the exact conditions when HARDLOCKUP_DETECTOR_PERF/BUDDY
> can be enabled. It explains the dependency on the other
> hardlockup detector variants.
>
> Also it allows to remove HARDLOCKUP_DETECTOR_NON_ARCH by using "imply".
> It triggers re-evaluating HARDLOCKUP_DETECTOR_PERF/BUDDY when
> the global HARDLOCKUP_DETECTOR switch is changed.
>
>   + Add dependency on HARDLOCKUP_DETECTOR so that the affected variables
> disappear when the hardlockup detectors are disabled.
>
> Another nice side effect is that HARDLOCKUP_DETECTOR_PREFER_BUDDY
> value is not preserved when the global switch is disabled.
> The user has to make the decision again when it gets re-enabled.
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/Kconfig  | 23 +-
>  lib/Kconfig.debug | 62 +++
>  2 files changed, 53 insertions(+), 32 deletions(-)

While I'd still paint the bikeshed a different color and organize the
dependencies a little differently, as discussed in your v1 post, this
is still OK w/ me.

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v9 6/7] kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB

2023-06-15 Thread Doug Anderson
Daniel,

On Thu, Jun 1, 2023 at 2:37 PM Douglas Anderson  wrote:
>
> To save architectures from needing to wrap the call in #ifdefs, add a
> stub no-op version of kgdb_nmicallback(), which returns 1 if it didn't
> handle anything.
>
> Reviewed-by: Daniel Thompson 
> Signed-off-by: Douglas Anderson 
> ---
> In v9 this is the only kgdb dependency. I'm assuming it could go
> through the arm64 tree? If that's not a good idea, we could always
> change the patch ("arm64: kgdb: Roundup cpus using IPI as NMI") not to
> depend on it by only calling kgdb_nmicallback() if CONFIG_KGDB is not
> defined.
>
> Changes in v9:
> - Added missing "inline"
>
> Changes in v8:
> - "Provide a stub kgdb_nmicallback() if !CONFIG_KGDB" new for v8
>
>  include/linux/kgdb.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
> index 258cdde8d356..76e891ee9e37 100644
> --- a/include/linux/kgdb.h
> +++ b/include/linux/kgdb.h
> @@ -365,5 +365,6 @@ extern void kgdb_free_init_mem(void);
>  #define dbg_late_init()
>  static inline void kgdb_panic(const char *msg) {}
>  static inline void kgdb_free_init_mem(void) { }
> +static inline int kgdb_nmicallback(int cpu, void *regs) { return 1; }

What do you think about landing just ${SUBJECT} patch in kgdb right
now so it can end up in v6.5-rc1? It seems like this series is
currently blocked on Mark getting a spare moment and it seems unlikely
that'll happen this cycle. If we at least land the kgdb patch then it
would make things all that much easier to land in the next cycle. The
kgdb patch feels like it can make sense on its own...

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 2/7] watchdog/hardlockup: Make the config checks more straightforward

2023-06-14 Thread Doug Anderson
Hi,

On Wed, Jun 14, 2023 at 3:29 AM Petr Mladek  wrote:
>
> It seems that we have entered into a bike shedding mode.
> The following questions come to my mind:
>
>1. Does this patchset improve the current state?
>
>2. Maybe, it is not black Is it possible to summarize
>   what exactly got better and what got worse?
>
> Maybe, there is no need to do bike-shedding about every step
> if the final result is reasonable and the steps are not
> completely wrong.
>
> I just followed my intuition and tried to do some changes step
> by step. I got lost many times so maybe the steps are not
> ideal. Anyway, the steps helped me to understand the logic
> and stay reasonably confident that they did not change
> the behavior.
>
> I could rework the patchset. But I first need to know what
> exactly is bad in the result. And eventually if there is more
> logical way how to end there.

Sure. I still feel like the end result of the CONFIG options after
your whole patchset is easier to understand / cleaner by adjusting the
dependencies as I have suggested. That being said, I agree that it is
the type of thing that can be more a matter of personal preference. I
do agree that, even if you don't take my suggestion of adjusting the
dependencies, the end result of your patchset still makes things
better than they were.

...so if you really feel strongly that things are more understandable
with the dependencies specified as you have, I won't stand in the way.
I still think you need a v2, though, just to address other nits.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v5 15/18] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs

2023-06-12 Thread Doug Anderson
Mark,

On Mon, Jun 12, 2023 at 3:33 AM Mark Rutland  wrote:
>
> On Fri, May 19, 2023 at 10:18:39AM -0700, Douglas Anderson wrote:
> > On arm64, NMI support needs to be detected at runtime. Add a weak
> > function to the perf hardlockup detector so that an architecture can
> > implement it to detect whether NMIs are available.
> >
> > Signed-off-by: Douglas Anderson 
> > ---
> > While I won't object to this patch landing, I consider it part of the
> > arm64 perf hardlockup effort. I would be OK with the earlier patches
> > in the series landing and then not landing ${SUBJECT} patch nor
> > anything else later.
>
> FWIW, everything prior to this looks fine to me, so I reckon it'd be worth
> splitting the series here and getting the buddy lockup detector in first, to
> avoid a log-jam on all the subsequent NMI bits.

I think the whole series has already landed in Andrew's tree,
including the arm64 "perf" lockup detector bits. I saw all the
notifications from Andrew go through over the weekend that they were
moved from an "unstable" branch to a "stable" one and I see them at:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/log/?h=mm-nonmm-stable

When I first saw Anderw land the arm64 perf lockup detector bits in
his unstable branch several weeks ago, I sent a private message to the
arm64 maintainers (yourself included) to make sure you were aware of
it and that it hadn't been caught in mail filters. I got the
impression that everything was OK. Is that not the case?


-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 2/7] watchdog/hardlockup: Make the config checks more straightforward

2023-06-08 Thread Doug Anderson
Hi,

On Thu, Jun 8, 2023 at 4:02 AM Petr Mladek  wrote:
>
> > >  config HARDLOCKUP_DETECTOR
> > > bool "Detect Hard Lockups"
> > > depends on DEBUG_KERNEL && !S390
> > > -   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || 
> > > HAVE_HARDLOCKUP_DETECTOR_ARCH
> > > +   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> > > HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> > > HAVE_HARDLOCKUP_DETECTOR_ARCH
> >
> > Adding the dependency to buddy (see ablove) would simplify the above
> > to just this:
> >
> > depends on HAVE_HARDLOCKUP_DETECTOR_PERF ||
> > HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
>
> This is exactly what I do not want. It would just move the check
> somewhere else. But it would make the logic harder to understand.

Hmmm. To me, it felt easier to understand by moving this into the
"HAVE_HARDLOCKUP_DETECTOR_BUDDY". To me it was pretty easy to say "if
an architecture defined its own arch-specific watchdog then buddy
can't be enabled" and that felt like it fit cleanly within the
"HAVE_HARDLOCKUP_DETECTOR_BUDDY" definition. It got rid of _a lot_ of
other special cases / checks elsewhere and felt quite a bit cleaner to
me. I only had to think about the conflict between the "buddy" and
"nmi" watchdogs once when I understood
"HAVE_HARDLOCKUP_DETECTOR_BUDDY".


> > As per above, it's simply a responsibility of architectures not to
> > define that they have both "perf" if they have the NMI watchdog, so
> > it's just buddy to worry about.
>
> Where is this documented, please?
> Is it safe to assume this?

It's not well documented and I agree that it could be improved. Right
now, HAVE_NMI_WATCHDOG is documented to say that the architecture
"defines its own arch_touch_nmi_watchdog()". Looking before my
patches, you can see that "kernel/watchdog_hld.c" (the "perf" detector
code) unconditionally defines arch_touch_nmi_watchdog(). That would
give you a linker error.


> I would personally prefer to ensure this by the config check.
> It is even better than documentation because nobody reads
> documentation ;-)

Sure. IMO this should be documented as close as possible to the root
of the problem. Make "HAVE_NMI_WATCHDOG" depend on
"!HAVE_HARDLOCKUP_DETECTOR_PERF". That expresses that an architecture
is not allowed to declare that it has both.


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 4/7] watchdog/hardlockup: Enable HAVE_NMI_WATCHDOG only on sparc64

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 13c6e596cf9e..57f15babe188 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -404,10 +404,9 @@ config HAVE_NMI_WATCHDOG
> depends on HAVE_NMI
> bool
> help
> - The arch provides its own hardlockup detector implementation instead
> + Sparc64 provides its own hardlockup detector implementation instead
>   of the generic perf one.

It's a little weird to document generic things with the specifics of
the user. The exception, IMO, is when something is deprecated.
Personally, it would sound less weird to me to say something like:

The arch provides its own hardlockup detector implementation instead
of the generic perf one. This is a deprecated thing to do and kept
around until sparc64 provides a full hardlockup implementation or
moves to generic code.


> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d201f5d3876b..4b4aa0f941f9 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1050,15 +1050,13 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY
>  #  sparc64: has a custom implementation which is not using the common
>  #  hardlockup command line options and sysctl interface.
>  #
> -# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
> -# implementaion. It is automatically enabled also for other arch-specific
> -# variants which set HAVE_HARDLOCKUP_DETECTOR_ARCH. It makes the check
> -# of avaialable and supported variants quite tricky.
> +# Note that HAVE_NMI_WATCHDOG is set when the sparc64 specific implementation
> +# is used.
>  #
>  config HARDLOCKUP_DETECTOR
> bool "Detect Hard Lockups"
> -   depends on DEBUG_KERNEL && !S390
> -   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH
> +   depends on DEBUG_KERNEL && !S390 && !HAVE_NMI_WATCHDOG
> +   depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH

If you add the "!HAVE_NMI_WATCHDOG" as a dependency to
HAVE_HARDLOCKUP_DETECTOR_BUDDY, as discussed in a previous patch, you
can skip adding it here.


> imply HARDLOCKUP_DETECTOR_PERF
> imply HARDLOCKUP_DETECTOR_BUDDY
> select LOCKUP_DETECTOR
> @@ -1079,7 +1077,7 @@ config HARDLOCKUP_DETECTOR_PREFER_BUDDY
> bool "Prefer the buddy CPU hardlockup detector"
> depends on HARDLOCKUP_DETECTOR
> depends on HAVE_HARDLOCKUP_DETECTOR_PERF && 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY
> -   depends on !HAVE_NMI_WATCHDOG
> +   depends on !HAVE_HARLOCKUP_DETECTOR_ARCH

Don't need this. Architectures never are allowed to define
HAVE_HARDLOCKUP_DETECTOR_PERF and HAVE_HARLOCKUP_DETECTOR_ARCH


> default n
> help
>   Say Y here to prefer the buddy hardlockup detector over the perf 
> one.
> @@ -1096,7 +1094,7 @@ config HARDLOCKUP_DETECTOR_PERF
> bool
> depends on HARDLOCKUP_DETECTOR
> depends on HAVE_HARDLOCKUP_DETECTOR_PERF && 
> !HARDLOCKUP_DETECTOR_PREFER_BUDDY
> -   depends on !HAVE_NMI_WATCHDOG
> +   depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH

Similarly, don't need this.


> select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
>
>  config HARDLOCKUP_DETECTOR_BUDDY
> @@ -1104,7 +1102,7 @@ config HARDLOCKUP_DETECTOR_BUDDY
> depends on HARDLOCKUP_DETECTOR
> depends on HAVE_HARDLOCKUP_DETECTOR_BUDDY
> depends on !HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HARDLOCKUP_DETECTOR_PREFER_BUDDY
> -   depends on !HAVE_NMI_WATCHDOG
> +   depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH

Similarly, don't need this.


In general I don't object to splitting out HAVE_NMI_WATCHDOG from
HAVE_HARDLOCKUP_DETECTOR_ARCH.


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 3/7] watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in linux/nmi.h

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> arch_touch_nmi_watchdog() needs a different implementation for various
> hardlockup detector implementations. And it does nothing when
> any hardlockup detector is not build at all.

s/build/built/


> arch_touch_nmi_watchdog() has to be declared in linux/nmi.h. It is done
> directly in this header file for the perf and buddy detectors. And it
> is done in the included asm/linux.h for arch specific detectors.
>
> The reason probably is that the arch specific variants build the code
> using another conditions. For example, powerpc64/sparc64 builds the code
> when CONFIG_PPC_WATCHDOG is enabled.
>
> Another reason might be that these architectures define more functions
> in asm/nmi.h anyway.
>
> However the generic code actually knows the information. The config
> variables HAVE_NMI_WATCHDOG and HAVE_HARDLOCKUP_DETECTOR_ARCH are used
> to decide whether to build the buddy detector.
>
> In particular, CONFIG_HARDLOCKUP_DETECTOR is set only when a generic
> or arch-specific hardlockup detector is built. The only exception
> is sparc64 which ignores the global HARDLOCKUP_DETECTOR switch.
>
> The information about sparc64 is a bit complicated. The hardlockup
> detector is built there when CONFIG_HAVE_NMI_WATCHDOG is set and
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> People might wonder whether this change really makes things easier.
> The motivation is:
>
>   + The current logic in linux/nmi.h is far from obvious.
> For example, arch_touch_nmi_watchdog() is defined as {} when
> neither CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER nor
> CONFIG_HAVE_NMI_WATCHDOG is defined.
>
>   + The change synchronizes the checks in lib/Kconfig.debug and
> in the generic code.
>
>   + It is a step that will help cleaning HAVE_NMI_WATCHDOG related
> checks.
>
> The change should not change the existing behavior.
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/powerpc/include/asm/nmi.h |  2 --
>  arch/sparc/include/asm/nmi.h   |  1 -
>  include/linux/nmi.h| 13 ++---
>  3 files changed, 10 insertions(+), 6 deletions(-)

This looks right and is a nice cleanup.

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 0/7] watchdog/hardlockup: Cleanup configuration of hardlockup detectors

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> Hi,
>
> this patchset is supposed to replace the last patch in the patchset cleaning
> up after introducing the buddy detector, see
> https://lore.kernel.org/r/20230526184139.10.I821fe7609e57608913fe05abd8f35b343e7a9aae@changeid

I will let Andrew chime in with his preference, but so far I haven't
seen him dropping and/or modifying any patches that he's picked up in
this series. I see that he's already picked up the patch that you're
"replacing". I wonder if it would be easier for him if you just built
atop that?

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 7/7] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:26 AM Petr Mladek  wrote:
>
> @@ -1102,6 +1103,14 @@ config HARDLOCKUP_DETECTOR_BUDDY
> depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
> select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
>
> +config HARDLOCKUP_DETECTOR_ARCH
> +   bool
> +   depends on HARDLOCKUP_DETECTOR
> +   depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
> +   help
> + The arch-specific implementation of the hardlockup detector is
> + available.

nit: "is available" makes it sound a bit too much like a "have"
version. Maybe "The arch-specific implementation of the hardlockup
detector will be used" or something like that?

Otherise:

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 6/7] watchdog/sparc64: Define HARDLOCKUP_DETECTOR_SPARC64

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:26 AM Petr Mladek  wrote:
>
> The HAVE_ prefix means that the code could be enabled. Add another
> variable for HAVE_HARDLOCKUP_DETECTOR_SPARC64 without this prefix.
> It will be set when it should be built. It will make it compatible
> with the other hardlockup detectors.
>
> Before, it is far from obvious that the SPARC64 variant is actually used:
>
> $> make ARCH=sparc64 defconfig
> $> grep HARDLOCKUP_DETECTOR .config
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y
>
> After, it is more clear:
>
> $> make ARCH=sparc64 defconfig
> $> grep HARDLOCKUP_DETECTOR .config
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_BUDDY=y
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_SPARC64=y
> CONFIG_HARDLOCKUP_DETECTOR_SPARC64=y
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/sparc/Kconfig.debug | 10 +-
>  include/linux/nmi.h  |  4 ++--
>  kernel/watchdog.c|  2 +-
>  lib/Kconfig.debug|  2 +-
>  4 files changed, 13 insertions(+), 5 deletions(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 5/7] watchdog/sparc64: Rename HAVE_NMI_WATCHDOG to HAVE_HARDLOCKUP_WATCHDOG_SPARC64

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> The configuration variable HAVE_NMI_WATCHDOG has a generic name but
> it is selected only for SPARC64.
>
> It should _not_ be used in general because it is not integrated with
> the other hardlockup detectors. Namely, it does not support the hardlockup
> specific command line parameters and systcl interface. Instead, it is
> enabled/disabled together with the softlockup detector by the global
> "watchdog" sysctl.
>
> Rename it to HAVE_HARDLOCKUP_WATCHDOG_SPARC64 to make the special
> behavior more clear.
>
> Also the variable is set only on sparc64. Move the definition
> from arch/Kconfig to arch/sparc/Kconfig.debug.
>
> Signed-off-by: Petr Mladek 
> ---
>  arch/Kconfig | 12 
>  arch/sparc/Kconfig   |  2 +-
>  arch/sparc/Kconfig.debug | 12 
>  include/linux/nmi.h  |  4 ++--
>  kernel/watchdog.c|  2 +-
>  lib/Kconfig.debug|  5 +
>  6 files changed, 17 insertions(+), 20 deletions(-)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 2/7] watchdog/hardlockup: Make the config checks more straightforward

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 422f0ffa269e..13c6e596cf9e 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -404,17 +404,27 @@ config HAVE_NMI_WATCHDOG
> depends on HAVE_NMI
> bool
> help
> - The arch provides a low level NMI watchdog. It provides
> - asm/nmi.h, and defines its own watchdog_hardlockup_probe() and
> - arch_touch_nmi_watchdog().
> + The arch provides its own hardlockup detector implementation instead
> + of the generic perf one.

nit: did you mean to have different wording here compared to
HAVE_HARDLOCKUP_DETECTOR_ARCH? Here you say "the generic perf one" and
there you say "the generic ones", though it seems like you mean them
to be the same.

> +
> + Sparc64 defines this variable without HAVE_HARDLOCKUP_DETECTOR_ARCH.
> + It does _not_ use the command line parameters and sysctl interface
> + used by generic hardlockup detectors. Instead it is enabled/disabled
> + by the top-level watchdog interface that is common for both 
> softlockup
> + and hardlockup detectors.
>
>  config HAVE_HARDLOCKUP_DETECTOR_ARCH
> bool
> select HAVE_NMI_WATCHDOG
> help
> - The arch chooses to provide its own hardlockup detector, which is
> - a superset of the HAVE_NMI_WATCHDOG. It also conforms to config
> - interfaces and parameters provided by hardlockup detector subsystem.
> + The arch provides its own hardlockup detector implementation instead
> + of the generic ones.
> +
> + It uses the same command line parameters, and sysctl interface,
> + as the generic hardlockup detectors.
> +
> + HAVE_NMI_WATCHDOG is selected to build the code shared with
> + the sparc64 specific implementation.
>
>  config HAVE_PERF_REGS
> bool
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 3e91fa33c7a0..d201f5d3876b 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1035,16 +1035,33 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
>
>   Say N if unsure.
>
> +config HAVE_HARDLOCKUP_DETECTOR_BUDDY
> +   bool
> +   depends on SMP
> +   default y

I think you simplify your life if you also add:

  depends on !HAVE_NMI_WATCHDOG

The existing config system has always assumed that no architecture
defines both HAVE_HARDLOCKUP_DETECTOR_PERF and HAVE_NMI_WATCHDOG
(symbols would have clashed and you would get a link error as two
watchdogs try to implement the same weak symbol). If you add the extra
dependency to "buddy" as per above, then a few things below fall out.
I'll try to point them out below.


> +
>  #
> -# arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
> -# lockup detector rather than the perf based detector.
> +# Global switch whether to build a hardlockup detector at all. It is 
> available
> +# only when the architecture supports at least one implementation. There are
> +# two exceptions. The hardlockup detector is newer enabled on:

s/newer/never/


> +#
> +#  s390: it reported many false positives there
> +#
> +#  sparc64: has a custom implementation which is not using the common
> +#  hardlockup command line options and sysctl interface.
> +#
> +# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
> +# implementaion. It is automatically enabled also for other arch-specific
> +# variants which set HAVE_HARDLOCKUP_DETECTOR_ARCH. It makes the check
> +# of avaialable and supported variants quite tricky.
>  #
>  config HARDLOCKUP_DETECTOR
> bool "Detect Hard Lockups"
> depends on DEBUG_KERNEL && !S390
> -   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH
> +   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> HAVE_HARDLOCKUP_DETECTOR_ARCH

Adding the dependency to buddy (see ablove) would simplify the above
to just this:

depends on HAVE_HARDLOCKUP_DETECTOR_PERF ||
HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH

As per above, it's simply a responsibility of architectures not to
define that they have both "perf" if they have the NMI watchdog, so
it's just buddy to worry about.


> +   imply HARDLOCKUP_DETECTOR_PERF
> +   imply HARDLOCKUP_DETECTOR_BUDDY
> select LOCKUP_DETECTOR
> -   select HARDLOCKUP_DETECTOR_NON_ARCH if 
> HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
>
> help
>   Say Y here to enable the kernel to act as a watchdog to detect
> @@ -1055,9 +1072,15 @@ config HARDLOCKUP_DETECTOR
>   chance to run.  The current stack trace is displayed upon detection
>   and the system will stay locked up.
>
> +#
> +# Note that arch-specific variants are always preferred.
> +#
>  config HARDLOCKUP_DETECTOR_PREFER_BUDDY
> bool "Prefer the buddy CPU hardlockup detector"
> 

Re: [Kgdb-bugreport] [PATCH 1/7] watchdog/hardlockup: Sort hardlockup detector related config values a logical way

2023-06-07 Thread Doug Anderson
Hi,

On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
>
> There are four possible variants of hardlockup detectors:
>
>   + buddy: available when SMP is set.
>
>   + perf: available when HAVE_HARDLOCKUP_DETECTOR_PERF is set.
>
>   + arch-specific: available when HAVE_HARDLOCKUP_DETECTOR_ARCH is set.
>
>   + sparc64 special variant: available when HAVE_NMI_WATCHDOG is set
> and HAVE_HARDLOCKUP_DETECTOR_ARCH is not set.
>
> Only one hardlockup detector can be compiled in. The selection is done
> using quite complex dependencies between several CONFIG variables.
> The following patches will try to make it more straightforward.
>
> As a first step, reorder the definitions of the various CONFIG variables.
> The logical order is:
>
>1. HAVE_* variables define available variants. They are typically
>   defined in the arch/ config files.
>
>2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup
>   detector is enabled at all.
>
>3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether
>   the buddy detector should be preferred over the perf one.
>   Note that the arch specific variants are always preferred when
>   available.
>
>4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given
>   detector is enabled in the end.
>
>5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH
>   are temporary variables that are going to be removed in
>   a followup patch.
>
> The patch just shuffles the definitions. It should not change the existing
> behavior.
>
> Signed-off-by: Petr Mladek 
> ---
>  lib/Kconfig.debug | 112 +++---
>  1 file changed, 56 insertions(+), 56 deletions(-)

I don't really have any strong opinions, so I'm fine with this. In
general I think the ordering I picked tried to match the existing
"style" which generally tried to list configs and then select them
below. To me the existing style makes more sense when thinking about
writing C code without having to put a pile of prototypes at the top
of your file: you define things higher in the file and reference them
below. For instance, the old style (before any of my patches) had:

config SOFTLOCKUP_DETECTOR:
  ... blah blah blah ...

config HARDLOCKUP_DETECTOR_PERF:
  select SOFTLOCKUP_DETECTOR

config HARDLOCKUP_DETECTOR:
  ... blah blah blah ...
  select LOCKUP_DETECTOR
  select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF

Your new style seems to be the opposite of that.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 07/10] kgdb: Expose default CPUs roundup fallback mechanism

2023-06-01 Thread Doug Anderson
Hi,

On Mon, May 15, 2023 at 4:21 PM Doug Anderson  wrote:
>
> Hi,
>
> On Fri, May 12, 2023 at 6:49 AM Daniel Thompson
>  wrote:
> >
> > On Wed, Apr 19, 2023 at 03:56:01PM -0700, Douglas Anderson wrote:
> > > From: Sumit Garg 
> > >
> > > Add a new API kgdb_smp_call_nmi_hook() to expose default CPUs roundup
> > > mechanism to a particular archichecture as a runtime fallback if it
> > > detects to not support NMI roundup.
> > >
> > > Currently such an architecture example is arm64 supporting pseudo NMIs
> > > feature which is only available on platforms which have support for GICv3
> > > or later version.
> > >
> > > Signed-off-by: Sumit Garg 
> > > Tested-by: Chen-Yu Tsai 
> > > Signed-off-by: Douglas Anderson 
> > > ---
> > >
> > > (no changes since v1)
> > >
> > >  include/linux/kgdb.h  | 12 
> > >  kernel/debug/debug_core.c |  8 +++-
> > >  2 files changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
> > > index 258cdde8d356..87713bd390f3 100644
> > > --- a/include/linux/kgdb.h
> > > +++ b/include/linux/kgdb.h
> > > @@ -199,6 +199,18 @@ kgdb_arch_handle_qxfer_pkt(char *remcom_in_buffer,
> > >
> > >  extern void kgdb_call_nmi_hook(void *ignored);
> > >
> > > +/**
> > > + *   kgdb_smp_call_nmi_hook - Provide default fallback mechanism to
> > > + *round-up CPUs
> > > + *
> > > + *   If you're using the default implementation of kgdb_roundup_cpus()
> > > + *   this function will be called.  And if an arch detects at runtime to
> > > + *   not support NMI based roundup then it can fallback to default
> > > + *   mechanism using this API.
> > > + */
> > > +
> > > +extern void kgdb_smp_call_nmi_hook(void);
> >
> > Concept looks sensible but this is a terrible name for aa command to
> > round up the CPUs using smp_call... functions. Whilst it is true it that
> > kgdb_roundup_cpus() does use kgdb_call_nmi_hook() internally that
> > doesn't mean we should name functions after it. They should be named
> > after what they are do, not how they do it.
> >
> > Something more like kgdb_roundup_cpus_with_smp_call() would be a much
> > better name.
>
> Sounds good. I'm happy to spin with this rename, though I was kinda
> hoping to drop ${SUBJECT} patch if folks were OK with patch #10 in
> this series [1]. I personally think that's the right way to go but
> it's unclear to me if arm64 maintainers will think it's a hack
> (despite the fact that arm32 implements the "nmi" functions with
> regular IPIs).
>
> For now, maybe I'll think positive thoughts and hope that folks will
> have the time to review the series soon. If another few weeks go by
> then I'll send a v9 with just your comments addressed. If nothing
> else, maybe you can land the kgdb parts in a tree targeting v6.5 and
> then when arm64 folks have the bandwidth then it will be easier to get
> them landed.
>
> [1] 
> https://lore.kernel.org/r/20230419155341.v8.10.Ic3659997d6243139d0522fc3afcdfd88d7a5f030@changeid

Breadcrumbs: I've dropped this patch and several others in v9 [1] by
embracing the idea of using a normal IPI as a fallback.

[1] https://lore.kernel.org/r/20230601213440.2488667-1-diand...@chromium.org/


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 00/10] arm64: Add framework to turn an IPI as NMI

2023-06-01 Thread Doug Anderson
Hi,

On Wed, May 10, 2023 at 9:42 AM Doug Anderson  wrote:
>
> Hi,
>
> On Wed, May 10, 2023 at 9:30 AM Mark Rutland  wrote:
> >
> > On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote:
> > > Hi,
> >
> > Hi Doug,
> >
> > > On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson  
> > > wrote:
> > > > This is an attempt to resurrect Sumit's old patch series [1] that
> > > > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
> > > > also to round up CPUs in kdb/kgdb. The last post from Sumit that I
> > > > could find was v7, so I called this series v8. I haven't copied all of
> > > > his old changelongs here, but you can find them from the link.
> > > >
> > > > Since v7, I have:
> > > > * Addressed the small amount of feedback that was there for v7.
> > > > * Rebased.
> > > > * Added a new patch that prevents us from spamming the logs with idle
> > > >   tasks.
> > > > * Added an extra patch to gracefully fall back to regular IPIs if
> > > >   pseudo-NMIs aren't there.
> > > >
> > > > Since there appear to be a few different patches series related to
> > > > being able to use NMIs to get stack traces of crashed systems, let me
> > > > try to organize them to the best of my understanding:
> > > >
> > > > a) This series. On its own, a) will (among other things) enable stack
> > > >traces of all running processes with the soft lockup detector if
> > > >you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On
> > > >its own, a) doesn't give a hard lockup detector.
> > > >
> > > > b) A different recently-posted series [2] that adds a hard lockup
> > > >detector based on perf. On its own, b) gives a stack crawl of the
> > > >locked up CPU but no stack crawls of other CPUs (even if they're
> > > >locked too). Together with a) + b) we get everything (full lockup
> > > >detect, full ability to get stack crawls).
> > > >
> > > > c) The old Android "buddy" hard lockup detector [3] that I'm
> > > >considering trying to upstream. If b) lands then I believe c) would
> > > >be redundant (at least for arm64). c) on its own is really only
> > > >useful on arm64 for platforms that can print CPU_DBGPCSR somehow
> > > >(see [4]). a) + c) is roughly as good as a) + b).
> >
> > > It's been 3 weeks and I haven't heard a peep on this series. That
> > > means nobody has any objections and it's all good to land, right?
> > > Right? :-P
> >
> > FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI
> > support (and fixing that requires an overhaul of our DAIF / IRQ flag
> > management, which I've been chipping away at for a number of releases), so I
> > hadn't looked at this in detail yet because the foundations are still 
> > somewhat
> > dodgy.
> >
> > I appreciate that this has been around for a while, and it's on my queue to
> > look at.
>
> Ah, thanks for the heads up! We've been thinking about turning this on
> in production in ChromeOS because it will help us track down a whole
> class of field-generated crash reports that are otherwise opaque to
> us. It sounds as if maybe that's not a good idea quite yet? Do you
> have any idea of how much farther along this needs to go?

I'm still very interested in this topic. Do you have anything concrete
that is broken? Your reply gives me the vibe that there have been a
bunch of bugs found recently and, while there are no known issues,
you're worried that there might be something lingering still. Is that
correct, or do you have something concrete that's broken? Is this
anything others could help with? I could make an attempt, or we might
be able to rope some of the others interested in this topic and more
GIC-knowledgeable to help?

I still have a goal for turning this on for production but obviously
don't want to make the device crashier because of it.

Also: if this really has known bugs, should we put a big WARN_ON splat
if anyone tries to turn on pseudo NMIs? Without that, it feels like
it's a bit of a footgun.


> ...of
> course, we've also run into issues with Mediatek devices because they
> don't save/restore GICR registers properly [1]. In theory, we might be
> able to work around that in the kernel.

To followup with this, we've formulated a plan for dealing with
Mediatek Chromebooks. I doubt anyone is truly interested, but if
anyone is the details are in the public 

Re: [Kgdb-bugreport] [PATCH v5 12/18] watchdog/hardlockup: Rename some "NMI watchdog" constants/function

2023-05-25 Thread Doug Anderson
Hi,

On Wed, May 24, 2023 at 6:38 AM Petr Mladek  wrote:
>
> On Fri 2023-05-19 10:18:36, Douglas Anderson wrote:
> > Do a search and replace of:
> > - NMI_WATCHDOG_ENABLED => WATCHDOG_HARDLOCKUP_ENABLED
> > - SOFT_WATCHDOG_ENABLED => WATCHDOG_SOFTOCKUP_ENABLED
> > - watchdog_nmi_ => watchdog_hardlockup_
> > - nmi_watchdog_available => watchdog_hardlockup_available
> > - nmi_watchdog_user_enabled => watchdog_hardlockup_user_enabled
> > - soft_watchdog_user_enabled => watchdog_softlockup_user_enabled
> > - NMI_WATCHDOG_DEFAULT => WATCHDOG_HARDLOCKUP_DEFAULT
> >
> > Then update a few comments near where names were changed.
> >
> > This is specifically to make it less confusing when we want to
> > introduce the buddy hardlockup detector, which isn't using NMIs. As
> > part of this, we sanitized a few names for consistency.
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -30,17 +30,17 @@
> >  static DEFINE_MUTEX(watchdog_mutex);
> >
> >  #if defined(CONFIG_HARDLOCKUP_DETECTOR) || 
> > defined(CONFIG_HAVE_NMI_WATCHDOG)
> > -# define NMI_WATCHDOG_DEFAULT1
> > +# define WATCHDOG_HARDLOCKUP_DEFAULT 1
> >  #else
> > -# define NMI_WATCHDOG_DEFAULT0
> > +# define WATCHDOG_HARDLOCKUP_DEFAULT 0
> >  #endif
> >
> >  unsigned long __read_mostly watchdog_enabled;
> >  int __read_mostly watchdog_user_enabled = 1;
> > -int __read_mostly nmi_watchdog_user_enabled = NMI_WATCHDOG_DEFAULT;
> > -int __read_mostly soft_watchdog_user_enabled = 1;
> > +int __read_mostly watchdog_hardlockup_user_enabled = 
> > WATCHDOG_HARDLOCKUP_DEFAULT;
> > +int __read_mostly watchdog_softlockup_user_enabled = 1;
>
> I still see nmi_watchdog_user_enabled and soft_watchdog_user_enabled
> in include/linux/nmi.h. They are declared there and also mentioned
> in a comment.
>
> It seems that they do not actually need to be declared there.
> I guess that it was need for the /proc stuff. But it was
> moved into kernel/watchdog.c by the commit commit dd0693fdf054f2ed37
> ("watchdog: move watchdog sysctl interface to watchdog.c").
>
> >  int __read_mostly watchdog_thresh = 10;
> > -static int __read_mostly nmi_watchdog_available;
> > +static int __read_mostly watchdog_hardlockup_available;
> >
> >  struct cpumask watchdog_cpumask __read_mostly;
> >  unsigned long *watchdog_cpumask_bits = cpumask_bits(_cpumask);
>
> Otherwise, I like the changes.
>
> With the following:
>
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 83076bf70ce8..d14fe345eae9 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -17,8 +17,6 @@ void lockup_detector_soft_poweroff(void);
>  void lockup_detector_cleanup(void);
>
>  extern int watchdog_user_enabled;
> -extern int nmi_watchdog_user_enabled;
> -extern int soft_watchdog_user_enabled;
>  extern int watchdog_thresh;
>  extern unsigned long watchdog_enabled;
>
> @@ -68,8 +66,8 @@ static inline void reset_hung_task_detector(void) { }
>   * 'watchdog_enabled' variable. Each lockup detector has its dedicated bit -
>   * bit 0 for the hard lockup detector and bit 1 for the soft lockup detector.
>   *
> - * 'watchdog_user_enabled', 'nmi_watchdog_user_enabled' and
> - * 'soft_watchdog_user_enabled' are variables that are only used as an
> + * 'watchdog_user_enabled', 'watchdog_hardlockup_user_enabled' and
> + * 'watchdog_softlockup_user_enabled' are variables that are only used as an
>   * 'interface' between the parameters in /proc/sys/kernel and the internal
>   * state bits in 'watchdog_enabled'. The 'watchdog_thresh' variable is
>   * handled differently because its value is not boolean, and the lockup
>
> Reviewed-by: Petr Mladek 
>
> Even better might be to remove the unused declaration in a separate
> patch. I think that more declarations are not needed after moving
> the /proc stuff into kernel/watchdog.c.
>
> But it might also be fixed later.

Breadcrumbs: I squashed your suggestion together with Tom's patch and
posted the result:

https://lore.kernel.org/r/20230525162822.1.I0fb41d138d158c9230573eaa37dc56afa2fb14ee@changeid

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v5 14/18] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-25 Thread Doug Anderson
Hi,

On Thu, May 25, 2023 at 9:27 AM Petr Mladek  wrote:
>
> On Fri 2023-05-19 10:18:38, Douglas Anderson wrote:
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -85,7 +85,7 @@ __setup("nmi_watchdog=", hardlockup_panic_setup);
> >
> >  #endif /* CONFIG_HARDLOCKUP_DETECTOR */
> >
> > -#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> > +#if defined(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)
> >
> >  static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts);
> >  static DEFINE_PER_CPU(int, hrtimer_interrupts_saved);
> > @@ -106,6 +106,14 @@ notrace void arch_touch_nmi_watchdog(void)
> >  }
> >  EXPORT_SYMBOL(arch_touch_nmi_watchdog);
> >
> > +void watchdog_hardlockup_touch_cpu(unsigned int cpu)
> > +{
> > + per_cpu(watchdog_hardlockup_touched, cpu) = true;
> > +
> > + /* Match with smp_rmb() in watchdog_hardlockup_check() */
> > + smp_wmb();
>
> It is great that you described where the related barrier is.
>
> Another important information is what exactly is synchronized.
> And I am actually not sure what we are synchronizing here.
>
> My understanding is that a write barrier should synchronize
> related writes, for example:
>
> X = ...;
> /* Make sure that X is modified before Y */
> smp_wmb();
> Y = ...;
>
> And the related read barrier should synchronize the related reads,
> for example:
>
> if (test(Y)) {
> /*
>  * Make sure that we use the updated X when
>  * we saw the updated Y.
>  */
>  smp_rmb();
>  do_something(X);
>  }
>
> IMHO, we do not need any barrier here because we have only
> one variable "watchdog_hardlockup_touched" here.
> watchdog_hardlockup_check() will either see the updated value
> or not. But it does not synchronize it against any other
> variables or values.

Fair. These barriers were present in the original buddy lockup
detector that we've been carrying in ChromeOS but that doesn't
necessarily mean that they were there for a good reason.

Reasoning about weakly ordered memory always makes my brain hurt and I
never feel confident at the end that I got the right answer and, of
course, this is coupled by the fact that if I have a logic error in my
reasoning that it might cause a rare / subtle bug. :( When possible I
try to write code that uses full blown locks to make it easier to
reason about (even if less efficient), but that's not necessarily
possible here. While we obviously don't just want to sprinkle barriers
all over the code, IMO it's not a terrible sin to put a barrier in a
case where it makes it easier to reason about the order of things.

In any case, I guess in this case I would worry about some sort of
ordering race when enabling / disabling the buddy lockup detector. At
the end of the buddy's watchdog_hardlockup_enable() /
watchdog_hardlockup_disable() we adjust the "watchdog_cpus" which
changes buddy assignments. Without a barrier, I _think_ it would be
possible for a new CPU to notice the change in buddies without
noticing the touch. Does that match your understanding? Now when
reasoning about CPUs going online/offline we need to consider this
extra case and we have to decide if there's any chance it could lead
to a false lockup detection. With the memory barriers here, it's a
little easier to think about.

Did the above convince you about keeping the barriers? If so, do you
have any suggested comment that would make it clearer?


> > +}
> > +
> >  static bool is_hardlockup(unsigned int cpu)
> >  {
> >   int hrint = atomic_read(_cpu(hrtimer_interrupts, cpu));
> > @@ -443,11 +454,15 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
> > hrtimer *hrtimer)
> >   struct pt_regs *regs = get_irq_regs();
> >   int duration;
> >   int softlockup_all_cpu_backtrace = 
> > sysctl_softlockup_all_cpu_backtrace;
> > + unsigned long hrtimer_interrupts;
> >
> >   if (!watchdog_enabled)
> >   return HRTIMER_NORESTART;
> >
> > - watchdog_hardlockup_kick();
> > + hrtimer_interrupts = watchdog_hardlockup_kick();
> > +
> > + /* test for hardlockups */
>
> I would omit the comment. It is not valid when perf detector is used.
> And checking the buddy is clear from the function name.
>
> > + watchdog_buddy_check_hardlockup(hrtimer_interrupts);
>
> I would personally move this into watchdog_hardlockup_kick().
> watchdog_timer_fn() is already complex enough. And checking
> the buddy when kicking a CPU makes sense.

Sure, I'll add that to my list of things to follow-up 

Re: [Kgdb-bugreport] [PATCH v5 13/18] watchdog/hardlockup: Have the perf hardlockup use __weak functions more cleanly

2023-05-24 Thread Doug Anderson
Hi,

On Wed, May 24, 2023 at 6:59 AM Petr Mladek  wrote:
>
> On Fri 2023-05-19 10:18:37, Douglas Anderson wrote:
> > The fact that there watchdog_hardlockup_enable(),
> > watchdog_hardlockup_disable(), and watchdog_hardlockup_probe() are
> > declared __weak means that the configured hardlockup detector can
> > define non-weak versions of those functions if it needs to. Instead of
> > doing this, the perf hardlockup detector hooked itself into the
> > default __weak implementation, which was a bit awkward. Clean this up.
> >
> > >From comments, it looks as if the original design was done because the
> > __weak function were expected to implemented by the architecture and
> > not by the configured hardlockup detector. This got awkward when we
> > tried to add the buddy lockup detector which was not arch-specific but
> > wanted to hook into those same functions.
> >
> > This is not expected to have any functional impact.
> >
> > @@ -187,27 +187,33 @@ static inline void watchdog_hardlockup_kick(void) { }
> >  #endif /* !CONFIG_HARDLOCKUP_DETECTOR_PERF */
> >
> >  /*
> > - * These functions can be overridden if an architecture implements its
> > - * own hardlockup detector.
> > + * These functions can be overridden based on the configured hardlockdup 
> > detector.
> >   *
> >   * watchdog_hardlockup_enable/disable can be implemented to start and stop 
> > when
> > - * softlockup watchdog start and stop. The arch must select the
> > + * softlockup watchdog start and stop. The detector must select the
> >   * SOFTLOCKUP_DETECTOR Kconfig.
> >   */
> > -void __weak watchdog_hardlockup_enable(unsigned int cpu)
> > -{
> > - hardlockup_detector_perf_enable();
> > -}
> > +void __weak watchdog_hardlockup_enable(unsigned int cpu) { }
> >
> > -void __weak watchdog_hardlockup_disable(unsigned int cpu)
> > -{
> > - hardlockup_detector_perf_disable();
> > -}
> > +void __weak watchdog_hardlockup_disable(unsigned int cpu) { }
> >
> >  /* Return 0, if a hardlockup watchdog is available. Error code otherwise */
> >  int __weak __init watchdog_hardlockup_probe(void)
> >  {
> > - return hardlockup_detector_perf_init();
> > + /*
> > +  * If CONFIG_HAVE_NMI_WATCHDOG is defined then an architecture
> > +  * is assumed to have the hard watchdog available and we return 0.
> > +  */
> > + if (IS_ENABLED(CONFIG_HAVE_NMI_WATCHDOG))
> > + return 0;
> > +
> > + /*
> > +  * Hardlockup detectors other than those using 
> > CONFIG_HAVE_NMI_WATCHDOG
> > +  * are required to implement a non-weak version of this probe function
> > +  * to tell whether they are available. If they don't override then
> > +  * we'll return -ENODEV.
> > +  */
> > + return -ENODEV;
> >  }
>
> When thinking more about it. It is weird that we need to handle
> CONFIG_HAVE_NMI_WATCHDOG in this default week function.
>
> It should be handled in watchdog_hardlockup_probe() implemented
> in kernel/watchdog_perf.c.
>
> IMHO, the default __weak function could always return -ENODEV;
>
> Would it make sense, please?

I don't quite understand. I'd agree that the special case for
CONFIG_HAVE_NMI_WATCHDOG is ugly, but it was also ugly before. IMO
it's actually a little less ugly / easier to understand after my
patch. ...but let me walk through how I think this special case works
and maybe you can tell me where I'm confused.

The first thing to understand is that CONFIG_HARDLOCKUP_DETECTOR_PERF
and CONFIG_HAVE_NMI_WATCHDOG are mutually exclusive from each other.
This was true before any of my patches and is still true after them.
Specifically, if CONFIG_HAVE_NMI_WATCHDOG is defined then an
architecture implements arch_touch_nmi_watchdog() (as documented in
the Kconfig docs for HAVE_NMI_WATCHDOG). Looking at the tree before my
series you can see that the perf hardlockup detector also implemented
arch_touch_nmi_watchdog(). This would have caused a conflict. The
mutual exclusion was presumably enforced by an architecture not
defining both HAVE_NMI_WATCHDOG and HAVE_HARDLOCKUP_DETECTOR_PERF.

The second thing to understand is that an architecture that defines
CONFIG_HAVE_NMI_WATCHDOG is _not_ required to implement
watchdog_hardlockup_probe() (used to be called watchdog_nmi_probe()).
Maybe this should change, but at the very least it appears that
SPARC64 defines HAVE_NMI_WATCHDOG but doesn't define
watchdog_hardlockup_probe() AKA watchdog_nmi_probe(). Anyone who
defines CONFIG_HAVE_NMI_WATCHDOG and doesn't implement
watchdog_hardlockup_probe() is claiming that their watchdog needs no
probing and is always available.

So with that context:

1. We can't handle any special cases for CONFIG_HAVE_NMI_WATCHDOG in
"kernel/watchdog_perf.c". The special cases that we need to handle are
all for the cases where CONFIG_HARDLOCKUP_DETECTOR_PERF isn't defined
and that means "kernel/watchdog_perf.c" isn't included.

2. We can't have the default __weak function return -ENODEV because
CONFIG_HAVE_NMI_WATCHDOG doesn't require an 

Re: [Kgdb-bugreport] [PATCH v5 10/18] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check()

2023-05-23 Thread Doug Anderson
Hi,

On Tue, May 23, 2023 at 9:02 AM Petr Mladek  wrote:
>
> On Fri 2023-05-19 10:18:34, Douglas Anderson wrote:
> > In preparation for the buddy hardlockup detector where the CPU
> > checking for lockup might not be the currently running CPU, add a
> > "cpu" parameter to watchdog_hardlockup_check().
> >
> > As part of this change, make hrtimer_interrupts an atomic_t since now
> > the CPU incrementing the value and the CPU reading the value might be
> > different. Technially this could also be done with just READ_ONCE and
> > WRITE_ONCE, but atomic_t feels a little cleaner in this case.
> >
> > While hrtimer_interrupts is made atomic_t, we change
> > hrtimer_interrupts_saved from "unsigned long" to "int". The "int" is
> > needed to match the data type backing atomic_t for hrtimer_interrupts.
> > Even if this changes us from 64-bits to 32-bits (which I don't think
> > is true for most compilers), it doesn't really matter. All we ever do
> > is increment it every few seconds and compare it to an old value so
> > 32-bits is fine (even 16-bits would be). The "signed" vs "unsigned"
> > also doesn't matter for simple equality comparisons.
> >
> > hrtimer_interrupts_saved is _not_ switched to atomic_t nor even
> > accessed with READ_ONCE / WRITE_ONCE. The hrtimer_interrupts_saved is
> > always consistently accessed with the same CPU. NOTE: with the
> > upcoming "buddy" detector there is one special case. When a CPU goes
> > offline/online then we can change which CPU is the one to consistently
> > access a given instance of hrtimer_interrupts_saved. We still can't
> > end up with a partially updated hrtimer_interrupts_saved, however,
> > because we end up petting all affected CPUs to make sure the new and
> > old CPU can't end up somehow read/write hrtimer_interrupts_saved at
> > the same time.
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -87,29 +87,34 @@ __setup("nmi_watchdog=", hardlockup_panic_setup);
> >
> >  #if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
> >
> > -static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
> > -static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
> > +static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts);
> > +static DEFINE_PER_CPU(int, hrtimer_interrupts_saved);
> >  static DEFINE_PER_CPU(bool, watchdog_hardlockup_warned);
> >  static unsigned long watchdog_hardlockup_all_cpu_dumped;
> >
> > -static bool is_hardlockup(void)
> > +static bool is_hardlockup(unsigned int cpu)
> >  {
> > - unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
> > + int hrint = atomic_read(_cpu(hrtimer_interrupts, cpu));
> >
> > - if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
> > + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> >   return true;
> >
> > - __this_cpu_write(hrtimer_interrupts_saved, hrint);
> > + /*
> > +  * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE
> > +  * for hrtimer_interrupts_saved. hrtimer_interrupts_saved is
> > +  * written/read by a single CPU.
> > +  */
> > + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> >
> >   return false;
> >  }
> >
> >  static void watchdog_hardlockup_kick(void)
> >  {
> > - __this_cpu_inc(hrtimer_interrupts);
> > + atomic_inc(raw_cpu_ptr(_interrupts));
>
> Is there any particular reason why raw_*() is needed, please?
>
> My expectation is that the raw_ API should be used only when
> there is a good reason for it. Where a good reason might be
> when the checks might fail but the consistency is guaranteed
> another way.
>
> IMHO, we should use:
>
> atomic_inc(this_cpu_ptr(_interrupts));
>
> To be honest, I am a bit lost in the per_cpu API definitions.
>
> But this_cpu_ptr() seems to be implemented the same way as
> per_cpu_ptr() when CONFIG_DEBUG_PREEMPT is enabled.
> And we use per_cpu_ptr() in is_hardlockup().
>
> Also this_cpu_ptr() is used more commonly:
>
> $> git grep this_cpu_ptr | wc -l
> 1385
> $> git grep raw_cpu_ptr | wc -l
> 114

Hmmm, I think maybe I confused myself. The old code purposely used the
double-underscore prefixed version of this_cpu_inc(). I couldn't find
a double-underscore version of this_cpu_ptr() and I somehow convinced
myself that the raw() version was the right equivalent version.

You're right that this_cpu_ptr() is a better choice here and I don't
see any reason why we'd need the raw version.

> >  }
> >
> > -void watchdog_hardlockup_check(struct pt_regs *regs)
> > +void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> >  {
> >   /*
> >* Check for a hardlockup by making sure the CPU's timer
> > @@ -117,35 +122,42 @@ void watchdog_hardlockup_check(struct pt_regs *regs)
> >* fired multiple times before we overflow'd. If it hasn't
> >* then this is a good indication the cpu is stuck
> >*/
> > - if (is_hardlockup()) {
> > + if (is_hardlockup(cpu)) {
> >   unsigned int this_cpu = smp_processor_id();
> > 

Re: [Kgdb-bugreport] [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-20 Thread Doug Anderson
Hi,

On Mon, May 8, 2023 at 8:52 AM Doug Anderson  wrote:
>
> Hmmm, but I don't think you really need "all-to-all" checking to get
> the stacktraces you want, do you? Each CPU can be "watching" exactly
> one other CPU, but then when we actually lock up we could check all of
> them and dump stacks on all the ones that are locked up. I think this
> would be a fairly easy improvement for the buddy system. I'll leave it
> out for now just to keep things simple for the initial landing, but it
> wouldn't be hard to add. Then I think the two SMP systems  (buddy vs.
> all-to-all) would be equivalent in terms of functionality?

FWIW, I take back my "this would be fairly easy" comment. :-P ...or,
at least I'll acknowledge that the easy way has some tradeoffs. It
wouldn't be trivially easy to just snoop on the data of the other
buddies because the watching processors aren't necessarily
synchronized with each other.

That being said, if someone really wanted to report on other locked
CPUs before doing a panic() and was willing to delay the panic, it
probably wouldn't be too hard to put in a mode where the CPU that
detects the first lockup could do some extra work to look for lockups.
Maybe it could send a normal IPI to other CPUs and see if they respond
or maybe it could take over monitoring all CPUs and wait one extra
period.

In any case, I'm not planning on implementing this now, but at least
wanted to document thoughts. ;-)

> With my simplistic solution
> of just allowing the buddy detector to be enabled in parallel with a
> perf-based detector then we wouldn't have this level of coordination,
> but I'll assume that's OK for the initial landing.

I dug into this more as well and I also wanted to note that, at least
for now, I'm not going to include support to turn on both the buddy
and perf lockup detectors in the common core. In order to do this and
not have them stomp on each other then I think we need extra
coordination or two copies of the interrupt count / saved interrupt
count and, at least at this point in time, it doesn't seem worth it
for a halfway solution. From everything I've heard there is a push on
many x86 machines to get off the perf lockup detector anyway to free
up the resources. Someone could look at adding this complexity later.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check()

2023-05-20 Thread Doug Anderson
Hi,

On Thu, May 11, 2023 at 7:14 AM Petr Mladek  wrote:
>
> On Thu 2023-05-04 15:13:41, Douglas Anderson wrote:
> > In preparation for the buddy hardlockup detector where the CPU
> > checking for lockup might not be the currently running CPU, add a
> > "cpu" parameter to watchdog_hardlockup_check().
> >
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -92,14 +92,14 @@ static DEFINE_PER_CPU(unsigned long, 
> > hrtimer_interrupts_saved);
> >  static DEFINE_PER_CPU(bool, watchdog_hardlockup_processed);
> >  static unsigned long watchdog_hardlockup_dumped_stacks;
> >
> > -static bool watchdog_hardlockup_is_lockedup(void)
> > +static bool watchdog_hardlockup_is_lockedup(unsigned int cpu)
> >  {
> > - unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
> > + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
>
> My radar tells me that this should be
> READ_ONCE(per_cpu(hrtimer_interrupts, cpu)) when the value might
> be modified on another CPU. Otherwise, the compiler is allowed
> to split the read into more instructions.
>
> It will be needed for the buddy detector. And it will require
> also incrementing the value in watchdog_hardlockup_interrupt_count()
> an atomic way.
>
> Note that __this_cpu_inc_return() does not guarantee atomicity
> according to my understanding. In theory, the following should
> work because counter will never be incremented in parallel:
>
> static unsigned long watchdog_hardlockup_interrupt_count(void)
> {
> unsigned long count;
>
> count = __this_cpu_read(hrtimer_interrupts);
> count++;
> WRITE_ONCE(*raw_cpu_ptr(hrtimer_interrupts), count);
> }
>
> but it is nasty. A more elegant solution might be using atomic_t
> for hrtimer_interrupts counter.

I switched it over to atomic_t.


> > - if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
> > + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> >   return true;
> >
> > - __this_cpu_write(hrtimer_interrupts_saved, hrint);
> > + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
>
> IMHO, hrtimer_interrupts_saved might be handled this way.
> The value is read/written only by this function.
>
> The buddy watchdog should see consistent values even when
> the buddy CPU goes offline. This check should never race
> because this CPU should get touched when another buddy
> gets assigned.
>
> Well, it would deserve a comment.

I spent a bunch of time thinking about this too and I agree that for
hrtimer_interrupts_saved we don't need atomic_t nor even
READ_ONCE/WRITE_ONCE. I've add a comment and a note in the commit
message in v5.


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 10/17] watchdog/hardlockup: Move perf hardlockup watchdog petting to watchdog.c

2023-05-20 Thread Doug Anderson
Hi,

On Thu, May 11, 2023 at 8:46 AM Petr Mladek  wrote:
>
> > @@ -111,6 +125,11 @@ static void watchdog_hardlockup_interrupt_count(void)
> >
> >  void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> >  {
> > + if (__this_cpu_read(watchdog_hardlockup_touch)) {
> > + __this_cpu_write(watchdog_hardlockup_touch, false);
> > + return;
> > + }
>
> If we clear watchdog_hardlockup_touch() here then
> watchdog_hardlockup_check() won't be called yet another
> watchdog_hrtimer_sample_threshold perior.
>
> It means that any touch will cause ignoring one full period.
> The is_hardlockup() check will be done after full two periods.
>
> It is not ideal, see below.
>
> > +
> >   /*
> >* Check for a hardlockup by making sure the CPU's timer
> >* interrupt is incrementing. The timer interrupt should have
> > diff --git a/kernel/watchdog_perf.c b/kernel/watchdog_perf.c
> > index 9be90b2a2ea7..547917ebd5d3 100644
> > --- a/kernel/watchdog_perf.c
> > +++ b/kernel/watchdog_perf.c
> > @@ -112,11 +98,6 @@ static void watchdog_overflow_callback(struct 
> > perf_event *event,
> >   /* Ensure the watchdog never gets throttled */
> >   event->hw.interrupts = 0;
> >
> > - if (__this_cpu_read(watchdog_nmi_touch) == true) {
> > - __this_cpu_write(watchdog_nmi_touch, false);
> > - return;
> > - }
>
> The original code looks wrong. arch_touch_nmi_watchdog() caused
> skipping only one period of the perf event.
>
> I would expect that it caused restarting the period,
> something like:
>
> if (__this_cpu_read(watchdog_nmi_touch) == true) {
> /*
>  * Restart the period after which the interrupt
>  * counter is checked.
>  */
> __this_cpu_write(nmi_rearmed, 0);
> __this_cpu_write(last_timestamp, now);
> __this_cpu_write(watchdog_nmi_touch, false);
> return;
> }
>
> By other words, we should restart the period in the very next perf
> event after the watchdog was touched.
>
> That said, the new code looks better than the original.
> IMHO, the original code was prone to false positives.

I had a little bit of a hard time following, but I _think_ the "tl;dr"
of all the above is that my change is fine. If I misunderstood, please
yell.


> Best Regards,
> Petr
>
> PS: It might be worth fixing this problem in a separate patch at the
> beginning of this patchset. It might be a candidate for stable
> backports.

Done. It's now its own patch and early in the series.


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: include kdb_private.h for function prototypes

2023-05-17 Thread Doug Anderson
Hi,

On Wed, May 17, 2023 at 5:48 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> The kdb_kbd_cleanup_state() is called from another file through
> the kdb_private.h file, but that is not included before the
> definition, causing a W=1 warning:
>
> kernel/debug/kdb/kdb_keyboard.c:198:6: error: no previous prototype for 
> 'kdb_kbd_cleanup_state' [-Werror=missing-prototypes]
>
> Signed-off-by: Arnd Bergmann 
> ---
>  kernel/debug/kdb/kdb_keyboard.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] kdb: include header in signal handling code

2023-05-17 Thread Doug Anderson
Hi,

On Wed, May 17, 2023 at 5:54 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> kdb_send_sig() is defined in the signal code and called from kdb,
> but the declaration is part of the kdb internal code.
> Include this from signal.c as well to avoid the warning:
>
> kernel/signal.c:4789:6: error: no previous prototype for 'kdb_send_sig' 
> [-Werror=missing-prototypes]
>
> Signed-off-by: Arnd Bergmann 
> ---
>  kernel/signal.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 07/10] kgdb: Expose default CPUs roundup fallback mechanism

2023-05-15 Thread Doug Anderson
Hi,

On Fri, May 12, 2023 at 6:49 AM Daniel Thompson
 wrote:
>
> On Wed, Apr 19, 2023 at 03:56:01PM -0700, Douglas Anderson wrote:
> > From: Sumit Garg 
> >
> > Add a new API kgdb_smp_call_nmi_hook() to expose default CPUs roundup
> > mechanism to a particular archichecture as a runtime fallback if it
> > detects to not support NMI roundup.
> >
> > Currently such an architecture example is arm64 supporting pseudo NMIs
> > feature which is only available on platforms which have support for GICv3
> > or later version.
> >
> > Signed-off-by: Sumit Garg 
> > Tested-by: Chen-Yu Tsai 
> > Signed-off-by: Douglas Anderson 
> > ---
> >
> > (no changes since v1)
> >
> >  include/linux/kgdb.h  | 12 
> >  kernel/debug/debug_core.c |  8 +++-
> >  2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
> > index 258cdde8d356..87713bd390f3 100644
> > --- a/include/linux/kgdb.h
> > +++ b/include/linux/kgdb.h
> > @@ -199,6 +199,18 @@ kgdb_arch_handle_qxfer_pkt(char *remcom_in_buffer,
> >
> >  extern void kgdb_call_nmi_hook(void *ignored);
> >
> > +/**
> > + *   kgdb_smp_call_nmi_hook - Provide default fallback mechanism to
> > + *round-up CPUs
> > + *
> > + *   If you're using the default implementation of kgdb_roundup_cpus()
> > + *   this function will be called.  And if an arch detects at runtime to
> > + *   not support NMI based roundup then it can fallback to default
> > + *   mechanism using this API.
> > + */
> > +
> > +extern void kgdb_smp_call_nmi_hook(void);
>
> Concept looks sensible but this is a terrible name for aa command to
> round up the CPUs using smp_call... functions. Whilst it is true it that
> kgdb_roundup_cpus() does use kgdb_call_nmi_hook() internally that
> doesn't mean we should name functions after it. They should be named
> after what they are do, not how they do it.
>
> Something more like kgdb_roundup_cpus_with_smp_call() would be a much
> better name.

Sounds good. I'm happy to spin with this rename, though I was kinda
hoping to drop ${SUBJECT} patch if folks were OK with patch #10 in
this series [1]. I personally think that's the right way to go but
it's unclear to me if arm64 maintainers will think it's a hack
(despite the fact that arm32 implements the "nmi" functions with
regular IPIs).

For now, maybe I'll think positive thoughts and hope that folks will
have the time to review the series soon. If another few weeks go by
then I'll send a v9 with just your comments addressed. If nothing
else, maybe you can land the kgdb parts in a tree targeting v6.5 and
then when arm64 folks have the bandwidth then it will be easier to get
them landed.

[1] 
https://lore.kernel.org/r/20230419155341.v8.10.Ic3659997d6243139d0522fc3afcdfd88d7a5f030@changeid


-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 09/10] arm64: kgdb: Roundup cpus using IPI as NMI

2023-05-15 Thread Doug Anderson
Hi,

On Fri, May 12, 2023 at 7:00 AM Daniel Thompson
 wrote:
>
> On Wed, Apr 19, 2023 at 03:56:03PM -0700, Douglas Anderson wrote:
> > From: Sumit Garg 
> >
> > arm64 platforms with GICv3 or later supports pseudo NMIs which can be
> > leveraged to roundup CPUs which are stuck in hard lockup state with
> > interrupts disabled that wouldn't be possible with a normal IPI.
> >
> > So instead switch to roundup CPUs using IPI turned as NMI. And in
> > case a particular arm64 platform doesn't supports pseudo NMIs,
> > it will switch back to default kgdb CPUs roundup mechanism.
> >
> > Signed-off-by: Sumit Garg 
> > Tested-by: Chen-Yu Tsai 
> > Signed-off-by: Douglas Anderson 
> > ---
> >
> > (no changes since v1)
> >
> >  arch/arm64/kernel/ipi_nmi.c |  5 +
> >  arch/arm64/kernel/kgdb.c| 18 ++
> >  2 files changed, 23 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/ipi_nmi.c b/arch/arm64/kernel/ipi_nmi.c
> > index c592e92b8cbf..2adaaf1519e5 100644
> > --- a/arch/arm64/kernel/ipi_nmi.c
> > +++ b/arch/arm64/kernel/ipi_nmi.c
> > @@ -8,6 +8,7 @@
> >
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >
> > @@ -45,10 +46,14 @@ bool arch_trigger_cpumask_backtrace(const cpumask_t 
> > *mask, bool exclude_self)
> >  static irqreturn_t ipi_nmi_handler(int irq, void *data)
> >  {
> >   irqreturn_t ret = IRQ_NONE;
> > + unsigned int cpu = smp_processor_id();
>
> Does this play nice with CONFIG_DEBUG_PREEMPT? I may have missed
> something about the NMI entry but a quick scan of the arm64
> arch_irq_disabled() suggests that debug_smp_processor_id() will issue
> warnings at this point...
>
> Other than I didn't see anything I don't like here.

Good question. It seems to, at least on the sc7180-trogdor-based
system I tested.

Just to confirm, I printed out the values of some of the stuff that's
checked. When this function was called, I saw:

irqs_disabled() => true
raw_local_save_flags() => 0x00f0
__irqflags_uses_pmr() => 1

The "__irqflags_uses_pmr" is the thing that gets set when we try to
enable pseudo-NMIs and they're actually there. That causes us to call
__pmr_irqs_disabled_flags() which will compare the flags (0xf0) to
GIC_PRIO_IRQON (0xe0). If flags is not the same as GIC_PRIO_IRQON then
interrupts are disabled.

...so, assuming I understood everything, I think we're OK.

I also tried to see what happened with the whole "fallback to use
regular IPIs if we don't have pseudo-NMIs enabled" (AKA patch #10 in
this series). In that case, I had:

irqs_disabled() => true
raw_local_save_flags() => 0x00c0
__irqflags_uses_pmr() => 0

...in this case we end up in __daif_irqs_disabled_flags(). That checks
to see if the flags (0xc0) has the "I bit" (0x80) set. It is set, so
interrupts are disabled.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 08/10] kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB

2023-05-11 Thread Doug Anderson
Hi,

On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson  wrote:
>
> To save architectures from needing to wrap the call in #ifdefs, add a
> stub no-op version of kgdb_nmicallback(), which returns 1 if it didn't
> handle anything.
>
> Signed-off-by: Douglas Anderson 
> ---
>
> Changes in v8:
> - "Provide a stub kgdb_nmicallback() if !CONFIG_KGDB" new for v8
>
>  include/linux/kgdb.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
> index 87713bd390f3..9ce628ee47cc 100644
> --- a/include/linux/kgdb.h
> +++ b/include/linux/kgdb.h
> @@ -377,5 +377,6 @@ extern void kgdb_free_init_mem(void);
>  #define dbg_late_init()
>  static inline void kgdb_panic(const char *msg) {}
>  static inline void kgdb_free_init_mem(void) { }
> +static int kgdb_nmicallback(int cpu, void *regs) { return 1; }

FWIW: I just realized that the above needs an "inline" to make the
compiler not complain. I'm still hoping for more feedback on the
series, but I'll plan to fix that in the next spin.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 06/10] arm64: idle: Tag the arm64 idle functions as __cpuidle

2023-05-10 Thread Doug Anderson
Hi,

On Wed, May 10, 2023 at 9:43 AM Mark Rutland  wrote:
>
> On Wed, Apr 19, 2023 at 03:56:00PM -0700, Douglas Anderson wrote:
> > As per the (somewhat recent) comment before the definition of
> > `__cpuidle`, the tag is like `noinstr` but also marks a function so it
> > can be identified by cpu_in_idle(). Let'a add this.
> >
> > After doing this then when we dump stack traces of all processors
> > using nmi_cpu_backtrace() then instead of getting useless backtraces
> > we get things like:
> >
> >   NMI backtrace for cpu N skipped: idling at cpu_do_idle+0x94/0x98
>
> As a heads-up, this is only going to work in the trivial case where a CPU is
> within the default cpu_do_idle(), and not for anything using PSCI
> cpu_suspend() (which I'd *really* hope is the common case).

Thanks for the review and the heads up! Right. It only catches shallow
idle. Specifically this was the stack trace when it was "caught" on a
sc7180-trogdor device:

 cpu_do_idle+0x94/0x98
 psci_enter_idle_state+0x58/0x70
 cpuidle_enter_state+0xb8/0x260
 cpuidle_enter+0x44/0x5c
 do_idle+0x188/0x30c
 ...

I checked in kgdb and saw that "psci_enter_idle_state()" had "idx" as
0, which makes sense since __CPU_PM_CPU_IDLE_ENTER will directly call
cpu_do_idle() when idx is 0.

I agree that it doesn't catch every case. Certainly it's not too hard
to see a CPU here:

 gic_cpu_sys_reg_init+0x1f8/0x314
 gic_cpu_pm_notifier+0x40/0x78
 raw_notifier_call_chain+0x5c/0x134
 cpu_pm_notify+0x38/0x64
 cpu_pm_exit+0x20/0x2c
 psci_enter_idle_state+0x48/0x70
 cpuidle_enter_state+0xb8/0x260
 cpuidle_enter+0x44/0x5c
 do_idle+0x188/0x30c
 ...

...and kgdb showed "idx" was 3.

That being said, catching some of the cases is better than catching
none of the cases. ;-)

In reality, I've seen cases where it catches most CPUs. For instance,
running soon after bootup on my sc7180-trogdor device:

echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT

...and then having the "buddy" hardlockup detector [1] catch the crash
caught all of the CPUs other than the one that was locked up and the
one that detected it. Specifically:

NMI backtrace for cpu 2 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 1 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 0 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 3 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 4 skipped: idling at cpu_do_idle+0x94/0x98
NMI backtrace for cpu 7 skipped: idling at cpu_do_idle+0x94/0x98

I haven't analyzed it, but I guess it's possible that when the buddy
hardlockup detector triggers that other CPUs are more likely to be in
a shallow idle? Certainly I seem to catch a lot more CPUs in a shallow
idle in buddy lockup reports...


[1] https://lore.kernel.org/r/20230504221349.1535669-1-diand...@chromium.org


> That doesn't get inlined, and the invocation is shared with other SMCCC users,
> so we probably need more work there if culling idle backtraces is important.

At this point I'd say that we should just take the low hanging fruit
(this patch). If someone later wants to try to do better they can.


> I'm not averse to this change itself.

Any chance you'd be willing to give any tags to it? :-P Do you need me
to add any of the above caveats to the commit message?

I also certainly wouldn't object to this patch landing even if others
aren't ready. It has no dependencies on any other patches and just
makes the debug messages prettier in some cases.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 00/10] arm64: Add framework to turn an IPI as NMI

2023-05-10 Thread Doug Anderson
Hi,

On Wed, May 10, 2023 at 9:30 AM Mark Rutland  wrote:
>
> On Wed, May 10, 2023 at 08:28:17AM -0700, Doug Anderson wrote:
> > Hi,
>
> Hi Doug,
>
> > On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson  
> > wrote:
> > > This is an attempt to resurrect Sumit's old patch series [1] that
> > > allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
> > > also to round up CPUs in kdb/kgdb. The last post from Sumit that I
> > > could find was v7, so I called this series v8. I haven't copied all of
> > > his old changelongs here, but you can find them from the link.
> > >
> > > Since v7, I have:
> > > * Addressed the small amount of feedback that was there for v7.
> > > * Rebased.
> > > * Added a new patch that prevents us from spamming the logs with idle
> > >   tasks.
> > > * Added an extra patch to gracefully fall back to regular IPIs if
> > >   pseudo-NMIs aren't there.
> > >
> > > Since there appear to be a few different patches series related to
> > > being able to use NMIs to get stack traces of crashed systems, let me
> > > try to organize them to the best of my understanding:
> > >
> > > a) This series. On its own, a) will (among other things) enable stack
> > >traces of all running processes with the soft lockup detector if
> > >you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On
> > >its own, a) doesn't give a hard lockup detector.
> > >
> > > b) A different recently-posted series [2] that adds a hard lockup
> > >detector based on perf. On its own, b) gives a stack crawl of the
> > >locked up CPU but no stack crawls of other CPUs (even if they're
> > >locked too). Together with a) + b) we get everything (full lockup
> > >detect, full ability to get stack crawls).
> > >
> > > c) The old Android "buddy" hard lockup detector [3] that I'm
> > >considering trying to upstream. If b) lands then I believe c) would
> > >be redundant (at least for arm64). c) on its own is really only
> > >useful on arm64 for platforms that can print CPU_DBGPCSR somehow
> > >(see [4]). a) + c) is roughly as good as a) + b).
>
> > It's been 3 weeks and I haven't heard a peep on this series. That
> > means nobody has any objections and it's all good to land, right?
> > Right? :-P
>
> FWIW, there are still longstanding soundness issues in the arm64 pseudo-NMI
> support (and fixing that requires an overhaul of our DAIF / IRQ flag
> management, which I've been chipping away at for a number of releases), so I
> hadn't looked at this in detail yet because the foundations are still somewhat
> dodgy.
>
> I appreciate that this has been around for a while, and it's on my queue to
> look at.

Ah, thanks for the heads up! We've been thinking about turning this on
in production in ChromeOS because it will help us track down a whole
class of field-generated crash reports that are otherwise opaque to
us. It sounds as if maybe that's not a good idea quite yet? Do you
have any idea of how much farther along this needs to go? ...of
course, we've also run into issues with Mediatek devices because they
don't save/restore GICR registers properly [1]. In theory, we might be
able to work around that in the kernel.

In any case, even if there are bugs that would prevent turning this on
for production, it still seems like we could still land this series.
It simply wouldn't do anything until someone turned on pseudo NMIs,
which wouldn't happen till the kinks are worked out.

...actually, I guess I should say that if all the patches of the
current series do land then it actually _would_ still do something,
even without pseudo-NMI. Assuming the last patch looks OK, it would at
least start falling back to using regular IPIs to do backtraces. That
wouldn't get backtraces on hard locked up CPUs but it would be better
than what we have today where we don't get any backtraces. This would
get arm64 on par with arm32...

[1] https://issuetracker.google.com/281831288


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 00/10] arm64: Add framework to turn an IPI as NMI

2023-05-10 Thread Doug Anderson
Hi,

On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson  wrote:
>
> This is an attempt to resurrect Sumit's old patch series [1] that
> allowed us to use the arm64 pseudo-NMI to get backtraces of CPUs and
> also to round up CPUs in kdb/kgdb. The last post from Sumit that I
> could find was v7, so I called this series v8. I haven't copied all of
> his old changelongs here, but you can find them from the link.
>
> Since v7, I have:
> * Addressed the small amount of feedback that was there for v7.
> * Rebased.
> * Added a new patch that prevents us from spamming the logs with idle
>   tasks.
> * Added an extra patch to gracefully fall back to regular IPIs if
>   pseudo-NMIs aren't there.
>
> Since there appear to be a few different patches series related to
> being able to use NMIs to get stack traces of crashed systems, let me
> try to organize them to the best of my understanding:
>
> a) This series. On its own, a) will (among other things) enable stack
>traces of all running processes with the soft lockup detector if
>you've enabled the sysctl "kernel.softlockup_all_cpu_backtrace". On
>its own, a) doesn't give a hard lockup detector.
>
> b) A different recently-posted series [2] that adds a hard lockup
>detector based on perf. On its own, b) gives a stack crawl of the
>locked up CPU but no stack crawls of other CPUs (even if they're
>locked too). Together with a) + b) we get everything (full lockup
>detect, full ability to get stack crawls).
>
> c) The old Android "buddy" hard lockup detector [3] that I'm
>considering trying to upstream. If b) lands then I believe c) would
>be redundant (at least for arm64). c) on its own is really only
>useful on arm64 for platforms that can print CPU_DBGPCSR somehow
>(see [4]). a) + c) is roughly as good as a) + b).
>
> [1] 
> https://lore.kernel.org/linux-arm-kernel/1604317487-14543-1-git-send-email-sumit.g...@linaro.org/
> [2] 
> https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.c...@mediatek.com/
> [3] https://issuetracker.google.com/172213097
> [4] https://issuetracker.google.com/172213129
>
> Changes in v8:
> - dynamic_ipi_setup() and dynamic_ipi_teardown() no longer take cpu param
> - dynamic_ipi_setup() and dynamic_ipi_teardown() no longer take cpu param
> - Add loongarch support, too
> - Removed "#ifdef CONFIG_SMP" since arm64 is always SMP
> - "Tag the arm64 idle functions as __cpuidle" new for v8
> - "Provide a stub kgdb_nmicallback() if !CONFIG_KGDB" new for v8
> - "Fallback to a regular IPI if NMI isn't enabled" new for v8
>
> Douglas Anderson (3):
>   arm64: idle: Tag the arm64 idle functions as __cpuidle
>   kgdb: Provide a stub kgdb_nmicallback() if !CONFIG_KGDB
>   arm64: ipi_nmi: Fallback to a regular IPI if NMI isn't enabled
>
> Sumit Garg (7):
>   arm64: Add framework to turn IPI as NMI
>   irqchip/gic-v3: Enable support for SGIs to act as NMIs
>   arm64: smp: Assign and setup an IPI as NMI
>   nmi: backtrace: Allow runtime arch specific override
>   arm64: ipi_nmi: Add support for NMI backtrace
>   kgdb: Expose default CPUs roundup fallback mechanism
>   arm64: kgdb: Roundup cpus using IPI as NMI
>
>  arch/arm/include/asm/irq.h   |   2 +-
>  arch/arm/kernel/smp.c|   3 +-
>  arch/arm64/include/asm/irq.h |   4 ++
>  arch/arm64/include/asm/nmi.h |  17 +
>  arch/arm64/kernel/Makefile   |   2 +-
>  arch/arm64/kernel/idle.c |   4 +-
>  arch/arm64/kernel/ipi_nmi.c  | 103 +++
>  arch/arm64/kernel/kgdb.c |  18 ++
>  arch/arm64/kernel/smp.c  |   8 +++
>  arch/loongarch/include/asm/irq.h |   2 +-
>  arch/loongarch/kernel/process.c  |   3 +-
>  arch/mips/include/asm/irq.h  |   2 +-
>  arch/mips/kernel/process.c   |   3 +-
>  arch/powerpc/include/asm/nmi.h   |   2 +-
>  arch/powerpc/kernel/stacktrace.c |   3 +-
>  arch/sparc/include/asm/irq_64.h  |   2 +-
>  arch/sparc/kernel/process_64.c   |   4 +-
>  arch/x86/include/asm/irq.h   |   2 +-
>  arch/x86/kernel/apic/hw_nmi.c|   3 +-
>  drivers/irqchip/irq-gic-v3.c |  29 ++---
>  include/linux/kgdb.h |  13 
>  include/linux/nmi.h  |  12 ++--
>  kernel/debug/debug_core.c|   8 ++-
>  23 files changed, 217 insertions(+), 32 deletions(-)

It's been 3 weeks and I haven't heard a peep on this series. That
means nobody has any objections and it's all good to land, right?
Right? :-P

Seriously, though, I really think we should figure out how to get this
landed. There's obviously interest from several different parties and
I'm chomping at the bit waiting to spin this series according to your
wishes. I also don't think there's anything super scary/ugly here. IMO
the ideal situation is that folks are OK with the idea presented in
the last patch in the series ("arm64: ipi_nmi: Fallback to a regular
IPI if NMI isn't enabled") and then I can re-spin this series to be
_much_ simpler. That being said, I'm also OK with 

Re: [Kgdb-bugreport] [PATCH v4 05/17] watchdog/hardlockup: Rename touch_nmi_watchdog() to touch_hardlockup_watchdog()

2023-05-08 Thread Doug Anderson
Hi,

On Sun, May 7, 2023 at 6:35 PM Nicholas Piggin  wrote:
>
> On Sat May 6, 2023 at 2:37 AM AEST, Doug Anderson wrote:
> > Hi,
> >
> > On Thu, May 4, 2023 at 7:51 PM Nicholas Piggin  wrote:
> > >
> > > On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > > > In preparation for the buddy hardlockup detector, rename
> > > > touch_nmi_watchdog() to touch_hardlockup_watchdog() to make it clear
> > > > that it will touch whatever hardlockup detector is configured. We'll
> > > > add a #define for the old name (touch_nmi_watchdog) so that we don't
> > > > have to touch every piece of code referring to the old name.
> > >
> > > Is this really helpful? Now it's got two names Could just leave it.
> > > If you insist then it'd be better just to rename everything in one
> > > go at the end of a merge window IMO. Conflicts would be trivial.
> >
> > I'm not picky here. I changed the name since Petr requested names to
> > be changed for any code I was touching [1] and so I threw this out as
> > a proposal. I agree that having two names can be confusing, but in
> > this case it didn't feel too terrible to me.
> >
> > I'd love to hear Petr's opinion on this name change. I'm happy with:
> >
> > a) This patch as it is.
> >
> > b) Dropping this patch (or perhaps just changing it to add comments).
> >
> > c) Changing this patch to rename all 70 uses of the old name. Assuming
> > this will go through Andrew Morton's tree, I'd be interested in
> > whether he's OK w/ this.
> >
> > d) Dropping this patch from this series but putting it on the
> > backburner to try to do later (so that the rename can happen at a time
> > when it's least disruptive).
> >
> >
> > > > Ideally this change would also rename the arch_touch_nmi_watchdog(),
> > > > but that is harder since arch_touch_nmi_watchdog() is exported with
> > > > EXPORT_SYMBOL() and thus is ABI. Add a comment next to the call to
> > > > hopefully alleviate some of the confusion here.
> > >
> > > We don't keep ABI fixed upstream.
> >
> > I'm happy to be corrected, but my understanding was that kernel devs
> > made an effort not to mess with things exported via "EXPORT_SYMBOL",
> > but things exported via "EXPORT_SYMBOL_GPL" were fair game.
>
> I don't think that's the case. If anything people might be a bit more
> inclined to accommodate GPL exports for out of tree modules that use
> them.
>
> > I guess maybe my patch calling it "ABI" is a stronger statement than
> > that, though. Doing a little more research, nobody wants to say that
> > things exported with "EXPORT_SYMBOL" are ABI, they just want to say
> > that we make an effort to have them be more stable.
>
> We wouldn't break any symbol for no reason, but in this case there is a
> good reason. If the name change is important for clarity then we change
> it. And this is about the easiest change for an out of tree module to
> deal with, so it should be no big deal for them.

OK, fair enough. My current plan is to wait a few more days to see if
anyone else chimes in with opinions. If I don't hear anything, in my
next version I will rename _neither_ touch_nmi_watchdog() nor
arch_touch_nmi_watchdog(). I'll still add comments indicating that
these functions touch the "hardlockup" watchdog but I won't attempt
the rename just to keep the series simpler.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-08 Thread Doug Anderson
Hi,

On Sun, May 7, 2023 at 6:05 PM Nicholas Piggin  wrote:
>
> > No, I wasn't aware of it. Interesting, it seems to basically enable
> > both types of hardlockup detectors together. If that really catches
> > more lockups, it seems like we could do the same thing for the buddy
> > system.
>
> It doesn't catch more lockups. On powerpc we don't have a reliable
> periodic NMI hence the SMP checker. But it is preferable that a CPU
> detects its own lockup because NMI IPIs can result in crashes if
> they are taken in certain critical sections.

Ah, interesting. Is the fact that NMI IPIs can crash the system
something specific to the way they're implemented in powerpc, or is
this something common in Linux? I've been assuming that IPI NMIs would
be roughly as reliable (or perhaps more reliable) than the NMI
timer/perf counter.


> > > It's all to
> > > all rather than buddy which makes it more complicated but arguably
> > > bit better functionality.
> >
> > Can you come up with an example crash where the "all to all" would
> > work better than the simple buddy system provided by this patch?
>
> CPU2 CPU3
> spin_lock_irqsave(A) spin_lock_irqsave(B)
> spin_lock_irqsave(B) spin_lock_irqsave(A)
>
> CPU1 will detect the lockup on CPU2, but CPU3's lockup won't be
> detected so we don't get the trace that can diagnose the bug.

Ah, that makes sense as a benefit if
"sysctl_hardlockup_all_cpu_backtrace" is false, which you'd probably
want if you had massively SMP systems. ...of course, if you had a lot
of cores then I'd imagine the "all-to-all" might become a lot of
overhead...

Hmmm, but I don't think you really need "all-to-all" checking to get
the stacktraces you want, do you? Each CPU can be "watching" exactly
one other CPU, but then when we actually lock up we could check all of
them and dump stacks on all the ones that are locked up. I think this
would be a fairly easy improvement for the buddy system. I'll leave it
out for now just to keep things simple for the initial landing, but it
wouldn't be hard to add. Then I think the two SMP systems  (buddy vs.
all-to-all) would be equivalent in terms of functionality?


Thinking about this more, I guess you must be assuming coordination
between the "SMP" and "non-SMP" checker. Specifically I guess you'd
want some guarantee that the "SMP" checker always fires before the
non-SMP checker because that would have a higher chance of getting a
stack trace without tickling the IPI NMI crash. ...but then once the
non-SMP checker printed its own stack trace you'd like the SMP checker
running on the same CPU to check to see if any other CPUs were locked
up so you could get their stacks as well. With my simplistic solution
of just allowing the buddy detector to be enabled in parallel with a
perf-based detector then we wouldn't have this level of coordination,
but I'll assume that's OK for the initial landing.


> Another thing I actually found it useful for is you can easily
> see if a core (i.e., all threads in the core) or a chip has
> died. Maybe more useful when doing presilicon and bring up work
> or firmware hacking, but still useful.

I'm probably biased, but for bringup work and stuff like this I
usually have kdb/kgdb enabled and then I could get stack traces for
whichever CPUs I want. :-P

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c

2023-05-05 Thread Doug Anderson
Hi,

On Thu, May 4, 2023 at 7:58 PM Nicholas Piggin  wrote:
>
> On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > The perf hardlockup detector works by looking at interrupt counts and
> > seeing if they change from run to run. The interrupt counts are
> > managed by the common watchdog code via its watchdog_timer_fn().
> >
> > Currently the API between the perf detector and the common code is a
> > function: is_hardlockup(). When the hard lockup detector sees that
> > function return true then it handles printing out debug info and
> > inducing a panic if necessary.
> >
> > Let's change the API a little bit in preparation for the buddy
> > hardlockup detector. The buddy hardlockup detector wants to print
>
> I think the name change is a gratuitous. Especially since it's now
> static.
>
> watchdog_hardlockup_ is a pretty long prefix too, hardlockup_
> should be enough?
>
> Seems okay otherwise though.

I went back and forth on names far too much when constructing this
patch series. Mostly I was trying to balance what looked good to me
and what Petr suggested [1]. I'm not super picky about the names and
I'm happy to change them all to a "hardlockup_" prefix. I'd love to
hear Petr's opinion.

[1] https://lore.kernel.org/r/ZFErmshcrcikrSU1@alley


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-05 Thread Doug Anderson
Hi,

On Thu, May 4, 2023 at 7:36 PM Nicholas Piggin  wrote:
>
> On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > From: Colin Cross 
> >
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
>
> Powerpc's watchdog has an SMP checker, did you see it?

No, I wasn't aware of it. Interesting, it seems to basically enable
both types of hardlockup detectors together. If that really catches
more lockups, it seems like we could do the same thing for the buddy
system. If people want, I don't think it would be very hard to make
the buddy system _not_ exclusive of the perf system. Instead of having
the buddy system implement the "weak" functions I could just call the
buddy functions in the right places directly and leave the "weak"
functions for a more traditional hardlockup detector to implement.
Opinions?

Maybe after all this lands, the powerpc watchdog could move to use the
common code? As evidenced by this patch series, there's not really a
reason for the SMP detection to be platform specific.


> It's all to
> all rather than buddy which makes it more complicated but arguably
> bit better functionality.

Can you come up with an example crash where the "all to all" would
work better than the simple buddy system provided by this patch? It
seems like they would be equivalent, but I could be missing something.
Specifically they both need at least one non-locked-up CPU to detect a
problem. If one or more CPUs is locked up then we'll always detect it.
I suppose maybe you could provide a better error message at lockup
time saying that several CPUs were locked up and that could be
helpful. For now, I'd keep the current buddy system the way it is and
if you want to provide a patch improving things to be "all-to-all" in
the future that would be interesting to review.


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 11/17] watchdog/hardlockup: Rename some "NMI watchdog" constants/function

2023-05-05 Thread Doug Anderson
Hi,

On Thu, May 4, 2023 at 8:07 PM Nicholas Piggin  wrote:
>
> On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > Do a search and replace of:
> > - NMI_WATCHDOG_ENABLED => HARD_WATCHDOG_ENABLED
> > - watchdog_nmi_ => watchdog_hardlockup_
>
> These are just making prefixes inconsistent again.
>
> If you really want to do a prefix, I would call it hardlockup which
> probably best matches existing code and sysctl / boot stuff, and
> concentrate on non-static symbols.

As with other similar patches, I'm happy to drop this and am doing it
at Petr's request.

[1] https://lore.kernel.org/r/ZFErmshcrcikrSU1@alley


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 08/17] watchdog/hardlockup: Style changes to watchdog_hardlockup_check() / ..._is_lockedup()

2023-05-05 Thread Doug Anderson
Hi,

On Thu, May 4, 2023 at 8:02 PM Nicholas Piggin  wrote:
>
> On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > These are tiny style changes:
> > - Add a blank line before a "return".
> > - Renames two globals to use the "watchdog_hld" prefix.
>
> Particularly static ones don't really need the namespace prefixes.

Renames are mostly at Petr's request. If I've misunderstood what he
wants here that I'm happy to remove them.


> Not sure if processed is better than warn.

I can undo this one if you want. It felt like we were doing more than
just warning, but if people think "warn" is a better way to describe
it then that's fine with me.


> allcpu_dumped is better
> than dumped_stacks though because the all-CPUs-dump is a particular
> thing.

OK, I can undo this and leave it as "allcpu_dumped".


[1] https://lore.kernel.org/r/ZFErmshcrcikrSU1@alley


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v4 05/17] watchdog/hardlockup: Rename touch_nmi_watchdog() to touch_hardlockup_watchdog()

2023-05-05 Thread Doug Anderson
Hi,

On Thu, May 4, 2023 at 7:51 PM Nicholas Piggin  wrote:
>
> On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > In preparation for the buddy hardlockup detector, rename
> > touch_nmi_watchdog() to touch_hardlockup_watchdog() to make it clear
> > that it will touch whatever hardlockup detector is configured. We'll
> > add a #define for the old name (touch_nmi_watchdog) so that we don't
> > have to touch every piece of code referring to the old name.
>
> Is this really helpful? Now it's got two names Could just leave it.
> If you insist then it'd be better just to rename everything in one
> go at the end of a merge window IMO. Conflicts would be trivial.

I'm not picky here. I changed the name since Petr requested names to
be changed for any code I was touching [1] and so I threw this out as
a proposal. I agree that having two names can be confusing, but in
this case it didn't feel too terrible to me.

I'd love to hear Petr's opinion on this name change. I'm happy with:

a) This patch as it is.

b) Dropping this patch (or perhaps just changing it to add comments).

c) Changing this patch to rename all 70 uses of the old name. Assuming
this will go through Andrew Morton's tree, I'd be interested in
whether he's OK w/ this.

d) Dropping this patch from this series but putting it on the
backburner to try to do later (so that the rename can happen at a time
when it's least disruptive).


> > Ideally this change would also rename the arch_touch_nmi_watchdog(),
> > but that is harder since arch_touch_nmi_watchdog() is exported with
> > EXPORT_SYMBOL() and thus is ABI. Add a comment next to the call to
> > hopefully alleviate some of the confusion here.
>
> We don't keep ABI fixed upstream.

I'm happy to be corrected, but my understanding was that kernel devs
made an effort not to mess with things exported via "EXPORT_SYMBOL",
but things exported via "EXPORT_SYMBOL_GPL" were fair game.

I guess maybe my patch calling it "ABI" is a stronger statement than
that, though. Doing a little more research, nobody wants to say that
things exported with "EXPORT_SYMBOL" are ABI, they just want to say
that we make an effort to have them be more stable.

So certainly I should adjust my patch series not to call it ABI, but
I'm still on the fence about whether I should rename this or not. I'd
love to hear other opinions. This rename actually would be a lot
easier than the touch_nmi_watchdog() one since the code referencing
the name "arch_touch_nmi_watchdog" isn't spread so broadly through the
kernel.

[1] https://lore.kernel.org/r/ZFErmshcrcikrSU1@alley

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v3] hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-04 Thread Doug Anderson
Hi,

On Mon, May 1, 2023 at 8:25 AM Douglas Anderson  wrote:
>
> From: Colin Cross 
>
> Implement a hardlockup detector that doesn't doesn't need any extra
> arch-specific support code to detect lockups. Instead of using
> something arch-specific we will use the buddy system, where each CPU
> watches out for another one. Specifically, each CPU will use its
> softlockup hrtimer to check that the next CPU is processing hrtimer
> interrupts by verifying that a counter is increasing.
>
> NOTE: unlike the other hard lockup detectors, the buddy one can't
> easily show what's happening on the CPU that locked up just by doing a
> simple backtrace. It relies on some other mechanism in the system to
> get information about the locked up CPUs. This could be support for
> NMI backtraces like [1], it could be a mechanism for printing the PC
> of locked CPUs at panic time like [2] / [3], or it could be something
> else. Even though that means we still rely on arch-specific code, this
> arch-specific code seems to often be implemented even on architectures
> that don't have a hardlockup detector.
>
> This style of hardlockup detector originated in some downstream
> Android trees and has been rebased on / carried in ChromeOS trees for
> quite a long time for use on arm and arm64 boards. Historically on
> these boards we've leveraged mechanism [2] / [3] to get information
> about hung CPUs, but we could move to [1].
>
> Although the original motivation for the buddy system was for use on
> systems without an arch-specific hardlockup detector, it can still be
> useful to use even on systems that _do_ have an arch-specific
> hardlockup detector. On x86, for instance, there is a 24-part patch
> series [4] in progress switching the arch-specific hard lockup
> detector from a scarce perf counter to a less-scarce hardware
> resource. Potentially the buddy system could be a simpler alternative
> to free up the perf counter but still get hard lockup detection.
>
> Overall, pros (+) and cons (-) of the buddy system compared to an
> arch-specific hardlockup detector:
> + The buddy system is usable on systems that don't have an
>   arch-specific hardlockup detector, like arm32 and arm64 (though it's
>   being worked on for arm64 [5]).
> + The buddy system may free up scarce hardware resources.
> + If a CPU totally goes out to lunch (can't process NMIs) the buddy
>   system could still detect the problem (though it would be unlikely
>   to be able to get a stack trace).
> + The buddy system uses the same timer function to pet the hardlockup
>   detector on the running CPU as it uses to detect hardlockups on
>   other CPUs. Compared to other hardlockup detectors, this means it
>   generates fewer interrupts and thus is likely better able to let
>   CPUs stay idle longer.
> - If all CPUs are hard locked up at the same time the buddy system
>   can't detect it.
> - If we don't have SMP we can't use the buddy system.
> - The buddy system needs an arch-specific mechanism (possibly NMI
>   backtrace) to get info about the locked up CPU.
>
> [1] https://lore.kernel.org/r/20230419225604.21204-1-diand...@chromium.org
> [2] https://issuetracker.google.com/172213129
> [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html
> [4] 
> https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calde...@linux.intel.com/
> [5] 
> https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.c...@mediatek.com/
>
> Signed-off-by: Colin Cross 
> Signed-off-by: Matthias Kaehlcke 
> Signed-off-by: Guenter Roeck 
> Signed-off-by: Tzung-Bi Shih 
> Signed-off-by: Douglas Anderson 
> ---
> This patch has been rebased in ChromeOS kernel trees many times, and
> each time someone had to do work on it they added their
> Signed-off-by. I've included those here. I've also left the author as
> Colin Cross since the core code is still his.
>
> I'll also note that the CC list is pretty giant, but that's what
> get_maintainers came up with (plus a few other folks I thought would
> be interested). As far as I can tell, there's no true MAINTAINER
> listed for the existing watchdog code. Assuming people don't hate
> this, maybe it would go through Andrew Morton's tree?
>
> Changes in v3:
> - More cpu => CPU (in Kconfig and comments).
> - Added a note in commit message about the effect on idle.
> - Cleaned up commit message pros/cons to be complete sentences.
> - No code changes other than comments.
>
> Changes in v2:
> - cpu => CPU (in commit message).
> - Reworked description and Kconfig based on v1 discussion.
> - No code changes.
>
>  include/linux/nmi.h |  18 -
>  kernel/Makefile |   1 +
>  kernel/watchdog.c   |  24 --
>  kernel/watchdog_buddy_cpu.c | 141 
>  lib/Kconfig.debug   |  23 +-
>  5 files changed, 196 insertions(+), 11 deletions(-)

To leave breadcrumbs: I've posted v4 which is now a big series


Re: [Kgdb-bugreport] shared code: was: Re: [PATCH v3] hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-04 Thread Doug Anderson
Hi,

On Tue, May 2, 2023 at 8:26 AM Petr Mladek  wrote:
>
> > +int __weak watchdog_nmi_enable(unsigned int cpu) { return 0; }
> > +void __weak watchdog_nmi_disable(unsigned int cpu) { return; }
>
> Honestly, the mix of softlockup and hardlockup code was a hard to
> follow even before this patch. And it is going to be worse.
>
> Anyway, the buddy watchdog is not using NMI at all. It should not
> get enable using a function called *_nmi_enabled().

Thanks for your review! I'm not going to individually reply to all
your comments below, but I've sent out a v4 [1] where I think I've
done a semi-decent job of making this a little cleaner (not perfect,
but hopefully a step in the right direction). Please take a look.

[1] https://lore.kernel.org/r/20230504221349.1535669-1-diand...@chromium.org


> Also some comments are not longer valid, for example:
>
> static void watchdog_enable(unsigned int cpu)
> {
> [...]
> /* Enable the perf event */
> if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
> watchdog_nmi_enable(cpu);

Ugh, after I sent the new version I just realized that I missed
updating the above comment. :( If I need to send a v5 I can update it
then, or if v4 lands I can send a follow-on patch to update that
comment. My eyes are glazed over enough from trying to organize a
17-patch series, so I somewhat imagine it's not the only comment I
missed...

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] cpu hotplug : was: Re: [PATCH v3] hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-04 Thread Doug Anderson
Hi,

On Tue, May 2, 2023 at 8:23 AM Petr Mladek  wrote:
>
> On Mon 2023-05-01 08:24:46, Douglas Anderson wrote:
> > From: Colin Cross 
> >
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > --- /dev/null
> > +++ b/kernel/watchdog_buddy_cpu.c
> > +int watchdog_nmi_enable(unsigned int cpu)
> > +{
> > + /*
> > +  * The new CPU will be marked online before the first hrtimer 
> > interrupt
> > +  * runs on it.
>
> It does not need to be the first hrtimer interrupt. The CPU might have
> been offlined/onlined repeatedly. The counter might have any value.
>
> > +  * If another CPU tests for a hardlockup on the new CPU
> > +  * before it has run its first hrtimer, it will get a false positive.
> > +  * Touch the watchdog on the new CPU to delay the first check for at
> > +  * least 3 sampling periods to guarantee one hrtimer has run on the 
> > new
> > +  * CPU.
> > +  */

OK, I've updated the above comment to:

/*
 * The new CPU will be marked online before the hrtimer interrupt
 * gets a chance to run on it. If another CPU tests for a
 * hardlockup on the new CPU before it has run its the hrtimer
 * interrupt, it will get a false positive. Touch the watchdog on
 * the new CPU to delay the check for at least 3 sampling periods
 * to guarantee one hrtimer has run on the new CPU.
 */

> > + per_cpu(watchdog_touch, cpu) = true;
>
> We should touch also the next_cpu:
>
> /*
>  * We are going to check the next CPU. Our watchdog_hrtimer
>  * need not be zero if the CPU has already been online earlier.
>  * Touch the watchdog on the next CPU to avoid false positive
>  * if we try to check it in less then 3 interrupts.
>  */
> next_cpu = watchdog_next_cpu(cpu);
> if (next_cpu < nr_cpu_ids)
> per_cpu(watchdog_touch, next_cpu) = true;
>
> Alternative would be to clear watchdog_hrtimer. But it would kind-of
> affect also the softlockup detector.

Looks reasonable. I've incorporated it.


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2] hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-01 Thread Doug Anderson
Hi,

On Sat, Apr 29, 2023 at 2:22 PM Ian Rogers  wrote:
>
> On Fri, Apr 28, 2023 at 4:41 PM Douglas Anderson  
> wrote:
> >
> > From: Colin Cross 
> >
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > easily show what's happening on the CPU that locked up just by doing a
> > simple backtrace. It relies on some other mechanism in the system to
> > get information about the locked up CPUs. This could be support for
> > NMI backtraces like [1], it could be a mechanism for printing the PC
> > of locked CPUs at panic time like [2] / [3], or it could be something
> > else. Even though that means we still rely on arch-specific code, this
> > arch-specific code seems to often be implemented even on architectures
> > that don't have a hardlockup detector.
> >
> > This style of hardlockup detector originated in some downstream
> > Android trees and has been rebased on / carried in ChromeOS trees for
> > quite a long time for use on arm and arm64 boards. Historically on
> > these boards we've leveraged mechanism [2] / [3] to get information
> > about hung CPUs, but we could move to [1].
> >
> > Although the original motivation for the buddy system was for use on
> > systems without an arch-specific hardlockup detector, it can still be
> > useful to use even on systems that _do_ have an arch-specific
> > hardlockup detector. On x86, for instance, there is a 24-part patch
> > series [4] in progress switching the arch-specific hard lockup
> > detector from a scarce perf counter to a less-scarce hardware
> > resource. Potentially the buddy system could be a simpler alternative
> > to free up the perf counter but still get hard lockup detection.
> >
> > Overall, pros (+) and cons (-) of the buddy system compared to an
> > arch-specific hardlockup detector:
> > + Usable on systems that don't have an arch-specific hardlockup
> >   detector, like arm32 and arm64 (though it's being worked on for
> >   arm64 [5]).
> > + May free up scarce hardware resources.
> > + If a CPU totally goes out to lunch (can't process NMIs) the buddy
> >   system could still detect the problem (though it would be unlikely
> >   to be able to get a stack trace).
> > - If all CPUs are hard locked up at the same time the buddy system
> >   can't detect it.
> > - If we don't have SMP we can't use the buddy system.
> > - The buddy system needs an arch-specific mechanism (possibly NMI
> >   backtrace) to get info about the locked up CPU.
>
> Thanks for this list, it is really useful! Is it worth mentioning the
> behavior around idle? Could this approach potentially use more power?

Sure, I'll add some text in there. If I'm analyzing the code properly,
my belief is that, if anything, the buddy detector should be better
for idle/power than some other detectors.

Specifically, note that the main "worker" of the buddy detector is
called from watchdog_timer_fn(). The timer function is the same one
that runs for other hard lockup detectors, but in those cases it
_only_ pets the watchdog of the running CPU. With the buddy detector
it pets the running CPU's watchdog and then checks on the buddy's
watchdog. There is no separate wakeup / interrupt that needs to run
periodically to look for hard lockups.

I'm about to send a v3 to fix the cpu=>CPU that I missed on v2. I'll
add text similar to the above to the commit message.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v2] hardlockup: detect hard lockups using secondary (buddy) CPUs

2023-05-01 Thread Doug Anderson
Hi,


On Fri, Apr 28, 2023 at 5:36 PM Randy Dunlap  wrote:
>
> Hi--
>
> On 4/28/23 16:37, Douglas Anderson wrote:
> > From: Colin Cross 
> >
> > Implement a hardlockup detector that doesn't doesn't need any extra
> > arch-specific support code to detect lockups. Instead of using
> > something arch-specific we will use the buddy system, where each CPU
> > watches out for another one. Specifically, each CPU will use its
> > softlockup hrtimer to check that the next CPU is processing hrtimer
> > interrupts by verifying that a counter is increasing.
> >
> > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > easily show what's happening on the CPU that locked up just by doing a
> > simple backtrace. It relies on some other mechanism in the system to
> > get information about the locked up CPUs. This could be support for
> > NMI backtraces like [1], it could be a mechanism for printing the PC
> > of locked CPUs at panic time like [2] / [3], or it could be something
> > else. Even though that means we still rely on arch-specific code, this
> > arch-specific code seems to often be implemented even on architectures
> > that don't have a hardlockup detector.
> >
> > This style of hardlockup detector originated in some downstream
> > Android trees and has been rebased on / carried in ChromeOS trees for
> > quite a long time for use on arm and arm64 boards. Historically on
> > these boards we've leveraged mechanism [2] / [3] to get information
> > about hung CPUs, but we could move to [1].
> >
> > Although the original motivation for the buddy system was for use on
> > systems without an arch-specific hardlockup detector, it can still be
> > useful to use even on systems that _do_ have an arch-specific
> > hardlockup detector. On x86, for instance, there is a 24-part patch
> > series [4] in progress switching the arch-specific hard lockup
> > detector from a scarce perf counter to a less-scarce hardware
> > resource. Potentially the buddy system could be a simpler alternative
> > to free up the perf counter but still get hard lockup detection.
> >
> > Overall, pros (+) and cons (-) of the buddy system compared to an
> > arch-specific hardlockup detector:
> > + Usable on systems that don't have an arch-specific hardlockup
> >   detector, like arm32 and arm64 (though it's being worked on for
> >   arm64 [5]).
> > + May free up scarce hardware resources.
> > + If a CPU totally goes out to lunch (can't process NMIs) the buddy
> >   system could still detect the problem (though it would be unlikely
> >   to be able to get a stack trace).
> > - If all CPUs are hard locked up at the same time the buddy system
> >   can't detect it.
> > - If we don't have SMP we can't use the buddy system.
> > - The buddy system needs an arch-specific mechanism (possibly NMI
> >   backtrace) to get info about the locked up CPU.
> >
> > [1] https://lore.kernel.org/r/20230419225604.21204-1-diand...@chromium.org
> > [2] https://issuetracker.google.com/172213129
> > [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html
> > [4] 
> > https://lore.kernel.org/lkml/20230301234753.28582-1-ricardo.neri-calde...@linux.intel.com/
> > [5] 
> > https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.c...@mediatek.com/
> >
> > Signed-off-by: Colin Cross 
> > Signed-off-by: Matthias Kaehlcke 
> > Signed-off-by: Guenter Roeck 
> > Signed-off-by: Tzung-Bi Shih 
> > Signed-off-by: Douglas Anderson 
> > ---
> > This patch has been rebased in ChromeOS kernel trees many times, and
> > each time someone had to do work on it they added their
> > Signed-off-by. I've included those here. I've also left the author as
> > Colin Cross since the core code is still his.
> >
> > I'll also note that the CC list is pretty giant, but that's what
> > get_maintainers came up with (plus a few other folks I thought would
> > be interested). As far as I can tell, there's no true MAINTAINER
> > listed for the existing watchdog code. Assuming people don't hate
> > this, maybe it would go through Andrew Morton's tree?
> >
> > Changes in v2:
> > - cpu => CPU.
> > - Reworked description and Kconfig based on v1 discussion.
>
> or at least some of the comments from v1. :(

Oh no! My email program confused me and I thought all of your cpu=>CPU
stuff was in the patch description, not in the Kconfig. I'll whip up a
quick v3.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] hardlockup: detect hard lockups using secondary (buddy) cpus

2023-04-25 Thread Doug Anderson
Hi,

On Mon, Apr 24, 2023 at 9:58 PM Chen-Yu Tsai  wrote:
>
> On Mon, Apr 24, 2023 at 11:42 PM Doug Anderson  wrote:
> >
> > Hi,
> >
> > On Mon, Apr 24, 2023 at 5:54 AM Daniel Thompson
> >  wrote:
> > >
> > > On Fri, Apr 21, 2023 at 03:53:30PM -0700, Douglas Anderson wrote:
> > > > From: Colin Cross 
> > > >
> > > > Implement a hardlockup detector that can be enabled on SMP systems
> > > > that don't have an arch provided one or one implemented atop perf by
> > > > using interrupts on other cpus. Each cpu will use its softlockup
> > > > hrtimer to check that the next cpu is processing hrtimer interrupts by
> > > > verifying that a counter is increasing.
> > > >
> > > > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > > > easily provide a backtrace on the CPU that locked up. It relies on
> > > > some other mechanism in the system to get information about the locked
> > > > up CPUs. This could be support for NMI backtraces like [1], it could
> > > > be a mechanism for printing the PC of locked CPUs like [2], or it
> > > > could be something else.
> > > >
> > > > This style of hardlockup detector originated in some downstream
> > > > Android trees and has been rebased on / carried in ChromeOS trees for
> > > > quite a long time for use on arm and arm64 boards. Historically on
> > > > these boards we've leveraged mechanism [2] to get information about
> > > > hung CPUs, but we could move to [1].
> > >
> > > On the Arm platforms is this code able to leverage the existing
> > > infrastructure to extract status from stuck CPUs:
> > > https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html
> >
> > Yup! I wasn't explicit about this, but that's where you end up if you
> > follow the whole bug tracker item that was linked as [2].
> > Specifically, we used to have downstream patches in the ChromeOS that
> > just reached into the coresight range from a SoC specific driver and
> > printed out the CPU_DBGPCSR. When Brian was uprevving rk3399
> > Chromebooks he found that the equivalent functionality had made it
> > upstream in a generic way through the coresight framework. Brian
> > confirmed it was working on rk3399 and made all of the device tree
> > changes needed to get it all hooked up, so (at least for that SoC) it
> > should work on that SoC.
> >
> > [2] https://issuetracker.google.com/172213129
>
> IIRC with the coresight CPU debug driver enabled and the proper DT nodes
> added, the panic handler does dump out information from the hardware.
> I don't think it's wired up for hung tasks though.

Yes, that's correct. The coresight CPU debug driver doesn't work for
hung tasks because it can't get a real stack crawl. All it can get is
the PC of the last branch that the CPU took. This is why combining
${SUBJECT} patch with the ability to get stack traces via pseudo-NMI
is superior. That being said, even with just the coresight CPU debug
driver ${SUBJECT} patch is still helpful because (assuming
"hardlockup_panic" is set) we'll do a panic which will then trigger
the coresight CPU debug driver. :-)

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH v8 03/10] arm64: smp: Assign and setup an IPI as NMI

2023-04-24 Thread Doug Anderson
Hi,

On Wed, Apr 19, 2023 at 3:57 PM Douglas Anderson  wrote:
>
> From: Sumit Garg 
>
> Assign an unused IPI which can be turned as NMI using ipi_nmi framework.
> Also, invoke corresponding dynamic IPI setup/teardown APIs.
>
> Signed-off-by: Sumit Garg 
> Reviewed-by: Masayoshi Mizuma 
> Tested-by: Chen-Yu Tsai 
> Signed-off-by: Douglas Anderson 
> ---
>
> Changes in v8:
> - dynamic_ipi_setup() and dynamic_ipi_teardown() no longer take cpu param
>
>  arch/arm64/kernel/smp.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 4e8327264255..94ff063527c6 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -938,6 +939,8 @@ static void ipi_setup(int cpu)
>
> for (i = 0; i < nr_ipi; i++)
> enable_percpu_irq(ipi_irq_base + i, 0);
> +
> +   dynamic_ipi_setup();
>  }
>
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -950,6 +953,8 @@ static void ipi_teardown(int cpu)
>
> for (i = 0; i < nr_ipi; i++)
> disable_percpu_irq(ipi_irq_base + i);
> +
> +   dynamic_ipi_teardown();
>  }
>  #endif
>
> @@ -971,6 +976,9 @@ void __init set_smp_ipi_range(int ipi_base, int n)
> irq_set_status_flags(ipi_base + i, IRQ_HIDDEN);
> }
>
> +   if (n > nr_ipi)
> +   set_smp_dynamic_ipi(ipi_base + nr_ipi);

From thinking about this patch more, I'm guessing that the biggest
objection someone could have is that this uses up the "last" IPI slot
in any systems that are stuck with only 8. Is that the reason that
this patch series stagnated in the past, or was it something else?

If this is truly the concern that people have, it doesn't look like it
would be terribly hard to combine the existing IPI_CPU_STOP and
IPI_CPU_CRASH_STOP. Presumably we could just get rid of the "crash
stop" IPI and have the normal "stop" IPI do the crash if
"waiting_for_crash_ipi" is non-zero. If that's the thing blocking the
series from moving forward then I can add that to the series, or we
could just wait until someone actually needs another IPI. ;-)

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] hardlockup: detect hard lockups using secondary (buddy) cpus

2023-04-24 Thread Doug Anderson
Hi,

On Mon, Apr 24, 2023 at 5:54 AM Daniel Thompson
 wrote:
>
> On Fri, Apr 21, 2023 at 03:53:30PM -0700, Douglas Anderson wrote:
> > From: Colin Cross 
> >
> > Implement a hardlockup detector that can be enabled on SMP systems
> > that don't have an arch provided one or one implemented atop perf by
> > using interrupts on other cpus. Each cpu will use its softlockup
> > hrtimer to check that the next cpu is processing hrtimer interrupts by
> > verifying that a counter is increasing.
> >
> > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > easily provide a backtrace on the CPU that locked up. It relies on
> > some other mechanism in the system to get information about the locked
> > up CPUs. This could be support for NMI backtraces like [1], it could
> > be a mechanism for printing the PC of locked CPUs like [2], or it
> > could be something else.
> >
> > This style of hardlockup detector originated in some downstream
> > Android trees and has been rebased on / carried in ChromeOS trees for
> > quite a long time for use on arm and arm64 boards. Historically on
> > these boards we've leveraged mechanism [2] to get information about
> > hung CPUs, but we could move to [1].
>
> On the Arm platforms is this code able to leverage the existing
> infrastructure to extract status from stuck CPUs:
> https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html

Yup! I wasn't explicit about this, but that's where you end up if you
follow the whole bug tracker item that was linked as [2].
Specifically, we used to have downstream patches in the ChromeOS that
just reached into the coresight range from a SoC specific driver and
printed out the CPU_DBGPCSR. When Brian was uprevving rk3399
Chromebooks he found that the equivalent functionality had made it
upstream in a generic way through the coresight framework. Brian
confirmed it was working on rk3399 and made all of the device tree
changes needed to get it all hooked up, so (at least for that SoC) it
should work on that SoC.

[2] https://issuetracker.google.com/172213129


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] hardlockup: detect hard lockups using secondary (buddy) cpus

2023-04-24 Thread Doug Anderson
Hi,

On Fri, Apr 21, 2023 at 6:20 PM Ian Rogers  wrote:
>
> On Fri, Apr 21, 2023 at 3:54 PM Douglas Anderson  
> wrote:
> >
> > From: Colin Cross 
> >
> > Implement a hardlockup detector that can be enabled on SMP systems
> > that don't have an arch provided one or one implemented atop perf by
> > using interrupts on other cpus. Each cpu will use its softlockup
> > hrtimer to check that the next cpu is processing hrtimer interrupts by
> > verifying that a counter is increasing.
> >
> > NOTE: unlike the other hard lockup detectors, the buddy one can't
> > easily provide a backtrace on the CPU that locked up. It relies on
> > some other mechanism in the system to get information about the locked
> > up CPUs. This could be support for NMI backtraces like [1], it could
> > be a mechanism for printing the PC of locked CPUs like [2], or it
> > could be something else.
> >
> > This style of hardlockup detector originated in some downstream
> > Android trees and has been rebased on / carried in ChromeOS trees for
> > quite a long time for use on arm and arm64 boards. Historically on
> > these boards we've leveraged mechanism [2] to get information about
> > hung CPUs, but we could move to [1].
> >
> > NOTE: the buddy system is not really useful to enable on any
> > architectures that have a better mechanism. On arm64 folks have been
> > trying to get a better mechanism for years and there has even been
> > recent posts of patches adding support [3]. However, nothing about the
> > buddy system is tied to arm64 and several archs (even arm32, where it
> > was originally developed) could find it useful.
> >
> > [1] https://lore.kernel.org/r/20230419225604.21204-1-diand...@chromium.org
> > [2] https://issuetracker.google.com/172213129
> > [3] 
> > https://lore.kernel.org/linux-arm-kernel/20220903093415.15850-1-lecopzer.c...@mediatek.com/
>
> There is another proposal to use timers for lockup detection but not
> the buddy system:
> https://lore.kernel.org/lkml/20230413035844.ga31...@ranerica-svr.sc.intel.com/
> It'd be very good to free up the counter used by the current NMI watchdog.

Thanks for the link!

Looks like that series is x86 only, so I think that ${SUBJECT} patch
should still move forward since it provides a solution that is generic
across any platform. I guess the question is: if the buddy system gets
landed then is the HPET series still worthwhile? I guess the answer to
that would depend on whether the HPET-based watchdog has any
advantages over the buddy system.

I'd imagine that there could be some cases where the HPET system could
detect lockups that the buddy system can't. If _all_ CPUs in the
system have interrupts disabled then the buddy system won't be able to
run, but the HPET system could run. That's a win for the HPET system.
That being said, I guess I could imagine that there could be lockups
that the buddy system could detect that the HPET system couldn't. The
HPET system seems to have a single CPU in charge of processing the
main NMI and then that single CPU is in charge of checking all the
others. If that single CPU goes out to lunch then the system couldn't
detect hard lockups.

In any case, I'm happy to let others debate about the HPET system. For
now, I'll take my action items to be:

1. Modify the patch description and KConfig to include some of the
same advantages that the HPET patch series talks about (freeing up
resources).

2. Increase my CC list for the next version even more to include the
people you added to this thread who have been working on the HPET
patch series.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH 2/3] serial: uart_poll_init() should power on the UART

2023-03-13 Thread Doug Anderson
Hi,

On Tue, Mar 7, 2023 at 7:32 AM Douglas Anderson  wrote:
>
> On Qualcomm devices which use the "geni" serial driver, kdb/kgdb won't
> be very happy if you use it but the resources of the port haven't been
> powered on. Today kdb/kgdb rely on someone else powering the port
> on. This could be the normal kernel console or an agetty running.
> Let's fix this to explicitly power things on when setting up a polling
> driver.
>
> Signed-off-by: Douglas Anderson 
> ---
>
>  drivers/tty/serial/serial_core.c | 6 ++
>  1 file changed, 6 insertions(+)

Just in case it's not obvious, even though we ended up going with
Johan's series [1] instead of patch #1 of my series, patch #2 and #3
of my series are still relevant. I can repost the series without patch
#1 if it's helpful.

[1] https://lore.kernel.org/r/20230307164405.14218-1-johan+lin...@kernel.org

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v1 01/18] kdb: do not assume write() callback available

2023-03-07 Thread Doug Anderson
Hi,

On Thu, Mar 2, 2023 at 11:57 AM John Ogness  wrote:
>
> It is allowed for consoles to provide no write() callback. For
> example ttynull does this.
>
> Check if a write() callback is available before using it.
>
> Signed-off-by: John Ogness 
> ---
>  kernel/debug/kdb/kdb_io.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c
> index 5c7e9ba7cd6b..e9139dfc1f0a 100644
> --- a/kernel/debug/kdb/kdb_io.c
> +++ b/kernel/debug/kdb/kdb_io.c
> @@ -576,6 +576,8 @@ static void kdb_msg_write(const char *msg, int msg_len)
> continue;
> if (c == dbg_io_ops->cons)
> continue;
> +   if (!c->write)
> +   continue;

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH] Reduce number of concurrent KGDB_MAX_BREAKPOINTS

2022-12-21 Thread Doug Anderson
Hi,

On Mon, Dec 19, 2022 at 2:11 PM Helge Deller  wrote:
>
> On my 32-bit machine, with BREAK_INSTR_SIZE=4 the kgdb_break[] structure
> allocates 16000 bytes of static kernel memory, which is - by default -
> to be able to handle up to 1000 concurrent kgdb breakpoints.  I might be
> wrong, but I doubt that in real life someone really needs that many
> breakpoints, so I suggest to reduce the number of possible kgdb
> breakpoints and thus reduce the memory footprint of kgdb_break[].
>
> Signed-off-by: Helge Deller 
>
> diff --git a/include/linux/kgdb.h b/include/linux/kgdb.h
> index 258cdde8d356..fab81c4f007e 100644
> --- a/include/linux/kgdb.h
> +++ b/include/linux/kgdb.h
> @@ -95,7 +95,7 @@ extern char *dbg_get_reg(int regno, void *mem, struct 
> pt_regs *regs);
>  extern int dbg_set_reg(int regno, void *mem, struct pt_regs *regs);
>  #endif
>  #ifndef KGDB_MAX_BREAKPOINTS
> -# define KGDB_MAX_BREAKPOINTS  1000
> +# define KGDB_MAX_BREAKPOINTS  40

My first instinct is that, while I agree that 1000 feels too much,
that maybe 40 would have been slightly too small? I could almost
imagine myself using that many breakpoints while stepping through
code.

...but then again, I guess I'd say that Linux doesn't really behave
that great at having its code stepped through with a debugger anyway.
While it can work OK, it's not something where I'd expect people to
regularly have super long debugging sessions with lots of breakpoints
set and constantly dropping into the debugger and then continuing. So
I guess in the end, I'd be OK with limiting it to 40. Thus:

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v5 14/40] kdb: use srcu console list iterator

2022-11-16 Thread Doug Anderson
Hi,

On Wed, Nov 16, 2022 at 8:22 AM John Ogness  wrote:
>
> Guarantee safe iteration of the console list by using SRCU.
>
> Signed-off-by: John Ogness 
> Reviewed-by: Petr Mladek 
> Reviewed-by: Aaron Tomlin 
> ---
>  kernel/debug/kdb/kdb_io.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)

Without becoming an expert on this whole series, this seems reasonable to me.

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v5 35/40] tty: serial: kgdboc: use srcu console list iterator

2022-11-16 Thread Doug Anderson
Hi,

On Wed, Nov 16, 2022 at 8:22 AM John Ogness  wrote:
>
> Use srcu console list iteration for safe console list traversal.
> Note that this is a preparatory change for when console_lock no
> longer provides synchronization for the console list.
>
> Signed-off-by: John Ogness 
> Reviewed-by: Petr Mladek 
> ---
>  drivers/tty/serial/kgdboc.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
> index 5be381003e58..c6df9ef34099 100644
> --- a/drivers/tty/serial/kgdboc.c
> +++ b/drivers/tty/serial/kgdboc.c
> @@ -451,6 +451,7 @@ static void kgdboc_earlycon_pre_exp_handler(void)
>  {
> struct console *con;
> static bool already_warned;
> +   int cookie;
>
> if (already_warned)
> return;
> @@ -463,9 +464,14 @@ static void kgdboc_earlycon_pre_exp_handler(void)
>  * serial drivers might be OK with this, print a warning once per
>  * boot if we detect this case.
>  */
> -   for_each_console(con)
> +   cookie = console_srcu_read_lock();
> +   for_each_console_srcu(con) {
> if (con == kgdboc_earlycon_io_ops.cons)
> -   return;
> +   break;
> +   }
> +   console_srcu_read_unlock(cookie);
> +   if (con)
> +   return;

Is there truly any guarantee that "con" will be NULL if
for_each_console_srcu() finishes naturally (AKA without a "break"
being executed)?

It looks as if currently this will be true but nothing in the comments
of for_each_console_srcu() nor hlist_for_each_entry_srcu() (which it
calls) guarantees this, right? It would be nice if that was
documented, but I guess it's not a huge deal.

 Also: wasn't there just some big issue about people using loop
iteration variables after the loop finished?

https://lwn.net/Articles/885941/

Ah, I guess that's a slightly different problem and probably not relevant here.

So it seems like this is fine.

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v5 36/40] tty: serial: kgdboc: use console_list_lock for list traversal

2022-11-16 Thread Doug Anderson
Hi,

On Wed, Nov 16, 2022 at 8:22 AM John Ogness  wrote:
>
> configure_kgdboc() uses the console_lock for console list iteration. Use
> the console_list_lock instead because list synchronization responsibility
> will be removed from the console_lock in a later change.
>
> The SRCU iterator could have been used here, but a later change will
> relocate the locking of the console_list_lock to also provide
> synchronization against register_console().
>
> Note, the console_lock is still needed to serialize the device()
> callback with other console operations.
>
> Signed-off-by: John Ogness 
> Reviewed-by: Petr Mladek 
> ---
>  drivers/tty/serial/kgdboc.c | 12 
>  1 file changed, 12 insertions(+)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v5 37/40] tty: serial: kgdboc: synchronize tty_find_polling_driver() and register_console()

2022-11-16 Thread Doug Anderson
Hi,

On Wed, Nov 16, 2022 at 8:22 AM John Ogness  wrote:
>
> Calling tty_find_polling_driver() can lead to uart_set_options() being
> called (via the poll_init() callback of tty_operations) to configure the
> uart. But uart_set_options() can also be called by register_console()
> (via the setup() callback of console).
>
> Take the console_list_lock to synchronize against register_console() and
> also use it for console list traversal. This also ensures the console list
> cannot change until the polling console has been chosen.
>
> Signed-off-by: John Ogness 
> Reviewed-by: Daniel Thompson 
> Reviewed-by: Petr Mladek 
> ---
>  drivers/tty/serial/kgdboc.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
> index 82b4b4d67823..8c2b7ccdfebf 100644
> --- a/drivers/tty/serial/kgdboc.c
> +++ b/drivers/tty/serial/kgdboc.c
> @@ -189,12 +189,20 @@ static int configure_kgdboc(void)
> if (kgdboc_register_kbd())
> goto do_register;
>
> +   /*
> +* tty_find_polling_driver() can call uart_set_options()
> +* (via poll_init) to configure the uart. Take the console_list_lock
> +* in order to synchronize against register_console(), which can also
> +* configure the uart via uart_set_options(). This also allows safe
> +* traversal of the console list.
> +*/
> +   console_list_lock();
> +
> p = tty_find_polling_driver(cptr, _line);
> -   if (!p)
> +   if (!p) {
> +   console_list_unlock();
> goto noconfig;
> -
> -   /* For safe traversal of the console list. */
> -   console_list_lock();
> +   }

Seems OK to me, though I guess I would have moved console_lock() up
too just because this isn't a case we need to optimize and then we can
be extra certain that nobody else is messing with console structures
while we're looking at them.

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v5 38/40] tty: serial: kgdboc: use console_list_lock to trap exit

2022-11-16 Thread Doug Anderson
Hi,

On Wed, Nov 16, 2022 at 8:22 AM John Ogness  wrote:
>
> kgdboc_earlycon_init() uses the console_lock to ensure that no consoles
> are unregistered until the kgdboc_earlycon is setup. The console_list_lock
> should be used instead because list synchronization responsibility will
> be removed from the console_lock in a later change.
>
> Signed-off-by: John Ogness 
> Reviewed-by: Daniel Thompson 
> Reviewed-by: Petr Mladek 
> ---
>  drivers/tty/serial/kgdboc.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
> index 8c2b7ccdfebf..a3ed9b34e2ab 100644
> --- a/drivers/tty/serial/kgdboc.c
> +++ b/drivers/tty/serial/kgdboc.c
> @@ -558,13 +558,13 @@ static int __init kgdboc_earlycon_init(char *opt)
>  */
>
> /*
> -* Hold the console_lock to guarantee that no consoles are
> +* Hold the console_list_lock to guarantee that no consoles are
>  * unregistered until the kgdboc_earlycon setup is complete.
>  * Trapping the exit() callback relies on exit() not being
>  * called until the trap is setup. This also allows safe
>  * traversal of the console list and race-free reading of @flags.
>  */
> -   console_lock();
> +   console_list_lock();
> for_each_console(con) {
> if (con->write && con->read &&
> (con->flags & (CON_BOOT | CON_ENABLED)) &&

Officially don't we need both the list lock and normal lock here since
we're reaching into the consoles?

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v2 12/38] tty: serial: kgdboc: use console_is_enabled()

2022-10-24 Thread Doug Anderson
Hi,

On Mon, Oct 24, 2022 at 3:46 PM Doug Anderson  wrote:
>
> Hi,
>
> On Wed, Oct 19, 2022 at 7:56 AM John Ogness  wrote:
> >
> > Replace (console->flags & CON_ENABLED) usage with console_is_enabled().
> >
> > Signed-off-by: John Ogness 
> > ---
> >  drivers/tty/serial/kgdboc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
> > index e76f0186c335..b17aa7e49894 100644
> > --- a/drivers/tty/serial/kgdboc.c
> > +++ b/drivers/tty/serial/kgdboc.c
> > @@ -533,7 +533,7 @@ static int __init kgdboc_earlycon_init(char *opt)
> > console_lock();
> > for_each_console(con) {
> > if (con->write && con->read &&
> > -   (con->flags & (CON_BOOT | CON_ENABLED)) &&
> > +   (console_is_enabled(con) || (con->flags & CON_BOOT)) &&
>
> . I guess this is OK, but it feels a little pointless. If we're
> still directly looking at the CON_BOOT bit in con->flags it seems
> weird to be accessing CON_ENABLED through a special wrapper that's
> marked as a `data_race`. In our case it's _not_ a data race, right,
> since this function continues to hold the console_lock() even at the
> end of the series? I personally would drop this patch but if you
> really want it I won't object.

I realized that my statement isn't quite true. It actually only holds
console_list_lock() even at the end of the series. Still, it seems
weird that we're declaring the `data_race` on CON_ENABLED but not
CON_BOOT ?


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v2 26/38] kdb: use srcu console list iterator

2022-10-24 Thread Doug Anderson
Hi,

On Wed, Oct 19, 2022 at 7:56 AM John Ogness  wrote:
>
> Guarantee safe iteration of the console list by using SRCU.
>
> Signed-off-by: John Ogness 
> ---
>  kernel/debug/kdb/kdb_io.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/debug/kdb/kdb_io.c b/kernel/debug/kdb/kdb_io.c
> index 550fe8b456ec..5c0bd93c3574 100644
> --- a/kernel/debug/kdb/kdb_io.c
> +++ b/kernel/debug/kdb/kdb_io.c
> @@ -545,6 +545,7 @@ static void kdb_msg_write(const char *msg, int msg_len)
>  {
> struct console *c;
> const char *cp;
> +   int cookie;
> int len;
>
> if (msg_len == 0)
> @@ -558,7 +559,8 @@ static void kdb_msg_write(const char *msg, int msg_len)
> cp++;
> }
>
> -   for_each_console(c) {
> +   cookie = console_srcu_read_lock();
> +   for_each_console_srcu(c) {

Maybe it wouldn't hurt to also have a comment saying that normally the
console_srcu_read_lock() wouldn't be enough given that we're poking
into each individual console and calling ->write() but that we're
relying on the fact that all the other CPUs are stopped at the moment
and thus we should be safe.

-Doug


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v2 22/38] serial: kgdboc: document console_lock usage

2022-10-24 Thread Doug Anderson
Hi,

On Wed, Oct 19, 2022 at 7:56 AM John Ogness  wrote:
>
> kgdboc_earlycon_init() uses the console_lock to ensure that no consoles
> are unregistered until the kgdboc_earlycon is setup. This is necessary
> because the trapping of the exit() callback assumes that the exit()
> callback is not called before the trap is setup.
>
> Explicitly document this non-typical console_lock usage.
>
> Signed-off-by: John Ogness 
> ---
>  drivers/tty/serial/kgdboc.c | 8 
>  1 file changed, 8 insertions(+)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


Re: [Kgdb-bugreport] [PATCH printk v2 01/38] serial: kgdboc: Lock console list in probe function

2022-10-24 Thread Doug Anderson
Hi,

On Wed, Oct 19, 2022 at 7:56 AM John Ogness  wrote:
>
> From: Thomas Gleixner 
>
> Unprotected list walks are not necessarily safe.
>
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: John Ogness 
> Reviewed-by: Petr Mladek 
> ---
>  drivers/tty/serial/kgdboc.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Douglas Anderson 


___
Kgdb-bugreport mailing list
Kgdb-bugreport@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport


  1   2   3   >