Re: [PATCH 24/30] panic: Refactor the panic path
Perfect Petr, thanks for your feedback! I'll be out for some weeks, but after that what I'm doing is to split the series in 2 parts: (a) The general fixes, which should be reviewed by subsystem maintainers and even merged individually by them. (b) The proper panic refactor, which includes the notifiers list split, etc. I'll think about what I consider the best solution for the crash_dump required ones, and will try to split in very simple patches to make it easier to review. Cheers, Guilherme
Re: [PATCH 24/30] panic: Refactor the panic path
On Thu 2022-05-26 13:25:57, Guilherme G. Piccoli wrote: > OK, so it seems we have some points in which agreement exists, and some > points that there is no agreement and instead, we have antagonistic / > opposite views and needs. Let's start with the easier part heh > > It seems everybody agrees that *we shouldn't over-engineer things*, and > as per Eric good words: making the panic path more feature-full or > increasing flexibility isn't a good idea. So, as a "corollary": the > panic level approach I'm proposing is not a good fit, I'll drop it and > let's go with something simpler. Makes sense. > Another point of agreement seems to be that _notifier lists in the panic > path are dangerous_, for *2 different reasons*: > > (a) We cannot guarantee that people won't add crazy callbacks there, we > can plan and document things the best as possible - it'll never be > enough, somebody eventually would slip a nonsense callback that would > break things and defeat the planned purpose of such a list; It is true that notifier lists might allow to add crazy stuff without proper review more easily. Things added into the core code would most likely get better review. But nothing is error-proof. And bugs will happen with any approach. > (b) As per Eric point, in a panic/crash situation we might have memory > corruption exactly in the list code / pointers, etc, so the notifier > lists are, by nature, a bit fragile. But I think we shouldn't consider > it completely "bollocks", since this approach has been used for a while > with a good success rate. So, lists aren't perfect at all, but at the > same time, they aren't completely useless. I am not able to judge this. Of course, any extra step increases the risk. I am just not sure how much more complicated it would be to hardcode the calls. Most of them are architecture and/or feature specific. And such code is often hard to review and maintain. > To avoid using a 4th list, 4th or 5th? We already have "hypervisor", "info", "pre-reboot", and "pre-loop". The 5th might be pre-crash-exec. > especially given the list nature is a bit > fragile, I'd suggest one of the 3 following approaches - I *really > appreciate feedbacks* on that so I can implement the best solution and > avoid wasting time in some poor/disliked solution: Honestly, I am not able to decide what might be better without seeing the code. Most things fits pretty well into the 4 proposed lists: "hypervisor", "info", "pre-reboot", and "pre-loop". IMHO, the only question is the code that needs to be always called even before crash_dump. I suggest that you solve the crash_dump callbacks the way that looks best to you. Ideally do it in a separate patch so it can be reviewed and reworked more easily. I believe that a fresh code with an updated split and simplified logic would help us to move forward. Best Regards, Petr
Re: [PATCH 24/30] panic: Refactor the panic path
Hey folks, first of all thanks a lot for the reviews / opinions about this. I imagined that such change would be polemic, and I see I was right heh I'll try to "mix" all the relevant opinions in a single email, since they happened in different responses and even different mail threads. I've looped here the most interested parties based on the feedback received, such as Baoquan (kdump), Hatayama (kdump), Eric (kexec), Mark (arm64), Michael (Hyper-V), Petr (console/printk) and Vitaly (hyper-v / kvm). I hope we can discuss and try to reach some consensus - my apologies in advance for this long message! So, here goes some feedback we received about this change and correlated feedback from arm64 community - my apologies if I missed something important, I've tried to collect the most relevant portions, while keeping the summary "as short" as possible. I'll respond to such feedback below, after the quotes. On 24/05/2022 05:32, Baoquan He wrote: >> [...] >> Firstly, kdump is not always the first thing. In any use case, if kdump >> kernel is not loaded, it's not the first thing at all. Not to mention >> if crash_kexec_post_notifiers is specified. >> [...] >> Changing this will cause regression. During these years, nobody ever doubt >> kdump should execute firstly if crashkernel is reserved and kdump kernel is >> loaded. That's not saying we can't change >> this, but need a convincing justification. >> [...] >> Secondly, even with the notifiers' split, we can't guarantee people will >> absolutely add notifiers into right list in the future. Letting kdump >> execute behind lists by default will put kdump into risk. >> [...] >> As for Hyper-V, if it enforces to terminate VMbus connection, no matter >> it's kdump or not, why not taking it out of panic notifiers list and >> execute it before kdump unconditionally. On 24/05/2022 05:01, Petr Mladek wrote: >> [...] >> Anyway, I see four possible solutions: >> >> 1. The most conservative approach is to keep the current behavior >> and call kdump first by default. >> >> 2. A medium conservative approach to change the default default >> behavior and call hypervisor and eventually the info notifiers >> before kdump. There still would be the possibility to call kdump >> first by the command line parameter. >> >> 3. Remove the possibility to call kdump first completely. It would >> assume that all the notifiers in the info list are super safe >> or that they make kdump actually more safe. >> >> 4. Create one more notifier list for operations that always should >> be called before crash_dump. >> >> Regarding the extra notifier list (4th solution). It is not clear to >> me whether it would be always called even before hypervisor list or >> when kdump is not enabled. We must not over-engineer it. >> >> 2nd proposal looks like a good compromise. But maybe we could do >> this change few releases later. The notifiers split is a big >> change on its own. On 24/05/2022 07:18, Baoquan He wrote: >>[...] >> I would vote for 1 or 4 without any hesitation, and prefer 4. I ever >> suggest the variant of solution 4 in v1 reviewing. That's taking those >> notifiers out of list and enforcing to execute them before kdump. E.g >> the one on HyperV to terminate VMbus connection. Maybe solution 4 is >> better to provide a determinate way for people to add necessary code >> at the earliest part. >> [...] >>> >>> Regarding the extra notifier list (4th solution). It is not clear to >>> me whether it would be always called even before hypervisor list or >>> when kdump is not enabled. We must not over-engineer it. >> >> One thing I would like to notice is, no matter how perfect we split the >> lists this time, we can't gurantee people will add notifiers reasonablly >> in the future. And people from different sub-component may not do >> sufficient investigation and add them to fulfil their local purpose. >> >> The current panic notifers list is the best example. Hyper-V actually >> wants to run some necessary code before kdump, but not all of them, they >> just add it, ignoring the original purpose of >> crash_kexec_post_notifiers. I guess they do like this just because it's >> easy to do, no need to bother changing code in generic place. >> >> Solution 4 can make this no doubt, that's why I like it better. >> [...] >> As I replied to Guilherme, solution 2 will cause regression if not >> calling kdump firstly. Solution 3 leaves people space to make mistake, >> they could add nontifier into wrong list. >> >> I would like to note again that the panic notifiers are optional to run, >> while kdump is expectd once loaded, from the original purpose. I guess >> people I know will still have this thought, e.g Hatayama, Masa, they are >> truly often use panic notifiers like this on their company's system. On 24/05/2022 11:44, Eric W. Biederman wrote: > [...] > Unfortunately I am also very grouchy. > > Notifiers before kexec on panic are fundame
Re: [PATCH 24/30] panic: Refactor the panic path
"Guilherme G. Piccoli" writes: > The panic() function is somewhat convoluted - a lot of changes were > made over the years, adding comments that might be misleading/outdated > now, it has a code structure that is a bit complex to follow, with > lots of conditionals, for example. The panic notifier list is something > else - a single list, with multiple callbacks of different purposes, > that run in a non-deterministic order and may affect hardly kdump > reliability - see the "crash_kexec_post_notifiers" workaround-ish flag. > > This patch proposes a major refactor on the panic path based on Petr's > idea [0] - basically we split the notifiers list in three, having a set > of different call points in the panic path. Below a list of changes > proposed in this patch, culminating in the panic notifiers level > concept: > > (a) First of all, we improved comments all over the function > and removed useless variables / includes. Also, as part of this > clean-up we concentrate the console flushing functions in a helper. > > (b) As mentioned before, there is a split of the panic notifier list > in three, based on the purpose of the callback. The code contains > good documentation in form of comments, but a summary of the three > lists follows: > > - the hypervisor list aims low-risk procedures to inform hypervisors > or firmware about the panic event, also includes LED-related functions; > > - the informational list contains callbacks that provide more details, > like kernel offset or trace dump (if enabled) and also includes the > callbacks aimed at reducing log pollution or warns, like the RCU and > hung task disable callbacks; > > - finally, the pre_reboot list is the old notifier list renamed, > containing the more risky callbacks that didn't fit the previous > lists. There is also a 4th list (the post_reboot one), but it's not > related with the original list - it contains late time architecture > callbacks aimed at stopping the machine, for example. > > The 3 notifiers lists execute in different moments, hypervisor being > the first, followed by informational and finally the pre_reboot list. > > (c) But then, there is the ordering problem of the notifiers against > the crash_kernel() call - kdump must be as reliable as possible. > For that, a simple binary "switch" as "crash_kexec_post_notifiers" > is not enough, hence we introduce here concept of panic notifier > levels: there are 5 levels, from 0 (no notifier executes before > kdump) until 4 (all notifiers run before kdump); the default level > is 2, in which the hypervisor and (iff we have any kmsg dumper) > the informational notifiers execute before kdump. > > The detailed documentation of the levels is present in code comments > and in the kernel-parameters.txt file; as an analogy with the previous > panic() implementation, the level 0 is exactly the same as the old > behavior of notifiers, running all after kdump, and the level 4 is > the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as > a deprecated one). > > (d) Finally, an important change made here: we now use only the > function "crash_smp_send_stop()" to shut all the secondary CPUs > in the panic path. Before, there was a case of using the regular > "smp_send_stop()", but the better approach is to simplify the > code and try to use the function which was created exclusively > for the panic path. Experiments showed that it works fine, and > code was very simplified with that. > > Functional change is expected from this refactor, since now we > call some notifiers by default before kdump, but the goal here > besides code clean-up is to have a better panic path, more > reliable and deterministic, but also very customizable. > > [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ I am late to this discussion. My apologies. Unfortunately I am also very grouchy. Notifiers before kexec on panic are fundamentally broken. So please just remove crash_kexec_post notifiers and be done with it. Part of the deep issue is that firmware always has a common and broken implementation for anything that is not mission critical to motherboards. Notifiers in any sense on these paths are just bollocks. Any kind of notifier list is fundamentally fragile in the face of memory corruption and very very difficult to review. So I am going to refresh my ancient NACK on this. I can certainly appreciate that there are pieces of the reboot paths that can be improved. I don't think making anything more feature full or flexible is any kind of real improvement. Eric > > Suggested-by: Petr Mladek > Signed-off-by: Guilherme G. Piccoli > --- > > Special thanks to Petr and Baoquan for the suggestion and feedback in a > previous > email thread. There's some important design decisions that worth mentioning > and > discussing: > > * The default panic notifiers level is 2, based on Petr Mladek's suggestion, > which makes a lot of sense. Of course, this is customizable through the > parameter, but
Re: [PATCH 24/30] panic: Refactor the panic path
On 05/24/22 at 10:01am, Petr Mladek wrote: > On Fri 2022-05-20 08:23:33, Guilherme G. Piccoli wrote: > > On 19/05/2022 20:45, Baoquan He wrote: > > > [...] > > >> I really appreciate the summary skill you have, to convert complex > > >> problems in very clear and concise ideas. Thanks for that, very useful! > > >> I agree with what was summarized above. > > > > > > I want to say the similar words to Petr's reviewing comment when I went > > > through the patches and traced each reviewing sub-thread to try to > > > catch up. Petr has reivewed this series so carefully and given many > > > comments I want to ack immediately. > > > > > > I agree with most of the suggestions from Petr to this patch, except of > > > one tiny concern, please see below inline comment. > > > > Hi Baoquan, thanks! I'm glad you're also reviewing that =) > > > > > > > [...] > > > > > > I like the proposed skeleton of panic() and code style suggested by > > > Petr very much. About panic_prefer_crash_dump which might need be added, > > > I hope it has a default value true. This makes crash_dump execute at > > > first by default just as before, unless people specify > > > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > > > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > > > this is inconsistent with the old behaviour. > > > > I'd like to understand better why the crash_kexec() must always be the > > first thing in your use case. If we keep that behavior, we'll see all > > sorts of workarounds - see the last patches of this series, Hyper-V and > > PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force > > execution of their relevant notifiers (like the vmbus disconnect, > > specially in arm64 that has no custom machine_crash_shutdown, or the > > fadump case in ppc). This led to more risk in kdump. > > > > The thing is: with the notifiers' split, we tried to keep only the most > > relevant/necessary stuff in this first list, things that ultimately > > should improve kdump reliability or if not, at least not break it. My > > feeling is that, with this series, we should change the idea/concept > > that kdump must run first nevertheless, not matter what. We're here > > trying to accommodate the antagonistic goals of hypervisors that need > > some clean-up (even for kdump to work) VS. kdump users, that wish a > > "pristine" system reboot ASAP after the crash. > > Good question. I wonder if Baoquan knows about problems caused by the > particular notifiers that will end up in the hypervisor list. Note > that there will be some shuffles and the list will be slightly > different in V2. Yes, I knew some of them. Please check my response to Guilherme. We have bug to track the issue on Hyper-V in which failure happened during panic notifiers running, haven't come to kdump. Seems both of us sent mail replying to Guilherme at the same time. > > Anyway, I see four possible solutions: > > 1. The most conservative approach is to keep the current behavior > and call kdump first by default. > > 2. A medium conservative approach to change the default default > behavior and call hypervisor and eventually the info notifiers > before kdump. There still would be the possibility to call kdump > first by the command line parameter. > > 3. Remove the possibility to call kdump first completely. It would > assume that all the notifiers in the info list are super safe > or that they make kdump actually more safe. > > 4. Create one more notifier list for operations that always should > be called before crash_dump. I would vote for 1 or 4 without any hesitation, and prefer 4. I ever suggest the variant of solution 4 in v1 reviewing. That's taking those notifiers out of list and enforcing to execute them before kdump. E.g the one on HyperV to terminate VMbus connection. Maybe solution 4 is better to provide a determinate way for people to add necessary code at the earliest part. > > Regarding the extra notifier list (4th solution). It is not clear to > me whether it would be always called even before hypervisor list or > when kdump is not enabled. We must not over-engineer it. One thing I would like to notice is, no matter how perfect we split the lists this time, we can't gurantee people will add notifiers reasonablly in the future. And people from different sub-component may not do sufficient investigation and add them to fulfil their local purpose. The current panic notifers list is the best example. Hyper-V actually wants to run some necessary code before kdump, but not all of them, they just add it, ignoring the original purpose of crash_kexec_post_notifiers. I guess they do like this just because it's easy to do, no need to bother changing code in generic place. Solution 4 can make this no doubt, that's why I like it better. > > 2nd proposal looks like a good compromise. But maybe we could do > this change few releases later. The noti
Re: [PATCH 24/30] panic: Refactor the panic path
On 05/20/22 at 08:23am, Guilherme G. Piccoli wrote: > On 19/05/2022 20:45, Baoquan He wrote: > > [...] > >> I really appreciate the summary skill you have, to convert complex > >> problems in very clear and concise ideas. Thanks for that, very useful! > >> I agree with what was summarized above. > > > > I want to say the similar words to Petr's reviewing comment when I went > > through the patches and traced each reviewing sub-thread to try to > > catch up. Petr has reivewed this series so carefully and given many > > comments I want to ack immediately. > > > > I agree with most of the suggestions from Petr to this patch, except of > > one tiny concern, please see below inline comment. > > Hi Baoquan, thanks! I'm glad you're also reviewing that =) > > > > [...] > > > > I like the proposed skeleton of panic() and code style suggested by > > Petr very much. About panic_prefer_crash_dump which might need be added, > > I hope it has a default value true. This makes crash_dump execute at > > first by default just as before, unless people specify > > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > > this is inconsistent with the old behaviour. > > I'd like to understand better why the crash_kexec() must always be the > first thing in your use case. If we keep that behavior, we'll see all > sorts of workarounds - see the last patches of this series, Hyper-V and > PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force > execution of their relevant notifiers (like the vmbus disconnect, > specially in arm64 that has no custom machine_crash_shutdown, or the > fadump case in ppc). This led to more risk in kdump. Firstly, kdump is not always the first thing. In any use case, if kdump kernel is not loaded, it's not the first thing at all. Not to mention if crash_kexec_post_notifiers is specified. if kdump kernel is loaded, kdump has been executing firslty, since it was added into kenrel/panic(); Until 2014, Masa added crash_kexec_post_notifiers kernel parameter to make panic notifiers be able to execute before kdump if specified. commit dc009d92435f99498cbc579ce76bf28e837e2c14 Author: Eric W. Biederman Date: Sat Jun 25 14:57:52 2005 -0700 [PATCH] kexec: add kexec syscalls commit f06e5153f4ae2e2f3b0300f0e260e40cb7fefd45 Author: Masami Hiramatsu Date: Fri Jun 6 14:37:07 2014 -0700 kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers Changing this will cause regression. During these years, nobody ever doubt kdump should execute firstly if crashkernel is reserved and kdump kernel is loaded. That's not saying we can't change this, but need a convincing justification. Secondly, even with the notifiers' split, we can't guarantee people will absolutely add notifiers into right list in the future. Letting kdump execute behind lists by default will put kdump into risk. For example, you replied to Hatamata saying you have been working with kdump in the last 3, 4 years, and you have have been working on these panic notifiers refactoring issue in the recent months. However, in your refactoring patches of introducing hypervisor/info/pre-reboot, I noticed you acked the suggestion from Petr that several notifiers need be moved to correct position. So even you can't make sure these, how can other people be able to recognize which list should be 100% appropriate when they try to register one notifier for their sub-component? At last, I am wondering why fadump matters. I don't know in which case people wants to load kdump kernel, but expect to trigger crash fadump. Power people need consider this carefully and makes some change. Fadump just borrows the crashkernel reservation mechanism. If fadump would rather take risk to run all panic notifiers, whether fadump really needs them or not, then execute crash_fadump(), that's powerpc's business. As for Hyper-V, if it enforces to terminate VMbus connection, no matter it's kdump or not, why not taking it out of panic notifiers list and execute it before kdump unconditionally. Below is abstracted from Michael's words. https://lore.kernel.org/all/mwhpr21mb15933573f5c81c5250bf6a1cd7...@mwhpr21mb1593.namprd21.prod.outlook.com/T/#u === I looked at the code again, and should revise my previous comments somewhat. The Hyper-V resets that I described indeed must be done prior to kexec'ing the kdump kernel. Most such resets are actually done via __crash_kexec() -> machine_crash_shutdown(), not via the panic notifier. However, the Hyper-V panic notifier must terminate the VMbus connection, because that must be done even if kdump is not being invoked. See commit 74347a99e73. === > > The thing is: with the notifiers' split, we tried to keep only the most > relevant/necessary stuff in this first list, things that ultimately > should i
Re: [PATCH 24/30] panic: Refactor the panic path
On Fri 2022-05-20 08:23:33, Guilherme G. Piccoli wrote: > On 19/05/2022 20:45, Baoquan He wrote: > > [...] > >> I really appreciate the summary skill you have, to convert complex > >> problems in very clear and concise ideas. Thanks for that, very useful! > >> I agree with what was summarized above. > > > > I want to say the similar words to Petr's reviewing comment when I went > > through the patches and traced each reviewing sub-thread to try to > > catch up. Petr has reivewed this series so carefully and given many > > comments I want to ack immediately. > > > > I agree with most of the suggestions from Petr to this patch, except of > > one tiny concern, please see below inline comment. > > Hi Baoquan, thanks! I'm glad you're also reviewing that =) > > > > [...] > > > > I like the proposed skeleton of panic() and code style suggested by > > Petr very much. About panic_prefer_crash_dump which might need be added, > > I hope it has a default value true. This makes crash_dump execute at > > first by default just as before, unless people specify > > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > > this is inconsistent with the old behaviour. > > I'd like to understand better why the crash_kexec() must always be the > first thing in your use case. If we keep that behavior, we'll see all > sorts of workarounds - see the last patches of this series, Hyper-V and > PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force > execution of their relevant notifiers (like the vmbus disconnect, > specially in arm64 that has no custom machine_crash_shutdown, or the > fadump case in ppc). This led to more risk in kdump. > > The thing is: with the notifiers' split, we tried to keep only the most > relevant/necessary stuff in this first list, things that ultimately > should improve kdump reliability or if not, at least not break it. My > feeling is that, with this series, we should change the idea/concept > that kdump must run first nevertheless, not matter what. We're here > trying to accommodate the antagonistic goals of hypervisors that need > some clean-up (even for kdump to work) VS. kdump users, that wish a > "pristine" system reboot ASAP after the crash. Good question. I wonder if Baoquan knows about problems caused by the particular notifiers that will end up in the hypervisor list. Note that there will be some shuffles and the list will be slightly different in V2. Anyway, I see four possible solutions: 1. The most conservative approach is to keep the current behavior and call kdump first by default. 2. A medium conservative approach to change the default default behavior and call hypervisor and eventually the info notifiers before kdump. There still would be the possibility to call kdump first by the command line parameter. 3. Remove the possibility to call kdump first completely. It would assume that all the notifiers in the info list are super safe or that they make kdump actually more safe. 4. Create one more notifier list for operations that always should be called before crash_dump. Regarding the extra notifier list (4th solution). It is not clear to me whether it would be always called even before hypervisor list or when kdump is not enabled. We must not over-engineer it. 2nd proposal looks like a good compromise. But maybe we could do this change few releases later. The notifiers split is a big change on its own. Best Regards, Petr
Re: [PATCH 24/30] panic: Refactor the panic path
On 19/05/2022 20:45, Baoquan He wrote: > [...] >> I really appreciate the summary skill you have, to convert complex >> problems in very clear and concise ideas. Thanks for that, very useful! >> I agree with what was summarized above. > > I want to say the similar words to Petr's reviewing comment when I went > through the patches and traced each reviewing sub-thread to try to > catch up. Petr has reivewed this series so carefully and given many > comments I want to ack immediately. > > I agree with most of the suggestions from Petr to this patch, except of > one tiny concern, please see below inline comment. Hi Baoquan, thanks! I'm glad you're also reviewing that =) > [...] > > I like the proposed skeleton of panic() and code style suggested by > Petr very much. About panic_prefer_crash_dump which might need be added, > I hope it has a default value true. This makes crash_dump execute at > first by default just as before, unless people specify > panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add > panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, > this is inconsistent with the old behaviour. I'd like to understand better why the crash_kexec() must always be the first thing in your use case. If we keep that behavior, we'll see all sorts of workarounds - see the last patches of this series, Hyper-V and PowerPC folks hardcoded "crash_kexec_post_notifiers" in order to force execution of their relevant notifiers (like the vmbus disconnect, specially in arm64 that has no custom machine_crash_shutdown, or the fadump case in ppc). This led to more risk in kdump. The thing is: with the notifiers' split, we tried to keep only the most relevant/necessary stuff in this first list, things that ultimately should improve kdump reliability or if not, at least not break it. My feeling is that, with this series, we should change the idea/concept that kdump must run first nevertheless, not matter what. We're here trying to accommodate the antagonistic goals of hypervisors that need some clean-up (even for kdump to work) VS. kdump users, that wish a "pristine" system reboot ASAP after the crash. Cheers, Guilherme
Re: [PATCH 24/30] panic: Refactor the panic path
On 05/15/22 at 07:47pm, Guilherme G. Piccoli wrote: > On 12/05/2022 11:03, Petr Mladek wrote: .. > > OK, the question is how to make it better. Let's start with > > a clear picture of the problem: > > > > 1. panic() has basically two funtions: > > > > + show/store debug information (optional ways and amount) > > + do something with the system (reboot, stay hanged) > > > > > > 2. There are 4 ways how to show/store the information: > > > > + tell hypervisor to store what it is interested about > > + crash_dump > > + kmsg_dump() > > + consoles > > > > , where crash_dump and consoles are special: > > > > + crash_dump does not return. Instead it ends up with reboot. > > > > + Consoles work transparently. They just need an extra flush > >before reboot or staying hanged. > > > > > > 3. The various notifiers do things like: > > > > + tell hypervisor about the crash > > + print more information (also stop watchdogs) > > + prepare system for reboot (touch some interfaces) > > + prepare system for staying hanged (blinking) > > > >Note that it pretty nicely matches the 4 notifier lists. > > > > I really appreciate the summary skill you have, to convert complex > problems in very clear and concise ideas. Thanks for that, very useful! > I agree with what was summarized above. I want to say the similar words to Petr's reviewing comment when I went through the patches and traced each reviewing sub-thread to try to catch up. Petr has reivewed this series so carefully and given many comments I want to ack immediately. I agree with most of the suggestions from Petr to this patch, except of one tiny concern, please see below inline comment. > > > > Now, we need to decide about the ordering. The main area is how > > to store the debug information. Consoles are transparent so > > the quesition is about: > > > > + hypervisor > > + crash_dump > > + kmsg_dump > > > > Some people need none and some people want all. There is a > > risk that system might hung at any stage. This why people want to > > make the order configurable. > > > > But crash_dump() does not return when it succeeds. And kmsg_dump() > > users havn't complained about hypervisor problems yet. So, that > > two variants might be enough: > > > > + crash_dump (hypervisor, kmsg_dump as fallback) > > + hypervisor, kmsg_dump, crash_dump > > > > One option "panic_prefer_crash_dump" should be enough. > > And the code might look like: > > > > void panic() > > { > > [...] > > dump_stack(); > > kgdb_panic(buf); > > > > < --- here starts the reworked code --- > > > > > /* crash dump is enough when enabled and preferred. */ > > if (panic_prefer_crash_dump) > > __crash_kexec(NULL); I like the proposed skeleton of panic() and code style suggested by Petr very much. About panic_prefer_crash_dump which might need be added, I hope it has a default value true. This makes crash_dump execute at first by default just as before, unless people specify panic_prefer_crash_dump=0|n|off to disable it. Otherwise we need add panic_prefer_crash_dump=1 in kernel and in our distros to enable kdump, this is inconsistent with the old behaviour. > > > > /* Stop other CPUs and focus on handling the panic state. */ > > if (has_kexec_crash_image) > > crash_smp_send_stop(); > > else > > smp_send_stop() > > > > Here we have a very important point. Why do we need 2 variants of SMP > CPU stopping functions? I disagree with that - my understanding of this > after some study in architectures is that the crash_() variant is > "stronger", should work in all cases and if not, we should fix that - > that'd be a bug. > > Such variant either maps to smp_send_stop() (in various architectures, > including XEN/x86) or overrides the basic function with more proper > handling for panic() case...I don't see why we still need such > distinction, if you / others have some insight about that, I'd like to > hear =) > > > > /* Notify hypervisor about the system panic. */ > > atomic_notifier_call_chain(&panic_hypervisor_list, 0, NULL); > > > > /* > > * No need to risk extra info when there is no kmsg dumper > > * registered. > > */ > > if (!has_kmsg_dumper()) > > __crash_kexec(NULL); > > > > /* Add extra info from different subsystems. */ > > atomic_notifier_call_chain(&panic_info_list, 0, NULL); > > > > kmsg_dump(KMSG_DUMP_PANIC); > > __crash_kexec(NULL); > > > > /* Flush console */ > > unblank_screen(); > > console_unblank(); > > debug_locks_off(); > > console_flush_on_panic(CONSOLE_FLUSH_PENDING); > > > > if (panic_timeout > 0) { > > delay() > > } > > > > /* > > * Prepare system for eventual reboot and allow custom > > * reboot handling. > > */ > > atomic_notifier_call_chain(&panic_reboot_lis
Re: [PATCH 24/30] panic: Refactor the panic path
On 16/05/2022 07:21, Petr Mladek wrote: > [...] > Ah, it should have been: > > + notifiers vs. kmsg_dump > + notifiers vs. crash_dump > + crash_dump vs. kmsg_dump > > I am sorry for the confusion. Even "crash_dump" is slightly > misleading because there is no function with this name. > But it seems to be easier to understand than __crash_kexec(). Cool, thanks! Now it's totally clear for me =) I feel crash dump is the proper term, but I personally prefer kdump to avoid mess-up with user space "core dump" concept heheh Also, KDUMP is an entry on MAINTAINERS file. > [...] >> Heheh OK, I appreciate your opinion, but I guess we'll need to agree in >> disagree here - I'm much more fond to this kind of code than a bunch of >> if/else blocks that almost give headaches. Encoding such "level" logic >> in the if/else scheme is very convoluted, generates a very big code. And >> the functions aren't so black magic - they map a level in bits, and the >> functions _once() are called...once! Although we switch the position in >> the code, so there are 2 calls, one of them is called and the other not. > > I see. Well, I would consider this as a warning that the approach is > too complex. If the code, using if/then/else, would cause headaches > then also understanding of the behavior would cause headaches for > both users and programmers. > > >> But that's totally fine to change - especially if we're moving away from >> the "level" logic. I see below you propose a much simpler approach - if >> we follow that, definitely we won't need the "black magic" approach heheh > > I do not say that my proposal is fully correct. But we really need > this kind of simpler approach. It's cool, I agree that your idea is much simpler and makes sense - mine seems to be an over-engineering effort. Let's see the opinions of the interested parties, I'm curious to see if everybody agrees here, that'd would be ideal (and kind of "wishful thinking" I guess heheh - panic path is polemic). > [...] >> Here we have a very important point. Why do we need 2 variants of SMP >> CPU stopping functions? I disagree with that - my understanding of this >> after some study in architectures is that the crash_() variant is >> "stronger", should work in all cases and if not, we should fix that - >> that'd be a bug. >> >> Such variant either maps to smp_send_stop() (in various architectures, >> including XEN/x86) or overrides the basic function with more proper >> handling for panic() case...I don't see why we still need such >> distinction, if you / others have some insight about that, I'd like to >> hear =) > > The two variants were introduced by the commit 0ee59413c967c35a6dd > ("x86/panic: replace smp_send_stop() with kdump friendly version in > panic path") > > It points to https://lkml.org/lkml/2015/6/24/44 that talks about > still running watchdogs. > > It is possible that the problem could be fixed another way. It is > even possible that it has already been fixed by the notifiers > that disable the watchdogs. > > Anyway, any change of the smp_send_stop() behavior should be done > in a separate patch. It will help with bisection of possible > regression. Also it would require a good explanation in > the commit message. I would personally do it in a separate > patch(set). Thanks for the archeology and interesting findings. I agree that is better to split in smaller patches. I'm planning a split in 3 patches for V2: clean-up (comment, console flushing idea, useless header), the refactor itself and finally, this SMP change. > [...] >> You had the order of panic_reboot_list VS. consoles flushing inverted. >> It might make sense, although I didn't do that in V1... > > IMHO, it makes sense: > > 1. panic_reboot_list contains notifiers that do the reboot > immediately, for example, xen_panic_event, alpha_panic_event. > The consoles have to be flushed earlier. > > 2. console_flush_on_panic() ignores the result of console_trylock() > and always calls console_unlock(). As a result the lock should > be unlocked at the end. And any further printk() should be able > to printk the messages to the console immediately. It means > that any messages printed by the reboot notifiers should appear > on the console as well. > [...] >> OK, I agree with you! It's indeed simpler and if others agree, I can >> happily change the logic to what you proposed. Although...currently the >> "crash_kexec_post_notifiers" allows to call _all_ panic_reboot_list >> callbacks _before kdump_. >> >> We need to mention this change in the commit messages, but I really >> would like to hear the opinions of heavy users of notifiers (as >> Michael/Hyper-V) and the kdump interested parties (like Baoquan / Dave >> Young / Hayatama). If we all agree on such approach, will change that >> for V2 =) > > Sure, we need to make sure that we call everything that is needed. > And it should be documented. > > I believe that this is the
Re: [PATCH 24/30] panic: Refactor the panic path
On Sun 2022-05-15 19:47:39, Guilherme G. Piccoli wrote: > On 12/05/2022 11:03, Petr Mladek wrote: > > This talks only about kdump. The reality is much more complicated. > > The level affect the order of: > > > > + notifiers vs. kdump > > + notifiers vs. crash_dump > > + crash_dump vs. kdump > > First of all, I'd like to ask you please to clarify to me *exactly* what > are the differences between "crash_dump" and "kdump". I'm sorry if > that's a silly question, I need to be 100% sure I understand the > concepts the same way you do. Ah, it should have been: + notifiers vs. kmsg_dump + notifiers vs. crash_dump + crash_dump vs. kmsg_dump I am sorry for the confusion. Even "crash_dump" is slightly misleading because there is no function with this name. But it seems to be easier to understand than __crash_kexec(). > > There might theoretically many variants of the ordering of kdump, > > crash_dump, and the 4 notifier list. Some variants do not make > > much sense. You choose 5 variants and tried to select them by > > a level number. > > > > The question is if we really could easily describe the meaning this > > way. It is not only about a "level" of notifiers before kdump. It is > > also about the ordering of crash_dump vs. kdump. IMHO, "level" > > semantic does not fit there. > > > > Maybe more parameters might be easier to understand the effect. > > Anyway, we first need to agree on the chosen variants. > > I am going to discuss it more in the code, see below. > > > > > > [...] > > Here is the code using the above functions. It helps to discuss > > the design and logic. > > > > I have to say that the logic is very unclear. Almost all > > functions are called twice: > > > > The really used code path is defined by order_panic_notifiers_and_kdump() > > that encodes "level" into "bits". The bits are then flipped in > > panic_notifier_*_once() calls that either do something or not. > > kmsg_dump() is called according to the bit flip. > > > > Also I guess that it is good proof that "level" abstraction does > > not fit here. Normal levels would not need this kind of magic. > > Heheh OK, I appreciate your opinion, but I guess we'll need to agree in > disagree here - I'm much more fond to this kind of code than a bunch of > if/else blocks that almost give headaches. Encoding such "level" logic > in the if/else scheme is very convoluted, generates a very big code. And > the functions aren't so black magic - they map a level in bits, and the > functions _once() are called...once! Although we switch the position in > the code, so there are 2 calls, one of them is called and the other not. I see. Well, I would consider this as a warning that the approach is too complex. If the code, using if/then/else, would cause headaches then also understanding of the behavior would cause headaches for both users and programmers. > But that's totally fine to change - especially if we're moving away from > the "level" logic. I see below you propose a much simpler approach - if > we follow that, definitely we won't need the "black magic" approach heheh I do not say that my proposal is fully correct. But we really need this kind of simpler approach. > > OK, the question is how to make it better. > > One option "panic_prefer_crash_dump" should be enough. > > And the code might look like: > > > > void panic() > > { > > [...] > > dump_stack(); > > kgdb_panic(buf); > > > > < --- here starts the reworked code --- > > > > > /* crash dump is enough when enabled and preferred. */ > > if (panic_prefer_crash_dump) > > __crash_kexec(NULL); > > > > /* Stop other CPUs and focus on handling the panic state. */ > > if (has_kexec_crash_image) > > crash_smp_send_stop(); > > else > > smp_send_stop() > > > > Here we have a very important point. Why do we need 2 variants of SMP > CPU stopping functions? I disagree with that - my understanding of this > after some study in architectures is that the crash_() variant is > "stronger", should work in all cases and if not, we should fix that - > that'd be a bug. > > Such variant either maps to smp_send_stop() (in various architectures, > including XEN/x86) or overrides the basic function with more proper > handling for panic() case...I don't see why we still need such > distinction, if you / others have some insight about that, I'd like to > hear =) The two variants were introduced by the commit 0ee59413c967c35a6dd ("x86/panic: replace smp_send_stop() with kdump friendly version in panic path") It points to https://lkml.org/lkml/2015/6/24/44 that talks about still running watchdogs. It is possible that the problem could be fixed another way. It is even possible that it has already been fixed by the notifiers that disable the watchdogs. Anyway, any change of the smp_send_stop() behavior should be done in a separate patch. It will help with bisection of possible regression. Also it would r
Re: [PATCH 24/30] panic: Refactor the panic path
On 12/05/2022 11:03, Petr Mladek wrote: > Hello, > > first, I am sorry for stepping into the discussion so late. > I was busy with some other stuff and this patchset is far > from trivial. > > Second, thanks a lot for putting so much effort into it. > Most of the changes look pretty good, especially all > the fixes of particular notifiers and split into > four lists. > > Though this patch will need some more love. See below > for more details. Thanks a lot for your review Petr, it is much appreciated! No need for apologies, there is no urgency here =) > [...] > This talks only about kdump. The reality is much more complicated. > The level affect the order of: > > + notifiers vs. kdump > + notifiers vs. crash_dump > + crash_dump vs. kdump First of all, I'd like to ask you please to clarify to me *exactly* what are the differences between "crash_dump" and "kdump". I'm sorry if that's a silly question, I need to be 100% sure I understand the concepts the same way you do. > There might theoretically many variants of the ordering of kdump, > crash_dump, and the 4 notifier list. Some variants do not make > much sense. You choose 5 variants and tried to select them by > a level number. > > The question is if we really could easily describe the meaning this > way. It is not only about a "level" of notifiers before kdump. It is > also about the ordering of crash_dump vs. kdump. IMHO, "level" > semantic does not fit there. > > Maybe more parameters might be easier to understand the effect. > Anyway, we first need to agree on the chosen variants. > I am going to discuss it more in the code, see below. > > > [...] > Here is the code using the above functions. It helps to discuss > the design and logic. > > > order_panic_notifiers_and_kdump(); > > /* If no level, we should kdump ASAP. */ > if (!panic_notifiers_level) > __crash_kexec(NULL); > > crash_smp_send_stop(); > panic_notifier_hypervisor_once(buf); > > if (panic_notifier_info_once(buf)) > kmsg_dump(KMSG_DUMP_PANIC); > > panic_notifier_pre_reboot_once(buf); > > __crash_kexec(NULL); > > panic_notifier_hypervisor_once(buf); > > if (panic_notifier_info_once(buf)) > kmsg_dump(KMSG_DUMP_PANIC); > > panic_notifier_pre_reboot_once(buf); > > > I have to say that the logic is very unclear. Almost all > functions are called twice: > >+ __crash_kexec() >+ kmsg_dump() >+ panic_notifier_hypervisor_once() >+ panic_notifier_pre_reboot_once() >+ panic_notifier_info_once() > > It is pretty hard to find what functions are always called in the same > order and where the order can be inverted. > > The really used code path is defined by order_panic_notifiers_and_kdump() > that encodes "level" into "bits". The bits are then flipped in > panic_notifier_*_once() calls that either do something or not. > kmsg_dump() is called according to the bit flip. > > It is an interesting approach. I guess that you wanted to avoid too > many if/then/else levels in panic(). But honestly, it looks like > a black magic to me. > > IMHO, it is always easier to follow if/then/else logic than using > a translation table that requires additional bit flips when > a value is used more times. > > Also I guess that it is good proof that "level" abstraction does > not fit here. Normal levels would not need this kind of magic. Heheh OK, I appreciate your opinion, but I guess we'll need to agree in disagree here - I'm much more fond to this kind of code than a bunch of if/else blocks that almost give headaches. Encoding such "level" logic in the if/else scheme is very convoluted, generates a very big code. And the functions aren't so black magic - they map a level in bits, and the functions _once() are called...once! Although we switch the position in the code, so there are 2 calls, one of them is called and the other not. But that's totally fine to change - especially if we're moving away from the "level" logic. I see below you propose a much simpler approach - if we follow that, definitely we won't need the "black magic" approach heheh > > OK, the question is how to make it better. Let's start with > a clear picture of the problem: > > 1. panic() has basically two funtions: > > + show/store debug information (optional ways and amount) > + do something with the system (reboot, stay hanged) > > > 2. There are 4 ways how to show/store the information: > > + tell hypervisor to store what it is interested about > + crash_dump > + kmsg_dump() > + consoles > > , where crash_dump and consoles are special: > > + crash_dump does not return. Instead it ends up with reboot. > > + Consoles work transparently. They just need an extra flush >before reboot or staying hanged. > > > 3. The various notifiers do things like: > > + tell hypervisor about the crash > + print more info
Re: [PATCH 24/30] panic: Refactor the panic path
Hello, first, I am sorry for stepping into the discussion so late. I was busy with some other stuff and this patchset is far from trivial. Second, thanks a lot for putting so much effort into it. Most of the changes look pretty good, especially all the fixes of particular notifiers and split into four lists. Though this patch will need some more love. See below for more details. On Wed 2022-04-27 19:49:18, Guilherme G. Piccoli wrote: > The panic() function is somewhat convoluted - a lot of changes were > made over the years, adding comments that might be misleading/outdated > now, it has a code structure that is a bit complex to follow, with > lots of conditionals, for example. The panic notifier list is something > else - a single list, with multiple callbacks of different purposes, > that run in a non-deterministic order and may affect hardly kdump > reliability - see the "crash_kexec_post_notifiers" workaround-ish flag. > > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3784,6 +3791,33 @@ > timeout < 0: reboot immediately > Format: > > + panic_notifiers_level= > + [KNL] Set the panic notifiers execution order. > + Format: > + We currently have 4 lists of panic notifiers; based > + on the functionality and risk (for panic success) the > + callbacks are added in a given list. The lists are: > + - hypervisor/FW notification list (low risk); > + - informational list (low/medium risk); > + - pre_reboot list (higher risk); > + - post_reboot list (only run late in panic and after > + kdump, not configurable for now). > + This parameter defines the ordering of the first 3 > + lists with regards to kdump; the levels determine > + which set of notifiers execute before kdump. The > + accepted levels are: This talks only about kdump. The reality is much more complicated. The level affect the order of: + notifiers vs. kdump + notifiers vs. crash_dump + crash_dump vs. kdump There might theoretically many variants of the ordering of kdump, crash_dump, and the 4 notifier list. Some variants do not make much sense. You choose 5 variants and tried to select them by a level number. The question is if we really could easily describe the meaning this way. It is not only about a "level" of notifiers before kdump. It is also about the ordering of crash_dump vs. kdump. IMHO, "level" semantic does not fit there. Maybe more parameters might be easier to understand the effect. Anyway, we first need to agree on the chosen variants. I am going to discuss it more in the code, see below. > + 0: kdump is the first thing to run, NO list is > + executed before kdump. > + 1: only the hypervisor list is executed before kdump. > + 2 (default level): the hypervisor list and (*if* > + there's any kmsg_dumper defined) the informational > + list are executed before kdump. > + 3: both the hypervisor and the informational lists > + (always) execute before kdump. > + 4: the 3 lists (hypervisor, info and pre_reboot) > + execute before kdump - this behavior is analog to the > + deprecated parameter "crash_kexec_post_notifiers". > + > panic_print=Bitmask for printing system info when panic happens. > User can chose combination of the following bits: > bit 0: print all tasks info > --- a/kernel/panic.c > +++ b/kernel/panic.c > @@ -183,6 +195,112 @@ static void panic_print_sys_info(bool console_flush) > ftrace_dump(DUMP_ALL); > } > > +/* > + * Helper that accumulates all console flushing routines executed on panic. > + */ > +static void console_flushing(void) > +{ > +#ifdef CONFIG_VT > + unblank_screen(); > +#endif > + console_unblank(); > + > + /* > + * In this point, we may have disabled other CPUs, hence stopping the > + * CPU holding the lock while still having some valuable data in the > + * console buffer. > + * > + * Try to acquire the lock then release it regardless of the result. > + * The release will also print the buffers out. Locks debug should > + * be disabled to avoid reporting bad unlock balance when panic() > + * is not being called from OOPS. > + */ > + debug_locks_off(); > + console_flush_on_panic(CONSOLE_FLUSH_PENDING); > + > + panic_print_sys_info(true); > +} > + > +#define PN_HYPERVISOR_BIT0 > +#define PN_INFO_BIT 1 > +#define PN_PRE_REBOOT_BIT
Re: [PATCH 24/30] panic: Refactor the panic path
Hey Hatayma, thanks for your great analysis and no need for apologies! I'll comment/respond properly inline below, just noticing here that I've CCed Mark and Marc (from the ARM64 perspective), Michael (Hyper-V perspective) and Hari (PowerPC perspective), besides the usual suspects as Petr, Baoquan, etc. On 09/05/2022 12:16, d.hatay...@fujitsu.com wrote: > Sorry for the delayed response. Unfortunately, I had 10 days holidays > until yesterday... > [...] >> + We currently have 4 lists of panic notifiers; based >> + on the functionality and risk (for panic success) the >> + callbacks are added in a given list. The lists are: >> + - hypervisor/FW notification list (low risk); >> + - informational list (low/medium risk); >> + - pre_reboot list (higher risk); >> + - post_reboot list (only run late in panic and after >> + kdump, not configurable for now). >> + This parameter defines the ordering of the first 3 >> + lists with regards to kdump; the levels determine >> + which set of notifiers execute before kdump. The >> + accepted levels are: >> + 0: kdump is the first thing to run, NO list is >> + executed before kdump. >> + 1: only the hypervisor list is executed before kdump. >> + 2 (default level): the hypervisor list and (*if* > > Hmmm, why are you trying to change default setting? > > Based on the current design of kdump, it's natural to put what the > handlers for these level 1 and level 2 handlers do in > machine_crash_shutdown(), as these are necessary by default, right? > > Or have you already tried that and figured out it's difficult in some > reason and reached the current design? If so, why is that difficult? > Could you point to if there is already such discussion online? > > kdump is designed to perform as little things as possible before > transferring the execution to the 2nd kernel in order to increase > reliability. Just detour to panic() increases risks of kdump failure > in the sense of increasing the executed codes in the abnormal > situation, which is very the note in the explanation of > crash_kexec_post_notifiers. > > Also, the current implementation of crash_kexec_post_notifiers uses > the panic notifier, but this is not from the technical > reason. Ideally, it should have been implemented in the context of > crash_kexec() independently of panic(). > > That is, it looks to me that, in addition to changing design of panic > notifier, you are trying to integrate shutdown code of the crash kexec > and the panic paths. If so, this is a big design change for kdump. > I'm concerned about increase of reliability. I'd like you to discuss > them carefully. >From my understanding (specially based on both these threads [0] and [1]), 3 facts are clear and quite complex in nature: (a) Currently, the panic notifier list is a "no man's land" - it's a mess, all sort of callbacks are added there, some of them are extremely risk for breaking kdump, others are quite safe (like setting a variable). Petr's details in thread [0] are really clear and express in great way how confusing and conflicting the panic notifiers goals are. (b) In order to "address" problems in the antagonistic goals of notifiers (see point (a) above and thread [0]), we have this quirk/workaround called "crash_kexec_post_notifiers". This is useful, but (almost as for attesting how this is working as band-aid over complex and fundamental issues) see the following commits: a11589563e96 ("x86/Hyper-V: Report crash register data or kmsg before running crash kernel") 06e629c25daa ("powerpc/fadump: Fix inaccurate CPU state info in vmcore generated with panic") They hardcode such workaround, because they *need* some notifiers' callbacks. But notice they *don't need all of them*, only some important ones (that usually are good considering the risk, it's a good cost/benefit). Since we currently have an all-or-nothing behavior for the panic notifiers, both PowerPC and Hyper-V end-up bringing all of them to run before kdump due to the lack of flexibility, increasing a lot the risk of failure for kdump. (c) To add on top of all such complexity, we don't have a custom machine_crash_shutdown() handler for some architectures like ARM64, and the feeling is that's not right to add a bunch of callbacks / complexity in such architecture code, specially since we have the notifiers infrastructure in the kernel. I've recently started a discussion about that with ARM64 community, please take a look in [1]. With that said, we can use (a) + (b) + (c) to justify our argument here: the panic notifiers should be refactored! We need to try to encompass the antagonistic goals of kdump (wants to be the fi
Re: [PATCH 24/30] panic: Refactor the panic path
Sorry for the delayed response. Unfortunately, I had 10 days holidays until yesterday... > .../admin-guide/kernel-parameters.txt | 42 ++- > include/linux/panic_notifier.h| 1 + > kernel/kexec_core.c | 8 +- > kernel/panic.c| 292 +- > .../selftests/pstore/pstore_crash_test| 5 +- > 5 files changed, 252 insertions(+), 96 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 3f1cc5e317ed..8d3524060ce3 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt ...snip... > @@ -3784,6 +3791,33 @@ > timeout < 0: reboot immediately > Format: > > + panic_notifiers_level= > + [KNL] Set the panic notifiers execution order. > + Format: > + We currently have 4 lists of panic notifiers; based > + on the functionality and risk (for panic success) the > + callbacks are added in a given list. The lists are: > + - hypervisor/FW notification list (low risk); > + - informational list (low/medium risk); > + - pre_reboot list (higher risk); > + - post_reboot list (only run late in panic and after > + kdump, not configurable for now). > + This parameter defines the ordering of the first 3 > + lists with regards to kdump; the levels determine > + which set of notifiers execute before kdump. The > + accepted levels are: > + 0: kdump is the first thing to run, NO list is > + executed before kdump. > + 1: only the hypervisor list is executed before kdump. > + 2 (default level): the hypervisor list and (*if* Hmmm, why are you trying to change default setting? Based on the current design of kdump, it's natural to put what the handlers for these level 1 and level 2 handlers do in machine_crash_shutdown(), as these are necessary by default, right? Or have you already tried that and figured out it's difficult in some reason and reached the current design? If so, why is that difficult? Could you point to if there is already such discussion online? kdump is designed to perform as little things as possible before transferring the execution to the 2nd kernel in order to increase reliability. Just detour to panic() increases risks of kdump failure in the sense of increasing the executed codes in the abnormal situation, which is very the note in the explanation of crash_kexec_post_notifiers. Also, the current implementation of crash_kexec_post_notifiers uses the panic notifier, but this is not from the technical reason. Ideally, it should have been implemented in the context of crash_kexec() independently of panic(). That is, it looks to me that, in addition to changing design of panic notifier, you are trying to integrate shutdown code of the crash kexec and the panic paths. If so, this is a big design change for kdump. I'm concerned about increase of reliability. I'd like you to discuss them carefully. Thanks. HATAYAMA, Daisuke
Re: [PATCH 24/30] panic: Refactor the panic path
On 29/04/2022 13:04, Guilherme G. Piccoli wrote: > On 27/04/2022 21:28, Randy Dunlap wrote: >> >> >> On 4/27/22 15:49, Guilherme G. Piccoli wrote: >>> + crash_kexec_post_notifiers >>> + This was DEPRECATED - users should always prefer the >> >> This is DEPRECATED - users should always prefer the >> >>> + parameter "panic_notifiers_level" - check its entry >>> + in this documentation for details on how it works. >>> + Setting this parameter is exactly the same as setting >>> + "panic_notifiers_level=4". >> > > Thanks Randy, for your suggestion - but I confess I couldn't understand > it properly. It's related to spaces/tabs, right? What you suggest me to > change in this formatting? Just by looking the email I can't parse. > > Cheers, > > > Guilherme Complete lack of attention from me, apologies! The suggestions was s/was/is - already fixed for V2, thanks Randy.
Re: [PATCH 24/30] panic: Refactor the panic path
On 03/05/2022 14:31, Michael Kelley (LINUX) wrote: > [...] > > To me, it's a weak correlation between having a kmsg dumper, and > wanting or not wanting the info level output to come before kdump. > Hyper-V is one of only a few places that register a kmsg dumper, so most > Linux instances outside of Hyper-V guest (and PowerPC systems?) will have > the info level output after kdump. It seems like anyone who cared strongly > about the info level output would set the panic_notifier_level to 1 or to 3 > so that the result is more deterministic. But that's just my opinion, and > it's probably an opinion that is not as well informed on the topic as some > others in the discussion. So keeping things as in your patch set is not a > show-stopper for me. > > However, I would request a clarification in the documentation. The > panic_notifier_level affects not only the hypervisor, informational, > and pre_reboot lists, but it also affects panic_print_sys_info() and > kmsg_dump(). Specifically, at level 1, panic_print_sys_info() and > kmsg_dump() will not be run before kdump. At level 3, they will > always be run before kdump. Your documentation above mentions > "informational lists" (plural), which I take to vaguely include > kmsg_dump() and panic_print_sys_info(), but being explicit about > the effect would be better. > > Michael Thanks again Michael, to express your points and concerns - great idea of documentation improvement here, I'll do that for V2, for sure. The idea of "defaulting" to skip the info list on kdump (if no kmsg_dump() is set) is again a mechanism that aims at accommodating all users and concerns of antagonistic goals, kdump vs notifier lists. Before this patch set, by default no notifier executed before kdump. So, the "pendulum" was strongly on kdump side, and clearly this was a sub-optimal decision - proof of that is that both Hyper-V / PowerPC code forcibly set the "crash_kexec_post_notifiers". The goal here is to have a more lightweight list that by default runs before kdump, a secondary list that only runs before kdump if there's usage for that (either user sets that or kmsg_dumper set is considered a valid user), and the remaining notifiers run by default only after kdump, all of that very customizable through the levels idea. Now, one thing we could do to improve consistency for the hyper-v case: having a kmsg_dump_once() helper, and *for Hyper-V only*, call it on the hypervisor list, within the info notifier (that would be moved to hypervisor list, ofc). Let's wait for more feedback on that, just throwing some ideas in order we can have everyone happy with the end-result! Cheers, Guilherme
RE: [PATCH 24/30] panic: Refactor the panic path
From: Guilherme G. Piccoli Sent: Friday, April 29, 2022 1:38 PM > > On 29/04/2022 14:53, Michael Kelley (LINUX) wrote: > > From: Guilherme G. Piccoli Sent: Wednesday, April 27, > > 2022 > 3:49 PM > >> [...] > >> + panic_notifiers_level= > >> + [KNL] Set the panic notifiers execution order. > >> + Format: > >> + We currently have 4 lists of panic notifiers; based > >> + on the functionality and risk (for panic success) the > >> + callbacks are added in a given list. The lists are: > >> + - hypervisor/FW notification list (low risk); > >> + - informational list (low/medium risk); > >> + - pre_reboot list (higher risk); > >> + - post_reboot list (only run late in panic and after > >> + kdump, not configurable for now). > >> + This parameter defines the ordering of the first 3 > >> + lists with regards to kdump; the levels determine > >> + which set of notifiers execute before kdump. The > >> + accepted levels are: > >> + 0: kdump is the first thing to run, NO list is > >> + executed before kdump. > >> + 1: only the hypervisor list is executed before kdump. > >> + 2 (default level): the hypervisor list and (*if* > >> + there's any kmsg_dumper defined) the informational > >> + list are executed before kdump. > >> + 3: both the hypervisor and the informational lists > >> + (always) execute before kdump. > > > > I'm not clear on why level 2 exists. What is the scenario where > > execution of the info list before kdump should be conditional on the > > existence of a kmsg_dumper? Maybe the scenario is described > > somewhere in the patch set and I just missed it. > > > > Hi Michael, thanks for your review/consideration. So, this idea started > kind of some time ago. It all started with a need of exposing more > information on kernel log *before* kdump and *before* pstore - > specifically, we're talking about panic_print. But this cause some > reactions, Baoquan was very concerned with that [0]. Soon after, I've > proposed a panic notifiers filter (orthogonal) approach, to which Petr > suggested instead doing a major refactor [1] - it finally is alive in > the form of this series. > > The theory behind the level 2 is to allow a scenario of kdump with the > minimum amount of notifiers - what is the point in printing more > information if the user doesn't care, since it's going to kdump? Now, if > there is a kmsg dumper, it means that there is likely some interest in > collecting information, and that might as well be required before the > potential kdump (which is my case, hence the proposal on [0]). > > Instead of forcing one of the two behaviors (level 1 or level 3), we > have a middle-term/compromise: if there's interest in collecting such > data (in the form of a kmsg dumper), we then execute the informational > notifiers before kdump. If not, why to increase (even slightly) the risk > for kdump? > > I'm OK in removing the level 2 if people prefer, but I don't feel it's a > burden, quite opposite - seems a good way to accommodate the somewhat > antagonistic ideas (jump to kdump ASAP vs collecting more info in the > panicked kernel log). > > [0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/ > > [1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ > To me, it's a weak correlation between having a kmsg dumper, and wanting or not wanting the info level output to come before kdump. Hyper-V is one of only a few places that register a kmsg dumper, so most Linux instances outside of Hyper-V guest (and PowerPC systems?) will have the info level output after kdump. It seems like anyone who cared strongly about the info level output would set the panic_notifier_level to 1 or to 3 so that the result is more deterministic. But that's just my opinion, and it's probably an opinion that is not as well informed on the topic as some others in the discussion. So keeping things as in your patch set is not a show-stopper for me. However, I would request a clarification in the documentation. The panic_notifier_level affects not only the hypervisor, informational, and pre_reboot lists, but it also affects panic_print_sys_info() and kmsg_dump(). Specifically, at level 1, panic_print_sys_info() and kmsg_dump() will not be run before kdump. At level 3, they will always be run before kdump. Your documentation above mentions "informational lists" (plural), which I take to vaguely include kmsg_dump() and panic_print_sys_info(), but being explicit about the effect would be better. Michael > > >[...] > >> + * Based on the level configured (smaller than 4), we clear the > >> + * proper bits in "panic_notifiers_bits". Not
Re: [PATCH 24/30] panic: Refactor the panic path
On 29/04/2022 14:53, Michael Kelley (LINUX) wrote: > From: Guilherme G. Piccoli Sent: Wednesday, April 27, > 2022 3:49 PM >> [...] >> +panic_notifiers_level= >> +[KNL] Set the panic notifiers execution order. >> +Format: >> +We currently have 4 lists of panic notifiers; based >> +on the functionality and risk (for panic success) the >> +callbacks are added in a given list. The lists are: >> +- hypervisor/FW notification list (low risk); >> +- informational list (low/medium risk); >> +- pre_reboot list (higher risk); >> +- post_reboot list (only run late in panic and after >> +kdump, not configurable for now). >> +This parameter defines the ordering of the first 3 >> +lists with regards to kdump; the levels determine >> +which set of notifiers execute before kdump. The >> +accepted levels are: >> +0: kdump is the first thing to run, NO list is >> +executed before kdump. >> +1: only the hypervisor list is executed before kdump. >> +2 (default level): the hypervisor list and (*if* >> +there's any kmsg_dumper defined) the informational >> +list are executed before kdump. >> +3: both the hypervisor and the informational lists >> +(always) execute before kdump. > > I'm not clear on why level 2 exists. What is the scenario where > execution of the info list before kdump should be conditional on the > existence of a kmsg_dumper? Maybe the scenario is described > somewhere in the patch set and I just missed it. > Hi Michael, thanks for your review/consideration. So, this idea started kind of some time ago. It all started with a need of exposing more information on kernel log *before* kdump and *before* pstore - specifically, we're talking about panic_print. But this cause some reactions, Baoquan was very concerned with that [0]. Soon after, I've proposed a panic notifiers filter (orthogonal) approach, to which Petr suggested instead doing a major refactor [1] - it finally is alive in the form of this series. The theory behind the level 2 is to allow a scenario of kdump with the minimum amount of notifiers - what is the point in printing more information if the user doesn't care, since it's going to kdump? Now, if there is a kmsg dumper, it means that there is likely some interest in collecting information, and that might as well be required before the potential kdump (which is my case, hence the proposal on [0]). Instead of forcing one of the two behaviors (level 1 or level 3), we have a middle-term/compromise: if there's interest in collecting such data (in the form of a kmsg dumper), we then execute the informational notifiers before kdump. If not, why to increase (even slightly) the risk for kdump? I'm OK in removing the level 2 if people prefer, but I don't feel it's a burden, quite opposite - seems a good way to accommodate the somewhat antagonistic ideas (jump to kdump ASAP vs collecting more info in the panicked kernel log). [0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/ [1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ >[...] >> + * Based on the level configured (smaller than 4), we clear the >> + * proper bits in "panic_notifiers_bits". Notice that this bitfield >> + * is initialized with all notifiers set. >> + */ >> +switch (panic_notifiers_level) { >> +case 3: >> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); >> +break; >> +case 2: >> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); >> + >> +if (!kmsg_has_dumpers()) >> +clear_bit(PN_INFO_BIT, &panic_notifiers_bits); >> +break; >> +case 1: >> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); >> +clear_bit(PN_INFO_BIT, &panic_notifiers_bits); >> +break; >> +case 0: >> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); >> +clear_bit(PN_INFO_BIT, &panic_notifiers_bits); >> +clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits); >> +break; >> +} > > I think the above switch statement could be done as follows: > > if (panic_notifiers_level <= 3) > clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits); > if (panic_notifiers_level <= 2) > if (!kmsg_has_dumpers()) > clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > if (panic_notifiers_level <=1) > clear_bit(PN_INFO_BIT, &panic_notifiers_bits); > if (panic_notifiers_level == 0) > clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits); > > That's about half the lines of code. It's
RE: [PATCH 24/30] panic: Refactor the panic path
From: Guilherme G. Piccoli Sent: Wednesday, April 27, 2022 3:49 PM > > The panic() function is somewhat convoluted - a lot of changes were > made over the years, adding comments that might be misleading/outdated > now, it has a code structure that is a bit complex to follow, with > lots of conditionals, for example. The panic notifier list is something > else - a single list, with multiple callbacks of different purposes, > that run in a non-deterministic order and may affect hardly kdump > reliability - see the "crash_kexec_post_notifiers" workaround-ish flag. > > This patch proposes a major refactor on the panic path based on Petr's > idea [0] - basically we split the notifiers list in three, having a set > of different call points in the panic path. Below a list of changes > proposed in this patch, culminating in the panic notifiers level > concept: > > (a) First of all, we improved comments all over the function > and removed useless variables / includes. Also, as part of this > clean-up we concentrate the console flushing functions in a helper. > > (b) As mentioned before, there is a split of the panic notifier list > in three, based on the purpose of the callback. The code contains > good documentation in form of comments, but a summary of the three > lists follows: > > - the hypervisor list aims low-risk procedures to inform hypervisors > or firmware about the panic event, also includes LED-related functions; > > - the informational list contains callbacks that provide more details, > like kernel offset or trace dump (if enabled) and also includes the > callbacks aimed at reducing log pollution or warns, like the RCU and > hung task disable callbacks; > > - finally, the pre_reboot list is the old notifier list renamed, > containing the more risky callbacks that didn't fit the previous > lists. There is also a 4th list (the post_reboot one), but it's not > related with the original list - it contains late time architecture > callbacks aimed at stopping the machine, for example. > > The 3 notifiers lists execute in different moments, hypervisor being > the first, followed by informational and finally the pre_reboot list. > > (c) But then, there is the ordering problem of the notifiers against > the crash_kernel() call - kdump must be as reliable as possible. > For that, a simple binary "switch" as "crash_kexec_post_notifiers" > is not enough, hence we introduce here concept of panic notifier > levels: there are 5 levels, from 0 (no notifier executes before > kdump) until 4 (all notifiers run before kdump); the default level > is 2, in which the hypervisor and (iff we have any kmsg dumper) > the informational notifiers execute before kdump. > > The detailed documentation of the levels is present in code comments > and in the kernel-parameters.txt file; as an analogy with the previous > panic() implementation, the level 0 is exactly the same as the old > behavior of notifiers, running all after kdump, and the level 4 is > the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as > a deprecated one). > > (d) Finally, an important change made here: we now use only the > function "crash_smp_send_stop()" to shut all the secondary CPUs > in the panic path. Before, there was a case of using the regular > "smp_send_stop()", but the better approach is to simplify the > code and try to use the function which was created exclusively > for the panic path. Experiments showed that it works fine, and > code was very simplified with that. > > Functional change is expected from this refactor, since now we > call some notifiers by default before kdump, but the goal here > besides code clean-up is to have a better panic path, more > reliable and deterministic, but also very customizable. > > [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ > > Suggested-by: Petr Mladek > Signed-off-by: Guilherme G. Piccoli > --- > > Special thanks to Petr and Baoquan for the suggestion and feedback in a > previous > email thread. There's some important design decisions that worth mentioning > and > discussing: > > * The default panic notifiers level is 2, based on Petr Mladek's suggestion, > which makes a lot of sense. Of course, this is customizable through the > parameter, but would be something worthwhile to have a KConfig option to set > the default level? It would help distros that want the old behavior > (no notifiers before kdump) as default. > > * The implementation choice was to _avoid_ intricate if conditionals in the > panic path, which would _definitely_ be present with the panic notifiers > levels > idea; so, instead of lots of if conditionals, the set/clear bits approach with > functions called in 2 points (but executing only in one of them) is much > easier > to follow an was used here; the ordering helper function and the comments also > help a lot to avoid confusion (hopefully). > > * Choice was to *always* use crash_smp_send_stop() instead of sometimes making > use of the re
Re: [PATCH 24/30] panic: Refactor the panic path
On 27/04/2022 21:28, Randy Dunlap wrote: > > > On 4/27/22 15:49, Guilherme G. Piccoli wrote: >> +crash_kexec_post_notifiers >> +This was DEPRECATED - users should always prefer the > > This is DEPRECATED - users should always prefer the > >> +parameter "panic_notifiers_level" - check its entry >> +in this documentation for details on how it works. >> +Setting this parameter is exactly the same as setting >> +"panic_notifiers_level=4". > Thanks Randy, for your suggestion - but I confess I couldn't understand it properly. It's related to spaces/tabs, right? What you suggest me to change in this formatting? Just by looking the email I can't parse. Cheers, Guilherme
Re: [PATCH 24/30] panic: Refactor the panic path
On 4/27/22 15:49, Guilherme G. Piccoli wrote: > + crash_kexec_post_notifiers > + This was DEPRECATED - users should always prefer the This is DEPRECATED - users should always prefer the > + parameter "panic_notifiers_level" - check its entry > + in this documentation for details on how it works. > + Setting this parameter is exactly the same as setting > + "panic_notifiers_level=4". -- ~Randy
[PATCH 24/30] panic: Refactor the panic path
The panic() function is somewhat convoluted - a lot of changes were made over the years, adding comments that might be misleading/outdated now, it has a code structure that is a bit complex to follow, with lots of conditionals, for example. The panic notifier list is something else - a single list, with multiple callbacks of different purposes, that run in a non-deterministic order and may affect hardly kdump reliability - see the "crash_kexec_post_notifiers" workaround-ish flag. This patch proposes a major refactor on the panic path based on Petr's idea [0] - basically we split the notifiers list in three, having a set of different call points in the panic path. Below a list of changes proposed in this patch, culminating in the panic notifiers level concept: (a) First of all, we improved comments all over the function and removed useless variables / includes. Also, as part of this clean-up we concentrate the console flushing functions in a helper. (b) As mentioned before, there is a split of the panic notifier list in three, based on the purpose of the callback. The code contains good documentation in form of comments, but a summary of the three lists follows: - the hypervisor list aims low-risk procedures to inform hypervisors or firmware about the panic event, also includes LED-related functions; - the informational list contains callbacks that provide more details, like kernel offset or trace dump (if enabled) and also includes the callbacks aimed at reducing log pollution or warns, like the RCU and hung task disable callbacks; - finally, the pre_reboot list is the old notifier list renamed, containing the more risky callbacks that didn't fit the previous lists. There is also a 4th list (the post_reboot one), but it's not related with the original list - it contains late time architecture callbacks aimed at stopping the machine, for example. The 3 notifiers lists execute in different moments, hypervisor being the first, followed by informational and finally the pre_reboot list. (c) But then, there is the ordering problem of the notifiers against the crash_kernel() call - kdump must be as reliable as possible. For that, a simple binary "switch" as "crash_kexec_post_notifiers" is not enough, hence we introduce here concept of panic notifier levels: there are 5 levels, from 0 (no notifier executes before kdump) until 4 (all notifiers run before kdump); the default level is 2, in which the hypervisor and (iff we have any kmsg dumper) the informational notifiers execute before kdump. The detailed documentation of the levels is present in code comments and in the kernel-parameters.txt file; as an analogy with the previous panic() implementation, the level 0 is exactly the same as the old behavior of notifiers, running all after kdump, and the level 4 is the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as a deprecated one). (d) Finally, an important change made here: we now use only the function "crash_smp_send_stop()" to shut all the secondary CPUs in the panic path. Before, there was a case of using the regular "smp_send_stop()", but the better approach is to simplify the code and try to use the function which was created exclusively for the panic path. Experiments showed that it works fine, and code was very simplified with that. Functional change is expected from this refactor, since now we call some notifiers by default before kdump, but the goal here besides code clean-up is to have a better panic path, more reliable and deterministic, but also very customizable. [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/ Suggested-by: Petr Mladek Signed-off-by: Guilherme G. Piccoli --- Special thanks to Petr and Baoquan for the suggestion and feedback in a previous email thread. There's some important design decisions that worth mentioning and discussing: * The default panic notifiers level is 2, based on Petr Mladek's suggestion, which makes a lot of sense. Of course, this is customizable through the parameter, but would be something worthwhile to have a KConfig option to set the default level? It would help distros that want the old behavior (no notifiers before kdump) as default. * The implementation choice was to _avoid_ intricate if conditionals in the panic path, which would _definitely_ be present with the panic notifiers levels idea; so, instead of lots of if conditionals, the set/clear bits approach with functions called in 2 points (but executing only in one of them) is much easier to follow an was used here; the ordering helper function and the comments also help a lot to avoid confusion (hopefully). * Choice was to *always* use crash_smp_send_stop() instead of sometimes making use of the regular smp_send_stop(); for most architectures they are the same, including Xen (on x86). For the ones that override it, all should work fine, in the powerpc case it's even more correct (see the subsequent patch "powerpc: Do not force all panic notifiers to execute before