Re: [PATCH 24/30] panic: Refactor the panic path
Sorry for the delayed response. Unfortunately, I had 10 days holidays until yesterday... > .../admin-guide/kernel-parameters.txt | 42 ++- > include/linux/panic_notifier.h| 1 + > kernel/kexec_core.c | 8 +- > kernel/panic.c| 292 +- > .../selftests/pstore/pstore_crash_test| 5 +- > 5 files changed, 252 insertions(+), 96 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 3f1cc5e317ed..8d3524060ce3 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt ...snip... > @@ -3784,6 +3791,33 @@ > timeout < 0: reboot immediately > Format: > > + panic_notifiers_level= > + [KNL] Set the panic notifiers execution order. > + Format: > + We currently have 4 lists of panic notifiers; based > + on the functionality and risk (for panic success) the > + callbacks are added in a given list. The lists are: > + - hypervisor/FW notification list (low risk); > + - informational list (low/medium risk); > + - pre_reboot list (higher risk); > + - post_reboot list (only run late in panic and after > + kdump, not configurable for now). > + This parameter defines the ordering of the first 3 > + lists with regards to kdump; the levels determine > + which set of notifiers execute before kdump. The > + accepted levels are: > + 0: kdump is the first thing to run, NO list is > + executed before kdump. > + 1: only the hypervisor list is executed before kdump. > + 2 (default level): the hypervisor list and (*if* Hmmm, why are you trying to change default setting? Based on the current design of kdump, it's natural to put what the handlers for these level 1 and level 2 handlers do in machine_crash_shutdown(), as these are necessary by default, right? Or have you already tried that and figured out it's difficult in some reason and reached the current design? If so, why is that difficult? Could you point to if there is already such discussion online? kdump is designed to perform as little things as possible before transferring the execution to the 2nd kernel in order to increase reliability. Just detour to panic() increases risks of kdump failure in the sense of increasing the executed codes in the abnormal situation, which is very the note in the explanation of crash_kexec_post_notifiers. Also, the current implementation of crash_kexec_post_notifiers uses the panic notifier, but this is not from the technical reason. Ideally, it should have been implemented in the context of crash_kexec() independently of panic(). That is, it looks to me that, in addition to changing design of panic notifier, you are trying to integrate shutdown code of the crash kexec and the panic paths. If so, this is a big design change for kdump. I'm concerned about increase of reliability. I'd like you to discuss them carefully. Thanks. HATAYAMA, Daisuke ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter
> On 01/24/22 at 11:48am, Guilherme G. Piccoli wrote: > > On 24/01/2022 10:59, Baoquan He wrote: > > > [...] > > > About pre_dump, if the dump is crash dump, hope those pre_dump notifiers > > > will be executed under conditional check, e.g only if > > > 'crash_kexec_post_notifiers' > > > is specified in kernel cmdline. > > > > Hi Baoquan, based on Petr's suggestion, I think pre_dump would be > > responsible for really *non-intrusive/non-risky* tasks and should be > > always executed in the panic path (before kdump), regardless of > > "crash_kexec_post_notifiers". > > > > The idea is that the majority of the notifiers would be executed in the > > post_dump portion, and for that, we have the > > "crash_kexec_post_notifiers" conditional. I also suggest we have > > blacklist options (based on function names) for both notifiers, in order > > to make kdump issues debug easier. > > > > Do you agree with that? Feel free to comment with suggestions! > > Cheers, > > I would say "please NO" cautiously. > > As Petr said, kdump mostly works only if people configure it correctly. > That's because we try best to switch to kdump kernel from the fragile > panicked kernel immediately. When we try to add anthing before the switching, > please consider carefully and ask if that adding is mandatory, otherwise > switching into kdump kernel may fail. If the answer is yes, the adding > is needed and welcomed. Othewise, any unnecessary action, including any > "non-intrusive/non-risky" tasks, would be unwelcomed. > > Surely, we don't oppose the "non-intrusive/non-risky" or completely > "intrusive/risky" action adding before kdump kernel switching, with a > conditional limitation. When we handle customers' kdump support, we > explicitly declare we only support normal and default kdump operation. > If any action which is done before switching into kdump kernel is specified, > e.g panic_notifier, panic_print, they need take care of their own kdump > failure. Sorry for bringing back the past discussion, but how about reconsidering the following design? - kdump-specific notifier list within crash_kexec() I don't think that the current implementation of crash_kexec_post_notifiers is ideal, where panic() and panic notifier are used, which contains the code that is unnecessary for kdump, and it unnecessarily decreases kdump's reliability. The presence of kdump code in panic() also conversely makes panic()'s code unnecessarily complicated. I use crash_kexec_post_notifiers with the understanding that it affects kdump's reliability to some degree, but at the same time I want it to be as reliable as possible. I think introducing kdump-specific notifier list will resolve the issues caused by dependencies to panic()'s code. In addition, as I already said in another mail, I want to avoid invoking any other handlers passed by users other than me, in order to keep kdump's reliability. For such purpose, I think some feature like Guilherme's panic notifier filter is needed. ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter
> Hi, thanks for the review. First of all, notice that it's very likely > this patch isn't gonna get merged this way, we are considering a > refactor that included 2 panic notifiers: one a bit earlier (pre_dump), > that includes functions less risky, as watchdog unloaders, kernel offset > dump, etc, and the second panic notifier (post_dump) will keep the > majority of callbacks, and can be conditionally executed on kdump > through the usage of "crash_kexec_post_notifiers". But the pre_dump cannot avoid calling multiple unnecessary handlers, right? It's more risky than the previous idea... > Anyway, I'm curious with your code review - how can we use this filter > with modules, if the filter setup is invoked as early_param(), before > modules load? In that case, module functions won't have a valid address, > correct? So, in that moment, this lookup fails, we cannot record an > unloaded module address in such list. Please, correct me if I'm wrong. For example, how about simply maintaining function symbol names in the list as string, not address. Thanks. HATAYAMA, Daisuke ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] sadump, kaslr: fix failure of calculating kaslr_offset
On kernels v5.8 or later, makedumpfile fails for memory dumps in the sadump-related formats as follows: # makedumpfile -f -l -d 31 -x ./vmlinux /dev/sdd4 /root/vmcore-ld31 __vtop4_x86_64: Can't get a valid pud_pte. ...110 lines of the same message... __vtop4_x86_64: Can't get a valid pud_pte. calc_kaslr_offset: failed to calculate kaslr_offset and phys_base; default to 0 readmem: type_addr: 1, addr:85411858, size:8 __vtop4_x86_64: Can't get pgd (page_dir:85411858). readmem: Can't convert a virtual address(059be980) to physical address. readmem: type_addr: 0, addr:059be980, size:1024 cpu_online_mask_init: Can't read cpu_online_mask memory. makedumpfile Failed. This is caused by the kernel commit 9d06c4027f21 ("x86/entry: Convert Divide Error to IDTENTRY") that renamed divide_error to asm_exc_divide_error, breaking logic for calculating kaslr offset. Fix this by adding initialization of asm_exc_divide_error. Signed-off-by: HATAYAMA Daisuke --- makedumpfile.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/makedumpfile.c b/makedumpfile.c index a51bdaf..7ed9756 100644 --- a/makedumpfile.c +++ b/makedumpfile.c @@ -1667,6 +1667,8 @@ get_symbol_info(void) SYMBOL_INIT(cur_cpu_spec, "cur_cpu_spec"); SYMBOL_INIT(divide_error, "divide_error"); + if (SYMBOL(divide_error) == NOT_FOUND_SYMBOL) + SYMBOL_INIT(divide_error, "asm_exc_divide_error"); SYMBOL_INIT(idt_table, "idt_table"); SYMBOL_INIT(saved_command_line, "saved_command_line"); SYMBOL_INIT(pti_init, "pti_init"); -- 2.31.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter
> @@ -146,6 +157,34 @@ void nmi_panic(struct pt_regs *regs, const char *msg) > } > EXPORT_SYMBOL(nmi_panic); > > +static int __init panic_notifier_filter_setup(char *buf) > +{ > + char *func; > + unsigned long addr; > + > + while ((func = strsep(, ","))) { > + addr = kallsyms_lookup_name(func); > + if (!addr) { > + pr_warn("panic_notifier_filter: invalid symbol %s\n", > func); > + continue; > + } Could you remove this check? panic_notifier_list is exported to kernel modules and this check prevents such users from using this feature. Thanks. HATAYAMA, Daisuke ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter
> On 01/21/22 at 05:31pm, Guilherme G. Piccoli wrote: > .. > > > IMHO, the right solution is to split the callbacks into 2 or more > > > notifier list. Then we might rework panic() to do: > > > > > > void panic(void) > > > { > > > [...] > > > > > > /* stop watchdogs + extra info */ > > > atomic_notifier_call_chain(_disable_watchdogs_notifier_list, 0, > > > buf); > > > atomic_notifier_call_chain(_info_notifier_list, 0, buf); > > > panic_print_sys_info(); > > > > > > /* crash_kexec + kmsg_dump in configurable order */ > > > if (!_crash_kexec_post_kmsg_dump) { > > > __crash_kexec(NULL); > > > smp_send_stop(); > > > } else { > > > crash_smp_send_stop(); > > > } > > > > > > kmsg_dump(); > > > if (_crash_kexec_post_kmsg_dump) > > > __crash_kexec(NULL); > > > > > > /* infinite loop or reboot */ > > > atomic_notifier_call_chain(_hypervisor_notifier_list, 0, buf); > > > atomic_notifier_call_chain(_rest_notifier_list, 0, buf); > > > > > > console_flush_on_panic(CONSOLE_FLUSH_PENDING); > > > [...] > > > Two notifier lists might be enough in the above scenario. I would call > > > them: > > > > > > panic_pre_dump_notifier_list > > > panic_post_dump_notifier_list > > > > > > > > > It is a real solution that will help everyone. It is more complicated now > > > but it will makes things much easier in the long term. And it might be > > > done > > > step by step: > > > > > > 1. introduce the two notifier lists > > > 2. convert all users: one by one > > > 3. remove the original notifier list when there is no user > > > > That's a great idea! I'm into it, if we have a consensus. The thing that > > scares me most here is that this is a big change and consumes time to > > implement - I'd not risk such time if somebody is really against that. > > So, let's see more opinions, maybe the kdump maintainers have good input. > > I am fine with it. As long as thing is made clear, glad to see code is > refactored to be more understandable and improved. Earlier, during several > rounds of discussion between you and Petr, seveal pitfalls have been > pointed out and avoided. > > Meanwhile, I would suggest Masa and HATAYAMA to help give input about > panic_notifier usage and refactory. AFAIK, they contributed code and use > panic_notifier in their product or environment a lot, that will be very > helpful to get the first hand information from them. > > Hi Masa, HATAYANA, > > Any comment on this? (Please ignore this if it's not in your care.) > Thanks for CCing to me. I like this patch set. I have same motivation. For example, when I used crash_kexec_post_notifiers, I sometimes ran into deadlock in printk's exclusion logic during the call of panic notifiers since kaslr outputs kernel offset at panic by dump_kernel_offset() via panic notifers (although this might never happen now thanks to lockless implementation). The problem is that in the current design, we have to run all the tasks registered, although most of them are actually unnecessary for other users' requirements. Each user wants to call only their own handlers in order to keep kdump as reliable as possible. I've just started reading this patch set and have no other comments for now. Thanks. HATAYAMA, Daisuke ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] sadump: Fix failure of reading __per_cpu_load memory
Saito-san, Thanks for your patch. I think it better to remove __per_cpu_load code, which is unnecessary. 差出人: kexec が saito.kaz...@fujitsu.com の代理で送信 送信日時: 2020年5月21日 16:15 宛先: 'kexec@lists.infradead.org' 件名: [PATCH] sadump: Fix failure of reading __per_cpu_load memory Creating vmcore from sadump by makedumpfile fails with the following error messages since kernel-4.19 with PTI (Page Table Isolation) enabled: __vtop4_x86_64: Can't get a valid pte. readmem: Can't convert a virtual address(b2986000) to physical address. readmem: type_addr: 0, addr:b2986000, size:8 per_cpu_init: Can't read __per_cpu_load memory. This is caused by the following patch: https://github.com/torvalds/linux/commit/c40a56a7818cfe735fc93a69e1875f8bba834483 The above patch clears _PAGE_PRESENT bit of __per_cpu_load memory, so __vtop4_x86_64 fails to convert the virtual address of the __per_cpu_load. To fix this issue, this patch changes sanity check of per_cpu_ptr() to use address of the __per_cpu_load instead of data of the memory. Signed-off-by: Kazuya Saito Signed-off-by: Kiyotaka Nakamura --- sadump_info.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/sadump_info.c b/sadump_info.c index 46867ce..72a077b 100644 --- a/sadump_info.c +++ b/sadump_info.c @@ -1732,11 +1732,11 @@ per_cpu_init(void) return FALSE; } - if (!readmem(VADDR, SYMBOL(__per_cpu_load), >__per_cpu_load, -sizeof(unsigned long))) { - ERRMSG("Can't read __per_cpu_load memory.\n"); + if (SYMBOL(__per_cpu_load) == NOT_FOUND_SYMBOL) { + ERRMSG("Can't find __per_cpu_load symbol.\n"); return FALSE; } + si->__per_cpu_load = SYMBOL(__per_cpu_load); DEBUG_MSG("sadump: __per_cpu_load: %#lx\n", si->__per_cpu_load); DEBUG_MSG("sadump: __per_cpu_offset: LENGTH: %ld\n", -- 2.12.3 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
RE: [RFD] kdump, kaslr: how to fix the failure of reservation of crashkernel low memory due to physical kaslr
> -Original Message- > > Hi HATAYAMA, > > On 12/25/19 at 04:26am, d.hatay...@fujitsu.com wrote: > > Currently, reservation of crashkernel low memory sometimes fails due > > to a sparse memory caused by physical kaslr with the following > > message: > > > > Cannot reserve 256MB crashkernel low memory, please try smaller size. > > I don't understand, may not get your point. KASLR will randomize the > position of kernel image. However, kernel image usually takes up 50M > memory. Under low 4G memory, how come it can't reserve 256M crashkernel > low memory. Do you have the boot log of the failed case? Thanks for your comments and sorry for the insufficient explanation. Low 4GB memory in our system is considerably limited. The size of the largest contiguous free physical pages at the timing when kernel attempts at reserving low memory for crash kernel is less than 512MB. Hence, if physical kaslr inserts kernel image into the center of the chunk, every remaining chunks have less than 256M size. Then, the failure occurs. > > > > > Kdump needs low memory, memory area less than 4GB, e.g. for swiotlb. > > Its size is 256MB for low memory by default. OTOH, physical kaslr > > loads kernel images in a random physical address for > > security. Physical kaslr sometimes choose low memory and sparse > > there and as a result, reservation of crash kernel low memory could fail. > > > > This failure seldom occurs on systems with large memory. For example, > > on our system with 128GB, the issue occurs once in hundreds of > > reboots. Although it doesn't occur frequently and can be avoided in > > practice simply by rebooting the system, it definitely occurs once in > > hundreds of reboots. Once the issue occurs, it's difficult for ordinary > > users to understand why it failed. I'd like to fix this current behavior. > > > > I'm now coming up two ideas but don't know what is best. Please > > discuss how to fix the issue in better way. > > > > 1) Add a kernel parameter to make physical kaslr to avoid specified > >memory area > > > > This is the simplest idea I came up with first just like > >kaslr_mem_avoid=4GB-0, which is similar syntax to memap=, meaning > >that kaslr, please avoid to load kernel image into the region [0, > >4GB). > > > > It looks to me that this can be implemented easily by taking > >advantage of the existing code about mem_avoid mechanism in > >arch/x86/boot/compressed/kaslr.c. > > > > This mechanism doesn't lose security provided by physical kaslr if > >system memory is large enough. > > > > Demerit of this is that users need configuration. Automatic way is > >better if possible. > > > > 2) Add special handling for crashkernel= low in physical kaslr > > > > The second idea I came up with is to add special handling for > >crashkernel= low in physical kaslr, i.e. physical kaslr recognizes > >crashkernel= in kernel parameters and keep enough memory for > >crashkenrel. > > > > To guarantee that the memory area kept by the special handling in > >physical kaslr is used for crashkernel, it is necessary to mark the > >area to indicate to the crashkernel code executed after kernel > >runs. To implement this, I imagine introducing a new type of memory > >a kind of E820_CRASHKERNEL_LOW. > > > > My concern on this idea is whether its worth implementing such > >special handling in physical kaslr simply because I don't find such > >code in physical kaslr now. > > > > 3) Any other better ideas? > > Someone ever told that some systems may not have low 4G memory since > they own hardware iommu. In real life, I never see such kind of system, > and most of them can give 256M crashkernel memory a satisfactory result. > Unless you reserve more than 1G under low 4G, it could fail because of > kinds of complicated memory reservations there. I'm surprised to hear such system without low 4GB memory and I wonder how such system works well without restriction of memory access range in early runtime mode on x86 such as real mode. Thanks. HATAYAMA, Daisuke ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[RFD] kdump, kaslr: how to fix the failure of reservation of crashkernel low memory due to physical kaslr
Currently, reservation of crashkernel low memory sometimes fails due to a sparse memory caused by physical kaslr with the following message: Cannot reserve 256MB crashkernel low memory, please try smaller size. Kdump needs low memory, memory area less than 4GB, e.g. for swiotlb. Its size is 256MB for low memory by default. OTOH, physical kaslr loads kernel images in a random physical address for security. Physical kaslr sometimes choose low memory and sparse there and as a result, reservation of crash kernel low memory could fail. This failure seldom occurs on systems with large memory. For example, on our system with 128GB, the issue occurs once in hundreds of reboots. Although it doesn't occur frequently and can be avoided in practice simply by rebooting the system, it definitely occurs once in hundreds of reboots. Once the issue occurs, it's difficult for ordinary users to understand why it failed. I'd like to fix this current behavior. I'm now coming up two ideas but don't know what is best. Please discuss how to fix the issue in better way. 1) Add a kernel parameter to make physical kaslr to avoid specified memory area This is the simplest idea I came up with first just like kaslr_mem_avoid=4GB-0, which is similar syntax to memap=, meaning that kaslr, please avoid to load kernel image into the region [0, 4GB). It looks to me that this can be implemented easily by taking advantage of the existing code about mem_avoid mechanism in arch/x86/boot/compressed/kaslr.c. This mechanism doesn't lose security provided by physical kaslr if system memory is large enough. Demerit of this is that users need configuration. Automatic way is better if possible. 2) Add special handling for crashkernel= low in physical kaslr The second idea I came up with is to add special handling for crashkernel= low in physical kaslr, i.e. physical kaslr recognizes crashkernel= in kernel parameters and keep enough memory for crashkenrel. To guarantee that the memory area kept by the special handling in physical kaslr is used for crashkernel, it is necessary to mark the area to indicate to the crashkernel code executed after kernel runs. To implement this, I imagine introducing a new type of memory a kind of E820_CRASHKERNEL_LOW. My concern on this idea is whether its worth implementing such special handling in physical kaslr simply because I don't find such code in physical kaslr now. 3) Any other better ideas? Thanks. HATAYAMA, Daisuke ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
RE: [PATCH v2 2/2] efi: arm64: Introduce /proc/efi/memreserve to tell the persistent pages
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Masayoshi Mizuma > Sent: Wednesday, December 4, 2019 5:14 AM > To: Ard Biesheuvel ; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org > Cc: Masayoshi Mizuma ; Mizuma, Masayoshi/水間 理仁 > ; linux-ker...@vger.kernel.org; > kexec@lists.infradead.org; Hatayama, Daisuke/畑山 大輔 > ; Eric Biederman ; Matthias > Brugger > Subject: [PATCH v2 2/2] efi: arm64: Introduce /proc/efi/memreserve to tell the > persistent pages > > From: Masayoshi Mizuma > > kexec reboot stops in early boot sequence because efi_config_parse_tables() > refers garbage data. We can see the log with memblock=debug kernel option: > > efi: ACPI 2.0=0x9821790014 PROP=0x8757f5c0 SMBIOS 3.0=0x982074 > MEMRESERVE=0x9820bfdc58 > memblock_reserve: [0x009820bfdc58-0x009820bfdc67] > efi_config_parse_tables+0x228/0x278 > memblock_reserve: [0x8276-0x324d07ff] > efi_config_parse_tables+0x228/0x278 > memblock_reserve: [0xcc4f84ecc0511670-0x5f6e5214a7fd91f9] > efi_config_parse_tables+0x244/0x278 > memblock_reserve: [0xd2fd4144b9af693d-0xad0c1db1086f40a2] > efi_config_parse_tables+0x244/0x278 > memblock_reserve: [0x0c719bb159b1fadc-0x5aa6e62a1417ce12] > efi_config_parse_tables+0x244/0x278 > ... > > That happens because 0x8276, struct linux_efi_memreserve, is destroyed. > 0x8276 is pointed from efi.mem_reseve, and efi.mem_reserve points the > head page of LPI pending table and LPI property table which are allocated by > gic_reserve_range(). > > The destroyer is kexec. kexec locates the initrd to the area: > > ]# kexec -d -l /boot/vmlinuz-5.4.0-rc7 /boot/initramfs-5.4.0-rc7.img > --reuse-cmdline > ... > initrd: base 8229, size 388dd8ah (59301258) > ... > > From dynamic debug log. initrd is located in segment[1]: > machine_kexec_prepare:70: > kexec kimage info: > type:0 > start: 85b30680 > head:0 > nr_segments: 4 > segment[0]: 8048 - 8229, 0x1e1 bytes, 481 > pages > segment[1]: 8229 - 85b2, 0x389 bytes, 905 > pages > segment[2]: 85b2 - 85b3, 0x1 bytes, 1 > pages > segment[3]: 85b3 - 85b4, 0x1 bytes, 1 > pages > > kexec searches the memory region to locate initrd through > "System RAM" in /proc/iomem. The pending tables are included in > "System RAM" because they are allocated by alloc_pages(), so kexec > destroys the LPI pending tables. > > Introduce /proc/efi/memreserve to tell the pages pointed by > efi.mem_reserve so that kexec can avoid the area to locate initrd. > > Signed-off-by: Masayoshi Mizuma > --- > drivers/firmware/efi/efi.c | 75 -- > 1 file changed, 72 insertions(+), 3 deletions(-) > > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > index d8157cb34..80bbe0b3e 100644 > --- a/drivers/firmware/efi/efi.c > +++ b/drivers/firmware/efi/efi.c > @@ -325,17 +325,87 @@ static __init int efivar_ssdt_load(void) > static inline int efivar_ssdt_load(void) { return 0; } > #endif > > +static struct linux_efi_memreserve *efi_memreserve_root __ro_after_init; > + > #ifdef CONFIG_PROC_FS > static struct proc_dir_entry *proc_efi; > +#ifdef CONFIG_KEXEC > +static int memreserve_show(struct seq_file *m, void *v) > +{ > + struct linux_efi_memreserve *rsv; > + phys_addr_t start, end; > + unsigned long prsv; > + int count, i; > + > + if ((efi_memreserve_root == (void *)ULONG_MAX) || > + (!efi_memreserve_root)) > + return -ENODEV; > + > + for (prsv = efi_memreserve_root->next; prsv; prsv = rsv->next) { > + rsv = memremap(prsv, sizeof(*rsv), MEMREMAP_WB); > + if (!rsv) { > + pr_err("Could not map efi_memreserve\n"); > + return -ENOMEM; > + } > + count = atomic_read(>count); > + for (i = 0; i < count; i++) { > + start = rsv->entry[i].base; > + end = start + rsv->entry[i].size - 1; > + > + seq_printf(m, "%pa-%pa\n", , ); > + } It looks to me that we reach the same issue as sysfs if memreserve_show() calls seq_printf multiple times this way. Default buffer size of struct seq_file is also PAGE_SIZE. We can configure default buffer size using single_open_size(), but I think it better to do the same thing as /proc/iomem (and /proc/mounts and many others) than choosing a certain size as default value. Looking into implementation of /proc/iomem, its show method, r_show(), calls seq_printf() only once; in other words, r_show() is called the number of lines of /proc/iomem, which is equal to the number of resource objects. Then, there is no buffer size issue because "%pa-%pa\n"
RE: [RFC PATCH] efi: arm64: Introduce /sys/firmware/efi/memreserve to tell the persistent pages
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Masayoshi Mizuma > Sent: Wednesday, November 13, 2019 1:53 AM > To: Ard Biesheuvel ; > linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org > Cc: Masayoshi Mizuma ; Mizuma, Masayoshi/水間 理仁 > ; linux-ker...@vger.kernel.org; > kexec@lists.infradead.org > Subject: [RFC PATCH] efi: arm64: Introduce /sys/firmware/efi/memreserve to > tell the persistent pages > > From: Masayoshi Mizuma > > kexec reboot stucks because efi_config_parse_tables() refers garbage > (with memblock=debug): > > efi: ACPI 2.0=0x9821790014 PROP=0x8757f5c0 SMBIOS 3.0=0x982074 > MEMRESERVE=0x9820bfdc58 > memblock_reserve: [0x009820bfdc58-0x009820bfdc67] > efi_config_parse_tables+0x228/0x278 > memblock_reserve: [0x8276-0x324d07ff] > efi_config_parse_tables+0x228/0x278 > memblock_reserve: [0xcc4f84ecc0511670-0x5f6e5214a7fd91f9] > efi_config_parse_tables+0x244/0x278 > memblock_reserve: [0xd2fd4144b9af693d-0xad0c1db1086f40a2] > efi_config_parse_tables+0x244/0x278 > memblock_reserve: [0x0c719bb159b1fadc-0x5aa6e62a1417ce12] > efi_config_parse_tables+0x244/0x278 > ... > > That happens because 0x8276, struct linux_efi_memreserve, is destroyed. > 0x8276 is pointed from efi.mem_reseve, and efi.mem_reserve points the > head page of pending table and prop table which are allocated by > gic_reserve_range(). > > The destroyer is kexec. kexec locates the inird to the area: > > # kexec -d -l /boot/vmlinuz-5.4.0-rc7 /boot/initramfs-5.4.0-rc7.img > --reuse-cmdline > ... > initrd: base 8229, size 388dd8ah (59301258) > ... > > From dynamic debug log: > machine_kexec_prepare:70: > kexec kimage info: > type:0 > start: 85b30680 > head:0 > nr_segments: 4 > segment[0]: 8048 - 8229, 0x1e1 bytes, 481 > pages > segment[1]: 8229 - 85b2, 0x389 bytes, 905 > pages > segment[2]: 85b2 - 85b3, 0x1 bytes, 1 > pages > segment[3]: 85b3 - 85b4, 0x1 bytes, 1 > pages > > kexec searches the appropriate memory region to locate initrd through "System > RAM" > in /proc/iomem. The pending tables are included in "System RAM" because they > are > allocated by alloc_pages(), so kexec destroys the pending tables. > > Introduce /sys/firmware/efi/memreserve to tell the pages pointed by > efi.mem_reserve > so that kexec can avoid the area to locate initrd. > > Signed-off-by: Masayoshi Mizuma > --- > drivers/firmware/efi/efi.c | 32 +++- > 1 file changed, 31 insertions(+), 1 deletion(-) > > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > index e98bbf8e5..67b21ae7a 100644 > --- a/drivers/firmware/efi/efi.c > +++ b/drivers/firmware/efi/efi.c > @@ -141,6 +141,36 @@ static ssize_t systab_show(struct kobject *kobj, > > static struct kobj_attribute efi_attr_systab = __ATTR_RO_MODE(systab, 0400); > > +static struct linux_efi_memreserve *efi_memreserve_root __ro_after_init; > +static ssize_t memreserve_show(struct kobject *kobj, > +struct kobj_attribute *attr, char *buf) > +{ > + struct linux_efi_memreserve *rsv; > + unsigned long prsv; > + char *str = buf; > + int index, i; > + > + if (!kobj || !buf) > + return -EINVAL; > + > + if (!efi_memreserve_root) > + return -ENODEV; Other functions use different conditions. The latter efi_memreserve_root == (void *)ULONG_MAX is correct? static int __init efi_memreserve_map_root(void) { if (efi.mem_reserve == EFI_INVALID_TABLE_ADDR) return -ENODEV; int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size) { struct linux_efi_memreserve *rsv; unsigned long prsv; int rc, index; if (efi_memreserve_root == (void *)ULONG_MAX) return -ENODEV; > + > + for (prsv = efi_memreserve_root->next; prsv; prsv = rsv->next) { > + rsv = memremap(prsv, sizeof(*rsv), MEMREMAP_WB); memremap() could fail with NULL as a return value. You need to deal with such case. It looks to me efi_mem_reserve_persistent() also doesn't deal with this. Maybe you should fix this, too. > + index = atomic_read(>count); > + for (i = 0; i < index; i++) > + str += sprintf(str, "%llx-%llx\n", > + rsv->entry[i].base, > + rsv->entry[i].base + rsv->entry[i].size - > 1); Is memreserve supported on 32-bit system? If so, phy_addr_t could have a type of 4-byte length in such system (not so if with PAE) and then %llx could lead to inconsistent type error. It's enough to add a cast to unsigned long long. > + memunmap(rsv); > + } > + > + return str - buf; > +} > +
RE: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified
> -Original Message- > From: lijiang [mailto:liji...@redhat.com] > Sent: Friday, October 25, 2019 10:31 AM > To: Simon Horman ; Hatayama, Daisuke/畑山 大輔 > > Cc: linux-ker...@vger.kernel.org; jgr...@suse.com; thomas.lenda...@amd.com; > b...@redhat.com; x...@kernel.org; kexec@lists.infradead.org; > dhowe...@redhat.com; mi...@redhat.com; b...@alien8.de; ebied...@xmission.com; > h...@zytor.com; t...@linutronix.de; dyo...@redhat.com; vgo...@redhat.com > Subject: Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the > crashkernel option is specified > > 在 2019年10月24日 19:33, lijiang 写道: > > 在 2019年10月24日 18:07, Simon Horman 写道: > >> Hi Linbo, > >> > >> thanks for your patch. > >> > >> On Wed, Oct 23, 2019 at 10:19:11PM +0800, Lianbo Jiang wrote: > >>> Kdump kernel will reuse the first 640k region because the real mode > >>> trampoline has to work in this area. When the vmcore is dumped, the > >>> old memory in this area may be accessed, therefore, kernel has to > >>> copy the contents of the first 640k area to a backup region so that > >>> kdump kernel can read the old memory from the backup area of the > >>> first 640k area, which is done in the purgatory(). > >>> > >>> But, the current handling of copying the first 640k area runs into > >>> problems when SME is enabled, kernel does not properly copy these > >>> old memory to the backup area in the purgatory(), thereby, kdump > >>> kernel reads out the encrypted contents, because the kdump kernel > >>> must access the first kernel's memory with the encryption bit set > >>> when SME is enabled in the first kernel. Please refer to this link: > >>> > >>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793 > >>> > >>> Finally, it causes the following errors, and the crash tool gets > >>> invalid pointers when parsing the vmcore. > >>> > >>> crash> kmem -s|grep -i invalid > >>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid > freepointer:a6086ac099f0c5a4 > >>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid > freepointer:a6086ac099f0c5a4 > >>> crash> > >>> > >>> To avoid the above errors, when the crashkernel option is specified, > >>> lets reserve the remaining low 1MiB memory(after reserving real mode > >>> memory) so that the allocated memory does not fall into the low 1MiB > >>> area, which makes us not to copy the first 640k content to a backup > >>> region in purgatory(). This indicates that it does not need to be > >>> included in crash dumps or used for anything except the processor > >>> trampolines that must live in the low 1MiB. > >>> > >>> Signed-off-by: Lianbo Jiang > >>> --- > >>> BTW:I also tried to fix the above problem in purgatory(), but there > >>> are too many restricts in purgatory() context, for example: i can't > >>> allocate new memory to create the identity mapping page table for > >>> SME situation. > >>> > >>> Currently, there are two places where the first 640k area is needed, > >>> the first one is in the find_trampoline_placement(), another one is > >>> in the reserve_real_mode(), and their content doesn't matter. > >>> > >>> In addition, also need to clean all the code related to the backup > >>> region later. > >>> > >>> arch/x86/realmode/init.c | 2 ++ > >>> include/linux/kexec.h| 2 ++ > >>> kernel/kexec_core.c | 13 + > >>> 3 files changed, 17 insertions(+) > >>> > >>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > >>> index 7dce39c8c034..064cc79a015d 100644 > >>> --- a/arch/x86/realmode/init.c > >>> +++ b/arch/x86/realmode/init.c > >>> @@ -3,6 +3,7 @@ > >>> #include > >>> #include > >>> #include > >>> +#include > >>> > >>> #include > >>> #include > >>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void) > >>> > >>> memblock_reserve(mem, size); > >>> set_real_mode_mem(mem); > >>> + kexec_reserve_low_1MiB(); > >>> } > >>> > >>> static void __init setup_real_mode(void) > >>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h > >>> index 1776eb2e43a4..30acf1d738bc 100644 > >>> --- a/include/linux/kexec.h > >>> +++ b/include/linux/kexec.h > >>> @@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *); > >>> extern void crash_kexec(struct pt_regs *); > >>> int kexec_should_crash(struct task_struct *); > >>> int kexec_crash_loaded(void); > >>> +void __init kexec_reserve_low_1MiB(void); > >>> void crash_save_cpu(struct pt_regs *regs, int cpu); > >>> extern int kimage_crash_copy_vmcoreinfo(struct kimage *image); > >>> > >>> @@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) > { } > >>> static inline void crash_kexec(struct pt_regs *regs) { } > >>> static inline int kexec_should_crash(struct task_struct *p) { return 0; } > >>> static inline int kexec_crash_loaded(void) { return 0; } > >>> +static inline void __init kexec_reserve_low_1MiB(void) { } > >>> #define kexec_in_progress false > >>> #endif /* CONFIG_KEXEC_CORE */ > >>> > >>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
RE: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified
I don't find the corresponding patch in the v5 patchset, so I comment here. > -Original Message- > From: linux-kernel-ow...@vger.kernel.org > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of lijiang > Sent: Wednesday, October 23, 2019 2:35 PM > To: Borislav Petkov > Cc: linux-ker...@vger.kernel.org; t...@linutronix.de; mi...@redhat.com; > h...@zytor.com; x...@kernel.org; b...@redhat.com; dyo...@redhat.com; > jgr...@suse.com; dhowe...@redhat.com; thomas.lenda...@amd.com; > ebied...@xmission.com; vgo...@redhat.com; kexec@lists.infradead.org > Subject: Re: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the > crashkernel option is specified > > 在 2019年10月22日 16:30, Borislav Petkov 写道: > > This ifdeffery needs to be a function in kernel/kexec_core.c which is > > called by reserve_real_mode(), instead. > > Would you mind if i improve this patch as follow? Thanks. > > From 5804abec62279585f374d78ace1250505c44c6b7 Mon Sep 17 00:00:00 2001 > From: Lianbo Jiang > Date: Wed, 23 Oct 2019 11:27:04 +0800 > Subject: [PATCH] x86/kdump: always reserve the low 1MiB when the crashkernel > option is specified > > Kdump kernel will reuse the first 640k region because the real mode > trampoline has to work in this area. When the vmcore is dumped, the > old memory in this area may be accessed, therefore, kernel has to > copy the contents of the first 640k area to a backup region so that > kdump kernel can read the old memory from the backup area of the > first 640k area, which is done in the purgatory(). > > But, the current handling of copying the first 640k area runs into > problems when SME is enabled, kernel does not properly copy these > old memory to the backup area in the purgatory(), thereby, kdump > kernel reads out the encrypted contents, because the kdump kernel > must access the first kernel's memory with the encryption bit set > when SME is enabled in the first kernel. Please refer to this link: > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793 > > Finally, it causes the following errors, and the crash tool gets > invalid pointers when parsing the vmcore. > > crash> kmem -s|grep -i invalid > kmem: dma-kmalloc-512: slab:d77680001c00 invalid > freepointer:a6086ac099f0c5a4 > kmem: dma-kmalloc-512: slab:d77680001c00 invalid > freepointer:a6086ac099f0c5a4 > crash> > > To avoid the above errors, when the crashkernel option is specified, > lets reserve the remaining low 1MiB memory(after reserving real mode > memory) so that the allocated memory does not fall into the low 1MiB > area, which makes us not to copy the first 640k content to a backup > region in purgatory(). This indicates that it does not need to be > included in crash dumps or used for anything except the processor > trampolines that must live in the low 1MiB. > > Signed-off-by: Lianbo Jiang > --- > BTW:I also tried to fix the above problem in purgatory(), but there > are too many restricts in purgatory() context, for example: i can't > allocate new memory to create the identity mapping page table for > SME situation. > > Currently, there are two places where the first 640k area is needed, > the first one is in the find_trampoline_placement(), another one is > in the reserve_real_mode(), and their content doesn't matter. > > In addition, also need to clean all the code related to the backup > region later. > > arch/x86/realmode/init.c | 2 ++ > include/linux/kexec.h| 2 ++ > kernel/kexec_core.c | 13 + > 3 files changed, 17 insertions(+) > > diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > index 7dce39c8c034..064cc79a015d 100644 > --- a/arch/x86/realmode/init.c > +++ b/arch/x86/realmode/init.c > @@ -3,6 +3,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -34,6 +35,7 @@ void __init reserve_real_mode(void) > > memblock_reserve(mem, size); > set_real_mode_mem(mem); > + kexec_reserve_low_1MiB(); > } > > static void __init setup_real_mode(void) > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index 1776eb2e43a4..30acf1d738bc 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *); > extern void crash_kexec(struct pt_regs *); > int kexec_should_crash(struct task_struct *); > int kexec_crash_loaded(void); > +void kexec_reserve_low_1MiB(void); > void crash_save_cpu(struct pt_regs *regs, int cpu); > extern int kimage_crash_copy_vmcoreinfo(struct kimage *image); > > @@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > static inline void crash_kexec(struct pt_regs *regs) { } > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > static inline int kexec_crash_loaded(void) { return 0; } > +static inline void kexec_reserve_low_1MiB(void) { } > #define kexec_in_progress false > #endif /* CONFIG_KEXEC_CORE */ > > diff --git a/kernel/kexec_core.c