Re: [PATCH 24/30] panic: Refactor the panic path

2022-05-10 Thread d.hatay...@fujitsu.com
Sorry for the delayed response. Unfortunately, I had 10 days holidays
until yesterday...

>  .../admin-guide/kernel-parameters.txt |  42 ++-
>  include/linux/panic_notifier.h|   1 +
>  kernel/kexec_core.c   |   8 +-
>  kernel/panic.c| 292 +-
>  .../selftests/pstore/pstore_crash_test|   5 +-
>  5 files changed, 252 insertions(+), 96 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 3f1cc5e317ed..8d3524060ce3 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
...snip...
> @@ -3784,6 +3791,33 @@
> timeout < 0: reboot immediately
> Format: 
> 
> +   panic_notifiers_level=
> +   [KNL] Set the panic notifiers execution order.
> +   Format: 
> +   We currently have 4 lists of panic notifiers; based
> +   on the functionality and risk (for panic success) the
> +   callbacks are added in a given list. The lists are:
> +   - hypervisor/FW notification list (low risk);
> +   - informational list (low/medium risk);
> +   - pre_reboot list (higher risk);
> +   - post_reboot list (only run late in panic and after
> +   kdump, not configurable for now).
> +   This parameter defines the ordering of the first 3
> +   lists with regards to kdump; the levels determine
> +   which set of notifiers execute before kdump. The
> +   accepted levels are:
> +   0: kdump is the first thing to run, NO list is
> +   executed before kdump.
> +   1: only the hypervisor list is executed before kdump.
> +   2 (default level): the hypervisor list and (*if*

Hmmm, why are you trying to change default setting?

Based on the current design of kdump, it's natural to put what the
handlers for these level 1 and level 2 handlers do in
machine_crash_shutdown(), as these are necessary by default, right?

Or have you already tried that and figured out it's difficult in some
reason and reached the current design? If so, why is that difficult?
Could you point to if there is already such discussion online?

kdump is designed to perform as little things as possible before
transferring the execution to the 2nd kernel in order to increase
reliability. Just detour to panic() increases risks of kdump failure
in the sense of increasing the executed codes in the abnormal
situation, which is very the note in the explanation of
crash_kexec_post_notifiers.

Also, the current implementation of crash_kexec_post_notifiers uses
the panic notifier, but this is not from the technical
reason. Ideally, it should have been implemented in the context of
crash_kexec() independently of panic().

That is, it looks to me that, in addition to changing design of panic
notifier, you are trying to integrate shutdown code of the crash kexec
and the panic paths. If so, this is a big design change for kdump.
I'm concerned about increase of reliability. I'd like you to discuss
them carefully.

Thanks.
HATAYAMA, Daisuke


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter

2022-01-26 Thread d.hatay...@fujitsu.com
> On 01/24/22 at 11:48am, Guilherme G. Piccoli wrote:
> > On 24/01/2022 10:59, Baoquan He wrote:
> > > [...]
> > > About pre_dump, if the dump is crash dump, hope those pre_dump notifiers
> > > will be executed under conditional check, e.g only if 
> > > 'crash_kexec_post_notifiers'
> > > is specified in kernel cmdline.
> >
> > Hi Baoquan, based on Petr's suggestion, I think pre_dump would be
> > responsible for really *non-intrusive/non-risky* tasks and should be
> > always executed in the panic path (before kdump), regardless of
> > "crash_kexec_post_notifiers".
> >
> > The idea is that the majority of the notifiers would be executed in the
> > post_dump portion, and for that, we have the
> > "crash_kexec_post_notifiers" conditional. I also suggest we have
> > blacklist options (based on function names) for both notifiers, in order
> > to make kdump issues debug easier.
> >
> > Do you agree with that? Feel free to comment with suggestions!
> > Cheers,
>
> I would say "please NO" cautiously.
>
> As Petr said, kdump mostly works only if people configure it correctly.
> That's because we try best to switch to kdump kernel from the fragile
> panicked kernel immediately. When we try to add anthing before the switching,
> please consider carefully and ask if that adding is mandatory, otherwise
> switching into kdump kernel may fail. If the answer is yes, the adding
> is needed and welcomed. Othewise, any unnecessary action, including any
> "non-intrusive/non-risky" tasks, would be unwelcomed.
>
> Surely, we don't oppose the "non-intrusive/non-risky" or completely
> "intrusive/risky" action adding before kdump kernel switching, with a
> conditional limitation. When we handle customers' kdump support, we
> explicitly declare we only support normal and default kdump operation.
> If any action which is done before switching into kdump kernel is specified,
> e.g panic_notifier, panic_print, they need take care of their own kdump
> failure.

Sorry for bringing back the past discussion, but how about
reconsidering the following design?

- kdump-specific notifier list within crash_kexec()

I don't think that the current implementation of
crash_kexec_post_notifiers is ideal, where panic() and panic notifier
are used, which contains the code that is unnecessary for kdump, and
it unnecessarily decreases kdump's reliability. The presence of kdump
code in panic() also conversely makes panic()'s code unnecessarily
complicated.

I use crash_kexec_post_notifiers with the understanding that it
affects kdump's reliability to some degree, but at the same time I
want it to be as reliable as possible.

I think introducing kdump-specific notifier list will resolve the
issues caused by dependencies to panic()'s code.

In addition, as I already said in another mail, I want to avoid
invoking any other handlers passed by users other than me, in order to
keep kdump's reliability. For such purpose, I think some feature like
Guilherme's panic notifier filter is needed.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter

2022-01-25 Thread d.hatay...@fujitsu.com
> Hi, thanks for the review. First of all, notice that it's very likely
> this patch isn't gonna get merged this way, we are considering a
> refactor that included 2 panic notifiers: one a bit earlier (pre_dump),
> that includes functions less risky, as watchdog unloaders, kernel offset
> dump, etc, and the second panic notifier (post_dump) will keep the
> majority of callbacks, and can be conditionally executed on kdump
> through the usage of "crash_kexec_post_notifiers".

But the pre_dump cannot avoid calling multiple unnecessary handlers, right?
It's more risky than the previous idea...

> Anyway, I'm curious with your code review - how can we use this filter
> with modules, if the filter setup is invoked as early_param(), before
> modules load? In that case, module functions won't have a valid address,
> correct? So, in that moment, this lookup fails, we cannot record an
> unloaded module address in such list. Please, correct me if I'm wrong.

For example, how about simply maintaining function symbol names in the list
as string, not address.

Thanks.
HATAYAMA, Daisuke

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[PATCH] sadump, kaslr: fix failure of calculating kaslr_offset

2022-01-25 Thread d.hatay...@fujitsu.com
On kernels v5.8 or later, makedumpfile fails for memory dumps in the
sadump-related formats as follows:

# makedumpfile -f -l -d 31 -x ./vmlinux /dev/sdd4 /root/vmcore-ld31
__vtop4_x86_64: Can't get a valid pud_pte.
...110 lines of the same message...
__vtop4_x86_64: Can't get a valid pud_pte.
calc_kaslr_offset: failed to calculate kaslr_offset and phys_base; default 
to 0
readmem: type_addr: 1, addr:85411858, size:8
__vtop4_x86_64: Can't get pgd (page_dir:85411858).
readmem: Can't convert a virtual address(059be980) to physical 
address.
readmem: type_addr: 0, addr:059be980, size:1024
cpu_online_mask_init: Can't read cpu_online_mask memory.

makedumpfile Failed.

This is caused by the kernel commit 9d06c4027f21 ("x86/entry: Convert
Divide Error to IDTENTRY") that renamed divide_error to
asm_exc_divide_error, breaking logic for calculating kaslr offset.

Fix this by adding initialization of asm_exc_divide_error.

Signed-off-by: HATAYAMA Daisuke 
---
 makedumpfile.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/makedumpfile.c b/makedumpfile.c
index a51bdaf..7ed9756 100644
--- a/makedumpfile.c
+++ b/makedumpfile.c
@@ -1667,6 +1667,8 @@ get_symbol_info(void)
SYMBOL_INIT(cur_cpu_spec, "cur_cpu_spec");
 
SYMBOL_INIT(divide_error, "divide_error");
+   if (SYMBOL(divide_error) == NOT_FOUND_SYMBOL)
+   SYMBOL_INIT(divide_error, "asm_exc_divide_error");
SYMBOL_INIT(idt_table, "idt_table");
SYMBOL_INIT(saved_command_line, "saved_command_line");
SYMBOL_INIT(pti_init, "pti_init");
-- 
2.31.1

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter

2022-01-25 Thread d.hatay...@fujitsu.com
> @@ -146,6 +157,34 @@ void nmi_panic(struct pt_regs *regs, const char *msg)
>  }
>  EXPORT_SYMBOL(nmi_panic);
> 
> +static int __init panic_notifier_filter_setup(char *buf)
> +{
> +   char *func;
> +   unsigned long addr;
> +
> +   while ((func = strsep(, ","))) {
> +   addr = kallsyms_lookup_name(func);
> +   if (!addr) {
> +   pr_warn("panic_notifier_filter: invalid symbol %s\n", 
> func);
> +   continue;
> +   }

Could you remove this check?

panic_notifier_list is exported to kernel modules and this check
prevents such users from using this feature.

Thanks.
HATAYAMA, Daisuke

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH V4] notifier/panic: Introduce panic_notifier_filter

2022-01-24 Thread d.hatay...@fujitsu.com


> On 01/21/22 at 05:31pm, Guilherme G. Piccoli wrote:
> ..
> > > IMHO, the right solution is to split the callbacks into 2 or more
> > > notifier list. Then we might rework panic() to do:
> > >
> > > void panic(void)
> > > {
> > > [...]
> > >
> > > /* stop watchdogs + extra info */
> > > atomic_notifier_call_chain(_disable_watchdogs_notifier_list, 0, 
> > > buf);
> > > atomic_notifier_call_chain(_info_notifier_list, 0, buf);
> > > panic_print_sys_info();
> > >
> > > /* crash_kexec + kmsg_dump in configurable order */
> > > if (!_crash_kexec_post_kmsg_dump) {
> > > __crash_kexec(NULL);
> > > smp_send_stop();
> > > } else {
> > > crash_smp_send_stop();
> > > }
> > >
> > > kmsg_dump();
> > > if (_crash_kexec_post_kmsg_dump)
> > > __crash_kexec(NULL);
> > >
> > > /* infinite loop or reboot */
> > > atomic_notifier_call_chain(_hypervisor_notifier_list, 0, buf);
> > > atomic_notifier_call_chain(_rest_notifier_list, 0, buf);
> > >
> > > console_flush_on_panic(CONSOLE_FLUSH_PENDING);
> > > [...]
> > > Two notifier lists might be enough in the above scenario. I would call
> > > them:
> > >
> > > panic_pre_dump_notifier_list
> > > panic_post_dump_notifier_list
> > >
> > >
> > > It is a real solution that will help everyone. It is more complicated now
> > > but it will makes things much easier in the long term. And it might be 
> > > done
> > > step by step:
> > >
> > >  1. introduce the two notifier lists
> > >  2. convert all users: one by one
> > >  3. remove the original notifier list when there is no user
> >
> > That's a great idea! I'm into it, if we have a consensus. The thing that
> > scares me most here is that this is a big change and consumes time to
> > implement - I'd not risk such time if somebody is really against that.
> > So, let's see more opinions, maybe the kdump maintainers have good input.
> 
> I am fine with it. As long as thing is made clear, glad to see code is
> refactored to be more understandable and improved. Earlier, during several
> rounds of discussion between you and Petr, seveal pitfalls have been
> pointed out and avoided.
> 
> Meanwhile, I would suggest Masa and HATAYAMA to help give input about
> panic_notifier usage and refactory. AFAIK, they contributed code and use
> panic_notifier in their product or environment a lot, that will be very
> helpful to get the first hand information from them.
> 
> Hi Masa, HATAYANA,
> 
> Any comment on this? (Please ignore this if it's not in your care.)
> 

Thanks for CCing to me. I like this patch set. I have same motivation.

For example, when I used crash_kexec_post_notifiers, I sometimes ran into
deadlock in printk's exclusion logic during the call of panic notifiers since
kaslr outputs kernel offset at panic by dump_kernel_offset() via panic notifers
(although this might never happen now thanks to lockless implementation).

The problem is that in the current design, we have to run all the
tasks registered, although most of them are actually unnecessary for
other users' requirements. Each user wants to call only their own handlers
in order to keep kdump as reliable as possible.

I've just started reading this patch set and have no other comments for now.

Thanks.
HATAYAMA, Daisuke

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH] sadump: Fix failure of reading __per_cpu_load memory

2020-05-21 Thread d.hatay...@fujitsu.com
Saito-san,

Thanks for your patch.

I think it better to remove __per_cpu_load code, which is unnecessary.


差出人: kexec  が saito.kaz...@fujitsu.com 
 の代理で送信
送信日時: 2020年5月21日 16:15
宛先: 'kexec@lists.infradead.org'
件名: [PATCH] sadump: Fix failure of reading __per_cpu_load memory

Creating vmcore from sadump by makedumpfile fails with the following
error messages since kernel-4.19 with PTI (Page Table Isolation)
enabled:

  __vtop4_x86_64: Can't get a valid pte.
  readmem: Can't convert a virtual address(b2986000) to physical 
address.
  readmem: type_addr: 0, addr:b2986000, size:8
  per_cpu_init: Can't read __per_cpu_load memory.

This is caused by the following patch:
https://github.com/torvalds/linux/commit/c40a56a7818cfe735fc93a69e1875f8bba834483

The above patch clears _PAGE_PRESENT bit of __per_cpu_load memory, so
__vtop4_x86_64 fails to convert the virtual address of the
__per_cpu_load.

To fix this issue, this patch changes sanity check of per_cpu_ptr()
to use address of the __per_cpu_load instead of data of the memory.

Signed-off-by: Kazuya Saito 
Signed-off-by: Kiyotaka Nakamura 
---
 sadump_info.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/sadump_info.c b/sadump_info.c
index 46867ce..72a077b 100644
--- a/sadump_info.c
+++ b/sadump_info.c
@@ -1732,11 +1732,11 @@ per_cpu_init(void)
return FALSE;
}

-   if (!readmem(VADDR, SYMBOL(__per_cpu_load), >__per_cpu_load,
-sizeof(unsigned long))) {
-   ERRMSG("Can't read __per_cpu_load memory.\n");
+   if (SYMBOL(__per_cpu_load) == NOT_FOUND_SYMBOL) {
+   ERRMSG("Can't find __per_cpu_load symbol.\n");
return FALSE;
}
+   si->__per_cpu_load = SYMBOL(__per_cpu_load);

DEBUG_MSG("sadump: __per_cpu_load: %#lx\n", si->__per_cpu_load);
DEBUG_MSG("sadump: __per_cpu_offset: LENGTH: %ld\n",
--
2.12.3


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [RFD] kdump, kaslr: how to fix the failure of reservation of crashkernel low memory due to physical kaslr

2019-12-26 Thread d.hatay...@fujitsu.com


> -Original Message-
> 
> Hi HATAYAMA,
> 
> On 12/25/19 at 04:26am, d.hatay...@fujitsu.com wrote:
> > Currently, reservation of crashkernel low memory sometimes fails due
> > to a sparse memory caused by physical kaslr with the following
> > message:
> >
> > Cannot reserve 256MB crashkernel low memory, please try smaller size.
> 
> I don't understand, may not get your point. KASLR will randomize the
> position of kernel image. However, kernel image usually takes up 50M
> memory. Under low 4G memory, how come it can't reserve 256M crashkernel
> low memory. Do you have the boot log of the failed case?

Thanks for your comments and sorry for the insufficient explanation.

Low 4GB memory in our system is considerably limited. The size of the largest
contiguous free physical pages at the timing when kernel attempts at
reserving low memory for crash kernel is less than 512MB. Hence, if physical
kaslr inserts kernel image into the center of the chunk, every remaining
chunks have less than 256M size. Then, the failure occurs.

> 
> >
> > Kdump needs low memory, memory area less than 4GB, e.g. for swiotlb.
> > Its size is 256MB for low memory by default. OTOH, physical kaslr
> > loads kernel images in a random physical address for
> > security. Physical kaslr sometimes choose low memory and sparse
> > there and as a result, reservation of crash kernel low memory could fail.
> >
> > This failure seldom occurs on systems with large memory. For example,
> > on our system with 128GB, the issue occurs once in hundreds of
> > reboots. Although it doesn't occur frequently and can be avoided in
> > practice simply by rebooting the system, it definitely occurs once in
> > hundreds of reboots. Once the issue occurs, it's difficult for ordinary
> > users to understand why it failed. I'd like to fix this current behavior.
> >
> > I'm now coming up two ideas but don't know what is best. Please
> > discuss how to fix the issue in better way.
> >
> > 1) Add a kernel parameter to make physical kaslr to avoid specified
> >memory area
> >
> >   This is the simplest idea I came up with first just like
> >kaslr_mem_avoid=4GB-0, which is similar syntax to memap=, meaning
> >that kaslr, please avoid to load kernel image into the region [0,
> >4GB).
> >
> >   It looks to me that this can be implemented easily by taking
> >advantage of the existing code about mem_avoid mechanism in
> >arch/x86/boot/compressed/kaslr.c.
> >
> >   This mechanism doesn't lose security provided by physical kaslr if
> >system memory is large enough.
> >
> >   Demerit of this is that users need configuration. Automatic way is
> >better if possible.
> >
> > 2) Add special handling for crashkernel= low in physical kaslr
> >
> >   The second idea I came up with is to add special handling for
> >crashkernel= low in physical kaslr, i.e. physical kaslr recognizes
> >crashkernel= in kernel parameters and keep enough memory for
> >crashkenrel.
> >
> >   To guarantee that the memory area kept by the special handling in
> >physical kaslr is used for crashkernel, it is necessary to mark the
> >area to indicate to the crashkernel code executed after kernel
> >runs. To implement this, I imagine introducing a new type of memory
> >a kind of E820_CRASHKERNEL_LOW.
> >
> >   My concern on this idea is whether its worth implementing such
> >special handling in physical kaslr simply because I don't find such
> >code in physical kaslr now.
> >
> > 3) Any other better ideas?
> 
> Someone ever told that some systems may not have low 4G memory since
> they own hardware iommu. In real life, I never see such kind of system,
> and most of them can give 256M crashkernel memory a satisfactory result.
> Unless you reserve more than 1G under low 4G, it could fail because of
> kinds of complicated memory reservations there.

I'm surprised to hear such system without low 4GB memory and I wonder
how such system works well without restriction of memory access range
in early runtime mode on x86 such as real mode.

Thanks.
HATAYAMA, Daisuke


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


[RFD] kdump, kaslr: how to fix the failure of reservation of crashkernel low memory due to physical kaslr

2019-12-24 Thread d.hatay...@fujitsu.com
Currently, reservation of crashkernel low memory sometimes fails due
to a sparse memory caused by physical kaslr with the following
message:

Cannot reserve 256MB crashkernel low memory, please try smaller size.

Kdump needs low memory, memory area less than 4GB, e.g. for swiotlb.
Its size is 256MB for low memory by default. OTOH, physical kaslr
loads kernel images in a random physical address for
security. Physical kaslr sometimes choose low memory and sparse
there and as a result, reservation of crash kernel low memory could fail.

This failure seldom occurs on systems with large memory. For example,
on our system with 128GB, the issue occurs once in hundreds of
reboots. Although it doesn't occur frequently and can be avoided in
practice simply by rebooting the system, it definitely occurs once in
hundreds of reboots. Once the issue occurs, it's difficult for ordinary
users to understand why it failed. I'd like to fix this current behavior.

I'm now coming up two ideas but don't know what is best. Please
discuss how to fix the issue in better way.

1) Add a kernel parameter to make physical kaslr to avoid specified
   memory area

  This is the simplest idea I came up with first just like
   kaslr_mem_avoid=4GB-0, which is similar syntax to memap=, meaning
   that kaslr, please avoid to load kernel image into the region [0,
   4GB).

  It looks to me that this can be implemented easily by taking
   advantage of the existing code about mem_avoid mechanism in
   arch/x86/boot/compressed/kaslr.c.

  This mechanism doesn't lose security provided by physical kaslr if
   system memory is large enough.

  Demerit of this is that users need configuration. Automatic way is
   better if possible.

2) Add special handling for crashkernel= low in physical kaslr

  The second idea I came up with is to add special handling for
   crashkernel= low in physical kaslr, i.e. physical kaslr recognizes
   crashkernel= in kernel parameters and keep enough memory for
   crashkenrel.

  To guarantee that the memory area kept by the special handling in
   physical kaslr is used for crashkernel, it is necessary to mark the
   area to indicate to the crashkernel code executed after kernel
   runs. To implement this, I imagine introducing a new type of memory
   a kind of E820_CRASHKERNEL_LOW.

  My concern on this idea is whether its worth implementing such
   special handling in physical kaslr simply because I don't find such
   code in physical kaslr now.

3) Any other better ideas?

Thanks.
HATAYAMA, Daisuke


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


RE: [PATCH v2 2/2] efi: arm64: Introduce /proc/efi/memreserve to tell the persistent pages

2019-12-04 Thread d.hatay...@fujitsu.com



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Masayoshi Mizuma
> Sent: Wednesday, December 4, 2019 5:14 AM
> To: Ard Biesheuvel ;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org
> Cc: Masayoshi Mizuma ; Mizuma, Masayoshi/水間 理仁
> ; linux-ker...@vger.kernel.org;
> kexec@lists.infradead.org; Hatayama, Daisuke/畑山 大輔
> ; Eric Biederman ; Matthias
> Brugger 
> Subject: [PATCH v2 2/2] efi: arm64: Introduce /proc/efi/memreserve to tell the
> persistent pages
> 
> From: Masayoshi Mizuma 
> 
> kexec reboot stops in early boot sequence because efi_config_parse_tables()
> refers garbage data. We can see the log with memblock=debug kernel option:
> 
>   efi:  ACPI 2.0=0x9821790014  PROP=0x8757f5c0  SMBIOS 3.0=0x982074
> MEMRESERVE=0x9820bfdc58
>   memblock_reserve: [0x009820bfdc58-0x009820bfdc67]
> efi_config_parse_tables+0x228/0x278
>   memblock_reserve: [0x8276-0x324d07ff]
> efi_config_parse_tables+0x228/0x278
>   memblock_reserve: [0xcc4f84ecc0511670-0x5f6e5214a7fd91f9]
> efi_config_parse_tables+0x244/0x278
>   memblock_reserve: [0xd2fd4144b9af693d-0xad0c1db1086f40a2]
> efi_config_parse_tables+0x244/0x278
>   memblock_reserve: [0x0c719bb159b1fadc-0x5aa6e62a1417ce12]
> efi_config_parse_tables+0x244/0x278
>   ...
> 
> That happens because 0x8276, struct linux_efi_memreserve, is destroyed.
> 0x8276 is pointed from efi.mem_reseve, and efi.mem_reserve points the
> head page of LPI pending table and LPI property table which are allocated by
> gic_reserve_range().
> 
> The destroyer is kexec. kexec locates the initrd to the area:
> 
>   ]# kexec -d -l /boot/vmlinuz-5.4.0-rc7 /boot/initramfs-5.4.0-rc7.img
> --reuse-cmdline
>   ...
>   initrd: base 8229, size 388dd8ah (59301258)
>   ...
> 
> From dynamic debug log. initrd is located in segment[1]:
>   machine_kexec_prepare:70:
> kexec kimage info:
>   type:0
>   start:   85b30680
>   head:0
>   nr_segments: 4
> segment[0]: 8048 - 8229, 0x1e1 bytes, 481
> pages
> segment[1]: 8229 - 85b2, 0x389 bytes, 905
> pages
> segment[2]: 85b2 - 85b3, 0x1 bytes, 1
> pages
> segment[3]: 85b3 - 85b4, 0x1 bytes, 1
> pages
> 
> kexec searches the memory region to locate initrd through
> "System RAM" in /proc/iomem. The pending tables are included in
> "System RAM" because they are allocated by alloc_pages(), so kexec
> destroys the LPI pending tables.
> 
> Introduce /proc/efi/memreserve to tell the pages pointed by
> efi.mem_reserve so that kexec can avoid the area to locate initrd.
> 
> Signed-off-by: Masayoshi Mizuma 
> ---
>  drivers/firmware/efi/efi.c | 75 --
>  1 file changed, 72 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index d8157cb34..80bbe0b3e 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -325,17 +325,87 @@ static __init int efivar_ssdt_load(void)
>  static inline int efivar_ssdt_load(void) { return 0; }
>  #endif
> 
> +static struct linux_efi_memreserve *efi_memreserve_root __ro_after_init;
> +
>  #ifdef CONFIG_PROC_FS
>  static struct proc_dir_entry *proc_efi;
> +#ifdef CONFIG_KEXEC
> +static int memreserve_show(struct seq_file *m, void *v)
> +{
> + struct linux_efi_memreserve *rsv;
> + phys_addr_t start, end;
> + unsigned long prsv;
> + int count, i;
> +
> + if ((efi_memreserve_root == (void *)ULONG_MAX) ||
> + (!efi_memreserve_root))
> + return -ENODEV;
> +
> + for (prsv = efi_memreserve_root->next; prsv; prsv = rsv->next) {
> + rsv = memremap(prsv, sizeof(*rsv), MEMREMAP_WB);
> + if (!rsv) {
> + pr_err("Could not map efi_memreserve\n");
> + return -ENOMEM;
> + }
> + count = atomic_read(>count);
> + for (i = 0; i < count; i++) {
> + start = rsv->entry[i].base;
> + end = start + rsv->entry[i].size - 1;
> +
> + seq_printf(m, "%pa-%pa\n", , );
> + }

It looks to me that we reach the same issue as sysfs if
memreserve_show() calls seq_printf multiple times this way.
Default buffer size of struct seq_file is also PAGE_SIZE.
We can configure default buffer size using single_open_size(),
but I think it better to do the same thing as /proc/iomem
(and /proc/mounts and many others) than choosing a certain
size as default value.

Looking into implementation of /proc/iomem, its show method, r_show(),
calls seq_printf() only once; in other words, r_show() is
called the number of lines of /proc/iomem, which is equal to
the number of resource objects. Then, there is no buffer size
issue because "%pa-%pa\n" 

RE: [RFC PATCH] efi: arm64: Introduce /sys/firmware/efi/memreserve to tell the persistent pages

2019-11-13 Thread d.hatay...@fujitsu.com



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Masayoshi Mizuma
> Sent: Wednesday, November 13, 2019 1:53 AM
> To: Ard Biesheuvel ;
> linux-arm-ker...@lists.infradead.org; linux-...@vger.kernel.org
> Cc: Masayoshi Mizuma ; Mizuma, Masayoshi/水間 理仁
> ; linux-ker...@vger.kernel.org;
> kexec@lists.infradead.org
> Subject: [RFC PATCH] efi: arm64: Introduce /sys/firmware/efi/memreserve to
> tell the persistent pages
> 
> From: Masayoshi Mizuma 
> 
> kexec reboot stucks because efi_config_parse_tables() refers garbage
>  (with memblock=debug):
> 
>   efi:  ACPI 2.0=0x9821790014  PROP=0x8757f5c0  SMBIOS 3.0=0x982074
> MEMRESERVE=0x9820bfdc58
>   memblock_reserve: [0x009820bfdc58-0x009820bfdc67]
> efi_config_parse_tables+0x228/0x278
>   memblock_reserve: [0x8276-0x324d07ff]
> efi_config_parse_tables+0x228/0x278
>   memblock_reserve: [0xcc4f84ecc0511670-0x5f6e5214a7fd91f9]
> efi_config_parse_tables+0x244/0x278
>   memblock_reserve: [0xd2fd4144b9af693d-0xad0c1db1086f40a2]
> efi_config_parse_tables+0x244/0x278
>   memblock_reserve: [0x0c719bb159b1fadc-0x5aa6e62a1417ce12]
> efi_config_parse_tables+0x244/0x278
>   ...
> 
> That happens because 0x8276, struct linux_efi_memreserve, is destroyed.
> 0x8276 is pointed from efi.mem_reseve, and efi.mem_reserve points the
> head page of pending table and prop table which are allocated by
> gic_reserve_range().
> 
> The destroyer is kexec. kexec locates the inird to the area:
> 
> # kexec -d -l /boot/vmlinuz-5.4.0-rc7 /boot/initramfs-5.4.0-rc7.img
> --reuse-cmdline
> ...
> initrd: base 8229, size 388dd8ah (59301258)
> ...
> 
> From dynamic debug log:
>   machine_kexec_prepare:70:
> kexec kimage info:
>   type:0
>   start:   85b30680
>   head:0
>   nr_segments: 4
> segment[0]: 8048 - 8229, 0x1e1 bytes, 481
> pages
> segment[1]: 8229 - 85b2, 0x389 bytes, 905
> pages
> segment[2]: 85b2 - 85b3, 0x1 bytes, 1
> pages
> segment[3]: 85b3 - 85b4, 0x1 bytes, 1
> pages
> 
> kexec searches the appropriate memory region to locate initrd through "System
> RAM"
> in /proc/iomem. The pending tables are included in "System RAM" because they
> are
> allocated by alloc_pages(), so kexec destroys the pending tables.
> 
> Introduce /sys/firmware/efi/memreserve to tell the pages pointed by
> efi.mem_reserve
> so that kexec can avoid the area to locate initrd.
> 
> Signed-off-by: Masayoshi Mizuma 
> ---
>  drivers/firmware/efi/efi.c | 32 +++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index e98bbf8e5..67b21ae7a 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -141,6 +141,36 @@ static ssize_t systab_show(struct kobject *kobj,
> 
>  static struct kobj_attribute efi_attr_systab = __ATTR_RO_MODE(systab, 0400);
> 
> +static struct linux_efi_memreserve *efi_memreserve_root __ro_after_init;
> +static ssize_t memreserve_show(struct kobject *kobj,
> +struct kobj_attribute *attr, char *buf)
> +{
> + struct linux_efi_memreserve *rsv;
> + unsigned long prsv;
> + char *str = buf;
> + int index, i;
> +
> + if (!kobj || !buf)
> + return -EINVAL;
> +
> + if (!efi_memreserve_root)
> + return -ENODEV;

Other functions use different conditions.
The latter efi_memreserve_root == (void *)ULONG_MAX is correct?

static int __init efi_memreserve_map_root(void)
{
if (efi.mem_reserve == EFI_INVALID_TABLE_ADDR)
return -ENODEV;

int __ref efi_mem_reserve_persistent(phys_addr_t addr, u64 size)
{
struct linux_efi_memreserve *rsv;
unsigned long prsv;
int rc, index;

if (efi_memreserve_root == (void *)ULONG_MAX)
return -ENODEV;

> +
> + for (prsv = efi_memreserve_root->next; prsv; prsv = rsv->next) {
> + rsv = memremap(prsv, sizeof(*rsv), MEMREMAP_WB);

memremap() could fail with NULL as a return value.
You need to deal with such case.

It looks to me efi_mem_reserve_persistent() also doesn't deal with this.
Maybe you should fix this, too.

> + index = atomic_read(>count);
> + for (i = 0; i < index; i++)
> + str += sprintf(str, "%llx-%llx\n",
> + rsv->entry[i].base,
> + rsv->entry[i].base + rsv->entry[i].size -
> 1);

Is memreserve supported on 32-bit system?
If so, phy_addr_t could have a type of 4-byte length in such system
(not so if with PAE) and then %llx could lead to inconsistent type error.
It's enough to add a cast to unsigned long long.

> + memunmap(rsv);
> + }
> +
> + return str - buf;
> +}
> +

RE: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-25 Thread d.hatay...@fujitsu.com


> -Original Message-
> From: lijiang [mailto:liji...@redhat.com]
> Sent: Friday, October 25, 2019 10:31 AM
> To: Simon Horman ; Hatayama, Daisuke/畑山 大輔
> 
> Cc: linux-ker...@vger.kernel.org; jgr...@suse.com; thomas.lenda...@amd.com;
> b...@redhat.com; x...@kernel.org; kexec@lists.infradead.org;
> dhowe...@redhat.com; mi...@redhat.com; b...@alien8.de; ebied...@xmission.com;
> h...@zytor.com; t...@linutronix.de; dyo...@redhat.com; vgo...@redhat.com
> Subject: Re: [PATCH 1/2 v5] x86/kdump: always reserve the low 1MiB when the
> crashkernel option is specified
> 
> 在 2019年10月24日 19:33, lijiang 写道:
> > 在 2019年10月24日 18:07, Simon Horman 写道:
> >> Hi Linbo,
> >>
> >> thanks for your patch.
> >>
> >> On Wed, Oct 23, 2019 at 10:19:11PM +0800, Lianbo Jiang wrote:
> >>> Kdump kernel will reuse the first 640k region because the real mode
> >>> trampoline has to work in this area. When the vmcore is dumped, the
> >>> old memory in this area may be accessed, therefore, kernel has to
> >>> copy the contents of the first 640k area to a backup region so that
> >>> kdump kernel can read the old memory from the backup area of the
> >>> first 640k area, which is done in the purgatory().
> >>>
> >>> But, the current handling of copying the first 640k area runs into
> >>> problems when SME is enabled, kernel does not properly copy these
> >>> old memory to the backup area in the purgatory(), thereby, kdump
> >>> kernel reads out the encrypted contents, because the kdump kernel
> >>> must access the first kernel's memory with the encryption bit set
> >>> when SME is enabled in the first kernel. Please refer to this link:
> >>>
> >>> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
> >>>
> >>> Finally, it causes the following errors, and the crash tool gets
> >>> invalid pointers when parsing the vmcore.
> >>>
> >>> crash> kmem -s|grep -i invalid
> >>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
> freepointer:a6086ac099f0c5a4
> >>> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
> freepointer:a6086ac099f0c5a4
> >>> crash>
> >>>
> >>> To avoid the above errors, when the crashkernel option is specified,
> >>> lets reserve the remaining low 1MiB memory(after reserving real mode
> >>> memory) so that the allocated memory does not fall into the low 1MiB
> >>> area, which makes us not to copy the first 640k content to a backup
> >>> region in purgatory(). This indicates that it does not need to be
> >>> included in crash dumps or used for anything except the processor
> >>> trampolines that must live in the low 1MiB.
> >>>
> >>> Signed-off-by: Lianbo Jiang 
> >>> ---
> >>> BTW:I also tried to fix the above problem in purgatory(), but there
> >>> are too many restricts in purgatory() context, for example: i can't
> >>> allocate new memory to create the identity mapping page table for
> >>> SME situation.
> >>>
> >>> Currently, there are two places where the first 640k area is needed,
> >>> the first one is in the find_trampoline_placement(), another one is
> >>> in the reserve_real_mode(), and their content doesn't matter.
> >>>
> >>> In addition, also need to clean all the code related to the backup
> >>> region later.
> >>>
> >>>  arch/x86/realmode/init.c |  2 ++
> >>>  include/linux/kexec.h|  2 ++
> >>>  kernel/kexec_core.c  | 13 +
> >>>  3 files changed, 17 insertions(+)
> >>>
> >>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
> >>> index 7dce39c8c034..064cc79a015d 100644
> >>> --- a/arch/x86/realmode/init.c
> >>> +++ b/arch/x86/realmode/init.c
> >>> @@ -3,6 +3,7 @@
> >>>  #include 
> >>>  #include 
> >>>  #include 
> >>> +#include 
> >>>
> >>>  #include 
> >>>  #include 
> >>> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
> >>>
> >>>   memblock_reserve(mem, size);
> >>>   set_real_mode_mem(mem);
> >>> + kexec_reserve_low_1MiB();
> >>>  }
> >>>
> >>>  static void __init setup_real_mode(void)
> >>> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> >>> index 1776eb2e43a4..30acf1d738bc 100644
> >>> --- a/include/linux/kexec.h
> >>> +++ b/include/linux/kexec.h
> >>> @@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *);
> >>>  extern void crash_kexec(struct pt_regs *);
> >>>  int kexec_should_crash(struct task_struct *);
> >>>  int kexec_crash_loaded(void);
> >>> +void __init kexec_reserve_low_1MiB(void);
> >>>  void crash_save_cpu(struct pt_regs *regs, int cpu);
> >>>  extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
> >>>
> >>> @@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs)
> { }
> >>>  static inline void crash_kexec(struct pt_regs *regs) { }
> >>>  static inline int kexec_should_crash(struct task_struct *p) { return 0; }
> >>>  static inline int kexec_crash_loaded(void) { return 0; }
> >>> +static inline void __init kexec_reserve_low_1MiB(void) { }
> >>>  #define kexec_in_progress false
> >>>  #endif /* CONFIG_KEXEC_CORE */
> >>>
> >>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c

RE: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the crashkernel option is specified

2019-10-24 Thread d.hatay...@fujitsu.com
I don't find the corresponding patch in the v5 patchset, so I comment here.

> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of lijiang
> Sent: Wednesday, October 23, 2019 2:35 PM
> To: Borislav Petkov 
> Cc: linux-ker...@vger.kernel.org; t...@linutronix.de; mi...@redhat.com;
> h...@zytor.com; x...@kernel.org; b...@redhat.com; dyo...@redhat.com;
> jgr...@suse.com; dhowe...@redhat.com; thomas.lenda...@amd.com;
> ebied...@xmission.com; vgo...@redhat.com; kexec@lists.infradead.org
> Subject: Re: [PATCH 1/3 v4] x86/kdump: always reserve the low 1MiB when the
> crashkernel option is specified
> 
> 在 2019年10月22日 16:30, Borislav Petkov 写道:
> > This ifdeffery needs to be a function in kernel/kexec_core.c which is
> > called by reserve_real_mode(), instead.
> 
> Would you mind if i improve this patch as follow? Thanks.
> 
> From 5804abec62279585f374d78ace1250505c44c6b7 Mon Sep 17 00:00:00 2001
> From: Lianbo Jiang 
> Date: Wed, 23 Oct 2019 11:27:04 +0800
> Subject: [PATCH] x86/kdump: always reserve the low 1MiB when the crashkernel
>  option is specified
> 
> Kdump kernel will reuse the first 640k region because the real mode
> trampoline has to work in this area. When the vmcore is dumped, the
> old memory in this area may be accessed, therefore, kernel has to
> copy the contents of the first 640k area to a backup region so that
> kdump kernel can read the old memory from the backup area of the
> first 640k area, which is done in the purgatory().
> 
> But, the current handling of copying the first 640k area runs into
> problems when SME is enabled, kernel does not properly copy these
> old memory to the backup area in the purgatory(), thereby, kdump
> kernel reads out the encrypted contents, because the kdump kernel
> must access the first kernel's memory with the encryption bit set
> when SME is enabled in the first kernel. Please refer to this link:
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793
> 
> Finally, it causes the following errors, and the crash tool gets
> invalid pointers when parsing the vmcore.
> 
> crash> kmem -s|grep -i invalid
> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
> freepointer:a6086ac099f0c5a4
> kmem: dma-kmalloc-512: slab:d77680001c00 invalid
> freepointer:a6086ac099f0c5a4
> crash>
> 
> To avoid the above errors, when the crashkernel option is specified,
> lets reserve the remaining low 1MiB memory(after reserving real mode
> memory) so that the allocated memory does not fall into the low 1MiB
> area, which makes us not to copy the first 640k content to a backup
> region in purgatory(). This indicates that it does not need to be
> included in crash dumps or used for anything except the processor
> trampolines that must live in the low 1MiB.
> 
> Signed-off-by: Lianbo Jiang 
> ---
> BTW:I also tried to fix the above problem in purgatory(), but there
> are too many restricts in purgatory() context, for example: i can't
> allocate new memory to create the identity mapping page table for
> SME situation.
> 
> Currently, there are two places where the first 640k area is needed,
> the first one is in the find_trampoline_placement(), another one is
> in the reserve_real_mode(), and their content doesn't matter.
> 
> In addition, also need to clean all the code related to the backup
> region later.
> 
>  arch/x86/realmode/init.c |  2 ++
>  include/linux/kexec.h|  2 ++
>  kernel/kexec_core.c  | 13 +
>  3 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
> index 7dce39c8c034..064cc79a015d 100644
> --- a/arch/x86/realmode/init.c
> +++ b/arch/x86/realmode/init.c
> @@ -3,6 +3,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -34,6 +35,7 @@ void __init reserve_real_mode(void)
> 
>   memblock_reserve(mem, size);
>   set_real_mode_mem(mem);
> + kexec_reserve_low_1MiB();
>  }
> 
>  static void __init setup_real_mode(void)
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 1776eb2e43a4..30acf1d738bc 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -306,6 +306,7 @@ extern void __crash_kexec(struct pt_regs *);
>  extern void crash_kexec(struct pt_regs *);
>  int kexec_should_crash(struct task_struct *);
>  int kexec_crash_loaded(void);
> +void kexec_reserve_low_1MiB(void);
>  void crash_save_cpu(struct pt_regs *regs, int cpu);
>  extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
> 
> @@ -397,6 +398,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { }
>  static inline void crash_kexec(struct pt_regs *regs) { }
>  static inline int kexec_should_crash(struct task_struct *p) { return 0; }
>  static inline int kexec_crash_loaded(void) { return 0; }
> +static inline void kexec_reserve_low_1MiB(void) { }
>  #define kexec_in_progress false
>  #endif /* CONFIG_KEXEC_CORE */
> 
> diff --git a/kernel/kexec_core.c