from:"Andrew Morton"

Re: [PATCH -next] crash: fix x86_32 memory reserve dead loop retry bug at "high"

2024-07-17 Thread Andrew Morton

On Wed, 17 Jul 2024 21:38:41 +0800 Baoquan He  wrote:

> 1) revert commit 8f9dade5906a in Andrew's tree;

Thanks, dropped.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] crash: use macro to add crashk_res into iomem early for specific arch

2024-03-25 Thread Andrew Morton

On Mon, 25 Mar 2024 09:50:50 +0800 Baoquan He  wrote:

> There are regression reports[1][2] that crashkernel region on x86_64 can't
> be added into iomem tree sometime. This causes the later failure of kdump
> loading.

So I think a cc:stable is needed.

> This happened after commit 4a693ce65b18 ("kdump: defer the insertion of
> crashkernel resources") was merged.
> 
> Even though, these reported issues are proved to be related to other
> component, they are just exposed after above commmit applied, I still
> would like to keep crashk_res and crashk_low_res being added into iomem
> early as before because the early adding has been always there on x86_64
> and working very well. For safety of kdump, Let's change it back.

I'll use 4a693ce65b18 as the Fixes: target, since there is no
"Exposed-by-non-buggy-patch:" tag.  To tell the -stable tree
maintainers how far back in history this should be backported.



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH RESEND] crash_core: export vmemmap when CONFIG_SPARSEMEM_VMEMMAP is enabled

2024-02-26 Thread Andrew Morton

On Mon, 26 Feb 2024 10:04:33 +0800 Huang Shijie  
wrote:

> In memory_model.h, if CONFIG_SPARSEMEM_VMEMMAP is configed,
> kernel will use vmemmap to do the __pfn_to_page/page_to_pfn,
> and kernel will not use the "classic sparse" to do the
> __pfn_to_page/page_to_pfn.
> 
> So export the vmemmap when CONFIG_SPARSEMEM_VMEMMAP is configed.
> This makes the user applications (crash, etc) get faster
> pfn_to_page/page_to_pfn operations too.

This is significantly out of sync with development trees.  Please take
a look at linux-next and redo the patch if still needed?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 00/14] Split crash out from kexec and clean up related config items

2024-02-22 Thread Andrew Morton

On Thu, 22 Feb 2024 10:47:29 +0530 Hari Bathini  wrote:

> 
> 
> On 22/02/24 2:27 am, Andrew Morton wrote:
> > On Wed, 21 Feb 2024 11:15:00 +0530 Hari Bathini  
> > wrote:
> > 
> >> On 04/02/24 8:56 am, Baoquan He wrote:
> >>>>> Hope Hari and Pingfan can help have a look, see if
> >>>>> it's doable. Now, I make it either have both kexec and crash enabled, or
> >>>>> disable both of them altogether.
> >>>>
> >>>> Sure. I will take a closer look...
> >>> Thanks a lot. Please feel free to post patches to make that, or I can do
> >>> it with your support or suggestion.
> >>
> >> Tested your changes and on top of these changes, came up with the below
> >> changes to get it working for powerpc:
> >>
> >>   
> >> https://lore.kernel.org/all/20240213113150.1148276-1-hbath...@linux.ibm.com/
> > 
> > So can we take it that you're OK with Baoquan's series as-is?
> 
> Hi Andrew,
> 
> If you mean
> 
> v3 (https://lore.kernel.org/all/20240124051254.67105-1-...@redhat.com/)
> +
> follow-up from Baoquan 
> (https://lore.kernel.org/all/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv/)
> 
> Yes.
> 

Can I add your Acked-by: and/or Tested-by: to the patches in this series?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 00/14] Split crash out from kexec and clean up related config items

2024-02-21 Thread Andrew Morton

On Wed, 21 Feb 2024 11:15:00 +0530 Hari Bathini  wrote:

> On 04/02/24 8:56 am, Baoquan He wrote:
> >>> Hope Hari and Pingfan can help have a look, see if
> >>> it's doable. Now, I make it either have both kexec and crash enabled, or
> >>> disable both of them altogether.
> >>
> >> Sure. I will take a closer look...
> > Thanks a lot. Please feel free to post patches to make that, or I can do
> > it with your support or suggestion.
> 
> Tested your changes and on top of these changes, came up with the below
> changes to get it working for powerpc:
> 
>  
> https://lore.kernel.org/all/20240213113150.1148276-1-hbath...@linux.ibm.com/

So can we take it that you're OK with Baoquan's series as-is?

Baoquan, do you believe the patches in mm-unstable are ready for moving
into mm-stable in preparation for an upstream merge?




___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 01/14] kexec: split crashkernel reservation code out from crash_core.c

2024-02-21 Thread Andrew Morton

On Wed, 21 Feb 2024 22:59:47 +0530 Sourabh Jain  
wrote:

> >   config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
> > -   def_bool CRASH_CORE
> > +   def_bool CRASH_RESEERVE
> 
> %s/CRASH_RESEERVE/CRASH_RESERVE? 

Yes, thanks, this has been addressed in a followon fixup patch
in the mm.git tree.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kdump: Defer the insertion of crashkernel resources

2024-01-05 Thread Andrew Morton

On Fri, 29 Dec 2023 16:02:13 +0800 Huacai Chen  wrote:

> In /proc/iomem, sub-regions should be inserted after their parent,
> otherwise the insertion of parent resource fails. But after generic
> crashkernel reservation applied, in both RISC-V and ARM64 (LoongArch
> will also use generic reservation later on), crashkernel resources are
> inserted before their parent, which causes the parent disappear in
> /proc/iomem. So we defer the insertion of crashkernel resources to an
> early_initcall().
> 
> ...
>
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -377,7 +377,6 @@ static int __init reserve_crashkernel_low(unsigned long 
> long low_size)
>  
>   crashk_low_res.start = low_base;
>   crashk_low_res.end   = low_base + low_size - 1;
> - insert_resource(_resource, _low_res);
>  #endif
>   return 0;
>  }
> @@ -459,8 +458,19 @@ void __init reserve_crashkernel_generic(char *cmdline,
>  
>   crashk_res.start = crash_base;
>   crashk_res.end = crash_base + crash_size - 1;
> - insert_resource(_resource, _res);
>  }
> +
> +static __init int insert_crashkernel_resources(void)
> +{
> + if (crashk_res.start < crashk_res.end)
> + insert_resource(_resource, _res);
> +
> + if (crashk_low_res.start < crashk_low_res.end)
> + insert_resource(_resource, _low_res);
> +
> + return 0;
> +}
> +early_initcall(insert_crashkernel_resources);
>  #endif
>  
>  int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,

I'm thinking 

Fixes: 0ab97169aa0 ("crash_core: add generic function to do reservation").

Also, is this a regression?  Were earlier kernels OK?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] crash_core: optimize crash_exclude_mem_range()

2023-12-29 Thread Andrew Morton

On Wed, 20 Dec 2023 00:34:18 +0800 Yuntao Wang  wrote:

> Because memory ranges in mem->ranges are stored in ascending order, when we
> detect `p_end < start`, we can break the for loop early, as the subsequent
> memory ranges must also be outside the range we are looking for.
> 
> Signed-off-by: Yuntao Wang 
> ---
> Hi Andrew,
> 
> Patch "[PATCH 2/2] crash_core: fix out-of-bounds access check in
> crash_exclude_mem_range()" can be ignored, use this patch instead.
> 

Some reviewer input on this would be helpful please?

> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -575,9 +575,12 @@ int crash_exclude_mem_range(struct crash_mem *mem,
>   p_start = mstart;
>   p_end = mend;
>  
> - if (p_start > end || p_end < start)
> + if (p_start > end)
>   continue;
>  
> + if (p_end < start)
> + break;
> +
>   /* Truncate any area outside of range */
>   if (p_start < start)
>   p_start = start;
> -- 
> 2.43.0

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 3/3] crash_core: fix and simplify the logic of crash_exclude_mem_range()

2023-12-29 Thread Andrew Morton

On Sat, 16 Dec 2023 11:31:04 +0800 Baoquan He  wrote:

> > > Imagine we have a crashkernel region 256M reserved under 4G, say [2G, 
> > > 2G+256M].
> > > Then after excluding the 256M from a region, it should stop. But now, 
> > > this patch
> > > will make it continue scanning. Not sure if it's all in my mind.
> > 
> > Hi Baoquan,
> > 
> > Thank you for such a detailed reply. Now I finally understand why the code 
> > is
> > written this way.
> > 
> > However, if we can guarantee its correctness, wouldn't it be better to use 
> > the
> > generic region removing logic? At least it is more concise and clear, and 
> > other
> > people reading this code for the first time wouldn't get confused like me.
> > 
> > As for your concern about the while loop, I think it wouldn't affect 
> > performance
> > much because the total number of loops is small.
> 
> Well, see below kexec-tools commit, you wouldn't say that. And when you
> understand the code, you will feel a little uncomfortable about the
> sustaining useless scanning. At least, we should stop scanning after
> needed exluding is done.
> 
> Or, we may need add a generic region removing function so that it
> can be shared, e.g e820 memory region removing, memblock region removing.
> Otherwise, I can't see why a specific region excluding need a generic 
> region removing function.

So where do we now stand on this patchset?

Thanks.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec_core: fix the assignment to kimage->control_page

2023-12-21 Thread Andrew Morton

On Thu, 21 Dec 2023 12:23:08 +0800 Yuntao Wang  wrote:

> image->control_page represents the starting address for allocating the next
> control page, while hole_end represents the address of the last valid byte
> of the currently allocated control page.
> 
> Therefore, after successfully allocating a control page, image->control_page
> should be updated to `hole_end + 1`, rather than hole_end.

Thanks.  Again, please include a description of the userspace-visible effects 
of the bug.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4 5/7] kexec_file, ricv: print out debugging message if required

2023-12-20 Thread Andrew Morton

On Wed, 20 Dec 2023 12:22:29 +0800 Baoquan He  wrote:

> Could you help fix the typo in subject?
> 
> [PATCH v4 5/7] kexec_file, ricv: print out debugging message if required
>~~~ s/ricv/riscv/

I made that change.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs

2023-12-19 Thread Andrew Morton

On Tue, 19 Dec 2023 15:29:01 +0800 Yuntao Wang  wrote:

> When an error is detected, use pr_err() instead of kexec_dprintk() to
> output log message.
>
> In addition, remove the unnecessary return from set_page_address().

The above describes what the code does, which is already quite clear
from looking at the code.

Please write changelogs and code comments which describe *why* the code
does something, rather than *what* it does.

>
>
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -429,7 +429,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>* command line. Make sure it does not overflow
>*/
>   if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
> - kexec_dprintk("Appending elfcorehdr= to command line 
> exceeds maximum allowed length\n");
> + pr_err("Appending elfcorehdr= to command line exceeds 
> maximum allowed length\n");

ie, what are the reasons for this change?

>   return ERR_PTR(-EINVAL);
>   }
>  
> diff --git a/mm/highmem.c b/mm/highmem.c
> index e19269093a93..bd48ba445dd4 100644
> --- a/mm/highmem.c
> +++ b/mm/highmem.c
> @@ -799,8 +799,6 @@ void set_page_address(struct page *page, void *virtual)
>   }
>   spin_unlock_irqrestore(>lock, flags);
>   }
> -
> - return;
>  }
>  
>  void __init page_address_init(void)
> -- 
> 2.43.0

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v5 3/3] x86/kexec: use pr_err() instead of pr_debug() when an error occurs

2023-12-18 Thread Andrew Morton

On Sun, 17 Dec 2023 11:35:28 +0800 Yuntao Wang  wrote:

> When an error is detected, use pr_err() instead of pr_debug() to output
> log message.
> 
> In addition, remove the unnecessary return from set_page_address().
> 
> ...
>
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -424,7 +424,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>* command line. Make sure it does not overflow
>*/
>   if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
> - pr_debug("Appending elfcorehdr= to command line exceeds 
> maximum allowed length\n");
> + pr_err("Appending elfcorehdr= to command line exceeds 
> maximum allowed length\n");
>   return ERR_PTR(-EINVAL);
>   }

https://lkml.kernel.org/r/20231213055747.61826-4-...@redhat.com has
already changed this to call kexec_dprintk().  I'll skip this patch.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/2] crash_core: fix out-of-bounds access check in crash_exclude_mem_range()

2023-12-18 Thread Andrew Morton

On Mon, 18 Dec 2023 16:19:15 +0800 Yuntao Wang  wrote:

> mem->nr_ranges represents the current number of elements stored in
> the mem->ranges array, and mem->max_nr_ranges represents the maximum number
> of elements that the mem->ranges array can hold. Therefore, the correct
> array out-of-bounds check should be mem->nr_ranges >= mem->max_nr_ranges.
> 

This does not apply after your own "crash_core: fix and simplify the
logic of crash_exclude_mem_range()".  What should be done?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] crash_core: Fix the check for whether crashkernel is from high memory

2023-12-09 Thread Andrew Morton

On Sat,  9 Dec 2023 22:14:38 +0800 Yuntao Wang  wrote:

> If crash_base is equal to CRASH_ADDR_LOW_MAX, it also indicates that
> the crashkernel memory is allocated from high memory. However, the
> current check only considers the case where crash_base is greater than
> CRASH_ADDR_LOW_MAX. Fix it.
> 
> This patch also includes some minor cleanups.

Can we please include a description of the runtime effects of this
change?  ie, what happens now and under what circumstances, and how
does the fix alter these things?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] kexec: fix KEXEC_FILE dependencies

2023-11-30 Thread Andrew Morton

On Thu, 2 Nov 2023 16:03:18 +0800 Baoquan He  wrote:

> > > CONFIG_KEXEC_FILE, but still get purgatory code built in which is
> > > totally useless.
> > >
> > > Not sure if I think too much over this.
> > 
> > I see your point here, and I would suggest changing the
> > CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY symbol to just indicate
> > the availability of the purgatory code for the arch, rather
> > than actually controlling the code itself. I already mentioned
> > this for s390, but riscv would need the same thing on top.
> > 
> > I think the change below should address your concern.
> 
> Since no new comment, do you mind spinning v2 to wrap all these up?

This patchset remains in mm-hotfixes-unstable from the previous -rc
cycle.  Eric, do you have any comments?  Arnd, do you plan on a v2?  If
not, should I merge v1?  If so, should I now add cc:stable?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP

2023-11-29 Thread Andrew Morton

On Wed, 29 Nov 2023 22:34:13 + Ignat Korchagin  wrote:

> On Wed, Nov 29, 2023 at 10:23 PM Andrew Morton
>  wrote:
> >
> > On Wed, 29 Nov 2023 22:04:09 + Ignat Korchagin  
> > wrote:
> >
> > > Fixes: 91506f7e5d21 ("arm64/kexec: refactor for kernel/Kconfig.kexec")
> > > Cc: sta...@vger.kernel.org # 6.6+: f8ff234: kernel/Kconfig.kexec: drop 
> > > select of KEXEC for CRASH_DUMP
> > > Cc: sta...@vger.kernel.org # 6.6+
> >
> > I doubt if anyone knows what the two above lines mean.  What are your
> > recommendations for the merging of this patch?
> 
> Hmm... I was just following
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#option-1
> and basically wanted to make sure that this patch gets backported
> together with commit f8ff234: kernel/Kconfig.kexec: drop select of
> KEXEC for CRASH_DUMP (they should go together)

I see, thanks.  I don't think I've ever received a patch which did this!



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP

2023-11-29 Thread Andrew Morton

On Wed, 29 Nov 2023 22:04:09 + Ignat Korchagin  wrote:

> Fixes: 91506f7e5d21 ("arm64/kexec: refactor for kernel/Kconfig.kexec")
> Cc: sta...@vger.kernel.org # 6.6+: f8ff234: kernel/Kconfig.kexec: drop select 
> of KEXEC for CRASH_DUMP
> Cc: sta...@vger.kernel.org # 6.6+

I doubt if anyone knows what the two above lines mean.  What are your
recommendations for the merging of this patch?

> Signed-off-by: Ignat Korchagin 


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3] drivers/base/cpu: crash data showing should depends on KEXEC_CORE

2023-11-24 Thread Andrew Morton

On Thu, 23 Nov 2023 19:15:43 +0800 Baoquan He  wrote:

> > > CONFIG_KEXEC is used to enable kexec_load interface, the
> > > crash_notes/crash_notes_size/crash_hotplug showing depends on
> > > CONFIG_KEXEC is incorrect. It should depend on KEXEC_CORE instead.
> > >
> > > Fix it now.
> > 
> > Can we add Fixes/CC stable, so it gets eventually backported into 6.6?
> 
> Makes sense. Will add it in v2 since I need respin to add the missing
> stuff in patch 1. Thanks.

Please avoid mixing cc:stable patches and this-merge-window fixes in
the same series as next-merge-window material.  Because I'll just have
to separate them out anyway, causing what-I-merged to unnecessarily
differ from what-you-sent.



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] resource: add walk_system_ram_res_rev()

2023-11-14 Thread Andrew Morton

On Tue, 14 Nov 2023 17:16:57 +0800 Baoquan He  wrote:

> This function, being a variant of walk_system_ram_res() introduced in
> commit 8c86e70acead ("resource: provide new functions to walk through
> resources"), walks through a list of all the resources of System RAM
> in reversed order, i.e., from higher to lower.
> 
> It will be used in kexec_file code to load kernel, initrd etc when
> preparing kexec reboot.
>
> ...
>
> +/*
> + * This function, being a variant of walk_system_ram_res(), calls the @func
> + * callback against all memory ranges of type System RAM which are marked as
> + * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from
> + * higher to lower.
> + */
> +int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
> + int (*func)(struct resource *, void *))
> +{
> + struct resource res, *rams;
> + int rams_size = 16, i;
> + unsigned long flags;
> + int ret = -1;
> +
> + /* create a list */
> + rams = kvcalloc(rams_size, sizeof(struct resource), GFP_KERNEL);
> + if (!rams)
> + return ret;
> +
> + flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> + i = 0;
> + while ((start < end) &&
> + (!find_next_iomem_res(start, end, flags, IORES_DESC_NONE, 
> ))) {
> + if (i >= rams_size) {
> + /* re-alloc */
> + struct resource *rams_new;
> + int rams_new_size;
> +
> + rams_new_size = rams_size + 16;
> + rams_new = kvcalloc(rams_new_size, sizeof(struct 
> resource),
> + GFP_KERNEL);

kvrealloc()?

> + if (!rams_new)
> + goto out;
> +
> + memcpy(rams_new, rams,
> + sizeof(struct resource) * rams_size);
> + kvfree(rams);
> + rams = rams_new;
> + rams_size = rams_new_size;
> + }
> +
> + rams[i].start = res.start;
> + rams[i++].end = res.end;
> +
> + start = res.end + 1;
> + }
> +
> + /* go reverse */
> + for (i--; i >= 0; i--) {
> + ret = (*func)([i], arg);
> + if (ret)
> + break;
> + }
> +
> +out:
> + kvfree(rams);
> + return ret;
> +}
> +
>  /*
>   * This function calls the @func callback against all memory ranges, which
>   * are ranges marked as IORESOURCE_MEM and IORESOUCE_BUSY.
> -- 
> 2.41.0

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3] Crash: add lock to serialize crash hotplug handling

2023-09-26 Thread Andrew Morton

On Tue, 26 Sep 2023 20:09:05 +0800 Baoquan He  wrote:

> Eric reported that handling corresponding crash hotplug event can be
> failed easily when many memory hotplug event are notified in a short
> period. They failed because failing to take __kexec_lock.

I'm assuming that this failure is sufficiently likely so as to  justify a 
-stable
backport of the fix.   Please let me know if this is incorrect.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: change locking mechanism to a mutex

2023-09-21 Thread Andrew Morton

On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder  
wrote:

> Scaled up testing has revealed that the kexec_trylock()
> implementation leads to failures within the crash hotplug
> infrastructure due to the inability to acquire the lock,
> specifically the message:
> 
>  crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
> 
> When hotplug events occur, the crash hotplug infrastructure first
> attempts to obtain the lock via the kexec_trylock(). However, the
> implementation either acquires the lock, or fails and returns; there
> is no waiting on the lock. Here is the comment/explanation from
> kernel/kexec_internal.h:kexec_trylock():
> 
>  * Whatever is used to serialize accesses to the kexec_crash_image needs to be
>  * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use 
> a
>  * "simple" atomic variable that is acquired with a cmpxchg().
> 
> While this in theory can happen for either CPU or memory hoptlug,
> this problem is most prone to occur for memory hotplug.
> 
> When memory is hot plugged, the memory is converted into smaller
> 128MiB memblocks (typically). As each memblock is processed, a
> kernel thread and a udev event thread are created. The udev thread
> tries for the lock via the reading of the sysfs node
> /sys/devices/system/memory/crash_hotplug node, and the kernel
> worker thread tries for the lock upon entering the crash hotplug
> infrastructure.
> 
> These threads then compete for the kexec lock.
> 
> For example, a 1GiB DIMM is converted into 8 memblocks, each
> spawning two threads for a total of 16 threads that create a small
> "swarm" all trying to acquire the lock. The larger the DIMM, the
> more the memblocks and the larger the swarm.
> 
> At the root of the problem is the atomic lock behind kexec_trylock();
> it works well for low lock traffic; ie loading/unloading a capture
> kernel, things that happen basically once. But with the introduction
> of crash hotplug, the traffic through the lock increases significantly,
> and more importantly in bursts occurring at roughly the same time. Thus
> there is a need to wait on the lock.
> 
> A possible workaround is to simply retry the lock, say up to N times.
> There is, of course, the problem of determining a value of N that works for
> all implementations, and for all the other call sites of kexec_trylock().
> Not ideal.
> 
> The design decision to use the atomic lock is described in the comment
> from kexec_internal.h, cited above. However, examining the code of
> __crash_kexec():
> 
> if (kexec_trylock()) {
> if (kexec_crash_image) {
> ...
> }
> kexec_unlock();
> }
> 
> reveals that the use of kexec_trylock() here is actually a "best effort"
> due to the atomic lock.  This atomic lock, prior to crash hotplug,
> would almost always be assured (another kexec syscall could hold the lock
> and prevent this, but that is about it).
> 
> So at the point where the capture kernel would be invoked, if the lock
> is not obtained, then kdump doesn't occur.
> 
> It is possible to instead use a mutex with proper waiting, and utilize
> mutex_trylock() as the "best effort" in __crash_kexec(). The use of a
> mutex then avoids all the lock acquisition problems that were revealed
> by the crash hotplug activity.
> 
> Convert the atomic lock to a mutex.
> 
> ...
>
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -47,7 +47,7 @@
>  #include 
>  #include "kexec_internal.h"
>  
> -atomic_t __kexec_lock = ATOMIC_INIT(0);
> +DEFINE_MUTEX(__kexec_lock);
>  
>  /* Flag to indicate we are going to kexec a new kernel */
>  bool kexec_in_progress = false;
> @@ -1057,7 +1057,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
>* of memory the xchg(_crash_image) would be
>* sufficient.  But since I reuse the memory...
>*/
> - if (kexec_trylock()) {
> + if (mutex_trylock(&__kexec_lock)) {
>   if (kexec_crash_image) {
>   struct pt_regs fixed_regs;

What's happening here?  If someone else held the lock we silently fail
to run the kexec?  Shouldn't we at least alert the user to what just
happened?



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: change locking mechanism to a mutex

2023-09-21 Thread Andrew Morton

On Thu, 21 Sep 2023 17:59:38 -0400 Eric DeVolder  
wrote:

> Scaled up testing has revealed that the kexec_trylock()
> implementation leads to failures within the crash hotplug
> infrastructure due to the inability to acquire the lock,
> specifically the message:
> 
> ...
>
> Convert the atomic lock to a mutex.
> 

Do you think this problem is serious enough to warrant a backport into
-stable kernels?  

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v28 0/8] crash: Kernel handling of CPU and memory hot un/plug

2023-08-14 Thread Andrew Morton

On Mon, 14 Aug 2023 17:44:38 -0400 Eric DeVolder  
wrote:

> This series is dependent upon "refactor Kconfig to consolidate
> KEXEC and CRASH options".
>  https://lore.kernel.org/lkml/20230712161545.87870-1-eric.devol...@oracle.com/
> 
> Once the kdump service is loaded, if changes to CPUs or memory occur,
> either by hot un/plug or off/onlining, the crash elfcorehdr must also
> be updated.

Thanks, I updated branch mm-nonmm-unstable to this version.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 0/2] kexec: Remove unnecessary arch hook

2023-03-23 Thread Andrew Morton

On Thu, 23 Mar 2023 15:49:20 +0800 Baoquan He  wrote:

> Hi,
> 
> On 03/07/23 at 04:44pm, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas 
> > 
> > There are no arch-specific things in arch_kexec_kernel_image_load(), so
> > remove it and just use the generic version.
> > 
> > v1 is at:
> > https://lore.kernel.org/all/20221215182339.129803-1-helg...@kernel.org/
> > 
> > This v2 is trivially rebased to v6.3-rc1 and the commit log expanded
> > slightly.
> 
> This is an obvious and good cleanup patchset, who should I ping to ask
> for accepting?  It's touching kexec generic code, while the hook
> only exists on x86 ARCH.

I grabbed them

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4] notifiers: Add tracepoints to the notifiers infrastructure

2023-03-14 Thread Andrew Morton

On Tue, 14 Mar 2023 18:08:37 -0300 "Guilherme G. Piccoli"  
wrote:

> On 14/03/2023 17:50, Andrew Morton wrote:
> > On Tue, 14 Mar 2023 17:00:58 -0300 "Guilherme G. Piccoli" 
> >  wrote:
> > 
> >>  include/trace/events/notifiers.h | 69 
> >>  kernel/notifier.c|  6 +++
> > 
> > Perhaps the filenames should match, which means "notifier.h".
> 
> Hi Andrew, thanks!
> 
> Do you want me to re-submit? I see some emails of the patch getting
> added to "mm-nonmm-unstable" (and also a checkpatch fixes you added on
> top of that). Lemme know how should I proceed.

I changed it, thanks.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4] notifiers: Add tracepoints to the notifiers infrastructure

2023-03-14 Thread Andrew Morton

On Tue, 14 Mar 2023 17:00:58 -0300 "Guilherme G. Piccoli"  
wrote:

>  include/trace/events/notifiers.h | 69 
>  kernel/notifier.c|  6 +++

Perhaps the filenames should match, which means "notifier.h".

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4] panic: Fixes the panic_print NMI backtrace setting

2023-02-25 Thread Andrew Morton

On Fri, 10 Feb 2023 17:35:10 -0300 "Guilherme G. Piccoli"  
wrote:

> Commit 8d470a45d1a6 ("panic: add option to dump all CPUs backtraces in 
> panic_print")
> introduced a setting for the "panic_print" kernel parameter to allow
> users to request a NMI backtrace on panic. Problem is that the panic_print
> handling happens after the secondary CPUs are already disabled, hence
> this option ended-up being kind of a no-op - kernel skips the NMI trace
> in idling CPUs, which is the case of offline CPUs.
> 
> Fix it by checking the NMI backtrace bit in the panic_print prior to
> the CPU disabling function.
> 
> ...
> 
> Notice that while at it, I got rid of the "crash_kexec_post_notifiers"
> local copy in panic(). This was introduced by commit b26e27ddfd2a
> ("kexec: use core_param for crash_kexec_post_notifiers boot option"),
> but it is not clear from comments or commit message why this local copy
> is required.
> 
> My understanding is that it's a mechanism to prevent some concurrency,
> in case some other CPU modify this variable while panic() is running.
> I find it very unlikely, hence I removed it - but if people consider
> this copy needed, I can respin this patch and keep it, even providing a
> comment about that, in order to be explict about its need.

Only two sites change crash_kexec_post_notifiers, in
arch/powerpc/kernel/fadump.c and drivers/hv/hv_common.c.  Yes, it's
very unlikely that this will be altered while panic() is running and
the consequences will be slight anyway.

But formally, we shouldn't do this, especially in a -stable
backportable patch.  So please, let's have the minimal bugfix for now
and we can look at removing that local at a later time?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH AUTOSEL 6.1 13/22] proc/vmcore: fix potential memory leak in vmcore_init()

2022-12-17 Thread Andrew Morton

On Sat, 17 Dec 2022 10:27:14 -0500 Sasha Levin  wrote:

> From: Jianglei Nie 
> 
> [ Upstream commit 12b9d301ff73122aebd78548fa4c04ca69ed78fe ]
> 
> Patch series "Some minor cleanup patches resent".
> 
> The first three patches trivial clean up patches.
>
> And for the patch "kexec: replace crash_mem_range with range", I got a
> ibm-p9wr ppc64le system to test, it works well.
> 
> This patch (of 4):
> 
> elfcorehdr_alloc() allocates a memory chunk for elfcorehdr_addr with
> kzalloc().  If is_vmcore_usable() returns false, elfcorehdr_addr is a
> predefined value.  If parse_crash_elf_headers() gets some error and
> returns a negetive value, the elfcorehdr_addr should be released with
> elfcorehdr_free().

This is exceedingly minor - a single memory leak per boot, under very
rare circumstances.

With every patch I merge I consider -stable.  Often I'll discuss the
desirability of a backport with the author and with reviewers.  Every
single patch.  And then some damn script comes along and overrides that
quite careful decision.  argh.

Can we please do something like

if (akpm && !cc:stable)
dont_backport()

And even go further - if your script thinks it might be something we
should backport and if it didn't have cc:stable then contact the
author, reviewers and committers and ask them to reconsider before we
go and backport it.  This approach will have the advantage of training
people to consider the backport more consistently.

I'd (still) like to have a new patch tag like Not-For-Stable: or
cc:not-stable or something to tell your scripts "yes, we thought about
it and we decided no".

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] vmcoreinfo: Warn if we exceed vmcoreinfo data size

2022-11-08 Thread Andrew Morton

On Thu, 27 Oct 2022 13:50:08 -0700 Stephen Brennan 
 wrote:

> Though vmcoreinfo is intended to be small, at just one page, useful
> information is still added to it, so we risk running out of space.
> Currently there is no runtime check to see whether the vmcoreinfo buffer
> has been exhausted. Add a warning for this case.
> 
> Currently, my static checking tool[1] indicates that a good upper bound
> for vmcoreinfo size is currently 3415 bytes, but the best time to add
> warnings is before the risk becomes too high.
> 
> ...
>
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -383,6 +383,9 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>   memcpy(_data[vmcoreinfo_size], buf, r);
>  
>   vmcoreinfo_size += r;
> +
> + WARN_ONCE(vmcoreinfo_size == VMCOREINFO_BYTES,
> +   "vmcoreinfo data exceeds allocated size, truncating");
>  }

Seems that vmcoreinfo_append_str() will truncate (ie: corrupt) the
final entry when limiting the overall data size to VMCOREINFO_BYTES. 
And that final entry will be missing any terminating \n or \0.

Is all this desirable, or should we be checking for (and warning about)
sufficient space _before_ appending this string?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4 0/2] kexec, panic: Making crash_kexec() NMI safe

2022-10-03 Thread Andrew Morton

On Mon, 03 Oct 2022 14:20:51 +0100 Valentin Schneider  
wrote:

> > I'll stash it away for consideration after -rc1.
> 
> I've seen them in linux-next for a while, am I right in assuming they'll be
> part of a 6.1 PR?

yup.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] vmcoreinfo: add kallsyms_num_syms symbol

2022-08-25 Thread Andrew Morton

On Thu, 25 Aug 2022 10:27:04 -0700 Stephen Brennan 
 wrote:

> I just wanted to check on the status of this. I got your email about it 
> being added to mm-hotfixes-unstable, and I do still see it in the quilt 
> patches set:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/vmcoreinfo-add-kallsyms_num_syms-symbol.patch
> 
> I see the patch in mm-everything-2022-08-15-02-47 but it seems to have 
> dropped out in mm-everything-2022-08-17-20-59 and later tags. As a 
> result, it no longer shows up in linux-next.
> 
> Please excuse me if I'm ignorant of part of your process here, I know 
> you handle a huge volume of patches!

Thanks for noticing this - I appear to have messed up a rebase and lost
a batch of hotfixes :(

I readded

mm-hugetlb-avoid-corrupting-page-mapping-in-hugetlb_mcopy_atomic_pte.patch
shmem-update-folio-if-shmem_replace_page-updates-the-page.patch
writeback-avoid-use-after-free-after-removing-device.patch
mailmap-update-guilherme-g-piccolis-email-addresses.patch
vmcoreinfo-add-kallsyms_num_syms-symbol.patch
mm-re-allow-pinning-of-zero-pfns-again.patch

to the hotfixes queue.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4 0/2] kexec, panic: Making crash_kexec() NMI safe

2022-07-13 Thread Andrew Morton

On Tue, 12 Jul 2022 12:13:03 +0100 Valentin Schneider  
wrote:

> On 12/07/22 10:47, Baoquan He wrote:
> > Hi,
> >
> > On 06/30/22 at 11:32pm, Valentin Schneider wrote:
> >> Hi folks,
> >>
> >> Here's ~v3~ v4 where we now completely get rid of kexec_mutex.
> >>
> >> o Patch 1 makes sure all kexec_mutex acquisitions are trylocks. This 
> >> prevents
> >>   having to add any while(atomic_cmpxchg()) loops which I'd really hate to 
> >> see
> >>   here. If that can't be done then I think we're better off with the 
> >> combined
> >>   mutex+atomic var approach.
> >> o Patch 2 does the mutex -> atomic var switch.
> >
> > This series looks good, has it been taken into any tree?
> >
> 
> I don't think so, briefly poked around git and haven't seen it anywhere.

I'll stash it away for consideration after -rc1.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] Expose kallsyms data in vmcoreinfo note

2022-06-13 Thread Andrew Morton

On Mon, 13 Jun 2022 14:59:44 -0700 Stephen Brennan 
 wrote:

> >> Related discussion around the BTF side of this:
> >> https://lore.kernel.org/bpf/586a6288-704a-f7a7-b256-e18a67592...@oracle.com/T/#u
> >> 
> >> Some work-in-progress branches using this feature:
> >> https://github.com/brenns10/dwarves/tree/remove_percpu_restriction_1
> >> https://github.com/brenns10/drgn/tree/kallsyms_plus_btf
> >
> > What's the story on using gdb with this?
> 
> There is no story with GDB as of yet. I was already familiar with the
> code of drgn when I started down this path, so that's what I used. Drgn
> happens to have a very extensible type system which made it quite simple
> to do. I'd love to see support for doing this with GDB, and might look
> into the feasibility of it, but it's not on my roadmap right now.

Naive question - could some standalone tool take this kallsyms-based
info, combine it with a core image and create a minimally-dwarfified
file which any debugger can munch on?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] Expose kallsyms data in vmcoreinfo note

2022-06-13 Thread Andrew Morton

On Mon, 16 May 2022 17:05:06 -0700 Stephen Brennan 
 wrote:

> The kernel can be configured to contain a lot of introspection or
> debugging information built-in, such as ORC for unwinding stack traces,
> BTF for type information, and of course kallsyms. Debuggers could use
> this information to navigate a core dump or live system, but they need
> to be able to find it.
> 
> This patch series adds the necessary symbols into vmcoreinfo, which
> would allow a debugger to find and interpret the kallsyms table. Using
> the kallsyms data, the debugger can then lookup any symbol, allowing it
> to find ORC, BTF, or any other useful data.
> 
> This would allow a live kernel, or core dump, to be debugged without
> any DWARF debuginfo. This is useful for many cases: the debuginfo may
> not have been generated, or you may not want to deploy the large files
> everywhere you need them.

Am trying to understand the value of all of this.  Can you explain
further why carrying the dwarf info is problematic?  How problematic
are these large files?

> I've demonstrated a proof of concept for this at LSF/MM+BPF during a
> lighting talk. Using a work-in-progress branch of the drgn debugger, and
> an extended set of BTF generated by a patched version of dwarves, I've
> been able to open a core dump without any DWARF info and do basic tasks
> such as enumerating slab caches, block devices, tasks, and doing
> backtraces. I hope this series can be a first step toward a new
> possibility of "DWARFless debugging".
> 
> Related discussion around the BTF side of this:
> https://lore.kernel.org/bpf/586a6288-704a-f7a7-b256-e18a67592...@oracle.com/T/#u
> 
> Some work-in-progress branches using this feature:
> https://github.com/brenns10/dwarves/tree/remove_percpu_restriction_1
> https://github.com/brenns10/drgn/tree/kallsyms_plus_btf

What's the story on using gdb with this?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-25 Thread Andrew Morton

On Fri, 20 May 2022 14:25:05 -0500 "Eric W. Biederman"  
wrote:

> > I am not strongly against taking off __weak, just wondering if there's
> > chance to fix it in recordmcount, and the cost comparing with kernel fix;
> > except of this issue, any other weakness of __weak. Noticed Andrew has
> > picked this patch, as a witness of this moment, raise a tiny concern.
> 
> I just don't see what else we can realistically do.

I think converting all of the kexec __weaks to use the ifdef approach
makes sense, if only because kexec is now using two different styles.

But for now, I'll send Naveen's v2 patch in to Linus to get us out of
trouble.

I'm thinking that we should add cc:stable to that patch as well, to
reduce the amount of problems which people experience when using newer
binutils on older kernels?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-19 Thread Andrew Morton

On Thu, 19 May 2022 23:17:48 +0800 kernel test robot  wrote:

> Hi "Naveen,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on f993aed406eaf968ba3867a76bb46c95336a33d0]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Naveen-N-Rao/kexec_file-Drop-weak-attribute-from-arch_kexec_apply_relocations-_add/20220519-171432
> base:   f993aed406eaf968ba3867a76bb46c95336a33d0
> config: s390-allmodconfig 
> (https://download.01.org/0day-ci/archive/20220519/202205192320.coxevcfr-...@intel.com/config)
> compiler: s390-linux-gcc (GCC) 11.3.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # 
> https://github.com/intel-lab-lkp/linux/commit/67171688c71cb5b05f26e0dfc45eec8d8d1428ff
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review 
> Naveen-N-Rao/kexec_file-Drop-weak-attribute-from-arch_kexec_apply_relocations-_add/20220519-171432
> git checkout 67171688c71cb5b05f26e0dfc45eec8d8d1428ff
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.3.0 make.cross W=1 
> O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> All errors (new ones prefixed by >>):
> 
>In file included from arch/s390/kernel/machine_kexec_reloc.c:3:
> >> arch/s390/include/asm/kexec.h:89:38: error: unknown type name 'Elf_Shdr'; 
> >> did you mean 'elf_shdr'?
>   89 |  Elf_Shdr *section,
>  |  ^~~~
>  |  elf_shdr
>arch/s390/include/asm/kexec.h:90:44: error: unknown type name 'Elf_Shdr'
>   90 |  const Elf_Shdr *relsec,
>  |^~~~
>arch/s390/include/asm/kexec.h:91:44: error: unknown type name 'Elf_Shdr'
>   91 |  const Elf_Shdr *symtab);
>  |^~~~

Thanks, I did this:

--- 
a/arch/s390/include/asm/kexec.h~kexec_file-drop-weak-attribute-from-arch_kexec_apply_relocations-fix
+++ a/arch/s390/include/asm/kexec.h
@@ -9,6 +9,8 @@
 #ifndef _S390_KEXEC_H
 #define _S390_KEXEC_H
 
+#include 
+
 #include 
 #include 
 #include 
_


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]

2022-05-18 Thread Andrew Morton

On Wed, 18 May 2022 23:48:28 +0530 "Naveen N. Rao" 
 wrote:

> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
> symbols") [1], binutils (v2.36+) started dropping section symbols that
> it thought were unused.  This isn't an issue in general, but with
> kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a
> separate .text.unlikely section and the section symbol ".text.unlikely"
> is being dropped. Due to this, recordmcount is unable to find a non-weak
> symbol in .text.unlikely to generate a relocation record against.
> 
> Address this by dropping the weak attribute from these functions:
> - arch_kexec_apply_relocations() is not overridden by any architecture
>   today, so just drop the weak attribute.
> - arch_kexec_apply_relocations_add() is only overridden by x86 and s390.
>   Retain the function prototype for those and move the weak
>   implementation into the header as a static inline for other
>   architectures.
> 
> ...
>

Sigh.  This patch demonstrates why I like __weak :<

> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -229,6 +225,30 @@ extern int crash_exclude_mem_range(struct crash_mem *mem,
>  unsigned long long mend);
>  extern int crash_prepare_elf64_headers(struct crash_mem *mem, int kernel_map,
>  void **addr, unsigned long *sz);
> +
> +#if defined(CONFIG_X86_64) || defined(CONFIG_S390)

Let's avoid listing the architectures here?  Better to add

select ARCH_HAVE_ARCH_KEXEC_APPLY_RELOCATIONS_ADD

to arch//Kconfig?

Please cc me on any additional work on this.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 0/2] x86/kexec: fix memory leak of elf header buffer

2022-04-12 Thread Andrew Morton

On Thu, 31 Mar 2022 19:32:17 +0800 Baoquan He  wrote:

> Hi Andrew,
> 
> On 02/23/22 at 07:32pm, Baoquan He wrote:
> > The memory leak is reported by kmemleak detector, has been existing
> > for very long time. It casue much memory loss on large machine
> > with huge memory hotplug which will trigger kdump kernel reloading
> > many times, with kexec_file_load interface.
> 
> Could you merge these two patches? Or should I ping x86 maintainers to
> take them? 

Ah, sorry, I tend to fall asleep if there's "x86" in the subject. 
Poking a sleeping Andrew is always the right thing to do.  Shall look
at them.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v5 0/3] Convert vmcore to use an iov_iter

2022-04-04 Thread Andrew Morton

> On Sat,  2 Apr 2022 12:30:05 +0800 Baoquan He  wrote:

You were on the patch delivery path, so these patches should have had
your signoff.  Documentation/process/submitting-patches.rst explains.

> Copy the description of v3 cover letter from Willy:
> ===
> For some reason several people have been sending bad patches to fix
> compiler warnings in vmcore recently.  Here's how it should be done.
> Compile-tested only on x86.  As noted in the first patch, s390 should
> take this conversion a bit further, but I'm not inclined to do that
> work myself.

We should tell the S390 maintainers this!

Can you please fix the signoff issue, add the S390 team to Cc and resend?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4 0/4] Convert vmcore to use an iov_iter

2022-03-31 Thread Andrew Morton

On Thu, 31 Mar 2022 15:36:39 +0100 Matthew Wilcox  wrote:

> On Thu, Mar 31, 2022 at 07:25:33PM +0800, Baoquan He wrote:
> > Hi Andrew,
> > 
> > On 03/18/22 at 05:37pm, Baoquan He wrote:
> > > Copy the description of v3 cover letter from Willy:
> > 
> > Could you pick this series into your tree? I reviewed the patches 1~3
> > and tested the whole patchset, no issue found.
> 
> ... I'd fold patch 4 into patch 1,

I think so too, please.  The addition then removal of a
read_from_oldmem() implementation is a bit odd.

> but yes, Andrew, please take these patches.

And against current -linus please.  There have been some changes since
then (rcu stuff).


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] proc/vmcore: fix possible deadlock on concurrent mmap and read

2022-01-25 Thread Andrew Morton

On Wed, 19 Jan 2022 20:34:17 +0100 David Hildenbrand  wrote:

> Lockdep noticed that there is chance for a deadlock if we have
> concurrent mmap, concurrent read, and the addition/removal of a
> callback.
> 
> As nicely explained by Boqun:
> 
> "
> Lockdep warned about the above sequences because rw_semaphore is a fair
> read-write lock, and the following can cause a deadlock:
> 
>   TASK 1  TASK 2  TASK 3
>   ==  ==  ==
>   down_write(mmap_lock);
>   down_read(vmcore_cb_rwsem)
>   down_write(vmcore_cb_rwsem); // 
> blocked
>   down_read(vmcore_cb_rwsem); // cannot get the lock because of the 
> fairness
>   down_read(mmap_lock); // blocked

I'm wondering about cc:stable.  It's hard to believe that this is
likely to be observed in real life.  But the ongoing reports of lockdep
splats will be irritating.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 0/5] Avoid requesting page from DMA zone when no managed pages

2021-12-13 Thread Andrew Morton

On Mon, 13 Dec 2021 20:27:07 +0800 Baoquan He  wrote:

> Background information can be checked in cover letter of v2 RESEND POST
> as below:
> https://lore.kernel.org/all/20211207030750.30824-1-...@redhat.com/T/#u

Please include all relevant info right here, in the [0/n].  For a
number of reasons, one of which is that the text is more likely to be
up to date as the patchset evolves.

It's unusual that this patchset has two non-urgent patches and the
final three patches are cc:stable.  It makes one worry that patches 3-5
might have dependencies on 1-2.  Also, I'd expect to merge the three
-stable patches during 5.16-rcX which means I have to reorder things,
redo changelogs, update links and blah blah.

So can I ask that you redo all of this as two patch series?  A 3-patch
series which is targeted at -stable, followed by a separate two-patch
series which is targeted at 5.17-rc1.  Each series with its own fully
prepared [0/n] cover.

Thanks.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] kdump: vmcore: move copy_to() from vmcore.c to uaccess.h

2021-12-10 Thread Andrew Morton

On Fri, 10 Dec 2021 21:36:00 +0800 Tiezhu Yang  wrote:

> In arch/*/kernel/crash_dump*.c, there exist similar code about
> copy_oldmem_page(), move copy_to() from vmcore.c to uaccess.h,
> and then we can use copy_to() to simplify the related code.
> 
> ...
>
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -238,20 +238,6 @@ copy_oldmem_page_encrypted(unsigned long pfn, char *buf, 
> size_t csize,
>   return copy_oldmem_page(pfn, buf, csize, offset, userbuf);
>  }
>  
> -/*
> - * Copy to either kernel or user space
> - */
> -static int copy_to(void *target, void *src, size_t size, int userbuf)
> -{
> - if (userbuf) {
> - if (copy_to_user((char __user *) target, src, size))
> - return -EFAULT;
> - } else {
> - memcpy(target, src, size);
> - }
> - return 0;
> -}
> -
>  #ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
>  static int vmcoredd_copy_dumps(void *dst, u64 start, size_t size, int 
> userbuf)
>  {
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index ac03940..4a6c3e4 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -201,6 +201,20 @@ copy_to_user(void __user *to, const void *from, unsigned 
> long n)
>   return n;
>  }
>  
> +/*
> + * Copy to either kernel or user space
> + */
> +static inline int copy_to(void *target, void *src, size_t size, int userbuf)
> +{
> + if (userbuf) {
> + if (copy_to_user((char __user *) target, src, size))
> + return -EFAULT;
> + } else {
> + memcpy(target, src, size);
> + }
> + return 0;
> +}
> +

Ordinarily I'd say "this is too large to be inlined".  But the function
has only a single callsite per architecture so inlining it won't cause
bloat at present.

But hopefully copy_to() will get additional callers in the future, in
which case it shouldn't be inlined.  So I'm thinking it would be best
to start out with this as a regular non-inlined function, in
lib/usercopy.c.

Also, copy_to() is a very poor name for a globally-visible helper
function.  Better would be copy_to_user_or_kernel(), although that's
perhaps a bit long.

And the `userbuf' arg should have type bool, yes?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH RESEND v2 0/5] Avoid requesting page from DMA zone when no managed pages

2021-12-07 Thread Andrew Morton

On Mon, 6 Dec 2021 22:03:59 -0600 John Donnelly  
wrote:

> On 12/6/21 9:16 PM, Baoquan He wrote:
> > Sorry, forgot adding x86 and x86/mm maintainers
> 
> Hi,
> 
>These commits need applied to Linux-5.15.0 (LTS) too since it has the 
> original regression :
> 
>   1d659236fb43 ("dma-pool: scale the default DMA coherent pool
> size with memory capacity")
> 
> Maybe add "Fixes" to the other commits ?

And cc:stable, please.  "Fixes:" doesn't always mean "should be
backported".

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] proc/vmcore: fix clearing user buffer by properly using clear_user()

2021-11-15 Thread Andrew Morton

On Fri, 12 Nov 2021 10:27:50 +0100 David Hildenbrand  wrote:

> To clear a user buffer we cannot simply use memset, we have to use
> clear_user(). With a virtio-mem device that registers a vmcore_cb and has
> some logically unplugged memory inside an added Linux memory block, I can
> easily trigger a BUG by copying the vmcore via "cp":
> 
> ...
>
> Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
> Prevention (SMAP)", which is used to detect wrong access from the kernel to
> user buffers like this: SMAP triggers a permissions violation on wrong
> access. In the x86-64 variant of clear_user(), SMAP is properly
> handled via clac()+stac().
> 
> To fix, properly use clear_user() when we're dealing with a user buffer.
> 

I added cc:stable, OK?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-08 Thread Andrew Morton

On Tue,  8 Jun 2021 12:13:15 +0300 Mike Rapoport  wrote:

> From: Mike Rapoport 
> 
> After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
> configuration options are equivalent.
> 
> Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
> 
> Done with
> 
>   $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
>   $(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
>   $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
>   $(git grep -wl NEED_MULTIPLE_NODES)
> 
> with manual tweaks afterwards.
> 
> ...
>
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -987,7 +987,7 @@ extern int movable_zone;
>  #ifdef CONFIG_HIGHMEM
>  static inline int zone_movable_is_highmem(void)
>  {
> -#ifdef CONFIG_NEED_MULTIPLE_NODES
> +#ifdef CONFIG_NUMA
>   return movable_zone == ZONE_HIGHMEM;
>  #else
>   return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM;

I dropped this hunk - your "mm/mmzone.h: simplify is_highmem_idx()"
removed zone_movable_is_highmem().  

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] crash_core, vmcoreinfo: Append 'SECTION_SIZE_BITS' to vmcoreinfo

2021-06-08 Thread Andrew Morton

On Tue, 8 Jun 2021 22:24:32 +0800 Baoquan He  wrote:

> On 06/08/21 at 06:33am, Pingfan Liu wrote:
> > As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
> > Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
> > formula:
> > #define SECTIONS_SHIFT(MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
> > 
> > Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
> > PAGES_PER_SECTION in makedumpfile just like kernel.
> > 
> > Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
> > recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
> > SECTION_SIZE_BITS"). But user space wants a stable interface to get this
> > info. Such info is impossible to be deduced from a crashdump vmcore.
> > Hence append SECTION_SIZE_BITS to vmcoreinfo.
> 
> ...
>
> Add the discussion of the original thread in kexec ML for reference:
> http://lists.infradead.org/pipermail/kexec/2021-June/022676.html

I added a Link: for this.
 
> This looks good to me.
> 
> Acked-by: Baoquan He 

I'm thinking we should backport this at least to Fixes:f0b13ee23241. 
But perhaps it's simpler to just backport it as far as possible, so I
added a bare cc:stable with no Fixes:.  Thoughts?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH V2 4/6] mm: rename the global section array to mem_sections

2021-06-01 Thread Andrew Morton

On Wed, 2 Jun 2021 01:11:07 + HAGIO KAZUHITO(萩尾　一仁)  
wrote:

> -Original Message-
> > On Tue, 1 Jun 2021 10:40:09 +0200 David Hildenbrand  
> > wrote:
> > 
> > > > Thanks, i explained the reason during my last reply.
> > > > Andrew has already picked this patch to -mm tree.
> > >
> > > Just because it's in Andrews tree doesn't mean it will end up upstream. ;)
> > >
> > > Anyhow, no really strong opinion, it's simply unnecessary code churn
> > > that makes bisecting harder without real value IMHO.
> > 
> > I think it's a good change - mem_sections refers to multiple instances
> > of a mem_section.  Churn is a pain, but that's the price we pay for more
> > readable code.  And for having screwed it up originally ;)
> 
> >From a makedumpfile/crash-utility viewpoint, I don't deny kernel improvement
> and probably the change will not be hard for them to support, but I'd like
> you to remember that the tool users will need to update them for the change.
> 
> The situation where we need to update the tools for new kernels is usual, but
> there are not many cases that they cannot even start session, and this change
> will cause it.  Personally I wonder the change is worth forcing users to 
> update
> them.

Didn't know that.  I guess I'll drop it then.

We could do an assembly-level alias I assume..

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH V2 4/6] mm: rename the global section array to mem_sections

2021-06-01 Thread Andrew Morton

On Tue, 1 Jun 2021 10:40:09 +0200 David Hildenbrand  wrote:

> > Thanks, i explained the reason during my last reply.
> > Andrew has already picked this patch to -mm tree.
> 
> Just because it's in Andrews tree doesn't mean it will end up upstream. ;)
> 
> Anyhow, no really strong opinion, it's simply unnecessary code churn 
> that makes bisecting harder without real value IMHO.

I think it's a good change - mem_sections refers to multiple instances
of a mem_section.  Churn is a pain, but that's the price we pay for more
readable code.  And for having screwed it up originally ;)


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v1 1/1] kernel.h: Split out panic and oops helpers

2021-04-09 Thread Andrew Morton

On Wed, 7 Apr 2021 11:46:37 +0300 Andy Shevchenko  
wrote:

> On Wed, Apr 7, 2021 at 11:17 AM Kees Cook  wrote:
> >
> > On Tue, Apr 06, 2021 at 04:31:58PM +0300, Andy Shevchenko wrote:
> > > kernel.h is being used as a dump for all kinds of stuff for a long time.
> > > Here is the attempt to start cleaning it up by splitting out panic and
> > > oops helpers.
> > >
> > > At the same time convert users in header and lib folder to use new header.
> > > Though for time being include new header back to kernel.h to avoid twisted
> > > indirected includes for existing users.
> > >
> > > Signed-off-by: Andy Shevchenko 
> >
> > I like it! Do you have a multi-arch CI to do allmodconfig builds to
> > double-check this?
> 
> Unfortunately no, I rely on plenty of bots that are harvesting mailing lists.
> 
> But I will appreciate it if somebody can run this through various build tests.
> 

um, did you try x86_64 allmodconfig?

I'm up to
kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix-fix.patch
and counting.

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix

more files need panic_notifier.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 arch/x86/xen/enlighten.c|1 +
 drivers/video/fbdev/hyperv_fb.c |1 +
 2 files changed, 2 insertions(+)

--- a/arch/x86/xen/enlighten.c~kernelh-split-out-panic-and-oops-helpers-fix
+++ a/arch/x86/xen/enlighten.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
--- 
a/drivers/video/fbdev/hyperv_fb.c~kernelh-split-out-panic-and-oops-helpers-fix
+++ a/drivers/video/fbdev/hyperv_fb.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
_


From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix

arch/x86/purgatory/purgatory.c needs kernel.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 arch/x86/purgatory/purgatory.c |1 +
 1 file changed, 1 insertion(+)

--- 
a/arch/x86/purgatory/purgatory.c~kernelh-split-out-panic-and-oops-helpers-fix-fix
+++ a/arch/x86/purgatory/purgatory.c
@@ -8,6 +8,7 @@
  *   Vivek Goyal 
  */
 
+#include 
 #include 
 #include 
 #include 
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix

drivers/clk/analogbits/wrpll-cln28hpc.c needs minmax.h, math.h and limits.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/clk/analogbits/wrpll-cln28hpc.c |4 
 1 file changed, 4 insertions(+)

--- 
a/drivers/clk/analogbits/wrpll-cln28hpc.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix
+++ a/drivers/clk/analogbits/wrpll-cln28hpc.c
@@ -25,6 +25,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+
 #include 
 
 /* MIN_INPUT_FREQ: minimum input clock frequency, in Hz (Fref_min) */
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix

drivers/misc/pvpanic/pvpanic.c needs panic_notifier.h

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/misc/pvpanic/pvpanic.c |1 +
 1 file changed, 1 insertion(+)

--- 
a/drivers/misc/pvpanic/pvpanic.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix
+++ a/drivers/misc/pvpanic/pvpanic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
_
From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix

fix drivers/misc/pvpanic/pvpanic.c and drivers/net/ipa/ipa_smp2p.c

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/net/ipa/ipa_smp2p.c |1 +
 1 file changed, 1 insertion(+)

--- 
a/drivers/net/ipa/ipa_smp2p.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix
+++ a/drivers/net/ipa/ipa_smp2p.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix

fix drivers/power/reset/ltc2952-poweroff.c and drivers/misc/bcm-vk/bcm_vk_dev.c

Cc: Andy Shevchenko 
Signed-off-by: Andrew Morton 
---

 drivers/misc/bcm-vk/bcm_vk_dev.c   |1 +
 drivers/power/reset/ltc2952-poweroff.c |1 +
 2 files changed, 2 insertions(+)

--- 
a/drivers/power/reset/ltc2952-poweroff.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix
+++ a/drivers/power/reset/ltc2952-poweroff.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
--- 
a/drivers/misc/bcm-vk/bcm_vk_dev.c~kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix
+++ a/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
_

From: Andrew Morton 
Subject: kernelh-split-out-panic-and-oops-helpers-fix-fix-fix-fix-fix-fix-fix

fix drivers/leds/trigger/ledtrig-panic.c and drivers/firmware/google/gsmi.c

Cc: Andy Sh

Re: [EXTERNAL] Re: [PATCH] kexec: Add kexec reboot string

2021-03-11 Thread Andrew Morton

On Thu, 11 Mar 2021 18:14:19 + Joe LeVeque  wrote:

> Is this all your looking for? If not, please let me know.
> 
> > Signed-off-by: Joe LeVeque 

Yes, thanks.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Add kexec reboot string

2021-03-10 Thread Andrew Morton

On Thu,  4 Mar 2021 13:46:26 +0100 Paul Menzel  wrote:

> From: Joe LeVeque 
> 
> The purpose is to notify the kernel module for fast reboot.
> 
> Upstream a patch from the SONiC network operating system [1].
> 
> [1]: https://github.com/Azure/sonic-linux-kernel/pull/46
> 
> Signed-off-by: Paul Menzel 

We should have Joe's signed-off-by: for this.  Joe, can you please send
it?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] Only allow to set crash_kexec_post_notifiers on boot time

2020-09-18 Thread Andrew Morton

On Fri, 18 Sep 2020 11:25:46 +0800 Dave Young  wrote:

> crash_kexec_post_notifiers enables running various panic notifier
> before kdump kernel booting. This increases risks of kdump failure.
> It is well documented in kernel-parameters.txt. We do not suggest
> people to enable it together with kdump unless he/she is really sure.
> This is also not suggested to be enabled by default when users are
> not aware in distributions.
> 
> But unfortunately it is enabled by default in systemd, see below
> discussions in a systemd report, we can not convince systemd to change
> it:
> https://github.com/systemd/systemd/issues/16661
> 
> Actually we have got reports about kdump kernel hangs in both s390x
> and powerpcle cases caused by the systemd change,  also some x86 cases
> could also be caused by the same (although that is in Hyper-V code
> instead of systemd, that need to be addressed separately).
> 
> Thus to avoid the auto enablement here just disable the param writable
> permission in sysfs.
> 

Well.  I don't think this is at all a desirable way of resolving a
disagreement with the systemd developers

At the above github address I'm seeing "ryncsn added a commit to
ryncsn/systemd that referenced this issue 9 days ago", "pstore: don't
enable crash_kexec_post_notifiers by default".  So didn't that address
the issue?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2][RFC] kdump: append kernel build-id string to VMCOREINFO

2020-06-17 Thread Andrew Morton

On Wed, 10 Jun 2020 21:27:52 -0700 Vijay Balakrishna 
 wrote:

> Make kernel GNU build-id available in VMCOREINFO.  Having
> build-id in VMCOREINFO facilitates presenting appropriate kernel
> namelist image with debug information file to kernel crash dump
> analysis tools.  Currently VMCOREINFO lacks uniquely identifiable
> key for crash analysis automation.
> 
> Regarding if this patch is necessary or matching of linux_banner
> and OSRELEASE in VMCOREINFO employed by crash(8) meets the
> need -- IMO, build-id approach more foolproof, in most instances it
> is a cryptographic hash generated using internal code/ELF bits unlike
> kernel version string upon which linux_banner is based that is
> external to the code.  I feel each is intended for a different purpose.
> Also OSRELEASE is not suitable when two different kernel builds
> from same version with different features enabled.
> 
> Currently for most linux (and non-linux) systems build-id can be
> extracted using standard methods for file types such as user mode crash
> dumps, shared libraries, loadable kernel modules etc.,  This is an
> exception for linux kernel dump.  Having build-id in VMCOREINFO brings
> some uniformity for automation tools.
> 
> ...
>
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -11,6 +11,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  /* vmcoreinfo stuff */
>  unsigned char *vmcoreinfo_data;
>  size_t vmcoreinfo_size;
> @@ -376,6 +378,53 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>  }
>  EXPORT_SYMBOL(paddr_vmcoreinfo_note);
>  
> +#define NOTES_SIZE (&__stop_notes - &__start_notes)
> +#define BUILD_ID_MAX SHA1_DIGEST_SIZE
> +#define NT_GNU_BUILD_ID 3
> +
> +struct elf_note_section {
> + struct elf_note n_hdr;
> + u8 n_data[];
> +};
> +
> +/*
> + * Add build ID from .notes section as generated by the GNU ld(1)
> + * or LLVM lld(1) --build-id option.
> + */
> +static void add_build_id_vmcoreinfo(void)
> +{
> + char build_id[BUILD_ID_MAX * 2 + 1];
> + int n_remain = NOTES_SIZE;
> +
> + while (n_remain >= sizeof(struct elf_note)) {
> + const struct elf_note_section *note_sec =
> + &__start_notes + NOTES_SIZE - n_remain;
> + const u32 n_namesz = note_sec->n_hdr.n_namesz;
> +
> + if (note_sec->n_hdr.n_type == NT_GNU_BUILD_ID &&
> + n_namesz != 0 &&
> + !strcmp((char *)_sec->n_data[0], "GNU")) {

Is it guaranteed that n_data[] is null-terminated?

> + if (note_sec->n_hdr.n_descsz <= BUILD_ID_MAX) {
> + const u32 n_descsz = note_sec->n_hdr.n_descsz;
> + const u8 *s = _sec->n_data[n_namesz];
> +
> + s = PTR_ALIGN(s, 4);
> + bin2hex(build_id, s, n_descsz);
> + build_id[2 * n_descsz] = '\0';
> + VMCOREINFO_BUILD_ID(build_id);
> + return;
> + }
> + pr_warn("Build ID is too large to include in 
> vmcoreinfo: %u > %u\n",
> + note_sec->n_hdr.n_descsz,
> + BUILD_ID_MAX);
> + return;
> + }
> + n_remain -= sizeof(struct elf_note) +
> + ALIGN(note_sec->n_hdr.n_namesz, 4) +
> + ALIGN(note_sec->n_hdr.n_descsz, 4);
> + }
> +}
> +
>  static int __init crash_save_vmcoreinfo_init(void)
>  {
>   vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] kexec: Do not verify the signature without the lockdown or mandatory signature

2020-06-17 Thread Andrew Morton

On Tue,  2 Jun 2020 12:59:52 +0800 Lianbo Jiang  wrote:

> Signature verification is an important security feature, to protect
> system from being attacked with a kernel of unknown origin. Kexec
> rebooting is a way to replace the running kernel, hence need be
> secured carefully.

I'm finding this changelog quite hard to understand,

> In the current code of handling signature verification of kexec kernel,
> the logic is very twisted. It mixes signature verification, IMA signature
> appraising and kexec lockdown.
> 
> If there is no KEXEC_SIG_FORCE, kexec kernel image doesn't have one of
> signature, the supported crypto, and key, we don't think this is wrong,

I think this is saying that in the absence of KEXEC_SIG_FORCE and if
the signature/crypto/key are all incorrect, the kexec still succeeds,
but it should not.

> Unless kexec lockdown is executed. IMA is considered as another kind of
> signature appraising method.
> 
> If kexec kernel image has signature/crypto/key, it has to go through the
> signature verification and pass. Otherwise it's seen as verification
> failure, and won't be loaded.

I don't know if this is describing the current situation or the
post-patch situation.

> Seems kexec kernel image with an unqualified signature is even worse than
> those w/o signature at all, this sounds very unreasonable. E.g. If people
> get a unsigned kernel to load, or a kernel signed with expired key, which
> one is more dangerous?
> 
> So, here, let's simplify the logic to improve code readability. If the
> KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature verification
> is mandated. Otherwise, we lift the bar for any kernel image.

I think the whole thing needs a rewrite.  Start out by fully describing
the current situation.  THen describe what is wrong with it, and why. 
Then describe the proposed change.  Or something along these lines.

The changelog should also make clear the end-user impact of the patch. 
In sufficient detail for others to decide which kernel version(s)
should be patched.  Your recommendations will also be valuable - which
kernel version(s) do you think should be patched, and why?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/3] mm/memory_hotplug: Allow arch override of non boot memory resource names

2020-05-08 Thread Andrew Morton

On Thu, 26 Mar 2020 18:07:29 + James Morse  wrote:

> Memory added to the system by hotplug has a 'System RAM' resource created
> for it. This is exposed to user-space via /proc/iomem.
> 
> This poses problems for kexec on arm64. If kexec decides to place the
> kernel in one of these newly onlined regions, the new kernel will find
> itself booting from a region not described as memory in the firmware
> tables.
> 
> Arm64 doesn't have a structure like the e820 memory map that can be
> re-written when memory is brought online. Instead arm64 uses the UEFI
> memory map, or the memory node from the DT, sometimes both. We never
> rewrite these.
> 
> Allow an architecture to specify a different name for these hotplug
> regions.
> 
> ...
>
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,10 @@
>  #include "internal.h"
>  #include "shuffle.h"
>  
> +#ifndef MEMORY_HOTPLUG_RES_NAME
> +#define MEMORY_HOTPLUG_RES_NAME "System RAM"
> +#endif
> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -103,7 +107,7 @@ static struct resource *register_memory_resource(u64 
> start, u64 size)
>  {
>   struct resource *res;
>   unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> - char *resource_name = "System RAM";
> + char *resource_name = MEMORY_HOTPLUG_RES_NAME;
>  
>   if (start + size > max_mem_size)
>   return ERR_PTR(-E2BIG);

I suppose we should do this as well:

--- 
a/mm/memory_hotplug.c~mm-memory_hotplug-allow-arch-override-of-non-boot-memory-resource-names-fix
+++ a/mm/memory_hotplug.c
@@ -129,7 +129,8 @@ static struct resource *register_memory_
   resource_name, flags);
 
if (!res) {
-   pr_debug("Unable to reserve System RAM region: 
%016llx->%016llx\n",
+   pr_debug("Unable to reserve " MEMORY_HOTPLUG_RES_NAME
+   " region: %016llx->%016llx\n",
start, start + size);
return ERR_PTR(-EEXIST);
}

It assumes that MEMORY_HOTPLUG_RES_NAME will be a literal string, which
is the case in [3/3].


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem

2020-05-08 Thread Andrew Morton

On Fri,  8 May 2020 10:42:14 +0200 David Hildenbrand  wrote:

> Assume we have kmem configured and loaded:
>   [root@localhost ~]# cat /proc/iomem
>   ...
>   14000-33fff : Persistent Memory$
> 14000-1481f : namespace0.0
> 15000-33fff : dax0.0
>   15000-33fff : System RAM
> 
> Assume we try to unload kmem. This force-unloading will work, even if
> memory cannot get removed from the system.
>   [root@localhost ~]# rmmod kmem
>   [   86.380228] removing memory fails, because memory 
> [0x00015000-0x000157ff] is onlined
>   ...
>   [   86.431225] kmem dax0.0: DAX region [mem 0x15000-0x33fff] cannot 
> be hotremoved until the next reboot
> 
> Now, we can reconfigure the namespace:
>   [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 
> --mode=devdax
>   [  131.409351] nd_pmem namespace0.0: could not reserve region [mem 
> 0x14000-0x33fff]dax
>   [  131.410147] nd_pmem: probe of namespace0.0 failed with error 
> -16namespace0.0 --mode=devdax
>   ...
> 
> This fails as expected due to the busy memory resource, and the memory
> cannot be used. However, the dax0.0 device is removed, and along its name.
> 
> The name of the memory resource now points at freed memory (name of the
> device).
>   [root@localhost ~]# cat /proc/iomem
>   ...
>   14000-33fff : Persistent Memory
> 14000-1481f : namespace0.0
> 15000-33fff : �_�^7_��/_��wR��WQ���^��� ...
> 15000-33fff : System RAM
> 
> We have to make sure to duplicate the string. While at it, remove the
> superfluous setting of the name and fixup a stale comment.
> 
> Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used 
> like normal RAM")
> Cc: sta...@vger.kernel.org # v5.3

hm.

Is this really -stable material?  These are all privileged operations,
I expect?

Assuming "yes", I've queued this separately, staged for 5.7-rcX.  I'll
redo patches 2-4 as a three-patch series for 5.8-rc1.

> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Pavel Tatashin 

Reviewers, please ;)



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-13 Thread Andrew Morton

On Mon, 13 Apr 2020 08:15:23 -0500 ebied...@xmission.com (Eric W. Biederman) 
wrote:

> > For 3), people can still use kexec_load and develop/fix for it, if no
> > kexec_file_load supported. But 32-bit arm should be a different one,
> > more like i386, we will leave it as is, and fix anything which could
> > break it. But people really expects to improve or add feature to it? E.g
> > in this patchset, the mem hotplug issue James raised, I assume James is
> > focusing on arm64, x86_64, but not 32-bit arm. As DavidH commented in
> > another reply, people even don't agree to continue supporting memory
> > hotplug on 32-bit system. We ever took effort to fix a memory hotplug
> > bug on i386 with a patch, but people would rather set it as BROKEN.
> 
> For memory hotplug just reload.  Userspace already gets good events.
> 
> We should not expect anything except a panic kernel to be loaded over a
> memory hotplug event. The kexec on panic code should actually be loaded
> in a location that we don't reliquish if asked for it.

Is that a nack for James's patchset?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

2020-04-10 Thread Andrew Morton

It's unclear (to me) what is the status of this patchset.  But it does appear 
that
an new version can be expected?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv2] mm/sparse: reset section's mem_map when fully deactivated

2020-01-23 Thread Andrew Morton

On Mon, 20 Jan 2020 08:29:39 +0100 Michal Hocko  wrote:

> On Mon 20-01-20 10:33:14, Pingfan Liu wrote:
> > After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
> > when a mem section is fully deactivated, section_mem_map still records the
> > section's start pfn, which is not used any more and will be reassigned
> > during re-added.
> > 
> > In analogy with alloc/free pattern, it is better to clear all fields of
> > section_mem_map.
> > 
> > Beside this, it breaks the user space tool "makedumpfile" [1], which makes
> > assumption that a hot-removed section has mem_map as NULL, instead of
> > checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile will be
> > better to change the assumption, and need a patch)
> > 
> > The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
> > trigger a crash, and save vmcore by makedumpfile
> 
> While makedumpfile lives very closely to the kernel and occasional
> breakage is to be expected I still believe that Fixes: ba72b4c8cf60
> is due.

But not a cc:stable?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 0/8] mm/kdump: allow to exclude pages that are logically offline

2019-02-28 Thread Andrew Morton

On Wed, 27 Feb 2019 13:32:14 +0800 Dave Young  wrote:

> This series have been in -next for some days, could we get this in
> mainline? 

It's been in -next for two months?

> Andrew, do you have plan about them, maybe next release?

They're all reviewed except for "xen/balloon: mark inflated pages
PG_offline". 
(https://ozlabs.org/~akpm/mmotm/broken-out/xen-balloon-mark-inflated-pages-pg_offline.patch).
Yes, I plan on sending these to Linus during the merge window for 5.1

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-23 Thread Andrew Morton

On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He  wrote:

> Hi Andrew,
> 
> On 07/18/18 at 03:33pm, Andrew Morton wrote:
> > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He  wrote:
> > 
> > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > > top to down. This is what we have been doing all along in the old kexec
> > > loading interface and the kexec loading is still default setting in some
> > > distributions. However, the current kexec_file loading interface doesn't
> > > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > > all resources of System RAM from bottom to up, to try to find memory 
> > > region
> > > which can contain the specific kexec buffer, then call 
> > > locate_mem_hole_callback()
> > > to allocate memory in that found memory region from top to down. This 
> > > brings
> > > confusion especially when KASLR is widely supported , users have to make 
> > > clear
> > > why kexec/kdump kernel loading position is different between these two
> > > interfaces in order to exclude unnecessary noises. Hence these two 
> > > interfaces
> > > need be unified on behaviour.
> > 
> > As far as I can tell, the above is the whole reason for the patchset,
> > yes?  To avoid confusing users.
> 
> 
> In fact, it's not just trying to avoid confusing users. Kexec loading
> and kexec_file loading are just do the same thing in essence. Just we
> need do kernel image verification on uefi system, have to port kexec
> loading code to kernel. 
> 
> Kexec has been a formal feature in our distro, and customers owning
> those kind of very large machine can make use of this feature to speed
> up the reboot process. On uefi machine, the kexec_file loading will
> search place to put kernel under 4G from top to down. As we know, the
> 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> it. It may have possibility to not be able to find a usable space for
> kernel/initrd. From the top down of the whole memory space, we don't
> have this worry. 
> 
> And at the first post, I just posted below with AKASHI's
> walk_system_ram_res_rev() version. Later you suggested to use
> list_head to link child sibling of resource, see what the code change
> looks like.
> http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com
> 
> Then I posted v2
> http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com
> Rob Herring mentioned that other components which has this tree struct
> have planned to do the same thing, replacing the singly linked list with
> list_head to link resource child sibling. Just quote Rob's words as
> below. I think this could be another reason.
> 
> ~ From Rob
> The DT struct device_node also has the same tree structure with
> parent, child, sibling pointers and converting to list_head had been
> on the todo list for a while. ACPI also has some tree walking
> functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
> common tree struct and helpers defined either on top of list_head or a
> ~
> new struct if that saves some size.

Please let's get all this into the changelogs?

> > 
> > Is that sufficient?  Can we instead simplify their lives by providing
> > better documentation or informative printks or better Kconfig text,
> > etc?
> > 
> > And who *are* the people who are performing this configuration?  Random
> > system administrators?  Linux distro engineers?  If the latter then
> > they presumably aren't easily confused!
> 
> Kexec was invented for kernel developer to speed up their kernel
> rebooting. Now high end sever admin, kernel developer and QE are also
> keen to use it to reboot large box for faster feature testing, bug
> debugging. Kernel dev could know this well, about kernel loading
> position, admin or QE might not be aware of it very well. 
> 
> > 
> > In other words, I'm trying to understand how much benefit this patchset
> > will provide to our users as a whole.
> 
> Understood. The list_head replacing patch truly involes too many code
> changes, it's risky. I am willing to try any idea from reviewers, won't
> persuit they have to be accepted finally. If don't have a try, we don't
> know what it looks like, and what impact it may have. I am fine to take
> AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
> though it could be a little bit low efficient.

The larger patch produces a better result.  We can handle it ;)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-23 Thread Andrew Morton

On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He  wrote:

> For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> is used to load kernel/initrd/purgatory is supposed to be allocated from
> top to down. This is what we have been doing all along in the old kexec
> loading interface and the kexec loading is still default setting in some
> distributions. However, the current kexec_file loading interface doesn't
> do like this. The function arch_kexec_walk_mem() it calls ignores checking
> kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> all resources of System RAM from bottom to up, to try to find memory region
> which can contain the specific kexec buffer, then call 
> locate_mem_hole_callback()
> to allocate memory in that found memory region from top to down. This brings
> confusion especially when KASLR is widely supported , users have to make clear
> why kexec/kdump kernel loading position is different between these two
> interfaces in order to exclude unnecessary noises. Hence these two interfaces
> need be unified on behaviour.

As far as I can tell, the above is the whole reason for the patchset,
yes?  To avoid confusing users.

Is that sufficient?  Can we instead simplify their lives by providing
better documentation or informative printks or better Kconfig text,
etc?

And who *are* the people who are performing this configuration?  Random
system administrators?  Linux distro engineers?  If the latter then
they presumably aren't easily confused!

In other words, I'm trying to understand how much benefit this patchset
will provide to our users as a whole.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-21 Thread Andrew Morton

On Mon, 21 May 2018 10:53:37 +0800 Dave Young  wrote:

> This is a rework of the crashkernel=auto patches back to 2009 although
> I'm not sure if below is the last version of the old effort:
> https://lkml.org/lkml/2009/8/12/61
> https://lwn.net/Articles/345344/
> 
> I changed the original design, instead of adding the auto reserve logic
> in code, in this patch just introduce two kernel config options for
> the default crashkernel value in MB and the threshold of system memory
> in MB so that only reserve default when system memory is equal or
> above the threshold.
> 
> With the kernel configs distributions can easily change the default
> values so that people do not need to manually set kernel cmdline
> for common use cases and one can still overwrite the default value
> with manual setup or disable it by using crashkernel=0
> 
> Signed-off-by: Dave Young 
> ---
> Another difference is with original design the crashkernel size scales
> with system memory, according to test, large machine may need more
> memory in kdump kernel because of several factors:
> 1. cpu numbers, because of the percpu memory allocated for cpus.
>(kdump can use nr_cpus=1 to workaround this, but some
> arches do not support nr_cpus=X for example powerpc) 
> 2. IO devices, large system can have a lot of io devices, although we
>can try to only add those device drivers we needed, it is still a
>problem because of some built-in drivers, some stacked logical devices
>eg. device mapper devices, acpi etc.  Even if only considering the
>meta data for driver model it will still be a big number eg. sysfs
>files etc.
> 3. The minimum memory requirement for some device drivers are big, even
>if some of them have implemented low meory profile.  It is usual to see
>10M memory use for a storage driver.
> 4. user space initramfs size growing.  Busybox is not usable if we need
>to add udev support and some complicate storage support.  Use dracut
>with systemd, especially networking stuff need more memory.
> 
> So probably add another kernel config option to scale the memory size
> eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
> use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
> how to describe and add this option. Any suggestions will be appreciated.
> 
> ...
>
> --- linux-x86.orig/arch/Kconfig
> +++ linux-x86/arch/Kconfig
> @@ -10,6 +10,22 @@ config KEXEC_CORE
>   select CRASH_CORE
>   bool
>  
> +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> + int "System memory size threshold for kdump memory default reserving"
> + depends on CRASH_CORE
> + default 0
> + help
> +   CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> +   the system memory size is equal or bigger than the threshold.

"the threshold" is rather vague.  Can it be clarified?

In fact I'm really struggling to understand the logic here


> +config CRASHKERNEL_DEFAULT_MB
> + int "Default crashkernel memory size reserved for kdump"
> + depends on CRASH_CORE
> + default 0
> + help
> +   This is used as the default kdump reserved memory size in MB.
> +   crashkernel=X kernel cmdline can overwrite this value.
> +
>  config HAVE_IMA_KEXEC
>   bool
>  
> @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
>   return 0;
>  }
>  
> +static int __init get_crashkernel_default(unsigned long long system_ram,
> +   unsigned long long *size)
> +{
> + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> +
> + thres *= SZ_1M;
> + sz *= SZ_1M;
> +
> + if (sz >= system_ram || system_ram < thres) {
> + pr_debug("crashkernel default size can not be used.\n");
> + return -EINVAL;

In other words,

if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
fail;

yes?

How come?  What's happening here?  Perhaps a (good) explanatory comment
is needed.  And clearer Kconfig text.

All confused :(

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] resource: add walk_system_ram_res_rev()

2018-03-23 Thread Andrew Morton

On Fri, 23 Mar 2018 11:10:13 +0800 Baoquan He <b...@redhat.com> wrote:

> On 03/22/18 at 07:06pm, Andrew Morton wrote:
> > On Fri, 23 Mar 2018 08:58:45 +0800 Baoquan He <b...@redhat.com> wrote:
> > 
> > > > erk, this is pretty nasty.  Isn't there a better way :(
> > > 
> > > Yes, this is not efficient.
> > > 
> > > In struct resource{}, ->sibling list is a singly linked list. I ever
> > > thought about changing it to doubly linked list, yet not very sure if
> > > it will have effect since struct resource is a core data structure.
> > 
> > Switching to a list_head sounds OK.  The only issue really is memory
> > consumption and surely we don't have tens of thousands of struct
> > resources floating about(?).  Or if we do have a lot, the machine is
> > presumably huge (hope?).
> 
> Yes. It doubles the memory consumption.
> 
> AFAIK, the biggest number of resrouces I heard of possibly is mentioned
> in this user space kexec_tools commit. In this commit, Xunlei told on
> SGI system with 64TB RAM, the array which we have been using to store
> "System RAM"|"Reserved"|"ACPI **" regions is not big enough. In that
> case, we need extra 8Byte*2048=16KB at most. With my understanding, this
> increase is system wide, since each resource instance only needs its own
> list_head member, right?

Yes.  That sounds perfectly acceptable.

It would be interesting to see what this approach looks like, if you
have time to toss something together?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] resource: add walk_system_ram_res_rev()

2018-03-22 Thread Andrew Morton

On Fri, 23 Mar 2018 08:58:45 +0800 Baoquan He  wrote:

> > erk, this is pretty nasty.  Isn't there a better way :(
> 
> Yes, this is not efficient.
> 
> In struct resource{}, ->sibling list is a singly linked list. I ever
> thought about changing it to doubly linked list, yet not very sure if
> it will have effect since struct resource is a core data structure.

Switching to a list_head sounds OK.  The only issue really is memory
consumption and surely we don't have tens of thousands of struct
resources floating about(?).  Or if we do have a lot, the machine is
presumably huge (hope?).

> AKASHI's method is more acceptable, and currently only kexec has this
> requirement.

What method is that?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/2] Kexec_file: Load kernel at top of system ram

2018-03-22 Thread Andrew Morton

On Thu, 22 Mar 2018 11:37:20 +0800 Baoquan He  wrote:

> The current kexec_file ignores kexec_buf.top_down value when call
> arch_kexec_walk_mem() to allocate memory for loading kernel/initrd
> stuffs. This is not supposed to be what kexec_buf.top_down is used
> for.

OK, but why is this a problem?  When fixing a bug, please fully
describe the user-visible effects of that bug.  That helps others
understand the importance of the fix, whether it should be backported
into various kernels, etc.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] resource: add walk_system_ram_res_rev()

2018-03-22 Thread Andrew Morton

On Thu, 22 Mar 2018 11:37:21 +0800 Baoquan He  wrote:

> From: AKASHI Takahiro 
> 
> This function, being a variant of walk_system_ram_res() introduced in
> commit 8c86e70acead ("resource: provide new functions to walk through
> resources"), walks through a list of all the resources of System RAM
> in reversed order, i.e., from higher to lower.
> 
> It will be used in kexec_file code.
> 
> ...
>
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -23,6 +23,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  
>  
> @@ -470,6 +472,67 @@ int walk_system_ram_res(u64 start, u64 end, void *arg,
>  }
>  
>  /*
> + * This function, being a variant of walk_system_ram_res(), calls the @func
> + * callback against all memory ranges of type System RAM which are marked as
> + * IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY in reversed order, i.e., from
> + * higher to lower.
> + */

This should document the return value, as should walk_system_ram_res().
Why does it return -1 on error rather than an errno (ENOMEM)?

> +int walk_system_ram_res_rev(u64 start, u64 end, void *arg,
> + int (*func)(struct resource *, void *))
> +{
> + struct resource res, *rams;
> + int rams_size = 16, i;
> + int ret = -1;
> +
> + /* create a list */
> + rams = vmalloc(sizeof(struct resource) * rams_size);
> + if (!rams)
> + return ret;
> +
> + res.start = start;
> + res.end = end;
> + res.flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> + i = 0;
> + while ((res.start < res.end) &&
> + (!find_next_iomem_res(, IORES_DESC_NONE, true))) {
> + if (i >= rams_size) {
> + /* re-alloc */
> + struct resource *rams_new;
> + int rams_new_size;
> +
> + rams_new_size = rams_size + 16;
> + rams_new = vmalloc(sizeof(struct resource)
> + * rams_new_size);
> + if (!rams_new)
> + goto out;
> +
> + memcpy(rams_new, rams,
> + sizeof(struct resource) * rams_size);
> + vfree(rams);
> + rams = rams_new;
> + rams_size = rams_new_size;
> + }
> +
> + rams[i].start = res.start;
> + rams[i++].end = res.end;
> +
> + res.start = res.end + 1;
> + res.end = end;
> + }
> +
> + /* go reverse */
> + for (i--; i >= 0; i--) {
> + ret = (*func)([i], arg);
> + if (ret)
> + break;
> + }

erk, this is pretty nasty.  Isn't there a better way :(

> +out:
> + vfree(rams);
> + return ret;
> +}


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 00/11] kexec_file: Clean up purgatory load

2018-03-21 Thread Andrew Morton

On Wed, 21 Mar 2018 12:27:40 +0100 Philipp Rudo  
wrote:

> here is the updated patch set including Dave's comments.

There were some syntactic clashes with the "kexec_file, x86, powerpc:
refactoring for other architecutres" series.  Please check (in
linux-next) that I got it all correct and that both patch series are
functioning correctly.

I expect to send all this into Stephen for -next tomorrow.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-08-31 Thread Andrew Morton

On Tue, 30 Aug 2016 18:40:02 -0400 Mimi Zohar  wrote:

> The TPM PCRs are only reset on a hard reboot.  In order to validate a
> TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
> of the running kernel must be saved and then restored on the subsequent
> boot, possibly of a different architecture.
> 
> The existing securityfs binary_runtime_measurements file conveniently
> provides a serialized format of the IMA measurement list. This patch
> set serializes the measurement list in this format and restores it.
> 
> Up to now, the binary_runtime_measurements was defined as architecture
> native format.  The assumption being that userspace could and would
> handle any architecture conversions.  With the ability of carrying the
> measurement list across kexec, possibly from one architecture to a
> different one, the per boot architecture information is lost and with it
> the ability of recalculating the template digest hash.  To resolve this
> problem, without breaking the existing ABI, this patch set introduces
> the boot command line option "ima_canonical_fmt", which is arbitrarily
> defined as little endian.
> 
> The need for this boot command line option will be limited to the
> existing version 1 format of the binary_runtime_measurements.
> Subsequent formats will be defined as canonical format (eg. TPM 2.0
> support for larger digests).
> 
> This patch set pre-req's Thiago Bauermann's "kexec_file: Add buffer
> hand-over for the next kernel" patch set. 
> 
> These patches can also be found in the next-kexec-restore branch of:
> git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git

I'll merge these into -mm to get some linux-next exposure.  I don't
know what your upstream merge planes will be?


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 0/5] kexec_file: Add buffer hand-over for the next kernel

2016-08-25 Thread Andrew Morton

On Thu, 25 Aug 2016 15:18:26 -0300 Thiago Jung Bauermann 
 wrote:

> Hello,
> 
> This patch series implements a mechanism which allows the kernel to pass
> on a buffer to the kernel that will be kexec'd. This buffer is passed
> as a segment which is added to the kimage when it is being prepared
> by kexec_file_load.
> 
> How the second kernel is informed of this buffer is architecture-specific.
> On powerpc, this is done via the device tree, by checking
> the properties /chosen/linux,kexec-handover-buffer-start and
> /chosen/linux,kexec-handover-buffer-end, which is analogous to how the
> kernel finds the initrd.
> 
> This is needed because the Integrity Measurement Architecture subsystem
> needs to preserve its measurement list accross the kexec reboot. The
> following patch series for the IMA subsystem uses this feature for that
> purpose:
> 
> https://lists.infradead.org/pipermail/kexec/2016-August/016745.html
> 
> This is so that IMA can implement trusted boot support on the OpenPower
> platform, because on such systems an intermediary Linux instance running
> as part of the firmware is used to boot the target operating system via
> kexec. Using this mechanism, IMA on this intermediary instance can
> hand over to the target OS the measurements of the components that were
> used to boot it.
> 
> Because there could be additional measurement events between the
> kexec_file_load call and the actual reboot, IMA needs a way to update the
> buffer with those additional events before rebooting. One can minimize
> the interval between the kexec_file_load and the reboot syscalls, but as
> small as it can be, there is always the possibility that the measurement
> list will be out of date at the time of reboot.
> 
> To address this issue, this patch series also introduces
> kexec_update_segment, which allows a reboot notifier to change the
> contents of the image segment during the reboot process.
> 
> The last patch is not intended to be merged, it just demonstrates how
> this feature can be used.
> 
> This series applies on top of v6 of the "kexec_file_load implementation
> for PowerPC" patch series (which applies on top of v4.8-rc1):
> 
> https://lists.infradead.org/pipermail/kexec/2016-August/016960.html

I grabbed these two patch series.  I also merged the "IMA:
Demonstration code for kexec buffer passing." demonstration patch just
to get things a bit of testing.  I assume that once the "ima: carry the
measurement list across kexec" series has stabilised, I should drop the
demo patch and also grab those?  If so, pelase start cc'ing me.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 4/6] kexec_file: Add mechanism to update kexec segments.

2016-08-16 Thread Andrew Morton

On Sat, 13 Aug 2016 00:18:23 -0300 Thiago Jung Bauermann 
 wrote:

> kexec_update_segment allows a given segment in kexec_image to have
> its contents updated. This is useful if the current kernel wants to
> send information to the next kernel that is up-to-date at the time of
> reboot.
> 
> ...
>
> @@ -721,6 +721,105 @@ static struct page *kimage_alloc_page(struct kimage 
> *image,
>   return page;
>  }
>  
> +/**
> + * kexec_update_segment - update the contents of a kimage segment
> + * @buffer:  New contents of the segment.
> + * @bufsz:   @buffer size.
> + * @load_addr:   Segment's physical address in the next kernel.
> + * @memsz:   Segment size.
> + *
> + * This function assumes kexec_mutex is held.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +int kexec_update_segment(const char *buffer, unsigned long bufsz,
> +  unsigned long load_addr, unsigned long memsz)
> +{
> + int i;
> + unsigned long entry;
> + unsigned long *ptr = NULL;
> + void *dest = NULL;
> +
> + if (kexec_image == NULL) {
> + pr_err("Can't update segment: no kexec image loaded.\n");
> + return -EINVAL;
> + }
> +
> + /*
> +  * kexec_add_buffer rounds up segment sizes to PAGE_SIZE, so
> +  * we have to do it here as well.
> +  */
> + memsz = ALIGN(memsz, PAGE_SIZE);
> +
> + for (i = 0; i < kexec_image->nr_segments; i++)
> + /* We only support updating whole segments. */
> + if (load_addr == kexec_image->segment[i].mem &&
> + memsz == kexec_image->segment[i].memsz) {
> + if (kexec_image->segment[i].do_checksum) {
> + pr_err("Trying to update non-modifiable 
> segment.\n");
> + return -EINVAL;
> + }
> +
> + break;
> + }
> + if (i == kexec_image->nr_segments) {
> + pr_err("Couldn't find segment to update: 0x%lx, size 0x%lx\n",
> +load_addr, memsz);
> + return -EINVAL;
> + }
> +
> + for (entry = kexec_image->head; !(entry & IND_DONE) && memsz;
> +  entry = *ptr++) {
> + void *addr = (void *) (entry & PAGE_MASK);
> +
> + switch (entry & IND_FLAGS) {
> + case IND_DESTINATION:
> + dest = addr;
> + break;
> + case IND_INDIRECTION:
> + ptr = __va(addr);
> + break;
> + case IND_SOURCE:
> + /* Shouldn't happen, but verify just to be safe. */
> + if (dest == NULL) {
> + pr_err("Invalid kexec entries list.");
> + return -EINVAL;
> + }
> +
> + if (dest == (void *) load_addr) {
> + struct page *page;
> + char *ptr;
> + size_t uchunk, mchunk;
> +
> + page = kmap_to_page(addr);
> +
> + ptr = kmap(page);

kmap_atomic() could be used here, and it is appreciably faster.


> + ptr += load_addr & ~PAGE_MASK;
> + mchunk = min_t(size_t, memsz,
> +PAGE_SIZE - (load_addr & 
> ~PAGE_MASK));
> + uchunk = min(bufsz, mchunk);
> + memcpy(ptr, buffer, uchunk);
> +
> + kunmap(page);
> +
> + bufsz -= uchunk;
> + load_addr += mchunk;
> + buffer += mchunk;
> + memsz -= mchunk;
> + }
> + dest += PAGE_SIZE;
> + }
> +
> + /* Shouldn't happen, but verify just to be safe. */
> + if (ptr == NULL) {
> + pr_err("Invalid kexec entries list.");
> + return -EINVAL;
> + }
> + }
> +
> + return 0;
> +}
> +


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] kexec: add resriction on the kexec_load

2016-07-22 Thread Andrew Morton

On Fri, 22 Jul 2016 13:36:22 +0800 zhongjiang <zhongji...@huawei.com> wrote:

> From: zhong jiang <zhongji...@huawei.com>
> 
> I hit the following question when run trinity in my system. The
> kernel is 3.4 version. but the mainline have same question to be
> solved. The root cause is the segment size is too large, it can
> expand the most of the area or the whole memory, therefore, it
> may waste an amount of time to abtain a useable page. and other
> cases will block until the test case quit. at the some time,
> OOM will come up.
> 
> Call Trace:
>  [] __alloc_pages_nodemask+0x14c/0x8f0
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
>  [] alloc_pages_current+0xaf/0x120
>  [] kimage_alloc_pages+0x10/0x60
>  [] kimage_alloc_control_pages+0x5d/0x270
>  [] machine_kexec_prepare+0xe5/0x6c0
>  [] ? kimage_free_page_list+0x52/0x70
>  [] sys_kexec_load+0x141/0x600
>  [] ? vfs_write+0x100/0x180
>  [] system_call_fastpath+0x16/0x1b
> 
> The patch just add condition on sanity_check_segment_list to
> restriction the segment size.
> 
> ...
>
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -148,6 +148,7 @@ static struct page *kimage_alloc_page(struct kimage 
> *image,
>  int sanity_check_segment_list(struct kimage *image)
>  {
>   int result, i;
> + unsigned long total_segments = 0;
>   unsigned long nr_segments = image->nr_segments;
>  
>   /*
> @@ -209,6 +210,21 @@ int sanity_check_segment_list(struct kimage *image)
>   return result;
>   }
>  
> + /* Verity all segment size donnot exceed the specified size.
> +  * if segment size from user space is too large,  a large
> +  * amount of time will be wasted when allocating page. so,
> +  * softlockup may be come up.
> +  */
> + for (i = 0; i < nr_segments; i++) {
> + if (image->segment[i].memsz > (totalram_pages / 2))
> + return result;
> +
> + total_segments += image->segment[i].memsz;
> + }
> +
> + if (total_segments > (totalram_pages / 2))
> + return result;
> +
>   /*
>* Verify we have good destination addresses.  Normally
>* the caller is responsible for making certain we don't

This needed a few adjustments for pending changes in linux-next's
sanity_check_segment_list().  Mainly s/return result/return -EINVAL/. 
I also tweaked the patch changelog.  Please check.

From: zhong jiang <zhongji...@huawei.com>
Subject: kexec: add restriction on kexec_load() segment sizes

I hit the following issue when run trinity in my system.  The kernel is
3.4 version, but mainline has the same issue.

The root cause is that the segment size is too large so the kerenl spends
too long trying to allocate a page.  Other cases will block until the test
case quits.  Also, OOM conditions will occur.

Call Trace:
 [] __alloc_pages_nodemask+0x14c/0x8f0
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [] alloc_pages_current+0xaf/0x120
 [] kimage_alloc_pages+0x10/0x60
 [] kimage_alloc_control_pages+0x5d/0x270
 [] machine_kexec_prepare+0xe5/0x6c0
 [] ? kimage_free_page_list+0x52/0x70
 [] sys_kexec_load+0x141/0x600
 [] ? vfs_write+0x100/0x180
 [] system_call_fastpath+0x16/0x1b

The patch chnages sanity_check_segment_list() to verify that no segment is
larger than half of memory.

Link: 
http://lkml.kernel.org/r/1469165782-13193-1-git-send-email-zhongji...@huawei.com
Signed-off-by: zhong jiang <zhongji...@huawei.com>
Cc: Eric Biederman <ebied...@xmission.com>
Cc: Vivek Goyal <vgo...@redhat.com>
Cc: Dave Young <dyo...@redhat.com>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
---

 kernel/kexec_core.c |   16 
 1 file changed, 16 insertions(+)

diff -puN kernel/kexec_core.c~kexec-add-resriction-on-the-kexec_load 
kernel/kexec_core.c
--- a/kernel/kexec_core.c~kexec-add-resriction-on-the-kexec_load
+++ a/kernel/kexec_core.c
@@ -154,6 +154,7 @@ static struct page *kimage_alloc_page(st
 int sanity_check_segment_list(struct kimage *image)
 {
int i;
+   unsigned long total_segments = 0;
unsigned long nr_segments = image->nr_segments;
 
/*
@@ -214,6 +215,21 @@ int sanity_check_segment_list(struct kim
return -EINVAL;
}
 
+   /* Verity all segment size donnot exceed the specified size.
+* if segment size from user space is too large,  a large
+* amount of time will be wasted

Re: [PATCH 1/2] kexec: update VMCOREINFO for compound_order/dtor

2016-04-01 Thread Andrew Morton

On Tue, 1 Mar 2016 06:14:32 + Atsushi Kumagai  
wrote:

> makedumpfile refers page.lru.next to get the order of compound pages
> for page filtering. However, now the order is stored in page.compound_order,
> hence VMCOREINFO should be updated to export the offset of
> page.compound_order.
> 
> The fact is, page.compound_order was introduced already in kernel 4.0,
> but the offset of it was the same as page.lru.next until kernel 4.3,
> so this was not actual problem.
> 
> The above can be said also for page.lru.prev and page.compound_dtor,
> it's necessary to detect hugetlbfs pages. Further, the content was
> changed from direct address to the ID which means dtor.

It's unclear which kernels need the patch and why.  I *think* that the
patch is needed in 4.3.x, 4.4.x, 4.5.x and 4.6 in order to make
makedumpfile work correctly.  Is that right?

And it appears that [patch 2/2] is needed in 4.0+?

However in both cases I am uncertain - what are the end-user visible
effects of these regressions?  Why can bugs remain in place for so long
without having been observed?

Please make all these things clear when perparing changelogs for
bugfixes: which kernel versions need fixing and why (ie: what are the
end-user visible effects of the bug).

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] kexec: Introduce a protection mechanism for the crashkernel reserved memory

2016-03-22 Thread Andrew Morton

On Wed, 23 Dec 2015 19:12:25 +0800 Xunlei Pang <xlp...@redhat.com> wrote:

> For the cases that some kernel (module) path stamps the crash
> reserved memory(already mapped by the kernel) where has been
> loaded the second kernel data, the kdump kernel will probably
> fail to boot when panic happens (or even not happens) leaving
> the culprit at large, this is unacceptable.
> 
> The patch introduces a mechanism for detecting such cases:
> 1) After each crash kexec loading, it simply marks the reserved
> memory regions readonly since we no longer access it after that.
> When someone stamps the region, the first kernel will panic and
> trigger the kdump. The weak arch_kexec_protect_crashkres() is
> introduced to do the actual protection.
> 
> 2) To allow multiple loading, once 1) was done we also need to
> remark the reserved memory to readwrite each time a system call
> related to kdump is made. The weak arch_kexec_unprotect_crashkres()
> is introduced to do the actual protection.
> 
> The architecture can make its specific implementation by overriding
> arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().

I'd like to see a bit more review activity on these patches from the
other kexec developers, please.  I'll include both the patches below.


Also I don't have any record of reviews of these two:

http://ozlabs.org/~akpm/mmotm/broken-out/kexec-make-a-pair-of-map-unmap-reserved-pages-in-error-path.patch
http://ozlabs.org/~akpm/mmotm/broken-out/kexec-do-a-cleanup-for-function-kexec_load.patch

Thanks.


From: Xunlei Pang <xlp...@redhat.com>
Subject: kexec: introduce a protection mechanism for the crashkernel reserved 
memory

For the cases that some kernel (module) path stamps the crash reserved
memory(already mapped by the kernel) where has been loaded the second
kernel data, the kdump kernel will probably fail to boot when panic
happens (or even not happens) leaving the culprit at large, this is
unacceptable.

The patch introduces a mechanism for detecting such cases:

1) After each crash kexec loading, it simply marks the reserved memory
   regions readonly since we no longer access it after that.  When someone
   stamps the region, the first kernel will panic and trigger the kdump. 
   The weak arch_kexec_protect_crashkres() is introduced to do the actual
   protection.

2) To allow multiple loading, once 1) was done we also need to remark
   the reserved memory to readwrite each time a system call related to
   kdump is made.  The weak arch_kexec_unprotect_crashkres() is introduced
   to do the actual protection.

The architecture can make its specific implementation by overriding
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().

Signed-off-by: Xunlei Pang <xlp...@redhat.com>
Cc: Eric Biederman <ebied...@xmission.com>
Cc: Dave Young <dyo...@redhat.com>
Cc: Minfei Huang <mhu...@redhat.com>
Cc: Vivek Goyal <vgo...@redhat.com>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
---

 include/linux/kexec.h |2 ++
 kernel/kexec.c|9 -
 kernel/kexec_core.c   |6 ++
 kernel/kexec_file.c   |8 +++-
 4 files changed, 23 insertions(+), 2 deletions(-)

diff -puN 
include/linux/kexec.h~kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory
 include/linux/kexec.h
--- 
a/include/linux/kexec.h~kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory
+++ a/include/linux/kexec.h
@@ -317,6 +317,8 @@ int __weak arch_kexec_apply_relocations_
Elf_Shdr *sechdrs, unsigned int relsec);
 int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr 
*sechdrs,
unsigned int relsec);
+void arch_kexec_protect_crashkres(void);
+void arch_kexec_unprotect_crashkres(void);
 
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
diff -puN 
kernel/kexec.c~kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory
 kernel/kexec.c
--- 
a/kernel/kexec.c~kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory
+++ a/kernel/kexec.c
@@ -167,8 +167,12 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
return -EBUSY;
 
dest_image = _image;
-   if (flags & KEXEC_ON_CRASH)
+   if (flags & KEXEC_ON_CRASH) {
dest_image = _crash_image;
+   if (kexec_crash_image)
+   arch_kexec_unprotect_crashkres();
+   }
+
if (nr_segments > 0) {
unsigned long i;
 
@@ -211,6 +215,9 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
image = xchg(dest_image, image);
 
 out:
+   if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
+   arch_kexec_protect_crashkres();
+
mutex_unlock(_mutex);
kimage_free(image);
 
diff -puN 
kernel/kexec_core.c~kexec-introduce-a-protection-m

Re: [PATCH V2] proc-vmcore: wrong data type casting fix

2016-03-11 Thread Andrew Morton

On Fri, 11 Mar 2016 16:42:48 +0800 Dave Young  wrote:

> On i686 PAE enabled machine the contiguous physical area could be large
> and it can cause trimming down variables in below calculation in
> read_vmcore() and mmap_vmcore():
> 
>   tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> 
> Then the real size passed down is not correct any more.
> Suppose m->offset + m->size - *fpos being truncated to 0, buflen >0 then
> we will get tsz = 0. It is of course not an expected result.

I don't really understand this.

vmcore.offset if loff_t which is 64-bit
vmcore.size is long long
*fpos is loff_t

so the expression should all be done with 64-bit arithmetic anyway.

Maybe buflen (size_t) has the wrong type, but the result of the other
expression should be in-range by the time we come to doing the
comparison.

> During our tests there are two problems caused by it:
> 1) read_vmcore will refuse to continue so makedumpfile fails.
> 2) mmap_vmcore will trigger BUG_ON() in remap_pfn_range().
> 
> Use unsigned long long in min_t instead so that the variables are not
> truncated.
> 
> Signed-off-by: Baoquan He 
> Signed-off-by: Dave Young 

I think we'll need a cc:stable here.

> --- linux-x86.orig/fs/proc/vmcore.c
> +++ linux-x86/fs/proc/vmcore.c
> @@ -231,7 +231,9 @@ static ssize_t __read_vmcore(char *buffe
>  
>   list_for_each_entry(m, _list, list) {
>   if (*fpos < m->offset + m->size) {
> - tsz = min_t(size_t, m->offset + m->size - *fpos, 
> buflen);
> + tsz = (size_t)min_t(unsigned long long,
> + m->offset + m->size - *fpos,
> + buflen);

This is rather a mess.  Can we please try to fix this bug by choosing
appropriate types rather than all the typecasting?


>   start = m->paddr + *fpos - m->offset;
>   tmp = read_from_oldmem(buffer, tsz, , userbuf);
>   if (tmp < 0)
> @@ -461,7 +463,8 @@ static int mmap_vmcore(struct file *file
>   if (start < m->offset + m->size) {
>   u64 paddr = 0;
>  
> - tsz = min_t(size_t, m->offset + m->size - start, size);
> + tsz = (size_t)min_t(unsigned long long,
> + m->offset + m->size - start, size);
>   paddr = m->paddr + start - m->offset;
>   if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
>   paddr >> PAGE_SHIFT, tsz,

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH V2 2/2] kexec: Do a cleanup for function kexec_load

2016-03-01 Thread Andrew Morton

ntry, nr_segments, segments, flags);

if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
arch_kexec_protect_crashkres();

mutex_unlock(_mutex);

return result;
}


From: Minfei Huang <mnfhu...@gmail.com>
Subject: kexec: do a cleanup for function kexec_load

There are a lof of work to be done in function kexec_load, not only for
allocating structs and loading initram, but also for some misc.

To make it more clear, wrap a new function do_kexec_load which is used to
allocate structs and load initram.  And the pre-work will be done in
kexec_load.

Signed-off-by: Minfei Huang <mnfhu...@gmail.com>
Cc: Vivek Goyal <vgo...@redhat.com>
Cc: "Eric W. Biederman" <ebied...@xmission.com>
Signed-off-by: Andrew Morton <a...@linux-foundation.org>
---

 kernel/kexec.c |  125 +--
 1 file changed, 69 insertions(+), 56 deletions(-)

diff -puN kernel/kexec.c~kexec-do-a-cleanup-for-function-kexec_load 
kernel/kexec.c
--- a/kernel/kexec.c~kexec-do-a-cleanup-for-function-kexec_load
+++ a/kernel/kexec.c
@@ -103,6 +103,74 @@ out_free_image:
return ret;
 }
 
+static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
+   struct kexec_segment __user *segments, unsigned long flags)
+{
+   struct kimage **dest_image, *image;
+   unsigned long i;
+   int ret;
+
+   if (flags & KEXEC_ON_CRASH) {
+   dest_image = _crash_image;
+   if (kexec_crash_image)
+   arch_kexec_unprotect_crashkres();
+   } else {
+   dest_image = _image;
+   }
+
+   if (nr_segments == 0) {
+   /* Uninstall image */
+   kimage_free(xchg(dest_image, NULL));
+   return 0;
+   }
+   if (flags & KEXEC_ON_CRASH) {
+   /*
+* Loading another kernel to switch to if this one
+* crashes.  Free any current crash dump kernel before
+* we corrupt it.
+*/
+   kimage_free(xchg(_crash_image, NULL));
+   }
+
+   ret = kimage_alloc_init(, entry, nr_segments, segments, flags);
+   if (ret)
+   return ret;
+
+   if (flags & KEXEC_ON_CRASH)
+   crash_map_reserved_pages();
+
+   if (flags & KEXEC_PRESERVE_CONTEXT)
+   image->preserve_context = 1;
+
+   ret = machine_kexec_prepare(image);
+   if (ret)
+   goto out;
+
+   for (i = 0; i < nr_segments; i++) {
+   ret = kimage_load_segment(image, >segment[i]);
+   if (ret)
+   goto out;
+   }
+
+   kimage_terminate(image);
+
+   /* Install the new kernel and uninstall the old */
+   image = xchg(dest_image, image);
+
+out:
+   if ((flags & KEXEC_ON_CRASH) && kexec_crash_image)
+   arch_kexec_protect_crashkres();
+
+   /*
+* Once the reserved memory is mapped, we should unmap this memory
+* before returning
+*/
+   if (flags & KEXEC_ON_CRASH)
+   crash_unmap_reserved_pages();
+   kimage_free(image);
+   return ret;
+}
+
 /*
  * Exec Kernel system call: for obvious reasons only root may call it.
  *
@@ -127,7 +195,6 @@ out_free_image:
 SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
struct kexec_segment __user *, segments, unsigned long, flags)
 {
-   struct kimage **dest_image, *image;
int result;
 
/* We only trust the superuser with rebooting the system. */
@@ -152,9 +219,6 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
if (nr_segments > KEXEC_SEGMENT_MAX)
return -EINVAL;
 
-   image = NULL;
-   result = 0;
-
/* Because we write directly to the reserved memory
 * region when loading crash kernels we need a mutex here to
 * prevent multiple crash  kernels from attempting to load
@@ -166,63 +230,12 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
if (!mutex_trylock(_mutex))
return -EBUSY;
 
-   dest_image = _image;
-   if (flags & KEXEC_ON_CRASH) {
-   dest_image = _crash_image;
-   if (kexec_crash_image)
-   arch_kexec_unprotect_crashkres();
-   }
-
-   if (nr_segments > 0) {
-   unsigned long i;
-
-   if (flags & KEXEC_ON_CRASH) {
-   /*
-* Loading another kernel to switch to if this one
-* crashes.  Free any current crash dump kernel before
-* we corrupt it.
-*/
-
-   kimage_free(xchg(_crash_image, NULL));
-   result = kimage_alloc_init(, entry, nr_segments,
-  segments, flags);
-

Re: [PATCH V2 1/2] kexec: Make a pair of map/unmap reserved pages in error path

2016-03-01 Thread Andrew Morton

On Tue,  1 Mar 2016 16:02:28 +0800 Minfei Huang  wrote:

> For some arch, kexec shall map the reserved pages, then use them, when
> we try to start the kdump service.

Which architectures are these, by the way?

> kexec may return directly, without unmaping the reserved pages, if it
> fails during starting service. To fix it, we make a pair of map/unmap
> reserved pages both in generic path and error path.

I'm having trouble understanding the urgency of this patch.  Do you
think it is needed in 4.5?  -stable?  If so, why?

Thanks.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Make a pair of map/unmap reserved pages in error path

2016-02-26 Thread Andrew Morton

On Thu, 25 Feb 2016 22:02:40 +0800 Minfei Huang  wrote:

> From: Minfei Huang 
> 
> For some arch, kexec shall map the reserved pages, then use them, when
> we try to start the kdump service.
> 
> kexec may return directly, without unmaping the reserved pages, if it
> fails during starting service. To fix it, we make a pair of map/unmap
> reserved pages both in generic path and error path.

This patch both refactors the code AND fixes the bug.  It is a
decent-looking refactoring, but mixing the two together makes it *much*
harder to review the bugfix.  These two steps should be separated
please, with the bugfix patch coming first.

> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -103,6 +103,68 @@ out_free_image:
>   return ret;
>  }
>  
> +static int do_kexec_load(unsigned long entry, unsigned long nr_segments,
> + struct kexec_segment __user *segments, unsigned long flags)
> +{
> + struct kimage **dest_image, *image;
> + unsigned long i;
> + int ret;
> +
> + if (flags & KEXEC_ON_CRASH)
> + dest_image = _crash_image;
> + else
> + dest_image = _image;
> +
> + if (nr_segments == 0) {
> + /* Uninstall image */
> + kimage_free(xchg(dest_image, NULL));
> + return 0;
> + }
> + if (flags & KEXEC_ON_CRASH) {
> + /*
> +  * Loading another kernel to switch to if this one
> +  * crashes.  Free any current crash dump kernel before
> +  * we corrupt it.
> +  */
> + kimage_free(xchg(_crash_image, NULL));
> + }
> +
> + ret = kimage_alloc_init(, entry, nr_segments, segments, flags);
> + if (ret)
> + return ret;

This is a bug, isn't it?  Missed kimage_free().

> + if (flags & KEXEC_ON_CRASH)
> + crash_map_reserved_pages();
> +
> + if (flags & KEXEC_PRESERVE_CONTEXT)
> + image->preserve_context = 1;
> +
> + ret = machine_kexec_prepare(image);
> + if (ret)
> + goto out;
> +
> + for (i = 0; i < nr_segments; i++) {
> + ret = kimage_load_segment(image, >segment[i]);
> + if (ret)
> + goto out;
> + }
> +
> + kimage_terminate(image);
> +
> + /* Install the new kernel and uninstall the old */
> + image = xchg(dest_image, image);
> +
> +out:
> + /*
> +  * Once the reserved memory is mapped, we should unmap this memory
> +  * before returning
> +  */
> + if (flags & KEXEC_ON_CRASH)
> + crash_unmap_reserved_pages();
> + kimage_free(image);
> + return ret;
> +}
> +
>
> ...
>

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-02-01 Thread Andrew Morton

On Thu, 28 Jan 2016 11:57:22 +0300 Dmitry Safonov  
wrote:

> On 01/28/2016 09:29 AM, Minfei Huang wrote:
> > On 01/27/16 at 02:48pm, Dmitry Safonov wrote:
> >> For allocation of kimage failure or kexec_prepare or load segments
> >> errors there is no need to keep crashkernel memory mapped.
> >> It will affect only s390 as map/unmap hook defined only for it.
> >> As on unmap s390 also changes os_info structure let's check return code
> >> and add info only on success.
> > Hi, Dmitry.
> >
> > Previously, I sent a patch to fix this issue. You can refer it in
> > following link.
> >
> > http://lists.infradead.org/pipermail/kexec/2015-July/013960.html
> Oh, scratch my patch - I'm fine with yours, wanted to do the similar thing
> because it has dazzled me while I was debugging around.

There were a bunch of patches tossed around in that thread but I'm not
sure that anything actually got applied?  Perhaps some resending is
needed.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-01-27 Thread Andrew Morton

On Wed, 27 Jan 2016 14:48:31 +0300 Dmitry Safonov  
wrote:

> For allocation of kimage failure or kexec_prepare or load segments
> errors there is no need to keep crashkernel memory mapped.
> It will affect only s390 as map/unmap hook defined only for it.
> As on unmap s390 also changes os_info structure let's check return code
> and add info only on success.
> 

This conflicts (both mechanically and somewhat conceptually) with
Xunlei Pang's "kexec: Introduce a protection mechanism for the
crashkernel reserved memory" and "kexec: provide
arch_kexec_protect(unprotect)_crashkres()".

http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory.patch
http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory-v4.patch

and

http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres.patch
http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres-v4.patch

so can we please sort all that out?

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kexec: Move some memembers and definitions within the scope of CONFIG_KEXEC_FILE

2015-12-22 Thread Andrew Morton

On Tue, 22 Dec 2015 19:40:39 +0800 Xunlei Pang  wrote:

> > Following functions will be used only in kexec_file. Please wrap them in
> > CONFIG_KEXEC_FILE.
> >
> > int __weak arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
> >  unsigned long buf_len);
> > void * __weak arch_kexec_kernel_image_load(struct kimage *image);
> > int __weak arch_kimage_file_post_load_cleanup(struct kimage *image);
> > int __weak arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
> > unsigned long buf_len);
> > int __weak arch_kexec_apply_relocations_add(const Elf_Ehdr *ehdr,
> > Elf_Shdr *sechdrs, unsigned int relsec);
> > int __weak arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr 
> > *sechdrs,
> > unsigned int relsec);
> 
> Thanks for the comment.
> 
> I noticed this as well, but seems for the function declarations we don't need 
> do this,
> since they don't consume the actual space.
> 
> For example, in the include/linux/timekeeping.h
> /*  
>  * RTC specific
>  */ 
> extern bool timekeeping_rtc_skipsuspend(void);
> extern bool timekeeping_rtc_skipresume(void);
> 
> extern void timekeeping_inject_sleeptime64(struct timespec64 *delta);
> 
> also not embraced by the corresponding macros.

Yes.  If we add the ifdefs then a programming error will be detected at
compile time.  If we don't add the ifdefs then that error will be
detected at link time.  So the ifdefs provide a quite small advantage,
while making the code harder to read and harder to maintain.  I believe
that "no ifdefs" is the better side of this tradeoff.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3] kexec: fix out of the ELF headers buffer issue in syscall kexec_file_load()

2015-10-01 Thread Andrew Morton

On Tue, 29 Sep 2015 20:58:57 +0800 "Lee, Chun-Yi"  
wrote:

> This patch modified the code in fill_up_crash_elf_data by using
> walk_system_ram_res instead of walk_system_ram_range to count the max
> number of crash memory ranges. That's because the walk_system_ram_range
> filters out small memory regions that are resided in the same page, but
> walk_system_ram_res does not.
> 
> The oringial issue is page fault error that sometimes happened on big machines
> when preparing ELF headers:
> 
> [  305.291522] BUG: unable to handle kernel paging request at c90613fc9000
> [  305.299621] IP: [] 
> prepare_elf64_ram_headers_callback+0x165/0x260
> [  305.308300] PGD e32067 PUD 6dcbec54067 PMD 9dc9bdeb067 PTE 0
> [  305.315393] Oops: 0002 [#1] SMP
> [...snip]
> [  305.420953] task: 8e1c01ced600 ti: 8e1c03ec2000 task.ti: 
> 8e1c03ec2000
> [  305.429292] RIP: 0010:[]  [] 
> prepare_elf64_ra
> m_headers_callback+0x165/0x260
> [...snip]
> 
> After tracing prepare_elf64_headers and prepare_elf64_ram_headers_callback,
> the code uses walk_system_ram_res to fill-in crash memory regions information
> to program header, so it counts those small memory regions that are resided in
> a page area. But, when kernel was using walk_system_ram_range in
> fill_up_crash_elf_data to count the number of crash memory regions, it filters
> out small regions. I printed those small memory regions, for example:
> 
> kexec: Get nr_ram ranges. vaddr=0x880077592258 paddr=0x77592258, sz=0xdc0
> 
> Base on the code in walk_system_ram_range, this memory region will be filtered
> out:
> 
> pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
> end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
> end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
> 
> So, the max_nr_ranges that's counted by kernel doesn't include small memory
> regions. That causes the page fault issue happened in later code path for
> preparing EFL headers.
> 
> This issus is not easy to reproduce on small machines that don't have too
> many CPUs because the allocated page aligned ELF buffer has more free space
> to cover those small memory regions' PT_LOAD headers.
> 

fyi, I added a cc:stable to my copy of this patch.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [REPOST PATCH] kexec: Remove the unnecessary conditional judgement to simplify the code logic

2015-08-13 Thread Andrew Morton

On Thu, 13 Aug 2015 09:55:25 +0900 Simon Horman ho...@verge.net.au wrote:

 On Tue, Jul 28, 2015 at 12:46:42PM +0800, Minfei Huang wrote:
  Transforming PFN(Page Frame Number) to struct page is never failure, so
  we can simplify the code logic to do the image-control_page assignment
  directly in the loop, and remove the unnecessary conditional judgement.
  
  Signed-off-by: Minfei Huang mnfhu...@gmail.com
  Acked-by: Dave Young dyo...@redhat.com
  Acked-by: Vivek Goyal vgo...@redhat.com
 
 Andrew, could you consider picking this up.
 It seems to have been sufficiently reviewed, acked, etc...

I grabbed this one of July 28. 
http://ozlabs.org/~akpm/mmotm/broken-out/kexec-remove-the-unnecessary-conditional-judgement-to-simplify-the-code-logic.patch

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [Patch v2] align crash_notes allocation to make it be inside one physical page

2015-08-03 Thread Andrew Morton

On Mon,  3 Aug 2015 20:50:43 +0800 Baoquan He b...@redhat.com wrote:

 People reported that crash_notes in /proc/vmcore were corrupted and
 this cause crash kdump failure. With code debugging and log we got
 the root cause. This is because percpu variable crash_notes are
 allocated in 2 vmalloc pages. Currently percpu is based on vmalloc
 by default. Vmalloc can't guarantee 2 continuous vmalloc pages
 are also on 2 continuous physical pages. So when 1st kernel exports
 the starting address and size of crash_notes through sysfs like below:
 
 /sys/devices/system/cpu/cpux/crash_notes
 /sys/devices/system/cpu/cpux/crash_notes_size
 
 kdump kernel use them to get the content of crash_notes. However the
 2nd part may not be in the next neighbouring physical page as we
 expected if crash_notes are allocated accross 2 vmalloc pages. That's
 why nhdr_ptr-n_namesz or nhdr_ptr-n_descsz could be very huge in
 update_note_header_size_elf64() and cause note header merging failure
 or some warnings.
 
 In this patch change to call __alloc_percpu() to passed in the align
 value by rounding crash_notes_size up to the nearest power of two.
 This make sure the crash_notes is allocated inside one physical page
 since sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE.
 Meanwhile add a WARN_ON in case it grows to be bigger than PAGE_SIZE
 in the future.
 
 ...

 --- a/kernel/kexec.c
 +++ b/kernel/kexec.c
 @@ -1620,7 +1620,16 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
  static int __init crash_notes_memory_init(void)
  {
   /* Allocate memory for saving cpu registers. */
 - crash_notes = alloc_percpu(note_buf_t);
 + size_t size, align;
 + int order;
 +
 + size = sizeof(note_buf_t);
 + order = get_count_order(size);
 + align = min_t(size_t, (1order), PAGE_SIZE);
 +
 + WARN_ON(size  PAGE_SIZE);
 +
 + crash_notes = __alloc_percpu(size, align);

A code comment would be helpful - the reason for this code's existence
is otherwise utterly unobvious.

I think it can be done this way:

align = min(roundup_pow_of_two(sizeof(note_buf_t)), PAGE_SIZE);


I never noticed get_count_order() before.  afaict it does the same as
order_base_2(), except get_count_order() generates better code and has
a ridiculous name.

And I think the WARN_ON can be replaced with a
BUILD_BUG_ON(sizeofPAGE_SIZE)?  That would avoid adding runtime
overhead.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] kdump, vmcoreinfo: report actual value of phys_base

2014-12-15 Thread Andrew Morton


(cc trimmed a bit)

On Thu, 13 Nov 2014 11:30:11 +0900 (JST) HATAYAMA Daisuke 
d.hatay...@jp.fujitsu.com wrote:

 Currently, VMCOREINFO note information reports the virtual address of
 phys_base that is assigned to symbol phys_base. But this doesn't make
 sense because to refer to phys_base, it's necessary to get the value
 of phys_base itself we are now about to refer to.

Folks, could we please get a bit of reviewing, acking or nacking for
this one?


From: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Subject: kdump, vmcoreinfo: report actual value of phys_base

Currently, VMCOREINFO note information reports the virtual address of
phys_base that is assigned to symbol phys_base.  But this doesn't make
sense because to refer to phys_base, it's necessary to get the value of
phys_base itself we are now about to refer to.

Userland tools related to kdump such as makedumpfile and crash utility so
far have made some efforts to calculate phys_base on crash dump formats
generated by mechanisms running outside Linux kernel, such as virtual
machine hypervisor such as qemu dump, which ordinary users use via virsh
dump, or ones implemented on vendor specific firmware.

That is, find a kernel data whose virtual and physical addresses are
available via its note information and calculate phys_base from it. 
However, such data structure is not the one prepared for phys_base
purpose.  There's no guarantee that other crash dump mechanisms include
such information that can be used to calculate phys_base similarly.

To get VMCOREINFO in vmcore, it's easy to use strings and grep commands
like this; VMCOREINFO consists of simple string:

$ strings vmcore-3.10.0-121.el7.x86_64 | grep -E .*VMCOREINFO.* -A 100
VMCOREINFO
OSRELEASE=3.10.0-121.el7.x86_64
PAGESIZE=4096
...

This is also useful to get value of phys_base in kdump 2nd kernel
contained in vmcore using the above-mentioned external crash dump
mechanism; kdump 2nd kernel is an inherently relocated kernel.

This commit doesn't remove VMCOREINFO_SYMBOL(phys_base) line because
makedumpfile refers to it and if removing it, old versions makedumpfile
doesn't work well.

Signed-off-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
Cc: Eric W. Biederman ebied...@xmission.com
Cc: Vivek Goyal vgo...@redhat.com
Cc: Atsushi Kumagai kumagai-atsu...@mxc.nes.nec.co.jp
Cc: Dave Anderson ander...@redhat.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 arch/x86/kernel/machine_kexec_64.c |1 +
 include/linux/kexec.h  |2 ++
 2 files changed, 3 insertions(+)

diff -puN 
arch/x86/kernel/machine_kexec_64.c~kdump-vmcoreinfo-report-actual-value-of-phys_base
 arch/x86/kernel/machine_kexec_64.c
--- 
a/arch/x86/kernel/machine_kexec_64.c~kdump-vmcoreinfo-report-actual-value-of-phys_base
+++ a/arch/x86/kernel/machine_kexec_64.c
@@ -334,6 +334,7 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
vmcoreinfo_append_str(KERNELOFFSET=%lx\n,
  (unsigned long)_text - __START_KERNEL);
+   VMCOREINFO_PHYS_BASE(phys_base);
 }
 
 /* arch-dependent functionality related to kexec file-based syscall */
diff -puN 
include/linux/kexec.h~kdump-vmcoreinfo-report-actual-value-of-phys_base 
include/linux/kexec.h
--- a/include/linux/kexec.h~kdump-vmcoreinfo-report-actual-value-of-phys_base
+++ a/include/linux/kexec.h
@@ -258,6 +258,8 @@ unsigned long paddr_vmcoreinfo_note(void
vmcoreinfo_append_str(NUMBER(%s)=%ld\n, #name, (long)name)
 #define VMCOREINFO_CONFIG(name) \
vmcoreinfo_append_str(CONFIG_%s=y\n, #name)
+#define VMCOREINFO_PHYS_BASE(value) \
+   vmcoreinfo_append_str(PHYS_BASE=%lx\n, (unsigned long)value)
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
_


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH V3] kernel, add bug_on_warn

2014-10-23 Thread Andrew Morton

On Thu, 23 Oct 2014 08:53:14 -0400 Prarit Bhargava pra...@redhat.com wrote:

 There have been several times where I have had to rebuild a kernel to
 cause a panic when hitting a WARN() in the code in order to get a crash
 dump from a system.  Sometimes this is easy to do, other times (such as
 in the case of a remote admin) it is not trivial to send new images to the
 user.
 
 A much easier method would be a switch to change the WARN() over to a
 BUG().  This makes debugging easier in that I can now test the actual
 image the WARN() was seen on and I do not have to engage in remote
 debugging.
 
 This patch adds a bug_on_warn kernel parameter, which calls BUG() in the
 warn_slowpath_common() path.  The function will still print out the
 location of the warning.

The changelog doesn't mention /proc/sys/kernel/bug_on_warn?

Why do we need both the sysctl and the kernel parameter?  Only to
trigger BUG for warnings which occur prior to initscripts.  Is there a
legitimate case for this?  Is kdump even usable at this time?

 --- a/include/asm-generic/bug.h
 +++ b/include/asm-generic/bug.h
 @@ -75,10 +75,18 @@ extern void warn_slowpath_null(const char *file, const 
 int line);
  #define __WARN_printf_taint(taint, arg...)   \
   warn_slowpath_fmt_taint(__FILE__, __LINE__, taint, arg)
  #else
 -#define __WARN() __WARN_TAINT(TAINT_WARN)
 +#define check_bug_on_warn()  \
 + do {\
 + if (bug_on_warn)\
 + BUG();  \
 + } while (0)

#define check_bug_on_warn() BUG_ON(bug_on_warn)

would suffice?

 +#define __WARN() \
 + do { __WARN_TAINT(TAINT_WARN); check_bug_on_warn(); } while (0)
 +
  #define __WARN_printf(arg...)do { printk(arg); __WARN(); } while (0)
  #define __WARN_printf_taint(taint, arg...)   \
 - do { printk(arg); __WARN_TAINT(taint); } while (0)
 + do { printk(arg); __WARN_TAINT(taint); check_bug_on_warn(); } while (0)
  #endif

What's this code here for anyway?  The changes to
warn_slowpath_common() aren't sufficient?

  #ifndef WARN_ON
 diff --git a/include/linux/kernel.h b/include/linux/kernel.h
 index 40728cf..4094a60 100644
 --- a/include/linux/kernel.h
 +++ b/include/linux/kernel.h
 @@ -422,6 +422,7 @@ extern int panic_on_oops;
  extern int panic_on_unrecovered_nmi;
  extern int panic_on_io_nmi;
  extern int sysctl_panic_on_stackoverflow;
 +extern int bug_on_warn;
  /*
   * Only to be used by arch init code. If the user over-wrote the default
   * CONFIG_PANIC_TIMEOUT, honor it.
 diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
 index 43aaba1..2ba0a58 100644
 --- a/include/uapi/linux/sysctl.h
 +++ b/include/uapi/linux/sysctl.h
 @@ -153,6 +153,7 @@ enum
   KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
   KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
   KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
 + KERN_BUG_ON_WARN=77, /* int: call BUG() in WARN() functions */
  };
  
  
 diff --git a/kernel/panic.c b/kernel/panic.c
 index d09dc5c..a6d2e2f 100644
 --- a/kernel/panic.c
 +++ b/kernel/panic.c
 @@ -33,6 +33,7 @@ static int pause_on_oops;
  static int pause_on_oops_flag;
  static DEFINE_SPINLOCK(pause_on_oops_lock);
  static bool crash_kexec_post_notifiers;
 +int bug_on_warn;

I suppose this should be __read_mostly.  Assuming __read_mostly is
useful :(


 ...


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/1] kexec: Verify the signature of signed PE bzImage

2014-07-24 Thread Andrew Morton

On Thu, 24 Jul 2014 10:16:42 -0400 Vivek Goyal vgo...@redhat.com wrote:

 I am wondering how this final kexec patch should be routed. Issue here is that
 this patch depends on David Howells's PKCS7 and PEFILE patches. Which are now
 sitting in security tree. (security/next).
 
 http://git.kernel.org/cgit/linux/kernel/git/jmorris/linux-security.git/log/?h=next
 
 Not sure how to handle this dependency issue.

I have all these patches staged after linux-next so it fitted
straight in.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 15/15] kexec: Support kexec/kdump on EFI systems

2014-07-01 Thread Andrew Morton

On Tue, 1 Jul 2014 20:46:05 +0100 Matt Fleming m...@console-pimps.org wrote:

  +int get_efi_runtime_map_size(void)
  +{
  +   return nr_efi_runtime_map * efi_memdesc_size;
  +}
  +
  +int get_efi_runtime_map_desc_size(void)
  +{
  +   return efi_memdesc_size;
  +}
  +
  +int efi_runtime_map_copy(void *buf, size_t bufsz)
  +{
  +   size_t sz = get_efi_runtime_map_size();
  +
  +   if (sz  bufsz)
  +   sz = bufsz;
  +
  +   memcpy(buf, efi_runtime_map, sz);
  +   return 0;
  +}
 
 Could we prefix these with efi_, e.g. efi_get_runtime_map_size() ?

This?

From: Andrew Morton a...@linux-foundation.org
Subject: kexec-support-kexec-kdump-on-efi-systems-fix

s/get_efi/efi_get/g, per Matt

Cc: Vivek Goyal vgo...@redhat.com
Cc: Matt Fleming m...@console-pimps.org
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 arch/x86/kernel/kexec-bzimage64.c  |4 ++--
 drivers/firmware/efi/runtime-map.c |6 +++---
 include/linux/efi.h|8 
 3 files changed, 9 insertions(+), 9 deletions(-)

diff -puN 
arch/x86/kernel/kexec-bzimage64.c~kexec-support-kexec-kdump-on-efi-systems-fix 
arch/x86/kernel/kexec-bzimage64.c
--- 
a/arch/x86/kernel/kexec-bzimage64.c~kexec-support-kexec-kdump-on-efi-systems-fix
+++ a/arch/x86/kernel/kexec-bzimage64.c
@@ -181,7 +181,7 @@ setup_efi_state(struct boot_params *para
ei-efi_systab_hi = current_ei-efi_systab_hi;
 
ei-efi_memdesc_version = current_ei-efi_memdesc_version;
-   ei-efi_memdesc_size = get_efi_runtime_map_desc_size();
+   ei-efi_memdesc_size = efi_get_runtime_map_desc_size();
 
setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
  efi_map_sz);
@@ -397,7 +397,7 @@ void *bzImage64_load(struct kimage *imag
 * have to create separate segment for each. Keeps things
 * little bit simple
 */
-   efi_map_sz = get_efi_runtime_map_size();
+   efi_map_sz = efi_get_runtime_map_size();
efi_map_sz = ALIGN(efi_map_sz, 16);
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
MAX_ELFCOREHDR_STR_LEN;
diff -puN 
drivers/firmware/efi/runtime-map.c~kexec-support-kexec-kdump-on-efi-systems-fix 
drivers/firmware/efi/runtime-map.c
--- 
a/drivers/firmware/efi/runtime-map.c~kexec-support-kexec-kdump-on-efi-systems-fix
+++ a/drivers/firmware/efi/runtime-map.c
@@ -138,19 +138,19 @@ add_sysfs_runtime_map_entry(struct kobje
return entry;
 }
 
-int get_efi_runtime_map_size(void)
+int efi_get_runtime_map_size(void)
 {
return nr_efi_runtime_map * efi_memdesc_size;
 }
 
-int get_efi_runtime_map_desc_size(void)
+int efi_get_runtime_map_desc_size(void)
 {
return efi_memdesc_size;
 }
 
 int efi_runtime_map_copy(void *buf, size_t bufsz)
 {
-   size_t sz = get_efi_runtime_map_size();
+   size_t sz = efi_get_runtime_map_size();
 
if (sz  bufsz)
sz = bufsz;
diff -puN include/linux/efi.h~kexec-support-kexec-kdump-on-efi-systems-fix 
include/linux/efi.h
--- a/include/linux/efi.h~kexec-support-kexec-kdump-on-efi-systems-fix
+++ a/include/linux/efi.h
@@ -1151,8 +1151,8 @@ int efivars_sysfs_init(void);
 #ifdef CONFIG_EFI_RUNTIME_MAP
 int efi_runtime_map_init(struct kobject *);
 void efi_runtime_map_setup(void *, int, u32);
-int get_efi_runtime_map_size(void);
-int get_efi_runtime_map_desc_size(void);
+int efi_get_runtime_map_size(void);
+int efi_get_runtime_map_desc_size(void);
 int efi_runtime_map_copy(void *buf, size_t bufsz);
 #else
 static inline int efi_runtime_map_init(struct kobject *kobj)
@@ -1163,12 +1163,12 @@ static inline int efi_runtime_map_init(s
 static inline void
 efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
 
-static inline int get_efi_runtime_map_size(void)
+static inline int efi_get_runtime_map_size(void)
 {
return 0;
 }
 
-static inline int get_efi_runtime_map_desc_size(void)
+static inline int efi_get_runtime_map_desc_size(void)
 {
return 0;
 }
_


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 00/15][V4] kexec: A new system call to allow in kernel loading

2014-06-26 Thread Andrew Morton

On Thu, 26 Jun 2014 16:33:29 -0400 Vivek Goyal vgo...@redhat.com wrote:

 This patch series does not do kernel signature verification yet. I plan
 to post another patch series for that. Now distributions are already signing
 PE/COFF bzImage with PKCS7 signature I plan to parse and verify those
 signatures.
 
 Primary goal of this patchset is to prepare groundwork so that kernel
 image can be signed and signatures be verified during kexec load. This
 should help with two things.
 
 - It should allow kexec/kdump on secureboot enabled machines.
 
 - In general it can help even without secureboot. By being able to verify
   kernel image signature in kexec, it should help with avoiding module
   signing restrictions. Matthew Garret showed how to boot into a custom
   kernel, modify first kernel's memory and then jump back to old kernel and
   bypass any policy one wants to.
 
 I hope these patches can be queued up for 3.17. Even without signature
 verification support, they provide new syscall functionality. But I
 wil leave it to maintainers to decide if they want signature verification
 support also be ready to merge before they merge this patchset.

Well, this is an absolute ton of new code, much of it pretty complex. 
And I believe the entire point of this work is to enable image
signature checking, but that hasn't been implemented yet?

In which case I'm thinking it would be unwise to merge these parts into
mainline - if signature checking doesn't work or fails review or if you
get hit by a bus then we'd be left with a large lump of rather useless
code?

In which case I'm inclined to put this series into -next and keep it
there pending completion of the signature checking part.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 09/15] kexec: Implementation of new syscall kexec_file_load

2014-06-26 Thread Andrew Morton

On Thu, 26 Jun 2014 16:33:38 -0400 Vivek Goyal vgo...@redhat.com wrote:

 Previous patch provided the interface definition and this patch prvides
 implementation of new syscall.
 
 Previously segment list was prepared in user space. Now user space just
 passes kernel fd, initrd fd and command line and kernel will create a
 segment list internally.
 
 This patch contains generic part of the code. Actual segment preparation
 and loading is done by arch and image specific loader. Which comes in
 next patch.
 
 ...

 --- a/kernel/kexec.c
 +++ b/kernel/kexec.c
 @@ -6,6 +6,8 @@
   * Version 2.  See the file COPYING for more details.
   */
  
 +#define pr_fmt(fmt)  kexec:  fmt
 +
  #include linux/capability.h
  #include linux/mm.h
  #include linux/file.h
 @@ -326,6 +328,215 @@ out_free_image:
   return ret;
  }
  
 +static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
 +{
 + struct fd f = fdget(fd);
 + int ret = 0;

unneeded initialisation.

 + struct kstat stat;
 + loff_t pos;
 + ssize_t bytes = 0;
 +
 + if (!f.file)
 + return -EBADF;
 +
 + ret = vfs_getattr(f.file-f_path, stat);
 + if (ret)
 + goto out;
 +
 + if (stat.size  INT_MAX) {
 + ret = -EFBIG;
 + goto out;
 + }
 +
 + /* Don't hand 0 to vmalloc, it whines. */
 + if (stat.size == 0) {
 + ret = -EINVAL;
 + goto out;
 + }
 +
 + *buf = vmalloc(stat.size);
 + if (!*buf) {
 + ret = -ENOMEM;
 + goto out;
 + }
 +
 + pos = 0;
 + while (pos  stat.size) {
 + bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
 + stat.size - pos);
 + if (bytes  0) {
 + vfree(*buf);
 + ret = bytes;
 + goto out;
 + }
 +
 + if (bytes == 0)
 + break;

Here we can get a short read: (pos  stat.size).  Seems to me that it
is risky to return this result to the caller as if all is well.

 + pos += bytes;
 + }
 +
 + *buf_len = pos;
 +out:
 + fdput(f);
 + return ret;
 +}
 
 ...



___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH] kdump: Fix exported size of vmcoreinfo note

2014-01-14 Thread Andrew Morton

On Tue, 14 Jan 2014 14:33:11 -0500 Vivek Goyal vgo...@redhat.com wrote:

 Right now we seem to be exporting the max data size contained inside
 vmcoreinfo note. But this does not include the size of meta data around
 vmcore info data. Like name of the note and starting and ending elf_note.
 
 I think user space expects total size and that size is put in PT_NOTE
 elf header. Things seem to be fine so far because we are not using
 vmcoreinfo note to the maximum capacity. But as it starts filling up,
 to capacity, at some point of time, problem will be visible.

urgh.  This is what we get for adding undocumented interfaces.

Looking through the fd59d231f81cb0287 changelog and the various email
threads it points to I can find no mention of what vmcoreinfo is
*supposed* to contain.  It was just kinda silently tossed in there.

So as a remedial step, could we please get this and any associated
interfaces written down in a way which people can very belatedly
review?

Phrases like I think user space expects and Things seem to be fine
so far don't inspire a ton of confidence.  What are the chances of
userspace breakage here?  Would it be safer/saner to leave vmcoreinfo
alone and add a new vmcoreinfo2 with the altered behaviour?

 --- linux-2.6.orig/kernel/ksysfs.c2014-01-14 14:09:42.107767503 -0500
 +++ linux-2.6/kernel/ksysfs.c 2014-01-14 14:15:24.385510314 -0500
 @@ -126,7 +126,7 @@ static ssize_t vmcoreinfo_show(struct ko
  {
   return sprintf(buf, %lx %x\n,
  paddr_vmcoreinfo_note(),
 -(unsigned int)vmcoreinfo_max_size);
 +(unsigned int)sizeof(vmcoreinfo_note));
  }
  KERNEL_ATTR_RO(vmcoreinfo);
  

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v5 00/14] kexec kernel efi runtime support

2013-12-10 Thread Andrew Morton

On Mon,  9 Dec 2013 17:42:13 +0800 Dave Young dyo...@redhat.com wrote:

 Here is the V5 patchset for supporting kexec kernel efi runtime.

It's unclear (to me) who's supposed to merge this lot.  It's a kexec
patch which actually hits on efi and x86.

I grabbed them for some -next testing but won't be offended if someone
else merges them ;)

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/7] Remove unused /dev/oldmem interface

2013-05-29 Thread Andrew Morton

On Tue, 28 May 2013 08:11:34 -0700 H. Peter Anvin h...@zytor.com wrote:

 On 05/28/2013 07:37 AM, Vivek Goyal wrote:
  
  Eric,
  
  Should we schedule the removal of this interface after 1-2 releases
  and give a warning once if anybody opens /dev/oldmem and tell them
  to use /proc/vmcore instead? 
  
  I am kind of inclined towards warning approarch. If there is any xyz
  /dev/oldmem user in the wild out there, he/she atleast gets a chance to
  migrate to /proc/vmcore.
  
 
 This has been discussed ad nauseam at the Kernel Summit.  It doesn't work.
 

Who said that and what was their evidence?  Giving users (especially
the more technical ones) several months warning is eminently sensible.

Although it doesn't appear to be necessary in this case.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 2/7] Documentation/devices.txt: Remove /dev/oldmem description

2013-05-29 Thread Andrew Morton

On Mon, 27 May 2013 09:27:42 +0800 Zhang Yanfei zhangyan...@cn.fujitsu.com 
wrote:

 ___ 2013___05___27___ 09:27, HATAYAMA Daisuke __:
  (2013/05/26 9:58), Zhang Yanfei wrote:
  ___ 2013___05___26___ 07:20, Eric W. Biederman __:
  Zhang Yanfei zhangyanfei@gmail.com writes:
 
  From: Zhang Yanfei zhangyan...@cn.fujitsu.com
 
  Won't we want to keep this reservation around to so that this number
  doesn't get reused, and cause people confusion when
  upgrading/downgrading kernels?
 
  Ah, yes. I will just keep this and add a note to make people know that
  it is removed in the next version.
 
  
  It looks enough writing obsolete according to the other parts of the same 
  file.
  
 
 I've sent the v2 version and actually did what you said.

I can't find any v2 version of this patchset.

I did this:

--- a/Documentation/devices.txt~dev-oldmem-remove-the-interface-fix
+++ a/Documentation/devices.txt
@@ -102,6 +102,7 @@ Your cooperation is appreciated.
export the buffered printk records.
 12 = /dev/oldmem   Used by crashdump kernels to access
the memory of the kernel that crashed.
+   (obsolete)
 
   1 block  RAM disk
  0 = /dev/ram0 First RAM disk
_


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v8 2/9] vmcore: allocate buffer for ELF headers on page-size alignment

2013-05-23 Thread Andrew Morton

On Thu, 23 May 2013 14:25:07 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com 
wrote:

 Allocate ELF headers on page-size boundary using __get_free_pages()
 instead of kmalloc().
 
 Later patch will merge PT_NOTE entries into a single unique one and
 decrease the buffer size actually used. Keep original buffer size in
 variable elfcorebuf_sz_orig to kfree the buffer later and actually
 used buffer size with rounded up to page-size boundary in variable
 elfcorebuf_sz separately.
 
 The size of part of the ELF buffer exported from /proc/vmcore is
 elfcorebuf_sz.
 
 The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz,
 elfcorebuf_sz_orig], is filled with 0.
 
 Use size of the ELF headers as an initial offset value in
 set_vmcore_list_offsets_elf{64,32} and
 process_ptload_program_headers_elf{64,32} in order to indicate that
 the offset includes the holes towards the page boundary.
 
 As a result, both set_vmcore_list_offsets_elf{64,32} have the same
 definition. Merge them as set_vmcore_list_offsets.
 
 ...

 @@ -526,30 +505,35 @@ static int __init parse_crash_elf64_headers(void)
   }
  
   /* Read in all elf headers. */
 - elfcorebuf_sz = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
 - elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
 + elfcorebuf_sz_orig = sizeof(Elf64_Ehdr) + ehdr.e_phnum * 
 sizeof(Elf64_Phdr);
 + elfcorebuf_sz = elfcorebuf_sz_orig;
 + elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
 +get_order(elfcorebuf_sz_orig));
   if (!elfcorebuf)
   return -ENOMEM;
   addr = elfcorehdr_addr;
 - rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, addr, 0);
 + rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, addr, 0);
   if (rc  0) {
 - kfree(elfcorebuf);
 + free_pages((unsigned long)elfcorebuf,
 +get_order(elfcorebuf_sz_orig));
   return rc;
   }
  
   /* Merge all PT_NOTE headers into one. */
   rc = merge_note_headers_elf64(elfcorebuf, elfcorebuf_sz, vmcore_list);
   if (rc) {
 - kfree(elfcorebuf);
 + free_pages((unsigned long)elfcorebuf,
 +get_order(elfcorebuf_sz_orig));
   return rc;
   }
   rc = process_ptload_program_headers_elf64(elfcorebuf, elfcorebuf_sz,
   vmcore_list);
   if (rc) {
 - kfree(elfcorebuf);
 + free_pages((unsigned long)elfcorebuf,
 +get_order(elfcorebuf_sz_orig));
   return rc;
   }
 - set_vmcore_list_offsets_elf64(elfcorebuf, vmcore_list);
 + set_vmcore_list_offsets(elfcorebuf_sz, vmcore_list);
   return 0;
  }
  
 @@ -581,30 +565,35 @@ static int __init parse_crash_elf32_headers(void)
   }
  
   /* Read in all elf headers. */
 - elfcorebuf_sz = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
 - elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
 + elfcorebuf_sz_orig = sizeof(Elf32_Ehdr) + ehdr.e_phnum * 
 sizeof(Elf32_Phdr);
 + elfcorebuf_sz = elfcorebuf_sz_orig;
 + elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
 +get_order(elfcorebuf_sz_orig));
   if (!elfcorebuf)
   return -ENOMEM;
   addr = elfcorehdr_addr;
 - rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, addr, 0);
 + rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, addr, 0);
   if (rc  0) {
 - kfree(elfcorebuf);
 + free_pages((unsigned long)elfcorebuf,
 +get_order(elfcorebuf_sz_orig));
   return rc;
   }
  
   /* Merge all PT_NOTE headers into one. */
   rc = merge_note_headers_elf32(elfcorebuf, elfcorebuf_sz, vmcore_list);
   if (rc) {
 - kfree(elfcorebuf);
 + free_pages((unsigned long)elfcorebuf,
 +get_order(elfcorebuf_sz_orig));
   return rc;
   }
   rc = process_ptload_program_headers_elf32(elfcorebuf, elfcorebuf_sz,
   vmcore_list);
   if (rc) {
 - kfree(elfcorebuf);
 + free_pages((unsigned long)elfcorebuf,
 +get_order(elfcorebuf_sz_orig));
   return rc;
   }
 - set_vmcore_list_offsets_elf32(elfcorebuf, vmcore_list);
 + set_vmcore_list_offsets(elfcorebuf_sz, vmcore_list);
   return 0;
  }
  
 @@ -629,14 +618,14 @@ static int __init parse_crash_elf_headers(void)
   return rc;
  
   /* Determine vmcore size. */
 - vmcore_size = get_vmcore_size_elf64(elfcorebuf);
 + vmcore_size = get_vmcore_size_elf64(elfcorebuf, elfcorebuf_sz);
   } else if (e_ident[EI_CLASS] == ELFCLASS32) {
   rc = parse_crash_elf32_headers();

Re: [PATCH v8 3/9] vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

2013-05-23 Thread Andrew Morton

On Thu, 23 May 2013 14:25:13 +0900 HATAYAMA Daisuke d.hatay...@jp.fujitsu.com 
wrote:

 Treat memory chunks referenced by PT_LOAD program header entries in
 page-size boundary in vmcore_list. Formally, for each range [start,
 end], we set up the corresponding vmcore object in vmcore_list to
 [rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].
 
 This change affects layout of /proc/vmcore.

Well, changing a userspace interface is generally unacceptable because
it can break existing userspace code.

If you think the risk is acceptable then please do explain why.  In
great detail!

 The gaps generated by the
 rearrangement are newly made visible to applications as
 holes. Concretely, they are two ranges [rounddown(start, PAGE_SIZE),
 start] and [end, roundup(end, PAGE_SIZE)].
 
 Suppose variable m points at a vmcore object in vmcore_list, and
 variable phdr points at the program header of PT_LOAD type the
 variable m corresponds to. Then, pictorially:
 
   m-offset+---+
| hole  |
 phdr-p_offset =   +---+
   m-offset + (paddr - start)  |   |\
| kernel memory | phdr-p_memsz
|   |/
+---+
| hole  |
   m-offset + m-size  +---+
 
 where m-offset and m-offset + m-size are always page-size aligned.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH]Add kmsg_dump() to kexec path

2013-05-21 Thread Andrew Morton

Guys, can we please get some review attention to this?

From: Seiji Aguchi seiji.agu...@hds.com
Subject: Add kmsg_dump() to kexec path

Problem
===

From our support service experience, we always need to detect root cause
of OS panic.  And customers in enterprise area never forgive us if we
can't detect the root cause of panic due to lack of materials for
investigation.

Kdump is a powerful troubleshooting feature, but it may accesses to
multiple hardware, like HBA, FC-cable, to get to dump disk.

This means kdump is not robust against hardware failure.

Solution

Logging kernel message to persistent device is an effective way to get
materials for investigation in case of kdump failure.

So this patch adds kmsg_dump() to a kexec path.  Also, it adds
KMSG_DUMP_KEXEC to pstore_cannot_block_path() so that it can avoid
deadlocking in kexec path.

Please see the detail of pstore_cannot_block_path(). 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/pstore/platform.c?id=9f244e9cfd70c7c0f82d3c92ce772ab2a92d9f64

Actually, there are some objections about kmsg_dump(KMSG_DUMP_KEXEC) and
EFI below.  But I still think adding kmsg_dump() to a kexec path is
useful.

- http://marc.info/?l=linux-kernelm=130698519720887w=2

(1) kdump already saves kernel messages inside /proc/vmcore

  - 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/kernel/kexec.c?id=a3dd3323058d281abd584b15ad4c5b65064d7a61

  It is correct, but the content of /proc/vmcore is stored a dump disk
  as well.  So, if kdump fails due to hardware failures, the kernel
  messages will be lost.

(2) EFI firmware is buggy

  - http://marc.info/?l=linux-kernelm=130698519720887w=2

  I haven't seen actual firmware bugs which may cause kdump failure.  So
  I don't think we need to care about it too much.  However, just to be
  safe, I introduced pstore_cannot_block_path() avoid deadlocking to
  pstore.

Also, this patch doesn't affect almost all users because kmsg_dump() is
kicked only when specifying both pstore.backend and
printk.always_kmsg_dump parameters.  Even if a buggy firmware causes a
kdump failure and someone blames kdump, we can ask them to reproduce the
kdump failure by removing the parameters.

In addition, I checked current coding of platform drivers.  There is no
obvious problem as follows.

- mtdoops/ramoops
  They are designed to be kicked in panic and oops cases only.
  So, they never run in a kexec path.

- erst/efi/early_printk_mrst/nvram driver for powerpc
  I don't see any bugs which may causes kdump failure because
  deadlocking/dynamic memory allocation don't happen in their write callbacks.

Signed-off-by: Seiji Aguchi seiji.agu...@hds.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 fs/pstore/platform.c  |4 
 include/linux/kmsg_dump.h |1 +
 kernel/kexec.c|2 ++
 3 files changed, 7 insertions(+)

diff -puN fs/pstore/platform.c~add-kmsg_dump-to-kexec-path fs/pstore/platform.c
--- a/fs/pstore/platform.c~add-kmsg_dump-to-kexec-path
+++ a/fs/pstore/platform.c
@@ -91,6 +91,8 @@ static const char *get_reason_str(enum k
return Halt;
case KMSG_DUMP_POWEROFF:
return Poweroff;
+   case KMSG_DUMP_KEXEC:
+   return Kexec;
default:
return Unknown;
}
@@ -110,6 +112,8 @@ bool pstore_cannot_block_path(enum kmsg_
case KMSG_DUMP_PANIC:
/* Emergency restart shouldn't be blocked by spin lock. */
case KMSG_DUMP_EMERG:
+   /* In kexec path, pstore shouldn't be blocked to avoid kexec failure. */
+   case KMSG_DUMP_KEXEC:
return true;
default:
return false;
diff -puN include/linux/kmsg_dump.h~add-kmsg_dump-to-kexec-path 
include/linux/kmsg_dump.h
--- a/include/linux/kmsg_dump.h~add-kmsg_dump-to-kexec-path
+++ a/include/linux/kmsg_dump.h
@@ -28,6 +28,7 @@ enum kmsg_dump_reason {
KMSG_DUMP_RESTART,
KMSG_DUMP_HALT,
KMSG_DUMP_POWEROFF,
+   KMSG_DUMP_KEXEC,
 };

 /**
diff -puN kernel/kexec.c~add-kmsg_dump-to-kexec-path kernel/kexec.c
--- a/kernel/kexec.c~add-kmsg_dump-to-kexec-path
+++ a/kernel/kexec.c
@@ -32,6 +32,7 @@
 #include linux/vmalloc.h
 #include linux/swap.h
 #include linux/syscore_ops.h
+#include linux/kmsg_dump.h

 #include asm/page.h
 #include asm/uaccess.h
@@ -1089,6 +1090,7 @@ void crash_kexec(struct pt_regs *regs)

crash_setup_regs(fixed_regs, regs);
crash_save_vmcoreinfo();
+   kmsg_dump(KMSG_DUMP_KEXEC);
machine_crash_shutdown(fixed_regs);
machine_kexec(kexec_crash_image);
}
_

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH]Add kmsg_dump() to kexec path

2013-05-21 Thread Andrew Morton

On Tue, 21 May 2013 15:47:41 -0700 gre...@linuxfoundation.org 
gre...@linuxfoundation.org wrote:

 On Tue, May 21, 2013 at 03:41:37PM -0700, Andrew Morton wrote:
  Guys, can we please get some review attention to this?
 
 I thought it was reviewed, and rejected, already.  Did you miss those
 emails?

I had memories of the same thing, but can't find any record in lkml
archives.  Either we're thinking of a different patch or this one has
appeared under multiple titles.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

1 2 >

1 - 100 of 148 matches

Mail list logo