from:"dave young"

Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-25 Thread Dave Young

> > 
> > Dave Young sent the original post, and I just re-post it with commit log
> 
> If he sent it, he should be the author I guess.

Just drop the line, but can change the credit to Chao Wang since he did
it initially.

But Chao does not work on kexec/kdump any more, and the email address is
outdated as well.

> 
> > improvement as his requirement.
> > http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > There was an old discussion below (previously posted by Chao Wang):
> > https://lkml.org/lkml/2013/10/15/601
> 
> All that changelog info doesn't belong in the commit message ...
> 
> > Signed-off-by: Pingfan Liu 
> > Cc: Dave Young 
> > Cc: Baoquan He 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: ying...@kernel.org,
> > Cc: vgo...@redhat.com
> > Cc: Randy Dunlap 
> > Cc: Borislav Petkov 
> > Cc: x...@kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
> 
>  but here.
> 
> > v6 -> v7: commit log improvement
> >  arch/x86/kernel/setup.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3d872a5..fa62c81 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
> > high ? CRASH_ADDR_HIGH_MAX
> >  : CRASH_ADDR_LOW_MAX,
> > crash_size, CRASH_ALIGN);
> > +#ifdef CONFIG_X86_64
> > +   /*
> > +* crashkernel=X reserve below 896M fails? Try below 4G
> > +*/
> > +   if (!high && !crash_base)
> > +   crash_base = memblock_find_in_range(CRASH_ALIGN,
> > +   (1ULL << 32),
> > +   crash_size, CRASH_ALIGN);
> > +   /*
> > +* crashkernel=X reserve below 4G fails? Try MAXMEM
> > +*/
> > +   if (!high && !crash_base)
> > +   crash_base = memblock_find_in_range(CRASH_ALIGN,
> > +   CRASH_ADDR_HIGH_MAX,
> > +   crash_size, CRASH_ALIGN);
> > +#endif
> 
> Ok, so this is silly: we know at which physical address KASLR allocated
> the kernel so why aren't we querying that and seeing if there's enough
> room before it or after it to call memblock_find_in_range() on the
> bigger range?

Baoquan may have comments?

> 
> Also, why is "high" dealt with separately and why isn't the code
> enforcing "high" if the normal reservation fails?
> 

AFAIK, some people prefer to explictly reserve crash memory at high
region even if it is possible to reserve at low area.  May because
<4G memory is limited on large server, they want to leave this for other
use. 

Yinghai or Vivek should know more about the history, probably they can
recall some initial reason.

> The presence of high is requiring from our users to pay attention what
> to use when the kernel can do all that automatically. Looks like a UI
> fail to me.
> 
> And look at all the flavors of crashkernel= :
> 
> crashkernel=size[KMG][@offset[KMG]]
> crashkernel=range1:size1[,range2:size2,...][@offset]
> crashkernel=size[KMG],high
> crashkernel=size[KMG],low
> 
> We couldn't do one so we made 4?!?!
> 
> What for?
> 
> Nowhere in that help text does it explain why a user would care about
> high or low or range or offset or the planets alignment...
> 
> So what's up?

Good question, still it may be some historical reason, but it is good to
make them clear and rethink about it after long time.

I also want to understand, need dig the log more.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

Thanks
Dave

Re: [PATCH v3 1/3] x86, kexec_file_load: Don't setup EFI info if EFI runtime is not enabled

2019-01-24 Thread Dave Young

On 01/18/19 at 07:13pm, Kairui Song wrote:
> Currently with "efi=noruntime" in kernel command line, calling
> kexec_file_load will raise below problem:
> 
> [   97.967067] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [   97.967894] #PF error: [normal kernel read fault]
> ...
> [   97.980456] Call Trace:
> [   97.980724]  efi_runtime_map_copy+0x28/0x30
> [   97.981267]  bzImage64_load+0x688/0x872
> [   97.981794]  arch_kexec_kernel_image_load+0x6d/0x70
> [   97.982441]  kimage_file_alloc_init+0x13e/0x220
> [   97.983035]  __x64_sys_kexec_file_load+0x144/0x290
> [   97.983586]  do_syscall_64+0x55/0x1a0
> [   97.983962]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> When efi runtime is not enabled, efi memmap is not mapped, so just skip
> EFI info setup.
> 
> Suggested-by: Dave Young 
> Signed-off-by: Kairui Song 
> ---
>  arch/x86/kernel/kexec-bzimage64.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 2c007abd3d40..097f52fb02e3 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -167,6 +167,9 @@ setup_efi_state(struct boot_params *params, unsigned long 
> params_load_addr,
>   struct efi_info *current_ei = _params.efi_info;
>   struct efi_info *ei = >efi_info;
>  
> + if (!efi_enabled(EFI_RUNTIME_SERVICES))
> + return 0;
> +
>   if (!current_ei->efi_memmap_size)
>   return 0;
>  
> -- 
> 2.20.1
> 

Patch 1/3 looks good to me, 2-3 should depend on Chao's early rsdp parsing 
according
to Boris.

Acked-by: Dave Young 

Thanks
Dave

Re: [PATCH v5 2/2] kexec, KEYS: Make use of platform keyring for signature verify

2019-01-22 Thread Dave Young

On 01/21/19 at 05:59pm, Kairui Song wrote:
> This patch let kexec_file_load makes use of .platform keyring as fall
> back if it failed to verify a PE signed image against secondary or
> builtin key ring, make it possible to verify kernel image signed with
> preboot keys as well.
> 
> This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous
> VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature
> should verify the signature using platform keyring. Also, decrease
> the error message log level when verification failed with -ENOKEY,
> so that if called tried multiple time with different keyring it
> won't generate extra noises.
> 
> Signed-off-by: Kairui Song 
> ---
>  arch/x86/kernel/kexec-bzimage64.c | 13 ++---
>  certs/system_keyring.c| 13 -
>  include/linux/verification.h  |  1 +
>  3 files changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 7d97e432cbbc..2c007abd3d40 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data)
>  #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
>  static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len)
>  {
> - return verify_pefile_signature(kernel, kernel_len,
> -VERIFY_USE_SECONDARY_KEYRING,
> -VERIFYING_KEXEC_PE_SIGNATURE);
> + int ret;
> + ret = verify_pefile_signature(kernel, kernel_len,
> +   VERIFY_USE_SECONDARY_KEYRING,
> +   VERIFYING_KEXEC_PE_SIGNATURE);
> + if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING)) {
> + ret = verify_pefile_signature(kernel, kernel_len,
> +   VERIFY_USE_PLATFORM_KEYRING,
> +   VERIFYING_KEXEC_PE_SIGNATURE);
> + }
> + return ret;
>  }
>  #endif
>  
> diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> index 4690ef9cda8a..7085c286f4bd 100644
> --- a/certs/system_keyring.c
> +++ b/certs/system_keyring.c
> @@ -240,11 +240,22 @@ int verify_pkcs7_signature(const void *data, size_t len,
>  #else
>   trusted_keys = builtin_trusted_keys;
>  #endif
> + } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) {
> +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
> + trusted_keys = platform_trusted_keys;
> +#else
> + trusted_keys = NULL;
> +#endif
> + if (!trusted_keys) {
> + ret = -ENOKEY;
> + pr_devel("PKCS#7 platform keyring is not available\n");
> + goto error;
> + }
>   }
>   ret = pkcs7_validate_trust(pkcs7, trusted_keys);
>   if (ret < 0) {
>   if (ret == -ENOKEY)
> - pr_err("PKCS#7 signature not signed with a trusted 
> key\n");
> + pr_devel("PKCS#7 signature not signed with a trusted 
> key\n");
>   goto error;
>   }
>  
> diff --git a/include/linux/verification.h b/include/linux/verification.h
> index cfa4730d607a..018fb5f13d44 100644
> --- a/include/linux/verification.h
> +++ b/include/linux/verification.h
> @@ -17,6 +17,7 @@
>   * should be used.
>   */
>  #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL)
> +#define VERIFY_USE_PLATFORM_KEYRING  ((struct key *)2UL)
>  
>  /*
>   * The use to which an asymmetric key is being put.
> -- 
> 2.20.1
> 

For kexec_file part

Acked-by: Dave Young 

Thanks
Dave

Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-18 Thread Dave Young

On 01/18/19 at 08:34pm, Dave Young wrote:
> On 01/18/19 at 06:53am, Mimi Zohar wrote:
> > On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote:
> > > This patch series adds a .platform_trusted_keys in system_keyring as the
> > > reference to .platform keyring in integrity subsystem, when platform
> > > keyring is being initialized it will be updated. So other component could
> > > use this keyring as well.
> > 
> > Kairui, when people review patches, the comments could be specific,
> > but are normally generic.  My review included a couple of generic
> > suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the
> > term "preboot" keys, and remove any references to "other components".
> > 
> > After all the wording suggestions I've made, you are still saying, "So
> > other components could use this keyring as well".  Really?!  How the
> > platform keyring will be used in the future, is up to you and others
> > to convince Linus.  At least for now, please limit its usage to
> > verifying the PE signed kernel image.  If this patch set needs to be
> > reposted, please remove all references to "other components".
> > 
> > Dave/David, are you ok with Kairui's usage of "#ifdef's"?  Dave, you
> > Acked the original post.  Can I include it?  Can we get some
> > additional Ack's on these patches?
> 
> It is better to update patch to use IS_ENABLED in patch 1/2 as well.

Hmm, not only for patch 1/2, patch 2/2 also need an update

> Other than that, for kexec part I'm fine with an ack.
>  
> Thanks
> Dave

Re: [PATCH v4 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-18 Thread Dave Young

On 01/18/19 at 06:53am, Mimi Zohar wrote:
> On Fri, 2019-01-18 at 17:17 +0800, Kairui Song wrote:
> > This patch series adds a .platform_trusted_keys in system_keyring as the
> > reference to .platform keyring in integrity subsystem, when platform
> > keyring is being initialized it will be updated. So other component could
> > use this keyring as well.
> 
> Kairui, when people review patches, the comments could be specific,
> but are normally generic.  My review included a couple of generic
> suggestions - not to use "#ifdef" in C code (eg. is_enabled), use the
> term "preboot" keys, and remove any references to "other components".
> 
> After all the wording suggestions I've made, you are still saying, "So
> other components could use this keyring as well".  Really?!  How the
> platform keyring will be used in the future, is up to you and others
> to convince Linus.  At least for now, please limit its usage to
> verifying the PE signed kernel image.  If this patch set needs to be
> reposted, please remove all references to "other components".
> 
> Dave/David, are you ok with Kairui's usage of "#ifdef's"?  Dave, you
> Acked the original post.  Can I include it?  Can we get some
> additional Ack's on these patches?

It is better to update patch to use IS_ENABLED in patch 1/2 as well.
Other than that, for kexec part I'm fine with an ack.
 
Thanks
Dave

Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-17 Thread Dave Young

Pingfan, thanks for the post.

On 01/15/19 at 04:07pm, Pingfan Liu wrote:
> People reported a bug on a high end server with many pcie devices, where
> kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> though we still see much memory under 896 MB, the finding still failed
> intermittently. Because currently we can only find region under 896 MB,
> if without ',high' specified. Then KASLR breaks 896 MB into several parts
> randomly, and crashkernel reservation need be aligned to 128 MB, that's
> why failure is found. It raises confusion to the end user that sometimes
> crashkernel=X works while sometimes fails.
> If want to make it succeed, customer can change kernel option to
> "crashkernel=384M,high". Just this give "crashkernel=xx@yy" a very
> limited space to behave even though its grammar looks more generic.
> And we can't answer questions raised from customer that confidently:
> 1) why it doesn't succeed to reserve 896 MB;
> 2) what's wrong with memory region under 4G;
> 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
> finally above 4G.

The patch log still looks not very good.  It needs some cleanup like
paragraph line breaks to make it more readable.

For example you can take like below:
--
People reported crashkernel=384M reservation failed on a high end server
with KASLR enabled.  In that case there is enough free memory under 896M
but crashkernel reservation still fails intermittently.

The situation is crashkernel reservation code only finds free region under
896 MB with 128M aligned in case no ',high' being used.  And KASLR could
break the first 896M into several parts randomly thus the failure happens.
User has no way to predict and make sure crashkernel=xM working unless
he/she use 'crashkernel=xM,high'.  Since 'crashkernel=xM' is the most
common use case this issue is a serious bug.

And we can't answer questions raised from customer:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.

This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
finally above 4G.

> Dave Young sent the original post, and I just re-post it with commit log
> improvement as his requirement.
> http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> There was an old discussion below (previously posted by Chao Wang):
> https://lkml.org/lkml/2013/10/15/601

I hope someone else can provide review because I posted it previously.

But I think previously when I posted it is a good to have improvement,
but now it is a real serious bug which need to be fixed.  I can review
and ack if you can repost with a better log.

> 
> Signed-off-by: Pingfan Liu 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: ying...@kernel.org,
> Cc: vgo...@redhat.com
> Cc: Randy Dunlap 
> ---
> v6 -> v7: fix spelling mistake pointed out by Randy
>  arch/x86/kernel/setup.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 3d872a5..fa62c81 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
>   high ? CRASH_ADDR_HIGH_MAX
>: CRASH_ADDR_LOW_MAX,
>   crash_size, CRASH_ALIGN);
> +#ifdef CONFIG_X86_64
> + /*
> +  * crashkernel=X reserve below 896M fails? Try below 4G
> +  */
> + if (!high && !crash_base)
> + crash_base = memblock_find_in_range(CRASH_ALIGN,
> + (1ULL << 32),
> + crash_size, CRASH_ALIGN);
> + /*
> +  * crashkernel=X reserve below 4G fails? Try MAXMEM
> +  */
> + if (!high && !crash_base)
> + crash_base = memblock_find_in_range(CRASH_ALIGN,
> + CRASH_ADDR_HIGH_MAX,
> + crash_size, CRASH_ALIGN);
> +#endif
>   if (!crash_base) {
>   pr_info("crashkernel reservation failed - No suitable 
> area found.\n");
>   return;
> -- 
> 2.7.4
> 

Thanks
Dave

Re: [PATCH v3 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-17 Thread Dave Young

On 01/18/19 at 09:35am, Dave Young wrote:
> On 01/17/19 at 08:08pm, Mimi Zohar wrote:
> > On Wed, 2019-01-16 at 18:16 +0800, Kairui Song wrote:
> > > This patch series adds a .platform_trusted_keys in system_keyring as the
> > > reference to .platform keyring in integrity subsystem, when platform
> > > keyring is being initialized it will be updated. So other component could
> > > use this keyring as well.
> > 
> > Remove "other component could use ...".
> > > 
> > > This patch series also let kexec_file_load use platform keyring as fall
> > > back if it failed to verify the image against secondary keyring, make it
> > > possible to load kernel signed by third part key if third party key is
> > > imported in the firmware.
> > 
> > This is the only reason for these patches.  Please remove "also".
> > 
> > > 
> > > After this patch kexec_file_load will be able to verify a signed PE
> > > bzImage using keys in platform keyring.
> > > 
> > > Tested in a VM with locally signed kernel with pesign and imported the
> > > cert to EFI's MokList variable.
> > 
> > It's taken so long for me to review/test this patch set due to a
> > regression in sanity_check_segment_list(), introduced somewhere
> > between 4.20 and 5.0.0-rc1.  The sgement overlap test - "if ((mend >
> > pstart) && (mstart < pend))" - fails, returning a -EINVAL.
> > 
> > Is anyone else seeing this?
> 
> Mimi, should be this issue?  I have sent a fix for that.
> https://lore.kernel.org/lkml/20181228011247.ga9...@dhcp-128-65.nay.redhat.com/

Hi, Kairui, I think you should know this while working on this series,
It is good to mention the test dependency in cover letter so that reviewers
can save time.

BTW, Boris took it in tip already:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=993a110319a4a60aadbd02f6defdebe048f7773b

> 
> Thanks
> Dave

Re: [PATCH v3 0/2] let kexec_file_load use platform keyring to verify the kernel image

2019-01-17 Thread Dave Young

On 01/17/19 at 08:08pm, Mimi Zohar wrote:
> On Wed, 2019-01-16 at 18:16 +0800, Kairui Song wrote:
> > This patch series adds a .platform_trusted_keys in system_keyring as the
> > reference to .platform keyring in integrity subsystem, when platform
> > keyring is being initialized it will be updated. So other component could
> > use this keyring as well.
> 
> Remove "other component could use ...".
> > 
> > This patch series also let kexec_file_load use platform keyring as fall
> > back if it failed to verify the image against secondary keyring, make it
> > possible to load kernel signed by third part key if third party key is
> > imported in the firmware.
> 
> This is the only reason for these patches.  Please remove "also".
> 
> > 
> > After this patch kexec_file_load will be able to verify a signed PE
> > bzImage using keys in platform keyring.
> > 
> > Tested in a VM with locally signed kernel with pesign and imported the
> > cert to EFI's MokList variable.
> 
> It's taken so long for me to review/test this patch set due to a
> regression in sanity_check_segment_list(), introduced somewhere
> between 4.20 and 5.0.0-rc1.  The sgement overlap test - "if ((mend >
> pstart) && (mstart < pend))" - fails, returning a -EINVAL.
> 
> Is anyone else seeing this?

Mimi, should be this issue?  I have sent a fix for that.
https://lore.kernel.org/lkml/20181228011247.ga9...@dhcp-128-65.nay.redhat.com/

Thanks
Dave

Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-17 Thread Dave Young

+ linux-acpi list
On 01/17/19 at 03:49pm, Chao Fan wrote:
> On Thu, Jan 17, 2019 at 03:41:13PM +0800, Kairui Song wrote:
> >On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov  wrote:
> >>
> >> On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote:
> >> > I didn't see a way to reuse things in that patch series, situation is
> >> > different, in that patch it needs to get RSDP in very early boot stage
> >> > so it did everything from scratch, in this patch kexec_file_load need
> >> > to get RSDP too, but everything is well setup so things are a lot
> >> > easier, just read from current boot_prams, efi and fallback to
> >> > acpi_find_root_pointer should be good.
> >>
> >> No no. Early code should find out that venerable RSDP thing once and
> >> will save it somewhere for further use. No gazillion parsings of it.
> >> Just once and share it with the rest of the code that needs it.
> >>
> >
> >How about we refill the boot_params.acpi_rsdp_addr if it is not valid
> >in early code, so it could be used as a reliable RSDP address source?
> >That should make things easier.
> 
> I think it's OK.
> Try to read it, if get RSDP, use it.
> If not, search in EFI/BIOS/... and refill the RSDP to
> boot_params.acpi_rsdp_addr.
> By the way, I search kernel code, I didn't find other code fill and
> use it, only you(KEXEC) are trying to fill it.
> If I miss something, please let me know.
> 
> Thanks,
> Chao Fan
> 
> >
> >But if early code should parse it and store it should be done in
> >Chao's patch, or I can post another patch to do it if Chao's patch is
> >merged.
> >
> >For now I think good to have something like this in this patch series
> >to always keep storing acpi_rsdp in late code,
> >acpi_os_get_root_pointer_late (maybe comeup with a better name later)
> >could be used anytime to get RSDP and no extra parsing:
> >
> >--- a/drivers/acpi/osl.c
> >+++ b/drivers/acpi/osl.c
> >@@ -180,8 +180,8 @@ void acpi_os_vprintf(const char *fmt, va_list args)
> > #endif
> > }
> >
> >-#ifdef CONFIG_KEXEC
> > static unsigned long acpi_rsdp;
> >+#ifdef CONFIG_KEXEC
> > static int __init setup_acpi_rsdp(char *arg)
> > {
> >return kstrtoul(arg, 16, _rsdp);
> >@@ -189,28 +189,38 @@ static int __init setup_acpi_rsdp(char *arg)
> > early_param("acpi_rsdp", setup_acpi_rsdp);
> > #endif
> >
> >+acpi_physical_address acpi_os_get_root_pointer_late(void) {
> >+   return acpi_rsdp;
> >+}
> >+
> > acpi_physical_address __init acpi_os_get_root_pointer(void)
> > {
> >acpi_physical_address pa;
> >
> >-#ifdef CONFIG_KEXEC
> >if (acpi_rsdp)
> >return acpi_rsdp;
> >-#endif
> >+
> >pa = acpi_arch_get_root_pointer();
> >-   if (pa)
> >+   if (pa) {
> >+   acpi_rsdp = pa;
> >return pa;
> >+   }
> >
> >if (efi_enabled(EFI_CONFIG_TABLES)) {
> >-   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR)
> >+   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) {
> >+   acpi_rsdp = efi.acpi20;
> >return efi.acpi20;
> >-   if (efi.acpi != EFI_INVALID_TABLE_ADDR)
> >+   }
> >+   if (efi.acpi != EFI_INVALID_TABLE_ADDR) {
> >+   acpi_rsdp = efi.acpi;
> >return efi.acpi;
> >+   }
> >pr_err(PREFIX "System description tables not found\n");
> >} else if (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) {
> >acpi_find_root_pointer();
> >}
> >
> > +   acpi_rsdp = pa;
> >return pa;
> > }
> >
> >> --
> >> Regards/Gruss,
> >> Boris.
> >>
> >> Good mailing practices for 400: avoid top-posting and trim the reply.
> >--
> >Best Regards,
> >Kairui Song
> >
> >
> 
>

Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-17 Thread Dave Young

Add linux-acpi list
On 01/17/19 at 03:41pm, Kairui Song wrote:
> On Wed, Jan 16, 2019 at 5:46 PM Borislav Petkov  wrote:
> >
> > On Wed, Jan 16, 2019 at 03:08:42PM +0800, Kairui Song wrote:
> > > I didn't see a way to reuse things in that patch series, situation is
> > > different, in that patch it needs to get RSDP in very early boot stage
> > > so it did everything from scratch, in this patch kexec_file_load need
> > > to get RSDP too, but everything is well setup so things are a lot
> > > easier, just read from current boot_prams, efi and fallback to
> > > acpi_find_root_pointer should be good.
> >
> > No no. Early code should find out that venerable RSDP thing once and
> > will save it somewhere for further use. No gazillion parsings of it.
> > Just once and share it with the rest of the code that needs it.
> >
> 
> How about we refill the boot_params.acpi_rsdp_addr if it is not valid
> in early code, so it could be used as a reliable RSDP address source?
> That should make things easier.
> 
> But if early code should parse it and store it should be done in
> Chao's patch, or I can post another patch to do it if Chao's patch is
> merged.
> 
> For now I think good to have something like this in this patch series
> to always keep storing acpi_rsdp in late code,
> acpi_os_get_root_pointer_late (maybe comeup with a better name later)
> could be used anytime to get RSDP and no extra parsing:
> 
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -180,8 +180,8 @@ void acpi_os_vprintf(const char *fmt, va_list args)
>  #endif
>  }
> 
> -#ifdef CONFIG_KEXEC
>  static unsigned long acpi_rsdp;
> +#ifdef CONFIG_KEXEC
>  static int __init setup_acpi_rsdp(char *arg)
>  {
> return kstrtoul(arg, 16, _rsdp);
> @@ -189,28 +189,38 @@ static int __init setup_acpi_rsdp(char *arg)
>  early_param("acpi_rsdp", setup_acpi_rsdp);
>  #endif
> 
> +acpi_physical_address acpi_os_get_root_pointer_late(void) {
> +   return acpi_rsdp;
> +}
> +
>  acpi_physical_address __init acpi_os_get_root_pointer(void)
>  {
> acpi_physical_address pa;
> 
> -#ifdef CONFIG_KEXEC
> if (acpi_rsdp)
> return acpi_rsdp;
> -#endif
> +
> pa = acpi_arch_get_root_pointer();
> -   if (pa)
> +   if (pa) {
> +   acpi_rsdp = pa;
> return pa;
> +   }
> 
> if (efi_enabled(EFI_CONFIG_TABLES)) {
> -   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR)
> +   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR) {
> +   acpi_rsdp = efi.acpi20;
> return efi.acpi20;
> -   if (efi.acpi != EFI_INVALID_TABLE_ADDR)
> +   }
> +   if (efi.acpi != EFI_INVALID_TABLE_ADDR) {
> +   acpi_rsdp = efi.acpi;
> return efi.acpi;
> +   }
> pr_err(PREFIX "System description tables not found\n");
> } else if (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) {
> acpi_find_root_pointer();
> }
> 
>  +   acpi_rsdp = pa;
> return pa;
>  }
> 
> > --
> > Regards/Gruss,
> > Boris.
> >
> > Good mailing practices for 400: avoid top-posting and trim the reply.
> --
> Best Regards,
> Kairui Song

Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-15 Thread Dave Young

On 01/16/19 at 01:09pm, Kairui Song wrote:
> On Wed, Jan 16, 2019 at 11:32 AM Dave Young  wrote:
> >
> > On 01/16/19 at 12:10am, Borislav Petkov wrote:
> > > On Tue, Jan 15, 2019 at 05:58:34PM +0800, Kairui Song wrote:
> > > > When efi=noruntime or efi=oldmap is used, EFI services won't be 
> > > > available
> > > > in the second kernel, therefore the second kernel will not be able to 
> > > > get
> > > > the ACPI RSDP address from firmware by calling EFI services and won't
> > > > boot. Previously we are expecting the user to set the acpi_rsdp=
> > > > on kernel command line for second kernel as there was no way to pass 
> > > > RSDP
> > > > address to second kernel.
> > > >
> > > > After commit e6e094e053af ('x86/acpi, x86/boot: Take RSDP address from
> > > > boot params if available'), now it's possible to set an acpi_rsdp_addr
> > > > parameter in the boot_params passed to second kernel, this commit make
> > > > use of it, detect and set the RSDP address when it's required for second
> > > > kernel to boot.
> > > >
> > > > Tested with an EFI enabled KVM VM with efi=noruntime.
> > > >
> > > > Suggested-by: Dave Young 
> > > > Signed-off-by: Kairui Song 
> > > > ---
> > > >  arch/x86/kernel/kexec-bzimage64.c | 21 +
> > > >  drivers/acpi/acpica/tbxfroot.c|  3 +--
> > > >  include/acpi/acpixf.h |  2 +-
> > > >  3 files changed, 23 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > > > b/arch/x86/kernel/kexec-bzimage64.c
> > > > index 53917a3ebf94..0a90dcbd041f 100644
> > > > --- a/arch/x86/kernel/kexec-bzimage64.c
> > > > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > > > @@ -20,6 +20,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > >
> > > >  #include 
> > > >  #include 
> > > > @@ -255,8 +256,28 @@ setup_boot_parameters(struct kimage *image, struct 
> > > > boot_params *params,
> > > > /* Setup EFI state */
> > > > setup_efi_state(params, params_load_addr, efi_map_offset, 
> > > > efi_map_sz,
> > > > efi_setup_data_offset);
> > > > +
> > > > +#ifdef CONFIG_ACPI
> > > > +   /* Setup ACPI RSDP pointer in case EFI is not available in second 
> > > > kernel */
> > > > +   if (!acpi_disabled && (!efi_enabled(EFI_RUNTIME_SERVICES) || 
> > > > efi_enabled(EFI_OLD_MEMMAP))) {
> > > > +   /* Copied from acpi_os_get_root_pointer accordingly */
> > > > +   params->acpi_rsdp_addr = boot_params.acpi_rsdp_addr;
> > > > +   if (!params->acpi_rsdp_addr) {
> > > > +   if (efi_enabled(EFI_CONFIG_TABLES)) {
> > > > +   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR)
> > > > +   params->acpi_rsdp_addr = efi.acpi20;
> > > > +   else if (efi.acpi != EFI_INVALID_TABLE_ADDR)
> > > > +   params->acpi_rsdp_addr = efi.acpi;
> > > > +   } else if 
> > > > (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) {
> > > > +   
> > > > acpi_find_root_pointer(>acpi_rsdp_addr);
> > > > +   }
> > > > +   }
> > > > +   if (!params->acpi_rsdp_addr)
> > > > +   pr_warn("RSDP is not available for second 
> > > > kernel\n");
> > > > +   }
> > > >  #endif
> > >
> > > Amazing the amount of ACPI RDSP parsing and fiddling patches flying
> > > around these days...
> > >
> > > In any case, this needs to be synchronized with:
> > >
> > > https://lkml.kernel.org/r/20190107032243.25324-1-fanc.f...@cn.fujitsu.com
> > >
> > > and checked whether the above can be used instead of sprinkling of ACPI
> > > parsing code left and right.
> >
> > Both Baoquan and Chao are cced for comments.
> > The above KASLR patches seems some early code to parsing acpi, but I think 
> > in this
> > patch just call acpi function to get the root pointer no need to add the
> > duplicate logic of if/else/else if.
> >
> > Kairui,  do you have any reason for the checking?  Is there some simple
> > acpi function to just return the root pointer?
> 
> Hi, I'm afraid that would require moving multiple structure and
> function out of .init,
> acpi_os_get_root_pointer is an ideal function to do the job, but it's
> in .init and (on x86) it will call x86_init.acpi.get_root_pointer (by
> calling acpi_arch_get_root_pointer) which will also be freed after
> init, unless I change the x86_init, move they out of .init which is
> not a good idea.
> 
> Maybe I could split acpi_os_get_root_pointer into two, one gets freed
> after init, one for later use. But the "acpi_rsdp_addr" trick only
> works for x86, and this would change more common acpi driver code so
> not sure if a good idea.

Can the acpi root pointer be saved after initialized? If that is good
then a new function to get it would be easier. But need opinion from
acpi people.

Thanks
Dave

Re: [PATCH v2 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=old_map

2019-01-15 Thread Dave Young

On 01/16/19 at 12:10am, Borislav Petkov wrote:
> On Tue, Jan 15, 2019 at 05:58:34PM +0800, Kairui Song wrote:
> > When efi=noruntime or efi=oldmap is used, EFI services won't be available
> > in the second kernel, therefore the second kernel will not be able to get
> > the ACPI RSDP address from firmware by calling EFI services and won't
> > boot. Previously we are expecting the user to set the acpi_rsdp=
> > on kernel command line for second kernel as there was no way to pass RSDP
> > address to second kernel.
> > 
> > After commit e6e094e053af ('x86/acpi, x86/boot: Take RSDP address from
> > boot params if available'), now it's possible to set an acpi_rsdp_addr
> > parameter in the boot_params passed to second kernel, this commit make
> > use of it, detect and set the RSDP address when it's required for second
> > kernel to boot.
> > 
> > Tested with an EFI enabled KVM VM with efi=noruntime.
> > 
> > Suggested-by: Dave Young 
> > Signed-off-by: Kairui Song 
> > ---
> >  arch/x86/kernel/kexec-bzimage64.c | 21 +
> >  drivers/acpi/acpica/tbxfroot.c|  3 +--
> >  include/acpi/acpixf.h |  2 +-
> >  3 files changed, 23 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > b/arch/x86/kernel/kexec-bzimage64.c
> > index 53917a3ebf94..0a90dcbd041f 100644
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -20,6 +20,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -255,8 +256,28 @@ setup_boot_parameters(struct kimage *image, struct 
> > boot_params *params,
> > /* Setup EFI state */
> > setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> > efi_setup_data_offset);
> > +
> > +#ifdef CONFIG_ACPI
> > +   /* Setup ACPI RSDP pointer in case EFI is not available in second 
> > kernel */
> > +   if (!acpi_disabled && (!efi_enabled(EFI_RUNTIME_SERVICES) || 
> > efi_enabled(EFI_OLD_MEMMAP))) {
> > +   /* Copied from acpi_os_get_root_pointer accordingly */
> > +   params->acpi_rsdp_addr = boot_params.acpi_rsdp_addr;
> > +   if (!params->acpi_rsdp_addr) {
> > +   if (efi_enabled(EFI_CONFIG_TABLES)) {
> > +   if (efi.acpi20 != EFI_INVALID_TABLE_ADDR)
> > +   params->acpi_rsdp_addr = efi.acpi20;
> > +   else if (efi.acpi != EFI_INVALID_TABLE_ADDR)
> > +   params->acpi_rsdp_addr = efi.acpi;
> > +   } else if 
> > (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) {
> > +   acpi_find_root_pointer(>acpi_rsdp_addr);
> > +   }
> > +   }
> > +   if (!params->acpi_rsdp_addr)
> > +   pr_warn("RSDP is not available for second kernel\n");
> > +   }
> >  #endif
> 
> Amazing the amount of ACPI RDSP parsing and fiddling patches flying
> around these days...
> 
> In any case, this needs to be synchronized with:
> 
> https://lkml.kernel.org/r/20190107032243.25324-1-fanc.f...@cn.fujitsu.com
> 
> and checked whether the above can be used instead of sprinkling of ACPI
> parsing code left and right.

Both Baoquan and Chao are cced for comments.
The above KASLR patches seems some early code to parsing acpi, but I think in 
this
patch just call acpi function to get the root pointer no need to add the
duplicate logic of if/else/else if. 

Kairui,  do you have any reason for the checking?  Is there some simple
acpi function to just return the root pointer?

> 
> Thx.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

Thanks
Dave

[tip:x86/urgent] x86/kexec: Fix a kexec_file_load() failure

2019-01-15 Thread tip-bot for Dave Young

Commit-ID:  993a110319a4a60aadbd02f6defdebe048f7773b
Gitweb: https://git.kernel.org/tip/993a110319a4a60aadbd02f6defdebe048f7773b
Author: Dave Young 
AuthorDate: Fri, 28 Dec 2018 09:12:47 +0800
Committer:  Borislav Petkov 
CommitDate: Tue, 15 Jan 2019 12:12:50 +0100

x86/kexec: Fix a kexec_file_load() failure

Commit

  b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")

changed the behavior of kexec_locate_mem_hole(): it will try to allocate
free memory only when kbuf.mem is initialized to zero.

However, x86's kexec_file_load() implementation reuses a struct
kexec_buf allocated on the stack and its kbuf.mem member gets set by
each kexec_add_buffer() invocation.

The second kexec_add_buffer() will reuse the same kbuf but not
reinitialize kbuf.mem.

Therefore, explictily reset kbuf.mem each time in order for
kexec_locate_mem_hole() to locate a free memory region each time.

 [ bp: massage commit message. ]

Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
Signed-off-by: Dave Young 
Signed-off-by: Borislav Petkov 
Acked-by: Baoquan He 
Cc: "Eric W. Biederman" 
Cc: "H. Peter Anvin" 
Cc: AKASHI Takahiro 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Martin Schwidefsky 
Cc: Philipp Rudo 
Cc: Thomas Gleixner 
Cc: Vivek Goyal 
Cc: Yannik Sembritzki 
Cc: Yi Wang 
Cc: ke...@lists.infradead.org
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20181228011247.ga9...@dhcp-128-65.nay.redhat.com
---
 arch/x86/kernel/crash.c   | 1 +
 arch/x86/kernel/kexec-bzimage64.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index c8b07d8ea5a2..17ffc869cab8 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -470,6 +470,7 @@ int crash_load_segments(struct kimage *image)
 
kbuf.memsz = kbuf.bufsz;
kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
+   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer();
if (ret) {
vfree((void *)image->arch.elf_headers);
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 278cd07228dd..0d5efa34f359 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -434,6 +434,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
kbuf.memsz = PAGE_ALIGN(header->init_size);
kbuf.buf_align = header->kernel_alignment;
kbuf.buf_min = MIN_KERNEL_LOAD_ADDR;
+   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer();
if (ret)
goto out_free_params;
@@ -448,6 +449,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
kbuf.bufsz = kbuf.memsz = initrd_len;
kbuf.buf_align = PAGE_SIZE;
kbuf.buf_min = MIN_INITRD_LOAD_ADDR;
+   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer();
if (ret)
goto out_free_params;

Re: [PATCH 2/2] x86, kexec_file_load: make it work with efi=noruntime or efi=oldmap

2019-01-15 Thread Dave Young

On 01/09/19 at 02:47pm, Kairui Song wrote:
> When efi=noruntime or efi=oldmap is used, EFI services won't be available
> in the second kernel, therefore the second kernel will not be able to get
> the ACPI RSDP address from firmware by calling EFI services and won't
> boot. Previously we are expecting the user to set the acpi_rsdp=
> on kernel command line for second kernel as there was no way to pass RSDP
> address to second kernel.
> 
> After commit e6e094e053af ('x86/acpi, x86/boot: Take RSDP address from
> boot params if available'), now it's possible to set a acpi_rsdp_addr
> parameter in the boot_params passed to second kernel, this commit make
> use of it, detect and set the RSDP address when it's required for second
> kernel to boot.
> 
> Tested with an EFI enabled KVM VM with efi=noruntime.
> 
> Suggested-by: Dave Young 
> Signed-off-by: Kairui Song 
> ---
>  arch/x86/kernel/kexec-bzimage64.c | 21 +
>  drivers/acpi/acpica/tbxfroot.c|  3 +--
>  include/acpi/acpixf.h |  2 +-
>  3 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 53917a3ebf94..0a90dcbd041f 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -255,8 +256,28 @@ setup_boot_parameters(struct kimage *image, struct 
> boot_params *params,
>   /* Setup EFI state */
>   setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
>   efi_setup_data_offset);
> +
> +#ifdef CONFIG_ACPI
> + /* Setup ACPI RSDP pointer in case EFI is not available in second 
> kernel */
> + if (!acpi_disabled && (!efi_enabled(EFI_RUNTIME_SERVICES) || 
> efi_enabled(EFI_OLD_MEMMAP))) {
> + /* Copied from acpi_os_get_root_pointer accordingly */
> + params->acpi_rsdp_addr = boot_params.acpi_rsdp_addr;
> + if (!params->acpi_rsdp_addr) {
> + if (efi_enabled(EFI_CONFIG_TABLES)) {
> + if (efi.acpi20 != EFI_INVALID_TABLE_ADDR)
> + params->acpi_rsdp_addr = efi.acpi20;
> + else if (efi.acpi != EFI_INVALID_TABLE_ADDR)
> + params->acpi_rsdp_addr = efi.acpi;
> + } else if 
> (IS_ENABLED(CONFIG_ACPI_LEGACY_TABLES_LOOKUP)) {
> + acpi_find_root_pointer(>acpi_rsdp_addr);
> + }
> + }
> + if (!params->acpi_rsdp_addr)
> + pr_warn("RSDP is not available for second kernel\n");
> + }
>  #endif
>  
> +#endif
>   /* Setup EDD info */
>   memcpy(params->eddbuf, boot_params.eddbuf,
>   EDDMAXNR * sizeof(struct edd_info));
> diff --git a/drivers/acpi/acpica/tbxfroot.c b/drivers/acpi/acpica/tbxfroot.c
> index 483d0ce5180a..dac1e34a931c 100644
> --- a/drivers/acpi/acpica/tbxfroot.c
> +++ b/drivers/acpi/acpica/tbxfroot.c
> @@ -108,8 +108,7 @@ acpi_status acpi_tb_validate_rsdp(struct acpi_table_rsdp 
> *rsdp)
>   *
>   
> **/
>  
> -acpi_status ACPI_INIT_FUNCTION
> -acpi_find_root_pointer(acpi_physical_address *table_address)
> +acpi_status acpi_find_root_pointer(acpi_physical_address *table_address)
>  {
>   u8 *table_ptr;
>   u8 *mem_rover;
> diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h
> index 7aa38b648564..869d75ecaf7d 100644
> --- a/include/acpi/acpixf.h
> +++ b/include/acpi/acpixf.h
> @@ -474,7 +474,7 @@ ACPI_EXTERNAL_RETURN_STATUS(acpi_status ACPI_INIT_FUNCTION
>  ACPI_EXTERNAL_RETURN_STATUS(acpi_status ACPI_INIT_FUNCTION
>   acpi_reallocate_root_table(void))
>  
> -ACPI_EXTERNAL_RETURN_STATUS(acpi_status ACPI_INIT_FUNCTION
> +ACPI_EXTERNAL_RETURN_STATUS(acpi_status
>   acpi_find_root_pointer(acpi_physical_address
>  *rsdp_address))
>  ACPI_EXTERNAL_RETURN_STATUS(acpi_status
> -- 
> 2.20.1
> 

Kairui, thanks for the patches, did a test, it works for me.

Seems the two patches are not in a thread, can you resend them together?

Dave

Re: [PATCH V2] x86/kexec: fix a kexec_file_load failure

2019-01-14 Thread Dave Young

On 12/28/18 at 09:12am, Dave Young wrote:
> The code cleanup mentioned in Fixes tag changed the behavior of
> kexec_locate_mem_hole.  The kexec_locate_mem_hole will try to
> allocate free memory only when kbuf.mem is initialized as zero.
> 
> But in x86 kexec_file_load implementation there are a few places
> the kbuf.mem is reused like below:
>   /* kbuf initialized, kbuf.mem = 0 */
>   ...
>   kexec_add_buffer()
>   ...
>   kexec_add_buffer()
> 
>   The second kexec_add_buffer will reuse previous kbuf but not
>   reinitialize the kbuf.mem.
> 
> Thus kexec_file_load failed because the sanity check failed.
> 
> So explictily reset kbuf.mem to fix the issue.
> 
> Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
> Signed-off-by: Dave Young 
> Cc: 
> ---
> V1 -> V2: use KEXEC_BUF_MEM_UNKNOWN in code.
>  arch/x86/kernel/crash.c   | 1 +
>  arch/x86/kernel/kexec-bzimage64.c | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index f631a3f15587..6b7890c7889b 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -469,6 +469,7 @@ int crash_load_segments(struct kimage *image)
>  
>   kbuf.memsz = kbuf.bufsz;
>   kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
> + kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>   ret = kexec_add_buffer();
>   if (ret) {
>   vfree((void *)image->arch.elf_headers);
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 278cd07228dd..0d5efa34f359 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -434,6 +434,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>   kbuf.memsz = PAGE_ALIGN(header->init_size);
>   kbuf.buf_align = header->kernel_alignment;
>   kbuf.buf_min = MIN_KERNEL_LOAD_ADDR;
> + kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>   ret = kexec_add_buffer();
>   if (ret)
>   goto out_free_params;
> @@ -448,6 +449,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>   kbuf.bufsz = kbuf.memsz = initrd_len;
>   kbuf.buf_align = PAGE_SIZE;
>   kbuf.buf_min = MIN_INITRD_LOAD_ADDR;
> + kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>   ret = kexec_add_buffer();
>   if (ret)
>   goto out_free_params;
> -- 
> 2.17.0
> 

Andrew, Boris,  can any of you take this patch? Without this fix we have a 
regression.

Thanks
Dave

Re: [RFC PATCH 2/2] kexec, KEYS: Make use of platform keyring for signature verify

2019-01-14 Thread Dave Young

On 01/14/19 at 11:10am, Mimi Zohar wrote:
> On Sun, 2019-01-13 at 09:39 +0800, Dave Young wrote:
> > Hi,
> > 
> > On 01/11/19 at 11:13am, Mimi Zohar wrote:
> > > On Fri, 2019-01-11 at 21:43 +0800, Dave Young wrote:
> > > [snip]
> > > 
> > > > Personally I would like to see platform key separated from integrity.
> > > > But for the kexec_file part I think it is good at least it works with
> > > > this fix.
> > > > 
> > > > Acked-by: Dave Young 
> > > 
> > > The original "platform" keyring patches that Nayna posted multiple
> > > times were in the certs directory, but nobody commented/responded.  So
> > > she reworked the patches, moving them to the integrity directory and
> > > posted them (cc'ing the kexec mailing list).  It's a bit late to be
> > > asking to move it, isn't it?
> > 
> > Hmm, apologize for being late,  I did not get chance to have a look the
> > old series.  Since we have the needs now, it should be still fine
> > 
> > Maybe Kairui can check Nayna's old series, see if he can do something
> > again?
> 
> Whether the platform keyring is defined in certs/ or in integrity/ the
> keyring id needs to be accessible to the other, without making the
> keyring id global.  Moving where the platform keyring is defined is
> not the problem.

Agreed, but just feel kexec depends on IMA sounds not good.

> 
> Commit a210fd32a46b ("kexec: add call to LSM hook in original
> kexec_load syscall") introduced a new LSM hook.  Assuming
> CONFIG_KEXEC_VERIFY_SIG is enabled, with commit b5ca117365d9 ("ima:
> prevent kexec_load syscall based on runtime secureboot flag") we can
> now block the kexec_load syscall.  Without being able to block the
> kexec_load syscall, verifying the kexec image signature via the
> kexec_file_load syscall is kind of pointless.
> 
> Unless you're planning on writing an LSM to prevent the kexec_load
> syscall, I assume you'll want to enable integrity anyway.

User can disable kexec_load in kernel config, and only allow
kexec_file_load.  But yes, this can be improved separately in case no
IMA enabled.

For the time being we can leave with it and fix like this series do.

> 
> Mimi
> 

Thanks
Dave

Re: [RFC PATCH 2/2] kexec, KEYS: Make use of platform keyring for signature verify

2019-01-12 Thread Dave Young

Hi,

On 01/11/19 at 11:13am, Mimi Zohar wrote:
> On Fri, 2019-01-11 at 21:43 +0800, Dave Young wrote:
> [snip]
> 
> > Personally I would like to see platform key separated from integrity.
> > But for the kexec_file part I think it is good at least it works with
> > this fix.
> > 
> > Acked-by: Dave Young 
> 
> The original "platform" keyring patches that Nayna posted multiple
> times were in the certs directory, but nobody commented/responded.  So
> she reworked the patches, moving them to the integrity directory and
> posted them (cc'ing the kexec mailing list).  It's a bit late to be
> asking to move it, isn't it?

Hmm, apologize for being late,  I did not get chance to have a look the
old series.  Since we have the needs now, it should be still fine

Maybe Kairui can check Nayna's old series, see if he can do something
again?

> 
> Mimi
> 

Thanks
Dave

Re: [RFC PATCH 2/2] kexec, KEYS: Make use of platform keyring for signature verify

2019-01-11 Thread Dave Young

On 01/10/19 at 12:48am, Kairui Song wrote:
> kexec_file_load will need to verify the kernel signed with third part
> keys, and the keys could be stored in firmware, then got loaded into
> the .platform keyring. Now we have a .platform_trusted_keyring
> as the reference to .platform keyring, this patch makes use if it and
> allow kexec_file_load to verify the image against keys in .platform
> keyring.
> 
> This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous
> VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature
> should verify the signature using platform keyring. Also, decrease
> the error message log level when verification failed with -ENOKEY,
> so that if called tried multiple time with different keyring it
> won't generate extra noises.
> 
> Signed-off-by: Kairui Song 
> ---
>  arch/x86/kernel/kexec-bzimage64.c | 13 ++---
>  certs/system_keyring.c|  7 ++-
>  include/linux/verification.h  |  1 +
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 7d97e432cbbc..a8a5c1773ccc 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -534,9 +534,16 @@ static int bzImage64_cleanup(void *loader_data)
>  #ifdef CONFIG_KEXEC_BZIMAGE_VERIFY_SIG
>  static int bzImage64_verify_sig(const char *kernel, unsigned long kernel_len)
>  {
> - return verify_pefile_signature(kernel, kernel_len,
> -VERIFY_USE_SECONDARY_KEYRING,
> -VERIFYING_KEXEC_PE_SIGNATURE);
> + int ret;
> + ret = verify_pefile_signature(kernel, kernel_len,
> + VERIFY_USE_SECONDARY_KEYRING,
> + VERIFYING_KEXEC_PE_SIGNATURE);
> + if (ret == -ENOKEY) {
> + ret = verify_pefile_signature(kernel, kernel_len,
> + VERIFY_USE_PLATFORM_KEYRING,
> + VERIFYING_KEXEC_PE_SIGNATURE);
> + }
> + return ret;
>  }
>  #endif
>  
> diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> index a61b95390b80..7514e69e719f 100644
> --- a/certs/system_keyring.c
> +++ b/certs/system_keyring.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  static struct key *builtin_trusted_keys;
> @@ -239,12 +240,16 @@ int verify_pkcs7_signature(const void *data, size_t len,
>   trusted_keys = secondary_trusted_keys;
>  #else
>   trusted_keys = builtin_trusted_keys;
> +#endif
> +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
> + } else if (trusted_keys == VERIFY_USE_PLATFORM_KEYRING) {
> + trusted_keys = platform_trusted_keys;
>  #endif
>   }
>   ret = pkcs7_validate_trust(pkcs7, trusted_keys);
>   if (ret < 0) {
>   if (ret == -ENOKEY)
> - pr_err("PKCS#7 signature not signed with a trusted 
> key\n");
> + pr_devel("PKCS#7 signature not signed with a trusted 
> key\n");
>   goto error;
>   }
>  
> diff --git a/include/linux/verification.h b/include/linux/verification.h
> index cfa4730d607a..018fb5f13d44 100644
> --- a/include/linux/verification.h
> +++ b/include/linux/verification.h
> @@ -17,6 +17,7 @@
>   * should be used.
>   */
>  #define VERIFY_USE_SECONDARY_KEYRING ((struct key *)1UL)
> +#define VERIFY_USE_PLATFORM_KEYRING  ((struct key *)2UL)
>  
>  /*
>   * The use to which an asymmetric key is being put.
> -- 
> 2.20.1
> 

Personally I would like to see platform key separated from integrity.
But for the kexec_file part I think it is good at least it works with
this fix.

Acked-by: Dave Young 

Thanks
Dave

Re: [RFC PATCH 1/1] KEYS, integrity: Link .platform keyring to .secondary_trusted_keys

2019-01-08 Thread Dave Young

CC kexec list
On 01/08/19 at 10:18am, Mimi Zohar wrote:
> [Cc'ing the LSM and integrity mailing lists]
> 
> Repeating my comment on PATCH 0/1 here with the expanded set of
> mailing lists.
> 
> The builtin and secondary keyrings have a signature change of trust
> rooted in the signed kernel image.  Adding the pre-boot keys to the
> secondary keyring breaks that signature chain of trust.
> 
> Please do NOT add the pre-boot "platform" keys to the secondary
> keyring.

If we regard kexec as a bootloader, it sounds natural to use the
platform key to verify the signature with kexec_file_load syscall.

It will be hard for user to manually sign a kernel and import the key
then to reuse kexec_file_load.

I think we do not care if platform key can be added to secondary or not,
any suggestions how can kexec_file to use the platform key?

> 
> Mimi
> 
> 
> On Tue, 2019-01-08 at 16:12 +0800, Kairui Song wrote:
> > Currently kexec may need to verify the kerne image, and the kernel image
> > could be signed with third part keys which are provided by paltform or
> > firmware (eg. stored in MokListRT EFI variable). And the same time,
> > kexec_file_load will only verify the image agains .builtin_trusted_keys
> > or .secondary_trusted_keys according to configuration, but there is no
> > way for kexec_file_load to verify the image against any third part keys
> > mentioned above.
> > 
> > In ea93102f3224 ('integrity: Define a trusted platform keyring') a
> > .platform keyring is introduced to store the keys provided by platform
> > or firmware. And with a few following commits including 15ea0e1e3e185
> > ('efi: Import certificates from UEFI Secure Boot'), now keys required to
> > verify the image is being imported to .paltform keyring, and later
> > IMA-appraisal could access the keyring and verify the image.
> > 
> > This patch links the .platform keyring to .secondary_trusted_keys so
> > kexec_file_load could also leverage the .platform keyring to verify the
> > kernel image.
> > 
> > Signed-off-by: Kairui Song 
> > ---
> >  certs/system_keyring.c  | 30 ++
> >  include/keys/platform_keyring.h | 12 
> >  security/integrity/digsig.c |  7 +++
> >  3 files changed, 49 insertions(+)
> >  create mode 100644 include/keys/platform_keyring.h
> > 
> > diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> > index 81728717523d..dcef0259e149 100644
> > --- a/certs/system_keyring.c
> > +++ b/certs/system_keyring.c
> > @@ -18,12 +18,14 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  
> >  static struct key *builtin_trusted_keys;
> >  #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING
> >  static struct key *secondary_trusted_keys;
> >  #endif
> > +static struct key *platform_keys = NULL;
> >  
> >  extern __initconst const u8 system_certificate_list[];
> >  extern __initconst const unsigned long system_certificate_list_size;
> > @@ -67,6 +69,12 @@ int restrict_link_by_builtin_and_secondary_trusted(
> > /* Allow the builtin keyring to be added to the secondary */
> > return 0;
> >  
> > +   if (type == _type_keyring &&
> > +   dest_keyring == secondary_trusted_keys &&
> > +   payload == _keys->payload)
> > +   /* Allow the platform keyring to be added to the secondary */
> > +   return 0;
> > +
> > return restrict_link_by_signature(dest_keyring, type, payload,
> >   secondary_trusted_keys);
> >  }
> > @@ -188,6 +196,28 @@ static __init int load_system_certificate_list(void)
> >  }
> >  late_initcall(load_system_certificate_list);
> >  
> > +#if defined(CONFIG_INTEGRITY_PLATFORM_KEYRING) && 
> > defined(CONFIG_SECONDARY_TRUSTED_KEYRING)
> > +
> > +/*
> > + * Link .platform keyring to .secondary_trusted_key keyring
> > + */
> > +static __init int load_platform_certificate_list(void)
> > +{
> > +   int ret = 0;
> > +   platform_keys = integrity_get_platform_keyring();
> > +   if (!platform_keys) {
> > +   return 0;
> > +   }
> > +   ret = key_link(secondary_trusted_keys, platform_keys);
> > +   if (ret < 0) {
> > +   pr_err("Failed to link platform keyring: %d", ret);
> > +   }
> > +   return 0;
> > +}
> > +late_initcall(load_platform_certificate_list);
> > +
> > +#endif
> > +
> >  #ifdef CONFIG_SYSTEM_DATA_VERIFICATION
> >  
> >  /**
> > diff --git a/include/keys/platform_keyring.h 
> > b/include/keys/platform_keyring.h
> > new file mode 100644
> > index ..4f92ed6c0b42
> > --- /dev/null
> > +++ b/include/keys/platform_keyring.h
> > @@ -0,0 +1,12 @@
> > +#ifndef _KEYS_PLATFORM_KEYRING_H
> > +#define _KEYS_PLATFORM_KEYRING_H
> > +
> > +#include 
> > +
> > +#ifdef CONFIG_INTEGRITY_PLATFORM_KEYRING
> > +
> > +extern const struct key* __init integrity_get_platform_keyring(void);
> > +
> > +#endif /* CONFIG_INTEGRITY_PLATFORM_KEYRING */
> > +
> > +#endif /* _KEYS_SYSTEM_KEYRING_H */
> > diff --git a/security/integrity/digsig.c

Re: [PATCH V2] x86/kexec: fix a kexec_file_load failure

2019-01-08 Thread Dave Young

On 01/08/19 at 04:51pm, Baoquan He wrote:
> On 01/08/19 at 04:46pm, Dave Young wrote:
> > > Wondering why this place doesn't need the initialization assignment.
> > > Isn't it to assign in all places before kexec_add_buffer() calling?
> > 
> > C designated initializers will make sure to initialize it as zero.
> > We set KEXEC_BUF_MEM_UNKNOWN as 0 so it just works.
> 
> Got it, it works, thanks. People may need check code to find out
> KEXEC_BUF_MEM_UNKNOWN is 0, then realize this fact.

Agreed,  it is not very clear now. It's better to improve it with some explict
initial value since we have the macro.  But since this is a regression
I suggest to fix the bug first, I can send a patch later for the
improvement. 

Thanks!
> 
> Other than this, it looks good to me, ack it.
> 
> Acked-by: Baoquan He 
> 
> Thanks
> Baoquan

Re: [PATCH V2] x86/kexec: fix a kexec_file_load failure

2019-01-08 Thread Dave Young

On 01/08/19 at 01:24pm, Baoquan He wrote:
> On 12/28/18 at 09:12am, Dave Young wrote:
> > The code cleanup mentioned in Fixes tag changed the behavior of
> > kexec_locate_mem_hole.  The kexec_locate_mem_hole will try to
> > allocate free memory only when kbuf.mem is initialized as zero.
> > 
> > But in x86 kexec_file_load implementation there are a few places
> > the kbuf.mem is reused like below:
> >   /* kbuf initialized, kbuf.mem = 0 */
> >   ...
> >   kexec_add_buffer()
> >   ...
> >   kexec_add_buffer()
> > 
> >   The second kexec_add_buffer will reuse previous kbuf but not
> >   reinitialize the kbuf.mem.
> > 
> > Thus kexec_file_load failed because the sanity check failed.
> > 
> > So explictily reset kbuf.mem to fix the issue.
> > 
> > Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
> > Signed-off-by: Dave Young 
> > Cc: 
> > ---
> > V1 -> V2: use KEXEC_BUF_MEM_UNKNOWN in code.
> >  arch/x86/kernel/crash.c   | 1 +
> >  arch/x86/kernel/kexec-bzimage64.c | 2 ++
> >  2 files changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index f631a3f15587..6b7890c7889b 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -469,6 +469,7 @@ int crash_load_segments(struct kimage *image)
> >  
> 
> Wondering why this place doesn't need the initialization assignment.
> Isn't it to assign in all places before kexec_add_buffer() calling?

C designated initializers will make sure to initialize it as zero.
We set KEXEC_BUF_MEM_UNKNOWN as 0 so it just works.

> 
>   /* Add backup segment. */
> if (image->arch.backup_src_sz) { 
>   }
> 
> > kbuf.memsz = kbuf.bufsz;
> > kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
> > +   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
> > ret = kexec_add_buffer();
> > if (ret) {
> > vfree((void *)image->arch.elf_headers);
> > diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > b/arch/x86/kernel/kexec-bzimage64.c
> > index 278cd07228dd..0d5efa34f359 100644
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -434,6 +434,7 @@ static void *bzImage64_load(struct kimage *image, char 
> > *kernel,
> > kbuf.memsz = PAGE_ALIGN(header->init_size);
> > kbuf.buf_align = header->kernel_alignment;
> > kbuf.buf_min = MIN_KERNEL_LOAD_ADDR;
> 
> Same question for bzImage64_load(), there are three kexec_add_buffer()
> calling, I only saw two initialization in this patch.
> 
> > +   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
> > ret = kexec_add_buffer();
> > if (ret)
> > goto out_free_params;
> > @@ -448,6 +449,7 @@ static void *bzImage64_load(struct kimage *image, char 
> > *kernel,
> > kbuf.bufsz = kbuf.memsz = initrd_len;
> > kbuf.buf_align = PAGE_SIZE;
> > kbuf.buf_min = MIN_INITRD_LOAD_ADDR;
> > +   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
> > ret = kexec_add_buffer();
> > if (ret)
> > goto out_free_params;
> > -- 
> > 2.17.0
> > 

Thanks
Dave

Re: [PATCH V2] x86/kexec: fix a kexec_file_load failure

2019-01-07 Thread Dave Young

On 12/28/18 at 09:12am, Dave Young wrote:
> The code cleanup mentioned in Fixes tag changed the behavior of
> kexec_locate_mem_hole.  The kexec_locate_mem_hole will try to
> allocate free memory only when kbuf.mem is initialized as zero.
> 
> But in x86 kexec_file_load implementation there are a few places
> the kbuf.mem is reused like below:
>   /* kbuf initialized, kbuf.mem = 0 */
>   ...
>   kexec_add_buffer()
>   ...
>   kexec_add_buffer()
> 
>   The second kexec_add_buffer will reuse previous kbuf but not
>   reinitialize the kbuf.mem.
> 
> Thus kexec_file_load failed because the sanity check failed.
> 
> So explictily reset kbuf.mem to fix the issue.
> 
> Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
> Signed-off-by: Dave Young 
> Cc: 
> ---
> V1 -> V2: use KEXEC_BUF_MEM_UNKNOWN in code.
>  arch/x86/kernel/crash.c   | 1 +
>  arch/x86/kernel/kexec-bzimage64.c | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index f631a3f15587..6b7890c7889b 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -469,6 +469,7 @@ int crash_load_segments(struct kimage *image)
>  
>   kbuf.memsz = kbuf.bufsz;
>   kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
> + kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>   ret = kexec_add_buffer();
>   if (ret) {
>   vfree((void *)image->arch.elf_headers);
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 278cd07228dd..0d5efa34f359 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -434,6 +434,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>   kbuf.memsz = PAGE_ALIGN(header->init_size);
>   kbuf.buf_align = header->kernel_alignment;
>   kbuf.buf_min = MIN_KERNEL_LOAD_ADDR;
> + kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>   ret = kexec_add_buffer();
>   if (ret)
>   goto out_free_params;
> @@ -448,6 +449,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>   kbuf.bufsz = kbuf.memsz = initrd_len;
>   kbuf.buf_align = PAGE_SIZE;
>   kbuf.buf_min = MIN_INITRD_LOAD_ADDR;
> + kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
>   ret = kexec_add_buffer();
>   if (ret)
>   goto out_free_params;
> -- 
> 2.17.0
> 


Ping, this is a regression issue, can we get this fixed?

Thanks
Dave

[PATCH V2] x86/kexec: fix a kexec_file_load failure

2018-12-27 Thread Dave Young

The code cleanup mentioned in Fixes tag changed the behavior of
kexec_locate_mem_hole.  The kexec_locate_mem_hole will try to
allocate free memory only when kbuf.mem is initialized as zero.

But in x86 kexec_file_load implementation there are a few places
the kbuf.mem is reused like below:
  /* kbuf initialized, kbuf.mem = 0 */
  ...
  kexec_add_buffer()
  ...
  kexec_add_buffer()

  The second kexec_add_buffer will reuse previous kbuf but not
  reinitialize the kbuf.mem.

Thus kexec_file_load failed because the sanity check failed.

So explictily reset kbuf.mem to fix the issue.

Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
Signed-off-by: Dave Young 
Cc: 
---
V1 -> V2: use KEXEC_BUF_MEM_UNKNOWN in code.
 arch/x86/kernel/crash.c   | 1 +
 arch/x86/kernel/kexec-bzimage64.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index f631a3f15587..6b7890c7889b 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -469,6 +469,7 @@ int crash_load_segments(struct kimage *image)
 
kbuf.memsz = kbuf.bufsz;
kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
+   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer();
if (ret) {
vfree((void *)image->arch.elf_headers);
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 278cd07228dd..0d5efa34f359 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -434,6 +434,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
kbuf.memsz = PAGE_ALIGN(header->init_size);
kbuf.buf_align = header->kernel_alignment;
kbuf.buf_min = MIN_KERNEL_LOAD_ADDR;
+   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer();
if (ret)
goto out_free_params;
@@ -448,6 +449,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
kbuf.bufsz = kbuf.memsz = initrd_len;
kbuf.buf_align = PAGE_SIZE;
kbuf.buf_min = MIN_INITRD_LOAD_ADDR;
+   kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer();
if (ret)
goto out_free_params;
-- 
2.17.0

Re: [PATCH] x86/kexec: fix a kexec_file_load failure

2018-12-27 Thread Dave Young

On 12/27/18 at 01:06pm, Dave Young wrote:
> The code cleanup mentioned in Fixes tag changed the behavior of
> kexec_locate_mem_hole.  The kexec_locate_mem_hole will try to
> allocate free memory only when kbuf.mem is initialized as zero.
> 
> But in x86 kexec_file_load implementation there are a few places
> the kbuf.mem is reused like below:
>   /* kbuf initialized, kbuf.mem = 0 */
>   ...
>   kexec_add_buffer()
>   ...
>   kexec_add_buffer()
> 
>   The second kexec_add_buffer will reuse previous kbuf but not
>   reinitialize the kbuf.mem.
> 
> Thus kexec_file_load failed because the sanity check failed.
> 
> So explictily reset mem = 0 to fix the issue.
> 
> Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
> Signed-off-by: Dave Young 
> Cc: 
> ---
>  arch/x86/kernel/crash.c   | 1 +
>  arch/x86/kernel/kexec-bzimage64.c | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index f631a3f15587..37147509d2c8 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -469,6 +469,7 @@ int crash_load_segments(struct kimage *image)
>  
>   kbuf.memsz = kbuf.bufsz;
>   kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
> + kbuf.mem = 0;

Self NAK, will resend with KEXEC_BUF_MEM_UNKNOWN instead of "0"

>   ret = kexec_add_buffer();
>   if (ret) {
>   vfree((void *)image->arch.elf_headers);
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 278cd07228dd..558204bdf412 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -434,6 +434,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>   kbuf.memsz = PAGE_ALIGN(header->init_size);
>   kbuf.buf_align = header->kernel_alignment;
>   kbuf.buf_min = MIN_KERNEL_LOAD_ADDR;
> + kbuf.mem = 0;
>   ret = kexec_add_buffer();
>   if (ret)
>   goto out_free_params;
> @@ -448,6 +449,7 @@ static void *bzImage64_load(struct kimage *image, char 
> *kernel,
>   kbuf.bufsz = kbuf.memsz = initrd_len;
>   kbuf.buf_align = PAGE_SIZE;
>   kbuf.buf_min = MIN_INITRD_LOAD_ADDR;
> + kbuf.mem = 0;
>   ret = kexec_add_buffer();
>   if (ret)
>   goto out_free_params;
> -- 
> 2.17.0
>

[PATCH] x86/kexec: fix a kexec_file_load failure

2018-12-26 Thread Dave Young

The code cleanup mentioned in Fixes tag changed the behavior of
kexec_locate_mem_hole.  The kexec_locate_mem_hole will try to
allocate free memory only when kbuf.mem is initialized as zero.

But in x86 kexec_file_load implementation there are a few places
the kbuf.mem is reused like below:
  /* kbuf initialized, kbuf.mem = 0 */
  ...
  kexec_add_buffer()
  ...
  kexec_add_buffer()

  The second kexec_add_buffer will reuse previous kbuf but not
  reinitialize the kbuf.mem.

Thus kexec_file_load failed because the sanity check failed.

So explictily reset mem = 0 to fix the issue.

Fixes: b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()")
Signed-off-by: Dave Young 
Cc: 
---
 arch/x86/kernel/crash.c   | 1 +
 arch/x86/kernel/kexec-bzimage64.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index f631a3f15587..37147509d2c8 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -469,6 +469,7 @@ int crash_load_segments(struct kimage *image)
 
kbuf.memsz = kbuf.bufsz;
kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
+   kbuf.mem = 0;
ret = kexec_add_buffer();
if (ret) {
vfree((void *)image->arch.elf_headers);
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 278cd07228dd..558204bdf412 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -434,6 +434,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
kbuf.memsz = PAGE_ALIGN(header->init_size);
kbuf.buf_align = header->kernel_alignment;
kbuf.buf_min = MIN_KERNEL_LOAD_ADDR;
+   kbuf.mem = 0;
ret = kexec_add_buffer();
if (ret)
goto out_free_params;
@@ -448,6 +449,7 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
kbuf.bufsz = kbuf.memsz = initrd_len;
kbuf.buf_align = PAGE_SIZE;
kbuf.buf_min = MIN_INITRD_LOAD_ADDR;
+   kbuf.mem = 0;
ret = kexec_add_buffer();
if (ret)
goto out_free_params;
-- 
2.17.0

Re: [PATCH 1/2 v3] kdump: add the vmcoreinfo documentation

2018-12-25 Thread Dave Young

On 12/26/18 at 11:24am, Dave Young wrote:
> > >> +
> > >> +KERNEL_IMAGE_SIZE
> > >> +=
> > >> +The size of 'KERNEL_IMAGE_SIZE', currently unused.
> > > 
> > > So remove?
> > > 
> > 
> > I'm not sure whether it should be removed, so i keep it.
> 
> Just remove it.  It was added by Baoquan for KASLR issues, later
> makedumpfile reverted the userspace part and added other implementation.
> 
> In case old makedumpfile does not support new kernel, it has some kernel
> versions support list in code, thus no worry about the compatibility
> issue.

Ah, it is not unused actually, clone crash tool git:
$ git grep KERNEL_IMAGE_SIZE
x86_64.c:   if ((string = 
pc->read_vmcoreinfo("NUMBER(KERNEL_IMAGE_SIZE)"))) {

So in the documentation, the use cases of crash tool should also be
covered.

Lianbo, it would be good to cc Dave and Kazu for these patches, could
you cc them in your next post?

> 
> Thanks
> Dave

Thanks
Dave

Re: [PATCH 0/2 v4] kdump,vmcoreinfo: Export the value of sme mask to vmcoreinfo

2018-12-25 Thread Dave Young

Add Kazu and Dave in cc

On 12/20/18 at 01:40pm, Lianbo Jiang wrote:
> This patchset did two things:
> a. add a new document for vmcoreinfo
> 
> This document lists some variables that export to vmcoreinfo, and briefly
> describles what these variables indicate. It should be instructive for
> many people who do not know the vmcoreinfo, and it also normalizes the
> exported variable as a convention between kernel and use-space.
> 
> b. export the value of sme mask to vmcoreinfo
> 
> For AMD machine with SME feature, makedumpfile tools need to know whether
> the crash kernel was encrypted or not. If SME is enabled in the first
> kernel, the crash kernel's page table(pgd/pud/pmd/pte) contains the
> memory encryption mask, so need to remove the sme mask to obtain the true
> physical address.
> 
> Changes since v1:
> 1. No need to export a kernel-internal mask to userspace, so copy the
> value of sme_me_mask to a local variable 'sme_mask' and write the value
> of sme_mask to vmcoreinfo.
> 2. Add comment for the code.
> 3. Improve the patch log.
> 4. Add the vmcoreinfo documentation.
> 
> Changes since v2:
> 1. Improve the vmcoreinfo document, add more descripts for these
> variables exported.
> 2. Fix spelling errors in the document.
> 
> Changes since v3:
> 1. Still improve the vmcoreinfo document, and make it become more
> clear and easy to read.
> 2. Move sme_mask comments in the code to the vmcoreinfo document.
> 3. Improve patch log.
> 
> Lianbo Jiang (2):
>   kdump: add the vmcoreinfo documentation
>   kdump,vmcoreinfo: Export the value of sme mask to vmcoreinfo
> 
>  Documentation/kdump/vmcoreinfo.txt | 513 +
>  arch/x86/kernel/machine_kexec_64.c |   3 +
>  2 files changed, 516 insertions(+)
>  create mode 100644 Documentation/kdump/vmcoreinfo.txt
> 
> -- 
> 2.17.1
>

Re: [PATCH 1/2 v3] kdump: add the vmcoreinfo documentation

2018-12-25 Thread Dave Young

> >> +
> >> +KERNEL_IMAGE_SIZE
> >> +=
> >> +The size of 'KERNEL_IMAGE_SIZE', currently unused.
> > 
> > So remove?
> > 
> 
> I'm not sure whether it should be removed, so i keep it.

Just remove it.  It was added by Baoquan for KASLR issues, later
makedumpfile reverted the userspace part and added other implementation.

In case old makedumpfile does not support new kernel, it has some kernel
versions support list in code, thus no worry about the compatibility
issue.

Thanks
Dave

Re: [PATCHv2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2018-12-25 Thread Dave Young

On 12/14/18 at 12:07pm, Pingfan Liu wrote:
> Customer reported a bug on a high end server with many pcie devices, where
> kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> though we still see much memory under 896 MB, the finding still failed
> intermittently. Because currently we can only find region under 896 MB,
> if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> randomly, and crashkernel reservation need be aligned to 128 MB, that's
> why failure is found. It raises confusion to the end user that sometimes
> crashkernel=X works while sometimes fails.
> If want to make it succeed, customer can change kernel option to
> "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> limited space to behave even though its grammer looks more generic.
> And we can't answer questions raised from customer that confidently:
> 1) why it doesn't succeed to reserve 896 MB;
> 2) what's wrong with memory region under 4G;
> 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> 
> This patch simplifies the method suggested in the mail [1]. It just goes
> bottom-up to find a candidate region for crashkernel. The bottom-up may be
> better compatible with the old reservation style, i.e. still want to get
> memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> 
> There is one trivial thing about the compatibility with old kexec-tools:
> if the reserved region is above 896M, then old tool will fail to load
> bzImage. But without this patch, the old tool also fail since there is no
> memory below 896M can be reserved for crashkernel.
> 
> [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> Signed-off-by: Pingfan Liu 
> Cc: Dave Young 
> Cc: Andrew Morton 
> Cc: Baoquan He 
> Cc: ying...@kernel.org,
> Cc: vgo...@redhat.com
> Cc: ke...@lists.infradead.org
> 
> ---
> v1->v2:
>   improve commit log
>  arch/x86/kernel/setup.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d494b9b..60f12c4 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
>  
>   /* 0 means: find the address automatically */
>   if (crash_base <= 0) {
> + if (!memblock_bottom_up())
> + memblock_set_bottom_up(true);

Looking at the memblock_find_in_range_node code, it is allocating
bottom up in case bottom_up is true, but it will try to allocate above
kernel_end:

bottom_up_start = max(start, kernel_end);

If kernel lives very high eg. KASLR case, then this bottom up way does
not help.  So probably previous old version to try 896M first then 4G
then maxmem is better.

>   /*
>* Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
>* as old kexec-tools loads bzImage below that, unless
>* "crashkernel=size[KMG],high" is specified.
>*/
>   crash_base = memblock_find_in_range(CRASH_ALIGN,
> - high ? CRASH_ADDR_HIGH_MAX
> -  : CRASH_ADDR_LOW_MAX,
> - crash_size, CRASH_ALIGN);
> + (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
> + if (!memblock_bottom_up())
> + memblock_set_bottom_up(false);
> +
>   if (!crash_base) {
>   pr_info("crashkernel reservation failed - No suitable 
> area found.\n");
>   return;
> -- 
> 2.7.4
>

Re: [PATCHv2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2018-12-25 Thread Dave Young

On 12/14/18 at 12:07pm, Pingfan Liu wrote:
> Customer reported a bug on a high end server with many pcie devices, where
> kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> though we still see much memory under 896 MB, the finding still failed
> intermittently. Because currently we can only find region under 896 MB,
> if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> randomly, and crashkernel reservation need be aligned to 128 MB, that's
> why failure is found. It raises confusion to the end user that sometimes
> crashkernel=X works while sometimes fails.
> If want to make it succeed, customer can change kernel option to
> "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> limited space to behave even though its grammer looks more generic.
> And we can't answer questions raised from customer that confidently:
> 1) why it doesn't succeed to reserve 896 MB;
> 2) what's wrong with memory region under 4G;
> 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> 
> This patch simplifies the method suggested in the mail [1]. It just goes
> bottom-up to find a candidate region for crashkernel. The bottom-up may be
> better compatible with the old reservation style, i.e. still want to get
> memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> 
> There is one trivial thing about the compatibility with old kexec-tools:
> if the reserved region is above 896M, then old tool will fail to load
> bzImage. But without this patch, the old tool also fail since there is no
> memory below 896M can be reserved for crashkernel.
> 
> [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> Signed-off-by: Pingfan Liu 
> Cc: Dave Young 
> Cc: Andrew Morton 
> Cc: Baoquan He 
> Cc: ying...@kernel.org,
> Cc: vgo...@redhat.com
> Cc: ke...@lists.infradead.org
> 
> ---
> v1->v2:
>   improve commit log
>  arch/x86/kernel/setup.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d494b9b..60f12c4 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
>  
>   /* 0 means: find the address automatically */
>   if (crash_base <= 0) {
> + if (!memblock_bottom_up())
> + memblock_set_bottom_up(true);
>   /*
>* Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
>* as old kexec-tools loads bzImage below that, unless
>* "crashkernel=size[KMG],high" is specified.
>*/
>   crash_base = memblock_find_in_range(CRASH_ALIGN,
> - high ? CRASH_ADDR_HIGH_MAX
> -  : CRASH_ADDR_LOW_MAX,
> - crash_size, CRASH_ALIGN);
> + (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
> + if (!memblock_bottom_up())
> + memblock_set_bottom_up(false);

The previous memblock_set_bottom_up(true) set it as true, so
"!memblock_bottom_up()" is impossible, not sure what is the point of
this condition check.

Do you want to restore the original memblock direction? If so a variable
to save the old direction is needed.  But is this really necessary?
Do you know any side effects of setting the bottom up as true?

> +
>   if (!crash_base) {
>   pr_info("crashkernel reservation failed - No suitable 
> area found.\n");
>   return;
> -- 
> 2.7.4
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

Re: [PATCH 1/2 v2] kdump: add the vmcoreinfo documentation

2018-12-05 Thread Dave Young

> >> +init_uts_ns
> >> +===
> >> +This is the UTS namespace, which is used to isolate two specific elements
> >> +of the system that relate to the uname system call. The UTS namespace is
> >> +named after the data structure used to store information returned by the
> >> +uname system call.
> > 
> > Those non-obvious exports should also have a short explanation why
> > they're part of VMCOREINFO.
> > 
> >> +
> >> +node_online_map
> >> +===
> >> +It is a macro definition, actually it is an arrary node_states[N_ONLINE],
> >> +and it represents the set of online node in a system, one bit position
> >> +per node number.
> > 
> > Ditto.
> > 
> > So yeah, people can find out what those things are but I think it is
> > more important to state here *why* they're part of VMCOREINFO and how
> > they're used and why they're exported.
> > 
> 
> This is a good question.
> 
> For these two *why*, it should be easy to understand. Because user-space tools
> need to know basic information, such as the symbol values, field offset, 
> structure
> size, etc. Otherwise, these tools won't know how to analyze the memory of the 
> crash
> kernel.
> 
> For the second question 'how they are used', we can get the answer from 
> user-space
> tools, such as makedumpfile, crash tools. Therefore, it may not need to 
> explain any
> more in kernel document. On the other hand, if we must put these contents 
> into kernel
> document, i have to say, that would be a hard task.

It should be a good chance to learn how makedumpfile works :),  Maybe it is
hard to get *all* of them, but it would be good to dig and find the
thing you can find then explain it.  And leave those *unknown* part as
FIXME or TODO, people can add description later. 

Added Kazu in cc as well..

Thanks
Dave

Re: [PATCH 1/2 v2] kdump: add the vmcoreinfo documentation

2018-12-05 Thread Dave Young

> >> +init_uts_ns
> >> +===
> >> +This is the UTS namespace, which is used to isolate two specific elements
> >> +of the system that relate to the uname system call. The UTS namespace is
> >> +named after the data structure used to store information returned by the
> >> +uname system call.
> > 
> > Those non-obvious exports should also have a short explanation why
> > they're part of VMCOREINFO.
> > 
> >> +
> >> +node_online_map
> >> +===
> >> +It is a macro definition, actually it is an arrary node_states[N_ONLINE],
> >> +and it represents the set of online node in a system, one bit position
> >> +per node number.
> > 
> > Ditto.
> > 
> > So yeah, people can find out what those things are but I think it is
> > more important to state here *why* they're part of VMCOREINFO and how
> > they're used and why they're exported.
> > 
> 
> This is a good question.
> 
> For these two *why*, it should be easy to understand. Because user-space tools
> need to know basic information, such as the symbol values, field offset, 
> structure
> size, etc. Otherwise, these tools won't know how to analyze the memory of the 
> crash
> kernel.
> 
> For the second question 'how they are used', we can get the answer from 
> user-space
> tools, such as makedumpfile, crash tools. Therefore, it may not need to 
> explain any
> more in kernel document. On the other hand, if we must put these contents 
> into kernel
> document, i have to say, that would be a hard task.

It should be a good chance to learn how makedumpfile works :),  Maybe it is
hard to get *all* of them, but it would be good to dig and find the
thing you can find then explain it.  And leave those *unknown* part as
FIXME or TODO, people can add description later. 

Added Kazu in cc as well..

Thanks
Dave

Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name

2018-11-19 Thread Dave Young

On 11/15/18 at 11:39am, Borislav Petkov wrote:
> + Bjorn.
> 
> On Thu, Nov 15, 2018 at 01:44:07PM +0800, lijiang wrote:
> > At present, the upstream kernel does not pass the e820 reserved ranges to 
> > the
> > second kernel, which might cause two problems:
> > 
> > The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) 
> > requires
> > the reserved region otherwise it falls back to legacy mode, which might 
> > lead to
> > the hot-plug device could not be recognized in kdump kernel.
> 
> Well, this still doesn't explain it fully. Let's look at a box:
> 
> [0.00] e820: BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x-0x000997ff] usable
> [0.00] BIOS-e820: [mem 0x00099800-0x0009] reserved
> [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
> [0.00] BIOS-e820: [mem 0x0010-0x65642fff] usable
> [0.00] BIOS-e820: [mem 0x65643000-0x67fb8fff] reserved
> [0.00] BIOS-e820: [mem 0x67fb9000-0x689e8fff] ACPI NVS
> [0.00] BIOS-e820: [mem 0x689e9000-0x68bf5fff] ACPI 
> data
> [0.00] BIOS-e820: [mem 0x68bf6000-0x6f7f] usable
> [0.00] BIOS-e820: [mem 0x6f80-0x8fff] reserved
> [0.00] BIOS-e820: [mem 0xfd00-0xfe7f] reserved
> [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
> [0.00] BIOS-e820: [mem 0xfec8-0xfed00fff] reserved
> [0.00] BIOS-e820: [mem 0xff80-0x0001007f] reserved
> [0.00] BIOS-e820: [mem 0x00010080-0x00603fff] usable
> 
> this one has 8 reserved regions. Does that mean that we need to pass
> them *all* 8 to the second kernel so that MMCONFIG works?

We just copy 1st kernel memmap (/proc/iomem) to be used in 2nd kernel
e820,  I'm not sure we can get the exact memory range and pass it.
Because different io devices may have different ranges, it is hard to
get all the use cases. And there seems no easy way to get them.

Another thing is it is not worth to get the exact info, the 1st kernel
reserved part is just fine to be reserved as well in 2nd kernel, no 
side effects.  Actually there might be some obscure use cases we
do not find which rely those reserved memory ranges so it is better to
have.

> 
> Or is it only one reserved region which is needed for MMCONFIG?
> 
> Bjorn, do you know what the detection logic should be to map the correct
> reserved region (or regions) for MMCONFIG?
> 
> Now, even if we don't map that reserved region and MMCONFIG falls back
> to legacy mode, why is that a problem for the kdump kernel? Why does
> the kdump kernel need the hotplug device? What would be the use case?
> Hotplug a SATA drive to store the memory dump to it ... or?

According to an old bug report only devices on PCI segment 0 fall back
to legacy mode, those devices on segment 1 do not fall back, they just
do not work, and this seems not related to hotplug.

I found the old bug report, copy something here:
'''
When doing a kdump, the kdump kernel failed to boot due to not finding 
/dev/root.  The root drive is on a LSI Megaraid disk.

...
[6.869903] input: American Megatrends Inc. Virtual Keyboard and Mouse as 
/devices/pci:00/:00:1a.0/usb1/1-1/1-1.3/1-1.3.1/1-1.3.1:1.1/input/input1
[6.885358] generic-usb 0003:046B:FF10.0002: input,hidraw1: USB HID v1.10 
Mouse [American Megatrends Inc. Virtual Keyboard and Mouse] on 
usb-:00:1a.0-1.3.1/input1
[6.901927] usbcore: registered new interface driver usbhid
[6.908145] usbhid: USB HID core driver
..Could not find /dev/root.
Want me to fall back to 
/dev/disk/by-id/scsi-3600605b0049fac9018513918775bfc13-part4? (Y/n) 
y
Waiting for device /dev/disk/by-id/scsi-3600605b0049fac9018513918775bfc13-part4 
to appear: ..not found -- exiting to /bin/sh
$

The basic problem is that this device is in PCI segment 1 and the kernel PCI 
probing cannot find it without all the e820 i/o reservations being present in 
the e820 table.  And the crash kernel does not have those reservations because 
the kexec command does not pass i/o reservation via the memmap= command line 
option. (This problem does not show up for other vendors, as SGI is apparently 
the only one using extended PCI. The lookup of devices in PCI segment 0 
actually fails for everyone, but devices in segment 0 are then found by some 
legacy lookup method.)  The workaround for this is to fix kexec to pass i/o 
reserved areas to the crash kernel.  The patch will be attached.
'''

And here is some old patches in kexec-tools for fixing this:
http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
(patch from SGI, later reverted)

http://lists.infradead.org/pipermail/kexec/2014-April/011710.html
(patch from Chaowang)

But

Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name

2018-11-19 Thread Dave Young

On 11/15/18 at 11:39am, Borislav Petkov wrote:
> + Bjorn.
> 
> On Thu, Nov 15, 2018 at 01:44:07PM +0800, lijiang wrote:
> > At present, the upstream kernel does not pass the e820 reserved ranges to 
> > the
> > second kernel, which might cause two problems:
> > 
> > The first one is the MMCONFIG issue, the PCI MMCONFIG(extended mode) 
> > requires
> > the reserved region otherwise it falls back to legacy mode, which might 
> > lead to
> > the hot-plug device could not be recognized in kdump kernel.
> 
> Well, this still doesn't explain it fully. Let's look at a box:
> 
> [0.00] e820: BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x-0x000997ff] usable
> [0.00] BIOS-e820: [mem 0x00099800-0x0009] reserved
> [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
> [0.00] BIOS-e820: [mem 0x0010-0x65642fff] usable
> [0.00] BIOS-e820: [mem 0x65643000-0x67fb8fff] reserved
> [0.00] BIOS-e820: [mem 0x67fb9000-0x689e8fff] ACPI NVS
> [0.00] BIOS-e820: [mem 0x689e9000-0x68bf5fff] ACPI 
> data
> [0.00] BIOS-e820: [mem 0x68bf6000-0x6f7f] usable
> [0.00] BIOS-e820: [mem 0x6f80-0x8fff] reserved
> [0.00] BIOS-e820: [mem 0xfd00-0xfe7f] reserved
> [0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
> [0.00] BIOS-e820: [mem 0xfec8-0xfed00fff] reserved
> [0.00] BIOS-e820: [mem 0xff80-0x0001007f] reserved
> [0.00] BIOS-e820: [mem 0x00010080-0x00603fff] usable
> 
> this one has 8 reserved regions. Does that mean that we need to pass
> them *all* 8 to the second kernel so that MMCONFIG works?

We just copy 1st kernel memmap (/proc/iomem) to be used in 2nd kernel
e820,  I'm not sure we can get the exact memory range and pass it.
Because different io devices may have different ranges, it is hard to
get all the use cases. And there seems no easy way to get them.

Another thing is it is not worth to get the exact info, the 1st kernel
reserved part is just fine to be reserved as well in 2nd kernel, no 
side effects.  Actually there might be some obscure use cases we
do not find which rely those reserved memory ranges so it is better to
have.

> 
> Or is it only one reserved region which is needed for MMCONFIG?
> 
> Bjorn, do you know what the detection logic should be to map the correct
> reserved region (or regions) for MMCONFIG?
> 
> Now, even if we don't map that reserved region and MMCONFIG falls back
> to legacy mode, why is that a problem for the kdump kernel? Why does
> the kdump kernel need the hotplug device? What would be the use case?
> Hotplug a SATA drive to store the memory dump to it ... or?

According to an old bug report only devices on PCI segment 0 fall back
to legacy mode, those devices on segment 1 do not fall back, they just
do not work, and this seems not related to hotplug.

I found the old bug report, copy something here:
'''
When doing a kdump, the kdump kernel failed to boot due to not finding 
/dev/root.  The root drive is on a LSI Megaraid disk.

...
[6.869903] input: American Megatrends Inc. Virtual Keyboard and Mouse as 
/devices/pci:00/:00:1a.0/usb1/1-1/1-1.3/1-1.3.1/1-1.3.1:1.1/input/input1
[6.885358] generic-usb 0003:046B:FF10.0002: input,hidraw1: USB HID v1.10 
Mouse [American Megatrends Inc. Virtual Keyboard and Mouse] on 
usb-:00:1a.0-1.3.1/input1
[6.901927] usbcore: registered new interface driver usbhid
[6.908145] usbhid: USB HID core driver
..Could not find /dev/root.
Want me to fall back to 
/dev/disk/by-id/scsi-3600605b0049fac9018513918775bfc13-part4? (Y/n) 
y
Waiting for device /dev/disk/by-id/scsi-3600605b0049fac9018513918775bfc13-part4 
to appear: ..not found -- exiting to /bin/sh
$

The basic problem is that this device is in PCI segment 1 and the kernel PCI 
probing cannot find it without all the e820 i/o reservations being present in 
the e820 table.  And the crash kernel does not have those reservations because 
the kexec command does not pass i/o reservation via the memmap= command line 
option. (This problem does not show up for other vendors, as SGI is apparently 
the only one using extended PCI. The lookup of devices in PCI segment 0 
actually fails for everyone, but devices in segment 0 are then found by some 
legacy lookup method.)  The workaround for this is to fix kexec to pass i/o 
reserved areas to the crash kernel.  The patch will be attached.
'''

And here is some old patches in kexec-tools for fixing this:
http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
(patch from SGI, later reverted)

http://lists.infradead.org/pipermail/kexec/2014-April/011710.html
(patch from Chaowang)

But

Re: [PATCH 2/2] x86/kexec_file: add reserved e820 ranges to 2nd kernel e820 table

2018-09-18 Thread Dave Young

On 09/18/18 at 06:20pm, lijiang wrote:
> 在 2018年09月18日 11:20, Dave Young 写道:
> > On 09/18/18 at 10:48am, Lianbo Jiang wrote:
> >> e820 reserved ranges is useful in kdump kernel, we have added this in
> >> kexec-tools code.
> >>
> >> One reason is PCI mmconf (extended mode) requires reserved region
> >> otherwise it falls back to legacy mode.
> >>
> >> When AMD SME kdump support, it needs to map dmi table area as unencrypted.
> >> For normal boot these ranges sit in e820 reserved ranges thus the early
> >> ioremap code naturally map them as unencrypted. So if we have same e820
> >> reserve setup in kdump kernel then it will just work like normal kernel.
> >>
> >> Signed-off-by: Dave Young 
> >> Signed-off-by: Lianbo Jiang 
> >> ---
> >>  arch/x86/kernel/crash.c | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> >> index 3c113e6545a3..db453e9c117b 100644
> >> --- a/arch/x86/kernel/crash.c
> >> +++ b/arch/x86/kernel/crash.c
> >> @@ -384,6 +384,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
> >> struct boot_params *params)
> >>walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1, ,
> >>memmap_entry_callback);
> >>  
> >> +  /* Add all reserved ranges */
> >> +  cmd.type = E820_TYPE_RESERVED;
> >> +  flags = IORESOURCE_MEM;
> > 
> > Lianbo, rethink about this, we will miss other io resource types if only
> > match IORESOURCE_MEM here, can you redo the patch with just using "0"
> > for the passing flags?
> > 
> 
> This patches align on kexec-tools for e820 reserved ranges, if so, the 
> kexec-tools also need to
> be improved for the other type, such as IORESOURCE_IO/BUS/DMA(...), right?

Kexec-tools just go through /proc/iomem "reserved" items and add them
these flags are not visible to it.  So we have done same in kexec-tools.

> 
> I will improve these patches and post v2 tomorrow.

Thanks!
Dave

Re: [PATCH 2/2] x86/kexec_file: add reserved e820 ranges to 2nd kernel e820 table

2018-09-18 Thread Dave Young

On 09/18/18 at 06:20pm, lijiang wrote:
> 在 2018年09月18日 11:20, Dave Young 写道:
> > On 09/18/18 at 10:48am, Lianbo Jiang wrote:
> >> e820 reserved ranges is useful in kdump kernel, we have added this in
> >> kexec-tools code.
> >>
> >> One reason is PCI mmconf (extended mode) requires reserved region
> >> otherwise it falls back to legacy mode.
> >>
> >> When AMD SME kdump support, it needs to map dmi table area as unencrypted.
> >> For normal boot these ranges sit in e820 reserved ranges thus the early
> >> ioremap code naturally map them as unencrypted. So if we have same e820
> >> reserve setup in kdump kernel then it will just work like normal kernel.
> >>
> >> Signed-off-by: Dave Young 
> >> Signed-off-by: Lianbo Jiang 
> >> ---
> >>  arch/x86/kernel/crash.c | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> >> index 3c113e6545a3..db453e9c117b 100644
> >> --- a/arch/x86/kernel/crash.c
> >> +++ b/arch/x86/kernel/crash.c
> >> @@ -384,6 +384,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
> >> struct boot_params *params)
> >>walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1, ,
> >>memmap_entry_callback);
> >>  
> >> +  /* Add all reserved ranges */
> >> +  cmd.type = E820_TYPE_RESERVED;
> >> +  flags = IORESOURCE_MEM;
> > 
> > Lianbo, rethink about this, we will miss other io resource types if only
> > match IORESOURCE_MEM here, can you redo the patch with just using "0"
> > for the passing flags?
> > 
> 
> This patches align on kexec-tools for e820 reserved ranges, if so, the 
> kexec-tools also need to
> be improved for the other type, such as IORESOURCE_IO/BUS/DMA(...), right?

Kexec-tools just go through /proc/iomem "reserved" items and add them
these flags are not visible to it.  So we have done same in kexec-tools.

> 
> I will improve these patches and post v2 tomorrow.

Thanks!
Dave

Re: [PATCH 2/2] x86/kexec_file: add reserved e820 ranges to 2nd kernel e820 table

2018-09-17 Thread Dave Young

On 09/18/18 at 10:48am, Lianbo Jiang wrote:
> e820 reserved ranges is useful in kdump kernel, we have added this in
> kexec-tools code.
> 
> One reason is PCI mmconf (extended mode) requires reserved region
> otherwise it falls back to legacy mode.
> 
> When AMD SME kdump support, it needs to map dmi table area as unencrypted.
> For normal boot these ranges sit in e820 reserved ranges thus the early
> ioremap code naturally map them as unencrypted. So if we have same e820
> reserve setup in kdump kernel then it will just work like normal kernel.
> 
> Signed-off-by: Dave Young 
> Signed-off-by: Lianbo Jiang 
> ---
>  arch/x86/kernel/crash.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 3c113e6545a3..db453e9c117b 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -384,6 +384,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
> struct boot_params *params)
>   walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1, ,
>   memmap_entry_callback);
>  
> + /* Add all reserved ranges */
> + cmd.type = E820_TYPE_RESERVED;
> + flags = IORESOURCE_MEM;

Lianbo, rethink about this, we will miss other io resource types if only
match IORESOURCE_MEM here, can you redo the patch with just using "0"
for the passing flags?

> + walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, -1, ,
> + memmap_entry_callback);
> +
>   /* Add crashk_low_res region */
>   if (crashk_low_res.end) {
>   ei.addr = crashk_low_res.start;
> -- 
> 2.17.1
>

Re: [PATCH 2/2] x86/kexec_file: add reserved e820 ranges to 2nd kernel e820 table

2018-09-17 Thread Dave Young

On 09/18/18 at 10:48am, Lianbo Jiang wrote:
> e820 reserved ranges is useful in kdump kernel, we have added this in
> kexec-tools code.
> 
> One reason is PCI mmconf (extended mode) requires reserved region
> otherwise it falls back to legacy mode.
> 
> When AMD SME kdump support, it needs to map dmi table area as unencrypted.
> For normal boot these ranges sit in e820 reserved ranges thus the early
> ioremap code naturally map them as unencrypted. So if we have same e820
> reserve setup in kdump kernel then it will just work like normal kernel.
> 
> Signed-off-by: Dave Young 
> Signed-off-by: Lianbo Jiang 
> ---
>  arch/x86/kernel/crash.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 3c113e6545a3..db453e9c117b 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -384,6 +384,12 @@ int crash_setup_memmap_entries(struct kimage *image, 
> struct boot_params *params)
>   walk_iomem_res_desc(IORES_DESC_ACPI_NV_STORAGE, flags, 0, -1, ,
>   memmap_entry_callback);
>  
> + /* Add all reserved ranges */
> + cmd.type = E820_TYPE_RESERVED;
> + flags = IORESOURCE_MEM;

Lianbo, rethink about this, we will miss other io resource types if only
match IORESOURCE_MEM here, can you redo the patch with just using "0"
for the passing flags?

> + walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, -1, ,
> + memmap_entry_callback);
> +
>   /* Add crashk_low_res region */
>   if (crashk_low_res.end) {
>   ei.addr = crashk_low_res.start;
> -- 
> 2.17.1
>

Re: Intel Wireless 8265/8275 (rev 78) issue

2018-08-30 Thread Dave Young

On 08/30/18 at 08:55am, Luca Coelho wrote:
> On Thu, 2018-08-30 at 13:06 +0800, Dave Young wrote:
> > On 08/30/18 at 07:15am, Luca Coelho wrote:
> > > On Wed, 2018-08-29 at 14:54 +0800, Dave Young wrote:
> > > > [   74.123114] wlp61s0: authenticate with 00:1f:c6:82:0a:57
> > > > [   74.126099] wlp61s0: send auth to 00:1f:c6:82:0a:57 (try 1/3)
> > > > [   74.139740] wlp61s0: authenticated
> > > > [   74.140142] iwlwifi :3d:00.0 wlp61s0: disabling HE/HT/VHT due to 
> > > > WEP/TKIP use
> > > > [   74.140146] iwlwifi :3d:00.0 wlp61s0: disabling HT as WMM/QoS is 
> > > > not supported by the AP
> > > > [   74.140149] iwlwifi :3d:00.0 wlp61s0: disabling VHT as WMM/QoS 
> > > > is not supported by the AP
> > > 
> > > Your AP seems to be configured to use WEP or TKIP, which is not allowed
> > > by IEEE802.11n, so we disable all the higher rates, as you can see
> > > here.  It should still work, but you will only get very basic rates
> > > (i.e. slow).
> > 
> > Luka, thanks for the suggestion,  moved AP to use WPA2-personal + AES.
> > seems I got only two features disabled this time:
> > [   20.949347] iwlwifi :3d:00.0 wlp61s0: disabling HT as WMM/QoS is not 
> > supported by the AP
> > [   20.949351] iwlwifi :3d:00.0 wlp61s0: disabling VHT as WMM/QoS is 
> > not supported by the AP
> > [   20.949622] wlp61s0: associate with 00:1f:c6:82:0a:57 (try 1/3)
> 
> There are many interoperability problems with old 11n APs that either
> support a draft version of the specs or simply have bugs in their
> implementations of the 11n specs.  When we detect these problems, we
> disable 11n rates, so connectivity will be slower, but it will at least
> remain connected.  This seems to be the case with your AP. :(

This seems not cause much slower connection.
But WMM indeed cause a very slow connection, thus I just leave WMM
disabled as before.

> 
> 
> > > > [ 2521.068469] wlp61s0: deauthenticating from 00:1f:c6:82:0a:57 by 
> > > > local choice (Reason: 3=DEAUTH_LEAVING)
> > > 
> > > Other than that I can't really see what is going on, except that
> > > something is deciding to disconnect here...
> > > 
> > > So, first thing I'd suggest would be to try to reconfigure your AP to
> > > use something else than WEP or TKIP, just to see if it improves in any
> > > way.
> > 
> > Will observe if it works or not. BTW, my old laptop Thinkpad T440s works
> > well with the same AP.
> 
> I hope it improves.  If not, we should debug it further.  It's a good
> thing not to use WEP and TKIP anyway, because they're not very secure
> anyway.

Ok, will provide feedback later, thanks!

> 
> 
> > > And BTW, is this a regression in recent kernels or has this always
> > > happened to you with this laptop/AP combination?
> > 
> > From my observation it just always happen, but hmm not 100% sure..
> 
> Okay, so we don't have enough information on this.  It doesn't matter. 
> What is the NIC on your old T440s?

It is Intel wireless 7260, but since the laptop is not at hand I can not
get lspci details.

> 
> --
> Cheers,
> Luca.
> 

Thanks
Dave

Re: Intel Wireless 8265/8275 (rev 78) issue

2018-08-30 Thread Dave Young

On 08/30/18 at 08:55am, Luca Coelho wrote:
> On Thu, 2018-08-30 at 13:06 +0800, Dave Young wrote:
> > On 08/30/18 at 07:15am, Luca Coelho wrote:
> > > On Wed, 2018-08-29 at 14:54 +0800, Dave Young wrote:
> > > > [   74.123114] wlp61s0: authenticate with 00:1f:c6:82:0a:57
> > > > [   74.126099] wlp61s0: send auth to 00:1f:c6:82:0a:57 (try 1/3)
> > > > [   74.139740] wlp61s0: authenticated
> > > > [   74.140142] iwlwifi :3d:00.0 wlp61s0: disabling HE/HT/VHT due to 
> > > > WEP/TKIP use
> > > > [   74.140146] iwlwifi :3d:00.0 wlp61s0: disabling HT as WMM/QoS is 
> > > > not supported by the AP
> > > > [   74.140149] iwlwifi :3d:00.0 wlp61s0: disabling VHT as WMM/QoS 
> > > > is not supported by the AP
> > > 
> > > Your AP seems to be configured to use WEP or TKIP, which is not allowed
> > > by IEEE802.11n, so we disable all the higher rates, as you can see
> > > here.  It should still work, but you will only get very basic rates
> > > (i.e. slow).
> > 
> > Luka, thanks for the suggestion,  moved AP to use WPA2-personal + AES.
> > seems I got only two features disabled this time:
> > [   20.949347] iwlwifi :3d:00.0 wlp61s0: disabling HT as WMM/QoS is not 
> > supported by the AP
> > [   20.949351] iwlwifi :3d:00.0 wlp61s0: disabling VHT as WMM/QoS is 
> > not supported by the AP
> > [   20.949622] wlp61s0: associate with 00:1f:c6:82:0a:57 (try 1/3)
> 
> There are many interoperability problems with old 11n APs that either
> support a draft version of the specs or simply have bugs in their
> implementations of the 11n specs.  When we detect these problems, we
> disable 11n rates, so connectivity will be slower, but it will at least
> remain connected.  This seems to be the case with your AP. :(

This seems not cause much slower connection.
But WMM indeed cause a very slow connection, thus I just leave WMM
disabled as before.

> 
> 
> > > > [ 2521.068469] wlp61s0: deauthenticating from 00:1f:c6:82:0a:57 by 
> > > > local choice (Reason: 3=DEAUTH_LEAVING)
> > > 
> > > Other than that I can't really see what is going on, except that
> > > something is deciding to disconnect here...
> > > 
> > > So, first thing I'd suggest would be to try to reconfigure your AP to
> > > use something else than WEP or TKIP, just to see if it improves in any
> > > way.
> > 
> > Will observe if it works or not. BTW, my old laptop Thinkpad T440s works
> > well with the same AP.
> 
> I hope it improves.  If not, we should debug it further.  It's a good
> thing not to use WEP and TKIP anyway, because they're not very secure
> anyway.

Ok, will provide feedback later, thanks!

> 
> 
> > > And BTW, is this a regression in recent kernels or has this always
> > > happened to you with this laptop/AP combination?
> > 
> > From my observation it just always happen, but hmm not 100% sure..
> 
> Okay, so we don't have enough information on this.  It doesn't matter. 
> What is the NIC on your old T440s?

It is Intel wireless 7260, but since the laptop is not at hand I can not
get lspci details.

> 
> --
> Cheers,
> Luca.
> 

Thanks
Dave

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-22 Thread Dave Young

On 08/22/18 at 06:23pm, Dave Young wrote:
> On 08/21/18 at 03:39pm, Ard Biesheuvel wrote:
> > On 9 August 2018 at 11:13, Dave Young  wrote:
> > > On 08/09/18 at 09:33am, Mike Galbraith wrote:
> > >> On Thu, 2018-08-09 at 12:21 +0800, Dave Young wrote:
> > >> > Hi Mike,
> > >> >
> > >> > Thanks for the patch!
> > >> > On 08/08/18 at 04:03pm, Mike Galbraith wrote:
> > >> > > When booting with efi=noruntime, we call efi_runtime_map_copy() while
> > >> > > loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
> > >> > > that and a useless allocation when the only mapping we can use (1:1)
> > >> > > is not available.
> > >> >
> > >> > At first glance, efi_get_runtime_map_size should return 0 in case
> > >> > noruntime.
> > >>
> > >> What efi does internally at unmap time is to leave everything except
> > >> efi.mmap.map untouched, setting it to NULL and turning off EFI_MEMMAP,
> > >> rendering efi.mmap.map accessors useless/unsafe without first checking
> > >> EFI_MEMMAP.
> > >
> > > Probably the x86 efi_init should reset nr_map to zero in case runtime is
> > > disabled.  But let's see how Ard thinks about this and cc linux-efi.
> > >
> > > As for efi_get_runtime_map_size, it was introduced for x86 kexec use.
> > > for copying runtime maps,  so I think it is reasonable this function
> > > return zero in case no runtime.
> > >
> > 
> > I don't see the patch in the context so I cannot comment in great detail.
> 
> The patch is below:
> https://lore.kernel.org/lkml/1533737025.4936.3.ca...@gmx.de
> 
> > 
> > In any case, it is better to decouple EFI_MEMMAP from EFI_RUNTIME
> > dependencies. On x86, one may imply the other, but this is not
> > generally the case.
> > 
> > That means that efi_get_runtime_map_size() should probably check the
> > EFI_RUNTIME flag, and return 0 if it is cleared. Perhaps there are
> > other places where EFI_MEMMAP flag checks are missing, but I consider
> > that a separate issue.
> 
> Yes, I also agree with to check EFI_RUNTIME_SERVICES. There is no point for
> efi_get_runtime_map_size to return a value other than 0 in case 
> EFI_RUNTIME_SERVICES
> is not set ie. via efi=noruntime
> 
> Is below patch acceptable?  The copy function can be changed to return
> an error in case map size == 0, but that can be done later along with
> the caller size cleanups in kexec code

Forgot to add Mike's reported-by tag..

Mike, since we are going this way, I'm working on a kexec code cleanup,
but it needs careful testing so still need some time.

Can you help test below efi fix and provide you tested-by if it works?

> ---
> 
> efi: check EFI_RUNTIME_SERVICES flag in runtime map copying code
> 
> Mike reported a kexec_file_load NULL pointer dereference bug like below:
> [5.878262] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [5.879868] PGD 80013c1f1067 P4D 80013c1f1067 PUD 13aea7067 PMD 0 
> [5.881225] Oops:  [#1] SMP PTI
> [5.882068] Modules linked in:
> [5.882851] CPU: 0 PID: 394 Comm: kexec Kdump: loaded Not tainted 
> 4.17.0-rc2+ #648
> [5.884333] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [5.885843] RIP: 0010:memcpy_erms+0x6/0x10
> [5.886789] RSP: 0018:c958bd00 EFLAGS: 00010246
> [5.887899] RAX: 880138e050b0 RBX: 000980b0 RCX: 
> 0ba0
> [5.889304] RDX: 0ba0 RSI:  RDI: 
> 880138e050b0
> [5.890977] RBP: 880138e04000 R08: 0017 R09: 
> 0002
> [5.892524] R10: 00099000 R11: 52d0 R12: 
> 39400200
> [5.893967] R13: 880138e05000 R14: 0ba0 R15: 
> c9a4d000
> [5.895378] FS:  7f167c9e6740() GS:88013fc0() 
> knlGS:
> [5.896953] CS:  0010 DS:  ES:  CR0: 80050033
> [5.898143] CR2:  CR3: 00013c3ec002 CR4: 
> 001606f0
> [5.899542] DR0:  DR1:  DR2: 
> 
> [5.900962] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [5.902552] Call Trace:
> [5.903267]  efi_runtime_map_copy+0x28/0x30
> [5.904956]  bzImage64_load+0x59d/0x736
> [5.906881]  ? arch_kexec_kernel_image_load+0x6d/0x70
> [5.908243]  ? __se_sys_kexec

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-22 Thread Dave Young

On 08/22/18 at 06:23pm, Dave Young wrote:
> On 08/21/18 at 03:39pm, Ard Biesheuvel wrote:
> > On 9 August 2018 at 11:13, Dave Young  wrote:
> > > On 08/09/18 at 09:33am, Mike Galbraith wrote:
> > >> On Thu, 2018-08-09 at 12:21 +0800, Dave Young wrote:
> > >> > Hi Mike,
> > >> >
> > >> > Thanks for the patch!
> > >> > On 08/08/18 at 04:03pm, Mike Galbraith wrote:
> > >> > > When booting with efi=noruntime, we call efi_runtime_map_copy() while
> > >> > > loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
> > >> > > that and a useless allocation when the only mapping we can use (1:1)
> > >> > > is not available.
> > >> >
> > >> > At first glance, efi_get_runtime_map_size should return 0 in case
> > >> > noruntime.
> > >>
> > >> What efi does internally at unmap time is to leave everything except
> > >> efi.mmap.map untouched, setting it to NULL and turning off EFI_MEMMAP,
> > >> rendering efi.mmap.map accessors useless/unsafe without first checking
> > >> EFI_MEMMAP.
> > >
> > > Probably the x86 efi_init should reset nr_map to zero in case runtime is
> > > disabled.  But let's see how Ard thinks about this and cc linux-efi.
> > >
> > > As for efi_get_runtime_map_size, it was introduced for x86 kexec use.
> > > for copying runtime maps,  so I think it is reasonable this function
> > > return zero in case no runtime.
> > >
> > 
> > I don't see the patch in the context so I cannot comment in great detail.
> 
> The patch is below:
> https://lore.kernel.org/lkml/1533737025.4936.3.ca...@gmx.de
> 
> > 
> > In any case, it is better to decouple EFI_MEMMAP from EFI_RUNTIME
> > dependencies. On x86, one may imply the other, but this is not
> > generally the case.
> > 
> > That means that efi_get_runtime_map_size() should probably check the
> > EFI_RUNTIME flag, and return 0 if it is cleared. Perhaps there are
> > other places where EFI_MEMMAP flag checks are missing, but I consider
> > that a separate issue.
> 
> Yes, I also agree with to check EFI_RUNTIME_SERVICES. There is no point for
> efi_get_runtime_map_size to return a value other than 0 in case 
> EFI_RUNTIME_SERVICES
> is not set ie. via efi=noruntime
> 
> Is below patch acceptable?  The copy function can be changed to return
> an error in case map size == 0, but that can be done later along with
> the caller size cleanups in kexec code

Forgot to add Mike's reported-by tag..

Mike, since we are going this way, I'm working on a kexec code cleanup,
but it needs careful testing so still need some time.

Can you help test below efi fix and provide you tested-by if it works?

> ---
> 
> efi: check EFI_RUNTIME_SERVICES flag in runtime map copying code
> 
> Mike reported a kexec_file_load NULL pointer dereference bug like below:
> [5.878262] BUG: unable to handle kernel NULL pointer dereference at 
> 
> [5.879868] PGD 80013c1f1067 P4D 80013c1f1067 PUD 13aea7067 PMD 0 
> [5.881225] Oops:  [#1] SMP PTI
> [5.882068] Modules linked in:
> [5.882851] CPU: 0 PID: 394 Comm: kexec Kdump: loaded Not tainted 
> 4.17.0-rc2+ #648
> [5.884333] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [5.885843] RIP: 0010:memcpy_erms+0x6/0x10
> [5.886789] RSP: 0018:c958bd00 EFLAGS: 00010246
> [5.887899] RAX: 880138e050b0 RBX: 000980b0 RCX: 
> 0ba0
> [5.889304] RDX: 0ba0 RSI:  RDI: 
> 880138e050b0
> [5.890977] RBP: 880138e04000 R08: 0017 R09: 
> 0002
> [5.892524] R10: 00099000 R11: 52d0 R12: 
> 39400200
> [5.893967] R13: 880138e05000 R14: 0ba0 R15: 
> c9a4d000
> [5.895378] FS:  7f167c9e6740() GS:88013fc0() 
> knlGS:
> [5.896953] CS:  0010 DS:  ES:  CR0: 80050033
> [5.898143] CR2:  CR3: 00013c3ec002 CR4: 
> 001606f0
> [5.899542] DR0:  DR1:  DR2: 
> 
> [5.900962] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [5.902552] Call Trace:
> [5.903267]  efi_runtime_map_copy+0x28/0x30
> [5.904956]  bzImage64_load+0x59d/0x736
> [5.906881]  ? arch_kexec_kernel_image_load+0x6d/0x70
> [5.908243]  ? __se_sys_kexec

Re: [PATCH 2/2] [FIXED v2] Replace magic for trusting the secondary keyring with #define

2018-08-16 Thread Dave Young

On 08/16/18 at 09:43am, Yannik Sembritzki wrote:
> On 16.08.2018 03:11, Dave Young wrote:
> > Instead of fix your 1st patch in 2nd patch, I would suggest to
> > switch the patch order.  In 1st patch change the common code to use
> > the new macro and in 2nd patch you can directly fix the kexec code
> > with TRUST_SECONDARY_KEYRING.
> My reasoning for doing it in this order was that the first patch which
> fixes the bug itself should be merged into stable, while the refactoring
> doesn't necessarily have to. I'm not familiar with the linux development
> process, so please correct me if this should be done in another fashion.

Frankly I'm not sure about the stable process.  But personally I do not 
like the order.

Cced Greg for opinions about stable concern.

> 
> Yannik

Re: [PATCH 2/2] [FIXED v2] Replace magic for trusting the secondary keyring with #define

2018-08-16 Thread Dave Young

On 08/16/18 at 09:43am, Yannik Sembritzki wrote:
> On 16.08.2018 03:11, Dave Young wrote:
> > Instead of fix your 1st patch in 2nd patch, I would suggest to
> > switch the patch order.  In 1st patch change the common code to use
> > the new macro and in 2nd patch you can directly fix the kexec code
> > with TRUST_SECONDARY_KEYRING.
> My reasoning for doing it in this order was that the first patch which
> fixes the bug itself should be merged into stable, while the refactoring
> doesn't necessarily have to. I'm not familiar with the linux development
> process, so please correct me if this should be done in another fashion.

Frankly I'm not sure about the stable process.  But personally I do not 
like the order.

Cced Greg for opinions about stable concern.

> 
> Yannik

Re: [PATCH 2/2] [FIXED v2] Replace magic for trusting the secondary keyring with #define

2018-08-15 Thread Dave Young

On 08/16/18 at 12:07am, Yannik Sembritzki wrote:
> Signed-off-by: Yannik Sembritzki 
> ---
>  arch/x86/kernel/kexec-bzimage64.c   | 2 +-
>  certs/system_keyring.c  | 3 ++-
>  crypto/asymmetric_keys/pkcs7_key_type.c | 2 +-
>  include/linux/verification.h    | 3 +++
>  4 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c
> b/arch/x86/kernel/kexec-bzimage64.c
> index 74628275..97d199a3 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -532,7 +532,7 @@ static int bzImage64_cleanup(void *loader_data)
>  static int bzImage64_verify_sig(const char *kernel, unsigned long
> kernel_len)
>  {
>      return verify_pefile_signature(kernel, kernel_len,
> -                   ((struct key *)1UL),
> +                   TRUST_SECONDARY_KEYRING,

Instead of fix your 1st patch in 2nd patch, I would suggest to
switch the patch order.  In 1st patch change the common code to use
the new macro and in 2nd patch you can directly fix the kexec code
with TRUST_SECONDARY_KEYRING.

>                     VERIFYING_KEXEC_PE_SIGNATURE);
>  }
>  #endif
> diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> index 6251d1b2..777ac7d2 100644
> --- a/certs/system_keyring.c
> +++ b/certs/system_keyring.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -230,7 +231,7 @@ int verify_pkcs7_signature(const void *data, size_t len,
>  
>      if (!trusted_keys) {
>          trusted_keys = builtin_trusted_keys;
> -    } else if (trusted_keys == (void *)1UL) {
> +    } else if (trusted_keys == TRUST_SECONDARY_KEYRING) {
>  #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING
>          trusted_keys = secondary_trusted_keys;
>  #else
> diff --git a/crypto/asymmetric_keys/pkcs7_key_type.c
> b/crypto/asymmetric_keys/pkcs7_key_type.c
> index e284d9cb..0783e555 100644
> --- a/crypto/asymmetric_keys/pkcs7_key_type.c
> +++ b/crypto/asymmetric_keys/pkcs7_key_type.c
> @@ -63,7 +63,7 @@ static int pkcs7_preparse(struct key_preparsed_payload
> *prep)
>  
>      return verify_pkcs7_signature(NULL, 0,
>                    prep->data, prep->datalen,
> -                  (void *)1UL, usage,
> +                  TRUST_SECONDARY_KEYRING, usage,
>                    pkcs7_view_content, prep);
>  }
>  
> diff --git a/include/linux/verification.h b/include/linux/verification.h
> index a10549a6..c00c1143 100644
> --- a/include/linux/verification.h
> +++ b/include/linux/verification.h
> @@ -12,6 +12,9 @@
>  #ifndef _LINUX_VERIFICATION_H
>  #define _LINUX_VERIFICATION_H
>  
> +// Allow both builtin trusted keys and secondary trusted keys

It would be better to use commenting style /*

> +#define TRUST_SECONDARY_KEYRING ((struct key *)1UL)
> +
>  /*
>   * The use to which an asymmetric key is being put.
>   */
> -- 
> 2.17.1
> 
> 

Thanks
Dave

Re: [PATCH 2/2] [FIXED v2] Replace magic for trusting the secondary keyring with #define

2018-08-15 Thread Dave Young

On 08/16/18 at 12:07am, Yannik Sembritzki wrote:
> Signed-off-by: Yannik Sembritzki 
> ---
>  arch/x86/kernel/kexec-bzimage64.c   | 2 +-
>  certs/system_keyring.c  | 3 ++-
>  crypto/asymmetric_keys/pkcs7_key_type.c | 2 +-
>  include/linux/verification.h    | 3 +++
>  4 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c
> b/arch/x86/kernel/kexec-bzimage64.c
> index 74628275..97d199a3 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -532,7 +532,7 @@ static int bzImage64_cleanup(void *loader_data)
>  static int bzImage64_verify_sig(const char *kernel, unsigned long
> kernel_len)
>  {
>      return verify_pefile_signature(kernel, kernel_len,
> -                   ((struct key *)1UL),
> +                   TRUST_SECONDARY_KEYRING,

Instead of fix your 1st patch in 2nd patch, I would suggest to
switch the patch order.  In 1st patch change the common code to use
the new macro and in 2nd patch you can directly fix the kexec code
with TRUST_SECONDARY_KEYRING.

>                     VERIFYING_KEXEC_PE_SIGNATURE);
>  }
>  #endif
> diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> index 6251d1b2..777ac7d2 100644
> --- a/certs/system_keyring.c
> +++ b/certs/system_keyring.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -230,7 +231,7 @@ int verify_pkcs7_signature(const void *data, size_t len,
>  
>      if (!trusted_keys) {
>          trusted_keys = builtin_trusted_keys;
> -    } else if (trusted_keys == (void *)1UL) {
> +    } else if (trusted_keys == TRUST_SECONDARY_KEYRING) {
>  #ifdef CONFIG_SECONDARY_TRUSTED_KEYRING
>          trusted_keys = secondary_trusted_keys;
>  #else
> diff --git a/crypto/asymmetric_keys/pkcs7_key_type.c
> b/crypto/asymmetric_keys/pkcs7_key_type.c
> index e284d9cb..0783e555 100644
> --- a/crypto/asymmetric_keys/pkcs7_key_type.c
> +++ b/crypto/asymmetric_keys/pkcs7_key_type.c
> @@ -63,7 +63,7 @@ static int pkcs7_preparse(struct key_preparsed_payload
> *prep)
>  
>      return verify_pkcs7_signature(NULL, 0,
>                    prep->data, prep->datalen,
> -                  (void *)1UL, usage,
> +                  TRUST_SECONDARY_KEYRING, usage,
>                    pkcs7_view_content, prep);
>  }
>  
> diff --git a/include/linux/verification.h b/include/linux/verification.h
> index a10549a6..c00c1143 100644
> --- a/include/linux/verification.h
> +++ b/include/linux/verification.h
> @@ -12,6 +12,9 @@
>  #ifndef _LINUX_VERIFICATION_H
>  #define _LINUX_VERIFICATION_H
>  
> +// Allow both builtin trusted keys and secondary trusted keys

It would be better to use commenting style /*

> +#define TRUST_SECONDARY_KEYRING ((struct key *)1UL)
> +
>  /*
>   * The use to which an asymmetric key is being put.
>   */
> -- 
> 2.17.1
> 
> 

Thanks
Dave

Re: [PATCH] Fix kexec forbidding kernels signed with custom platform keys to boot

2018-08-15 Thread Dave Young

On 08/16/18 at 08:52am, Dave Young wrote:
> On 08/15/18 at 01:42pm, Vivek Goyal wrote:
> > On Wed, Aug 15, 2018 at 07:27:33PM +0200, Yannik Sembritzki wrote:
> > > Would this be okay?
> > 
> > [ CC dave young, Baoquan, Justin Forbes]
> > 
> > Hi Yannik,
> > 
> > I am reading that bug and wondering that what broke it. It used to work,
> > so some change broke it. 
> > 
> > Justin said that we have been signing fedora kernels with fedora keys so
> > looks like no change there.
> > 
> > Previously, I think all the keys used to go in system keyring and it
> > used to work. Is it somehow because of split in builtin keyring and
> > secondary system keyring. Could it be that fedora key used to show
> > up in system keyring previously and it worked but now it shows up
> > in secondary system keyring and by default we don't use keys from
> > that keyring for signature verification?

The commit introduced this issue is:

commit d3bfe84129f65e0af2450743ebdab33d161d01c9
Author: David Howells 
Date:   Wed Apr 6 16:14:27 2016 +0100

certs: Add a secondary system keyring that can be added to dynamically

> 
> There was a Fedora bug below:
> https://bugzilla.redhat.com/show_bug.cgi?id=1470995
> 
> I posted a fix here but bobody responsed, I think I obviously did not
> consider the "trust build system only" point from Linus:
> http://lists.infradead.org/pipermail/kexec/2017-November/019632.html
> 
> But either above patch or defining a macro for the "1UL" in cert header
> file works.
> 
> Since nobody reviewed my patch so later I submitted a Fedora only patch
> which is similar with Yannik's and merged in Fedora tree:
> https://bugzilla.redhat.com/attachment.cgi?id=1450772=edit
> 
> > 
> > Thanks
> > Vivek
> > 
> > > 
> > > diff --git a/arch/x86/kernel/kexec-bzimage64.c
> > > b/arch/x86/kernel/kexec-bzimage64.c
> > > index 7326078e..2ba47e24 100644
> > > --- a/arch/x86/kernel/kexec-bzimage64.c
> > > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > > @@ -41,6 +41,9 @@
> > >  #define MIN_KERNEL_LOAD_ADDR   0x10
> > >  #define MIN_INITRD_LOAD_ADDR   0x100
> > >  
> > > +// Allow both builtin trusted keys and secondary trusted keys
> > > +#define TRUST_FULL_KEYRING (void *)1UL
> > > +
> > >  /*
> > >   * This is a place holder for all boot loader specific data structure 
> > > which
> > >   * gets allocated in one call but gets freed much later during cleanup
> > > @@ -532,7 +535,7 @@ static int bzImage64_cleanup(void *loader_data)
> > >  static int bzImage64_verify_sig(const char *kernel, unsigned long
> > > kernel_len)
> > >  {
> > >     return verify_pefile_signature(kernel, kernel_len,
> > > -  NULL,
> > > +  TRUST_FULL_KEYRING,
> > >    VERIFYING_KEXEC_PE_SIGNATURE);
> > >  }
> > >  #endif
> > > --
> > > 
> > > On 15.08.2018 18:54, Linus Torvalds wrote:
> > > > This needs more people involved, and at least a sign-off.
> > > >
> > > > It looks ok, but I think we need a #define for the magical (void *)1UL
> > > > thing. I see the use in verify_pkcs7_signature(), but still.
> > > >
> > > >   Linus
> > > >
> > > >
> > > >
> > > > On Wed, Aug 15, 2018 at 3:11 AM Yannik Sembritzki 
> > > >  wrote:
> > > >> ---
> > > >>  arch/x86/kernel/kexec-bzimage64.c | 2 +-
> > > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > > >> b/arch/x86/kernel/kexec-bzimage64.c
> > > >> index 7326078e..eaaa125d 100644
> > > >> --- a/arch/x86/kernel/kexec-bzimage64.c
> > > >> +++ b/arch/x86/kernel/kexec-bzimage64.c
> > > >> @@ -532,7 +532,7 @@ static int bzImage64_cleanup(void *loader_data)
> > > >>  static int bzImage64_verify_sig(const char *kernel, unsigned long 
> > > >> kernel_len)
> > > >>  {
> > > >> return verify_pefile_signature(kernel, kernel_len,
> > > >> -  NULL,
> > > >> +  (void *)1UL,
> > > >>VERIFYING_KEXEC_PE_SIGNATURE);
> > > >>  }
> > > >>  #endif
> > > >> --
> > > >> 2.17.1
> > > >>
> > > >> The exact scenario under which this issue occurs is described here:
> > > >> https://bugzilla.redhat.com/show_bug.cgi?id=1554113
> > > >>
> > > 
> 
> Thanks
> Dave

Re: [PATCH] Fix kexec forbidding kernels signed with custom platform keys to boot

2018-08-15 Thread Dave Young

On 08/16/18 at 08:52am, Dave Young wrote:
> On 08/15/18 at 01:42pm, Vivek Goyal wrote:
> > On Wed, Aug 15, 2018 at 07:27:33PM +0200, Yannik Sembritzki wrote:
> > > Would this be okay?
> > 
> > [ CC dave young, Baoquan, Justin Forbes]
> > 
> > Hi Yannik,
> > 
> > I am reading that bug and wondering that what broke it. It used to work,
> > so some change broke it. 
> > 
> > Justin said that we have been signing fedora kernels with fedora keys so
> > looks like no change there.
> > 
> > Previously, I think all the keys used to go in system keyring and it
> > used to work. Is it somehow because of split in builtin keyring and
> > secondary system keyring. Could it be that fedora key used to show
> > up in system keyring previously and it worked but now it shows up
> > in secondary system keyring and by default we don't use keys from
> > that keyring for signature verification?

The commit introduced this issue is:

commit d3bfe84129f65e0af2450743ebdab33d161d01c9
Author: David Howells 
Date:   Wed Apr 6 16:14:27 2016 +0100

certs: Add a secondary system keyring that can be added to dynamically

> 
> There was a Fedora bug below:
> https://bugzilla.redhat.com/show_bug.cgi?id=1470995
> 
> I posted a fix here but bobody responsed, I think I obviously did not
> consider the "trust build system only" point from Linus:
> http://lists.infradead.org/pipermail/kexec/2017-November/019632.html
> 
> But either above patch or defining a macro for the "1UL" in cert header
> file works.
> 
> Since nobody reviewed my patch so later I submitted a Fedora only patch
> which is similar with Yannik's and merged in Fedora tree:
> https://bugzilla.redhat.com/attachment.cgi?id=1450772=edit
> 
> > 
> > Thanks
> > Vivek
> > 
> > > 
> > > diff --git a/arch/x86/kernel/kexec-bzimage64.c
> > > b/arch/x86/kernel/kexec-bzimage64.c
> > > index 7326078e..2ba47e24 100644
> > > --- a/arch/x86/kernel/kexec-bzimage64.c
> > > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > > @@ -41,6 +41,9 @@
> > >  #define MIN_KERNEL_LOAD_ADDR   0x10
> > >  #define MIN_INITRD_LOAD_ADDR   0x100
> > >  
> > > +// Allow both builtin trusted keys and secondary trusted keys
> > > +#define TRUST_FULL_KEYRING (void *)1UL
> > > +
> > >  /*
> > >   * This is a place holder for all boot loader specific data structure 
> > > which
> > >   * gets allocated in one call but gets freed much later during cleanup
> > > @@ -532,7 +535,7 @@ static int bzImage64_cleanup(void *loader_data)
> > >  static int bzImage64_verify_sig(const char *kernel, unsigned long
> > > kernel_len)
> > >  {
> > >     return verify_pefile_signature(kernel, kernel_len,
> > > -  NULL,
> > > +  TRUST_FULL_KEYRING,
> > >    VERIFYING_KEXEC_PE_SIGNATURE);
> > >  }
> > >  #endif
> > > --
> > > 
> > > On 15.08.2018 18:54, Linus Torvalds wrote:
> > > > This needs more people involved, and at least a sign-off.
> > > >
> > > > It looks ok, but I think we need a #define for the magical (void *)1UL
> > > > thing. I see the use in verify_pkcs7_signature(), but still.
> > > >
> > > >   Linus
> > > >
> > > >
> > > >
> > > > On Wed, Aug 15, 2018 at 3:11 AM Yannik Sembritzki 
> > > >  wrote:
> > > >> ---
> > > >>  arch/x86/kernel/kexec-bzimage64.c | 2 +-
> > > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > > >> b/arch/x86/kernel/kexec-bzimage64.c
> > > >> index 7326078e..eaaa125d 100644
> > > >> --- a/arch/x86/kernel/kexec-bzimage64.c
> > > >> +++ b/arch/x86/kernel/kexec-bzimage64.c
> > > >> @@ -532,7 +532,7 @@ static int bzImage64_cleanup(void *loader_data)
> > > >>  static int bzImage64_verify_sig(const char *kernel, unsigned long 
> > > >> kernel_len)
> > > >>  {
> > > >> return verify_pefile_signature(kernel, kernel_len,
> > > >> -  NULL,
> > > >> +  (void *)1UL,
> > > >>VERIFYING_KEXEC_PE_SIGNATURE);
> > > >>  }
> > > >>  #endif
> > > >> --
> > > >> 2.17.1
> > > >>
> > > >> The exact scenario under which this issue occurs is described here:
> > > >> https://bugzilla.redhat.com/show_bug.cgi?id=1554113
> > > >>
> > > 
> 
> Thanks
> Dave

Re: [PATCH] Fix kexec forbidding kernels signed with custom platform keys to boot

2018-08-15 Thread Dave Young

On 08/15/18 at 01:42pm, Vivek Goyal wrote:
> On Wed, Aug 15, 2018 at 07:27:33PM +0200, Yannik Sembritzki wrote:
> > Would this be okay?
> 
> [ CC dave young, Baoquan, Justin Forbes]
> 
> Hi Yannik,
> 
> I am reading that bug and wondering that what broke it. It used to work,
> so some change broke it. 
> 
> Justin said that we have been signing fedora kernels with fedora keys so
> looks like no change there.
> 
> Previously, I think all the keys used to go in system keyring and it
> used to work. Is it somehow because of split in builtin keyring and
> secondary system keyring. Could it be that fedora key used to show
> up in system keyring previously and it worked but now it shows up
> in secondary system keyring and by default we don't use keys from
> that keyring for signature verification?

There was a Fedora bug below:
https://bugzilla.redhat.com/show_bug.cgi?id=1470995

I posted a fix here but bobody responsed, I think I obviously did not
consider the "trust build system only" point from Linus:
http://lists.infradead.org/pipermail/kexec/2017-November/019632.html

But either above patch or defining a macro for the "1UL" in cert header
file works.

Since nobody reviewed my patch so later I submitted a Fedora only patch
which is similar with Yannik's and merged in Fedora tree:
https://bugzilla.redhat.com/attachment.cgi?id=1450772=edit

> 
> Thanks
> Vivek
> 
> > 
> > diff --git a/arch/x86/kernel/kexec-bzimage64.c
> > b/arch/x86/kernel/kexec-bzimage64.c
> > index 7326078e..2ba47e24 100644
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -41,6 +41,9 @@
> >  #define MIN_KERNEL_LOAD_ADDR   0x10
> >  #define MIN_INITRD_LOAD_ADDR   0x100
> >  
> > +// Allow both builtin trusted keys and secondary trusted keys
> > +#define TRUST_FULL_KEYRING (void *)1UL
> > +
> >  /*
> >   * This is a place holder for all boot loader specific data structure which
> >   * gets allocated in one call but gets freed much later during cleanup
> > @@ -532,7 +535,7 @@ static int bzImage64_cleanup(void *loader_data)
> >  static int bzImage64_verify_sig(const char *kernel, unsigned long
> > kernel_len)
> >  {
> >     return verify_pefile_signature(kernel, kernel_len,
> > -  NULL,
> > +  TRUST_FULL_KEYRING,
> >    VERIFYING_KEXEC_PE_SIGNATURE);
> >  }
> >  #endif
> > --
> > 
> > On 15.08.2018 18:54, Linus Torvalds wrote:
> > > This needs more people involved, and at least a sign-off.
> > >
> > > It looks ok, but I think we need a #define for the magical (void *)1UL
> > > thing. I see the use in verify_pkcs7_signature(), but still.
> > >
> > >   Linus
> > >
> > >
> > >
> > > On Wed, Aug 15, 2018 at 3:11 AM Yannik Sembritzki  
> > > wrote:
> > >> ---
> > >>  arch/x86/kernel/kexec-bzimage64.c | 2 +-
> > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > >> b/arch/x86/kernel/kexec-bzimage64.c
> > >> index 7326078e..eaaa125d 100644
> > >> --- a/arch/x86/kernel/kexec-bzimage64.c
> > >> +++ b/arch/x86/kernel/kexec-bzimage64.c
> > >> @@ -532,7 +532,7 @@ static int bzImage64_cleanup(void *loader_data)
> > >>  static int bzImage64_verify_sig(const char *kernel, unsigned long 
> > >> kernel_len)
> > >>  {
> > >> return verify_pefile_signature(kernel, kernel_len,
> > >> -  NULL,
> > >> +  (void *)1UL,
> > >>VERIFYING_KEXEC_PE_SIGNATURE);
> > >>  }
> > >>  #endif
> > >> --
> > >> 2.17.1
> > >>
> > >> The exact scenario under which this issue occurs is described here:
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1554113
> > >>
> > 

Thanks
Dave

Re: [PATCH] Fix kexec forbidding kernels signed with custom platform keys to boot

2018-08-15 Thread Dave Young

On 08/15/18 at 01:42pm, Vivek Goyal wrote:
> On Wed, Aug 15, 2018 at 07:27:33PM +0200, Yannik Sembritzki wrote:
> > Would this be okay?
> 
> [ CC dave young, Baoquan, Justin Forbes]
> 
> Hi Yannik,
> 
> I am reading that bug and wondering that what broke it. It used to work,
> so some change broke it. 
> 
> Justin said that we have been signing fedora kernels with fedora keys so
> looks like no change there.
> 
> Previously, I think all the keys used to go in system keyring and it
> used to work. Is it somehow because of split in builtin keyring and
> secondary system keyring. Could it be that fedora key used to show
> up in system keyring previously and it worked but now it shows up
> in secondary system keyring and by default we don't use keys from
> that keyring for signature verification?

There was a Fedora bug below:
https://bugzilla.redhat.com/show_bug.cgi?id=1470995

I posted a fix here but bobody responsed, I think I obviously did not
consider the "trust build system only" point from Linus:
http://lists.infradead.org/pipermail/kexec/2017-November/019632.html

But either above patch or defining a macro for the "1UL" in cert header
file works.

Since nobody reviewed my patch so later I submitted a Fedora only patch
which is similar with Yannik's and merged in Fedora tree:
https://bugzilla.redhat.com/attachment.cgi?id=1450772=edit

> 
> Thanks
> Vivek
> 
> > 
> > diff --git a/arch/x86/kernel/kexec-bzimage64.c
> > b/arch/x86/kernel/kexec-bzimage64.c
> > index 7326078e..2ba47e24 100644
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -41,6 +41,9 @@
> >  #define MIN_KERNEL_LOAD_ADDR   0x10
> >  #define MIN_INITRD_LOAD_ADDR   0x100
> >  
> > +// Allow both builtin trusted keys and secondary trusted keys
> > +#define TRUST_FULL_KEYRING (void *)1UL
> > +
> >  /*
> >   * This is a place holder for all boot loader specific data structure which
> >   * gets allocated in one call but gets freed much later during cleanup
> > @@ -532,7 +535,7 @@ static int bzImage64_cleanup(void *loader_data)
> >  static int bzImage64_verify_sig(const char *kernel, unsigned long
> > kernel_len)
> >  {
> >     return verify_pefile_signature(kernel, kernel_len,
> > -  NULL,
> > +  TRUST_FULL_KEYRING,
> >    VERIFYING_KEXEC_PE_SIGNATURE);
> >  }
> >  #endif
> > --
> > 
> > On 15.08.2018 18:54, Linus Torvalds wrote:
> > > This needs more people involved, and at least a sign-off.
> > >
> > > It looks ok, but I think we need a #define for the magical (void *)1UL
> > > thing. I see the use in verify_pkcs7_signature(), but still.
> > >
> > >   Linus
> > >
> > >
> > >
> > > On Wed, Aug 15, 2018 at 3:11 AM Yannik Sembritzki  
> > > wrote:
> > >> ---
> > >>  arch/x86/kernel/kexec-bzimage64.c | 2 +-
> > >>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> > >> b/arch/x86/kernel/kexec-bzimage64.c
> > >> index 7326078e..eaaa125d 100644
> > >> --- a/arch/x86/kernel/kexec-bzimage64.c
> > >> +++ b/arch/x86/kernel/kexec-bzimage64.c
> > >> @@ -532,7 +532,7 @@ static int bzImage64_cleanup(void *loader_data)
> > >>  static int bzImage64_verify_sig(const char *kernel, unsigned long 
> > >> kernel_len)
> > >>  {
> > >> return verify_pefile_signature(kernel, kernel_len,
> > >> -  NULL,
> > >> +  (void *)1UL,
> > >>VERIFYING_KEXEC_PE_SIGNATURE);
> > >>  }
> > >>  #endif
> > >> --
> > >> 2.17.1
> > >>
> > >> The exact scenario under which this issue occurs is described here:
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1554113
> > >>
> > 

Thanks
Dave

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-14 Thread Dave Young

Apologize for late reply, I'm occupied with something else.

On 08/10/18 at 07:39pm, Mike Galbraith wrote:
> On Fri, 2018-08-10 at 18:28 +0800, Dave Young wrote:
> > 
> > > @@ -250,8 +253,10 @@ setup_boot_parameters(struct kimage *image, struct 
> > > boot_params *params,
> > >  
> > >  #ifdef CONFIG_EFI
> > >   /* Setup EFI state */
> > > - setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> > > + ret = setup_efi_state(params, params_load_addr, efi_map_offset, 
> > > efi_map_sz,
> > >   efi_setup_data_offset);
> > > + if (ret)
> > 
> > Here should check efi_enabled(EFI_BOOT) && ret
> 
> Patch with that works for me.
> 
> > In case efi boot we need the efi info set correctly,  or one need pass
> > acpi_rsdp= in kernel cmdline param.
> > 
> > Still not sure how to allow one to workaround it by using acpi_rsdp=
> > param with kexec_file_load..
> 
> Does this improve things, and plug the no boot hole?

Would you mind to tune my patch with some acpi_rsdp checking and add
some error message in case kexec load failure? Eg. suggest people to use
append acpi_rsdp for noefi booting etc.

I'm still not very satisfied with the code cleanup, ideally we should add a
separate kbuf for efi stuff, so that we can isolate the efi_map_sz
efi_setup_data_offset, and efi_map_offset initialization only when
necessary.  Anyway the cleanup can be a separate patch.

> 
> x86, kdump: cleanup efi setup data handling a bit
> 
> 1. Remove efi specific variables from bzImage64_load() other than the
> one it needs, efi_map_sz, passing it and params_cmdline_sz on to efi
> setup functions, giving them all they need without duplication.
> 
> 2. Only allocate space for efi setup data when a 1:1 mapping is available.
> Bail early with -ENODEV if not available, but is required to boot, and
> acpi_rsdp= was not passed on the command line. 
> 
> 3. Use the proper config dependency to isolate efi setup functions,
> adding a !EFI_RUNTIME_MAP stub for setup_efi_state().
> 
> 4. Change efi functions that cannot fail to void. 
> 
> Signed-off-by: Mike Galbraith 
> ---
>  arch/x86/kernel/kexec-bzimage64.c |   99 
> +-
>  1 file changed, 45 insertions(+), 54 deletions(-)
> 
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -112,35 +112,32 @@ static int setup_e820_entries(struct boo
>   return 0;
>  }
>  
> -#ifdef CONFIG_EFI
> -static int setup_efi_info_memmap(struct boot_params *params,
> +#ifdef CONFIG_EFI_RUNTIME_MAP
> +static void setup_efi_info_memmap(struct boot_params *params,
> unsigned long params_load_addr,
> -   unsigned int efi_map_offset,
> +   unsigned int params_cmdline_sz,
> unsigned int efi_map_sz)
>  {
> - void *efi_map = (void *)params + efi_map_offset;
> - unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
> + void *efi_map = (void *)params + params_cmdline_sz;
> + unsigned long efi_map_phys_addr = params_load_addr + params_cmdline_sz;
>   struct efi_info *ei = >efi_info;
>  
> - if (!efi_map_sz)
> - return -EINVAL;
> -
>   efi_runtime_map_copy(efi_map, efi_map_sz);
>  
>   ei->efi_memmap = efi_map_phys_addr & 0x;
>   ei->efi_memmap_hi = efi_map_phys_addr >> 32;
>   ei->efi_memmap_size = efi_map_sz;
> -
> - return 0;
>  }
>  
> -static int
> +static void
>  prepare_add_efi_setup_data(struct boot_params *params,
> -unsigned long params_load_addr,
> -unsigned int efi_setup_data_offset)
> +unsigned long params_load_addr,
> +unsigned int params_cmdline_sz,
> +unsigned int efi_map_sz)
>  {
> + unsigned int data_offset = params_cmdline_sz + ALIGN(efi_map_sz, 16);
>   unsigned long setup_data_phys;
> - struct setup_data *sd = (void *)params + efi_setup_data_offset;
> + struct setup_data *sd = (void *)params + data_offset;
>   struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
>  
>   esd->fw_vendor = efi.fw_vendor;
> @@ -152,33 +149,20 @@ prepare_add_efi_setup_data(struct boot_p
>   sd->len = sizeof(struct efi_setup_data);
>  
>   /* Add setup data */
> - setup_data_phys = params_load_addr + efi_setup_data_offset;
> + setup_data_phys = params_load_addr + data_offset;
>   sd->next = params->hdr.setup_data;
>   params->hdr

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-14 Thread Dave Young

Apologize for late reply, I'm occupied with something else.

On 08/10/18 at 07:39pm, Mike Galbraith wrote:
> On Fri, 2018-08-10 at 18:28 +0800, Dave Young wrote:
> > 
> > > @@ -250,8 +253,10 @@ setup_boot_parameters(struct kimage *image, struct 
> > > boot_params *params,
> > >  
> > >  #ifdef CONFIG_EFI
> > >   /* Setup EFI state */
> > > - setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> > > + ret = setup_efi_state(params, params_load_addr, efi_map_offset, 
> > > efi_map_sz,
> > >   efi_setup_data_offset);
> > > + if (ret)
> > 
> > Here should check efi_enabled(EFI_BOOT) && ret
> 
> Patch with that works for me.
> 
> > In case efi boot we need the efi info set correctly,  or one need pass
> > acpi_rsdp= in kernel cmdline param.
> > 
> > Still not sure how to allow one to workaround it by using acpi_rsdp=
> > param with kexec_file_load..
> 
> Does this improve things, and plug the no boot hole?

Would you mind to tune my patch with some acpi_rsdp checking and add
some error message in case kexec load failure? Eg. suggest people to use
append acpi_rsdp for noefi booting etc.

I'm still not very satisfied with the code cleanup, ideally we should add a
separate kbuf for efi stuff, so that we can isolate the efi_map_sz
efi_setup_data_offset, and efi_map_offset initialization only when
necessary.  Anyway the cleanup can be a separate patch.

> 
> x86, kdump: cleanup efi setup data handling a bit
> 
> 1. Remove efi specific variables from bzImage64_load() other than the
> one it needs, efi_map_sz, passing it and params_cmdline_sz on to efi
> setup functions, giving them all they need without duplication.
> 
> 2. Only allocate space for efi setup data when a 1:1 mapping is available.
> Bail early with -ENODEV if not available, but is required to boot, and
> acpi_rsdp= was not passed on the command line. 
> 
> 3. Use the proper config dependency to isolate efi setup functions,
> adding a !EFI_RUNTIME_MAP stub for setup_efi_state().
> 
> 4. Change efi functions that cannot fail to void. 
> 
> Signed-off-by: Mike Galbraith 
> ---
>  arch/x86/kernel/kexec-bzimage64.c |   99 
> +-
>  1 file changed, 45 insertions(+), 54 deletions(-)
> 
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -112,35 +112,32 @@ static int setup_e820_entries(struct boo
>   return 0;
>  }
>  
> -#ifdef CONFIG_EFI
> -static int setup_efi_info_memmap(struct boot_params *params,
> +#ifdef CONFIG_EFI_RUNTIME_MAP
> +static void setup_efi_info_memmap(struct boot_params *params,
> unsigned long params_load_addr,
> -   unsigned int efi_map_offset,
> +   unsigned int params_cmdline_sz,
> unsigned int efi_map_sz)
>  {
> - void *efi_map = (void *)params + efi_map_offset;
> - unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
> + void *efi_map = (void *)params + params_cmdline_sz;
> + unsigned long efi_map_phys_addr = params_load_addr + params_cmdline_sz;
>   struct efi_info *ei = >efi_info;
>  
> - if (!efi_map_sz)
> - return -EINVAL;
> -
>   efi_runtime_map_copy(efi_map, efi_map_sz);
>  
>   ei->efi_memmap = efi_map_phys_addr & 0x;
>   ei->efi_memmap_hi = efi_map_phys_addr >> 32;
>   ei->efi_memmap_size = efi_map_sz;
> -
> - return 0;
>  }
>  
> -static int
> +static void
>  prepare_add_efi_setup_data(struct boot_params *params,
> -unsigned long params_load_addr,
> -unsigned int efi_setup_data_offset)
> +unsigned long params_load_addr,
> +unsigned int params_cmdline_sz,
> +unsigned int efi_map_sz)
>  {
> + unsigned int data_offset = params_cmdline_sz + ALIGN(efi_map_sz, 16);
>   unsigned long setup_data_phys;
> - struct setup_data *sd = (void *)params + efi_setup_data_offset;
> + struct setup_data *sd = (void *)params + data_offset;
>   struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
>  
>   esd->fw_vendor = efi.fw_vendor;
> @@ -152,33 +149,20 @@ prepare_add_efi_setup_data(struct boot_p
>   sd->len = sizeof(struct efi_setup_data);
>  
>   /* Add setup data */
> - setup_data_phys = params_load_addr + efi_setup_data_offset;
> + setup_data_phys = params_load_addr + data_offset;
>   sd->next = params->hdr.setup_data;
>   params->hdr

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-10 Thread Dave Young

On 08/10/18 at 04:45pm, Dave Young wrote:
> On 08/08/18 at 04:03pm, Mike Galbraith wrote:
> > When booting with efi=noruntime, we call efi_runtime_map_copy() while
> > loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
> > that and a useless allocation when the only mapping we can use (1:1)
> > is not available.
> > 
> > Signed-off-by: Mike Galbraith 
> > ---
> >  arch/x86/kernel/kexec-bzimage64.c |   22 +++---
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> > 
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -122,9 +122,6 @@ static int setup_efi_info_memmap(struct
> > unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
> > struct efi_info *ei = >efi_info;
> >  
> > -   if (!efi_map_sz)
> > -   return 0;
> > -
> > efi_runtime_map_copy(efi_map, efi_map_sz);
> >  
> > ei->efi_memmap = efi_map_phys_addr & 0x;
> > @@ -176,7 +173,7 @@ setup_efi_state(struct boot_params *para
> >  * acpi_rsdp= on kernel command line to make second kernel boot
> >  * without efi.
> >  */
> > -   if (efi_enabled(EFI_OLD_MEMMAP))
> > +   if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_MEMMAP))
> > return 0;
> >  
> > ei->efi_loader_signature = current_ei->efi_loader_signature;
> > @@ -338,7 +335,7 @@ static void *bzImage64_load(struct kimag
> > struct kexec_entry64_regs regs64;
> > void *stack;
> > unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> > -   unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
> > +   unsigned int efi_map_offset = 0, efi_map_sz = 0, efi_setup_data_offset 
> > = 0;
> > struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
> >   .top_down = true };
> > struct kexec_buf pbuf = { .image = image, .buf_min = MIN_PURGATORY_ADDR,
> > @@ -397,19 +394,22 @@ static void *bzImage64_load(struct kimag
> >  * have to create separate segment for each. Keeps things
> >  * little bit simple
> >  */
> > -   efi_map_sz = efi_get_runtime_map_size();
> > params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
> > MAX_ELFCOREHDR_STR_LEN;
> > params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
> > -   kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
> > -   sizeof(struct setup_data) +
> > -   sizeof(struct efi_setup_data);
> > +   kbuf.bufsz = params_cmdline_sz + sizeof(struct setup_data);
> > +
> > +   /* Now add space for the efi stuff if we have a useable 1:1 mapping. */
> > +   if (!efi_enabled(EFI_OLD_MEMMAP) && efi_enabled(EFI_MEMMAP)) {
> > +   efi_map_sz = efi_get_runtime_map_size();
> > +   kbuf.bufsz += ALIGN(efi_map_sz, 16) + sizeof(struct 
> > efi_setup_data);
> > +   efi_map_offset = params_cmdline_sz;
> > +   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
> > +   }
> >  
> > params = kzalloc(kbuf.bufsz, GFP_KERNEL);
> > if (!params)
> > return ERR_PTR(-ENOMEM);
> > -   efi_map_offset = params_cmdline_sz;
> > -   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
> >  
> > /* Copy setup header onto bootparams. Documentation/x86/boot.txt */
> > setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
> 
> BTW, this patch only fix the kexec load phase problem,  even if kexec
> load successfully with the fix, the 2nd kernel can not boot because efi
> memmap info is not correct and usable.
> 
> So we should go with some fix similar to below, and do the cleanup we
> mentioned with a separate patch later.
> 
> Also user space kexec-tools need a similar patch to error out in case
> no runtime maps.  It would be good to fix both userspace and kernel
> load.
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 7326078eaa7a..e34ba2f53cfb 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -123,7 +123,7 @@ static int setup_efi_info_memmap(struct boot_params 
> *params,
>   struct efi_info *ei = >efi_info;
>  
>   if (!efi_map_sz)
> - return 0;
> + return -EINVAL;
>  
>   efi_runtime_map_copy(efi_map, efi_map_sz);
>  
> @@ -166,9 +166,10 @@ setup_efi_state(struct boot_params *params, un

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-10 Thread Dave Young

On 08/10/18 at 04:45pm, Dave Young wrote:
> On 08/08/18 at 04:03pm, Mike Galbraith wrote:
> > When booting with efi=noruntime, we call efi_runtime_map_copy() while
> > loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
> > that and a useless allocation when the only mapping we can use (1:1)
> > is not available.
> > 
> > Signed-off-by: Mike Galbraith 
> > ---
> >  arch/x86/kernel/kexec-bzimage64.c |   22 +++---
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> > 
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -122,9 +122,6 @@ static int setup_efi_info_memmap(struct
> > unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
> > struct efi_info *ei = >efi_info;
> >  
> > -   if (!efi_map_sz)
> > -   return 0;
> > -
> > efi_runtime_map_copy(efi_map, efi_map_sz);
> >  
> > ei->efi_memmap = efi_map_phys_addr & 0x;
> > @@ -176,7 +173,7 @@ setup_efi_state(struct boot_params *para
> >  * acpi_rsdp= on kernel command line to make second kernel boot
> >  * without efi.
> >  */
> > -   if (efi_enabled(EFI_OLD_MEMMAP))
> > +   if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_MEMMAP))
> > return 0;
> >  
> > ei->efi_loader_signature = current_ei->efi_loader_signature;
> > @@ -338,7 +335,7 @@ static void *bzImage64_load(struct kimag
> > struct kexec_entry64_regs regs64;
> > void *stack;
> > unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> > -   unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
> > +   unsigned int efi_map_offset = 0, efi_map_sz = 0, efi_setup_data_offset 
> > = 0;
> > struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
> >   .top_down = true };
> > struct kexec_buf pbuf = { .image = image, .buf_min = MIN_PURGATORY_ADDR,
> > @@ -397,19 +394,22 @@ static void *bzImage64_load(struct kimag
> >  * have to create separate segment for each. Keeps things
> >  * little bit simple
> >  */
> > -   efi_map_sz = efi_get_runtime_map_size();
> > params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
> > MAX_ELFCOREHDR_STR_LEN;
> > params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
> > -   kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
> > -   sizeof(struct setup_data) +
> > -   sizeof(struct efi_setup_data);
> > +   kbuf.bufsz = params_cmdline_sz + sizeof(struct setup_data);
> > +
> > +   /* Now add space for the efi stuff if we have a useable 1:1 mapping. */
> > +   if (!efi_enabled(EFI_OLD_MEMMAP) && efi_enabled(EFI_MEMMAP)) {
> > +   efi_map_sz = efi_get_runtime_map_size();
> > +   kbuf.bufsz += ALIGN(efi_map_sz, 16) + sizeof(struct 
> > efi_setup_data);
> > +   efi_map_offset = params_cmdline_sz;
> > +   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
> > +   }
> >  
> > params = kzalloc(kbuf.bufsz, GFP_KERNEL);
> > if (!params)
> > return ERR_PTR(-ENOMEM);
> > -   efi_map_offset = params_cmdline_sz;
> > -   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
> >  
> > /* Copy setup header onto bootparams. Documentation/x86/boot.txt */
> > setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
> 
> BTW, this patch only fix the kexec load phase problem,  even if kexec
> load successfully with the fix, the 2nd kernel can not boot because efi
> memmap info is not correct and usable.
> 
> So we should go with some fix similar to below, and do the cleanup we
> mentioned with a separate patch later.
> 
> Also user space kexec-tools need a similar patch to error out in case
> no runtime maps.  It would be good to fix both userspace and kernel
> load.
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 7326078eaa7a..e34ba2f53cfb 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -123,7 +123,7 @@ static int setup_efi_info_memmap(struct boot_params 
> *params,
>   struct efi_info *ei = >efi_info;
>  
>   if (!efi_map_sz)
> - return 0;
> + return -EINVAL;
>  
>   efi_runtime_map_copy(efi_map, efi_map_sz);
>  
> @@ -166,9 +166,10 @@ setup_efi_state(struct boot_params *params, un

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-10 Thread Dave Young

On 08/08/18 at 04:03pm, Mike Galbraith wrote:
> When booting with efi=noruntime, we call efi_runtime_map_copy() while
> loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
> that and a useless allocation when the only mapping we can use (1:1)
> is not available.
> 
> Signed-off-by: Mike Galbraith 
> ---
>  arch/x86/kernel/kexec-bzimage64.c |   22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -122,9 +122,6 @@ static int setup_efi_info_memmap(struct
>   unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
>   struct efi_info *ei = >efi_info;
>  
> - if (!efi_map_sz)
> - return 0;
> -
>   efi_runtime_map_copy(efi_map, efi_map_sz);
>  
>   ei->efi_memmap = efi_map_phys_addr & 0x;
> @@ -176,7 +173,7 @@ setup_efi_state(struct boot_params *para
>* acpi_rsdp= on kernel command line to make second kernel boot
>* without efi.
>*/
> - if (efi_enabled(EFI_OLD_MEMMAP))
> + if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_MEMMAP))
>   return 0;
>  
>   ei->efi_loader_signature = current_ei->efi_loader_signature;
> @@ -338,7 +335,7 @@ static void *bzImage64_load(struct kimag
>   struct kexec_entry64_regs regs64;
>   void *stack;
>   unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> - unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
> + unsigned int efi_map_offset = 0, efi_map_sz = 0, efi_setup_data_offset 
> = 0;
>   struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
> .top_down = true };
>   struct kexec_buf pbuf = { .image = image, .buf_min = MIN_PURGATORY_ADDR,
> @@ -397,19 +394,22 @@ static void *bzImage64_load(struct kimag
>* have to create separate segment for each. Keeps things
>* little bit simple
>*/
> - efi_map_sz = efi_get_runtime_map_size();
>   params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
>   MAX_ELFCOREHDR_STR_LEN;
>   params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
> - kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
> - sizeof(struct setup_data) +
> - sizeof(struct efi_setup_data);
> + kbuf.bufsz = params_cmdline_sz + sizeof(struct setup_data);
> +
> + /* Now add space for the efi stuff if we have a useable 1:1 mapping. */
> + if (!efi_enabled(EFI_OLD_MEMMAP) && efi_enabled(EFI_MEMMAP)) {
> + efi_map_sz = efi_get_runtime_map_size();
> + kbuf.bufsz += ALIGN(efi_map_sz, 16) + sizeof(struct 
> efi_setup_data);
> + efi_map_offset = params_cmdline_sz;
> + efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
> + }
>  
>   params = kzalloc(kbuf.bufsz, GFP_KERNEL);
>   if (!params)
>   return ERR_PTR(-ENOMEM);
> - efi_map_offset = params_cmdline_sz;
> - efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
>  
>   /* Copy setup header onto bootparams. Documentation/x86/boot.txt */
>   setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;

BTW, this patch only fix the kexec load phase problem,  even if kexec
load successfully with the fix, the 2nd kernel can not boot because efi
memmap info is not correct and usable.

So we should go with some fix similar to below, and do the cleanup we
mentioned with a separate patch later.

Also user space kexec-tools need a similar patch to error out in case
no runtime maps.  It would be good to fix both userspace and kernel
load.

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 7326078eaa7a..e34ba2f53cfb 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -123,7 +123,7 @@ static int setup_efi_info_memmap(struct boot_params *params,
struct efi_info *ei = >efi_info;
 
if (!efi_map_sz)
-   return 0;
+   return -EINVAL;
 
efi_runtime_map_copy(efi_map, efi_map_sz);
 
@@ -166,9 +166,10 @@ setup_efi_state(struct boot_params *params, unsigned long 
params_load_addr,
 {
struct efi_info *current_ei = _params.efi_info;
struct efi_info *ei = >efi_info;
+   int ret;
 
if (!current_ei->efi_memmap_size)
-   return 0;
+   return -EINVAL;
 
/*
 * If 1:1 mapping is not enabled, second kernel can not setup EFI
@@ -176,8 +177,8 @@ setup_efi_state(struct boot_params *params, unsigned long 
params_load_addr,
 * acpi_rsdp= on kernel command line to make second kernel boot
 * without efi.
 */
-   if (efi_enabled(EFI_OLD_MEMMAP))
-   return 0;
+   if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_RUNTIME_SERVICES))
+   return

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-10 Thread Dave Young

On 08/08/18 at 04:03pm, Mike Galbraith wrote:
> When booting with efi=noruntime, we call efi_runtime_map_copy() while
> loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
> that and a useless allocation when the only mapping we can use (1:1)
> is not available.
> 
> Signed-off-by: Mike Galbraith 
> ---
>  arch/x86/kernel/kexec-bzimage64.c |   22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -122,9 +122,6 @@ static int setup_efi_info_memmap(struct
>   unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
>   struct efi_info *ei = >efi_info;
>  
> - if (!efi_map_sz)
> - return 0;
> -
>   efi_runtime_map_copy(efi_map, efi_map_sz);
>  
>   ei->efi_memmap = efi_map_phys_addr & 0x;
> @@ -176,7 +173,7 @@ setup_efi_state(struct boot_params *para
>* acpi_rsdp= on kernel command line to make second kernel boot
>* without efi.
>*/
> - if (efi_enabled(EFI_OLD_MEMMAP))
> + if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_MEMMAP))
>   return 0;
>  
>   ei->efi_loader_signature = current_ei->efi_loader_signature;
> @@ -338,7 +335,7 @@ static void *bzImage64_load(struct kimag
>   struct kexec_entry64_regs regs64;
>   void *stack;
>   unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
> - unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
> + unsigned int efi_map_offset = 0, efi_map_sz = 0, efi_setup_data_offset 
> = 0;
>   struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
> .top_down = true };
>   struct kexec_buf pbuf = { .image = image, .buf_min = MIN_PURGATORY_ADDR,
> @@ -397,19 +394,22 @@ static void *bzImage64_load(struct kimag
>* have to create separate segment for each. Keeps things
>* little bit simple
>*/
> - efi_map_sz = efi_get_runtime_map_size();
>   params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
>   MAX_ELFCOREHDR_STR_LEN;
>   params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
> - kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
> - sizeof(struct setup_data) +
> - sizeof(struct efi_setup_data);
> + kbuf.bufsz = params_cmdline_sz + sizeof(struct setup_data);
> +
> + /* Now add space for the efi stuff if we have a useable 1:1 mapping. */
> + if (!efi_enabled(EFI_OLD_MEMMAP) && efi_enabled(EFI_MEMMAP)) {
> + efi_map_sz = efi_get_runtime_map_size();
> + kbuf.bufsz += ALIGN(efi_map_sz, 16) + sizeof(struct 
> efi_setup_data);
> + efi_map_offset = params_cmdline_sz;
> + efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
> + }
>  
>   params = kzalloc(kbuf.bufsz, GFP_KERNEL);
>   if (!params)
>   return ERR_PTR(-ENOMEM);
> - efi_map_offset = params_cmdline_sz;
> - efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
>  
>   /* Copy setup header onto bootparams. Documentation/x86/boot.txt */
>   setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;

BTW, this patch only fix the kexec load phase problem,  even if kexec
load successfully with the fix, the 2nd kernel can not boot because efi
memmap info is not correct and usable.

So we should go with some fix similar to below, and do the cleanup we
mentioned with a separate patch later.

Also user space kexec-tools need a similar patch to error out in case
no runtime maps.  It would be good to fix both userspace and kernel
load.

diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 7326078eaa7a..e34ba2f53cfb 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -123,7 +123,7 @@ static int setup_efi_info_memmap(struct boot_params *params,
struct efi_info *ei = >efi_info;
 
if (!efi_map_sz)
-   return 0;
+   return -EINVAL;
 
efi_runtime_map_copy(efi_map, efi_map_sz);
 
@@ -166,9 +166,10 @@ setup_efi_state(struct boot_params *params, unsigned long 
params_load_addr,
 {
struct efi_info *current_ei = _params.efi_info;
struct efi_info *ei = >efi_info;
+   int ret;
 
if (!current_ei->efi_memmap_size)
-   return 0;
+   return -EINVAL;
 
/*
 * If 1:1 mapping is not enabled, second kernel can not setup EFI
@@ -176,8 +177,8 @@ setup_efi_state(struct boot_params *params, unsigned long 
params_load_addr,
 * acpi_rsdp= on kernel command line to make second kernel boot
 * without efi.
 */
-   if (efi_enabled(EFI_OLD_MEMMAP))
-   return 0;
+   if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_RUNTIME_SERVICES))
+   return

Re: [PATCH v12 04/16] powerpc, kexec_file: factor out memblock-based arch_kexec_walk_mem()

2018-07-25 Thread Dave Young

On 07/24/18 at 03:57pm, AKASHI Takahiro wrote:
> Memblock list is another source for usable system memory layout.
> So move powerpc's arch_kexec_walk_mem() to common code so that other
> memblock-based architectures, particularly arm64, can also utilise it.
> A moved function is now renamed to kexec_walk_memblock() and integrated
> into kexec_locate_mem_hole(), which will now be usable for all
> architectures with no need for overriding arch_kexec_walk_mem().
> 
> kexec_walk_memblock() will not work for kdump in this form, this will be
> fixed in the next patch.
> 
> Signed-off-by: AKASHI Takahiro 
> Cc: "Eric W. Biederman" 
> Cc: Dave Young 
> Cc: Vivek Goyal 
> Cc: Baoquan He 
> Acked-by: James Morse 
> ---
>  arch/powerpc/kernel/machine_kexec_file_64.c | 54 ---
>  include/linux/kexec.h   |  2 -
>  kernel/kexec_file.c | 58 -
>  3 files changed, 56 insertions(+), 58 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/machine_kexec_file_64.c 
> b/arch/powerpc/kernel/machine_kexec_file_64.c
> index 0bd23dc789a4..5357b09902c5 100644
> --- a/arch/powerpc/kernel/machine_kexec_file_64.c
> +++ b/arch/powerpc/kernel/machine_kexec_file_64.c
> @@ -24,7 +24,6 @@
>  
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -46,59 +45,6 @@ int arch_kexec_kernel_image_probe(struct kimage *image, 
> void *buf,
>   return kexec_image_probe_default(image, buf, buf_len);
>  }
>  
> -/**
> - * arch_kexec_walk_mem - call func(data) for each unreserved memory block
> - * @kbuf:Context info for the search. Also passed to @func.
> - * @func:Function to call for each memory block.
> - *
> - * This function is used by kexec_add_buffer and kexec_locate_mem_hole
> - * to find unreserved memory to load kexec segments into.
> - *
> - * Return: The memory walk will stop when func returns a non-zero value
> - * and that value will be returned. If all free regions are visited without
> - * func returning non-zero, then zero will be returned.
> - */
> -int arch_kexec_walk_mem(struct kexec_buf *kbuf,
> - int (*func)(struct resource *, void *))
> -{
> - int ret = 0;
> - u64 i;
> - phys_addr_t mstart, mend;
> - struct resource res = { };
> -
> - if (kbuf->top_down) {
> - for_each_free_mem_range_reverse(i, NUMA_NO_NODE, 0,
> - , , NULL) {
> - /*
> -  * In memblock, end points to the first byte after the
> -  * range while in kexec, end points to the last byte
> -  * in the range.
> -  */
> - res.start = mstart;
> - res.end = mend - 1;
> - ret = func(, kbuf);
> - if (ret)
> - break;
> - }
> - } else {
> - for_each_free_mem_range(i, NUMA_NO_NODE, 0, , ,
> - NULL) {
> - /*
> -  * In memblock, end points to the first byte after the
> -  * range while in kexec, end points to the last byte
> -  * in the range.
> -  */
> - res.start = mstart;
> - res.end = mend - 1;
> - ret = func(, kbuf);
> - if (ret)
> - break;
> - }
> - }
> -
> - return ret;
> -}
> -
>  /**
>   * setup_purgatory - initialize the purgatory's global variables
>   * @image:   kexec image.
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 49ab758f4d91..c196bfd11bee 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -184,8 +184,6 @@ int __weak arch_kexec_apply_relocations(struct 
> purgatory_info *pi,
>   const Elf_Shdr *relsec,
>   const Elf_Shdr *symtab);
>  
> -int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
> -int (*func)(struct resource *, void *));
>  extern int kexec_add_buffer(struct kexec_buf *kbuf);
>  int kexec_locate_mem_hole(struct kexec_buf *kbuf);
>  
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index bf39df5e5bb9..2f0691b0f8ad 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -501,6 +502,55 @@ static

Re: [PATCH v12 04/16] powerpc, kexec_file: factor out memblock-based arch_kexec_walk_mem()

2018-07-25 Thread Dave Young

On 07/24/18 at 03:57pm, AKASHI Takahiro wrote:
> Memblock list is another source for usable system memory layout.
> So move powerpc's arch_kexec_walk_mem() to common code so that other
> memblock-based architectures, particularly arm64, can also utilise it.
> A moved function is now renamed to kexec_walk_memblock() and integrated
> into kexec_locate_mem_hole(), which will now be usable for all
> architectures with no need for overriding arch_kexec_walk_mem().
> 
> kexec_walk_memblock() will not work for kdump in this form, this will be
> fixed in the next patch.
> 
> Signed-off-by: AKASHI Takahiro 
> Cc: "Eric W. Biederman" 
> Cc: Dave Young 
> Cc: Vivek Goyal 
> Cc: Baoquan He 
> Acked-by: James Morse 
> ---
>  arch/powerpc/kernel/machine_kexec_file_64.c | 54 ---
>  include/linux/kexec.h   |  2 -
>  kernel/kexec_file.c | 58 -
>  3 files changed, 56 insertions(+), 58 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/machine_kexec_file_64.c 
> b/arch/powerpc/kernel/machine_kexec_file_64.c
> index 0bd23dc789a4..5357b09902c5 100644
> --- a/arch/powerpc/kernel/machine_kexec_file_64.c
> +++ b/arch/powerpc/kernel/machine_kexec_file_64.c
> @@ -24,7 +24,6 @@
>  
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -46,59 +45,6 @@ int arch_kexec_kernel_image_probe(struct kimage *image, 
> void *buf,
>   return kexec_image_probe_default(image, buf, buf_len);
>  }
>  
> -/**
> - * arch_kexec_walk_mem - call func(data) for each unreserved memory block
> - * @kbuf:Context info for the search. Also passed to @func.
> - * @func:Function to call for each memory block.
> - *
> - * This function is used by kexec_add_buffer and kexec_locate_mem_hole
> - * to find unreserved memory to load kexec segments into.
> - *
> - * Return: The memory walk will stop when func returns a non-zero value
> - * and that value will be returned. If all free regions are visited without
> - * func returning non-zero, then zero will be returned.
> - */
> -int arch_kexec_walk_mem(struct kexec_buf *kbuf,
> - int (*func)(struct resource *, void *))
> -{
> - int ret = 0;
> - u64 i;
> - phys_addr_t mstart, mend;
> - struct resource res = { };
> -
> - if (kbuf->top_down) {
> - for_each_free_mem_range_reverse(i, NUMA_NO_NODE, 0,
> - , , NULL) {
> - /*
> -  * In memblock, end points to the first byte after the
> -  * range while in kexec, end points to the last byte
> -  * in the range.
> -  */
> - res.start = mstart;
> - res.end = mend - 1;
> - ret = func(, kbuf);
> - if (ret)
> - break;
> - }
> - } else {
> - for_each_free_mem_range(i, NUMA_NO_NODE, 0, , ,
> - NULL) {
> - /*
> -  * In memblock, end points to the first byte after the
> -  * range while in kexec, end points to the last byte
> -  * in the range.
> -  */
> - res.start = mstart;
> - res.end = mend - 1;
> - ret = func(, kbuf);
> - if (ret)
> - break;
> - }
> - }
> -
> - return ret;
> -}
> -
>  /**
>   * setup_purgatory - initialize the purgatory's global variables
>   * @image:   kexec image.
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 49ab758f4d91..c196bfd11bee 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -184,8 +184,6 @@ int __weak arch_kexec_apply_relocations(struct 
> purgatory_info *pi,
>   const Elf_Shdr *relsec,
>   const Elf_Shdr *symtab);
>  
> -int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
> -int (*func)(struct resource *, void *));
>  extern int kexec_add_buffer(struct kexec_buf *kbuf);
>  int kexec_locate_mem_hole(struct kexec_buf *kbuf);
>  
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index bf39df5e5bb9..2f0691b0f8ad 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -501,6 +502,55 @@ static

Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver

2018-05-28 Thread Dave Young

On 05/24/18 at 11:14am, David Hildenbrand wrote:
> On 24.05.2018 10:56, Dave Young wrote:
> > Hi,
> > 
> > [snip]
> >>>
> >>>> For kdump and onlining/offlining code, we
> >>>> have to mark pages as offline before a new segment is visible to the 
> >>>> system
> >>>> (e.g. as these pages might not be backed by real memory in the 
> >>>> hypervisor).
> >>>
> >>> Please expand on the kdump part. That is really confusing because
> >>> hotplug should simply not depend on kdump at all. Moreover why don't you
> >>> simply mark those pages reserved and pull them out from the page
> >>> allocator?
> >>
> >> 1. "hotplug should simply not depend on kdump at all"
> >>
> >> In theory yes. In the current state we already have to trigger kdump to
> >> reload whenever we add/remove a memory block.
> >>
> >>
> >> 2. kdump part
> >>
> >> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> >> we should not assume that we can read that page again. Now, if dumping
> >> tools assume they can read all memory that is offline, we are in trouble.
> >>
> >> It is the same thing as we already have with Pg_hwpoison. Just a
> >> different meaning - "don't touch this page, it is offline" compared to
> >> "don't touch this page, hw is broken".
> > 
> > Does that means in case an offline no kdump reload as mentioned in 1)?
> > 
> > If we have the offline event and reload kdump, I assume the memory state
> > is refreshed so kdump will not read the memory offlined, am I missing
> > something?
> 
> If a whole section is offline: yes. (ACPI hotplug)
> 
> If pages are online but broken ("logically offline" - hwpoison): no
> 
> If single pages are logically offline: no. (Balloon inflation - let's
> call it unplug as that's what some people refer to)
> 
> If only subsections (4MB chunks) are offline: no.
> 
> Exporting memory ranges in a smaller granularity to kdump than section
> size would a) be heavily complicated b) introduce a lot of overhead for
> this tracking data c) make us retrigger kdump way too often.
> 
> So simply marking pages offline in the struct pages and telling kdump
> about it is the straight forward thing to do. And it is fairly easy to
> add and implement as we have the exact same thing in place for hwpoison.

Ok, it is clear enough.   If case fine grained page offline is is like
a hwpoison page so a userspace patch for makedumpfile is needes to
exclude them when copying vmcore.

> 
> > 
> >>
> >> Balloon drivers solve this problem by always allowing to read unplugged
> >> memory. In virtio-mem, this cannot and should even not be guaranteed.
> >>
> > 
> > Hmm, that sounds a bug..
> 
> I can give you a simple example why reading such unplugged (or balloon
> inflated) memory is problematic: Huge page backed guests.
> 
> There is no zero page for huge pages. So if we allow the guest to read
> that memory any time, we cannot guarantee that we actually consume less
> memory in the hypervisor. This is absolutely to be avoided.
> 
> Existing balloon drivers don't support huge page backed guests. (well
> you can inflate, but the hypervisor cannot madvise() 4k on a huge page,
> resulting in no action being performed). This scenario is to be
> supported with virtio-mem.
> 
> 
> So yes, this is actually a bug in e.g. virtio-balloon implementations:
> 
> With "VIRTIO_BALLOON_F_MUST_TELL_HOST" we have to tell the hypervisor
> before we access a page again. kdump cannot do this and does not care,
> so this page is silently accessed and dumped. One of the main problems
> why extending virtio-balloon hypervisor implementations to support
> host-enforced R/W protection is impossible.

I'm not sure I got all virt related background, but still thank you
for the detailed explanation.  This is the first time I heard about
this, nobody complained before :(

> 
> > 
> >> And what we have to do to make this work is actually pretty simple: Just
> >> like Pg_hwpoison, track per page if it is online and provide this
> >> information to kdump.
> >>
> >>
> > 
> > Thanks
> > Dave
> > 
> 
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

Thanks
Dave

Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver

2018-05-28 Thread Dave Young

On 05/24/18 at 11:14am, David Hildenbrand wrote:
> On 24.05.2018 10:56, Dave Young wrote:
> > Hi,
> > 
> > [snip]
> >>>
> >>>> For kdump and onlining/offlining code, we
> >>>> have to mark pages as offline before a new segment is visible to the 
> >>>> system
> >>>> (e.g. as these pages might not be backed by real memory in the 
> >>>> hypervisor).
> >>>
> >>> Please expand on the kdump part. That is really confusing because
> >>> hotplug should simply not depend on kdump at all. Moreover why don't you
> >>> simply mark those pages reserved and pull them out from the page
> >>> allocator?
> >>
> >> 1. "hotplug should simply not depend on kdump at all"
> >>
> >> In theory yes. In the current state we already have to trigger kdump to
> >> reload whenever we add/remove a memory block.
> >>
> >>
> >> 2. kdump part
> >>
> >> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> >> we should not assume that we can read that page again. Now, if dumping
> >> tools assume they can read all memory that is offline, we are in trouble.
> >>
> >> It is the same thing as we already have with Pg_hwpoison. Just a
> >> different meaning - "don't touch this page, it is offline" compared to
> >> "don't touch this page, hw is broken".
> > 
> > Does that means in case an offline no kdump reload as mentioned in 1)?
> > 
> > If we have the offline event and reload kdump, I assume the memory state
> > is refreshed so kdump will not read the memory offlined, am I missing
> > something?
> 
> If a whole section is offline: yes. (ACPI hotplug)
> 
> If pages are online but broken ("logically offline" - hwpoison): no
> 
> If single pages are logically offline: no. (Balloon inflation - let's
> call it unplug as that's what some people refer to)
> 
> If only subsections (4MB chunks) are offline: no.
> 
> Exporting memory ranges in a smaller granularity to kdump than section
> size would a) be heavily complicated b) introduce a lot of overhead for
> this tracking data c) make us retrigger kdump way too often.
> 
> So simply marking pages offline in the struct pages and telling kdump
> about it is the straight forward thing to do. And it is fairly easy to
> add and implement as we have the exact same thing in place for hwpoison.

Ok, it is clear enough.   If case fine grained page offline is is like
a hwpoison page so a userspace patch for makedumpfile is needes to
exclude them when copying vmcore.

> 
> > 
> >>
> >> Balloon drivers solve this problem by always allowing to read unplugged
> >> memory. In virtio-mem, this cannot and should even not be guaranteed.
> >>
> > 
> > Hmm, that sounds a bug..
> 
> I can give you a simple example why reading such unplugged (or balloon
> inflated) memory is problematic: Huge page backed guests.
> 
> There is no zero page for huge pages. So if we allow the guest to read
> that memory any time, we cannot guarantee that we actually consume less
> memory in the hypervisor. This is absolutely to be avoided.
> 
> Existing balloon drivers don't support huge page backed guests. (well
> you can inflate, but the hypervisor cannot madvise() 4k on a huge page,
> resulting in no action being performed). This scenario is to be
> supported with virtio-mem.
> 
> 
> So yes, this is actually a bug in e.g. virtio-balloon implementations:
> 
> With "VIRTIO_BALLOON_F_MUST_TELL_HOST" we have to tell the hypervisor
> before we access a page again. kdump cannot do this and does not care,
> so this page is silently accessed and dumped. One of the main problems
> why extending virtio-balloon hypervisor implementations to support
> host-enforced R/W protection is impossible.

I'm not sure I got all virt related background, but still thank you
for the detailed explanation.  This is the first time I heard about
this, nobody complained before :(

> 
> > 
> >> And what we have to do to make this work is actually pretty simple: Just
> >> like Pg_hwpoison, track per page if it is online and provide this
> >> information to kdump.
> >>
> >>
> > 
> > Thanks
> > Dave
> > 
> 
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

Hi Eric,

On 05/24/18 at 11:41am, Eric W. Biederman wrote:
> Dave Young <dyo...@redhat.com> writes:
> 
> > Hi Eric,
> > On 05/23/18 at 10:53am, Eric W. Biederman wrote:
> >> Dave Young <dyo...@redhat.com> writes:
> >> 
> >> > [snip]
> >> >
> >> >> >  
> >> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> >> > + int "System memory size threshold for kdump memory default 
> >> >> > reserving"
> >> >> > + depends on CRASH_CORE
> >> >> > + default 0
> >> >> > + help
> >> >> > +   CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> >> > +   the system memory size is equal or bigger than the threshold.
> >> >> 
> >> >> "the threshold" is rather vague.  Can it be clarified?
> >> >> 
> >> >> In fact I'm really struggling to understand the logic here
> >> >> 
> >> >> 
> >> >> > +config CRASHKERNEL_DEFAULT_MB
> >> >> > + int "Default crashkernel memory size reserved for kdump"
> >> >> > + depends on CRASH_CORE
> >> >> > + default 0
> >> >> > + help
> >> >> > +   This is used as the default kdump reserved memory size in MB.
> >> >> > +   crashkernel=X kernel cmdline can overwrite this value.
> >> >> > +
> >> >> >  config HAVE_IMA_KEXEC
> >> >> >   bool
> >> >> >  
> >> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> >> >   return 0;
> >> >> >  }
> >> >> >  
> >> >> > +static int __init get_crashkernel_default(unsigned long long 
> >> >> > system_ram,
> >> >> > +   unsigned long long *size)
> >> >> > +{
> >> >> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> >> > + unsigned long long thres = 
> >> >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> >> > +
> >> >> > + thres *= SZ_1M;
> >> >> > + sz *= SZ_1M;
> >> >> > +
> >> >> > + if (sz >= system_ram || system_ram < thres) {
> >> >> > + pr_debug("crashkernel default size can not be used.\n");
> >> >> > + return -EINVAL;
> >> >> 
> >> >> In other words,
> >> >> 
> >> >> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >> >> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >> >> fail;
> >> >> 
> >> >> yes?
> >> >> 
> >> >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> >> >> is needed.  And clearer Kconfig text.
> >> >> 
> >> >> All confused :(
> >> >
> >> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> >> > the size is too large and kernel can not find enough memory it will
> >> > still fail in latter code.
> >> >
> >> > Is below version looks clearer?
> >> 
> >> What is the advantage of providing this in a kconfig option rather
> >> than on the kernel command line as we can now?
> >
> > It is not a replacement of the cmdline, this can be a supplement to
> > the crashkernel command line.  For a lot of common use cases if we have
> > the auto reservation user just do not need to manually set the cmdline
> > for example on a virtual machine and usual setup (except of the
> > comlicate storage and very large machines).  The crashkernel=auto
> > has been used for long time, Red Hat QE tested it on a lot of different
> > lab machines and proved it works well.  Kdump usually just works so admin
> > do little work to enable kdump.
> >
> > But the crashkernel=auto implementation has some drawbacks that is it
> > is more like embed policy in the code and it is not flexible like a
> > config option.
> 
> Have you considered using the builtin command line aka CONFIG_CMDLINE?
> If as you are reserving a fixed amount of memory as your patch does that
> should be sufficient, and doable without any kernel changes.

Hmm, even in builtin cmdline it is same as a explict used crashkernel=.
If we think from a distribution point of view, it will be hard to
differentiate the builtin provided param and bootloader provided
params. It looks odd to see two crashkernel= when `cat /proc/cmdline`,
it will confuse people and there could cause compatibility problems
because it is explict value visible in kernel cmdline. 

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

Hi Eric,

On 05/24/18 at 11:41am, Eric W. Biederman wrote:
> Dave Young  writes:
> 
> > Hi Eric,
> > On 05/23/18 at 10:53am, Eric W. Biederman wrote:
> >> Dave Young  writes:
> >> 
> >> > [snip]
> >> >
> >> >> >  
> >> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> >> > + int "System memory size threshold for kdump memory default 
> >> >> > reserving"
> >> >> > + depends on CRASH_CORE
> >> >> > + default 0
> >> >> > + help
> >> >> > +   CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> >> > +   the system memory size is equal or bigger than the threshold.
> >> >> 
> >> >> "the threshold" is rather vague.  Can it be clarified?
> >> >> 
> >> >> In fact I'm really struggling to understand the logic here
> >> >> 
> >> >> 
> >> >> > +config CRASHKERNEL_DEFAULT_MB
> >> >> > + int "Default crashkernel memory size reserved for kdump"
> >> >> > + depends on CRASH_CORE
> >> >> > + default 0
> >> >> > + help
> >> >> > +   This is used as the default kdump reserved memory size in MB.
> >> >> > +   crashkernel=X kernel cmdline can overwrite this value.
> >> >> > +
> >> >> >  config HAVE_IMA_KEXEC
> >> >> >   bool
> >> >> >  
> >> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> >> >   return 0;
> >> >> >  }
> >> >> >  
> >> >> > +static int __init get_crashkernel_default(unsigned long long 
> >> >> > system_ram,
> >> >> > +   unsigned long long *size)
> >> >> > +{
> >> >> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> >> > + unsigned long long thres = 
> >> >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> >> > +
> >> >> > + thres *= SZ_1M;
> >> >> > + sz *= SZ_1M;
> >> >> > +
> >> >> > + if (sz >= system_ram || system_ram < thres) {
> >> >> > + pr_debug("crashkernel default size can not be used.\n");
> >> >> > + return -EINVAL;
> >> >> 
> >> >> In other words,
> >> >> 
> >> >> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >> >> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >> >> fail;
> >> >> 
> >> >> yes?
> >> >> 
> >> >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> >> >> is needed.  And clearer Kconfig text.
> >> >> 
> >> >> All confused :(
> >> >
> >> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> >> > the size is too large and kernel can not find enough memory it will
> >> > still fail in latter code.
> >> >
> >> > Is below version looks clearer?
> >> 
> >> What is the advantage of providing this in a kconfig option rather
> >> than on the kernel command line as we can now?
> >
> > It is not a replacement of the cmdline, this can be a supplement to
> > the crashkernel command line.  For a lot of common use cases if we have
> > the auto reservation user just do not need to manually set the cmdline
> > for example on a virtual machine and usual setup (except of the
> > comlicate storage and very large machines).  The crashkernel=auto
> > has been used for long time, Red Hat QE tested it on a lot of different
> > lab machines and proved it works well.  Kdump usually just works so admin
> > do little work to enable kdump.
> >
> > But the crashkernel=auto implementation has some drawbacks that is it
> > is more like embed policy in the code and it is not flexible like a
> > config option.
> 
> Have you considered using the builtin command line aka CONFIG_CMDLINE?
> If as you are reserving a fixed amount of memory as your patch does that
> should be sufficient, and doable without any kernel changes.

Hmm, even in builtin cmdline it is same as a explict used crashkernel=.
If we think from a distribution point of view, it will be hard to
differentiate the builtin provided param and bootloader provided
params. It looks odd to see two crashkernel= when `cat /proc/cmdline`,
it will confuse people and there could cause compatibility problems
because it is explict value visible in kernel cmdline. 

Thanks
Dave

Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver

2018-05-24 Thread Dave Young

Hi,

[snip]
> > 
> >> For kdump and onlining/offlining code, we
> >> have to mark pages as offline before a new segment is visible to the system
> >> (e.g. as these pages might not be backed by real memory in the hypervisor).
> > 
> > Please expand on the kdump part. That is really confusing because
> > hotplug should simply not depend on kdump at all. Moreover why don't you
> > simply mark those pages reserved and pull them out from the page
> > allocator?
> 
> 1. "hotplug should simply not depend on kdump at all"
> 
> In theory yes. In the current state we already have to trigger kdump to
> reload whenever we add/remove a memory block.
> 
> 
> 2. kdump part
> 
> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> we should not assume that we can read that page again. Now, if dumping
> tools assume they can read all memory that is offline, we are in trouble.
> 
> It is the same thing as we already have with Pg_hwpoison. Just a
> different meaning - "don't touch this page, it is offline" compared to
> "don't touch this page, hw is broken".

Does that means in case an offline no kdump reload as mentioned in 1)?

If we have the offline event and reload kdump, I assume the memory state
is refreshed so kdump will not read the memory offlined, am I missing
something?

> 
> Balloon drivers solve this problem by always allowing to read unplugged
> memory. In virtio-mem, this cannot and should even not be guaranteed.
> 

Hmm, that sounds a bug..

> And what we have to do to make this work is actually pretty simple: Just
> like Pg_hwpoison, track per page if it is online and provide this
> information to kdump.
> 
> 

Thanks
Dave

Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver

2018-05-24 Thread Dave Young

Hi,

[snip]
> > 
> >> For kdump and onlining/offlining code, we
> >> have to mark pages as offline before a new segment is visible to the system
> >> (e.g. as these pages might not be backed by real memory in the hypervisor).
> > 
> > Please expand on the kdump part. That is really confusing because
> > hotplug should simply not depend on kdump at all. Moreover why don't you
> > simply mark those pages reserved and pull them out from the page
> > allocator?
> 
> 1. "hotplug should simply not depend on kdump at all"
> 
> In theory yes. In the current state we already have to trigger kdump to
> reload whenever we add/remove a memory block.
> 
> 
> 2. kdump part
> 
> Whenever we offline a page and tell the hypervisor about it ("unplug"),
> we should not assume that we can read that page again. Now, if dumping
> tools assume they can read all memory that is offline, we are in trouble.
> 
> It is the same thing as we already have with Pg_hwpoison. Just a
> different meaning - "don't touch this page, it is offline" compared to
> "don't touch this page, hw is broken".

Does that means in case an offline no kdump reload as mentioned in 1)?

If we have the offline event and reload kdump, I assume the memory state
is refreshed so kdump will not read the memory offlined, am I missing
something?

> 
> Balloon drivers solve this problem by always allowing to read unplugged
> memory. In virtio-mem, this cannot and should even not be guaranteed.
> 

Hmm, that sounds a bug..

> And what we have to do to make this work is actually pretty simple: Just
> like Pg_hwpoison, track per page if it is online and provide this
> information to kdump.
> 
> 

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

> > > Instead of setting aside a significant chunk of memory nobody can use,
> > > [...] reserve a significant chunk of memory that the kernel is prevented
> > > from using [...], but applications are free to use it.
> > 
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.
> 
> Good suggestion. I have been reading that posts already at the same time 
> before I saw
> this reply from you :)
> 
> That could be a good idea and worth to discuss more.  I cced Hari
> already in the thread. Hari, is it possible for you to extend your
> idea to general use, ie. shared by both kdump and fadump?  Anyway I
> think that is another topic we can discuss separately.

BTW, I remember we had some Red hat internal discussion about CMA previously
there is a problem, that is we have crashkernel=,high for reserving high
memory and ,low for low memory, we were not sure if CMA can handle this
case.

Thanks
Dave
Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

> > > Instead of setting aside a significant chunk of memory nobody can use,
> > > [...] reserve a significant chunk of memory that the kernel is prevented
> > > from using [...], but applications are free to use it.
> > 
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.
> 
> Good suggestion. I have been reading that posts already at the same time 
> before I saw
> this reply from you :)
> 
> That could be a good idea and worth to discuss more.  I cced Hari
> already in the thread. Hari, is it possible for you to extend your
> idea to general use, ie. shared by both kdump and fadump?  Anyway I
> think that is another topic we can discuss separately.

BTW, I remember we had some Red hat internal discussion about CMA previously
there is a problem, that is we have crashkernel=,high for reserving high
memory and ,low for low memory, we were not sure if CMA can handle this
case.

Thanks
Dave
Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

On 05/24/18 at 03:26pm, Dave Young wrote:
> On 05/24/18 at 08:57am, Petr Tesarik wrote:
> > On Thu, 24 May 2018 09:49:05 +0800
> > Dave Young <dyo...@redhat.com> wrote:
> > 
> > > Hi Petr,
> > > 
> > > On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> > >[...]
> > > > In short, if one size fits none, what good is it to hardcode that "one
> > > > size" into the kernel image?  
> > > 
> > > I agreed with all the things that we can not know the exact memory
> > > requirement for 100% use cases.  But that does not means this is useless
> > > it is still useful for common use cases of no special and memory hog
> > > requirements as I mentioned in another reply it can simplify the kdump
> > > deployment for those people who do not need the special setup.
> > 
> > I still tend to disagree. This "common-case" reservation depends on
> > things that are defined by user space. It surely does not make it
> > easier to build a distribution kernel. Today, I get bug reports that
> > the number calculated and added to the boot loader configuration by the
> > installer is inaccurate. If I put a fixed number into a kernel config
> > option, I will start getting bugs that this number is incorrect (for
> > some systems).
> 
> The value is a best effort, it will never be 100% correct.  We did not
> guarantee that.   The kernel config option value is just up to user.
> So I'm thinking it as a good to have benefit.

I means this patch is not trying to force add a fixed value for crashkernel
in kernel code. It provides another way one can use on kernel build time
the value just works.

> 
> > 
> > > For example, if this is a workstation I just want to break into a shell
> > > to collect some panic info, then I just need a very minimal initrd, then
> > > the Kconfig will work just fine.
> > 
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).
> 
> Still I think we have agreement and same feeling about the userspace
> memory requirement.   I think although it is hard, we have been still
> trying to shrink the initramfs memory use.
> 
> Besides of distribution use,  why people can not use some minimal
> initrd?  For example only a basic shell and some necessary tools and
> basic storage eg. raw disks supported, and he/she can just collect the
> panic infomation by himself in a shell.
> 
> > 
> > Anyway, if you want to improve the "common case", then look how IBM
> > tries to solve it for firmware-assisted dump (fadump) on powerpc:
> > 
> > https://patchwork.ozlabs.org/patch/905026/
> > 
> > The main idea is:
> > 
> > > Instead of setting aside a significant chunk of memory nobody can use,
> > > [...] reserve a significant chunk of memory that the kernel is prevented
> > > from using [...], but applications are free to use it.
> > 
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.
> 
> Good suggestion. I have been reading that posts already at the same time 
> before I saw
> this reply from you :)
> 
> That could be a good idea and worth to discuss more.  I cced Hari
> already in the thread. Hari, is it possible for you to extend your
> idea to general use, ie. shared by both kdump and fadump?  Anyway I
> think that is another topic we can discuss separately.
> 
> > 
> > Just my two cents,
> > Petr T
> 
> Thanks
> Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

On 05/24/18 at 03:26pm, Dave Young wrote:
> On 05/24/18 at 08:57am, Petr Tesarik wrote:
> > On Thu, 24 May 2018 09:49:05 +0800
> > Dave Young  wrote:
> > 
> > > Hi Petr,
> > > 
> > > On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> > >[...]
> > > > In short, if one size fits none, what good is it to hardcode that "one
> > > > size" into the kernel image?  
> > > 
> > > I agreed with all the things that we can not know the exact memory
> > > requirement for 100% use cases.  But that does not means this is useless
> > > it is still useful for common use cases of no special and memory hog
> > > requirements as I mentioned in another reply it can simplify the kdump
> > > deployment for those people who do not need the special setup.
> > 
> > I still tend to disagree. This "common-case" reservation depends on
> > things that are defined by user space. It surely does not make it
> > easier to build a distribution kernel. Today, I get bug reports that
> > the number calculated and added to the boot loader configuration by the
> > installer is inaccurate. If I put a fixed number into a kernel config
> > option, I will start getting bugs that this number is incorrect (for
> > some systems).
> 
> The value is a best effort, it will never be 100% correct.  We did not
> guarantee that.   The kernel config option value is just up to user.
> So I'm thinking it as a good to have benefit.

I means this patch is not trying to force add a fixed value for crashkernel
in kernel code. It provides another way one can use on kernel build time
the value just works.

> 
> > 
> > > For example, if this is a workstation I just want to break into a shell
> > > to collect some panic info, then I just need a very minimal initrd, then
> > > the Kconfig will work just fine.
> > 
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).
> 
> Still I think we have agreement and same feeling about the userspace
> memory requirement.   I think although it is hard, we have been still
> trying to shrink the initramfs memory use.
> 
> Besides of distribution use,  why people can not use some minimal
> initrd?  For example only a basic shell and some necessary tools and
> basic storage eg. raw disks supported, and he/she can just collect the
> panic infomation by himself in a shell.
> 
> > 
> > Anyway, if you want to improve the "common case", then look how IBM
> > tries to solve it for firmware-assisted dump (fadump) on powerpc:
> > 
> > https://patchwork.ozlabs.org/patch/905026/
> > 
> > The main idea is:
> > 
> > > Instead of setting aside a significant chunk of memory nobody can use,
> > > [...] reserve a significant chunk of memory that the kernel is prevented
> > > from using [...], but applications are free to use it.
> > 
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.
> 
> Good suggestion. I have been reading that posts already at the same time 
> before I saw
> this reply from you :)
> 
> That could be a good idea and worth to discuss more.  I cced Hari
> already in the thread. Hari, is it possible for you to extend your
> idea to general use, ie. shared by both kdump and fadump?  Anyway I
> think that is another topic we can discuss separately.
> 
> > 
> > Just my two cents,
> > Petr T
> 
> Thanks
> Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

On 05/24/18 at 08:57am, Petr Tesarik wrote:
> On Thu, 24 May 2018 09:49:05 +0800
> Dave Young <dyo...@redhat.com> wrote:
> 
> > Hi Petr,
> > 
> > On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> >[...]
> > > In short, if one size fits none, what good is it to hardcode that "one
> > > size" into the kernel image?  
> > 
> > I agreed with all the things that we can not know the exact memory
> > requirement for 100% use cases.  But that does not means this is useless
> > it is still useful for common use cases of no special and memory hog
> > requirements as I mentioned in another reply it can simplify the kdump
> > deployment for those people who do not need the special setup.
> 
> I still tend to disagree. This "common-case" reservation depends on
> things that are defined by user space. It surely does not make it
> easier to build a distribution kernel. Today, I get bug reports that
> the number calculated and added to the boot loader configuration by the
> installer is inaccurate. If I put a fixed number into a kernel config
> option, I will start getting bugs that this number is incorrect (for
> some systems).

The value is a best effort, it will never be 100% correct.  We did not
guarantee that.   The kernel config option value is just up to user.
So I'm thinking it as a good to have benefit.

> 
> > For example, if this is a workstation I just want to break into a shell
> > to collect some panic info, then I just need a very minimal initrd, then
> > the Kconfig will work just fine.
> 
> What is "a very minimal initrd"? Last time I had to make a significant
> adjustment to the estimation for openSUSE, this was caused by growing
> user-space requirements (systemd in this case, but I don't want to
> start flamewars on that topic, please).

Still I think we have agreement and same feeling about the userspace
memory requirement.   I think although it is hard, we have been still
trying to shrink the initramfs memory use.

Besides of distribution use,  why people can not use some minimal
initrd?  For example only a basic shell and some necessary tools and
basic storage eg. raw disks supported, and he/she can just collect the
panic infomation by himself in a shell.

> 
> Anyway, if you want to improve the "common case", then look how IBM
> tries to solve it for firmware-assisted dump (fadump) on powerpc:
> 
> https://patchwork.ozlabs.org/patch/905026/
> 
> The main idea is:
> 
> > Instead of setting aside a significant chunk of memory nobody can use,
> > [...] reserve a significant chunk of memory that the kernel is prevented
> > from using [...], but applications are free to use it.
> 
> That works great, because user space pages are filtered out in the
> common case, so they can be used freely by the panic kernel.

Good suggestion. I have been reading that posts already at the same time before 
I saw
this reply from you :)

That could be a good idea and worth to discuss more.  I cced Hari
already in the thread. Hari, is it possible for you to extend your
idea to general use, ie. shared by both kdump and fadump?  Anyway I
think that is another topic we can discuss separately.

> 
> Just my two cents,
> Petr T

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-24 Thread Dave Young

On 05/24/18 at 08:57am, Petr Tesarik wrote:
> On Thu, 24 May 2018 09:49:05 +0800
> Dave Young  wrote:
> 
> > Hi Petr,
> > 
> > On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> >[...]
> > > In short, if one size fits none, what good is it to hardcode that "one
> > > size" into the kernel image?  
> > 
> > I agreed with all the things that we can not know the exact memory
> > requirement for 100% use cases.  But that does not means this is useless
> > it is still useful for common use cases of no special and memory hog
> > requirements as I mentioned in another reply it can simplify the kdump
> > deployment for those people who do not need the special setup.
> 
> I still tend to disagree. This "common-case" reservation depends on
> things that are defined by user space. It surely does not make it
> easier to build a distribution kernel. Today, I get bug reports that
> the number calculated and added to the boot loader configuration by the
> installer is inaccurate. If I put a fixed number into a kernel config
> option, I will start getting bugs that this number is incorrect (for
> some systems).

The value is a best effort, it will never be 100% correct.  We did not
guarantee that.   The kernel config option value is just up to user.
So I'm thinking it as a good to have benefit.

> 
> > For example, if this is a workstation I just want to break into a shell
> > to collect some panic info, then I just need a very minimal initrd, then
> > the Kconfig will work just fine.
> 
> What is "a very minimal initrd"? Last time I had to make a significant
> adjustment to the estimation for openSUSE, this was caused by growing
> user-space requirements (systemd in this case, but I don't want to
> start flamewars on that topic, please).

Still I think we have agreement and same feeling about the userspace
memory requirement.   I think although it is hard, we have been still
trying to shrink the initramfs memory use.

Besides of distribution use,  why people can not use some minimal
initrd?  For example only a basic shell and some necessary tools and
basic storage eg. raw disks supported, and he/she can just collect the
panic infomation by himself in a shell.

> 
> Anyway, if you want to improve the "common case", then look how IBM
> tries to solve it for firmware-assisted dump (fadump) on powerpc:
> 
> https://patchwork.ozlabs.org/patch/905026/
> 
> The main idea is:
> 
> > Instead of setting aside a significant chunk of memory nobody can use,
> > [...] reserve a significant chunk of memory that the kernel is prevented
> > from using [...], but applications are free to use it.
> 
> That works great, because user space pages are filtered out in the
> common case, so they can be used freely by the panic kernel.

Good suggestion. I have been reading that posts already at the same time before 
I saw
this reply from you :)

That could be a good idea and worth to discuss more.  I cced Hari
already in the thread. Hari, is it possible for you to extend your
idea to general use, ie. shared by both kdump and fadump?  Anyway I
think that is another topic we can discuss separately.

> 
> Just my two cents,
> Petr T

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Dave Young

Hi Petr,

On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> On Wed, 23 May 2018 10:53:55 -0500
> ebied...@xmission.com (Eric W. Biederman) wrote:
> 
> > Dave Young <dyo...@redhat.com> writes:
> > 
> > > [snip]
> > >  
> > >> >  
> > >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > >> > +  int "System memory size threshold for kdump memory default 
> > >> > reserving"
> > >> > +  depends on CRASH_CORE
> > >> > +  default 0
> > >> > +  help
> > >> > +CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > >> > +the system memory size is equal or bigger than the threshold. 
> > >> >  
> > >> 
> > >> "the threshold" is rather vague.  Can it be clarified?
> > >> 
> > >> In fact I'm really struggling to understand the logic here
> > >> 
> > >>   
> > >> > +config CRASHKERNEL_DEFAULT_MB
> > >> > +  int "Default crashkernel memory size reserved for kdump"
> > >> > +  depends on CRASH_CORE
> > >> > +  default 0
> > >> > +  help
> > >> > +This is used as the default kdump reserved memory size in MB.
> > >> > +crashkernel=X kernel cmdline can overwrite this value.
> > >> > +
> > >> >  config HAVE_IMA_KEXEC
> > >> >bool
> > >> >  
> > >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > >> >return 0;
> > >> >  }
> > >> >  
> > >> > +static int __init get_crashkernel_default(unsigned long long 
> > >> > system_ram,
> > >> > +unsigned long long *size)
> > >> > +{
> > >> > +  unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > >> > +  unsigned long long thres = 
> > >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > >> > +
> > >> > +  thres *= SZ_1M;
> > >> > +  sz *= SZ_1M;
> > >> > +
> > >> > +  if (sz >= system_ram || system_ram < thres) {
> > >> > +  pr_debug("crashkernel default size can not be used.\n");
> > >> > +  return -EINVAL;  
> > >> 
> > >> In other words,
> > >> 
> > >>  if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> > >>  system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> > >>  fail;
> > >> 
> > >> yes?
> > >> 
> > >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> > >> is needed.  And clearer Kconfig text.
> > >> 
> > >> All confused :(  
> > >
> > > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > > the size is too large and kernel can not find enough memory it will
> > > still fail in latter code.
> > >
> > > Is below version looks clearer?  
> > 
> > What is the advantage of providing this in a kconfig option rather
> > than on the kernel command line as we can now?
> 
> Yeah, I was about to ask the very same question.
> 
> Having spent quite some time on estimating RAM required to save a crash
> dump, I can tell you that there is no silver bullet. My main objection
> is that core dumps are saved from user space, and the kernel cannot
> have a clue what it is going to be.
> 
> First, the primary kernel cannot know how much memory will be needed
> for the panic kernel (not necessarily same as the primary kernel) and
> the panic initrd. If you build a minimal initrd for your system, then
> at least it depends on which modules must be included, which in turn
> depends on where you want to store the resulting dump. Mounting a local
> ext2 partition will require less software than mounting an LVM logical
> volume in a PV accessed through iSCSI over two bonded Ethernet NICs.
> 
> Second, run-time requirements may vary wildly. While sending the data
> over a simple TCP connection (e.g. using FTP) consumes just a few
> megabytes even on 10G Ethernet, dm block devices tend to consume much
> more, because of the additional buffers allocated by device mapper.
> 
> Third, systems should be treated as "big" not so much because of the
> amount of RAM, but more so because of the amount of attach

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Dave Young

Hi Petr,

On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> On Wed, 23 May 2018 10:53:55 -0500
> ebied...@xmission.com (Eric W. Biederman) wrote:
> 
> > Dave Young  writes:
> > 
> > > [snip]
> > >  
> > >> >  
> > >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > >> > +  int "System memory size threshold for kdump memory default 
> > >> > reserving"
> > >> > +  depends on CRASH_CORE
> > >> > +  default 0
> > >> > +  help
> > >> > +CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > >> > +the system memory size is equal or bigger than the threshold. 
> > >> >  
> > >> 
> > >> "the threshold" is rather vague.  Can it be clarified?
> > >> 
> > >> In fact I'm really struggling to understand the logic here
> > >> 
> > >>   
> > >> > +config CRASHKERNEL_DEFAULT_MB
> > >> > +  int "Default crashkernel memory size reserved for kdump"
> > >> > +  depends on CRASH_CORE
> > >> > +  default 0
> > >> > +  help
> > >> > +This is used as the default kdump reserved memory size in MB.
> > >> > +crashkernel=X kernel cmdline can overwrite this value.
> > >> > +
> > >> >  config HAVE_IMA_KEXEC
> > >> >bool
> > >> >  
> > >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > >> >return 0;
> > >> >  }
> > >> >  
> > >> > +static int __init get_crashkernel_default(unsigned long long 
> > >> > system_ram,
> > >> > +unsigned long long *size)
> > >> > +{
> > >> > +  unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > >> > +  unsigned long long thres = 
> > >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > >> > +
> > >> > +  thres *= SZ_1M;
> > >> > +  sz *= SZ_1M;
> > >> > +
> > >> > +  if (sz >= system_ram || system_ram < thres) {
> > >> > +  pr_debug("crashkernel default size can not be used.\n");
> > >> > +  return -EINVAL;  
> > >> 
> > >> In other words,
> > >> 
> > >>  if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> > >>  system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> > >>  fail;
> > >> 
> > >> yes?
> > >> 
> > >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> > >> is needed.  And clearer Kconfig text.
> > >> 
> > >> All confused :(  
> > >
> > > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > > the size is too large and kernel can not find enough memory it will
> > > still fail in latter code.
> > >
> > > Is below version looks clearer?  
> > 
> > What is the advantage of providing this in a kconfig option rather
> > than on the kernel command line as we can now?
> 
> Yeah, I was about to ask the very same question.
> 
> Having spent quite some time on estimating RAM required to save a crash
> dump, I can tell you that there is no silver bullet. My main objection
> is that core dumps are saved from user space, and the kernel cannot
> have a clue what it is going to be.
> 
> First, the primary kernel cannot know how much memory will be needed
> for the panic kernel (not necessarily same as the primary kernel) and
> the panic initrd. If you build a minimal initrd for your system, then
> at least it depends on which modules must be included, which in turn
> depends on where you want to store the resulting dump. Mounting a local
> ext2 partition will require less software than mounting an LVM logical
> volume in a PV accessed through iSCSI over two bonded Ethernet NICs.
> 
> Second, run-time requirements may vary wildly. While sending the data
> over a simple TCP connection (e.g. using FTP) consumes just a few
> megabytes even on 10G Ethernet, dm block devices tend to consume much
> more, because of the additional buffers allocated by device mapper.
> 
> Third, systems should be treated as "big" not so much because of the
> amount of RAM, but more so because of the amount of attached devices.
> I've se

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Dave Young

Hi Eric,
On 05/23/18 at 10:53am, Eric W. Biederman wrote:
> Dave Young <dyo...@redhat.com> writes:
> 
> > [snip]
> >
> >> >  
> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> > +int "System memory size threshold for kdump memory default 
> >> > reserving"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> > +  the system memory size is equal or bigger than the threshold.
> >> 
> >> "the threshold" is rather vague.  Can it be clarified?
> >> 
> >> In fact I'm really struggling to understand the logic here
> >> 
> >> 
> >> > +config CRASHKERNEL_DEFAULT_MB
> >> > +int "Default crashkernel memory size reserved for kdump"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  This is used as the default kdump reserved memory size in MB.
> >> > +  crashkernel=X kernel cmdline can overwrite this value.
> >> > +
> >> >  config HAVE_IMA_KEXEC
> >> >  bool
> >> >  
> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> >  return 0;
> >> >  }
> >> >  
> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> >> > +  unsigned long long *size)
> >> > +{
> >> > +unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> > +unsigned long long thres = 
> >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> > +
> >> > +thres *= SZ_1M;
> >> > +sz *= SZ_1M;
> >> > +
> >> > +if (sz >= system_ram || system_ram < thres) {
> >> > +pr_debug("crashkernel default size can not be used.\n");
> >> > +return -EINVAL;
> >> 
> >> In other words,
> >> 
> >>if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >>system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >>fail;
> >> 
> >> yes?
> >> 
> >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> >> is needed.  And clearer Kconfig text.
> >> 
> >> All confused :(
> >
> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > the size is too large and kernel can not find enough memory it will
> > still fail in latter code.
> >
> > Is below version looks clearer?
> 
> What is the advantage of providing this in a kconfig option rather
> than on the kernel command line as we can now?

It is not a replacement of the cmdline, this can be a supplement to
the crashkernel command line.  For a lot of common use cases if we have
the auto reservation user just do not need to manually set the cmdline
for example on a virtual machine and usual setup (except of the
comlicate storage and very large machines).  The crashkernel=auto
has been used for long time, Red Hat QE tested it on a lot of different
lab machines and proved it works well.  Kdump usually just works so admin
do little work to enable kdump.

But the crashkernel=auto implementation has some drawbacks that is it
is more like embed policy in the code and it is not flexible like a
config option.

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Dave Young

Hi Eric,
On 05/23/18 at 10:53am, Eric W. Biederman wrote:
> Dave Young  writes:
> 
> > [snip]
> >
> >> >  
> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> > +int "System memory size threshold for kdump memory default 
> >> > reserving"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> > +  the system memory size is equal or bigger than the threshold.
> >> 
> >> "the threshold" is rather vague.  Can it be clarified?
> >> 
> >> In fact I'm really struggling to understand the logic here
> >> 
> >> 
> >> > +config CRASHKERNEL_DEFAULT_MB
> >> > +int "Default crashkernel memory size reserved for kdump"
> >> > +depends on CRASH_CORE
> >> > +default 0
> >> > +help
> >> > +  This is used as the default kdump reserved memory size in MB.
> >> > +  crashkernel=X kernel cmdline can overwrite this value.
> >> > +
> >> >  config HAVE_IMA_KEXEC
> >> >  bool
> >> >  
> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> >  return 0;
> >> >  }
> >> >  
> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> >> > +  unsigned long long *size)
> >> > +{
> >> > +unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> > +unsigned long long thres = 
> >> > CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> > +
> >> > +thres *= SZ_1M;
> >> > +sz *= SZ_1M;
> >> > +
> >> > +if (sz >= system_ram || system_ram < thres) {
> >> > +pr_debug("crashkernel default size can not be used.\n");
> >> > +return -EINVAL;
> >> 
> >> In other words,
> >> 
> >>if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >>system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >>fail;
> >> 
> >> yes?
> >> 
> >> How come?  What's happening here?  Perhaps a (good) explanatory comment
> >> is needed.  And clearer Kconfig text.
> >> 
> >> All confused :(
> >
> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > the size is too large and kernel can not find enough memory it will
> > still fail in latter code.
> >
> > Is below version looks clearer?
> 
> What is the advantage of providing this in a kconfig option rather
> than on the kernel command line as we can now?

It is not a replacement of the cmdline, this can be a supplement to
the crashkernel command line.  For a lot of common use cases if we have
the auto reservation user just do not need to manually set the cmdline
for example on a virtual machine and usual setup (except of the
comlicate storage and very large machines).  The crashkernel=auto
has been used for long time, Red Hat QE tested it on a lot of different
lab machines and proved it works well.  Kdump usually just works so admin
do little work to enable kdump.

But the crashkernel=auto implementation has some drawbacks that is it
is more like embed policy in the code and it is not flexible like a
config option.

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Dave Young

[snip]

> >  
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > +   int "System memory size threshold for kdump memory default reserving"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
> 
> "the threshold" is rather vague.  Can it be clarified?
> 
> In fact I'm really struggling to understand the logic here
> 
> 
> > +config CRASHKERNEL_DEFAULT_MB
> > +   int "Default crashkernel memory size reserved for kdump"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + This is used as the default kdump reserved memory size in MB.
> > + crashkernel=X kernel cmdline can overwrite this value.
> > +
> >  config HAVE_IMA_KEXEC
> > bool
> >  
> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > return 0;
> >  }
> >  
> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> > + unsigned long long *size)
> > +{
> > +   unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > +   unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > +
> > +   thres *= SZ_1M;
> > +   sz *= SZ_1M;
> > +
> > +   if (sz >= system_ram || system_ram < thres) {
> > +   pr_debug("crashkernel default size can not be used.\n");
> > +   return -EINVAL;
> 
> In other words,
> 
>   if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
>   system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
>   fail;
> 
> yes?
> 
> How come?  What's happening here?  Perhaps a (good) explanatory comment
> is needed.  And clearer Kconfig text.
> 
> All confused :(

Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
the size is too large and kernel can not find enough memory it will
still fail in latter code.

Is below version looks clearer?
---

This is a rework of the crashkernel=auto patches back to 2009 although
I'm not sure if below is the last version of the old effort:
https://lkml.org/lkml/2009/8/12/61
https://lwn.net/Articles/345344/

I changed the original design, instead of adding the auto reserve logic
in code, in this patch just introduce two kernel config options for
the default crashkernel value in MB and the threshold of system memory
in MB so that only reserve default when system memory is equal or
above the threshold.

Signed-off-by: Dave Young <dyo...@redhat.com>
---
Another difference is with original design the crashkernel size scales
with system memory, according to test, large machine may need more
memory in kdump kernel because of several factors:
1. cpu numbers, because of the percpu memory allocated for cpus.
   (kdump can use nr_cpus=1 to workaround this, but some
arches do not support nr_cpus=X for example powerpc) 
2. IO devices, large system can have a lot of io devices, although we
   can try to only add those device drivers we needed, it is still a
   problem because of some built-in drivers, some stacked logical devices
   eg. device mapper devices, acpi etc.  Even if only considering the
   meta data for driver model it will still be a big number eg. sysfs
   files etc.
3. The minimum memory requirement for some device drivers are big, even
   if some of them have implemented low meory profile.  It is usual to see
   10M memory use for a storage driver.
4. user space initramfs size growing.  Busybox is not usable if we need
   to add udev support and some complicate storage support.  Use dracut
   with systemd, especially networking stuff need more memory.

So probably add another kernel config option to scale the memory size
eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
how to describe and add this option. Any suggestions will be appreciated.

 arch/Kconfig|   17 +
 kernel/crash_core.c |   19 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

--- linux-x86.orig/arch/Kconfig
+++ linux-x86/arch/Kconfig
@@ -10,6 +10,23 @@ config KEXEC_CORE
select CRASH_CORE
bool
 
+config CRASHKERNEL_DEFAULT_THRESHOLD_MB
+   int "System memory size threshold for using CRASHKERNEL_DEFAULT_MB"
+   depends on CRASH_CORE
+   default 0
+   help
+ CRASHKERNEL_DEFAULT_MB will be reserved for kdump if the system
+ memory is above or equal to CRASHKERNEL_DEFAULT_THRESHOLD_MB MB.
+ It is only effective in case no crashkernel=X parameter is used.
+
+config CRASHKERNEL_DEFAULT_MB
+   int

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-23 Thread Dave Young

[snip]

> >  
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > +   int "System memory size threshold for kdump memory default reserving"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
> 
> "the threshold" is rather vague.  Can it be clarified?
> 
> In fact I'm really struggling to understand the logic here
> 
> 
> > +config CRASHKERNEL_DEFAULT_MB
> > +   int "Default crashkernel memory size reserved for kdump"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + This is used as the default kdump reserved memory size in MB.
> > + crashkernel=X kernel cmdline can overwrite this value.
> > +
> >  config HAVE_IMA_KEXEC
> > bool
> >  
> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > return 0;
> >  }
> >  
> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> > + unsigned long long *size)
> > +{
> > +   unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > +   unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > +
> > +   thres *= SZ_1M;
> > +   sz *= SZ_1M;
> > +
> > +   if (sz >= system_ram || system_ram < thres) {
> > +   pr_debug("crashkernel default size can not be used.\n");
> > +   return -EINVAL;
> 
> In other words,
> 
>   if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
>   system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
>   fail;
> 
> yes?
> 
> How come?  What's happening here?  Perhaps a (good) explanatory comment
> is needed.  And clearer Kconfig text.
> 
> All confused :(

Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
the size is too large and kernel can not find enough memory it will
still fail in latter code.

Is below version looks clearer?
---

This is a rework of the crashkernel=auto patches back to 2009 although
I'm not sure if below is the last version of the old effort:
https://lkml.org/lkml/2009/8/12/61
https://lwn.net/Articles/345344/

I changed the original design, instead of adding the auto reserve logic
in code, in this patch just introduce two kernel config options for
the default crashkernel value in MB and the threshold of system memory
in MB so that only reserve default when system memory is equal or
above the threshold.

Signed-off-by: Dave Young 
---
Another difference is with original design the crashkernel size scales
with system memory, according to test, large machine may need more
memory in kdump kernel because of several factors:
1. cpu numbers, because of the percpu memory allocated for cpus.
   (kdump can use nr_cpus=1 to workaround this, but some
arches do not support nr_cpus=X for example powerpc) 
2. IO devices, large system can have a lot of io devices, although we
   can try to only add those device drivers we needed, it is still a
   problem because of some built-in drivers, some stacked logical devices
   eg. device mapper devices, acpi etc.  Even if only considering the
   meta data for driver model it will still be a big number eg. sysfs
   files etc.
3. The minimum memory requirement for some device drivers are big, even
   if some of them have implemented low meory profile.  It is usual to see
   10M memory use for a storage driver.
4. user space initramfs size growing.  Busybox is not usable if we need
   to add udev support and some complicate storage support.  Use dracut
   with systemd, especially networking stuff need more memory.

So probably add another kernel config option to scale the memory size
eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
how to describe and add this option. Any suggestions will be appreciated.

 arch/Kconfig|   17 +
 kernel/crash_core.c |   19 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

--- linux-x86.orig/arch/Kconfig
+++ linux-x86/arch/Kconfig
@@ -10,6 +10,23 @@ config KEXEC_CORE
select CRASH_CORE
bool
 
+config CRASHKERNEL_DEFAULT_THRESHOLD_MB
+   int "System memory size threshold for using CRASHKERNEL_DEFAULT_MB"
+   depends on CRASH_CORE
+   default 0
+   help
+ CRASHKERNEL_DEFAULT_MB will be reserved for kdump if the system
+ memory is above or equal to CRASHKERNEL_DEFAULT_THRESHOLD_MB MB.
+ It is only effective in case no crashkernel=X parameter is used.
+
+config CRASHKERNEL_DEFAULT_MB
+   int "Default crashkernel

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-21 Thread Dave Young

On 05/21/18 at 12:02pm, Andrew Morton wrote:
> On Mon, 21 May 2018 10:53:37 +0800 Dave Young <dyo...@redhat.com> wrote:
> 
> > This is a rework of the crashkernel=auto patches back to 2009 although
> > I'm not sure if below is the last version of the old effort:
> > https://lkml.org/lkml/2009/8/12/61
> > https://lwn.net/Articles/345344/
> > 
> > I changed the original design, instead of adding the auto reserve logic
> > in code, in this patch just introduce two kernel config options for
> > the default crashkernel value in MB and the threshold of system memory
> > in MB so that only reserve default when system memory is equal or
> > above the threshold.
> > 
> > With the kernel configs distributions can easily change the default
> > values so that people do not need to manually set kernel cmdline
> > for common use cases and one can still overwrite the default value
> > with manual setup or disable it by using crashkernel=0
> > 
> > Signed-off-by: Dave Young <dyo...@redhat.com>
> > ---
> > Another difference is with original design the crashkernel size scales
> > with system memory, according to test, large machine may need more
> > memory in kdump kernel because of several factors:
> > 1. cpu numbers, because of the percpu memory allocated for cpus.
> >(kdump can use nr_cpus=1 to workaround this, but some
> > arches do not support nr_cpus=X for example powerpc) 
> > 2. IO devices, large system can have a lot of io devices, although we
> >can try to only add those device drivers we needed, it is still a
> >problem because of some built-in drivers, some stacked logical devices
> >eg. device mapper devices, acpi etc.  Even if only considering the
> >meta data for driver model it will still be a big number eg. sysfs
> >files etc.
> > 3. The minimum memory requirement for some device drivers are big, even
> >if some of them have implemented low meory profile.  It is usual to see
> >10M memory use for a storage driver.
> > 4. user space initramfs size growing.  Busybox is not usable if we need
> >to add udev support and some complicate storage support.  Use dracut
> >with systemd, especially networking stuff need more memory.
> > 
> > So probably add another kernel config option to scale the memory size
> > eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
> > use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
> > how to describe and add this option. Any suggestions will be appreciated.
> > 
> > ...
> >
> > --- linux-x86.orig/arch/Kconfig
> > +++ linux-x86/arch/Kconfig
> > @@ -10,6 +10,22 @@ config KEXEC_CORE
> > select CRASH_CORE
> > bool
> >  
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > +   int "System memory size threshold for kdump memory default reserving"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
> 
> "the threshold" is rather vague.  Can it be clarified?
> 
> In fact I'm really struggling to understand the logic here

Sorry I missed this comment, the threshold is like this:

if system memory size is lower than the threshold then we will do
nothing and do not reserve crashkernel memory by default.  Eg. if the
threshold is 1024M then default reservation is only used when system has
more than 1024M memory, and for those low mem machine we do not reserve by
default.

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-21 Thread Dave Young

On 05/21/18 at 12:02pm, Andrew Morton wrote:
> On Mon, 21 May 2018 10:53:37 +0800 Dave Young  wrote:
> 
> > This is a rework of the crashkernel=auto patches back to 2009 although
> > I'm not sure if below is the last version of the old effort:
> > https://lkml.org/lkml/2009/8/12/61
> > https://lwn.net/Articles/345344/
> > 
> > I changed the original design, instead of adding the auto reserve logic
> > in code, in this patch just introduce two kernel config options for
> > the default crashkernel value in MB and the threshold of system memory
> > in MB so that only reserve default when system memory is equal or
> > above the threshold.
> > 
> > With the kernel configs distributions can easily change the default
> > values so that people do not need to manually set kernel cmdline
> > for common use cases and one can still overwrite the default value
> > with manual setup or disable it by using crashkernel=0
> > 
> > Signed-off-by: Dave Young 
> > ---
> > Another difference is with original design the crashkernel size scales
> > with system memory, according to test, large machine may need more
> > memory in kdump kernel because of several factors:
> > 1. cpu numbers, because of the percpu memory allocated for cpus.
> >(kdump can use nr_cpus=1 to workaround this, but some
> > arches do not support nr_cpus=X for example powerpc) 
> > 2. IO devices, large system can have a lot of io devices, although we
> >can try to only add those device drivers we needed, it is still a
> >problem because of some built-in drivers, some stacked logical devices
> >eg. device mapper devices, acpi etc.  Even if only considering the
> >meta data for driver model it will still be a big number eg. sysfs
> >files etc.
> > 3. The minimum memory requirement for some device drivers are big, even
> >if some of them have implemented low meory profile.  It is usual to see
> >10M memory use for a storage driver.
> > 4. user space initramfs size growing.  Busybox is not usable if we need
> >to add udev support and some complicate storage support.  Use dracut
> >with systemd, especially networking stuff need more memory.
> > 
> > So probably add another kernel config option to scale the memory size
> > eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
> > use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
> > how to describe and add this option. Any suggestions will be appreciated.
> > 
> > ...
> >
> > --- linux-x86.orig/arch/Kconfig
> > +++ linux-x86/arch/Kconfig
> > @@ -10,6 +10,22 @@ config KEXEC_CORE
> > select CRASH_CORE
> > bool
> >  
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > +   int "System memory size threshold for kdump memory default reserving"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
> 
> "the threshold" is rather vague.  Can it be clarified?
> 
> In fact I'm really struggling to understand the logic here

Sorry I missed this comment, the threshold is like this:

if system memory size is lower than the threshold then we will do
nothing and do not reserve crashkernel memory by default.  Eg. if the
threshold is 1024M then default reservation is only used when system has
more than 1024M memory, and for those low mem machine we do not reserve by
default.

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-21 Thread Dave Young

On 05/21/18 at 12:02pm, Andrew Morton wrote:
> On Mon, 21 May 2018 10:53:37 +0800 Dave Young <dyo...@redhat.com> wrote:
> 
> > This is a rework of the crashkernel=auto patches back to 2009 although
> > I'm not sure if below is the last version of the old effort:
> > https://lkml.org/lkml/2009/8/12/61
> > https://lwn.net/Articles/345344/
> > 
> > I changed the original design, instead of adding the auto reserve logic
> > in code, in this patch just introduce two kernel config options for
> > the default crashkernel value in MB and the threshold of system memory
> > in MB so that only reserve default when system memory is equal or
> > above the threshold.
> > 
> > With the kernel configs distributions can easily change the default
> > values so that people do not need to manually set kernel cmdline
> > for common use cases and one can still overwrite the default value
> > with manual setup or disable it by using crashkernel=0
> > 
> > Signed-off-by: Dave Young <dyo...@redhat.com>
> > ---
> > Another difference is with original design the crashkernel size scales
> > with system memory, according to test, large machine may need more
> > memory in kdump kernel because of several factors:
> > 1. cpu numbers, because of the percpu memory allocated for cpus.
> >(kdump can use nr_cpus=1 to workaround this, but some
> > arches do not support nr_cpus=X for example powerpc) 
> > 2. IO devices, large system can have a lot of io devices, although we
> >can try to only add those device drivers we needed, it is still a
> >problem because of some built-in drivers, some stacked logical devices
> >eg. device mapper devices, acpi etc.  Even if only considering the
> >meta data for driver model it will still be a big number eg. sysfs
> >files etc.
> > 3. The minimum memory requirement for some device drivers are big, even
> >if some of them have implemented low meory profile.  It is usual to see
> >10M memory use for a storage driver.
> > 4. user space initramfs size growing.  Busybox is not usable if we need
> >to add udev support and some complicate storage support.  Use dracut
> >with systemd, especially networking stuff need more memory.
> > 
> > So probably add another kernel config option to scale the memory size
> > eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
> > use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
> > how to describe and add this option. Any suggestions will be appreciated.
> > 
> > ...
> >
> > --- linux-x86.orig/arch/Kconfig
> > +++ linux-x86/arch/Kconfig
> > @@ -10,6 +10,22 @@ config KEXEC_CORE
> > select CRASH_CORE
> > bool
> >  
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > +   int "System memory size threshold for kdump memory default reserving"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
> 
> "the threshold" is rather vague.  Can it be clarified?
> 
> In fact I'm really struggling to understand the logic here
> 
> 
> > +config CRASHKERNEL_DEFAULT_MB
> > +   int "Default crashkernel memory size reserved for kdump"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + This is used as the default kdump reserved memory size in MB.
> > + crashkernel=X kernel cmdline can overwrite this value.
> > +
> >  config HAVE_IMA_KEXEC
> > bool
> >  
> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > return 0;
> >  }
> >  
> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> > + unsigned long long *size)
> > +{
> > +   unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > +   unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > +
> > +   thres *= SZ_1M;
> > +   sz *= SZ_1M;
> > +
> > +   if (sz >= system_ram || system_ram < thres) {
> > +   pr_debug("crashkernel default size can not be used.\n");
> > +   return -EINVAL;
> 
> In other words,
> 
>   if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
>   system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
>   fail;
> 
> yes?

the first comparison is a sanity check for the default reserved
size, if it is bigger than system ram size it is apparently bad:
if ( CONFIG_CRASHKERNEL_DEFAULT_MB >= system_ram )
fail;

The second comparison is for the threshold setting, it is a designed
logic like:
if ( system_ram >= CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB ) then
go ahead to use the default value of CONFIG_CRASHKERNEL_DEFAULT_MB

> 
> How come?  What's happening here?  Perhaps a (good) explanatory comment
> is needed.  And clearer Kconfig text.
> 
> All confused :(

Hmm, scratch head~, will think about how to describe it better.  If you
have any suggestions just let me know :)

Thanks
Dave

Re: [PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-21 Thread Dave Young

On 05/21/18 at 12:02pm, Andrew Morton wrote:
> On Mon, 21 May 2018 10:53:37 +0800 Dave Young  wrote:
> 
> > This is a rework of the crashkernel=auto patches back to 2009 although
> > I'm not sure if below is the last version of the old effort:
> > https://lkml.org/lkml/2009/8/12/61
> > https://lwn.net/Articles/345344/
> > 
> > I changed the original design, instead of adding the auto reserve logic
> > in code, in this patch just introduce two kernel config options for
> > the default crashkernel value in MB and the threshold of system memory
> > in MB so that only reserve default when system memory is equal or
> > above the threshold.
> > 
> > With the kernel configs distributions can easily change the default
> > values so that people do not need to manually set kernel cmdline
> > for common use cases and one can still overwrite the default value
> > with manual setup or disable it by using crashkernel=0
> > 
> > Signed-off-by: Dave Young 
> > ---
> > Another difference is with original design the crashkernel size scales
> > with system memory, according to test, large machine may need more
> > memory in kdump kernel because of several factors:
> > 1. cpu numbers, because of the percpu memory allocated for cpus.
> >(kdump can use nr_cpus=1 to workaround this, but some
> > arches do not support nr_cpus=X for example powerpc) 
> > 2. IO devices, large system can have a lot of io devices, although we
> >can try to only add those device drivers we needed, it is still a
> >problem because of some built-in drivers, some stacked logical devices
> >eg. device mapper devices, acpi etc.  Even if only considering the
> >meta data for driver model it will still be a big number eg. sysfs
> >files etc.
> > 3. The minimum memory requirement for some device drivers are big, even
> >if some of them have implemented low meory profile.  It is usual to see
> >10M memory use for a storage driver.
> > 4. user space initramfs size growing.  Busybox is not usable if we need
> >to add udev support and some complicate storage support.  Use dracut
> >with systemd, especially networking stuff need more memory.
> > 
> > So probably add another kernel config option to scale the memory size
> > eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
> > use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
> > how to describe and add this option. Any suggestions will be appreciated.
> > 
> > ...
> >
> > --- linux-x86.orig/arch/Kconfig
> > +++ linux-x86/arch/Kconfig
> > @@ -10,6 +10,22 @@ config KEXEC_CORE
> > select CRASH_CORE
> > bool
> >  
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > +   int "System memory size threshold for kdump memory default reserving"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
> 
> "the threshold" is rather vague.  Can it be clarified?
> 
> In fact I'm really struggling to understand the logic here
> 
> 
> > +config CRASHKERNEL_DEFAULT_MB
> > +   int "Default crashkernel memory size reserved for kdump"
> > +   depends on CRASH_CORE
> > +   default 0
> > +   help
> > + This is used as the default kdump reserved memory size in MB.
> > + crashkernel=X kernel cmdline can overwrite this value.
> > +
> >  config HAVE_IMA_KEXEC
> > bool
> >  
> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > return 0;
> >  }
> >  
> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> > + unsigned long long *size)
> > +{
> > +   unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > +   unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > +
> > +   thres *= SZ_1M;
> > +   sz *= SZ_1M;
> > +
> > +   if (sz >= system_ram || system_ram < thres) {
> > +   pr_debug("crashkernel default size can not be used.\n");
> > +   return -EINVAL;
> 
> In other words,
> 
>   if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
>   system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
>   fail;
> 
> yes?

the first comparison is a sanity check for the default reserved
size, if it is bigger than system ram size it is apparently bad:
if ( CONFIG_CRASHKERNEL_DEFAULT_MB >= system_ram )
fail;

The second comparison is for the threshold setting, it is a designed
logic like:
if ( system_ram >= CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB ) then
go ahead to use the default value of CONFIG_CRASHKERNEL_DEFAULT_MB

> 
> How come?  What's happening here?  Perhaps a (good) explanatory comment
> is needed.  And clearer Kconfig text.
> 
> All confused :(

Hmm, scratch head~, will think about how to describe it better.  If you
have any suggestions just let me know :)

Thanks
Dave

[PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-20 Thread Dave Young

This is a rework of the crashkernel=auto patches back to 2009 although
I'm not sure if below is the last version of the old effort:
https://lkml.org/lkml/2009/8/12/61
https://lwn.net/Articles/345344/

I changed the original design, instead of adding the auto reserve logic
in code, in this patch just introduce two kernel config options for
the default crashkernel value in MB and the threshold of system memory
in MB so that only reserve default when system memory is equal or
above the threshold.

With the kernel configs distributions can easily change the default
values so that people do not need to manually set kernel cmdline
for common use cases and one can still overwrite the default value
with manual setup or disable it by using crashkernel=0

Signed-off-by: Dave Young <dyo...@redhat.com>
---
Another difference is with original design the crashkernel size scales
with system memory, according to test, large machine may need more
memory in kdump kernel because of several factors:
1. cpu numbers, because of the percpu memory allocated for cpus.
   (kdump can use nr_cpus=1 to workaround this, but some
arches do not support nr_cpus=X for example powerpc) 
2. IO devices, large system can have a lot of io devices, although we
   can try to only add those device drivers we needed, it is still a
   problem because of some built-in drivers, some stacked logical devices
   eg. device mapper devices, acpi etc.  Even if only considering the
   meta data for driver model it will still be a big number eg. sysfs
   files etc.
3. The minimum memory requirement for some device drivers are big, even
   if some of them have implemented low meory profile.  It is usual to see
   10M memory use for a storage driver.
4. user space initramfs size growing.  Busybox is not usable if we need
   to add udev support and some complicate storage support.  Use dracut
   with systemd, especially networking stuff need more memory.

So probably add another kernel config option to scale the memory size
eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
how to describe and add this option. Any suggestions will be appreciated.

 arch/Kconfig|   16 
 kernel/crash_core.c |   23 ++-
 2 files changed, 38 insertions(+), 1 deletion(-)

--- linux-x86.orig/arch/Kconfig
+++ linux-x86/arch/Kconfig
@@ -10,6 +10,22 @@ config KEXEC_CORE
select CRASH_CORE
bool
 
+config CRASHKERNEL_DEFAULT_THRESHOLD_MB
+   int "System memory size threshold for kdump memory default reserving"
+   depends on CRASH_CORE
+   default 0
+   help
+ CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
+ the system memory size is equal or bigger than the threshold.
+
+config CRASHKERNEL_DEFAULT_MB
+   int "Default crashkernel memory size reserved for kdump"
+   depends on CRASH_CORE
+   default 0
+   help
+ This is used as the default kdump reserved memory size in MB.
+ crashkernel=X kernel cmdline can overwrite this value.
+
 config HAVE_IMA_KEXEC
bool
 
--- linux-x86.orig/kernel/crash_core.c
+++ linux-x86/kernel/crash_core.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
return 0;
 }
 
+static int __init get_crashkernel_default(unsigned long long system_ram,
+ unsigned long long *size)
+{
+   unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
+   unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
+
+   thres *= SZ_1M;
+   sz *= SZ_1M;
+
+   if (sz >= system_ram || system_ram < thres) {
+   pr_debug("crashkernel default size can not be used.\n");
+   return -EINVAL;
+   }
+   *size = sz;
+
+   return 0;
+}
+
 #define SUFFIX_HIGH 0
 #define SUFFIX_LOW  1
 #define SUFFIX_NULL 2
@@ -240,8 +259,10 @@ static int __init __parse_crashkernel(ch
*crash_size = 0;
*crash_base = 0;
 
-   ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
+   if (!strstr(cmdline, "crashkernel="))
+   return get_crashkernel_default(system_ram, crash_size);
 
+   ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
if (!ck_cmdline)
return -EINVAL;

[PATCH] kdump: add default crashkernel reserve kernel config options

2018-05-20 Thread Dave Young

This is a rework of the crashkernel=auto patches back to 2009 although
I'm not sure if below is the last version of the old effort:
https://lkml.org/lkml/2009/8/12/61
https://lwn.net/Articles/345344/

I changed the original design, instead of adding the auto reserve logic
in code, in this patch just introduce two kernel config options for
the default crashkernel value in MB and the threshold of system memory
in MB so that only reserve default when system memory is equal or
above the threshold.

With the kernel configs distributions can easily change the default
values so that people do not need to manually set kernel cmdline
for common use cases and one can still overwrite the default value
with manual setup or disable it by using crashkernel=0

Signed-off-by: Dave Young 
---
Another difference is with original design the crashkernel size scales
with system memory, according to test, large machine may need more
memory in kdump kernel because of several factors:
1. cpu numbers, because of the percpu memory allocated for cpus.
   (kdump can use nr_cpus=1 to workaround this, but some
arches do not support nr_cpus=X for example powerpc) 
2. IO devices, large system can have a lot of io devices, although we
   can try to only add those device drivers we needed, it is still a
   problem because of some built-in drivers, some stacked logical devices
   eg. device mapper devices, acpi etc.  Even if only considering the
   meta data for driver model it will still be a big number eg. sysfs
   files etc.
3. The minimum memory requirement for some device drivers are big, even
   if some of them have implemented low meory profile.  It is usual to see
   10M memory use for a storage driver.
4. user space initramfs size growing.  Busybox is not usable if we need
   to add udev support and some complicate storage support.  Use dracut
   with systemd, especially networking stuff need more memory.

So probably add another kernel config option to scale the memory size
eg.  CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have,  in RHEL we
use base_value + system_mem >> (2^14) for x86.  I'm still hesatating
how to describe and add this option. Any suggestions will be appreciated.

 arch/Kconfig|   16 
 kernel/crash_core.c |   23 ++-
 2 files changed, 38 insertions(+), 1 deletion(-)

--- linux-x86.orig/arch/Kconfig
+++ linux-x86/arch/Kconfig
@@ -10,6 +10,22 @@ config KEXEC_CORE
select CRASH_CORE
bool
 
+config CRASHKERNEL_DEFAULT_THRESHOLD_MB
+   int "System memory size threshold for kdump memory default reserving"
+   depends on CRASH_CORE
+   default 0
+   help
+ CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
+ the system memory size is equal or bigger than the threshold.
+
+config CRASHKERNEL_DEFAULT_MB
+   int "Default crashkernel memory size reserved for kdump"
+   depends on CRASH_CORE
+   default 0
+   help
+ This is used as the default kdump reserved memory size in MB.
+ crashkernel=X kernel cmdline can overwrite this value.
+
 config HAVE_IMA_KEXEC
bool
 
--- linux-x86.orig/kernel/crash_core.c
+++ linux-x86/kernel/crash_core.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
return 0;
 }
 
+static int __init get_crashkernel_default(unsigned long long system_ram,
+ unsigned long long *size)
+{
+   unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
+   unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
+
+   thres *= SZ_1M;
+   sz *= SZ_1M;
+
+   if (sz >= system_ram || system_ram < thres) {
+   pr_debug("crashkernel default size can not be used.\n");
+   return -EINVAL;
+   }
+   *size = sz;
+
+   return 0;
+}
+
 #define SUFFIX_HIGH 0
 #define SUFFIX_LOW  1
 #define SUFFIX_NULL 2
@@ -240,8 +259,10 @@ static int __init __parse_crashkernel(ch
*crash_size = 0;
*crash_base = 0;
 
-   ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
+   if (!strstr(cmdline, "crashkernel="))
+   return get_crashkernel_default(system_ram, crash_size);
 
+   ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
if (!ck_cmdline)
return -EINVAL;

Re: [PATCH 1/2] kdump/x86: crashkernel=X try to reserve below 896M first then below 4G and MAXMEM

2018-05-06 Thread Dave Young

On 04/27/18 at 05:14pm, Dave Young wrote:
> Hi,
>  
> This is a resend of below patches:
> http://lists.infradead.org/pipermail/kexec/2017-October/019569.html
>  
> I dropped the original patch 1 since Baoquan is not happy with it.
> For patch 2 (the 1st patch in this series), there is some improvement
> comment from Baoquan to create some generic memblock iteration function.
> But nobody has time to work on it for the time being.  According to
> offline discussion with him.  That can be done in the future if someone
> is interested.  We can go with the current kdump only fixes.
>  
> Other than above,  the patches are just same.

Hi Andrew, do you have concerns about the patches?  It has been used for
long time in Red Hat kernel, since people do not object them, could you
pick them if no other concerns?

Thanks
Dave

Re: [PATCH 1/2] kdump/x86: crashkernel=X try to reserve below 896M first then below 4G and MAXMEM

2018-05-06 Thread Dave Young

On 04/27/18 at 05:14pm, Dave Young wrote:
> Hi,
>  
> This is a resend of below patches:
> http://lists.infradead.org/pipermail/kexec/2017-October/019569.html
>  
> I dropped the original patch 1 since Baoquan is not happy with it.
> For patch 2 (the 1st patch in this series), there is some improvement
> comment from Baoquan to create some generic memblock iteration function.
> But nobody has time to work on it for the time being.  According to
> offline discussion with him.  That can be done in the future if someone
> is interested.  We can go with the current kdump only fixes.
>  
> Other than above,  the patches are just same.

Hi Andrew, do you have concerns about the patches?  It has been used for
long time in Red Hat kernel, since people do not object them, could you
pick them if no other concerns?

Thanks
Dave

Re: [PATCH v9 02/11] kexec_file: make kexec_image_post_load_cleanup_default() global

2018-04-28 Thread Dave Young

On 04/25/18 at 03:26pm, AKASHI Takahiro wrote:
> Change this function from static to global so that arm64 can implement
> its own arch_kimage_file_post_load_cleanup() later using
> kexec_image_post_load_cleanup_default().
> 
> Signed-off-by: AKASHI Takahiro <takahiro.aka...@linaro.org>
> Cc: Dave Young <dyo...@redhat.com>
> Cc: Vivek Goyal <vgo...@redhat.com>
> Cc: Baoquan He <b...@redhat.com>
> ---
>  include/linux/kexec.h | 1 +
>  kernel/kexec_file.c   | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 9e4e638fb505..49ab758f4d91 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -143,6 +143,7 @@ extern const struct kexec_file_ops * const 
> kexec_file_loaders[];
>  
>  int kexec_image_probe_default(struct kimage *image, void *buf,
> unsigned long buf_len);
> +int kexec_image_post_load_cleanup_default(struct kimage *image);
>  
>  /**
>   * struct kexec_buf - parameters for finding a place for a buffer in memory
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 75d8e7cf040e..eef89d9b1f03 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -78,7 +78,7 @@ void * __weak arch_kexec_kernel_image_load(struct kimage 
> *image)
>   return kexec_image_load_default(image);
>  }
>  
> -static int kexec_image_post_load_cleanup_default(struct kimage *image)
> +int kexec_image_post_load_cleanup_default(struct kimage *image)
>  {
>   if (!image->fops || !image->fops->cleanup)
>   return 0;
> -- 
> 2.17.0
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young <dyo...@redhat.com>

Thanks
Dave

Re: [PATCH v9 02/11] kexec_file: make kexec_image_post_load_cleanup_default() global

2018-04-28 Thread Dave Young

On 04/25/18 at 03:26pm, AKASHI Takahiro wrote:
> Change this function from static to global so that arm64 can implement
> its own arch_kimage_file_post_load_cleanup() later using
> kexec_image_post_load_cleanup_default().
> 
> Signed-off-by: AKASHI Takahiro 
> Cc: Dave Young 
> Cc: Vivek Goyal 
> Cc: Baoquan He 
> ---
>  include/linux/kexec.h | 1 +
>  kernel/kexec_file.c   | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 9e4e638fb505..49ab758f4d91 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -143,6 +143,7 @@ extern const struct kexec_file_ops * const 
> kexec_file_loaders[];
>  
>  int kexec_image_probe_default(struct kimage *image, void *buf,
> unsigned long buf_len);
> +int kexec_image_post_load_cleanup_default(struct kimage *image);
>  
>  /**
>   * struct kexec_buf - parameters for finding a place for a buffer in memory
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 75d8e7cf040e..eef89d9b1f03 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -78,7 +78,7 @@ void * __weak arch_kexec_kernel_image_load(struct kimage 
> *image)
>   return kexec_image_load_default(image);
>  }
>  
> -static int kexec_image_post_load_cleanup_default(struct kimage *image)
> +int kexec_image_post_load_cleanup_default(struct kimage *image)
>  {
>   if (!image->fops || !image->fops->cleanup)
>   return 0;
> -- 
> 2.17.0
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young 

Thanks
Dave

Re: pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)

2018-04-27 Thread Dave Young

On 04/28/18 at 08:56am, Dave Young wrote:
> On 04/27/18 at 04:12pm, Bjorn Helgaas wrote:
> > [+cc Eric, Vivek, kexec list]
> > 
> > On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
> > > On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> > > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > > generate hotplug interrupts" command.  I think we should work on this
> > > > idea a little more.  If we're shutting down the whole system, I can't
> > > > believe there's much value in *anything* we do in the pciehp_remove()
> > > > path.
> > > > 
> > > > Maybe we should just get rid of pciehp_remove() (and probably
> > > > pcie_port_remove_service() and the other service driver remove methods)
> > > > completely.  That dates from when the service drivers could be modules 
> > > > that

Hmm, if it is the remove() method then kexec does not use it.  kexec use
the shutdown() method instead.  I missed this details when I replied.

> > > > could be potentially unloaded, but unloading them hasn't been possible 
> > > > for
> > > > years.
> > > 
> > > Shutdown path is also used for kexec. Leaving hotplug interrupts
> > > pending is dangerous for the newly loaded kernel as it leaves
> > > spurious interrupts during the new kernel boot.
> > > 
> > > I think we should always disable the hotplug interrupt on shutdown.
> > > We might think of not waiting for command-completion as a
> > > middle-ground or go to polling path instead of interrupts all the
> > > time.
> > 
> > Ah, I forgot about the kexec path.  The kexec path is used for
> > crashdump, too, so ideally the newly-loaded kernel would defend itself
> > when possible so it doesn't depend on the original kernel doing things
> > correctly.
> 
> It is true for kdump.  But kexec needs device shutdown.
> 
> > 
> > Seems like this question of whether to do things in the original
> > kernel or the kexec-ed kernel comes up periodically, but I can never
> > remember a definitive answer.  My initial reaction is that it'd be
> > nice if we didn't have to do *any* shutdown in the original kernel,
> > but I'm sure there are reasons that's not practical.
> 
> Devices sometimes assume it is in a good state initialized in firmware boot
> phase, so we need a shutdown in 1st kernel so that kexec kernel can boot
> correctly for those devices.  For kdump since kernel already panicked
> and it is not reliable so we do as less as we can in the 1st kernel
> crash path, but there are some special handling for kdump in various drivers
> to reset the devices in 2nd kernel, eg. when it see "reset_devices" kernel 
> parameter.
> 
> > 
> > I copied Eric (kexec maintainer) and Vivek (contact listed in
> > Documentation/kdump/kdump.txt) in case they have suggestions or would
> > consider some sort of Documentation/ update.
> > 
> > Bjorn
> > 
> > ___
> > kexec mailing list
> > ke...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> 
> Thanks
> Dave
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Re: pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)

2018-04-27 Thread Dave Young

On 04/28/18 at 08:56am, Dave Young wrote:
> On 04/27/18 at 04:12pm, Bjorn Helgaas wrote:
> > [+cc Eric, Vivek, kexec list]
> > 
> > On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
> > > On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> > > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > > generate hotplug interrupts" command.  I think we should work on this
> > > > idea a little more.  If we're shutting down the whole system, I can't
> > > > believe there's much value in *anything* we do in the pciehp_remove()
> > > > path.
> > > > 
> > > > Maybe we should just get rid of pciehp_remove() (and probably
> > > > pcie_port_remove_service() and the other service driver remove methods)
> > > > completely.  That dates from when the service drivers could be modules 
> > > > that

Hmm, if it is the remove() method then kexec does not use it.  kexec use
the shutdown() method instead.  I missed this details when I replied.

> > > > could be potentially unloaded, but unloading them hasn't been possible 
> > > > for
> > > > years.
> > > 
> > > Shutdown path is also used for kexec. Leaving hotplug interrupts
> > > pending is dangerous for the newly loaded kernel as it leaves
> > > spurious interrupts during the new kernel boot.
> > > 
> > > I think we should always disable the hotplug interrupt on shutdown.
> > > We might think of not waiting for command-completion as a
> > > middle-ground or go to polling path instead of interrupts all the
> > > time.
> > 
> > Ah, I forgot about the kexec path.  The kexec path is used for
> > crashdump, too, so ideally the newly-loaded kernel would defend itself
> > when possible so it doesn't depend on the original kernel doing things
> > correctly.
> 
> It is true for kdump.  But kexec needs device shutdown.
> 
> > 
> > Seems like this question of whether to do things in the original
> > kernel or the kexec-ed kernel comes up periodically, but I can never
> > remember a definitive answer.  My initial reaction is that it'd be
> > nice if we didn't have to do *any* shutdown in the original kernel,
> > but I'm sure there are reasons that's not practical.
> 
> Devices sometimes assume it is in a good state initialized in firmware boot
> phase, so we need a shutdown in 1st kernel so that kexec kernel can boot
> correctly for those devices.  For kdump since kernel already panicked
> and it is not reliable so we do as less as we can in the 1st kernel
> crash path, but there are some special handling for kdump in various drivers
> to reset the devices in 2nd kernel, eg. when it see "reset_devices" kernel 
> parameter.
> 
> > 
> > I copied Eric (kexec maintainer) and Vivek (contact listed in
> > Documentation/kdump/kdump.txt) in case they have suggestions or would
> > consider some sort of Documentation/ update.
> > 
> > Bjorn
> > 
> > ___
> > kexec mailing list
> > ke...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> 
> Thanks
> Dave
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Re: pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)

2018-04-27 Thread Dave Young

On 04/27/18 at 04:12pm, Bjorn Helgaas wrote:
> [+cc Eric, Vivek, kexec list]
> 
> On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
> > On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > generate hotplug interrupts" command.  I think we should work on this
> > > idea a little more.  If we're shutting down the whole system, I can't
> > > believe there's much value in *anything* we do in the pciehp_remove()
> > > path.
> > > 
> > > Maybe we should just get rid of pciehp_remove() (and probably
> > > pcie_port_remove_service() and the other service driver remove methods)
> > > completely.  That dates from when the service drivers could be modules 
> > > that
> > > could be potentially unloaded, but unloading them hasn't been possible for
> > > years.
> > 
> > Shutdown path is also used for kexec. Leaving hotplug interrupts
> > pending is dangerous for the newly loaded kernel as it leaves
> > spurious interrupts during the new kernel boot.
> > 
> > I think we should always disable the hotplug interrupt on shutdown.
> > We might think of not waiting for command-completion as a
> > middle-ground or go to polling path instead of interrupts all the
> > time.
> 
> Ah, I forgot about the kexec path.  The kexec path is used for
> crashdump, too, so ideally the newly-loaded kernel would defend itself
> when possible so it doesn't depend on the original kernel doing things
> correctly.

It is true for kdump.  But kexec needs device shutdown.

> 
> Seems like this question of whether to do things in the original
> kernel or the kexec-ed kernel comes up periodically, but I can never
> remember a definitive answer.  My initial reaction is that it'd be
> nice if we didn't have to do *any* shutdown in the original kernel,
> but I'm sure there are reasons that's not practical.

Devices sometimes assume it is in a good state initialized in firmware boot
phase, so we need a shutdown in 1st kernel so that kexec kernel can boot
correctly for those devices.  For kdump since kernel already panicked
and it is not reliable so we do as less as we can in the 1st kernel
crash path, but there are some special handling for kdump in various drivers
to reset the devices in 2nd kernel, eg. when it see "reset_devices" kernel 
parameter.

> 
> I copied Eric (kexec maintainer) and Vivek (contact listed in
> Documentation/kdump/kdump.txt) in case they have suggestions or would
> consider some sort of Documentation/ update.
> 
> Bjorn
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

Re: pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago)

2018-04-27 Thread Dave Young

On 04/27/18 at 04:12pm, Bjorn Helgaas wrote:
> [+cc Eric, Vivek, kexec list]
> 
> On Fri, Apr 27, 2018 at 03:34:30PM -0400, Sinan Kaya wrote:
> > On 4/27/2018 3:22 PM, Bjorn Helgaas wrote:
> > > Sinan mooted the idea of using a "no-wait" path of sending the "don't
> > > generate hotplug interrupts" command.  I think we should work on this
> > > idea a little more.  If we're shutting down the whole system, I can't
> > > believe there's much value in *anything* we do in the pciehp_remove()
> > > path.
> > > 
> > > Maybe we should just get rid of pciehp_remove() (and probably
> > > pcie_port_remove_service() and the other service driver remove methods)
> > > completely.  That dates from when the service drivers could be modules 
> > > that
> > > could be potentially unloaded, but unloading them hasn't been possible for
> > > years.
> > 
> > Shutdown path is also used for kexec. Leaving hotplug interrupts
> > pending is dangerous for the newly loaded kernel as it leaves
> > spurious interrupts during the new kernel boot.
> > 
> > I think we should always disable the hotplug interrupt on shutdown.
> > We might think of not waiting for command-completion as a
> > middle-ground or go to polling path instead of interrupts all the
> > time.
> 
> Ah, I forgot about the kexec path.  The kexec path is used for
> crashdump, too, so ideally the newly-loaded kernel would defend itself
> when possible so it doesn't depend on the original kernel doing things
> correctly.

It is true for kdump.  But kexec needs device shutdown.

> 
> Seems like this question of whether to do things in the original
> kernel or the kexec-ed kernel comes up periodically, but I can never
> remember a definitive answer.  My initial reaction is that it'd be
> nice if we didn't have to do *any* shutdown in the original kernel,
> but I'm sure there are reasons that's not practical.

Devices sometimes assume it is in a good state initialized in firmware boot
phase, so we need a shutdown in 1st kernel so that kexec kernel can boot
correctly for those devices.  For kdump since kernel already panicked
and it is not reliable so we do as less as we can in the 1st kernel
crash path, but there are some special handling for kdump in various drivers
to reset the devices in 2nd kernel, eg. when it see "reset_devices" kernel 
parameter.

> 
> I copied Eric (kexec maintainer) and Vivek (contact listed in
> Documentation/kdump/kdump.txt) in case they have suggestions or would
> consider some sort of Documentation/ update.
> 
> Bjorn
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

Re: [PATCH 1/2] kdump/x86: crashkernel=X try to reserve below 896M first then below 4G and MAXMEM

2018-04-27 Thread Dave Young

Hi,
 
This is a resend of below patches:
http://lists.infradead.org/pipermail/kexec/2017-October/019569.html
 
I dropped the original patch 1 since Baoquan is not happy with it.
For patch 2 (the 1st patch in this series), there is some improvement
comment from Baoquan to create some generic memblock iteration function.
But nobody has time to work on it for the time being.  According to
offline discussion with him.  That can be done in the future if someone
is interested.  We can go with the current kdump only fixes.
 
Other than above,  the patches are just same.
 
Thanks
Dave

Re: [PATCH 1/2] kdump/x86: crashkernel=X try to reserve below 896M first then below 4G and MAXMEM

2018-04-27 Thread Dave Young

Hi,
 
This is a resend of below patches:
http://lists.infradead.org/pipermail/kexec/2017-October/019569.html
 
I dropped the original patch 1 since Baoquan is not happy with it.
For patch 2 (the 1st patch in this series), there is some improvement
comment from Baoquan to create some generic memblock iteration function.
But nobody has time to work on it for the time being.  According to
offline discussion with him.  That can be done in the future if someone
is interested.  We can go with the current kdump only fixes.
 
Other than above,  the patches are just same.
 
Thanks
Dave

Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-18 Thread Dave Young

On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> > Hi Rahul,
> > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > > On production servers running variety of workloads over time, kernel
> > > panic can happen sporadically after days or even months. It is
> > > important to collect as much debug logs as possible to root cause
> > > and fix the problem, that may not be easy to reproduce. Snapshot of
> > > underlying hardware/firmware state (like register dump, firmware
> > > logs, adapter memory, etc.), at the time of kernel panic will be very
> > > helpful while debugging the culprit device driver.
> > > 
> > > This series of patches add new generic framework that enable device
> > > drivers to collect device specific snapshot of the hardware/firmware
> > > state of the underlying device in the crash recovery kernel. In crash
> > > recovery kernel, the collected logs are added as elf notes to
> > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> > > 
> > > The sequence of actions done by device drivers to append their device
> > > specific hardware/firmware logs to /proc/vmcore are as follows:
> > > 
> > > 1. During probe (before hardware is initialized), device drivers
> > > register to the vmcore module (via vmcore_add_device_dump()), with
> > > callback function, along with buffer size and log name needed for
> > > firmware/hardware log collection.
> > 
> > I assumed the elf notes info should be prepared while kexec_[file_]load
> > phase. But I did not read the old comment, not sure if it has been discussed
> > or not.
> > 
> 
> We must not collect dumps in crashing kernel. Adding more things in
> crash dump path risks not collecting vmcore at all. Eric had
> discussed this in more detail at:
> 
> https://lkml.org/lkml/2018/3/24/319
> 
> We are safe to collect dumps in the second kernel. Each device dump
> will be exported as an elf note in /proc/vmcore.

I understand that we should avoid adding anything in crash path.  And I also
agree to collect device dump in second kernel.  I just assumed device
dump use some memory area to store the debug info and the memory
is persistent so that this can be done in 2 steps, first register the
address in elf header in kexec_load, then collect the dump in 2nd
kernel.  But it seems the driver is doing some other logic to collect
the info instead of just that simple like I thought. 

> 
> > If do this in 2nd kernel a question is driver can be loaded later than 
> > vmcore init.
> 
> Yes, drivers will add their device dumps after vmcore init.
> 
> > How to guarantee the function works if vmcore reading happens before
> > the driver is loaded?
> > 
> > Also it is possible that kdump initramfs does not contains the driver
> > module.
> > 
> > Am I missing something?
> > 
> 
> Yes, driver must be in initramfs if it wants to collect and add device
> dump to /proc/vmcore in second kernel.

In RH/Fedora kdump scripts we only add the things are required to
bring up the dump target, so that we can use as less memory as we can.

For example, if a net driver panicked, and the dump target is rootfs
which is a scsi disk, then no network related stuff will be added in
initramfs.

In this case the device dump info will be not collected..
> 
> > > 
> > > 2. vmcore module allocates the buffer with requested size. It adds
> > > an elf note and invokes the device driver's registered callback
> > > function.
> > > 
> > > 3. Device driver collects all hardware/firmware logs into the buffer
> > > and returns control back to vmcore module.
> > > 
> > > The device specific hardware/firmware logs can be seen as elf notes:
> > > 
> > > # readelf -n /proc/vmcore
> > > 
> > > Displaying notes found at file offset 0x1000 with length 0x04003288:
> > >   Owner Data size Description
> > >   VMCOREDD_cxgb4_:02:00.4 0x02000fd8  Unknown note type: (0x0700)
> > >   VMCOREDD_cxgb4_:04:00.4 0x02000fd8  Unknown note type: (0x0700)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x01

Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-18 Thread Dave Young

On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> > Hi Rahul,
> > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > > On production servers running variety of workloads over time, kernel
> > > panic can happen sporadically after days or even months. It is
> > > important to collect as much debug logs as possible to root cause
> > > and fix the problem, that may not be easy to reproduce. Snapshot of
> > > underlying hardware/firmware state (like register dump, firmware
> > > logs, adapter memory, etc.), at the time of kernel panic will be very
> > > helpful while debugging the culprit device driver.
> > > 
> > > This series of patches add new generic framework that enable device
> > > drivers to collect device specific snapshot of the hardware/firmware
> > > state of the underlying device in the crash recovery kernel. In crash
> > > recovery kernel, the collected logs are added as elf notes to
> > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> > > 
> > > The sequence of actions done by device drivers to append their device
> > > specific hardware/firmware logs to /proc/vmcore are as follows:
> > > 
> > > 1. During probe (before hardware is initialized), device drivers
> > > register to the vmcore module (via vmcore_add_device_dump()), with
> > > callback function, along with buffer size and log name needed for
> > > firmware/hardware log collection.
> > 
> > I assumed the elf notes info should be prepared while kexec_[file_]load
> > phase. But I did not read the old comment, not sure if it has been discussed
> > or not.
> > 
> 
> We must not collect dumps in crashing kernel. Adding more things in
> crash dump path risks not collecting vmcore at all. Eric had
> discussed this in more detail at:
> 
> https://lkml.org/lkml/2018/3/24/319
> 
> We are safe to collect dumps in the second kernel. Each device dump
> will be exported as an elf note in /proc/vmcore.

I understand that we should avoid adding anything in crash path.  And I also
agree to collect device dump in second kernel.  I just assumed device
dump use some memory area to store the debug info and the memory
is persistent so that this can be done in 2 steps, first register the
address in elf header in kexec_load, then collect the dump in 2nd
kernel.  But it seems the driver is doing some other logic to collect
the info instead of just that simple like I thought. 

> 
> > If do this in 2nd kernel a question is driver can be loaded later than 
> > vmcore init.
> 
> Yes, drivers will add their device dumps after vmcore init.
> 
> > How to guarantee the function works if vmcore reading happens before
> > the driver is loaded?
> > 
> > Also it is possible that kdump initramfs does not contains the driver
> > module.
> > 
> > Am I missing something?
> > 
> 
> Yes, driver must be in initramfs if it wants to collect and add device
> dump to /proc/vmcore in second kernel.

In RH/Fedora kdump scripts we only add the things are required to
bring up the dump target, so that we can use as less memory as we can.

For example, if a net driver panicked, and the dump target is rootfs
which is a scsi disk, then no network related stuff will be added in
initramfs.

In this case the device dump info will be not collected..
> 
> > > 
> > > 2. vmcore module allocates the buffer with requested size. It adds
> > > an elf note and invokes the device driver's registered callback
> > > function.
> > > 
> > > 3. Device driver collects all hardware/firmware logs into the buffer
> > > and returns control back to vmcore module.
> > > 
> > > The device specific hardware/firmware logs can be seen as elf notes:
> > > 
> > > # readelf -n /proc/vmcore
> > > 
> > > Displaying notes found at file offset 0x1000 with length 0x04003288:
> > >   Owner Data size Description
> > >   VMCOREDD_cxgb4_:02:00.4 0x02000fd8  Unknown note type: (0x0700)
> > >   VMCOREDD_cxgb4_:04:00.4 0x02000fd8  Unknown note type: (0x0700)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x0150 NT_PRSTATUS (prstatus structure)
> > >   CORE 0x01

Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-18 Thread Dave Young

Hi Rahul,
On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> On production servers running variety of workloads over time, kernel
> panic can happen sporadically after days or even months. It is
> important to collect as much debug logs as possible to root cause
> and fix the problem, that may not be easy to reproduce. Snapshot of
> underlying hardware/firmware state (like register dump, firmware
> logs, adapter memory, etc.), at the time of kernel panic will be very
> helpful while debugging the culprit device driver.
> 
> This series of patches add new generic framework that enable device
> drivers to collect device specific snapshot of the hardware/firmware
> state of the underlying device in the crash recovery kernel. In crash
> recovery kernel, the collected logs are added as elf notes to
> /proc/vmcore, which is copied by user space scripts for post-analysis.
> 
> The sequence of actions done by device drivers to append their device
> specific hardware/firmware logs to /proc/vmcore are as follows:
> 
> 1. During probe (before hardware is initialized), device drivers
> register to the vmcore module (via vmcore_add_device_dump()), with
> callback function, along with buffer size and log name needed for
> firmware/hardware log collection.

I assumed the elf notes info should be prepared while kexec_[file_]load
phase. But I did not read the old comment, not sure if it has been discussed
or not.

If do this in 2nd kernel a question is driver can be loaded later than vmcore 
init.
How to guarantee the function works if vmcore reading happens before
the driver is loaded?

Also it is possible that kdump initramfs does not contains the driver
module.

Am I missing something?

> 
> 2. vmcore module allocates the buffer with requested size. It adds
> an elf note and invokes the device driver's registered callback
> function.
> 
> 3. Device driver collects all hardware/firmware logs into the buffer
> and returns control back to vmcore module.
> 
> The device specific hardware/firmware logs can be seen as elf notes:
> 
> # readelf -n /proc/vmcore
> 
> Displaying notes found at file offset 0x1000 with length 0x04003288:
>   Owner Data size Description
>   VMCOREDD_cxgb4_:02:00.4 0x02000fd8  Unknown note type: (0x0700)
>   VMCOREDD_cxgb4_:04:00.4 0x02000fd8  Unknown note type: (0x0700)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   VMCOREINFO   0x074f Unknown note type: (0x)
> 
> Patch 1 adds API to vmcore module to allow drivers to register callback
> to collect the device specific hardware/firmware logs.  The logs will
> be added to /proc/vmcore as elf notes.
> 
> Patch 2 updates read and mmap logic to append device specific hardware/
> firmware logs as elf notes.
> 
> Patch 3 shows a cxgb4 driver example using the API to collect
> hardware/firmware logs in crash recovery kernel, before hardware is
> initialized.
> 
> Thanks,
> Rahul
> 
> RFC v1: https://lkml.org/lkml/2018/3/2/542
> RFC v2: https://lkml.org/lkml/2018/3/16/326
> 
> ---
> v4:
> - Made __vmcore_add_device_dump() static.
> - Moved compile check to define vmcore_add_device_dump() to
>   crash_dump.h to fix compilation when vmcore.c is not compiled in.
> - Convert ---help--- to help in Kconfig as indicated by checkpatch.
> - Rebased to tip.
> 
> v3:
> - Dropped sysfs crashdd module.
> - Exported dumps as elf notes. Suggested by Eric Biederman
>   .  Added as patch 2 in this version.
> - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
>   dump support.
> - Moved logic related to adding dumps from crashdd to vmcore module.
> - Rename all crashdd* to vmcoredd*.
> - Updated comments.
> 
> v2:
> - Added ABI Documentation for crashdd.
> - Directly use octal permission instead of macro.
> 
> Changes since rfc v2:
> - Moved exporting crashdd from procfs to sysfs. Suggested by
>   Stephen Hemminger 
> - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
> - Replaced all proc API with sysfs API and updated comments.
> - Calling driver callback before creating the binary file under
>   crashdd sysfs.
> - Changed binary dump file permission from S_IRUSR to S_IRUGO.
> - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.
> 
> rfc v2:
> - Collecting logs in 2nd kernel instead of during kernel panic.
>   Suggested by Eric Biederman .
> - Added new crashdd

Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-18 Thread Dave Young

Hi Rahul,
On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> On production servers running variety of workloads over time, kernel
> panic can happen sporadically after days or even months. It is
> important to collect as much debug logs as possible to root cause
> and fix the problem, that may not be easy to reproduce. Snapshot of
> underlying hardware/firmware state (like register dump, firmware
> logs, adapter memory, etc.), at the time of kernel panic will be very
> helpful while debugging the culprit device driver.
> 
> This series of patches add new generic framework that enable device
> drivers to collect device specific snapshot of the hardware/firmware
> state of the underlying device in the crash recovery kernel. In crash
> recovery kernel, the collected logs are added as elf notes to
> /proc/vmcore, which is copied by user space scripts for post-analysis.
> 
> The sequence of actions done by device drivers to append their device
> specific hardware/firmware logs to /proc/vmcore are as follows:
> 
> 1. During probe (before hardware is initialized), device drivers
> register to the vmcore module (via vmcore_add_device_dump()), with
> callback function, along with buffer size and log name needed for
> firmware/hardware log collection.

I assumed the elf notes info should be prepared while kexec_[file_]load
phase. But I did not read the old comment, not sure if it has been discussed
or not.

If do this in 2nd kernel a question is driver can be loaded later than vmcore 
init.
How to guarantee the function works if vmcore reading happens before
the driver is loaded?

Also it is possible that kdump initramfs does not contains the driver
module.

Am I missing something?

> 
> 2. vmcore module allocates the buffer with requested size. It adds
> an elf note and invokes the device driver's registered callback
> function.
> 
> 3. Device driver collects all hardware/firmware logs into the buffer
> and returns control back to vmcore module.
> 
> The device specific hardware/firmware logs can be seen as elf notes:
> 
> # readelf -n /proc/vmcore
> 
> Displaying notes found at file offset 0x1000 with length 0x04003288:
>   Owner Data size Description
>   VMCOREDD_cxgb4_:02:00.4 0x02000fd8  Unknown note type: (0x0700)
>   VMCOREDD_cxgb4_:04:00.4 0x02000fd8  Unknown note type: (0x0700)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   CORE 0x0150 NT_PRSTATUS (prstatus structure)
>   VMCOREINFO   0x074f Unknown note type: (0x)
> 
> Patch 1 adds API to vmcore module to allow drivers to register callback
> to collect the device specific hardware/firmware logs.  The logs will
> be added to /proc/vmcore as elf notes.
> 
> Patch 2 updates read and mmap logic to append device specific hardware/
> firmware logs as elf notes.
> 
> Patch 3 shows a cxgb4 driver example using the API to collect
> hardware/firmware logs in crash recovery kernel, before hardware is
> initialized.
> 
> Thanks,
> Rahul
> 
> RFC v1: https://lkml.org/lkml/2018/3/2/542
> RFC v2: https://lkml.org/lkml/2018/3/16/326
> 
> ---
> v4:
> - Made __vmcore_add_device_dump() static.
> - Moved compile check to define vmcore_add_device_dump() to
>   crash_dump.h to fix compilation when vmcore.c is not compiled in.
> - Convert ---help--- to help in Kconfig as indicated by checkpatch.
> - Rebased to tip.
> 
> v3:
> - Dropped sysfs crashdd module.
> - Exported dumps as elf notes. Suggested by Eric Biederman
>   .  Added as patch 2 in this version.
> - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
>   dump support.
> - Moved logic related to adding dumps from crashdd to vmcore module.
> - Rename all crashdd* to vmcoredd*.
> - Updated comments.
> 
> v2:
> - Added ABI Documentation for crashdd.
> - Directly use octal permission instead of macro.
> 
> Changes since rfc v2:
> - Moved exporting crashdd from procfs to sysfs. Suggested by
>   Stephen Hemminger 
> - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
> - Replaced all proc API with sysfs API and updated comments.
> - Calling driver callback before creating the binary file under
>   crashdd sysfs.
> - Changed binary dump file permission from S_IRUSR to S_IRUGO.
> - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.
> 
> rfc v2:
> - Collecting logs in 2nd kernel instead of during kernel panic.
>   Suggested by Eric Biederman .
> - Added new crashdd module that exports /proc/crashdd/ containing
>   driver's registered

[PATCH] kexec_file: do not add extra alignment to efi memmap

2018-04-17 Thread Dave Young

Chun-Yi reported a kernel warning message below:
WARNING: CPU: 0 PID: 0 at ../mm/early_ioremap.c:182 early_iounmap+0x4f/0x12c()
early_iounmap(ff200180, 0118) [0] size not consistent 0120

The problem is x86 kexec_file_load adds extra alignment to the efi memmap:
in bzImage64_load()
efi_map_sz = efi_get_runtime_map_size();
efi_map_sz = ALIGN(efi_map_sz, 16);

And __efi_memmap_init maps with the size including the alignment bytes
but efi_memmap_unmap use nr_maps * desc_size which does not include the
extra bytes.

The alignment in kexec code is only needed for the kexec buffer internal use
Actually kexec should pass exact size of the efi memmap to 2nd kernel.

Signed-off-by: Dave Young <dyo...@redhat.com>
Reported-by: joeyli <j...@suse.com>
Tested-by: Randy Wright <rwri...@hpe.com>
---
 arch/x86/kernel/kexec-bzimage64.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- linux.orig/arch/x86/kernel/kexec-bzimage64.c
+++ linux/arch/x86/kernel/kexec-bzimage64.c
@@ -398,11 +398,10 @@ static void *bzImage64_load(struct kimag
 * little bit simple
 */
efi_map_sz = efi_get_runtime_map_size();
-   efi_map_sz = ALIGN(efi_map_sz, 16);
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
MAX_ELFCOREHDR_STR_LEN;
params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
-   kbuf.bufsz = params_cmdline_sz + efi_map_sz +
+   kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
sizeof(struct setup_data) +
sizeof(struct efi_setup_data);
 
@@ -410,7 +409,7 @@ static void *bzImage64_load(struct kimag
if (!params)
return ERR_PTR(-ENOMEM);
efi_map_offset = params_cmdline_sz;
-   efi_setup_data_offset = efi_map_offset + efi_map_sz;
+   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
 
/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;

[PATCH] kexec_file: do not add extra alignment to efi memmap

2018-04-17 Thread Dave Young

Chun-Yi reported a kernel warning message below:
WARNING: CPU: 0 PID: 0 at ../mm/early_ioremap.c:182 early_iounmap+0x4f/0x12c()
early_iounmap(ff200180, 0118) [0] size not consistent 0120

The problem is x86 kexec_file_load adds extra alignment to the efi memmap:
in bzImage64_load()
efi_map_sz = efi_get_runtime_map_size();
efi_map_sz = ALIGN(efi_map_sz, 16);

And __efi_memmap_init maps with the size including the alignment bytes
but efi_memmap_unmap use nr_maps * desc_size which does not include the
extra bytes.

The alignment in kexec code is only needed for the kexec buffer internal use
Actually kexec should pass exact size of the efi memmap to 2nd kernel.

Signed-off-by: Dave Young 
Reported-by: joeyli 
Tested-by: Randy Wright 
---
 arch/x86/kernel/kexec-bzimage64.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- linux.orig/arch/x86/kernel/kexec-bzimage64.c
+++ linux/arch/x86/kernel/kexec-bzimage64.c
@@ -398,11 +398,10 @@ static void *bzImage64_load(struct kimag
 * little bit simple
 */
efi_map_sz = efi_get_runtime_map_size();
-   efi_map_sz = ALIGN(efi_map_sz, 16);
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
MAX_ELFCOREHDR_STR_LEN;
params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
-   kbuf.bufsz = params_cmdline_sz + efi_map_sz +
+   kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
sizeof(struct setup_data) +
sizeof(struct efi_setup_data);
 
@@ -410,7 +409,7 @@ static void *bzImage64_load(struct kimag
if (!params)
return ERR_PTR(-ENOMEM);
efi_map_offset = params_cmdline_sz;
-   efi_setup_data_offset = efi_map_offset + efi_map_sz;
+   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
 
/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;

Re: [PATCH] efi: Fix the size not consistent issue when unmapping memory map

2018-04-16 Thread Dave Young

On 04/16/18 at 06:35pm, Randy Wright wrote:
> On Mon, Apr 16, 2018 at 02:37:38PM +0800, joeyli wrote:
> > Hi Randy,
> > ...
> > Randy, do you want to try Dave's kexec patch on your environment? Please 
> > remove
> > my patch first.  
> > 
> > Thanks a lot!
> > Joey Lee
> 
> Hi Joey, 
> 
> I tried Dave's patch to kexec-bzimage64.c on my build of the SuSE
> 4.12.14-15 kernel.   I ran the same test as I did with your patch: I
> verified the early_ioremap.c warnings occurred with a crash triggered
> from a kexec boot of the unmodified kernel. Then I applied the patch to
> kexec-bzimage64.c, rebuilt, re-ran the test to crash from the kexec'ed
> kernel, and verified the warnings are no longer seen.

Great, thanks for the testing, will send out the patch after some local
tests.

Thanks
Dave

Re: [PATCH] efi: Fix the size not consistent issue when unmapping memory map

2018-04-16 Thread Dave Young

On 04/16/18 at 06:35pm, Randy Wright wrote:
> On Mon, Apr 16, 2018 at 02:37:38PM +0800, joeyli wrote:
> > Hi Randy,
> > ...
> > Randy, do you want to try Dave's kexec patch on your environment? Please 
> > remove
> > my patch first.  
> > 
> > Thanks a lot!
> > Joey Lee
> 
> Hi Joey, 
> 
> I tried Dave's patch to kexec-bzimage64.c on my build of the SuSE
> 4.12.14-15 kernel.   I ran the same test as I did with your patch: I
> verified the early_ioremap.c warnings occurred with a crash triggered
> from a kexec boot of the unmodified kernel. Then I applied the patch to
> kexec-bzimage64.c, rebuilt, re-ran the test to crash from the kexec'ed
> kernel, and verified the warnings are no longer seen.

Great, thanks for the testing, will send out the patch after some local
tests.

Thanks
Dave

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2643 matches

Mail list logo