from:"dave young"

Re: [PATCH v2 0/2] x86/boot/KASLR: Code bug fix about kernel virtual address randomization

2017-06-29 Thread Dave Young

On 06/27/17 at 08:39pm, Baoquan He wrote:
> People complained that crashkernel high doesn't work when kaslr code
> compiled in but add 'nokaslr' to diable it. Kexec has the same
> phenomenon.

This is a regression, with 4.12* kernel kexec reboot fails always on
my desktop pc now without kaslr being enabled.

> 
> The root cause is a code bug which assigned the original loading address
> of kernel to the local variable 'virt_addr' which represents the offset
> of kernel virtual address randmoization. As we know, kernel can be loaded
> to anywhere under 64T physically, this wrong assignment could cause kernel
> relocation handling of x86 64 error if no kaslr is taken.
> 
> The v1 post can be found here:
>   x86/boot/KASLR: Skip relocation handling in no kaslr case
>   https://patchwork.kernel.org/patch/9807789/
> 
> In v2, Ingo suggested that we should add a judgement to check if 'virt_addr'
> is randomized to make kernel beyond the kernel mapping area. This checking
> can let us know the error but not reset to firmware quietly as it does now.
> 
> Baoquan He (2):
>   x86/boot/KASLR: Add checking for the offset of kernel virtual address
> randomization
>   x86/boot/KASLR: Fix the wrong assignment to 'virt_addr'
> 
>  arch/x86/boot/compressed/kaslr.c | 3 ---
>  arch/x86/boot/compressed/misc.c  | 6 --
>  arch/x86/boot/compressed/misc.h  | 2 --
>  3 files changed, 4 insertions(+), 7 deletions(-)
> 
> -- 
> 2.5.5
>

[tip:efi/urgent] efi: Fix boot panic because of invalid BGRT image address

2017-06-09 Thread tip-bot for Dave Young

Commit-ID:  792ef14df5c585c19b2831673a077504a09e5203
Gitweb: http://git.kernel.org/tip/792ef14df5c585c19b2831673a077504a09e5203
Author: Dave Young 
AuthorDate: Fri, 9 Jun 2017 08:45:58 +
Committer:  Ingo Molnar 
CommitDate: Fri, 9 Jun 2017 14:50:11 +0200

efi: Fix boot panic because of invalid BGRT image address

Maniaxx reported a kernel boot crash in the EFI code, which I emulated
by using same invalid phys addr in code:

  BUG: unable to handle kernel paging request at ff280001
  IP: efi_bgrt_init+0xfb/0x153
  ...
  Call Trace:
   ? bgrt_init+0xbc/0xbc
   acpi_parse_bgrt+0xe/0x12
   acpi_table_parse+0x89/0xb8
   acpi_boot_init+0x445/0x4e2
   ? acpi_parse_x2apic+0x79/0x79
   ? dmi_ignore_irq0_timer_override+0x33/0x33
   setup_arch+0xb63/0xc82
   ? early_idt_handler_array+0x120/0x120
   start_kernel+0xb7/0x443
   ? early_idt_handler_array+0x120/0x120
   x86_64_start_reservations+0x29/0x2b
   x86_64_start_kernel+0x154/0x177
   secondary_startup_64+0x9f/0x9f

There is also a similar bug filed in bugzilla.kernel.org:

  https://bugzilla.kernel.org/show_bug.cgi?id=195633

The crash is caused by this commit:

  7b0a911478c7 efi/x86: Move the EFI BGRT init code to early init code

The root cause is the firmware on those machines provides invalid BGRT
image addresses.

In a kernel before above commit BGRT initializes late and uses ioremap()
to map the image address. Ioremap validates the address, if it is not a
valid physical address ioremap() just fails and returns. However in current
kernel EFI BGRT initializes early and uses early_memremap() which does not
validate the image address, and kernel panic happens.

According to ACPI spec the BGRT image address should fall into
EFI_BOOT_SERVICES_DATA, see the section 5.2.22.4 of below document:

  http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf

Fix this issue by validating the image address in efi_bgrt_init(). If the
image address does not fall into any EFI_BOOT_SERVICES_DATA areas we just
bail out with a warning message.

Reported-by: Maniaxx 
Signed-off-by: Dave Young 
Signed-off-by: Ard Biesheuvel 
Cc: Linus Torvalds 
Cc: Matt Fleming 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Fixes: 7b0a911478c7 ("efi/x86: Move the EFI BGRT init code to early init code")
Link: http://lkml.kernel.org/r/20170609084558.26766-2-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/efi-bgrt.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/efi-bgrt.c b/drivers/firmware/efi/efi-bgrt.c
index 8bf2732..b58233e 100644
--- a/drivers/firmware/efi/efi-bgrt.c
+++ b/drivers/firmware/efi/efi-bgrt.c
@@ -27,6 +27,26 @@ struct bmp_header {
u32 size;
 } __packed;
 
+static bool efi_bgrt_addr_valid(u64 addr)
+{
+   efi_memory_desc_t *md;
+
+   for_each_efi_memory_desc(md) {
+   u64 size;
+   u64 end;
+
+   if (md->type != EFI_BOOT_SERVICES_DATA)
+   continue;
+
+   size = md->num_pages << EFI_PAGE_SHIFT;
+   end = md->phys_addr + size;
+   if (addr >= md->phys_addr && addr < end)
+   return true;
+   }
+
+   return false;
+}
+
 void __init efi_bgrt_init(struct acpi_table_header *table)
 {
void *image;
@@ -36,7 +56,7 @@ void __init efi_bgrt_init(struct acpi_table_header *table)
if (acpi_disabled)
return;
 
-   if (!efi_enabled(EFI_BOOT))
+   if (!efi_enabled(EFI_MEMMAP))
return;
 
if (table->length < sizeof(bgrt_tab)) {
@@ -65,6 +85,10 @@ void __init efi_bgrt_init(struct acpi_table_header *table)
goto out;
}
 
+   if (!efi_bgrt_addr_valid(bgrt->image_address)) {
+   pr_notice("Ignoring BGRT: invalid image address\n");
+   goto out;
+   }
image = early_memremap(bgrt->image_address, sizeof(bmp_header));
if (!image) {
pr_notice("Ignoring BGRT: failed to map image header memory\n");

[PATCH v2] efi: fix boot panic because of invalid bgrt image address

2017-06-09 Thread Dave Young

Maniaxx  reported a kernel boot failure of below:
(emulated the panic by using same invalid phys addr in code)
There are also a bug in bugzilla.kernel.org:
https://bugzilla.kernel.org/show_bug.cgi?id=195633

The reported panic happens after below commit:
7b0a911 efi/x86: Move the EFI BGRT init code to early init code

The root cause is the firmware on those machines provides invalid bgrt
image addresses.

In a kernel before above commit bgrt initializes late and use ioremap
to map the image address. Ioremap validate the address, if it is not a
valid physical address ioremap just fails and returns. However in current
kernel efi bgrt initializes early and uses early_memremap which does not
validate the image address, and kernel panic happens.

According to ACPI spec the BGRT image address should fall into
EFI_BOOT_SERVICES_DATA, see the section 5.2.22.4 of below document:
http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf

Fix this issue by validating the image address in efi_bgrt_init(). If the
image address does not fall into any EFI_BOOT_SERVICES_DATA areas we just
bail out.

[0.00] BUG: unable to handle kernel paging request at ff280001
[0.00] IP: efi_bgrt_init+0xfb/0x153
[0.00] PGD 6e00b067 
[0.00] P4D 6e00b067 
[0.00] PUD 6e00d067 
[0.00] PMD 6e221067 
[0.00] PTE 8a08e0180163
[0.00] 
[0.00] Oops: 0009 [#1] SMP
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc4+ #135
[0.00] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
02/06/2015
[0.00] task: 9840f4c0 task.stack: 9840
[0.00] RIP: 0010:efi_bgrt_init+0xfb/0x153
[0.00] RSP: :98403d50 EFLAGS: 00010082
[0.00] RAX: ff280001 RBX:  RCX: 0006
[0.00] RDX: 0a08e0181000 RSI: 8a08e0180163 RDI: 057e
[0.00] RBP: 98403d68 R08: 0041 R09: 0002
[0.00] R10:  R11: 8c063cff8fc6 R12: 981d1fb2
[0.00] R13: 986b4fa0 R14: 0010 R15: 
[0.00] FS:  () GS:984db000() 
knlGS:
[0.00] CS:  0010 DS:  ES:  CR0: 80050033
[0.00] CR2: ff280001 CR3: 6e00a000 CR4: 000406b0
[0.00] Call Trace:
[0.00]  ? bgrt_init+0xbc/0xbc
[0.00]  acpi_parse_bgrt+0xe/0x12
[0.00]  acpi_table_parse+0x89/0xb8
[0.00]  acpi_boot_init+0x445/0x4e2
[0.00]  ? acpi_parse_x2apic+0x79/0x79
[0.00]  ? dmi_ignore_irq0_timer_override+0x33/0x33
[0.00]  setup_arch+0xb63/0xc82
[0.00]  ? early_idt_handler_array+0x120/0x120
[0.00]  start_kernel+0xb7/0x443
[0.00]  ? early_idt_handler_array+0x120/0x120
[0.00]  x86_64_start_reservations+0x29/0x2b
[0.00]  x86_64_start_kernel+0x154/0x177
[0.00]  secondary_startup_64+0x9f/0x9f
[0.00] Code: 3f ff eb 6c 48 bf 01 00 00 00 18 e0 08 0a be 06 00 00 00 
e8 ef 2b fe ff 48 85 c0 75 0e 48 c7 c7 88 09 22 98 e8 e1 31 3f ff eb 45 <66> 44 
8b 20 be 06 00 00 00 48 89 c7 8b 58 02 e8 91 2c fe ff 66 
[0.00] RIP: efi_bgrt_init+0xfb/0x153 RSP: 98403d50
[0.00] CR2: ff280001
[0.00] ---[ end trace 9843d3b7cbcab26a ]---
[0.00] Kernel panic - not syncing: Attempted to kill the idle task!
[0.00] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task!

Fixes: 7b0a911 efi/x86: Move the EFI BGRT init code to early init code
Reported-by: Maniaxx 
Signed-off-by: Dave Young 
---
v1->v2: Ard: move EFI_MEMMAP checking out and improve the patchlog.
 drivers/firmware/efi/efi-bgrt.c |   26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

--- linux-x86.orig/drivers/firmware/efi/efi-bgrt.c
+++ linux-x86/drivers/firmware/efi/efi-bgrt.c
@@ -27,6 +27,26 @@ struct bmp_header {
u32 size;
 } __packed;
 
+static bool efi_bgrt_addr_valid(u64 addr)
+{
+   efi_memory_desc_t *md;
+
+   for_each_efi_memory_desc(md) {
+   u64 size;
+   u64 end;
+
+   if (md->type != EFI_BOOT_SERVICES_DATA)
+   continue;
+
+   size = md->num_pages << EFI_PAGE_SHIFT;
+   end = md->phys_addr + size;
+   if (addr >= md->phys_addr && addr < end)
+   return true;
+   }
+
+   return false;
+}
+
 void __init efi_bgrt_init(struct acpi_table_header *table)
 {
void *image;
@@ -36,7 +56,7 @@ void __init efi_bgrt_init(struct acpi_ta
if (acpi_disabled)
return;
 
-   if (!efi_enabled(EFI_BOOT))
+   if (!efi_enabled(EFI_MEMMAP))
return;
 
if (table->length < sizeof(bgrt_tab)) {
@@ -65,6 +85

Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-09 Thread Dave Young

On 06/09/17 at 10:29am, Dave Young wrote:
> On 06/09/17 at 10:17am, Xunlei Pang wrote:
> > S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
> > is now defined as follows:
> > typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
> > It was changed by the CONFIG_CRASH_CORE feature.
> > 
> > This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
> > renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
> > 
> > Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related 
> > code under CONFIG_CRASH_CORE")
> > Cc: Dave Young 
> > Cc: Dave Anderson 
> > Cc: Hari Bathini 
> > Cc: Gustavo Luiz Duarte 
> > Signed-off-by: Xunlei Pang 
> > ---
> >  arch/s390/include/asm/kexec.h |  2 +-
> >  include/linux/crash_core.h|  7 +++
> >  include/linux/kexec.h | 11 +--
> >  3 files changed, 9 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
> > index 2f924bc..352deb8 100644
> > --- a/arch/s390/include/asm/kexec.h
> > +++ b/arch/s390/include/asm/kexec.h
> > @@ -47,7 +47,7 @@
> >   * Seven notes plus zero note at the end: prstatus, fpregset, timer,
> >   * tod_cmp, tod_reg, control regs, and prefix
> >   */
> > -#define KEXEC_NOTE_BYTES \
> > +#define CRASH_CORE_NOTE_BYTES \
> > (ALIGN(sizeof(struct elf_note), 4) * 8 + \
> >  ALIGN(sizeof("CORE"), 4) * 7 + \
> >  ALIGN(sizeof(struct elf_prstatus), 4) + \

I found that in mainline since below commit, above define should be
useless, but if distribution with older kernel does need your fix, so in
mainline the right fix should be dropping the s390 part about these
macros usage.

Anyway this need a comment from Michael.

commit 8a07dd02d7615d91d65d6235f7232e3f9b5d347f
Author: Martin Schwidefsky 
Date:   Wed Oct 14 15:53:06 2015 +0200

s390/kdump: remove code to create ELF notes in the crashed system

The s390 architecture can store the CPU registers of the crashed
system
after the kdump kernel has been started and this is the preferred
way.
Remove the remaining code fragments that deal with storing CPU
registers
while the crashed system is still active.

Acked-by: Michael Holzheu 
Signed-off-by: Martin Schwidefsky 


> > diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> > index e9de6b4..dbc6e5c 100644
> > --- a/include/linux/crash_core.h
> > +++ b/include/linux/crash_core.h
> > @@ -10,9 +10,16 @@
> >  #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
> >  #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
> >  
> > +/*
> > + * The per-cpu notes area is a list of notes terminated by a "NULL"
> > + * note header.  For kdump, the code in vmcore.c runs in the context
> > + * of the second kernel to combine them into one note.
> > + */
> > +#ifndef CRASH_CORE_NOTE_BYTES
> >  #define CRASH_CORE_NOTE_BYTES ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  
> > \
> >  CRASH_CORE_NOTE_NAME_BYTES +   \
> >  CRASH_CORE_NOTE_DESC_BYTES)
> > +#endif
> >  
> >  #define VMCOREINFO_BYTES  PAGE_SIZE
> >  #define VMCOREINFO_NOTE_NAME  "VMCOREINFO"
> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > index 3ea8275..133df03 100644
> > --- a/include/linux/kexec.h
> > +++ b/include/linux/kexec.h
> > @@ -14,7 +14,6 @@
> >  
> >  #if !defined(__ASSEMBLY__)
> >  
> > -#include 
> >  #include 
> >  
> >  #include 
> > @@ -25,6 +24,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  /* Verify architecture specific macros are defined */
> >  
> > @@ -63,15 +63,6 @@
> >  #define KEXEC_CORE_NOTE_NAME   CRASH_CORE_NOTE_NAME
> >  
> >  /*
> > - * The per-cpu notes area is a list of notes terminated by a "NULL"
> > - * note header.  For kdump, the code in vmcore.c runs in the context
> > - * of the second kernel to combine them into one note.
> > - */
> > -#ifndef KEXEC_NOTE_BYTES
> > -#define KEXEC_NOTE_BYTES   CRASH_CORE_NOTE_BYTES
> > -#endif
> 
> It is still not clear how does s390 use the crash_notes except this macro.
> But from code point of view we do need to update this as well after the
> crash_core splitting.
> 
> Acked-by: Dave Young 

Hold on the ack because of the new findings, wait for Michael's
feedback.

Thanks
Dave

Re: [PATCH] x86/efi: fix boot panic because of invalid bgrt image address

2017-06-08 Thread Dave Young

On 06/08/17 at 03:51pm, Ard Biesheuvel wrote:
>  On 8 June 2017 at 05:32, Dave Young  wrote:
> > Maniaxx  reported kernel boot panic similar to 
> > below:
> > (emulated the panic with using same invalid phys addr in a uefi vm)
> > There are also a bug in bugzilla.kernel.org:
> > https://bugzilla.kernel.org/show_bug.cgi?id=195633
> >
> > This happens after below commit:
> > 7b0a911 efi/x86: Move the EFI BGRT init code to early init code
> >
> > The root cause is the firmware on those machines provides invalid bgrt
> > image addresses.
> >
> > With original efi bgrt code we initialize bgrt late
> > and use ioremap to map the image address. In ioremap code we check the
> > address is a valid physical address or not before really map it.
> >
> > With current new efi bgrt code we moved the initialization to early code
> > so we switch to early_memremap which does not check the phys_addr like
> > ioremap does. This lead to the early kernel panics.
> >
> > Fix this by checking the image physical address, if it is not within
> > any EFI_BOOT_SERVICES_DATA areas then we just bail out. It is stronger
> > then the original ioremap checking, according to spec the BGRT data
> > should fall into EFI_BOOT_SERVICES_DATA.
> >
> > [0.00] BUG: unable to handle kernel paging request at 
> > ff280001
> > [0.00] IP: efi_bgrt_init+0xfb/0x153
> > [0.00] PGD 6e00b067
> > [0.00] P4D 6e00b067
> > [0.00] PUD 6e00d067
> > [0.00] PMD 6e221067
> > [0.00] PTE 8a08e0180163
> > [0.00]
> > [0.00] Oops: 0009 [#1] SMP
> > [0.00] Modules linked in:
> > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc4+ #135
> > [0.00] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 0.0.0 02/06/2015
> > [0.00] task: 9840f4c0 task.stack: 9840
> > [0.00] RIP: 0010:efi_bgrt_init+0xfb/0x153
> > [0.00] RSP: :98403d50 EFLAGS: 00010082
> > [0.00] RAX: ff280001 RBX:  RCX: 
> > 0006
> > [0.00] RDX: 0a08e0181000 RSI: 8a08e0180163 RDI: 
> > 057e
> > [0.00] RBP: 98403d68 R08: 0041 R09: 
> > 0002
> > [0.00] R10:  R11: 8c063cff8fc6 R12: 
> > 981d1fb2
> > [0.00] R13: 986b4fa0 R14: 0010 R15: 
> > 
> > [0.00] FS:  () GS:984db000() 
> > knlGS:
> > [0.00] CS:  0010 DS:  ES:  CR0: 80050033
> > [0.00] CR2: ff280001 CR3: 6e00a000 CR4: 
> > 000406b0
> > [0.00] Call Trace:
> > [0.00]  ? bgrt_init+0xbc/0xbc
> > [0.00]  acpi_parse_bgrt+0xe/0x12
> > [0.00]  acpi_table_parse+0x89/0xb8
> > [0.00]  acpi_boot_init+0x445/0x4e2
> > [0.00]  ? acpi_parse_x2apic+0x79/0x79
> > [0.00]  ? dmi_ignore_irq0_timer_override+0x33/0x33
> > [0.00]  setup_arch+0xb63/0xc82
> > [0.00]  ? early_idt_handler_array+0x120/0x120
> > [0.00]  start_kernel+0xb7/0x443
> > [0.00]  ? early_idt_handler_array+0x120/0x120
> > [0.00]  x86_64_start_reservations+0x29/0x2b
> > [0.00]  x86_64_start_kernel+0x154/0x177
> > [0.00]  secondary_startup_64+0x9f/0x9f
> > [0.00] Code: 3f ff eb 6c 48 bf 01 00 00 00 18 e0 08 0a be 06 00 00 
> > 00 e8 ef 2b fe ff 48 85 c0 75 0e 48 c7 c7 88 09 22 98 e8 e1 31 3f ff eb 45 
> > <66> 44 8b 20 be 06 00 00 00 48 89 c7 8b 58 02 e8 91 2c fe ff 66
> > [    0.00] RIP: efi_bgrt_init+0xfb/0x153 RSP: 98403d50
> > [0.00] CR2: ff280001
> > [0.00] ---[ end trace 9843d3b7cbcab26a ]---
> > [0.00] Kernel panic - not syncing: Attempted to kill the idle task!
> > [0.00] ---[ end Kernel panic - not syncing: Attempted to kill the 
> > idle task!
> >
> > Fixes: 7b0a911 efi/x86: Move the EFI BGRT init code to early init code
> > Reported-by: Maniaxx 
> > Signed-off-by: Dave Young 
> 
> Hi Dave,
> 
> I'm with the program now :-)
> 
> Could you please check your commit log for grammar? And add that the
> spec you refer to is the ACPI spec?

Will do both. My fault, should say it clearly when I wrote the log..

> 
> > ---
> >  drivers/firmware/efi/efi-bgrt.c |   29 +
> >  1 file changed, 2

Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-08 Thread Dave Young

On 06/09/17 at 10:17am, Xunlei Pang wrote:
> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
> is now defined as follows:
> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
> It was changed by the CONFIG_CRASH_CORE feature.
> 
> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
> 
> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related code 
> under CONFIG_CRASH_CORE")
> Cc: Dave Young 
> Cc: Dave Anderson 
> Cc: Hari Bathini 
> Cc: Gustavo Luiz Duarte 
> Signed-off-by: Xunlei Pang 
> ---
>  arch/s390/include/asm/kexec.h |  2 +-
>  include/linux/crash_core.h|  7 +++
>  include/linux/kexec.h | 11 +--
>  3 files changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
> index 2f924bc..352deb8 100644
> --- a/arch/s390/include/asm/kexec.h
> +++ b/arch/s390/include/asm/kexec.h
> @@ -47,7 +47,7 @@
>   * Seven notes plus zero note at the end: prstatus, fpregset, timer,
>   * tod_cmp, tod_reg, control regs, and prefix
>   */
> -#define KEXEC_NOTE_BYTES \
> +#define CRASH_CORE_NOTE_BYTES \
>   (ALIGN(sizeof(struct elf_note), 4) * 8 + \
>ALIGN(sizeof("CORE"), 4) * 7 + \
>ALIGN(sizeof(struct elf_prstatus), 4) + \
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index e9de6b4..dbc6e5c 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -10,9 +10,16 @@
>  #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
>  #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
>  
> +/*
> + * The per-cpu notes area is a list of notes terminated by a "NULL"
> + * note header.  For kdump, the code in vmcore.c runs in the context
> + * of the second kernel to combine them into one note.
> + */
> +#ifndef CRASH_CORE_NOTE_BYTES
>  #define CRASH_CORE_NOTE_BYTES   ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  
> \
>CRASH_CORE_NOTE_NAME_BYTES +   \
>CRASH_CORE_NOTE_DESC_BYTES)
> +#endif
>  
>  #define VMCOREINFO_BYTESPAGE_SIZE
>  #define VMCOREINFO_NOTE_NAME"VMCOREINFO"
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3ea8275..133df03 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -14,7 +14,6 @@
>  
>  #if !defined(__ASSEMBLY__)
>  
> -#include 
>  #include 
>  
>  #include 
> @@ -25,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Verify architecture specific macros are defined */
>  
> @@ -63,15 +63,6 @@
>  #define KEXEC_CORE_NOTE_NAME CRASH_CORE_NOTE_NAME
>  
>  /*
> - * The per-cpu notes area is a list of notes terminated by a "NULL"
> - * note header.  For kdump, the code in vmcore.c runs in the context
> - * of the second kernel to combine them into one note.
> - */
> -#ifndef KEXEC_NOTE_BYTES
> -#define KEXEC_NOTE_BYTES CRASH_CORE_NOTE_BYTES
> -#endif

It is still not clear how does s390 use the crash_notes except this macro.
But from code point of view we do need to update this as well after the
crash_core splitting.

Acked-by: Dave Young 

Thanks
Dave

Re: [PATCH] x86/efi: fix boot panic because of invalid bgrt image address

2017-06-08 Thread Dave Young

On 06/08/17 at 10:02am, Ard Biesheuvel wrote:
> On 8 June 2017 at 05:32, Dave Young  wrote:
> > Maniaxx  reported kernel boot panic similar to 
> > below:
> > (emulated the panic with using same invalid phys addr in a uefi vm)
> > There are also a bug in bugzilla.kernel.org:
> > https://bugzilla.kernel.org/show_bug.cgi?id=195633
> >
> > This happens after below commit:
> > 7b0a911 efi/x86: Move the EFI BGRT init code to early init code
> >
> > The root cause is the firmware on those machines provides invalid bgrt
> > image addresses.
> >
> > With original efi bgrt code we initialize bgrt late
> > and use ioremap to map the image address. In ioremap code we check the
> > address is a valid physical address or not before really map it.
> >
> > With current new efi bgrt code we moved the initialization to early code
> > so we switch to early_memremap which does not check the phys_addr like
> > ioremap does. This lead to the early kernel panics.
> >
> > Fix this by checking the image physical address, if it is not within
> > any EFI_BOOT_SERVICES_DATA areas then we just bail out. It is stronger
> > then the original ioremap checking, according to spec the BGRT data
> > should fall into EFI_BOOT_SERVICES_DATA.
> >
> 
> Which spec? The UEFI spec does not mention BGRT, and given that it is

It is mentioned in ACPI spec, see 6.1 spec section 5.2.22.4
http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf

> an ACPI table, I would expect an ACPI reclaim region to be the most
> appropriate. A quick test with QEMU confirms this:
> 
> ACPI: BGRT 0x00013A5E 38 (v01 INTEL  EDK2 0002
>  0113)
> 
> and
> 
> efi:   0x00013a5e-0x00013a5e [ACPI Reclaim Memory|   |  |  |
> |  |  |  |   |WB|  |  |  ]
> 
> So while I agree that we have to fix this, and that checking the BGRT
> address against the UEFI memory map is the most appropriate course of
> action, requiring a certain region type is probably not what we want.
> 
> We have a similar check for ESRT, in efi_mem_desc_lookup(), which
> looks a bit dodgy tbh, given that it allows any region type (including
> MMIO), as long as it has the EFI_MEMORY_RUNTIME attribute, which is
> almost certainly incorrect.

Yes, I had that in mind but it delayed for something else ..

> 
> So what I would like to see is a function that can tell you whether a
> certain address is covered by a region of a type that is normal
> memory, and is occupied, i.e.,
> 
> EFI_RESERVED_TYPE
> EFI_LOADER_CODE
> EFI_LOADER_DATA
> EFI_BOOT_SERVICES_CODE
> EFI_BOOT_SERVICES_DATA
> EFI_RUNTIME_SERVICES_CODE
> EFI_RUNTIME_SERVICES_DATA
> EFI_ACPI_RECLAIM_MEMORY
> EFI_ACPI_MEMORY_NVS
> 
> The EFI_MEMORY_RUNTIME attribute is irrelevant: the firmware itself
> does not have to read these tables at runtime, so it doesn't matter
> whether the O/S maps them on its behalf.
> 
> If you could please stick that in drivers/firmware/efi/efi.c, and
> rework the patch to use it instead? I will move the ESRT code to it as
> well once this is merged.

Will think about this, I had some plan to change the desc lookup
function but it was delayed for something else. For this bgrt issue since
acpi spec clearly said it is in boot data, can we just check the boot
data areas?

Thanks
Dave

Re: [PATCH] x86/efi: fix boot panic because of invalid bgrt image address

2017-06-07 Thread Dave Young

The subject tag should be efi instead of x86/efi since the code
is in general driver code now. Matt/Ard, if need resend please let me
know. Please help review the patch.

Maniaxx, can you verify it on your machine? It passed my test with
an emulation of your wrong address.

On 06/08/17 at 01:32pm, Dave Young wrote:
> Maniaxx  reported kernel boot panic similar to 
> below:
> (emulated the panic with using same invalid phys addr in a uefi vm)
> There are also a bug in bugzilla.kernel.org:
> https://bugzilla.kernel.org/show_bug.cgi?id=195633
> 
> This happens after below commit:
> 7b0a911 efi/x86: Move the EFI BGRT init code to early init code
> 
> The root cause is the firmware on those machines provides invalid bgrt
> image addresses.
> 
> With original efi bgrt code we initialize bgrt late
> and use ioremap to map the image address. In ioremap code we check the 
> address is a valid physical address or not before really map it.  
> 
> With current new efi bgrt code we moved the initialization to early code
> so we switch to early_memremap which does not check the phys_addr like
> ioremap does. This lead to the early kernel panics.
> 
> Fix this by checking the image physical address, if it is not within
> any EFI_BOOT_SERVICES_DATA areas then we just bail out. It is stronger
> then the original ioremap checking, according to spec the BGRT data
> should fall into EFI_BOOT_SERVICES_DATA.
> 
> [0.00] BUG: unable to handle kernel paging request at ff280001
> [0.00] IP: efi_bgrt_init+0xfb/0x153
> [0.00] PGD 6e00b067 
> [0.00] P4D 6e00b067 
> [0.00] PUD 6e00d067 
> [0.00] PMD 6e221067 
> [0.00] PTE 8a08e0180163
> [0.00] 
> [0.00] Oops: 0009 [#1] SMP
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc4+ #135
> [0.00] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [0.00] task: 9840f4c0 task.stack: 9840
> [0.00] RIP: 0010:efi_bgrt_init+0xfb/0x153
> [0.00] RSP: :98403d50 EFLAGS: 00010082
> [0.00] RAX: ff280001 RBX:  RCX: 
> 0006
> [0.00] RDX: 0a08e0181000 RSI: 8a08e0180163 RDI: 
> 057e
> [0.00] RBP: 98403d68 R08: 0041 R09: 
> 0002
> [0.00] R10:  R11: 8c063cff8fc6 R12: 
> 981d1fb2
> [0.00] R13: 986b4fa0 R14: 0010 R15: 
> 
> [0.00] FS:  () GS:984db000() 
> knlGS:
> [0.00] CS:  0010 DS:  ES:  CR0: 80050033
> [0.00] CR2: ff280001 CR3: 6e00a000 CR4: 
> 000406b0
> [0.00] Call Trace:
> [0.00]  ? bgrt_init+0xbc/0xbc
> [0.00]  acpi_parse_bgrt+0xe/0x12
> [0.00]  acpi_table_parse+0x89/0xb8
> [0.00]  acpi_boot_init+0x445/0x4e2
> [0.00]  ? acpi_parse_x2apic+0x79/0x79
> [0.00]  ? dmi_ignore_irq0_timer_override+0x33/0x33
> [0.00]  setup_arch+0xb63/0xc82
> [0.00]  ? early_idt_handler_array+0x120/0x120
> [0.00]  start_kernel+0xb7/0x443
> [0.00]  ? early_idt_handler_array+0x120/0x120
> [0.00]  x86_64_start_reservations+0x29/0x2b
> [0.00]  x86_64_start_kernel+0x154/0x177
> [0.00]  secondary_startup_64+0x9f/0x9f
> [0.00] Code: 3f ff eb 6c 48 bf 01 00 00 00 18 e0 08 0a be 06 00 00 00 
> e8 ef 2b fe ff 48 85 c0 75 0e 48 c7 c7 88 09 22 98 e8 e1 31 3f ff eb 45 <66> 
> 44 8b 20 be 06 00 00 00 48 89 c7 8b 58 02 e8 91 2c fe ff 66 
> [0.00] RIP: efi_bgrt_init+0xfb/0x153 RSP: 98403d50
> [0.00] CR2: ff280001
> [0.00] ---[ end trace 9843d3b7cbcab26a ]---
> [0.00] Kernel panic - not syncing: Attempted to kill the idle task!
> [0.00] ---[ end Kernel panic - not syncing: Attempted to kill the 
> idle task!
> 
> Fixes: 7b0a911 efi/x86: Move the EFI BGRT init code to early init code
> Reported-by: Maniaxx 
> Signed-off-by: Dave Young 
> ---
>  drivers/firmware/efi/efi-bgrt.c |   29 +
>  1 file changed, 29 insertions(+)
> 
> --- linux.orig/drivers/firmware/efi/efi-bgrt.c
> +++ linux/drivers/firmware/efi/efi-bgrt.c
> @@ -27,6 +27,31 @@ struct bmp_header {
>   u32 size;
>  } __packed;
>  
> +static bool efi_bgrt_addr_valid(u64 addr)
> +{
> + efi_memory_desc_t *md;
> +
> + if (!efi_enabled(EFI_MEMMAP)) {
> + pr_err("EFI_MEMMAP is not enabled.\n");
> + return true;
> + }
>

[PATCH] x86/efi: fix boot panic because of invalid bgrt image address

2017-06-07 Thread Dave Young

Maniaxx  reported kernel boot panic similar to below:
(emulated the panic with using same invalid phys addr in a uefi vm)
There are also a bug in bugzilla.kernel.org:
https://bugzilla.kernel.org/show_bug.cgi?id=195633

This happens after below commit:
7b0a911 efi/x86: Move the EFI BGRT init code to early init code

The root cause is the firmware on those machines provides invalid bgrt
image addresses.

With original efi bgrt code we initialize bgrt late
and use ioremap to map the image address. In ioremap code we check the 
address is a valid physical address or not before really map it.  

With current new efi bgrt code we moved the initialization to early code
so we switch to early_memremap which does not check the phys_addr like
ioremap does. This lead to the early kernel panics.

Fix this by checking the image physical address, if it is not within
any EFI_BOOT_SERVICES_DATA areas then we just bail out. It is stronger
then the original ioremap checking, according to spec the BGRT data
should fall into EFI_BOOT_SERVICES_DATA.

[0.00] BUG: unable to handle kernel paging request at ff280001
[0.00] IP: efi_bgrt_init+0xfb/0x153
[0.00] PGD 6e00b067 
[0.00] P4D 6e00b067 
[0.00] PUD 6e00d067 
[0.00] PMD 6e221067 
[0.00] PTE 8a08e0180163
[0.00] 
[0.00] Oops: 0009 [#1] SMP
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc4+ #135
[0.00] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
02/06/2015
[0.00] task: 9840f4c0 task.stack: 9840
[0.00] RIP: 0010:efi_bgrt_init+0xfb/0x153
[0.00] RSP: :98403d50 EFLAGS: 00010082
[0.00] RAX: ff280001 RBX:  RCX: 0006
[0.00] RDX: 0a08e0181000 RSI: 8a08e0180163 RDI: 057e
[0.00] RBP: 98403d68 R08: 0041 R09: 0002
[0.00] R10:  R11: 8c063cff8fc6 R12: 981d1fb2
[0.00] R13: 986b4fa0 R14: 0010 R15: 
[0.00] FS:  () GS:984db000() 
knlGS:
[0.00] CS:  0010 DS:  ES:  CR0: 80050033
[0.00] CR2: ff280001 CR3: 6e00a000 CR4: 000406b0
[0.00] Call Trace:
[0.00]  ? bgrt_init+0xbc/0xbc
[0.00]  acpi_parse_bgrt+0xe/0x12
[0.00]  acpi_table_parse+0x89/0xb8
[0.00]  acpi_boot_init+0x445/0x4e2
[0.00]  ? acpi_parse_x2apic+0x79/0x79
[0.00]  ? dmi_ignore_irq0_timer_override+0x33/0x33
[0.00]  setup_arch+0xb63/0xc82
[0.00]  ? early_idt_handler_array+0x120/0x120
[0.00]  start_kernel+0xb7/0x443
[0.00]  ? early_idt_handler_array+0x120/0x120
[0.00]  x86_64_start_reservations+0x29/0x2b
[0.00]  x86_64_start_kernel+0x154/0x177
[0.00]  secondary_startup_64+0x9f/0x9f
[0.00] Code: 3f ff eb 6c 48 bf 01 00 00 00 18 e0 08 0a be 06 00 00 00 
e8 ef 2b fe ff 48 85 c0 75 0e 48 c7 c7 88 09 22 98 e8 e1 31 3f ff eb 45 <66> 44 
8b 20 be 06 00 00 00 48 89 c7 8b 58 02 e8 91 2c fe ff 66 
[0.00] RIP: efi_bgrt_init+0xfb/0x153 RSP: 98403d50
[0.00] CR2: ff280001
[0.00] ---[ end trace 9843d3b7cbcab26a ]---
[0.00] Kernel panic - not syncing: Attempted to kill the idle task!
[0.00] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task!

Fixes: 7b0a911 efi/x86: Move the EFI BGRT init code to early init code
Reported-by: Maniaxx 
Signed-off-by: Dave Young 
---
 drivers/firmware/efi/efi-bgrt.c |   29 +
 1 file changed, 29 insertions(+)

--- linux.orig/drivers/firmware/efi/efi-bgrt.c
+++ linux/drivers/firmware/efi/efi-bgrt.c
@@ -27,6 +27,31 @@ struct bmp_header {
u32 size;
 } __packed;
 
+static bool efi_bgrt_addr_valid(u64 addr)
+{
+   efi_memory_desc_t *md;
+
+   if (!efi_enabled(EFI_MEMMAP)) {
+   pr_err("EFI_MEMMAP is not enabled.\n");
+   return true;
+   }
+
+   for_each_efi_memory_desc(md) {
+   u64 size;
+   u64 end;
+
+   if (md->type != EFI_BOOT_SERVICES_DATA)
+   continue;
+
+   size = md->num_pages << EFI_PAGE_SHIFT;
+   end = md->phys_addr + size;
+   if (addr >= md->phys_addr && addr < end)
+   return true;
+   }
+
+   return false;
+}
+
 void __init efi_bgrt_init(struct acpi_table_header *table)
 {
void *image;
@@ -65,6 +90,10 @@ void __init efi_bgrt_init(struct acpi_ta
goto out;
}
 
+   if (!efi_bgrt_addr_valid(bgrt->image_address)) {
+   pr_notice("Ignoring BGRT: invalid image address\n");
+   goto out;
+   }

[tip:efi/urgent] efi/bgrt: Skip efi_bgrt_init() in case of non-EFI boot

2017-05-28 Thread tip-bot for Dave Young

Commit-ID:  7425826f4f7ac60f2538b06a7f0a5d1006405159
Gitweb: http://git.kernel.org/tip/7425826f4f7ac60f2538b06a7f0a5d1006405159
Author: Dave Young 
AuthorDate: Fri, 26 May 2017 12:36:51 +0100
Committer:  Ingo Molnar 
CommitDate: Sun, 28 May 2017 11:06:17 +0200

efi/bgrt: Skip efi_bgrt_init() in case of non-EFI boot

Sabrina Dubroca reported an early panic:

  BUG: unable to handle kernel paging request at ff240001
  IP: efi_bgrt_init+0xdc/0x134

  [...]

  ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

... which was introduced by:

  7b0a911478c7 ("efi/x86: Move the EFI BGRT init code to early init code")

The cause is that on this machine the firmware provides the EFI ACPI BGRT
table even on legacy non-EFI bootups - which table should be EFI only.

The garbage BGRT data causes the efi_bgrt_init() panic.

Add a check to skip efi_bgrt_init() in case non-EFI bootup to work around
this firmware bug.

Tested-by: Sabrina Dubroca 
Signed-off-by: Dave Young 
Signed-off-by: Ard Biesheuvel 
Signed-off-by: Matt Fleming 
Cc:  # v4.11+
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Fixes: 7b0a911478c7 ("efi/x86: Move the EFI BGRT init code to early init code")
Link: http://lkml.kernel.org/r/20170526113652.21339-6-m...@codeblueprint.co.uk
[ Rewrote the changelog to be more readable. ]
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/efi-bgrt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/firmware/efi/efi-bgrt.c b/drivers/firmware/efi/efi-bgrt.c
index 04ca876..8bf2732 100644
--- a/drivers/firmware/efi/efi-bgrt.c
+++ b/drivers/firmware/efi/efi-bgrt.c
@@ -36,6 +36,9 @@ void __init efi_bgrt_init(struct acpi_table_header *table)
if (acpi_disabled)
return;
 
+   if (!efi_enabled(EFI_BOOT))
+   return;
+
if (table->length < sizeof(bgrt_tab)) {
pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
   table->length, sizeof(bgrt_tab));

Re: [PATCH v5 28/32] x86/mm, kexec: Allow kexec to be used with SME

2017-05-26 Thread Dave Young

On 05/26/17 at 12:17pm, Xunlei Pang wrote:
> On 04/19/2017 at 05:21 AM, Tom Lendacky wrote:
> > Provide support so that kexec can be used to boot a kernel when SME is
> > enabled.
> >
> > Support is needed to allocate pages for kexec without encryption.  This
> > is needed in order to be able to reboot in the kernel in the same manner
> > as originally booted.
> 
> Hi Tom,
> 
> Looks like kdump will break, I didn't see the similar handling for kdump 
> cases, see kernel:
> kimage_alloc_crash_control_pages(), kimage_load_crash_segment(), etc.
> 
> We need to support kdump with SME, kdump 
> kernel/initramfs/purgatory/elfcorehdr/etc
> are all loaded into the reserved memory(see crashkernel=X) by userspace 
> kexec-tools.

For kexec_load, it is loaded by kexec-tools, we have in kernel loader
syscall kexec_file_load, it is handled in kernel.

> I think a straightforward way would be to mark the whole reserved memory 
> range without
> encryption before loading all the kexec segments for kdump, I guess we can 
> handle this
> easily in arch_kexec_unprotect_crashkres().
> 
> Moreover, now that "elfcorehdr=X" is left as decrypted, it needs to be 
> remapped to the
> encrypted data.

Tom, could you have a try on kdump according to suggestion from Xunlei?
It is just based on theoretical patch understanding, there could be
other issues when you work on it. Feel free to ask if we can help on
anything.

Thanks
Dave

Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-05-25 Thread Dave Young

Ccing Xunlei he is reading the patches see what need to be done for
kdump. There should still be several places to handle to make kdump work.

On 05/18/17 at 07:01pm, Borislav Petkov wrote:
> On Tue, Apr 18, 2017 at 04:22:12PM -0500, Tom Lendacky wrote:
> > Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> > determine if SME is active.
> 
> But why do user-space tools need to know that?
> 
> I mean, when we load the kdump kernel, we do it with the first kernel,
> with the kexec_load() syscall, AFAICT. And that code does a lot of
> things during that init, like machine_kexec_prepare()->init_pgtable() to
> prepare the ident mapping of the second kernel, for example.
> 
> What I'm aiming at is that the first kernel knows *exactly* whether SME
> is enabled or not and doesn't need to tell the second one through some
> sysfs entries - it can do that during loading.
> 
> So I don't think we need any userspace things at all...

If kdump kernel can get the SME status from hardware register then this
should be not necessary and this patch can be dropped.

Thanks
Dave

Re: [PATCH v4] x86/efi: Correct ident mapping of efi old_map when kalsr enabled

2017-05-24 Thread Dave Young

Hi Baoquan,
On 05/18/17 at 02:39pm, Baoquan He wrote:
> For EFI with 'efi=old_map' kernel option specified, Kernel will panic
> when kaslr is enabled.
> 
> The back trace is:
> 
> BUG: unable to handle kernel paging request at 7febd57e
> IP: 0x7febd57e
> PGD 1025a067
> PUD 0
> 
> Oops: 0010 [#1] SMP
> [ ... ]
> Call Trace:
>  ? efi_call+0x58/0x90
>  ? printk+0x58/0x6f
>  efi_enter_virtual_mode+0x3c5/0x50d
>  start_kernel+0x40f/0x4b8
>  ? set_init_arg+0x55/0x55
>  ? early_idt_handler_array+0x120/0x120
>  x86_64_start_reservations+0x24/0x26
>  x86_64_start_kernel+0x14c/0x16f
>  start_cpu+0x14/0x14
> 
> The root cause is the ident mapping is not built correctly in old_map case.
> 
> For nokaslr kernel, PAGE_OFFSET is 0x8800 which is PGDIR_SIZE
> aligned. We can borrow the pud table from direct mapping safely. Given a
> physical address X, we have pud_index(X) == pud_index(__va(X)). However,
> for kaslr kernel, PAGE_OFFSET is PUD_SIZE aligned. For a given physical
> address X, pud_index(X) != pud_index(__va(X)). We can't only copy pgd entry
> from direct mapping to build ident mapping, instead need copy pud entry
> one by one from direct mapping.
> 
> Fix it.
> 
> Signed-off-by: Baoquan He 
> Signed-off-by: Dave Young 

Although I put some effort on debugging the problem, the patch should
be your credit, also I'm not familiar with the pgtable walking code
especially p4d usage, it should be reviewed by x86/efi/mm maintainer.

It would be better to drop my signed-off-by 

> Cc: Matt Fleming 
> Cc: Ard Biesheuvel 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Thomas Garnier 
> Cc: Kees Cook 
> Cc: Russ Anderson 
> Cc: Frank Ramsay 
> Cc: Borislav Petkov 
> Cc: Bhupesh Sharma 
> Cc: x...@kernel.org
> Cc: linux-...@vger.kernel.org
> ---
> v3->v4:
> 1. Forget running scripts/checkpatch.pl to check patch, there are several
> code stype issue. Correct them in this version.
> 
> v2->v3:
> 1. Rewrite code to copy pud entry one by one so that code can be 
> understood
> better. Usually we only have less than 1TB or several TB memory, pud entry
> copy one by one won't impact efficiency.
> 
> 2. Adding p4d page table handling.
> 
> v1->v2:
> Change code and add description according to Thomas's suggestion as below:
> 
> 1. Add checking if pud table is allocated successfully. If not just break
> the for loop.
> 
> 2. Add code comment to explain how the 1:1 mapping is built in 
> efi_call_phys_prolog
> 
> 3. Other minor change
> 
>  arch/x86/platform/efi/efi_64.c | 70 
> +-
>  1 file changed, 62 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index c488625..087aafc 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -71,11 +71,13 @@ static void __init early_code_mapping_set_exec(int 
> executable)
>  
>  pgd_t * __init efi_call_phys_prolog(void)
>  {
> - unsigned long vaddress;
> - pgd_t *save_pgd;
> + unsigned long vaddr, addr_pgd, addr_p4d, addr_pud;
> + pgd_t *save_pgd, *pgd_k, *pgd_efi;
> + p4d_t *p4d, *p4d_k, *p4d_efi;
> + pud_t *pud;
>  
>   int pgd;
> - int n_pgds;
> + int n_pgds, i, j;
>  
>   if (!efi_enabled(EFI_OLD_MEMMAP)) {
>   save_pgd = (pgd_t *)read_cr3();
> @@ -88,10 +90,44 @@ pgd_t * __init efi_call_phys_prolog(void)
>   n_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT), PGDIR_SIZE);
>   save_pgd = kmalloc_array(n_pgds, sizeof(*save_pgd), GFP_KERNEL);
>  
> + /*
> +  * Build 1:1 ident mapping for old_map usage. It needs to be noticed
> +  * that PAGE_OFFSET is PGDIR_SIZE aligned with KASLR disabled, while
> +  * PUD_SIZE ALIGNED with KASLR enabled. So for a given physical
> +  * address X, the pud_index(X) != pud_index(__va(X)), we can only copy
> +  * pud entry of __va(X) to fill in pud entry of X to build 1:1 mapping
> +  * . Means here we can only reuse pmd table of direct mapping.
> +  */
>   for (pgd = 0; pgd < n_pgds; pgd++) {
> - save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
> - vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> - set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), 
> *pgd_offset_k(vaddress));
> + addr_pgd = (unsigned long)(pgd * PGDIR_SIZE);
> + vaddr = (unsigned long)__va(pgd * PGDIR_SIZE);
> + pgd_efi = pgd_offset_k(addr_pgd);
> + save_pgd[pgd] = *pgd_efi;
> +

Re: [PATCH] kexec/kdump: Minor Documentation updates for arm64 and Image

2017-05-17 Thread Dave Young

Add Takahiro and Pratyush, they should be able to review the arm64 part.

On 05/18/17 at 11:03am, Bharat Bhushan wrote:
> This patch have minor updates in Documentation for arm64i as
> relocatable kernel.
> Also this patch updates documentation for using uncompressed
> image "Image" which is used for ARM64.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  Documentation/kdump/kdump.txt | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index 615434d..522ce13 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -112,8 +112,8 @@ There are two possible methods of using Kdump.
>  2) Or use the system kernel binary itself as dump-capture kernel and there is
> no need to build a separate dump-capture kernel. This is possible
> only with the architectures which support a relocatable kernel. As
> -   of today, i386, x86_64, ppc64, ia64 and arm architectures support 
> relocatable
> -   kernel.
> +   of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support
> +   relocatable kernel.
>  
>  Building a relocatable kernel is advantageous from the point of view that
>  one does not have to build a second kernel for capturing the dump. But
> @@ -361,6 +361,12 @@ to load dump-capture kernel.
> --dtb= \
> --append="root= "
>  
> +If you are using a uncompressed Image, then use following command

s/a/an

> +to load dump-capture kernel.
> +
> +   kexec -p  \
> +   --initrd= \
> +   --append="root= "

For uncompressed Image, dtb is not necessary?

>  
>  Please note, that --args-linux does not need to be specified for ia64.
>  It is planned to make this a no-op on that architecture, but for now
> -- 
> 1.9.3
> 

Thanks
Dave

Re: [PATCH v3] x86/efi: Correct ident mapping of efi old_map when kalsr enabled

2017-05-16 Thread Dave Young

Hi, Baoquan

On 05/13/17 at 11:56am, Baoquan He wrote:
> For EFI with 'efi=old_map' kernel option specified, Kernel will panic
> when kaslr is enabled.
> 
> The back trace is:
> 
> BUG: unable to handle kernel paging request at 7febd57e
> IP: 0x7febd57e
> PGD 1025a067
> PUD 0
> 
> Oops: 0010 [#1] SMP
> [ ... ]
> Call Trace:
>  ? efi_call+0x58/0x90
>  ? printk+0x58/0x6f
>  efi_enter_virtual_mode+0x3c5/0x50d
>  start_kernel+0x40f/0x4b8
>  ? set_init_arg+0x55/0x55
>  ? early_idt_handler_array+0x120/0x120
>  x86_64_start_reservations+0x24/0x26
>  x86_64_start_kernel+0x14c/0x16f
>  start_cpu+0x14/0x14
> 
> The root cause is the ident mapping is not built correctly in old_map case.
> 
> For nokaslr kernel, PAGE_OFFSET is 0x8800 which is PGDIR_SIZE
> aligned. We can borrow the pud table from direct mapping safely. Given a
> physical address X, we have pud_index(X) == pud_index(__va(X)). However,
> for kaslr kernel, PAGE_OFFSET is PUD_SIZE aligned. For a given physical
> address X, pud_index(X) != pud_index(__va(X)). We can't only copy pgd entry
> from direct mapping to build ident mapping, instead need copy pud entry
> one by one from direct mapping.
> 
> Fix it.
> 
> Signed-off-by: Baoquan He 
> Signed-off-by: Dave Young 
> Cc: Matt Fleming 
> Cc: Ard Biesheuvel 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Thomas Garnier 
> Cc: Kees Cook 
> Cc: x...@kernel.org
> Cc: linux-...@vger.kernel.org
> ---
> v2->v3:
> 1. Rewrite code to copy pud entry one by one so that code can be 
> understood
> better. Usually we only have less than 1TB or several TB memory, pud entry
> copy one by one won't impact efficiency.
> 
> 2. Adding p4d page table handling.
> 
> v1->v2:
> Change code and add description according to Thomas's suggestion as below:
> 
> 1. Add checking if pud table is allocated successfully. If not just break
> the for loop.
> 
> 2. Add code comment to explain how the 1:1 mapping is built in 
> efi_call_phys_prolog
> 
> 3. Other minor change
> 
>  arch/x86/platform/efi/efi_64.c | 69 
> +-
>  1 file changed, 61 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index c488625..c9dffec 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -71,11 +71,13 @@ static void __init early_code_mapping_set_exec(int 
> executable)
>  
>  pgd_t * __init efi_call_phys_prolog(void)
>  {
> - unsigned long vaddress;
> - pgd_t *save_pgd;
> + unsigned long vaddr, addr_pgd, addr_p4d, addr_pud;
> + pgd_t *save_pgd, *pgd_k, *pgd_efi;
> + p4d_t *p4d, *p4d_k, *p4d_efi;
> + pud_t *pud;
>  
>   int pgd;
> - int n_pgds;
> + int n_pgds, i, j;
>  
>   if (!efi_enabled(EFI_OLD_MEMMAP)) {
>   save_pgd = (pgd_t *)read_cr3();
> @@ -88,10 +90,44 @@ pgd_t * __init efi_call_phys_prolog(void)
>   n_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT), PGDIR_SIZE);
>   save_pgd = kmalloc_array(n_pgds, sizeof(*save_pgd), GFP_KERNEL);
>  
> + /*
> +  * Build 1:1 ident mapping for old_map usage. It needs to be noticed
> +  * that PAGE_OFFSET is PGDIR_SIZE aligned with KASLR disabled, while
> +  * PUD_SIZE ALIGNED with KASLR enabled. So for a given physical
> +  * address X, the pud_index(X) != pud_index(__va(X)), we can only copy
> +  * pud entry of __va(X) to fill in pud entry of X to build 1:1 mapping
> +  * . Means here we can only reuse pmd table of direct mapping.
> +  */
>   for (pgd = 0; pgd < n_pgds; pgd++) {
> - save_pgd[pgd] = *pgd_offset_k(pgd * PGDIR_SIZE);
> - vaddress = (unsigned long)__va(pgd * PGDIR_SIZE);
> - set_pgd(pgd_offset_k(pgd * PGDIR_SIZE), 
> *pgd_offset_k(vaddress));
> + addr_pgd = (unsigned long)(pgd * PGDIR_SIZE);
> + vaddr = (unsigned long)__va(pgd * PGDIR_SIZE);
> + pgd_efi = pgd_offset_k(addr_pgd);
> + save_pgd[pgd] = *pgd_efi;
> + p4d =  p4d_alloc(&init_mm, pgd_efi, addr_pgd);
> +
> + if (!p4d) {
> + pr_err("Failed to allocate p4d table \n");
> + goto out;
> + }
> + for(i=0; i + addr_p4d = addr_pgd + i * P4D_SIZE;
> + p4d_efi = p4d + p4d_index(addr_p4d);
> + pud = pud_alloc(&init_mm, p4d_efi, addr_p4d);
> + if (!p

Re: [PATCH v2] x86/efi: Disable runtime services on kexec kernel if booted with efi=old_map

2017-05-16 Thread Dave Young

Hi Sai,
On 05/16/17 at 06:14pm, Sai Praneeth Prakhya wrote:
> From: Sai Praneeth 
> 
> Booting kexec kernel with "efi=old_map" in kernel command line hits
> kernel panic as shown below.
> 
> [0.001000] BUG: unable to handle kernel paging request at 88007fe78070
> [0.001000] IP: virt_efi_set_variable.part.7+0x63/0x1b0
> [0.001000] PGD 7ea28067
> [0.001000] PUD 7ea2b067
> [0.001000] PMD 7ea2d067
> [0.001000] PTE 0
> [0.001000]
> [0.001000] Oops:  [#1] SMP
> [0.001000] Modules linked in:
> [0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 4.11.0-rc2-yocto-standard+ #229
> [0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 
> 02/06/2015
> [0.001000] task: 82022500 task.stack: 8200
> [0.001000] RIP: 0010:virt_efi_set_variable.part.7+0x63/0x1b0
> [0.001000] RSP: :82003dc0 EFLAGS: 00010246
> [0.001000] RAX: 88007fe78018 RBX: 82050300 RCX: 0007
> [0.001000] RDX: 82003e50 RSI: 82050300 RDI: 82050300
> [0.001000] RBP: 82003e08 R08:  R09: 
> [0.001000] R10:  R11:  R12: 82003e50
> [0.001000] R13: 0007 R14: 0246 R15: 
> [0.001000] FS:  () GS:88007fa0() 
> knlGS:
> [0.001000] CS:  0010 DS:  ES:  CR0: 80050033
> [0.001000] CR2: 88007fe78070 CR3: 7da1d000 CR4: 06b0
> [0.001000] DR0:  DR1:  DR2: 
> [0.001000] DR3:  DR6: fffe0ff0 DR7: 0400
> [0.001000] Call Trace:
> [0.001000]  virt_efi_set_variable+0x5d/0x70
> [0.001000]  efi_delete_dummy_variable+0x7a/0x80
> [0.001000]  efi_enter_virtual_mode+0x3f6/0x4a7
> [0.001000]  start_kernel+0x375/0x400
> [0.001000]  x86_64_start_reservations+0x2a/0x2c
> [0.001000]  x86_64_start_kernel+0x168/0x176
> [0.001000]  start_cpu+0x14/0x14
> [0.001000] Code: 04 b0 84 ff 80 3d c5 56 b3 00 00 4c 8b 44 24 08 75
> 6b 9c 41 5e 48 8b 05 9c 78 99 00 4d 89 c1 48 89 de 4d 89 f8 44 89 e9 4c
> 89 e2 <48> 8b 40 58 48 8b 78 58 e8 b0 2d 88 ff 48 c7 c6 b6 1d f4 81 4c
> [0.001000] RIP: virt_efi_set_variable.part.7+0x63/0x1b0 RSP: 82003dc0
> [0.001000] CR2: 88007fe78070
> [0.001000] ---[ end trace  ]---
> [0.001000] Kernel panic - not syncing: Attempted to kill the idle task!
> [0.001000] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
> task!
> 
> [efi=old_map was never intended to work with kexec. The problem with
> using efi=old_map is that the virtual addresses are assigned from the
> memory region used by other kernel mappings; vmalloc() space.
> Potentially there could be collisions when booting kexec if something
> else is mapped at the virtual address we allocated for runtime service
> regions in the initial boot] - Matt Fleming
> 
> Since kexec was never intended to work with efi=old_map, disable
> runtime services in kexec if booted with efi=old_map, so that we don't
> panic.
> 
> Signed-off-by: Sai Praneeth Prakhya 
> Cc: Borislav Petkov 
> Cc: Ricardo Neri 
> Cc: Matt Fleming 
> Cc: Ard Biesheuvel 
> Cc: Ravi Shankar 
> Cc: Lee Chun-Yi 
> Cc: Dave Young 
> 
> Changes since v1:
> Don't fix the panic, because this was never intended to work.
> ---
>  arch/x86/platform/efi/efi.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index 87835c97611f..627b7b86f369 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -827,9 +827,11 @@ static void __init kexec_enter_virtual_mode(void)
>  
>   /*
>* We don't do virtual mode, since we don't do runtime services, on
> -  * non-native EFI
> +  * non-native EFI. With efi=old_map, we don't do runtime services in
> +  * kexec kernel because in the initial boot something else might
> +  * have been mapped at these virtual addresses.
>    */
> - if (!efi_is_native()) {
> + if (!efi_is_native() || efi_enabled(EFI_OLD_MEMMAP)) {
>   efi_memmap_unmap();
>   clear_bit(EFI_RUNTIME_SERVICES, &efi.flags);
>   return;
> -- 
> 2.1.4
> 

Suppose it has passed the test on your hardware:
Acked-by: Dave Young 

Thanks
Dave

Re: [PATCH] x86/efi: Fix kexec kernel panic when efi=old_map is enabled

2017-05-15 Thread Dave Young

On 05/15/17 at 02:23pm, Matt Fleming wrote:
> (Pulling in Dave, Mr. Kexec on EFI)
> 
> On Mon, 08 May, at 12:25:23PM, Sai Praneeth Prakhya wrote:
> > From: Sai Praneeth 
> > 
> > Booting kexec kernel with "efi=old_map" in kernel command line hits
> > kernel panic as shown below.
> > 
> > [0.001000] BUG: unable to handle kernel paging request at 
> > 88007fe78070
> > [0.001000] IP: virt_efi_set_variable.part.7+0x63/0x1b0
> > [0.001000] PGD 7ea28067
> > [0.001000] PUD 7ea2b067
> > [0.001000] PMD 7ea2d067
> > [0.001000] PTE 0
> > [0.001000]
> > [0.001000] Oops:  [#1] SMP
> > [0.001000] Modules linked in:
> > [0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> > 4.11.0-rc2-yocto-standard+ #229
> > [0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 0.0.0 02/06/2015
> > [0.001000] task: 82022500 task.stack: 8200
> > [0.001000] RIP: 0010:virt_efi_set_variable.part.7+0x63/0x1b0
> > [0.001000] RSP: :82003dc0 EFLAGS: 00010246
> > [0.001000] RAX: 88007fe78018 RBX: 82050300 RCX: 
> > 0007
> > [0.001000] RDX: 82003e50 RSI: 82050300 RDI: 
> > 82050300
> > [0.001000] RBP: 82003e08 R08:  R09: 
> > 
> > [0.001000] R10:  R11:  R12: 
> > 82003e50
> > [0.001000] R13: 0007 R14: 0246 R15: 
> > 
> > [0.001000] FS:  () GS:88007fa0() 
> > knlGS:
> > [0.001000] CS:  0010 DS:  ES:  CR0: 80050033
> > [0.001000] CR2: 88007fe78070 CR3: 7da1d000 CR4: 
> > 06b0
> > [0.001000] DR0:  DR1:  DR2: 
> > 
> > [0.001000] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [0.001000] Call Trace:
> > [0.001000]  virt_efi_set_variable+0x5d/0x70
> > [0.001000]  efi_delete_dummy_variable+0x7a/0x80
> > [0.001000]  efi_enter_virtual_mode+0x3f6/0x4a7
> > [0.001000]  start_kernel+0x375/0x400
> > [0.001000]  x86_64_start_reservations+0x2a/0x2c
> > [0.001000]  x86_64_start_kernel+0x168/0x176
> > [0.001000]  start_cpu+0x14/0x14
> > [0.001000] Code: 04 b0 84 ff 80 3d c5 56 b3 00 00 4c 8b 44 24 08 75
> > 6b 9c 41 5e 48 8b 05 9c 78 99 00 4d 89 c1 48 89 de 4d 89 f8 44 89 e9 4c
> > 89 e2 <48> 8b 40 58 48 8b 78 58 e8 b0 2d 88 ff 48 c7 c6 b6 1d f4 81 4c
> > [0.001000] RIP: virt_efi_set_variable.part.7+0x63/0x1b0 RSP: 
> > 82003dc0
> > [0.001000] CR2: 88007fe78070
> > [0.001000] ---[ end trace  ]---
> > [0.001000] Kernel panic - not syncing: Attempted to kill the idle task!
> > [0.001000] ---[ end Kernel panic - not syncing: Attempted to kill the 
> > idle task!
> > 
> > This happens because efi=old_map doesn't use efi_pgd but rather it uses
> > kernel's pgd. We don't hit the same panic in a regular kernel because
> > it uses old_map_region() and not __map_region().
> > 
> > Signed-off-by: Sai Praneeth Prakhya 
> > Cc: Borislav Petkov 
> > Cc: Ricardo Neri 
> > Cc: Matt Fleming 
> > Cc: Ard Biesheuvel 
> > Cc: Ravi Shankar 
> > ---
> >  arch/x86/platform/efi/efi_64.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> > index 4e043a8c8556..76e1cd6b74dd 100644
> > --- a/arch/x86/platform/efi/efi_64.c
> > +++ b/arch/x86/platform/efi/efi_64.c
> > @@ -320,6 +320,9 @@ static void __init __map_region(efi_memory_desc_t *md, 
> > u64 va)
> > unsigned long pfn;
> > pgd_t *pgd = efi_pgd;
> >  
> > +   if (efi_enabled(EFI_OLD_MEMMAP))
> > +   pgd = swapper_pg_dir;
> > +
> > if (!(md->attribute & EFI_MEMORY_WB))
> > flags |= _PAGE_PCD;
> >  
> 
> The thing is, efi=old_map was never intended to work with kexec.
> 
> Part of the reason for introducing the new EFI runtime services
> mapping scheme was so that we could kexec on EFI. See commit
> d2f7cbe7b26a ("x86/efi: Runtime services virtual mapping").
> 
> The problem with using efi=old_map is that the virtual addresses are
> assigned from the memory region used by other kernel mappings;
> vmalloc() space.
> 
> Potentially there could be collisions when booting kexec if something
> else is mapped at the virtual address we allocated for runtime service
> regions in the initial boot.
> 
> So, while this patch may work for you and Joey, I don't think it's
> reliable.
> 
> Dave, did I miss anything?

Matt, sorry for late reply, I did not notice this patch is kexec related
and I missed it.

Yes, you are right, efi=old_map is supposed not to work with kexec
reboot because the runtime va is not persistent. The only way for
old_map kexec boot is use below kexec-tools option to load kexec kernel:
kexec --noefi
and at the same time need pass acpi root pointer in

[PATCH] efi/bgrt: skip efi_bgrt_init in case non-efi boot

2017-05-15 Thread Dave Young


Sabrina Dubroca reported an early panic below, it was introduced by
commit 7b0a911478c7 ("efi/x86: Move the EFI BGRT init code to early init code")
The cause is on this machine even for legacy boot firmware still provide
the ACPI BGRT table which should be EFI only. Thus the garbage bgrt data
caused the efi_bgrt_init panic.

Add a checking to skip efi_bgrt_init in case non EFI booting solves this
problem.

BUG: unable to handle kernel paging request at ff240001
IP: efi_bgrt_init+0xdc/0x134
PGD 1ac0c067
PUD 1ac0e067
PMD 1aee9067
PTE 938070180163

Oops: 0009 [#1] SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00116-g7b0a911 #19
Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.02 
05/03/2012
task: 9fc10500 task.stack: 9fc0
RIP: 0010:efi_bgrt_init+0xdc/0x134
RSP: :9fc03d58 EFLAGS: 00010082
RAX: ff240001 RBX:  RCX: 138070180006
RDX: 8163 RSI: 938070180163 RDI: 05be
RBP: 9fc03d70 R08: 138070181000 R09: 0002
R10: 0002d000 R11: 98a3dedd2fc6 R12: 9f9f22b6
R13: 9ff49480 R14: 0010 R15: 
FS:  () GS:9fd2() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ff240001 CR3: 1ac09000 CR4: 000406b0
Call Trace:
 ? acpi_parse_ioapic+0x98/0x98
 acpi_parse_bgrt+0x9/0xd
 acpi_table_parse+0x7a/0xa9
 acpi_boot_init+0x3c7/0x4f9
 ? acpi_parse_x2apic+0x74/0x74
 ? acpi_parse_x2apic_nmi+0x46/0x46
 setup_arch+0xb4b/0xc6f
 ? printk+0x52/0x6e
 start_kernel+0xb2/0x47b
 ? early_idt_handler_array+0x120/0x120
 x86_64_start_reservations+0x24/0x26
 x86_64_start_kernel+0xf7/0x11a
 start_cpu+0x14/0x14
Code: 48 c7 c7 10 16 a0 9f e8 4e 94 40 ff eb 62 be 06 00 00 00 e8 f9 ff 00 00 
48 85 c0 75 0e 48
c7 c7 40 16 a0 9f e8 31 94 40 ff eb 45 <66> 44 8b 20 be 06 00 00 00 48 89 c7 8b 
58 02 e8 87 00
01 00 66
RIP: efi_bgrt_init+0xdc/0x134 RSP: 9fc03d58
CR2: ff240001
---[ end trace f68728a0d3053b52 ]---
Kernel panic - not syncing: Attempted to kill the idle task!
---[ end Kernel panic - not syncing: Attempted to kill the idle task!

Fixes: 7b0a911478c7 ("efi/x86: Move the EFI BGRT init code to early init code")
Signed-off-by: Dave Young 
Tested-by: Sabrina Dubroca 
---
 drivers/firmware/efi/efi-bgrt.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/firmware/efi/efi-bgrt.c b/drivers/firmware/efi/efi-bgrt.c
index 04ca876..8bf2732 100644
--- a/drivers/firmware/efi/efi-bgrt.c
+++ b/drivers/firmware/efi/efi-bgrt.c
@@ -36,6 +36,9 @@ void __init efi_bgrt_init(struct acpi_table_header *table)
if (acpi_disabled)
return;
 
+   if (!efi_enabled(EFI_BOOT))
+   return;
+
if (table->length < sizeof(bgrt_tab)) {
pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
   table->length, sizeof(bgrt_tab));
-- 
2.10.2

Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-15 Thread Dave Young

On 05/15/17 at 01:10pm, Sabrina Dubroca wrote:
> 2017-05-15, 16:37:40 +0800, Dave Young wrote:
> > Hi,
> > 
> > Thanks for the report.
> > On 05/14/17 at 01:18am, Sabrina Dubroca wrote:
> > > 2017-01-31, 13:21:40 +, Ard Biesheuvel wrote:
> > > > From: Dave Young 
> > > > 
> > > > Before invoking the arch specific handler, efi_mem_reserve() reserves
> > > > the given memory region through memblock.
> > > > 
> > > > efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
> > > > time memblock is dead and should not be used anymore.
> > > > 
> > > > The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
> > > > table, so move parsing of the BGRT table to ACPI early boot code to
> > > > ensure that efi_mem_reserve() in EFI BGRT code still use memblock 
> > > > safely.
> > > > 
> > > > Signed-off-by: Dave Young 
> > > > Cc: Matt Fleming 
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Len Brown 
> > > > Cc: linux-a...@vger.kernel.org
> > > > Tested-by: Bhupesh Sharma 
> > > > Signed-off-by: Ard Biesheuvel 
> > > 
> > > I have a box that panics in early boot after this patch. The kernel
> > > config is based on a Fedora 25 kernel + localmodconfig.
> > > 
> > > BUG: unable to handle kernel paging request at ff240001
> > > IP: efi_bgrt_init+0xdc/0x134
> > > PGD 1ac0c067
> > > PUD 1ac0e067
> > > PMD 1aee9067
> > > PTE 938070180163
> > > 
> > > Oops: 0009 [#1] SMP
> > > Modules linked in:
> > > CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00116-g7b0a911 #19
> > > Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 
> > > v01.02 05/03/2012
> > > task: 9fc10500 task.stack: 9fc0
> > > RIP: 0010:efi_bgrt_init+0xdc/0x134
> > > RSP: :9fc03d58 EFLAGS: 00010082
> > > RAX: ff240001 RBX:  RCX: 138070180006
> > > RDX: 8163 RSI: 938070180163 RDI: 05be
> > > RBP: 9fc03d70 R08: 138070181000 R09: 0002
> > > R10: 0002d000 R11: 98a3dedd2fc6 R12: 9f9f22b6
> > > R13: 9ff49480 R14: 0010 R15: 
> > > FS:  () GS:9fd2() 
> > > knlGS:
> > > CS:  0010 DS:  ES:  CR0: 80050033
> > > CR2: ff240001 CR3: 1ac09000 CR4: 000406b0
> > > Call Trace:
> > >  ? acpi_parse_ioapic+0x98/0x98
> > >  acpi_parse_bgrt+0x9/0xd
> > >  acpi_table_parse+0x7a/0xa9
> > >  acpi_boot_init+0x3c7/0x4f9
> > >  ? acpi_parse_x2apic+0x74/0x74
> > >  ? acpi_parse_x2apic_nmi+0x46/0x46
> > >  setup_arch+0xb4b/0xc6f
> > >  ? printk+0x52/0x6e
> > >  start_kernel+0xb2/0x47b
> > >  ? early_idt_handler_array+0x120/0x120
> > >  x86_64_start_reservations+0x24/0x26
> > >  x86_64_start_kernel+0xf7/0x11a
> > >  start_cpu+0x14/0x14
> > > Code: 48 c7 c7 10 16 a0 9f e8 4e 94 40 ff eb 62 be 06 00 00 00 e8 f9 ff 
> > > 00 00 48 85 c0 75 0e 48 c7 c7 40 16 a0 9f e8 31 94 40 ff eb 45 <66> 44 8b 
> > > 20 be 06 00 00 00 48 89 c7 8b 58 02 e8 87 00 01 00 66
> > > RIP: efi_bgrt_init+0xdc/0x134 RSP: 9fc03d58
> > > CR2: ff240001
> > > ---[ end trace f68728a0d3053b52 ]---
> > > Kernel panic - not syncing: Attempted to kill the idle task!
> > > ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> > > 
> > > 
> > > That code is:
> > > 
> > > 
> > > All code
> > > 
> > >0: 48 c7 c7 10 16 a0 9fmov$0x9fa01610,%rdi
> > >7: e8 4e 94 40 ff  callq  0xff40945a
> > >c: eb 62   jmp0x70
> > >e: be 06 00 00 00  mov$0x6,%esi
> > >   13: e8 f9 ff 00 00  callq  0x10011
> > >   18: 48 85 c0test   %rax,%rax
> > >   1b: 75 0e   jne0x2b
> > >   1d: 48 c7 c7 40 16 a0 9fmov$0x9fa01640,%rdi
> > >   24: e8 31 94 40 ff  callq  0xff40945a
> > >   29: eb 45   jmp0x70
> > >   2b:*66 44 8b 20 mov(%rax),%r12w <-- 
> &g

Re: [PATCH 08/10] efi/x86: Move EFI BGRT init code to early init code

2017-05-15 Thread Dave Young

Hi,

Thanks for the report.
On 05/14/17 at 01:18am, Sabrina Dubroca wrote:
> 2017-01-31, 13:21:40 +, Ard Biesheuvel wrote:
> > From: Dave Young 
> > 
> > Before invoking the arch specific handler, efi_mem_reserve() reserves
> > the given memory region through memblock.
> > 
> > efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
> > time memblock is dead and should not be used anymore.
> > 
> > The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
> > table, so move parsing of the BGRT table to ACPI early boot code to
> > ensure that efi_mem_reserve() in EFI BGRT code still use memblock safely.
> > 
> > Signed-off-by: Dave Young 
> > Cc: Matt Fleming 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Len Brown 
> > Cc: linux-a...@vger.kernel.org
> > Tested-by: Bhupesh Sharma 
> > Signed-off-by: Ard Biesheuvel 
> 
> I have a box that panics in early boot after this patch. The kernel
> config is based on a Fedora 25 kernel + localmodconfig.
> 
> BUG: unable to handle kernel paging request at ff240001
> IP: efi_bgrt_init+0xdc/0x134
> PGD 1ac0c067
> PUD 1ac0e067
> PMD 1aee9067
> PTE 938070180163
> 
> Oops: 0009 [#1] SMP
> Modules linked in:
> CPU: 0 PID: 0 Comm: swapper Not tainted 4.10.0-rc5-00116-g7b0a911 #19
> Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.02 
> 05/03/2012
> task: 9fc10500 task.stack: 9fc0
> RIP: 0010:efi_bgrt_init+0xdc/0x134
> RSP: :9fc03d58 EFLAGS: 00010082
> RAX: ff240001 RBX:  RCX: 138070180006
> RDX: 8163 RSI: 938070180163 RDI: 05be
> RBP: 9fc03d70 R08: 138070181000 R09: 0002
> R10: 0002d000 R11: 98a3dedd2fc6 R12: 9f9f22b6
> R13: 9ff49480 R14: 0010 R15: 
> FS:  () GS:9fd2() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: ff240001 CR3: 1ac09000 CR4: 000406b0
> Call Trace:
>  ? acpi_parse_ioapic+0x98/0x98
>  acpi_parse_bgrt+0x9/0xd
>  acpi_table_parse+0x7a/0xa9
>  acpi_boot_init+0x3c7/0x4f9
>  ? acpi_parse_x2apic+0x74/0x74
>  ? acpi_parse_x2apic_nmi+0x46/0x46
>  setup_arch+0xb4b/0xc6f
>  ? printk+0x52/0x6e
>  start_kernel+0xb2/0x47b
>  ? early_idt_handler_array+0x120/0x120
>  x86_64_start_reservations+0x24/0x26
>  x86_64_start_kernel+0xf7/0x11a
>  start_cpu+0x14/0x14
> Code: 48 c7 c7 10 16 a0 9f e8 4e 94 40 ff eb 62 be 06 00 00 00 e8 f9 ff 00 00 
> 48 85 c0 75 0e 48 c7 c7 40 16 a0 9f e8 31 94 40 ff eb 45 <66> 44 8b 20 be 06 
> 00 00 00 48 89 c7 8b 58 02 e8 87 00 01 00 66
> RIP: efi_bgrt_init+0xdc/0x134 RSP: 9fc03d58
> CR2: ff240001
> ---[ end trace f68728a0d3053b52 ]---
> Kernel panic - not syncing: Attempted to kill the idle task!
> ---[ end Kernel panic - not syncing: Attempted to kill the idle task!
> 
> 
> That code is:
> 
> 
> All code
> 
>0: 48 c7 c7 10 16 a0 9fmov$0x9fa01610,%rdi
>7: e8 4e 94 40 ff  callq  0xff40945a
>c: eb 62   jmp0x70
>e: be 06 00 00 00  mov$0x6,%esi
>   13: e8 f9 ff 00 00  callq  0x10011
>   18: 48 85 c0test   %rax,%rax
>   1b: 75 0e   jne0x2b
>   1d: 48 c7 c7 40 16 a0 9fmov$0x9fa01640,%rdi
>   24: e8 31 94 40 ff  callq  0xff40945a
>   29: eb 45   jmp0x70
>   2b:*66 44 8b 20 mov(%rax),%r12w <-- 
> trapping instruction
>   2f: be 06 00 00 00  mov$0x6,%esi
>   34: 48 89 c7mov%rax,%rdi
>   37: 8b 58 02mov0x2(%rax),%ebx
>   3a: e8 87 00 01 00  callq  0x100c6
>   3f: 66  data16
> 
> Code starting with the faulting instruction
> ===
>0: 66 44 8b 20 mov(%rax),%r12w
>4: be 06 00 00 00  mov$0x6,%esi
>9: 48 89 c7mov%rax,%rdi
>c: 8b 58 02mov0x2(%rax),%ebx
>f: e8 87 00 01 00  callq  0x1009b
>   14: 66  data16
> 
> 
> which is just after the early_memremap() call.
> 
> I enabled early_ioremap_debug and the last warning had:
> 
> __early_ioremap(138070181000, 1000) [1] => 0001 + ff24

The phys addr looks odd..

>From the kernel log, I do not see any efi messages so can you check if
you are booting with legacy mode or efi boot?

I suppose bgrt are efi only, if y

Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-04-27 Thread Dave Young

On 04/27/17 at 08:52am, Dave Hansen wrote:
> On 04/27/2017 12:25 AM, Dave Young wrote:
> > On 04/21/17 at 02:55pm, Dave Hansen wrote:
> >> On 04/18/2017 02:22 PM, Tom Lendacky wrote:
> >>> Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> >>> determine if SME is active.
> >>>
> >>> A new directory will be created:
> >>>   /sys/kernel/mm/sme/
> >>>
> >>> And two entries within the new directory:
> >>>   /sys/kernel/mm/sme/active
> >>>   /sys/kernel/mm/sme/encryption_mask
> >>
> >> Why do they care, and what will they be doing with this information?
> > 
> > Since kdump will copy old memory but need this to know if the old memory
> > was encrypted or not. With this sysfs file we can know the previous SME
> > status and pass to kdump kernel as like a kernel param.
> > 
> > Tom, have you got chance to try if it works or not?
> 
> What will the kdump kernel do with it though?  We kexec() into that
> kernel so the SME keys will all be the same, right?  So, will the kdump
> kernel be just setting the encryption bit in the PTE so it can copy the
> old plaintext out?

I assume it is for active -> non active case, the new boot need to know
the old memory is encrypted. But I think I did not read all the patches
I may miss things.

> 
> Why do we need both 'active' and 'encryption_mask'?  How could it be
> that the hardware-enumerated 'encryption_mask' changes across a kexec()?

Leave this question to Tom..

Thanks
Dave

Re: [PATCH] kexec: allocate buffer in top-down, if specified, correctly

2017-04-27 Thread Dave Young

Correct Vivek's email address
On 04/28/17 at 01:19pm, Dave Young wrote:
> Vivek, can you help to give some comments about the locate hole isssue
> in kexec_file?
> 
> On 04/28/17 at 09:51am, AKASHI Takahiro wrote:
> > Thiago,
> > 
> > Thank you for the comment.
> > 
> > On Thu, Apr 27, 2017 at 07:00:04PM -0300, Thiago Jung Bauermann wrote:
> > > Hello,
> > > 
> > > Am Mittwoch, 26. April 2017, 17:22:09 BRT schrieb AKASHI Takahiro:
> > > > The current kexec_locate_mem_hole(kbuf.top_down == 1) stops searching at
> > > > the first memory region that has enough space for requested size even if
> > > > some of higher regions may also have.
> > > 
> > > kexec_locate_mem_hole expects arch_kexec_walk_mem to walk memory from top 
> > > to 
> > > bottom if top_down is true. That is what powerpc's version does.
> > 
> > Ah, I haven't noticed that, but x86 doesn't have arch_kexec_walk_mem and
> > how can it work for x86?
> > 
> > > Isn't it possible to walk resources from top to bottom?
> > 
> > Yes, it will be, but it seems to me that such a behavior is not intuitive
> > and even confusing if it doesn't come with explicit explanation.
> 
> Thing need to make clear is why do we need the change, it might be a
> problem for crashkernel=xM,low since that is for softiotlb in case
> crashkernel=xM,high being used, otherwise seems current code is fine.
>  
> Need seeking for old memory from Vivek to confirm.
> > 
> > > > This behavior is not consistent with locate_hole(hole_end == -1) 
> > > > function
> > > > of kexec-tools.
> > > > 
> > > > This patch fixes the bug, going though all the memory regions anyway.
> > > 
> > > This patch would break powerpc, because at the end of the memory walk 
> > > kbuf 
> > > would have the lowest memory hole.
> > > 
> > > If it's not possible to walk resources in reverse order, then this patch 
> > > needs 
> > > to change powerpc to always walk memory from bottom to top.
> > 
> > So I would like to hear from x86 guys.
> > 
> > Thanks
> > -Takahiro AKASHI
> > 
> > > -- 
> > > Thiago Jung Bauermann
> > > IBM Linux Technology Center
> > >

Re: [PATCH] kexec: allocate buffer in top-down, if specified, correctly

2017-04-27 Thread Dave Young

Vivek, can you help to give some comments about the locate hole isssue
in kexec_file?

On 04/28/17 at 09:51am, AKASHI Takahiro wrote:
> Thiago,
> 
> Thank you for the comment.
> 
> On Thu, Apr 27, 2017 at 07:00:04PM -0300, Thiago Jung Bauermann wrote:
> > Hello,
> > 
> > Am Mittwoch, 26. April 2017, 17:22:09 BRT schrieb AKASHI Takahiro:
> > > The current kexec_locate_mem_hole(kbuf.top_down == 1) stops searching at
> > > the first memory region that has enough space for requested size even if
> > > some of higher regions may also have.
> > 
> > kexec_locate_mem_hole expects arch_kexec_walk_mem to walk memory from top 
> > to 
> > bottom if top_down is true. That is what powerpc's version does.
> 
> Ah, I haven't noticed that, but x86 doesn't have arch_kexec_walk_mem and
> how can it work for x86?
> 
> > Isn't it possible to walk resources from top to bottom?
> 
> Yes, it will be, but it seems to me that such a behavior is not intuitive
> and even confusing if it doesn't come with explicit explanation.

Thing need to make clear is why do we need the change, it might be a
problem for crashkernel=xM,low since that is for softiotlb in case
crashkernel=xM,high being used, otherwise seems current code is fine.
 
Need seeking for old memory from Vivek to confirm.
> 
> > > This behavior is not consistent with locate_hole(hole_end == -1) function
> > > of kexec-tools.
> > > 
> > > This patch fixes the bug, going though all the memory regions anyway.
> > 
> > This patch would break powerpc, because at the end of the memory walk kbuf 
> > would have the lowest memory hole.
> > 
> > If it's not possible to walk resources in reverse order, then this patch 
> > needs 
> > to change powerpc to always walk memory from bottom to top.
> 
> So I would like to hear from x86 guys.
> 
> Thanks
> -Takahiro AKASHI
> 
> > -- 
> > Thiago Jung Bauermann
> > IBM Linux Technology Center
> >

Re: [PATCH] kexec: allocate buffer in top-down, if specified, correctly

2017-04-27 Thread Dave Young

Hi AKASHI
On 04/26/17 at 05:22pm, AKASHI Takahiro wrote:
> The current kexec_locate_mem_hole(kbuf.top_down == 1) stops searching at
> the first memory region that has enough space for requested size even if
> some of higher regions may also have.
> This behavior is not consistent with locate_hole(hole_end == -1) function
> of kexec-tools.

Have you seen actual bug happened or just observing this during code
review?

Till now seems we do not see any reports about this.

> 
> This patch fixes the bug, going though all the memory regions anyway.
> 
> Signed-off-by: AKASHI Takahiro 
> ---
>  kernel/kexec_file.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b118735fea9d..2f131c0d9017 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -373,8 +373,8 @@ static int locate_mem_hole_top_down(unsigned long start, 
> unsigned long end,
>   /* If we are here, we found a suitable memory range */
>   kbuf->mem = temp_start;
>  
> - /* Success, stop navigating through remaining System RAM ranges */
> - return 1;
> + /* always return zero, going through all the System RAM ranges */
> + return 0;
>  }
>  
>  static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
> @@ -439,18 +439,27 @@ static int locate_mem_hole_callback(u64 start, u64 end, 
> void *arg)
>   *
>   * Return: The memory walk will stop when func returns a non-zero value
>   * and that value will be returned. If all free regions are visited without
> - * func returning non-zero, then zero will be returned.
> + * func returning non-zero, then kbuf->mem will be additionally checked
> + * for top-down search.
> + * After all, zero will be returned if none of regions fits.
>   */
>  int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
>  int (*func)(u64, u64, void *))
>  {
> + int ret;
> +
> + kbuf->mem = 0;
>   if (kbuf->image->type == KEXEC_TYPE_CRASH)
> - return walk_iomem_res_desc(crashk_res.desc,
> + ret = walk_iomem_res_desc(crashk_res.desc,
>  IORESOURCE_SYSTEM_RAM | 
> IORESOURCE_BUSY,
>  crashk_res.start, crashk_res.end,
>  kbuf, func);
>   else
> - return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
> + ret = walk_system_ram_res(0, ULONG_MAX, kbuf, func);
> +
> + if (!ret && kbuf->mem)
> + ret = 1; /* found for top-down search */
> + return ret;
>  }
>  
>  /**
> -- 
> 2.11.1
>

Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-04-27 Thread Dave Young

On 04/21/17 at 02:55pm, Dave Hansen wrote:
> On 04/18/2017 02:22 PM, Tom Lendacky wrote:
> > Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> > determine if SME is active.
> > 
> > A new directory will be created:
> >   /sys/kernel/mm/sme/
> > 
> > And two entries within the new directory:
> >   /sys/kernel/mm/sme/active
> >   /sys/kernel/mm/sme/encryption_mask
> 
> Why do they care, and what will they be doing with this information?

Since kdump will copy old memory but need this to know if the old memory
was encrypted or not. With this sysfs file we can know the previous SME
status and pass to kdump kernel as like a kernel param.

Tom, have you got chance to try if it works or not?

Thanks
Dave

Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-26 Thread Dave Young

Hi Xunlei,

On 04/27/17 at 01:25pm, Xunlei Pang wrote:
> On 04/27/2017 at 11:06 AM, Dave Young wrote:
> > [snip]
> >>>>>  
> >>>>>  static int __init crash_save_vmcoreinfo_init(void)
> >>>>>  {
> >>>>> +   /* One page should be enough for VMCOREINFO_BYTES under all 
> >>>>> archs */
> >>>> Can we add a comment in the VMCOREINFO_BYTES header file about the one
> >>>> page assumption?
> >>>>
> >>>> Or just define the VMCOREINFO_BYTES as PAGE_SIZE instead of 4096
> >>> Yes, I considered this before, but VMCOREINFO_BYTES is also used by 
> >>> VMCOREINFO_NOTE_SIZE
> >>> definition which is exported to sysfs, also some platform has larger page 
> >>> size(64KB), so
> >>> I didn't touch this 4096 value.
> >>>
> >>> I think I should use kmalloc() to allocate both of them, then move this 
> >>> comment to Patch3 
> >>> kimage_crash_copy_vmcoreinfo().
> >> But on the other hand, using a separate page for them seems safer compared 
> >> with
> >> using frequently-used slab, what's your opinion?
> > I feel current page based way is better.
> >
> > For 64k page the vmcore note size will increase it seems fine. Do you
> > have concern in mind?
> 
> Since tools are supposed to acquire vmcoreinfo note size from sysfs, it 
> should be safe to do so,
> except that there is some waste in memory for larger PAGE_SIZE.

Either way is fine to me, I think it is up to your implementation, if
choose page alloc then modify the macro with PAGE_SIZE looks better.

Thanks
Dave

Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-26 Thread Dave Young

[snip]
> >>>  
> >>>  static int __init crash_save_vmcoreinfo_init(void)
> >>>  {
> >>> + /* One page should be enough for VMCOREINFO_BYTES under all archs */
> >> Can we add a comment in the VMCOREINFO_BYTES header file about the one
> >> page assumption?
> >>
> >> Or just define the VMCOREINFO_BYTES as PAGE_SIZE instead of 4096
> > Yes, I considered this before, but VMCOREINFO_BYTES is also used by 
> > VMCOREINFO_NOTE_SIZE
> > definition which is exported to sysfs, also some platform has larger page 
> > size(64KB), so
> > I didn't touch this 4096 value.
> >
> > I think I should use kmalloc() to allocate both of them, then move this 
> > comment to Patch3 
> > kimage_crash_copy_vmcoreinfo().
> 
> But on the other hand, using a separate page for them seems safer compared 
> with
> using frequently-used slab, what's your opinion?

I feel current page based way is better.

For 64k page the vmcore note size will increase it seems fine. Do you
have concern in mind?

Thanks

Re: [PATCH v4 3/3] kdump: Protect vmcoreinfo data under the crash memory

2017-04-26 Thread Dave Young

[snip]
> >> index 43cdb00..a29e9ad 100644
> >> --- a/kernel/crash_core.c
> >> +++ b/kernel/crash_core.c
> >> @@ -15,9 +15,12 @@
> >>  
> >>  /* vmcoreinfo stuff */
> >>  static unsigned char *vmcoreinfo_data;
> >> -size_t vmcoreinfo_size;
> >> +static size_t vmcoreinfo_size;
> >>  u32 *vmcoreinfo_note;
> >>  
> >> +/* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
> > May make it clearer like:
> > /* Trusted vmcoreinfo copy in the kdump reserved memory */
> 
> My thought is that it is in crash_core.c now which should be independent of 
> kexec/kdump,
> so I used "e.g. ..." just like one use case.

Ok, then it is fine.

[snip]
> >>  static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
> >>  {
> >>if (*image->entry != 0)
> >> @@ -598,6 +632,11 @@ void kimage_free(struct kimage *image)
> >>if (image->file_mode)
> >>kimage_file_post_load_cleanup(image);
> >>  
> >> +  if (image->vmcoreinfo_data_copy) {
> >> +  crash_update_vmcoreinfo_safecopy(NULL);
> >> +  vunmap(image->vmcoreinfo_data_copy);
> >> +  }
> >> +
> > Should move above chunk before the freeing of the actual page?
> 
> It should be fine, because it is allocated from the reserved memory, it 
> doesn't
> need to be freed. Anyway I can move it above to avoid confusion. Thanks!
> 

Yes, it looks better, thanks for explanation.

Thanks
Dave

Re: [PATCH] memmap: Parse "Reserved" together with "reserved"

2017-04-26 Thread Dave Young

On 04/26/17 at 03:28pm, Dave Young wrote:
> On 04/26/17 at 08:22am, Ingo Molnar wrote:
> > 
> > * Yinghai Lu  wrote:
> > 
> > > For x86 with recent kernel after
> > >  commit 640e1b38b0 ("x86/boot/e820: Basic cleanup of e820.c")
> > > change "reserved" to "Reserved" in /sys firmware memmap and /proc/iomem.
> > > 
> > > So here, we add handling for that too.
> > > 
> > > Signed-off-by: Yinghai Lu 
> > > 
> > > ---
> > >  kexec/arch/i386/crashdump-x86.c |2 ++
> > >  kexec/arch/ia64/kexec-ia64.c|2 ++
> > >  kexec/arch/mips/kexec-mips.c|2 ++
> > >  kexec/firmware_memmap.c |2 ++
> > >  4 files changed, 8 insertions(+)
> > 
> > I'd rather fix the bug I introduced and undo the reserved->Reserved string 
> > change 
> 
> This patch parses both 'reserved' and 'Reserved' it should be fine, but
> reverting the change in kernel sounds better..

Hmm, after press sending I noticed old kexec-tools with new kernel, it
is still a problem, so we'd better to revert the kernel changes.

> 
> > in e820.c: I didn't realize that it's exposed in sysfs and had quasi-ABI 
> > consequences for kexec.
> > 
> > Agreed?
> > 
> > Thanks,
> > 
> > Ingo
> > 
> > ___
> > kexec mailing list
> > ke...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> 
> Thanks
> Dave

Re: [PATCH] memmap: Parse "Reserved" together with "reserved"

2017-04-26 Thread Dave Young

On 04/26/17 at 08:22am, Ingo Molnar wrote:
> 
> * Yinghai Lu  wrote:
> 
> > For x86 with recent kernel after
> >  commit 640e1b38b0 ("x86/boot/e820: Basic cleanup of e820.c")
> > change "reserved" to "Reserved" in /sys firmware memmap and /proc/iomem.
> > 
> > So here, we add handling for that too.
> > 
> > Signed-off-by: Yinghai Lu 
> > 
> > ---
> >  kexec/arch/i386/crashdump-x86.c |2 ++
> >  kexec/arch/ia64/kexec-ia64.c|2 ++
> >  kexec/arch/mips/kexec-mips.c|2 ++
> >  kexec/firmware_memmap.c |2 ++
> >  4 files changed, 8 insertions(+)
> 
> I'd rather fix the bug I introduced and undo the reserved->Reserved string 
> change 

This patch parses both 'reserved' and 'Reserved' it should be fine, but
reverting the change in kernel sounds better..

> in e820.c: I didn't realize that it's exposed in sysfs and had quasi-ABI 
> consequences for kexec.
> 
> Agreed?
> 
> Thanks,
> 
>   Ingo
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-26 Thread Dave Young

Add ia64i list,  and s390 list although Michael has tested it

On 04/20/17 at 07:39pm, Xunlei Pang wrote:
> As Eric said,
> "what we need to do is move the variable vmcoreinfo_note out
> of the kernel's .bss section.  And modify the code to regenerate
> and keep this information in something like the control page.
> 
> Definitely something like this needs a page all to itself, and ideally
> far away from any other kernel data structures.  I clearly was not
> watching closely the data someone decided to keep this silly thing
> in the kernel's .bss section."
> 
> This patch allocates extra pages for these vmcoreinfo_XXX variables,
> one advantage is that it enhances some safety of vmcoreinfo, because
> vmcoreinfo now is kept far away from other kernel data structures.
> 
> Suggested-by: Eric Biederman 
> Cc: Michael Holzheu 
> Cc: Juergen Gross 
> Signed-off-by: Xunlei Pang 
> ---
> v3->v4:
> -Rebased on the latest linux-next
> -Handle S390 vmcoreinfo_note properly
> -Handle the newly-added xen/mmu_pv.c
> 
>  arch/ia64/kernel/machine_kexec.c |  5 -
>  arch/s390/kernel/machine_kexec.c |  1 +
>  arch/s390/kernel/setup.c |  6 --
>  arch/x86/kernel/crash.c  |  2 +-
>  arch/x86/xen/mmu_pv.c|  4 ++--
>  include/linux/crash_core.h   |  2 +-
>  kernel/crash_core.c  | 27 +++
>  kernel/ksysfs.c  |  2 +-
>  8 files changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/ia64/kernel/machine_kexec.c 
> b/arch/ia64/kernel/machine_kexec.c
> index 599507b..c14815d 100644
> --- a/arch/ia64/kernel/machine_kexec.c
> +++ b/arch/ia64/kernel/machine_kexec.c
> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>  #endif
>  }
>  
> -phys_addr_t paddr_vmcoreinfo_note(void)
> -{
> - return ia64_tpa((unsigned long)(char *)&vmcoreinfo_note);
> -}
> -
> diff --git a/arch/s390/kernel/machine_kexec.c 
> b/arch/s390/kernel/machine_kexec.c
> index 49a6bd4..3d0b14a 100644
> --- a/arch/s390/kernel/machine_kexec.c
> +++ b/arch/s390/kernel/machine_kexec.c
> @@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void)
>   VMCOREINFO_SYMBOL(lowcore_ptr);
>   VMCOREINFO_SYMBOL(high_memory);
>   VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> + mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
>  }
>  
>  void machine_shutdown(void)
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 3ae756c..3d1d808 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -496,11 +496,6 @@ static void __init setup_memory_end(void)
>   pr_notice("The maximum memory size is %luMB\n", memory_end >> 20);
>  }
>  
> -static void __init setup_vmcoreinfo(void)
> -{
> - mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
> -}
> -
>  #ifdef CONFIG_CRASH_DUMP
>  
>  /*
> @@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p)
>  #endif
>  
>   setup_resources();
> - setup_vmcoreinfo();
>   setup_lowcore();
>   smp_fill_possible_mask();
>   cpu_detect_mhz_feature();
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 22217ec..44404e2 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
> *ced,
>   bufp += sizeof(Elf64_Phdr);
>   phdr->p_type = PT_NOTE;
>   phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
> - phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
> + phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>   (ehdr->e_phnum)++;
>  
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 9d9ae66..35543fa 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2723,8 +2723,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
> unsigned int order)
>  phys_addr_t paddr_vmcoreinfo_note(void)
>  {
>   if (xen_pv_domain())
> - return virt_to_machine(&vmcoreinfo_note).maddr;
> + return virt_to_machine(vmcoreinfo_note).maddr;
>   else
> - return __pa_symbol(&vmcoreinfo_note);
> + return __pa(vmcoreinfo_note);
>  }
>  #endif /* CONFIG_KEXEC_CORE */
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index eb71a70..ba283a2 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -53,7 +53,7 @@
>  #define VMCOREINFO_PHYS_BASE(value) \
>   vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
>  
> -extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
> +extern u32 *vmcoreinfo_note;
>  extern size_t vmcoreinfo_size;
>  extern size_t vmcoreinfo_max_size;
>  
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index fcbd568..0321f04 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -14,10 +14,10 @@
>  #include 
>  
>  /* vmcoreinfo stuff */
> -static unsigned char vmcoreinfo_data[VMCOREINFO_B

Re: [PATCH v4 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr

2017-04-26 Thread Dave Young

Ccing ppc list
On 04/20/17 at 07:39pm, Xunlei Pang wrote:
> vmcoreinfo_max_size stands for the vmcoreinfo_data, the
> correct one we should use is vmcoreinfo_note whose total
> size is VMCOREINFO_NOTE_SIZE.
> 
> Like explained in commit 77019967f06b ("kdump: fix exported
> size of vmcoreinfo note"), it should not affect the actual
> function, but we better fix it, also this change should be
> safe and backward compatible.
> 
> After this, we can get rid of variable vmcoreinfo_max_size,
> let's use the corresponding macros directly, fewer variables
> means more safety for vmcoreinfo operation.
> 
> Cc: Mahesh Salgaonkar 
> Cc: Hari Bathini 
> Signed-off-by: Xunlei Pang 
> ---
> v3->v4:
> -Rebased on the latest linux-next
> 
>  arch/powerpc/kernel/fadump.c | 3 +--
>  include/linux/crash_core.h   | 1 -
>  kernel/crash_core.c  | 3 +--
>  3 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 466569e..7bd6cd0 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -893,8 +893,7 @@ static int fadump_create_elfcore_headers(char *bufp)
>  
>   phdr->p_paddr   = fadump_relocate(paddr_vmcoreinfo_note());
>   phdr->p_offset  = phdr->p_paddr;
> - phdr->p_memsz   = vmcoreinfo_max_size;
> - phdr->p_filesz  = vmcoreinfo_max_size;
> + phdr->p_memsz   = phdr->p_filesz = VMCOREINFO_NOTE_SIZE;
>  
>   /* Increment number of program headers. */
>   (elf->e_phnum)++;
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index ba283a2..7d6bc7b 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -55,7 +55,6 @@
>  
>  extern u32 *vmcoreinfo_note;
>  extern size_t vmcoreinfo_size;
> -extern size_t vmcoreinfo_max_size;
>  
>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> void *data, size_t data_len);
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 0321f04..43cdb00 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -16,7 +16,6 @@
>  /* vmcoreinfo stuff */
>  static unsigned char *vmcoreinfo_data;
>  size_t vmcoreinfo_size;
> -size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
>  u32 *vmcoreinfo_note;
>  
>  /*
> @@ -343,7 +342,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>   r = vscnprintf(buf, sizeof(buf), fmt, args);
>   va_end(args);
>  
> - r = min(r, vmcoreinfo_max_size - vmcoreinfo_size);
> + r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
>  
>   memcpy(&vmcoreinfo_data[vmcoreinfo_size], buf, r);
>  
> -- 
> 1.8.3.1
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Reviewed-by: Dave Young 

Thanks
Dave

Re: [PATCH v4 3/3] kdump: Protect vmcoreinfo data under the crash memory

2017-04-26 Thread Dave Young

On 04/20/17 at 07:39pm, Xunlei Pang wrote:
> Currently vmcoreinfo data is updated at boot time subsys_initcall(),
> it has the risk of being modified by some wrong code during system
> is running.
> 
> As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
> when using "crash", "makedumpfile", etc utility to parse this vmcore,
> we probably will get "Segmentation fault" or other unexpected errors.
> 
> E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
> system; 3) trigger kdump, then we obviously will fail to recognize the
> crash context correctly due to the corrupted vmcoreinfo.
> 
> Now except for vmcoreinfo, all the crash data is well protected(including
> the cpu note which is fully updated in the crash path, thus its correctness
> is guaranteed). Given that vmcoreinfo data is a large chunk prepared for
> kdump, we better protect it as well.
> 
> To solve this, we relocate and copy vmcoreinfo_data to the crash memory
> when kdump is loading via kexec syscalls. Because the whole crash memory
> will be protected by existing arch_kexec_protect_crashkres() mechanism,
> we naturally protect vmcoreinfo_data from write(even read) access under
> kernel direct mapping after kdump is loaded.
> 
> Since kdump is usually loaded at the very early stage after boot, we can
> trust the correctness of the vmcoreinfo data copied.
> 
> On the other hand, we still need to operate the vmcoreinfo safe copy when
> crash happens to generate vmcoreinfo_note again, we rely on vmap() to map
> out a new kernel virtual address and update to use this new one instead in
> the following crash_save_vmcoreinfo().
> 
> BTW, we do not touch vmcoreinfo_note, because it will be fully updated
> using the protected vmcoreinfo_data after crash which is surely correct
> just like the cpu crash note.
> 
> Cc: Michael Holzheu 
> Signed-off-by: Xunlei Pang 
> ---
> v3->v4:
> -Rebased on the latest linux-next
> -Copy vmcoreinfo after machine_kexec_prepare()
> 
>  include/linux/crash_core.h |  2 +-
>  include/linux/kexec.h  |  2 ++
>  kernel/crash_core.c| 17 -
>  kernel/kexec.c |  8 
>  kernel/kexec_core.c| 39 +++
>  kernel/kexec_file.c|  8 
>  6 files changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index 7d6bc7b..5469adb 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -23,6 +23,7 @@
>  
>  typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
>  
> +void crash_update_vmcoreinfo_safecopy(void *ptr);
>  void crash_save_vmcoreinfo(void);
>  void arch_crash_save_vmcoreinfo(void);
>  __printf(1, 2)
> @@ -54,7 +55,6 @@
>   vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
>  
>  extern u32 *vmcoreinfo_note;
> -extern size_t vmcoreinfo_size;
>  
>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
> void *data, size_t data_len);
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index c9481eb..3ea8275 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -181,6 +181,7 @@ struct kimage {
>   unsigned long start;
>   struct page *control_code_page;
>   struct page *swap_page;
> + void *vmcoreinfo_data_copy; /* locates in the crash memory */
>  
>   unsigned long nr_segments;
>   struct kexec_segment segment[KEXEC_SEGMENT_MAX];
> @@ -250,6 +251,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct 
> kimage *image,
>  int kexec_should_crash(struct task_struct *);
>  int kexec_crash_loaded(void);
>  void crash_save_cpu(struct pt_regs *regs, int cpu);
> +extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
>  
>  extern struct kimage *kexec_image;
>  extern struct kimage *kexec_crash_image;
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 43cdb00..a29e9ad 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -15,9 +15,12 @@
>  
>  /* vmcoreinfo stuff */
>  static unsigned char *vmcoreinfo_data;
> -size_t vmcoreinfo_size;
> +static size_t vmcoreinfo_size;
>  u32 *vmcoreinfo_note;
>  
> +/* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */

May make it clearer like:
/* Trusted vmcoreinfo copy in the kdump reserved memory */

> +static unsigned char *vmcoreinfo_data_safecopy;
> +
>  /*
>   * parsing the "crashkernel" commandline
>   *
> @@ -323,11 +326,23 @@ static void update_vmcoreinfo_note(void)
>   final_note(buf);
>  }
>  
> +void crash_update_vmcoreinfo_safecopy(void *ptr)
> +{
> + if (ptr)
> + memcpy(ptr, vmcoreinfo_data, vmcoreinfo_size);
> +
> + vmcoreinfo_data_safecopy = ptr;
> +}
> +
>  void crash_save_vmcoreinfo(void)
>  {
>   if (!vmcoreinfo_note)
>   return;
>  
> + /* Use the safe copy to generate vmcoreinfo note if have */
> + if (vmcoreinfo_data_safecopy)
> + vmcoreinfo_dat

Re: KASLR causes intermittent boot failures on some systems

2017-04-12 Thread Dave Young

On 04/12/17 at 04:24pm, Dave Young wrote:
> On 04/07/17 at 10:41am, Jeff Moyer wrote:
> > Hi,
> > 
> > commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
> > regions") causes some of my systems with persistent memory (whether real
> > or emulated) to fail to boot with a couple of different crash
> > signatures.  The first signature is a NMI watchdog lockup of all but 1
> > cpu, which causes much difficulty in extracting useful information from
> > the console.  The second variant is an invalid paging request, listed
> > below.
> > 
> > On some systems, I haven't hit this problem at all.  Other systems
> > experience a failed boot maybe 20-30% of the time.  To reproduce it,
> > configure some emulated pmem on your system.  You can find directions
> > for that here: https://nvdimm.wiki.kernel.org/
> > 
> > Install ndctl (https://github.com/pmem/ndctl).
> > Configure the namespace:
> > # ndctl create-namespace -f -e namespace0.0 -m memory
> > 
> > Then just reboot several times (5 should be enough), and hopefully
> > you'll hit the issue.
> > 
> > I've attached both my .config and the dmesg output from a successful
> > boot at the end of this mail.
> > 
> [snip]
> 
> I did some tests about emulated pmem via memmap=, kdump kernel hangs or
> just reboots early during compressing kernel, no clue how to handle it.
> Since for kdump kernel kaslr is pointless a workaround is use "nokaslr"
> 
> In Fedora or RHEL, just add "nokaslr" in KDUMP_COMMANDLINE_APPEND
> in /etc/sysconfig/kdump 
> 
> Can you try if this works?

Oops, your problem is normal boot instead of kdump so this is two
different problems. Seems we have not met your bug yet..

Thanks
Dave

Re: KASLR causes intermittent boot failures on some systems

2017-04-12 Thread Dave Young

On 04/12/17 at 04:24pm, Dave Young wrote:
> On 04/07/17 at 10:41am, Jeff Moyer wrote:
> > Hi,
> > 
> > commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
> > regions") causes some of my systems with persistent memory (whether real
> > or emulated) to fail to boot with a couple of different crash
> > signatures.  The first signature is a NMI watchdog lockup of all but 1
> > cpu, which causes much difficulty in extracting useful information from
> > the console.  The second variant is an invalid paging request, listed
> > below.
> > 
> > On some systems, I haven't hit this problem at all.  Other systems
> > experience a failed boot maybe 20-30% of the time.  To reproduce it,
> > configure some emulated pmem on your system.  You can find directions
> > for that here: https://nvdimm.wiki.kernel.org/
> > 
> > Install ndctl (https://github.com/pmem/ndctl).
> > Configure the namespace:
> > # ndctl create-namespace -f -e namespace0.0 -m memory
> > 
> > Then just reboot several times (5 should be enough), and hopefully
> > you'll hit the issue.
> > 
> > I've attached both my .config and the dmesg output from a successful
> > boot at the end of this mail.
> > 
> [snip]
> 
> I did some tests about emulated pmem via memmap=, kdump kernel hangs or
> just reboots early during compressing kernel, no clue how to handle it.

s/compressing/uncompressing

> Since for kdump kernel kaslr is pointless a workaround is use "nokaslr"
> 
> In Fedora or RHEL, just add "nokaslr" in KDUMP_COMMANDLINE_APPEND
> in /etc/sysconfig/kdump 
> 
> Can you try if this works?
> 
> Thanks
> Dave

Re: KASLR causes intermittent boot failures on some systems

2017-04-12 Thread Dave Young

On 04/07/17 at 10:41am, Jeff Moyer wrote:
> Hi,
> 
> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
> regions") causes some of my systems with persistent memory (whether real
> or emulated) to fail to boot with a couple of different crash
> signatures.  The first signature is a NMI watchdog lockup of all but 1
> cpu, which causes much difficulty in extracting useful information from
> the console.  The second variant is an invalid paging request, listed
> below.
> 
> On some systems, I haven't hit this problem at all.  Other systems
> experience a failed boot maybe 20-30% of the time.  To reproduce it,
> configure some emulated pmem on your system.  You can find directions
> for that here: https://nvdimm.wiki.kernel.org/
> 
> Install ndctl (https://github.com/pmem/ndctl).
> Configure the namespace:
> # ndctl create-namespace -f -e namespace0.0 -m memory
> 
> Then just reboot several times (5 should be enough), and hopefully
> you'll hit the issue.
> 
> I've attached both my .config and the dmesg output from a successful
> boot at the end of this mail.
> 
[snip]

I did some tests about emulated pmem via memmap=, kdump kernel hangs or
just reboots early during compressing kernel, no clue how to handle it.
Since for kdump kernel kaslr is pointless a workaround is use "nokaslr"

In Fedora or RHEL, just add "nokaslr" in KDUMP_COMMANDLINE_APPEND
in /etc/sysconfig/kdump 

Can you try if this works?

Thanks
Dave

Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-07 Thread Dave Young

On 04/07/17 at 04:28am, Mimi Zohar wrote:
> On Fri, 2017-04-07 at 15:41 +0800, Dave Young wrote:
> > On 04/07/17 at 08:07am, David Howells wrote:
> > > Dave Young  wrote:
> > > 
> > > > > > > + /* Don't permit images to be loaded into trusted kernels if 
> > > > > > > we're not
> > > > > > > +  * going to verify the signature on them
> > > > > > > +  */
> > > > > > > + if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && 
> > > > > > > kernel_is_locked_down())
> > > > > > > + return -EPERM;
> > > > > > > +
> > > > > > >  
> > > > > 
> > > > > IMA can be used to verify file signatures too, based on the LSM hooks
> > > > > in  kernel_read_file_from_fd().  CONFIG_KEXEC_VERIFY_SIG should not be
> > > > > required.
> > > > 
> > > > Mimi, I remember we talked somthing before about the two signature 
> > > > verification. One can change IMA policy in initramfs userspace,
> > > > also there are kernel cmdline param to disable IMA, so it can break the
> > > > lockdown? Suppose kexec boot with ima disabled cmdline param and then
> > > > kexec reboot again..
> > > 
> > > I guess I should lock down the parameter to disable IMA too.
> > 
> > That is one thing, user can change IMA policy in initramfs userspace,
> > I'm not sure if IMA enforce the signed policy now, if no it will be also
> > a problem.
> 
> I'm not sure how this relates to the question of whether IMA verifies
> the kexec kernel image signature, as the test would not be based on a
> Kconfig option, but on a runtime variable.

I assumed one can change the policy to avoid kexec and initramfs check
And we use a global IMA status in the -EPERM check for the lockdown
checking.  But if there is some fine grained checking to ensure kernel
signature verification it should be fine.
> 
> To answer your question, the rule for requiring the policy to be
> signed is:  appraise func=POLICY_CHECK appraise_type=imasig
> 
> When the ability to append rules is Kconfig enabled, the builtin
> policy requires the new policy or additional rules to be signed.
>  Unfortunately, always requiring the policy to be signed, would have
> broken userspace.
> 
> Mimi
> 

Thanks
Dave

Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-07 Thread Dave Young

On 04/07/17 at 03:45am, Mimi Zohar wrote:
> On Fri, 2017-04-07 at 14:19 +0800, Dave Young wrote:
> > On 04/06/17 at 11:49pm, Mimi Zohar wrote:
> > > On Fri, 2017-04-07 at 11:05 +0800, Dave Young wrote:
> > > > On 04/05/17 at 09:15pm, David Howells wrote:
> > > > > From: Chun-Yi Lee 
> > > > > 
> > > > > When KEXEC_VERIFY_SIG is not enabled, kernel should not loads image
> > > > > through kexec_file systemcall if securelevel has been set.
> > > > > 
> > > > > This code was showed in Matthew's patch but not in git:
> > > > > https://lkml.org/lkml/2015/3/13/778
> 
> I specifically checked to make sure that either kexec_file() signature
> verification was acceptable and would have commented then, if it had
> not been included.
> 
> > > > > Cc: Matthew Garrett 
> > > > > Signed-off-by: Chun-Yi Lee 
> > > > > Signed-off-by: David Howells 
> > > > > cc: ke...@lists.infradead.org
> > > > > ---
> > > > > 
> > > > >  kernel/kexec_file.c |6 ++
> > > > >  1 file changed, 6 insertions(+)
> > > > > 
> > > > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> > > > > index b118735fea9d..f6937eecd1eb 100644
> > > > > --- a/kernel/kexec_file.c
> > > > > +++ b/kernel/kexec_file.c
> > > > > @@ -268,6 +268,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, 
> > > > > int, initrd_fd,
> > > > >   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
> > > > >   return -EPERM;
> > > > >  
> > > > > + /* Don't permit images to be loaded into trusted kernels if 
> > > > > we're not
> > > > > +  * going to verify the signature on them
> > > > > +  */
> > > > > + if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && 
> > > > > kernel_is_locked_down())
> > > > > + return -EPERM;
> > > > > +
> > > > >  
> > > 
> > > IMA can be used to verify file signatures too, based on the LSM hooks
> > > in  kernel_read_file_from_fd().  CONFIG_KEXEC_VERIFY_SIG should not be
> > > required.
> > 
> > Mimi, I remember we talked somthing before about the two signature 
> > verification. One can change IMA policy in initramfs userspace,
> > also there are kernel cmdline param to disable IMA, so it can break the
> > lockdown? Suppose kexec boot with ima disabled cmdline param and then
> > kexec reboot again..
> 
> Right, we discussed that the same method of measuring the kexec image
> and initramfs, for extending trusted boot to the OS, could also be
> used for verifying the kexec image and initramfs signatures, for
> extending secure boot to the OS.  The file hash would be calculated
> once for both.
> 
> All of your concerns could be addressed with very minor changes to
> IMA.  (Continued in response to David.)

Thanks! As long as IMA can ensure not breaking the lockdown it should be
fine to add an check for either !CONFIG_KEXEC_VERIFY_SIG or !IMA
enforced.

> 
> > > 
> > > > /* Make sure we have a legal set of flags */
> > > > >   if (flags != (flags & KEXEC_FILE_FLAGS))
> > > > >   return -EINVAL;
> > > > > 
> > > > > 
> > > > > ___
> > > > > kexec mailing list
> > > > > ke...@lists.infradead.org
> > > > > http://lists.infradead.org/mailman/listinfo/kexec
> > > > 
> > > > Acked-by: Dave Young 
> > > > 
> > > > Thanks
> > > > Dave
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > linux-security-module" in
> > > > the body of a message to majord...@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > 
> > 
>

Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-07 Thread Dave Young

On 04/07/17 at 08:07am, David Howells wrote:
> Dave Young  wrote:
> 
> > > > > + /* Don't permit images to be loaded into trusted kernels if 
> > > > > we're not
> > > > > +  * going to verify the signature on them
> > > > > +  */
> > > > > + if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && 
> > > > > kernel_is_locked_down())
> > > > > + return -EPERM;
> > > > > +
> > > > >  
> > > 
> > > IMA can be used to verify file signatures too, based on the LSM hooks
> > > in  kernel_read_file_from_fd().  CONFIG_KEXEC_VERIFY_SIG should not be
> > > required.
> > 
> > Mimi, I remember we talked somthing before about the two signature 
> > verification. One can change IMA policy in initramfs userspace,
> > also there are kernel cmdline param to disable IMA, so it can break the
> > lockdown? Suppose kexec boot with ima disabled cmdline param and then
> > kexec reboot again..
> 
> I guess I should lock down the parameter to disable IMA too.

That is one thing, user can change IMA policy in initramfs userspace,
I'm not sure if IMA enforce the signed policy now, if no it will be also
a problem.

Thanks
Dave

Re: [PATCH 17/24] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down

2017-04-07 Thread Dave Young

On 04/07/17 at 08:05am, David Howells wrote:
> Dave Young  wrote:
> 
> > > > This option allows userspace to pass the RSDP address to the kernel, 
> > > > which
> > > > makes it possible for a user to circumvent any restrictions imposed on
> > > > loading modules.  Ignore the option when the kernel is locked down.
> > > 
> > > I'm not really sure here.
> > > 
> > > What exactly is the mechanism?
> > 
> > Actually this acpi_rsdp param is created for EFI kexec reboot in old
> > days when we had not supported persistent efi vm space across kexec
> > reboot. At that time kexec reboot runs as noefi mode, it can not find
> > the acpi root table thus kernel will hang early.
> > 
> > Now kexec can support EFI boot so this param is not necessary for most
> > user unless they still use efi=old_map.
> 
> Is this patch now unnecessary?

I think it is still necessary because the acpi_rsdp kernel param is still
a valid paramater and one can still pass a pointer to be recognized as acpi
root pointer.

Maybe "imposed on loading modules" is not clear which can be dropped.

Thanks
Dave

Re: [PATCH 17/24] acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down

2017-04-06 Thread Dave Young

On 04/06/17 at 09:43pm, Rafael J. Wysocki wrote:
> On Wed, Apr 5, 2017 at 10:16 PM, David Howells  wrote:
> > From: Josh Boyer 
> >
> > This option allows userspace to pass the RSDP address to the kernel, which
> > makes it possible for a user to circumvent any restrictions imposed on
> > loading modules.  Ignore the option when the kernel is locked down.
> 
> I'm not really sure here.
> 
> What exactly is the mechanism?

Actually this acpi_rsdp param is created for EFI kexec reboot in old
days when we had not supported persistent efi vm space across kexec
reboot. At that time kexec reboot runs as noefi mode, it can not find
the acpi root table thus kernel will hang early.

Now kexec can support EFI boot so this param is not necessary for most
user unless they still use efi=old_map.

> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-06 Thread Dave Young

On 04/06/17 at 11:49pm, Mimi Zohar wrote:
> On Fri, 2017-04-07 at 11:05 +0800, Dave Young wrote:
> > On 04/05/17 at 09:15pm, David Howells wrote:
> > > From: Chun-Yi Lee 
> > > 
> > > When KEXEC_VERIFY_SIG is not enabled, kernel should not loads image
> > > through kexec_file systemcall if securelevel has been set.
> > > 
> > > This code was showed in Matthew's patch but not in git:
> > > https://lkml.org/lkml/2015/3/13/778
> > > 
> > > Cc: Matthew Garrett 
> > > Signed-off-by: Chun-Yi Lee 
> > > Signed-off-by: David Howells 
> > > cc: ke...@lists.infradead.org
> > > ---
> > > 
> > >  kernel/kexec_file.c |6 ++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> > > index b118735fea9d..f6937eecd1eb 100644
> > > --- a/kernel/kexec_file.c
> > > +++ b/kernel/kexec_file.c
> > > @@ -268,6 +268,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, 
> > > int, initrd_fd,
> > >   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
> > >   return -EPERM;
> > >  
> > > + /* Don't permit images to be loaded into trusted kernels if we're not
> > > +  * going to verify the signature on them
> > > +  */
> > > + if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && kernel_is_locked_down())
> > > + return -EPERM;
> > > +
> > >  
> 
> IMA can be used to verify file signatures too, based on the LSM hooks
> in  kernel_read_file_from_fd().  CONFIG_KEXEC_VERIFY_SIG should not be
> required.

Mimi, I remember we talked somthing before about the two signature 
verification. One can change IMA policy in initramfs userspace,
also there are kernel cmdline param to disable IMA, so it can break the
lockdown? Suppose kexec boot with ima disabled cmdline param and then
kexec reboot again..

> 
> Mimi
> 
> 
> > /* Make sure we have a legal set of flags */
> > >   if (flags != (flags & KEXEC_FILE_FLAGS))
> > >   return -EINVAL;
> > > 
> > > 
> > > ___
> > > kexec mailing list
> > > ke...@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/kexec
> > 
> > Acked-by: Dave Young 
> > 
> > Thanks
> > Dave
> > --
> > To unsubscribe from this list: send the line "unsubscribe 
> > linux-security-module" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
>

Re: [PATCH 07/24] kexec: Disable at runtime if the kernel is locked down

2017-04-06 Thread Dave Young

On 04/05/17 at 09:15pm, David Howells wrote:
> From: Matthew Garrett 
> 
> kexec permits the loading and execution of arbitrary code in ring 0, which
> is something that lock-down is meant to prevent. It makes sense to disable
> kexec in this situation.
> 
> This does not affect kexec_file_load() which can check for a signature on the
> image to be booted.
> 
> Signed-off-by: Matthew Garrett 
> Signed-off-by: David Howells 
> cc: ke...@lists.infradead.org
> ---
> 
>  kernel/kexec.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 980936a90ee6..46de8e6b42f4 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -194,6 +194,13 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>   return -EPERM;
>  
>   /*
> +  * kexec can be used to circumvent module loading restrictions, so
> +  * prevent loading in that case
> +  */
> + if (kernel_is_locked_down())
> + return -EPERM;
> +
> + /*
>* Verify we have a legal set of flags
>* This leaves us room for future extensions.
>*/
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young 

Thanks
Dave

Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-06 Thread Dave Young

On 04/05/17 at 09:15pm, David Howells wrote:
> From: Chun-Yi Lee 
> 
> When KEXEC_VERIFY_SIG is not enabled, kernel should not loads image
> through kexec_file systemcall if securelevel has been set.
> 
> This code was showed in Matthew's patch but not in git:
> https://lkml.org/lkml/2015/3/13/778
> 
> Cc: Matthew Garrett 
> Signed-off-by: Chun-Yi Lee 
> Signed-off-by: David Howells 
> cc: ke...@lists.infradead.org
> ---
> 
>  kernel/kexec_file.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b118735fea9d..f6937eecd1eb 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -268,6 +268,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
> initrd_fd,
>   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
>   return -EPERM;
>  
> + /* Don't permit images to be loaded into trusted kernels if we're not
> +  * going to verify the signature on them
> +  */
> + if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && kernel_is_locked_down())
> + return -EPERM;
> +
>   /* Make sure we have a legal set of flags */
>   if (flags != (flags & KEXEC_FILE_FLAGS))
>   return -EINVAL;
> 
> 
> _______
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young 

Thanks
Dave

Re: kexec regression since 4.9 caused by efi

2017-04-04 Thread Dave Young

On 04/04/17 at 02:37pm, Matt Fleming wrote:
> On Mon, 20 Mar, at 10:14:12AM, Dave Young wrote:
> > 
> > Matt, I'm fine if you prefer to capture the range checking errors.
> > Would you like me to post it or just you send it out?
> 
> Can you please send out the patch with the minimal change to
> efi_arch_mem_reserve() and we'll get it into urgent ASAP.

Omar has sent it out, for the lookup function issue I think I can do it
after this one later.

Thanks
Dave

Re: [PATCH v2] x86/mm/KASLR: EFI region is mistakenly included into KASLR VA space for randomization

2017-03-24 Thread Dave Young

On 03/24/17 at 09:08am, Ingo Molnar wrote:
> 
> * Baoquan He  wrote:
> 
> > Currently KASLR is enabled on three regions: the direct mapping of physical
> > memory, vamlloc and vmemmap. However EFI region is also mistakenly included
> > for VA space randomization because of misusing EFI_VA_START macro and
> > assuming EFI_VA_START < EFI_VA_END.
> > 
> > The EFI region is reserved for EFI runtime services virtual mapping which
> > should not be included in kaslr ranges. In Documentation/x86/x86_64/mm.txt,
> > we can see:
> >   ffef - fffe (=64 GB) EFI region mapping space
> > EFI use the space from -4G to -64G thus EFI_VA_START > EFI_VA_END,
> > Here EFI_VA_START = -4G, and EFI_VA_END = -64G.
> > 
> > Changing EFI_VA_START to EFI_VA_END in mm/kaslr.c fixes this problem.
> > 
> > Cc:  #4.8+
> > Signed-off-by: Baoquan He 
> > Acked-by: Dave Young 
> > Reviewed-by: Bhupesh Sharma 
> > Acked-by: Thomas Garnier 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: "H. Peter Anvin" 
> > Cc: x...@kernel.org
> > Cc: linux-...@vger.kernel.org
> > Cc: Thomas Garnier 
> > Cc: Kees Cook 
> > Cc: Borislav Petkov 
> > Cc: Andrew Morton 
> > Cc: Masahiro Yamada 
> > Cc: Dave Young 
> > Cc: Bhupesh Sharma 
> 
> So I applied this kexec fix and extended the changelog to clearly show why 
> this 
> fix matters in practice.
> 
> Also, to make sure I understood it correctly: these addresses are all dynamic 
> on 
> 64-bit kernels, i.e. we are establishing and then tearing down these page 
> tables 
> around EFI calls, and they are 'normally' not present at all, right?

Ingo, if I understand the question right "these addresses" means EFI va 
addresses
then it is right, EFI switch to its own page tables, so they are not
present in kernel page tables.

> 
> Thanks,
> 
>   Ingo

Re: [PATCH v2] x86/mm/KASLR: EFI region is mistakenly included into KASLR VA space for randomization

2017-03-24 Thread Dave Young

On 03/24/17 at 04:34pm, Baoquan He wrote:
> On 03/24/17 at 09:08am, Ingo Molnar wrote:
> > > Cc:  #4.8+
> > > Signed-off-by: Baoquan He 
> > > Acked-by: Dave Young 
> > > Reviewed-by: Bhupesh Sharma 
> > > Acked-by: Thomas Garnier 
> > > Cc: Thomas Gleixner 
> > > Cc: Ingo Molnar 
> > > Cc: "H. Peter Anvin" 
> > > Cc: x...@kernel.org
> > > Cc: linux-...@vger.kernel.org
> > > Cc: Thomas Garnier 
> > > Cc: Kees Cook 
> > > Cc: Borislav Petkov 
> > > Cc: Andrew Morton 
> > > Cc: Masahiro Yamada 
> > > Cc: Dave Young 
> > > Cc: Bhupesh Sharma 
> > 
> > So I applied this kexec fix and extended the changelog to clearly show why 
> > this 
> > fix matters in practice.
> 
> I thought it only impacts kexec, but Dave thought it will impact 1st
> kenrel either.

Yes, I think no need to mention kexec, it is a general issue.

First, the space is reserved for EFI, so kernel should not use it for
kaslr. Second, EFI page tables sync the low kernel page tables into its
own page tables, if others use this space for non-EFI then those part
will be missing.

arch/x86/platform/efi/efi_64.c
efi_sync_low_kernel_mappings() is syncing kernel page tables to efi's.
>From the function comment below:
Add low kernel mappings for passing arguments to EFI functions.

Suppose some arguments use kaslr randomized address which is within efi
ranges then we will hit bugs. But we do not see actual bug reports in
real world yet. This is found during patch review.

Anyway, since this area is EFI reserved, no reason to add it to kaslr
pool.

> > 
> > Also, to make sure I understood it correctly: these addresses are all 
> > dynamic on 
> > 64-bit kernels, i.e. we are establishing and then tearing down these page 
> > tables 
> > around EFI calls, and they are 'normally' not present at all, right?
> 
> The EFI region is reserved for EFI runtime services virtual mapping, the
> original purpose is to preserve this region so that they can be reused
> by kexec-ed kernel. This was introduced by Boris in commit d2f7cbe7b26a7
> ("x86/efi: Runtime services virtual mapping").
> 
> So it will be establishing and stay there. According to Dave's telling,
> efi will still fetch efi variables or time/date by runtime service by
> switching the efi pgd and entering into efi mode. And then switch back
> to normal OS. Not sure if I am right for efi part in 1st kernel. For 2nd
> kernel, if want to reuse the them, the efi region has to be kept.
> 
> Thanks
> Baoquan

Thanks
Dave

Re: [PATCH v1 RESEND 1/2] x86/mm/KASLR: EFI region is mistakenly included into KASLR VA space for randomization

2017-03-23 Thread Dave Young

This should also cc linux-efi

On 03/24/17 at 10:29am, Dave Young wrote:
> Hi, Baoquan
> 
> On 03/23/17 at 11:27am, Baoquan He wrote:
> > Currently KASLR is enabled on three regions: the direct mapping of physical
> > memory, vamlloc and vmemmap. However EFI region is also mistakenly included
> > for VA space randomization because of misusing EFI_VA_START macro and
> > assuming EFI_VA_START < EFI_VA_END.
> > 
> > The EFI region is reserved for EFI runtime services virtual mapping which
> > should not be included in kaslr ranges. It will be re-used by kexec/kdump
> > kernel, the mistake may cause failure when jump to kexec/kdump kernel if
> > vmemmap allocation stomps on the allocated efi mapping region.
> 
> No need to mention kexec/kdump in changelog although it is true that
> kexec kernel will use the persistent efi runtime mapping. The main point
> is it is wrong to use the reserved vm space for efi.

Explain more about this:

It is a general issue instead of a kexec/kdump issue and it is a real
bug. Although efi has its own page tables, it will still sync kernel
page tables along with the mapping of efi reserved area. So if vmalloc
etc use the vm space of efi reserved area, then some of them will be
missed when efi sync the low kernel page tables..

> 
> Also I think this patch can be sent as a standalone patch, no need to be
> a patch series. For the second patch I think it depends on efi
> maintainer's opinion, personally I think only this simple fix for kaslr only
> will be better.
> 
> > 
> > In Documentation/x86/x86_64/mm.txt, we can see:
> >   ffef - fffe (=64 GB) EFI region mapping space
> > EFI use the space from -4G to -64G thus EFI_VA_START > EFI_VA_END
> > Here EFI_VA_START = -4G, and EFI_VA_END = -64G
> > 
> > Changing EFI_VA_START to EFI_VA_END in mm/kaslr.c fixes this problem.
> > 
> > Cc:  #4.8+
> > Signed-off-by: Baoquan He 
> > Acked-by: Dave Young 
> > Reviewed-by: Bhupesh Sharma 
> > Acked-by: Thomas Garnier 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: "H. Peter Anvin"  
> > Cc: x...@kernel.org
> > Cc: Thomas Garnier 
> > Cc: Kees Cook 
> > Cc: Borislav Petkov 
> > Cc: Andrew Morton 
> > Cc: Masahiro Yamada 
> > ---
> >  arch/x86/mm/kaslr.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> > index 887e571..aed2064 100644
> > --- a/arch/x86/mm/kaslr.c
> > +++ b/arch/x86/mm/kaslr.c
> > @@ -48,7 +48,7 @@ static const unsigned long vaddr_start = 
> > __PAGE_OFFSET_BASE;
> >  #if defined(CONFIG_X86_ESPFIX64)
> >  static const unsigned long vaddr_end = ESPFIX_BASE_ADDR;
> >  #elif defined(CONFIG_EFI)
> > -static const unsigned long vaddr_end = EFI_VA_START;
> > +static const unsigned long vaddr_end = EFI_VA_END;
> >  #else
> >  static const unsigned long vaddr_end = __START_KERNEL_map;
> >  #endif
> > @@ -105,7 +105,7 @@ void __init kernel_randomize_memory(void)
> >  */
> > BUILD_BUG_ON(vaddr_start >= vaddr_end);
> > BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_ESPFIX64) &&
> > -vaddr_end >= EFI_VA_START);
> > +vaddr_end >= EFI_VA_END);
> > BUILD_BUG_ON((IS_ENABLED(CONFIG_X86_ESPFIX64) ||
> >   IS_ENABLED(CONFIG_EFI)) &&
> >  vaddr_end >= __START_KERNEL_map);
> > -- 
> > 2.5.5
> > 
> 
> Thanks
> Dave

Re: [PATCH v1 RESEND 1/2] x86/mm/KASLR: EFI region is mistakenly included into KASLR VA space for randomization

2017-03-23 Thread Dave Young

Hi, Baoquan

On 03/23/17 at 11:27am, Baoquan He wrote:
> Currently KASLR is enabled on three regions: the direct mapping of physical
> memory, vamlloc and vmemmap. However EFI region is also mistakenly included
> for VA space randomization because of misusing EFI_VA_START macro and
> assuming EFI_VA_START < EFI_VA_END.
> 
> The EFI region is reserved for EFI runtime services virtual mapping which
> should not be included in kaslr ranges. It will be re-used by kexec/kdump
> kernel, the mistake may cause failure when jump to kexec/kdump kernel if
> vmemmap allocation stomps on the allocated efi mapping region.

No need to mention kexec/kdump in changelog although it is true that
kexec kernel will use the persistent efi runtime mapping. The main point
is it is wrong to use the reserved vm space for efi.

Also I think this patch can be sent as a standalone patch, no need to be
a patch series. For the second patch I think it depends on efi
maintainer's opinion, personally I think only this simple fix for kaslr only
will be better.

> 
> In Documentation/x86/x86_64/mm.txt, we can see:
>   ffef - fffe (=64 GB) EFI region mapping space
> EFI use the space from -4G to -64G thus EFI_VA_START > EFI_VA_END
> Here EFI_VA_START = -4G, and EFI_VA_END = -64G
> 
> Changing EFI_VA_START to EFI_VA_END in mm/kaslr.c fixes this problem.
> 
> Cc:  #4.8+
> Signed-off-by: Baoquan He 
> Acked-by: Dave Young 
> Reviewed-by: Bhupesh Sharma 
> Acked-by: Thomas Garnier 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin"  
> Cc: x...@kernel.org
> Cc: Thomas Garnier 
> Cc: Kees Cook 
> Cc: Borislav Petkov 
> Cc: Andrew Morton 
> Cc: Masahiro Yamada 
> ---
>  arch/x86/mm/kaslr.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> index 887e571..aed2064 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -48,7 +48,7 @@ static const unsigned long vaddr_start = __PAGE_OFFSET_BASE;
>  #if defined(CONFIG_X86_ESPFIX64)
>  static const unsigned long vaddr_end = ESPFIX_BASE_ADDR;
>  #elif defined(CONFIG_EFI)
> -static const unsigned long vaddr_end = EFI_VA_START;
> +static const unsigned long vaddr_end = EFI_VA_END;
>  #else
>  static const unsigned long vaddr_end = __START_KERNEL_map;
>  #endif
> @@ -105,7 +105,7 @@ void __init kernel_randomize_memory(void)
>*/
>   BUILD_BUG_ON(vaddr_start >= vaddr_end);
>   BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_ESPFIX64) &&
> -  vaddr_end >= EFI_VA_START);
> +  vaddr_end >= EFI_VA_END);
>   BUILD_BUG_ON((IS_ENABLED(CONFIG_X86_ESPFIX64) ||
> IS_ENABLED(CONFIG_EFI)) &&
>vaddr_end >= __START_KERNEL_map);
> -- 
> 2.5.5
> 

Thanks
Dave

Re: kexec regression since 4.9 caused by efi

2017-03-22 Thread Dave Young

On 03/22/17 at 04:10pm, Ard Biesheuvel wrote:
> On 21 March 2017 at 07:48, Dave Young  wrote:
> > On 03/20/17 at 10:14am, Dave Young wrote:
> >> On 03/17/17 at 01:32pm, Matt Fleming wrote:
> >> > On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
> >> > >
> >> > > Matt, I think it should be fine although I think the md type checking 
> >> > > in
> >> > > efi_mem_desc_lookup() is causing confusion and not easy to understand..
> >> >
> >> > Could you make that a separate patch if you think of improvements
> >> > there?
> >>
> >> Duplicate the lookup function is indeed a little ugly, will do it when I
> >> have a better idea, we can leave it as is since it works.
> >
> > Matt, rethinking about this, how about doint something below, not
> > tested, just seeking for idea and opinons, in this way no need duplicate
> > a function, but there is an assumption that no overlapped mem ranges in
> > efi memmap.
> >
> > Also the case Omar reported is the esrt memory range type is
> > RUNTIME_DATA, that is a little different with the mem attribute of
> > RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
> > attribute, like bgrt in kexec reboot. Should we distinguish the two
> > cases and give out some warnings or debug info?
> >
> >
> > ---
> >  arch/x86/platform/efi/quirks.c |5 +
> >  drivers/firmware/efi/efi.c |6 --
> >  drivers/firmware/efi/esrt.c|7 +++
> >  3 files changed, 12 insertions(+), 6 deletions(-)
> >
> > --- linux-x86.orig/drivers/firmware/efi/efi.c
> > +++ linux-x86/drivers/firmware/efi/efi.c
> > @@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
> > u64 size;
> > u64 end;
> >
> > -   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > -   md->type != EFI_BOOT_SERVICES_DATA &&
> > -   md->type != EFI_RUNTIME_SERVICES_DATA) {
> > -   continue;
> > -   }
> > -
> > size = md->num_pages << EFI_PAGE_SHIFT;
> > end = md->phys_addr + size;
> > if (phys_addr >= md->phys_addr && phys_addr < end) {
> > --- linux-x86.orig/drivers/firmware/efi/esrt.c
> > +++ linux-x86/drivers/firmware/efi/esrt.c
> > @@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
> > return;
> > }
> >
> > +   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > + md->type != EFI_BOOT_SERVICES_DATA &&
> > + md->type != EFI_RUNTIME_SERVICES_DATA) {
> > +   pr_err("ESRT header memory map type is invalid\n");
> > +   return;
> > +   }
> > +
> 
> This looks wrong to me. While the meanings get convoluted in practice,
> the EFI_MEMORY_RUNTIME attribute only means that the firmware requests
> a virtual mapping for the region. It is perfectly legal for a
> EFI_RUNTIME_SERVICES_DATA region not to have the EFI_MEMORY_RUNTIME
> attribute, if the region is never accessed by the runtime services
> themselves, and this is not entirely unlikely for tables that the
> firmware exposes to the OS

Thanks for the comment, if so "!(md->attribute & EFI_MEMORY_RUNTIME) &&"
should be dropped.

BTW, md->type should be md.type, bgrt reserving works fine with this
change but I have no esrt machine to test it. I would like to wait for
Matt's opinions about this first before an update. 

Also cc Peter about the esrt piece.
> 
> > max = efi_mem_desc_end(&md);
> > if (max < efi.esrt) {
> > pr_err("EFI memory descriptor is invalid. (esrt: %p max: 
> > %p)\n",
> > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > @@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
> > return;
> > }
> >
> > +   if (md->attribute & EFI_MEMORY_RUNTIME ||
> > + md->type != EFI_BOOT_SERVICES_DATA) {
> > +   return;
> > +   }
> > +
> > size += addr % EFI_PAGE_SIZE;
> > size = round_up(size, EFI_PAGE_SIZE);
> > addr = round_down(addr, EFI_PAGE_SIZE);
> >
> >>
> >> >
> >> > > How about move the if chunk early like below because it seems no need
> >> > > to sanity check the addr + size any more if the md is still RUNTIME?
> >> >
> >> > My original version did as you suggest, but I changed it because we
> >> > *really* want to know if someone tries to reserve a range that spans
> >> > regions. That would be totally unexpected and a warning about a
> >> > potential bug/issue.
> >>
> >> Matt, I'm fine if you prefer to capture the range checking errors.
> >> Would you like me to post it or just you send it out?
> >>
> >> Thanks
> >> Dave
> >
> > Thanks
> > Dave
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-21 Thread Dave Young

On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
> Dave Young  writes:
> 
> > On 03/20/17 at 10:33pm, Eric W. Biederman wrote:
> >> Xunlei Pang  writes:
> >> 
> >> > As Eric said,
> >> > "what we need to do is move the variable vmcoreinfo_note out
> >> > of the kernel's .bss section.  And modify the code to regenerate
> >> > and keep this information in something like the control page.
> >> >
> >> > Definitely something like this needs a page all to itself, and ideally
> >> > far away from any other kernel data structures.  I clearly was not
> >> > watching closely the data someone decided to keep this silly thing
> >> > in the kernel's .bss section."
> >> >
> >> > This patch allocates extra pages for these vmcoreinfo_XXX variables,
> >> > one advantage is that it enhances some safety of vmcoreinfo, because
> >> > vmcoreinfo now is kept far away from other kernel data structures.
> >> 
> >> Can you preceed this patch with a patch that removes CRASHTIME from
> >> vmcoreinfo?  If someone actually cares we can add a separate note that 
> >> holds
> >> a 64bit crashtime in the per cpu notes.  
> >
> > I think makedumpfile is using it, but I also vote to remove the
> > CRASHTIME. It is better not to do this while crashing and a makedumpfile
> > userspace patch is needed to drop the use of it.
> >
> >> 
> >> As we are looking at reliability concerns removing CRASHTIME should make
> >> everything in vmcoreinfo a boot time constant.  Which should simplify
> >> everything considerably.
> >
> > It is a nice improvement..
> 
> We also need to take a close look at what s390 is doing with vmcoreinfo.
> As apparently it is reading it in a different kind of crashdump process.

Yes, need careful review from s390 and maybe ppc64 especially about
patch 2/3, better to have comments from IBM about s390 dump tool and ppc
fadump. Added more cc.

Thanks
Dave

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-21 Thread Dave Young

On 03/20/17 at 10:33pm, Eric W. Biederman wrote:
> Xunlei Pang  writes:
> 
> > As Eric said,
> > "what we need to do is move the variable vmcoreinfo_note out
> > of the kernel's .bss section.  And modify the code to regenerate
> > and keep this information in something like the control page.
> >
> > Definitely something like this needs a page all to itself, and ideally
> > far away from any other kernel data structures.  I clearly was not
> > watching closely the data someone decided to keep this silly thing
> > in the kernel's .bss section."
> >
> > This patch allocates extra pages for these vmcoreinfo_XXX variables,
> > one advantage is that it enhances some safety of vmcoreinfo, because
> > vmcoreinfo now is kept far away from other kernel data structures.
> 
> Can you preceed this patch with a patch that removes CRASHTIME from
> vmcoreinfo?  If someone actually cares we can add a separate note that holds
> a 64bit crashtime in the per cpu notes.  

I think makedumpfile is using it, but I also vote to remove the
CRASHTIME. It is better not to do this while crashing and a makedumpfile
userspace patch is needed to drop the use of it.

> 
> As we are looking at reliability concerns removing CRASHTIME should make
> everything in vmcoreinfo a boot time constant.  Which should simplify
> everything considerably.

It is a nice improvement..

> 
> Which means we only need to worry abou the per-cpu notes being written
> at the time of a crash.
> 
> > Suggested-by: Eric Biederman 
> > Signed-off-by: Xunlei Pang 
> > ---
> >  arch/ia64/kernel/machine_kexec.c |  5 -
> >  arch/x86/kernel/crash.c  |  2 +-
> >  include/linux/kexec.h|  2 +-
> >  kernel/kexec_core.c  | 29 -
> >  kernel/ksysfs.c  |  2 +-
> >  5 files changed, 27 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/ia64/kernel/machine_kexec.c 
> > b/arch/ia64/kernel/machine_kexec.c
> > index 599507b..c14815d 100644
> > --- a/arch/ia64/kernel/machine_kexec.c
> > +++ b/arch/ia64/kernel/machine_kexec.c
> > @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
> >  #endif
> >  }
> >  
> > -phys_addr_t paddr_vmcoreinfo_note(void)
> > -{
> > -   return ia64_tpa((unsigned long)(char *)&vmcoreinfo_note);
> > -}
> > -
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index 3741461..4d35fbb 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -456,7 +456,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
> > *ced,
> > bufp += sizeof(Elf64_Phdr);
> > phdr->p_type = PT_NOTE;
> > phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
> > -   phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
> > +   phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
> > (ehdr->e_phnum)++;
> >  
> >  #ifdef CONFIG_X86_64
> > diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> > index e98e546..f1c601b 100644
> > --- a/include/linux/kexec.h
> > +++ b/include/linux/kexec.h
> > @@ -317,7 +317,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct 
> > kimage *image,
> >  extern struct resource crashk_low_res;
> >  typedef u32 note_buf_t[KEXEC_NOTE_BYTES/4];
> >  extern note_buf_t __percpu *crash_notes;
> > -extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
> > +extern u32 *vmcoreinfo_note;
> >  extern size_t vmcoreinfo_size;
> >  extern size_t vmcoreinfo_max_size;
> >  
> > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> > index bfe62d5..e3a4bda 100644
> > --- a/kernel/kexec_core.c
> > +++ b/kernel/kexec_core.c
> > @@ -52,10 +52,10 @@
> >  note_buf_t __percpu *crash_notes;
> >  
> >  /* vmcoreinfo stuff */
> > -static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
> > -u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
> > +static unsigned char *vmcoreinfo_data;
> >  size_t vmcoreinfo_size;
> > -size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
> > +size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
> > +u32 *vmcoreinfo_note;
> >  
> >  /* Flag to indicate we are going to kexec a new kernel */
> >  bool kexec_in_progress = false;
> > @@ -1369,6 +1369,9 @@ static void update_vmcoreinfo_note(void)
> >  
> >  void crash_save_vmcoreinfo(void)
> >  {
> > +   if (!vmcoreinfo_note)
> > +   return;
> > +
> > vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
> > update_vmcoreinfo_note();
> >  }
> > @@ -1397,13 +1400,29 @@ void vmcoreinfo_append_str(const char *fmt, ...)
> >  void __weak arch_crash_save_vmcoreinfo(void)
> >  {}
> >  
> > -phys_addr_t __weak paddr_vmcoreinfo_note(void)
> > +phys_addr_t paddr_vmcoreinfo_note(void)
> >  {
> > -   return __pa_symbol((unsigned long)(char *)&vmcoreinfo_note);
> > +   return __pa(vmcoreinfo_note);
> >  }
> >  
> >  static int __init crash_save_vmcoreinfo_init(void)
> >  {
> > +   /* One page should be enough for VMCOREINFO_BYTES under all archs */
> > +   vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
> > +   if (!vmcoreinfo

Re: kexec regression since 4.9 caused by efi

2017-03-21 Thread Dave Young

On 03/20/17 at 10:14am, Dave Young wrote:
> On 03/17/17 at 01:32pm, Matt Fleming wrote:
> > On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
> > > 
> > > Matt, I think it should be fine although I think the md type checking in
> > > efi_mem_desc_lookup() is causing confusion and not easy to understand..
> >  
> > Could you make that a separate patch if you think of improvements
> > there?
> 
> Duplicate the lookup function is indeed a little ugly, will do it when I
> have a better idea, we can leave it as is since it works.

Matt, rethinking about this, how about doint something below, not
tested, just seeking for idea and opinons, in this way no need duplicate
a function, but there is an assumption that no overlapped mem ranges in
efi memmap.

Also the case Omar reported is the esrt memory range type is
RUNTIME_DATA, that is a little different with the mem attribute of
RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
attribute, like bgrt in kexec reboot. Should we distinguish the two
cases and give out some warnings or debug info?


---
 arch/x86/platform/efi/quirks.c |5 +
 drivers/firmware/efi/efi.c |6 --
 drivers/firmware/efi/esrt.c|7 +++
 3 files changed, 12 insertions(+), 6 deletions(-)

--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
u64 size;
u64 end;
 
-   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
-   md->type != EFI_BOOT_SERVICES_DATA &&
-   md->type != EFI_RUNTIME_SERVICES_DATA) {
-   continue;
-   }
-
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;
if (phys_addr >= md->phys_addr && phys_addr < end) {
--- linux-x86.orig/drivers/firmware/efi/esrt.c
+++ linux-x86/drivers/firmware/efi/esrt.c
@@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
return;
}
 
+   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
+ md->type != EFI_BOOT_SERVICES_DATA &&
+ md->type != EFI_RUNTIME_SERVICES_DATA) {
+   pr_err("ESRT header memory map type is invalid\n");
+   return;
+   }
+
max = efi_mem_desc_end(&md);
if (max < efi.esrt) {
pr_err("EFI memory descriptor is invalid. (esrt: %p max: %p)\n",
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}
 
+   if (md->attribute & EFI_MEMORY_RUNTIME ||
+ md->type != EFI_BOOT_SERVICES_DATA) {
+   return;
+   }
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);

> 
> > 
> > > How about move the if chunk early like below because it seems no need
> > > to sanity check the addr + size any more if the md is still RUNTIME?
> > 
> > My original version did as you suggest, but I changed it because we
> > *really* want to know if someone tries to reserve a range that spans
> > regions. That would be totally unexpected and a warning about a
> > potential bug/issue.
> 
> Matt, I'm fine if you prefer to capture the range checking errors.
> Would you like me to post it or just you send it out?
> 
> Thanks
> Dave

Thanks
Dave

Re: kexec regression since 4.9 caused by efi

2017-03-19 Thread Dave Young

On 03/17/17 at 01:32pm, Matt Fleming wrote:
> On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
> > 
> > Matt, I think it should be fine although I think the md type checking in
> > efi_mem_desc_lookup() is causing confusion and not easy to understand..
>  
> Could you make that a separate patch if you think of improvements
> there?

Duplicate the lookup function is indeed a little ugly, will do it when I
have a better idea, we can leave it as is since it works.

> 
> > How about move the if chunk early like below because it seems no need
> > to sanity check the addr + size any more if the md is still RUNTIME?
> 
> My original version did as you suggest, but I changed it because we
> *really* want to know if someone tries to reserve a range that spans
> regions. That would be totally unexpected and a warning about a
> potential bug/issue.

Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?

Thanks
Dave

Re: kexec regression since 4.9 caused by efi

2017-03-16 Thread Dave Young

On 03/16/17 at 12:41pm, Matt Fleming wrote:
> On Mon, 13 Mar, at 03:37:48PM, Dave Young wrote:
> > 
> > Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is 
> > not
> > correct to be used in efi_arch_mem_reserve, if it passed your test, I
> > can rewrite patch log with more background and send it out:
> > 
> > for_each_efi_memory_desc(md) {
> > [snip]
> > if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > md->type != EFI_BOOT_SERVICES_DATA &&
> > md->type != EFI_RUNTIME_SERVICES_DATA) {
> > continue;
> > }
> > 
> > In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
> > data or runtime data, this is wrong for efi_mem_reserve, because we are
> > reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
> > running time. Just is happened to work and we did not capture the error.
> 
> Wouldn't something like this be simpler?
> 
> ---
> 
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 30031d5293c4..cdfe8c628959 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 
> size)
>   return;
>   }
>  
> + /* No need to reserve regions that will never be freed. */
> + if (md.attribute & EFI_MEMORY_RUNTIME)
> + return;
> +

Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..

How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -196,6 +196,10 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}
 
+   /* No need to reserve regions that will never be freed. */
+   if (md.attribute & EFI_MEMORY_RUNTIME)
+   return;
+
if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;

Thanks
Dave

Re: kexec regression since 4.9 caused by efi

2017-03-13 Thread Dave Young

On 03/09/17 at 01:54am, Omar Sandoval wrote:
> On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
> > Add efi/kexec list.
> > 
> > On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> 
> [snip]
> 
> > I have no more clue yet from your provided log, but the runtime value is
> > odd to me. It is set in below code:
> > 
> > arch/x86/platform/efi/efi.c: efi_systab_init()
> > efi_systab.runtime = data ?
> >  (void *)(unsigned long)data->runtime :
> >  (void *)(unsigne long)systab64->runtime;
> > 
> > Here data is the setup_data passed by kexec-tools from normal kernel to
> > kexec kernel, efi_setup_data structure is like below: 
> > struct efi_setup_data {
> > u64 fw_vendor;
> > u64 runtime;
> > u64 tables;
> > u64 smbios;
> > u64 reserved[8];
> > };
> > 
> > kexec-tools get the runtime address from /sys/firmware/efi/runtime
> > 
> > So can you do some debuggin on your side, eg. see the sysfs runtime
> > value is correct or not. And add some printk in efi init path etc.
> 
> The attached patch fixes this for me.

Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
can rewrite patch log with more background and send it out:

for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}

In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.

Signed-off-by: Dave Young 
---
 arch/x86/platform/efi/quirks.c |4 +++-
 drivers/firmware/efi/efi.c |   39 +++
 include/linux/efi.h|1 +
 3 files changed, 43 insertions(+), 1 deletion(-)

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -191,7 +191,9 @@ void __init efi_arch_mem_reserve(phys_ad
int num_entries;
void *new;
 
-   if (efi_mem_desc_lookup(addr, &md)) {
+   if (efi_md_lookup_boot_data(addr, &md)) {
+   if (md.attribute & EFI_MEMORY_RUNTIME)
+   return;
pr_err("Failed to lookup EFI memory descriptor for %pa\n", 
&addr);
return;
}
--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -353,6 +353,45 @@ err_put:
 subsys_initcall(efisubsys_init);
 
 /*
+ * Find the efi memory descriptor for a given physical address.
+ * Given a physical address, if it exists within an EFI memory map
+ * entry of type EFI_BOOT_SERVICES_DATA and the entry has no attribute
+ * EFI_MEMORY_RUNTIME, then populate the supplied memory descriptor
+ * with the appropriate data.
+ */
+int __init efi_md_lookup_boot_data(u64 phys_addr,
+efi_memory_desc_t *out_md)
+{
+   efi_memory_desc_t *md;
+
+   if (!efi_enabled(EFI_MEMMAP)) {
+   pr_err_once("EFI_MEMMAP is not enabled.\n");
+   return -EINVAL;
+   }
+
+   if (!out_md) {
+   pr_err_once("out_md is null.\n");
+   return -EINVAL;
+   }
+
+   for_each_efi_memory_desc(md) {
+   u64 size;
+   u64 end;
+
+   if (md->type != EFI_BOOT_SERVICES_DATA)
+   continue;
+
+   size = md->num_pages << EFI_PAGE_SHIFT;
+   end = md->phys_addr + size;
+   if (phys_addr >= md->phys_addr && phys_addr < end) {
+   memcpy(out_md, md, sizeof(*out_md));
+   return 0;
+   }
+   }
+   return -ENOENT;
+}
+
+/*
  * Find the efi memory descriptor for a given physical address.  Given a
  * physical address, determine if it exists within an EFI Memory Map entry,
  * and if so, populate the supplied memory descriptor with the appropriate
--- linux-x86.orig/include/linux/efi.h
+++ linux-x86/include/linux/efi.h
@@ -979,6 +979,7 @@ extern u64 efi_mem_attribute (unsigned l
 extern int __init efi_uart_console_only (void);
 extern u64 efi_mem_desc_end(efi_memory_desc_t *md);
 extern int efi_mem_desc_lookup(u64 phys_addr, efi_memory_desc_t *out_md);
+extern int efi_md_lookup_boot_data(u64 phys_addr, efi_memory_desc_t *out_md);
 extern void efi_mem_reserve(phys_addr_t addr, u64 size);
 extern void efi_initialize_iomem_resources(struct resource *code_resource,
struct resource *data_resource, struct resource *bss_resource);

Re: kexec regression since 4.9 caused by efi

2017-03-09 Thread Dave Young

On 03/09/17 at 01:54am, Omar Sandoval wrote:
> On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
> > Add efi/kexec list.
> > 
> > On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> 
> [snip]
> 
> > I have no more clue yet from your provided log, but the runtime value is
> > odd to me. It is set in below code:
> > 
> > arch/x86/platform/efi/efi.c: efi_systab_init()
> > efi_systab.runtime = data ?
> >  (void *)(unsigned long)data->runtime :
> >  (void *)(unsigne long)systab64->runtime;
> > 
> > Here data is the setup_data passed by kexec-tools from normal kernel to
> > kexec kernel, efi_setup_data structure is like below: 
> > struct efi_setup_data {
> > u64 fw_vendor;
> > u64 runtime;
> > u64 tables;
> > u64 smbios;
> > u64 reserved[8];
> > };
> > 
> > kexec-tools get the runtime address from /sys/firmware/efi/runtime
> > 
> > So can you do some debuggin on your side, eg. see the sysfs runtime
> > value is correct or not. And add some printk in efi init path etc.
> 
> The attached patch fixes this for me.
> From 4b343f0b0b408469f28c973ea52877797a166313 Mon Sep 17 00:00:00 2001
> Message-Id: 
> <4b343f0b0b408469f28c973ea52877797a166313.1489053164.git.osan...@fb.com>
> From: Omar Sandoval 
> Date: Thu, 9 Mar 2017 01:46:19 -0800
> Subject: [PATCH] efi: adjust virt_addr when splitting descriptors in
>  efi_memmap_insert()
> 
> When we split efi memory descriptors, we adjust the physical address but
> not the virtual address it maps to. This leads to bogus memory mappings
> later when these virtual addresses are used.
> 
> This fixes a kexec boot regression since 8e80632fb23f ("efi/esrt: Use
> efi_mem_reserve() and avoid a kmalloc()"), although the bug was only
> exposed by that commit.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  drivers/firmware/efi/memmap.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
> index 78686443cb37..ca614db76faf 100644
> --- a/drivers/firmware/efi/memmap.c
> +++ b/drivers/firmware/efi/memmap.c
> @@ -298,6 +298,7 @@ void __init efi_memmap_insert(struct efi_memory_map 
> *old_memmap, void *buf,
>   memcpy(new, old, old_memmap->desc_size);
>   md = new;
>   md->phys_addr = m_end + 1;
> + md->virt_addr += md->phys_addr - start;
>   md->num_pages = (end - md->phys_addr + 1) >>
>   EFI_PAGE_SHIFT;
>   }
> @@ -312,6 +313,7 @@ void __init efi_memmap_insert(struct efi_memory_map 
> *old_memmap, void *buf,
>   md = new;
>   md->attribute |= m_attr;
>   md->phys_addr = m_start;
> + md->virt_addr += md->phys_addr - start;
>   md->num_pages = (m_end - m_start + 1) >>
>   EFI_PAGE_SHIFT;
>   /* last part */
> @@ -319,6 +321,7 @@ void __init efi_memmap_insert(struct efi_memory_map 
> *old_memmap, void *buf,
>   memcpy(new, old, old_memmap->desc_size);
>   md = new;
>   md->phys_addr = m_end + 1;
> + md->virt_addr += md->phys_addr - start;
>   md->num_pages = (end - m_end) >>
>   EFI_PAGE_SHIFT;
>   }
> @@ -333,6 +336,7 @@ void __init efi_memmap_insert(struct efi_memory_map 
> *old_memmap, void *buf,
>   memcpy(new, old, old_memmap->desc_size);
>   md = new;
>   md->phys_addr = m_start;
> + md->virt_addr += md->phys_addr - start;
>   md->num_pages = (end - md->phys_addr + 1) >>
>   EFI_PAGE_SHIFT;
>   md->attribute |= m_attr;
> -- 
> 2.12.0
> 

Nice, thanks for the debugging, so the problem is clear now.

Just Runtime areas are not necessarily to be reserved, for boot areas no
need to update the virt address. But I'm not sure about the fakemem
usage of this.

So need comments from Matt..

Thanks
Dave

Re: kexec regression since 4.9 caused by efi

2017-03-09 Thread Dave Young

On 03/09/17 at 12:53pm, Ard Biesheuvel wrote:
> On 9 March 2017 at 10:54, Omar Sandoval  wrote:
> > On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
> >> Add efi/kexec list.
> >>
> >> On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> >
> > [snip]
> >
> >> I have no more clue yet from your provided log, but the runtime value is
> >> odd to me. It is set in below code:
> >>
> >> arch/x86/platform/efi/efi.c: efi_systab_init()
> >>   efi_systab.runtime = data ?
> >>(void *)(unsigned long)data->runtime :
> >>(void *)(unsigne long)systab64->runtime;
> >>
> >> Here data is the setup_data passed by kexec-tools from normal kernel to
> >> kexec kernel, efi_setup_data structure is like below:
> >> struct efi_setup_data {
> >> u64 fw_vendor;
> >> u64 runtime;
> >> u64 tables;
> >> u64 smbios;
> >> u64 reserved[8];
> >> };
> >>
> >> kexec-tools get the runtime address from /sys/firmware/efi/runtime
> >>
> >> So can you do some debuggin on your side, eg. see the sysfs runtime
> >> value is correct or not. And add some printk in efi init path etc.
> >
> > The attached patch fixes this for me.
> 
> Hi Omar,
> 
> Thanks for tracking this down.
> 
> I wonder if this is an unintended side effect of the way we repurpose
> the EFI_MEMORY_RUNTIME attribute in efi_arch_mem_reserve(). AFAIUI,
> splitting memory map entries should only be necessary for regions that
> are not runtime memory regions to begin with, and so whether their
> virtual mapping address makes sense or not should be irrelevant.

In this case the esrt chunk are Runtime Data which is not necessary to
be reserved explicitly. I think efi_arch_mem_reserve are for boot areas.

Probably there could be esrt data which belongs to boot data? If we are
sure they are all runtime, the better fix may be just dropping the
efi_mem_reserve in esrt.c

> 
> Perhaps this only illustrates my lack of understanding of the x86 way
> of doing this, so perhaps Matt can shed some light on this?
> 
> Thanks,
> Ard.

Thanks
Dave

Re: kexec regression since 4.9 caused by efi

2017-03-08 Thread Dave Young

Add efi/kexec list.

On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> Hi, everyone,
> 
> Since 4.9, kexec results in the following panic on some of our servers:
> 
> [0.001000] general protection fault:  [#1] SMP
> [0.001000] Modules linked in:
> [0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53
> [0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05  
>  09/30/2016
> [0.001000] task: 81e0e4c0 task.stack: 81e0
> [0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0
> [0.001000] RSP: :81e03e18 EFLAGS: 00010202
> [0.001000] RAX: afafafafafafafaf RBX: 81e3a4e0 RCX: 
> 0007
> [0.001000] RDX: 81e03e70 RSI: 81e3a4e0 RDI: 
> 88407f8c2de0
> [0.001000] RBP: 81e03e60 R08:  R09: 
> 
> [0.001000] R10:  R11:  R12: 
> 81e03e70
> [0.001000] R13: 0007 R14:  R15: 
> 
> [0.001000] FS:  () GS:881fff60() 
> knlGS:
> [0.001000] CS:  0010 DS:  ES:  CR0: 80050033
> [0.001000] CR2: 88407f30f000 CR3: 001fff102000 CR4: 
> 000406b0
> [0.001000] DR0:  DR1:  DR2: 
> 
> [0.001000] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [0.001000] Call Trace:
> [0.001000]  efi_delete_dummy_variable+0x7a/0x80
> [0.001000]  efi_enter_virtual_mode+0x3e2/0x494
> [0.001000]  start_kernel+0x392/0x418
> [0.001000]  ? set_init_arg+0x55/0x55
> [0.001000]  x86_64_start_reservations+0x2a/0x2c
> [0.001000]  x86_64_start_kernel+0xea/0xed
> [0.001000]  start_cpu+0x14/0x14
> [0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b 
> 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> 
> 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca
> [0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: 81e03e18
> [0.001000] ---[ end trace 0bd213e540e9b19f ]---
> [0.001000] Kernel panic - not syncing: Fatal exception
> [0.001000] ---[ end Kernel panic - not syncing: Fatal exception
> 
> Booting normally (i.e., not kexec) still works.
> 
> The decoded code is:
> 
> 
>0:   42 25 8d ff 80 3d   rex.X and $0x3d80ff8d,%eax
>6:   43 77 95rex.XB ja 0xff9e
>9:   00 00   add%al,(%rax)
>b:   75 68   jne0x75
>d:   9c  pushfq
>e:   8f 04 24popq   (%rsp)
>   11:   48 8b 05 3e 7d 7e 00mov0x7e7d3e(%rip),%rax# 0x7e7d56
>   18:   48 89 demov%rbx,%rsi
>   1b:   4d 89 f9mov%r15,%r9
>   1e:   4d 89 f0mov%r14,%r8
>   21:   44 89 e9mov%r13d,%ecx
>   24:   4c 89 e2mov%r12,%rdx
>   27:   48 8b 40 58 mov0x58(%rax),%rax
>   2b:*  48 8b 78 58 mov0x58(%rax),%rdi  <-- trapping 
> instruction
>   2f:   31 c0   xor%eax,%eax
>   31:   e8 90 e4 92 ff  callq  0xff92e4c6
>   36:   48 8b 3c 24 mov(%rsp),%rdi
>   3a:   48  rex.W
>   3b:   c7  .byte 0xc7
>   3c:   c6  (bad)
>   3d:   2b 0a   sub(%rdx),%ecx
>   3f:   ca  .byte 0xca
> 
> If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf,

I have no more clue yet from your provided log, but the runtime value is
odd to me. It is set in below code:

arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
 (void *)(unsigned long)data->runtime :
 (void *)(unsigne long)systab64->runtime;

Here data is the setup_data passed by kexec-tools from normal kernel to
kexec kernel, efi_setup_data structure is like below: 
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};

kexec-tools get the runtime address from /sys/firmware/efi/runtime

So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.

> and we're crashing when we try to dereference that.
> 
> Here is the output of efi=debug from before the crash:
> 
> [0.00] Linux version 4.11.0-rc1 (osan...@devbig561.prn1.facebook.com) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 
> 12:07:16 PST 2017
> [0.00] Command line: BOOT_IMAGE=/vmlinuz-4.6.7-34_fbk7_2504_g8275185 
> ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 
> fsck.repair=yes pcie_pme=nomsi 
> netconsole=+@2401:db00:0011:b03e:face::0009:0

Re: kexec regression since 4.9 caused by efi

2017-03-08 Thread Dave Young

On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> Hi, everyone,
> 
> Since 4.9, kexec results in the following panic on some of our servers:
> 
> [0.001000] general protection fault:  [#1] SMP
> [0.001000] Modules linked in:
> [0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53
> [0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05  
>  09/30/2016
> [0.001000] task: 81e0e4c0 task.stack: 81e0
> [0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0
> [0.001000] RSP: :81e03e18 EFLAGS: 00010202
> [0.001000] RAX: afafafafafafafaf RBX: 81e3a4e0 RCX: 
> 0007
> [0.001000] RDX: 81e03e70 RSI: 81e3a4e0 RDI: 
> 88407f8c2de0
> [0.001000] RBP: 81e03e60 R08:  R09: 
> 
> [0.001000] R10:  R11:  R12: 
> 81e03e70
> [0.001000] R13: 0007 R14:  R15: 
> 
> [0.001000] FS:  () GS:881fff60() 
> knlGS:
> [0.001000] CS:  0010 DS:  ES:  CR0: 80050033
> [0.001000] CR2: 88407f30f000 CR3: 001fff102000 CR4: 
> 000406b0
> [0.001000] DR0:  DR1:  DR2: 
> 
> [0.001000] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [0.001000] Call Trace:
> [0.001000]  efi_delete_dummy_variable+0x7a/0x80
> [0.001000]  efi_enter_virtual_mode+0x3e2/0x494
> [0.001000]  start_kernel+0x392/0x418
> [0.001000]  ? set_init_arg+0x55/0x55
> [0.001000]  x86_64_start_reservations+0x2a/0x2c
> [0.001000]  x86_64_start_kernel+0xea/0xed
> [0.001000]  start_cpu+0x14/0x14
> [0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b 
> 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> 
> 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca
> [0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: 81e03e18
> [0.001000] ---[ end trace 0bd213e540e9b19f ]---
> [0.001000] Kernel panic - not syncing: Fatal exception
> [0.001000] ---[ end Kernel panic - not syncing: Fatal exception
> 
> Booting normally (i.e., not kexec) still works.
> 
> The decoded code is:
> 
> 
>0:   42 25 8d ff 80 3d   rex.X and $0x3d80ff8d,%eax
>6:   43 77 95rex.XB ja 0xff9e
>9:   00 00   add%al,(%rax)
>b:   75 68   jne0x75
>d:   9c  pushfq
>e:   8f 04 24popq   (%rsp)
>   11:   48 8b 05 3e 7d 7e 00mov0x7e7d3e(%rip),%rax# 0x7e7d56
>   18:   48 89 demov%rbx,%rsi
>   1b:   4d 89 f9mov%r15,%r9
>   1e:   4d 89 f0mov%r14,%r8
>   21:   44 89 e9mov%r13d,%ecx
>   24:   4c 89 e2mov%r12,%rdx
>   27:   48 8b 40 58 mov0x58(%rax),%rax
>   2b:*  48 8b 78 58 mov0x58(%rax),%rdi  <-- trapping 
> instruction
>   2f:   31 c0   xor%eax,%eax
>   31:   e8 90 e4 92 ff  callq  0xff92e4c6
>   36:   48 8b 3c 24 mov(%rsp),%rdi
>   3a:   48  rex.W
>   3b:   c7  .byte 0xc7
>   3c:   c6  (bad)
>   3d:   2b 0a   sub(%rdx),%ecx
>   3f:   ca  .byte 0xca
> 
> If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf,
> and we're crashing when we try to dereference that.
> 
> Here is the output of efi=debug from before the crash:
> 
> [0.00] Linux version 4.11.0-rc1 (osan...@devbig561.prn1.facebook.com) 
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 
> 12:07:16 PST 2017
> [0.00] Command line: BOOT_IMAGE=/vmlinuz-4.6.7-34_fbk7_2504_g8275185 
> ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 
> fsck.repair=yes pcie_pme=nomsi 
> netconsole=+@2401:db00:0011:b03e:face::0009:/eth0,1514@2401:db00:eef0:a59::/02:90:fb:5b:b7:1e
>  crashkernel=128M console=tty0 co
> nsole=ttyS1,57600 efi=debug
> [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
> registers'
> [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
> [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 
> bytes, using 'standard' format.
> [0.00] e820: BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x0100-0x0009] usable
> [0.00] BIOS-e820: [mem 0x0010-0x750bdfff] usable
> [0.00] BIOS-e820: [mem 0x750be000-0x75ddbf

Re: [PATCH 1/2] x86/efi: Correct a tiny mistake in code comment

2017-03-08 Thread Dave Young

Hi,

On 03/08/17 at 04:45pm, Baoquan He wrote:
> Forgot cc to Boris, add him.
> 
> On 03/08/17 at 04:18pm, Dave Young wrote:
> > On 03/08/17 at 03:47pm, Baoquan He wrote:
> > > EFI allocate runtime services regions down from EFI_VA_START, -4G.
> > > It should be top-down handling.
> > > 
> > > Signed-off-by: Baoquan He 
> > > ---
> > >  arch/x86/platform/efi/efi_64.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/platform/efi/efi_64.c 
> > > b/arch/x86/platform/efi/efi_64.c
> > > index a4695da..6cbf9e0 100644
> > > --- a/arch/x86/platform/efi/efi_64.c
> > > +++ b/arch/x86/platform/efi/efi_64.c
> > > @@ -47,7 +47,7 @@
> > >  #include 
> > >  
> > >  /*
> > > - * We allocate runtime services regions bottom-up, starting from -4G, 
> > > i.e.
> > > + * We allocate runtime services regions top-down, starting from -4G, i.e.
> > 
> > Baoquan, I think original bottom-up is right, it is just considering
> > -68G as up, see the x86_64 mm.txt. We regard vmalloc as higher address
> > although from mathematics view it is lower then positive addresses.
> 
> Thanks for reviewing!
> 
> I am not sure. Just in efi_map_region() it gets the starting va to map
> 'size' big of region by below code:
>   efi_va -= size;
> 
> -4G and -68G just a trick which makes people understand easily, still we
> think kernel text mapping region is in higher addr area then vmalloc. I
> personnally think.

I understand your points, there is not right or wrong. So I think drop
the words like the change in your V2 looks good.

Thanks
Dave

Re: [PATCH 1/2] x86/efi: Correct a tiny mistake in code comment

2017-03-08 Thread Dave Young

On 03/08/17 at 11:50am, Borislav Petkov wrote:
> On Wed, Mar 08, 2017 at 06:17:50PM +0800, Baoquan He wrote:
> > All right, I will just update the code comment. Just back ported kaslr
> > to our OS product, people reviewed and found the upper boundary of kaslr
> > mm region is EFI_VA_START, that's not correct, it has to be corrected
> > firstly in upstream. Then found the confusion in code comment.
> 
> BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_ESPFIX64) &&
> +vaddr_end >= EFI_VA_END);
> 
> so I think that once we've done the mapping, we won't need anymore VA
> space so we could simply check the range [efi_va, EFI_VA_START] instead.
> 
> However, that won't work currently because evi_va is not valid at
> build time. And it won't work at boot time either because, AFAICT,
> kernel_randomize_memory() runs before efi_enter_virtual_mode() so ...
> 
> So yours is probably OK.
> 
> I guess what's confusing there is the naming - EFI_VA_START and
> EFI_VA_END. They're kinda swapped because of the direction we take when
> we start mapping runtime services, i.e., from the higher (unsigned)
> address to lower.
> 
> I guess we could swap the naming so that it doesn't confuse people but
> that would be up to EFI maintainers.
> 
> Then stuff like that:
> 
> # ifdef CONFIG_EFI
> { EFI_VA_END,   "EFI Runtime Services" },
> # endif
> 
> will make more sense when they are:
> 
> # ifdef CONFIG_EFI
> { EFI_VA_START,   "EFI Runtime Services" },
> # endif
> 
> But changing it now could confuse more people who have the current
> mental picture of the mapping direction so I'd vote for the simple fix
> above.

People should understand the meaning of the macro then use it correctly,
one should not assume START == lower address unless they are sure. 

> 
> Again, as previously, this is a maintainer decision.
> 

Personally I think current way is just fine, but agreed it is up to efi
maintainer. 

Thanks
Dave

Re: [PATCH 2/2] x86/mm/KASLR: Correct the upper boundary of KALSR mm regions if adjacent to EFI

2017-03-08 Thread Dave Young

On 03/08/17 at 03:47pm, Baoquan He wrote:
> EFI allocates runtime services regions top-down, starting from EFI_VA_START
> to EFI_VA_END. So EFI_VA_START is bigger than EFI_VA_END and is the end of
> EFI region. The upper boundary of memory regions randomized by KASLR should
> be EFI_VA_END if it's adjacent to EFI region, but not EFI_VA_START.
> 
> Correct it in this patch.
> 
> Signed-off-by: Baoquan He 
> ---
>  arch/x86/mm/kaslr.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
> index 887e571..aed2064 100644
> --- a/arch/x86/mm/kaslr.c
> +++ b/arch/x86/mm/kaslr.c
> @@ -48,7 +48,7 @@ static const unsigned long vaddr_start = __PAGE_OFFSET_BASE;
>  #if defined(CONFIG_X86_ESPFIX64)
>  static const unsigned long vaddr_end = ESPFIX_BASE_ADDR;
>  #elif defined(CONFIG_EFI)
> -static const unsigned long vaddr_end = EFI_VA_START;
> +static const unsigned long vaddr_end = EFI_VA_END;
>  #else
>  static const unsigned long vaddr_end = __START_KERNEL_map;
>  #endif
> @@ -105,7 +105,7 @@ void __init kernel_randomize_memory(void)
>*/
>   BUILD_BUG_ON(vaddr_start >= vaddr_end);
>   BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_ESPFIX64) &&
> -  vaddr_end >= EFI_VA_START);
> +  vaddr_end >= EFI_VA_END);
>   BUILD_BUG_ON((IS_ENABLED(CONFIG_X86_ESPFIX64) ||
>     IS_ENABLED(CONFIG_EFI)) &&
>vaddr_end >= __START_KERNEL_map);
> -- 
> 2.5.5
> 

Acked-by: Dave Young 

Thanks
Dave

Re: [PATCH 1/2] x86/efi: Correct a tiny mistake in code comment

2017-03-08 Thread Dave Young

On 03/08/17 at 03:47pm, Baoquan He wrote:
> EFI allocate runtime services regions down from EFI_VA_START, -4G.
> It should be top-down handling.
> 
> Signed-off-by: Baoquan He 
> ---
>  arch/x86/platform/efi/efi_64.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index a4695da..6cbf9e0 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -47,7 +47,7 @@
>  #include 
>  
>  /*
> - * We allocate runtime services regions bottom-up, starting from -4G, i.e.
> + * We allocate runtime services regions top-down, starting from -4G, i.e.

Baoquan, I think original bottom-up is right, it is just considering
-68G as up, see the x86_64 mm.txt. We regard vmalloc as higher address
although from mathematics view it is lower then positive addresses.

>   * 0x___ and limit EFI VA mapping space to 64G.
>   */
>  static u64 efi_va = EFI_VA_START;
> -- 
> 2.5.5
> 

Thanks
Dave

Re: [RFC PATCH v4 26/28] x86: Allow kexec to be used with SME

2017-03-08 Thread Dave Young

On 03/06/17 at 11:58am, Tom Lendacky wrote:
> On 3/1/2017 3:25 AM, Dave Young wrote:
> > Hi Tom,
> 
> Hi Dave,
> 
> > 
> > On 02/17/17 at 10:43am, Tom Lendacky wrote:
> > > On 2/17/2017 9:57 AM, Konrad Rzeszutek Wilk wrote:
> > > > On Thu, Feb 16, 2017 at 09:47:55AM -0600, Tom Lendacky wrote:
> > > > > Provide support so that kexec can be used to boot a kernel when SME is
> > > > > enabled.
> > > > 
> > > > Is the point of kexec and kdump to ehh, dump memory ? But if the
> > > > rest of the memory is encrypted you won't get much, will you?
> > > 
> > > Kexec can be used to reboot a system without going back through BIOS.
> > > So you can use kexec without using kdump.
> > > 
> > > For kdump, just taking a quick look, the option to enable memory
> > > encryption can be provided on the crash kernel command line and then
> > 
> > Is there a simple way to get the SME status? Probably add some sysfs
> > file for this purpose.
> 
> Currently there is not.  I can look at adding something, maybe just the
> sme_me_mask value, which if non-zero, would indicate SME is active.
> 
> > 
> > > crash kernel can would be able to copy the memory decrypted if the
> > > pagetable is set up properly. It looks like currently ioremap_cache()
> > > is used to map the old memory page.  That might be able to be changed
> > > to a memremap() so that the encryption bit is set in the mapping. That
> > > will mean that memory that is not marked encrypted (EFI tables, swiotlb
> > > memory, etc) would not be read correctly.
> > 
> > Manage to store info about those ranges which are not encrypted so that
> > memremap can handle them?
> 
> I can look into whether something can be done in this area. Any input
> you can provide as to what would be the best way/place to store the
> range info so kdump can make use of it, would be greatly appreciated.

Previously to support efi runtime in kexec, I passed some efi
infomation via setup_data, see below userspace kexec-tools commit:
e1ffc9e9a0769e1f54185003102e9bec428b84e8, it was what Boris mentioned
about the setup_data use case for kexec.

Suppose you have successfully tested kexec reboot, so the EFI tables you
mentioned should be those area in old mem for copying /proc/vmcore? If
only EFI tables and swiotlb maybe not worth to passing those stuff
across kexec reboot.

I have more idea about this for now..
> 
> > 
> > > 
> > > > 
> > > > Would it make sense to include some printk to the user if they
> > > > are setting up kdump that they won't get anything out of it?
> > > 
> > > Probably a good idea to add something like that.
> > 
> > It will break kdump functionality, it should be fixed instead of
> > just adding printk to warn user..
> 
> I do want kdump to work. I'll investigate further what can be done in
> this area.

Thanks a lot!

Dave

Re: [RFC PATCH v4 25/28] x86: Access the setup data through sysfs decrypted

2017-03-07 Thread Dave Young

On 02/16/17 at 09:47am, Tom Lendacky wrote:
> Use memremap() to map the setup data.  This will make the appropriate
> decision as to whether a RAM remapping can be done or if a fallback to
> ioremap_cache() is needed (similar to the setup data debugfs support).
> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/kernel/ksysfs.c |   27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kernel/ksysfs.c b/arch/x86/kernel/ksysfs.c
> index 4afc67f..d653b3e 100644
> --- a/arch/x86/kernel/ksysfs.c
> +++ b/arch/x86/kernel/ksysfs.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -79,12 +80,12 @@ static int get_setup_data_paddr(int nr, u64 *paddr)
>   *paddr = pa_data;
>   return 0;
>   }
> - data = ioremap_cache(pa_data, sizeof(*data));
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>  
>   pa_data = data->next;
> - iounmap(data);
> + memunmap(data);
>   i++;
>   }
>   return -EINVAL;
> @@ -97,17 +98,17 @@ static int __init get_setup_data_size(int nr, size_t 
> *size)
>   u64 pa_data = boot_params.hdr.setup_data;
>  
>   while (pa_data) {
> - data = ioremap_cache(pa_data, sizeof(*data));
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>   if (nr == i) {
>   *size = data->len;
> - iounmap(data);
> + memunmap(data);
>   return 0;
>   }
>  
>   pa_data = data->next;
> - iounmap(data);
> + memunmap(data);
>   i++;
>   }
>   return -EINVAL;
> @@ -127,12 +128,12 @@ static ssize_t type_show(struct kobject *kobj,
>   ret = get_setup_data_paddr(nr, &paddr);
>   if (ret)
>   return ret;
> - data = ioremap_cache(paddr, sizeof(*data));
> + data = memremap(paddr, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>  
>   ret = sprintf(buf, "0x%x\n", data->type);
> - iounmap(data);
> + memunmap(data);
>   return ret;
>  }
>  
> @@ -154,7 +155,7 @@ static ssize_t setup_data_data_read(struct file *fp,
>   ret = get_setup_data_paddr(nr, &paddr);
>   if (ret)
>   return ret;
> - data = ioremap_cache(paddr, sizeof(*data));
> + data = memremap(paddr, sizeof(*data), MEMREMAP_WB);
>   if (!data)
>   return -ENOMEM;
>  
> @@ -170,15 +171,15 @@ static ssize_t setup_data_data_read(struct file *fp,
>   goto out;
>  
>   ret = count;
> - p = ioremap_cache(paddr + sizeof(*data), data->len);
> + p = memremap(paddr + sizeof(*data), data->len, MEMREMAP_WB);
>   if (!p) {
>   ret = -ENOMEM;
>   goto out;
>   }
>   memcpy(buf, p + off, count);
> - iounmap(p);
> + memunmap(p);
>  out:
> - iounmap(data);
> + memunmap(data);
>   return ret;
>  }
>  
> @@ -250,13 +251,13 @@ static int __init get_setup_data_total_num(u64 pa_data, 
> int *nr)
>   *nr = 0;
>   while (pa_data) {
>   *nr += 1;
> - data = ioremap_cache(pa_data, sizeof(*data));
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
>   if (!data) {
>   ret = -ENOMEM;
>   goto out;
>   }
>   pa_data = data->next;
> - iounmap(data);
> + memunmap(data);
>   }
>  
>  out:
> 

It would be better that these cleanup patches are sent separately.

Acked-by: Dave Young 

Thanks
Dave

Re: [RFC PATCH v4 24/28] x86: Access the setup data through debugfs decrypted

2017-03-07 Thread Dave Young

On 02/16/17 at 09:47am, Tom Lendacky wrote:
> Use memremap() to map the setup data.  This simplifies the code and will
> make the appropriate decision as to whether a RAM remapping can be done
> or if a fallback to ioremap_cache() is needed (which includes checking
> PageHighMem).
> 
> Signed-off-by: Tom Lendacky 
> ---
>  arch/x86/kernel/kdebugfs.c |   30 +++---
>  1 file changed, 11 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kernel/kdebugfs.c b/arch/x86/kernel/kdebugfs.c
> index bdb83e4..c3d354d 100644
> --- a/arch/x86/kernel/kdebugfs.c
> +++ b/arch/x86/kernel/kdebugfs.c
> @@ -48,17 +48,13 @@ static ssize_t setup_data_read(struct file *file, char 
> __user *user_buf,
>  
>   pa = node->paddr + sizeof(struct setup_data) + pos;
>   pg = pfn_to_page((pa + count - 1) >> PAGE_SHIFT);
> - if (PageHighMem(pg)) {
> - p = ioremap_cache(pa, count);
> - if (!p)
> - return -ENXIO;
> - } else
> - p = __va(pa);
> + p = memremap(pa, count, MEMREMAP_WB);
> + if (!p)
> + return -ENXIO;

-ENOMEM looks better for memremap, ditto for other places..

>  
>   remain = copy_to_user(user_buf, p, count);
>  
> - if (PageHighMem(pg))
> - iounmap(p);
> + memunmap(p);
>  
>   if (remain)
>   return -EFAULT;
> @@ -127,15 +123,12 @@ static int __init create_setup_data_nodes(struct dentry 
> *parent)
>   }
>  
>   pg = pfn_to_page((pa_data+sizeof(*data)-1) >> PAGE_SHIFT);
> - if (PageHighMem(pg)) {
> - data = ioremap_cache(pa_data, sizeof(*data));
> - if (!data) {
> - kfree(node);
> - error = -ENXIO;
> - goto err_dir;
> - }
> - } else
> - data = __va(pa_data);
> + data = memremap(pa_data, sizeof(*data), MEMREMAP_WB);
> + if (!data) {
> + kfree(node);
> + error = -ENXIO;
> + goto err_dir;
> + }
>  
>   node->paddr = pa_data;
>   node->type = data->type;
> @@ -143,8 +136,7 @@ static int __init create_setup_data_nodes(struct dentry 
> *parent)
>   error = create_setup_data_node(d, no, node);
>   pa_data = data->next;
>  
> - if (PageHighMem(pg))
> - iounmap(data);
> + memunmap(data);
>   if (error)
>   goto err_dir;
>   no++;
> 

Thanks
Dave

Re: [RFC PATCH v4 14/28] Add support to access boot related data in the clear

2017-03-07 Thread Dave Young

On 02/16/17 at 09:45am, Tom Lendacky wrote:
[snip]
> + * This function determines if an address should be mapped encrypted.
> + * Boot setup data, EFI data and E820 areas are checked in making this
> + * determination.
> + */
> +static bool memremap_should_map_encrypted(resource_size_t phys_addr,
> +   unsigned long size)
> +{
> + /*
> +  * SME is not active, return true:
> +  *   - For early_memremap_pgprot_adjust(), returning true or false
> +  * results in the same protection value
> +  *   - For arch_memremap_do_ram_remap(), returning true will allow
> +  * the RAM remap to occur instead of falling back to ioremap()
> +  */
> + if (!sme_active())
> + return true;

>From the function name shouldn't above be return false? 

> +
> + /* Check if the address is part of the setup data */
> + if (memremap_is_setup_data(phys_addr, size))
> + return false;
> +
> + /* Check if the address is part of EFI boot/runtime data */
> + switch (efi_mem_type(phys_addr)) {
> + case EFI_BOOT_SERVICES_DATA:
> + case EFI_RUNTIME_SERVICES_DATA:

Only these two types needed? I'm not sure about this, just bring up the
question.

> + return false;
> + default:
> + break;
> + }
> +
> + /* Check if the address is outside kernel usable area */
> + switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) {
> + case E820_TYPE_RESERVED:
> + case E820_TYPE_ACPI:
> + case E820_TYPE_NVS:
> + case E820_TYPE_UNUSABLE:
> + return false;
> + default:
> + break;
> + }
> +
> + return true;
> +}
> +

Thanks
Dave

Re: [RFC PATCH v4 26/28] x86: Allow kexec to be used with SME

2017-03-01 Thread Dave Young

Add kexec list..

On 03/01/17 at 05:25pm, Dave Young wrote:
> Hi Tom,
> 
> On 02/17/17 at 10:43am, Tom Lendacky wrote:
> > On 2/17/2017 9:57 AM, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Feb 16, 2017 at 09:47:55AM -0600, Tom Lendacky wrote:
> > > > Provide support so that kexec can be used to boot a kernel when SME is
> > > > enabled.
> > > 
> > > Is the point of kexec and kdump to ehh, dump memory ? But if the
> > > rest of the memory is encrypted you won't get much, will you?
> > 
> > Kexec can be used to reboot a system without going back through BIOS.
> > So you can use kexec without using kdump.
> > 
> > For kdump, just taking a quick look, the option to enable memory
> > encryption can be provided on the crash kernel command line and then
> 
> Is there a simple way to get the SME status? Probably add some sysfs
> file for this purpose.
> 
> > crash kernel can would be able to copy the memory decrypted if the
> > pagetable is set up properly. It looks like currently ioremap_cache()
> > is used to map the old memory page.  That might be able to be changed
> > to a memremap() so that the encryption bit is set in the mapping. That
> > will mean that memory that is not marked encrypted (EFI tables, swiotlb
> > memory, etc) would not be read correctly.
> 
> Manage to store info about those ranges which are not encrypted so that
> memremap can handle them?
> 
> > 
> > > 
> > > Would it make sense to include some printk to the user if they
> > > are setting up kdump that they won't get anything out of it?
> > 
> > Probably a good idea to add something like that.
> 
> It will break kdump functionality, it should be fixed instead of
> just adding printk to warn user..
> 
> Thanks
> Dave

Re: [RFC PATCH v4 26/28] x86: Allow kexec to be used with SME

2017-03-01 Thread Dave Young

Hi Tom,

On 02/17/17 at 10:43am, Tom Lendacky wrote:
> On 2/17/2017 9:57 AM, Konrad Rzeszutek Wilk wrote:
> > On Thu, Feb 16, 2017 at 09:47:55AM -0600, Tom Lendacky wrote:
> > > Provide support so that kexec can be used to boot a kernel when SME is
> > > enabled.
> > 
> > Is the point of kexec and kdump to ehh, dump memory ? But if the
> > rest of the memory is encrypted you won't get much, will you?
> 
> Kexec can be used to reboot a system without going back through BIOS.
> So you can use kexec without using kdump.
> 
> For kdump, just taking a quick look, the option to enable memory
> encryption can be provided on the crash kernel command line and then

Is there a simple way to get the SME status? Probably add some sysfs
file for this purpose.

> crash kernel can would be able to copy the memory decrypted if the
> pagetable is set up properly. It looks like currently ioremap_cache()
> is used to map the old memory page.  That might be able to be changed
> to a memremap() so that the encryption bit is set in the mapping. That
> will mean that memory that is not marked encrypted (EFI tables, swiotlb
> memory, etc) would not be read correctly.

Manage to store info about those ranges which are not encrypted so that
memremap can handle them?

> 
> > 
> > Would it make sense to include some printk to the user if they
> > are setting up kdump that they won't get anything out of it?
> 
> Probably a good idea to add something like that.

It will break kdump functionality, it should be fixed instead of
just adding printk to warn user..

Thanks
Dave

Re: [RFC PATCH v4 00/28] x86: Secure Memory Encryption (AMD)

2017-03-01 Thread Dave Young

Hi Tom,

On 02/16/17 at 09:41am, Tom Lendacky wrote:
> This RFC patch series provides support for AMD's new Secure Memory
> Encryption (SME) feature.
> 
> SME can be used to mark individual pages of memory as encrypted through the
> page tables. A page of memory that is marked encrypted will be automatically
> decrypted when read from DRAM and will be automatically encrypted when
> written to DRAM. Details on SME can found in the links below.
> 
> The SME feature is identified through a CPUID function and enabled through
> the SYSCFG MSR. Once enabled, page table entries will determine how the
> memory is accessed. If a page table entry has the memory encryption mask set,
> then that memory will be accessed as encrypted memory. The memory encryption
> mask (as well as other related information) is determined from settings
> returned through the same CPUID function that identifies the presence of the
> feature.
> 
> The approach that this patch series takes is to encrypt everything possible
> starting early in the boot where the kernel is encrypted. Using the page
> table macros the encryption mask can be incorporated into all page table
> entries and page allocations. By updating the protection map, userspace
> allocations are also marked encrypted. Certain data must be accounted for
> as having been placed in memory before SME was enabled (EFI, initrd, etc.)
> and accessed accordingly.
> 
> This patch series is a pre-cursor to another AMD processor feature called
> Secure Encrypted Virtualization (SEV). The support for SEV will build upon
> the SME support and will be submitted later. Details on SEV can be found
> in the links below.
> 
> The following links provide additional detail:
> 
> AMD Memory Encryption whitepaper:
>
> http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
> 
> AMD64 Architecture Programmer's Manual:
>http://support.amd.com/TechDocs/24593.pdf
>SME is section 7.10
>SEV is section 15.34
> 
> This patch series is based off of the master branch of tip.
>   Commit a27cb9e1b2b4 ("Merge branch 'WIP.sched/core'")
> 
> ---
> 
> Still to do: IOMMU enablement support
> 
> Changes since v3:
> - Broke out some of the patches into smaller individual patches
> - Updated Documentation
> - Added a message to indicate why the IOMMU was disabled
> - Updated CPU feature support for SME by taking into account whether
>   BIOS has enabled SME
> - Eliminated redundant functions
> - Added some warning messages for DMA usage of bounce buffers when SME
>   is active
> - Added support for persistent memory
> - Added support to determine when setup data is being mapped and be sure
>   to map it un-encrypted
> - Added CONFIG support to set the default action of whether to activate
>   SME if it is supported/enabled
> - Added support for (re)booting with kexec

Could you please add kexec list in cc when you updating the patches so
that kexec/kdump people do not miss them?

> 
> Changes since v2:
> - Updated Documentation
> - Make the encryption mask available outside of arch/x86 through a
>   standard include file
> - Conversion of assembler routines to C where possible (not everything
>   could be converted, e.g. the routine that does the actual encryption
>   needs to be copied into a safe location and it is difficult to
>   determine the actual length of the function in order to copy it)
> - Fix SME feature use of scattered CPUID feature
> - Creation of SME specific functions for things like encrypting
>   the setup data, ramdisk, etc.
> - New take on early_memremap / memremap encryption support
> - Additional support for accessing video buffers (fbdev/gpu) as
>   un-encrypted
> - Disable IOMMU for now - need to investigate further in relation to
>   how it needs to be programmed relative to accessing physical memory
> 
> Changes since v1:
> - Added Documentation.
> - Removed AMD vendor check for setting the PAT write protect mode
> - Updated naming of trampoline flag for SME as well as moving of the
>   SME check to before paging is enabled.
> - Change to early_memremap to identify the data being mapped as either
>   boot data or kernel data.  The idea being that boot data will have
>   been placed in memory as un-encrypted data and would need to be accessed
>   as such.
> - Updated debugfs support for the bootparams to access the data properly.
> - Do not set the SYSCFG[MEME] bit, only check it.  The setting of the
>   MemEncryptionModeEn bit results in a reduction of physical address size
>   of the processor.  It is possible that BIOS could have configured resources
>   resources into a range that will now not be addressable.  To prevent this,
>   rely on BIOS to set the SYSCFG[MEME] bit and only then enable memory
>   encryption support in the kernel.
> 
> Tom Lendacky (28):
>   x86: Documentation for AMD Secure Memory Encryption (SME)
>   x86: Set the write-protect cache mode for full PAT support
>   x86: Add the Secure

Re: [PATCH V2 0/4] efi/x86: move efi bgrt init code to early init

2017-02-02 Thread Dave Young

On 01/27/17 at 05:03pm, Ard Biesheuvel wrote:
> On 16 January 2017 at 02:45, Dave Young  wrote:
> > Hi,
> >
> > Here the the update of the series for moving bgrt init code to early init.
> >
> > Main changes is:
> > - Move the 1st patch to the last because it does not block the 2nd patch
> > any more with Peter's patch to prune invlid memmap entries:
> > https://git.kernel.org/cgit/linux/kernel/git/efi/efi.git/commit/?h=next&id=b2a91
> > a35445229
> > But it is still tood to have since efi_mem_reserve only cares about boot 
> > related
> > mem ranges.
> >
> > - Other comments about code itself, details please the the patches 
> > themselves.
> >
> >  arch/x86/include/asm/efi.h   |1
> >  arch/x86/kernel/acpi/boot.c  |   12 +++
> >  arch/x86/platform/efi/efi-bgrt.c |   59 
> > ---
> >  arch/x86/platform/efi/efi.c  |   26 +++--
> >  arch/x86/platform/efi/quirks.c   |2 -
> >  drivers/acpi/bgrt.c  |   28 +-
> >  drivers/firmware/efi/fake_mem.c  |3 +
> >  drivers/firmware/efi/memmap.c|   22 +-
> >  include/linux/efi-bgrt.h |7 +---
> >  include/linux/efi.h  |5 +--
> >  init/main.c  |1
> >  11 files changed, 92 insertions(+), 74 deletions(-)
> >
> 
> Dave,
> 
> I have pushed these to efi/next: please double check if I did it
> correctly. I had some trouble applying these given that I rebased
> efi/next onto -rc4. However, the fact that you are not using standard
> git cover letters and emails doesn't help things, so could you
> *please* use standard git send-email to post to linux-efi in the
> future? Thanks.

Ard, apologize for late reply due to a one week holiday. I double-checked
the efi-next commits, I think they are correct. As for the email format
I use quilt to send the series so that the cover letter is not git
formatted. The patches are based on mainline linus tree, maybe this is
the reason of the trouble I will check if it is efi/next mergeble with
"git am" before sending out next time and will switch to git-send-email
if it does not work.

Thanks
Dave

[tip:efi/core] efi/x86: Add debug code to print cooked memmap

2017-02-01 Thread tip-bot for Dave Young

Commit-ID:  22c091d02a5422d2825a4fb1af71e5a62f9e4d0f
Gitweb: http://git.kernel.org/tip/22c091d02a5422d2825a4fb1af71e5a62f9e4d0f
Author: Dave Young 
AuthorDate: Tue, 31 Jan 2017 13:21:41 +
Committer:  Ingo Molnar 
CommitDate: Wed, 1 Feb 2017 08:45:46 +0100

efi/x86: Add debug code to print cooked memmap

It is not obvious if the reserved boot area are added correctly, add a
efi_print_memmap() call to print the new memmap.

Tested-by: Nicolai Stange 
Signed-off-by: Dave Young 
Signed-off-by: Ard Biesheuvel 
Reviewed-by: Nicolai Stange 
Cc: Linus Torvalds 
Cc: Matt Fleming 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1485868902-20401-10-git-send-email-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/platform/efi/efi.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 0d4becf..565dff3 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -955,6 +955,11 @@ static void __init __efi_enter_virtual_mode(void)
return;
}
 
+   if (efi_enabled(EFI_DBG)) {
+   pr_info("EFI runtime memory map:\n");
+   efi_print_memmap();
+   }
+
BUG_ON(!efi.systab);
 
if (efi_setup_page_tables(pa, 1 << pg_shift)) {

[tip:efi/core] efi/x86: Move the EFI BGRT init code to early init code

2017-02-01 Thread tip-bot for Dave Young

Commit-ID:  7b0a911478c74ca02581d496f732c10e811e894f
Gitweb: http://git.kernel.org/tip/7b0a911478c74ca02581d496f732c10e811e894f
Author: Dave Young 
AuthorDate: Tue, 31 Jan 2017 13:21:40 +
Committer:  Ingo Molnar 
CommitDate: Wed, 1 Feb 2017 08:45:46 +0100

efi/x86: Move the EFI BGRT init code to early init code

Before invoking the arch specific handler, efi_mem_reserve() reserves
the given memory region through memblock.

efi_bgrt_init() will call efi_mem_reserve() after mm_init(), at which
time memblock is dead and should not be used anymore.

The EFI BGRT code depends on ACPI initialization to get the BGRT ACPI
table, so move parsing of the BGRT table to ACPI early boot code to
ensure that efi_mem_reserve() in EFI BGRT code still use memblock safely.

Tested-by: Bhupesh Sharma 
Signed-off-by: Dave Young 
Signed-off-by: Ard Biesheuvel 
Cc: Len Brown 
Cc: Linus Torvalds 
Cc: Matt Fleming 
Cc: Peter Zijlstra 
Cc: Rafael J. Wysocki 
Cc: Thomas Gleixner 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1485868902-20401-9-git-send-email-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/acpi/boot.c  |  9 ++
 arch/x86/platform/efi/efi-bgrt.c | 59 +---
 arch/x86/platform/efi/efi.c  |  5 
 drivers/acpi/bgrt.c  | 28 +--
 include/linux/efi-bgrt.h | 11 
 init/main.c  |  1 -
 6 files changed, 59 insertions(+), 54 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 64422f8..7ff007e 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1557,6 +1558,12 @@ int __init early_acpi_boot_init(void)
return 0;
 }
 
+static int __init acpi_parse_bgrt(struct acpi_table_header *table)
+{
+   efi_bgrt_init(table);
+   return 0;
+}
+
 int __init acpi_boot_init(void)
 {
/* those are executed after early-quirks are executed */
@@ -1581,6 +1588,8 @@ int __init acpi_boot_init(void)
acpi_process_madt();
 
acpi_table_parse(ACPI_SIG_HPET, acpi_parse_hpet);
+   if (IS_ENABLED(CONFIG_ACPI_BGRT))
+   acpi_table_parse(ACPI_SIG_BGRT, acpi_parse_bgrt);
 
if (!acpi_noirq)
x86_init.pci.init = pci_acpi_init;
diff --git a/arch/x86/platform/efi/efi-bgrt.c b/arch/x86/platform/efi/efi-bgrt.c
index 6aad870..04ca876 100644
--- a/arch/x86/platform/efi/efi-bgrt.c
+++ b/arch/x86/platform/efi/efi-bgrt.c
@@ -19,8 +19,7 @@
 #include 
 #include 
 
-struct acpi_table_bgrt *bgrt_tab;
-void *__initdata bgrt_image;
+struct acpi_table_bgrt bgrt_tab;
 size_t __initdata bgrt_image_size;
 
 struct bmp_header {
@@ -28,66 +27,58 @@ struct bmp_header {
u32 size;
 } __packed;
 
-void __init efi_bgrt_init(void)
+void __init efi_bgrt_init(struct acpi_table_header *table)
 {
-   acpi_status status;
void *image;
struct bmp_header bmp_header;
+   struct acpi_table_bgrt *bgrt = &bgrt_tab;
 
if (acpi_disabled)
return;
 
-   status = acpi_get_table("BGRT", 0,
-   (struct acpi_table_header **)&bgrt_tab);
-   if (ACPI_FAILURE(status))
-   return;
-
-   if (bgrt_tab->header.length < sizeof(*bgrt_tab)) {
+   if (table->length < sizeof(bgrt_tab)) {
pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
-  bgrt_tab->header.length, sizeof(*bgrt_tab));
+  table->length, sizeof(bgrt_tab));
return;
}
-   if (bgrt_tab->version != 1) {
+   *bgrt = *(struct acpi_table_bgrt *)table;
+   if (bgrt->version != 1) {
pr_notice("Ignoring BGRT: invalid version %u (expected 1)\n",
-  bgrt_tab->version);
-   return;
+  bgrt->version);
+   goto out;
}
-   if (bgrt_tab->status & 0xfe) {
+   if (bgrt->status & 0xfe) {
pr_notice("Ignoring BGRT: reserved status bits are non-zero 
%u\n",
-  bgrt_tab->status);
-   return;
+  bgrt->status);
+   goto out;
}
-   if (bgrt_tab->image_type != 0) {
+   if (bgrt->image_type != 0) {
pr_notice("Ignoring BGRT: invalid image type %u (expected 0)\n",
-  bgrt_tab->image_type);
-   return;
+  bgrt->image_type);
+   goto out;
}
-   if (!bgrt_tab->image_address) {
+   if (!bgrt->image_address) {
pr_notice("Ignoring BGRT: null image address\n");
-   return;
+   goto out;
}
 
-

Re: [PATCH V3 1/4] efi/x86: move efi bgrt init code to early init code

2017-01-25 Thread Dave Young

On 01/19/17 at 12:48pm, Ard Biesheuvel wrote:
> On 18 January 2017 at 19:24, Bhupesh Sharma  wrote:
> > On Wed, Jan 18, 2017 at 7:30 PM, Ard Biesheuvel
> >  wrote:
> >> On 18 January 2017 at 13:48, Dave Young  wrote:
> >>> Before invoking the arch specific handler, efi_mem_reserve() reserves
> >>> the given memory region through memblock.
> >>>
> >>> efi_bgrt_init will call efi_mem_reserve after mm_init(), at that time
> >>> memblock is dead and it should not be used any more.
> >>>
> >>> efi bgrt code depend on acpi intialization to get the bgrt acpi table,
> >>> moving bgrt parsing to acpi early boot code can make sure efi_mem_reserve
> >>> in efi bgrt code still use memblock safely.
> >>>
> >>> Signed-off-by: Dave Young 
> >>
> >> This patch looks fine to me know
> >>
> >> Reviewed-by: Ard Biesheuvel 
> >>
> >> but before applying it, I'd like
> >>
> >> - Bhupesh to confirm that this patch is a move in the right direction
> >> with regard to enabling BGRT on arm64/ACPI,
> >
> > I gave the changes from Dave a try on top of the following combination:
> > 4.10-rc3 + efi/next patches not available in 4.10-rc3 + Nicolai's patches
> >
> > and was able to get the BGRT table working properly with OVMF on a
> > QEMU-x86_64 machine. So you can add my tested-by for this patch
> > series.
> >
> > I think this is the right direction for the ARM64 BGRT handling
> > patches as well and I will post a RFC in two phases as suggested by
> > Ard, once Dave's patches are accepted (in efi/next?).
> >
> 
> Thanks!
> 
> >
> >> - an ack from the ACPI maintainers (cc'ed)
> >>
> 
> Rafael, Len? Any objections?


Ping..

> 
> If not, I will go ahead and queue this for v4.11

Ard, thanks, just ignore the last one 4/4 if you think it is risky or
unnecessary. 

> 
> 
> >>> --->>> v1->v2: efi_bgrt_init: check table length first before copying 
> >>> bgrt table
> >>> error checking in drivers/acpi/bgrt.c
> >>> v2->v3: drop #ifdef added before; efi-bgrt.h build warning fix
> >>> since only changed this patch, so I just only resend this one.
> >>>  arch/x86/kernel/acpi/boot.c  |9 +
> >>>  arch/x86/platform/efi/efi-bgrt.c |   59 
> >>> ---
> >>>  arch/x86/platform/efi/efi.c  |5 ---
> >>>  drivers/acpi/bgrt.c  |   28 +-
> >>>  include/linux/efi-bgrt.h |   11 +++
> >>>  init/main.c  |1
> >>>  6 files changed, 59 insertions(+), 54 deletions(-)
> >>>
> >>> --- linux-x86.orig/arch/x86/kernel/acpi/boot.c
> >>> +++ linux-x86/arch/x86/kernel/acpi/boot.c
> >>> @@ -35,6 +35,7 @@
> >>>  #include 
> >>>  #include 
> >>>  #include 
> >>> +#include 
> >>>
> >>>  #include 
> >>>  #include 
> >>> @@ -1557,6 +1558,12 @@ int __init early_acpi_boot_init(void)
> >>> return 0;
> >>>  }
> >>>
> >>> +static int __init acpi_parse_bgrt(struct acpi_table_header *table)
> >>> +{
> >>> +   efi_bgrt_init(table);
> >>> +   return 0;
> >>> +}
> >>> +
> >>>  int __init acpi_boot_init(void)
> >>>  {
> >>> /* those are executed after early-quirks are executed */
> >>> @@ -1581,6 +1588,8 @@ int __init acpi_boot_init(void)
> >>> acpi_process_madt();
> >>>
> >>> acpi_table_parse(ACPI_SIG_HPET, acpi_parse_hpet);
> >>> +   if (IS_ENABLED(CONFIG_ACPI_BGRT))
> >>> +   acpi_table_parse(ACPI_SIG_BGRT, acpi_parse_bgrt);
> >>>
> >>> if (!acpi_noirq)
> >>> x86_init.pci.init = pci_acpi_init;
> >>> --- linux-x86.orig/arch/x86/platform/efi/efi-bgrt.c
> >>> +++ linux-x86/arch/x86/platform/efi/efi-bgrt.c
> >>> @@ -19,8 +19,7 @@
> >>>  #include 
> >>>  #include 
> >>>
> >>> -struct acpi_table_bgrt *bgrt_tab;
> >>> -void *__initdata bgrt_image;
> >>> +struct acpi_table_bgrt bgrt_tab;
> >>>  size_t __initdata bgrt_image_size;
> >>>
> >>>  struct bmp_header {
> >>> @@ -28,66 +27,58 @@ struct

Re: [PATCH] /proc/kcore: Update physical address for kcore ram and text

2017-01-24 Thread Dave Young

Hi Pratyush
On 01/25/17 at 10:14am, Pratyush Anand wrote:
> Currently all the p_paddr of PT_LOAD headers are assigned to 0, which is
> not true and could be misleading, since 0 is a valid physical address.

I do not know the history of /proc/kcore, so a question is why the
p_addr was set as 0, if there were some reasons and if this could cause
some risk or breakage.

> 
> User space tools like makedumpfile needs to know physical address for
> PT_LOAD segments of direct mapped regions. Therefore this patch updates
> paddr for such regions. It also sets an invalid paddr (-1) for other
> regions, so that user space tool can know whether a physical address
> provided in PT_LOAD is correct or not.
> 
> Signed-off-by: Pratyush Anand 
> ---
>  fs/proc/kcore.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 0b80ad87b4d6..ea9f3d1ae830 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -373,7 +373,10 @@ static void elf_kcore_store_hdr(char *bufp, int nphdr, 
> int dataoff)
>   phdr->p_flags   = PF_R|PF_W|PF_X;
>   phdr->p_offset  = kc_vaddr_to_offset(m->addr) + dataoff;
>   phdr->p_vaddr   = (size_t)m->addr;
> - phdr->p_paddr   = 0;
> + if (m->type == KCORE_RAM || m->type == KCORE_TEXT)
> + phdr->p_paddr   = __pa(m->addr);
> + else
> + phdr->p_paddr   = (elf_addr_t)-1;
>   phdr->p_filesz  = phdr->p_memsz = m->size;
>   phdr->p_align   = PAGE_SIZE;
>   }
> -- 
> 2.9.3
> 

Thanks
Dave

Re: [PATCH v2] x86/crash: Update the stale comment in reserve_crashkernel()

2017-01-23 Thread Dave Young

Hi, Xunlei

On 01/23/17 at 02:48pm, Xunlei Pang wrote:
> CRASH_KERNEL_ADDR_MAX has been missing for a long time,
> update it with more detailed explanation.
> 
> Cc: Robert LeBlanc 
> Cc: Baoquan He 
> Signed-off-by: Xunlei Pang 
> ---
>  arch/x86/kernel/setup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 4cfba94..c32a167 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -575,7 +575,9 @@ static void __init reserve_crashkernel(void)
>   /* 0 means: find the address automatically */
>   if (crash_base <= 0) {
>   /*
> -  *  kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
> +  * Set CRASH_ADDR_LOW_MAX upper bound for crash memory
> +  * as old kexec-tools loads bzImage below that, unless
> +  * "crashkernel=size[KMG],high" is specified.

There is already comment before the define of those macros, also
there are 32bit case which has a different reason about 512M there as
well.

So it looks better to just drop the one line comment without adding
further comments here.
>*/
>   crash_base = memblock_find_in_range(CRASH_ALIGN,
>   high ? CRASH_ADDR_HIGH_MAX
> -- 
> 1.8.3.1
> 

Thanks
Dave

[PATCH V3 1/4] efi/x86: move efi bgrt init code to early init code

2017-01-18 Thread Dave Young

Before invoking the arch specific handler, efi_mem_reserve() reserves
the given memory region through memblock.

efi_bgrt_init will call efi_mem_reserve after mm_init(), at that time
memblock is dead and it should not be used any more.

efi bgrt code depend on acpi intialization to get the bgrt acpi table,
moving bgrt parsing to acpi early boot code can make sure efi_mem_reserve
in efi bgrt code still use memblock safely. 

Signed-off-by: Dave Young 
---
v1->v2: efi_bgrt_init: check table length first before copying bgrt table
error checking in drivers/acpi/bgrt.c
v2->v3: drop #ifdef added before; efi-bgrt.h build warning fix
since only changed this patch, so I just only resend this one.
 arch/x86/kernel/acpi/boot.c  |9 +
 arch/x86/platform/efi/efi-bgrt.c |   59 ---
 arch/x86/platform/efi/efi.c  |5 ---
 drivers/acpi/bgrt.c  |   28 +-
 include/linux/efi-bgrt.h |   11 +++
 init/main.c  |1 
 6 files changed, 59 insertions(+), 54 deletions(-)

--- linux-x86.orig/arch/x86/kernel/acpi/boot.c
+++ linux-x86/arch/x86/kernel/acpi/boot.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1557,6 +1558,12 @@ int __init early_acpi_boot_init(void)
return 0;
 }
 
+static int __init acpi_parse_bgrt(struct acpi_table_header *table)
+{
+   efi_bgrt_init(table);
+   return 0;
+}
+
 int __init acpi_boot_init(void)
 {
/* those are executed after early-quirks are executed */
@@ -1581,6 +1588,8 @@ int __init acpi_boot_init(void)
acpi_process_madt();
 
acpi_table_parse(ACPI_SIG_HPET, acpi_parse_hpet);
+   if (IS_ENABLED(CONFIG_ACPI_BGRT))
+   acpi_table_parse(ACPI_SIG_BGRT, acpi_parse_bgrt);
 
if (!acpi_noirq)
x86_init.pci.init = pci_acpi_init;
--- linux-x86.orig/arch/x86/platform/efi/efi-bgrt.c
+++ linux-x86/arch/x86/platform/efi/efi-bgrt.c
@@ -19,8 +19,7 @@
 #include 
 #include 
 
-struct acpi_table_bgrt *bgrt_tab;
-void *__initdata bgrt_image;
+struct acpi_table_bgrt bgrt_tab;
 size_t __initdata bgrt_image_size;
 
 struct bmp_header {
@@ -28,66 +27,58 @@ struct bmp_header {
u32 size;
 } __packed;
 
-void __init efi_bgrt_init(void)
+void __init efi_bgrt_init(struct acpi_table_header *table)
 {
-   acpi_status status;
void *image;
struct bmp_header bmp_header;
+   struct acpi_table_bgrt *bgrt = &bgrt_tab;
 
if (acpi_disabled)
return;
 
-   status = acpi_get_table("BGRT", 0,
-   (struct acpi_table_header **)&bgrt_tab);
-   if (ACPI_FAILURE(status))
-   return;
-
-   if (bgrt_tab->header.length < sizeof(*bgrt_tab)) {
+   if (table->length < sizeof(bgrt_tab)) {
pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
-  bgrt_tab->header.length, sizeof(*bgrt_tab));
+  table->length, sizeof(bgrt_tab));
return;
}
-   if (bgrt_tab->version != 1) {
+   *bgrt = *(struct acpi_table_bgrt *)table;
+   if (bgrt->version != 1) {
pr_notice("Ignoring BGRT: invalid version %u (expected 1)\n",
-  bgrt_tab->version);
-   return;
+  bgrt->version);
+   goto out;
}
-   if (bgrt_tab->status & 0xfe) {
+   if (bgrt->status & 0xfe) {
pr_notice("Ignoring BGRT: reserved status bits are non-zero 
%u\n",
-  bgrt_tab->status);
-   return;
+  bgrt->status);
+   goto out;
}
-   if (bgrt_tab->image_type != 0) {
+   if (bgrt->image_type != 0) {
pr_notice("Ignoring BGRT: invalid image type %u (expected 0)\n",
-  bgrt_tab->image_type);
-   return;
+  bgrt->image_type);
+   goto out;
}
-   if (!bgrt_tab->image_address) {
+   if (!bgrt->image_address) {
pr_notice("Ignoring BGRT: null image address\n");
-   return;
+   goto out;
}
 
-   image = memremap(bgrt_tab->image_address, sizeof(bmp_header), 
MEMREMAP_WB);
+   image = early_memremap(bgrt->image_address, sizeof(bmp_header));
if (!image) {
pr_notice("Ignoring BGRT: failed to map image header memory\n");
-   return;
+   goto out;
}
 
memcpy(&bmp_header, image, sizeof(bmp_header));
-   memunmap(image);
+   early_memunmap(image, sizeof(bmp_header));
if (bmp_header.id != 0x4d42) {
pr_notice("Ignoring BGRT: Incorrect BMP magic number 0x%x 
(expected 0x4d42)\n",

Re: [PATCH V2 4/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-18 Thread Dave Young

On 01/18/17 at 07:06pm, Dave Young wrote:
> On 01/18/17 at 07:01pm, Dave Young wrote:
> > On 01/17/17 at 08:48pm, Nicolai Stange wrote:
> > > On Tue, Jan 17 2017, Ard Biesheuvel wrote:
> > > 
> > > > On 16 January 2017 at 02:45, Dave Young  wrote:
> > > >> efi_mem_reserve cares only about boot services regions, for making sure
> > > >> later efi_free_boot_services does not free areas which are still 
> > > >> useful,
> > > >> such as bgrt image buffer.
> > > >>
> > > >> So add a new argument to efi_memmap_insert for this purpose.
> > > >>
> > > >
> > > > So what happens is we try to efi_mem_reserve() a regions that is not
> > > > bootservices code or data?
> > > > We shouldn't simply ignore it, because it is a serious condition.
> > > 
> > > The efi_mem_desc_lookup() call in efi_arch_mem_reserve() wouldn't return
> > > anything and the latter would
> > > 
> > >   pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
> > > 
> > > then.
> > > 
> > > This is so because efi_mem_desc_lookup() searches only for regions that
> > > either
> > > - are of type EFI_BOOT_SERVICES_DATA or EFI_RUNTIME_SERVICES_DATA
> > > - or which have EFI_MEMORY_RUNTIME set already:
> > > 
> > >   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > >   md->type != EFI_BOOT_SERVICES_DATA &&
> > >   md->type != EFI_RUNTIME_SERVICES_DATA) {
> > >   continue;
> > >   }
> > > 
> > > For EFI_RUNTIME_SERVICES_DATA and EFI_MEMORY_RUNTIME,
> > > efi_arch_mem_reserve() would be a nop.
> > > 
> > > So we're fine here? Do you want to have a more descriptive error message
> > > than "Failed to lookup EFI memory descriptor"?
> > > 
> > > 
> > > For the other checks you suggested in that other thread, i.e. for the
> > > post-slab_is_available() condition and so: let me wait until Dave's
> > > series has stabilized (or even picked) and I'll submit patches for
> > > what remains to be sanity checked then.
> > > 
> > > Also, since Dave eliminated the need for late efi_mem_reserve()'s,
> > > my 20b1e22d01a4 ("x86/efi: Don't allocate memmap through memblock after
> > > mm_init()") should certainly get reverted at some point.
> > 
> > While testing my patches with latest edk2, I found another thing to be
> > fixed, I will repost bgrt patch according to Ard's comment tomorrow,
> > maybe with below patch as another fix to the memblock_alloc late
> > callback.

Please ignore the comment, your patch already addressed this, that
means it is still necessary after bgrt being moved early because
efi_free_boot_services still need it.

Apologize for the noise..
> > 
> > ---
> >  arch/x86/platform/efi/quirks.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > @@ -355,7 +355,7 @@ void __init efi_free_boot_services(void)
> > }
> >  
> > new_size = efi.memmap.desc_size * num_entries;
> > -   new_phys = kzalloc(new_size, GFP_KERNEL);
> 
> Oops, it was memblock_alloc(), this is a middle test code used for
> debugging. Just ignore it.
> 
> > +   new_phys = efi_memmap_alloc(num_entries);
> 
> Maybe a efi_memmap_late_alloc is enough, will see if I can get a better
> version, maybe your previous patch can be dropped partially or just
> kept.
> 
> > if (!new_phys) {
> > pr_err("Failed to allocate new EFI memmap\n");
> > return;
> > 
> > > 
> > > 
> > > Thanks,
> > > 
> > > Nicolai

Re: [PATCH V2 4/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-18 Thread Dave Young

On 01/17/17 at 08:48pm, Nicolai Stange wrote:
> On Tue, Jan 17 2017, Ard Biesheuvel wrote:
> 
> > On 16 January 2017 at 02:45, Dave Young  wrote:
> >> efi_mem_reserve cares only about boot services regions, for making sure
> >> later efi_free_boot_services does not free areas which are still useful,
> >> such as bgrt image buffer.
> >>
> >> So add a new argument to efi_memmap_insert for this purpose.
> >>
> >
> > So what happens is we try to efi_mem_reserve() a regions that is not
> > bootservices code or data?
> > We shouldn't simply ignore it, because it is a serious condition.
> 
> The efi_mem_desc_lookup() call in efi_arch_mem_reserve() wouldn't return
> anything and the latter would
> 
>   pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
> 
> then.
> 
> This is so because efi_mem_desc_lookup() searches only for regions that
> either
> - are of type EFI_BOOT_SERVICES_DATA or EFI_RUNTIME_SERVICES_DATA
> - or which have EFI_MEMORY_RUNTIME set already:
> 
>   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
>   md->type != EFI_BOOT_SERVICES_DATA &&
>   md->type != EFI_RUNTIME_SERVICES_DATA) {
>   continue;
>   }
> 
> For EFI_RUNTIME_SERVICES_DATA and EFI_MEMORY_RUNTIME,
> efi_arch_mem_reserve() would be a nop.
> 
> So we're fine here? Do you want to have a more descriptive error message
> than "Failed to lookup EFI memory descriptor"?
> 
> 
> For the other checks you suggested in that other thread, i.e. for the
> post-slab_is_available() condition and so: let me wait until Dave's
> series has stabilized (or even picked) and I'll submit patches for
> what remains to be sanity checked then.
> 
> Also, since Dave eliminated the need for late efi_mem_reserve()'s,
> my 20b1e22d01a4 ("x86/efi: Don't allocate memmap through memblock after
> mm_init()") should certainly get reverted at some point.

While testing my patches with latest edk2, I found another thing to be
fixed, I will repost bgrt patch according to Ard's comment tomorrow,
maybe with below patch as another fix to the memblock_alloc late
callback.

---
 arch/x86/platform/efi/quirks.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -355,7 +355,7 @@ void __init efi_free_boot_services(void)
}
 
new_size = efi.memmap.desc_size * num_entries;
-   new_phys = kzalloc(new_size, GFP_KERNEL);
+   new_phys = efi_memmap_alloc(num_entries);
if (!new_phys) {
pr_err("Failed to allocate new EFI memmap\n");
return;

> 
> 
> Thanks,
> 
> Nicolai

Re: [PATCH V2 4/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-18 Thread Dave Young

On 01/18/17 at 07:01pm, Dave Young wrote:
> On 01/17/17 at 08:48pm, Nicolai Stange wrote:
> > On Tue, Jan 17 2017, Ard Biesheuvel wrote:
> > 
> > > On 16 January 2017 at 02:45, Dave Young  wrote:
> > >> efi_mem_reserve cares only about boot services regions, for making sure
> > >> later efi_free_boot_services does not free areas which are still useful,
> > >> such as bgrt image buffer.
> > >>
> > >> So add a new argument to efi_memmap_insert for this purpose.
> > >>
> > >
> > > So what happens is we try to efi_mem_reserve() a regions that is not
> > > bootservices code or data?
> > > We shouldn't simply ignore it, because it is a serious condition.
> > 
> > The efi_mem_desc_lookup() call in efi_arch_mem_reserve() wouldn't return
> > anything and the latter would
> > 
> >   pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
> > 
> > then.
> > 
> > This is so because efi_mem_desc_lookup() searches only for regions that
> > either
> > - are of type EFI_BOOT_SERVICES_DATA or EFI_RUNTIME_SERVICES_DATA
> > - or which have EFI_MEMORY_RUNTIME set already:
> > 
> > if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > md->type != EFI_BOOT_SERVICES_DATA &&
> > md->type != EFI_RUNTIME_SERVICES_DATA) {
> > continue;
> > }
> > 
> > For EFI_RUNTIME_SERVICES_DATA and EFI_MEMORY_RUNTIME,
> > efi_arch_mem_reserve() would be a nop.
> > 
> > So we're fine here? Do you want to have a more descriptive error message
> > than "Failed to lookup EFI memory descriptor"?
> > 
> > 
> > For the other checks you suggested in that other thread, i.e. for the
> > post-slab_is_available() condition and so: let me wait until Dave's
> > series has stabilized (or even picked) and I'll submit patches for
> > what remains to be sanity checked then.
> > 
> > Also, since Dave eliminated the need for late efi_mem_reserve()'s,
> > my 20b1e22d01a4 ("x86/efi: Don't allocate memmap through memblock after
> > mm_init()") should certainly get reverted at some point.
> 
> While testing my patches with latest edk2, I found another thing to be
> fixed, I will repost bgrt patch according to Ard's comment tomorrow,
> maybe with below patch as another fix to the memblock_alloc late
> callback.
> 
> ---
>  arch/x86/platform/efi/quirks.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> +++ linux-x86/arch/x86/platform/efi/quirks.c
> @@ -355,7 +355,7 @@ void __init efi_free_boot_services(void)
>   }
>  
>   new_size = efi.memmap.desc_size * num_entries;
> - new_phys = kzalloc(new_size, GFP_KERNEL);

Oops, it was memblock_alloc(), this is a middle test code used for
debugging. Just ignore it.

> + new_phys = efi_memmap_alloc(num_entries);

Maybe a efi_memmap_late_alloc is enough, will see if I can get a better
version, maybe your previous patch can be dropped partially or just
kept.

>   if (!new_phys) {
>   pr_err("Failed to allocate new EFI memmap\n");
>   return;
> 
> > 
> > 
> > Thanks,
> > 
> > Nicolai

Re: [PATCH V2 4/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-17 Thread Dave Young

On 01/17/17 at 05:13pm, Ard Biesheuvel wrote:
> On 16 January 2017 at 02:45, Dave Young  wrote:
> > efi_mem_reserve cares only about boot services regions, for making sure
> > later efi_free_boot_services does not free areas which are still useful,
> > such as bgrt image buffer.
> >
> > So add a new argument to efi_memmap_insert for this purpose.
> >
> 
> So what happens is we try to efi_mem_reserve() a regions that is not
> bootservices code or data? We shouldn't simply ignore it, because it
> is a serious condition.

efi_mem_reserve is designed to address the boot service memory reservation
issue but I'm not sure if we could have other requirement in the future,
then the efi_mem_reserve itself at least the function comment need an
update also. Anyway I have no strong opinion about this patch..

> 
> > Signed-off-by: Dave Young 
> > ---
> > v1->v2: only check EFI_BOOT_SERVICES_CODE/_DATA
> >  arch/x86/platform/efi/quirks.c  |2 +-
> >  drivers/firmware/efi/fake_mem.c |3 ++-
> >  drivers/firmware/efi/memmap.c   |6 +-
> >  include/linux/efi.h |4 ++--
> >  4 files changed, 10 insertions(+), 5 deletions(-)
> >
> > --- linux-x86.orig/drivers/firmware/efi/memmap.c
> > +++ linux-x86/drivers/firmware/efi/memmap.c
> > @@ -229,7 +229,7 @@ int __init efi_memmap_split_count(efi_me
> >   * to see how large @buf needs to be.
> >   */
> >  void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
> > - struct efi_mem_range *mem)
> > + struct efi_mem_range *mem, bool boot_only)
> >  {
> > u64 m_start, m_end, m_attr;
> > efi_memory_desc_t *md;
> > @@ -262,6 +262,10 @@ void __init efi_memmap_insert(struct efi
> > start = md->phys_addr;
> > end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1;
> >
> > +   if (boot_only && !(md->type == EFI_BOOT_SERVICES_CODE ||
> > +  md->type == EFI_BOOT_SERVICES_DATA))
> > +   continue;
> > +
> > if (m_start <= start && end <= m_end)
> > md->attribute |= m_attr;
> >
> > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > @@ -226,7 +226,7 @@ void __init efi_arch_mem_reserve(phys_ad
> > return;
> > }
> >
> > -   efi_memmap_insert(&efi.memmap, new, &mr);
> > +   efi_memmap_insert(&efi.memmap, new, &mr, true);
> > early_memunmap(new, new_size);
> >
> > efi_memmap_install(new_phys, num_entries);
> > --- linux-x86.orig/drivers/firmware/efi/fake_mem.c
> > +++ linux-x86/drivers/firmware/efi/fake_mem.c
> > @@ -85,7 +85,8 @@ void __init efi_fake_memmap(void)
> > }
> >
> > for (i = 0; i < nr_fake_mem; i++)
> > -   efi_memmap_insert(&efi.memmap, new_memmap, &fake_mems[i]);
> > +   efi_memmap_insert(&efi.memmap, new_memmap, &fake_mems[i],
> > + false);
> >
> > /* swap into new EFI memmap */
> > early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
> > --- linux-x86.orig/include/linux/efi.h
> > +++ linux-x86/include/linux/efi.h
> > @@ -959,8 +959,8 @@ extern int __init efi_memmap_install(phy
> >  extern int __init efi_memmap_split_count(efi_memory_desc_t *md,
> >  struct range *range);
> >  extern void __init efi_memmap_insert(struct efi_memory_map *old_memmap,
> > -void *buf, struct efi_mem_range *mem);
> > -
> > +void *buf, struct efi_mem_range *mem,
> > +bool boot_only);
> >  extern int efi_config_init(efi_config_table_type_t *arch_tables);
> >  #ifdef CONFIG_EFI_ESRT
> >  extern void __init efi_esrt_init(void);
> >
> >

Thanks
Dave

Re: [PATCH V2 1/4] efi/x86: move efi bgrt init code to early init code

2017-01-17 Thread Dave Young

On 01/17/17 at 05:10pm, Ard Biesheuvel wrote:
> On 16 January 2017 at 02:45, Dave Young  wrote:
> > Before invoking the arch specific handler, efi_mem_reserve() reserves
> > the given memory region through memblock.
> >
> > efi_bgrt_init will call efi_mem_reserve after mm_init(), at that time
> > memblock is dead and it should not be used any more.
> >
> > efi bgrt code depend on acpi intialization to get the bgrt acpi table,
> > moving bgrt parsing to acpi early boot code can make sure efi_mem_reserve
> > in efi bgrt code still use memblock safely.
> >
> > Signed-off-by: Dave Young 
> > ---
> > v1->v2: efi_bgrt_init: check table length first before copying bgrt table
> > error checking in drivers/acpi/bgrt.c
> >  arch/x86/kernel/acpi/boot.c  |   12 +++
> >  arch/x86/platform/efi/efi-bgrt.c |   59 
> > ---
> >  arch/x86/platform/efi/efi.c  |5 ---
> >  drivers/acpi/bgrt.c  |   28 +-
> >  include/linux/efi-bgrt.h |7 +---
> >  init/main.c  |1
> >  6 files changed, 60 insertions(+), 52 deletions(-)
> >
> > --- linux-x86.orig/arch/x86/kernel/acpi/boot.c
> > +++ linux-x86/arch/x86/kernel/acpi/boot.c
> > @@ -35,6 +35,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -1557,6 +1558,14 @@ int __init early_acpi_boot_init(void)
> > return 0;
> >  }
> >
> > +#ifdef CONFIG_ACPI_BGRT
> > +static int __init acpi_parse_bgrt(struct acpi_table_header *table)
> > +{
> > +   efi_bgrt_init(table);
> > +   return 0;
> > +}
> > +#endif
> > +
> 
> Please drop the #ifdef / #endif
> 
> >  int __init acpi_boot_init(void)
> >  {
> > /* those are executed after early-quirks are executed */
> > @@ -1581,6 +1590,9 @@ int __init acpi_boot_init(void)
> > acpi_process_madt();
> >
> > acpi_table_parse(ACPI_SIG_HPET, acpi_parse_hpet);
> > +#ifdef CONFIG_ACPI_BGRT
> 
> Please replace with
> 
> if (IS_ENABLED(CONFIG_ACPI_BGRT))
> 
> > +   acpi_table_parse("BGRT", acpi_parse_bgrt);
> > +#endif
> >
> 
> and perhaps we should add a #define for ACPI_SIG_BGRT as well?

There is ACPI_SIG_BGRT already in acpi header file, will use it and
switch to if (IS_ENABLED..) as you mentioned.

> 
> > if (!acpi_noirq)
> > x86_init.pci.init = pci_acpi_init;
> > --- linux-x86.orig/arch/x86/platform/efi/efi-bgrt.c
> > +++ linux-x86/arch/x86/platform/efi/efi-bgrt.c
> > @@ -19,8 +19,7 @@
> >  #include 
> >  #include 
> >
> > -struct acpi_table_bgrt *bgrt_tab;
> > -void *__initdata bgrt_image;
> > +struct acpi_table_bgrt bgrt_tab;
> >  size_t __initdata bgrt_image_size;
> >
> >  struct bmp_header {
> > @@ -28,66 +27,58 @@ struct bmp_header {
> > u32 size;
> >  } __packed;
> >
> > -void __init efi_bgrt_init(void)
> > +void __init efi_bgrt_init(struct acpi_table_header *table)
> >  {
> > -   acpi_status status;
> > void *image;
> > struct bmp_header bmp_header;
> > +   struct acpi_table_bgrt *bgrt = &bgrt_tab;
> >
> > if (acpi_disabled)
> > return;
> >
> > -   status = acpi_get_table("BGRT", 0,
> > -   (struct acpi_table_header **)&bgrt_tab);
> > -   if (ACPI_FAILURE(status))
> > -   return;
> > -
> > -   if (bgrt_tab->header.length < sizeof(*bgrt_tab)) {
> > +   if (table->length < sizeof(bgrt_tab)) {
> > pr_notice("Ignoring BGRT: invalid length %u (expected 
> > %zu)\n",
> > -  bgrt_tab->header.length, sizeof(*bgrt_tab));
> > +  table->length, sizeof(bgrt_tab));
> > return;
> > }
> > -   if (bgrt_tab->version != 1) {
> > +   *bgrt = *(struct acpi_table_bgrt *)table;
> > +   if (bgrt->version != 1) {
> > pr_notice("Ignoring BGRT: invalid version %u (expected 
> > 1)\n",
> > -  bgrt_tab->version);
> > -   return;
> > +  bgrt->version);
> > +   goto out;
> > }
> > -   if (bgrt_tab->status & 0xfe) {
> > +   if (bgrt->status & 0xfe) {
> >

Re: [PATCH 2/4] efi/x86: move efi bgrt init code to early init code

2017-01-15 Thread Dave Young

On 01/13/17 at 01:21pm, Nicolai Stange wrote:
> On Fri, Jan 13 2017, Dave Young wrote:
> 
> > On 01/13/17 at 10:21am, Dave Young wrote:
> >> On 01/13/17 at 12:11am, Nicolai Stange wrote:
> >> > On Fri, Jan 13 2017, Dave Young wrote:
> >> > 
> >> > > On 01/12/17 at 12:54pm, Nicolai Stange wrote:
> >> > >> On Thu, Jan 12 2017, Dave Young wrote:
> >> > >> 
> >> > >> > -void __init efi_bgrt_init(void)
> >> > >> > +void __init efi_bgrt_init(struct acpi_table_header *table)
> >> > >> >  {
> >> > >> > -   acpi_status status;
> >> > >> > void *image;
> >> > >> > struct bmp_header bmp_header;
> >> > >> >  
> >> > >> > if (acpi_disabled)
> >> > >> > return;
> >> > >> >  
> >> > >> > -   status = acpi_get_table("BGRT", 0,
> >> > >> > -   (struct acpi_table_header **)&bgrt_tab);
> >> > >> > -   if (ACPI_FAILURE(status))
> >> > >> > -   return;
> >> > >> 
> >> > >> 
> >> > >> Not sure, but wouldn't it be safer to reverse the order of this
> >> > >> assignment
> >> > >> 
> >> > >> > +   bgrt_tab = *(struct acpi_table_bgrt *)table;
> >> > >
> >> > > Nicolai, sorry, I'm not sure I understand the comment, is it
> >> > > about above line?
> >> > > Could you elaborate a bit?
> >> > >
> >> > >> 
> >> > >> and this length check
> >> > >> 
> >> > >
> >> > > I also do not get this :(
> >> > 
> >> > Ah sorry, my point is this: the length check should perhaps be made
> >> > before doing the assignment to bgrt_tab because otherwise, we might end
> >> > up reading from invalid memory.
> >> > 
> >> > I.e. if (struct acpi_table_bgrt *)table->length < sizeof(bgrt_tab), then
> >> > 
> >> >   bgrt_tab = *(struct acpi_table_bgrt *)table;
> >> > 
> >> > would read past the table's end.
> >> > 
> >> > I'm not sure whether this is a real problem though -- that is, whether
> >> > this read could ever hit some unmapped memory.
> >> 
> >> Nicolai, thanks for the explanation. It make sense to move it to even later
> >> at the end of the function.
> >
> > Indeed assignment should be after the length checking, but with another
> > tmp variable the assignment to global var can be moved to the end to
> > avoid clear the image_address field..
> 
> I had a look at your updated patches at
> http://people.redhat.com/~ruyang/efi-bgrt/ and they look fine to me.

Many thanks~

> 
> One minor remark:
> 
> sizeof(acpi_table_bgrt) == 56 and it might be better to avoid the extra
> tmp copy in efi_bgrt_init() by
> - assigning directly to bgrt_tab
> - do a 'goto err' rather than a 'return' from all the error paths
> - do a memset(&bgrt_tab, 0, sizeof(bgrt_tab)) at 'err:'

Updated in V2, indeed text size shrunk from 1199 to 762.

> 
> 
> With the copy to the on-stack 'bgrt', gcc 6.2.0 emits this for each of
> the two copies:
> 
>   41:   8a 07   mov(%rdi),%al
>   43:   88 45 d7mov%al,-0x29(%rbp)
>   46:   8a 47 01mov0x1(%rdi),%al
>   49:   88 45 d6mov%al,-0x2a(%rbp)
>   4c:   8a 47 02mov0x2(%rdi),%al
>   4f:   88 45 d5mov%al,-0x2b(%rbp)
>   52:   8a 47 03mov0x3(%rdi),%al
>   55:   88 45 d4mov%al,-0x2c(%rbp)
>   58:   8a 47 08mov0x8(%rdi),%al
>   5b:   88 45 d3mov%al,-0x2d(%rbp)
>   5e:   8a 47 09mov0x9(%rdi),%al
>   61:   88 45 d2mov%al,-0x2e(%rbp)
>   64:   8a 47 0amov0xa(%rdi),%al
>   67:   88 45 d1mov%al,-0x2f(%rbp)
>   6a:   8a 47 0bmov0xb(%rdi),%al
>   6d:   88 45 d0mov%al,-0x30(%rbp)
>   70:   8a 47 0cmov0xc(%rdi),%al
>   73:   88 45 cfmov%al,-0x31(%rbp)
>   76:   8a 47 0dmov0xd(%rdi),%al
>   79:   88 45 cemov%al,-0x32(%rbp)
>   7c:   8a 47 0e

[PATCH V2 4/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-15 Thread Dave Young

efi_mem_reserve cares only about boot services regions, for making sure
later efi_free_boot_services does not free areas which are still useful,
such as bgrt image buffer. 

So add a new argument to efi_memmap_insert for this purpose.
 
Signed-off-by: Dave Young 
---
v1->v2: only check EFI_BOOT_SERVICES_CODE/_DATA
 arch/x86/platform/efi/quirks.c  |2 +-
 drivers/firmware/efi/fake_mem.c |3 ++-
 drivers/firmware/efi/memmap.c   |6 +-
 include/linux/efi.h |4 ++--
 4 files changed, 10 insertions(+), 5 deletions(-)

--- linux-x86.orig/drivers/firmware/efi/memmap.c
+++ linux-x86/drivers/firmware/efi/memmap.c
@@ -229,7 +229,7 @@ int __init efi_memmap_split_count(efi_me
  * to see how large @buf needs to be.
  */
 void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
- struct efi_mem_range *mem)
+ struct efi_mem_range *mem, bool boot_only)
 {
u64 m_start, m_end, m_attr;
efi_memory_desc_t *md;
@@ -262,6 +262,10 @@ void __init efi_memmap_insert(struct efi
start = md->phys_addr;
end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1;
 
+   if (boot_only && !(md->type == EFI_BOOT_SERVICES_CODE ||
+  md->type == EFI_BOOT_SERVICES_DATA))
+   continue;
+
if (m_start <= start && end <= m_end)
md->attribute |= m_attr;
 
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -226,7 +226,7 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}
 
-   efi_memmap_insert(&efi.memmap, new, &mr);
+   efi_memmap_insert(&efi.memmap, new, &mr, true);
early_memunmap(new, new_size);
 
efi_memmap_install(new_phys, num_entries);
--- linux-x86.orig/drivers/firmware/efi/fake_mem.c
+++ linux-x86/drivers/firmware/efi/fake_mem.c
@@ -85,7 +85,8 @@ void __init efi_fake_memmap(void)
}
 
for (i = 0; i < nr_fake_mem; i++)
-   efi_memmap_insert(&efi.memmap, new_memmap, &fake_mems[i]);
+   efi_memmap_insert(&efi.memmap, new_memmap, &fake_mems[i],
+ false);
 
/* swap into new EFI memmap */
early_memunmap(new_memmap, efi.memmap.desc_size * new_nr_map);
--- linux-x86.orig/include/linux/efi.h
+++ linux-x86/include/linux/efi.h
@@ -959,8 +959,8 @@ extern int __init efi_memmap_install(phy
 extern int __init efi_memmap_split_count(efi_memory_desc_t *md,
 struct range *range);
 extern void __init efi_memmap_insert(struct efi_memory_map *old_memmap,
-void *buf, struct efi_mem_range *mem);
-
+void *buf, struct efi_mem_range *mem,
+bool boot_only);
 extern int efi_config_init(efi_config_table_type_t *arch_tables);
 #ifdef CONFIG_EFI_ESRT
 extern void __init efi_esrt_init(void);

[PATCH V2 3/4] efi/x86: add debug code to print cooked memmap

2017-01-15 Thread Dave Young

It is not obvious if the reserved boot area are added correctly, add a
efi_print_memmap to print the new memmap.

Signed-off-by: Dave Young 
Acked-by: Ard Biesheuvel 
---
 arch/x86/platform/efi/efi.c |5 +
 1 file changed, 5 insertions(+)

--- linux-x86.orig/arch/x86/platform/efi/efi.c
+++ linux-x86/arch/x86/platform/efi/efi.c
@@ -943,6 +943,11 @@ static void __init __efi_enter_virtual_m
return;
}
 
+   if (efi_enabled(EFI_DBG)) {
+   pr_info("EFI runtime memory map:\n");
+   efi_print_memmap();
+   }
+
BUG_ON(!efi.systab);
 
if (efi_setup_page_tables(pa, 1 << pg_shift)) {

[PATCH V2 0/4] efi/x86: move efi bgrt init code to early init

2017-01-15 Thread Dave Young

Hi,

Here the the update of the series for moving bgrt init code to early init.

Main changes is:
- Move the 1st patch to the last because it does not block the 2nd patch
any more with Peter's patch to prune invlid memmap entries:
https://git.kernel.org/cgit/linux/kernel/git/efi/efi.git/commit/?h=next&id=b2a91
a35445229
But it is still tood to have since efi_mem_reserve only cares about boot related
mem ranges.

- Other comments about code itself, details please the the patches themselves.

 arch/x86/include/asm/efi.h   |1 
 arch/x86/kernel/acpi/boot.c  |   12 +++
 arch/x86/platform/efi/efi-bgrt.c |   59 ---
 arch/x86/platform/efi/efi.c  |   26 +++--
 arch/x86/platform/efi/quirks.c   |2 -
 drivers/acpi/bgrt.c  |   28 +-
 drivers/firmware/efi/fake_mem.c  |3 +
 drivers/firmware/efi/memmap.c|   22 +-
 include/linux/efi-bgrt.h |7 +---
 include/linux/efi.h  |5 +--
 init/main.c  |1 
 11 files changed, 92 insertions(+), 74 deletions(-)

Thanks
Dave

[PATCH V2 2/4] efi/x86: move efi_print_memmap to drivers/firmware/efi/memmap.c

2017-01-15 Thread Dave Young

Signed-off-by: Dave Young 
---
v1->v2: move efi_print_memmap declaration to general header file 
 arch/x86/include/asm/efi.h|1 -
 arch/x86/platform/efi/efi.c   |   16 
 drivers/firmware/efi/memmap.c |   16 
 include/linux/efi.h   |1 +
 4 files changed, 17 insertions(+), 17 deletions(-)

--- linux-x86.orig/arch/x86/platform/efi/efi.c
+++ linux-x86/arch/x86/platform/efi/efi.c
@@ -278,22 +278,6 @@ static void __init efi_clean_memmap(void
}
 }
 
-void __init efi_print_memmap(void)
-{
-   efi_memory_desc_t *md;
-   int i = 0;
-
-   for_each_efi_memory_desc(md) {
-   char buf[64];
-
-   pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n",
-   i++, efi_md_typeattr_format(buf, sizeof(buf), md),
-   md->phys_addr,
-   md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
-   (md->num_pages >> (20 - EFI_PAGE_SHIFT)));
-   }
-}
-
 static int __init efi_systab_init(void *phys)
 {
if (efi_enabled(EFI_64BIT)) {
--- linux-x86.orig/drivers/firmware/efi/memmap.c
+++ linux-x86/drivers/firmware/efi/memmap.c
@@ -10,6 +10,22 @@
 #include 
 #include 
 
+void __init efi_print_memmap(void)
+{
+   efi_memory_desc_t *md;
+   int i = 0;
+
+   for_each_efi_memory_desc(md) {
+   char buf[64];
+
+   pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n",
+   i++, efi_md_typeattr_format(buf, sizeof(buf), md),
+   md->phys_addr,
+   md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
+   (md->num_pages >> (20 - EFI_PAGE_SHIFT)));
+   }
+}
+
 /**
  * __efi_memmap_init - Common code for mapping the EFI memory map
  * @data: EFI memory map data
--- linux-x86.orig/arch/x86/include/asm/efi.h
+++ linux-x86/arch/x86/include/asm/efi.h
@@ -116,7 +116,6 @@ extern void __init efi_set_executable(ef
 extern int __init efi_memblock_x86_reserve_range(void);
 extern pgd_t * __init efi_call_phys_prolog(void);
 extern void __init efi_call_phys_epilog(pgd_t *save_pgd);
-extern void __init efi_print_memmap(void);
 extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
 extern void __init efi_map_region_fixed(efi_memory_desc_t *md);
--- linux-x86.orig/include/linux/efi.h
+++ linux-x86/include/linux/efi.h
@@ -949,6 +949,7 @@ static inline efi_status_t efi_query_var
return EFI_SUCCESS;
 }
 #endif
+extern void __init efi_print_memmap(void);
 extern void __iomem *efi_lookup_mapped_addr(u64 phys_addr);
 
 extern int __init efi_memmap_init_early(struct efi_memory_map_data *data);

[PATCH V2 1/4] efi/x86: move efi bgrt init code to early init code

2017-01-15 Thread Dave Young

Before invoking the arch specific handler, efi_mem_reserve() reserves
the given memory region through memblock.

efi_bgrt_init will call efi_mem_reserve after mm_init(), at that time
memblock is dead and it should not be used any more.

efi bgrt code depend on acpi intialization to get the bgrt acpi table,
moving bgrt parsing to acpi early boot code can make sure efi_mem_reserve
in efi bgrt code still use memblock safely. 

Signed-off-by: Dave Young 
---
v1->v2: efi_bgrt_init: check table length first before copying bgrt table
error checking in drivers/acpi/bgrt.c
 arch/x86/kernel/acpi/boot.c  |   12 +++
 arch/x86/platform/efi/efi-bgrt.c |   59 ---
 arch/x86/platform/efi/efi.c  |5 ---
 drivers/acpi/bgrt.c  |   28 +-
 include/linux/efi-bgrt.h |7 +---
 init/main.c  |1 
 6 files changed, 60 insertions(+), 52 deletions(-)

--- linux-x86.orig/arch/x86/kernel/acpi/boot.c
+++ linux-x86/arch/x86/kernel/acpi/boot.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1557,6 +1558,14 @@ int __init early_acpi_boot_init(void)
return 0;
 }
 
+#ifdef CONFIG_ACPI_BGRT
+static int __init acpi_parse_bgrt(struct acpi_table_header *table)
+{
+   efi_bgrt_init(table);
+   return 0;
+}
+#endif
+
 int __init acpi_boot_init(void)
 {
/* those are executed after early-quirks are executed */
@@ -1581,6 +1590,9 @@ int __init acpi_boot_init(void)
acpi_process_madt();
 
acpi_table_parse(ACPI_SIG_HPET, acpi_parse_hpet);
+#ifdef CONFIG_ACPI_BGRT
+   acpi_table_parse("BGRT", acpi_parse_bgrt);
+#endif
 
if (!acpi_noirq)
x86_init.pci.init = pci_acpi_init;
--- linux-x86.orig/arch/x86/platform/efi/efi-bgrt.c
+++ linux-x86/arch/x86/platform/efi/efi-bgrt.c
@@ -19,8 +19,7 @@
 #include 
 #include 
 
-struct acpi_table_bgrt *bgrt_tab;
-void *__initdata bgrt_image;
+struct acpi_table_bgrt bgrt_tab;
 size_t __initdata bgrt_image_size;
 
 struct bmp_header {
@@ -28,66 +27,58 @@ struct bmp_header {
u32 size;
 } __packed;
 
-void __init efi_bgrt_init(void)
+void __init efi_bgrt_init(struct acpi_table_header *table)
 {
-   acpi_status status;
void *image;
struct bmp_header bmp_header;
+   struct acpi_table_bgrt *bgrt = &bgrt_tab;
 
if (acpi_disabled)
return;
 
-   status = acpi_get_table("BGRT", 0,
-   (struct acpi_table_header **)&bgrt_tab);
-   if (ACPI_FAILURE(status))
-   return;
-
-   if (bgrt_tab->header.length < sizeof(*bgrt_tab)) {
+   if (table->length < sizeof(bgrt_tab)) {
pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
-  bgrt_tab->header.length, sizeof(*bgrt_tab));
+  table->length, sizeof(bgrt_tab));
return;
}
-   if (bgrt_tab->version != 1) {
+   *bgrt = *(struct acpi_table_bgrt *)table;
+   if (bgrt->version != 1) {
pr_notice("Ignoring BGRT: invalid version %u (expected 1)\n",
-  bgrt_tab->version);
-   return;
+  bgrt->version);
+   goto out;
}
-   if (bgrt_tab->status & 0xfe) {
+   if (bgrt->status & 0xfe) {
pr_notice("Ignoring BGRT: reserved status bits are non-zero 
%u\n",
-  bgrt_tab->status);
-   return;
+  bgrt->status);
+   goto out;
}
-   if (bgrt_tab->image_type != 0) {
+   if (bgrt->image_type != 0) {
pr_notice("Ignoring BGRT: invalid image type %u (expected 0)\n",
-  bgrt_tab->image_type);
-   return;
+  bgrt->image_type);
+   goto out;
}
-   if (!bgrt_tab->image_address) {
+   if (!bgrt->image_address) {
pr_notice("Ignoring BGRT: null image address\n");
-   return;
+   goto out;
}
 
-   image = memremap(bgrt_tab->image_address, sizeof(bmp_header), 
MEMREMAP_WB);
+   image = early_memremap(bgrt->image_address, sizeof(bmp_header));
if (!image) {
pr_notice("Ignoring BGRT: failed to map image header memory\n");
-   return;
+   goto out;
}
 
memcpy(&bmp_header, image, sizeof(bmp_header));
-   memunmap(image);
+   early_memunmap(image, sizeof(bmp_header));
if (bmp_header.id != 0x4d42) {
pr_notice("Ignoring BGRT: Incorrect BMP magic number 0x%x 
(expected 0x4d42)\n",
bmp_header.id);
-   return;
+   goto out;
}
bgrt_

Re: [PATCH 1/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-13 Thread Dave Young

On 01/13/17 at 05:20am, Dave Young wrote:
> On 01/12/17 at 04:15pm, Ard Biesheuvel wrote:
> > Hello Dave,
> > 
> > On 12 January 2017 at 09:41, Dave Young  wrote:
> > > There are memory ranges like below when I testing early efi_mem_reserve:
> > >
> > > efi: mem62: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > > range=[0x-0x] (0MB)
> > > efi: mem63: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > > range=[0x-0x] (0MB)
> > > efi: mem64: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > > range=[0x-0x] (0MB)
> > > efi: mem65: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > > range=[0x-0x] (0MB)
> > > efi: mem66: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > > range=[0x-0x] (0MB)
> > > efi: mem67: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > > range=[0x-0x] (0MB)
> > >
> > 
> > Did you spot Peter's patch to prune invalid memmap entries?
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/efi/efi.git/commit/?h=next&id=b2a91a35445229
> > 
> > I would expect this patch to no longer be necessary with that in place, no?
> 
> Ard, good suggestion, I did not notice that patch, will try. Actually
> I'm not sure about fake_mem handling, if we can filter out the invalid
> ranges then it will be natural to drop this one after a test.

The commit works for me. I updated the series.

I addressed comments known, moved patch 1 to 4/4 with only boot service
(dropped loader areas checking), I plan to send them out next Monday
see if there are more comments. 

If anyone who want to try them now, feel free to download from below url:
http://people.redhat.com/~ruyang/efi-bgrt/

Thanks
Dave

Re: [PATCH 2/4] efi/x86: move efi bgrt init code to early init code

2017-01-12 Thread Dave Young

On 01/13/17 at 10:21am, Dave Young wrote:
> On 01/13/17 at 12:11am, Nicolai Stange wrote:
> > On Fri, Jan 13 2017, Dave Young wrote:
> > 
> > > On 01/12/17 at 12:54pm, Nicolai Stange wrote:
> > >> On Thu, Jan 12 2017, Dave Young wrote:
> > >> 
> > >> > -void __init efi_bgrt_init(void)
> > >> > +void __init efi_bgrt_init(struct acpi_table_header *table)
> > >> >  {
> > >> > -  acpi_status status;
> > >> >void *image;
> > >> >struct bmp_header bmp_header;
> > >> >  
> > >> >if (acpi_disabled)
> > >> >return;
> > >> >  
> > >> > -  status = acpi_get_table("BGRT", 0,
> > >> > -  (struct acpi_table_header **)&bgrt_tab);
> > >> > -  if (ACPI_FAILURE(status))
> > >> > -  return;
> > >> 
> > >> 
> > >> Not sure, but wouldn't it be safer to reverse the order of this 
> > >> assignment
> > >> 
> > >> > +  bgrt_tab = *(struct acpi_table_bgrt *)table;
> > >
> > > Nicolai, sorry, I'm not sure I understand the comment, is it about above 
> > > line?
> > > Could you elaborate a bit?
> > >
> > >> 
> > >> and this length check
> > >> 
> > >
> > > I also do not get this :(
> > 
> > Ah sorry, my point is this: the length check should perhaps be made
> > before doing the assignment to bgrt_tab because otherwise, we might end
> > up reading from invalid memory.
> > 
> > I.e. if (struct acpi_table_bgrt *)table->length < sizeof(bgrt_tab), then
> > 
> >   bgrt_tab = *(struct acpi_table_bgrt *)table;
> > 
> > would read past the table's end.
> > 
> > I'm not sure whether this is a real problem though -- that is, whether
> > this read could ever hit some unmapped memory.
> 
> Nicolai, thanks for the explanation. It make sense to move it to even later
> at the end of the function.

Indeed assignment should be after the length checking, but with another
tmp variable the assignment to global var can be moved to the end to
avoid clear the image_address field..

> 
> Thanks
> Dave

Re: [PATCH 2/4] efi/x86: move efi bgrt init code to early init code

2017-01-12 Thread Dave Young

On 01/13/17 at 12:11am, Nicolai Stange wrote:
> On Fri, Jan 13 2017, Dave Young wrote:
> 
> > On 01/12/17 at 12:54pm, Nicolai Stange wrote:
> >> On Thu, Jan 12 2017, Dave Young wrote:
> >> 
> >> > -void __init efi_bgrt_init(void)
> >> > +void __init efi_bgrt_init(struct acpi_table_header *table)
> >> >  {
> >> > -acpi_status status;
> >> >  void *image;
> >> >  struct bmp_header bmp_header;
> >> >  
> >> >  if (acpi_disabled)
> >> >  return;
> >> >  
> >> > -status = acpi_get_table("BGRT", 0,
> >> > -(struct acpi_table_header **)&bgrt_tab);
> >> > -if (ACPI_FAILURE(status))
> >> > -return;
> >> 
> >> 
> >> Not sure, but wouldn't it be safer to reverse the order of this assignment
> >> 
> >> > +bgrt_tab = *(struct acpi_table_bgrt *)table;
> >
> > Nicolai, sorry, I'm not sure I understand the comment, is it about above 
> > line?
> > Could you elaborate a bit?
> >
> >> 
> >> and this length check
> >> 
> >
> > I also do not get this :(
> 
> Ah sorry, my point is this: the length check should perhaps be made
> before doing the assignment to bgrt_tab because otherwise, we might end
> up reading from invalid memory.
> 
> I.e. if (struct acpi_table_bgrt *)table->length < sizeof(bgrt_tab), then
> 
>   bgrt_tab = *(struct acpi_table_bgrt *)table;
> 
> would read past the table's end.
> 
> I'm not sure whether this is a real problem though -- that is, whether
> this read could ever hit some unmapped memory.

Nicolai, thanks for the explanation. It make sense to move it to even later
at the end of the function.

Thanks
Dave

Re: [PATCH 3/4] efi/x86: move efi_print_memmap to drivers/firmware/efi/memmap.c

2017-01-12 Thread Dave Young

On 01/12/17 at 01:08pm, Nicolai Stange wrote:
> On Thu, Jan 12 2017, Dave Young wrote:
> 
> > Signed-off-by: Dave Young 
> > ---
> >  arch/x86/platform/efi/efi.c   |   16 
> >  drivers/firmware/efi/memmap.c |   16 
> >  2 files changed, 16 insertions(+), 16 deletions(-)
> >
> > --- linux-x86.orig/arch/x86/platform/efi/efi.c
> > +++ linux-x86/arch/x86/platform/efi/efi.c
> > @@ -210,22 +210,6 @@ int __init efi_memblock_x86_reserve_rang
> > return 0;
> >  }
> >  
> > -void __init efi_print_memmap(void)
> > -{
> > -   efi_memory_desc_t *md;
> > -   int i = 0;
> > -
> > -   for_each_efi_memory_desc(md) {
> > -   char buf[64];
> > -
> > -   pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n",
> > -   i++, efi_md_typeattr_format(buf, sizeof(buf), md),
> > -   md->phys_addr,
> > -   md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
> > -   (md->num_pages >> (20 - EFI_PAGE_SHIFT)));
> > -   }
> > -}
> > -
> >  static int __init efi_systab_init(void *phys)
> >  {
> > if (efi_enabled(EFI_64BIT)) {
> > --- linux-x86.orig/drivers/firmware/efi/memmap.c
> > +++ linux-x86/drivers/firmware/efi/memmap.c
> > @@ -10,6 +10,22 @@
> >  #include 
> >  #include 
> >  
> > +void __init efi_print_memmap(void)
> > +{
> > +   efi_memory_desc_t *md;
> > +   int i = 0;
> > +
> > +   for_each_efi_memory_desc(md) {
> > +   char buf[64];
> > +
> > +   pr_info("mem%02u: %s range=[0x%016llx-0x%016llx] (%lluMB)\n",
> > +   i++, efi_md_typeattr_format(buf, sizeof(buf), md),
> > +   md->phys_addr,
> > +   md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1,
> > +   (md->num_pages >> (20 - EFI_PAGE_SHIFT)));
> > +   }
> > +}
> > +
> >  /**
> >   * __efi_memmap_init - Common code for mapping the EFI memory map
> >   * @data: EFI memory map data
> 
> Shouldn't the declaration in arch/x86/include/asm/efi.h get moved as well?

Good catch, will change it as well

Thanks
Dave

Re: [PATCH 2/4] efi/x86: move efi bgrt init code to early init code

2017-01-12 Thread Dave Young

On 01/12/17 at 12:54pm, Nicolai Stange wrote:
> On Thu, Jan 12 2017, Dave Young wrote:
> 
> > -void __init efi_bgrt_init(void)
> > +void __init efi_bgrt_init(struct acpi_table_header *table)
> >  {
> > -   acpi_status status;
> > void *image;
> > struct bmp_header bmp_header;
> >  
> > if (acpi_disabled)
> > return;
> >  
> > -   status = acpi_get_table("BGRT", 0,
> > -   (struct acpi_table_header **)&bgrt_tab);
> > -   if (ACPI_FAILURE(status))
> > -   return;
> 
> 
> Not sure, but wouldn't it be safer to reverse the order of this assignment
> 
> > +   bgrt_tab = *(struct acpi_table_bgrt *)table;

Nicolai, sorry, I'm not sure I understand the comment, is it about above line?
Could you elaborate a bit?

> 
> and this length check
> 

I also do not get this :(

> > -   if (bgrt_tab->header.length < sizeof(*bgrt_tab)) {
> > +   if (bgrt_tab.header.length < sizeof(bgrt_tab)) {
> > pr_notice("Ignoring BGRT: invalid length %u (expected %zu)\n",
> > -  bgrt_tab->header.length, sizeof(*bgrt_tab));
> > +  bgrt_tab.header.length, sizeof(bgrt_tab));
> > return;
> > }
> 
> ?
> 
> Also, from here on, all error paths should zero out
> bgrt_tab.image_address (or so) to signal failure to bgrt_init():
> bgrt_init() now checks for !bgrt_tab.image_address whereas before it had
> tested bgrt_image and the latter used to be set at the very end of
> efi_bgrt_init().
> 

Will do, thanks!

> 
> > -   if (bgrt_tab->version != 1) {
> > +   if (bgrt_tab.version != 1) {
> > pr_notice("Ignoring BGRT: invalid version %u (expected 1)\n",
> > -  bgrt_tab->version);
> > +  bgrt_tab.version);
> > return;
> > }
> > -   if (bgrt_tab->status & 0xfe) {
> > +   if (bgrt_tab.status & 0xfe) {
> > pr_notice("Ignoring BGRT: reserved status bits are non-zero 
> > %u\n",
> > -  bgrt_tab->status);
> > +  bgrt_tab.status);
> > return;
> > }
> > -   if (bgrt_tab->image_type != 0) {
> > +   if (bgrt_tab.image_type != 0) {
> > pr_notice("Ignoring BGRT: invalid image type %u (expected 0)\n",
> > -  bgrt_tab->image_type);
> > +  bgrt_tab.image_type);
> > return;
> > }
> > -   if (!bgrt_tab->image_address) {
> > +   if (!bgrt_tab.image_address) {
> > pr_notice("Ignoring BGRT: null image address\n");
> > return;
> > }
> >  
> > -   image = memremap(bgrt_tab->image_address, sizeof(bmp_header), 
> > MEMREMAP_WB);
> > +   image = early_memremap(bgrt_tab.image_address, sizeof(bmp_header));
> > if (!image) {
> > pr_notice("Ignoring BGRT: failed to map image header memory\n");
> > return;
> > }
> >  
> > memcpy(&bmp_header, image, sizeof(bmp_header));
> > -   memunmap(image);
> > +   early_memunmap(image, sizeof(bmp_header));
> > if (bmp_header.id != 0x4d42) {
> > pr_notice("Ignoring BGRT: Incorrect BMP magic number 0x%x 
> > (expected 0x4d42)\n",
> > bmp_header.id);
> > @@ -82,12 +77,5 @@ void __init efi_bgrt_init(void)
> > }
> > bgrt_image_size = bmp_header.size;
> >  
> > -   bgrt_image = memremap(bgrt_tab->image_address, bmp_header.size, 
> > MEMREMAP_WB);
> > -   if (!bgrt_image) {
> > -   pr_notice("Ignoring BGRT: failed to map image memory\n");
> > -   bgrt_image = NULL;
> > -   return;
> > -   }
> > -
> > -   efi_mem_reserve(bgrt_tab->image_address, bgrt_image_size);
> > +   efi_mem_reserve(bgrt_tab.image_address, bgrt_image_size);
> >  }
> > --- linux-x86.orig/drivers/acpi/bgrt.c
> > +++ linux-x86/drivers/acpi/bgrt.c
> > @@ -15,40 +15,41 @@
> >  #include 
> >  #include 
> >  
> > +static void *bgrt_image;
> 
> [...]
> 
> > @@ -84,9 +85,17 @@ static int __init bgrt_init(void)
> >  {
> > int ret;
> >  
> > -   if (!bgrt_image)
> > +   if (!bgrt_tab.image_address)
> > return -ENODEV;
> >  
> > +   bgrt_image = memremap(bgrt_tab.image_address, bgrt_image_size,
> > + MEMREMAP_WB);
> > +   if (!bgrt_image) {
> > +   pr_notice("Ignoring BGRT: failed to map image memory\n");
> > +   bgrt_image = NULL;
> > +   return -ENOMEM;
> > +   }
> > +
> > bin_attr_image.private = bgrt_image;
> > bin_attr_image.size = bgrt_image_size;
> >  
> 
> Thanks,
> 
> Nicolai

Thanks
Dave

Re: [PATCH 2/4] efi/x86: move efi bgrt init code to early init code

2017-01-12 Thread Dave Young

On 01/12/17 at 04:20pm, Ard Biesheuvel wrote:
> On 12 January 2017 at 09:41, Dave Young  wrote:
> > Before invoking the arch specific handler, efi_mem_reserve() reserves
> > the given memory region through memblock.
> >
> > efi_bgrt_init will call efi_mem_reserve after mm_init(), at that time
> > memblock is dead and it should not be used any more.
> >
> > efi bgrt code depend on acpi intialization to get the bgrt acpi table,
> > moving bgrt parsing to acpi early boot code can make sure efi_mem_reserve
> > in efi bgrt code still use memblock safely.
> >
> > Signed-off-by: Dave Young 
> 
> I know this is probably out of scope for you, but since we're moving
> things around, any chance we could do so in a manner that will enable
> BGRT support for arm64/ACPI? Happy to test/collaborate on this.
> 

I'm happy to do so, Bhupesh Sharma  said he had
some investigation on that already, I would like to ask him to help on that.

Already cced him..

Thanks
Dave

Re: [PATCH 1/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-12 Thread Dave Young

On 01/12/17 at 12:15pm, Nicolai Stange wrote:
> Hi Dave,
> 
> On Thu, Jan 12 2017, Dave Young wrote:
> 
> > efi_mem_reserve cares only about boot services regions and maybe loader 
> > areas.
> > So add a new argument to efi_memmap_insert for this purpose.
> 
> Please see below.
> 
> 
> > --- linux-x86.orig/drivers/firmware/efi/memmap.c
> > +++ linux-x86/drivers/firmware/efi/memmap.c
> > @@ -213,7 +213,7 @@ int __init efi_memmap_split_count(efi_me
> >   * to see how large @buf needs to be.
> >   */
> >  void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
> > - struct efi_mem_range *mem)
> > + struct efi_mem_range *mem, bool boot_only)
> >  {
> > u64 m_start, m_end, m_attr;
> > efi_memory_desc_t *md;
> > @@ -246,6 +246,12 @@ void __init efi_memmap_insert(struct efi
> > start = md->phys_addr;
> > end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - 1;
> >  
> > +   if (boot_only && !(md->type == EFI_LOADER_DATA ||
> > +   md->type == EFI_LOADER_CODE ||
> > +   md->type == EFI_BOOT_SERVICES_CODE ||
> > +   md->type == EFI_BOOT_SERVICES_DATA))
> > +   continue;
> > +
> 
> 
> Actually, the efi_mem_desc_lookup() called from
> efi_arch_memmap_reserve() will only return mds not satisfying the
> following condition:
> 
>   if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
>   md->type != EFI_BOOT_SERVICES_DATA &&
>   md->type != EFI_RUNTIME_SERVICES_DATA) {
>   continue;
>   }
> 
> Furthermore, efi_arch_mem_reserve() will only accept ranges fully
> contained within such a region.
> 
> I think we can make efi_arch_mem_reserve() return early if
> EFI_MEMORY_RUNTIME has been set already and thus, neglect this case in
> efi_memmap_insert().
> 
> I suppose that we don't want to reserve within EFI_RUNTIME_SERVICES_DATA
> regions in efi_mem_reserve() either -- these won't ever get made
> available as general memory anyway [1]. So efi_arch_mem_reserve() should
> return early here as well imo.
> 
> So, what would remain to be handled from efi_memmap_insert() in case of
> boot_only would be EFI_BOOT_SERVICES_DATA only?

It sounds reasonable though I'm still not sure about EFI_LOADER*.

The main purpose of this patch is to address the invalid mem ranges
case. As Ard mentioned I will test with Peter's patch first, if it works
fine I would like to either drop this patch as a future improvement or add
it at the end of the next post.

Matt, what's your opinion about the boot_only check and the EFI_LOADERS*
question?

> 
> (As a sidenote, Matt pointed out at [1] that the EFI_LOADER_* regions
>  should be reserved early through memblock_reserve() and not through
>  efi_mem_reserve()).
> 
> Thanks,
> 
> Nicolai
> 
> 
> [1] http://lkml.kernel.org/r/20170109130702.gi16...@codeblueprint.co.uk

Thanks
Dave

Re: [PATCH 1/4] efi/x86: make efi_memmap_reserve only insert into boot mem areas

2017-01-12 Thread Dave Young

On 01/12/17 at 04:15pm, Ard Biesheuvel wrote:
> Hello Dave,
> 
> On 12 January 2017 at 09:41, Dave Young  wrote:
> > There are memory ranges like below when I testing early efi_mem_reserve:
> >
> > efi: mem62: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > range=[0x-0x] (0MB)
> > efi: mem63: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > range=[0x-0x] (0MB)
> > efi: mem64: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > range=[0x-0x] (0MB)
> > efi: mem65: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > range=[0x-0x] (0MB)
> > efi: mem66: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > range=[0x-0x] (0MB)
> > efi: mem67: [Reserved   |   |  |  |  |  |  |  |   |  |  |  |  ] 
> > range=[0x-0x] (0MB)
> >
> 
> Did you spot Peter's patch to prune invalid memmap entries?
> 
> https://git.kernel.org/cgit/linux/kernel/git/efi/efi.git/commit/?h=next&id=b2a91a35445229
> 
> I would expect this patch to no longer be necessary with that in place, no?

Ard, good suggestion, I did not notice that patch, will try. Actually
I'm not sure about fake_mem handling, if we can filter out the invalid
ranges then it will be natural to drop this one after a test.

Thanks
Dave

Re: [PATCH 2/4] efi/x86: move efi bgrt init code to early init code

2017-01-12 Thread Dave Young

[snip]
> --- linux-x86.orig/drivers/acpi/bgrt.c
> +++ linux-x86/drivers/acpi/bgrt.c

[snip]
>  
> @@ -84,9 +85,17 @@ static int __init bgrt_init(void)
>  {
>   int ret;
>  
> - if (!bgrt_image)
> + if (!bgrt_tab.image_address)
>   return -ENODEV;
>  
> + bgrt_image = memremap(bgrt_tab.image_address, bgrt_image_size,
> +   MEMREMAP_WB);
> + if (!bgrt_image) {
> + pr_notice("Ignoring BGRT: failed to map image memory\n");
> + bgrt_image = NULL;
> + return -ENOMEM;
> + }
> +

Oops, later error path need unmap bgrt_image, will update in next
version after collecting more comments.

Also bgrt_image = NULL is useless, will drop it.

>   bin_attr_image.private = bgrt_image;
>   bin_attr_image.size = bgrt_image_size;
>  

Thanks
Dave

[PATCH 0/4] efi/x86: move efi bgrt init code to early init

2017-01-12 Thread Dave Young

Hi,

Here is a patchset to move efi_bgrt_init to early code so that we can still use 
memblock api.

Appreciated for comments and review.

Diffstat:

 arch/x86/kernel/acpi/boot.c  |   12 +++
 arch/x86/platform/efi/efi-bgrt.c |   42
+--
 arch/x86/platform/efi/efi.c  |   26 
 arch/x86/platform/efi/quirks.c   |2 -
 drivers/acpi/bgrt.c  |   21 +--
 drivers/firmware/efi/fake_mem.c  |3 +-
 drivers/firmware/efi/memmap.c|   24 +-
 include/linux/efi-bgrt.h |7 ++
 include/linux/efi.h  |4 +--
 init/main.c  |1 
 10 files changed, 78 insertions(+), 64 deletions(-)

Thanks
Dave

[PATCH 4/4] efi/x86: add debug code to print cooked memmap

2017-01-12 Thread Dave Young

It is not obvious if the reserved boot area are added correctly, add a
efi_print_memmap to print the new memmap.

Signed-off-by: Dave Young 
---
 arch/x86/platform/efi/efi.c |5 +
 1 file changed, 5 insertions(+)

--- linux-x86.orig/arch/x86/platform/efi/efi.c
+++ linux-x86/arch/x86/platform/efi/efi.c
@@ -873,6 +873,11 @@ static void __init __efi_enter_virtual_m
return;
}
 
+   if (efi_enabled(EFI_DBG)) {
+   pr_info("EFI runtime memory map:\n");
+   efi_print_memmap();
+   }
+
BUG_ON(!efi.systab);
 
if (efi_setup_page_tables(pa, 1 << pg_shift)) {

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 1376 matches

Mail list logo