Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-05-22 Thread Pingfan Liu
On Thu, May 23, 2019 at 11:58 AM Pingfan Liu  wrote:
>
> On Wed, May 22, 2019 at 7:16 PM Michal Hocko  wrote:
> >
> > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > > On Mon, May 13, 2019 at 11:31 PM Michal Hocko  wrote:
> > > >
> > > > On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > > > > The linux-next commit ("x86, numa: always initialize all 
> > > > > > > > > possible
> > > > > > > > > nodes") introduced a crash below during boot for systems with 
> > > > > > > > > a
> > > > > > > > > memory-less node. This is due to CPUs that get onlined during 
> > > > > > > > > SMP boot,
> > > > > > > > > but that onlining triggers a page fault in bus_add_device() 
> > > > > > > > > during
> > > > > > > > > device registration:
> > > > > > > > >
> > > > > > > > >   error = sysfs_create_link(>p->devices_kset->kobj,
> > > > > > > > >
> > > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it 
> > > > > > > > > should
> > > > > > > > > have been set in,
> > > > > > > > >
> > > > > > > > >   postcore_initcall(register_node_type);
> > > > > > > > >
> > > > > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > > > > >
> > > > > > > > > The old code had set this node online via alloc_node_data(), 
> > > > > > > > > so when it
> > > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was 
> > > > > > > > > already up
> > > > > > > > > and nothing happened.
> > > > > > > > >
> > > > > > > > > Now, it attempts to online the node, which registers the node 
> > > > > > > > > with
> > > > > > > > > sysfs, but that can't happen before the 'node' subsystem is 
> > > > > > > > > registered.
> > > > > > > > >
> > > > > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering 
> > > > > > > > > with sysfs
> > > > > > > > > during the early boot in __try_online_node().
> > > > > > > >
> > > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot 
> > > > > > > > we simply
> > > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below 
> > > > > > > > suggests that
> > > > > > > > we need to call node_set_online because something later on 
> > > > > > > > depends on
> > > > > > > > that. Btw. why do we even allocate a pgdat from this path? This 
> > > > > > > > looks
> > > > > > > > really messy.
> > > > > > >
> > > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online 
> > > > > > > before
> > > > > > > local
> > > > > > > memory online")
> > > > > > >
> > > > > > > It looks like try_online_node() in do_cpu_up() is needed for 
> > > > > > > memory hotplug
> > > > > > > which is to put its node online if offlined and then 
> > > > > > > hotadd_new_pgdat()
> > > > > > > calls
> > > > > > > build_all_zonelists() to initialize the zone list.
> > > > > >
> > > > > > Well, do we still have to followthe logic that the above 
> > > > > > (unreviewed)
> > > > > > commit has established? The hotplug code in general made a lot of 
> > &

Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-05-22 Thread Pingfan Liu
On Wed, May 22, 2019 at 7:16 PM Michal Hocko  wrote:
>
> On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > On Mon, May 13, 2019 at 11:31 PM Michal Hocko  wrote:
> > >
> > > On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > > > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > > > The linux-next commit ("x86, numa: always initialize all 
> > > > > > > > possible
> > > > > > > > nodes") introduced a crash below during boot for systems with a
> > > > > > > > memory-less node. This is due to CPUs that get onlined during 
> > > > > > > > SMP boot,
> > > > > > > > but that onlining triggers a page fault in bus_add_device() 
> > > > > > > > during
> > > > > > > > device registration:
> > > > > > > >
> > > > > > > >   error = sysfs_create_link(>p->devices_kset->kobj,
> > > > > > > >
> > > > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it 
> > > > > > > > should
> > > > > > > > have been set in,
> > > > > > > >
> > > > > > > >   postcore_initcall(register_node_type);
> > > > > > > >
> > > > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > > > >
> > > > > > > > The old code had set this node online via alloc_node_data(), so 
> > > > > > > > when it
> > > > > > > > came time to do_cpu_up() -> try_online_node(), the node was 
> > > > > > > > already up
> > > > > > > > and nothing happened.
> > > > > > > >
> > > > > > > > Now, it attempts to online the node, which registers the node 
> > > > > > > > with
> > > > > > > > sysfs, but that can't happen before the 'node' subsystem is 
> > > > > > > > registered.
> > > > > > > >
> > > > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering 
> > > > > > > > with sysfs
> > > > > > > > during the early boot in __try_online_node().
> > > > > > >
> > > > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we 
> > > > > > > simply
> > > > > > > drop try_online_node from do_cpu_up? Your v2 remark below 
> > > > > > > suggests that
> > > > > > > we need to call node_set_online because something later on 
> > > > > > > depends on
> > > > > > > that. Btw. why do we even allocate a pgdat from this path? This 
> > > > > > > looks
> > > > > > > really messy.
> > > > > >
> > > > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online 
> > > > > > before
> > > > > > local
> > > > > > memory online")
> > > > > >
> > > > > > It looks like try_online_node() in do_cpu_up() is needed for memory 
> > > > > > hotplug
> > > > > > which is to put its node online if offlined and then 
> > > > > > hotadd_new_pgdat()
> > > > > > calls
> > > > > > build_all_zonelists() to initialize the zone list.
> > > > >
> > > > > Well, do we still have to followthe logic that the above (unreviewed)
> > > > > commit has established? The hotplug code in general made a lot of 
> > > > > ad-hoc
> > > > > design decisions which had to be revisited over time. If we are not
> > > > > allocating pgdats for newly added memory then we should really make 
> > > > > sure
> > > > > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > > > > init ordering but even then I would really prefer if we could make the
> > > > > init less obscure and _doc

Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-05-22 Thread Pingfan Liu
On Mon, May 13, 2019 at 11:31 PM Michal Hocko  wrote:
>
> On Mon 13-05-19 11:20:46, Qian Cai wrote:
> > On Mon, 2019-05-13 at 16:04 +0200, Michal Hocko wrote:
> > > On Mon 13-05-19 09:43:59, Qian Cai wrote:
> > > > On Mon, 2019-05-13 at 14:41 +0200, Michal Hocko wrote:
> > > > > On Sun 12-05-19 01:48:29, Qian Cai wrote:
> > > > > > The linux-next commit ("x86, numa: always initialize all possible
> > > > > > nodes") introduced a crash below during boot for systems with a
> > > > > > memory-less node. This is due to CPUs that get onlined during SMP 
> > > > > > boot,
> > > > > > but that onlining triggers a page fault in bus_add_device() during
> > > > > > device registration:
> > > > > >
> > > > > >   error = sysfs_create_link(>p->devices_kset->kobj,
> > > > > >
> > > > > > bus->p is NULL. That "p" is the subsys_private struct, and it should
> > > > > > have been set in,
> > > > > >
> > > > > >   postcore_initcall(register_node_type);
> > > > > >
> > > > > > but that happens in do_basic_setup() after smp_init().
> > > > > >
> > > > > > The old code had set this node online via alloc_node_data(), so 
> > > > > > when it
> > > > > > came time to do_cpu_up() -> try_online_node(), the node was already 
> > > > > > up
> > > > > > and nothing happened.
> > > > > >
> > > > > > Now, it attempts to online the node, which registers the node with
> > > > > > sysfs, but that can't happen before the 'node' subsystem is 
> > > > > > registered.
> > > > > >
> > > > > > Since kernel_init() is running by a kernel thread that is in
> > > > > > SYSTEM_SCHEDULING state, fixed this by skipping registering with 
> > > > > > sysfs
> > > > > > during the early boot in __try_online_node().
> > > > >
> > > > > Relying on SYSTEM_SCHEDULING looks really hackish. Why cannot we 
> > > > > simply
> > > > > drop try_online_node from do_cpu_up? Your v2 remark below suggests 
> > > > > that
> > > > > we need to call node_set_online because something later on depends on
> > > > > that. Btw. why do we even allocate a pgdat from this path? This looks
> > > > > really messy.
> > > >
> > > > See the commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before
> > > > local
> > > > memory online")
> > > >
> > > > It looks like try_online_node() in do_cpu_up() is needed for memory 
> > > > hotplug
> > > > which is to put its node online if offlined and then hotadd_new_pgdat()
> > > > calls
> > > > build_all_zonelists() to initialize the zone list.
> > >
> > > Well, do we still have to followthe logic that the above (unreviewed)
> > > commit has established? The hotplug code in general made a lot of ad-hoc
> > > design decisions which had to be revisited over time. If we are not
> > > allocating pgdats for newly added memory then we should really make sure
> > > to do so at a proper time and hook. I am not sure about CPU vs. memory
> > > init ordering but even then I would really prefer if we could make the
> > > init less obscure and _documented_.
> >
> > I don't know, but I think it is a good idea to keep the existing logic 
> > rather
> > than do a big surgery
>
> Adding more hacks just doesn't make the situation any better.
>
> > unless someone is able to confirm it is not breaking NUMA
> > node physical hotplug.
>
> I have a machine to test whole node offline. I am just busy to prepare a
> patch myself. I can have it tested though.
>
I think the definition of "node online" is worth of rethinking. Before
patch "x86, numa: always initialize all possible nodes", online means
either cpu or memory present. After this patch, only node owing memory
as present.

In the commit log, I think the change's motivation should be "Not to
mention that it doesn't really make much sense to consider an empty
node as online because we just consider this node whenever we want to
iterate nodes to use and empty node is obviously not the best
candidate."

But in fact, we already have for_each_node_state(nid, N_MEMORY) to
cover this purpose. Furthermore, changing the definition of online may
break something in the scheduler, e.g. in task_numa_migrate(), where
it calls for_each_online_node.

By keeping the node owning cpu as online, Michal's patch can avoid
such corner case and keep things easy. Furthermore, if needed, the
other patch can use for_each_node_state(nid, N_MEMORY) to replace
for_each_online_node is some space.

Regards,
Pingfan

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 1/2] x86/boot: move early_serial_base to .data section

2019-05-07 Thread Pingfan Liu
On Tue, May 7, 2019 at 4:28 PM Ingo Molnar  wrote:
>
>
> * Pingfan Liu  wrote:
>
> > arch/x86/boot/compressed/head_64.S clears BSS after relocated. If early
> > serial is set up before clearing BSS, the early_serial_base will be reset
> > to 0.
> >
> > Initializing early_serial_base as -1 to push it to .data section.
>
> I'm wondering whether it's wise to clear the BSS after relocation to
> begin with. It already gets cleared once, and an implicit zeroing of all
> fields on kernel relocation sounds dubious to me.
>
After reading the code more closely, I think that the BSS is not fully
initialized to 0, exception the stack and heap.

Furthermore the BSS is not copied to the target address. We just copy [0, _bss).
> Is there a strong reason for that? I.e. is there some uninitialized or
> otherwise important-to-clear data there?
>
I guess the reason may be stack or heap can contain some position
dependent data. (While in practice, there is no such kind of data in
the code now days)

Thanks,
Pingfan


[PATCH 0/2] x86/boot: support to handle exception in early boot

2019-05-07 Thread Pingfan Liu
The boot code becomes a little complicated, and hits some bugs, e.g.
Commit 3a63f70bf4c3a ("x86/boot: Early parse RSDP and save it in
boot_params") broke kexec boot on EFI systems.

There is few hint when bug happens. Catching the exception and printing
message can give a immediate help, instead of adding more debug_putstr() to
narraw down the problem.

Although no functional dependency, but in order to show message, the early
console should be ready. I have sent a separate series:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1992923.html
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1992919.html

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: "Kirill A. Shutemov" 
Cc: Cao jin 
Cc: Wei Huang 
Cc: Chao Fan 
Cc: Nicolai Stange 
Cc: Dou Liyang 
Cc: linux-kernel@vger.kernel.org

Pingfan Liu (2):
  x86/idt: split out idt routines
  x86/boot: set up idt for very early boot stage

 arch/x86/boot/compressed/head_64.S | 11 +++
 arch/x86/boot/compressed/misc.c| 61 
 arch/x86/include/asm/idt.h | 64 ++
 arch/x86/kernel/idt.c  | 58 +-
 4 files changed, 137 insertions(+), 57 deletions(-)
 create mode 100644 arch/x86/include/asm/idt.h

-- 
2.7.4



[PATCH 1/2] x86/idt: split out idt routines

2019-05-07 Thread Pingfan Liu
Some idt routines can be reused in early boot stage. Splitting them out.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: "Kirill A. Shutemov" 
Cc: Cao jin 
Cc: Wei Huang 
Cc: Chao Fan 
Cc: Nicolai Stange 
Cc: Dou Liyang 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/idt.h | 64 ++
 arch/x86/kernel/idt.c  | 58 +
 2 files changed, 65 insertions(+), 57 deletions(-)
 create mode 100644 arch/x86/include/asm/idt.h

diff --git a/arch/x86/include/asm/idt.h b/arch/x86/include/asm/idt.h
new file mode 100644
index 000..147f128
--- /dev/null
+++ b/arch/x86/include/asm/idt.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IDT_H
+#define _ASM_X86_IDT_H
+
+#include 
+
+struct idt_data {
+   unsigned intvector;
+   unsigned intsegment;
+   struct idt_bits bits;
+   const void  *addr;
+};
+
+#define DPL0   0x0
+#define DPL3   0x3
+
+#define DEFAULT_STACK  0
+
+#define G(_vector, _addr, _ist, _type, _dpl, _segment) \
+   {   \
+   .vector = _vector,  \
+   .bits.ist   = _ist, \
+   .bits.type  = _type,\
+   .bits.dpl   = _dpl, \
+   .bits.p = 1,\
+   .addr   = _addr,\
+   .segment= _segment, \
+   }
+
+/* Interrupt gate */
+#define INTG(_vector, _addr)   \
+   G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL0, __KERNEL_CS)
+
+/* System interrupt gate */
+#define SYSG(_vector, _addr)   \
+   G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL3, __KERNEL_CS)
+
+/* Interrupt gate with interrupt stack */
+#define ISTG(_vector, _addr, _ist) \
+   G(_vector, _addr, _ist, GATE_INTERRUPT, DPL0, __KERNEL_CS)
+
+/* System interrupt gate with interrupt stack */
+#define SISTG(_vector, _addr, _ist)\
+   G(_vector, _addr, _ist, GATE_INTERRUPT, DPL3, __KERNEL_CS)
+
+/* Task gate */
+#define TSKG(_vector, _gdt)\
+   G(_vector, NULL, DEFAULT_STACK, GATE_TASK, DPL0, _gdt << 3)
+
+static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
+{
+   unsigned long addr = (unsigned long) d->addr;
+
+   gate->offset_low= (u16) addr;
+   gate->segment   = (u16) d->segment;
+   gate->bits  = d->bits;
+   gate->offset_middle = (u16) (addr >> 16);
+#ifdef CONFIG_X86_64
+   gate->offset_high   = (u32) (addr >> 32);
+   gate->reserved  = 0;
+#endif
+}
+
+#endif
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index 01adea2..80b811a 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -9,49 +9,7 @@
 #include 
 #include 
 #include 
-
-struct idt_data {
-   unsigned intvector;
-   unsigned intsegment;
-   struct idt_bits bits;
-   const void  *addr;
-};
-
-#define DPL0   0x0
-#define DPL3   0x3
-
-#define DEFAULT_STACK  0
-
-#define G(_vector, _addr, _ist, _type, _dpl, _segment) \
-   {   \
-   .vector = _vector,  \
-   .bits.ist   = _ist, \
-   .bits.type  = _type,\
-   .bits.dpl   = _dpl, \
-   .bits.p = 1,\
-   .addr   = _addr,\
-   .segment= _segment, \
-   }
-
-/* Interrupt gate */
-#define INTG(_vector, _addr)   \
-   G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL0, __KERNEL_CS)
-
-/* System interrupt gate */
-#define SYSG(_vector, _addr)   \
-   G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL3, __KERNEL_CS)
-
-/* Interrupt gate with interrupt stack */
-#define ISTG(_vector, _addr, _ist) \
-   G(_vector, _addr, _ist, GATE_INTERRUPT, DPL0, __KERNEL_CS)
-
-/* System interrupt gate with interrupt stack */
-#define SISTG(_vector, _addr, _ist)\
-   G(_vector, _addr, _ist, GATE_INTERRUPT, DPL3, __KERNEL_CS)
-
-/* Task gate */
-#define TSKG(_vector, _gdt)\
-   G(_vector, NULL, DEFAULT_STACK, GATE_TASK, DPL0, _gdt << 3)
+#include 
 
 /*
  * Early traps running on the DEFAULT_STACK because the other interrupt
@@ -202,20 +160,6 @@ const struct desc_ptr debug_idt_descr = {
 };
 #endif
 
-static inline void idt_init_desc(gate_desc *gate, const stru

[PATCH 2/2] x86/boot: set up idt for very early boot stage

2019-05-07 Thread Pingfan Liu
The boot code becomes a little complicated, and hits some bugs, e.g.
Commit 3a63f70bf4c3a ("x86/boot: Early parse RSDP and save it in
boot_params") broke kexec boot on EFI systems.

There is few hint when bug happens. Catching the exception and printing
message can give a immediate help, instead of adding more debug_putstr() to
narraw down the problem.

At present, page fault exception handler is added. And the printed out
message looks like:
  early boot page fault:
  ENTRY(startup_64) is at: 00047f67d200
  nip: 00047fdeedd3
  fault address: fffeef6fde30

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: "Kirill A. Shutemov" 
Cc: Cao jin 
Cc: Wei Huang 
Cc: Chao Fan 
Cc: Nicolai Stange 
Cc: Dou Liyang 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/boot/compressed/head_64.S | 11 +++
 arch/x86/boot/compressed/misc.c| 61 ++
 2 files changed, 72 insertions(+)

diff --git a/arch/x86/boot/compressed/head_64.S 
b/arch/x86/boot/compressed/head_64.S
index e4a25f9..f589aa2 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -527,6 +527,10 @@ relocated:
shrq$3, %rcx
rep stosq
 
+   pushq   %rsi/* Save the real mode argument */
+   leaqstartup_64(%rip), %rdi
+   callsetup_early_boot_idt
+   popq%rsi
 /*
  * Do the extraction, and jump to the new kernel..
  */
@@ -659,6 +663,13 @@ no_longmode:
 
 #include "../../kernel/verify_cpu.S"
 
+   .code64
+.align 8
+ENTRY(boot_page_fault)
+   mov 8(%rsp), %rdi
+   calldo_boot_page_fault
+   iretq
+
.data
 gdt64:
.word   gdt_end - gdt
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 475a3c6..8aaa582 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -76,6 +76,11 @@ static int lines, cols;
 #ifdef CONFIG_KERNEL_LZ4
 #include "../../../../lib/decompress_unlz4.c"
 #endif
+
+#include "../../include/asm/desc.h"
+#include "../../include/asm/idt.h"
+#include "../../include/asm/traps.h"
+
 /*
  * NOTE: When adding a new decompressor, please update the analysis in
  * ../header.S.
@@ -429,3 +434,59 @@ void fortify_panic(const char *name)
 {
error("detected buffer overflow");
 }
+
+static unsigned long rt_startup_64;
+
+void do_boot_page_fault(unsigned long retaddr)
+{
+   struct desc_ptr idt = { .address = 0, .size = 0 };
+   unsigned long fault_address = read_cr2();
+
+   debug_putstr("early boot page fault:\n");
+   debug_putstr("ENTRY(startup_64) is at: ");
+   debug_puthex(rt_startup_64);
+   debug_putstr("\n");
+   debug_putstr("nip: ");
+   debug_puthex(retaddr);
+   debug_putstr("\n");
+   debug_putstr("fault address: ");
+   debug_puthex(fault_address);
+   debug_putstr("\n");
+
+   load_idt();
+}
+
+asmlinkage void boot_page_fault(void);
+
+static struct idt_data boot_idts[] = {
+   INTG(X86_TRAP_PF, 0),
+};
+
+static gate_desc early_boot_idt_table[IDT_ENTRIES] __page_aligned_bss;
+
+static struct desc_ptr early_boot_idt_descr __ro_after_init = {
+   .size   = (IDT_ENTRIES * 2 * sizeof(unsigned long)) - 1,
+};
+
+static void
+idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size)
+{
+   gate_desc desc;
+
+   for (; size > 0; t++, size--) {
+   idt_init_desc(, t);
+   write_idt_entry(idt, t->vector, );
+   }
+}
+
+void setup_early_boot_idt(unsigned long rip)
+{
+   rt_startup_64 = rip;
+   /* fill it with runtime address */
+   boot_idts[0].addr = boot_page_fault;
+   early_boot_idt_descr.address = (unsigned long)early_boot_idt_table;
+
+   idt_setup_from_table(early_boot_idt_table, boot_idts,
+   ARRAY_SIZE(boot_idts));
+   load_idt(_boot_idt_descr);
+}
-- 
2.7.4



[PATCH 1/2] x86/boot: move early_serial_base to .data section

2019-05-07 Thread Pingfan Liu
arch/x86/boot/compressed/head_64.S clears BSS after relocated. If early
serial is set up before clearing BSS, the early_serial_base will be reset
to 0.

Initializing early_serial_base as -1 to push it to .data section.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Jordan Borgner 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/boot/compressed/early_serial_console.c | 2 +-
 arch/x86/boot/early_serial_console.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/early_serial_console.c 
b/arch/x86/boot/compressed/early_serial_console.c
index 261e81f..624e334 100644
--- a/arch/x86/boot/compressed/early_serial_console.c
+++ b/arch/x86/boot/compressed/early_serial_console.c
@@ -1,5 +1,5 @@
 #include "misc.h"
 
-int early_serial_base;
+int early_serial_base = -1;
 
 #include "../early_serial_console.c"
diff --git a/arch/x86/boot/early_serial_console.c 
b/arch/x86/boot/early_serial_console.c
index 023bf1c..d8de15a 100644
--- a/arch/x86/boot/early_serial_console.c
+++ b/arch/x86/boot/early_serial_console.c
@@ -149,6 +149,6 @@ void console_init(void)
 {
parse_earlyprintk();
 
-   if (!early_serial_base)
+   if (early_serial_base <= 0)
parse_console_uart8250();
 }
-- 
2.7.4



[PATCH 2/2] x86/boot: push console_init forward

2019-05-07 Thread Pingfan Liu
At the very early boot stage, early console is badly needed. Push its
initialization as early as possible, just after stack is ready.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Jordan Borgner 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/boot/compressed/early_serial_console.c | 7 +++
 arch/x86/boot/compressed/head_64.S  | 4 
 arch/x86/boot/compressed/misc.c | 1 -
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/early_serial_console.c 
b/arch/x86/boot/compressed/early_serial_console.c
index 624e334..223954a 100644
--- a/arch/x86/boot/compressed/early_serial_console.c
+++ b/arch/x86/boot/compressed/early_serial_console.c
@@ -3,3 +3,10 @@
 int early_serial_base = -1;
 
 #include "../early_serial_console.c"
+
+void early_console_init(void *rmode)
+{
+   boot_params = rmode;
+   console_init();
+   debug_putstr("early console is ready\n");
+}
diff --git a/arch/x86/boot/compressed/head_64.S 
b/arch/x86/boot/compressed/head_64.S
index fafb75c..e4a25f9 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -323,6 +323,10 @@ ENTRY(startup_64)
subq$1b, %rdi
 
calladjust_got
+   pushq   %rsi
+   movq%rsi, %rdi
+   callearly_console_init
+   popq%rsi
 
/*
 * At this point we are in long mode with 4-level paging enabled,
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index cafc6aa..475a3c6 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -368,7 +368,6 @@ asmlinkage __visible void *extract_kernel(void *rmode, 
memptr heap,
lines = boot_params->screen_info.orig_video_lines;
cols = boot_params->screen_info.orig_video_cols;
 
-   console_init();
debug_putstr("early console in extract_kernel\n");
 
free_mem_ptr = heap;/* Heap */
-- 
2.7.4



[PATCHv5 0/2] x86/boot/KASLR: skip the specified crashkernel region

2019-05-06 Thread Pingfan Liu
crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
fail to reserve the required memory region if KASLR puts kernel into the
region. To avoid this uncertainty, asking KASLR to skip the required
region.
And the parsing routine can be re-used at this early boot stage.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Vivek Goyal 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
CC: Hari Bathini 
Cc: linux-kernel@vger.kernel.org
---
v3 -> v4:
  reuse the parse_crashkernel_xx routines
v4 -> v5:
  drop unnecessary initialization of crash_base in [2/2]

Pingfan Liu (2):
  kernel/crash_core: separate the parsing routines to
lib/parse_crashkernel.c
  x86/boot/KASLR: skip the specified crashkernel region

 arch/x86/boot/compressed/kaslr.c |  40 ++
 kernel/crash_core.c  | 273 
 lib/Makefile |   2 +
 lib/parse_crashkernel.c  | 289 +++
 4 files changed, 331 insertions(+), 273 deletions(-)
 create mode 100644 lib/parse_crashkernel.c

-- 
2.7.4



[PATCHv5 2/2] x86/boot/KASLR: skip the specified crashkernel region

2019-05-06 Thread Pingfan Liu
crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
fail to reserve the required memory region if KASLR puts kernel into the
region. To avoid this uncertainty, asking KASLR to skip the required
region.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Vivek Goyal 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
CC: Hari Bathini 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/boot/compressed/kaslr.c | 40 
 lib/parse_crashkernel.c  | 10 ++
 2 files changed, 50 insertions(+)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 2e53c05..12f72a3 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -107,6 +107,7 @@ enum mem_avoid_index {
MEM_AVOID_BOOTPARAMS,
MEM_AVOID_MEMMAP_BEGIN,
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
+   MEM_AVOID_CRASHKERNEL,
MEM_AVOID_MAX,
 };
 
@@ -131,6 +132,11 @@ char *skip_spaces(const char *str)
 }
 #include "../../../../lib/ctype.c"
 #include "../../../../lib/cmdline.c"
+#ifdef CONFIG_CRASH_CORE
+#define printk
+#define _BOOT_KASLR
+#include "../../../../lib/parse_crashkernel.c"
+#endif
 
 static int
 parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
@@ -292,6 +298,39 @@ static void handle_mem_options(void)
return;
 }
 
+static u64 mem_ram_size(void)
+{
+   struct boot_e820_entry *entry;
+   u64 total_sz = 0;
+   int i;
+
+   for (i = 0; i < boot_params->e820_entries; i++) {
+   entry = _params->e820_table[i];
+   /* Skip non-RAM entries. */
+   if (entry->type != E820_TYPE_RAM)
+   continue;
+   total_sz += entry->size;
+   }
+   return total_sz;
+}
+
+/*
+ * For crashkernel=size@offset or =range1:size1[,range2:size2,...]@offset
+ * options, recording mem_avoid for them.
+ */
+static void handle_crashkernel_options(void)
+{
+   unsigned long long crash_size, crash_base;
+   char *cmdline = (char *)get_cmd_line_ptr();
+   u64 total_sz = mem_ram_size();
+
+   parse_crashkernel(cmdline, total_sz, _size, _base);
+   if (crash_base) {
+   mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
+   mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
+   }
+}
+
 /*
  * In theory, KASLR can put the kernel anywhere in the range of [16M, 64T).
  * The mem_avoid array is used to store the ranges that need to be avoided
@@ -414,6 +453,7 @@ static void mem_avoid_init(unsigned long input, unsigned 
long input_size,
 
/* Mark the memmap regions we need to avoid */
handle_mem_options();
+   handle_crashkernel_options();
 
/* Enumerate the immovable memory regions */
num_immovable_mem = count_immovable_mem_regions();
diff --git a/lib/parse_crashkernel.c b/lib/parse_crashkernel.c
index b9a8dc6..4644379 100644
--- a/lib/parse_crashkernel.c
+++ b/lib/parse_crashkernel.c
@@ -137,6 +137,7 @@ static __initdata char *suffix_tbl[] = {
[SUFFIX_NULL] = NULL,
 };
 
+#ifndef _BOOT_KASLR
 /*
  * That function parses "suffix"  crashkernel command lines like
  *
@@ -169,6 +170,7 @@ static int __init parse_crashkernel_suffix(char *cmdline,
 
return 0;
 }
+#endif
 
 static __init char *get_last_crashkernel(char *cmdline,
 const char *name,
@@ -232,9 +234,11 @@ static int __init __parse_crashkernel(char *cmdline,
 
ck_cmdline += strlen(name);
 
+#ifndef _BOOT_KASLR
if (suffix)
return parse_crashkernel_suffix(ck_cmdline, crash_size,
suffix);
+#endif
/*
 * if the commandline contains a ':', then that's the extended
 * syntax -- if not, it must be the classic syntax
@@ -261,6 +265,11 @@ int __init parse_crashkernel(char *cmdline,
"crashkernel=", NULL);
 }
 
+/*
+ * At boot stage, KASLR does not care about crashkernel=size,[high|low], which
+ * never specifies the offset of region.
+ */
+#ifndef _BOOT_KASLR
 int __init parse_crashkernel_high(char *cmdline,
 unsigned long long system_ram,
 unsigned long long *crash_size,
@@ -278,3 +287,4 @@ int __init parse_crashkernel_low(char *cmdline,
return __parse_crashkernel(cmdline, system_ram, crash_size, crash_base,
"crashkernel=", suffix_tbl[SUFFIX_LOW]);
 }
+#endif
-- 
2.7.4



[PATCHv5 1/2] crash: Carve out crashkernel= cmdline parsing

2019-05-06 Thread Pingfan Liu
Make the "crashkernel=" parsing functionality available to the early
KASLR code. Will be used by a later patch to parse crashkernel regions
which KASLR should aviod.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
Cc: Vivek Goyal 
CC: Hari Bathini 
Cc: linux-kernel@vger.kernel.org
---
 kernel/crash_core.c | 273 --
 lib/Makefile|   2 +
 lib/parse_crashkernel.c | 280 
 3 files changed, 282 insertions(+), 273 deletions(-)
 create mode 100644 lib/parse_crashkernel.c

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 093c9f9..37c4d6f 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -21,279 +21,6 @@ u32 *vmcoreinfo_note;
 /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
 static unsigned char *vmcoreinfo_data_safecopy;
 
-/*
- * parsing the "crashkernel" commandline
- *
- * this code is intended to be called from architecture specific code
- */
-
-
-/*
- * This function parses command lines in the format
- *
- *   crashkernel=ramsize-range:size[,...][@offset]
- *
- * The function returns 0 on success and -EINVAL on failure.
- */
-static int __init parse_crashkernel_mem(char *cmdline,
-   unsigned long long system_ram,
-   unsigned long long *crash_size,
-   unsigned long long *crash_base)
-{
-   char *cur = cmdline, *tmp;
-
-   /* for each entry of the comma-separated list */
-   do {
-   unsigned long long start, end = ULLONG_MAX, size;
-
-   /* get the start of the range */
-   start = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (*cur != '-') {
-   pr_warn("crashkernel: '-' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   /* if no ':' is here, than we read the end */
-   if (*cur != ':') {
-   end = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (end <= start) {
-   pr_warn("crashkernel: end <= start\n");
-   return -EINVAL;
-   }
-   }
-
-   if (*cur != ':') {
-   pr_warn("crashkernel: ':' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   size = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (size >= system_ram) {
-   pr_warn("crashkernel: invalid size\n");
-   return -EINVAL;
-   }
-
-   /* match ? */
-   if (system_ram >= start && system_ram < end) {
-   *crash_size = size;
-   break;
-   }
-   } while (*cur++ == ',');
-
-   if (*crash_size > 0) {
-   while (*cur && *cur != ' ' && *cur != '@')
-   cur++;
-   if (*cur == '@') {
-   cur++;
-   *crash_base = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("Memory value expected after '@'\n");
-   return -EINVAL;
-   }
-   }
-   } else
-   pr_info("crashkernel size resulted in zero bytes\n");
-
-   return 0;
-}
-
-/*
- * That function parses "simple" (old) crashkernel command lines like
- *
- * crashkernel=size[@offset]
- *
- * It returns 0 on success and -EINVAL on failure.
- */
-static int __init parse_crashkernel_simple(char *cmdline,
-  unsigned long long *crash_size,
-  unsigned long long *crash_base)
-{
-   char *cur = cmdline;
-
-   *crash_size = memparse(cmdline, );
-   if (cmdline == cur) {
-   pr_warn("crashkernel: memory value expected\n");
-   return -EINVAL;
-  

Re: [PATCH v4 2/2] x86/boot/KASLR: skip the specified crashkernel region

2019-05-06 Thread Pingfan Liu
On Thu, Apr 18, 2019 at 8:32 PM Borislav Petkov  wrote:
>
> On Thu, Apr 18, 2019 at 03:56:09PM +0800, Pingfan Liu wrote:
> > Then in my case, either no @offset or invalid argument will keep
> > "*crash_base = 0", and KASLR does not care about either of them.
>
> Ok.
>
> > It is not elegant. Will try a separate patch to fix it firstly.
>
> That's appreciated, thanks. It is about time that whole kexec/kaslr/...
> code gets some much needed cleaning up and streamlining.
>
I had tried it v1 on https://patchwork.kernel.org/patch/10909627/ and
v2 on https://patchwork.kernel.org/patch/10914169/. It seems no more
feed back and hard to push forward.

Since "x86/boot/KASLR: skip the specified crashkernel region" has no
dependency on the above patch, I would like to send the next version
for it.

Regards,
Pingfan


Re: [PATCHv2] kernel/crash: make parse_crashkernel()'s return value more indicant

2019-04-25 Thread Pingfan Liu
On Wed, Apr 24, 2019 at 4:31 PM Matthias Brugger  wrote:
>
>
[...]
> > @@ -139,6 +141,8 @@ static int __init parse_crashkernel_simple(char 
> > *cmdline,
> >   pr_warn("crashkernel: unrecognized char: %c\n", *cur);
> >   return -EINVAL;
> >   }
> > + if (*crash_size == 0)
> > + return -EINVAL;
>
> This covers the case where I pass an argument like "crashkernel=0M" ?
> Can't we fix that by using kstrtoull() in memparse and check if the return 
> value
> is < 0? In that case we could return without updating the retptr and we will 
> be
> fine.
>
It seems that kstrtoull() treats 0M as invalid parameter, while
simple_strtoull() does not.

If changed like your suggestion, then all the callers of memparse()
will treats 0M as invalid parameter. This affects many components
besides kexec.  Not sure this can be done or not.

Regards,
Pingfan

> >
> >   return 0;
> >  }
> > @@ -181,6 +185,8 @@ static int __init parse_crashkernel_suffix(char 
> > *cmdline,
> >   pr_warn("crashkernel: unrecognized char: %c\n", *cur);
> >   return -EINVAL;
> >   }
> > + if (*crash_size == 0)
> > + return -EINVAL;
>
> Same here.
>
> >
> >   return 0;
> >  }
> > @@ -266,6 +272,8 @@ static int __init __parse_crashkernel(char *cmdline,
> >  /*
> >   * That function is the entry point for command line parsing and should be
> >   * called from the arch-specific code.
> > + * On success 0. On error for either syntax error or crash_size=0, -EINVAL 
> > is
> > + * returned.
> >   */
> >  int __init parse_crashkernel(char *cmdline,
> >unsigned long long system_ram,
> >


Re: [PATCH v4 2/2] x86/boot/KASLR: skip the specified crashkernel region

2019-04-18 Thread Pingfan Liu
On Thu, Apr 18, 2019 at 12:06 AM Borislav Petkov  wrote:
>
> On Wed, Apr 17, 2019 at 01:53:37PM +0800, Pingfan Liu wrote:
> > Take __parse_crashkernel()->parse_crashkernel_simple() for example. If
> > no offset given, then it still return 0, but crash_base is dangling.

Sorry for misleading, I made a mistake. In
parse_crashkernel()->__parse_crashkernel(), { *crash_size = 0;
*crash_base = 0;}. Hence no need to initialize crash_base in
handle_crashkernel_options().
>
> Well, that is bad design. parse_crashkernel_simple() should return a
> *separate* distinct value which denotes that @offset hasn't been passed.

Then in my case, either no @offset or invalid argument will keep
"*crash_base = 0", and KASLR does not care about either of them.
>
> Please fix that by having it return 1 or something else positive to
> denote that there wasn't an [@offset] given.
>
> And then correct that crap here:
>
> static void __init reserve_crashkernel(void)
> {
> ...
>
> ret = parse_crashkernel(boot_command_line, total_mem, _size, 
> _base);
> if (ret != 0 || crash_size <= 0) {
It is not elegant. Will try a separate patch to fix it firstly.

Thanks,
Pingfan
>
> where *two*! variables are used as return values from a single function.
> That's just sloppy.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v4 2/2] x86/boot/KASLR: skip the specified crashkernel region

2019-04-16 Thread Pingfan Liu
On Wed, Apr 17, 2019 at 3:01 AM Borislav Petkov  wrote:
>
> On Mon, Apr 08, 2019 at 01:58:35PM +0800, Pingfan Liu wrote:
> > crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
> > fail to reserve the required memory region if KASLR puts kernel into the
> > region. To avoid this uncertainty, asking KASLR to skip the required
> > region.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Baoquan He 
> > Cc: Will Deacon 
> > Cc: Nicolas Pitre 
> > Cc: Vivek Goyal 
> > Cc: Chao Fan 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Ard Biesheuvel 
> > CC: Hari Bathini 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  arch/x86/boot/compressed/kaslr.c | 40 
> > 
> >  1 file changed, 40 insertions(+)
> >
> > diff --git a/arch/x86/boot/compressed/kaslr.c 
> > b/arch/x86/boot/compressed/kaslr.c
> > index 2e53c05..765a593 100644
> > --- a/arch/x86/boot/compressed/kaslr.c
> > +++ b/arch/x86/boot/compressed/kaslr.c
> > @@ -107,6 +107,7 @@ enum mem_avoid_index {
> >   MEM_AVOID_BOOTPARAMS,
> >   MEM_AVOID_MEMMAP_BEGIN,
> >   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 
> > 1,
> > + MEM_AVOID_CRASHKERNEL,
> >   MEM_AVOID_MAX,
> >  };
> >
> > @@ -131,6 +132,11 @@ char *skip_spaces(const char *str)
> >  }
> >  #include "../../../../lib/ctype.c"
> >  #include "../../../../lib/cmdline.c"
> > +#ifdef CONFIG_CRASH_CORE
> > +#define printk
> > +#define _BOOT_KASLR
> > +#include "../../../../lib/parse_crashkernel.c"
> > +#endif
> >
> >  static int
> >  parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
> > @@ -292,6 +298,39 @@ static void handle_mem_options(void)
> >   return;
> >  }
> >
> > +static u64 mem_ram_size(void)
> > +{
> > + struct boot_e820_entry *entry;
> > + u64 total_sz = 0;
> > + int i;
> > +
> > + for (i = 0; i < boot_params->e820_entries; i++) {
> > + entry = _params->e820_table[i];
> > + /* Skip non-RAM entries. */
> > + if (entry->type != E820_TYPE_RAM)
> > + continue;
> > + total_sz += entry->size;
> > + }
> > + return total_sz;
> > +}
> > +
> > +/*
> > + * For crashkernel=size@offset or =range1:size1[,range2:size2,...]@offset
> > + * options, recording mem_avoid for them.
> > + */
> > +static void handle_crashkernel_options(void)
> > +{
> > + unsigned long long crash_size, crash_base = 0;
> > + char *cmdline = (char *)get_cmd_line_ptr();
> > + u64 total_sz = mem_ram_size();
> > +
> > + parse_crashkernel(cmdline, total_sz, _size, _base);
>
> That function has a return value which you could test. And then you
> don't need to set crash_base to 0 above.
>
Take __parse_crashkernel()->parse_crashkernel_simple() for example. If
no offset given, then it still return 0, but crash_base is dangling.

Regards,
Pingfan


Re: [PATCH v4 1/2] kernel/crash_core: separate the parsing routines to lib/parse_crashkernel.c

2019-04-16 Thread Pingfan Liu
On Wed, Apr 17, 2019 at 3:01 AM Borislav Petkov  wrote:
>
> On Mon, Apr 08, 2019 at 01:58:34PM +0800, Pingfan Liu wrote:
> > Beside kernel, at early boot stage, the KASLR code also needs to parse the
> > crashkernel=x@y or crashkernel=ramsize-range:size[,...][@offset] option,
> > and avoid to put randomized kernel in the region.
> >
> > Extracting the parsing related routines to lib/parse_crashkernel.c, so it
> > will be handy included by other
> > files.
>
> Use this commit message for your next submission:
>
> crash: Carve out crashkernel= cmdline parsing
>
> Make the "crashkernel=" parsing functionality available to the early
> KASLR code. Will be used by a later patch to parse crashkernel regions
> which KASLR should aviod.
>
OK.

> > Signed-off-by: Pingfan Liu 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Baoquan He 
> > Cc: Will Deacon 
> > Cc: Nicolas Pitre 
> > Cc: Chao Fan 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Ard Biesheuvel 
> > Cc: Vivek Goyal 
> > CC: Hari Bathini 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  kernel/crash_core.c | 273 -
> >  lib/Makefile|   2 +
> >  lib/parse_crashkernel.c | 289 
> > 
> >  3 files changed, 291 insertions(+), 273 deletions(-)
>
> And this is not how you carve out code.
>
> First, you do a patch which does only code move. Nothing more.
>
> In a follow on patch, you make the changes to the moved code so that it
> is immediately visible what you're changing.
>
Will fix it. Thanks for your review.

Regards,
Pingfan


[PATCH v4 2/2] x86/boot/KASLR: skip the specified crashkernel region

2019-04-07 Thread Pingfan Liu
crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
fail to reserve the required memory region if KASLR puts kernel into the
region. To avoid this uncertainty, asking KASLR to skip the required
region.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Vivek Goyal 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
CC: Hari Bathini 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/boot/compressed/kaslr.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 2e53c05..765a593 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -107,6 +107,7 @@ enum mem_avoid_index {
MEM_AVOID_BOOTPARAMS,
MEM_AVOID_MEMMAP_BEGIN,
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
+   MEM_AVOID_CRASHKERNEL,
MEM_AVOID_MAX,
 };
 
@@ -131,6 +132,11 @@ char *skip_spaces(const char *str)
 }
 #include "../../../../lib/ctype.c"
 #include "../../../../lib/cmdline.c"
+#ifdef CONFIG_CRASH_CORE
+#define printk
+#define _BOOT_KASLR
+#include "../../../../lib/parse_crashkernel.c"
+#endif
 
 static int
 parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
@@ -292,6 +298,39 @@ static void handle_mem_options(void)
return;
 }
 
+static u64 mem_ram_size(void)
+{
+   struct boot_e820_entry *entry;
+   u64 total_sz = 0;
+   int i;
+
+   for (i = 0; i < boot_params->e820_entries; i++) {
+   entry = _params->e820_table[i];
+   /* Skip non-RAM entries. */
+   if (entry->type != E820_TYPE_RAM)
+   continue;
+   total_sz += entry->size;
+   }
+   return total_sz;
+}
+
+/*
+ * For crashkernel=size@offset or =range1:size1[,range2:size2,...]@offset
+ * options, recording mem_avoid for them.
+ */
+static void handle_crashkernel_options(void)
+{
+   unsigned long long crash_size, crash_base = 0;
+   char *cmdline = (char *)get_cmd_line_ptr();
+   u64 total_sz = mem_ram_size();
+
+   parse_crashkernel(cmdline, total_sz, _size, _base);
+   if (crash_base) {
+   mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
+   mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
+   }
+}
+
 /*
  * In theory, KASLR can put the kernel anywhere in the range of [16M, 64T).
  * The mem_avoid array is used to store the ranges that need to be avoided
@@ -414,6 +453,7 @@ static void mem_avoid_init(unsigned long input, unsigned 
long input_size,
 
/* Mark the memmap regions we need to avoid */
handle_mem_options();
+   handle_crashkernel_options();
 
/* Enumerate the immovable memory regions */
num_immovable_mem = count_immovable_mem_regions();
-- 
2.7.4



[PATCH v4 1/2] kernel/crash_core: separate the parsing routines to lib/parse_crashkernel.c

2019-04-07 Thread Pingfan Liu
Beside kernel, at early boot stage, the KASLR code also needs to parse the
crashkernel=x@y or crashkernel=ramsize-range:size[,...][@offset] option,
and avoid to put randomized kernel in the region.

Extracting the parsing related routines to lib/parse_crashkernel.c, so it
will be handy included by other
files.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
Cc: Vivek Goyal 
CC: Hari Bathini 
Cc: linux-kernel@vger.kernel.org
---
 kernel/crash_core.c | 273 -
 lib/Makefile|   2 +
 lib/parse_crashkernel.c | 289 
 3 files changed, 291 insertions(+), 273 deletions(-)
 create mode 100644 lib/parse_crashkernel.c

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 093c9f9..37c4d6f 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -21,279 +21,6 @@ u32 *vmcoreinfo_note;
 /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
 static unsigned char *vmcoreinfo_data_safecopy;
 
-/*
- * parsing the "crashkernel" commandline
- *
- * this code is intended to be called from architecture specific code
- */
-
-
-/*
- * This function parses command lines in the format
- *
- *   crashkernel=ramsize-range:size[,...][@offset]
- *
- * The function returns 0 on success and -EINVAL on failure.
- */
-static int __init parse_crashkernel_mem(char *cmdline,
-   unsigned long long system_ram,
-   unsigned long long *crash_size,
-   unsigned long long *crash_base)
-{
-   char *cur = cmdline, *tmp;
-
-   /* for each entry of the comma-separated list */
-   do {
-   unsigned long long start, end = ULLONG_MAX, size;
-
-   /* get the start of the range */
-   start = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (*cur != '-') {
-   pr_warn("crashkernel: '-' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   /* if no ':' is here, than we read the end */
-   if (*cur != ':') {
-   end = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (end <= start) {
-   pr_warn("crashkernel: end <= start\n");
-   return -EINVAL;
-   }
-   }
-
-   if (*cur != ':') {
-   pr_warn("crashkernel: ':' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   size = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (size >= system_ram) {
-   pr_warn("crashkernel: invalid size\n");
-   return -EINVAL;
-   }
-
-   /* match ? */
-   if (system_ram >= start && system_ram < end) {
-   *crash_size = size;
-   break;
-   }
-   } while (*cur++ == ',');
-
-   if (*crash_size > 0) {
-   while (*cur && *cur != ' ' && *cur != '@')
-   cur++;
-   if (*cur == '@') {
-   cur++;
-   *crash_base = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("Memory value expected after '@'\n");
-   return -EINVAL;
-   }
-   }
-   } else
-   pr_info("crashkernel size resulted in zero bytes\n");
-
-   return 0;
-}
-
-/*
- * That function parses "simple" (old) crashkernel command lines like
- *
- * crashkernel=size[@offset]
- *
- * It returns 0 on success and -EINVAL on failure.
- */
-static int __init parse_crashkernel_simple(char *cmdline,
-  unsigned long long *crash_size,
-  unsigned long long *crash_base)
-{
-   char *cur = cmdline;
-
-   *crash_size = memp

[PATCH v4 0/2] x86/boot/KASLR: skip the specified crashkernel region

2019-04-07 Thread Pingfan Liu
crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
fail to reserve the required memory region if KASLR puts kernel into the
region. To avoid this uncertainty, asking KASLR to skip the required
region.
And the parsing routine can be re-used at this early boot stage.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Vivek Goyal 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
CC: Hari Bathini 
Cc: linux-kernel@vger.kernel.org
---
v3 -> v4:
  reuse the parse_crashkernel_xx routines

Pingfan Liu (2):
  kernel/crash_core: separate the parsing routines to
lib/parse_crashkernel.c
  x86/boot/KASLR: skip the specified crashkernel region

 arch/x86/boot/compressed/kaslr.c |  40 ++
 kernel/crash_core.c  | 273 
 lib/Makefile |   2 +
 lib/parse_crashkernel.c  | 289 +++
 4 files changed, 331 insertions(+), 273 deletions(-)
 create mode 100644 lib/parse_crashkernel.c

-- 
2.7.4



Re: [PATCHv3] x86/boot/KASLR: skip the specified crashkernel region

2019-04-04 Thread Pingfan Liu
On Wed, Apr 3, 2019 at 11:10 AM Baoquan He  wrote:
>
> On 04/03/19 at 10:58am, Pingfan Liu wrote:
> > On Tue, Apr 2, 2019 at 4:08 PM Baoquan He  wrote:
> > >
> > > > +/* handle crashkernel=x@y or =range1:size1[,range2:size2,...]@offset 
> > > > options */
> > > > +static void mem_avoid_specified_crashkernel_region(char *option)
> > > > +{
> > > > + unsigned long long crash_size, crash_base = 0;
> > > > + char*first_colon, *first_space, *cur = option;
> > > > +
> > >
> > > Another thing which need be noticed is that you may only need to handle
> > > when '@' is found. Otherwise just let it go. Right?
> > >
> > According to kernel's behavior, only the last "crashkernel=" option
> > takes effect. Hence if no '@', then clearing mem_avoid
>
> Here I mean that you can search '@' at the beginning if crashkernel is
> found. Maybe no need to clear mem_avoid since it's global data, has been
> initialized to 0 during loading. It's in BSS, right?
>
Consider the following cmdline crashkernel=256M@1G crashkernel=512M.
These two options are handled independently by handle_mem_options(),
and the later one should overwrite the previous one. This is the
behavior of current kernel's code, referring to
get_last_crashkernel().
> You don't have to search first colon or first space, then parse size of
> crashkernel, and finally find out that it's only 
> crashkernel=512M-2G:64M,2G-:128M
> style, no '@' specified. What do you think?
>
Yes, that will be better. But I plan to reuse parse_crashkernel_mem/simple().

Thanks,
Pingfan
> >
> > > > + first_colon = strchr(option, ':');
> > > > + first_space = strchr(option, ' ');
> > > > + /* if contain ":" */
> > > > + if (first_colon && (!first_space || first_colon < first_space)) {
> > > > + int i;
> > > > + u64 total_sz = 0;
> > > > + struct boot_e820_entry *entry;
> > > > +
> > > > + for (i = 0; i < boot_params->e820_entries; i++) {
> > > > + entry = _params->e820_table[i];
> > > > + /* Skip non-RAM entries. */
> > > > + if (entry->type != E820_TYPE_RAM)
> > > > + continue;
> > > > + total_sz += entry->size;
> > > > + }
> > > > + handle_crashkernel_mem(option, total_sz, _size,
> > > > + _base);
> > > > + } else {
> > > > + crash_size = memparse(option, );
> > > > + if (option == cur)
> > > > + return;
> > > > + while (*cur && *cur != ' ' && *cur != '@')
> > > > + cur++;
> > > > + if (*cur == '@') {
> > > > + option = cur + 1;
> > > > + crash_base = memparse(option, );
> > > > + }
> > > > + }
> > > > + if (crash_base) {
> > > > + mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
> > > > + mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
> > > > + } else {
> > > > + /*
> > > > +  * Clearing mem_avoid if no offset is given. This is 
> > > > consistent
> > > > +  * with kernel, which uses the last crashkernel= option.
> > > > +  */
> > > > + mem_avoid[MEM_AVOID_CRASHKERNEL].start = 0;
> > > > + mem_avoid[MEM_AVOID_CRASHKERNEL].size = 0;
> > > > + }
> > > > +}


Re: [PATCHv3] x86/boot/KASLR: skip the specified crashkernel region

2019-04-02 Thread Pingfan Liu
On Tue, Apr 2, 2019 at 2:46 PM Baoquan He  wrote:
>
> On 04/02/19 at 12:10pm, Pingfan Liu wrote:
> > crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
> > fail to reserve the required memory region if KASLR puts kernel into the
> > region. To avoid this uncertainty, asking KASLR to skip the required
> > region.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Baoquan He 
> > Cc: Will Deacon 
> > Cc: Nicolas Pitre 
> > Cc: Pingfan Liu 
> > Cc: Chao Fan 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Ard Biesheuvel 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> > v2 -> v3: adding parsing of 
> > crashkernel=range1:size1[,range2:size2,...]@offset
> >
> >  arch/x86/boot/compressed/kaslr.c | 116 
> > ++-
> >  1 file changed, 114 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/kaslr.c 
> > b/arch/x86/boot/compressed/kaslr.c
> > index 2e53c05..7f698f4 100644
> > --- a/arch/x86/boot/compressed/kaslr.c
> > +++ b/arch/x86/boot/compressed/kaslr.c
> > @@ -107,6 +107,7 @@ enum mem_avoid_index {
> >   MEM_AVOID_BOOTPARAMS,
> >   MEM_AVOID_MEMMAP_BEGIN,
> >   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 
> > 1,
> > + MEM_AVOID_CRASHKERNEL,
> >   MEM_AVOID_MAX,
> >  };
> >
> > @@ -238,6 +239,115 @@ static void parse_gb_huge_pages(char *param, char 
> > *val)
> >   }
> >  }
> >
> > +/* code heavily copied from parse_crashkernel_mem() */
> > +static void handle_crashkernel_mem(char *cmdline,
> > + unsigned long long system_ram,
> > + unsigned long long *crash_size,
> > + unsigned long long *crash_base)
>
> This version looks better and the logic is simple. It will be much better
> if we can share code with parse_crashkernel_mem() since both of them look
> almost the same.
>
A little hard, but I will have a try.
> > +{
> > + char *tmp, *cur = cmdline;
> > +
> > + /* for each entry of the comma-separated list */
> > + do {
> > + unsigned long long start, end = ULLONG_MAX, size;
> > +
> > + /* get the start of the range */
> > + start = memparse(cur, );
> > + /* no value given */
> > + if (cur == tmp)
> > + return;
> > + cur = tmp;
> > + if (*cur != '-')
> > + return;
> > + cur++;
> > +
> > + /* if no ':' is here, than we read the end */
> > + if (*cur != ':') {
> > + end = memparse(cur, );
> > + /* no value given */
> > + if (cur == tmp)
> > + return;
> > + cur = tmp;
> > + /* invalid if crashkernel end <= start */
> > + if (end <= start)
> > + return;
> > + }
> > + /* expect ":" after range */
> > + if (*cur != ':')
> > + return;
> > + cur++;
> > +
> > + size = memparse(cur, );
> > + /* no size value given */
> > + if (cur == tmp)
> > + return;
> > + cur = tmp;
> > + if (size >= system_ram)
> > + return;
> > +
> > + /* match ? */
> > + if (system_ram >= start && system_ram < end) {
> > + *crash_size = size;
> > + break;
> > + }
> > + } while (*cur++ == ',');
> > +
> > + if (*crash_size > 0) {
> > + while (*cur && *cur != ' ' && *cur != '@')
> > + cur++;
> > + if (*cur == '@') {
> > + cur++;
> > + *crash_base = memparse(cur, );
> > + }
> > + }
> > +}
> > +
> > +/* handle crashkernel=x@y or =range1:size1[,range2:size2,...]@offset 
> > options */
> > +static void mem_avoid_specified_crashkernel_region(char *option)
>
> Maybe just add more words to expl

Re: [PATCHv3] x86/boot/KASLR: skip the specified crashkernel region

2019-04-02 Thread Pingfan Liu
On Tue, Apr 2, 2019 at 4:08 PM Baoquan He  wrote:
>
> > +/* handle crashkernel=x@y or =range1:size1[,range2:size2,...]@offset 
> > options */
> > +static void mem_avoid_specified_crashkernel_region(char *option)
> > +{
> > + unsigned long long crash_size, crash_base = 0;
> > + char*first_colon, *first_space, *cur = option;
> > +
>
> Another thing which need be noticed is that you may only need to handle
> when '@' is found. Otherwise just let it go. Right?
>
According to kernel's behavior, only the last "crashkernel=" option
takes effect. Hence if no '@', then clearing mem_avoid

> > + first_colon = strchr(option, ':');
> > + first_space = strchr(option, ' ');
> > + /* if contain ":" */
> > + if (first_colon && (!first_space || first_colon < first_space)) {
> > + int i;
> > + u64 total_sz = 0;
> > + struct boot_e820_entry *entry;
> > +
> > + for (i = 0; i < boot_params->e820_entries; i++) {
> > + entry = _params->e820_table[i];
> > + /* Skip non-RAM entries. */
> > + if (entry->type != E820_TYPE_RAM)
> > + continue;
> > + total_sz += entry->size;
> > + }
> > + handle_crashkernel_mem(option, total_sz, _size,
> > + _base);
> > + } else {
> > + crash_size = memparse(option, );
> > + if (option == cur)
> > + return;
> > + while (*cur && *cur != ' ' && *cur != '@')
> > + cur++;
> > + if (*cur == '@') {
> > + option = cur + 1;
> > + crash_base = memparse(option, );
> > + }
> > + }
> > + if (crash_base) {
> > + mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
> > + mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
> > + } else {
> > + /*
> > +  * Clearing mem_avoid if no offset is given. This is 
> > consistent
> > +  * with kernel, which uses the last crashkernel= option.
> > +  */
> > + mem_avoid[MEM_AVOID_CRASHKERNEL].start = 0;
> > + mem_avoid[MEM_AVOID_CRASHKERNEL].size = 0;
> > + }
> > +}


Re: [PATCHv3] x86/boot/KASLR: skip the specified crashkernel region

2019-04-02 Thread Pingfan Liu
On Tue, Apr 2, 2019 at 1:20 PM Chao Fan  wrote:
>
> On Tue, Apr 02, 2019 at 12:10:46PM +0800, Pingfan Liu wrote:
> >crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
> or or?
> >fail to reserve the required memory region if KASLR puts kernel into the
> >region. To avoid this uncertainty, asking KASLR to skip the required
> >region.
> >
> >Signed-off-by: Pingfan Liu 
> >Cc: Thomas Gleixner 
> >Cc: Ingo Molnar 
> >Cc: Borislav Petkov 
> >Cc: "H. Peter Anvin" 
> >Cc: Baoquan He 
> >Cc: Will Deacon 
> >Cc: Nicolas Pitre 
> >Cc: Pingfan Liu 
> >Cc: Chao Fan 
> >Cc: "Kirill A. Shutemov" 
> >Cc: Ard Biesheuvel 
> >Cc: linux-kernel@vger.kernel.org
> >---
> [...]
> >+
> >+/* handle crashkernel=x@y or =range1:size1[,range2:size2,...]@offset 
> >options */
>
> Before review, I want to say more about the background.
> It's very hard to review the code for someone who is not so familiar
> with kdump, so could you please explain more ahout
> the uasge of crashkernel=range1:size1[,range2:size2,...]@offset.
> And also there are so many jobs who are parsing string. So I really
> need your help to understand the PATCH.
>
> >+static void mem_avoid_specified_crashkernel_region(char *option)
> >+{
> >+  unsigned long long crash_size, crash_base = 0;
> >+  char*first_colon, *first_space, *cur = option;
> Is there a tab after char?
> >+
> >+  first_colon = strchr(option, ':');
> >+  first_space = strchr(option, ' ');
> >+  /* if contain ":" */
> >+  if (first_colon && (!first_space || first_colon < first_space)) {
> >+  int i;
> >+  u64 total_sz = 0;
> >+  struct boot_e820_entry *entry;
> >+
> >+  for (i = 0; i < boot_params->e820_entries; i++) {
> >+  entry = _params->e820_table[i];
> >+  /* Skip non-RAM entries. */
> >+  if (entry->type != E820_TYPE_RAM)
> >+  continue;
> >+  total_sz += entry->size;
> I wonder whether it's needed to consider the memory ranges here.
> I think it's OK to only record the regions should to be avoid.
Maybe not catch you exactly. In the case of
crashkernel=range1:size1[,range2:size2,...]@offset, the size of avoid
region depends on the size of total system ram.
> I remeber I ever talked with Baoquan about the similiar problems.
> @Baoquan, I am not sure if I misunderstand something.
>
> Thanks,
> Chao Fan
> >+  }
> >+  handle_crashkernel_mem(option, total_sz, _size,
> >+  _base);
> >+  } else {
> >+  crash_size = memparse(option, );
> >+  if (option == cur)
> >+  return;
> >+  while (*cur && *cur != ' ' && *cur != '@')
> >+  cur++;
> >+  if (*cur == '@') {
> >+  option = cur + 1;
> >+  crash_base = memparse(option, );
> >+  }
> >+  }
> >+  if (crash_base) {
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
> >+  } else {
> >+  /*
> >+   * Clearing mem_avoid if no offset is given. This is 
> >consistent
> >+   * with kernel, which uses the last crashkernel= option.
> >+   */
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].start = 0;
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].size = 0;
> >+  }
> >+}
> >
> > static void handle_mem_options(void)
> > {
> >@@ -248,7 +358,7 @@ static void handle_mem_options(void)
> >   u64 mem_size;
> >
> >   if (!strstr(args, "memmap=") && !strstr(args, "mem=") &&
> >-  !strstr(args, "hugepages"))
> >+  !strstr(args, "hugepages") && !strstr(args, "crashkernel="))
> >   return;
> >
> >   tmp_cmdline = malloc(len + 1);
> >@@ -284,6 +394,8 @@ static void handle_mem_options(void)
> >   goto out;
> >
> >   mem_limit = mem_size;
> >+  } else if (strstr(param, "crashkernel")) {
> >+  mem_avoid_specified_crashkernel_region(val);
> >   }
> >   }
> >
> >@@ -412,7 +524,7 @@ static void mem_avoid_init(unsigned long input, unsigned 
> >long input_size,
> >
> >   /* We don't need to set a mapping for setup_data. */
> >
> >-  /* Mark the memmap regions we need to avoid */
> >+  /* Mark the regions we need to avoid */
> >   handle_mem_options();
> >
> >   /* Enumerate the immovable memory regions */
> >--
> >2.7.4
> >
> >
> >
>
>


[PATCHv3] x86/boot/KASLR: skip the specified crashkernel region

2019-04-01 Thread Pingfan Liu
crashkernel=x@y or or =range1:size1[,range2:size2,...]@offset option may
fail to reserve the required memory region if KASLR puts kernel into the
region. To avoid this uncertainty, asking KASLR to skip the required
region.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Pingfan Liu 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
Cc: linux-kernel@vger.kernel.org
---
v2 -> v3: adding parsing of crashkernel=range1:size1[,range2:size2,...]@offset

 arch/x86/boot/compressed/kaslr.c | 116 ++-
 1 file changed, 114 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 2e53c05..7f698f4 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -107,6 +107,7 @@ enum mem_avoid_index {
MEM_AVOID_BOOTPARAMS,
MEM_AVOID_MEMMAP_BEGIN,
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
+   MEM_AVOID_CRASHKERNEL,
MEM_AVOID_MAX,
 };
 
@@ -238,6 +239,115 @@ static void parse_gb_huge_pages(char *param, char *val)
}
 }
 
+/* code heavily copied from parse_crashkernel_mem() */
+static void handle_crashkernel_mem(char *cmdline,
+   unsigned long long system_ram,
+   unsigned long long *crash_size,
+   unsigned long long *crash_base)
+{
+   char *tmp, *cur = cmdline;
+
+   /* for each entry of the comma-separated list */
+   do {
+   unsigned long long start, end = ULLONG_MAX, size;
+
+   /* get the start of the range */
+   start = memparse(cur, );
+   /* no value given */
+   if (cur == tmp)
+   return;
+   cur = tmp;
+   if (*cur != '-')
+   return;
+   cur++;
+
+   /* if no ':' is here, than we read the end */
+   if (*cur != ':') {
+   end = memparse(cur, );
+   /* no value given */
+   if (cur == tmp)
+   return;
+   cur = tmp;
+   /* invalid if crashkernel end <= start */
+   if (end <= start)
+   return;
+   }
+   /* expect ":" after range */
+   if (*cur != ':')
+   return;
+   cur++;
+
+   size = memparse(cur, );
+   /* no size value given */
+   if (cur == tmp)
+   return;
+   cur = tmp;
+   if (size >= system_ram)
+   return;
+
+   /* match ? */
+   if (system_ram >= start && system_ram < end) {
+   *crash_size = size;
+   break;
+   }
+   } while (*cur++ == ',');
+
+   if (*crash_size > 0) {
+   while (*cur && *cur != ' ' && *cur != '@')
+   cur++;
+   if (*cur == '@') {
+   cur++;
+   *crash_base = memparse(cur, );
+   }
+   }
+}
+
+/* handle crashkernel=x@y or =range1:size1[,range2:size2,...]@offset options */
+static void mem_avoid_specified_crashkernel_region(char *option)
+{
+   unsigned long long crash_size, crash_base = 0;
+   char*first_colon, *first_space, *cur = option;
+
+   first_colon = strchr(option, ':');
+   first_space = strchr(option, ' ');
+   /* if contain ":" */
+   if (first_colon && (!first_space || first_colon < first_space)) {
+   int i;
+   u64 total_sz = 0;
+   struct boot_e820_entry *entry;
+
+   for (i = 0; i < boot_params->e820_entries; i++) {
+   entry = _params->e820_table[i];
+   /* Skip non-RAM entries. */
+   if (entry->type != E820_TYPE_RAM)
+   continue;
+   total_sz += entry->size;
+   }
+   handle_crashkernel_mem(option, total_sz, _size,
+   _base);
+   } else {
+   crash_size = memparse(option, );
+   if (option == cur)
+   return;
+   while (*cur && *cur != ' ' && *cur != '@')
+   cur++;
+   if (*cur == '@') {
+   option = cur + 1;
+   crash_base = memparse(option, );
+   }
+   }
+   if (crash_base) {
+   mem_avoi

Re: [PATCHv2] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-29 Thread Pingfan Liu
On Fri, Mar 29, 2019 at 3:34 PM Baoquan He  wrote:
>
> On 03/29/19 at 03:25pm, Pingfan Liu wrote:
> > On Fri, Mar 29, 2019 at 2:27 PM Baoquan He  wrote:
> > >
> > > On 03/29/19 at 01:45pm, Pingfan Liu wrote:
> > > > On Fri, Mar 22, 2019 at 4:34 PM Baoquan He  wrote:
> > > > >
> > > > > On 03/22/19 at 03:52pm, Baoquan He wrote:
> > > > > > On 03/22/19 at 03:43pm, Pingfan Liu wrote:
> > > > > > > > > +/* parse crashkernel=x@y option */
> > > > > > > > > +static void mem_avoid_crashkernel_simple(char *option)
> > > > > > > >
> > > > > > > > Chao ever mentioned this, I want to ask again, why does it has 
> > > > > > > > to be
> > > > > > > > xxx_simple()?
> > > > > > > >
> > > > > > > Seems that I had replied Chao's question in another email. The 
> > > > > > > naming
> > > > > > > follows the function parse_crashkernel_simple(), as the notes 
> > > > > > > above
> > > > > >
> > > > > >
> > > > > > Sorry, I don't get.  typo?
> > > > >
> > > > > OK, I misunderstood it. We do have parse_crashkernel_simple() to 
> > > > > handle
> > > > > crashkernel=size[@offset] case, to differente with other complicated
> > > > > cases, like crashkernel=size,[high|low],
> > > > >
> > > > > Then I am fine with this naming. Soryy about the noise.
> > > > >
> > > > > By the way, do you think if we should take care of this case:
> > > > > crashkernel=:[,:,...][@offset]
> > > > >
> > > > > It can also specify @offset. Not sure if it's too complicated, you may
> > > > > have a investigation.
> > > > >
> > > > In this case, kernel should get the total memory size info. So
> > > > process_e820_entries() or process_efi_entries() should be called
> > > > twice. One before handle_mem_options(), so crashkernel can evaluate
> > > > the reserved size. It is doable, and what is your opinion about the
> > >
> > > You mean calling process_e820_entries to calculate the RAM size in
> > > system? I may not do like that, please check what __find_max_addr() is
> > > doing. Did I get it?
> >
> > Yes, you got my meaning. But __find_max_addr() relies on the info, fed
> > by e820__memblock_setup(). It also involves the iteration of all e820
> > entries to get the max address. No essential difference, right?
>
> Hmm, I would say iterating e820 or efi entries to get the mas addr should be
> different with calling process_e820_entries(). The 1st is much simpler,
> right?
>
Yes. My original meaning is to reuse process_e820_entries(), but does
not call process_mem_region() at the first time.

Thanks,
Pingfan


Re: [PATCHv2] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-29 Thread Pingfan Liu
On Fri, Mar 29, 2019 at 2:27 PM Baoquan He  wrote:
>
> On 03/29/19 at 01:45pm, Pingfan Liu wrote:
> > On Fri, Mar 22, 2019 at 4:34 PM Baoquan He  wrote:
> > >
> > > On 03/22/19 at 03:52pm, Baoquan He wrote:
> > > > On 03/22/19 at 03:43pm, Pingfan Liu wrote:
> > > > > > > +/* parse crashkernel=x@y option */
> > > > > > > +static void mem_avoid_crashkernel_simple(char *option)
> > > > > >
> > > > > > Chao ever mentioned this, I want to ask again, why does it has to be
> > > > > > xxx_simple()?
> > > > > >
> > > > > Seems that I had replied Chao's question in another email. The naming
> > > > > follows the function parse_crashkernel_simple(), as the notes above
> > > >
> > > >
> > > > Sorry, I don't get.  typo?
> > >
> > > OK, I misunderstood it. We do have parse_crashkernel_simple() to handle
> > > crashkernel=size[@offset] case, to differente with other complicated
> > > cases, like crashkernel=size,[high|low],
> > >
> > > Then I am fine with this naming. Soryy about the noise.
> > >
> > > By the way, do you think if we should take care of this case:
> > > crashkernel=:[,:,...][@offset]
> > >
> > > It can also specify @offset. Not sure if it's too complicated, you may
> > > have a investigation.
> > >
> > In this case, kernel should get the total memory size info. So
> > process_e820_entries() or process_efi_entries() should be called
> > twice. One before handle_mem_options(), so crashkernel can evaluate
> > the reserved size. It is doable, and what is your opinion about the
>
> You mean calling process_e820_entries to calculate the RAM size in
> system? I may not do like that, please check what __find_max_addr() is
> doing. Did I get it?

Yes, you got my meaning. But __find_max_addr() relies on the info, fed
by e820__memblock_setup(). It also involves the iteration of all e820
entries to get the max address. No essential difference, right?


Re: [PATCHv2] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-28 Thread Pingfan Liu
On Fri, Mar 22, 2019 at 4:34 PM Baoquan He  wrote:
>
> On 03/22/19 at 03:52pm, Baoquan He wrote:
> > On 03/22/19 at 03:43pm, Pingfan Liu wrote:
> > > > > +/* parse crashkernel=x@y option */
> > > > > +static void mem_avoid_crashkernel_simple(char *option)
> > > >
> > > > Chao ever mentioned this, I want to ask again, why does it has to be
> > > > xxx_simple()?
> > > >
> > > Seems that I had replied Chao's question in another email. The naming
> > > follows the function parse_crashkernel_simple(), as the notes above
> >
> >
> > Sorry, I don't get.  typo?
>
> OK, I misunderstood it. We do have parse_crashkernel_simple() to handle
> crashkernel=size[@offset] case, to differente with other complicated
> cases, like crashkernel=size,[high|low],
>
> Then I am fine with this naming. Soryy about the noise.
>
> By the way, do you think if we should take care of this case:
> crashkernel=:[,:,...][@offset]
>
> It can also specify @offset. Not sure if it's too complicated, you may
> have a investigation.
>
In this case, kernel should get the total memory size info. So
process_e820_entries() or process_efi_entries() should be called
twice. One before handle_mem_options(), so crashkernel can evaluate
the reserved size. It is doable, and what is your opinion about the
extra complicate?

Thanks,
Pingfan
[...]


Re: [PATCHv2] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-24 Thread Pingfan Liu
On Fri, Mar 22, 2019 at 4:34 PM Baoquan He  wrote:
>
> On 03/22/19 at 03:52pm, Baoquan He wrote:
> > On 03/22/19 at 03:43pm, Pingfan Liu wrote:
> > > > > +/* parse crashkernel=x@y option */
> > > > > +static void mem_avoid_crashkernel_simple(char *option)
> > > >
> > > > Chao ever mentioned this, I want to ask again, why does it has to be
> > > > xxx_simple()?
> > > >
> > > Seems that I had replied Chao's question in another email. The naming
> > > follows the function parse_crashkernel_simple(), as the notes above
> >
> >
> > Sorry, I don't get.  typo?
>
> OK, I misunderstood it. We do have parse_crashkernel_simple() to handle
> crashkernel=size[@offset] case, to differente with other complicated
> cases, like crashkernel=size,[high|low],
>
> Then I am fine with this naming. Soryy about the noise.
>
> By the way, do you think if we should take care of this case:
> crashkernel=:[,:,...][@offset]
>
> It can also specify @offset. Not sure if it's too complicated, you may
> have a investigation.
>
OK, I will try it.
> These two cases have dependency on your crashkernel=X bug fix patch.
No, crashkernel=x@y should have no dependcy on crashkernel=X, the
later one relies on memblock searching.
> The current code only allow crashkernel= to reserve under 896MB. I
> noticed Boris has agreed on the solution. Maybe you can repost a new
> version based on the discussion.
I will sync with Dave to see whether he will post the new version.

Thank you for kindly review.

Regards,
Pingfan
>
> http://lkml.kernel.org/r/1548047768-7656-1-git-send-email-kernelf...@gmail.com
> [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent 
> with kaslr
>
> Thanks
> Baoquan
>
> >
> > > the definition
> > > /*
> > >  * That function parses "simple" (old) crashkernel command lines like
> > >  *
> > >  * crashkernel=size[@offset]
> >
> > Hmm, should only crashkernel=size@offset be cared? crashkernel=size will
> > auto finding a place to reserve, and that is after KASLR.
> >
> > >  *
> > >  * It returns 0 on success and -EINVAL on failure.
> > >  */
> > > static int __init parse_crashkernel_simple(char *cmdline,
> > >
> > > Do you have alternative suggestion?
> > >
> > > > Except of these, patch looks good to me. It's a nice catch, and only
> > > > need a simple fix based on the current code.
> > > >
> > > Thank you for the kindly review.
> > >
> > > Regards,
> > > Pingfan
> > >
> > > > Thanks
> > > > Baoquan
> > > >
> > > > > +{
> > > > > + unsigned long long crash_size, crash_base;
> > > > > + char *cur = option;
> > > > > +
> > > > > + crash_size = memparse(option, );
> > > > > + if (option == cur)
> > > > > + return;
> > > > > +
> > > > > + if (*cur == '@') {
> > > > > + option = cur + 1;
> > > > > + crash_base = memparse(option, );
> > > > > + if (option == cur)
> > > > > + return;
> > > > > + mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
> > > > > + mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
> > > > > + }
> > > > > +}
> > > > >
> > > > >  static void handle_mem_options(void)
> > > > >  {
> > > > > @@ -250,7 +270,7 @@ static void handle_mem_options(void)
> > > > >   u64 mem_size;
> > > > >
> > > > >   if (!strstr(args, "memmap=") && !strstr(args, "mem=") &&
> > > > > - !strstr(args, "hugepages"))
> > > > > + !strstr(args, "hugepages") && !strstr(args, 
> > > > > "crashkernel="))
> > > > >   return;
> > > > >
> > > > >   tmp_cmdline = malloc(len + 1);
> > > > > @@ -286,6 +306,8 @@ static void handle_mem_options(void)
> > > > >   goto out;
> > > > >
> > > > >   mem_limit = mem_size;
> > > > > + } else if (strstr(param, "crashkernel")) {
> > > > > + mem_avoid_crashkernel_simple(val);
> > > > >   }
> > > > >   }
> > > > >
> > > > > @@ -414,7 +436,7 @@ static void mem_avoid_init(unsigned long input, 
> > > > > unsigned long input_size,
> > > > >
> > > > >   /* We don't need to set a mapping for setup_data. */
> > > > >
> > > > > - /* Mark the memmap regions we need to avoid */
> > > > > + /* Mark the regions we need to avoid */
> > > > >   handle_mem_options();
> > > > >
> > > > >  #ifdef CONFIG_X86_VERBOSE_BOOTUP
> > > > > --
> > > > > 2.7.4
> > > > >


Re: [PATCHv2] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-22 Thread Pingfan Liu
On Thu, Mar 21, 2019 at 2:38 PM Chao Fan  wrote:
>
> On Wed, Mar 13, 2019 at 12:19:31PM +0800, Pingfan Liu wrote:
>
> I tested it in Qemu test with 12G memory, and set crashkernel=6G@6G.
> Without this PATCH, it successed to reserve memory just 4 times(total
> 10 times).
> With this PATCH, it successed to reserve memory 15 times(total 15
> times).
>
> So I think if you post new version, you can add:
>
> Tested-by: Chao Fan 
>
Appreciate for your testing. I had done some test on a real machine
with a private patch to narrow down the KASLR range. I think your test
method is more simple and I will add the tested-by you.

Regards,
Pingfan

> Thanks,
> Chao Fan
>
> >crashkernel=x@y option may fail to reserve the required memory region if
> >KASLR puts kernel into the region. To avoid this uncertainty, making KASLR
> >skip the required region.
> >
> >Signed-off-by: Pingfan Liu 
> >Cc: Thomas Gleixner 
> >Cc: Ingo Molnar 
> >Cc: Borislav Petkov 
> >Cc: "H. Peter Anvin" 
> >Cc: Baoquan He 
> >Cc: Will Deacon 
> >Cc: Nicolas Pitre 
> >Cc: Pingfan Liu 
> >Cc: Chao Fan 
> >Cc: "Kirill A. Shutemov" 
> >Cc: Ard Biesheuvel 
> >Cc: linux-kernel@vger.kernel.org
> >---
> >v1 -> v2: fix some trival format
> >
> > arch/x86/boot/compressed/kaslr.c | 26 --
> > 1 file changed, 24 insertions(+), 2 deletions(-)
> >
> >diff --git a/arch/x86/boot/compressed/kaslr.c 
> >b/arch/x86/boot/compressed/kaslr.c
> >index 9ed9709..e185318 100644
> >--- a/arch/x86/boot/compressed/kaslr.c
> >+++ b/arch/x86/boot/compressed/kaslr.c
> >@@ -109,6 +109,7 @@ enum mem_avoid_index {
> >   MEM_AVOID_BOOTPARAMS,
> >   MEM_AVOID_MEMMAP_BEGIN,
> >   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 
> > 1,
> >+  MEM_AVOID_CRASHKERNEL,
> >   MEM_AVOID_MAX,
> > };
> >
> >@@ -240,6 +241,25 @@ static void parse_gb_huge_pages(char *param, char *val)
> >   }
> > }
> >
> >+/* parse crashkernel=x@y option */
> >+static void mem_avoid_crashkernel_simple(char *option)
> >+{
> >+  unsigned long long crash_size, crash_base;
> >+  char *cur = option;
> >+
> >+  crash_size = memparse(option, );
> >+  if (option == cur)
> >+  return;
> >+
> >+  if (*cur == '@') {
> >+  option = cur + 1;
> >+  crash_base = memparse(option, );
> >+  if (option == cur)
> >+  return;
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
> >+  }
> >+}
> >
> > static void handle_mem_options(void)
> > {
> >@@ -250,7 +270,7 @@ static void handle_mem_options(void)
> >   u64 mem_size;
> >
> >   if (!strstr(args, "memmap=") && !strstr(args, "mem=") &&
> >-  !strstr(args, "hugepages"))
> >+  !strstr(args, "hugepages") && !strstr(args, "crashkernel="))
> >   return;
> >
> >   tmp_cmdline = malloc(len + 1);
> >@@ -286,6 +306,8 @@ static void handle_mem_options(void)
> >   goto out;
> >
> >   mem_limit = mem_size;
> >+  } else if (strstr(param, "crashkernel")) {
> >+  mem_avoid_crashkernel_simple(val);
> >   }
> >   }
> >
> >@@ -414,7 +436,7 @@ static void mem_avoid_init(unsigned long input, unsigned 
> >long input_size,
> >
> >   /* We don't need to set a mapping for setup_data. */
> >
> >-  /* Mark the memmap regions we need to avoid */
> >+  /* Mark the regions we need to avoid */
> >   handle_mem_options();
> >
> > #ifdef CONFIG_X86_VERBOSE_BOOTUP
> >--
> >2.7.4
> >
> >
> >
>
>


Re: [PATCHv2] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-22 Thread Pingfan Liu
On Wed, Mar 20, 2019 at 8:25 AM Baoquan He  wrote:
>
> Please change subject as:
>
> "x86/boot/KASLR: skip the specified crashkernel region"
>
OK.

> Don't see why reserved is needed here.
>
> On 03/13/19 at 12:19pm, Pingfan Liu wrote:
> > crashkernel=x@y option may fail to reserve the required memory region if
> > KASLR puts kernel into the region. To avoid this uncertainty, making KASLR
> > skip the required region.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Baoquan He 
> > Cc: Will Deacon 
> > Cc: Nicolas Pitre 
> > Cc: Pingfan Liu 
> > Cc: Chao Fan 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Ard Biesheuvel 
> > Cc: linux-kernel@vger.kernel.org
> > ---
> > v1 -> v2: fix some trival format
> >
> >  arch/x86/boot/compressed/kaslr.c | 26 --
> >  1 file changed, 24 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/kaslr.c 
> > b/arch/x86/boot/compressed/kaslr.c
> > index 9ed9709..e185318 100644
> > --- a/arch/x86/boot/compressed/kaslr.c
> > +++ b/arch/x86/boot/compressed/kaslr.c
> > @@ -109,6 +109,7 @@ enum mem_avoid_index {
> >   MEM_AVOID_BOOTPARAMS,
> >   MEM_AVOID_MEMMAP_BEGIN,
> >   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 
> > 1,
> > + MEM_AVOID_CRASHKERNEL,
> >   MEM_AVOID_MAX,
> >  };
> >
> > @@ -240,6 +241,25 @@ static void parse_gb_huge_pages(char *param, char *val)
> >   }
> >  }
> >
> > +/* parse crashkernel=x@y option */
> > +static void mem_avoid_crashkernel_simple(char *option)
>
> Chao ever mentioned this, I want to ask again, why does it has to be
> xxx_simple()?
>
Seems that I had replied Chao's question in another email. The naming
follows the function parse_crashkernel_simple(), as the notes above
the definition
/*
 * That function parses "simple" (old) crashkernel command lines like
 *
 * crashkernel=size[@offset]
 *
 * It returns 0 on success and -EINVAL on failure.
 */
static int __init parse_crashkernel_simple(char *cmdline,

Do you have alternative suggestion?

> Except of these, patch looks good to me. It's a nice catch, and only
> need a simple fix based on the current code.
>
Thank you for the kindly review.

Regards,
Pingfan

> Thanks
> Baoquan
>
> > +{
> > + unsigned long long crash_size, crash_base;
> > + char *cur = option;
> > +
> > + crash_size = memparse(option, );
> > + if (option == cur)
> > + return;
> > +
> > + if (*cur == '@') {
> > + option = cur + 1;
> > + crash_base = memparse(option, );
> > + if (option == cur)
> > + return;
> > + mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
> > + mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
> > + }
> > +}
> >
> >  static void handle_mem_options(void)
> >  {
> > @@ -250,7 +270,7 @@ static void handle_mem_options(void)
> >   u64 mem_size;
> >
> >   if (!strstr(args, "memmap=") && !strstr(args, "mem=") &&
> > - !strstr(args, "hugepages"))
> > + !strstr(args, "hugepages") && !strstr(args, "crashkernel="))
> >   return;
> >
> >   tmp_cmdline = malloc(len + 1);
> > @@ -286,6 +306,8 @@ static void handle_mem_options(void)
> >   goto out;
> >
> >   mem_limit = mem_size;
> > + } else if (strstr(param, "crashkernel")) {
> > + mem_avoid_crashkernel_simple(val);
> >   }
> >   }
> >
> > @@ -414,7 +436,7 @@ static void mem_avoid_init(unsigned long input, 
> > unsigned long input_size,
> >
> >   /* We don't need to set a mapping for setup_data. */
> >
> > - /* Mark the memmap regions we need to avoid */
> > + /* Mark the regions we need to avoid */
> >   handle_mem_options();
> >
> >  #ifdef CONFIG_X86_VERBOSE_BOOTUP
> > --
> > 2.7.4
> >


[PATCHv2] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-12 Thread Pingfan Liu
crashkernel=x@y option may fail to reserve the required memory region if
KASLR puts kernel into the region. To avoid this uncertainty, making KASLR
skip the required region.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Pingfan Liu 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
Cc: linux-kernel@vger.kernel.org
---
v1 -> v2: fix some trival format

 arch/x86/boot/compressed/kaslr.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 9ed9709..e185318 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -109,6 +109,7 @@ enum mem_avoid_index {
MEM_AVOID_BOOTPARAMS,
MEM_AVOID_MEMMAP_BEGIN,
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
+   MEM_AVOID_CRASHKERNEL,
MEM_AVOID_MAX,
 };
 
@@ -240,6 +241,25 @@ static void parse_gb_huge_pages(char *param, char *val)
}
 }
 
+/* parse crashkernel=x@y option */
+static void mem_avoid_crashkernel_simple(char *option)
+{
+   unsigned long long crash_size, crash_base;
+   char *cur = option;
+
+   crash_size = memparse(option, );
+   if (option == cur)
+   return;
+
+   if (*cur == '@') {
+   option = cur + 1;
+   crash_base = memparse(option, );
+   if (option == cur)
+   return;
+   mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
+   mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
+   }
+}
 
 static void handle_mem_options(void)
 {
@@ -250,7 +270,7 @@ static void handle_mem_options(void)
u64 mem_size;
 
if (!strstr(args, "memmap=") && !strstr(args, "mem=") &&
-   !strstr(args, "hugepages"))
+   !strstr(args, "hugepages") && !strstr(args, "crashkernel="))
return;
 
tmp_cmdline = malloc(len + 1);
@@ -286,6 +306,8 @@ static void handle_mem_options(void)
goto out;
 
mem_limit = mem_size;
+   } else if (strstr(param, "crashkernel")) {
+   mem_avoid_crashkernel_simple(val);
}
}
 
@@ -414,7 +436,7 @@ static void mem_avoid_init(unsigned long input, unsigned 
long input_size,
 
/* We don't need to set a mapping for setup_data. */
 
-   /* Mark the memmap regions we need to avoid */
+   /* Mark the regions we need to avoid */
handle_mem_options();
 
 #ifdef CONFIG_X86_VERBOSE_BOOTUP
-- 
2.7.4



Re: [PATCH] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-03-05 Thread Pingfan Liu
On Wed, Feb 27, 2019 at 3:40 PM Borislav Petkov  wrote:
>
> + Kees.
>
> @Kees, you might want to go upthread a bit for context.
>
Seems not reply from Kees.
> On Wed, Feb 27, 2019 at 09:30:34AM +0800, Baoquan He wrote:
> > Agree that 'crashkernel=x' should be encouraged to use as the first
> > choice when reserve crashkernel. If we decide to not obsolete
> > 'crashkernel=x@y', it will leave a unstable kernel parameter.
>
> Is anyone even talking about obsoleting this?
>
> And if anyone is, anyone can think a bit why we can't do this.
>
As Dave said, some un-relocatable code should be loaded to a specified
space. Also the param is used by archs beside x86
> > Another worry is that KASLR won't always fail 'crashkernel=x@y',
> > customer may set and check in testing stage, then later in production
> > environment one time of neglect to not check may cause carashed kernel
> > uncaptured.
> >
> > IMHO, 'crashkernel=x@y' is similar to those specified memmap=ss[#$!]nn
> > which have been avoided in boot stage KASLR.
>
> So my worry is that by specifying too many exclusion ranges, we might
> limit the kaslr space too much and make it too predictable. Especially
> since distros slap those things automatically and most users take them
> for granted.
>
Kernel has already done this excluding 1gb pages. Do we need to worry
about 200-400 MB for crashkernel? And I think if a user specify the
region, then he/she should be aware of the limit of KASLR (can printk
to warn him/her).

> But I might be way off here because of something else I'm missing ...
>
So how do you think about this now? Just leaving a unstable kernel
parameter, or printk some info when crashkernel=x@y fails.

Thanks,
Pingfan
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info

2019-03-05 Thread Pingfan Liu
On Tue, Feb 26, 2019 at 8:09 PM Michal Hocko  wrote:
>
> On Tue 26-02-19 13:47:37, Pingfan Liu wrote:
> > On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko  wrote:
> > >
> > > On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > > > There are NUMA machines with memory-less node. At present page 
> > > > allocator builds the
> > > > full fallback info by build_zonelists(). But memblock allocator does 
> > > > not utilize
> > > > this info. And for memory-less node, memblock allocator just falls back 
> > > > "node 0",
> > > > without utilizing the nearest node. Unfortunately, the percpu section 
> > > > is allocated
> > > > by memblock, which is accessed frequently after bootup.
> > > >
> > > > This series aims to improve the performance of per cpu section on 
> > > > memory-less node
> > > > by feeding node's fallback info to memblock allocator on x86, like we 
> > > > do for page
> > > > allocator. On other archs, it requires independent effort to setup node 
> > > > to cpumask
> > > > map ahead.
> > >
> > > Do you have any numbers to tell us how much does this improve the
> > > situation?
> >
> > Not yet. At present just based on the fact that we prefer to allocate
> > per cpu area on local node.
>
> Yes, we _usually_ do. But the additional complexity should be worth it.
> And if we find out that the final improvement is not all that great and
> considering that memory-less setups are crippled anyway then it might
> turn out we just do not care all that much.
> --
I had finished some test on a "Dell Inc. PowerEdge R7425/02MJ3T"
machine, which owns 8 numa node. and the topology is:
L1d cache:   32K
L1i cache:   64K
L2 cache:512K
L3 cache:4096K
NUMA node0 CPU(s):   0,8,16,24
NUMA node1 CPU(s):   2,10,18,26
NUMA node2 CPU(s):   4,12,20,28
NUMA node3 CPU(s):   6,14,22,30
NUMA node4 CPU(s):   1,9,17,25
NUMA node5 CPU(s):   3,11,19,27
NUMA node6 CPU(s):   5,13,21,29
NUMA node7 CPU(s):   7,15,23,31

Here is the basic info about the NUMA machine. cpu 0 and 16 share the
same L3 cache. Only node 1 and 5 own memory. Using local node as
baseline, the memory write performance suffer 25% drop to nearest node
(i.e. writing data from node 0 to 1), and 78% drop to farthest node
(i.e. writing from 0 to 5).

I used a user space test case to get the performance difference
between the nearest node and the farthest. The case pins two tasks on
cpu 0 and 16. The case used two memory chunks, A which emulates a
small footprint of per cpu section, and B which emulates a large
footprint. Chunk B is always allocated on nearest node, while chunk A
switch between nearest node and the farthest to render comparable
result. To emulate around 2.5% access to per cpu area, the case
composes two groups of writing, 1 time to memory chunk A, then 40
times to chunk B.

On the nearest node, I used 4MB foot print, which is the same size as
L3 cache. And varying foot print from 2K -> 4K ->8K to emulate the
access to the per cpu section. For 2K and 4K, perf result can not tell
the difference exactly, due to the difference is smaller than the
variance. For 8K: 1.8% improvement, then the larger footprint, the
higher improvement in performance. But 8K means that a module
allocates 4K/per cpu in the section. This is not in practice.

So the changes may be not need.

Regards,
Pingfan


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-28 Thread Pingfan Liu
On Fri, Mar 1, 2019 at 11:04 AM Pingfan Liu  wrote:
>
> Hi Borislav,
>
> Do you think the following patch is good at present?
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 81f9d23..9213073 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -460,7 +460,7 @@ static void __init
> memblock_x86_reserve_range_setup_data(void)
>  # define CRASH_ADDR_LOW_MAX(512 << 20)
>  # define CRASH_ADDR_HIGH_MAX   (512 << 20)
>  #else
> -# define CRASH_ADDR_LOW_MAX(896UL << 20)
> +# define CRASH_ADDR_LOW_MAX(1 << 32)
>  # define CRASH_ADDR_HIGH_MAX   MAXMEM
>  #endif
>
Or patch lools like:
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..ed0def5 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -459,7 +459,7 @@ static void __init
memblock_x86_reserve_range_setup_data(void)
 # define CRASH_ADDR_LOW_MAX(512 << 20)
 # define CRASH_ADDR_HIGH_MAX   (512 << 20)
 #else
-# define CRASH_ADDR_LOW_MAX(896UL << 20)
+# define CRASH_ADDR_LOW_MAX(1 << 32)
 # define CRASH_ADDR_HIGH_MAX   MAXMEM
 #endif

@@ -551,6 +551,15 @@ static void __init reserve_crashkernel(void)
high ? CRASH_ADDR_HIGH_MAX
 : CRASH_ADDR_LOW_MAX,
crash_size, CRASH_ALIGN);
+#ifdef CONFIG_X86_64
+   /*
+* crashkernel=X reserve below 4G fails? Try MAXMEM
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(CRASH_ALIGN,
+   CRASH_ADDR_HIGH_MAX,
+   crash_size, CRASH_ALIGN);
+#endif

which tries 0-4G, the fall back to 4G above

> For documentation, I will send another patch to improve the description.
>
> Thanks,
> Pingfan
>
> On Mon, Feb 25, 2019 at 7:30 PM Borislav Petkov  wrote:
> >
> > On Mon, Feb 25, 2019 at 07:12:16PM +0800, Dave Young wrote:
> > > If we move to high as default, it will allocate 160M high + 256M low. It
> >
> > We won't move to high by default - we will *fall* back to high if the
> > default allocation fails.
> >
> > > To make the process less fragile maybe we can remove the 896M limitation
> > > and only try <4G then go to high.
> >
> > Sure, the more robust for the user, the better.
> >
> > --
> > Regards/Gruss,
> > Boris.
> >
> > Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-28 Thread Pingfan Liu
Hi Borislav,

Do you think the following patch is good at present?
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 81f9d23..9213073 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -460,7 +460,7 @@ static void __init
memblock_x86_reserve_range_setup_data(void)
 # define CRASH_ADDR_LOW_MAX(512 << 20)
 # define CRASH_ADDR_HIGH_MAX   (512 << 20)
 #else
-# define CRASH_ADDR_LOW_MAX(896UL << 20)
+# define CRASH_ADDR_LOW_MAX(1 << 32)
 # define CRASH_ADDR_HIGH_MAX   MAXMEM
 #endif

For documentation, I will send another patch to improve the description.

Thanks,
Pingfan

On Mon, Feb 25, 2019 at 7:30 PM Borislav Petkov  wrote:
>
> On Mon, Feb 25, 2019 at 07:12:16PM +0800, Dave Young wrote:
> > If we move to high as default, it will allocate 160M high + 256M low. It
>
> We won't move to high by default - we will *fall* back to high if the
> default allocation fails.
>
> > To make the process less fragile maybe we can remove the 896M limitation
> > and only try <4G then go to high.
>
> Sure, the more robust for the user, the better.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH 2/6] mm/memblock: make full utilization of numa info

2019-02-27 Thread Pingfan Liu
On Tue, Feb 26, 2019 at 7:58 PM Mike Rapoport  wrote:
>
> On Sun, Feb 24, 2019 at 08:34:05PM +0800, Pingfan Liu wrote:
> > There are numa machines with memory-less node. When allocating memory for
> > the memory-less node, memblock allocator falls back to 'Node 0' without 
> > fully
> > utilizing the nearest node. This hurts the performance, especially for per
> > cpu section. Suppressing this defect by building the full node fall back
> > info for memblock allocator, like what we have done for page allocator.
>
> Is it really necessary to build full node fallback info for memblock and
> then rebuild it again for the page allocator?
>
Do you mean building the full node fallback info once, and share it by
both memblock and page allocator? If it is, then node online/offline
is the corner case to block this design.

> I think it should be possible to split parts of build_all_zonelists_init()
> that do not touch per-cpu areas into a separate function and call that
> function after topology detection. Then it would be possible to use
> local_memory_node() when calling memblock.
>
Yes, this is one way but may be with higher pay of changing the code.
I will try it.
Thank your for your suggestion.

Best regards,
Pingfan
> > Signed-off-by: Pingfan Liu 
> > CC: Thomas Gleixner 
> > CC: Ingo Molnar 
> > CC: Borislav Petkov 
> > CC: "H. Peter Anvin" 
> > CC: Dave Hansen 
> > CC: Vlastimil Babka 
> > CC: Mike Rapoport 
> > CC: Andrew Morton 
> > CC: Mel Gorman 
> > CC: Joonsoo Kim 
> > CC: Andy Lutomirski 
> > CC: Andi Kleen 
> > CC: Petr Tesarik 
> > CC: Michal Hocko 
> > CC: Stephen Rothwell 
> > CC: Jonathan Corbet 
> > CC: Nicholas Piggin 
> > CC: Daniel Vacek 
> > CC: linux-kernel@vger.kernel.org
> > ---
> >  include/linux/memblock.h |  3 +++
> >  mm/memblock.c| 68 
> > 
> >  2 files changed, 66 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index 64c41cf..ee999c5 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -342,6 +342,9 @@ void *memblock_alloc_try_nid_nopanic(phys_addr_t size, 
> > phys_addr_t align,
> >  void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align,
> >phys_addr_t min_addr, phys_addr_t max_addr,
> >int nid);
> > +extern int build_node_order(int *node_oder_array, int sz,
> > + int local_node, nodemask_t *used_mask);
> > +void memblock_build_node_order(void);
> >
> >  static inline void * __init memblock_alloc(phys_addr_t size,  phys_addr_t 
> > align)
> >  {
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 022d4cb..cf78850 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1338,6 +1338,47 @@ phys_addr_t __init 
> > memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t ali
> >   return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
> >  }
> >
> > +static int **node_fallback __initdata;
> > +
> > +/*
> > + * build_node_order() relies on cpumask_of_node(), hence arch should set up
> > + * cpumask before calling this func.
> > + */
> > +void __init memblock_build_node_order(void)
> > +{
> > + int nid, i;
> > + nodemask_t used_mask;
> > +
> > + node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> > + sizeof(int *));
> > + for_each_online_node(nid) {
> > + node_fallback[nid] = memblock_alloc(
> > + num_online_nodes() * sizeof(int), sizeof(int));
> > + for (i = 0; i < num_online_nodes(); i++)
> > + node_fallback[nid][i] = NUMA_NO_NODE;
> > + }
> > +
> > + for_each_online_node(nid) {
> > + nodes_clear(used_mask);
> > + node_set(nid, used_mask);
> > + build_node_order(node_fallback[nid], num_online_nodes(),
> > + nid, _mask);
> > + }
> > +}
> > +
> > +static void __init memblock_free_node_order(void)
> > +{
> > + int nid;
> > +
> > + if (!node_fallback)
> > + return;
> > + for_each_online_node(nid)
> > + memblock_free(__pa(node_fallback[nid]),
> > + num_online_nodes() * sizeof(int));
> > + memblock_free(__pa(node_fallback), MAX_NUMNODES * sizeof(int *));
> > + node_fallback = NULL;
> > +}
> > 

Re: [PATCH 0/6] make memblock allocator utilize the node's fallback info

2019-02-25 Thread Pingfan Liu
On Tue, Feb 26, 2019 at 12:04 AM Michal Hocko  wrote:
>
> On Sun 24-02-19 20:34:03, Pingfan Liu wrote:
> > There are NUMA machines with memory-less node. At present page allocator 
> > builds the
> > full fallback info by build_zonelists(). But memblock allocator does not 
> > utilize
> > this info. And for memory-less node, memblock allocator just falls back 
> > "node 0",
> > without utilizing the nearest node. Unfortunately, the percpu section is 
> > allocated
> > by memblock, which is accessed frequently after bootup.
> >
> > This series aims to improve the performance of per cpu section on 
> > memory-less node
> > by feeding node's fallback info to memblock allocator on x86, like we do 
> > for page
> > allocator. On other archs, it requires independent effort to setup node to 
> > cpumask
> > map ahead.
>
> Do you have any numbers to tell us how much does this improve the
> situation?

Not yet. At present just based on the fact that we prefer to allocate
per cpu area on local node.

Thanks,
Pingfan


Re: [PATCH 2/6] mm/memblock: make full utilization of numa info

2019-02-25 Thread Pingfan Liu
On Mon, Feb 25, 2019 at 11:34 PM Dave Hansen  wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > +/*
> > + * build_node_order() relies on cpumask_of_node(), hence arch should
> > + * set up cpumask before calling this func.
> > + */
>
> Whenever I see comments like this, I wonder what happens if the arch
> doesn't do this?  Do we just crash in early boot in wonderful new ways?
>  Or do we get a nice message telling us?
>
If doesn't do this, this function will crash. It is a shame but a
little hard to work around, since this function is called at early
boot stage, things like cpumask_of_node(cpu_to_node(cpu)) can not work
reliably, and we lack of an abstract interface to get such information
from all archs. So I leave this to arch's developer.

> > +void __init memblock_build_node_order(void)
> > +{
> > + int nid, i;
> > + nodemask_t used_mask;
> > +
> > + node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
> > + sizeof(int *));
> > + for_each_online_node(nid) {
> > + node_fallback[nid] = memblock_alloc(
> > + num_online_nodes() * sizeof(int), sizeof(int));
> > + for (i = 0; i < num_online_nodes(); i++)
> > + node_fallback[nid][i] = NUMA_NO_NODE;
> > + }
> > +
> > + for_each_online_node(nid) {
> > + nodes_clear(used_mask);
> > + node_set(nid, used_mask);
> > + build_node_order(node_fallback[nid], num_online_nodes(),
> > + nid, _mask);
> > + }
> > +}
>
> This doesn't get used until patch 6 as far as I can tell.  Was there a
> reason to define it here?
>
Yes, it gets used until patch 6. Patch 6 has two groups of
pre-requirements [1-2] and [3-5]. Do you think reorder the patches and
moving [3-5] ahead of [1-2] is a better choice?

Thanks and regards,
Pingfan


Re: [PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA

2019-02-25 Thread Pingfan Liu
On Mon, Feb 25, 2019 at 11:24 PM Dave Hansen  wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > +#ifdef CONFIG_NUMA
> >  /*
> >   * There are unfortunately some poorly designed mainboards around that
> >   * only connect memory to a single CPU. This breaks the 1:1 cpu->node
> > @@ -618,6 +619,9 @@ static void __init numa_init_array(void)
> >   rr = next_node_in(rr, node_online_map);
> >   }
> >  }
> > +#else
> > +static void __init numa_init_array(void) {}
> > +#endif
>
> What functional effect does this #ifdef have?
>
> Let's look at the code:
>
> > static void __init numa_init_array(void)
> > {
> > int rr, i;
> >
> > rr = first_node(node_online_map);
> > for (i = 0; i < nr_cpu_ids; i++) {
> > if (early_cpu_to_node(i) != NUMA_NO_NODE)
> > continue;
> > numa_set_node(i, rr);
> > rr = next_node_in(rr, node_online_map);
> > }
> > }
>
> and "play compiler" for a bit.
>
> The first iteration will see early_cpu_to_node(i)==1 because:
>
> static inline int early_cpu_to_node(int cpu)
> {
> return 0;
> }
>
> if CONFIG_NUMA=n.
>
> In other words, I'm not sure this patch does *anything*.

I had thought separating [3/6] and [4/6] can ease the review. And I
will merge them in next version.

Thanks and regards,
Pingfan


Re: [PATCH 5/6] x86/numa: push forward the setup of node to cpumask map

2019-02-25 Thread Pingfan Liu
On Mon, Feb 25, 2019 at 11:30 PM Dave Hansen  wrote:
>
> On 2/24/19 4:34 AM, Pingfan Liu wrote:
> > At present the node to cpumask map is set up until the secondary
> > cpu boot up. But it is too late for the purpose of building node fall back
> > list at early boot stage. Considering that init_cpu_to_node() already owns
> > cpu to node map, it is a good place to set up node to cpumask map too. So
> > do it by calling numa_add_cpu(cpu) in init_cpu_to_node().
>
> It sounds like you have carefully considered the ordering and
> dependencies here.  However, none of that consideration has made it into
> the code.
>
> Could you please add some comments to the new call-sites to explain why
> the *must* be where they are?

OK. How about: "building up node fallback list needs cpumask info, so
filling cpumask info here"
Thanks for your kindly review.

Regards,
Pingfan


Re: [PATCH] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-02-25 Thread Pingfan Liu
On Mon, Feb 25, 2019 at 4:23 PM Chao Fan  wrote:
>
> On Mon, Feb 25, 2019 at 03:59:56PM +0800, Pingfan Liu wrote:
> >crashkernel=x@y option may fail to reserve the required memory region if
> >KASLR puts kernel into the region. To avoid this uncertainty, making KASLR
> >skip the required region.
> >
> >Signed-off-by: Pingfan Liu 
> >Cc: Thomas Gleixner 
> >Cc: Ingo Molnar 
> >Cc: Borislav Petkov 
> >Cc: "H. Peter Anvin" 
> >Cc: Baoquan He 
> >Cc: Will Deacon 
> >Cc: Nicolas Pitre 
> >Cc: Pingfan Liu 
> >Cc: Chao Fan 
> >Cc: "Kirill A. Shutemov" 
> >Cc: Ard Biesheuvel 
> >Cc: linux-kernel@vger.kernel.org
> >---
> > arch/x86/boot/compressed/kaslr.c | 26 +-
> > 1 file changed, 25 insertions(+), 1 deletion(-)
> >
>
> Hi Pingfan,
>
> Some not important comments:
>
> >diff --git a/arch/x86/boot/compressed/kaslr.c 
> >b/arch/x86/boot/compressed/kaslr.c
> >index 9ed9709..728bc4b 100644
> >--- a/arch/x86/boot/compressed/kaslr.c
> >+++ b/arch/x86/boot/compressed/kaslr.c
> >@@ -109,6 +109,7 @@ enum mem_avoid_index {
> >   MEM_AVOID_BOOTPARAMS,
> >   MEM_AVOID_MEMMAP_BEGIN,
> >   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 
> > 1,
> >+  MEM_AVOID_CRASHKERNEL,
> >   MEM_AVOID_MAX,
> > };
> >
> >@@ -240,6 +241,27 @@ static void parse_gb_huge_pages(char *param, char *val)
> >   }
> > }
> >
> >+/* parse crashkernel=x@y option */
> >+static int mem_avoid_crashkernel_simple(char *option)
> >+{
> >+  char *cur = option;
> >+  unsigned long long crash_size, crash_base;
>
> Change the position of two lines above.
>
Yes, it is better.
> >+
> >+  crash_size = memparse(option, );
> >+  if (option == cur)
> >+  return -EINVAL;
> >+
> >+  if (*cur == '@') {
> >+  option = cur + 1;
> >+  crash_base = memparse(option, );
> >+  if (option == cur)
> >+  return -EINVAL;
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
> >+  mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
> >+  }
> >+
> >+  return 0;
>
> You just call this function and don't use its return value.
> So why not change it as void type.
>
OK.
> >+}
> >
> > static void handle_mem_options(void)
>
> If you want to change this function, I think you could change the
> function name and the comment:
>
> /* Mark the memmap regions we need to avoid */
> handle_mem_options();
>
Yes, it is outdated, should fix the comment.
> > {
> >@@ -250,7 +272,7 @@ static void handle_mem_options(void)
> >   u64 mem_size;
> >
> >   if (!strstr(args, "memmap=") && !strstr(args, "mem=") &&
> >-  !strstr(args, "hugepages"))
> >+  !strstr(args, "hugepages") && !strstr(args, "crashkernel="))
> >   return;
> >
> >   tmp_cmdline = malloc(len + 1);
> >@@ -286,6 +308,8 @@ static void handle_mem_options(void)
> >   goto out;
> >
> >   mem_limit = mem_size;
> >+  } else if (strstr(param, "crashkernel")) {
> >+  mem_avoid_crashkernel_simple(val);
>
> I am wondering why you call this function mem_avoid_crashkernel_*simple*().
>
It follows the name of parse_crashkernel_simple()

Thanks,
Pingfan


Re: [PATCH] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-02-25 Thread Pingfan Liu
On Mon, Feb 25, 2019 at 5:45 PM Borislav Petkov  wrote:
>
> On Mon, Feb 25, 2019 at 03:59:56PM +0800, Pingfan Liu wrote:
> > crashkernel=x@y option may fail to reserve the required memory region if
> > KASLR puts kernel into the region. To avoid this uncertainty, making KASLR
> > skip the required region.
>
> Lemme see if I understand this correctly: supplying crashkernel=X@Y
> influences where KASLR would put the randomized kernel. And it should be

Yes, you get it.
> the other way around, IMHO. crashkernel= will have to "work" with KASLR
> to find a suitable range and if the reservation at Y fails, then we tell
> the user to try the more relaxed variant crashkernel=M.
>
I follow Baoquan's opinion. Due to the randomness caused by KASLR, a
user may be surprised to find crashkernel=x@y not working sometime. If
kernel can help them out of this corner automatically, then no need to
bother them with the message to use alternative method crashkernel=M.
Anyway it is a cheap method already used by other options like
hugepages and memmap in handle_mem_options().
If commitment, then do it without failure. Or just removing
crashkernel=x@y option on x86.

Thanks and regards,
Pingfan

> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.


[PATCH] x86/boot/KASLR: skip the specified crashkernel reserved region

2019-02-25 Thread Pingfan Liu
crashkernel=x@y option may fail to reserve the required memory region if
KASLR puts kernel into the region. To avoid this uncertainty, making KASLR
skip the required region.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Baoquan He 
Cc: Will Deacon 
Cc: Nicolas Pitre 
Cc: Pingfan Liu 
Cc: Chao Fan 
Cc: "Kirill A. Shutemov" 
Cc: Ard Biesheuvel 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/boot/compressed/kaslr.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 9ed9709..728bc4b 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -109,6 +109,7 @@ enum mem_avoid_index {
MEM_AVOID_BOOTPARAMS,
MEM_AVOID_MEMMAP_BEGIN,
MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
+   MEM_AVOID_CRASHKERNEL,
MEM_AVOID_MAX,
 };
 
@@ -240,6 +241,27 @@ static void parse_gb_huge_pages(char *param, char *val)
}
 }
 
+/* parse crashkernel=x@y option */
+static int mem_avoid_crashkernel_simple(char *option)
+{
+   char *cur = option;
+   unsigned long long crash_size, crash_base;
+
+   crash_size = memparse(option, );
+   if (option == cur)
+   return -EINVAL;
+
+   if (*cur == '@') {
+   option = cur + 1;
+   crash_base = memparse(option, );
+   if (option == cur)
+   return -EINVAL;
+   mem_avoid[MEM_AVOID_CRASHKERNEL].start = crash_base;
+   mem_avoid[MEM_AVOID_CRASHKERNEL].size = crash_size;
+   }
+
+   return 0;
+}
 
 static void handle_mem_options(void)
 {
@@ -250,7 +272,7 @@ static void handle_mem_options(void)
u64 mem_size;
 
if (!strstr(args, "memmap=") && !strstr(args, "mem=") &&
-   !strstr(args, "hugepages"))
+   !strstr(args, "hugepages") && !strstr(args, "crashkernel="))
return;
 
tmp_cmdline = malloc(len + 1);
@@ -286,6 +308,8 @@ static void handle_mem_options(void)
goto out;
 
mem_limit = mem_size;
+   } else if (strstr(param, "crashkernel")) {
+   mem_avoid_crashkernel_simple(val);
}
}
 
-- 
2.7.4



Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-24 Thread Pingfan Liu
On Fri, Feb 22, 2019 at 9:00 PM Borislav Petkov  wrote:
>
> On Fri, Feb 22, 2019 at 09:42:41AM +0100, Joerg Roedel wrote:
> > The current default of 256MB was found by experiments on a bigger
> > number of machines, to create a reasonable default that is at least
> > likely to be sufficient of an average machine.
>
> Exactly, and this is what makes sense.
>
> The code should try the requested reservation and if it fails, it should
> try high allocation with default swiotlb size because we need to reserve
> *some* range.
>
> If that reservation succeeds, we should say something along the lines of
>
> "... requested range failed, reserved  range instead."
>
Maybe I misunderstood you, but does "requested range failed" mean that
user specify the range? If yes, then it should be the duty of user as
you said later, not the duty of kernel"

> And then in Documentation/admin-guide/kernel-parameters.txt above the
> crashkernel= explanations, the allocation strategy of best effort should
> be explained in short. That the kernel will try to allocate high if the
> requested allocation didn't succeed and that the user can tweak the
> allocation with the below options.
>
Yes, it should be improved.

> Bottom line is: the kernel should assist the user and try harder to
> allocate *some* range for a crash kernel when there's no detailed
> specification what that range should be.
>
> *If* the user adds ,low, high, then the kernel should try only that
> specified range because the assumption is that the user knows what she's
> doing.
>
> But if the user simply wants a range for a crash kernel without stating
> where that range should be in particular and it's placement is a don't
> care - as long as there is a range - then the kernel should simply try
> high, etc.
>
We do not know the memory layout of a system, maybe a system with
memory less than 4GB. So it is better to try all the range of system
memory

Thanks,
Pingfan

> Makes sense?
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.


[PATCH 6/6] x86/numa: build node fallback info after setting up node to cpumask map

2019-02-24 Thread Pingfan Liu
After the previous patches, on x86, it is safe to call
memblock_build_node_order() after init_cpu_to_node(), which has set up node
to cpumask map. So calling memblock_build_node_order() to feed memblock with
numa node fall back info.

Signed-off-by: Pingfan Liu 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Dave Hansen 
CC: Vlastimil Babka 
CC: Mike Rapoport 
CC: Andrew Morton 
CC: Mel Gorman 
CC: Joonsoo Kim 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Petr Tesarik 
CC: Michal Hocko 
CC: Stephen Rothwell 
CC: Jonathan Corbet 
CC: Nicholas Piggin 
CC: Daniel Vacek 
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/setup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..3ec1a6e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1245,6 +1245,8 @@ void __init setup_arch(char **cmdline_p)
prefill_possible_map();
 
init_cpu_to_node();
+   /* node to cpumask map is ready */
+   memblock_build_node_order();
 
io_apic_init_mappings();
 
-- 
2.7.4



[PATCH 3/6] x86/numa: define numa_init_array() conditional on CONFIG_NUMA

2019-02-24 Thread Pingfan Liu
For non-NUMA, it turns out that numa_init_array() has no operations. Make
separated definition for non-NUMA and NUMA, so later they can be combined
into their counterpart init_cpu_to_node().

Signed-off-by: Pingfan Liu 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Dave Hansen 
CC: Vlastimil Babka 
CC: Mike Rapoport 
CC: Andrew Morton 
CC: Mel Gorman 
CC: Joonsoo Kim 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Petr Tesarik 
CC: Michal Hocko 
CC: Stephen Rothwell 
CC: Jonathan Corbet 
CC: Nicholas Piggin 
CC: Daniel Vacek 
CC: linux-kernel@vger.kernel.org
---
 arch/x86/mm/numa.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f54..bfe6732 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -599,6 +599,7 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
return 0;
 }
 
+#ifdef CONFIG_NUMA
 /*
  * There are unfortunately some poorly designed mainboards around that
  * only connect memory to a single CPU. This breaks the 1:1 cpu->node
@@ -618,6 +619,9 @@ static void __init numa_init_array(void)
rr = next_node_in(rr, node_online_map);
}
 }
+#else
+static void __init numa_init_array(void) {}
+#endif
 
 static int __init numa_init(int (*init_func)(void))
 {
-- 
2.7.4



[PATCH 4/6] x86/numa: concentrate the code of setting cpu to node map

2019-02-24 Thread Pingfan Liu
Both numa_init_array() and init_cpu_to_node() aim at setting up the cpu to
node map, so combining them. And the coming patch will set up node to
cpumask map in the combined function.

Signed-off-by: Pingfan Liu 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Dave Hansen 
CC: Vlastimil Babka 
CC: Mike Rapoport 
CC: Andrew Morton 
CC: Mel Gorman 
CC: Joonsoo Kim 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Petr Tesarik 
CC: Michal Hocko 
CC: Stephen Rothwell 
CC: Jonathan Corbet 
CC: Nicholas Piggin 
CC: Daniel Vacek 
CC: linux-kernel@vger.kernel.org
---
 arch/x86/mm/numa.c | 39 +--
 1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index bfe6732..c8dd7af 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -599,30 +599,6 @@ static int __init numa_register_memblks(struct 
numa_meminfo *mi)
return 0;
 }
 
-#ifdef CONFIG_NUMA
-/*
- * There are unfortunately some poorly designed mainboards around that
- * only connect memory to a single CPU. This breaks the 1:1 cpu->node
- * mapping. To avoid this fill in the mapping for all possible CPUs,
- * as the number of CPUs is not known yet. We round robin the existing
- * nodes.
- */
-static void __init numa_init_array(void)
-{
-   int rr, i;
-
-   rr = first_node(node_online_map);
-   for (i = 0; i < nr_cpu_ids; i++) {
-   if (early_cpu_to_node(i) != NUMA_NO_NODE)
-   continue;
-   numa_set_node(i, rr);
-   rr = next_node_in(rr, node_online_map);
-   }
-}
-#else
-static void __init numa_init_array(void) {}
-#endif
-
 static int __init numa_init(int (*init_func)(void))
 {
int i;
@@ -675,7 +651,6 @@ static int __init numa_init(int (*init_func)(void))
if (!node_online(nid))
numa_clear_node(i);
}
-   numa_init_array();
 
return 0;
 }
@@ -758,14 +733,26 @@ void __init init_cpu_to_node(void)
 {
int cpu;
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
+   int rr;
 
BUG_ON(cpu_to_apicid == NULL);
+   rr = first_node(node_online_map);
 
for_each_possible_cpu(cpu) {
int node = numa_cpu_node(cpu);
 
-   if (node == NUMA_NO_NODE)
+   /*
+* There are unfortunately some poorly designed mainboards
+* around that only connect memory to a single CPU. This
+* breaks the 1:1 cpu->node mapping. To avoid this fill in
+* the mapping for all possible CPUs, as the number of CPUs
+* is not known yet. We round robin the existing nodes.
+*/
+   if (node == NUMA_NO_NODE) {
+   numa_set_node(cpu, rr);
+   rr = next_node_in(rr, node_online_map);
continue;
+   }
 
if (!node_online(node))
init_memory_less_node(node);
-- 
2.7.4



[PATCH 5/6] x86/numa: push forward the setup of node to cpumask map

2019-02-24 Thread Pingfan Liu
At present the node to cpumask map is set up until the secondary
cpu boot up. But it is too late for the purpose of building node fall back
list at early boot stage. Considering that init_cpu_to_node() already owns
cpu to node map, it is a good place to set up node to cpumask map too. So
do it by calling numa_add_cpu(cpu) in init_cpu_to_node().

Signed-off-by: Pingfan Liu 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Dave Hansen 
CC: Vlastimil Babka 
CC: Mike Rapoport 
CC: Andrew Morton 
CC: Mel Gorman 
CC: Joonsoo Kim 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Petr Tesarik 
CC: Michal Hocko 
CC: Stephen Rothwell 
CC: Jonathan Corbet 
CC: Nicholas Piggin 
CC: Daniel Vacek 
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/topology.h | 4 
 arch/x86/kernel/setup_percpu.c  | 3 ---
 arch/x86/mm/numa.c  | 5 -
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..fad77c7 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -73,8 +73,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 }
 #endif
 
-extern void setup_node_to_cpumask_map(void);
-
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 
 extern int __node_distance(int, int);
@@ -96,8 +94,6 @@ static inline int early_cpu_to_node(int cpu)
return 0;
 }
 
-static inline void setup_node_to_cpumask_map(void) { }
-
 #endif
 
 #include 
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index e8796fc..206fa43 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -283,9 +283,6 @@ void __init setup_per_cpu_areas(void)
early_per_cpu_ptr(x86_cpu_to_node_map) = NULL;
 #endif
 
-   /* Setup node to cpumask map */
-   setup_node_to_cpumask_map();
-
/* Setup cpu initialized, callin, callout masks */
setup_cpu_local_masks();
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c8dd7af..8d73e2273 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -110,7 +110,7 @@ void numa_clear_node(int cpu)
  * Note: cpumask_of_node() is not valid until after this is done.
  * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
  */
-void __init setup_node_to_cpumask_map(void)
+static void __init setup_node_to_cpumask_map(void)
 {
unsigned int node;
 
@@ -738,6 +738,7 @@ void __init init_cpu_to_node(void)
BUG_ON(cpu_to_apicid == NULL);
rr = first_node(node_online_map);
 
+   setup_node_to_cpumask_map();
for_each_possible_cpu(cpu) {
int node = numa_cpu_node(cpu);
 
@@ -750,6 +751,7 @@ void __init init_cpu_to_node(void)
 */
if (node == NUMA_NO_NODE) {
numa_set_node(cpu, rr);
+   numa_add_cpu(cpu);
rr = next_node_in(rr, node_online_map);
continue;
}
@@ -758,6 +760,7 @@ void __init init_cpu_to_node(void)
init_memory_less_node(node);
 
numa_set_node(cpu, node);
+   numa_add_cpu(cpu);
}
 }
 
-- 
2.7.4



[PATCH 0/6] make memblock allocator utilize the node's fallback info

2019-02-24 Thread Pingfan Liu
There are NUMA machines with memory-less node. At present page allocator builds 
the
full fallback info by build_zonelists(). But memblock allocator does not utilize
this info. And for memory-less node, memblock allocator just falls back "node 
0",
without utilizing the nearest node. Unfortunately, the percpu section is 
allocated 
by memblock, which is accessed frequently after bootup.

This series aims to improve the performance of per cpu section on memory-less 
node
by feeding node's fallback info to memblock allocator on x86, like we do for 
page
allocator. On other archs, it requires independent effort to setup node to 
cpumask
map ahead.


CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Dave Hansen 
CC: Vlastimil Babka 
CC: Mike Rapoport 
CC: Andrew Morton 
CC: Mel Gorman 
CC: Joonsoo Kim 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Petr Tesarik 
CC: Michal Hocko 
CC: Stephen Rothwell 
CC: Jonathan Corbet 
CC: Nicholas Piggin 
CC: Daniel Vacek 
CC: linux-kernel@vger.kernel.org

Pingfan Liu (6):
  mm/numa: extract the code of building node fall back list
  mm/memblock: make full utilization of numa info
  x86/numa: define numa_init_array() conditional on CONFIG_NUMA
  x86/numa: concentrate the code of setting cpu to node map
  x86/numa: push forward the setup of node to cpumask map
  x86/numa: build node fallback info after setting up node to cpumask
map

 arch/x86/include/asm/topology.h |  4 ---
 arch/x86/kernel/setup.c |  2 ++
 arch/x86/kernel/setup_percpu.c  |  3 --
 arch/x86/mm/numa.c  | 40 +++-
 include/linux/memblock.h|  3 ++
 mm/memblock.c   | 68 ++---
 mm/page_alloc.c | 48 +
 7 files changed, 114 insertions(+), 54 deletions(-)

-- 
2.7.4



[PATCH 1/6] mm/numa: extract the code of building node fall back list

2019-02-24 Thread Pingfan Liu
In coming patch, memblock allocator also utilizes node fall back list info.
Hence extracting the related code for reusing.

Signed-off-by: Pingfan Liu 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Dave Hansen 
CC: Vlastimil Babka 
CC: Mike Rapoport 
CC: Andrew Morton 
CC: Mel Gorman 
CC: Joonsoo Kim 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Petr Tesarik 
CC: Michal Hocko 
CC: Stephen Rothwell 
CC: Jonathan Corbet 
CC: Nicholas Piggin 
CC: Daniel Vacek 
CC: linux-kernel@vger.kernel.org
---
 mm/page_alloc.c | 48 +---
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde0..a6967a1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5380,6 +5380,32 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
zonerefs->zone_idx = 0;
 }
 
+int build_node_order(int *node_oder_array, int sz,
+   int local_node, nodemask_t *used_mask)
+{
+   int node, nr_nodes = 0;
+   int prev_node = local_node;
+   int load = nr_online_nodes;
+
+
+   while ((node = find_next_best_node(local_node, used_mask)) >= 0
+   && nr_nodes < sz) {
+   /*
+* We don't want to pressure a particular node.
+* So adding penalty to the first node in same
+* distance group to make it round-robin.
+*/
+   if (node_distance(local_node, node) !=
+   node_distance(local_node, prev_node))
+   node_load[node] = load;
+
+   node_oder_array[nr_nodes++] = node;
+   prev_node = node;
+   load--;
+   }
+   return nr_nodes;
+}
+
 /*
  * Build zonelists ordered by zone and nodes within zones.
  * This results in conserving DMA zone[s] until all Normal memory is
@@ -5390,32 +5416,16 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
 static void build_zonelists(pg_data_t *pgdat)
 {
static int node_order[MAX_NUMNODES];
-   int node, load, nr_nodes = 0;
+   int local_node, nr_nodes = 0;
nodemask_t used_mask;
-   int local_node, prev_node;
 
/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
-   load = nr_online_nodes;
-   prev_node = local_node;
nodes_clear(used_mask);
 
memset(node_order, 0, sizeof(node_order));
-   while ((node = find_next_best_node(local_node, _mask)) >= 0) {
-   /*
-* We don't want to pressure a particular node.
-* So adding penalty to the first node in same
-* distance group to make it round-robin.
-*/
-   if (node_distance(local_node, node) !=
-   node_distance(local_node, prev_node))
-   node_load[node] = load;
-
-   node_order[nr_nodes++] = node;
-   prev_node = node;
-   load--;
-   }
-
+   nr_nodes = build_node_order(node_order, MAX_NUMNODES,
+   local_node, _mask);
build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
build_thisnode_zonelists(pgdat);
 }
-- 
2.7.4



[PATCH 2/6] mm/memblock: make full utilization of numa info

2019-02-24 Thread Pingfan Liu
There are numa machines with memory-less node. When allocating memory for
the memory-less node, memblock allocator falls back to 'Node 0' without fully
utilizing the nearest node. This hurts the performance, especially for per
cpu section. Suppressing this defect by building the full node fall back
info for memblock allocator, like what we have done for page allocator.

Signed-off-by: Pingfan Liu 
CC: Thomas Gleixner 
CC: Ingo Molnar 
CC: Borislav Petkov 
CC: "H. Peter Anvin" 
CC: Dave Hansen 
CC: Vlastimil Babka 
CC: Mike Rapoport 
CC: Andrew Morton 
CC: Mel Gorman 
CC: Joonsoo Kim 
CC: Andy Lutomirski 
CC: Andi Kleen 
CC: Petr Tesarik 
CC: Michal Hocko 
CC: Stephen Rothwell 
CC: Jonathan Corbet 
CC: Nicholas Piggin 
CC: Daniel Vacek 
CC: linux-kernel@vger.kernel.org
---
 include/linux/memblock.h |  3 +++
 mm/memblock.c| 68 
 2 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 64c41cf..ee999c5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -342,6 +342,9 @@ void *memblock_alloc_try_nid_nopanic(phys_addr_t size, 
phys_addr_t align,
 void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align,
 phys_addr_t min_addr, phys_addr_t max_addr,
 int nid);
+extern int build_node_order(int *node_oder_array, int sz,
+   int local_node, nodemask_t *used_mask);
+void memblock_build_node_order(void);
 
 static inline void * __init memblock_alloc(phys_addr_t size,  phys_addr_t 
align)
 {
diff --git a/mm/memblock.c b/mm/memblock.c
index 022d4cb..cf78850 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1338,6 +1338,47 @@ phys_addr_t __init 
memblock_phys_alloc_try_nid(phys_addr_t size, phys_addr_t ali
return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
 }
 
+static int **node_fallback __initdata;
+
+/*
+ * build_node_order() relies on cpumask_of_node(), hence arch should set up
+ * cpumask before calling this func.
+ */
+void __init memblock_build_node_order(void)
+{
+   int nid, i;
+   nodemask_t used_mask;
+
+   node_fallback = memblock_alloc(MAX_NUMNODES * sizeof(int *),
+   sizeof(int *));
+   for_each_online_node(nid) {
+   node_fallback[nid] = memblock_alloc(
+   num_online_nodes() * sizeof(int), sizeof(int));
+   for (i = 0; i < num_online_nodes(); i++)
+   node_fallback[nid][i] = NUMA_NO_NODE;
+   }
+
+   for_each_online_node(nid) {
+   nodes_clear(used_mask);
+   node_set(nid, used_mask);
+   build_node_order(node_fallback[nid], num_online_nodes(),
+   nid, _mask);
+   }
+}
+
+static void __init memblock_free_node_order(void)
+{
+   int nid;
+
+   if (!node_fallback)
+   return;
+   for_each_online_node(nid)
+   memblock_free(__pa(node_fallback[nid]),
+   num_online_nodes() * sizeof(int));
+   memblock_free(__pa(node_fallback), MAX_NUMNODES * sizeof(int *));
+   node_fallback = NULL;
+}
+
 /**
  * memblock_alloc_internal - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1370,6 +1411,7 @@ static void * __init memblock_alloc_internal(
 {
phys_addr_t alloc;
void *ptr;
+   int node;
enum memblock_flags flags = choose_memblock_flags();
 
if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is 
deprecated. Use NUMA_NO_NODE instead\n"))
@@ -1397,11 +1439,26 @@ static void * __init memblock_alloc_internal(
goto done;
 
if (nid != NUMA_NO_NODE) {
-   alloc = memblock_find_in_range_node(size, align, min_addr,
-   max_addr, NUMA_NO_NODE,
-   flags);
-   if (alloc && !memblock_reserve(alloc, size))
-   goto done;
+   if (!node_fallback) {
+   alloc = memblock_find_in_range_node(size, align,
+   min_addr, max_addr,
+   NUMA_NO_NODE, flags);
+   if (alloc && !memblock_reserve(alloc, size))
+   goto done;
+   } else {
+   int i;
+   for (i = 0; i < num_online_nodes(); i++) {
+   node = node_fallback[nid][i];
+   /* fallback list has all memory nodes */
+   if (node == NUMA_NO_NODE)
+   break;
+   alloc = memblock_find_in_range_node(size,
+   align, min_addr, max_addr,
+

Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-20 Thread Pingfan Liu
On Wed, Feb 20, 2019 at 5:41 PM Dave Young  wrote:
>
> On 02/20/19 at 09:32am, Borislav Petkov wrote:
> > On Mon, Feb 18, 2019 at 09:48:20AM +0800, Dave Young wrote:
> > > It is ideal if kernel can do it automatically, but I'm not sure if
> > > kernel can predict the swiotlb reserved size automatically.
> >
> > Do you see how even more absurd this gets?
> >
> > If the kernel cannot know the swiotlb reserved size automatically, how
> > is the normal user even supposed to know?!
> >
I think swiotlb is bounce-buffer, if we enlarge it, we can get better
performance. Default size should be enough for platform to work. But
in case of reserving low memory for crashkernel, things are different.
The reserve low memory = swiotlb_size_or_default() + DMA32 memory for
devices. And the 2nd item in the right of the equation varies, based
on machine type and dynamic payload

> > I see swiotlb_size_or_default() so we have a sane default which we fall
> > back to. Now where's the problem with that?
>
> Good question, I expect some answer from people who know more about the
> background.  It would be good to have some actual test results, Pingfan
> is trying to do some tests.
>
Not following the idea, I do not think the following test result can
tell much. (We need various type of machine to get a final result.)
I do a quick test on "HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10",
command line "crashkernel=180M,high crashkernel=64M,low" can work for
the 2nd kernel. Although it complained some memory shortage issue:
[7.655591] fbcon: mgadrmfb (fb0) is primary device
[7.655639] Console: switching to colour frame buffer device 128x48
[7.660609] systemd-udevd: page allocation failure: order:0, mode:0x280d4
[7.660611] CPU: 0 PID: 180 Comm: systemd-udevd Not tainted
3.10.0-957.el7.x86_64 #1
[7.660612] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
Gen10, BIOS U30 06/20/2018
[7.660612] Call Trace:
[7.660621]  [] dump_stack+0x19/0x1b
[7.660625]  [] warn_alloc_failed+0x110/0x180
[7.660628]  [] __alloc_pages_slowpath+0x6b6/0x724
[7.660631]  [] __alloc_pages_nodemask+0x405/0x420
[7.660633]  [] alloc_pages_current+0x98/0x110
[7.660638]  [] ttm_pool_populate+0x3d2/0x4b0 [ttm]
[7.660641]  [] ttm_tt_populate+0x7d/0x90 [ttm]
[7.660644]  [] ttm_bo_kmap+0x124/0x240 [ttm]
[7.660648]  [] ? __wake_up_sync_key+0x4f/0x60
[7.660650]  [] mga_dirty_update+0x25e/0x310 [mgag200]
[7.660653]  [] mga_imageblit+0x2f/0x40 [mgag200]
[7.660657]  [] soft_cursor+0x1ba/0x260
[7.660659]  [] bit_cursor+0x663/0x6a0
[7.660662]  [] ? console_trylock+0x19/0x70
[7.660664]  [] fbcon_cursor+0x13d/0x1c0
[7.660665]  [] ? bit_clear+0x120/0x120
[7.660668]  [] hide_cursor+0x2e/0xa0
[7.660669]  [] redraw_screen+0x188/0x270
[7.660671]  [] do_bind_con_driver+0x316/0x340
[7.660672]  [] do_take_over_console+0x49/0x60
[7.660674]  [] do_fbcon_takeover+0x63/0xd0
[7.660675]  [] fbcon_event_notify+0x61d/0x730
[7.660678]  [] notifier_call_chain+0x4f/0x70
[7.660681]  [] __blocking_notifier_call_chain+0x4d/0x70
[7.660683]  [] blocking_notifier_call_chain+0x16/0x20
[7.660684]  [] fb_notifier_call_chain+0x1b/0x20
[7.660686]  [] register_framebuffer+0x1f6/0x340
[7.660690]  []
__drm_fb_helper_initial_config_and_unlock+0x252/0x3e0 [drm_kms_helper]
[7.660694]  []
drm_fb_helper_initial_config+0x3e/0x50 [drm_kms_helper]
[7.660697]  [] mgag200_fbdev_init+0xe3/0x100 [mgag200]
[7.660699]  [] mgag200_modeset_init+0x154/0x1d0 [mgag200]
[7.660701]  [] mgag200_driver_load+0x41d/0x5b0 [mgag200]
[7.660708]  [] drm_dev_register+0x15f/0x1f0 [drm]
[7.660711]  [] ? pci_enable_device_flags+0xe8/0x140
[7.660718]  [] drm_get_pci_dev+0x8a/0x1a0 [drm]
[7.660720]  [] mga_pci_probe+0x9b/0xc0 [mgag200]
[7.660722]  [] local_pci_probe+0x4a/0xb0
[7.660723]  [] pci_device_probe+0x109/0x160
[7.660726]  [] driver_probe_device+0xc5/0x3e0
[7.660727]  [] __driver_attach+0x93/0xa0
[7.660728]  [] ? __device_attach+0x50/0x50
[7.660730]  [] bus_for_each_dev+0x75/0xc0
[7.660731]  [] driver_attach+0x1e/0x20
[7.660733]  [] bus_add_driver+0x200/0x2d0
[7.660734]  [] driver_register+0x64/0xf0
[7.660735]  [] __pci_register_driver+0xa5/0xc0
[7.660737]  [] ? 0xc012cfff
[7.660739]  [] mgag200_init+0x39/0x1000 [mgag200]
[7.660742]  [] do_one_initcall+0xba/0x240
[7.660745]  [] load_module+0x272c/0x2bc0
[7.660748]  [] ? ddebug_proc_write+0x100/0x100
[7.660750]  [] SyS_init_module+0xef/0x140
[7.660752]  [] system_call_fastpath+0x22/0x27
[7.660753] Mem-Info:
[7.660756] active_anon:3364 inactive_anon:6661 isolated_anon:0
[7.660756]  active_file:0 inactive_file:0 isolated_file:0
[7.660756]  unevictable:0 dirty:0 writeback:0 unstable:0
[7.660756]  slab_reclaimable:1492 slab_unreclaimable:3116
[7.660756]  mapped:1223 shmem:8449 pagetables:179 bounce:0
[7.660756]  

Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-19 Thread Pingfan Liu
On Mon, Feb 18, 2019 at 9:48 AM Dave Young  wrote:
>
> On 02/15/19 at 11:24am, Borislav Petkov wrote:
> > On Tue, Feb 12, 2019 at 04:48:16AM +0800, Dave Young wrote:
> > > Even we make it automatic in kernel, but we have to have some default
> > > value for swiotlb in case crashkernel can not find a free region under 4G.
> > > So this default value can not work for every use cases, people need
> > > manually use crashkernel=,low and crashkernel=,high in case
> > > crashkernel=X does not work.
> >
> > Why would the user need to find swiotlb range? The kernel has all the
> > information it requires at its finger tips in order to decide properly.
> >
> > The user wants a crashkernel range, the kernel tries the low range =>
> > no workie, then it tries the next range => workie but needs to allocate
> > swiotlb range so that DMA can happen too. Doh, then the kernel does
> > allocate that too.
>
> It is ideal if kernel can do it automatically, but I'm not sure if
> kernel can predict the swiotlb reserved size automatically.
>
Agreed, I think it is hard to decide the reserved size automatically.
We do not know the requirement for memory of ZONE_DMA32 at boot time.
The requirement depends on how many DMA32 devices, and the dynamic
payload of them.

> Let's add more people to seek for comments.
>
> >
> > Why would the user need to do anything here?!
> >
> > --
> > Regards/Gruss,
> > Boris.
> >
> > Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-02-11 Thread Pingfan Liu
On Tue, Feb 12, 2019 at 4:48 AM Dave Young  wrote:
>
> On 02/06/19 at 08:08pm, Dave Young wrote:
> > On 02/05/19 at 09:15am, Borislav Petkov wrote:
> > > On Mon, Feb 04, 2019 at 03:30:16PM -0700, Jerry Hoemann wrote:
> > > > Is your objection only to the second fallback of allocating
> > > > memory above >= 4GB?   Or are you objecting to allocating from
> > > > (896 .. 4GB) as well?
> > >
> > > My problem is why should the user need to specify high or low allocation
> > > explicitly when we can handle all that in the kernel automatically.
> > >
> > > The presence of crashkernel= on the cmdline sure means that the user
> > > wants to allocate memory for a second kernel.
> > >
> > > Now, if the requested allocation fails, we say:
> > >
> > >   Error reserving crashkernel
> > >
> > > So, instead of saying that, we can *try* *again* and say
> > >
> > >   Error reserving requested crashkernel at @..., attempting a high range.
> > >
> > > and run memblock_find_in_range() on the other regions which we deemed
> > > are ok to allocate from.
> > >
> > > Why aren't we doing that by default instead of placing all those
> > > different options in front of the user and expecting her/him to know
> > > something about all those magic ranges?
> >
> > As we talked in another reply, for the >4G allocation we can not avoid
> > the swiotlb issue,  but if one request for 256M in high region and we
> > allocate the low part automatically, it will eat more memory eg. 512M.
> >
> > But probably in case allacation failed in low region ,high is a must for 
> > kdump
> > reservation, since no other choices perhaps we can make that as you said
>
> That is exactly what Pingfan is doing in this patch.
>
> Even we make it automatic in kernel, but we have to have some default
> value for swiotlb in case crashkernel can not find a free region under 4G.
> So this default value can not work for every use cases, people need
> manually use crashkernel=,low and crashkernel=,high in case
> crashkernel=X does not work.  One can tune it for their use:
>
> 1) crashkernel=X reservation fails, likely the ,low default value is
> still too big, one can shrink the value and manually try other value
> 2) crashkernel=X reserve successfully on high memory and along with some
> default low memory region. But the low region is not enough.  In this
> case one can increase the
>
> This should answer the question why ,high and ,low is still needed.
>
> But for above consumption 1),  KASLR can still cause default ,low memory
> failed to reserve.  So I wonder if KASLR can skip the 0-896M if the
> system memory is big enough.
>
A little fix about the comment. Refer to reserve_crashkernel_low(),
low_base = memblock_find_in_range(0, 1ULL << 32, low_size,
CRASH_ALIGN); So it should try 0~4G for ",low". And the default size
for low is 256M. Given the limited memory region reserved by other
component before crashkernel, we always can find a continuous chunk of
256M inside the fragmented [0,4G], which is split by initrd, KASLR.

Thanks,
Pingfan


[tip:x86/cleanups] x86/trap: Remove useless declaration

2019-01-29 Thread tip-bot for Pingfan Liu
Commit-ID:  439fbdf6a2021ab1cca94b30837674b2b7527ae8
Gitweb: https://git.kernel.org/tip/439fbdf6a2021ab1cca94b30837674b2b7527ae8
Author: Pingfan Liu 
AuthorDate: Fri, 4 Jan 2019 16:46:19 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 29 Jan 2019 22:09:12 +0100

x86/trap: Remove useless declaration

There is no early_trap_pf_init() implementation, hence remove this useless
declaration.

Signed-off-by: Pingfan Liu 
Signed-off-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Link: 
https://lkml.kernel.org/r/1546591579-23502-1-git-send-email-kernelf...@gmail.com


---
 arch/x86/include/asm/processor.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 33051436c864..2bb3a648fc12 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -742,7 +742,6 @@ enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, 
IDLE_NOMWAIT,
 extern void enable_sep_cpu(void);
 extern int sysenter_setup(void);
 
-void early_trap_pf_init(void);
 
 /* Defined in head.S */
 extern struct desc_ptr early_gdt_descr;


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-28 Thread Pingfan Liu
On Fri, Jan 25, 2019 at 6:39 PM Borislav Petkov  wrote:
>
>
> >  Subject: Re: [PATCHv7] x86/kdump: bugfix, make the behavior of 
> > crashkernel=X
>
> s/bugfix, //
>
OK.

> On Mon, Jan 21, 2019 at 01:16:08PM +0800, Pingfan Liu wrote:
> > People reported crashkernel=384M reservation failed on a high end server
> > with KASLR enabled.  In that case there is enough free memory under 896M
> > but crashkernel reservation still fails intermittently.
> >
> > The situation is crashkernel reservation code only finds free region under
> > 896 MB with 128M aligned in case no ',high' being used.  And KASLR could
> > break the first 896M into several parts randomly thus the failure happens.
>
> This reads very strange.
>
What about   "  It turns out that crashkernel reservation code only
tries to find a region under 896 MB, aligned on 128M. But KASLR
randomly breaks big region inside [0,896M] into smaller pieces, not
big enough as demanded in the "crashkernel=X" parameter."

> > User has no way to predict and make sure crashkernel=xM working unless
> > he/she use 'crashkernel=xM,high'.  Since 'crashkernel=xM' is the most
> > common use case this issue is a serious bug.
> >
> > And we can't answer questions raised from customer:
> > 1) why it doesn't succeed to reserve 896 MB;
> > 2) what's wrong with memory region under 4G;
> > 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
>
> Errr, this looks like communication issue. Sounds to me like the text
> around crashkernel= in
>
What about dropping this section in commit log and another patch to
fix the document?

> Documentation/admin-guide/kernel-parameters.txt
>
> needs improving?
>
> > This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
>
> Avoid having "This patch" or "This commit" in the commit message. It is
> tautologically useless.
>
OK

> Also, do
>
> $ git grep 'This patch' Documentation/process
>
> for more details.
>
> > finally above 4G.
> >
> > Dave Young sent the original post, and I just re-post it with commit log
>
> If he sent it, he should be the author I guess.
>
> > improvement as his requirement.
> > http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > There was an old discussion below (previously posted by Chao Wang):
> > https://lkml.org/lkml/2013/10/15/601
>
> All that changelog info doesn't belong in the commit message ...
>
> > Signed-off-by: Pingfan Liu 
> > Cc: Dave Young 
> > Cc: Baoquan He 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: ying...@kernel.org,
> > Cc: vgo...@redhat.com
> > Cc: Randy Dunlap 
> > Cc: Borislav Petkov 
> > Cc: x...@kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > ---
>
>  but here.
>
> > v6 -> v7: commit log improvement
> >  arch/x86/kernel/setup.c | 16 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3d872a5..fa62c81 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
> >   high ? CRASH_ADDR_HIGH_MAX
> >: CRASH_ADDR_LOW_MAX,
> >   crash_size, CRASH_ALIGN);
> > +#ifdef CONFIG_X86_64
> > + /*
> > +  * crashkernel=X reserve below 896M fails? Try below 4G
> > +  */
> > + if (!high && !crash_base)
> > + crash_base = memblock_find_in_range(CRASH_ALIGN,
> > + (1ULL << 32),
> > + crash_size, CRASH_ALIGN);
> > + /*
> > +  * crashkernel=X reserve below 4G fails? Try MAXMEM
> > +  */
> > + if (!high && !crash_base)
> > + crash_base = memblock_find_in_range(CRASH_ALIGN,
> > + CRASH_ADDR_HIGH_MAX,
> > + crash_size, CRASH_ALIGN);
> > +#endif
>
> Ok, so this is silly: we know at which physical address KASLR allocated
> the kernel so why aren't we querying that and seeing if there's enough
> room before it or after it to call memblock_find_in_range() on the
> bigger range?
>
Sorry, can not catch up with you. Do you suggestion
memblock_find_in_range(0, ker

Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-28 Thread Pingfan Liu
On Fri, Jan 25, 2019 at 10:08 PM Borislav Petkov  wrote:
>
> On Fri, Jan 25, 2019 at 09:45:18PM +0800, Dave Young wrote:
> > AFAIK, some people prefer to explictly reserve crash memory at high
> > region even if it is possible to reserve at low area.  May because
> > <4G memory is limited on large server, they want to leave this for other
> > use.
> >
> > Yinghai or Vivek should know more about the history, probably they can
> > recall some initial reason.
>
Go through the git log, and I found the initial introduction of
crashkernel_high option. Refer to
commit 55a20ee7804ab64ac90bcdd4e2868a42829e2784
Author: Yinghai Lu 
Date:   Mon Apr 15 22:23:47 2013 -0700

x86, kdump: Retore crashkernel= to allocate under 896M

Vivek found old kexec-tools does not work new kernel anymore.

So change back crashkernel= back to old behavoir, and add crashkernel_high=
to let user decide if buffer could be above 4G, and also new
kexec-tools will
be needed.

But kexec-tools-2.0.3, released at 2012, can run 4.20 kernel with
crashkernel=256M@5G, so I think only very old kexec-tools requires
memory under 896M. Due to -1.few people running latest kernel with
very old kexec-tools to date, -2. crashkernel=X is more popular than
crashkernel=X.high, it should be time to eliminate this limit of
crashkernel=X parameter, otherwise we will run into this bug.
As for crashkernel=,high, I think it is a more professional option for
who cares about the DMA32. On high-end machine, big reserved region is
used for crashkernel(e.g. in this case 384M), which make the crowed
situation under under 4GB memory worse.

> Yes, just "prefer" is not good enough. There should be a technical
> reason why that's there.
>
> Also, if the user doesn't care, then the code should be free to force
> "high" and thus probe a different range for allocation.
>
Do you suggest to remove crashkernel=X,high parameter?

Thanks,
Pingfan
> > Good question, still it may be some historical reason, but it is good to
> > make them clear and rethink about it after long time.
> >
> > I also want to understand, need dig the log more.
>
> Good idea. That would be a very nice cleanup. :-)
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-21 Thread Pingfan Liu
On Sat, Jan 19, 2019 at 9:25 AM Jerry Hoemann  wrote:
>
> On Tue, Jan 15, 2019 at 04:07:03PM +0800, Pingfan Liu wrote:
> > People reported a bug on a high end server with many pcie devices, where
> > kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> > though we still see much memory under 896 MB, the finding still failed
> > intermittently. Because currently we can only find region under 896 MB,
> > if without ',high' specified. Then KASLR breaks 896 MB into several parts
> > randomly, and crashkernel reservation need be aligned to 128 MB, that's
> > why failure is found. It raises confusion to the end user that sometimes
> > crashkernel=X works while sometimes fails.
> > If want to make it succeed, customer can change kernel option to
> > "crashkernel=384M,high". Just this give "crashkernel=xx@yy" a very
> > limited space to behave even though its grammar looks more generic.
> > And we can't answer questions raised from customer that confidently:
> > 1) why it doesn't succeed to reserve 896 MB;
> > 2) what's wrong with memory region under 4G;
> > 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> > This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
> > finally above 4G.
>
> While allocating crashkernel from below 4G seems fine, won't we have
> problems if the crash kernel gets allocated above 4G because of the SWIOTLB?
>
It will reserve extra memory below 4G for the swiotlb purpose. You can
find the logic in reserve_crashkernel_low()
And testing with crashkernel=512M@4G, we will get:
cat /proc/iomem  | grep Crash
  aa00-b9ff : Crash kernel
  1-11fff : Crash kernel

Thanks,
Pingfan

> thanks
>
>
> > Dave Young sent the original post, and I just re-post it with commit log
> > improvement as his requirement.
> > http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > There was an old discussion below (previously posted by Chao Wang):
> > https://lkml.org/lkml/2013/10/15/601
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Dave Young 
> > Cc: Baoquan He 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: ying...@kernel.org,
> > Cc: vgo...@redhat.com
> > Cc: Randy Dunlap 
> > ---
> > v6 -> v7: fix spelling mistake pointed out by Randy
> >  arch/x86/kernel/setup.c | 16 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3d872a5..fa62c81 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
> >   high ? CRASH_ADDR_HIGH_MAX
> >: CRASH_ADDR_LOW_MAX,
> >   crash_size, CRASH_ALIGN);
> > +#ifdef CONFIG_X86_64
> > + /*
> > +  * crashkernel=X reserve below 896M fails? Try below 4G
> > +  */
> > + if (!high && !crash_base)
> > + crash_base = memblock_find_in_range(CRASH_ALIGN,
> > + (1ULL << 32),
> > + crash_size, CRASH_ALIGN);
> > + /*
> > +  * crashkernel=X reserve below 4G fails? Try MAXMEM
> > +  */
> > + if (!high && !crash_base)
> > + crash_base = memblock_find_in_range(CRASH_ALIGN,
> > + CRASH_ADDR_HIGH_MAX,
> > + crash_size, CRASH_ALIGN);
> > +#endif
> >   if (!crash_base) {
> >   pr_info("crashkernel reservation failed - No suitable 
> > area found.\n");
> >   return;
> > --
> > 2.7.4
> >
> >
> > ___
> > kexec mailing list
> > ke...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
>
> --
>
> -
> Jerry Hoemann  Software Engineer   Hewlett Packard Enterprise
> -


[PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-20 Thread Pingfan Liu
People reported crashkernel=384M reservation failed on a high end server
with KASLR enabled.  In that case there is enough free memory under 896M
but crashkernel reservation still fails intermittently.

The situation is crashkernel reservation code only finds free region under
896 MB with 128M aligned in case no ',high' being used.  And KASLR could
break the first 896M into several parts randomly thus the failure happens.
User has no way to predict and make sure crashkernel=xM working unless
he/she use 'crashkernel=xM,high'.  Since 'crashkernel=xM' is the most
common use case this issue is a serious bug.

And we can't answer questions raised from customer:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.

This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
finally above 4G.

Dave Young sent the original post, and I just re-post it with commit log
improvement as his requirement.
http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
There was an old discussion below (previously posted by Chao Wang):
https://lkml.org/lkml/2013/10/15/601

Signed-off-by: Pingfan Liu 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: ying...@kernel.org,
Cc: vgo...@redhat.com
Cc: Randy Dunlap 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
---
v6 -> v7: commit log improvement
 arch/x86/kernel/setup.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..fa62c81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
high ? CRASH_ADDR_HIGH_MAX
 : CRASH_ADDR_LOW_MAX,
crash_size, CRASH_ALIGN);
+#ifdef CONFIG_X86_64
+   /*
+* crashkernel=X reserve below 896M fails? Try below 4G
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(CRASH_ALIGN,
+   (1ULL << 32),
+   crash_size, CRASH_ALIGN);
+   /*
+* crashkernel=X reserve below 4G fails? Try MAXMEM
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(CRASH_ALIGN,
+   CRASH_ADDR_HIGH_MAX,
+   crash_size, CRASH_ALIGN);
+#endif
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable 
area found.\n");
return;
-- 
2.7.4



[PATCHv7] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-15 Thread Pingfan Liu
People reported a bug on a high end server with many pcie devices, where
kernel bootup with crashkernel=384M, and kaslr is enabled. Even
though we still see much memory under 896 MB, the finding still failed
intermittently. Because currently we can only find region under 896 MB,
if without ',high' specified. Then KASLR breaks 896 MB into several parts
randomly, and crashkernel reservation need be aligned to 128 MB, that's
why failure is found. It raises confusion to the end user that sometimes
crashkernel=X works while sometimes fails.
If want to make it succeed, customer can change kernel option to
"crashkernel=384M,high". Just this give "crashkernel=xx@yy" a very
limited space to behave even though its grammar looks more generic.
And we can't answer questions raised from customer that confidently:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.
This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
finally above 4G.
Dave Young sent the original post, and I just re-post it with commit log
improvement as his requirement.
http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
There was an old discussion below (previously posted by Chao Wang):
https://lkml.org/lkml/2013/10/15/601

Signed-off-by: Pingfan Liu 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: ying...@kernel.org,
Cc: vgo...@redhat.com
Cc: Randy Dunlap 
---
v6 -> v7: fix spelling mistake pointed out by Randy
 arch/x86/kernel/setup.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..fa62c81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
high ? CRASH_ADDR_HIGH_MAX
 : CRASH_ADDR_LOW_MAX,
crash_size, CRASH_ALIGN);
+#ifdef CONFIG_X86_64
+   /*
+* crashkernel=X reserve below 896M fails? Try below 4G
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(CRASH_ALIGN,
+   (1ULL << 32),
+   crash_size, CRASH_ALIGN);
+   /*
+* crashkernel=X reserve below 4G fails? Try MAXMEM
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(CRASH_ALIGN,
+   CRASH_ADDR_HIGH_MAX,
+   crash_size, CRASH_ALIGN);
+#endif
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable 
area found.\n");
return;
-- 
2.7.4



Re: [PATCHv6] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-15 Thread Pingfan Liu
On Mon, Jan 14, 2019 at 12:24 PM Randy Dunlap  wrote:
>
> Hi,
>
> Just fix a few of the commit log comments...
>
> On 1/13/19 7:15 PM, Pingfan Liu wrote:
> > People reported a bug on a high end server with many pcie devices, where
> > kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> > though we still see much memory under 896 MB, the finding still failed
> > intermittently. Because currently we can only find region under 896 MB,
> > if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
>
>   if w/o
> or preferably:
>   if without
>
> > randomly, and crashkernel reservation need be aligned to 128 MB, that's
> > why failure is found. It raises confusion to the end user that sometimes
> > crashkernel=X works while sometimes fails.
> > If want to make it succeed, customer can change kernel option to
> > "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
>
> no space?  just
>   "crashkernel=384M,high"
>
> > limited space to behave even though its grammer looks more generic.
>
>   grammar
>
Thanks for your review, will cc you in next version.

Regards,
Pingfan
> > And we can't answer questions raised from customer that confidently:
> > 1) why it doesn't succeed to reserve 896 MB;
> > 2) what's wrong with memory region under 4G;
> > 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> > This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
> > finally above 4G.
> > Dave Young sent the original post, and I just re-post it with commit log
> > improvement as his requirement.
> > http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > There was an old discussion below (previously posted by Chao Wang):
> > https://lkml.org/lkml/2013/10/15/601
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Dave Young 
> > Cc: Baoquan He 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: ying...@kernel.org,
> > Cc: vgo...@redhat.com
> > ---
> > v5 -> v6
> >   discard bottom-up allocation, just repost dyoung's original patch with 
> > commit log improved
> > ---
> >  arch/x86/kernel/setup.c | 16 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index 3d872a5..fa62c81 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
> >   high ? CRASH_ADDR_HIGH_MAX
> >: CRASH_ADDR_LOW_MAX,
> >   crash_size, CRASH_ALIGN);
> > +#ifdef CONFIG_X86_64
> > + /*
> > +  * crashkernel=X reserve below 896M fails? Try below 4G
> > +  */
> > + if (!high && !crash_base)
> > + crash_base = memblock_find_in_range(CRASH_ALIGN,
> > + (1ULL << 32),
> > + crash_size, CRASH_ALIGN);
> > + /*
> > +  * crashkernel=X reserve below 4G fails? Try MAXMEM
> > +  */
> > + if (!high && !crash_base)
> > + crash_base = memblock_find_in_range(CRASH_ALIGN,
> > + CRASH_ADDR_HIGH_MAX,
> > + crash_size, CRASH_ALIGN);
> > +#endif
> >   if (!crash_base) {
> >   pr_info("crashkernel reservation failed - No suitable 
> > area found.\n");
> >   return;
> >
>
> ciao.
> --
> ~Randy


Re: [PATCHv2 6/7] x86/mm: remove bottom-up allocation style for x86_64

2019-01-14 Thread Pingfan Liu
On Tue, Jan 15, 2019 at 7:27 AM Dave Hansen  wrote:
>
> On 1/10/19 9:12 PM, Pingfan Liu wrote:
> > Although kaslr-kernel can avoid to stain the movable node. [1]
>
> Can you explain what staining is, or perhaps try to use some more
> standard nomenclature?  There are exactly 0 instances of the word
> "stain" in arch/x86/ or mm/.
>
I mean that KASLR may randomly choose some positions for base address,
which are located in movable node.

> > But the
> > pgtable can still stain the movable node. That is a probability problem,
> > although low, but exist. This patch tries to make it certainty by
> > allocating pgtable on unmovable node, instead of following kernel end.
>
> Anyway, can you read my suggested summary in the earlier patch and see
> if it fits or if I missed anything?  This description is really hard to
> read.
>
Your summary in the reply to [PATCH 0/7] express the things clearly. I
will use them to update the commit log

> ...> +#ifdef CONFIG_X86_32
> > +
> > +static unsigned long min_pfn_mapped;
> > +
> >  static unsigned long __init get_new_step_size(unsigned long step_size)
> >  {
> >   /*
> > @@ -653,6 +655,32 @@ static void __init memory_map_bottom_up(unsigned long 
> > map_start,
> >   }
> >  }
> >
> > +static unsigned long __init init_range_memory_mapping32(
> > + unsigned long r_start, unsigned long r_end)
> > +{
>
> Why is this returning a value which is not used?
>
> Did you compile this?  Didn't you get a warning that you're not
> returning a value from a function returning non-void?
>
It should be void. I will fix it in next version

> Also, I'd much rather see something like this written:
>
> static __init
> unsigned long init_range_memory_mapping32(unsigned long r_start,
>   unsigned long r_end)
>
> than what you have above.  But, if you get rid of the 'unsigned long',
> it will look much more sane in the first place.

Yes. Thank for your kindly review.

Best Regards,
Pingfan


Re: [PATCHv2 2/7] acpi: change the topo of acpi_table_upgrade()

2019-01-14 Thread Pingfan Liu
On Tue, Jan 15, 2019 at 7:12 AM Dave Hansen  wrote:
>
> On 1/10/19 9:12 PM, Pingfan Liu wrote:
> > The current acpi_table_upgrade() relies on initrd_start, but this var is
>
> "var" meaning variable?
>
> Could you please go back and try to ensure you spell out all the words
> you are intending to write?  I think "topo" probably means "topology",
> but it's a really odd word to use for changing the arguments of a
> function, so I'm not sure.
>
> There are a couple more of these in this set.
>
Yes. I will do it and fix them in next version.

> > only valid after relocate_initrd(). There is requirement to extract the
> > acpi info from initrd before memblock-allocator can work(see [2/4]), hence
> > acpi_table_upgrade() need to accept the input param directly.
>
> "[2/4]"
>
> It looks like you quickly resent this set without updating the patch
> descriptions.
>
> > diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> > index 61203ee..84e0a79 100644
> > --- a/drivers/acpi/tables.c
> > +++ b/drivers/acpi/tables.c
> > @@ -471,10 +471,8 @@ static DECLARE_BITMAP(acpi_initrd_installed, 
> > NR_ACPI_INITRD_TABLES);
> >
> >  #define MAP_CHUNK_SIZE   (NR_FIX_BTMAPS << PAGE_SHIFT)
> >
> > -void __init acpi_table_upgrade(void)
> > +void __init acpi_table_upgrade(void *data, size_t size)
> >  {
> > - void *data = (void *)initrd_start;
> > - size_t size = initrd_end - initrd_start;
> >   int sig, no, table_nr = 0, total_offset = 0;
> >   long offset = 0;
> >   struct acpi_table_header *table;
>
> I know you are just replacing some existing variables, but we have a
> slightly higher standard for naming when you actually have to specify
> arguments to a function.  Can you please give these proper names?
>
OK, I will change it to acpi_table_upgrade(void *initrd, size_t size).

Thanks,
Pingfan


Re: [PATCHv2 1/7] x86/mm: concentrate the code to memblock allocator enabled

2019-01-14 Thread Pingfan Liu
On Tue, Jan 15, 2019 at 7:07 AM Dave Hansen  wrote:
>
> On 1/10/19 9:12 PM, Pingfan Liu wrote:
> > This patch identifies the point where memblock alloc start. It has no
> > functional.
>
> It has no functional ... what?  Effects?
>
During re-organize the code, it takes me a long time to figure out why
memblock_set_bottom_up(true) is added here, and how far can it be
deferred. And finally, I realize that it only takes effect after
e820__memblock_setup(), the point where memblock allocator can work.
So I concentrate the related code, and hope this patch can classify
this truth.

> > - memblock_set_current_limit(ISA_END_ADDRESS);
> > - e820__memblock_setup();
> > -
> >   reserve_bios_regions();
> >
> >   if (efi_enabled(EFI_MEMMAP)) {
> > @@ -1113,6 +1087,8 @@ void __init setup_arch(char **cmdline_p)
> >   efi_reserve_boot_services();
> >   }
> >
> > + memblock_set_current_limit(0, ISA_END_ADDRESS, false);
> > + e820__memblock_setup();
>
> It looks like you changed the arguments passed to
> memblock_set_current_limit().  How can this even compile?  Did you mean
> that this patch is not functional?
>
Sorry that during rebasing, merge trivial fix by mistake. I will build
against each patch.

Best regards,
Pingfan


Re: [PATCHv2 0/7] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

2019-01-14 Thread Pingfan Liu
On Tue, Jan 15, 2019 at 7:02 AM Dave Hansen  wrote:
>
> On 1/10/19 9:12 PM, Pingfan Liu wrote:
> > Background
> > When kaslr kernel can be guaranteed to sit inside unmovable node
> > after [1].
>
> What does this "[1]" refer to?
>
https://lore.kernel.org/patchwork/patch/1029376/

> Also, can you clarify your terminology here a bit.  By "kaslr kernel",
> do you mean the base address?
>
It should be the randomization of load address. Googled, and found out
that it is "base address".

> > But if kaslr kernel is located near the end of the movable node,
> > then bottom-up allocator may create pagetable which crosses the boundary
> > between unmovable node and movable node.
>
> Again, I'm confused.  Do you literally mean a single page table page?  I
> think you mean the page tables, but it would be nice to clarify this,
> and also explicitly state which page tables these are.
>
It should be page table pages. The page table is built by init_mem_mapping().

> >  It is a probability issue,
> > two factors include -1. how big the gap between kernel end and
> > unmovable node's end.  -2. how many memory does the system own.
> > Alternative way to fix this issue is by increasing the gap by
> > boot/compressed/kaslr*.
>
> Oh, you mean the KASLR code in arch/x86/boot/compressed/kaslr*.[ch]?
>
Sorry, and yes, code in arch/x86/boot/compressed/kaslr_64.c and kaslr.c

> It took me a minute to figure out you were talking about filenames.
>
> > But taking the scenario of PB level memory, the pagetable will take
> > server MB even if using 1GB page, different page attr and fragment
> > will make things worse. So it is hard to decide how much should the
> > gap increase.
> I'm not following this.  If we move the image around, we leave holes.
> Why do we need page table pages allocated to cover these holes?
>
I means in arch/x86/boot/compressed/kaslr.c, store_slot_info() {
slot_area.num = (region->size - image_size) /CONFIG_PHYSICAL_ALIGN + 1
}.  Let us denote the size of page table as "X", then the formula is
changed to slot_area.num = (region->size - image_size -X)
/CONFIG_PHYSICAL_ALIGN + 1. And it is hard to decide X due to the
above factors.

> > The following figure show the defection of current bottom-up style:
> >   [startA, endA][startB, "kaslr kernel verly close to" endB][startC, endC]
>
> "defection"?
>
Oh, defect.

> > If nodeA,B is unmovable, while nodeC is movable, then init_mem_mapping()
> > can generate pgtable on nodeC, which stain movable node.
>
> Let me see if I can summarize this:
> 1. The kernel ASLR decompression code picks a spot to place the kernel
>image in physical memory.
> 2. Some page tables are dynamically allocated near (after) this spot.
> 3. Sometimes, based on the random ASLR location, these page tables fall
>over into the "movable node" area.  Being unmovable allocations, this
>is not cool.
> 4. To fix this (on 64-bit at least), we stop allocating page tables
>based on the location of the kernel image.  Instead, we allocate
>using the memblock allocator itself, which knows how to avoid the
>movable node.
>
Yes, you get my idea exactly. Thanks for your help to summary it. Hard
for me to express it clearly in English.

> > This patch makes it certainty instead of a probablity problem. It achieves
> > this by pushing forward the parsing of mem hotplug info ahead of 
> > init_mem_mapping().
>
> What does memory hotplug have to do with this?  I thought this was all
> about early boot.

Put the info about memory hot plugable to memblock allocator,
initmem_init()->...->acpi_numa_memory_affinity_init(), where
memblock_mark_hotplug() does it. Later when memory allocator works, in
__next_mem_range(), it will check this info by
memblock_is_hotpluggable().

Thanks and regards,
Pingfan


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-14 Thread Pingfan Liu
[...]
> >
> > I would appreciate a help with those architectures because I couldn't
> > really grasp how the memoryless nodes are really initialized there. E.g.
> > ppc only seem to call setup_node_data for online nodes but I couldn't
> > find any special treatment for nodes without any memory.
>
> We have a somewhat dubious hack in our hotplug code, see:
>
> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes")
>
> Which basically onlines the node when we hotplug a CPU into it.
>
This bug should be related with the present state of numa node during
boot time. On PowerNV and PSeries, the boot code seems not to bring up
all nodes if memoryless. Then it can not avoid this bug.

Thanks,
Pingfan


Re: [PATCHv2 3/7] mm/memblock: introduce allocation boundary for tracing purpose

2019-01-14 Thread Pingfan Liu
On Mon, Jan 14, 2019 at 4:50 PM Mike Rapoport  wrote:
>
> On Mon, Jan 14, 2019 at 04:33:50PM +0800, Pingfan Liu wrote:
> > On Mon, Jan 14, 2019 at 3:51 PM Mike Rapoport  wrote:
> > >
> > > Hi Pingfan,
> > >
> > > On Fri, Jan 11, 2019 at 01:12:53PM +0800, Pingfan Liu wrote:
> > > > During boot time, there is requirement to tell whether a series of func
> > > > call will consume memory or not. For some reason, a temporary memory
> > > > resource can be loan to those func through memblock allocator, but at a
> > > > check point, all of the loan memory should be turned back.
> > > > A typical using style:
> > > >  -1. find a usable range by memblock_find_in_range(), said, [A,B]
> > > >  -2. before calling a series of func, 
> > > > memblock_set_current_limit(A,B,true)
> > > >  -3. call funcs
> > > >  -4. memblock_find_in_range(A,B,B-A,1), if failed, then some memory is 
> > > > not
> > > >  turned back.
> > > >  -5. reset the original limit
> > > >
> > > > E.g. in the case of hotmovable memory, some acpi routines should be 
> > > > called,
> > > > and they are not allowed to own some movable memory. Although at present
> > > > these functions do not consume memory, but later, if changed without
> > > > awareness, they may do. With the above method, the allocation can be
> > > > detected, and pr_warn() to ask people to resolve it.
> > >
> > > To ensure there were that a sequence of function calls didn't create new
> > > memblock allocations you can simply check the number of the reserved
> > > regions before and after that sequence.
> > >
> > Yes, thank you point out it.
> >
> > > Still, I'm not sure it would be practical to try tracking what code 
> > > that's called
> > > from x86::setup_arch() did memory allocation.
> > > Probably a better approach is to verify no memory ended up in the movable
> > > areas after their extents are known.
> > >
> > It is a probability problem whether allocated memory sit on hotmovable
> > memory or not. And if warning based on the verification, then it is
> > also a probability problem and maybe we will miss it.
>
> I'm not sure I'm following you here.
>
> After the hotmovable memory configuration is detected it is possible to
> traverse reserved memblock areas and warn if some of them reside in the
> hotmovable memory.
>
Oh, sorry that I did not explain it accurately. Let use say a machine
with nodeA/B/C from low to high memory address. With top-down
allocation by default, at this point, memory will always be allocated
from nodeC. But it depends on machine whether nodeC is hotmovable or
not. The verification can pass on a machine with unmovable nodeC , but
fails on a machine with movable nodeC. It will be a probability issue.

Thanks

[...]


Re: [PATCHv2 3/7] mm/memblock: introduce allocation boundary for tracing purpose

2019-01-14 Thread Pingfan Liu
On Mon, Jan 14, 2019 at 3:51 PM Mike Rapoport  wrote:
>
> Hi Pingfan,
>
> On Fri, Jan 11, 2019 at 01:12:53PM +0800, Pingfan Liu wrote:
> > During boot time, there is requirement to tell whether a series of func
> > call will consume memory or not. For some reason, a temporary memory
> > resource can be loan to those func through memblock allocator, but at a
> > check point, all of the loan memory should be turned back.
> > A typical using style:
> >  -1. find a usable range by memblock_find_in_range(), said, [A,B]
> >  -2. before calling a series of func, memblock_set_current_limit(A,B,true)
> >  -3. call funcs
> >  -4. memblock_find_in_range(A,B,B-A,1), if failed, then some memory is not
> >  turned back.
> >  -5. reset the original limit
> >
> > E.g. in the case of hotmovable memory, some acpi routines should be called,
> > and they are not allowed to own some movable memory. Although at present
> > these functions do not consume memory, but later, if changed without
> > awareness, they may do. With the above method, the allocation can be
> > detected, and pr_warn() to ask people to resolve it.
>
> To ensure there were that a sequence of function calls didn't create new
> memblock allocations you can simply check the number of the reserved
> regions before and after that sequence.
>
Yes, thank you point out it.

> Still, I'm not sure it would be practical to try tracking what code that's 
> called
> from x86::setup_arch() did memory allocation.
> Probably a better approach is to verify no memory ended up in the movable
> areas after their extents are known.
>
It is a probability problem whether allocated memory sit on hotmovable
memory or not. And if warning based on the verification, then it is
also a probability problem and maybe we will miss it.

Thanks and regards,
Pingfan

> > Signed-off-by: Pingfan Liu 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: "H. Peter Anvin" 
> > Cc: Dave Hansen 
> > Cc: Andy Lutomirski 
> > Cc: Peter Zijlstra 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Len Brown 
> > Cc: Yinghai Lu 
> > Cc: Tejun Heo 
> > Cc: Chao Fan 
> > Cc: Baoquan He 
> > Cc: Juergen Gross 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: Vlastimil Babka 
> > Cc: Michal Hocko 
> > Cc: x...@kernel.org
> > Cc: linux-a...@vger.kernel.org
> > Cc: linux...@kvack.org
> > ---
> >  arch/arm/mm/init.c  |  3 ++-
> >  arch/arm/mm/mmu.c   |  4 ++--
> >  arch/arm/mm/nommu.c |  2 +-
> >  arch/csky/kernel/setup.c|  2 +-
> >  arch/microblaze/mm/init.c   |  2 +-
> >  arch/mips/kernel/setup.c|  2 +-
> >  arch/powerpc/mm/40x_mmu.c   |  6 --
> >  arch/powerpc/mm/44x_mmu.c   |  2 +-
> >  arch/powerpc/mm/8xx_mmu.c   |  2 +-
> >  arch/powerpc/mm/fsl_booke_mmu.c |  5 +++--
> >  arch/powerpc/mm/hash_utils_64.c |  4 ++--
> >  arch/powerpc/mm/init_32.c   |  2 +-
> >  arch/powerpc/mm/pgtable-radix.c |  2 +-
> >  arch/powerpc/mm/ppc_mmu_32.c|  8 ++--
> >  arch/powerpc/mm/tlb_nohash.c|  6 --
> >  arch/unicore32/mm/mmu.c |  2 +-
> >  arch/x86/kernel/setup.c |  2 +-
> >  arch/xtensa/mm/init.c   |  2 +-
> >  include/linux/memblock.h| 10 +++---
> >  mm/memblock.c   | 23 ++-
> >  20 files changed, 59 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
> > index 32e4845..58a4342 100644
> > --- a/arch/arm/mm/init.c
> > +++ b/arch/arm/mm/init.c
> > @@ -93,7 +93,8 @@ __tagtable(ATAG_INITRD2, parse_tag_initrd2);
> >  static void __init find_limits(unsigned long *min, unsigned long *max_low,
> >  unsigned long *max_high)
> >  {
> > - *max_low = PFN_DOWN(memblock_get_current_limit());
> > + memblock_get_current_limit(NULL, max_low);
> > + *max_low = PFN_DOWN(*max_low);
> >   *min = PFN_UP(memblock_start_of_DRAM());
> >   *max_high = PFN_DOWN(memblock_end_of_DRAM());
> >  }
> > diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
> > index f5cc1cc..9025418 100644
> > --- a/arch/arm/mm/mmu.c
> > +++ b/arch/arm/mm/mmu.c
> > @@ -1240,7 +1240,7 @@ void __init adjust_lowmem_bounds(void)
> >   }
> >   }
> >
> > - memblock_set_current_limit(memblock_limit);
> > + memblock_set_current_limit(0, memblock_limit, false);
> >  }
> >
> >  static inline void prepare_p

[PATCHv6] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-13 Thread Pingfan Liu
People reported a bug on a high end server with many pcie devices, where
kernel bootup with crashkernel=384M, and kaslr is enabled. Even
though we still see much memory under 896 MB, the finding still failed
intermittently. Because currently we can only find region under 896 MB,
if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
randomly, and crashkernel reservation need be aligned to 128 MB, that's
why failure is found. It raises confusion to the end user that sometimes
crashkernel=X works while sometimes fails.
If want to make it succeed, customer can change kernel option to
"crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
limited space to behave even though its grammer looks more generic.
And we can't answer questions raised from customer that confidently:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.
This patch tries to get memory region from 896 MB firstly, then [896MB,4G],
finally above 4G.
Dave Young sent the original post, and I just re-post it with commit log
improvement as his requirement.
http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
There was an old discussion below (previously posted by Chao Wang):
https://lkml.org/lkml/2013/10/15/601

Signed-off-by: Pingfan Liu 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: ying...@kernel.org,
Cc: vgo...@redhat.com
---
v5 -> v6
  discard bottom-up allocation, just repost dyoung's original patch with commit 
log improved
---
 arch/x86/kernel/setup.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..fa62c81 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -551,6 +551,22 @@ static void __init reserve_crashkernel(void)
high ? CRASH_ADDR_HIGH_MAX
 : CRASH_ADDR_LOW_MAX,
crash_size, CRASH_ALIGN);
+#ifdef CONFIG_X86_64
+   /*
+* crashkernel=X reserve below 896M fails? Try below 4G
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(CRASH_ALIGN,
+   (1ULL << 32),
+   crash_size, CRASH_ALIGN);
+   /*
+* crashkernel=X reserve below 4G fails? Try MAXMEM
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(CRASH_ALIGN,
+   CRASH_ADDR_HIGH_MAX,
+   crash_size, CRASH_ALIGN);
+#endif
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable 
area found.\n");
return;
-- 
2.7.4



Re: [PATCHv2 2/7] acpi: change the topo of acpi_table_upgrade()

2019-01-11 Thread Pingfan Liu
On Fri, Jan 11, 2019 at 1:31 PM Chao Fan  wrote:
>
> On Fri, Jan 11, 2019 at 01:12:52PM +0800, Pingfan Liu wrote:
> >The current acpi_table_upgrade() relies on initrd_start, but this var is
> >only valid after relocate_initrd(). There is requirement to extract the
> >acpi info from initrd before memblock-allocator can work(see [2/4]), hence
> >acpi_table_upgrade() need to accept the input param directly.
> >
> >Signed-off-by: Pingfan Liu 
> >Acked-by: "Rafael J. Wysocki" 
> >Cc: Thomas Gleixner 
> >Cc: Ingo Molnar 
> >Cc: Borislav Petkov 
> >Cc: "H. Peter Anvin" 
> >Cc: Dave Hansen 
> >Cc: Andy Lutomirski 
> >Cc: Peter Zijlstra 
> >Cc: "Rafael J. Wysocki" 
> >Cc: Len Brown 
> >Cc: Yinghai Lu 
> >Cc: Tejun Heo 
> >Cc: Chao Fan 
> >Cc: Baoquan He 
> >Cc: Juergen Gross 
> >Cc: Andrew Morton 
> >Cc: Mike Rapoport 
> >Cc: Vlastimil Babka 
> >Cc: Michal Hocko 
> >Cc: x...@kernel.org
> >Cc: linux-a...@vger.kernel.org
> >Cc: linux...@kvack.org
> >---
> > arch/arm64/kernel/setup.c | 2 +-
> > arch/x86/kernel/setup.c   | 2 +-
> > drivers/acpi/tables.c | 4 +---
> > include/linux/acpi.h  | 4 ++--
> > 4 files changed, 5 insertions(+), 7 deletions(-)
> >
> >diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> >index f4fc1e0..bc4b47d 100644
> >--- a/arch/arm64/kernel/setup.c
> >+++ b/arch/arm64/kernel/setup.c
> >@@ -315,7 +315,7 @@ void __init setup_arch(char **cmdline_p)
> >   paging_init();
> >   efi_apply_persistent_mem_reservations();
> >
> >-  acpi_table_upgrade();
> >+  acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
> >
> >   /* Parse the ACPI tables for possible boot-time configuration */
> >   acpi_boot_table_init();
> >diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> >index ac432ae..dc8fc5d 100644
> >--- a/arch/x86/kernel/setup.c
> >+++ b/arch/x86/kernel/setup.c
> >@@ -1172,8 +1172,8 @@ void __init setup_arch(char **cmdline_p)
> >
> >   reserve_initrd();
> >
> >-  acpi_table_upgrade();
> >
> I wonder whether this will cause two blank lines together.
>
Yes, will fix it in next version.

Thanks,
Pingfan
> Thanks,
> Chao Fan
>
> >+  acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
> >   vsmp_init();
> >
> >   io_delay_init();
> >diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> >index 61203ee..84e0a79 100644
> >--- a/drivers/acpi/tables.c
> >+++ b/drivers/acpi/tables.c
> >@@ -471,10 +471,8 @@ static DECLARE_BITMAP(acpi_initrd_installed, 
> >NR_ACPI_INITRD_TABLES);
> >
> > #define MAP_CHUNK_SIZE   (NR_FIX_BTMAPS << PAGE_SHIFT)
> >
> >-void __init acpi_table_upgrade(void)
> >+void __init acpi_table_upgrade(void *data, size_t size)
> > {
> >-  void *data = (void *)initrd_start;
> >-  size_t size = initrd_end - initrd_start;
> >   int sig, no, table_nr = 0, total_offset = 0;
> >   long offset = 0;
> >   struct acpi_table_header *table;
> >diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> >index ed80f14..0b6e0b6 100644
> >--- a/include/linux/acpi.h
> >+++ b/include/linux/acpi.h
> >@@ -1254,9 +1254,9 @@ acpi_graph_get_remote_endpoint(const struct 
> >fwnode_handle *fwnode,
> > #endif
> >
> > #ifdef CONFIG_ACPI_TABLE_UPGRADE
> >-void acpi_table_upgrade(void);
> >+void acpi_table_upgrade(void *data, size_t size);
> > #else
> >-static inline void acpi_table_upgrade(void) { }
> >+static inline void acpi_table_upgrade(void *data, size_t size) { }
> > #endif
> >
> > #if defined(CONFIG_ACPI) && defined(CONFIG_ACPI_WATCHDOG)
> >--
> >2.7.4
> >
> >
> >
>
>


Re: [PATCHv2 1/7] x86/mm: concentrate the code to memblock allocator enabled

2019-01-11 Thread Pingfan Liu
On Fri, Jan 11, 2019 at 2:13 PM Chao Fan  wrote:
>
> On Fri, Jan 11, 2019 at 01:12:51PM +0800, Pingfan Liu wrote:
> >This patch identifies the point where memblock alloc start. It has no
> >functional.
> [...]
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+  /*
> >+   * Memory used by the kernel cannot be hot-removed because Linux
> >+   * cannot migrate the kernel pages. When memory hotplug is
> >+   * enabled, we should prevent memblock from allocating memory
> >+   * for the kernel.
> >+   *
> >+   * ACPI SRAT records all hotpluggable memory ranges. But before
> >+   * SRAT is parsed, we don't know about it.
> >+   *
> >+   * The kernel image is loaded into memory at very early time. We
> >+   * cannot prevent this anyway. So on NUMA system, we set any
> >+   * node the kernel resides in as un-hotpluggable.
> >+   *
> >+   * Since on modern servers, one node could have double-digit
> >+   * gigabytes memory, we can assume the memory around the kernel
> >+   * image is also un-hotpluggable. So before SRAT is parsed, just
> >+   * allocate memory near the kernel image to try the best to keep
> >+   * the kernel away from hotpluggable memory.
> >+   */
> >+  if (movable_node_is_enabled())
> >+  memblock_set_bottom_up(true);
>
> Hi Pingfan,
>
> In my understanding, 'movable_node' is based on the that memory near
> kernel is considered as in the same node as kernel in high possibility.
>
> If SRAT has been parsed early, do we still need the kernel parameter
> 'movable_node'? Since you have got the memory information about hot-remove,
> so I wonder if it's OK to drop 'movable_node', and if memory-hotremove is
> enabled, change memblock allocation according to SRAT.
>
x86_32 still need this logic. Maybe it can be doable later.

Thanks,
Pingfan
> If there is something wrong in my understanding, please let me know.
>
> Thanks,
> Chao Fan
>
> >+#endif
> >   init_mem_mapping();
> >+  memblock_set_current_limit(get_max_mapped());
> >
> >   idt_setup_early_pf();
> >
> >@@ -1145,8 +1145,6 @@ void __init setup_arch(char **cmdline_p)
> >*/
> >   mmu_cr4_features = __read_cr4() & ~X86_CR4_PCIDE;
> >
> >-  memblock_set_current_limit(get_max_mapped());
> >-
> >   /*
> >* NOTE: On x86-32, only from this point on, fixmaps are ready for 
> > use.
> >*/
> >--
> >2.7.4
> >
> >
> >
>
>


[PATCHv2 4/7] x86/setup: parse acpi to get hotplug info before init_mem_mapping()

2019-01-10 Thread Pingfan Liu
At present, memblock bottom-up allocation can help us against staining over
movable node in very high probability. But if the hotplug info has already
been parsed, the memblock allocator can step around the movable node by
itself. This patch pushes the parsing step forward, just ahead of where,
the memblock allocator can work. About how memblock allocator steps around
the movable node, referring to the cond check on memblock_is_hotpluggable()
in __next_mem_range().
Later in this series, the bottom-up allocation style can be removed on x86_64.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/x86/kernel/setup.c | 39 ++-
 include/linux/acpi.h|  1 +
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a0122cd..9b57e01 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -804,6 +804,35 @@ dump_kernel_offset(struct notifier_block *self, unsigned 
long v, void *p)
return 0;
 }
 
+static void early_acpi_parse(void)
+{
+   phys_addr_t start, end, orig_start, orig_end;
+   bool enforcing;
+
+   enforcing = memblock_get_current_limit(_start, _end);
+   /* find a 16MB slot for temporary usage by the following routines. */
+   start = memblock_find_in_range(ISA_END_ADDRESS,
+   max_pfn, 1 << 24, 1);
+   end = start + 1 + (1 << 24);
+   memblock_set_current_limit(start, end, true);
+#ifdef CONFIG_BLK_DEV_INITRD
+   if (get_ramdisk_size())
+   acpi_table_upgrade(__va(get_ramdisk_image()),
+   get_ramdisk_size());
+#endif
+   /*
+* Parse the ACPI tables for possible boot-time SMP configuration.
+*/
+   acpi_boot_table_init();
+   early_acpi_boot_init();
+   initmem_init();
+   /* check whether memory is returned or not */
+   start = memblock_find_in_range(start, end, 1<<24, 1);
+   if (!start)
+   pr_warn("the above acpi routines change and consume memory\n");
+   memblock_set_current_limit(orig_start, orig_end, enforcing);
+}
+
 /*
  * Determine if we were loaded by an EFI loader.  If so, then we have also been
  * passed the efi memmap, systab, etc., so we should use these data structures
@@ -1129,6 +1158,7 @@ void __init setup_arch(char **cmdline_p)
if (movable_node_is_enabled())
memblock_set_bottom_up(true);
 #endif
+   early_acpi_parse();
init_mem_mapping();
memblock_set_current_limit(0, get_max_mapped(), false);
 
@@ -1173,21 +1203,12 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();
 
 
-   acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
vsmp_init();
 
io_delay_init();
 
early_platform_quirks();
 
-   /*
-* Parse the ACPI tables for possible boot-time SMP configuration.
-*/
-   acpi_boot_table_init();
-
-   early_acpi_boot_init();
-
-   initmem_init();
dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
/*
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 0b6e0b6..4f6b391 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -235,6 +235,7 @@ int acpi_mps_check (void);
 int acpi_numa_init (void);
 
 int acpi_table_init (void);
+void acpi_tb_terminate(void);
 int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
 int __init acpi_table_parse_entries(char *id, unsigned long table_size,
  int entry_id,
-- 
2.7.4



[PATCHv2 2/7] acpi: change the topo of acpi_table_upgrade()

2019-01-10 Thread Pingfan Liu
The current acpi_table_upgrade() relies on initrd_start, but this var is
only valid after relocate_initrd(). There is requirement to extract the
acpi info from initrd before memblock-allocator can work(see [2/4]), hence
acpi_table_upgrade() need to accept the input param directly.

Signed-off-by: Pingfan Liu 
Acked-by: "Rafael J. Wysocki" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/arm64/kernel/setup.c | 2 +-
 arch/x86/kernel/setup.c   | 2 +-
 drivers/acpi/tables.c | 4 +---
 include/linux/acpi.h  | 4 ++--
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index f4fc1e0..bc4b47d 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -315,7 +315,7 @@ void __init setup_arch(char **cmdline_p)
paging_init();
efi_apply_persistent_mem_reservations();
 
-   acpi_table_upgrade();
+   acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
 
/* Parse the ACPI tables for possible boot-time configuration */
acpi_boot_table_init();
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ac432ae..dc8fc5d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1172,8 +1172,8 @@ void __init setup_arch(char **cmdline_p)
 
reserve_initrd();
 
-   acpi_table_upgrade();
 
+   acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
vsmp_init();
 
io_delay_init();
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 61203ee..84e0a79 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -471,10 +471,8 @@ static DECLARE_BITMAP(acpi_initrd_installed, 
NR_ACPI_INITRD_TABLES);
 
 #define MAP_CHUNK_SIZE   (NR_FIX_BTMAPS << PAGE_SHIFT)
 
-void __init acpi_table_upgrade(void)
+void __init acpi_table_upgrade(void *data, size_t size)
 {
-   void *data = (void *)initrd_start;
-   size_t size = initrd_end - initrd_start;
int sig, no, table_nr = 0, total_offset = 0;
long offset = 0;
struct acpi_table_header *table;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index ed80f14..0b6e0b6 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1254,9 +1254,9 @@ acpi_graph_get_remote_endpoint(const struct fwnode_handle 
*fwnode,
 #endif
 
 #ifdef CONFIG_ACPI_TABLE_UPGRADE
-void acpi_table_upgrade(void);
+void acpi_table_upgrade(void *data, size_t size);
 #else
-static inline void acpi_table_upgrade(void) { }
+static inline void acpi_table_upgrade(void *data, size_t size) { }
 #endif
 
 #if defined(CONFIG_ACPI) && defined(CONFIG_ACPI_WATCHDOG)
-- 
2.7.4



[PATCHv2 5/7] x86/mm: set allowed range for memblock allocator

2019-01-10 Thread Pingfan Liu
Due to the incoming divergence of x86_32 and x86_64, there is requirement
to set the allowed allocating range at the early boot stage.
This patch also includes minor change to remove redundat cond check, refer
to memblock_find_in_range_node(), memblock_find_in_range() has already
protect itself from the case: start > end.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/x86/mm/init.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ef99f38..385b9cd 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -76,6 +76,14 @@ static unsigned long min_pfn_mapped;
 
 static bool __initdata can_use_brk_pgt = true;
 
+static unsigned long min_pfn_allowed;
+static unsigned long max_pfn_allowed;
+void set_alloc_range(unsigned long low, unsigned long high)
+{
+   min_pfn_allowed = low;
+   max_pfn_allowed = high;
+}
+
 /*
  * Pages returned are already directly mapped.
  *
@@ -100,12 +108,10 @@ __ref void *alloc_low_pages(unsigned int num)
if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret = 0;
 
-   if (min_pfn_mapped < max_pfn_mapped) {
-   ret = memblock_find_in_range(
-   min_pfn_mapped << PAGE_SHIFT,
-   max_pfn_mapped << PAGE_SHIFT,
-   PAGE_SIZE * num , PAGE_SIZE);
-   }
+   ret = memblock_find_in_range(
+   min_pfn_allowed << PAGE_SHIFT,
+   max_pfn_allowed << PAGE_SHIFT,
+   PAGE_SIZE * num, PAGE_SIZE);
if (ret)
memblock_reserve(ret, PAGE_SIZE * num);
else if (can_use_brk_pgt)
@@ -588,14 +594,17 @@ static void __init memory_map_top_down(unsigned long 
map_start,
start = map_start;
mapped_ram_size += init_range_memory_mapping(start,
last_start);
+   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
last_start = start;
min_pfn_mapped = last_start >> PAGE_SHIFT;
if (mapped_ram_size >= step_size)
step_size = get_new_step_size(step_size);
}
 
-   if (real_end < map_end)
+   if (real_end < map_end) {
init_range_memory_mapping(real_end, map_end);
+   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
+   }
 }
 
 /**
@@ -636,6 +645,7 @@ static void __init memory_map_bottom_up(unsigned long 
map_start,
}
 
mapped_ram_size += init_range_memory_mapping(start, next);
+   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
start = next;
 
if (mapped_ram_size >= step_size)
-- 
2.7.4



[PATCHv2 7/7] x86/mm: isolate the bottom-up style to init_32.c

2019-01-10 Thread Pingfan Liu
bottom-up style is useless in x86_64 any longer, isolate it. Later, it may
be removed completely from x86.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/x86/mm/init.c| 153 +-
 arch/x86/mm/init_32.c | 147 
 arch/x86/mm/mm_internal.h |   8 ++-
 3 files changed, 155 insertions(+), 153 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 003ad77..6a853e4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -502,7 +502,7 @@ unsigned long __ref init_memory_mapping(unsigned long start,
  * That range would have hole in the middle or ends, and only ram parts
  * will be mapped in init_range_memory_mapping().
  */
-static unsigned long __init init_range_memory_mapping(
+unsigned long __init init_range_memory_mapping(
   unsigned long r_start,
   unsigned long r_end)
 {
@@ -530,157 +530,6 @@ static unsigned long __init init_range_memory_mapping(
return mapped_ram_size;
 }
 
-#ifdef CONFIG_X86_32
-
-static unsigned long min_pfn_mapped;
-
-static unsigned long __init get_new_step_size(unsigned long step_size)
-{
-   /*
-* Initial mapped size is PMD_SIZE (2M).
-* We can not set step_size to be PUD_SIZE (1G) yet.
-* In worse case, when we cross the 1G boundary, and
-* PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k)
-* to map 1G range with PTE. Hence we use one less than the
-* difference of page table level shifts.
-*
-* Don't need to worry about overflow in the top-down case, on 32bit,
-* when step_size is 0, round_down() returns 0 for start, and that
-* turns it into 0x1ULL.
-* In the bottom-up case, round_up(x, 0) returns 0 though too, which
-* needs to be taken into consideration by the code below.
-*/
-   return step_size << (PMD_SHIFT - PAGE_SHIFT - 1);
-}
-
-/**
- * memory_map_top_down - Map [map_start, map_end) top down
- * @map_start: start address of the target memory range
- * @map_end: end address of the target memory range
- *
- * This function will setup direct mapping for memory range
- * [map_start, map_end) in top-down. That said, the page tables
- * will be allocated at the end of the memory, and we map the
- * memory in top-down.
- */
-static void __init memory_map_top_down(unsigned long map_start,
-  unsigned long map_end)
-{
-   unsigned long real_end, start, last_start;
-   unsigned long step_size;
-   unsigned long addr;
-   unsigned long mapped_ram_size = 0;
-
-   /* xen has big range in reserved near end of ram, skip it at first.*/
-   addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
-   real_end = addr + PMD_SIZE;
-
-   /* step_size need to be small so pgt_buf from BRK could cover it */
-   step_size = PMD_SIZE;
-   max_pfn_mapped = 0; /* will get exact value next */
-   min_pfn_mapped = real_end >> PAGE_SHIFT;
-   last_start = start = real_end;
-
-   /*
-* We start from the top (end of memory) and go to the bottom.
-* The memblock_find_in_range() gets us a block of RAM from the
-* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
-* for page table.
-*/
-   while (last_start > map_start) {
-   if (last_start > step_size) {
-   start = round_down(last_start - 1, step_size);
-   if (start < map_start)
-   start = map_start;
-   } else
-   start = map_start;
-   mapped_ram_size += init_range_memory_mapping(start,
-   last_start);
-   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
-   last_start = start;
-   min_pfn_mapped = last_start >> PAGE_SHIFT;
-   if (mapped_ram_size >= step_size)
-   step_size = get_new_step_size(step_size);
-   }
-
-   if (real_end < map_end) {
-   init_range_memory_mapping(real_end, map_end);
-   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
-   }
-}
-
-/**
- * memory_map_bottom_up - Map [map_start, map_end) bottom up
- * @map_start: start address of the target memory range
- * @map_end: end address of the target memory range
- *
- * Thi

[PATCHv2 6/7] x86/mm: remove bottom-up allocation style for x86_64

2019-01-10 Thread Pingfan Liu
Although kaslr-kernel can avoid to stain the movable node. [1] But the
pgtable can still stain the movable node. That is a probability problem,
although low, but exist. This patch tries to make it certainty by
allocating pgtable on unmovable node, instead of following kernel end.
There are two acheivements by this patch:
-1st. keep the subtree of pgtable away from movable node.
With the previous patch, at the point of init_mem_mapping(),
memblock allocator can work with the knowledge of acpi memory hotmovable
info, and avoid to stain the movable node. As a result,
memory_map_bottom_up() is not needed any more.
The following figure show the defection of current bottom-up style:
  [startA, endA][startB, "kaslr kernel verly close to" endB][startC, endC]
If nodeA,B is unmovable, while nodeC is movable, then init_mem_mapping()
can generate pgtable on nodeC, which stain movable node.
For more lengthy background, please refer to Background section

-2nd. simplify the logic of memory_map_top_down()
Thanks to the help of early_make_pgtable(), x86_64 can directly set up the
subtree of pgtable at any place, hence the careful iteration in
memory_map_top_down() can be discard.

*Background section*
When kaslr kernel can be guaranteed to sit inside unmovable node
after [1]. But if kaslr kernel is located near the end of the movable node,
then bottom-up allocator may create pagetable which crosses the boundary
between unmovable node and movable node.  It is a probability issue,
two factors include -1. how big the gap between kernel end and
unmovable node's end.  -2. how many memory does the system own.
Alternative way to fix this issue is by increasing the gap by
boot/compressed/kaslr*. But taking the scenario of PB level memory,
the pagetable will take server MB even if using 1GB page, different page
attr and fragment will make things worse. So it is hard to decide how much
should the gap increase.

[1]: https://lore.kernel.org/patchwork/patch/1029376/
Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org

---
 arch/x86/kernel/setup.c |  4 ++--
 arch/x86/mm/init.c  | 56 ++---
 2 files changed, 36 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9b57e01..00a1b84 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -827,7 +827,7 @@ static void early_acpi_parse(void)
early_acpi_boot_init();
initmem_init();
/* check whether memory is returned or not */
-   start = memblock_find_in_range(start, end, 1<<24, 1);
+   start = memblock_find_in_range(start, end, 1 << 24, 1);
if (!start)
pr_warn("the above acpi routines change and consume memory\n");
memblock_set_current_limit(orig_start, orig_end, enforcing);
@@ -1135,7 +1135,7 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();
 
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_X86_32)
/*
 * Memory used by the kernel cannot be hot-removed because Linux
 * cannot migrate the kernel pages. When memory hotplug is
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 385b9cd..003ad77 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -72,8 +72,6 @@ static unsigned long __initdata pgt_buf_start;
 static unsigned long __initdata pgt_buf_end;
 static unsigned long __initdata pgt_buf_top;
 
-static unsigned long min_pfn_mapped;
-
 static bool __initdata can_use_brk_pgt = true;
 
 static unsigned long min_pfn_allowed;
@@ -532,6 +530,10 @@ static unsigned long __init init_range_memory_mapping(
return mapped_ram_size;
 }
 
+#ifdef CONFIG_X86_32
+
+static unsigned long min_pfn_mapped;
+
 static unsigned long __init get_new_step_size(unsigned long step_size)
 {
/*
@@ -653,6 +655,32 @@ static void __init memory_map_bottom_up(unsigned long 
map_start,
}
 }
 
+static unsigned long __init init_range_memory_mapping32(
+   unsigned long r_start, unsigned long r_end)
+{
+   /*
+* If the allocation is in bottom-up direction, we setup direct mapping
+* in bottom-up, otherwise we setup direct mapping in top-down.
+*/
+   if (memblock_bottom_up()) {
+   unsigned long kernel_end = __pa_symbol(_end);
+
+   /*
+* we need two separate calls here. This is because we want to
+* allocate page tables above the kernel. So we first map
+   

[PATCHv2 1/7] x86/mm: concentrate the code to memblock allocator enabled

2019-01-10 Thread Pingfan Liu
This patch identifies the point where memblock alloc start. It has no
functional.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/x86/kernel/setup.c | 54 -
 1 file changed, 26 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d494b9b..ac432ae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -962,29 +962,6 @@ void __init setup_arch(char **cmdline_p)
 
if (efi_enabled(EFI_BOOT))
efi_memblock_x86_reserve_range();
-#ifdef CONFIG_MEMORY_HOTPLUG
-   /*
-* Memory used by the kernel cannot be hot-removed because Linux
-* cannot migrate the kernel pages. When memory hotplug is
-* enabled, we should prevent memblock from allocating memory
-* for the kernel.
-*
-* ACPI SRAT records all hotpluggable memory ranges. But before
-* SRAT is parsed, we don't know about it.
-*
-* The kernel image is loaded into memory at very early time. We
-* cannot prevent this anyway. So on NUMA system, we set any
-* node the kernel resides in as un-hotpluggable.
-*
-* Since on modern servers, one node could have double-digit
-* gigabytes memory, we can assume the memory around the kernel
-* image is also un-hotpluggable. So before SRAT is parsed, just
-* allocate memory near the kernel image to try the best to keep
-* the kernel away from hotpluggable memory.
-*/
-   if (movable_node_is_enabled())
-   memblock_set_bottom_up(true);
-#endif
 
x86_report_nx();
 
@@ -1096,9 +1073,6 @@ void __init setup_arch(char **cmdline_p)
 
cleanup_highmap();
 
-   memblock_set_current_limit(ISA_END_ADDRESS);
-   e820__memblock_setup();
-
reserve_bios_regions();
 
if (efi_enabled(EFI_MEMMAP)) {
@@ -1113,6 +1087,8 @@ void __init setup_arch(char **cmdline_p)
efi_reserve_boot_services();
}
 
+   memblock_set_current_limit(0, ISA_END_ADDRESS, false);
+   e820__memblock_setup();
/* preallocate 4k for mptable mpc */
e820__memblock_alloc_reserved_mpc_new();
 
@@ -1130,7 +1106,31 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+   /*
+* Memory used by the kernel cannot be hot-removed because Linux
+* cannot migrate the kernel pages. When memory hotplug is
+* enabled, we should prevent memblock from allocating memory
+* for the kernel.
+*
+* ACPI SRAT records all hotpluggable memory ranges. But before
+* SRAT is parsed, we don't know about it.
+*
+* The kernel image is loaded into memory at very early time. We
+* cannot prevent this anyway. So on NUMA system, we set any
+* node the kernel resides in as un-hotpluggable.
+*
+* Since on modern servers, one node could have double-digit
+* gigabytes memory, we can assume the memory around the kernel
+* image is also un-hotpluggable. So before SRAT is parsed, just
+* allocate memory near the kernel image to try the best to keep
+* the kernel away from hotpluggable memory.
+*/
+   if (movable_node_is_enabled())
+   memblock_set_bottom_up(true);
+#endif
init_mem_mapping();
+   memblock_set_current_limit(get_max_mapped());
 
idt_setup_early_pf();
 
@@ -1145,8 +1145,6 @@ void __init setup_arch(char **cmdline_p)
 */
mmu_cr4_features = __read_cr4() & ~X86_CR4_PCIDE;
 
-   memblock_set_current_limit(get_max_mapped());
-
/*
 * NOTE: On x86-32, only from this point on, fixmaps are ready for use.
 */
-- 
2.7.4



[PATCHv2 3/7] mm/memblock: introduce allocation boundary for tracing purpose

2019-01-10 Thread Pingfan Liu
During boot time, there is requirement to tell whether a series of func
call will consume memory or not. For some reason, a temporary memory
resource can be loan to those func through memblock allocator, but at a
check point, all of the loan memory should be turned back.
A typical using style:
 -1. find a usable range by memblock_find_in_range(), said, [A,B]
 -2. before calling a series of func, memblock_set_current_limit(A,B,true)
 -3. call funcs
 -4. memblock_find_in_range(A,B,B-A,1), if failed, then some memory is not
 turned back.
 -5. reset the original limit

E.g. in the case of hotmovable memory, some acpi routines should be called,
and they are not allowed to own some movable memory. Although at present
these functions do not consume memory, but later, if changed without
awareness, they may do. With the above method, the allocation can be
detected, and pr_warn() to ask people to resolve it.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
---
 arch/arm/mm/init.c  |  3 ++-
 arch/arm/mm/mmu.c   |  4 ++--
 arch/arm/mm/nommu.c |  2 +-
 arch/csky/kernel/setup.c|  2 +-
 arch/microblaze/mm/init.c   |  2 +-
 arch/mips/kernel/setup.c|  2 +-
 arch/powerpc/mm/40x_mmu.c   |  6 --
 arch/powerpc/mm/44x_mmu.c   |  2 +-
 arch/powerpc/mm/8xx_mmu.c   |  2 +-
 arch/powerpc/mm/fsl_booke_mmu.c |  5 +++--
 arch/powerpc/mm/hash_utils_64.c |  4 ++--
 arch/powerpc/mm/init_32.c   |  2 +-
 arch/powerpc/mm/pgtable-radix.c |  2 +-
 arch/powerpc/mm/ppc_mmu_32.c|  8 ++--
 arch/powerpc/mm/tlb_nohash.c|  6 --
 arch/unicore32/mm/mmu.c |  2 +-
 arch/x86/kernel/setup.c |  2 +-
 arch/xtensa/mm/init.c   |  2 +-
 include/linux/memblock.h| 10 +++---
 mm/memblock.c   | 23 ++-
 20 files changed, 59 insertions(+), 32 deletions(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 32e4845..58a4342 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -93,7 +93,8 @@ __tagtable(ATAG_INITRD2, parse_tag_initrd2);
 static void __init find_limits(unsigned long *min, unsigned long *max_low,
   unsigned long *max_high)
 {
-   *max_low = PFN_DOWN(memblock_get_current_limit());
+   memblock_get_current_limit(NULL, max_low);
+   *max_low = PFN_DOWN(*max_low);
*min = PFN_UP(memblock_start_of_DRAM());
*max_high = PFN_DOWN(memblock_end_of_DRAM());
 }
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index f5cc1cc..9025418 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -1240,7 +1240,7 @@ void __init adjust_lowmem_bounds(void)
}
}
 
-   memblock_set_current_limit(memblock_limit);
+   memblock_set_current_limit(0, memblock_limit, false);
 }
 
 static inline void prepare_page_table(void)
@@ -1625,7 +1625,7 @@ void __init paging_init(const struct machine_desc *mdesc)
 
prepare_page_table();
map_lowmem();
-   memblock_set_current_limit(arm_lowmem_limit);
+   memblock_set_current_limit(0, arm_lowmem_limit, false);
dma_contiguous_remap();
early_fixmap_shutdown();
devicemaps_init(mdesc);
diff --git a/arch/arm/mm/nommu.c b/arch/arm/mm/nommu.c
index 7d67c70..721535c 100644
--- a/arch/arm/mm/nommu.c
+++ b/arch/arm/mm/nommu.c
@@ -138,7 +138,7 @@ void __init adjust_lowmem_bounds(void)
adjust_lowmem_bounds_mpu();
end = memblock_end_of_DRAM();
high_memory = __va(end - 1) + 1;
-   memblock_set_current_limit(end);
+   memblock_set_current_limit(0, end, false);
 }
 
 /*
diff --git a/arch/csky/kernel/setup.c b/arch/csky/kernel/setup.c
index dff8b89..e6f88bf 100644
--- a/arch/csky/kernel/setup.c
+++ b/arch/csky/kernel/setup.c
@@ -100,7 +100,7 @@ static void __init csky_memblock_init(void)
 
highend_pfn = max_pfn;
 #endif
-   memblock_set_current_limit(PFN_PHYS(max_low_pfn));
+   memblock_set_current_limit(0, PFN_PHYS(max_low_pfn), false);
 
dma_contiguous_reserve(0);
 
diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index b17fd8a..cee99da 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -353,7 +353,7 @@ asmlinkage void __init mmu_init(void)
/* Shortly after that, the entire linear mapping will be available */
/* This will also cause that unflatten device tree will be allocated
 * inside 768MB limit */
-   memblock_set_current_limit(memory_start + lowmem_size - 1);
+   memblock_set_current_limit(0, memory_start + lo

[PATCHv2 0/7] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

2019-01-10 Thread Pingfan Liu
Background
When kaslr kernel can be guaranteed to sit inside unmovable node
after [1]. But if kaslr kernel is located near the end of the movable node,
then bottom-up allocator may create pagetable which crosses the boundary
between unmovable node and movable node.  It is a probability issue,
two factors include -1. how big the gap between kernel end and
unmovable node's end.  -2. how many memory does the system own.
Alternative way to fix this issue is by increasing the gap by
boot/compressed/kaslr*. But taking the scenario of PB level memory,
the pagetable will take server MB even if using 1GB page, different page
attr and fragment will make things worse. So it is hard to decide how much
should the gap increase.
The following figure show the defection of current bottom-up style:
  [startA, endA][startB, "kaslr kernel verly close to" endB][startC, endC]

If nodeA,B is unmovable, while nodeC is movable, then init_mem_mapping()
can generate pgtable on nodeC, which stain movable node.

This patch makes it certainty instead of a probablity problem. It achieves
this by pushing forward the parsing of mem hotplug info ahead of 
init_mem_mapping().

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Yinghai Lu 
Cc: Tejun Heo 
Cc: Chao Fan 
Cc: Baoquan He 
Cc: Juergen Gross 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
Pingfan Liu (7):
  x86/mm: concentrate the code to memblock allocator enabled
  acpi: change the topo of acpi_table_upgrade()
  mm/memblock: introduce allocation boundary for tracing purpose
  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  x86/mm: set allowed range for memblock allocator
  x86/mm: remove bottom-up allocation style for x86_64
  x86/mm: isolate the bottom-up style to init_32.c

 arch/arm/mm/init.c  |   3 +-
 arch/arm/mm/mmu.c   |   4 +-
 arch/arm/mm/nommu.c |   2 +-
 arch/arm64/kernel/setup.c   |   2 +-
 arch/csky/kernel/setup.c|   2 +-
 arch/microblaze/mm/init.c   |   2 +-
 arch/mips/kernel/setup.c|   2 +-
 arch/powerpc/mm/40x_mmu.c   |   6 +-
 arch/powerpc/mm/44x_mmu.c   |   2 +-
 arch/powerpc/mm/8xx_mmu.c   |   2 +-
 arch/powerpc/mm/fsl_booke_mmu.c |   5 +-
 arch/powerpc/mm/hash_utils_64.c |   4 +-
 arch/powerpc/mm/init_32.c   |   2 +-
 arch/powerpc/mm/pgtable-radix.c |   2 +-
 arch/powerpc/mm/ppc_mmu_32.c|   8 +-
 arch/powerpc/mm/tlb_nohash.c|   6 +-
 arch/unicore32/mm/mmu.c |   2 +-
 arch/x86/kernel/setup.c |  93 ++-
 arch/x86/mm/init.c  | 163 +---
 arch/x86/mm/init_32.c   | 147 
 arch/x86/mm/mm_internal.h   |   8 +-
 arch/xtensa/mm/init.c   |   2 +-
 drivers/acpi/tables.c   |   4 +-
 include/linux/acpi.h|   5 +-
 include/linux/memblock.h|  10 ++-
 mm/memblock.c   |  23 --
 26 files changed, 290 insertions(+), 221 deletions(-)

-- 
2.7.4



Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2019-01-10 Thread Pingfan Liu
On Tue, Jan 8, 2019 at 10:34 PM Michal Hocko  wrote:
>
> On Thu 20-12-18 10:19:34, Michal Hocko wrote:
> > On Thu 20-12-18 15:19:39, Pingfan Liu wrote:
> > > Hi Michal,
> > >
> > > WIth this patch applied on the old one, I got the following message.
> > > Please get it from attachment.
> > [...]
> > > [0.409637] NUMA: Node 1 [mem 0x-0x0009] + [mem 
> > > 0x0010-0x7fff] -> [mem 0x-0x7fff]
> > > [0.419858] NUMA: Node 1 [mem 0x-0x7fff] + [mem 
> > > 0x1-0x47fff] -> [mem 0x-0x47fff]
> > > [0.430356] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
> > > [0.436325] NODE_DATA(0) on node 5
> > > [0.440092] Initmem setup node 0 [mem 
> > > 0x-0x]
> > > [0.447078] node[0] zonelist:
> > > [0.450106] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fff]
> > > [0.456114] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
> > > [0.462064] NODE_DATA(2) on node 5
> > > [0.465852] Initmem setup node 2 [mem 
> > > 0x-0x]
> > > [0.472813] node[2] zonelist:
> > > [0.475846] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
> > > [0.481827] NODE_DATA(3) on node 5
> > > [0.485590] Initmem setup node 3 [mem 
> > > 0x-0x]
> > > [0.492575] node[3] zonelist:
> > > [0.495608] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
> > > [0.501587] NODE_DATA(4) on node 5
> > > [0.505349] Initmem setup node 4 [mem 
> > > 0x-0x]
> > > [0.512334] node[4] zonelist:
> > > [0.515370] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
> > > [0.521384] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
> > > [0.527329] NODE_DATA(6) on node 5
> > > [0.531091] Initmem setup node 6 [mem 
> > > 0x-0x]
> > > [0.538076] node[6] zonelist:
> > > [0.541109] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
> > > [0.547090] NODE_DATA(7) on node 5
> > > [0.550851] Initmem setup node 7 [mem 
> > > 0x-0x]
> > > [0.557836] node[7] zonelist:
> >
> > OK, so it is clear that building zonelists this early is not going to
> > fly. We do not have the complete information yet. I am not sure when do
> > we get that at this moment but I suspect the we either need to move that
> > initialization to a sooner stage or we have to reconsider whether the
> > phase when we build zonelists really needs to consider only online numa
> > nodes.
> >
> > [...]
> > > [1.067658] percpu: Embedded 46 pages/cpu @(ptrval) s151552 
> > > r8192 d28672 u262144
> > > [1.075692] node[1] zonelist: 1:Normal 1:DMA32 1:DMA 5:Normal
> > > [1.081376] node[5] zonelist: 5:Normal 1:Normal 1:DMA32 1:DMA
> >
> > I hope to get to this before I leave for christmas vacation, if not I
> > will stare into it after then.
>
> I am sorry but I didn't get to this sooner. But I've got another idea. I
> concluded that the whole dance is simply bogus and we should treat
> memory less nodes, well, as nodes with no memory ranges rather than
> special case them. Could you give the following a spin please?
>
> ---
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..0e79445cfd85 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
>
> node_data[nid] = nd;
> memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -   node_set_online(nid);
>  }
>
>  /**
> @@ -535,6 +533,7 @@ static int __init numa_register_memblks(struct 
> numa_meminfo *mi)
> /* Account for nodes with cpus and no memory */
> node_possible_map = numa_nodes_parsed;
> numa_nodemask_from_meminfo(_possible_map, mi);
> +   pr_info("parsed=%*pbl, possible=%*pbl\n", 
> nodemask_pr_args(_nodes_parsed), nodemask_pr_args(_possible_map));
> if (WARN_ON(nodes_empty(node_possible_map)))
> return -EINVAL;
>
> @@ -570,7 +569,7 @@ static int __init numa_register_memblks(struct 
> numa_meminfo *mi)
> return -EINVAL;
>
> /* Finally register nodes. */
> -   for_each_node_mask(nid, node_possible_map) {
> +   for_each_node_mask(nid, 

Re: [PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-10 Thread Pingfan Liu
On Wed, Jan 9, 2019 at 10:25 PM Baoquan He  wrote:
>
> On 01/08/19 at 05:48pm, Mike Rapoport wrote:
> > On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > > Hi Mike,
> > >
> > > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > > I'm not thrilled by duplicating this code (yet again).
> > > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode 
> > > > to
> > > > allocate [0, kernel_start) unconditionally.
> > > > I'd just replace you first patch in v3 [2] with something like:
> > >
> > > In initmem_init(), we will restore the top-down allocation style anyway.
> > > While reserve_crashkernel() is called after initmem_init(), it's not
> > > appropriate to adjust memblock_find_in_range_node(), and we really want
> > > to find region bottom up for crashkernel reservation, no matter where
> > > kernel is loaded, better call __memblock_find_range_bottom_up().
> > >
> > > Create a wrapper to do the necessary handling, then call
> > > __memblock_find_range_bottom_up() directly, looks better.
> >
> > What bothers me is 'the necessary handling' which is already done in
> > several places in memblock in a similar, but yet slightly different way.
>
> The page aligning for start and the mirror flag setting, I suppose.
> >
> > memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> > MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> > allocating from the specified node and does not retry when it falls back to
> > any node. And memblock_alloc_internal() has yet another set of fallbacks.
>
> Get what you mean, seems they are trying to allocate within mirrorred
> memory region, if fail, try the non-mirrorred region. If kernel data
> allocation failed, no need to care about if it's movable or not, it need
> to live firstly. For the bottom-up allocation wrapper, maybe we need do
> like this too?
>
> >
> > So what should be the necessary handling in the wrapper for
> > __memblock_find_range_bottom_up() ?
> >
> > BTW, even without any memblock modifications, retrying allocation in
> > reserve_crashkerenel() for different ranges, like the proposal at [1] would
> > also work, wouldn't it?
>
> Yes, it also looks good. This patch only calls once, seems a simpler
> line adding.
>
> In fact, below one and this patch, both is fine to me, as long as it
> fixes the problem customers are complaining about.
>
It seems that there is divergence on opinion. Maybe it is easier to
fix this bug by dyoung's patch. I will repost his patch.

Thanks and regards,
Pingfan
> >
> > [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
>
> Thanks
> Baoquan


Re: [PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-10 Thread Pingfan Liu
On Thu, Jan 10, 2019 at 3:57 PM Mike Rapoport  wrote:
>
> Hi Pingfan,
>
> On Wed, Jan 09, 2019 at 09:02:41PM +0800, Pingfan Liu wrote:
> > On Tue, Jan 8, 2019 at 11:49 PM Mike Rapoport  wrote:
> > >
> > > On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > > > Hi Mike,
> > > >
> > > > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > > > I'm not thrilled by duplicating this code (yet again).
> > > > > I liked the v3 of this patch [1] more, assuming we allow bottom-up 
> > > > > mode to
> > > > > allocate [0, kernel_start) unconditionally.
> > > > > I'd just replace you first patch in v3 [2] with something like:
> > > >
> > > > In initmem_init(), we will restore the top-down allocation style anyway.
> > > > While reserve_crashkernel() is called after initmem_init(), it's not
> > > > appropriate to adjust memblock_find_in_range_node(), and we really want
> > > > to find region bottom up for crashkernel reservation, no matter where
> > > > kernel is loaded, better call __memblock_find_range_bottom_up().
> > > >
> > > > Create a wrapper to do the necessary handling, then call
> > > > __memblock_find_range_bottom_up() directly, looks better.
> > >
> > > What bothers me is 'the necessary handling' which is already done in
> > > several places in memblock in a similar, but yet slightly different way.
> > >
> > > memblock_find_in_range() and memblock_phys_alloc_nid() retry with 
> > > different
> > > MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> > > allocating from the specified node and does not retry when it falls back 
> > > to
> > > any node. And memblock_alloc_internal() has yet another set of fallbacks.
> > >
> > > So what should be the necessary handling in the wrapper for
> > > __memblock_find_range_bottom_up() ?
> > >
> > Well, it is a hard choice.
> > > BTW, even without any memblock modifications, retrying allocation in
> > > reserve_crashkerenel() for different ranges, like the proposal at [1] 
> > > would
> > > also work, wouldn't it?
> > >
> > Yes, it can work. Then is it worth to expose the bottom-up allocation
> > style beside for hotmovable purpose?
>
> Some architectures use bottom-up as a "compatability" mode with bootmem.
> And, I believe, powerpc and s390 use bottom-up to make some of the
> allocations close to the kernel.
>
Ok, got it. Thanks.

Best regards,
Pingfan

> > Thanks,
> > Pingfan
> > > [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > >
> > > > Thanks
> > > > Baoquan
> > > >
> > > > >
> > > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > > index 7df468c..d1b30b9 100644
> > > > > --- a/mm/memblock.c
> > > > > +++ b/mm/memblock.c
> > > > > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock 
> > > > > memblock_find_in_range_node(phys_addr_t size,
> > > > >  * try bottom-up allocation only when bottom-up mode
> > > > >  * is set and @end is above the kernel image.
> > > > >  */
> > > > > -   if (memblock_bottom_up() && end > kernel_end) {
> > > > > -   phys_addr_t bottom_up_start;
> > > > > -
> > > > > -   /* make sure we will allocate above the kernel */
> > > > > -   bottom_up_start = max(start, kernel_end);
> > > > > -
> > > > > +   if (memblock_bottom_up()) {
> > > > > /* ok, try bottom-up allocation first */
> > > > > -   ret = __memblock_find_range_bottom_up(bottom_up_start, 
> > > > > end,
> > > > > +   ret = __memblock_find_range_bottom_up(start, end,
> > > > >   size, align, nid, 
> > > > > flags);
> > > > > if (ret)
> > > > > return ret;
> > > > >
> > > > > /*
> > > > > -* we always limit bottom-up allocation above the kernel,
> > > > > -* but top-down allocation doesn't have the limit, so
> > > > > -* retrying top-down allocation may succeed when bottom-up
> > > > > -* allocation failed.
> > > > > -*
> > > > >  * bottom-up allocation is expected to be fail very 
> > > > > rarely,
> > > > >  * so we use WARN_ONCE() here to see the stack trace if
> > > > >  * fail happens.
> > > > >
> > > > > [1] 
> > > > > https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelf...@gmail.com/
> > > > > [2] 
> > > > > https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelf...@gmail.com/
> > > > >
> > > > > > +
> > > > > > + return ret;
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * __memblock_find_range_top_down - find free area utility, in 
> > > > > > top-down
> > > > > >   * @start: start of candidate range
> > > > > > --
> > > > > > 2.7.4
> > > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours,
> > > > > Mike.
> > > > >
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
> >
>
> --
> Sincerely yours,
> Mike.
>


Re: [PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-09 Thread Pingfan Liu
On Tue, Jan 8, 2019 at 11:49 PM Mike Rapoport  wrote:
>
> On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > Hi Mike,
> >
> > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > I'm not thrilled by duplicating this code (yet again).
> > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > > allocate [0, kernel_start) unconditionally.
> > > I'd just replace you first patch in v3 [2] with something like:
> >
> > In initmem_init(), we will restore the top-down allocation style anyway.
> > While reserve_crashkernel() is called after initmem_init(), it's not
> > appropriate to adjust memblock_find_in_range_node(), and we really want
> > to find region bottom up for crashkernel reservation, no matter where
> > kernel is loaded, better call __memblock_find_range_bottom_up().
> >
> > Create a wrapper to do the necessary handling, then call
> > __memblock_find_range_bottom_up() directly, looks better.
>
> What bothers me is 'the necessary handling' which is already done in
> several places in memblock in a similar, but yet slightly different way.
>
> memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> allocating from the specified node and does not retry when it falls back to
> any node. And memblock_alloc_internal() has yet another set of fallbacks.
>
> So what should be the necessary handling in the wrapper for
> __memblock_find_range_bottom_up() ?
>
Well, it is a hard choice.
> BTW, even without any memblock modifications, retrying allocation in
> reserve_crashkerenel() for different ranges, like the proposal at [1] would
> also work, wouldn't it?
>
Yes, it can work. Then is it worth to expose the bottom-up allocation
style beside for hotmovable purpose?

Thanks,
Pingfan
> [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
>
> > Thanks
> > Baoquan
> >
> > >
> > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > index 7df468c..d1b30b9 100644
> > > --- a/mm/memblock.c
> > > +++ b/mm/memblock.c
> > > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock 
> > > memblock_find_in_range_node(phys_addr_t size,
> > >  * try bottom-up allocation only when bottom-up mode
> > >  * is set and @end is above the kernel image.
> > >  */
> > > -   if (memblock_bottom_up() && end > kernel_end) {
> > > -   phys_addr_t bottom_up_start;
> > > -
> > > -   /* make sure we will allocate above the kernel */
> > > -   bottom_up_start = max(start, kernel_end);
> > > -
> > > +   if (memblock_bottom_up()) {
> > > /* ok, try bottom-up allocation first */
> > > -   ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > +   ret = __memblock_find_range_bottom_up(start, end,
> > >   size, align, nid, 
> > > flags);
> > > if (ret)
> > > return ret;
> > >
> > > /*
> > > -* we always limit bottom-up allocation above the kernel,
> > > -* but top-down allocation doesn't have the limit, so
> > > -* retrying top-down allocation may succeed when bottom-up
> > > -* allocation failed.
> > > -*
> > >  * bottom-up allocation is expected to be fail very rarely,
> > >  * so we use WARN_ONCE() here to see the stack trace if
> > >  * fail happens.
> > >
> > > [1] 
> > > https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelf...@gmail.com/
> > > [2] 
> > > https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelf...@gmail.com/
> > >
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > >  /**
> > > >   * __memblock_find_range_top_down - find free area utility, in top-down
> > > >   * @start: start of candidate range
> > > > --
> > > > 2.7.4
> > > >
> > >
> > > --
> > > Sincerely yours,
> > > Mike.
> > >
> >
>
> --
> Sincerely yours,
> Mike.
>


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2019-01-08 Thread Pingfan Liu
On Tue, Jan 8, 2019 at 10:34 PM Michal Hocko  wrote:
>
> On Thu 20-12-18 10:19:34, Michal Hocko wrote:
> > On Thu 20-12-18 15:19:39, Pingfan Liu wrote:
> > > Hi Michal,
> > >
> > > WIth this patch applied on the old one, I got the following message.
> > > Please get it from attachment.
> > [...]
> > > [0.409637] NUMA: Node 1 [mem 0x-0x0009] + [mem 
> > > 0x0010-0x7fff] -> [mem 0x-0x7fff]
> > > [0.419858] NUMA: Node 1 [mem 0x-0x7fff] + [mem 
> > > 0x1-0x47fff] -> [mem 0x-0x47fff]
> > > [0.430356] NODE_DATA(0) allocated [mem 0x87efd4000-0x87effefff]
> > > [0.436325] NODE_DATA(0) on node 5
> > > [0.440092] Initmem setup node 0 [mem 
> > > 0x-0x]
> > > [0.447078] node[0] zonelist:
> > > [0.450106] NODE_DATA(1) allocated [mem 0x47ffd5000-0x47fff]
> > > [0.456114] NODE_DATA(2) allocated [mem 0x87efa9000-0x87efd3fff]
> > > [0.462064] NODE_DATA(2) on node 5
> > > [0.465852] Initmem setup node 2 [mem 
> > > 0x-0x]
> > > [0.472813] node[2] zonelist:
> > > [0.475846] NODE_DATA(3) allocated [mem 0x87ef7e000-0x87efa8fff]
> > > [0.481827] NODE_DATA(3) on node 5
> > > [0.485590] Initmem setup node 3 [mem 
> > > 0x-0x]
> > > [0.492575] node[3] zonelist:
> > > [0.495608] NODE_DATA(4) allocated [mem 0x87ef53000-0x87ef7dfff]
> > > [0.501587] NODE_DATA(4) on node 5
> > > [0.505349] Initmem setup node 4 [mem 
> > > 0x-0x]
> > > [0.512334] node[4] zonelist:
> > > [0.515370] NODE_DATA(5) allocated [mem 0x87ef28000-0x87ef52fff]
> > > [0.521384] NODE_DATA(6) allocated [mem 0x87eefd000-0x87ef27fff]
> > > [0.527329] NODE_DATA(6) on node 5
> > > [0.531091] Initmem setup node 6 [mem 
> > > 0x-0x]
> > > [0.538076] node[6] zonelist:
> > > [0.541109] NODE_DATA(7) allocated [mem 0x87eed2000-0x87eefcfff]
> > > [0.547090] NODE_DATA(7) on node 5
> > > [0.550851] Initmem setup node 7 [mem 
> > > 0x-0x]
> > > [0.557836] node[7] zonelist:
> >
> > OK, so it is clear that building zonelists this early is not going to
> > fly. We do not have the complete information yet. I am not sure when do
> > we get that at this moment but I suspect the we either need to move that
> > initialization to a sooner stage or we have to reconsider whether the
> > phase when we build zonelists really needs to consider only online numa
> > nodes.
> >
> > [...]
> > > [1.067658] percpu: Embedded 46 pages/cpu @(ptrval) s151552 
> > > r8192 d28672 u262144
> > > [1.075692] node[1] zonelist: 1:Normal 1:DMA32 1:DMA 5:Normal
> > > [1.081376] node[5] zonelist: 5:Normal 1:Normal 1:DMA32 1:DMA
> >
> > I hope to get to this before I leave for christmas vacation, if not I
> > will stare into it after then.
>
> I am sorry but I didn't get to this sooner. But I've got another idea. I
> concluded that the whole dance is simply bogus and we should treat
> memory less nodes, well, as nodes with no memory ranges rather than
> special case them. Could you give the following a spin please?
>

Sure, I have queued a loan for the remote machine. It will take some time.

Regards,
Pingfan
> ---
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..0e79445cfd85 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -216,8 +216,6 @@ static void __init alloc_node_data(int nid)
>
> node_data[nid] = nd;
> memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> -
> -   node_set_online(nid);
>  }
>
>  /**
> @@ -535,6 +533,7 @@ static int __init numa_register_memblks(struct 
> numa_meminfo *mi)
> /* Account for nodes with cpus and no memory */
> node_possible_map = numa_nodes_parsed;
> numa_nodemask_from_meminfo(_possible_map, mi);
> +   pr_info("parsed=%*pbl, possible=%*pbl\n", 
> nodemask_pr_args(_nodes_parsed), nodemask_pr_args(_possible_map));
> if (WARN_ON(nodes_empty(node_possible_map)))
> return -EINVAL;
>
> @@ -570,7 +569,7 @@ static int __init numa_register_memblks(struct 
> numa_meminfo *mi)
> return -EINVAL;
>
> /* Finally register nodes. */
> -

Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64

2019-01-08 Thread Pingfan Liu
On Wed, Jan 9, 2019 at 1:33 AM Dave Hansen  wrote:
>
> On 1/7/19 10:13 PM, Pingfan Liu wrote:
> > On Tue, Jan 8, 2019 at 1:42 AM Dave Hansen  wrote:
> >> Why is this 0x10 open-coded?  Why is this needed *now*?
> >>
> >
> > Memory under 1MB should be used by BIOS. For x86_64, after
> > e820__memblock_setup(), the memblock allocator has already been ready
> > to work. But there are two factors to in order to
> > set_alloc_range(0x10, end). The major one is to be compatible with
> > x86_32, please refer to alloc_low_pages->memblock_find_in_range() uses
> > [min_pfn_mapped, max_pfn_mapped] to limit the range, which is ready to
> > be allocated from. The minor one is to prevent unexpected allocation
> > from memblock allocator through allow_low_pages() at very early stage.
>
> Wow, that's a ton of critical information which was neither commented
> upon or referenced in the changelog.  Can you fix this up in the next
> version, please?

Sure.

Thanks,
Pingfan


Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

2019-01-08 Thread Pingfan Liu
On Tue, Jan 8, 2019 at 6:06 PM Chao Fan  wrote:
>
> On Mon, Jan 07, 2019 at 04:24:41PM +0800, Pingfan Liu wrote:
> >Background about the defect of the current bottom-up allocation style, take
> >the following scenario:
> >  |  unmovable node | movable node   |
> > | kaslr-kernel |subtree of pgtable for phy<->virt |
> >
> >Although kaslr-kernel can avoid to stain the movable node. But the
> >pgtable can still stain the movable node. That is a probability problem,
> >with low probability, but still exist. This patch tries to eliminate the
> >probability. With the previous patch, at the point of init_mem_mapping(),
> >memblock allocator can work with the knowledge of acpi memory hotmovable
> >info, and avoid to stain the movable node. As a result,
> >memory_map_bottom_up() is not needed any more.
> >
>
> Hi Pingfan,
>
> Tang Chen ever tried to do this before adding 'movable_node':
> commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> Author: Tang Chen 
> Date:   Fri Feb 22 16:33:44 2013 -0800
>
> acpi, memory-hotplug: parse SRAT before memblock is ready
>
> Then, Lu Yinghai tried to do the similar job, you can see:
> https://lwn.net/Articles/554854/
> for more information. Hope that can help you.
>
Thanks, It is a long thread, as my understanding, Tejun concerned
about the early parsing of ACPI consumes memory from memblock
allocator. If it is, then this should not happen in my series.
Cc Tejun and Yinghai.

Regards,
Pingfan
> Thanks,
> Chao Fan
>
> >
> >Cc: Thomas Gleixner 
> >Cc: Ingo Molnar 
> >Cc: Borislav Petkov 
> >Cc: "H. Peter Anvin" 
> >Cc: Dave Hansen 
> >Cc: Andy Lutomirski 
> >Cc: Peter Zijlstra 
> >Cc: "Rafael J. Wysocki" 
> >Cc: Len Brown 
> >Cc: linux-kernel@vger.kernel.org
> >
> >Pingfan Liu (4):
> >  acpi: change the topo of acpi_table_upgrade()
> >  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
> >  x86/mm: set allowed range for memblock allocator
> >  x86/mm: remove bottom-up allocation style for x86_64
> >
> > arch/arm64/kernel/setup.c |   2 +-
> > arch/x86/kernel/setup.c   |  17 -
> > arch/x86/mm/init.c| 154 
> > +++---
> > arch/x86/mm/init_32.c | 123 
> > arch/x86/mm/mm_internal.h |   7 +++
> > drivers/acpi/tables.c |   4 +-
> > include/linux/acpi.h  |   5 +-
> > 7 files changed, 172 insertions(+), 140 deletions(-)
> >
> >--
> >2.7.4
> >
> >
> >
>
>


Re: [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()

2019-01-07 Thread Pingfan Liu
On Tue, Jan 8, 2019 at 1:11 AM Dave Hansen  wrote:
>
>
> On 1/7/19 12:24 AM, Pingfan Liu wrote:
> > At present, memblock bottom-up allocation can help us against stamping over
> > movable node in very high probability.
>
> Is this what you are fixing?  Making a "high probability", a certainty?
>  Is this the problem?
>

Yes, as my reply on another mail in detail.
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index acbcd62..df4132c 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -805,6 +805,20 @@ dump_kernel_offset(struct notifier_block *self, 
> > unsigned long v, void *p)
> >   return 0;
> >  }
> >
> > +/* only need the effect of acpi_numa_memory_affinity_init()
> > + * ->memblock_mark_hotplug()
> > + */
>
> CodingStyle, please.
>

Will fix.
> > +static int early_detect_acpi_memhotplug(void)
> > +{
> > +#ifdef CONFIG_ACPI_NUMA
> > + acpi_table_upgrade(__va(get_ramdisk_image()), get_ramdisk_size());
>
> This adds a new, early, call to acpi_table_upgrade(), and presumably all
> the following functions.  However, it does not remove any of the later
> calls.  How do they interact with each other now that they are
> presumably called twice?
>

ACPI is a big subsystem, I have a hurry through these functions. This
group seems not to allocate extra memory, and using static data. So if
called twice, just overwriting the effect of previous one. The only
issue is printk some info twice. I will pay more time on this for the
next version.
> > + acpi_table_init();
> > + acpi_numa_init();
> > + acpi_tb_terminate();
> > +#endif
> > + return 0;
> > +}
>
> Why does this return an 'int' that is unconsumed by its lone caller?
>

No special purpose about the return. Just a habit.
> There seems to be a lack of comments on this newly-added code.
>
> >  /*
> >   * Determine if we were loaded by an EFI loader.  If so, then we have also 
> > been
> >   * passed the efi memmap, systab, etc., so we should use these data 
> > structures
> > @@ -1131,6 +1145,7 @@ void __init setup_arch(char **cmdline_p)
> >   trim_platform_memory_ranges();
> >   trim_low_memory_range();
> >
> > + early_detect_acpi_memhotplug();
>
> Comments, please.  Why is this call here, specifically?  What is it doing?
>
It parses the acpi srat to extract memory hotmovable info, and feed
those info to memory allocator. The exactly effect is:
acpi_numa_memory_affinity_init() ->memblock_mark_hotplug(). So later
when memblock allocator allocates range, in __next_mem_range(), there
is cond check to skip movable node:  if (movable_node_is_enabled() &&
memblock_is_hotpluggable(m)) continue;

Thanks,
Pingfan


Re: [RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64

2019-01-07 Thread Pingfan Liu
On Tue, Jan 8, 2019 at 1:42 AM Dave Hansen  wrote:
>
> On 1/7/19 12:24 AM, Pingfan Liu wrote:
> > There are two acheivements by this patch.
> > -1st. keep the subtree of pgtable away from movable node.
> > Background about the defect of the current bottom-up allocation style, take
> > the following scenario:
> >   |  unmovable node | movable node   |
> >  | kaslr-kernel |subtree of pgtable for phy<->virt |
>
>
>
> > Although kaslr-kernel can avoid to stain the movable node. [1] But the
> > pgtable can still stain the movable node. That is a probability problem,
> > with low probability, but still exist. This patch tries to eliminate the
> > probability. With the previous patch, at the point of init_mem_mapping(),
> > memblock allocator can work with the knowledge of acpi memory hotmovable
> > info, and avoid to stain the movable node. As a result,
> > memory_map_bottom_up() is not needed any more.
> >
> > -2nd. simplify the logic of memory_map_top_down()
> > Thanks to the help of early_make_pgtable(), x86_64 can directly set up the
> > subtree of pgtable at any place, hence the careful iteration in
> > memory_map_top_down() can be discard.
>
> >  void __init init_mem_mapping(void)
> >  {
> >   unsigned long end;
> > @@ -663,6 +540,7 @@ void __init init_mem_mapping(void)
> >
> >  #ifdef CONFIG_X86_64
> >   end = max_pfn << PAGE_SHIFT;
> > + set_alloc_range(0x10, end);
> >  #else
>
> Why is this 0x10 open-coded?  Why is this needed *now*?
>

Memory under 1MB should be used by BIOS. For x86_64, after
e820__memblock_setup(), the memblock allocator has already been ready
to work. But there are two factors to in order to
set_alloc_range(0x10, end). The major one is to be compatible with
x86_32, please refer to alloc_low_pages->memblock_find_in_range() uses
[min_pfn_mapped, max_pfn_mapped] to limit the range, which is ready to
be allocated from. The minor one is to prevent unexpected allocation
from memblock allocator through allow_low_pages() at very early stage.
>
> >   /*
> >* If the allocation is in bottom-up direction, we setup direct 
> > mapping
> >* in bottom-up, otherwise we setup direct mapping in top-down.
> > @@ -692,13 +577,6 @@ void __init init_mem_mapping(void)
> >   } else {
> >   memory_map_top_down(ISA_END_ADDRESS, end);
> >   }
> > -
> > -#ifdef CONFIG_X86_64
> > - if (max_pfn > max_low_pfn) {
> > - /* can we preseve max_low_pfn ?*/
> > - max_low_pfn = max_pfn;
> > - }
> > -#else
> >   early_ioremap_page_table_range_init();
> >  #endif
> >
> > diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> > index 85c94f9..ecf7243 100644
> > --- a/arch/x86/mm/init_32.c
> > +++ b/arch/x86/mm/init_32.c
> > @@ -58,6 +58,8 @@ unsigned long highstart_pfn, highend_pfn;
> >
> >  bool __read_mostly __vmalloc_start_set = false;
> >
> > +static unsigned long min_pfn_mapped;
> > +
> >  /*
> >   * Creates a middle page table and puts a pointer to it in the
> >   * given global directory entry. This only returns the gd entry
> > @@ -516,6 +518,127 @@ void __init native_pagetable_init(void)
> >   paging_init();
> >  }
> >
> > +static unsigned long __init get_new_step_size(unsigned long step_size)
> > +{
> > + /*
> > +  * Initial mapped size is PMD_SIZE (2M).
> > +  * We can not set step_size to be PUD_SIZE (1G) yet.
> > +  * In worse case, when we cross the 1G boundary, and
> > +  * PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k)
> > +  * to map 1G range with PTE. Hence we use one less than the
> > +  * difference of page table level shifts.
> > +  *
> > +  * Don't need to worry about overflow in the top-down case, on 32bit,
> > +  * when step_size is 0, round_down() returns 0 for start, and that
> > +  * turns it into 0x1ULL.
> > +  * In the bottom-up case, round_up(x, 0) returns 0 though too, which
> > +  * needs to be taken into consideration by the code below.
> > +  */
> > + return step_size << (PMD_SHIFT - PAGE_SHIFT - 1);
> > +}
> > +
> > +/**
> > + * memory_map_top_down - Map [map_start, map_end) top down
> > + * @map_start: start address of the target memory range
> > + * @map_end: end address of the target memory range
> > + *
> > + * This function will setup direct mapping for memory range
> > + * [map_start, ma

Re: [RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

2019-01-07 Thread Pingfan Liu
On Tue, Jan 8, 2019 at 1:04 AM Dave Hansen  wrote:
>
> On 1/7/19 12:24 AM, Pingfan Liu wrote:
> > Background about the defect of the current bottom-up allocation style, take
> > the following scenario:
> >   |  unmovable node | movable node   |
> >  | kaslr-kernel |subtree of pgtable for phy<->virt |
> >
> > Although kaslr-kernel can avoid to stain the movable node. But the
> > pgtable can still stain the movable node. That is a probability problem,
> > with low probability, but still exist. This patch tries to eliminate the
> > probability. With the previous patch, at the point of init_mem_mapping(),
> > memblock allocator can work with the knowledge of acpi memory hotmovable
> > info, and avoid to stain the movable node. As a result,
> > memory_map_bottom_up() is not needed any more.
>
> I'm really missing the basic problem statement.  What's the problem this
> is fixing?  What is the end-user-visible impact of this problem?
>
Sorry for the misaligned figure. It should be
   |  kaslr-kernel|subtree of pgtable for phy<->virt|
  |--- boundary between unmovable node and
movable node
Where kaslr kernel can be guaranteed to sit inside unmovable node
after patch: https://lore.kernel.org/patchwork/patch/1029376/. But if
kaslr kernel is located near the end of the movable node, then
bottom-up allocator may create pagetable which crosses the  boundary
between unmovable node and movable node.  It is a probability issue,
the factors include -1. how big the gap between kernel end and
unmovable node's end.  -2. how many memory does the system own.
Alternative way to fix this issue is by increasing the gap by
boot/compressed/kaslr*. But taking the scenario of PB level memory,
the pagetable will take server MB even if using 1GB page, so it is
hard to decide how much should the gap increase.
In a word, this series fix the probability with certainty, by
allocating pagetable on unmovable node, instead of following kernel
end.

> To make memory hot-remove work, we want as much memory as possible to he
> hot-removable, which is basically what movable nodes are used for.  But,
> it sounds like, maybe, that KASLR can place the kernel image inside the
> movable node.  This is somehow related to the bottom-up allocation style
> currently in use.

Yes, currently kaslr kernel can stain the movable node, but it will
not do this soon after the patch:
https://lore.kernel.org/patchwork/patch/1029376/

Thanks,
Pingfan


Re: [RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()

2019-01-07 Thread Pingfan Liu
On Mon, Jan 7, 2019 at 4:25 PM Pingfan Liu  wrote:
>
> At present, memblock bottom-up allocation can help us against stamping over
> movable node in very high probability. But if the hotplug info has already
> been parsed, the memblock allocator can step around the movable node by
> itself. This patch pushes the parsing step forward, just ahead of where,
> the memblock allocator can work. Later in this series, the bottom-up
> allocation style can be removed on x86_64.
>
> Signed-off-by: Pingfan Liu 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/setup.c | 15 +++
>  include/linux/acpi.h|  1 +
>  2 files changed, 16 insertions(+)
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index acbcd62..df4132c 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -805,6 +805,20 @@ dump_kernel_offset(struct notifier_block *self, unsigned 
> long v, void *p)
> return 0;
>  }
>
> +/* only need the effect of acpi_numa_memory_affinity_init()
> + * ->memblock_mark_hotplug()
> + */
> +static int early_detect_acpi_memhotplug(void)
> +{
> +#ifdef CONFIG_ACPI_NUMA
> +   acpi_table_upgrade(__va(get_ramdisk_image()), get_ramdisk_size());
> +   acpi_table_init();
> +   acpi_numa_init();

As this is RFC version, I do not suppress this extra printk info yet.
Should do it next version.
> +   acpi_tb_terminate();
> +#endif
> +   return 0;
> +}
> +
>  /*
>   * Determine if we were loaded by an EFI loader.  If so, then we have also 
> been
>   * passed the efi memmap, systab, etc., so we should use these data 
> structures
> @@ -1131,6 +1145,7 @@ void __init setup_arch(char **cmdline_p)
> trim_platform_memory_ranges();
> trim_low_memory_range();
>
> +   early_detect_acpi_memhotplug();
> init_mem_mapping();
>
> idt_setup_early_pf();
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 44dcbba..1b69044 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -235,6 +235,7 @@ int acpi_mps_check (void);
>  int acpi_numa_init (void);
>
>  int acpi_table_init (void);
> +void acpi_tb_terminate(void);
>  int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
>  int __init acpi_table_parse_entries(char *id, unsigned long table_size,
>   int entry_id,
> --
> 2.7.4
>


Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr

2019-01-07 Thread Pingfan Liu
I send out a series [RFC PATCH 0/4] x86_64/mm: remove bottom-up
allocation style by pushing forward the parsing of mem hotplug info (
https://lore.kernel.org/lkml/1546849485-27933-1-git-send-email-kernelf...@gmail.com/T/#t).
Please give comment if you are interested.

Thanks,
Pingfan

On Fri, Jan 4, 2019 at 2:47 AM Tejun Heo  wrote:
>
> Hello,
>
> On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > I agree that currently the bottom-up allocation after the kernel text has
> > issues with KASLR. But this issues are not necessarily related to the
> > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > fail if KASLR would put the kernel near the end of node0.
> >
> > What I am trying to understand is whether there is a fundamental reason to
> > prevent allocations from [0, kernel_start)?
> >
> > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > kernel_end.
>
> That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> allocation mode").  I wasn't involved in that patch, so no idea why
> the restrictions were added, but FWIW it doesn't seem necessary to me.
>
> Thanks.
>
> --
> tejun


[RFC PATCH 0/4] x86_64/mm: remove bottom-up allocation style by pushing forward the parsing of mem hotplug info

2019-01-07 Thread Pingfan Liu
Background about the defect of the current bottom-up allocation style, take
the following scenario:
  |  unmovable node | movable node   |
 | kaslr-kernel |subtree of pgtable for phy<->virt |

Although kaslr-kernel can avoid to stain the movable node. But the
pgtable can still stain the movable node. That is a probability problem,
with low probability, but still exist. This patch tries to eliminate the
probability. With the previous patch, at the point of init_mem_mapping(),
memblock allocator can work with the knowledge of acpi memory hotmovable
info, and avoid to stain the movable node. As a result,
memory_map_bottom_up() is not needed any more.


Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org

Pingfan Liu (4):
  acpi: change the topo of acpi_table_upgrade()
  x86/setup: parse acpi to get hotplug info before init_mem_mapping()
  x86/mm: set allowed range for memblock allocator
  x86/mm: remove bottom-up allocation style for x86_64

 arch/arm64/kernel/setup.c |   2 +-
 arch/x86/kernel/setup.c   |  17 -
 arch/x86/mm/init.c| 154 +++---
 arch/x86/mm/init_32.c | 123 
 arch/x86/mm/mm_internal.h |   7 +++
 drivers/acpi/tables.c |   4 +-
 include/linux/acpi.h  |   5 +-
 7 files changed, 172 insertions(+), 140 deletions(-)

-- 
2.7.4



[RFC PATCH 4/4] x86/mm: remove bottom-up allocation style for x86_64

2019-01-07 Thread Pingfan Liu
There are two acheivements by this patch.
-1st. keep the subtree of pgtable away from movable node.
Background about the defect of the current bottom-up allocation style, take
the following scenario:
  |  unmovable node | movable node   |
 | kaslr-kernel |subtree of pgtable for phy<->virt |

Although kaslr-kernel can avoid to stain the movable node. [1] But the
pgtable can still stain the movable node. That is a probability problem,
with low probability, but still exist. This patch tries to eliminate the
probability. With the previous patch, at the point of init_mem_mapping(),
memblock allocator can work with the knowledge of acpi memory hotmovable
info, and avoid to stain the movable node. As a result,
memory_map_bottom_up() is not needed any more.

-2nd. simplify the logic of memory_map_top_down()
Thanks to the help of early_make_pgtable(), x86_64 can directly set up the
subtree of pgtable at any place, hence the careful iteration in
memory_map_top_down() can be discard.

[1]: https://lore.kernel.org/patchwork/patch/1029376/
Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org

---
 arch/x86/mm/init.c| 140 +++---
 arch/x86/mm/init_32.c | 123 
 arch/x86/mm/mm_internal.h |   7 +++
 3 files changed, 139 insertions(+), 131 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 84baa66..4e0286b 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -72,8 +72,6 @@ static unsigned long __initdata pgt_buf_start;
 static unsigned long __initdata pgt_buf_end;
 static unsigned long __initdata pgt_buf_top;
 
-static unsigned long min_pfn_mapped;
-
 static bool __initdata can_use_brk_pgt = true;
 
 static unsigned long min_pfn_allowed;
@@ -504,7 +502,7 @@ unsigned long __ref init_memory_mapping(unsigned long start,
  * That range would have hole in the middle or ends, and only ram parts
  * will be mapped in init_range_memory_mapping().
  */
-static unsigned long __init init_range_memory_mapping(
+unsigned long __init init_range_memory_mapping(
   unsigned long r_start,
   unsigned long r_end)
 {
@@ -532,127 +530,6 @@ static unsigned long __init init_range_memory_mapping(
return mapped_ram_size;
 }
 
-static unsigned long __init get_new_step_size(unsigned long step_size)
-{
-   /*
-* Initial mapped size is PMD_SIZE (2M).
-* We can not set step_size to be PUD_SIZE (1G) yet.
-* In worse case, when we cross the 1G boundary, and
-* PG_LEVEL_2M is not set, we will need 1+1+512 pages (2M + 8k)
-* to map 1G range with PTE. Hence we use one less than the
-* difference of page table level shifts.
-*
-* Don't need to worry about overflow in the top-down case, on 32bit,
-* when step_size is 0, round_down() returns 0 for start, and that
-* turns it into 0x1ULL.
-* In the bottom-up case, round_up(x, 0) returns 0 though too, which
-* needs to be taken into consideration by the code below.
-*/
-   return step_size << (PMD_SHIFT - PAGE_SHIFT - 1);
-}
-
-/**
- * memory_map_top_down - Map [map_start, map_end) top down
- * @map_start: start address of the target memory range
- * @map_end: end address of the target memory range
- *
- * This function will setup direct mapping for memory range
- * [map_start, map_end) in top-down. That said, the page tables
- * will be allocated at the end of the memory, and we map the
- * memory in top-down.
- */
-static void __init memory_map_top_down(unsigned long map_start,
-  unsigned long map_end)
-{
-   unsigned long real_end, start, last_start;
-   unsigned long step_size;
-   unsigned long addr;
-   unsigned long mapped_ram_size = 0;
-
-   /* xen has big range in reserved near end of ram, skip it at first.*/
-   addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
-   real_end = addr + PMD_SIZE;
-
-   /* step_size need to be small so pgt_buf from BRK could cover it */
-   step_size = PMD_SIZE;
-   max_pfn_mapped = 0; /* will get exact value next */
-   min_pfn_mapped = real_end >> PAGE_SHIFT;
-   last_start = start = real_end;
-
-   /*
-* We start from the top (end of memory) and go to the bottom.
-* The memblock_find_in_range() gets us a block of RAM from the
-* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
-* for page table.
-*/
-   while (last_start > map_start) {
-   if (last_start > step_size) {
-   

[RFC PATCH 3/4] x86/mm: set allowed range for memblock allocator

2019-01-07 Thread Pingfan Liu
Due to the incoming divergence of x86_32 and x86_64, there is requirement
to set the allowed allocating range at the early boot stage.
This patch also includes minor change to remove redundat cond check, refer
to memblock_find_in_range_node(), memblock_find_in_range() has already
protect itself from the case: start > end.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/mm/init.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index f905a23..84baa66 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -76,6 +76,14 @@ static unsigned long min_pfn_mapped;
 
 static bool __initdata can_use_brk_pgt = true;
 
+static unsigned long min_pfn_allowed;
+static unsigned long max_pfn_allowed;
+void set_alloc_range(unsigned long low, unsigned long high)
+{
+   min_pfn_allowed = low;
+   max_pfn_allowed = high;
+}
+
 /*
  * Pages returned are already directly mapped.
  *
@@ -100,12 +108,10 @@ __ref void *alloc_low_pages(unsigned int num)
if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret = 0;
 
-   if (min_pfn_mapped < max_pfn_mapped) {
-   ret = memblock_find_in_range(
-   min_pfn_mapped << PAGE_SHIFT,
-   max_pfn_mapped << PAGE_SHIFT,
-   PAGE_SIZE * num , PAGE_SIZE);
-   }
+   ret = memblock_find_in_range(
+   min_pfn_allowed << PAGE_SHIFT,
+   max_pfn_allowed << PAGE_SHIFT,
+   PAGE_SIZE * num, PAGE_SIZE);
if (ret)
memblock_reserve(ret, PAGE_SIZE * num);
else if (can_use_brk_pgt)
@@ -588,14 +594,17 @@ static void __init memory_map_top_down(unsigned long 
map_start,
start = map_start;
mapped_ram_size += init_range_memory_mapping(start,
last_start);
+   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
last_start = start;
min_pfn_mapped = last_start >> PAGE_SHIFT;
if (mapped_ram_size >= step_size)
step_size = get_new_step_size(step_size);
}
 
-   if (real_end < map_end)
+   if (real_end < map_end) {
init_range_memory_mapping(real_end, map_end);
+   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
+   }
 }
 
 /**
@@ -636,6 +645,7 @@ static void __init memory_map_bottom_up(unsigned long 
map_start,
}
 
mapped_ram_size += init_range_memory_mapping(start, next);
+   set_alloc_range(min_pfn_mapped, max_pfn_mapped);
start = next;
 
if (mapped_ram_size >= step_size)
-- 
2.7.4



[RFC PATCH 2/4] x86/setup: parse acpi to get hotplug info before init_mem_mapping()

2019-01-07 Thread Pingfan Liu
At present, memblock bottom-up allocation can help us against stamping over
movable node in very high probability. But if the hotplug info has already
been parsed, the memblock allocator can step around the movable node by
itself. This patch pushes the parsing step forward, just ahead of where,
the memblock allocator can work. Later in this series, the bottom-up
allocation style can be removed on x86_64.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/setup.c | 15 +++
 include/linux/acpi.h|  1 +
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index acbcd62..df4132c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -805,6 +805,20 @@ dump_kernel_offset(struct notifier_block *self, unsigned 
long v, void *p)
return 0;
 }
 
+/* only need the effect of acpi_numa_memory_affinity_init()
+ * ->memblock_mark_hotplug()
+ */
+static int early_detect_acpi_memhotplug(void)
+{
+#ifdef CONFIG_ACPI_NUMA
+   acpi_table_upgrade(__va(get_ramdisk_image()), get_ramdisk_size());
+   acpi_table_init();
+   acpi_numa_init();
+   acpi_tb_terminate();
+#endif
+   return 0;
+}
+
 /*
  * Determine if we were loaded by an EFI loader.  If so, then we have also been
  * passed the efi memmap, systab, etc., so we should use these data structures
@@ -1131,6 +1145,7 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();
 
+   early_detect_acpi_memhotplug();
init_mem_mapping();
 
idt_setup_early_pf();
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 44dcbba..1b69044 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -235,6 +235,7 @@ int acpi_mps_check (void);
 int acpi_numa_init (void);
 
 int acpi_table_init (void);
+void acpi_tb_terminate(void);
 int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
 int __init acpi_table_parse_entries(char *id, unsigned long table_size,
  int entry_id,
-- 
2.7.4



[RFC PATCH 1/4] acpi: change the topo of acpi_table_upgrade()

2019-01-07 Thread Pingfan Liu
The current acpi_table_upgrade() relies on initrd_start, but this var is
only valid after relocate_initrd(). There is requirement to extract the
acpi info from initrd before memblock-allocator can work(see [2/4]), hence
acpi_table_upgrade() need to accept the input param directly.

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: linux-kernel@vger.kernel.org
---
 arch/arm64/kernel/setup.c | 2 +-
 arch/x86/kernel/setup.c   | 2 +-
 drivers/acpi/tables.c | 4 +---
 include/linux/acpi.h  | 4 ++--
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 4b0e123..48cb98c 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -315,7 +315,7 @@ void __init setup_arch(char **cmdline_p)
paging_init();
efi_apply_persistent_mem_reservations();
 
-   acpi_table_upgrade();
+   acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
 
/* Parse the ACPI tables for possible boot-time configuration */
acpi_boot_table_init();
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3d872a5..acbcd62 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1175,8 +1175,8 @@ void __init setup_arch(char **cmdline_p)
 
reserve_initrd();
 
-   acpi_table_upgrade();
 
+   acpi_table_upgrade((void *)initrd_start, initrd_end - initrd_start);
vsmp_init();
 
io_delay_init();
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 48eabb6..d29b05c 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -471,10 +471,8 @@ static DECLARE_BITMAP(acpi_initrd_installed, 
NR_ACPI_INITRD_TABLES);
 
 #define MAP_CHUNK_SIZE   (NR_FIX_BTMAPS << PAGE_SHIFT)
 
-void __init acpi_table_upgrade(void)
+void __init acpi_table_upgrade(void *data, size_t size)
 {
-   void *data = (void *)initrd_start;
-   size_t size = initrd_end - initrd_start;
int sig, no, table_nr = 0, total_offset = 0;
long offset = 0;
struct acpi_table_header *table;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 87715f2..44dcbba 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1272,9 +1272,9 @@ acpi_graph_get_remote_endpoint(const struct fwnode_handle 
*fwnode,
 #endif
 
 #ifdef CONFIG_ACPI_TABLE_UPGRADE
-void acpi_table_upgrade(void);
+void acpi_table_upgrade(void *data, size_t size);
 #else
-static inline void acpi_table_upgrade(void) { }
+static inline void acpi_table_upgrade(void *data, size_t size) { }
 #endif
 
 #if defined(CONFIG_ACPI) && defined(CONFIG_ACPI_WATCHDOG)
-- 
2.7.4



[PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-07 Thread Pingfan Liu
Customer reported a bug on a high end server with many pcie devices, where
kernel bootup with crashkernel=384M, and kaslr is enabled. Even
though we still see much memory under 896 MB, the finding still failed
intermittently. Because currently we can only find region under 896 MB,
if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
randomly, and crashkernel reservation need be aligned to 128 MB, that's
why failure is found. It raises confusion to the end user that sometimes
crashkernel=X works while sometimes fails.
If want to make it succeed, customer can change kernel option to
"crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
limited space to behave even though its grammer looks more generic.
And we can't answer questions raised from customer that confidently:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.

This patch simplifies the method suggested in the mail [1]. It just goes
bottom-up to find a candidate region for crashkernel. The bottom-up may be
better compatible with the old reservation style, i.e. still want to get
memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.

There is one trivial thing about the compatibility with old kexec-tools:
if the reserved region is above 896M, then old tool will fail to load
bzImage. But without this patch, the old tool also fail since there is no
memory below 896M can be reserved for crashkernel.

[1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
Signed-off-by: Pingfan Liu 
Cc: Tang Chen 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Michal Hocko 
Cc: Jonathan Corbet 
Cc: Yaowei Bai 
Cc: Pavel Tatashin 
Cc: Nicholas Piggin 
Cc: Naoya Horiguchi 
Cc: Daniel Vacek 
Cc: Mathieu Malaterre 
Cc: Stefan Agner 
Cc: Dave Young 
Cc: Baoquan He 
Cc: ying...@kernel.org,
Cc: vgo...@redhat.com
Cc: linux-kernel@vger.kernel.org
---
v4 -> v5:
  add a wrapper of bottom up allocation func
v3 -> v4:
  instead of exporting the stage of parsing mem hotplug info, just using the 
bottom-up allocation func directly
 arch/x86/kernel/setup.c  |  8 
 include/linux/memblock.h |  3 +++
 mm/memblock.c| 29 +
 3 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d494b9b..80e7923 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -546,10 +546,10 @@ static void __init reserve_crashkernel(void)
 * as old kexec-tools loads bzImage below that, unless
 * "crashkernel=size[KMG],high" is specified.
 */
-   crash_base = memblock_find_in_range(CRASH_ALIGN,
-   high ? CRASH_ADDR_HIGH_MAX
-: CRASH_ADDR_LOW_MAX,
-   crash_size, CRASH_ALIGN);
+   crash_base = memblock_find_range_bottom_up(CRASH_ALIGN,
+   (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN,
+   NUMA_NO_NODE);
+
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable 
area found.\n");
return;
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a..a35ae17 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -116,6 +116,9 @@ phys_addr_t memblock_find_in_range_node(phys_addr_t size, 
phys_addr_t align,
int nid, enum memblock_flags flags);
 phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
   phys_addr_t size, phys_addr_t align);
+phys_addr_t __init_memblock
+memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+   phys_addr_t size, phys_addr_t align, int nid);
 void memblock_allow_resize(void);
 int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
 int memblock_add(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 81ae63c..f68287e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -192,6 +192,35 @@ __memblock_find_range_bottom_up(phys_addr_t start, 
phys_addr_t end,
return 0;
 }
 
+phys_addr_t __init_memblock
+memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+   phys_addr_t size, phys_addr_t align, int nid)
+{
+   phys_addr_t ret;
+   enum memblock_flags flags = choose_memblock_flags();
+
+   /* pump up @end */
+   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+   end = memblock.current_limit;
+
+   /* avoid allocating the first page */
+   start = max_t(phys_addr_t, start, PAGE_SIZE);
+   end = max(start, end);
+
+again:
+   ret =

Re: [PATCHv4] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-07 Thread Pingfan Liu
On Fri, Jan 4, 2019 at 5:43 PM Baoquan He  wrote:
>
> On 01/04/19 at 04:39pm, Pingfan Liu wrote:
> > Customer reported a bug on a high end server with many pcie devices, where
> > kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> > though we still see much memory under 896 MB, the finding still failed
> > intermittently. Because currently we can only find region under 896 MB,
> > if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> > randomly, and crashkernel reservation need be aligned to 128 MB, that's
> > why failure is found. It raises confusion to the end user that sometimes
> > crashkernel=X works while sometimes fails.
> > If want to make it succeed, customer can change kernel option to
> > "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> > limited space to behave even though its grammer looks more generic.
> > And we can't answer questions raised from customer that confidently:
> > 1) why it doesn't succeed to reserve 896 MB;
> > 2) what's wrong with memory region under 4G;
> > 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> >
> > This patch simplifies the method suggested in the mail [1]. It just goes
> > bottom-up to find a candidate region for crashkernel. The bottom-up may be
> > better compatible with the old reservation style, i.e. still want to get
> > memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> >
> > There is one trivial thing about the compatibility with old kexec-tools:
> > if the reserved region is above 896M, then old tool will fail to load
> > bzImage. But without this patch, the old tool also fail since there is no
> > memory below 896M can be reserved for crashkernel.
> >
> > [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> > Signed-off-by: Pingfan Liu 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Len Brown 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Cc: Michal Hocko 
> > Cc: Jonathan Corbet 
> > Cc: Yaowei Bai 
> > Cc: Nicholas Piggin 
> > Cc: Naoya Horiguchi 
> > Cc: Daniel Vacek 
> > Cc: Mathieu Malaterre 
> > Cc: Stefan Agner 
> > Cc: Dave Young 
> > Cc: Baoquan He 
> > Cc: ying...@kernel.org
> > Cc: vgo...@redhat.com
> > Cc: linux-kernel@vger.kernel.org
> > ---
> > v3 -> v4:
> >  instead of exporting the stage of parsing mem hotplug info, just using the 
> > bottom-up allocation func directly
> >  arch/x86/kernel/setup.c  | 8 
> >  include/linux/memblock.h | 4 
> >  mm/memblock.c| 2 +-
> >  3 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index d494b9b..082aadd 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -546,10 +546,10 @@ static void __init reserve_crashkernel(void)
> >* as old kexec-tools loads bzImage below that, unless
> >* "crashkernel=size[KMG],high" is specified.
> >*/
> > - crash_base = memblock_find_in_range(CRASH_ALIGN,
> > - high ? CRASH_ADDR_HIGH_MAX
> > -  : CRASH_ADDR_LOW_MAX,
> > - crash_size, CRASH_ALIGN);
> > + crash_base = __memblock_find_range_bottom_up(CRASH_ALIGN,
>
> Better make a wrapper function for external invocation. E.g we need
> allocate kernel data in mirrorred memory region if it's available. This
> has been done in memblock_find_in_range(), and the boundary alignment.
>
OK, I will update v5.
Thanks for your kindly review.

Regards,
Pingfan
> > + (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN,
> > + NUMA_NO_NODE, MEMBLOCK_NONE);
> > +
> >   if (!crash_base) {
> >   pr_info("crashkernel reservation failed - No suitable 
> > area found.\n");
> >   return;
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index aee299a..39720bf 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -116,6 +116,10 @@ phys_addr_t memblock_find_in_range_node(phys_addr_t 
> > size, phys_addr_t align,
> >   int nid, enum memblock_flags flags);
> >  phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
> > 

[PATCH] x86/trap: remove useless declaration

2019-01-04 Thread Pingfan Liu
There is no early_trap_pf_init() implementation, hence removing this
useless declaration

Signed-off-by: Pingfan Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: linux-kernel@vger.kernel.org

---
 arch/x86/include/asm/processor.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 071b2a6..88a7365 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -742,7 +742,6 @@ enum idle_boot_override {IDLE_NO_OVERRIDE=0, IDLE_HALT, 
IDLE_NOMWAIT,
 extern void enable_sep_cpu(void);
 extern int sysenter_setup(void);
 
-void early_trap_pf_init(void);
 
 /* Defined in head.S */
 extern struct desc_ptr early_gdt_descr;
-- 
2.7.4



<    1   2   3   4   >