from:"Mike Rapoport"

Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-06-03 Thread Mike Rapoport

On Mon, Jun 03, 2024 at 11:56:01AM -0500, Kalra, Ashish wrote:
> On 6/3/2024 10:29 AM, Mike Rapoport wrote:
> 
> > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> > > On 6/3/2024 8:39 AM, Mike Rapoport wrote:
> > > 
> > > > On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
> > > > > On 6/3/2024 3:56 AM, Borislav Petkov wrote
> > > > > 
> > > > > > > EFI memory map and due to early allocation it uses memblock 
> > > > > > > allocation.
> > > > > > > 
> > > > > > > Later during boot, efi_enter_virtual_mode() calls 
> > > > > > > kexec_enter_virtual_mode()
> > > > > > > in case of a kexec-ed kernel boot.
> > > > > > > 
> > > > > > > This function kexec_enter_virtual_mode() installs the new EFI 
> > > > > > > memory map by
> > > > > > > calling efi_memmap_init_late() which remaps the efi_memmap 
> > > > > > > physically allocated
> > > > > > > in efi_arch_mem_reserve(), but this remapping is still using 
> > > > > > > memblock allocation.
> > > > > > > 
> > > > > > > Subsequently, when memblock is freed later in boot flow, this 
> > > > > > > remapped
> > > > > > > efi_memmap will have random corruption (similar to a 
> > > > > > > use-after-free scenario).
> > > > > > > 
> > > > > > > The corrupted EFI memory map is then passed to the next kexec-ed 
> > > > > > > kernel
> > > > > > > which causes a panic when trying to use the corrupted EFI memory 
> > > > > > > map.
> > > > > > This sounds fishy: memblock allocated memory is not freed later in 
> > > > > > the
> > > > > > boot - it remains reserved. Only free memory is freed from memblock 
> > > > > > to
> > > > > > the buddy allocator.
> > > > > > 
> > > > > > Or is the problem that memblock-allocated memory cannot be 
> > > > > > memremapped
> > > > > > because *raisins*?
> > > > > This is what seems to be happening:
> > > > > 
> > > > > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> > > > > EFI memory map and due to early allocation it uses memblock 
> > > > > allocation.
> > > > > 
> > > > > And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> > > > > in case of a kexec-ed kernel boot.
> > > > > 
> > > > > This function kexec_enter_virtual_mode() installs the new EFI memory 
> > > > > map by
> > > > > calling efi_memmap_init_late() which does memremap() on 
> > > > > memblock-allocated memory.
> > > > Does the issue happen only with SNP?
> > > This is observed under SNP as efi_arch_mem_reserve() is only being called
> > > with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map
> > > using memblock.
> > I don't see how efi_arch_mem_reserve() is only called with SNP. What did I
> > miss?
> 
> This is the call stack for efi_arch_mem_reserve():
> 
> [0.310010]  efi_arch_mem_reserve+0xb1/0x220
> [0.311382]  efi_mem_reserve+0x36/0x60
> [0.311973]  efi_bgrt_init+0x17d/0x1a0
> [0.313265]  acpi_parse_bgrt+0x12/0x20
> [0.313858]  acpi_table_parse+0x77/0xd0
> [0.314463]  acpi_boot_init+0x362/0x630
> [0.315069]  setup_arch+0xa88/0xf80
> [0.315629]  start_kernel+0x68/0xa90
> [0.316194]  x86_64_start_reservations+0x1c/0x30
> [0.316921]  x86_64_start_kernel+0xbf/0x110
> [0.317582]  common_startup_64+0x13e/0x141
> 
> So, probably it is being invoked specifically for AMD platform ?

AFAIU, efi_bgrt_init() can be called for any x86 platform, with or without
encryption. 
So if my understating is correct, efi_arch_mem_reserve() will be called with SNP
disabled as well. And if kexec works ok without SNP but fails with SNP this
may give as a clue to the root cause of the failure.
 
> > > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> > > for kexec case), then for kexec boot, EFI memmap is memremapped in the 
> > > same
> > > virtual address as the first kernel and not the allocated memblock 
> > > address.
> > Maybe we should skip efi_arch_mem_reserve() for kexec case

Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-06-03 Thread Mike Rapoport

On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
> On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> > for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> > virtual address as the first kernel and not the allocated memblock address.
> 
> Are you saying that we should simply do
> 
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index fdf07dd6f459..410cb0743289 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
>   if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
>   return;
>  
> + if (kexec_in_progress)
> + return;
> +
>   if (!memblock_is_region_reserved(addr, size))
>   memblock_reserve(addr, size);
>  
> and skip that whole call?

I think Ashish suggested rather 

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index fdf07dd6f459..eccc10ab15a4 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -580,6 +580,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
if (!memblock_is_region_reserved(addr, size))
memblock_reserve(addr, size);
 
+   if (kexec_in_progress)
+   return;
+
/*
 * Some architectures (x86) reserve all boot services ranges
 * until efi_free_boot_services() because of buggy firmware
 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-06-03 Thread Mike Rapoport

On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> On 6/3/2024 8:39 AM, Mike Rapoport wrote:
> 
> > On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
> > > On 6/3/2024 3:56 AM, Borislav Petkov wrote
> > > 
> > > > > EFI memory map and due to early allocation it uses memblock 
> > > > > allocation.
> > > > > 
> > > > > Later during boot, efi_enter_virtual_mode() calls 
> > > > > kexec_enter_virtual_mode()
> > > > > in case of a kexec-ed kernel boot.
> > > > > 
> > > > > This function kexec_enter_virtual_mode() installs the new EFI memory 
> > > > > map by
> > > > > calling efi_memmap_init_late() which remaps the efi_memmap physically 
> > > > > allocated
> > > > > in efi_arch_mem_reserve(), but this remapping is still using memblock 
> > > > > allocation.
> > > > > 
> > > > > Subsequently, when memblock is freed later in boot flow, this remapped
> > > > > efi_memmap will have random corruption (similar to a use-after-free 
> > > > > scenario).
> > > > > 
> > > > > The corrupted EFI memory map is then passed to the next kexec-ed 
> > > > > kernel
> > > > > which causes a panic when trying to use the corrupted EFI memory map.
> > > > This sounds fishy: memblock allocated memory is not freed later in the
> > > > boot - it remains reserved. Only free memory is freed from memblock to
> > > > the buddy allocator.
> > > > 
> > > > Or is the problem that memblock-allocated memory cannot be memremapped
> > > > because *raisins*?
> > > This is what seems to be happening:
> > > 
> > > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> > > EFI memory map and due to early allocation it uses memblock allocation.
> > > 
> > > And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> > > in case of a kexec-ed kernel boot.
> > > 
> > > This function kexec_enter_virtual_mode() installs the new EFI memory map 
> > > by
> > > calling efi_memmap_init_late() which does memremap() on 
> > > memblock-allocated memory.
> > Does the issue happen only with SNP?
> 
> This is observed under SNP as efi_arch_mem_reserve() is only being called
> with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map
> using memblock.

I don't see how efi_arch_mem_reserve() is only called with SNP. What did I
miss?
 
> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> virtual address as the first kernel and not the allocated memblock address.

Maybe we should skip efi_arch_mem_reserve() for kexec case, but I think we
still need to understand what's causing memory corruption.

> Thanks, Ashish
> 
> > 
> > I didn't really dig, but my theory would be that it has something to do
> > with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c
> > > Thanks, Ashish

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

2024-06-03 Thread Mike Rapoport

On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
> On 6/3/2024 3:56 AM, Borislav Petkov wrote
> 
> > > EFI memory map and due to early allocation it uses memblock allocation.
> > > 
> > > Later during boot, efi_enter_virtual_mode() calls 
> > > kexec_enter_virtual_mode()
> > > in case of a kexec-ed kernel boot.
> > > 
> > > This function kexec_enter_virtual_mode() installs the new EFI memory map 
> > > by
> > > calling efi_memmap_init_late() which remaps the efi_memmap physically 
> > > allocated
> > > in efi_arch_mem_reserve(), but this remapping is still using memblock 
> > > allocation.
> > > 
> > > Subsequently, when memblock is freed later in boot flow, this remapped
> > > efi_memmap will have random corruption (similar to a use-after-free 
> > > scenario).
> > > 
> > > The corrupted EFI memory map is then passed to the next kexec-ed kernel
> > > which causes a panic when trying to use the corrupted EFI memory map.
> > This sounds fishy: memblock allocated memory is not freed later in the
> > boot - it remains reserved. Only free memory is freed from memblock to
> > the buddy allocator.
> > 
> > Or is the problem that memblock-allocated memory cannot be memremapped
> > because *raisins*?
> 
> This is what seems to be happening:
> 
> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> EFI memory map and due to early allocation it uses memblock allocation.
> 
> And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> in case of a kexec-ed kernel boot.
> 
> This function kexec_enter_virtual_mode() installs the new EFI memory map by
> calling efi_memmap_init_late() which does memremap() on memblock-allocated 
> memory.

Does the issue happen only with SNP?

I didn't really dig, but my theory would be that it has something to do
with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c
 
> Thanks, Ashish

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 09/17] x86: Add KHO support

2024-02-20 Thread Mike Rapoport

Hi Alex,

On Wed, Jan 17, 2024 at 02:46:56PM +, Alexander Graf wrote:
> We now have all bits in place to support KHO kexecs. This patch adds
> awareness of KHO in the kexec file as well as boot path for x86 and
> adds the respective kconfig option to the architecture so that it can
> use KHO successfully.
> 
> In addition, it enlightens it decompression code with KHO so that its
> KASLR location finder only considers memory regions that are not already
> occupied by KHO memory.
> 
> Signed-off-by: Alexander Graf 
> 
> ---
> 
> v1 -> v2:
> 
>   - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - s/kho_reserve/kho_reserve_scratch/g
> ---
>  arch/x86/Kconfig  |  3 ++
>  arch/x86/boot/compressed/kaslr.c  | 55 +++
>  arch/x86/include/uapi/asm/bootparam.h | 15 +++-
>  arch/x86/kernel/e820.c|  9 +
>  arch/x86/kernel/kexec-bzimage64.c | 39 +++
>  arch/x86/kernel/setup.c   | 46 ++
>  arch/x86/mm/init_32.c |  7 
>  arch/x86/mm/init_64.c |  7 
>  8 files changed, 180 insertions(+), 1 deletion(-)

...

> @@ -987,8 +1013,26 @@ void __init setup_arch(char **cmdline_p)
>   cleanup_highmap();
>  
>   memblock_set_current_limit(ISA_END_ADDRESS);
> +
>   e820__memblock_setup();
>  
> + /*
> +  * We can resize memblocks at this point, let's dump all KHO
> +  * reservations in and switch from scratch-only to normal allocations
> +  */
> + kho_reserve_previous_mem();
> +
> + /* Allocations now skip scratch mem, return low 1M to the pool */
> + if (is_kho_boot()) {
> + u64 i;
> + phys_addr_t base, end;
> +
> + __for_each_mem_range(i, , NULL, NUMA_NO_NODE,
> +  MEMBLOCK_SCRATCH, , , NULL)
> + if (end <= ISA_END_ADDRESS)
> + memblock_clear_scratch(base, end - base);
> + }

You had to mark lower 16M as MEMBLOCK_SCRATCH because at this point the
mapping of the physical memory is not ready yet and page tables only cover
lower 16M and the memory mapped in kexec::init_pgtable(). Hence the call
for memblock_set_current_limit(ISA_END_ADDRESS) slightly above, which
essentially makes scratch mem reserved by KHO unusable for allocations.

I'd suggest to move kho_reserve_previous_mem() earlier, probably even right
next to kho_populate().
kho_populate() already does memblock_add(scratch) and at that point it's
the only physical memory that memblock knows of, so if it'll have to
allocate, the allocations will end up there.

Also, there are no kernel allocations before e820__memblock_setup(), so the
only memory that might need to be allocated is for memblock_double_array()
and that will be discarded later anyway.

With this, it seems that MEMBLOCK_SCRATCH is not needed, as the scratch
memory is anyway the only usable memory up to e820__memblock_setup().

>   /*
>* Needs to run after memblock setup because it needs the physical
>* memory size.
> @@ -1104,6 +1148,8 @@ void __init setup_arch(char **cmdline_p)
>*/
>   arch_reserve_crashkernel();
>  
> + kho_reserve_scratch();
> +
>   memblock_find_dma_reserve();
>  
>   if (!early_xdbc_setup_hardware())
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index b63403d7179d..6c3810afed04 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -738,6 +739,12 @@ void __init mem_init(void)
>   after_bootmem = 1;
>   x86_init.hyper.init_after_bootmem();
>  
> + /*
> +  * Now that all KHO pages are marked as reserved, let's flip them back
> +  * to normal pages with accurate refcount.
> +  */
> + kho_populate_refcount();

This should go to mm_core_init(), there's nothing architecture specific
there.

> +
>   /*
>* Check boundaries twice: Some fundamental inconsistencies can
>* be detected at build time already.

-- 
Sincerely yours,
Mike.

Re: [PATCH 0/3] arm64: kdump : take off the protection on crashkernel memory region

2023-03-25 Thread Mike Rapoport

On Fri, Mar 24, 2023 at 09:18:35PM +0800, Baoquan He wrote:
> Problem:
> ===
> On arm64, block and section mapping is supported to build page tables.
> However, currently it enforces to take base page mapping for the whole
> linear mapping if CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled and
> crashkernel kernel parameter is set. This will cause longer time of the
> linear mapping process during bootup and severe performance degradation
> during running time.
> 
> Root cause:
> ==
> On arm64, crashkernel reservation relies on knowing the upper limit of
> low memory zone because it needs to reserve memory in the zone so that
> devices' DMA addressing in kdump kernel can be satisfied. However, the
> upper limit of low memory on arm64 is variant. And the upper limit can
> only be decided late till bootmem_init() is called [1].
> 
> And we need to map the crashkernel region with base page granularity when
> doing linear mapping, because kdump needs to protect the crashkernel region
> via set_memory_valid(,0) after kdump kernel loading. However, arm64 doesn't
> support well on splitting the built block or section mapping due to some
> cpu reststriction [2]. And unfortunately, the linear mapping is done before
> bootmem_init().
> 
> To resolve the above conflict on arm64, the compromise is enforcing to
> take base page mapping for the entire linear mapping if crashkernel is
> set, and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabed. Hence
> performance is sacrificed.
> 
> Solution:
> =
> Comparing with the always encountered base page mapping for the whole
> linear region, it's better to take off the protection on crashkernel memory
> region for now because the protection can only happen in a chance in one
> million, while the base page mapping for the whole linear mapping is
> always mitigating arm64 systems with crashkernel set.
> 
> This can let distros have chance to back port this patchset to fix the
> performance issue caused by the base page mapping in the whole linear
> region.
> 
> Extra words
> ===
> I personally expect that  we can add these back in the near future
> when arm64_dma_phys_limit is fixed, e.g Raspberry Pi enlarges the device
> addressing limit to 32bit; or Arm64 can support splitting built block or
> section mapping. Like this, the code is the simplest and clearest.
> 
> Or as Catalin suggested, for below 4 cases we currently defer to handle
> in bootme_init(), we can try to handle case 3) in advance so that memory
> above 4G can avoid base page mapping wholly. This will complicate the
> already complex code, let's see how it looks if people interested post patch.
> 
> crashkernel=size
> 1)first attempt:  low memory under arm64_dma_phys_limit
> 2)fallback:   finding memory above 4G
> 
> crashkernel=size,high
> 3)first attempt:  finding memory above 4G
> 4)fallback:   low memory under arm64_dma_phys_limit
> 
> 
> [1]
> https://lore.kernel.org/all/yriijkhkwsuaq...@arm.com/T/#u
> 
> [2]
> https://lore.kernel.org/linux-arm-kernel/20190911182546.17094-1-nsaenzjulie...@suse.de/T/
> 
> Baoquan He (3):
>   arm64: kdump : take off the protection on crashkernel memory region
>   arm64: kdump: do not map crashkernel region specifically
>   arm64: kdump: defer the crashkernel reservation for platforms with no
> DMA memory zones
> 
>  arch/arm64/include/asm/kexec.h|  6 -
>  arch/arm64/include/asm/memory.h   |  5 
>  arch/arm64/kernel/machine_kexec.c | 20 --
>  arch/arm64/mm/init.c  |  6 +
>  arch/arm64/mm/mmu.c   | 43 ---
>  5 files changed, 1 insertion(+), 79 deletions(-)

Acked-by: Mike Rapoport (IBM) 

> -- 
> 2.34.1
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 6/6] mm: export dump_mm()

2023-01-27 Thread Mike Rapoport

On Wed, Jan 25, 2023 at 12:38:51AM -0800, Suren Baghdasaryan wrote:
> mmap_assert_write_locked() is used in vm_flags modifiers. Because
> mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
> modified from from inside a module, it's necessary to export
> dump_mm() function.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Mike Rapoport (IBM) 

> ---
>  mm/debug.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/debug.c b/mm/debug.c
> index 9d3d893dc7f4..96d594e16292 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
>   mm->def_flags, >def_flags
>   );
>  }
> +EXPORT_SYMBOL(dump_mm);
>  
>  static bool page_init_poisoning __read_mostly = true;
>  
> -- 
> 2.39.1
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-27 Thread Mike Rapoport

On Thu, Jan 26, 2023 at 11:17:09AM +0200, Mike Rapoport wrote:
> On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
> > vm_flags are among VMA attributes which affect decisions like VMA merging
> > and splitting. Therefore all vm_flags modifications are performed after
> > taking exclusive mmap_lock to prevent vm_flags updates racing with such
> > operations. Introduce modifier functions for vm_flags to be used whenever
> > flags are updated. This way we can better check and control correct
> > locking behavior during these updates.
> > 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h   | 37 +
> >  include/linux/mm_types.h |  8 +++-
> >  2 files changed, 44 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c2f62bdce134..b71f2809caac 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,6 +627,43 @@ static inline void vma_init(struct vm_area_struct 
> > *vma, struct mm_struct *mm)
> > INIT_LIST_HEAD(>anon_vma_chain);
> >  }
> >  
> > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > +static inline void init_vm_flags(struct vm_area_struct *vma,
> > +unsigned long flags)
> 
> I'd suggest to make it vm_flags_init() etc.

Thinking more about it, it will be even clearer to name these vma_flags_xyz()

> Except that
> 
> Acked-by: Mike Rapoport (IBM) 
> 

--
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 5/6] mm: introduce mod_vm_flags_nolock and use it in untrack_pfn

2023-01-26 Thread Mike Rapoport

On Wed, Jan 25, 2023 at 12:38:50AM -0800, Suren Baghdasaryan wrote:
> In cases when VMA flags are modified after VMA was isolated and mmap_lock
> was downgraded, flags modifications would result in an assertion because
> mmap write lock is not held.
> Introduce mod_vm_flags_nolock to be used in such situation.

vm_flags_mod_nolock?

> Pass a hint to untrack_pfn to conditionally use mod_vm_flags_nolock for
> flags modification and to avoid assertion.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  arch/x86/mm/pat/memtype.c | 10 +++---
>  include/linux/mm.h| 12 +---
>  include/linux/pgtable.h   |  5 +++--
>  mm/memory.c   | 13 +++--
>  mm/memremap.c |  4 ++--
>  mm/mmap.c | 16 ++--
>  6 files changed, 38 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index ae9645c900fa..d8adc0b42cf2 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -1046,7 +1046,7 @@ void track_pfn_insert(struct vm_area_struct *vma, 
> pgprot_t *prot, pfn_t pfn)
>   * can be for the entire vma (in which case pfn, size are zero).
>   */
>  void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> -  unsigned long size)
> +  unsigned long size, bool mm_wr_locked)
>  {
>   resource_size_t paddr;
>   unsigned long prot;
> @@ -1065,8 +1065,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned 
> long pfn,
>   size = vma->vm_end - vma->vm_start;
>   }
>   free_pfn_range(paddr, size);
> - if (vma)
> - clear_vm_flags(vma, VM_PAT);
> + if (vma) {
> + if (mm_wr_locked)
> + clear_vm_flags(vma, VM_PAT);
> + else
> + mod_vm_flags_nolock(vma, 0, VM_PAT);
> + }
>  }
>  
>  /*
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 55335edd1373..48d49930c411 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -656,12 +656,18 @@ static inline void clear_vm_flags(struct vm_area_struct 
> *vma,
>   vma->vm_flags &= ~flags;
>  }
>  
> +static inline void mod_vm_flags_nolock(struct vm_area_struct *vma,
> +unsigned long set, unsigned long clear)
> +{
> + vma->vm_flags |= set;
> + vma->vm_flags &= ~clear;
> +}
> +
>  static inline void mod_vm_flags(struct vm_area_struct *vma,
>   unsigned long set, unsigned long clear)
>  {
>   mmap_assert_write_locked(vma->vm_mm);
> - vma->vm_flags |= set;
> - vma->vm_flags &= ~clear;
> + mod_vm_flags_nolock(vma, set, clear);
>  }
>  
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> @@ -2087,7 +2093,7 @@ static inline void zap_vma_pages(struct vm_area_struct 
> *vma)
>  }
>  void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
>   struct vm_area_struct *start_vma, unsigned long start,
> - unsigned long end);
> + unsigned long end, bool mm_wr_locked);
>  
>  struct mmu_notifier_range;
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5fd45454c073..c63cd44777ec 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1185,7 +1185,8 @@ static inline int track_pfn_copy(struct vm_area_struct 
> *vma)
>   * can be for the entire vma (in which case pfn, size are zero).
>   */
>  static inline void untrack_pfn(struct vm_area_struct *vma,
> -unsigned long pfn, unsigned long size)
> +unsigned long pfn, unsigned long size,
> +bool mm_wr_locked)
>  {
>  }
>  
> @@ -1203,7 +1204,7 @@ extern void track_pfn_insert(struct vm_area_struct 
> *vma, pgprot_t *prot,
>pfn_t pfn);
>  extern int track_pfn_copy(struct vm_area_struct *vma);
>  extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
> - unsigned long size);
> + unsigned long size, bool mm_wr_locked);
>  extern void untrack_pfn_moved(struct vm_area_struct *vma);
>  #endif
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index d6902065e558..5b11b50e2c4a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1613,7 +1613,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>  static void unmap_single_vma(struct mmu_gather *tlb,
>   struct vm_area_struct *vma, unsigned long start_addr,
>   unsigned long end_addr,
> - struct zap_details *details)
> + struct zap_details *details, bool mm_wr_locked)
>  {
>   unsigned long start = max(vma->vm_start, start_addr);
>   unsigned long end;
> @@ -1628,7 +1628,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>   uprobe_munmap(vma, start, end);
>  
>   if (unlikely(vma->vm_flags & VM_PFNMAP))
> - untrack_pfn(vma, 0, 0);
> + untrack_pfn(vma, 0, 0,

Re: [PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-26 Thread Mike Rapoport

On Wed, Jan 25, 2023 at 12:38:49AM -0800, Suren Baghdasaryan wrote:
> Replace indirect modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
> vm_flags modification attempts.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Mike Rapoport (IBM) 

> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 5 -
>  arch/s390/mm/gmap.c| 5 -
>  mm/khugepaged.c| 2 ++
>  mm/ksm.c   | 2 ++
>  4 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 1d67baa5557a..325a7a47d348 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
>  {
>   unsigned long gfn = memslot->base_gfn;
>   unsigned long end, start = gfn_to_hva(kvm, gfn);
> + unsigned long vm_flags;
>   int ret = 0;
>   struct vm_area_struct *vma;
>   int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
> @@ -409,12 +410,14 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
>   ret = H_STATE;
>   break;
>   }
> + vm_flags = vma->vm_flags;
>   ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> -   merge_flag, >vm_flags);
> +   merge_flag, _flags);
>   if (ret) {
>   ret = H_STATE;
>   break;
>   }
> + reset_vm_flags(vma, vm_flags);
>   start = vma->vm_end;
>   } while (end > vma->vm_end);
>  
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 3a695b8a1e3c..d5eb47dcdacb 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2587,14 +2587,17 @@ int gmap_mark_unmergeable(void)
>  {
>   struct mm_struct *mm = current->mm;
>   struct vm_area_struct *vma;
> + unsigned long vm_flags;
>   int ret;
>   VMA_ITERATOR(vmi, mm, 0);
>  
>   for_each_vma(vmi, vma) {
> + vm_flags = vma->vm_flags;
>   ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> -   MADV_UNMERGEABLE, >vm_flags);
> +   MADV_UNMERGEABLE, _flags);
>   if (ret)
>   return ret;
> + reset_vm_flags(vma, vm_flags);
>   }
>   mm->def_flags &= ~VM_MERGEABLE;
>   return 0;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 8abc59345bf2..76b24cd0c179 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -354,6 +354,8 @@ struct attribute_group khugepaged_attr_group = {
>  int hugepage_madvise(struct vm_area_struct *vma,
>unsigned long *vm_flags, int advice)
>  {
> + /* vma->vm_flags can be changed only using modifier functions */
> + BUG_ON(vm_flags == >vm_flags);
>   switch (advice) {
>   case MADV_HUGEPAGE:
>  #ifdef CONFIG_S390
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 04f1c8c2df11..992b2be9f5e6 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -2573,6 +2573,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned 
> long start,
>   struct mm_struct *mm = vma->vm_mm;
>   int err;
>  
> + /* vma->vm_flags can be changed only using modifier functions */
> + BUG_ON(vm_flags == >vm_flags);
>   switch (advice) {
>   case MADV_MERGEABLE:
>   /*
> -- 
> 2.39.1
> 
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 3/6] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-26 Thread Mike Rapoport

On Wed, Jan 25, 2023 at 12:38:48AM -0800, Suren Baghdasaryan wrote:
> Replace direct modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Mike Rapoport (IBM) 

> ---
>  arch/arm/kernel/process.c  |  2 +-
>  arch/ia64/mm/init.c|  8 
>  arch/loongarch/include/asm/tlb.h   |  2 +-
>  arch/powerpc/kvm/book3s_xive_native.c  |  2 +-
>  arch/powerpc/mm/book3s64/subpage_prot.c|  2 +-
>  arch/powerpc/platforms/book3s/vas-api.c|  2 +-
>  arch/powerpc/platforms/cell/spufs/file.c   | 14 +++---
>  arch/s390/mm/gmap.c|  3 +--
>  arch/x86/entry/vsyscall/vsyscall_64.c  |  2 +-
>  arch/x86/kernel/cpu/sgx/driver.c   |  2 +-
>  arch/x86/kernel/cpu/sgx/virt.c |  2 +-
>  arch/x86/mm/pat/memtype.c  |  6 +++---
>  arch/x86/um/mem_32.c   |  2 +-
>  drivers/acpi/pfr_telemetry.c   |  2 +-
>  drivers/android/binder.c   |  3 +--
>  drivers/char/mspec.c   |  2 +-
>  drivers/crypto/hisilicon/qm.c  |  2 +-
>  drivers/dax/device.c   |  2 +-
>  drivers/dma/idxd/cdev.c|  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  4 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c  |  4 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_events.c|  4 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c   |  4 ++--
>  drivers/gpu/drm/drm_gem.c  |  2 +-
>  drivers/gpu/drm/drm_gem_dma_helper.c   |  3 +--
>  drivers/gpu/drm/drm_gem_shmem_helper.c |  2 +-
>  drivers/gpu/drm/drm_vm.c   |  8 
>  drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  2 +-
>  drivers/gpu/drm/exynos/exynos_drm_gem.c|  4 ++--
>  drivers/gpu/drm/gma500/framebuffer.c   |  2 +-
>  drivers/gpu/drm/i810/i810_dma.c|  2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_mman.c   |  4 ++--
>  drivers/gpu/drm/mediatek/mtk_drm_gem.c |  2 +-
>  drivers/gpu/drm/msm/msm_gem.c  |  2 +-
>  drivers/gpu/drm/omapdrm/omap_gem.c |  3 +--
>  drivers/gpu/drm/rockchip/rockchip_drm_gem.c|  3 +--
>  drivers/gpu/drm/tegra/gem.c|  5 ++---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c|  3 +--
>  drivers/gpu/drm/virtio/virtgpu_vram.c  |  2 +-
>  drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |  2 +-
>  drivers/gpu/drm/xen/xen_drm_front_gem.c|  3 +--
>  drivers/hsi/clients/cmt_speech.c   |  2 +-
>  drivers/hwtracing/intel_th/msu.c   |  2 +-
>  drivers/hwtracing/stm/core.c   |  2 +-
>  drivers/infiniband/hw/hfi1/file_ops.c  |  4 ++--
>  drivers/infiniband/hw/mlx5/main.c  |  4 ++--
>  drivers/infiniband/hw/qib/qib_file_ops.c   | 13 ++---
>  drivers/infiniband/hw/usnic/usnic_ib_verbs.c   |  2 +-
>  drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c|  2 +-
>  .../media/common/videobuf2/videobuf2-dma-contig.c  |  2 +-
>  drivers/media/common/videobuf2/videobuf2-vmalloc.c |  2 +-
>  drivers/media/v4l2-core/videobuf-dma-contig.c  |  2 +-
>  drivers/media/v4l2-core/videobuf-dma-sg.c  |  4 ++--
>  drivers/media/v4l2-core/videobuf-vmalloc.c |  2 +-
>  drivers/misc/cxl/context.c |  2 +-
>  drivers/misc/habanalabs/common/memory.c|  2 +-
>  drivers/misc/habanalabs/gaudi/gaudi.c  |  4 ++--
>  drivers/misc/habanalabs/gaudi2/gaudi2.c|  8 
>  drivers/misc/habanalabs/goya/goya.c|  4 ++--
>  drivers/misc/ocxl/context.c|  4 ++--
>  drivers/misc/ocxl/sysfs.c  |  2 +-
>  drivers/misc/open-dice.c   |  4 ++--
>  drivers/misc/sgi-gru/grufile.c |  4 ++--
>  drivers/misc/uacce/uacce.c |  2 +-
>  drivers/sbus/char/oradax.c |  2 +-
>  drivers/scsi/cxlflash/ocxl_hw.c|  2 +-
>  drivers/scsi/sg.c  |  2 +-
>  drivers/staging/media/atomisp/pci/hmm/hmm_bo.c |  2 +-
>  drivers/staging/media/deprecated/meye/meye.c   |  4 ++--
>  .../media/deprecated/stkwebcam

Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-26 Thread Mike Rapoport

On Wed, Jan 25, 2023 at 12:38:46AM -0800, Suren Baghdasaryan wrote:
> vm_flags are among VMA attributes which affect decisions like VMA merging
> and splitting. Therefore all vm_flags modifications are performed after
> taking exclusive mmap_lock to prevent vm_flags updates racing with such
> operations. Introduce modifier functions for vm_flags to be used whenever
> flags are updated. This way we can better check and control correct
> locking behavior during these updates.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h   | 37 +
>  include/linux/mm_types.h |  8 +++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c2f62bdce134..b71f2809caac 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,43 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   INIT_LIST_HEAD(>anon_vma_chain);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void init_vm_flags(struct vm_area_struct *vma,
> +  unsigned long flags)

I'd suggest to make it vm_flags_init() etc.
Except that

Acked-by: Mike Rapoport (IBM) 

> +{
> + vma->vm_flags = flags;
> +}
> +
> +/* Use when VMA is part of the VMA tree and modifications need coordination 
> */
> +static inline void reset_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + init_vm_flags(vma, flags);
> +}
> +
> +static inline void set_vm_flags(struct vm_area_struct *vma,
> + unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= flags;
> +}
> +
> +static inline void clear_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags &= ~flags;
> +}
> +
> +static inline void mod_vm_flags(struct vm_area_struct *vma,
> + unsigned long set, unsigned long clear)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= set;
> + vma->vm_flags &= ~clear;
> +}
> +
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  {
>   vma->vm_ops = NULL;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d6d790d9bed..6c7c70bf50dd 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -491,7 +491,13 @@ struct vm_area_struct {
>* See vmf_insert_mixed_prot() for discussion.
>*/
>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * WARNING! Do not modify directly.
> +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> +  */
> + unsigned long vm_flags;
>  
>   /*
>* For areas with an address space and backing store,
> -- 
> 2.39.1
> 
> 

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 2/6] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-26 Thread Mike Rapoport

On Wed, Jan 25, 2023 at 12:38:47AM -0800, Suren Baghdasaryan wrote:
> To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
> replace it with VM_LOCKED_MASK bitmask and convert all users.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Mike Rapoport (IBM) 

> ---
>  include/linux/mm.h | 4 ++--
>  kernel/fork.c  | 2 +-
>  mm/hugetlb.c   | 4 ++--
>  mm/mlock.c | 6 +++---
>  mm/mmap.c  | 6 +++---
>  mm/mremap.c| 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b71f2809caac..da62bdd627bf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -421,8 +421,8 @@ extern unsigned int kobjsize(const void *objp);
>  /* This mask defines which mm->def_flags a process can inherit its parent */
>  #define VM_INIT_DEF_MASK VM_NOHUGEPAGE
>  
> -/* This mask is used to clear all the VMA flags used by mlock */
> -#define VM_LOCKED_CLEAR_MASK (~(VM_LOCKED | VM_LOCKONFAULT))
> +/* This mask represents all the VMA flag bits used by mlock */
> +#define VM_LOCKED_MASK   (VM_LOCKED | VM_LOCKONFAULT)
>  
>  /* Arch-specific flags to clear when updating VM flags on protection change 
> */
>  #ifndef VM_ARCH_CLEAR
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6683c1b0f460..03d472051236 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -669,7 +669,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>   tmp->anon_vma = NULL;
>   } else if (anon_vma_fork(tmp, mpnt))
>   goto fail_nomem_anon_vma_fork;
> - tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> + clear_vm_flags(tmp, VM_LOCKED_MASK);
>   file = tmp->vm_file;
>   if (file) {
>   struct address_space *mapping = file->f_mapping;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d20c8b09890e..4ecdbad9a451 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6973,8 +6973,8 @@ static unsigned long page_table_shareable(struct 
> vm_area_struct *svma,
>   unsigned long s_end = sbase + PUD_SIZE;
>  
>   /* Allow segments to share if only one is marked locked */
> - unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> - unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
> + unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
>  
>   /*
>* match the virtual addresses, permission and the alignment of the
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0336f52e03d7..5c4fff93cd6b 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -497,7 +497,7 @@ static int apply_vma_lock_flags(unsigned long start, 
> size_t len,
>   if (vma->vm_start != tmp)
>   return -ENOMEM;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= flags;
>   /* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
>   tmp = vma->vm_end;
> @@ -661,7 +661,7 @@ static int apply_mlockall_flags(int flags)
>   struct vm_area_struct *vma, *prev = NULL;
>   vm_flags_t to_add = 0;
>  
> - current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
> + current->mm->def_flags &= ~VM_LOCKED_MASK;
>   if (flags & MCL_FUTURE) {
>   current->mm->def_flags |= VM_LOCKED;
>  
> @@ -681,7 +681,7 @@ static int apply_mlockall_flags(int flags)
>   for_each_vma(vmi, vma) {
>   vm_flags_t newflags;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= to_add;
>  
>   /* Ignore errors */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d4abc6feced1..323bd253b25a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2671,7 +2671,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>   if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>   is_vm_hugetlb_page(vma) ||
>   vma == get_gate_vma(current->mm))
> - vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> + clear_vm_flags(vma, VM_LOCKED_MASK);
>   else
>   mm->locked_vm += (len >> PAGE_SHIFT);
>   }
> @@ -3340,8 +3340,8 @@ static struct vm_area_struct *__install_sp

Re: [PATCH 1/2] arm64, kdump: enforce to take 4G as the crashkernel low memory end

2022-09-21 Thread Mike Rapoport

On Tue, Sep 06, 2022 at 03:05:57PM +0200, Ard Biesheuvel wrote:
> 
> While I appreciate the effort that has gone into solving this problem,
> I don't think there is any consensus that an elaborate fix is required
> to ensure that the crash kernel can be unmapped from the linear map at
> all cost. In fact, I personally think we shouldn't bother, and IIRC,
> Will made a remark along the same lines back when the Huawei engineers
> were still driving this effort.
> 
> So perhaps we could align on that before doing yet another version of this?

I suggest to start with disabling crash kernel protection when its memory
reservation is deferred and then Baoquan and kdump folks can take it from
here.

>From 6430407f784f3571da9b4d79340487f2647a44ab Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Wed, 21 Sep 2022 10:14:46 +0300
Subject: [PATCH] arm64/mm: don't protect crash kernel memory with
 CONFIG_ZONE_DMA/DMA32

Currently, in order to allow protection of crash kernel memory when
CONFIG_ZONE_DMA/DMA32 is enabled, the block mappings in the linear map are
disabled and the entire linear map uses base size pages.

This results in performance degradation because of higher TLB pressure for
kernel memory accesses, so there is a trade off between performance and
ability to protect the crash kernel memory.

Baoquan He said [1]:

In fact, panic is a small probability event, and accidental
corruption on kdump kernel data is a much smaller probability
event.

With this, it makes sense to only protect crash kernel memory only when it
can be reserved before creation of the linear map.

Simplify the logic around crash kernel protection in map_mem() so that it
will use base pages only if crash kernel memory is already reserved and
introduce crashkres_protection_possible variable to ensure that
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres() won't
try to modify page table if crash kernel is not mapped with base pages.

[1] https://lore.kernel.org/all/Yw2C9ahluhX4Mg3G@MiWiFi-R3L-srv

Suggested-by: Will Deacon 
Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/mmu.h  |  1 +
 arch/arm64/kernel/machine_kexec.c |  6 
 arch/arm64/mm/init.c  | 30 +---
 arch/arm64/mm/mmu.c   | 46 ---
 4 files changed, 32 insertions(+), 51 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 48f8466a4be9..975607843548 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, 
phys_addr_t phys,
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
 extern void mark_linear_text_alias_ro(void);
 extern bool kaslr_requires_kpti(void);
+extern bool crashkres_protection_possible;
 
 #define INIT_MM_CONTEXT(name)  \
.pgd = init_pg_dir,
diff --git a/arch/arm64/kernel/machine_kexec.c 
b/arch/arm64/kernel/machine_kexec.c
index 19c2d487cb08..68295403aa40 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -272,6 +272,9 @@ void arch_kexec_protect_crashkres(void)
 {
int i;
 
+   if (!crashkres_protection_possible)
+   return;
+
for (i = 0; i < kexec_crash_image->nr_segments; i++)
set_memory_valid(
__phys_to_virt(kexec_crash_image->segment[i].mem),
@@ -282,6 +285,9 @@ void arch_kexec_unprotect_crashkres(void)
 {
int i;
 
+   if (!crashkres_protection_possible)
+   return;
+
for (i = 0; i < kexec_crash_image->nr_segments; i++)
set_memory_valid(
__phys_to_virt(kexec_crash_image->segment[i].mem),
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index b9af30be813e..220d45655918 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -62,27 +62,21 @@ EXPORT_SYMBOL(memstart_addr);
  * In such case, ZONE_DMA32 covers the rest of the 32-bit addressable memory,
  * otherwise it is empty.
  *
- * Memory reservation for crash kernel either done early or deferred
- * depending on DMA memory zones configs (ZONE_DMA) --
+ * Memory reservation for crash kernel must know the upper limit of low
+ * memory in order to allow DMA access for devices with kdump kernel. When
+ * ZONE_DMA/DMA32 is enabled, this limit is determined after DT/ACPI is
+ * parsed, and crash kernel reservation happens afterwards. In this case,
+ * the crash kernel memory is reserved after linear map is created, there
+ * is no guarantee that crash kernel memory will be mapped with the base
+ * pages in the linear map, and thus the protection if the crash kernel
+ * memory is disabled.
  *
  * In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
  * here instead of max_zone_phys().  This lets early reservation of
  * crash kernel memory which has a dependency on arm64_dma_

Re: [PATCH 1/2] arm64, kdump: enforce to take 4G as the crashkernel low memory end

2022-09-05 Thread Mike Rapoport

On Thu, Sep 01, 2022 at 08:25:54PM +0800, Baoquan He wrote:
> On 09/01/22 at 10:24am, Mike Rapoport wrote:
> 
> max_zone_phys() only handles cases when CONFIG_ZONE_DMA/DMA32 enabled,
> the disabledCONFIG_ZONE_DMA/DMA32 case is not included. I can change
> it like:
> 
> static phys_addr_t __init crash_addr_low_max(void)
> {
> phys_addr_t low_mem_mask = U32_MAX;
> phys_addr_t phys_start = memblock_start_of_DRAM();
> 
> if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) 
> ||
>  (phys_start > U32_MAX))
> low_mem_mask = PHYS_ADDR_MAX;
> 
> return low_mem_mast + 1;
> }
> 
> or add the disabled CONFIG_ZONE_DMA/DMA32 case into crash_addr_low_max()
> as you suggested. Which one do you like better?
> 
> static phys_addr_t __init crash_addr_low_max(void)
> {
> if (!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32))
>   return PHYS_ADDR_MAX + 1;
> 
> return max_zone_phys(32);
> }
 
I like the second variant better.

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] arm64, kdump: enforce to take 4G as the crashkernel low memory end

2022-09-01 Thread Mike Rapoport

On Wed, Aug 31, 2022 at 10:29:39PM +0800, Baoquan He wrote:
> On 08/31/22 at 10:37am, Mike Rapoport wrote:
> > On Sun, Aug 28, 2022 at 08:55:44AM +0800, Baoquan He wrote:
> > > 
> > > Solution:
> > > =
> > > To fix the problem, we should always take 4G as the crashkernel low
> > > memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
> > > With this, we don't need to defer the crashkernel reservation till
> > > bootmem_init() is called to set the arm64_dma_phys_limit. As long as
> > > memblock init is done, we can conclude what is the upper limit of low
> > > memory zone.
> > > 
> > > 1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or 
> > > memblock_start_of_DRAM() > 4G
> > >   limit = PHYS_ADDR_MAX+1  (Corner cases)
> > 
> > Why these are corner cases? 
> > The case when CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled is the
> > simplest one because it does not require the whole dancing around
> > arm64_dma_phys_limit initialization.
> > 
> > And AFAIK, memblock_start_of_DRAM() > 4G is not uncommon on arm64, but it
> > does not matter for device DMA addressing.
> 
> Thanks for reviewing.
> 
> I could be wrong and have misunderstanding about corner case.
> 
> With my understanding, both ZONE_DMA and ZONE_DMA32 are enabled by
> default in kernel. And on distros, I believe they are on too. The both
> ZONE_DMA and ZONE_DMA32 disabled case should only exist on one specific
> product, and the memblock_start_of_DRAM() > 4G case too. At least, I
> haven't seen one in our LAB. What I thought the non generic as corner
> case could be wrong. I will change that phrasing.
> 
> mm/Kconfig:
> config ZONE_DMA
> bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET
> default y if ARM64 || X86
> 
> config ZONE_DMA32
> bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET
> depends on !X86_32
> default y if ARM64

My point was that the cases with ZONE_DMA/DMA32 disabled or with RAM above
4G do not require detection of arm64_dma_phys_limit before reserving the
crash kernel, can use predefined constants and are simple to handle.
 
> > The actual corner cases are systems with ZONE_DMA/DMA32 and with <32 bits
> > limit for device DMA addressing (e.g RPi 4). I think the changelog should
> 
> Right, RPi4's 30bit DMA addressing device is corner case.
> 
> > mention that to use kdump on these devices user must specify
> > crashkernel=X@Y 
> 
> Makes sense. I will add words in log, and add sentences to
> mention that in code comment or some place of document.
> Thanks for advice.
> 
> > 
> > > 2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
> > >limit = 4G  (generic case)
> > > 

...

> > > +static phys_addr_t __init crash_addr_low_max(void)
> > > +{
> > > + phys_addr_t low_mem_mask = U32_MAX;
> > > + phys_addr_t phys_start = memblock_start_of_DRAM();
> > > +
> > > + if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> > > +  (phys_start > U32_MAX))
> > > + low_mem_mask = PHYS_ADDR_MAX;
> > > +
> > > + return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;
> > 
> > Since RAM frequently starts on non-zero address the limit for systems with
> > ZONE_DMA/DMA32 should be memblock_start_of_DRAM() + 4G. There is no need to
> 
> It may not be right for memblock_start_of_DRAM(). On most of arm64
> servers I ever tested, their memblock usually starts from a higher
> address, but not zero which is like x86. E.g below memory ranges printed
> on an ampere-mtsnow-altra system, the starting addr is 0x8300. With
> my understanding, DMA addressing bits correspond to the cpu logical
> address range devices can address. So memblock_start_of_DRAM() + 4G
> seems not right for normal system, and not right for system which
> starting physical address is above 4G. I refer to max_zone_phys() of
> arch/arm64/mm/init.c when implementing crash_addr_low_max(). Please
> correct me if I am wrong.

My understanding was that no matter where DRAM starts, the first 4G would
be accessible by 32-bit devices, but I maybe wrong as well :)

I haven't notice you used max_zone_phys() as a reference. Wouldn't it be
simpler to just call it from crash_addr_low_max():

static phys_addr_t __init crash_addr_low_max(void)
{
return max_zone_phys(32);
}
 
-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/2] arm64, kdump: enforce to take 4G as the crashkernel low memory end

2022-08-31 Thread Mike Rapoport

On Sun, Aug 28, 2022 at 08:55:44AM +0800, Baoquan He wrote:
> Problem:
> ===
> On arm64, block and section mapping is supported to build page tables.
> However, currently it enforces to take base page mapping for the whole
> linear mapping if CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled and
> crashkernel kernel parameter is set. This will cause longer time of the
> linear mapping process during bootup and severe performance degradation
> during running time.
> 
> Root cause:
> ==
> On arm64, crashkernel reservation relies on knowing the upper limit of
> low memory zone because it needs to reserve memory in the zone so that
> devices' DMA addressing in kdump kernel can be satisfied. However, the
> limit on arm64 is variant. And the upper limit can only be decided late
> till bootmem_init() is called.
> 
> And we need to map the crashkernel region with base page granularity when
> doing linear mapping, because kdump needs to protect the crashkernel region
> via set_memory_valid(,0) after kdump kernel loading. However, arm64 doesn't
> support well on splitting the built block or section mapping due to some
> cpu reststriction [1]. And unfortunately, the linear mapping is done before
> bootmem_init().
> 
> To resolve the above conflict on arm64, the compromise is enforcing to
> take base page mapping for the entire linear mapping if crashkernel is
> set, and CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabed. Hence
> performance is sacrificed.
> 
> Solution:
> =
> To fix the problem, we should always take 4G as the crashkernel low
> memory end in case CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 is enabled.
> With this, we don't need to defer the crashkernel reservation till
> bootmem_init() is called to set the arm64_dma_phys_limit. As long as
> memblock init is done, we can conclude what is the upper limit of low
> memory zone.
> 
> 1) both CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled or 
> memblock_start_of_DRAM() > 4G
>   limit = PHYS_ADDR_MAX+1  (Corner cases)

Why these are corner cases? 
The case when CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are disabled is the
simplest one because it does not require the whole dancing around
arm64_dma_phys_limit initialization.

And AFAIK, memblock_start_of_DRAM() > 4G is not uncommon on arm64, but it
does not matter for device DMA addressing.

The actual corner cases are systems with ZONE_DMA/DMA32 and with <32 bits
limit for device DMA addressing (e.g RPi 4). I think the changelog should
mention that to use kdump on these devices user must specify
crashkernel=X@Y 

> 2) CONFIG_ZONE_DMA or CONFIG_ZONE_DMA32 are enabled:
>limit = 4G  (generic case)
> 
> [1]
> https://lore.kernel.org/all/yriijkhkwsuaq...@arm.com/T/#u
> 
> Signed-off-by: Baoquan He 
> ---
>  arch/arm64/mm/init.c | 24 ++--
>  arch/arm64/mm/mmu.c  | 38 ++
>  2 files changed, 36 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index b9af30be813e..8ae55afdd11c 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -90,10 +90,22 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit;
>  phys_addr_t __ro_after_init arm64_dma_phys_limit = PHYS_MASK + 1;
>  #endif

Please also update the comment above this hunk.

> +static phys_addr_t __init crash_addr_low_max(void)
> +{
> + phys_addr_t low_mem_mask = U32_MAX;
> + phys_addr_t phys_start = memblock_start_of_DRAM();
> +
> + if ((!IS_ENABLED(CONFIG_ZONE_DMA) && !IS_ENABLED(CONFIG_ZONE_DMA32)) ||
> +  (phys_start > U32_MAX))
> + low_mem_mask = PHYS_ADDR_MAX;
> +
> + return min(low_mem_mask, memblock_end_of_DRAM() - 1) + 1;

Since RAM frequently starts on non-zero address the limit for systems with
ZONE_DMA/DMA32 should be memblock_start_of_DRAM() + 4G. There is no need to
take into the account the end of DRAM, memblock allocation will take care
of that. I'd suggest to simplify crash_addr_low_max() to be

static phys_addr_t __init crash_addr_low_max(void)
{
if (IS_ENABLED(CONFIG_ZONE_DMA) || IS_ENABLED(CONFIG_ZONE_DMA32))
return memblock_start_of_DRAM() + SZ_4G;

return PHYS_ADDR_MAX;
}

> +}
> +
>  /* Current arm64 boot protocol requires 2MB alignment */
>  #define CRASH_ALIGN  SZ_2M
>  
> -#define CRASH_ADDR_LOW_MAX   arm64_dma_phys_limit
> +#define CRASH_ADDR_LOW_MAX   crash_addr_low_max()

With introduction of crash_addr_low_max() I think it's better to get rid of
the CRASH_ADDR_LOW_MAX and use local variables in reserve_crashkernel() and
reserve_crashkernel_low() that would get initialized to
crash_addr_low_max().

Besides, #ifdef around arm64_dma_phys_limit declaration can go away because
this variable will be used only after it is initialized in
zone_sizes_init().

>  #define CRASH_ADDR_HIGH_MAX  (PHYS_MASK + 1)
>  
>  static int __init reserve_crashkernel_low(unsigned long long low_size)
> @@ -389,8 +401,7 @@ void __init

Re: [PATCH 0/5] arm64/mm: remap crash kernel with base pages even if rodata_full disabled

2022-08-29 Thread Mike Rapoport

On Sun, Aug 28, 2022 at 04:37:29PM +0800, Baoquan He wrote:
> On 08/25/22 at 10:48am, Mike Rapoport wrote:
> .. 
> > > > There were several rounds of discussion how to remap with base pages 
> > > > only
> > > > the crash kernel area, the latest one here:
> > > > 
> > > > https://lore.kernel.org/all/1656777473-73887-1-git-send-email-guanghuif...@linux.alibaba.com
> > > > 
> > > > and this is my attempt to allow having both large pages in the linear 
> > > > map
> > > > and protection for the crash kernel memory.
> > > > 
> > > > For server systems it is important to protect crash kernel memory for
> > > > post-mortem analysis, and for that protection to work the crash kernel
> > > > memory should be mapped with base pages in the linear map. 
> > > > 
> > > > On the systems with ZONE_DMA/DMA32 enabled, crash kernel reservation
> > > > happens after the linear map is created and the current code forces 
> > > > using
> > > > base pages for the entire linear map, which results in performance
> > > > degradation.
> > > > 
> > > > These patches enable remapping of the crash kernel area with base pages
> > > > while keeping large pages in the rest of the linear map.
> > > > 
> > > > The idea is to align crash kernel reservation to PUD boundaries, remap 
> > > > that
> > > > PUD and then free the extra memory.
> > > 
> > > Hi Mike,
> > > 
> > > Thanks for the effort to work on this issue. While I have to say this
> > > isnt's good because it can only be made relying on a prerequisite that
> > > there's big enough memory. If on a system, say 2G memory, it's not easy
> > > to succeed on getting one 1G memory. While we only require far smaller
> > > region than 1G, e.g about 200M which should be easy to get. So the way
> > > taken in this patchset is too quirky and will cause regression on
> > > systemswith small memory. This kind of sytems with small memory exists
> > > widely on virt guest instance.
> > 
> > I don't agree there is a regression. If the PUD-aligned allocation fails,
> > there is a fallback to the allocation of the exact size requested for crash
> > kernel. This allocation just won't get protected.
> 
> Sorry, I misunderstood it. I just went through the log and didn't
> look into codes.
> 
> But honestly, if we accept the fallback which doesn't do the protection,
> we should be able to take off the protection completely, right?
> Otherwise, the reservation code is a little complicated.

We don't do protection of the crash kernel for most architectures
supporting kexec ;-)

My goal was to allow large systems with ZONE_DMA/DMA32 have block mappings
in the linear map and crash kernel protection without breaking backward
compatibility for the existing systems.

> > Also please note, that the changes are only for the case when user didn't
> > force base-size pages in the linear map, so anything that works now will
> > work the same way with this set applied.
> >  
> > > The crashkernel reservation happens after linear map because the
> > > reservation needs to know the dma zone boundary, arm64_dma_phys_limit.
> > > If we can deduce that before bootmem_init(), the reservation can be
> > > done before linear map. I will make an attempt on that. If still can't
> > > be accepted, we would like to take off the crashkernel region protection
> > > on arm64 for now.
> > 
> > I doubt it would be easy because arm64_dma_phys_limit is determined after
> > parsing of the device tree and there might be memory allocations of
> > possibly unmapped memory during the parsing.
> 
> I have sent out the patches with an attempt, it's pretty straightforward
> and simple. Because arm64 only has one exception, namely Raspberry Pi 4,
> on which some peripherals can only address 30bit range. That is a corner
> case, to be honest. And kdump is a necessary feature on server, but may
> not be so expected on Raspberry Pi 4, a system for computer education
> and hobbyists. And kdump only cares whether the dump target devices can
> address 32bit range, namely storage device or network card on server.
> If finally confirmed that storage devices can only address 30bit range
> on Raspberry Pi 4, people still can have crashkernel=xM@yM method to
> reserve crashkernel regions.

I hope you are right and Raspberry Pi 4 is the only system that limits
DMA'able range to 30 bits. But with diversity of arm64 chips and boards I
won't be surprised that there are other variants with a similar problem.
 
> Thanks
> Baoquan
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 0/5] arm64/mm: remap crash kernel with base pages even if rodata_full disabled

2022-08-25 Thread Mike Rapoport

Hi Baoquan,

On Thu, Aug 25, 2022 at 03:35:04PM +0800, Baoquan He wrote:
> Add kexec list in CC
> 
> On 08/19/22 at 07:11am, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > Hi,
> > 
> > There were several rounds of discussion how to remap with base pages only
> > the crash kernel area, the latest one here:
> > 
> > https://lore.kernel.org/all/1656777473-73887-1-git-send-email-guanghuif...@linux.alibaba.com
> > 
> > and this is my attempt to allow having both large pages in the linear map
> > and protection for the crash kernel memory.
> > 
> > For server systems it is important to protect crash kernel memory for
> > post-mortem analysis, and for that protection to work the crash kernel
> > memory should be mapped with base pages in the linear map. 
> > 
> > On the systems with ZONE_DMA/DMA32 enabled, crash kernel reservation
> > happens after the linear map is created and the current code forces using
> > base pages for the entire linear map, which results in performance
> > degradation.
> > 
> > These patches enable remapping of the crash kernel area with base pages
> > while keeping large pages in the rest of the linear map.
> > 
> > The idea is to align crash kernel reservation to PUD boundaries, remap that
> > PUD and then free the extra memory.
> 
> Hi Mike,
> 
> Thanks for the effort to work on this issue. While I have to say this
> isnt's good because it can only be made relying on a prerequisite that
> there's big enough memory. If on a system, say 2G memory, it's not easy
> to succeed on getting one 1G memory. While we only require far smaller
> region than 1G, e.g about 200M which should be easy to get. So the way
> taken in this patchset is too quirky and will cause regression on
> systemswith small memory. This kind of sytems with small memory exists
> widely on virt guest instance.

I don't agree there is a regression. If the PUD-aligned allocation fails,
there is a fallback to the allocation of the exact size requested for crash
kernel. This allocation just won't get protected.

Also please note, that the changes are only for the case when user didn't
force base-size pages in the linear map, so anything that works now will
work the same way with this set applied.
 
> The crashkernel reservation happens after linear map because the
> reservation needs to know the dma zone boundary, arm64_dma_phys_limit.
> If we can deduce that before bootmem_init(), the reservation can be
> done before linear map. I will make an attempt on that. If still can't
> be accepted, we would like to take off the crashkernel region protection
> on arm64 for now.

I doubt it would be easy because arm64_dma_phys_limit is determined after
parsing of the device tree and there might be memory allocations of
possibly unmapped memory during the parsing.
 
> Thanks
> Baoquan
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] memblock: define functions to set the usable memory range

2022-01-13 Thread Mike Rapoport

On Tue, Jan 11, 2022 at 08:44:41PM +, Frank van der Linden wrote:
> On Tue, Jan 11, 2022 at 12:31:58PM +0200, Mike Rapoport wrote:
> > > --- a/include/linux/memblock.h
> > > +++ b/include/linux/memblock.h
> > > @@ -481,6 +481,8 @@ phys_addr_t memblock_reserved_size(void);
> > >  phys_addr_t memblock_start_of_DRAM(void);
> > >  phys_addr_t memblock_end_of_DRAM(void);
> > >  void memblock_enforce_memory_limit(phys_addr_t memory_limit);
> > > +void memblock_set_usable_range(phys_addr_t base, phys_addr_t size);
> > > +void memblock_enforce_usable_range(void);
> > >  void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
> > >  void memblock_mem_limit_remove_map(phys_addr_t limit);
> > 
> > We already have 3 very similar interfaces that deal with memory capping.
> > Now you suggest to add fourth that will "generically" solve a single use
> > case of DT, EFI and kdump interaction on arm64.
> > 
> > Looks like a workaround for a fundamental issue of incompatibility between
> > DT and EFI wrt memory registration.
> 
> Yep, I figured this would be the main argument against this - arm64
> already added several other more-or-less special cased interfaces over
> time.
> 
> I'm more than happy to solve this in a different way.
> 
> What would you suggest:
> 
> 1) Try to merge the similar interfaces in to one.
> 2) Just deal with it at a lower (arm64) level?
> 3) Some other way?

We've discussed this with Ard on IRC, and our conclusion was that on arm64
kdump kernel should have memblock.memory exactly the same as the normal
kernel. Then, the memory outside usable-memory-range should be reserved so
that kdump kernel won't step over it.

With that, simple (untested) patch below could be what we need:

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index bdca35284ceb..371418dffaf1 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1275,7 +1275,8 @@ void __init early_init_dt_scan_nodes(void)
of_scan_flat_dt(early_init_dt_scan_memory, NULL);
 
/* Handle linux,usable-memory-range property */
-   memblock_cap_memory_range(cap_mem_addr, cap_mem_size);
+   memblock_reserve(0, cap_mem_addr);
+   memblock_reserve(cap_mem_addr + cap_mem_size, PHYS_ADDR_MAX);
 }
 
 bool __init early_init_dt_scan(void *params)

> Thanks,
> 
> - Frank
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] memblock: define functions to set the usable memory range

2022-01-12 Thread Mike Rapoport

On Tue, Jan 11, 2022 at 08:44:41PM +, Frank van der Linden wrote:
> On Tue, Jan 11, 2022 at 12:31:58PM +0200, Mike Rapoport wrote:
> > > --- a/include/linux/memblock.h
> > > +++ b/include/linux/memblock.h
> > > @@ -481,6 +481,8 @@ phys_addr_t memblock_reserved_size(void);
> > >  phys_addr_t memblock_start_of_DRAM(void);
> > >  phys_addr_t memblock_end_of_DRAM(void);
> > >  void memblock_enforce_memory_limit(phys_addr_t memory_limit);
> > > +void memblock_set_usable_range(phys_addr_t base, phys_addr_t size);
> > > +void memblock_enforce_usable_range(void);
> > >  void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
> > >  void memblock_mem_limit_remove_map(phys_addr_t limit);
> > 
> > We already have 3 very similar interfaces that deal with memory capping.
> > Now you suggest to add fourth that will "generically" solve a single use
> > case of DT, EFI and kdump interaction on arm64.
> > 
> > Looks like a workaround for a fundamental issue of incompatibility between
> > DT and EFI wrt memory registration.
> 
> Yep, I figured this would be the main argument against this - arm64
> already added several other more-or-less special cased interfaces over
> time.
> 
> I'm more than happy to solve this in a different way.
> 
> What would you suggest:
> 
> 1) Try to merge the similar interfaces in to one.

This could be a nice cleanup regardless of how we handle
"linux,usable-memory-range".

> 2) Just deal with it at a lower (arm64) level?

Probably it will be the simplest solution in the short term.

> 3) Some other way?

I'm not expert enough on DT and EFI to see how they communicate the
linux,usable-memory-range property. 

One thought I have is since we already create a DT for kexec/kdump why
can't we add some data to EFI memory description similar to
linux,usable-memore-range?

Another thing is, if we could presume that DT and EFI are consistent in
their view what is the span of the physical memory, we could drop
memblock_remove(EVERYTHIING) and early_init_dt_add_memory_arch() from
efi_init::reserve_regions() and then the loop over EFI memory descriptors
will only take care of reserved and nomap regions.

> Thanks,
> 
> - Frank
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH 1/3] memblock: define functions to set the usable memory range

2022-01-11 Thread Mike Rapoport

On Mon, Jan 10, 2022 at 09:08:07PM +, Frank van der Linden wrote:
> Some architectures might limit the usable memory range based
> on a firmware property, like "linux,usable-memory-range"
> for ARM crash kernels. This limit needs to be enforced after
> firmware memory map processing has been done, which might be
> e.g. FDT or EFI, or both.
> 
> Define an interface for it that is firmware type agnostic.
> 
> Signed-off-by: Frank van der Linden 
> ---
>  include/linux/memblock.h |  2 ++
>  mm/memblock.c| 37 +
>  2 files changed, 39 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 34de69b3b8ba..6128efa50d33 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -481,6 +481,8 @@ phys_addr_t memblock_reserved_size(void);
>  phys_addr_t memblock_start_of_DRAM(void);
>  phys_addr_t memblock_end_of_DRAM(void);
>  void memblock_enforce_memory_limit(phys_addr_t memory_limit);
> +void memblock_set_usable_range(phys_addr_t base, phys_addr_t size);
> +void memblock_enforce_usable_range(void);
>  void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
>  void memblock_mem_limit_remove_map(phys_addr_t limit);

We already have 3 very similar interfaces that deal with memory capping.
Now you suggest to add fourth that will "generically" solve a single use
case of DT, EFI and kdump interaction on arm64.

Looks like a workaround for a fundamental issue of incompatibility between
DT and EFI wrt memory registration.

>  bool memblock_is_memory(phys_addr_t addr);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5096500b2647..cb961965f3ad 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -101,6 +101,7 @@ unsigned long max_low_pfn;
>  unsigned long min_low_pfn;
>  unsigned long max_pfn;
>  unsigned long long max_possible_pfn;
> +phys_addr_t usable_start, usable_size;
>
>  static struct memblock_region 
> memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
>  static struct memblock_region 
> memblock_reserved_init_regions[INIT_MEMBLOCK_RESERVED_REGIONS] 
> __initdata_memblock;
> @@ -1715,6 +1716,42 @@ void __init memblock_cap_memory_range(phys_addr_t 
> base, phys_addr_t size)
>   base + size, PHYS_ADDR_MAX);
>  }
>  
> +/**
> + * memblock_set_usable_range - set usable memory range
> + * @base: physical address that is the start of the range
> + * @size: size of the range.
> + *
> + * Used when a firmware property limits the range of usable
> + * memory, like for the linux,usable-memory-range property
> + * used by ARM crash kernels.
> + */
> +void __init memblock_set_usable_range(phys_addr_t base, phys_addr_t size)
> +{
> + usable_start = base;
> + usable_size = size;
> +}
> +
> +/**
> + * memblock_enforce_usable_range - cap memory ranges to usable range
> + *
> + * Some architectures call this during boot after firmware memory ranges
> + * have been scanned, to make sure they fall within the usable range
> + * set by memblock_set_usable_range.
> + *
> + * This may be called more than once if there are multiple firmware sources
> + * for memory ranges.
> + *
> + * Avoid "no memory registered" warning - the warning itself is
> + * useful, but we know this can be called with no registered
> + * memory (e.g. when the synthetic DT for the crash kernel has
> + * been parsed on EFI arm64 systems).
> + */
> +void __init memblock_enforce_usable_range(void)
> +{
> + if (memblock_memory->total_size)
> + memblock_cap_memory_range(usable_start, usable_size);
> +}
> +
>  void __init memblock_mem_limit_remove_map(phys_addr_t limit)
>  {
>   phys_addr_t max_addr;
> -- 
> 2.32.0
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 3/5] memblock: allow to specify flags with memblock_add_node()

2021-10-05 Thread Mike Rapoport

On Mon, Oct 04, 2021 at 11:36:03AM +0200, David Hildenbrand wrote:
> We want to specify flags when hotplugging memory. Let's prepare to pass
> flags to memblock_add_node() by adjusting all existing users.
> 
> Note that when hotplugging memory the system is already up and running
> and we might have concurrent memblock users: for example, while we're
> hotplugging memory, kexec_file code might search for suitable memory
> regions to place kexec images. It's important to add the memory directly
> to memblock via a single call with the right flags, instead of adding the
> memory first and apply flags later: otherwise, concurrent memblock users
> might temporarily stumble over memblocks with wrong flags, which will be
> important in a follow-up patch that introduces a new flag to properly
> handle add_memory_driver_managed().
> 
> Acked-by: Geert Uytterhoeven 
> Acked-by: Heiko Carstens 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Mike Rapoport 

> ---
>  arch/arc/mm/init.c   | 4 ++--
>  arch/ia64/mm/contig.c| 2 +-
>  arch/ia64/mm/init.c  | 2 +-
>  arch/m68k/mm/mcfmmu.c| 3 ++-
>  arch/m68k/mm/motorola.c  | 6 --
>  arch/mips/loongson64/init.c  | 4 +++-
>  arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
>  arch/s390/kernel/setup.c | 3 ++-
>  include/linux/memblock.h | 3 ++-
>  include/linux/mm.h   | 2 +-
>  mm/memblock.c| 9 +
>  mm/memory_hotplug.c  | 2 +-
>  12 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index 699ecf119641..110eb69e9bee 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
> size)
>  
>   low_mem_sz = size;
>   in_use = 1;
> - memblock_add_node(base, size, 0);
> + memblock_add_node(base, size, 0, MEMBLOCK_NONE);
>   } else {
>  #ifdef CONFIG_HIGHMEM
>   high_mem_start = base;
>   high_mem_sz = size;
>   in_use = 1;
> - memblock_add_node(base, size, 1);
> + memblock_add_node(base, size, 1, MEMBLOCK_NONE);
>   memblock_reserve(base, size);
>  #endif
>   }
> diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
> index 42e025cfbd08..24901d809301 100644
> --- a/arch/ia64/mm/contig.c
> +++ b/arch/ia64/mm/contig.c
> @@ -153,7 +153,7 @@ find_memory (void)
>   efi_memmap_walk(find_max_min_low_pfn, NULL);
>   max_pfn = max_low_pfn;
>  
> - memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
> + memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);
>  
>   find_initrd();
>  
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 5c6da8d83c1a..5d165607bf35 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int 
> nid)
>  #endif
>  
>   if (start < end)
> - memblock_add_node(__pa(start), end - start, nid);
> + memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
>   return 0;
>  }
>  
> diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
> index eac9dde65193..6f1f25125294 100644
> --- a/arch/m68k/mm/mcfmmu.c
> +++ b/arch/m68k/mm/mcfmmu.c
> @@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
>   m68k_memory[0].addr = _rambase;
>   m68k_memory[0].size = _ramend - _rambase;
>  
> - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
> + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
> +   MEMBLOCK_NONE);
>  
>   /* compute total pages in system */
>   num_pages = PFN_DOWN(_ramend - _rambase);
> diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
> index 9f3f77785aa7..2b05bb2bac00 100644
> --- a/arch/m68k/mm/motorola.c
> +++ b/arch/m68k/mm/motorola.c
> @@ -410,7 +410,8 @@ void __init paging_init(void)
>  
>   min_addr = m68k_memory[0].addr;
>   max_addr = min_addr + m68k_memory[0].size;
> - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
> + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
> +   MEMBLOCK_NONE);
>   for (i = 1; i < m68k_num_memory;) {
>   if (m68k_memory[i].addr < min_addr) {
>   printk("Ignoring memory chunk at 0x%lx:0x%lx before the 
> first chunk\n",
> @@ -421,7 +422,8 @@ void __init paging_init(void)
>   (m68k_num_memory - i) * sizeof(str

Re: [PATCH v2 4/5] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-10-05 Thread Mike Rapoport

On Mon, Oct 04, 2021 at 11:36:04AM +0200, David Hildenbrand wrote:
> Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
> indicating that we're dealing with a memory region that is never
> indicated in the firmware-provided memory map, but always detected and
> added by a driver.
> 
> Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
> regions like ordinary MEMBLOCK_NONE memory regions -- for example, when
> selecting memory regions to add to the vmcore for dumping in the
> crashkernel via for_each_mem_range().
> 
> However, especially kexec_file is not supposed to select such memblocks via
> for_each_free_mem_range() / for_each_free_mem_range_reverse() to place
> kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED
> without CONFIG_ARCH_KEEP_MEMBLOCK.
> 
> We'll make sure that memory hotplug code sets the flag where applicable
> (IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
> that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
> support.
> 
> Note that kexec *must not* indicate this memory to the second kernel
> and *must not* place kexec-images on this memory. Let's add a comment to
> kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
> now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
> locate_mem_hole_callback() for kexec_walk_resources().
> 
> Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
> semantics:
>   MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
>   firmware-provided memory map and added to the system early during
>   boot; kexec *has to* indicate this memory to the second kernel and
>   can place kexec-images on this memory. After memory hotunplug,
>   kexec has to be re-armed. We mostly ignore this flag when
>   "movable_node" is not set on the kernel command line, because
>   then we're told to not care about hotunpluggability of such
>   memory regions.
> 
>   MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
>   the firmware-provided memory map; this memory is always detected
>   and added to the system by a driver; memory might not actually be
>   physically hotunpluggable. kexec *must not* indicate this memory to
>   the second kernel and *must not* place kexec-images on this memory.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Mike Rapoport 

> ---
>  include/linux/memblock.h | 16 ++--
>  kernel/kexec_file.c  |  5 +
>  mm/memblock.c|  4 
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 2bc726e43a1b..b3b29ccf91f3 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -37,12 +37,17 @@ extern unsigned long long max_possible_pfn;
>   * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
>   * reserved in the memory map; refer to memblock_mark_nomap() description
>   * for further details
> + * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected and added
> + * via a driver, and never indicated in the firmware-provided memory map as
> + * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
> + * kernel resource tree.
>   */
>  enum memblock_flags {
>   MEMBLOCK_NONE   = 0x0,  /* No special request */
>   MEMBLOCK_HOTPLUG= 0x1,  /* hotpluggable region */
>   MEMBLOCK_MIRROR = 0x2,  /* mirrored region */
>   MEMBLOCK_NOMAP  = 0x4,  /* don't add to kernel direct mapping */
> + MEMBLOCK_DRIVER_MANAGED = 0x8,  /* always detected via a driver */
>  };
>  
>  /**
> @@ -213,7 +218,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
> memblock_type *type,
>   */
>  #define for_each_mem_range(i, p_start, p_end) \
>   __for_each_mem_range(i, , NULL, NUMA_NO_NODE,   \
> -  MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
> +  MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \
> +  p_start, p_end, NULL)
>  
>  /**
>   * for_each_mem_range_rev - reverse iterate through memblock areas from
> @@ -224,7 +230,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
> memblock_type *type,
>   */
>  #define for_each_mem_range_rev(i, p_start, p_end)\
>   __for_each_mem_range_rev(i, , NULL, NUMA_NO_NODE, \
> -  MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
> +  MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\
> +  p_start, p_end, NULL)
>  
>  /**
>   * for_

Re: [PATCH v2 2/5] memblock: improve MEMBLOCK_HOTPLUG documentation

2021-10-05 Thread Mike Rapoport

On Mon, Oct 04, 2021 at 11:36:02AM +0200, David Hildenbrand wrote:
> The description of MEMBLOCK_HOTPLUG is currently short and consequently
> misleading: we're actually dealing with a memory region that might get
> hotunplugged later (i.e., the platform+firmware supports it), yet it is
> indicated in the firmware-provided memory map as system ram that will just
> get used by the system for any purpose when not taking special care. The
> firmware marked this memory region as a hot(un)plugged (e.g., hotplugged
> before reboot), implying that it might get hotunplugged again later.
> 
> Whether we consider this information depends on the "movable_node" kernel
> commandline parameter: only with "movable_node" set, we'll try keeping
> this memory hotunpluggable, for example, by not serving early allocations
> from this memory region and by letting the buddy manage it using the
> ZONE_MOVABLE.
> 
> Let's make this clearer by extending the documentation.
> 
> Note: kexec *has to* indicate this memory to the second kernel. With
> "movable_node" set, we don't want to place kexec-images on this memory.
> Without "movable_node" set, we don't care and can place kexec-images on
> this memory. In both cases, after successful memory hotunplug, kexec has to
> be re-armed to update the memory map for the second kernel and to place the
> kexec-images somewhere else.
> 
> Signed-off-by: David Hildenbrand 

Reviewed-by: Mike Rapoport 

> ---
>  include/linux/memblock.h | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 34de69b3b8ba..4ee8dd2d63a7 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -28,7 +28,11 @@ extern unsigned long long max_possible_pfn;
>  /**
>   * enum memblock_flags - definition of memory region attributes
>   * @MEMBLOCK_NONE: no special request
> - * @MEMBLOCK_HOTPLUG: hotpluggable region
> + * @MEMBLOCK_HOTPLUG: memory region indicated in the firmware-provided memory
> + * map during early boot as hot(un)pluggable system RAM (e.g., memory range
> + * that might get hotunplugged later). With "movable_node" set on the kernel
> + * commandline, try keeping this memory region hotunpluggable. Does not apply
> + * to memblocks added ("hotplugged") after early boot.
>   * @MEMBLOCK_MIRROR: mirrored region
>   * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
>   * reserved in the memory map; refer to memblock_mark_nomap() description
> -- 
> 2.31.1
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-10-01 Thread Mike Rapoport

On Fri, Oct 01, 2021 at 10:04:24AM +0200, David Hildenbrand wrote:
> On 30.09.21 23:21, Mike Rapoport wrote:
> > On Wed, Sep 29, 2021 at 06:54:01PM +0200, David Hildenbrand wrote:
> > > On 29.09.21 18:39, Mike Rapoport wrote:
> > > > Hi,
> > > > 
> > > > On Mon, Sep 27, 2021 at 05:05:17PM +0200, David Hildenbrand wrote:
> > > > > Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED.
> > > > > Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such 
> > > > > memory
> > > > > like ordinary MEMBLOCK_NONE memory -- for example, when selecting 
> > > > > memory
> > > > > regions to add to the vmcore for dumping in the crashkernel via
> > > > > for_each_mem_range().
> > > > Can you please elaborate on the difference in semantics of 
> > > > MEMBLOCK_HOTPLUG
> > > > and MEMBLOCK_DRIVER_MANAGED?
> > > > Unless I'm missing something they both mark memory that can be unplugged
> > > > anytime and so it should not be used in certain cases. Why is there a 
> > > > need
> > > > for a new flag?
> > > 
> > > In the cover letter I have "Alternative B: Reuse MEMBLOCK_HOTPLUG.
> > > MEMBLOCK_HOTPLUG serves a different purpose, though.", but looking into 
> > > the
> > > details it won't work as is.
> > > 
> > > MEMBLOCK_HOTPLUG is used to mark memory early during boot that can later 
> > > get
> > > hotunplugged again and should be placed into ZONE_MOVABLE if the
> > > "movable_node" kernel parameter is set.
> > > 
> > > The confusing part is that we talk about "hotpluggable" but really mean
> > > "hotunpluggable": the reason is that HW flags DIMM slots that can later be
> > > hotplugged as "hotpluggable" even though there is already something
> > > hotplugged.
> > 
> > MEMBLOCK_HOTPLUG name is indeed somewhat confusing, but still it's core
> > meaning "this memory may be removed" which does not differ from what
> > IORESOURCE_SYSRAM_DRIVER_MANAGED means.
> > 
> > MEMBLOCK_HOTPLUG regions are indeed placed into ZONE_MOVABLE, but more
> > importantly, they are avoided when we allocate memory from memblock.
> > 
> > So, in my view, both flags mean that the memory may be removed and it
> > should not be used for certain types of allocations.
> 
> The semantics are different:
> 
> MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
> firmware-provided memory map and added to the system early during boot; we
> want this memory to be managed by ZONE_MOVABLE with "movable_node" set on
> the kernel command line, because only then we want it to be hotpluggable
> again. kexec *has to* indicate this memory to the second kernel and can
> place kexec-images on this memory. After memory hotunplug, kexec has to be
> re-armed.
> 
> MEMBLOCK_DRIVER_MANAGED: memory is not indicated as System RAM" in the
> firmware-provided memory map; this memory is always detected and added to
> the system by a driver; memory might not actually be physically
> hotunpluggable and the ZONE selection does not depend on "movable_core".
> kexec *must not* indicate this memory to the second kernel and *must not*
> place kexec-images on this memory.

Ok, this clarifies.
This explanation should be a part of the changelog. The sentences about the
zone selection could be probably skipped, because they are less important
for this case. E.g something like:

MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
firmware-provided memory map and added to the system early during boot;
kexec *has to* indicate this memory to the second kernel and can place
kexec-images on this memory. After memory hotunplug, kexec has to be
re-armed.

MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in the
firmware-provided memory map; this memory is always detected and added to
the system by a driver; memory might not actually be physically
hotunpluggable.  kexec *must not* indicate this memory to the second kernel
and *must not* place kexec-images on this memory.

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-09-30 Thread Mike Rapoport

On Wed, Sep 29, 2021 at 06:54:01PM +0200, David Hildenbrand wrote:
> On 29.09.21 18:39, Mike Rapoport wrote:
> > Hi,
> > 
> > On Mon, Sep 27, 2021 at 05:05:17PM +0200, David Hildenbrand wrote:
> > > Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED.
> > > Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
> > > like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory
> > > regions to add to the vmcore for dumping in the crashkernel via
> > > for_each_mem_range().
> > Can you please elaborate on the difference in semantics of MEMBLOCK_HOTPLUG
> > and MEMBLOCK_DRIVER_MANAGED?
> > Unless I'm missing something they both mark memory that can be unplugged
> > anytime and so it should not be used in certain cases. Why is there a need
> > for a new flag?
> 
> In the cover letter I have "Alternative B: Reuse MEMBLOCK_HOTPLUG.
> MEMBLOCK_HOTPLUG serves a different purpose, though.", but looking into the
> details it won't work as is.
> 
> MEMBLOCK_HOTPLUG is used to mark memory early during boot that can later get
> hotunplugged again and should be placed into ZONE_MOVABLE if the
> "movable_node" kernel parameter is set.
> 
> The confusing part is that we talk about "hotpluggable" but really mean
> "hotunpluggable": the reason is that HW flags DIMM slots that can later be
> hotplugged as "hotpluggable" even though there is already something
> hotplugged.

MEMBLOCK_HOTPLUG name is indeed somewhat confusing, but still it's core
meaning "this memory may be removed" which does not differ from what
IORESOURCE_SYSRAM_DRIVER_MANAGED means.

MEMBLOCK_HOTPLUG regions are indeed placed into ZONE_MOVABLE, but more
importantly, they are avoided when we allocate memory from memblock.

So, in my view, both flags mean that the memory may be removed and it
should not be used for certain types of allocations.
 
> For example, ranges in the ACPI SRAT that are marked as
> ACPI_SRAT_MEM_HOT_PLUGGABLE will be marked MEMBLOCK_HOTPLUG early during
> boot (drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()). Later, we
> use that information to size ZONE_MOVABLE
> (mm/page_alloc.c:find_zone_movable_pfns_for_nodes()). This will make sure
> that these "hotpluggable" DIMMs can later get hotunplugged.
> 
> Also, see should_skip_region() how this relates to the "movable_node" kernel
> parameter:
> 
>   /* skip hotpluggable memory regions if needed */
>   if (movable_node_is_enabled() && memblock_is_hotpluggable(m) &&
>   (flags & MEMBLOCK_HOTPLUG))
>   return true;

Hmm, I think that the movable_node_is_enabled() check here is excessive,
but I suspect we cannot simply remove it without breaking anything.

I'll take a deeper look on the potential consequences.

BTW, is there anything that prevents putting kexec to hot-unplugable memory
that was cold-plugged on boot?

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v1 3/4] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

2021-09-29 Thread Mike Rapoport

Hi,

On Mon, Sep 27, 2021 at 05:05:17PM +0200, David Hildenbrand wrote:
> Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED.
> Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
> like ordinary MEMBLOCK_NONE memory -- for example, when selecting memory
> regions to add to the vmcore for dumping in the crashkernel via
> for_each_mem_range().
 
Can you please elaborate on the difference in semantics of MEMBLOCK_HOTPLUG
and MEMBLOCK_DRIVER_MANAGED?
Unless I'm missing something they both mark memory that can be unplugged
anytime and so it should not be used in certain cases. Why is there a need
for a new flag?

> However, especially kexec_file is not supposed to select such memblocks via
> for_each_free_mem_range() / for_each_free_mem_range_reverse() to place
> kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED
> without CONFIG_ARCH_KEEP_MEMBLOCK.
> 
> Let's document why kexec_walk_memblock() won't try placing images on
> areas marked MEMBLOCK_DRIVER_MANAGED -- similar to
> IORESOURCE_SYSRAM_DRIVER_MANAGED handling in locate_mem_hole_callback()
> via kexec_walk_resources().
> 
> We'll make sure that memory hotplug code sets the flag where applicable
> (IORESOURCE_SYSRAM_DRIVER_MANAGED) next. This prepares architectures
> that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
> support.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/memblock.h | 16 ++--
>  kernel/kexec_file.c  |  5 +
>  mm/memblock.c|  4 
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index b49a58f621bc..7d8d656d5082 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -33,12 +33,17 @@ extern unsigned long long max_possible_pfn;
>   * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
>   * reserved in the memory map; refer to memblock_mark_nomap() description
>   * for further details
> + * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected via a 
> driver,
> + * corresponding to IORESOURCE_SYSRAM_DRIVER_MANAGED in the kernel resource
> + * tree. Especially kexec should never use this memory for placing images and
> + * shouldn't expose this memory to the second kernel.
>   */
>  enum memblock_flags {
>   MEMBLOCK_NONE   = 0x0,  /* No special request */
>   MEMBLOCK_HOTPLUG= 0x1,  /* hotpluggable region */
>   MEMBLOCK_MIRROR = 0x2,  /* mirrored region */
>   MEMBLOCK_NOMAP  = 0x4,  /* don't add to kernel direct mapping */
> + MEMBLOCK_DRIVER_MANAGED = 0x8,  /* always detected via a driver */
>  };
>  
>  /**
> @@ -209,7 +214,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
> memblock_type *type,
>   */
>  #define for_each_mem_range(i, p_start, p_end) \
>   __for_each_mem_range(i, , NULL, NUMA_NO_NODE,   \
> -  MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
> +  MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \
> +  p_start, p_end, NULL)
>  
>  /**
>   * for_each_mem_range_rev - reverse iterate through memblock areas from
> @@ -220,7 +226,8 @@ static inline void __next_physmem_range(u64 *idx, struct 
> memblock_type *type,
>   */
>  #define for_each_mem_range_rev(i, p_start, p_end)\
>   __for_each_mem_range_rev(i, , NULL, NUMA_NO_NODE, \
> -  MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
> +  MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\
> +  p_start, p_end, NULL)
>  
>  /**
>   * for_each_reserved_mem_range - iterate over all reserved memblock areas
> @@ -250,6 +257,11 @@ static inline bool memblock_is_nomap(struct 
> memblock_region *m)
>   return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> +static inline bool memblock_is_driver_managed(struct memblock_region *m)
> +{
> + return m->flags & MEMBLOCK_DRIVER_MANAGED;
> +}
> +
>  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>   unsigned long  *end_pfn);
>  void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 33400ff051a8..8347fc158d2b 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -556,6 +556,11 @@ static int kexec_walk_memblock(struct kexec_buf *kbuf,
>   if (kbuf->image->type == KEXEC_TYPE_CRASH)
>   return func(_res, kbuf);
>  
> + /*
> +  * Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See
> +  * IORESOURCE_SYSRAM_DRIVER_MANAGED handling in
> +  * locate_mem_hole_callback().
> +  */
>   if (kbuf->top_down) {
>   for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
>   , , NULL) {
> diff --git a/mm/memblock.c b/mm/memblock.c
>

Re: [PATCH v1 2/4] memblock: allow to specify flags with memblock_add_node()

2021-09-29 Thread Mike Rapoport

On Mon, Sep 27, 2021 at 05:05:16PM +0200, David Hildenbrand wrote:
> We want to specify flags when hotplugging memory. Let's prepare to pass
> flags to memblock_add_node() by adjusting all existing users.
> 
> Note that when hotplugging memory the system is already up and running
> and we don't want to add the memory first and apply flags later: it
> should happen within one memblock call.

Why is it important that the system is up and why it should happen in a
single call?
I don't mind adding flags parameter to memblock_add_node() but this
changelog does not really explain the reasons to do it.
 
> Signed-off-by: David Hildenbrand 
> ---
>  arch/arc/mm/init.c   | 4 ++--
>  arch/ia64/mm/contig.c| 2 +-
>  arch/ia64/mm/init.c  | 2 +-
>  arch/m68k/mm/mcfmmu.c| 3 ++-
>  arch/m68k/mm/motorola.c  | 6 --
>  arch/mips/loongson64/init.c  | 4 +++-
>  arch/mips/sgi-ip27/ip27-memory.c | 3 ++-
>  arch/s390/kernel/setup.c | 3 ++-
>  include/linux/memblock.h | 3 ++-
>  include/linux/mm.h   | 2 +-
>  mm/memblock.c| 9 +
>  mm/memory_hotplug.c  | 2 +-
>  12 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index 699ecf119641..110eb69e9bee 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 
> size)
>  
>   low_mem_sz = size;
>   in_use = 1;
> - memblock_add_node(base, size, 0);
> + memblock_add_node(base, size, 0, MEMBLOCK_NONE);
>   } else {
>  #ifdef CONFIG_HIGHMEM
>   high_mem_start = base;
>   high_mem_sz = size;
>   in_use = 1;
> - memblock_add_node(base, size, 1);
> + memblock_add_node(base, size, 1, MEMBLOCK_NONE);
>   memblock_reserve(base, size);
>  #endif
>   }
> diff --git a/arch/ia64/mm/contig.c b/arch/ia64/mm/contig.c
> index 42e025cfbd08..24901d809301 100644
> --- a/arch/ia64/mm/contig.c
> +++ b/arch/ia64/mm/contig.c
> @@ -153,7 +153,7 @@ find_memory (void)
>   efi_memmap_walk(find_max_min_low_pfn, NULL);
>   max_pfn = max_low_pfn;
>  
> - memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
> + memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);
>  
>   find_initrd();
>  
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 5c6da8d83c1a..5d165607bf35 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int 
> nid)
>  #endif
>  
>   if (start < end)
> - memblock_add_node(__pa(start), end - start, nid);
> + memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
>   return 0;
>  }
>  
> diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
> index eac9dde65193..6f1f25125294 100644
> --- a/arch/m68k/mm/mcfmmu.c
> +++ b/arch/m68k/mm/mcfmmu.c
> @@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
>   m68k_memory[0].addr = _rambase;
>   m68k_memory[0].size = _ramend - _rambase;
>  
> - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
> + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
> +   MEMBLOCK_NONE);
>  
>   /* compute total pages in system */
>   num_pages = PFN_DOWN(_ramend - _rambase);
> diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
> index 3a653f0a4188..e80c5d7e6728 100644
> --- a/arch/m68k/mm/motorola.c
> +++ b/arch/m68k/mm/motorola.c
> @@ -410,7 +410,8 @@ void __init paging_init(void)
>  
>   min_addr = m68k_memory[0].addr;
>   max_addr = min_addr + m68k_memory[0].size;
> - memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
> + memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
> +   MEMBLOCK_NONE);
>   for (i = 1; i < m68k_num_memory;) {
>   if (m68k_memory[i].addr < min_addr) {
>   printk("Ignoring memory chunk at 0x%lx:0x%lx before the 
> first chunk\n",
> @@ -421,7 +422,8 @@ void __init paging_init(void)
>   (m68k_num_memory - i) * sizeof(struct 
> m68k_mem_info));
>   continue;
>   }
> - memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i);
> + memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i,
> +   MEMBLOCK_NONE);
>   addr = m68k_memory[i].addr + m68k_memory[i].size;
>   if (addr > max_addr)
>   max_addr = addr;
> diff --git a/arch/mips/loongson64/init.c b/arch/mips/loongson64/init.c
> index 76e0a9636a0e..4ac5ba80bbf6 100644
> --- a/arch/mips/loongson64/init.c
> +++ b/arch/mips/loongson64/init.c
> @@ -77,7 +77,9 @@ void __init szmem(unsigned int node)
>

Re: [PATCH v5 1/9] MIPS: Avoid future duplicate elf core header reservation

2021-08-23 Thread Mike Rapoport

On Mon, Aug 23, 2021 at 09:44:55AM -0500, Rob Herring wrote:
> On Mon, Aug 23, 2021 at 8:10 AM Mike Rapoport  wrote:
> >
> > On Mon, Aug 23, 2021 at 12:17:50PM +0200, Geert Uytterhoeven wrote:
> > > Hi Mike,
> > >
> > > On Mon, Aug 16, 2021 at 7:52 AM Mike Rapoport  wrote:
> > > > On Wed, Aug 11, 2021 at 10:50:59AM +0200, Geert Uytterhoeven wrote:
> > > > > Prepare for early_init_fdt_scan_reserved_mem() reserving the memory
> > > > > occupied by an elf core header described in the device tree.
> > > > > As arch_mem_init() calls early_init_fdt_scan_reserved_mem() before
> > > > > mips_reserve_vmcore(), the latter needs to check if the memory has
> > > > > already been reserved before.
> > > >
> > > > Doing memblock_reserve() for the same region is usually fine, did you
> > > > encounter any issues without this patch?
> > >
> > > Does it also work if the same region is part of an earlier larger
> > > reservation?  I am no memblock expert, so I don't know.
> > > I didn't run into any issues, as my MIPS platform is non-DT, but I
> > > assume arch/arm64/mm/init.c:reserve_elfcorehdr() had the check for
> > > a reason.
> >
> > The memory will be reserved regardless of the earlier reservation, the
> > issue may appear when the reservations are made for different purpose. E.g.
> > if there was crash kernel allocation before the reservation of elfcorehdr.
> >
> > The check in such case will prevent the second reservation, but, at least
> > in arch/arm64/mm/init.c:reserve_elfcorehdr() it does not seem to prevent
> > different users of the overlapping regions to step on each others toes.
> 
> If the kernel has been passed in overlapping regions, is there
> anything you can do other than hope to get a message out?

Nothing really. I've been thinking about adding flags to memblock.reserved
to at least distinguish firmware regions from the kernel allocations, but I
never got to that.
 
> > Moreover, arm64::reserve_elfcorehdr() seems buggy to me, because of there
> > is only a partial overlap of the elfcorehdr with the previous reservation,
> > the non-overlapping part of elfcorehdr won't get reserved at all.
> 
> What do you suggest as the arm64 version is not the common version?

I'm not really familiar with crash dump internals, so I don't know if
resetting elfcorehdr_addr to ELFCORE_ADDR_ERR is a good idea. I think at
least arm64::reserve_elfcorehdr() should reserve the entire elfcorehdr area
regardless of the overlap. Otherwise it might get overwritten by a random
memblock_alloc().

> Rob

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v5 1/9] MIPS: Avoid future duplicate elf core header reservation

2021-08-23 Thread Mike Rapoport

On Mon, Aug 23, 2021 at 12:17:50PM +0200, Geert Uytterhoeven wrote:
> Hi Mike,
> 
> On Mon, Aug 16, 2021 at 7:52 AM Mike Rapoport  wrote:
> > On Wed, Aug 11, 2021 at 10:50:59AM +0200, Geert Uytterhoeven wrote:
> > > Prepare for early_init_fdt_scan_reserved_mem() reserving the memory
> > > occupied by an elf core header described in the device tree.
> > > As arch_mem_init() calls early_init_fdt_scan_reserved_mem() before
> > > mips_reserve_vmcore(), the latter needs to check if the memory has
> > > already been reserved before.
> >
> > Doing memblock_reserve() for the same region is usually fine, did you
> > encounter any issues without this patch?
> 
> Does it also work if the same region is part of an earlier larger
> reservation?  I am no memblock expert, so I don't know.
> I didn't run into any issues, as my MIPS platform is non-DT, but I
> assume arch/arm64/mm/init.c:reserve_elfcorehdr() had the check for
> a reason.

The memory will be reserved regardless of the earlier reservation, the
issue may appear when the reservations are made for different purpose. E.g.
if there was crash kernel allocation before the reservation of elfcorehdr.

The check in such case will prevent the second reservation, but, at least
in arch/arm64/mm/init.c:reserve_elfcorehdr() it does not seem to prevent
different users of the overlapping regions to step on each others toes.

Moreover, arm64::reserve_elfcorehdr() seems buggy to me, because of there
is only a partial overlap of the elfcorehdr with the previous reservation,
the non-overlapping part of elfcorehdr won't get reserved at all.

> Thanks!
> 
> >
> > > Note that mips_reserve_vmcore() cannot just be removed, as not all MIPS
> > > systems use DT.
> > >
> > > Signed-off-by: Geert Uytterhoeven 
> > > ---
> > > v5:
> > >   - New.
> > > ---
> > >  arch/mips/kernel/setup.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
> > > index 23a140327a0bac1b..4693add05743d78b 100644
> > > --- a/arch/mips/kernel/setup.c
> > > +++ b/arch/mips/kernel/setup.c
> > > @@ -429,7 +429,8 @@ static void __init mips_reserve_vmcore(void)
> > >   pr_info("Reserving %ldKB of memory at %ldKB for kdump\n",
> > >   (unsigned long)elfcorehdr_size >> 10, (unsigned 
> > > long)elfcorehdr_addr >> 10);
> > >
> > > - memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
> > > + if (!memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)
> > > + memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
> > >  #endif
> > >  }
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
> -- 
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v5 1/9] MIPS: Avoid future duplicate elf core header reservation

2021-08-15 Thread Mike Rapoport

Hi Geert,

On Wed, Aug 11, 2021 at 10:50:59AM +0200, Geert Uytterhoeven wrote:
> Prepare for early_init_fdt_scan_reserved_mem() reserving the memory
> occupied by an elf core header described in the device tree.
> As arch_mem_init() calls early_init_fdt_scan_reserved_mem() before
> mips_reserve_vmcore(), the latter needs to check if the memory has
> already been reserved before.

Doing memblock_reserve() for the same region is usually fine, did you
encounter any issues without this patch?
 
> Note that mips_reserve_vmcore() cannot just be removed, as not all MIPS
> systems use DT.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
> v5:
>   - New.
> ---
>  arch/mips/kernel/setup.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
> index 23a140327a0bac1b..4693add05743d78b 100644
> --- a/arch/mips/kernel/setup.c
> +++ b/arch/mips/kernel/setup.c
> @@ -429,7 +429,8 @@ static void __init mips_reserve_vmcore(void)
>   pr_info("Reserving %ldKB of memory at %ldKB for kdump\n",
>   (unsigned long)elfcorehdr_size >> 10, (unsigned 
> long)elfcorehdr_addr >> 10);
>  
> - memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
> + if (!memblock_is_region_reserved(elfcorehdr_addr, elfcorehdr_size)
> + memblock_reserve(elfcorehdr_addr, elfcorehdr_size);
>  #endif
>  }
>  
> -- 
> 2.25.1
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4 02/10] memblock: Add variables for usable memory limitation

2021-07-19 Thread Mike Rapoport

Hi Geert,

On Mon, Jul 19, 2021 at 08:59:03AM +0200, Geert Uytterhoeven wrote:
> Hi Mike,
> 
> On Sun, Jul 18, 2021 at 11:31 AM Mike Rapoport  wrote:
> > On Wed, Jul 14, 2021 at 07:51:01AM -0600, Rob Herring wrote:
> > > On Wed, Jul 14, 2021 at 02:50:12PM +0200, Geert Uytterhoeven wrote:
> > > > Add two global variables (cap_mem_addr and cap_mem_size) for storing a
> > > > base address and size, describing a limited region in which memory may
> > > > be considered available for use by the kernel.  If enabled, memory
> > > > outside of this range is not available for use.
> > > >
> > > > These variables can by filled by firmware-specific code, and used in
> > > > calls to memblock_cap_memory_range() by architecture-specific code.
> > > > An example user is the parser of the "linux,usable-memory-range"
> > > > property in the DT "/chosen" node.
> > > >
> > > > Signed-off-by: Geert Uytterhoeven 
> > > > ---
> > > > This is similar to how the initial ramdisk (phys_initrd_{start,size})
> > > > and ELF core headers (elfcorehdr_{addr,size})) are handled.
> > > >
> > > > Does there exist a suitable place in the common memblock code to call
> > > > "memblock_cap_memory_range(cap_mem_addr, cap_mem_size)", or does this
> > > > have to be done in architecture-specific code?
> > >
> > > Can't you just call it from early_init_dt_scan_usablemem? If the
> > > property is present, you want to call it. If the property is not
> > > present, nothing happens.
> 
> I will have a look...
> 
> > For memblock_cap_memory_range() to work properly it should be called after
> > memory is detected and added to memblock with memblock_add[_node]()
> >
> > I'm not huge fan of adding more globals to memblock so if such ordering can
> > be implemented on the DT side it would be great.
> 
> Me neither ;-)
> 
> > I don't see a way to actually enforce this ordering, so maybe we'd want to
> > add warning in memblock_cap_memory_range() if memblock.memory is empty.
> 
> "linux,usable-memory-range" is optional, and typically used only in
> crashdump kernels, so it would be a bad idea to add such a warning.

If I remember correctly, memblock_cap_memory_range() was added to support
"linux,usable-memory-range" for crasdump kernels on arm64 and if it would
be called before memory is registered we may silently corrupt the memory
because the crash kernel will see all the memory as available.

So while WARN() maybe too much a pr_warn() seems to me quite appropriate.
 
-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v4 02/10] memblock: Add variables for usable memory limitation

2021-07-18 Thread Mike Rapoport

Hi,

On Wed, Jul 14, 2021 at 07:51:01AM -0600, Rob Herring wrote:
> On Wed, Jul 14, 2021 at 02:50:12PM +0200, Geert Uytterhoeven wrote:
> > Add two global variables (cap_mem_addr and cap_mem_size) for storing a
> > base address and size, describing a limited region in which memory may
> > be considered available for use by the kernel.  If enabled, memory
> > outside of this range is not available for use.
> > 
> > These variables can by filled by firmware-specific code, and used in
> > calls to memblock_cap_memory_range() by architecture-specific code.
> > An example user is the parser of the "linux,usable-memory-range"
> > property in the DT "/chosen" node.
> > 
> > Signed-off-by: Geert Uytterhoeven 
> > ---
> > This is similar to how the initial ramdisk (phys_initrd_{start,size})
> > and ELF core headers (elfcorehdr_{addr,size})) are handled.
> > 
> > Does there exist a suitable place in the common memblock code to call
> > "memblock_cap_memory_range(cap_mem_addr, cap_mem_size)", or does this
> > have to be done in architecture-specific code?
> 
> Can't you just call it from early_init_dt_scan_usablemem? If the 
> property is present, you want to call it. If the property is not 
> present, nothing happens.

For memblock_cap_memory_range() to work properly it should be called after
memory is detected and added to memblock with memblock_add[_node]()

I'm not huge fan of adding more globals to memblock so if such ordering can
be implemented on the DT side it would be great.

I don't see a way to actually enforce this ordering, so maybe we'd want to
add warning in memblock_cap_memory_range() if memblock.memory is empty.
 
> Rob

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 5/9] mm: remove CONFIG_DISCONTIGMEM

2021-06-11 Thread Mike Rapoport

On Fri, Jun 11, 2021 at 01:53:48PM -0700, Stephen Brennan wrote:
> Mike Rapoport  writes:
> > From: Mike Rapoport 
> >
> > There are no architectures that support DISCONTIGMEM left.
> >
> > Remove the configuration option and the dead code it was guarding in the
> > generic memory management code.
> >
> > Signed-off-by: Mike Rapoport 
> > ---
> >  include/asm-generic/memory_model.h | 37 --
> >  include/linux/mmzone.h |  8 ---
> >  mm/Kconfig | 25 +++-
> >  mm/page_alloc.c| 13 ---
> >  4 files changed, 12 insertions(+), 71 deletions(-)
> >
> > diff --git a/include/asm-generic/memory_model.h 
> > b/include/asm-generic/memory_model.h
> > index 7637fb46ba4f..a2c8ed60233a 100644
> > --- a/include/asm-generic/memory_model.h
> > +++ b/include/asm-generic/memory_model.h
> > @@ -6,47 +6,18 @@
> >  
> >  #ifndef __ASSEMBLY__
> >  
> > +/*
> > + * supports 3 memory models.
> > + */
> 
> This comment could either be updated to reflect 2 memory models, or
> removed entirely.

I counted SPARSE and SPARSE_VMEMMAP as 2.

The code below has three clauses: one for FLATMEM, one for SPARSE and one
for VMEMMAP.
 
> Thanks,
> Stephen
> 
> >  #if defined(CONFIG_FLATMEM)
> >  
> >  #ifndef ARCH_PFN_OFFSET
> >  #define ARCH_PFN_OFFSET(0UL)
> >  #endif
> >  
> > -#elif defined(CONFIG_DISCONTIGMEM)
> > -
> > -#ifndef arch_pfn_to_nid
> > -#define arch_pfn_to_nid(pfn)   pfn_to_nid(pfn)
> > -#endif
> > -
> > -#ifndef arch_local_page_offset
> > -#define arch_local_page_offset(pfn, nid)   \
> > -   ((pfn) - NODE_DATA(nid)->node_start_pfn)
> > -#endif
> > -
> > -#endif /* CONFIG_DISCONTIGMEM */
> > -
> > -/*
> > - * supports 3 memory models.
> > - */
> > -#if defined(CONFIG_FLATMEM)
> > -
> >  #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
> >  #define __page_to_pfn(page)((unsigned long)((page) - mem_map) + \
> >  ARCH_PFN_OFFSET)
> > -#elif defined(CONFIG_DISCONTIGMEM)
> > -
> > -#define __pfn_to_page(pfn) \
> > -({ unsigned long __pfn = (pfn);\
> > -   unsigned long __nid = arch_pfn_to_nid(__pfn);  \
> > -   NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\
> > -})
> > -
> > -#define __page_to_pfn(pg)  \
> > -({ const struct page *__pg = (pg); \
> > -   struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \
> > -   (unsigned long)(__pg - __pgdat->node_mem_map) + \
> > -__pgdat->node_start_pfn;   \
> > -})
> >  
> >  #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
> >  
> > @@ -70,7 +41,7 @@
> > struct mem_section *__sec = __pfn_to_section(__pfn);\
> > __section_mem_map_addr(__sec) + __pfn;  \
> >  })
> > -#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
> > +#endif /* CONFIG_FLATMEM/SPARSEMEM */
> >  
> >  /*
> >   * Convert a physical address to a Page Frame Number and back
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 0d53eba1c383..700032e99419 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -738,10 +738,12 @@ struct zonelist {
> > struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
> >  };
> >  
> > -#ifndef CONFIG_DISCONTIGMEM
> > -/* The array of struct pages - for discontigmem use pgdat->lmem_map */
> > +/*
> > + * The array of struct pages for flatmem.
> > + * It must be declared for SPARSEMEM as well because there are 
> > configurations
> > + * that rely on that.
> > + */
> >  extern struct page *mem_map;
> > -#endif
> >  
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  struct deferred_split {
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 02d44e3420f5..218b96ccc84a 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -19,7 +19,7 @@ choice
> >  
> >  config FLATMEM_MANUAL
> > bool "Flat Memory"
> > -   depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || 
> > ARCH_FLATMEM_ENABLE
> > +   depends on !ARCH_SPARSEMEM_ENABLE || ARCH_FLATMEM_ENABLE
> > help
> >   This option is best suited for non-NUMA systems with
> >   flat addres

Re: [PATCH v2 0/9] Remove DISCINTIGMEM memory model

2021-06-09 Thread Mike Rapoport

Hi Arnd,

On Wed, Jun 09, 2021 at 01:30:39PM +0200, Arnd Bergmann wrote:
> On Fri, Jun 4, 2021 at 8:49 AM Mike Rapoport  wrote:
> >
> > From: Mike Rapoport 
> >
> > Hi,
> >
> > SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a
> > (long) while ago. The last architectures that used DISCONTIGMEM were
> > updated to use other memory models in v5.11 and it is about the time to
> > entirely remove DISCONTIGMEM from the kernel.
> >
> > This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory
> > model selection in mm/Kconfig and replaces usage of redundant
> > CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA
> > and CONFIG_FLATMEM respectively.
> >
> > I've also removed NUMA support on alpha that was BROKEN for more than 15
> > years.
> >
> > There were also minor updates all over arch/ to remove mentions of
> > DISCONTIGMEM in comments and #ifdefs.
> 
> Hi Mike and Andrew,
> 
> It looks like everyone is happy with this version so far. How should we merge 
> it
> for linux-next? I'm happy to take it through the asm-generic tree, but 
> linux-mm
> would fit at least as well. In case we go for linux-mm, feel free to add

Andrew already took to mmotm.
 
> Acked-by: Arnd Bergmann 

Thanks!

> for the whole series.

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-09 Thread Mike Rapoport

On Tue, Jun 08, 2021 at 05:25:44PM -0700, Andrew Morton wrote:
> On Tue,  8 Jun 2021 12:13:15 +0300 Mike Rapoport  wrote:
> 
> > From: Mike Rapoport 
> > 
> > After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
> > configuration options are equivalent.
> > 
> > Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
> > 
> > Done with
> > 
> > $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
> > $(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
> > $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
> > $(git grep -wl NEED_MULTIPLE_NODES)
> > 
> > with manual tweaks afterwards.
> > 
> > ...
> >
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -987,7 +987,7 @@ extern int movable_zone;
> >  #ifdef CONFIG_HIGHMEM
> >  static inline int zone_movable_is_highmem(void)
> >  {
> > -#ifdef CONFIG_NEED_MULTIPLE_NODES
> > +#ifdef CONFIG_NUMA
> > return movable_zone == ZONE_HIGHMEM;
> >  #else
> > return (ZONE_MOVABLE - 1) == ZONE_HIGHMEM;
> 
> I dropped this hunk - your "mm/mmzone.h: simplify is_highmem_idx()"
> removed zone_movable_is_highmem().  

Ah, right.
Thanks!

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v3 9/9] mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Signed-off-by: Mike Rapoport 
---
 include/linux/mmzone.h | 4 ++--
 kernel/crash_core.c| 2 +-
 mm/Kconfig | 4 
 mm/page_alloc.c| 6 +++---
 mm/page_ext.c  | 2 +-
 5 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acdc51c7b259..1d5cafe5ccc3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -777,7 +777,7 @@ typedef struct pglist_data {
struct zonelist node_zonelists[MAX_ZONELISTS];
 
int nr_zones; /* number of populated zones in this node */
-#ifdef CONFIG_FLAT_NODE_MEM_MAP/* means !SPARSEMEM */
+#ifdef CONFIG_FLATMEM  /* means !SPARSEMEM */
struct page *node_mem_map;
 #ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
@@ -867,7 +867,7 @@ typedef struct pglist_data {
 
 #define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
 #define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
 #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + (pagenr))
 #else
 #define pgdat_page_nr(pgdat, pagenr)   pfn_to_page((pgdat)->node_start_pfn + 
(pagenr))
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 53eb8bc6026d..2b8446ea7105 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -483,7 +483,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(page, compound_head);
VMCOREINFO_OFFSET(pglist_data, node_zones);
VMCOREINFO_OFFSET(pglist_data, nr_zones);
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
VMCOREINFO_OFFSET(pglist_data, node_mem_map);
 #endif
VMCOREINFO_OFFSET(pglist_data, node_start_pfn);
diff --git a/mm/Kconfig b/mm/Kconfig
index bffe4bd859f3..ded98fb859ab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -55,10 +55,6 @@ config FLATMEM
def_bool y
depends on !SPARSEMEM || FLATMEM_MANUAL
 
-config FLAT_NODE_MEM_MAP
-   def_bool y
-   depends on !SPARSEMEM
-
 #
 # SPARSEMEM_EXTREME (which is the default) does some bootmem
 # allocations when sparse_init() is called.  If this cannot
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8f08135d3eb4..f039736541eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6444,7 +6444,7 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
+#if !defined(CONFIG_FLATMEM)
 /*
  * Only struct pages that correspond to ranges defined by memblock.memory
  * are zeroed and initialized by going through __init_single_page() during
@@ -7241,7 +7241,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
}
 }
 
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
 static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
 {
unsigned long __maybe_unused start = 0;
@@ -7289,7 +7289,7 @@ static void __ref alloc_node_mem_map(struct pglist_data 
*pgdat)
 }
 #else
 static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { }
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+#endif /* CONFIG_FLATMEM */
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
diff --git a/mm/page_ext.c b/mm/page_ext.c
index df6f74aac8e1..293b2685fc48 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -191,7 +191,7 @@ void __init page_ext_init_flatmem(void)
panic("Out of memory");
 }
 
-#else /* CONFIG_FLAT_NODE_MEM_MAP */
+#else /* CONFIG_FLATMEM */
 
 struct page_ext *lookup_page_ext(const struct page *page)
 {
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v3 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.

Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

Done with

$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)

with manual tweaks afterwards.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/Kconfig|  2 +-
 arch/ia64/Kconfig |  2 +-
 arch/mips/Kconfig |  2 +-
 arch/mips/include/asm/mmzone.h|  2 +-
 arch/mips/include/asm/page.h  |  2 +-
 arch/mips/mm/init.c   |  4 ++--
 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/include/asm/mmzone.h |  4 ++--
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/kernel/smp.c |  2 +-
 arch/powerpc/kexec/core.c |  4 ++--
 arch/powerpc/mm/Makefile  |  2 +-
 arch/powerpc/mm/mem.c |  4 ++--
 arch/riscv/Kconfig|  2 +-
 arch/s390/Kconfig |  2 +-
 arch/sh/include/asm/mmzone.h  |  4 ++--
 arch/sh/kernel/topology.c |  2 +-
 arch/sh/mm/Kconfig|  2 +-
 arch/sh/mm/init.c |  2 +-
 arch/sparc/Kconfig|  2 +-
 arch/sparc/include/asm/mmzone.h   |  4 ++--
 arch/sparc/kernel/smp_64.c|  2 +-
 arch/sparc/mm/init_64.c   | 12 ++--
 arch/x86/Kconfig  |  2 +-
 arch/x86/kernel/setup_percpu.c|  6 +++---
 arch/x86/mm/init_32.c |  4 ++--
 include/asm-generic/topology.h|  2 +-
 include/linux/memblock.h  |  6 +++---
 include/linux/mm.h|  4 ++--
 include/linux/mmzone.h|  8 
 kernel/crash_core.c   |  2 +-
 mm/Kconfig|  9 -
 mm/memblock.c |  8 
 mm/memory.c   |  3 +--
 mm/page_alloc.c   |  6 +++---
 35 files changed, 59 insertions(+), 69 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..d01a1545ab8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1035,7 +1035,7 @@ config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
range 1 10
default "4"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
help
  Specify the maximum number of NUMA Nodes available on the target
  system.  Increases memory reserved to accommodate various tables.
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 279252e3e0f7..da22a35e6f03 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -302,7 +302,7 @@ config NODES_SHIFT
int "Max num nodes shift(3-10)"
range 3 10
default "10"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
help
  This option specifies the maximum number of nodes in your SSI system.
  MAX_NUMNODES will be 2^(This value).
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index ed51970c08e7..4704a16c2e44 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2867,7 +2867,7 @@ config RANDOMIZE_BASE_MAX_OFFSET
 config NODES_SHIFT
int
default "6"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
 
 config HW_PERF_EVENTS
bool "Enable hardware performance counter support for perf events"
diff --git a/arch/mips/include/asm/mmzone.h b/arch/mips/include/asm/mmzone.h
index 7649ab45e80c..602a21aee9d4 100644
--- a/arch/mips/include/asm/mmzone.h
+++ b/arch/mips/include/asm/mmzone.h
@@ -8,7 +8,7 @@
 
 #include 
 
-#ifdef CONFIG_NEED_MULTIPLE_NODES
+#ifdef CONFIG_NUMA
 # include 
 #endif
 
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 195ff4e9771f..96bc798c1ec1 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -239,7 +239,7 @@ static inline int pfn_valid(unsigned long pfn)
 
 /* pfn_valid is defined in linux/mmzone.h */
 
-#elif defined(CONFIG_NEED_MULTIPLE_NODES)
+#elif defined(CONFIG_NUMA)
 
 #define pfn_valid(pfn) \
 ({ \
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 97f6ca341448..19347dc6bbf8 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -394,7 +394,7 @@ void maar_init(void)
}
 }
 
-#ifndef CONFIG_NEED_MULTIPLE_NODES
+#ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
unsigned long max_zone_pfns[MAX_NR_ZONES];
@@ -473,7 +473,7 @@ void __init mem_init(void)
0x8000 - 4, KCORE_TEXT);
 #endif
 }
-#endif /* !CONFIG_NEED_MULTIPLE_NODES */
+#endif /* !CONFIG_NUMA */
 
 void free_init_pages(const char *what, unsigned long begin, unsigned long end)
 {
diff --git a/ar

[PATCH v3 7/9] docs: remove description of DISCONTIGMEM

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

Remove description of DISCONTIGMEM from the "Memory Models" document and
update VM sysctl description so that it won't mention DISCONIGMEM.

Signed-off-by: Mike Rapoport 
---
 Documentation/admin-guide/sysctl/vm.rst | 12 +++
 Documentation/vm/memory-model.rst   | 45 ++---
 2 files changed, 8 insertions(+), 49 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 586cd4b86428..ddbd71d592e0 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -936,12 +936,12 @@ allocations, THP and hugetlbfs pages.
 
 To make it sensible with respect to the watermark_scale_factor
 parameter, the unit is in fractions of 10,000. The default value of
-15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
-watermark will be reclaimed in the event of a pageblock being mixed due
-to fragmentation. The level of reclaim is determined by the number of
-fragmentation events that occurred in the recent past. If this value is
-smaller than a pageblock then a pageblocks worth of pages will be reclaimed
-(e.g.  2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+15,000 means that up to 150% of the high watermark will be reclaimed in the
+event of a pageblock being mixed due to fragmentation. The level of reclaim
+is determined by the number of fragmentation events that occurred in the
+recent past. If this value is smaller than a pageblock then a pageblocks
+worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
+of 0 will disable the feature.
 
 
 watermark_scale_factor
diff --git a/Documentation/vm/memory-model.rst 
b/Documentation/vm/memory-model.rst
index ce398a7dc6cd..30e8fbed6914 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -14,15 +14,11 @@ for the CPU. Then there could be several contiguous ranges 
at
 completely distinct addresses. And, don't forget about NUMA, where
 different memory banks are attached to different CPUs.
 
-Linux abstracts this diversity using one of the three memory models:
-FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
+Linux abstracts this diversity using one of the two memory models:
+FLATMEM and SPARSEMEM. Each architecture defines what
 memory models it supports, what the default memory model is and
 whether it is possible to manually override that default.
 
-.. note::
-   At time of this writing, DISCONTIGMEM is considered deprecated,
-   although it is still in use by several architectures.
-
 All the memory models track the status of physical page frames using
 struct page arranged in one or more arrays.
 
@@ -63,43 +59,6 @@ straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
 The `ARCH_PFN_OFFSET` defines the first page frame number for
 systems with physical memory starting at address different from 0.
 
-DISCONTIGMEM
-
-
-The DISCONTIGMEM model treats the physical memory as a collection of
-`nodes` similarly to how Linux NUMA support does. For each node Linux
-constructs an independent memory management subsystem represented by
-`struct pglist_data` (or `pg_data_t` for short). Among other
-things, `pg_data_t` holds the `node_mem_map` array that maps
-physical pages belonging to that node. The `node_start_pfn` field of
-`pg_data_t` is the number of the first page frame belonging to that
-node.
-
-The architecture setup code should call :c:func:`free_area_init_node` for
-each node in the system to initialize the `pg_data_t` object and its
-`node_mem_map`.
-
-Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
-every physical page frame in a node has a `struct page` entry in the
-`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
-`flags` field of the `struct page` encodes the node number of the
-node hosting that page.
-
-The conversion between a PFN and the `struct page` in the
-DISCONTIGMEM model became slightly more complex as it has to determine
-which node hosts the physical page and which `pg_data_t` object
-holds the `struct page`.
-
-Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
-to convert PFN to the node number. The opposite conversion helper
-:c:func:`page_to_nid` is generic as it uses the node number encoded in
-page->flags.
-
-Once the node number is known, the PFN can be used to index
-appropriate `node_mem_map` array to access the `struct page` and
-the offset of the `struct page` from the `node_mem_map` plus
-`node_start_pfn` is the PFN of that page.
-
 SPARSEMEM
 =
 
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v3 5/9] mm: remove CONFIG_DISCONTIGMEM

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

There are no architectures that support DISCONTIGMEM left.

Remove the configuration option and the dead code it was guarding in the
generic memory management code.

Signed-off-by: Mike Rapoport 
---
 include/asm-generic/memory_model.h | 37 --
 include/linux/mmzone.h |  8 ---
 mm/Kconfig | 25 +++-
 mm/page_alloc.c| 13 ---
 4 files changed, 12 insertions(+), 71 deletions(-)

diff --git a/include/asm-generic/memory_model.h 
b/include/asm-generic/memory_model.h
index 7637fb46ba4f..a2c8ed60233a 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -6,47 +6,18 @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * supports 3 memory models.
+ */
 #if defined(CONFIG_FLATMEM)
 
 #ifndef ARCH_PFN_OFFSET
 #define ARCH_PFN_OFFSET(0UL)
 #endif
 
-#elif defined(CONFIG_DISCONTIGMEM)
-
-#ifndef arch_pfn_to_nid
-#define arch_pfn_to_nid(pfn)   pfn_to_nid(pfn)
-#endif
-
-#ifndef arch_local_page_offset
-#define arch_local_page_offset(pfn, nid)   \
-   ((pfn) - NODE_DATA(nid)->node_start_pfn)
-#endif
-
-#endif /* CONFIG_DISCONTIGMEM */
-
-/*
- * supports 3 memory models.
- */
-#if defined(CONFIG_FLATMEM)
-
 #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
 #define __page_to_pfn(page)((unsigned long)((page) - mem_map) + \
 ARCH_PFN_OFFSET)
-#elif defined(CONFIG_DISCONTIGMEM)
-
-#define __pfn_to_page(pfn) \
-({ unsigned long __pfn = (pfn);\
-   unsigned long __nid = arch_pfn_to_nid(__pfn);  \
-   NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\
-})
-
-#define __page_to_pfn(pg)  \
-({ const struct page *__pg = (pg); \
-   struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \
-   (unsigned long)(__pg - __pgdat->node_mem_map) + \
-__pgdat->node_start_pfn;   \
-})
 
 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
 
@@ -70,7 +41,7 @@
struct mem_section *__sec = __pfn_to_section(__pfn);\
__section_mem_map_addr(__sec) + __pfn;  \
 })
-#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
+#endif /* CONFIG_FLATMEM/SPARSEMEM */
 
 /*
  * Convert a physical address to a Page Frame Number and back
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d53eba1c383..700032e99419 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -738,10 +738,12 @@ struct zonelist {
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
 };
 
-#ifndef CONFIG_DISCONTIGMEM
-/* The array of struct pages - for discontigmem use pgdat->lmem_map */
+/*
+ * The array of struct pages for flatmem.
+ * It must be declared for SPARSEMEM as well because there are configurations
+ * that rely on that.
+ */
 extern struct page *mem_map;
-#endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 struct deferred_split {
diff --git a/mm/Kconfig b/mm/Kconfig
index 02d44e3420f5..218b96ccc84a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,7 +19,7 @@ choice
 
 config FLATMEM_MANUAL
bool "Flat Memory"
-   depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || 
ARCH_FLATMEM_ENABLE
+   depends on !ARCH_SPARSEMEM_ENABLE || ARCH_FLATMEM_ENABLE
help
  This option is best suited for non-NUMA systems with
  flat address space. The FLATMEM is the most efficient
@@ -32,21 +32,6 @@ config FLATMEM_MANUAL
 
  If unsure, choose this option (Flat Memory) over any other.
 
-config DISCONTIGMEM_MANUAL
-   bool "Discontiguous Memory"
-   depends on ARCH_DISCONTIGMEM_ENABLE
-   help
- This option provides enhanced support for discontiguous
- memory systems, over FLATMEM.  These systems have holes
- in their physical address spaces, and this option provides
- more efficient handling of these holes.
-
- Although "Discontiguous Memory" is still used by several
- architectures, it is considered deprecated in favor of
- "Sparse Memory".
-
- If unsure, choose "Sparse Memory" over this option.
-
 config SPARSEMEM_MANUAL
bool "Sparse Memory"
depends on ARCH_SPARSEMEM_ENABLE
@@ -62,17 +47,13 @@ config SPARSEMEM_MANUAL
 
 endchoice
 
-config DISCONTIGMEM
-   def_bool y
-   depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || 
DISCONTIGMEM_MANUAL
-
 config SPARSEMEM
def_bool y
depends on (!SELECT_MEMORY_MODEL && ARCH_SPARSEMEM_ENABLE) || 
SPARSEMEM_MANUAL
 
 config FLATMEM
def_bool y
-   depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL
+   depends on !SPARSEMEM || FLATMEM_MANUAL
 
 config FLAT_NODE_M

[PATCH v3 6/9] arch, mm: remove stale mentions of DISCONIGMEM

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

There are several places that mention DISCONIGMEM in comments or have stale
code guarded by CONFIG_DISCONTIGMEM.

Remove the dead code and update the comments.

Signed-off-by: Mike Rapoport 
---
 arch/ia64/kernel/topology.c | 5 ++---
 arch/ia64/mm/numa.c | 5 ++---
 arch/mips/include/asm/mmzone.h  | 6 --
 arch/mips/mm/init.c | 3 ---
 arch/nds32/include/asm/memory.h | 6 --
 arch/xtensa/include/asm/page.h  | 4 
 include/linux/gfp.h | 4 ++--
 7 files changed, 6 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/kernel/topology.c b/arch/ia64/kernel/topology.c
index 09fc385c2acd..3639e0a7cb3b 100644
--- a/arch/ia64/kernel/topology.c
+++ b/arch/ia64/kernel/topology.c
@@ -3,9 +3,8 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * This file contains NUMA specific variables and functions which can
- * be split away from DISCONTIGMEM and are used on NUMA machines with
- * contiguous memory.
+ * This file contains NUMA specific variables and functions which are used on
+ * NUMA machines with contiguous memory.
  * 2002/08/07 Erich Focht 
  * Populate cpu entries in sysfs for non-numa systems as well
  * Intel Corporation - Ashok Raj
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index 46b6e5f3a40f..d6579ec3ea32 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -3,9 +3,8 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * This file contains NUMA specific variables and functions which can
- * be split away from DISCONTIGMEM and are used on NUMA machines with
- * contiguous memory.
+ * This file contains NUMA specific variables and functions which are used on
+ * NUMA machines with contiguous memory.
  * 
  * 2002/08/07 Erich Focht 
  */
diff --git a/arch/mips/include/asm/mmzone.h b/arch/mips/include/asm/mmzone.h
index b826b8473e95..7649ab45e80c 100644
--- a/arch/mips/include/asm/mmzone.h
+++ b/arch/mips/include/asm/mmzone.h
@@ -20,10 +20,4 @@
 #define nid_to_addrbase(nid) 0
 #endif
 
-#ifdef CONFIG_DISCONTIGMEM
-
-#define pfn_to_nid(pfn)pa_to_nid((pfn) << PAGE_SHIFT)
-
-#endif /* CONFIG_DISCONTIGMEM */
-
 #endif /* _ASM_MMZONE_H_ */
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index c36358758969..97f6ca341448 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -454,9 +454,6 @@ void __init mem_init(void)
BUILD_BUG_ON(IS_ENABLED(CONFIG_32BIT) && (_PFN_SHIFT > PAGE_SHIFT));
 
 #ifdef CONFIG_HIGHMEM
-#ifdef CONFIG_DISCONTIGMEM
-#error "CONFIG_HIGHMEM and CONFIG_DISCONTIGMEM dont work together yet"
-#endif
max_mapnr = highend_pfn ? highend_pfn : max_low_pfn;
 #else
max_mapnr = max_low_pfn;
diff --git a/arch/nds32/include/asm/memory.h b/arch/nds32/include/asm/memory.h
index 940d32842793..62faafbc28e4 100644
--- a/arch/nds32/include/asm/memory.h
+++ b/arch/nds32/include/asm/memory.h
@@ -76,18 +76,12 @@
  *  virt_to_page(k)convert a _valid_ virtual address to struct page *
  *  virt_addr_valid(k) indicates whether a virtual address is valid
  */
-#ifndef CONFIG_DISCONTIGMEM
-
 #define ARCH_PFN_OFFSETPHYS_PFN_OFFSET
 #define pfn_valid(pfn) ((pfn) >= PHYS_PFN_OFFSET && (pfn) < 
(PHYS_PFN_OFFSET + max_mapnr))
 
 #define virt_to_page(kaddr)(pfn_to_page(__pa(kaddr) >> PAGE_SHIFT))
 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && 
(unsigned long)(kaddr) < (unsigned long)high_memory)
 
-#else /* CONFIG_DISCONTIGMEM */
-#error CONFIG_DISCONTIGMEM is not supported yet.
-#endif /* !CONFIG_DISCONTIGMEM */
-
 #define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
 
 #endif
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 37ce25ef92d6..493eb7083b1a 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -192,10 +192,6 @@ static inline unsigned long ___pa(unsigned long va)
 #define pfn_valid(pfn) \
((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr)
 
-#ifdef CONFIG_DISCONTIGMEM
-# error CONFIG_DISCONTIGMEM not supported
-#endif
-
 #define virt_to_page(kaddr)pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
 #define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
 #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 11da8af06704..dbe1f5fc901d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -494,8 +494,8 @@ static inline int gfp_zonelist(gfp_t flags)
  * There are two zonelists per node, one for all zones with memory and
  * one containing just zones from the node the zonelist belongs to.
  *
- * For the normal case of non-DISCONTIGMEM systems the NODE_DATA(

[PATCH v3 3/9] arc: remove support for DISCONTIGMEM

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
in v5.11.

Remove the support for DISCONTIGMEM entirely.

Signed-off-by: Mike Rapoport 
Acked-by: Vineet Gupta 
---
 arch/arc/Kconfig  | 13 
 arch/arc/include/asm/mmzone.h | 40 ---
 arch/arc/mm/init.c|  8 ---
 3 files changed, 61 deletions(-)
 delete mode 100644 arch/arc/include/asm/mmzone.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 2d98501c0897..d8f51eb8963b 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -62,10 +62,6 @@ config SCHED_OMIT_FRAME_POINTER
 config GENERIC_CSUM
def_bool y
 
-config ARCH_DISCONTIGMEM_ENABLE
-   def_bool n
-   depends on BROKEN
-
 config ARCH_FLATMEM_ENABLE
def_bool y
 
@@ -344,15 +340,6 @@ config ARC_HUGEPAGE_16M
 
 endchoice
 
-config NODES_SHIFT
-   int "Maximum NUMA Nodes (as a power of 2)"
-   default "0" if !DISCONTIGMEM
-   default "1" if DISCONTIGMEM
-   depends on NEED_MULTIPLE_NODES
-   help
- Accessing memory beyond 1GB (with or w/o PAE) requires 2 memory
- zones.
-
 config ARC_COMPACT_IRQ_LEVELS
depends on ISA_ARCOMPACT
bool "Setup Timer IRQ as high Priority"
diff --git a/arch/arc/include/asm/mmzone.h b/arch/arc/include/asm/mmzone.h
deleted file mode 100644
index b86b9d1e54dc..
--- a/arch/arc/include/asm/mmzone.h
+++ /dev/null
@@ -1,40 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2016 Synopsys, Inc. (www.synopsys.com)
- */
-
-#ifndef _ASM_ARC_MMZONE_H
-#define _ASM_ARC_MMZONE_H
-
-#ifdef CONFIG_DISCONTIGMEM
-
-extern struct pglist_data node_data[];
-#define NODE_DATA(nid) (_data[nid])
-
-static inline int pfn_to_nid(unsigned long pfn)
-{
-   int is_end_low = 1;
-
-   if (IS_ENABLED(CONFIG_ARC_HAS_PAE40))
-   is_end_low = pfn <= virt_to_pfn(0xUL);
-
-   /*
-* node 0: lowmem: 0x8000_   to 0x_
-* node 1: HIGHMEM w/o  PAE40: 0x0   to 0x7FFF_
-* HIGHMEM with PAE40: 0x1__ to ...
-*/
-   if (pfn >= ARCH_PFN_OFFSET && is_end_low)
-   return 0;
-
-   return 1;
-}
-
-static inline int pfn_valid(unsigned long pfn)
-{
-   int nid = pfn_to_nid(pfn);
-
-   return (pfn <= node_end_pfn(nid));
-}
-#endif /* CONFIG_DISCONTIGMEM  */
-
-#endif
diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 397a201adfe3..abfeef7bf6f8 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -32,11 +32,6 @@ unsigned long arch_pfn_offset;
 EXPORT_SYMBOL(arch_pfn_offset);
 #endif
 
-#ifdef CONFIG_DISCONTIGMEM
-struct pglist_data node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL(node_data);
-#endif
-
 long __init arc_get_mem_sz(void)
 {
return low_mem_sz;
@@ -147,9 +142,6 @@ void __init setup_arch_memory(void)
 * to the hole is freed and ARC specific version of pfn_valid()
 * handles the hole in the memory map.
 */
-#ifdef CONFIG_DISCONTIGMEM
-   node_set_online(1);
-#endif
 
min_high_pfn = PFN_DOWN(high_mem_start);
max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v3 2/9] arc: update comment about HIGHMEM implementation

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

Arc does not use DISCONTIGMEM to implement high memory, update the comment
describing how high memory works to reflect this.

Signed-off-by: Mike Rapoport 
Acked-by: Vineet Gupta 
---
 arch/arc/mm/init.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index e2ed355438c9..397a201adfe3 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -139,16 +139,13 @@ void __init setup_arch_memory(void)
 
 #ifdef CONFIG_HIGHMEM
/*
-* Populate a new node with highmem
-*
 * On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based)
-* than addresses in normal ala low memory (0x8000_ based).
+* than addresses in normal aka low memory (0x8000_ based).
 * Even with PAE, the huge peripheral space hole would waste a lot of
-* mem with single mem_map[]. This warrants a mem_map per region design.
-* Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM.
-*
-* DISCONTIGMEM in turns requires multiple nodes. node 0 above is
-* populated with normal memory zone while node 1 only has highmem
+* mem with single contiguous mem_map[].
+* Thus when HIGHMEM on ARC is enabled the memory map corresponding
+* to the hole is freed and ARC specific version of pfn_valid()
+* handles the hole in the memory map.
 */
 #ifdef CONFIG_DISCONTIGMEM
node_set_online(1);
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v3 4/9] m68k: remove support for DISCONTIGMEM

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
in v5.11.

Remove the support for DISCONTIGMEM entirely.

Signed-off-by: Mike Rapoport 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/Kconfig.cpu   | 10 --
 arch/m68k/include/asm/mmzone.h  | 10 --
 arch/m68k/include/asm/page.h|  2 +-
 arch/m68k/include/asm/page_mm.h | 35 -
 arch/m68k/mm/init.c | 20 ---
 5 files changed, 1 insertion(+), 76 deletions(-)
 delete mode 100644 arch/m68k/include/asm/mmzone.h

diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu
index f4d23977d2a5..29e946394fdb 100644
--- a/arch/m68k/Kconfig.cpu
+++ b/arch/m68k/Kconfig.cpu
@@ -408,10 +408,6 @@ config SINGLE_MEMORY_CHUNK
  order" to save memory that could be wasted for unused memory map.
  Say N if not sure.
 
-config ARCH_DISCONTIGMEM_ENABLE
-   depends on BROKEN
-   def_bool MMU && !SINGLE_MEMORY_CHUNK
-
 config FORCE_MAX_ZONEORDER
int "Maximum zone order" if ADVANCED
depends on !SINGLE_MEMORY_CHUNK
@@ -451,11 +447,6 @@ config M68K_L2_CACHE
depends on MAC
default y
 
-config NODES_SHIFT
-   int
-   default "3"
-   depends on DISCONTIGMEM
-
 config CPU_HAS_NO_BITFIELDS
bool
 
@@ -553,4 +544,3 @@ config CACHE_COPYBACK
  The ColdFire CPU cache is set into Copy-back mode.
 endchoice
 endif
-
diff --git a/arch/m68k/include/asm/mmzone.h b/arch/m68k/include/asm/mmzone.h
deleted file mode 100644
index 64573fe8e60d..
--- a/arch/m68k/include/asm/mmzone.h
+++ /dev/null
@@ -1,10 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_M68K_MMZONE_H_
-#define _ASM_M68K_MMZONE_H_
-
-extern pg_data_t pg_data_map[];
-
-#define NODE_DATA(nid) (_data_map[nid])
-#define NODE_MEM_MAP(nid)  (NODE_DATA(nid)->node_mem_map)
-
-#endif /* _ASM_M68K_MMZONE_H_ */
diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
index 97087dd3ca6d..2f1c54e4725d 100644
--- a/arch/m68k/include/asm/page.h
+++ b/arch/m68k/include/asm/page.h
@@ -62,7 +62,7 @@ extern unsigned long _ramend;
 #include 
 #endif
 
-#if !defined(CONFIG_MMU) || defined(CONFIG_DISCONTIGMEM)
+#ifndef CONFIG_MMU
 #define __phys_to_pfn(paddr)   ((unsigned long)((paddr) >> PAGE_SHIFT))
 #define __pfn_to_phys(pfn) PFN_PHYS(pfn)
 #endif
diff --git a/arch/m68k/include/asm/page_mm.h b/arch/m68k/include/asm/page_mm.h
index 2411ea9ef578..a5b459bcb7d8 100644
--- a/arch/m68k/include/asm/page_mm.h
+++ b/arch/m68k/include/asm/page_mm.h
@@ -126,26 +126,6 @@ static inline void *__va(unsigned long x)
 
 extern int m68k_virt_to_node_shift;
 
-#ifndef CONFIG_DISCONTIGMEM
-#define __virt_to_node(addr)   (_data_map[0])
-#else
-extern struct pglist_data *pg_data_table[];
-
-static inline __attribute_const__ int __virt_to_node_shift(void)
-{
-   int shift;
-
-   asm (
-   "1: moveq   #0,%0\n"
-   m68k_fixup(%c1, 1b)
-   : "=d" (shift)
-   : "i" (m68k_fixup_vnode_shift));
-   return shift;
-}
-
-#define __virt_to_node(addr)   (pg_data_table[(unsigned long)(addr) >> 
__virt_to_node_shift()])
-#endif
-
 #define virt_to_page(addr) ({  \
pfn_to_page(virt_to_pfn(addr)); \
 })
@@ -153,23 +133,8 @@ static inline __attribute_const__ int 
__virt_to_node_shift(void)
pfn_to_virt(page_to_pfn(page)); \
 })
 
-#ifdef CONFIG_DISCONTIGMEM
-#define pfn_to_page(pfn) ({\
-   unsigned long __pfn = (pfn);\
-   struct pglist_data *pgdat;  \
-   pgdat = __virt_to_node((unsigned long)pfn_to_virt(__pfn));  \
-   pgdat->node_mem_map + (__pfn - pgdat->node_start_pfn);  \
-})
-#define page_to_pfn(_page) ({  \
-   const struct page *__p = (_page);   \
-   struct pglist_data *pgdat;  \
-   pgdat = _data_map[page_to_nid(__p)]; \
-   ((__p) - pgdat->node_mem_map) + pgdat->node_start_pfn;  \
-})
-#else
 #define ARCH_PFN_OFFSET (m68k_memory[0].addr >> PAGE_SHIFT)
 #include 
-#endif
 
 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && 
(unsigned long)(kaddr) < (unsigned long)high_memory)
 #define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn))
diff --git a/arch/m68k/mm/init.c b/arch/m68k/mm/init.c
index 1759ab875d47..5d749e188246 100644
--- a/arch/m68k/mm/init.c
+++ b/arch/m68k/mm/init.c
@@ -44,28 +44,8 @@ EXPORT_SYMBOL(empty_zero_page);
 
 int m68k_virt_to_node_shift;
 
-#ifdef CONFIG_

[PATCH v3 1/9] alpha: remove DISCONTIGMEM and NUMA

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was
replaced with SPARSEMEM in v5.11.

Remove both NUMA and DISCONTIGMEM support from alpha.

Signed-off-by: Mike Rapoport 
---
 arch/alpha/Kconfig|  22 ---
 arch/alpha/include/asm/machvec.h  |   6 -
 arch/alpha/include/asm/mmzone.h   | 100 --
 arch/alpha/include/asm/pgtable.h  |   4 -
 arch/alpha/include/asm/topology.h |  39 --
 arch/alpha/kernel/core_marvel.c   |  53 +--
 arch/alpha/kernel/core_wildfire.c |  29 +---
 arch/alpha/kernel/pci_iommu.c |  29 
 arch/alpha/kernel/proto.h |   8 --
 arch/alpha/kernel/setup.c |  16 ---
 arch/alpha/kernel/sys_marvel.c|   5 -
 arch/alpha/kernel/sys_wildfire.c  |   5 -
 arch/alpha/mm/Makefile|   2 -
 arch/alpha/mm/init.c  |   3 -
 arch/alpha/mm/numa.c  | 223 --
 15 files changed, 4 insertions(+), 540 deletions(-)
 delete mode 100644 arch/alpha/include/asm/mmzone.h
 delete mode 100644 arch/alpha/mm/numa.c

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 5998106faa60..8954216b9956 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -549,29 +549,12 @@ config NR_CPUS
  MARVEL support can handle a maximum of 32 CPUs, all the others
  with working support have a maximum of 4 CPUs.
 
-config ARCH_DISCONTIGMEM_ENABLE
-   bool "Discontiguous Memory Support"
-   depends on BROKEN
-   help
- Say Y to support efficient handling of discontiguous physical memory,
- for architectures which are either NUMA (Non-Uniform Memory Access)
- or have huge holes in the physical address space for other reasons.
- See  for more.
-
 config ARCH_SPARSEMEM_ENABLE
bool "Sparse Memory Support"
help
  Say Y to support efficient handling of discontiguous physical memory,
  for systems that have huge holes in the physical address space.
 
-config NUMA
-   bool "NUMA Support (EXPERIMENTAL)"
-   depends on DISCONTIGMEM && BROKEN
-   help
- Say Y to compile the kernel to support NUMA (Non-Uniform Memory
- Access).  This option is for configuring high-end multiprocessor
- server machines.  If in doubt, say N.
-
 config ALPHA_WTINT
bool "Use WTINT" if ALPHA_SRM || ALPHA_GENERIC
default y if ALPHA_QEMU
@@ -596,11 +579,6 @@ config ALPHA_WTINT
 
  If unsure, say N.
 
-config NODES_SHIFT
-   int
-   default "7"
-   depends on NEED_MULTIPLE_NODES
-
 # LARGE_VMALLOC is racy, if you *really* need it then fix it first
 config ALPHA_LARGE_VMALLOC
bool
diff --git a/arch/alpha/include/asm/machvec.h b/arch/alpha/include/asm/machvec.h
index a4e96e2bec74..e49fabce7b33 100644
--- a/arch/alpha/include/asm/machvec.h
+++ b/arch/alpha/include/asm/machvec.h
@@ -99,12 +99,6 @@ struct alpha_machine_vector
 
const char *vector_name;
 
-   /* NUMA information */
-   int (*pa_to_nid)(unsigned long);
-   int (*cpuid_to_nid)(int);
-   unsigned long (*node_mem_start)(int);
-   unsigned long (*node_mem_size)(int);
-
/* System specific parameters.  */
union {
struct {
diff --git a/arch/alpha/include/asm/mmzone.h b/arch/alpha/include/asm/mmzone.h
deleted file mode 100644
index 86644604d977..
--- a/arch/alpha/include/asm/mmzone.h
+++ /dev/null
@@ -1,100 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Written by Kanoj Sarcar (ka...@sgi.com) Aug 99
- * Adapted for the alpha wildfire architecture Jan 2001.
- */
-#ifndef _ASM_MMZONE_H_
-#define _ASM_MMZONE_H_
-
-#ifdef CONFIG_DISCONTIGMEM
-
-#include 
-
-/*
- * Following are macros that are specific to this numa platform.
- */
-
-extern pg_data_t node_data[];
-
-#define alpha_pa_to_nid(pa)\
-(alpha_mv.pa_to_nid\
-? alpha_mv.pa_to_nid(pa)   \
-: (0))
-#define node_mem_start(nid)\
-(alpha_mv.node_mem_start   \
-? alpha_mv.node_mem_start(nid) \
-: (0UL))
-#define node_mem_size(nid) \
-(alpha_mv.node_mem_size\
-? alpha_mv.node_mem_size(nid)  \
-: ((nid) ? (0UL) : (~0UL)))
-
-#define pa_to_nid(pa)  alpha_pa_to_nid(pa)
-#define NODE_DATA(nid) (_data[(nid)])
-
-#define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn)
-
-#if 1
-#define PLAT_NODE_DATA_LOCALNR(p, n)   \
-   (((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
-#else
-static inline unsigned long
-PLAT_NODE_DATA_LOCALNR(unsigned long p, int n)
-{
-   unsigned long temp;
-   temp = p >> PAGE_SHIFT;
-   return temp - PLAT_NODE_DATA(n)->gendata.node_start_pfn;
-}
-#endif
-
-/*
- * Following are macros that each numa implementation must define.
- */
-
-/*
- * Given a kernel address,

[PATCH v3 0/9] Remove DISCONTIGMEM memory model

2021-06-08 Thread Mike Rapoport

From: Mike Rapoport 

Hi,

SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a
(long) while ago. The last architectures that used DISCONTIGMEM were
updated to use other memory models in v5.11 and it is about the time to
entirely remove DISCONTIGMEM from the kernel.

This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory
model selection in mm/Kconfig and replaces usage of redundant
CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA
and CONFIG_FLATMEM respectively. 

I've also removed NUMA support on alpha that was BROKEN for more than 15
years.

There were also minor updates all over arch/ to remove mentions of
DISCONTIGMEM in comments and #ifdefs.

v3:
* Remove stale reference of CONFIG_NEED_MULTIPLE_NODES and stale
  discontigmem comment, per Geert
* Add Vineet Acks
* Fix spelling in cover letter subject

v2: Link: https://lore.kernel.org/lkml/20210604064916.26580-1-r...@kernel.org
* Fix build errors reported by kbuild bot
* Add additional cleanups in m68k as suggested by Geert

v1: Link: https://lore.kernel.org/lkml/20210602105348.13387-1-r...@kernel.org

Mike Rapoport (9):
  alpha: remove DISCONTIGMEM and NUMA
  arc: update comment about HIGHMEM implementation
  arc: remove support for DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  docs: remove description of DISCONTIGMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

 Documentation/admin-guide/sysctl/vm.rst |  12 +-
 Documentation/vm/memory-model.rst   |  45 +
 arch/alpha/Kconfig  |  22 ---
 arch/alpha/include/asm/machvec.h|   6 -
 arch/alpha/include/asm/mmzone.h | 100 ---
 arch/alpha/include/asm/pgtable.h|   4 -
 arch/alpha/include/asm/topology.h   |  39 -
 arch/alpha/kernel/core_marvel.c |  53 +-
 arch/alpha/kernel/core_wildfire.c   |  29 +--
 arch/alpha/kernel/pci_iommu.c   |  29 ---
 arch/alpha/kernel/proto.h   |   8 -
 arch/alpha/kernel/setup.c   |  16 --
 arch/alpha/kernel/sys_marvel.c  |   5 -
 arch/alpha/kernel/sys_wildfire.c|   5 -
 arch/alpha/mm/Makefile  |   2 -
 arch/alpha/mm/init.c|   3 -
 arch/alpha/mm/numa.c| 223 
 arch/arc/Kconfig|  13 --
 arch/arc/include/asm/mmzone.h   |  40 -
 arch/arc/mm/init.c  |  21 +--
 arch/arm64/Kconfig  |   2 +-
 arch/ia64/Kconfig   |   2 +-
 arch/ia64/kernel/topology.c |   5 +-
 arch/ia64/mm/numa.c |   5 +-
 arch/m68k/Kconfig.cpu   |  10 --
 arch/m68k/include/asm/mmzone.h  |  10 --
 arch/m68k/include/asm/page.h|   2 +-
 arch/m68k/include/asm/page_mm.h |  35 
 arch/m68k/mm/init.c |  20 ---
 arch/mips/Kconfig   |   2 +-
 arch/mips/include/asm/mmzone.h  |   8 +-
 arch/mips/include/asm/page.h|   2 +-
 arch/mips/mm/init.c |   7 +-
 arch/nds32/include/asm/memory.h |   6 -
 arch/powerpc/Kconfig|   2 +-
 arch/powerpc/include/asm/mmzone.h   |   4 +-
 arch/powerpc/kernel/setup_64.c  |   2 +-
 arch/powerpc/kernel/smp.c   |   2 +-
 arch/powerpc/kexec/core.c   |   4 +-
 arch/powerpc/mm/Makefile|   2 +-
 arch/powerpc/mm/mem.c   |   4 +-
 arch/riscv/Kconfig  |   2 +-
 arch/s390/Kconfig   |   2 +-
 arch/sh/include/asm/mmzone.h|   4 +-
 arch/sh/kernel/topology.c   |   2 +-
 arch/sh/mm/Kconfig  |   2 +-
 arch/sh/mm/init.c   |   2 +-
 arch/sparc/Kconfig  |   2 +-
 arch/sparc/include/asm/mmzone.h |   4 +-
 arch/sparc/kernel/smp_64.c  |   2 +-
 arch/sparc/mm/init_64.c |  12 +-
 arch/x86/Kconfig|   2 +-
 arch/x86/kernel/setup_percpu.c  |   6 +-
 arch/x86/mm/init_32.c   |   4 +-
 arch/xtensa/include/asm/page.h  |   4 -
 include/asm-generic/memory_model.h  |  37 +---
 include/asm-generic/topology.h  |   2 +-
 include/linux/gfp.h |   4 +-
 include/linux/memblock.h|   6 +-
 include/linux/mm.h  |   4 +-
 include/linux/mmzone.h  |  20 ++-
 kernel/crash_core.c |   4 +-
 mm/Kconfig  |  36 +---
 mm/memblock.c   |   8 +-
 mm/memory.c |   3 +-
 mm/page_alloc.c |  25 +--
 mm/page_ext.c   |   2 +-
 67 files changed, 101 insertions

Re: [PATCH v2 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-07 Thread Mike Rapoport

Hi,

On Mon, Jun 07, 2021 at 10:53:08AM +0200, Geert Uytterhoeven wrote:
> Hi Mike,
> 
> On Fri, Jun 4, 2021 at 8:50 AM Mike Rapoport  wrote:
> > From: Mike Rapoport 
> >
> > After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
> > configuration options are equivalent.
> >
> > Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
> >
> > Done with
> >
> > $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
> > $(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
> > $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
> > $(git grep -wl NEED_MULTIPLE_NODES)
> >
> > with manual tweaks afterwards.
> >
> > Signed-off-by: Mike Rapoport 
> 
> Thanks for your patch!
> 
> As you dropped the following hunk from v2 of PATCH 5/9, there's now
> one reference left of CONFIG_NEED_MULTIPLE_NODES
> (plus the discontigmem comment):

Aargh, indeed. Thanks for catching this.

And I wondered why you suggested to fix spelling in cover letter for v3 :)
 
> -diff --git a/mm/memory.c b/mm/memory.c
> -index f3ffab9b9e39157b..fd0ebb63be3304f5 100644
>  a/mm/memory.c
> -+++ b/mm/memory.c
> -@@ -90,8 +90,7 @@
> - #warning Unfortunate NUMA and NUMA Balancing config, growing
> page-frame for last_cpupid.
> - #endif
> -
> --#ifndef CONFIG_NEED_MULTIPLE_NODES
> --/* use the per-pgdat data instead for discontigmem - mbligh */
> -+#ifdef CONFIG_FLATMEM
> - unsigned long max_mapnr;
> - EXPORT_SYMBOL(max_mapnr);
> -
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
> -- 
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2 3/9] arc: remove support for DISCONTIGMEM

2021-06-04 Thread Mike Rapoport

On Fri, Jun 04, 2021 at 02:07:39PM +, Vineet Gupta wrote:
> On 6/3/21 11:49 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> >
> > DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
> > in v5.11.
> >
> > Remove the support for DISCONTIGMEM entirely.
> >
> > Signed-off-by: Mike Rapoport 
> 
> Looks non intrusive, but I'd still like to give this a spin on hardware 
> - considering highmem on ARC has tendency to go sideways ;-)
> Can you please share a branch !

Sure:

https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=memory-models/rm-discontig/v2
 
> Acked-by: Vineet Gupta 

Thanks!
 
> Thx,
> -Vineet
> 
> > ---
> >   arch/arc/Kconfig  | 13 
> >   arch/arc/include/asm/mmzone.h | 40 ---
> >   arch/arc/mm/init.c|  8 ---
> >   3 files changed, 61 deletions(-)
> >   delete mode 100644 arch/arc/include/asm/mmzone.h
> >
> > diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
> > index 2d98501c0897..d8f51eb8963b 100644
> > --- a/arch/arc/Kconfig
> > +++ b/arch/arc/Kconfig
> > @@ -62,10 +62,6 @@ config SCHED_OMIT_FRAME_POINTER
> >   config GENERIC_CSUM
> > def_bool y
> >   
> > -config ARCH_DISCONTIGMEM_ENABLE
> > -   def_bool n
> > -   depends on BROKEN
> > -
> >   config ARCH_FLATMEM_ENABLE
> > def_bool y
> >   
> > @@ -344,15 +340,6 @@ config ARC_HUGEPAGE_16M
> >   
> >   endchoice
> >   
> > -config NODES_SHIFT
> > -   int "Maximum NUMA Nodes (as a power of 2)"
> > -   default "0" if !DISCONTIGMEM
> > -   default "1" if DISCONTIGMEM
> > -   depends on NEED_MULTIPLE_NODES
> > -   help
> > - Accessing memory beyond 1GB (with or w/o PAE) requires 2 memory
> > - zones.
> > -
> >   config ARC_COMPACT_IRQ_LEVELS
> > depends on ISA_ARCOMPACT
> > bool "Setup Timer IRQ as high Priority"
> > diff --git a/arch/arc/include/asm/mmzone.h b/arch/arc/include/asm/mmzone.h
> > deleted file mode 100644
> > index b86b9d1e54dc..
> > --- a/arch/arc/include/asm/mmzone.h
> > +++ /dev/null
> > @@ -1,40 +0,0 @@
> > -/* SPDX-License-Identifier: GPL-2.0-only */
> > -/*
> > - * Copyright (C) 2016 Synopsys, Inc. (www.synopsys.com)
> > - */
> > -
> > -#ifndef _ASM_ARC_MMZONE_H
> > -#define _ASM_ARC_MMZONE_H
> > -
> > -#ifdef CONFIG_DISCONTIGMEM
> > -
> > -extern struct pglist_data node_data[];
> > -#define NODE_DATA(nid) (_data[nid])
> > -
> > -static inline int pfn_to_nid(unsigned long pfn)
> > -{
> > -   int is_end_low = 1;
> > -
> > -   if (IS_ENABLED(CONFIG_ARC_HAS_PAE40))
> > -   is_end_low = pfn <= virt_to_pfn(0xUL);
> > -
> > -   /*
> > -* node 0: lowmem: 0x8000_   to 0x_
> > -* node 1: HIGHMEM w/o  PAE40: 0x0   to 0x7FFF_
> > -* HIGHMEM with PAE40: 0x1__ to ...
> > -*/
> > -   if (pfn >= ARCH_PFN_OFFSET && is_end_low)
> > -   return 0;
> > -
> > -   return 1;
> > -}
> > -
> > -static inline int pfn_valid(unsigned long pfn)
> > -{
> > -   int nid = pfn_to_nid(pfn);
> > -
> > -   return (pfn <= node_end_pfn(nid));
> > -}
> > -#endif /* CONFIG_DISCONTIGMEM  */
> > -
> > -#endif
> > diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> > index 397a201adfe3..abfeef7bf6f8 100644
> > --- a/arch/arc/mm/init.c
> > +++ b/arch/arc/mm/init.c
> > @@ -32,11 +32,6 @@ unsigned long arch_pfn_offset;
> >   EXPORT_SYMBOL(arch_pfn_offset);
> >   #endif
> >   
> > -#ifdef CONFIG_DISCONTIGMEM
> > -struct pglist_data node_data[MAX_NUMNODES] __read_mostly;
> > -EXPORT_SYMBOL(node_data);
> > -#endif
> > -
> >   long __init arc_get_mem_sz(void)
> >   {
> > return low_mem_sz;
> > @@ -147,9 +142,6 @@ void __init setup_arch_memory(void)
> >  * to the hole is freed and ARC specific version of pfn_valid()
> >  * handles the hole in the memory map.
> >  */
> > -#ifdef CONFIG_DISCONTIGMEM
> > -   node_set_online(1);
> > -#endif
> >   
> > min_high_pfn = PFN_DOWN(high_mem_start);
> > max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v2 9/9] mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Signed-off-by: Mike Rapoport 
---
 include/linux/mmzone.h | 4 ++--
 kernel/crash_core.c| 2 +-
 mm/Kconfig | 4 
 mm/page_alloc.c| 6 +++---
 mm/page_ext.c  | 2 +-
 5 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acdc51c7b259..1d5cafe5ccc3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -777,7 +777,7 @@ typedef struct pglist_data {
struct zonelist node_zonelists[MAX_ZONELISTS];
 
int nr_zones; /* number of populated zones in this node */
-#ifdef CONFIG_FLAT_NODE_MEM_MAP/* means !SPARSEMEM */
+#ifdef CONFIG_FLATMEM  /* means !SPARSEMEM */
struct page *node_mem_map;
 #ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
@@ -867,7 +867,7 @@ typedef struct pglist_data {
 
 #define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
 #define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
 #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + (pagenr))
 #else
 #define pgdat_page_nr(pgdat, pagenr)   pfn_to_page((pgdat)->node_start_pfn + 
(pagenr))
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 53eb8bc6026d..2b8446ea7105 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -483,7 +483,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(page, compound_head);
VMCOREINFO_OFFSET(pglist_data, node_zones);
VMCOREINFO_OFFSET(pglist_data, nr_zones);
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
VMCOREINFO_OFFSET(pglist_data, node_mem_map);
 #endif
VMCOREINFO_OFFSET(pglist_data, node_start_pfn);
diff --git a/mm/Kconfig b/mm/Kconfig
index bffe4bd859f3..ded98fb859ab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -55,10 +55,6 @@ config FLATMEM
def_bool y
depends on !SPARSEMEM || FLATMEM_MANUAL
 
-config FLAT_NODE_MEM_MAP
-   def_bool y
-   depends on !SPARSEMEM
-
 #
 # SPARSEMEM_EXTREME (which is the default) does some bootmem
 # allocations when sparse_init() is called.  If this cannot
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8f08135d3eb4..f039736541eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6444,7 +6444,7 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
+#if !defined(CONFIG_FLATMEM)
 /*
  * Only struct pages that correspond to ranges defined by memblock.memory
  * are zeroed and initialized by going through __init_single_page() during
@@ -7241,7 +7241,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
}
 }
 
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
 static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
 {
unsigned long __maybe_unused start = 0;
@@ -7289,7 +7289,7 @@ static void __ref alloc_node_mem_map(struct pglist_data 
*pgdat)
 }
 #else
 static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { }
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+#endif /* CONFIG_FLATMEM */
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
diff --git a/mm/page_ext.c b/mm/page_ext.c
index df6f74aac8e1..293b2685fc48 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -191,7 +191,7 @@ void __init page_ext_init_flatmem(void)
panic("Out of memory");
 }
 
-#else /* CONFIG_FLAT_NODE_MEM_MAP */
+#else /* CONFIG_FLATMEM */
 
 struct page_ext *lookup_page_ext(const struct page *page)
 {
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v2 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.

Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

Done with

$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)

with manual tweaks afterwards.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/Kconfig|  2 +-
 arch/ia64/Kconfig |  2 +-
 arch/mips/Kconfig |  2 +-
 arch/mips/include/asm/mmzone.h|  2 +-
 arch/mips/include/asm/page.h  |  2 +-
 arch/mips/mm/init.c   |  4 ++--
 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/include/asm/mmzone.h |  4 ++--
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/kernel/smp.c |  2 +-
 arch/powerpc/kexec/core.c |  4 ++--
 arch/powerpc/mm/Makefile  |  2 +-
 arch/powerpc/mm/mem.c |  4 ++--
 arch/riscv/Kconfig|  2 +-
 arch/s390/Kconfig |  2 +-
 arch/sh/include/asm/mmzone.h  |  4 ++--
 arch/sh/kernel/topology.c |  2 +-
 arch/sh/mm/Kconfig|  2 +-
 arch/sh/mm/init.c |  2 +-
 arch/sparc/Kconfig|  2 +-
 arch/sparc/include/asm/mmzone.h   |  4 ++--
 arch/sparc/kernel/smp_64.c|  2 +-
 arch/sparc/mm/init_64.c   | 12 ++--
 arch/x86/Kconfig  |  2 +-
 arch/x86/kernel/setup_percpu.c|  6 +++---
 arch/x86/mm/init_32.c |  4 ++--
 include/asm-generic/topology.h|  2 +-
 include/linux/memblock.h  |  6 +++---
 include/linux/mm.h|  4 ++--
 include/linux/mmzone.h|  8 
 kernel/crash_core.c   |  2 +-
 mm/Kconfig|  9 -
 mm/memblock.c |  8 
 mm/page_alloc.c   |  6 +++---
 34 files changed, 58 insertions(+), 67 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..d01a1545ab8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1035,7 +1035,7 @@ config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
range 1 10
default "4"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
help
  Specify the maximum number of NUMA Nodes available on the target
  system.  Increases memory reserved to accommodate various tables.
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 279252e3e0f7..da22a35e6f03 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -302,7 +302,7 @@ config NODES_SHIFT
int "Max num nodes shift(3-10)"
range 3 10
default "10"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
help
  This option specifies the maximum number of nodes in your SSI system.
  MAX_NUMNODES will be 2^(This value).
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index ed51970c08e7..4704a16c2e44 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2867,7 +2867,7 @@ config RANDOMIZE_BASE_MAX_OFFSET
 config NODES_SHIFT
int
default "6"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
 
 config HW_PERF_EVENTS
bool "Enable hardware performance counter support for perf events"
diff --git a/arch/mips/include/asm/mmzone.h b/arch/mips/include/asm/mmzone.h
index 7649ab45e80c..602a21aee9d4 100644
--- a/arch/mips/include/asm/mmzone.h
+++ b/arch/mips/include/asm/mmzone.h
@@ -8,7 +8,7 @@
 
 #include 
 
-#ifdef CONFIG_NEED_MULTIPLE_NODES
+#ifdef CONFIG_NUMA
 # include 
 #endif
 
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 195ff4e9771f..96bc798c1ec1 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -239,7 +239,7 @@ static inline int pfn_valid(unsigned long pfn)
 
 /* pfn_valid is defined in linux/mmzone.h */
 
-#elif defined(CONFIG_NEED_MULTIPLE_NODES)
+#elif defined(CONFIG_NUMA)
 
 #define pfn_valid(pfn) \
 ({ \
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 97f6ca341448..19347dc6bbf8 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -394,7 +394,7 @@ void maar_init(void)
}
 }
 
-#ifndef CONFIG_NEED_MULTIPLE_NODES
+#ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
unsigned long max_zone_pfns[MAX_NR_ZONES];
@@ -473,7 +473,7 @@ void __init mem_init(void)
0x8000 - 4, KCORE_TEXT);
 #endif
 }
-#endif /* !CONFIG_NEED_MULTIPLE_NODES */
+#endif /* !CONFIG_NUMA */
 
 void free_init_pages(const char *what, unsigned long begin, unsigned long end)
 {
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 088dd2

[PATCH v2 7/9] docs: remove description of DISCONTIGMEM

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

Remove description of DISCONTIGMEM from the "Memory Models" document and
update VM sysctl description so that it won't mention DISCONIGMEM.

Signed-off-by: Mike Rapoport 
---
 Documentation/admin-guide/sysctl/vm.rst | 12 +++
 Documentation/vm/memory-model.rst   | 45 ++---
 2 files changed, 8 insertions(+), 49 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 586cd4b86428..ddbd71d592e0 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -936,12 +936,12 @@ allocations, THP and hugetlbfs pages.
 
 To make it sensible with respect to the watermark_scale_factor
 parameter, the unit is in fractions of 10,000. The default value of
-15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
-watermark will be reclaimed in the event of a pageblock being mixed due
-to fragmentation. The level of reclaim is determined by the number of
-fragmentation events that occurred in the recent past. If this value is
-smaller than a pageblock then a pageblocks worth of pages will be reclaimed
-(e.g.  2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+15,000 means that up to 150% of the high watermark will be reclaimed in the
+event of a pageblock being mixed due to fragmentation. The level of reclaim
+is determined by the number of fragmentation events that occurred in the
+recent past. If this value is smaller than a pageblock then a pageblocks
+worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
+of 0 will disable the feature.
 
 
 watermark_scale_factor
diff --git a/Documentation/vm/memory-model.rst 
b/Documentation/vm/memory-model.rst
index ce398a7dc6cd..30e8fbed6914 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -14,15 +14,11 @@ for the CPU. Then there could be several contiguous ranges 
at
 completely distinct addresses. And, don't forget about NUMA, where
 different memory banks are attached to different CPUs.
 
-Linux abstracts this diversity using one of the three memory models:
-FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
+Linux abstracts this diversity using one of the two memory models:
+FLATMEM and SPARSEMEM. Each architecture defines what
 memory models it supports, what the default memory model is and
 whether it is possible to manually override that default.
 
-.. note::
-   At time of this writing, DISCONTIGMEM is considered deprecated,
-   although it is still in use by several architectures.
-
 All the memory models track the status of physical page frames using
 struct page arranged in one or more arrays.
 
@@ -63,43 +59,6 @@ straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
 The `ARCH_PFN_OFFSET` defines the first page frame number for
 systems with physical memory starting at address different from 0.
 
-DISCONTIGMEM
-
-
-The DISCONTIGMEM model treats the physical memory as a collection of
-`nodes` similarly to how Linux NUMA support does. For each node Linux
-constructs an independent memory management subsystem represented by
-`struct pglist_data` (or `pg_data_t` for short). Among other
-things, `pg_data_t` holds the `node_mem_map` array that maps
-physical pages belonging to that node. The `node_start_pfn` field of
-`pg_data_t` is the number of the first page frame belonging to that
-node.
-
-The architecture setup code should call :c:func:`free_area_init_node` for
-each node in the system to initialize the `pg_data_t` object and its
-`node_mem_map`.
-
-Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
-every physical page frame in a node has a `struct page` entry in the
-`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
-`flags` field of the `struct page` encodes the node number of the
-node hosting that page.
-
-The conversion between a PFN and the `struct page` in the
-DISCONTIGMEM model became slightly more complex as it has to determine
-which node hosts the physical page and which `pg_data_t` object
-holds the `struct page`.
-
-Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
-to convert PFN to the node number. The opposite conversion helper
-:c:func:`page_to_nid` is generic as it uses the node number encoded in
-page->flags.
-
-Once the node number is known, the PFN can be used to index
-appropriate `node_mem_map` array to access the `struct page` and
-the offset of the `struct page` from the `node_mem_map` plus
-`node_start_pfn` is the PFN of that page.
-
 SPARSEMEM
 =
 
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v2 6/9] arch, mm: remove stale mentions of DISCONIGMEM

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

There are several places that mention DISCONIGMEM in comments or have stale
code guarded by CONFIG_DISCONTIGMEM.

Remove the dead code and update the comments.

Signed-off-by: Mike Rapoport 
---
 arch/ia64/kernel/topology.c | 5 ++---
 arch/ia64/mm/numa.c | 5 ++---
 arch/mips/include/asm/mmzone.h  | 6 --
 arch/mips/mm/init.c | 3 ---
 arch/nds32/include/asm/memory.h | 6 --
 arch/xtensa/include/asm/page.h  | 4 
 include/linux/gfp.h | 4 ++--
 7 files changed, 6 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/kernel/topology.c b/arch/ia64/kernel/topology.c
index 09fc385c2acd..3639e0a7cb3b 100644
--- a/arch/ia64/kernel/topology.c
+++ b/arch/ia64/kernel/topology.c
@@ -3,9 +3,8 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * This file contains NUMA specific variables and functions which can
- * be split away from DISCONTIGMEM and are used on NUMA machines with
- * contiguous memory.
+ * This file contains NUMA specific variables and functions which are used on
+ * NUMA machines with contiguous memory.
  * 2002/08/07 Erich Focht 
  * Populate cpu entries in sysfs for non-numa systems as well
  * Intel Corporation - Ashok Raj
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index 46b6e5f3a40f..d6579ec3ea32 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -3,9 +3,8 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * This file contains NUMA specific variables and functions which can
- * be split away from DISCONTIGMEM and are used on NUMA machines with
- * contiguous memory.
+ * This file contains NUMA specific variables and functions which are used on
+ * NUMA machines with contiguous memory.
  * 
  * 2002/08/07 Erich Focht 
  */
diff --git a/arch/mips/include/asm/mmzone.h b/arch/mips/include/asm/mmzone.h
index b826b8473e95..7649ab45e80c 100644
--- a/arch/mips/include/asm/mmzone.h
+++ b/arch/mips/include/asm/mmzone.h
@@ -20,10 +20,4 @@
 #define nid_to_addrbase(nid) 0
 #endif
 
-#ifdef CONFIG_DISCONTIGMEM
-
-#define pfn_to_nid(pfn)pa_to_nid((pfn) << PAGE_SHIFT)
-
-#endif /* CONFIG_DISCONTIGMEM */
-
 #endif /* _ASM_MMZONE_H_ */
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index c36358758969..97f6ca341448 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -454,9 +454,6 @@ void __init mem_init(void)
BUILD_BUG_ON(IS_ENABLED(CONFIG_32BIT) && (_PFN_SHIFT > PAGE_SHIFT));
 
 #ifdef CONFIG_HIGHMEM
-#ifdef CONFIG_DISCONTIGMEM
-#error "CONFIG_HIGHMEM and CONFIG_DISCONTIGMEM dont work together yet"
-#endif
max_mapnr = highend_pfn ? highend_pfn : max_low_pfn;
 #else
max_mapnr = max_low_pfn;
diff --git a/arch/nds32/include/asm/memory.h b/arch/nds32/include/asm/memory.h
index 940d32842793..62faafbc28e4 100644
--- a/arch/nds32/include/asm/memory.h
+++ b/arch/nds32/include/asm/memory.h
@@ -76,18 +76,12 @@
  *  virt_to_page(k)convert a _valid_ virtual address to struct page *
  *  virt_addr_valid(k) indicates whether a virtual address is valid
  */
-#ifndef CONFIG_DISCONTIGMEM
-
 #define ARCH_PFN_OFFSETPHYS_PFN_OFFSET
 #define pfn_valid(pfn) ((pfn) >= PHYS_PFN_OFFSET && (pfn) < 
(PHYS_PFN_OFFSET + max_mapnr))
 
 #define virt_to_page(kaddr)(pfn_to_page(__pa(kaddr) >> PAGE_SHIFT))
 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && 
(unsigned long)(kaddr) < (unsigned long)high_memory)
 
-#else /* CONFIG_DISCONTIGMEM */
-#error CONFIG_DISCONTIGMEM is not supported yet.
-#endif /* !CONFIG_DISCONTIGMEM */
-
 #define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
 
 #endif
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 37ce25ef92d6..493eb7083b1a 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -192,10 +192,6 @@ static inline unsigned long ___pa(unsigned long va)
 #define pfn_valid(pfn) \
((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr)
 
-#ifdef CONFIG_DISCONTIGMEM
-# error CONFIG_DISCONTIGMEM not supported
-#endif
-
 #define virt_to_page(kaddr)pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
 #define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
 #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 11da8af06704..dbe1f5fc901d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -494,8 +494,8 @@ static inline int gfp_zonelist(gfp_t flags)
  * There are two zonelists per node, one for all zones with memory and
  * one containing just zones from the node the zonelist belongs to.
  *
- * For the normal case of non-DISCONTIGMEM systems the NODE_DATA(

[PATCH v2 4/9] m68k: remove support for DISCONTIGMEM

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
in v5.11.

Remove the support for DISCONTIGMEM entirely.

Signed-off-by: Mike Rapoport 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/Kconfig.cpu   | 10 --
 arch/m68k/include/asm/mmzone.h  | 10 --
 arch/m68k/include/asm/page.h|  2 +-
 arch/m68k/include/asm/page_mm.h | 35 -
 arch/m68k/mm/init.c | 20 ---
 5 files changed, 1 insertion(+), 76 deletions(-)
 delete mode 100644 arch/m68k/include/asm/mmzone.h

diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu
index f4d23977d2a5..29e946394fdb 100644
--- a/arch/m68k/Kconfig.cpu
+++ b/arch/m68k/Kconfig.cpu
@@ -408,10 +408,6 @@ config SINGLE_MEMORY_CHUNK
  order" to save memory that could be wasted for unused memory map.
  Say N if not sure.
 
-config ARCH_DISCONTIGMEM_ENABLE
-   depends on BROKEN
-   def_bool MMU && !SINGLE_MEMORY_CHUNK
-
 config FORCE_MAX_ZONEORDER
int "Maximum zone order" if ADVANCED
depends on !SINGLE_MEMORY_CHUNK
@@ -451,11 +447,6 @@ config M68K_L2_CACHE
depends on MAC
default y
 
-config NODES_SHIFT
-   int
-   default "3"
-   depends on DISCONTIGMEM
-
 config CPU_HAS_NO_BITFIELDS
bool
 
@@ -553,4 +544,3 @@ config CACHE_COPYBACK
  The ColdFire CPU cache is set into Copy-back mode.
 endchoice
 endif
-
diff --git a/arch/m68k/include/asm/mmzone.h b/arch/m68k/include/asm/mmzone.h
deleted file mode 100644
index 64573fe8e60d..
--- a/arch/m68k/include/asm/mmzone.h
+++ /dev/null
@@ -1,10 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_M68K_MMZONE_H_
-#define _ASM_M68K_MMZONE_H_
-
-extern pg_data_t pg_data_map[];
-
-#define NODE_DATA(nid) (_data_map[nid])
-#define NODE_MEM_MAP(nid)  (NODE_DATA(nid)->node_mem_map)
-
-#endif /* _ASM_M68K_MMZONE_H_ */
diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
index 97087dd3ca6d..2f1c54e4725d 100644
--- a/arch/m68k/include/asm/page.h
+++ b/arch/m68k/include/asm/page.h
@@ -62,7 +62,7 @@ extern unsigned long _ramend;
 #include 
 #endif
 
-#if !defined(CONFIG_MMU) || defined(CONFIG_DISCONTIGMEM)
+#ifndef CONFIG_MMU
 #define __phys_to_pfn(paddr)   ((unsigned long)((paddr) >> PAGE_SHIFT))
 #define __pfn_to_phys(pfn) PFN_PHYS(pfn)
 #endif
diff --git a/arch/m68k/include/asm/page_mm.h b/arch/m68k/include/asm/page_mm.h
index 2411ea9ef578..a5b459bcb7d8 100644
--- a/arch/m68k/include/asm/page_mm.h
+++ b/arch/m68k/include/asm/page_mm.h
@@ -126,26 +126,6 @@ static inline void *__va(unsigned long x)
 
 extern int m68k_virt_to_node_shift;
 
-#ifndef CONFIG_DISCONTIGMEM
-#define __virt_to_node(addr)   (_data_map[0])
-#else
-extern struct pglist_data *pg_data_table[];
-
-static inline __attribute_const__ int __virt_to_node_shift(void)
-{
-   int shift;
-
-   asm (
-   "1: moveq   #0,%0\n"
-   m68k_fixup(%c1, 1b)
-   : "=d" (shift)
-   : "i" (m68k_fixup_vnode_shift));
-   return shift;
-}
-
-#define __virt_to_node(addr)   (pg_data_table[(unsigned long)(addr) >> 
__virt_to_node_shift()])
-#endif
-
 #define virt_to_page(addr) ({  \
pfn_to_page(virt_to_pfn(addr)); \
 })
@@ -153,23 +133,8 @@ static inline __attribute_const__ int 
__virt_to_node_shift(void)
pfn_to_virt(page_to_pfn(page)); \
 })
 
-#ifdef CONFIG_DISCONTIGMEM
-#define pfn_to_page(pfn) ({\
-   unsigned long __pfn = (pfn);\
-   struct pglist_data *pgdat;  \
-   pgdat = __virt_to_node((unsigned long)pfn_to_virt(__pfn));  \
-   pgdat->node_mem_map + (__pfn - pgdat->node_start_pfn);  \
-})
-#define page_to_pfn(_page) ({  \
-   const struct page *__p = (_page);   \
-   struct pglist_data *pgdat;  \
-   pgdat = _data_map[page_to_nid(__p)]; \
-   ((__p) - pgdat->node_mem_map) + pgdat->node_start_pfn;  \
-})
-#else
 #define ARCH_PFN_OFFSET (m68k_memory[0].addr >> PAGE_SHIFT)
 #include 
-#endif
 
 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && 
(unsigned long)(kaddr) < (unsigned long)high_memory)
 #define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn))
diff --git a/arch/m68k/mm/init.c b/arch/m68k/mm/init.c
index 1759ab875d47..5d749e188246 100644
--- a/arch/m68k/mm/init.c
+++ b/arch/m68k/mm/init.c
@@ -44,28 +44,8 @@ EXPORT_SYMBOL(empty_zero_page);
 
 int m68k_virt_to_node_shift;
 
-#ifdef CONFIG_

[PATCH v2 5/9] mm: remove CONFIG_DISCONTIGMEM

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

There are no architectures that support DISCONTIGMEM left.

Remove the configuration option and the dead code it was guarding in the
generic memory management code.

Signed-off-by: Mike Rapoport 
---
 include/asm-generic/memory_model.h | 37 --
 include/linux/mmzone.h |  8 ---
 mm/Kconfig | 25 +++-
 mm/page_alloc.c| 13 ---
 4 files changed, 12 insertions(+), 71 deletions(-)

diff --git a/include/asm-generic/memory_model.h 
b/include/asm-generic/memory_model.h
index 7637fb46ba4f..a2c8ed60233a 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -6,47 +6,18 @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * supports 3 memory models.
+ */
 #if defined(CONFIG_FLATMEM)
 
 #ifndef ARCH_PFN_OFFSET
 #define ARCH_PFN_OFFSET(0UL)
 #endif
 
-#elif defined(CONFIG_DISCONTIGMEM)
-
-#ifndef arch_pfn_to_nid
-#define arch_pfn_to_nid(pfn)   pfn_to_nid(pfn)
-#endif
-
-#ifndef arch_local_page_offset
-#define arch_local_page_offset(pfn, nid)   \
-   ((pfn) - NODE_DATA(nid)->node_start_pfn)
-#endif
-
-#endif /* CONFIG_DISCONTIGMEM */
-
-/*
- * supports 3 memory models.
- */
-#if defined(CONFIG_FLATMEM)
-
 #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
 #define __page_to_pfn(page)((unsigned long)((page) - mem_map) + \
 ARCH_PFN_OFFSET)
-#elif defined(CONFIG_DISCONTIGMEM)
-
-#define __pfn_to_page(pfn) \
-({ unsigned long __pfn = (pfn);\
-   unsigned long __nid = arch_pfn_to_nid(__pfn);  \
-   NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\
-})
-
-#define __page_to_pfn(pg)  \
-({ const struct page *__pg = (pg); \
-   struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \
-   (unsigned long)(__pg - __pgdat->node_mem_map) + \
-__pgdat->node_start_pfn;   \
-})
 
 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
 
@@ -70,7 +41,7 @@
struct mem_section *__sec = __pfn_to_section(__pfn);\
__section_mem_map_addr(__sec) + __pfn;  \
 })
-#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
+#endif /* CONFIG_FLATMEM/SPARSEMEM */
 
 /*
  * Convert a physical address to a Page Frame Number and back
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d53eba1c383..700032e99419 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -738,10 +738,12 @@ struct zonelist {
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
 };
 
-#ifndef CONFIG_DISCONTIGMEM
-/* The array of struct pages - for discontigmem use pgdat->lmem_map */
+/*
+ * The array of struct pages for flatmem.
+ * It must be declared for SPARSEMEM as well because there are configurations
+ * that rely on that.
+ */
 extern struct page *mem_map;
-#endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 struct deferred_split {
diff --git a/mm/Kconfig b/mm/Kconfig
index 02d44e3420f5..218b96ccc84a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,7 +19,7 @@ choice
 
 config FLATMEM_MANUAL
bool "Flat Memory"
-   depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || 
ARCH_FLATMEM_ENABLE
+   depends on !ARCH_SPARSEMEM_ENABLE || ARCH_FLATMEM_ENABLE
help
  This option is best suited for non-NUMA systems with
  flat address space. The FLATMEM is the most efficient
@@ -32,21 +32,6 @@ config FLATMEM_MANUAL
 
  If unsure, choose this option (Flat Memory) over any other.
 
-config DISCONTIGMEM_MANUAL
-   bool "Discontiguous Memory"
-   depends on ARCH_DISCONTIGMEM_ENABLE
-   help
- This option provides enhanced support for discontiguous
- memory systems, over FLATMEM.  These systems have holes
- in their physical address spaces, and this option provides
- more efficient handling of these holes.
-
- Although "Discontiguous Memory" is still used by several
- architectures, it is considered deprecated in favor of
- "Sparse Memory".
-
- If unsure, choose "Sparse Memory" over this option.
-
 config SPARSEMEM_MANUAL
bool "Sparse Memory"
depends on ARCH_SPARSEMEM_ENABLE
@@ -62,17 +47,13 @@ config SPARSEMEM_MANUAL
 
 endchoice
 
-config DISCONTIGMEM
-   def_bool y
-   depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || 
DISCONTIGMEM_MANUAL
-
 config SPARSEMEM
def_bool y
depends on (!SELECT_MEMORY_MODEL && ARCH_SPARSEMEM_ENABLE) || 
SPARSEMEM_MANUAL
 
 config FLATMEM
def_bool y
-   depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL
+   depends on !SPARSEMEM || FLATMEM_MANUAL
 
 config FLAT_NODE_M

[PATCH v2 2/9] arc: update comment about HIGHMEM implementation

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

Arc does not use DISCONTIGMEM to implement high memory, update the comment
describing how high memory works to reflect this.

Signed-off-by: Mike Rapoport 
---
 arch/arc/mm/init.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index e2ed355438c9..397a201adfe3 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -139,16 +139,13 @@ void __init setup_arch_memory(void)
 
 #ifdef CONFIG_HIGHMEM
/*
-* Populate a new node with highmem
-*
 * On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based)
-* than addresses in normal ala low memory (0x8000_ based).
+* than addresses in normal aka low memory (0x8000_ based).
 * Even with PAE, the huge peripheral space hole would waste a lot of
-* mem with single mem_map[]. This warrants a mem_map per region design.
-* Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM.
-*
-* DISCONTIGMEM in turns requires multiple nodes. node 0 above is
-* populated with normal memory zone while node 1 only has highmem
+* mem with single contiguous mem_map[].
+* Thus when HIGHMEM on ARC is enabled the memory map corresponding
+* to the hole is freed and ARC specific version of pfn_valid()
+* handles the hole in the memory map.
 */
 #ifdef CONFIG_DISCONTIGMEM
node_set_online(1);
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v2 0/9] Remove DISCINTIGMEM memory model

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

Hi,

SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a
(long) while ago. The last architectures that used DISCONTIGMEM were
updated to use other memory models in v5.11 and it is about the time to
entirely remove DISCONTIGMEM from the kernel.

This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory
model selection in mm/Kconfig and replaces usage of redundant
CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA
and CONFIG_FLATMEM respectively. 

I've also removed NUMA support on alpha that was BROKEN for more than 15
years.

There were also minor updates all over arch/ to remove mentions of
DISCONTIGMEM in comments and #ifdefs.

v2:
* Fix build errors reported by kbuild bot
* Add additional cleanups in m68k as suggested by Geert

v1: Link: https://lore.kernel.org/lkml/20210602105348.13387-1-r...@kernel.org

Mike Rapoport (9):
  alpha: remove DISCONTIGMEM and NUMA
  arc: update comment about HIGHMEM implementation
  arc: remove support for DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  docs: remove description of DISCONTIGMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

 Documentation/admin-guide/sysctl/vm.rst |  12 +-
 Documentation/vm/memory-model.rst   |  45 +
 arch/alpha/Kconfig  |  22 ---
 arch/alpha/include/asm/machvec.h|   6 -
 arch/alpha/include/asm/mmzone.h | 100 ---
 arch/alpha/include/asm/pgtable.h|   4 -
 arch/alpha/include/asm/topology.h   |  39 -
 arch/alpha/kernel/core_marvel.c |  53 +-
 arch/alpha/kernel/core_wildfire.c   |  29 +--
 arch/alpha/kernel/pci_iommu.c   |  29 ---
 arch/alpha/kernel/proto.h   |   8 -
 arch/alpha/kernel/setup.c   |  16 --
 arch/alpha/kernel/sys_marvel.c  |   5 -
 arch/alpha/kernel/sys_wildfire.c|   5 -
 arch/alpha/mm/Makefile  |   2 -
 arch/alpha/mm/init.c|   3 -
 arch/alpha/mm/numa.c| 223 
 arch/arc/Kconfig|  13 --
 arch/arc/include/asm/mmzone.h   |  40 -
 arch/arc/mm/init.c  |  21 +--
 arch/arm64/Kconfig  |   2 +-
 arch/ia64/Kconfig   |   2 +-
 arch/ia64/kernel/topology.c |   5 +-
 arch/ia64/mm/numa.c |   5 +-
 arch/m68k/Kconfig.cpu   |  10 --
 arch/m68k/include/asm/mmzone.h  |  10 --
 arch/m68k/include/asm/page.h|   2 +-
 arch/m68k/include/asm/page_mm.h |  35 
 arch/m68k/mm/init.c |  20 ---
 arch/mips/Kconfig   |   2 +-
 arch/mips/include/asm/mmzone.h  |   8 +-
 arch/mips/include/asm/page.h|   2 +-
 arch/mips/mm/init.c |   7 +-
 arch/nds32/include/asm/memory.h |   6 -
 arch/powerpc/Kconfig|   2 +-
 arch/powerpc/include/asm/mmzone.h   |   4 +-
 arch/powerpc/kernel/setup_64.c  |   2 +-
 arch/powerpc/kernel/smp.c   |   2 +-
 arch/powerpc/kexec/core.c   |   4 +-
 arch/powerpc/mm/Makefile|   2 +-
 arch/powerpc/mm/mem.c   |   4 +-
 arch/riscv/Kconfig  |   2 +-
 arch/s390/Kconfig   |   2 +-
 arch/sh/include/asm/mmzone.h|   4 +-
 arch/sh/kernel/topology.c   |   2 +-
 arch/sh/mm/Kconfig  |   2 +-
 arch/sh/mm/init.c   |   2 +-
 arch/sparc/Kconfig  |   2 +-
 arch/sparc/include/asm/mmzone.h |   4 +-
 arch/sparc/kernel/smp_64.c  |   2 +-
 arch/sparc/mm/init_64.c |  12 +-
 arch/x86/Kconfig|   2 +-
 arch/x86/kernel/setup_percpu.c  |   6 +-
 arch/x86/mm/init_32.c   |   4 +-
 arch/xtensa/include/asm/page.h  |   4 -
 include/asm-generic/memory_model.h  |  37 +---
 include/asm-generic/topology.h  |   2 +-
 include/linux/gfp.h |   4 +-
 include/linux/memblock.h|   6 +-
 include/linux/mm.h  |   4 +-
 include/linux/mmzone.h  |  20 ++-
 kernel/crash_core.c |   4 +-
 mm/Kconfig  |  36 +---
 mm/memblock.c   |   8 +-
 mm/page_alloc.c |  25 +--
 mm/page_ext.c   |   2 +-
 66 files changed, 100 insertions(+), 909 deletions(-)
 delete mode 100644 arch/alpha/include/asm/mmzone.h
 delete mode 100644 arch/alpha/mm/numa.c
 delete mode 100644 arch/arc/include/asm/mmzone.h
 delete mode 100644 arch/m68k/include/asm/mmzone.h


base-commit: c4681547bcce777daf576925a966ffa824edd09d
-- 
2.28.0

[PATCH v2 3/9] arc: remove support for DISCONTIGMEM

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
in v5.11.

Remove the support for DISCONTIGMEM entirely.

Signed-off-by: Mike Rapoport 
---
 arch/arc/Kconfig  | 13 
 arch/arc/include/asm/mmzone.h | 40 ---
 arch/arc/mm/init.c|  8 ---
 3 files changed, 61 deletions(-)
 delete mode 100644 arch/arc/include/asm/mmzone.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 2d98501c0897..d8f51eb8963b 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -62,10 +62,6 @@ config SCHED_OMIT_FRAME_POINTER
 config GENERIC_CSUM
def_bool y
 
-config ARCH_DISCONTIGMEM_ENABLE
-   def_bool n
-   depends on BROKEN
-
 config ARCH_FLATMEM_ENABLE
def_bool y
 
@@ -344,15 +340,6 @@ config ARC_HUGEPAGE_16M
 
 endchoice
 
-config NODES_SHIFT
-   int "Maximum NUMA Nodes (as a power of 2)"
-   default "0" if !DISCONTIGMEM
-   default "1" if DISCONTIGMEM
-   depends on NEED_MULTIPLE_NODES
-   help
- Accessing memory beyond 1GB (with or w/o PAE) requires 2 memory
- zones.
-
 config ARC_COMPACT_IRQ_LEVELS
depends on ISA_ARCOMPACT
bool "Setup Timer IRQ as high Priority"
diff --git a/arch/arc/include/asm/mmzone.h b/arch/arc/include/asm/mmzone.h
deleted file mode 100644
index b86b9d1e54dc..
--- a/arch/arc/include/asm/mmzone.h
+++ /dev/null
@@ -1,40 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2016 Synopsys, Inc. (www.synopsys.com)
- */
-
-#ifndef _ASM_ARC_MMZONE_H
-#define _ASM_ARC_MMZONE_H
-
-#ifdef CONFIG_DISCONTIGMEM
-
-extern struct pglist_data node_data[];
-#define NODE_DATA(nid) (_data[nid])
-
-static inline int pfn_to_nid(unsigned long pfn)
-{
-   int is_end_low = 1;
-
-   if (IS_ENABLED(CONFIG_ARC_HAS_PAE40))
-   is_end_low = pfn <= virt_to_pfn(0xUL);
-
-   /*
-* node 0: lowmem: 0x8000_   to 0x_
-* node 1: HIGHMEM w/o  PAE40: 0x0   to 0x7FFF_
-* HIGHMEM with PAE40: 0x1__ to ...
-*/
-   if (pfn >= ARCH_PFN_OFFSET && is_end_low)
-   return 0;
-
-   return 1;
-}
-
-static inline int pfn_valid(unsigned long pfn)
-{
-   int nid = pfn_to_nid(pfn);
-
-   return (pfn <= node_end_pfn(nid));
-}
-#endif /* CONFIG_DISCONTIGMEM  */
-
-#endif
diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 397a201adfe3..abfeef7bf6f8 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -32,11 +32,6 @@ unsigned long arch_pfn_offset;
 EXPORT_SYMBOL(arch_pfn_offset);
 #endif
 
-#ifdef CONFIG_DISCONTIGMEM
-struct pglist_data node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL(node_data);
-#endif
-
 long __init arc_get_mem_sz(void)
 {
return low_mem_sz;
@@ -147,9 +142,6 @@ void __init setup_arch_memory(void)
 * to the hole is freed and ARC specific version of pfn_valid()
 * handles the hole in the memory map.
 */
-#ifdef CONFIG_DISCONTIGMEM
-   node_set_online(1);
-#endif
 
min_high_pfn = PFN_DOWN(high_mem_start);
max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH v2 1/9] alpha: remove DISCONTIGMEM and NUMA

2021-06-04 Thread Mike Rapoport

From: Mike Rapoport 

NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was
replaced with SPARSEMEM in v5.11.

Remove both NUMA and DISCONTIGMEM support from alpha.

Signed-off-by: Mike Rapoport 
---
 arch/alpha/Kconfig|  22 ---
 arch/alpha/include/asm/machvec.h  |   6 -
 arch/alpha/include/asm/mmzone.h   | 100 --
 arch/alpha/include/asm/pgtable.h  |   4 -
 arch/alpha/include/asm/topology.h |  39 --
 arch/alpha/kernel/core_marvel.c   |  53 +--
 arch/alpha/kernel/core_wildfire.c |  29 +---
 arch/alpha/kernel/pci_iommu.c |  29 
 arch/alpha/kernel/proto.h |   8 --
 arch/alpha/kernel/setup.c |  16 ---
 arch/alpha/kernel/sys_marvel.c|   5 -
 arch/alpha/kernel/sys_wildfire.c  |   5 -
 arch/alpha/mm/Makefile|   2 -
 arch/alpha/mm/init.c  |   3 -
 arch/alpha/mm/numa.c  | 223 --
 15 files changed, 4 insertions(+), 540 deletions(-)
 delete mode 100644 arch/alpha/include/asm/mmzone.h
 delete mode 100644 arch/alpha/mm/numa.c

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 5998106faa60..8954216b9956 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -549,29 +549,12 @@ config NR_CPUS
  MARVEL support can handle a maximum of 32 CPUs, all the others
  with working support have a maximum of 4 CPUs.
 
-config ARCH_DISCONTIGMEM_ENABLE
-   bool "Discontiguous Memory Support"
-   depends on BROKEN
-   help
- Say Y to support efficient handling of discontiguous physical memory,
- for architectures which are either NUMA (Non-Uniform Memory Access)
- or have huge holes in the physical address space for other reasons.
- See  for more.
-
 config ARCH_SPARSEMEM_ENABLE
bool "Sparse Memory Support"
help
  Say Y to support efficient handling of discontiguous physical memory,
  for systems that have huge holes in the physical address space.
 
-config NUMA
-   bool "NUMA Support (EXPERIMENTAL)"
-   depends on DISCONTIGMEM && BROKEN
-   help
- Say Y to compile the kernel to support NUMA (Non-Uniform Memory
- Access).  This option is for configuring high-end multiprocessor
- server machines.  If in doubt, say N.
-
 config ALPHA_WTINT
bool "Use WTINT" if ALPHA_SRM || ALPHA_GENERIC
default y if ALPHA_QEMU
@@ -596,11 +579,6 @@ config ALPHA_WTINT
 
  If unsure, say N.
 
-config NODES_SHIFT
-   int
-   default "7"
-   depends on NEED_MULTIPLE_NODES
-
 # LARGE_VMALLOC is racy, if you *really* need it then fix it first
 config ALPHA_LARGE_VMALLOC
bool
diff --git a/arch/alpha/include/asm/machvec.h b/arch/alpha/include/asm/machvec.h
index a4e96e2bec74..e49fabce7b33 100644
--- a/arch/alpha/include/asm/machvec.h
+++ b/arch/alpha/include/asm/machvec.h
@@ -99,12 +99,6 @@ struct alpha_machine_vector
 
const char *vector_name;
 
-   /* NUMA information */
-   int (*pa_to_nid)(unsigned long);
-   int (*cpuid_to_nid)(int);
-   unsigned long (*node_mem_start)(int);
-   unsigned long (*node_mem_size)(int);
-
/* System specific parameters.  */
union {
struct {
diff --git a/arch/alpha/include/asm/mmzone.h b/arch/alpha/include/asm/mmzone.h
deleted file mode 100644
index 86644604d977..
--- a/arch/alpha/include/asm/mmzone.h
+++ /dev/null
@@ -1,100 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Written by Kanoj Sarcar (ka...@sgi.com) Aug 99
- * Adapted for the alpha wildfire architecture Jan 2001.
- */
-#ifndef _ASM_MMZONE_H_
-#define _ASM_MMZONE_H_
-
-#ifdef CONFIG_DISCONTIGMEM
-
-#include 
-
-/*
- * Following are macros that are specific to this numa platform.
- */
-
-extern pg_data_t node_data[];
-
-#define alpha_pa_to_nid(pa)\
-(alpha_mv.pa_to_nid\
-? alpha_mv.pa_to_nid(pa)   \
-: (0))
-#define node_mem_start(nid)\
-(alpha_mv.node_mem_start   \
-? alpha_mv.node_mem_start(nid) \
-: (0UL))
-#define node_mem_size(nid) \
-(alpha_mv.node_mem_size\
-? alpha_mv.node_mem_size(nid)  \
-: ((nid) ? (0UL) : (~0UL)))
-
-#define pa_to_nid(pa)  alpha_pa_to_nid(pa)
-#define NODE_DATA(nid) (_data[(nid)])
-
-#define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn)
-
-#if 1
-#define PLAT_NODE_DATA_LOCALNR(p, n)   \
-   (((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
-#else
-static inline unsigned long
-PLAT_NODE_DATA_LOCALNR(unsigned long p, int n)
-{
-   unsigned long temp;
-   temp = p >> PAGE_SHIFT;
-   return temp - PLAT_NODE_DATA(n)->gendata.node_start_pfn;
-}
-#endif
-
-/*
- * Following are macros that each numa implementation must define.
- */
-
-/*
- * Given a kernel address,

Re: [PATCH 4/9] m68k: remove support for DISCONTIGMEM

2021-06-02 Thread Mike Rapoport

On Wed, Jun 02, 2021 at 01:25:24PM +0200, Geert Uytterhoeven wrote:
> Hi Mike,
> 
> On Wed, Jun 2, 2021 at 12:54 PM Mike Rapoport  wrote:
> > From: Mike Rapoport 
> >
> > DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
> > in v5.11.
> >
> > Remove the support for DISCONTIGMEM entirely.
> >
> > Signed-off-by: Mike Rapoport 
> 
> Thanks for your patch!
> 
> Reviewed-by: Geert Uytterhoeven 
> Acked-by: Geert Uytterhoeven 
> 
> > --- a/arch/m68k/include/asm/page_mm.h
> > +++ b/arch/m68k/include/asm/page_mm.h
> > @@ -126,25 +126,7 @@ static inline void *__va(unsigned long x)
> >
> >  extern int m68k_virt_to_node_shift;
> >
> > -#ifndef CONFIG_DISCONTIGMEM
> >  #define __virt_to_node(addr)   (_data_map[0])
> 
> With pg_data_map[] removed, this definition can go as well.
> Seems to be a leftover from 1008a11590b966b4 ("m68k: switch to MEMBLOCK
>  + NO_BOOTMEM")
> 
> There are a few more:
> arch/m68k/include/asm/mmzone.h:extern pg_data_t pg_data_map[];
> arch/m68k/include/asm/mmzone.h:#define NODE_DATA(nid)
> (_data_map[nid])

It seems that arch/m68k/include/asm/mmzone.h can be simply removed.
 
> > -#else
> > -extern struct pglist_data *pg_data_table[];
> > -
> > -static inline __attribute_const__ int __virt_to_node_shift(void)
> > -{
> > -   int shift;
> > -
> > -   asm (
> > -   "1: moveq   #0,%0\n"
> > -   m68k_fixup(%c1, 1b)
> > -   : "=d" (shift)
> > -   : "i" (m68k_fixup_vnode_shift));
> > -   return shift;
> > -}
> > -
> > -#define __virt_to_node(addr)   (pg_data_table[(unsigned long)(addr) >> 
> > __virt_to_node_shift()])
> > -#endif
> 
> > --- a/arch/m68k/mm/init.c
> > +++ b/arch/m68k/mm/init.c
> > @@ -44,28 +44,8 @@ EXPORT_SYMBOL(empty_zero_page);
> >
> >  int m68k_virt_to_node_shift;
> >
> > -#ifdef CONFIG_DISCONTIGMEM
> > -pg_data_t pg_data_map[MAX_NUMNODES];
> > -EXPORT_SYMBOL(pg_data_map);
> > -
> > -pg_data_t *pg_data_table[65];
> > -EXPORT_SYMBOL(pg_data_table);
> > -#endif
> > -
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH 7/9] docs: remove description of DISCONTIGMEM

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

Remove description of DISCONTIGMEM from the "Memory Models" document and
update VM sysctl description so that it won't mention DISCONIGMEM.

Signed-off-by: Mike Rapoport 
---
 Documentation/admin-guide/sysctl/vm.rst | 12 +++
 Documentation/vm/memory-model.rst   | 45 ++---
 2 files changed, 8 insertions(+), 49 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 586cd4b86428..ddbd71d592e0 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -936,12 +936,12 @@ allocations, THP and hugetlbfs pages.
 
 To make it sensible with respect to the watermark_scale_factor
 parameter, the unit is in fractions of 10,000. The default value of
-15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
-watermark will be reclaimed in the event of a pageblock being mixed due
-to fragmentation. The level of reclaim is determined by the number of
-fragmentation events that occurred in the recent past. If this value is
-smaller than a pageblock then a pageblocks worth of pages will be reclaimed
-(e.g.  2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+15,000 means that up to 150% of the high watermark will be reclaimed in the
+event of a pageblock being mixed due to fragmentation. The level of reclaim
+is determined by the number of fragmentation events that occurred in the
+recent past. If this value is smaller than a pageblock then a pageblocks
+worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
+of 0 will disable the feature.
 
 
 watermark_scale_factor
diff --git a/Documentation/vm/memory-model.rst 
b/Documentation/vm/memory-model.rst
index ce398a7dc6cd..30e8fbed6914 100644
--- a/Documentation/vm/memory-model.rst
+++ b/Documentation/vm/memory-model.rst
@@ -14,15 +14,11 @@ for the CPU. Then there could be several contiguous ranges 
at
 completely distinct addresses. And, don't forget about NUMA, where
 different memory banks are attached to different CPUs.
 
-Linux abstracts this diversity using one of the three memory models:
-FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
+Linux abstracts this diversity using one of the two memory models:
+FLATMEM and SPARSEMEM. Each architecture defines what
 memory models it supports, what the default memory model is and
 whether it is possible to manually override that default.
 
-.. note::
-   At time of this writing, DISCONTIGMEM is considered deprecated,
-   although it is still in use by several architectures.
-
 All the memory models track the status of physical page frames using
 struct page arranged in one or more arrays.
 
@@ -63,43 +59,6 @@ straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
 The `ARCH_PFN_OFFSET` defines the first page frame number for
 systems with physical memory starting at address different from 0.
 
-DISCONTIGMEM
-
-
-The DISCONTIGMEM model treats the physical memory as a collection of
-`nodes` similarly to how Linux NUMA support does. For each node Linux
-constructs an independent memory management subsystem represented by
-`struct pglist_data` (or `pg_data_t` for short). Among other
-things, `pg_data_t` holds the `node_mem_map` array that maps
-physical pages belonging to that node. The `node_start_pfn` field of
-`pg_data_t` is the number of the first page frame belonging to that
-node.
-
-The architecture setup code should call :c:func:`free_area_init_node` for
-each node in the system to initialize the `pg_data_t` object and its
-`node_mem_map`.
-
-Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
-every physical page frame in a node has a `struct page` entry in the
-`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
-`flags` field of the `struct page` encodes the node number of the
-node hosting that page.
-
-The conversion between a PFN and the `struct page` in the
-DISCONTIGMEM model became slightly more complex as it has to determine
-which node hosts the physical page and which `pg_data_t` object
-holds the `struct page`.
-
-Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
-to convert PFN to the node number. The opposite conversion helper
-:c:func:`page_to_nid` is generic as it uses the node number encoded in
-page->flags.
-
-Once the node number is known, the PFN can be used to index
-appropriate `node_mem_map` array to access the `struct page` and
-the offset of the `struct page` from the `node_mem_map` plus
-`node_start_pfn` is the PFN of that page.
-
 SPARSEMEM
 =
 
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH 6/9] arch, mm: remove stale mentions of DISCONIGMEM

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

There are several places that mention DISCONIGMEM in comments or have stale
code guarded by CONFIG_DISCONTIGMEM.

Remove the dead code and update the comments.

Signed-off-by: Mike Rapoport 
---
 arch/ia64/kernel/topology.c | 5 ++---
 arch/ia64/mm/numa.c | 5 ++---
 arch/mips/include/asm/mmzone.h  | 6 --
 arch/mips/mm/init.c | 3 ---
 arch/nds32/include/asm/memory.h | 6 --
 arch/xtensa/include/asm/page.h  | 4 
 include/linux/gfp.h | 4 ++--
 7 files changed, 6 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/kernel/topology.c b/arch/ia64/kernel/topology.c
index 09fc385c2acd..3639e0a7cb3b 100644
--- a/arch/ia64/kernel/topology.c
+++ b/arch/ia64/kernel/topology.c
@@ -3,9 +3,8 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * This file contains NUMA specific variables and functions which can
- * be split away from DISCONTIGMEM and are used on NUMA machines with
- * contiguous memory.
+ * This file contains NUMA specific variables and functions which are used on
+ * NUMA machines with contiguous memory.
  * 2002/08/07 Erich Focht 
  * Populate cpu entries in sysfs for non-numa systems as well
  * Intel Corporation - Ashok Raj
diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c
index 46b6e5f3a40f..d6579ec3ea32 100644
--- a/arch/ia64/mm/numa.c
+++ b/arch/ia64/mm/numa.c
@@ -3,9 +3,8 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * This file contains NUMA specific variables and functions which can
- * be split away from DISCONTIGMEM and are used on NUMA machines with
- * contiguous memory.
+ * This file contains NUMA specific variables and functions which are used on
+ * NUMA machines with contiguous memory.
  * 
  * 2002/08/07 Erich Focht 
  */
diff --git a/arch/mips/include/asm/mmzone.h b/arch/mips/include/asm/mmzone.h
index b826b8473e95..7649ab45e80c 100644
--- a/arch/mips/include/asm/mmzone.h
+++ b/arch/mips/include/asm/mmzone.h
@@ -20,10 +20,4 @@
 #define nid_to_addrbase(nid) 0
 #endif
 
-#ifdef CONFIG_DISCONTIGMEM
-
-#define pfn_to_nid(pfn)pa_to_nid((pfn) << PAGE_SHIFT)
-
-#endif /* CONFIG_DISCONTIGMEM */
-
 #endif /* _ASM_MMZONE_H_ */
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index c36358758969..97f6ca341448 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -454,9 +454,6 @@ void __init mem_init(void)
BUILD_BUG_ON(IS_ENABLED(CONFIG_32BIT) && (_PFN_SHIFT > PAGE_SHIFT));
 
 #ifdef CONFIG_HIGHMEM
-#ifdef CONFIG_DISCONTIGMEM
-#error "CONFIG_HIGHMEM and CONFIG_DISCONTIGMEM dont work together yet"
-#endif
max_mapnr = highend_pfn ? highend_pfn : max_low_pfn;
 #else
max_mapnr = max_low_pfn;
diff --git a/arch/nds32/include/asm/memory.h b/arch/nds32/include/asm/memory.h
index 940d32842793..62faafbc28e4 100644
--- a/arch/nds32/include/asm/memory.h
+++ b/arch/nds32/include/asm/memory.h
@@ -76,18 +76,12 @@
  *  virt_to_page(k)convert a _valid_ virtual address to struct page *
  *  virt_addr_valid(k) indicates whether a virtual address is valid
  */
-#ifndef CONFIG_DISCONTIGMEM
-
 #define ARCH_PFN_OFFSETPHYS_PFN_OFFSET
 #define pfn_valid(pfn) ((pfn) >= PHYS_PFN_OFFSET && (pfn) < 
(PHYS_PFN_OFFSET + max_mapnr))
 
 #define virt_to_page(kaddr)(pfn_to_page(__pa(kaddr) >> PAGE_SHIFT))
 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && 
(unsigned long)(kaddr) < (unsigned long)high_memory)
 
-#else /* CONFIG_DISCONTIGMEM */
-#error CONFIG_DISCONTIGMEM is not supported yet.
-#endif /* !CONFIG_DISCONTIGMEM */
-
 #define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
 
 #endif
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 37ce25ef92d6..493eb7083b1a 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -192,10 +192,6 @@ static inline unsigned long ___pa(unsigned long va)
 #define pfn_valid(pfn) \
((pfn) >= ARCH_PFN_OFFSET && ((pfn) - ARCH_PFN_OFFSET) < max_mapnr)
 
-#ifdef CONFIG_DISCONTIGMEM
-# error CONFIG_DISCONTIGMEM not supported
-#endif
-
 #define virt_to_page(kaddr)pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
 #define page_to_virt(page) __va(page_to_pfn(page) << PAGE_SHIFT)
 #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 11da8af06704..dbe1f5fc901d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -494,8 +494,8 @@ static inline int gfp_zonelist(gfp_t flags)
  * There are two zonelists per node, one for all zones with memory and
  * one containing just zones from the node the zonelist belongs to.
  *
- * For the normal case of non-DISCONTIGMEM systems the NODE_DATA(

[PATCH 9/9] mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Signed-off-by: Mike Rapoport 
---
 include/linux/mmzone.h | 4 ++--
 kernel/crash_core.c| 2 +-
 mm/Kconfig | 4 
 mm/page_alloc.c| 6 +++---
 mm/page_ext.c  | 2 +-
 5 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ad42f440c704..2698cdbfbf75 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -775,7 +775,7 @@ typedef struct pglist_data {
struct zonelist node_zonelists[MAX_ZONELISTS];
 
int nr_zones; /* number of populated zones in this node */
-#ifdef CONFIG_FLAT_NODE_MEM_MAP/* means !SPARSEMEM */
+#ifdef CONFIG_FLATMEM  /* means !SPARSEMEM */
struct page *node_mem_map;
 #ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
@@ -865,7 +865,7 @@ typedef struct pglist_data {
 
 #define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
 #define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
 #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + (pagenr))
 #else
 #define pgdat_page_nr(pgdat, pagenr)   pfn_to_page((pgdat)->node_start_pfn + 
(pagenr))
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 53eb8bc6026d..2b8446ea7105 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -483,7 +483,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(page, compound_head);
VMCOREINFO_OFFSET(pglist_data, node_zones);
VMCOREINFO_OFFSET(pglist_data, nr_zones);
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
VMCOREINFO_OFFSET(pglist_data, node_mem_map);
 #endif
VMCOREINFO_OFFSET(pglist_data, node_start_pfn);
diff --git a/mm/Kconfig b/mm/Kconfig
index bffe4bd859f3..ded98fb859ab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -55,10 +55,6 @@ config FLATMEM
def_bool y
depends on !SPARSEMEM || FLATMEM_MANUAL
 
-config FLAT_NODE_MEM_MAP
-   def_bool y
-   depends on !SPARSEMEM
-
 #
 # SPARSEMEM_EXTREME (which is the default) does some bootmem
 # allocations when sparse_init() is called.  If this cannot
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8f08135d3eb4..f039736541eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6444,7 +6444,7 @@ static void __meminit zone_init_free_lists(struct zone 
*zone)
}
 }
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
+#if !defined(CONFIG_FLATMEM)
 /*
  * Only struct pages that correspond to ranges defined by memblock.memory
  * are zeroed and initialized by going through __init_single_page() during
@@ -7241,7 +7241,7 @@ static void __init free_area_init_core(struct pglist_data 
*pgdat)
}
 }
 
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLATMEM
 static void __ref alloc_node_mem_map(struct pglist_data *pgdat)
 {
unsigned long __maybe_unused start = 0;
@@ -7289,7 +7289,7 @@ static void __ref alloc_node_mem_map(struct pglist_data 
*pgdat)
 }
 #else
 static void __ref alloc_node_mem_map(struct pglist_data *pgdat) { }
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+#endif /* CONFIG_FLATMEM */
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static inline void pgdat_set_deferred_range(pg_data_t *pgdat)
diff --git a/mm/page_ext.c b/mm/page_ext.c
index df6f74aac8e1..293b2685fc48 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -191,7 +191,7 @@ void __init page_ext_init_flatmem(void)
panic("Out of memory");
 }
 
-#else /* CONFIG_FLAT_NODE_MEM_MAP */
+#else /* CONFIG_FLATMEM */
 
 struct page_ext *lookup_page_ext(const struct page *page)
 {
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH 5/9] mm: remove CONFIG_DISCONTIGMEM

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

There are no architectures that support DISCONTIGMEM left.

Remove the configuration option and the dead code it was guarding in the
generic memory management code.

Signed-off-by: Mike Rapoport 
---
 include/asm-generic/memory_model.h | 37 --
 include/linux/mmzone.h |  4 ++--
 mm/Kconfig | 25 +++-
 mm/memory.c|  3 +--
 mm/page_alloc.c| 13 ---
 5 files changed, 10 insertions(+), 72 deletions(-)

diff --git a/include/asm-generic/memory_model.h 
b/include/asm-generic/memory_model.h
index 7637fb46ba4f..a2c8ed60233a 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -6,47 +6,18 @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * supports 3 memory models.
+ */
 #if defined(CONFIG_FLATMEM)
 
 #ifndef ARCH_PFN_OFFSET
 #define ARCH_PFN_OFFSET(0UL)
 #endif
 
-#elif defined(CONFIG_DISCONTIGMEM)
-
-#ifndef arch_pfn_to_nid
-#define arch_pfn_to_nid(pfn)   pfn_to_nid(pfn)
-#endif
-
-#ifndef arch_local_page_offset
-#define arch_local_page_offset(pfn, nid)   \
-   ((pfn) - NODE_DATA(nid)->node_start_pfn)
-#endif
-
-#endif /* CONFIG_DISCONTIGMEM */
-
-/*
- * supports 3 memory models.
- */
-#if defined(CONFIG_FLATMEM)
-
 #define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
 #define __page_to_pfn(page)((unsigned long)((page) - mem_map) + \
 ARCH_PFN_OFFSET)
-#elif defined(CONFIG_DISCONTIGMEM)
-
-#define __pfn_to_page(pfn) \
-({ unsigned long __pfn = (pfn);\
-   unsigned long __nid = arch_pfn_to_nid(__pfn);  \
-   NODE_DATA(__nid)->node_mem_map + arch_local_page_offset(__pfn, __nid);\
-})
-
-#define __page_to_pfn(pg)  \
-({ const struct page *__pg = (pg); \
-   struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg)); \
-   (unsigned long)(__pg - __pgdat->node_mem_map) + \
-__pgdat->node_start_pfn;   \
-})
 
 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
 
@@ -70,7 +41,7 @@
struct mem_section *__sec = __pfn_to_section(__pfn);\
__section_mem_map_addr(__sec) + __pfn;  \
 })
-#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */
+#endif /* CONFIG_FLATMEM/SPARSEMEM */
 
 /*
  * Convert a physical address to a Page Frame Number and back
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d53eba1c383..2b41e252a995 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -738,8 +738,8 @@ struct zonelist {
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
 };
 
-#ifndef CONFIG_DISCONTIGMEM
-/* The array of struct pages - for discontigmem use pgdat->lmem_map */
+#ifdef CONFIG_FLATMEM
+/* The array of struct pages for flatmem */
 extern struct page *mem_map;
 #endif
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 02d44e3420f5..218b96ccc84a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,7 +19,7 @@ choice
 
 config FLATMEM_MANUAL
bool "Flat Memory"
-   depends on !(ARCH_DISCONTIGMEM_ENABLE || ARCH_SPARSEMEM_ENABLE) || 
ARCH_FLATMEM_ENABLE
+   depends on !ARCH_SPARSEMEM_ENABLE || ARCH_FLATMEM_ENABLE
help
  This option is best suited for non-NUMA systems with
  flat address space. The FLATMEM is the most efficient
@@ -32,21 +32,6 @@ config FLATMEM_MANUAL
 
  If unsure, choose this option (Flat Memory) over any other.
 
-config DISCONTIGMEM_MANUAL
-   bool "Discontiguous Memory"
-   depends on ARCH_DISCONTIGMEM_ENABLE
-   help
- This option provides enhanced support for discontiguous
- memory systems, over FLATMEM.  These systems have holes
- in their physical address spaces, and this option provides
- more efficient handling of these holes.
-
- Although "Discontiguous Memory" is still used by several
- architectures, it is considered deprecated in favor of
- "Sparse Memory".
-
- If unsure, choose "Sparse Memory" over this option.
-
 config SPARSEMEM_MANUAL
bool "Sparse Memory"
depends on ARCH_SPARSEMEM_ENABLE
@@ -62,17 +47,13 @@ config SPARSEMEM_MANUAL
 
 endchoice
 
-config DISCONTIGMEM
-   def_bool y
-   depends on (!SELECT_MEMORY_MODEL && ARCH_DISCONTIGMEM_ENABLE) || 
DISCONTIGMEM_MANUAL
-
 config SPARSEMEM
def_bool y
depends on (!SELECT_MEMORY_MODEL && ARCH_SPARSEMEM_ENABLE) || 
SPARSEMEM_MANUAL
 
 config FLATMEM
def_bool y
-   depends on (!DISCONTIGMEM && !SPARSEMEM) || FLATMEM_MANUAL
+   depends on !SPARSEMEM || FLATMEM_MANUAL
 
 config FLAT_NODE_MEM_MAP
def_bool y
@@ -85,7 +66,7 @@ config FLAT_NODE_MEM_MAP
 #
 config NEED_MULTIPLE_NODES
d

[PATCH 8/9] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.

Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

Done with

$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)

with manual tweaks afterwards.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/Kconfig|  2 +-
 arch/ia64/Kconfig |  2 +-
 arch/mips/Kconfig |  2 +-
 arch/mips/include/asm/mmzone.h|  2 +-
 arch/mips/include/asm/page.h  |  2 +-
 arch/mips/mm/init.c   |  4 ++--
 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/include/asm/mmzone.h |  4 ++--
 arch/powerpc/kernel/setup_64.c|  2 +-
 arch/powerpc/kernel/smp.c |  2 +-
 arch/powerpc/kexec/core.c |  4 ++--
 arch/powerpc/mm/Makefile  |  2 +-
 arch/powerpc/mm/mem.c |  4 ++--
 arch/riscv/Kconfig|  2 +-
 arch/s390/Kconfig |  2 +-
 arch/sh/include/asm/mmzone.h  |  4 ++--
 arch/sh/kernel/topology.c |  2 +-
 arch/sh/mm/Kconfig|  2 +-
 arch/sh/mm/init.c |  2 +-
 arch/sparc/Kconfig|  2 +-
 arch/sparc/include/asm/mmzone.h   |  4 ++--
 arch/sparc/kernel/smp_64.c|  2 +-
 arch/sparc/mm/init_64.c   | 12 ++--
 arch/x86/Kconfig  |  2 +-
 arch/x86/kernel/setup_percpu.c|  6 +++---
 arch/x86/mm/init_32.c |  4 ++--
 include/asm-generic/topology.h|  2 +-
 include/linux/memblock.h  |  6 +++---
 include/linux/mm.h|  4 ++--
 include/linux/mmzone.h|  8 
 kernel/crash_core.c   |  2 +-
 mm/Kconfig|  9 -
 mm/memblock.c |  8 
 mm/page_alloc.c   |  6 +++---
 34 files changed, 58 insertions(+), 67 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..d01a1545ab8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1035,7 +1035,7 @@ config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
range 1 10
default "4"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
help
  Specify the maximum number of NUMA Nodes available on the target
  system.  Increases memory reserved to accommodate various tables.
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 279252e3e0f7..da22a35e6f03 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -302,7 +302,7 @@ config NODES_SHIFT
int "Max num nodes shift(3-10)"
range 3 10
default "10"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
help
  This option specifies the maximum number of nodes in your SSI system.
  MAX_NUMNODES will be 2^(This value).
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index ed51970c08e7..4704a16c2e44 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2867,7 +2867,7 @@ config RANDOMIZE_BASE_MAX_OFFSET
 config NODES_SHIFT
int
default "6"
-   depends on NEED_MULTIPLE_NODES
+   depends on NUMA
 
 config HW_PERF_EVENTS
bool "Enable hardware performance counter support for perf events"
diff --git a/arch/mips/include/asm/mmzone.h b/arch/mips/include/asm/mmzone.h
index 7649ab45e80c..602a21aee9d4 100644
--- a/arch/mips/include/asm/mmzone.h
+++ b/arch/mips/include/asm/mmzone.h
@@ -8,7 +8,7 @@
 
 #include 
 
-#ifdef CONFIG_NEED_MULTIPLE_NODES
+#ifdef CONFIG_NUMA
 # include 
 #endif
 
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index 195ff4e9771f..96bc798c1ec1 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -239,7 +239,7 @@ static inline int pfn_valid(unsigned long pfn)
 
 /* pfn_valid is defined in linux/mmzone.h */
 
-#elif defined(CONFIG_NEED_MULTIPLE_NODES)
+#elif defined(CONFIG_NUMA)
 
 #define pfn_valid(pfn) \
 ({ \
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 97f6ca341448..19347dc6bbf8 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -394,7 +394,7 @@ void maar_init(void)
}
 }
 
-#ifndef CONFIG_NEED_MULTIPLE_NODES
+#ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
unsigned long max_zone_pfns[MAX_NR_ZONES];
@@ -473,7 +473,7 @@ void __init mem_init(void)
0x8000 - 4, KCORE_TEXT);
 #endif
 }
-#endif /* !CONFIG_NEED_MULTIPLE_NODES */
+#endif /* !CONFIG_NUMA */
 
 void free_init_pages(const char *what, unsigned long begin, unsigned long end)
 {
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 088dd2

[PATCH 4/9] m68k: remove support for DISCONTIGMEM

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
in v5.11.

Remove the support for DISCONTIGMEM entirely.

Signed-off-by: Mike Rapoport 
---
 arch/m68k/Kconfig.cpu   | 10 --
 arch/m68k/include/asm/page.h|  2 +-
 arch/m68k/include/asm/page_mm.h | 33 -
 arch/m68k/mm/init.c | 20 
 4 files changed, 1 insertion(+), 64 deletions(-)

diff --git a/arch/m68k/Kconfig.cpu b/arch/m68k/Kconfig.cpu
index f4d23977d2a5..29e946394fdb 100644
--- a/arch/m68k/Kconfig.cpu
+++ b/arch/m68k/Kconfig.cpu
@@ -408,10 +408,6 @@ config SINGLE_MEMORY_CHUNK
  order" to save memory that could be wasted for unused memory map.
  Say N if not sure.
 
-config ARCH_DISCONTIGMEM_ENABLE
-   depends on BROKEN
-   def_bool MMU && !SINGLE_MEMORY_CHUNK
-
 config FORCE_MAX_ZONEORDER
int "Maximum zone order" if ADVANCED
depends on !SINGLE_MEMORY_CHUNK
@@ -451,11 +447,6 @@ config M68K_L2_CACHE
depends on MAC
default y
 
-config NODES_SHIFT
-   int
-   default "3"
-   depends on DISCONTIGMEM
-
 config CPU_HAS_NO_BITFIELDS
bool
 
@@ -553,4 +544,3 @@ config CACHE_COPYBACK
  The ColdFire CPU cache is set into Copy-back mode.
 endchoice
 endif
-
diff --git a/arch/m68k/include/asm/page.h b/arch/m68k/include/asm/page.h
index 97087dd3ca6d..2f1c54e4725d 100644
--- a/arch/m68k/include/asm/page.h
+++ b/arch/m68k/include/asm/page.h
@@ -62,7 +62,7 @@ extern unsigned long _ramend;
 #include 
 #endif
 
-#if !defined(CONFIG_MMU) || defined(CONFIG_DISCONTIGMEM)
+#ifndef CONFIG_MMU
 #define __phys_to_pfn(paddr)   ((unsigned long)((paddr) >> PAGE_SHIFT))
 #define __pfn_to_phys(pfn) PFN_PHYS(pfn)
 #endif
diff --git a/arch/m68k/include/asm/page_mm.h b/arch/m68k/include/asm/page_mm.h
index 2411ea9ef578..ff8f8a3f7cac 100644
--- a/arch/m68k/include/asm/page_mm.h
+++ b/arch/m68k/include/asm/page_mm.h
@@ -126,25 +126,7 @@ static inline void *__va(unsigned long x)
 
 extern int m68k_virt_to_node_shift;
 
-#ifndef CONFIG_DISCONTIGMEM
 #define __virt_to_node(addr)   (_data_map[0])
-#else
-extern struct pglist_data *pg_data_table[];
-
-static inline __attribute_const__ int __virt_to_node_shift(void)
-{
-   int shift;
-
-   asm (
-   "1: moveq   #0,%0\n"
-   m68k_fixup(%c1, 1b)
-   : "=d" (shift)
-   : "i" (m68k_fixup_vnode_shift));
-   return shift;
-}
-
-#define __virt_to_node(addr)   (pg_data_table[(unsigned long)(addr) >> 
__virt_to_node_shift()])
-#endif
 
 #define virt_to_page(addr) ({  \
pfn_to_page(virt_to_pfn(addr)); \
@@ -153,23 +135,8 @@ static inline __attribute_const__ int 
__virt_to_node_shift(void)
pfn_to_virt(page_to_pfn(page)); \
 })
 
-#ifdef CONFIG_DISCONTIGMEM
-#define pfn_to_page(pfn) ({\
-   unsigned long __pfn = (pfn);\
-   struct pglist_data *pgdat;  \
-   pgdat = __virt_to_node((unsigned long)pfn_to_virt(__pfn));  \
-   pgdat->node_mem_map + (__pfn - pgdat->node_start_pfn);  \
-})
-#define page_to_pfn(_page) ({  \
-   const struct page *__p = (_page);   \
-   struct pglist_data *pgdat;  \
-   pgdat = _data_map[page_to_nid(__p)]; \
-   ((__p) - pgdat->node_mem_map) + pgdat->node_start_pfn;  \
-})
-#else
 #define ARCH_PFN_OFFSET (m68k_memory[0].addr >> PAGE_SHIFT)
 #include 
-#endif
 
 #define virt_addr_valid(kaddr) ((unsigned long)(kaddr) >= PAGE_OFFSET && 
(unsigned long)(kaddr) < (unsigned long)high_memory)
 #define pfn_valid(pfn) virt_addr_valid(pfn_to_virt(pfn))
diff --git a/arch/m68k/mm/init.c b/arch/m68k/mm/init.c
index 1759ab875d47..5d749e188246 100644
--- a/arch/m68k/mm/init.c
+++ b/arch/m68k/mm/init.c
@@ -44,28 +44,8 @@ EXPORT_SYMBOL(empty_zero_page);
 
 int m68k_virt_to_node_shift;
 
-#ifdef CONFIG_DISCONTIGMEM
-pg_data_t pg_data_map[MAX_NUMNODES];
-EXPORT_SYMBOL(pg_data_map);
-
-pg_data_t *pg_data_table[65];
-EXPORT_SYMBOL(pg_data_table);
-#endif
-
 void __init m68k_setup_node(int node)
 {
-#ifdef CONFIG_DISCONTIGMEM
-   struct m68k_mem_info *info = m68k_memory + node;
-   int i, end;
-
-   i = (unsigned long)phys_to_virt(info->addr) >> __virt_to_node_shift();
-   end = (unsigned long)phys_to_virt(info->addr + info->size - 1) >> 
__virt_to_node_shift();
-   for (; i <= end; i++) {
-   if (pg_data_table[i])
-   pr_warn("overlap at %u for chunk

[PATCH 3/9] arc: remove support for DISCONTIGMEM

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
in v5.11.

Remove the support for DISCONTIGMEM entirely.

Signed-off-by: Mike Rapoport 
---
 arch/arc/Kconfig  | 13 
 arch/arc/include/asm/mmzone.h | 40 ---
 arch/arc/mm/init.c|  8 ---
 3 files changed, 61 deletions(-)
 delete mode 100644 arch/arc/include/asm/mmzone.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 2d98501c0897..d8f51eb8963b 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -62,10 +62,6 @@ config SCHED_OMIT_FRAME_POINTER
 config GENERIC_CSUM
def_bool y
 
-config ARCH_DISCONTIGMEM_ENABLE
-   def_bool n
-   depends on BROKEN
-
 config ARCH_FLATMEM_ENABLE
def_bool y
 
@@ -344,15 +340,6 @@ config ARC_HUGEPAGE_16M
 
 endchoice
 
-config NODES_SHIFT
-   int "Maximum NUMA Nodes (as a power of 2)"
-   default "0" if !DISCONTIGMEM
-   default "1" if DISCONTIGMEM
-   depends on NEED_MULTIPLE_NODES
-   help
- Accessing memory beyond 1GB (with or w/o PAE) requires 2 memory
- zones.
-
 config ARC_COMPACT_IRQ_LEVELS
depends on ISA_ARCOMPACT
bool "Setup Timer IRQ as high Priority"
diff --git a/arch/arc/include/asm/mmzone.h b/arch/arc/include/asm/mmzone.h
deleted file mode 100644
index b86b9d1e54dc..
--- a/arch/arc/include/asm/mmzone.h
+++ /dev/null
@@ -1,40 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2016 Synopsys, Inc. (www.synopsys.com)
- */
-
-#ifndef _ASM_ARC_MMZONE_H
-#define _ASM_ARC_MMZONE_H
-
-#ifdef CONFIG_DISCONTIGMEM
-
-extern struct pglist_data node_data[];
-#define NODE_DATA(nid) (_data[nid])
-
-static inline int pfn_to_nid(unsigned long pfn)
-{
-   int is_end_low = 1;
-
-   if (IS_ENABLED(CONFIG_ARC_HAS_PAE40))
-   is_end_low = pfn <= virt_to_pfn(0xUL);
-
-   /*
-* node 0: lowmem: 0x8000_   to 0x_
-* node 1: HIGHMEM w/o  PAE40: 0x0   to 0x7FFF_
-* HIGHMEM with PAE40: 0x1__ to ...
-*/
-   if (pfn >= ARCH_PFN_OFFSET && is_end_low)
-   return 0;
-
-   return 1;
-}
-
-static inline int pfn_valid(unsigned long pfn)
-{
-   int nid = pfn_to_nid(pfn);
-
-   return (pfn <= node_end_pfn(nid));
-}
-#endif /* CONFIG_DISCONTIGMEM  */
-
-#endif
diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index 397a201adfe3..abfeef7bf6f8 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -32,11 +32,6 @@ unsigned long arch_pfn_offset;
 EXPORT_SYMBOL(arch_pfn_offset);
 #endif
 
-#ifdef CONFIG_DISCONTIGMEM
-struct pglist_data node_data[MAX_NUMNODES] __read_mostly;
-EXPORT_SYMBOL(node_data);
-#endif
-
 long __init arc_get_mem_sz(void)
 {
return low_mem_sz;
@@ -147,9 +142,6 @@ void __init setup_arch_memory(void)
 * to the hole is freed and ARC specific version of pfn_valid()
 * handles the hole in the memory map.
 */
-#ifdef CONFIG_DISCONTIGMEM
-   node_set_online(1);
-#endif
 
min_high_pfn = PFN_DOWN(high_mem_start);
max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH 2/9] arc: update comment about HIGHMEM implementation

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

Arc does not use DISCONTIGMEM to implement high memory, update the comment
describing how high memory works to reflect this.

Signed-off-by: Mike Rapoport 
---
 arch/arc/mm/init.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
index e2ed355438c9..397a201adfe3 100644
--- a/arch/arc/mm/init.c
+++ b/arch/arc/mm/init.c
@@ -139,16 +139,13 @@ void __init setup_arch_memory(void)
 
 #ifdef CONFIG_HIGHMEM
/*
-* Populate a new node with highmem
-*
 * On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based)
-* than addresses in normal ala low memory (0x8000_ based).
+* than addresses in normal aka low memory (0x8000_ based).
 * Even with PAE, the huge peripheral space hole would waste a lot of
-* mem with single mem_map[]. This warrants a mem_map per region design.
-* Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM.
-*
-* DISCONTIGMEM in turns requires multiple nodes. node 0 above is
-* populated with normal memory zone while node 1 only has highmem
+* mem with single contiguous mem_map[].
+* Thus when HIGHMEM on ARC is enabled the memory map corresponding
+* to the hole is freed and ARC specific version of pfn_valid()
+* handles the hole in the memory map.
 */
 #ifdef CONFIG_DISCONTIGMEM
node_set_online(1);
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH 0/9] Remove DISCINTIGMEM memory model

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

Hi,

SPARSEMEM memory model was supposed to entirely replace DISCONTIGMEM a
(long) while ago. The last architectures that used DISCONTIGMEM were
updated to use other memory models in v5.11 and it is about the time to
entirely remove DISCONTIGMEM from the kernel.

This set removes DISCONTIGMEM from alpha, arc and m68k, simplifies memory
model selection in mm/Kconfig and replaces usage of redundant
CONFIG_NEED_MULTIPLE_NODES and CONFIG_FLAT_NODE_MEM_MAP with CONFIG_NUMA
and CONFIG_FLATMEM respectively. 

I've also removed NUMA support on alpha that was BROKEN for more than 15
years.

There were also minor updates all over arch/ to remove mentions of
DISCONTIGMEM in comments and #ifdefs.

Mike Rapoport (9):
  alpha: remove DISCONTIGMEM and NUMA
  arc: update comment about HIGHMEM implementation
  arc: remove support for DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  docs: remove description of DISCONTIGMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM

 Documentation/admin-guide/sysctl/vm.rst |  12 +-
 Documentation/vm/memory-model.rst   |  45 +
 arch/alpha/Kconfig  |  22 ---
 arch/alpha/include/asm/machvec.h|   6 -
 arch/alpha/include/asm/mmzone.h | 100 ---
 arch/alpha/include/asm/pgtable.h|   4 -
 arch/alpha/include/asm/topology.h   |  39 -
 arch/alpha/kernel/core_marvel.c |  53 +-
 arch/alpha/kernel/core_wildfire.c   |  29 +--
 arch/alpha/kernel/pci_iommu.c   |  29 ---
 arch/alpha/kernel/proto.h   |   8 -
 arch/alpha/kernel/setup.c   |  16 --
 arch/alpha/kernel/sys_marvel.c  |   5 -
 arch/alpha/kernel/sys_wildfire.c|   5 -
 arch/alpha/mm/Makefile  |   2 -
 arch/alpha/mm/init.c|   3 -
 arch/alpha/mm/numa.c| 223 
 arch/arc/Kconfig|  13 --
 arch/arc/include/asm/mmzone.h   |  40 -
 arch/arc/mm/init.c  |  21 +--
 arch/arm64/Kconfig  |   2 +-
 arch/ia64/Kconfig   |   2 +-
 arch/ia64/kernel/topology.c |   5 +-
 arch/ia64/mm/numa.c |   5 +-
 arch/m68k/Kconfig.cpu   |  10 --
 arch/m68k/include/asm/page.h|   2 +-
 arch/m68k/include/asm/page_mm.h |  33 
 arch/m68k/mm/init.c |  20 ---
 arch/mips/Kconfig   |   2 +-
 arch/mips/include/asm/mmzone.h  |   8 +-
 arch/mips/include/asm/page.h|   2 +-
 arch/mips/mm/init.c |   7 +-
 arch/nds32/include/asm/memory.h |   6 -
 arch/powerpc/Kconfig|   2 +-
 arch/powerpc/include/asm/mmzone.h   |   4 +-
 arch/powerpc/kernel/setup_64.c  |   2 +-
 arch/powerpc/kernel/smp.c   |   2 +-
 arch/powerpc/kexec/core.c   |   4 +-
 arch/powerpc/mm/Makefile|   2 +-
 arch/powerpc/mm/mem.c   |   4 +-
 arch/riscv/Kconfig  |   2 +-
 arch/s390/Kconfig   |   2 +-
 arch/sh/include/asm/mmzone.h|   4 +-
 arch/sh/kernel/topology.c   |   2 +-
 arch/sh/mm/Kconfig  |   2 +-
 arch/sh/mm/init.c   |   2 +-
 arch/sparc/Kconfig  |   2 +-
 arch/sparc/include/asm/mmzone.h |   4 +-
 arch/sparc/kernel/smp_64.c  |   2 +-
 arch/sparc/mm/init_64.c |  12 +-
 arch/x86/Kconfig|   2 +-
 arch/x86/kernel/setup_percpu.c  |   6 +-
 arch/x86/mm/init_32.c   |   4 +-
 arch/xtensa/include/asm/page.h  |   4 -
 include/asm-generic/memory_model.h  |  37 +---
 include/asm-generic/topology.h  |   2 +-
 include/linux/gfp.h |   4 +-
 include/linux/memblock.h|   6 +-
 include/linux/mm.h  |   4 +-
 include/linux/mmzone.h  |  16 +-
 kernel/crash_core.c |   4 +-
 mm/Kconfig  |  36 +---
 mm/memblock.c   |   8 +-
 mm/memory.c |   3 +-
 mm/page_alloc.c |  25 +--
 mm/page_ext.c   |   2 +-
 66 files changed, 98 insertions(+), 898 deletions(-)
 delete mode 100644 arch/alpha/include/asm/mmzone.h
 delete mode 100644 arch/alpha/mm/numa.c
 delete mode 100644 arch/arc/include/asm/mmzone.h


base-commit: c4681547bcce777daf576925a966ffa824edd09d
-- 
2.28.0


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

[PATCH 1/9] alpha: remove DISCONTIGMEM and NUMA

2021-06-02 Thread Mike Rapoport

From: Mike Rapoport 

NUMA is marked broken on alpha for more than 15 years and DISCONTIGMEM was
replaced with SPARSEMEM in v5.11.

Remove both NUMA and DISCONTIGMEM support from alpha.

Signed-off-by: Mike Rapoport 
---
 arch/alpha/Kconfig|  22 ---
 arch/alpha/include/asm/machvec.h  |   6 -
 arch/alpha/include/asm/mmzone.h   | 100 --
 arch/alpha/include/asm/pgtable.h  |   4 -
 arch/alpha/include/asm/topology.h |  39 --
 arch/alpha/kernel/core_marvel.c   |  53 +--
 arch/alpha/kernel/core_wildfire.c |  29 +---
 arch/alpha/kernel/pci_iommu.c |  29 
 arch/alpha/kernel/proto.h |   8 --
 arch/alpha/kernel/setup.c |  16 ---
 arch/alpha/kernel/sys_marvel.c|   5 -
 arch/alpha/kernel/sys_wildfire.c  |   5 -
 arch/alpha/mm/Makefile|   2 -
 arch/alpha/mm/init.c  |   3 -
 arch/alpha/mm/numa.c  | 223 --
 15 files changed, 4 insertions(+), 540 deletions(-)
 delete mode 100644 arch/alpha/include/asm/mmzone.h
 delete mode 100644 arch/alpha/mm/numa.c

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 5998106faa60..8954216b9956 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -549,29 +549,12 @@ config NR_CPUS
  MARVEL support can handle a maximum of 32 CPUs, all the others
  with working support have a maximum of 4 CPUs.
 
-config ARCH_DISCONTIGMEM_ENABLE
-   bool "Discontiguous Memory Support"
-   depends on BROKEN
-   help
- Say Y to support efficient handling of discontiguous physical memory,
- for architectures which are either NUMA (Non-Uniform Memory Access)
- or have huge holes in the physical address space for other reasons.
- See  for more.
-
 config ARCH_SPARSEMEM_ENABLE
bool "Sparse Memory Support"
help
  Say Y to support efficient handling of discontiguous physical memory,
  for systems that have huge holes in the physical address space.
 
-config NUMA
-   bool "NUMA Support (EXPERIMENTAL)"
-   depends on DISCONTIGMEM && BROKEN
-   help
- Say Y to compile the kernel to support NUMA (Non-Uniform Memory
- Access).  This option is for configuring high-end multiprocessor
- server machines.  If in doubt, say N.
-
 config ALPHA_WTINT
bool "Use WTINT" if ALPHA_SRM || ALPHA_GENERIC
default y if ALPHA_QEMU
@@ -596,11 +579,6 @@ config ALPHA_WTINT
 
  If unsure, say N.
 
-config NODES_SHIFT
-   int
-   default "7"
-   depends on NEED_MULTIPLE_NODES
-
 # LARGE_VMALLOC is racy, if you *really* need it then fix it first
 config ALPHA_LARGE_VMALLOC
bool
diff --git a/arch/alpha/include/asm/machvec.h b/arch/alpha/include/asm/machvec.h
index a4e96e2bec74..e49fabce7b33 100644
--- a/arch/alpha/include/asm/machvec.h
+++ b/arch/alpha/include/asm/machvec.h
@@ -99,12 +99,6 @@ struct alpha_machine_vector
 
const char *vector_name;
 
-   /* NUMA information */
-   int (*pa_to_nid)(unsigned long);
-   int (*cpuid_to_nid)(int);
-   unsigned long (*node_mem_start)(int);
-   unsigned long (*node_mem_size)(int);
-
/* System specific parameters.  */
union {
struct {
diff --git a/arch/alpha/include/asm/mmzone.h b/arch/alpha/include/asm/mmzone.h
deleted file mode 100644
index 86644604d977..
--- a/arch/alpha/include/asm/mmzone.h
+++ /dev/null
@@ -1,100 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Written by Kanoj Sarcar (ka...@sgi.com) Aug 99
- * Adapted for the alpha wildfire architecture Jan 2001.
- */
-#ifndef _ASM_MMZONE_H_
-#define _ASM_MMZONE_H_
-
-#ifdef CONFIG_DISCONTIGMEM
-
-#include 
-
-/*
- * Following are macros that are specific to this numa platform.
- */
-
-extern pg_data_t node_data[];
-
-#define alpha_pa_to_nid(pa)\
-(alpha_mv.pa_to_nid\
-? alpha_mv.pa_to_nid(pa)   \
-: (0))
-#define node_mem_start(nid)\
-(alpha_mv.node_mem_start   \
-? alpha_mv.node_mem_start(nid) \
-: (0UL))
-#define node_mem_size(nid) \
-(alpha_mv.node_mem_size\
-? alpha_mv.node_mem_size(nid)  \
-: ((nid) ? (0UL) : (~0UL)))
-
-#define pa_to_nid(pa)  alpha_pa_to_nid(pa)
-#define NODE_DATA(nid) (_data[(nid)])
-
-#define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn)
-
-#if 1
-#define PLAT_NODE_DATA_LOCALNR(p, n)   \
-   (((p) >> PAGE_SHIFT) - PLAT_NODE_DATA(n)->gendata.node_start_pfn)
-#else
-static inline unsigned long
-PLAT_NODE_DATA_LOCALNR(unsigned long p, int n)
-{
-   unsigned long temp;
-   temp = p >> PAGE_SHIFT;
-   return temp - PLAT_NODE_DATA(n)->gendata.node_start_pfn;
-}
-#endif
-
-/*
- * Following are macros that each numa implementation must define.
- */
-
-/*
- * Given a kernel address,

Re: [PATCH v2] x86/efi: unconditionally hold the whole low-1MB memory regions

2021-05-31 Thread Mike Rapoport

On Mon, May 31, 2021 at 12:52:06PM +0200, Borislav Petkov wrote:
> On Mon, May 31, 2021 at 12:58:40PM +0300, Mike Rapoport wrote:
> > Right, but TBH, I didn't update efi_free_boot_services() in my initial
> > version. I've added similar change there now and I'm waiting now to see if
> > kbuild is happy with this:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=x86/reservelow
> 
> Right, also I'm guessing that first patch should be
> 
> Cc: 
> 
> as there was one report with failing boot, right?

Hmm, why?
The regression is from v5.13-rc1, isn't it?

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] x86/efi: unconditionally hold the whole low-1MB memory regions

2021-05-31 Thread Mike Rapoport

On Mon, May 31, 2021 at 07:00:59PM +0800, lijiang wrote:
> Thank you for the information, Boris and Mike.
> 
> BTW: I just noticed that Mike's patch is incorrect, maybe it's a typo:
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 7850111008a8b..e262ca858787f 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -450,6 +450,18 @@ void __init efi_free_boot_services(void)
> size -= rm_size;
> }
> + /*
> + * Don't free memory under 1M for two reasons:
> + * - BIOS might clobber it
> + * - Crash kernel needs it to be reserved
> + */
> + if (start + size < SZ_1M)
> + continue;
> + if (start < SZ_1M) {
> + size -= (start - SZ_1M);
> 
> 
> It looks like: size -= (SZ_1M - start);

Right, thanks!

> + start = SZ_1M;
> + }
> +
> memblock_free_late(start, size);
> }
> 
> Mike's patch link:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/commit/?h=x86/
> reservelow=479fb34676ac448529b605854cf48c007e796ccd
 
-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v2] x86/efi: unconditionally hold the whole low-1MB memory regions

2021-05-31 Thread Mike Rapoport

On Mon, May 31, 2021 at 11:08:32AM +0200, Borislav Petkov wrote:
> + Mike.
> 
> On Mon, May 31, 2021 at 05:00:23PM +0800, Lianbo Jiang wrote:
> > Some sub-1MB memory regions may be reserved by EFI boot services, and the
> > memory regions will be released later in the efi_free_boot_services().
> > 
> > Currently, always reserve all sub-1MB memory regions when the crashkernel
> > option is specified, but unfortunately EFI boot services may have already
> > reserved some sub-1MB memory regions before the crash_reserve_low_1M() is
> > called, which makes that the crash_reserve_low_1M() only own the
> > remaining sub-1MB memory regions, not all sub-1MB memory regions, because,
> > subsequently EFI boot services will free its own sub-1MB memory regions.
> > Eventually, DMA will be able to allocate memory from the sub-1MB area and
> > cause the following error:
> > 
> > crash> kmem -s |grep invalid
> > kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
> > 9403c0067300
> > kmem: dma-kmalloc-512: slab: d52c40001900 invalid freepointer: 
> > 9403c0067300
> > crash> vtop 9403c0067300
> > VIRTUAL   PHYSICAL
> > 9403c0067300  67300   --->The physical address falls into this range 
> > [0x00063000-0x0008efff]
> > 
> > kernel debugging log:
> > ...
> > [0.008927] memblock_reserve: [0x0001-0x00013fff] 
> > efi_reserve_boot_services+0x85/0xd0
> > [0.008930] memblock_reserve: [0x00063000-0x0008efff] 
> > efi_reserve_boot_services+0x85/0xd0
> > ...
> > [0.009425] memblock_reserve: [0x-0x000f] 
> > crash_reserve_low_1M+0x2c/0x49
> > ...
> > [0.010586] Zone ranges:
> > [0.010587]   DMA  [mem 0x1000-0x00ff]
> > [0.010589]   DMA32[mem 0x0100-0x]
> > [0.010591]   Normal   [mem 0x0001-0x000c7fff]
> > [0.010593]   Device   empty
> > ...
> > [8.814894] __memblock_free_late: 
> > [0x00063000-0x0008efff] efi_free_boot_services+0x14b/0x23b
> > [8.815793] __memblock_free_late: 
> > [0x0001-0x00013fff] efi_free_boot_services+0x14b/0x23b
> > 
> > To fix the above issues, let's hold the whole low-1M memory regions
> > unconditionally in the efi_free_boot_services().
> > 
> > Signed-off-by: Lianbo Jiang 
> > ---
> > Background(copy from bhe's comment in the patch v1):
> > 
> > Kdump kernel also need go through real mode code path during bootup. It
> > is not different than normal kernel except that it skips the firmware
> > resetting. So kdump kernel needs low 1M as system RAM just as normal
> > kernel does. Here we reserve the whole low 1M with memblock_reserve()
> > to avoid any later kernel or driver data reside in this area. Otherwise,
> > we need dump the content of this area to vmcore. As we know, when crash
> > happened, the old memory of 1st kernel should be untouched until vmcore
> > dumping read out its content. Meanwhile, kdump kernel need reuse low 1M.
> > In the past, we used a back up region to copy out the low 1M area, and
> > map the back up region into the low 1M area in vmcore elf file. In
> > 6f599d84231fd27 ("x86/kdump: Always reserve the low 1M when the crashkernel
> > option is specified"), we changed to lock the whole low 1M to avoid
> > writting any kernel data into, like this we can skip this area when
> > dumping vmcore.
> > 
> > Above is why we try to memblock reserve the whole low 1M. We don't want
> > to use it, just don't want anyone to use it in 1st kernel.
> > 
> > 
> >  arch/x86/platform/efi/quirks.c | 32 +++-
> >  1 file changed, 15 insertions(+), 17 deletions(-)
> > 
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 7850111008a8..840b7e3b3d48 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -11,6 +11,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -409,7 +410,7 @@ void __init efi_free_boot_services(void)
> > for_each_efi_memory_desc(md) {
> > unsigned long long start = md->phys_addr;
> > unsigned long long size = md->num_pages << EFI_PAGE_SHIFT;
> > -   size_t rm_size;
> > +   unsigned long long end = start + size;
> >  
> > if (md->type != EFI_BOOT_SERVICES_CODE &&
> > md->type != EFI_BOOT_SERVICES_DATA) {
> > @@ -431,23 +432,20 @@ void __init efi_free_boot_services(void)
> > efi_unmap_pages(md);
> >  
> > /*
> > -* Nasty quirk: if all sub-1MB memory is used for boot
> > -* services, we can get here without having allocated the
> > -* real mode trampoline.  It's too late to hand boot services
> > -* memory back to the memblock allocator, so instead
> > -* try to manually allocate the trampoline if

Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances

2021-05-18 Thread Mike Rapoport

On Wed, May 19, 2021 at 08:12:06AM +0800, Miles Chen wrote:
> On Tue, 2021-05-18 at 19:09 +0300, Mike Rapoport wrote:
> > Hello Miles,
> > 
> > On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote:
> > > This patches is created to fix the __pa() warning messages when
> > > CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data
> > > instances.
> > > 
> > > In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y,
> > > pglist_data is allocated by a memblock API. If 
> > > CONFIG_NEED_MULTIPLE_NODES=n,
> > > we use a global variable named "contig_page_data".
> > > 
> > > If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both
> > > allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set,
> > > we will have the "virt_to_phys used for non-linear address" warning
> > > when booting.
> > > 
> > > To fix the warning, always allocate pglist_data by memblock APIs and
> > > remove the usage of contig_page_data.
> > 
> > Somehow I was sure that we can allocate pglist_data before it is accessed
> > in sparse_init() somewhere outside mm/sparse.c. It's really not the case
> > and having two places that may allocated this structure is surely worth
> > than your previous suggestion.
> > 
> > Sorry about that.
> 
> Do you mean taht to call allocation function arch/*, somewhere after
> paging_init() (so we can access pglist_data) and before sparse_init()
> and free_area_init()?

No, I meant that your original patch is better than adding allocation of
NODE_DATA(0) in two places.
 
> Miles
> 
> >  
> > > Warning message:
> > > [0.00] [ cut here ]
> > > [0.00] virt_to_phys used for non-linear address: (ptrval) 
> > > (contig_page_data+0x0/0x1c00)
> > > [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 
> > > __virt_to_phys+0x58/0x68
> > > [0.00] Modules linked in:
> > > [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 
> > > 5.13.0-rc1-00074-g1140ab592e2e #3
> > > [0.00] Hardware name: linux,dummy-virt (DT)
> > > [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
> > > [0.00] pc : __virt_to_phys+0x58/0x68
> > > [0.00] lr : __virt_to_phys+0x54/0x68
> > > [0.00] sp : 800011833e70
> > > [0.00] x29: 800011833e70 x28: 418a0018 x27: 
> > > 
> > > [0.00] x26: 000a x25: 800011b7 x24: 
> > > 800011b7
> > > [0.00] x23: fc0001c0 x22: 800011b7 x21: 
> > > 47b0
> > > [0.00] x20: 0008 x19: 800011b082c0 x18: 
> > > 
> > > [0.00] x17:  x16: 800011833bf9 x15: 
> > > 0004
> > > [0.00] x14: 0fff x13: 80001186a548 x12: 
> > > 
> > > [0.00] x11:  x10:  x9 : 
> > > 
> > > [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 
> > > 800011b62ef8
> > > [0.00] x5 :  x4 : 0001 x3 : 
> > > 
> > > [0.00] x2 :  x1 : 80001159585e x0 : 
> > > 0058
> > > [0.00] Call trace:
> > > [0.00]  __virt_to_phys+0x58/0x68
> > > [0.00]  check_usemap_section_nr+0x50/0xfc
> > > [0.00]  sparse_init_nid+0x1ac/0x28c
> > > [0.00]  sparse_init+0x1c4/0x1e0
> > > [0.00]  bootmem_init+0x60/0x90
> > > [0.00]  setup_arch+0x184/0x1f0
> > > [0.00]  start_kernel+0x78/0x488
> > > [0.00] ---[ end trace f68728a0d3053b60 ]---
> > > 
> > > [1] 
> > > https://urldefense.com/v3/__https://lore.kernel.org/patchwork/patch/1425110/__;!!CTRNKA9wMg0ARbw!x-wGFEC1wLzXho2kI1CrC2fjXNaQm5f-n0ADQyJDckCOKZHAP_q055DCSWYcQ7Zdcw$
> > >  
> > > 
> > > Change since v1:
> > > - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n
> > > 
> > > Miles Chen (2):
> > >   mm: introduce prepare_node_data
> > >   mm: replace contig_page_data with node_data
> > > 
> > >  Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
> > >  arch/powerpc/kexec/core.c  |  5 -
> > >

Re: [PATCH v2 0/2] mm: unify the allocation of pglist_data instances

2021-05-18 Thread Mike Rapoport

Hello Miles,

On Tue, May 18, 2021 at 05:24:44PM +0800, Miles Chen wrote:
> This patches is created to fix the __pa() warning messages when
> CONFIG_DEBUG_VIRTUAL=y by unifying the allocation of pglist_data
> instances.
> 
> In current implementation of node_data, if CONFIG_NEED_MULTIPLE_NODES=y,
> pglist_data is allocated by a memblock API. If CONFIG_NEED_MULTIPLE_NODES=n,
> we use a global variable named "contig_page_data".
> 
> If CONFIG_DEBUG_VIRTUAL is not enabled. __pa() can handle both
> allocation and symbol cases. But if CONFIG_DEBUG_VIRTUAL is set,
> we will have the "virt_to_phys used for non-linear address" warning
> when booting.
> 
> To fix the warning, always allocate pglist_data by memblock APIs and
> remove the usage of contig_page_data.

Somehow I was sure that we can allocate pglist_data before it is accessed
in sparse_init() somewhere outside mm/sparse.c. It's really not the case
and having two places that may allocated this structure is surely worth
than your previous suggestion.

Sorry about that.
 
> Warning message:
> [0.00] [ cut here ]
> [0.00] virt_to_phys used for non-linear address: (ptrval) 
> (contig_page_data+0x0/0x1c00)
> [0.00] WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 
> __virt_to_phys+0x58/0x68
> [0.00] Modules linked in:
> [0.00] CPU: 0 PID: 0 Comm: swapper Tainted: GW 
> 5.13.0-rc1-00074-g1140ab592e2e #3
> [0.00] Hardware name: linux,dummy-virt (DT)
> [0.00] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
> [0.00] pc : __virt_to_phys+0x58/0x68
> [0.00] lr : __virt_to_phys+0x54/0x68
> [0.00] sp : 800011833e70
> [0.00] x29: 800011833e70 x28: 418a0018 x27: 
> 
> [0.00] x26: 000a x25: 800011b7 x24: 
> 800011b7
> [0.00] x23: fc0001c0 x22: 800011b7 x21: 
> 47b0
> [0.00] x20: 0008 x19: 800011b082c0 x18: 
> 
> [0.00] x17:  x16: 800011833bf9 x15: 
> 0004
> [0.00] x14: 0fff x13: 80001186a548 x12: 
> 
> [0.00] x11:  x10:  x9 : 
> 
> [0.00] x8 : 8000115c9000 x7 : 737520737968705f x6 : 
> 800011b62ef8
> [0.00] x5 :  x4 : 0001 x3 : 
> 
> [0.00] x2 :  x1 : 80001159585e x0 : 
> 0058
> [0.00] Call trace:
> [0.00]  __virt_to_phys+0x58/0x68
> [0.00]  check_usemap_section_nr+0x50/0xfc
> [0.00]  sparse_init_nid+0x1ac/0x28c
> [0.00]  sparse_init+0x1c4/0x1e0
> [0.00]  bootmem_init+0x60/0x90
> [0.00]  setup_arch+0x184/0x1f0
> [0.00]  start_kernel+0x78/0x488
> [0.00] ---[ end trace f68728a0d3053b60 ]---
> 
> [1] https://lore.kernel.org/patchwork/patch/1425110/
> 
> Change since v1:
> - use memblock_alloc() to create pglist_data when CONFIG_NUMA=n
> 
> Miles Chen (2):
>   mm: introduce prepare_node_data
>   mm: replace contig_page_data with node_data
> 
>  Documentation/admin-guide/kdump/vmcoreinfo.rst | 13 -
>  arch/powerpc/kexec/core.c  |  5 -
>  include/linux/gfp.h|  3 ---
>  include/linux/mm.h |  2 ++
>  include/linux/mmzone.h |  4 ++--
>  kernel/crash_core.c|  1 -
>  mm/memblock.c  |  3 +--
>  mm/page_alloc.c| 16 
>  mm/sparse.c|  2 ++
>  9 files changed, 23 insertions(+), 26 deletions(-)
> 
> 
> base-commit: 8ac91e6c6033ebc12c5c1e4aa171b81a662bd70f
> -- 
> 2.18.0
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v1 1/1] kernel.h: Split out panic and oops helpers

2021-04-06 Thread Mike Rapoport

On Tue, Apr 06, 2021 at 04:31:58PM +0300, Andy Shevchenko wrote:
> kernel.h is being used as a dump for all kinds of stuff for a long time.
> Here is the attempt to start cleaning it up by splitting out panic and
> oops helpers.
> 
> At the same time convert users in header and lib folder to use new header.
> Though for time being include new header back to kernel.h to avoid twisted
> indirected includes for existing users.
> 
> Signed-off-by: Andy Shevchenko 

Acked-by: Mike Rapoport 

> ---
>  arch/powerpc/kernel/setup-common.c   |  1 +
>  arch/x86/include/asm/desc.h  |  1 +
>  arch/x86/kernel/cpu/mshyperv.c   |  1 +
>  arch/x86/kernel/setup.c  |  1 +
>  drivers/char/ipmi/ipmi_msghandler.c  |  1 +
>  drivers/remoteproc/remoteproc_core.c |  1 +
>  include/asm-generic/bug.h|  3 +-
>  include/linux/kernel.h   | 84 +---
>  include/linux/panic.h| 98 
>  include/linux/panic_notifier.h   | 12 
>  kernel/hung_task.c   |  1 +
>  kernel/kexec_core.c  |  1 +
>  kernel/panic.c   |  1 +
>  kernel/rcu/tree.c|  2 +
>  kernel/sysctl.c  |  1 +
>  kernel/trace/trace.c |  1 +
>  16 files changed, 126 insertions(+), 84 deletions(-)
>  create mode 100644 include/linux/panic.h
>  create mode 100644 include/linux/panic_notifier.h
> 
> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
> index 476082a83d1c..ceb12683b6d1 100644
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  
> +#include 

This seems unrelated, but I might be missing something.

>  #include 
>  #include 
>  

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v13 6/8] arm64: kdump: reimplement crashkernel=X

2020-11-12 Thread Mike Rapoport

On Wed, Nov 11, 2020 at 09:54:48PM +0800, Baoquan He wrote:
> On 11/11/20 at 09:27pm, chenzhou wrote:
> > Hi Baoquan,
> ...
> > >>  #ifdef CONFIG_CRASH_DUMP
> > >>  static int __init early_init_dt_scan_elfcorehdr(unsigned long node,
> > >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > >> index 1c0f3e02f731..c55cee290bbb 100644
> > >> --- a/arch/arm64/mm/mmu.c
> > >> +++ b/arch/arm64/mm/mmu.c
> > >> @@ -488,6 +488,10 @@ static void __init map_mem(pgd_t *pgdp)
> > >>   */
> > >>  memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> > >>  #ifdef CONFIG_KEXEC_CORE
> > >> +if (crashk_low_res.end)
> > >> +memblock_mark_nomap(crashk_low_res.start,
> > >> +resource_size(_low_res));
> > >> +
> > >>  if (crashk_res.end)
> > >>  memblock_mark_nomap(crashk_res.start,
> > >>  resource_size(_res));
> > >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > >> index d39892bdb9ae..cdef7d8c91a6 100644
> > >> --- a/kernel/crash_core.c
> > >> +++ b/kernel/crash_core.c
> > >> @@ -321,7 +321,7 @@ int __init parse_crashkernel_low(char *cmdline,
> > >>  
> > >>  int __init reserve_crashkernel_low(void)
> > >>  {
> > >> -#ifdef CONFIG_X86_64
> > >> +#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
> > > Not very sure if a CONFIG_64BIT checking is better.
> > If doing like this, there may be some compiling errors for other 64-bit 
> > kernel, such as mips.
> > >
> > >>  unsigned long long base, low_base = 0, low_size = 0;
> > >>  unsigned long low_mem_limit;
> > >>  int ret;
> > >> @@ -362,12 +362,14 @@ int __init reserve_crashkernel_low(void)
> > >>  
> > >>  crashk_low_res.start = low_base;
> > >>  crashk_low_res.end   = low_base + low_size - 1;
> > >> +#ifdef CONFIG_X86_64
> > >>  insert_resource(_resource, _low_res);
> > >> +#endif
> > >>  #endif
> > >>  return 0;
> > >>  }
> > >>  
> > >> -#ifdef CONFIG_X86
> > >> +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> > > Should we make this weak default so that we can remove the ARCH config?
> > The same as above, some arch may not support kdump, in that case,  
> > compiling errors occur.
> 
> OK, not sure if other people have better idea, oterwise, we can leave with 
> it. 
> Thanks for telling.

I think it would be better to have CONFIG_ARCH_WANT_RESERVE_CRASH_KERNEL
in arch/Kconfig and select this by X86 and ARM64.

Since reserve_crashkernel() implementations are quite similart on other
architectures as well, we can have more users of this later.

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v13 4/8] x86: kdump: move reserve_crashkernel[_low]() into crash_core.c

2020-11-12 Thread Mike Rapoport

On Sat, Oct 31, 2020 at 03:44:33PM +0800, Chen Zhou wrote:
> Make the functions reserve_crashkernel[_low]() as generic.
> Arm64 will use these to reimplement crashkernel=X.
> 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/x86/include/asm/kexec.h |  25 ++
>  arch/x86/kernel/setup.c  | 151 +---
>  include/linux/crash_core.h   |   4 +
>  include/linux/kexec.h|   2 -
>  kernel/crash_core.c  | 164 +++
>  kernel/kexec_core.c  |  17 
>  6 files changed, 195 insertions(+), 168 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 8cf9d3fd31c7..34afa7b645f9 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -21,6 +21,27 @@
>  /* 2M alignment for crash kernel regions */
>  #define CRASH_ALIGN  SZ_16M
>  
> +/*
> + * Keep the crash kernel below this limit.
> + *
> + * Earlier 32-bits kernels would limit the kernel to the low 512 MB range
> + * due to mapping restrictions.
> + *
> + * 64-bit kdump kernels need to be restricted to be under 64 TB, which is
> + * the upper limit of system RAM in 4-level paging mode. Since the kdump
> + * jump could be from 5-level paging to 4-level paging, the jump will fail if
> + * the kernel is put above 64 TB, and during the 1st kernel bootup there's
> + * no good way to detect the paging mode of the target kernel which will be
> + * loaded for dumping.
> + */
> +#ifdef CONFIG_X86_32
> +# define CRASH_ADDR_LOW_MAX  SZ_512M
> +# define CRASH_ADDR_HIGH_MAX SZ_512M
> +#else
> +# define CRASH_ADDR_LOW_MAX  SZ_4G
> +# define CRASH_ADDR_HIGH_MAX SZ_64T
> +#endif
> +
>  #ifndef __ASSEMBLY__
>  
>  #include 
> @@ -200,6 +221,10 @@ typedef void crash_vmclear_fn(void);
>  extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
>  extern void kdump_nmi_shootdown_cpus(void);
>  
> +#ifdef CONFIG_KEXEC_CORE
> +extern void __init reserve_crashkernel(void);
> +#endif
> +
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_KEXEC_H */
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 1289f079ad5f..00b3840d30f9 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -25,8 +25,6 @@
>  
>  #include 
>  
> -#include 
> -
>  #include 
>  #include 
>  #include 
> @@ -38,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -389,153 +388,7 @@ static void __init 
> memblock_x86_reserve_range_setup_data(void)
>   }
>  }
>  
> -/*
> - * - Crashkernel reservation --
> - */
> -
> -#ifdef CONFIG_KEXEC_CORE
> -
> -/*
> - * Keep the crash kernel below this limit.
> - *
> - * Earlier 32-bits kernels would limit the kernel to the low 512 MB range
> - * due to mapping restrictions.
> - *
> - * 64-bit kdump kernels need to be restricted to be under 64 TB, which is
> - * the upper limit of system RAM in 4-level paging mode. Since the kdump
> - * jump could be from 5-level paging to 4-level paging, the jump will fail if
> - * the kernel is put above 64 TB, and during the 1st kernel bootup there's
> - * no good way to detect the paging mode of the target kernel which will be
> - * loaded for dumping.
> - */
> -#ifdef CONFIG_X86_32
> -# define CRASH_ADDR_LOW_MAX  SZ_512M
> -# define CRASH_ADDR_HIGH_MAX SZ_512M
> -#else
> -# define CRASH_ADDR_LOW_MAX  SZ_4G
> -# define CRASH_ADDR_HIGH_MAX SZ_64T
> -#endif
> -
> -static int __init reserve_crashkernel_low(void)
> -{
> -#ifdef CONFIG_X86_64
> - unsigned long long base, low_base = 0, low_size = 0;
> - unsigned long low_mem_limit;
> - int ret;
> -
> - low_mem_limit = min(memblock_phys_mem_size(), CRASH_ADDR_LOW_MAX);
> -
> - /* crashkernel=Y,low */
> - ret = parse_crashkernel_low(boot_command_line, low_mem_limit, 
> _size, );
> - if (ret) {
> - /*
> -  * two parts from kernel/dma/swiotlb.c:
> -  * -swiotlb size: user-specified with swiotlb= or default.
> -  *
> -  * -swiotlb overflow buffer: now hardcoded to 32k. We round it
> -  * to 8M for other buffers that may need to stay low too. Also
> -  * make sure we allocate enough extra low memory so that we
> -  * don't run out of DMA buffers for 32-bit devices.
> -  */
> - low_size = max(swiotlb_size_or_default() + (8UL << 20), 256UL 
> << 20);
> - } else {
> - /* passed with crashkernel=0,low ? */
> - if (!low_size)
> - return 0;
> - }
> -
> - low_base = memblock_phys_alloc_range(low_size, CRASH_ALIGN, 
> CRASH_ALIGN, CRASH_ADDR_LOW_MAX);
> - if (!low_base) {
> - pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
> smaller size.\n",
> -(unsigned long)(low_size >> 20));
> - return -ENOMEM;
> - }
> -
> -

Re: [PATCH v13 1/8] x86: kdump: replace the hard-coded alignment with macro CRASH_ALIGN

2020-11-12 Thread Mike Rapoport

Hi,

On Sat, Oct 31, 2020 at 03:44:30PM +0800, Chen Zhou wrote:
> Move CRASH_ALIGN to header asm/kexec.h and replace the hard-coded
> alignment with macro CRASH_ALIGN in function reserve_crashkernel().
> 
> Suggested-by: Dave Young 
> Signed-off-by: Chen Zhou 
> Tested-by: John Donnelly 
> ---
>  arch/x86/include/asm/kexec.h | 3 +++
>  arch/x86/kernel/setup.c  | 5 +
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 6802c59e8252..8cf9d3fd31c7 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -18,6 +18,9 @@
>  
>  # define KEXEC_CONTROL_CODE_MAX_SIZE 2048
>  
> +/* 2M alignment for crash kernel regions */
> +#define CRASH_ALIGN  SZ_16M

Please update the comment to match the code.

> +
>  #ifndef __ASSEMBLY__
>  
>  #include 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 84f581c91db4..bf373422dc8a 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -395,9 +395,6 @@ static void __init 
> memblock_x86_reserve_range_setup_data(void)
>  
>  #ifdef CONFIG_KEXEC_CORE
>  
> -/* 16M alignment for crash kernel regions */
> -#define CRASH_ALIGN  SZ_16M
> -
>  /*
>   * Keep the crash kernel below this limit.
>   *
> @@ -515,7 +512,7 @@ static void __init reserve_crashkernel(void)
>   } else {
>   unsigned long long start;
>  
> - start = memblock_phys_alloc_range(crash_size, SZ_1M, crash_base,
> + start = memblock_phys_alloc_range(crash_size, CRASH_ALIGN, 
> crash_base,
> crash_base + crash_size);
>   if (start != crash_base) {
>   pr_info("crashkernel reservation failed - memory is in 
> use.\n");
> -- 
> 2.20.1
> 

-- 
Sincerely yours,
Mike.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [RFC 14/43] mm: memblock: PKRAM: prevent memblock resize from clobbering preserved pages

2020-05-11 Thread Mike Rapoport

On Wed, May 06, 2020 at 05:41:40PM -0700, Anthony Yznaga wrote:
> The size of the memblock reserved array may be increased while preserved
> pages are being reserved. When this happens, preserved pages that have
> not yet been reserved are at risk for being clobbered when space for a
> larger array is allocated.
> When called from memblock_double_array(), a wrapper around
> memblock_find_in_range() walks the preserved pages pagetable to find
> sufficiently sized ranges without preserved pages and passes them to
> memblock_find_in_range().

I'd suggest to create an array of memblock_region's that will contain
the PKRAM ranges before kexec and pass this array to the new kernel.
Then, somewhere in start_kerenel() replace replace
memblock.reserved->regions with that array. 

> Signed-off-by: Anthony Yznaga 
> ---
>  include/linux/pkram.h |  3 +++
>  mm/memblock.c | 15 +--
>  mm/pkram.c| 51 
> +++
>  3 files changed, 67 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pkram.h b/include/linux/pkram.h
> index edc5d8bef9d3..409022e1472f 100644
> --- a/include/linux/pkram.h
> +++ b/include/linux/pkram.h
> @@ -62,6 +62,9 @@ struct page *pkram_load_page(struct pkram_stream *ps, 
> unsigned long *index,
>  ssize_t pkram_write(struct pkram_stream *ps, const void *buf, size_t count);
>  size_t pkram_read(struct pkram_stream *ps, void *buf, size_t count);
>  
> +phys_addr_t pkram_memblock_find_in_range(phys_addr_t start, phys_addr_t end,
> +  phys_addr_t size, phys_addr_t align);
> +
>  #ifdef CONFIG_PKRAM
>  extern unsigned long pkram_reserved_pages;
>  void pkram_reserve(void);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index c79ba6f9920c..69ae883b8d21 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -349,6 +350,16 @@ phys_addr_t __init_memblock 
> memblock_find_in_range(phys_addr_t start,
>   return ret;
>  }
>  
> +phys_addr_t __init_memblock __memblock_find_in_range(phys_addr_t start,
> + phys_addr_t end, phys_addr_t size,
> + phys_addr_t align)
> +{
> + if (IS_ENABLED(CONFIG_PKRAM))
> + return pkram_memblock_find_in_range(start, end, size, align);
> + else
> + return memblock_find_in_range(start, end, size, align);
> +}
> +
>  static void __init_memblock memblock_remove_region(struct memblock_type 
> *type, unsigned long r)
>  {
>   type->total_size -= type->regions[r].size;
> @@ -447,11 +458,11 @@ static int __init_memblock memblock_double_array(struct 
> memblock_type *type,
>   if (type != )
>   new_area_start = new_area_size = 0;
>  
> - addr = memblock_find_in_range(new_area_start + new_area_size,
> + addr = __memblock_find_in_range(new_area_start + new_area_size,
>   memblock.current_limit,
>   new_alloc_size, PAGE_SIZE);
>   if (!addr && new_area_size)
> - addr = memblock_find_in_range(0,
> + addr = __memblock_find_in_range(0,
>   min(new_area_start, memblock.current_limit),
>   new_alloc_size, PAGE_SIZE);
>  
> diff --git a/mm/pkram.c b/mm/pkram.c
> index dd3c89614010..e49c9bcd3854 100644
> --- a/mm/pkram.c
> +++ b/mm/pkram.c
> @@ -1238,3 +1238,54 @@ void pkram_free_pgt(void)
>   __free_pages_core(virt_to_page(pkram_pgd), 0);
>   pkram_pgd = NULL;
>  }
> +
> +static int __init_memblock pkram_memblock_find_cb(struct pkram_pg_state *st, 
> unsigned long base, unsigned long size)
> +{
> + unsigned long end = base + size;
> + unsigned long addr;
> +
> + if (size < st->min_size)
> + return 0;
> +
> + addr =  memblock_find_in_range(base, end, st->min_size, PAGE_SIZE);
> + if (!addr)
> + return 0;
> +
> + st->retval = addr;
> + return 1;
> +}
> +
> +/*
> + * It may be necessary to allocate a larger reserved memblock array
> + * while populating it with ranges of preserved pages.  To avoid
> + * trampling preserved pages that have not yet been added to the
> + * memblock reserved list this function implements a wrapper around
> + * memblock_find_in_range() that restricts searches to subranges
> + * that do not contain preserved pages.
> + */
> +phys_addr_t __init_memblock pkram_memblock_find_in_range(phys_addr_t start,
> + phys_addr_t end, phys_addr_t size,
> + phys_addr_t align)
> +{
> + struct pkram_pg_state st = {
> + .range_cb = pkram_memblock_find_cb,
> + .min_addr = start,
> + .max_addr = end,
> + .min_size = PAGE_ALIGN(size),
> +

[PATCH] memblock: make keeping memblock memory opt-in rather than opt-out

2019-04-24 Thread Mike Rapoport

Most architectures do not need the memblock memory after the page allocator
is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after system
initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK to the
architectures that are still missing that option.

Signed-off-by: Mike Rapoport 
---
 arch/arm/Kconfig |  2 +-
 arch/arm64/Kconfig   |  1 +
 arch/hexagon/Kconfig |  1 -
 arch/ia64/Kconfig|  1 -
 arch/m68k/Kconfig|  1 -
 arch/mips/Kconfig|  1 -
 arch/nios2/Kconfig   |  1 -
 arch/powerpc/Kconfig |  1 +
 arch/s390/Kconfig|  1 +
 arch/sh/Kconfig  |  1 -
 arch/x86/Kconfig |  1 -
 include/linux/memblock.h |  3 ++-
 kernel/kexec_file.c  | 16 
 mm/Kconfig   |  2 +-
 mm/memblock.c|  6 +++---
 mm/page_alloc.c  |  3 +--
 16 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 850b480..7073436 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -4,7 +4,6 @@ config ARM
default y
select ARCH_32BIT_OFF_T
select ARCH_CLOCKSOURCE_DATA
-   select ARCH_DISCARD_MEMBLOCK if !HAVE_ARCH_PFN_VALID && !KEXEC
select ARCH_HAS_DEBUG_VIRTUAL if MMU
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
@@ -21,6 +20,7 @@ config ARM
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAVE_CUSTOM_GPIO_H
select ARCH_HAS_GCOV_PROFILE_ALL
+   select ARCH_KEEP_MEMBLOCK if HAVE_ARCH_PFN_VALID || KEXEC
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_NO_SG_CHAIN if !ARM_HAS_SG_CHAIN
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7e34b9e..d71f043 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -58,6 +58,7 @@ config ARM64
select ARCH_INLINE_SPIN_UNLOCK_BH if !PREEMPT
select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPT
select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPT
+   select ARCH_KEEP_MEMBLOCK
select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index ac44168..bbe3819 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -22,7 +22,6 @@ config HEXAGON
select GENERIC_IRQ_SHOW
select HAVE_ARCH_KGDB
select HAVE_ARCH_TRACEHOOK
-   select ARCH_DISCARD_MEMBLOCK
select NEED_SG_DMA_LENGTH
select NO_IOPORT_MAP
select GENERIC_IOMAP
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 8d7396b..bd51d3b 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -33,7 +33,6 @@ config IA64
select ARCH_HAS_DMA_COHERENT_TO_PFN if SWIOTLB
select ARCH_HAS_SYNC_DMA_FOR_CPU if SWIOTLB
select VIRT_TO_BUS
-   select ARCH_DISCARD_MEMBLOCK
select GENERIC_IRQ_PROBE
select GENERIC_PENDING_IRQ if SMP
select GENERIC_IRQ_SHOW
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index b542064..7d1e5d9 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -27,7 +27,6 @@ config M68K
select MODULES_USE_ELF_RELA
select OLD_SIGSUSPEND3
select OLD_SIGACTION
-   select ARCH_DISCARD_MEMBLOCK
 
 config CPU_BIG_ENDIAN
def_bool y
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 4a5f5b0..8b9298b 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -5,7 +5,6 @@ config MIPS
select ARCH_32BIT_OFF_T if !64BIT
select ARCH_BINFMT_ELF_STATE if MIPS_FP_SUPPORT
select ARCH_CLOCKSOURCE_DATA
-   select ARCH_DISCARD_MEMBLOCK
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index 4ef15a6..dc4239c 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -23,7 +23,6 @@ config NIOS2
select SPARSE_IRQ
select USB_ARCH_HAS_HCD if USB_SUPPORT
select CPU_NO_EFFICIENT_FFS
-   select ARCH_DISCARD_MEMBLOCK
 
 config GENERIC_CSUM
def_bool y
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d0be82..39877b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -143,6 +143,7 @@ config PPC
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+   select ARCH_KEEP_MEMBLOCK
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index

Re: [PATCH v4 3/5] memblock: add memblock_cap_memory_ranges for multiple ranges

2019-04-15 Thread Mike Rapoport

Hi,

On Mon, Apr 15, 2019 at 06:57:23PM +0800, Chen Zhou wrote:
> The memblock_cap_memory_range() removes all the memory except the
> range passed to it. Extend this function to receive memblock_type
> with the regions that should be kept.
> 
> Enable this function in arm64 for reservation of multiple regions
> for the crash kernel.
> 
> Signed-off-by: Chen Zhou 
> Signed-off-by: Mike Rapoport 

I didn't work on this version, please drop the signed-off.

> ---
>  include/linux/memblock.h |  1 +
>  mm/memblock.c| 45 +
>  2 files changed, 46 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 47e3c06..180877c 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -446,6 +446,7 @@ phys_addr_t memblock_start_of_DRAM(void);
>  phys_addr_t memblock_end_of_DRAM(void);
>  void memblock_enforce_memory_limit(phys_addr_t memory_limit);
>  void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
> +void memblock_cap_memory_ranges(struct memblock_type *regions_to_keep);
>  void memblock_mem_limit_remove_map(phys_addr_t limit);
>  bool memblock_is_memory(phys_addr_t addr);
>  bool memblock_is_map_memory(phys_addr_t addr);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index f315eca..9661807 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1697,6 +1697,51 @@ void __init memblock_cap_memory_range(phys_addr_t 
> base, phys_addr_t size)
>   base + size, PHYS_ADDR_MAX);
>  }
>  
> +void __init memblock_cap_memory_ranges(struct memblock_type *regions_to_keep)
> +{
> + int start_rgn[INIT_MEMBLOCK_REGIONS], end_rgn[INIT_MEMBLOCK_REGIONS];
> + int i, j, ret, nr = 0;
> + struct memblock_region *regs = regions_to_keep->regions;
> +
> + for (i = 0; i < regions_to_keep->cnt; i++) {
> + ret = memblock_isolate_range(, regs[i].base,
> + regs[i].size, _rgn[i], _rgn[i]);
> + if (ret)
> + break;
> + nr++;
> + }
> + if (!nr)
> + return;
> +
> + /* remove all the MAP regions */
> + for (i = memblock.memory.cnt - 1; i >= end_rgn[nr - 1]; i--)
> + if (!memblock_is_nomap([i]))
> + memblock_remove_region(, i);
> +
> + for (i = nr - 1; i > 0; i--)
> + for (j = start_rgn[i] - 1; j >= end_rgn[i - 1]; j--)
> + if (!memblock_is_nomap([j]))
> + memblock_remove_region(, j);
> +
> + for (i = start_rgn[0] - 1; i >= 0; i--)
> + if (!memblock_is_nomap([i]))
> + memblock_remove_region(, i);
> +
> + /* truncate the reserved regions */
> + memblock_remove_range(, 0, regs[0].base);
> +
> + for (i = nr - 1; i > 0; i--) {
> + phys_addr_t remove_base = regs[i - 1].base + regs[i - 1].size;
> + phys_addr_t remove_size = regs[i].base - remove_base;
> +
> + memblock_remove_range(, remove_base,
> + remove_size);
> + }
> +
> + memblock_remove_range(,
> + regs[nr - 1].base + regs[nr - 1].size, PHYS_ADDR_MAX);
> +}
> +

I've double-checked and I see no problem with using
for_each_mem_range_rev() iterators for removing some ranges. And with them
this functions becomes much clearer and more efficient.

Can you please check if the below patch works for you?

>From e25e6c9cd94a01abac124deacc66e5d258fdbf7c Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Wed, 10 Apr 2019 16:02:32 +0300
Subject: [PATCH] memblock: extend memblock_cap_memory_range to multiple ranges

The memblock_cap_memory_range() removes all the memory except the range
passed to it. Extend this function to receive an array of memblock_regions
that should be kept. This allows switching to simple iteration over
memblock arrays with 'for_each_mem_range_rev' to remove the unneeded memory.

Enable use of this function in arm64 for reservation of multiple regions for
the crash kernel.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/mm/init.c | 34 --
 include/linux/memblock.h |  2 +-
 mm/memblock.c| 44 
 3 files changed, 45 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 6bc1350..8665d29 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -64,6 +64,10 @@ EXPORT_SYMBOL(memstart_addr);
 phys_addr_t arm64_dma_phys_limit __ro_after_init;
 
 #ifdef CONFIG_KEXEC_CORE
+
+/* at most two crash kernel regions, low_region and high_region */
+#define CRASH_MAX_USABLE_RANGES2
+
 /*
  * rese

Re: [PATCH v3 3/4] arm64: kdump: support more than one crash kernel regions

2019-04-14 Thread Mike Rapoport

Hi,

On Mon, Apr 15, 2019 at 10:05:18AM +0800, Chen Zhou wrote:
> Hi Mike,
> 
> On 2019/4/14 20:10, Mike Rapoport wrote:
> >>
> >> solution A:phys_addr_t start[INIT_MEMBLOCK_RESERVED_REGIONS * 2];
> >>phys_addr_t end[INIT_MEMBLOCK_RESERVED_REGIONS * 2];
> >> start, end is physical addr
> >>
> >> solution B:int start_rgn[INIT_MEMBLOCK_REGIONS], 
> >> end_rgn[INIT_MEMBLOCK_REGIONS];
> >> start_rgn, end_rgn is rgn index
> >>
> >> Solution B do less remove operations and with no warning comparing to 
> >> solution A.
> >> I think solution B is better, could you give some suggestions?
> >  
> > Solution B is indeed better that solution A, but I'm still worried by
> > relatively large arrays on stack and the amount of loops :(
> > 
> > The very least we could do is to call memblock_cap_memory_range() to drop
> > the memory before and after the ranges we'd like to keep.
> 
> 1. relatively large arrays
> As my said above, the start_rgn, end_rgn is rgn index, we could use unsigned 
> char type.

Let's stick to int for now

> 2. loops
> Loops always exist, and the solution with fewer loops may be just 
> encapsulated well.

Of course the loops are there, I just hoped we could get rid of the nested
loop and get away with single passes in all the cases.
Apparently it's not the case :(

> Thanks,
> Chen Zhou
> 

-- 
Sincerely yours,
Mike.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCH v3 3/4] arm64: kdump: support more than one crash kernel regions

2019-04-14 Thread Mike Rapoport

On Mon, Apr 15, 2019 at 10:27:30AM +0800, Chen Zhou wrote:
> Hi Mike,
> 
> On 2019/4/14 20:13, Mike Rapoport wrote:
> > Hi,
> > 
> > On Tue, Apr 09, 2019 at 06:28:18PM +0800, Chen Zhou wrote:
> >> After commit (arm64: kdump: support reserving crashkernel above 4G),
> >> there may be two crash kernel regions, one is below 4G, the other is
> >> above 4G.
> >>
> >> Crash dump kernel reads more than one crash kernel regions via a dtb
> >> property under node /chosen,
> >> linux,usable-memory-range = 
> > 
> > Somehow I've missed that previously, but how is this supposed to work on
> > EFI systems?
> 
> Whatever the way in which the systems work, there is FDT 
> pointer(__fdt_pointer)
> in arm64 kernel and file /sys/firmware/fdt will be created in late_initcall.
> 
> Kexec-tools read and update file /sys/firmware/fdt in EFI systems to support 
> kdump to
> boot capture kernel.
> 
> For supporting more than one crash kernel regions, kexec-tools make changes 
> accordingly.
> Details are in below:
> http://lists.infradead.org/pipermail/kexec/2019-April/022792.html
 
Thanks for the clarification!

> Thanks,
> Chen Zhou
> 
> >  
> >> Signed-off-by: Chen Zhou 
> >> ---
> >>  arch/arm64/mm/init.c | 66 
> >> 
> >>  include/linux/memblock.h |  6 +
> >>  mm/memblock.c|  7 ++---
> >>  3 files changed, 66 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> index 3bebddf..0f18665 100644
> >> --- a/arch/arm64/mm/init.c
> >> +++ b/arch/arm64/mm/init.c
> >> @@ -65,6 +65,11 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
> >>  
> >>  #ifdef CONFIG_KEXEC_CORE
> >>  
> >> +/* at most two crash kernel regions, low_region and high_region */
> >> +#define CRASH_MAX_USABLE_RANGES   2
> >> +#define LOW_REGION_IDX0
> >> +#define HIGH_REGION_IDX   1
> >> +
> >>  /*
> >>   * reserve_crashkernel() - reserves memory for crash kernel
> >>   *
> >> @@ -297,8 +302,8 @@ static int __init 
> >> early_init_dt_scan_usablemem(unsigned long node,
> >>const char *uname, int depth, void *data)
> >>  {
> >>struct memblock_region *usablemem = data;
> >> -  const __be32 *reg;
> >> -  int len;
> >> +  const __be32 *reg, *endp;
> >> +  int len, nr = 0;
> >>  
> >>if (depth != 1 || strcmp(uname, "chosen") != 0)
> >>return 0;
> >> @@ -307,22 +312,63 @@ static int __init 
> >> early_init_dt_scan_usablemem(unsigned long node,
> >>if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
> >>return 1;
> >>  
> >> -  usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> >> -  usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> >> +  endp = reg + (len / sizeof(__be32));
> >> +  while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
> >> +  usablemem[nr].base = dt_mem_next_cell(dt_root_addr_cells, );
> >> +  usablemem[nr].size = dt_mem_next_cell(dt_root_size_cells, );
> >> +
> >> +  if (++nr >= CRASH_MAX_USABLE_RANGES)
> >> +  break;
> >> +  }
> >>  
> >>return 1;
> >>  }
> >>  
> >>  static void __init fdt_enforce_memory_region(void)
> >>  {
> >> -  struct memblock_region reg = {
> >> -  .size = 0,
> >> -  };
> >> +  int i, cnt = 0;
> >> +  struct memblock_region regs[CRASH_MAX_USABLE_RANGES];
> >> +
> >> +  memset(regs, 0, sizeof(regs));
> >> +  of_scan_flat_dt(early_init_dt_scan_usablemem, regs);
> >> +
> >> +  for (i = 0; i < CRASH_MAX_USABLE_RANGES; i++)
> >> +  if (regs[i].size)
> >> +  cnt++;
> >> +  else
> >> +  break;
> >> +
> >> +  if (cnt - 1 == LOW_REGION_IDX)
> >> +  memblock_cap_memory_range(regs[LOW_REGION_IDX].base,
> >> +  regs[LOW_REGION_IDX].size);
> >> +  else if (cnt - 1 == HIGH_REGION_IDX) {
> >> +  /*
> >> +   * Two crash kernel regions, cap the memory range
> >> +   * [regs[LOW_REGION_IDX].base, regs[HIGH_REGION_IDX].end]
> >> +   * and then remo

Re: [PATCH v3 3/4] arm64: kdump: support more than one crash kernel regions

2019-04-14 Thread Mike Rapoport

Hi,

On Tue, Apr 09, 2019 at 06:28:18PM +0800, Chen Zhou wrote:
> After commit (arm64: kdump: support reserving crashkernel above 4G),
> there may be two crash kernel regions, one is below 4G, the other is
> above 4G.
> 
> Crash dump kernel reads more than one crash kernel regions via a dtb
> property under node /chosen,
> linux,usable-memory-range = 

Somehow I've missed that previously, but how is this supposed to work on
EFI systems?
 
> Signed-off-by: Chen Zhou 
> ---
>  arch/arm64/mm/init.c | 66 
> 
>  include/linux/memblock.h |  6 +
>  mm/memblock.c|  7 ++---
>  3 files changed, 66 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 3bebddf..0f18665 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -65,6 +65,11 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  
>  #ifdef CONFIG_KEXEC_CORE
>  
> +/* at most two crash kernel regions, low_region and high_region */
> +#define CRASH_MAX_USABLE_RANGES  2
> +#define LOW_REGION_IDX   0
> +#define HIGH_REGION_IDX  1
> +
>  /*
>   * reserve_crashkernel() - reserves memory for crash kernel
>   *
> @@ -297,8 +302,8 @@ static int __init early_init_dt_scan_usablemem(unsigned 
> long node,
>   const char *uname, int depth, void *data)
>  {
>   struct memblock_region *usablemem = data;
> - const __be32 *reg;
> - int len;
> + const __be32 *reg, *endp;
> + int len, nr = 0;
>  
>   if (depth != 1 || strcmp(uname, "chosen") != 0)
>   return 0;
> @@ -307,22 +312,63 @@ static int __init early_init_dt_scan_usablemem(unsigned 
> long node,
>   if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
>   return 1;
>  
> - usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> - usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> + endp = reg + (len / sizeof(__be32));
> + while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
> + usablemem[nr].base = dt_mem_next_cell(dt_root_addr_cells, );
> + usablemem[nr].size = dt_mem_next_cell(dt_root_size_cells, );
> +
> + if (++nr >= CRASH_MAX_USABLE_RANGES)
> + break;
> + }
>  
>   return 1;
>  }
>  
>  static void __init fdt_enforce_memory_region(void)
>  {
> - struct memblock_region reg = {
> - .size = 0,
> - };
> + int i, cnt = 0;
> + struct memblock_region regs[CRASH_MAX_USABLE_RANGES];
> +
> + memset(regs, 0, sizeof(regs));
> + of_scan_flat_dt(early_init_dt_scan_usablemem, regs);
> +
> + for (i = 0; i < CRASH_MAX_USABLE_RANGES; i++)
> + if (regs[i].size)
> + cnt++;
> + else
> + break;
> +
> + if (cnt - 1 == LOW_REGION_IDX)
> + memblock_cap_memory_range(regs[LOW_REGION_IDX].base,
> + regs[LOW_REGION_IDX].size);
> + else if (cnt - 1 == HIGH_REGION_IDX) {
> + /*
> +  * Two crash kernel regions, cap the memory range
> +  * [regs[LOW_REGION_IDX].base, regs[HIGH_REGION_IDX].end]
> +  * and then remove the memory range in the middle.
> +  */
> + int start_rgn, end_rgn, i, ret;
> + phys_addr_t mid_base, mid_size;
> +
> + mid_base = regs[LOW_REGION_IDX].base + 
> regs[LOW_REGION_IDX].size;
> + mid_size = regs[HIGH_REGION_IDX].base - mid_base;
> + ret = memblock_isolate_range(, mid_base,
> + mid_size, _rgn, _rgn);
>  
> - of_scan_flat_dt(early_init_dt_scan_usablemem, );
> + if (ret)
> + return;
>  
> - if (reg.size)
> - memblock_cap_memory_range(reg.base, reg.size);
> + memblock_cap_memory_range(regs[LOW_REGION_IDX].base,
> + regs[HIGH_REGION_IDX].base -
> + regs[LOW_REGION_IDX].base +
> + regs[HIGH_REGION_IDX].size);
> + for (i = end_rgn - 1; i >= start_rgn; i--) {
> + if (!memblock_is_nomap([i]))
> + memblock_remove_region(, i);
> + }
> + memblock_remove_range(, mid_base,
> + mid_base + mid_size);
> + }
>  }
>  
>  void __init arm64_memblock_init(void)
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 294d5d8..787d252 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -110,9 +110,15 @@ void memblock_discard(void);
>  
>  phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
>  phys_addr_t size, phys_addr_t align);
> +void memblock_remove_region(struct memblock_type *type, unsigned long r);
>  void memblock_allow_resize(void);
>

Re: [PATCH v3 3/4] arm64: kdump: support more than one crash kernel regions

2019-04-14 Thread Mike Rapoport

Hi,

On Thu, Apr 11, 2019 at 08:17:43PM +0800, Chen Zhou wrote:
> Hi Mike,
> 
> This overall looks well.
> Replacing memblock_cap_memory_range() with memblock_cap_memory_ranges() was 
> what i wanted
> to do in v1, sorry for don't express that clearly.

I didn't object to memblock_cap_memory_ranges() in general, I was worried
about it's complexity and I hoped that we could find a simpler solution.
 
> But there are some issues as below. After fixing this, it can work correctly.
> 
> On 2019/4/10 21:09, Mike Rapoport wrote:
> > Hi,
> > 
> > On Tue, Apr 09, 2019 at 06:28:18PM +0800, Chen Zhou wrote:
> >> After commit (arm64: kdump: support reserving crashkernel above 4G),
> >> there may be two crash kernel regions, one is below 4G, the other is
> >> above 4G.
> >>
> >> Crash dump kernel reads more than one crash kernel regions via a dtb
> >> property under node /chosen,
> >> linux,usable-memory-range = 
> >>
> >> Signed-off-by: Chen Zhou 
> >> ---
> >>  arch/arm64/mm/init.c | 66 
> >> 
> >>  include/linux/memblock.h |  6 +
> >>  mm/memblock.c|  7 ++---
> >>  3 files changed, 66 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> index 3bebddf..0f18665 100644
> >> --- a/arch/arm64/mm/init.c
> >> +++ b/arch/arm64/mm/init.c
> >> @@ -65,6 +65,11 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
> >>  
> >>  #ifdef CONFIG_KEXEC_CORE
> >>  
> >> +/* at most two crash kernel regions, low_region and high_region */
> >> +#define CRASH_MAX_USABLE_RANGES   2
> >> +#define LOW_REGION_IDX0
> >> +#define HIGH_REGION_IDX   1
> >> +
> >>  /*
> >>   * reserve_crashkernel() - reserves memory for crash kernel
> >>   *
> >> @@ -297,8 +302,8 @@ static int __init 
> >> early_init_dt_scan_usablemem(unsigned long node,
> >>const char *uname, int depth, void *data)
> >>  {
> >>struct memblock_region *usablemem = data;
> >> -  const __be32 *reg;
> >> -  int len;
> >> +  const __be32 *reg, *endp;
> >> +  int len, nr = 0;
> >>  
> >>if (depth != 1 || strcmp(uname, "chosen") != 0)
> >>return 0;
> >> @@ -307,22 +312,63 @@ static int __init 
> >> early_init_dt_scan_usablemem(unsigned long node,
> >>if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
> >>return 1;
> >>  
> >> -  usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> >> -  usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> >> +  endp = reg + (len / sizeof(__be32));
> >> +  while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
> >> +  usablemem[nr].base = dt_mem_next_cell(dt_root_addr_cells, );
> >> +  usablemem[nr].size = dt_mem_next_cell(dt_root_size_cells, );
> >> +
> >> +  if (++nr >= CRASH_MAX_USABLE_RANGES)
> >> +  break;
> >> +  }
> >>  
> >>return 1;
> >>  }
> >>  
> >>  static void __init fdt_enforce_memory_region(void)
> >>  {
> >> -  struct memblock_region reg = {
> >> -  .size = 0,
> >> -  };
> >> +  int i, cnt = 0;
> >> +  struct memblock_region regs[CRASH_MAX_USABLE_RANGES];
> > 
> > I only now noticed that fdt_enforce_memory_region() uses memblock_region to
> > pass the ranges around. If we'd switch to memblock_type instead, the
> > implementation of memblock_cap_memory_ranges() would be really
> > straightforward. Can you check if the below patch works for you? 
> > 
> >>From e476d584098e31273af573e1a78e308880c5cf28 Mon Sep 17 00:00:00 2001
> > From: Mike Rapoport 
> > Date: Wed, 10 Apr 2019 16:02:32 +0300
> > Subject: [PATCH] memblock: extend memblock_cap_memory_range to multiple 
> > ranges
> > 
> > The memblock_cap_memory_range() removes all the memory except the range
> > passed to it. Extend this function to recieve memblock_type with the
> > regions that should be kept. This allows switching to simple iteration over
> > memblock arrays with 'for_each_mem_range' to remove the unneeded memory.
> > 
> > Enable use of this function in arm64 for reservation of multile regions for
> > the crash kernel.
> > 
> > Sig

Re: [PATCH v3 3/4] arm64: kdump: support more than one crash kernel regions

2019-04-10 Thread Mike Rapoport

Hi,

On Tue, Apr 09, 2019 at 06:28:18PM +0800, Chen Zhou wrote:
> After commit (arm64: kdump: support reserving crashkernel above 4G),
> there may be two crash kernel regions, one is below 4G, the other is
> above 4G.
> 
> Crash dump kernel reads more than one crash kernel regions via a dtb
> property under node /chosen,
> linux,usable-memory-range = 
> 
> Signed-off-by: Chen Zhou 
> ---
>  arch/arm64/mm/init.c | 66 
> 
>  include/linux/memblock.h |  6 +
>  mm/memblock.c|  7 ++---
>  3 files changed, 66 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 3bebddf..0f18665 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -65,6 +65,11 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  
>  #ifdef CONFIG_KEXEC_CORE
>  
> +/* at most two crash kernel regions, low_region and high_region */
> +#define CRASH_MAX_USABLE_RANGES  2
> +#define LOW_REGION_IDX   0
> +#define HIGH_REGION_IDX  1
> +
>  /*
>   * reserve_crashkernel() - reserves memory for crash kernel
>   *
> @@ -297,8 +302,8 @@ static int __init early_init_dt_scan_usablemem(unsigned 
> long node,
>   const char *uname, int depth, void *data)
>  {
>   struct memblock_region *usablemem = data;
> - const __be32 *reg;
> - int len;
> + const __be32 *reg, *endp;
> + int len, nr = 0;
>  
>   if (depth != 1 || strcmp(uname, "chosen") != 0)
>   return 0;
> @@ -307,22 +312,63 @@ static int __init early_init_dt_scan_usablemem(unsigned 
> long node,
>   if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
>   return 1;
>  
> - usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> - usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> + endp = reg + (len / sizeof(__be32));
> + while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
> + usablemem[nr].base = dt_mem_next_cell(dt_root_addr_cells, );
> + usablemem[nr].size = dt_mem_next_cell(dt_root_size_cells, );
> +
> + if (++nr >= CRASH_MAX_USABLE_RANGES)
> + break;
> + }
>  
>   return 1;
>  }
>  
>  static void __init fdt_enforce_memory_region(void)
>  {
> - struct memblock_region reg = {
> - .size = 0,
> - };
> + int i, cnt = 0;
> + struct memblock_region regs[CRASH_MAX_USABLE_RANGES];

I only now noticed that fdt_enforce_memory_region() uses memblock_region to
pass the ranges around. If we'd switch to memblock_type instead, the
implementation of memblock_cap_memory_ranges() would be really
straightforward. Can you check if the below patch works for you? 

>From e476d584098e31273af573e1a78e308880c5cf28 Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Wed, 10 Apr 2019 16:02:32 +0300
Subject: [PATCH] memblock: extend memblock_cap_memory_range to multiple ranges

The memblock_cap_memory_range() removes all the memory except the range
passed to it. Extend this function to recieve memblock_type with the
regions that should be kept. This allows switching to simple iteration over
memblock arrays with 'for_each_mem_range' to remove the unneeded memory.

Enable use of this function in arm64 for reservation of multile regions for
the crash kernel.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/mm/init.c | 34 --
 include/linux/memblock.h |  2 +-
 mm/memblock.c| 45 ++---
 3 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 6bc1350..30a496f 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -64,6 +64,10 @@ EXPORT_SYMBOL(memstart_addr);
 phys_addr_t arm64_dma_phys_limit __ro_after_init;
 
 #ifdef CONFIG_KEXEC_CORE
+
+/* at most two crash kernel regions, low_region and high_region */
+#define CRASH_MAX_USABLE_RANGES2
+
 /*
  * reserve_crashkernel() - reserves memory for crash kernel
  *
@@ -280,9 +284,9 @@ early_param("mem", early_mem);
 static int __init early_init_dt_scan_usablemem(unsigned long node,
const char *uname, int depth, void *data)
 {
-   struct memblock_region *usablemem = data;
-   const __be32 *reg;
-   int len;
+   struct memblock_type *usablemem = data;
+   const __be32 *reg, *endp;
+   int len, nr = 0;
 
if (depth != 1 || strcmp(uname, "chosen") != 0)
return 0;
@@ -291,22 +295,32 @@ static int __init early_init_dt_scan_usablemem(unsigned 
long node,
if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))

Re: [PATCH 2/3] arm64: kdump: support more than one crash kernel regions

2019-04-08 Thread Mike Rapoport

Hi,

On Fri, Apr 05, 2019 at 11:47:27AM +0800, Chen Zhou wrote:
> Hi Mike,
> 
> On 2019/4/5 10:17, Chen Zhou wrote:
> > Hi Mike,
> > 
> > On 2019/4/4 22:44, Mike Rapoport wrote:
> >> Hi,
> >>
> >> On Wed, Apr 03, 2019 at 09:51:27PM +0800, Chen Zhou wrote:
> >>> Hi Mike,
> >>>
> >>> On 2019/4/3 19:29, Mike Rapoport wrote:
> >>>> On Wed, Apr 03, 2019 at 11:05:45AM +0800, Chen Zhou wrote:
> >>>>> After commit (arm64: kdump: support reserving crashkernel above 4G),
> >>>>> there may be two crash kernel regions, one is below 4G, the other is
> >>>>> above 4G.
> >>>>>
> >>>>> Crash dump kernel reads more than one crash kernel regions via a dtb
> >>>>> property under node /chosen,
> >>>>> linux,usable-memory-range = 
> >>>>>
> >>>>> Signed-off-by: Chen Zhou 
> >>>>> ---
> >>>>>  arch/arm64/mm/init.c | 37 +
> >>>>>  include/linux/memblock.h |  1 +
> >>>>>  mm/memblock.c| 40 
> >>>>>  3 files changed, 66 insertions(+), 12 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >>>>> index ceb2a25..769c77a 100644
> >>>>> --- a/arch/arm64/mm/init.c
> >>>>> +++ b/arch/arm64/mm/init.c
> >>>>> @@ -64,6 +64,8 @@ EXPORT_SYMBOL(memstart_addr);
> >>>>>  phys_addr_t arm64_dma_phys_limit __ro_after_init;
> >>>>>  
> >>>>>  #ifdef CONFIG_KEXEC_CORE
> >>>>> +# define CRASH_MAX_USABLE_RANGES2
> >>>>> +
> >>>>>  static int __init reserve_crashkernel_low(void)
> >>>>>  {
> >>>>> unsigned long long base, low_base = 0, low_size = 0;
> >>>>> @@ -346,8 +348,8 @@ static int __init 
> >>>>> early_init_dt_scan_usablemem(unsigned long node,
> >>>>> const char *uname, int depth, void *data)
> >>>>>  {
> >>>>> struct memblock_region *usablemem = data;
> >>>>> -   const __be32 *reg;
> >>>>> -   int len;
> >>>>> +   const __be32 *reg, *endp;
> >>>>> +   int len, nr = 0;
> >>>>>  
> >>>>> if (depth != 1 || strcmp(uname, "chosen") != 0)
> >>>>> return 0;
> >>>>> @@ -356,22 +358,33 @@ static int __init 
> >>>>> early_init_dt_scan_usablemem(unsigned long node,
> >>>>> if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
> >>>>> return 1;
> >>>>>  
> >>>>> -   usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> >>>>> -   usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> >>>>> +   endp = reg + (len / sizeof(__be32));
> >>>>> +   while ((endp - reg) >= (dt_root_addr_cells + 
> >>>>> dt_root_size_cells)) {
> >>>>> +   usablemem[nr].base = 
> >>>>> dt_mem_next_cell(dt_root_addr_cells, );
> >>>>> +   usablemem[nr].size = 
> >>>>> dt_mem_next_cell(dt_root_size_cells, );
> >>>>> +
> >>>>> +   if (++nr >= CRASH_MAX_USABLE_RANGES)
> >>>>> +   break;
> >>>>> +   }
> >>>>>  
> >>>>> return 1;
> >>>>>  }
> >>>>>  
> >>>>>  static void __init fdt_enforce_memory_region(void)
> >>>>>  {
> >>>>> -   struct memblock_region reg = {
> >>>>> -   .size = 0,
> >>>>> -   };
> >>>>> -
> >>>>> -   of_scan_flat_dt(early_init_dt_scan_usablemem, );
> >>>>> -
> >>>>> -   if (reg.size)
> >>>>> -   memblock_cap_memory_range(reg.base, reg.size);
> >>>>> +   int i, cnt = 0;
> >>>>> +   struct memblock_region regs[CRASH_MAX_USABLE_RANGES];
> >>>>> +
> >>>>> +   memset(regs, 0, sizeof(regs));
> >

Re: [PATCH 1/3] arm64: kdump: support reserving crashkernel above 4G

2019-04-04 Thread Mike Rapoport

Hi,

On Wed, Apr 03, 2019 at 11:05:44AM +0800, Chen Zhou wrote:
> When crashkernel is reserved above 4G in memory, kernel should
> reserve some amount of low memory for swiotlb and some DMA buffers.
> 
> Kernel would try to allocate at least 256M below 4G automatically
> as x86_64 if crashkernel is above 4G. Meanwhile, support
> crashkernel=X,[high,low] in arm64.
> 
> Signed-off-by: Chen Zhou 
> ---
>  arch/arm64/kernel/setup.c |  3 ++
>  arch/arm64/mm/init.c  | 71 
> +--
>  2 files changed, 71 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 413d566..82cd9a0 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -243,6 +243,9 @@ static void __init request_standard_resources(void)
>   request_resource(res, _data);
>  #ifdef CONFIG_KEXEC_CORE
>   /* Userspace will find "Crash kernel" region in /proc/iomem. */
> + if (crashk_low_res.end && crashk_low_res.start >= res->start &&
> + crashk_low_res.end <= res->end)
> + request_resource(res, _low_res);
>   if (crashk_res.end && crashk_res.start >= res->start &&
>   crashk_res.end <= res->end)
>   request_resource(res, _res);
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 6bc1350..ceb2a25 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -64,6 +64,57 @@ EXPORT_SYMBOL(memstart_addr);
>  phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  
>  #ifdef CONFIG_KEXEC_CORE
> +static int __init reserve_crashkernel_low(void)
> +{
> + unsigned long long base, low_base = 0, low_size = 0;
> + unsigned long total_low_mem;
> + int ret;
> +
> + total_low_mem = memblock_mem_size(1UL << (32 - PAGE_SHIFT));
> +
> + /* crashkernel=Y,low */
> + ret = parse_crashkernel_low(boot_command_line, total_low_mem, 
> _size, );
> + if (ret) {
> + /*
> +  * two parts from lib/swiotlb.c:
> +  * -swiotlb size: user-specified with swiotlb= or default.
> +  *
> +  * -swiotlb overflow buffer: now hardcoded to 32k. We round it
> +  * to 8M for other buffers that may need to stay low too. Also
> +  * make sure we allocate enough extra low memory so that we
> +  * don't run out of DMA buffers for 32-bit devices.
> +  */
> + low_size = max(swiotlb_size_or_default() + (8UL << 20), 256UL 
> << 20);
> + } else {
> + /* passed with crashkernel=0,low ? */
> + if (!low_size)
> + return 0;
> + }
> +
> + low_base = memblock_find_in_range(0, 1ULL << 32, low_size, SZ_2M);
> + if (!low_base) {
> + pr_err("Cannot reserve %ldMB crashkernel low memory, please try 
> smaller size.\n",
> + (unsigned long)(low_size >> 20));
> + return -ENOMEM;
> + }
> +
> + ret = memblock_reserve(low_base, low_size);
> + if (ret) {
> + pr_err("%s: Error reserving crashkernel low memblock.\n", 
> __func__);
> + return ret;
> + }
> +
> + pr_info("Reserving %ldMB of low memory at %ldMB for crashkernel (System 
> RAM: %ldMB)\n",
> + (unsigned long)(low_size >> 20),
> + (unsigned long)(low_base >> 20),
> + (unsigned long)(total_low_mem >> 20));
> +
> + crashk_low_res.start = low_base;
> + crashk_low_res.end   = low_base + low_size - 1;
> +
> + return 0;
> +}
> +
>  /*
>   * reserve_crashkernel() - reserves memory for crash kernel
>   *
> @@ -74,19 +125,28 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  static void __init reserve_crashkernel(void)
>  {
>   unsigned long long crash_base, crash_size;
> + bool high = false;
>   int ret;
>  
>   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
>   _size, _base);
>   /* no crashkernel= or invalid value specified */
> - if (ret || !crash_size)
> - return;
> + if (ret || !crash_size) {
> + /* crashkernel=X,high */
> + ret = parse_crashkernel_high(boot_command_line, 
> memblock_phys_mem_size(),
> + _size, _base);
> + if (ret || !crash_size)
> + return;
> + high = true;
> + }
>  
>   crash_size = PAGE_ALIGN(crash_size);
>  
>   if (crash_base == 0) {
>   /* Current arm64 boot protocol requires 2MB alignment */
> - crash_base = memblock_find_in_range(0, ARCH_LOW_ADDRESS_LIMIT,
> + crash_base = memblock_find_in_range(0,
> + high ? memblock_end_of_DRAM()
> + : ARCH_LOW_ADDRESS_LIMIT,
>   crash_size, SZ_2M);
>

Re: [PATCH 2/3] arm64: kdump: support more than one crash kernel regions

2019-04-04 Thread Mike Rapoport

Hi,

On Wed, Apr 03, 2019 at 09:51:27PM +0800, Chen Zhou wrote:
> Hi Mike,
> 
> On 2019/4/3 19:29, Mike Rapoport wrote:
> > On Wed, Apr 03, 2019 at 11:05:45AM +0800, Chen Zhou wrote:
> >> After commit (arm64: kdump: support reserving crashkernel above 4G),
> >> there may be two crash kernel regions, one is below 4G, the other is
> >> above 4G.
> >>
> >> Crash dump kernel reads more than one crash kernel regions via a dtb
> >> property under node /chosen,
> >> linux,usable-memory-range = 
> >>
> >> Signed-off-by: Chen Zhou 
> >> ---
> >>  arch/arm64/mm/init.c | 37 +
> >>  include/linux/memblock.h |  1 +
> >>  mm/memblock.c| 40 
> >>  3 files changed, 66 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> index ceb2a25..769c77a 100644
> >> --- a/arch/arm64/mm/init.c
> >> +++ b/arch/arm64/mm/init.c
> >> @@ -64,6 +64,8 @@ EXPORT_SYMBOL(memstart_addr);
> >>  phys_addr_t arm64_dma_phys_limit __ro_after_init;
> >>  
> >>  #ifdef CONFIG_KEXEC_CORE
> >> +# define CRASH_MAX_USABLE_RANGES2
> >> +
> >>  static int __init reserve_crashkernel_low(void)
> >>  {
> >>unsigned long long base, low_base = 0, low_size = 0;
> >> @@ -346,8 +348,8 @@ static int __init 
> >> early_init_dt_scan_usablemem(unsigned long node,
> >>const char *uname, int depth, void *data)
> >>  {
> >>struct memblock_region *usablemem = data;
> >> -  const __be32 *reg;
> >> -  int len;
> >> +  const __be32 *reg, *endp;
> >> +  int len, nr = 0;
> >>  
> >>if (depth != 1 || strcmp(uname, "chosen") != 0)
> >>return 0;
> >> @@ -356,22 +358,33 @@ static int __init 
> >> early_init_dt_scan_usablemem(unsigned long node,
> >>if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
> >>return 1;
> >>  
> >> -  usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> >> -  usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> >> +  endp = reg + (len / sizeof(__be32));
> >> +  while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
> >> +  usablemem[nr].base = dt_mem_next_cell(dt_root_addr_cells, );
> >> +  usablemem[nr].size = dt_mem_next_cell(dt_root_size_cells, );
> >> +
> >> +  if (++nr >= CRASH_MAX_USABLE_RANGES)
> >> +  break;
> >> +  }
> >>  
> >>return 1;
> >>  }
> >>  
> >>  static void __init fdt_enforce_memory_region(void)
> >>  {
> >> -  struct memblock_region reg = {
> >> -  .size = 0,
> >> -  };
> >> -
> >> -  of_scan_flat_dt(early_init_dt_scan_usablemem, );
> >> -
> >> -  if (reg.size)
> >> -  memblock_cap_memory_range(reg.base, reg.size);
> >> +  int i, cnt = 0;
> >> +  struct memblock_region regs[CRASH_MAX_USABLE_RANGES];
> >> +
> >> +  memset(regs, 0, sizeof(regs));
> >> +  of_scan_flat_dt(early_init_dt_scan_usablemem, regs);
> >> +
> >> +  for (i = 0; i < CRASH_MAX_USABLE_RANGES; i++)
> >> +  if (regs[i].size)
> >> +  cnt++;
> >> +  else
> >> +  break;
> >> +  if (cnt)
> >> +  memblock_cap_memory_ranges(regs, cnt);
> > 
> > Why not simply call memblock_cap_memory_range() for each region?
> 
> Function memblock_cap_memory_range() removes all memory type ranges except 
> specified range.
> So if we call memblock_cap_memory_range() for each region simply, there will 
> be no usable-memory
> on kdump capture kernel.

Thanks for the clarification.
I still think that memblock_cap_memory_ranges() is overly complex. 

How about doing something like this:

Cap the memory range for [min(regs[*].start, max(regs[*].end)] and then
removing the range in the middle?
 
> Thanks,
> Chen Zhou
> 
> > 
> >>  }
> >>  
> >>  void __init arm64_memblock_init(void)
> >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >> index 47e3c06..aeade34 100644
> >> --- a/include/linux/memblock.h
> >> +++ b/include/linux/memblock.h
> >> @@ -446,6 +446,7 @@ phys_addr_t memblock_start_of_DRAM(void);
> >>  phys_addr_t memblock_end_of_D

Re: [PATCH 2/3] arm64: kdump: support more than one crash kernel regions

2019-04-03 Thread Mike Rapoport

On Wed, Apr 03, 2019 at 11:05:45AM +0800, Chen Zhou wrote:
> After commit (arm64: kdump: support reserving crashkernel above 4G),
> there may be two crash kernel regions, one is below 4G, the other is
> above 4G.
> 
> Crash dump kernel reads more than one crash kernel regions via a dtb
> property under node /chosen,
> linux,usable-memory-range = 
> 
> Signed-off-by: Chen Zhou 
> ---
>  arch/arm64/mm/init.c | 37 +
>  include/linux/memblock.h |  1 +
>  mm/memblock.c| 40 
>  3 files changed, 66 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index ceb2a25..769c77a 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -64,6 +64,8 @@ EXPORT_SYMBOL(memstart_addr);
>  phys_addr_t arm64_dma_phys_limit __ro_after_init;
>  
>  #ifdef CONFIG_KEXEC_CORE
> +# define CRASH_MAX_USABLE_RANGES2
> +
>  static int __init reserve_crashkernel_low(void)
>  {
>   unsigned long long base, low_base = 0, low_size = 0;
> @@ -346,8 +348,8 @@ static int __init early_init_dt_scan_usablemem(unsigned 
> long node,
>   const char *uname, int depth, void *data)
>  {
>   struct memblock_region *usablemem = data;
> - const __be32 *reg;
> - int len;
> + const __be32 *reg, *endp;
> + int len, nr = 0;
>  
>   if (depth != 1 || strcmp(uname, "chosen") != 0)
>   return 0;
> @@ -356,22 +358,33 @@ static int __init early_init_dt_scan_usablemem(unsigned 
> long node,
>   if (!reg || (len < (dt_root_addr_cells + dt_root_size_cells)))
>   return 1;
>  
> - usablemem->base = dt_mem_next_cell(dt_root_addr_cells, );
> - usablemem->size = dt_mem_next_cell(dt_root_size_cells, );
> + endp = reg + (len / sizeof(__be32));
> + while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
> + usablemem[nr].base = dt_mem_next_cell(dt_root_addr_cells, );
> + usablemem[nr].size = dt_mem_next_cell(dt_root_size_cells, );
> +
> + if (++nr >= CRASH_MAX_USABLE_RANGES)
> + break;
> + }
>  
>   return 1;
>  }
>  
>  static void __init fdt_enforce_memory_region(void)
>  {
> - struct memblock_region reg = {
> - .size = 0,
> - };
> -
> - of_scan_flat_dt(early_init_dt_scan_usablemem, );
> -
> - if (reg.size)
> - memblock_cap_memory_range(reg.base, reg.size);
> + int i, cnt = 0;
> + struct memblock_region regs[CRASH_MAX_USABLE_RANGES];
> +
> + memset(regs, 0, sizeof(regs));
> + of_scan_flat_dt(early_init_dt_scan_usablemem, regs);
> +
> + for (i = 0; i < CRASH_MAX_USABLE_RANGES; i++)
> + if (regs[i].size)
> + cnt++;
> + else
> + break;
> + if (cnt)
> + memblock_cap_memory_ranges(regs, cnt);

Why not simply call memblock_cap_memory_range() for each region?

>  }
>  
>  void __init arm64_memblock_init(void)
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 47e3c06..aeade34 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -446,6 +446,7 @@ phys_addr_t memblock_start_of_DRAM(void);
>  phys_addr_t memblock_end_of_DRAM(void);
>  void memblock_enforce_memory_limit(phys_addr_t memory_limit);
>  void memblock_cap_memory_range(phys_addr_t base, phys_addr_t size);
> +void memblock_cap_memory_ranges(struct memblock_region *regs, int cnt);
>  void memblock_mem_limit_remove_map(phys_addr_t limit);
>  bool memblock_is_memory(phys_addr_t addr);
>  bool memblock_is_map_memory(phys_addr_t addr);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 28fa8926..1a7f4ee7c 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1697,6 +1697,46 @@ void __init memblock_cap_memory_range(phys_addr_t 
> base, phys_addr_t size)
>   base + size, PHYS_ADDR_MAX);
>  }
>  
> +void __init memblock_cap_memory_ranges(struct memblock_region *regs, int cnt)
> +{
> + int start_rgn[INIT_MEMBLOCK_REGIONS], end_rgn[INIT_MEMBLOCK_REGIONS];
> + int i, j, ret, nr = 0;
> +
> + for (i = 0; i < cnt; i++) {
> + ret = memblock_isolate_range(, regs[i].base,
> + regs[i].size, _rgn[i], _rgn[i]);
> + if (ret)
> + break;
> + nr++;
> + }
> + if (!nr)
> + return;
> +
> + /* remove all the MAP regions */
> + for (i = memblock.memory.cnt - 1; i >= end_rgn[nr - 1]; i--)
> + if (!memblock_is_nomap([i]))
> + memblock_remove_region(, i);
> +
> + for (i = nr - 1; i > 0; i--)
> + for (j = start_rgn[i] - 1; j >= end_rgn[i - 1]; j--)
> + if (!memblock_is_nomap([j]))
> + memblock_remove_region(, j);
> +
> + for (i = start_rgn[0] - 1; i >= 0; i--)
> + if (!memblock_is_nomap([i]))
> +

Re: [PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-09 Thread Mike Rapoport

Hi Pingfan,

On Wed, Jan 09, 2019 at 09:02:41PM +0800, Pingfan Liu wrote:
> On Tue, Jan 8, 2019 at 11:49 PM Mike Rapoport  wrote:
> >
> > On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> > > Hi Mike,
> > >
> > > On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > > > I'm not thrilled by duplicating this code (yet again).
> > > > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode 
> > > > to
> > > > allocate [0, kernel_start) unconditionally.
> > > > I'd just replace you first patch in v3 [2] with something like:
> > >
> > > In initmem_init(), we will restore the top-down allocation style anyway.
> > > While reserve_crashkernel() is called after initmem_init(), it's not
> > > appropriate to adjust memblock_find_in_range_node(), and we really want
> > > to find region bottom up for crashkernel reservation, no matter where
> > > kernel is loaded, better call __memblock_find_range_bottom_up().
> > >
> > > Create a wrapper to do the necessary handling, then call
> > > __memblock_find_range_bottom_up() directly, looks better.
> >
> > What bothers me is 'the necessary handling' which is already done in
> > several places in memblock in a similar, but yet slightly different way.
> >
> > memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
> > MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
> > allocating from the specified node and does not retry when it falls back to
> > any node. And memblock_alloc_internal() has yet another set of fallbacks.
> >
> > So what should be the necessary handling in the wrapper for
> > __memblock_find_range_bottom_up() ?
> >
> Well, it is a hard choice.
> > BTW, even without any memblock modifications, retrying allocation in
> > reserve_crashkerenel() for different ranges, like the proposal at [1] would
> > also work, wouldn't it?
> >
> Yes, it can work. Then is it worth to expose the bottom-up allocation
> style beside for hotmovable purpose?

Some architectures use bottom-up as a "compatability" mode with bootmem.
And, I believe, powerpc and s390 use bottom-up to make some of the
allocations close to the kernel.
 
> Thanks,
> Pingfan
> > [1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> >
> > > Thanks
> > > Baoquan
> > >
> > > >
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 7df468c..d1b30b9 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock 
> > > > memblock_find_in_range_node(phys_addr_t size,
> > > >  * try bottom-up allocation only when bottom-up mode
> > > >  * is set and @end is above the kernel image.
> > > >  */
> > > > -   if (memblock_bottom_up() && end > kernel_end) {
> > > > -   phys_addr_t bottom_up_start;
> > > > -
> > > > -   /* make sure we will allocate above the kernel */
> > > > -   bottom_up_start = max(start, kernel_end);
> > > > -
> > > > +   if (memblock_bottom_up()) {
> > > > /* ok, try bottom-up allocation first */
> > > > -   ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > > > +   ret = __memblock_find_range_bottom_up(start, end,
> > > >   size, align, nid, 
> > > > flags);
> > > > if (ret)
> > > > return ret;
> > > >
> > > > /*
> > > > -* we always limit bottom-up allocation above the kernel,
> > > > -* but top-down allocation doesn't have the limit, so
> > > > -* retrying top-down allocation may succeed when bottom-up
> > > > -* allocation failed.
> > > > -*
> > > >  * bottom-up allocation is expected to be fail very rarely,
> > > >  * so we use WARN_ONCE() here to see the stack trace if
> > > >  * fail happens.
> > > >
> > > > [1] 
> > > > https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelf...@gmail.com/
> > > > [2] 
> > > > https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelf...@gmail.com/
> > > >
> > > > > +
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * __memblock_find_range_top_down - find free area utility, in 
> > > > > top-down
> > > > >   * @start: start of candidate range
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> > > >
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >
> 

-- 
Sincerely yours,
Mike.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-08 Thread Mike Rapoport

On Tue, Jan 08, 2019 at 05:01:38PM +0800, Baoquan He wrote:
> Hi Mike,
> 
> On 01/08/19 at 10:05am, Mike Rapoport wrote:
> > I'm not thrilled by duplicating this code (yet again).
> > I liked the v3 of this patch [1] more, assuming we allow bottom-up mode to
> > allocate [0, kernel_start) unconditionally. 
> > I'd just replace you first patch in v3 [2] with something like:
> 
> In initmem_init(), we will restore the top-down allocation style anyway.
> While reserve_crashkernel() is called after initmem_init(), it's not
> appropriate to adjust memblock_find_in_range_node(), and we really want
> to find region bottom up for crashkernel reservation, no matter where
> kernel is loaded, better call __memblock_find_range_bottom_up().
> 
> Create a wrapper to do the necessary handling, then call
> __memblock_find_range_bottom_up() directly, looks better.

What bothers me is 'the necessary handling' which is already done in
several places in memblock in a similar, but yet slightly different way.

memblock_find_in_range() and memblock_phys_alloc_nid() retry with different
MEMBLOCK_MIRROR, but memblock_phys_alloc_try_nid() does that only when
allocating from the specified node and does not retry when it falls back to
any node. And memblock_alloc_internal() has yet another set of fallbacks. 

So what should be the necessary handling in the wrapper for
__memblock_find_range_bottom_up() ?

BTW, even without any memblock modifications, retrying allocation in
reserve_crashkerenel() for different ranges, like the proposal at [1] would
also work, wouldn't it?

[1] http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
 
> Thanks
> Baoquan
> 
> > 
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 7df468c..d1b30b9 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -274,24 +274,14 @@ phys_addr_t __init_memblock 
> > memblock_find_in_range_node(phys_addr_t size,
> >  * try bottom-up allocation only when bottom-up mode
> >  * is set and @end is above the kernel image.
> >  */
> > -   if (memblock_bottom_up() && end > kernel_end) {
> > -   phys_addr_t bottom_up_start;
> > -
> > -   /* make sure we will allocate above the kernel */
> > -   bottom_up_start = max(start, kernel_end);
> > -
> > +   if (memblock_bottom_up()) {
> > /* ok, try bottom-up allocation first */
> > -   ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> > +   ret = __memblock_find_range_bottom_up(start, end,
> >   size, align, nid, flags);
> > if (ret)
> > return ret;
> >  
> > /*
> > -* we always limit bottom-up allocation above the kernel,
> > -* but top-down allocation doesn't have the limit, so
> > -* retrying top-down allocation may succeed when bottom-up
> > -* allocation failed.
> > -*
> >  * bottom-up allocation is expected to be fail very rarely,
> >  * so we use WARN_ONCE() here to see the stack trace if
> >  * fail happens.
> > 
> > [1] 
> > https://lore.kernel.org/lkml/1545966002-3075-3-git-send-email-kernelf...@gmail.com/
> > [2] 
> > https://lore.kernel.org/lkml/1545966002-3075-2-git-send-email-kernelf...@gmail.com/
> > 
> > > +
> > > + return ret;
> > > +}
> > > +
> > >  /**
> > >   * __memblock_find_range_top_down - find free area utility, in top-down
> > >   * @start: start of candidate range
> > > -- 
> > > 2.7.4
> > > 
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> > 
> 

-- 
Sincerely yours,
Mike.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv5] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2019-01-08 Thread Mike Rapoport

On Mon, Jan 07, 2019 at 04:04:59PM +0800, Pingfan Liu wrote:
> Customer reported a bug on a high end server with many pcie devices, where
> kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> though we still see much memory under 896 MB, the finding still failed
> intermittently. Because currently we can only find region under 896 MB,
> if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> randomly, and crashkernel reservation need be aligned to 128 MB, that's
> why failure is found. It raises confusion to the end user that sometimes
> crashkernel=X works while sometimes fails.
> If want to make it succeed, customer can change kernel option to
> "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> limited space to behave even though its grammer looks more generic.
> And we can't answer questions raised from customer that confidently:
> 1) why it doesn't succeed to reserve 896 MB;
> 2) what's wrong with memory region under 4G;
> 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> 
> This patch simplifies the method suggested in the mail [1]. It just goes
> bottom-up to find a candidate region for crashkernel. The bottom-up may be
> better compatible with the old reservation style, i.e. still want to get
> memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> 
> There is one trivial thing about the compatibility with old kexec-tools:
> if the reserved region is above 896M, then old tool will fail to load
> bzImage. But without this patch, the old tool also fail since there is no
> memory below 896M can be reserved for crashkernel.
> 
> [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> Signed-off-by: Pingfan Liu 
> Cc: Tang Chen 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Jonathan Corbet 
> Cc: Yaowei Bai 
> Cc: Pavel Tatashin 
> Cc: Nicholas Piggin 
> Cc: Naoya Horiguchi 
> Cc: Daniel Vacek 
> Cc: Mathieu Malaterre 
> Cc: Stefan Agner 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: ying...@kernel.org,
> Cc: vgo...@redhat.com
> Cc: linux-ker...@vger.kernel.org
> ---
> v4 -> v5:
>   add a wrapper of bottom up allocation func
> v3 -> v4:
>   instead of exporting the stage of parsing mem hotplug info, just using the 
> bottom-up allocation func directly
>  arch/x86/kernel/setup.c  |  8 
>  include/linux/memblock.h |  3 +++
>  mm/memblock.c| 29 +
>  3 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d494b9b..80e7923 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -546,10 +546,10 @@ static void __init reserve_crashkernel(void)
>* as old kexec-tools loads bzImage below that, unless
>* "crashkernel=size[KMG],high" is specified.
>*/
> - crash_base = memblock_find_in_range(CRASH_ALIGN,
> - high ? CRASH_ADDR_HIGH_MAX
> -  : CRASH_ADDR_LOW_MAX,
> - crash_size, CRASH_ALIGN);
> + crash_base = memblock_find_range_bottom_up(CRASH_ALIGN,
> + (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN,
> + NUMA_NO_NODE);
> +
>   if (!crash_base) {
>   pr_info("crashkernel reservation failed - No suitable 
> area found.\n");
>   return;
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a..a35ae17 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -116,6 +116,9 @@ phys_addr_t memblock_find_in_range_node(phys_addr_t size, 
> phys_addr_t align,
>   int nid, enum memblock_flags flags);
>  phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
>  phys_addr_t size, phys_addr_t align);
> +phys_addr_t __init_memblock
> +memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
> + phys_addr_t size, phys_addr_t align, int nid);
>  void memblock_allow_resize(void);
>  int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
>  int memblock_add(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 81ae63c..f68287e 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -192,6 +192,35 @@ __memblock_find_range_bottom_up(phys_addr_t start, 
> phys_addr_t end,
>   ret

Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr

2019-01-05 Thread Mike Rapoport

On Sat, Jan 05, 2019 at 11:44:50AM +0800, Baoquan He wrote:
> On 01/04/19 at 05:09pm, Mike Rapoport wrote:
> > On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > > > I agree that currently the bottom-up allocation after the kernel text 
> > > > has
> > > > issues with KASLR. But this issues are not necessarily related to the
> > > > memory hotplug. Even with a single memory node, a bottom-up allocation 
> > > > will
> > > > fail if KASLR would put the kernel near the end of node0.
> > > > 
> > > > What I am trying to understand is whether there is a fundamental reason 
> > > > to
> > > > prevent allocations from [0, kernel_start)?
> > > > 
> > > > Maybe Tejun can recall why he suggested to start bottom-up allocations 
> > > > from
> > > > kernel_end.
> > > 
> > > That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> > > allocation mode").  I wasn't involved in that patch, so no idea why
> > > the restrictions were added, but FWIW it doesn't seem necessary to me.
> > 
> > I should have added the reference [1] at the first place :)
> > Thanks!
> > 
> > [1] https://lore.kernel.org/lkml/20130904192215.gg26...@mtj.dyndns.org/
> 
> With my understanding, we may not be able to discard the bottom-up
> method for the current kernel. It's related to hotplug feature when
> 'movable_node' kernel parameter is specified. With 'movable_node',
> system relies on reading hotplug information from firmware, on x86 it's
> acpi SRAT table. In the current system, we allocate memblock region
> top-down by default. However, before that hotplug information retrieving,
> there are several places of memblock allocating, top-down memblock
> allocation must break hotplug feature since it will allocate kernel data
> in movable zone which is usually at the end node on bare metal system.

I do not suggest to discard the bottom-up method, I merely suggest to allow
it to use [0, kernel_start).
 
> This bottom-up way is taken on many ARCHes, it works well on system if
> KASLR is not enabled. Below is the searching result in the current linux
> kernel, we can see that all ARCHes have this mechanism, except of
> arm/arm64. But now only arm64/mips/x86 have KASLR.
> 
> W/o KASLR, allocating memblock region above kernle end when hotplug info
> is not parsed, looks very reasonable. Since kernel is usually put at
> lower address, e.g on x86, it's 16M. My thought is that we need do
> memblock allocation around kernel before hotplug info parsed. That is
> for system w/o KASLR, we will keep the current bottom-up way; for system
> with KASLR, we should allocate memblock region top-down just below
> kernel start.

I completely agree. I was thinking about making
memblock_find_in_range_node() to do something like

if (memblock_bottom_up()) {
bottom_up_start = max(start, kernel_end);

ret = __memblock_find_range_bottom_up(bottom_up_start, end,
  size, align, nid, flags);
if (ret)
return ret;

bottom_up_start = max(start, 0);
end = kernel_start;

ret = __memblock_find_range_top_down(bottom_up_start, end,
 size, align, nid, flags);
if (ret)
return ret;
}

 
> This issue must break hotplug, just because currently bare metal system
> need add 'nokaslr' to disable KASLR since another bug fix is under
> discussion as below, so this issue is covered up.
> 
>  [PATCH v14 0/5] x86/boot/KASLR: Parse ACPI table and limit KASLR to choosing 
> immovable memory
> lkml.kernel.org/r/20181214093013.13370-1-fanc.f...@cn.fujitsu.com
> 
> [~ ]$ git grep memblock_set_bottom_up
> arch/alpha/kernel/setup.c:  memblock_set_bottom_up(true);
> arch/m68k/mm/motorola.c:memblock_set_bottom_up(true);
> arch/mips/kernel/setup.c:   memblock_set_bottom_up(true);
> arch/mips/kernel/traps.c:   memblock_set_bottom_up(false);
> arch/nds32/kernel/setup.c:  memblock_set_bottom_up(true);
> arch/powerpc/kernel/paca.c: memblock_set_bottom_up(true);
> arch/powerpc/kernel/paca.c: memblock_set_bottom_up(false);
> arch/s390/kernel/setup.c:   memblock_set_bottom_up(true);
> arch/s390/kernel/setup.c:   memblock_set_bottom_up(false);
> arch/sparc/mm/init_32.c:memblock_set_bottom_up(true);
> arch/x86/kernel/setup.c:memblock_set_bottom_up(true);
> arch/x86/mm/numa.c: memblock_set_bottom_up(false);
> include/linux/memblock.h:static inline void __init 
> memblock_set_bottom_up(bool enable)
> 

-- 
Sincerely yours,
Mike.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr

2019-01-04 Thread Mike Rapoport

On Fri, Jan 04, 2019 at 01:59:46PM +0800, Pingfan Liu wrote:
> On Wed, Jan 2, 2019 at 6:18 PM Baoquan He  wrote:
> >
> > On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport  
> > > > wrote:
> > > > >
> > > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > > The bottom-up allocation style is introduced to cope with 
> > > > > > movable_node,
> > > > > > where the limit inferior of allocation starts from kernel's end, 
> > > > > > due to
> > > > > > lack of knowledge of memory hotplug info at this early time. But if 
> > > > > > later,
> > > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower 
> > > > > > address,
> > > > > > since if the reserved region is beyond 4G, then it requires extra 
> > > > > > mem
> > > > > > (default is 16M) for swiotlb.
> > > > >
> > > > > I fail to understand why the availability of memory hotplug 
> > > > > information
> > > > > would allow to extend the lower limit of bottom-up memblock 
> > > > > allocations
> > > > > below the kernel. The memory in the physical range [0, kernel_start) 
> > > > > can be
> > > > > allocated as soon as the kernel memory is reserved.
> > > > >
> > > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > > e.g. memblock_reserve(). But there is trick. For the func like
> > > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > > check the hotmovable attr in __next_mem_range()
> > > > {
> > > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > > continue
> > > > }.  So the movable memory can be safely skipped.
> > >
> > > I still don't see the connection between allocating memory below
> > > kernel_start and the hotplug info.
> > >
> > > The check for 'end > kernel_end' in
> > >
> > >   if (memblock_bottom_up() && end > kernel_end)
> > >
> > > does not protect against allocation in a hotplugable area.
> > > If memblock_find_in_range() is called before hotplug info is parsed it can
> > > return a range in a hotplugable area.
> > >
> > > The point I'd like to clarify is why allocating memory in the range [0,
> > > kernel_start) cannot be done before hotplug info is available and why it 
> > > is
> > > safe to allocate that memory afterwards?
> >
> > Well, I think that's because we have KASLR. Before KASLR was introdueced,
> > kernel is put at a low and fixed physical address. Allocating memblock
> > bottom-up after kernel can make sure those kernel data is in the same node
> > as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> > very small range, e.g on x86, it was 16 MB, which is very possibly used
> > up.
> >
> > But now, with KASLR enabled by default, this bottom-up after kernel text
> > allocation has potential issue. E.g we have node0 (including normal zone),
> > node1(including movable zone), if KASLR put kernel at top of node0, the
> > next memblock allocation before SRAT parsed will stamp into movable zone
> > of node1, hotplug doesn't work well any more consequently. I had
> > considered this issue previously, but haven't thought of a way to fix
> > it.
> >
> > While it's not related to this patch. About this patchset, I didn't
> > check it carefully in v2 post, and acked it. In fact the current way is
> > not good, Pingfan should call __memblock_find_range_bottom_up() directly
> > for crashkernel reserving. Reasons are:
> 
> Good suggestion, thanks. I will send out V4.

I think we can simply remove the restriction of allocating above the kernel
in the memblock_find_in_range_node().
 
> Regards,
> Pingfan
> > 1)SRAT parsing is done, system restore to take top-down way to do
> > memblock allocat.
> > 2)we do need to find range bottom-up if user specify crashkernel=xxM
> > (without a explicit base address).
> >
> > Thanks
> > Baoquan
> >
> > >
> > > > Thanks for your kindly review.
> > > >
> > > > Regards,
&g

Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr

2019-01-04 Thread Mike Rapoport

On Thu, Jan 03, 2019 at 10:47:06AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 02, 2019 at 07:05:38PM +0200, Mike Rapoport wrote:
> > I agree that currently the bottom-up allocation after the kernel text has
> > issues with KASLR. But this issues are not necessarily related to the
> > memory hotplug. Even with a single memory node, a bottom-up allocation will
> > fail if KASLR would put the kernel near the end of node0.
> > 
> > What I am trying to understand is whether there is a fundamental reason to
> > prevent allocations from [0, kernel_start)?
> > 
> > Maybe Tejun can recall why he suggested to start bottom-up allocations from
> > kernel_end.
> 
> That's from 79442ed189ac ("mm/memblock.c: introduce bottom-up
> allocation mode").  I wasn't involved in that patch, so no idea why
> the restrictions were added, but FWIW it doesn't seem necessary to me.

I should have added the reference [1] at the first place :)
Thanks!

[1] https://lore.kernel.org/lkml/20130904192215.gg26...@mtj.dyndns.org/
 
> Thanks.
> 
> -- 
> tejun
> 

-- 
Sincerely yours,
Mike.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr

2019-01-02 Thread Mike Rapoport

(added Tejun)

On Wed, Jan 02, 2019 at 06:18:04PM +0800, Baoquan He wrote:
> On 01/02/19 at 11:27am, Mike Rapoport wrote:
> > On Wed, Jan 02, 2019 at 02:47:34PM +0800, Pingfan Liu wrote:
> > > On Mon, Dec 31, 2018 at 4:40 PM Mike Rapoport  wrote:
> > > >
> > > > On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> > > > > The bottom-up allocation style is introduced to cope with 
> > > > > movable_node,
> > > > > where the limit inferior of allocation starts from kernel's end, due 
> > > > > to
> > > > > lack of knowledge of memory hotplug info at this early time. But if 
> > > > > later,
> > > > > hotplug info has been got, the limit inferior can be extend to 0.
> > > > > 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> > > > > since if the reserved region is beyond 4G, then it requires extra mem
> > > > > (default is 16M) for swiotlb.
> > > >
> > > > I fail to understand why the availability of memory hotplug information
> > > > would allow to extend the lower limit of bottom-up memblock allocations
> > > > below the kernel. The memory in the physical range [0, kernel_start) 
> > > > can be
> > > > allocated as soon as the kernel memory is reserved.
> > > >
> > > Yes, the  [0, kernel_start) can be allocated at this time by some func
> > > e.g. memblock_reserve(). But there is trick. For the func like
> > > memblock_find_in_range(), this is hotplug attr checking ,,it will
> > > check the hotmovable attr in __next_mem_range()
> > > {
> > > if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
> > > continue
> > > }.  So the movable memory can be safely skipped.
> > 
> > I still don't see the connection between allocating memory below
> > kernel_start and the hotplug info.
> > 
> > The check for 'end > kernel_end' in
> > 
> > if (memblock_bottom_up() && end > kernel_end)
> > 
> > does not protect against allocation in a hotplugable area.
> > If memblock_find_in_range() is called before hotplug info is parsed it can
> > return a range in a hotplugable area.
> > 
> > The point I'd like to clarify is why allocating memory in the range [0,
> > kernel_start) cannot be done before hotplug info is available and why it is
> > safe to allocate that memory afterwards?
> 
> Well, I think that's because we have KASLR. Before KASLR was introdueced,
> kernel is put at a low and fixed physical address. Allocating memblock
> bottom-up after kernel can make sure those kernel data is in the same node
> as kernel text itself before SRAT parsed. While [0, kernel_start) is a
> very small range, e.g on x86, it was 16 MB, which is very possibly used
> up.
> 
> But now, with KASLR enabled by default, this bottom-up after kernel text
> allocation has potential issue. E.g we have node0 (including normal zone),
> node1(including movable zone), if KASLR put kernel at top of node0, the
> next memblock allocation before SRAT parsed will stamp into movable zone
> of node1, hotplug doesn't work well any more consequently. I had
> considered this issue previously, but haven't thought of a way to fix
> it.
 
I agree that currently the bottom-up allocation after the kernel text has
issues with KASLR. But this issues are not necessarily related to the
memory hotplug. Even with a single memory node, a bottom-up allocation will
fail if KASLR would put the kernel near the end of node0.

What I am trying to understand is whether there is a fundamental reason to
prevent allocations from [0, kernel_start)?

Maybe Tejun can recall why he suggested to start bottom-up allocations from
kernel_end.

> While it's not related to this patch. About this patchset, I didn't
> check it carefully in v2 post, and acked it. In fact the current way is
> not good, Pingfan should call __memblock_find_range_bottom_up() directly
> for crashkernel reserving. Reasons are:
> 1)SRAT parsing is done, system restore to take top-down way to do
> memblock allocat.
> 2)we do need to find range bottom-up if user specify crashkernel=xxM
> (without a explicit base address).
> 
> Thanks
> Baoquan
> 
> > 
> > > Thanks for your kindly review.
> > > 
> > > Regards,
> > > Pingfan
> > > 
> > > > The extents of the memory node hosting the kernel image can be used to
> > > > limit memblok allocations from that particular node, even in top-down 
> > > > mode.
> > > >
> > > > > Signed-off-by: Pingfa

Re: [PATCHv3 2/2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2018-12-31 Thread Mike Rapoport

On Fri, Dec 28, 2018 at 11:00:02AM +0800, Pingfan Liu wrote:
> Customer reported a bug on a high end server with many pcie devices, where
> kernel bootup with crashkernel=384M, and kaslr is enabled. Even
> though we still see much memory under 896 MB, the finding still failed
> intermittently. Because currently we can only find region under 896 MB,
> if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
> randomly, and crashkernel reservation need be aligned to 128 MB, that's
> why failure is found. It raises confusion to the end user that sometimes
> crashkernel=X works while sometimes fails.
> If want to make it succeed, customer can change kernel option to
> "crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
> limited space to behave even though its grammer looks more generic.
> And we can't answer questions raised from customer that confidently:
> 1) why it doesn't succeed to reserve 896 MB;
> 2) what's wrong with memory region under 4G;
> 3) why I have to add ',high', I only require 384 MB, not 3840 MB.
> 
> This patch simplifies the method suggested in the mail [1]. It just goes
> bottom-up to find a candidate region for crashkernel. The bottom-up may be
> better compatible with the old reservation style, i.e. still want to get
> memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.
> 
> There is one trivial thing about the compatibility with old kexec-tools:
> if the reserved region is above 896M, then old tool will fail to load
> bzImage. But without this patch, the old tool also fail since there is no
> memory below 896M can be reserved for crashkernel.
> 
> [1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
> Signed-off-by: Pingfan Liu 
> Cc: Tang Chen 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Jonathan Corbet 
> Cc: Yaowei Bai 
> Cc: Pavel Tatashin 
> Cc: Nicholas Piggin 
> Cc: Naoya Horiguchi 
> Cc: Daniel Vacek 
> Cc: Mathieu Malaterre 
> Cc: Stefan Agner 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: ying...@kernel.org,
> Cc: vgo...@redhat.com
> Cc: linux-ker...@vger.kernel.org
> ---
>  arch/x86/kernel/setup.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d494b9b..165f9c3 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
> 
>   /* 0 means: find the address automatically */
>   if (crash_base <= 0) {
> + bool bottom_up = memblock_bottom_up();
> +
> + memblock_set_bottom_up(true);
>
>   /*
>* Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
>* as old kexec-tools loads bzImage below that, unless
>* "crashkernel=size[KMG],high" is specified.
>*/
>   crash_base = memblock_find_in_range(CRASH_ALIGN,
> - high ? CRASH_ADDR_HIGH_MAX
> -  : CRASH_ADDR_LOW_MAX,
> - crash_size, CRASH_ALIGN);
> + (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
> + memblock_set_bottom_up(bottom_up);

Using bottom-up does not guarantee that the allocation won't fall into a
removable memory, it only makes it highly probable.

I think that the 'max_pfn * PAGE_SIZE' limit should be replaced with the
end of the non-removable memory node.

> +
>   if (!crash_base) {
>   pr_info("crashkernel reservation failed - No suitable 
> area found.\n");
>   return;
> -- 
> 2.7.4
> 

-- 
Sincerely yours,
Mike.


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

Re: [PATCHv3 1/2] mm/memblock: extend the limit inferior of bottom-up after parsing hotplug attr

2018-12-31 Thread Mike Rapoport

On Fri, Dec 28, 2018 at 11:00:01AM +0800, Pingfan Liu wrote:
> The bottom-up allocation style is introduced to cope with movable_node,
> where the limit inferior of allocation starts from kernel's end, due to
> lack of knowledge of memory hotplug info at this early time. But if later,
> hotplug info has been got, the limit inferior can be extend to 0.
> 'kexec -c' prefers to reuse this style to alloc mem at lower address,
> since if the reserved region is beyond 4G, then it requires extra mem
> (default is 16M) for swiotlb.

I fail to understand why the availability of memory hotplug information
would allow to extend the lower limit of bottom-up memblock allocations
below the kernel. The memory in the physical range [0, kernel_start) can be
allocated as soon as the kernel memory is reserved.

The extents of the memory node hosting the kernel image can be used to
limit memblok allocations from that particular node, even in top-down mode.
 
> Signed-off-by: Pingfan Liu 
> Cc: Tang Chen 
> Cc: "Rafael J. Wysocki" 
> Cc: Len Brown 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Jonathan Corbet 
> Cc: Yaowei Bai 
> Cc: Pavel Tatashin 
> Cc: Nicholas Piggin 
> Cc: Naoya Horiguchi 
> Cc: Daniel Vacek 
> Cc: Mathieu Malaterre 
> Cc: Stefan Agner 
> Cc: Dave Young 
> Cc: Baoquan He 
> Cc: ying...@kernel.org,
> Cc: vgo...@redhat.com
> Cc: linux-ker...@vger.kernel.org
> ---
>  drivers/acpi/numa.c  |  4 
>  include/linux/memblock.h |  1 +
>  mm/memblock.c| 58 
> +---
>  3 files changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 2746994..3eea4e4 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -462,6 +462,10 @@ int __init acpi_numa_init(void)
> 
>   cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
>   acpi_parse_memory_affinity, 0);
> +
> +#if defined(CONFIG_X86) || defined(CONFIG_ARM64)
> + mark_mem_hotplug_parsed();
> +#endif
>   }
> 
>   /* SLIT: System Locality Information Table */
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a..d89ed9e 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -125,6 +125,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
>  void memblock_trim_memory(phys_addr_t align);
>  bool memblock_overlaps_region(struct memblock_type *type,
> phys_addr_t base, phys_addr_t size);
> +void mark_mem_hotplug_parsed(void);
>  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 81ae63c..a3f5e46 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -231,6 +231,12 @@ __memblock_find_range_top_down(phys_addr_t start, 
> phys_addr_t end,
>   return 0;
>  }
> 
> +static bool mem_hotmovable_parsed __initdata_memblock;
> +void __init_memblock mark_mem_hotplug_parsed(void)
> +{
> + mem_hotmovable_parsed = true;
> +}
> +
>  /**
>   * memblock_find_in_range_node - find free area in given range and node
>   * @size: size of free area to find
> @@ -259,7 +265,7 @@ phys_addr_t __init_memblock 
> memblock_find_in_range_node(phys_addr_t size,
>   phys_addr_t end, int nid,
>   enum memblock_flags flags)
>  {
> - phys_addr_t kernel_end, ret;
> + phys_addr_t kernel_end, ret = 0;
> 
>   /* pump up @end */
>   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
> @@ -270,34 +276,40 @@ phys_addr_t __init_memblock 
> memblock_find_in_range_node(phys_addr_t size,
>   end = max(start, end);
>   kernel_end = __pa_symbol(_end);
> 
> - /*
> -  * try bottom-up allocation only when bottom-up mode
> -  * is set and @end is above the kernel image.
> -  */
> - if (memblock_bottom_up() && end > kernel_end) {
> - phys_addr_t bottom_up_start;
> + if (memblock_bottom_up()) {
> + phys_addr_t bottom_up_start = start;
> 
> - /* make sure we will allocate above the kernel */
> - bottom_up_start = max(start, kernel_end);
> -
> - /* ok, try bottom-up allocation first */
> - ret = __memblock_find_range_bottom_up(bottom_up_start, end,
> -   size, align, nid, flags);
> - if (ret)
> + if (mem_hotmovable_parse

1 2 >

100 matches

Mail list logo