Re: [PATCH] KVM: arm64: Prevent kmemleak from accessing pKVM memory

2022-06-17 Thread Mike Rapoport
On Fri, Jun 17, 2022 at 09:21:31AM +0100, Marc Zyngier wrote:
> On Thu, 16 Jun 2022 16:11:34 +, Quentin Perret wrote:
> > Commit a7259df76702 ("memblock: make memblock_find_in_range method
> > private") changed the API using which memory is reserved for the pKVM
> > hypervisor. However, it seems that memblock_phys_alloc() differs
> > from the original API in terms of kmemleak semantics -- the old one
> > excluded the reserved regions from kmemleak scans when the new one
> > doesn't seem to. Unfortunately, when protected KVM is enabled, all
> > kernel accesses to pKVM-private memory result in a fatal exception,
> > which can now happen because of kmemleak scans:
> > 
> > [...]
> 
> Applied to fixes, thanks!
> 
> [1/1] KVM: arm64: Prevent kmemleak from accessing pKVM memory
>   commit: 9e5afa8a537f742bccc2cd91bc0bef4b6483ee98

I'd really like to update the changelog to this:

Commit a7259df76702 ("memblock: make memblock_find_in_range method
private") changed the API using which memory is reserved for the pKVM
hypervisor. However, memblock_phys_alloc() differs from the original API in
terms of kmemleak semantics -- the old one didn't report the reserved
regions to kmemleak while the new one does. Unfortunately, when protected
KVM is enabled, all kernel accesses to pKVM-private memory result in a
fatal exception, which can now happen because of kmemleak scans:

$ echo scan > /sys/kernel/debug/kmemleak
[   34.991354] kvm [304]: nVHE hyp BUG at: [] 
__kvm_nvhe_handle_host_mem_abort+0x270/0x290!
...

Fix this by explicitly excluding the hypervisor's memory pool from
kmemleak like we already do for the hyp BSS.


> Cheers,
> 
>   M.
> -- 
> Marc Zyngier 
> 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Prevent kmemleak from accessing pKVM memory

2022-06-17 Thread Mike Rapoport
On Thu, Jun 16, 2022 at 04:11:34PM +, Quentin Perret wrote:
> Commit a7259df76702 ("memblock: make memblock_find_in_range method
> private") changed the API using which memory is reserved for the pKVM
> hypervisor. However, it seems that memblock_phys_alloc() differs
> from the original API in terms of kmemleak semantics -- the old one
> excluded the reserved regions from kmemleak scans when the new one
> doesn't seem to. Unfortunately, when protected KVM is enabled, all

I'd rather say that memblock_find_in_range() didn't inform kmemleak about
the reserved regions, while memblock_phys_alloc() does.

> kernel accesses to pKVM-private memory result in a fatal exception,
> which can now happen because of kmemleak scans:
> 
> $ echo scan > /sys/kernel/debug/kmemleak
> [   34.991354] kvm [304]: nVHE hyp BUG at: [] 
> __kvm_nvhe_handle_host_mem_abort+0x270/0x290!
> [   34.991580] kvm [304]: Hyp Offset: 0xfffe8be807e0
> [   34.991813] Kernel panic - not syncing: HYP panic:
> [   34.991813] PS:63c9 PC:f418011a3750 ESR:f2000800
> [   34.991813] FAR:00043920 HPFAR:04792000 
> PAR:
> [   34.991813] VCPU:
> [   34.993660] CPU: 0 PID: 304 Comm: bash Not tainted 5.19.0-rc2 #102
> [   34.994059] Hardware name: linux,dummy-virt (DT)
> [   34.994452] Call trace:
> [   34.994641]  dump_backtrace.part.0+0xcc/0xe0
> [   34.994932]  show_stack+0x18/0x6c
> [   34.995094]  dump_stack_lvl+0x68/0x84
> [   34.995276]  dump_stack+0x18/0x34
> [   34.995484]  panic+0x16c/0x354
> [   34.995673]  __hyp_pgtable_total_pages+0x0/0x60
> [   34.995933]  scan_block+0x74/0x12c
> [   34.996129]  scan_gray_list+0xd8/0x19c
> [   34.996332]  kmemleak_scan+0x2c8/0x580
> [   34.996535]  kmemleak_write+0x340/0x4a0
> [   34.996744]  full_proxy_write+0x60/0xbc
> [   34.996967]  vfs_write+0xc4/0x2b0
> [   34.997136]  ksys_write+0x68/0xf4
> [   34.997311]  __arm64_sys_write+0x20/0x2c
> [   34.997532]  invoke_syscall+0x48/0x114
> [   34.997779]  el0_svc_common.constprop.0+0x44/0xec
> [   34.998029]  do_el0_svc+0x2c/0xc0
> [   34.998205]  el0_svc+0x2c/0x84
> [   34.998421]  el0t_64_sync_handler+0xf4/0x100
> [   34.998653]  el0t_64_sync+0x18c/0x190
> [   34.999252] SMP: stopping secondary CPUs
> [   35.34] Kernel Offset: disabled
> [   35.000261] CPU features: 0x800,7831,1086
> [   35.000642] Memory Limit: none
> [   35.001329] ---[ end Kernel panic - not syncing: HYP panic:
> [   35.001329] PS:63c9 PC:f418011a3750 ESR:f2000800
> [   35.001329] FAR:00043920 HPFAR:04792000 
> PAR:
> [   35.001329] VCPU: ]---
> 
> Fix this by explicitly excluding the hypervisor's memory pool from
> kmemleak like we already do for the hyp BSS.
> 
> Cc: Mike Rapoport 
> Fixes: a7259df76702 ("memblock: make memblock_find_in_range method private")
> Signed-off-by: Quentin Perret 
> ---
> An alternative could be to actually exclude memory allocated using
> memblock_phys_alloc_range() from kmemleak scans to revert back to the
> old behaviour.

This would be wrong because memblock_phys_alloc() does allocate memory and
unless there is a good reason to exclude it from kmemleak.

> But nobody else has complained about this AFAIK, so I'd be inclined to
> keep this local to pKVM. No strong opinion.

Yes, please :)
An alternative to excluding this memory from kmemleak is to allocate it
using 

memblock_phys_alloc_range(size, align, 0, MEMBLOCK_ALLOC_NOLEAKTRACE)

then it won't be added to kmemleak at the first place.

> ---
>  arch/arm64/kvm/arm.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 400bb0fe2745..28765bd22efb 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -2110,11 +2110,11 @@ static int finalize_hyp_mode(void)
>   return 0;
>  
>   /*
> -  * Exclude HYP BSS from kmemleak so that it doesn't get peeked
> -  * at, which would end badly once the section is inaccessible.
> -  * None of other sections should ever be introspected.
> +  * Exclude HYP sections from kmemleak so that they don't get peeked
> +  * at, which would end badly once inaccessible.
>*/
>   kmemleak_free_part(__hyp_bss_start, __hyp_bss_end - __hyp_bss_start);
> + kmemleak_free_part(__va(hyp_mem_base), hyp_mem_size);
>   return pkvm_drop_host_privileges();
>  }
>  
> -- 
> 2.36.1.476.g0c4daa206d-goog
> 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v5] memblock: make memblock_find_in_range method private

2021-08-16 Thread Mike Rapoport
From: Mike Rapoport 

There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.

memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.

Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.

This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().

Signed-off-by: Mike Rapoport 
Acked-by: Kirill A. Shutemov 
Acked-by: Rafael J. Wysocki # ACPI
Acked-by: Russell King (Oracle) 
Acked-by: Nick Kossifidis# riscv
Reviewed-by: Catalin Marinas   # arm64
---
v5:
* restore the original behaviour on x86 with addition of more elaborate
  comment; I will address the issue in memory_map_top_down() in a separate
  series.

v4: https://lore.kernel.org/lkml/20210812065907.20046-1-r...@kernel.org
* Add patch that prevents the crashes reported by Guenter Roeck on x86/i386
  on QEMU with 256M or 512M of memory and EFI boot enabled.
* Add Acked-by and Reviewed-by, thanks everybidy!

v3: https://lore.kernel.org/lkml/20210803064218.6611-1-r...@kernel.org
* simplify check for exact crash kerenl allocation on arm, per Rob
* make crash_max unsigned long long on arm64, per Rob

v2: https://lore.kernel.org/lkml/20210802063737.22733-1-r...@kernel.org
* don't change error message in arm::reserve_crashkernel(), per Russell

v1: https://lore.kernel.org/lkml/20210730104039.7047-1-r...@kernel.org


 arch/arm/kernel/setup.c   | 20 +-
 arch/arm64/kvm/hyp/reserved_mem.c |  9 +++
 arch/arm64/mm/init.c  | 36 -
 arch/mips/kernel/setup.c  | 14 +-
 arch/riscv/mm/init.c  | 44 ++-
 arch/s390/kernel/setup.c  | 10 ---
 arch/x86/kernel/aperture_64.c |  5 ++--
 arch/x86/mm/init.c| 23 ++--
 arch/x86/mm/numa.c|  5 ++--
 arch/x86/mm/numa_emulation.c  |  5 ++--
 arch/x86/realmode/init.c  |  2 +-
 drivers/acpi/tables.c |  5 ++--
 drivers/base/arch_numa.c  |  5 +---
 drivers/of/of_reserved_mem.c  | 12 ++---
 include/linux/memblock.h  |  2 --
 mm/memblock.c |  2 +-
 16 files changed, 81 insertions(+), 118 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index f97eb2371672..284a80c0b6e1 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1012,31 +1012,25 @@ static void __init reserve_crashkernel(void)
unsigned long long lowmem_max = __pa(high_memory - 1) + 1;
if (crash_max > lowmem_max)
crash_max = lowmem_max;
-   crash_base = memblock_find_in_range(CRASH_ALIGN, crash_max,
-   crash_size, CRASH_ALIGN);
+
+   crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
+  CRASH_ALIGN, crash_max);
if (!crash_base) {
pr_err("crashkernel reservation failed - No suitable 
area found.\n");
return;
}
} else {
+   unsigned long long crash_max = crash_base + crash_size;
unsigned long long start;
 
-   start = memblock_find_in_range(crash_base,
-  crash_base + crash_size,
-  crash_size, SECTION_SIZE);
-   if (start != crash_base) {
+   start = memblock_phys_alloc_range(crash_size, SECTION_SIZE,
+ crash_base, crash_max);
+   if (!start) {
pr_err("crashkernel reservation failed - memory is in 
use.\n");
return;
}
}
 
-   ret = memblock_reserve(crash_base, crash_size);
-   if (ret < 0) {
-   pr_warn("crashkernel reservation failed - memory is in use 
(0x%lx)\n",
-   (unsigned long)crash_base);
-   return;
-   }
-
pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crash_base >> 20),
diff --git a/arch/arm64/kvm/hyp/reserved_mem.c 
b/arch/arm64/kvm/hyp/reserved_mem.c
index d654921dd09b..578670e3f608 100644
--- a/arch/arm64/kvm/hyp/reserved_mem.c
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -92,12 +92,10 @@ void __init kvm_hyp_reserve(void)
 * this is

[PATCH v4 2/2] memblock: make memblock_find_in_range method private

2021-08-12 Thread Mike Rapoport
From: Mike Rapoport 

There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.

memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.

Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.

This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().

Signed-off-by: Mike Rapoport 
Acked-by: Kirill A. Shutemov 
Acked-by: Rafael J. Wysocki 
Acked-by: Russell King (Oracle) 
Acked-by: Nick Kossifidis 
Reviewed-by: Catalin Marinas 
---
 arch/arm/kernel/setup.c   | 20 +-
 arch/arm64/kvm/hyp/reserved_mem.c |  9 +++
 arch/arm64/mm/init.c  | 36 -
 arch/mips/kernel/setup.c  | 14 +-
 arch/riscv/mm/init.c  | 44 ++-
 arch/s390/kernel/setup.c  | 10 ---
 arch/x86/kernel/aperture_64.c |  5 ++--
 arch/x86/mm/init.c| 11 
 arch/x86/mm/numa.c|  5 ++--
 arch/x86/mm/numa_emulation.c  |  5 ++--
 arch/x86/realmode/init.c  |  2 +-
 drivers/acpi/tables.c |  5 ++--
 drivers/base/arch_numa.c  |  5 +---
 drivers/of/of_reserved_mem.c  | 12 ++---
 include/linux/memblock.h  |  2 --
 mm/memblock.c |  2 +-
 16 files changed, 71 insertions(+), 116 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index f97eb2371672..284a80c0b6e1 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1012,31 +1012,25 @@ static void __init reserve_crashkernel(void)
unsigned long long lowmem_max = __pa(high_memory - 1) + 1;
if (crash_max > lowmem_max)
crash_max = lowmem_max;
-   crash_base = memblock_find_in_range(CRASH_ALIGN, crash_max,
-   crash_size, CRASH_ALIGN);
+
+   crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
+  CRASH_ALIGN, crash_max);
if (!crash_base) {
pr_err("crashkernel reservation failed - No suitable 
area found.\n");
return;
}
} else {
+   unsigned long long crash_max = crash_base + crash_size;
unsigned long long start;
 
-   start = memblock_find_in_range(crash_base,
-  crash_base + crash_size,
-  crash_size, SECTION_SIZE);
-   if (start != crash_base) {
+   start = memblock_phys_alloc_range(crash_size, SECTION_SIZE,
+ crash_base, crash_max);
+   if (!start) {
pr_err("crashkernel reservation failed - memory is in 
use.\n");
return;
}
}
 
-   ret = memblock_reserve(crash_base, crash_size);
-   if (ret < 0) {
-   pr_warn("crashkernel reservation failed - memory is in use 
(0x%lx)\n",
-   (unsigned long)crash_base);
-   return;
-   }
-
pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crash_base >> 20),
diff --git a/arch/arm64/kvm/hyp/reserved_mem.c 
b/arch/arm64/kvm/hyp/reserved_mem.c
index d654921dd09b..578670e3f608 100644
--- a/arch/arm64/kvm/hyp/reserved_mem.c
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -92,12 +92,10 @@ void __init kvm_hyp_reserve(void)
 * this is unmapped from the host stage-2, and fallback to PAGE_SIZE.
 */
hyp_mem_size = hyp_mem_pages << PAGE_SHIFT;
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- ALIGN(hyp_mem_size, PMD_SIZE),
- PMD_SIZE);
+   hyp_mem_base = memblock_phys_alloc(ALIGN(hyp_mem_size, PMD_SIZE),
+  PMD_SIZE);
if (!hyp_mem_base)
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- hyp_mem_size, PAGE_SIZE);
+   hyp_mem_base = memblock_phys_alloc(hyp_mem_size, PAGE_SIZE);
else
hyp_mem_size = ALIGN(hyp_mem_size, PMD_SIZE);
 
@@ -105,7 +103,6 @@ void __init kvm_hyp_reserve(void)
k

[PATCH v4 1/2] x86/mm: memory_map_top_down: remove spurious reservation of upper 2M

2021-08-12 Thread Mike Rapoport
From: Mike Rapoport 

memory_map_top_down() function skips the upper 2M in the beginning and maps
them in the end because

"xen has big range in reserved near end of ram, skip it at first"

It appears, though, that the root cause was that there was not enough
memory in the range [min_pfn_mapped, max_pfn_mapped] to allocate page
tables from that range in alloc_low_pages() because min_pfn_mapped didn't
reflect that actual minimal pfn that was already mapped but remained close
to the end of the range being mapped by memory_map_top_down().

This happened because min_pfn_mapped is updated at every iteration of the
loop in memory_map_top_down(), but there is another loop in
init_range_memory_mapping() that maps several regions below the current
min_pfn_mapped without updating this variable.

Move the update of min_pfn_mapped to add_pfn_range_mapped() next to the
update of max_pfn_mapped so that every time a new range is mapped both
limits will be updated accordingly, and remove the spurious "reservation"
of upper 2M.

Signed-off-by: Mike Rapoport 
---
 arch/x86/mm/init.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 75ef19aa8903..87150961fdca 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -486,6 +486,7 @@ static void add_pfn_range_mapped(unsigned long start_pfn, 
unsigned long end_pfn)
nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_MAX_ENTRIES);
 
max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+   min_pfn_mapped = min(min_pfn_mapped, start_pfn);
 
if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
max_low_pfn_mapped = max(max_low_pfn_mapped,
@@ -605,20 +606,14 @@ static unsigned long __init get_new_step_size(unsigned 
long step_size)
 static void __init memory_map_top_down(unsigned long map_start,
   unsigned long map_end)
 {
-   unsigned long real_end, last_start;
-   unsigned long step_size;
-   unsigned long addr;
+   unsigned long real_end = ALIGN_DOWN(map_end, PMD_SIZE);
+   unsigned long last_start = real_end;
+   /* step_size need to be small so pgt_buf from BRK could cover it */
+   unsigned long step_size = PMD_SIZE;
unsigned long mapped_ram_size = 0;
 
-   /* xen has big range in reserved near end of ram, skip it at first.*/
-   addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
-   real_end = addr + PMD_SIZE;
-
-   /* step_size need to be small so pgt_buf from BRK could cover it */
-   step_size = PMD_SIZE;
max_pfn_mapped = 0; /* will get exact value next */
min_pfn_mapped = real_end >> PAGE_SHIFT;
-   last_start = real_end;
 
/*
 * We start from the top (end of memory) and go to the bottom.
@@ -638,7 +633,6 @@ static void __init memory_map_top_down(unsigned long 
map_start,
mapped_ram_size += init_range_memory_mapping(start,
last_start);
last_start = start;
-   min_pfn_mapped = last_start >> PAGE_SHIFT;
if (mapped_ram_size >= step_size)
step_size = get_new_step_size(step_size);
}
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 0/2] memblock: make memblock_find_in_range method private

2021-08-12 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

This is v4 of "memblock: make memblock_find_in_range method private" patch
that essentially replaces memblock_find_in_range() + memblock_reserve()
calls with equivalent calls to memblock_phys_alloc() and prevents usage of
memblock_find_in_range() outside memblock itself.

The patch uncovered an issue with top down memory mapping on x86 and this
version has a preparation patch that addresses this issue.

Guenter, I didn't add your Tested-by because the patch that addresses the
crashes differs from the one you've tested.

v4: 
* Add patch that prevents the crashes reported by Guenter Roeck on x86/i386
  on QEMU with 256M or 512M of memory and EFI boot enabled.
* Add Acked-by and Reviewed-by, thanks everybidy!

v3: https://lore.kernel.org/lkml/20210803064218.6611-1-r...@kernel.org
* simplify check for exact crash kerenl allocation on arm, per Rob
* make crash_max unsigned long long on arm64, per Rob

v2: https://lore.kernel.org/lkml/20210802063737.22733-1-r...@kernel.org
* don't change error message in arm::reserve_crashkernel(), per Russell

v1: https://lore.kernel.org/lkml/20210730104039.7047-1-r...@kernel.org

Mike Rapoport (2):
  x86/mm: memory_map_top_down: remove spurious reservation of upper 2M
  memblock: make memblock_find_in_range method private

 arch/arm/kernel/setup.c   | 20 +-
 arch/arm64/kvm/hyp/reserved_mem.c |  9 +++
 arch/arm64/mm/init.c  | 36 -
 arch/mips/kernel/setup.c  | 14 +-
 arch/riscv/mm/init.c  | 44 ++-
 arch/s390/kernel/setup.c  | 10 ---
 arch/x86/kernel/aperture_64.c |  5 ++--
 arch/x86/mm/init.c| 27 +++
 arch/x86/mm/numa.c|  5 ++--
 arch/x86/mm/numa_emulation.c  |  5 ++--
 arch/x86/realmode/init.c  |  2 +-
 drivers/acpi/tables.c |  5 ++--
 drivers/base/arch_numa.c  |  5 +---
 drivers/of/of_reserved_mem.c  | 12 ++---
 include/linux/memblock.h  |  2 --
 mm/memblock.c |  2 +-
 16 files changed, 76 insertions(+), 127 deletions(-)


base-commit: ff1176468d368232b684f75e82563369208bc371
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3] memblock: make memblock_find_in_range method private

2021-08-11 Thread Mike Rapoport
On Tue, Aug 10, 2021 at 12:21:46PM -0700, Guenter Roeck wrote:
> On 8/10/21 11:55 AM, Mike Rapoport wrote:
> > On Mon, Aug 09, 2021 at 12:06:41PM -0700, Guenter Roeck wrote:
> > > On Tue, Aug 03, 2021 at 09:42:18AM +0300, Mike Rapoport wrote:
> > > > From: Mike Rapoport 
> > > > 
> > > > There are a lot of uses of memblock_find_in_range() along with
> > > > memblock_reserve() from the times memblock allocation APIs did not 
> > > > exist.
> > > > 
> > > > memblock_find_in_range() is the very core of memblock allocations, so 
> > > > any
> > > > future changes to its internal behaviour would mandate updates of all 
> > > > the
> > > > users outside memblock.
> > > > 
> > > > Replace the calls to memblock_find_in_range() with an equivalent calls 
> > > > to
> > > > memblock_phys_alloc() and memblock_phys_alloc_range() and make
> > > > memblock_find_in_range() private method of memblock.
> > > > 
> > > > This simplifies the callers, ensures that (unlikely) errors in
> > > > memblock_reserve() are handled and improves maintainability of
> > > > memblock_find_in_range().
> > > > 
> > > > Signed-off-by: Mike Rapoport 
> > > 
> > > I see a number of crashes in next-20210806 when booting x86 images from 
> > > efi.
> > > 
> > > [0.00] efi: EFI v2.70 by EDK II
> > > [0.00] efi: SMBIOS=0x1fbcc000 ACPI=0x1fbfa000 ACPI 2.0=0x1fbfa014 
> > > MEMATTR=0x1f25f018
> > > [0.00] SMBIOS 2.8 present.
> > > [0.00] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> > > 02/06/2015
> > > [0.00] last_pfn = 0x1ff50 max_arch_pfn = 0x4
> > > [0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- 
> > > WT
> > > [0.00] Kernel panic - not syncing: alloc_low_pages: can not alloc 
> > > memory
> > > [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> > > 5.14.0-rc4-next-20210806 #1
> > > [0.00] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > > 0.0.0 02/06/2015
> > > [0.00] Call Trace:
> > > [0.00]  ? dump_stack_lvl+0x57/0x7d
> > > [0.00]  ? panic+0xfc/0x2c6
> > > [0.00]  ? alloc_low_pages+0x117/0x156
> > > [0.00]  ? phys_pmd_init+0x234/0x342
> > > [0.00]  ? phys_pud_init+0x171/0x337
> > > [0.00]  ? __kernel_physical_mapping_init+0xec/0x276
> > > [0.00]  ? init_memory_mapping+0x1ea/0x2aa
> > > [0.00]  ? init_range_memory_mapping+0xdf/0x12e
> > > [0.00]  ? init_mem_mapping+0x1e9/0x26f
> > > [0.00]  ? setup_arch+0x5ff/0xb6d
> > > [0.00]  ? start_kernel+0x71/0x6b4
> > > [0.00]  ? secondary_startup_64_no_verify+0xc2/0xcb
> > > 
> > > Bisect points to this patch. Reverting it fixes the problem. Key seems to
> > > be the amount of memory configured in qemu; the problem is not seen if
> > > there is 1G or more of memory, but it is seen with all test boots with
> > > 512M or 256M of memory. It is also seen with almost all 32-bit efi boots.
> > > 
> > > The problem is not seen when booting without efi.
> > 
> > It looks like this change uncovered a problem in
> > x86::memory_map_top_down().
> > 
> > The allocation in alloc_low_pages() is limited by min_pfn_mapped and
> > max_pfn_mapped. The min_pfn_mapped is updated at every iteration of the
> > loop in memory_map_top_down, but there is another loop in
> > init_range_memory_mapping() that maps several regions below the current
> > min_pfn_mapped without updating this variable.
> > 
> > The memory layout in qemu with 256M of RAM and EFI enabled, causes
> > exhaustion of the memory limited by min_pfn_mapped and max_pfn_mapped
> > before min_pfn_mapped is updated.
> > 
> > Before this commit there was unconditional "reservation" of 2M in the end
> > of the memory that moved the initial min_pfn_mapped below the memory
> > reserved by EFI. The addition of check for xen_domain() removed this
> > reservation for !XEN and made alloc_low_pages() use the range already busy
> > with EFI data.
> > 
> > The patch below moves the update of min_pfn_mapped near the update of
> > max_pfn_mapped so that every time a new range is mapped both limits will be
> > updated accordingly.
> > 
> > diff --git 

Re: [PATCH v3] memblock: make memblock_find_in_range method private

2021-08-10 Thread Mike Rapoport
On Mon, Aug 09, 2021 at 12:06:41PM -0700, Guenter Roeck wrote:
> On Tue, Aug 03, 2021 at 09:42:18AM +0300, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > There are a lot of uses of memblock_find_in_range() along with
> > memblock_reserve() from the times memblock allocation APIs did not exist.
> > 
> > memblock_find_in_range() is the very core of memblock allocations, so any
> > future changes to its internal behaviour would mandate updates of all the
> > users outside memblock.
> > 
> > Replace the calls to memblock_find_in_range() with an equivalent calls to
> > memblock_phys_alloc() and memblock_phys_alloc_range() and make
> > memblock_find_in_range() private method of memblock.
> > 
> > This simplifies the callers, ensures that (unlikely) errors in
> > memblock_reserve() are handled and improves maintainability of
> > memblock_find_in_range().
> > 
> > Signed-off-by: Mike Rapoport 
> 
> I see a number of crashes in next-20210806 when booting x86 images from efi.
> 
> [0.00] efi: EFI v2.70 by EDK II
> [0.00] efi: SMBIOS=0x1fbcc000 ACPI=0x1fbfa000 ACPI 2.0=0x1fbfa014 
> MEMATTR=0x1f25f018
> [0.00] SMBIOS 2.8 present.
> [0.00] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [0.00] last_pfn = 0x1ff50 max_arch_pfn = 0x4
> [0.00] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> [0.00] Kernel panic - not syncing: alloc_low_pages: can not alloc 
> memory
> [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
> 5.14.0-rc4-next-20210806 #1
> [0.00] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [0.00] Call Trace:
> [0.00]  ? dump_stack_lvl+0x57/0x7d
> [0.00]  ? panic+0xfc/0x2c6
> [0.00]  ? alloc_low_pages+0x117/0x156
> [0.00]  ? phys_pmd_init+0x234/0x342
> [0.00]  ? phys_pud_init+0x171/0x337
> [0.00]  ? __kernel_physical_mapping_init+0xec/0x276
> [0.00]  ? init_memory_mapping+0x1ea/0x2aa
> [0.00]  ? init_range_memory_mapping+0xdf/0x12e
> [0.00]  ? init_mem_mapping+0x1e9/0x26f
> [0.00]  ? setup_arch+0x5ff/0xb6d
> [0.00]  ? start_kernel+0x71/0x6b4
> [0.00]  ? secondary_startup_64_no_verify+0xc2/0xcb
> 
> Bisect points to this patch. Reverting it fixes the problem. Key seems to
> be the amount of memory configured in qemu; the problem is not seen if
> there is 1G or more of memory, but it is seen with all test boots with
> 512M or 256M of memory. It is also seen with almost all 32-bit efi boots.
> 
> The problem is not seen when booting without efi.

It looks like this change uncovered a problem in
x86::memory_map_top_down(). 

The allocation in alloc_low_pages() is limited by min_pfn_mapped and
max_pfn_mapped. The min_pfn_mapped is updated at every iteration of the
loop in memory_map_top_down, but there is another loop in
init_range_memory_mapping() that maps several regions below the current
min_pfn_mapped without updating this variable.

The memory layout in qemu with 256M of RAM and EFI enabled, causes
exhaustion of the memory limited by min_pfn_mapped and max_pfn_mapped
before min_pfn_mapped is updated.

Before this commit there was unconditional "reservation" of 2M in the end
of the memory that moved the initial min_pfn_mapped below the memory
reserved by EFI. The addition of check for xen_domain() removed this
reservation for !XEN and made alloc_low_pages() use the range already busy
with EFI data.

The patch below moves the update of min_pfn_mapped near the update of
max_pfn_mapped so that every time a new range is mapped both limits will be
updated accordingly.

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 1152a29ce109..be279f6e5a0a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1,3 +1,4 @@
+#define DEBUG
 #include 
 #include 
 #include 
@@ -485,6 +486,7 @@ static void add_pfn_range_mapped(unsigned long start_pfn, 
unsigned long end_pfn)
nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_MAX_ENTRIES);
 
max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+   min_pfn_mapped = min(min_pfn_mapped, start_pfn);
 
if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
max_low_pfn_mapped = max(max_low_pfn_mapped,
@@ -643,7 +645,6 @@ static void __init memory_map_top_down(unsigned long 
map_start,
mapped_ram_size += init_range_memory_mapping(start,
last_start);
last_start = start;
-   min_pfn_mapped = last_start >> PAGE_SHIFT;
if (mapped_ram_size >= step_size)
step_size = get_new_step_size(step_size);
}
 

&

Re: [RFC PATCH 14/15] mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time constant.

2021-08-08 Thread Mike Rapoport
On Thu, Aug 05, 2021 at 03:02:52PM -0400, Zi Yan wrote:
> From: Zi Yan 
> 
> For other MAX_ORDER uses (described below), there is no need or too much
> hassle to convert certain static array to dynamic ones. Add
> MIN_MAX_ORDER to serve as compile time constant in place of MAX_ORDER.
> 
> ARM64 hypervisor maintains its own free page list and does not import
> any core kernel symbols, so soon-to-be runtime variable MAX_ORDER is not
> accessible in ARM64 hypervisor code. Also there is no need to allocating
> very large pages.
> 
> In SLAB/SLOB/SLUB, 2-D array kmalloc_caches uses MAX_ORDER in its second
> dimension. It is too much hassle to allocate memory for kmalloc_caches
> before any proper memory allocator is set up.
> 
> Signed-off-by: Zi Yan 
> Cc: Marc Zyngier 
> Cc: Catalin Marinas 
> Cc: Christoph Lameter 
> Cc: Vlastimil Babka 
> Cc: Quentin Perret 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: kvmarm@lists.cs.columbia.edu
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  arch/arm64/kvm/hyp/include/nvhe/gfp.h | 2 +-
>  arch/arm64/kvm/hyp/nvhe/page_alloc.c  | 3 ++-
>  include/linux/mmzone.h| 3 +++
>  include/linux/slab.h  | 8 
>  mm/slab.c | 2 +-
>  mm/slub.c | 7 ---
>  6 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h 
> b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> index fb0f523d1492..c774b4a98336 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> @@ -16,7 +16,7 @@ struct hyp_pool {
>* API at EL2.
>*/
>   hyp_spinlock_t lock;
> - struct list_head free_area[MAX_ORDER];
> + struct list_head free_area[MIN_MAX_ORDER];
>   phys_addr_t range_start;
>   phys_addr_t range_end;
>   unsigned short max_order;
> diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c 
> b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> index 41fc25bdfb34..a1cc1b648de0 100644
> --- a/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> @@ -226,7 +226,8 @@ int hyp_pool_init(struct hyp_pool *pool, u64 pfn, 
> unsigned int nr_pages,
>   int i;
>  
>   hyp_spin_lock_init(>lock);
> - pool->max_order = min(MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
> +
> + pool->max_order = min(MIN_MAX_ORDER, get_order(nr_pages << PAGE_SHIFT));
>   for (i = 0; i < pool->max_order; i++)
>   INIT_LIST_HEAD(>free_area[i]);
>   pool->range_start = phys;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 09aafc05aef4..379dada82d4b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -27,11 +27,14 @@
>  #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
>  #ifdef CONFIG_SET_MAX_ORDER
>  #define MAX_ORDER CONFIG_SET_MAX_ORDER
> +#define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
>  #else
>  #define MAX_ORDER 11
> +#define MIN_MAX_ORDER MAX_ORDER
>  #endif /* CONFIG_SET_MAX_ORDER */
>  #else
>  #define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
> +#define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
>  #endif /* CONFIG_ARCH_FORCE_MAX_ORDER */
>  #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))

The end result of this #ifdef explosion looks entirely unreadable:

/* Free memory management - zoned buddy allocator.  */
#ifndef CONFIG_ARCH_FORCE_MAX_ORDER
#ifdef CONFIG_SET_MAX_ORDER
/* Defined in mm/page_alloc.c */
extern int buddy_alloc_max_order;

#define MAX_ORDER buddy_alloc_max_order
#define MIN_MAX_ORDER CONFIG_SET_MAX_ORDER
#else
#define MAX_ORDER 11
#define MIN_MAX_ORDER MAX_ORDER
#endif /* CONFIG_SET_MAX_ORDER */
#else

#ifdef CONFIG_SPARSEMEM_VMEMMAP
/* Defined in mm/page_alloc.c */
extern int buddy_alloc_max_order;

#define MAX_ORDER buddy_alloc_max_order
#else
#define MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
#define MIN_MAX_ORDER CONFIG_ARCH_FORCE_MAX_ORDER
#endif /* CONFIG_ARCH_FORCE_MAX_ORDER */

> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 2c0d80cca6b8..d8747c158db6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -244,8 +244,8 @@ static inline void __check_heap_object(const void *ptr, 
> unsigned long n,
>   * to do various tricks to work around compiler limitations in order to
>   * ensure proper constant folding.
>   */
> -#define KMALLOC_SHIFT_HIGH   ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
> - (MAX_ORDER + PAGE_SHIFT - 1) : 25)
> +#define KMALLOC_SHIFT_HIGH   ((MIN_MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
> + (MIN_MAX_ORDER + PAGE_SHIFT - 1) : 25)
>  #define KMALLOC_SHIFT_MAXKMALLOC_SHIFT_HIGH
>  #ifndef KMALLOC_SHIFT_LOW
>  #define KMALLOC_SHIFT_LOW5
> @@ -258,7 +258,7 @@ static inline void __check_heap_object(const void *ptr, 
> unsigned long n,
>   * (PAGE_SIZE*2).  Larger requests are passed to the page allocator.
>   */
>  #define KMALLOC_SHIFT_HIGH   (PAGE_SHIFT + 1)
> 

Re: [PATCH v3] memblock: make memblock_find_in_range method private

2021-08-03 Thread Mike Rapoport
On Tue, Aug 03, 2021 at 07:05:26PM +0100, Catalin Marinas wrote:
> On Tue, Aug 03, 2021 at 09:42:18AM +0300, Mike Rapoport wrote:
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 8490ed2917ff..0bffd2d1854f 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -74,6 +74,7 @@ phys_addr_t arm64_dma_phys_limit __ro_after_init;
> >  static void __init reserve_crashkernel(void)
> >  {
> > unsigned long long crash_base, crash_size;
> > +   unsigned long long crash_max = arm64_dma_phys_limit;
> > int ret;
> >  
> > ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
> > @@ -84,33 +85,18 @@ static void __init reserve_crashkernel(void)
> >  
> > crash_size = PAGE_ALIGN(crash_size);
> >  
> > -   if (crash_base == 0) {
> > -   /* Current arm64 boot protocol requires 2MB alignment */
> > -   crash_base = memblock_find_in_range(0, arm64_dma_phys_limit,
> > -   crash_size, SZ_2M);
> > -   if (crash_base == 0) {
> > -   pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
> > -   crash_size);
> > -   return;
> > -   }
> > -   } else {
> > -   /* User specifies base address explicitly. */
> > -   if (!memblock_is_region_memory(crash_base, crash_size)) {
> > -   pr_warn("cannot reserve crashkernel: region is not 
> > memory\n");
> > -   return;
> > -   }
> > +   /* User specifies base address explicitly. */
> > +   if (crash_base)
> > +   crash_max = crash_base + crash_size;
> >  
> > -   if (memblock_is_region_reserved(crash_base, crash_size)) {
> > -   pr_warn("cannot reserve crashkernel: region overlaps 
> > reserved memory\n");
> > -   return;
> > -   }
> > -
> > -   if (!IS_ALIGNED(crash_base, SZ_2M)) {
> > -   pr_warn("cannot reserve crashkernel: base address is 
> > not 2MB aligned\n");
> > -   return;
> > -   }
> > +   /* Current arm64 boot protocol requires 2MB alignment */
> > +   crash_base = memblock_phys_alloc_range(crash_size, SZ_2M,
> > +  crash_base, crash_max);
> > +   if (!crash_base) {
> > +   pr_warn("cannot allocate crashkernel (size:0x%llx)\n",
> > +   crash_size);
> > +   return;
> > }
> > -   memblock_reserve(crash_base, crash_size);
> 
> We'll miss a bit on debug information provided to the user in case of a
> wrong crash_base/size option on the command line. Not sure we care much,
> though the alignment would probably be useful (maybe we document it
> somewhere).

It is already documented:

Documentation/admin-guide/kdump/kdump.rst:
   On arm64, use "crashkernel=Y[@X]".  Note that the start address of
   the kernel, X if explicitly specified, must be aligned to 2MiB (0x20).
 
> What I haven't checked is whether memblock_phys_alloc_range() aims to
> get a 2MB aligned end (size) as well. If crash_size is not 2MB aligned,
> crash_max wouldn't be either and the above could fail. We only care
> about the crash_base to be aligned but the memblock_phys_alloc_range()
> doc says that both the start and size would be aligned to this.

The doc lies :)

memblock_phys_alloc_range() boils down to 

for_each_free_mem_range_reverse(i, nid, flags, _start, _end,
NULL) {

/* clamp this_{start,end} to the user defined limits */

cand = round_down(this_end - size, align);
if (cand >= this_start)
return cand;
}

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3] memblock: make memblock_find_in_range method private

2021-08-03 Thread Mike Rapoport
From: Mike Rapoport 

There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.

memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.

Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.

This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().

Signed-off-by: Mike Rapoport 
---
v3:
* simplify check for exact crash kerenl allocation on arm, per Rob
* make crash_max unsigned long long on arm64, per Rob
 
v2: https://lore.kernel.org/lkml/20210802063737.22733-1-r...@kernel.org 
* don't change error message in arm::reserve_crashkernel(), per Russell

v1: https://lore.kernel.org/lkml/20210730104039.7047-1-r...@kernel.org

 arch/arm/kernel/setup.c   | 20 +-
 arch/arm64/kvm/hyp/reserved_mem.c |  9 +++
 arch/arm64/mm/init.c  | 36 -
 arch/mips/kernel/setup.c  | 14 +-
 arch/riscv/mm/init.c  | 44 ++-
 arch/s390/kernel/setup.c  | 10 ---
 arch/x86/kernel/aperture_64.c |  5 ++--
 arch/x86/mm/init.c| 21 +--
 arch/x86/mm/numa.c|  5 ++--
 arch/x86/mm/numa_emulation.c  |  5 ++--
 arch/x86/realmode/init.c  |  2 +-
 drivers/acpi/tables.c |  5 ++--
 drivers/base/arch_numa.c  |  5 +---
 drivers/of/of_reserved_mem.c  | 12 ++---
 include/linux/memblock.h  |  2 --
 mm/memblock.c |  2 +-
 16 files changed, 79 insertions(+), 118 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index f97eb2371672..284a80c0b6e1 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1012,31 +1012,25 @@ static void __init reserve_crashkernel(void)
unsigned long long lowmem_max = __pa(high_memory - 1) + 1;
if (crash_max > lowmem_max)
crash_max = lowmem_max;
-   crash_base = memblock_find_in_range(CRASH_ALIGN, crash_max,
-   crash_size, CRASH_ALIGN);
+
+   crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
+  CRASH_ALIGN, crash_max);
if (!crash_base) {
pr_err("crashkernel reservation failed - No suitable 
area found.\n");
return;
}
} else {
+   unsigned long long crash_max = crash_base + crash_size;
unsigned long long start;
 
-   start = memblock_find_in_range(crash_base,
-  crash_base + crash_size,
-  crash_size, SECTION_SIZE);
-   if (start != crash_base) {
+   start = memblock_phys_alloc_range(crash_size, SECTION_SIZE,
+ crash_base, crash_max);
+   if (!start) {
pr_err("crashkernel reservation failed - memory is in 
use.\n");
return;
}
}
 
-   ret = memblock_reserve(crash_base, crash_size);
-   if (ret < 0) {
-   pr_warn("crashkernel reservation failed - memory is in use 
(0x%lx)\n",
-   (unsigned long)crash_base);
-   return;
-   }
-
pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crash_base >> 20),
diff --git a/arch/arm64/kvm/hyp/reserved_mem.c 
b/arch/arm64/kvm/hyp/reserved_mem.c
index d654921dd09b..578670e3f608 100644
--- a/arch/arm64/kvm/hyp/reserved_mem.c
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -92,12 +92,10 @@ void __init kvm_hyp_reserve(void)
 * this is unmapped from the host stage-2, and fallback to PAGE_SIZE.
 */
hyp_mem_size = hyp_mem_pages << PAGE_SHIFT;
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- ALIGN(hyp_mem_size, PMD_SIZE),
- PMD_SIZE);
+   hyp_mem_base = memblock_phys_alloc(ALIGN(hyp_mem_size, PMD_SIZE),
+  PMD_SIZE);
if (!hyp_mem_base)
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- hyp_mem_size, PAGE_SIZE);
+   hyp_mem_b

Re: [PATCH v2] memblock: make memblock_find_in_range method private

2021-08-02 Thread Mike Rapoport
Hi Rob,

On Mon, Aug 02, 2021 at 08:55:57AM -0600, Rob Herring wrote:
> On Mon, Aug 2, 2021 at 12:37 AM Mike Rapoport  wrote:
> >
> > From: Mike Rapoport 
> >
> > There are a lot of uses of memblock_find_in_range() along with
> > memblock_reserve() from the times memblock allocation APIs did not exist.
> >
> > memblock_find_in_range() is the very core of memblock allocations, so any
> > future changes to its internal behaviour would mandate updates of all the
> > users outside memblock.
> >
> > Replace the calls to memblock_find_in_range() with an equivalent calls to
> > memblock_phys_alloc() and memblock_phys_alloc_range() and make
> > memblock_find_in_range() private method of memblock.
> >
> > This simplifies the callers, ensures that (unlikely) errors in
> > memblock_reserve() are handled and improves maintainability of
> > memblock_find_in_range().
> >
> > Signed-off-by: Mike Rapoport 
> > ---
> > v2: don't change error message in arm::reserve_crashkernel(), per Russell
> > v1: https://lore.kernel.org/lkml/20210730104039.7047-1-r...@kernel.org
> >
> >  arch/arm/kernel/setup.c   | 18 +
> >  arch/arm64/kvm/hyp/reserved_mem.c |  9 +++
> >  arch/arm64/mm/init.c  | 36 -
> >  arch/mips/kernel/setup.c  | 14 +-
> >  arch/riscv/mm/init.c  | 44 ++-
> >  arch/s390/kernel/setup.c  | 10 ---
> >  arch/x86/kernel/aperture_64.c |  5 ++--
> >  arch/x86/mm/init.c| 21 +--
> >  arch/x86/mm/numa.c|  5 ++--
> >  arch/x86/mm/numa_emulation.c  |  5 ++--
> >  arch/x86/realmode/init.c  |  2 +-
> >  drivers/acpi/tables.c |  5 ++--
> >  drivers/base/arch_numa.c  |  5 +---
> >  drivers/of/of_reserved_mem.c  | 12 ++---
> >  include/linux/memblock.h  |  2 --
> >  mm/memblock.c |  2 +-
> >  16 files changed, 78 insertions(+), 117 deletions(-)
> >
> > diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
> > index f97eb2371672..67f5421b2af7 100644
> > --- a/arch/arm/kernel/setup.c
> > +++ b/arch/arm/kernel/setup.c
> > @@ -1012,31 +1012,25 @@ static void __init reserve_crashkernel(void)
> > unsigned long long lowmem_max = __pa(high_memory - 1) + 1;
> > if (crash_max > lowmem_max)
> > crash_max = lowmem_max;
> > -   crash_base = memblock_find_in_range(CRASH_ALIGN, crash_max,
> > -   crash_size, 
> > CRASH_ALIGN);
> > +
> > +   crash_base = memblock_phys_alloc_range(crash_size, 
> > CRASH_ALIGN,
> > +  CRASH_ALIGN, 
> > crash_max);
> > if (!crash_base) {
> > pr_err("crashkernel reservation failed - No 
> > suitable area found.\n");
> > return;
> > }
> > } else {
> > +   unsigned long long crash_max = crash_base + crash_size;
> > unsigned long long start;
> >
> > -   start = memblock_find_in_range(crash_base,
> > -  crash_base + crash_size,
> > -  crash_size, SECTION_SIZE);
> > +   start = memblock_phys_alloc_range(crash_size, SECTION_SIZE,
> > + crash_base, crash_max);
> > if (start != crash_base) {
> 
> If this is true and start is non-zero, then you need an
> memblock_free(). However, since the range is equal to the size, then
> that can never happen and just checking !start is sufficient.

Agree. Will update.
 
> > pr_err("crashkernel reservation failed - memory is 
> > in use.\n");
> > return;
> > }
> > }
> >
> > -   ret = memblock_reserve(crash_base, crash_size);
> > -   if (ret < 0) {
> > -   pr_warn("crashkernel reservation failed - memory is in use 
> > (0x%lx)\n",
> > -   (unsigned long)crash_base);
> > -   return;
> > -   }
> > -
> > pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
> > RAM: %ldMB)\n",
> > (unsigned

[PATCH v2] memblock: make memblock_find_in_range method private

2021-08-02 Thread Mike Rapoport
From: Mike Rapoport 

There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.

memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.

Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.

This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().

Signed-off-by: Mike Rapoport 
---
v2: don't change error message in arm::reserve_crashkernel(), per Russell
v1: https://lore.kernel.org/lkml/20210730104039.7047-1-r...@kernel.org

 arch/arm/kernel/setup.c   | 18 +
 arch/arm64/kvm/hyp/reserved_mem.c |  9 +++
 arch/arm64/mm/init.c  | 36 -
 arch/mips/kernel/setup.c  | 14 +-
 arch/riscv/mm/init.c  | 44 ++-
 arch/s390/kernel/setup.c  | 10 ---
 arch/x86/kernel/aperture_64.c |  5 ++--
 arch/x86/mm/init.c| 21 +--
 arch/x86/mm/numa.c|  5 ++--
 arch/x86/mm/numa_emulation.c  |  5 ++--
 arch/x86/realmode/init.c  |  2 +-
 drivers/acpi/tables.c |  5 ++--
 drivers/base/arch_numa.c  |  5 +---
 drivers/of/of_reserved_mem.c  | 12 ++---
 include/linux/memblock.h  |  2 --
 mm/memblock.c |  2 +-
 16 files changed, 78 insertions(+), 117 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index f97eb2371672..67f5421b2af7 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1012,31 +1012,25 @@ static void __init reserve_crashkernel(void)
unsigned long long lowmem_max = __pa(high_memory - 1) + 1;
if (crash_max > lowmem_max)
crash_max = lowmem_max;
-   crash_base = memblock_find_in_range(CRASH_ALIGN, crash_max,
-   crash_size, CRASH_ALIGN);
+
+   crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
+  CRASH_ALIGN, crash_max);
if (!crash_base) {
pr_err("crashkernel reservation failed - No suitable 
area found.\n");
return;
}
} else {
+   unsigned long long crash_max = crash_base + crash_size;
unsigned long long start;
 
-   start = memblock_find_in_range(crash_base,
-  crash_base + crash_size,
-  crash_size, SECTION_SIZE);
+   start = memblock_phys_alloc_range(crash_size, SECTION_SIZE,
+ crash_base, crash_max);
if (start != crash_base) {
pr_err("crashkernel reservation failed - memory is in 
use.\n");
return;
}
}
 
-   ret = memblock_reserve(crash_base, crash_size);
-   if (ret < 0) {
-   pr_warn("crashkernel reservation failed - memory is in use 
(0x%lx)\n",
-   (unsigned long)crash_base);
-   return;
-   }
-
pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crash_base >> 20),
diff --git a/arch/arm64/kvm/hyp/reserved_mem.c 
b/arch/arm64/kvm/hyp/reserved_mem.c
index d654921dd09b..578670e3f608 100644
--- a/arch/arm64/kvm/hyp/reserved_mem.c
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -92,12 +92,10 @@ void __init kvm_hyp_reserve(void)
 * this is unmapped from the host stage-2, and fallback to PAGE_SIZE.
 */
hyp_mem_size = hyp_mem_pages << PAGE_SHIFT;
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- ALIGN(hyp_mem_size, PMD_SIZE),
- PMD_SIZE);
+   hyp_mem_base = memblock_phys_alloc(ALIGN(hyp_mem_size, PMD_SIZE),
+  PMD_SIZE);
if (!hyp_mem_base)
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- hyp_mem_size, PAGE_SIZE);
+   hyp_mem_base = memblock_phys_alloc(hyp_mem_size, PAGE_SIZE);
else
hyp_mem_size = ALIGN(hyp_mem_size, PMD_SIZE);
 
@@ -105,7 +103,6 @@ void __init kvm_hyp_reserve(void)
kvm_err("Failed to reserve hyp m

[PATCH] memblock: make memblock_find_in_range method private

2021-07-30 Thread Mike Rapoport
From: Mike Rapoport 

There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.

memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.

Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.

This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().

Signed-off-by: Mike Rapoport 
---
 arch/arm/kernel/setup.c   | 20 +-
 arch/arm64/kvm/hyp/reserved_mem.c |  9 +++
 arch/arm64/mm/init.c  | 36 -
 arch/mips/kernel/setup.c  | 14 +-
 arch/riscv/mm/init.c  | 44 ++-
 arch/s390/kernel/setup.c  | 10 ---
 arch/x86/kernel/aperture_64.c |  5 ++--
 arch/x86/mm/init.c| 21 +--
 arch/x86/mm/numa.c|  5 ++--
 arch/x86/mm/numa_emulation.c  |  5 ++--
 arch/x86/realmode/init.c  |  2 +-
 drivers/acpi/tables.c |  5 ++--
 drivers/base/arch_numa.c  |  5 +---
 drivers/of/of_reserved_mem.c  | 12 ++---
 include/linux/memblock.h  |  2 --
 mm/memblock.c |  2 +-
 16 files changed, 79 insertions(+), 118 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index f97eb2371672..1f8ef9fd5215 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1012,31 +1012,25 @@ static void __init reserve_crashkernel(void)
unsigned long long lowmem_max = __pa(high_memory - 1) + 1;
if (crash_max > lowmem_max)
crash_max = lowmem_max;
-   crash_base = memblock_find_in_range(CRASH_ALIGN, crash_max,
-   crash_size, CRASH_ALIGN);
+
+   crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
+  CRASH_ALIGN, crash_max);
if (!crash_base) {
pr_err("crashkernel reservation failed - No suitable 
area found.\n");
return;
}
} else {
+   unsigned long long crash_max = crash_base + crash_size;
unsigned long long start;
 
-   start = memblock_find_in_range(crash_base,
-  crash_base + crash_size,
-  crash_size, SECTION_SIZE);
+   start = memblock_phys_alloc_range(crash_size, SECTION_SIZE,
+ crash_base, crash_max);
if (start != crash_base) {
-   pr_err("crashkernel reservation failed - memory is in 
use.\n");
+   pr_err("crashkernel reservation failed - No suitable 
area found.\n");
return;
}
}
 
-   ret = memblock_reserve(crash_base, crash_size);
-   if (ret < 0) {
-   pr_warn("crashkernel reservation failed - memory is in use 
(0x%lx)\n",
-   (unsigned long)crash_base);
-   return;
-   }
-
pr_info("Reserving %ldMB of memory at %ldMB for crashkernel (System 
RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
(unsigned long)(crash_base >> 20),
diff --git a/arch/arm64/kvm/hyp/reserved_mem.c 
b/arch/arm64/kvm/hyp/reserved_mem.c
index d654921dd09b..578670e3f608 100644
--- a/arch/arm64/kvm/hyp/reserved_mem.c
+++ b/arch/arm64/kvm/hyp/reserved_mem.c
@@ -92,12 +92,10 @@ void __init kvm_hyp_reserve(void)
 * this is unmapped from the host stage-2, and fallback to PAGE_SIZE.
 */
hyp_mem_size = hyp_mem_pages << PAGE_SHIFT;
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- ALIGN(hyp_mem_size, PMD_SIZE),
- PMD_SIZE);
+   hyp_mem_base = memblock_phys_alloc(ALIGN(hyp_mem_size, PMD_SIZE),
+  PMD_SIZE);
if (!hyp_mem_base)
-   hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
- hyp_mem_size, PAGE_SIZE);
+   hyp_mem_base = memblock_phys_alloc(hyp_mem_size, PAGE_SIZE);
else
hyp_mem_size = ALIGN(hyp_mem_size, PMD_SIZE);
 
@@ -105,7 +103,6 @@ void __init kvm_hyp_reserve(void)
kvm_err("Failed to reserve hyp memory\n");
return;

Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-05-13 Thread Mike Rapoport
On Thu, May 13, 2021 at 11:44:00AM +0800, Kefeng Wang wrote:
> On 2021/5/12 16:26, Mike Rapoport wrote:
> > On Wed, May 12, 2021 at 11:08:14AM +0800, Kefeng Wang wrote:
> > > 
> > > On 2021/5/11 16:48, Mike Rapoport wrote:
> > > > On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
> > > > > 
> > > > > > > The memory is not continuous, see MEMBLOCK:
> > > > > > > memory size = 0x4c0f reserved size = 0x027ef058
> > > > > > > memory.cnt  = 0xa
> > > > > > > memory[0x0][0x80a0-0x855f], 0x04c0 bytes 
> > > > > > > flags: 0x0
> > > > > > > memory[0x1][0x86a0-0x87df], 0x0140 bytes 
> > > > > > > flags: 0x0
> > > > > > > memory[0x2][0x8bd0-0x8c4f], 0x0080 bytes 
> > > > > > > flags: 0x0
> > > > > > > memory[0x3][0x8e30-0x8ecf], 0x00a0 bytes 
> > > > > > > flags: 0x0
> > > > > > > memory[0x4][0x90d0-0xbfff], 0x2f30 bytes 
> > > > > > > flags: 0x0
> > > > > > > memory[0x5][0xcc00-0xdc9f], 0x10a0 bytes 
> > > > > > > flags: 0x0
> > > > > > > memory[0x6][0xde70-0xde9f], 0x0030 bytes 
> > > > > > > flags: 0x0
> > > > > > > ...
> > > > > > > 
> > > > > > > The pfn_range [0xde600,0xde700] => addr_range 
> > > > > > > [0xde60,0xde70]
> > > > > > > is not available memory, and we won't create memmap , so with or 
> > > > > > > without
> > > > > > > your patch, we can't see the range in free_memmap(), right?
> > > > > > 
> > > > > > This is not available memory and we won't see the reange in 
> > > > > > free_memmap(),
> > > > > > but we still should create memmap for it and that's what my patch 
> > > > > > tried to
> > > > > > do.
> > > > > > 
> > > > > > There are a lot of places in core mm that operate on pageblocks and
> > > > > > free_unused_memmap() should make sure that any pageblock has a 
> > > > > > valid memory
> > > > > > map.
> > > > > > 
> > > > > > Currently, that's not the case when SPARSEMEM=y and my patch tried 
> > > > > > to fix
> > > > > > it.
> > > > > > 
> > > > > > Can you please send log with my patch applied and with the printing 
> > > > > > of
> > > > > > ranges that are freed in free_unused_memmap() you've used in 
> > > > > > previous
> > > > > > mails?
> > > > 
> > > > > with your patch[1] and debug print in free_memmap,
> > > > > > free_memmap, start_pfn = 85800,  8580 end_pfn = 86800, 
> > > > > 8680
> > > > > > free_memmap, start_pfn = 8c800,  8c80 end_pfn = 8e000, 
> > > > > 8e00
> > > > > > free_memmap, start_pfn = 8f000,  8f00 end_pfn = 9, 
> > > > > 9000
> > > > > > free_memmap, start_pfn = dcc00,  dcc0 end_pfn = de400, 
> > > > > de40
> > > > > > free_memmap, start_pfn = dec00,  dec0 end_pfn = e, 
> > > > > e000
> > > > > > free_memmap, start_pfn = e0c00,  e0c0 end_pfn = e4000, 
> > > > > e400
> > > > > > free_memmap, start_pfn = f7000,  f700 end_pfn = f8000, 
> > > > > f800
> > > > 
> > > > It seems that freeing of the memory map is suboptimal still because that
> > > > code was not designed for memory layout that has more holes than Swiss
> > > > cheese.
> > > > 
> > > > Still, the range [0xde600,0xde700] is not freed and there should be 
> > > > struct
> > > > pages for this range.
> > > > 
> > > > Can you add
> > > > 
> > > > dump_page(pfn_to_page(0xde600), "");
> > > > 
> > > > say, in the end of memblock_free_all()?
> > > > 
> > > The range [0xde600,0xde700] is not memory, so it won't create struct page
> > > for it when sparse_init?

Re: [PATCH v4 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-05-12 Thread Mike Rapoport
On Wed, May 12, 2021 at 09:59:33AM +0200, Ard Biesheuvel wrote:
> On Wed, 12 May 2021 at 09:34, Mike Rapoport  wrote:
> >
> > On Wed, May 12, 2021 at 09:00:02AM +0200, Ard Biesheuvel wrote:
> > > On Tue, 11 May 2021 at 12:05, Mike Rapoport  wrote:
> > > >
> > > > From: Mike Rapoport 
> > > >
> > > > Hi,
> > > >
> > > > These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially 
> > > > hardwire
> > > > pfn_valid_within() to 1.
> > > >
> > > > The idea is to mark NOMAP pages as reserved in the memory map and 
> > > > restore
> > > > the intended semantics of pfn_valid() to designate availability of 
> > > > struct
> > > > page for a pfn.
> > > >
> > > > With this the core mm will be able to cope with the fact that it cannot 
> > > > use
> > > > NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER 
> > > > blocks
> > > > will be treated correctly even without the need for pfn_valid_within.
> > > >
> > > > The patches are boot tested on qemu-system-aarch64.
> > > >
> > >
> > > Did you use EFI boot when testing this? The memory map is much more
> > > fragmented in that case, so this would be a good data point.
> >
> > Right, something like this:
> >
> 
> Yes, although it is not always that bad.
> 
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x4000-0xbfff]
> > [0.00]   node   0: [mem 0xc000-0x]
> 
> This is allocated below 4 GB by the firmware, for reasons that are
> only valid on x86 (where some of the early boot chain is IA32 only)
> 
> > [0.00]   node   0: [mem 0x0001-0x0004386f]
> > [0.00]   node   0: [mem 0x00043870-0x00043899]
> > [0.00]   node   0: [mem 0x0004389a-0x0004389b]
> > [0.00]   node   0: [mem 0x0004389c-0x000438b5]
> > [0.00]   node   0: [mem 0x000438b6-0x00043be3]
> > [0.00]   node   0: [mem 0x00043be4-0x00043bec]
> > [0.00]   node   0: [mem 0x00043bed-0x00043bed]
> > [0.00]   node   0: [mem 0x00043bee-0x00043bff]
> > [0.00]   node   0: [mem 0x00043c00-0x00043fff]
> >
> > This is a pity really, because I don't see a fundamental reason for those
> > tiny holes all over the place.
> >
> 
> There is a config option in the firmware build that allows these
> regions to be preallocated using larger windows, which greatly reduces
> the fragmentation.
> > I know that EFI/ACPI mandates "IO style" memory access for those regions,
> > but I fail to get why...
> >
> 
> Not sure what you mean by 'IO style memory access'.
 
Well, my understanding is that the memory reserved by the firmware cannot
be mapped in the linear map because it might require different caching
modes (e.g like IO) and arm64 cannot tolerate aliased mappings with
different caching.
But what evades me is *why* these areas cannot be accessed as normal RAM.
 
> > > > I beleive it would be best to route these via mmotm tree.
> > > >
> > > > v4:
> > > > * rebase on v5.13-rc1
> > > >
> > > > v3: Link: 
> > > > https://lore.kernel.org/lkml/20210422061902.21614-1-r...@kernel.org
> > > > * Fix minor issues found by Anshuman
> > > > * Freshen up the declaration of pfn_valid() to make it consistent with
> > > >   pfn_is_map_memory()
> > > > * Add more Acked-by and Reviewed-by tags, thanks Anshuman and David
> > > >
> > > > v2: Link: 
> > > > https://lore.kernel.org/lkml/20210421065108.1987-1-r...@kernel.org
> > > > * Add check for PFN overflow in pfn_is_map_memory()
> > > > * Add Acked-by and Reviewed-by tags, thanks David.
> > > >
> > > > v1: Link: 
> > > > https://lore.kernel.org/lkml/20210420090925.7457-1-r...@kernel.org
> > > > * Add comment about the semantics of pfn_valid() as Anshuman suggested
> > > > * Extend comments about MEMBLOCK_NOMAP, per Anshuman
> > > > * Use pfn_is_map_memory() name for the exported wrapper for
> > > >   memblock_is_map_memory(). It is still local to arch/arm64 in the end
> > > >   because of header dependency issues.
> > > >
> > > > rfc: Link: 
> > > > https

Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-05-12 Thread Mike Rapoport
On Wed, May 12, 2021 at 11:08:14AM +0800, Kefeng Wang wrote:
> 
> On 2021/5/11 16:48, Mike Rapoport wrote:
> > On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
> > > 
> > > > > The memory is not continuous, see MEMBLOCK:
> > > > >memory size = 0x4c0f reserved size = 0x027ef058
> > > > >memory.cnt  = 0xa
> > > > >memory[0x0][0x80a0-0x855f], 0x04c0 bytes flags: 0x0
> > > > >memory[0x1][0x86a0-0x87df], 0x0140 bytes flags: 0x0
> > > > >memory[0x2][0x8bd0-0x8c4f], 0x0080 bytes flags: 0x0
> > > > >memory[0x3][0x8e30-0x8ecf], 0x00a0 bytes flags: 0x0
> > > > >memory[0x4][0x90d0-0xbfff], 0x2f30 bytes flags: 0x0
> > > > >memory[0x5][0xcc00-0xdc9f], 0x10a0 bytes flags: 0x0
> > > > >memory[0x6][0xde70-0xde9f], 0x0030 bytes flags: 0x0
> > > > > ...
> > > > > 
> > > > > The pfn_range [0xde600,0xde700] => addr_range [0xde60,0xde70]
> > > > > is not available memory, and we won't create memmap , so with or 
> > > > > without
> > > > > your patch, we can't see the range in free_memmap(), right?
> > > > 
> > > > This is not available memory and we won't see the reange in 
> > > > free_memmap(),
> > > > but we still should create memmap for it and that's what my patch tried 
> > > > to
> > > > do.
> > > > 
> > > > There are a lot of places in core mm that operate on pageblocks and
> > > > free_unused_memmap() should make sure that any pageblock has a valid 
> > > > memory
> > > > map.
> > > > 
> > > > Currently, that's not the case when SPARSEMEM=y and my patch tried to 
> > > > fix
> > > > it.
> > > > 
> > > > Can you please send log with my patch applied and with the printing of
> > > > ranges that are freed in free_unused_memmap() you've used in previous
> > > > mails?
> > 
> > > with your patch[1] and debug print in free_memmap,
> > > > free_memmap, start_pfn = 85800,  8580 end_pfn = 86800, 8680
> > > > free_memmap, start_pfn = 8c800,  8c80 end_pfn = 8e000, 8e00
> > > > free_memmap, start_pfn = 8f000,  8f00 end_pfn = 9, 9000
> > > > free_memmap, start_pfn = dcc00,  dcc0 end_pfn = de400, de40
> > > > free_memmap, start_pfn = dec00,  dec0 end_pfn = e, e000
> > > > free_memmap, start_pfn = e0c00,  e0c0 end_pfn = e4000, e400
> > > > free_memmap, start_pfn = f7000,  f700 end_pfn = f8000, f800
> > 
> > It seems that freeing of the memory map is suboptimal still because that
> > code was not designed for memory layout that has more holes than Swiss
> > cheese.
> > 
> > Still, the range [0xde600,0xde700] is not freed and there should be struct
> > pages for this range.
> > 
> > Can you add
> > 
> > dump_page(pfn_to_page(0xde600), "");
> > 
> > say, in the end of memblock_free_all()?
> > 
> The range [0xde600,0xde700] is not memory, so it won't create struct page
> for it when sparse_init?

sparse_init() indeed does not create memory map for unpopulated memory, but
it has pretty coarse granularity, i.e. 64M in your configuration. A hole
should be at least 64M in order to skip allocation of the memory map for
it.

For example, your memory layout has a hole of 192M at pfn 0xc and this
hole won't have the memory map.

However the hole 0xdca00 - 0xde70 will still have a memory map in the
section  that covers 0xdc000 - 0xe.

I've tried outline this in a sketch below, hope it helps.

Memory:
  c  cc000  dca00
--+  +--+ ++
 memory bank  |<- hole ->| memory bank  | | mb |
--+  +--+ ++
de700  dea00

Memory map:

bb4000c  cc000   dd8000dc000
+++- ... -+  ++- ... -++-+
| memmap | memmap | ...   |<- hole ->| memmap |  ...  | memmap | memmap  |
+++- ... -+  ++- ... -++-+


> After apply patch[1], the dump_page log,
> 
> page:ef3cc000 is uninitialized and poisoned
> raw:        
> page dumped because:

This means that there is a memory map entry, and it got poisoned during the
initialization and never got reinitialized to sensible values, which would
be PageReserved() in this case.

I believe this was fixed by commit 0740a50b9baa ("mm/page_alloc.c: refactor
initialization of struct page for holes in memory layout") in the mainline
tree.

Can you backport it to your 5.10 tree and check if it helps?
 
-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-05-12 Thread Mike Rapoport
On Wed, May 12, 2021 at 09:00:02AM +0200, Ard Biesheuvel wrote:
> On Tue, 11 May 2021 at 12:05, Mike Rapoport  wrote:
> >
> > From: Mike Rapoport 
> >
> > Hi,
> >
> > These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
> > pfn_valid_within() to 1.
> >
> > The idea is to mark NOMAP pages as reserved in the memory map and restore
> > the intended semantics of pfn_valid() to designate availability of struct
> > page for a pfn.
> >
> > With this the core mm will be able to cope with the fact that it cannot use
> > NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
> > will be treated correctly even without the need for pfn_valid_within.
> >
> > The patches are boot tested on qemu-system-aarch64.
> >
> 
> Did you use EFI boot when testing this? The memory map is much more
> fragmented in that case, so this would be a good data point.

Right, something like this:

[0.00] Early memory node ranges 
[0.00]   node   0: [mem 0x4000-0xbfff]  
[0.00]   node   0: [mem 0xc000-0x]  
[0.00]   node   0: [mem 0x0001-0x0004386f]  
[0.00]   node   0: [mem 0x00043870-0x00043899]  
[0.00]   node   0: [mem 0x0004389a-0x0004389b]  
[0.00]   node   0: [mem 0x0004389c-0x000438b5]  
[0.00]   node   0: [mem 0x000438b6-0x00043be3]  
[0.00]   node   0: [mem 0x00043be4-0x00043bec]  
[0.00]   node   0: [mem 0x00043bed-0x00043bed]  
[0.00]   node   0: [mem 0x00043bee-0x00043bff]  
[0.00]   node   0: [mem 0x00043c00-0x00043fff]  

This is a pity really, because I don't see a fundamental reason for those
tiny holes all over the place. 

I know that EFI/ACPI mandates "IO style" memory access for those regions,
but I fail to get why...
 
> > I beleive it would be best to route these via mmotm tree.
> >
> > v4:
> > * rebase on v5.13-rc1
> >
> > v3: Link: 
> > https://lore.kernel.org/lkml/20210422061902.21614-1-r...@kernel.org
> > * Fix minor issues found by Anshuman
> > * Freshen up the declaration of pfn_valid() to make it consistent with
> >   pfn_is_map_memory()
> > * Add more Acked-by and Reviewed-by tags, thanks Anshuman and David
> >
> > v2: Link: https://lore.kernel.org/lkml/20210421065108.1987-1-r...@kernel.org
> > * Add check for PFN overflow in pfn_is_map_memory()
> > * Add Acked-by and Reviewed-by tags, thanks David.
> >
> > v1: Link: https://lore.kernel.org/lkml/20210420090925.7457-1-r...@kernel.org
> > * Add comment about the semantics of pfn_valid() as Anshuman suggested
> > * Extend comments about MEMBLOCK_NOMAP, per Anshuman
> > * Use pfn_is_map_memory() name for the exported wrapper for
> >   memblock_is_map_memory(). It is still local to arch/arm64 in the end
> >   because of header dependency issues.
> >
> > rfc: Link: 
> > https://lore.kernel.org/lkml/20210407172607.8812-1-r...@kernel.org
> >
> > Mike Rapoport (4):
> >   include/linux/mmzone.h: add documentation for pfn_valid()
> >   memblock: update initialization of reserved pages
> >   arm64: decouple check whether pfn is in linear map from pfn_valid()
> >   arm64: drop pfn_valid_within() and simplify pfn_valid()
> >
> >  arch/arm64/Kconfig  |  3 ---
> >  arch/arm64/include/asm/memory.h |  2 +-
> >  arch/arm64/include/asm/page.h   |  3 ++-
> >  arch/arm64/kvm/mmu.c|  2 +-
> >  arch/arm64/mm/init.c| 14 +-
> >  arch/arm64/mm/ioremap.c |  4 ++--
> >  arch/arm64/mm/mmu.c |  2 +-
> >  include/linux/memblock.h|  4 +++-
> >  include/linux/mmzone.h  | 11 +++
> >  mm/memblock.c   | 28 ++--
> >  10 files changed, 60 insertions(+), 13 deletions(-)
> >
> >
> > base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
> > --
> > 2.28.0
> >

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 4/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-05-11 Thread Mike Rapoport
On Tue, May 11, 2021 at 04:40:01PM -0700, Andrew Morton wrote:
> On Tue, 11 May 2021 13:05:50 +0300 Mike Rapoport  wrote:
> 
> > From: Mike Rapoport 
> > 
> > The arm64's version of pfn_valid() differs from the generic because of two
> > reasons:
> > 
> > * Parts of the memory map are freed during boot. This makes it necessary to
> >   verify that there is actual physical memory that corresponds to a pfn
> >   which is done by querying memblock.
> > 
> > * There are NOMAP memory regions. These regions are not mapped in the
> >   linear map and until the previous commit the struct pages representing
> >   these areas had default values.
> > 
> > As the consequence of absence of the special treatment of NOMAP regions in
> > the memory map it was necessary to use memblock_is_map_memory() in
> > pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
> > generic mm functionality would not treat a NOMAP page as a normal page.
> > 
> > Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
> > the rest of core mm will treat them as unusable memory and thus
> > pfn_valid_within() is no longer required at all and can be disabled by
> > removing CONFIG_HOLES_IN_ZONE on arm64.
> > 
> > pfn_valid() can be slightly simplified by replacing
> > memblock_is_map_memory() with memblock_is_memory().
> > 
> > ...
> >
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -1052,9 +1052,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
> > def_bool y
> > depends on NUMA
> >  
> > -config HOLES_IN_ZONE
> > -   def_bool y
> > -
> >  source "kernel/Kconfig.hz"
> >  
> >  config ARCH_SPARSEMEM_ENABLE
> 
> https://lkml.kernel.org/r/20210417075946.181402-1-wangkefeng.w...@huawei.com
> already did this, so I simply dropped that hunk? 
> And I don't think the changelog needs updating for this?

We need another hunk instead (below)

> And I don't think the changelog needs updating for this?

maybe "s/disabled by removing CONFIG_HOLES_IN_ZONE/disabled/", but does not
seem that important to me.

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3d6c7436a2fa..d7dc8698cf8e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -201,7 +201,6 @@ config ARM64
select HAVE_KPROBES
select HAVE_KRETPROBES
select HAVE_GENERIC_VDSO
-   select HOLES_IN_ZONE
select IOMMU_DMA if IOMMU_SUPPORT
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 4/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-05-11 Thread Mike Rapoport
From: Mike Rapoport 

The arm64's version of pfn_valid() differs from the generic because of two
reasons:

* Parts of the memory map are freed during boot. This makes it necessary to
  verify that there is actual physical memory that corresponds to a pfn
  which is done by querying memblock.

* There are NOMAP memory regions. These regions are not mapped in the
  linear map and until the previous commit the struct pages representing
  these areas had default values.

As the consequence of absence of the special treatment of NOMAP regions in
the memory map it was necessary to use memblock_is_map_memory() in
pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
generic mm functionality would not treat a NOMAP page as a normal page.

Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
the rest of core mm will treat them as unusable memory and thus
pfn_valid_within() is no longer required at all and can be disabled by
removing CONFIG_HOLES_IN_ZONE on arm64.

pfn_valid() can be slightly simplified by replacing
memblock_is_map_memory() with memblock_is_memory().

Signed-off-by: Mike Rapoport 
Acked-by: David Hildenbrand 
---
 arch/arm64/Kconfig   | 3 ---
 arch/arm64/mm/init.c | 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..d7dc8698cf8e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1052,9 +1052,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
 
-config HOLES_IN_ZONE
-   def_bool y
-
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 798f74f501d5..fb07218da2c0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -251,7 +251,7 @@ int pfn_valid(unsigned long pfn)
if (!early_section(ms))
return pfn_section_valid(ms, pfn);
 
-   return memblock_is_map_memory(addr);
+   return memblock_is_memory(addr);
 }
 EXPORT_SYMBOL(pfn_valid);
 
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 3/4] arm64: decouple check whether pfn is in linear map from pfn_valid()

2021-05-11 Thread Mike Rapoport
From: Mike Rapoport 

The intended semantics of pfn_valid() is to verify whether there is a
struct page for the pfn in question and nothing else.

Yet, on arm64 it is used to distinguish memory areas that are mapped in the
linear map vs those that require ioremap() to access them.

Introduce a dedicated pfn_is_map_memory() wrapper for
memblock_is_map_memory() to perform such check and use it where
appropriate.

Using a wrapper allows to avoid cyclic include dependencies.

While here also update style of pfn_valid() so that both pfn_valid() and
pfn_is_map_memory() declarations will be consistent.

Signed-off-by: Mike Rapoport 
Acked-by: David Hildenbrand 
---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 12 
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 6 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 87b90dc27a43..9027b7e16c4c 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -369,7 +369,7 @@ static inline void *phys_to_virt(phys_addr_t x)
 
 #define virt_addr_valid(addr)  ({  \
__typeof__(addr) __addr = __tag_reset(addr);\
-   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
+   __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr));  
\
 })
 
 void dump_mem_limit(void);
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..75ddfe671393 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -37,7 +37,8 @@ void copy_highpage(struct page *to, struct page *from);
 
 typedef struct page *pgtable_t;
 
-extern int pfn_valid(unsigned long);
+int pfn_valid(unsigned long pfn);
+int pfn_is_map_memory(unsigned long pfn);
 
 #include 
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c5d1f3c87dbd..470070073085 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 
 static bool kvm_is_device_pfn(unsigned long pfn)
 {
-   return !pfn_valid(pfn);
+   return !pfn_is_map_memory(pfn);
 }
 
 static void *stage2_memcache_zalloc_page(void *arg)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 16a2b2b1c54d..798f74f501d5 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -255,6 +255,18 @@ int pfn_valid(unsigned long pfn)
 }
 EXPORT_SYMBOL(pfn_valid);
 
+int pfn_is_map_memory(unsigned long pfn)
+{
+   phys_addr_t addr = PFN_PHYS(pfn);
+
+   /* avoid false positives for bogus PFNs, see comment in pfn_valid() */
+   if (PHYS_PFN(addr) != pfn)
+   return 0;
+
+   return memblock_is_map_memory(addr);
+}
+EXPORT_SYMBOL(pfn_is_map_memory);
+
 static phys_addr_t memory_limit = PHYS_ADDR_MAX;
 
 /*
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index b5e83c46b23e..b7c81dacabf0 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, 
size_t size,
/*
 * Don't allow RAM to be mapped.
 */
-   if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
+   if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr
return NULL;
 
area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
 {
/* For normal memory we already have a cacheable mapping. */
-   if (pfn_valid(__phys_to_pfn(phys_addr)))
+   if (pfn_is_map_memory(__phys_to_pfn(phys_addr)))
return (void __iomem *)__phys_to_virt(phys_addr);
 
return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6dd9369e3ea0..ab5914cebd3c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -82,7 +82,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
  unsigned long size, pgprot_t vma_prot)
 {
-   if (!pfn_valid(pfn))
+   if (!pfn_is_map_memory(pfn))
return pgprot_noncached(vma_prot);
else if (file->f_flags & O_SYNC)
return pgprot_writecombine(vma_prot);
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 2/4] memblock: update initialization of reserved pages

2021-05-11 Thread Mike Rapoport
From: Mike Rapoport 

The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function. This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.

The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().

Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.

Signed-off-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
Reviewed-by: Anshuman Khandual 
---
 include/linux/memblock.h |  4 +++-
 mm/memblock.c| 28 ++--
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5984fff3f175..1b4c97c151ae 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -30,7 +30,9 @@ extern unsigned long long max_possible_pfn;
  * @MEMBLOCK_NONE: no special request
  * @MEMBLOCK_HOTPLUG: hotpluggable region
  * @MEMBLOCK_MIRROR: mirrored region
- * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
+ * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
+ * reserved in the memory map; refer to memblock_mark_nomap() description
+ * for further details
  */
 enum memblock_flags {
MEMBLOCK_NONE   = 0x0,  /* No special request */
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..3abf2c3fea7f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -906,6 +906,11 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, 
phys_addr_t size)
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
+ * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the
+ * direct mapping of the physical memory. These regions will still be
+ * covered by the memory map. The struct page representing NOMAP memory
+ * frames in the memory map will be PageReserved()
+ *
  * Return: 0 on success, -errno on failure.
  */
 int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
@@ -2002,6 +2007,26 @@ static unsigned long __init 
__free_memory_core(phys_addr_t start,
return end_pfn - start_pfn;
 }
 
+static void __init memmap_init_reserved_pages(void)
+{
+   struct memblock_region *region;
+   phys_addr_t start, end;
+   u64 i;
+
+   /* initialize struct pages for the reserved regions */
+   for_each_reserved_mem_range(i, , )
+   reserve_bootmem_region(start, end);
+
+   /* and also treat struct pages for the NOMAP regions as PageReserved */
+   for_each_mem_region(region) {
+   if (memblock_is_nomap(region)) {
+   start = region->base;
+   end = start + region->size;
+   reserve_bootmem_region(start, end);
+   }
+   }
+}
+
 static unsigned long __init free_low_memory_core_early(void)
 {
unsigned long count = 0;
@@ -2010,8 +2035,7 @@ static unsigned long __init 
free_low_memory_core_early(void)
 
memblock_clear_hotplug(0, -1);
 
-   for_each_reserved_mem_range(i, , )
-   reserve_bootmem_region(start, end);
+   memmap_init_reserved_pages();
 
/*
 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 1/4] include/linux/mmzone.h: add documentation for pfn_valid()

2021-05-11 Thread Mike Rapoport
From: Mike Rapoport 

Add comment describing the semantics of pfn_valid() that clarifies that
pfn_valid() only checks for availability of a memory map entry (i.e. struct
page) for a PFN rather than availability of usable memory backing that PFN.

The most "generic" version of pfn_valid() used by the configurations with
SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
suitable place for documentation about semantics of pfn_valid().

Suggested-by: Anshuman Khandual 
Signed-off-by: Mike Rapoport 
Reviewed-by: Anshuman Khandual 
---
 include/linux/mmzone.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d53eba1c383..e5945ca24df7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1427,6 +1427,17 @@ static inline int pfn_section_valid(struct mem_section 
*ms, unsigned long pfn)
 #endif
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
+/**
+ * pfn_valid - check if there is a valid memory map entry for a PFN
+ * @pfn: the page frame number to check
+ *
+ * Check if there is a valid memory map entry aka struct page for the @pfn.
+ * Note, that availability of the memory map entry does not imply that
+ * there is actual usable memory at that @pfn. The struct page may
+ * represent a hole or an unusable page frame.
+ *
+ * Return: 1 for PFNs that have memory map entries and 0 otherwise
+ */
 static inline int pfn_valid(unsigned long pfn)
 {
struct mem_section *ms;
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-05-11 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1. 

The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.

With this the core mm will be able to cope with the fact that it cannot use
NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
will be treated correctly even without the need for pfn_valid_within.

The patches are boot tested on qemu-system-aarch64.

I beleive it would be best to route these via mmotm tree.

v4:
* rebase on v5.13-rc1

v3: Link: https://lore.kernel.org/lkml/20210422061902.21614-1-r...@kernel.org
* Fix minor issues found by Anshuman
* Freshen up the declaration of pfn_valid() to make it consistent with
  pfn_is_map_memory()
* Add more Acked-by and Reviewed-by tags, thanks Anshuman and David

v2: Link: https://lore.kernel.org/lkml/20210421065108.1987-1-r...@kernel.org
* Add check for PFN overflow in pfn_is_map_memory()
* Add Acked-by and Reviewed-by tags, thanks David.

v1: Link: https://lore.kernel.org/lkml/20210420090925.7457-1-r...@kernel.org
* Add comment about the semantics of pfn_valid() as Anshuman suggested
* Extend comments about MEMBLOCK_NOMAP, per Anshuman
* Use pfn_is_map_memory() name for the exported wrapper for
  memblock_is_map_memory(). It is still local to arch/arm64 in the end
  because of header dependency issues.

rfc: Link: https://lore.kernel.org/lkml/20210407172607.8812-1-r...@kernel.org

Mike Rapoport (4):
  include/linux/mmzone.h: add documentation for pfn_valid()
  memblock: update initialization of reserved pages
  arm64: decouple check whether pfn is in linear map from pfn_valid()
  arm64: drop pfn_valid_within() and simplify pfn_valid()

 arch/arm64/Kconfig  |  3 ---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 14 +-
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 include/linux/memblock.h|  4 +++-
 include/linux/mmzone.h  | 11 +++
 mm/memblock.c   | 28 ++--
 10 files changed, 60 insertions(+), 13 deletions(-)


base-commit: 6efb943b8616ec53a5e444193dccf1af9ad627b5
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-05-11 Thread Mike Rapoport
On Mon, May 10, 2021 at 11:10:20AM +0800, Kefeng Wang wrote:
>
> > > The memory is not continuous, see MEMBLOCK:
> > >   memory size = 0x4c0f reserved size = 0x027ef058
> > >   memory.cnt  = 0xa
> > >   memory[0x0][0x80a0-0x855f], 0x04c0 bytes flags: 0x0
> > >   memory[0x1][0x86a0-0x87df], 0x0140 bytes flags: 0x0
> > >   memory[0x2][0x8bd0-0x8c4f], 0x0080 bytes flags: 0x0
> > >   memory[0x3][0x8e30-0x8ecf], 0x00a0 bytes flags: 0x0
> > >   memory[0x4][0x90d0-0xbfff], 0x2f30 bytes flags: 0x0
> > >   memory[0x5][0xcc00-0xdc9f], 0x10a0 bytes flags: 0x0
> > >   memory[0x6][0xde70-0xde9f], 0x0030 bytes flags: 0x0
> > > ...
> > > 
> > > The pfn_range [0xde600,0xde700] => addr_range [0xde60,0xde70]
> > > is not available memory, and we won't create memmap , so with or without
> > > your patch, we can't see the range in free_memmap(), right?
> > 
> > This is not available memory and we won't see the reange in free_memmap(),
> > but we still should create memmap for it and that's what my patch tried to
> > do.
> > 
> > There are a lot of places in core mm that operate on pageblocks and
> > free_unused_memmap() should make sure that any pageblock has a valid memory
> > map.
> > 
> > Currently, that's not the case when SPARSEMEM=y and my patch tried to fix
> > it.
> > 
> > Can you please send log with my patch applied and with the printing of
> > ranges that are freed in free_unused_memmap() you've used in previous
> > mails?

> with your patch[1] and debug print in free_memmap,
> > free_memmap, start_pfn = 85800,  8580 end_pfn = 86800, 8680
> > free_memmap, start_pfn = 8c800,  8c80 end_pfn = 8e000, 8e00
> > free_memmap, start_pfn = 8f000,  8f00 end_pfn = 9, 9000
> > free_memmap, start_pfn = dcc00,  dcc0 end_pfn = de400, de40
> > free_memmap, start_pfn = dec00,  dec0 end_pfn = e, e000
> > free_memmap, start_pfn = e0c00,  e0c0 end_pfn = e4000, e400
> > free_memmap, start_pfn = f7000,  f700 end_pfn = f8000, f800

It seems that freeing of the memory map is suboptimal still because that
code was not designed for memory layout that has more holes than Swiss
cheese. 

Still, the range [0xde600,0xde700] is not freed and there should be struct
pages for this range.

Can you add 

dump_page(pfn_to_page(0xde600), "");

say, in the end of memblock_free_all()?
 
-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-05-08 Thread Mike Rapoport
On Fri, May 07, 2021 at 08:34:52PM +0800, Kefeng Wang wrote:
> 
> 
> On 2021/5/7 18:30, Mike Rapoport wrote:
> > On Fri, May 07, 2021 at 03:17:08PM +0800, Kefeng Wang wrote:
> > > 
> > > On 2021/5/6 20:47, Kefeng Wang wrote:
> > > > 
> > > > > > > > no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> > > > > > > > move_freepages at
> > > > > > > > 
> > > > > > > > start_pfn/end_pfn [de600, de7ff], [de60, de7ff000]
> > > > > > > > :  pfn =de600, page
> > > > > > > > =ef3cc000, page-flags = ,  pfn2phy = de60
> > > > > > > > 
> > > > > > > > > > __free_memory_core, range: 0xb020 -
> > > > > > > > > > 0xc000, pfn: b0200 - b0200
> > > > > > > > > > __free_memory_core, range: 0xcc00 -
> > > > > > > > > > 0xdca0, pfn: cc000 - b0200
> > > > > > > > > > __free_memory_core, range: 0xde70 -
> > > > > > > > > > 0xdea0, pfn: de700 - b0200
> > > > > > > 
> > > > > > > Hmm, [de600, de7ff] is not added to the free lists which is
> > > > > > > correct. But
> > > > > > > then it's unclear how the page for de600 gets to 
> > > > > > > move_freepages()...
> > > > > > > 
> > > > > > > Can't say I have any bright ideas to try here...
> > > > > > 
> > > > > > Are we missing some checks (e.g., PageReserved()) that
> > > > > > pfn_valid_within()
> > > > > > would have "caught" before?
> > > > > 
> > > > > Unless I'm missing something the crash happens in 
> > > > > __rmqueue_fallback():
> > > > > 
> > > > > do_steal:
> > > > >  page = get_page_from_free_area(area, fallback_mt);
> > > > > 
> > > > >  steal_suitable_fallback(zone, page, alloc_flags, 
> > > > > start_migratetype,
> > > > >      can_steal);
> > > > >      -> move_freepages()
> > > > >      -> BUG()
> > > > > 
> > > > > So a page from free area should be sane as the freed range was never
> > > > > added
> > > > > it to the free lists.
> > > > 
> > > > Sorry for the late response due to the vacation.
> > > > 
> > > > The pfn in range [de600, de7ff] won't be added into the free lists via
> > > > __free_memory_core(), but the pfn could be added into freelists via
> > > > free_highmem_page()
> > > > 
> > > > I add some debug[1] in add_to_free_list(), we could see the calltrace
> > > > 
> > > > free_highpages, range_pfn [b0200, c], range_addr [b020, 
> > > > c000]
> > > > free_highpages, range_pfn [cc000, dca00], range_addr [cc00, 
> > > > dca0]
> > > > free_highpages, range_pfn [de700, dea00], range_addr [de70, 
> > > > dea0]
> > > > add_to_free_list, ===> pfn = de700
> > > > [ cut here ]
> > > > WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
> > > > pfn = de700
> > > > Modules linked in:
> > > > CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
> > > > Hardware name: Hisilicon A9
> > > > [] (show_stack) from [] (dump_stack+0x9c/0xc0)
> > > > [] (dump_stack) from [] (__warn+0xc0/0xec)
> > > > [] (__warn) from [] (warn_slowpath_fmt+0x74/0xa4)
> > > > [] (warn_slowpath_fmt) from []
> > > > (add_to_free_list+0x8c/0xec)
> > > > [] (add_to_free_list) from []
> > > > (free_pcppages_bulk+0x200/0x278)
> > > > [] (free_pcppages_bulk) from []
> > > > (free_unref_page+0x58/0x68)
> > > > [] (free_unref_page) from []
> > > > (free_highmem_page+0xc/0x50)
> > > > [] (free_highmem_page) from [] 
> > > > (mem_init+0x21c/0x254)
> > > > [] (mem_init) from [] (start_kernel+0x258/0x5c0)
> > > > [] (start_kernel) from [<>] (0x0)
> > > > 
> > > > so any idea?
> > > 
> > > If pfn = 0xde700, due to the pageblock_nr_pages = 0x200, then the
> > > start_pfn,en

Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-05-07 Thread Mike Rapoport
On Fri, May 07, 2021 at 03:17:08PM +0800, Kefeng Wang wrote:
> 
> On 2021/5/6 20:47, Kefeng Wang wrote:
> > 
> > 
> > > > > > no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> > > > > > move_freepages at
> > > > > > 
> > > > > > start_pfn/end_pfn [de600, de7ff], [de60, de7ff000]
> > > > > > :  pfn =de600, page
> > > > > > =ef3cc000, page-flags = ,  pfn2phy = de60
> > > > > > 
> > > > > > > > __free_memory_core, range: 0xb020 -
> > > > > > > > 0xc000, pfn: b0200 - b0200
> > > > > > > > __free_memory_core, range: 0xcc00 -
> > > > > > > > 0xdca0, pfn: cc000 - b0200
> > > > > > > > __free_memory_core, range: 0xde70 -
> > > > > > > > 0xdea0, pfn: de700 - b0200
> > > > > 
> > > > > Hmm, [de600, de7ff] is not added to the free lists which is
> > > > > correct. But
> > > > > then it's unclear how the page for de600 gets to move_freepages()...
> > > > > 
> > > > > Can't say I have any bright ideas to try here...
> > > > 
> > > > Are we missing some checks (e.g., PageReserved()) that
> > > > pfn_valid_within()
> > > > would have "caught" before?
> > > 
> > > Unless I'm missing something the crash happens in __rmqueue_fallback():
> > > 
> > > do_steal:
> > > page = get_page_from_free_area(area, fallback_mt);
> > > 
> > > steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
> > >     can_steal);
> > >     -> move_freepages()
> > >     -> BUG()
> > > 
> > > So a page from free area should be sane as the freed range was never
> > > added
> > > it to the free lists.
> > 
> > Sorry for the late response due to the vacation.
> > 
> > The pfn in range [de600, de7ff] won't be added into the free lists via
> > __free_memory_core(), but the pfn could be added into freelists via
> > free_highmem_page()
> > 
> > I add some debug[1] in add_to_free_list(), we could see the calltrace
> > 
> > free_highpages, range_pfn [b0200, c], range_addr [b020, c000]
> > free_highpages, range_pfn [cc000, dca00], range_addr [cc00, dca0]
> > free_highpages, range_pfn [de700, dea00], range_addr [de70, dea0]
> > add_to_free_list, ===> pfn = de700
> > [ cut here ]
> > WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:900 add_to_free_list+0x8c/0xec
> > pfn = de700
> > Modules linked in:
> > CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.0+ #48
> > Hardware name: Hisilicon A9
> > [] (show_stack) from [] (dump_stack+0x9c/0xc0)
> > [] (dump_stack) from [] (__warn+0xc0/0xec)
> > [] (__warn) from [] (warn_slowpath_fmt+0x74/0xa4)
> > [] (warn_slowpath_fmt) from []
> > (add_to_free_list+0x8c/0xec)
> > [] (add_to_free_list) from []
> > (free_pcppages_bulk+0x200/0x278)
> > [] (free_pcppages_bulk) from []
> > (free_unref_page+0x58/0x68)
> > [] (free_unref_page) from []
> > (free_highmem_page+0xc/0x50)
> > [] (free_highmem_page) from [] (mem_init+0x21c/0x254)
> > [] (mem_init) from [] (start_kernel+0x258/0x5c0)
> > [] (start_kernel) from [<>] (0x0)
> > 
> > so any idea?
> 
> If pfn = 0xde700, due to the pageblock_nr_pages = 0x200, then the
> start_pfn,end_pfn passed to move_freepages() will be [de600, de7ff],
> but the range of [de600,de700] without ‘struct page' will lead to
> this panic when pfn_valid_within not enabled if no HOLES_IN_ZONE,
> and the same issue will occurred in isolate_freepages_block(), maybe

I think your analysis is correct except one minor detail. With the #ifdef
fix I've proposed earlieri [1] the memmap for [0xde600, 0xde700] should not
be freed so there should be a struct page. Did you check what parts of the
memmap are actually freed with this patch applied?
Would you get a panic if you add

dump_page(pfn_to_page(0xde600), "");

say, in the end of memblock_free_all()?

> there are some scene, so I select HOLES_IN_ZONE in ARCH_HISI(ARM) to solve
> this issue in our 5.10, should we select HOLES_IN_ZONE in all ARM or only in
> ARCH_HISI, any better solution?  Thanks.

I don't think that HOLES_IN_ZONE is the right solution. I believe that we
must keep the memory map aligned on pageblock boundaries. That's surely not the
case for SPARSEMEM as of now, and if my fix is not enough we need to find
where it went wrong.

Besides, I'd say that if it is possible to update your firmware to make the
memory layout reported to the kernel less, hmm, esoteric, you would hit
less corner cases.

[1] https://lore.kernel.org/lkml/yipy8txcsc7lf...@kernel.org

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-05-03 Thread Mike Rapoport
On Mon, May 03, 2021 at 10:07:01AM +0200, David Hildenbrand wrote:
> On 03.05.21 08:26, Mike Rapoport wrote:
> > On Fri, Apr 30, 2021 at 07:24:37PM +0800, Kefeng Wang wrote:
> > > 
> > > 
> > > On 2021/4/30 17:51, Mike Rapoport wrote:
> > > > On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
> > > > > 
> > > > > On 2021/4/29 14:57, Mike Rapoport wrote:
> > > > > 
> > > > > > > > Do you use SPARSMEM? If yes, what is your section size?
> > > > > > > > What is the value if CONFIG_FORCE_MAX_ZONEORDER in your 
> > > > > > > > configuration?
> > > > > > > Yes,
> > > > > > > 
> > > > > > > CONFIG_SPARSEMEM=y
> > > > > > > 
> > > > > > > CONFIG_SPARSEMEM_STATIC=y
> > > > > > > 
> > > > > > > CONFIG_FORCE_MAX_ZONEORDER = 11
> > > > > > > 
> > > > > > > CONFIG_PAGE_OFFSET=0xC000
> > > > > > > CONFIG_HAVE_ARCH_PFN_VALID=y
> > > > > > > CONFIG_HIGHMEM=y
> > > > > > > #define SECTION_SIZE_BITS26
> > > > > > > #define MAX_PHYSADDR_BITS32
> > > > > > > #define MAX_PHYSMEM_BITS 32
> > > > > 
> > > > > 
> > > > > With the patch,  the addr is aligned, but the panic still occurred,
> > > > 
> > > > Is this the same panic at move_freepages() for range [de600, de7ff]?
> > > > 
> > > > Do you enable CONFIG_ARM_LPAE?
> > > 
> > > no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> > > move_freepages at
> > > 
> > > start_pfn/end_pfn [de600, de7ff], [de60, de7ff000] :  pfn =de600, page
> > > =ef3cc000, page-flags = ,  pfn2phy = de60
> > > 
> > > > > __free_memory_core, range: 0xb020 - 0xc000, pfn: b0200 - b0200
> > > > > __free_memory_core, range: 0xcc00 - 0xdca0, pfn: cc000 - b0200
> > > > > __free_memory_core, range: 0xde70 - 0xdea0, pfn: de700 - b0200
> > 
> > Hmm, [de600, de7ff] is not added to the free lists which is correct. But
> > then it's unclear how the page for de600 gets to move_freepages()...
> > 
> > Can't say I have any bright ideas to try here...
> 
> Are we missing some checks (e.g., PageReserved()) that pfn_valid_within()
> would have "caught" before?

Unless I'm missing something the crash happens in __rmqueue_fallback():

do_steal:
page = get_page_from_free_area(area, fallback_mt);

steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
can_steal);
-> move_freepages() 
-> BUG()

So a page from free area should be sane as the freed range was never added
it to the free lists.

And honestly, with the memory layout reported elsewhere in the stack I'd
say that the bootloader/fdt beg for fixes...

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-05-03 Thread Mike Rapoport
On Fri, Apr 30, 2021 at 07:24:37PM +0800, Kefeng Wang wrote:
> 
> 
> On 2021/4/30 17:51, Mike Rapoport wrote:
> > On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
> > > 
> > > On 2021/4/29 14:57, Mike Rapoport wrote:
> > > 
> > > > > > Do you use SPARSMEM? If yes, what is your section size?
> > > > > > What is the value if CONFIG_FORCE_MAX_ZONEORDER in your 
> > > > > > configuration?
> > > > > Yes,
> > > > > 
> > > > > CONFIG_SPARSEMEM=y
> > > > > 
> > > > > CONFIG_SPARSEMEM_STATIC=y
> > > > > 
> > > > > CONFIG_FORCE_MAX_ZONEORDER = 11
> > > > > 
> > > > > CONFIG_PAGE_OFFSET=0xC000
> > > > > CONFIG_HAVE_ARCH_PFN_VALID=y
> > > > > CONFIG_HIGHMEM=y
> > > > > #define SECTION_SIZE_BITS26
> > > > > #define MAX_PHYSADDR_BITS32
> > > > > #define MAX_PHYSMEM_BITS 32
> > > 
> > > 
> > > With the patch,  the addr is aligned, but the panic still occurred,
> > 
> > Is this the same panic at move_freepages() for range [de600, de7ff]?
> > 
> > Do you enable CONFIG_ARM_LPAE?
> 
> no, the CONFIG_ARM_LPAE is not set, and yes with same panic at
> move_freepages at
> 
> start_pfn/end_pfn [de600, de7ff], [de60, de7ff000] :  pfn =de600, page
> =ef3cc000, page-flags = ,  pfn2phy = de60
> 
> > > __free_memory_core, range: 0xb020 - 0xc000, pfn: b0200 - b0200
> > > __free_memory_core, range: 0xcc00 - 0xdca0, pfn: cc000 - b0200
> > > __free_memory_core, range: 0xde70 - 0xdea0, pfn: de700 - b0200

Hmm, [de600, de7ff] is not added to the free lists which is correct. But
then it's unclear how the page for de600 gets to move_freepages()...

Can't say I have any bright ideas to try here...

> the __free_memory_core will check the start pfn and end pfn,
> 
>  if (start_pfn >= end_pfn)
>  return 0;
> 
>  __free_pages_memory(start_pfn, end_pfn);
> so the memory will not be freed to buddy, confused...

It's a check for range validity, all valid ranges are added.

> > > __free_memory_core, range: 0xe080 - 0xe0c0, pfn: e0800 - b0200
> > > __free_memory_core, range: 0xf4b0 - 0xf700, pfn: f4b00 - b0200
> > > __free_memory_core, range: 0xfda0 - 0x, pfn: fda00 - b0200
> > > > It seems that with SPARSEMEM we don't align the freed parts on pageblock
> > > > boundaries.
> > > > 
> > > > Can you try the patch below:
> > > > 
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index afaefa8fc6ab..1926369b52ec 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -1941,14 +1941,13 @@ static void __init free_unused_memmap(void)
> > > >  * due to SPARSEMEM sections which aren't present.
> > > >  */
> > > > start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
> > > > -#else
> > > > +#endif
> > > > /*
> > > >  * Align down here since the VM subsystem insists that 
> > > > the
> > > >  * memmap entries are valid from the bank start aligned 
> > > > to
> > > >  * MAX_ORDER_NR_PAGES.
> > > >  */
> > > > start = round_down(start, MAX_ORDER_NR_PAGES);
> > > > -#endif
> > > > /*
> > > >  * If we had a previous bank, and there is a space
> > > > 
> > 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-04-30 Thread Mike Rapoport
On Thu, Apr 29, 2021 at 06:22:55PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/29 14:57, Mike Rapoport wrote:
> 
> > > > Do you use SPARSMEM? If yes, what is your section size?
> > > > What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?
> > > Yes,
> > > 
> > > CONFIG_SPARSEMEM=y
> > > 
> > > CONFIG_SPARSEMEM_STATIC=y
> > > 
> > > CONFIG_FORCE_MAX_ZONEORDER = 11
> > > 
> > > CONFIG_PAGE_OFFSET=0xC000
> > > CONFIG_HAVE_ARCH_PFN_VALID=y
> > > CONFIG_HIGHMEM=y
> > > #define SECTION_SIZE_BITS26
> > > #define MAX_PHYSADDR_BITS32
> > > #define MAX_PHYSMEM_BITS 32
> 
> 
> With the patch,  the addr is aligned, but the panic still occurred,

Is this the same panic at move_freepages() for range [de600, de7ff]?

Do you enable CONFIG_ARM_LPAE?

> new free memory log is below,
> 
> memblock_free: [0xaf43-0xaf44] mem_init+0x158/0x23c
> 
> memblock_free: [0xaf51-0xaf53] mem_init+0x158/0x23c
> memblock_free: [0xaf56-0xaf57] mem_init+0x158/0x23c
> memblock_free: [0xafd98000-0xafdc7fff] mem_init+0x158/0x23c
> memblock_free: [0xafdd8000-0xafdf] mem_init+0x158/0x23c
> memblock_free: [0xafe18000-0xafe7] mem_init+0x158/0x23c
> memblock_free: [0xafee-0xafef] mem_init+0x158/0x23c
> __free_memory_core, range: 0x80a03000 - 0x80a04000, pfn: 80a03 - 80a04
> __free_memory_core, range: 0x80a08000 - 0x80b0, pfn: 80a08 - 80b00
> __free_memory_core, range: 0x812e8058 - 0x8300, pfn: 812e9 - 83000
> __free_memory_core, range: 0x8500 - 0x8560, pfn: 85000 - 85600
> __free_memory_core, range: 0x86a0 - 0x87e0, pfn: 86a00 - 87e00
> __free_memory_core, range: 0x8bd0 - 0x8c50, pfn: 8bd00 - 8c500
> __free_memory_core, range: 0x8e30 - 0x8ed0, pfn: 8e300 - 8ed00
> __free_memory_core, range: 0x90d0 - 0xaf2c, pfn: 90d00 - af2c0
> __free_memory_core, range: 0xaf43 - 0xaf45, pfn: af430 - af450
> __free_memory_core, range: 0xaf51 - 0xaf54, pfn: af510 - af540
> __free_memory_core, range: 0xaf56 - 0xaf58, pfn: af560 - af580
> __free_memory_core, range: 0xafd98000 - 0xafdc8000, pfn: afd98 - afdc8
> __free_memory_core, range: 0xafdd8000 - 0xafe0, pfn: afdd8 - afe00
> __free_memory_core, range: 0xafe18000 - 0xafe8, pfn: afe18 - afe80
> __free_memory_core, range: 0xafee - 0xaff0, pfn: afee0 - aff00
> __free_memory_core, range: 0xaff8 - 0xaff8d000, pfn: aff80 - aff8d
> __free_memory_core, range: 0xafff2000 - 0xafff4580, pfn: afff2 - afff4
> __free_memory_core, range: 0xafffe000 - 0xafffe0e0, pfn: afffe - afffe
> __free_memory_core, range: 0xafffe4fc - 0xafffe500, pfn: a - afffe
> __free_memory_core, range: 0xafffe6e4 - 0xafffe700, pfn: a - afffe
> __free_memory_core, range: 0xafffe8dc - 0xafffe8e0, pfn: a - afffe
> __free_memory_core, range: 0xafffe970 - 0xafffe980, pfn: a - afffe
> __free_memory_core, range: 0xafffe990 - 0xafffe9a0, pfn: a - afffe
> __free_memory_core, range: 0xafffe9a4 - 0xafffe9c0, pfn: a - afffe
> __free_memory_core, range: 0xafffeb54 - 0xafffeb60, pfn: a - afffe
> __free_memory_core, range: 0xafffecf4 - 0xafffed00, pfn: a - afffe
> __free_memory_core, range: 0xafffefc4 - 0xafffefd8, pfn: a - afffe
> __free_memory_core, range: 0xb020 - 0xc000, pfn: b0200 - b0200
> __free_memory_core, range: 0xcc00 - 0xdca0, pfn: cc000 - b0200
> __free_memory_core, range: 0xde70 - 0xdea0, pfn: de700 - b0200

The range [de600, de7ff] 

> __free_memory_core, range: 0xe080 - 0xe0c0, pfn: e0800 - b0200
> __free_memory_core, range: 0xf4b0 - 0xf700, pfn: f4b00 - b0200
> __free_memory_core, range: 0xfda0 - 0x, pfn: fda00 - b0200
> > It seems that with SPARSEMEM we don't align the freed parts on pageblock
> > boundaries.
> > 
> > Can you try the patch below:
> > 
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index afaefa8fc6ab..1926369b52ec 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1941,14 +1941,13 @@ static void __init free_unused_memmap(void)
> >  * due to SPARSEMEM sections which aren't present.
> >  */
> > start = min(start, ALIGN(prev_end, PAGES_PER_SECTION));
> > -#else
> > +#endif
> > /*
> >  * Align down here since the VM subsystem insists that the
> >  * memmap entries are valid from the bank start aligned to
> >  * MAX_ORDER_NR_PAGES.
> >  */
> > start = round_down(start, MAX_ORDER_NR_PAGES);
> > -#endif
> > /*
> >  * If we had a previous bank, and there is a space
> > 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-04-27 Thread Mike Rapoport
On Tue, Apr 27, 2021 at 07:08:59PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/27 14:23, Mike Rapoport wrote:
> > On Mon, Apr 26, 2021 at 11:26:38PM +0800, Kefeng Wang wrote:
> > > On 2021/4/26 13:20, Mike Rapoport wrote:
> > > > On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> > > > > On 2021/4/25 15:19, Mike Rapoport wrote:
> > > > > 
> > > > >   On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> > > > > 
> > > > >   I tested this patchset(plus arm32 change, like arm64 does)
> > > > >   based on lts 5.10,add some debug log, the useful info shows
> > > > >   below, if we enable HOLES_IN_ZONE, no panic, any idea,
> > > > >   thanks.
> > > > > 
> > > > >   Are there any changes on top of 5.10 except for pfn_valid() 
> > > > > patch?
> > > > >   Do you see this panic on 5.10 without the changes?
> > > > > 
> > > > > Yes, there are some BSP support for arm board based on 5.10,
> > Is it possible to test 5.12?

Do you use SPARSMEM? If yes, what is your section size?
What is the value if CONFIG_FORCE_MAX_ZONEORDER in your configuration?

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-04-27 Thread Mike Rapoport
On Mon, Apr 26, 2021 at 11:26:38PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/26 13:20, Mike Rapoport wrote:
> > On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> > > On 2021/4/25 15:19, Mike Rapoport wrote:
> > > 
> > >  On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> > > 
> > >  I tested this patchset(plus arm32 change, like arm64 does)
> > >  based on lts 5.10,add some debug log, the useful info shows
> > >  below, if we enable HOLES_IN_ZONE, no panic, any idea,
> > >  thanks.
> > > 
> > >  Are there any changes on top of 5.10 except for pfn_valid() patch?
> > >  Do you see this panic on 5.10 without the changes?
> > > 
> > > Yes, there are some BSP support for arm board based on 5.10,

Is it possible to test 5.12?

> > > with or without your patch will get same panic, the panic pfn=de600
> > > in the range of [dcc00,de00] which is freed by free_memmap, start_pfn
> > > = dcc00,  dcc0 end_pfn = de700, de70
> > > 
> > > we see the PC is at PageLRU, same reason like arm64 panic log,
> > > 
> > > "PageBuddy in move_freepages returns false
> > >  Then we call PageLRU, the macro calls PF_HEAD which is 
> > > compound_page()
> > >  compound_page reads page->compound_head, it is 0x, 
> > > so it
> > >  resturns 0xfffe - and accessing this address causes 
> > > crash"
> > > 
> > >  Can you see stack backtrace beyond move_freepages_block?
> > > 
> > > I do some oom test, so the log is about memory allocate,
> > > 
> > > [] (move_freepages_block) from []
> > > (steal_suitable_fallback+0x174/0x1f4)
> > > 
> > > [] (steal_suitable_fallback) from [] 
> > > (get_page_from_freelist+0x490/0x9a4)
> >
> > Hmm, this is called with a page from free list, having a page from a freed
> > part of the memory map passed to steal_suitable_fallback() means that there
> > is an issue with creation of the free list.
> > 
> > Can you please add "memblock=debug" to the kernel command line and post the
> > log?
> 
> Here is the log,
> 
> CPU: ARMv7 Processor [413fc090] revision 0 (ARMv7), cr=1ac5387d
> 
> CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
> OF: fdt: Machine model: HISI-CA9
> memblock_add: [0x80a0-0x855f] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x86a0-0x87df] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x8bd0-0x8c4f] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x8e30-0x8ecf] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0x90d0-0xbfff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xcc00-0xdc9f] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xe080-0xe0bf] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xf530-0xf5bf] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xf5c0-0xf6ff] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xfe10-0xfebf] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xfec0-0x] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xde70-0xde9f] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xf4b0-0xf52f] early_init_dt_scan_memory+0x11c/0x188
> memblock_add: [0xfda0-0xfe0f] early_init_dt_scan_memory+0x11c/0x188
> memblock_reserve: [0x80a01000-0x80a02d2e] setup_arch+0x68/0x5c4
> Malformed early option 'vecpage_wrprotect'
> Memory policy: Data cache writealloc
> memblock_reserve: [0x80b0-0x812e8057] arm_memblock_init+0x34/0x14c
> memblock_reserve: [0x8300-0x84ff] arm_memblock_init+0x100/0x14c
> memblock_reserve: [0x80a04000-0x80a07fff] arm_memblock_init+0xa0/0x14c
> memblock_reserve: [0x80a0-0x80a02fff] hisi_mem_reserve+0x14/0x30
> MEMBLOCK configuration:
>  memory size = 0x4c0f reserved size = 0x027ef058
>  memory.cnt  = 0xa
>  memory[0x0]    [0x80a0-0x855f], 0x04c0 bytes flags: 0x0
>  memory[0x1]    [0x86a0-0x87df], 0x0140 bytes flags: 0x0
>  memory[0x2]    [0x8bd0-0x8c4f], 0x0080 bytes flags: 0x0
>  memory[0x3]    [0x8e30-0x8ecf], 0x00a0 bytes flags: 0x0
>  memory[0x4]    [0x90d0-0xbfff], 0x2f30 bytes flags: 0x0
>  memory[0x5]    [0xcc00-0xdc9f], 0x10a0 bytes flags: 0x0
>  memory[0x6]    [0xde70-0xde9f], 0x0030 bytes flags: 0x0
>  memory[0x7]    [0xe080-0xe0bf], 0x0040 bytes flags: 0x0
>  memory[0x8]    [0xf4b0

Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-04-25 Thread Mike Rapoport
On Sun, Apr 25, 2021 at 03:51:56PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/25 15:19, Mike Rapoport wrote:
> 
> On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> 
> I tested this patchset(plus arm32 change, like arm64 does) based on 
> lts
> 5.10,add
> 
> some debug log, the useful info shows below, if we enable 
> HOLES_IN_ZONE, no
> panic,
> 
> any idea, thanks.
> 
> 
> Are there any changes on top of 5.10 except for pfn_valid() patch?
> Do you see this panic on 5.10 without the changes?
> 
> Yes, there are some BSP support for arm board based on 5.10, with or without
> 
> your patch will get same panic, the panic pfn=de600 in the range of
> [dcc00,de00]
> 
> which is freed by free_memmap, start_pfn = dcc00,  dcc0 end_pfn = de700,
> de70
> 
> we see the PC is at PageLRU, same reason like arm64 panic log,
> 
>"PageBuddy in move_freepages returns false
> Then we call PageLRU, the macro calls PF_HEAD which is compound_page()
> compound_page reads page->compound_head, it is 0x, so it
> resturns 0xfffe - and accessing this address causes crash"
> 
> Can you see stack backtrace beyond move_freepages_block?
> 
> I do some oom test, so the log is about memory allocate,
> 
> [] (move_freepages_block) from []
> (steal_suitable_fallback+0x174/0x1f4)
> 
> [] (steal_suitable_fallback) from [] 
> (get_page_from_freelist+0x490/0x9a4)

Hmm, this is called with a page from free list, having a page from a freed
part of the memory map passed to steal_suitable_fallback() means that there
is an issue with creation of the free list.

Can you please add "memblock=debug" to the kernel command line and post the
log?

> [] (get_page_from_freelist) from [] 
> (__alloc_pages_nodemask+0x188/0xc08)
> [] (__alloc_pages_nodemask) from [] 
> (alloc_zeroed_user_highpage_movable+0x14/0x3c)
> [] (alloc_zeroed_user_highpage_movable) from [] 
> (handle_mm_fault+0x254/0xac8)
> [] (handle_mm_fault) from [] (do_page_fault+0x228/0x2f4)
> [] (do_page_fault) from [] (do_DataAbort+0x48/0xd0)
> [] (do_DataAbort) from [] (__dabt_usr+0x40/0x60)
> 
> 
> 
> Zone ranges:
>   Normal   [mem 0x80a0-0xb01f]
>   HighMem  [mem 0xb020-0xefff]
> Movable zone start for each node
> Early memory node ranges
>   node   0: [mem 0x80a0-0x855f]
>   node   0: [mem 0x86a0-0x87df]
>   node   0: [mem 0x8bd0-0x8c4f]
>   node   0: [mem 0x8e30-0x8ecf]
>   node   0: [mem 0x90d0-0xbfff]
>   node   0: [mem 0xcc00-0xdc9f]
>   node   0: [mem 0xde70-0xde9f]
>   node   0: [mem 0xe080-0xe0bf]
>   node   0: [mem 0xf4b0-0xf6ff]
>   node   0: [mem 0xfda0-0xefff]
> 
> > free_memmap, start_pfn = 85800,  8580 end_pfn = 86a00, 
> 86a0
> > free_memmap, start_pfn = 8c800,  8c80 end_pfn = 8e300, 
> 8e30
> > free_memmap, start_pfn = 8f000,  8f00 end_pfn = 9, 
> 9000
> > free_memmap, start_pfn = dcc00,  dcc0 end_pfn = de700, 
> de70
> > free_memmap, start_pfn = dec00,  dec0 end_pfn = e, 
> e000
> > free_memmap, start_pfn = e0c00,  e0c0 end_pfn = e4000, 
> e400
> > free_memmap, start_pfn = f7000,  f700 end_pfn = f8000, 
> f800
> === >move_freepages: start_pfn/end_pfn [de601, de7ff], [de60, 
> de7ff000]
> :  pfn =de600 pfn2phy = de60 , page = ef3cc000, page-flags = 
> 
> 8<--- cut here ---
> Unable to handle kernel paging request at virtual address fffe
> pgd = 5dd50df5
> [fffe] *pgd=a861, *pte=, *ppte=
> Internal error: Oops: 37 [#1] SMP ARM
> Modules linked in: gmac(O)
> CPU: 2 PID: 635 Comm: test-oom Tainted: G   O  5.10.0+ #31
> Hardware name: Hisilicon A9
> PC is at move_freepages_block+0x150/0x278
> LR is at move_freepages_block+0x150/0x278
> pc : []    lr : []    psr: 200e0393
> sp : c4179cf8  ip :   fp : 0001
> r10: c4179d58  r9 : 000de7ff  r8 : 
> r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
> r3 :   r2 :   r1 :

Re: arm32: panic in move_freepages (Was [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid())

2021-04-25 Thread Mike Rapoport
On Fri, Apr 23, 2021 at 04:11:16PM +0800, Kefeng Wang wrote:
> 
> I tested this patchset(plus arm32 change, like arm64 does) based on lts
> 5.10,add
> 
> some debug log, the useful info shows below, if we enable HOLES_IN_ZONE, no
> panic,
> 
> any idea, thanks.
 
Are there any changes on top of 5.10 except for pfn_valid() patch?
Do you see this panic on 5.10 without the changes?
Can you see stack backtrace beyond move_freepages_block?

> Zone ranges:
>   Normal   [mem 0x80a0-0xb01f]
>   HighMem  [mem 0xb020-0xefff]
> Movable zone start for each node
> Early memory node ranges
>   node   0: [mem 0x80a0-0x855f]
>   node   0: [mem 0x86a0-0x87df]
>   node   0: [mem 0x8bd0-0x8c4f]
>   node   0: [mem 0x8e30-0x8ecf]
>   node   0: [mem 0x90d0-0xbfff]
>   node   0: [mem 0xcc00-0xdc9f]
>   node   0: [mem 0xde70-0xde9f]
>   node   0: [mem 0xe080-0xe0bf]
>   node   0: [mem 0xf4b0-0xf6ff]
>   node   0: [mem 0xfda0-0xefff]
> 
> > free_memmap, start_pfn = 85800,  8580 end_pfn = 86a00, 86a0
> > free_memmap, start_pfn = 8c800,  8c80 end_pfn = 8e300, 8e30
> > free_memmap, start_pfn = 8f000,  8f00 end_pfn = 9, 9000
> > free_memmap, start_pfn = dcc00,  dcc0 end_pfn = de700, de70
> > free_memmap, start_pfn = dec00,  dec0 end_pfn = e, e000
> > free_memmap, start_pfn = e0c00,  e0c0 end_pfn = e4000, e400
> > free_memmap, start_pfn = f7000,  f700 end_pfn = f8000, f800
> === >move_freepages: start_pfn/end_pfn [de601, de7ff], [de60, de7ff000]
> :  pfn =de600 pfn2phy = de60 , page = ef3cc000, page-flags = 
> 8<--- cut here ---
> Unable to handle kernel paging request at virtual address fffe
> pgd = 5dd50df5
> [fffe] *pgd=a861, *pte=, *ppte=
> Internal error: Oops: 37 [#1] SMP ARM
> Modules linked in: gmac(O)
> CPU: 2 PID: 635 Comm: test-oom Tainted: G   O  5.10.0+ #31
> Hardware name: Hisilicon A9
> PC is at move_freepages_block+0x150/0x278
> LR is at move_freepages_block+0x150/0x278
> pc : []    lr : []    psr: 200e0393
> sp : c4179cf8  ip :   fp : 0001
> r10: c4179d58  r9 : 000de7ff  r8 : 
> r7 : c0863280  r6 : 000de600  r5 : 000de600  r4 : ef3cc000
> r3 :   r2 :   r1 : ef5d069c  r0 : fffe
> Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
> Control: 1ac5387d  Table: 83b0c04a  DAC: 
> Process test-oom (pid: 635, stack limit = 0x25d667df)
> 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-25 Thread Mike Rapoport
On Thu, Apr 22, 2021 at 11:28:24PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/22 15:29, Mike Rapoport wrote:
> > On Thu, Apr 22, 2021 at 03:00:20PM +0800, Kefeng Wang wrote:
> > > On 2021/4/21 14:51, Mike Rapoport wrote:
> > > > From: Mike Rapoport 
> > > > 
> > > > Hi,
> > > > 
> > > > These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially 
> > > > hardwire
> > > > pfn_valid_within() to 1.
> > > > 
> > > > The idea is to mark NOMAP pages as reserved in the memory map and 
> > > > restore
> > > > the intended semantics of pfn_valid() to designate availability of 
> > > > struct
> > > > page for a pfn.
> > > > 
> > > > With this the core mm will be able to cope with the fact that it cannot 
> > > > use
> > > > NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER 
> > > > blocks
> > > > will be treated correctly even without the need for pfn_valid_within.
> > > > 
> > > > The patches are only boot tested on qemu-system-aarch64 so I'd really
> > > > appreciate memory stress tests on real hardware.
> > > > 
> > > > If this actually works we'll be one step closer to drop custom 
> > > > pfn_valid()
> > > > on arm64 altogether.
> > > Hi Mike,I have a question, without HOLES_IN_ZONE, the pfn_valid_within() 
> > > in
> > > move_freepages_block()->move_freepages()
> > > will be optimized, if there are holes in zone, the 'struce page'(memory 
> > > map)
> > > for pfn range of hole will be free by
> > > free_memmap(), and then the page traverse in the zone(with holes) from
> > > move_freepages() will meet the wrong page,
> > > then it could panic at PageLRU(page) test, check link[1],
> > First, HOLES_IN_ZONE name us hugely misleading, this configuration option
> > has nothing to to with memory holes, but rather it is there to deal with
> > holes or undefined struct pages in the memory map, when these holes can be
> > inside a MAX_ORDER_NR_PAGES region.
> > 
> > In general pfn walkers use pfn_valid() and pfn_valid_within() to avoid
> > accessing *missing* struct pages, like those that are freed at
> > free_memmap(). But on arm64 these tests also filter out the nomap entries
> > because their struct pages are not initialized.
> > 
> > The panic you refer to happened because there was an uninitialized struct
> > page in the middle of MAX_ORDER_NR_PAGES region because it corresponded to
> > nomap memory.
> > 
> > With these changes I make sure that such pages will be properly initialized
> > as PageReserved and the pfn walkers will be able to rely on the memory map.
> > 
> > Note also, that free_memmap() aligns the parts being freed on MAX_ORDER
> > boundaries, so there will be no missing parts in the memory map within a
> > MAX_ORDER_NR_PAGES region.
> 
> Ok, thanks, we met a same panic like the link on arm32(without
> HOLES_IN_ZONE),
> 
> the scheme for arm64 could be suit for arm32, right?

In general yes. You just need to make sure that usage of pfn_valid() in
arch/arm does not presume that it tests something beyond availability of
struct page for a pfn.
 
> I will try the patchset with some changes on arm32 and give some
> feedback.
> 
> Again, the stupid question, where will mark the region of memblock with
> MEMBLOCK_NOMAP flag ?
 
Not sure I understand the question. The memory regions with "nomap"
property in the device tree will be marked MEMBLOCK_NOMAP.
 
> > > "The idea is to mark NOMAP pages as reserved in the memory map", I see the
> > > patch2 check memblock_is_nomap() in memory region
> > > of memblock, but it seems that memblock_mark_nomap() is not called(maybe I
> > > missed), then memmap_init_reserved_pages() won't
> > > work, so should the HOLES_IN_ZONE still be needed for generic mm code?
> > > 
> > > [1] 
> > > https://lore.kernel.org/linux-arm-kernel/541193a6-2bce-f042-5bb2-88913d5f1...@arm.com/
> > > 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-22 Thread Mike Rapoport
On Thu, Apr 22, 2021 at 03:00:20PM +0800, Kefeng Wang wrote:
> 
> On 2021/4/21 14:51, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > Hi,
> > 
> > These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
> > pfn_valid_within() to 1.
> > 
> > The idea is to mark NOMAP pages as reserved in the memory map and restore
> > the intended semantics of pfn_valid() to designate availability of struct
> > page for a pfn.
> > 
> > With this the core mm will be able to cope with the fact that it cannot use
> > NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
> > will be treated correctly even without the need for pfn_valid_within.
> > 
> > The patches are only boot tested on qemu-system-aarch64 so I'd really
> > appreciate memory stress tests on real hardware.
> > 
> > If this actually works we'll be one step closer to drop custom pfn_valid()
> > on arm64 altogether.
> 
> Hi Mike,I have a question, without HOLES_IN_ZONE, the pfn_valid_within() in
> move_freepages_block()->move_freepages()
> will be optimized, if there are holes in zone, the 'struce page'(memory map)
> for pfn range of hole will be free by
> free_memmap(), and then the page traverse in the zone(with holes) from
> move_freepages() will meet the wrong page,
> then it could panic at PageLRU(page) test, check link[1],

First, HOLES_IN_ZONE name us hugely misleading, this configuration option
has nothing to to with memory holes, but rather it is there to deal with
holes or undefined struct pages in the memory map, when these holes can be
inside a MAX_ORDER_NR_PAGES region.

In general pfn walkers use pfn_valid() and pfn_valid_within() to avoid
accessing *missing* struct pages, like those that are freed at
free_memmap(). But on arm64 these tests also filter out the nomap entries
because their struct pages are not initialized.

The panic you refer to happened because there was an uninitialized struct
page in the middle of MAX_ORDER_NR_PAGES region because it corresponded to
nomap memory.

With these changes I make sure that such pages will be properly initialized
as PageReserved and the pfn walkers will be able to rely on the memory map.

Note also, that free_memmap() aligns the parts being freed on MAX_ORDER
boundaries, so there will be no missing parts in the memory map within a
MAX_ORDER_NR_PAGES region.
 
> "The idea is to mark NOMAP pages as reserved in the memory map", I see the
> patch2 check memblock_is_nomap() in memory region
> of memblock, but it seems that memblock_mark_nomap() is not called(maybe I
> missed), then memmap_init_reserved_pages() won't
> work, so should the HOLES_IN_ZONE still be needed for generic mm code?
> 
> [1] 
> https://lore.kernel.org/linux-arm-kernel/541193a6-2bce-f042-5bb2-88913d5f1...@arm.com/
> 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 3/4] arm64: decouple check whether pfn is in linear map from pfn_valid()

2021-04-22 Thread Mike Rapoport
From: Mike Rapoport 

The intended semantics of pfn_valid() is to verify whether there is a
struct page for the pfn in question and nothing else.

Yet, on arm64 it is used to distinguish memory areas that are mapped in the
linear map vs those that require ioremap() to access them.

Introduce a dedicated pfn_is_map_memory() wrapper for
memblock_is_map_memory() to perform such check and use it where
appropriate.

Using a wrapper allows to avoid cyclic include dependencies.

While here also update style of pfn_valid() so that both pfn_valid() and
pfn_is_map_memory() declarations will be consistent.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 12 
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 6 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 0aabc3be9a75..194f9f993d30 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
 
 #define virt_addr_valid(addr)  ({  \
__typeof__(addr) __addr = __tag_reset(addr);\
-   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
+   __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr));  
\
 })
 
 void dump_mem_limit(void);
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..75ddfe671393 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -37,7 +37,8 @@ void copy_highpage(struct page *to, struct page *from);
 
 typedef struct page *pgtable_t;
 
-extern int pfn_valid(unsigned long);
+int pfn_valid(unsigned long pfn);
+int pfn_is_map_memory(unsigned long pfn);
 
 #include 
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..23dd99e29b23 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 
 static bool kvm_is_device_pfn(unsigned long pfn)
 {
-   return !pfn_valid(pfn);
+   return !pfn_is_map_memory(pfn);
 }
 
 /*
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 3685e12aba9b..966a7a18d528 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -258,6 +258,18 @@ int pfn_valid(unsigned long pfn)
 }
 EXPORT_SYMBOL(pfn_valid);
 
+int pfn_is_map_memory(unsigned long pfn)
+{
+   phys_addr_t addr = PFN_PHYS(pfn);
+
+   /* avoid false positives for bogus PFNs, see comment in pfn_valid() */
+   if (PHYS_PFN(addr) != pfn)
+   return 0;
+
+   return memblock_is_map_memory(addr);
+}
+EXPORT_SYMBOL(pfn_is_map_memory);
+
 static phys_addr_t memory_limit = PHYS_ADDR_MAX;
 
 /*
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index b5e83c46b23e..b7c81dacabf0 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, 
size_t size,
/*
 * Don't allow RAM to be mapped.
 */
-   if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
+   if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr
return NULL;
 
area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
 {
/* For normal memory we already have a cacheable mapping. */
-   if (pfn_valid(__phys_to_pfn(phys_addr)))
+   if (pfn_is_map_memory(__phys_to_pfn(phys_addr)))
return (void __iomem *)__phys_to_virt(phys_addr);
 
return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 5d9550fdb9cf..26045e9adbd7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -81,7 +81,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
  unsigned long size, pgprot_t vma_prot)
 {
-   if (!pfn_valid(pfn))
+   if (!pfn_is_map_memory(pfn))
return pgprot_noncached(vma_prot);
else if (file->f_flags & O_SYNC)
return pgprot_writecombine(vma_prot);
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 4/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-22 Thread Mike Rapoport
From: Mike Rapoport 

The arm64's version of pfn_valid() differs from the generic because of two
reasons:

* Parts of the memory map are freed during boot. This makes it necessary to
  verify that there is actual physical memory that corresponds to a pfn
  which is done by querying memblock.

* There are NOMAP memory regions. These regions are not mapped in the
  linear map and until the previous commit the struct pages representing
  these areas had default values.

As the consequence of absence of the special treatment of NOMAP regions in
the memory map it was necessary to use memblock_is_map_memory() in
pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
generic mm functionality would not treat a NOMAP page as a normal page.

Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
the rest of core mm will treat them as unusable memory and thus
pfn_valid_within() is no longer required at all and can be disabled by
removing CONFIG_HOLES_IN_ZONE on arm64.

pfn_valid() can be slightly simplified by replacing
memblock_is_map_memory() with memblock_is_memory().

Signed-off-by: Mike Rapoport 
Acked-by: David Hildenbrand 
---
 arch/arm64/Kconfig   | 3 ---
 arch/arm64/mm/init.c | 4 ++--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..58e439046d05 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
 
-config HOLES_IN_ZONE
-   def_bool y
-
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 966a7a18d528..f431b38d0837 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
 
/*
 * ZONE_DEVICE memory does not have the memblock entries.
-* memblock_is_map_memory() check for ZONE_DEVICE based
+* memblock_is_memory() check for ZONE_DEVICE based
 * addresses will always fail. Even the normal hotplugged
 * memory will never have MEMBLOCK_NOMAP flag set in their
 * memblock entries. Skip memblock search for all non early
@@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
return pfn_section_valid(ms, pfn);
 }
 #endif
-   return memblock_is_map_memory(addr);
+   return memblock_is_memory(addr);
 }
 EXPORT_SYMBOL(pfn_valid);
 
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 2/4] memblock: update initialization of reserved pages

2021-04-22 Thread Mike Rapoport
From: Mike Rapoport 

The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function. This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.

The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().

Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.

Signed-off-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
Reviewed-by: Anshuman Khandual 
---
 include/linux/memblock.h |  4 +++-
 mm/memblock.c| 28 ++--
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5984fff3f175..1b4c97c151ae 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -30,7 +30,9 @@ extern unsigned long long max_possible_pfn;
  * @MEMBLOCK_NONE: no special request
  * @MEMBLOCK_HOTPLUG: hotpluggable region
  * @MEMBLOCK_MIRROR: mirrored region
- * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
+ * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
+ * reserved in the memory map; refer to memblock_mark_nomap() description
+ * for further details
  */
 enum memblock_flags {
MEMBLOCK_NONE   = 0x0,  /* No special request */
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..3abf2c3fea7f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -906,6 +906,11 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, 
phys_addr_t size)
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
+ * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the
+ * direct mapping of the physical memory. These regions will still be
+ * covered by the memory map. The struct page representing NOMAP memory
+ * frames in the memory map will be PageReserved()
+ *
  * Return: 0 on success, -errno on failure.
  */
 int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
@@ -2002,6 +2007,26 @@ static unsigned long __init 
__free_memory_core(phys_addr_t start,
return end_pfn - start_pfn;
 }
 
+static void __init memmap_init_reserved_pages(void)
+{
+   struct memblock_region *region;
+   phys_addr_t start, end;
+   u64 i;
+
+   /* initialize struct pages for the reserved regions */
+   for_each_reserved_mem_range(i, , )
+   reserve_bootmem_region(start, end);
+
+   /* and also treat struct pages for the NOMAP regions as PageReserved */
+   for_each_mem_region(region) {
+   if (memblock_is_nomap(region)) {
+   start = region->base;
+   end = start + region->size;
+   reserve_bootmem_region(start, end);
+   }
+   }
+}
+
 static unsigned long __init free_low_memory_core_early(void)
 {
unsigned long count = 0;
@@ -2010,8 +2035,7 @@ static unsigned long __init 
free_low_memory_core_early(void)
 
memblock_clear_hotplug(0, -1);
 
-   for_each_reserved_mem_range(i, , )
-   reserve_bootmem_region(start, end);
+   memmap_init_reserved_pages();
 
/*
 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 1/4] include/linux/mmzone.h: add documentation for pfn_valid()

2021-04-22 Thread Mike Rapoport
From: Mike Rapoport 

Add comment describing the semantics of pfn_valid() that clarifies that
pfn_valid() only checks for availability of a memory map entry (i.e. struct
page) for a PFN rather than availability of usable memory backing that PFN.

The most "generic" version of pfn_valid() used by the configurations with
SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
suitable place for documentation about semantics of pfn_valid().

Suggested-by: Anshuman Khandual 
Signed-off-by: Mike Rapoport 
Reviewed-by: Anshuman Khandual 
---
 include/linux/mmzone.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..961f0eeefb62 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1410,6 +1410,17 @@ static inline int pfn_section_valid(struct mem_section 
*ms, unsigned long pfn)
 #endif
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
+/**
+ * pfn_valid - check if there is a valid memory map entry for a PFN
+ * @pfn: the page frame number to check
+ *
+ * Check if there is a valid memory map entry aka struct page for the @pfn.
+ * Note, that availability of the memory map entry does not imply that
+ * there is actual usable memory at that @pfn. The struct page may
+ * represent a hole or an unusable page frame.
+ *
+ * Return: 1 for PFNs that have memory map entries and 0 otherwise
+ */
 static inline int pfn_valid(unsigned long pfn)
 {
struct mem_section *ms;
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-22 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1. 

The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.

With this the core mm will be able to cope with the fact that it cannot use
NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
will be treated correctly even without the need for pfn_valid_within.

The patches are only boot tested on qemu-system-aarch64 so I'd really
appreciate memory stress tests on real hardware.

If this actually works we'll be one step closer to drop custom pfn_valid()
on arm64 altogether.

v3:
* Fix minor issues found by Anshuman
* Freshen up the declaration of pfn_valid() to make it consistent with
  pfn_is_map_memory()
* Add more Acked-by and Reviewed-by tags, thanks Anshuman and David

v2: Link: https://lore.kernel.org/lkml/20210421065108.1987-1-r...@kernel.org
* Add check for PFN overflow in pfn_is_map_memory()
* Add Acked-by and Reviewed-by tags, thanks David.

v1: Link: https://lore.kernel.org/lkml/20210420090925.7457-1-r...@kernel.org
* Add comment about the semantics of pfn_valid() as Anshuman suggested
* Extend comments about MEMBLOCK_NOMAP, per Anshuman
* Use pfn_is_map_memory() name for the exported wrapper for
  memblock_is_map_memory(). It is still local to arch/arm64 in the end
  because of header dependency issues.

rfc: Link: https://lore.kernel.org/lkml/20210407172607.8812-1-r...@kernel.org

Mike Rapoport (4):
  include/linux/mmzone.h: add documentation for pfn_valid()
  memblock: update initialization of reserved pages
  arm64: decouple check whether pfn is in linear map from pfn_valid()
  arm64: drop pfn_valid_within() and simplify pfn_valid()

 arch/arm64/Kconfig  |  3 ---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  3 ++-
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 16 ++--
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 include/linux/memblock.h|  4 +++-
 include/linux/mmzone.h  | 11 +++
 mm/memblock.c   | 28 ++--
 10 files changed, 61 insertions(+), 14 deletions(-)

base-commit: e49d033bddf5b565044e2abe4241353959bc9120
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 4/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-21 Thread Mike Rapoport
On Wed, Apr 21, 2021 at 04:36:46PM +0530, Anshuman Khandual wrote:
> 
> On 4/21/21 12:21 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The arm64's version of pfn_valid() differs from the generic because of two
> > reasons:
> > 
> > * Parts of the memory map are freed during boot. This makes it necessary to
> >   verify that there is actual physical memory that corresponds to a pfn
> >   which is done by querying memblock.
> > 
> > * There are NOMAP memory regions. These regions are not mapped in the
> >   linear map and until the previous commit the struct pages representing
> >   these areas had default values.
> > 
> > As the consequence of absence of the special treatment of NOMAP regions in
> > the memory map it was necessary to use memblock_is_map_memory() in
> > pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
> > generic mm functionality would not treat a NOMAP page as a normal page.
> > 
> > Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
> > the rest of core mm will treat them as unusable memory and thus
> > pfn_valid_within() is no longer required at all and can be disabled by
> > removing CONFIG_HOLES_IN_ZONE on arm64.
> 
> This makes sense.
> 
> > 
> > pfn_valid() can be slightly simplified by replacing
> > memblock_is_map_memory() with memblock_is_memory().
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >  arch/arm64/Kconfig   | 3 ---
> >  arch/arm64/mm/init.c | 4 ++--
> >  2 files changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index e4e1b6550115..58e439046d05 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
> > def_bool y
> > depends on NUMA
> >  
> > -config HOLES_IN_ZONE
> > -   def_bool y
> > -
> 
> Right.
> 
> >  source "kernel/Kconfig.hz"
> >  
> >  config ARCH_SPARSEMEM_ENABLE
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index dc03bdc12c0f..eb3f56fb8c7c 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
> >  
> > /*
> >  * ZONE_DEVICE memory does not have the memblock entries.
> > -* memblock_is_map_memory() check for ZONE_DEVICE based
> > +* memblock_is_memory() check for ZONE_DEVICE based
> >  * addresses will always fail. Even the normal hotplugged
> >  * memory will never have MEMBLOCK_NOMAP flag set in their
> >  * memblock entries. Skip memblock search for all non early
> > @@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
> > return pfn_section_valid(ms, pfn);
> >  }
> >  #endif
> > -   return memblock_is_map_memory(addr);
> > +   return memblock_is_memory(addr);
> 
> Wondering if MEMBLOCK_NOMAP is now being treated similarly to other
> memory pfns for page table walking purpose but with PageReserved(),
> why memblock_is_memory() is still required ? At this point, should
> not we just return valid for early_section() memory. As pfn_valid()
> now just implies that pfn has a struct page backing which has been
> already verified with valid_section() etc.

memblock_is_memory() is required because arm64 frees unused parts of the
memory map. So, for instance, if we have 64M out of 128M populated in a
section the section based calculation would return 1 for a pfn in the
second half of the section, but there would be no memory map there.


> >  }
> >  EXPORT_SYMBOL(pfn_valid);
> >  
> > 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 3/4] arm64: decouple check whether pfn is in linear map from pfn_valid()

2021-04-21 Thread Mike Rapoport
On Wed, Apr 21, 2021 at 04:29:48PM +0530, Anshuman Khandual wrote:
> 
> On 4/21/21 12:21 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The intended semantics of pfn_valid() is to verify whether there is a
> > struct page for the pfn in question and nothing else.
> > 
> > Yet, on arm64 it is used to distinguish memory areas that are mapped in the
> > linear map vs those that require ioremap() to access them.
> > 
> > Introduce a dedicated pfn_is_map_memory() wrapper for
> > memblock_is_map_memory() to perform such check and use it where
> > appropriate.
> > 
> > Using a wrapper allows to avoid cyclic include dependencies.
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >  arch/arm64/include/asm/memory.h |  2 +-
> >  arch/arm64/include/asm/page.h   |  1 +
> >  arch/arm64/kvm/mmu.c|  2 +-
> >  arch/arm64/mm/init.c| 11 +++
> >  arch/arm64/mm/ioremap.c |  4 ++--
> >  arch/arm64/mm/mmu.c |  2 +-
> >  6 files changed, 17 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/memory.h 
> > b/arch/arm64/include/asm/memory.h
> > index 0aabc3be9a75..194f9f993d30 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
> >  
> >  #define virt_addr_valid(addr)  ({  
> > \
> > __typeof__(addr) __addr = __tag_reset(addr);\
> > -   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
> > +   __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr));  
> > \
> >  })
> >  
> >  void dump_mem_limit(void);
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 012cffc574e8..99a6da91f870 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
> >  typedef struct page *pgtable_t;
> >  
> >  extern int pfn_valid(unsigned long);
> > +extern int pfn_is_map_memory(unsigned long);
> 
> Check patch is complaining about this.
> 
> WARNING: function definition argument 'unsigned long' should also have an 
> identifier name
> #50: FILE: arch/arm64/include/asm/page.h:41:
> +extern int pfn_is_map_memory(unsigned long);
> 
> 
> >  
> >  #include 
> >  
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 8711894db8c2..23dd99e29b23 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> >  
> >  static bool kvm_is_device_pfn(unsigned long pfn)
> >  {
> > -   return !pfn_valid(pfn);
> > +   return !pfn_is_map_memory(pfn);
> >  }
> >  
> >  /*
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 3685e12aba9b..dc03bdc12c0f 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -258,6 +258,17 @@ int pfn_valid(unsigned long pfn)
> >  }
> >  EXPORT_SYMBOL(pfn_valid);
> >  
> > +int pfn_is_map_memory(unsigned long pfn)
> > +{
> > +   phys_addr_t addr = PFN_PHYS(pfn);
> > +
> 
> Should also bring with it, the comment regarding upper bits in
> the pfn from arm64 pfn_valid().

I think a reference to the comment in pfn_valid() will suffice.

BTW, I wonder how is that other architectures do not need this check?
 
> > +   if (PHYS_PFN(addr) != pfn)
> > +   return 0;
> > +   
> 
>  ^ trailing spaces here.
> 
> ERROR: trailing whitespace
> #81: FILE: arch/arm64/mm/init.c:263:
> +^I$

Oops :)
 
> > +   return memblock_is_map_memory(addr);
> > +}
> > +EXPORT_SYMBOL(pfn_is_map_memory);
> > +
> 
> Is the EXPORT_SYMBOL() required to build drivers which will use
> pfn_is_map_memory() but currently use pfn_valid() ?

Yes, this is required for virt_addr_valid() that is used by modules.

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2 4/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-21 Thread Mike Rapoport
From: Mike Rapoport 

The arm64's version of pfn_valid() differs from the generic because of two
reasons:

* Parts of the memory map are freed during boot. This makes it necessary to
  verify that there is actual physical memory that corresponds to a pfn
  which is done by querying memblock.

* There are NOMAP memory regions. These regions are not mapped in the
  linear map and until the previous commit the struct pages representing
  these areas had default values.

As the consequence of absence of the special treatment of NOMAP regions in
the memory map it was necessary to use memblock_is_map_memory() in
pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
generic mm functionality would not treat a NOMAP page as a normal page.

Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
the rest of core mm will treat them as unusable memory and thus
pfn_valid_within() is no longer required at all and can be disabled by
removing CONFIG_HOLES_IN_ZONE on arm64.

pfn_valid() can be slightly simplified by replacing
memblock_is_map_memory() with memblock_is_memory().

Signed-off-by: Mike Rapoport 
---
 arch/arm64/Kconfig   | 3 ---
 arch/arm64/mm/init.c | 4 ++--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..58e439046d05 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
 
-config HOLES_IN_ZONE
-   def_bool y
-
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index dc03bdc12c0f..eb3f56fb8c7c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
 
/*
 * ZONE_DEVICE memory does not have the memblock entries.
-* memblock_is_map_memory() check for ZONE_DEVICE based
+* memblock_is_memory() check for ZONE_DEVICE based
 * addresses will always fail. Even the normal hotplugged
 * memory will never have MEMBLOCK_NOMAP flag set in their
 * memblock entries. Skip memblock search for all non early
@@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
return pfn_section_valid(ms, pfn);
 }
 #endif
-   return memblock_is_map_memory(addr);
+   return memblock_is_memory(addr);
 }
 EXPORT_SYMBOL(pfn_valid);
 
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2 3/4] arm64: decouple check whether pfn is in linear map from pfn_valid()

2021-04-21 Thread Mike Rapoport
From: Mike Rapoport 

The intended semantics of pfn_valid() is to verify whether there is a
struct page for the pfn in question and nothing else.

Yet, on arm64 it is used to distinguish memory areas that are mapped in the
linear map vs those that require ioremap() to access them.

Introduce a dedicated pfn_is_map_memory() wrapper for
memblock_is_map_memory() to perform such check and use it where
appropriate.

Using a wrapper allows to avoid cyclic include dependencies.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  1 +
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 11 +++
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 6 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 0aabc3be9a75..194f9f993d30 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
 
 #define virt_addr_valid(addr)  ({  \
__typeof__(addr) __addr = __tag_reset(addr);\
-   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
+   __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr));  
\
 })
 
 void dump_mem_limit(void);
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..99a6da91f870 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
 typedef struct page *pgtable_t;
 
 extern int pfn_valid(unsigned long);
+extern int pfn_is_map_memory(unsigned long);
 
 #include 
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..23dd99e29b23 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 
 static bool kvm_is_device_pfn(unsigned long pfn)
 {
-   return !pfn_valid(pfn);
+   return !pfn_is_map_memory(pfn);
 }
 
 /*
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 3685e12aba9b..dc03bdc12c0f 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -258,6 +258,17 @@ int pfn_valid(unsigned long pfn)
 }
 EXPORT_SYMBOL(pfn_valid);
 
+int pfn_is_map_memory(unsigned long pfn)
+{
+   phys_addr_t addr = PFN_PHYS(pfn);
+
+   if (PHYS_PFN(addr) != pfn)
+   return 0;
+   
+   return memblock_is_map_memory(addr);
+}
+EXPORT_SYMBOL(pfn_is_map_memory);
+
 static phys_addr_t memory_limit = PHYS_ADDR_MAX;
 
 /*
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index b5e83c46b23e..b7c81dacabf0 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, 
size_t size,
/*
 * Don't allow RAM to be mapped.
 */
-   if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
+   if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr
return NULL;
 
area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
 {
/* For normal memory we already have a cacheable mapping. */
-   if (pfn_valid(__phys_to_pfn(phys_addr)))
+   if (pfn_is_map_memory(__phys_to_pfn(phys_addr)))
return (void __iomem *)__phys_to_virt(phys_addr);
 
return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 5d9550fdb9cf..26045e9adbd7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -81,7 +81,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
  unsigned long size, pgprot_t vma_prot)
 {
-   if (!pfn_valid(pfn))
+   if (!pfn_is_map_memory(pfn))
return pgprot_noncached(vma_prot);
else if (file->f_flags & O_SYNC)
return pgprot_writecombine(vma_prot);
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2 2/4] memblock: update initialization of reserved pages

2021-04-21 Thread Mike Rapoport
From: Mike Rapoport 

The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function. This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.

The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().

Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.

Signed-off-by: Mike Rapoport 
---
 include/linux/memblock.h |  4 +++-
 mm/memblock.c| 28 ++--
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5984fff3f175..634c1a578db8 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -30,7 +30,9 @@ extern unsigned long long max_possible_pfn;
  * @MEMBLOCK_NONE: no special request
  * @MEMBLOCK_HOTPLUG: hotpluggable region
  * @MEMBLOCK_MIRROR: mirrored region
- * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
+ * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
+ * reserved in the memory map; refer to memblock_mark_nomap() description
+ * for futher details
  */
 enum memblock_flags {
MEMBLOCK_NONE   = 0x0,  /* No special request */
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..3abf2c3fea7f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -906,6 +906,11 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, 
phys_addr_t size)
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
+ * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the
+ * direct mapping of the physical memory. These regions will still be
+ * covered by the memory map. The struct page representing NOMAP memory
+ * frames in the memory map will be PageReserved()
+ *
  * Return: 0 on success, -errno on failure.
  */
 int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
@@ -2002,6 +2007,26 @@ static unsigned long __init 
__free_memory_core(phys_addr_t start,
return end_pfn - start_pfn;
 }
 
+static void __init memmap_init_reserved_pages(void)
+{
+   struct memblock_region *region;
+   phys_addr_t start, end;
+   u64 i;
+
+   /* initialize struct pages for the reserved regions */
+   for_each_reserved_mem_range(i, , )
+   reserve_bootmem_region(start, end);
+
+   /* and also treat struct pages for the NOMAP regions as PageReserved */
+   for_each_mem_region(region) {
+   if (memblock_is_nomap(region)) {
+   start = region->base;
+   end = start + region->size;
+   reserve_bootmem_region(start, end);
+   }
+   }
+}
+
 static unsigned long __init free_low_memory_core_early(void)
 {
unsigned long count = 0;
@@ -2010,8 +2035,7 @@ static unsigned long __init 
free_low_memory_core_early(void)
 
memblock_clear_hotplug(0, -1);
 
-   for_each_reserved_mem_range(i, , )
-   reserve_bootmem_region(start, end);
+   memmap_init_reserved_pages();
 
/*
 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2 1/4] include/linux/mmzone.h: add documentation for pfn_valid()

2021-04-21 Thread Mike Rapoport
From: Mike Rapoport 

Add comment describing the semantics of pfn_valid() that clarifies that
pfn_valid() only checks for availability of a memory map entry (i.e. struct
page) for a PFN rather than availability of usable memory backing that PFN.

The most "generic" version of pfn_valid() used by the configurations with
SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
suitable place for documentation about semantics of pfn_valid().

Suggested-by: Anshuman Khandual 
Signed-off-by: Mike Rapoport 
---
 include/linux/mmzone.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..961f0eeefb62 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1410,6 +1410,17 @@ static inline int pfn_section_valid(struct mem_section 
*ms, unsigned long pfn)
 #endif
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
+/**
+ * pfn_valid - check if there is a valid memory map entry for a PFN
+ * @pfn: the page frame number to check
+ *
+ * Check if there is a valid memory map entry aka struct page for the @pfn.
+ * Note, that availability of the memory map entry does not imply that
+ * there is actual usable memory at that @pfn. The struct page may
+ * represent a hole or an unusable page frame.
+ *
+ * Return: 1 for PFNs that have memory map entries and 0 otherwise
+ */
 static inline int pfn_valid(unsigned long pfn)
 {
struct mem_section *ms;
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-21 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1. 

The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.

With this the core mm will be able to cope with the fact that it cannot use
NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
will be treated correctly even without the need for pfn_valid_within.

The patches are only boot tested on qemu-system-aarch64 so I'd really
appreciate memory stress tests on real hardware.

If this actually works we'll be one step closer to drop custom pfn_valid()
on arm64 altogether.

v2:
* Add check for PFN overflow in pfn_is_map_memory()
* Add Acked-by and Reviewed-by tags, thanks David.

v1: Link: https://lore.kernel.org/lkml/20210420090925.7457-1-r...@kernel.org
* Add comment about the semantics of pfn_valid() as Anshuman suggested
* Extend comments about MEMBLOCK_NOMAP, per Anshuman
* Use pfn_is_map_memory() name for the exported wrapper for
  memblock_is_map_memory(). It is still local to arch/arm64 in the end
  because of header dependency issues.

rfc: Link: https://lore.kernel.org/lkml/20210407172607.8812-1-r...@kernel.org

Mike Rapoport (4):
  include/linux/mmzone.h: add documentation for pfn_valid()
  memblock: update initialization of reserved pages
  arm64: decouple check whether pfn is in linear map from pfn_valid()
  arm64: drop pfn_valid_within() and simplify pfn_valid()

 arch/arm64/Kconfig  |  3 ---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  1 +
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 10 --
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 include/linux/memblock.h|  4 +++-
 include/linux/mmzone.h  | 11 +++
 mm/memblock.c   | 28 ++--
 10 files changed, 54 insertions(+), 13 deletions(-)

base-commit: e49d033bddf5b565044e2abe4241353959bc9120
-- 
2.28.0

*** BLURB HERE ***

Mike Rapoport (4):
  include/linux/mmzone.h: add documentation for pfn_valid()
  memblock: update initialization of reserved pages
  arm64: decouple check whether pfn is in linear map from pfn_valid()
  arm64: drop pfn_valid_within() and simplify pfn_valid()

 arch/arm64/Kconfig  |  3 ---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  1 +
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 15 +--
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 include/linux/memblock.h|  4 +++-
 include/linux/mmzone.h  | 11 +++
 mm/memblock.c   | 28 ++--
 10 files changed, 59 insertions(+), 13 deletions(-)

-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 4/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-20 Thread Mike Rapoport
On Tue, Apr 20, 2021 at 06:00:55PM +0200, David Hildenbrand wrote:
> On 20.04.21 11:09, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The arm64's version of pfn_valid() differs from the generic because of two
> > reasons:
> > 
> > * Parts of the memory map are freed during boot. This makes it necessary to
> >verify that there is actual physical memory that corresponds to a pfn
> >which is done by querying memblock.
> > 
> > * There are NOMAP memory regions. These regions are not mapped in the
> >linear map and until the previous commit the struct pages representing
> >these areas had default values.
> > 
> > As the consequence of absence of the special treatment of NOMAP regions in
> > the memory map it was necessary to use memblock_is_map_memory() in
> > pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
> > generic mm functionality would not treat a NOMAP page as a normal page.
> > 
> > Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
> > the rest of core mm will treat them as unusable memory and thus
> > pfn_valid_within() is no longer required at all and can be disabled by
> > removing CONFIG_HOLES_IN_ZONE on arm64.
> > 
> > pfn_valid() can be slightly simplified by replacing
> > memblock_is_map_memory() with memblock_is_memory().
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >   arch/arm64/Kconfig   | 3 ---
> >   arch/arm64/mm/init.c | 4 ++--
> >   2 files changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index e4e1b6550115..58e439046d05 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
> > def_bool y
> > depends on NUMA
> > -config HOLES_IN_ZONE
> > -   def_bool y
> > -
> >   source "kernel/Kconfig.hz"
> >   config ARCH_SPARSEMEM_ENABLE
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index c54e329aca15..370f33765b64 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
> > /*
> >  * ZONE_DEVICE memory does not have the memblock entries.
> > -* memblock_is_map_memory() check for ZONE_DEVICE based
> > +* memblock_is_memory() check for ZONE_DEVICE based
> >  * addresses will always fail. Even the normal hotplugged
> >  * memory will never have MEMBLOCK_NOMAP flag set in their
> >  * memblock entries. Skip memblock search for all non early
> > @@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
> > return pfn_section_valid(ms, pfn);
> >   }
> >   #endif
> > -   return memblock_is_map_memory(addr);
> > +   return memblock_is_memory(addr);
> >   }
> >   EXPORT_SYMBOL(pfn_valid);
> > 
> 
> What are the steps needed to get rid of custom pfn_valid() completely?
> 
> I'd assume we would have to stop freeing parts of the mem map during boot.
> How relevant is that for arm64 nowadays, especially with reduced section
> sizes?

Yes, for arm64 to use the generic pfn_valid() it'd need to stop freeing
parts of the memory map.

Presuming struct page is 64 bytes, the memory map takes 2M per section in
the worst case (128M per section, 4k pages). 

So for systems that have less than 128M populated in each section freeing
unused memory map would mean significant savings.

But nowadays when a clock has at least 1G of RAM I doubt this is relevant
to many systems if at all.

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 3/4] arm64: decouple check whether pfn is in linear map from pfn_valid()

2021-04-20 Thread Mike Rapoport
On Tue, Apr 20, 2021 at 05:57:57PM +0200, David Hildenbrand wrote:
> On 20.04.21 11:09, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The intended semantics of pfn_valid() is to verify whether there is a
> > struct page for the pfn in question and nothing else.
> > 
> > Yet, on arm64 it is used to distinguish memory areas that are mapped in the
> > linear map vs those that require ioremap() to access them.
> > 
> > Introduce a dedicated pfn_is_map_memory() wrapper for
> > memblock_is_map_memory() to perform such check and use it where
> > appropriate.
> > 
> > Using a wrapper allows to avoid cyclic include dependencies.
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >   arch/arm64/include/asm/memory.h | 2 +-
> >   arch/arm64/include/asm/page.h   | 1 +
> >   arch/arm64/kvm/mmu.c| 2 +-
> >   arch/arm64/mm/init.c| 6 ++
> >   arch/arm64/mm/ioremap.c | 4 ++--
> >   arch/arm64/mm/mmu.c | 2 +-
> >   6 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/memory.h 
> > b/arch/arm64/include/asm/memory.h
> > index 0aabc3be9a75..194f9f993d30 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
> >   #define virt_addr_valid(addr) ({  
> > \
> > __typeof__(addr) __addr = __tag_reset(addr);\
> > -   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
> > +   __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr));  
> > \
> >   })
> >   void dump_mem_limit(void);
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 012cffc574e8..99a6da91f870 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
> >   typedef struct page *pgtable_t;
> >   extern int pfn_valid(unsigned long);
> > +extern int pfn_is_map_memory(unsigned long);
> >   #include 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 8711894db8c2..23dd99e29b23 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> >   static bool kvm_is_device_pfn(unsigned long pfn)
> >   {
> > -   return !pfn_valid(pfn);
> > +   return !pfn_is_map_memory(pfn);
> >   }
> >   /*
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 3685e12aba9b..c54e329aca15 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
> >   }
> >   EXPORT_SYMBOL(pfn_valid);
> > +int pfn_is_map_memory(unsigned long pfn)
> > +{
> 
> I think you might have to add (see pfn_valid())
> 
> if (PHYS_PFN(PFN_PHYS(pfn)) != pfn)
>   return 0;
> 
> to catch false positives.
 
Yeah, makes sense. 

> > +   return memblock_is_map_memory(PFN_PHYS(pfn));
> > +}
> > +EXPORT_SYMBOL(pfn_is_map_memory);
> > +
> >   static phys_addr_t memory_limit = PHYS_ADDR_MAX;
> >   /*
> > diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
> > index b5e83c46b23e..b7c81dacabf0 100644
> > --- a/arch/arm64/mm/ioremap.c
> > +++ b/arch/arm64/mm/ioremap.c
> > @@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t 
> > phys_addr, size_t size,
> > /*
> >  * Don't allow RAM to be mapped.
> >  */
> > -   if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
> > +   if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr
> > return NULL;
> > area = get_vm_area_caller(size, VM_IOREMAP, caller);
> > @@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
> >   void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
> >   {
> > /* For normal memory we already have a cacheable mapping. */
> > -   if (pfn_valid(__phys_to_pfn(phys_addr)))
> > +   if (pfn_is_map_memory(__phys_to_pfn(phys_addr)))
> > return (void __iomem *)__phys_to_virt(phys_addr);
> > return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 5d9550fdb9cf..26045e9adbd7 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -81,7 +81,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)

Re: [PATCH v1 2/4] memblock: update initialization of reserved pages

2021-04-20 Thread Mike Rapoport
On Tue, Apr 20, 2021 at 05:18:55PM +0200, David Hildenbrand wrote:
> On 20.04.21 17:03, Mike Rapoport wrote:
> > On Tue, Apr 20, 2021 at 03:56:28PM +0200, David Hildenbrand wrote:
> > > On 20.04.21 11:09, Mike Rapoport wrote:
> > > > From: Mike Rapoport 
> > > > 
> > > > The struct pages representing a reserved memory region are initialized
> > > > using reserve_bootmem_range() function. This function is called for each
> > > > reserved region just before the memory is freed from memblock to the 
> > > > buddy
> > > > page allocator.
> > > > 
> > > > The struct pages for MEMBLOCK_NOMAP regions are kept with the default
> > > > values set by the memory map initialization which makes it necessary to
> > > > have a special treatment for such pages in pfn_valid() and
> > > > pfn_valid_within().
> > > 
> > > Just a general question while thinking about it:
> > > 
> > > Would we right now initialize the memmap of these pages already via
> > > memmap_init_zone()->memmap_init_range()? (IOW, not marking the
> > > PageReserved?)
> > 
> > Yep. These pages are part of memblock.memory so they are initialized in
> > memmap_init_zone()->memmap_init_range() to the default values.
> > 
> 
> So instead of fully initializing them again, we mostly would only have to
> set PageReserved(). Not sure how big that memory usually is -- IOW, if we
> really care about optimizing the double-init.

IIUC, these are small areas reserved by the firmware, like e.g. ACPI
tables.

@Ard, am I right?

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 2/4] memblock: update initialization of reserved pages

2021-04-20 Thread Mike Rapoport
On Tue, Apr 20, 2021 at 03:56:28PM +0200, David Hildenbrand wrote:
> On 20.04.21 11:09, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The struct pages representing a reserved memory region are initialized
> > using reserve_bootmem_range() function. This function is called for each
> > reserved region just before the memory is freed from memblock to the buddy
> > page allocator.
> > 
> > The struct pages for MEMBLOCK_NOMAP regions are kept with the default
> > values set by the memory map initialization which makes it necessary to
> > have a special treatment for such pages in pfn_valid() and
> > pfn_valid_within().
> 
> Just a general question while thinking about it:
> 
> Would we right now initialize the memmap of these pages already via
> memmap_init_zone()->memmap_init_range()? (IOW, not marking the
> PageReserved?)

Yep. These pages are part of memblock.memory so they are initialized in
memmap_init_zone()->memmap_init_range() to the default values.

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v1 1/4] include/linux/mmzone.h: add documentation for pfn_valid()

2021-04-20 Thread Mike Rapoport
On Tue, Apr 20, 2021 at 11:22:53AM +0200, David Hildenbrand wrote:
> On 20.04.21 11:09, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > Add comment describing the semantics of pfn_valid() that clarifies that
> > pfn_valid() only checks for availability of a memory map entry (i.e. struct
> > page) for a PFN rather than availability of usable memory backing that PFN.
> > 
> > The most "generic" version of pfn_valid() used by the configurations with
> > SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
> > suitable place for documentation about semantics of pfn_valid().
> > 
> > Suggested-by: Anshuman Khandual 
> > Signed-off-by: Mike Rapoport 
> > ---
> >   include/linux/mmzone.h | 11 +++
> >   1 file changed, 11 insertions(+)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 47946cec7584..961f0eeefb62 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1410,6 +1410,17 @@ static inline int pfn_section_valid(struct 
> > mem_section *ms, unsigned long pfn)
> >   #endif
> >   #ifndef CONFIG_HAVE_ARCH_PFN_VALID
> > +/**
> > + * pfn_valid - check if there is a valid memory map entry for a PFN
> > + * @pfn: the page frame number to check
> > + *
> > + * Check if there is a valid memory map entry aka struct page for the @pfn.
> > + * Note, that availability of the memory map entry does not imply that
> > + * there is actual usable memory at that @pfn. The struct page may
> > + * represent a hole or an unusable page frame.
> > + *
> > + * Return: 1 for PFNs that have memory map entries and 0 otherwise
> > + */
> >   static inline int pfn_valid(unsigned long pfn)
> >   {
> > struct mem_section *ms;
> > 
> 
> I'd rephrase all "there is a valid memory map" to "there is a memory map"
> and add "pfn_valid() does to indicate whether the memory map as actually
> initialized -- see pfn_to_online_page()."
> 
> pfn_valid() means that we can do a pfn_to_page() and don't get a fault when
> accessing the "struct page". It doesn't state anything about the content.

Well, I mean valid in the sense you can access the struct page :)
How about:

/**
 * pfn_valid - check if there is a memory map entry for a PFN
 * @pfn: the page frame number to check
 *
 * Check if there is a memory map entry aka struct page for the @pfn and it
 * is safe to access that struct page; the struct page state may be
 * uninitialized -- see pfn_to_online_page().
 *
 * Note, that availability of the memory map entry does not imply that
 * there is actual usable memory at that @pfn. The struct page may
 * represent a hole or an unusable page frame.
 *
 * Return: 1 for PFNs that have memory map entries and 0 otherwise.
 */

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v1 4/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-20 Thread Mike Rapoport
From: Mike Rapoport 

The arm64's version of pfn_valid() differs from the generic because of two
reasons:

* Parts of the memory map are freed during boot. This makes it necessary to
  verify that there is actual physical memory that corresponds to a pfn
  which is done by querying memblock.

* There are NOMAP memory regions. These regions are not mapped in the
  linear map and until the previous commit the struct pages representing
  these areas had default values.

As the consequence of absence of the special treatment of NOMAP regions in
the memory map it was necessary to use memblock_is_map_memory() in
pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
generic mm functionality would not treat a NOMAP page as a normal page.

Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
the rest of core mm will treat them as unusable memory and thus
pfn_valid_within() is no longer required at all and can be disabled by
removing CONFIG_HOLES_IN_ZONE on arm64.

pfn_valid() can be slightly simplified by replacing
memblock_is_map_memory() with memblock_is_memory().

Signed-off-by: Mike Rapoport 
---
 arch/arm64/Kconfig   | 3 ---
 arch/arm64/mm/init.c | 4 ++--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..58e439046d05 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
 
-config HOLES_IN_ZONE
-   def_bool y
-
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index c54e329aca15..370f33765b64 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
 
/*
 * ZONE_DEVICE memory does not have the memblock entries.
-* memblock_is_map_memory() check for ZONE_DEVICE based
+* memblock_is_memory() check for ZONE_DEVICE based
 * addresses will always fail. Even the normal hotplugged
 * memory will never have MEMBLOCK_NOMAP flag set in their
 * memblock entries. Skip memblock search for all non early
@@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
return pfn_section_valid(ms, pfn);
 }
 #endif
-   return memblock_is_map_memory(addr);
+   return memblock_is_memory(addr);
 }
 EXPORT_SYMBOL(pfn_valid);
 
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v1 3/4] arm64: decouple check whether pfn is in linear map from pfn_valid()

2021-04-20 Thread Mike Rapoport
From: Mike Rapoport 

The intended semantics of pfn_valid() is to verify whether there is a
struct page for the pfn in question and nothing else.

Yet, on arm64 it is used to distinguish memory areas that are mapped in the
linear map vs those that require ioremap() to access them.

Introduce a dedicated pfn_is_map_memory() wrapper for
memblock_is_map_memory() to perform such check and use it where
appropriate.

Using a wrapper allows to avoid cyclic include dependencies.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/memory.h | 2 +-
 arch/arm64/include/asm/page.h   | 1 +
 arch/arm64/kvm/mmu.c| 2 +-
 arch/arm64/mm/init.c| 6 ++
 arch/arm64/mm/ioremap.c | 4 ++--
 arch/arm64/mm/mmu.c | 2 +-
 6 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 0aabc3be9a75..194f9f993d30 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
 
 #define virt_addr_valid(addr)  ({  \
__typeof__(addr) __addr = __tag_reset(addr);\
-   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
+   __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr));  
\
 })
 
 void dump_mem_limit(void);
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..99a6da91f870 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
 typedef struct page *pgtable_t;
 
 extern int pfn_valid(unsigned long);
+extern int pfn_is_map_memory(unsigned long);
 
 #include 
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..23dd99e29b23 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 
 static bool kvm_is_device_pfn(unsigned long pfn)
 {
-   return !pfn_valid(pfn);
+   return !pfn_is_map_memory(pfn);
 }
 
 /*
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 3685e12aba9b..c54e329aca15 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
 }
 EXPORT_SYMBOL(pfn_valid);
 
+int pfn_is_map_memory(unsigned long pfn)
+{
+   return memblock_is_map_memory(PFN_PHYS(pfn));
+}
+EXPORT_SYMBOL(pfn_is_map_memory);
+
 static phys_addr_t memory_limit = PHYS_ADDR_MAX;
 
 /*
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index b5e83c46b23e..b7c81dacabf0 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, 
size_t size,
/*
 * Don't allow RAM to be mapped.
 */
-   if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
+   if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr
return NULL;
 
area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
 {
/* For normal memory we already have a cacheable mapping. */
-   if (pfn_valid(__phys_to_pfn(phys_addr)))
+   if (pfn_is_map_memory(__phys_to_pfn(phys_addr)))
return (void __iomem *)__phys_to_virt(phys_addr);
 
return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 5d9550fdb9cf..26045e9adbd7 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -81,7 +81,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
  unsigned long size, pgprot_t vma_prot)
 {
-   if (!pfn_valid(pfn))
+   if (!pfn_is_map_memory(pfn))
return pgprot_noncached(vma_prot);
else if (file->f_flags & O_SYNC)
return pgprot_writecombine(vma_prot);
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v1 2/4] memblock: update initialization of reserved pages

2021-04-20 Thread Mike Rapoport
From: Mike Rapoport 

The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function. This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.

The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().

Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.

Signed-off-by: Mike Rapoport 
---
 include/linux/memblock.h |  4 +++-
 mm/memblock.c| 28 ++--
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5984fff3f175..634c1a578db8 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -30,7 +30,9 @@ extern unsigned long long max_possible_pfn;
  * @MEMBLOCK_NONE: no special request
  * @MEMBLOCK_HOTPLUG: hotpluggable region
  * @MEMBLOCK_MIRROR: mirrored region
- * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
+ * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
+ * reserved in the memory map; refer to memblock_mark_nomap() description
+ * for futher details
  */
 enum memblock_flags {
MEMBLOCK_NONE   = 0x0,  /* No special request */
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..3abf2c3fea7f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -906,6 +906,11 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, 
phys_addr_t size)
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
+ * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the
+ * direct mapping of the physical memory. These regions will still be
+ * covered by the memory map. The struct page representing NOMAP memory
+ * frames in the memory map will be PageReserved()
+ *
  * Return: 0 on success, -errno on failure.
  */
 int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
@@ -2002,6 +2007,26 @@ static unsigned long __init 
__free_memory_core(phys_addr_t start,
return end_pfn - start_pfn;
 }
 
+static void __init memmap_init_reserved_pages(void)
+{
+   struct memblock_region *region;
+   phys_addr_t start, end;
+   u64 i;
+
+   /* initialize struct pages for the reserved regions */
+   for_each_reserved_mem_range(i, , )
+   reserve_bootmem_region(start, end);
+
+   /* and also treat struct pages for the NOMAP regions as PageReserved */
+   for_each_mem_region(region) {
+   if (memblock_is_nomap(region)) {
+   start = region->base;
+   end = start + region->size;
+   reserve_bootmem_region(start, end);
+   }
+   }
+}
+
 static unsigned long __init free_low_memory_core_early(void)
 {
unsigned long count = 0;
@@ -2010,8 +2035,7 @@ static unsigned long __init 
free_low_memory_core_early(void)
 
memblock_clear_hotplug(0, -1);
 
-   for_each_reserved_mem_range(i, , )
-   reserve_bootmem_region(start, end);
+   memmap_init_reserved_pages();
 
/*
 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v1 1/4] include/linux/mmzone.h: add documentation for pfn_valid()

2021-04-20 Thread Mike Rapoport
From: Mike Rapoport 

Add comment describing the semantics of pfn_valid() that clarifies that
pfn_valid() only checks for availability of a memory map entry (i.e. struct
page) for a PFN rather than availability of usable memory backing that PFN.

The most "generic" version of pfn_valid() used by the configurations with
SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
suitable place for documentation about semantics of pfn_valid().

Suggested-by: Anshuman Khandual 
Signed-off-by: Mike Rapoport 
---
 include/linux/mmzone.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47946cec7584..961f0eeefb62 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1410,6 +1410,17 @@ static inline int pfn_section_valid(struct mem_section 
*ms, unsigned long pfn)
 #endif
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
+/**
+ * pfn_valid - check if there is a valid memory map entry for a PFN
+ * @pfn: the page frame number to check
+ *
+ * Check if there is a valid memory map entry aka struct page for the @pfn.
+ * Note, that availability of the memory map entry does not imply that
+ * there is actual usable memory at that @pfn. The struct page may
+ * represent a hole or an unusable page frame.
+ *
+ * Return: 1 for PFNs that have memory map entries and 0 otherwise
+ */
 static inline int pfn_valid(unsigned long pfn)
 {
struct mem_section *ms;
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v1 0/4] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-20 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1. 

The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.

With this the core mm will be able to cope with the fact that it cannot use
NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
will be treated correctly even without the need for pfn_valid_within.

The patches are only boot tested on qemu-system-aarch64 so I'd really
appreciate memory stress tests on real hardware.

If this actually works we'll be one step closer to drop custom pfn_valid()
on arm64 altogether.

Changes since RFC
Link: https://lore.kernel.org/lkml/20210407172607.8812-1-r...@kernel.org

* Add comment about the semantics of pfn_valid() as Anshuman suggested
* Extend comments about MEMBLOCK_NOMAP, per Anshuman
* Use pfn_is_map_memory() name for the exported wrapper for
  memblock_is_map_memory(). It is still local to arch/arm64 in the end
  because of header dependency issues.

Mike Rapoport (4):
  include/linux/mmzone.h: add documentation for pfn_valid()
  memblock: update initialization of reserved pages
  arm64: decouple check whether pfn is in linear map from pfn_valid()
  arm64: drop pfn_valid_within() and simplify pfn_valid()

 arch/arm64/Kconfig  |  3 ---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  1 +
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 10 --
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 include/linux/memblock.h|  4 +++-
 include/linux/mmzone.h  | 11 +++
 mm/memblock.c   | 28 ++--
 10 files changed, 54 insertions(+), 13 deletions(-)

base-commit: e49d033bddf5b565044e2abe4241353959bc9120
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-16 Thread Mike Rapoport
On Thu, Apr 15, 2021 at 11:30:12AM +0200, David Hildenbrand wrote:
> > Not sure we really need a new pagetype here, PG_Reserved seems to be quite
> > enough to say "don't touch this".  I generally agree that we could make
> > PG_Reserved a PageType and then have several sub-types for reserved memory.
> > This definitely will add clarity but I'm not sure that this justifies
> > amount of churn and effort required to audit uses of PageResrved().
> > > Then, we could mostly avoid having to query memblock at runtime to figure
> > > out that this is special memory. This would obviously be an extension to
> > > this series. Just a thought.
> > 
> > Stop pushing memblock out of kernel! ;-)
> 
> Can't stop. Won't stop. :D
> 
> It's lovely for booting up a kernel until we have other data-structures in
> place ;)

A bit more seriously, we don't have any data structure that reliably
represents physical memory layout and arch-independent fashion. 
memblock is probably the best starting point for eventually having one.

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 2/3] arm64: decouple check whether pfn is normal memory from pfn_valid()

2021-04-16 Thread Mike Rapoport
On Thu, Apr 15, 2021 at 11:31:26AM +0200, David Hildenbrand wrote:
> On 14.04.21 22:29, Mike Rapoport wrote:
> > On Wed, Apr 14, 2021 at 05:58:26PM +0200, David Hildenbrand wrote:
> > > On 08.04.21 07:14, Anshuman Khandual wrote:
> > > > 
> > > > On 4/7/21 10:56 PM, Mike Rapoport wrote:
> > > > > From: Mike Rapoport 
> > > > > 
> > > > > The intended semantics of pfn_valid() is to verify whether there is a
> > > > > struct page for the pfn in question and nothing else.
> > > > 
> > > > Should there be a comment affirming this semantics interpretation, 
> > > > above the
> > > > generic pfn_valid() in include/linux/mmzone.h ?
> > > > 
> > > > > 
> > > > > Yet, on arm64 it is used to distinguish memory areas that are mapped 
> > > > > in the
> > > > > linear map vs those that require ioremap() to access them.
> > > > > 
> > > > > Introduce a dedicated pfn_is_memory() to perform such check and use it
> > > > > where appropriate.
> > > > > 
> > > > > Signed-off-by: Mike Rapoport 
> > > > > ---
> > > > >arch/arm64/include/asm/memory.h | 2 +-
> > > > >arch/arm64/include/asm/page.h   | 1 +
> > > > >arch/arm64/kvm/mmu.c| 2 +-
> > > > >arch/arm64/mm/init.c| 6 ++
> > > > >arch/arm64/mm/ioremap.c | 4 ++--
> > > > >arch/arm64/mm/mmu.c | 2 +-
> > > > >6 files changed, 12 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/arch/arm64/include/asm/memory.h 
> > > > > b/arch/arm64/include/asm/memory.h
> > > > > index 0aabc3be9a75..7e77fdf71b9d 100644
> > > > > --- a/arch/arm64/include/asm/memory.h
> > > > > +++ b/arch/arm64/include/asm/memory.h
> > > > > @@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
> > > > >#define virt_addr_valid(addr)  ({  
> > > > > \
> > > > >   __typeof__(addr) __addr = __tag_reset(addr);
> > > > > \
> > > > > - __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  
> > > > > \
> > > > > + __is_lm_address(__addr) && pfn_is_memory(virt_to_pfn(__addr));  
> > > > > \
> > > > >})
> > > > >void dump_mem_limit(void);
> > > > > diff --git a/arch/arm64/include/asm/page.h 
> > > > > b/arch/arm64/include/asm/page.h
> > > > > index 012cffc574e8..32b485bcc6ff 100644
> > > > > --- a/arch/arm64/include/asm/page.h
> > > > > +++ b/arch/arm64/include/asm/page.h
> > > > > @@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page 
> > > > > *from);
> > > > >typedef struct page *pgtable_t;
> > > > >extern int pfn_valid(unsigned long);
> > > > > +extern int pfn_is_memory(unsigned long);
> > > > >#include 
> > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > > index 8711894db8c2..ad2ea65a3937 100644
> > > > > --- a/arch/arm64/kvm/mmu.c
> > > > > +++ b/arch/arm64/kvm/mmu.c
> > > > > @@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> > > > >static bool kvm_is_device_pfn(unsigned long pfn)
> > > > >{
> > > > > - return !pfn_valid(pfn);
> > > > > + return !pfn_is_memory(pfn);
> > > > >}
> > > > >/*
> > > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > > > index 3685e12aba9b..258b1905ed4a 100644
> > > > > --- a/arch/arm64/mm/init.c
> > > > > +++ b/arch/arm64/mm/init.c
> > > > > @@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
> > > > >}
> > > > >EXPORT_SYMBOL(pfn_valid);
> > > > > +int pfn_is_memory(unsigned long pfn)
> > > > > +{
> > > > > + return memblock_is_map_memory(PFN_PHYS(pfn));
> > > > > +}
> > > > > +EXPORT_SYMBOL(pfn_is_memory);> +
> > > > 
> > > > Should not this be generic though ? There is nothing platform or arm64
> > > > specific in here. Wondering as pfn_is_memory() just indicates 

Re: [RFC/RFT PATCH 2/3] arm64: decouple check whether pfn is normal memory from pfn_valid()

2021-04-14 Thread Mike Rapoport
On Wed, Apr 14, 2021 at 05:58:26PM +0200, David Hildenbrand wrote:
> On 08.04.21 07:14, Anshuman Khandual wrote:
> > 
> > On 4/7/21 10:56 PM, Mike Rapoport wrote:
> > > From: Mike Rapoport 
> > > 
> > > The intended semantics of pfn_valid() is to verify whether there is a
> > > struct page for the pfn in question and nothing else.
> > 
> > Should there be a comment affirming this semantics interpretation, above the
> > generic pfn_valid() in include/linux/mmzone.h ?
> > 
> > > 
> > > Yet, on arm64 it is used to distinguish memory areas that are mapped in 
> > > the
> > > linear map vs those that require ioremap() to access them.
> > > 
> > > Introduce a dedicated pfn_is_memory() to perform such check and use it
> > > where appropriate.
> > > 
> > > Signed-off-by: Mike Rapoport 
> > > ---
> > >   arch/arm64/include/asm/memory.h | 2 +-
> > >   arch/arm64/include/asm/page.h   | 1 +
> > >   arch/arm64/kvm/mmu.c| 2 +-
> > >   arch/arm64/mm/init.c| 6 ++
> > >   arch/arm64/mm/ioremap.c | 4 ++--
> > >   arch/arm64/mm/mmu.c | 2 +-
> > >   6 files changed, 12 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/memory.h 
> > > b/arch/arm64/include/asm/memory.h
> > > index 0aabc3be9a75..7e77fdf71b9d 100644
> > > --- a/arch/arm64/include/asm/memory.h
> > > +++ b/arch/arm64/include/asm/memory.h
> > > @@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
> > >   #define virt_addr_valid(addr)   ({  
> > > \
> > >   __typeof__(addr) __addr = __tag_reset(addr);
> > > \
> > > - __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
> > > + __is_lm_address(__addr) && pfn_is_memory(virt_to_pfn(__addr));  \
> > >   })
> > >   void dump_mem_limit(void);
> > > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > > index 012cffc574e8..32b485bcc6ff 100644
> > > --- a/arch/arm64/include/asm/page.h
> > > +++ b/arch/arm64/include/asm/page.h
> > > @@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
> > >   typedef struct page *pgtable_t;
> > >   extern int pfn_valid(unsigned long);
> > > +extern int pfn_is_memory(unsigned long);
> > >   #include 
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 8711894db8c2..ad2ea65a3937 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> > >   static bool kvm_is_device_pfn(unsigned long pfn)
> > >   {
> > > - return !pfn_valid(pfn);
> > > + return !pfn_is_memory(pfn);
> > >   }
> > >   /*
> > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > index 3685e12aba9b..258b1905ed4a 100644
> > > --- a/arch/arm64/mm/init.c
> > > +++ b/arch/arm64/mm/init.c
> > > @@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
> > >   }
> > >   EXPORT_SYMBOL(pfn_valid);
> > > +int pfn_is_memory(unsigned long pfn)
> > > +{
> > > + return memblock_is_map_memory(PFN_PHYS(pfn));
> > > +}
> > > +EXPORT_SYMBOL(pfn_is_memory);> +
> > 
> > Should not this be generic though ? There is nothing platform or arm64
> > specific in here. Wondering as pfn_is_memory() just indicates that the
> > pfn is linear mapped, should not it be renamed as pfn_is_linear_memory()
> > instead ? Regardless, it's fine either way.
> 
> TBH, I dislike (generic) pfn_is_memory(). It feels like we're mixing
> concepts.

Yeah, at the moment NOMAP is very much arm specific so I'd keep it this way
for now.

>  NOMAP memory vs !NOMAP memory; even NOMAP is some kind of memory
> after all. pfn_is_map_memory() would be more expressive, although still
> sub-optimal.
>
> We'd actually want some kind of arm64-specific pfn_is_system_memory() or the
> inverse pfn_is_device_memory() -- to be improved.

In my current version (to be posted soon) I've started with
pfn_lineary_mapped() but then ended up with pfn_mapped() to make it
"upward" compatible with architectures that use direct rather than linear
map :)

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-14 Thread Mike Rapoport
On Wed, Apr 14, 2021 at 05:52:57PM +0200, David Hildenbrand wrote:
> On 14.04.21 17:27, Ard Biesheuvel wrote:
> > On Wed, 14 Apr 2021 at 17:14, David Hildenbrand  wrote:
> > > 
> > > On 07.04.21 19:26, Mike Rapoport wrote:
> > > > From: Mike Rapoport 
> > > > 
> > > > The struct pages representing a reserved memory region are initialized
> > > > using reserve_bootmem_range() function. This function is called for each
> > > > reserved region just before the memory is freed from memblock to the 
> > > > buddy
> > > > page allocator.
> > > > 
> > > > The struct pages for MEMBLOCK_NOMAP regions are kept with the default
> > > > values set by the memory map initialization which makes it necessary to
> > > > have a special treatment for such pages in pfn_valid() and
> > > > pfn_valid_within().
> > > 
> > > I assume these pages are never given to the buddy, because we don't have
> > > a direct mapping. So to the kernel, it's essentially just like a memory
> > > hole with benefits.
> > > 
> > > I can spot that we want to export such memory like any special memory
> > > thingy/hole in /proc/iomem -- "reserved", which makes sense.
> > > 
> > > I would assume that MEMBLOCK_NOMAP is a special type of *reserved*
> > > memory. IOW, that for_each_reserved_mem_range() should already succeed
> > > on these as well -- we should mark anything that is MEMBLOCK_NOMAP
> > > implicitly as reserved. Or are there valid reasons not to do so? What
> > > can anyone do with that memory?
> > > 
> > > I assume they are pretty much useless for the kernel, right? Like other
> > > reserved memory ranges.
> > > 
> > 
> > On ARM, we need to know whether any physical regions that do not
> > contain system memory contain something with device semantics or not.
> > One of the examples is ACPI tables: these are in reserved memory, and
> > so they are not covered by the linear region. However, when the ACPI
> > core ioremap()s an arbitrary memory region, we don't know whether it
> > is mapping a memory region or a device region unless we keep track of
> > this in some way. (Device mappings require device attributes, but
> > firmware tables require memory attributes, as they might be accessed
> > using misaligned reads)
> 
> Using generically sounding NOMAP ("don't create direct mapping") to identify
> device regions feels like a hack. I know, it was introduced just for that
> purpose.
> 
> Looking at memblock_mark_nomap(), we consider "device regions"
> 
> 1) ACPI tables
> 
> 2) VIDEO_TYPE_EFI memory
> 
> 3) some device-tree regions in of/fdt.c
> 
> 
> IIUC, right now we end up creating a memmap for this NOMAP memory, but hide
> it away in pfn_valid(). This patch set at least fixes that.

Currently we have memmap entries with struct page set to defaults for the
NOMAP memory. AFAIU hiding them in pfn_valid()/pfn_valid_within() was a
solution to failures in pfn walkers that presumed that for a pfn_valid()
there will be a struct page that really reflects the state of that page.

> Assuming these pages are never mapped to user space via the struct page
> (which better be the case), we could further use a new pagetype to mark
> these pages in a special way, such that we can identify them directly via
> pfn_to_page().

Not sure we really need a new pagetype here, PG_Reserved seems to be quite
enough to say "don't touch this".  I generally agree that we could make
PG_Reserved a PageType and then have several sub-types for reserved memory.
This definitely will add clarity but I'm not sure that this justifies
amount of churn and effort required to audit uses of PageResrved().
 
> Then, we could mostly avoid having to query memblock at runtime to figure
> out that this is special memory. This would obviously be an extension to
> this series. Just a thought. 

Stop pushing memblock out of kernel! ;-)

Now, seriously, we can minimize memblock involvement in run-time and this
series in yet another step in that direction.

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-14 Thread Mike Rapoport
On Wed, Apr 14, 2021 at 05:27:53PM +0200, Ard Biesheuvel wrote:
> On Wed, 14 Apr 2021 at 17:14, David Hildenbrand  wrote:
> >
> > On 07.04.21 19:26, Mike Rapoport wrote:
> > > From: Mike Rapoport 
> > >
> > > The struct pages representing a reserved memory region are initialized
> > > using reserve_bootmem_range() function. This function is called for each
> > > reserved region just before the memory is freed from memblock to the buddy
> > > page allocator.
> > >
> > > The struct pages for MEMBLOCK_NOMAP regions are kept with the default
> > > values set by the memory map initialization which makes it necessary to
> > > have a special treatment for such pages in pfn_valid() and
> > > pfn_valid_within().
> >
> > I assume these pages are never given to the buddy, because we don't have
> > a direct mapping. So to the kernel, it's essentially just like a memory
> > hole with benefits.
> >
> > I can spot that we want to export such memory like any special memory
> > thingy/hole in /proc/iomem -- "reserved", which makes sense.
> >
> > I would assume that MEMBLOCK_NOMAP is a special type of *reserved*
> > memory. IOW, that for_each_reserved_mem_range() should already succeed
> > on these as well -- we should mark anything that is MEMBLOCK_NOMAP
> > implicitly as reserved. Or are there valid reasons not to do so? What
> > can anyone do with that memory?
> >
> > I assume they are pretty much useless for the kernel, right? Like other
> > reserved memory ranges.
> >
> 
> On ARM, we need to know whether any physical regions that do not
> contain system memory contain something with device semantics or not.
> One of the examples is ACPI tables: these are in reserved memory, and
> so they are not covered by the linear region. However, when the ACPI
> core ioremap()s an arbitrary memory region, we don't know whether it
> is mapping a memory region or a device region unless we keep track of
> this in some way. (Device mappings require device attributes, but
> firmware tables require memory attributes, as they might be accessed
> using misaligned reads)

I mostly agree, but my understanding is that regions of *physical* memory
that are occupied by various pieces of EFI/ACPI information require special
treatment because it was defined this way in the APCI spec.
And since ARM cannot tolerate aliased mappings with different caching mode
the whole bunch of firmware memory should be ioremap()ed to access it.

> > > Split out initialization of the reserved pages to a function with a
> > > meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
> > > reserved regions and mark struct pages for the NOMAP regions as
> > > PageReserved.
> > >
> > > Signed-off-by: Mike Rapoport 
> > > ---
> > >   mm/memblock.c | 23 +--
> > >   1 file changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > index afaefa8fc6ab..6b7ea9d86310 100644
> > > --- a/mm/memblock.c
> > > +++ b/mm/memblock.c
> > > @@ -2002,6 +2002,26 @@ static unsigned long __init 
> > > __free_memory_core(phys_addr_t start,
> > >   return end_pfn - start_pfn;
> > >   }
> > >
> > > +static void __init memmap_init_reserved_pages(void)
> > > +{
> > > + struct memblock_region *region;
> > > + phys_addr_t start, end;
> > > + u64 i;
> > > +
> > > + /* initialize struct pages for the reserved regions */
> > > + for_each_reserved_mem_range(i, , )
> > > + reserve_bootmem_region(start, end);
> > > +
> > > + /* and also treat struct pages for the NOMAP regions as 
> > > PageReserved */
> > > + for_each_mem_region(region) {
> > > + if (memblock_is_nomap(region)) {
> > > + start = region->base;
> > > + end = start + region->size;
> > > + reserve_bootmem_region(start, end);
> > > + }
> > > + }
> > > +}
> > > +
> > >   static unsigned long __init free_low_memory_core_early(void)
> > >   {
> > >   unsigned long count = 0;
> > > @@ -2010,8 +2030,7 @@ static unsigned long __init 
> > > free_low_memory_core_early(void)
> > >
> > >   memblock_clear_hotplug(0, -1);
> > >
> > > - for_each_reserved_mem_range(i, , )
> > > - reserve_bootmem_region(start, end);
> > > + memmap_init_reserved_pages();
> > >
> > >   /*
> > >* We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
> > >
> >
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >
> >
> > ___
> > linux-arm-kernel mailing list
> > linux-arm-ker...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-14 Thread Mike Rapoport
On Wed, Apr 14, 2021 at 05:12:11PM +0200, David Hildenbrand wrote:
> On 07.04.21 19:26, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The struct pages representing a reserved memory region are initialized
> > using reserve_bootmem_range() function. This function is called for each
> > reserved region just before the memory is freed from memblock to the buddy
> > page allocator.
> > 
> > The struct pages for MEMBLOCK_NOMAP regions are kept with the default
> > values set by the memory map initialization which makes it necessary to
> > have a special treatment for such pages in pfn_valid() and
> > pfn_valid_within().
> 
> I assume these pages are never given to the buddy, because we don't have a
> direct mapping. So to the kernel, it's essentially just like a memory hole
> with benefits.

The pages should not be accessed as normal memory so they do not have a
direct (or in ARMish linear) mapping and are never given to buddy. 
After looking at ACPI standard I don't see a fundamental reason for this
but they've already made this mess and we need to cope with it.
 
> I can spot that we want to export such memory like any special memory
> thingy/hole in /proc/iomem -- "reserved", which makes sense.

It does, but let's wait with /proc/iomem changes. We don't really have a
100% consistent view of it on different architectures, so adding yet
another type there does not seem, well, urgent.
 
> I would assume that MEMBLOCK_NOMAP is a special type of *reserved* memory.
> IOW, that for_each_reserved_mem_range() should already succeed on these as
> well -- we should mark anything that is MEMBLOCK_NOMAP implicitly as
> reserved. Or are there valid reasons not to do so? What can anyone do with
> that memory?
> 
> I assume they are pretty much useless for the kernel, right? Like other
> reserved memory ranges.

I agree that there is a lot of commonality between NOMAP and reserved. The
problem is that even semantics for reserved is different between
architectures. Moreover, on the same architecture there could be
E820_TYPE_RESERVED and memblock.reserved with different properties.

I'd really prefer moving in baby steps here because any change in the boot
mm can bear several month of early hangs debugging ;-)

> > Split out initialization of the reserved pages to a function with a
> > meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
> > reserved regions and mark struct pages for the NOMAP regions as
> > PageReserved.
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >   mm/memblock.c | 23 +--
> >   1 file changed, 21 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index afaefa8fc6ab..6b7ea9d86310 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -2002,6 +2002,26 @@ static unsigned long __init 
> > __free_memory_core(phys_addr_t start,
> > return end_pfn - start_pfn;
> >   }
> > +static void __init memmap_init_reserved_pages(void)
> > +{
> > +   struct memblock_region *region;
> > +   phys_addr_t start, end;
> > +   u64 i;
> > +
> > +   /* initialize struct pages for the reserved regions */
> > +   for_each_reserved_mem_range(i, , )
> > +   reserve_bootmem_region(start, end);
> > +
> > +   /* and also treat struct pages for the NOMAP regions as PageReserved */
> > +   for_each_mem_region(region) {
> > +   if (memblock_is_nomap(region)) {
> > +   start = region->base;
> > +   end = start + region->size;
> > +   reserve_bootmem_region(start, end);
> > +   }
> > +   }
> > +}
> > +
> >   static unsigned long __init free_low_memory_core_early(void)
> >   {
> > unsigned long count = 0;
> > @@ -2010,8 +2030,7 @@ static unsigned long __init 
> > free_low_memory_core_early(void)
> > memblock_clear_hotplug(0, -1);
> > -   for_each_reserved_mem_range(i, , )
> > -   reserve_bootmem_region(start, end);
> > +   memmap_init_reserved_pages();
> > /*
> >  * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 0/3] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-08 Thread Mike Rapoport
On Thu, Apr 08, 2021 at 10:49:02AM +0530, Anshuman Khandual wrote:
> Adding James here.
> 
> + James Morse 
> 
> On 4/7/21 10:56 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > Hi,
> > 
> > These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
> > pfn_valid_within() to 1. 
> 
> That would be really great for arm64 platform as it will save CPU cycles on
> many generic MM paths, given that our pfn_valid() has been expensive.
> 
> > 
> > The idea is to mark NOMAP pages as reserved in the memory map and restore
> 
> Though I am not really sure, would that possibly be problematic for UEFI/EFI
> use cases as it might have just treated them as normal struct pages till now.

I don't think there should be a problem because now the struct pages for
UEFI/ACPI never got to be used by the core mm. They were (rightfully)
skipped by memblock_free_all() from one side and pfn_valid() and
pfn_valid_within() return false for them in various pfn walkers from the
other side.
 
> > the intended semantics of pfn_valid() to designate availability of struct
> > page for a pfn.
> 
> Right, that would be better as the current semantics is not ideal.
> 
> > 
> > With this the core mm will be able to cope with the fact that it cannot use
> > NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
> > will be treated correctly even without the need for pfn_valid_within.
> > 
> > The patches are only boot tested on qemu-system-aarch64 so I'd really
> > appreciate memory stress tests on real hardware.
> 
> Did some preliminary memory stress tests on a guest with portions of memory
> marked as MEMBLOCK_NOMAP and did not find any obvious problem. But this might
> require some testing on real UEFI environment with firmware using 
> MEMBLOCK_NOMAP
> memory to make sure that changing these struct pages to PageReserved() is 
> safe.

I surely have no access for such machines :)
 
> > If this actually works we'll be one step closer to drop custom pfn_valid()
> > on arm64 altogether.
> 
> Right, planning to rework and respin the RFC originally sent last month.
> 
> https://patchwork.kernel.org/project/linux-mm/patch/1615174073-10520-1-git-send-email-anshuman.khand...@arm.com/

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 3/3] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-08 Thread Mike Rapoport
On Thu, Apr 08, 2021 at 10:42:43AM +0530, Anshuman Khandual wrote:
> 
> On 4/7/21 10:56 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The arm64's version of pfn_valid() differs from the generic because of two
> > reasons:
> > 
> > * Parts of the memory map are freed during boot. This makes it necessary to
> >   verify that there is actual physical memory that corresponds to a pfn
> >   which is done by querying memblock.
> > 
> > * There are NOMAP memory regions. These regions are not mapped in the
> >   linear map and until the previous commit the struct pages representing
> >   these areas had default values.
> > 
> > As the consequence of absence of the special treatment of NOMAP regions in
> > the memory map it was necessary to use memblock_is_map_memory() in
> > pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
> > generic mm functionality would not treat a NOMAP page as a normal page.
> > 
> > Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
> > the rest of core mm will treat them as unusable memory and thus
> > pfn_valid_within() is no longer required at all and can be disabled by
> > removing CONFIG_HOLES_IN_ZONE on arm64.
> 
> But what about the memory map that are freed during boot (mentioned above).
> Would not they still cause CONFIG_HOLES_IN_ZONE to be applicable and hence
> pfn_valid_within() ?

The CONFIG_HOLES_IN_ZONE name is misleading as actually pfn_valid_within()
is only required for holes within a MAX_ORDER_NR_PAGES blocks (see comment
near pfn_valid_within() definition in mmzone.h). The freeing of the memory
map during boot avoids breaking MAX_ORDER blocks and the holes for which
memory map is freed are always aligned at MAX_ORDER.

AFAIU, the only case when there could be a hole in a MAX_ORDER block is
when EFI/ACPI reserves memory for its use and this memory becomes NOMAP in
the kernel. We still create struct pages for this memory, but they never
get values other than defaults, so core mm has no idea that this memory
should be touched, hence the need for pfn_valid_within() aliased to
pfn_valid() on arm64.
 
> > pfn_valid() can be slightly simplified by replacing
> > memblock_is_map_memory() with memblock_is_memory().
> 
> Just to understand this better, pfn_valid() will now return true for all
> MEMBLOCK_NOMAP based memory but that is okay as core MM would still ignore
> them as unusable memory for being PageReserved().

Right, pfn_valid() will return true for all memory, including
MEMBLOCK_NOMAP. Since core mm deals with PageResrved() for memory used by
the firmware, e.g. on x86, I don't see why it won't work on arm64.
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >  arch/arm64/Kconfig   | 3 ---
> >  arch/arm64/mm/init.c | 4 ++--
> >  2 files changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index e4e1b6550115..58e439046d05 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
> > def_bool y
> > depends on NUMA
> >  
> > -config HOLES_IN_ZONE
> > -   def_bool y
> > -
> >  source "kernel/Kconfig.hz"
> >  
> >  config ARCH_SPARSEMEM_ENABLE
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 258b1905ed4a..bb6dd406b1f0 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
> >  
> > /*
> >  * ZONE_DEVICE memory does not have the memblock entries.
> > -* memblock_is_map_memory() check for ZONE_DEVICE based
> > +* memblock_is_memory() check for ZONE_DEVICE based
> >  * addresses will always fail. Even the normal hotplugged
> >  * memory will never have MEMBLOCK_NOMAP flag set in their
> >  * memblock entries. Skip memblock search for all non early
> > @@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
> > return pfn_section_valid(ms, pfn);
> >  }
> >  #endif
> > -   return memblock_is_map_memory(addr);
> > +   return memblock_is_memory(addr);
> >  }
> >  EXPORT_SYMBOL(pfn_valid);
> >  
> > 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC/RFT PATCH 2/3] arm64: decouple check whether pfn is normal memory from pfn_valid()

2021-04-08 Thread Mike Rapoport
On Thu, Apr 08, 2021 at 10:44:58AM +0530, Anshuman Khandual wrote:
> 
> On 4/7/21 10:56 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The intended semantics of pfn_valid() is to verify whether there is a
> > struct page for the pfn in question and nothing else.
> 
> Should there be a comment affirming this semantics interpretation, above the
> generic pfn_valid() in include/linux/mmzone.h ?

Yeah, that would have been helpful :)
 
> > 
> > Yet, on arm64 it is used to distinguish memory areas that are mapped in the
> > linear map vs those that require ioremap() to access them.
> > 
> > Introduce a dedicated pfn_is_memory() to perform such check and use it
> > where appropriate.
> > 
> > Signed-off-by: Mike Rapoport 
> > ---
> >  arch/arm64/include/asm/memory.h | 2 +-
> >  arch/arm64/include/asm/page.h   | 1 +
> >  arch/arm64/kvm/mmu.c| 2 +-
> >  arch/arm64/mm/init.c| 6 ++
> >  arch/arm64/mm/ioremap.c | 4 ++--
> >  arch/arm64/mm/mmu.c | 2 +-
> >  6 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/memory.h 
> > b/arch/arm64/include/asm/memory.h
> > index 0aabc3be9a75..7e77fdf71b9d 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
> >  
> >  #define virt_addr_valid(addr)  ({  
> > \
> > __typeof__(addr) __addr = __tag_reset(addr);\
> > -   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
> > +   __is_lm_address(__addr) && pfn_is_memory(virt_to_pfn(__addr));  \
> >  })
> >  
> >  void dump_mem_limit(void);
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 012cffc574e8..32b485bcc6ff 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
> >  typedef struct page *pgtable_t;
> >  
> >  extern int pfn_valid(unsigned long);
> > +extern int pfn_is_memory(unsigned long);
> >  
> >  #include 
> >  
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 8711894db8c2..ad2ea65a3937 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
> >  
> >  static bool kvm_is_device_pfn(unsigned long pfn)
> >  {
> > -   return !pfn_valid(pfn);
> > +   return !pfn_is_memory(pfn);
> >  }
> >  
> >  /*
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 3685e12aba9b..258b1905ed4a 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
> >  }
> >  EXPORT_SYMBOL(pfn_valid);
> >  
> > +int pfn_is_memory(unsigned long pfn)
> > +{
> > +   return memblock_is_map_memory(PFN_PHYS(pfn));
> > +}
> > +EXPORT_SYMBOL(pfn_is_memory);> +
> 
> Should not this be generic though ? There is nothing platform or arm64
> specific in here.

As NOMAP itself is quite ARM specific, this check is currently only
relevant for arm64 and maybe arm32.
But probably having an EXPORT_SYMBOL wrapper for memblock_is_map_memory(),
say in memblock does make sense for all architectures that have
KEEP_MEMBLOCK.

> Wondering as pfn_is_memory() just indicates that the
> pfn is linear mapped, should not it be renamed as pfn_is_linear_memory()
> instead ? Regardless, it's fine either way.

Yeah, I agree that naming could be better here. I think that for a generic name
we'd need pfn_is_directly_mapped() so that it can be used on x86 ;-)
 
> >  static phys_addr_t memory_limit = PHYS_ADDR_MAX;
> >  
> >  /*
> > diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
> > index b5e83c46b23e..82a369b22ef5 100644
> > --- a/arch/arm64/mm/ioremap.c
> > +++ b/arch/arm64/mm/ioremap.c
> > @@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t 
> > phys_addr, size_t size,
> > /*
> >  * Don't allow RAM to be mapped.
> >  */
> > -   if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
> > +   if (WARN_ON(pfn_is_memory(__phys_to_pfn(phys_addr
> > return NULL;
> >  
> > area = get_vm_area_caller(size, VM_IOREMAP, caller);
> > @@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
> >  void __iomem *ioremap_cache(phys_addr_t phys_addr, s

Re: [RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-07 Thread Mike Rapoport
On Thu, Apr 08, 2021 at 10:46:18AM +0530, Anshuman Khandual wrote:
> 
> 
> On 4/7/21 10:56 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> > 
> > The struct pages representing a reserved memory region are initialized
> > using reserve_bootmem_range() function. This function is called for each
> > reserved region just before the memory is freed from memblock to the buddy
> > page allocator.
> > 
> > The struct pages for MEMBLOCK_NOMAP regions are kept with the default
> > values set by the memory map initialization which makes it necessary to
> > have a special treatment for such pages in pfn_valid() and
> > pfn_valid_within().
> > 
> > Split out initialization of the reserved pages to a function with a
> > meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
> > reserved regions and mark struct pages for the NOMAP regions as
> > PageReserved.
> 
> This would definitely need updating the comment for MEMBLOCK_NOMAP definition
> in include/linux/memblock.h just to make the semantics is clear,

Sure

> though arm64 is currently the only user for MEMBLOCK_NOMAP.

> > Signed-off-by: Mike Rapoport 
> > ---
> >  mm/memblock.c | 23 +--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index afaefa8fc6ab..6b7ea9d86310 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -2002,6 +2002,26 @@ static unsigned long __init 
> > __free_memory_core(phys_addr_t start,
> > return end_pfn - start_pfn;
> >  }
> >  
> > +static void __init memmap_init_reserved_pages(void)
> > +{
> > +   struct memblock_region *region;
> > +   phys_addr_t start, end;
> > +   u64 i;
> > +
> > +   /* initialize struct pages for the reserved regions */
> > +   for_each_reserved_mem_range(i, , )
> > +   reserve_bootmem_region(start, end);
> > +
> > +   /* and also treat struct pages for the NOMAP regions as PageReserved */
> > +   for_each_mem_region(region) {
> > +   if (memblock_is_nomap(region)) {
> > +   start = region->base;
> > +   end = start + region->size;
> > +   reserve_bootmem_region(start, end);
> > +   }
> > +   }
> > +}
> > +
> >  static unsigned long __init free_low_memory_core_early(void)
> >  {
> > unsigned long count = 0;
> > @@ -2010,8 +2030,7 @@ static unsigned long __init 
> > free_low_memory_core_early(void)
> >  
> > memblock_clear_hotplug(0, -1);
> >  
> > -   for_each_reserved_mem_range(i, , )
> > -   reserve_bootmem_region(start, end);
> > +   memmap_init_reserved_pages();
> >  
> > /*
> >  * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
> > 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC/RFT PATCH 3/3] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-07 Thread Mike Rapoport
From: Mike Rapoport 

The arm64's version of pfn_valid() differs from the generic because of two
reasons:

* Parts of the memory map are freed during boot. This makes it necessary to
  verify that there is actual physical memory that corresponds to a pfn
  which is done by querying memblock.

* There are NOMAP memory regions. These regions are not mapped in the
  linear map and until the previous commit the struct pages representing
  these areas had default values.

As the consequence of absence of the special treatment of NOMAP regions in
the memory map it was necessary to use memblock_is_map_memory() in
pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
generic mm functionality would not treat a NOMAP page as a normal page.

Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
the rest of core mm will treat them as unusable memory and thus
pfn_valid_within() is no longer required at all and can be disabled by
removing CONFIG_HOLES_IN_ZONE on arm64.

pfn_valid() can be slightly simplified by replacing
memblock_is_map_memory() with memblock_is_memory().

Signed-off-by: Mike Rapoport 
---
 arch/arm64/Kconfig   | 3 ---
 arch/arm64/mm/init.c | 4 ++--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..58e439046d05 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1040,9 +1040,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
def_bool y
depends on NUMA
 
-config HOLES_IN_ZONE
-   def_bool y
-
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 258b1905ed4a..bb6dd406b1f0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -243,7 +243,7 @@ int pfn_valid(unsigned long pfn)
 
/*
 * ZONE_DEVICE memory does not have the memblock entries.
-* memblock_is_map_memory() check for ZONE_DEVICE based
+* memblock_is_memory() check for ZONE_DEVICE based
 * addresses will always fail. Even the normal hotplugged
 * memory will never have MEMBLOCK_NOMAP flag set in their
 * memblock entries. Skip memblock search for all non early
@@ -254,7 +254,7 @@ int pfn_valid(unsigned long pfn)
return pfn_section_valid(ms, pfn);
 }
 #endif
-   return memblock_is_map_memory(addr);
+   return memblock_is_memory(addr);
 }
 EXPORT_SYMBOL(pfn_valid);
 
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC/RFT PATCH 2/3] arm64: decouple check whether pfn is normal memory from pfn_valid()

2021-04-07 Thread Mike Rapoport
From: Mike Rapoport 

The intended semantics of pfn_valid() is to verify whether there is a
struct page for the pfn in question and nothing else.

Yet, on arm64 it is used to distinguish memory areas that are mapped in the
linear map vs those that require ioremap() to access them.

Introduce a dedicated pfn_is_memory() to perform such check and use it
where appropriate.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/memory.h | 2 +-
 arch/arm64/include/asm/page.h   | 1 +
 arch/arm64/kvm/mmu.c| 2 +-
 arch/arm64/mm/init.c| 6 ++
 arch/arm64/mm/ioremap.c | 4 ++--
 arch/arm64/mm/mmu.c | 2 +-
 6 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 0aabc3be9a75..7e77fdf71b9d 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -351,7 +351,7 @@ static inline void *phys_to_virt(phys_addr_t x)
 
 #define virt_addr_valid(addr)  ({  \
__typeof__(addr) __addr = __tag_reset(addr);\
-   __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));  \
+   __is_lm_address(__addr) && pfn_is_memory(virt_to_pfn(__addr));  \
 })
 
 void dump_mem_limit(void);
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 012cffc574e8..32b485bcc6ff 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -38,6 +38,7 @@ void copy_highpage(struct page *to, struct page *from);
 typedef struct page *pgtable_t;
 
 extern int pfn_valid(unsigned long);
+extern int pfn_is_memory(unsigned long);
 
 #include 
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..ad2ea65a3937 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 
 static bool kvm_is_device_pfn(unsigned long pfn)
 {
-   return !pfn_valid(pfn);
+   return !pfn_is_memory(pfn);
 }
 
 /*
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 3685e12aba9b..258b1905ed4a 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -258,6 +258,12 @@ int pfn_valid(unsigned long pfn)
 }
 EXPORT_SYMBOL(pfn_valid);
 
+int pfn_is_memory(unsigned long pfn)
+{
+   return memblock_is_map_memory(PFN_PHYS(pfn));
+}
+EXPORT_SYMBOL(pfn_is_memory);
+
 static phys_addr_t memory_limit = PHYS_ADDR_MAX;
 
 /*
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index b5e83c46b23e..82a369b22ef5 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, 
size_t size,
/*
 * Don't allow RAM to be mapped.
 */
-   if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr
+   if (WARN_ON(pfn_is_memory(__phys_to_pfn(phys_addr
return NULL;
 
area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
 {
/* For normal memory we already have a cacheable mapping. */
-   if (pfn_valid(__phys_to_pfn(phys_addr)))
+   if (pfn_is_memory(__phys_to_pfn(phys_addr)))
return (void __iomem *)__phys_to_virt(phys_addr);
 
return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 5d9550fdb9cf..038d20fe163f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -81,7 +81,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
  unsigned long size, pgprot_t vma_prot)
 {
-   if (!pfn_valid(pfn))
+   if (!pfn_is_memory(pfn))
return pgprot_noncached(vma_prot);
else if (file->f_flags & O_SYNC)
return pgprot_writecombine(vma_prot);
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC/RFT PATCH 1/3] memblock: update initialization of reserved pages

2021-04-07 Thread Mike Rapoport
From: Mike Rapoport 

The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function. This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.

The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().

Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.

Signed-off-by: Mike Rapoport 
---
 mm/memblock.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..6b7ea9d86310 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2002,6 +2002,26 @@ static unsigned long __init 
__free_memory_core(phys_addr_t start,
return end_pfn - start_pfn;
 }
 
+static void __init memmap_init_reserved_pages(void)
+{
+   struct memblock_region *region;
+   phys_addr_t start, end;
+   u64 i;
+
+   /* initialize struct pages for the reserved regions */
+   for_each_reserved_mem_range(i, , )
+   reserve_bootmem_region(start, end);
+
+   /* and also treat struct pages for the NOMAP regions as PageReserved */
+   for_each_mem_region(region) {
+   if (memblock_is_nomap(region)) {
+   start = region->base;
+   end = start + region->size;
+   reserve_bootmem_region(start, end);
+   }
+   }
+}
+
 static unsigned long __init free_low_memory_core_early(void)
 {
unsigned long count = 0;
@@ -2010,8 +2030,7 @@ static unsigned long __init 
free_low_memory_core_early(void)
 
memblock_clear_hotplug(0, -1);
 
-   for_each_reserved_mem_range(i, , )
-   reserve_bootmem_region(start, end);
+   memmap_init_reserved_pages();
 
/*
 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[RFC/RFT PATCH 0/3] arm64: drop pfn_valid_within() and simplify pfn_valid()

2021-04-07 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1. 

The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.

With this the core mm will be able to cope with the fact that it cannot use
NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks
will be treated correctly even without the need for pfn_valid_within.

The patches are only boot tested on qemu-system-aarch64 so I'd really
appreciate memory stress tests on real hardware.

If this actually works we'll be one step closer to drop custom pfn_valid()
on arm64 altogether.

Mike Rapoport (3):
  memblock: update initialization of reserved pages
  arm64: decouple check whether pfn is normal memory from pfn_valid()
  arm64: drop pfn_valid_within() and simplify pfn_valid()

 arch/arm64/Kconfig  |  3 ---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/page.h   |  1 +
 arch/arm64/kvm/mmu.c|  2 +-
 arch/arm64/mm/init.c| 10 --
 arch/arm64/mm/ioremap.c |  4 ++--
 arch/arm64/mm/mmu.c |  2 +-
 mm/memblock.c   | 23 +--
 8 files changed, 35 insertions(+), 12 deletions(-)


base-commit: e49d033bddf5b565044e2abe4241353959bc9120
-- 
2.28.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 03/14] arm64: add support for folded p4d page tables

2020-05-16 Thread Mike Rapoport
On Fri, May 15, 2020 at 11:40:12AM -0700, Andrew Morton wrote:
> On Tue, 14 Apr 2020 18:34:44 +0300 Mike Rapoport  wrote:
> 
> > Implement primitives necessary for the 4th level folding, add walks of p4d
> > level where appropriate, replace 5level-fixup.h with pgtable-nop4d.h and
> > remove __ARCH_USE_5LEVEL_HACK.
> 
> This needed some rework due to arm changes in linux-next.  Please check
> my handiwork and test it once I've merged this into linux-next?

Looks ok to me. It passed defconfig and a couple of randconfig builds
and qemu-system-aarch64 boots find with this.

> Rejects were
> 
> --- 
> arch/arm64/include/asm/pgtable.h~arm64-add-support-for-folded-p4d-page-tables
> +++ arch/arm64/include/asm/pgtable.h
> @@ -596,49 +604,50 @@ static inline phys_addr_t pud_page_paddr
>  
>  #define pud_ERROR(pud)   __pud_error(__FILE__, __LINE__, 
> pud_val(pud))
>  
> -#define pgd_none(pgd)(!pgd_val(pgd))
> -#define pgd_bad(pgd) (!(pgd_val(pgd) & 2))
> -#define pgd_present(pgd) (pgd_val(pgd))
> +#define p4d_none(p4d)(!p4d_val(p4d))
> +#define p4d_bad(p4d) (!(p4d_val(p4d) & 2))
> +#define p4d_present(p4d) (p4d_val(p4d))
>  
> -static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>  {
> - if (in_swapper_pgdir(pgdp)) {
> - set_swapper_pgd(pgdp, pgd);
> + if (in_swapper_pgdir(p4dp)) {
> + set_swapper_pgd((pgd_t *)p4dp, __pgd(p4d_val(p4d)));
>   return;
>   }
>  
> - WRITE_ONCE(*pgdp, pgd);
> + WRITE_ONCE(*p4dp, p4d);
>   dsb(ishst);
>   isb();
>  }
>  
> -static inline void pgd_clear(pgd_t *pgdp)
> +static inline void p4d_clear(p4d_t *p4dp)
>  {
> - set_pgd(pgdp, __pgd(0));
> + set_p4d(p4dp, __p4d(0));
>  }
>  
> -static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
> +static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
>  {
> - return __pgd_to_phys(pgd);
> + return __p4d_to_phys(p4d);
>  }
>  
>  /* Find an entry in the frst-level page table. */
>  #define pud_index(addr)  (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD 
> - 1))
>  
> -#define pud_offset_phys(dir, addr)   (pgd_page_paddr(READ_ONCE(*(dir))) + 
> pud_index(addr) * sizeof(pud_t))
> +#define pud_offset_phys(dir, addr)   (p4d_page_paddr(READ_ONCE(*(dir))) + 
> pud_index(addr) * sizeof(pud_t))
>  #define pud_offset(dir, addr)((pud_t 
> *)__va(pud_offset_phys((dir), (addr
>  
>  #define pud_set_fixmap(addr) ((pud_t *)set_fixmap_offset(FIX_PUD, 
> addr))
> -#define pud_set_fixmap_offset(pgd, addr) 
> pud_set_fixmap(pud_offset_phys(pgd, addr))
> +#define pud_set_fixmap_offset(p4d, addr) 
> pud_set_fixmap(pud_offset_phys(p4d, addr))
>  #define pud_clear_fixmap()   clear_fixmap(FIX_PUD)
>  
> -#define pgd_page(pgd)
> pfn_to_page(__phys_to_pfn(__pgd_to_phys(pgd)))
> +#define p4d_page(p4d)
> pfn_to_page(__phys_to_pfn(__p4d_to_phys(p4d)))
>  
>  /* use ONLY for statically allocated translation tables */
>  #define pud_offset_kimg(dir,addr)((pud_t 
> *)__phys_to_kimg(pud_offset_phys((dir), (addr
>  
>  #else
>  
> +#define p4d_page_paddr(p4d)  ({ BUILD_BUG(); 0;})
>  #define pgd_page_paddr(pgd)  ({ BUILD_BUG(); 0;})
>  
>  /* Match pud_offset folding in  */
> 
> 
> 
> and
> 
> --- arch/arm64/kvm/mmu.c~arm64-add-support-for-folded-p4d-page-tables
> +++ arch/arm64/kvm/mmu.c
> @@ -469,7 +517,7 @@ static void stage2_flush_memslot(struct
>   do {
>   next = stage2_pgd_addr_end(kvm, addr, end);
>   if (!stage2_pgd_none(kvm, *pgd))
> - stage2_flush_puds(kvm, pgd, addr, next);
> + stage2_flush_p4ds(kvm, pgd, addr, next);
>   } while (pgd++, addr = next, addr != end);
>  }
>  
> 
> 
> Result:
> 
> From: Mike Rapoport 
> Subject: arm64: add support for folded p4d page tables
> 
> Implement primitives necessary for the 4th level folding, add walks of p4d
> level where appropriate, replace 5level-fixup.h with pgtable-nop4d.h and
> remove __ARCH_USE_5LEVEL_HACK.
> 
> Link: http://lkml.kernel.org/r/20200414153455.21744-4-r...@kernel.org
> Signed-off-by: Mike Rapoport 
> Cc: Arnd Bergmann 
> Cc: Benjamin Herrenschmidt 
> Cc: Brian Cain 
> Cc: Catalin Marinas 
> Cc: Christophe Leroy 
> Cc: Fenghua Yu 
> Cc: Geert Uytterhoeven 
> Cc: Guan Xuetao 
> Cc: James Morse 
> Cc: Jonas Bonn 
> Cc: Julien Thierry 
> Cc: Ley Foon Tan 
> Cc: Marc Zyngier 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Rich Felke

Re: [PATCH v4 02/14] arm: add support for folded p4d page tables

2020-05-11 Thread Mike Rapoport
Hi Marek,

On Mon, May 11, 2020 at 08:36:41AM +0200, Marek Szyprowski wrote:
> Hi Mike,
> 
> On 08.05.2020 19:42, Mike Rapoport wrote:
> > On Fri, May 08, 2020 at 08:53:27AM +0200, Marek Szyprowski wrote:
> >> On 07.05.2020 18:11, Mike Rapoport wrote:
> >>> On Thu, May 07, 2020 at 02:16:56PM +0200, Marek Szyprowski wrote:
> >>>> On 14.04.2020 17:34, Mike Rapoport wrote:
> >>>>> From: Mike Rapoport 
> >>>>>
> >>>>> Implement primitives necessary for the 4th level folding, add walks of 
> >>>>> p4d
> >>>>> level where appropriate, and remove __ARCH_USE_5LEVEL_HACK.
> >>>>>
> >>>>> Signed-off-by: Mike Rapoport 
> > Can you please try the patch below:
> >
> > diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
> > index 963b5284d284..f86b3d17928e 100644
> > --- a/arch/arm/mm/init.c
> > +++ b/arch/arm/mm/init.c
> > @@ -571,7 +571,7 @@ static inline void section_update(unsigned long addr, 
> > pmdval_t mask,
> >   {
> > pmd_t *pmd;
> >   
> > -   pmd = pmd_off_k(addr);
> > +   pmd = pmd_offset(pud_offset(p4d_offset(pgd_offset(mm, addr), addr), 
> > addr), addr);
> >   
> >   #ifdef CONFIG_ARM_LPAE
> > pmd[0] = __pmd((pmd_val(pmd[0]) & mask) | prot);
> This fixes kexec issue! Thanks!
> 
> 
> Feel free to add:
> 
> Reported-by: Marek Szyprowski 
> Fixes: 218f1c390557 ("arm: add support for folded p4d page tables")
> Tested-by: Marek Szyprowski 

Thanks for testing!

The patch is still in mmotm tree, so I don't think "Fixes" apply.

Andrew, would you like me to send the fix as a formal patch or will pick
it up as a fixup?

> Best regards
> -- 
> Marek Szyprowski, PhD
> Samsung R Institute Poland
> 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 02/14] arm: add support for folded p4d page tables

2020-05-08 Thread Mike Rapoport
On Fri, May 08, 2020 at 08:53:27AM +0200, Marek Szyprowski wrote:
> Hi Mike,
> 
> On 07.05.2020 18:11, Mike Rapoport wrote:
> > On Thu, May 07, 2020 at 02:16:56PM +0200, Marek Szyprowski wrote:
> >> On 14.04.2020 17:34, Mike Rapoport wrote:
> >>> From: Mike Rapoport 
> >>>
> >>> Implement primitives necessary for the 4th level folding, add walks of p4d
> >>> level where appropriate, and remove __ARCH_USE_5LEVEL_HACK.
> >>>
> >>> Signed-off-by: Mike Rapoport 
> >> Today I've noticed that kexec is broken on ARM 32bit. Bisecting between
> >> current linux-next and v5.7-rc1 pointed to this commit. I've tested this
> >> on Odroid XU4 and Raspberry Pi4 boards. Here is the relevant log:
> >>
> >> # kexec --kexec-syscall -l zImage --append "$(cat /proc/cmdline)"
> >> memory_range[0]:0x4000..0xbe9f
> >> memory_range[0]:0x4000..0xbe9f
> >> # kexec -e
> >> kexec_core: Starting new kernel
> >> 8<--- cut here ---
> >> Unable to handle kernel paging request at virtual address c010f1f4
> >> pgd = c6817793
> >> [c010f1f4] *pgd=441e(bad)
> >> Internal error: Oops: 80d [#1] PREEMPT ARM
> >> Modules linked in:
> >> CPU: 0 PID: 1329 Comm: kexec Tainted: G    W
> >> 5.7.0-rc3-00127-g6cba81ed0f62 #611
> >> Hardware name: Samsung Exynos (Flattened Device Tree)
> >> PC is at machine_kexec+0x40/0xfc
> > Any chance you have the debug info in this kernel?
> > scripts/faddr2line would come handy here.
> 
> # ./scripts/faddr2line --list vmlinux machine_kexec+0x40
> machine_kexec+0x40/0xf8:
> 
> machine_kexec at arch/arm/kernel/machine_kexec.c:182
>   177    reboot_code_buffer = 
> page_address(image->control_code_page);
>   178
>   179    /* Prepare parameters for reboot_code_buffer*/
>   180    set_kernel_text_rw();
>   181    kexec_start_address = image->start;
>  >182<   kexec_indirection_page = page_list;
>   183    kexec_mach_type = machine_arch_type;
>   184    kexec_boot_atags = image->arch.kernel_r2;
>   185
>   186    /* copy our kernel relocation code to the control code 
> page */
>   187    reboot_entry = fncpy(reboot_code_buffer,

Can you please try the patch below:

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 963b5284d284..f86b3d17928e 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -571,7 +571,7 @@ static inline void section_update(unsigned long addr, 
pmdval_t mask,
 {
pmd_t *pmd;
 
-   pmd = pmd_off_k(addr);
+   pmd = pmd_offset(pud_offset(p4d_offset(pgd_offset(mm, addr), addr), 
addr), addr);
 
 #ifdef CONFIG_ARM_LPAE
pmd[0] = __pmd((pmd_val(pmd[0]) & mask) | prot);

>  > ...
> 
> Best regards
> -- 
> Marek Szyprowski, PhD
> Samsung R Institute Poland
> 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 02/14] arm: add support for folded p4d page tables

2020-05-07 Thread Mike Rapoport
Hi,

On Thu, May 07, 2020 at 02:16:56PM +0200, Marek Szyprowski wrote:
> Hi
> 
> On 14.04.2020 17:34, Mike Rapoport wrote:
> > From: Mike Rapoport 
> >
> > Implement primitives necessary for the 4th level folding, add walks of p4d
> > level where appropriate, and remove __ARCH_USE_5LEVEL_HACK.
> >
> > Signed-off-by: Mike Rapoport 
> 
> Today I've noticed that kexec is broken on ARM 32bit. Bisecting between 
> current linux-next and v5.7-rc1 pointed to this commit. I've tested this 
> on Odroid XU4 and Raspberry Pi4 boards. Here is the relevant log:
> 
> # kexec --kexec-syscall -l zImage --append "$(cat /proc/cmdline)"
> memory_range[0]:0x4000..0xbe9f
> memory_range[0]:0x4000..0xbe9f
> # kexec -e
> kexec_core: Starting new kernel
> 8<--- cut here ---
> Unable to handle kernel paging request at virtual address c010f1f4
> pgd = c6817793
> [c010f1f4] *pgd=441e(bad)
> Internal error: Oops: 80d [#1] PREEMPT ARM
> Modules linked in:
> CPU: 0 PID: 1329 Comm: kexec Tainted: G    W 
> 5.7.0-rc3-00127-g6cba81ed0f62 #611
> Hardware name: Samsung Exynos (Flattened Device Tree)
> PC is at machine_kexec+0x40/0xfc

Any chance you have the debug info in this kernel?
scripts/faddr2line would come handy here.

> LR is at 0x
> pc : []    lr : []    psr: 6013
> sp : ebc13e60  ip : 40008000  fp : 0001
> r10: 0058  r9 : fee1dead  r8 : 0001
> r7 : c121387c  r6 : 6c224000  r5 : ece40c00  r4 : ec222000
> r3 : c010f1f4  r2 : c110  r1 : c110  r0 : 418d
> Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> Control: 10c5387d  Table: 6bc14059  DAC: 0051
> Process kexec (pid: 1329, stack limit = 0x366bb4dc)
> Stack: (0xebc13e60 to 0xebc14000)
> ...
> [] (machine_kexec) from [] (kernel_kexec+0x74/0x7c)
> [] (kernel_kexec) from [] (__do_sys_reboot+0x1f8/0x210)
> [] (__do_sys_reboot) from [] (ret_fast_syscall+0x0/0x28)
> Exception stack(0xebc13fa8 to 0xebc13ff0)
> ...
> ---[ end trace 3e8d6c81723c778d ]---
> 1329 Segmentation fault  ./kexec -e
> 
> > ---
> >   arch/arm/include/asm/pgtable.h |  1 -
> >   arch/arm/lib/uaccess_with_memcpy.c |  7 +-
> >   arch/arm/mach-sa1100/assabet.c |  2 +-
> >   arch/arm/mm/dump.c | 29 +-
> >   arch/arm/mm/fault-armv.c   |  7 +-
> >   arch/arm/mm/fault.c| 22 ++--
> >   arch/arm/mm/idmap.c|  3 ++-
> >   arch/arm/mm/init.c |  2 +-
> >   arch/arm/mm/ioremap.c  | 12 ++---
> >   arch/arm/mm/mm.h   |  2 +-
> >   arch/arm/mm/mmu.c  | 35 +-
> >   arch/arm/mm/pgd.c  | 40 --
> >   12 files changed, 125 insertions(+), 37 deletions(-)
> >
> > ...
> 
> Best regards
> -- 
> Marek Szyprowski, PhD
> Samsung R Institute Poland
> 

-- 
Sincerely yours,
Mike.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 14/14] mm: remove __ARCH_HAS_5LEVEL_HACK and include/asm-generic/5level-fixup.h

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

There are no architectures that use include/asm-generic/5level-fixup.h
therefore it can be removed along with __ARCH_HAS_5LEVEL_HACK define and
the code it surrounds

Signed-off-by: Mike Rapoport 
---
 include/asm-generic/5level-fixup.h | 58 --
 include/linux/mm.h |  6 
 mm/kasan/init.c| 11 --
 mm/memory.c|  8 -
 4 files changed, 83 deletions(-)
 delete mode 100644 include/asm-generic/5level-fixup.h

diff --git a/include/asm-generic/5level-fixup.h 
b/include/asm-generic/5level-fixup.h
deleted file mode 100644
index 4c74b1c1d13b..
--- a/include/asm-generic/5level-fixup.h
+++ /dev/null
@@ -1,58 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _5LEVEL_FIXUP_H
-#define _5LEVEL_FIXUP_H
-
-#define __ARCH_HAS_5LEVEL_HACK
-#define __PAGETABLE_P4D_FOLDED 1
-
-#define P4D_SHIFT  PGDIR_SHIFT
-#define P4D_SIZE   PGDIR_SIZE
-#define P4D_MASK   PGDIR_MASK
-#define MAX_PTRS_PER_P4D   1
-#define PTRS_PER_P4D   1
-
-#define p4d_t  pgd_t
-
-#define pud_alloc(mm, p4d, address) \
-   ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
-   NULL : pud_offset(p4d, address))
-
-#define p4d_alloc(mm, pgd, address)(pgd)
-#define p4d_offset(pgd, start) (pgd)
-
-#ifndef __ASSEMBLY__
-static inline int p4d_none(p4d_t p4d)
-{
-   return 0;
-}
-
-static inline int p4d_bad(p4d_t p4d)
-{
-   return 0;
-}
-
-static inline int p4d_present(p4d_t p4d)
-{
-   return 1;
-}
-#endif
-
-#define p4d_ERROR(p4d) do { } while (0)
-#define p4d_clear(p4d) pgd_clear(p4d)
-#define p4d_val(p4d)   pgd_val(p4d)
-#define p4d_populate(mm, p4d, pud) pgd_populate(mm, p4d, pud)
-#define p4d_populate_safe(mm, p4d, pud)pgd_populate(mm, p4d, pud)
-#define p4d_page(p4d)  pgd_page(p4d)
-#define p4d_page_vaddr(p4d)pgd_page_vaddr(p4d)
-
-#define __p4d(x)   __pgd(x)
-#define set_p4d(p4dp, p4d) set_pgd(p4dp, p4d)
-
-#undef p4d_free_tlb
-#define p4d_free_tlb(tlb, x, addr) do { } while (0)
-#define p4d_free(mm, x)do { } while (0)
-
-#undef  p4d_addr_end
-#define p4d_addr_end(addr, end)(end)
-
-#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..f794b77df1ca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2060,11 +2060,6 @@ int __pte_alloc_kernel(pmd_t *pmd);
 
 #if defined(CONFIG_MMU)
 
-/*
- * The following ifdef needed to get the 5level-fixup.h header to work.
- * Remove it when 5level-fixup.h has been removed.
- */
-#ifndef __ARCH_HAS_5LEVEL_HACK
 static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
unsigned long address)
 {
@@ -2078,7 +2073,6 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, 
p4d_t *p4d,
return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
NULL : pud_offset(p4d, address);
 }
-#endif /* !__ARCH_HAS_5LEVEL_HACK */
 
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long 
address)
 {
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ce45c491ebcd..fe6be0be1f76 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -250,20 +250,9 @@ int __ref kasan_populate_early_shadow(const void 
*shadow_start,
 * 3,2 - level page tables where we don't have
 * puds,pmds, so pgd_populate(), pud_populate()
 * is noops.
-*
-* The ifndef is required to avoid build breakage.
-*
-* With 5level-fixup.h, pgd_populate() is not nop and
-* we reference kasan_early_shadow_p4d. It's not defined
-* unless 5-level paging enabled.
-*
-* The ifndef can be dropped once all KASAN-enabled
-* architectures will switch to pgtable-nop4d.h.
 */
-#ifndef __ARCH_HAS_5LEVEL_HACK
pgd_populate(_mm, pgd,
lm_alias(kasan_early_shadow_p4d));
-#endif
p4d = p4d_offset(pgd, addr);
p4d_populate(_mm, p4d,
lm_alias(kasan_early_shadow_pud));
diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..379277c631b4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4434,19 +4434,11 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, 
unsigned long address)
smp_wmb(); /* See comment in __pte_alloc */
 
spin_lock(>page_table_lock);
-#ifndef __ARCH_HAS_5LEVEL_HACK
if (!p4d_present(*p4d)) {
  

[PATCH v4 12/14] unicore32: remove __ARCH_USE_5LEVEL_HACK

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

The unicore32 architecture has 2 level page tables and
asm-generic/pgtable-nopmd.h and explicit casts from pud_t to pgd_t for page
table folding.

Add p4d walk in the only place that actually unfolds the pud level and
remove __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/unicore32/include/asm/pgtable.h | 1 -
 arch/unicore32/kernel/hibernate.c| 4 +++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/unicore32/include/asm/pgtable.h 
b/arch/unicore32/include/asm/pgtable.h
index 3b8731b3a937..826f49edd94e 100644
--- a/arch/unicore32/include/asm/pgtable.h
+++ b/arch/unicore32/include/asm/pgtable.h
@@ -9,7 +9,6 @@
 #ifndef __UNICORE_PGTABLE_H__
 #define __UNICORE_PGTABLE_H__
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 #include 
 
diff --git a/arch/unicore32/kernel/hibernate.c 
b/arch/unicore32/kernel/hibernate.c
index f3812245cc00..ccad051a79b6 100644
--- a/arch/unicore32/kernel/hibernate.c
+++ b/arch/unicore32/kernel/hibernate.c
@@ -33,9 +33,11 @@ struct swsusp_arch_regs swsusp_arch_regs_cpu0;
 static pmd_t *resume_one_md_table_init(pgd_t *pgd)
 {
pud_t *pud;
+   p4d_t *p4d;
pmd_t *pmd_table;
 
-   pud = pud_offset(pgd, 0);
+   p4d = p4d_offset(pgd, 0);
+   pud = pud_offset(p4d, 0);
pmd_table = pmd_offset(pud, 0);
 
return pmd_table;
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 11/14] sh: add support for folded p4d page tables

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and remove usage of __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/sh/include/asm/pgtable-2level.h |  1 -
 arch/sh/include/asm/pgtable-3level.h |  1 -
 arch/sh/kernel/io_trapped.c  |  7 ++-
 arch/sh/mm/cache-sh4.c   |  4 +++-
 arch/sh/mm/cache-sh5.c   |  7 ++-
 arch/sh/mm/fault.c   | 26 +++---
 arch/sh/mm/hugetlbpage.c | 28 ++--
 arch/sh/mm/init.c|  9 -
 arch/sh/mm/kmap.c|  2 +-
 arch/sh/mm/tlbex_32.c|  6 +-
 arch/sh/mm/tlbex_64.c|  7 ++-
 11 files changed, 76 insertions(+), 22 deletions(-)

diff --git a/arch/sh/include/asm/pgtable-2level.h 
b/arch/sh/include/asm/pgtable-2level.h
index bf1eb51c3ee5..08bff93927ff 100644
--- a/arch/sh/include/asm/pgtable-2level.h
+++ b/arch/sh/include/asm/pgtable-2level.h
@@ -2,7 +2,6 @@
 #ifndef __ASM_SH_PGTABLE_2LEVEL_H
 #define __ASM_SH_PGTABLE_2LEVEL_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 /*
diff --git a/arch/sh/include/asm/pgtable-3level.h 
b/arch/sh/include/asm/pgtable-3level.h
index 779260b721ca..0f80097e5c9c 100644
--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -2,7 +2,6 @@
 #ifndef __ASM_SH_PGTABLE_3LEVEL_H
 #define __ASM_SH_PGTABLE_3LEVEL_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 /*
diff --git a/arch/sh/kernel/io_trapped.c b/arch/sh/kernel/io_trapped.c
index 60c828a2b8a2..037aab2708b7 100644
--- a/arch/sh/kernel/io_trapped.c
+++ b/arch/sh/kernel/io_trapped.c
@@ -136,6 +136,7 @@ EXPORT_SYMBOL_GPL(match_trapped_io_handler);
 static struct trapped_io *lookup_tiop(unsigned long address)
 {
pgd_t *pgd_k;
+   p4d_t *p4d_k;
pud_t *pud_k;
pmd_t *pmd_k;
pte_t *pte_k;
@@ -145,7 +146,11 @@ static struct trapped_io *lookup_tiop(unsigned long 
address)
if (!pgd_present(*pgd_k))
return NULL;
 
-   pud_k = pud_offset(pgd_k, address);
+   p4d_k = p4d_offset(pgd_k, address);
+   if (!p4d_present(*p4d_k))
+   return NULL;
+
+   pud_k = pud_offset(p4d_k, address);
if (!pud_present(*pud_k))
return NULL;
 
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index eee911422cf9..45943bcb7042 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -209,6 +209,7 @@ static void sh4_flush_cache_page(void *args)
unsigned long address, pfn, phys;
int map_coherent = 0;
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -224,7 +225,8 @@ static void sh4_flush_cache_page(void *args)
return;
 
pgd = pgd_offset(vma->vm_mm, address);
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   pud = pud_offset(p4d, address);
pmd = pmd_offset(pud, address);
pte = pte_offset_kernel(pmd, address);
 
diff --git a/arch/sh/mm/cache-sh5.c b/arch/sh/mm/cache-sh5.c
index 445b5e69b73c..442a77cc2957 100644
--- a/arch/sh/mm/cache-sh5.c
+++ b/arch/sh/mm/cache-sh5.c
@@ -383,6 +383,7 @@ static void sh64_dcache_purge_user_pages(struct mm_struct 
*mm,
unsigned long addr, unsigned long end)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -397,7 +398,11 @@ static void sh64_dcache_purge_user_pages(struct mm_struct 
*mm,
if (pgd_bad(*pgd))
return;
 
-   pud = pud_offset(pgd, addr);
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d) || p4d_bad(*p4d))
+   return;
+
+   pud = pud_offset(p4d, addr);
if (pud_none(*pud) || pud_bad(*pud))
return;
 
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 7b74e18b71d7..8b3ab65c81c4 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -53,6 +53,7 @@ static void show_pte(struct mm_struct *mm, unsigned long addr)
 (u64)pgd_val(*pgd));
 
do {
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -65,7 +66,20 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
break;
}
 
-   pud = pud_offset(pgd, addr);
+   p4d = p4d_offset(pgd, addr);
+   if (PTRS_PER_P4D != 1)
+   pr_cont(", *p4d=%0*Lx", (u32)(sizeof(*p4d) * 2),
+   (u64)p4d_val(*p4d));
+
+   if (p4d_none(*p4d))
+   break;
+
+   if (p4d_bad(*p4d)) {
+   pr_cont("(bad)");
+   break;
+   }
+
+   pud = pud_offset(p4d, addr);
  

[PATCH v4 13/14] asm-generic: remove pgtable-nop4d-hack.h

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

No architecture defines __ARCH_USE_5LEVEL_HACK and therefore
pgtable-nop4d-hack.h will be never actually included.

Remove it.

Signed-off-by: Mike Rapoport 
---
 include/asm-generic/pgtable-nop4d-hack.h | 64 
 include/asm-generic/pgtable-nopud.h  |  4 --
 2 files changed, 68 deletions(-)
 delete mode 100644 include/asm-generic/pgtable-nop4d-hack.h

diff --git a/include/asm-generic/pgtable-nop4d-hack.h 
b/include/asm-generic/pgtable-nop4d-hack.h
deleted file mode 100644
index 829bdb0d6327..
--- a/include/asm-generic/pgtable-nop4d-hack.h
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _PGTABLE_NOP4D_HACK_H
-#define _PGTABLE_NOP4D_HACK_H
-
-#ifndef __ASSEMBLY__
-#include 
-
-#define __PAGETABLE_PUD_FOLDED 1
-
-/*
- * Having the pud type consist of a pgd gets the size right, and allows
- * us to conceptually access the pgd entry that this pud is folded into
- * without casting.
- */
-typedef struct { pgd_t pgd; } pud_t;
-
-#define PUD_SHIFT  PGDIR_SHIFT
-#define PTRS_PER_PUD   1
-#define PUD_SIZE   (1UL << PUD_SHIFT)
-#define PUD_MASK   (~(PUD_SIZE-1))
-
-/*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
- * setup: the pud is never bad, and a pud always exists (as it's folded
- * into the pgd entry)
- */
-static inline int pgd_none(pgd_t pgd)  { return 0; }
-static inline int pgd_bad(pgd_t pgd)   { return 0; }
-static inline int pgd_present(pgd_t pgd)   { return 1; }
-static inline void pgd_clear(pgd_t *pgd)   { }
-#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
-#define pgd_populate(mm, pgd, pud) do { } while (0)
-#define pgd_populate_safe(mm, pgd, pud)do { } while (0)
-/*
- * (puds are folded into pgds so this doesn't get actually called,
- * but the define is needed for a generic inline function.)
- */
-#define set_pgd(pgdptr, pgdval)set_pud((pud_t *)(pgdptr), (pud_t) { 
pgdval })
-
-static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
-{
-   return (pud_t *)pgd;
-}
-
-#define pud_val(x) (pgd_val((x).pgd))
-#define __pud(x)   ((pud_t) { __pgd(x) })
-
-#define pgd_page(pgd)  (pud_page((pud_t){ pgd }))
-#define pgd_page_vaddr(pgd)(pud_page_vaddr((pud_t){ pgd }))
-
-/*
- * allocating and freeing a pud is trivial: the 1-entry pud is
- * inside the pgd, so has no extra memory associated with it.
- */
-#define pud_alloc_one(mm, address) NULL
-#define pud_free(mm, x)do { } while (0)
-#define __pud_free_tlb(tlb, x, a)  do { } while (0)
-
-#undef  pud_addr_end
-#define pud_addr_end(addr, end)(end)
-
-#endif /* __ASSEMBLY__ */
-#endif /* _PGTABLE_NOP4D_HACK_H */
diff --git a/include/asm-generic/pgtable-nopud.h 
b/include/asm-generic/pgtable-nopud.h
index d3776cb494c0..ad05c1684bfc 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -4,9 +4,6 @@
 
 #ifndef __ASSEMBLY__
 
-#ifdef __ARCH_USE_5LEVEL_HACK
-#include 
-#else
 #include 
 
 #define __PAGETABLE_PUD_FOLDED 1
@@ -65,5 +62,4 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long 
address)
 #define pud_addr_end(addr, end)(end)
 
 #endif /* __ASSEMBLY__ */
-#endif /* !__ARCH_USE_5LEVEL_HACK */
 #endif /* _PGTABLE_NOPUD_H */
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 10/14] sh: drop __pXd_offset() macros that duplicate pXd_index() ones

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

The __pXd_offset() macros are identical to the pXd_index() macros and there
is no point to keep both of them. All architectures define and use
pXd_index() so let's keep only those to make mips consistent with the rest
of the kernel.

Signed-off-by: Mike Rapoport 
---
 arch/sh/include/asm/pgtable_32.h | 5 ++---
 arch/sh/include/asm/pgtable_64.h | 5 ++---
 arch/sh/mm/init.c| 6 +++---
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
index 29274f0e428e..4acce5f2cbf9 100644
--- a/arch/sh/include/asm/pgtable_32.h
+++ b/arch/sh/include/asm/pgtable_32.h
@@ -407,13 +407,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t 
newprot)
 /* to find an entry in a page-table-directory. */
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
 #define pgd_offset(mm, address)((mm)->pgd + pgd_index(address))
-#define __pgd_offset(address)  pgd_index(address)
 
 /* to find an entry in a kernel page-table-directory */
 #define pgd_offset_k(address)  pgd_offset(_mm, address)
 
-#define __pud_offset(address)  (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
-#define __pmd_offset(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
+#define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+#define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
 
 /* Find an entry in the third-level page table.. */
 #define pte_index(address) ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
diff --git a/arch/sh/include/asm/pgtable_64.h b/arch/sh/include/asm/pgtable_64.h
index 1778bc5971e7..27cc282ec6c0 100644
--- a/arch/sh/include/asm/pgtable_64.h
+++ b/arch/sh/include/asm/pgtable_64.h
@@ -46,14 +46,13 @@ static __inline__ void set_pte(pte_t *pteptr, pte_t pteval)
 
 /* To find an entry in a generic PGD. */
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
-#define __pgd_offset(address) pgd_index(address)
 #define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
 
 /* To find an entry in a kernel PGD. */
 #define pgd_offset_k(address) pgd_offset(_mm, address)
 
-#define __pud_offset(address)  (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
-#define __pmd_offset(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
+#define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+/* #define pmd_index(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1)) */
 
 /*
  * PMD level access routines. Same notes as above.
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index b9de2d4fa57e..f445ba630790 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -172,9 +172,9 @@ void __init page_table_range_init(unsigned long start, 
unsigned long end,
unsigned long vaddr;
 
vaddr = start;
-   i = __pgd_offset(vaddr);
-   j = __pud_offset(vaddr);
-   k = __pmd_offset(vaddr);
+   i = pgd_index(vaddr);
+   j = pud_index(vaddr);
+   k = pmd_index(vaddr);
pgd = pgd_base + i;
 
for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) {
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 06/14] nios2: add support for folded p4d page tables

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and remove usage of __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/nios2/include/asm/pgtable.h | 3 +--
 arch/nios2/mm/fault.c| 9 +++--
 arch/nios2/mm/ioremap.c  | 6 +-
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index f98b7f4519ba..47a1a3ea5734 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -22,7 +22,6 @@
 #include 
 
 #include 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 #define FIRST_USER_ADDRESS 0UL
@@ -100,7 +99,7 @@ extern pte_t invalid_pte_table[PAGE_SIZE/sizeof(pte_t)];
  */
 static inline void set_pmd(pmd_t *pmdptr, pmd_t pmdval)
 {
-   pmdptr->pud.pgd.pgd = pmdval.pud.pgd.pgd;
+   *pmdptr = pmdval;
 }
 
 /* to find an entry in a page-table-directory */
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index ec9d8a9c426f..964eac1a21d0 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -242,6 +242,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, 
unsigned long cause,
 */
int offset = pgd_index(address);
pgd_t *pgd, *pgd_k;
+   p4d_t *p4d, *p4d_k;
pud_t *pud, *pud_k;
pmd_t *pmd, *pmd_k;
pte_t *pte_k;
@@ -253,8 +254,12 @@ asmlinkage void do_page_fault(struct pt_regs *regs, 
unsigned long cause,
goto no_context;
set_pgd(pgd, *pgd_k);
 
-   pud = pud_offset(pgd, address);
-   pud_k = pud_offset(pgd_k, address);
+   p4d = p4d_offset(pgd, address);
+   p4d_k = p4d_offset(pgd_k, address);
+   if (!p4d_present(*p4d_k))
+   goto no_context;
+   pud = pud_offset(p4d, address);
+   pud_k = pud_offset(p4d_k, address);
if (!pud_present(*pud_k))
goto no_context;
pmd = pmd_offset(pud, address);
diff --git a/arch/nios2/mm/ioremap.c b/arch/nios2/mm/ioremap.c
index 819bdfcc2e71..fe821efb9a99 100644
--- a/arch/nios2/mm/ioremap.c
+++ b/arch/nios2/mm/ioremap.c
@@ -86,11 +86,15 @@ static int remap_area_pages(unsigned long address, unsigned 
long phys_addr,
if (address >= end)
BUG();
do {
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
 
error = -ENOMEM;
-   pud = pud_alloc(_mm, dir, address);
+   p4d = p4d_alloc(_mm, dir, address);
+   if (!p4d)
+   break;
+   pud = pud_alloc(_mm, p4d, address);
if (!pud)
break;
pmd = pmd_alloc(_mm, pud, address);
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 08/14] powerpc: add support for folded p4d page tables

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h.

Signed-off-by: Mike Rapoport 
Tested-by: Christophe Leroy  # 8xx and 83xx
---
 arch/powerpc/include/asm/book3s/32/pgtable.h  |  1 -
 arch/powerpc/include/asm/book3s/64/hash.h |  4 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h  |  4 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 60 ++-
 arch/powerpc/include/asm/book3s/64/radix.h|  6 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h  |  1 -
 arch/powerpc/include/asm/nohash/64/pgalloc.h  |  2 +-
 .../include/asm/nohash/64/pgtable-4k.h| 32 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h  |  6 +-
 arch/powerpc/include/asm/pgtable.h| 10 ++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 32 ++
 arch/powerpc/lib/code-patching.c  |  7 ++-
 arch/powerpc/mm/book3s64/hash_pgtable.c   |  4 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 26 +---
 arch/powerpc/mm/book3s64/subpage_prot.c   |  6 +-
 arch/powerpc/mm/hugetlbpage.c | 28 +
 arch/powerpc/mm/nohash/book3e_pgtable.c   | 15 ++---
 arch/powerpc/mm/pgtable.c | 30 ++
 arch/powerpc/mm/pgtable_64.c  | 10 ++--
 arch/powerpc/mm/ptdump/hashpagetable.c| 20 ++-
 arch/powerpc/mm/ptdump/ptdump.c   | 14 +++--
 arch/powerpc/xmon/xmon.c  | 18 +++---
 22 files changed, 197 insertions(+), 139 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 7549393c4c43..6052b72216a6 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 #define _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 #include 
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 6fc4520092c7..73ad038ed10b 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -134,9 +134,9 @@ static inline int get_region_id(unsigned long ea)
 
 #definehash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
 #definehash__pud_bad(pud)  (pud_val(pud) & H_PUD_BAD_BITS)
-static inline int hash__pgd_bad(pgd_t pgd)
+static inline int hash__p4d_bad(p4d_t p4d)
 {
-   return (pgd_val(pgd) == 0);
+   return (p4d_val(p4d) == 0);
 }
 #ifdef CONFIG_STRICT_KERNEL_RWX
 extern void hash__mark_rodata_ro(void);
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index a41e91bd0580..69c5b051734f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -85,9 +85,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *pgd, pud_t *pud)
 {
-   *pgd =  __pgd(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
+   *pgd =  __p4d(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 368b136517e0..bc047514724c 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 #define _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 
-#include 
+#include 
 
 #ifndef __ASSEMBLY__
 #include 
@@ -251,7 +251,7 @@ extern unsigned long __pmd_frag_size_shift;
 /* Bits to mask out from a PUD to get to the PMD page */
 #define PUD_MASKED_BITS0xc0ffUL
 /* Bits to mask out from a PGD to get to the PUD page */
-#define PGD_MASKED_BITS0xc0ffUL
+#define P4D_MASKED_BITS0xc0ffUL
 
 /*
  * Used as an indicator for rcu callback functions
@@ -949,54 +949,60 @@ static inline bool pud_access_permitted(pud_t pud, bool 
write)
return pte_access_permitted(pud_pte(pud), write);
 }
 
-#define pgd_write(pgd) pte_write(pgd_pte(pgd))
+#define __p4d_raw(x)   ((p4d_t) { __pgd_raw(x) })
+static inline __be64 p4d_raw(p4d_t x)
+{
+   return pgd_raw(x.pgd);
+}
+
+#define p4d_write(p4d) pte_write(p4d_pte(p4d))
 
-static inline void pgd_clear(pgd_t *pgdp)
+static inline void p4d_clear(p4d_t *p4dp)
 {
-   *pgdp = __pgd(0);
+   *p4dp = __p4d(0);
 }
 
-static inline int pgd_none(pgd_t pgd)
+static inline int p4d_none(p4d_t p4d)
 {
-   return !pgd_raw(pgd);
+   return !p4d_raw(p4d);
 }
 
-static in

[PATCH v4 09/14] sh: fault: Modernize printing of kernel messages

2020-04-14 Thread Mike Rapoport
From: Geert Uytterhoeven 

  - Convert from printk() to pr_*(),
  - Add missing continuations,
  - Use "%llx" to format u64,
  - Join multiple prints in show_fault_oops() into a single print.

Signed-off-by: Geert Uytterhoeven 
Signed-off-by: Mike Rapoport 
---
 arch/sh/mm/fault.c | 39 ++-
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 5f23d7907597..7b74e18b71d7 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -47,10 +47,10 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
pgd = swapper_pg_dir;
}
 
-   printk(KERN_ALERT "pgd = %p\n", pgd);
+   pr_alert("pgd = %p\n", pgd);
pgd += pgd_index(addr);
-   printk(KERN_ALERT "[%08lx] *pgd=%0*Lx", addr,
-  (u32)(sizeof(*pgd) * 2), (u64)pgd_val(*pgd));
+   pr_alert("[%08lx] *pgd=%0*llx", addr, (u32)(sizeof(*pgd) * 2),
+(u64)pgd_val(*pgd));
 
do {
pud_t *pud;
@@ -61,33 +61,33 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
break;
 
if (pgd_bad(*pgd)) {
-   printk("(bad)");
+   pr_cont("(bad)");
break;
}
 
pud = pud_offset(pgd, addr);
if (PTRS_PER_PUD != 1)
-   printk(", *pud=%0*Lx", (u32)(sizeof(*pud) * 2),
-  (u64)pud_val(*pud));
+   pr_cont(", *pud=%0*llx", (u32)(sizeof(*pud) * 2),
+   (u64)pud_val(*pud));
 
if (pud_none(*pud))
break;
 
if (pud_bad(*pud)) {
-   printk("(bad)");
+   pr_cont("(bad)");
break;
}
 
pmd = pmd_offset(pud, addr);
if (PTRS_PER_PMD != 1)
-   printk(", *pmd=%0*Lx", (u32)(sizeof(*pmd) * 2),
-  (u64)pmd_val(*pmd));
+   pr_cont(", *pmd=%0*llx", (u32)(sizeof(*pmd) * 2),
+   (u64)pmd_val(*pmd));
 
if (pmd_none(*pmd))
break;
 
if (pmd_bad(*pmd)) {
-   printk("(bad)");
+   pr_cont("(bad)");
break;
}
 
@@ -96,11 +96,11 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
break;
 
pte = pte_offset_kernel(pmd, addr);
-   printk(", *pte=%0*Lx", (u32)(sizeof(*pte) * 2),
-  (u64)pte_val(*pte));
+   pr_cont(", *pte=%0*llx", (u32)(sizeof(*pte) * 2),
+   (u64)pte_val(*pte));
} while (0);
 
-   printk("\n");
+   pr_cont("\n");
 }
 
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
@@ -188,14 +188,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
address)
if (!oops_may_print())
return;
 
-   printk(KERN_ALERT "BUG: unable to handle kernel ");
-   if (address < PAGE_SIZE)
-   printk(KERN_CONT "NULL pointer dereference");
-   else
-   printk(KERN_CONT "paging request");
-
-   printk(KERN_CONT " at %08lx\n", address);
-   printk(KERN_ALERT "PC:");
+   pr_alert("BUG: unable to handle kernel %s at %08lx\n",
+address < PAGE_SIZE ? "NULL pointer dereference"
+: "paging request",
+address);
+   pr_alert("PC:");
printk_address(regs->pc, 1);
 
show_pte(NULL, address);
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 07/14] openrisc: add support for folded p4d page tables

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and remove usage of __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/openrisc/include/asm/pgtable.h |  1 -
 arch/openrisc/mm/fault.c| 10 --
 arch/openrisc/mm/init.c |  4 +++-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/openrisc/include/asm/pgtable.h 
b/arch/openrisc/include/asm/pgtable.h
index 7f3fb9ceb083..219979e57790 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -21,7 +21,6 @@
 #ifndef __ASM_OPENRISC_PGTABLE_H
 #define __ASM_OPENRISC_PGTABLE_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 #ifndef __ASSEMBLY__
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 8af1cc78c4fb..6e0a11ac4c00 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -295,6 +295,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, 
unsigned long address,
 
int offset = pgd_index(address);
pgd_t *pgd, *pgd_k;
+   p4d_t *p4d, *p4d_k;
pud_t *pud, *pud_k;
pmd_t *pmd, *pmd_k;
pte_t *pte_k;
@@ -321,8 +322,13 @@ asmlinkage void do_page_fault(struct pt_regs *regs, 
unsigned long address,
 * it exists.
 */
 
-   pud = pud_offset(pgd, address);
-   pud_k = pud_offset(pgd_k, address);
+   p4d = p4d_offset(pgd, address);
+   p4d_k = p4d_offset(pgd_k, address);
+   if (!p4d_present(*p4d_k))
+   goto no_context;
+
+   pud = pud_offset(p4d, address);
+   pud_k = pud_offset(p4d_k, address);
if (!pud_present(*pud_k))
goto no_context;
 
diff --git a/arch/openrisc/mm/init.c b/arch/openrisc/mm/init.c
index 1f87b524db78..2536aeae0975 100644
--- a/arch/openrisc/mm/init.c
+++ b/arch/openrisc/mm/init.c
@@ -71,6 +71,7 @@ static void __init map_ram(void)
unsigned long v, p, e;
pgprot_t prot;
pgd_t *pge;
+   p4d_t *p4e;
pud_t *pue;
pmd_t *pme;
pte_t *pte;
@@ -90,7 +91,8 @@ static void __init map_ram(void)
 
while (p < e) {
int j;
-   pue = pud_offset(pge, v);
+   p4e = p4d_offset(pge, v);
+   pue = pud_offset(p4e, v);
pme = pmd_offset(pue, v);
 
if ((u32) pue != (u32) pge || (u32) pme != (u32) pge) {
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 03/14] arm64: add support for folded p4d page tables

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate, replace 5level-fixup.h with pgtable-nop4d.h and
remove __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/kvm_mmu.h|  10 +-
 arch/arm64/include/asm/pgalloc.h|  10 +-
 arch/arm64/include/asm/pgtable-types.h  |   5 +-
 arch/arm64/include/asm/pgtable.h|  37 +++--
 arch/arm64/include/asm/stage2_pgtable.h |  48 --
 arch/arm64/kernel/hibernate.c   |  44 -
 arch/arm64/mm/fault.c   |   9 +-
 arch/arm64/mm/hugetlbpage.c |  15 +-
 arch/arm64/mm/kasan_init.c  |  26 ++-
 arch/arm64/mm/mmu.c |  52 --
 arch/arm64/mm/pageattr.c|   7 +-
 virt/kvm/arm/mmu.c  | 209 
 12 files changed, 368 insertions(+), 104 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 30b0e8d6b895..8255fab2e441 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -172,8 +172,8 @@ void kvm_clear_hyp_idmap(void);
__pmd(__phys_to_pmd_val(__pa(ptep)) | PMD_TYPE_TABLE)
 #define kvm_mk_pud(pmdp)   \
__pud(__phys_to_pud_val(__pa(pmdp)) | PMD_TYPE_TABLE)
-#define kvm_mk_pgd(pudp)   \
-   __pgd(__phys_to_pgd_val(__pa(pudp)) | PUD_TYPE_TABLE)
+#define kvm_mk_p4d(pmdp)   \
+   __p4d(__phys_to_p4d_val(__pa(pmdp)) | PUD_TYPE_TABLE)
 
 #define kvm_set_pud(pudp, pud) set_pud(pudp, pud)
 
@@ -299,6 +299,12 @@ static inline bool kvm_s2pud_young(pud_t pud)
 #define hyp_pud_table_empty(pudp) kvm_page_empty(pudp)
 #endif
 
+#ifdef __PAGETABLE_P4D_FOLDED
+#define hyp_p4d_table_empty(p4dp) (0)
+#else
+#define hyp_p4d_table_empty(p4dp) kvm_page_empty(p4dp)
+#endif
+
 struct kvm;
 
 #define kvm_flush_dcache_to_poc(a,l)   __flush_dcache_area((a), (l))
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 172d76fa0245..58e93583ddb6 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -73,17 +73,17 @@ static inline void pud_free(struct mm_struct *mm, pud_t 
*pudp)
free_page((unsigned long)pudp);
 }
 
-static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t pudp, pgdval_t prot)
+static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
 {
-   set_pgd(pgdp, __pgd(__phys_to_pgd_val(pudp) | prot));
+   set_p4d(p4dp, __p4d(__phys_to_p4d_val(pudp) | prot));
 }
 
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgdp, pud_t *pudp)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp)
 {
-   __pgd_populate(pgdp, __pa(pudp), PUD_TYPE_TABLE);
+   __p4d_populate(p4dp, __pa(pudp), PUD_TYPE_TABLE);
 }
 #else
-static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t pudp, pgdval_t prot)
+static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
 {
BUILD_BUG();
 }
diff --git a/arch/arm64/include/asm/pgtable-types.h 
b/arch/arm64/include/asm/pgtable-types.h
index acb0751a6606..b8f158ae2527 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -14,6 +14,7 @@
 typedef u64 pteval_t;
 typedef u64 pmdval_t;
 typedef u64 pudval_t;
+typedef u64 p4dval_t;
 typedef u64 pgdval_t;
 
 /*
@@ -44,13 +45,11 @@ typedef struct { pteval_t pgprot; } pgprot_t;
 #define __pgprot(x)((pgprot_t) { (x) } )
 
 #if CONFIG_PGTABLE_LEVELS == 2
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 #elif CONFIG_PGTABLE_LEVELS == 3
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 #elif CONFIG_PGTABLE_LEVELS == 4
-#include 
+#include 
 #endif
 
 #endif /* __ASM_PGTABLE_TYPES_H */
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 538c85e62f86..c23c5a4e6dc6 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -298,6 +298,11 @@ static inline pte_t pgd_pte(pgd_t pgd)
return __pte(pgd_val(pgd));
 }
 
+static inline pte_t p4d_pte(p4d_t p4d)
+{
+   return __pte(p4d_val(p4d));
+}
+
 static inline pte_t pud_pte(pud_t pud)
 {
return __pte(pud_val(pud));
@@ -401,6 +406,9 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 
 #define set_pmd_at(mm, addr, pmdp, pmd)set_pte_at(mm, addr, (pte_t 
*)pmdp, pmd_pte(pmd))
 
+#define __p4d_to_phys(p4d) __pte_to_phys(p4d_pte(p4d))
+#define __phys_to_p4d_val(phys)__phys_to_pte_val(phys)
+
 #define __pgd_to_phys(pgd) __pte_to_phys(pgd_pte(pgd))
 #define __phys_to_pgd_val(phys)__phys_to_pte_val(phys)
 
@@ -588,49 +596,50 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
 
 #define pud_ERROR(pud) __pud_error(__FILE__, __LINE__, pud_val(pud))
 
-#define pgd_none(pgd)  (!pgd_val(pgd))
-#define pgd_bad(pgd

[PATCH v4 05/14] ia64: add support for folded p4d page tables

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate, remove usage of __ARCH_USE_5LEVEL_HACK and replace
5level-fixup.h with pgtable-nop4d.h

Signed-off-by: Mike Rapoport 
---
 arch/ia64/include/asm/pgalloc.h |  4 ++--
 arch/ia64/include/asm/pgtable.h | 17 -
 arch/ia64/mm/fault.c|  7 ++-
 arch/ia64/mm/hugetlbpage.c  | 18 --
 arch/ia64/mm/init.c | 28 
 5 files changed, 52 insertions(+), 22 deletions(-)

diff --git a/arch/ia64/include/asm/pgalloc.h b/arch/ia64/include/asm/pgalloc.h
index f4c491044882..2a3050345099 100644
--- a/arch/ia64/include/asm/pgalloc.h
+++ b/arch/ia64/include/asm/pgalloc.h
@@ -36,9 +36,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 #if CONFIG_PGTABLE_LEVELS == 4
 static inline void
-pgd_populate(struct mm_struct *mm, pgd_t * pgd_entry, pud_t * pud)
+p4d_populate(struct mm_struct *mm, p4d_t * p4d_entry, pud_t * pud)
 {
-   pgd_val(*pgd_entry) = __pa(pud);
+   p4d_val(*p4d_entry) = __pa(pud);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 0e7b645b76c6..787b0a91d255 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -283,12 +283,12 @@ extern unsigned long VMALLOC_END;
 #define pud_page(pud)  virt_to_page((pud_val(pud) + 
PAGE_OFFSET))
 
 #if CONFIG_PGTABLE_LEVELS == 4
-#define pgd_none(pgd)  (!pgd_val(pgd))
-#define pgd_bad(pgd)   (!ia64_phys_addr_valid(pgd_val(pgd)))
-#define pgd_present(pgd)   (pgd_val(pgd) != 0UL)
-#define pgd_clear(pgdp)(pgd_val(*(pgdp)) = 0UL)
-#define pgd_page_vaddr(pgd)((unsigned long) __va(pgd_val(pgd) & 
_PFN_MASK))
-#define pgd_page(pgd)  virt_to_page((pgd_val(pgd) + 
PAGE_OFFSET))
+#define p4d_none(p4d)  (!p4d_val(p4d))
+#define p4d_bad(p4d)   (!ia64_phys_addr_valid(p4d_val(p4d)))
+#define p4d_present(p4d)   (p4d_val(p4d) != 0UL)
+#define p4d_clear(p4dp)(p4d_val(*(p4dp)) = 0UL)
+#define p4d_page_vaddr(p4d)((unsigned long) __va(p4d_val(p4d) & 
_PFN_MASK))
+#define p4d_page(p4d)  virt_to_page((p4d_val(p4d) + 
PAGE_OFFSET))
 #endif
 
 /*
@@ -386,7 +386,7 @@ pgd_offset (const struct mm_struct *mm, unsigned long 
address)
 #if CONFIG_PGTABLE_LEVELS == 4
 /* Find an entry in the second-level page table.. */
 #define pud_offset(dir,addr) \
-   ((pud_t *) pgd_page_vaddr(*(dir)) + (((addr) >> PUD_SHIFT) & 
(PTRS_PER_PUD - 1)))
+   ((pud_t *) p4d_page_vaddr(*(dir)) + (((addr) >> PUD_SHIFT) & 
(PTRS_PER_PUD - 1)))
 #endif
 
 /* Find an entry in the third-level page table.. */
@@ -580,10 +580,9 @@ extern struct page *zero_page_memmap_ptr;
 
 
 #if CONFIG_PGTABLE_LEVELS == 3
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 #endif
-#include 
+#include 
 #include 
 
 #endif /* _ASM_IA64_PGTABLE_H */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 30d0c1fca99e..12242aa0dad1 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -29,6 +29,7 @@ static int
 mapped_kernel_page_is_present (unsigned long address)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep, pte;
@@ -37,7 +38,11 @@ mapped_kernel_page_is_present (unsigned long address)
if (pgd_none(*pgd) || pgd_bad(*pgd))
return 0;
 
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   if (p4d_none(*p4d) || p4d_bad(*p4d))
+   return 0;
+
+   pud = pud_offset(p4d, address);
if (pud_none(*pud) || pud_bad(*pud))
return 0;
 
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index d16e419fd712..32352a73df0c 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -30,12 +30,14 @@ huge_pte_alloc(struct mm_struct *mm, unsigned long addr, 
unsigned long sz)
 {
unsigned long taddr = htlbpage_to_page(addr);
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte = NULL;
 
pgd = pgd_offset(mm, taddr);
-   pud = pud_alloc(mm, pgd, taddr);
+   p4d = p4d_offset(pgd, taddr);
+   pud = pud_alloc(mm, p4d, taddr);
if (pud) {
pmd = pmd_alloc(mm, pud, taddr);
if (pmd)
@@ -49,17 +51,21 @@ huge_pte_offset (struct mm_struct *mm, unsigned long addr, 
unsigned long sz)
 {
unsigned long taddr = htlbpage_to_page(addr);
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte = NULL;
 
pgd = pgd_offset(mm, taddr);
if (pgd_present(*pgd)) {
-   pud = pud_offset(pgd, taddr);
-   

[PATCH v4 02/14] arm: add support for folded p4d page tables

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate, and remove __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/arm/include/asm/pgtable.h |  1 -
 arch/arm/lib/uaccess_with_memcpy.c |  7 +-
 arch/arm/mach-sa1100/assabet.c |  2 +-
 arch/arm/mm/dump.c | 29 +-
 arch/arm/mm/fault-armv.c   |  7 +-
 arch/arm/mm/fault.c| 22 ++--
 arch/arm/mm/idmap.c|  3 ++-
 arch/arm/mm/init.c |  2 +-
 arch/arm/mm/ioremap.c  | 12 ++---
 arch/arm/mm/mm.h   |  2 +-
 arch/arm/mm/mmu.c  | 35 +-
 arch/arm/mm/pgd.c  | 40 --
 12 files changed, 125 insertions(+), 37 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index befc8fcec98f..fba20607c53c 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -17,7 +17,6 @@
 
 #else
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 #include 
 #include 
diff --git a/arch/arm/lib/uaccess_with_memcpy.c 
b/arch/arm/lib/uaccess_with_memcpy.c
index c9450982a155..d72b14c96670 100644
--- a/arch/arm/lib/uaccess_with_memcpy.c
+++ b/arch/arm/lib/uaccess_with_memcpy.c
@@ -24,6 +24,7 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, 
spinlock_t **ptlp)
 {
unsigned long addr = (unsigned long)_addr;
pgd_t *pgd;
+   p4d_t *p4d;
pmd_t *pmd;
pte_t *pte;
pud_t *pud;
@@ -33,7 +34,11 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, 
spinlock_t **ptlp)
if (unlikely(pgd_none(*pgd) || pgd_bad(*pgd)))
return 0;
 
-   pud = pud_offset(pgd, addr);
+   p4d = p4d_offset(pgd, addr);
+   if (unlikely(p4d_none(*p4d) || p4d_bad(*p4d)))
+   return 0;
+
+   pud = pud_offset(p4d, addr);
if (unlikely(pud_none(*pud) || pud_bad(*pud)))
return 0;
 
diff --git a/arch/arm/mach-sa1100/assabet.c b/arch/arm/mach-sa1100/assabet.c
index d96a101e5504..0631a7b02678 100644
--- a/arch/arm/mach-sa1100/assabet.c
+++ b/arch/arm/mach-sa1100/assabet.c
@@ -633,7 +633,7 @@ static void __init map_sa1100_gpio_regs( void )
int prot = PMD_TYPE_SECT | PMD_SECT_AP_WRITE | PMD_DOMAIN(DOMAIN_IO);
pmd_t *pmd;
 
-   pmd = pmd_offset(pud_offset(pgd_offset_k(virt), virt), virt);
+   pmd = pmd_offset(pud_offset(p4d_offset(pgd_offset_k(virt), virt), 
virt), virt);
*pmd = __pmd(phys | prot);
flush_pmd_entry(pmd);
 }
diff --git a/arch/arm/mm/dump.c b/arch/arm/mm/dump.c
index 7d6291f23251..677549d6854c 100644
--- a/arch/arm/mm/dump.c
+++ b/arch/arm/mm/dump.c
@@ -207,6 +207,7 @@ struct pg_level {
 static struct pg_level pg_level[] = {
{
}, { /* pgd */
+   }, { /* p4d */
}, { /* pud */
}, { /* pmd */
.bits   = section_bits,
@@ -308,7 +309,7 @@ static void walk_pte(struct pg_state *st, pmd_t *pmd, 
unsigned long start,
 
for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
addr = start + i * PAGE_SIZE;
-   note_page(st, addr, 4, pte_val(*pte), domain);
+   note_page(st, addr, 5, pte_val(*pte), domain);
}
 }
 
@@ -350,14 +351,14 @@ static void walk_pmd(struct pg_state *st, pud_t *pud, 
unsigned long start)
addr += SECTION_SIZE;
pmd++;
domain = get_domain_name(pmd);
-   note_page(st, addr, 3, pmd_val(*pmd), domain);
+   note_page(st, addr, 4, pmd_val(*pmd), domain);
}
}
 }
 
-static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start)
+static void walk_pud(struct pg_state *st, p4d_t *p4d, unsigned long start)
 {
-   pud_t *pud = pud_offset(pgd, 0);
+   pud_t *pud = pud_offset(p4d, 0);
unsigned long addr;
unsigned i;
 
@@ -366,7 +367,23 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, 
unsigned long start)
if (!pud_none(*pud)) {
walk_pmd(st, pud, addr);
} else {
-   note_page(st, addr, 2, pud_val(*pud), NULL);
+   note_page(st, addr, 3, pud_val(*pud), NULL);
+   }
+   }
+}
+
+static void walk_p4d(struct pg_state *st, pgd_t *pgd, unsigned long start)
+{
+   p4d_t *p4d = p4d_offset(pgd, 0);
+   unsigned long addr;
+   unsigned i;
+
+   for (i = 0; i < PTRS_PER_P4D; i++, p4d++) {
+   addr = start + i * P4D_SIZE;
+   if (!p4d_none(*p4d)) {
+   walk_pud(st, p4d, addr);
+   } else {
+   note_page(st, addr, 2, p4d_val(*p4d), NULL);
}
}
 }
@@ -381,7 +398,7 @@ static void walk_pgd(struct pg_state *st, struct mm_

[PATCH v4 01/14] h8300: remove usage of __ARCH_USE_5LEVEL_HACK

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

h8300 is a nommu architecture and does not require fixup for upper layers
of the page tables because it is already handled by the generic nommu
implementation.

Remove definition of __ARCH_USE_5LEVEL_HACK in
arch/h8300/include/asm/pgtable.h

Signed-off-by: Mike Rapoport 
---
 arch/h8300/include/asm/pgtable.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/h8300/include/asm/pgtable.h b/arch/h8300/include/asm/pgtable.h
index 4d00152fab58..f00828720dc4 100644
--- a/arch/h8300/include/asm/pgtable.h
+++ b/arch/h8300/include/asm/pgtable.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _H8300_PGTABLE_H
 #define _H8300_PGTABLE_H
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 #include 
 extern void paging_init(void);
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 04/14] hexagon: remove __ARCH_USE_5LEVEL_HACK

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

The hexagon architecture has 2 level page tables and as such most of the
page table folding is already implemented in asm-generic/pgtable-nopmd.h.

Fixup the only place in arch/hexagon to unfold the p4d level and remove
__ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/hexagon/include/asm/fixmap.h  | 4 ++--
 arch/hexagon/include/asm/pgtable.h | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/hexagon/include/asm/fixmap.h 
b/arch/hexagon/include/asm/fixmap.h
index 933dac167504..97b1b062e750 100644
--- a/arch/hexagon/include/asm/fixmap.h
+++ b/arch/hexagon/include/asm/fixmap.h
@@ -16,7 +16,7 @@
 #include 
 
 #define kmap_get_fixmap_pte(vaddr) \
-   pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr), \
-   (vaddr)), (vaddr)), (vaddr))
+   pte_offset_kernel(pmd_offset(pud_offset(p4d_offset(pgd_offset_k(vaddr), 
\
+   (vaddr)), (vaddr)), (vaddr)), (vaddr))
 
 #endif
diff --git a/arch/hexagon/include/asm/pgtable.h 
b/arch/hexagon/include/asm/pgtable.h
index d383e8bea5b2..2a17d4eb2fa4 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -12,7 +12,6 @@
  * Page table definitions for Qualcomm Hexagon processor.
  */
 #include 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 /* A handy thing to have if one has the RAM. Declared in head.S */
-- 
2.25.1

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v4 00/14] mm: remove __ARCH_HAS_5LEVEL_HACK

2020-04-14 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

These patches convert several architectures to use page table folding and
remove __ARCH_HAS_5LEVEL_HACK along with include/asm-generic/5level-fixup.h
and include/asm-generic/pgtable-nop4d-hack.h. With that we'll have a single
and consistent way of dealing with page table folding instead of a mix of
three existing options.

The changes are mostly about mechanical replacement of pgd accessors with
p4d ones and the addition of higher levels to page table traversals.

v4 is about rebasing on top of v5.7-rc1 
* split arm and arm64 changes as there is no KVM host on arm anymore
* update powerpc patches to reflect its recent changes in page table handling

v3:
* add Christophe's patch that removes ppc32 get_pteptr()
* reduce amount of upper layer walks in powerpc

v2:
* collect per-arch patches into a single set
* include Geert's update of 'sh' printing messages
* rebase on v5.6-rc1+

Geert Uytterhoeven (1):
  sh: fault: Modernize printing of kernel messages

Mike Rapoport (13):
  h8300: remove usage of __ARCH_USE_5LEVEL_HACK
  arm: add support for folded p4d page tables
  arm64: add support for folded p4d page tables
  hexagon: remove __ARCH_USE_5LEVEL_HACK
  ia64: add support for folded p4d page tables
  nios2: add support for folded p4d page tables
  openrisc: add support for folded p4d page tables
  powerpc: add support for folded p4d page tables
  sh: drop __pXd_offset() macros that duplicate pXd_index() ones
  sh: add support for folded p4d page tables
  unicore32: remove __ARCH_USE_5LEVEL_HACK
  asm-generic: remove pgtable-nop4d-hack.h
  mm: remove __ARCH_HAS_5LEVEL_HACK and include/asm-generic/5level-fixup.h

 arch/arm/include/asm/pgtable.h|   1 -
 arch/arm/lib/uaccess_with_memcpy.c|   7 +-
 arch/arm/mach-sa1100/assabet.c|   2 +-
 arch/arm/mm/dump.c|  29 ++-
 arch/arm/mm/fault-armv.c  |   7 +-
 arch/arm/mm/fault.c   |  22 +-
 arch/arm/mm/idmap.c   |   3 +-
 arch/arm/mm/init.c|   2 +-
 arch/arm/mm/ioremap.c |  12 +-
 arch/arm/mm/mm.h  |   2 +-
 arch/arm/mm/mmu.c |  35 ++-
 arch/arm/mm/pgd.c |  40 +++-
 arch/arm64/include/asm/kvm_mmu.h  |  10 +-
 arch/arm64/include/asm/pgalloc.h  |  10 +-
 arch/arm64/include/asm/pgtable-types.h|   5 +-
 arch/arm64/include/asm/pgtable.h  |  37 ++--
 arch/arm64/include/asm/stage2_pgtable.h   |  48 +++-
 arch/arm64/kernel/hibernate.c |  44 +++-
 arch/arm64/mm/fault.c |   9 +-
 arch/arm64/mm/hugetlbpage.c   |  15 +-
 arch/arm64/mm/kasan_init.c|  26 ++-
 arch/arm64/mm/mmu.c   |  52 +++--
 arch/arm64/mm/pageattr.c  |   7 +-
 arch/h8300/include/asm/pgtable.h  |   1 -
 arch/hexagon/include/asm/fixmap.h |   4 +-
 arch/hexagon/include/asm/pgtable.h|   1 -
 arch/ia64/include/asm/pgalloc.h   |   4 +-
 arch/ia64/include/asm/pgtable.h   |  17 +-
 arch/ia64/mm/fault.c  |   7 +-
 arch/ia64/mm/hugetlbpage.c|  18 +-
 arch/ia64/mm/init.c   |  28 ++-
 arch/nios2/include/asm/pgtable.h  |   3 +-
 arch/nios2/mm/fault.c |   9 +-
 arch/nios2/mm/ioremap.c   |   6 +-
 arch/openrisc/include/asm/pgtable.h   |   1 -
 arch/openrisc/mm/fault.c  |  10 +-
 arch/openrisc/mm/init.c   |   4 +-
 arch/powerpc/include/asm/book3s/32/pgtable.h  |   1 -
 arch/powerpc/include/asm/book3s/64/hash.h |   4 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h  |   4 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  60 ++---
 arch/powerpc/include/asm/book3s/64/radix.h|   6 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h  |   1 -
 arch/powerpc/include/asm/nohash/64/pgalloc.h  |   2 +-
 .../include/asm/nohash/64/pgtable-4k.h|  32 +--
 arch/powerpc/include/asm/nohash/64/pgtable.h  |   6 +-
 arch/powerpc/include/asm/pgtable.h|  10 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c|  32 +--
 arch/powerpc/lib/code-patching.c  |   7 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c   |   4 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  26 ++-
 arch/powerpc/mm/book3s64/subpage_prot.c   |   6 +-
 arch/powerpc/mm/hugetlbpage.c |  28 ++-
 arch/powerpc/mm/nohash/book3e_pgtable.c   |  15 +-
 arch/powerpc/mm/pgtable.c |  30 ++-
 arch/powerpc/mm/pgtable_64.c  |  10 +-
 arch/powerpc/mm/ptdump/hashpagetable.c|  20 +-
 arch/powerpc/mm/ptdump/ptdump.c   |  14 +-
 arch/powerpc/xmon/xmon.c  |  18 +-
 arch/sh

Re: [PATCH v3 07/14] powerpc/32: drop get_pteptr()

2020-03-08 Thread Mike Rapoport
On Fri, Mar 06, 2020 at 08:00:16PM -0800, Andrew Morton wrote:
> On Thu, 27 Feb 2020 10:46:01 +0200 Mike Rapoport  wrote:
> 
> > Commit 8d30c14cab30 ("powerpc/mm: Rework I$/D$ coherency (v3)") and
> > commit 90ac19a8b21b ("[POWERPC] Abolish iopa(), mm_ptov(),
> > io_block_mapping() from arch/powerpc") removed the use of get_pteptr()
> > outside of mm/pgtable_32.c
> > 
> > In mm/pgtable_32.c, the only user of get_pteptr() is __change_page_attr()
> > which operates on kernel context and on lowmem pages only.
> > 
> > Move page table traversal to __change_page_attr() and drop get_pteptr().
> 
> People have been changing things in linux-next and the powerpc patches
> are hurting.
> 
> I'll disable this patch series for now.  Can you please redo
> powerpc-32-drop-get_pteptr.patch and
> powerpc-add-support-for-folded-p4d-page-tables.patch (and
> powerpc-add-support-for-folded-p4d-page-tables-fix.patch)?
 
This is the powerpc-add-support-for-folded-p4d-page-tables.patch on top of
current powerpc/next. The powerpc-32-drop-get_pteptr.patch is already there
and I've folded powerpc-add-support-for-folded-p4d-page-tables-fix.patch
into this one.

>From e2b405537d917c99430ee93b68fe4cb43d7b8787 Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Sun, 24 Nov 2019 15:38:00 +0200
Subject: [PATCH v4] powerpc: add support for folded p4d page tables

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h.

Signed-off-by: Mike Rapoport 
Tested-by: Christophe Leroy  # 8xx and 83xx
---
 arch/powerpc/include/asm/book3s/32/pgtable.h  |  1 -
 arch/powerpc/include/asm/book3s/64/hash.h |  4 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h  |  4 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 60 ++-
 arch/powerpc/include/asm/book3s/64/radix.h|  6 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h  |  1 -
 arch/powerpc/include/asm/nohash/64/pgalloc.h  |  2 +-
 .../include/asm/nohash/64/pgtable-4k.h| 32 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h  |  6 +-
 arch/powerpc/include/asm/pgtable.h| 10 ++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 32 ++
 arch/powerpc/lib/code-patching.c  |  7 ++-
 arch/powerpc/mm/book3s64/hash_pgtable.c   |  4 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 26 +---
 arch/powerpc/mm/book3s64/subpage_prot.c   |  6 +-
 arch/powerpc/mm/hugetlbpage.c | 28 +
 arch/powerpc/mm/kasan/kasan_init_32.c |  1 -
 arch/powerpc/mm/nohash/book3e_pgtable.c   | 15 ++---
 arch/powerpc/mm/pgtable.c | 30 ++
 arch/powerpc/mm/pgtable_64.c  | 10 ++--
 arch/powerpc/mm/ptdump/hashpagetable.c| 20 ++-
 arch/powerpc/mm/ptdump/ptdump.c   | 14 +++--
 arch/powerpc/xmon/xmon.c  | 18 +++---
 23 files changed, 197 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 7549393c4c43..6052b72216a6 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 #define _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 #include 
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 2781ebf6add4..876d1528c2cf 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -134,9 +134,9 @@ static inline int get_region_id(unsigned long ea)
 
 #definehash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
 #definehash__pud_bad(pud)  (pud_val(pud) & H_PUD_BAD_BITS)
-static inline int hash__pgd_bad(pgd_t pgd)
+static inline int hash__p4d_bad(p4d_t p4d)
 {
-   return (pgd_val(pgd) == 0);
+   return (p4d_val(p4d) == 0);
 }
 #ifdef CONFIG_STRICT_KERNEL_RWX
 extern void hash__mark_rodata_ro(void);
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index a41e91bd0580..69c5b051734f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -85,9 +85,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *pgd, pud_t *pud)
 {
-   *pgd =  __pgd(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
+   *pgd =  __p4d(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
diff --git a/arch/powerpc/incl

[PATCH v3 14/14] mm: remove __ARCH_HAS_5LEVEL_HACK and include/asm-generic/5level-fixup.h

2020-02-27 Thread Mike Rapoport
From: Mike Rapoport 

There are no architectures that use include/asm-generic/5level-fixup.h
therefore it can be removed along with __ARCH_HAS_5LEVEL_HACK define and
the code it surrounds

Signed-off-by: Mike Rapoport 
---
 include/asm-generic/5level-fixup.h | 58 --
 include/linux/mm.h |  6 
 mm/kasan/init.c| 11 --
 mm/memory.c|  8 -
 4 files changed, 83 deletions(-)
 delete mode 100644 include/asm-generic/5level-fixup.h

diff --git a/include/asm-generic/5level-fixup.h 
b/include/asm-generic/5level-fixup.h
deleted file mode 100644
index 4c74b1c1d13b..
--- a/include/asm-generic/5level-fixup.h
+++ /dev/null
@@ -1,58 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _5LEVEL_FIXUP_H
-#define _5LEVEL_FIXUP_H
-
-#define __ARCH_HAS_5LEVEL_HACK
-#define __PAGETABLE_P4D_FOLDED 1
-
-#define P4D_SHIFT  PGDIR_SHIFT
-#define P4D_SIZE   PGDIR_SIZE
-#define P4D_MASK   PGDIR_MASK
-#define MAX_PTRS_PER_P4D   1
-#define PTRS_PER_P4D   1
-
-#define p4d_t  pgd_t
-
-#define pud_alloc(mm, p4d, address) \
-   ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
-   NULL : pud_offset(p4d, address))
-
-#define p4d_alloc(mm, pgd, address)(pgd)
-#define p4d_offset(pgd, start) (pgd)
-
-#ifndef __ASSEMBLY__
-static inline int p4d_none(p4d_t p4d)
-{
-   return 0;
-}
-
-static inline int p4d_bad(p4d_t p4d)
-{
-   return 0;
-}
-
-static inline int p4d_present(p4d_t p4d)
-{
-   return 1;
-}
-#endif
-
-#define p4d_ERROR(p4d) do { } while (0)
-#define p4d_clear(p4d) pgd_clear(p4d)
-#define p4d_val(p4d)   pgd_val(p4d)
-#define p4d_populate(mm, p4d, pud) pgd_populate(mm, p4d, pud)
-#define p4d_populate_safe(mm, p4d, pud)pgd_populate(mm, p4d, pud)
-#define p4d_page(p4d)  pgd_page(p4d)
-#define p4d_page_vaddr(p4d)pgd_page_vaddr(p4d)
-
-#define __p4d(x)   __pgd(x)
-#define set_p4d(p4dp, p4d) set_pgd(p4dp, p4d)
-
-#undef p4d_free_tlb
-#define p4d_free_tlb(tlb, x, addr) do { } while (0)
-#define p4d_free(mm, x)do { } while (0)
-
-#undef  p4d_addr_end
-#define p4d_addr_end(addr, end)(end)
-
-#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52269e56c514..69fb46e1d91b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1841,11 +1841,6 @@ int __pte_alloc_kernel(pmd_t *pmd);
 
 #if defined(CONFIG_MMU)
 
-/*
- * The following ifdef needed to get the 5level-fixup.h header to work.
- * Remove it when 5level-fixup.h has been removed.
- */
-#ifndef __ARCH_HAS_5LEVEL_HACK
 static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
unsigned long address)
 {
@@ -1859,7 +1854,6 @@ static inline pud_t *pud_alloc(struct mm_struct *mm, 
p4d_t *p4d,
return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
NULL : pud_offset(p4d, address);
 }
-#endif /* !__ARCH_HAS_5LEVEL_HACK */
 
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long 
address)
 {
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ce45c491ebcd..fe6be0be1f76 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -250,20 +250,9 @@ int __ref kasan_populate_early_shadow(const void 
*shadow_start,
 * 3,2 - level page tables where we don't have
 * puds,pmds, so pgd_populate(), pud_populate()
 * is noops.
-*
-* The ifndef is required to avoid build breakage.
-*
-* With 5level-fixup.h, pgd_populate() is not nop and
-* we reference kasan_early_shadow_p4d. It's not defined
-* unless 5-level paging enabled.
-*
-* The ifndef can be dropped once all KASAN-enabled
-* architectures will switch to pgtable-nop4d.h.
 */
-#ifndef __ARCH_HAS_5LEVEL_HACK
pgd_populate(_mm, pgd,
lm_alias(kasan_early_shadow_p4d));
-#endif
p4d = p4d_offset(pgd, addr);
p4d_populate(_mm, p4d,
lm_alias(kasan_early_shadow_pud));
diff --git a/mm/memory.c b/mm/memory.c
index 0bccc622e482..10cc147db1b8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4252,19 +4252,11 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, 
unsigned long address)
smp_wmb(); /* See comment in __pte_alloc */
 
spin_lock(>page_table_lock);
-#ifndef __ARCH_HAS_5LEVEL_HACK
if (!p4d_present(*p4d)) {
  

[PATCH v3 13/14] asm-generic: remove pgtable-nop4d-hack.h

2020-02-27 Thread Mike Rapoport
From: Mike Rapoport 

No architecture defines __ARCH_USE_5LEVEL_HACK and therefore
pgtable-nop4d-hack.h will be never actually included.

Remove it.

Signed-off-by: Mike Rapoport 
---
 include/asm-generic/pgtable-nop4d-hack.h | 64 
 include/asm-generic/pgtable-nopud.h  |  4 --
 2 files changed, 68 deletions(-)
 delete mode 100644 include/asm-generic/pgtable-nop4d-hack.h

diff --git a/include/asm-generic/pgtable-nop4d-hack.h 
b/include/asm-generic/pgtable-nop4d-hack.h
deleted file mode 100644
index 829bdb0d6327..
--- a/include/asm-generic/pgtable-nop4d-hack.h
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _PGTABLE_NOP4D_HACK_H
-#define _PGTABLE_NOP4D_HACK_H
-
-#ifndef __ASSEMBLY__
-#include 
-
-#define __PAGETABLE_PUD_FOLDED 1
-
-/*
- * Having the pud type consist of a pgd gets the size right, and allows
- * us to conceptually access the pgd entry that this pud is folded into
- * without casting.
- */
-typedef struct { pgd_t pgd; } pud_t;
-
-#define PUD_SHIFT  PGDIR_SHIFT
-#define PTRS_PER_PUD   1
-#define PUD_SIZE   (1UL << PUD_SHIFT)
-#define PUD_MASK   (~(PUD_SIZE-1))
-
-/*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
- * setup: the pud is never bad, and a pud always exists (as it's folded
- * into the pgd entry)
- */
-static inline int pgd_none(pgd_t pgd)  { return 0; }
-static inline int pgd_bad(pgd_t pgd)   { return 0; }
-static inline int pgd_present(pgd_t pgd)   { return 1; }
-static inline void pgd_clear(pgd_t *pgd)   { }
-#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
-
-#define pgd_populate(mm, pgd, pud) do { } while (0)
-#define pgd_populate_safe(mm, pgd, pud)do { } while (0)
-/*
- * (puds are folded into pgds so this doesn't get actually called,
- * but the define is needed for a generic inline function.)
- */
-#define set_pgd(pgdptr, pgdval)set_pud((pud_t *)(pgdptr), (pud_t) { 
pgdval })
-
-static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
-{
-   return (pud_t *)pgd;
-}
-
-#define pud_val(x) (pgd_val((x).pgd))
-#define __pud(x)   ((pud_t) { __pgd(x) })
-
-#define pgd_page(pgd)  (pud_page((pud_t){ pgd }))
-#define pgd_page_vaddr(pgd)(pud_page_vaddr((pud_t){ pgd }))
-
-/*
- * allocating and freeing a pud is trivial: the 1-entry pud is
- * inside the pgd, so has no extra memory associated with it.
- */
-#define pud_alloc_one(mm, address) NULL
-#define pud_free(mm, x)do { } while (0)
-#define __pud_free_tlb(tlb, x, a)  do { } while (0)
-
-#undef  pud_addr_end
-#define pud_addr_end(addr, end)(end)
-
-#endif /* __ASSEMBLY__ */
-#endif /* _PGTABLE_NOP4D_HACK_H */
diff --git a/include/asm-generic/pgtable-nopud.h 
b/include/asm-generic/pgtable-nopud.h
index d3776cb494c0..ad05c1684bfc 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -4,9 +4,6 @@
 
 #ifndef __ASSEMBLY__
 
-#ifdef __ARCH_USE_5LEVEL_HACK
-#include 
-#else
 #include 
 
 #define __PAGETABLE_PUD_FOLDED 1
@@ -65,5 +62,4 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long 
address)
 #define pud_addr_end(addr, end)(end)
 
 #endif /* __ASSEMBLY__ */
-#endif /* !__ARCH_USE_5LEVEL_HACK */
 #endif /* _PGTABLE_NOPUD_H */
-- 
2.24.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 09/14] sh: fault: Modernize printing of kernel messages

2020-02-27 Thread Mike Rapoport
From: Geert Uytterhoeven 

  - Convert from printk() to pr_*(),
  - Add missing continuations,
  - Use "%llx" to format u64,
  - Join multiple prints in show_fault_oops() into a single print.

Signed-off-by: Geert Uytterhoeven 
Signed-off-by: Mike Rapoport 
---
 arch/sh/mm/fault.c | 39 ++-
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 5f51456f4fc7..a2b0275413e8 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -47,10 +47,10 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
pgd = swapper_pg_dir;
}
 
-   printk(KERN_ALERT "pgd = %p\n", pgd);
+   pr_alert("pgd = %p\n", pgd);
pgd += pgd_index(addr);
-   printk(KERN_ALERT "[%08lx] *pgd=%0*Lx", addr,
-  (u32)(sizeof(*pgd) * 2), (u64)pgd_val(*pgd));
+   pr_alert("[%08lx] *pgd=%0*llx", addr, (u32)(sizeof(*pgd) * 2),
+(u64)pgd_val(*pgd));
 
do {
pud_t *pud;
@@ -61,33 +61,33 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
break;
 
if (pgd_bad(*pgd)) {
-   printk("(bad)");
+   pr_cont("(bad)");
break;
}
 
pud = pud_offset(pgd, addr);
if (PTRS_PER_PUD != 1)
-   printk(", *pud=%0*Lx", (u32)(sizeof(*pud) * 2),
-  (u64)pud_val(*pud));
+   pr_cont(", *pud=%0*llx", (u32)(sizeof(*pud) * 2),
+   (u64)pud_val(*pud));
 
if (pud_none(*pud))
break;
 
if (pud_bad(*pud)) {
-   printk("(bad)");
+   pr_cont("(bad)");
break;
}
 
pmd = pmd_offset(pud, addr);
if (PTRS_PER_PMD != 1)
-   printk(", *pmd=%0*Lx", (u32)(sizeof(*pmd) * 2),
-  (u64)pmd_val(*pmd));
+   pr_cont(", *pmd=%0*llx", (u32)(sizeof(*pmd) * 2),
+   (u64)pmd_val(*pmd));
 
if (pmd_none(*pmd))
break;
 
if (pmd_bad(*pmd)) {
-   printk("(bad)");
+   pr_cont("(bad)");
break;
}
 
@@ -96,11 +96,11 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
break;
 
pte = pte_offset_kernel(pmd, addr);
-   printk(", *pte=%0*Lx", (u32)(sizeof(*pte) * 2),
-  (u64)pte_val(*pte));
+   pr_cont(", *pte=%0*llx", (u32)(sizeof(*pte) * 2),
+   (u64)pte_val(*pte));
} while (0);
 
-   printk("\n");
+   pr_cont("\n");
 }
 
 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
@@ -188,14 +188,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
address)
if (!oops_may_print())
return;
 
-   printk(KERN_ALERT "BUG: unable to handle kernel ");
-   if (address < PAGE_SIZE)
-   printk(KERN_CONT "NULL pointer dereference");
-   else
-   printk(KERN_CONT "paging request");
-
-   printk(KERN_CONT " at %08lx\n", address);
-   printk(KERN_ALERT "PC:");
+   pr_alert("BUG: unable to handle kernel %s at %08lx\n",
+address < PAGE_SIZE ? "NULL pointer dereference"
+: "paging request",
+address);
+   pr_alert("PC:");
printk_address(regs->pc, 1);
 
show_pte(NULL, address);
-- 
2.24.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 07/14] powerpc/32: drop get_pteptr()

2020-02-27 Thread Mike Rapoport
From: Christophe Leroy 

Commit 8d30c14cab30 ("powerpc/mm: Rework I$/D$ coherency (v3)") and
commit 90ac19a8b21b ("[POWERPC] Abolish iopa(), mm_ptov(),
io_block_mapping() from arch/powerpc") removed the use of get_pteptr()
outside of mm/pgtable_32.c

In mm/pgtable_32.c, the only user of get_pteptr() is __change_page_attr()
which operates on kernel context and on lowmem pages only.

Move page table traversal to __change_page_attr() and drop get_pteptr().

Signed-off-by: Christophe Leroy 
Signed-off-by: Mike Rapoport 
---
 arch/powerpc/mm/pgtable_32.c | 43 ++--
 1 file changed, 7 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 5fb90edd865e..4894555622d7 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -121,53 +121,24 @@ void __init mapin_ram(void)
}
 }
 
-/* Scan the real Linux page tables and return a PTE pointer for
- * a virtual address in a context.
- * Returns true (1) if PTE was found, zero otherwise.  The pointer to
- * the PTE pointer is unmodified if PTE is not found.
- */
-static int
-get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t 
**pmdp)
-{
-pgd_t  *pgd;
-   pud_t   *pud;
-pmd_t  *pmd;
-pte_t  *pte;
-int retval = 0;
-
-pgd = pgd_offset(mm, addr & PAGE_MASK);
-if (pgd) {
-   pud = pud_offset(pgd, addr & PAGE_MASK);
-   if (pud && pud_present(*pud)) {
-   pmd = pmd_offset(pud, addr & PAGE_MASK);
-   if (pmd_present(*pmd)) {
-   pte = pte_offset_map(pmd, addr & PAGE_MASK);
-   if (pte) {
-   retval = 1;
-   *ptep = pte;
-   if (pmdp)
-   *pmdp = pmd;
-   /* XXX caller needs to do pte_unmap, 
yuck */
-   }
-   }
-   }
-}
-return(retval);
-}
-
 static int __change_page_attr_noflush(struct page *page, pgprot_t prot)
 {
pte_t *kpte;
pmd_t *kpmd;
-   unsigned long address;
+   unsigned long address, va;
 
BUG_ON(PageHighMem(page));
address = (unsigned long)page_address(page);
+   va = address & PAGE_MASK;
 
if (v_block_mapped(address))
return 0;
-   if (!get_pteptr(_mm, address, , ))
+
+   kpmd = pmd_offset(pud_offset(pgd_offset_k(va), va), va);
+   if (!pmd_present(*kpmd))
return -EINVAL;
+
+   kpte = pte_offset_map(kpmd, va);
__set_pte_at(_mm, address, kpte, mk_pte(page, prot), 0);
pte_unmap(kpte);
 
-- 
2.24.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 12/14] unicore32: remove __ARCH_USE_5LEVEL_HACK

2020-02-27 Thread Mike Rapoport
From: Mike Rapoport 

The unicore32 architecture has 2 level page tables and
asm-generic/pgtable-nopmd.h and explicit casts from pud_t to pgd_t for page
table folding.

Add p4d walk in the only place that actually unfolds the pud level and
remove __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/unicore32/include/asm/pgtable.h | 1 -
 arch/unicore32/kernel/hibernate.c| 4 +++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/unicore32/include/asm/pgtable.h 
b/arch/unicore32/include/asm/pgtable.h
index c8f7ba12f309..82030c32fc05 100644
--- a/arch/unicore32/include/asm/pgtable.h
+++ b/arch/unicore32/include/asm/pgtable.h
@@ -9,7 +9,6 @@
 #ifndef __UNICORE_PGTABLE_H__
 #define __UNICORE_PGTABLE_H__
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 #include 
 
diff --git a/arch/unicore32/kernel/hibernate.c 
b/arch/unicore32/kernel/hibernate.c
index f3812245cc00..ccad051a79b6 100644
--- a/arch/unicore32/kernel/hibernate.c
+++ b/arch/unicore32/kernel/hibernate.c
@@ -33,9 +33,11 @@ struct swsusp_arch_regs swsusp_arch_regs_cpu0;
 static pmd_t *resume_one_md_table_init(pgd_t *pgd)
 {
pud_t *pud;
+   p4d_t *p4d;
pmd_t *pmd_table;
 
-   pud = pud_offset(pgd, 0);
+   p4d = p4d_offset(pgd, 0);
+   pud = pud_offset(p4d, 0);
pmd_table = pmd_offset(pud, 0);
 
return pmd_table;
-- 
2.24.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v3 08/14] powerpc: add support for folded p4d page tables

2020-02-27 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h.

Signed-off-by: Mike Rapoport 
Tested-by: Christophe Leroy  # 8xx and 83xx
---
 arch/powerpc/include/asm/book3s/32/pgtable.h  |  1 -
 arch/powerpc/include/asm/book3s/64/hash.h |  4 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h  |  4 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 60 ++-
 arch/powerpc/include/asm/book3s/64/radix.h|  6 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h  |  1 -
 arch/powerpc/include/asm/nohash/64/pgalloc.h  |  2 +-
 .../include/asm/nohash/64/pgtable-4k.h| 32 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h  |  6 +-
 arch/powerpc/include/asm/pgtable.h|  6 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 30 ++
 arch/powerpc/lib/code-patching.c  |  7 ++-
 arch/powerpc/mm/book3s32/mmu.c|  2 +-
 arch/powerpc/mm/book3s32/tlb.c|  4 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c   |  4 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 26 +---
 arch/powerpc/mm/book3s64/subpage_prot.c   |  6 +-
 arch/powerpc/mm/hugetlbpage.c | 28 +
 arch/powerpc/mm/kasan/kasan_init_32.c |  8 +--
 arch/powerpc/mm/mem.c |  4 +-
 arch/powerpc/mm/nohash/40x.c  |  4 +-
 arch/powerpc/mm/nohash/book3e_pgtable.c   | 15 ++---
 arch/powerpc/mm/pgtable.c | 30 ++
 arch/powerpc/mm/pgtable_32.c  |  4 +-
 arch/powerpc/mm/pgtable_64.c  | 10 ++--
 arch/powerpc/mm/ptdump/hashpagetable.c| 20 ++-
 arch/powerpc/mm/ptdump/ptdump.c   | 14 +++--
 arch/powerpc/xmon/xmon.c  | 18 +++---
 28 files changed, 207 insertions(+), 149 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 5b39c11e884a..39ec11371be0 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 #define _ASM_POWERPC_BOOK3S_32_PGTABLE_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 #include 
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 2781ebf6add4..876d1528c2cf 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -134,9 +134,9 @@ static inline int get_region_id(unsigned long ea)
 
 #definehash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
 #definehash__pud_bad(pud)  (pud_val(pud) & H_PUD_BAD_BITS)
-static inline int hash__pgd_bad(pgd_t pgd)
+static inline int hash__p4d_bad(p4d_t p4d)
 {
-   return (pgd_val(pgd) == 0);
+   return (p4d_val(p4d) == 0);
 }
 #ifdef CONFIG_STRICT_KERNEL_RWX
 extern void hash__mark_rodata_ro(void);
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index a41e91bd0580..69c5b051734f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -85,9 +85,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *pgd, pud_t *pud)
 {
-   *pgd =  __pgd(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
+   *pgd =  __p4d(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 201a69e6a355..fa60e8594b9f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 #define _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 
-#include 
+#include 
 
 #ifndef __ASSEMBLY__
 #include 
@@ -251,7 +251,7 @@ extern unsigned long __pmd_frag_size_shift;
 /* Bits to mask out from a PUD to get to the PMD page */
 #define PUD_MASKED_BITS0xc0ffUL
 /* Bits to mask out from a PGD to get to the PUD page */
-#define PGD_MASKED_BITS0xc0ffUL
+#define P4D_MASKED_BITS0xc0ffUL
 
 /*
  * Used as an indicator for rcu callback functions
@@ -949,54 +949,60 @@ static inline bool pud_access_permitted(pud_t pud, bool 
write)
return pte_access_permitted(pud_pte(pud), write);
 }
 
-#define pgd_write(pgd) pte_write(pgd_pte(pgd))
+#define __p4d_raw(x)   ((p4d_t) { __pgd_raw(x) })
+static inline __be64 p4d_raw(p4d_t x)
+{
+   return pgd_raw(x.pgd);
+}
+
+#define p4d_write(p4d) p

[PATCH v3 11/14] sh: add support for folded p4d page tables

2020-02-27 Thread Mike Rapoport
From: Mike Rapoport 

Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate and remove usage of __ARCH_USE_5LEVEL_HACK.

Signed-off-by: Mike Rapoport 
---
 arch/sh/include/asm/pgtable-2level.h |  1 -
 arch/sh/include/asm/pgtable-3level.h |  1 -
 arch/sh/kernel/io_trapped.c  |  7 ++-
 arch/sh/mm/cache-sh4.c   |  4 +++-
 arch/sh/mm/cache-sh5.c   |  7 ++-
 arch/sh/mm/fault.c   | 26 +++---
 arch/sh/mm/hugetlbpage.c | 28 ++--
 arch/sh/mm/init.c|  9 -
 arch/sh/mm/kmap.c|  2 +-
 arch/sh/mm/tlbex_32.c|  6 +-
 arch/sh/mm/tlbex_64.c|  7 ++-
 11 files changed, 76 insertions(+), 22 deletions(-)

diff --git a/arch/sh/include/asm/pgtable-2level.h 
b/arch/sh/include/asm/pgtable-2level.h
index bf1eb51c3ee5..08bff93927ff 100644
--- a/arch/sh/include/asm/pgtable-2level.h
+++ b/arch/sh/include/asm/pgtable-2level.h
@@ -2,7 +2,6 @@
 #ifndef __ASM_SH_PGTABLE_2LEVEL_H
 #define __ASM_SH_PGTABLE_2LEVEL_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 /*
diff --git a/arch/sh/include/asm/pgtable-3level.h 
b/arch/sh/include/asm/pgtable-3level.h
index 779260b721ca..0f80097e5c9c 100644
--- a/arch/sh/include/asm/pgtable-3level.h
+++ b/arch/sh/include/asm/pgtable-3level.h
@@ -2,7 +2,6 @@
 #ifndef __ASM_SH_PGTABLE_3LEVEL_H
 #define __ASM_SH_PGTABLE_3LEVEL_H
 
-#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 /*
diff --git a/arch/sh/kernel/io_trapped.c b/arch/sh/kernel/io_trapped.c
index 60c828a2b8a2..037aab2708b7 100644
--- a/arch/sh/kernel/io_trapped.c
+++ b/arch/sh/kernel/io_trapped.c
@@ -136,6 +136,7 @@ EXPORT_SYMBOL_GPL(match_trapped_io_handler);
 static struct trapped_io *lookup_tiop(unsigned long address)
 {
pgd_t *pgd_k;
+   p4d_t *p4d_k;
pud_t *pud_k;
pmd_t *pmd_k;
pte_t *pte_k;
@@ -145,7 +146,11 @@ static struct trapped_io *lookup_tiop(unsigned long 
address)
if (!pgd_present(*pgd_k))
return NULL;
 
-   pud_k = pud_offset(pgd_k, address);
+   p4d_k = p4d_offset(pgd_k, address);
+   if (!p4d_present(*p4d_k))
+   return NULL;
+
+   pud_k = pud_offset(p4d_k, address);
if (!pud_present(*pud_k))
return NULL;
 
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index eee911422cf9..45943bcb7042 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -209,6 +209,7 @@ static void sh4_flush_cache_page(void *args)
unsigned long address, pfn, phys;
int map_coherent = 0;
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -224,7 +225,8 @@ static void sh4_flush_cache_page(void *args)
return;
 
pgd = pgd_offset(vma->vm_mm, address);
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   pud = pud_offset(p4d, address);
pmd = pmd_offset(pud, address);
pte = pte_offset_kernel(pmd, address);
 
diff --git a/arch/sh/mm/cache-sh5.c b/arch/sh/mm/cache-sh5.c
index 445b5e69b73c..442a77cc2957 100644
--- a/arch/sh/mm/cache-sh5.c
+++ b/arch/sh/mm/cache-sh5.c
@@ -383,6 +383,7 @@ static void sh64_dcache_purge_user_pages(struct mm_struct 
*mm,
unsigned long addr, unsigned long end)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -397,7 +398,11 @@ static void sh64_dcache_purge_user_pages(struct mm_struct 
*mm,
if (pgd_bad(*pgd))
return;
 
-   pud = pud_offset(pgd, addr);
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d) || p4d_bad(*p4d))
+   return;
+
+   pud = pud_offset(p4d, addr);
if (pud_none(*pud) || pud_bad(*pud))
return;
 
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index a2b0275413e8..ebd30003fd06 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -53,6 +53,7 @@ static void show_pte(struct mm_struct *mm, unsigned long addr)
 (u64)pgd_val(*pgd));
 
do {
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -65,7 +66,20 @@ static void show_pte(struct mm_struct *mm, unsigned long 
addr)
break;
}
 
-   pud = pud_offset(pgd, addr);
+   p4d = p4d_offset(pgd, addr);
+   if (PTRS_PER_P4D != 1)
+   pr_cont(", *p4d=%0*Lx", (u32)(sizeof(*p4d) * 2),
+   (u64)p4d_val(*p4d));
+
+   if (p4d_none(*p4d))
+   break;
+
+   if (p4d_bad(*p4d)) {
+   pr_cont("(bad)");
+   break;
+   }
+
+   pud = pud_offset(p4d, addr);
  

[PATCH v3 10/14] sh: drop __pXd_offset() macros that duplicate pXd_index() ones

2020-02-27 Thread Mike Rapoport
From: Mike Rapoport 

The __pXd_offset() macros are identical to the pXd_index() macros and there
is no point to keep both of them. All architectures define and use
pXd_index() so let's keep only those to make mips consistent with the rest
of the kernel.

Signed-off-by: Mike Rapoport 
---
 arch/sh/include/asm/pgtable_32.h | 5 ++---
 arch/sh/include/asm/pgtable_64.h | 5 ++---
 arch/sh/mm/init.c| 6 +++---
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
index 29274f0e428e..4acce5f2cbf9 100644
--- a/arch/sh/include/asm/pgtable_32.h
+++ b/arch/sh/include/asm/pgtable_32.h
@@ -407,13 +407,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t 
newprot)
 /* to find an entry in a page-table-directory. */
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
 #define pgd_offset(mm, address)((mm)->pgd + pgd_index(address))
-#define __pgd_offset(address)  pgd_index(address)
 
 /* to find an entry in a kernel page-table-directory */
 #define pgd_offset_k(address)  pgd_offset(_mm, address)
 
-#define __pud_offset(address)  (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
-#define __pmd_offset(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
+#define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+#define pmd_index(address) (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
 
 /* Find an entry in the third-level page table.. */
 #define pte_index(address) ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
diff --git a/arch/sh/include/asm/pgtable_64.h b/arch/sh/include/asm/pgtable_64.h
index 1778bc5971e7..27cc282ec6c0 100644
--- a/arch/sh/include/asm/pgtable_64.h
+++ b/arch/sh/include/asm/pgtable_64.h
@@ -46,14 +46,13 @@ static __inline__ void set_pte(pte_t *pteptr, pte_t pteval)
 
 /* To find an entry in a generic PGD. */
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
-#define __pgd_offset(address) pgd_index(address)
 #define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address))
 
 /* To find an entry in a kernel PGD. */
 #define pgd_offset_k(address) pgd_offset(_mm, address)
 
-#define __pud_offset(address)  (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
-#define __pmd_offset(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1))
+#define pud_index(address) (((address) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+/* #define pmd_index(address)  (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1)) */
 
 /*
  * PMD level access routines. Same notes as above.
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index d1b1ff2be17a..4bab79baee75 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -172,9 +172,9 @@ void __init page_table_range_init(unsigned long start, 
unsigned long end,
unsigned long vaddr;
 
vaddr = start;
-   i = __pgd_offset(vaddr);
-   j = __pud_offset(vaddr);
-   k = __pmd_offset(vaddr);
+   i = pgd_index(vaddr);
+   j = pud_index(vaddr);
+   k = pmd_index(vaddr);
pgd = pgd_base + i;
 
for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) {
-- 
2.24.0

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


  1   2   >