from:"Anshuman Khandual"

[PATCH V5 0/3] arm64/mm/hotplug: Improve memory offline event notifier

2020-11-08 Thread Anshuman Khandual

This series brings three different changes to the only memory event notifier on
arm64 platform. These changes improve it's robustness while also enhancing debug
capabilities during potential memory offlining error conditions.

This applies on 5.10-rc3

Changes in V5:

- Added some more documentation in [PATCH 2/3]
- Used for_each_mem_range() as for_each_memblock() has been dropped
- validate_bootmem_online() just prints non-compliant early sections
- validate_bootmem_online() does not prevent notifier registration
- Folded two pr_err() statements into just a single one per Gavin

Changes in V4: 
(https://lore.kernel.org/linux-arm-kernel/1601387687-6077-1-git-send-email-anshuman.khand...@arm.com/

- Dropped additional return in prevent_bootmem_remove_init() per Gavin
- Rearranged memory section loop in prevent_bootmem_remove_notifier() per Gavin
- Call out boot memory ranges for attempted offline or offline events

Changes in V3: 
(https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=352717)

- Split the single patch into three patch series per Catalin
- Trigger changed from setup_arch() to early_initcall() per Catalin
- Renamed back memory_hotremove_notifier() as prevent_bootmem_remove_init()
- validate_bootmem_online() is now called from prevent_bootmem_remove_init() 
per Catalin
- Skip registering the notifier if validate_bootmem_online() returns negative

Changes in V2: (https://patchwork.kernel.org/patch/11732161/)

- Dropped all generic changes wrt MEM_CANCEL_OFFLINE reasons enumeration
- Dropped all related (processing MEM_CANCEL_OFFLINE reasons) changes on arm64
- Added validate_boot_mem_online_state() that gets called with early_initcall()
- Added CONFIG_MEMORY_HOTREMOVE check before registering memory notifier
- Moved notifier registration i.e memory_hotremove_notifier into setup_arch()

Changes in V1: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=271237)

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: Gavin Shan 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (3):
  arm64/mm/hotplug: Register boot memory hot remove notifier earlier
  arm64/mm/hotplug: Enable MEM_OFFLINE event handling
  arm64/mm/hotplug: Ensure early memory sections are all online

 arch/arm64/mm/mmu.c | 95 +++--
 1 file changed, 91 insertions(+), 4 deletions(-)

-- 
2.20.1

[PATCH V5 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling

2020-11-08 Thread Anshuman Khandual

This enables MEM_OFFLINE memory event handling. It will help intercept any
possible error condition such as if boot memory some how still got offlined
even after an explicit notifier failure, potentially by a future change in
generic hot plug framework. This would help detect such scenarios and help
debug further. While here, also call out the first section being attempted
for offline or got offlined.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Gavin Shan 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 71dd9d753b8b..ca6d4952b733 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1493,13 +1493,43 @@ static int prevent_bootmem_remove_notifier(struct 
notifier_block *nb,
unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
unsigned long pfn = arg->start_pfn;
 
-   if (action != MEM_GOING_OFFLINE)
+   if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
return NOTIFY_OK;
 
for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   unsigned long start = PFN_PHYS(pfn);
+   unsigned long end = start + (1UL << PA_SECTION_SHIFT);
+
ms = __pfn_to_section(pfn);
-   if (early_section(ms))
+   if (!early_section(ms))
+   continue;
+
+   if (action == MEM_GOING_OFFLINE) {
+   /*
+* Boot memory removal is not supported. Prevent
+* it via blocking any attempted offline request
+* for the boot memory and just report it.
+*/
+   pr_warn("Boot memory [%lx %lx] offlining attempted\n", 
start, end);
return NOTIFY_BAD;
+   } else if (action == MEM_OFFLINE) {
+   /*
+* This should have never happened. Boot memory
+* offlining should have been prevented by this
+* very notifier. Probably some memory removal
+* procedure might have changed which would then
+* require further debug.
+*/
+   pr_err("Boot memory [%lx %lx] offlined\n", start, end);
+
+   /*
+* Core memory hotplug does not process a return
+* code from the notifier for MEM_OFFLINE events.
+* The error condition has been reported. Return
+* from here as if ignored.
+*/
+   return NOTIFY_DONE;
+   }
}
return NOTIFY_OK;
 }
-- 
2.20.1

[PATCH V5 3/3] arm64/mm/hotplug: Ensure early memory sections are all online

2020-11-08 Thread Anshuman Khandual

This adds a validation function that scans the entire boot memory and makes
sure that all early memory sections are online. This check is essential for
the memory notifier to work properly, as it cannot prevent any boot memory
from offlining, if all sections are not online to begin with. Although the
boot section scanning is selectively enabled with DEBUG_VM.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 48 +
 1 file changed, 48 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ca6d4952b733..f293ff50 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1538,6 +1538,53 @@ static struct notifier_block prevent_bootmem_remove_nb = 
{
.notifier_call = prevent_bootmem_remove_notifier,
 };
 
+/*
+ * This ensures that boot memory sections on the platform are online
+ * from early boot. Memory sections could not be prevented from being
+ * offlined, unless for some reason they are not online to begin with.
+ * This helps validate the basic assumption on which the above memory
+ * event notifier works to prevent boot memory section offlining and
+ * its possible removal.
+ */
+static void validate_bootmem_online(void)
+{
+   phys_addr_t start, end, addr;
+   struct mem_section *ms;
+   u64 i;
+
+   /*
+* Scanning across all memblock might be expensive
+* on some big memory systems. Hence enable this
+* validation only with DEBUG_VM.
+*/
+   if (!IS_ENABLED(CONFIG_DEBUG_VM))
+   return;
+
+   for_each_mem_range(i, , ) {
+   for (addr = start; addr < end; addr += (1UL << 
PA_SECTION_SHIFT)) {
+   ms = __pfn_to_section(PHYS_PFN(addr));
+
+   /*
+* All memory ranges in the system at this point
+* should have been marked as early sections.
+*/
+   WARN_ON(!early_section(ms));
+
+   /*
+* Memory notifier mechanism here to prevent boot
+* memory offlining depends on the fact that each
+* early section memory on the system is initially
+* online. Otherwise a given memory section which
+* is already offline will be overlooked and can
+* be removed completely. Call out such sections.
+*/
+   if (!online_section(ms))
+   pr_err("Boot memory [%llx %llx] is offline, can 
be removed\n",
+   addr, addr + (1UL << PA_SECTION_SHIFT));
+   }
+   }
+}
+
 static int __init prevent_bootmem_remove_init(void)
 {
int ret = 0;
@@ -1545,6 +1592,7 @@ static int __init prevent_bootmem_remove_init(void)
if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
return ret;
 
+   validate_bootmem_online();
ret = register_memory_notifier(_bootmem_remove_nb);
if (ret)
pr_err("%s: Notifier registration failed %d\n", __func__, ret);
-- 
2.20.1

[PATCH V5 1/3] arm64/mm/hotplug: Register boot memory hot remove notifier earlier

2020-11-08 Thread Anshuman Khandual

This moves memory notifier registration earlier in the boot process from
device_initcall() to early_initcall() which will help in guarding against
potential early boot memory offline requests. Even though there should not
be any actual offlinig requests till memory block devices are initialized
with memory_dev_init() but then generic init sequence might just change in
future. Hence an early registration for the memory event notifier would be
helpful. While here, just skip the registration if CONFIG_MEMORY_HOTREMOVE
is not enabled and also call out when memory notifier registration fails.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Gavin Shan 
Reviewed-by: Catalin Marinas 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1c0f3e02f731..71dd9d753b8b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1510,7 +1510,16 @@ static struct notifier_block prevent_bootmem_remove_nb = 
{
 
 static int __init prevent_bootmem_remove_init(void)
 {
-   return register_memory_notifier(_bootmem_remove_nb);
+   int ret = 0;
+
+   if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+   return ret;
+
+   ret = register_memory_notifier(_bootmem_remove_nb);
+   if (ret)
+   pr_err("%s: Notifier registration failed %d\n", __func__, ret);
+
+   return ret;
 }
-device_initcall(prevent_bootmem_remove_init);
+early_initcall(prevent_bootmem_remove_init);
 #endif
-- 
2.20.1

Re: [PATCH] arm64: NUMA: Kconfig: Increase max number of nodes

2020-10-20 Thread Anshuman Khandual

On 10/20/2020 11:39 PM, Valentin Schneider wrote:
> 
> Hi,
> 
> Nit on the subject: this only increases the default, the max is still 2¹⁰.

Agreed.

> 
> On 20/10/20 18:34, Vanshidhar Konda wrote:
>> The current arm64 max NUMA nodes default to 4. Today's arm64 systems can
>> reach or exceed 16. Increase the number to 64 (matching x86_64).
>>
>> Signed-off-by: Vanshidhar Konda 
>> ---
>>  arch/arm64/Kconfig | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 893130ce1626..3e69d3c981be 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -980,7 +980,7 @@ config NUMA
>>  config NODES_SHIFT
>>   int "Maximum NUMA Nodes (as a power of 2)"
>>   range 1 10
>> -default "2"
>> +default "6"
> 
> This leads to more statically allocated memory for things like node to CPU
> maps (see uses of MAX_NUMNODES), but that shouldn't be too much of an
> issue.

The smaller systems should not be required to waste those memory in
a default case, unless there is a real and available larger system
with those increased nodes.

> 
> AIUI this also directly correlates to how many more page->flags bits are
> required: are we sure the max 10 works on any aarch64 platform? I'm

We will have to test that. Besides 256 (2 ^ 8) is the first threshold
to be crossed here.

> genuinely asking here, given that I'm mostly a stranger to the mm
> world. The default should be something we're somewhat confident works
> everywhere.

Agreed. Do we really need to match X86 right now ? Do we really have
systems that has 64 nodes ? We should not increase the default node
value and then try to solve some new problems, when there might not
be any system which could even use that. I would suggest increase
NODES_SHIFT value upto as required by a real and available system.

> 
>>   depends on NEED_MULTIPLE_NODES
>>   help
>> Specify the maximum number of NUMA Nodes available on the target
>

Re: [PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-10-19 Thread Anshuman Khandual




On 10/07/2020 02:09 PM, David Hildenbrand wrote:
>>> We do have __add_pages()->check_hotplug_memory_addressable() where we
>>> already check against MAX_PHYSMEM_BITS.
>>
>> Initially, I thought about check_hotplug_memory_addressable() but the
>> existing check that asserts end of hotplug wrt MAX_PHYSMEM_BITS, is
>> generic in nature. AFAIK the linear mapping problem is arm64 specific,
>> hence I was not sure whether to add an arch specific callback which
>> will give platform an opportunity to weigh in for these ranges.
> 
> Also on s390x, the range where you can create an identity mapping depends on
> - early kernel setup
> - kasan
> 
> (I assume it's the same for all archs)
> 
> See arch/s390/mm/vmem.c:vmem_add_mapping(), which contains similar
> checks (VMEM_MAX_PHYS).

Once there is a high level function, all these platform specific
checks should go in their arch_get_mappable_range() instead.

> 
>>
>> But hold on, check_hotplug_memory_addressable() only gets called from
>> __add_pages() after linear mapping creation in arch_add_memory(). How
>> would it help ? We need some thing for add_memory(), its variants and
>> also possibly for memremap_pages() when it calls arch_add_memory().
>>
> 
> Good point. We chose that place for simplicity when adding it (I was
> favoring calling it at two places back then). Now, we might have good
> reason to move the checks further up the call chain.

check_hotplug_memory_addressable() check in add_pages() does not add
much as linear mapping creation must have been completed by then. I
guess moving this check inside the single high level function should
be better.

But checking against MAX_PHYSMEM_BITS might no longer be required, as
the range would have been validated against applicable memhp_range.   

> 
> Most probably,
> 
> struct range memhp_get_addressable_range(bool need_mapping)
> {
>   ...
> }

Something like this...

+struct memhp_range {
+   u64 start;
+   u64 end;
+};
+
+#ifndef arch_get_addressable_range
+static inline struct memhp_range arch_get_mappable_range(bool need_mapping)
+{
+   struct memhp_range range = {
+   .start = 0UL,
+   .end = (1ull << (MAX_PHYSMEM_BITS + 1)) - 1,
+   };
+   return range;
+}
+#endif
+
+static inline struct memhp_range memhp_get_mappable_range(bool need_mapping)
+{
+   const u64 max_phys = (1ull << (MAX_PHYSMEM_BITS + 1)) - 1;
+   struct memhp_range range = arch_get_mappable_range(need_mapping);
+
+   if (range.start > max_phys) {
+   range.start = 0;
+   range.end = 0;
+   }
+   range.end = min_t(u64, range.end, max_phys);
+   return range;
+}
+
+static inline bool memhp_range_allowed(u64 start, u64 end, bool need_mapping)
+{
+   struct memhp_range range = memhp_get_mappable_range(need_mapping);
+
+   return (start <= end) && (start >= range.start) && (end <= range.end);
+}

> 
> Would make sense, to deal with memremap_pages() without identity mappings.
> 
> We have two options:
> 
> 1. Generalize the checks, check early in applicable functions. Have a
> single way to get applicable ranges, both in callers, and inside the
> functions.
Inside the functions, check_hotplug_memory_addressable() in add_pages() ?
We could just drop that. Single generalized check with an arch callback
makes more sense IMHO.

> 
> 2. Keep the checks where they are. Add memhp_get_addressable_range() so
> callers can figure limits out. It's less clear what the relation between
> the different checks is. And it's likely if things change at one place
> that we miss the other place.

Right, does not sound like a good idea :)

> 
>>> struct range memhp_get_addressable_range(void)
>>> {
>>> const u64 max_phys = (1ull << (MAX_PHYSMEM_BITS + 1)) - 1;
>>> struct range range = arch_get_mappable_range();
>>
>> What would you suggest as the default fallback range if a platform
>> does not define this callback.
> 
> Just the largest possible range until we implement them. IIRC, an s390x
> version should be easy to add.

[0UL...(1ull << (MAX_PHYSMEM_BITS + 1)) - 1] is the largest possible
hotplug range.

> 
>>
>>>
>>> if (range.start > max_phys) {
>>> range.start = 0;
>>> range.end = 0;
>>> }
>>> range.end = max_t(u64, range.end, max_phys);
>>
>> min_t instead ?
> 
> Yeah :)
> 
>>
>>>
>>> return range;
>>> }
>>>
>>>
>>> That, we can use in check_hotplug_memory_addressable(), and also allow
>>> add_memory*() users to make use of it.
>>
>> So this check would happen twice during a hotplug ?
> 
> Right now it's like calling a function with wrong arguments - you just
> don't have a clue what valid arguments are, because non-obvious errors
> (besides -ENOMEM, which is a temporary error) pop up deep down the call
> chain.
> 
> For example, virito-mem would use it to detect during device
> initialization the usable device range, and warn the user accordingly.
> It currently manually checks for MAX_PHYSMEM_BITS, but

Re: [PATCH 2/2] arm64: allow hotpluggable sections to be offlined

2020-10-19 Thread Anshuman Khandual




On 10/17/2020 01:04 PM, David Hildenbrand wrote:
> 
>> Am 17.10.2020 um 04:03 schrieb Sudarshan Rajagopalan 
>> :
>>
>> On receiving the MEM_GOING_OFFLINE notification, we disallow offlining of
>> any boot memory by checking if section_early or not. With the introduction
>> of SECTION_MARK_HOTPLUGGABLE, allow boot mem sections that are marked as
>> hotpluggable with this bit set to be offlined and removed. This now allows
>> certain boot mem sections to be offlined.
>>
> 
> The check (notifier) is in arm64 code. I don‘t see why you cannot make such 
> decisions completely in arm64 code? Why would you have to mark sections?
> 
> Also, I think I am missing from *where* the code that marks sections 
> removable is even called? Who makes such decisions?

>From the previous patch.

+EXPORT_SYMBOL_GPL(mark_memory_hotpluggable);

> 
> This feels wrong. 
> 
>> Signed-off-by: Sudarshan Rajagopalan 
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Anshuman Khandual 
>> Cc: Mark Rutland 
>> Cc: Gavin Shan 
>> Cc: Logan Gunthorpe 
>> Cc: David Hildenbrand 
>> Cc: Andrew Morton 
>> Cc: Steven Price 
>> Cc: Suren Baghdasaryan 
>> ---
>> arch/arm64/mm/mmu.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 75df62fea1b6..fb8878698672 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1487,7 +1487,7 @@ static int prevent_bootmem_remove_notifier(struct 
>> notifier_block *nb,
>>
>>for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>ms = __pfn_to_section(pfn);
>> -if (early_section(ms))
>> +if (early_section(ms) && !removable_section(ms))

Till challenges related to boot memory removal on arm64 platform get
resolved, no portion of boot memory can be offlined. Let alone via a
driver making such decisions.

>>return NOTIFY_BAD;
>>}
>>return NOTIFY_OK;
>> -- 
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>> a Linux Foundation Collaborative Project
>>
> 
>

Re: [PATCH 0/2] mm/memory_hotplug, arm64: allow certain bootmem sections to be offlinable

2020-10-19 Thread Anshuman Khandual

Hello Sudarshan,

On 10/17/2020 07:32 AM, Sudarshan Rajagopalan wrote:
> In the patch that enables memory hot-remove (commit bbd6ec605c0f ("arm64/mm: 
> Enable memory hot remove")) for arm64, there’s a notifier put in place that 
> prevents boot memory from being offlined and removed. The commit text 
> mentions that boot memory on arm64 cannot be removed. But x86 and other archs 
> doesn’t seem to do this prevention.
> 
> The current logic is that only “new” memory blocks which are hot-added can 
> later be offlined and removed. The memory that system booted up with cannot 
> be offlined and removed. But there could be many usercases such as inter-VM 
> memory sharing where a primary VM could offline and hot-remove a 
> block/section of memory and lend it to secondary VM where it could hot-add 
> it. And after usecase is done, the reverse happens where secondary VM 
> hot-removes and gives it back to primary which can hot-add it back. In such 
> cases, the present logic for arm64 doesn’t allow this hot-remove in primary 
> to happen.
> 
> Also, on systems with movable zone that sort of guarantees pages to be 
> migrated and isolated so that blocks can be offlined, this logic also defeats 
> the purpose of having a movable zone which system can rely on memory 
> hot-plugging, which say virt-io mem also relies on for fully plugged memory 
> blocks.
> 
> This patch tries to solve by introducing a new section mem map sit 
> 'SECTION_MARK_HOTPLUGGABLE' which allows the concerned module drivers be able
> to mark requried sections as "hotpluggable" by setting this bit. Also this 
> marking is only allowed for sections which are in movable zone and have 
> unmovable pages. The arm64 mmu code on receiving the MEM_GOING_OFFLINE 
> notification, we disallow offlining of any boot memory by checking if 
> section_early or not. With the introduction of SECTION_MARK_HOTPLUGGABLE, we 
> allow boot mem sections that are marked as hotpluggable with this bit set to 
> be offlined and removed. Thereby allowing required bootmem sections to be 
> offlinable.

This series was posted right after another thread you initiated in this regard
but without even waiting for it to conclude in any manner.

https://lore.kernel.org/linux-arm-kernel/de8388df2fbc5a6a33aab95831ba7...@codeaurora.org/
 

Inter-VM memory migration could be solved in other methods as David has 
mentioned.
Boot memory cannot be removed and hence offlined on arm64 due to multiple 
reasons
including making kexec non-functional afterwards. Besides these intrusive core 
MM
changes are not really required.

- Anshuman

Re: arm64: dropping prevent_bootmem_remove_notifier

2020-10-19 Thread Anshuman Khandual




On 10/17/2020 03:05 PM, David Hildenbrand wrote:
> On 17.10.20 01:11, Sudarshan Rajagopalan wrote:
>>
>> Hello Anshuman,
>>
> David here,
> 
> in general, if your driver offlines+removes random memory, it is doing
> something *very* wrong and dangerous. You shouldn't ever be
> offlining+removing memory unless
> a) you own that boot memory after boot. E.g., the ACPI driver owns DIMMs
> after a reboot.
> b) you added that memory via add_memory() and friends.

Right.

> 
> Even trusting that offline memory can be used by your driver is wrong.

Right.

> 
> Just imagine you racing with actual memory hot(un)plug, you'll be in
> *big* trouble. For example,
> 
> 1. You offlined memory and assume you can use it. A DIMM can simply get
> unplugged. you're doomed.
> 2. You offlined+removed memory and assume you can use it. A DIMM can
> simply get unplugged and the whole machine would crash.
> 
> Or imagine your driver running on a system that has virtio-mem, which
> will try to remove/offline+remove memory that was added by virtio-mem/
> is under its control.
> 
> Long story short: don't do it.
> 
> There is *one* instance in Linux where we currently allow it for legacy
> reasons. It is powernv/memtrace code that offlines+removes boot memory.
> But here we are guaranteed to run in an environment (HW) without any
> actual memory hot(un)plug.
> 
> I guess you're going to say "but in our environment we don't have ..." -
> this is not a valid argument to change such generic things upstream /
> introducing such hacks.

Agreed.

> 
>> In the patch that enables memory hot-remove (commit bbd6ec605c0f 
>> ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier 
>> put in place that prevents boot memory from being offlined and removed. 
>> Also commit text mentions that boot memory on arm64 cannot be removed. 
>> We wanted to understand more about the reasoning for this. X86 and other 
>> archs doesn’t seem to do this prevention. There’s also comment in the 
>> code that this notifier could be dropped in future if and when boot 
>> memory can be removed.
> 
> The issue is that with *actual* memory hotunplug (for what the whole
> machinery should be used for), that memory/DIMM will be gone. And as you
> cannot fixup the initial memmap, if you were to reboot that machine, you
> would simply crash immediately.

Right.

> 
> On x86, you can have that easily: hotplug DIMMs on bare metal and
> reboot. The DIMMs will be exposed via e820 during boot, so they are
> "early", although if done right (movable_node, movable_core and
> similar), they can get hotunplugged later. Important in environments
> where you want to hotunplug whole nodes. But has HW on x86 will properly
> adjust the initial memmap / e820, there is no such issue as on arm64.

That is the primary problem.

> 
>>
>> The current logic is that only “new” memory blocks which are hot-added 
>> can later be offlined and removed. The memory that system booted up with 
>> cannot be offlined and removed. But there could be many usercases such 
>> as inter-VM memory sharing where a primary VM could offline and 
>> hot-remove a block/section of memory and lend it to secondary VM where 
>> it could hot-add it. And after usecase is done, the reverse happens 
> 
> That use case is using the wrong mechanisms. It shouldn't be
> offlining+removing memory. Read below.
> 
>> where secondary VM hot-removes and gives it back to primary which can 
>> hot-add it back. In such cases, the present logic for arm64 doesn’t 
>> allow this hot-remove in primary to happen.
>>
>> Also, on systems with movable zone that sort of guarantees pages to be 
>> migrated and isolated so that blocks can be offlined, this logic also 
>> defeats the purpose of having a movable zone which system can rely on 
>> memory hot-plugging, which say virt-io mem also relies on for fully 
>> plugged memory blocks.
> 
> The MOVABLE_ZONE is *not* just for better guarantees when trying to
> hotunplug memory. It also increases the number of THP/huge pages. And
> that part works just fine.

Right.

> 
>>
>> So we’re trying to understand the reasoning for such a prevention put in 
>> place for arm64 arch alone.
>>
>> One possible way to solve this is by marking the required sections as 
>> “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. 
>> This puts these sections in the context of “memory hotpluggable” which 
>> can be offlined-removed and added-onlined which are part of boot RAM 
>> itself and doesn’t need any extra blocks to be hot added. This way of 
>> marking certain sections as “non-early” could be exported so that module 
>> drivers can set the required number of sections as “memory 
> 
> Oh please no. No driver should be doing that. That's just hacking around
> the root issue: you're not supposed to do that.
> 
>> hotpluggable”. This could have certain checks put in place to see which 
>> sections are allowed, example only movable zone sections can be marked 
>> as “non-early”.

Re: arm64: dropping prevent_bootmem_remove_notifier

2020-10-18 Thread Anshuman Khandual

Hello Sudarshan,

On 10/17/2020 04:41 AM, Sudarshan Rajagopalan wrote:
> 
> Hello Anshuman,
> 
> In the patch that enables memory hot-remove (commit bbd6ec605c0f ("arm64/mm: 
> Enable memory hot remove")) for arm64, there’s a notifier put in place that 
> prevents boot memory from being offlined and removed. Also commit text 
> mentions that boot memory on arm64 cannot be removed. We wanted to understand 
> more about the reasoning for this. X86 and other archs doesn’t seem to do 
> this prevention. There’s also comment in the code that this notifier could be 
> dropped in future if and when boot memory can be removed.

Right and till then the notifier cannot be dropped. There was a lot of 
discussions
around this topic during multiple iterations of memory hot remove series. 
Hence, I
would just request you to please go through them first. This list here is from 
one
such series (https://lwn.net/Articles/809179/) but might not be exhaustive.

-
On arm64 platform, it is essential to ensure that the boot time discovered
memory couldn't be hot-removed so that,

1. FW data structures used across kexec are idempotent
   e.g. the EFI memory map.

2. linear map or vmemmap would not have to be dynamically split, and can
   map boot memory at a large granularity

3. Avoid penalizing paths that have to walk page tables, where we can be
   certain that the memory is not hot-removable
-

The primary reason being kexec which would need substantial rework otherwise.

> 
> The current logic is that only “new” memory blocks which are hot-added can 
> later be offlined and removed. The memory that system booted up with cannot 
> be offlined and removed. But there could be many usercases such as inter-VM 
> memory sharing where a primary VM could offline and hot-remove a 
> block/section of memory and lend it to secondary VM where it could hot-add 
> it. And after usecase is done, the reverse happens where secondary VM 
> hot-removes and gives it back to primary which can hot-add it back. In such 
> cases, the present logic for arm64 doesn’t allow this hot-remove in primary 
> to happen.

That is not true. Each VM could just boot with a minimum boot memory which can
not be offlined or removed but then a possible larger portion of memory can be
hot added during the boot process itself, making them available for any future
inter VM sharing purpose. Hence this problem could easily be solved in the user
space itself.

> 
> Also, on systems with movable zone that sort of guarantees pages to be 
> migrated and isolated so that blocks can be offlined, this logic also defeats 
> the purpose of having a movable zone which system can rely on memory 
> hot-plugging, which say virt-io mem also relies on for fully plugged memory 
> blocks.
ZONE_MOVABLE does not really guarantee migration, isolation and removal. There
are reasons an offline request might just fail. I agree that those reasons are
normally not platform related but core memory gives platform an opportunity to
decline an offlining request via a notifier. Hence ZONE_MOVABLE offline can be
denied. Semantics wise we are still okay.

This might look bit inconsistent that movablecore/kernelcore/movable_node with
firmware sending in 'hot pluggable' memory (IIRC arm64 does not really support
this yet), the system might end up with ZONE_MOVABLE marked boot memory which
cannot be offlined or removed. But an offline notifier action is orthogonal.
Hence did not block those kernel command line paths that creates ZONE_MOVABLE
during boot to preserve existing behavior.

> 
> I understand that some region of boot RAM shouldn’t be allowed to be removed, 
> but such regions won’t be allowed to be offlined in first place since pages 
> cannot be migrated and isolated, example reserved pages.
> 
> So we’re trying to understand the reasoning for such a prevention put in 
> place for arm64 arch alone.

Primary reason being kexec. During kexec on arm64, next kernel's memory map is
derived from firmware and not from current running kernel. So the next kernel
will crash if it would access memory that might have been removed in running
kernel. Until kexec on arm64 changes substantially and takes into account the
real available memory on the current kernel, boot memory cannot be removed.

> 
> One possible way to solve this is by marking the required sections as 
> “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map.

That is too intrusive from core memory perspective.

 This puts these sections in the context of “memory hotpluggable” which can be 
offlined-removed and added-onlined which are part of boot RAM itself and 
doesn’t need any extra blocks to be hot added. This way of marking certain 
sections as “non-early” could be exported so that module drivers can set the 
required number of sections as “memory hotpluggable”. This could have certain 
checks put in place to see which sections are allowed, example only movable 
zone sections can be

Re: [PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-10-13 Thread Anshuman Khandual




On 10/12/2020 12:59 PM, Ard Biesheuvel wrote:
> On Tue, 6 Oct 2020 at 08:36, Anshuman Khandual
>  wrote:
>>
>>
>>
>> On 09/30/2020 01:32 PM, Anshuman Khandual wrote:
>>> But if __is_lm_address() checks against the effective linear range instead
>>> i.e [_PAGE_OFFSET(vabits_actual)..(PAGE_END - 1)], it can be used for hot
>>> plug physical range check there after. Perhaps something like this, though
>>> not tested properly.
>>>
>>> diff --git a/arch/arm64/include/asm/memory.h 
>>> b/arch/arm64/include/asm/memory.h
>>> index afa722504bfd..6da046b479d4 100644
>>> --- a/arch/arm64/include/asm/memory.h
>>> +++ b/arch/arm64/include/asm/memory.h
>>> @@ -238,7 +238,10 @@ static inline const void *__tag_set(const void *addr, 
>>> u8 tag)
>>>   * space. Testing the top bit for the start of the region is a
>>>   * sufficient check and avoids having to worry about the tag.
>>>   */
>>> -#define __is_lm_address(addr)  (!(((u64)addr) & BIT(vabits_actual - 1)))
>>> +static inline bool __is_lm_address(unsigned long addr)
>>> +{
>>> +   return ((addr >= _PAGE_OFFSET(vabits_actual)) && (addr <= (PAGE_END 
>>> - 1)));
>>> +}
>>>
>>>  #define __lm_to_phys(addr) (((addr) + physvirt_offset))
>>>  #define __kimg_to_phys(addr)   ((addr) - kimage_voffset)
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index d59ffabb9c84..5750370a7e8c 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1451,8 +1451,7 @@ static bool inside_linear_region(u64 start, u64 size)
>>>  * address range mapped by the linear map, the start address should
>>>  * be calculated using vabits_actual.
>>>  */
>>> -   return ((start >= __pa(_PAGE_OFFSET(vabits_actual)))
>>> -   && ((start + size) <= __pa(PAGE_END - 1)));
>>> +   return __is_lm_address(__va(start)) && __is_lm_address(__va(start + 
>>> size));
>>>  }
>>>
>>>  int arch_add_memory(int nid, u64 start, u64 size,
>>
>> Will/Ard,
>>
>> Any thoughts about this ? __is_lm_address() now checks for a range instead
>> of a bit. This will be compatible later on, even if linear mapping range
>> changes from current lower half scheme.
>>
> 
> As I'm sure you have noticed, I sent out some patches that get rid of
> physvirt_offset, and which simplify __is_lm_address() to only take
> compile time constants into account (unless KASAN is enabled). This
> means that in the 52-bit VA case, __is_lm_address() does not
> distinguish between virtual addresses that can be mapped by the
> hardware and ones that cannot.

Yeah, though was bit late in getting to the series. So with that change
there might be areas in the linear mapping which cannot be addressed by
the hardware and hence should also need be checked apart from proposed
linear mapping coverage test, during memory hotplug ?

> 
> In the memory hotplug case, we need to decide whether the added memory
> will appear in the addressable area, which is a different question. So
> it makes sense to duplicate some of the logic that exists in
> arm64_memblock_init() (or factor it out) to decide whether this newly
> added memory will appear in the addressable window or not.

It seems unlikely that any hotplug agent (e.g. firmware) will ever push
through a memory range which is not accessible in the hardware but then
it is not impossible either. In summary, arch_add_memory() should check

1. Range can be covered inside linear mapping
2. Range is accessible by the hardware

Before the VA space organization series, (2) was not necessary as it was
contained inside (1) ?

> 
> So I think your original approach makes more sense here, although I
> think you want '(start + size - 1) <= __pa(PAGE_END - 1)' in the
> comparison above (and please drop the redundant parens)
> 

Sure, will accommodate these changes.

Re: [PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory

2020-10-13 Thread Anshuman Khandual




On 10/13/2020 04:35 AM, Sudarshan Rajagopalan wrote:
> When section mappings are enabled, we allocate vmemmap pages from physically
> continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section
> mappings are good to reduce TLB pressure. But when system is highly fragmented
> and memory blocks are being hot-added at runtime, its possible that such
> physically continuous memory allocations can fail. Rather than failing the
> memory hot-add procedure, add a fallback option to allocate vmemmap pages from
> discontinuous pages using vmemmap_populate_basepages().

There is a checkpatch warning here, which could be fixed while merging ?

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7: 
When section mappings are enabled, we allocate vmemmap pages from physically

total: 0 errors, 1 warnings, 13 lines checked

> 
> Signed-off-by: Sudarshan Rajagopalan 
> Reviewed-by: Gavin Shan 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Anshuman Khandual 
> Cc: Mark Rutland 
> Cc: Logan Gunthorpe 
> Cc: David Hildenbrand 
> Cc: Andrew Morton 
> Cc: Steven Price 

Nonetheless, this looks fine. Did not see any particular problem
while creating an experimental vmemmap with interleaving section
and base page mapping.

Reviewed-by: Anshuman Khandual 

> ---
>  arch/arm64/mm/mmu.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 75df62fea1b6..44486fd0e883 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1121,8 +1121,11 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>   void *p = NULL;
>  
>   p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> - if (!p)
> - return -ENOMEM;
> + if (!p) {
> + if (vmemmap_populate_basepages(addr, next, 
> node, altmap))
> + return -ENOMEM;
> + continue;
> + }
>  
>   pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL));
>   } else
>

Re: [PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-10-06 Thread Anshuman Khandual




On 10/06/2020 09:04 PM, David Hildenbrand wrote:
> On 17.09.20 10:46, Anshuman Khandual wrote:
>> During memory hotplug process, the linear mapping should not be created for
>> a given memory range if that would fall outside the maximum allowed linear
>> range. Else it might cause memory corruption in the kernel virtual space.
>>
>> Maximum linear mapping region is [PAGE_OFFSET..(PAGE_END -1)] accommodating
>> both its ends but excluding PAGE_END. Max physical range that can be mapped
>> inside this linear mapping range, must also be derived from its end points.
>>
>> When CONFIG_ARM64_VA_BITS_52 is enabled, PAGE_OFFSET is computed with the
>> assumption of 52 bits virtual address space. However, if the CPU does not
>> support 52 bits, then it falls back using 48 bits instead and the PAGE_END
>> is updated to reflect this using the vabits_actual. As for PAGE_OFFSET,
>> bits [51..48] are ignored by the MMU and remain unchanged, even though the
>> effective start address of linear map is now slightly different. Hence, to
>> reliably check the physical address range mapped by the linear map, the
>> start address should be calculated using vabits_actual. This ensures that
>> arch_add_memory() validates memory hot add range for its potential linear
>> mapping requirement, before creating it with __create_pgd_mapping().
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Ard Biesheuvel 
>> Cc: Steven Price 
>> Cc: Robin Murphy 
>> Cc: David Hildenbrand 
>> Cc: Andrew Morton 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Fixes: 4ab215061554 ("arm64: Add memory hotplug support")
>> Signed-off-by: Anshuman Khandual 
>> ---
>>  arch/arm64/mm/mmu.c | 27 +++
>>  1 file changed, 27 insertions(+)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 75df62fea1b6..d59ffabb9c84 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1433,11 +1433,38 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
>> unsigned long start, u64 size)
>>  free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
>>  }
>>  
>> +static bool inside_linear_region(u64 start, u64 size)
>> +{
>> +/*
>> + * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
>> + * accommodating both its ends but excluding PAGE_END. Max physical
>> + * range which can be mapped inside this linear mapping range, must
>> + * also be derived from its end points.
>> + *
>> + * With CONFIG_ARM64_VA_BITS_52 enabled, PAGE_OFFSET is defined with
>> + * the assumption of 52 bits virtual address space. However, if the
>> + * CPU does not support 52 bits, it falls back using 48 bits and the
>> + * PAGE_END is updated to reflect this using the vabits_actual. As
>> + * for PAGE_OFFSET, bits [51..48] are ignored by the MMU and remain
>> + * unchanged, even though the effective start address of linear map
>> + * is now slightly different. Hence, to reliably check the physical
>> + * address range mapped by the linear map, the start address should
>> + * be calculated using vabits_actual.
>> + */
>> +return ((start >= __pa(_PAGE_OFFSET(vabits_actual)))
>> +&& ((start + size) <= __pa(PAGE_END - 1)));
>> +}
>> +
>>  int arch_add_memory(int nid, u64 start, u64 size,
>>  struct mhp_params *params)
>>  {
>>  int ret, flags = 0;
>>  
>> +if (!inside_linear_region(start, size)) {
>> +pr_err("[%llx %llx] is outside linear mapping region\n", start, 
>> start + size);
>> +return -EINVAL;
>> +}
>> +
>>  if (rodata_full || debug_pagealloc_enabled())
>>  flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>  
>>
> 
> Can we please provide a generic way to figure limits like that out,
> especially, before calling add_memory() and friends?
> 
> We do have __add_pages()->check_hotplug_memory_addressable() where we
> already check against MAX_PHYSMEM_BITS.

Initially, I thought about check_hotplug_memory_addressable() but the
existing check that asserts end of hotplug wrt MAX_PHYSMEM_BITS, is
generic in nature. AFAIK the linear mapping problem is arm64 specific,
hence I was not sure whether to add an arch specific callback which
will give platform an opportunity to weigh in for these ranges.

But hold on, check_hotplug_memory_addressable() only gets called from
__add_pages() after li

Re: [PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-10-06 Thread Anshuman Khandual




On 09/30/2020 01:32 PM, Anshuman Khandual wrote:
> But if __is_lm_address() checks against the effective linear range instead
> i.e [_PAGE_OFFSET(vabits_actual)..(PAGE_END - 1)], it can be used for hot
> plug physical range check there after. Perhaps something like this, though
> not tested properly.
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index afa722504bfd..6da046b479d4 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -238,7 +238,10 @@ static inline const void *__tag_set(const void *addr, u8 
> tag)
>   * space. Testing the top bit for the start of the region is a
>   * sufficient check and avoids having to worry about the tag.
>   */
> -#define __is_lm_address(addr)  (!(((u64)addr) & BIT(vabits_actual - 1)))
> +static inline bool __is_lm_address(unsigned long addr)
> +{
> +   return ((addr >= _PAGE_OFFSET(vabits_actual)) && (addr <= (PAGE_END - 
> 1)));
> +}
>  
>  #define __lm_to_phys(addr) (((addr) + physvirt_offset))
>  #define __kimg_to_phys(addr)   ((addr) - kimage_voffset)
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index d59ffabb9c84..5750370a7e8c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1451,8 +1451,7 @@ static bool inside_linear_region(u64 start, u64 size)
>  * address range mapped by the linear map, the start address should
>  * be calculated using vabits_actual.
>  */
> -   return ((start >= __pa(_PAGE_OFFSET(vabits_actual)))
> -   && ((start + size) <= __pa(PAGE_END - 1)));
> +   return __is_lm_address(__va(start)) && __is_lm_address(__va(start + 
> size));
>  }
>  
>  int arch_add_memory(int nid, u64 start, u64 size,

Will/Ard,

Any thoughts about this ? __is_lm_address() now checks for a range instead
of a bit. This will be compatible later on, even if linear mapping range
changes from current lower half scheme.

- Anshuman

Re: [PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-10-06 Thread Anshuman Khandual




On 09/30/2020 04:31 PM, Ard Biesheuvel wrote:
> On Wed, 30 Sep 2020 at 10:03, Anshuman Khandual
>  wrote:
>>
>>
>> On 09/29/2020 08:52 PM, Will Deacon wrote:
>>> On Tue, Sep 29, 2020 at 01:34:24PM +0530, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 09/29/2020 02:05 AM, Will Deacon wrote:
>>>>> On Thu, Sep 17, 2020 at 02:16:42PM +0530, Anshuman Khandual wrote:
>>>>>> During memory hotplug process, the linear mapping should not be created 
>>>>>> for
>>>>>> a given memory range if that would fall outside the maximum allowed 
>>>>>> linear
>>>>>> range. Else it might cause memory corruption in the kernel virtual space.
>>>>>>
>>>>>> Maximum linear mapping region is [PAGE_OFFSET..(PAGE_END -1)] 
>>>>>> accommodating
>>>>>> both its ends but excluding PAGE_END. Max physical range that can be 
>>>>>> mapped
>>>>>> inside this linear mapping range, must also be derived from its end 
>>>>>> points.
>>>>>>
>>>>>> When CONFIG_ARM64_VA_BITS_52 is enabled, PAGE_OFFSET is computed with the
>>>>>> assumption of 52 bits virtual address space. However, if the CPU does not
>>>>>> support 52 bits, then it falls back using 48 bits instead and the 
>>>>>> PAGE_END
>>>>>> is updated to reflect this using the vabits_actual. As for PAGE_OFFSET,
>>>>>> bits [51..48] are ignored by the MMU and remain unchanged, even though 
>>>>>> the
>>>>>> effective start address of linear map is now slightly different. Hence, 
>>>>>> to
>>>>>> reliably check the physical address range mapped by the linear map, the
>>>>>> start address should be calculated using vabits_actual. This ensures that
>>>>>> arch_add_memory() validates memory hot add range for its potential linear
>>>>>> mapping requirement, before creating it with __create_pgd_mapping().
>>>>>>
>>>>>> Cc: Catalin Marinas 
>>>>>> Cc: Will Deacon 
>>>>>> Cc: Mark Rutland 
>>>>>> Cc: Ard Biesheuvel 
>>>>>> Cc: Steven Price 
>>>>>> Cc: Robin Murphy 
>>>>>> Cc: David Hildenbrand 
>>>>>> Cc: Andrew Morton 
>>>>>> Cc: linux-arm-ker...@lists.infradead.org
>>>>>> Cc: linux-kernel@vger.kernel.org
>>>>>> Fixes: 4ab215061554 ("arm64: Add memory hotplug support")
>>>>>> Signed-off-by: Anshuman Khandual 
>>>>>> ---
>>>>>>  arch/arm64/mm/mmu.c | 27 +++
>>>>>>  1 file changed, 27 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>> index 75df62fea1b6..d59ffabb9c84 100644
>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>> @@ -1433,11 +1433,38 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
>>>>>> unsigned long start, u64 size)
>>>>>>free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
>>>>>>  }
>>>>>>
>>>>>> +static bool inside_linear_region(u64 start, u64 size)
>>>>>> +{
>>>>>> +  /*
>>>>>> +   * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
>>>>>> +   * accommodating both its ends but excluding PAGE_END. Max physical
>>>>>> +   * range which can be mapped inside this linear mapping range, must
>>>>>> +   * also be derived from its end points.
>>>>>> +   *
>>>>>> +   * With CONFIG_ARM64_VA_BITS_52 enabled, PAGE_OFFSET is defined with
>>>>>> +   * the assumption of 52 bits virtual address space. However, if the
>>>>>> +   * CPU does not support 52 bits, it falls back using 48 bits and the
>>>>>> +   * PAGE_END is updated to reflect this using the vabits_actual. As
>>>>>> +   * for PAGE_OFFSET, bits [51..48] are ignored by the MMU and remain
>>>>>> +   * unchanged, even though the effective start address of linear map
>>>>>> +   * is now slightly different. Hence, to reliably check the physical
>>>>>> +   * address range mapped by the linear map, the start address should
>>>>>> +

Re: [PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory

2020-10-05 Thread Anshuman Khandual




On 10/02/2020 01:46 AM, Sudarshan Rajagopalan wrote:
> When section mappings are enabled, we allocate vmemmap pages from physically
> continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section
> mappings are good to reduce TLB pressure. But when system is highly fragmented
> and memory blocks are being hot-added at runtime, its possible that such
> physically continuous memory allocations can fail. Rather than failing the
> memory hot-add procedure, add a fallback option to allocate vmemmap pages from
> discontinuous pages using vmemmap_populate_basepages().
> 
> Signed-off-by: Sudarshan Rajagopalan 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Anshuman Khandual 
> Cc: Mark Rutland 
> Cc: Logan Gunthorpe 
> Cc: David Hildenbrand 
> Cc: Andrew Morton 
> Cc: Steven Price 
> ---
>  arch/arm64/mm/mmu.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 75df62f..11f8639 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1121,8 +1121,15 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>   void *p = NULL;
>  
>   p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> - if (!p)
> - return -ENOMEM;
> + if (!p) {
> + /*
> +  * fallback allocating with virtually
> +  * contiguous memory for this section
> +  */

Mapping is always virtually contiguous with or without huge pages.
Please drop this comment here, as it's obvious.

> + if (vmemmap_populate_basepages(addr, next, 
> node, NULL))
> + return -ENOMEM;

Please send in the 'altmap' instead of NULL for allocation from
device memory if and when requested.

Re: [PATCH V4 3/3] arm64/mm/hotplug: Ensure early memory sections are all online

2020-10-05 Thread Anshuman Khandual




On 10/01/2020 06:23 AM, Gavin Shan wrote:
> Hi Anshuman,
> 
> On 9/29/20 11:54 PM, Anshuman Khandual wrote:
>> This adds a validation function that scans the entire boot memory and makes
>> sure that all early memory sections are online. This check is essential for
>> the memory notifier to work properly, as it cannot prevent any boot memory
>> from offlining, if all sections are not online to begin with. The notifier
>> registration is skipped, if this validation does not go through. Although
>> the boot section scanning is selectively enabled with DEBUG_VM.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Marc Zyngier 
>> Cc: Steve Capper 
>> Cc: Mark Brown 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   arch/arm64/mm/mmu.c | 59 +
>>   1 file changed, 59 insertions(+)
> 
> I don't understand why this is necessary. The core already ensure the
> corresponding section is online when trying to offline it. It's guranteed
> that section is online when the notifier is triggered. I'm not sure if
> there is anything I missed?

Current memory notifier blocks any boot memory hot removal attempt via
blocking its offlining step itself. So if some sections in boot memory
are not online (because of a bug or change in init sequence) by the
time memory block device can be removed, the notifier loses the ability
to prevent its removal. This validation here, ensures that entire boot
memory is in online state, otherwise call out sections that are not,
with an warning that those boot memory can be removed.  

>  
> 
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 90a30f5ebfc0..b67a657ea1ad 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1522,6 +1522,62 @@ static struct notifier_block 
>> prevent_bootmem_remove_nb = {
>>   .notifier_call = prevent_bootmem_remove_notifier,
>>   };
>>   +/*
>> + * This ensures that boot memory sections on the plaltform are online

Will fix.

>     ^
>> + * during early boot. They could not be prevented from being offlined
>> + * if for some reason they are not brought online to begin with. This
>> + * help validate the basic assumption on which the above memory event
>> + * notifier works to prevent boot memory offlining and it's possible
>> + * removal.
>> + */
>> +static bool validate_bootmem_online(void)
>> +{
>> +    struct memblock_region *mblk;
>> +    struct mem_section *ms;
>> +    unsigned long pfn, end_pfn, start, end;
>> +    bool all_online = true;
>> +
>> +    /*
>> + * Scanning across all memblock might be expensive
>> + * on some big memory systems. Hence enable this
>> + * validation only with DEBUG_VM.
>> + */
>> +    if (!IS_ENABLED(CONFIG_DEBUG_VM))
>> +    return all_online;
>> +
>> +    for_each_memblock(memory, mblk) {
>> +    pfn = PHYS_PFN(mblk->base);
>> +    end_pfn = PHYS_PFN(mblk->base + mblk->size);
>> +
> 
> It's not a good idea to access @mblk->{base, size}. There are two
> accessors: memblock_region_memory_{base, end}_pfn().

Sure, will replace.

> 
>> +    for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> +    ms = __pfn_to_section(pfn);
>> +
>> +    /*
>> + * All memory ranges in the system at this point
>> + * should have been marked early sections.
>> + */
>> +    WARN_ON(!early_section(ms));
>> +
>> +    /*
>> + * Memory notifier mechanism here to prevent boot
>> + * memory offlining depends on the fact that each
>> + * early section memory on the system is intially
>> + * online. Otherwise a given memory section which
>> + * is already offline will be overlooked and can
>> + * be removed completely. Call out such sections.
>> + */
> 
> s/intially/initially

Will change.

> 
>> +    if (!online_section(ms)) {
>> +    start = PFN_PHYS(pfn);
>> +    end = start + (1UL << PA_SECTION_SHIFT);
>> +    pr_err("Memory range [%lx %lx] is offline\n", start, end);
>> +    pr_err("Memory range [%lx %lx] can be removed\n", start, 
>> end);
>> +    all_online = false;
> 
> These two error messages can be c

Re: [PATCH V4 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling

2020-10-05 Thread Anshuman Khandual




On 10/01/2020 05:27 AM, Gavin Shan wrote:
> Hi Anshuman,
> 
> On 9/29/20 11:54 PM, Anshuman Khandual wrote:
>> This enables MEM_OFFLINE memory event handling. It will help intercept any
>> possible error condition such as if boot memory some how still got offlined
>> even after an explicit notifier failure, potentially by a future change in
>> generic hot plug framework. This would help detect such scenarios and help
>> debug further. While here, also call out the first section being attempted
>> for offline or got offlined.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Marc Zyngier 
>> Cc: Steve Capper 
>> Cc: Mark Brown 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   arch/arm64/mm/mmu.c | 29 +++--
>>   1 file changed, 27 insertions(+), 2 deletions(-)
>>
> 
> This looks good to me except a nit and it can be improved if
> that looks reasonable and only when you get a chance for
> respin.
> 
> Reviewed-by: Gavin Shan 
> 
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 4e70f4fea06c..90a30f5ebfc0 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1482,13 +1482,38 @@ static int prevent_bootmem_remove_notifier(struct 
>> notifier_block *nb,
>>   unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
>>   unsigned long pfn = arg->start_pfn;
>>   -    if (action != MEM_GOING_OFFLINE)
>> +    if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
>>   return NOTIFY_OK;
>>     for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> +    unsigned long start = PFN_PHYS(pfn);
>> +    unsigned long end = start + (1UL << PA_SECTION_SHIFT);
>> +
>>   ms = __pfn_to_section(pfn);
>> -    if (early_section(ms))
>> +    if (!early_section(ms))
>> +    continue;
>> +
> 
> The discussion here is irrelevant to this patch itself. It seems
> early_section() is coarse, which means all memory detected during
> boot time won't be hotpluggable?

Right, thats the policy being enforced on arm64 platform for various
critical reasons. Please refer to earlier discussions around memory
hot remove development on arm64.

> 
>> +    if (action == MEM_GOING_OFFLINE) {
>> +    pr_warn("Boot memory [%lx %lx] offlining attempted\n", start, 
>> end);
>>   return NOTIFY_BAD;
>> +    } else if (action == MEM_OFFLINE) {
>> +    /*
>> + * This should have never happened. Boot memory
>> + * offlining should have been prevented by this
>> + * very notifier. Probably some memory removal
>> + * procedure might have changed which would then
>> + * require further debug.
>> + */
>> +    pr_err("Boot memory [%lx %lx] offlined\n", start, end);
>> +
>> +    /*
>> + * Core memory hotplug does not process a return
>> + * code from the notifier for MEM_OFFLINE event.
>> + * Error condition has been reported. Report as
>> + * ignored.
>> + */
>> +    return NOTIFY_DONE;
>> +    }
>>   }
>>   return NOTIFY_OK;
>>   }
>>
> 
> I think NOTIFY_BAD is returned for MEM_OFFLINE wouldn't be a
> bad idea, even the core isn't handling the errno. With this,
> the code can be simplified. However, it's not a big deal and
> you probably evaluate and change when you need another respin:
> 
>     pr_warn("Boot memory [%lx %lx] %s\n",
>     (action == MEM_GOING_OFFLINE) ? "offlining attempted" : 
> "offlined",
>     start, end);
>     return NOTIFY_BAD;

Wondering whether returning a NOTIFY_BAD for MEM_OFFLINE event could
be somewhat risky if generic hotplug mechanism to change later. But
again, probably it might just be OK.

Regardless, also wanted to differentiate error messages for both the
cases. An warning messages i.e pr_warn() for MEM_GOING_OFFLINE which
suggests an unexpected user action but an error message i.e pr_err()
for MEM_OFFLINE which clearly indicates an error condition that needs
to be debugged further.

Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-10-05 Thread Anshuman Khandual




On 10/05/2020 11:35 AM, Michal Hocko wrote:
> On Mon 05-10-20 07:59:12, Anshuman Khandual wrote:
>>
>>
>> On 10/02/2020 05:34 PM, Michal Hocko wrote:
>>> On Wed 30-09-20 11:30:49, Anshuman Khandual wrote:
>>>> Add following new vmstat events which will track HugeTLB page migration.
>>>>
>>>> 1. HUGETLB_MIGRATION_SUCCESS
>>>> 2. HUGETLB_MIGRATION_FAILURE
>>>>
>>>> It follows the existing semantics to accommodate HugeTLB subpages in total
>>>> page migration statistics. While here, this updates current trace event
>>>> 'mm_migrate_pages' to accommodate now available HugeTLB based statistics.
>>> What is the actual usecase? And how do you deal with the complexity
>>> introduced by many different hugetlb page sizes. Really, what is the
>>> point to having such a detailed view on hugetlb migration?
>>>
>>
>> It helps differentiate various page migration events i.e normal, THP and
>> HugeTLB and gives us more reliable and accurate measurement. Current stats
>> as per PGMIGRATE_SUCCESS and PGMIGRATE_FAIL are misleading, as they contain
>> both normal and HugeTLB pages as single entities, which is not accurate.
> 
> Yes this is true. But why does it really matter? Do you have a specific
> example?

An example which demonstrates that mixing and misrepresentation in these
stats create some problem ? Well, we could just create one scenario via
an application with different VMA types and triggering some migrations.
But the fact remains, that these stats are inaccurate and misleading
which is very clear and apparent.

> 
>> After this change, PGMIGRATE_SUCCESS and PGMIGRATE_FAIL will contain page
>> migration statistics in terms of normal pages irrespective of whether any
>> previous migrations until that point involved normal pages, THP or HugeTLB
>> (any size) pages. At the least, this fixes existing misleading stats with
>> PGMIGRATE_SUCCESS and PGMIGRATE_FAIL.
>>
>> Besides, it helps us understand HugeTLB migrations in more detail. Even
>> though HugeTLB can be of various sizes on a given platform, these new
>> stats HUGETLB_MIGRATION_SUCCESS and HUGETLB_MIGRATION_FAILURE give enough
>> overall insight into HugeTLB migration events.
> 
> While true this all is way too vague to add yet another imprecise
> counter.

Given that user knows about all HugeTLB mappings it has got, these counters
are not really vague and could easily be related back. Moreover this change
completes the migration stats restructuring which was started with adding
THP counters. Otherwise it remains incomplete.

Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-10-04 Thread Anshuman Khandual

On 10/02/2020 05:34 PM, Michal Hocko wrote:
> On Wed 30-09-20 11:30:49, Anshuman Khandual wrote:
>> Add following new vmstat events which will track HugeTLB page migration.
>>
>> 1. HUGETLB_MIGRATION_SUCCESS
>> 2. HUGETLB_MIGRATION_FAILURE
>>
>> It follows the existing semantics to accommodate HugeTLB subpages in total
>> page migration statistics. While here, this updates current trace event
>> 'mm_migrate_pages' to accommodate now available HugeTLB based statistics.
> What is the actual usecase? And how do you deal with the complexity
> introduced by many different hugetlb page sizes. Really, what is the
> point to having such a detailed view on hugetlb migration?
>

It helps differentiate various page migration events i.e normal, THP and
HugeTLB and gives us more reliable and accurate measurement. Current stats
as per PGMIGRATE_SUCCESS and PGMIGRATE_FAIL are misleading, as they contain
both normal and HugeTLB pages as single entities, which is not accurate.

After this change, PGMIGRATE_SUCCESS and PGMIGRATE_FAIL will contain page
migration statistics in terms of normal pages irrespective of whether any
previous migrations until that point involved normal pages, THP or HugeTLB
(any size) pages. At the least, this fixes existing misleading stats with
PGMIGRATE_SUCCESS and PGMIGRATE_FAIL.

Besides, it helps us understand HugeTLB migrations in more detail. Even
though HugeTLB can be of various sizes on a given platform, these new
stats HUGETLB_MIGRATION_SUCCESS and HUGETLB_MIGRATION_FAILURE give enough
overall insight into HugeTLB migration events.

Though these new stats accumulate HugeTLB migration success and failure
irrespective of their size, it will still be helpful as HugeTLB is user
driven, who should be able to decipher these accumulated stats. But this
is a limitation, as it might be difficult to determine available HugeTLB
sizes at compile time.

Re: [PATCH v2] arm64/mm: add fallback option to allocate virtually contiguous memory

2020-09-30 Thread Anshuman Khandual




On 10/01/2020 04:43 AM, Sudarshan Rajagopalan wrote:
> When section mappings are enabled, we allocate vmemmap pages from physically
> continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section
> mappings are good to reduce TLB pressure. But when system is highly fragmented
> and memory blocks are being hot-added at runtime, its possible that such
> physically continuous memory allocations can fail. Rather than failing the
> memory hot-add procedure, add a fallback option to allocate vmemmap pages from
> discontinuous pages using vmemmap_populate_basepages().
> 
> Signed-off-by: Sudarshan Rajagopalan 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Anshuman Khandual 
> Cc: Mark Rutland 
> Cc: Logan Gunthorpe 
> Cc: David Hildenbrand 
> Cc: Andrew Morton 
> Cc: Steven Price 
> ---
>  arch/arm64/mm/mmu.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 75df62f..9edbbb8 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1121,8 +1121,18 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>   void *p = NULL;
>  
>   p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> - if (!p)
> - return -ENOMEM;
> + if (!p) {
> + if (altmap)
> + return -ENOMEM; /* no fallback */

Why ? If huge pages inside a vmemmap section might have been allocated
from altmap, the base page could also fallback on altmap. If this patch
has just followed the existing x86 semantics, it was written [1] long
back before vmemmap_populate_basepages() supported altmap allocation.
While adding that support [2] recently, it was deliberate not to change 
x86 semantics as it was a platform decision. Nonetheless, it makes sense
to fallback on altmap bases pages if and when required.

[1] 4b94ffdc4163 (x86, mm: introduce vmem_altmap to augment vmemmap_populate())
[2] 1d9cfee7535c (mm/sparsemem: enable vmem_altmap support in 
vmemmap_populate_basepages())

Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-09-30 Thread Anshuman Khandual

On 09/30/2020 01:16 PM, Oscar Salvador wrote:
> On Wed, Sep 30, 2020 at 11:30:49AM +0530, Anshuman Khandual wrote:
>> -is_thp = PageTransHuge(page) && !PageHuge(page);
>> -nr_subpages = thp_nr_pages(page);
>> +is_thp = false;
>> +is_hugetlb = false;
>> +if (PageTransHuge(page)) {
>> +if (PageHuge(page))
>> +is_hugetlb = true;
>> +else
>> +is_thp = true;
>> +}
> 
> Since PageHuge only returns true for hugetlb pages, I think the following is
> more simple?
> 
>   if (PageHuge(page))
>   is_hugetlb = true;
>   else if (PageTransHuge(page))
>   is_thp = true

Right, it would be simple. But as Mike had mentioned before PageHuge()
check is more expensive than PageTransHuge(). This proposal just tries
not to call PageHuge() unless the page first clears PageTransHuge(),
saving some potential CPU cycles on normal pages.

> 
> 
> Besides that, it looks good to me:
> 
> Reviewed-by: Oscar Salvador 
>

Re: [PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-09-30 Thread Anshuman Khandual



On 09/29/2020 08:52 PM, Will Deacon wrote:
> On Tue, Sep 29, 2020 at 01:34:24PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 09/29/2020 02:05 AM, Will Deacon wrote:
>>> On Thu, Sep 17, 2020 at 02:16:42PM +0530, Anshuman Khandual wrote:
>>>> During memory hotplug process, the linear mapping should not be created for
>>>> a given memory range if that would fall outside the maximum allowed linear
>>>> range. Else it might cause memory corruption in the kernel virtual space.
>>>>
>>>> Maximum linear mapping region is [PAGE_OFFSET..(PAGE_END -1)] accommodating
>>>> both its ends but excluding PAGE_END. Max physical range that can be mapped
>>>> inside this linear mapping range, must also be derived from its end points.
>>>>
>>>> When CONFIG_ARM64_VA_BITS_52 is enabled, PAGE_OFFSET is computed with the
>>>> assumption of 52 bits virtual address space. However, if the CPU does not
>>>> support 52 bits, then it falls back using 48 bits instead and the PAGE_END
>>>> is updated to reflect this using the vabits_actual. As for PAGE_OFFSET,
>>>> bits [51..48] are ignored by the MMU and remain unchanged, even though the
>>>> effective start address of linear map is now slightly different. Hence, to
>>>> reliably check the physical address range mapped by the linear map, the
>>>> start address should be calculated using vabits_actual. This ensures that
>>>> arch_add_memory() validates memory hot add range for its potential linear
>>>> mapping requirement, before creating it with __create_pgd_mapping().
>>>>
>>>> Cc: Catalin Marinas 
>>>> Cc: Will Deacon 
>>>> Cc: Mark Rutland 
>>>> Cc: Ard Biesheuvel 
>>>> Cc: Steven Price 
>>>> Cc: Robin Murphy 
>>>> Cc: David Hildenbrand 
>>>> Cc: Andrew Morton 
>>>> Cc: linux-arm-ker...@lists.infradead.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Fixes: 4ab215061554 ("arm64: Add memory hotplug support")
>>>> Signed-off-by: Anshuman Khandual 
>>>> ---
>>>>  arch/arm64/mm/mmu.c | 27 +++
>>>>  1 file changed, 27 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index 75df62fea1b6..d59ffabb9c84 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -1433,11 +1433,38 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
>>>> unsigned long start, u64 size)
>>>>free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
>>>>  }
>>>>  
>>>> +static bool inside_linear_region(u64 start, u64 size)
>>>> +{
>>>> +  /*
>>>> +   * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
>>>> +   * accommodating both its ends but excluding PAGE_END. Max physical
>>>> +   * range which can be mapped inside this linear mapping range, must
>>>> +   * also be derived from its end points.
>>>> +   *
>>>> +   * With CONFIG_ARM64_VA_BITS_52 enabled, PAGE_OFFSET is defined with
>>>> +   * the assumption of 52 bits virtual address space. However, if the
>>>> +   * CPU does not support 52 bits, it falls back using 48 bits and the
>>>> +   * PAGE_END is updated to reflect this using the vabits_actual. As
>>>> +   * for PAGE_OFFSET, bits [51..48] are ignored by the MMU and remain
>>>> +   * unchanged, even though the effective start address of linear map
>>>> +   * is now slightly different. Hence, to reliably check the physical
>>>> +   * address range mapped by the linear map, the start address should
>>>> +   * be calculated using vabits_actual.
>>>> +   */
>>>> +  return ((start >= __pa(_PAGE_OFFSET(vabits_actual)))
>>>> +  && ((start + size) <= __pa(PAGE_END - 1)));
>>>> +}
>>>
>>> Why isn't this implemented using the existing __is_lm_address()?
>>
>> Not sure, if I understood your suggestion here. The physical address range
>> [start..start + size] needs to be checked against maximum physical range
>> that can be represented inside effective boundaries for the linear mapping
>> i.e [__pa(_PAGE_OFFSET(vabits_actual)..__pa(PAGE_END - 1)].
>>
>> Are you suggesting [start..start + size] should be first be converted into
>> a virtual address range and then checked against __is_lm_addresses() ? But
>> is not deriving th

[RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-09-30 Thread Anshuman Khandual

Add following new vmstat events which will track HugeTLB page migration.

1. HUGETLB_MIGRATION_SUCCESS
2. HUGETLB_MIGRATION_FAILURE

It follows the existing semantics to accommodate HugeTLB subpages in total
page migration statistics. While here, this updates current trace event
'mm_migrate_pages' to accommodate now available HugeTLB based statistics.

Cc: Daniel Jordan 
Cc: Zi Yan 
Cc: John Hubbard 
Cc: Mike Kravetz 
Cc: Oscar Salvador 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
Applies on linux-next and v5.9-rc7.

Changes in RFC V2:

- Added the missing hugetlb_retry in the loop per Oscar
- Changed HugeTLB and THP detection sequence per Mike
- Changed nr_subpages fetch from compound_nr() instead per Mike

Changes in RFC V1: (https://patchwork.kernel.org/patch/11799395/)

 include/linux/vm_event_item.h  |  2 ++
 include/trace/events/migrate.h | 13 ++---
 mm/migrate.c   | 48 +-
 mm/vmstat.c|  2 ++
 4 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..d1ddad835c19 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -60,6 +60,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_MIGRATION_SUCCESS,
THP_MIGRATION_FAIL,
THP_MIGRATION_SPLIT,
+   HUGETLB_MIGRATION_SUCCESS,
+   HUGETLB_MIGRATION_FAIL,
 #endif
 #ifdef CONFIG_COMPACTION
COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..f8ffb8aece48 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -47,10 +47,11 @@ TRACE_EVENT(mm_migrate_pages,
 
TP_PROTO(unsigned long succeeded, unsigned long failed,
 unsigned long thp_succeeded, unsigned long thp_failed,
-unsigned long thp_split, enum migrate_mode mode, int reason),
+unsigned long thp_split, unsigned long hugetlb_succeeded,
+unsigned long hugetlb_failed, enum migrate_mode mode, int 
reason),
 
TP_ARGS(succeeded, failed, thp_succeeded, thp_failed,
-   thp_split, mode, reason),
+   thp_split, hugetlb_succeeded, hugetlb_failed, mode, reason),
 
TP_STRUCT__entry(
__field(unsigned long,  succeeded)
@@ -58,6 +59,8 @@ TRACE_EVENT(mm_migrate_pages,
__field(unsigned long,  thp_succeeded)
__field(unsigned long,  thp_failed)
__field(unsigned long,  thp_split)
+   __field(unsigned long,  hugetlb_succeeded)
+   __field(unsigned long,  hugetlb_failed)
__field(enum migrate_mode,  mode)
__field(int,reason)
),
@@ -68,16 +71,20 @@ TRACE_EVENT(mm_migrate_pages,
__entry->thp_succeeded  = thp_succeeded;
__entry->thp_failed = thp_failed;
__entry->thp_split  = thp_split;
+   __entry->hugetlb_succeeded  = hugetlb_succeeded;
+   __entry->hugetlb_failed = hugetlb_failed;
__entry->mode   = mode;
__entry->reason = reason;
),
 
-   TP_printk("nr_succeeded=%lu nr_failed=%lu nr_thp_succeeded=%lu 
nr_thp_failed=%lu nr_thp_split=%lu mode=%s reason=%s",
+   TP_printk("nr_succeeded=%lu nr_failed=%lu nr_thp_succeeded=%lu 
nr_thp_failed=%lu nr_thp_split=%lu nr_hugetlb_succeeded=%lu 
nr_hugetlb_failed=%lu mode=%s reason=%s",
__entry->succeeded,
__entry->failed,
__entry->thp_succeeded,
__entry->thp_failed,
__entry->thp_split,
+   __entry->hugetlb_succeeded,
+   __entry->hugetlb_failed,
__print_symbolic(__entry->mode, MIGRATE_MODE),
__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
diff --git a/mm/migrate.c b/mm/migrate.c
index 5ca5842df5db..0aac9d39778c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1415,13 +1415,17 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
 {
int retry = 1;
int thp_retry = 1;
+   int hugetlb_retry = 1;
int nr_failed = 0;
int nr_succeeded = 0;
int nr_thp_succeeded = 0;
int nr_thp_failed = 0;
int nr_thp_split = 0;
+   int nr_hugetlb_succeeded = 0;
+   int nr_hugetlb_failed = 0;
int pass = 0;
bool is_thp = false;
+   bool is_hugetlb = false;
struct page *page;
struct page *page2;
int s

[PATCH V4 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling

2020-09-29 Thread Anshuman Khandual

This enables MEM_OFFLINE memory event handling. It will help intercept any
possible error condition such as if boot memory some how still got offlined
even after an explicit notifier failure, potentially by a future change in
generic hot plug framework. This would help detect such scenarios and help
debug further. While here, also call out the first section being attempted
for offline or got offlined.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 4e70f4fea06c..90a30f5ebfc0 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1482,13 +1482,38 @@ static int prevent_bootmem_remove_notifier(struct 
notifier_block *nb,
unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
unsigned long pfn = arg->start_pfn;
 
-   if (action != MEM_GOING_OFFLINE)
+   if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
return NOTIFY_OK;
 
for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   unsigned long start = PFN_PHYS(pfn);
+   unsigned long end = start + (1UL << PA_SECTION_SHIFT);
+
ms = __pfn_to_section(pfn);
-   if (early_section(ms))
+   if (!early_section(ms))
+   continue;
+
+   if (action == MEM_GOING_OFFLINE) {
+   pr_warn("Boot memory [%lx %lx] offlining attempted\n", 
start, end);
return NOTIFY_BAD;
+   } else if (action == MEM_OFFLINE) {
+   /*
+* This should have never happened. Boot memory
+* offlining should have been prevented by this
+* very notifier. Probably some memory removal
+* procedure might have changed which would then
+* require further debug.
+*/
+   pr_err("Boot memory [%lx %lx] offlined\n", start, end);
+
+   /*
+* Core memory hotplug does not process a return
+* code from the notifier for MEM_OFFLINE event.
+* Error condition has been reported. Report as
+* ignored.
+*/
+   return NOTIFY_DONE;
+   }
}
return NOTIFY_OK;
 }
-- 
2.20.1

[PATCH V4 1/3] arm64/mm/hotplug: Register boot memory hot remove notifier earlier

2020-09-29 Thread Anshuman Khandual

This moves memory notifier registration earlier in the boot process from
device_initcall() to early_initcall() which will help in guarding against
potential early boot memory offline requests. Even though there should not
be any actual offlinig requests till memory block devices are initialized
with memory_dev_init() but then generic init sequence might just change in
future. Hence an early registration for the memory event notifier would be
helpful. While here, just skip the registration if CONFIG_MEMORY_HOTREMOVE
is not enabled and also call out when memory notifier registration fails.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Gavin Shan 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..4e70f4fea06c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1499,7 +1499,16 @@ static struct notifier_block prevent_bootmem_remove_nb = 
{
 
 static int __init prevent_bootmem_remove_init(void)
 {
-   return register_memory_notifier(_bootmem_remove_nb);
+   int ret = 0;
+
+   if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+   return ret;
+
+   ret = register_memory_notifier(_bootmem_remove_nb);
+   if (ret)
+   pr_err("%s: Notifier registration failed %d\n", __func__, ret);
+
+   return ret;
 }
-device_initcall(prevent_bootmem_remove_init);
+early_initcall(prevent_bootmem_remove_init);
 #endif
-- 
2.20.1

[PATCH V4 3/3] arm64/mm/hotplug: Ensure early memory sections are all online

2020-09-29 Thread Anshuman Khandual

This adds a validation function that scans the entire boot memory and makes
sure that all early memory sections are online. This check is essential for
the memory notifier to work properly, as it cannot prevent any boot memory
from offlining, if all sections are not online to begin with. The notifier
registration is skipped, if this validation does not go through. Although
the boot section scanning is selectively enabled with DEBUG_VM.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 59 +
 1 file changed, 59 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 90a30f5ebfc0..b67a657ea1ad 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1522,6 +1522,62 @@ static struct notifier_block prevent_bootmem_remove_nb = 
{
.notifier_call = prevent_bootmem_remove_notifier,
 };
 
+/*
+ * This ensures that boot memory sections on the plaltform are online
+ * during early boot. They could not be prevented from being offlined
+ * if for some reason they are not brought online to begin with. This
+ * help validate the basic assumption on which the above memory event
+ * notifier works to prevent boot memory offlining and it's possible
+ * removal.
+ */
+static bool validate_bootmem_online(void)
+{
+   struct memblock_region *mblk;
+   struct mem_section *ms;
+   unsigned long pfn, end_pfn, start, end;
+   bool all_online = true;
+
+   /*
+* Scanning across all memblock might be expensive
+* on some big memory systems. Hence enable this
+* validation only with DEBUG_VM.
+*/
+   if (!IS_ENABLED(CONFIG_DEBUG_VM))
+   return all_online;
+
+   for_each_memblock(memory, mblk) {
+   pfn = PHYS_PFN(mblk->base);
+   end_pfn = PHYS_PFN(mblk->base + mblk->size);
+
+   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+
+   /*
+* All memory ranges in the system at this point
+* should have been marked early sections.
+*/
+   WARN_ON(!early_section(ms));
+
+   /*
+* Memory notifier mechanism here to prevent boot
+* memory offlining depends on the fact that each
+* early section memory on the system is intially
+* online. Otherwise a given memory section which
+* is already offline will be overlooked and can
+* be removed completely. Call out such sections.
+*/
+   if (!online_section(ms)) {
+   start = PFN_PHYS(pfn);
+   end = start + (1UL << PA_SECTION_SHIFT);
+   pr_err("Memory range [%lx %lx] is offline\n", 
start, end);
+   pr_err("Memory range [%lx %lx] can be 
removed\n", start, end);
+   all_online = false;
+   }
+   }
+   }
+   return all_online;
+}
+
 static int __init prevent_bootmem_remove_init(void)
 {
int ret = 0;
@@ -1529,6 +1585,9 @@ static int __init prevent_bootmem_remove_init(void)
if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
return ret;
 
+   if (!validate_bootmem_online())
+   return -EINVAL;
+
ret = register_memory_notifier(_bootmem_remove_nb);
if (ret)
pr_err("%s: Notifier registration failed %d\n", __func__, ret);
-- 
2.20.1

[PATCH V4 0/3] arm64/mm/hotplug: Improve memory offline event notifier

2020-09-29 Thread Anshuman Khandual

This series brings three different changes to the only memory event notifier on
arm64 platform. These changes improve it's robustness while also enhancing debug
capabilities during potential memory offlining error conditions.

This applies on 5.9-rc7

Changes in V4:

- Dropped additional return in prevent_bootmem_remove_init() per Gavin
- Rearranged memory section loop in prevent_bootmem_remove_notifier() per Gavin
- Call out boot memory ranges for attempted offline or offline events

Changes in V3: 
(https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=352717)

- Split the single patch into three patch series per Catalin
- Trigger changed from setup_arch() to early_initcall() per Catalin
- Renamed back memory_hotremove_notifier() as prevent_bootmem_remove_init()
- validate_bootmem_online() is now called from prevent_bootmem_remove_init() 
per Catalin
- Skip registering the notifier if validate_bootmem_online() returns negative

Changes in V2: (https://patchwork.kernel.org/patch/11732161/)

- Dropped all generic changes wrt MEM_CANCEL_OFFLINE reasons enumeration
- Dropped all related (processing MEM_CANCEL_OFFLINE reasons) changes on arm64
- Added validate_boot_mem_online_state() that gets called with early_initcall()
- Added CONFIG_MEMORY_HOTREMOVE check before registering memory notifier
- Moved notifier registration i.e memory_hotremove_notifier into setup_arch()

Changes in V1: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=271237)

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: Gavin Shan 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (3):
  arm64/mm/hotplug: Register boot memory hot remove notifier earlier
  arm64/mm/hotplug: Enable MEM_OFFLINE event handling
  arm64/mm/hotplug: Ensure early memory sections are all online

 arch/arm64/mm/mmu.c | 101 ++--
 1 file changed, 97 insertions(+), 4 deletions(-)

-- 
2.20.1

Re: [PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-09-29 Thread Anshuman Khandual




On 09/29/2020 02:05 AM, Will Deacon wrote:
> On Thu, Sep 17, 2020 at 02:16:42PM +0530, Anshuman Khandual wrote:
>> During memory hotplug process, the linear mapping should not be created for
>> a given memory range if that would fall outside the maximum allowed linear
>> range. Else it might cause memory corruption in the kernel virtual space.
>>
>> Maximum linear mapping region is [PAGE_OFFSET..(PAGE_END -1)] accommodating
>> both its ends but excluding PAGE_END. Max physical range that can be mapped
>> inside this linear mapping range, must also be derived from its end points.
>>
>> When CONFIG_ARM64_VA_BITS_52 is enabled, PAGE_OFFSET is computed with the
>> assumption of 52 bits virtual address space. However, if the CPU does not
>> support 52 bits, then it falls back using 48 bits instead and the PAGE_END
>> is updated to reflect this using the vabits_actual. As for PAGE_OFFSET,
>> bits [51..48] are ignored by the MMU and remain unchanged, even though the
>> effective start address of linear map is now slightly different. Hence, to
>> reliably check the physical address range mapped by the linear map, the
>> start address should be calculated using vabits_actual. This ensures that
>> arch_add_memory() validates memory hot add range for its potential linear
>> mapping requirement, before creating it with __create_pgd_mapping().
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Ard Biesheuvel 
>> Cc: Steven Price 
>> Cc: Robin Murphy 
>> Cc: David Hildenbrand 
>> Cc: Andrew Morton 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Fixes: 4ab215061554 ("arm64: Add memory hotplug support")
>> Signed-off-by: Anshuman Khandual 
>> ---
>>  arch/arm64/mm/mmu.c | 27 +++
>>  1 file changed, 27 insertions(+)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 75df62fea1b6..d59ffabb9c84 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1433,11 +1433,38 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
>> unsigned long start, u64 size)
>>  free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
>>  }
>>  
>> +static bool inside_linear_region(u64 start, u64 size)
>> +{
>> +/*
>> + * Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
>> + * accommodating both its ends but excluding PAGE_END. Max physical
>> + * range which can be mapped inside this linear mapping range, must
>> + * also be derived from its end points.
>> + *
>> + * With CONFIG_ARM64_VA_BITS_52 enabled, PAGE_OFFSET is defined with
>> + * the assumption of 52 bits virtual address space. However, if the
>> + * CPU does not support 52 bits, it falls back using 48 bits and the
>> + * PAGE_END is updated to reflect this using the vabits_actual. As
>> + * for PAGE_OFFSET, bits [51..48] are ignored by the MMU and remain
>> + * unchanged, even though the effective start address of linear map
>> + * is now slightly different. Hence, to reliably check the physical
>> + * address range mapped by the linear map, the start address should
>> + * be calculated using vabits_actual.
>> + */
>> +return ((start >= __pa(_PAGE_OFFSET(vabits_actual)))
>> +&& ((start + size) <= __pa(PAGE_END - 1)));
>> +}
> 
> Why isn't this implemented using the existing __is_lm_address()?

Not sure, if I understood your suggestion here. The physical address range
[start..start + size] needs to be checked against maximum physical range
that can be represented inside effective boundaries for the linear mapping
i.e [__pa(_PAGE_OFFSET(vabits_actual)..__pa(PAGE_END - 1)].

Are you suggesting [start..start + size] should be first be converted into
a virtual address range and then checked against __is_lm_addresses() ? But
is not deriving the physical range from from know limits of linear mapping
much cleaner ?

Re: [RFC] mm/vmstat: Add events for HugeTLB migration

2020-09-29 Thread Anshuman Khandual




On 09/29/2020 03:34 AM, Mike Kravetz wrote:
> On 9/25/20 2:12 AM, Anshuman Khandual wrote:
>> Add following new vmstat events which will track HugeTLB page migration.
>>
>> 1. HUGETLB_MIGRATION_SUCCESS
>> 2. HUGETLB_MIGRATION_FAILURE
>>
>> It follows the existing semantics to accommodate HugeTLB subpages in total
>> page migration statistics. While here, this updates current trace event
>> "mm_migrate_pages" to accommodate now available HugeTLB based statistics.
> 
> Thanks.  This makes sense with recent THP changes.
> 
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 3ab965f83029..d53dd101 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1415,13 +1415,17 @@ int migrate_pages(struct list_head *from, new_page_t 
>> get_new_page,
>>  {
>>  int retry = 1;
>>  int thp_retry = 1;
>> +int hugetlb_retry = 1;
>>  int nr_failed = 0;
>>  int nr_succeeded = 0;
>>  int nr_thp_succeeded = 0;
>>  int nr_thp_failed = 0;
>>  int nr_thp_split = 0;
>> +int nr_hugetlb_succeeded = 0;
>> +int nr_hugetlb_failed = 0;
>>  int pass = 0;
>>  bool is_thp = false;
>> +bool is_hugetlb = false;
>>  struct page *page;
>>  struct page *page2;
>>  int swapwrite = current->flags & PF_SWAPWRITE;
>> @@ -1433,6 +1437,7 @@ int migrate_pages(struct list_head *from, new_page_t 
>> get_new_page,
>>  for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
>>  retry = 0;
>>  thp_retry = 0;
>> +hugetlb_retry = 0;
>>  
>>  list_for_each_entry_safe(page, page2, from, lru) {
>>  retry:
>> @@ -1442,7 +1447,12 @@ int migrate_pages(struct list_head *from, new_page_t 
>> get_new_page,
>>   * during migration.
>>   */
>>  is_thp = PageTransHuge(page) && !PageHuge(page);
>> +is_hugetlb = PageTransHuge(page) && PageHuge(page);
> 
> PageHuge does not depend on PageTransHuge.  So, this could just be
>   is_hugetlb = PageHuge(page);

Sure.

> 
> Actually, the current version of PageHuge is more expensive than 
> PageTransHuge.
> So, the most optimal way to set these would be something like.
>   if (PageTransHuge(page))
>   if (PageHuge(page))
>   is_hugetlb = true;
>   else
>   is_thp = true;
> 
> Although, the compiler may be able to optimize.  I did not check.

Both is_hugetlb and is_thp need to have either a true or false value
during each iteration as they are not getting reset otherwise. Hence
basically it should either be

is_thp = PageTransHuge(page) && !PageHuge(page);
is_hugetlb = PageHuge(page);

OR

is_hugetlb = false;
is_thp = false;
if (PageTransHuge(page))
if (PageHuge(page))
is_hugetlb = true;
else
is_thp = true;
}

> 
>> +
>>  nr_subpages = thp_nr_pages(page);
>> +if (is_hugetlb)
>> +nr_subpages = 
>> pages_per_huge_page(page_hstate(page));
> 
> Can we just use compound_order() here for all cases?

Sure but we could also directly use compound_nr().

Re: [RFC] mm/vmstat: Add events for HugeTLB migration

2020-09-27 Thread Anshuman Khandual

On 09/25/2020 03:19 PM, Oscar Salvador wrote:
> On Fri, Sep 25, 2020 at 02:42:29PM +0530, Anshuman Khandual wrote:
>> Add following new vmstat events which will track HugeTLB page migration.
>>
>> 1. HUGETLB_MIGRATION_SUCCESS
>> 2. HUGETLB_MIGRATION_FAILURE
>>
>> It follows the existing semantics to accommodate HugeTLB subpages in total
>> page migration statistics. While here, this updates current trace event
>> "mm_migrate_pages" to accommodate now available HugeTLB based statistics.
>>
>> Cc: Daniel Jordan 
>> Cc: Zi Yan 
>> Cc: John Hubbard 
>> Cc: Mike Kravetz 
>> Cc: Andrew Morton 
>> Cc: linux...@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
> 
> Was this inspired by some usecase/debugging or just to follow THP's example?

Currently HugeTLB migration events get accommodated in PGMIGRATE_SUCCESS and
PGMIGRATE_FAIL event stats as normal single page instances. Previously this
might have just seemed okay as HugeTLB page could be viewed as a single page
entity, even though it was not fully accurate as PGMIGRATE_[SUCCESS|FAILURE]
tracked statistics in terms of normal base pages.

But tracking HugeTLB pages as single pages does not make sense any more, now
that THP pages are accounted for properly. This would complete the revamped
page migration accounting where PGMIGRATE_[SUCCESS|FAILURE] will track entire
page migration events in terms of normal base pages and THP_*/HUGETLB_* will
track specialized events when applicable.

> 
>>  int retry = 1;
>>  int thp_retry = 1;
>> +int hugetlb_retry = 1;
>>  int nr_failed = 0;
>>  int nr_succeeded = 0;
>>  int nr_thp_succeeded = 0;
>>  int nr_thp_failed = 0;
>>  int nr_thp_split = 0;
>> +int nr_hugetlb_succeeded = 0;
>> +int nr_hugetlb_failed = 0;
>>  int pass = 0;
>>  bool is_thp = false;
>> +bool is_hugetlb = false;
>>  struct page *page;
>>  struct page *page2;
>>  int swapwrite = current->flags & PF_SWAPWRITE;
>> @@ -1433,6 +1437,7 @@ int migrate_pages(struct list_head *from, new_page_t 
>> get_new_page,
>>  for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
> 
> Should you not have put hugetlb_retry within the loop as well?
> Otherwise we might not rety for hugetlb pages now?
> 

Right, will fix it.

[RFC] mm/vmstat: Add events for HugeTLB migration

2020-09-25 Thread Anshuman Khandual

Add following new vmstat events which will track HugeTLB page migration.

1. HUGETLB_MIGRATION_SUCCESS
2. HUGETLB_MIGRATION_FAILURE

It follows the existing semantics to accommodate HugeTLB subpages in total
page migration statistics. While here, this updates current trace event
"mm_migrate_pages" to accommodate now available HugeTLB based statistics.

Cc: Daniel Jordan 
Cc: Zi Yan 
Cc: John Hubbard 
Cc: Mike Kravetz 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This is just for getting some early feedbacks. Applies on linux-next and
lightly tested for THP and HugeTLB migrations.

 include/linux/vm_event_item.h  |  2 ++
 include/trace/events/migrate.h | 13 +---
 mm/migrate.c   | 37 --
 mm/vmstat.c|  2 ++
 4 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..d1ddad835c19 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -60,6 +60,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_MIGRATION_SUCCESS,
THP_MIGRATION_FAIL,
THP_MIGRATION_SPLIT,
+   HUGETLB_MIGRATION_SUCCESS,
+   HUGETLB_MIGRATION_FAIL,
 #endif
 #ifdef CONFIG_COMPACTION
COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 4d434398d64d..f8ffb8aece48 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -47,10 +47,11 @@ TRACE_EVENT(mm_migrate_pages,
 
TP_PROTO(unsigned long succeeded, unsigned long failed,
 unsigned long thp_succeeded, unsigned long thp_failed,
-unsigned long thp_split, enum migrate_mode mode, int reason),
+unsigned long thp_split, unsigned long hugetlb_succeeded,
+unsigned long hugetlb_failed, enum migrate_mode mode, int 
reason),
 
TP_ARGS(succeeded, failed, thp_succeeded, thp_failed,
-   thp_split, mode, reason),
+   thp_split, hugetlb_succeeded, hugetlb_failed, mode, reason),
 
TP_STRUCT__entry(
__field(unsigned long,  succeeded)
@@ -58,6 +59,8 @@ TRACE_EVENT(mm_migrate_pages,
__field(unsigned long,  thp_succeeded)
__field(unsigned long,  thp_failed)
__field(unsigned long,  thp_split)
+   __field(unsigned long,  hugetlb_succeeded)
+   __field(unsigned long,  hugetlb_failed)
__field(enum migrate_mode,  mode)
__field(int,reason)
),
@@ -68,16 +71,20 @@ TRACE_EVENT(mm_migrate_pages,
__entry->thp_succeeded  = thp_succeeded;
__entry->thp_failed = thp_failed;
__entry->thp_split  = thp_split;
+   __entry->hugetlb_succeeded  = hugetlb_succeeded;
+   __entry->hugetlb_failed = hugetlb_failed;
__entry->mode   = mode;
__entry->reason = reason;
),
 
-   TP_printk("nr_succeeded=%lu nr_failed=%lu nr_thp_succeeded=%lu 
nr_thp_failed=%lu nr_thp_split=%lu mode=%s reason=%s",
+   TP_printk("nr_succeeded=%lu nr_failed=%lu nr_thp_succeeded=%lu 
nr_thp_failed=%lu nr_thp_split=%lu nr_hugetlb_succeeded=%lu 
nr_hugetlb_failed=%lu mode=%s reason=%s",
__entry->succeeded,
__entry->failed,
__entry->thp_succeeded,
__entry->thp_failed,
__entry->thp_split,
+   __entry->hugetlb_succeeded,
+   __entry->hugetlb_failed,
__print_symbolic(__entry->mode, MIGRATE_MODE),
__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
diff --git a/mm/migrate.c b/mm/migrate.c
index 3ab965f83029..d53dd101 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1415,13 +1415,17 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
 {
int retry = 1;
int thp_retry = 1;
+   int hugetlb_retry = 1;
int nr_failed = 0;
int nr_succeeded = 0;
int nr_thp_succeeded = 0;
int nr_thp_failed = 0;
int nr_thp_split = 0;
+   int nr_hugetlb_succeeded = 0;
+   int nr_hugetlb_failed = 0;
int pass = 0;
bool is_thp = false;
+   bool is_hugetlb = false;
struct page *page;
struct page *page2;
int swapwrite = current->flags & PF_SWAPWRITE;
@@ -1433,6 +1437,7 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
for (pass = 0; pass < 10 &am

Re: [PATCH V3 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling

2020-09-23 Thread Anshuman Khandual




On 09/23/2020 12:01 PM, Gavin Shan wrote:
> Hi Anshuman,
> 
> On 9/21/20 10:05 PM, Anshuman Khandual wrote:
>> This enables MEM_OFFLINE memory event handling. It will help intercept any
>> possible error condition such as if boot memory some how still got offlined
>> even after an explicit notifier failure, potentially by a future change in
>> generic hot plug framework. This would help detect such scenarios and help
>> debug further.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Marc Zyngier 
>> Cc: Steve Capper 
>> Cc: Mark Brown 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
> 
> I'm not sure if it makes sense since MEM_OFFLINE won't be triggered
> after NOTIFY_BAD is returned from MEM_GOING_OFFLINE. NOTIFY_BAD means
> the whole offline process is stopped. It would be guranteed by generic
> framework from syntax standpoint.

Right but the intent here is to catch any deviation in generic hotplug
semantics going forward.
 > 
> However, this looks good if MEM_OFFLINE is triggered without calling
> into MEM_GOING_OFFLINE previously, but it would be a bug from generic
> framework.

Exactly, this will just ensure that we know about any change or a bug
in the generic framework. But if required, this additional check can
be enabled only with DEBUG_VM.

> 
>>   arch/arm64/mm/mmu.c | 37 -
>>   1 file changed, 32 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index df3b7415b128..6b171bd88bcf 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1482,13 +1482,40 @@ static int prevent_bootmem_remove_notifier(struct 
>> notifier_block *nb,
>>   unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
>>   unsigned long pfn = arg->start_pfn;
>>   -    if (action != MEM_GOING_OFFLINE)
>> +    if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
>>   return NOTIFY_OK;
>>   -    for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> -    ms = __pfn_to_section(pfn);
>> -    if (early_section(ms))
>> -    return NOTIFY_BAD;
>> +    if (action == MEM_GOING_OFFLINE) {
>> +    for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> +    ms = __pfn_to_section(pfn);
>> +    if (early_section(ms)) {
>> +    pr_warn("Boot memory offlining attempted\n");
>> +    return NOTIFY_BAD;
>> +    }
>> +    }
>> +    } else if (action == MEM_OFFLINE) {
>> +    for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> +    ms = __pfn_to_section(pfn);
>> +    if (early_section(ms)) {
>> +
>> +    /*
>> + * This should have never happened. Boot memory
>> + * offlining should have been prevented by this
>> + * very notifier. Probably some memory removal
>> + * procedure might have changed which would then
>> + * require further debug.
>> + */
>> +    pr_err("Boot memory offlined\n");
>> +
>> +    /*
>> + * Core memory hotplug does not process a return
>> + * code from the notifier for MEM_OFFLINE event.
>> + * Error condition has been reported. Report as
>> + * ignored.
>> + */
>> +    return NOTIFY_DONE;
>> +    }
>> +    }
>>   }
>>   return NOTIFY_OK;
>>   }
>>
> 
> It's pretty much irrelevant comment if the patch doesn't make sense:
> the logical block for MEM_GOING_OFFLINE would be reused by MEM_OFFLINE
> as they looks similar except the return value and error message :)

This can be reorganized in the above mentioned format as well. Without
much additional code or iteration, it might not need DEBUG_VM as well.

for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
ms = __pfn_to_section(pfn);
if (!early_section(ms))
continue;

if (action == MEM_GOING_OFFLINE) {
pr_warn("Boot memory offlining attempted\n");
return NOTIFY_BAD;
}
else if (action == MEM_OFFLINE) {
pr_err("Boot memory offlined\n");
return NOTIFY_DONE;
}
}
return NOTIFY_OK;

Re: [PATCH V3 1/3] arm64/mm/hotplug: Register boot memory hot remove notifier earlier

2020-09-23 Thread Anshuman Khandual




On 09/23/2020 11:34 AM, Gavin Shan wrote:
> Hi Anshuman,
> 
> On 9/21/20 10:05 PM, Anshuman Khandual wrote:
>> This moves memory notifier registration earlier in the boot process from
>> device_initcall() to early_initcall() which will help in guarding against
>> potential early boot memory offline requests. Even though there should not
>> be any actual offlinig requests till memory block devices are initialized
>> with memory_dev_init() but then generic init sequence might just change in
>> future. Hence an early registration for the memory event notifier would be
>> helpful. While here, just skip the registration if CONFIG_MEMORY_HOTREMOVE
>> is not enabled and also call out when memory notifier registration fails.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: Marc Zyngier 
>> Cc: Steve Capper 
>> Cc: Mark Brown 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   arch/arm64/mm/mmu.c | 14 --
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
> 
> With the following nit-picky comments resolved:
> 
> Reviewed-by: Gavin Shan 
> 
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 75df62fea1b6..df3b7415b128 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1499,7 +1499,17 @@ static struct notifier_block 
>> prevent_bootmem_remove_nb = {
>>     static int __init prevent_bootmem_remove_init(void)
>>   {
>> -    return register_memory_notifier(_bootmem_remove_nb);
>> +    int ret = 0;
>> +
>> +    if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
>> +    return ret;
>> +
>> +    ret = register_memory_notifier(_bootmem_remove_nb);
>> +    if (!ret)
>> +    return ret;
>> +
>> +    pr_err("Notifier registration failed - boot memory can be removed\n");
>> +    return ret;
>>   }
> 
> It might be cleaner if the duplicated return statements can be
> avoided. Besides, it's always nice to print the errno even though

Thought about it, just that the error message was too long.

> zero is always returned from register_memory_notifier(). So I guess
> you probably need something like below:
> 
>     ret = register_memory_notifier(_bootmem_remove_nb);
>     if (ret)
>     pr_err("%s: Error %d registering notifier\n", __func__, ret)
> 
>     return ret;

Sure, will do.

> 
> 
> register_memory_notifier   # 0 is returned on 
> !CONFIG_MEMORY_HOTPLUG_SPARSE
>    blocking_notifier_chain_register
>   notifier_chain_register  # 0 is always returned
>  
>> -device_initcall(prevent_bootmem_remove_init);
>> +early_initcall(prevent_bootmem_remove_init);
>>   #endif
>>
> 
> Cheers,
> Gavin
> 
>

[PATCH] mm/debug_vm_pgtable: Drop hugetlb_advanced_tests()

2020-09-23 Thread Anshuman Khandual

hugetlb_advanced_tests() has now stopped working on i386 platform due to
some recent changes with respect to page table lock. The test never worked
on ppc64 platform, which resulted in disabling it selectively. Let's just
drop hugetlb_advanced_tests() for now and in the process free up the entire
test from the only platform specific test execution path.

https://lore.kernel.org/lkml/289c3fdb-1394-c1af-bdc4-554290708...@linux.ibm.com/#t

CC: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Aneesh Kumar K.V 
Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This applies on current linux-next (20200923).

 mm/debug_vm_pgtable.c | 55 ---
 1 file changed, 55 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index b0edca2e2c73..c5ae822cc6bc 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -811,59 +811,8 @@ static void __init hugetlb_basic_tests(unsigned long pfn, 
pgprot_t prot)
WARN_ON(!pte_huge(pte_mkhuge(pte)));
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 }
-
-#ifndef CONFIG_PPC_BOOK3S_64
-static void __init hugetlb_advanced_tests(struct mm_struct *mm,
- struct vm_area_struct *vma,
- pte_t *ptep, unsigned long pfn,
- unsigned long vaddr, pgprot_t prot)
-{
-   struct page *page = pfn_to_page(pfn);
-   pte_t pte = ptep_get(ptep);
-   unsigned long paddr = __pfn_to_phys(pfn) & PMD_MASK;
-
-   pr_debug("Validating HugeTLB advanced\n");
-   pte = pte_mkhuge(mk_pte(pfn_to_page(PHYS_PFN(paddr)), prot));
-   set_huge_pte_at(mm, vaddr, ptep, pte);
-   barrier();
-   WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
-   huge_pte_clear(mm, vaddr, ptep, PMD_SIZE);
-   pte = huge_ptep_get(ptep);
-   WARN_ON(!huge_pte_none(pte));
-
-   pte = mk_huge_pte(page, prot);
-   set_huge_pte_at(mm, vaddr, ptep, pte);
-   barrier();
-   huge_ptep_set_wrprotect(mm, vaddr, ptep);
-   pte = huge_ptep_get(ptep);
-   WARN_ON(huge_pte_write(pte));
-
-   pte = mk_huge_pte(page, prot);
-   set_huge_pte_at(mm, vaddr, ptep, pte);
-   barrier();
-   huge_ptep_get_and_clear(mm, vaddr, ptep);
-   pte = huge_ptep_get(ptep);
-   WARN_ON(!huge_pte_none(pte));
-
-   pte = mk_huge_pte(page, prot);
-   pte = huge_pte_wrprotect(pte);
-   set_huge_pte_at(mm, vaddr, ptep, pte);
-   barrier();
-   pte = huge_pte_mkwrite(pte);
-   pte = huge_pte_mkdirty(pte);
-   huge_ptep_set_access_flags(vma, vaddr, ptep, pte, 1);
-   pte = huge_ptep_get(ptep);
-   WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
-}
-#endif
 #else  /* !CONFIG_HUGETLB_PAGE */
 static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { }
-static void __init hugetlb_advanced_tests(struct mm_struct *mm,
- struct vm_area_struct *vma,
- pte_t *ptep, unsigned long pfn,
- unsigned long vaddr, pgprot_t prot)
-{
-}
 #endif /* CONFIG_HUGETLB_PAGE */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -1073,10 +1022,6 @@ static int __init debug_vm_pgtable(void)
pud_populate_tests(mm, pudp, saved_pmdp);
spin_unlock(ptl);
 
-#ifndef CONFIG_PPC_BOOK3S_64
-   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
-#endif
-
spin_lock(>page_table_lock);
p4d_clear_tests(mm, p4dp);
pgd_clear_tests(mm, pgdp);
-- 
2.20.1

Re: [PATCH V3 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling

2020-09-22 Thread Anshuman Khandual




On 09/21/2020 05:35 PM, Anshuman Khandual wrote:
> This enables MEM_OFFLINE memory event handling. It will help intercept any
> possible error condition such as if boot memory some how still got offlined
> even after an explicit notifier failure, potentially by a future change in
> generic hot plug framework. This would help detect such scenarios and help
> debug further.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Rutland 
> Cc: Marc Zyngier 
> Cc: Steve Capper 
> Cc: Mark Brown 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/arm64/mm/mmu.c | 37 -
>  1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index df3b7415b128..6b171bd88bcf 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1482,13 +1482,40 @@ static int prevent_bootmem_remove_notifier(struct 
> notifier_block *nb,
>   unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
>   unsigned long pfn = arg->start_pfn;
>  
> - if (action != MEM_GOING_OFFLINE)
> + if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
>   return NOTIFY_OK;
>  
> - for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> - ms = __pfn_to_section(pfn);
> - if (early_section(ms))
> - return NOTIFY_BAD;
> + if (action == MEM_GOING_OFFLINE) {
> + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> + ms = __pfn_to_section(pfn);
> + if (early_section(ms)) {
> + pr_warn("Boot memory offlining attempted\n");
> + return NOTIFY_BAD;
> + }
> + }
> + } else if (action == MEM_OFFLINE) {
> + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> + ms = __pfn_to_section(pfn);
> + if (early_section(ms)) {
> +
> + /*
> +  * This should have never happened. Boot memory
> +  * offlining should have been prevented by this
> +  * very notifier. Probably some memory removal
> +  * procedure might have changed which would then
> +  * require further debug.
> +  */
> + pr_err("Boot memory offlined\n");

It is returning in the first instance, when a section inside the
offline range happen to be part of the boot memory. So wondering
if it would be better to call out here, entire attempted offline
range or just the first section inside that which overlaps with
boot memory ? But some range information here will be helpful.

Re: [mm/debug_vm_pgtable/locks] e2aad6f1d2: BUG:unable_to_handle_page_fault_for_address

2020-09-22 Thread Anshuman Khandual




On 09/22/2020 02:50 PM, Aneesh Kumar K.V wrote:
> On 9/22/20 2:22 PM, Anshuman Khandual wrote:
>>
>>
>> On 09/22/2020 09:33 AM, Aneesh Kumar K.V wrote:
>>> On 9/21/20 2:51 PM, kernel test robot wrote:
>>>> Greeting,
>>>>
>>>> FYI, we noticed the following commit (built with gcc-9):
>>>>
>>>> commit: e2aad6f1d232b457ea6a3194992dd4c0a83534a5 
>>>> ("mm/debug_vm_pgtable/locks: take correct page table lock")
>>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>>>
>>>>
>>>> in testcase: trinity
>>>> version: trinity-i386
>>>> with following parameters:
>>>>
>>>>  ï¿½ï¿½ï¿½ï¿½runtime: 300s
>>>>
>>>> test-description: Trinity is a linux system call fuzz tester.
>>>> test-url: http://codemonkey.org.uk/projects/trinity/
>>>>
>>>>
>>>> on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
>>>>
>>>> caused below changes (please refer to attached dmesg/kmsg for entire 
>>>> log/backtrace):
>>>>
>>>>
>>>> +--+++
>>>> |ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | c50eb1ed65 | e2aad6f1d2 |
>>>> +--+++
>>>> | 
>>>> boot_successesï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | 0ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | 0ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> | 
>>>> boot_failuresï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | 61ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | 17ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> | 
>>>> BUG:workqueue_lockup-poolï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | 1ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> | BUG:sleeping_function_called_from_invalid_context_at_mm/page_alloc.c | 
>>>> 60ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | 17ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> | 
>>>> BUG:unable_to_handle_page_fault_for_addressï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | 0ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | 17ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> | 
>>>> Oops:#[##]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | 0ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | 17ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> | 
>>>> EIP:ptep_getï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | 0ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | 17ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> | 
>>>> Kernel_panic-not_syncing:Fatal_exceptionï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½
>>>>  | 0ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ | 17ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |
>>>> +--+++
>>>>
>>>>
>>>> If you fix the issue, kindly add following tag
>>>> Reported-by: kernel test robot 
>>>>
>>>>
>>>> [ï¿½ï¿½ 28.726464] BUG: sleeping function called from invalid context at 
>>>> mm/page_alloc.c:4822
>>>> [ï¿½ï¿½ 28.727835] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 
>>>> 1, name: swapper
>>>> [ï¿½ï¿½ 28.729221] no locks held by swapper/1.
>>>> [ï¿½ï¿½ 28.729954] CPU: 0 PID: 1 Comm: swapper Not tainted 
>>>> 5.9.0-rc3-00324-ge2aad6f1d232b4 #1
>>>> [ï¿½ï¿½ 28.731484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
>>>> BIOS 1.12.0-1 04/01/2014
>>>> [ï¿½ï¿½ 28.732891] Call Trace:
>>>> [ï¿½ï¿½ 28.733295]ï¿½ ? show_stack+0x48/0x50
>>>> [ï¿½ï¿½ 28.733943]ï¿½ dump_stack+0x1b/0x1d
>>>> [ï¿½ï¿½ 28.734569]ï¿½ _

Re: [mm/debug_vm_pgtable/locks] e2aad6f1d2: BUG:unable_to_handle_page_fault_for_address

2020-09-22 Thread Anshuman Khandual




On 09/22/2020 09:33 AM, Aneesh Kumar K.V wrote:
> On 9/21/20 2:51 PM, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed the following commit (built with gcc-9):
>>
>> commit: e2aad6f1d232b457ea6a3194992dd4c0a83534a5 
>> ("mm/debug_vm_pgtable/locks: take correct page table lock")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>>
>> in testcase: trinity
>> version: trinity-i386
>> with following parameters:
>>
>> runtime: 300s
>>
>> test-description: Trinity is a linux system call fuzz tester.
>> test-url: http://codemonkey.org.uk/projects/trinity/
>>
>>
>> on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
>>
>> caused below changes (please refer to attached dmesg/kmsg for entire 
>> log/backtrace):
>>
>>
>> +--+++
>> |  | 
>> c50eb1ed65 | e2aad6f1d2 |
>> +--+++
>> | boot_successes   | 0   
>>    | 0  |
>> | boot_failures    | 61  
>>    | 17 |
>> | BUG:workqueue_lockup-pool    | 1   
>>    |    |
>> | BUG:sleeping_function_called_from_invalid_context_at_mm/page_alloc.c | 60  
>>    | 17 |
>> | BUG:unable_to_handle_page_fault_for_address  | 0   
>>    | 17 |
>> | Oops:#[##]   | 0   
>>    | 17 |
>> | EIP:ptep_get | 0   
>>    | 17 |
>> | Kernel_panic-not_syncing:Fatal_exception | 0   
>>    | 17 |
>> +--+++
>>
>>
>> If you fix the issue, kindly add following tag
>> Reported-by: kernel test robot 
>>
>>
>> [   28.726464] BUG: sleeping function called from invalid context at 
>> mm/page_alloc.c:4822
>> [   28.727835] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, 
>> name: swapper
>> [   28.729221] no locks held by swapper/1.
>> [   28.729954] CPU: 0 PID: 1 Comm: swapper Not tainted 
>> 5.9.0-rc3-00324-ge2aad6f1d232b4 #1
>> [   28.731484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.12.0-1 04/01/2014
>> [   28.732891] Call Trace:
>> [   28.733295]  ? show_stack+0x48/0x50
>> [   28.733943]  dump_stack+0x1b/0x1d
>> [   28.734569]  ___might_sleep+0x205/0x219
>> [   28.735292]  __might_sleep+0x106/0x10f
>> [   28.736022]  __alloc_pages_nodemask+0xe0/0x2c8
>> [   28.736845]  swap_migration_tests+0x62/0x295
>> [   28.737639]  debug_vm_pgtable+0x587/0x9b5
>> [   28.738374]  ? pte_advanced_tests+0x267/0x267
>> [   28.739318]  do_one_initcall+0x129/0x31c
>> [   28.740023]  ? rcu_read_lock_sched_held+0x46/0x74
>> [   28.740944]  kernel_init_freeable+0x201/0x250
>> [   28.741763]  ? rest_init+0xf8/0xf8
>> [   28.742401]  kernel_init+0xe/0x15d
>> [   28.743040]  ? rest_init+0xf8/0xf8
>> [   28.743694]  ret_from_fork+0x1c/0x30
> 
> 
> This should be fixed by
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/mm/debug_vm_pgtable.c?id=3a4f9a45eadb6ed5fc04686e8db4dc7bb1caec44
> 
>> [   28.744364] BUG: unable to handle page fault for address: fffbbea4
>> [   28.745465] #PF: supervisor read access in kernel mode
>> [   28.746373] #PF: error_code(0x) - not-present page
>> [   28.747275] *pde = 0492b067 *pte = 
>> [   28.748054] Oops:  [#1]
>> [   28.748548] CPU: 0 PID: 1 Comm: swapper Tainted: G    W 
>> 5.9.0-rc3-00324-ge2aad6f1d232b4 #1
>> [   28.750188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.12.0-1 04/01/2014
>> [   28.751641] EIP: ptep_get+0x0/0x3
>> [   28.752226] Code: 5d fc c9 c3 55 c1 e8 1a 89 e5 53 31 db 83 f8 1f 6a 00 
>> 0f 94 c3 b8 80 67 02 c4 31 c9 89 da e8 16 5c f1 ff 89 d8 8b 5d fc c9 c3 <8b> 
>> 00 c3 55 31 c9 89 e5 57 56 53 8b 70 04 89 c3 b8 10 68 02 c4 6a
>> [   28.755465] EAX: fffbbea4 EBX: fffbbea4 ECX: 47bd EDX: fffbbea4
>> [   28.756418] ESI: 47bd EDI: 0025 EBP: f406bed8 ESP: f406bebc
>> [   28.757522] DS: 007b ES: 007b FS:  GS:  SS: 0068 EFLAGS: 00010286
>> [   28.758739] CR0: 80050033 CR2: fffbbea4 CR3: 04928000 CR4: 000406d0
>> [   28.759828] Call Trace:
>> [   28.760235]  ? hugetlb_advanced_tests+0x2a/0x27f
>> [   28.761099]  ? do_raw_spin_unlock+0xd7/0x112
>> [   28.761872]  debug_vm_pgtable+0x927/0x9b5
>> [   28.762578]  ? pte_advanced_tests+0x267/0x267
>> [   28.763462]  do_one_initcall+0x129/0x31c
>> [   28.764134]  ? rcu_read_lock_sched_held+0x46/0x74
>> [   28.764948]  kernel_init_freeable+0x201/0x250
>> [   28.765654]  ? rest_init+0xf8/0xf8
>> [   28.766277]

Re: [PATCH 2/2] arm64/mm: Enable color zero pages

2020-09-21 Thread Anshuman Khandual




On 09/21/2020 08:26 AM, Gavin Shan wrote:
> Hi Robin,
> 
> On 9/17/20 8:22 PM, Robin Murphy wrote:
>> On 2020-09-17 04:35, Gavin Shan wrote:
>>> On 9/16/20 6:28 PM, Will Deacon wrote:
 On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
> This enables color zero pages by allocating contigous page frames
> for it. The number of pages for this is determined by L1 dCache
> (or iCache) size, which is probbed from the hardware.
>
>     * Add cache_total_size() to return L1 dCache (or iCache) size
>
>     * Implement setup_zero_pages(), which is called after the page
>   allocator begins to work, to allocate the contigous pages
>   needed by color zero page.
>
>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>
> Signed-off-by: Gavin Shan 
> ---
>   arch/arm64/include/asm/cache.h   | 22 
>   arch/arm64/include/asm/pgtable.h |  9 ++--
>   arch/arm64/kernel/cacheinfo.c    | 34 +++
>   arch/arm64/mm/init.c | 35 
>   arch/arm64/mm/mmu.c  |  7 ---
>   5 files changed, 98 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cache.h 
> b/arch/arm64/include/asm/cache.h
> index a4d1b5f771f6..420e9dde2c51 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -39,6 +39,27 @@
>   #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>   #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
> +#define CSSELR_TND_SHIFT    4
> +#define CSSELR_TND_MASK    (UL(1) << CSSELR_TND_SHIFT)
> +#define CSSELR_LEVEL_SHIFT    1
> +#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
> +#define CSSELR_IND_SHIFT    0
> +#define CSSERL_IND_MASK    (UL(1) << CSSELR_IND_SHIFT)
> +
> +#define CCSIDR_64_LS_SHIFT    0
> +#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
> +#define CCSIDR_64_ASSOC_SHIFT    3
> +#define CCSIDR_64_ASSOC_MASK    (UL(0x1F) << CCSIDR_64_ASSOC_SHIFT)
> +#define CCSIDR_64_SET_SHIFT    32
> +#define CCSIDR_64_SET_MASK    (UL(0xFF) << CCSIDR_64_SET_SHIFT)
> +
> +#define CCSIDR_32_LS_SHIFT    0
> +#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
> +#define CCSIDR_32_ASSOC_SHIFT    3
> +#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
> +#define CCSIDR_32_SET_SHIFT    13
> +#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)

 I don't think we should be inferring cache structure from these register
 values. The Arm ARM helpfully says:

    | You cannot make any inference about the actual sizes of caches based
    | on these parameters.

 so we need to take the topology information from elsewhere.

>>>
>>> Yeah, I also noticed the statement in the spec. However, the L1 cache size
>>> figured out from above registers are matching with "lscpu" on the machine
>>> where I did my tests. Note "lscpu" depends on sysfs entries whose 
>>> information
>>> is retrieved from ACPI (PPTT) table. The number of cache levels are 
>>> partially
>>> retrieved from system register (clidr_el1).
>>>
>>> It's doable to retrieve the L1 cache size from ACPI (PPTT) table. I'll
>>> change accordingly in v2 if this enablement is really needed. More clarify
>>> is provided below.
>>>
 But before we get into that, can you justify why we need to do this at all,
 please? Do you have data to show the benefit of adding this complexity?

>>>
>>> Initially, I found it's the missed feature which has been enabled on
>>> mips/s390. Currently, all read-only anonymous VMAs are backed up by
>>> same zero page. It means all reads to these VMAs are cached by same
>>> set of cache, but still multiple ways if supported. So it would be
>>> nice to have multiple zero pages to back up these read-only anonymous
>>> VMAs, so that the reads on them can be cached by multiple sets (multiple
>>> ways still if supported). It's overall beneficial to the performance.
>>
>> Is this a concern for true PIPT caches, or is it really just working around 
>> a pathological case for alias-detecting VIPT caches?
>>
> 
> I think it's definitely a concern for PIPT caches. However, I'm not
> sure about VIPT caches because I failed to understand how it works
> from ARM8-A spec. If I'm correct, the index of VIPT cache line is
> still determined by the physical address and the number of sets is
> another limitation? For example, two virtual addresses (v1) and (v2)
> are translated to same physical address (p1), there is still one
> cache line (from particular set) for them. If so, this should helps
> in terms of performance.
> 
> However, I'm not sure I understood VIPT caches correctly because there
> is one statement in the ARM8-A spec as below. It seems (v1)

[PATCH V3 0/3] arm64/mm/hotplug: Improve memory offline event notifier

2020-09-21 Thread Anshuman Khandual

This series brings three different changes to the only memory event notifier on
arm64 platform. These changes improve it's robustness while also enhancing debug
capabilities during potential memory offlining error conditions.

This applies on 5.9-rc6

Changes in V3:

- Split the single patch into three patch series per Catalin
- Trigger changed from setup_arch() to early_initcall() per Catalin
- Renamed back memory_hotremove_notifier() as prevent_bootmem_remove_init()
- validate_bootmem_online() is now called from prevent_bootmem_remove_init() 
per Catalin
- Skip registering the notifier if validate_bootmem_online() returns negative

Changes in V2: (https://patchwork.kernel.org/patch/11732161/)

- Dropped all generic changes wrt MEM_CANCEL_OFFLINE reasons enumeration
- Dropped all related (processing MEM_CANCEL_OFFLINE reasons) changes on arm64
- Added validate_boot_mem_online_state() that gets called with early_initcall()
- Added CONFIG_MEMORY_HOTREMOVE check before registering memory notifier
- Moved notifier registration i.e memory_hotremove_notifier into setup_arch()

Changes in V1: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=271237)

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (3):
  arm64/mm/hotplug: Register boot memory hot remove notifier earlier
  arm64/mm/hotplug: Enable MEM_OFFLINE event handling
  arm64/mm/hotplug: Ensure early memory sections are all online

 arch/arm64/mm/mmu.c | 110 +---
 1 file changed, 103 insertions(+), 7 deletions(-)

-- 
2.20.1

[PATCH V3 2/3] arm64/mm/hotplug: Enable MEM_OFFLINE event handling

2020-09-21 Thread Anshuman Khandual

This enables MEM_OFFLINE memory event handling. It will help intercept any
possible error condition such as if boot memory some how still got offlined
even after an explicit notifier failure, potentially by a future change in
generic hot plug framework. This would help detect such scenarios and help
debug further.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index df3b7415b128..6b171bd88bcf 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1482,13 +1482,40 @@ static int prevent_bootmem_remove_notifier(struct 
notifier_block *nb,
unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
unsigned long pfn = arg->start_pfn;
 
-   if (action != MEM_GOING_OFFLINE)
+   if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
return NOTIFY_OK;
 
-   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-   ms = __pfn_to_section(pfn);
-   if (early_section(ms))
-   return NOTIFY_BAD;
+   if (action == MEM_GOING_OFFLINE) {
+   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+   if (early_section(ms)) {
+   pr_warn("Boot memory offlining attempted\n");
+   return NOTIFY_BAD;
+   }
+   }
+   } else if (action == MEM_OFFLINE) {
+   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+   if (early_section(ms)) {
+
+   /*
+* This should have never happened. Boot memory
+* offlining should have been prevented by this
+* very notifier. Probably some memory removal
+* procedure might have changed which would then
+* require further debug.
+*/
+   pr_err("Boot memory offlined\n");
+
+   /*
+* Core memory hotplug does not process a return
+* code from the notifier for MEM_OFFLINE event.
+* Error condition has been reported. Report as
+* ignored.
+*/
+   return NOTIFY_DONE;
+   }
+   }
}
return NOTIFY_OK;
 }
-- 
2.20.1

[PATCH V3 1/3] arm64/mm/hotplug: Register boot memory hot remove notifier earlier

2020-09-21 Thread Anshuman Khandual

This moves memory notifier registration earlier in the boot process from
device_initcall() to early_initcall() which will help in guarding against
potential early boot memory offline requests. Even though there should not
be any actual offlinig requests till memory block devices are initialized
with memory_dev_init() but then generic init sequence might just change in
future. Hence an early registration for the memory event notifier would be
helpful. While here, just skip the registration if CONFIG_MEMORY_HOTREMOVE
is not enabled and also call out when memory notifier registration fails.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..df3b7415b128 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1499,7 +1499,17 @@ static struct notifier_block prevent_bootmem_remove_nb = 
{
 
 static int __init prevent_bootmem_remove_init(void)
 {
-   return register_memory_notifier(_bootmem_remove_nb);
+   int ret = 0;
+
+   if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+   return ret;
+
+   ret = register_memory_notifier(_bootmem_remove_nb);
+   if (!ret)
+   return ret;
+
+   pr_err("Notifier registration failed - boot memory can be removed\n");
+   return ret;
 }
-device_initcall(prevent_bootmem_remove_init);
+early_initcall(prevent_bootmem_remove_init);
 #endif
-- 
2.20.1

[PATCH V3 3/3] arm64/mm/hotplug: Ensure early memory sections are all online

2020-09-21 Thread Anshuman Khandual

This adds a validation function that scans the entire boot memory and makes
sure that all early memory sections are online. This check is essential for
the memory notifier to work properly, as it cannot prevent any boot memory
from offlining, if all sections are not online to begin with. The notifier
registration is skipped, if this validation does not go through. Although
the boot section scanning is selectively enabled with DEBUG_VM.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 59 +
 1 file changed, 59 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6b171bd88bcf..124eeb84ec43 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1524,6 +1524,62 @@ static struct notifier_block prevent_bootmem_remove_nb = 
{
.notifier_call = prevent_bootmem_remove_notifier,
 };
 
+/*
+ * This ensures that boot memory sections on the plaltform are online
+ * during early boot. They could not be prevented from being offlined
+ * if for some reason they are not brought online to begin with. This
+ * help validate the basic assumption on which the above memory event
+ * notifier works to prevent boot memory offlining and it's possible
+ * removal.
+ */
+static bool validate_bootmem_online(void)
+{
+   struct memblock_region *mblk;
+   struct mem_section *ms;
+   unsigned long pfn, end_pfn, start, end;
+   bool all_online = true;
+
+   /*
+* Scanning across all memblock might be expensive
+* on some big memory systems. Hence enable this
+* validation only with DEBUG_VM.
+*/
+   if (!IS_ENABLED(CONFIG_DEBUG_VM))
+   return all_online;
+
+   for_each_memblock(memory, mblk) {
+   pfn = PHYS_PFN(mblk->base);
+   end_pfn = PHYS_PFN(mblk->base + mblk->size);
+
+   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+
+   /*
+* All memory ranges in the system at this point
+* should have been marked early sections.
+*/
+   WARN_ON(!early_section(ms));
+
+   /*
+* Memory notifier mechanism here to prevent boot
+* memory offlining depends on the fact that each
+* early section memory on the system is intially
+* online. Otherwise a given memory section which
+* is already offline will be overlooked and can
+* be removed completely. Call out such sections.
+*/
+   if (!online_section(ms)) {
+   start = PFN_PHYS(pfn);
+   end = start + (1UL << PA_SECTION_SHIFT);
+   pr_err("Memory range [%lx %lx] is offline\n", 
start, end);
+   pr_err("Memory range [%lx %lx] can be 
removed\n", start, end);
+   all_online = false;
+   }
+   }
+   }
+   return all_online;
+}
+
 static int __init prevent_bootmem_remove_init(void)
 {
int ret = 0;
@@ -1531,6 +1587,9 @@ static int __init prevent_bootmem_remove_init(void)
if (!IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
return ret;
 
+   if (!validate_bootmem_online())
+   return -EINVAL;
+
ret = register_memory_notifier(_bootmem_remove_nb);
if (!ret)
return ret;
-- 
2.20.1

Re: [PATCH v2] mm/migrate: correct thp migration stats.

2020-09-17 Thread Anshuman Khandual

Hi Zi,

On 09/18/2020 02:34 AM, Zi Yan wrote:
> From: Zi Yan 
> 
> PageTransHuge returns true for both thp and hugetlb, so thp stats was
> counting both thp and hugetlb migrations. Exclude hugetlb migration by
> setting is_thp variable right.

Coincidentally, I had just detected this problem last evening and was
in the process of sending a patch this morning :) Nonetheless, thanks
for the patch.

Earlier there was a similar THP-HugeTLB ambiguity down the error path
as well. In hindsight, I should have noticed or remembered about this
earlier fix during the THP stats patch.

e6112fc30070 (mm/migrate.c: split only transparent huge pages when allocation 
fails)

> 
> Clean up thp handling code too when we are there.
> 
> Fixes: 1a5bae25e3cf ("mm/vmstat: add events for THP migration without split")
> Signed-off-by: Zi Yan 
> Reviewed-by: Daniel Jordan 
> Cc: Daniel Jordan 
> Cc: Anshuman Khandual 
> ---
>  mm/migrate.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 941b89383cf3..6bc9559afc70 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1445,7 +1445,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>* Capture required information that might get lost
>* during migration.
>*/
> - is_thp = PageTransHuge(page);
> + is_thp = PageTransHuge(page) && !PageHuge(page);
>   nr_subpages = thp_nr_pages(page);
>   cond_resched();
>  
> @@ -1471,7 +1471,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>* we encounter them after the rest of the list
>* is processed.
>*/
> - if (PageTransHuge(page) && !PageHuge(page)) {
> + if (is_thp) {
>   lock_page(page);
>   rc = split_huge_page_to_list(page, 
> from);
>   unlock_page(page);
> @@ -1480,8 +1480,7 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>   nr_thp_split++;
>   goto retry;
>   }
> - }
> - if (is_thp) {
> +
>   nr_thp_failed++;
>   nr_failed += nr_subpages;
>       goto out;
> 

Moving the failure path inside the split path makes sense, now
that it is already established that the page is indeed a THP.

Reviewed-by: Anshuman Khandual

[PATCH] arm64/mm: Validate hotplug range before creating linear mapping

2020-09-17 Thread Anshuman Khandual

During memory hotplug process, the linear mapping should not be created for
a given memory range if that would fall outside the maximum allowed linear
range. Else it might cause memory corruption in the kernel virtual space.

Maximum linear mapping region is [PAGE_OFFSET..(PAGE_END -1)] accommodating
both its ends but excluding PAGE_END. Max physical range that can be mapped
inside this linear mapping range, must also be derived from its end points.

When CONFIG_ARM64_VA_BITS_52 is enabled, PAGE_OFFSET is computed with the
assumption of 52 bits virtual address space. However, if the CPU does not
support 52 bits, then it falls back using 48 bits instead and the PAGE_END
is updated to reflect this using the vabits_actual. As for PAGE_OFFSET,
bits [51..48] are ignored by the MMU and remain unchanged, even though the
effective start address of linear map is now slightly different. Hence, to
reliably check the physical address range mapped by the linear map, the
start address should be calculated using vabits_actual. This ensures that
arch_add_memory() validates memory hot add range for its potential linear
mapping requirement, before creating it with __create_pgd_mapping().

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Ard Biesheuvel 
Cc: Steven Price 
Cc: Robin Murphy 
Cc: David Hildenbrand 
Cc: Andrew Morton 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Fixes: 4ab215061554 ("arm64: Add memory hotplug support")
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/mmu.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..d59ffabb9c84 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1433,11 +1433,38 @@ static void __remove_pgd_mapping(pgd_t *pgdir, unsigned 
long start, u64 size)
free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
 }
 
+static bool inside_linear_region(u64 start, u64 size)
+{
+   /*
+* Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
+* accommodating both its ends but excluding PAGE_END. Max physical
+* range which can be mapped inside this linear mapping range, must
+* also be derived from its end points.
+*
+* With CONFIG_ARM64_VA_BITS_52 enabled, PAGE_OFFSET is defined with
+* the assumption of 52 bits virtual address space. However, if the
+* CPU does not support 52 bits, it falls back using 48 bits and the
+* PAGE_END is updated to reflect this using the vabits_actual. As
+* for PAGE_OFFSET, bits [51..48] are ignored by the MMU and remain
+* unchanged, even though the effective start address of linear map
+* is now slightly different. Hence, to reliably check the physical
+* address range mapped by the linear map, the start address should
+* be calculated using vabits_actual.
+*/
+   return ((start >= __pa(_PAGE_OFFSET(vabits_actual)))
+   && ((start + size) <= __pa(PAGE_END - 1)));
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
struct mhp_params *params)
 {
int ret, flags = 0;
 
+   if (!inside_linear_region(start, size)) {
+   pr_err("[%llx %llx] is outside linear mapping region\n", start, 
start + size);
+   return -EINVAL;
+   }
+
if (rodata_full || debug_pagealloc_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
-- 
2.20.1

Re: [PATCH v2] arm64/mm: Refactor {pgd, pud, pmd, pte}_ERROR()

2020-09-14 Thread Anshuman Khandual




On 09/14/2020 05:17 AM, Gavin Shan wrote:
> The function __{pgd, pud, pmd, pte}_error() are introduced so that
> they can be called by {pgd, pud, pmd, pte}_ERROR(). However, some
> of the functions could never be called when the corresponding page
> table level isn't enabled. For example, __{pud, pmd}_error() are
> unused when PUD and PMD are folded to PGD.
> 
> This removes __{pgd, pud, pmd, pte}_error() and call pr_err() from
> {pgd, pud, pmd, pte}_ERROR() directly, similar to what x86/powerpc
> are doing. With this, the code looks a bit simplified either.
> 
> Signed-off-by: Gavin Shan 
> ---
> v2: Fix build warning caused by wrong printk format
> ---
>  arch/arm64/include/asm/pgtable.h | 17 -
>  arch/arm64/kernel/traps.c| 20 
>  2 files changed, 8 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index d5d3fbe73953..e0ab81923c30 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -35,11 +35,6 @@
>  
>  extern struct page *vmemmap;
>  
> -extern void __pte_error(const char *file, int line, unsigned long val);
> -extern void __pmd_error(const char *file, int line, unsigned long val);
> -extern void __pud_error(const char *file, int line, unsigned long val);
> -extern void __pgd_error(const char *file, int line, unsigned long val);
> -
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>  
> @@ -57,7 +52,8 @@ extern void __pgd_error(const char *file, int line, 
> unsigned long val);
>  extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
>  #define ZERO_PAGE(vaddr) phys_to_page(__pa_symbol(empty_zero_page))
>  
> -#define pte_ERROR(pte)   __pte_error(__FILE__, __LINE__, 
> pte_val(pte))
> +#define pte_ERROR(e) \
> + pr_err("%s:%d: bad pte %016llx.\n", __FILE__, __LINE__, pte_val(e))
>  
>  /*
>   * Macros to convert between a physical address and its placement in a
> @@ -541,7 +537,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>  
>  #if CONFIG_PGTABLE_LEVELS > 2
>  
> -#define pmd_ERROR(pmd)   __pmd_error(__FILE__, __LINE__, 
> pmd_val(pmd))
> +#define pmd_ERROR(e) \
> + pr_err("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
>  
>  #define pud_none(pud)(!pud_val(pud))
>  #define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
> @@ -608,7 +605,8 @@ static inline unsigned long pud_page_vaddr(pud_t pud)
>  
>  #if CONFIG_PGTABLE_LEVELS > 3
>  
> -#define pud_ERROR(pud)   __pud_error(__FILE__, __LINE__, 
> pud_val(pud))
> +#define pud_ERROR(e) \
> + pr_err("%s:%d: bad pud %016llx.\n", __FILE__, __LINE__, pud_val(e))
>  
>  #define p4d_none(p4d)(!p4d_val(p4d))
>  #define p4d_bad(p4d) (!(p4d_val(p4d) & 2))
> @@ -667,7 +665,8 @@ static inline unsigned long p4d_page_vaddr(p4d_t p4d)
>  
>  #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
>  
> -#define pgd_ERROR(pgd)   __pgd_error(__FILE__, __LINE__, 
> pgd_val(pgd))
> +#define pgd_ERROR(e) \
> + pr_err("%s:%d: bad pgd %016llx.\n", __FILE__, __LINE__, pgd_val(e))
>  
>  #define pgd_set_fixmap(addr) ((pgd_t *)set_fixmap_offset(FIX_PGD, addr))
>  #define pgd_clear_fixmap()   clear_fixmap(FIX_PGD)
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 13ebd5ca2070..12fba7136dbd 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -935,26 +935,6 @@ asmlinkage void enter_from_user_mode(void)
>  }
>  NOKPROBE_SYMBOL(enter_from_user_mode);
>  
> -void __pte_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pte %016lx.\n", file, line, val);
> -}
> -
> -void __pmd_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pmd %016lx.\n", file, line, val);
> -}
> -
> -void __pud_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pud %016lx.\n", file, line, val);
> -}
> -
> -void __pgd_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pgd %016lx.\n", file, line, val);
> -}
> -
>  /* GENERIC_BUG traps */
>  
>  int is_valid_bugaddr(unsigned long addr)
> 

Looks good to me. Seems like a sensible clean up which reduces code.
Tried booting on multiple page size configs and see no regression.

Reviewed-by: Anshuman Khandual

Re: [PATCH v2] arm64/mm: Refactor {pgd, pud, pmd, pte}_ERROR()

2020-09-13 Thread Anshuman Khandual




On 09/14/2020 05:17 AM, Gavin Shan wrote:
> The function __{pgd, pud, pmd, pte}_error() are introduced so that
> they can be called by {pgd, pud, pmd, pte}_ERROR(). However, some
> of the functions could never be called when the corresponding page
> table level isn't enabled. For example, __{pud, pmd}_error() are
> unused when PUD and PMD are folded to PGD.

Right, it makes sense not to have these helpers generally available.
Given pxx_ERROR() is enabled only when required page table level is
available, with a CONFIG_PGTABLE_LEVEL check.

> 
> This removes __{pgd, pud, pmd, pte}_error() and call pr_err() from
> {pgd, pud, pmd, pte}_ERROR() directly, similar to what x86/powerpc
> are doing. With this, the code looks a bit simplified either.

Do we need p4d_ERROR() here as well !

> 
> Signed-off-by: Gavin Shan 
> ---
> v2: Fix build warning caused by wrong printk format
> ---
>  arch/arm64/include/asm/pgtable.h | 17 -
>  arch/arm64/kernel/traps.c| 20 
>  2 files changed, 8 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index d5d3fbe73953..e0ab81923c30 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -35,11 +35,6 @@
>  
>  extern struct page *vmemmap;
>  
> -extern void __pte_error(const char *file, int line, unsigned long val);
> -extern void __pmd_error(const char *file, int line, unsigned long val);
> -extern void __pud_error(const char *file, int line, unsigned long val);
> -extern void __pgd_error(const char *file, int line, unsigned long val);
> -
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>  
> @@ -57,7 +52,8 @@ extern void __pgd_error(const char *file, int line, 
> unsigned long val);
>  extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
>  #define ZERO_PAGE(vaddr) phys_to_page(__pa_symbol(empty_zero_page))
>  
> -#define pte_ERROR(pte)   __pte_error(__FILE__, __LINE__, 
> pte_val(pte))
> +#define pte_ERROR(e) \
> + pr_err("%s:%d: bad pte %016llx.\n", __FILE__, __LINE__, pte_val(e))
>  
>  /*
>   * Macros to convert between a physical address and its placement in a
> @@ -541,7 +537,8 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>  
>  #if CONFIG_PGTABLE_LEVELS > 2
>  
> -#define pmd_ERROR(pmd)   __pmd_error(__FILE__, __LINE__, 
> pmd_val(pmd))
> +#define pmd_ERROR(e) \
> + pr_err("%s:%d: bad pmd %016llx.\n", __FILE__, __LINE__, pmd_val(e))
>  
>  #define pud_none(pud)(!pud_val(pud))
>  #define pud_bad(pud) (!(pud_val(pud) & PUD_TABLE_BIT))
> @@ -608,7 +605,8 @@ static inline unsigned long pud_page_vaddr(pud_t pud)
>  
>  #if CONFIG_PGTABLE_LEVELS > 3
>  
> -#define pud_ERROR(pud)   __pud_error(__FILE__, __LINE__, 
> pud_val(pud))
> +#define pud_ERROR(e) \
> + pr_err("%s:%d: bad pud %016llx.\n", __FILE__, __LINE__, pud_val(e))
>  
>  #define p4d_none(p4d)(!p4d_val(p4d))
>  #define p4d_bad(p4d) (!(p4d_val(p4d) & 2))
> @@ -667,7 +665,8 @@ static inline unsigned long p4d_page_vaddr(p4d_t p4d)
>  
>  #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
>  
> -#define pgd_ERROR(pgd)   __pgd_error(__FILE__, __LINE__, 
> pgd_val(pgd))
> +#define pgd_ERROR(e) \
> + pr_err("%s:%d: bad pgd %016llx.\n", __FILE__, __LINE__, pgd_val(e))

A line break in these macros might not be required any more, as checkpatch.pl
now accepts bit longer lines.

>  
>  #define pgd_set_fixmap(addr) ((pgd_t *)set_fixmap_offset(FIX_PGD, addr))
>  #define pgd_clear_fixmap()   clear_fixmap(FIX_PGD)
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 13ebd5ca2070..12fba7136dbd 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -935,26 +935,6 @@ asmlinkage void enter_from_user_mode(void)
>  }
>  NOKPROBE_SYMBOL(enter_from_user_mode);
>  
> -void __pte_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pte %016lx.\n", file, line, val);
> -}
> -
> -void __pmd_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pmd %016lx.\n", file, line, val);
> -}
> -
> -void __pud_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pud %016lx.\n", file, line, val);
> -}
> -
> -void __pgd_error(const char *file, int line, unsigned long val)
> -{
> - pr_err("%s:%d: bad pgd %016lx.\n", file, line, val);
> -}

While moving %016lx now becomes %016llx. I guess this should be okay.
Looks much cleaner to have removed these helpers from trap.c

> -
>  /* GENERIC_BUG traps */
>  
>  int is_valid_bugaddr(unsigned long addr)
>

Re: [PATCH V2] arm64/hotplug: Improve memory offline event notifier

2020-09-13 Thread Anshuman Khandual

On 09/11/2020 07:36 PM, Catalin Marinas wrote:
> Hi Anshuman,
> 
> On Mon, Aug 24, 2020 at 09:34:29AM +0530, Anshuman Khandual wrote:
>> This brings about three different changes to the sole memory event notifier
>> for arm64 platform and improves it's robustness while also enhancing debug
>> capabilities during potential memory offlining error conditions.
>>
>> This moves the memory notifier registration bit earlier in the boot process
>> from device_initcall() to setup_arch() which will help in guarding against
>> potential early boot memory offline requests.
>>
>> This enables MEM_OFFLINE memory event handling. It will help intercept any
>> possible error condition such as if boot memory some how still got offlined
>> even after an expilicit notifier failure, potentially by a future change in
>> generic hotplug framework. This would help detect such scenarious and help
>> debug further.
>>
>> It also adds a validation function which scans entire boot memory and makes
>> sure that early memory sections are online. This check is essential for the
>> memory notifier to work properly as it cannot prevent boot memory offlining
>> if they are not online to begin with. But this additional sanity check is
>> enabled only with DEBUG_VM.
> 
> Could you please split this in separate patches rather than having a
> single one doing three somewhat related things?

Sure, will do.

> 
>> --- a/arch/arm64/kernel/setup.c
>> +++ b/arch/arm64/kernel/setup.c
>> @@ -376,6 +376,14 @@ void __init __no_sanitize_address setup_arch(char 
>> **cmdline_p)
>>  "This indicates a broken bootloader or old kernel\n",
>>  boot_args[1], boot_args[2], boot_args[3]);
>>  }
>> +
>> +/*
>> + * Register the memory notifier which will prevent boot
>> + * memory offlining requests - early enough. But there
>> + * should not be any actual offlinig request till memory
>> + * block devices are initialized with memory_dev_init().
>> + */
>> +memory_hotremove_notifier();
> 
> Why can this not be an early_initcall()? As you said, memory_dev_init()
> is called much later, after the SMP was initialised.

This proposal moves memory_hotremove_notifier() to setup_arch() because it
could and there is no harm in calling this too early than required for now.
But in case generic MM sequence of events during memory init changes later,
this notifier will still work.

IIUC, the notifier chain registration can be called very early in the boot
process without much problem. There are some precedence on other platforms.

1. arch/s390/mm/init.c- In device_initcall() via 
s390_cma_mem_init()
2. arch/s390/mm/setup.c   - In setup_arch() via 
reserve_crashkernel()
3. arch/powerpc/platforms/pseries/cmm.c   - In module_init() via cmm_init()
4. arch/powerpc/platforms/pseries/iommu.c - via iommu_init_early_pSeries()
via pSeries_init()
via pSeries_probe() aka 
ppc_md.porbe()
via probe_machine()
via setup_arch()

> 
> You could even combine this with validate_bootmem_online_state() in a
> single early_initcall() which, after checking, registers the notifier.
> 

Yes, that will be definitely simpler but there might be still some value
in having this registration in setup_arch() which guard against future
generic MM changes while keeping it separate from the sanity check i.e
validate_bootmem_online_state() which is enabled only with DEBUG_VM. But
will combine both in early_initcall() with some name changes if that is
preferred.

Re: [PATCH v2 3/3] arm64/mm: Unitify CONT_PMD_SHIFT

2020-09-11 Thread Anshuman Khandual




On 09/10/2020 03:29 PM, Gavin Shan wrote:
> Similar to how CONT_PTE_SHIFT is determined, this introduces a new
> kernel option (CONFIG_CONT_PMD_SHIFT) to determine CONT_PMD_SHIFT.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/Kconfig |  6 ++
>  arch/arm64/include/asm/pgtable-hwdef.h | 10 ++
>  2 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7ec30dd56300..d58e17fe9473 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -217,6 +217,12 @@ config ARM64_CONT_PTE_SHIFT
>   default 7 if ARM64_16K_PAGES
>   default 4
>  
> +config ARM64_CONT_PMD_SHIFT
> + int
> + default 5 if ARM64_64K_PAGES
> + default 5 if ARM64_16K_PAGES
> + default 4
> +
>  config ARCH_MMAP_RND_BITS_MIN
> default 14 if ARM64_64K_PAGES
> default 16 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 6c9c67f62551..94b3f2ac2e9d 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -82,17 +82,11 @@
>   * Contiguous page definitions.
>   */
>  #define CONT_PTE_SHIFT   (CONFIG_ARM64_CONT_PTE_SHIFT + 
> PAGE_SHIFT)
> -#ifdef CONFIG_ARM64_64K_PAGES
> -#define CONT_PMD_SHIFT   (5 + PMD_SHIFT)
> -#elif defined(CONFIG_ARM64_16K_PAGES)
> -#define CONT_PMD_SHIFT   (5 + PMD_SHIFT)
> -#else
> -#define CONT_PMD_SHIFT   (4 + PMD_SHIFT)
> -#endif
> -
>  #define CONT_PTES(1 << (CONT_PTE_SHIFT - PAGE_SHIFT))
>  #define CONT_PTE_SIZE(CONT_PTES * PAGE_SIZE)
>  #define CONT_PTE_MASK(~(CONT_PTE_SIZE - 1))
> +
> +#define CONT_PMD_SHIFT   (CONFIG_ARM64_CONT_PMD_SHIFT + 
> PMD_SHIFT)
>  #define CONT_PMDS(1 << (CONT_PMD_SHIFT - PMD_SHIFT))
>  #define CONT_PMD_SIZE(CONT_PMDS * PMD_SIZE)
>  #define CONT_PMD_MASK(~(CONT_PMD_SIZE - 1))
> 

This is cleaner and more uniform. Did not see any problem while
running some quick hugetlb tests across multiple page size configs
after applying all patches in this series.

Adding this new configuration ARM64_CONT_PMD_SHIFT makes sense, as
it eliminates existing constant values that are used in an ad hoc
manner, while computing contiguous page table entry properties.

Reviewed-by: Anshuman Khandual

Re: [PATCH v2 2/3] arm64/mm: Unitify CONT_PTE_SHIFT

2020-09-11 Thread Anshuman Khandual



On 09/10/2020 03:29 PM, Gavin Shan wrote:
> CONT_PTE_SHIFT actually depends on CONFIG_ARM64_CONT_SHIFT. It's
> reasonable to reflect the dependency:

Also always better to avoid direct numerical such as 5, 7, 4. A config
option with a right name (even with constant values), gives them some
meaning.

> 
>* This renames CONFIG_ARM64_CONT_SHIFT to CONFIG_ARM64_CONT_PTE_SHIFT,
>  so that we can introduce CONFIG_ARM64_CONT_PMD_SHIFT later.

Agreed.

> 
>* CONT_{SHIFT, SIZE, MASK}, defined in page-def.h are removed as they
>  are not used by anyone.

Makes sense.

> 
>* CONT_PTE_SHIFT is determined by CONFIG_ARM64_CONT_PTE_SHIFT.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/Kconfig | 2 +-
>  arch/arm64/include/asm/page-def.h  | 5 -
>  arch/arm64/include/asm/pgtable-hwdef.h | 4 +---
>  3 files changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 6d232837cbee..7ec30dd56300 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -211,7 +211,7 @@ config ARM64_PAGE_SHIFT
>   default 14 if ARM64_16K_PAGES
>   default 12
>  
> -config ARM64_CONT_SHIFT
> +config ARM64_CONT_PTE_SHIFT
>   int
>   default 5 if ARM64_64K_PAGES
>   default 7 if ARM64_16K_PAGES
> diff --git a/arch/arm64/include/asm/page-def.h 
> b/arch/arm64/include/asm/page-def.h
> index f99d48ecbeef..2403f7b4cdbf 100644
> --- a/arch/arm64/include/asm/page-def.h
> +++ b/arch/arm64/include/asm/page-def.h
> @@ -11,13 +11,8 @@
>  #include 
>  
>  /* PAGE_SHIFT determines the page size */
> -/* CONT_SHIFT determines the number of pages which can be tracked together  
> */

This does not get added back in , would you please
add one comment, for both CONT_PTE_SHIFT and CONT_PMD_SHIFT in their
respective patches.

>  #define PAGE_SHIFT   CONFIG_ARM64_PAGE_SHIFT
> -#define CONT_SHIFT   CONFIG_ARM64_CONT_SHIFT
>  #define PAGE_SIZE(_AC(1, UL) << PAGE_SHIFT)
>  #define PAGE_MASK(~(PAGE_SIZE-1))
>  
> -#define CONT_SIZE(_AC(1, UL) << (CONT_SHIFT + PAGE_SHIFT))
> -#define CONT_MASK(~(CONT_SIZE-1))
> -
>  #endif /* __ASM_PAGE_DEF_H */
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 8a399e666837..6c9c67f62551 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -81,14 +81,12 @@
>  /*
>   * Contiguous page definitions.
>   */
> +#define CONT_PTE_SHIFT   (CONFIG_ARM64_CONT_PTE_SHIFT + 
> PAGE_SHIFT)
>  #ifdef CONFIG_ARM64_64K_PAGES
> -#define CONT_PTE_SHIFT   (5 + PAGE_SHIFT)
>  #define CONT_PMD_SHIFT   (5 + PMD_SHIFT)
>  #elif defined(CONFIG_ARM64_16K_PAGES)
> -#define CONT_PTE_SHIFT   (7 + PAGE_SHIFT)
>  #define CONT_PMD_SHIFT   (5 + PMD_SHIFT)
>  #else
> -#define CONT_PTE_SHIFT   (4 + PAGE_SHIFT)
>  #define CONT_PMD_SHIFT   (4 + PMD_SHIFT)
>  #endif
>  
> 

Looks good to me and there are no obvious regressions as well.

Reviewed-by: Anshuman Khandual

Re: [PATCH] arm64/mm: add fallback option to allocate virtually contiguous memory

2020-09-10 Thread Anshuman Khandual




On 09/10/2020 01:57 PM, sudar...@codeaurora.org wrote:
> Hello Anshuman,
> 
>> On 09/10/2020 11:35 AM, Sudarshan Rajagopalan wrote:
>>> When section mappings are enabled, we allocate vmemmap pages from 
>>> physically continuous memory of size PMD_SZIE using 
>>> vmemmap_alloc_block_buf(). Section> mappings are good to reduce TLB 
>>> pressure. But when system is highly fragmented and memory blocks are 
>>> being hot-added at runtime, its possible that such physically 
>>> continuous memory allocations can fail. Rather than failing the
>>
>> Did you really see this happen on a system ?
> 
> Thanks for the response.

There seems to be some text alignment problem in your response on this
thread, please have a look.

> 
> Yes, this happened on a system with very low RAM (size ~120MB) where no free 
> order-9 pages were present. Pasting below few kernel logs. On systems with 
> low RAM, its high probability where memory is fragmented and no higher order 
> pages are free. On such scenarios, vmemmap alloc would fail for PMD_SIZE of 
> contiguous memory.
> 
> We have a usecase for memory sharing between VMs where one of the VM uses 
> add_memory() to add the memory that was donated by the other VM. This uses 
> something similar to VirtIO-Mem. And this requires memory to be _guaranteed_ 
> to be added in the VM so that the usecase can run without any failure.
> 
> vmemmap alloc failure: order:9, mode:0x4cc0(GFP_KERNEL|__GFP_RETRY_MAYFAIL), 
> nodemask=(null),cpuset=/,mems_allowed=0
> CPU: 1 PID: 294 Comm:  Tainted: G S5.4.50 #1
> Call trace:
>  dump_stack+0xa4/0xdc
>  warn_alloc+0x104/0x160
>  vmemmap_alloc_block+0xe4/0xf4
>  vmemmap_alloc_block_buf+0x34/0x38
>  vmemmap_populate+0xc8/0x224
>  __populate_section_memmap+0x34/0x54
>  sparse_add_section+0x16c/0x254
>  __add_pages+0xd0/0x138
>  arch_add_memory+0x114/0x1a8
> 
> DMA32: 2627*4kB (UMC) 23*8kB (UME) 6*16kB (UM) 8*32kB (UME) 2*64kB (ME) 
> 2*128kB (UE) 1*256kB (M) 2*512kB (ME) 1*1024kB (M) 0*2048kB 0*4096kB = 13732kB
> 30455 pages RAM
> 
> But keeping this usecase aside, won’t this be problematic on any systems with 
> low RAM where order-9 alloc would fail on a fragmented system, and any memory 
> hot-adding would fail? Or other similar users of VirtIO-Mem which uses 
> arch_add_memory.
> 
>>
>>> memory hot-add procedure, add a fallback option to allocate vmemmap 
>>> pages from discontinuous pages using vmemmap_populate_basepages().
>>
>> Which could lead to a mixed page size mapping in the VMEMMAP area.
> 
> Would this be problematic? We would only lose one section mapping per failure 
> and increases slight TLB pressure. Also, we would anyway do discontinuous 
> pages alloc for systems having non-4K pages (ARM64_SWAPPER_USES_SECTION_MAPS 
> will be 0). I only see a small cost to performance due to slight TLB pressure.
> 
>> Allocation failure in vmemmap_populate() should just cleanly fail the memory 
>> hot add operation, which can then be retried. Why the retry has to be 
>> offloaded to kernel ?
> 
> While a retry can attempted again, but it won’t help in cases where there are 
> no order-9 pages available and any retry would just not succeed again until a 
> order-9 page gets free'ed. Here we are just falling back to use discontinuous 
> pages allocation to help succeed memory hot-add as best as possible.

Understood, seems like there is enough potential use cases and scenarios
right now, to consider this fallback mechanism and a possible mixed page
size vmemmap. But I would let others weigh in, on the performance impact.

> 
> Thanks and Regards,
> Sudarshan
> 
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
> Foundation Collaborative Project
> 
> -Original Message-
> From: Anshuman Khandual  
> Sent: Wednesday, September 9, 2020 11:45 PM
> To: Sudarshan Rajagopalan ; 
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org
> Cc: Catalin Marinas ; Will Deacon ; 
> Mark Rutland ; Logan Gunthorpe ; 
> David Hildenbrand ; Andrew Morton 
> ; Steven Price 
> Subject: Re: [PATCH] arm64/mm: add fallback option to allocate virtually 
> contiguous memory
> 
> Hello Sudarshan,
> 
> On 09/10/2020 11:35 AM, Sudarshan Rajagopalan wrote:
>> When section mappings are enabled, we allocate vmemmap pages from 
>> physically continuous memory of size PMD_SZIE using 
>> vmemmap_alloc_block_buf(). Section> mappings are good to reduce TLB 
>> pressure. But when system is highly fragmented and memory blocks are 
>> being hot-added at runtime, its possible that such physically 
>> continuous memory allocations

Re: [PATCH] arm64/mm: add fallback option to allocate virtually contiguous memory

2020-09-10 Thread Anshuman Khandual




On 09/10/2020 01:38 PM, David Hildenbrand wrote:
> On 10.09.20 08:45, Anshuman Khandual wrote:
>> Hello Sudarshan,
>>
>> On 09/10/2020 11:35 AM, Sudarshan Rajagopalan wrote:
>>> When section mappings are enabled, we allocate vmemmap pages from physically
>>> continuous memory of size PMD_SZIE using vmemmap_alloc_block_buf(). 
>>> Section> mappings are good to reduce TLB pressure. But when system is 
>>> highly fragmented
>>> and memory blocks are being hot-added at runtime, its possible that such
>>> physically continuous memory allocations can fail. Rather than failing the
>>
>> Did you really see this happen on a system ?
>>
>>> memory hot-add procedure, add a fallback option to allocate vmemmap pages 
>>> from
>>> discontinuous pages using vmemmap_populate_basepages().
>>
>> Which could lead to a mixed page size mapping in the VMEMMAP area.
> 
> Right, with gives you a slight performance hit - nobody really cares,
> especially if it happens in corner cases only.

On performance impact, will probably let Catalin and others comment from
arm64 platform perspective, because I might not have all information here.
But will do some more audit regarding possible impact of a mixed page size
vmemmap mapping.

> 
> At least x86_64 (see vmemmap_populate_hugepages()) and s390x (added
> recently by me) implement that behavior.
> 
> Assume you run in a virtualized environment where your hypervisor tries
> to do some smart dynamic guest resizing - like monitoring the guest
> memory consumption and adding more memory on demand. You much rather
> want hotadd to succeed (in these corner cases) that failing just because
> you weren't able to grab a huge page in one instance.
> 
> Examples include XEN balloon, Hyper-V balloon, and virtio-mem. We might
> see some of these for arm64 as well (if don't already do).

Makes sense.

> 
>> Allocation failure in vmemmap_populate() should just cleanly fail
>> the memory hot add operation, which can then be retried. Why the
>> retry has to be offloaded to kernel ?
> 
> (not sure what "offloaded to kernel" really means here - add_memory() is

Offloaded here referred to the responsibility to retry or just fallback.
If the situation can be resolved by user retrying hot add operation till
it succeeds, compared to kernel falling back on allocating normal pages.

> also just triggered from the kernel) I disagree, we should try our best
> to add memory and make it available, especially when short on memory
> already.

Okay.

Re: [PATCH] arm64/mm: add fallback option to allocate virtually contiguous memory

2020-09-10 Thread Anshuman Khandual




On 09/10/2020 01:57 PM, Steven Price wrote:
> On 10/09/2020 07:05, Sudarshan Rajagopalan wrote:
>> When section mappings are enabled, we allocate vmemmap pages from physically
>> continuous memory of size PMD_SZIE using vmemmap_alloc_block_buf(). Section
>> mappings are good to reduce TLB pressure. But when system is highly 
>> fragmented
>> and memory blocks are being hot-added at runtime, its possible that such
>> physically continuous memory allocations can fail. Rather than failing the
>> memory hot-add procedure, add a fallback option to allocate vmemmap pages 
>> from
>> discontinuous pages using vmemmap_populate_basepages().
>>
>> Signed-off-by: Sudarshan Rajagopalan 
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Anshuman Khandual 
>> Cc: Mark Rutland 
>> Cc: Logan Gunthorpe 
>> Cc: David Hildenbrand 
>> Cc: Andrew Morton 
>> Cc: Steven Price 
>> ---
>>   arch/arm64/mm/mmu.c | 15 ---
>>   1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 75df62f..a46c7d4 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1100,6 +1100,7 @@ int __meminit vmemmap_populate(unsigned long start, 
>> unsigned long end, int node,
>>   p4d_t *p4dp;
>>   pud_t *pudp;
>>   pmd_t *pmdp;
>> +    int ret = 0;
>>     do {
>>   next = pmd_addr_end(addr, end);
>> @@ -1121,15 +1122,23 @@ int __meminit vmemmap_populate(unsigned long start, 
>> unsigned long end, int node,
>>   void *p = NULL;
>>     p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
>> -    if (!p)
>> -    return -ENOMEM;
>> +    if (!p) {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +    vmemmap_free(start, end, altmap);
>> +#endif
>> +    ret = -ENOMEM;
>> +    break;
>> +    }
>>     pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL));
>>   } else
>>   vmemmap_verify((pte_t *)pmdp, node, addr, next);
>>   } while (addr = next, addr != end);
>>   -    return 0;
>> +    if (ret)
>> +    return vmemmap_populate_basepages(start, end, node, altmap);
>> +    else
>> +    return ret;
> 
> Style comment: I find this usage of 'ret' confusing. When we assign -ENOMEM 
> above that is never actually the return value of the function (in that case 
> vmemmap_populate_basepages() provides the actual return value).

Right.

> 
> Also the "return ret" is misleading since we know by that point that ret==0 
> (and the 'else' is redundant).

Right.

> 
> Can you not just move the call to vmemmap_populate_basepages() up to just 
> after the (possible) vmemmap_free() call and remove the 'ret' variable?
> 
> AFAICT the call to vmemmap_free() also doesn't need the #ifdef as the 
> function is a no-op if CONFIG_MEMORY_HOTPLUG isn't set. I also feel you 

Right, CONFIG_MEMORY_HOTPLUG is not required.

need at least a comment to explain Anshuman's point that it looks like you're 
freeing an unmapped area. Although if I'm reading the code correctly it seems 
like the unmapped area will just be skipped.
Proposed vmemmap_free() attempts to free the entire requested vmemmap range
[start, end] when an intermediate PMD entry can not be allocated. Hence even
if vmemap_free() could skip an unmapped area (will double check on that), it
unnecessarily goes through large sections of unmapped range, which could not
have been mapped.

So, basically there could be two different methods for doing this fallback.

1. Call vmemmap_populate_basepages() for sections when PMD_SIZE allocation fails

- vmemmap_free() need not be called

2. Abort at the first instance of PMD_SIZE allocation failure

- Call vmemmap_free() to unmap all sections mapped till that point
- Call vmemmap_populate_basepages() to map the entire request section

The proposed patch tried to mix both approaches. Regardless, the first approach
here seems better and is the case in vmemmap_populate_hugepages() implementation
on x86 as well.

Re: [PATCH 2/2] arm64/mm: Use CONT_SHIFT to define CONT_PTE_SHIFT

2020-09-10 Thread Anshuman Khandual




On 09/10/2020 02:01 PM, Gavin Shan wrote:
> Hi Anshuman,
> 
> On 9/10/20 4:17 PM, Anshuman Khandual wrote:
>> On 09/08/2020 12:49 PM, Gavin Shan wrote:
>>> The macro CONT_PTE_SHIFT actually depends on CONT_SHIFT, which has
>>> been defined in page-def.h, based on CONFIG_ARM64_CONT_SHIFT. Lets
>>> reflect the dependency.
>>>
>>> Signed-off-by: Gavin Shan 
>>> ---
>>>   arch/arm64/include/asm/pgtable-hwdef.h | 4 +---
>>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
>>> b/arch/arm64/include/asm/pgtable-hwdef.h
>>> index 8a399e666837..0bd9469f4323 100644
>>> --- a/arch/arm64/include/asm/pgtable-hwdef.h
>>> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
>>> @@ -81,14 +81,12 @@
>>>   /*
>>>    * Contiguous page definitions.
>>>    */
>>> +#define CONT_PTE_SHIFT    (CONT_SHIFT + PAGE_SHIFT)
>>>   #ifdef CONFIG_ARM64_64K_PAGES
>>> -#define CONT_PTE_SHIFT    (5 + PAGE_SHIFT)
>>>   #define CONT_PMD_SHIFT    (5 + PMD_SHIFT)
>>>   #elif defined(CONFIG_ARM64_16K_PAGES)
>>> -#define CONT_PTE_SHIFT    (7 + PAGE_SHIFT)
>>>   #define CONT_PMD_SHIFT    (5 + PMD_SHIFT)
>>>   #else
>>> -#define CONT_PTE_SHIFT    (4 + PAGE_SHIFT)
>>>   #define CONT_PMD_SHIFT    (4 + PMD_SHIFT)
>>>   #endif
>> Could not a similar CONT_PMD be created from a new CONFIG_ARM64_CONT_PMD
>> config option, which would help unify CONT_PMD_SHIFT here as well ?
>>
> 
> I was thinking of it, to have CONFIG_ARM64_CONT_PMD and defined the
> following macros in arch/arm64/include/asm/page-def.h:
> 
>    #define CONT_PMD_SHIFT    CONFIG_ARM64_CONT_PMD_SHIFT
>    #define CONT_PMD_SIZE    (_AC(1, UL) << (CONT_PMD_SHIFT + PMD_SHIFT)
>    #define CONT_PMD_MASK    (~(CONT_PMD_SIZE - 1))
> 
> PMD_SHIFT is variable because PMD could be folded into PUD or PGD,
> depending on the kernel configuration. PMD_SHIFT is declared

Even CONT_PMD_SHIFT via the new CONFIG_ARM64_CONT_PMD_SHIFT will
be a variable as well depending on page size.

> in arch/arm64/include/asm/pgtable-types.h, which isn't supposed
> to be included in "page-def.h".

Are there build failures if  is included from  ?

> 
> So the peroper way to handle this might be drop the continuous page
> macros in page-def.h and introduce the following ones into pgtable-hwdef.h.
> I will post v2 to do this if it sounds good to you.

Sure, go ahead if that builds. But unifying both these macros seems cleaner.

> 
>    #define CONT_PTE_SHIFT (CONFIG_ARM64_CONT_PTE_SHIFT + PAGE_SHIFT)
>    #define CONT_PMD_SHIFT (CONFIG_ARM64_CONT_PMD_SHIFT + PMD_SHIFT)
> 
> Thanks,
> Gavin
> 
>

Re: [PATCH] arm64/mm: add fallback option to allocate virtually contiguous memory

2020-09-10 Thread Anshuman Khandual

Hello Sudarshan,

On 09/10/2020 11:35 AM, Sudarshan Rajagopalan wrote:
> When section mappings are enabled, we allocate vmemmap pages from physically
> continuous memory of size PMD_SZIE using vmemmap_alloc_block_buf(). Section> 
> mappings are good to reduce TLB pressure. But when system is highly fragmented
> and memory blocks are being hot-added at runtime, its possible that such
> physically continuous memory allocations can fail. Rather than failing the

Did you really see this happen on a system ?

> memory hot-add procedure, add a fallback option to allocate vmemmap pages from
> discontinuous pages using vmemmap_populate_basepages().

Which could lead to a mixed page size mapping in the VMEMMAP area.
Allocation failure in vmemmap_populate() should just cleanly fail
the memory hot add operation, which can then be retried. Why the
retry has to be offloaded to kernel ?

> 
> Signed-off-by: Sudarshan Rajagopalan 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Anshuman Khandual 
> Cc: Mark Rutland 
> Cc: Logan Gunthorpe 
> Cc: David Hildenbrand 
> Cc: Andrew Morton 
> Cc: Steven Price 
> ---
>  arch/arm64/mm/mmu.c | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 75df62f..a46c7d4 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1100,6 +1100,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>   p4d_t *p4dp;
>   pud_t *pudp;
>   pmd_t *pmdp;
> + int ret = 0;
>  
>   do {
>   next = pmd_addr_end(addr, end);
> @@ -1121,15 +1122,23 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>   void *p = NULL;
>  
>   p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
> - if (!p)
> - return -ENOMEM;
> + if (!p) {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> + vmemmap_free(start, end, altmap);
> +#endif

The mapping was never created in the first place, as the allocation
failed. vmemmap_free() here will free an unmapped area !

> + ret = -ENOMEM;
> + break;
> + }
>  
>   pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL));
>   } else
>   vmemmap_verify((pte_t *)pmdp, node, addr, next);
>   } while (addr = next, addr != end);
>  
> - return 0;
> + if (ret)
> + return vmemmap_populate_basepages(start, end, node, altmap);
> + else
> + return ret;
>  }
>  #endif   /* !ARM64_SWAPPER_USES_SECTION_MAPS */
>  void vmemmap_free(unsigned long start, unsigned long end,
>

Re: [PATCH 2/2] arm64/mm: Use CONT_SHIFT to define CONT_PTE_SHIFT

2020-09-10 Thread Anshuman Khandual




On 09/08/2020 12:49 PM, Gavin Shan wrote:
> The macro CONT_PTE_SHIFT actually depends on CONT_SHIFT, which has
> been defined in page-def.h, based on CONFIG_ARM64_CONT_SHIFT. Lets
> reflect the dependency.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/include/asm/pgtable-hwdef.h | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index 8a399e666837..0bd9469f4323 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -81,14 +81,12 @@
>  /*
>   * Contiguous page definitions.
>   */
> +#define CONT_PTE_SHIFT   (CONT_SHIFT + PAGE_SHIFT)
>  #ifdef CONFIG_ARM64_64K_PAGES
> -#define CONT_PTE_SHIFT   (5 + PAGE_SHIFT)
>  #define CONT_PMD_SHIFT   (5 + PMD_SHIFT)
>  #elif defined(CONFIG_ARM64_16K_PAGES)
> -#define CONT_PTE_SHIFT   (7 + PAGE_SHIFT)
>  #define CONT_PMD_SHIFT   (5 + PMD_SHIFT)
>  #else
> -#define CONT_PTE_SHIFT   (4 + PAGE_SHIFT)
>  #define CONT_PMD_SHIFT   (4 + PMD_SHIFT)
>  #endif
Could not a similar CONT_PMD be created from a new CONFIG_ARM64_CONT_PMD
config option, which would help unify CONT_PMD_SHIFT here as well ?

Re: [PATCH 1/2] arm64/mm: Remove CONT_RANGE_OFFSET

2020-09-10 Thread Anshuman Khandual




On 09/08/2020 12:49 PM, Gavin Shan wrote:
> The macro was introduced by commit  ("arm64: PTE/PMD
> contiguous bit definition") at the beginning. It's only used by
> commit <348a65cdcbbf> ("arm64: Mark kernel page ranges contiguous"),
> which was reverted later by commit <667c27597ca8>. This makes the
> macro unused.
> 
> This removes the unused macro (CONT_RANGE_OFFSET).
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/include/asm/pgtable-hwdef.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> b/arch/arm64/include/asm/pgtable-hwdef.h
> index d400a4d9aee2..8a399e666837 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -98,8 +98,6 @@
>  #define CONT_PMDS(1 << (CONT_PMD_SHIFT - PMD_SHIFT))
>  #define CONT_PMD_SIZE(CONT_PMDS * PMD_SIZE)
>  #define CONT_PMD_MASK(~(CONT_PMD_SIZE - 1))
> -/* the numerical offset of the PTE within a range of CONT_PTES */
> -#define CONT_RANGE_OFFSET(addr) (((addr)>>PAGE_SHIFT)&(CONT_PTES-1))
>  
>  /*
>   * Hardware page table definitions.
> 

Reviewed-by: Anshuman Khandual

[PATCH V4] arm64/cpuinfo: Define HWCAP name arrays per their actual bit definitions

2020-09-08 Thread Anshuman Khandual

HWCAP name arrays (hwcap_str, compat_hwcap_str, compat_hwcap2_str) that are
scanned for /proc/cpuinfo are detached from their bit definitions making it
vulnerable and difficult to correlate. It is also bit problematic because
during /proc/cpuinfo dump these arrays get traversed sequentially assuming
they reflect and match actual HWCAP bit sequence, to test various features
for a given CPU. This redefines name arrays per their HWCAP bit definitions
. It also warns after detecting any feature which is not expected on arm64.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Brown 
Cc: Dave Martin 
Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Suzuki K Poulose 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This applies on 5.9-rc4

Mark, since the patch has changed I have dropped your Acked-by: tag. Are you
happy to give a new one ?

Changes in V4:

- Unified all three HWCAP array traversal per Will

Changes in V3: (https://patchwork.kernel.org/patch/11718113/)

- Moved name arrays to (arch/arm64/kernel/cpuinfo.c) to prevent a build warning
- Replaced string values with NULL for all compat features not possible on arm64
- Changed compat_hwcap_str[] iteration on size as some NULL values are expected
- Warn once after detecting any feature on arm64 that is not expected

Changes in V2: (https://patchwork.kernel.org/patch/11533755/)

- Defined COMPAT_KERNEL_HWCAP[2] and updated the name arrays per Mark
- Updated the commit message as required

Changes in V1: (https://patchwork.kernel.org/patch/11532945/)

 arch/arm64/include/asm/hwcap.h |   9 ++
 arch/arm64/kernel/cpuinfo.c| 176 +
 2 files changed, 101 insertions(+), 84 deletions(-)

diff --git a/arch/arm64/include/asm/hwcap.h b/arch/arm64/include/asm/hwcap.h
index 22f73fe09030..6493a4c63a2f 100644
--- a/arch/arm64/include/asm/hwcap.h
+++ b/arch/arm64/include/asm/hwcap.h
@@ -8,18 +8,27 @@
 #include 
 #include 
 
+#define COMPAT_HWCAP_SWP   (1 << 0)
 #define COMPAT_HWCAP_HALF  (1 << 1)
 #define COMPAT_HWCAP_THUMB (1 << 2)
+#define COMPAT_HWCAP_26BIT (1 << 3)
 #define COMPAT_HWCAP_FAST_MULT (1 << 4)
+#define COMPAT_HWCAP_FPA   (1 << 5)
 #define COMPAT_HWCAP_VFP   (1 << 6)
 #define COMPAT_HWCAP_EDSP  (1 << 7)
+#define COMPAT_HWCAP_JAVA  (1 << 8)
+#define COMPAT_HWCAP_IWMMXT(1 << 9)
+#define COMPAT_HWCAP_CRUNCH(1 << 10)
+#define COMPAT_HWCAP_THUMBEE   (1 << 11)
 #define COMPAT_HWCAP_NEON  (1 << 12)
 #define COMPAT_HWCAP_VFPv3 (1 << 13)
+#define COMPAT_HWCAP_VFPV3D16  (1 << 14)
 #define COMPAT_HWCAP_TLS   (1 << 15)
 #define COMPAT_HWCAP_VFPv4 (1 << 16)
 #define COMPAT_HWCAP_IDIVA (1 << 17)
 #define COMPAT_HWCAP_IDIVT (1 << 18)
 #define COMPAT_HWCAP_IDIV  (COMPAT_HWCAP_IDIVA|COMPAT_HWCAP_IDIVT)
+#define COMPAT_HWCAP_VFPD32(1 << 19)
 #define COMPAT_HWCAP_LPAE  (1 << 20)
 #define COMPAT_HWCAP_EVTSTRM   (1 << 21)
 
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index d0076c2159e6..04640f5f9f0f 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -43,94 +43,93 @@ static const char *icache_policy_str[] = {
 unsigned long __icache_flags;
 
 static const char *const hwcap_str[] = {
-   "fp",
-   "asimd",
-   "evtstrm",
-   "aes",
-   "pmull",
-   "sha1",
-   "sha2",
-   "crc32",
-   "atomics",
-   "fphp",
-   "asimdhp",
-   "cpuid",
-   "asimdrdm",
-   "jscvt",
-   "fcma",
-   "lrcpc",
-   "dcpop",
-   "sha3",
-   "sm3",
-   "sm4",
-   "asimddp",
-   "sha512",
-   "sve",
-   "asimdfhm",
-   "dit",
-   "uscat",
-   "ilrcpc",
-   "flagm",
-   "ssbs",
-   "sb",
-   "paca",
-   "pacg",
-   "dcpodp",
-   "sve2",
-   "sveaes",
-   "svepmull",
-   "svebitperm",
-   "svesha3",
-   "svesm4",
-   "flagm2",
-   "frint",
-   "svei8mm",
-   "svef32mm",
-   "svef64mm",
-   "svebf16",
-   "i8mm",
-   "bf16",
-   "dgh",
-   "rng",
-   "bti",
+   [KERNEL_HWCAP_FP]   = "fp",
+   [KERNEL_HWCAP_ASIMD]= "asimd",
+   [KERNEL_HWCAP_EVTSTRM]  = "evtstrm",
+   [KERNEL_HWCAP_AES]

[PATCH V2 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

2020-09-08 Thread Anshuman Khandual

pmd_present() and pmd_trans_huge() are expected to behave in the following
manner during various phases of a given PMD. It is derived from a previous
detailed discussion on this topic [1] and present THP documentation [2].

pmd_present(pmd):

- Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
- Returns false if pmd refers to a migration or swap entry

pmd_trans_huge(pmd):

- Returns true if pmd refers to system RAM and is a trans huge mapping

-
|   PMD states  |   pmd_present |   pmd_trans_huge  |
-
|   Mapped  |   Yes |   Yes |
-
|   Splitting   |   Yes |   Yes |
-
|   Migration/Swap  |   No  |   No  |
-

The problem:

PMD is first invalidated with pmdp_invalidate() before it's splitting. This
invalidation clears PMD_SECT_VALID as below.

PMD Split -> pmdp_invalidate() -> pmd_mkinvalid -> Clears PMD_SECT_VALID

Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
affirm pmd_present() as true during the THP split process. To comply with
above mentioned semantics, pmd_trans_huge() should also check pmd_present()
first before testing presence of an actual transparent huge mapping.

The solution:

Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
it will not be there for pmd_present() check after pmdp_invalidate().

A new software defined PMD_PRESENT_INVALID (bit 59) can be set on the PMD
entry during invalidation which can help pmd_present() return true and in
recognizing the fact that it still points to memory.

This bit is transient. During the split process it will be overridden by a
page table page representing normal pages in place of erstwhile huge page.
Other pmdp_invalidate() callers always write a fresh PMD value on the entry
overriding this transient PMD_PRESENT_INVALID bit, which makes it safe.

[1]: https://lkml.org/lkml/2018/10/17/231
[2]: https://www.kernel.org/doc/Documentation/vm/transhuge.txt

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Suzuki Poulose 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Catalin Marinas 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/include/asm/pgtable-prot.h |  7 ++
 arch/arm64/include/asm/pgtable.h  | 34 ---
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h 
b/arch/arm64/include/asm/pgtable-prot.h
index 4d867c6446c4..2df4b75fce3c 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,13 @@
 #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE  (_AT(pteval_t, 1) << 58) /* only when 
!PTE_VALID */
 
+/*
+ * This bit indicates that the entry is present i.e. pmd_page()
+ * still points to a valid huge page in memory even if the pmd
+ * has been invalidated.
+ */
+#define PMD_PRESENT_INVALID(_AT(pteval_t, 1) << 59) /* only when 
!PMD_SECT_VALID */
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d5d3fbe73953..d8258ae8fce0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -145,6 +145,18 @@ static inline pte_t set_pte_bit(pte_t pte, pgprot_t prot)
return pte;
 }
 
+static inline pmd_t clear_pmd_bit(pmd_t pmd, pgprot_t prot)
+{
+   pmd_val(pmd) &= ~pgprot_val(prot);
+   return pmd;
+}
+
+static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t prot)
+{
+   pmd_val(pmd) |= pgprot_val(prot);
+   return pmd;
+}
+
 static inline pte_t pte_wrprotect(pte_t pte)
 {
pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
@@ -363,15 +375,24 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
+
+static inline int pmd_present(pmd_t pmd)
+{
+   return pte_present(pmd_pte(pmd)) || pmd_present_invalid(pmd);
+}
+
 /*
  * THP definitions.
  */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_trans_huge(pmd)(pmd_val(pmd) && !(pmd_val(pmd) & 
PMD_TABLE_BIT))
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+   return pmd_val(pmd) && pmd_present(pmd) && !(pmd_val(pmd) & 
PMD_TABLE_BIT);
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define

[PATCH V2 0/2] arm64/mm: Enable THP migration

2020-09-08 Thread Anshuman Khandual

This series enables THP migration on arm64 via ARCH_ENABLE_THP_MIGRATION.
But first this modifies all existing THP helpers like pmd_present() and
pmd_trans_huge() etc per expected generic memory semantics as concluded
from a previous discussion here.

https://lkml.org/lkml/2018/10/9/220

This series is based on v5.9-rc4.

Changes in V2:

- Renamed clr_pmd_bit() as clear_pmd_bit() per Catalin
- Updated in-code documentation per Catalin and Ralph
- Updated commit message in the first patch per Catalin
- Updated commit message in the second patch per Catalin
- Added tags from Catalin

Changes in V1: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=333627)

- Used new PMD_PRESENT_INVALID (bit 59) to represent invalidated PMD state per 
Catalin

Changes in RFC V2: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=302965)

- Used PMD_TABLE_BIT to represent splitting PMD state per Catalin

Changes in RFC V1: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=138797)

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Suzuki Poulose 
Cc: Zi Yan 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  arm64/mm: Change THP helpers to comply with generic MM semantics
  arm64/mm: Enable THP migration

 arch/arm64/Kconfig|  4 +++
 arch/arm64/include/asm/pgtable-prot.h |  7 +
 arch/arm64/include/asm/pgtable.h  | 39 ---
 3 files changed, 47 insertions(+), 3 deletions(-)

-- 
2.20.1

[PATCH V2 2/2] arm64/mm: Enable THP migration

2020-09-08 Thread Anshuman Khandual

In certain page migration situations, a THP page can be migrated without
being split into it's constituent subpages. This saves time required to
split a THP and put it back together when required. But it also saves an
wider address range translation covered by a single TLB entry, reducing
future page fault costs.

A previous patch changed platform THP helpers per generic memory semantics,
clearing the path for THP migration support. This adds two more THP helpers
required to create PMD migration swap entries. Now enable THP migration via
ARCH_ENABLE_THP_MIGRATION.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Suzuki Poulose 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Catalin Marinas 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig   | 4 
 arch/arm64/include/asm/pgtable.h | 5 +
 2 files changed, 9 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..e21b94061780 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1876,6 +1876,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on HUGETLB_PAGE && MIGRATION
 
+config ARCH_ENABLE_THP_MIGRATION
+   def_bool y
+   depends on TRANSPARENT_HUGEPAGE
+
 menu "Power management options"
 
 source "kernel/power/Kconfig"
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d8258ae8fce0..bc68da9f5706 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -875,6 +875,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct 
*vma,
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
 #define __swp_entry_to_pte(swp)((pte_t) { (swp).val })
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) })
+#define __swp_entry_to_pmd(swp)__pmd((swp).val)
+#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 /*
  * Ensure that there are not more swap files than can be encoded in the kernel
  * PTEs.
-- 
2.20.1

Re: [PATCH 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

2020-09-08 Thread Anshuman Khandual




On 09/03/2020 11:01 PM, Ralph Campbell wrote:
> 
> On 9/3/20 9:56 AM, Catalin Marinas wrote:
>> On Mon, Aug 17, 2020 at 02:49:43PM +0530, Anshuman Khandual wrote:
>>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>>> manner during various phases of a given PMD. It is derived from a previous
>>> detailed discussion on this topic [1] and present THP documentation [2].
>>>
>>> pmd_present(pmd):
>>>
>>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>
>> The second bullet doesn't make much sense. If you have a pmd mapping of
>> some I/O memory, pmd_present() still returns true (as does
>> pte_present()).
>>
>>> diff --git a/arch/arm64/include/asm/pgtable-prot.h 
>>> b/arch/arm64/include/asm/pgtable-prot.h
>>> index 4d867c6446c4..28792fdd9627 100644
>>> --- a/arch/arm64/include/asm/pgtable-prot.h
>>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>>> @@ -19,6 +19,13 @@
>>>   #define PTE_DEVMAP    (_AT(pteval_t, 1) << 57)
>>>   #define PTE_PROT_NONE    (_AT(pteval_t, 1) << 58) /* only when 
>>> !PTE_VALID */
>>>   +/*
>>> + * This help indicate that the entry is present i.e pmd_page()
>>
>> Nit: add another . after i.e
> 
> Another nit: "This help indicate" => "This helper indicates"
> 
> Maybe I should look at the series more. :-)

It is talking about the new PTE bit being used here not any
helper. Though the following replacement might be better.

s/This help indicate/This bit indicates/

/*
 * This help indicate that the entry is present i.e pmd_page()
 * still points to a valid huge page in memory even if the pmd
 * has been invalidated.
 */
#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 59) /* only when 
!PMD_SECT_VALID */

Re: [PATCH 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

2020-09-08 Thread Anshuman Khandual




On 09/03/2020 10:26 PM, Catalin Marinas wrote:
> On Mon, Aug 17, 2020 at 02:49:43PM +0530, Anshuman Khandual wrote:
>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>> manner during various phases of a given PMD. It is derived from a previous
>> detailed discussion on this topic [1] and present THP documentation [2].
>>
>> pmd_present(pmd):
>>
>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
> 
> The second bullet doesn't make much sense. If you have a pmd mapping of
> some I/O memory, pmd_present() still returns true (as does
> pte_present()).

Derived this from an earlier discussion (https://lkml.org/lkml/2018/10/17/231)
but current representation here might not be accurate.

Would this be any better ?

pmd_present(pmd):

- Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
- Returns false if pmd refers to a migration or swap entry

> 
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h 
>> b/arch/arm64/include/asm/pgtable-prot.h
>> index 4d867c6446c4..28792fdd9627 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -19,6 +19,13 @@
>>  #define PTE_DEVMAP  (_AT(pteval_t, 1) << 57)
>>  #define PTE_PROT_NONE   (_AT(pteval_t, 1) << 58) /* only when 
>> !PTE_VALID */
>>  
>> +/*
>> + * This help indicate that the entry is present i.e pmd_page()
> 
> Nit: add another . after i.e

Will fix.

> 
>> + * still points to a valid huge page in memory even if the pmd
>> + * has been invalidated.
>> + */
>> +#define PMD_PRESENT_INVALID (_AT(pteval_t, 1) << 59) /* only when 
>> !PMD_SECT_VALID */
>> +
>>  #ifndef __ASSEMBLY__
>>  
>>  #include 
>> diff --git a/arch/arm64/include/asm/pgtable.h 
>> b/arch/arm64/include/asm/pgtable.h
>> index d5d3fbe73953..7aa69cace784 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -145,6 +145,18 @@ static inline pte_t set_pte_bit(pte_t pte, pgprot_t 
>> prot)
>>  return pte;
>>  }
>>  
>> +static inline pmd_t clr_pmd_bit(pmd_t pmd, pgprot_t prot)
>> +{
>> +pmd_val(pmd) &= ~pgprot_val(prot);
>> +return pmd;
>> +}
> 
> Could you use clear_pmd_bit (instead of clr) for consistency with
> clear_pte_bit()?

Sure, will do.

> 
> It would be good if the mm folk can do a sanity check on the assumptions
> about pmd_present/pmdp_invalidate/pmd_trans_huge.
> 
> The patch looks fine to me otherwise, feel free to add:
> 
> Reviewed-by: Catalin Marinas 
>

Re: [PATCH V2] arm64/hotplug: Improve memory offline event notifier

2020-09-08 Thread Anshuman Khandual




On 08/24/2020 09:34 AM, Anshuman Khandual wrote:
> This brings about three different changes to the sole memory event notifier
> for arm64 platform and improves it's robustness while also enhancing debug
> capabilities during potential memory offlining error conditions.
> 
> This moves the memory notifier registration bit earlier in the boot process
> from device_initcall() to setup_arch() which will help in guarding against
> potential early boot memory offline requests.
> 
> This enables MEM_OFFLINE memory event handling. It will help intercept any
> possible error condition such as if boot memory some how still got offlined
> even after an expilicit notifier failure, potentially by a future change in
> generic hotplug framework. This would help detect such scenarious and help
> debug further.
> 
> It also adds a validation function which scans entire boot memory and makes
> sure that early memory sections are online. This check is essential for the
> memory notifier to work properly as it cannot prevent boot memory offlining
> if they are not online to begin with. But this additional sanity check is
> enabled only with DEBUG_VM.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Rutland 
> Cc: Marc Zyngier 
> Cc: Steve Capper 
> Cc: Mark Brown 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Anshuman Khandual 
> ---
> This applies on 5.9-rc2
> 
> Changes in V2:
> 
> - Dropped all generic changes wrt MEM_CANCEL_OFFLINE reasons enumeration
> - Dropped all related (processing MEM_CANCEL_OFFLINE reasons) changes on arm64
> - Added validate_boot_mem_online_state() that gets called with 
> early_initcall()
> - Added CONFIG_MEMORY_HOTREMOVE check before registering memory notifier
> - Moved notifier registration i.e memory_hotremove_notifier into setup_arch()
Gentle ping, any updates on this ?

Re: [PATCH V3] arm64/cpuinfo: Define HWCAP name arrays per their actual bit definitions

2020-09-08 Thread Anshuman Khandual




On 09/08/2020 01:11 PM, Will Deacon wrote:
> On Tue, Sep 08, 2020 at 10:43:12AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 09/07/2020 05:46 PM, Will Deacon wrote:
>>> On Mon, Aug 17, 2020 at 05:34:23PM +0530, Anshuman Khandual wrote:
>>>> HWCAP name arrays (hwcap_str, compat_hwcap_str, compat_hwcap2_str) that are
>>>> scanned for /proc/cpuinfo are detached from their bit definitions making it
>>>> vulnerable and difficult to correlate. It is also bit problematic because
>>>> during /proc/cpuinfo dump these arrays get traversed sequentially assuming
>>>> they reflect and match actual HWCAP bit sequence, to test various features
>>>> for a given CPU. This redefines name arrays per their HWCAP bit definitions
>>>> . It also warns after detecting any feature which is not expected on arm64.
>>>>
>>>> Cc: Catalin Marinas 
>>>> Cc: Will Deacon 
>>>> Cc: Mark Brown 
>>>> Cc: Dave Martin 
>>>> Cc: Ard Biesheuvel 
>>>> Cc: Mark Rutland 
>>>> Cc: Suzuki K Poulose 
>>>> Cc: linux-arm-ker...@lists.infradead.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Signed-off-by: Anshuman Khandual 
>>>> ---
>>>> This applies on 5.9-rc1
>>>>
>>>> Mark, since the patch has changed I have dropped your Acked-by: tag. Are 
>>>> you
>>>> happy to give a new one ?
>>>>
>>>> Changes in V3:
>>>>
>>>> - Moved name arrays to (arch/arm64/kernel/cpuinfo.c) to prevent a build 
>>>> warning
>>>> - Replaced string values with NULL for all compat features not possible on 
>>>> arm64
>>>> - Changed compat_hwcap_str[] iteration on size as some NULL values are 
>>>> expected
>>>> - Warn once after detecting any feature on arm64 that is not expected
>>>>
>>>> Changes in V2: (https://patchwork.kernel.org/patch/11533755/)
>>>>
>>>> - Defined COMPAT_KERNEL_HWCAP[2] and updated the name arrays per Mark
>>>> - Updated the commit message as required
>>>>
>>>> Changes in V1: (https://patchwork.kernel.org/patch/11532945/)
>>>>
>>>>  arch/arm64/include/asm/hwcap.h |   9 +++
>>>>  arch/arm64/kernel/cpuinfo.c| 172 
>>>> ++---
>>>>  2 files changed, 100 insertions(+), 81 deletions(-)
>>>
>>> [...]
>>>
>>>> +  [KERNEL_HWCAP_FP]   = "fp",
>>>> +  [KERNEL_HWCAP_ASIMD]= "asimd",
>>>> +  [KERNEL_HWCAP_EVTSTRM]  = "evtstrm",
>>>> +  [KERNEL_HWCAP_AES]  = "aes",
>>>
>>> It would be nice if the cap and the string were generated by the same
>>> macro, along the lines of:
>>>
>>> #define KERNEL_HWCAP(c) [KERNEL_HWCAP_##c] = #c,
>>>
>>> Does making the constants mixed case break anything, or is it just really
>>> churny to do?
>>
>> Currently all existing HWCAP feature strings are lower case, above change
>> will make them into upper case instead. I could not find a method to force
>> convert #c into lower case constant strings in the macro definition. Would
>> not changing the HWCAP string case here, break user interface ?
> 
> Yes, we can't change the user-visible strings, but what's wrong with
> having e.g. KERNEL_HWCAP_fp instead of KERNEL_HWCAP_FP?

Unlike the new compat macros i.e COMPAT_KERNEL_HWCAP[2] in this patch,
KERNEL_HWCAP_XXX macros are already defined and are getting used else
where (arch/arm64/kernel/cpufeature.c) as well. [KERNEL_HWCAP_##c] can
only be used here, if the input string is in upper case. Otherwise all
these existing macros need to be changed first, which will result in
too much code churn.

Re: [PATCH V3] arm64/cpuinfo: Define HWCAP name arrays per their actual bit definitions

2020-09-07 Thread Anshuman Khandual




On 09/07/2020 05:46 PM, Will Deacon wrote:
> On Mon, Aug 17, 2020 at 05:34:23PM +0530, Anshuman Khandual wrote:
>> HWCAP name arrays (hwcap_str, compat_hwcap_str, compat_hwcap2_str) that are
>> scanned for /proc/cpuinfo are detached from their bit definitions making it
>> vulnerable and difficult to correlate. It is also bit problematic because
>> during /proc/cpuinfo dump these arrays get traversed sequentially assuming
>> they reflect and match actual HWCAP bit sequence, to test various features
>> for a given CPU. This redefines name arrays per their HWCAP bit definitions
>> . It also warns after detecting any feature which is not expected on arm64.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Brown 
>> Cc: Dave Martin 
>> Cc: Ard Biesheuvel 
>> Cc: Mark Rutland 
>> Cc: Suzuki K Poulose 
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>> This applies on 5.9-rc1
>>
>> Mark, since the patch has changed I have dropped your Acked-by: tag. Are you
>> happy to give a new one ?
>>
>> Changes in V3:
>>
>> - Moved name arrays to (arch/arm64/kernel/cpuinfo.c) to prevent a build 
>> warning
>> - Replaced string values with NULL for all compat features not possible on 
>> arm64
>> - Changed compat_hwcap_str[] iteration on size as some NULL values are 
>> expected
>> - Warn once after detecting any feature on arm64 that is not expected
>>
>> Changes in V2: (https://patchwork.kernel.org/patch/11533755/)
>>
>> - Defined COMPAT_KERNEL_HWCAP[2] and updated the name arrays per Mark
>> - Updated the commit message as required
>>
>> Changes in V1: (https://patchwork.kernel.org/patch/11532945/)
>>
>>  arch/arm64/include/asm/hwcap.h |   9 +++
>>  arch/arm64/kernel/cpuinfo.c| 172 
>> ++---
>>  2 files changed, 100 insertions(+), 81 deletions(-)
> 
> [...]
> 
>> +[KERNEL_HWCAP_FP]   = "fp",
>> +[KERNEL_HWCAP_ASIMD]= "asimd",
>> +[KERNEL_HWCAP_EVTSTRM]  = "evtstrm",
>> +[KERNEL_HWCAP_AES]  = "aes",
> 
> It would be nice if the cap and the string were generated by the same
> macro, along the lines of:
> 
> #define KERNEL_HWCAP(c)   [KERNEL_HWCAP_##c] = #c,
> 
> Does making the constants mixed case break anything, or is it just really
> churny to do?

Currently all existing HWCAP feature strings are lower case, above change
will make them into upper case instead. I could not find a method to force
convert #c into lower case constant strings in the macro definition. Would
not changing the HWCAP string case here, break user interface ?

> 
>> @@ -166,9 +167,18 @@ static int c_show(struct seq_file *m, void *v)
>>  seq_puts(m, "Features\t:");
>>  if (compat) {
>>  #ifdef CONFIG_COMPAT
>> -for (j = 0; compat_hwcap_str[j]; j++)
>> -if (compat_elf_hwcap & (1 << j))
>> +for (j = 0; j < ARRAY_SIZE(compat_hwcap_str); j++) {
>> +if (compat_elf_hwcap & (1 << j)) {
>> +/*
>> + * Warn once if any feature should not
>> + * have been present on arm64 platform.
>> + */
>> +if (WARN_ON_ONCE(!compat_hwcap_str[j]))
>> +continue;
>> +
>>  seq_printf(m, " %s", 
>> compat_hwcap_str[j]);
>> +}
>> +}
>>  
>>  for (j = 0; compat_hwcap2_str[j]; j++)
> 
> Hmm, I find this pretty confusing now as compat_hwcap_str is not NULL
> terminated and must be traversed with a loop bounded by ARRAY_SIZE(...),

Right. Thats because unlike before, it can now have some intermediate NULL
entries. Hence NULL sentinel based traversal wont be possible any more.


> whereas compat_hwcap2_str *is* NULL terminated and is traversed until you
> hit the sentinel.
> 
> I think hwcap_str, compat_hwcap_str and compat_hwcap2_str should be
> identical in this regard.

Sure, will make the traversal based on ARRAY_SIZE() for all three arrays
here, to make that uniform.

> 
> Will
>

[PATCH] arm64/mm/ptdump: Add address markers for BPF regions

2020-09-04 Thread Anshuman Khandual

Kernel virtual region [BPF_JIT_REGION_START..BPF_JIT_REGION_END] is missing
from address_markers[], hence relevant page table entries are not displayed
with /sys/kernel/debug/kernel_page_tables. This adds those missing markers.
While here, also rename arch/arm64/mm/dump.c which sounds bit ambiguous, as
arch/arm64/mm/ptdump.c instead.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Ard Biesheuvel 
Cc: Steven Price 
Cc: Andrew Morton 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/Makefile | 2 +-
 arch/arm64/mm/{dump.c => ptdump.c} | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)
 rename arch/arm64/mm/{dump.c => ptdump.c} (99%)

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index d91030f0ffee..2a1d275cd4d7 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -4,7 +4,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   ioremap.o mmap.o pgd.o mmu.o \
   context.o proc.o pageattr.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
-obj-$(CONFIG_PTDUMP_CORE)  += dump.o
+obj-$(CONFIG_PTDUMP_CORE)  += ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)   += ptdump_debugfs.o
 obj-$(CONFIG_NUMA) += numa.o
 obj-$(CONFIG_DEBUG_VIRTUAL)+= physaddr.o
diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/ptdump.c
similarity index 99%
rename from arch/arm64/mm/dump.c
rename to arch/arm64/mm/ptdump.c
index 0b8da1cc1c07..265284dc942d 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -41,6 +41,8 @@ static struct addr_marker address_markers[] = {
{ 0 /* KASAN_SHADOW_START */,   "Kasan shadow start" },
{ KASAN_SHADOW_END, "Kasan shadow end" },
 #endif
+   { BPF_JIT_REGION_START, "BPF start" },
+   { BPF_JIT_REGION_END,   "BPF end" },
{ MODULES_VADDR,"Modules start" },
{ MODULES_END,  "Modules end" },
{ VMALLOC_START,"vmalloc() area" },
-- 
2.20.1

Re: Warning on Kernel 5.9.0-rc1 on PowerBook G4 (ppc32), bisected to a5c3b9ffb0f4

2020-08-31 Thread Anshuman Khandual




On 08/29/2020 06:40 AM, Larry Finger wrote:
> In kernel 5.9.0-rc1 on a PowerBook G4 (ppc32), several warnings of the 
> following type are logged:
> 
>  [ cut here ]
>  WARNING: CPU: 0 PID: 1 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x20/0x100

All those warnings triggered at the same place i.e 
arch/powerpc/mm/pgtable.c:185 ?

>  Modules linked in:
>  CPU: 0 PID: 1 Comm: swapper Not tainted 5.9.0-rc2 #2
>  NIP:  c002add4 LR: c07dba40 CTR: 
>  REGS: f1019d70 TRAP: 0700   Not tainted  (5.9.0-rc2)
>  MSR:  00029032   CR: 22000888  XER: 
> 
>    GPR00: c07dba40 f1019e28 eeca3220 eef7ace0 4e999000 eef7d664 f1019e50 
> 
>    GPR08: 007c2315 0001 007c2315 f1019e48 22000888  c00054dc 
> 
>    GPR16:   2ef7d000 07c2 fff0 eef7b000 04e8 
> eef7d000
>    GPR24: eef7c5c0  007c2315 4e999000 c05ef548 eef7d664 c087cda8 
> 007c2315
>  NIP [c002add4] set_pte_at+0x20/0x100
>  LR [c07dba40] debug_vm_pgtable+0x29c/0x654
>  Call Trace:
>  [f1019e28] [c002b4ac] pte_fragment_alloc+0x24/0xe4 (unreliable)
>  [f1019e48] [c07dba40] debug_vm_pgtable+0x29c/0x654
>  [f1019e98] [c0005160] do_one_initcall+0x70/0x158
>  [f1019ef8] [c07c352c] kernel_init_freeable+0x1f4/0x1f8
>  [f1019f28] [c00054f0] kernel_init+0x14/0xfc
>  [f1019f38] [c001516c] ret_from_kernel_thread+0x14/0x1c
>  Instruction dump:
>  57ff053e 39610010 7c63fa14 4800308c 9421ffe0 7c0802a6 8125 bfa10014
>  7cbd2b78 90010024 552907fe 83e6 <0f09> 3d20c089 83c91280 813e0018
>  ---[ end trace 4ef67686e5133716 ]---
> 
> Although the warnings do no harm, I suspect that they should be fixed in case 
> some future modification turns the warning statements into BUGS.

These warnings are from mm/debug_vm_pgtable.c test, wont be converted into BUGS.
But nonetheless, need to be addressed though.

> 
> The problem was bisected to commit a5c3b9ffb0f4 ("mm/debug_vm_pgtable: add 
> tests validating advanced arch page table helpers") by Anshuman Khandual 
> 

There are some known issues wrt DEBUG_VM_PGTABLE on certain ppc64 platforms. But
I thought it worked all right on ppc32 platforms though. Adding Christophe Leroy
here. Currently, there is a series under review that makes DEBUG_VM_PGTABLE work
correctly on ppc64 platforms. Could you please give it a try and see if it fixes
these warnings ?

https://patchwork.kernel.org/project/linux-mm/list/?series=339387

- Anshuman

Re: [PATCH V2] arm64/hotplug: Improve memory offline event notifier

2020-08-23 Thread Anshuman Khandual




On 08/24/2020 09:34 AM, Anshuman Khandual wrote:
> This brings about three different changes to the sole memory event notifier
> for arm64 platform and improves it's robustness while also enhancing debug
> capabilities during potential memory offlining error conditions.
> 
> This moves the memory notifier registration bit earlier in the boot process
> from device_initcall() to setup_arch() which will help in guarding against
> potential early boot memory offline requests.
> 
> This enables MEM_OFFLINE memory event handling. It will help intercept any
> possible error condition such as if boot memory some how still got offlined
> even after an expilicit notifier failure, potentially by a future change in
> generic hotplug framework. This would help detect such scenarious and help
> debug further.
> 
> It also adds a validation function which scans entire boot memory and makes
> sure that early memory sections are online. This check is essential for the
> memory notifier to work properly as it cannot prevent boot memory offlining
> if they are not online to begin with. But this additional sanity check is
> enabled only with DEBUG_VM.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Wrong email address here for Will.

+ Will Deacon 

s/w...@kernel.com/w...@kernel.org next time around.

[PATCH V2] arm64/hotplug: Improve memory offline event notifier

2020-08-23 Thread Anshuman Khandual

This brings about three different changes to the sole memory event notifier
for arm64 platform and improves it's robustness while also enhancing debug
capabilities during potential memory offlining error conditions.

This moves the memory notifier registration bit earlier in the boot process
from device_initcall() to setup_arch() which will help in guarding against
potential early boot memory offline requests.

This enables MEM_OFFLINE memory event handling. It will help intercept any
possible error condition such as if boot memory some how still got offlined
even after an expilicit notifier failure, potentially by a future change in
generic hotplug framework. This would help detect such scenarious and help
debug further.

It also adds a validation function which scans entire boot memory and makes
sure that early memory sections are online. This check is essential for the
memory notifier to work properly as it cannot prevent boot memory offlining
if they are not online to begin with. But this additional sanity check is
enabled only with DEBUG_VM.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Steve Capper 
Cc: Mark Brown 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This applies on 5.9-rc2

Changes in V2:

- Dropped all generic changes wrt MEM_CANCEL_OFFLINE reasons enumeration
- Dropped all related (processing MEM_CANCEL_OFFLINE reasons) changes on arm64
- Added validate_boot_mem_online_state() that gets called with early_initcall()
- Added CONFIG_MEMORY_HOTREMOVE check before registering memory notifier
- Moved notifier registration i.e memory_hotremove_notifier into setup_arch()

Changes in V1: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=271237)

 arch/arm64/include/asm/mmu.h |   8 +++
 arch/arm64/kernel/setup.c|   8 +++
 arch/arm64/mm/mmu.c  | 108 ---
 3 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index a7a5ecaa2e83..b7e99b528766 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -73,6 +73,14 @@ static inline struct bp_hardening_data 
*arm64_get_bp_hardening_data(void)
 static inline void arm64_apply_bp_hardening(void)  { }
 #endif /* CONFIG_HARDEN_BRANCH_PREDICTOR */
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void memory_hotremove_notifier(void);
+#else
+static inline void memory_hotremove_notifier(void)
+{
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 extern void arm64_memblock_init(void);
 extern void paging_init(void);
 extern void bootmem_init(void);
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 77c4c9bad1b8..44406c9f8d83 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -376,6 +376,14 @@ void __init __no_sanitize_address setup_arch(char 
**cmdline_p)
"This indicates a broken bootloader or old kernel\n",
boot_args[1], boot_args[2], boot_args[3]);
}
+
+   /*
+* Register the memory notifier which will prevent boot
+* memory offlining requests - early enough. But there
+* should not be any actual offlinig request till memory
+* block devices are initialized with memory_dev_init().
+*/
+   memory_hotremove_notifier();
 }
 
 static inline bool cpu_can_disable(unsigned int cpu)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..8cdb0b02089f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1482,13 +1482,40 @@ static int prevent_bootmem_remove_notifier(struct 
notifier_block *nb,
unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
unsigned long pfn = arg->start_pfn;
 
-   if (action != MEM_GOING_OFFLINE)
+   if ((action != MEM_GOING_OFFLINE) && (action != MEM_OFFLINE))
return NOTIFY_OK;
 
-   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-   ms = __pfn_to_section(pfn);
-   if (early_section(ms))
-   return NOTIFY_BAD;
+   if (action == MEM_GOING_OFFLINE) {
+   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+   if (early_section(ms)) {
+   pr_warn("Boot memory offlining attempted\n");
+   return NOTIFY_BAD;
+   }
+   }
+   } else if (action == MEM_OFFLINE) {
+   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+   ms = __pfn_to_section(pfn);
+   if (early_section(ms)) {
+
+   /*
+* This should have never happened. Boot memory
+* offlining should have been prevented by this
+

Re: [RFC/RFT PATCH 1/6] numa: Move numa implementation to common code

2020-08-19 Thread Anshuman Khandual




On 08/20/2020 12:48 AM, Atish Patra wrote:
> On Tue, Aug 18, 2020 at 8:19 PM Anshuman Khandual
>  wrote:
>>
>>
>>
>> On 08/15/2020 03:17 AM, Atish Patra wrote:
>>> ARM64 numa implementation is generic enough that RISC-V can reuse that
>>> implementation with very minor cosmetic changes. This will help both
>>> ARM64 and RISC-V in terms of maintanace and feature improvement
>>>
>>> Move the numa implementation code to common directory so that both ISAs
>>> can reuse this. This doesn't introduce any function changes for ARM64.
>>>
>>> Signed-off-by: Atish Patra 
>>> ---
>>>  arch/arm64/Kconfig|  1 +
>>>  arch/arm64/include/asm/numa.h | 45 +---
>>>  arch/arm64/mm/Makefile|  1 -
>>>  drivers/base/Kconfig  |  6 +++
>>>  drivers/base/Makefile |  1 +
>>>  .../mm/numa.c => drivers/base/arch_numa.c |  0
>>>  include/asm-generic/numa.h| 51 +++
>>>  7 files changed, 60 insertions(+), 45 deletions(-)
>>>  rename arch/arm64/mm/numa.c => drivers/base/arch_numa.c (100%)
>>>  create mode 100644 include/asm-generic/numa.h
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index 6d232837cbee..955a0cf75b16 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -960,6 +960,7 @@ config HOTPLUG_CPU
>>>  # Common NUMA Features
>>>  config NUMA
>>>   bool "NUMA Memory Allocation and Scheduler Support"
>>> + select GENERIC_ARCH_NUMA
>>
>> So this introduces a generic NUMA framework selectable with 
>> GENERIC_ARCH_NUMA.
>>
>>>   select ACPI_NUMA if ACPI
>>>   select OF_NUMA
>>>   help
>>> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
>>> index 626ad01e83bf..8c8cf4297cc3 100644
>>> --- a/arch/arm64/include/asm/numa.h
>>> +++ b/arch/arm64/include/asm/numa.h
>>> @@ -3,49 +3,6 @@
>>>  #define __ASM_NUMA_H
>>>
>>>  #include 
>>> -
>>> -#ifdef CONFIG_NUMA
>>> -
>>> -#define NR_NODE_MEMBLKS  (MAX_NUMNODES * 2)
>>> -
>>> -int __node_distance(int from, int to);
>>> -#define node_distance(a, b) __node_distance(a, b)
>>> -
>>> -extern nodemask_t numa_nodes_parsed __initdata;
>>> -
>>> -extern bool numa_off;
>>> -
>>> -/* Mappings between node number and cpus on that node. */
>>> -extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>>> -void numa_clear_node(unsigned int cpu);
>>> -
>>> -#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>>> -const struct cpumask *cpumask_of_node(int node);
>>> -#else
>>> -/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
>>> -static inline const struct cpumask *cpumask_of_node(int node)
>>> -{
>>> - return node_to_cpumask_map[node];
>>> -}
>>> -#endif
>>> -
>>> -void __init arm64_numa_init(void);
>>> -int __init numa_add_memblk(int nodeid, u64 start, u64 end);
>>> -void __init numa_set_distance(int from, int to, int distance);
>>> -void __init numa_free_distance(void);
>>> -void __init early_map_cpu_to_node(unsigned int cpu, int nid);
>>> -void numa_store_cpu_info(unsigned int cpu);
>>> -void numa_add_cpu(unsigned int cpu);
>>> -void numa_remove_cpu(unsigned int cpu);
>>> -
>>> -#else/* CONFIG_NUMA */
>>> -
>>> -static inline void numa_store_cpu_info(unsigned int cpu) { }
>>> -static inline void numa_add_cpu(unsigned int cpu) { }
>>> -static inline void numa_remove_cpu(unsigned int cpu) { }
>>> -static inline void arm64_numa_init(void) { }
>>> -static inline void early_map_cpu_to_node(unsigned int cpu, int nid) { }
>>> -
>>> -#endif   /* CONFIG_NUMA */
>>> +#include 
>>>
>>>  #endif   /* __ASM_NUMA_H */
>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>> index d91030f0ffee..928c308b044b 100644
>>> --- a/arch/arm64/mm/Makefile
>>> +++ b/arch/arm64/mm/Makefile
>>> @@ -6,7 +6,6 @@ obj-y := dma-mapping.o extable.o 
>>> fault.o init.o \
>>>  obj-$(CONFIG_HUGETLB_PAGE)   += hugetlbpage.o
>>>  obj-$(CONFIG_PTDUMP_CORE)+= dump.o
>>>  obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_d

Re: [PATCH 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

2020-08-19 Thread Anshuman Khandual




On 08/18/2020 05:56 PM, Jonathan Cameron wrote:
> On Tue, 18 Aug 2020 15:11:58 +0530
> Anshuman Khandual  wrote:
> 
>> On 08/18/2020 02:43 PM, Jonathan Cameron wrote:
>>> On Mon, 17 Aug 2020 14:49:43 +0530
>>> Anshuman Khandual  wrote:
>>>   
>>>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>>>> manner during various phases of a given PMD. It is derived from a previous
>>>> detailed discussion on this topic [1] and present THP documentation [2].
>>>>
>>>> pmd_present(pmd):
>>>>
>>>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>>>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>>>
>>>> pmd_trans_huge(pmd):
>>>>
>>>> - Returns true if pmd refers to system RAM and is a trans huge mapping
>>>>
>>>> -
>>>> |  PMD states  |   pmd_present |   pmd_trans_huge  |
>>>> -
>>>> |  Mapped  |   Yes |   Yes |
>>>> -
>>>> |  Splitting   |   Yes |   Yes |
>>>> -
>>>> |  Migration/Swap  |   No  |   No  |
>>>> -
>>>>
>>>> The problem:
>>>>
>>>> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
>>>> invalidation clears PMD_SECT_VALID as below.
>>>>
>>>> PMD Split -> pmdp_invalidate() -> pmd_mkinvalid -> Clears PMD_SECT_VALID
>>>>
>>>> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
>>>> on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
>>>> affirm pmd_present() as true during the THP split process. To comply with
>>>> above mentioned semantics, pmd_trans_huge() should also check pmd_present()
>>>> first before testing presence of an actual transparent huge mapping.
>>>>
>>>> The solution:
>>>>
>>>> Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
>>>> bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
>>>> it will not be there for pmd_present() check after pmdp_invalidate().
>>>>
>>>> A new software defined PMD_PRESENT_INVALID (bit 59) can be set on the PMD
>>>> entry during invalidation which can help pmd_present() return true and in
>>>> recognizing the fact that it still points to memory.
>>>>
>>>> This bit is transient. During the split process it will be overridden by a
>>>> page table page representing normal pages in place of erstwhile huge page.
>>>> Other pmdp_invalidate() callers always write a fresh PMD value on the entry
>>>> overriding this transient PMD_PRESENT_INVALID bit, which makes it safe.
>>>>
>>>> [1]: https://lkml.org/lkml/2018/10/17/231
>>>> [2]: https://www.kernel.org/doc/Documentation/vm/transhuge.txt  
>>>
>>> Hi Anshuman,
>>>
>>> One query on this.  From my reading of the ARM ARM, bit 59 is not
>>> an ignored bit.  The exact requirements for hardware to be using
>>> it are a bit complex though.
>>>
>>> It 'might' be safe to use it for this, but if so can we have a comment
>>> explaining why.  Also more than possible I'm misunderstanding things!   
>>
>> We are using this bit 59 only when the entry is not active from MMU
>> perspective i.e PMD_SECT_VALID is clear.
>>
> 
> Understood. I guess we ran out of bits that were always ignored so had
> to start using ones that are ignored in this particular state.

Right, there are no more available SW PTE bits.

#define PTE_DIRTY   (_AT(pteval_t, 1) << 55)
#define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
#define PTE_DEVMAP  (_AT(pteval_t, 1) << 57)
#define PTE_PROT_NONE   (_AT(pteval_t, 1) << 58) /* only when 
!PTE_VALID */

Earlier I had proposed using PTE_SPECIAL at PMD level for this purpose.
But Catalin prefers these unused bits as the entry is anyway invalid
and which also leaves aside PTE_SPECIAL at mapped PMD for later use.
There is already one comment near PMD_PRESENT_INVALID definition which
explains this situation.

+/*
+ * This help indicate that the entry is present i.e pmd_page()
+ * still points to a valid huge page in memory even if the pmd
+ * has been invalidated.
+ */
+#define PMD_PRESENT_INVALID(_AT(pteval_t, 1) << 59) /* only when 
!PMD_SECT_VALID */

Re: [PATCH] mm: Fix missing function declaration

2020-08-19 Thread Anshuman Khandual




On 08/19/2020 01:30 PM, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> The compilation with CONFIG_DEBUG_RODATA_TEST set produces the following
> warning due to the missing include.
> 
>  mm/rodata_test.c:15:6: warning: no previous prototype for 'rodata_test' 
> [-Wmissing-prototypes]
> 15 | void rodata_test(void)
>   |  ^~~
> 
> Fixes: 2959a5f726f6 ("mm: add arch-independent testcases for RODATA")
> Signed-off-by: Leon Romanovsky 

This build failure appears only with W=1 and gets fixed with this.

Reviewed-by: Anshuman Khandual 

> ---
>  mm/rodata_test.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/rodata_test.c b/mm/rodata_test.c
> index 2a99df7beeb3..2613371945b7 100644
> --- a/mm/rodata_test.c
> +++ b/mm/rodata_test.c
> @@ -7,6 +7,7 @@
>   */
>  #define pr_fmt(fmt) "rodata_test: " fmt
> 
> +#include 
>  #include 
>  #include 
> 
> --
> 2.26.2
> 
> 
>

Re: [PATCH v2 2/2] mm/pageblock: remove false sharing in pageblock_flags

2020-08-19 Thread Anshuman Khandual




On 08/19/2020 11:17 AM, Alex Shi wrote:
> Current pageblock_flags is only 4 bits, so it has to share a char size
> in cmpxchg when get set, the false sharing cause perf drop.
> 
> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
> the only cost is half char per pageblock, which is half char per 128MB
> on x86, 4 chars in 1 GB.

Agreed that increase in memory utilization is negligible here but does
this really improve performance ?

Re: [PATCH v2 1/2] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags

2020-08-19 Thread Anshuman Khandual




On 08/19/2020 11:17 AM, Alex Shi wrote:
> pageblock_flags is used as long, since every pageblock_flags is just 4
> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
> flags. It would cause long waiting for sync.
> 
> If we could change the pageblock_flags variable as char, we could use
> char size cmpxchg, which just sync up with 2 pageblock flags. it could
> relief much false sharing in cmpxchg.

Do you have numbers demonstrating claimed performance improvement
after this change ?

Re: [RFC/RFT PATCH 1/6] numa: Move numa implementation to common code

2020-08-18 Thread Anshuman Khandual




On 08/15/2020 03:17 AM, Atish Patra wrote:
> ARM64 numa implementation is generic enough that RISC-V can reuse that
> implementation with very minor cosmetic changes. This will help both
> ARM64 and RISC-V in terms of maintanace and feature improvement
> 
> Move the numa implementation code to common directory so that both ISAs
> can reuse this. This doesn't introduce any function changes for ARM64.
> 
> Signed-off-by: Atish Patra 
> ---
>  arch/arm64/Kconfig|  1 +
>  arch/arm64/include/asm/numa.h | 45 +---
>  arch/arm64/mm/Makefile|  1 -
>  drivers/base/Kconfig  |  6 +++
>  drivers/base/Makefile |  1 +
>  .../mm/numa.c => drivers/base/arch_numa.c |  0
>  include/asm-generic/numa.h| 51 +++
>  7 files changed, 60 insertions(+), 45 deletions(-)
>  rename arch/arm64/mm/numa.c => drivers/base/arch_numa.c (100%)
>  create mode 100644 include/asm-generic/numa.h
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 6d232837cbee..955a0cf75b16 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -960,6 +960,7 @@ config HOTPLUG_CPU
>  # Common NUMA Features
>  config NUMA
>   bool "NUMA Memory Allocation and Scheduler Support"
> + select GENERIC_ARCH_NUMA

So this introduces a generic NUMA framework selectable with GENERIC_ARCH_NUMA.

>   select ACPI_NUMA if ACPI
>   select OF_NUMA
>   help
> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
> index 626ad01e83bf..8c8cf4297cc3 100644
> --- a/arch/arm64/include/asm/numa.h
> +++ b/arch/arm64/include/asm/numa.h
> @@ -3,49 +3,6 @@
>  #define __ASM_NUMA_H
>  
>  #include 
> -
> -#ifdef CONFIG_NUMA
> -
> -#define NR_NODE_MEMBLKS  (MAX_NUMNODES * 2)
> -
> -int __node_distance(int from, int to);
> -#define node_distance(a, b) __node_distance(a, b)
> -
> -extern nodemask_t numa_nodes_parsed __initdata;
> -
> -extern bool numa_off;
> -
> -/* Mappings between node number and cpus on that node. */
> -extern cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
> -void numa_clear_node(unsigned int cpu);
> -
> -#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> -const struct cpumask *cpumask_of_node(int node);
> -#else
> -/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
> -static inline const struct cpumask *cpumask_of_node(int node)
> -{
> - return node_to_cpumask_map[node];
> -}
> -#endif
> -
> -void __init arm64_numa_init(void);
> -int __init numa_add_memblk(int nodeid, u64 start, u64 end);
> -void __init numa_set_distance(int from, int to, int distance);
> -void __init numa_free_distance(void);
> -void __init early_map_cpu_to_node(unsigned int cpu, int nid);
> -void numa_store_cpu_info(unsigned int cpu);
> -void numa_add_cpu(unsigned int cpu);
> -void numa_remove_cpu(unsigned int cpu);
> -
> -#else/* CONFIG_NUMA */
> -
> -static inline void numa_store_cpu_info(unsigned int cpu) { }
> -static inline void numa_add_cpu(unsigned int cpu) { }
> -static inline void numa_remove_cpu(unsigned int cpu) { }
> -static inline void arm64_numa_init(void) { }
> -static inline void early_map_cpu_to_node(unsigned int cpu, int nid) { }
> -
> -#endif   /* CONFIG_NUMA */
> +#include 
>  
>  #endif   /* __ASM_NUMA_H */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index d91030f0ffee..928c308b044b 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -6,7 +6,6 @@ obj-y := dma-mapping.o extable.o 
> fault.o init.o \
>  obj-$(CONFIG_HUGETLB_PAGE)   += hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)+= dump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
> -obj-$(CONFIG_NUMA)   += numa.o
>  obj-$(CONFIG_DEBUG_VIRTUAL)  += physaddr.o
>  KASAN_SANITIZE_physaddr.o+= n
>  
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 8d7001712062..73c2151de194 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -210,4 +210,10 @@ config GENERIC_ARCH_TOPOLOGY
> appropriate scaling, sysfs interface for reading capacity values at
> runtime.
>  
> +config GENERIC_ARCH_NUMA
> + bool
> + help
> +   Enable support for generic numa implementation. Currently, RISC-V
> +   and ARM64 uses it.
> +
>  endmenu
> diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> index 157452080f3d..c3d02c644222 100644
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -23,6 +23,7 @@ obj-$(CONFIG_PINCTRL) += pinctrl.o
>  obj-$(CONFIG_DEV_COREDUMP) += devcoredump.o
>  obj-$(CONFIG_GENERIC_MSI_IRQ_DOMAIN) += platform-msi.o
>  obj-$(CONFIG_GENERIC_ARCH_TOPOLOGY) += arch_topology.o
> +obj-$(CONFIG_GENERIC_ARCH_NUMA) += arch_numa.o
>  
>  obj-y+= test/
>  
> diff --git a/arch/arm64/mm/numa.c b/drivers/base/arch_numa.c
> similarity index 100%
> rename from arch/arm64/mm/numa.c
> rename to

Re: [PATCH 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

2020-08-18 Thread Anshuman Khandual




On 08/18/2020 02:43 PM, Jonathan Cameron wrote:
> On Mon, 17 Aug 2020 14:49:43 +0530
> Anshuman Khandual  wrote:
> 
>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>> manner during various phases of a given PMD. It is derived from a previous
>> detailed discussion on this topic [1] and present THP documentation [2].
>>
>> pmd_present(pmd):
>>
>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>
>> pmd_trans_huge(pmd):
>>
>> - Returns true if pmd refers to system RAM and is a trans huge mapping
>>
>> -
>> |PMD states  |   pmd_present |   pmd_trans_huge  |
>> -
>> |Mapped  |   Yes |   Yes |
>> -
>> |Splitting   |   Yes |   Yes |
>> -
>> |Migration/Swap  |   No  |   No  |
>> -
>>
>> The problem:
>>
>> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
>> invalidation clears PMD_SECT_VALID as below.
>>
>> PMD Split -> pmdp_invalidate() -> pmd_mkinvalid -> Clears PMD_SECT_VALID
>>
>> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
>> on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
>> affirm pmd_present() as true during the THP split process. To comply with
>> above mentioned semantics, pmd_trans_huge() should also check pmd_present()
>> first before testing presence of an actual transparent huge mapping.
>>
>> The solution:
>>
>> Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
>> bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
>> it will not be there for pmd_present() check after pmdp_invalidate().
>>
>> A new software defined PMD_PRESENT_INVALID (bit 59) can be set on the PMD
>> entry during invalidation which can help pmd_present() return true and in
>> recognizing the fact that it still points to memory.
>>
>> This bit is transient. During the split process it will be overridden by a
>> page table page representing normal pages in place of erstwhile huge page.
>> Other pmdp_invalidate() callers always write a fresh PMD value on the entry
>> overriding this transient PMD_PRESENT_INVALID bit, which makes it safe.
>>
>> [1]: https://lkml.org/lkml/2018/10/17/231
>> [2]: https://www.kernel.org/doc/Documentation/vm/transhuge.txt
> 
> Hi Anshuman,
> 
> One query on this.  From my reading of the ARM ARM, bit 59 is not
> an ignored bit.  The exact requirements for hardware to be using
> it are a bit complex though.
> 
> It 'might' be safe to use it for this, but if so can we have a comment
> explaining why.  Also more than possible I'm misunderstanding things! 

We are using this bit 59 only when the entry is not active from MMU
perspective i.e PMD_SECT_VALID is clear.

Re: [PATCH] mm/hotplug: Enumerate memory range offlining failure reasons

2020-08-18 Thread Anshuman Khandual




On 08/18/2020 11:35 AM, Michal Hocko wrote:
> On Tue 18-08-20 09:52:02, Anshuman Khandual wrote:
>> Currently a debug message is printed describing the reason for memory range
>> offline failure. This just enumerates existing reason codes which improves
>> overall readability and makes it cleaner. This does not add any functional
>> change.
> 
> Wasn't something like that posted already? To be honest I do not think

There was a similar one regarding bad page reason.

https://patchwork.kernel.org/patch/11464713/

> this is worth the additional LOC. We are talking about few strings used
> at a single place. I really do not see any simplification, constants are
> sometimes even longer than the strings they are describing.

I am still trying to understand why enumerating all potential offline
failure reasons in a single place (i.e via enum) is not a better idea
than strings scattered across the function. Besides being cleaner, it
classifies, organizes and provide a structure to the set of reasons.
It is not just about string replacement with constants.

> 
>> Cc: Andrew Morton 
>> Cc: David Hildenbrand 
>> Cc: Michal Hocko 
>> Cc: Dan Williams 
>> Cc: linux...@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>> This is based on 5.9-rc1
>>
>>  include/linux/memory.h | 28 
>>  mm/memory_hotplug.c| 18 +-
>>  2 files changed, 37 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/memory.h b/include/linux/memory.h
>> index 439a89e758d8..4b52d706edc1 100644
>> --- a/include/linux/memory.h
>> +++ b/include/linux/memory.h
>> @@ -44,6 +44,34 @@ int set_memory_block_size_order(unsigned int order);
>>  #define MEM_CANCEL_ONLINE   (1<<4)
>>  #define MEM_CANCEL_OFFLINE  (1<<5)
>>  
>> +/*
>> + * Memory offline failure reasons
>> + */
>> +enum offline_failure_reason {
>> +OFFLINE_FAILURE_MEMHOLES,
>> +OFFLINE_FAILURE_MULTIZONE,
>> +OFFLINE_FAILURE_ISOLATE,
>> +OFFLINE_FAILURE_NOTIFIER,
>> +OFFLINE_FAILURE_SIGNAL,
>> +OFFLINE_FAILURE_UNMOVABLE,
>> +OFFLINE_FAILURE_DISSOLVE,
>> +};
>> +
>> +static const char *const offline_failure_names[] = {
>> +[OFFLINE_FAILURE_MEMHOLES]  = "memory holes",
>> +[OFFLINE_FAILURE_MULTIZONE] = "multizone range",
>> +[OFFLINE_FAILURE_ISOLATE]   = "failure to isolate range",
>> +[OFFLINE_FAILURE_NOTIFIER]  = "notifier failure",
>> +[OFFLINE_FAILURE_SIGNAL]= "signal backoff",
>> +[OFFLINE_FAILURE_UNMOVABLE] = "unmovable page",
>> +[OFFLINE_FAILURE_DISSOLVE]  = "failure to dissolve huge pages",
>> +};
>> +
>> +static inline const char *offline_failure(enum offline_failure_reason 
>> reason)
>> +{
>> +return offline_failure_names[reason];
>> +}
>> +
>>  struct memory_notify {
>>  unsigned long start_pfn;
>>  unsigned long nr_pages;
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index e9d5ab5d3ca0..b3fa36a09d7f 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1484,7 +1484,7 @@ static int __ref __offline_pages(unsigned long 
>> start_pfn,
>>  unsigned long flags;
>>  struct zone *zone;
>>  struct memory_notify arg;
>> -char *reason;
>> +enum offline_failure_reason reason;
>>  
>>  mem_hotplug_begin();
>>  
>> @@ -1500,7 +1500,7 @@ static int __ref __offline_pages(unsigned long 
>> start_pfn,
>>count_system_ram_pages_cb);
>>  if (nr_pages != end_pfn - start_pfn) {
>>  ret = -EINVAL;
>> -reason = "memory holes";
>> +reason = OFFLINE_FAILURE_MEMHOLES;
>>  goto failed_removal;
>>  }
>>  
>> @@ -1509,7 +1509,7 @@ static int __ref __offline_pages(unsigned long 
>> start_pfn,
>>  zone = test_pages_in_a_zone(start_pfn, end_pfn);
>>  if (!zone) {
>>  ret = -EINVAL;
>> -reason = "multizone range";
>> +reason = OFFLINE_FAILURE_MULTIZONE;
>>  goto failed_removal;
>>  }
>>  node = zone_to_nid(zone);
>> @@ -1519,7 +1519,7 @@ static int __ref __offline_pages(unsigned long 
>> start_pfn,
>> MIGRATE_MOVABLE,
>>

[PATCH] mm/hotplug: Enumerate memory range offlining failure reasons

2020-08-17 Thread Anshuman Khandual

Currently a debug message is printed describing the reason for memory range
offline failure. This just enumerates existing reason codes which improves
overall readability and makes it cleaner. This does not add any functional
change.

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This is based on 5.9-rc1

 include/linux/memory.h | 28 
 mm/memory_hotplug.c| 18 +-
 2 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/include/linux/memory.h b/include/linux/memory.h
index 439a89e758d8..4b52d706edc1 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -44,6 +44,34 @@ int set_memory_block_size_order(unsigned int order);
 #defineMEM_CANCEL_ONLINE   (1<<4)
 #defineMEM_CANCEL_OFFLINE  (1<<5)
 
+/*
+ * Memory offline failure reasons
+ */
+enum offline_failure_reason {
+   OFFLINE_FAILURE_MEMHOLES,
+   OFFLINE_FAILURE_MULTIZONE,
+   OFFLINE_FAILURE_ISOLATE,
+   OFFLINE_FAILURE_NOTIFIER,
+   OFFLINE_FAILURE_SIGNAL,
+   OFFLINE_FAILURE_UNMOVABLE,
+   OFFLINE_FAILURE_DISSOLVE,
+};
+
+static const char *const offline_failure_names[] = {
+   [OFFLINE_FAILURE_MEMHOLES]  = "memory holes",
+   [OFFLINE_FAILURE_MULTIZONE] = "multizone range",
+   [OFFLINE_FAILURE_ISOLATE]   = "failure to isolate range",
+   [OFFLINE_FAILURE_NOTIFIER]  = "notifier failure",
+   [OFFLINE_FAILURE_SIGNAL]= "signal backoff",
+   [OFFLINE_FAILURE_UNMOVABLE] = "unmovable page",
+   [OFFLINE_FAILURE_DISSOLVE]  = "failure to dissolve huge pages",
+};
+
+static inline const char *offline_failure(enum offline_failure_reason reason)
+{
+   return offline_failure_names[reason];
+}
+
 struct memory_notify {
unsigned long start_pfn;
unsigned long nr_pages;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e9d5ab5d3ca0..b3fa36a09d7f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1484,7 +1484,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
unsigned long flags;
struct zone *zone;
struct memory_notify arg;
-   char *reason;
+   enum offline_failure_reason reason;
 
mem_hotplug_begin();
 
@@ -1500,7 +1500,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
  count_system_ram_pages_cb);
if (nr_pages != end_pfn - start_pfn) {
ret = -EINVAL;
-   reason = "memory holes";
+   reason = OFFLINE_FAILURE_MEMHOLES;
goto failed_removal;
}
 
@@ -1509,7 +1509,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
zone = test_pages_in_a_zone(start_pfn, end_pfn);
if (!zone) {
ret = -EINVAL;
-   reason = "multizone range";
+   reason = OFFLINE_FAILURE_MULTIZONE;
goto failed_removal;
}
node = zone_to_nid(zone);
@@ -1519,7 +1519,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
   MIGRATE_MOVABLE,
   MEMORY_OFFLINE | REPORT_FAILURE);
if (ret < 0) {
-   reason = "failure to isolate range";
+   reason = OFFLINE_FAILURE_ISOLATE;
goto failed_removal;
}
nr_isolate_pageblock = ret;
@@ -1531,7 +1531,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
ret = memory_notify(MEM_GOING_OFFLINE, );
ret = notifier_to_errno(ret);
if (ret) {
-   reason = "notifier failure";
+   reason = OFFLINE_FAILURE_NOTIFIER;
goto failed_removal_isolated;
}
 
@@ -1540,7 +1540,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
do {
if (signal_pending(current)) {
ret = -EINTR;
-   reason = "signal backoff";
+   reason = OFFLINE_FAILURE_SIGNAL;
goto failed_removal_isolated;
}
 
@@ -1558,7 +1558,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
} while (!ret);
 
if (ret != -ENOENT) {
-   reason = "unmovable page";
+   reason = OFFLINE_FAILURE_UNMOVABLE;
goto failed_removal_isolated;
}
 
@@ -1569,7 +1569,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 */
ret = dissolve_free_huge_pages(start_pfn, end_pfn);
if (ret) {
-   reason =

[PATCH V3] arm64/cpuinfo: Define HWCAP name arrays per their actual bit definitions

2020-08-17 Thread Anshuman Khandual

HWCAP name arrays (hwcap_str, compat_hwcap_str, compat_hwcap2_str) that are
scanned for /proc/cpuinfo are detached from their bit definitions making it
vulnerable and difficult to correlate. It is also bit problematic because
during /proc/cpuinfo dump these arrays get traversed sequentially assuming
they reflect and match actual HWCAP bit sequence, to test various features
for a given CPU. This redefines name arrays per their HWCAP bit definitions
. It also warns after detecting any feature which is not expected on arm64.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Brown 
Cc: Dave Martin 
Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Suzuki K Poulose 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This applies on 5.9-rc1

Mark, since the patch has changed I have dropped your Acked-by: tag. Are you
happy to give a new one ?

Changes in V3:

- Moved name arrays to (arch/arm64/kernel/cpuinfo.c) to prevent a build warning
- Replaced string values with NULL for all compat features not possible on arm64
- Changed compat_hwcap_str[] iteration on size as some NULL values are expected
- Warn once after detecting any feature on arm64 that is not expected

Changes in V2: (https://patchwork.kernel.org/patch/11533755/)

- Defined COMPAT_KERNEL_HWCAP[2] and updated the name arrays per Mark
- Updated the commit message as required

Changes in V1: (https://patchwork.kernel.org/patch/11532945/)

 arch/arm64/include/asm/hwcap.h |   9 +++
 arch/arm64/kernel/cpuinfo.c| 172 ++---
 2 files changed, 100 insertions(+), 81 deletions(-)

diff --git a/arch/arm64/include/asm/hwcap.h b/arch/arm64/include/asm/hwcap.h
index 22f73fe..6493a4c 100644
--- a/arch/arm64/include/asm/hwcap.h
+++ b/arch/arm64/include/asm/hwcap.h
@@ -8,18 +8,27 @@
 #include 
 #include 
 
+#define COMPAT_HWCAP_SWP   (1 << 0)
 #define COMPAT_HWCAP_HALF  (1 << 1)
 #define COMPAT_HWCAP_THUMB (1 << 2)
+#define COMPAT_HWCAP_26BIT (1 << 3)
 #define COMPAT_HWCAP_FAST_MULT (1 << 4)
+#define COMPAT_HWCAP_FPA   (1 << 5)
 #define COMPAT_HWCAP_VFP   (1 << 6)
 #define COMPAT_HWCAP_EDSP  (1 << 7)
+#define COMPAT_HWCAP_JAVA  (1 << 8)
+#define COMPAT_HWCAP_IWMMXT(1 << 9)
+#define COMPAT_HWCAP_CRUNCH(1 << 10)
+#define COMPAT_HWCAP_THUMBEE   (1 << 11)
 #define COMPAT_HWCAP_NEON  (1 << 12)
 #define COMPAT_HWCAP_VFPv3 (1 << 13)
+#define COMPAT_HWCAP_VFPV3D16  (1 << 14)
 #define COMPAT_HWCAP_TLS   (1 << 15)
 #define COMPAT_HWCAP_VFPv4 (1 << 16)
 #define COMPAT_HWCAP_IDIVA (1 << 17)
 #define COMPAT_HWCAP_IDIVT (1 << 18)
 #define COMPAT_HWCAP_IDIV  (COMPAT_HWCAP_IDIVA|COMPAT_HWCAP_IDIVT)
+#define COMPAT_HWCAP_VFPD32(1 << 19)
 #define COMPAT_HWCAP_LPAE  (1 << 20)
 #define COMPAT_HWCAP_EVTSTRM   (1 << 21)
 
diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 393c6fb..382cb4c 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -43,94 +43,95 @@ static const char *icache_policy_str[] = {
 unsigned long __icache_flags;
 
 static const char *const hwcap_str[] = {
-   "fp",
-   "asimd",
-   "evtstrm",
-   "aes",
-   "pmull",
-   "sha1",
-   "sha2",
-   "crc32",
-   "atomics",
-   "fphp",
-   "asimdhp",
-   "cpuid",
-   "asimdrdm",
-   "jscvt",
-   "fcma",
-   "lrcpc",
-   "dcpop",
-   "sha3",
-   "sm3",
-   "sm4",
-   "asimddp",
-   "sha512",
-   "sve",
-   "asimdfhm",
-   "dit",
-   "uscat",
-   "ilrcpc",
-   "flagm",
-   "ssbs",
-   "sb",
-   "paca",
-   "pacg",
-   "dcpodp",
-   "sve2",
-   "sveaes",
-   "svepmull",
-   "svebitperm",
-   "svesha3",
-   "svesm4",
-   "flagm2",
-   "frint",
-   "svei8mm",
-   "svef32mm",
-   "svef64mm",
-   "svebf16",
-   "i8mm",
-   "bf16",
-   "dgh",
-   "rng",
-   "bti",
+   [KERNEL_HWCAP_FP]   = "fp",
+   [KERNEL_HWCAP_ASIMD]= "asimd",
+   [KERNEL_HWCAP_EVTSTRM]  = "evtstrm",
+   [KERNEL_HWCAP_AES]  = "aes",
+   [KERNEL_HWCAP_PMULL]= "pmull",
+   [KERNEL_HWCAP_SHA1]

[PATCH 2/2] arm64/mm: Enable THP migration

2020-08-17 Thread Anshuman Khandual

In certain page migration situations, a THP page can be migrated without
being split into it's constituent subpages. This saves time required to
split a THP and put it back together when required. But it also saves an
wider address range translation covered by a single TLB entry, reducing
future page fault costs.

A previous patch changed platform THP helpers per generic memory semantics,
clearing the path for THP migration support. This adds two more THP helpers
required to create PMD migration swap entries. Now just enable HP migration
via ARCH_ENABLE_THP_MIGRATION.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Suzuki Poulose 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig   | 4 
 arch/arm64/include/asm/pgtable.h | 5 +
 2 files changed, 9 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbee..e21b94061780 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1876,6 +1876,10 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
def_bool y
depends on HUGETLB_PAGE && MIGRATION
 
+config ARCH_ENABLE_THP_MIGRATION
+   def_bool y
+   depends on TRANSPARENT_HUGEPAGE
+
 menu "Power management options"
 
 source "kernel/power/Kconfig"
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7aa69cace784..c54334bca4e2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -875,6 +875,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct 
*vma,
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val(pte) })
 #define __swp_entry_to_pte(swp)((pte_t) { (swp).val })
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val(pmd) })
+#define __swp_entry_to_pmd(swp)__pmd((swp).val)
+#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 /*
  * Ensure that there are not more swap files than can be encoded in the kernel
  * PTEs.
-- 
2.20.1

[PATCH 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

2020-08-17 Thread Anshuman Khandual

pmd_present() and pmd_trans_huge() are expected to behave in the following
manner during various phases of a given PMD. It is derived from a previous
detailed discussion on this topic [1] and present THP documentation [2].

pmd_present(pmd):

- Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
- Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)

pmd_trans_huge(pmd):

- Returns true if pmd refers to system RAM and is a trans huge mapping

-
|   PMD states  |   pmd_present |   pmd_trans_huge  |
-
|   Mapped  |   Yes |   Yes |
-
|   Splitting   |   Yes |   Yes |
-
|   Migration/Swap  |   No  |   No  |
-

The problem:

PMD is first invalidated with pmdp_invalidate() before it's splitting. This
invalidation clears PMD_SECT_VALID as below.

PMD Split -> pmdp_invalidate() -> pmd_mkinvalid -> Clears PMD_SECT_VALID

Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
affirm pmd_present() as true during the THP split process. To comply with
above mentioned semantics, pmd_trans_huge() should also check pmd_present()
first before testing presence of an actual transparent huge mapping.

The solution:

Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
it will not be there for pmd_present() check after pmdp_invalidate().

A new software defined PMD_PRESENT_INVALID (bit 59) can be set on the PMD
entry during invalidation which can help pmd_present() return true and in
recognizing the fact that it still points to memory.

This bit is transient. During the split process it will be overridden by a
page table page representing normal pages in place of erstwhile huge page.
Other pmdp_invalidate() callers always write a fresh PMD value on the entry
overriding this transient PMD_PRESENT_INVALID bit, which makes it safe.

[1]: https://lkml.org/lkml/2018/10/17/231
[2]: https://www.kernel.org/doc/Documentation/vm/transhuge.txt

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Suzuki Poulose 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/include/asm/pgtable-prot.h |  7 ++
 arch/arm64/include/asm/pgtable.h  | 34 ---
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h 
b/arch/arm64/include/asm/pgtable-prot.h
index 4d867c6446c4..28792fdd9627 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,13 @@
 #define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE  (_AT(pteval_t, 1) << 58) /* only when 
!PTE_VALID */
 
+/*
+ * This help indicate that the entry is present i.e pmd_page()
+ * still points to a valid huge page in memory even if the pmd
+ * has been invalidated.
+ */
+#define PMD_PRESENT_INVALID(_AT(pteval_t, 1) << 59) /* only when 
!PMD_SECT_VALID */
+
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d5d3fbe73953..7aa69cace784 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -145,6 +145,18 @@ static inline pte_t set_pte_bit(pte_t pte, pgprot_t prot)
return pte;
 }
 
+static inline pmd_t clr_pmd_bit(pmd_t pmd, pgprot_t prot)
+{
+   pmd_val(pmd) &= ~pgprot_val(prot);
+   return pmd;
+}
+
+static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t prot)
+{
+   pmd_val(pmd) |= pgprot_val(prot);
+   return pmd;
+}
+
 static inline pte_t pte_wrprotect(pte_t pte)
 {
pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
@@ -363,15 +375,24 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
+
+static inline int pmd_present(pmd_t pmd)
+{
+   return pte_present(pmd_pte(pmd)) || pmd_present_invalid(pmd);
+}
+
 /*
  * THP definitions.
  */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_trans_huge(pmd)(pmd_val(pmd) && !(pmd_val(pmd) & 
PMD_TABLE_BIT))
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+   return pmd_val(pmd) && pmd_present(pmd) && !(pmd_val(pmd) & 
PMD_TABLE_BIT);
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define pmd_

[PATCH 0/2] arm64/mm: Enable THP migration

2020-08-17 Thread Anshuman Khandual

This series enables THP migration on arm64 via ARCH_ENABLE_THP_MIGRATION.
But first this modifies all existing THP helpers like pmd_present() and
pmd_trans_huge() etc per expected generic memory semantics as concluded
from a previous discussion here.

https://lkml.org/lkml/2018/10/9/220

This series is based on v5.9-rc1.

Changes in V1:

- Used new PMD_PRESENT_INVALID (bit 59) to represent invalidated PMD state per 
Catalin

Changes in RFC V2: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=302965)

- Used PMD_TABLE_BIT to represent splitting PMD state per Catalin

Changes in RFC V1: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=138797)

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Suzuki Poulose 
Cc: Zi Yan 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  arm64/mm: Change THP helpers to comply with generic MM semantics
  arm64/mm: Enable THP migration

 arch/arm64/Kconfig|  4 +++
 arch/arm64/include/asm/pgtable-prot.h |  7 +
 arch/arm64/include/asm/pgtable.h  | 39 ---
 3 files changed, 47 insertions(+), 3 deletions(-)

-- 
2.20.1

Re: [RFC V2 1/2] arm64/mm: Change THP helpers per generic memory semantics

2020-08-16 Thread Anshuman Khandual




On 07/07/2020 11:14 PM, Catalin Marinas wrote:
> On Mon, Jul 06, 2020 at 09:27:04AM +0530, Anshuman Khandual wrote:
>> On 07/02/2020 05:41 PM, Catalin Marinas wrote:
>>> On Mon, Jun 15, 2020 at 06:45:17PM +0530, Anshuman Khandual wrote:
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -353,15 +353,92 @@ static inline int pmd_protnone(pmd_t pmd)
>>>>  }
>>>>  #endif
>>>>  
>>>> +#define pmd_table(pmd)((pmd_val(pmd) & PMD_TYPE_MASK) ==  
>>>> PMD_TYPE_TABLE)
>>>> +#define pmd_sect(pmd) ((pmd_val(pmd) & PMD_TYPE_MASK) ==  
>>>> PMD_TYPE_SECT)
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>  /*
>>>> - * THP definitions.
>>>> + * PMD Level Encoding (THP Enabled)
>>>> + *
>>>> + * 0b00 - Not valid   Not present NA
>>>> + * 0b10 - Not valid   Present Huge  (Splitting)
>>>> + * 0b01 - Valid   Present Huge  (Mapped)
>>>> + * 0b11 - Valid   Present Table (Mapped)
>>>>   */
>>>
>>> I wonder whether it would be easier to read if we add a dedicated
>>> PMD_SPLITTING bit, only when bit 0 is cleared. This bit can be high (say
>>> 59), it doesn't really matter as the entry is not valid.
>>
>> Could make (PMD[0b00] = 0b10) be represented as PMD_SPLITTING just for
>> better reading purpose. But if possible, IMHO it is efficient and less
>> vulnerable to use HW defined PTE attribute bit positions including SW
>> usable ones than the reserved bits, for a PMD state representation.
>>
>> Earlier proposal used PTE_SPECIAL (bit 56) instead. Using PMD_TABLE_BIT
>> helps save bit 56 for later. Thinking about it again, would not these
>> unused higher bits [59..63] create any problem ? For example while
>> enabling THP swapping without split via ARCH_WANTS_THP_SWAP or something
>> else later when these higher bits might be required. I am not sure, just
>> speculating.
> 
> The swap encoding goes to bit 57, so going higher shouldn't be an issue.
> 
>> But, do you see any particular problem with PMD_TABLE_BIT ?
> 
> No. Only that we have some precedent like PTE_PROT_NONE (bit 58) and
> wondering whether we could use a high bit as well here. If we can get
> them to overlap, it simplifies this patch further.
> 
>>> The only doubt I have is that pmd_mkinvalid() is used in other contexts
>>> when it's not necessarily splitting a pmd (search for the
>>> pmdp_invalidate() calls). So maybe a better name like PMD_PRESENT with a
>>> comment that pmd_to_page() is valid (i.e. no migration or swap entry).
>>> Feel free to suggest a better name.
>>
>> PMD_INVALID_PRESENT sounds better ?
> 
> No strong opinion either way. Yours is clearer.
> 
>>>> +static inline pmd_t pmd_mksplitting(pmd_t pmd)
>>>> +{
>>>> +  unsigned long val = pmd_val(pmd);
>>>>  
>>>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> -#define pmd_trans_huge(pmd)   (pmd_val(pmd) && !(pmd_val(pmd) & 
>>>> PMD_TABLE_BIT))
>>>> +  return __pmd((val & ~PMD_TYPE_MASK) | PMD_TABLE_BIT);
>>>> +}
>>>> +
>>>> +static inline pmd_t pmd_clrsplitting(pmd_t pmd)
>>>> +{
>>>> +  unsigned long val = pmd_val(pmd);
>>>> +
>>>> +  return __pmd((val & ~PMD_TYPE_MASK) | PMD_TYPE_SECT);
>>>> +}
>>>> +
>>>> +static inline bool pmd_splitting(pmd_t pmd)
>>>> +{
>>>> +  unsigned long val = pmd_val(pmd);
>>>> +
>>>> +  if ((val & PMD_TYPE_MASK) == PMD_TABLE_BIT)
>>>> +  return true;
>>>> +  return false;
>>>> +}
>>>> +
>>>> +static inline bool pmd_mapped(pmd_t pmd)
>>>> +{
>>>> +  return pmd_sect(pmd);
>>>> +}
>>>> +
>>>> +static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>>>> +{
>>>> +  /*
>>>> +   * Invalidation should not have been invoked on
>>>> +   * a PMD table entry. Just warn here otherwise.
>>>> +   */
>>>> +  WARN_ON(pmd_table(pmd));
>>>> +  return pmd_mksplitting(pmd);
>>>> +}
>>>
>>> And here we wouldn't need t worry about table checks.
>>
>> This is just a temporary sanity check validating the assumption
>> that a table entry would never be called with pmdp_invalidate().
>&

Re: [PATCH -next] arm64: Export __cpu_logical_map

2020-07-24 Thread Anshuman Khandual




On 07/24/2020 03:00 PM, Catalin Marinas wrote:
> On Fri, Jul 24, 2020 at 01:46:18PM +0530, Anshuman Khandual wrote:
>> On 07/24/2020 08:38 AM, Kefeng Wang wrote:
>>> On 2020/7/24 11:04, Kefeng Wang wrote:
>>>> ERROR: modpost: "__cpu_logical_map" [drivers/cpufreq/tegra194-cpufreq.ko] 
>>>> undefined!
>>>>
>>>> ARM64 tegra194-cpufreq driver use cpu_logical_map, export
>>>> __cpu_logical_map to fix build issue.
>> Commit 887d5fc82cb4 ("cpufreq: Add Tegra194 cpufreq driver") which adds
>> this particular driver is present just on linux-next. But as expected,
>> the driver does not use __cpu_logical_map directly but instead accesses
>> it via cpu_logical_map() wrapper. Wondering, how did you even trigger
>> the modpost error ?
> Since the wrapper is a macro, it just expands to __cpu_logical_map[].
> 

Ahh, right. Existing cpu_logical_map() is not a true wrapper and it
makes sense to convert that into one.

Re: [PATCH -next] arm64: Export __cpu_logical_map

2020-07-24 Thread Anshuman Khandual




On 07/24/2020 03:05 PM, Catalin Marinas wrote:
> On Fri, Jul 24, 2020 at 10:13:52AM +0100, Mark Rutland wrote:
>> On Fri, Jul 24, 2020 at 01:46:18PM +0530, Anshuman Khandual wrote:
>>> On 07/24/2020 08:38 AM, Kefeng Wang wrote:
>>>>> Reported-by: Hulk Robot 
>>>>> Signed-off-by: Kefeng Wang 
>>>>> ---
>>>>> ï¿½ arch/arm64/kernel/setup.c | 1 +
>>>>> ï¿½ 1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>>>>> index c793276ec7ad9..3aea05fbb9998 100644
>>>>> --- a/arch/arm64/kernel/setup.c
>>>>> +++ b/arch/arm64/kernel/setup.c
>>>>> @@ -275,6 +275,7 @@ static int __init 
>>>>> reserve_memblock_reserved_regions(void)
>>>>> ï¿½ arch_initcall(reserve_memblock_reserved_regions);
>>>>> ï¿½ ï¿½ u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = 
>>>>> INVALID_HWID };
>>>>> +EXPORT_SYMBOL(__cpu_logical_map);
>>
>> If modules are using cpu_logical_map(), this looks sane ot me, but I
>> wonder if we should instead turn cpu_logical_map() into a C wrapper in
>> smp.c, or at least mark __cpu_logical_map as __ro_after_init lest
>> someone have the bright idea to fiddle with it.
> 
> I'd go for a C wrapper and also change a couple of instances where we
> assign a value directly to cpu_logical_map(cpu).

Probably also create a set_cpu_logical_map(cpu, hwid) for those instances
as well.

Re: [PATCH -next] arm64: Export __cpu_logical_map

2020-07-24 Thread Anshuman Khandual



On 07/24/2020 08:38 AM, Kefeng Wang wrote:
> +maillist

This does not seem to be a correct method of posting any patch.

> 
> On 2020/7/24 11:04, Kefeng Wang wrote:
>> ERROR: modpost: "__cpu_logical_map" [drivers/cpufreq/tegra194-cpufreq.ko] 
>> undefined!


>>
>> ARM64 tegra194-cpufreq driver use cpu_logical_map, export
>> __cpu_logical_map to fix build issue.

Commit 887d5fc82cb4 ("cpufreq: Add Tegra194 cpufreq driver") which adds
this particular driver is present just on linux-next. But as expected,
the driver does not use __cpu_logical_map directly but instead accesses
it via cpu_logical_map() wrapper. Wondering, how did you even trigger
the modpost error ?

>>
>> Reported-by: Hulk Robot 
>> Signed-off-by: Kefeng Wang 
>> ---
>>   arch/arm64/kernel/setup.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>> index c793276ec7ad9..3aea05fbb9998 100644
>> --- a/arch/arm64/kernel/setup.c
>> +++ b/arch/arm64/kernel/setup.c
>> @@ -275,6 +275,7 @@ static int __init reserve_memblock_reserved_regions(void)
>>   arch_initcall(reserve_memblock_reserved_regions);
>>     u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
>> +EXPORT_SYMBOL(__cpu_logical_map);
>>     void __init setup_arch(char **cmdline_p)
>>   {
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Re: [PATCH v2 2/4] mm/hugetlb.c: Remove the unnecessary non_swap_entry()

2020-07-23 Thread Anshuman Khandual




On 07/23/2020 11:44 AM, Baoquan He wrote:
> On 07/23/20 at 10:36am, Anshuman Khandual wrote:
>>
>>
>> On 07/23/2020 08:52 AM, Baoquan He wrote:
>>> The checking is_migration_entry() and is_hwpoison_entry() are stricter
>>> than non_swap_entry(), means they have covered the conditional check
>>> which non_swap_entry() is doing.
>>
>> They are no stricter as such but implicitly contains non_swap_entry() in 
>> itself.
>> If a swap entry tests positive for either is_[migration|hwpoison]_entry(), 
>> then
>> its swap_type() is among SWP_MIGRATION_READ, SWP_MIGRATION_WRITE and 
>> SWP_HWPOISON.
>> All these types >= MAX_SWAPFILES, exactly what is asserted with 
>> non_swap_entry().
>>
>>>
>>> Hence remove the unnecessary non_swap_entry() in 
>>> is_hugetlb_entry_migration()
>>> and is_hugetlb_entry_hwpoisoned() to simplify code.
>>>
>>> Signed-off-by: Baoquan He 
>>> Reviewed-by: Mike Kravetz 
>>> Reviewed-by: David Hildenbrand 
>>> ---
>>>  mm/hugetlb.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 3569e731e66b..c14837854392 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -3748,7 +3748,7 @@ bool is_hugetlb_entry_migration(pte_t pte)
>>> if (huge_pte_none(pte) || pte_present(pte))
>>> return false;
>>> swp = pte_to_swp_entry(pte);
>>> -   if (non_swap_entry(swp) && is_migration_entry(swp))
>>> +   if (is_migration_entry(swp))
>>> return true;
>>> else
>>> return false;
>>> @@ -3761,7 +3761,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
>>> if (huge_pte_none(pte) || pte_present(pte))
>>> return false;
>>> swp = pte_to_swp_entry(pte);
>>> -   if (non_swap_entry(swp) && is_hwpoison_entry(swp))
>>> +   if (is_hwpoison_entry(swp))
>>> return true;
>>> else
>>> return false;
>>>
>>
>> It would be better if the commit message contains details about
>> the existing redundant check. But either way.
> 
> Thanks for your advice. Do you think updating the log as below is OK?
> 
> 
> If a swap entry tests positive for either is_[migration|hwpoison]_entry(), 
> then
> its swap_type() is among SWP_MIGRATION_READ, SWP_MIGRATION_WRITE and 
> SWP_HWPOISON.
> All these types >= MAX_SWAPFILES, exactly what is asserted with 
> non_swap_entry().
> 
> So the checking non_swap_entry() in is_hugetlb_entry_migration() and
> is_hugetlb_entry_hwpoisoned() is redundant.
> 
> Let's remove it to optimize code.
> 

Something like above would be good.

Re: [PATCH v2 4/4] mm/hugetl.c: warn out if expected count of huge pages adjustment is not achieved

2020-07-23 Thread Anshuman Khandual




On 07/23/2020 08:52 AM, Baoquan He wrote:
> A customer complained that no message is logged when the number of
> persistent huge pages is not changed to the exact value written to
> the sysfs or proc nr_hugepages file.
> 
> In the current code, a best effort is made to satisfy requests made
> via the nr_hugepages file.  However, requests may be only partially
> satisfied.
> 
> Log a message if the code was unsuccessful in fully satisfying a
> request. This includes both increasing and decreasing the number
> of persistent huge pages.

But is kernel expected to warn for all such situations where the user
requested resources could not be allocated completely ? Otherwise, it
does not make sense to add an warning for just one such situation.

Re: [PATCH v2 3/4] doc/vm: fix typo in the hugetlb admin documentation

2020-07-22 Thread Anshuman Khandual




On 07/23/2020 08:52 AM, Baoquan He wrote:
> Change 'pecify' to 'Specify'.
> 
> Signed-off-by: Baoquan He 
> Reviewed-by: Mike Kravetz 
> Reviewed-by: David Hildenbrand 
> ---
>  Documentation/admin-guide/mm/hugetlbpage.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst 
> b/Documentation/admin-guide/mm/hugetlbpage.rst
> index 015a5f7d7854..f7b1c7462991 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -131,7 +131,7 @@ hugepages
>   parameter is preceded by an invalid hugepagesz parameter, it will
>   be ignored.
>  default_hugepagesz
> - pecify the default huge page size.  This parameter can
> + Specify the default huge page size.  This parameter can
>   only be specified once on the command line.  default_hugepagesz can
>   optionally be followed by the hugepages parameter to preallocate a
>   specific number of huge pages of default size.  The number of default
> 

This does not apply on 5.8-rc6 and the original typo seems to be missing
there as well. This section was introduced recently with following commit.

 282f4214384e ("hugetlbfs: clean up command line processing")

Re: [PATCH v2 2/4] mm/hugetlb.c: Remove the unnecessary non_swap_entry()

2020-07-22 Thread Anshuman Khandual




On 07/23/2020 08:52 AM, Baoquan He wrote:
> The checking is_migration_entry() and is_hwpoison_entry() are stricter
> than non_swap_entry(), means they have covered the conditional check
> which non_swap_entry() is doing.

They are no stricter as such but implicitly contains non_swap_entry() in itself.
If a swap entry tests positive for either is_[migration|hwpoison]_entry(), then
its swap_type() is among SWP_MIGRATION_READ, SWP_MIGRATION_WRITE and 
SWP_HWPOISON.
All these types >= MAX_SWAPFILES, exactly what is asserted with 
non_swap_entry().

> 
> Hence remove the unnecessary non_swap_entry() in is_hugetlb_entry_migration()
> and is_hugetlb_entry_hwpoisoned() to simplify code.
> 
> Signed-off-by: Baoquan He 
> Reviewed-by: Mike Kravetz 
> Reviewed-by: David Hildenbrand 
> ---
>  mm/hugetlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3569e731e66b..c14837854392 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3748,7 +3748,7 @@ bool is_hugetlb_entry_migration(pte_t pte)
>   if (huge_pte_none(pte) || pte_present(pte))
>   return false;
>   swp = pte_to_swp_entry(pte);
> - if (non_swap_entry(swp) && is_migration_entry(swp))
> + if (is_migration_entry(swp))
>   return true;
>   else
>   return false;
> @@ -3761,7 +3761,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
>   if (huge_pte_none(pte) || pte_present(pte))
>   return false;
>   swp = pte_to_swp_entry(pte);
> - if (non_swap_entry(swp) && is_hwpoison_entry(swp))
> + if (is_hwpoison_entry(swp))
>   return true;
>   else
>   return false;
> 

It would be better if the commit message contains details about
the existing redundant check. But either way.

Reviewed-by: Anshuman Khandual

Re: [PATCH v2 1/4] mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool

2020-07-22 Thread Anshuman Khandual




On 07/23/2020 08:52 AM, Baoquan He wrote:
> Just like its neighbour is_hugetlb_entry_migration() has done.
> 
> Signed-off-by: Baoquan He 
> Reviewed-by: Mike Kravetz 
> Reviewed-by: David Hildenbrand 
> ---
>  mm/hugetlb.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24acb3af741..3569e731e66b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3754,17 +3754,17 @@ bool is_hugetlb_entry_migration(pte_t pte)
>   return false;
>  }
>  
> -static int is_hugetlb_entry_hwpoisoned(pte_t pte)
> +static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
>  {
>   swp_entry_t swp;
>  
>   if (huge_pte_none(pte) || pte_present(pte))
> - return 0;
> + return false;
>   swp = pte_to_swp_entry(pte);
>   if (non_swap_entry(swp) && is_hwpoison_entry(swp))
> - return 1;
> + return true;
>   else
> - return 0;
> + return false;
>  }
>  
>  int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> 

Reviewed-by: Anshuman Khandual

Re: [PATCH] arm64: mm: free unused memmap for sparse memory model that define VMEMMAP

2020-07-22 Thread Anshuman Khandual




On 07/21/2020 01:02 PM, Wei Li wrote:
> For the memory hole, sparse memory model that define SPARSEMEM_VMEMMAP
> do not free the reserved memory for the page map, this patch do it.
> 
> Signed-off-by: Wei Li 
> Signed-off-by: Chen Feng 
> Signed-off-by: Xia Qing 
> ---
>  arch/arm64/mm/init.c | 81 
> +---
>  1 file changed, 71 insertions(+), 10 deletions(-)

This patch does not compile on 5.8-rc6 with defconfig.

Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory

2020-07-20 Thread Anshuman Khandual

On 07/17/2020 10:32 PM, Mike Kravetz wrote:
> On 7/16/20 10:02 PM, Anshuman Khandual wrote:
>>
>>
>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz 
>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>  hstate
>>>
>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>> or as the result of hugetlb command line processing.  In either case,
>>> this is late enough in the init process that all numa memory information
>>> should be initialized.  And, it is early enough to still use early
>>> memory allocator.
>>
>> This assumes that hugetlb_add_hstate() is called from the arch code at
>> the right point in time for the generic HugeTLB to do the required CMA
>> reservation which is not ideal. I guess it must have been a reason why
>> CMA reservation should always called by the platform code which knows
>> the boot sequence timing better.
> 
> Actually, the code does not make the assumption that hugetlb_add_hstate
> is called from arch specific huge page setup.  It can even be called later
> at the time of hugetlb command line processing.

Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate().
But then there is an explicit warning while trying to mix both the command
line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have
not changed that behavior and hence the following warning should have been
triggered here as well.

1) hugepagesz_setup()
hugetlb_add_hstate()
 hugetlb_cma_reserve()

2) hugepages_setup()
hugetlb_hstate_alloc_pages()when order >= MAX_ORDER

if (hstate_is_gigantic(h)) {
if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip 
boot time allocation\n");
break;
}
if (!alloc_bootmem_huge_page(h))
break;
}

Nonetheless, it does not make sense to mix both memblock and CMA based huge
page pre-allocations. But looking at this again, could this warning be ever
triggered till now ? Unless, a given platform calls hugetlb_cma_reserve()
before _setup("hugepages=", hugepages_setup). Anyways, there seems to be
good reasons to keep both memblock and CMA based pre-allocations in place.
But mixing them together (as done in the proposed code here) does not seem
to be right.

> 
> My 'reasoning' is that gigantic pages can currently be preallocated from
> bootmem/memblock_alloc at the time of command line processing.  Therefore,
> we should be able to reserve bootmem for CMA at the same time.  Is there
> something wrong with this reasoning?  I tested this on x86 by removing the
> call to hugetlb_add_hstate from arch specific code and instead forced the
> call at command line processing time.  The ability to reserve CMA was the
> same.

There is no problem with that reasoning. __setup() triggered function should
be able perform CMA reservation. But as pointed out before, it does not make
sense to mix both CMA reservation and memblock based pre-allocation.

> 
> Yes, the CMA reservation interface says it should be called from arch
> specific code.  However, if we currently depend on the ability to do
> memblock_alloc at hugetlb command line processing time for gigantic page
> preallocation, then I think we can do the CMA reservation here as well.

IIUC, CMA reservation and memblock alloc have some differences in terms of
how the memory can be used later on, will have to dig deeper on this. But
the comment section near cma_declare_contiguous_nid() is a concern.

 * This function reserves memory from early allocator. It should be
 * called by arch specific code once the early allocator (memblock or bootmem)
 * has been activated and all other subsystems have already allocated/reserved
 * memory. This function allows to create custom reserved areas.

> 
> Thinking about it some more, I suppose there could be some arch code that
> could call hugetlb_add_hstate too early in the boot process.  But, I do
> not think we have an issue with calling it too late.
> 

Calling it too late might have got the page allocator initialized completely
and then CMA reservation would not be possible afterwards. Also calling it
too early would prevent other subsystems which might need memory reservation
in specific physical ranges.

Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory

2020-07-20 Thread Anshuman Khandual




On 07/17/2020 11:07 PM, Mike Kravetz wrote:
> On 7/17/20 2:51 AM, Anshuman Khandual wrote:
>>
>>
>> On 07/17/2020 02:06 PM, Will Deacon wrote:
>>> On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>>>> From: Mike Kravetz 
>>>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>>>  hstate
>>>>>
>>>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>>>> or as the result of hugetlb command line processing.  In either case,
>>>>> this is late enough in the init process that all numa memory information
>>>>> should be initialized.  And, it is early enough to still use early
>>>>> memory allocator.
>>>>
>>>> This assumes that hugetlb_add_hstate() is called from the arch code at
>>>> the right point in time for the generic HugeTLB to do the required CMA
>>>> reservation which is not ideal. I guess it must have been a reason why
>>>> CMA reservation should always called by the platform code which knows
>>>> the boot sequence timing better.
>>>
>>> Ha, except we've moved it around two or three times already in the last
>>> month or so, so I'd say we don't have a clue when to call it in the arch
>>> code.
>>
>> The arch dependency is not going way with this change either. Just that
>> its getting transferred to hugetlb_add_hstate() which gets called from
>> arch_initcall() in every architecture.
>>
>> The perfect timing here happens to be because of arch_initcall() instead.
>> This is probably fine, as long as
>>
>> 0. hugetlb_add_hstate() is always called at arch_initcall()
> 
> In another reply, I give reasoning why it would be safe to call even later
> at hugetlb command line processing time.

Understood, but there is a time window in which CMA reservation is available
irrespective of whether it is called from arch or generic code. Finding this
right time window and ensuring that N_MEMORY nodemask is initialized, easier
done in the platform code.

> 
>> 1. N_MEMORY mask is guaranteed to be initialized at arch_initcall()
> 
> This is a bit more difficult to guarantee.  I find the init sequence hard to
> understand.  Looking at the arm code, arch_initcall(hugetlbpage_init)
> happens after N_MEMORY mask is setup.  I can't imagine any arch code setting
> up huge pages before N_MEMORY.  But, I suppose it is possible and we would
> need to somehow guarantee this.

Ensuring that N_MEMORY nodemask is initialized from the generic code is even
more difficult.

> 
>> 2. CMA reservation is available to be called at arch_initcall()
> 
> Since I am pretty sure we can delay the reservation until hugetlb command
> line processing time, it would be great if it was always done there.

But moving hugetlb CMA reservation completely during command line processing
has got another concern of mixing with existing memblock based pre-allocation.

> Unfortunately, I can not immediately think of an easy way to do this.
> 

It is rationale to move CMA reservation stuff into generic HugeTLB but there
are some challenges which needs to be solved comprehensively. The patch here
from Barry does solve a short term problem (N_ONLINE ---> N_MEMORY) for now
which IMHO should be considered. Moving CMA reservation into generic HugeTLB
would require some more thoughts and can be attempted later.

Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory

2020-07-17 Thread Anshuman Khandual




On 07/17/2020 02:06 PM, Will Deacon wrote:
> On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz 
>>> Date: Tue, 14 Jul 2020 15:54:46 -0700
>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>>>  hstate
>>>
>>> Instead of calling hugetlb_cma_reserve() directly from arch specific
>>> code, call from hugetlb_add_hstate when adding a gigantic hstate.
>>> hugetlb_add_hstate is either called from arch specific huge page setup,
>>> or as the result of hugetlb command line processing.  In either case,
>>> this is late enough in the init process that all numa memory information
>>> should be initialized.  And, it is early enough to still use early
>>> memory allocator.
>>
>> This assumes that hugetlb_add_hstate() is called from the arch code at
>> the right point in time for the generic HugeTLB to do the required CMA
>> reservation which is not ideal. I guess it must have been a reason why
>> CMA reservation should always called by the platform code which knows
>> the boot sequence timing better.
> 
> Ha, except we've moved it around two or three times already in the last
> month or so, so I'd say we don't have a clue when to call it in the arch
> code.

The arch dependency is not going way with this change either. Just that
its getting transferred to hugetlb_add_hstate() which gets called from
arch_initcall() in every architecture.

The perfect timing here happens to be because of arch_initcall() instead.
This is probably fine, as long as

0. hugetlb_add_hstate() is always called at arch_initcall()
1. N_MEMORY mask is guaranteed to be initialized at arch_initcall()
2. CMA reservation is available to be called at arch_initcall()

Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory

2020-07-16 Thread Anshuman Khandual

On 07/16/2020 11:55 PM, Mike Kravetz wrote:
>>From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001
> From: Mike Kravetz 
> Date: Tue, 14 Jul 2020 15:54:46 -0700
> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic
>  hstate
> 
> Instead of calling hugetlb_cma_reserve() directly from arch specific
> code, call from hugetlb_add_hstate when adding a gigantic hstate.
> hugetlb_add_hstate is either called from arch specific huge page setup,
> or as the result of hugetlb command line processing.  In either case,
> this is late enough in the init process that all numa memory information
> should be initialized.  And, it is early enough to still use early
> memory allocator.

This assumes that hugetlb_add_hstate() is called from the arch code at
the right point in time for the generic HugeTLB to do the required CMA
reservation which is not ideal. I guess it must have been a reason why
CMA reservation should always called by the platform code which knows
the boot sequence timing better.

Re: [PATCH V5 1/4] mm/debug_vm_pgtable: Add tests validating arch helpers for core MM features

2020-07-16 Thread Anshuman Khandual




On 07/16/2020 07:44 PM, Steven Price wrote:
> On 13/07/2020 04:23, Anshuman Khandual wrote:
>> This adds new tests validating arch page table helpers for these following
>> core memory features. These tests create and test specific mapping types at
>> various page table levels.
>>
>> 1. SPECIAL mapping
>> 2. PROTNONE mapping
>> 3. DEVMAP mapping
>> 4. SOFTDIRTY mapping
>> 5. SWAP mapping
>> 6. MIGRATION mapping
>> 7. HUGETLB mapping
>> 8. THP mapping
>>
>> Cc: Andrew Morton 
>> Cc: Gerald Schaefer 
>> Cc: Christophe Leroy 
>> Cc: Mike Rapoport 
>> Cc: Vineet Gupta 
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Michael Ellerman 
>> Cc: Heiko Carstens 
>> Cc: Vasily Gorbik 
>> Cc: Christian Borntraeger 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Borislav Petkov 
>> Cc: "H. Peter Anvin" 
>> Cc: Kirill A. Shutemov 
>> Cc: Paul Walmsley 
>> Cc: Palmer Dabbelt 
>> Cc: linux-snps-...@lists.infradead.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linuxppc-...@lists.ozlabs.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux-ri...@lists.infradead.org
>> Cc: x...@kernel.org
>> Cc: linux...@kvack.org
>> Cc: linux-a...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Tested-by: Vineet Gupta     #arc
>> Reviewed-by: Zi Yan 
>> Suggested-by: Catalin Marinas 
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   mm/debug_vm_pgtable.c | 302 +-
>>   1 file changed, 301 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>> index 61ab16fb2e36..2fac47db3eb7 100644
>> --- a/mm/debug_vm_pgtable.c
>> +++ b/mm/debug_vm_pgtable.c
> [...]
>> +
>> +static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot)
>> +{
>> +    swp_entry_t swp;
>> +    pte_t pte;
>> +
>> +    pte = pfn_pte(pfn, prot);
>> +    swp = __pte_to_swp_entry(pte);
> 
> Minor issue: this doesn't look necessarily valid - there's no reason a normal 
> PTE can be turned into a swp_entry. In practise this is likely to work on all 
> architectures because there's no reason not to use (at least) all the PFN 
> bits for the swap entry, but it doesn't exactly seem correct.

Agreed, that it is a simple test but nonetheless a valid one which
makes sure that PFN value remained unchanged during pte <---> swp
conversion.

> 
> Can we start with a swp_entry_t (from __swp_entry()) and check the round trip 
> of that?
> 
> It would also seem sensible to have a check that 
> is_swap_pte(__swp_entry_to_pte(__swp_entry(x,y))) is true.

>From past experiences, getting any these new tests involving platform
helpers, working on all existing enabled archs is neither trivial nor
going to be quick. Existing tests here are known to succeed in enabled
platforms. Nonetheless, proposed tests as in the above suggestions do
make sense but will try to accommodate them in a later patch.

Re: [PATCH] riscv: Select ARCH_HAS_DEBUG_VM_PGTABLE

2020-07-14 Thread Anshuman Khandual




On 07/15/2020 02:56 AM, Emil Renner Berthing wrote:
> This allows the pgtable tests to be built.
> 
> Signed-off-by: Emil Renner Berthing 
> ---
> 
> The tests seem to succeed both in Qemu and on the HiFive Unleashed
> 
> Both with and without the recent additions in
> https://lore.kernel.org/linux-riscv/1594610587-4172-1-git-send-email-anshuman.khand...@arm.com/

That's great, thanks for testing.

[PATCH V5 4/4] Documentation/mm: Add descriptions for arch page table helpers

2020-07-12 Thread Anshuman Khandual

This adds a specific description file for all arch page table helpers which
is in sync with the semantics being tested via CONFIG_DEBUG_VM_PGTABLE. All
future changes either to these descriptions here or the debug test should
always remain in sync.

Cc: Jonathan Corbet 
Cc: Andrew Morton 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Mike Rapoport 
Cc: Vineet Gupta 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Kirill A. Shutemov 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux-ri...@lists.infradead.org
Cc: x...@kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Mike Rapoport 
Suggested-by: Mike Rapoport 
Signed-off-by: Anshuman Khandual 
---
 Documentation/vm/arch_pgtable_helpers.rst | 258 ++
 mm/debug_vm_pgtable.c |   6 +
 2 files changed, 264 insertions(+)
 create mode 100644 Documentation/vm/arch_pgtable_helpers.rst

diff --git a/Documentation/vm/arch_pgtable_helpers.rst 
b/Documentation/vm/arch_pgtable_helpers.rst
new file mode 100644
index ..f3591ee3aaa8
--- /dev/null
+++ b/Documentation/vm/arch_pgtable_helpers.rst
@@ -0,0 +1,258 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _arch_page_table_helpers:
+
+===
+Architecture Page Table Helpers
+===
+
+Generic MM expects architectures (with MMU) to provide helpers to create, 
access
+and modify page table entries at various level for different memory functions.
+These page table helpers need to conform to a common semantics across 
platforms.
+Following tables describe the expected semantics which can also be tested 
during
+boot via CONFIG_DEBUG_VM_PGTABLE option. All future changes in here or the 
debug
+test need to be in sync.
+
+==
+PTE Page Table Helpers
+==
+
++---+--+
+| pte_same  | Tests whether both PTE entries are the same  
|
++---+--+
+| pte_bad   | Tests a non-table mapped PTE 
|
++---+--+
+| pte_present   | Tests a valid mapped PTE 
|
++---+--+
+| pte_young | Tests a young PTE
|
++---+--+
+| pte_dirty | Tests a dirty PTE
|
++---+--+
+| pte_write | Tests a writable PTE 
|
++---+--+
+| pte_special   | Tests a special PTE  
|
++---+--+
+| pte_protnone  | Tests a PROT_NONE PTE
|
++---+--+
+| pte_devmap| Tests a ZONE_DEVICE mapped PTE   
|
++---+--+
+| pte_soft_dirty| Tests a soft dirty PTE   
|
++---+--+
+| pte_swp_soft_dirty| Tests a soft dirty swapped PTE   
|
++---+--+
+| pte_mkyoung   | Creates a young PTE  
|
++---+--+
+| pte_mkold | Creates an old PTE   
|
++---+--+
+| pte_mkdirty   | Creates a dirty PTE  
|
++---+--+
+| pte_mkclean   | Creates a clean PTE  
|
++---+--+
+| pte_mkwrite   | Creates a writable PTE   
|
++---+--+
+| pte_m

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 2971 matches

Mail list logo