[PATCH V3] mm/page_alloc: Add alloc_contig_pages()

2019-10-17 Thread Anshuman Khandual
HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
method where it scans over various zones looking for a large contiguous pfn
range before trying to allocate it with alloc_contig_range(). Other than
deriving the requested order from 'struct hstate', there is nothing HugeTLB
specific in there. This can be made available for general use to allocate
contiguous memory which could not have been allocated through the buddy
allocator.

alloc_gigantic_page() has been split carving out actual allocation method
which is then made available via new alloc_contig_pages() helper wrapped
under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
with more generic term 'contig'. Allocated pages here should be freed with
free_contig_range() or by calling __free_page() on each allocated page.

Cc: Mike Kravetz 
Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Matthew Wilcox 
Cc: David Hildenbrand 
Cc: linux-kernel@vger.kernel.org
Acked-by: David Hildenbrand 
Acked-by: Michal Hocko 
Signed-off-by: Anshuman Khandual 
---
This is based on https://patchwork.kernel.org/patch/11190213/

Changes in V3:

- Added an in-code comment per Michal and David

Changes in V2:

- Rephrased patch subject per David
- Fixed all typos per David
- s/order/contiguous

Changes from [V5,1/2] mm/hugetlb: Make alloc_gigantic_page()...

- alloc_contig_page() takes nr_pages instead of order per Michal
- s/gigantic/contig on all related functions

 include/linux/gfp.h |   2 +
 mm/hugetlb.c|  77 +
 mm/page_alloc.c | 101 
 3 files changed, 105 insertions(+), 75 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..1a11d4857027 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -589,6 +589,8 @@ static inline bool pm_suspended_storage(void)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
  unsigned migratetype, gfp_t gfp_mask);
+extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask);
 #endif
 void free_contig_range(unsigned long pfn, unsigned int nr_pages);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 985ee15eb04b..a5c2c880af27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1023,85 +1023,12 @@ static void free_gigantic_page(struct page *page, 
unsigned int order)
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
-static int __alloc_gigantic_page(unsigned long start_pfn,
-   unsigned long nr_pages, gfp_t gfp_mask)
-{
-   unsigned long end_pfn = start_pfn + nr_pages;
-   return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
-}
-
-static bool pfn_range_valid_gigantic(struct zone *z,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
-
-   for (i = start_pfn; i < end_pfn; i++) {
-   page = pfn_to_online_page(i);
-   if (!page)
-   return false;
-
-   if (page_zone(page) != z)
-   return false;
-
-   if (PageReserved(page))
-   return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
-   }
-
-   return true;
-}
-
-static bool zone_spans_last_pfn(const struct zone *zone,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long last_pfn = start_pfn + nr_pages - 1;
-   return zone_spans_pfn(zone, last_pfn);
-}
-
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
-   unsigned int order = huge_page_order(h);
-   unsigned long nr_pages = 1 << order;
-   unsigned long ret, pfn, flags;
-   struct zonelist *zonelist;
-   struct zone *zone;
-   struct zoneref *z;
-
-   zonelist = node_zonelist(nid, gfp_mask);
-   for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), 
nodemask) {
-   spin_lock_irqsave(>lock, flags);
+   unsigned long nr_pages = 1UL << huge_page_order(h);
 
-   pfn = ALIGN(zone->zone_start_pfn, nr_pages);
-   while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-   if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) {
-   /*
-* We release the zone lock here because
-* alloc_conti

Re: [PATCH V2] mm/page_alloc: Add alloc_contig_pages()

2019-10-16 Thread Anshuman Khandual



On 10/16/2019 10:18 PM, David Hildenbrand wrote:
> On 16.10.19 17:31, Anshuman Khandual wrote:
>>
>>
>> On 10/16/2019 06:11 PM, Michal Hocko wrote:
>>> On Wed 16-10-19 14:29:05, David Hildenbrand wrote:
>>>> On 16.10.19 13:51, Michal Hocko wrote:
>>>>> On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
>>>>>>
>>>>>>
>>>>>> On 10/16/2019 04:39 PM, David Hildenbrand wrote:
>>>>> [...]
>>>>>>> Just to make sure, you ignored my comment regarding alignment
>>>>>>> although I explicitly mentioned it a second time? Thanks.
>>>>>>
>>>>>> I had asked Michal explicitly what to be included for the respin. Anyways
>>>>>> seems like the previous thread is active again. I am happy to incorporate
>>>>>> anything new getting agreed on there.
>>>>>
>>>>> Your patch is using the same alignment as the original code would do. If
>>>>> an explicit alignement is needed then this can be added on top, right?
>>>>>
>>>>
>>>> Again, the "issue" I see here is that we could now pass in numbers that are
>>>> not a power of two. For gigantic pages it was clear that we always have a
>>>> number of two. The alignment does not make any sense otherwise.
>>
>> ALIGN() does expect nr_pages two be power of two otherwise the mask
>> value might not be correct, affecting start pfn value for a zone.
>>
>> #define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
>> #define __ALIGN_KERNEL(x, a)    __ALIGN_KERNEL_MASK(x, 
>> (typeof(x))(a) - 1)
>> #define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))
>>
>>>>
>>>> What I'm asking for is
>>>>
>>>> a) Document "The resulting PFN is aligned to nr_pages" and "nr_pages should
>>>> be a power of two".
>>>
>>> OK, this makes sense.
>> Sure, will add this to the alloc_contig_pages() helper description and
>> in the commit message as well.
> 
> As long as it is documented that implicit alignment will happen, fine with me.
> 
> The thing about !is_power_of2() is that we usually don't need an alignment 
> there (or instead an explicit one). And as I mentioned, the current function 
> might fail easily to allocate a suitable range due to the way the search 
> works (== check aligned blocks only). The search really only provides 
> reliable results when size==alignment and it's a power of two IMHO. Not 
> documenting that is in my opinion misleading - somebody who wants 
> !is_power_of2() and has no alignment requirements should probably rework the 
> function first.
> 
> So with some documentation regarding that

Does this add-on documentation look okay ? Should we also mention about the
possible reduction in chances of success during pfn block search for the
non-power-of-two cases as the implicit alignment will probably turn out to
be bigger than nr_pages itself ?

 * Requested nr_pages may or may not be power of two. The search for suitable
 * memory range in a zone happens in nr_pages aligned pfn blocks. But in case
 * when nr_pages is not power of two, an implicitly aligned pfn block search
 * will happen which in turn will impact allocated memory block's alignment.
 * In these cases, the size (i.e nr_pages) and the alignment of the allocated
 * memory will be different. This problem does not exist when nr_pages is power
 * of two where the size and the alignment of the allocated memory will always
 * be nr_pages.

> 
> Acked-by: David Hildenbrand 
> 


Re: [PATCH V2] mm/page_alloc: Add alloc_contig_pages()

2019-10-16 Thread Anshuman Khandual



On 10/17/2019 06:20 AM, Mike Kravetz wrote:
> On 10/16/19 4:02 AM, Anshuman Khandual wrote:
>> HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
>> method where it scans over various zones looking for a large contiguous pfn
>> range before trying to allocate it with alloc_contig_range(). Other than
>> deriving the requested order from 'struct hstate', there is nothing HugeTLB
>> specific in there. This can be made available for general use to allocate
>> contiguous memory which could not have been allocated through the buddy
>> allocator.
>>
>> alloc_gigantic_page() has been split carving out actual allocation method
>> which is then made available via new alloc_contig_pages() helper wrapped
>> under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
>> with more generic term 'contig'. Allocated pages here should be freed with
>> free_contig_range() or by calling __free_page() on each allocated page.
> 
> I had a 'test harness' used when previously working on such an interface.
> It simply provides a user interface to call the allocator with random
> values for nr_pages.  Several tasks are then started doing random allocations
> in parallel.  The new interface is pretty straight forward, but the idea
> was to stress the underlying code.  In fact, it did identify issues with
> isolation which were corrected.
> 
> I exercised this new interface in the same way and am happy to report that
> no issues were detected.

Cool, thanks Mike.


Re: [PATCH V2] mm/page_alloc: Add alloc_contig_pages()

2019-10-16 Thread Anshuman Khandual



On 10/16/2019 06:11 PM, Michal Hocko wrote:
> On Wed 16-10-19 14:29:05, David Hildenbrand wrote:
>> On 16.10.19 13:51, Michal Hocko wrote:
>>> On Wed 16-10-19 16:43:57, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 10/16/2019 04:39 PM, David Hildenbrand wrote:
>>> [...]
>>>>> Just to make sure, you ignored my comment regarding alignment
>>>>> although I explicitly mentioned it a second time? Thanks.
>>>>
>>>> I had asked Michal explicitly what to be included for the respin. Anyways
>>>> seems like the previous thread is active again. I am happy to incorporate
>>>> anything new getting agreed on there.
>>>
>>> Your patch is using the same alignment as the original code would do. If
>>> an explicit alignement is needed then this can be added on top, right?
>>>
>>
>> Again, the "issue" I see here is that we could now pass in numbers that are
>> not a power of two. For gigantic pages it was clear that we always have a
>> number of two. The alignment does not make any sense otherwise.

ALIGN() does expect nr_pages two be power of two otherwise the mask
value might not be correct, affecting start pfn value for a zone.

#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
#define __ALIGN_KERNEL(x, a)__ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 
1)
#define __ALIGN_KERNEL_MASK(x, mask)(((x) + (mask)) & ~(mask))

>>
>> What I'm asking for is
>>
>> a) Document "The resulting PFN is aligned to nr_pages" and "nr_pages should
>> be a power of two".
> 
> OK, this makes sense.
Sure, will add this to the alloc_contig_pages() helper description and
in the commit message as well.

> 
>> b) Eventually adding something like
>>
>> if (WARN_ON_ONCE(!is_power_of_2(nr_pages)))
>>  return NULL;
> 
> I am not sure this is really needed.
> 
Just wondering why not ? Ideally if we are documenting that nr_pages should be
power of two, then we should abort erring allocation request with an warning. Is
it because allocation can still succeed for non-power-of-two requests despite
possible problem with mask and alignment ? Hence there is no need to abort it.


Re: [PATCH V2] mm/page_alloc: Add alloc_contig_pages()

2019-10-16 Thread Anshuman Khandual



On 10/16/2019 04:39 PM, David Hildenbrand wrote:
> On 16.10.19 13:02, Anshuman Khandual wrote:
>> HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
>> method where it scans over various zones looking for a large contiguous pfn
>> range before trying to allocate it with alloc_contig_range(). Other than
>> deriving the requested order from 'struct hstate', there is nothing HugeTLB
>> specific in there. This can be made available for general use to allocate
>> contiguous memory which could not have been allocated through the buddy
>> allocator.
>>
>> alloc_gigantic_page() has been split carving out actual allocation method
>> which is then made available via new alloc_contig_pages() helper wrapped
>> under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
>> with more generic term 'contig'. Allocated pages here should be freed with
>> free_contig_range() or by calling __free_page() on each allocated page.
>>
>> Cc: Mike Kravetz 
>> Cc: Andrew Morton 
>> Cc: Vlastimil Babka 
>> Cc: Michal Hocko 
>> Cc: David Rientjes 
>> Cc: Andrea Arcangeli 
>> Cc: Oscar Salvador 
>> Cc: Mel Gorman 
>> Cc: Mike Rapoport 
>> Cc: Dan Williams 
>> Cc: Pavel Tatashin 
>> Cc: Matthew Wilcox 
>> Cc: David Hildenbrand 
>> Cc: linux-kernel@vger.kernel.org
>> Acked-by: Michal Hocko 
>> Signed-off-by: Anshuman Khandual 
>> ---
>> This is based on https://patchwork.kernel.org/patch/11190213/
>>
>> Changes in V2:
>>
>> - Rephrased patch subject per David
>> - Fixed all typos per David
>> - s/order/contiguous
> 
> Just to make sure, you ignored my comment regarding alignment although I 
> explicitly mentioned it a second time? Thanks.

I had asked Michal explicitly what to be included for the respin. Anyways
seems like the previous thread is active again. I am happy to incorporate
anything new getting agreed on there.

- Anshuman


[PATCH V2] mm/page_alloc: Add alloc_contig_pages()

2019-10-16 Thread Anshuman Khandual
HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
method where it scans over various zones looking for a large contiguous pfn
range before trying to allocate it with alloc_contig_range(). Other than
deriving the requested order from 'struct hstate', there is nothing HugeTLB
specific in there. This can be made available for general use to allocate
contiguous memory which could not have been allocated through the buddy
allocator.

alloc_gigantic_page() has been split carving out actual allocation method
which is then made available via new alloc_contig_pages() helper wrapped
under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
with more generic term 'contig'. Allocated pages here should be freed with
free_contig_range() or by calling __free_page() on each allocated page.

Cc: Mike Kravetz 
Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Matthew Wilcox 
Cc: David Hildenbrand 
Cc: linux-kernel@vger.kernel.org
Acked-by: Michal Hocko 
Signed-off-by: Anshuman Khandual 
---
This is based on https://patchwork.kernel.org/patch/11190213/

Changes in V2:

- Rephrased patch subject per David
- Fixed all typos per David
- s/order/contiguous

Changes from [V5,1/2] mm/hugetlb: Make alloc_gigantic_page()...

- alloc_contig_page() takes nr_pages instead of order per Michal
- s/gigantic/contig on all related functions

 include/linux/gfp.h |  2 +
 mm/hugetlb.c| 77 +--
 mm/page_alloc.c | 97 +
 3 files changed, 101 insertions(+), 75 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..1a11d4857027 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -589,6 +589,8 @@ static inline bool pm_suspended_storage(void)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
  unsigned migratetype, gfp_t gfp_mask);
+extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask);
 #endif
 void free_contig_range(unsigned long pfn, unsigned int nr_pages);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 985ee15eb04b..a5c2c880af27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1023,85 +1023,12 @@ static void free_gigantic_page(struct page *page, 
unsigned int order)
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
-static int __alloc_gigantic_page(unsigned long start_pfn,
-   unsigned long nr_pages, gfp_t gfp_mask)
-{
-   unsigned long end_pfn = start_pfn + nr_pages;
-   return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
-}
-
-static bool pfn_range_valid_gigantic(struct zone *z,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
-
-   for (i = start_pfn; i < end_pfn; i++) {
-   page = pfn_to_online_page(i);
-   if (!page)
-   return false;
-
-   if (page_zone(page) != z)
-   return false;
-
-   if (PageReserved(page))
-   return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
-   }
-
-   return true;
-}
-
-static bool zone_spans_last_pfn(const struct zone *zone,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long last_pfn = start_pfn + nr_pages - 1;
-   return zone_spans_pfn(zone, last_pfn);
-}
-
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
-   unsigned int order = huge_page_order(h);
-   unsigned long nr_pages = 1 << order;
-   unsigned long ret, pfn, flags;
-   struct zonelist *zonelist;
-   struct zone *zone;
-   struct zoneref *z;
-
-   zonelist = node_zonelist(nid, gfp_mask);
-   for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), 
nodemask) {
-   spin_lock_irqsave(>lock, flags);
+   unsigned long nr_pages = 1UL << huge_page_order(h);
 
-   pfn = ALIGN(zone->zone_start_pfn, nr_pages);
-   while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-   if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) {
-   /*
-* We release the zone lock here because
-* alloc_contig_range() will also lock the zone
-* at some point. 

Re: [PATCH] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-16 Thread Anshuman Khandual



On 10/16/2019 02:28 PM, Michal Hocko wrote:
> On Wed 16-10-19 13:04:53, Anshuman Khandual wrote:
>> HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
>> method where it scans over various zones looking for a large contiguous pfn
>> range before trying to allocate it with alloc_contig_range(). Other than
>> deriving the requested order from 'struct hstate', there is nothing HugeTLB
>> specific in there. This can be made available for general use to allocate
>> contiguous memory which could not have been allocated through the buddy
>> allocator.
>>
>> alloc_gigantic_page() has been split carving out actual allocation method
>> which is then made available via new alloc_contig_pages() helper wrapped
>> under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
>> with more generic term 'contig'. Allocated pages here should be freed with
>> free_contig_range() or by calling __free_page() on each allocated page.
> 
> Do we want to export this to modules? Apart from mostly styling issues
> pointed by David this looks fine.
> 
> I do agree that a general allocator api belongs to page_alloc.c rather
> than force people to invent their own and broken instances.
>  
>> Cc: Mike Kravetz 
>> Cc: Andrew Morton 
>> Cc: Vlastimil Babka 
>> Cc: Michal Hocko 
>> Cc: David Rientjes 
>> Cc: Andrea Arcangeli 
>> Cc: Oscar Salvador 
>> Cc: Mel Gorman 
>> Cc: Mike Rapoport 
>> Cc: Dan Williams 
>> Cc: Pavel Tatashin 
>> Cc: Matthew Wilcox 
>> Cc: David Hildenbrand 
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
> 
> After other issues mentioned by David get resolved you can add

Just to confirm. Only the styling issues, right ? pfn_range_valid_contig(),
pfn alignment and zone scanning all remain the same like before ?

> Acked-by: Michal Hocko 
> 
>> ---
>> This is based on https://patchwork.kernel.org/patch/11190213/
>>
>> Changes from [V5,1/2] mm/hugetlb: Make alloc_gigantic_page()...
>>
>> - alloc_contig_page() takes nr_pages instead of order per Michal
>> - s/gigantic/contig on all related functions
>>
>>  include/linux/gfp.h |  2 +
>>  mm/hugetlb.c| 77 +--
>>  mm/page_alloc.c | 97 +
>>  3 files changed, 101 insertions(+), 75 deletions(-)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index fb07b503dc45..1a11d4857027 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -589,6 +589,8 @@ static inline bool pm_suspended_storage(void)
>>  /* The below functions must be run on a range from a single zone. */
>>  extern int alloc_contig_range(unsigned long start, unsigned long end,
>>unsigned migratetype, gfp_t gfp_mask);
>> +extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t 
>> gfp_mask,
>> +   int nid, nodemask_t *nodemask);
>>  #endif
>>  void free_contig_range(unsigned long pfn, unsigned int nr_pages);
>>  
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 985ee15eb04b..a5c2c880af27 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -1023,85 +1023,12 @@ static void free_gigantic_page(struct page *page, 
>> unsigned int order)
>>  }
>>  
>>  #ifdef CONFIG_CONTIG_ALLOC
>> -static int __alloc_gigantic_page(unsigned long start_pfn,
>> -unsigned long nr_pages, gfp_t gfp_mask)
>> -{
>> -unsigned long end_pfn = start_pfn + nr_pages;
>> -return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
>> -  gfp_mask);
>> -}
>> -
>> -static bool pfn_range_valid_gigantic(struct zone *z,
>> -unsigned long start_pfn, unsigned long nr_pages)
>> -{
>> -unsigned long i, end_pfn = start_pfn + nr_pages;
>> -struct page *page;
>> -
>> -for (i = start_pfn; i < end_pfn; i++) {
>> -page = pfn_to_online_page(i);
>> -if (!page)
>> -return false;
>> -
>> -if (page_zone(page) != z)
>> -return false;
>> -
>> -if (PageReserved(page))
>> -return false;
>> -
>> -if (page_count(page) > 0)
>> -return false;
>> -
>> -if (PageHuge(page))
>> -return false;
>> -}
>> -
>> -return true;
>> -}
>&g

Re: [PATCH V6 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-16 Thread Anshuman Khandual



On 10/15/2019 11:39 PM, Qian Cai wrote:
> On Tue, 2019-10-15 at 14:51 +0530, Anshuman Khandual wrote:
>> +static unsigned long __init get_random_vaddr(void)
>> +{
>> +unsigned long random_vaddr, random_pages, total_user_pages;
>> +
>> +total_user_pages = (TASK_SIZE - FIRST_USER_ADDRESS) / PAGE_SIZE;
>> +
>> +random_pages = get_random_long() % total_user_pages;
>> +random_vaddr = FIRST_USER_ADDRESS + random_pages * PAGE_SIZE;
>> +
>> +WARN_ON(random_vaddr > TASK_SIZE);
>> +WARN_ON(random_vaddr < FIRST_USER_ADDRESS);
> 
> It would be nice if this patch does not introduce a new W=1 GCC warning here 
> on
> x86 because FIRST_USER_ADDRESS is 0, and GCC think the code is dumb because
> "random_vaddr" is unsigned,
> 
> In file included from ./arch/x86/include/asm/bug.h:83,
>  from ./include/linux/bug.h:5,
>  from ./include/linux/mmdebug.h:5,
>  from ./include/linux/gfp.h:5,
>  from mm/debug_vm_pgtable.c:13:
> mm/debug_vm_pgtable.c: In function ‘get_random_vaddr’:
> mm/debug_vm_pgtable.c:359:23: warning: comparison of unsigned expression < 0 
> is
> always false [-Wtype-limits]
>   WARN_ON(random_vaddr < FIRST_USER_ADDRESS);
>    ^
> ./include/asm-generic/bug.h:113:25: note: in definition of macro ‘WARN_ON’
>   int __ret_warn_on = !!(condition);\
>  ^

The test checks against an erroneous unsigned long overflow when
FIRST_USER_ADDRESS is not 0 but a positive number. Wondering if
the compiler will still complain if we merge both the WARN_ON()
checks as || on a single statement.


Re: [PATCH V6 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-16 Thread Anshuman Khandual



On 10/16/2019 12:12 AM, Qian Cai wrote:
> On Tue, 2019-10-15 at 20:51 +0530, Anshuman Khandual wrote:
>>
>> On 10/15/2019 08:11 PM, Qian Cai wrote:
>>> The x86 will crash with linux-next during boot due to this series (v5) with 
>>> the
>>> config below plus CONFIG_DEBUG_VM_PGTABLE=y. I am not sure if v6 would 
>>> address
>>> it.
>>>
>>> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
>>>
>>> [   33.862600][T1] page:ea000900 is uninitialized and poisoned
>>> [   33.862608][T1] raw:   
>>> 
>>> ff871140][T1]  ? _raw_spin_unlock_irq+0x27/0x40
>>> [   33.871140][T1]  ? rest_init+0x307/0x307
>>> [   33.871140][T1]  kernel_init+0x11/0x139
>>> [   33.871140][T1]  ? rest_init+0x307/0x307
>>> [   33.871140][T1]  ret_from_fork+0x27/0x50
>>> [   33.871140][T1] Modules linked in:
>>> [   33.871140][T1] ---[ end trace e99d392b0f7befbd ]---
>>> [   33.871140][T1] RIP: 0010:alloc_gigantic_page_order+0x3fe/0x490
>>
>> Hmm, with defconfig (DEBUG_VM=y and DEBUG_VM_PGTABLE=y) it does not crash but
>> with the config above, it does. Just wondering if it is possible that these
>> pages might not been initialized yet because DEFERRED_STRUCT_PAGE_INIT=y ?
> 
> Yes, this patch works fine.
> 
> diff --git a/init/main.c b/init/main.c
> index 676d8020dd29..591be8f9e8e0 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1177,7 +1177,6 @@ static noinline void __init kernel_init_freeable(void)
> workqueue_init();
>  
> init_mm_internals();
> -   debug_vm_pgtable();
>  
> do_pre_smp_initcalls();
> lockup_detector_init();
> @@ -1186,6 +1185,8 @@ static noinline void __init kernel_init_freeable(void)
> sched_init_smp();
>  
> page_alloc_init_late();
> +   debug_vm_pgtable();
> +
> /* Initialize page ext after all struct pages are initialized. */
> page_ext_init();
> 

Sure, will keep this in mind if we at all end up with memory allocation approach
for this test.

>>
>> [   13.898549][T1] page:ea000500 is uninitialized and poisoned
>> [   13.898549][T1] raw:   
>>  
>> [   13.898549][T1] raw:   
>>  
>> [   13.898549][T1] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>> [   13.898549][T1] [ cut here ]
>> [   13.898549][T1] kernel BUG at ./include/linux/mm.h:1107!
>> [   13.898549][T1] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN 
>> PTI
>> [   13.898549][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
>> 5.4.0-rc3-next-20191015+ #
> 


[PATCH] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-16 Thread Anshuman Khandual
HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
method where it scans over various zones looking for a large contiguous pfn
range before trying to allocate it with alloc_contig_range(). Other than
deriving the requested order from 'struct hstate', there is nothing HugeTLB
specific in there. This can be made available for general use to allocate
contiguous memory which could not have been allocated through the buddy
allocator.

alloc_gigantic_page() has been split carving out actual allocation method
which is then made available via new alloc_contig_pages() helper wrapped
under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
with more generic term 'contig'. Allocated pages here should be freed with
free_contig_range() or by calling __free_page() on each allocated page.

Cc: Mike Kravetz 
Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: Matthew Wilcox 
Cc: David Hildenbrand 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This is based on https://patchwork.kernel.org/patch/11190213/

Changes from [V5,1/2] mm/hugetlb: Make alloc_gigantic_page()...

- alloc_contig_page() takes nr_pages instead of order per Michal
- s/gigantic/contig on all related functions

 include/linux/gfp.h |  2 +
 mm/hugetlb.c| 77 +--
 mm/page_alloc.c | 97 +
 3 files changed, 101 insertions(+), 75 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..1a11d4857027 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -589,6 +589,8 @@ static inline bool pm_suspended_storage(void)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
  unsigned migratetype, gfp_t gfp_mask);
+extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask);
 #endif
 void free_contig_range(unsigned long pfn, unsigned int nr_pages);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 985ee15eb04b..a5c2c880af27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1023,85 +1023,12 @@ static void free_gigantic_page(struct page *page, 
unsigned int order)
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
-static int __alloc_gigantic_page(unsigned long start_pfn,
-   unsigned long nr_pages, gfp_t gfp_mask)
-{
-   unsigned long end_pfn = start_pfn + nr_pages;
-   return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
-}
-
-static bool pfn_range_valid_gigantic(struct zone *z,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
-
-   for (i = start_pfn; i < end_pfn; i++) {
-   page = pfn_to_online_page(i);
-   if (!page)
-   return false;
-
-   if (page_zone(page) != z)
-   return false;
-
-   if (PageReserved(page))
-   return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
-   }
-
-   return true;
-}
-
-static bool zone_spans_last_pfn(const struct zone *zone,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long last_pfn = start_pfn + nr_pages - 1;
-   return zone_spans_pfn(zone, last_pfn);
-}
-
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
-   unsigned int order = huge_page_order(h);
-   unsigned long nr_pages = 1 << order;
-   unsigned long ret, pfn, flags;
-   struct zonelist *zonelist;
-   struct zone *zone;
-   struct zoneref *z;
-
-   zonelist = node_zonelist(nid, gfp_mask);
-   for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), 
nodemask) {
-   spin_lock_irqsave(>lock, flags);
+   unsigned long nr_pages = 1UL << huge_page_order(h);
 
-   pfn = ALIGN(zone->zone_start_pfn, nr_pages);
-   while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-   if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) {
-   /*
-* We release the zone lock here because
-* alloc_contig_range() will also lock the zone
-* at some point. If there's an allocation
-* spinning on this lock, it may win the race
-   

Re: [PATCH V6 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual



On 10/15/2019 08:11 PM, Qian Cai wrote:
> The x86 will crash with linux-next during boot due to this series (v5) with 
> the
> config below plus CONFIG_DEBUG_VM_PGTABLE=y. I am not sure if v6 would address
> it.
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> 
> [   33.862600][T1] page:ea000900 is uninitialized and poisoned
> [   33.862608][T1] raw:   
> ff871140][T1]  ? _raw_spin_unlock_irq+0x27/0x40
> [   33.871140][T1]  ? rest_init+0x307/0x307
> [   33.871140][T1]  kernel_init+0x11/0x139
> [   33.871140][T1]  ? rest_init+0x307/0x307
> [   33.871140][T1]  ret_from_fork+0x27/0x50
> [   33.871140][T1] Modules linked in:
> [   33.871140][T1] ---[ end trace e99d392b0f7befbd ]---
> [   33.871140][T1] RIP: 0010:alloc_gigantic_page_order+0x3fe/0x490

Hmm, with defconfig (DEBUG_VM=y and DEBUG_VM_PGTABLE=y) it does not crash but
with the config above, it does. Just wondering if it is possible that these
pages might not been initialized yet because DEFERRED_STRUCT_PAGE_INIT=y ?

[   13.898549][T1] page:ea000500 is uninitialized and poisoned
[   13.898549][T1] raw:    

[   13.898549][T1] raw:    

[   13.898549][T1] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   13.898549][T1] [ cut here ]
[   13.898549][T1] kernel BUG at ./include/linux/mm.h:1107!
[   13.898549][T1] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[   13.898549][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.4.0-rc3-next-20191015+ #


Re: [PATCH V6 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual



On 10/15/2019 05:16 PM, Michal Hocko wrote:
> On Tue 15-10-19 14:51:42, Anshuman Khandual wrote:
>> This adds tests which will validate architecture page table helpers and
>> other accessors in their compliance with expected generic MM semantics.
>> This will help various architectures in validating changes to existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required size and alignments. But if
>> memory pages with required size and alignment could not be allocated, then
>> all depending individual tests are just skipped afterwards. This test gets
>> called right after init_mm_internals() required for alloc_contig_range() to
>> work correctly.
>>
>> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
>> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
>> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
>> arm64. Going forward, other architectures too can enable this after fixing
>> build or runtime problems (if any) with their page table helpers.
> 
> A highlevel description of tests and what they are testing for would be
> really appreciated. Who wants to run these tests and why/when? What kind
> of bugs would get detected? In short why do we really need/want this
> code in the tree?

Sure, will do.


[PATCH V6 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual
This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics.
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.

Test page table and memory pages creating it's entries at various level are
all allocated from system memory with required size and alignments. But if
memory pages with required size and alignment could not be allocated, then
all depending individual tests are just skipped afterwards. This test gets
called right after init_mm_internals() required for alloc_contig_range() to
work correctly.

This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
arm64. Going forward, other architectures too can enable this after fixing
build or runtime problems (if any) with their page table helpers.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Tested-by: Christophe Leroy#PPC32
Suggested-by: Catalin Marinas 
Signed-off-by: Andrew Morton 
Signed-off-by: Christophe Leroy 
Signed-off-by: Anshuman Khandual 
---
 .../debug/debug-vm-pgtable/arch-support.txt   |  34 ++
 arch/arm64/Kconfig|   1 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/pgtable_64.h |   6 +
 include/asm-generic/pgtable.h |   6 +
 init/main.c   |   1 +
 lib/Kconfig.debug |  21 +
 mm/Makefile   |   1 +
 mm/debug_vm_pgtable.c | 450 ++
 9 files changed, 521 insertions(+)
 create mode 100644 
Documentation/features/debug/debug-vm-pgtable/arch-support.txt
 create mode 100644 mm/debug_vm_pgtable.c

diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
new file mode 100644
index ..d6b8185dcf1e
--- /dev/null
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -0,0 +1,34 @@
+#
+# Feature name:  debug-vm-pgtable
+# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
+# description:   arch supports pgtable tests for semantics compliance
+#
+---
+| arch |status|
+---
+|   alpha: | TODO |
+| arc: | TODO |
+| arm: | TODO |
+|   arm64: |  ok  |
+| c6x: | TODO |
+|csky: | TODO |
+|   h8300: | TODO |
+| hexagon: | TODO |
+|ia64: | TODO |
+|m68k: | TODO |
+|  microblaze: | TODO |
+|mips: | TODO |
+|   nds32: | TODO |
+|   nios2: | TODO |
+|openrisc: | TODO |
+|  parisc: | TODO |
+| powerpc: | TODO |
+|   riscv: | TODO |
+|s390: | TODO |
+|  sh: | TODO |
+|   sparc: | TODO |
+|  um: | TODO |
+|   unicore32: | TODO |
+| x86: |  ok  |
+|  xtensa: | TODO |
+---
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 950a56b71ff0..8a3b3eaa49e9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -11,6 +11,7 @@ config ARM64
select ACPI_PPTT if ACPI
select ARCH_CLOCKSOURCE_DATA
select ARCH_HAS_DEBUG_VIRTUAL
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_DMA_COHERENT_TO_PFN
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index abe822d52167..13c9bd950256 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -61,6 +61,7 @@ config X86
select ARCH_CLOCKSOURCE_INIT
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_DEBUG_VIRTUAL
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_

[PATCH V6 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual
c: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org


Anshuman Khandual (2):
  mm/page_alloc: Make alloc_gigantic_page() available for general use
  mm/debug: Add tests validating architecture page table helpers

 .../debug/debug-vm-pgtable/arch-support.txt   |  34 ++
 arch/arm64/Kconfig|   1 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/pgtable_64.h |   6 +
 include/asm-generic/pgtable.h |   6 +
 include/linux/gfp.h   |   3 +
 init/main.c   |   1 +
 lib/Kconfig.debug |  21 +
 mm/Makefile   |   1 +
 mm/debug_vm_pgtable.c | 450 ++
 mm/hugetlb.c  |  76 +--
 mm/page_alloc.c   |  98 
 12 files changed, 623 insertions(+), 75 deletions(-)
 create mode 100644 
Documentation/features/debug/debug-vm-pgtable/arch-support.txt
 create mode 100644 mm/debug_vm_pgtable.c

-- 
2.20.1



[PATCH V10 2/2] arm64/mm: Enable memory hot remove

2019-10-11 Thread Anshuman Khandual
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The free_mapped argument determines whether the
backing memory will be freed.

It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.

Signed-off-by: Anshuman Khandual 
Reviewed-by: Catalin Marinas 
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 303 ++--
 3 files changed, 298 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 950a56b..c9a3c02 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -273,6 +273,9 @@ config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+   def_bool y
+
 config SMP
def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b61b50b..615dcd0 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
 #define MODULES_VADDR  (BPF_JIT_REGION_END)
 #define MODULES_VSIZE  (SZ_128M)
 #define VMEMMAP_START  (-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END (VMEMMAP_START - SZ_2M)
 #define PCI_IO_START   (PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d10247f..66ef63e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -725,6 +725,275 @@ int kern_addr_valid(unsigned long addr)
 
return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+   WARN_ON(PageReserved(page));
+   free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+   free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static bool pgtable_range_aligned(unsigned long start, unsigned long end,
+ unsigned long floor, unsigned long ceiling,
+ unsigned long mask)
+{
+   start &= mask;
+   if (start < floor)
+   return false;
+
+   if (ceiling) {
+   ceiling &= mask;
+   if (!ceiling)
+   return false;
+   }
+
+   if (end - 1 > ceiling - 1)
+   return false;
+   return true;
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+   unsigned long end, bool free_mapped)
+{
+   pte_t *ptep, pte;
+
+   do {
+   ptep = pte_offset_kernel(pmdp, addr);
+   pte = READ_ONCE(*ptep);
+   if (pte_none(pte))
+   continue;
+
+   WARN_ON(!pte_present(pte));
+   pte_clear(_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+   i

[PATCH V10 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

2019-10-11 Thread Anshuman Khandual
The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand 
Acked-by: Mark Rutland 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/ptdump_debugfs.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..b5eebc8 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
+#include 
 #include 
 
 #include 
@@ -7,7 +8,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
struct ptdump_info *info = m->private;
+
+   get_online_mems();
ptdump_walk_pgd(m, info);
+   put_online_mems();
return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.7.4



[PATCH V10 0/2] arm64/mm: Enable memory hot remove

2019-10-11 Thread Anshuman Khandual
ge table entries
- s/p???_present()/p???_none() for checking valid kernel page table entries
- WARN_ON() when an entry is !p???_none() and !p???_present() at the same time
- Updated memory hot-remove commit message with additional details as suggested
- Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted 
polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (2):
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 303 ++--
 arch/arm64/mm/ptdump_debugfs.c  |   4 +
 4 files changed, 302 insertions(+), 9 deletions(-)

-- 
2.7.4



[PATCH V5 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-11 Thread Anshuman Khandual
This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics.
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.

Test page table and memory pages creating it's entries at various level are
all allocated from system memory with required size and alignments. But if
memory pages with required size and alignment could not be allocated, then
all depending individual tests are just skipped afterwards. This test gets
called right after init_mm_internals() required for alloc_contig_range() to
work correctly.

This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
arm64. Going forward, other architectures too can enable this after fixing
build or runtime problems (if any) with their page table helpers.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Tested-by: Christophe Leroy#PPC32
Suggested-by: Catalin Marinas 
Signed-off-by: Christophe Leroy 
Signed-off-by: Anshuman Khandual 
---
 .../debug/debug-vm-pgtable/arch-support.txt|  34 ++
 arch/arm64/Kconfig |   1 +
 arch/x86/Kconfig   |   1 +
 arch/x86/include/asm/pgtable_64.h  |   6 +
 include/asm-generic/pgtable.h  |   6 +
 init/main.c|   1 +
 lib/Kconfig.debug  |  21 +
 mm/Makefile|   1 +
 mm/debug_vm_pgtable.c  | 438 +
 9 files changed, 509 insertions(+)
 create mode 100644 
Documentation/features/debug/debug-vm-pgtable/arch-support.txt
 create mode 100644 mm/debug_vm_pgtable.c

diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
new file mode 100644
index 000..d6b8185
--- /dev/null
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -0,0 +1,34 @@
+#
+# Feature name:  debug-vm-pgtable
+# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
+# description:   arch supports pgtable tests for semantics compliance
+#
+---
+| arch |status|
+---
+|   alpha: | TODO |
+| arc: | TODO |
+| arm: | TODO |
+|   arm64: |  ok  |
+| c6x: | TODO |
+|csky: | TODO |
+|   h8300: | TODO |
+| hexagon: | TODO |
+|ia64: | TODO |
+|m68k: | TODO |
+|  microblaze: | TODO |
+|mips: | TODO |
+|   nds32: | TODO |
+|   nios2: | TODO |
+|openrisc: | TODO |
+|  parisc: | TODO |
+| powerpc: | TODO |
+|   riscv: | TODO |
+|s390: | TODO |
+|  sh: | TODO |
+|   sparc: | TODO |
+|  um: | TODO |
+|   unicore32: | TODO |
+| x86: |  ok  |
+|  xtensa: | TODO |
+---
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 950a56b..8a3b3ea 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -11,6 +11,7 @@ config ARM64
select ACPI_PPTT if ACPI
select ARCH_CLOCKSOURCE_DATA
select ARCH_HAS_DEBUG_VIRTUAL
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_DMA_COHERENT_TO_PFN
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index abe822d..13c9bd9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -61,6 +61,7 @@ config X86
select ARCH_CLOCKSOURCE_INIT
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_DEBUG_VIRTUAL
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_IS_ALLOWED

[PATCH V5 1/2] mm/hugetlb: Make alloc_gigantic_page() available for general use

2019-10-11 Thread Anshuman Khandual
alloc_gigantic_page() implements an allocation method where it scans over
various zones looking for a large contiguous memory block which could not
have been allocated through the buddy allocator. A subsequent patch which
tests arch page table helpers needs such a method to allocate PUD_SIZE
sized memory block. In the future such methods might have other use cases
as well. So alloc_gigantic_page() has been split carving out actual memory
allocation method and made available via new alloc_gigantic_page_order().

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Mike Kravetz 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h |  9 +
 mm/hugetlb.c| 24 ++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4c5a16b..7ff1e36 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -298,6 +298,9 @@ static inline bool is_file_hugepages(struct file *file)
 }
 
 
+struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
 #else /* !CONFIG_HUGETLBFS */
 
 #define is_file_hugepages(file)false
@@ -309,6 +312,12 @@ hugetlb_file_setup(const char *name, size_t size, 
vm_flags_t acctflag,
return ERR_PTR(-ENOSYS);
 }
 
+static inline struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 977f9a3..2996e44 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1066,10 +1066,9 @@ static bool zone_spans_last_pfn(const struct zone *zone,
return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
-   unsigned int order = huge_page_order(h);
unsigned long nr_pages = 1 << order;
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
@@ -1105,6 +1104,14 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
return NULL;
 }
 
+static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+   int nid, nodemask_t *nodemask)
+{
+   unsigned int order = huge_page_order(h);
+
+   return alloc_gigantic_page_order(order, gfp_mask, nid, nodemask);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
 static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 #else /* !CONFIG_CONTIG_ALLOC */
@@ -1113,6 +1120,12 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* CONFIG_CONTIG_ALLOC */
 
 #else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
@@ -1121,6 +1134,13 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
+
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
 static inline void destroy_compound_gigantic_page(struct page *page,
unsigned int order) { }
-- 
2.7.4



[PATCH V5 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-11 Thread Anshuman Khandual
This series adds a test validation for architecture exported page table
helpers. Patch in the series adds basic transformation tests at various
levels of the page table. Before that it exports gigantic page allocation
function from HugeTLB.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Changes in V5:

- Redefined and moved X86 mm_p4d_folded() into a different header per 
Kirill/Ingo
- Updated the config option comment per Ingo and dropped 'kernel module' 
reference
- Updated the commit message and dropped 'kernel module' reference
- Changed DEBUG_ARCH_PGTABLE_TEST into DEBUG_VM_PGTABLE per Ingo
- Moved config option from mm/Kconfig.debug into lib/Kconfig.debug
- Renamed core test function arch_pgtable_tests() as debug_vm_pgtable()
- Renamed mm/arch_pgtable_test.c as mm/debug_vm_pgtable.c
- debug_vm_pgtable() gets called from kernel_init_freeable() after 
init_mm_internals()
- Added an entry in Documentation/features/debug/ per Ingo
- Enabled the test on arm64 and x86 platforms for now

Changes in V4: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=183465)

- Disable DEBUG_ARCH_PGTABLE_TEST for ARM and IA64 platforms

Changes in V3: 
(https://lore.kernel.org/patchwork/project/lkml/list/?series=411216)

- Changed test trigger from module format into late_initcall()
- Marked all functions with __init to be freed after completion
- Changed all __PGTABLE_PXX_FOLDED checks as mm_pxx_folded()
- Folded in PPC32 fixes from Christophe

Changes in V2:

https://lore.kernel.org/linux-mm/1568268173-31302-1-git-send-email-anshuman.khand...@arm.com/T/#t

- Fixed small typo error in MODULE_DESCRIPTION()
- Fixed m64k build problems for lvalue concerns in pmd_xxx_tests()
- Fixed dynamic page table level folding problems on x86 as per Kirril
- Fixed second pointers during pxx_populate_tests() per Kirill and Gerald
- Allocate and free pte table with pte_alloc_one/pte_free per Kirill
- Modified pxx_clear_tests() to accommodate s390 lower 12 bits situation
- Changed RANDOM_NZVALUE value from 0xbe to 0xff
- Changed allocation, usage, free sequence for saved_ptep
- Renamed VMA_FLAGS as VMFLAGS
- Implemented a new method for random vaddr generation
- Implemented some other cleanups
- Dropped extern reference to mm_alloc()
- Created and exported new alloc_gigantic_page_order()
- Dropped the custom allocator and used new alloc_gigantic_page_order()

Changes in V1:

https://lore.kernel.org/linux-mm/1567497706-8649-1-git-send-email-anshuman.khand...@arm.com/

- Added fallback mechanism for PMD aligned memory allocation failure

Changes in RFC V2:

https://lore.kernel.org/linux-mm/1565335998-22553-1-git-send-email-anshuman.khand...@arm.com/T/#u

- Moved test module and it's config from lib/ to mm/
- Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
- Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
- Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
- Dropped loadable module config option
- Basic tests now use memory blocks with required size and alignment
- PUD aligned memory block gets allocated with alloc_contig_range()
- If PUD aligned memory could not be allocated it falls back on PMD aligned
  memory block from page allocator and pud_* tests are skipped
- Clear and populate tests now operate on real in memory page table entries
- Dummy mm_struct gets allocated with mm_alloc()
- Dummy page table entries get allocated with [pud|pmd|pte]_alloc_[map]()
- Simplified [p4d|pgd]_basic_tests(), now has random values in the entries

Original RFC V1:

https://lore.kernel.org/linux-mm/1564037723-26676-1-git-send-email-anshuman.khand...@arm.com/

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Mike Kravetz 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  mm/hugetlb: Make alloc_gigantic_page() available for general use
  mm/debug:

Re: [PATCH V9 2/2] arm64/mm: Enable memory hot remove

2019-10-10 Thread Anshuman Khandual
On 10/10/2019 05:04 PM, Catalin Marinas wrote:
> Hi Anshuman,
> 
> On Wed, Oct 09, 2019 at 01:51:48PM +0530, Anshuman Khandual wrote:
>> +static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
>> +unsigned long end, bool free_mapped)
>> +{
>> +unsigned long next;
>> +pmd_t *pmdp, pmd;
>> +
>> +do {
>> +next = pmd_addr_end(addr, end);
>> +pmdp = pmd_offset(pudp, addr);
>> +pmd = READ_ONCE(*pmdp);
>> +if (pmd_none(pmd))
>> +continue;
>> +
>> +WARN_ON(!pmd_present(pmd));
>> +if (pmd_sect(pmd)) {
>> +pmd_clear(pmdp);
>> +flush_tlb_kernel_range(addr, next);
> 
> The range here could be a whole PMD_SIZE. Since we are invalidating a
> single block entry, one TLBI should be sufficient:
> 
>   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

Sure, will change.

> 
>> +if (free_mapped)
>> +free_hotplug_page_range(pmd_page(pmd),
>> +PMD_SIZE);
>> +continue;
>> +}
>> +WARN_ON(!pmd_table(pmd));
>> +unmap_hotplug_pte_range(pmdp, addr, next, free_mapped);
>> +} while (addr = next, addr < end);
>> +}
>> +
>> +static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
>> +unsigned long end, bool free_mapped)
>> +{
>> +unsigned long next;
>> +pud_t *pudp, pud;
>> +
>> +do {
>> +next = pud_addr_end(addr, end);
>> +pudp = pud_offset(pgdp, addr);
>> +pud = READ_ONCE(*pudp);
>> +if (pud_none(pud))
>> +continue;
>> +
>> +WARN_ON(!pud_present(pud));
>> +if (pud_sect(pud)) {
>> +pud_clear(pudp);
>> +flush_tlb_kernel_range(addr, next);
>   ^^^
>   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

Will change.

> 
> [...]
>> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
>> + unsigned long end, unsigned long floor,
>> + unsigned long ceiling)
>> +{
>> +pte_t *ptep, pte;
>> +unsigned long i, start = addr;
>> +
>> +do {
>> +ptep = pte_offset_kernel(pmdp, addr);
>> +pte = READ_ONCE(*ptep);
>> +WARN_ON(!pte_none(pte));
>> +} while (addr += PAGE_SIZE, addr < end);
> 
> So this loop is just a sanity check (pte clearing having been done by
> the unmap loops). That's fine, maybe a comment for future reference.

Sure, will add.
> 
>> +
>> +if (!pgtable_range_aligned(start, end, floor, ceiling, PMD_MASK))
>> +return;
>> +
>> +ptep = pte_offset_kernel(pmdp, 0UL);
>> +for (i = 0; i < PTRS_PER_PTE; i++) {
>> +if (!pte_none(READ_ONCE(ptep[i])))
>> +return;
>> +}
> 
> We could do with a comment for this loop along the lines of:
> 
>   Check whether we can free the pte page if the rest of the
>   entries are empty. Overlap with other regions have been handled
>   by the floor/ceiling check.

Sure, will add.

> 
> Apart from the comments above, the rest of the patch looks fine. Once
> fixed:
> 
> Reviewed-by: Catalin Marinas 
> 
> Mark Rutland mentioned at some point that, as a preparatory patch to
> this series, we'd need to make sure we don't hot-remove memory already
> given to the kernel at boot. Any plans here?

Hmm, this series just enables platform memory hot remove as required from
generic memory hotplug framework. The path here is triggered either from
remove_memory() or __remove_memory() which takes physical memory range
arguments like (nid, start, size) and do the needful. arch_remove_memory()
should never be required to test given memory range for anything including
being part of the boot memory.

IIUC boot memory added to system with memblock_add() lose all it's identity
after the system is up and running. In order to reject any attempt to hot
remove boot memory, platform needs to remember all those memory that came
early in the boot and then scan through it during arch_remove_memory().

Ideally, it is the responsibility of [_]remove_memory() callers like ACPI
driver, DAX etc to make sure they never attempt to hot remove a memory
range, which never got hot added by them in the first place. Also, unlike
/sys/devices/system/memory/probe there is no 'unprobe' interface where the
user can just trigger boot memory removal. Hence, unless there is a bug in
ACPI, DAX or other callers, there should never be any attempt to hot remove
boot memory in the first place.

> 
> Thanks.
> 


[PATCH V9 2/2] arm64/mm: Enable memory hot remove

2019-10-09 Thread Anshuman Khandual
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The free_mapped argument determines whether the
backing memory will be freed.

It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.

Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 273 ++--
 3 files changed, 268 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 950a56b71ff0..c9a3c029c55f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -273,6 +273,9 @@ config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+   def_bool y
+
 config SMP
def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b61b50bf68b1..615dcd08acfa 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
 #define MODULES_VADDR  (BPF_JIT_REGION_END)
 #define MODULES_VSIZE  (SZ_128M)
 #define VMEMMAP_START  (-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END (VMEMMAP_START - SZ_2M)
 #define PCI_IO_START   (PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d10247fab0fd..cb0411b1e735 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -725,6 +725,245 @@ int kern_addr_valid(unsigned long addr)
 
return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+   WARN_ON(PageReserved(page));
+   free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+   free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static bool pgtable_range_aligned(unsigned long start, unsigned long end,
+ unsigned long floor, unsigned long ceiling,
+ unsigned long mask)
+{
+   start &= mask;
+   if (start < floor)
+   return false;
+
+   if (ceiling) {
+   ceiling &= mask;
+   if (!ceiling)
+   return false;
+   }
+
+   if (end - 1 > ceiling - 1)
+   return false;
+   return true;
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+   unsigned long end, bool free_mapped)
+{
+   pte_t *ptep, pte;
+
+   do {
+   ptep = pte_offset_kernel(pmdp, addr);
+   pte = READ_ONCE(*ptep);
+   if (pte_none(pte))
+   continue;
+
+   WARN_ON(!pte_present(pte));
+   pte_clear(_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+   i

[PATCH V9 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

2019-10-09 Thread Anshuman Khandual
The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand 
Acked-by: Mark Rutland 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/ptdump_debugfs.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f25592..b5eebc8c4924 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
+#include 
 #include 
 
 #include 
@@ -7,7 +8,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
struct ptdump_info *info = m->private;
+
+   get_online_mems();
ptdump_walk_pgd(m, info);
+   put_online_mems();
return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.20.1



[PATCH V9 0/2] arm64/mm: Enable memory hot remove

2019-10-09 Thread Anshuman Khandual
and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted 
polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (2):
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 273 ++--
 arch/arm64/mm/ptdump_debugfs.c  |   4 +
 4 files changed, 272 insertions(+), 9 deletions(-)

-- 
2.20.1



Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove

2019-10-08 Thread Anshuman Khandual



On 10/08/2019 04:25 PM, Catalin Marinas wrote:
> On Tue, Oct 08, 2019 at 10:06:26AM +0530, Anshuman Khandual wrote:
>> On 10/07/2019 07:47 PM, Catalin Marinas wrote:
>>> On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote:
>>>> The arch code for hot-remove must tear down portions of the linear map and
>>>> vmemmap corresponding to memory being removed. In both cases the page
>>>> tables mapping these regions must be freed, and when sparse vmemmap is in
>>>> use the memory backing the vmemmap must also be freed.
>>>>
>>>> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
>>>> can be used to tear down either region and calls it from vmemmap_free() and
>>>> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the
>>>> backing memory will be freed.
>>>
>>> Can you change the 'sparse_vmap' name to something more meaningful which
>>> would suggest freeing of the backing memory?
>>
>> free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or
>> free_backed might do as well. Do you have a particular preference here ? But
>> yes, sparse_vmap has been very much specific to vmemmap for these functions
>> which are now very generic in nature.
> 
> free_mapped would do.

Sure.

> 
>>>> +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
>>>> +  unsigned long end, bool sparse_vmap)
>>>> +{
>>>> +  struct page *page;
>>>> +  pte_t *ptep, pte;
>>>> +
>>>> +  do {
>>>> +  ptep = pte_offset_kernel(pmdp, addr);
>>>> +  pte = READ_ONCE(*ptep);
>>>> +  if (pte_none(pte))
>>>> +  continue;
>>>> +
>>>> +  WARN_ON(!pte_present(pte));
>>>> +  page = sparse_vmap ? pte_page(pte) : NULL;
>>>> +  pte_clear(_mm, addr, ptep);
>>>> +  flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>>> +  if (sparse_vmap)
>>>> +  free_hotplug_page_range(page, PAGE_SIZE);
>>>
>>> You could only set 'page' if sparse_vmap (or even drop 'page' entirely).
>>
>> I am afraid 'page' is being used to hold pte_page(pte) extraction which
>> needs to be freed (sparse_vmap) as we are going to clear the ptep entry
>> in the next statement and lose access to it for good.
> 
> You clear *ptep, not pte.

Ahh, missed that. We have already captured the contents with READ_ONCE().

> 
>> We will need some
>> where to hold onto pte_page(pte) across pte_clear() as we cannot free it
>> before clearing it's entry and flushing the TLB. Hence wondering how the
>> 'page' can be completely dropped.
>>
>>> The compiler is probably smart enough to optimise it but using a
>>> pointless ternary operator just makes the code harder to follow.
>>
>> Not sure I got this but are you suggesting for an 'if' statement here
>>
>> if (sparse_vmap)
>>  page = pte_page(pte);
>>
>> instead of the current assignment ?
>>
>> page = sparse_vmap ? pte_page(pte) : NULL;
> 
> I suggest:
> 
>   if (sparse_vmap)
>   free_hotplug_pgtable_page(pte_page(pte), PAGE_SIZE);

Sure, will do.

> 
>>>> +  } while (addr += PAGE_SIZE, addr < end);
>>>> +}
>>> [...]
>>>> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
>>>> +   unsigned long end)
>>>> +{
>>>> +  pte_t *ptep, pte;
>>>> +
>>>> +  do {
>>>> +  ptep = pte_offset_kernel(pmdp, addr);
>>>> +  pte = READ_ONCE(*ptep);
>>>> +  WARN_ON(!pte_none(pte));
>>>> +  } while (addr += PAGE_SIZE, addr < end);
>>>> +}
>>>> +
>>>> +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
>>>> +   unsigned long end, unsigned long floor,
>>>> +   unsigned long ceiling)
>>>> +{
>>>> +  unsigned long next;
>>>> +  pmd_t *pmdp, pmd;
>>>> +
>>>> +  do {
>>>> +  next = pmd_addr_end(addr, end);
>>>> +  pmdp = pmd_offset(pudp, addr);
>>>> +  pmd = READ_ONCE(*pmdp);
>>>> +  if (pmd_none(pmd))
>>>> +  continue;
>>>> +
>>>> +  WARN_ON(!pmd_present(p

Re: [PATCH V4 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-10-08 Thread Anshuman Khandual



On 10/07/2019 07:30 PM, Kirill A. Shutemov wrote:
> On Mon, Oct 07, 2019 at 03:51:58PM +0200, Ingo Molnar wrote:
>>
>> * Kirill A. Shutemov  wrote:
>>
>>> On Mon, Oct 07, 2019 at 03:06:17PM +0200, Ingo Molnar wrote:
>>>>
>>>> * Anshuman Khandual  wrote:
>>>>
>>>>> This adds a test module which will validate architecture page table 
>>>>> helpers
>>>>> and accessors regarding compliance with generic MM semantics expectations.
>>>>> This will help various architectures in validating changes to the existing
>>>>> page table helpers or addition of new ones.
>>>>>
>>>>> Test page table and memory pages creating it's entries at various level 
>>>>> are
>>>>> all allocated from system memory with required alignments. If memory pages
>>>>> with required size and alignment could not be allocated, then all 
>>>>> depending
>>>>> individual tests are skipped.
>>>>
>>>>> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
>>>>> b/arch/x86/include/asm/pgtable_64_types.h
>>>>> index 52e5f5f2240d..b882792a3999 100644
>>>>> --- a/arch/x86/include/asm/pgtable_64_types.h
>>>>> +++ b/arch/x86/include/asm/pgtable_64_types.h
>>>>> @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
>>>>>  #define pgtable_l5_enabled() 0
>>>>>  #endif /* CONFIG_X86_5LEVEL */
>>>>>  
>>>>> +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
>>>>> +
>>>>>  extern unsigned int pgdir_shift;
>>>>>  extern unsigned int ptrs_per_p4d;
>>>>
>>>> Any deep reason this has to be a macro instead of proper C?
>>>
>>> It's a way to override the generic mm_p4d_folded(). It can be rewritten
>>> as inline function + define. Something like:
>>>
>>> #define mm_p4d_folded mm_p4d_folded
>>> static inline bool mm_p4d_folded(struct mm_struct *mm)
>>> {
>>> return !pgtable_l5_enabled();
>>> }
>>>
>>> But I don't see much reason to be more verbose here than needed.
>>
>> C type checking? Documentation? Yeah, I know it's just a one-liner, but 
>> the principle of the death by a thousand cuts applies here.
> 
> Okay, if you think it worth it. Anshuman, could you fix it up for the next
> submission?

Sure, will do.

> 
> 
>> BTW., any reason this must be in the low level pgtable_64_types.h type 
>> header, instead of one of the API level header files?
> 
> I defined it next pgtable_l5_enabled(). What is more appropriate place to
> you? pgtable_64.h? Yeah, it makes sense.


Needs to be moved to arch/x86/include/asm/pgtable_64.h as well ?


Re: [PATCH V4 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-10-08 Thread Anshuman Khandual


On 10/07/2019 06:36 PM, Ingo Molnar wrote:
> 
> * Anshuman Khandual  wrote:
> 
>> This adds a test module which will validate architecture page table helpers
>> and accessors regarding compliance with generic MM semantics expectations.
>> This will help various architectures in validating changes to the existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required alignments. If memory pages
>> with required size and alignment could not be allocated, then all depending
>> individual tests are skipped.
> 
>> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
>> b/arch/x86/include/asm/pgtable_64_types.h
>> index 52e5f5f2240d..b882792a3999 100644
>> --- a/arch/x86/include/asm/pgtable_64_types.h
>> +++ b/arch/x86/include/asm/pgtable_64_types.h
>> @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
>>  #define pgtable_l5_enabled() 0
>>  #endif /* CONFIG_X86_5LEVEL */
>>  
>> +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
>> +
>>  extern unsigned int pgdir_shift;
>>  extern unsigned int ptrs_per_p4d;
> 
> Any deep reason this has to be a macro instead of proper C?
> 
>> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
>> index 327b3ebf23bf..683131b1ee7d 100644
>> --- a/mm/Kconfig.debug
>> +++ b/mm/Kconfig.debug
>> @@ -117,3 +117,18 @@ config DEBUG_RODATA_TEST
>>  depends on STRICT_KERNEL_RWX
>>  ---help---
>>This option enables a testcase for the setting rodata read-only.
>> +
>> +config DEBUG_ARCH_PGTABLE_TEST
>> +bool "Test arch page table helpers for semantics compliance"
>> +depends on MMU
>> +depends on DEBUG_KERNEL
>> +depends on !(ARM || IA64)
> 
> Please add a proper enabling switch for architectures to opt in.

Sure, will do.

> 
> Please also add it to Documentation/features/list-arch.sh so that it's 
> listed as a 'TODO' entry on architectures where the tests are not enabled 
> yet.

Will do.

> 
>> +help
>> +  This options provides a kernel module which can be used to test
>> +  architecture page table helper functions on various platform in
>> +  verifying if they comply with expected generic MM semantics. This
>> +  will help architectures code in making sure that any changes or
>> +  new additions of these helpers will still conform to generic MM
>> +  expected semantics.
> 
> Typos and grammar fixed:
> 
>   help
> This option provides a kernel module which can be used to test
> architecture page table helper functions on various platforms in
> verifying if they comply with expected generic MM semantics. This
> will help architecture code in making sure that any changes or
> new additions of these helpers still conform to expected 
> semantics of the generic MM.

Sure, will update except the 'kernel module' part. Thank you.

> 
> Also, more fundamentally: isn't a kernel module too late for such a debug

Its not a kernel module any more, my bad that the description has still these
words left on from previous versions, will fix it. The test now gets invoked
through a late_initcall().

> check, should something break due to a core MM change? Have these debug 
> checks caught any bugs or inconsistencies before?

Gerald Schaefer had reported earlier about a bug found on s390 with this test.

https://lkml.org/lkml/2019/9/4/1718

> 
> Why not call this as some earlier MM debug check, after enabling paging 
> but before executing user-space binaries or relying on complex MM ops 
> within the kernel, called at a stage when those primitives are all 
> expected to work fine?

At minimum we need buddy allocator to be initialized for the allocations to
work. Just after pgtable_init() or kmem_cache_init() in mm_init() will be a
good place ?

> 
> It seems to me that arch_pgtable_tests_init) won't even context-switch 
> normally, right?

Not sure whether I got this. Why would you expect it to context switch ?

> 
> Finally, instead of inventing yet another randomly named .config debug 
> switch, please fit it into the regular MM debug options which go along 
> the CONFIG_DEBUG_VM* naming scheme.
> 
> Might even make sense to enable these new debug checks by default if 
> CONFIG_DEBUG_VM=y, that way we'll get a *lot* more debug coverage than 
> some random module somewhere that few people will know about, let alone 
> run.

All the configs with respect to memory debugging is generated from
lib/Kconfig.debug after fetching all that is in "mm/Kconfig.debug&

Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove

2019-10-07 Thread Anshuman Khandual



On 10/07/2019 07:47 PM, Catalin Marinas wrote:
> On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote:
>> The arch code for hot-remove must tear down portions of the linear map and
>> vmemmap corresponding to memory being removed. In both cases the page
>> tables mapping these regions must be freed, and when sparse vmemmap is in
>> use the memory backing the vmemmap must also be freed.
>>
>> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
>> can be used to tear down either region and calls it from vmemmap_free() and
>> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the
>> backing memory will be freed.
> 
> Can you change the 'sparse_vmap' name to something more meaningful which
> would suggest freeing of the backing memory?

free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or
free_backed might do as well. Do you have a particular preference here ? But
yes, sparse_vmap has been very much specific to vmemmap for these functions
which are now very generic in nature.

> 
>> It makes two distinct passes over the kernel page table. In the first pass
>> with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
>> frees backing memory if required (vmemmap) for each mapped leaf entry. In
>> the second pass with free_empty_tables() it looks for empty page table
>> sections whose page table page can be unmapped, TLB invalidated and freed.
>>
>> While freeing intermediate level page table pages bail out if any of its
>> entries are still valid. This can happen for partially filled kernel page
>> table either from a previously attempted failed memory hot add or while
>> removing an address range which does not span the entire page table page
>> range.
>>
>> The vmemmap region may share levels of table with the vmalloc region.
>> There can be conflicts between hot remove freeing page table pages with
>> a concurrent vmalloc() walking the kernel page table. This conflict can
>> not just be solved by taking the init_mm ptl because of existing locking
>> scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
>> method which is borrowed from user page table tear with free_pgd_range()
>> which skips freeing page table pages if intermediate address range is not
>> aligned or maximum floor-ceiling might not own the entire page table page.
>>
>> While here update arch_add_memory() to handle __add_pages() failures by
>> just unmapping recently added kernel linear mapping. Now enable memory hot
>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>>
>> This implementation is overall inspired from kernel page table tear down
>> procedure on X86 architecture and user page table tear down method.
>>
>> Acked-by: Steve Capper 
>> Acked-by: David Hildenbrand 
>> Signed-off-by: Anshuman Khandual 
> 
> Given the amount of changes since version 7, do the acks still stand?

I had taken the liberty to carry them till V7 where the implementation has
been sort of structurally similar but as you mention, there have been some
basic changes in the approach since V7. Will drop these tags in next version
and request their fresh ACKs once again.

> 
> [...]
>> +static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long 
>> end,
>> +   unsigned long floor, unsigned long ceiling)
>> +{
>> +struct page *page;
>> +pte_t *ptep;
>> +int i;
>> +
>> +if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK))
>> +return;
>> +
>> +ptep = pte_offset_kernel(pmdp, 0UL);
>> +for (i = 0; i < PTRS_PER_PTE; i++) {
>> +if (!pte_none(READ_ONCE(ptep[i])))
>> +return;
>> +}
>> +
>> +page = pmd_page(READ_ONCE(*pmdp));
> 
> Arguably, that's not the pmd page we are freeing here. Even if you get
> the same result, pmd_page() is normally used for huge pages pointed at
> by the pmd entry. Since you have the ptep already, why not use
> virt_to_page(ptep)?

Makes sense, will do.

> 
>> +pmd_clear(pmdp);
>> +__flush_tlb_kernel_pgtable(addr);
>> +free_hotplug_pgtable_page(page);
>> +}
>> +
>> +static void free_pmd_table(pud_t *pudp, unsigned long addr, unsigned long 
>> end,
>> +   unsigned long floor, unsigned long ceiling)
>> +{
>> +struct page *page;
>> +pmd_t *pmdp;
>> +int i;
>> +
>> +if (CONFIG_PGTABLE_LEVELS <= 2)
>> +return;
>> +
>> +if (!pgtable_ran

[PATCH V4 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-10-06 Thread Anshuman Khandual
This adds a test module which will validate architecture page table helpers
and accessors regarding compliance with generic MM semantics expectations.
This will help various architectures in validating changes to the existing
page table helpers or addition of new ones.

Test page table and memory pages creating it's entries at various level are
all allocated from system memory with required alignments. If memory pages
with required size and alignment could not be allocated, then all depending
individual tests are skipped.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Suggested-by: Catalin Marinas 
Signed-off-by: Christophe Leroy 
Tested-by: Christophe Leroy#PPC32
Signed-off-by: Anshuman Khandual 
---
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 mm/Kconfig.debug|  15 +
 mm/Makefile |   1 +
 mm/arch_pgtable_test.c  | 440 
 4 files changed, 458 insertions(+)
 create mode 100644 mm/arch_pgtable_test.c

diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f2240d..b882792a3999 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
 #define pgtable_l5_enabled() 0
 #endif /* CONFIG_X86_5LEVEL */
 
+#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
+
 extern unsigned int pgdir_shift;
 extern unsigned int ptrs_per_p4d;
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 327b3ebf23bf..683131b1ee7d 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -117,3 +117,18 @@ config DEBUG_RODATA_TEST
 depends on STRICT_KERNEL_RWX
 ---help---
   This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_ARCH_PGTABLE_TEST
+   bool "Test arch page table helpers for semantics compliance"
+   depends on MMU
+   depends on DEBUG_KERNEL
+   depends on !(ARM || IA64)
+   help
+ This options provides a kernel module which can be used to test
+ architecture page table helper functions on various platform in
+ verifying if they comply with expected generic MM semantics. This
+ will help architectures code in making sure that any changes or
+ new additions of these helpers will still conform to generic MM
+ expected semantics.
+
+ If unsure, say N.
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..bb572c5aa8c5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
+obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
new file mode 100644
index ..2942a0484d63
--- /dev/null
+++ b/mm/arch_pgtable_test.c
@@ -0,0 +1,440 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This kernel module validates architecture page table helpers &
+ * accessors and helps in verifying their continued compliance with
+ * generic MM semantics.
+ *
+ * Copyright (C) 2019 ARM Ltd.
+ *
+ * Author: Anshuman Khandual 
+ */
+#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Basic operations
+ *
+ * mkold(entry)= An old and not a young entry
+ * mkyoung(entry)  = A young and not an old entry
+ * mkdirty(entry)  = A dirty and not a clean entry
+ * mkclean(entry)  = A clean and not a dirty entry
+ * mkwrite(entry)  = A write and not a w

[PATCH V4 1/2] mm/hugetlb: Make alloc_gigantic_page() available for general use

2019-10-06 Thread Anshuman Khandual
alloc_gigantic_page() implements an allocation method where it scans over
various zones looking for a large contiguous memory block which could not
have been allocated through the buddy allocator. A subsequent patch which
tests arch page table helpers needs such a method to allocate PUD_SIZE
sized memory block. In the future such methods might have other use cases
as well. So alloc_gigantic_page() has been split carving out actual memory
allocation method and made available via new alloc_gigantic_page_order().

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Mike Kravetz 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h |  9 +
 mm/hugetlb.c| 24 ++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f930d0..cc50d5ad4885 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -299,6 +299,9 @@ static inline bool is_file_hugepages(struct file *file)
 }
 
 
+struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
 #else /* !CONFIG_HUGETLBFS */
 
 #define is_file_hugepages(file)false
@@ -310,6 +313,12 @@ hugetlb_file_setup(const char *name, size_t size, 
vm_flags_t acctflag,
return ERR_PTR(-ENOSYS);
 }
 
+static inline struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef37c85423a5..3fb81252f52b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1112,10 +1112,9 @@ static bool zone_spans_last_pfn(const struct zone *zone,
return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
-   unsigned int order = huge_page_order(h);
unsigned long nr_pages = 1 << order;
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
@@ -1151,6 +1150,14 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
return NULL;
 }
 
+static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+   int nid, nodemask_t *nodemask)
+{
+   unsigned int order = huge_page_order(h);
+
+   return alloc_gigantic_page_order(order, gfp_mask, nid, nodemask);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
 static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 #else /* !CONFIG_CONTIG_ALLOC */
@@ -1159,6 +1166,12 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* CONFIG_CONTIG_ALLOC */
 
 #else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
@@ -1167,6 +1180,13 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
+
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
 static inline void destroy_compound_gigantic_page(struct page *page,
unsigned int order) { }
-- 
2.20.1



[PATCH V4 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-10-06 Thread Anshuman Khandual
This series adds a test validation for architecture exported page table
helpers. Patch in the series adds basic transformation tests at various
levels of the page table. Before that it exports gigantic page allocation
function from HugeTLB.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Changes in V4:

- Disable DEBUG_ARCH_PGTABLE_TEST for ARM and IA64 platforms

Changes in V3: 
(https://lore.kernel.org/patchwork/project/lkml/list/?series=411216)

- Changed test trigger from module format into late_initcall()
- Marked all functions with __init to be freed after completion
- Changed all __PGTABLE_PXX_FOLDED checks as mm_pxx_folded()
- Folded in PPC32 fixes from Christophe

Changes in V2:

https://lore.kernel.org/linux-mm/1568268173-31302-1-git-send-email-anshuman.khand...@arm.com/T/#t

- Fixed small typo error in MODULE_DESCRIPTION()
- Fixed m64k build problems for lvalue concerns in pmd_xxx_tests()
- Fixed dynamic page table level folding problems on x86 as per Kirril
- Fixed second pointers during pxx_populate_tests() per Kirill and Gerald
- Allocate and free pte table with pte_alloc_one/pte_free per Kirill
- Modified pxx_clear_tests() to accommodate s390 lower 12 bits situation
- Changed RANDOM_NZVALUE value from 0xbe to 0xff
- Changed allocation, usage, free sequence for saved_ptep
- Renamed VMA_FLAGS as VMFLAGS
- Implemented a new method for random vaddr generation
- Implemented some other cleanups
- Dropped extern reference to mm_alloc()
- Created and exported new alloc_gigantic_page_order()
- Dropped the custom allocator and used new alloc_gigantic_page_order()

Changes in V1:

https://lore.kernel.org/linux-mm/1567497706-8649-1-git-send-email-anshuman.khand...@arm.com/

- Added fallback mechanism for PMD aligned memory allocation failure

Changes in RFC V2:

https://lore.kernel.org/linux-mm/1565335998-22553-1-git-send-email-anshuman.khand...@arm.com/T/#u

- Moved test module and it's config from lib/ to mm/
- Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
- Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
- Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
- Dropped loadable module config option
- Basic tests now use memory blocks with required size and alignment
- PUD aligned memory block gets allocated with alloc_contig_range()
- If PUD aligned memory could not be allocated it falls back on PMD aligned
  memory block from page allocator and pud_* tests are skipped
- Clear and populate tests now operate on real in memory page table entries
- Dummy mm_struct gets allocated with mm_alloc()
- Dummy page table entries get allocated with [pud|pmd|pte]_alloc_[map]()
- Simplified [p4d|pgd]_basic_tests(), now has random values in the entries

Original RFC V1:

https://lore.kernel.org/linux-mm/1564037723-26676-1-git-send-email-anshuman.khand...@arm.com/

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Mike Kravetz 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  mm/hugetlb: Make alloc_gigantic_page() available for general use
  mm/pgtable/debug: Add test validating architecture page table helpers

 arch/x86/include/asm/pgtable_64_types.h |   2 +
 include/linux/hugetlb.h |   9 +
 mm/Kconfig.debug|  15 +
 mm/Makefile |   1 +
 mm/arch_pgtable_test.c  | 440 
 mm/hugetlb.c|  24 +-
 6 files changed, 489 insertions(+), 2 deletions(-)
 create mode 100644 mm/arch_pgtable_test.c

-- 
2.20.1



Re: [PATCH] mm/page_alloc: Add a reason for reserved pages in has_unmovable_pages()

2019-10-04 Thread Anshuman Khandual



On 10/04/2019 04:28 PM, Michal Hocko wrote:
> On Thu 03-10-19 13:40:57, Anshuman Khandual wrote:
>> Having unmovable pages on a given pageblock should be reported correctly
>> when required with REPORT_FAILURE flag. But there can be a scenario where a
>> reserved page in the page block will get reported as a generic "unmovable"
>> reason code. Instead this should be changed to a more appropriate reason
>> code like "Reserved page".
> 
> Others have already pointed out this is just redundant but I will have a

Sure.

> more generic comment on the changelog. There is essentially no
> information why the current state is bad/unhelpful and why the chnage is

The current state is not necessarily bad or unhelpful. I just though that it
could be improved upon. Some how calling out explicitly only the CMA page
failure case just felt adhoc, where as there are other reasons like HugeTLB
immovability which might depend on other factors apart from just page flags
(though I did not propose that originally).

> needed. All you claim is that something is a certain way and then assert
> that it should be done differently. That is not how changelogs should
> look like.
> 

Okay, probably I should have explained more on why "unmovable" is less than
adequate to capture the exact reason for specific failure cases and how
"Reserved Page" instead would been better. But got the point, will improve.


Re: [PATCH] arm64/mm: Poison initmem while freeing with free_reserved_area()

2019-10-04 Thread Anshuman Khandual



On 10/04/2019 03:49 PM, Steven Price wrote:
> On 04/10/2019 05:23, Anshuman Khandual wrote:
>> Platform implementation for free_initmem() should poison the memory while
>> freeing it up. Hence pass across POISON_FREE_INITMEM while calling into
>> free_reserved_area(). The same is being followed in the generic fallback
>> for free_initmem() and some other platforms overriding it.
>>
>> Cc: Catalin Marinas 
>> Cc: Will Deacon 
>> Cc: Mark Rutland 
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
> 
> Is there a good reason you haven't made a similar change to
> free_initrd_mem() - the same logic seems to apply. However this change
> looks fine to me.

We will use generic free_initrd_mem() going forward as proposed in a recent
patch which does call free_reserved_area() with POISON_FREE_INITMEM.

https://patchwork.kernel.org/patch/11165379/

> 
> Reviewed-by: Steven Price 
> 
>> ---
>>  arch/arm64/mm/init.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> index 45c00a54909c..ea7d38011e83 100644
>> --- a/arch/arm64/mm/init.c
>> +++ b/arch/arm64/mm/init.c
>> @@ -571,7 +571,7 @@ void free_initmem(void)
>>  {
>>  free_reserved_area(lm_alias(__init_begin),
>> lm_alias(__init_end),
>> -   0, "unused kernel");
>> +   POISON_FREE_INITMEM, "unused kernel");
>>  /*
>>   * Unmap the __init region but leave the VM area in place. This
>>   * prevents the region from being reused for kernel modules, which
>>
> 
> 


[PATCH] arm64/mm: Poison initmem while freeing with free_reserved_area()

2019-10-03 Thread Anshuman Khandual
Platform implementation for free_initmem() should poison the memory while
freeing it up. Hence pass across POISON_FREE_INITMEM while calling into
free_reserved_area(). The same is being followed in the generic fallback
for free_initmem() and some other platforms overriding it.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 45c00a54909c..ea7d38011e83 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -571,7 +571,7 @@ void free_initmem(void)
 {
free_reserved_area(lm_alias(__init_begin),
   lm_alias(__init_end),
-  0, "unused kernel");
+  POISON_FREE_INITMEM, "unused kernel");
/*
 * Unmap the __init region but leave the VM area in place. This
 * prevents the region from being reused for kernel modules, which
-- 
2.20.1



Re: [PATCH] mm/page_alloc: Add a reason for reserved pages in has_unmovable_pages()

2019-10-03 Thread Anshuman Khandual



On 10/03/2019 05:20 PM, Qian Cai wrote:
> 
> 
>> On Oct 3, 2019, at 7:31 AM, Anshuman Khandual  
>> wrote:
>>
>> Ohh, never meant that the 'Reserved' bit is anything special here but it
>> is a reason to make a page unmovable unlike many other flags.
> 
> But dump_page() is used everywhere, and it is better to reserve “reason” to 
> indicate something more important rather than duplicating the page flags.
> 
> Especially, it is trivial enough right now for developers look in the page 
> flags dumping from has_unmovable_pages(), and figure out the exact branching 
> in the code.
> 

Will something like this be better ? hugepage_migration_supported() has got
uncertainty depending on platform and huge page size.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c2050c629b..8dbc86696515 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8175,7 +8175,7 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,
unsigned long found;
unsigned long iter = 0;
unsigned long pfn = page_to_pfn(page);
-   const char *reason = "unmovable page";
+   const char *reason;

/*
 * TODO we could make this much more efficient by not checking every
@@ -8194,7 +8194,7 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,
if (is_migrate_cma(migratetype))
return false;

-   reason = "CMA page";
+   reason = "Unmovable CMA page";
goto unmovable;
}

@@ -8206,8 +8206,10 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,

page = pfn_to_page(check);

-   if (PageReserved(page))
+   if (PageReserved(page)) {
+   reason = "Unmovable reserved page";
goto unmovable;
+   }

/*
 * If the zone is movable and we have ruled out all reserved
@@ -8226,8 +8228,10 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,
struct page *head = compound_head(page);
unsigned int skip_pages;

-   if (!hugepage_migration_supported(page_hstate(head)))
+   if (!hugepage_migration_supported(page_hstate(head))) {
+   reason = "Unmovable HugeTLB page";
goto unmovable;
+   }

skip_pages = compound_nr(head) - (page - head);
iter += skip_pages - 1;
@@ -8271,8 +8275,10 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,
 * is set to both of a memory hole page and a _used_ kernel
 * page at boot.
 */
-   if (found > count)
+   if (found > count) {
+   reason = "Unmovable non-LRU page";
goto unmovable;
+   }
}
return false;
 unmovable:


Re: [PATCH] mm/page_alloc: Add a reason for reserved pages in has_unmovable_pages()

2019-10-03 Thread Anshuman Khandual



On 10/03/2019 04:49 PM, Qian Cai wrote:
> 
> 
>> On Oct 3, 2019, at 5:32 AM, Anshuman Khandual  
>> wrote:
>>
>> Even though page flags does contain reserved bit information, the problem
>> is that we are explicitly printing the reason for this page dump. In this
>> case it is caused by the fact that it is a reserved page.
>>
>> page dumped because: 
>>
>> The proposed change makes it explicit that the dump is caused because a
>> non movable page with reserved bit set. It also helps in differentiating 
>> between reserved bit condition and the last one "if (found > count)".
> 
> Then, someone later would like to add a reason for every different page flags 
> just because they *think* they are special.
> 

Ohh, never meant that the 'Reserved' bit is anything special here but it
is a reason to make a page unmovable unlike many other flags.


Re: [PATCH] mm/page_alloc: Add a reason for reserved pages in has_unmovable_pages()

2019-10-03 Thread Anshuman Khandual



On 10/03/2019 03:02 PM, Anshuman Khandual wrote:
> 
> On 10/03/2019 02:35 PM, Qian Cai wrote:
>>
>>> On Oct 3, 2019, at 4:10 AM, Anshuman Khandual  
>>> wrote:
>>>
>>> Having unmovable pages on a given pageblock should be reported correctly
>>> when required with REPORT_FAILURE flag. But there can be a scenario where a
>>> reserved page in the page block will get reported as a generic "unmovable"
>>> reason code. Instead this should be changed to a more appropriate reason
>>> code like "Reserved page".
>> Isn’t this redundant as it dumps the flags in dump_page() anyway?
> Even though page flags does contain reserved bit information, the problem
> is that we are explicitly printing the reason for this page dump. In this
> case it is caused by the fact that it is a reserved page.
> 
> page dumped because: 
> 
> The proposed change makes it explicit that the dump is caused because a
> non movable page with reserved bit set. It also helps in differentiating 
> between reserved bit condition and the last one "if (found > count)"

Instead, will it better to rename the reason codes as

1. "Unmovable (CMA)"
2. "Unmovable (Reserved)"
3. "Unmovable (Private/non-LRU)"


Re: [PATCH] mm/page_alloc: Add a reason for reserved pages in has_unmovable_pages()

2019-10-03 Thread Anshuman Khandual



On 10/03/2019 02:35 PM, Qian Cai wrote:
> 
> 
>> On Oct 3, 2019, at 4:10 AM, Anshuman Khandual  
>> wrote:
>>
>> Having unmovable pages on a given pageblock should be reported correctly
>> when required with REPORT_FAILURE flag. But there can be a scenario where a
>> reserved page in the page block will get reported as a generic "unmovable"
>> reason code. Instead this should be changed to a more appropriate reason
>> code like "Reserved page".
> 
> Isn’t this redundant as it dumps the flags in dump_page() anyway?

Even though page flags does contain reserved bit information, the problem
is that we are explicitly printing the reason for this page dump. In this
case it is caused by the fact that it is a reserved page.

page dumped because: 

The proposed change makes it explicit that the dump is caused because a
non movable page with reserved bit set. It also helps in differentiating 
between reserved bit condition and the last one "if (found > count)".

> 
>>
>> Cc: Andrew Morton 
>> Cc: Michal Hocko 
>> Cc: Vlastimil Babka 
>> Cc: Oscar Salvador 
>> Cc: Mel Gorman 
>> Cc: Mike Rapoport 
>> Cc: Dan Williams 
>> Cc: Pavel Tatashin 
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>> mm/page_alloc.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 15c2050c629b..fbf93ea119d2 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -8206,8 +8206,10 @@ bool has_unmovable_pages(struct zone *zone, struct 
>> page *page, int count,
>>
>>page = pfn_to_page(check);
>>
>> -if (PageReserved(page))
>> +if (PageReserved(page)) {
>> +reason = "Reserved page";
>>goto unmovable;
>> +}
>>
>>/*
>> * If the zone is movable and we have ruled out all reserved
>> -- 
>> 2.20.1
>>
> 


[PATCH] mm/page_alloc: Add a reason for reserved pages in has_unmovable_pages()

2019-10-03 Thread Anshuman Khandual
Having unmovable pages on a given pageblock should be reported correctly
when required with REPORT_FAILURE flag. But there can be a scenario where a
reserved page in the page block will get reported as a generic "unmovable"
reason code. Instead this should be changed to a more appropriate reason
code like "Reserved page".

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c2050c629b..fbf93ea119d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8206,8 +8206,10 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,
 
page = pfn_to_page(check);
 
-   if (PageReserved(page))
+   if (PageReserved(page)) {
+   reason = "Reserved page";
goto unmovable;
+   }
 
/*
 * If the zone is movable and we have ruled out all reserved
-- 
2.20.1



Re: [PATCH v4] arm64: use generic free_initrd_mem()

2019-09-29 Thread Anshuman Khandual


On 09/28/2019 01:32 PM, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> arm64 calls memblock_free() for the initrd area in its implementation of
> free_initrd_mem(), but this call has no actual effect that late in the boot
> process. By the time initrd is freed, all the reserved memory is managed by
> the page allocator and the memblock.reserved is unused, so the only purpose
> of the memblock_free() call is to keep track of initrd memory for debugging
> and accounting.

Thats correct. memblock_free_all() gets called before free_initrd_mem().

> 
> Without the memblock_free() call the only difference between arm64 and the
> generic versions of free_initrd_mem() is the memory poisoning.
> 
> Move memblock_free() call to the generic code, enable it there
> for the architectures that define ARCH_KEEP_MEMBLOCK and use the generic
> implementation of free_initrd_mem() on arm64.

This improves free_initrd_mem() generic implementation for others to use.

> 
> Signed-off-by: Mike Rapoport 

Tested-by: Anshuman Khandual #arm64
Reviewed-by: Anshuman Khandual 

> ---
> 
> v4:
> * memblock_free() aligned area around the initrd
> 
> v3:
> * fix powerpc build
> 
> v2:
> * add memblock_free() to the generic free_initrd_mem()
> * rebase on the current upstream
> 
> 
>  arch/arm64/mm/init.c | 12 
>  init/initramfs.c |  8 
>  2 files changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 45c00a5..87a0e3b 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -580,18 +580,6 @@ void free_initmem(void)
>   unmap_kernel_range((u64)__init_begin, (u64)(__init_end - __init_begin));
>  }
>  
> -#ifdef CONFIG_BLK_DEV_INITRD
> -void __init free_initrd_mem(unsigned long start, unsigned long end)
> -{
> - unsigned long aligned_start, aligned_end;
> -
> - aligned_start = __virt_to_phys(start) & PAGE_MASK;
> - aligned_end = PAGE_ALIGN(__virt_to_phys(end));
> - memblock_free(aligned_start, aligned_end - aligned_start);
> - free_reserved_area((void *)start, (void *)end, 0, "initrd");
> -}
> -#endif
> -
>  /*
>   * Dump out memory limit information on panic.
>   */
> diff --git a/init/initramfs.c b/init/initramfs.c
> index c47dad0..8ec1be4 100644
> --- a/init/initramfs.c
> +++ b/init/initramfs.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static ssize_t __init xwrite(int fd, const char *p, size_t count)
>  {
> @@ -529,6 +530,13 @@ extern unsigned long __initramfs_size;
>  
>  void __weak free_initrd_mem(unsigned long start, unsigned long end)
>  {
> +#ifdef CONFIG_ARCH_KEEP_MEMBLOCK
> + unsigned long aligned_start = ALIGN_DOWN(start, PAGE_SIZE);
> + unsigned long aligned_end = ALIGN(end, PAGE_SIZE);
> +
> + memblock_free(__pa(aligned_start), aligned_end - aligned_start);
> +#endif
> +
>   free_reserved_area((void *)start, (void *)end, POISON_FREE_INITMEM,
>   "initrd");
>  }
> 


Re: [PATCH v3] arm64: use generic free_initrd_mem()

2019-09-27 Thread Anshuman Khandual


On 09/25/2019 10:39 AM, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> arm64 calls memblock_free() for the initrd area in its implementation of
> free_initrd_mem(), but this call has no actual effect that late in the boot
> process. By the time initrd is freed, all the reserved memory is managed by
> the page allocator and the memblock.reserved is unused, so the only purpose
> of the memblock_free() call is to keep track of initrd memory for debugging
> and accounting.
> 
> Without the memblock_free() call the only difference between arm64 and the
> generic versions of free_initrd_mem() is the memory poisoning.
> 
> Move memblock_free() call to the generic code, enable it there
> for the architectures that define ARCH_KEEP_MEMBLOCK and use the generic
> implementaion of free_initrd_mem() on arm64.

Small nit. s/implementaion/implementation.

> 
> Signed-off-by: Mike Rapoport 
> ---
> 
> v3:
> * fix powerpc build
> 
> v2: 
> * add memblock_free() to the generic free_initrd_mem()
> * rebase on the current upstream
> 
> 
>  arch/arm64/mm/init.c | 12 
>  init/initramfs.c |  5 +
>  2 files changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 45c00a5..87a0e3b 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -580,18 +580,6 @@ void free_initmem(void)
>   unmap_kernel_range((u64)__init_begin, (u64)(__init_end - __init_begin));
>  }
>  
> -#ifdef CONFIG_BLK_DEV_INITRD
> -void __init free_initrd_mem(unsigned long start, unsigned long end)
> -{
> - unsigned long aligned_start, aligned_end;
> -
> - aligned_start = __virt_to_phys(start) & PAGE_MASK;
> - aligned_end = PAGE_ALIGN(__virt_to_phys(end));
> - memblock_free(aligned_start, aligned_end - aligned_start);
> - free_reserved_area((void *)start, (void *)end, 0, "initrd");
> -}
> -#endif
> -
>  /*
>   * Dump out memory limit information on panic.
>   */
> diff --git a/init/initramfs.c b/init/initramfs.c
> index c47dad0..3d61e13 100644
> --- a/init/initramfs.c
> +++ b/init/initramfs.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static ssize_t __init xwrite(int fd, const char *p, size_t count)
>  {
> @@ -531,6 +532,10 @@ void __weak free_initrd_mem(unsigned long start, 
> unsigned long end)
>  {
>   free_reserved_area((void *)start, (void *)end, POISON_FREE_INITMEM,
>   "initrd");
> +
> +#ifdef CONFIG_ARCH_KEEP_MEMBLOCK

Should not the addresses here be aligned first before calling memblock_free() ?
Without alignment, it breaks present behavior on arm64 which was explicitly 
added
with 13776f9d40a0 ("arm64: mm: free the initrd reserved memblock in a aligned 
manner").
Or does initrd always gets allocated with page alignment on other architectures.

> + memblock_free(__pa(start), end - start);
> +#endif
>  }
>  
>  #ifdef CONFIG_KEXEC_CORE
> 


Re: [PATCH] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-24 Thread Anshuman Khandual



On 09/25/2019 08:43 AM, Andrew Morton wrote:
> On Mon, 23 Sep 2019 11:16:38 +0530 Anshuman Khandual 
>  wrote:
> 
>>
>>
>> On 09/16/2019 11:17 AM, Anshuman Khandual wrote:
>>> In add_memory_resource() the memory range to be hot added first gets into
>>> the memblock via memblock_add() before arch_add_memory() is called on it.
>>> Reverse sequence should be followed during memory hot removal which already
>>> is being followed in add_memory_resource() error path. This now ensures
>>> required re-order between memblock_[free|remove]() and arch_remove_memory()
>>> during memory hot-remove.
>>>
>>> Cc: Andrew Morton 
>>> Cc: Oscar Salvador 
>>> Cc: Michal Hocko 
>>> Cc: David Hildenbrand 
>>> Cc: Pavel Tatashin 
>>> Cc: Dan Williams 
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>> Original patch https://lkml.org/lkml/2019/9/3/327
>>>
>>> Memory hot remove now works on arm64 without this because a recent commit
>>> 60bb462fc7ad ("drivers/base/node.c: simplify 
>>> unregister_memory_block_under_nodes()").
>>>
>>> David mentioned that re-ordering should still make sense for consistency
>>> purpose (removing stuff in the reverse order they were added). This patch
>>> is now detached from arm64 hot-remove series.
>>>
>>> https://lkml.org/lkml/2019/9/3/326
>>
>> ...
>>
>> Hello Andrew,
>>
>> Any feedbacks on this, does it look okay ?
>>
> 
> Well.  I'd parked this for 5.4-rc1 processing because it looked like a
> cleanup.

This does not fix a serious problem. It just removes an inconsistency while
freeing resources during memory hot remove which for now does not pose a
real problem.

> 
> But wy down below the ^---$ line I see "Memory hot remove now works
> on arm64".  Am I correct in believing that 60bb462fc7ad broke arm64 mem
> hot remove?  And that this patch fixes a serious regression?  If so,

No. [Proposed] arm64 memory hot remove series does not anymore depend on
this particular patch because 60bb462fc7ad has already solved the problem.

> that should have been right there in the patch title and changelog!

V2 (https://patchwork.kernel.org/patch/11159939/) for this patch makes it
very clear in it's commit message.

- Anshuman


Re: [PATCH V3 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-09-24 Thread Anshuman Khandual



On 09/24/2019 06:01 PM, Mike Rapoport wrote:
> On Tue, Sep 24, 2019 at 02:51:01PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Sep 20, 2019 at 12:03:21PM +0530, Anshuman Khandual wrote:
>>> This series adds a test validation for architecture exported page table
>>> helpers. Patch in the series adds basic transformation tests at various
>>> levels of the page table. Before that it exports gigantic page allocation
>>> function from HugeTLB.
>>>
>>> This test was originally suggested by Catalin during arm64 THP migration
>>> RFC discussion earlier. Going forward it can include more specific tests
>>> with respect to various generic MM functions like THP, HugeTLB etc and
>>> platform specific tests.
>>>
>>> https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/
>>>
>>> Testing:
>>>
>>> Successfully build and boot tested on both arm64 and x86 platforms without
>>> any test failing. Only build tested on some other platforms. Build failed
>>> on some platforms (known) in pud_clear_tests() as there were no available
>>> __pgd() definitions.
>>>
>>> - ARM32
>>> - IA64
>>
>> Hm. Grep shows __pgd() definitions for both of them. Is it for specific
>> config?
>  
> For ARM32 it's defined only for 3-lelel page tables, i.e with LPAE on.
> For IA64 it's defined for !STRICT_MM_TYPECHECKS which is even not a config
> option, but a define in arch/ia64/include/asm/page.h

Right. So now where we go from here ! We will need help from platform folks to
fix this unless its trivial. I did propose this on last thread (v2), wondering 
if
it will be a better idea to restrict DEBUG_ARCH_PGTABLE_TEST among architectures
which have fixed all pending issues whether build or run time. Though enabling 
all
platforms where the test builds at the least might make more sense, we might 
have
to just exclude arm32 and ia64 for now. Then run time problems can be fixed 
later
platform by platform. Any thoughts ?

BTW the test is known to run successfully on arm64, x86, ppc32 platforms. Gerald
has been trying to get it working on s390. in the meantime., if there are other
volunteers to test this on ppc64, sparc, riscv, mips, m68k etc platforms, it 
will
be really helpful.

- Anshuman


[PATCH V2] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-24 Thread Anshuman Khandual
Currently during memory hot add procedure, memory gets into memblock before
calling arch_add_memory() which creates it's linear mapping.

add_memory_resource() {
..
memblock_add_node()
..
arch_add_memory()
..
}

But during memory hot remove procedure, removal from memblock happens first
before it's linear mapping gets teared down with arch_remove_memory() which
is not consistent. Resource removal should happen in reverse order as they
were added. However this does not pose any problem for now, unless there is
an assumption regarding linear mapping. One example was a subtle failure on
arm64 platform [1]. Though this has now found a different solution.

try_remove_memory() {
..
memblock_free()
memblock_remove()
..
arch_remove_memory()
..
}

This changes the sequence of resource removal including memblock and linear
mapping tear down during memory hot remove which will now be the reverse
order in which they were added during memory hot add. The changed removal
order looks like the following.

try_remove_memory() {
..
arch_remove_memory()
..
memblock_free()
memblock_remove()
..
}

[1] https://patchwork.kernel.org/patch/11127623/

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Signed-off-by: Anshuman Khandual 
---
Changes in V2:

- Changed the commit message as per Michal and David 

Changed in V1: https://patchwork.kernel.org/patch/11146361/

Original patch https://lkml.org/lkml/2019/9/3/327

Memory hot remove now works on arm64 without this because a recent commit
60bb462fc7ad ("drivers/base/node.c: simplify 
unregister_memory_block_under_nodes()").

David mentioned that re-ordering should still make sense for consistency
purpose (removing stuff in the reverse order they were added). This patch
is now detached from arm64 hot-remove series.

https://lkml.org/lkml/2019/9/3/326

 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 49f7bf91c25a..4f7d426a84d0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1763,13 +1763,13 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
-   memblock_free(start, size);
-   memblock_remove(start, size);
 
/* remove memory block devices before removing memory */
remove_memory_block_devices(start, size);
 
arch_remove_memory(nid, start, size, NULL);
+   memblock_free(start, size);
+   memblock_remove(start, size);
__release_memory_resource(start, size);
 
try_offline_node(nid);
-- 
2.20.1



Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove

2019-09-24 Thread Anshuman Khandual



On 09/23/2019 04:47 PM, Matthew Wilcox wrote:
> On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote:
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +static void free_hotplug_page_range(struct page *page, size_t size)
>> +{
>> +WARN_ON(!page || PageReserved(page));
> 
> WARN_ON(!page) isn't terribly useful.  You're going to crash on the very
> next line when you call page_address() anyway.  If this line were
> 
>   if (WARN_ON(!page || PageReserved(page)))
>   return;
> 
> it would make sense, or if it were just
> 
>   WARN_ON(PageReserved(page))
> 
> it would also make sense.

I guess WARN_ON(PageReserved(page)) should be good enough to make sure
that page being freed here was originally allocated at runtime for a
previous memory hot add operation and did not some how come from the
memblock reserved area. That was the original objective for this check.
Will change it.

> 
>> +free_pages((unsigned long)page_address(page), get_order(size));
>> +}
> 


[PATCH] mm/memremap: Drop unused SECTION_SIZE and SECTION_MASK

2019-09-24 Thread Anshuman Khandual
SECTION_SIZE and SECTION_MASK macros are not getting used anymore. But they
do conflict with existing definitions on arm64 platform causing following
warning during build. Lets drop these unused macros.

mm/memremap.c:16: warning: "SECTION_MASK" redefined
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)

In file included from ./arch/arm64/include/asm/processor.h:34,
 from ./include/linux/mutex.h:19,
 from ./include/linux/kernfs.h:12,
 from ./include/linux/sysfs.h:16,
 from ./include/linux/kobject.h:20,
 from ./include/linux/device.h:16,
 from mm/memremap.c:3:
./arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of
the previous definition
 #define SECTION_MASK  (~(SECTION_SIZE-1))

mm/memremap.c:17: warning: "SECTION_SIZE" redefined
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)

In file included from ./arch/arm64/include/asm/processor.h:34,
 from ./include/linux/mutex.h:19,
 from ./include/linux/kernfs.h:12,
 from ./include/linux/sysfs.h:16,
 from ./include/linux/kobject.h:20,
 from ./include/linux/device.h:16,
 from mm/memremap.c:3:
./arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of
the previous definition
 #define SECTION_SIZE  (_AC(1, UL) << SECTION_SHIFT)

Cc: Dan Williams 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Cc: linux-kernel@vger.kernel.org
Reported-by: kbuild test robot 
Signed-off-by: Anshuman Khandual 
---
Build and boot tested on arm64 platform with ZONE_DEVICE enabled. But
only boot tested on arm64 and some other platforms with allmodconfig.

 mm/memremap.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index f6c17339cd0d..bf2882bfbf48 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -13,8 +13,6 @@
 #include 
 
 static DEFINE_XARRAY(pgmap_array);
-#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
-#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-- 
2.20.1



Re: [PATCH] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-23 Thread Anshuman Khandual



On 09/23/2019 04:24 PM, David Hildenbrand wrote:
> On 23.09.19 12:52, Michal Hocko wrote:
>> On Mon 16-09-19 11:17:37, Anshuman Khandual wrote:
>>> In add_memory_resource() the memory range to be hot added first gets into
>>> the memblock via memblock_add() before arch_add_memory() is called on it.
>>> Reverse sequence should be followed during memory hot removal which already
>>> is being followed in add_memory_resource() error path. This now ensures
>>> required re-order between memblock_[free|remove]() and arch_remove_memory()
>>> during memory hot-remove.
>>
>> This changelog is not really easy to follow. First of all please make
>> sure to explain whether there is any actual problem to solve or this is
>> an aesthetic matter. Please think of people reading this changelog in
>> few years and scratching their heads what you were thinking back then...
>>
> 
> I think it would make sense to just draft the current call sequence in
> the add and the removal path (instead of describing it) - then it
> becomes obvious why this is a cosmetic change.

Does this look okay ?

mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

Currently during memory hot add procedure, memory gets into memblock before
calling arch_add_memory() which creates it's linear mapping.

add_memory_resource() {
..
memblock_add_node()
..
arch_add_memory()
..
}

But during memory hot remove procedure, removal from memblock happens first
before it's linear mapping gets teared down with arch_remove_memory() which
is not coherent. Resource removal should happen in reverse order as they
were added.

try_remove_memory() {
..
memblock_free()
memblock_remove()
..
arch_remove_memory()
..
}

This changes the sequence of resource removal including memblock and linear
mapping tear down during memory hot remove which will now be the reverse
order in which they were added during memory hot add. The changed removal
order looks like the following.

try_remove_memory() {
..
arch_remove_memory()
..
memblock_free()
memblock_remove()
..
}


Re: [PATCH] arm64: use generic free_initrd_mem()

2019-09-23 Thread Anshuman Khandual



On 09/16/2019 07:25 PM, Mike Rapoport wrote:
> (added linux-arch)
> 
> On Mon, Sep 16, 2019 at 08:23:29AM -0400, Laura Abbott wrote:
>> On 9/16/19 8:21 AM, Mike Rapoport wrote:
>>> From: Mike Rapoport 
>>>
>>> arm64 calls memblock_free() for the initrd area in its implementation of
>>> free_initrd_mem(), but this call has no actual effect that late in the boot
>>> process. By the time initrd is freed, all the reserved memory is managed by
>>> the page allocator and the memblock.reserved is unused, so there is no
>>> point to update it.
>>>
>>
>> People like to use memblock for keeping track of memory even if it has no
>> actual effect. We made this change explicitly (see 05c58752f9dc ("arm64: To 
>> remove
>> initrd reserved area entry from memblock") That said, moving to the generic
>> APIs would be nice. Maybe we can find another place to update the accounting?
> 
> Any other place in arch/arm64 would make it messy because it would have to
> duplicate keepinitrd logic.
> 
> We could put the memblock_free() in the generic free_initrd_mem() with
> something like:
> 
> diff --git a/init/initramfs.c b/init/initramfs.c
> index c47dad0..403c6a0 100644
> --- a/init/initramfs.c
> +++ b/init/initramfs.c
> @@ -531,6 +531,10 @@ void __weak free_initrd_mem(unsigned long start,
> unsigned long end)
>  {
> free_reserved_area((void *)start, (void *)end, POISON_FREE_INITMEM,
> "initrd");
> +
> +#ifdef CONFIG_ARCH_KEEP_MEMBLOCK
> +   memblock_free(__virt_to_phys(start), end - start);
> +#endif
>  }

This makes sense.

>  
>  #ifdef CONFIG_KEXEC_CORE
> 
> 
> Then powerpc and s390 folks will also be able to track the initrd memory :)

Sure.

> 
>>> Without the memblock_free() call the only difference between arm64 and the
>>> generic versions of free_initrd_mem() is the memory poisoning. Switching
>>> arm64 to the generic version will enable the poisoning.
>>>
>>> Signed-off-by: Mike Rapoport 
>>> ---
>>>
>>> I've boot tested it on qemu and I've checked that kexec works.
>>>
>>>  arch/arm64/mm/init.c | 8 
>>>  1 file changed, 8 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>>> index f3c7952..8ad2934 100644
>>> --- a/arch/arm64/mm/init.c
>>> +++ b/arch/arm64/mm/init.c
>>> @@ -567,14 +567,6 @@ void free_initmem(void)
>>> unmap_kernel_range((u64)__init_begin, (u64)(__init_end - __init_begin));
>>>  }
>>> -#ifdef CONFIG_BLK_DEV_INITRD
>>> -void __init free_initrd_mem(unsigned long start, unsigned long end)
>>> -{
>>> -   free_reserved_area((void *)start, (void *)end, 0, "initrd");
>>> -   memblock_free(__virt_to_phys(start), end - start);
>>> -}
>>> -#endif
>>> -
>>>  /*
>>>   * Dump out memory limit information on panic.
>>>   */
>>>
>>
> 


Re: [PATCH V3 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-23 Thread Anshuman Khandual



On 09/21/2019 09:30 PM, kbuild test robot wrote:
> Hi Anshuman,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [cannot apply to v5.3 next-20190919]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see 
> https://stackoverflow.com/a/37406982]
> 
> url:
> https://github.com/0day-ci/linux/commits/Anshuman-Khandual/mm-debug-Add-tests-for-architecture-exported-page-table-helpers/20190920-143746
> config: ia64-allmodconfig (attached as .config)
> compiler: ia64-linux-gcc (GCC) 7.4.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.4.0 make.cross ARCH=ia64 
> :: branch date: 3 hours ago
> :: commit date: 3 hours ago
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
> 
> All error/warnings (new ones prefixed by >>):
> 
>In file included from include/asm-generic/pgtable-nopud.h:8:0,
> from arch/ia64/include/asm/pgtable.h:591,
> from include/linux/mm.h:99,
> from include/linux/highmem.h:8,
> from mm/arch_pgtable_test.c:14:
>mm/arch_pgtable_test.c: In function 'pud_clear_tests':
>>> include/asm-generic/pgtable-nop4d-hack.h:47:32: error: implicit declaration 
>>> of function '__pgd'; did you mean '__p4d'? 
>>> [-Werror=implicit-function-declaration]
> #define __pud(x)((pud_t) { __pgd(x) })
>^
>>> mm/arch_pgtable_test.c:162:8: note: in expansion of macro '__pud'
>  pud = __pud(pud_val(pud) | RANDOM_ORVALUE);
>^
>>> include/asm-generic/pgtable-nop4d-hack.h:47:22: warning: missing braces 
>>> around initializer [-Wmissing-braces]
> #define __pud(x)((pud_t) { __pgd(x) })
>  ^
>>> mm/arch_pgtable_test.c:162:8: note: in expansion of macro '__pud'
>  pud = __pud(pud_val(pud) | RANDOM_ORVALUE);
>^
>cc1: some warnings being treated as errors
> 
> # 
> https://github.com/0day-ci/linux/commit/49047f93b076974eefa5b019311bd3b734d61f8c
> git remote add linux-review https://github.com/0day-ci/linux
> git remote update linux-review
> git checkout 49047f93b076974eefa5b019311bd3b734d61f8c
> vim +47 include/asm-generic/pgtable-nop4d-hack.h
> 
> 30ec842660bd0d Kirill A. Shutemov 2017-03-09  45  
> 30ec842660bd0d Kirill A. Shutemov 2017-03-09  46  #define pud_val(x)  
> (pgd_val((x).pgd))
> 30ec842660bd0d Kirill A. Shutemov 2017-03-09 @47  #define __pud(x)
> ((pud_t) { __pgd(x) })

I had mentioned about this build failure in the cover letter. The same
problem also exists on arm32 platform.

- Anshuman


Re: [PATCH] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-22 Thread Anshuman Khandual



On 09/16/2019 11:17 AM, Anshuman Khandual wrote:
> In add_memory_resource() the memory range to be hot added first gets into
> the memblock via memblock_add() before arch_add_memory() is called on it.
> Reverse sequence should be followed during memory hot removal which already
> is being followed in add_memory_resource() error path. This now ensures
> required re-order between memblock_[free|remove]() and arch_remove_memory()
> during memory hot-remove.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Signed-off-by: Anshuman Khandual 
> ---
> Original patch https://lkml.org/lkml/2019/9/3/327
> 
> Memory hot remove now works on arm64 without this because a recent commit
> 60bb462fc7ad ("drivers/base/node.c: simplify 
> unregister_memory_block_under_nodes()").
> 
> David mentioned that re-ordering should still make sense for consistency
> purpose (removing stuff in the reverse order they were added). This patch
> is now detached from arm64 hot-remove series.
> 
> https://lkml.org/lkml/2019/9/3/326
> 
>  mm/memory_hotplug.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c73f09913165..355c466e0621 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 
> start, u64 size)
>  
>   /* remove memmap entry */
>   firmware_map_remove(start, start + size, "System RAM");
> - memblock_free(start, size);
> - memblock_remove(start, size);
>  
>   /* remove memory block devices before removing memory */
>   remove_memory_block_devices(start, size);
>  
>   arch_remove_memory(nid, start, size, NULL);
> + memblock_free(start, size);
> + memblock_remove(start, size);
>   __release_memory_resource(start, size);
>  
>   try_offline_node(nid);
> 

Hello Andrew,

Any feedbacks on this, does it look okay ?

- Anshuman


[PATCH V8 0/2] arm64/mm: Enable memory hot remove

2019-09-22 Thread Anshuman Khandual
tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (2):
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 305 ++--
 arch/arm64/mm/ptdump_debugfs.c  |   4 +
 4 files changed, 304 insertions(+), 9 deletions(-)

-- 
2.7.4



[PATCH V8 2/2] arm64/mm: Enable memory hot remove

2019-09-22 Thread Anshuman Khandual
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.

It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.

Acked-by: Steve Capper 
Acked-by: David Hildenbrand 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 305 ++--
 3 files changed, 300 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 41a9b42..511249f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -274,6 +274,9 @@ config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+   def_bool y
+
 config SMP
def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b61b50b..615dcd0 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
 #define MODULES_VADDR  (BPF_JIT_REGION_END)
 #define MODULES_VSIZE  (SZ_128M)
 #define VMEMMAP_START  (-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END (VMEMMAP_START - SZ_2M)
 #define PCI_IO_START   (PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 60c929f..6178837 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -725,6 +725,277 @@ int kern_addr_valid(unsigned long addr)
 
return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+   WARN_ON(!page || PageReserved(page));
+   free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+   free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static bool pgtable_range_aligned(unsigned long start, unsigned long end,
+ unsigned long floor, unsigned long ceiling,
+ unsigned long mask)
+{
+   start &= mask;
+   if (start < floor)
+   return false;
+
+   if (ceiling) {
+   ceiling &= mask;
+   if (!ceiling)
+   return false;
+   }
+
+   if (end - 1 > ceiling - 1)
+   return false;
+   return true;
+}
+
+static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end,
+  unsigned long floor, unsigned long ceiling)
+{
+   struct page *page;
+   pte_t *ptep;
+   int i;
+
+   if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK))
+   return;
+
+   ptep = pte_offset_kernel(pmdp, 0UL);
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   if (!pte_none(READ_ONCE(ptep[i])))
+   return;
+   }
+
+ 

[PATCH V8 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

2019-09-22 Thread Anshuman Khandual
The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand 
Acked-by: Mark Rutland 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/ptdump_debugfs.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..b5eebc8 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
+#include 
 #include 
 
 #include 
@@ -7,7 +8,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
struct ptdump_info *info = m->private;
+
+   get_online_mems();
ptdump_walk_pgd(m, info);
+   put_online_mems();
return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.7.4



[PATCH V3 1/2] mm/hugetlb: Make alloc_gigantic_page() available for general use

2019-09-20 Thread Anshuman Khandual
alloc_gigantic_page() implements an allocation method where it scans over
various zones looking for a large contiguous memory block which could not
have been allocated through the buddy allocator. A subsequent patch which
tests arch page table helpers needs such a method to allocate PUD_SIZE
sized memory block. In the future such methods might have other use cases
as well. So alloc_gigantic_page() has been split carving out actual memory
allocation method and made available via new alloc_gigantic_page_order().

Cc: Mike Kravetz 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 include/linux/hugetlb.h |  9 +
 mm/hugetlb.c| 24 ++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f9..cc50d5a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -299,6 +299,9 @@ static inline bool is_file_hugepages(struct file *file)
 }
 
 
+struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
 #else /* !CONFIG_HUGETLBFS */
 
 #define is_file_hugepages(file)false
@@ -310,6 +313,12 @@ hugetlb_file_setup(const char *name, size_t size, 
vm_flags_t acctflag,
return ERR_PTR(-ENOSYS);
 }
 
+static inline struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef37c85..3fb8125 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1112,10 +1112,9 @@ static bool zone_spans_last_pfn(const struct zone *zone,
return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
-   unsigned int order = huge_page_order(h);
unsigned long nr_pages = 1 << order;
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
@@ -1151,6 +1150,14 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
return NULL;
 }
 
+static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+   int nid, nodemask_t *nodemask)
+{
+   unsigned int order = huge_page_order(h);
+
+   return alloc_gigantic_page_order(order, gfp_mask, nid, nodemask);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
 static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 #else /* !CONFIG_CONTIG_ALLOC */
@@ -1159,6 +1166,12 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* CONFIG_CONTIG_ALLOC */
 
 #else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
@@ -1167,6 +1180,13 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
+
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
 static inline void destroy_compound_gigantic_page(struct page *page,
unsigned int order) { }
-- 
2.7.4



[PATCH V3 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-20 Thread Anshuman Khandual
This adds a test module which will validate architecture page table helpers
and accessors regarding compliance with generic MM semantics expectations.
This will help various architectures in validating changes to the existing
page table helpers or addition of new ones.

Test page table and memory pages creating it's entries at various level are
all allocated from system memory with required alignments. If memory pages
with required size and alignment could not be allocated, then all depending
individual tests are skipped.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Suggested-by: Catalin Marinas 
Signed-off-by: Christophe Leroy 
Tested-by: Christophe Leroy(PPC32)
Signed-off-by: Anshuman Khandual 
---
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 mm/Kconfig.debug|  14 +
 mm/Makefile |   1 +
 mm/arch_pgtable_test.c  | 440 
 4 files changed, 457 insertions(+)
 create mode 100644 mm/arch_pgtable_test.c

diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f..b882792 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
 #define pgtable_l5_enabled() 0
 #endif /* CONFIG_X86_5LEVEL */
 
+#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
+
 extern unsigned int pgdir_shift;
 extern unsigned int ptrs_per_p4d;
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 327b3eb..ce9c397 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -117,3 +117,17 @@ config DEBUG_RODATA_TEST
 depends on STRICT_KERNEL_RWX
 ---help---
   This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_ARCH_PGTABLE_TEST
+   bool "Test arch page table helpers for semantics compliance"
+   depends on MMU
+   depends on DEBUG_KERNEL
+   help
+ This options provides a kernel module which can be used to test
+ architecture page table helper functions on various platform in
+ verifying if they comply with expected generic MM semantics. This
+ will help architectures code in making sure that any changes or
+ new additions of these helpers will still conform to generic MM
+ expected semantics.
+
+ If unsure, say N.
diff --git a/mm/Makefile b/mm/Makefile
index d996846..bb572c5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
+obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
new file mode 100644
index 000..2942a04
--- /dev/null
+++ b/mm/arch_pgtable_test.c
@@ -0,0 +1,440 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This kernel module validates architecture page table helpers &
+ * accessors and helps in verifying their continued compliance with
+ * generic MM semantics.
+ *
+ * Copyright (C) 2019 ARM Ltd.
+ *
+ * Author: Anshuman Khandual 
+ */
+#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Basic operations
+ *
+ * mkold(entry)= An old and not a young entry
+ * mkyoung(entry)  = A young and not an old entry
+ * mkdirty(entry)  = A dirty and not a clean entry
+ * mkclean(entry)  = A clean and not a dirty entry
+ * mkwrite(entry)  = A write and not a write protected entry
+ * wrprotect(entry)= A wri

[PATCH V3 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-09-20 Thread Anshuman Khandual
This series adds a test validation for architecture exported page table
helpers. Patch in the series adds basic transformation tests at various
levels of the page table. Before that it exports gigantic page allocation
function from HugeTLB.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Testing:

Successfully build and boot tested on both arm64 and x86 platforms without
any test failing. Only build tested on some other platforms. Build failed
on some platforms (known) in pud_clear_tests() as there were no available
__pgd() definitions.

- ARM32
- IA64

But I would really appreciate if folks can help validate this test on other
architectures and report back problems. All suggestions, comments and inputs
welcome. Thank you.

Changes in V3:

- Changed test trigger from module format into late_initcall()
- Marked all functions with __init to be freed after completion
- Changed all __PGTABLE_PXX_FOLDED checks as mm_pxx_folded()
- Folded in PPC32 fixes from Christophe

Changes in V2:

https://lore.kernel.org/linux-mm/1568268173-31302-1-git-send-email-anshuman.khand...@arm.com/T/#t

- Fixed small typo error in MODULE_DESCRIPTION()
- Fixed m64k build problems for lvalue concerns in pmd_xxx_tests()
- Fixed dynamic page table level folding problems on x86 as per Kirril
- Fixed second pointers during pxx_populate_tests() per Kirill and Gerald
- Allocate and free pte table with pte_alloc_one/pte_free per Kirill
- Modified pxx_clear_tests() to accommodate s390 lower 12 bits situation
- Changed RANDOM_NZVALUE value from 0xbe to 0xff
- Changed allocation, usage, free sequence for saved_ptep
- Renamed VMA_FLAGS as VMFLAGS
- Implemented a new method for random vaddr generation
- Implemented some other cleanups
- Dropped extern reference to mm_alloc()
- Created and exported new alloc_gigantic_page_order()
- Dropped the custom allocator and used new alloc_gigantic_page_order()

Changes in V1:

https://lore.kernel.org/linux-mm/1567497706-8649-1-git-send-email-anshuman.khand...@arm.com/

- Added fallback mechanism for PMD aligned memory allocation failure

Changes in RFC V2:

https://lore.kernel.org/linux-mm/1565335998-22553-1-git-send-email-anshuman.khand...@arm.com/T/#u

- Moved test module and it's config from lib/ to mm/
- Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
- Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
- Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
- Dropped loadable module config option
- Basic tests now use memory blocks with required size and alignment
- PUD aligned memory block gets allocated with alloc_contig_range()
- If PUD aligned memory could not be allocated it falls back on PMD aligned
  memory block from page allocator and pud_* tests are skipped
- Clear and populate tests now operate on real in memory page table entries
- Dummy mm_struct gets allocated with mm_alloc()
- Dummy page table entries get allocated with [pud|pmd|pte]_alloc_[map]()
- Simplified [p4d|pgd]_basic_tests(), now has random values in the entries

Original RFC V1:

https://lore.kernel.org/linux-mm/1564037723-26676-1-git-send-email-anshuman.khand...@arm.com/

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Mike Kravetz 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  mm/hugetlb: Make alloc_gigantic_page() available for general use
  mm/pgtable/debug: Add test validating architecture page table helpers

 arch/x86/include/asm/pgtable_64_types.h |   2 +
 include/linux/hugetlb.h |   9 +
 mm/Kconfig.debug|  14 +
 mm/Makefile |   1 +
 mm/arch_pgtable_test.c  | 440 
 mm/hugetlb.c|  24 +-
 6 files changed, 488 insertions(+), 2 deletions(-)
 create mode 100644 mm/arch_pgtable_test.c

-- 
2.7.4



Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-19 Thread Anshuman Khandual



On 09/18/2019 11:52 PM, Gerald Schaefer wrote:
> On Wed, 18 Sep 2019 18:26:03 +0200
> Christophe Leroy  wrote:
> 
> [..] 
>> My suggestion was not to completely drop the #ifdef but to do like you 
>> did in pgd_clear_tests() for instance, ie to add the following test on 
>> top of the function:
>>
>>  if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
>>  return;
>>
> 
> Ah, very nice, this would also fix the remaining issues for s390. Since
> we have dynamic page table folding, neither __PAGETABLE_PXX_FOLDED nor
> __ARCH_HAS_XLEVEL_HACK is defined, but mm_pxx_folded() will work.

Like Christophe mentioned earlier on the other thread, we will convert
all __PGTABLE_PXX_FOLDED checks as mm_pxx_folded() but looks like 
ARCH_HAS_[4 and 5]LEVEL_HACK macros will still be around. Will respin
the series with all agreed upon changes first and probably we can then
discuss pending issues from there.

> 
> mm_alloc() returns with a 3-level page table by default on s390, so we
> will run into issues in p4d_clear/populate_tests(), and also at the end
> with p4d/pud_free() (double free).
> 
> So, adding the mm_pud_folded() check to p4d_clear/populate_tests(),
> and also adding mm_p4d/pud_folded() checks at the end before calling> 
> p4d/pud_free(), would make it all work on s390.

Atleast p4d_clear/populate_tests() tests will be taken care.

> 
> BTW, regarding p4d/pud_free(), I'm not sure if we should rather check
> the folding inside our s390 functions, similar to how we do it for
> p4d/pud_free_tlb(), instead of relying on not being called for folded
> p4d/pud. So far, I see no problem with this behavior, all callers of
> p4d/pud_free() should be fine because of our folding check within
> p4d/pud_present/none(). But that doesn't mean that it is correct not
> to check for the folding inside p4d/pud_free(). At least, with this
> test module we do now have a caller of p4d/pud_free() on potentially
> folded entries, so instead of adding pxx_folded() checks to this
> test module, we could add them to our p4d/pud_free() functions.
> Any thoughts on this?
Agreed, it seems better to do the check inside p4d/pud_free() functions.


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Anshuman Khandual



On 09/18/2019 09:56 PM, Christophe Leroy wrote:
> 
> 
> Le 18/09/2019 à 07:04, Anshuman Khandual a écrit :
>>
>>
>> On 09/13/2019 03:31 PM, Christophe Leroy wrote:
>>>
>>>
>>> Le 13/09/2019 à 11:02, Anshuman Khandual a écrit :
>>>>
>>>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
>>>>>
>>>>> #ifdefs have to be avoided as much as possible, see below
>>>>
>>>> Yeah but it has been bit difficult to avoid all these $ifdef because of the
>>>> availability (or lack of it) for all these pgtable helpers in various 
>>>> config
>>>> combinations on all platforms.
>>>
>>> As far as I can see these pgtable helpers should exist everywhere at least 
>>> via asm-generic/ files.
>>
>> But they might not actually do the right thing.
>>
>>>
>>> Can you spot a particular config which fails ?
>>
>> Lets consider the following example (after removing the $ifdefs around it)
>> which though builds successfully but fails to pass the intended test. This
>> is with arm64 config 4K pages sizes with 39 bits VA space which ends up
>> with a 3 level page table arrangement.
>>
>> static void __init p4d_clear_tests(p4d_t *p4dp)
>> {
>>  p4d_t p4d = READ_ONCE(*p4dp);
> 
> My suggestion was not to completely drop the #ifdef but to do like you did in 
> pgd_clear_tests() for instance, ie to add the following test on top of the 
> function:
> 
> if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
>     return;
> 

Sometimes this does not really work. On some platforms, combination of
__PAGETABLE_PUD_FOLDED and __ARCH_HAS_5LEVEL_HACK decide whether the
helpers such as __pud() or __pgd() is even available for that platform.
Ideally it should have been through generic falls backs in include/*/
but I guess there might be bugs on the platform or it has not been
changed to adopt 5 level page table framework with required folding
macros etc.

>>
>>  p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
>>  WRITE_ONCE(*p4dp, p4d);
>>  p4d_clear(p4dp);
>>  p4d = READ_ONCE(*p4dp);
>>  WARN_ON(!p4d_none(p4d));
>> }
>>
>> The following test hits an error at WARN_ON(!p4d_none(p4d))
>>
>> [   16.757333] [ cut here ]
>> [   16.758019] WARNING: CPU: 11 PID: 1 at mm/arch_pgtable_test.c:187 
>> arch_pgtable_tests_init+0x24c/0x474
>> [   16.759455] Modules linked in:
>> [   16.759952] CPU: 11 PID: 1 Comm: swapper/0 Not tainted 
>> 5.3.0-next-20190916-5-g61c218153bb8-dirty #222
>> [   16.761449] Hardware name: linux,dummy-virt (DT)
>> [   16.762185] pstate: 0045 (nzcv daif +PAN -UAO)
>> [   16.762964] pc : arch_pgtable_tests_init+0x24c/0x474
>> [   16.763750] lr : arch_pgtable_tests_init+0x174/0x474
>> [   16.764534] sp : ffc011d7bd50
>> [   16.765065] x29: ffc011d7bd50 x28: 1756bac0
>> [   16.765908] x27: ff85ddaf3000 x26: 02e8
>> [   16.766767] x25: ffc0111ce000 x24: ff85ddaf32e8
>> [   16.767606] x23: ff85ddaef278 x22: 0045cc844000
>> [   16.768445] x21: 00065daef003 x20: 1754
>> [   16.769283] x19: ff85ddb6 x18: 0014
>> [   16.770122] x17: 980426bb x16: 698594c6
>> [   16.770976] x15: 66e25a88 x14: 
>> [   16.771813] x13: 1754 x12: 000a
>> [   16.772651] x11: ff85fcfd0a40 x10: 0001
>> [   16.773488] x9 : 0008 x8 : ffc01143ab26
>> [   16.774336] x7 :  x6 : 
>> [   16.775180] x5 :  x4 : 
>> [   16.776018] x3 : 1756bbe8 x2 : 00065daeb003
>> [   16.776856] x1 : 0065daeb x0 : f000
>> [   16.777693] Call trace:
>> [   16.778092]  arch_pgtable_tests_init+0x24c/0x474
>> [   16.778843]  do_one_initcall+0x74/0x1b0
>> [   16.779458]  kernel_init_freeable+0x1cc/0x290
>> [   16.780151]  kernel_init+0x10/0x100
>> [   16.780710]  ret_from_fork+0x10/0x18
>> [   16.781282] ---[ end trace 042e6c40c0a3b038 ]---
>>
>> On arm64 (4K page size|39 bits VA|3 level page table)
>>
>> #elif CONFIG_PGTABLE_LEVELS == 3    /* Applicable here */
>> #define __ARCH_USE_5LEVEL_HACK
>> #include 
>>
>> Which pulls in
>>
>> #include 
>>
>> which pulls in
>>
>> #include 
>>
>> which defines
>>
>> static inline int p4d_

Re: [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-18 Thread Anshuman Khandual



On 09/16/2019 07:14 AM, Balbir Singh wrote:
> 
> 
> On 3/9/19 7:45 pm, Anshuman Khandual wrote:
>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> 
> I could not find this path in the code, the only called for get_nid_for_pfn()
> was register_mem_sect_under_node() when the system is under boot.
> 
>> entries between memory block and node. It first checks pfn validity with
>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>
>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>> creates a problem in memory hot remove path which has already removed given
>> memory range from memory block with memblock_[remove|free] before arriving
>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>> of existing sysfs entries.
>>
>> [   62.007176] NUMA: Unknown node for memory at 0x68000, assuming node 0
>> [   62.052517] [ cut here ]
> 
> This seems like arm64 is not ready for probe_store() via 
> drivers/base/memory.c/ode.c
> 
>> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> 
> 
> 
>> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [   62.054589] Modules linked in:
>> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 
>> 5.1.0-rc2-4-g28cea40b2683 #41
>> [   62.056274] Hardware name: linux,dummy-virt (DT)
>> [   62.057166] pstate: 4045 (nZcv daif +PAN -UAO)
>> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
>> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
>> [   62.059842] sp : 168b3ce0
>> [   62.060477] x29: 168b3ce0 x28: 8005db546c00
>> [   62.061501] x27:  x26: 
>> [   62.062509] x25: 111ef000 x24: 111ef5d0
>> [   62.063520] x23:  x22: 0006bfff
>> [   62.064540] x21: ffef x20: 006c
>> [   62.065558] x19: 0068 x18: 0024
>> [   62.066566] x17:  x16: 
>> [   62.067579] x15:  x14: 8005e412e890
>> [   62.068588] x13: 8005d6b105d8 x12: 
>> [   62.069610] x11: 8005d6b10490 x10: 0040
>> [   62.070615] x9 : 8005e412e898 x8 : 8005e412e890
>> [   62.071631] x7 : 8005d6b105d8 x6 : 8005db546c00
>> [   62.072640] x5 : 0001 x4 : 0002
>> [   62.073654] x3 : 8005d7049480 x2 : 0002
>> [   62.074666] x1 : 0003 x0 : ffef
>> [   62.075685] Process bash (pid: 3275, stack limit = 0xd754280f)
>> [   62.076930] Call trace:
>> [   62.077411]  add_memory_resource+0x1cc/0x1d8
>> [   62.078227]  __add_memory+0x70/0xa8
>> [   62.078901]  probe_store+0xa4/0xc8
>> [   62.079561]  dev_attr_store+0x18/0x28
>> [   62.080270]  sysfs_kf_write+0x40/0x58
>> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
>> [   62.081744]  __vfs_write+0x18/0x40
>> [   62.082400]  vfs_write+0xa4/0x1b0
>> [   62.083037]  ksys_write+0x5c/0xc0
>> [   62.083681]  __arm64_sys_write+0x18/0x20
>> [   62.084432]  el0_svc_handler+0x88/0x100
>> [   62.085177]  el0_svc+0x8/0xc
>>
>> Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
>> problem on arm64 as pfn_valid() behaves correctly and returns positive
>> as memblock for the address range still exists. arch_remove_memory()
>> removes applicable memory sections from zone with __remove_pages() and
>> tears down kernel linear mapping. Removing memblock regions afterwards
>> is safe because there is no other memblock (bootmem) allocator user that
>> late. So nobody is going to allocate from the removed range just to blow
>> up later. Also nobody should be using the bootmem allocated range else
>> we wouldn't allow to remove it. So reordering is indeed safe.
>>
>> Reviewed-by: David Hildenbrand 
>> Reviewed-by: Oscar Salvador 
>> Acked-by: Mark Rutland 
>> Acked-by: Michal Hocko 
>> Signed-off-by: Anshuman Khandual 
>> ---
> 
> Honestly, the issue is not clear from the changelog, largely
> because I can't find the use of get_nid_for_pfn()  being used
> in memory hotunplug. I can see why using pfn_valid() after
> memblock_free/remove is 

Re: [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

2019-09-18 Thread Anshuman Khandual



On 09/15/2019 08:05 AM, Balbir Singh wrote:
> 
> 
> On 3/9/19 7:45 pm, Anshuman Khandual wrote:
>> The arm64 page table dump code can race with concurrent modification of the
>> kernel page tables. When a leaf entries are modified concurrently, the dump
>> code may log stale or inconsistent information for a VA range, but this is
>> otherwise not harmful.
>>
>> When intermediate levels of table are freed, the dump code will continue to
>> use memory which has been freed and potentially reallocated for another
>> purpose. In such cases, the dump code may dereference bogus addresses,
>> leading to a number of potential problems.
>>
>> Intermediate levels of table may by freed during memory hot-remove,
>> which will be enabled by a subsequent patch. To avoid racing with
>> this, take the memory hotplug lock when walking the kernel page table.
>>
>> Acked-by: David Hildenbrand 
>> Acked-by: Mark Rutland 
>> Signed-off-by: Anshuman Khandual 
>> ---
>>  arch/arm64/mm/ptdump_debugfs.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
>> index 064163f25592..b5eebc8c4924 100644
>> --- a/arch/arm64/mm/ptdump_debugfs.c
>> +++ b/arch/arm64/mm/ptdump_debugfs.c
>> @@ -1,5 +1,6 @@
>>  // SPDX-License-Identifier: GPL-2.0
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include 
>> @@ -7,7 +8,10 @@
>>  static int ptdump_show(struct seq_file *m, void *v)
>>  {
>>  struct ptdump_info *info = m->private;
>> +
>> +get_online_mems();
>>  ptdump_walk_pgd(m, info);
>> +put_online_mems();
> 
> Looks sane, BTW, checking other arches they might have the same race.

The problem can be present on other architectures which can dump kernel page
table during memory hot-remove operation where it actually frees up page table
pages. If there is no freeing involved the race condition here could cause
inconsistent or garbage information capture for a given VA range. Same is true
even for concurrent vmalloc() operations as well. But removal of page tables
pages can make it worse. Freeing page table pages during hot-remove is a 
platform
decision, so would be adding these locks while walking kernel page table during
ptdump.

> Is there anything special about the arch?

AFAICS, no.

> 
> Acked-by: Balbir Singh 
> 
> 


Re: [PATCH] mm/pgtable/debug: Fix test validating architecture page table helpers

2019-09-18 Thread Anshuman Khandual



On 09/13/2019 11:53 AM, Christophe Leroy wrote:
> Fix build failure on powerpc.
> 
> Fix preemption imbalance.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  mm/arch_pgtable_test.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
> index 8b4a92756ad8..f2b3c9ec35fa 100644
> --- a/mm/arch_pgtable_test.c
> +++ b/mm/arch_pgtable_test.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -400,6 +401,8 @@ static int __init arch_pgtable_tests_init(void)
>   p4d_clear_tests(p4dp);
>   pgd_clear_tests(mm, pgdp);
>  
> + pte_unmap(ptep);
> +
>   pmd_populate_tests(mm, pmdp, saved_ptep);
>   pud_populate_tests(mm, pudp, saved_pmdp);
>   p4d_populate_tests(mm, p4dp, saved_pudp);
> 

Hello Christophe,

I am planning to fold this fix into the current patch and retain your
Signed-off-by. Are you okay with it ?

- Anshuman


Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-17 Thread Anshuman Khandual



On 09/13/2019 03:31 PM, Christophe Leroy wrote:
> 
> 
> Le 13/09/2019 à 11:02, Anshuman Khandual a écrit :
>>
>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
>>>
>>> #ifdefs have to be avoided as much as possible, see below
>>
>> Yeah but it has been bit difficult to avoid all these $ifdef because of the
>> availability (or lack of it) for all these pgtable helpers in various config
>> combinations on all platforms.
> 
> As far as I can see these pgtable helpers should exist everywhere at least 
> via asm-generic/ files.

But they might not actually do the right thing.

> 
> Can you spot a particular config which fails ?

Lets consider the following example (after removing the $ifdefs around it)
which though builds successfully but fails to pass the intended test. This
is with arm64 config 4K pages sizes with 39 bits VA space which ends up
with a 3 level page table arrangement.

static void __init p4d_clear_tests(p4d_t *p4dp)
{
p4d_t p4d = READ_ONCE(*p4dp);

p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
WRITE_ONCE(*p4dp, p4d);
p4d_clear(p4dp);
p4d = READ_ONCE(*p4dp);
WARN_ON(!p4d_none(p4d));
}

The following test hits an error at WARN_ON(!p4d_none(p4d))

[   16.757333] [ cut here ]
[   16.758019] WARNING: CPU: 11 PID: 1 at mm/arch_pgtable_test.c:187 
arch_pgtable_tests_init+0x24c/0x474
[   16.759455] Modules linked in:
[   16.759952] CPU: 11 PID: 1 Comm: swapper/0 Not tainted 
5.3.0-next-20190916-5-g61c218153bb8-dirty #222
[   16.761449] Hardware name: linux,dummy-virt (DT)
[   16.762185] pstate: 0045 (nzcv daif +PAN -UAO)
[   16.762964] pc : arch_pgtable_tests_init+0x24c/0x474
[   16.763750] lr : arch_pgtable_tests_init+0x174/0x474
[   16.764534] sp : ffc011d7bd50
[   16.765065] x29: ffc011d7bd50 x28: 1756bac0 
[   16.765908] x27: ff85ddaf3000 x26: 02e8 
[   16.766767] x25: ffc0111ce000 x24: ff85ddaf32e8 
[   16.767606] x23: ff85ddaef278 x22: 0045cc844000 
[   16.768445] x21: 00065daef003 x20: 1754 
[   16.769283] x19: ff85ddb6 x18: 0014 
[   16.770122] x17: 980426bb x16: 698594c6 
[   16.770976] x15: 66e25a88 x14:  
[   16.771813] x13: 1754 x12: 000a 
[   16.772651] x11: ff85fcfd0a40 x10: 0001 
[   16.773488] x9 : 0008 x8 : ffc01143ab26 
[   16.774336] x7 :  x6 :  
[   16.775180] x5 :  x4 :  
[   16.776018] x3 : 1756bbe8 x2 : 00065daeb003 
[   16.776856] x1 : 0065daeb x0 : f000 
[   16.777693] Call trace:
[   16.778092]  arch_pgtable_tests_init+0x24c/0x474
[   16.778843]  do_one_initcall+0x74/0x1b0
[   16.779458]  kernel_init_freeable+0x1cc/0x290
[   16.780151]  kernel_init+0x10/0x100
[   16.780710]  ret_from_fork+0x10/0x18
[   16.781282] ---[ end trace 042e6c40c0a3b038 ]---

On arm64 (4K page size|39 bits VA|3 level page table)

#elif CONFIG_PGTABLE_LEVELS == 3/* Applicable here */
#define __ARCH_USE_5LEVEL_HACK
#include 

Which pulls in 

#include 

which pulls in

#include 

which defines

static inline int p4d_none(p4d_t p4d)
{
return 0;
}

which will invariably trigger WARN_ON(!p4d_none(p4d)).

Similarly for next test p4d_populate_tests() which will always be
successful because p4d_bad() invariably returns negative.

static inline int p4d_bad(p4d_t p4d)
{
return 0;
}

static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp,
  pud_t *pudp)
{
p4d_t p4d;

/*
 * This entry points to next level page table page.
 * Hence this must not qualify as p4d_bad().
 */
pud_clear(pudp);
p4d_clear(p4dp);
p4d_populate(mm, p4dp, pudp);
p4d = READ_ONCE(*p4dp);
WARN_ON(p4d_bad(p4d));
}

We should not run these tests for the above config because they are
not applicable and will invariably produce same result.

> 
>>
>>>
> 
> [...]
> 
>>>> +#if !defined(__PAGETABLE_PUD_FOLDED) && !defined(__ARCH_HAS_5LEVEL_HACK)
>>>
>>> The same can be done here.
>>
>> IIRC not only the page table helpers but there are data types (pxx_t) which
>> were not present on various configs and these wrappers help prevent build
>> failures. Any ways will try and see if this can be improved further. But
>> meanwhile if you have some suggestions, please do let me know.
> 
> pgt_t and pmd_t are everywhere I guess.
> then pud_t and p4d_t have fallbacks in asm-generic files.

Lets take another example where it fails to compile. On arm64 with 16K
page size, 48 bits VA, 4 level page table arrangement in the fol

Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove

2019-09-16 Thread Anshuman Khandual



On 09/13/2019 03:39 PM, Catalin Marinas wrote:
> On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote:
>> On 09/13/2019 01:45 AM, Catalin Marinas wrote:
>>> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>>>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, 
>>>> unsigned long end, int node,
>>>>  void vmemmap_free(unsigned long start, unsigned long end,
>>>>struct vmem_altmap *altmap)
>>>>  {
>>>> +#ifdef CONFIG_MEMORY_HOTPLUG
>>>> +  /*
>>>> +   * FIXME: We should have called remove_pagetable(start, end, true).
>>>> +   * vmemmap and vmalloc virtual range might share intermediate kernel
>>>> +   * page table entries. Removing vmemmap range page table pages here
>>>> +   * can potentially conflict with a concurrent vmalloc() allocation.
>>>> +   *
>>>> +   * This is primarily because vmalloc() does not take init_mm ptl for
>>>> +   * the entire page table walk and it's modification. Instead it just
>>>> +   * takes the lock while allocating and installing page table pages
>>>> +   * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>>>> +   * entry via memory hot remove can cause vmalloc() kernel page table
>>>> +   * walk pointers to be invalid on the fly which can cause corruption
>>>> +   * or worst, a crash.
>>>> +   *
>>>> +   * So free_empty_tables() gets called where vmalloc and vmemmap range
>>>> +   * do not overlap at any intermediate level kernel page table entry.
>>>> +   */
>>>> +  unmap_hotplug_range(start, end, true);
>>>> +  if (!vmalloc_vmemmap_overlap)
>>>> +  free_empty_tables(start, end);
>>>> +#endif
>>>>  }
>>>
>>> So, I see the risk with overlapping and I guess for some kernel
>>> configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we
>>
>> Did not see 64K config options to have overlap, do you suspect they might ?
>> After the 52 bit KVA series has been merged, following configurations have
>> the vmalloc-vmemmap range overlap problem.
>>
>> - 4K  page size with 48 bit VA space
>> - 16K page size with 48 bit VA space
> 
> OK. I haven't checked, so it was just a guess that 64K has this problem
> since the pgd entry coverage is fairly large.
> 
>>> can, that's great, otherwise could we rewrite the above functions to
>>> handle floor and ceiling similar to free_pgd_range()? (I wonder how this
>>> function works if you called it on init_mm and kernel address range). By
>>
>> Hmm, never tried that. Are you wondering if this can be used directly ?
>> There are two distinct elements which make it very specific to user page
>> tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
>> with mm_dec_nr_pxx().
> 
> Ah, I missed the mm_dec_nr_*(). So I don't think it would work directly.
> We could, however, use the same approach for kernel page tables.

Right.

> 
>>> having the vmemmap start/end information it avoids freeing partially
>>> filled page table pages.
>>
>> Did you mean page table pages which can partially overlap with vmalloc ?
> 
> Overlapping with the vmalloc range, not necessarily with a mapped
> vmalloc area.
> 
>> The problem (race) is not because of the inability to deal with partially
>> filled table. We can handle that correctly as explained below [1]. The
>> problem is with inadequate kernel page table locking during vmalloc()
>> which might be accessing intermediate kernel page table pointers which is
>> being freed with free_empty_tables() concurrently. Hence we cannot free
>> any page table page which can ever have entries from vmalloc() range.
> 
> The way you deal with the partially filled table in this patch is to
> avoid freeing if there is a non-empty entry (!p*d_none()). This is what
> causes the race with vmalloc. If you simply avoid freeing a pmd page,
> for example, if the range floor/ceiling is not aligned to PUD_SIZE,
> irrespective of whether the other entries are empty or not, you
> shouldn't have this problem. You do free the pte page if the range is

Right, the floor/ceiling alignment check should abort the process before
scanning for non-empty entries.

> aligned to PMD_SIZE but in this case it wouldn't overlap with the
> vmalloc space. That's how free_pgd_range() works.

Like free_pgd_range(), page table pages can be freed at lower levels when
they have the right alignment wrt floor/ceiling. I will change all existing

Re: [PATCH] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-16 Thread Anshuman Khandual



On 09/16/2019 02:20 PM, Anshuman Khandual wrote:
> 
> 
> On 09/16/2019 12:06 PM, Mike Rapoport wrote:
>> On Mon, Sep 16, 2019 at 11:17:37AM +0530, Anshuman Khandual wrote:
>>> In add_memory_resource() the memory range to be hot added first gets into
>>> the memblock via memblock_add() before arch_add_memory() is called on it.
>>> Reverse sequence should be followed during memory hot removal which already
>>> is being followed in add_memory_resource() error path. This now ensures
>>> required re-order between memblock_[free|remove]() and arch_remove_memory()
>>> during memory hot-remove.
>>>
>>> Cc: Andrew Morton 
>>> Cc: Oscar Salvador 
>>> Cc: Michal Hocko 
>>> Cc: David Hildenbrand 
>>> Cc: Pavel Tatashin 
>>> Cc: Dan Williams 
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>> Original patch https://lkml.org/lkml/2019/9/3/327
>>>
>>> Memory hot remove now works on arm64 without this because a recent commit
>>> 60bb462fc7ad ("drivers/base/node.c: simplify 
>>> unregister_memory_block_under_nodes()").
>>>
>>> David mentioned that re-ordering should still make sense for consistency
>>> purpose (removing stuff in the reverse order they were added). This patch
>>> is now detached from arm64 hot-remove series.
>>>
>>> https://lkml.org/lkml/2019/9/3/326
>>>
>>>  mm/memory_hotplug.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index c73f09913165..355c466e0621 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 
>>> start, u64 size)
>>>
>>> /* remove memmap entry */
>>> firmware_map_remove(start, start + size, "System RAM");
>>> -   memblock_free(start, size);
>>> -   memblock_remove(start, size);
>>>
>>> /* remove memory block devices before removing memory */
>>> remove_memory_block_devices(start, size);
>>>
>>> arch_remove_memory(nid, start, size, NULL);
>>> +   memblock_free(start, size);
>>
>> I don't see memblock_reserve() anywhere in memory_hotplug.c, so the
>> memblock_free() call here seems superfluous. I think it can be simply
>> dropped.
> 
> I had observed that previously but was not sure whether or not there are
> still scenarios where it might be true. Error path in add_memory_resource()
> even just calls memblock_remove() not memblock_free(). Unless there is any
> objection, can just drop it.

Hello Mike,

Looks like we might need memblock_free() here as well. As you mentioned
before there might not be any memblock_reserve() in mm/memory_hotplug.c
but that does not guarantee that there could not have been a previous
memblock_reserve() or memblock_alloc_XXX() allocation which came from
the current hot remove range.

memblock_free() followed by memblock_remove() on the entire hot-remove
range ensures that memblock.memory and memblock.reserve are in sync and
the entire range is guaranteed to be removed from both the memory types.
Or am I missing something here.

- Anshuman


Re: [PATCH] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-16 Thread Anshuman Khandual



On 09/16/2019 12:06 PM, Mike Rapoport wrote:
> On Mon, Sep 16, 2019 at 11:17:37AM +0530, Anshuman Khandual wrote:
>> In add_memory_resource() the memory range to be hot added first gets into
>> the memblock via memblock_add() before arch_add_memory() is called on it.
>> Reverse sequence should be followed during memory hot removal which already
>> is being followed in add_memory_resource() error path. This now ensures
>> required re-order between memblock_[free|remove]() and arch_remove_memory()
>> during memory hot-remove.
>>
>> Cc: Andrew Morton 
>> Cc: Oscar Salvador 
>> Cc: Michal Hocko 
>> Cc: David Hildenbrand 
>> Cc: Pavel Tatashin 
>> Cc: Dan Williams 
>> Signed-off-by: Anshuman Khandual 
>> ---
>> Original patch https://lkml.org/lkml/2019/9/3/327
>>
>> Memory hot remove now works on arm64 without this because a recent commit
>> 60bb462fc7ad ("drivers/base/node.c: simplify 
>> unregister_memory_block_under_nodes()").
>>
>> David mentioned that re-ordering should still make sense for consistency
>> purpose (removing stuff in the reverse order they were added). This patch
>> is now detached from arm64 hot-remove series.
>>
>> https://lkml.org/lkml/2019/9/3/326
>>
>>  mm/memory_hotplug.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index c73f09913165..355c466e0621 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 
>> start, u64 size)
>>
>>  /* remove memmap entry */
>>  firmware_map_remove(start, start + size, "System RAM");
>> -memblock_free(start, size);
>> -memblock_remove(start, size);
>>
>>  /* remove memory block devices before removing memory */
>>  remove_memory_block_devices(start, size);
>>
>>  arch_remove_memory(nid, start, size, NULL);
>> +memblock_free(start, size);
> 
> I don't see memblock_reserve() anywhere in memory_hotplug.c, so the
> memblock_free() call here seems superfluous. I think it can be simply
> dropped.

I had observed that previously but was not sure whether or not there are
still scenarios where it might be true. Error path in add_memory_resource()
even just calls memblock_remove() not memblock_free(). Unless there is any
objection, can just drop it.


[PATCH] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-15 Thread Anshuman Khandual
In add_memory_resource() the memory range to be hot added first gets into
the memblock via memblock_add() before arch_add_memory() is called on it.
Reverse sequence should be followed during memory hot removal which already
is being followed in add_memory_resource() error path. This now ensures
required re-order between memblock_[free|remove]() and arch_remove_memory()
during memory hot-remove.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Signed-off-by: Anshuman Khandual 
---
Original patch https://lkml.org/lkml/2019/9/3/327

Memory hot remove now works on arm64 without this because a recent commit
60bb462fc7ad ("drivers/base/node.c: simplify 
unregister_memory_block_under_nodes()").

David mentioned that re-ordering should still make sense for consistency
purpose (removing stuff in the reverse order they were added). This patch
is now detached from arm64 hot-remove series.

https://lkml.org/lkml/2019/9/3/326

 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c73f09913165..355c466e0621 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
-   memblock_free(start, size);
-   memblock_remove(start, size);
 
/* remove memory block devices before removing memory */
remove_memory_block_devices(start, size);
 
arch_remove_memory(nid, start, size, NULL);
+   memblock_free(start, size);
+   memblock_remove(start, size);
__release_memory_resource(start, size);
 
try_offline_node(nid);
-- 
2.20.1



Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-13 Thread Anshuman Khandual


On 09/12/2019 10:44 PM, Christophe Leroy wrote:
> 
> 
> Le 12/09/2019 à 08:02, Anshuman Khandual a écrit :
>> This adds a test module which will validate architecture page table helpers
>> and accessors regarding compliance with generic MM semantics expectations.
>> This will help various architectures in validating changes to the existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required alignments. If memory pages
>> with required size and alignment could not be allocated, then all depending
>> individual tests are skipped.
>>
> 
> [...]
> 
>>
>> Suggested-by: Catalin Marinas 
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   arch/x86/include/asm/pgtable_64_types.h |   2 +
>>   mm/Kconfig.debug    |  14 +
>>   mm/Makefile |   1 +
>>   mm/arch_pgtable_test.c  | 429 
>>   4 files changed, 446 insertions(+)
>>   create mode 100644 mm/arch_pgtable_test.c
>>
>> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
>> b/arch/x86/include/asm/pgtable_64_types.h
>> index 52e5f5f2240d..b882792a3999 100644
>> --- a/arch/x86/include/asm/pgtable_64_types.h
>> +++ b/arch/x86/include/asm/pgtable_64_types.h
>> @@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
>>   #define pgtable_l5_enabled() 0
>>   #endif /* CONFIG_X86_5LEVEL */
>>   +#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
>> +
> 
> This is specific to x86, should go in a separate patch.

Thought about it but its just a single line. Kirill suggested this in the
previous version. There is a generic fallback definition but s390 has it's
own. This change overrides the generic one for x86 probably as a fix or as
an improvement. Kirill should be able to help classify it in which case it
can be a separate patch.

> 
>>   extern unsigned int pgdir_shift;
>>   extern unsigned int ptrs_per_p4d;
>>   diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
>> index 327b3ebf23bf..ce9c397f7b07 100644
>> --- a/mm/Kconfig.debug
>> +++ b/mm/Kconfig.debug
>> @@ -117,3 +117,17 @@ config DEBUG_RODATA_TEST
>>   depends on STRICT_KERNEL_RWX
>>   ---help---
>>     This option enables a testcase for the setting rodata read-only.
>> +
>> +config DEBUG_ARCH_PGTABLE_TEST
>> +    bool "Test arch page table helpers for semantics compliance"
>> +    depends on MMU
>> +    depends on DEBUG_KERNEL
>> +    help
>> +  This options provides a kernel module which can be used to test
>> +  architecture page table helper functions on various platform in
>> +  verifying if they comply with expected generic MM semantics. This
>> +  will help architectures code in making sure that any changes or
>> +  new additions of these helpers will still conform to generic MM
>> +  expected semantics.
>> +
>> +  If unsure, say N.
>> diff --git a/mm/Makefile b/mm/Makefile
>> index d996846697ef..bb572c5aa8c5 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>>   obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>>   obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>>   obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
>> +obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
>>   obj-$(CONFIG_PAGE_OWNER) += page_owner.o
>>   obj-$(CONFIG_CLEANCACHE) += cleancache.o
>>   obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
>> new file mode 100644
>> index ..8b4a92756ad8
>> --- /dev/null
>> +++ b/mm/arch_pgtable_test.c
>> @@ -0,0 +1,429 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * This kernel module validates architecture page table helpers &
>> + * accessors and helps in verifying their continued compliance with
>> + * generic MM semantics.
>> + *
>> + * Copyright (C) 2019 ARM Ltd.
>> + *
>> + * Author: Anshuman Khandual 
>> + */
>> +#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
> 
> Add  (see other mails, build failure on ppc book3s/32)

Ok

Re: [PATCH] mm/pgtable/debug: Fix test validating architecture page table helpers

2019-09-13 Thread Anshuman Khandual



On 09/13/2019 12:41 PM, Christophe Leroy wrote:
> 
> 
> Le 13/09/2019 à 09:03, Christophe Leroy a écrit :
>>
>>
>> Le 13/09/2019 à 08:58, Anshuman Khandual a écrit :
>>> On 09/13/2019 11:53 AM, Christophe Leroy wrote:
>>>> Fix build failure on powerpc.
>>>>
>>>> Fix preemption imbalance.
>>>>
>>>> Signed-off-by: Christophe Leroy 
>>>> ---
>>>>   mm/arch_pgtable_test.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
>>>> index 8b4a92756ad8..f2b3c9ec35fa 100644
>>>> --- a/mm/arch_pgtable_test.c
>>>> +++ b/mm/arch_pgtable_test.c
>>>> @@ -24,6 +24,7 @@
>>>>   #include 
>>>>   #include 
>>>>   #include 
>>>> +#include 
>>>
>>> This is okay.
>>>
>>>>   #include 
>>>>   #include 
>>>> @@ -400,6 +401,8 @@ static int __init arch_pgtable_tests_init(void)
>>>>   p4d_clear_tests(p4dp);
>>>>   pgd_clear_tests(mm, pgdp);
>>>> +    pte_unmap(ptep);
>>>> +
>>>
>>> Now the preemption imbalance via pte_alloc_map() path i.e
>>>
>>> pte_alloc_map() -> pte_offset_map() -> kmap_atomic()
>>>
>>> Is not this very much powerpc 32 specific or this will be applicable
>>> for all platform which uses kmap_XXX() to map high memory ?
>>>
>>
>> See 
>> https://elixir.bootlin.com/linux/v5.3-rc8/source/include/linux/highmem.h#L91
>>
>> I think it applies at least to all arches using the generic implementation.
>>
>> Applies also to arm:
>> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/mm/highmem.c#L52
>>
>> Applies also to mips:
>> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/mips/mm/highmem.c#L47
>>
>> Same on sparc:
>> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/sparc/mm/highmem.c#L52
>>
>> Same on x86:
>> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/mm/highmem_32.c#L34
>>
>> I have not checked others, but I guess it is like that for all.
>>
> 
> 
> Seems like I answered too quickly. All kmap_atomic() do preempt_disable(), 
> but not all pte_alloc_map() call kmap_atomic().
> 
> However, for instance ARM does:
> 
> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/include/asm/pgtable.h#L200
> 
> And X86 as well:
> 
> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/include/asm/pgtable_32.h#L51
> 
> Microblaze also:
> 
> https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/microblaze/include/asm/pgtable.h#L495

All the above platforms checks out to be using k[un]map_atomic(). I am 
wondering whether
any of the intermediate levels will have similar problems on any these 32 bit 
platforms
or any other platforms which might be using generic k[un]map_atomic(). There 
can be many
permutations here.

p4dp = p4d_alloc(mm, pgdp, vaddr);
pudp = pud_alloc(mm, p4dp, vaddr);
pmdp = pmd_alloc(mm, pudp, vaddr);

Otherwise pte_alloc_map()/pte_unmap() looks good enough which will atleast take 
care of
a known failure.


Re: [PATCH] mm/pgtable/debug: Fix test validating architecture page table helpers

2019-09-13 Thread Anshuman Khandual
On 09/13/2019 11:53 AM, Christophe Leroy wrote:
> Fix build failure on powerpc.
> 
> Fix preemption imbalance.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  mm/arch_pgtable_test.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
> index 8b4a92756ad8..f2b3c9ec35fa 100644
> --- a/mm/arch_pgtable_test.c
> +++ b/mm/arch_pgtable_test.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

This is okay.

>  #include 
>  #include 
>  
> @@ -400,6 +401,8 @@ static int __init arch_pgtable_tests_init(void)
>   p4d_clear_tests(p4dp);
>   pgd_clear_tests(mm, pgdp);
>  
> + pte_unmap(ptep);
> +

Now the preemption imbalance via pte_alloc_map() path i.e

pte_alloc_map() -> pte_offset_map() -> kmap_atomic()

Is not this very much powerpc 32 specific or this will be applicable
for all platform which uses kmap_XXX() to map high memory ?


Re: [PATCH V2 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-09-13 Thread Anshuman Khandual



On 09/12/2019 08:12 PM, Christophe Leroy wrote:
> Hi,
> 
> I didn't get patch 1 of this series, and it is not on linuxppc-dev patchwork 
> either. Can you resend ?

Its there on linux-mm patchwork and copied on linux-kernel@vger.kernel.org
as well. The CC list for the first patch was different than the second one.

https://patchwork.kernel.org/patch/11142317/

Let me know if you can not find it either on MM or LKML list.

- Anshuman


Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove

2019-09-12 Thread Anshuman Khandual
On 09/13/2019 01:45 AM, Catalin Marinas wrote:
> Hi Anshuman,
> 
> Thanks for the details on the need for removing the page tables and
> vmemmap backing. Some comments on the code below.
> 
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss 
>> __maybe_unused;
>>  
>>  static DEFINE_SPINLOCK(swapper_pgdir_lock);
>>  
>> +/*
>> + * This represents if vmalloc and vmemmap address range overlap with
>> + * each other on an intermediate level kernel page table entry which
>> + * in turn helps in deciding whether empty kernel page table pages
>> + * if any can be freed during memory hotplug operation.
>> + */
>> +static bool vmalloc_vmemmap_overlap;
> 
> I'd say just move the static find_vmalloc_vmemmap_overlap() function
> here, the compiler should be sufficiently smart enough to figure out
> that it's just a build-time constant.

Sure, will do.

> 
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, 
>> unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +/*
>> + * FIXME: We should have called remove_pagetable(start, end, true).
>> + * vmemmap and vmalloc virtual range might share intermediate kernel
>> + * page table entries. Removing vmemmap range page table pages here
>> + * can potentially conflict with a concurrent vmalloc() allocation.
>> + *
>> + * This is primarily because vmalloc() does not take init_mm ptl for
>> + * the entire page table walk and it's modification. Instead it just
>> + * takes the lock while allocating and installing page table pages
>> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> + * entry via memory hot remove can cause vmalloc() kernel page table
>> + * walk pointers to be invalid on the fly which can cause corruption
>> + * or worst, a crash.
>> + *
>> + * So free_empty_tables() gets called where vmalloc and vmemmap range
>> + * do not overlap at any intermediate level kernel page table entry.
>> + */
>> +unmap_hotplug_range(start, end, true);
>> +if (!vmalloc_vmemmap_overlap)
>> +free_empty_tables(start, end);
>> +#endif
>>  }
> 
> So, I see the risk with overlapping and I guess for some kernel
> configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we

Did not see 64K config options to have overlap, do you suspect they might ?
After the 52 bit KVA series has been merged, following configurations have
the vmalloc-vmemmap range overlap problem.

- 4K  page size with 48 bit VA space
- 16K page size with 48 bit VA space

> can, that's great, otherwise could we rewrite the above functions to
> handle floor and ceiling similar to free_pgd_range()? (I wonder how this
> function works if you called it on init_mm and kernel address range). By

Hmm, never tried that. Are you wondering if this can be used directly ?
There are two distinct elements which make it very specific to user page
tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
with mm_dec_nr_pxx().

> having the vmemmap start/end information it avoids freeing partially
> filled page table pages.

Did you mean page table pages which can partially overlap with vmalloc ?

The problem (race) is not because of the inability to deal with partially
filled table. We can handle that correctly as explained below [1]. The
problem is with inadequate kernel page table locking during vmalloc()
which might be accessing intermediate kernel page table pointers which is
being freed with free_empty_tables() concurrently. Hence we cannot free
any page table page which can ever have entries from vmalloc() range.

Though not completely sure, whether I really understood the suggestion above
with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
suggesting that we should only attempt to free up those vmemmap range page
table pages which *definitely* could never overlap with vmalloc by working
on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
at each level) vmemmap range instead ? This can be one restrictive version of
the function free_empty_tables() called in case there is an overlap. So we
will maintain two versions for free_empty_tables(). Please correct me if any
the above assumptions or understanding is wrong.

But yes, with this we should be able to free up some possible empty page
table pages which were being left out in the current proposal when

Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-12 Thread Anshuman Khandual



On 09/12/2019 04:30 PM, Kirill A. Shutemov wrote:
> On Thu, Sep 12, 2019 at 11:32:53AM +0530, Anshuman Khandual wrote:
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_AUTHOR("Anshuman Khandual ");
>> +MODULE_DESCRIPTION("Test architecture page table helpers");
> 
> It's not module. Why?

Not any more. Nothing in particular. Just that module_init() code gets
executed after page allocator init which is needed here. But I guess
probably not a great way to get this test started.

> 
> BTW, I think we should make all code here __init (or it's variants) so it
> can be discarded on boot. It has not use after that.

Sounds good, will change. Will mark all these functions as __init and
will trigger the test with late_initcall().


Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove

2019-09-12 Thread Anshuman Khandual



On 09/12/2019 09:58 AM, Anshuman Khandual wrote:
> 
> On 09/10/2019 09:47 PM, Catalin Marinas wrote:
>> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, 
>>> unsigned long end, int node,
>>>  void vmemmap_free(unsigned long start, unsigned long end,
>>> struct vmem_altmap *altmap)
>>>  {
>>> +#ifdef CONFIG_MEMORY_HOTPLUG
>>> +   /*
>>> +* FIXME: We should have called remove_pagetable(start, end, true).
>>> +* vmemmap and vmalloc virtual range might share intermediate kernel
>>> +* page table entries. Removing vmemmap range page table pages here
>>> +* can potentially conflict with a concurrent vmalloc() allocation.
>>> +*
>>> +* This is primarily because vmalloc() does not take init_mm ptl for
>>> +* the entire page table walk and it's modification. Instead it just
>>> +* takes the lock while allocating and installing page table pages
>>> +* via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>>> +* entry via memory hot remove can cause vmalloc() kernel page table
>>> +* walk pointers to be invalid on the fly which can cause corruption
>>> +* or worst, a crash.
>>> +*
>>> +* So free_empty_tables() gets called where vmalloc and vmemmap range
>>> +* do not overlap at any intermediate level kernel page table entry.
>>> +*/
>>> +   unmap_hotplug_range(start, end, true);
>>> +   if (!vmalloc_vmemmap_overlap)
>>> +   free_empty_tables(start, end);
>>> +#endif
>>>  }
>>>  #endif /* CONFIG_SPARSEMEM_VMEMMAP */
> Hello Catalin,
> 
>> I wonder whether we could simply ignore the vmemmap freeing altogether,
>> just leave it around and not unmap it. This way, we could call
> This would have been an option (even if we just ignore for a moment that
> it might not be the cleanest possible method) if present memory hot remove
> scenarios involved just system RAM of comparable sizes.
> 
> But with persistent memory which will be plugged in as ZONE_DEVICE might
> ask for a vmem_atlamp based vmemmap mapping where the backing memory comes
> from the persistent memory range itself not from existing system RAM. IIRC
> altmap support was originally added because the amount persistent memory on
> a system might be order of magnitude higher than that of regular system RAM.
> During normal memory hot add (without altmap) would have caused great deal
> of consumption from system RAM just for persistent memory range's vmemmap
> mapping. In order to avoid such a scenario altmap was created to allocate
> vmemmap mapping backing memory from the device memory range itself.
> 
> In such cases vmemmap must be unmapped and it's backing memory freed up for
> the complete removal of persistent memory which originally requested for
> altmap based vmemmap backing.
> 
> Just as a reference, the upcoming series which enables altmap support on
> arm64 tries to allocate vmemmap mapping backing memory from the device range
> itself during memory hot add and free them up during memory hot remove. Those
> methods will not be possible if memory hot-remove does not really free up
> vmemmap backing storage.
> 
> https://patchwork.kernel.org/project/linux-mm/list/?series=139299
> 

Just to add in here. There is an ongoing work which will enable allocating
memory from the hot-add range itself even for normal system RAM. So this
might not be specific to ZONE_DEVICE based device/persistent memory alone
for a long time.

https://lore.kernel.org/lkml/20190725160207.19579-1-osalva...@suse.de/


[PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-12 Thread Anshuman Khandual
This adds a test module which will validate architecture page table helpers
and accessors regarding compliance with generic MM semantics expectations.
This will help various architectures in validating changes to the existing
page table helpers or addition of new ones.

Test page table and memory pages creating it's entries at various level are
all allocated from system memory with required alignments. If memory pages
with required size and alignment could not be allocated, then all depending
individual tests are skipped.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Suggested-by: Catalin Marinas 
Signed-off-by: Anshuman Khandual 
---
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 mm/Kconfig.debug|  14 +
 mm/Makefile |   1 +
 mm/arch_pgtable_test.c  | 429 
 4 files changed, 446 insertions(+)
 create mode 100644 mm/arch_pgtable_test.c

diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 52e5f5f2240d..b882792a3999 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -40,6 +40,8 @@ static inline bool pgtable_l5_enabled(void)
 #define pgtable_l5_enabled() 0
 #endif /* CONFIG_X86_5LEVEL */
 
+#define mm_p4d_folded(mm) (!pgtable_l5_enabled())
+
 extern unsigned int pgdir_shift;
 extern unsigned int ptrs_per_p4d;
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 327b3ebf23bf..ce9c397f7b07 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -117,3 +117,17 @@ config DEBUG_RODATA_TEST
 depends on STRICT_KERNEL_RWX
 ---help---
   This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_ARCH_PGTABLE_TEST
+   bool "Test arch page table helpers for semantics compliance"
+   depends on MMU
+   depends on DEBUG_KERNEL
+   help
+ This options provides a kernel module which can be used to test
+ architecture page table helper functions on various platform in
+ verifying if they comply with expected generic MM semantics. This
+ will help architectures code in making sure that any changes or
+ new additions of these helpers will still conform to generic MM
+ expected semantics.
+
+ If unsure, say N.
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..bb572c5aa8c5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
+obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
new file mode 100644
index ..8b4a92756ad8
--- /dev/null
+++ b/mm/arch_pgtable_test.c
@@ -0,0 +1,429 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This kernel module validates architecture page table helpers &
+ * accessors and helps in verifying their continued compliance with
+ * generic MM semantics.
+ *
+ * Copyright (C) 2019 ARM Ltd.
+ *
+ * Author: Anshuman Khandual 
+ */
+#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Basic operations
+ *
+ * mkold(entry)= An old and not a young entry
+ * mkyoung(entry)  = A young and not an old entry
+ * mkdirty(entry)  = A dirty and not a clean entry
+ * mkclean(entry)  = A clean and not a dirty entry
+ * mkwrite(entry)  = A write and not a write protected entry
+ * wrprotect(entry)= A write protected and not a write entry
+ * pxx_bad(entry)  = A mappe

[PATCH V2 1/2] mm/hugetlb: Make alloc_gigantic_page() available for general use

2019-09-12 Thread Anshuman Khandual
alloc_gigantic_page() implements an allocation method where it scans over
various zones looking for a large contiguous memory block which could not
have been allocated through the buddy allocator. A subsequent patch which
tests arch page table helpers needs such a method to allocate PUD_SIZE
sized memory block. In the future such methods might have other use cases
as well. So alloc_gigantic_page() has been split carving out actual memory
allocation method and made available via new alloc_gigantic_page_order().

Cc: Mike Kravetz 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
Should we move alloc_gigantic_page_order() to page_alloc.c and declarations
to include/linux/gfp.h instead ? This is still very much HugeTLB specific.

 include/linux/hugetlb.h |  9 +
 mm/hugetlb.c| 24 ++--
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f930d0..cc50d5ad4885 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -299,6 +299,9 @@ static inline bool is_file_hugepages(struct file *file)
 }
 
 
+struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask);
 #else /* !CONFIG_HUGETLBFS */
 
 #define is_file_hugepages(file)false
@@ -310,6 +313,12 @@ hugetlb_file_setup(const char *name, size_t size, 
vm_flags_t acctflag,
return ERR_PTR(-ENOSYS);
 }
 
+static inline struct page *
+alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+ int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef37c85423a5..3fb81252f52b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1112,10 +1112,9 @@ static bool zone_spans_last_pfn(const struct zone *zone,
return zone_spans_pfn(zone, last_pfn);
 }
 
-static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
-   unsigned int order = huge_page_order(h);
unsigned long nr_pages = 1 << order;
unsigned long ret, pfn, flags;
struct zonelist *zonelist;
@@ -1151,6 +1150,14 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
return NULL;
 }
 
+static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
+   int nid, nodemask_t *nodemask)
+{
+   unsigned int order = huge_page_order(h);
+
+   return alloc_gigantic_page_order(order, gfp_mask, nid, nodemask);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
 static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 #else /* !CONFIG_CONTIG_ALLOC */
@@ -1159,6 +1166,12 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
 #endif /* CONFIG_CONTIG_ALLOC */
 
 #else /* !CONFIG_ARCH_HAS_GIGANTIC_PAGE */
@@ -1167,6 +1180,13 @@ static struct page *alloc_gigantic_page(struct hstate 
*h, gfp_t gfp_mask,
 {
return NULL;
 }
+
+struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
+  int nid, nodemask_t *nodemask)
+{
+   return NULL;
+}
+
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
 static inline void destroy_compound_gigantic_page(struct page *page,
unsigned int order) { }
-- 
2.20.1



[PATCH V2 0/2] mm/debug: Add tests for architecture exported page table helpers

2019-09-12 Thread Anshuman Khandual
This series adds a test validation for architecture exported page table
helpers. Patch in the series adds basic transformation tests at various
levels of the page table. Before that it exports gigantic page allocation
function from HugeTLB.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Testing:

Successfully build and boot tested on both arm64 and x86 platforms without
any test failing. Only build tested on some other platforms.

But I would really appreciate if folks can help validate this test on other
platforms and report back problems. All suggestions, comments and inputs
welcome. Thank you.

Changes in V2:

- Fixed small typo error in MODULE_DESCRIPTION()
- Fixed m64k build problems for lvalue concerns in pmd_xxx_tests()
- Fixed dynamic page table level folding problems on x86 as per Kirril
- Fixed second pointers during pxx_populate_tests() per Kirill and Gerald
- Allocate and free pte table with pte_alloc_one/pte_free per Kirill
- Modified pxx_clear_tests() to accommodate s390 lower 12 bits situation
- Changed RANDOM_NZVALUE value from 0xbe to 0xff
- Changed allocation, usage, free sequence for saved_ptep
- Renamed VMA_FLAGS as VMFLAGS
- Implemented a new method for random vaddr generation
- Implemented some other cleanups
- Dropped extern reference to mm_alloc()
- Created and exported new alloc_gigantic_page_order()
- Dropped the custom allocator and used new alloc_gigantic_page_order()

Changes in V1:

https://lore.kernel.org/linux-mm/1567497706-8649-1-git-send-email-anshuman.khand...@arm.com/

- Added fallback mechanism for PMD aligned memory allocation failure

Changes in RFC V2:

https://lore.kernel.org/linux-mm/1565335998-22553-1-git-send-email-anshuman.khand...@arm.com/T/#u

- Moved test module and it's config from lib/ to mm/
- Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
- Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
- Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
- Dropped loadable module config option
- Basic tests now use memory blocks with required size and alignment
- PUD aligned memory block gets allocated with alloc_contig_range()
- If PUD aligned memory could not be allocated it falls back on PMD aligned
  memory block from page allocator and pud_* tests are skipped
- Clear and populate tests now operate on real in memory page table entries
- Dummy mm_struct gets allocated with mm_alloc()
- Dummy page table entries get allocated with [pud|pmd|pte]_alloc_[map]()
- Simplified [p4d|pgd]_basic_tests(), now has random values in the entries

Original RFC V1:

https://lore.kernel.org/linux-mm/1564037723-26676-1-git-send-email-anshuman.khand...@arm.com/

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Mike Kravetz 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (2):
  mm/hugetlb: Make alloc_gigantic_page() available for general use
  mm/pgtable/debug: Add test validating architecture page table helpers

 arch/x86/include/asm/pgtable_64_types.h |   2 +
 include/linux/hugetlb.h |   9 +
 mm/Kconfig.debug|  14 +
 mm/Makefile |   1 +
 mm/arch_pgtable_test.c  | 429 
 mm/hugetlb.c|  24 +-
 6 files changed, 477 insertions(+), 2 deletions(-)
 create mode 100644 mm/arch_pgtable_test.c

-- 
2.20.1



Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove

2019-09-11 Thread Anshuman Khandual



On 09/10/2019 09:47 PM, Catalin Marinas wrote:
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, 
>> unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +/*
>> + * FIXME: We should have called remove_pagetable(start, end, true).
>> + * vmemmap and vmalloc virtual range might share intermediate kernel
>> + * page table entries. Removing vmemmap range page table pages here
>> + * can potentially conflict with a concurrent vmalloc() allocation.
>> + *
>> + * This is primarily because vmalloc() does not take init_mm ptl for
>> + * the entire page table walk and it's modification. Instead it just
>> + * takes the lock while allocating and installing page table pages
>> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> + * entry via memory hot remove can cause vmalloc() kernel page table
>> + * walk pointers to be invalid on the fly which can cause corruption
>> + * or worst, a crash.
>> + *
>> + * So free_empty_tables() gets called where vmalloc and vmemmap range
>> + * do not overlap at any intermediate level kernel page table entry.
>> + */
>> +unmap_hotplug_range(start, end, true);
>> +if (!vmalloc_vmemmap_overlap)
>> +free_empty_tables(start, end);
>> +#endif
>>  }
>>  #endif  /* CONFIG_SPARSEMEM_VMEMMAP */

Hello Catalin,

> 
> I wonder whether we could simply ignore the vmemmap freeing altogether,
> just leave it around and not unmap it. This way, we could call

This would have been an option (even if we just ignore for a moment that
it might not be the cleanest possible method) if present memory hot remove
scenarios involved just system RAM of comparable sizes.

But with persistent memory which will be plugged in as ZONE_DEVICE might
ask for a vmem_atlamp based vmemmap mapping where the backing memory comes
from the persistent memory range itself not from existing system RAM. IIRC
altmap support was originally added because the amount persistent memory on
a system might be order of magnitude higher than that of regular system RAM.
During normal memory hot add (without altmap) would have caused great deal
of consumption from system RAM just for persistent memory range's vmemmap
mapping. In order to avoid such a scenario altmap was created to allocate
vmemmap mapping backing memory from the device memory range itself.

In such cases vmemmap must be unmapped and it's backing memory freed up for
the complete removal of persistent memory which originally requested for
altmap based vmemmap backing.

Just as a reference, the upcoming series which enables altmap support on
arm64 tries to allocate vmemmap mapping backing memory from the device range
itself during memory hot add and free them up during memory hot remove. Those
methods will not be possible if memory hot-remove does not really free up
vmemmap backing storage.

https://patchwork.kernel.org/project/linux-mm/list/?series=139299

> unmap_kernel_range() for removing the linear map and we save some code.
> 
> For the linear map, I think we use just above 2MB of tables for 1GB of
> memory mapped (worst case with 4KB pages we need 512 pte pages). For
> vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we

You are right, the amount of memory required for kernel page table pages
are dependent on mapping page size and size of the range to be mapped. But
as explained below there might be hot remove situations where these ranges
will remain unused for ever after hot remove. There are chances that some
these pages (even empty) might remain unused for good.

> expect such memory to be re-plugged again in the same range? If we do,
> then I shouldn't even bother with removing the vmmemmap.
> 
> I don't fully understand the use-case for memory hotremove, so any
> additional info would be useful to make a decision here.

Sure, these are some of the scenarios I could recollect.

Physical Environment:

A. Physical DIMM replacement

Platform detects memory errors and initiates a DIMM replacement.

- Hot remove selected DIMM with errors
- Hot add a new DIMM in it's place on the same slot

In normal circumstances, the new DIMM will require the same linear
and vmemmap mapping. In such cases hot-remove could just unmap
linear mapping, leave everything else and be done with it. Though
I am not sure whether its a good idea to leave aside accessible
struct pages which correspond to non-present pfns.

B. Physical DIMM movement

Platform can detect errors on a DIMM slot itself and initiates a
DIMM movement i

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Anshuman Khandual



On 09/10/2019 10:15 AM, Christophe Leroy wrote:
> 
> 
> On 09/10/2019 03:56 AM, Anshuman Khandual wrote:
>>
>>
>> On 09/09/2019 08:43 PM, Kirill A. Shutemov wrote:
>>> On Mon, Sep 09, 2019 at 11:56:50AM +0530, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 09/07/2019 12:33 AM, Gerald Schaefer wrote:
>>>>> On Fri, 6 Sep 2019 11:58:59 +0530
>>>>> Anshuman Khandual  wrote:
>>>>>
>>>>>> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
>>>>>>> On Thu, 5 Sep 2019 14:48:14 +0530
>>>>>>> Anshuman Khandual  wrote:
>>>>>>>   
>>>>>>>>> [...]
>>>>>>>>>> +
>>>>>>>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && 
>>>>>>>>>> !defined(__ARCH_HAS_4LEVEL_HACK)
>>>>>>>>>> +static void pud_clear_tests(pud_t *pudp)
>>>>>>>>>> +{
>>>>>>>>>> +    memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
>>>>>>>>>> +    pud_clear(pudp);
>>>>>>>>>> +    WARN_ON(!pud_none(READ_ONCE(*pudp)));
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> For pgd/p4d/pud_clear(), we only clear if the page table level is 
>>>>>>>>> present
>>>>>>>>> and not folded. The memset() here overwrites the table type bits, so
>>>>>>>>> pud_clear() will not clear anything on s390 and the pud_none() check 
>>>>>>>>> will
>>>>>>>>> fail.
>>>>>>>>> Would it be possible to OR a (larger) random value into the table, so 
>>>>>>>>> that
>>>>>>>>> the lower 12 bits would be preserved?
>>>>>>>>
>>>>>>>> So the suggestion is instead of doing memset() on entry with 
>>>>>>>> RANDOM_NZVALUE,
>>>>>>>> it should OR a large random value preserving lower 12 bits. Hmm, this 
>>>>>>>> should
>>>>>>>> still do the trick for other platforms, they just need non zero value. 
>>>>>>>> So on
>>>>>>>> s390, the lower 12 bits on the page table entry already has valid 
>>>>>>>> value while
>>>>>>>> entering this function which would make sure that pud_clear() really 
>>>>>>>> does
>>>>>>>> clear the entry ?
>>>>>>>
>>>>>>> Yes, in theory the table entry on s390 would have the type set in the 
>>>>>>> last
>>>>>>> 4 bits, so preserving those would be enough. If it does not conflict 
>>>>>>> with
>>>>>>> others, I would still suggest preserving all 12 bits since those would 
>>>>>>> contain
>>>>>>> arch-specific flags in general, just to be sure. For s390, the pte/pmd 
>>>>>>> tests
>>>>>>> would also work with the memset, but for consistency I think the same 
>>>>>>> logic
>>>>>>> should be used in all pxd_clear_tests.
>>>>>>
>>>>>> Makes sense but..
>>>>>>
>>>>>> There is a small challenge with this. Modifying individual bits on a 
>>>>>> given
>>>>>> page table entry from generic code like this test case is bit tricky. 
>>>>>> That
>>>>>> is because there are not enough helpers to create entries with an 
>>>>>> absolute
>>>>>> value. This would have been easier if all the platforms provided 
>>>>>> functions
>>>>>> like __pxx() which is not the case now. Otherwise something like this 
>>>>>> should
>>>>>> have worked.
>>>>>>
>>>>>>
>>>>>> pud_t pud = READ_ONCE(*pudp);
>>>>>> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
>>>>>> WRITE_ONCE(*pudp, pud);
>>>>>>
>>>>>> But __pud() will fail to build in many platforms.
>>>>>
>>>>> Hmm, I simply used this on my system to make pud_clear_tests() work, not
>>>>> sure if it works on all archs:
>>>>>
>>>>> pud_val(*pudp

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Anshuman Khandual



On 09/09/2019 08:43 PM, Kirill A. Shutemov wrote:
> On Mon, Sep 09, 2019 at 11:56:50AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 09/07/2019 12:33 AM, Gerald Schaefer wrote:
>>> On Fri, 6 Sep 2019 11:58:59 +0530
>>> Anshuman Khandual  wrote:
>>>
>>>> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
>>>>> On Thu, 5 Sep 2019 14:48:14 +0530
>>>>> Anshuman Khandual  wrote:
>>>>>   
>>>>>>> [...]
>>>>>>>> +
>>>>>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && 
>>>>>>>> !defined(__ARCH_HAS_4LEVEL_HACK)
>>>>>>>> +static void pud_clear_tests(pud_t *pudp)
>>>>>>>> +{
>>>>>>>> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
>>>>>>>> +  pud_clear(pudp);
>>>>>>>> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
>>>>>>>> +}
>>>>>>>
>>>>>>> For pgd/p4d/pud_clear(), we only clear if the page table level is 
>>>>>>> present
>>>>>>> and not folded. The memset() here overwrites the table type bits, so
>>>>>>> pud_clear() will not clear anything on s390 and the pud_none() check 
>>>>>>> will
>>>>>>> fail.
>>>>>>> Would it be possible to OR a (larger) random value into the table, so 
>>>>>>> that
>>>>>>> the lower 12 bits would be preserved?
>>>>>>
>>>>>> So the suggestion is instead of doing memset() on entry with 
>>>>>> RANDOM_NZVALUE,
>>>>>> it should OR a large random value preserving lower 12 bits. Hmm, this 
>>>>>> should
>>>>>> still do the trick for other platforms, they just need non zero value. 
>>>>>> So on
>>>>>> s390, the lower 12 bits on the page table entry already has valid value 
>>>>>> while
>>>>>> entering this function which would make sure that pud_clear() really does
>>>>>> clear the entry ?  
>>>>>
>>>>> Yes, in theory the table entry on s390 would have the type set in the last
>>>>> 4 bits, so preserving those would be enough. If it does not conflict with
>>>>> others, I would still suggest preserving all 12 bits since those would 
>>>>> contain
>>>>> arch-specific flags in general, just to be sure. For s390, the pte/pmd 
>>>>> tests
>>>>> would also work with the memset, but for consistency I think the same 
>>>>> logic
>>>>> should be used in all pxd_clear_tests.  
>>>>
>>>> Makes sense but..
>>>>
>>>> There is a small challenge with this. Modifying individual bits on a given
>>>> page table entry from generic code like this test case is bit tricky. That
>>>> is because there are not enough helpers to create entries with an absolute
>>>> value. This would have been easier if all the platforms provided functions
>>>> like __pxx() which is not the case now. Otherwise something like this 
>>>> should
>>>> have worked.
>>>>
>>>>
>>>> pud_t pud = READ_ONCE(*pudp);
>>>> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
>>>> WRITE_ONCE(*pudp, pud);
>>>>
>>>> But __pud() will fail to build in many platforms.
>>>
>>> Hmm, I simply used this on my system to make pud_clear_tests() work, not
>>> sure if it works on all archs:
>>>
>>> pud_val(*pudp) |= RANDOM_NZVALUE;
>>
>> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
>> has been defined there.
> 
> Use instead
> 
>   *pudp = __pud(pud_val(*pudp) | RANDOM_NZVALUE);

Agreed.

As I had mentioned before this would have been really the cleanest approach.

> 
> It *should* be more portable.

Not really, because not all the platforms have __pxx() definitions right now.
Going with these will clearly cause build failures on affected platforms. Lets
examine __pud() for instance. It is defined only on these platforms.

arch/arm64/include/asm/pgtable-types.h: #define __pud(x) ((pud_t) { (x) 
} )
arch/mips/include/asm/pgtable-64.h: #define __pud(x) ((pud_t) { (x) 
})
arch/powerpc/include/asm/pgtable-be-types.h:#define __pud(x) ((pud_t) { 
cpu_to_be64(x) })
arch/powerpc/include/asm/pgtable-types.h:   #define __

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Anshuman Khandual



On 09/07/2019 12:33 AM, Gerald Schaefer wrote:
> On Fri, 6 Sep 2019 11:58:59 +0530
> Anshuman Khandual  wrote:
> 
>> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
>>> On Thu, 5 Sep 2019 14:48:14 +0530
>>> Anshuman Khandual  wrote:
>>>   
>>>>> [...]
>>>>>> +
>>>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
>>>>>> +static void pud_clear_tests(pud_t *pudp)
>>>>>> +{
>>>>>> +memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
>>>>>> +pud_clear(pudp);
>>>>>> +WARN_ON(!pud_none(READ_ONCE(*pudp)));
>>>>>> +}
>>>>>
>>>>> For pgd/p4d/pud_clear(), we only clear if the page table level is present
>>>>> and not folded. The memset() here overwrites the table type bits, so
>>>>> pud_clear() will not clear anything on s390 and the pud_none() check will
>>>>> fail.
>>>>> Would it be possible to OR a (larger) random value into the table, so that
>>>>> the lower 12 bits would be preserved?
>>>>
>>>> So the suggestion is instead of doing memset() on entry with 
>>>> RANDOM_NZVALUE,
>>>> it should OR a large random value preserving lower 12 bits. Hmm, this 
>>>> should
>>>> still do the trick for other platforms, they just need non zero value. So 
>>>> on
>>>> s390, the lower 12 bits on the page table entry already has valid value 
>>>> while
>>>> entering this function which would make sure that pud_clear() really does
>>>> clear the entry ?  
>>>
>>> Yes, in theory the table entry on s390 would have the type set in the last
>>> 4 bits, so preserving those would be enough. If it does not conflict with
>>> others, I would still suggest preserving all 12 bits since those would 
>>> contain
>>> arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
>>> would also work with the memset, but for consistency I think the same logic
>>> should be used in all pxd_clear_tests.  
>>
>> Makes sense but..
>>
>> There is a small challenge with this. Modifying individual bits on a given
>> page table entry from generic code like this test case is bit tricky. That
>> is because there are not enough helpers to create entries with an absolute
>> value. This would have been easier if all the platforms provided functions
>> like __pxx() which is not the case now. Otherwise something like this should
>> have worked.
>>
>>
>> pud_t pud = READ_ONCE(*pudp);
>> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
>> WRITE_ONCE(*pudp, pud);
>>
>> But __pud() will fail to build in many platforms.
> 
> Hmm, I simply used this on my system to make pud_clear_tests() work, not
> sure if it works on all archs:
> 
> pud_val(*pudp) |= RANDOM_NZVALUE;

Which compiles on arm64 but then fails on x86 because of the way pmd_val()
has been defined there. on arm64 and s390 (with many others) pmd_val() is
a macro which still got the variable that can be used as lvalue but that is
not true for some other platforms like x86.

arch/arm64/include/asm/pgtable-types.h: #define pmd_val(x)  ((x).pmd)
arch/s390/include/asm/page.h:   #define pmd_val(x)  ((x).pmd)
arch/x86/include/asm/pgtable.h: #define pmd_val(x)   
native_pmd_val(x)

static inline pmdval_t native_pmd_val(pmd_t pmd)
{
return pmd.pmd;
}

Unless I am mistaken, the return value from this function can not be used as
lvalue for future assignments.

mm/arch_pgtable_test.c: In function ‘pud_clear_tests’:
mm/arch_pgtable_test.c:156:17: error: lvalue required as left operand of 
assignment
  pud_val(*pudp) |= RANDOM_ORVALUE;
 ^~
AFAICS pxx_val() were never intended to be used as lvalue and using it that way
might just happen to work on all those platforms which define them as macros.
They meant to just provide values for an entry as being determined by the 
platform.

In principle pxx_val() on an entry was not supposed to be modified directly from
generic code without going through (again) platform helpers for any specific 
state
change (write, old, dirty, special, huge etc). The current use case is a 
deviation
for that.

I originally went with memset() just to load up the entries with non-zero value 
so
that we know pxx_clear() are really doing the clearing. The same is being 
followed
for all pxx_same() checks.

Another way for fixing the problem would be to mark them with known attributes
like write/young/huge etc instead which for sure 

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-06 Thread Anshuman Khandual



On 09/05/2019 02:29 PM, Kirill A. Shutemov wrote:
> On Thu, Sep 05, 2019 at 01:48:27PM +0530, Anshuman Khandual wrote:
>>>> +#define VADDR_TEST(PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
>>>
>>> What is special about this address? How do you know if it is not occupied
>>> yet?
>>
>> We are creating the page table from scratch after allocating an mm_struct
>> for a given random virtual address 'VADDR_TEST'. Hence nothing is occupied
>> just yet. There is nothing special about this address, just that it tries
>> to ensure the page table entries are being created with some offset from
>> beginning of respective page table page at all levels ? The idea is to
>> have a more representative form of page table structure for test.
> 
> Why P4D_SIZE is missing?

That was an omission even though I was wondering whether it will be
applicable or even make sense on platforms which dont have real P4D.

> 
> Are you sure it will not land into kernel address space on any arch?

Can it even cross user virtual address range with just a single span
at each page table level ? TBH I did not think about that possibility.

> 
> I think more robust way to deal with this would be using
> get_unmapped_area() instead of fixed address.

Makes sense and probably its better to get a virtual address which
is known to have been checked against all boundary conditions. Will
explore around get_unmapped_area() in this regard.

> 
>> This makes sense for runtime cases but there is a problem here.
>>
>> On arm64, pgd_populate() which takes (pud_t *) as last argument instead of
>> (p4d_t *) will fail to build when not wrapped in !__PAGETABLE_P4D_FOLDED
>> on certain configurations.
>>
>> ./arch/arm64/include/asm/pgalloc.h:81:75: note:
>> expected ‘pud_t *’ {aka ‘struct  *’}
>> but argument is of type ‘pgd_t *’ {aka ‘struct  *’}
>> static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgdp, pud_t 
>> *pudp)
>>
>> ~~~^~~~
>> Wondering if this is something to be fixed on arm64 or its more general
>> problem. Will look into this further.
> 
> I think you need wrap this into #ifndef __ARCH_HAS_5LEVEL_HACK.

Okay.

> 
>>>> +  pmd_populate_tests(mm, pmdp, (pgtable_t) page);
>>>
>>> This is not correct for architectures that defines pgtable_t as pte_t
>>> pointer, not struct page pointer.
>>
>> Right, a grep on the source confirms that.
>>
>> These platforms define pgtable_t as struct page *
>>
>> arch/alpha/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/arm/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/arm64/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/csky/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/hexagon/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/ia64/include/asm/page.h:  typedef struct page *pgtable_t;
>> arch/ia64/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/m68k/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/microblaze/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/mips/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/nds32/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/nios2/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/openrisc/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/parisc/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/riscv/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/sh/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/sparc/include/asm/page_32.h:typedef struct page *pgtable_t;
>> arch/um/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/unicore32/include/asm/page.h:typedef struct page *pgtable_t;
>> arch/x86/include/asm/pgtable_types.h:typedef struct page *pgtable_t;
>> arch/xtensa/include/asm/page.h:typedef struct page *pgtable_t;
>>
>> These platforms define pgtable_t as pte_t *
>>
>> arch/arc/include/asm/page.h:typedef pte_t * pgtable_t;
>> arch/powerpc/include/asm/mmu.h:typedef pte_t *pgtable_t;
>> arch/s390/include/asm/page.h:typedef pte_t *pgtable_t;
>> arch/sparc/include/asm/page_64.h:typedef pte_t *pgtable_t;
>>
>> Should we need have two pmd_populate_tests() definitions with
>> different arguments (struct page pointer or pte_t pointer) and then
>> call either one after detecting the given platform ?
> 
> Use pte_alloc_one() instead of alloc_mapped_page() to allocate the page
> table.

Right, the PTE page table page should come from pte_alloc_one() instead

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-06 Thread Anshuman Khandual
On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
> On Thu, 5 Sep 2019 14:48:14 +0530
> Anshuman Khandual  wrote:
> 
>>> [...]  
>>>> +
>>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
>>>> +static void pud_clear_tests(pud_t *pudp)
>>>> +{
>>>> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
>>>> +  pud_clear(pudp);
>>>> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
>>>> +}  
>>>
>>> For pgd/p4d/pud_clear(), we only clear if the page table level is present
>>> and not folded. The memset() here overwrites the table type bits, so
>>> pud_clear() will not clear anything on s390 and the pud_none() check will
>>> fail.
>>> Would it be possible to OR a (larger) random value into the table, so that
>>> the lower 12 bits would be preserved?  
>>
>> So the suggestion is instead of doing memset() on entry with RANDOM_NZVALUE,
>> it should OR a large random value preserving lower 12 bits. Hmm, this should
>> still do the trick for other platforms, they just need non zero value. So on
>> s390, the lower 12 bits on the page table entry already has valid value while
>> entering this function which would make sure that pud_clear() really does
>> clear the entry ?
> 
> Yes, in theory the table entry on s390 would have the type set in the last
> 4 bits, so preserving those would be enough. If it does not conflict with
> others, I would still suggest preserving all 12 bits since those would contain
> arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
> would also work with the memset, but for consistency I think the same logic
> should be used in all pxd_clear_tests.

Makes sense but..

There is a small challenge with this. Modifying individual bits on a given
page table entry from generic code like this test case is bit tricky. That
is because there are not enough helpers to create entries with an absolute
value. This would have been easier if all the platforms provided functions
like __pxx() which is not the case now. Otherwise something like this should
have worked.


pud_t pud = READ_ONCE(*pudp);
pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
WRITE_ONCE(*pudp, pud);

But __pud() will fail to build in many platforms.

The other alternative will be to make sure memset() happens on all other
bits except the lower 12 bits which will depend on endianness. If s390
has a fixed endianness, we can still use either of them which will hold
good for others as well.

memset(pudp, RANDOM_NZVALUE, sizeof(pud_t) - 3);

OR

memset(pudp + 3, RANDOM_NZVALUE, sizeof(pud_t) - 3);

> 
> However, there is another issue on s390 which will make this only work
> for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
> mm_alloc() will only give you a 3-level page table initially on s390.
> This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
> both see the pud level (of course this also affects other tests).

Got it.

> 
> Not sure yet how to fix this, i.e. how to initialize/update the page table
> to 5 levels. We can handle 5 level page tables, and it would be good if
> all levels could be tested, but using mm_alloc() to establish the page
> tables might not work on s390. One option could be to provide an arch-hook
> or weak function to allocate/initialize the mm.

Sure, got it. Though I plan to do add some arch specific tests or init sequence
like the above later on but for now the idea is to get the smallest possible set
of test cases which builds and runs on all platforms without requiring any arch
specific hooks or special casing (#ifdef) to be agreed upon broadly and 
accepted.

Do you think this is absolutely necessary on s390 for the very first set of test
cases or we can add this later on as an improvement ?

> 
> IIUC, the (dummy) mm is really only needed to provide an mm->pgd as starting
> point, right?

Right.


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Anshuman Khandual


On 09/05/2019 01:46 AM, Gerald Schaefer wrote:
> On Tue,  3 Sep 2019 13:31:46 +0530
> Anshuman Khandual  wrote:
> 
>> This adds a test module which will validate architecture page table helpers
>> and accessors regarding compliance with generic MM semantics expectations.
>> This will help various architectures in validating changes to the existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required alignments. If memory pages
>> with required size and alignment could not be allocated, then all depending
>> individual tests are skipped.
> 
> This looks very useful, thanks. Of course, s390 is quite special and does
> not work nicely with this patch (yet), mostly because of our dynamic page
> table levels/folding. Still need to figure out what can be fixed in the arch

Hmm.

> code and what would need to be changed in the test module. See below for some
> generic comments/questions.

Sure.

> 
> At least one real bug in the s390 code was already revealed by this, which
> is very nice. In pmd/pud_bad(), we also check large pmds/puds for sanity,
> instead of reporting them as bad, which is apparently not how it is expected.

Hmm, so it has already started being useful :)

> 
> [...]
>> +/*
>> + * Basic operations
>> + *
>> + * mkold(entry) = An old and not a young entry
>> + * mkyoung(entry)   = A young and not an old entry
>> + * mkdirty(entry)   = A dirty and not a clean entry
>> + * mkclean(entry)   = A clean and not a dirty entry
>> + * mkwrite(entry)   = A write and not a write protected entry
>> + * wrprotect(entry) = A write protected and not a write entry
>> + * pxx_bad(entry)   = A mapped and non-table entry
>> + * pxx_same(entry1, entry2) = Both entries hold the exact same value
>> + */
>> +#define VADDR_TEST  (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
> 
> Why is P4D_SIZE missing in the VADDR_TEST calculation?

This was a random possible virtual address to generate a representative
page table structure for the test. As there is a default (PGDIR_SIZE) for
P4D_SIZE on platforms which really do not have P4D level, it should be okay
to add P4D_SIZE in the above calculation.

> 
> [...]
>> +
>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
>> +static void pud_clear_tests(pud_t *pudp)
>> +{
>> +memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
>> +pud_clear(pudp);
>> +WARN_ON(!pud_none(READ_ONCE(*pudp)));
>> +}
> 
> For pgd/p4d/pud_clear(), we only clear if the page table level is present
> and not folded. The memset() here overwrites the table type bits, so
> pud_clear() will not clear anything on s390 and the pud_none() check will
> fail.
> Would it be possible to OR a (larger) random value into the table, so that
> the lower 12 bits would be preserved?

So the suggestion is instead of doing memset() on entry with RANDOM_NZVALUE,
it should OR a large random value preserving lower 12 bits. Hmm, this should
still do the trick for other platforms, they just need non zero value. So on
s390, the lower 12 bits on the page table entry already has valid value while
entering this function which would make sure that pud_clear() really does
clear the entry ?

> 
>> +
>> +static void pud_populate_tests(struct mm_struct *mm, pud_t *pudp, pmd_t 
>> *pmdp)
>> +{
>> +/*
>> + * This entry points to next level page table page.
>> + * Hence this must not qualify as pud_bad().
>> + */
>> +pmd_clear(pmdp);
>> +pud_clear(pudp);
>> +pud_populate(mm, pudp, pmdp);
>> +WARN_ON(pud_bad(READ_ONCE(*pudp)));
>> +}
> 
> This will populate the pud with a pmd pointer that does not point to the
> beginning of the pmd table, but to the second entry (because of how
> VADDR_TEST is constructed). This will result in failing pud_bad() check
> on s390. Not sure why/how it works on other archs, but would it be possible
> to align pmdp down to the beginning of the pmd table (and similar for the
> other pxd_populate_tests)?

Right, that was a problem. Will fix it by using the saved entries used for
freeing the page table pages at the end, which always point to the beginning
of a page table page.

> 
> [...]
>> +
>> +p4d_free(mm, saved_p4dp);
>> +pud_free(mm, saved_pudp);
>> +pmd_free(mm, saved_pmdp);
>> +pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));
> 
> pgtable_t is arch-specific, and on s390 it is not a struct page pointer,
> but a pte pointer. So this will go wrong, also on all other archs (if any)
> where pgtable_t is not struct page.
> Would it be possible to use pte_free_kernel() instead, and just pass
> saved_ptep directly?

It makes sense, will change.


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Anshuman Khandual
On 09/04/2019 07:49 PM, Kirill A. Shutemov wrote:
> On Tue, Sep 03, 2019 at 01:31:46PM +0530, Anshuman Khandual wrote:
>> This adds a test module which will validate architecture page table helpers
>> and accessors regarding compliance with generic MM semantics expectations.
>> This will help various architectures in validating changes to the existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required alignments. If memory pages
>> with required size and alignment could not be allocated, then all depending
>> individual tests are skipped.
> 
> See my comments below.
> 
>>
>> Cc: Andrew Morton 
>> Cc: Vlastimil Babka 
>> Cc: Greg Kroah-Hartman 
>> Cc: Thomas Gleixner 
>> Cc: Mike Rapoport 
>> Cc: Jason Gunthorpe 
>> Cc: Dan Williams 
>> Cc: Peter Zijlstra 
>> Cc: Michal Hocko 
>> Cc: Mark Rutland 
>> Cc: Mark Brown 
>> Cc: Steven Price 
>> Cc: Ard Biesheuvel 
>> Cc: Masahiro Yamada 
>> Cc: Kees Cook 
>> Cc: Tetsuo Handa 
>> Cc: Matthew Wilcox 
>> Cc: Sri Krishna chowdary 
>> Cc: Dave Hansen 
>> Cc: Russell King - ARM Linux 
>> Cc: Michael Ellerman 
>> Cc: Paul Mackerras 
>> Cc: Martin Schwidefsky 
>> Cc: Heiko Carstens 
>> Cc: "David S. Miller" 
>> Cc: Vineet Gupta 
>> Cc: James Hogan 
>> Cc: Paul Burton 
>> Cc: Ralf Baechle 
>> Cc: linux-snps-...@lists.infradead.org
>> Cc: linux-m...@vger.kernel.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-i...@vger.kernel.org
>> Cc: linuxppc-...@lists.ozlabs.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> Cc: sparcli...@vger.kernel.org
>> Cc: x...@kernel.org
>> Cc: linux-kernel@vger.kernel.org
>>
>> Suggested-by: Catalin Marinas 
>> Signed-off-by: Anshuman Khandual 
>> ---
>>  mm/Kconfig.debug   |  14 ++
>>  mm/Makefile|   1 +
>>  mm/arch_pgtable_test.c | 425 +
>>  3 files changed, 440 insertions(+)
>>  create mode 100644 mm/arch_pgtable_test.c
>>
>> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
>> index 327b3ebf23bf..ce9c397f7b07 100644
>> --- a/mm/Kconfig.debug
>> +++ b/mm/Kconfig.debug
>> @@ -117,3 +117,17 @@ config DEBUG_RODATA_TEST
>>  depends on STRICT_KERNEL_RWX
>>  ---help---
>>This option enables a testcase for the setting rodata read-only.
>> +
>> +config DEBUG_ARCH_PGTABLE_TEST
>> +bool "Test arch page table helpers for semantics compliance"
>> +depends on MMU
>> +depends on DEBUG_KERNEL
>> +help
>> +  This options provides a kernel module which can be used to test
>> +  architecture page table helper functions on various platform in
>> +  verifying if they comply with expected generic MM semantics. This
>> +  will help architectures code in making sure that any changes or
>> +  new additions of these helpers will still conform to generic MM
>> +  expected semantics.
>> +
>> +  If unsure, say N.
>> diff --git a/mm/Makefile b/mm/Makefile
>> index d996846697ef..bb572c5aa8c5 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>>  obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
>> +obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
>>  obj-$(CONFIG_PAGE_OWNER) += page_owner.o
>>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
>> new file mode 100644
>> index ..f15be8a73723
>> --- /dev/null
>> +++ b/mm/arch_pgtable_test.c
>> @@ -0,0 +1,425 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * This kernel module validates architecture page table helpers &
>> + * accessors and helps in verifying their continued compliance with
>> + * generic MM semantics.
>> + *
>> + * Copyright (C) 2019 ARM Ltd.
>> + *
>> + * Author: Anshuman Khandual 
>> + */
>> +#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>

Re: [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-04 Thread Anshuman Khandual



On 09/04/2019 01:46 PM, David Hildenbrand wrote:
> On 03.09.19 11:45, Anshuman Khandual wrote:
>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
>> entries between memory block and node. It first checks pfn validity with
>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>
>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>> creates a problem in memory hot remove path which has already removed given
>> memory range from memory block with memblock_[remove|free] before arriving
>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>> of existing sysfs entries.
> Since
> 
> commit 60bb462fc7adb06ebee3beb5a4af6c7e6182e248
> Author: David Hildenbrand 
> Date:   Wed Aug 28 13:57:15 2019 +1000
> 
> drivers/base/node.c: simplify unregister_memory_block_under_nodes()
> 
> that problem should be gone. There is no get_nid_for_pfn() call anymore.

Yes, the problem is gone. The above commit is still not present on arm64
tree against which this series was rebased and tested while posting.

> 
> So this patch should no longer be necessary - but as I said during
> earlier versions of this patch, the re-ordering might still make sense
> for consistency (removing stuff in the reverse order they were added).
> You'll have to rephrase the description then.

Sure will reword the commit message on these lines.


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Anshuman Khandual



On 09/03/2019 04:43 PM, kbuild test robot wrote:
> Hi Anshuman,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [cannot apply to v5.3-rc7 next-20190902]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Anshuman-Khandual/mm-debug-Add-tests-for-architecture-exported-page-table-helpers/20190903-162959
> config: m68k-allmodconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 7.4.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.4.0 make.cross ARCH=m68k 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
> 
> All error/warnings (new ones prefixed by >>):
> 
>In file included from arch/m68k/include/asm/bug.h:32:0,
> from include/linux/bug.h:5,
> from include/linux/thread_info.h:12,
> from include/asm-generic/preempt.h:5,
> from ./arch/m68k/include/generated/asm/preempt.h:1,
> from include/linux/preempt.h:78,
> from arch/m68k/include/asm/irqflags.h:6,
> from include/linux/irqflags.h:16,
> from arch/m68k/include/asm/atomic.h:6,
> from include/linux/atomic.h:7,
> from include/linux/mm_types_task.h:13,
> from include/linux/mm_types.h:5,
> from include/linux/hugetlb.h:5,
> from mm/arch_pgtable_test.c:14:
>mm/arch_pgtable_test.c: In function 'pmd_clear_tests':
>>> arch/m68k/include/asm/page.h:31:22: error: lvalue required as unary '&' 
>>> operand
> #define pmd_val(x) (()->pmd[0])
>  ^
>include/asm-generic/bug.h:124:25: note: in definition of macro 'WARN_ON'
>  int __ret_warn_on = !!(condition);\
> ^
>>> arch/m68k/include/asm/motorola_pgtable.h:138:26: note: in expansion of 
>>> macro 'pmd_val'
> #define pmd_none(pmd)  (!pmd_val(pmd))
>  ^~~
>>> mm/arch_pgtable_test.c:233:11: note: in expansion of macro 'pmd_none'
>  WARN_ON(!pmd_none(READ_ONCE(*pmdp)));
>   ^~~~
>mm/arch_pgtable_test.c: In function 'pmd_populate_tests':
>>> arch/m68k/include/asm/page.h:31:22: error: lvalue required as unary '&' 
>>> operand
> #define pmd_val(x) (()->pmd[0])

Storing READ_ONCE(*pmdp) in a local pmd_t variable first solves the problem.


Re: [PATCH] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-03 Thread Anshuman Khandual



On 09/04/2019 10:27 AM, Justin He (Arm Technology China) wrote:
> Hi Anshuman, thanks for the comments, see below please
> 
>> -Original Message-----
>> From: Anshuman Khandual 
>> Sent: 2019年9月4日 12:38
>> To: Justin He (Arm Technology China) ; Andrew
>> Morton ; Matthew Wilcox
>> ; Jérôme Glisse ; Ralph
>> Campbell ; Jason Gunthorpe ;
>> Peter Zijlstra ; Dave Airlie ;
>> Aneesh Kumar K.V ; Thomas Hellstrom
>> ; Souptick Joarder ;
>> linux...@kvack.org; linux-kernel@vger.kernel.org
>> Subject: Re: [PATCH] mm: fix double page fault on arm64 if PTE_AF is
>> cleared
>>
>>
>>
>> On 09/04/2019 08:49 AM, Anshuman Khandual wrote:
>>> /*
>>>  * This really shouldn't fail, because the page is there
>>>  * in the page tables. But it might just be unreadable,
>>>  * in which case we just give up and fill the result with
>>> -* zeroes.
>>> +* zeroes. If PTE_AF is cleared on arm64, it might
>>> +* cause double page fault here. so makes pte young here
>>>  */
>>> +   if (!pte_young(vmf->orig_pte)) {
>>> +   entry = pte_mkyoung(vmf->orig_pte);
>>> +   if (ptep_set_access_flags(vmf->vma, vmf->address,
>>> +   vmf->pte, entry, vmf->flags &
>> FAULT_FLAG_WRITE))
>>> +   update_mmu_cache(vmf->vma, vmf-
>>> address,
>>> +   vmf->pte);
>>> +   }
>>
>> This looks correct where it updates the pte entry with PTE_AF which
>> will prevent a subsequent page fault. But I think what we really need
>> here is to make sure 'uaddr' is mapped correctly at vma->pte. Probably
>> a generic function arch_map_pte() when defined for arm64 should check
>> CPU version and ensure continuance of PTE_AF if required. The comment
>> above also need to be updated saying not only the page should be there
>> in the page table, it needs to mapped appropriately as well.
> 
> I agree that a generic interface here is needed but not the arch_map_pte().
> In this case, I thought all the pgd/pud/pmd/pte had been set correctly except
> for the PTE_AF bit.
> How about arch_hw_access_flag()?

Sure, go ahead. I just meant 'map' to include not only the PFN but also
appropriate HW attributes not cause a page fault.

> If non-arm64, arch_hw_access_flag() == true

The function does not need to return anything. Dummy default definition
in generic MM will do nothing when arch does not override.

> If arm64 with hardware-managed access flag supported, == true
> else == false?

On arm64 with hardware-managed access flag supported, it will do nothing.
But in case its not supported the above mentioned pte update as in the
current proposal needs to be executed. The details should hide in arch
specific override.


Re: [PATCH] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-03 Thread Anshuman Khandual



On 09/04/2019 08:49 AM, Anshuman Khandual wrote:
>   /*
>* This really shouldn't fail, because the page is there
>* in the page tables. But it might just be unreadable,
>* in which case we just give up and fill the result with
> -  * zeroes.
> +  * zeroes. If PTE_AF is cleared on arm64, it might
> +  * cause double page fault here. so makes pte young here
>*/
> + if (!pte_young(vmf->orig_pte)) {
> + entry = pte_mkyoung(vmf->orig_pte);
> + if (ptep_set_access_flags(vmf->vma, vmf->address,
> + vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE))
> + update_mmu_cache(vmf->vma, vmf->address,
> + vmf->pte);
> + }

This looks correct where it updates the pte entry with PTE_AF which
will prevent a subsequent page fault. But I think what we really need
here is to make sure 'uaddr' is mapped correctly at vma->pte. Probably
a generic function arch_map_pte() when defined for arm64 should check
CPU version and ensure continuance of PTE_AF if required. The comment
above also need to be updated saying not only the page should be there
in the page table, it needs to mapped appropriately as well.


Re: [PATCH] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-03 Thread Anshuman Khandual
On 09/04/2019 06:28 AM, Jia He wrote:
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest, there
> will be a double page fault in __copy_from_user_inatomic of cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is(PTE_AF is cleared):
> [9b007000] pgd=00023d4f8003, pud=00023da9b003, 
> pmd=00023d4b3003, pte=36298607bd3
> 
> The keypoint is: we don't always have a hardware-managed access flag on
> arm64.
> 
> The root cause is in copy_one_pte, it will clear the PTE_AF for COW
> pages. Generally, when it is accessed by user, the COW pages will be set
> as accessed(PTE_AF bit on arm64) by hardware if hardware feature is
> supported. But on some arm64 platforms, the PTE_AF needs to be set by
> software.
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Reported-by: Yibo Cai 
> Signed-off-by: Jia He 
> ---
>  mm/memory.c | 21 -
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e2bb51b6242e..b1f9ace2e943 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2140,7 +2140,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
> pmd_t *pmd,
>   return same;
>  }
>  
> -static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> +static inline void cow_user_page(struct page *dst, struct page *src,
> + struct vm_fault *vmf)
>  {
>   debug_dma_assert_idle(src);
>  
> @@ -2152,20 +2153,30 @@ static inline void cow_user_page(struct page *dst, 
> struct page *src, unsigned lo
>*/
>   if (unlikely(!src)) {
>   void *kaddr = kmap_atomic(dst);
> - void __user *uaddr = (void __user *)(va & PAGE_MASK);
> + void __user *uaddr = (void __user *)(vmf->address & PAGE_MASK);
> + pte_t entry;
>  
>   /*
>* This really shouldn't fail, because the page is there
>* in the page tables. But it might just be unreadable,
>* in which case we just give up and fill the result with
> -  * zeroes.
> +  * zeroes. If PTE_AF is cleared on arm64, it might
> +  * cause double page fault here. so makes pte young here
>*/
> + if (!pte_young(vmf->orig_pte)) {
> + entry = pte_mkyoung(vmf->orig_pte);
> + if (ptep_set_access_flags(vmf->vma, vmf->address,
> + vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE))
> + update_mmu_cache(vmf->vma, vmf->address,
> + vmf->pte);
> + }
> +
>   if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))

Should not page fault be disabled when doing this ? Ideally it should
have also called access_ok() on the user address range first. The point
is that the caller of __copy_from_user_inatomic() must make sure that
there cannot be any page fault while doing the actual copy. But also it
should be done in generic way, something like in access_ok(). The current
proposal here seems very specific to arm64 case.


[PATCH V7 3/3] arm64/mm: Enable memory hot remove

2019-09-03 Thread Anshuman Khandual
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds a new remove_pagetable() helper which can be used to tear
down either region, and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.

remove_pagetable() makes two distinct passes over the kernel page table.
In the first pass it unmaps, invalidates applicable TLB cache and frees
backing memory if required (vmemmap) for each mapped leaf entry. In the
second pass it looks for empty page table sections whose page table page
can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). Hence unlike linear mapping, skip freeing page table
pages while tearing down vmemmap mapping when vmalloc and vmemmap ranges
overlap.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Acked-by: Steve Capper 
Acked-by: David Hildenbrand 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 338 +++-
 3 files changed, 333 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6481964b6425..0079820af308 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -274,6 +274,9 @@ config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+   def_bool y
+
 config SMP
def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b61b50bf68b1..615dcd08acfa 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
 #define MODULES_VADDR  (BPF_JIT_REGION_END)
 #define MODULES_VSIZE  (SZ_128M)
 #define VMEMMAP_START  (-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END (VMEMMAP_START - SZ_2M)
 #define PCI_IO_START   (PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ee1bf416368d..980d761a7327 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss 
__maybe_unused;
 
 static DEFINE_SPINLOCK(swapper_pgdir_lock);
 
+/*
+ * This represents if vmalloc and vmemmap address range overlap with
+ * each other on an intermediate level kernel page table entry which
+ * in turn helps in deciding whether empty kernel page table pages
+ * if any can be freed during memory hotplug operation.
+ */
+static bool vmalloc_vmemmap_overlap;
+
 void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 {
pgd_t *fixmap_pgdp;
@@ -723,6 +731,250 @@ int kern_addr_valid(unsigned long addr)
 
return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+   WARN_ON(!page || PageReserved(page));
+   free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+   free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pmd_t *pmdp, unsigned long addr)
+{
+   struct page *page;
+   pte_t *ptep;
+   int i;
+
+   ptep = pte_offset_kernel(pmdp, 0UL);
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   if (!pte_none(READ_ONCE(ptep[i])))
+   return;
+   }
+
+   page = pmd_page(READ_ONCE(*pmdp));
+   pmd_clear(pmdp);
+   __flush_tlb_kernel_pgtable(addr);
+   free_hotplug_pgtable_page(page);
+}
+
+static void free_pmd_table(pud_t *pudp, unsigned long addr)
+{
+   struct page *page;
+   pmd_t *pmdp;
+   int i;
+
+   if (CONFIG_PGTABLE_LEVELS <= 2)
+   return;
+
+   pmdp = pmd_offset(pudp, 0UL);
+   for (i 

[PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

2019-09-03 Thread Anshuman Khandual
The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand 
Acked-by: Mark Rutland 
Signed-off-by: Anshuman Khandual 
---
 arch/arm64/mm/ptdump_debugfs.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f25592..b5eebc8c4924 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
+#include 
 #include 
 
 #include 
@@ -7,7 +8,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
struct ptdump_info *info = m->private;
+
+   get_online_mems();
ptdump_walk_pgd(m, info);
+   put_online_mems();
return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.20.1



[PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()

2019-09-03 Thread Anshuman Khandual
Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x68000, assuming node 0
[   62.052517] [ cut here ]
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 
5.1.0-rc2-4-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 4045 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : 168b3ce0
[   62.060477] x29: 168b3ce0 x28: 8005db546c00
[   62.061501] x27:  x26: 
[   62.062509] x25: 111ef000 x24: 111ef5d0
[   62.063520] x23:  x22: 0006bfff
[   62.064540] x21: ffef x20: 006c
[   62.065558] x19: 0068 x18: 0024
[   62.066566] x17:  x16: 
[   62.067579] x15:  x14: 8005e412e890
[   62.068588] x13: 8005d6b105d8 x12: 
[   62.069610] x11: 8005d6b10490 x10: 0040
[   62.070615] x9 : 8005e412e898 x8 : 8005e412e890
[   62.071631] x7 : 8005d6b105d8 x6 : 8005db546c00
[   62.072640] x5 : 0001 x4 : 0002
[   62.073654] x3 : 8005d7049480 x2 : 0002
[   62.074666] x1 : 0003 x0 : ffef
[   62.075685] Process bash (pid: 3275, stack limit = 0xd754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.

Reviewed-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Acked-by: Mark Rutland 
Acked-by: Michal Hocko 
Signed-off-by: Anshuman Khandual 
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c73f09913165..355c466e0621 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
-   memblock_free(start, size);
-   memblock_remove(start, size);
 
/* remove memory block devices before removing memory */
remove_memory_block_devices(start, size);
 
arch_remove_memory(nid, start, size, NULL);
+   memblock_free(start, size);
+   memblock_remove(start, size);
__release_memory_resource(start, size);
 
try_offline_node(nid);
-- 
2.20.1



[PATCH V7 0/3] arm64/mm: Enable memory hot remove

2019-09-03 Thread Anshuman Khandual
pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (3):
  mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig  |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c | 338 +++-
 arch/arm64/mm/ptdump_debugfs.c  |   4 +
 mm/memory_hotplug.c |   4 +-
 5 files changed, 339 insertions(+), 11 deletions(-)

-- 
2.20.1



[PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-03 Thread Anshuman Khandual
This adds a test module which will validate architecture page table helpers
and accessors regarding compliance with generic MM semantics expectations.
This will help various architectures in validating changes to the existing
page table helpers or addition of new ones.

Test page table and memory pages creating it's entries at various level are
all allocated from system memory with required alignments. If memory pages
with required size and alignment could not be allocated, then all depending
individual tests are skipped.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Suggested-by: Catalin Marinas 
Signed-off-by: Anshuman Khandual 
---
 mm/Kconfig.debug   |  14 ++
 mm/Makefile|   1 +
 mm/arch_pgtable_test.c | 425 +
 3 files changed, 440 insertions(+)
 create mode 100644 mm/arch_pgtable_test.c

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 327b3ebf23bf..ce9c397f7b07 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -117,3 +117,17 @@ config DEBUG_RODATA_TEST
 depends on STRICT_KERNEL_RWX
 ---help---
   This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_ARCH_PGTABLE_TEST
+   bool "Test arch page table helpers for semantics compliance"
+   depends on MMU
+   depends on DEBUG_KERNEL
+   help
+ This options provides a kernel module which can be used to test
+ architecture page table helper functions on various platform in
+ verifying if they comply with expected generic MM semantics. This
+ will help architectures code in making sure that any changes or
+ new additions of these helpers will still conform to generic MM
+ expected semantics.
+
+ If unsure, say N.
diff --git a/mm/Makefile b/mm/Makefile
index d996846697ef..bb572c5aa8c5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
+obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
new file mode 100644
index ..f15be8a73723
--- /dev/null
+++ b/mm/arch_pgtable_test.c
@@ -0,0 +1,425 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This kernel module validates architecture page table helpers &
+ * accessors and helps in verifying their continued compliance with
+ * generic MM semantics.
+ *
+ * Copyright (C) 2019 ARM Ltd.
+ *
+ * Author: Anshuman Khandual 
+ */
+#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Basic operations
+ *
+ * mkold(entry)= An old and not a young entry
+ * mkyoung(entry)  = A young and not an old entry
+ * mkdirty(entry)  = A dirty and not a clean entry
+ * mkclean(entry)  = A clean and not a dirty entry
+ * mkwrite(entry)  = A write and not a write protected entry
+ * wrprotect(entry)= A write protected and not a write entry
+ * pxx_bad(entry)  = A mapped and non-table entry
+ * pxx_same(entry1, entry2)= Both entries hold the exact same value
+ */
+#define VADDR_TEST (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
+#define VMA_TEST_FLAGS (VM_READ|VM_WRITE|VM_EXEC)
+#define RANDOM_NZVALUE (0xbe)
+
+static bool pud_aligned;
+static bool pmd_aligned;
+
+extern struct mm_struct *mm_alloc(void);
+
+static void pte_basic_tests(struct page *page, pgprot_t prot)
+{
+   pte_t pte = mk_pte(page, prot);
+
+   WARN_ON(!pte_same(pte, pte));
+   WARN_ON(!pte_young(pte_mkyoung(pte)));
+   WARN_ON(!pte_dirty(pte_mkdirty(pte)));
+   WARN_ON(!pte_write(pte_mkwrite(pte)));
+

[PATCH 0/1] mm/debug: Add tests for architecture exported page table helpers

2019-09-03 Thread Anshuman Khandual
This series adds a test validation for architecture exported page table
helpers. Patch in the series adds basic transformation tests at various
levels of the page table.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Questions:

Should alloc_gigantic_page() be made available as an interface for general
use in the kernel. The test module here uses very similar implementation from
HugeTLB to allocate a PUD aligned memory block. Similar for mm_alloc() which
needs to be exported through a header.

Matthew Wilcox had expressed concerns regarding memory allocation for mapped
page table entries at various level. He also suggested using synethetic pfns
which can be derived from virtual address of a known kernel text symbol like
kernel_init(). But as discussed previously, it seems like allocated memory
might still outweigh synthetic pfns. This proposal goes with allocated memory
but if there is a broader agreement with respect to synthetic pfns, will be
happy to rework the test.

Testing:

Build and boot tested on arm64 and x86 platforms. While arm64 clears all
these tests, following errors were reported on x86.

1. WARN_ON(pud_bad(pud)) in pud_populate_tests()
2. WARN_ON(p4d_bad(p4d)) in p4d_populate_tests()

I would really appreciate if folks can help validate this test on other
platforms and report back problems if any. Suggestions, comments and
inputs welcome. Thank you.

Changes in V3:

- Added fallback mechanism for PMD aligned memory allocation failure

Changes in V2:

https://lore.kernel.org/linux-mm/1565335998-22553-1-git-send-email-anshuman.khand...@arm.com/T/#u

- Moved test module and it's config from lib/ to mm/
- Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
- Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
- Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
- Dropped loadable module config option
- Basic tests now use memory blocks with required size and alignment
- PUD aligned memory block gets allocated with alloc_contig_range()
- If PUD aligned memory could not be allocated it falls back on PMD aligned
  memory block from page allocator and pud_* tests are skipped
- Clear and populate tests now operate on real in memory page table entries
- Dummy mm_struct gets allocated with mm_alloc()
- Dummy page table entries get allocated with [pud|pmd|pte]_alloc_[map]()
- Simplified [p4d|pgd]_basic_tests(), now has random values in the entries

RFC V1:

https://lore.kernel.org/linux-mm/1564037723-26676-1-git-send-email-anshuman.khand...@arm.com/

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (1):
  mm/pgtable/debug: Add test validating architecture page table helpers

 mm/Kconfig.debug   |  14 ++
 mm/Makefile|   1 +
 mm/arch_pgtable_test.c | 425 +
 3 files changed, 440 insertions(+)
 create mode 100644 mm/arch_pgtable_test.c

-- 
2.20.1



Re: [RFC V2 0/1] mm/debug: Add tests for architecture exported page table helpers

2019-08-28 Thread Anshuman Khandual



On 08/26/2019 06:43 PM, Matthew Wilcox wrote:
> On Mon, Aug 26, 2019 at 08:07:13AM +0530, Anshuman Khandual wrote:
>> On 08/09/2019 07:22 PM, Matthew Wilcox wrote:
>>> On Fri, Aug 09, 2019 at 04:05:07PM +0530, Anshuman Khandual wrote:
>>>> On 08/09/2019 03:46 PM, Matthew Wilcox wrote:
>>>>> On Fri, Aug 09, 2019 at 01:03:17PM +0530, Anshuman Khandual wrote:
>>>>>> Should alloc_gigantic_page() be made available as an interface for 
>>>>>> general
>>>>>> use in the kernel. The test module here uses very similar implementation 
>>>>>> from
>>>>>> HugeTLB to allocate a PUD aligned memory block. Similar for mm_alloc() 
>>>>>> which
>>>>>> needs to be exported through a header.
>>>>>
>>>>> Why are you allocating memory at all instead of just using some
>>>>> known-to-exist PFNs like I suggested?
>>>>
>>>> We needed PFN to be PUD aligned for pfn_pud() and PMD aligned for mk_pmd().
>>>> Now walking the kernel page table for a known symbol like kernel_init()
>>>
>>> I didn't say to walk the kernel page table.  I said to call virt_to_pfn()
>>> for a known symbol like kernel_init().
>>>
>>>> as you had suggested earlier we might encounter page table page entries at 
>>>> PMD
>>>> and PUD which might not be PMD or PUD aligned respectively. It seemed to me
>>>> that alignment requirement is applicable only for mk_pmd() and pfn_pud()
>>>> which create large mappings at those levels but that requirement does not
>>>> exist for page table pages pointing to next level. Is not that correct ? Or
>>>> I am missing something here ?
>>>
>>> Just clear the bottom bits off the PFN until you get a PMD or PUD aligned
>>> PFN.  It's really not hard.
>>
>> As Mark pointed out earlier that might end up being just a synthetic PFN
>> which might not even exist on a given system.
> 
> And why would that matter?
> 

To start with the test uses struct page with mk_pte() and mk_pmd() while
pfn gets used in pfn_pud() during pXX_basic_tests(). So we will not be able
to derive a valid struct page from a synthetic pfn. Also if synthetic pfn is
going to be used anyway then why derive it from a real kernel symbol like
kernel_init(). Could not one be just made up with right alignment ?

Currently the test allocates 'mm_struct' and other page table pages from real
memory then why should it use synthetic pfn while creating actual page table
entries ? Couple of benefits going with synthetic pfn will be..

- It simplifies the test a bit removing PUD_SIZE allocation helpers
- It might enable the test to be run on systems without adequate memory

In the current proposal the allocation happens during boot making it much more
likely to succeed than not and when it fails, respective tests will be skipped.

I am just wondering if being able to run complete set of tests on smaller
systems with less memory weighs lot more in favor of going with synthetic
pfn instead.


Re: [RFC V2 0/1] mm/debug: Add tests for architecture exported page table helpers

2019-08-09 Thread Anshuman Khandual



On 08/09/2019 03:46 PM, Matthew Wilcox wrote:
> On Fri, Aug 09, 2019 at 01:03:17PM +0530, Anshuman Khandual wrote:
>> Should alloc_gigantic_page() be made available as an interface for general
>> use in the kernel. The test module here uses very similar implementation from
>> HugeTLB to allocate a PUD aligned memory block. Similar for mm_alloc() which
>> needs to be exported through a header.
> 
> Why are you allocating memory at all instead of just using some
> known-to-exist PFNs like I suggested?

We needed PFN to be PUD aligned for pfn_pud() and PMD aligned for mk_pmd().
Now walking the kernel page table for a known symbol like kernel_init()
as you had suggested earlier we might encounter page table page entries at PMD
and PUD which might not be PMD or PUD aligned respectively. It seemed to me
that alignment requirement is applicable only for mk_pmd() and pfn_pud()
which create large mappings at those levels but that requirement does not
exist for page table pages pointing to next level. Is not that correct ? Or
I am missing something here ?


Re: [PATCH] mm/sparse: use __nr_to_section(section_nr) to get mem_section

2019-08-09 Thread Anshuman Khandual



On 08/09/2019 06:32 AM, Wei Yang wrote:
> __pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)).

Right.

> 
> Since we already get section_nr, it is not necessary to get mem_section
> from start_pfn. By doing so, we reduce one redundant operation.
> 
> Signed-off-by: Wei Yang 

Looks right.

With this applied, memory hot add still works on arm64.

Reviewed-by: Anshuman Khandual 

> ---
>  mm/sparse.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 72f010d9bff5..95158a148cd1 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -867,7 +867,7 @@ int __meminit sparse_add_section(int nid, unsigned long 
> start_pfn,
>*/
>   page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * 
> nr_pages);
>  
> - ms = __pfn_to_section(start_pfn);
> + ms = __nr_to_section(section_nr);
>   set_section_nid(section_nr, nid);
>   section_mark_present(ms);
>  
> 


[RFC V2 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-08-09 Thread Anshuman Khandual
This adds a test module which will validate architecture page table helpers
and accessors regarding compliance with generic MM semantics expectations.
This will help various architectures in validating changes to the existing
page table helpers or addition of new ones.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Suggested-by: Catalin Marinas 
Signed-off-by: Anshuman Khandual 
---
 mm/Kconfig.debug   |  14 ++
 mm/Makefile|   1 +
 mm/arch_pgtable_test.c | 400 +
 3 files changed, 415 insertions(+)
 create mode 100644 mm/arch_pgtable_test.c

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 82b6a20898bd..d3dfbe984d41 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -115,3 +115,17 @@ config DEBUG_RODATA_TEST
 depends on STRICT_KERNEL_RWX
 ---help---
   This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_ARCH_PGTABLE_TEST
+   bool "Test arch page table helpers for semantics compliance"
+   depends on MMU
+   depends on DEBUG_KERNEL
+   help
+ This options provides a kernel module which can be used to test
+ architecture page table helper functions on various platform in
+ verifying if they comply with expected generic MM semantics. This
+ will help architectures code in making sure that any changes or
+ new additions of these helpers will still conform to generic MM
+ expected semantics.
+
+ If unsure, say N.
diff --git a/mm/Makefile b/mm/Makefile
index 338e528ad436..0e6ac3789ca8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
+obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
new file mode 100644
index ..41d6fa78a620
--- /dev/null
+++ b/mm/arch_pgtable_test.c
@@ -0,0 +1,400 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This kernel module validates architecture page table helpers &
+ * accessors and helps in verifying their continued compliance with
+ * generic MM semantics.
+ *
+ * Copyright (C) 2019 ARM Ltd.
+ *
+ * Author: Anshuman Khandual 
+ */
+#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Basic operations
+ *
+ * mkold(entry)= An old and not a young entry
+ * mkyoung(entry)  = A young and not an old entry
+ * mkdirty(entry)  = A dirty and not a clean entry
+ * mkclean(entry)  = A clean and not a dirty entry
+ * mkwrite(entry)  = A write and not a write protected entry
+ * wrprotect(entry)= A write protected and not a write entry
+ * pxx_bad(entry)  = A mapped and non-table entry
+ * pxx_same(entry1, entry2)= Both entries hold the exact same value
+ */
+#define VADDR_TEST (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
+#define VMA_TEST_FLAGS (VM_READ|VM_WRITE|VM_EXEC)
+#define RANDOM_NZVALUE (0xbe)
+
+static bool pud_aligned;
+
+extern struct mm_struct *mm_alloc(void);
+
+static void pte_basic_tests(struct page *page, pgprot_t prot)
+{
+   pte_t pte = mk_pte(page, prot);
+
+   WARN_ON(!pte_same(pte, pte));
+   WARN_ON(!pte_young(pte_mkyoung(pte)));
+   WARN_ON(!pte_dirty(pte_mkdirty(pte)));
+   WARN_ON(!pte_write(pte_mkwrite(pte)));
+   WARN_ON(pte_young(pte_mkold(pte)));
+   WARN_ON(pte_dirty(pte_mkclean(pte)));
+   WARN_ON(pte_write(pte_wrprotect(pte)));
+}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE
+static void pmd_basic_tests(struct page *page, pgprot_t prot)
+{
+   pmd_t pmd = mk_pmd(page, 

[RFC V2 0/1] mm/debug: Add tests for architecture exported page table helpers

2019-08-09 Thread Anshuman Khandual
This series adds a test validation for architecture exported page table
helpers. Patch in the series adds basic transformation tests at various
levels of the page table.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Questions:

Should alloc_gigantic_page() be made available as an interface for general
use in the kernel. The test module here uses very similar implementation from
HugeTLB to allocate a PUD aligned memory block. Similar for mm_alloc() which
needs to be exported through a header.

Testing:

Build and boot tested on arm64 and x86 platforms. While arm64 clears all
these tests, following errors were reported on x86.

1. WARN_ON(pud_bad(pud)) in pud_populate_tests()
2. WARN_ON(p4d_bad(p4d)) in p4d_populate_tests()

I would really appreciate if folks can help validate this test on other
platforms and report back problems if any. Suggestions, comments and
inputs welcome. Thank you.

Changes in V2:

- Moved test module and it's config from lib/ to mm/
- Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
- Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
- Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
- Dropped loadable module config option
- Basic tests now use memory blocks with required size and alignment
- PUD aligned memory block gets allocated with alloc_contig_range()
- If PUD aligned memory could not be allocated it falls back on PMD aligned
  memory block from page allocator and pud_* tests are skipped
- Clear and populate tests now operate on real in memory page table entries
- Dummy mm_struct gets allocated with mm_alloc()
- Dummy page table entries get allocated with [pud|pmd|pte]_alloc_[map]()
- Simplified [p4d|pgd]_basic_tests(), now has random values in the entries

RFC V1:

https://lore.kernel.org/linux-mm/1564037723-26676-1-git-send-email-anshuman.khand...@arm.com/

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-kernel@vger.kernel.org

Anshuman Khandual (1):
  mm/pgtable/debug: Add test validating architecture page table helpers

 mm/Kconfig.debug   |  14 ++
 mm/Makefile|   1 +
 mm/arch_pgtable_test.c | 400 +
 3 files changed, 415 insertions(+)
 create mode 100644 mm/arch_pgtable_test.c

-- 
2.20.1



<    2   3   4   5   6   7   8   9   10   11   >