from:"David Hildenbrand"

Re: [PATCH STABLE 4.9] mm: Avoid calling build_all_zonelists_init under hotplug context

2020-08-18 Thread David Hildenbrand

On 18.08.20 13:00, Oscar Salvador wrote:
> Recently a customer of ours experienced a crash when booting the
> system while enabling memory-hotplug.
> 
> The problem is that Normal zones on different nodes don't get their private
> zone->pageset allocated, and keep sharing the initial boot_pageset.
> The sharing between zones is normally safe as explained by the comment for
> boot_pageset - it's a percpu structure, and manipulations are done with
> disabled interrupts, and boot_pageset is set up in a way that any page placed
> on its pcplist is immediately flushed to shared zone's freelist, because
> pcp->high == 1.
> However, the hotplug operation updates pcp->high to a higher value as it
> expects to be operating on a private pageset.
> 
> The problem is in build_all_zonelists(), which is called when the first range
> of pages is onlined for the Normal zone of node X or Y:
> 
>   if (system_state == SYSTEM_BOOTING) {
>   build_all_zonelists_init();
>   } else {
>   #ifdef CONFIG_MEMORY_HOTPLUG
>   if (zone)
>   setup_zone_pageset(zone);
>   #endif
>   /* we have to stop all cpus to guarantee there is no user
>   of zonelist */
>   stop_machine(__build_all_zonelists, pgdat, NULL);
>   /* cpuset refresh routine should be here */
>   }
> 
> When called during hotplug, it should execute the setup_zone_pageset(zone)
> which allocates the private pageset.
> However, with memhp_default_state=online, this happens early while
> system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
> (and build_all_zonelists_init() is probably unsafe anyway at this point).
> 
> Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
> called from online_pages(), which updates the pcp->high for the shared
> boot_pageset to a value higher than 1.
> At that point, pages freed from Node X and Y Normal zones can end up on the 
> same
> pcplist and from there they can be freed to the wrong zone's freelist,
> leading to the corruption and crashes.
> 
> Please, note that upstream has fixed that differently (and unintentionally) by
> adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
> That should happen before memory hotplug events even with 
> memhp_default_state=online.
> Backporting that would be too intrusive.
> 
> Signed-off-by: Oscar Salvador 
> Debugged-by: Vlastimil Babka 

So, we have ACPI running and already adding DIMMs while booting? Crazy.

Looks sane to me. Thanks!


-- 
Thanks,

David / dhildenb

Re: [PATCH v2 1/2] mm/pageblock: mitigation cmpxchg false sharing in pageblock flags

2020-08-19 Thread David Hildenbrand

On 19.08.20 09:55, Anshuman Khandual wrote:
> 
> 
> On 08/19/2020 11:17 AM, Alex Shi wrote:
>> pageblock_flags is used as long, since every pageblock_flags is just 4
>> bits, 'long' size will include 8(32bit machine) or 16 pageblocks' flags,
>> that flag setting has to sync in cmpxchg with 7 or 15 other pageblock
>> flags. It would cause long waiting for sync.
>>
>> If we could change the pageblock_flags variable as char, we could use
>> char size cmpxchg, which just sync up with 2 pageblock flags. it could
>> relief much false sharing in cmpxchg.
> 
> Do you have numbers demonstrating claimed performance improvement
> after this change ?
> 

I asked for that in v1 and there are no performance numbers to justify
the change. IMHO, that will be required to consider this for inclusion,
otherwise it's just code churn resulting in an (although minimal)
additional memory consumption.

-- 
Thanks,

David / dhildenb

Re: [PATCH v2 2/2] mm/pageblock: remove false sharing in pageblock_flags

2020-08-19 Thread David Hildenbrand

On 19.08.20 10:09, Alex Shi wrote:
> 
> 
> 在 2020/8/19 下午3:57, Anshuman Khandual 写道:
>>
>>
>> On 08/19/2020 11:17 AM, Alex Shi wrote:
>>> Current pageblock_flags is only 4 bits, so it has to share a char size
>>> in cmpxchg when get set, the false sharing cause perf drop.
>>>
>>> If we incrase the bits up to 8, false sharing would gone in cmpxchg. and
>>> the only cost is half char per pageblock, which is half char per 128MB
>>> on x86, 4 chars in 1 GB.
>>
>> Agreed that increase in memory utilization is negligible here but does
>> this really improve performance ?
>>
> 
> It's no doubt in theory. and it would had a bad impact according to 
> commit e380bebe4771548  mm, compaction: keep migration source private to a 
> single 
> 
> but I do have some problem in running thpscale/mmtest. I'd like to see if 
> anyone
> could give a try.
> 
> BTW, I naturally hate the false sharing even it's in theory. Anyone who 
> doesn't? :)

I hate wasting memory, even if it's just a little bit, anyone who
doesn't? ;)


-- 
Thanks,

David / dhildenb

[PATCH v1 01/11] mm/memory_hotplug: inline __offline_pages() into offline_pages()

2020-08-19 Thread David Hildenbrand

There is only a single user, offline_pages(). Let's inline, to make
it look more similar to online_pages().

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7f62d69660e06..c781d386d87f9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1473,11 +1473,10 @@ static int count_system_ram_pages_cb(unsigned long 
start_pfn,
return 0;
 }
 
-static int __ref __offline_pages(unsigned long start_pfn,
- unsigned long end_pfn)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
-   unsigned long pfn, nr_pages = 0;
-   unsigned long offlined_pages = 0;
+   const unsigned long end_pfn = start_pfn + nr_pages;
+   unsigned long pfn, system_ram_pages = 0, offlined_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
struct zone *zone;
@@ -1494,9 +1493,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
 * memory holes PG_reserved, don't need pfn_valid() checks, and can
 * avoid using walk_system_ram_range() later.
 */
-   walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
+   walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages,
  count_system_ram_pages_cb);
-   if (nr_pages != end_pfn - start_pfn) {
+   if (system_ram_pages != nr_pages) {
ret = -EINVAL;
reason = "memory holes";
goto failed_removal;
@@ -1631,11 +1630,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
return ret;
 }
 
-int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
-{
-   return __offline_pages(start_pfn, start_pfn + nr_pages);
-}
-
 static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 {
int ret = !is_memblock_offlined(mem);
-- 
2.26.2

[PATCH v1 11/11] mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory

2020-08-19 Thread David Hildenbrand

Currently, it can happen that pages are allocated (and freed) via the buddy
before we finished basic memory onlining.

For example, pages are exposed to the buddy and can be allocated before
we actually mark the sections online. Allocated pages could suddenly
fail pfn_to_online_page() checks. We had similar issues with pcp
handling, when pages are allocated+freed before we reach
zone_pcp_update() in online_pages() [1].

Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
impossible. Once done with the heavy lifting, use
undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
freelist, marking them ready for allocation. Similar to offline_pages(),
we have to manually adjust zone->nr_isolate_pageblock.

[1] 
https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-chara...@codeaurora.org

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Cc: Charan Teja Reddy 
Signed-off-by: David Hildenbrand 
---
 mm/Kconfig  |  2 +-
 mm/memory_hotplug.c | 32 ++--
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 6c974888f86f9..85a16ce1dbc49 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -152,6 +152,7 @@ config HAVE_BOOTMEM_INFO_NODE
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
+   select MEMORY_ISOLATION
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
depends on 64BIT || BROKEN
@@ -178,7 +179,6 @@ config MEMORY_HOTPLUG_DEFAULT_ONLINE
 
 config MEMORY_HOTREMOVE
bool "Allow for memory hot remove"
-   select MEMORY_ISOLATION
select HAVE_BOOTMEM_INFO_NODE if (X86_64 || PPC64)
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1c16a5def781e..35d56cbd3e45b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -805,7 +805,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
 
/* associate pfn range with the zone */
zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-   move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_MOVABLE);
+   move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
 
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
@@ -816,6 +816,14 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
if (ret)
goto failed_addition;
 
+   /*
+* Fixup the number of isolated pageblocks before marking the sections
+* onlining, such that undo_isolate_page_range() works correctly.
+*/
+   spin_lock_irqsave(&zone->lock, flags);
+   zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
+   spin_unlock_irqrestore(&zone->lock, flags);
+
/*
 * If this zone is not populated, then it is not in zonelist.
 * This means the page allocator ignores this zone.
@@ -833,21 +841,25 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
zone->zone_pgdat->node_present_pages += nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
+   node_states_set_node(nid, &arg);
+   if (need_zonelists_rebuild)
+   build_all_zonelists(NULL);
+   zone_pcp_update(zone);
+
+   /* Basic onlining is complete, allow allocation of onlined pages. */
+   undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
+
/*
 * When exposing larger, physically contiguous memory areas to the
 * buddy, shuffling in the buddy (when freeing onlined pages, putting
 * them either to the head or the tail of the freelist) is only helpful
 * for maintaining the shuffle, but not for creating the initial
 * shuffle. Shuffle the whole zone to make sure the just onlined pages
-* are properly distributed across the whole freelist.
+* are properly distributed across the whole freelist. Make sure to
+* shuffle once pageblocks are no longer isolated.
 */
shuffle_zone(zone);
 
-   node_states_set_node(nid, &arg);
-   if (need_zonelists_rebuild)
-   build_all_zonelists(NULL);
-   zone_pcp_update(zone);
-
init_per_zone_wmark_min();
 
kswapd_run(nid);
@@ -1550,9 +1562,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
pr_info("Offlined Pages %ld\n", nr_pages);
 
/*
-* Onlining will reset pagetype flags and makes migrate type
-* MOVABLE, so just need to decrease the number of isolated
-* pageblocks zone counter here.
+* The memory sections are marked offline, and the pageblock flags
+* effectively stale; n

[PATCH v1 02/11] mm/memory_hotplug: enforce section granularity when onlining/offlining

2020-08-19 Thread David Hildenbrand

Already two people (including me) tried to offline subsections, because
the function looks like it can deal with it. But we really can only
online/offline full sections (e.g., we can only mark full sections
online/offline via SECTION_IS_ONLINE).

Add a simple safety net that to document the restriction now. Current users
(core and powernv/memtrace) respect these restrictions.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c781d386d87f9..6856702af68d9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -801,6 +801,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
int ret;
struct memory_notify arg;
 
+   /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
+   if (WARN_ON_ONCE(!nr_pages ||
+!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
+   return -EINVAL;
+
mem_hotplug_begin();
 
/* associate pfn range with the zone */
@@ -1483,6 +1488,11 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
struct memory_notify arg;
char *reason;
 
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
+   if (WARN_ON_ONCE(!nr_pages ||
+!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
+   return -EINVAL;
+
mem_hotplug_begin();
 
/*
-- 
2.26.2

[PATCH v1 10/11] mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone()

2020-08-19 Thread David Hildenbrand

On the memory hotplug path, we want to start with MIGRATE_ISOLATE, to
un-isolate the pages after memory onlining is complete. Let's allow
passing in the migratetype.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Logan Gunthorpe 
Cc: Dan Williams 
Cc: Mike Rapoport 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michel Lespinasse 
Cc: linux-i...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 arch/ia64/mm/init.c|  4 ++--
 include/linux/memory_hotplug.h |  3 ++-
 include/linux/mm.h |  3 ++-
 mm/memory_hotplug.c| 11 ---
 mm/memremap.c  |  3 ++-
 mm/page_alloc.c| 21 -
 6 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 0b3fb4c7af292..82b7a46ddd23d 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -538,7 +538,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
if (map_start < map_end)
memmap_init_zone((unsigned long)(map_end - map_start),
 args->nid, args->zone, page_to_pfn(map_start),
-MEMMAP_EARLY, NULL);
+MEMMAP_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
 
@@ -548,7 +548,7 @@ memmap_init (unsigned long size, int nid, unsigned long 
zone,
 {
if (!vmem_map) {
memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY,
-   NULL);
+NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
struct memmap_init_callback_data args;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 0b461691d1a49..cbafeda859380 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -346,7 +346,8 @@ extern int add_memory_resource(int nid, struct resource 
*resource);
 extern int add_memory_driver_managed(int nid, u64 start, u64 size,
 const char *resource_name);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
-   unsigned long nr_pages, struct vmem_altmap *altmap);
+  unsigned long nr_pages,
+  struct vmem_altmap *altmap, int migratetype);
 extern void remove_pfn_range_from_zone(struct zone *zone,
   unsigned long start_pfn,
   unsigned long nr_pages);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ab941cf73f44..c842aa2a97ba2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2409,7 +2409,8 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
-   enum memmap_context, struct vmem_altmap *);
+   enum memmap_context, struct vmem_altmap *,
+   int migratetype);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3aba0d956f9b1..1c16a5def781e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -693,9 +693,14 @@ static void __meminit resize_pgdat_range(struct 
pglist_data *pgdat, unsigned lon
  * Associate the pfn range with the given zone, initializing the memmaps
  * and resizing the pgdat/zone data to span the added pages. After this
  * call, all affected pages are PG_reserved.
+ *
+ * All aligned pageblocks are initialized to the specified migratetype
+ * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
+ * zone stats (e.g., nr_isolate_pageblock) are touched.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
-   unsigned long nr_pages, struct vmem_altmap *altmap)
+ unsigned long nr_pages,
+ struct vmem_altmap *altmap, int migratetype)
 {
struct pglist_data *pgdat = zone->zone_pgdat;
int nid = pgdat->node_id;
@@ -720,7 +725,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * are reserved so nobody should be touching them so we should be safe
 */
memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
-   MEMMAP_HOTPLUG, altmap);
+MEMMAP_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
 }
@@ -800,7 +805,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
 
/* associate pfn range with the zone */
zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-

[PATCH v1 05/11] mm/page_alloc: simplify __offline_isolated_pages()

2020-08-19 Thread David Hildenbrand

offline_pages() is the only user. __offline_isolated_pages() never gets
called with ranges that contain memory holes and we no longer care about
the return value. Drop the return value handling and all pfn_valid()
checks.

Update the documentation.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 include/linux/memory_hotplug.h |  4 ++--
 mm/page_alloc.c| 27 ---
 2 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 375515803cd83..0b461691d1a49 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -103,8 +103,8 @@ extern int online_pages(unsigned long pfn, unsigned long 
nr_pages,
int online_type, int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 unsigned long end_pfn);
-extern unsigned long __offline_isolated_pages(unsigned long start_pfn,
-   unsigned long end_pfn);
+extern void __offline_isolated_pages(unsigned long start_pfn,
+unsigned long end_pfn);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cf0b25161feae..03f585f95dc60 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8692,35 +8692,21 @@ void zone_pcp_reset(struct zone *zone)
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
- * All pages in the range must be in a single zone and isolated
- * before calling this.
+ * All pages in the range must be in a single zone, must not contain holes,
+ * must span full sections, and must be isolated before calling this function.
  */
-unsigned long
-__offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
+   unsigned long pfn = start_pfn;
struct page *page;
struct zone *zone;
unsigned int order;
-   unsigned long pfn;
unsigned long flags;
-   unsigned long offlined_pages = 0;
-
-   /* find the first valid pfn */
-   for (pfn = start_pfn; pfn < end_pfn; pfn++)
-   if (pfn_valid(pfn))
-   break;
-   if (pfn == end_pfn)
-   return offlined_pages;
 
offline_mem_sections(pfn, end_pfn);
zone = page_zone(pfn_to_page(pfn));
spin_lock_irqsave(&zone->lock, flags);
-   pfn = start_pfn;
while (pfn < end_pfn) {
-   if (!pfn_valid(pfn)) {
-   pfn++;
-   continue;
-   }
page = pfn_to_page(pfn);
/*
 * The HWPoisoned page may be not in buddy system, and
@@ -8728,7 +8714,6 @@ __offline_isolated_pages(unsigned long start_pfn, 
unsigned long end_pfn)
 */
if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
pfn++;
-   offlined_pages++;
continue;
}
/*
@@ -8739,20 +8724,16 @@ __offline_isolated_pages(unsigned long start_pfn, 
unsigned long end_pfn)
BUG_ON(page_count(page));
BUG_ON(PageBuddy(page));
pfn++;
-   offlined_pages++;
continue;
}
 
BUG_ON(page_count(page));
BUG_ON(!PageBuddy(page));
order = page_order(page);
-   offlined_pages += 1 << order;
del_page_from_free_list(page, zone, order);
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
-
-   return offlined_pages;
 }
 #endif
 
-- 
2.26.2

[PATCH v1 06/11] mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages()

2020-08-19 Thread David Hildenbrand

We make sure that we cannot have any memory holes right at the beginning
of offline_pages() and we only support to online/offline full sections.
Both, sections and pageblocks are a power of two in size, and sections
always span full pageblocks.

We can directly calculate the number of isolated pageblocks from nr_pages.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 50aa5df696e9d..098361fcb4504 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1460,10 +1460,10 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 {
const unsigned long end_pfn = start_pfn + nr_pages;
unsigned long pfn, system_ram_pages = 0;
-   int ret, node, nr_isolate_pageblock;
unsigned long flags;
struct zone *zone;
struct memory_notify arg;
+   int ret, node;
char *reason;
 
/* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
@@ -1507,7 +1507,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
reason = "failure to isolate range";
goto failed_removal;
}
-   nr_isolate_pageblock = ret;
 
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
@@ -1569,7 +1568,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
 * pageblocks zone counter here.
 */
spin_lock_irqsave(&zone->lock, flags);
-   zone->nr_isolate_pageblock -= nr_isolate_pageblock;
+   zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
spin_unlock_irqrestore(&zone->lock, flags);
 
/* removal success */
-- 
2.26.2

[PATCH v1 03/11] mm/memory_hotplug: simplify checking if all pages are isolated in offline_pages()

2020-08-19 Thread David Hildenbrand

We make sure that we cannot have any memory holes right at the beginning
of offline_pages(). We no longer need walk_system_ram_range() and can
call test_pages_isolated() directly.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6856702af68d9..f64478349148d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1384,17 +1384,6 @@ offline_isolated_pages_cb(unsigned long start, unsigned 
long nr_pages,
return 0;
 }
 
-/*
- * Check all pages in range, recorded as memory resource, are isolated.
- */
-static int
-check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
-   void *data)
-{
-   return test_pages_isolated(start_pfn, start_pfn + nr_pages,
-  MEMORY_OFFLINE);
-}
-
 static int __init cmdline_parse_movable_node(char *p)
 {
movable_node_enabled = true;
@@ -1579,10 +1568,7 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
reason = "failure to dissolve huge pages";
goto failed_removal_isolated;
}
-   /* check again */
-   ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
-   NULL, check_pages_isolated_cb);
-   } while (ret);
+   } while (test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE));
 
/* Ok, all of our target is isolated.
   We cannot do rollback at this point. */
-- 
2.26.2

[PATCH v1 08/11] mm/memory_hotplug: simplify page onlining

2020-08-19 Thread David Hildenbrand

We don't allow to offline memory with holes, all boot memory is online,
and all hotplugged memory cannot have holes.

We can now simplify onlining of pages. As we only allow to online/offline
full sections and sections always span full MAX_ORDER_NR_PAGES, we can just
process MAX_ORDER - 1 pages without further special handling.

The number of onlined pages simply corresponds to the number of pages we
were requested to online.

While at it, refine the comment regarding the callback not exposing all
pages to the buddy.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 38 ++
 1 file changed, 10 insertions(+), 28 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0011a1115381c..3aba0d956f9b1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -617,31 +617,22 @@ void generic_online_page(struct page *page, unsigned int 
order)
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
-static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
-   void *arg)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 {
const unsigned long end_pfn = start_pfn + nr_pages;
unsigned long pfn;
-   int order;
 
/*
-* Online the pages. The callback might decide to keep some pages
-* PG_reserved (to add them to the buddy later), but we still account
-* them as being online/belonging to this zone ("present").
+* Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
+* decide to not expose all pages to the buddy (e.g., expose them
+* later). We account all pages as being online and belonging to this
+* zone ("present").
 */
-   for (pfn = start_pfn; pfn < end_pfn; pfn += 1ul << order) {
-   order = min(MAX_ORDER - 1, get_order(PFN_PHYS(end_pfn - pfn)));
-   /* __free_pages_core() wants pfns to be aligned to the order */
-   if (WARN_ON_ONCE(!IS_ALIGNED(pfn, 1ul << order)))
-   order = 0;
-   (*online_page_callback)(pfn_to_page(pfn), order);
-   }
+   for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
+   (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
 
/* mark all involved sections as online */
online_mem_sections(start_pfn, end_pfn);
-
-   *(unsigned long *)arg += nr_pages;
-   return 0;
 }
 
 /* check which state of node_states will be changed when online memory */
@@ -795,7 +786,6 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
   int online_type, int nid)
 {
unsigned long flags;
-   unsigned long onlined_pages = 0;
struct zone *zone;
int need_zonelists_rebuild = 0;
int ret;
@@ -831,19 +821,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
setup_zone_pageset(zone);
}
 
-   ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
-   online_pages_range);
-   if (ret) {
-   /* not a single memory resource was applicable */
-   if (need_zonelists_rebuild)
-   zone_pcp_reset(zone);
-   goto failed_addition;
-   }
-
-   zone->present_pages += onlined_pages;
+   online_pages_range(pfn, nr_pages);
+   zone->present_pages += nr_pages;
 
pgdat_resize_lock(zone->zone_pgdat, &flags);
-   zone->zone_pgdat->node_present_pages += onlined_pages;
+   zone->zone_pgdat->node_present_pages += nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
/*
-- 
2.26.2

[PATCH v1 09/11] mm/page_alloc: drop stale pageblock comment in memmap_init_zone*()

2020-08-19 Thread David Hildenbrand

Commit ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are
set during struct page init") moved the actual zone range check, leaving
only the alignment check for pageblocks.

Let's drop the stale comment and make the pageblock check easier to read.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Signed-off-by: David Hildenbrand 
---
 mm/page_alloc.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 848664352dfe2..5db0b35f95e20 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6022,13 +6022,8 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * to reserve their blocks rather than leaking throughout
 * the address space during boot when many long-lived
 * kernel allocations are made.
-*
-* bitmap is created for zone's valid pfn range. but memmap
-* can be created for invalid pages (for alignment)
-* check here not to call set_pageblock_migratetype() against
-* pfn out of zone.
 */
-   if (!(pfn & (pageblock_nr_pages - 1))) {
+   if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
@@ -6091,15 +6086,10 @@ void __ref memmap_init_zone_device(struct zone *zone,
 * the address space during boot when many long-lived
 * kernel allocations are made.
 *
-* bitmap is created for zone's valid pfn range. but memmap
-* can be created for invalid pages (for alignment)
-* check here not to call set_pageblock_migratetype() against
-* pfn out of zone.
-*
 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
 * because this is done early in section_activate()
 */
-   if (!(pfn & (pageblock_nr_pages - 1))) {
+   if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
-- 
2.26.2

[PATCH v1 07/11] mm/page_isolation: simplify return value of start_isolate_page_range()

2020-08-19 Thread David Hildenbrand

Callers no longer need the number of isolated pageblocks. Let's
simplify.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 2 +-
 mm/page_isolation.c | 7 ++-
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 098361fcb4504..0011a1115381c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1503,7 +1503,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
ret = start_isolate_page_range(start_pfn, end_pfn,
   MIGRATE_MOVABLE,
   MEMORY_OFFLINE | REPORT_FAILURE);
-   if (ret < 0) {
+   if (ret) {
reason = "failure to isolate range";
goto failed_removal;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03f585f95dc60..848664352dfe2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8456,7 +8456,7 @@ int alloc_contig_range(unsigned long start, unsigned long 
end,
 
ret = start_isolate_page_range(pfn_max_align_down(start),
   pfn_max_align_up(end), migratetype, 0);
-   if (ret < 0)
+   if (ret)
return ret;
 
/*
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 242c03121d731..b066c860e606e 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -170,8 +170,7 @@ __first_valid_page(unsigned long pfn, unsigned long 
nr_pages)
  * pageblocks we may have modified and return -EBUSY to caller. This
  * prevents two threads from simultaneously working on overlapping ranges.
  *
- * Return: the number of isolated pageblocks on success and -EBUSY if any part
- * of range cannot be isolated.
+ * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 unsigned migratetype, int flags)
@@ -179,7 +178,6 @@ int start_isolate_page_range(unsigned long start_pfn, 
unsigned long end_pfn,
unsigned long pfn;
unsigned long undo_pfn;
struct page *page;
-   int nr_isolate_pageblock = 0;
 
BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
@@ -193,10 +191,9 @@ int start_isolate_page_range(unsigned long start_pfn, 
unsigned long end_pfn,
undo_pfn = pfn;
goto undo;
}
-   nr_isolate_pageblock++;
}
}
-   return nr_isolate_pageblock;
+   return 0;
 undo:
for (pfn = start_pfn;
 pfn < undo_pfn;
-- 
2.26.2

[PATCH v1 04/11] mm/memory_hotplug: simplify offlining of pages in offline_pages()

2020-08-19 Thread David Hildenbrand

We make sure that we cannot have any memory holes right at the beginning
of offline_pages(). We no longer need walk_system_ram_range() and can
call __offline_isolated_pages() directly.

offlined_pages always corresponds to nr_pages, so we can simplify that.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 28 
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f64478349148d..50aa5df696e9d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1373,17 +1373,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
return ret;
 }
 
-/* Mark all sections offline and remove all free pages from the buddy. */
-static int
-offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
-   void *data)
-{
-   unsigned long *offlined_pages = (unsigned long *)data;
-
-   *offlined_pages += __offline_isolated_pages(start, start + nr_pages);
-   return 0;
-}
-
 static int __init cmdline_parse_movable_node(char *p)
 {
movable_node_enabled = true;
@@ -1470,7 +1459,7 @@ static int count_system_ram_pages_cb(unsigned long 
start_pfn,
 int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
const unsigned long end_pfn = start_pfn + nr_pages;
-   unsigned long pfn, system_ram_pages = 0, offlined_pages = 0;
+   unsigned long pfn, system_ram_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
struct zone *zone;
@@ -1570,11 +1559,10 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
}
} while (test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE));
 
-   /* Ok, all of our target is isolated.
-  We cannot do rollback at this point. */
-   walk_system_ram_range(start_pfn, end_pfn - start_pfn,
- &offlined_pages, offline_isolated_pages_cb);
-   pr_info("Offlined Pages %ld\n", offlined_pages);
+   /* Mark all sections offline and remove free pages from the buddy. */
+   __offline_isolated_pages(start_pfn, end_pfn);
+   pr_info("Offlined Pages %ld\n", nr_pages);
+
/*
 * Onlining will reset pagetype flags and makes migrate type
 * MOVABLE, so just need to decrease the number of isolated
@@ -1585,11 +1573,11 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
spin_unlock_irqrestore(&zone->lock, flags);
 
/* removal success */
-   adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
-   zone->present_pages -= offlined_pages;
+   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+   zone->present_pages -= nr_pages;
 
pgdat_resize_lock(zone->zone_pgdat, &flags);
-   zone->zone_pgdat->node_present_pages -= offlined_pages;
+   zone->zone_pgdat->node_present_pages -= nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
init_per_zone_wmark_min();
-- 
2.26.2

[PATCH v1 00/11] mm/memory_hotplug: online_pages()/offline_pages() cleanups

2020-08-19 Thread David Hildenbrand

These are a bunch of cleanups for online_pages()/offline_pages() and
related code, mostly getting rid of memory hole handling that is no longer
necessary. There is only a single walk_system_ram_range() call left in
offline_pages(), to make sure we don't have any memory holes. I had some
of these patches lying around for a longer time but didn't have time to
polish them.

In addition, the last patch marks all pageblocks of memory to get onlined
MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
get allocated before onlining is complete. Once heavy lifting is done,
the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
possible.

I played with DIMMs and virtio-mem on x86-64 and didn't spot any surprises.
I verified that the numer of isolated pageblocks is correctly handled when
onlining/offlining.

David Hildenbrand (11):
  mm/memory_hotplug: inline __offline_pages() into offline_pages()
  mm/memory_hotplug: enforce section granularity when onlining/offlining
  mm/memory_hotplug: simplify checking if all pages are isolated in
offline_pages()
  mm/memory_hotplug: simplify offlining of pages in offline_pages()
  mm/page_alloc: simplify __offline_isolated_pages()
  mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages()
  mm/page_isolation: simplify return value of start_isolate_page_range()
  mm/memory_hotplug: simplify page onlining
  mm/page_alloc: drop stale pageblock comment in memmap_init_zone*()
  mm: pass migratetype into memmap_init_zone() and
move_pfn_range_to_zone()
  mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining
memory

 arch/ia64/mm/init.c|   4 +-
 include/linux/memory_hotplug.h |   7 +-
 include/linux/mm.h |   3 +-
 mm/Kconfig |   2 +-
 mm/memory_hotplug.c| 156 ++---
 mm/memremap.c  |   3 +-
 mm/page_alloc.c|  64 --
 mm/page_isolation.c|   7 +-
 8 files changed, 98 insertions(+), 148 deletions(-)

-- 
2.26.2

Re: [PATCH v4 0/6] mm: introduce memfd_secret system call to create "secret" memory areas

2020-08-19 Thread David Hildenbrand

On 18.08.20 16:15, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Hi,
> 
> This is an implementation of "secret" mappings backed by a file descriptor. 
> 
> v4 changes:
> * rebase on v5.9-rc1
> * Do not redefine PMD_PAGE_ORDER in fs/dax.c, thanks Kirill
> * Make secret mappings exclusive by default and only require flags to
>   memfd_secret() system call for uncached mappings, thanks again Kirill :)
> 
> v3 changes:
> * Squash kernel-parameters.txt update into the commit that added the
>   command line option.
> * Make uncached mode explicitly selectable by architectures. For now enable
>   it only on x86.
> 
> v2 changes:
> * Follow Michael's suggestion and name the new system call 'memfd_secret'
> * Add kernel-parameters documentation about the boot option
> * Fix i386-tinyconfig regression reported by the kbuild bot.
>   CONFIG_SECRETMEM now depends on !EMBEDDED to disable it on small systems
>   from one side and still make it available unconditionally on
>   architectures that support SET_DIRECT_MAP.
> 
> 
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will have desired protection bits set in the user page
> table. For instance, current implementation allows uncached mappings.
> 
> Although normally Linux userspace mappings are protected from other users, 
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
> 
> Additionally, the secret mappings may be used as a mean to protect guest
> memory in a virtual machine host.
> 

Just a general question. I assume such pages (where the direct mapping
was changed) cannot get migrated - I can spot a simple alloc_page(). So
essentially a process can just allocate a whole bunch of memory that is
unmovable, correct? Is there any limit? Is it properly accounted towards
the process (memctl) ?

-- 
Thanks,

David / dhildenb

Re: [PATCH v4 6/6] mm: secretmem: add ability to reserve memory at boot

2020-08-19 Thread David Hildenbrand

On 18.08.20 16:15, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Taking pages out from the direct map and bringing them back may create
> undesired fragmentation and usage of the smaller pages in the direct
> mapping of the physical memory.
> 
> This can be avoided if a significantly large area of the physical memory
> would be reserved for secretmem purposes at boot time.
> 
> Add ability to reserve physical memory for secretmem at boot time using
> "secretmem" kernel parameter and then use that reserved memory as a global
> pool for secret memory needs.

Wouldn't something like CMA be the better fit? Just wondering. Then, the
memory can actually be reused for something else while not needed.

> 
> Signed-off-by: Mike Rapoport 
> ---
>  mm/secretmem.c | 134 ++---
>  1 file changed, 126 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 333eb18fb483..54067ea62b2d 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -45,6 +46,39 @@ struct secretmem_ctx {
>   unsigned int mode;
>  };
>  
> +struct secretmem_pool {
> + struct gen_pool *pool;
> + unsigned long reserved_size;
> + void *reserved;
> +};
> +
> +static struct secretmem_pool secretmem_pool;
> +
> +static struct page *secretmem_alloc_huge_page(gfp_t gfp)
> +{
> + struct gen_pool *pool = secretmem_pool.pool;
> + unsigned long addr = 0;
> + struct page *page = NULL;
> +
> + if (pool) {
> + if (gen_pool_avail(pool) < PMD_SIZE)
> + return NULL;
> +
> + addr = gen_pool_alloc(pool, PMD_SIZE);
> + if (!addr)
> + return NULL;
> +
> + page = virt_to_page(addr);
> + } else {
> + page = alloc_pages(gfp, PMD_PAGE_ORDER);
> +
> + if (page)
> + split_page(page, PMD_PAGE_ORDER);
> + }
> +
> + return page;
> +}
> +
>  static int secretmem_pool_increase(struct secretmem_ctx *ctx, gfp_t gfp)
>  {
>   unsigned long nr_pages = (1 << PMD_PAGE_ORDER);
> @@ -53,12 +87,11 @@ static int secretmem_pool_increase(struct secretmem_ctx 
> *ctx, gfp_t gfp)
>   struct page *page;
>   int err;
>  
> - page = alloc_pages(gfp, PMD_PAGE_ORDER);
> + page = secretmem_alloc_huge_page(gfp);
>   if (!page)
>   return -ENOMEM;
>  
>   addr = (unsigned long)page_address(page);
> - split_page(page, PMD_PAGE_ORDER);
>  
>   err = gen_pool_add(pool, addr, PMD_SIZE, NUMA_NO_NODE);
>   if (err) {
> @@ -267,11 +300,13 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
>   return err;
>  }
>  
> -static void secretmem_cleanup_chunk(struct gen_pool *pool,
> - struct gen_pool_chunk *chunk, void *data)
> +static void secretmem_recycle_range(unsigned long start, unsigned long end)
> +{
> + gen_pool_free(secretmem_pool.pool, start, PMD_SIZE);
> +}
> +
> +static void secretmem_release_range(unsigned long start, unsigned long end)
>  {
> - unsigned long start = chunk->start_addr;
> - unsigned long end = chunk->end_addr;
>   unsigned long nr_pages, addr;
>  
>   nr_pages = (end - start + 1) / PAGE_SIZE;
> @@ -281,6 +316,18 @@ static void secretmem_cleanup_chunk(struct gen_pool 
> *pool,
>   put_page(virt_to_page(addr));
>  }
>  
> +static void secretmem_cleanup_chunk(struct gen_pool *pool,
> + struct gen_pool_chunk *chunk, void *data)
> +{
> + unsigned long start = chunk->start_addr;
> + unsigned long end = chunk->end_addr;
> +
> + if (secretmem_pool.pool)
> + secretmem_recycle_range(start, end);
> + else
> + secretmem_release_range(start, end);
> +}
> +
>  static void secretmem_cleanup_pool(struct secretmem_ctx *ctx)
>  {
>   struct gen_pool *pool = ctx->pool;
> @@ -320,14 +367,85 @@ static struct file_system_type secretmem_fs = {
>   .kill_sb= kill_anon_super,
>  };
>  
> +static int secretmem_reserved_mem_init(void)
> +{
> + struct gen_pool *pool;
> + struct page *page;
> + void *addr;
> + int err;
> +
> + if (!secretmem_pool.reserved)
> + return 0;
> +
> + pool = gen_pool_create(PMD_SHIFT, NUMA_NO_NODE);
> + if (!pool)
> + return -ENOMEM;
> +
> + err = gen_pool_add(pool, (unsigned long)secretmem_pool.reserved,
> +secretmem_pool.reserved_size, NUMA_NO_NODE);
> + if (err)
> + goto err_destroy_pool;
> +
> + for (addr = secretmem_pool.reserved;
> +  addr < secretmem_pool.reserved + secretmem_pool.reserved_size;
> +  addr += PAGE_SIZE) {
> + page = virt_to_page(addr);
> + __ClearPageReserved(page);
> + set_page_count(page, 1);
> + }
> +
> + secretmem_pool.pool = pool;
> +

Re: [PATCH v4 0/6] mm: introduce memfd_secret system call to create "secret" memory areas

2020-08-19 Thread David Hildenbrand

On 19.08.20 13:42, Mike Rapoport wrote:
> On Wed, Aug 19, 2020 at 12:47:54PM +0200, David Hildenbrand wrote:
>> On 18.08.20 16:15, Mike Rapoport wrote:
>>> From: Mike Rapoport 
>>>
>>> Hi,
>>>
>>> This is an implementation of "secret" mappings backed by a file descriptor. 
>>>
>>> v4 changes:
>>> * rebase on v5.9-rc1
>>> * Do not redefine PMD_PAGE_ORDER in fs/dax.c, thanks Kirill
>>> * Make secret mappings exclusive by default and only require flags to
>>>   memfd_secret() system call for uncached mappings, thanks again Kirill :)
>>>
>>> v3 changes:
>>> * Squash kernel-parameters.txt update into the commit that added the
>>>   command line option.
>>> * Make uncached mode explicitly selectable by architectures. For now enable
>>>   it only on x86.
>>>
>>> v2 changes:
>>> * Follow Michael's suggestion and name the new system call 'memfd_secret'
>>> * Add kernel-parameters documentation about the boot option
>>> * Fix i386-tinyconfig regression reported by the kbuild bot.
>>>   CONFIG_SECRETMEM now depends on !EMBEDDED to disable it on small systems
>>>   from one side and still make it available unconditionally on
>>>   architectures that support SET_DIRECT_MAP.
>>>
>>>
>>> The file descriptor backing secret memory mappings is created using a
>>> dedicated memfd_secret system call The desired protection mode for the
>>> memory is configured using flags parameter of the system call. The mmap()
>>> of the file descriptor created with memfd_secret() will create a "secret"
>>> memory mapping. The pages in that mapping will be marked as not present in
>>> the direct map and will have desired protection bits set in the user page
>>> table. For instance, current implementation allows uncached mappings.
>>>
>>> Although normally Linux userspace mappings are protected from other users, 
>>> such secret mappings are useful for environments where a hostile tenant is
>>> trying to trick the kernel into giving them access to other tenants
>>> mappings.
>>>
>>> Additionally, the secret mappings may be used as a mean to protect guest
>>> memory in a virtual machine host.
>>>
>>
>> Just a general question. I assume such pages (where the direct mapping
>> was changed) cannot get migrated - I can spot a simple alloc_page(). So
>> essentially a process can just allocate a whole bunch of memory that is
>> unmovable, correct? Is there any limit? Is it properly accounted towards
>> the process (memctl) ?
> 
> The memory as accounted in the same way like with mlock(), so normal
> user won't be able to allocate more than RLIMIT_MEMLOCK.

Okay, thanks. AFAIU the difference to mlock() is that the pages here are
not movable, fragment memory, and limit compaction. Hm.

-- 
Thanks,

David / dhildenb

Re: [PATCH v4 6/6] mm: secretmem: add ability to reserve memory at boot

2020-08-19 Thread David Hildenbrand

On 19.08.20 13:53, Mike Rapoport wrote:
> On Wed, Aug 19, 2020 at 12:49:05PM +0200, David Hildenbrand wrote:
>> On 18.08.20 16:15, Mike Rapoport wrote:
>>> From: Mike Rapoport 
>>>
>>> Taking pages out from the direct map and bringing them back may create
>>> undesired fragmentation and usage of the smaller pages in the direct
>>> mapping of the physical memory.
>>>
>>> This can be avoided if a significantly large area of the physical memory
>>> would be reserved for secretmem purposes at boot time.
>>>
>>> Add ability to reserve physical memory for secretmem at boot time using
>>> "secretmem" kernel parameter and then use that reserved memory as a global
>>> pool for secret memory needs.
>>
>> Wouldn't something like CMA be the better fit? Just wondering. Then, the
>> memory can actually be reused for something else while not needed.
> 
> The memory allocated as secret is removed from the direct map and the
> boot time reservation is intended to reduce direct map fragmentatioan
> and to avoid splitting 1G pages there. So with CMA I'd still need to
> allocate 1G chunks for this and once 1G page is dropped from the direct
> map it still cannot be reused for anything else until it is freed.
> 
> I could use CMA to do the boot time reservation, but doing the
> reservesion directly seemed simpler and more explicit to me.

Well, using CMA would give you the possibility to let the memory be used
for other purposes until you decide it's the right time to take it +
remove the direct mapping etc.

I wonder if a sane approach would be to require to allocate a pool
during boot and only take pages ever from that pool. That would avoid
spilling many unmovable pages all over the place, locally limiting them
to your area here.

-- 
Thanks,

David / dhildenb

Re: [PATCH v1 02/11] mm/memory_hotplug: enforce section granularity when onlining/offlining

2020-08-19 Thread David Hildenbrand

On 19.08.20 14:37, Michal Hocko wrote:
> On Wed 19-08-20 12:11:48, David Hildenbrand wrote:
>> Already two people (including me) tried to offline subsections, because
>> the function looks like it can deal with it. But we really can only
>> online/offline full sections (e.g., we can only mark full sections
>> online/offline via SECTION_IS_ONLINE).
>>
>> Add a simple safety net that to document the restriction now. Current users
>> (core and powernv/memtrace) respect these restrictions.
> 
> I do agree with the warning because it clarifies our expectations
> indeed. Se below for more questions.
> 
>> Cc: Andrew Morton 
>> Cc: Michal Hocko 
>> Cc: Wei Yang 
>> Cc: Baoquan He 
>> Cc: Pankaj Gupta 
>> Cc: Oscar Salvador 
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index c781d386d87f9..6856702af68d9 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -801,6 +801,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
>> nr_pages,
>>  int ret;
>>  struct memory_notify arg;
>>  
>> +/* We can only online full sections (e.g., SECTION_IS_ONLINE) */
>> +if (WARN_ON_ONCE(!nr_pages ||
>> + !IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
>> +return -EINVAL;
> 
> This looks looks unnecessarily cryptic to me. Do you want to catch full
> section operation that doesn't start at the usual section boundary? If
> yes the above doesn't work work unless I am missing something.
> 
> Why don't you simply WARN_ON_ONCE(nr_pages % PAGES_PER_SECTION).
> !nr_pages doesn't sound like something interesting to care about or why
> do we care?
> 

Also the start pfn has to be section aligned, so we always cover fully
aligned sections (e.g., not two partial ones).

It's essentially a compressed version of

!nr_pages || !IS_ALIGNED(pfn, PAGES_PER_SECTION) || !IS_ALIGN(nr_pages,
PAGES_PER_SECTION)

which is the same as

!nr_pages || pfn % PAGES_PER_SECTION) || nr_pages % PAGES_PER_SECTION

or

!nr_pages || (pfn | nr_pages) % PAGES_PER_SECTION

I consider IS_ALIGNED easier to read than % PAGES_PER_SECTION. I can
certainly un-compress, whatever you prefer, thanks.

-- 
Thanks,

David / dhildenb

Re: [PATCH v1 04/11] mm/memory_hotplug: simplify offlining of pages in offline_pages()

2020-08-19 Thread David Hildenbrand

On 19.08.20 14:40, Michal Hocko wrote:
> On Wed 19-08-20 12:11:50, David Hildenbrand wrote:
>> We make sure that we cannot have any memory holes right at the beginning
>> of offline_pages(). We no longer need walk_system_ram_range() and can
>> call __offline_isolated_pages() directly.
>>
>> offlined_pages always corresponds to nr_pages, so we can simplify that.
> 
> I would probably fold this into the previous patch as they are dealing
> with the same thing.

Will do, shouldn't harm readability. Thanks!

---
Thanks,

David / dhildenb

Re: [PATCH v1 11/11] mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory

2020-08-19 Thread David Hildenbrand

On 19.08.20 15:16, Michal Hocko wrote:
> On Wed 19-08-20 12:11:57, David Hildenbrand wrote:
>> Currently, it can happen that pages are allocated (and freed) via the buddy
>> before we finished basic memory onlining.
>>
>> For example, pages are exposed to the buddy and can be allocated before
>> we actually mark the sections online. Allocated pages could suddenly
>> fail pfn_to_online_page() checks. We had similar issues with pcp
>> handling, when pages are allocated+freed before we reach
>> zone_pcp_update() in online_pages() [1].
>>
>> Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
>> impossible. Once done with the heavy lifting, use
>> undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
>> freelist, marking them ready for allocation. Similar to offline_pages(),
>> we have to manually adjust zone->nr_isolate_pageblock.
>>
>> [1] 
>> https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-chara...@codeaurora.org
>>
>> Cc: Andrew Morton 
>> Cc: Michal Hocko 
>> Cc: Wei Yang 
>> Cc: Baoquan He 
>> Cc: Pankaj Gupta 
>> Cc: Oscar Salvador 
>> Cc: Charan Teja Reddy 
>> Signed-off-by: David Hildenbrand 
> 
> Acked-by: Michal Hocko 
> 
> Yes this looks very sensible and we should have done that from the
> beginning. I just have one minor comment below
>> @@ -816,6 +816,14 @@ int __ref online_pages(unsigned long pfn, unsigned long 
>> nr_pages,
>>  if (ret)
>>  goto failed_addition;
>>  
>> +/*
>> + * Fixup the number of isolated pageblocks before marking the sections
>> + * onlining, such that undo_isolate_page_range() works correctly.
>> + */
>> +spin_lock_irqsave(&zone->lock, flags);
>> +zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
>> +spin_unlock_irqrestore(&zone->lock, flags);
>> +
> 
> I am not entirely happy about this. I am wondering whether it would make
> more sense to keep the counter in sync already in memmap_init_zone. Sure
> we add a branch to the boot time initialization - and it always fails
> there - but the code would be cleaner and we wouldn't have to do tricks
> like this in caller(s).

I had that in mind initially. The issue is that we have to fixup in case
onlining fails, which I consider even more ugly. Also

1. It's the complete reverse of the offlining path now.
2. pageblock flags are essentially stale unless the section is online,
my approach moves the handling to the point where nothing else will go
wrong and we are just about to mark sections online. That looks a little
cleaner to me.

Unless there are strong opinions, I'd prefer to keep it like this.

Thanks for the very fast review Michal!

-- 
Thanks,

David / dhildenb

Re: [PATCH v4 6/6] mm: secretmem: add ability to reserve memory at boot

2020-08-19 Thread David Hildenbrand

On 19.08.20 19:33, Mike Rapoport wrote:
> On Wed, Aug 19, 2020 at 02:10:43PM +0200, David Hildenbrand wrote:
>> On 19.08.20 13:53, Mike Rapoport wrote:
>>> On Wed, Aug 19, 2020 at 12:49:05PM +0200, David Hildenbrand wrote:
>>>> On 18.08.20 16:15, Mike Rapoport wrote:
>>>>> From: Mike Rapoport 
>>>>>
>>>>> Taking pages out from the direct map and bringing them back may create
>>>>> undesired fragmentation and usage of the smaller pages in the direct
>>>>> mapping of the physical memory.
>>>>>
>>>>> This can be avoided if a significantly large area of the physical memory
>>>>> would be reserved for secretmem purposes at boot time.
>>>>>
>>>>> Add ability to reserve physical memory for secretmem at boot time using
>>>>> "secretmem" kernel parameter and then use that reserved memory as a global
>>>>> pool for secret memory needs.
>>>>
>>>> Wouldn't something like CMA be the better fit? Just wondering. Then, the
>>>> memory can actually be reused for something else while not needed.
>>>
>>> The memory allocated as secret is removed from the direct map and the
>>> boot time reservation is intended to reduce direct map fragmentatioan
>>> and to avoid splitting 1G pages there. So with CMA I'd still need to
>>> allocate 1G chunks for this and once 1G page is dropped from the direct
>>> map it still cannot be reused for anything else until it is freed.
>>>
>>> I could use CMA to do the boot time reservation, but doing the
>>> reservesion directly seemed simpler and more explicit to me.
>>
>> Well, using CMA would give you the possibility to let the memory be used
>> for other purposes until you decide it's the right time to take it +
>> remove the direct mapping etc.
> 
> I still can't say I follow you here. If I reseve a CMA area as a pool
> for secret memory 1G pages, it is still reserved and it still cannot be
> used for other purposes, right?

So, AFAIK, if you create a CMA pool it can be used for any MOVABLE
allocations (similar to ZONE_MOVABLE) until you actually allocate CMA
memory from that region. Other allocations on that are will then be
migrated away (using alloc_contig_range()).

For example, if you have a 1~GiB CMA area, you could allocate 4~MB pages
from that CMA area on demand (removing the direct mapping, etc ..), and
free when no longer needed (instantiating the direct mapping). The free
memory in that area could used for MOVABLE allocations.

Please let me know if I am missing something important.

-- 
Thanks,

David / dhildenb

[PATCH v2 02/10] mm/memory_hotplug: enforce section granularity when onlining/offlining

2020-08-19 Thread David Hildenbrand

Already two people (including me) tried to offline subsections, because
the function looks like it can deal with it. But we really can only
online/offline full sections that are properly aligned (e.g., we can only
mark full sections online/offline via SECTION_IS_ONLINE).

Add a simple safety net to document the restriction now. Current users
(core and powernv/memtrace) respect these restrictions.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c781d386d87f9..6856702af68d9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -801,6 +801,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
int ret;
struct memory_notify arg;
 
+   /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
+   if (WARN_ON_ONCE(!nr_pages ||
+!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
+   return -EINVAL;
+
mem_hotplug_begin();
 
/* associate pfn range with the zone */
@@ -1483,6 +1488,11 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
struct memory_notify arg;
char *reason;
 
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
+   if (WARN_ON_ONCE(!nr_pages ||
+!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
+   return -EINVAL;
+
mem_hotplug_begin();
 
/*
-- 
2.26.2

[PATCH v2 01/10] mm/memory_hotplug: inline __offline_pages() into offline_pages()

2020-08-19 Thread David Hildenbrand

There is only a single user, offline_pages(). Let's inline, to make
it look more similar to online_pages().

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7f62d69660e06..c781d386d87f9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1473,11 +1473,10 @@ static int count_system_ram_pages_cb(unsigned long 
start_pfn,
return 0;
 }
 
-static int __ref __offline_pages(unsigned long start_pfn,
- unsigned long end_pfn)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
-   unsigned long pfn, nr_pages = 0;
-   unsigned long offlined_pages = 0;
+   const unsigned long end_pfn = start_pfn + nr_pages;
+   unsigned long pfn, system_ram_pages = 0, offlined_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
struct zone *zone;
@@ -1494,9 +1493,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
 * memory holes PG_reserved, don't need pfn_valid() checks, and can
 * avoid using walk_system_ram_range() later.
 */
-   walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
+   walk_system_ram_range(start_pfn, nr_pages, &system_ram_pages,
  count_system_ram_pages_cb);
-   if (nr_pages != end_pfn - start_pfn) {
+   if (system_ram_pages != nr_pages) {
ret = -EINVAL;
reason = "memory holes";
goto failed_removal;
@@ -1631,11 +1630,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
return ret;
 }
 
-int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
-{
-   return __offline_pages(start_pfn, start_pfn + nr_pages);
-}
-
 static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 {
int ret = !is_memblock_offlined(mem);
-- 
2.26.2

[PATCH v2 00/10] mm/memory_hotplug: online_pages()/offline_pages() cleanups

2020-08-19 Thread David Hildenbrand

These are a bunch of cleanups for online_pages()/offline_pages() and
related code, mostly getting rid of memory hole handling that is no longer
necessary. There is only a single walk_system_ram_range() call left in
offline_pages(), to make sure we don't have any memory holes. I had some
of these patches lying around for a longer time but didn't have time to
polish them.

In addition, the last patch marks all pageblocks of memory to get onlined
MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
get allocated before onlining is complete. Once heavy lifting is done,
the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
possible.

I played with DIMMs and virtio-mem on x86-64 and didn't spot any surprises.
I verified that the numer of isolated pageblocks is correctly handled when
onlining/offlining.

v1 -> v2:
- "mm/memory_hotplug: enforce section granularity when onlining/offlining"
-- Extended the patch description regarding alignment requirements
-- Fixed a typo
- Squashed "mm/memory_hotplug: simplify offlining of pages in
  offline_pages()" and "mm/memory_hotplug: simplify checking if all pages are
  isolated in offline_pages()", resulting in "mm/memory_hotplug: simplify page
  offlining"
- Added ACKs

David Hildenbrand (10):
  mm/memory_hotplug: inline __offline_pages() into offline_pages()
  mm/memory_hotplug: enforce section granularity when onlining/offlining
  mm/memory_hotplug: simplify page offlining
  mm/page_alloc: simplify __offline_isolated_pages()
  mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages()
  mm/page_isolation: simplify return value of start_isolate_page_range()
  mm/memory_hotplug: simplify page onlining
  mm/page_alloc: drop stale pageblock comment in memmap_init_zone*()
  mm: pass migratetype into memmap_init_zone() and
move_pfn_range_to_zone()
  mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining
memory

 arch/ia64/mm/init.c|   4 +-
 include/linux/memory_hotplug.h |   7 +-
 include/linux/mm.h |   3 +-
 mm/Kconfig |   2 +-
 mm/memory_hotplug.c| 156 ++---
 mm/memremap.c  |   3 +-
 mm/page_alloc.c|  64 --
 mm/page_isolation.c|   7 +-
 8 files changed, 98 insertions(+), 148 deletions(-)

-- 
2.26.2

[PATCH v2 04/10] mm/page_alloc: simplify __offline_isolated_pages()

2020-08-19 Thread David Hildenbrand

offline_pages() is the only user. __offline_isolated_pages() never gets
called with ranges that contain memory holes and we no longer care about
the return value. Drop the return value handling and all pfn_valid()
checks.

Update the documentation.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 include/linux/memory_hotplug.h |  4 ++--
 mm/page_alloc.c| 27 ---
 2 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 375515803cd83..0b461691d1a49 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -103,8 +103,8 @@ extern int online_pages(unsigned long pfn, unsigned long 
nr_pages,
int online_type, int nid);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 unsigned long end_pfn);
-extern unsigned long __offline_isolated_pages(unsigned long start_pfn,
-   unsigned long end_pfn);
+extern void __offline_isolated_pages(unsigned long start_pfn,
+unsigned long end_pfn);
 
 typedef void (*online_page_callback_t)(struct page *page, unsigned int order);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cf0b25161feae..03f585f95dc60 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8692,35 +8692,21 @@ void zone_pcp_reset(struct zone *zone)
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
- * All pages in the range must be in a single zone and isolated
- * before calling this.
+ * All pages in the range must be in a single zone, must not contain holes,
+ * must span full sections, and must be isolated before calling this function.
  */
-unsigned long
-__offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
+void __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
+   unsigned long pfn = start_pfn;
struct page *page;
struct zone *zone;
unsigned int order;
-   unsigned long pfn;
unsigned long flags;
-   unsigned long offlined_pages = 0;
-
-   /* find the first valid pfn */
-   for (pfn = start_pfn; pfn < end_pfn; pfn++)
-   if (pfn_valid(pfn))
-   break;
-   if (pfn == end_pfn)
-   return offlined_pages;
 
offline_mem_sections(pfn, end_pfn);
zone = page_zone(pfn_to_page(pfn));
spin_lock_irqsave(&zone->lock, flags);
-   pfn = start_pfn;
while (pfn < end_pfn) {
-   if (!pfn_valid(pfn)) {
-   pfn++;
-   continue;
-   }
page = pfn_to_page(pfn);
/*
 * The HWPoisoned page may be not in buddy system, and
@@ -8728,7 +8714,6 @@ __offline_isolated_pages(unsigned long start_pfn, 
unsigned long end_pfn)
 */
if (unlikely(!PageBuddy(page) && PageHWPoison(page))) {
pfn++;
-   offlined_pages++;
continue;
}
/*
@@ -8739,20 +8724,16 @@ __offline_isolated_pages(unsigned long start_pfn, 
unsigned long end_pfn)
BUG_ON(page_count(page));
BUG_ON(PageBuddy(page));
pfn++;
-   offlined_pages++;
continue;
}
 
BUG_ON(page_count(page));
BUG_ON(!PageBuddy(page));
order = page_order(page);
-   offlined_pages += 1 << order;
del_page_from_free_list(page, zone, order);
pfn += (1 << order);
}
spin_unlock_irqrestore(&zone->lock, flags);
-
-   return offlined_pages;
 }
 #endif
 
-- 
2.26.2

[PATCH v2 05/10] mm/memory_hotplug: drop nr_isolate_pageblock in offline_pages()

2020-08-19 Thread David Hildenbrand

We make sure that we cannot have any memory holes right at the beginning
of offline_pages() and we only support to online/offline full sections.
Both, sections and pageblocks are a power of two in size, and sections
always span full pageblocks.

We can directly calculate the number of isolated pageblocks from nr_pages.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 50aa5df696e9d..098361fcb4504 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1460,10 +1460,10 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 {
const unsigned long end_pfn = start_pfn + nr_pages;
unsigned long pfn, system_ram_pages = 0;
-   int ret, node, nr_isolate_pageblock;
unsigned long flags;
struct zone *zone;
struct memory_notify arg;
+   int ret, node;
char *reason;
 
/* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
@@ -1507,7 +1507,6 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
reason = "failure to isolate range";
goto failed_removal;
}
-   nr_isolate_pageblock = ret;
 
arg.start_pfn = start_pfn;
arg.nr_pages = nr_pages;
@@ -1569,7 +1568,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
 * pageblocks zone counter here.
 */
spin_lock_irqsave(&zone->lock, flags);
-   zone->nr_isolate_pageblock -= nr_isolate_pageblock;
+   zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
spin_unlock_irqrestore(&zone->lock, flags);
 
/* removal success */
-- 
2.26.2

[PATCH v2 06/10] mm/page_isolation: simplify return value of start_isolate_page_range()

2020-08-19 Thread David Hildenbrand

Callers no longer need the number of isolated pageblocks. Let's
simplify.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 2 +-
 mm/page_alloc.c | 2 +-
 mm/page_isolation.c | 7 ++-
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 098361fcb4504..0011a1115381c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1503,7 +1503,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
ret = start_isolate_page_range(start_pfn, end_pfn,
   MIGRATE_MOVABLE,
   MEMORY_OFFLINE | REPORT_FAILURE);
-   if (ret < 0) {
+   if (ret) {
reason = "failure to isolate range";
goto failed_removal;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 03f585f95dc60..848664352dfe2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8456,7 +8456,7 @@ int alloc_contig_range(unsigned long start, unsigned long 
end,
 
ret = start_isolate_page_range(pfn_max_align_down(start),
   pfn_max_align_up(end), migratetype, 0);
-   if (ret < 0)
+   if (ret)
return ret;
 
/*
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 242c03121d731..b066c860e606e 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -170,8 +170,7 @@ __first_valid_page(unsigned long pfn, unsigned long 
nr_pages)
  * pageblocks we may have modified and return -EBUSY to caller. This
  * prevents two threads from simultaneously working on overlapping ranges.
  *
- * Return: the number of isolated pageblocks on success and -EBUSY if any part
- * of range cannot be isolated.
+ * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 unsigned migratetype, int flags)
@@ -179,7 +178,6 @@ int start_isolate_page_range(unsigned long start_pfn, 
unsigned long end_pfn,
unsigned long pfn;
unsigned long undo_pfn;
struct page *page;
-   int nr_isolate_pageblock = 0;
 
BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
BUG_ON(!IS_ALIGNED(end_pfn, pageblock_nr_pages));
@@ -193,10 +191,9 @@ int start_isolate_page_range(unsigned long start_pfn, 
unsigned long end_pfn,
undo_pfn = pfn;
goto undo;
}
-   nr_isolate_pageblock++;
}
}
-   return nr_isolate_pageblock;
+   return 0;
 undo:
for (pfn = start_pfn;
 pfn < undo_pfn;
-- 
2.26.2

[PATCH v2 03/10] mm/memory_hotplug: simplify page offlining

2020-08-19 Thread David Hildenbrand

We make sure that we cannot have any memory holes right at the beginning
of offline_pages(). We no longer need walk_system_ram_range() and can
call test_pages_isolated() and __offline_isolated_pages() directly.

offlined_pages always corresponds to nr_pages, so we can simplify that.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 46 ++---
 1 file changed, 10 insertions(+), 36 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6856702af68d9..50aa5df696e9d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1373,28 +1373,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
return ret;
 }
 
-/* Mark all sections offline and remove all free pages from the buddy. */
-static int
-offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
-   void *data)
-{
-   unsigned long *offlined_pages = (unsigned long *)data;
-
-   *offlined_pages += __offline_isolated_pages(start, start + nr_pages);
-   return 0;
-}
-
-/*
- * Check all pages in range, recorded as memory resource, are isolated.
- */
-static int
-check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
-   void *data)
-{
-   return test_pages_isolated(start_pfn, start_pfn + nr_pages,
-  MEMORY_OFFLINE);
-}
-
 static int __init cmdline_parse_movable_node(char *p)
 {
movable_node_enabled = true;
@@ -1481,7 +1459,7 @@ static int count_system_ram_pages_cb(unsigned long 
start_pfn,
 int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
const unsigned long end_pfn = start_pfn + nr_pages;
-   unsigned long pfn, system_ram_pages = 0, offlined_pages = 0;
+   unsigned long pfn, system_ram_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
struct zone *zone;
@@ -1579,16 +1557,12 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
reason = "failure to dissolve huge pages";
goto failed_removal_isolated;
}
-   /* check again */
-   ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
-   NULL, check_pages_isolated_cb);
-   } while (ret);
-
-   /* Ok, all of our target is isolated.
-  We cannot do rollback at this point. */
-   walk_system_ram_range(start_pfn, end_pfn - start_pfn,
- &offlined_pages, offline_isolated_pages_cb);
-   pr_info("Offlined Pages %ld\n", offlined_pages);
+   } while (test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE));
+
+   /* Mark all sections offline and remove free pages from the buddy. */
+   __offline_isolated_pages(start_pfn, end_pfn);
+   pr_info("Offlined Pages %ld\n", nr_pages);
+
/*
 * Onlining will reset pagetype flags and makes migrate type
 * MOVABLE, so just need to decrease the number of isolated
@@ -1599,11 +1573,11 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
spin_unlock_irqrestore(&zone->lock, flags);
 
/* removal success */
-   adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
-   zone->present_pages -= offlined_pages;
+   adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+   zone->present_pages -= nr_pages;
 
pgdat_resize_lock(zone->zone_pgdat, &flags);
-   zone->zone_pgdat->node_present_pages -= offlined_pages;
+   zone->zone_pgdat->node_present_pages -= nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
init_per_zone_wmark_min();
-- 
2.26.2

[PATCH v2 07/10] mm/memory_hotplug: simplify page onlining

2020-08-19 Thread David Hildenbrand

We don't allow to offline memory with holes, all boot memory is online,
and all hotplugged memory cannot have holes.

We can now simplify onlining of pages. As we only allow to online/offline
full sections and sections always span full MAX_ORDER_NR_PAGES, we can just
process MAX_ORDER - 1 pages without further special handling.

The number of onlined pages simply corresponds to the number of pages we
were requested to online.

While at it, refine the comment regarding the callback not exposing all
pages to the buddy.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 38 ++
 1 file changed, 10 insertions(+), 28 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0011a1115381c..3aba0d956f9b1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -617,31 +617,22 @@ void generic_online_page(struct page *page, unsigned int 
order)
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
-static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
-   void *arg)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
 {
const unsigned long end_pfn = start_pfn + nr_pages;
unsigned long pfn;
-   int order;
 
/*
-* Online the pages. The callback might decide to keep some pages
-* PG_reserved (to add them to the buddy later), but we still account
-* them as being online/belonging to this zone ("present").
+* Online the pages in MAX_ORDER - 1 aligned chunks. The callback might
+* decide to not expose all pages to the buddy (e.g., expose them
+* later). We account all pages as being online and belonging to this
+* zone ("present").
 */
-   for (pfn = start_pfn; pfn < end_pfn; pfn += 1ul << order) {
-   order = min(MAX_ORDER - 1, get_order(PFN_PHYS(end_pfn - pfn)));
-   /* __free_pages_core() wants pfns to be aligned to the order */
-   if (WARN_ON_ONCE(!IS_ALIGNED(pfn, 1ul << order)))
-   order = 0;
-   (*online_page_callback)(pfn_to_page(pfn), order);
-   }
+   for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
+   (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
 
/* mark all involved sections as online */
online_mem_sections(start_pfn, end_pfn);
-
-   *(unsigned long *)arg += nr_pages;
-   return 0;
 }
 
 /* check which state of node_states will be changed when online memory */
@@ -795,7 +786,6 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
   int online_type, int nid)
 {
unsigned long flags;
-   unsigned long onlined_pages = 0;
struct zone *zone;
int need_zonelists_rebuild = 0;
int ret;
@@ -831,19 +821,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
setup_zone_pageset(zone);
}
 
-   ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
-   online_pages_range);
-   if (ret) {
-   /* not a single memory resource was applicable */
-   if (need_zonelists_rebuild)
-   zone_pcp_reset(zone);
-   goto failed_addition;
-   }
-
-   zone->present_pages += onlined_pages;
+   online_pages_range(pfn, nr_pages);
+   zone->present_pages += nr_pages;
 
pgdat_resize_lock(zone->zone_pgdat, &flags);
-   zone->zone_pgdat->node_present_pages += onlined_pages;
+   zone->zone_pgdat->node_present_pages += nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
/*
-- 
2.26.2

[PATCH v2 09/10] mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone()

2020-08-19 Thread David Hildenbrand

On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
un-isolate the pages after memory onlining is complete. Let's allow
passing in the migratetype.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Logan Gunthorpe 
Cc: Dan Williams 
Cc: Mike Rapoport 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michel Lespinasse 
Cc: linux-i...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 arch/ia64/mm/init.c|  4 ++--
 include/linux/memory_hotplug.h |  3 ++-
 include/linux/mm.h |  3 ++-
 mm/memory_hotplug.c| 11 ---
 mm/memremap.c  |  3 ++-
 mm/page_alloc.c| 21 -
 6 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 0b3fb4c7af292..82b7a46ddd23d 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -538,7 +538,7 @@ virtual_memmap_init(u64 start, u64 end, void *arg)
if (map_start < map_end)
memmap_init_zone((unsigned long)(map_end - map_start),
 args->nid, args->zone, page_to_pfn(map_start),
-MEMMAP_EARLY, NULL);
+MEMMAP_EARLY, NULL, MIGRATE_MOVABLE);
return 0;
 }
 
@@ -548,7 +548,7 @@ memmap_init (unsigned long size, int nid, unsigned long 
zone,
 {
if (!vmem_map) {
memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY,
-   NULL);
+NULL, MIGRATE_MOVABLE);
} else {
struct page *start;
struct memmap_init_callback_data args;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 0b461691d1a49..cbafeda859380 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -346,7 +346,8 @@ extern int add_memory_resource(int nid, struct resource 
*resource);
 extern int add_memory_driver_managed(int nid, u64 start, u64 size,
 const char *resource_name);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
-   unsigned long nr_pages, struct vmem_altmap *altmap);
+  unsigned long nr_pages,
+  struct vmem_altmap *altmap, int migratetype);
 extern void remove_pfn_range_from_zone(struct zone *zone,
   unsigned long start_pfn,
   unsigned long nr_pages);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ab941cf73f44..c842aa2a97ba2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2409,7 +2409,8 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
-   enum memmap_context, struct vmem_altmap *);
+   enum memmap_context, struct vmem_altmap *,
+   int migratetype);
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3aba0d956f9b1..1c16a5def781e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -693,9 +693,14 @@ static void __meminit resize_pgdat_range(struct 
pglist_data *pgdat, unsigned lon
  * Associate the pfn range with the given zone, initializing the memmaps
  * and resizing the pgdat/zone data to span the added pages. After this
  * call, all affected pages are PG_reserved.
+ *
+ * All aligned pageblocks are initialized to the specified migratetype
+ * (usually MIGRATE_MOVABLE). Besides setting the migratetype, no related
+ * zone stats (e.g., nr_isolate_pageblock) are touched.
  */
 void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
-   unsigned long nr_pages, struct vmem_altmap *altmap)
+ unsigned long nr_pages,
+ struct vmem_altmap *altmap, int migratetype)
 {
struct pglist_data *pgdat = zone->zone_pgdat;
int nid = pgdat->node_id;
@@ -720,7 +725,7 @@ void __ref move_pfn_range_to_zone(struct zone *zone, 
unsigned long start_pfn,
 * are reserved so nobody should be touching them so we should be safe
 */
memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn,
-   MEMMAP_HOTPLUG, altmap);
+MEMMAP_HOTPLUG, altmap, migratetype);
 
set_zone_contiguous(zone);
 }
@@ -800,7 +805,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
 
/* associate pfn range with the zone */
zone = zone_for_pfn_range(o

[PATCH v2 08/10] mm/page_alloc: drop stale pageblock comment in memmap_init_zone*()

2020-08-19 Thread David Hildenbrand

Commit ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are
set during struct page init") moved the actual zone range check, leaving
only the alignment check for pageblocks.

Let's drop the stale comment and make the pageblock check easier to read.

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Signed-off-by: David Hildenbrand 
---
 mm/page_alloc.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 848664352dfe2..5db0b35f95e20 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6022,13 +6022,8 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 * to reserve their blocks rather than leaking throughout
 * the address space during boot when many long-lived
 * kernel allocations are made.
-*
-* bitmap is created for zone's valid pfn range. but memmap
-* can be created for invalid pages (for alignment)
-* check here not to call set_pageblock_migratetype() against
-* pfn out of zone.
 */
-   if (!(pfn & (pageblock_nr_pages - 1))) {
+   if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
@@ -6091,15 +6086,10 @@ void __ref memmap_init_zone_device(struct zone *zone,
 * the address space during boot when many long-lived
 * kernel allocations are made.
 *
-* bitmap is created for zone's valid pfn range. but memmap
-* can be created for invalid pages (for alignment)
-* check here not to call set_pageblock_migratetype() against
-* pfn out of zone.
-*
 * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
 * because this is done early in section_activate()
 */
-   if (!(pfn & (pageblock_nr_pages - 1))) {
+   if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
-- 
2.26.2

[PATCH v2 10/10] mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory

2020-08-19 Thread David Hildenbrand

Currently, it can happen that pages are allocated (and freed) via the buddy
before we finished basic memory onlining.

For example, pages are exposed to the buddy and can be allocated before
we actually mark the sections online. Allocated pages could suddenly
fail pfn_to_online_page() checks. We had similar issues with pcp
handling, when pages are allocated+freed before we reach
zone_pcp_update() in online_pages() [1].

Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
impossible. Once done with the heavy lifting, use
undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
freelist, marking them ready for allocation. Similar to offline_pages(),
we have to manually adjust zone->nr_isolate_pageblock.

[1] 
https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-chara...@codeaurora.org

Acked-by: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Wei Yang 
Cc: Baoquan He 
Cc: Pankaj Gupta 
Cc: Oscar Salvador 
Cc: Charan Teja Reddy 
Signed-off-by: David Hildenbrand 
---
 mm/Kconfig  |  2 +-
 mm/memory_hotplug.c | 32 ++--
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 6c974888f86f9..85a16ce1dbc49 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -152,6 +152,7 @@ config HAVE_BOOTMEM_INFO_NODE
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
+   select MEMORY_ISOLATION
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
depends on 64BIT || BROKEN
@@ -178,7 +179,6 @@ config MEMORY_HOTPLUG_DEFAULT_ONLINE
 
 config MEMORY_HOTREMOVE
bool "Allow for memory hot remove"
-   select MEMORY_ISOLATION
select HAVE_BOOTMEM_INFO_NODE if (X86_64 || PPC64)
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1c16a5def781e..35d56cbd3e45b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -805,7 +805,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
 
/* associate pfn range with the zone */
zone = zone_for_pfn_range(online_type, nid, pfn, nr_pages);
-   move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_MOVABLE);
+   move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE);
 
arg.start_pfn = pfn;
arg.nr_pages = nr_pages;
@@ -816,6 +816,14 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
if (ret)
goto failed_addition;
 
+   /*
+* Fixup the number of isolated pageblocks before marking the sections
+* onlining, such that undo_isolate_page_range() works correctly.
+*/
+   spin_lock_irqsave(&zone->lock, flags);
+   zone->nr_isolate_pageblock += nr_pages / pageblock_nr_pages;
+   spin_unlock_irqrestore(&zone->lock, flags);
+
/*
 * If this zone is not populated, then it is not in zonelist.
 * This means the page allocator ignores this zone.
@@ -833,21 +841,25 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
zone->zone_pgdat->node_present_pages += nr_pages;
pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
+   node_states_set_node(nid, &arg);
+   if (need_zonelists_rebuild)
+   build_all_zonelists(NULL);
+   zone_pcp_update(zone);
+
+   /* Basic onlining is complete, allow allocation of onlined pages. */
+   undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
+
/*
 * When exposing larger, physically contiguous memory areas to the
 * buddy, shuffling in the buddy (when freeing onlined pages, putting
 * them either to the head or the tail of the freelist) is only helpful
 * for maintaining the shuffle, but not for creating the initial
 * shuffle. Shuffle the whole zone to make sure the just onlined pages
-* are properly distributed across the whole freelist.
+* are properly distributed across the whole freelist. Make sure to
+* shuffle once pageblocks are no longer isolated.
 */
shuffle_zone(zone);
 
-   node_states_set_node(nid, &arg);
-   if (need_zonelists_rebuild)
-   build_all_zonelists(NULL);
-   zone_pcp_update(zone);
-
init_per_zone_wmark_min();
 
kswapd_run(nid);
@@ -1550,9 +1562,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned 
long nr_pages)
pr_info("Offlined Pages %ld\n", nr_pages);
 
/*
-* Onlining will reset pagetype flags and makes migrate type
-* MOVABLE, so just need to decrease the number of isolated
-* pageblocks zone counter here.
+* The memory sections are marked offline, and the pageblock flags
+

Re: [PATCH 1/1] arm64: make section size configurable for memory hotplug

2021-01-11 Thread David Hildenbrand

On 11.01.21 05:17, Anshuman Khandual wrote:
> 
> 
> On 1/8/21 9:00 PM, David Hildenbrand wrote:
>>> To summarize, the section size bits for each base page size config
>>> should always
>>>
>>> a. Avoid (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
>>
>> Pageblocks must also always fall completely into a section.
>>
>>>
>>> b. Provide minimum possible section size for a given base page config to
>>>have increased agility during memory hotplug operations and reduced
>>>vmemmap wastage for sections with holes.
>>
>> OTOH, making the section size too small (e.g., 16MB) creates way to many
>> memory block devices in /sys/devices/system/memory/, and might consume
>> too many page->flags bits in the !vmemmap case.
>>
>> For bigger setups, we might, similar to x86-64 (e.g., >= 64 GiB),
>> determine the memory_block_size_bytes() at runtime (spanning multiple
>> sections then), once it becomes relevant.
>>
>>>
>>> c. Allow 4K base page configs to have PMD based vmemmap mappings
>>
>> Agreed.
>>
>>>
>>> Because CONFIG_FORCE_MAX_ZONEORDER is always defined on arm64 platform,
>>> the following would always avoid the condition (a)
>>>
>>> SECTION_SIZE_BITS (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT)
>>>
>>> - 22 (11 - 1 + 12) for 4K pages
>>> - 24 (11 - 1 + 14) for 16K pages without THP
>>> - 25 (12 - 1 + 14) for 16K pages with THP
>>> - 26 (11 - 1 + 16) for 64K pages without THP
>>> - 29 (14 - 1 + 16) for 64K pages with THP
>>>
>>> Apart from overriding 4K base page size config to have 27 as section size
>>> bits, should not all other values be okay here ? But then wondering what
>>> benefit 128MB (27 bits) section size for 16K config would have ? OR the
>>> objective here is to match 16K page size config with default x86-64.
>>
>> We don't want to have sections that are too small. We don't want to have
>> sections that are too big :)
>>
>> Not sure if we really want to allow setting e.g., a section size of 4
>> MB. That's just going to hurt. IMHO, something in the range of 64..256
>> MB is quite a good choice, where possible.
>>
>>>
>>>>
>>>> (If we worry about the number of section bits in page->flags, we could
>>>> glue it to vmemmap support where that does not matter)
>>>
>>> Could you please elaborate ? Could smaller section size bits numbers like
>>> 22 or 24 create problems in page->flags without changing other parameters
>>> like NR_CPUS or NODES_SHIFT ? A quick test with 64K base page without THP
>>
>> Yes, in the !vmemmap case, we have to store the section_nr in there.
>> IIRC, it's less of an issue with section sizes like 128 MB.
>>
>>> i.e 26 bits in section size, fails to boot.
>>
>> 26 bits would mean 64 MB, no? Not sure if that's possible even without
>> THP (MAX_ORDER - 1, pageblock_order ...) on 64k pages. I'd assume 512 MB
>> is the lowest we can go. I'd assume this would crash :)
>>
>>>
>>> As you have suggested, probably constant defaults (128MB for 4K/16K, 512MB
>>> for 64K) might be better than depending on the CONFIG_FORCE_MAX_ZONEORDER,
>>> at least for now.
>>
>> That's also what I would prefer, keeping it simple.
> 
> Okay sure, will send a RFC to begin with.
> 

Note that there is

https://lkml.kernel.org/r/15cf9a2359197fee0168f820c5c904650d07939e.1610146597.git.sudar...@codeaurora.org

(Sudarshan missed to cc linux-mm)

-- 
Thanks,

David / dhildenb

Re: [RFC 1/2] arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory

2021-01-11 Thread David Hildenbrand

On 04.01.21 07:18, Anshuman Khandual wrote:
> 
> On 12/22/20 2:41 PM, David Hildenbrand wrote:
>> On 22.12.20 08:12, Anshuman Khandual wrote:
>>> pfn_valid() validates a pfn but basically it checks for a valid struct page
>>> backing for that pfn. It should always return positive for memory ranges
>>> backed with struct page mapping. But currently pfn_valid() fails for all
>>> ZONE_DEVICE based memory types even though they have struct page mapping.
>>>
>>> pfn_valid() asserts that there is a memblock entry for a given pfn without
>>> MEMBLOCK_NOMAP flag being set. The problem with ZONE_DEVICE based memory is
>>> that they do not have memblock entries. Hence memblock_is_map_memory() will
>>> invariably fail via memblock_search() for a ZONE_DEVICE based address. This
>>> eventually fails pfn_valid() which is wrong. memblock_is_map_memory() needs
>>> to be skipped for such memory ranges. As ZONE_DEVICE memory gets hotplugged
>>> into the system via memremap_pages() called from a driver, their respective
>>> memory sections will not have SECTION_IS_EARLY set.
>>>
>>> Normal hotplug memory will never have MEMBLOCK_NOMAP set in their memblock
>>> regions. Because the flag MEMBLOCK_NOMAP was specifically designed and set
>>> for firmware reserved memory regions. memblock_is_map_memory() can just be
>>> skipped as its always going to be positive and that will be an optimization
>>> for the normal hotplug memory. Like ZONE_DEVIE based memory, all hotplugged
>>> normal memory too will not have SECTION_IS_EARLY set for their sections.
>>>
>>> Skipping memblock_is_map_memory() for all non early memory sections would
>>> fix pfn_valid() problem for ZONE_DEVICE based memory and also improve its
>>> performance for normal hotplug memory as well.
>>>
>>> Cc: Catalin Marinas 
>>> Cc: Will Deacon 
>>> Cc: Ard Biesheuvel 
>>> Cc: Robin Murphy 
>>> Cc: linux-arm-ker...@lists.infradead.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
>>> Signed-off-by: Anshuman Khandual 
>>> ---
>>>  arch/arm64/mm/init.c | 12 
>>>  1 file changed, 12 insertions(+)
>>>
>>> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>>> index 75addb36354a..ee23bda00c28 100644
>>> --- a/arch/arm64/mm/init.c
>>> +++ b/arch/arm64/mm/init.c
>>> @@ -225,6 +225,18 @@ int pfn_valid(unsigned long pfn)
>>>  
>>> if (!valid_section(__pfn_to_section(pfn)))
>>> return 0;
>>> +
>>> +   /*
>>> +* ZONE_DEVICE memory does not have the memblock entries.
>>> +* memblock_is_map_memory() check for ZONE_DEVICE based
>>> +* addresses will always fail. Even the normal hotplugged
>>> +* memory will never have MEMBLOCK_NOMAP flag set in their
>>> +* memblock entries. Skip memblock search for all non early
>>> +* memory sections covering all of hotplug memory including
>>> +* both normal and ZONE_DEVIE based.
>>> +*/
>>> +   if (!early_section(__pfn_to_section(pfn)))
>>> +   return 1;
>>
>> Actually, I think we want to check for partial present sections.
>>
>> Maybe we can rather switch to generic pfn_valid() and tweak it to
>> something like
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index fb3bf696c05e..7b1fcce5bd5a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1382,9 +1382,13 @@ static inline int pfn_valid(unsigned long pfn)
>> return 0;
>> /*
>>  * Traditionally early sections always returned pfn_valid() for
>> -* the entire section-sized span.
>> +* the entire section-sized span. Some archs might have holes in
>> +* early sections, so double check with memblock if configured.
>>  */
>> -   return early_section(ms) || pfn_section_valid(ms, pfn);
>> +   if (early_section(ms))
>> +   return IS_ENABLED(CONFIG_EARLY_SECTION_MEMMAP_HOLES) ?
>> +  memblock_is_map_memory(pfn << PAGE_SHIFT) : 1;
>> +   return pfn_section_valid(ms, pfn);
>>  }
>>  #endif
> 
> Could not find CONFIG_EARLY_SECTION_MEMMAP_HOLES. Are you suggesting to
> create this config which could track platform scenarios where all early
> sections might not have mmap coverage such as arm64 ?

Yes, a new config that states what&#x

Re: [PATCH V2 3/3] s390/mm: Define arch_get_mappable_range()

2021-01-11 Thread David Hildenbrand

On 17.12.20 16:28, Anshuman Khandual wrote:
> This overrides arch_get_mappabble_range() on s390 platform which will be
> used with recently added generic framework. It modifies the existing range
> check in vmem_add_mapping() using arch_get_mappable_range(). It also adds a
> VM_BUG_ON() check that would ensure that memhp_range_allowed() has already
> been called on the hotplug path.
> 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: David Hildenbrand 
> Cc: linux-s...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Acked-by: Heiko Carstens 
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/s390/mm/init.c |  1 +
>  arch/s390/mm/vmem.c | 15 ++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 77767850d0d0..e0e78234ae57 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -291,6 +291,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
>   return -EINVAL;
>  
> + VM_BUG_ON(!memhp_range_allowed(start, size, 1));

s/1/true/

>   rc = vmem_add_mapping(start, size);
>   if (rc)
>   return rc;
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index b239f2ba93b0..e10e563ad2b4 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -4,6 +4,7 @@
>   *Author(s): Heiko Carstens 
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -532,11 +533,23 @@ void vmem_remove_mapping(unsigned long start, unsigned 
> long size)
>   mutex_unlock(&vmem_mutex);
>  }
>  
> +struct range arch_get_mappable_range(void)
> +{
> + struct range memhp_range;
> +
> + memhp_range.start = 0;
> + memhp_range.end =  VMEM_MAX_PHYS;

s/  / /

IIRC, the range is inclusive? "VMEM_MAX_PHYS - 1" then, and adjust the
check below.

> + return memhp_range;
> +}
> +


-- 
Thanks,

David / dhildenb

Re: [PATCH V2 1/3] mm/hotplug: Prevalidate the address range being added with platform

2021-01-11 Thread David Hildenbrand

On 17.12.20 16:28, Anshuman Khandual wrote:
> This introduces memhp_range_allowed() which can be called in various memory
> hotplug paths to prevalidate the address range which is being added, with
> the platform. Then memhp_range_allowed() calls memhp_get_pluggable_range()
> which provides applicable address range depending on whether linear mapping
> is required or not. For ranges that require linear mapping, it calls a new
> arch callback arch_get_mappable_range() which the platform can override. So
> the new callback, in turn provides the platform an opportunity to configure
> acceptable memory hotplug address ranges in case there are constraints.
> 
> This mechanism will help prevent platform specific errors deep down during
> hotplug calls. This drops now redundant check_hotplug_memory_addressable()
> check in __add_pages() but instead adds a VM_BUG_ON() check which would
> ensure that the range has been validated with memhp_range_allowed() earlier
> in the call chain. Besides memhp_get_pluggable_range() also can be used by
> potential memory hotplug callers to avail the allowed physical range which
> would go through on a given platform.
> 
> This does not really add any new range check in generic memory hotplug but
> instead compensates for lost checks in arch_add_memory() where applicable
> and check_hotplug_memory_addressable(), with unified memhp_range_allowed().
> 

Subject s/mm\/hotplug/mm\/memory_hotplug/

Everywhere in this patch: Use "true/false" for boolean values.

> Cc: David Hildenbrand 
> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Suggested-by: David Hildenbrand 
> Signed-off-by: Anshuman Khandual 
> ---
>  include/linux/memory_hotplug.h | 10 +
>  mm/memory_hotplug.c| 79 +-
>  mm/memremap.c  |  6 +++
>  3 files changed, 75 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 551093b74596..8d72354758c8 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -70,6 +70,9 @@ typedef int __bitwise mhp_t;
>   */
>  #define MEMHP_MERGE_RESOURCE ((__force mhp_t)BIT(0))
>  
> +bool memhp_range_allowed(u64 start, u64 size, bool need_mapping);
> +struct range memhp_get_pluggable_range(bool need_mapping);

AFAIKs, all memhp_get_pluggable_range() users pass "1".

What about the "add_pages()-only" path?

-- 
Thanks,

David / dhildenb

[PATCH RFC] virtio-mem: check against memhp_get_pluggable_range() which memory we can hotplug

2021-01-11 Thread David Hildenbrand

Right now, we only check against MAX_PHYSMEM_BITS - but turns out there
are more restrictions of which memory we can actually hotplug, especially
om arm64 or s390x once we support them: we might receive something like
-E2BIG or -ERANGE from add_memory_driver_managed(), stopping device
operation.

So, check right when initializing the device which memory we can add,
warning the user. Try only adding actually pluggable ranges: in the worst
case, no memory provided by our device is pluggable.

In the usual case, we expect all device memory to be pluggable, and in
corner cases only some memory at the end of the device-managed memory
region to not be pluggable.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Pankaj Gupta 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Wei Yang 
Cc: Andrew Morton 
Cc: catalin.mari...@arm.com
Cc: teawater 
Cc: Anshuman Khandual 
Cc: Pankaj Gupta 
Cc: Jonathan Cameron 
Cc: h...@linux.ibm.com
Cc: Vasily Gorbik 
Cc: Will Deacon 
Cc: Ard Biesheuvel 
Cc: Mark Rutland 
Cc: Heiko Carstens 
Cc: Michal Hocko 
Signed-off-by: David Hildenbrand 
---

This is an example how virito-mem intends to use an interface like
memhp_get_pluggable_range() once around. See:

"[PATCH V2 0/3] mm/hotplug: Pre-validate the address range with platform"
https://lkml.kernel.org/r/1608218912-28932-1-git-send-email-anshuman.khand...@arm.com

@Anshuman, feel free to pick up and carry this patch. I'll retest the final
result / new versions of you series.

---
 drivers/virtio/virtio_mem.c | 40 +
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 9fc9ec4a25f5..1fe40b2d7b6d 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -,7 +,7 @@ static int virtio_mem_unplug_pending_mb(struct virtio_mem 
*vm)
  */
 static void virtio_mem_refresh_config(struct virtio_mem *vm)
 {
-   const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS;
+   const struct range pluggable_range = memhp_get_pluggable_range(true);
uint64_t new_plugged_size, usable_region_size, end_addr;
 
/* the plugged_size is just a reflection of what _we_ did previously */
@@ -2234,15 +2234,25 @@ static void virtio_mem_refresh_config(struct virtio_mem 
*vm)
/* calculate the last usable memory block id */
virtio_cread_le(vm->vdev, struct virtio_mem_config,
usable_region_size, &usable_region_size);
-   end_addr = vm->addr + usable_region_size;
-   end_addr = min(end_addr, phys_limit);
+   end_addr = min(vm->addr + usable_region_size - 1,
+  pluggable_range.end);
 
-   if (vm->in_sbm)
-   vm->sbm.last_usable_mb_id =
-virtio_mem_phys_to_mb_id(end_addr) - 1;
-   else
-   vm->bbm.last_usable_bb_id =
-virtio_mem_phys_to_bb_id(vm, end_addr) - 1;
+   if (vm->in_sbm) {
+   vm->sbm.last_usable_mb_id = virtio_mem_phys_to_mb_id(end_addr);
+   if (!IS_ALIGNED(end_addr + 1, memory_block_size_bytes()))
+   vm->sbm.last_usable_mb_id--;
+   } else {
+   vm->bbm.last_usable_bb_id = virtio_mem_phys_to_bb_id(vm,
+end_addr);
+   if (!IS_ALIGNED(end_addr + 1, vm->bbm.bb_size))
+   vm->bbm.last_usable_bb_id--;
+   }
+   /*
+* If we cannot plug any of our device memory (e.g., nothing in the
+* usable region is addressable), the last usable memory block id will
+* be smaller than the first usable memory block id. We'll stop
+* attempting to add memory with -ENOSPC from our main loop.
+*/
 
/* see if there is a request to change the size */
virtio_cread_le(vm->vdev, struct virtio_mem_config, requested_size,
@@ -2364,6 +2374,7 @@ static int virtio_mem_init_vq(struct virtio_mem *vm)
 
 static int virtio_mem_init(struct virtio_mem *vm)
 {
+   const struct range pluggable_range = memhp_get_pluggable_range(true);
const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS;
uint64_t sb_size, addr;
uint16_t node_id;
@@ -2405,9 +2416,10 @@ static int virtio_mem_init(struct virtio_mem *vm)
if (!IS_ALIGNED(vm->addr + vm->region_size, memory_block_size_bytes()))
dev_warn(&vm->vdev->dev,
 "The alignment of the physical end address can make 
some memory unusable.\n");
-   if (vm->addr + vm->region_size > phys_limit)
+   if (vm->addr < pluggable_range.start ||
+   vm->addr + vm->region_size - 1 > pluggable_range.end)
dev_warn(&vm->vdev->dev,
-"Some memory is not addres

Re: [PATCH] mm/compaction: Remove duplicated VM_BUG_ON_PAGE !PageLocked

2021-01-11 Thread David Hildenbrand

On 09.01.21 09:14, Miaohe Lin wrote:
> The VM_BUG_ON_PAGE(!PageLocked(page), page) is also done in PageMovable.
> Remove this explicitly one.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  mm/compaction.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 02af220fb992..6d316eb99913 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -137,7 +137,6 @@ EXPORT_SYMBOL(__SetPageMovable);
>  
>  void __ClearPageMovable(struct page *page)
>  {
> - VM_BUG_ON_PAGE(!PageLocked(page), page);
>   VM_BUG_ON_PAGE(!PageMovable(page), page);
>   /*
>* Clear registered address_space val with keeping PAGE_MAPPING_MOVABLE
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-01-11 Thread David Hildenbrand

On 17.12.20 14:07, Oscar Salvador wrote:
> In order to use self-hosted memmap array, the platform needs to have
> support for CONFIG_SPARSEMEM_VMEMMAP and altmap.
> Currently, only arm64, PPC and x86_64 have the support, so enable those
> platforms with ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.

"In the first version, only  will support it".

I'd probably split off enabling it per architecture in separate patches
and the end of the series.

Apart from that looks good.

> 
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE will be checked later on to see whether
> we can enable this feature or not.
> 
> Signed-off-by: Oscar Salvador 


-- 
Thanks,

David / dhildenb

Re: [PATCH 3/5] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported

2021-01-11 Thread David Hildenbrand

On 17.12.20 14:07, Oscar Salvador wrote:
> Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
> checking mhp_supports_memmap_on_memory().
> MHP_MEMMAP_ON_MEMORY can only be set in case
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
> altmap, and the range to be added spans a single memory block.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  drivers/acpi/acpi_memhotplug.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index b02fd51e5589..8cc195c4c861 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct 
> acpi_memory_device *mem_device)
>   acpi_handle handle = mem_device->device->handle;
>   int result, num_enabled = 0;
>   struct acpi_memory_info *info;
> + mhp_t mhp_flags = MHP_NONE;
>   int node;
>  
>   node = acpi_get_node(handle);
> @@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct 
> acpi_memory_device *mem_device)
>   if (node < 0)
>   node = memory_add_physaddr_to_nid(info->start_addr);
>  
> + if (mhp_supports_memmap_on_memory(info->length))
> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>   result = __add_memory(node, info->start_addr, info->length,
> -   MHP_NONE);
> +   mhp_flags);
>  
>   /*
>* If the memory block has been used by the kernel, add_memory()
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH 4/5] powerpc/memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported

2021-01-11 Thread David Hildenbrand

On 17.12.20 14:07, Oscar Salvador wrote:
> Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
> checking mhp_supports_memmap_on_memory().
> MHP_MEMMAP_ON_MEMORY can only be set in case
> ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
> altmap, and the range to be added spans a single memory block.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  arch/powerpc/platforms/pseries/hotplug-memory.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 7efe6ec5d14a..a7f68e282ec1 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -610,6 +610,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
> u32 drc_index)
>  
>  static int dlpar_add_lmb(struct drmem_lmb *lmb)
>  {
> + mhp_t mhp_flags = MHP_NONE;
>   unsigned long block_sz;
>   int nid, rc;
>  
> @@ -629,8 +630,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
>   if (nid < 0 || !node_possible(nid))
>   nid = first_online_node;
>  
> + if (mhp_supports_memmap_on_memory(block_sz))
> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>   /* Add the memory */
> - rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
> + rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
>   if (rc) {
>   invalidate_lmb_associativity_index(lmb);
>   return rc;

With 16MB LMBs it's quite wasteful - you won't even have a huge page
fitting the the remaining part.

I do wonder if we want this on powerpc only with a bigger LMB/memory
block size (e.g., 256 MB, which is AFAIK the maximum usually found).

-- 
Thanks,

David / dhildenb

Re: [PATCH 2/2] resource: Make it possible to reserve memory on 64bit platform

2021-01-11 Thread David Hildenbrand

On 11.01.21 17:33, Wesley Zhao wrote:
> From: "Wesley.Zhao" 
> 
> For now "reserve=" is limitied to 32bit,not available on 64bit
> platform,so we change the get_option() to get_option_ull(added in
> patch: commit 4b6bfe96265e ("lib/cmdline: add new function
> get_option_ull()"))

Curious, what's the target use case? (did not receive a cover letter,
maybe it's buried in there)

-- 
Thanks,

David / dhildenb

Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

2021-01-12 Thread David Hildenbrand

On 12.01.21 10:18, Dan Williams wrote:
> On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand  wrote:
>>
>> [...]
>>
>>>>> Well, I would love to have no surprises either. So far there was not
>>>>> actual argument why the pmem reserved space cannot be fully initialized.
>>>>
>>>> Yes, I'm still hoping Dan can clarify that.
>>>
>>> Complexity and effective utility (once pfn_to_online_page() is fixed)
>>> are the roadblocks in my mind. The altmap is there to allow for PMEM
>>> capacity to be used as memmap space, so there would need to be code to
>>> break that circular dependency and allocate a memmap for the metadata
>>> space from DRAM and the rest of the memmap space for the data capacity
>>> from pmem itself. That memmap-for-pmem-metadata will still represent
>>> offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
>>> is going to be doing pfn_to_page() on PMEM metadata? Secondly, there
>>
>> Assume I do
>>
>> pgmap = get_dev_pagemap(pfn, NULL);
>> if (pgmap)
>> return pfn_to_page(pfn);
>> return NULL;
>>
>> on a random pfn because I want to inspect ZONE_DEVICE PFNs.
> 
> I keep getting hung up on the motivation to do random pfn inspection?
> 
> The problems we have found to date have required different solutions.
> The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it
> could rely on the fact that the page already had an elevated reference
> count. The get_user_pages() path only looks up ZONE_DEVICE pfns when
> it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers
> have been a problem, but with pfn_to_online_page() fixed what is the
> remaining motivation to inspect ZONE_DEVICE pfns?

1) Let's assume we want to implement zone shrinking
(remove_pfn_range_from_zone()->shrink_zone_span()) for ZONE_DEVICE at
some point.

A simple approach would be going via get_dev_pagemap(pfn,
NULL)->pfn_to_page(pfn), checking for the zone.

If that's not possible, then extending dev_pagemap (e.g., indicating the
nid) might also work (unless there is another way to get the nid).

2) Let's take a look at mm/memory-failure.c:memory_failure_dev_pagemap()

IIUC, we might end up doing pfn_to_page(pfn) on a pfn in the reserved
altmap space, so one with an uninitialized memmap.

E.g., in dax_lock_page() we access page->mapping, which might just be
garbage. dax_mapping() will de-reference garbage.

Most probably I am missing something here.

Question is: what are the expectations regarding the memmap if
get_dev_pagemap() succeeded.

I'm fine documenting that "get_dev_pagemap() does not guarantee that the
"struct page" returned by pfn_to_page() was initialized and can safely
be used. E.g., it might be a pfn in the reserved altmap space, for which
the memmap is never initialized. Accessing it might be dangerous.".

Then, there has to be a check at relevant places (e.g.,
memory_failure_dev_pagemap()), checking somehow if the memmap content
can actually be used.

-- 
Thanks,

David / dhildenb

Re: [PATCH v2 1/5] mm: Move pfn_to_online_page() out of line

2021-01-12 Thread David Hildenbrand

On 12.01.21 10:34, Dan Williams wrote:
> pfn_to_online_page() is already too large to be a macro or an inline
> function. In anticipation of further logic changes / growth, move it out
> of line.
> 
> No functional change, just code movement.
> 
> Cc: David Hildenbrand 
> Reported-by: Michal Hocko 
> Signed-off-by: Dan Williams 
> ---
>  include/linux/memory_hotplug.h |   17 +
>  mm/memory_hotplug.c|   16 
>  2 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 15acce5ab106..3d99de0db2dd 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -16,22 +16,7 @@ struct resource;
>  struct vmem_altmap;
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -/*
> - * Return page for the valid pfn only if the page is online. All pfn
> - * walkers which rely on the fully initialized page->flags and others
> - * should use this rather than pfn_valid && pfn_to_page
> - */
> -#define pfn_to_online_page(pfn) \
> -({  \
> - struct page *___page = NULL;   \
> - unsigned long ___pfn = pfn;\
> - unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> -\
> - if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> - pfn_valid_within(___pfn))  \
> - ___page = pfn_to_page(___pfn); \
> - ___page;   \
> -})
> +struct page *pfn_to_online_page(unsigned long pfn);
>  
>  /*
>   * Types for free bootmem stored in page->lru.next. These have to be in
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f9d57b9be8c7..55a69d4396e7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,22 @@ static int check_hotplug_memory_addressable(unsigned 
> long pfn,
>   return 0;
>  }
>  
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +struct page *pfn_to_online_page(unsigned long pfn)
> +{
> + unsigned long nr = pfn_to_section_nr(pfn);
> +
> + if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
> + pfn_valid_within(pfn))
> +     return pfn_to_page(pfn);
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(pfn_to_online_page);
> +
>  /*
>   * Reasonably generic function for adding memory.  It is
>   * expected that archs that support memory hotplug will
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH v2 2/5] mm: Teach pfn_to_online_page() to consider subsection validity

2021-01-12 Thread David Hildenbrand

On 12.01.21 10:34, Dan Williams wrote:
> pfn_section_valid() determines pfn validity on subsection granularity.
> 
> pfn_valid_within() internally uses pfn_section_valid(), but gates it
> with early_section() to preserve the traditional behavior of pfn_valid()
> before subsection support was added.
> 
> pfn_to_online_page() wants the explicit precision that pfn_valid() does
> not offer, so use pfn_section_valid() directly. Since
> pfn_to_online_page() already open codes the validity of the section
> number vs NR_MEM_SECTIONS, there's not much value to using
> pfn_valid_within(), just use pfn_section_valid(). This loses the
> valid_section() check that pfn_valid_within() was performing, but that
> was already redundant with the online check.
> 
> Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
> Cc: Qian Cai 
> Cc: Michal Hocko 
> Reported-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c |   16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 55a69d4396e7..a845b3979bc0 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,11 +308,19 @@ static int check_hotplug_memory_addressable(unsigned 
> long pfn,
>  struct page *pfn_to_online_page(unsigned long pfn)
>  {
>   unsigned long nr = pfn_to_section_nr(pfn);
> + struct mem_section *ms;
> +
> + if (nr >= NR_MEM_SECTIONS)
> + return NULL;
> +
> + ms = __nr_to_section(nr);
> + if (!online_section(ms))
> + return NULL;
> +
> + if (!pfn_section_valid(ms, pfn))
> + return NULL;

That's not sufficient for alternative implementations of pfn_valid().

You still need some kind of pfn_valid(pfn) for alternative versions of
pfn_valid(). Consider arm64 memory holes in the memmap. See their
current (yet to be fixed/reworked) pfn_valid() implementation.
(pfn_valid_within() is implicitly active on arm64)

Actually, I think we should add something like the following, to make
this clearer (pfn_valid_within() is confusing)

#ifdef CONFIG_HAVE_ARCH_PFN_VALID
/* We might have to check for holes inside the memmap. */
if (!pfn_valid())
return NULL;
#endif

>  
> - if (nr < NR_MEM_SECTIONS && online_section_nr(nr) &&
> - pfn_valid_within(pfn))
> - return pfn_to_page(pfn);
> - return NULL;
> + return pfn_to_page(pfn);
>  }
>  EXPORT_SYMBOL_GPL(pfn_to_online_page);
>  
> 


-- 
Thanks,

David / dhildenb

Re: [PATCH v2 3/5] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

2021-01-12 Thread David Hildenbrand

On 12.01.21 10:34, Dan Williams wrote:
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixes ZONE_{NORMAL,MOVABLE} with
> ZONE_DEVICE. This means that pfn_to_online_page() may return invalid
> @page objects. For example with a memory map like:
> 
> 1-1fbff : System RAM
>   14200-143002e16 : Kernel code
>   14320-143713fff : Kernel rodata
>   14380-143b15b7f : Kernel data
>   144227000-144ff : Kernel bss
> 1fc00-2fbff : Persistent Memory (legacy)
>   1fc00-2fbff : namespace0.0
> 
> This command:
> 
> echo 0x1fc00 > /sys/devices/system/memory/soft_offline_page
> 
> ...succeeds when it should fail. When it succeeds it touches
> an uninitialized page and may crash or cause other damage (see
> dissolve_free_huge_page()).
> 
> While the memory map above is contrived via the memmap=ss!nn kernel
> command line option, the collision happens in practice on shipping
> platforms. The memory controller resources that decode spans of
> physical address space are a limited resource. One technique
> platform-firmware uses to conserve those resources is to share a decoder
> across 2 devices to keep the address range contiguous. Unfortunately the
> unit of operation of a decoder is 64MiB while the Linux section size is
> 128MiB. This results in situations where, without subsection hotplug
> memory mappings with different lifetimes collide into one object that
> can only express one lifetime.
> 
> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section. In
> the fast path online_section() for a full ZONE_DEVICE section returns
> false.
> 
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
> 
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton 
> Reported-by: Michal Hocko 
> Reported-by: David Hildenbrand 
> Signed-off-by: Dan Williams 
> ---
>  include/linux/mmzone.h |   22 +++---
>  mm/memory_hotplug.c|   38 ++
>  2 files changed, 53 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b593316bff3d..0b5c44f730b4 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1273,13 +1273,14 @@ extern size_t mem_section_usage_size(void);
>   *  which results in PFN_SECTION_SHIFT equal 6.
>   * To sum it up, at least 6 bits are available.
>   */
> -#define  SECTION_MARKED_PRESENT  (1UL<<0)
> -#define SECTION_HAS_MEM_MAP  (1UL<<1)
> -#define SECTION_IS_ONLINE(1UL<<2)
> -#define SECTION_IS_EARLY (1UL<<3)
> -#define SECTION_MAP_LAST_BIT (1UL<<4)
> -#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> -#define SECTION_NID_SHIFT3
> +#define SECTION_MARKED_PRESENT   (1UL<<0)
> +#define SECTION_HAS_MEM_MAP  (1UL<<1)
> +#define SECTION_IS_ONLINE(1UL<<2)
> +#define SECTION_IS_EARLY (1UL<<3)
> +#define SECTION_TAINT_ZONE_DEVICE(1UL<<4)
> +#define SECTION_MAP_LAST_BIT (1UL<<5)
> +#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> +#define SECTION_NID_SHIFT3
>  
>  static inline struct page *__section_mem_map_addr(struct mem_section 
> *section)
>  {
> @@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section 
> *section)
>   return (section && (section->section_mem_map & SECTION_IS_ONLINE));
>  }
>  
> +static inline int online_device_section(struct mem_section *section)
> +{
> + unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
> +
> + return section && ((section->section_mem_map & flags) == flags);
> +}
> +
>  static inline int online_section_nr(unsigned long nr)
>  {
>   return online_section(__nr_to_section(nr));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a845b3979bc0..b2ccb84c3082 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,6 +308,7 @@ static int check_hotplug_memory_addressable(unsigned long 
> pfn,
>  struct page *pfn_to_online_page(unsigned long pfn)
>  {
>   unsigned long nr = pfn_to_section_nr(pfn);
> + struct dev_pagemap *pgmap;
>   struct mem_section *ms;
>

Re: [PATCH V2 1/3] mm/hotplug: Prevalidate the address range being added with platform

2021-01-12 Thread David Hildenbrand

On 12.01.21 04:51, Anshuman Khandual wrote:
> 
> 
> On 1/11/21 7:13 PM, Oscar Salvador wrote:
>> On Mon, Jan 11, 2021 at 11:51:47AM +0100, David Hildenbrand wrote:
>>> AFAIKs, all memhp_get_pluggable_range() users pass "1".
>>>
>>> What about the "add_pages()-only" path?
>>
>> I guess you refer to memremap_pages(), right?
> 
> Right, via pagemap_range().
> 
>> If so, moving the added memhp_range_allowed() check above the if-else might 
>> do
>> the trick
>>
> We had that code in the earlier version. But dropped it, as we did
> not want to add any new checks in the generic code. Can add it back
> if that is preferred.

I remember discussing replacing the check in __add_pages() instead. But
I don't really care where the check ends up being. As discussed, at some
point, we should provide versions of add_pages() and arch_add_pages()
that don't immediately end in arch-code.

-- 
Thanks,

David / dhildenb

Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-01-12 Thread David Hildenbrand

On 12.01.21 08:26, Oscar Salvador wrote:
> On Mon, Jan 11, 2021 at 05:52:19PM +0100, David Hildenbrand wrote:
>> On 17.12.20 14:07, Oscar Salvador wrote:
>>> In order to use self-hosted memmap array, the platform needs to have
>>> support for CONFIG_SPARSEMEM_VMEMMAP and altmap.
>>> Currently, only arm64, PPC and x86_64 have the support, so enable those
>>> platforms with ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE.
>>
>> "In the first version, only  will support it".
> 
> I will try to be more specific.
> 
>>
>> I'd probably split off enabling it per architecture in separate patches
>> and the end of the series.
> 
> You mean introducing only mm/Kconfig change in this patch, and then
> arch/*/Kconfig changes in separate patches at the end of the series?

Yeah, or squashing the leftovers of this patch (3 LOC) into patch #2.

> 
> I can certainly do that, not sure how much will help with the review,
> but it might help when bisecting.

It's usually nicer to explicitly enable stuff per architecture, stating
why it works on that architecture (and in the best case, even was
tested!). :)

-- 
Thanks,

David / dhildenb

Re: [PATCH v2 4/5] mm: Fix page reference leak in soft_offline_page()

2021-01-12 Thread David Hildenbrand

On 12.01.21 10:34, Dan Williams wrote:
> The conversion to move pfn_to_online_page() internal to
> soft_offline_page() missed that the get_user_pages() reference needs to
> be dropped when pfn_to_online_page() fails.
> 
> When soft_offline_page() is handed a pfn_valid() &&
> !pfn_to_online_page() pfn the kernel hangs at dax-device shutdown due to
> a leaked reference.
> 
> Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
> Cc: Andrew Morton 
> Cc: Naoya Horiguchi 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: 
> Signed-off-by: Dan Williams 
> ---
>  mm/memory-failure.c |   20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5a38e9eade94..78b173c7190c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1885,6 +1885,12 @@ static int soft_offline_free_page(struct page *page)
>   return rc;
>  }
>  
> +static void put_ref_page(struct page *page)
> +{
> + if (page)
> + put_page(page);
> +}
> +
>  /**
>   * soft_offline_page - Soft offline a page.
>   * @pfn: pfn to soft-offline
> @@ -1910,20 +1916,26 @@ static int soft_offline_free_page(struct page *page)
>  int soft_offline_page(unsigned long pfn, int flags)
>  {
>   int ret;
> - struct page *page;
>   bool try_again = true;
> + struct page *page, *ref_page = NULL;
> +
> + WARN_ON_ONCE(!pfn_valid(pfn) && (flags & MF_COUNT_INCREASED));
>  
>   if (!pfn_valid(pfn))
>   return -ENXIO;
> + if (flags & MF_COUNT_INCREASED)
> + ref_page = pfn_to_page(pfn);
> +
>   /* Only online pages can be soft-offlined (esp., not ZONE_DEVICE). */
>   page = pfn_to_online_page(pfn);
> - if (!page)
> + if (!page) {
> + put_ref_page(ref_page);
>   return -EIO;
> + }
>  
>   if (PageHWPoison(page)) {
>   pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
> - if (flags & MF_COUNT_INCREASED)
> - put_page(page);
> + put_ref_page(ref_page);
>   return 0;
>   }

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb

Re: [RFC PATCH 0/3] create hugetlb flags to consolidate state

2021-01-12 Thread David Hildenbrand

On 11.01.21 22:01, Mike Kravetz wrote:
> While discussing a series of hugetlb fixes in [1], it became evident
> that the hugetlb specific page state information is stored in a somewhat
> haphazard manner.  Code dealing with state information would be easier
> to read, understand and maintain if this information was stored in a
> consistent manner.
> 
> This RFC series uses page.private of the hugetlb head page for storing a
> set of hugetlb specific page flags.  Routines to manipulate the flags
> are copied from normal page flag manipulation code and use atomic
> operations.  This is likely overkill for the uses in hugetlb code, and
> can be changed after additional auditing of code.

Haven't looked into the code but that sounds like a good idea.

> 
> For now, only 3 state flags are added as part of this series.  However,
> the suggested fix in [2] could use another flag.  In addition, a flag
> could be used to track whether or not a huge page has been cleared to
> optimize code paths if the init_on_alloc security feature is enabled.

Right, that will be helpful: indicating pages that were either cleared
directly when allocating (init_on_alloc) or via some asynchronous
mechanism in the future.

-- 
Thanks,

David / dhildenb

Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-12 Thread David Hildenbrand

On 10.01.21 13:40, Muchun Song wrote:
> If the refcount is one when it is migrated, it means that the page
> was freed from under us. So we are done and do not need to migrate.
> 
> This optimization is consistent with the regular pages, just like
> unmap_and_move() does.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Mike Kravetz 
> Acked-by: Yang Shi 
> ---
>  mm/migrate.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4385f2fb5d18..a6631c4eb6a6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t 
> get_new_page,
>   return -ENOSYS;
>   }
>  
> + if (page_count(hpage) == 1) {
> + /* page was freed from under us. So we are done. */
> + putback_active_hugepage(hpage);
> + return MIGRATEPAGE_SUCCESS;
> + }
> +
>   new_hpage = get_new_page(hpage, private);
>   if (!new_hpage)
>   return -ENOMEM;
> 

Question: What if called via alloc_contig_range() where we even want to
"migrate" free pages, meaning, relocate it?

-- 
Thanks,

David / dhildenb

Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-12 Thread David Hildenbrand

On 12.01.21 12:00, David Hildenbrand wrote:
> On 10.01.21 13:40, Muchun Song wrote:
>> If the refcount is one when it is migrated, it means that the page
>> was freed from under us. So we are done and do not need to migrate.
>>
>> This optimization is consistent with the regular pages, just like
>> unmap_and_move() does.
>>
>> Signed-off-by: Muchun Song 
>> Reviewed-by: Mike Kravetz 
>> Acked-by: Yang Shi 
>> ---
>>  mm/migrate.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 4385f2fb5d18..a6631c4eb6a6 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t 
>> get_new_page,
>>  return -ENOSYS;
>>  }
>>  
>> +if (page_count(hpage) == 1) {
>> +/* page was freed from under us. So we are done. */
>> +putback_active_hugepage(hpage);
>> +return MIGRATEPAGE_SUCCESS;
>> +}
>> +
>>  new_hpage = get_new_page(hpage, private);
>>  if (!new_hpage)
>>  return -ENOMEM;
>>
> 
> Question: What if called via alloc_contig_range() where we even want to
> "migrate" free pages, meaning, relocate it?
> 

To be more precise:

a) We don't have dissolve_free_huge_pages() calls on the
alloc_contig_range() path. So we *need* migration IIUC.

b) dissolve_free_huge_pages() will fail if going below the reservation.
In that case we really want to migrate free pages. This even applies to
memory offlining.

Either I am missing something important or this patch is more dangerous
than it looks like.

-- 
Thanks,

David / dhildenb

Re: [PATCH 1/5] mm: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-01-12 Thread David Hildenbrand

On 12.01.21 12:17, Oscar Salvador wrote:
> On Tue, Jan 12, 2021 at 11:12:30AM +0100, David Hildenbrand wrote:
>> On 12.01.21 08:26, Oscar Salvador wrote:
>>> You mean introducing only mm/Kconfig change in this patch, and then
>>> arch/*/Kconfig changes in separate patches at the end of the series?
>>
>> Yeah, or squashing the leftovers of this patch (3 LOC) into patch #2.
> 
> Ok, makes sense.
> 
>>> I can certainly do that, not sure how much will help with the review,
>>> but it might help when bisecting.
>>
>> It's usually nicer to explicitly enable stuff per architecture, stating
>> why it works on that architecture (and in the best case, even was
>> tested!). :)
> 
> Fine by me.
> I will prepare another re-spin with that in mind then.
> 
> It would be great to have some feedback on patch#2 before that (in case you 
> find
> some time ;-).

Yeah, will get to that soon. Mailbox is flooded right now :D


-- 
Thanks,

David / dhildenb

Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-12 Thread David Hildenbrand

On 12.01.21 12:27, Michal Hocko wrote:
> On Tue 12-01-21 12:11:21, David Hildenbrand wrote:
>> On 12.01.21 12:00, David Hildenbrand wrote:
>>> On 10.01.21 13:40, Muchun Song wrote:
>>>> If the refcount is one when it is migrated, it means that the page
>>>> was freed from under us. So we are done and do not need to migrate.
>>>>
>>>> This optimization is consistent with the regular pages, just like
>>>> unmap_and_move() does.
>>>>
>>>> Signed-off-by: Muchun Song 
>>>> Reviewed-by: Mike Kravetz 
>>>> Acked-by: Yang Shi 
>>>> ---
>>>>  mm/migrate.c | 6 ++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 4385f2fb5d18..a6631c4eb6a6 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t 
>>>> get_new_page,
>>>>return -ENOSYS;
>>>>}
>>>>  
>>>> +  if (page_count(hpage) == 1) {
>>>> +  /* page was freed from under us. So we are done. */
>>>> +  putback_active_hugepage(hpage);
>>>> +  return MIGRATEPAGE_SUCCESS;
>>>> +  }
>>>> +
>>>>new_hpage = get_new_page(hpage, private);
>>>>if (!new_hpage)
>>>>return -ENOMEM;
>>>>
>>>
>>> Question: What if called via alloc_contig_range() where we even want to
>>> "migrate" free pages, meaning, relocate it?
>>>
>>
>> To be more precise:
>>
>> a) We don't have dissolve_free_huge_pages() calls on the
>> alloc_contig_range() path. So we *need* migration IIUC.
>>
>> b) dissolve_free_huge_pages() will fail if going below the reservation.
>> In that case we really want to migrate free pages. This even applies to
>> memory offlining.
>>
>> Either I am missing something important or this patch is more dangerous
>> than it looks like.
> 
> This is an interesting point. But do we try to migrate hugetlb pages in
> alloc_contig_range? isolate_migratepages_block !PageLRU need to be

I didn't test it so far (especially in the context of virtio-mem or
CMA), but have a TODO item on my long list of things to look at in the
future.

> marked as PageMovable AFAICS. This would be quite easy to implement but
> a more fundamental question is whether we really want to mess with
> existing pools for alloc_contig_range.

Can these pages fall onto ZONE_MOVABLE or even MIGRATE_CMA? If yes, we
really want to. And I think both is the case for "ordinary" huge pages
allocated via the buddy.

> 
> Anyway you are quite right that this change has more side effects than
> it is easy to see while it doesn't really bring any major advantage
> other than the consistency.

Free hugetlbfs pages are special. E.g., they cannot simply be skipped
when offlining. So I don't think consistency actually really applies.


-- 
Thanks,

David / dhildenb

Re: [PATCH v8 06/12] mm/hugetlb: Allocate the vmemmap pages associated with each HugeTLB page

2020-12-11 Thread David Hildenbrand



> Am 11.12.2020 um 10:35 schrieb Oscar Salvador :
> 
> On Thu, Dec 10, 2020 at 11:55:20AM +0800, Muchun Song wrote:
>> When we free a HugeTLB page to the buddy allocator, we should allocate the
>> vmemmap pages associated with it. We can do that in the __free_hugepage()
> "vmemmap pages that describe the range" would look better to me, but it is ok.
> 
>> +#define GFP_VMEMMAP_PAGE\
>> +(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH | __GFP_NOWARN)
>> 
>> #ifndef VMEMMAP_HPAGE_SHIFT
>> #define VMEMMAP_HPAGE_SHIFTHPAGE_SHIFT
>> @@ -197,6 +200,11 @@
>>(__boundary - 1 < (end) - 1) ? __boundary : (end); \
>> })
>> 
>> +typedef void (*vmemmap_remap_pte_func_t)(struct page *reuse, pte_t *pte,
>> + unsigned long start, unsigned long end,
>> + void *priv);
> 
> Any reason to not have defined GFP_VMEMMAP_PAGE and the new typedef into
> hugetlb_vmemmap.h?
> 
> 
>> +static void vmemmap_restore_pte_range(struct page *reuse, pte_t *pte,
>> +  unsigned long start, unsigned long end,
>> +  void *priv)
>> +{
>> +pgprot_t pgprot = PAGE_KERNEL;
>> +void *from = page_to_virt(reuse);
>> +unsigned long addr;
>> +struct list_head *pages = priv;
> [...]
>> +
>> +/*
>> + * Make sure that any data that writes to the @to is made
>> + * visible to the physical page.
>> + */
>> +flush_kernel_vmap_range(to, PAGE_SIZE);
> 
> Correct me if I am wrong, but flush_kernel_vmap_range is a NOOP under arches 
> which
> do not have ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE.
> Since we only enable support for x86_64, and x86_64 is one of those arches,
> could we remove this, and introduced later on in case we enable this feature
> on an arch that needs it?
> 
> I am not sure if you need to flush the range somehow, as you did in
> vmemmap_remap_range.
> 
>> +retry:
>> +page = alloc_page(GFP_VMEMMAP_PAGE);
>> +if (unlikely(!page)) {
>> +msleep(100);
>> +/*
>> + * We should retry infinitely, because we cannot
>> + * handle allocation failures. Once we allocate
>> + * vmemmap pages successfully, then we can free
>> + * a HugeTLB page.
>> + */
>> +goto retry;
> 
> I think this is the trickiest part.
> With 2MB HugeTLB pages we only need 6 pages, but with 1GB, the number of pages
> we need to allocate increases significantly (4088 pages IIRC).
> And you are using __GFP_HIGH, which will allow us to use more memory (by
> cutting down the watermark), but it might lead to putting the system
> on its knees wrt. memory.
> And yes, I know that once we allocate the 4088 pages, 1GB gets freed, but
> still.

Similar to memory hotplug, no? I don‘t think this is really an issue that 
cannot be mitigated. Yeah, we might want to tweak allocation flags.

> 
> I would like to hear Michal's thoughts on this one, but I wonder if it makes
> sense to not let 1GB-HugeTLB pages be freed.
> 
> -- 
> Oscar Salvador
> SUSE L3
>

Re: [PATCH v3 5/6] mm/gup: migrate pinned pages out of movable zone

2020-12-11 Thread David Hildenbrand



> Am 11.12.2020 um 22:09 schrieb Pavel Tatashin :
> 
> On Fri, Dec 11, 2020 at 3:46 PM Jason Gunthorpe  wrote:
>> 
>>> On Fri, Dec 11, 2020 at 03:40:57PM -0500, Pavel Tatashin wrote:
>>> On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe  wrote:
 
 On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
> @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct 
> mm_struct *mm,
>  }
> 
>  if (!isolate_lru_page(head)) {
> - list_add_tail(&head->lru, 
> &cma_page_list);
> + list_add_tail(&head->lru, 
> &movable_page_list);
>  mod_node_page_state(page_pgdat(head),
>  NR_ISOLATED_ANON 
> +
>  
> page_is_file_lru(head),
> @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct 
> mm_struct *mm,
>  i += step;
>  }
> 
> - if (!list_empty(&cma_page_list)) {
> + if (!list_empty(&movable_page_list)) {
 
 You didn't answer my earlier question, is it OK that ZONE_MOVABLE
 pages leak out here if ioslate_lru_page() fails but the
 moval_page_list is empty?
 
 I think the answer is no, right?
>>> In my opinion it is OK. We are doing our best to not pin movable
>>> pages, but if isolate_lru_page() fails because pages are currently
>>> locked by someone else, we will end up long-term pinning them.
>>> See comment in this patch:
>>> +* 1. Pinned pages: (long-term) pinning of movable pages is avoided
>>> +*when pages are pinned and faulted, but it is still possible 
>>> that
>>> +*address space already has pages in ZONE_MOVABLE at the time 
>>> when
>>> +*pages are pinned (i.e. user has touches that memory before
>>> +*pinning). In such case we try to migrate them to a different 
>>> zone,
>>> +*but if migration fails the pages can still end-up pinned in
>>> +*ZONE_MOVABLE. In such case, memory offlining might retry a 
>>> long
>>> +*time and will only succeed once user application unpins pages.
>> 
>> It is not "retry a long time" it is "might never complete" because
>> userspace will hold the DMA pin indefinitely.
>> 
>> Confused what the point of all this is then ??
>> 
>> I thought to goal here is to make memory unplug reliable, if you leave
>> a hole like this then any hostile userspace can block it forever.
> 
> You are right, I used a wording from the previous comment, and it
> should be made clear that pin may be forever. Without these patches it
> is guaranteed that hot-remove will fail if there are pinned pages as
> ZONE_MOVABLE is actually the first to be searched. Now, it will fail
> only due to exceptions listed in ZONE_MOVABLE comment:
> 
> 1. pin + migration/isolation failure

Not sure what that really means. We have short-term pinnings (although we might 
have a better term for „pinning“ here) for example, when a process dies (IIRC). 
There is a period where pages cannot get migrated and offlining code has to 
retry (which might take a while). This still applies after your change - are 
you referring to that?

> 2. memblock allocation due to limited amount of space for kernelcore
> 3. memory holes
> 4. hwpoison
> 5. Unmovable PG_offline pages (? need to study why this is a scenario).

Virtio-mem is the primary user in this context.

> Do you think we should unconditionally unpin pages, and return error
> when isolation/migration fails?

I‘m not sure what you mean here. Who’s supposed to unpin which pages?

> 
> Pasha
> 
>> 
>> Jason
>

Re: [PATCH v3 5/6] mm/gup: migrate pinned pages out of movable zone

2020-12-11 Thread David Hildenbrand



> Am 11.12.2020 um 22:36 schrieb Pavel Tatashin :
> 
> On Fri, Dec 11, 2020 at 4:29 PM David Hildenbrand  wrote:
>> 
>> 
>>>> Am 11.12.2020 um 22:09 schrieb Pavel Tatashin :
>>> 
>>> On Fri, Dec 11, 2020 at 3:46 PM Jason Gunthorpe  wrote:
>>>> 
>>>>> On Fri, Dec 11, 2020 at 03:40:57PM -0500, Pavel Tatashin wrote:
>>>>> On Fri, Dec 11, 2020 at 3:23 PM Jason Gunthorpe  wrote:
>>>>>> 
>>>>>> On Fri, Dec 11, 2020 at 03:21:39PM -0500, Pavel Tatashin wrote:
>>>>>>> @@ -1593,7 +1592,7 @@ static long check_and_migrate_cma_pages(struct 
>>>>>>> mm_struct *mm,
>>>>>>> }
>>>>>>> 
>>>>>>> if (!isolate_lru_page(head)) {
>>>>>>> - list_add_tail(&head->lru, 
>>>>>>> &cma_page_list);
>>>>>>> + list_add_tail(&head->lru, 
>>>>>>> &movable_page_list);
>>>>>>> 
>>>>>>> mod_node_page_state(page_pgdat(head),
>>>>>>> 
>>>>>>> NR_ISOLATED_ANON +
>>>>>>> 
>>>>>>> page_is_file_lru(head),
>>>>>>> @@ -1605,7 +1604,7 @@ static long check_and_migrate_cma_pages(struct 
>>>>>>> mm_struct *mm,
>>>>>>> i += step;
>>>>>>> }
>>>>>>> 
>>>>>>> - if (!list_empty(&cma_page_list)) {
>>>>>>> + if (!list_empty(&movable_page_list)) {
>>>>>> 
>>>>>> You didn't answer my earlier question, is it OK that ZONE_MOVABLE
>>>>>> pages leak out here if ioslate_lru_page() fails but the
>>>>>> moval_page_list is empty?
>>>>>> 
>>>>>> I think the answer is no, right?
>>>>> In my opinion it is OK. We are doing our best to not pin movable
>>>>> pages, but if isolate_lru_page() fails because pages are currently
>>>>> locked by someone else, we will end up long-term pinning them.
>>>>> See comment in this patch:
>>>>> +* 1. Pinned pages: (long-term) pinning of movable pages is 
>>>>> avoided
>>>>> +*when pages are pinned and faulted, but it is still possible 
>>>>> that
>>>>> +*address space already has pages in ZONE_MOVABLE at the time 
>>>>> when
>>>>> +*pages are pinned (i.e. user has touches that memory before
>>>>> +*pinning). In such case we try to migrate them to a 
>>>>> different zone,
>>>>> +*but if migration fails the pages can still end-up pinned in
>>>>> +*ZONE_MOVABLE. In such case, memory offlining might retry a 
>>>>> long
>>>>> +*time and will only succeed once user application unpins 
>>>>> pages.
>>>> 
>>>> It is not "retry a long time" it is "might never complete" because
>>>> userspace will hold the DMA pin indefinitely.
>>>> 
>>>> Confused what the point of all this is then ??
>>>> 
>>>> I thought to goal here is to make memory unplug reliable, if you leave
>>>> a hole like this then any hostile userspace can block it forever.
>>> 
>>> You are right, I used a wording from the previous comment, and it
>>> should be made clear that pin may be forever. Without these patches it
>>> is guaranteed that hot-remove will fail if there are pinned pages as
>>> ZONE_MOVABLE is actually the first to be searched. Now, it will fail
>>> only due to exceptions listed in ZONE_MOVABLE comment:
>>> 
>>> 1. pin + migration/isolation failure
>> 
>> Not sure what that really means. We have short-term pinnings (although we 
>> might have a better term for „pinning“ here) for example, when a process 
>> dies (IIRC). There is a period where pages cannot get migrated and offlining 
>> code has to retry (which might take a while). This still applies after your 
>> change - are you referring to that?
>> 
>>> 2. memblock allocation due to limited amount of space for kernelcore
>>> 3. memory holes
>>> 4. hwpoison
>>> 5. Unmovable PG_offline pages (? need to study why this is a scenario).
>> 
>> Virtio-mem is the primary user in this context.
>> 
>>> Do you think we should unconditionally unpin pages, and return error
>>> when isolation/migration fails?
>> 
>> I‘m not sure what you mean here. Who’s supposed to unpin which pages?
> 
> Hi David,
> 
> When check_and_migrate_movable_pages() is called, the pages are
> already pinned. If some of those pages are in movable zone, and we
> fail to migrate or isolate them what should we do: proceed, and keep
> it as exception of when movable zone can actually have pinned pages or
> unpin all pages in the array, and return an error, or unpin only pages
> in movable zone, and return an error?
> 

I guess revert what we did (unpin) and return an error. The interesting 
question is what can make migration/isolation fail

a) out of memory: smells like a zone setup issue. Failures are acceptable I 
guess.

b) short term pinnings: process dying - not relevant I guess. Other cases? 
(Fork?)

c) ?

Once we clarified that, we actually know how likely it will be to return an 
error (and making vfio pinnings fail etc).

> Pasha

Re: [PATCH v3 5/6] mm/gup: migrate pinned pages out of movable zone

2020-12-12 Thread David Hildenbrand



> Am 12.12.2020 um 00:50 schrieb Jason Gunthorpe :
> 
> On Fri, Dec 11, 2020 at 10:53:00PM +0100, David Hildenbrand wrote:
> 
>>> When check_and_migrate_movable_pages() is called, the pages are
>>> already pinned. If some of those pages are in movable zone, and we
>>> fail to migrate or isolate them what should we do: proceed, and
>>> keep it as exception of when movable zone can actually have pinned
>>> pages or unpin all pages in the array, and return an error, or
>>> unpin only pages in movable zone, and return an error?
>> 
>> I guess revert what we did (unpin) and return an error. The
>> interesting question is what can make migration/isolation fail
>> 
>> a) out of memory: smells like a zone setup issue. Failures are acceptable I 
>> guess.
> 
> Out of memory is reasonable..
> 
>> b) short term pinnings: process dying - not relevant I guess. Other cases? 
>> (Fork?)
> 
> Concurrent with non-longterm GUP users are less reasonable, fork is
> not reasonable, etc..

Concurrent alloc_contig_range(), memory offlining, compaction .. where we 
migrate pages? Any experts on racing page migration in these scenarios?

(Also wondering what would happen if we are just about to swap)

> 
> Racing with another GUP in another thread is also not reasonable, so
> failing to isolate can't be a failure

Having VMs with multiple vfio containers is certainly realistic, and optimizing 
in user space to do vfio mappings concurrently doesn‘t sound too crazy to me. 
But I haven‘t checked if vfio common code already handles such concurrency.

> 
> Jasnon
>

Re: [RFC PATCH 2/3] virtio-balloon: add support for providing free huge page reports to host

2020-12-22 Thread David Hildenbrand

On 22.12.20 08:48, Liang Li wrote:
> Free page reporting only supports buddy pages, it can't report the
> free pages reserved for hugetlbfs case. On the other hand, hugetlbfs

The virtio-balloon free page reporting interface accepts a generic sg,
so it isn't glue to buddy pages. There is no need for a new interface.

-- 
Thanks,

David / dhildenb

Re: [RFC PATCH 3/3] mm: support free hugepage pre zero out

2020-12-22 Thread David Hildenbrand

On 22.12.20 08:49, Liang Li wrote:
> This patch add support of pre zero out free hugepage, we can use
> this feature to speed up page population and page fault handing.
> 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Andrea Arcangeli 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: David Hildenbrand   
> Cc: Michal Hocko  
> Cc: Andrew Morton 
> Cc: Alex Williamson 
> Cc: Michael S. Tsirkin 
> Cc: Jason Wang 
> Cc: Mike Kravetz 
> Cc: Liang Li 
> Signed-off-by: Liang Li 
> ---
>  mm/page_prezero.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/mm/page_prezero.c b/mm/page_prezero.c
> index c8ce720bfc54..dff4e0adf402 100644
> --- a/mm/page_prezero.c
> +++ b/mm/page_prezero.c
> @@ -26,6 +26,7 @@ static unsigned long delay_millisecs = 1000;
>  static unsigned long zeropage_enable __read_mostly;
>  static DEFINE_MUTEX(kzeropaged_mutex);
>  static struct page_reporting_dev_info zero_page_dev_info;
> +static struct page_reporting_dev_info zero_hugepage_dev_info;
>  
>  inline void clear_zero_page_flag(struct page *page, int order)
>  {
> @@ -69,9 +70,17 @@ static int start_kzeropaged(void)
>   zero_page_dev_info.delay_jiffies = 
> msecs_to_jiffies(delay_millisecs);
>  
>   err = page_reporting_register(&zero_page_dev_info);
> +
> + zero_hugepage_dev_info.report = zero_free_pages;
> + zero_hugepage_dev_info.mini_order = mini_page_order;
> + zero_hugepage_dev_info.batch_size = batch_size;
> + zero_hugepage_dev_info.delay_jiffies = 
> msecs_to_jiffies(delay_millisecs);
> +
> + err |= hugepage_reporting_register(&zero_hugepage_dev_info);
>   pr_info("Zero page enabled\n");
>   } else {
>   page_reporting_unregister(&zero_page_dev_info);
> + hugepage_reporting_unregister(&zero_hugepage_dev_info);
>   pr_info("Zero page disabled\n");
>   }
>  
> @@ -90,7 +99,15 @@ static int restart_kzeropaged(void)
>   zero_page_dev_info.batch_size = batch_size;
>   zero_page_dev_info.delay_jiffies = 
> msecs_to_jiffies(delay_millisecs);
>  
> + hugepage_reporting_unregister(&zero_hugepage_dev_info);
> +
> + zero_hugepage_dev_info.report = zero_free_pages;
> + zero_hugepage_dev_info.mini_order = mini_page_order;
> + zero_hugepage_dev_info.batch_size = batch_size;
> + zero_hugepage_dev_info.delay_jiffies = 
> msecs_to_jiffies(delay_millisecs);
> +
>   err = page_reporting_register(&zero_page_dev_info);
> + err |= hugepage_reporting_register(&zero_hugepage_dev_info);
>   pr_info("Zero page enabled\n");
>   }
>  
> 

Free page reporting in virtio-balloon doesn't give you any guarantees
regarding zeroing of pages. Take a look at the QEMU implementation -
e.g., with vfio all reports are simply ignored.

Also, I am not sure if mangling such details ("zeroing of pages") into
the page reporting infrastructure is a good idea.

-- 
Thanks,

David / dhildenb

Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-22 Thread David Hildenbrand

On 21.12.20 17:25, Liang Li wrote:
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
> 
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.
> 
> My original intention for adding this feature is to shorten VM
> creation time when SR-IOV devicde is attached, it works good and the
> VM creation time is reduced by about 90%.
> 
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =
> QEMU use 4K pages, THP is off
>   round1  round2  round3
> w/o this patch:23.5s   24.7s   24.6s
> w/ this patch: 10.2s   10.3s   11.2s
> 
> QEMU use 4K pages, THP is on
>   round1  round2  round3
> w/o this patch:17.9s   14.8s   14.9s
> w/ this patch: 1.9s1.8s1.9s
> =
> 

I am still not convinces that we want/need this for this (main) use
case. Why can't we use huge pages for such use cases (that really care
about VM creation time) and rather deal with pre-zeroing of huge pages
instead?

If possible, I'd like to avoid GFP_ZERO (for reasons already discussed).

> Obviously, it can do more than this. We can benefit from this feature
> in the flowing case:
> 
> Interactive sence
> =
> Shorten application lunch time on desktop or mobile phone, it can help
> to improve the user experience. Test shows on a
> server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
> the kernel will take about 200ms, while some mainly used application
> like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
> launch, by pre zero out free pages, it means the application launch
> time will be reduced about 20~60ms (can be visual sensed?). May be
> we can make use of this feature to speed up the launch of Andorid APP
> (I didn't do any test for Android).

I am not really sure if you can actually visually sense a difference in
your examples. Startup time of an application is not just memory
allocation (page zeroing) time. It would be interesting of much of a
difference this actually makes in practice. (e.g., firefox startup time
etc.)

> 
> Virtulization
> =
> Speed up VM creation and shorten guest boot time, especially for PCI
> SR-IOV device passthrough scenario. Compared with some of the para
> vitalization solutions, it is easy to deploy because it’s transparent
> to guest and can handle DMA properly in BIOS stage, while the para
> virtualization solution can’t handle it well.

What is the "para virtualization" approach you are talking about?

> 
> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> to the VMM, VMM will unmap the corresponding host page for reclaim,
> when guest allocate a page just reclaimed, host will allocate a new page
> and zero it out for guest, in this case pre zero out free page will help
> to speed up the proccess of fault in and reduce the performance impaction.

Such faults in the VMM are no different to other faults, when first
accessing a page to be populated. Again, I wonder how much of a
difference it actually makes.

> 
> Speed up kernel routine
> ===
> This can’t be guaranteed because we don’t pre zero out all the free pages,
> but is true for most case. It can help to speed up some important system
> call just like fork, which will allocate zero pages for building page
> table. And speed up the process of page fault, especially for huge page
> fault. The POC of Hugetlb free page pre zero out has been done.

Would be interesting to have an actual example with some numbers.

> 
> Security
> 
> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> boot options", which zero out page in a asynchronous way. For users can't
> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> this feauture provide another choice.

"we don’t pre zero out all the free pages" so this is of little actual use.

-- 
Thanks,

David / dhildenb

Re: [RFC PATCH 3/3] mm: support free hugepage pre zero out

2020-12-22 Thread David Hildenbrand

On 22.12.20 09:31, David Hildenbrand wrote:
> On 22.12.20 08:49, Liang Li wrote:
>> This patch add support of pre zero out free hugepage, we can use
>> this feature to speed up page population and page fault handing.
>>
>> Cc: Alexander Duyck 
>> Cc: Mel Gorman 
>> Cc: Andrea Arcangeli 
>> Cc: Dan Williams 
>> Cc: Dave Hansen 
>> Cc: David Hildenbrand   
>> Cc: Michal Hocko  
>> Cc: Andrew Morton 
>> Cc: Alex Williamson 
>> Cc: Michael S. Tsirkin 
>> Cc: Jason Wang 
>> Cc: Mike Kravetz 
>> Cc: Liang Li 
>> Signed-off-by: Liang Li 
>> ---
>>  mm/page_prezero.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/mm/page_prezero.c b/mm/page_prezero.c
>> index c8ce720bfc54..dff4e0adf402 100644
>> --- a/mm/page_prezero.c
>> +++ b/mm/page_prezero.c
>> @@ -26,6 +26,7 @@ static unsigned long delay_millisecs = 1000;
>>  static unsigned long zeropage_enable __read_mostly;
>>  static DEFINE_MUTEX(kzeropaged_mutex);
>>  static struct page_reporting_dev_info zero_page_dev_info;
>> +static struct page_reporting_dev_info zero_hugepage_dev_info;
>>  
>>  inline void clear_zero_page_flag(struct page *page, int order)
>>  {
>> @@ -69,9 +70,17 @@ static int start_kzeropaged(void)
>>  zero_page_dev_info.delay_jiffies = 
>> msecs_to_jiffies(delay_millisecs);
>>  
>>  err = page_reporting_register(&zero_page_dev_info);
>> +
>> +zero_hugepage_dev_info.report = zero_free_pages;
>> +zero_hugepage_dev_info.mini_order = mini_page_order;
>> +zero_hugepage_dev_info.batch_size = batch_size;
>> +zero_hugepage_dev_info.delay_jiffies = 
>> msecs_to_jiffies(delay_millisecs);
>> +
>> +err |= hugepage_reporting_register(&zero_hugepage_dev_info);
>>  pr_info("Zero page enabled\n");
>>  } else {
>>  page_reporting_unregister(&zero_page_dev_info);
>> +hugepage_reporting_unregister(&zero_hugepage_dev_info);
>>  pr_info("Zero page disabled\n");
>>  }
>>  
>> @@ -90,7 +99,15 @@ static int restart_kzeropaged(void)
>>  zero_page_dev_info.batch_size = batch_size;
>>  zero_page_dev_info.delay_jiffies = 
>> msecs_to_jiffies(delay_millisecs);
>>  
>> +hugepage_reporting_unregister(&zero_hugepage_dev_info);
>> +
>> +zero_hugepage_dev_info.report = zero_free_pages;
>> +zero_hugepage_dev_info.mini_order = mini_page_order;
>> +zero_hugepage_dev_info.batch_size = batch_size;
>> +zero_hugepage_dev_info.delay_jiffies = 
>> msecs_to_jiffies(delay_millisecs);
>> +
>>  err = page_reporting_register(&zero_page_dev_info);
>> +err |= hugepage_reporting_register(&zero_hugepage_dev_info);
>>  pr_info("Zero page enabled\n");
>>  }
>>  
>>
> 
> Free page reporting in virtio-balloon doesn't give you any guarantees
> regarding zeroing of pages. Take a look at the QEMU implementation -
> e.g., with vfio all reports are simply ignored.
> 
> Also, I am not sure if mangling such details ("zeroing of pages") into
> the page reporting infrastructure is a good idea.
> 

Oh, now I get what you are doing here, you rely on zero_free_pages of
your other patch series and are not relying on virtio-balloon free page
reporting to do the zeroing.

You really should have mentioned that this patch series relies on the
other one and in which way.

-- 
Thanks,

David / dhildenb

Re: [RFC 1/2] arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory

2020-12-22 Thread David Hildenbrand

On 22.12.20 08:12, Anshuman Khandual wrote:
> pfn_valid() validates a pfn but basically it checks for a valid struct page
> backing for that pfn. It should always return positive for memory ranges
> backed with struct page mapping. But currently pfn_valid() fails for all
> ZONE_DEVICE based memory types even though they have struct page mapping.
> 
> pfn_valid() asserts that there is a memblock entry for a given pfn without
> MEMBLOCK_NOMAP flag being set. The problem with ZONE_DEVICE based memory is
> that they do not have memblock entries. Hence memblock_is_map_memory() will
> invariably fail via memblock_search() for a ZONE_DEVICE based address. This
> eventually fails pfn_valid() which is wrong. memblock_is_map_memory() needs
> to be skipped for such memory ranges. As ZONE_DEVICE memory gets hotplugged
> into the system via memremap_pages() called from a driver, their respective
> memory sections will not have SECTION_IS_EARLY set.
> 
> Normal hotplug memory will never have MEMBLOCK_NOMAP set in their memblock
> regions. Because the flag MEMBLOCK_NOMAP was specifically designed and set
> for firmware reserved memory regions. memblock_is_map_memory() can just be
> skipped as its always going to be positive and that will be an optimization
> for the normal hotplug memory. Like ZONE_DEVIE based memory, all hotplugged
> normal memory too will not have SECTION_IS_EARLY set for their sections.
> 
> Skipping memblock_is_map_memory() for all non early memory sections would
> fix pfn_valid() problem for ZONE_DEVICE based memory and also improve its
> performance for normal hotplug memory as well.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Ard Biesheuvel 
> Cc: Robin Murphy 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/arm64/mm/init.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 75addb36354a..ee23bda00c28 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -225,6 +225,18 @@ int pfn_valid(unsigned long pfn)
>  
>   if (!valid_section(__pfn_to_section(pfn)))
>   return 0;
> +
> + /*
> +  * ZONE_DEVICE memory does not have the memblock entries.
> +  * memblock_is_map_memory() check for ZONE_DEVICE based
> +  * addresses will always fail. Even the normal hotplugged
> +  * memory will never have MEMBLOCK_NOMAP flag set in their
> +  * memblock entries. Skip memblock search for all non early
> +  * memory sections covering all of hotplug memory including
> +  * both normal and ZONE_DEVIE based.
> +  */
> + if (!early_section(__pfn_to_section(pfn)))
> + return 1;

Actually, I think we want to check for partial present sections.

Maybe we can rather switch to generic pfn_valid() and tweak it to
something like

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..7b1fcce5bd5a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1382,9 +1382,13 @@ static inline int pfn_valid(unsigned long pfn)
return 0;
/*
 * Traditionally early sections always returned pfn_valid() for
-* the entire section-sized span.
+* the entire section-sized span. Some archs might have holes in
+* early sections, so double check with memblock if configured.
 */
-   return early_section(ms) || pfn_section_valid(ms, pfn);
+   if (early_section(ms))
+   return IS_ENABLED(CONFIG_EARLY_SECTION_MEMMAP_HOLES) ?
+  memblock_is_map_memory(pfn << PAGE_SHIFT) : 1;
+   return pfn_section_valid(ms, pfn);
 }
 #endif



Which users are remaining that require us to add/remove memblocks when
hot(un)plugging memory

 $ git grep KEEP_MEM | grep memory_hotplug
mm/memory_hotplug.c:if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
mm/memory_hotplug.c:if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
mm/memory_hotplug.c:if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {


I think one user we would have to handle is
arch/arm64/mm/mmap.c:valid_phys_addr_range(). AFAIS, powerpc at least
does not rely on memblock_is_map_memory.

-- 
Thanks,

David / dhildenb

Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-22 Thread David Hildenbrand

> 
>>>
>>> Virtulization
>>> =
>>> Speed up VM creation and shorten guest boot time, especially for PCI
>>> SR-IOV device passthrough scenario. Compared with some of the para
>>> vitalization solutions, it is easy to deploy because it’s transparent
>>> to guest and can handle DMA properly in BIOS stage, while the para
>>> virtualization solution can’t handle it well.
>>
>> What is the "para virtualization" approach you are talking about?
> 
> I refer two topic in the KVM forum 2020, the doc can give more details :
> https://static.sched.com/hosted_files/kvmforum2020/48/coIOMMU.pdf
> https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf
> 
> and the flowing link is mine:
> https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf

Thanks for the pointers! I actually did watch your presentation.

>>
>>>
>>> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
>>> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
>>> to the VMM, VMM will unmap the corresponding host page for reclaim,
>>> when guest allocate a page just reclaimed, host will allocate a new page
>>> and zero it out for guest, in this case pre zero out free page will help
>>> to speed up the proccess of fault in and reduce the performance impaction.
>>
>> Such faults in the VMM are no different to other faults, when first
>> accessing a page to be populated. Again, I wonder how much of a
>> difference it actually makes.
>>
> 
> I am not just referring to faults in the VMM, I mean the whole process
> that handles guest page faults.
> without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero
> out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are
> reclaimed by the host and may return to the host buddy
> free list. When the pages are given back to the guest, the host kernel
> needs to zero out it again. It means
> with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be
> degraded for frequently
> zero out operation on host side. The performance degradation will be
> obvious for huge page case. Free
> page pre zero out can help to make guest memory performance almost the
> same as without
> VIRTIO_BALLOON_F_REPORTING.

Yes, what I am saying is that this fault handling is no different to
ordinary faults when accessing a virtual memory location the first time
and populating a page. The only difference is that it happens
continuously, not only the first time we touch a page.

And we might be able to improve handling in the hypervisor in the
future. We have been discussing using MADV_FREE instead of MADV_DONTNEED
in QEMU for handling free page reporting. Then, guest reported pages
will only get reclaimed by the hypervisor when there is actual memory
pressure in the hypervisor (e.g., when about to swap). And zeroing a
page is an obvious improvement over going to swap. The price for zeroing
pages has to be paid at one point.

Also note that we've been discussing cache-related things already. If
you zero out before giving the page to the guest, the page will already
be in the cache - where the guest directly wants to access it.

[...]

>>>
>>> Security
>>> 
>>> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
>>> boot options", which zero out page in a asynchronous way. For users can't
>>> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
>>> this feauture provide another choice.
>> "we don’t pre zero out all the free pages" so this is of little actual use.
> 
> OK. It seems none of the reasons listed above is strong enough for

I was rather saying that for security it's of little use IMHO.
Application/VM start up time might be improved by using huge pages (and
pre-zeroing these). Free page reporting might be improved by using
MADV_FREE instead of MADV_DONTNEED in the hypervisor.

> this feature, above all of them, which one is likely to become the
> most strong one?  From the implementation, you will find it is
> configurable, users don't want to use it can turn it off.  This is not
> an option?

Well, we have to maintain the feature and sacrifice a page flag. For
example, do we expect someone explicitly enabling the feature just to
speed up startup time of an app that consumes a lot of memory? I highly
doubt it.

I'd love to hear opinions of other people. (a lot of people are offline
until beginning of January, including, well, actually me :) )

-- 
Thanks,

David / dhildenb

Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-23 Thread David Hildenbrand

[...]

>> I was rather saying that for security it's of little use IMHO.
>> Application/VM start up time might be improved by using huge pages (and
>> pre-zeroing these). Free page reporting might be improved by using
>> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>>
>>> this feature, above all of them, which one is likely to become the
>>> most strong one?  From the implementation, you will find it is
>>> configurable, users don't want to use it can turn it off.  This is not
>>> an option?
>>
>> Well, we have to maintain the feature and sacrifice a page flag. For
>> example, do we expect someone explicitly enabling the feature just to
>> speed up startup time of an app that consumes a lot of memory? I highly
>> doubt it.
> 
> In our production environment, there are three main applications have such
> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> for best performance, they populate memory when starting up. For SPDK vhost,
> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> vhost 'live' upgrade, which is done by killing the old process and
> starting a new
> one with the new binary. In this case, we want the new process started as 
> quick
> as possible to shorten the service downtime. We really enable this feature
> to speed up startup time for them  :)

Thanks for info on the use case!

All of these use cases either already use, or could use, huge pages
IMHO. It's not your ordinary proprietary gaming app :) This is where
pre-zeroing of huge pages could already help.

Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
creating a file and pre-zeroing it from another process, or am I missing
something important? At least for QEMU this should work AFAIK, where you
can just pass the file to be use using memory-backend-file.

> 
>> I'd love to hear opinions of other people. (a lot of people are offline
>> until beginning of January, including, well, actually me :) )
> 
> OK. I will wait some time for others' feedback. Happy holidays!

To you too, cheers!


-- 
Thanks,

David / dhildenb

Re: [RFC V2 3/3] s390/mm: Define arch_get_mappable_range()

2020-12-07 Thread David Hildenbrand

On 07.12.20 05:38, Anshuman Khandual wrote:
> 
> 
> On 12/3/20 5:31 PM, David Hildenbrand wrote:
>> On 03.12.20 12:51, Heiko Carstens wrote:
>>> On Thu, Dec 03, 2020 at 06:03:00AM +0530, Anshuman Khandual wrote:
>>>>>> diff --git a/arch/s390/mm/extmem.c b/arch/s390/mm/extmem.c
>>>>>> index 5060956b8e7d..cc055a78f7b6 100644
>>>>>> --- a/arch/s390/mm/extmem.c
>>>>>> +++ b/arch/s390/mm/extmem.c
>>>>>> @@ -337,6 +337,11 @@ __segment_load (char *name, int do_nonshared, 
>>>>>> unsigned long *addr, unsigned long
>>>>>>  goto out_free_resource;
>>>>>>  }
>>>>>>  
>>>>>> +if (seg->end + 1 > VMEM_MAX_PHYS || seg->end + 1 < 
>>>>>> seg->start_addr) {
>>>>>> +rc = -ERANGE;
>>>>>> +goto out_resource;
>>>>>> +}
>>>>>> +
>>>>>>  rc = vmem_add_mapping(seg->start_addr, seg->end - 
>>>>>> seg->start_addr + 1);
>>>>>>  if (rc)
>>>>>>  goto out_resource;
>>>>>> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
>>>>>> index b239f2ba93b0..06dddcc0ce06 100644
>>>>>> --- a/arch/s390/mm/vmem.c
>>>>>> +++ b/arch/s390/mm/vmem.c
>>>>>> @@ -532,14 +532,19 @@ void vmem_remove_mapping(unsigned long start, 
>>>>>> unsigned long size)
>>>>>>  mutex_unlock(&vmem_mutex);
>>>>>>  }
>>>>>>  
>>>>>> +struct range arch_get_mappable_range(void)
>>>>>> +{
>>>>>> +struct range memhp_range;
>>>>>> +
>>>>>> +memhp_range.start = 0;
>>>>>> +memhp_range.end =  VMEM_MAX_PHYS;
>>>>>> +return memhp_range;
>>>>>> +}
>>>>>> +
>>>>>>  int vmem_add_mapping(unsigned long start, unsigned long size)
>>>>>>  {
>>>>>>  int ret;
>>>>>>  
>>>>>> -if (start + size > VMEM_MAX_PHYS ||
>>>>>> -start + size < start)
>>>>>> -return -ERANGE;
>>>>>> -
>>>>>
>>>>> I really fail to see how this could be considered an improvement for
>>>>> s390. Especially I do not like that the (central) range check is now
>>>>> moved to the caller (__segment_load). Which would mean potential
>>>>> additional future callers would have to duplicate that code as well.
>>>>
>>>> The physical range check is being moved to the generic hotplug code
>>>> via arch_get_mappable_range() instead, making the existing check in
>>>> vmem_add_mapping() redundant. Dropping the check there necessitates
>>>> adding back a similar check in __segment_load(). Otherwise there
>>>> will be a loss of functionality in terms of range check.
>>>>
>>>> May be we could just keep this existing check in vmem_add_mapping()
>>>> as well in order avoid this movement but then it would be redundant
>>>> check in every hotplug path.
>>>>
>>>> So I guess the choice is to either have redundant range checks in
>>>> all hotplug paths or future internal callers of vmem_add_mapping()
>>>> take care of the range check.
>>>
>>> The problem I have with this current approach from an architecture
>>> perspective: we end up having two completely different methods which
>>> are doing the same and must be kept in sync. This might be obvious
>>> looking at this patch, but I'm sure this will go out-of-sync (aka
>>> broken) sooner or later.
>>
>> Exactly, there should be one function only that was the whole idea of
>> arch_get_mappable_range().
>>
>>>
>>> Therefore I would really like to see a single method to do the range
>>> checking. Maybe you could add a callback into architecture code, so
>>> that such an architecture specific function could also be used
>>> elsewhere. Dunno.
>>>
>>
>> I think we can just switch to using "memhp_range_allowed()" here then
>> after implementing arch_get_mappable_range().
>>
>> Doesn't hurt to double check in vmem_add_mapping() - especially to keep
&

Re: 5.10 RC 7: grub2 out of memory

2020-12-07 Thread David Hildenbrand

On 07.12.20 10:16, Elias Carter wrote:
> I just compiled and installed 5.10 RC 7 and got a message from grub2:
> "out of memory, press any key to continue" shortly followed by a
> kernel panic (see attached screenshot).
> 
> The 5.4.0-56-generic kernel from Ubuntu works on my machine fine.
> 
> Things I have tried so far:
> - setting grub video mode to "console"
> - setting grub video resolution to 800x600
> - regenerating the initramfs for 5.10 RC 7
> - verifying that /boot has free space
> - changing boot mode from UEFI to legacy BIOS
> 
> I have attached the following:
> 1.) dmesg ran from using the 5.4 kernel (since I cant boot into 5.10 RC 7)
> 2.) my 5.10 RC 7 kernel .config
> 3.) screenshot of kernel panic after "out of memory" grub2 message
> 4.) my /etc/default/grub
> 
> Please let me know if you would like any more information or testing of 
> patches.

The kernel crashes because it's unable to mount root - I suspect the
initrd is not loaded, because I would have expected systemd messages
before trying to mount root (after loading additional drivers from the
initrd). I assume grub2 fails to load the (now too big?) initrd - you
could try compiling out debug symbols and give it a try.

What's the size difference between old vs. new kernel and old vs. new
initrd?

Similar report for aarch64 was at

https://bugzilla.redhat.com/show_bug.cgi?id=1615969

which turned out to be a grub2 issue. Which distro/grub2 version etc are
you running?

-- 
Thanks,

David / dhildenb

Re: [PATCH v3] mm/page_alloc: speeding up the iteration of max_order

2020-12-07 Thread David Hildenbrand

On 04.12.20 16:51, Muchun Song wrote:
> When we free a page whose order is very close to MAX_ORDER and greater
> than pageblock_order, it wastes some CPU cycles to increase max_order
> to MAX_ORDER one by one and check the pageblock migratetype of that page
> repeatedly especially when MAX_ORDER is much larger than pageblock_order.
> 
> We also should not be checking migratetype of buddy when "order ==
> MAX_ORDER - 1" as the buddy pfn may be invalid, so adjust the condition.
> With the new check, we don't need the max_order check anymore, so we
> replace it.
> 
> Also adjust max_order initialization so that it's lower by one than
> previously, which makes the code hopefully more clear.
> 
> Fixes: d9dddbf55667 ("mm/page_alloc: prevent merging between isolated and 
> other pageblocks")
> Signed-off-by: Muchun Song 
> Acked-by: Vlastimil Babka 
> ---
> Changes in v3:
>  - Update commit log.
> 
> Changes in v2:
>  - Rework the code suggested by Vlastimil. Thanks.
> 
>  mm/page_alloc.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f91df593bf71..56e603eea1dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1002,7 +1002,7 @@ static inline void __free_one_page(struct page *page,
>   struct page *buddy;
>   bool to_tail;
>  
> - max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
> + max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
>  
>   VM_BUG_ON(!zone_is_initialized(zone));
>   VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> @@ -1015,7 +1015,7 @@ static inline void __free_one_page(struct page *page,
>   VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
>  continue_merging:
> - while (order < max_order - 1) {
> + while (order < max_order) {
>   if (compaction_capture(capc, page, order, migratetype)) {
>   __mod_zone_freepage_state(zone, -(1 << order),
>   migratetype);
> @@ -1041,7 +1041,7 @@ static inline void __free_one_page(struct page *page,
>   pfn = combined_pfn;
>   order++;
>   }
> - if (max_order < MAX_ORDER) {
> + if (order < MAX_ORDER - 1) {
>   /* If we are here, it means order is >= pageblock_order.
>* We want to prevent merge between freepages on isolate
>* pageblock and normal pageblock. Without this, pageblock
> @@ -1062,7 +1062,7 @@ static inline void __free_one_page(struct page *page,
>   is_migrate_isolate(buddy_mt)))
>   goto done_merging;
>   }
> - max_order++;
> + max_order = order + 1;
>   goto continue_merging;
>   }
>  
> 


LGTM

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [RFC V2 00/37] Enhance memory utilization with DMEMFS

2020-12-07 Thread David Hildenbrand

On 07.12.20 12:30, yulei.ker...@gmail.com wrote:
> From: Yulei Zhang 
> 
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> more and more remarkable. So is it possible that we could reclaim
> such memory and make it reusable?
> 
> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
> 
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.

"is not managed by kernel" well, it's obviously is managed by the
kernel. It's not managed by the buddy ;)

How is this different to using "mem=X" and mapping the relevant memory
directly into applications? Is this "simply" a control instance on top
that makes sure unprivileged process can access it and not step onto
each others feet? Is that the reason why it's called  a "file system"?
(an example would have helped here, showing how it's used)

It's worth noting that memory hotunplug, memory poisoning and probably
more is currently fundamentally incompatible with this approach - which
should better be pointed out in the cover letter.

Also, I think something similar can be obtained by using dax/hmat
infrastructure with "memmap=", at least I remember a talk where this was
discussed (but not sure if they modified the firmware to expose selected
memory as soft-reserved - we would only need a cmdline parameter to
achieve the same - Dan might know more).

> 
> As the belowing figure shows, we uses a kernel boot parameter 'dmem='
> to reserve the system memory when the host system boots up, the
> remaining system memory is still managed by system memory management
> which is associated with "struct page", the reserved memory
> will be managed by dmem and assigned to guest system, the details
> can be checked in /Documentation/admin-guide/kernel-parameters.txt.
> 
>+--+--+
>|  system memory   | memory for guest system  | 
>+--+--+
> |   |
> v   |
> struct page |
> |   |
> v   v
> system mem management dmem  
> 
> And during the usage, the dmemfs will handle the memory request to
> allocate and free the reserved memory on each NUMA node, the user 
> space application could leverage the mmap interface to access the 
> memory, and kernel module such as kvm and vfio would be able to pin
> the memory thongh follow_pfn() and get_user_page() in different given
> page size granularities.

I cannot say that I really like this approach. I really prefer the
proposal to free-up most vmemmap pages for huge/gigantic pages instead
if all this is about is reducing the memmap size.

-- 
Thanks,

David / dhildenb

Re: [PATCH v7 01/15] mm/memory_hotplug: Move bootmem info registration API to bootmem_info.c

2020-12-07 Thread David Hildenbrand

On 30.11.20 16:18, Muchun Song wrote:
> Move bootmem info registration common API to individual bootmem_info.c
> for later patch use. This is just code movement without any functional
> change.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Mike Kravetz 
> Reviewed-by: Oscar Salvador 
> ---
>  arch/x86/mm/init_64.c  |  1 +
>  include/linux/bootmem_info.h   | 27 
>  include/linux/memory_hotplug.h | 23 --
>  mm/Makefile|  1 +
>  mm/bootmem_info.c  | 99 
> ++
>  mm/memory_hotplug.c| 91 +-
>  6 files changed, 129 insertions(+), 113 deletions(-)
>  create mode 100644 include/linux/bootmem_info.h
>  create mode 100644 mm/bootmem_info.c
> 

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb

Re: [PATCH v7 02/15] mm/memory_hotplug: Move {get,put}_page_bootmem() to bootmem_info.c

2020-12-07 Thread David Hildenbrand

On 30.11.20 16:18, Muchun Song wrote:
> In the later patch, we will use {get,put}_page_bootmem() to initialize
> the page for vmemmap or free vmemmap page to buddy. So move them out of
> CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any
> functional change.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Mike Kravetz 
> Reviewed-by: Oscar Salvador 
> ---
>  arch/x86/mm/init_64.c  |  2 +-
>  include/linux/bootmem_info.h   | 13 +
>  include/linux/memory_hotplug.h |  4 
>  mm/bootmem_info.c  | 25 +
>  mm/memory_hotplug.c| 27 ---
>  mm/sparse.c|  1 +
>  6 files changed, 40 insertions(+), 32 deletions(-)
> 

I'd squash this into the previous patch and name it like

"mm/memory_hotplug: Factor out bootmem core functions to bootmem_info.c"


-- 
Thanks,

David / dhildenb

Re: [PATCH v7 03/15] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP

2020-12-07 Thread David Hildenbrand

On 30.11.20 16:18, Muchun Song wrote:
> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
> whether to enable the feature of freeing unused vmemmap associated
> with HugeTLB pages. And this is just for dependency check. Now only
> support x86.

x86 - i386 and x86-64? (I assume the latter only ;) )

> 
> Signed-off-by: Muchun Song 
> ---
>  arch/x86/mm/init_64.c |  2 +-
>  fs/Kconfig| 14 ++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0a45f062826e..0435bee2e172 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>  
>  static void __init register_page_bootmem_info(void)
>  {
> -#ifdef CONFIG_NUMA
> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>   int i;
>  

Why does this hunk belong into this patch? Looks like this should go
into another patch.

>   for_each_online_node(i)
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 976e8b9033c4..4961dd488444 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -245,6 +245,20 @@ config HUGETLBFS
>  config HUGETLB_PAGE
>   def_bool HUGETLBFS
>  
> +config HUGETLB_PAGE_FREE_VMEMMAP
> + def_bool HUGETLB_PAGE
> + depends on X86
> + depends on SPARSEMEM_VMEMMAP
> + depends on HAVE_BOOTMEM_INFO_NODE
> + help
> +   When using HUGETLB_PAGE_FREE_VMEMMAP, the system can save up some
> +   memory from pre-allocated HugeTLB pages when they are not used.
> +   6 pages per 2MB HugeTLB page and 4094 per 1GB HugeTLB page.

Calculations only apply to 4k base pages, no? (maybe generalize this a
bit or mention 4k base pages - I'm pretty sure we'll see the "depends on
X86" part fairly soon if this goes upstream)


-- 
Thanks,

David / dhildenb

Re: [PATCH v7 04/15] mm/hugetlb: Introduce nr_free_vmemmap_pages in the struct hstate

2020-12-07 Thread David Hildenbrand

On 30.11.20 16:18, Muchun Song wrote:
> Every HugeTLB has more than one struct page structure. The 2M HugeTLB
> has 512 struct page structure and 1G HugeTLB has 4096 struct page
> structures. We __know__ that we only use the first 4(HUGETLB_CGROUP_MIN_ORDER)
> struct page structures to store metadata associated with each HugeTLB.
> 
> There are a lot of struct page structures(8 page frames for 2MB HugeTLB
> page and 4096 page frames for 1GB HugeTLB page) associated with each
> HugeTLB page. For tail pages, the value of compound_head is the same.
> So we can reuse first page of tail page structures. We map the virtual
> addresses of the remaining pages of tail page structures to the first
> tail page struct, and then free these page frames. Therefore, we need
> to reserve two pages as vmemmap areas.
> 
> So we introduce a new nr_free_vmemmap_pages field in the hstate to
> indicate how many vmemmap pages associated with a HugeTLB page that we
> can free to buddy system.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Mike Kravetz 
> ---
>  include/linux/hugetlb.h |   3 ++
>  mm/Makefile |   1 +
>  mm/hugetlb.c|   3 ++
>  mm/hugetlb_vmemmap.c| 129 
> 
>  mm/hugetlb_vmemmap.h|  20 
>  5 files changed, 156 insertions(+)
>  create mode 100644 mm/hugetlb_vmemmap.c
>  create mode 100644 mm/hugetlb_vmemmap.h
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ebca2ef02212..4efeccb7192c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -492,6 +492,9 @@ struct hstate {
>   unsigned int nr_huge_pages_node[MAX_NUMNODES];
>   unsigned int free_huge_pages_node[MAX_NUMNODES];
>   unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
> + unsigned int nr_free_vmemmap_pages;
> +#endif
>  #ifdef CONFIG_CGROUP_HUGETLB
>   /* cgroup control files */
>   struct cftype cgroup_files_dfl[7];
> diff --git a/mm/Makefile b/mm/Makefile
> index ed4b88fa0f5e..056801d8daae 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -71,6 +71,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
>  obj-$(CONFIG_ZSWAP)  += zswap.o
>  obj-$(CONFIG_HAS_DMA)+= dmapool.o
>  obj-$(CONFIG_HUGETLBFS)  += hugetlb.o
> +obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)  += hugetlb_vmemmap.o
>  obj-$(CONFIG_NUMA)   += mempolicy.o
>  obj-$(CONFIG_SPARSEMEM)  += sparse.o
>  obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1f3bf1710b66..25f9e8e9fc4a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include "internal.h"
> +#include "hugetlb_vmemmap.h"
>  
>  int hugetlb_max_hstate __read_mostly;
>  unsigned int default_hstate_idx;
> @@ -3206,6 +3207,8 @@ void __init hugetlb_add_hstate(unsigned int order)
>   snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>   huge_page_size(h)/1024);
>  
> + hugetlb_vmemmap_init(h);
> +
>   parsed_hstate = h;
>  }
>  
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> new file mode 100644
> index ..51152e258f39
> --- /dev/null
> +++ b/mm/hugetlb_vmemmap.c
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Free some vmemmap pages of HugeTLB
> + *
> + * Copyright (c) 2020, Bytedance. All rights reserved.
> + *
> + * Author: Muchun Song 
> + *
> + * The struct page structures (page structs) are used to describe a physical
> + * page frame. By default, there is a one-to-one mapping from a page frame to
> + * it's corresponding page struct.
> + *
> + * The HugeTLB pages consist of multiple base page size pages and is 
> supported
> + * by many architectures. See hugetlbpage.rst in the Documentation directory
> + * for more details. On the x86 architecture, HugeTLB pages of size 2MB and 
> 1GB
> + * are currently supported. Since the base page size on x86 is 4KB, a 2MB
> + * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> + * 4096 base pages. For each base page, there is a corresponding page struct.
> + *
> + * Within the HugeTLB subsystem, only the first 4 page structs are used to
> + * contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER
> + * provides this upper limit. The only 'useful' information in the remaining
> + * page structs is the compound_head field, and this field is the same for 
> all
> + * tail pages.
> + *
> + * By removing redundant page structs for HugeTLB pages, memory can returned 
> to
> + * the buddy allocator for other uses.
> + *
> + * When the system boot up, every 2M HugeTLB has 512 struct page structs 
> which
> + * size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).


You should try to generalize all descriptions regarding differing base
page sizes. E.g., arm64 supports 4k, 16k, and 64k base pages.

[...]

> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifie

Re: [PATCH v7 05/15] mm/bootmem_info: Introduce {free,prepare}_vmemmap_page()

2020-12-07 Thread David Hildenbrand

On 30.11.20 16:18, Muchun Song wrote:
> In the later patch, we can use the free_vmemmap_page() to free the
> unused vmemmap pages and initialize a page for vmemmap page using
> via prepare_vmemmap_page().
> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/bootmem_info.h | 24 
>  1 file changed, 24 insertions(+)
> 
> diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
> index 4ed6dee1adc9..239e3cc8f86c 100644
> --- a/include/linux/bootmem_info.h
> +++ b/include/linux/bootmem_info.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_BOOTMEM_INFO_H
>  
>  #include 
> +#include 
>  
>  /*
>   * Types for free bootmem stored in page->lru.next. These have to be in
> @@ -22,6 +23,29 @@ void __init register_page_bootmem_info_node(struct 
> pglist_data *pgdat);
>  void get_page_bootmem(unsigned long info, struct page *page,
> unsigned long type);
>  void put_page_bootmem(struct page *page);
> +
> +static inline void free_vmemmap_page(struct page *page)
> +{
> + VM_WARN_ON(!PageReserved(page) || page_ref_count(page) != 2);
> +
> + /* bootmem page has reserved flag in the reserve_bootmem_region */
> + if (PageReserved(page)) {
> + unsigned long magic = (unsigned long)page->freelist;
> +
> + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
> + put_page_bootmem(page);
> + else
> + WARN_ON(1);
> + }
> +}
> +
> +static inline void prepare_vmemmap_page(struct page *page)
> +{
> + unsigned long section_nr = pfn_to_section_nr(page_to_pfn(page));
> +
> + get_page_bootmem(section_nr, page, SECTION_INFO);
> + mark_page_reserved(page);
> +}

Can you clarify in the description when exactly these functions are
called and on which type of pages?

Would indicating "bootmem" in the function names make it clearer what we
are dealing with?

E.g., any memory allocated via the memblock allocator and not via the
buddy will be makred reserved already in the memmap. It's unclear to me
why we need the mark_page_reserved() here - can you enlighten me? :)

-- 
Thanks,

David / dhildenb

Re: [External] Re: [PATCH v7 03/15] mm/hugetlb: Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP

2020-12-07 Thread David Hildenbrand

On 07.12.20 13:42, Muchun Song wrote:
> On Mon, Dec 7, 2020 at 8:19 PM David Hildenbrand  wrote:
>>
>> On 30.11.20 16:18, Muchun Song wrote:
>>> The purpose of introducing HUGETLB_PAGE_FREE_VMEMMAP is to configure
>>> whether to enable the feature of freeing unused vmemmap associated
>>> with HugeTLB pages. And this is just for dependency check. Now only
>>> support x86.
>>
>> x86 - i386 and x86-64? (I assume the latter only ;) )
> 
> Yeah, you are right. Only the latter support SPARSEMEM_VMEMMAP.
> 
>>
>>>
>>> Signed-off-by: Muchun Song 
>>> ---
>>>  arch/x86/mm/init_64.c |  2 +-
>>>  fs/Kconfig| 14 ++
>>>  2 files changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index 0a45f062826e..0435bee2e172 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -1225,7 +1225,7 @@ static struct kcore_list kcore_vsyscall;
>>>
>>>  static void __init register_page_bootmem_info(void)
>>>  {
>>> -#ifdef CONFIG_NUMA
>>> +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
>>>   int i;
>>>
>>
>> Why does this hunk belong into this patch? Looks like this should go
>> into another patch.
> 
> Of course can. But Mike suggests that it is better to use it when
> introducing a new config. Because this config depends on
> HAVE_BOOTMEM_INFO_NODE. And register_page_bootmem_info
> is aimed to register bootmem info. So maybe it is reasonable from
> this point of view. What is your opinion?
>

Ah, I see. Maybe mention in the patch description, because the
"Introduce a new config HUGETLB_PAGE_FREE_VMEMMAP" part left me
clueless. Stumbling over this change only left me rather clueless.

>>
>>>   for_each_online_node(i)
>>> diff --git a/fs/Kconfig b/fs/Kconfig
>>> index 976e8b9033c4..4961dd488444 100644
>>> --- a/fs/Kconfig
>>> +++ b/fs/Kconfig
>>> @@ -245,6 +245,20 @@ config HUGETLBFS
>>>  config HUGETLB_PAGE
>>>   def_bool HUGETLBFS
>>>
>>> +config HUGETLB_PAGE_FREE_VMEMMAP
>>> + def_bool HUGETLB_PAGE
>>> + depends on X86
>>> + depends on SPARSEMEM_VMEMMAP
>>> + depends on HAVE_BOOTMEM_INFO_NODE
>>> + help
>>> +   When using HUGETLB_PAGE_FREE_VMEMMAP, the system can save up some
>>> +   memory from pre-allocated HugeTLB pages when they are not used.
>>> +   6 pages per 2MB HugeTLB page and 4094 per 1GB HugeTLB page.
>>
>> Calculations only apply to 4k base pages, no?
> 
> No, if the base page is not 4k, we also can free 6 pages.
> 
> For example:
> 
> If the base page size is 64k, the PMD huge page size is 512MB. We also

Note that 2MB huge pages on arm64 with 64k base pages are possible as
well. Also, I think powerpc always has 16MB huge pages, independent of
base page sizes.


-- 
Thanks,

David / dhildenb

Re: [RFC V2 3/3] s390/mm: Define arch_get_mappable_range()

2020-12-08 Thread David Hildenbrand

>>
>> Both changes only make sense with an in-tree user. I'm planning on using
>> this functionality in virtio-mem code. I can pickup your patches, drop
>> the superfluous checks, and use it from virtio-mem code. Makese sense
>> (BTW, looks like we'll see aarch64 support for virtio-mem soon)?
> 
> I have not been following virtio-mem closely. But is there something pending
> on arm64 platform which prevents virtio-mem enablement ?

Regarding enablement, I expect things to be working out of the box
mostly. Jonathan is currently doing some testing and wants to send a
simple unlock patch once done. [1]


Now, there are some things to improve in the future. virtio-mem
adds/removes individual Linux memory blocks and logically plugs/unplugs
MAX_ORDER - 1/pageblock_order pages inside Linux memory blocks.

1. memblock

On arm64 and powerpc, we create/delete memblocks when adding/removing
memory, which is suboptimal (and the code is quite fragile as we don't
handle errors ...). Hotplugged memory never has holes, so we can tweak
relevant code to not check via the memblock api.

For example, pfn_valid() only has to check for memblock_is_map_memory()
in case of !early_section() - otherwise it can just fallback to our
generic pfn_valid() function.

2. MAX_ORDER - 1 / pageblock_order

With 64k base pages, virtio-mem can only logically plug/unplug in 512MB
granularity, which is sub-optimal and inflexible. 4/2MB would be much
better - however this would require always using 2MB THP on arm64 (IIRC
via "cont" bits). Luckily, only some distributions use 64k base pages as
default nowadays ... :)

3. Section size

virtio-mem benefits from small section sizes. Currently, we have 1G.
With 4k base pages we could easily reduce it to something what x86 has
(128 MB) - and I remember discussions regarding that already in other
(IIRC NVDIMM / DIMM) context. Again, with 64k base pages we cannot go
below 512 MB right now.

[1] https://lkml.kernel.org/r/20201125145659.4...@huawei.com

-- 
Thanks,

David / dhildenb

Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-12 Thread David Hildenbrand

On 12.01.21 14:40, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 7:11 PM David Hildenbrand  wrote:
>>
>> On 12.01.21 12:00, David Hildenbrand wrote:
>>> On 10.01.21 13:40, Muchun Song wrote:
>>>> If the refcount is one when it is migrated, it means that the page
>>>> was freed from under us. So we are done and do not need to migrate.
>>>>
>>>> This optimization is consistent with the regular pages, just like
>>>> unmap_and_move() does.
>>>>
>>>> Signed-off-by: Muchun Song 
>>>> Reviewed-by: Mike Kravetz 
>>>> Acked-by: Yang Shi 
>>>> ---
>>>>  mm/migrate.c | 6 ++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>> index 4385f2fb5d18..a6631c4eb6a6 100644
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t 
>>>> get_new_page,
>>>>  return -ENOSYS;
>>>>  }
>>>>
>>>> +if (page_count(hpage) == 1) {
>>>> +/* page was freed from under us. So we are done. */
>>>> +putback_active_hugepage(hpage);
>>>> +return MIGRATEPAGE_SUCCESS;
>>>> +}
>>>> +
>>>>  new_hpage = get_new_page(hpage, private);
>>>>  if (!new_hpage)
>>>>  return -ENOMEM;
>>>>
>>>
>>> Question: What if called via alloc_contig_range() where we even want to
>>> "migrate" free pages, meaning, relocate it?
>>>
>>
>> To be more precise:
>>
>> a) We don't have dissolve_free_huge_pages() calls on the
>> alloc_contig_range() path. So we *need* migration IIUC.
> 
> Without this patch, if you want to migrate a HUgeTLB page,
> the page is freed to the hugepage pool. With this patch,
> the page is also freed to the hugepage pool.
> I didn't see any different. I am missing something?

I am definitely not an expert on hugetlb pools, that's why I am asking.

Isn't it, that with your code, no new page is allocated - so
dissolve_free_huge_pages() might just refuse to dissolve due to
reservations, bailing out, no?

(as discussed, looks like alloc_contig_range() needs to be fixed to
handle this correctly)

-- 
Thanks,

David / dhildenb

Re: [External] Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-12 Thread David Hildenbrand

On 12.01.21 15:17, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 9:51 PM David Hildenbrand  wrote:
>>
>> On 12.01.21 14:40, Muchun Song wrote:
>>> On Tue, Jan 12, 2021 at 7:11 PM David Hildenbrand  wrote:
>>>>
>>>> On 12.01.21 12:00, David Hildenbrand wrote:
>>>>> On 10.01.21 13:40, Muchun Song wrote:
>>>>>> If the refcount is one when it is migrated, it means that the page
>>>>>> was freed from under us. So we are done and do not need to migrate.
>>>>>>
>>>>>> This optimization is consistent with the regular pages, just like
>>>>>> unmap_and_move() does.
>>>>>>
>>>>>> Signed-off-by: Muchun Song 
>>>>>> Reviewed-by: Mike Kravetz 
>>>>>> Acked-by: Yang Shi 
>>>>>> ---
>>>>>>  mm/migrate.c | 6 ++
>>>>>>  1 file changed, 6 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>>>>> index 4385f2fb5d18..a6631c4eb6a6 100644
>>>>>> --- a/mm/migrate.c
>>>>>> +++ b/mm/migrate.c
>>>>>> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t 
>>>>>> get_new_page,
>>>>>>  return -ENOSYS;
>>>>>>  }
>>>>>>
>>>>>> +if (page_count(hpage) == 1) {
>>>>>> +/* page was freed from under us. So we are done. */
>>>>>> +putback_active_hugepage(hpage);
>>>>>> +return MIGRATEPAGE_SUCCESS;
>>>>>> +}
>>>>>> +
>>>>>>  new_hpage = get_new_page(hpage, private);
>>>>>>  if (!new_hpage)
>>>>>>  return -ENOMEM;
>>>>>>
>>>>>
>>>>> Question: What if called via alloc_contig_range() where we even want to
>>>>> "migrate" free pages, meaning, relocate it?
>>>>>
>>>>
>>>> To be more precise:
>>>>
>>>> a) We don't have dissolve_free_huge_pages() calls on the
>>>> alloc_contig_range() path. So we *need* migration IIUC.
>>>
>>> Without this patch, if you want to migrate a HUgeTLB page,
>>> the page is freed to the hugepage pool. With this patch,
>>> the page is also freed to the hugepage pool.
>>> I didn't see any different. I am missing something?
>>
>> I am definitely not an expert on hugetlb pools, that's why I am asking.
>>
>> Isn't it, that with your code, no new page is allocated - so
>> dissolve_free_huge_pages() might just refuse to dissolve due to
>> reservations, bailing out, no?
> 
> Without this patch, the new page can be allocated from the
> hugepage pool. The dissolve_free_huge_pages() also
> can refuse to dissolve due to reservations. Right?

Oh, you mean the migration target might be coming from the pool? I guess
yes, looking at alloc_migration_target()->alloc_huge_page_nodemask().

In that case, yes, I think we run into a similar issue already.

Instead of trying to allocate new huge pages in
dissolve_free_huge_pages() to "relocate free pages", we bail out.

This all feels kind of wrong. After we migrated a huge page we should
free it back to the buddy, so most of our machinery just keeps working
without caring about free huge pages.


I can see how your patch will not change the current (IMHO broken) behavior.

-- 
Thanks,

David / dhildenb

Re: [PATCH v3 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-12 Thread David Hildenbrand

On 12.01.21 15:23, Michal Hocko wrote:
> On Tue 12-01-21 13:16:45, Michal Hocko wrote:
> [...]
>> Well, currently pool pages are not migrateable but you are right that
>> this is likely something that we will need to look into in the future
>> and this optimization would stand in the way.
> 
> After some more thinking I believe I was wrong in my last statement.
> This optimization shouldn't have any effect on pages on the pool as
> those stay at reference count 0 and they cannot be isolated either
> (clear_page_huge_active before it is enqueued).
> 
> That being said, the migration code would still have to learn about
> about this pages but that is out of scope of this discussion.
> 
> Sorry about the confusion from my side.
> 

At this point I am fairly confused what's working at what's not :D

I think this will require more thought, on how to teach
alloc_contig_range() (and eventually in some cases offline_pages()?) to
do the right thing.

-- 
Thanks,

David / dhildenb

Re: [PATCH v3 2/6] mm: Teach pfn_to_online_page() to consider subsection validity

2021-01-13 Thread David Hildenbrand

On 13.01.21 08:35, Dan Williams wrote:
> pfn_section_valid() determines pfn validity on subsection granularity
> where pfn_valid() may be limited to coarse section granularity.
> Explicitly validate subsections after pfn_valid() succeeds.
> 
> Fixes: b13bc35193d9 ("mm/hotplug: invalid PFNs from pfn_to_online_page()")
> Cc: Qian Cai 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Reported-by: David Hildenbrand 
> Signed-off-by: Dan Williams 
> ---
>  mm/memory_hotplug.c |   24 
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 55a69d4396e7..9f37f8a68da4 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -308,11 +308,27 @@ static int check_hotplug_memory_addressable(unsigned 
> long pfn,
>  struct page *pfn_to_online_page(unsigned long pfn)
>  {
>   unsigned long nr = pfn_to_section_nr(pfn);
> + struct mem_section *ms;
> +
> + if (nr >= NR_MEM_SECTIONS)
> + return NULL;
> +
> + ms = __nr_to_section(nr);
> + if (!online_section(ms))
> + return NULL;
> +
> + /*
> +  * Save some code text when online_section() +
> +  * pfn_section_valid() are sufficient.
> +  */
> + if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID))
> + if (!pfn_valid(pfn))
> + return NULL;

Nit:

if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) &&
!pfn_valid(pfn))

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH v3 5/6] mm: Fix memory_failure() handling of dax-namespace metadata

2021-01-13 Thread David Hildenbrand

On 13.01.21 08:35, Dan Williams wrote:
> Given 'struct dev_pagemap' spans both data pages and metadata pages be
> careful to consult the altmap if present to delineate metadata. In fact
> the pfn_first() helper already identifies the first valid data pfn, so
> export that helper for other code paths via pgmap_pfn_valid().
> 
> Other usage of get_dev_pagemap() are not a concern because those are
> operating on known data pfns having been looking up by get_user_pages().
> I.e. metadata pfns are never user mapped.
> 
> Cc: Naoya Horiguchi 
> Cc: Andrew Morton 
> Reported-by: David Hildenbrand 
> Signed-off-by: Dan Williams 
> ---
>  include/linux/memremap.h |6 ++
>  mm/memory-failure.c  |6 ++
>  mm/memremap.c|   15 +++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 79c49e7f5c30..f5b464daeeca 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap);
>  void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>   struct dev_pagemap *pgmap);
> +bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>  
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> @@ -165,6 +166,11 @@ static inline struct dev_pagemap 
> *get_dev_pagemap(unsigned long pfn,
>   return NULL;
>  }
>  
> +static inline bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long 
> pfn)
> +{
> + return false;
> +}
> +
>  static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
>  {
>   return 0;
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 78b173c7190c..541569cb4a99 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1308,6 +1308,12 @@ static int memory_failure_dev_pagemap(unsigned long 
> pfn, int flags,
>*/
>   put_page(page);
>  
> + /* device metadata space is not recoverable */
> + if (!pgmap_pfn_valid(pgmap, pfn)) {
> + rc = -ENXIO;
> + goto out;
> + }
> +
>   /*
>* Prevent the inode from being freed while we are interrogating
>* the address_space, typically this would be handled by
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 16b2fb482da1..2455bac89506 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -80,6 +80,21 @@ static unsigned long pfn_first(struct dev_pagemap *pgmap, 
> int range_id)
>   return pfn + vmem_altmap_offset(pgmap_altmap(pgmap));
>  }
>  
> +bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn)
> +{
> + int i;
> +
> + for (i = 0; i < pgmap->nr_range; i++) {
> + struct range *range = &pgmap->ranges[i];
> +
> + if (pfn >= PHYS_PFN(range->start) &&
> + pfn <= PHYS_PFN(range->end))
> + return pfn >= pfn_first(pgmap, i);
> + }
> +
> + return false;
> +}
> +
>  static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
>  {
>   const struct range *range = &pgmap->ranges[range_id];
> 

LGTM

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH v3 2/4] mm: failfast mode with __GFP_NORETRY in alloc_contig_range

2021-01-13 Thread David Hildenbrand

On 13.01.21 02:21, Minchan Kim wrote:
> Contiguous memory allocation can be stalled due to waiting
> on page writeback and/or page lock which causes unpredictable
> delay. It's a unavoidable cost for the requestor to get *big*
> contiguous memory but it's expensive for *small* contiguous
> memory(e.g., order-4) because caller could retry the request
> in diffrent range where would have easy migratable pages
> without stalling.

s/diffrent/different/

> 
> This patch introduce __GFP_NORETRY as compaction gfp_mask in
> alloc_contig_range so it will fail fast without blocking
> when it encounters pages needed waitting.

s/waitting/waiting/

> 
> Signed-off-by: Minchan Kim 
> ---
>  mm/page_alloc.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b3923db9158..ff41ceb4db51 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8489,12 +8489,16 @@ static int __alloc_contig_migrate_range(struct 
> compact_control *cc,
>   unsigned int nr_reclaimed;
>   unsigned long pfn = start;
>   unsigned int tries = 0;
> + unsigned int max_tries = 5;
>   int ret = 0;
>   struct migration_target_control mtc = {
>   .nid = zone_to_nid(cc->zone),
>   .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
>   };
>  
> + if (cc->alloc_contig && cc->mode == MIGRATE_ASYNC)
> + max_tries = 1;
> +
>   migrate_prep();
>  
>   while (pfn < end || !list_empty(&cc->migratepages)) {
> @@ -8511,7 +8515,7 @@ static int __alloc_contig_migrate_range(struct 
> compact_control *cc,
>   break;
>   }
>   tries = 0;
> - } else if (++tries == 5) {
> + } else if (++tries == max_tries) {
>   ret = ret < 0 ? ret : -EBUSY;
>   break;
>   }
> @@ -8562,7 +8566,7 @@ int alloc_contig_range(unsigned long start, unsigned 
> long end,
>   .nr_migratepages = 0,
>   .order = -1,
>   .zone = page_zone(pfn_to_page(start)),
> - .mode = MIGRATE_SYNC,
> + .mode = gfp_mask & __GFP_NORETRY ? MIGRATE_ASYNC : MIGRATE_SYNC,
>   .ignore_skip_hint = true,
>   .no_set_skip_hint = true,
>   .gfp_mask = current_gfp_context(gfp_mask),
> 

I'm fine with using gfp flags (e.g., __GFP_NORETRY) as long as they
don't enable other implicit behavior (e.g., move draining X to the
caller) that's hard to get from the flag name.

IMHO, if we ever want to move draining to the caller, or change the
behavior of alloc_contig_range() in different ways (e.g., disable PCP),
we won't get around introducing a separate set of flags for
alloc_contig_range().

Let's see what Michal thinks. Thanks!

-- 
Thanks,

David / dhildenb

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-13 Thread David Hildenbrand

On 13.01.21 04:31, Linus Torvalds wrote:
> On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox  wrote:
>>
>> The thing about the speculative page cache references is that they can
>> temporarily bump a refcount on a page which _used_ to be in the page
>> cache and has now been reallocated as some other kind of page.
> 
> Oh, and thinking about this made me think we might actually have a
> serious bug here, and it has nothing what-so-ever to do with COW, GUP,
> or even the page count itself.
> 
> It's unlikely enough that I think it's mostly theoretical, but tell me
> I'm wrong.
> 
> PLEASE tell me I'm wrong:
> 
> CPU1 does page_cache_get_speculative under RCU lock
> 
> CPU2 frees and re-uses the page
> 
> CPU1CPU2
> 
> 
> page = xas_load(&xas);
> if (!page_cache_get_speculative(page))
> goto repeat;
> .. succeeds ..
> 
> remove page from XA
> release page
> reuse for something else
> 
> .. and then re-check ..
> if (unlikely(page != xas_reload(&xas))) {
> put_page(page);
> goto repeat;
> }
> 
> ok, the above all looks fine. We got the speculative ref, but then we
> noticed that its' not valid any more, so we put it again. All good,
> right?
> 
> Wrong.
> 
> What if that "reuse for something else" was actually really quick, and
> both allocated and released it?
> 
> That still sounds good, right? Yes, now the "put_page()" will be the
> one that _actually_ releases the page, but we're still fine, right?
> 
> Very very wrong.
> 
> The "reuse for something else" on CPU2 might have gotten not an
> order-0 page, but a *high-order* page. So it allocated (and then
> immediately free'd) maybe an order-2 allocation with _four_ pages, and
> the re-use happened when we had coalesced the buddy pages.
> 
> But when we release the page on CPU1, we will release just _one_ page,
> and the other three pages will be lost forever.
> 
> IOW, we restored the page count perfectly fine, but we screwed up the
> page sizes and buddy information.
> 
> Ok, so the above is so unlikely from a timing standpoint that I don't
> think it ever happens, but I don't see why it couldn't happen in
> theory.
> 
> Please somebody tell me I'm missing some clever thing we do to make
> sure this can actually not happen..

Wasn't that tackled by latest (not merged AFAIKs) __free_pages() changes?

I'm only able to come up with the doc update, not with the oroginal
fix/change

https://lkml.kernel.org/r/20201027025523.3235-1-wi...@infradead.org

-- 
Thanks,

David / dhildenb

Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

2021-01-13 Thread David Hildenbrand

On 13.01.21 09:52, David Hildenbrand wrote:
> On 13.01.21 04:31, Linus Torvalds wrote:
>> On Tue, Jan 12, 2021 at 6:16 PM Matthew Wilcox  wrote:
>>>
>>> The thing about the speculative page cache references is that they can
>>> temporarily bump a refcount on a page which _used_ to be in the page
>>> cache and has now been reallocated as some other kind of page.
>>
>> Oh, and thinking about this made me think we might actually have a
>> serious bug here, and it has nothing what-so-ever to do with COW, GUP,
>> or even the page count itself.
>>
>> It's unlikely enough that I think it's mostly theoretical, but tell me
>> I'm wrong.
>>
>> PLEASE tell me I'm wrong:
>>
>> CPU1 does page_cache_get_speculative under RCU lock
>>
>> CPU2 frees and re-uses the page
>>
>> CPU1CPU2
>> 
>>
>> page = xas_load(&xas);
>> if (!page_cache_get_speculative(page))
>> goto repeat;
>> .. succeeds ..
>>
>> remove page from XA
>> release page
>> reuse for something else
>>
>> .. and then re-check ..
>> if (unlikely(page != xas_reload(&xas))) {
>> put_page(page);
>> goto repeat;
>> }
>>
>> ok, the above all looks fine. We got the speculative ref, but then we
>> noticed that its' not valid any more, so we put it again. All good,
>> right?
>>
>> Wrong.
>>
>> What if that "reuse for something else" was actually really quick, and
>> both allocated and released it?
>>
>> That still sounds good, right? Yes, now the "put_page()" will be the
>> one that _actually_ releases the page, but we're still fine, right?
>>
>> Very very wrong.
>>
>> The "reuse for something else" on CPU2 might have gotten not an
>> order-0 page, but a *high-order* page. So it allocated (and then
>> immediately free'd) maybe an order-2 allocation with _four_ pages, and
>> the re-use happened when we had coalesced the buddy pages.
>>
>> But when we release the page on CPU1, we will release just _one_ page,
>> and the other three pages will be lost forever.
>>
>> IOW, we restored the page count perfectly fine, but we screwed up the
>> page sizes and buddy information.
>>
>> Ok, so the above is so unlikely from a timing standpoint that I don't
>> think it ever happens, but I don't see why it couldn't happen in
>> theory.
>>
>> Please somebody tell me I'm missing some clever thing we do to make
>> sure this can actually not happen..
> 
> Wasn't that tackled by latest (not merged AFAIKs) __free_pages() changes?
> 
> I'm only able to come up with the doc update, not with the oroginal
> fix/change
> 
> https://lkml.kernel.org/r/20201027025523.3235-1-wi...@infradead.org
> 

Sorry, found it, it's in v5.10

commit e320d3012d25b1fb5f3df4edb7bd44a1c362ec10
Author: Matthew Wilcox (Oracle) 
Date:   Tue Oct 13 16:56:04 2020 -0700

mm/page_alloc.c: fix freeing non-compound pages

and

commit 7f194fbb2dd75e9346b305b8902e177b423b1062
Author: Matthew Wilcox (Oracle) 
Date:   Mon Dec 14 19:11:09 2020 -0800

mm/page_alloc: add __free_pages() documentation

is in v5.11-rc1

-- 
Thanks,

David / dhildenb

Re: [PATCH v4 1/6] mm: migrate: do not migrate HugeTLB page whose refcount is one

2021-01-13 Thread David Hildenbrand

On 13.01.21 06:22, Muchun Song wrote:
> All pages isolated for the migration have an elevated reference count
> and therefore seeing a reference count equal to 1 means that the last
> user of the page has dropped the reference and the page has became
> unused and there doesn't make much sense to migrate it anymore. This has
> been done for regular pages and this patch does the same for hugetlb
> pages. Although the likelyhood of the race is rather small for hugetlb
> pages it makes sense the two code paths in sync.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Mike Kravetz 
> Acked-by: Yang Shi 
> Acked-by: Michal Hocko 
> ---
>  mm/migrate.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4385f2fb5d18..a6631c4eb6a6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1279,6 +1279,12 @@ static int unmap_and_move_huge_page(new_page_t 
> get_new_page,
>   return -ENOSYS;
>   }
>  
> + if (page_count(hpage) == 1) {
> + /* page was freed from under us. So we are done. */
> + putback_active_hugepage(hpage);
> + return MIGRATEPAGE_SUCCESS;
> + }
> +
>   new_hpage = get_new_page(hpage, private);
>   if (!new_hpage)
>       return -ENOMEM;
> 

Happy to say that I now know understand what's going on here

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH v3 1/2] x86/setup: don't remove E820_TYPE_RAM for pfn 0

2021-01-13 Thread David Hildenbrand

On 11.01.21 20:40, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The first 4Kb of memory is a BIOS owned area and to avoid its allocation
> for the kernel it was not listed in e820 tables as memory. As the result,
> pfn 0 was never recognised by the generic memory management and it is not a
> part of neither node 0 nor ZONE_DMA.
> 
> If set_pfnblock_flags_mask() would be ever called for the pageblock
> corresponding to the first 2Mbytes of memory, having pfn 0 outside of
> ZONE_DMA would trigger
> 
>   VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> Along with reserving the first 4Kb in e820 tables, several first pages are
> reserved with memblock in several places during setup_arch(). These
> reservations are enough to ensure the kernel does not touch the BIOS area
> and it is not necessary to remove E820_TYPE_RAM for pfn 0.
> 
> Remove the update of e820 table that changes the type of pfn 0 and move the
> comment describing why it was done to trim_low_memory_range() that reserves
> the beginning of the memory.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/x86/kernel/setup.c | 20 +---
>  1 file changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 740f3bdb3f61..3412c4595efd 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -660,17 +660,6 @@ static void __init trim_platform_memory_ranges(void)
>  
>  static void __init trim_bios_range(void)
>  {
> - /*
> -  * A special case is the first 4Kb of memory;
> -  * This is a BIOS owned area, not kernel ram, but generally
> -  * not listed as such in the E820 table.
> -  *
> -  * This typically reserves additional memory (64KiB by default)
> -  * since some BIOSes are known to corrupt low memory.  See the
> -  * Kconfig help text for X86_RESERVE_LOW.
> -  */
> - e820__range_update(0, PAGE_SIZE, E820_TYPE_RAM, E820_TYPE_RESERVED);
> -
>   /*
>* special case: Some BIOSes report the PC BIOS
>* area (640Kb -> 1Mb) as RAM even though it is not.
> @@ -728,6 +717,15 @@ early_param("reservelow", parse_reservelow);
>  
>  static void __init trim_low_memory_range(void)
>  {
> + /*
> +  * A special case is the first 4Kb of memory;
> +  * This is a BIOS owned area, not kernel ram, but generally
> +  * not listed as such in the E820 table.
> +  *
> +  * This typically reserves additional memory (64KiB by default)
> +  * since some BIOSes are known to corrupt low memory.  See the
> +  * Kconfig help text for X86_RESERVE_LOW.
> +  */
>   memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
>  }
>   
> 

The only somewhat-confusing thing is that in-between
e820__memblock_setup() and trim_low_memory_range(), we already have
memblock allocations. So [0..4095] might look like ordinary memory until
we reserve it later on.

E.g., reserve_real_mode() does a

mem = memblock_find_in_range(0, 1<<20, size, PAGE_SIZE);
...
memblock_reserve(mem, size);
set_real_mode_mem(mem);

which looks kind of suspicious to me. Most probably I am missing
something, just wanted to point that out. We might want to do such
trimming/adjustments before any kind of allocations.

-- 
Thanks,

David / dhildenb

Re: [PATCH] hugetlbfs: Use helper macro default_hstate in init_hugetlbfs_fs

2021-01-18 Thread David Hildenbrand

On 16.01.21 10:18, Miaohe Lin wrote:
> Since commit e5ff215941d5 ("hugetlb: multiple hstates for multiple page
> sizes"), we can use macro default_hstate to get the struct hstate which
> we use by default. But init_hugetlbfs_fs() forgot to use it.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  fs/hugetlbfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 9b221b87fbea..88751e35e69d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -1544,7 +1544,7 @@ static int __init init_hugetlbfs_fs(void)
>   goto out_free;
>  
>   /* default hstate mount is required */
> - mnt = mount_one_hugetlbfs(&hstates[default_hstate_idx]);
> + mnt = mount_one_hugetlbfs(&default_hstate);
>   if (IS_ERR(mnt)) {
>   error = PTR_ERR(mnt);
>   goto out_unreg;
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH] hugetlbfs: remove meaningless variable avoid_reserve

2021-01-18 Thread David Hildenbrand

On 16.01.21 10:26, Miaohe Lin wrote:
> The variable avoid_reserve is meaningless because we never changed its
> value and just passed it to alloc_huge_page(). So remove it to make code
> more clear that in hugetlbfs_fallocate, we never avoid reserve when alloc
> hugepage yet.
> 
> Signed-off-by: Miaohe Lin 
> ---
>  fs/hugetlbfs/inode.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 88751e35e69d..23ad6ed8b75f 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -680,7 +680,6 @@ static long hugetlbfs_fallocate(struct file *file, int 
> mode, loff_t offset,
>*/
>   struct page *page;
>   unsigned long addr;
> - int avoid_reserve = 0;
>  
>   cond_resched();
>  
> @@ -717,7 +716,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
> mode, loff_t offset,
>   }
>  
>   /* Allocate page and add to page cache */
> - page = alloc_huge_page(&pseudo_vma, addr, avoid_reserve);
> + page = alloc_huge_page(&pseudo_vma, addr, 0);
>   hugetlb_drop_vma_policy(&pseudo_vma);
>   if (IS_ERR(page)) {
>   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH V3 1/3] mm/memory_hotplug: Prevalidate the address range being added with platform

2021-01-19 Thread David Hildenbrand

On 18.01.21 14:12, Anshuman Khandual wrote:
> This introduces memhp_range_allowed() which can be called in various memory
> hotplug paths to prevalidate the address range which is being added, with
> the platform. Then memhp_range_allowed() calls memhp_get_pluggable_range()
> which provides applicable address range depending on whether linear mapping
> is required or not. For ranges that require linear mapping, it calls a new
> arch callback arch_get_mappable_range() which the platform can override. So
> the new callback, in turn provides the platform an opportunity to configure
> acceptable memory hotplug address ranges in case there are constraints.
> 
> This mechanism will help prevent platform specific errors deep down during
> hotplug calls. This drops now redundant check_hotplug_memory_addressable()
> check in __add_pages() but instead adds a VM_BUG_ON() check which would

In this patch, you keep the __add_pages() checks. But as discussed, we
could perform it in mm/memremap.c:pagemap_range() insted and convert it
to a VM_BUG_ON().

Apart from that looks good to me.

-- 
Thanks,

David / dhildenb

Re: [PATCH V3 2/3] arm64/mm: Define arch_get_mappable_range()

2021-01-19 Thread David Hildenbrand

On 18.01.21 14:13, Anshuman Khandual wrote:
> This overrides arch_get_mappable_range() on arm64 platform which will be
> used with recently added generic framework. It drops inside_linear_region()
> and subsequent check in arch_add_memory() which are no longer required. It
> also adds a VM_BUG_ON() check that would ensure that memhp_range_allowed()
> has already been called.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Ard Biesheuvel 
> Cc: Mark Rutland 
> Cc: David Hildenbrand 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/arm64/mm/mmu.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ae0c3d023824..f2e1770c9f29 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1442,16 +1442,19 @@ static void __remove_pgd_mapping(pgd_t *pgdir, 
> unsigned long start, u64 size)
>   free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
>  }
>  
> -static bool inside_linear_region(u64 start, u64 size)
> +struct range arch_get_mappable_range(void)
>  {
> + struct range memhp_range;
> +
>   /*
>* Linear mapping region is the range [PAGE_OFFSET..(PAGE_END - 1)]
>* accommodating both its ends but excluding PAGE_END. Max physical
>* range which can be mapped inside this linear mapping range, must
>* also be derived from its end points.
>*/
> - return start >= __pa(_PAGE_OFFSET(vabits_actual)) &&
> -(start + size - 1) <= __pa(PAGE_END - 1);
> + memhp_range.start = __pa(_PAGE_OFFSET(vabits_actual));
> + memhp_range.end =  __pa(PAGE_END - 1);
> + return memhp_range;
>  }
>  
>  int arch_add_memory(int nid, u64 start, u64 size,
> @@ -1459,11 +1462,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  {
>   int ret, flags = 0;
>  
> - if (!inside_linear_region(start, size)) {
> - pr_err("[%llx %llx] is outside linear mapping region\n", start, 
> start + size);
> - return -EINVAL;
> - }
> -
> + VM_BUG_ON(!memhp_range_allowed(start, size, true));
>   if (rodata_full || debug_pagealloc_enabled())
>   flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>  
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH RFC] virtio-mem: check against memhp_get_pluggable_range() which memory we can hotplug

2021-01-19 Thread David Hildenbrand

On 18.01.21 14:21, Anshuman Khandual wrote:
> 
> 
> On 1/18/21 6:43 PM, Anshuman Khandual wrote:
>> From: David Hildenbrand 
>>
>> Right now, we only check against MAX_PHYSMEM_BITS - but turns out there
>> are more restrictions of which memory we can actually hotplug, especially
>> om arm64 or s390x once we support them: we might receive something like
>> -E2BIG or -ERANGE from add_memory_driver_managed(), stopping device
>> operation.
>>
>> So, check right when initializing the device which memory we can add,
>> warning the user. Try only adding actually pluggable ranges: in the worst
>> case, no memory provided by our device is pluggable.
>>
>> In the usual case, we expect all device memory to be pluggable, and in
>> corner cases only some memory at the end of the device-managed memory
>> region to not be pluggable.
>>
>> Cc: "Michael S. Tsirkin" 
>> Cc: Jason Wang 
>> Cc: Pankaj Gupta 
>> Cc: Michal Hocko 
>> Cc: Oscar Salvador 
>> Cc: Wei Yang 
>> Cc: Andrew Morton 
>> Cc: catalin.mari...@arm.com
>> Cc: teawater 
>> Cc: Anshuman Khandual 
>> Cc: Pankaj Gupta 
>> Cc: Jonathan Cameron 
>> Cc: h...@linux.ibm.com
>> Cc: Vasily Gorbik 
>> Cc: Will Deacon 
>> Cc: Ard Biesheuvel 
>> Cc: Mark Rutland 
>> Cc: Heiko Carstens 
>> Cc: Michal Hocko 
>> Signed-off-by: David Hildenbrand 
>> Signed-off-by: Anshuman Khandual 
> 
> Hello David,
> 
> As your original patch was in the RFC state, I have just maintained
> the same here as well. But once you test this patch along with the
> new series, please do let me know if this needs to be converted to
> a normal PATCH instead. Thank you.

I'll give it a churn on x86-64, where not that much should change. It
will be interesting to test with arm64 in such corner cases in the future.

Thanks

-- 
Thanks,

David / dhildenb

Re: [PATCH V3 3/3] s390/mm: Define arch_get_mappable_range()

2021-01-19 Thread David Hildenbrand

On 18.01.21 14:13, Anshuman Khandual wrote:
> This overrides arch_get_mappabble_range() on s390 platform which will be
> used with recently added generic framework. It modifies the existing range
> check in vmem_add_mapping() using arch_get_mappable_range(). It also adds a
> VM_BUG_ON() check that would ensure that memhp_range_allowed() has already
> been called on the hotplug path.
> 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: David Hildenbrand 
> Cc: linux-s...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Acked-by: Heiko Carstens 
> Signed-off-by: Anshuman Khandual 
> ---
>  arch/s390/mm/init.c |  1 +
>  arch/s390/mm/vmem.c | 15 ++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 73a163065b95..97017a4bcc90 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -297,6 +297,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
>   return -EINVAL;
>  
> + VM_BUG_ON(!memhp_range_allowed(start, size, true));
>   rc = vmem_add_mapping(start, size);
>   if (rc)
>   return rc;
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index 01f3a5f58e64..afc39ff1cc8d 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -4,6 +4,7 @@
>   *Author(s): Heiko Carstens 
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -532,11 +533,23 @@ void vmem_remove_mapping(unsigned long start, unsigned 
> long size)
>   mutex_unlock(&vmem_mutex);
>  }
>  
> +struct range arch_get_mappable_range(void)
> +{
> + struct range memhp_range;

You could do:

memhp_range = {
.start = 0,
.end =  VMEM_MAX_PHYS - 1,
};

Similar in the arm64 patch.

> +
> + memhp_range.start = 0;
> + memhp_range.end =  VMEM_MAX_PHYS - 1;
> + return memhp_range;
> +}
> +
>  int vmem_add_mapping(unsigned long start, unsigned long size)
>  {
> + struct range range;
>   int ret;
>  
> - if (start + size > VMEM_MAX_PHYS ||
> + range = arch_get_mappable_range();

You could do

struct range range = arch_get_mappable_range();

> + if (start < range.start ||
> + start + size > range.end + 1 ||
>   start + size < start)
>   return -ERANGE;
>  
> 


-- 
Thanks,

David / dhildenb

Re: [PATCH V3 0/3] mm/memory_hotplug: Pre-validate the address range with platform

2021-01-19 Thread David Hildenbrand

On 18.01.21 14:12, Anshuman Khandual wrote:
> This series adds a mechanism allowing platforms to weigh in and prevalidate
> incoming address range before proceeding further with the memory hotplug.
> This helps prevent potential platform errors for the given address range,
> down the hotplug call chain, which inevitably fails the hotplug itself.
> 
> This mechanism was suggested by David Hildenbrand during another discussion
> with respect to a memory hotplug fix on arm64 platform.
> 
> https://lore.kernel.org/linux-arm-kernel/1600332402-30123-1-git-send-email-anshuman.khand...@arm.com/
> 
> This mechanism focuses on the addressibility aspect and not [sub] section
> alignment aspect. Hence check_hotplug_memory_range() and check_pfn_span()
> have been left unchanged. Wondering if all these can still be unified in
> an expanded memhp_range_allowed() check, that can be called from multiple
> memory hot add and remove paths.
> 
> This series applies on v5.11-rc4 and has been tested on arm64. But only
> build tested on s390.
> 
> Changes in V3
> 
> - Updated the commit message in [PATCH 1/3]
> - Replaced 1 with 'true' and 0 with 'false' in memhp_range_allowed()
> - Updated memhp_range.end as VMEM_MAX_PHYS - 1 and updated vmem_add_mapping() 
> on s390
> - Changed memhp_range_allowed() behaviour in __add_pages()
> - Updated __add_pages() to return E2BIG when memhp_range_allowed() fails for 
> non-linear mapping based requests

Minor thing, we should make up our mind if we want to call stuff
internally "memhp_" or "mhp". I prefer the latter, because it is shorter.


-- 
Thanks,

David / dhildenb

Re: [PATCH 2/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-01-19 Thread David Hildenbrand

On 17.12.20 14:07, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]   = Initialized and sent to buddy
> 
> Hot-remove:
> 
>  We need to be careful when removing memory, as adding and
>  removing memory needs to be done with the same granularity.
>  To check that this assumption is not violated, we check the
>  memory range we want to remove and if a) any memory block has
>  vmemmap pages and b) the range spans more than a single memory
>  block, we scream out loud and refuse to proceed.
> 
>  If all is good and the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does the right
>  thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador 

IIRC, there is a conflict with the hpage vmemmap freeing patch set,
right? How are we going to handle that?

[...]

> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 439a89e758d8..f09d60ef8ad9 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -30,6 +30,11 @@ struct memory_block {
>   int phys_device;/* to which fru does this belong? */
>   struct device dev;
>   int nid;/* NID for this memory block */
> + unsigned long nr_vmemmap_pages; /*
> +  * Number of vmemmap pages. These pages
> +  * lay at the beginning of the memory
> +  * block.
> +  */

I suggest moving this comment above nr_vmemmap_pages.

>  };
>  
>  int arch_get_memory_phys_device(unsigned long start_pfn);
> @@ -81,7 +86,8 @@ static inline int memory_notify(unsigned long val, void *v)
>  #else
>  extern int register_memory_notifier(struct notifier_block *nb);
>  extern void unregister_memory_notifier(struct notifier_block *nb);
> -int create_memory_block_devices(unsigned long start, unsigned long size);
> +int create_memory_block_devices(unsigned long start, unsigned long size,
> + unsigned long vmemmap_pages);
>  void remove_memory_block_devices(unsigned long start, unsigned long size);
>  extern void memory_dev_init(void);
>  extern int memory_notify(unsigned long val, void *v);
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 15acce5ab106..67ea4a0b25bd 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -70,6 +70,14 @@ typedef int __bitwise mhp_t;
>   */
>  #define MEMHP_MERGE_RESOURCE ((__force mhp_t)BIT(0))
>  
> +/*
> + * We want memmap (struct page array) to be self contained.
> + * To do so, we will use the beginning of the hot-added range to build
> + * the page tables for the memmap array that describes the entire range.
> + * Only selected architectures support it with SPARSE_VMEMMAP.
> + */
> +#define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
> +
>  /*
>   * Extended parameters for memory hotplug:
>   * altmap: alternative allocator for memmap array (optional)
> @@ -113,11 +121,13 @@ extern int zone_grow_waitqueues(struct zone *zone, 
> unsigned long nr_pages);
>  extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
>  /* VM interface that may be used by firmware interface */
>  extern int online_pages(unsigned long pfn, unsigned long nr_pages,
> - int online_type, int nid);
> + unsigned long nr_vmemmap_pages, int online_type,
> + int nid);
>  extern struct zone *test_pages_in_a_zone(

Re: [PATCH 1/1] arm64: make section size configurable for memory hotplug

2021-01-07 Thread David Hildenbrand

On 06.01.21 07:11, Anshuman Khandual wrote:
> Hi Sudershan,
> 
> This patch (and the cover letter) does not copy LAKML even though the
> entire change here is arm64 specific. Please do copy all applicable
> mailing lists for a given patch.
> 
> On 1/6/21 6:58 AM, Sudarshan Rajagopalan wrote:
>> Currently on arm64, memory section size is hard-coded to 1GB.
>> Make this configurable if memory-hotplug is enabled, to support
>> more finer granularity for hotplug-able memory.
> 
> Section size has always been decided by the platform. It cannot be a
> configurable option because the user would not know the constraints
> for memory representation on the platform and besides it also cannot
> be trusted.
> 
>>
>> Signed-off-by: Sudarshan Rajagopalan 
>> ---
>>  arch/arm64/Kconfig | 11 +++
>>  arch/arm64/include/asm/sparsemem.h |  4 
>>  2 files changed, 15 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 6d232837cbee..34124eee65da 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -294,6 +294,17 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
>>  config SMP
>>  def_bool y
>>  
>> +config HOTPLUG_SIZE_BITS
>> +int "Memory hotplug block size(29 => 512MB 30 => 1GB)"

"Memory hotplug block size" and "HOTPLUG_SIZE_BITS" is confusing. It's
the section size. Please use that terminology.

(if at all, it would have to be "minimum memory hot(un)plug
granularity", but even that is somewhat misleading)

"SECTION_SIZE_BITS"

But I agree that letting the user configure it is sub-optimal.

>> +depends on SPARSEMEM
>> +depends on MEMORY_HOTPLUG
>> +range 28 30
> 
> 28 would not work for 64K pages.

@Anshuman, what's your feeling about changing this to 128 MB for 4k/16k
base pages (as we have on x86-64 right now) and 512 MB for 64k as
default for now?

(If we worry about the number of section bits in page->flags, we could
glue it to vmemmap support where that does not matter)


-- 
Thanks,

David / dhildenb

Re: [PATCH 1/1] arm64: make section size configurable for memory hotplug

2021-01-08 Thread David Hildenbrand

> To summarize, the section size bits for each base page size config
> should always
> 
> a. Avoid (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS

Pageblocks must also always fall completely into a section.

> 
> b. Provide minimum possible section size for a given base page config to
>have increased agility during memory hotplug operations and reduced
>vmemmap wastage for sections with holes.

OTOH, making the section size too small (e.g., 16MB) creates way to many
memory block devices in /sys/devices/system/memory/, and might consume
too many page->flags bits in the !vmemmap case.

For bigger setups, we might, similar to x86-64 (e.g., >= 64 GiB),
determine the memory_block_size_bytes() at runtime (spanning multiple
sections then), once it becomes relevant.

> 
> c. Allow 4K base page configs to have PMD based vmemmap mappings

Agreed.

> 
> Because CONFIG_FORCE_MAX_ZONEORDER is always defined on arm64 platform,
> the following would always avoid the condition (a)
> 
> SECTION_SIZE_BITS (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT)
> 
>   - 22 (11 - 1 + 12) for 4K pages
>   - 24 (11 - 1 + 14) for 16K pages without THP
>   - 25 (12 - 1 + 14) for 16K pages with THP
>   - 26 (11 - 1 + 16) for 64K pages without THP
>   - 29 (14 - 1 + 16) for 64K pages with THP
> 
> Apart from overriding 4K base page size config to have 27 as section size
> bits, should not all other values be okay here ? But then wondering what
> benefit 128MB (27 bits) section size for 16K config would have ? OR the
> objective here is to match 16K page size config with default x86-64.

We don't want to have sections that are too small. We don't want to have
sections that are too big :)

Not sure if we really want to allow setting e.g., a section size of 4
MB. That's just going to hurt. IMHO, something in the range of 64..256
MB is quite a good choice, where possible.

> 
>>
>> (If we worry about the number of section bits in page->flags, we could
>> glue it to vmemmap support where that does not matter)
> 
> Could you please elaborate ? Could smaller section size bits numbers like
> 22 or 24 create problems in page->flags without changing other parameters
> like NR_CPUS or NODES_SHIFT ? A quick test with 64K base page without THP

Yes, in the !vmemmap case, we have to store the section_nr in there.
IIRC, it's less of an issue with section sizes like 128 MB.

> i.e 26 bits in section size, fails to boot.

26 bits would mean 64 MB, no? Not sure if that's possible even without
THP (MAX_ORDER - 1, pageblock_order ...) on 64k pages. I'd assume 512 MB
is the lowest we can go. I'd assume this would crash :)

> 
> As you have suggested, probably constant defaults (128MB for 4K/16K, 512MB
> for 64K) might be better than depending on the CONFIG_FORCE_MAX_ZONEORDER,
> at least for now.

That's also what I would prefer, keeping it simple.

-- 
Thanks,

David / dhildenb

Re: [PATCH] mm: memblock: remove return value of memblock_free_all()

2021-01-14 Thread David Hildenbrand

On 14.01.21 08:08, Daeseok Youn wrote:
> No one checks the return value of memblock_free_all().
> Make the return value void.
> 
> memblock_free_all() is used on mem_init() for each
> architecture, and the total count of freed pages will be added
> to _totalram_pages variable by calling totalram_pages_add().
> 
> so do not need to return total count of freed pages.
> 
> Signed-off-by: Daeseok Youn 
> ---
>  include/linux/memblock.h | 2 +-
>  mm/memblock.c| 6 +-
>  2 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 9c5cc95c7cee..076fda398dff 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -117,7 +117,7 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t 
> size);
>  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
>  
> -unsigned long memblock_free_all(void);
> +void memblock_free_all(void);
>  void reset_node_managed_pages(pg_data_t *pgdat);
>  void reset_all_zones_managed_pages(void);
>  void memblock_enforce_memory_reserved_overlap(void);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 40ca30bfa387..2a2b1fe4b659 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -2074,10 +2074,8 @@ void __init reset_all_zones_managed_pages(void)
>  
>  /**
>   * memblock_free_all - release free pages to the buddy allocator
> - *
> - * Return: the number of pages actually released.
>   */
> -unsigned long __init memblock_free_all(void)
> +void __init memblock_free_all(void)
>  {
>   unsigned long pages;
>  
> @@ -2086,8 +2084,6 @@ unsigned long __init memblock_free_all(void)
>  
>   pages = free_low_memory_core_early();
>   totalram_pages_add(pages);
> -
> - return pages;
>  }
>  
>  #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_ARCH_KEEP_MEMBLOCK)
> 

Reviewed-by: David Hildenbrand 

-- 
Thanks,

David / dhildenb

Re: [PATCH] mm/hugetlb: avoid unnecessary hugetlb_acct_memory() call

2021-01-14 Thread David Hildenbrand

On 14.01.21 12:31, Miaohe Lin wrote:
> When gbl_reserve is 0, hugetlb_acct_memory() will do nothing except holding
> and releasing hugetlb_lock.

So, what's the deal then? Adding more code?

If this is a performance improvement, we should spell it out. Otherwise
I don't see a real benefit of this patch.

> 
> Signed-off-by: Miaohe Lin 
> ---
>  mm/hugetlb.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 737b2dce19e6..fe2da9ad6233 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5241,7 +5241,8 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
> start, long end,
>* reservations to be released may be adjusted.
>*/
>   gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
> - hugetlb_acct_memory(h, -gbl_reserve);
> + if (gbl_reserve)
> + hugetlb_acct_memory(h, -gbl_reserve);
>  
>   return 0;
>  }
> 


-- 
Thanks,

David / dhildenb

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 3170 matches

Mail list logo