[PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2019-03-28 Thread Oscar Salvador
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
(~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.

a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
the memory consumed during physical hotadd consumes enough memory to push
system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.

I have also seen hot-add operations failing on powerpc due to the fact
that we try to use order-8 pages. If the base page size is 64KB, this
gives us 16MB, and if we run out of those, we simply fail.
One could arge that we can fall back to basepages as we do in x86_64.

But we can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
page tables can map arbitrary memory. That means that we can simply
use the beginning of each memory section and map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
walkers. We do not have any spare page flag for this purpose so use the
combination of PageReserved bit which already tells that the page should
be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
bits but PAGE_MAPPING_FLAGS) into page->mapping.

There is one case where we cannot check for PageReserved, and that is
when we have poisoning enabled + VM_BUG_ON_PGFLAGS is on + the page
is not initialized.
This happens in __init_single_page, where we do have to preserve the
state of PageVmemmap pages, so we cannot zero the page.
I added __PageVmemmap for that purpose, as it only checks for the
page->mapping field.
It should be enough as these pages are not yet onlined.

On the memory hotplug front add a new MHP_MEMMAP_FROM_RANGE restriction
flag. User is supposed to set the flag if the memmap should be allocated
from the hotadded range.
Right now, this is passed to add_memory(), __add_memory() and
add_memory_resource().
Unfortunately we do not have a single entry point, as Hyper-V, Acpi, Xen
 use those three functions, so all those users have to
specifiy if they want the memmap array allocated from the hot-added range.
For the time being, only ACPI enabled it.

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
If MHP_MEMMAP_FROM_RANGE was passed, we set up the layout of the altmap
structure at the beginning of __add_pages(), and then we call
mark_vmemmap_pages() after the memory has been added.
mark_vmemmap_pages marks the pages as vmemmap and sets some metadata:

The current layout of the Vmemmap pages are:

- There is a head Vmemmap (first page), which has the following fields set:
  * page->_refcount: number of sections that used this altmap
  * page->private: total number of vmemmap pages
- The remaining vmemmap pages have:
  * page->freelist: pointer to the head vmemmap page

This is done to easy the computation we need in some places.

E.g:
Let us say we hot-add 9GB on x86_64:

head->_refcount = 72 sections
head->private = 36864 vmemmap pages
tail's->freelist = head

We keep a _refcount of the used sections to know how much do we have to defer
the call to vmemmap_free().
The thing is that the first pages of the hot-added range are used to create
the memmap mapping, so we cannot remove those first, otherwise we would blow up.

What we do is that since when we hot-remove a memory-range, sections are being
removed sequentially, we wait until we hit the last section, and then we free
the hole range to vmemmap_free backwards.
We know that it is the last section because in every pass we
decrease head->_refcount, and when it reaches 0, we got our last section.

We also have to be careful about those pages during online and offline
operations. They are simply skipped, so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.

Signed-off-by: Oscar Salvador 
---
 arch/arm64/mm/mmu.c |   5 +-
 arch/powerpc/mm/init_64.c   |   7 ++
 arch/powerpc/platforms/powernv/memtrace.c   |   2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c |   2 +-
 arch/s390/mm/init.c |   6 ++
 arch/x86/mm/init_64.c   |  10 +++
 drivers/acpi/acpi_memhotplug.c  |   2 +-
 drivers/base/memory.c   |   2 +-
 drivers/dax/kmem.c  |   2 +-
 

Re: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2018-11-18 Thread osalvador
On Fri, 2018-11-16 at 14:41 -0800, Dave Hansen wrote:
> On 11/16/18 2:12 AM, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array)
> > for
> > the newly added memory section. Currently, kmalloc is used for
> > those
> > allocations.
> 
> Did you literally mean kmalloc?  I thought we had a bunch of ways of
> allocating memmaps, but I didn't think kmalloc() was actually used.

No, sorry.
The name of the fuctions used for allocating a memmap contain the word
kmalloc, so it was a confusion.
Indeed, vmemmap_alloc_block() ends up calling alloc_pages_node().
__kmalloc_section_usemap() is the one that calls kmalloc.

> 
> So, can the ZONE_DEVICE altmaps move over to this infrastructure?
> Doesn't this effectively duplicate that code?

Actually, we are reciclyng/using part of the ZONE_DEVICE altmap code,
and the "struct vmemmap_altmap" itself.

The only thing we added in that regard is the callback function
mark_vmemmap_pages(), that controls the refcount and marks the pages as
Vmemmap.


> ...
> > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> > index 7a9886f98b0c..03f014abd4eb 100644
> > --- a/arch/powerpc/mm/init_64.c
> > +++ b/arch/powerpc/mm/init_64.c
> > @@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start,
> > unsigned long end,
> > continue;
> >  
> > page = pfn_to_page(addr >> PAGE_SHIFT);
> > +   if (PageVmemmap(page))
> > +   continue;
> > section_base =
> > pfn_to_page(vmemmap_section_start(start));
> > nr_pages = 1 << page_order;
> 
> Reading this, I'm wondering if PageVmemmap() could be named better.
> From this is reads like "skip PageVmemmap() pages if freeing
> vmemmap",
> which does not make much sense.
> 
> This probably at _least_ needs a comment to explain why the pages are
> being skipped.

The thing is that we do not need to send Vmemmap pages to the buddy
system by means of free_pages/free_page_reserved, as those pages reside
within the memory section.
The only thing we need is to clear the mapping(pagetables).

I just realized that that piece of code is wrong, as it does not allow
to clear the mapping.
One of the consequences to only have tested this on x86_64.

> 
> > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> > index 4139affd6157..bc1523bcb09d 100644
> > --- a/arch/s390/mm/init.c
> > +++ b/arch/s390/mm/init.c
> > @@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64
> > size,
> > unsigned long size_pages = PFN_DOWN(size);
> > int rc;
> >  
> > +   /*
> > +* Physical memory is added only later during the memory
> > online so we
> > +* cannot use the added range at this stage
> > unfortunatelly.
> 
>   unfortunately ^
> 
> > +*/
> > +   restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;
> 
> Could you also add to the  comment about this being specific to s390?

Sure, will do.

> > rc = vmem_add_mapping(start, size);
> > if (rc)
> > return rc;
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index fd06bcbd9535..d5234ca5c483 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct
> > page *page, int order)
> > unsigned long magic;
> > unsigned int nr_pages = 1 << order;
> >  
> > +   /*
> > +* runtime vmemmap pages are residing inside the memory
> > section so
> > +* they do not have to be freed anywhere.
> > +*/
> > +   if (PageVmemmap(page))
> > +   return;
> 
> Thanks for the comment on this one, this one is right on.
> 
> > @@ -16,13 +18,18 @@ struct device;
> >   * @free: free pages set aside in the mapping for memmap storage
> >   * @align: pages reserved to meet allocation alignments
> >   * @alloc: track pages consumed, private to vmemmap_populate()
> > + * @flush_alloc_pfns: callback to be called on the allocated range
> > after it
> > + * @nr_sects: nr of sects filled with memmap allocations
> > + * is mapped to the vmemmap - see mark_vmemmap_pages
> >   */
> 
> I think you split up the "@flush_alloc_pfns" comment accidentally.

Indeed, looks "broken".
I will fix it.


> > +   /*
> > +* We keep track of the sections using this altmap by
> > means
> > +* of a refcount, so we know how much do we have to defer
> > +* the call to vmemmap_free for this memory range.
> > +* The refcount is kept in the first vmemmap page.
> > +* For example:
> > +* We add 10GB: (ea000400 - ea000427ffc0)
> > +* ea000400 will have a refcount of 80.
> > +*/
> 
> The example is good, but it took me a minute to realize that 80 is
> because 10GB is roughly 80 sections.

I will try to make it more clear.

> 
> > +   head = (struct page *)ALIGN_DOWN((unsigned
> > long)pfn_to_page(pfn), align);
> 
> Is this ALIGN_DOWN() OK?  It seems like it might be aligning 'pfn'
> down
> into the 

Re: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2018-11-18 Thread osalvador
On Fri, 2018-11-16 at 14:41 -0800, Dave Hansen wrote:
> On 11/16/18 2:12 AM, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array)
> > for
> > the newly added memory section. Currently, kmalloc is used for
> > those
> > allocations.
> 
> Did you literally mean kmalloc?  I thought we had a bunch of ways of
> allocating memmaps, but I didn't think kmalloc() was actually used.

No, sorry.
The name of the fuctions used for allocating a memmap contain the word
kmalloc, so it was a confusion.
Indeed, vmemmap_alloc_block() ends up calling alloc_pages_node().
__kmalloc_section_usemap() is the one that calls kmalloc.

> 
> So, can the ZONE_DEVICE altmaps move over to this infrastructure?
> Doesn't this effectively duplicate that code?

Actually, we are reciclyng/using part of the ZONE_DEVICE altmap code,
and the "struct vmemmap_altmap" itself.

The only thing we added in that regard is the callback function
mark_vmemmap_pages(), that controls the refcount and marks the pages as
Vmemmap.


> ...
> > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> > index 7a9886f98b0c..03f014abd4eb 100644
> > --- a/arch/powerpc/mm/init_64.c
> > +++ b/arch/powerpc/mm/init_64.c
> > @@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start,
> > unsigned long end,
> > continue;
> >  
> > page = pfn_to_page(addr >> PAGE_SHIFT);
> > +   if (PageVmemmap(page))
> > +   continue;
> > section_base =
> > pfn_to_page(vmemmap_section_start(start));
> > nr_pages = 1 << page_order;
> 
> Reading this, I'm wondering if PageVmemmap() could be named better.
> From this is reads like "skip PageVmemmap() pages if freeing
> vmemmap",
> which does not make much sense.
> 
> This probably at _least_ needs a comment to explain why the pages are
> being skipped.

The thing is that we do not need to send Vmemmap pages to the buddy
system by means of free_pages/free_page_reserved, as those pages reside
within the memory section.
The only thing we need is to clear the mapping(pagetables).

I just realized that that piece of code is wrong, as it does not allow
to clear the mapping.
One of the consequences to only have tested this on x86_64.

> 
> > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> > index 4139affd6157..bc1523bcb09d 100644
> > --- a/arch/s390/mm/init.c
> > +++ b/arch/s390/mm/init.c
> > @@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64
> > size,
> > unsigned long size_pages = PFN_DOWN(size);
> > int rc;
> >  
> > +   /*
> > +* Physical memory is added only later during the memory
> > online so we
> > +* cannot use the added range at this stage
> > unfortunatelly.
> 
>   unfortunately ^
> 
> > +*/
> > +   restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;
> 
> Could you also add to the  comment about this being specific to s390?

Sure, will do.

> > rc = vmem_add_mapping(start, size);
> > if (rc)
> > return rc;
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index fd06bcbd9535..d5234ca5c483 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct
> > page *page, int order)
> > unsigned long magic;
> > unsigned int nr_pages = 1 << order;
> >  
> > +   /*
> > +* runtime vmemmap pages are residing inside the memory
> > section so
> > +* they do not have to be freed anywhere.
> > +*/
> > +   if (PageVmemmap(page))
> > +   return;
> 
> Thanks for the comment on this one, this one is right on.
> 
> > @@ -16,13 +18,18 @@ struct device;
> >   * @free: free pages set aside in the mapping for memmap storage
> >   * @align: pages reserved to meet allocation alignments
> >   * @alloc: track pages consumed, private to vmemmap_populate()
> > + * @flush_alloc_pfns: callback to be called on the allocated range
> > after it
> > + * @nr_sects: nr of sects filled with memmap allocations
> > + * is mapped to the vmemmap - see mark_vmemmap_pages
> >   */
> 
> I think you split up the "@flush_alloc_pfns" comment accidentally.

Indeed, looks "broken".
I will fix it.


> > +   /*
> > +* We keep track of the sections using this altmap by
> > means
> > +* of a refcount, so we know how much do we have to defer
> > +* the call to vmemmap_free for this memory range.
> > +* The refcount is kept in the first vmemmap page.
> > +* For example:
> > +* We add 10GB: (ea000400 - ea000427ffc0)
> > +* ea000400 will have a refcount of 80.
> > +*/
> 
> The example is good, but it took me a minute to realize that 80 is
> because 10GB is roughly 80 sections.

I will try to make it more clear.

> 
> > +   head = (struct page *)ALIGN_DOWN((unsigned
> > long)pfn_to_page(pfn), align);
> 
> Is this ALIGN_DOWN() OK?  It seems like it might be aligning 'pfn'
> down
> into the 

Re: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2018-11-16 Thread Dave Hansen
On 11/16/18 2:12 AM, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, kmalloc is used for those
> allocations.

Did you literally mean kmalloc?  I thought we had a bunch of ways of
allocating memmaps, but I didn't think kmalloc() was actually used.

Like vmemmap_alloc_block(), for instance, uses alloc_pages_node().

So, can the ZONE_DEVICE altmaps move over to this infrastructure?
Doesn't this effectively duplicate that code?

...
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 7a9886f98b0c..03f014abd4eb 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start, unsigned 
> long end,
>   continue;
>  
>   page = pfn_to_page(addr >> PAGE_SHIFT);
> + if (PageVmemmap(page))
> + continue;
>   section_base = pfn_to_page(vmemmap_section_start(start));
>   nr_pages = 1 << page_order;

Reading this, I'm wondering if PageVmemmap() could be named better.
>From this is reads like "skip PageVmemmap() pages if freeing vmemmap",
which does not make much sense.

This probably at _least_ needs a comment to explain why the pages are
being skipped.

> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 4139affd6157..bc1523bcb09d 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   unsigned long size_pages = PFN_DOWN(size);
>   int rc;
>  
> + /*
> +  * Physical memory is added only later during the memory online so we
> +  * cannot use the added range at this stage unfortunatelly.

unfortunately ^

> +  */
> + restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;

Could you also add to the  comment about this being specific to s390?

>   rc = vmem_add_mapping(start, size);
>   if (rc)
>   return rc;
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index fd06bcbd9535..d5234ca5c483 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct page *page, 
> int order)
>   unsigned long magic;
>   unsigned int nr_pages = 1 << order;
>  
> + /*
> +  * runtime vmemmap pages are residing inside the memory section so
> +  * they do not have to be freed anywhere.
> +  */
> + if (PageVmemmap(page))
> + return;

Thanks for the comment on this one, this one is right on.

> @@ -16,13 +18,18 @@ struct device;
>   * @free: free pages set aside in the mapping for memmap storage
>   * @align: pages reserved to meet allocation alignments
>   * @alloc: track pages consumed, private to vmemmap_populate()
> + * @flush_alloc_pfns: callback to be called on the allocated range after it
> + * @nr_sects: nr of sects filled with memmap allocations
> + * is mapped to the vmemmap - see mark_vmemmap_pages
>   */

I think you split up the "@flush_alloc_pfns" comment accidentally.

>  struct vmem_altmap {
> - const unsigned long base_pfn;
> + unsigned long base_pfn;
>   const unsigned long reserve;
>   unsigned long free;
>   unsigned long align;
>   unsigned long alloc;
> + int nr_sects;
> + void (*flush_alloc_pfns)(struct vmem_altmap *self);
>  };
>  
>  /*
> @@ -133,8 +140,62 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap);
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>   struct dev_pagemap *pgmap);
>  
> -unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
> +static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
> +{
> + /* number of pfns from base where pfn_to_page() is valid */
> + return altmap->reserve + altmap->free;
> +}
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> +
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self)
> +{
> + unsigned long pfn = self->base_pfn + self->reserve;
> + unsigned long nr_pages = self->alloc;
> + unsigned long align = PAGES_PER_SECTION * sizeof(struct page);
> + struct page *head;
> + unsigned long i;
> +
> + pr_debug("%s: marking %px - %px as Vmemmap\n", __func__,
> + pfn_to_page(pfn),
> + pfn_to_page(pfn + nr_pages - 
> 1));
> + /*
> +  * We keep track of the sections using this altmap by means
> +  * of a refcount, so we know how much do we have to defer
> +  * the call to vmemmap_free for this memory range.
> +  * The refcount is kept in the first vmemmap page.
> +  * For example:
> +  * We add 10GB: (ea000400 - ea000427ffc0)
> +  * ea000400 will have a refcount of 80.
> +  */

The example 

Re: [RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2018-11-16 Thread Dave Hansen
On 11/16/18 2:12 AM, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, kmalloc is used for those
> allocations.

Did you literally mean kmalloc?  I thought we had a bunch of ways of
allocating memmaps, but I didn't think kmalloc() was actually used.

Like vmemmap_alloc_block(), for instance, uses alloc_pages_node().

So, can the ZONE_DEVICE altmaps move over to this infrastructure?
Doesn't this effectively duplicate that code?

...
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 7a9886f98b0c..03f014abd4eb 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -278,6 +278,8 @@ void __ref vmemmap_free(unsigned long start, unsigned 
> long end,
>   continue;
>  
>   page = pfn_to_page(addr >> PAGE_SHIFT);
> + if (PageVmemmap(page))
> + continue;
>   section_base = pfn_to_page(vmemmap_section_start(start));
>   nr_pages = 1 << page_order;

Reading this, I'm wondering if PageVmemmap() could be named better.
>From this is reads like "skip PageVmemmap() pages if freeing vmemmap",
which does not make much sense.

This probably at _least_ needs a comment to explain why the pages are
being skipped.

> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 4139affd6157..bc1523bcb09d 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -231,6 +231,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   unsigned long size_pages = PFN_DOWN(size);
>   int rc;
>  
> + /*
> +  * Physical memory is added only later during the memory online so we
> +  * cannot use the added range at this stage unfortunatelly.

unfortunately ^

> +  */
> + restrictions->flags &= ~MHP_MEMMAP_FROM_RANGE;

Could you also add to the  comment about this being specific to s390?

>   rc = vmem_add_mapping(start, size);
>   if (rc)
>   return rc;
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index fd06bcbd9535..d5234ca5c483 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -815,6 +815,13 @@ static void __meminit free_pagetable(struct page *page, 
> int order)
>   unsigned long magic;
>   unsigned int nr_pages = 1 << order;
>  
> + /*
> +  * runtime vmemmap pages are residing inside the memory section so
> +  * they do not have to be freed anywhere.
> +  */
> + if (PageVmemmap(page))
> + return;

Thanks for the comment on this one, this one is right on.

> @@ -16,13 +18,18 @@ struct device;
>   * @free: free pages set aside in the mapping for memmap storage
>   * @align: pages reserved to meet allocation alignments
>   * @alloc: track pages consumed, private to vmemmap_populate()
> + * @flush_alloc_pfns: callback to be called on the allocated range after it
> + * @nr_sects: nr of sects filled with memmap allocations
> + * is mapped to the vmemmap - see mark_vmemmap_pages
>   */

I think you split up the "@flush_alloc_pfns" comment accidentally.

>  struct vmem_altmap {
> - const unsigned long base_pfn;
> + unsigned long base_pfn;
>   const unsigned long reserve;
>   unsigned long free;
>   unsigned long align;
>   unsigned long alloc;
> + int nr_sects;
> + void (*flush_alloc_pfns)(struct vmem_altmap *self);
>  };
>  
>  /*
> @@ -133,8 +140,62 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap);
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>   struct dev_pagemap *pgmap);
>  
> -unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
> +static inline unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
> +{
> + /* number of pfns from base where pfn_to_page() is valid */
> + return altmap->reserve + altmap->free;
> +}
>  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> +
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self)
> +{
> + unsigned long pfn = self->base_pfn + self->reserve;
> + unsigned long nr_pages = self->alloc;
> + unsigned long align = PAGES_PER_SECTION * sizeof(struct page);
> + struct page *head;
> + unsigned long i;
> +
> + pr_debug("%s: marking %px - %px as Vmemmap\n", __func__,
> + pfn_to_page(pfn),
> + pfn_to_page(pfn + nr_pages - 
> 1));
> + /*
> +  * We keep track of the sections using this altmap by means
> +  * of a refcount, so we know how much do we have to defer
> +  * the call to vmemmap_free for this memory range.
> +  * The refcount is kept in the first vmemmap page.
> +  * For example:
> +  * We add 10GB: (ea000400 - ea000427ffc0)
> +  * ea000400 will have a refcount of 80.
> +  */

The example 

[RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2018-11-16 Thread Oscar Salvador
From: Oscar Salvador 

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, kmalloc is used for those
allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose (~2MB per 128MB memory 
section)
 b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.

a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
the memory consumed during physical hotadd consumes enough memory to push
system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.

We can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
page tables can map arbitrary memory. That means that we can simply
use the beginning of each memory section and map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
walkers. We do not have any spare page flag for this purpose so use the
combination of PageReserved bit which already tells that the page should
be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
bits but PAGE_MAPPING_FLAGS) into page->mapping.

On the memory hotplug front add a new MHP_MEMMAP_FROM_RANGE restriction
flag. User is supposed to set the flag if the memmap should be allocated
from the hotadded range. Please note that this is just a hint and
architecture code can veto this if this cannot be supported. E.g. s390
cannot support this currently beause the physical memory range is made
accessible only during memory online.

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation
and this is done by a new vmem_altmap::flush_alloc_pfns callback.
mark_vmemmap_pages implementation then simply __SetPageVmemmap all
struct pages backing those pfns.
The callback is called from sparse_add_one_section.

mark_vmemmap_pages will take care of marking the pages as PageVmemmap,
and to increase the refcount of the first Vmemmap page.
This is done to know how much do we have to defer the call to vmemmap_free().

We also have to be careful about those pages during online and offline
operations. They are simply skipped now so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.

When hot-remove the range, since sections are removed sequantially
starting from the first one and moving on, __kfree_section_memmap will
catch the first Vmemmap page and will get its reference count.
In this way, __kfree_section_memmap knows how much does it have to defer
the call to vmemmap_free().

in case we are hot-removing a range that used altmap, the call to
vmemmap_free must be done backwards, because the beginning of memory
is used for the pagetables.
Doing it this way, we ensure that by the time we remove the pagetables,
those pages will not have to be referenced anymore.

Please note that only the memory hotplug is currently using this
allocation scheme. The boot time memmap allocation could use the same
trick as well but this is not done yet.

Signed-off-by: Oscar Salvador 
---
 arch/arm64/mm/mmu.c|  5 ++-
 arch/powerpc/mm/init_64.c  |  2 ++
 arch/s390/mm/init.c|  6 
 arch/x86/mm/init_64.c  |  7 
 include/linux/memory_hotplug.h |  8 -
 include/linux/memremap.h   | 65 +++--
 include/linux/page-flags.h | 18 ++
 kernel/memremap.c  |  6 
 mm/compaction.c|  3 ++
 mm/hmm.c   |  6 ++--
 mm/memory_hotplug.c| 81 +++---
 mm/page_alloc.c| 22 ++--
 mm/page_isolation.c| 13 ++-
 mm/sparse.c| 46 
 14 files changed, 268 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 394b8d554def..8fa6e2ade5be 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -733,7 +733,10 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
if (pmd_none(READ_ONCE(*pmdp))) {
void *p = NULL;
 
-   p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+   if (altmap)
+   p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+   else
+   p = vmemmap_alloc_block_buf(PMD_SIZE, node);
if (!p)
return -ENOMEM;
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 

[RFC PATCH 3/4] mm, memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap

2018-11-16 Thread Oscar Salvador
From: Oscar Salvador 

Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, kmalloc is used for those
allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose (~2MB per 128MB memory 
section)
 b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.

a) has turned out to be a problem for memory hotplug based ballooning
because the userspace might not react in time to online memory while
the memory consumed during physical hotadd consumes enough memory to push
system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory") has been added to workaround that
problem.

We can do much better when CONFIG_SPARSEMEM_VMEMMAP=y because vmemap
page tables can map arbitrary memory. That means that we can simply
use the beginning of each memory section and map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Add {_Set,_Clear}PageVmemmap helpers to distinguish those pages in pfn
walkers. We do not have any spare page flag for this purpose so use the
combination of PageReserved bit which already tells that the page should
be ignored by the core mm code and store VMEMMAP_PAGE (which sets all
bits but PAGE_MAPPING_FLAGS) into page->mapping.

On the memory hotplug front add a new MHP_MEMMAP_FROM_RANGE restriction
flag. User is supposed to set the flag if the memmap should be allocated
from the hotadded range. Please note that this is just a hint and
architecture code can veto this if this cannot be supported. E.g. s390
cannot support this currently beause the physical memory range is made
accessible only during memory online.

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation
and this is done by a new vmem_altmap::flush_alloc_pfns callback.
mark_vmemmap_pages implementation then simply __SetPageVmemmap all
struct pages backing those pfns.
The callback is called from sparse_add_one_section.

mark_vmemmap_pages will take care of marking the pages as PageVmemmap,
and to increase the refcount of the first Vmemmap page.
This is done to know how much do we have to defer the call to vmemmap_free().

We also have to be careful about those pages during online and offline
operations. They are simply skipped now so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.

When hot-remove the range, since sections are removed sequantially
starting from the first one and moving on, __kfree_section_memmap will
catch the first Vmemmap page and will get its reference count.
In this way, __kfree_section_memmap knows how much does it have to defer
the call to vmemmap_free().

in case we are hot-removing a range that used altmap, the call to
vmemmap_free must be done backwards, because the beginning of memory
is used for the pagetables.
Doing it this way, we ensure that by the time we remove the pagetables,
those pages will not have to be referenced anymore.

Please note that only the memory hotplug is currently using this
allocation scheme. The boot time memmap allocation could use the same
trick as well but this is not done yet.

Signed-off-by: Oscar Salvador 
---
 arch/arm64/mm/mmu.c|  5 ++-
 arch/powerpc/mm/init_64.c  |  2 ++
 arch/s390/mm/init.c|  6 
 arch/x86/mm/init_64.c  |  7 
 include/linux/memory_hotplug.h |  8 -
 include/linux/memremap.h   | 65 +++--
 include/linux/page-flags.h | 18 ++
 kernel/memremap.c  |  6 
 mm/compaction.c|  3 ++
 mm/hmm.c   |  6 ++--
 mm/memory_hotplug.c| 81 +++---
 mm/page_alloc.c| 22 ++--
 mm/page_isolation.c| 13 ++-
 mm/sparse.c| 46 
 14 files changed, 268 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 394b8d554def..8fa6e2ade5be 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -733,7 +733,10 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
if (pmd_none(READ_ONCE(*pmdp))) {
void *p = NULL;
 
-   p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+   if (altmap)
+   p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+   else
+   p = vmemmap_alloc_block_buf(PMD_SIZE, node);
if (!p)
return -ENOMEM;
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index