Re: [PATCH v11 12/13] mm/vmalloc: Hugepage vmalloc mappings

2021-01-26 Thread Ding Tianhong
On 2021/1/26 17:47, Nicholas Piggin wrote:
> Excerpts from Ding Tianhong's message of January 26, 2021 4:59 pm:
>> On 2021/1/26 12:45, Nicholas Piggin wrote:
>>> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
>>> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
>>> supports PMD sized vmap mappings.
>>>
>>> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
>>> or larger, and fall back to small pages if that was unsuccessful.
>>>
>>> Architectures must ensure that any arch specific vmalloc allocations
>>> that require PAGE_SIZE mappings (e.g., module allocations vs strict
>>> module rwx) use the VM_NOHUGE flag to inhibit larger mappings.
>>>
>>> When hugepage vmalloc mappings are enabled in the next patch, this
>>> reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
>>> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
>>>
>>> This can result in more internal fragmentation and memory overhead for a
>>> given allocation, an option nohugevmalloc is added to disable at boot.
>>>
>>> Signed-off-by: Nicholas Piggin 
>>> ---
>>>  arch/Kconfig|  11 ++
>>>  include/linux/vmalloc.h |  21 
>>>  mm/page_alloc.c |   5 +-
>>>  mm/vmalloc.c| 215 +++-
>>>  4 files changed, 205 insertions(+), 47 deletions(-)
>>>
>>> diff --git a/arch/Kconfig b/arch/Kconfig
>>> index 24862d15f3a3..eef170e0c9b8 100644
>>> --- a/arch/Kconfig
>>> +++ b/arch/Kconfig
>>> @@ -724,6 +724,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>  config HAVE_ARCH_HUGE_VMAP
>>> bool
>>>  
>>> +#
>>> +#  Archs that select this would be capable of PMD-sized vmaps (i.e.,
>>> +#  arch_vmap_pmd_supported() returns true), and they must make no 
>>> assumptions
>>> +#  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP 
>>> flag
>>> +#  can be used to prohibit arch-specific allocations from using hugepages 
>>> to
>>> +#  help with this (e.g., modules may require it).
>>> +#
>>> +config HAVE_ARCH_HUGE_VMALLOC
>>> +   depends on HAVE_ARCH_HUGE_VMAP
>>> +   bool
>>> +
>>>  config ARCH_WANT_HUGE_PMD_SHARE
>>> bool
>>>  
>>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>>> index 99ea72d547dc..93270adf5db5 100644
>>> --- a/include/linux/vmalloc.h
>>> +++ b/include/linux/vmalloc.h
>>> @@ -25,6 +25,7 @@ struct notifier_block;/* in notifier.h */
>>>  #define VM_NO_GUARD0x0040  /* don't add guard page 
>>> */
>>>  #define VM_KASAN   0x0080  /* has allocated kasan shadow 
>>> memory */
>>>  #define VM_MAP_PUT_PAGES   0x0100  /* put pages and free array in 
>>> vfree */
>>> +#define VM_NO_HUGE_VMAP0x0200  /* force PAGE_SIZE pte 
>>> mapping */
>>>
>>>  /*
>>>   * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
>>> @@ -59,6 +60,9 @@ struct vm_struct {
>>> unsigned long   size;
>>> unsigned long   flags;
>>> struct page **pages;
>>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>>> +   unsigned intpage_order;
>>> +#endif
>>> unsigned intnr_pages;
>>> phys_addr_t phys_addr;
>>> const void  *caller;
>> Hi Nicholas:
>>
>> Give a suggestion :)
>>
>> The page order was only used to indicate the huge page flag for vm area, and 
>> only valid when
>> size bigger than PMD_SIZE, so can we use the vm flgas to instead of that, 
>> just like define the
>> new flag named VM_HUGEPAGE, it would not break the vm struct, and it is 
>> easier for me to backport the serious
>> patches to our own branches. (Base on the lts version).
> 
> Hmm, it might be possible. I'm not sure if 1GB vmallocs will be used any 
> time soon (or maybe they will for edge case configurations? It would be 
> trivial to add support for).
> 

1GB vmallocs is really crazy, but maybe used for future. :)

> The other concern I have is that Christophe IIRC was asking about 
> implementing a mapping for PPC which used TLB mappings that were 
> different than kernel page table tree size. Although I guess we could 
> deal with that when it comes.
> 

I didn't check the PPC platform, but a agree with you.

> I like the flexibility of page_order though. How hard would it be for 
> you to do the backport with VM_HUGEPAGE yourself?
> 

Yes, i can fix it with VM_HUGEPAGE for my own branch.

> I should also say, thanks for all the review and testing from the Huawei 
> team. Do you have an x86 patch?
I only enable and use it for x86 and aarch64 platform, this serious patches is
really help us a lot. Thanks.

Ding

> Thanks,
> Nick
> .
> 



Re: [PATCH v11 12/13] mm/vmalloc: Hugepage vmalloc mappings

2021-01-26 Thread Nicholas Piggin
Excerpts from Ding Tianhong's message of January 26, 2021 4:59 pm:
> On 2021/1/26 12:45, Nicholas Piggin wrote:
>> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
>> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
>> supports PMD sized vmap mappings.
>> 
>> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
>> or larger, and fall back to small pages if that was unsuccessful.
>> 
>> Architectures must ensure that any arch specific vmalloc allocations
>> that require PAGE_SIZE mappings (e.g., module allocations vs strict
>> module rwx) use the VM_NOHUGE flag to inhibit larger mappings.
>> 
>> When hugepage vmalloc mappings are enabled in the next patch, this
>> reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
>> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
>> 
>> This can result in more internal fragmentation and memory overhead for a
>> given allocation, an option nohugevmalloc is added to disable at boot.
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/Kconfig|  11 ++
>>  include/linux/vmalloc.h |  21 
>>  mm/page_alloc.c |   5 +-
>>  mm/vmalloc.c| 215 +++-
>>  4 files changed, 205 insertions(+), 47 deletions(-)
>> 
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 24862d15f3a3..eef170e0c9b8 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -724,6 +724,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>  config HAVE_ARCH_HUGE_VMAP
>>  bool
>>  
>> +#
>> +#  Archs that select this would be capable of PMD-sized vmaps (i.e.,
>> +#  arch_vmap_pmd_supported() returns true), and they must make no 
>> assumptions
>> +#  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP 
>> flag
>> +#  can be used to prohibit arch-specific allocations from using hugepages to
>> +#  help with this (e.g., modules may require it).
>> +#
>> +config HAVE_ARCH_HUGE_VMALLOC
>> +depends on HAVE_ARCH_HUGE_VMAP
>> +bool
>> +
>>  config ARCH_WANT_HUGE_PMD_SHARE
>>  bool
>>  
>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 99ea72d547dc..93270adf5db5 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -25,6 +25,7 @@ struct notifier_block; /* in notifier.h */
>>  #define VM_NO_GUARD 0x0040  /* don't add guard page */
>>  #define VM_KASAN0x0080  /* has allocated kasan shadow 
>> memory */
>>  #define VM_MAP_PUT_PAGES0x0100  /* put pages and free array in 
>> vfree */
>> +#define VM_NO_HUGE_VMAP 0x0200  /* force PAGE_SIZE pte 
>> mapping */
>> 
>>  /*
>>   * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
>> @@ -59,6 +60,9 @@ struct vm_struct {
>>  unsigned long   size;
>>  unsigned long   flags;
>>  struct page **pages;
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +unsigned intpage_order;
>> +#endif
>>  unsigned intnr_pages;
>>  phys_addr_t phys_addr;
>>  const void  *caller;
> Hi Nicholas:
> 
> Give a suggestion :)
> 
> The page order was only used to indicate the huge page flag for vm area, and 
> only valid when
> size bigger than PMD_SIZE, so can we use the vm flgas to instead of that, 
> just like define the
> new flag named VM_HUGEPAGE, it would not break the vm struct, and it is 
> easier for me to backport the serious
> patches to our own branches. (Base on the lts version).

Hmm, it might be possible. I'm not sure if 1GB vmallocs will be used any 
time soon (or maybe they will for edge case configurations? It would be 
trivial to add support for).

The other concern I have is that Christophe IIRC was asking about 
implementing a mapping for PPC which used TLB mappings that were 
different than kernel page table tree size. Although I guess we could 
deal with that when it comes.

I like the flexibility of page_order though. How hard would it be for 
you to do the backport with VM_HUGEPAGE yourself?

I should also say, thanks for all the review and testing from the Huawei 
team. Do you have an x86 patch?

Thanks,
Nick


Re: [PATCH v11 12/13] mm/vmalloc: Hugepage vmalloc mappings

2021-01-25 Thread Ding Tianhong
On 2021/1/26 12:45, Nicholas Piggin wrote:
> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
> supports PMD sized vmap mappings.
> 
> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
> or larger, and fall back to small pages if that was unsuccessful.
> 
> Architectures must ensure that any arch specific vmalloc allocations
> that require PAGE_SIZE mappings (e.g., module allocations vs strict
> module rwx) use the VM_NOHUGE flag to inhibit larger mappings.
> 
> When hugepage vmalloc mappings are enabled in the next patch, this
> reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
> 
> This can result in more internal fragmentation and memory overhead for a
> given allocation, an option nohugevmalloc is added to disable at boot.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/Kconfig|  11 ++
>  include/linux/vmalloc.h |  21 
>  mm/page_alloc.c |   5 +-
>  mm/vmalloc.c| 215 +++-
>  4 files changed, 205 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 24862d15f3a3..eef170e0c9b8 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -724,6 +724,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>  config HAVE_ARCH_HUGE_VMAP
>   bool
>  
> +#
> +#  Archs that select this would be capable of PMD-sized vmaps (i.e.,
> +#  arch_vmap_pmd_supported() returns true), and they must make no assumptions
> +#  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP 
> flag
> +#  can be used to prohibit arch-specific allocations from using hugepages to
> +#  help with this (e.g., modules may require it).
> +#
> +config HAVE_ARCH_HUGE_VMALLOC
> + depends on HAVE_ARCH_HUGE_VMAP
> + bool
> +
>  config ARCH_WANT_HUGE_PMD_SHARE
>   bool
>  
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 99ea72d547dc..93270adf5db5 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -25,6 +25,7 @@ struct notifier_block;  /* in notifier.h */
>  #define VM_NO_GUARD  0x0040  /* don't add guard page */
>  #define VM_KASAN 0x0080  /* has allocated kasan shadow 
> memory */
>  #define VM_MAP_PUT_PAGES 0x0100  /* put pages and free array in 
> vfree */
> +#define VM_NO_HUGE_VMAP  0x0200  /* force PAGE_SIZE pte 
> mapping */
> 
>  /*
>   * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
> @@ -59,6 +60,9 @@ struct vm_struct {
>   unsigned long   size;
>   unsigned long   flags;
>   struct page **pages;
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> + unsigned intpage_order;
> +#endif
>   unsigned intnr_pages;
>   phys_addr_t phys_addr;
>   const void  *caller;
Hi Nicholas:

Give a suggestion :)

The page order was only used to indicate the huge page flag for vm area, and 
only valid when
size bigger than PMD_SIZE, so can we use the vm flgas to instead of that, just 
like define the
new flag named VM_HUGEPAGE, it would not break the vm struct, and it is easier 
for me to backport the serious
patches to our own branches. (Base on the lts version).

Tianhong

> @@ -193,6 +197,22 @@ void free_vm_area(struct vm_struct *area);
>  extern struct vm_struct *remove_vm_area(const void *addr);
>  extern struct vm_struct *find_vm_area(const void *addr);
>  
> +static inline bool is_vm_area_hugepages(const void *addr)
> +{
> + /*
> +  * This may not 100% tell if the area is mapped with > PAGE_SIZE
> +  * page table entries, if for some reason the architecture indicates
> +  * larger sizes are available but decides not to use them, nothing
> +  * prevents that. This only indicates the size of the physical page
> +  * allocated in the vmalloc layer.
> +  */
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> + return find_vm_area(addr)->page_order > 0;
> +#else
> + return false;
> +#endif
> +}
> +
>  #ifdef CONFIG_MMU
>  int vmap_range(unsigned long addr, unsigned long end,
>   phys_addr_t phys_addr, pgprot_t prot,
> @@ -210,6 +230,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
>   if (vm)
>   vm->flags |= VM_FLUSH_RESET_PERMS;
>  }
> +
>  #else
>  static inline int
>  map_kernel_range_noflush(unsigned long start, unsigned long size,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 027f6481ba59..b7a9661fa232 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -72,6 +72,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -8238,6 +8239,7 @@ void *__init alloc_large_system_hash(const char 
> *tablename,
>   void *table = NULL;
>   gfp_t gfp_flags;
>   bool 

[PATCH v11 12/13] mm/vmalloc: Hugepage vmalloc mappings

2021-01-25 Thread Nicholas Piggin
Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
or larger, and fall back to small pages if that was unsuccessful.

Architectures must ensure that any arch specific vmalloc allocations
that require PAGE_SIZE mappings (e.g., module allocations vs strict
module rwx) use the VM_NOHUGE flag to inhibit larger mappings.

When hugepage vmalloc mappings are enabled in the next patch, this
reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig|  11 ++
 include/linux/vmalloc.h |  21 
 mm/page_alloc.c |   5 +-
 mm/vmalloc.c| 215 +++-
 4 files changed, 205 insertions(+), 47 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 24862d15f3a3..eef170e0c9b8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -724,6 +724,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+#
+#  Archs that select this would be capable of PMD-sized vmaps (i.e.,
+#  arch_vmap_pmd_supported() returns true), and they must make no assumptions
+#  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
+#  can be used to prohibit arch-specific allocations from using hugepages to
+#  help with this (e.g., modules may require it).
+#
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 99ea72d547dc..93270adf5db5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -25,6 +25,7 @@ struct notifier_block;/* in notifier.h */
 #define VM_NO_GUARD0x0040  /* don't add guard page */
 #define VM_KASAN   0x0080  /* has allocated kasan shadow 
memory */
 #define VM_MAP_PUT_PAGES   0x0100  /* put pages and free array in 
vfree */
+#define VM_NO_HUGE_VMAP0x0200  /* force PAGE_SIZE pte 
mapping */
 
 /*
  * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
@@ -59,6 +60,9 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+   unsigned intpage_order;
+#endif
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
@@ -193,6 +197,22 @@ void free_vm_area(struct vm_struct *area);
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
+static inline bool is_vm_area_hugepages(const void *addr)
+{
+   /*
+* This may not 100% tell if the area is mapped with > PAGE_SIZE
+* page table entries, if for some reason the architecture indicates
+* larger sizes are available but decides not to use them, nothing
+* prevents that. This only indicates the size of the physical page
+* allocated in the vmalloc layer.
+*/
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+   return find_vm_area(addr)->page_order > 0;
+#else
+   return false;
+#endif
+}
+
 #ifdef CONFIG_MMU
 int vmap_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
@@ -210,6 +230,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
if (vm)
vm->flags |= VM_FLUSH_RESET_PERMS;
 }
+
 #else
 static inline int
 map_kernel_range_noflush(unsigned long start, unsigned long size,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 027f6481ba59..b7a9661fa232 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -8238,6 +8239,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8305,6 +8307,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = is_vm_area_hugepages(table);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8321,7 +8324,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,