RE: [PATCH] mm/compaction: remove unused variable sysctl_compact_memory

2021-03-03 Thread Nitin Gupta



> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of pi...@codeaurora.org
> Sent: Wednesday, March 3, 2021 6:34 AM
> To: Nitin Gupta 
> Cc: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux-
> m...@kvack.org; linux-fsde...@vger.kernel.org; iamjoonsoo@lge.com;
> sh_...@163.com; mateusznos...@gmail.com; b...@redhat.com;
> vba...@suse.cz; yzai...@google.com; keesc...@chromium.org;
> mcg...@kernel.org; mgor...@techsingularity.net; pintu.p...@gmail.com
> Subject: Re: [PATCH] mm/compaction: remove unused variable
> sysctl_compact_memory
> 
> External email: Use caution opening links or attachments
> 
> 
> On 2021-03-03 01:48, Nitin Gupta wrote:
> >> -Original Message-
> >> From: pintu=codeaurora@mg.codeaurora.org
> >>  On Behalf Of Pintu Kumar
> >> Sent: Tuesday, March 2, 2021 9:56 AM
> >> To: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux-
> >> m...@kvack.org; linux-fsde...@vger.kernel.org; pi...@codeaurora.org;
> >> iamjoonsoo@lge.com; sh_...@163.com;
> mateusznos...@gmail.com;
> >> b...@redhat.com; Nitin Gupta ; vba...@suse.cz;
> >> yzai...@google.com; keesc...@chromium.org; mcg...@kernel.org;
> >> mgor...@techsingularity.net
> >> Cc: pintu.p...@gmail.com
> >> Subject: [PATCH] mm/compaction: remove unused variable
> >> sysctl_compact_memory
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> The sysctl_compact_memory is mostly unsed in mm/compaction.c It just
> >> acts as a place holder for sysctl.
> >>
> >> Thus we can remove it from here and move the declaration directly in
> >> kernel/sysctl.c itself.
> >> This will also eliminate the extern declaration from header file.
> >
> >
> > I prefer keeping the existing pattern of listing all compaction
> > related tunables together in compaction.h:
> >
> >   extern int sysctl_compact_memory;
> >   extern unsigned int sysctl_compaction_proactiveness;
> >   extern int sysctl_extfrag_threshold;
> >   extern int sysctl_compact_unevictable_allowed;
> >
> 
> Thanks Nitin for your review.
> You mean, you just wanted to retain this extern declaration ?
> Any real benefit of keeping this declaration if not used elsewhere ?
> 

I see that sysctl_compaction_handler() doesn't use the sysctl value at all.
So, we can get rid of it completely as Vlastimil suggested.

> >
> >> No functionality is broken or changed this way.
> >>
> >> Signed-off-by: Pintu Kumar 
> >> Signed-off-by: Pintu Agarwal 
> >> ---
> >>  include/linux/compaction.h | 1 -
> >>  kernel/sysctl.c| 1 +
> >>  mm/compaction.c| 3 ---
> >>  3 files changed, 1 insertion(+), 4 deletions(-)
> >>
> >> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> >> index
> >> ed4070e..4221888 100644
> >> --- a/include/linux/compaction.h
> >> +++ b/include/linux/compaction.h
> >> @@ -81,7 +81,6 @@ static inline unsigned long compact_gap(unsigned
> >> int
> >> order)  }
> >>
> >>  #ifdef CONFIG_COMPACTION
> >> -extern int sysctl_compact_memory;
> >>  extern unsigned int sysctl_compaction_proactiveness;  extern int
> >> sysctl_compaction_handler(struct ctl_table *table, int write,
> >> void *buffer, size_t *length, loff_t *ppos);
> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..66aff21
> >> 100644
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -198,6 +198,7 @@ static int max_sched_tunable_scaling =
> >> SCHED_TUNABLESCALING_END-1;  #ifdef CONFIG_COMPACTION  static int
> >> min_extfrag_threshold;  static int max_extfrag_threshold = 1000;
> >> +static int sysctl_compact_memory;
> >>  #endif
> >>
> >>  #endif /* CONFIG_SYSCTL */
> >> diff --git a/mm/compaction.c b/mm/compaction.c index
> 190ccda..ede2886
> >> 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -2650,9 +2650,6 @@ static void compact_nodes(void)
> >> compact_node(nid);
> >>  }
> >>
> >> -/* The written value is actually unused, all memory is compacted */
> >> -int sysctl_compact_memory;
> >> -
> >
> >
> > Please retain this comment for the tunable.
> 
> Sorry, I could not understand.
> You mean to say just retain this last comment and only remove the
> variable ?
> Again any real benefit you see in retaining this even if its not used?
> 
> 

You are just moving declaration of sysctl_compact_memory from compaction.c
to sysctl.c. So, I wanted the comment "... all memory is compacted" to be 
retained
with the sysctl variable. Since you are now getting rid of this variable 
completely,
this comment goes away too.

Thanks,
Nitin



RE: [PATCH] mm/compaction: remove unused variable sysctl_compact_memory

2021-03-02 Thread Nitin Gupta



> -Original Message-
> From: pintu=codeaurora@mg.codeaurora.org
>  On Behalf Of Pintu Kumar
> Sent: Tuesday, March 2, 2021 9:56 AM
> To: linux-kernel@vger.kernel.org; a...@linux-foundation.org; linux-
> m...@kvack.org; linux-fsde...@vger.kernel.org; pi...@codeaurora.org;
> iamjoonsoo@lge.com; sh_...@163.com; mateusznos...@gmail.com;
> b...@redhat.com; Nitin Gupta ; vba...@suse.cz;
> yzai...@google.com; keesc...@chromium.org; mcg...@kernel.org;
> mgor...@techsingularity.net
> Cc: pintu.p...@gmail.com
> Subject: [PATCH] mm/compaction: remove unused variable
> sysctl_compact_memory
> 
> External email: Use caution opening links or attachments
> 
> 
> The sysctl_compact_memory is mostly unsed in mm/compaction.c It just acts
> as a place holder for sysctl.
> 
> Thus we can remove it from here and move the declaration directly in
> kernel/sysctl.c itself.
> This will also eliminate the extern declaration from header file.


I prefer keeping the existing pattern of listing all compaction related tunables
together in compaction.h:

extern int sysctl_compact_memory;
extern unsigned int sysctl_compaction_proactiveness;
extern int sysctl_extfrag_threshold;
extern int sysctl_compact_unevictable_allowed;


> No functionality is broken or changed this way.
> 
> Signed-off-by: Pintu Kumar 
> Signed-off-by: Pintu Agarwal 
> ---
>  include/linux/compaction.h | 1 -
>  kernel/sysctl.c| 1 +
>  mm/compaction.c| 3 ---
>  3 files changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h index
> ed4070e..4221888 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -81,7 +81,6 @@ static inline unsigned long compact_gap(unsigned int
> order)  }
> 
>  #ifdef CONFIG_COMPACTION
> -extern int sysctl_compact_memory;
>  extern unsigned int sysctl_compaction_proactiveness;  extern int
> sysctl_compaction_handler(struct ctl_table *table, int write,
> void *buffer, size_t *length, loff_t *ppos); diff 
> --git
> a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..66aff21 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -198,6 +198,7 @@ static int max_sched_tunable_scaling =
> SCHED_TUNABLESCALING_END-1;  #ifdef CONFIG_COMPACTION  static int
> min_extfrag_threshold;  static int max_extfrag_threshold = 1000;
> +static int sysctl_compact_memory;
>  #endif
> 
>  #endif /* CONFIG_SYSCTL */
> diff --git a/mm/compaction.c b/mm/compaction.c index 190ccda..ede2886
> 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2650,9 +2650,6 @@ static void compact_nodes(void)
> compact_node(nid);
>  }
> 
> -/* The written value is actually unused, all memory is compacted */ -int
> sysctl_compact_memory;
> -


Please retain this comment for the tunable.

-Nitin


[PATCH] mm: Fix compile error due to COMPACTION_HPAGE_ORDER

2020-06-23 Thread Nitin Gupta
Fix compile error when COMPACTION_HPAGE_ORDER is assigned
to HUGETLB_PAGE_ORDER. The correct way to check if this
constant is defined is to check for CONFIG_HUGETLBFS.

Signed-off-by: Nitin Gupta 
To: Andrew Morton 
Reported-by: Nathan Chancellor 
Tested-by: Nathan Chancellor 
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 45fd24a0ea0b..02963ffb9e70 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -62,7 +62,7 @@ static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 
500;
  */
 #if defined CONFIG_TRANSPARENT_HUGEPAGE
 #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER
-#elif defined HUGETLB_PAGE_ORDER
+#elif defined CONFIG_HUGETLBFS
 #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER
 #else
 #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT)
-- 
2.27.0



Re: [PATCH v8] mm: Proactive compaction

2020-06-23 Thread Nitin Gupta
On 6/22/20 9:57 PM, Nathan Chancellor wrote:
> On Mon, Jun 22, 2020 at 09:32:12PM -0700, Nitin Gupta wrote:
>> On 6/22/20 7:26 PM, Nathan Chancellor wrote:
>>> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
>>>> For some applications, we need to allocate almost all memory as
>>>> hugepages. However, on a running system, higher-order allocations can
>>>> fail if the memory is fragmented. Linux kernel currently does on-demand
>>>> compaction as we request more hugepages, but this style of compaction
>>>> incurs very high latency. Experiments with one-time full memory
>>>> compaction (followed by hugepage allocations) show that kernel is able
>>>> to restore a highly fragmented memory state to a fairly compacted memory
>>>> state within <1 sec for a 32G system. Such data suggests that a more
>>>> proactive compaction can help us allocate a large fraction of memory as
>>>> hugepages keeping allocation latencies low.
>>>>
>>>> For a more proactive compaction, the approach taken here is to define a
>>>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>>>> for external fragmentation which kcompactd tries to maintain.
>>>>
>>>> The tunable takes a value in range [0, 100], with a default of 20.
>>>>
>>>> Note that a previous version of this patch [1] was found to introduce
>>>> too many tunables (per-order extfrag{low, high}), but this one reduces
>>>> them to just one sysctl. Also, the new tunable is an opaque value
>>>> instead of asking for specific bounds of "external fragmentation", which
>>>> would have been difficult to estimate. The internal interpretation of
>>>> this opaque value allows for future fine-tuning.
>>>>
>>>> Currently, we use a simple translation from this tunable to [low, high]
>>>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>>>> The score for a node is defined as weighted mean of per-zone external
>>>> fragmentation. A zone's present_pages determines its weight.
>>>>
>>>> To periodically check per-node score, we reuse per-node kcompactd
>>>> threads, which are woken up every 500 milliseconds to check the same. If
>>>> a node's score exceeds its high threshold (as derived from user-provided
>>>> proactiveness value), proactive compaction is started until its score
>>>> reaches its low threshold value. By default, proactiveness is set to 20,
>>>> which implies threshold values of low=80 and high=90.
>>>>
>>>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>>>> LWN article [3].
>>>>
>>>> Performance data
>>>> 
>>>>
>>>> System: x64_64, 1T RAM, 80 CPU threads.
>>>> Kernel: 5.6.0-rc3 + this patch
>>>>
>>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>>>
>>>> Before starting the driver, the system was fragmented from a userspace
>>>> program that allocates all memory and then for each 2M aligned section,
>>>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>>>> userspace pages, which are easy to move around. I intentionally avoided
>>>> unmovable pages in this test to see how much latency we incur when
>>>> hugepage allocations hit direct compaction.
>>>>
>>>> 1. Kernel hugepage allocation latencies
>>>>
>>>> With the system in such a fragmented state, a kernel driver then
>>>> allocates as many hugepages as possible and measures allocation
>>>> latency:
>>>>
>>>> (all latency values are in microseconds)
>>>>
>>>> - With vanilla 5.6.0-rc3
>>>>
>>>>   percentile latency
>>>>   –– –––
>>>>   57894
>>>>  109496
>>>>  25   12561
>>>>  30   15295
>>>>  40   18244
>>>>  50   21229
>>>>  60   27556
>>>>  75   30147
>>>>  80   31047
>>>>  90   32859
>>>>  95   33799
>>>>
>>>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>>>> 762G total free => 98% of free memory could be allocated as hugepages)
&

Re: [PATCH v8] mm: Proactive compaction

2020-06-22 Thread Nitin Gupta
On 6/22/20 7:26 PM, Nathan Chancellor wrote:
> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is able
>> to restore a highly fragmented memory state to a fairly compacted memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation", which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low, high]
>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same. If
>> a node's score exceeds its high threshold (as derived from user-provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a userspace
>> program that allocates all memory and then for each 2M aligned section,
>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>> userspace pages, which are easy to move around. I intentionally avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> 2. JAVA heap allocation
>>
>> In this test, we firs

Re: [PATCH] mm: Use unsigned types for fragmentation score

2020-06-18 Thread Nitin Gupta
On 6/18/20 6:41 AM, Baoquan He wrote:
> On 06/17/20 at 06:03pm, Nitin Gupta wrote:
>> Proactive compaction uses per-node/zone "fragmentation score" which
>> is always in range [0, 100], so use unsigned type of these scores
>> as well as for related constants.
>>
>> Signed-off-by: Nitin Gupta 
>> ---
>>  include/linux/compaction.h |  4 ++--
>>  kernel/sysctl.c|  2 +-
>>  mm/compaction.c| 18 +-
>>  mm/vmstat.c|  2 +-
>>  4 files changed, 13 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 7a242d46454e..25a521d299c1 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -85,13 +85,13 @@ static inline unsigned long compact_gap(unsigned int 
>> order)
>>  
>>  #ifdef CONFIG_COMPACTION
>>  extern int sysctl_compact_memory;
>> -extern int sysctl_compaction_proactiveness;
>> +extern unsigned int sysctl_compaction_proactiveness;
>>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>>  void *buffer, size_t *length, loff_t *ppos);
>>  extern int sysctl_extfrag_threshold;
>>  extern int sysctl_compact_unevictable_allowed;
>>  
>> -extern int extfrag_for_order(struct zone *zone, unsigned int order);
>> +extern unsigned int extfrag_for_order(struct zone *zone, unsigned int 
>> order);
>>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>>  extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>>  unsigned int order, unsigned int alloc_flags,
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 58b0a59c9769..40180cdde486 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2833,7 +2833,7 @@ static struct ctl_table vm_table[] = {
>>  {
>>  .procname   = "compaction_proactiveness",
>>  .data   = _compaction_proactiveness,
>> -.maxlen = sizeof(int),
>> +.maxlen = sizeof(sysctl_compaction_proactiveness),
> 
> Patch looks good to me. Wondering why not using 'unsigned int' here,
> just curious.
> 


It's just coding style preference. I see the same style used for many
other sysctls too (min_free_kbytes etc.).

Thanks,
Nitin



[PATCH] mm: Use unsigned types for fragmentation score

2020-06-17 Thread Nitin Gupta
Proactive compaction uses per-node/zone "fragmentation score" which
is always in range [0, 100], so use unsigned type of these scores
as well as for related constants.

Signed-off-by: Nitin Gupta 
---
 include/linux/compaction.h |  4 ++--
 kernel/sysctl.c|  2 +-
 mm/compaction.c| 18 +-
 mm/vmstat.c|  2 +-
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7a242d46454e..25a521d299c1 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -85,13 +85,13 @@ static inline unsigned long compact_gap(unsigned int order)
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
-extern int sysctl_compaction_proactiveness;
+extern unsigned int sysctl_compaction_proactiveness;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void *buffer, size_t *length, loff_t *ppos);
 extern int sysctl_extfrag_threshold;
 extern int sysctl_compact_unevictable_allowed;
 
-extern int extfrag_for_order(struct zone *zone, unsigned int order);
+extern unsigned int extfrag_for_order(struct zone *zone, unsigned int order);
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58b0a59c9769..40180cdde486 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2833,7 +2833,7 @@ static struct ctl_table vm_table[] = {
{
.procname   = "compaction_proactiveness",
.data   = _compaction_proactiveness,
-   .maxlen = sizeof(int),
+   .maxlen = sizeof(sysctl_compaction_proactiveness),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
diff --git a/mm/compaction.c b/mm/compaction.c
index ac2030814edb..45fd24a0ea0b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -53,7 +53,7 @@ static inline void count_compact_events(enum vm_event_item 
item, long delta)
 /*
  * Fragmentation score check interval for proactive compaction purposes.
  */
-static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
+static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
 
 /*
  * Page order with-respect-to which proactive compaction
@@ -1890,7 +1890,7 @@ static bool kswapd_is_running(pg_data_t *pgdat)
  * ZONE_DMA32. For smaller zones, the score value remains close to zero,
  * and thus never exceeds the high threshold for proactive compaction.
  */
-static int fragmentation_score_zone(struct zone *zone)
+static unsigned int fragmentation_score_zone(struct zone *zone)
 {
unsigned long score;
 
@@ -1906,9 +1906,9 @@ static int fragmentation_score_zone(struct zone *zone)
  * the node's score falls below the low threshold, or one of the back-off
  * conditions is met.
  */
-static int fragmentation_score_node(pg_data_t *pgdat)
+static unsigned int fragmentation_score_node(pg_data_t *pgdat)
 {
-   unsigned long score = 0;
+   unsigned int score = 0;
int zoneid;
 
for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
@@ -1921,17 +1921,17 @@ static int fragmentation_score_node(pg_data_t *pgdat)
return score;
 }
 
-static int fragmentation_score_wmark(pg_data_t *pgdat, bool low)
+static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low)
 {
-   int wmark_low;
+   unsigned int wmark_low;
 
/*
 * Cap the low watermak to avoid excessive compaction
 * activity in case a user sets the proactivess tunable
 * close to 100 (maximum).
 */
-   wmark_low = max(100 - sysctl_compaction_proactiveness, 5);
-   return low ? wmark_low : min(wmark_low + 10, 100);
+   wmark_low = max(100U - sysctl_compaction_proactiveness, 5U);
+   return low ? wmark_low : min(wmark_low + 10, 100U);
 }
 
 static bool should_proactive_compact_node(pg_data_t *pgdat)
@@ -2604,7 +2604,7 @@ int sysctl_compact_memory;
  * aggressively the kernel should compact memory in the
  * background. It takes values in the range [0, 100].
  */
-int __read_mostly sysctl_compaction_proactiveness = 20;
+unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
 
 /*
  * This is the entry point for compacting all nodes via
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3e7ba8bce2ba..b1de695b826d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1079,7 +1079,7 @@ static int __fragmentation_index(unsigned int order, 
struct contig_page_info *in
  * It is defined as the percentage of pages found in blocks of size
  * less than 1 << order. It returns values in range [0, 100].
  */
-int extfrag_for_order(struct zone *zone, unsigned int order)
+unsigned int extfrag_for_order(struct zone *zone, unsigned

Re: [PATCH v8] mm: Proactive compaction

2020-06-17 Thread Nitin Gupta




On 6/17/20 1:53 PM, Andrew Morton wrote:

On Tue, 16 Jun 2020 13:45:27 -0700 Nitin Gupta  wrote:


For some applications, we need to allocate almost all memory as
hugepages. However, on a running system, higher-order allocations can
fail if the memory is fragmented. Linux kernel currently does on-demand
compaction as we request more hugepages, but this style of compaction
incurs very high latency. Experiments with one-time full memory
compaction (followed by hugepage allocations) show that kernel is able
to restore a highly fragmented memory state to a fairly compacted memory
state within <1 sec for a 32G system. Such data suggests that a more
proactive compaction can help us allocate a large fraction of memory as
hugepages keeping allocation latencies low.

...



All looks straightforward to me and easy to disable if it goes wrong.

All the hard-coded magic numbers are a worry, but such is life.

One teeny complaint:



...

@@ -2650,12 +2801,34 @@ static int kcompactd(void *p)
unsigned long pflags;
  
  		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);

-   wait_event_freezable(pgdat->kcompactd_wait,
-   kcompactd_work_requested(pgdat));
+   if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
+   kcompactd_work_requested(pgdat),
+   msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) {
+
+   psi_memstall_enter();
+   kcompactd_do_work(pgdat);
+   psi_memstall_leave();
+   continue;
+   }
  
-		psi_memstall_enter();

-   kcompactd_do_work(pgdat);
-   psi_memstall_leave();
+   /* kcompactd wait timeout */
+   if (should_proactive_compact_node(pgdat)) {
+   unsigned int prev_score, score;


Everywhere else, scores have type `int'.  Here they are unsigned.  How come?

Would it be better to make these unsigned throughout?  I don't think a
score can ever be negative?



The score is always in [0, 100], so yes, it should be unsigned.
I will send another patch which fixes this.

Thanks,
Nitin



[PATCH v8] mm: Proactive compaction

2020-06-16 Thread Nitin Gupta
igher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t  he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Khalid Aziz 
Reviewed-by: Oleksandr Natalenko 
Tested-by: Oleksandr Natalenko 
To: Andrew Morton 
CC: Vlastimil Babka 
CC: Khalid Aziz 
CC: Michal Hocko 
CC: Mel Gorman 
CC: Matthew Wilcox 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: Oleksandr Natalenko 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v8 vs v7:
 - Rebase to 5.8-rc1

Changelog v7 vs v6:
 - Fix compile error while THP is disabled (Oleksandr)

Changelog v6 vs v5:
 - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and
   some cleanups (Vlastimil)
 - Cap min threshold to avoid excess compaction load in case user sets
   extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid)
 - Add some more explanation about the effect of tunable on compaction
   behavior in user guide (Khalid)

Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated
the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil
Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream
of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  15 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 183 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  18 +++
 6 files changed, 223 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index d46d5b7013c6..4b7c496199ca 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,21 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
+Be careful when setting it to extreme values like 100, as that may

Re: [PATCH v7] mm: Proactive compaction

2020-06-16 Thread Nitin Gupta
On 6/16/20 2:46 AM, Oleksandr Natalenko wrote:
> Hello.
> 
> Please see the notes inline.
> 
> On Mon, Jun 15, 2020 at 07:36:14AM -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is able
>> to restore a highly fragmented memory state to a fairly compacted memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation", which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low, high]
>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same. If
>> a node's score exceeds its high threshold (as derived from user-provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a userspace
>> program that allocates all memory and then for each 2M aligned section,
>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>> userspace pages, which are easy to move around. I intentionally avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> 2. JAVA heap

[PATCH v7] mm: Proactive compaction

2020-06-15 Thread Nitin Gupta
igher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t  he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Khalid Aziz 
To: Andrew Morton 
CC: Vlastimil Babka 
CC: Khalid Aziz 
CC: Michal Hocko 
CC: Mel Gorman 
CC: Matthew Wilcox 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: Oleksandr Natalenko 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v7 vs v6:
 - Fix compile error while THP is disabled (Oleksandr)

Changelog v6 vs v5:
 - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and
   some cleanups (Vlastimil)
 - Cap min threshold to avoid excess compaction load in case user sets
   extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid)
 - Add some more explanation about the effect of tunable on compaction
   behavior in user guide (Khalid)

Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated
the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil
Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream
of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  15 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 183 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  18 +++
 6 files changed, 223 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..360914b4f346 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,21 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
+Be careful when setting it to extreme values like 100, as that may
+cause excessive background compaction activity.
 
 compact_unevictable_allowed
 ===
diff --git a/include

Re: [PATCH v6] mm: Proactive compaction

2020-06-15 Thread Nitin Gupta
On 6/15/20 7:25 AM, Oleksandr Natalenko wrote:
> On Mon, Jun 15, 2020 at 10:29:01AM +0200, Oleksandr Natalenko wrote:
>> Just to let you know, this fails to compile for me with THP disabled on
>> v5.8-rc1:
>>
>>   CC  mm/compaction.o
>> In file included from ./include/linux/dev_printk.h:14,
>>  from ./include/linux/device.h:15,
>>  from ./include/linux/node.h:18,
>>  from ./include/linux/cpu.h:17,
>>  from mm/compaction.c:11:
>> In function ‘fragmentation_score_zone’,
>> inlined from ‘__compact_finished’ at mm/compaction.c:1982:11,
>> inlined from ‘compact_zone’ at mm/compaction.c:2062:8:
>> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ 
>> declared with attribute error: BUILD_BUG failed
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^
>> ./include/linux/compiler.h:373:4: note: in definition of macro 
>> ‘__compiletime_assert’
>>   373 |prefix ## suffix();\
>>   |^~
>> ./include/linux/compiler.h:392:2: note: in expansion of macro 
>> ‘_compiletime_assert’
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^~~
>> ./include/linux/build_bug.h:39:37: note: in expansion of macro 
>> ‘compiletime_assert’
>>39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>   | ^~
>> ./include/linux/build_bug.h:59:21: note: in expansion of macro 
>> ‘BUILD_BUG_ON_MSG’
>>59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>>   | ^~~~
>> ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’
>>   319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>   |^
>> ./include/linux/huge_mm.h:115:26: note: in expansion of macro 
>> ‘HPAGE_PMD_SHIFT’
>>   115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>>   |  ^~~
>> mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’
>>64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER
>>   |^~~
>> mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’
>>  1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
>>   |^~
>> In function ‘fragmentation_score_zone’,
>> inlined from ‘kcompactd’ at mm/compaction.c:1918:12:
>> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ 
>> declared with attribute error: BUILD_BUG failed
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^
>> ./include/linux/compiler.h:373:4: note: in definition of macro 
>> ‘__compiletime_assert’
>>   373 |prefix ## suffix();\
>>   |^~
>> ./include/linux/compiler.h:392:2: note: in expansion of macro 
>> ‘_compiletime_assert’
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^~~
>> ./include/linux/build_bug.h:39:37: note: in expansion of macro 
>> ‘compiletime_assert’
>>39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>   | ^~
>> ./include/linux/build_bug.h:59:21: note: in expansion of macro 
>> ‘BUILD_BUG_ON_MSG’
>>59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>>   | ^~~~
>> ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’
>>   319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>   |^
>> ./include/linux/huge_mm.h:115:26: note: in expansion of macro 
>> ‘HPAGE_PMD_SHIFT’
>>   115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>>   |  ^~~
>> mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’
>>64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER
>>   |^~~
>> mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’
>>  1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
>>   |^~
>> In function ‘fragmentation_score_zone’,
>> inlined from ‘kcompactd’ at mm/compaction.c:1918:12:
>> ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ 
>> declared with attribute error: BUILD_BUG failed
>>   392 |  _compiletime_assert(condition, msg, __compiletime_assert_, 
>> __COUNTER__)
>>   |  ^
>> ./include/linux/compiler.h:373:4: note: in definition of macro 
>> ‘__compiletime_assert’
>>   373 |prefix ## suffix();\
>>   

Re: [PATCH v6] mm: Proactive compaction

2020-06-11 Thread Nitin Gupta
On 6/9/20 12:23 PM, Khalid Aziz wrote:
> On Mon, 2020-06-01 at 12:48 -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-
>> demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is
>> able
>> to restore a highly fragmented memory state to a fairly compacted
>> memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory
>> as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define
>> a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one
>> reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation",
>> which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low,
>> high]
>> "fragmentation score" thresholds (low=100-proactiveness,
>> high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same.
>> If
>> a node's score exceeds its high threshold (as derived from user-
>> provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to
>> 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also
>> the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a
>> userspace
>> program that allocates all memory and then for each 2M aligned
>> section,
>> frees 3/4 of base pages using munmap. The workload is mainly
>> anonymous
>> userspace pages, which are easy to move around. I intentionally
>> avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as
>> hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
>> 762G total 

Re: [PATCH v6] mm: Proactive compaction

2020-06-09 Thread Nitin Gupta
On Mon, Jun 1, 2020 at 12:48 PM Nitin Gupta  wrote:
>
> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages, but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
>

> Signed-off-by: Nitin Gupta 
> Reviewed-by: Vlastimil Babka 

(+CC Khalid)

Can this be pipelined for upstream inclusion now? Sorry, I'm a bit
rusty on upstream flow these days.

Thanks,
Nitin


[PATCH v6] mm: Proactive compaction

2020-06-01 Thread Nitin Gupta
igher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t  he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta 
Reviewed-by: Vlastimil Babka 
To: Mel Gorman 
To: Michal Hocko 
To: Vlastimil Babka 
CC: Matthew Wilcox 
CC: Andrew Morton 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v6 vs v5:
 - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and
   some cleanups (Vlastimil)
 - Cap min threshold to avoid excess compaction load in case user sets
   extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid)
 - Add some more explanation about the effect of tunable on compaction
   behavior in user guide (Khalid)

Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated
the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil
Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream
of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  15 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 183 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  18 +++
 6 files changed, 223 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..360914b4f346 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,21 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
+Be careful when setting it to extreme values like 100, as that may
+cause excessive background compaction activity.
 
 compact_unevictable_allowed
 ===
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4b898cdbdf05..ccd28978b296 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compact

Re: [PATCH v5] mm: Proactive compaction

2020-05-28 Thread Nitin Gupta
On Thu, May 28, 2020 at 4:32 PM Khalid Aziz  wrote:
>
> This looks good to me. I like the idea overall of controlling
> aggressiveness of compaction with a single tunable for the whole
> system. I wonder how an end user could arrive at what a reasonable
> value would be for this based upon their workload. More comments below.
>

Tunables like the one this patch introduces, and similar ones like 'swappiness'
will always require some experimentations from the user.


> On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote:
> > For some applications, we need to allocate almost all memory as
> > hugepages. However, on a running system, higher-order allocations can
> > fail if the memory is fragmented. Linux kernel currently does on-
> > demand
> > compaction as we request more hugepages, but this style of compaction
> > incurs very high latency. Experiments with one-time full memory
> > compaction (followed by hugepage allocations) show that kernel is
> > able
> > to restore a highly fragmented memory state to a fairly compacted
> > memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as
> > hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > a new tunable called 'proactiveness' which dictates bounds for
> > external
> > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
> > maintain.
> >
> > The tunable is exposed through sysctl:
> >   /proc/sys/vm/compaction_proactiveness
> >
> > It takes value in range [0, 100], with a default of 20.
>
> Looking at the code, setting this to 100 would mean system would
> continuously strive to drive level of fragmentation down to 0 which can
> not be reasonable and would bog the system down. A cap lower than 100
> might be a good idea to keep kcompactd from dragging system down.
>

Yes, I understand that a value of 100 would be a continuous compaction
storm but I still don't want to artificially cap the tunable. The interpretation
of this tunable can change in future, and a range of [0, 100] seems
more intuitive than, say [0, 90]. Still, I think a word of caution should
be added to its documentation (admin-guide/sysctl/vm.rst).


> >

> > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> > 762G total free => 98% of free memory could be allocated as
> > hugepages)
> >
> > - With 5.6.0-rc3 + this patch, with proactiveness=20
> >
> > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
>
> Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness"
>

oops... I forgot to update the patch description. This is from the v4 patch
which used sysfs but v5 switched to using sysctl.


> >

> > diff --git a/Documentation/admin-guide/sysctl/vm.rst
> > b/Documentation/admin-guide/sysctl/vm.rst
> > index 0329a4d3fa9e..e5d88cabe980 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -119,6 +119,19 @@ all zones are compacted such that free memory is
> > available in contiguous
> >  blocks where possible. This can be important for example in the
> > allocation of
> >  huge pages although processes will also directly compact memory as
> > required.
> >
> > +compaction_proactiveness
> > +
> > +
> > +This tunable takes a value in the range [0, 100] with a default
> > value of
> > +20. This tunable determines how aggressively compaction is done in
> > the
> > +background. Setting it to 0 disables proactive compaction.
> > +
> > +Note that compaction has a non-trivial system-wide impact as pages
> > +belonging to different processes are moved around, which could also
> > lead
> > +to latency spikes in unsuspecting applications. The kernel employs
> > +various heuristics to avoid wasting CPU cycles if it detects that
> > +proactive compaction is not being effective.
> > +
>
> Value of 100 would cause kcompactd to try to bring fragmentation down
> to 0. If hugepages are being consumed and released continuously by the
> workload, it is possible that kcompactd keeps making progress (and
> hence passes the test "proactive_defer = score < prev_score ?")
> continuously but can not reach a fragmentation score of 0 and hence
> gets stuck in compact_zone() for a long time. Page migration for
> compaction is not inexpensive. Maybe either cap the value to something
> less than 100 or set a floor for wmark_low above 0.
>

Re: [PATCH v5] mm: Proactive compaction

2020-05-28 Thread Nitin Gupta
On Wed, May 27, 2020 at 3:18 AM Vlastimil Babka  wrote:
>
> On 5/18/20 8:14 PM, Nitin Gupta wrote:
> > For some applications, we need to allocate almost all memory as
> > hugepages. However, on a running system, higher-order allocations can
> > fail if the memory is fragmented. Linux kernel currently does on-demand
> > compaction as we request more hugepages, but this style of compaction
> > incurs very high latency. Experiments with one-time full memory
> > compaction (followed by hugepage allocations) show that kernel is able
> > to restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory as
> > hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > a new tunable called 'proactiveness' which dictates bounds for external
> > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
>
> HPAGE_PMD_ORDER
>

Since HPAGE_PMD_ORDER is not always defined, and thus we may have
to fallback to HUGETLB_PAGE_ORDER or even PMD_ORDER, I think
I should remove references to the order in the patch description entirely.

I also need to change the tunable name from 'proactiveness' to
'vm.compaction_proactiveness' sysctl.

modified description:
===
For a more proactive compaction, the approach taken here is to define
a new sysctl called 'vm.compaction_proactiveness' which dictates
bounds for external fragmentation which kcompactd tries to ...
===


> >
> > The tunable is exposed through sysctl:
> >   /proc/sys/vm/compaction_proactiveness
> >
> > It takes value in range [0, 100], with a default of 20.
> >


> >
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
>
> Make this link a [2] reference? I would also add: "See also the LWN article
> [3]." where [3] is https://lwn.net/Articles/817905/
>
>

Sounds good. I will turn these into [2] and [3] references.



>
> Reviewed-by: Vlastimil Babka 
>

> With some smaller nitpicks below.
>
> But as we are adding a new API, I would really appreciate others comment about
> the approach at least.
>



> > +/*
> > + * A zone's fragmentation score is the external fragmentation wrt to the
> > + * HUGETLB_PAGE_ORDER scaled by the zone's size. It returns a value in the
>
> HPAGE_PMD_ORDER
>

Maybe just remove reference to the order as I mentioned above?



> > +/*
> > + * Tunable for proactive compaction. It determines how
> > + * aggressively the kernel should compact memory in the
> > + * background. It takes values in the range [0, 100].
> > + */
> > +int sysctl_compaction_proactiveness = 20;
>
> These are usually __read_mostly
>

Ok.


> > +
> >  /*
> >   * This is the entry point for compacting all nodes via
> >   * /proc/sys/vm/compact_memory
> > @@ -2637,6 +2769,7 @@ static int kcompactd(void *p)
> >  {
> >   pg_data_t *pgdat = (pg_data_t*)p;
> >   struct task_struct *tsk = current;
> > + unsigned int proactive_defer = 0;
> >
> >   const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> >
> > @@ -2652,12 +2785,34 @@ static int kcompactd(void *p)
> >   unsigned long pflags;
> >
> >   trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
> > - wait_event_freezable(pgdat->kcompactd_wait,
> > - kcompactd_work_requested(pgdat));
> > + if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
> > + kcompactd_work_requested(pgdat),
> > + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) {
>
> Hmm perhaps the wakeups should also backoff if there's nothing to do?


Perhaps. For now, I just wanted to keep it simple and waking a thread to do a
quick calculation didn't seem expensive to me, so I prefer this simplistic
approach for now.


> > +/*
> > + * Calculates external fragmentation within a zone wrt the given order.
> > + * It is defined as the percentage of pages found in blocks of size
> > + * less than 1 << order. It returns values in range [0, 100].
> > + */
> > +int extfrag_for_order(struct zone *zone, unsigned int order)
> > +{
> > + struct contig_page_info info;
> > +
> > + fill_contig_page_info(zone, order, );
> > + if (info.free_pages == 0)
> > + return 0;
> > +
> > + return (info.free_pages - (info.free_blocks_suitable << order)) * 100
> > + / info.free_pages;
>
> I guess this should also use div_u64() like __fragmentation_index() does.
>

Ok.


> > +}
> > +
> >  /* Same as __fragmentation index but allocs contig_page_info on stack */
> >  int fragmentation_index(struct zone *zone, unsigned int order)
> >  {
> >
>


Thanks,
Nitin


Re: [PATCH v5] mm: Proactive compaction

2020-05-28 Thread Nitin Gupta
On Thu, May 28, 2020 at 2:50 AM Vlastimil Babka  wrote:
>
> On 5/28/20 11:15 AM, Holger Hoffstätte wrote:
> >
> > On 5/18/20 8:14 PM, Nitin Gupta wrote:
> > [patch v5 :)]
> >
> > I've been successfully using this in my tree and it works great, but a 
> > friend
> > who also uses my tree just found a bug (actually an improvement ;) due to 
> > the
> > change from HUGETLB_PAGE_ORDER to HPAGE_PMD_ORDER in v5.
> >
> > When building with CONFIG_TRANSPARENT_HUGEPAGE=n (for some reason it was 
> > off)
> > HPAGE_PMD_SHIFT expands to BUILD_BUG() and compilation fails like this:
>
> Oops, I forgot about this. Still I believe HPAGE_PMD_ORDER is the best choice 
> as
> long as THP's are enabled. I guess fallback to HUGETLB_PAGE_ORDER would be
> possible if THPS are not enabled, but AFAICS some architectures don't define
> that. Such architectures perhaps won't benefit from proactive compaction 
> anyway?
>

I am not sure about such architectures but in such cases, we would end
up calculating
"fragmentation score" based on a page size which does not match the
architecture's
view of the "default hugepage size" which is not a terrible thing in
itself as compaction
can still be done in the background, after all.

Since we always need a target order to calculate the fragmentation score, how
about this fallack scheme:

HPAGE_PMD_ORDER -> HUGETLB_PAGE_ORDER -> PMD_ORDER

Thanks,
Nitin


[PATCH v5] mm: Proactive compaction

2020-05-18 Thread Nitin Gupta
uation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As the benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/

Signed-off-by: Nitin Gupta 
To: Mel Gorman 
To: Michal Hocko 
To: Vlastimil Babka 
CC: Matthew Wilcox 
CC: Andrew Morton 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v5 vs v4:
 - Change tunable from sysfs to sysctl (Vlastimil)
 - HUGETLB_PAGE_ORDER -> HPAGE_PMD_ORDER (Vlastimil)
 - Minor cleanups (remove redundant initializations, ...)

Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 Documentation/admin-guide/sysctl/vm.rst |  13 ++
 include/linux/compaction.h  |   2 +
 kernel/sysctl.c |   9 ++
 mm/compaction.c | 165 +++-
 mm/internal.h   |   1 +
 mm/vmstat.c |  17 +++
 6 files changed, 202 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..e5d88cabe980 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -119,6 +119,19 @@ all zones are compacted such that free memory is available 
in contiguous
 blocks where possible. This can be important for example in the allocation of
 huge pages although processes will also directly compact memory as required.
 
+compaction_proactiveness
+
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
+
 
 compact_unevictable_allowed
 ===
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4b898cdbdf05..ccd28978b296 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned int order)
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
+extern int sysctl_compaction_proactiveness;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
 extern int sysctl_extfrag_thresh

[PATCH v4] mm: Proactive compaction

2020-04-28 Thread Nitin Gupta
uation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As the benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the
low threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior


Above workloads produce a memory state which is easy to compact.
However, if memory is filled with unmovable pages, proactive compaction
should essentially back off. To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/

Signed-off-by: Nitin Gupta 
To: Mel Gorman 
To: Michal Hocko 
To: Vlastimil Babka 
CC: Matthew Wilcox 
CC: Andrew Morton 
CC: Mike Kravetz 
CC: Joonsoo Kim 
CC: David Rientjes 
CC: Nitin Gupta 
CC: linux-kernel 
CC: linux-mm 
CC: Linux API 

---
Changelog v4 vs v3:
 - Document various functions.
 - Added admin-guide for the new tunable `proactiveness`.
 - Rename proactive_compaction_score to fragmentation_score for clarity.

Changelog v3 vs v2:
 - Make proactiveness a global tunable and not per-node. Also upadated the
   patch description to reflect the same (Vlastimil Babka).
 - Don't start proactive compaction if kswapd is running (Vlastimil Babka).
 - Clarified in the description that compaction runs in parallel with
   the workload, instead of a one-time compaction followed by a stream of
   hugepage allocations.

Changelog v2 vs v1:
 - Introduce per-node and per-zone "proactive compaction score". This
   score is compared against watermarks which are set according to
   user provided proactiveness value.
 - Separate code-paths for proactive compaction from targeted compaction
   i.e. where pgdat->kcompactd_max_order is non-zero.
 - Renamed hpage_compaction_effort -> proactiveness. In future we may
   use more than extfrag wrt hugepage size to determine proactive
   compaction score.
---
 .../admin-guide/mm/proactive-compaction.rst   |  26 ++
 MAINTAINERS   |   6 +
 include/linux/compaction.h|   1 +
 mm/compaction.c   | 236 +-
 mm/internal.h |   1 +
 mm/page_alloc.c   |   1 +
 mm/vmstat.c   |  17 ++
 7 files changed, 282 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/proactive-compaction.rst

diff --git a/Documentation/admin-guide/mm/proactive-compaction.rst 
b/Documentation/admin-guide/mm/proactive-compaction.rst
new file mode 100644
index ..510f47e38238
--- /dev/null
+++ b/Documentation/admin-guide/mm/proactive-compaction.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _proactive_compaction:
+
+
+Proactive Compaction
+
+
+Many applications benefit significantly from the use of huge pages.
+However, huge-page allocations often incur a high latency or even fail
+under fragmented memory conditions. Proactive compaction provides an
+effective solution to these problems by doing memory compaction in the
+background.
+
+The process of proactive compaction is controlled by a single tunable:
+
+/sys/kernel/mm/compaction/proactiveness
+
+This tunable takes a value in the range [0, 100] with a default value of
+20. This tunable determines how aggressively compaction is done in the
+background. Setting it to 0 disables proactive compaction.
+
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.
diff --git a/MAINTAINERS b/MAINTAINERS
index 26f281d9f32a..e448c0b35ecb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18737,6 +18737,12 @@ L: linux...@kvack.org
 S: Maintained
 F: mm/zswap.c
 
+PROACTIVE COMPACTION
+M: Nitin Gupta 
+L: linux...@kvack.org
+S: Maintained
+F: Documentation/admin-guide/mm/proactive-compa

Re: [RFC] mm: Proactive compaction

2019-09-19 Thread Nitin Gupta
On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote:
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
> > 
> > Testing done (on x86):
> >   - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> >   respectively.
> >   - Use a test program to fragment memory: the program allocates all
> > memory
> >   and then for each 2M aligned section, frees 3/4 of base pages using
> >   munmap.
> >   - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> >   compaction till extfrag < extfrag_low for order-9.
> > 
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to
> their
> needs?


Yes, it's difficult for an admin to get so many tunable right unless
targeting a very specific workload.

How about a simpler solution where we exposed just one tunable per-node:
   /sys/.../node-x/compaction_effort
which accepts [0, 100]

This parallels /proc/sys/vm/swappiness but for compaction. With this
single number, we can estimate per-order [low, high] watermarks for external
fragmentation like this:
 - For now, map this range to [low, medium, high] which correponds to specific
low, high thresholds for extfrag.
 - Apply more relaxed thresholds for higher-order than for lower orders.

With this single tunable we remove the burden of setting per-order explicit
[low, high] thresholds and it should be easier to experiment with.

-Nitin





Re: [RFC] mm: Proactive compaction

2019-09-19 Thread Nitin Gupta
On Thu, 2019-08-22 at 09:51 +0100, Mel Gorman wrote:
> As unappealing as it sounds, I think it is better to try improve the
> allocation latency itself instead of trying to hide the cost in a kernel
> thread. It's far harder to implement as compaction is not easy but it
> would be more obvious what the savings are by looking at a histogram of
> allocation latencies -- there are other metrics that could be considered
> but that's the obvious one.
> 

Do you mean reducing allocation latency especially when it hits direct
compaction path? Do you have any ideas in mind for this? I'm open to
working on them and report back latency nummbers, while I think more on less
tunable-heavy background (pro-active) compaction approaches.

-Nitin



Re: [RFC] mm: Proactive compaction

2019-09-16 Thread Nitin Gupta
On Mon, 2019-09-16 at 13:16 -0700, David Rientjes wrote:
> On Fri, 16 Aug 2019, Nitin Gupta wrote:
> 
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> > 
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> > 
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> > 
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> > 
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> > 
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> > even if extfrag thresholds are not met.
> > 
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
> > 
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> >  respectively.
> >  - Use a test program to fragment memory: the program allocates all memory
> >  and then for each 2M aligned section, frees 3/4 of base pages using
> >  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> >  compaction till extfrag < extfrag_low for order-9.
> > 
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> > 
> 
> Is there an update to this proposal or non-RFC patch that has been posted 
> for proactive compaction?
> 
> We've had good success with periodically compacting memory on a regular 
> cadence on systems with hugepages enabled.  The cadence itself is defined 
> by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> compaction in an attempt to keep zones as defragmented as possible 
> (perhaps more "proactive" than what is proposed here in an attempt to keep 
> all memory as unfragmented as possible regardless of extfrag thresholds).  
> It also avoids corner-cases where kcompactd could become more expensive 
> than what is anticipated because it is unsuccessful at compacting memory 
> yet the extfrag threshold is still exceeded.
> 
>  [*] Khugepaged instead of kcompactd only because this is only enabled
>  for systems where transparent hugepages are enabled, probably better
>  off in kcompactd to avoid duplicating work between two kthreads if
>  there is already a need for background compaction.
> 


Discussion on this RFC patch revolved around the issue of exposing too
many tunables (per-node, per-order, [low-high] extfrag thresholds). It
was sort-of concluded that no admin will get these tunables right for
a variety of workloads.

To eliminate the need for tunables, I proposed another patch:

https://patchwork.kernel.org/patch/11140067/

which does not add any tunables but extends and exports an existing
function (compact_zone_order). In summary, this new patch adds a
callback function which allows any driver to implement ad-hoc
compaction policies. There is also a sample driver which makes use
of this interface to keep hugepage external fragmentation within
specified range (exposed through debugfs):

https://gitlab.com/nigupta/linux/snippets/1894161

-Nitin



Re: [PATCH] mm: Add callback for defining compaction completion

2019-09-12 Thread Nitin Gupta
On Thu, 2019-09-12 at 17:11 +0530, Bharath Vedartham wrote:
> Hi Nitin,
> On Wed, Sep 11, 2019 at 10:33:39PM +, Nitin Gupta wrote:
> > On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> > > On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> > > [...]
> > > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > > > For some applications we need to allocate almost all memory as
> > > > > > hugepages.
> > > > > > However, on a running system, higher order allocations can fail if
> > > > > > the
> > > > > > memory is fragmented. Linux kernel currently does on-demand
> > > > > > compaction
> > > > > > as we request more hugepages but this style of compaction incurs
> > > > > > very
> > > > > > high latency. Experiments with one-time full memory compaction
> > > > > > (followed by hugepage allocations) shows that kernel is able to
> > > > > > restore a highly fragmented memory state to a fairly compacted
> > > > > > memory
> > > > > > state within <1 sec for a 32G system. Such data suggests that a
> > > > > > more
> > > > > > proactive compaction can help us allocate a large fraction of
> > > > > > memory
> > > > > > as hugepages keeping allocation latencies low.
> > > > > > 
> > > > > > In general, compaction can introduce unexpected latencies for
> > > > > > applications that don't even have strong requirements for
> > > > > > contiguous
> > > > > > allocations.
> > > 
> > > Could you expand on this a bit please? Gfp flags allow to express how
> > > much the allocator try and compact for a high order allocations. Hugetlb
> > > allocations tend to require retrying and heavy compaction to succeed and
> > > the success rate tends to be pretty high from my experience.  Why that
> > > is not case in your case?
> > > 
> The link to the driver you send on gitlab is not working :(

Sorry about that, here's the correct link:
https://gitlab.com/nigupta/linux/snippets/1894161

> > Yes, I have the same observation: with `GFP_TRANSHUGE |
> > __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
> > allocated as hugepages). However, what I'm trying to point out is that
> > this
> > high success rate comes with high allocation latencies (90th percentile
> > latency of 2206us). On the same system, the same high-order allocations
> > which hit the fast path have latency <5us.
> > 
> > > > > > It is also hard to efficiently determine if the current
> > > > > > system state can be easily compacted due to mixing of unmovable
> > > > > > memory. Due to these reasons, automatic background compaction by
> > > > > > the
> > > > > > kernel itself is hard to get right in a way which does not hurt
> > > > > > unsuspecting
> > > > > applications or waste CPU cycles.
> > > > > 
> > > > > We do trigger background compaction on a high order pressure from
> > > > > the
> > > > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > > > 
> > > > 
> > > > Whenever kcompactd is woken up, it does just enough work to create
> > > > one free page of the given order (compaction_control.order) or higher.
> > > 
> > > This is an implementation detail IMHO. I am pretty sure we can do a
> > > better auto tuning when there is an indication of a constant flow of
> > > high order requests. This is no different from the memory reclaim in
> > > principle. Just because the kswapd autotuning not fitting with your
> > > particular workload you wouldn't want to export direct reclaim
> > > functionality and call it from a random module. That is just doomed to
> > > fail because different subsystems in control just leads to decisions
> > > going against each other.
> > > 
> > 
> > I don't want to go the route of adding any auto-tuning/perdiction code to
> > control compaction in the kernel. I'm more inclined towards extending
> > existing interfaces to allow compaction behavior to be controlled either
> > from userspace or a kernel driver. Letting a random module control
> > compaction or a root process pumping new tunables from sysfs is the same
> > in
> > principle.
> Do you think a kernel 

Re: [PATCH] mm: Add callback for defining compaction completion

2019-09-11 Thread Nitin Gupta
On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> [...]
> > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > For some applications we need to allocate almost all memory as
> > > > hugepages.
> > > > However, on a running system, higher order allocations can fail if the
> > > > memory is fragmented. Linux kernel currently does on-demand
> > > > compaction
> > > > as we request more hugepages but this style of compaction incurs very
> > > > high latency. Experiments with one-time full memory compaction
> > > > (followed by hugepage allocations) shows that kernel is able to
> > > > restore a highly fragmented memory state to a fairly compacted memory
> > > > state within <1 sec for a 32G system. Such data suggests that a more
> > > > proactive compaction can help us allocate a large fraction of memory
> > > > as hugepages keeping allocation latencies low.
> > > > 
> > > > In general, compaction can introduce unexpected latencies for
> > > > applications that don't even have strong requirements for contiguous
> > > > allocations.
> 
> Could you expand on this a bit please? Gfp flags allow to express how
> much the allocator try and compact for a high order allocations. Hugetlb
> allocations tend to require retrying and heavy compaction to succeed and
> the success rate tends to be pretty high from my experience.  Why that
> is not case in your case?
> 

Yes, I have the same observation: with `GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
allocated as hugepages). However, what I'm trying to point out is that this
high success rate comes with high allocation latencies (90th percentile
latency of 2206us). On the same system, the same high-order allocations
which hit the fast path have latency <5us.

> > > > It is also hard to efficiently determine if the current
> > > > system state can be easily compacted due to mixing of unmovable
> > > > memory. Due to these reasons, automatic background compaction by the
> > > > kernel itself is hard to get right in a way which does not hurt
> > > > unsuspecting
> > > applications or waste CPU cycles.
> > > 
> > > We do trigger background compaction on a high order pressure from the
> > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > 
> > 
> > Whenever kcompactd is woken up, it does just enough work to create
> > one free page of the given order (compaction_control.order) or higher.
> 
> This is an implementation detail IMHO. I am pretty sure we can do a
> better auto tuning when there is an indication of a constant flow of
> high order requests. This is no different from the memory reclaim in
> principle. Just because the kswapd autotuning not fitting with your
> particular workload you wouldn't want to export direct reclaim
> functionality and call it from a random module. That is just doomed to
> fail because different subsystems in control just leads to decisions
> going against each other.
> 

I don't want to go the route of adding any auto-tuning/perdiction code to
control compaction in the kernel. I'm more inclined towards extending
existing interfaces to allow compaction behavior to be controlled either
from userspace or a kernel driver. Letting a random module control
compaction or a root process pumping new tunables from sysfs is the same in
principle.

This patch is in the spirit of simple extension to existing
compaction_zone_order() which allows either a kernel driver or userspace
(through sysfs) to control compaction.

Also, we should avoid driving hard parallels between reclaim and
compaction: the former is often necessary for forward progress while the
latter is often an optimization. Since contiguous allocations are mostly
optimizations it's good to expose hooks from the kernel that let user
(through a driver or userspace) control it using its own heuristics.


I thought hard about whats lacking in current userspace interface (sysfs):
 - /proc/sys/vm/compact_memory: full system compaction is not an option as
   a viable pro-active compaction strategy.
 - possibly expose [low, high] threshold values for each node and let
   kcompactd act on them. This was my approach for my original patch I
   linked earlier. Problem here is that it introduces too many tunables.

Considering the above, I came up with this callback approach which make it
trivial to introduce user specific policies for compaction. It puts the
onus of system stability, responsive in the hands of user without burdening
admins with more tunables or adding crystal balls to kernel.

>

RE: [PATCH] mm: Add callback for defining compaction completion

2019-09-10 Thread Nitin Gupta
> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of Michal Hocko
> Sent: Tuesday, September 10, 2019 1:19 PM
> To: Nitin Gupta 
> Cc: a...@linux-foundation.org; vba...@suse.cz;
> mgor...@techsingularity.net; dan.j.willi...@intel.com;
> khalid.a...@oracle.com; Matthew Wilcox ; Yu Zhao
> ; Qian Cai ; Andrey Ryabinin
> ; Allison Randal ; Mike
> Rapoport ; Thomas Gleixner
> ; Arun KS ; Wei Yang
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org
> Subject: Re: [PATCH] mm: Add callback for defining compaction completion
> 
> On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> hugepages.
> > However, on a running system, higher order allocations can fail if the
> > memory is fragmented. Linux kernel currently does on-demand
> compaction
> > as we request more hugepages but this style of compaction incurs very
> > high latency. Experiments with one-time full memory compaction
> > (followed by hugepage allocations) shows that kernel is able to
> > restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as hugepages keeping allocation latencies low.
> >
> > In general, compaction can introduce unexpected latencies for
> > applications that don't even have strong requirements for contiguous
> > allocations. It is also hard to efficiently determine if the current
> > system state can be easily compacted due to mixing of unmovable
> > memory. Due to these reasons, automatic background compaction by the
> > kernel itself is hard to get right in a way which does not hurt unsuspecting
> applications or waste CPU cycles.
> 
> We do trigger background compaction on a high order pressure from the
> page allocator by waking up kcompactd. Why is that not sufficient?
> 

Whenever kcompactd is woken up, it does just enough work to create
one free page of the given order (compaction_control.order) or higher.

Such a design causes very high latency for workloads where we want
to allocate lots of hugepages in short period of time. With pro-active
compaction we can hide much of this latency. For some more background
discussion and data, please see this thread:

https://patchwork.kernel.org/patch/11098289/

> > Even with these caveats, pro-active compaction can still be very
> > useful in certain scenarios to reduce hugepage allocation latencies.
> > This callback interface allows drivers to drive compaction based on
> > their own policies like the current level of external fragmentation
> > for a particular order, system load etc.
> 
> So we do not trust the core MM to make a reasonable decision while we give
> a free ticket to modules. How does this make any sense at all? How is a
> random module going to make a more informed decision when it has less
> visibility on the overal MM situation.
>

Embedding any specific policy (like: keep external fragmentation for order-9
between 30-40%) within MM core looks like a bad idea. As a driver, we
can easily measure parameters like system load, current fragmentation level
for any order in any zone etc. to make an informed decision.
See the thread I refereed above for more background discussion.

> If you need to control compaction from the userspace you have an interface
> for that.  It is also completely unexplained why you need a completion
> callback.
> 

/proc/sys/vm/compact_memory does whole system compaction which is
often too much as a pro-active compaction strategy. To get more control
over how to compaction work to do, I have added a compaction callback
which controls how much work is done in one compaction cycle.
 
For example, as a test for this patch, I have a small test driver which defines
[low, high] external fragmentation thresholds for the HPAGE_ORDER. Whenever
extfrag is within this range, I run compact_zone_order with a callback which
returns COMPACT_CONTINUE till extfrag > low threshold and returns
COMPACT_PARTIAL_SKIPPED when extfrag <= low.

Here's the code for this sample driver:
https://gitlab.com/nigupta/memstress/snippets/1893847

Maybe this code can be added to Documentation/...

Thanks,
Nitin

> 
> > Signed-off-by: Nitin Gupta 
> > ---
> >  include/linux/compaction.h | 10 ++
> >  mm/compaction.c| 20 ++--
> >  mm/internal.h  |  2 ++
> >  3 files changed, 26 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..1ea828450fa2 100644
> > --- a/include/linux/compaction.h
> >

[PATCH] mm: Add callback for defining compaction completion

2019-09-10 Thread Nitin Gupta
For some applications we need to allocate almost all memory as hugepages.
However, on a running system, higher order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) shows that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping allocation
latencies low.

In general, compaction can introduce unexpected latencies for applications
that don't even have strong requirements for contiguous allocations. It is
also hard to efficiently determine if the current system state can be
easily compacted due to mixing of unmovable memory. Due to these reasons,
automatic background compaction by the kernel itself is hard to get right
in a way which does not hurt unsuspecting applications or waste CPU cycles.

Even with these caveats, pro-active compaction can still be very useful in
certain scenarios to reduce hugepage allocation latencies. This callback
interface allows drivers to drive compaction based on their own policies
like the current level of external fragmentation for a particular order,
system load etc.

Signed-off-by: Nitin Gupta 
---
 include/linux/compaction.h | 10 ++
 mm/compaction.c| 20 ++--
 mm/internal.h  |  2 ++
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..1ea828450fa2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,16 @@ enum compact_result {
COMPACT_SUCCESS,
 };
 
+/* Callback function to determine if compaction is finished. */
+typedef enum compact_result (*compact_finished_cb)(
+   struct zone *zone, int order);
+
+enum compact_result compact_zone_order(struct zone *zone, int order,
+   gfp_t gfp_mask, enum compact_priority prio,
+   unsigned int alloc_flags, int classzone_idx,
+   struct page **capture,
+   compact_finished_cb compact_finished_cb);
+
 struct alloc_context; /* in mm/internal.h */
 
 /*
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..73e2e9246bc4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1872,6 +1872,9 @@ static enum compact_result __compact_finished(struct 
compact_control *cc)
return COMPACT_PARTIAL_SKIPPED;
}
 
+   if (cc->compact_finished_cb)
+   return cc->compact_finished_cb(cc->zone, cc->order);
+
if (is_via_compact_memory(cc->order))
return COMPACT_CONTINUE;
 
@@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)
return ret;
 }
 
-static enum compact_result compact_zone_order(struct zone *zone, int order,
+enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int classzone_idx,
-   struct page **capture)
+   struct page **capture,
+   compact_finished_cb compact_finished_cb)
 {
enum compact_result ret;
struct compact_control cc = {
@@ -2293,10 +2297,11 @@ static enum compact_result compact_zone_order(struct 
zone *zone, int order,
MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
.alloc_flags = alloc_flags,
.classzone_idx = classzone_idx,
-   .direct_compaction = true,
+   .direct_compaction = !compact_finished_cb,
.whole_zone = (prio == MIN_COMPACT_PRIORITY),
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
-   .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
+   .ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY),
+   .compact_finished_cb = compact_finished_cb
};
struct capture_control capc = {
.cc = ,
@@ -2313,11 +2318,13 @@ static enum compact_result compact_zone_order(struct 
zone *zone, int order,
VM_BUG_ON(!list_empty());
VM_BUG_ON(!list_empty());
 
-   *capture = capc.page;
+   if (capture)
+   *capture = capc.page;
current->capture_control = NULL;
 
return ret;
 }
+EXPORT_SYMBOL(compact_zone_order);
 
 int sysctl_extfrag_threshold = 500;
 
@@ -2361,7 +2368,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, 
unsigned int order,
}
 
status = compact_zone_order(zone, order, gfp_mask, prio,
-   alloc_flags, ac_classzone_idx(ac), capture);
+   alloc_fl

Re: [RFC] mm: Proactive compaction

2019-08-27 Thread Nitin Gupta
On Mon, 2019-08-26 at 12:47 +0100, Mel Gorman wrote:
> On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote:
> > > Note that proactive compaction may reduce allocation latency but
> > > it is not
> > > free either. Even though the scanning and migration may happen in
> > > a kernel
> > > thread, tasks can incur faults while waiting for compaction to
> > > complete if the
> > > task accesses data being migrated. This means that costs are
> > > incurred by
> > > applications on a system that may never care about high-order
> > > allocation
> > > latency -- particularly if the allocations typically happen at
> > > application
> > > initialisation time.  I recognise that kcompactd makes a bit of
> > > effort to
> > > compact memory out-of-band but it also is typically triggered in
> > > response to
> > > reclaim that was triggered by a high-order allocation request.
> > > i.e. the work
> > > done by the thread is triggered by an allocation request that hit
> > > the slow
> > > paths and not a preemptive measure.
> > > 
> > 
> > Hitting the slow path for every higher-order allocation is a
> > signification
> > performance/latency issue for applications that requires a large
> > number of
> > these allocations to succeed in bursts. To get some concrete
> > numbers, I
> > made a small driver that allocates as many hugepages as possible
> > and
> > measures allocation latency:
> > 
> 
> Every higher-order allocation does not necessarily hit the slow path
> nor
> does it incur equal latency.

I did not mean *every* hugepage allocation in a literal sense.
I meant to say: higher order allocation *tend* to hit slow path
with a high probability under reasonably fragmented memory state
and when they do, they incur high latency.


> 
> > The driver first tries to allocate hugepage using
> > GFP_TRANSHUGE_LIGHT
> > (referred to as "Light" in the table below) and if that fails,
> > tries to
> > allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
> > "Fallback" in table below). We stop the allocation loop if both
> > methods
> > fail.
> > 
> > Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All
> > latencies
> > are in microsec.
> > 
> > > GFP/Stat |Any |   Light |   Fallback |
> > > : | -: | --: | -: |
> > >count |   9908 | 788 |   9120 |
> > >  min |0.0 | 0.0 | 1726.0 |
> > >  max |   135387.0 |   142.0 |   135387.0 |
> > > mean |5494.66 |1.83 |5969.26 |
> > >   stddev |   21624.04 |7.58 |   22476.06 |
> 
> Given that it is expected that there would be significant tail
> latencies,
> it would be better to analyse this in terms of percentiles. A very
> small
> number of high latency allocations would skew the mean significantly
> which is hinted by the stddev.
> 

Here is the same data in terms of percentiles:

- with vanilla kernel 5.3.0-rc5:

percentile latency
–– –––
 5   1
10179
0
251829
301838
401854
5018
71
601890
751924
801945
902
206
952302


- Now with kernel 5.3.0-rc5 + this patch:

percentile latency
–– –––
 5   3
10   4
25   
4
30   4
40   4
50   4
60  
 4
75   5
80   5
90   9
951
154


> > As you can see, the mean and stddev of allocation is extremely high
> > with
> > the current approach of on-demand compaction.
> > 
> > The system was fragmented from a userspace program as I described
> > in this
> > patch description. The workload is mainly anonymous userspace pages
> > which
> > as easy to move around. I intentionally avoided unmovable pages in
> > this
> > test to see how much latency do we incur just by hitting the slow
> > path for
> > a majority of allocations.
> > 
> 
> Even though, the penalty for proactive compaction is that
> applications
> that may have no interest in higher-order pages may still stall while
> their data is migrated if the data is hot. This is why I think the
> focus
> should be on reducing the latency of compaction -- it benefits
> applications that require higher-order latencies without increasing
> the
> overhead for unrelated applications.
> 

Sure, reducing compaction latency 

Re: [RFC] mm: Proactive compaction

2019-08-22 Thread Nitin Gupta
> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of Mel Gorman
> Sent: Thursday, August 22, 2019 1:52 AM
> To: Nitin Gupta 
> Cc: a...@linux-foundation.org; vba...@suse.cz; mho...@suse.com;
> dan.j.willi...@intel.com; Yu Zhao ; Matthew Wilcox
> ; Qian Cai ; Andrey Ryabinin
> ; Roman Gushchin ; Greg Kroah-
> Hartman ; Kees Cook
> ; Jann Horn ; Johannes
> Weiner ; Arun KS ; Janne
> Huttunen ; Konstantin Khlebnikov
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org
> Subject: Re: [RFC] mm: Proactive compaction
> 
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> 
> Note that proactive compaction may reduce allocation latency but it is not
> free either. Even though the scanning and migration may happen in a kernel
> thread, tasks can incur faults while waiting for compaction to complete if the
> task accesses data being migrated. This means that costs are incurred by
> applications on a system that may never care about high-order allocation
> latency -- particularly if the allocations typically happen at application
> initialisation time.  I recognise that kcompactd makes a bit of effort to
> compact memory out-of-band but it also is typically triggered in response to
> reclaim that was triggered by a high-order allocation request. i.e. the work
> done by the thread is triggered by an allocation request that hit the slow
> paths and not a preemptive measure.
> 

Hitting the slow path for every higher-order allocation is a signification
performance/latency issue for applications that requires a large number of
these allocations to succeed in bursts. To get some concrete numbers, I
made a small driver that allocates as many hugepages as possible and
measures allocation latency:

The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
(referred to as "Light" in the table below) and if that fails, tries to
allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
"Fallback" in table below). We stop the allocation loop if both methods
fail.

Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies
are in microsec.

| GFP/Stat |Any |   Light |   Fallback |
|: | -: | --: | -: |
|count |   9908 | 788 |   9120 |
|  min |0.0 | 0.0 | 1726.0 |
|  max |   135387.0 |   142.0 |   135387.0 |
| mean |5494.66 |1.83 |5969.26 |
|   stddev |   21624.04 |7.58 |   22476.06 |

As you can see, the mean and stddev of allocation is extremely high with
the current approach of on-demand compaction.

The system was fragmented from a userspace program as I described in this
patch description. The workload is mainly anonymous userspace pages which
as easy to move around. I intentionally avoided unmovable pages in this
test to see how much latency do we incur just by hitting the slow path for
a majority of allocations.


> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> 
> These will be difficult for an admin to tune that is not extremely familiar 
> with
> how external fragmentation is defined. If an admin asked "how much will
> stalls be reduced by setting this to a different value?", the answer will 
> always
> be "I don't know, maybe some, maybe not".
>

Yes, this is my main worry. These values can be set to emperically
determined values on highly specialized systems like database appliances.
However, on a generic system, there is no real reasonable value.


Still, at the very least, I would like an interface that allows compacting
system to a reasonable state. Something like:

compact_extfrag(node, zone, order, high, low)

which start compaction if extfrag > high, a

RE: [RFC] mm: Proactive compaction

2019-08-21 Thread Nitin Gupta



> -Original Message-
> From: owner-linux...@kvack.org  On Behalf
> Of Matthew Wilcox
> Sent: Tuesday, August 20, 2019 3:21 PM
> To: Nitin Gupta 
> Cc: a...@linux-foundation.org; vba...@suse.cz;
> mgor...@techsingularity.net; mho...@suse.com;
> dan.j.willi...@intel.com; Yu Zhao ; Qian Cai
> ; Andrey Ryabinin ; Roman
> Gushchin ; Greg Kroah-Hartman
> ; Kees Cook ; Jann
> Horn ; Johannes Weiner ; Arun
> KS ; Janne Huttunen
> ; Konstantin Khlebnikov
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org
> Subject: Re: [RFC] mm: Proactive compaction
> 
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> 
> Your test program is a good idea, but I worry it may produce unrealistically
> optimistic outcomes.  Page cache is readily reclaimable, so you're setting up
> a situation where 2MB pages can once again be produced.
> 
> How about this:
> 
> One program which creates a file several times the size of memory (or
> several files which total the same amount).  Then read the file(s).  Maybe by
> mmap(), and just do nice easy sequential accesses.
> 
> A second program which causes slab allocations.  eg
> 
> for (;;) {
>   for (i = 0; i < n * 1000 * 1000; i++) {
>   char fname[64];
> 
>   sprintf(fname, "/tmp/missing.%d", i);
>   open(fname, O_RDWR);
>   }
> }
> 
> The first program should thrash the pagecache, causing pages to
> continuously be allocated, reclaimed and freed.  The second will create
> millions of dentries, causing the slab allocator to allocate a lot of
> order-0 pages which are harder to free.  If you really want to make it work
> hard, mix in opening some files whihc actually exist, preventing the pages
> which contain those dentries from being evicted.
> 
> This feels like it's simulating a more normal workload than your test.
> What do you think?

This combination of workloads for mixing movable and unmovable
pages sounds good.   I coded up these two and here's what I observed:

- kernel: 5.3.0-rc5 + this patch, x86_64, 32G RAM.
- Set extfrag_{low,high} = {25,30} for order-9
- Run pagecache and dentry thrash test programs as you described
- for pagecache test: mmap and sequentially read 128G file on a 32G system.
- for dentry test: set n=100. I created /tmp/missing.[0-1] so these 
dentries stay allocated..
- Start linux kernel compile for further pagecache thrashing.

With above workload fragmentation for order-9 stayed 80-90% which kept
kcompactd0 working but it couldn't make progress due to unmovable pages
from dentries.  As expected, we keep hitting compaction_deferred() as
compaction attempts fail.

After a manual `echo 3 | /proc/sys/vm/drop_caches` and stopping dentry thrasher,
kcompactd succeded in bringing extfrag below set thresholds.


With unmovable pages spread across memory, there is little compaction
can do. Maybe we should have a knob like 'compactness' (like swapiness) which
defines how aggressive compaction can be. For high values, maybe allow
freeing dentries too? This way hugepage sensitive applications can trade
with higher I/O latencies.

Thanks,
Nitin








RE: [RFC] mm: Proactive compaction

2019-08-20 Thread Nitin Gupta
> -Original Message-
> From: Vlastimil Babka 
> Sent: Tuesday, August 20, 2019 1:46 AM
> To: Nitin Gupta ; a...@linux-foundation.org;
> mgor...@techsingularity.net; mho...@suse.com;
> dan.j.willi...@intel.com
> Cc: Yu Zhao ; Matthew Wilcox ;
> Qian Cai ; Andrey Ryabinin ; Roman
> Gushchin ; Greg Kroah-Hartman
> ; Kees Cook ; Jann
> Horn ; Johannes Weiner ; Arun
> KS ; Janne Huttunen
> ; Konstantin Khlebnikov
> ; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Khalid Aziz 
> Subject: Re: [RFC] mm: Proactive compaction
> 
> +CC Khalid Aziz who proposed a different approach:
> https://lore.kernel.org/linux-mm/20190813014012.30232-1-
> khalid.a...@oracle.com/T/#u
> 
> On 8/16/19 11:43 PM, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> 
> Could you define what exactly extfrag is, in the changelog?
> 

extfrag for order-n = ((total free pages) - (free pages for order >= n)) / 
(total free pages) * 100;

I will add this to v2 changelog.


> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays
> > inactive even if extfrag thresholds are not met.
> 
> How does it translate to e.g. the number of free pages of order?
> 

Watermarks are checked as follows (see: __compaction_suitable)

watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);

If a zone does not satisfy this watermark, we don't start compaction.

> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-
> mm/20161230131412.gi13...@dhcp22.suse.cz
> > /
> >
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> >
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to
> their needs?


I expect that a workload would typically care for just a particular page order
(say, order-9 on x86 for the default hugepage size). An admin can set
extfrag_{low,high} for just that order (say, low=25, high=30) and leave these
thresholds to their default value (low=100, high=100) for all other orders.

Thanks,
Nitin


> 
> (keeping the rest for reference)
> 
> > Signed-off-by: Nitin Gupta 
> > ---
> >  include/linux/compaction.h |  12 ++
> >  mm/compaction.c| 250 ++---
> >  mm/vmstat.c|  12 ++
> >  3 files changed, 228 insertions(+), 46 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..26bfedbbc64b 100

[RFC] mm: Proactive compaction

2019-08-16 Thread Nitin Gupta
For some applications we need to allocate almost all memory as
hugepages. However, on a running system, higher order allocations can
fail if the memory is fragmented. Linux kernel currently does
on-demand compaction as we request more hugepages but this style of
compaction incurs very high latency. Experiments with one-time full
memory compaction (followed by hugepage allocations) shows that kernel
is able to restore a highly fragmented memory state to a fairly
compacted memory state within <1 sec for a 32G system. Such data
suggests that a more proactive compaction can help us allocate a large
fraction of memory as hugepages keeping allocation latencies low.

For a more proactive compaction, the approach taken here is to define
per page-order external fragmentation thresholds and let kcompactd
threads act on these thresholds.

The low and high thresholds are defined per page-order and exposed
through sysfs:

  /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}

Per-node kcompactd thread is woken up every few seconds to check if
any zone on its node has extfrag above the extfrag_high threshold for
any order, in which case the thread starts compaction in the backgrond
till all zones are below extfrag_low level for all orders. By default
both these thresolds are set to 100 for all orders which essentially
disables kcompactd.

To avoid wasting CPU cycles when compaction cannot help, such as when
memory is full, we check both, extfrag > extfrag_high and
compaction_suitable(zone). This allows kcomapctd thread to stays inactive
even if extfrag thresholds are not met.

This patch is largely based on ideas from Michal Hocko posted here:
https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/

Testing done (on x86):
 - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
 respectively.
 - Use a test program to fragment memory: the program allocates all memory
 and then for each 2M aligned section, frees 3/4 of base pages using
 munmap.
 - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
 compaction till extfrag < extfrag_low for order-9.

The patch has plenty of rough edges but posting it early to see if I'm
going in the right direction and to get some early feedback.

Signed-off-by: Nitin Gupta 
---
 include/linux/compaction.h |  12 ++
 mm/compaction.c| 250 ++---
 mm/vmstat.c|  12 ++
 3 files changed, 228 insertions(+), 46 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..26bfedbbc64b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -60,6 +60,17 @@ enum compact_result {
 
 struct alloc_context; /* in mm/internal.h */
 
+// "order-%d"
+#define COMPACTION_ORDER_STATE_NAME_LEN 16
+// Per-order compaction state
+struct compaction_order_state {
+   unsigned int order;
+   unsigned int extfrag_low;
+   unsigned int extfrag_high;
+   unsigned int extfrag_curr;
+   char name[COMPACTION_ORDER_STATE_NAME_LEN];
+};
+
 /*
  * Number of free order-0 pages that should be available above given watermark
  * to make sure compaction has reasonable chance of not running out of free
@@ -90,6 +101,7 @@ extern int sysctl_compaction_handler(struct ctl_table 
*table, int write,
 extern int sysctl_extfrag_threshold;
 extern int sysctl_compact_unevictable_allowed;
 
+extern int extfrag_for_order(struct zone *zone, unsigned int order);
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, unsigned int alloc_flags,
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..21866b1ad249 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -25,6 +25,10 @@
 #include 
 #include "internal.h"
 
+#ifdef CONFIG_COMPACTION
+struct compaction_order_state compaction_order_states[MAX_ORDER+1];
+#endif
+
 #ifdef CONFIG_COMPACTION
 static inline void count_compact_event(enum vm_event_item item)
 {
@@ -1846,6 +1850,49 @@ static inline bool is_via_compact_memory(int order)
return order == -1;
 }
 
+static int extfrag_wmark_high(struct zone *zone)
+{
+   int order;
+
+   for (order = 1; order <= MAX_ORDER; order++) {
+   int extfrag = extfrag_for_order(zone, order);
+   int threshold = compaction_order_states[order].extfrag_high;
+
+   if (extfrag > threshold)
+   return order;
+   }
+   return 0;
+}
+
+static bool node_should_compact(pg_data_t *pgdat)
+{
+   struct zone *zone;
+
+   for_each_populated_zone(zone) {
+   int order = extfrag_wmark_high(zone);
+
+   if (order && compaction_suitable(zone, order,
+   0, zone_idx(zone)) == COMPACT_CONTINUE) {
+   return true;
+   }
+ 

Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-31 Thread Nitin Gupta


On 01/25/2018 01:13 PM, Mel Gorman wrote:
> On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
>>>> It's not really about memory scarcity but a more efficient use of it.
>>>> Applications may want hugepage benefits without requiring any changes to
>>>> app code which is what THP is supposed to provide, while still avoiding
>>>> memory bloat.
>>>>
>>> I read these links and find that there are mainly two complains:
>>> 1. THP causes latency spikes, because direction compaction slows down THP 
>>> allocation,
>>> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return 
>>> memory ranges smaller than
>>>THP size and fails because of THP.
>>>
>>> The first complain is not related to this patch.
>>
>> I'm trying to address many different THP issues and memory bloat is
>> first among them.
> 
> Expecting userspace to get this right is probably going to go sideways.
> It'll be screwed up and be sub-optimal or have odd semantics for existing
> madvise flags. The fact is that an application may not even know if it's
> going to be sparsely using memory in advance if it's a computation load
> modelling from unknown input data.
> 
> I suggest you read the old Talluri paper "Superpassing the TLB Performance
> of Superpages with Less Operating System Support" and pay attention to
> Section 4. There it discusses a page reservation scheme whereby on fault
> a naturally aligned set of base pages are reserved and only one correctly
> placed base page is inserted into the faulting address. It was tied into
> a hypothetical piece of hardware that doesn't exist to give best-effort
> support for superpages so it does not directly help you but the initial
> idea is sound. There are holes in the paper from todays perspective but
> it was written in the 90's.
> 
> From there, read "Transparent operating system support for superpages"
> by Navarro, particularly chapter 4 paying attention to the parts where
> it talks about opportunism and promotion threshold.
> 
> Superficially, it goes like this
> 
> 1. On fault, reserve a THP in the allocator and use one base page that
>is correctly-aligned for the faulting addresses. By correctly-aligned,
>I mean that you use base page whose offset would be naturally contiguous
>if it ever was part of a huge page.
> 2. On subsequent faults, attempt to use a base page that is naturally
>aligned to be a THP
> 3. When a "threshold" of base pages are inserted, allocate the remaining
>pages and promote it to a THP
> 4. If there is memory pressure, spill "reserved" pages into the main
>allocation pool and lose the opportunity to promote (which will need
>khugepaged to recover)
> 
> By definition, a promotion threshold of 1 would be the existing scheme
> of allocation a THP on the first fault and some users will want that. It
> also should be the default to avoid unexpected overhead.  For workloads
> where memory is being sparsely addressed and the increased overhead of
> THP is unwelcome then the threshold should be tuned higher with a maximum
> possible value of HPAGE_PMD_NR.
> 
> It's non-trivial to do this because at minimum a page fault has to check
> if there is a potential promotion candidate by checking the PTEs around
> the faulting address searching for a correctly-aligned base page that is
> already inserted. If there is, then check if the correctly aligned base
> page for the current faulting address is free and if so use it. It'll
> also then need to check the remaining PTEs to see if both the promotion
> threshold has been reached and if so, promote it to a THP (or else teach
> khugepaged to do an in-place promotion if possible). In other words,
> implementing the promotion threshold is both hard and it's not free.
> 
> However, if it did exist then the only tunable would be the "promotion
> threshold" and applications would not need any special awareness of their
> address space.
> 

I went through both references you mentioned and I really like the
idea of reservation-based hugepage allocation.  Navarro also extends
the idea to allow multiple hugepage sizes to be used (as support by
underlying hardware) which was next in order of what I wanted to do in
THP.

So, please ignore this patch and I would work towards implementing
ideas in these papers.

Thanks for the feedback.

Nitin


Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-31 Thread Nitin Gupta


On 01/25/2018 01:13 PM, Mel Gorman wrote:
> On Thu, Jan 25, 2018 at 11:41:03AM -0800, Nitin Gupta wrote:
>>>> It's not really about memory scarcity but a more efficient use of it.
>>>> Applications may want hugepage benefits without requiring any changes to
>>>> app code which is what THP is supposed to provide, while still avoiding
>>>> memory bloat.
>>>>
>>> I read these links and find that there are mainly two complains:
>>> 1. THP causes latency spikes, because direction compaction slows down THP 
>>> allocation,
>>> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return 
>>> memory ranges smaller than
>>>THP size and fails because of THP.
>>>
>>> The first complain is not related to this patch.
>>
>> I'm trying to address many different THP issues and memory bloat is
>> first among them.
> 
> Expecting userspace to get this right is probably going to go sideways.
> It'll be screwed up and be sub-optimal or have odd semantics for existing
> madvise flags. The fact is that an application may not even know if it's
> going to be sparsely using memory in advance if it's a computation load
> modelling from unknown input data.
> 
> I suggest you read the old Talluri paper "Superpassing the TLB Performance
> of Superpages with Less Operating System Support" and pay attention to
> Section 4. There it discusses a page reservation scheme whereby on fault
> a naturally aligned set of base pages are reserved and only one correctly
> placed base page is inserted into the faulting address. It was tied into
> a hypothetical piece of hardware that doesn't exist to give best-effort
> support for superpages so it does not directly help you but the initial
> idea is sound. There are holes in the paper from todays perspective but
> it was written in the 90's.
> 
> From there, read "Transparent operating system support for superpages"
> by Navarro, particularly chapter 4 paying attention to the parts where
> it talks about opportunism and promotion threshold.
> 
> Superficially, it goes like this
> 
> 1. On fault, reserve a THP in the allocator and use one base page that
>is correctly-aligned for the faulting addresses. By correctly-aligned,
>I mean that you use base page whose offset would be naturally contiguous
>if it ever was part of a huge page.
> 2. On subsequent faults, attempt to use a base page that is naturally
>aligned to be a THP
> 3. When a "threshold" of base pages are inserted, allocate the remaining
>pages and promote it to a THP
> 4. If there is memory pressure, spill "reserved" pages into the main
>allocation pool and lose the opportunity to promote (which will need
>khugepaged to recover)
> 
> By definition, a promotion threshold of 1 would be the existing scheme
> of allocation a THP on the first fault and some users will want that. It
> also should be the default to avoid unexpected overhead.  For workloads
> where memory is being sparsely addressed and the increased overhead of
> THP is unwelcome then the threshold should be tuned higher with a maximum
> possible value of HPAGE_PMD_NR.
> 
> It's non-trivial to do this because at minimum a page fault has to check
> if there is a potential promotion candidate by checking the PTEs around
> the faulting address searching for a correctly-aligned base page that is
> already inserted. If there is, then check if the correctly aligned base
> page for the current faulting address is free and if so use it. It'll
> also then need to check the remaining PTEs to see if both the promotion
> threshold has been reached and if so, promote it to a THP (or else teach
> khugepaged to do an in-place promotion if possible). In other words,
> implementing the promotion threshold is both hard and it's not free.
> 
> However, if it did exist then the only tunable would be the "promotion
> threshold" and applications would not need any special awareness of their
> address space.
> 

I went through both references you mentioned and I really like the
idea of reservation-based hugepage allocation.  Navarro also extends
the idea to allow multiple hugepage sizes to be used (as support by
underlying hardware) which was next in order of what I wanted to do in
THP.

So, please ignore this patch and I would work towards implementing
ideas in these papers.

Thanks for the feedback.

Nitin


Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-25 Thread Nitin Gupta


On 01/24/2018 04:47 PM, Zi Yan wrote:
 With this change, whenever an application issues MADV_DONTNEED on a
 memory region, the region is marked as "space-efficient". For such
 regions, a hugepage is not immediately allocated on first write.
>>> Kirill didn't like it in the previous version and I do not like this
>>> either. You are adding a very subtle side effect which might completely
>>> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
>>> to free up unused memory. Now you have put it out of THP usage
>>> basically.
>>>
>> Userpsace may want a region to be considered by khugepaged while opting
>> out of hugepage allocation on first touch. Asking userspace memory
>> allocators to have to track and reclaim unused parts of a THP allocated
>> hugepage does not seems right, as the kernel can use simple userspace
>> hints to avoid allocating extra memory in the first place.
>>
>> I agree that this patch is adding a subtle side-effect which may take
>> some applications by surprise. However, I often see the opposite too:
>> for many workloads, disabling THP is the first advise as this aggressive
>> allocation of hugepages on first touch is unexpected and is too
>> wasteful. For e.g.:
>>
>> 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
>> http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/
>>
>> 2) Disable THP on MongoDB
>> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
>>
>> 3) Disable THP for Couchbase Server
>> https://blog.couchbase.com/often-overlooked-linux-os-tweaks/
>>
>> 4) Redis
>> http://antirez.com/news/84
>>
>>
>>> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
>>>
>> It's not really about memory scarcity but a more efficient use of it.
>> Applications may want hugepage benefits without requiring any changes to
>> app code which is what THP is supposed to provide, while still avoiding
>> memory bloat.
>>
> I read these links and find that there are mainly two complains:
> 1. THP causes latency spikes, because direction compaction slows down THP 
> allocation,
> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return 
> memory ranges smaller than
>THP size and fails because of THP.
>
> The first complain is not related to this patch.

I'm trying to address many different THP issues and memory bloat is
first among them.
> For second one, at least with recent kernels, MADV_DONTNEED splits THPs and 
> returns the memory range you
> specified in madvise(). Am I missing anything?
>

Yes, MADV_DONTNEED splits THPs and releases the requested range but
this is not
solving the issue of aggressive alloc-hugepage-on-first-touch policy
of THP=madvise
on MADV_HUGEPAGE regions. Sure, some workloads may prefer that policy
but for
application that don't, this patch give them an option to give hints
to the kernel to
go for gradual hugepage promotion via khugepaged only (and not on
first touch).

It's not good if an application has to track which parts of their
(implicitly allocated)
hugepage are in use and which sub-parts are free so they can issue
MADV_DONTNEED
calls on them. This approach really does not make THP "transparent"
and requires
lot of mm tracking code in userpace.

Nitin



Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-25 Thread Nitin Gupta


On 01/24/2018 04:47 PM, Zi Yan wrote:
 With this change, whenever an application issues MADV_DONTNEED on a
 memory region, the region is marked as "space-efficient". For such
 regions, a hugepage is not immediately allocated on first write.
>>> Kirill didn't like it in the previous version and I do not like this
>>> either. You are adding a very subtle side effect which might completely
>>> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
>>> to free up unused memory. Now you have put it out of THP usage
>>> basically.
>>>
>> Userpsace may want a region to be considered by khugepaged while opting
>> out of hugepage allocation on first touch. Asking userspace memory
>> allocators to have to track and reclaim unused parts of a THP allocated
>> hugepage does not seems right, as the kernel can use simple userspace
>> hints to avoid allocating extra memory in the first place.
>>
>> I agree that this patch is adding a subtle side-effect which may take
>> some applications by surprise. However, I often see the opposite too:
>> for many workloads, disabling THP is the first advise as this aggressive
>> allocation of hugepages on first touch is unexpected and is too
>> wasteful. For e.g.:
>>
>> 1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
>> http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/
>>
>> 2) Disable THP on MongoDB
>> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
>>
>> 3) Disable THP for Couchbase Server
>> https://blog.couchbase.com/often-overlooked-linux-os-tweaks/
>>
>> 4) Redis
>> http://antirez.com/news/84
>>
>>
>>> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
>>>
>> It's not really about memory scarcity but a more efficient use of it.
>> Applications may want hugepage benefits without requiring any changes to
>> app code which is what THP is supposed to provide, while still avoiding
>> memory bloat.
>>
> I read these links and find that there are mainly two complains:
> 1. THP causes latency spikes, because direction compaction slows down THP 
> allocation,
> 2. THP bloats memory footprint when jemalloc uses MADV_DONTNEED to return 
> memory ranges smaller than
>THP size and fails because of THP.
>
> The first complain is not related to this patch.

I'm trying to address many different THP issues and memory bloat is
first among them.
> For second one, at least with recent kernels, MADV_DONTNEED splits THPs and 
> returns the memory range you
> specified in madvise(). Am I missing anything?
>

Yes, MADV_DONTNEED splits THPs and releases the requested range but
this is not
solving the issue of aggressive alloc-hugepage-on-first-touch policy
of THP=madvise
on MADV_HUGEPAGE regions. Sure, some workloads may prefer that policy
but for
application that don't, this patch give them an option to give hints
to the kernel to
go for gradual hugepage promotion via khugepaged only (and not on
first touch).

It's not good if an application has to track which parts of their
(implicitly allocated)
hugepage are in use and which sub-parts are free so they can issue
MADV_DONTNEED
calls on them. This approach really does not make THP "transparent"
and requires
lot of mm tracking code in userpace.

Nitin



Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-24 Thread Nitin Gupta
On 1/19/18 4:49 AM, Michal Hocko wrote:
> On Thu 18-01-18 15:33:16, Nitin Gupta wrote:
>> From: Nitin Gupta <nitin.m.gu...@oracle.com>
>>
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
> 
> Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always
> users.
>  

Yes, allocating hugepage on first touch is the current behavior for
above two cases. However, I see issues with this current behavior.
Firstly, THP=always mode is often too aggressive/wasteful to be useful
for any realistic workloads. For THP=madvise, users may want to back
active parts of memory region with hugepages while avoiding aggressive
hugepage allocation on first touch. Or, they may really want the current
behavior.

With this patch, users would have the option to pick what behavior they
want by passing hints to the kernel in the form of MADV_HUGEPAGE and
MADV_DONTNEED madvise calls.


>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
> 
> Is that really true? We have 
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> This is not reflected during the PF of course but you can control the
> behavior there as well. Either by the global setting or a per proces
> prctl.
> 

I think this part of patch description needs some rewording. This patch
is to change *only* the page fault behavior.

Once pages are installed, khugepaged does its job as usual, using
max_ptes_none and other config values. I'm not trying to change any
khugepaged behavior here.


>> With this change, whenever an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For such
>> regions, a hugepage is not immediately allocated on first write.
> 
> Kirill didn't like it in the previous version and I do not like this
> either. You are adding a very subtle side effect which might completely
> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
> to free up unused memory. Now you have put it out of THP usage
> basically.
>

Userpsace may want a region to be considered by khugepaged while opting
out of hugepage allocation on first touch. Asking userspace memory
allocators to have to track and reclaim unused parts of a THP allocated
hugepage does not seems right, as the kernel can use simple userspace
hints to avoid allocating extra memory in the first place.

I agree that this patch is adding a subtle side-effect which may take
some applications by surprise. However, I often see the opposite too:
for many workloads, disabling THP is the first advise as this aggressive
allocation of hugepages on first touch is unexpected and is too
wasteful. For e.g.:

1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/

2) Disable THP on MongoDB
https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

3) Disable THP for Couchbase Server
https://blog.couchbase.com/often-overlooked-linux-os-tweaks/

4) Redis
http://antirez.com/news/84


> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
> 

It's not really about memory scarcity but a more efficient use of it.
Applications may want hugepage benefits without requiring any changes to
app code which is what THP is supposed to provide, while still avoiding
memory bloat.

-Nitin


Re: [PATCH v2] mm: Reduce memory bloat with THP

2018-01-24 Thread Nitin Gupta
On 1/19/18 4:49 AM, Michal Hocko wrote:
> On Thu 18-01-18 15:33:16, Nitin Gupta wrote:
>> From: Nitin Gupta 
>>
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
> 
> Yes, this is true but hardly unexpected for MADV_HUGEPAGE or THP always
> users.
>  

Yes, allocating hugepage on first touch is the current behavior for
above two cases. However, I see issues with this current behavior.
Firstly, THP=always mode is often too aggressive/wasteful to be useful
for any realistic workloads. For THP=madvise, users may want to back
active parts of memory region with hugepages while avoiding aggressive
hugepage allocation on first touch. Or, they may really want the current
behavior.

With this patch, users would have the option to pick what behavior they
want by passing hints to the kernel in the form of MADV_HUGEPAGE and
MADV_DONTNEED madvise calls.


>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
> 
> Is that really true? We have 
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> This is not reflected during the PF of course but you can control the
> behavior there as well. Either by the global setting or a per proces
> prctl.
> 

I think this part of patch description needs some rewording. This patch
is to change *only* the page fault behavior.

Once pages are installed, khugepaged does its job as usual, using
max_ptes_none and other config values. I'm not trying to change any
khugepaged behavior here.


>> With this change, whenever an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For such
>> regions, a hugepage is not immediately allocated on first write.
> 
> Kirill didn't like it in the previous version and I do not like this
> either. You are adding a very subtle side effect which might completely
> unexpected. Consider userspace memory allocator which uses MADV_DONTNEED
> to free up unused memory. Now you have put it out of THP usage
> basically.
>

Userpsace may want a region to be considered by khugepaged while opting
out of hugepage allocation on first touch. Asking userspace memory
allocators to have to track and reclaim unused parts of a THP allocated
hugepage does not seems right, as the kernel can use simple userspace
hints to avoid allocating extra memory in the first place.

I agree that this patch is adding a subtle side-effect which may take
some applications by surprise. However, I often see the opposite too:
for many workloads, disabling THP is the first advise as this aggressive
allocation of hugepages on first touch is unexpected and is too
wasteful. For e.g.:

1) Disabling THP for TokuDB (Storage engine for MySQL, MariaDB)
http://www.chriscalender.com/disabling-transparent-hugepages-for-tokudb/

2) Disable THP on MongoDB
https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

3) Disable THP for Couchbase Server
https://blog.couchbase.com/often-overlooked-linux-os-tweaks/

4) Redis
http://antirez.com/news/84


> If the memory is used really scarce then we have MADV_NOHUGEPAGE.
> 

It's not really about memory scarcity but a more efficient use of it.
Applications may want hugepage benefits without requiring any changes to
app code which is what THP is supposed to provide, while still avoiding
memory bloat.

-Nitin


Re: [PATCH] mm: Reduce memory bloat with THP

2017-12-15 Thread Nitin Gupta
On 12/15/17 2:01 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote:
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 751e97a..b2ec07b 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -508,6 +508,7 @@ static long madvise_dontneed_single_vma(struct 
>> vm_area_struct *vma,
>>  unsigned long start, unsigned long end)
>>  {
>>  zap_page_range(vma, start, end - start);
>> +vma->space_efficient = true;
>>  return 0;
>>  }
>>  
> 
> And this modifies vma without down_write(mmap_sem).
> 

I thought this function was always called with mmmap_sem write locked.
I will check again.

- Nitin




Re: [PATCH] mm: Reduce memory bloat with THP

2017-12-15 Thread Nitin Gupta
On 12/15/17 2:01 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote:
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 751e97a..b2ec07b 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -508,6 +508,7 @@ static long madvise_dontneed_single_vma(struct 
>> vm_area_struct *vma,
>>  unsigned long start, unsigned long end)
>>  {
>>  zap_page_range(vma, start, end - start);
>> +vma->space_efficient = true;
>>  return 0;
>>  }
>>  
> 
> And this modifies vma without down_write(mmap_sem).
> 

I thought this function was always called with mmmap_sem write locked.
I will check again.

- Nitin




Re: [PATCH] mm: Reduce memory bloat with THP

2017-12-15 Thread Nitin Gupta
On 12/15/17 2:00 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote:
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
>>
>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
>>
>> With this change, when an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For
>> such regions, a hugepage is not immediately allocated on first
>> write.  Instead, it is left to the khugepaged thread to do
>> delayed hugepage promotion depending on whether the region is
>> actually mapped and active. When application issues
>> MADV_HUGEPAGE, the region is marked again as non-space-efficient
>> wherein hugepage is allocated on first touch.
> 
> I think this would be NAK. At least in this form.
> 
> What performance testing have you done? Any numbers?
> 

I wrote a throw-away code which mmaps 128G area and writes to a random
address in a loop. Together with writes, madvise(MADV_DONTNEED) are
issued at another random addresses. Writes are issued with 70%
probability and DONTNEED with 30%. With this test, I'm trying to emulate
workload of a large in-memory hash-table.

With the patch, I see that memory bloat is much less severe.
I've uploaded the test program with the memory usage plot here:

https://gist.github.com/nitingupta910/42ddf969e17556d74a14fbd84640ddb3

THP was set to 'always' mode in both cases but the result would be the
same if madvise mode was used instead.

> Making whole vma "space_efficient" just because somebody freed one page
> from it is just wrong. And there's no way back after this.
>

I'm using MADV_DONTNEED as a hint that although user wants to
transparently use hugepages but at the same time wants to be more
conservative with respect to memory usage. If a MADV_HUGEPAGE is issued
for a VMA range after any DONTNEEDs then the space_efficient bit is
again cleared, so we revert back to allocating hugepage on fault on
empty pud/pmd.

>>
>> Orabug: 26910556
> 
> Wat?
> 

It's oracle internal identifier used to track this work.

Thanks,
Nitin



Re: [PATCH] mm: Reduce memory bloat with THP

2017-12-15 Thread Nitin Gupta
On 12/15/17 2:00 AM, Kirill A. Shutemov wrote:
> On Thu, Dec 14, 2017 at 05:28:52PM -0800, Nitin Gupta wrote:
>> Currently, if the THP enabled policy is "always", or the mode
>> is "madvise" and a region is marked as MADV_HUGEPAGE, a hugepage
>> is allocated on a page fault if the pud or pmd is empty.  This
>> yields the best VA translation performance, but increases memory
>> consumption if some small page ranges within the huge page are
>> never accessed.
>>
>> An alternate behavior for such page faults is to install a
>> hugepage only when a region is actually found to be (almost)
>> fully mapped and active.  This is a compromise between
>> translation performance and memory consumption.  Currently there
>> is no way for an application to choose this compromise for the
>> page fault conditions above.
>>
>> With this change, when an application issues MADV_DONTNEED on a
>> memory region, the region is marked as "space-efficient". For
>> such regions, a hugepage is not immediately allocated on first
>> write.  Instead, it is left to the khugepaged thread to do
>> delayed hugepage promotion depending on whether the region is
>> actually mapped and active. When application issues
>> MADV_HUGEPAGE, the region is marked again as non-space-efficient
>> wherein hugepage is allocated on first touch.
> 
> I think this would be NAK. At least in this form.
> 
> What performance testing have you done? Any numbers?
> 

I wrote a throw-away code which mmaps 128G area and writes to a random
address in a loop. Together with writes, madvise(MADV_DONTNEED) are
issued at another random addresses. Writes are issued with 70%
probability and DONTNEED with 30%. With this test, I'm trying to emulate
workload of a large in-memory hash-table.

With the patch, I see that memory bloat is much less severe.
I've uploaded the test program with the memory usage plot here:

https://gist.github.com/nitingupta910/42ddf969e17556d74a14fbd84640ddb3

THP was set to 'always' mode in both cases but the result would be the
same if madvise mode was used instead.

> Making whole vma "space_efficient" just because somebody freed one page
> from it is just wrong. And there's no way back after this.
>

I'm using MADV_DONTNEED as a hint that although user wants to
transparently use hugepages but at the same time wants to be more
conservative with respect to memory usage. If a MADV_HUGEPAGE is issued
for a VMA range after any DONTNEEDs then the space_efficient bit is
again cleared, so we revert back to allocating hugepage on fault on
empty pud/pmd.

>>
>> Orabug: 26910556
> 
> Wat?
> 

It's oracle internal identifier used to track this work.

Thanks,
Nitin



[PATCH] sparc64: Fix page table walk for PUD hugepages

2017-11-03 Thread Nitin Gupta
For a PUD hugepage entry, we need to propagate bits [32:22]
from virtual address to resolve at 4M granularity. However,
the current code was incorrectly propagating bits [29:19].
This bug can cause incorrect data to be returned for pages
backed with 16G hugepages.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
Reported-by: Al Viro <v...@zeniv.linux.org.uk>
Cc: Al Viro <v...@zeniv.linux.org.uk>

diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index acf55063aa3d..ca0de1646f1e 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -216,7 +216,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
sllxREG2, 32, REG2; \
andcc   REG1, REG2, %g0;\
be,pt   %xcc, 700f; \
-sethi  %hi(0x1ffc), REG2;  \
+sethi  %hi(0xffe0), REG2;  \
sllxREG2, 1, REG2;  \
brgez,pnREG1, FAIL_LABEL;   \
 andn   REG1, REG2, REG1;   \
-- 
2.13.1



[PATCH] sparc64: Fix page table walk for PUD hugepages

2017-11-03 Thread Nitin Gupta
For a PUD hugepage entry, we need to propagate bits [32:22]
from virtual address to resolve at 4M granularity. However,
the current code was incorrectly propagating bits [29:19].
This bug can cause incorrect data to be returned for pages
backed with 16G hugepages.

Signed-off-by: Nitin Gupta 
Reported-by: Al Viro 
Cc: Al Viro 

diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index acf55063aa3d..ca0de1646f1e 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -216,7 +216,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
sllxREG2, 32, REG2; \
andcc   REG1, REG2, %g0;\
be,pt   %xcc, 700f; \
-sethi  %hi(0x1ffc), REG2;  \
+sethi  %hi(0xffe0), REG2;  \
sllxREG2, 1, REG2;  \
brgez,pnREG1, FAIL_LABEL;   \
 andn   REG1, REG2, REG1;   \
-- 
2.13.1



Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-22 Thread Nitin Gupta
On Sun, Oct 22, 2017 at 8:10 PM, Minchan Kim <minc...@kernel.org> wrote:
> On Fri, Oct 20, 2017 at 10:59:31PM +0300, Kirill A. Shutemov wrote:
>> With boot-time switching between paging mode we will have variable
>> MAX_PHYSMEM_BITS.
>>
>> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> configuration to define zsmalloc data structures.
>>
>> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> It also suits well to handle PAE special case.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
>> Cc: Minchan Kim <minc...@kernel.org>
>> Cc: Nitin Gupta <ngu...@vflare.org>
>> Cc: Sergey Senozhatsky <sergey.senozhatsky.w...@gmail.com>
> Acked-by: Minchan Kim <minc...@kernel.org>
>
> Nitin:
>
> I think this patch works and it would be best for Kirill to be able to do.
> So if you have better idea to clean it up, let's make it as another patch
> regardless of this patch series.
>


I was looking into dynamically allocating size_class array to avoid that
compile error, but yes, that can be done in a future patch. So, for this patch:

Reviewed-by: Nitin Gupta <ngu...@vflare.org>


Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-22 Thread Nitin Gupta
On Sun, Oct 22, 2017 at 8:10 PM, Minchan Kim  wrote:
> On Fri, Oct 20, 2017 at 10:59:31PM +0300, Kirill A. Shutemov wrote:
>> With boot-time switching between paging mode we will have variable
>> MAX_PHYSMEM_BITS.
>>
>> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> configuration to define zsmalloc data structures.
>>
>> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> It also suits well to handle PAE special case.
>>
>> Signed-off-by: Kirill A. Shutemov 
>> Cc: Minchan Kim 
>> Cc: Nitin Gupta 
>> Cc: Sergey Senozhatsky 
> Acked-by: Minchan Kim 
>
> Nitin:
>
> I think this patch works and it would be best for Kirill to be able to do.
> So if you have better idea to clean it up, let's make it as another patch
> regardless of this patch series.
>


I was looking into dynamically allocating size_class array to avoid that
compile error, but yes, that can be done in a future patch. So, for this patch:

Reviewed-by: Nitin Gupta 


Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-20 Thread Nitin Gupta
On Fri, Oct 20, 2017 at 12:59 PM, Kirill A. Shutemov
 wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>


I see that with your upcoming patch, MAX_PHYSMEM_BITS is turned into a
variable for x86_64 case as: (pgtable_l5_enabled ? 52 : 46).

Even with this change, I don't see a need for this new
MAX_POSSIBLE_PHYSMEM_BITS constant.


> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else


This ifdef on HIGHMEM64G is redundant, as x86 already defines
MAX_PHYSMEM_BITS = 36 in PAE case. So, all that zsmalloc should do is:

#ifndef MAX_PHYSMEM_BITS
#define MAX_PHYSMEM_BITS BITS_PER_LONG
#endif

.. and then no change is needed for rest of derived constants like _PFN_BITS.

It is upto every arch to define correct MAX_PHYSMEM_BITS (variable or constant)
based on whatever configurations the arch supports. If not defined,
zsmalloc picks
a reasonable default of BITS_PER_LONG.

I will send a patch which makes the change to remove ifdef on CONFIG_HIGHMEM64G.

Thanks,
Nitin


Re: [PATCH 1/4] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-20 Thread Nitin Gupta
On Fri, Oct 20, 2017 at 12:59 PM, Kirill A. Shutemov
 wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>


I see that with your upcoming patch, MAX_PHYSMEM_BITS is turned into a
variable for x86_64 case as: (pgtable_l5_enabled ? 52 : 46).

Even with this change, I don't see a need for this new
MAX_POSSIBLE_PHYSMEM_BITS constant.


> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else


This ifdef on HIGHMEM64G is redundant, as x86 already defines
MAX_PHYSMEM_BITS = 36 in PAE case. So, all that zsmalloc should do is:

#ifndef MAX_PHYSMEM_BITS
#define MAX_PHYSMEM_BITS BITS_PER_LONG
#endif

.. and then no change is needed for rest of derived constants like _PFN_BITS.

It is upto every arch to define correct MAX_PHYSMEM_BITS (variable or constant)
based on whatever configurations the arch supports. If not defined,
zsmalloc picks
a reasonable default of BITS_PER_LONG.

I will send a patch which makes the change to remove ifdef on CONFIG_HIGHMEM64G.

Thanks,
Nitin


Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-18 Thread Nitin Gupta
On Mon, Oct 16, 2017 at 7:44 AM, Kirill A. Shutemov
<kir...@shutemov.name> wrote:
> On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote:
>> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
>> <kirill.shute...@linux.intel.com> wrote:
>> > With boot-time switching between paging mode we will have variable
>> > MAX_PHYSMEM_BITS.
>> >
>> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> > configuration to define zsmalloc data structures.
>> >
>> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> > It also suits well to handle PAE special case.
>> >
>> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
>> > Cc: Minchan Kim <minc...@kernel.org>
>> > Cc: Nitin Gupta <ngu...@vflare.org>
>> > Cc: Sergey Senozhatsky <sergey.senozhatsky.w...@gmail.com>
>> > ---
>> >  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>> >  arch/x86/include/asm/pgtable_64_types.h |  2 ++
>> >  mm/zsmalloc.c   | 13 +++--
>> >  3 files changed, 10 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h 
>> > b/arch/x86/include/asm/pgtable-3level_types.h
>> > index b8a4341faafa..3fe1d107a875 100644
>> > --- a/arch/x86/include/asm/pgtable-3level_types.h
>> > +++ b/arch/x86/include/asm/pgtable-3level_types.h
>> > @@ -43,5 +43,6 @@ typedef union {
>> >   */
>> >  #define PTRS_PER_PTE   512
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS  36
>> >
>> >  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
>> > diff --git a/arch/x86/include/asm/pgtable_64_types.h 
>> > b/arch/x86/include/asm/pgtable_64_types.h
>> > index 06470da156ba..39075df30b8a 100644
>> > --- a/arch/x86/include/asm/pgtable_64_types.h
>> > +++ b/arch/x86/include/asm/pgtable_64_types.h
>> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>> >  #define P4D_SIZE   (_AC(1, UL) << P4D_SHIFT)
>> >  #define P4D_MASK   (~(P4D_SIZE - 1))
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS  52
>> > +
>> >  #else /* CONFIG_X86_5LEVEL */
>> >
>> >  /*
>> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> > index 7c38e850a8fc..7bde01c55c90 100644
>> > --- a/mm/zsmalloc.c
>> > +++ b/mm/zsmalloc.c
>> > @@ -82,18 +82,19 @@
>> >   * This is made more complicated by various memory models and PAE.
>> >   */
>> >
>> > -#ifndef MAX_PHYSMEM_BITS
>> > -#ifdef CONFIG_HIGHMEM64G
>> > -#define MAX_PHYSMEM_BITS 36
>> > -#else /* !CONFIG_HIGHMEM64G */
>> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
>> > +#ifdef MAX_PHYSMEM_BITS
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
>> > +#else
>> >  /*
>> >   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will 
>> > just
>> >   * be PAGE_SHIFT
>> >   */
>> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>> >  #endif
>> >  #endif
>> > -#define _PFN_BITS  (MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> > +
>> > +#define _PFN_BITS  (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>> >
>>
>>
>> I think we can avoid using this new constant in zsmalloc.
>>
>> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
>> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
>> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
>> would remain 32 bytes.
>>
>> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
>> thus OBJ_INDEX_BITS = PAGE_SHIFT.
>
> As you understand the topic better than me, could you prepare the patch?
>


Actually no changes are necessary.

As long as physical address bits <= BITS_PER_LONG, then setting
_PFN_BITS to the most conservative value of BITS_PER_LONG is
fine. AFAIK, this condition does not hold on x86 PAE where PA
bits (36) > BITS_PER_LONG (32), so only that case need special
handling to make sure PFN bits are not lost when encoding
allocated object location in an unsigned long.

Thanks,
Nitin


Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-18 Thread Nitin Gupta
On Mon, Oct 16, 2017 at 7:44 AM, Kirill A. Shutemov
 wrote:
> On Fri, Oct 13, 2017 at 05:00:12PM -0700, Nitin Gupta wrote:
>> On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
>>  wrote:
>> > With boot-time switching between paging mode we will have variable
>> > MAX_PHYSMEM_BITS.
>> >
>> > Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
>> > configuration to define zsmalloc data structures.
>> >
>> > The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
>> > It also suits well to handle PAE special case.
>> >
>> > Signed-off-by: Kirill A. Shutemov 
>> > Cc: Minchan Kim 
>> > Cc: Nitin Gupta 
>> > Cc: Sergey Senozhatsky 
>> > ---
>> >  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>> >  arch/x86/include/asm/pgtable_64_types.h |  2 ++
>> >  mm/zsmalloc.c   | 13 +++--
>> >  3 files changed, 10 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/arch/x86/include/asm/pgtable-3level_types.h 
>> > b/arch/x86/include/asm/pgtable-3level_types.h
>> > index b8a4341faafa..3fe1d107a875 100644
>> > --- a/arch/x86/include/asm/pgtable-3level_types.h
>> > +++ b/arch/x86/include/asm/pgtable-3level_types.h
>> > @@ -43,5 +43,6 @@ typedef union {
>> >   */
>> >  #define PTRS_PER_PTE   512
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS  36
>> >
>> >  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
>> > diff --git a/arch/x86/include/asm/pgtable_64_types.h 
>> > b/arch/x86/include/asm/pgtable_64_types.h
>> > index 06470da156ba..39075df30b8a 100644
>> > --- a/arch/x86/include/asm/pgtable_64_types.h
>> > +++ b/arch/x86/include/asm/pgtable_64_types.h
>> > @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>> >  #define P4D_SIZE   (_AC(1, UL) << P4D_SHIFT)
>> >  #define P4D_MASK   (~(P4D_SIZE - 1))
>> >
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS  52
>> > +
>> >  #else /* CONFIG_X86_5LEVEL */
>> >
>> >  /*
>> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> > index 7c38e850a8fc..7bde01c55c90 100644
>> > --- a/mm/zsmalloc.c
>> > +++ b/mm/zsmalloc.c
>> > @@ -82,18 +82,19 @@
>> >   * This is made more complicated by various memory models and PAE.
>> >   */
>> >
>> > -#ifndef MAX_PHYSMEM_BITS
>> > -#ifdef CONFIG_HIGHMEM64G
>> > -#define MAX_PHYSMEM_BITS 36
>> > -#else /* !CONFIG_HIGHMEM64G */
>> > +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
>> > +#ifdef MAX_PHYSMEM_BITS
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
>> > +#else
>> >  /*
>> >   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will 
>> > just
>> >   * be PAGE_SHIFT
>> >   */
>> > -#define MAX_PHYSMEM_BITS BITS_PER_LONG
>> > +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>> >  #endif
>> >  #endif
>> > -#define _PFN_BITS  (MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> > +
>> > +#define _PFN_BITS  (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>> >
>>
>>
>> I think we can avoid using this new constant in zsmalloc.
>>
>> The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
>> bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
>> for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
>> would remain 32 bytes.
>>
>> So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
>> thus OBJ_INDEX_BITS = PAGE_SHIFT.
>
> As you understand the topic better than me, could you prepare the patch?
>


Actually no changes are necessary.

As long as physical address bits <= BITS_PER_LONG, then setting
_PFN_BITS to the most conservative value of BITS_PER_LONG is
fine. AFAIK, this condition does not hold on x86 PAE where PA
bits (36) > BITS_PER_LONG (32), so only that case need special
handling to make sure PFN bits are not lost when encoding
allocated object location in an unsigned long.

Thanks,
Nitin


Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-13 Thread Nitin Gupta
On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
<kirill.shute...@linux.intel.com> wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Nitin Gupta <ngu...@vflare.org>
> Cc: Sergey Senozhatsky <sergey.senozhatsky.w...@gmail.com>
> ---
>  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>  arch/x86/include/asm/pgtable_64_types.h |  2 ++
>  mm/zsmalloc.c   | 13 +++--
>  3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable-3level_types.h 
> b/arch/x86/include/asm/pgtable-3level_types.h
> index b8a4341faafa..3fe1d107a875 100644
> --- a/arch/x86/include/asm/pgtable-3level_types.h
> +++ b/arch/x86/include/asm/pgtable-3level_types.h
> @@ -43,5 +43,6 @@ typedef union {
>   */
>  #define PTRS_PER_PTE   512
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS  36
>
>  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> b/arch/x86/include/asm/pgtable_64_types.h
> index 06470da156ba..39075df30b8a 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>  #define P4D_SIZE   (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK   (~(P4D_SIZE - 1))
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS  52
> +
>  #else /* CONFIG_X86_5LEVEL */
>
>  /*
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 7c38e850a8fc..7bde01c55c90 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -82,18 +82,19 @@
>   * This is made more complicated by various memory models and PAE.
>   */
>
> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else
>  /*
>   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>   * be PAGE_SHIFT
>   */
> -#define MAX_PHYSMEM_BITS BITS_PER_LONG
> +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>  #endif
>  #endif
> -#define _PFN_BITS  (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +
> +#define _PFN_BITS  (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>


I think we can avoid using this new constant in zsmalloc.

The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
would remain 32 bytes.

So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
thus OBJ_INDEX_BITS = PAGE_SHIFT.

- Nitin


Re: [PATCH 2/6] mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITS

2017-10-13 Thread Nitin Gupta
On Fri, Sep 29, 2017 at 7:08 AM, Kirill A. Shutemov
 wrote:
> With boot-time switching between paging mode we will have variable
> MAX_PHYSMEM_BITS.
>
> Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y
> configuration to define zsmalloc data structures.
>
> The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case.
> It also suits well to handle PAE special case.
>
> Signed-off-by: Kirill A. Shutemov 
> Cc: Minchan Kim 
> Cc: Nitin Gupta 
> Cc: Sergey Senozhatsky 
> ---
>  arch/x86/include/asm/pgtable-3level_types.h |  1 +
>  arch/x86/include/asm/pgtable_64_types.h |  2 ++
>  mm/zsmalloc.c   | 13 +++--
>  3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable-3level_types.h 
> b/arch/x86/include/asm/pgtable-3level_types.h
> index b8a4341faafa..3fe1d107a875 100644
> --- a/arch/x86/include/asm/pgtable-3level_types.h
> +++ b/arch/x86/include/asm/pgtable-3level_types.h
> @@ -43,5 +43,6 @@ typedef union {
>   */
>  #define PTRS_PER_PTE   512
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS  36
>
>  #endif /* _ASM_X86_PGTABLE_3LEVEL_DEFS_H */
> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> b/arch/x86/include/asm/pgtable_64_types.h
> index 06470da156ba..39075df30b8a 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -39,6 +39,8 @@ typedef struct { pteval_t pte; } pte_t;
>  #define P4D_SIZE   (_AC(1, UL) << P4D_SHIFT)
>  #define P4D_MASK   (~(P4D_SIZE - 1))
>
> +#define MAX_POSSIBLE_PHYSMEM_BITS  52
> +
>  #else /* CONFIG_X86_5LEVEL */
>
>  /*
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 7c38e850a8fc..7bde01c55c90 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -82,18 +82,19 @@
>   * This is made more complicated by various memory models and PAE.
>   */
>
> -#ifndef MAX_PHYSMEM_BITS
> -#ifdef CONFIG_HIGHMEM64G
> -#define MAX_PHYSMEM_BITS 36
> -#else /* !CONFIG_HIGHMEM64G */
> +#ifndef MAX_POSSIBLE_PHYSMEM_BITS
> +#ifdef MAX_PHYSMEM_BITS
> +#define MAX_POSSIBLE_PHYSMEM_BITS MAX_PHYSMEM_BITS
> +#else
>  /*
>   * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
>   * be PAGE_SHIFT
>   */
> -#define MAX_PHYSMEM_BITS BITS_PER_LONG
> +#define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
>  #endif
>  #endif
> -#define _PFN_BITS  (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +
> +#define _PFN_BITS  (MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT)
>


I think we can avoid using this new constant in zsmalloc.

The reason for trying to save on MAX_PHYSMEM_BITS is just to gain more
bits for OBJ_INDEX_BITS which would reduce ZS_MIN_ALLOC_SIZE. However,
for all practical values of ZS_MAX_PAGES_PER_ZSPAGE, this min size
would remain 32 bytes.

So, we can unconditionally use MAX_PHYSMEM_BITS = BITS_PER_LONG and
thus OBJ_INDEX_BITS = PAGE_SHIFT.

- Nitin


[PATCH v6 3/3] sparc64: Cleanup hugepage table walk functions

2017-08-11 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v6 1/3] sparc64: Support huge PUD case in get_user_pages

2017-08-11 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/pgtable_64.h | 15 +++--
 arch/sparc/mm/gup.c | 45 -
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d809099 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v6 3/3] sparc64: Cleanup hugepage table walk functions

2017-08-11 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v6 1/3] sparc64: Support huge PUD case in get_user_pages

2017-08-11 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 +++--
 arch/sparc/mm/gup.c | 45 -
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d809099 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v6 2/3] sparc64: Add 16GB hugepage support

2017-08-11 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Cc: Anthony Yznaga <anthony.yzn...@oracle.com>
Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/hugetlb.h|  7 
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 36 ++
 arch/sparc/kernel/head_64.S |  2 +-
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/kernel/vmlinux.lds.S |  5 +++
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 54 +++
 9 files changed, 157 insertions(+), 31 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index d1f837d..0ca7caa 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct pud_huge_patch_entry {
+   unsigned int addr;
+   unsigned int insn;
+};
+extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end;
+#endif
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte);
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..acf5506 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+700:   ba 700f;\
+nop;   \
+   .section.pud_huge_patch, "ax";  \
+   .word   700b;   \
+   nop;\
+   .previous;  \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one

[PATCH v6 2/3] sparc64: Add 16GB hugepage support

2017-08-11 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Cc: Anthony Yznaga 
Reviewed-by: Bob Picco 
Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/hugetlb.h|  7 
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 36 ++
 arch/sparc/kernel/head_64.S |  2 +-
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/kernel/vmlinux.lds.S |  5 +++
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 54 +++
 9 files changed, 157 insertions(+), 31 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index d1f837d..0ca7caa 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct pud_huge_patch_entry {
+   unsigned int addr;
+   unsigned int insn;
+};
+extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end;
+#endif
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte);
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..acf5506 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+700:   ba 700f;\
+nop;   \
+   .section.pud_huge_patch, "ax";  \
+   .word   700b;   \
+   nop;\
+   .previous;  \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@

[PATCH v5 1/3] sparc64: Support huge PUD case in get_user_pages

2017-07-29 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/pgtable_64.h | 15 +++--
 arch/sparc/mm/gup.c | 45 -
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d809099 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v5 1/3] sparc64: Support huge PUD case in get_user_pages

2017-07-29 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 +++--
 arch/sparc/mm/gup.c | 45 -
 2 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d809099 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,45 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   page = pud_page(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +180,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v5 2/3] sparc64: Add 16GB hugepage support

2017-07-29 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/hugetlb.h|  7 
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 36 ++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/kernel/vmlinux.lds.S |  5 +++
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 54 +++
 8 files changed, 156 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index d1f837d..0ca7caa 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct pud_huge_patch_entry {
+   unsigned int addr;
+   unsigned int insn;
+};
+extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end;
+#endif
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte);
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..acf5506 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+700:   ba 700f;\
+nop;   \
+   .section.pud_huge_patch, "ax";  \
+   .word   700b;   \
+   nop;\
+   .previous;  \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +277,7 @@ extern struct tsb_phys_patch_entry __tsb_phy

[PATCH v5 2/3] sparc64: Add 16GB hugepage support

2017-07-29 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/hugetlb.h|  7 
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 36 ++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/kernel/vmlinux.lds.S |  5 +++
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 54 +++
 8 files changed, 156 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/hugetlb.h b/arch/sparc/include/asm/hugetlb.h
index d1f837d..0ca7caa 100644
--- a/arch/sparc/include/asm/hugetlb.h
+++ b/arch/sparc/include/asm/hugetlb.h
@@ -4,6 +4,13 @@
 #include 
 #include 
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct pud_huge_patch_entry {
+   unsigned int addr;
+   unsigned int insn;
+};
+extern struct pud_huge_patch_entry __pud_huge_patch, __pud_huge_patch_end;
+#endif
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte);
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..acf5506 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,41 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+700:   ba 700f;\
+nop;   \
+   .section.pud_huge_patch, "ax";  \
+   .word   700b;   \
+   nop;\
+   .previous;  \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +277,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys

[PATCH v5 3/3] sparc64: Cleanup hugepage table walk functions

2017-07-29 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v5 3/3] sparc64: Cleanup hugepage table walk functions

2017-07-29 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



Re: [PATCH 2/3] sparc64: Add 16GB hugepage support

2017-07-26 Thread Nitin Gupta


On 07/20/2017 01:04 PM, David Miller wrote:
> From: Nitin Gupta <nitin.m.gu...@oracle.com>
> Date: Thu, 13 Jul 2017 14:53:24 -0700
> 
>> Testing:
>>
>> Tested with the stream benchmark which allocates 48G of
>> arrays backed by 16G hugepages and does RW operation on
>> them in parallel.
> 
> It would be great if we started adding tests under
> tools/testing/selftests so that other people can recreate
> your tests/benchmarks.
> 

Yes, I would like to add the stream benchmark to selftests too.
I will check if our internal version of stream can be released.


>> diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
>> index 32258e0..7b240a3 100644
>> --- a/arch/sparc/include/asm/tsb.h
>> +++ b/arch/sparc/include/asm/tsb.h
>> @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
>> __tsb_phys_patch_end;
>>   nop; \
>>  699:
>>  
>> +/* PUD has been loaded into REG1, interpret the value, seeing
>> + * if it is a HUGE PUD or a normal one.  If it is not valid
>> + * then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
>> + * translates to a valid PTE, branch to PTE_LABEL.
>> + *
>> + * We have to propagate bits [32:22] from the virtual address
>> + * to resolve at 4M granularity.
>> + */
>> +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 
>> PTE_LABEL) \
>> +brz,pn  REG1, FAIL_LABEL;   \
>> + sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
>> +sllxREG2, 32, REG2; \
>> +andcc   REG1, REG2, %g0;\
>> +be,pt   %xcc, 700f; \
>> + sethi  %hi(0x1ffc), REG2;  \
>> +sllxREG2, 1, REG2;  \
>> +brgez,pnREG1, FAIL_LABEL;   \
>> + andn   REG1, REG2, REG1;   \
>> +and VADDR, REG2, REG2;  \
>> +brlz,pt REG1, PTE_LABEL;\
>> + or REG1, REG2, REG1;   \
>> +700:
>> +#else
>> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 
>> PTE_LABEL) \
>> +brz,pn  REG1, FAIL_LABEL; \
>> + nop;
>> +#endif
>> +
>>  /* PMD has been loaded into REG1, interpret the value, seeing
>>   * if it is a HUGE PMD or a normal one.  If it is not valid
>>   * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
>> @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
>> __tsb_phys_patch_end;
>>  srlxREG2, 64 - PAGE_SHIFT, REG2; \
>>  andnREG2, 0x7, REG2; \
>>  ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
>> +USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
>>  brz,pn  REG1, FAIL_LABEL; \
>>   sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
>>  srlxREG2, 64 - PAGE_SHIFT, REG2; \
> 
> This macro is getting way out of control, every TLB/TSB miss is
> going to invoke this sequence of code.
> 
> Yes, it's just a two cycle constant load, a test modifying the
> condition codes, and an easy to predict branch.
> 
> But every machine will eat this overhead, even if they don't use
> hugepages or don't set the 16GB knob.
> 
> I think we can do better, using code patching or similar.
> 
> Once the knob is set, you can know for sure that this code path
> will never actually be taken.

The simplest way I can think of is to add CONFIG_SPARC_16GB_HUGEPAGE
and exclude PUD check if not enabled.  Would this be okay?

Thanks,
Nitin



Re: [PATCH 2/3] sparc64: Add 16GB hugepage support

2017-07-26 Thread Nitin Gupta


On 07/20/2017 01:04 PM, David Miller wrote:
> From: Nitin Gupta 
> Date: Thu, 13 Jul 2017 14:53:24 -0700
> 
>> Testing:
>>
>> Tested with the stream benchmark which allocates 48G of
>> arrays backed by 16G hugepages and does RW operation on
>> them in parallel.
> 
> It would be great if we started adding tests under
> tools/testing/selftests so that other people can recreate
> your tests/benchmarks.
> 

Yes, I would like to add the stream benchmark to selftests too.
I will check if our internal version of stream can be released.


>> diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
>> index 32258e0..7b240a3 100644
>> --- a/arch/sparc/include/asm/tsb.h
>> +++ b/arch/sparc/include/asm/tsb.h
>> @@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
>> __tsb_phys_patch_end;
>>   nop; \
>>  699:
>>  
>> +/* PUD has been loaded into REG1, interpret the value, seeing
>> + * if it is a HUGE PUD or a normal one.  If it is not valid
>> + * then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
>> + * translates to a valid PTE, branch to PTE_LABEL.
>> + *
>> + * We have to propagate bits [32:22] from the virtual address
>> + * to resolve at 4M granularity.
>> + */
>> +#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 
>> PTE_LABEL) \
>> +brz,pn  REG1, FAIL_LABEL;   \
>> + sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
>> +sllxREG2, 32, REG2; \
>> +andcc   REG1, REG2, %g0;\
>> +be,pt   %xcc, 700f; \
>> + sethi  %hi(0x1ffc), REG2;  \
>> +sllxREG2, 1, REG2;  \
>> +brgez,pnREG1, FAIL_LABEL;   \
>> + andn   REG1, REG2, REG1;   \
>> +and VADDR, REG2, REG2;  \
>> +brlz,pt REG1, PTE_LABEL;\
>> + or REG1, REG2, REG1;   \
>> +700:
>> +#else
>> +#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 
>> PTE_LABEL) \
>> +brz,pn  REG1, FAIL_LABEL; \
>> + nop;
>> +#endif
>> +
>>  /* PMD has been loaded into REG1, interpret the value, seeing
>>   * if it is a HUGE PMD or a normal one.  If it is not valid
>>   * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
>> @@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
>> __tsb_phys_patch_end;
>>  srlxREG2, 64 - PAGE_SHIFT, REG2; \
>>  andnREG2, 0x7, REG2; \
>>  ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
>> +USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
>>  brz,pn  REG1, FAIL_LABEL; \
>>   sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
>>  srlxREG2, 64 - PAGE_SHIFT, REG2; \
> 
> This macro is getting way out of control, every TLB/TSB miss is
> going to invoke this sequence of code.
> 
> Yes, it's just a two cycle constant load, a test modifying the
> condition codes, and an easy to predict branch.
> 
> But every machine will eat this overhead, even if they don't use
> hugepages or don't set the 16GB knob.
> 
> I think we can do better, using code patching or similar.
> 
> Once the knob is set, you can know for sure that this code path
> will never actually be taken.

The simplest way I can think of is to add CONFIG_SPARC_16GB_HUGEPAGE
and exclude PUD check if not enabled.  Would this be okay?

Thanks,
Nitin



[PATCH] sparc64: Register hugepages during arch init

2017-07-19 Thread Nitin Gupta
Add hstate for each supported hugepage size using
arch initcall. This change fixes some hugepage
parameter parsing inconsistencies:

case 1: no hugepage parameters

 Without hugepage parameters, only a hugepages-8192kB entry is visible
 in sysfs.  It's different from x86_64 where both 2M and 1G hugepage
 sizes are available.

case 2: default_hugepagesz=[64K|256M|2G]

 When specifying only a default_hugepagesz parameter, the default
 hugepage size isn't really changed and it stays at 8M. This is again
 different from x86_64.

Orabug: 25869946

Reviewed-by: Bob Picco <bob.pi...@oracle.com>
Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/init_64.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 3c40ebd..fed73f1 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -325,6 +325,29 @@ static void __update_mmu_tsb_insert(struct mm_struct *mm, 
unsigned long tsb_inde
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+static void __init add_huge_page_size(unsigned long size)
+{
+   unsigned int order;
+
+   if (size_to_hstate(size))
+   return;
+
+   order = ilog2(size) - PAGE_SHIFT;
+   hugetlb_add_hstate(order);
+}
+
+static int __init hugetlbpage_init(void)
+{
+   add_huge_page_size(1UL << HPAGE_64K_SHIFT);
+   add_huge_page_size(1UL << HPAGE_SHIFT);
+   add_huge_page_size(1UL << HPAGE_256MB_SHIFT);
+   add_huge_page_size(1UL << HPAGE_2GB_SHIFT);
+
+   return 0;
+}
+
+arch_initcall(hugetlbpage_init);
+
 static int __init setup_hugepagesz(char *string)
 {
unsigned long long hugepage_size;
@@ -364,7 +387,7 @@ static int __init setup_hugepagesz(char *string)
goto out;
}
 
-   hugetlb_add_hstate(hugepage_shift - PAGE_SHIFT);
+   add_huge_page_size(hugepage_size);
rc = 1;
 
 out:
-- 
2.9.2



[PATCH] sparc64: Register hugepages during arch init

2017-07-19 Thread Nitin Gupta
Add hstate for each supported hugepage size using
arch initcall. This change fixes some hugepage
parameter parsing inconsistencies:

case 1: no hugepage parameters

 Without hugepage parameters, only a hugepages-8192kB entry is visible
 in sysfs.  It's different from x86_64 where both 2M and 1G hugepage
 sizes are available.

case 2: default_hugepagesz=[64K|256M|2G]

 When specifying only a default_hugepagesz parameter, the default
 hugepage size isn't really changed and it stays at 8M. This is again
 different from x86_64.

Orabug: 25869946

Reviewed-by: Bob Picco 
Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/init_64.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 3c40ebd..fed73f1 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -325,6 +325,29 @@ static void __update_mmu_tsb_insert(struct mm_struct *mm, 
unsigned long tsb_inde
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
+static void __init add_huge_page_size(unsigned long size)
+{
+   unsigned int order;
+
+   if (size_to_hstate(size))
+   return;
+
+   order = ilog2(size) - PAGE_SHIFT;
+   hugetlb_add_hstate(order);
+}
+
+static int __init hugetlbpage_init(void)
+{
+   add_huge_page_size(1UL << HPAGE_64K_SHIFT);
+   add_huge_page_size(1UL << HPAGE_SHIFT);
+   add_huge_page_size(1UL << HPAGE_256MB_SHIFT);
+   add_huge_page_size(1UL << HPAGE_2GB_SHIFT);
+
+   return 0;
+}
+
+arch_initcall(hugetlbpage_init);
+
 static int __init setup_hugepagesz(char *string)
 {
unsigned long long hugepage_size;
@@ -364,7 +387,7 @@ static int __init setup_hugepagesz(char *string)
goto out;
}
 
-   hugetlb_add_hstate(hugepage_shift - PAGE_SHIFT);
+   add_huge_page_size(hugepage_size);
rc = 1;
 
 out:
-- 
2.9.2



[PATCH 2/3] sparc64: Add 16GB hugepage support

2017-07-13 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-   sethi   %uhi(_PAGE_PMD_HUGE), %g7
+   sethi   %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7
sllx%g7, 32, %g7
 
andcc   %g5, %g7, %g0
diff --git a/arch/sparc/mm/hugetlbpa

[PATCH 2/3] sparc64: Add 16GB hugepage support

2017-07-13 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2579f5a..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-   sethi   %uhi(_PAGE_PMD_HUGE), %g7
+   sethi   %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7
sllx%g7, 32, %g7
 
andcc   %g5, %g7, %g0
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c

[PATCH 3/3] sparc64: Cleanup hugepage table walk functions

2017-07-13 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH 3/3] sparc64: Cleanup hugepage table walk functions

2017-07-13 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 7acb84d..bcd8cdb 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -295,27 +287,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH 1/3] sparc64: Support huge PUD case in get_user_pages

2017-07-13 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d777594 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH 1/3] sparc64: Support huge PUD case in get_user_pages

2017-07-13 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2579f5a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -687,6 +687,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -823,9 +825,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index f80cfc6..d777594 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v2] sparc64: Fix gup_huge_pmd

2017-06-22 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Cc: Julian Calaby <julian.cal...@gmail.com>
Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 Changes since v1
 - Clarify use of 'head' variable (Julian Calaby)

 arch/sparc/mm/gup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..f80cfc6 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -78,8 +78,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
return 0;
 
refs = 0;
-   head = pmd_page(pmd);
-   page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH v2] sparc64: Fix gup_huge_pmd

2017-06-22 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Cc: Julian Calaby 
Signed-off-by: Nitin Gupta 
---
 Changes since v1
 - Clarify use of 'head' variable (Julian Calaby)

 arch/sparc/mm/gup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..f80cfc6 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -78,8 +78,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
return 0;
 
refs = 0;
-   head = pmd_page(pmd);
-   page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   head = compound_head(page);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



Re: [PATCH] sparc64: Fix gup_huge_pmd

2017-06-22 Thread Nitin Gupta

Hi Julian,


On 6/22/17 3:53 AM, Julian Calaby wrote:

On Thu, Jun 22, 2017 at 7:50 AM, Nitin Gupta <nitin.m.gu...@oracle.com> wrote:

The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
  arch/sparc/mm/gup.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..9116a6f 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
 refs = 0;
 head = pmd_page(pmd);
 page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);

Stupid question: shouldn't this go before the page calculation?


No, it should be after page calculation: First, 'head' points to base of
the PMD page, then 'page' points to an offset within that page. Finally,
we make sure that head variable points to head of the compound page
which contains the addr.

I think confusion comes from the use of 'head' for pointing to a
non-head page. So, maybe it would be more clear to write that part
of the function this way:

page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
head = compound_head(page);

Thanks,
Nitin



Re: [PATCH] sparc64: Fix gup_huge_pmd

2017-06-22 Thread Nitin Gupta

Hi Julian,


On 6/22/17 3:53 AM, Julian Calaby wrote:

On Thu, Jun 22, 2017 at 7:50 AM, Nitin Gupta  wrote:

The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
  arch/sparc/mm/gup.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..9116a6f 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
 refs = 0;
 head = pmd_page(pmd);
 page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);

Stupid question: shouldn't this go before the page calculation?


No, it should be after page calculation: First, 'head' points to base of
the PMD page, then 'page' points to an offset within that page. Finally,
we make sure that head variable points to head of the compound page
which contains the addr.

I think confusion comes from the use of 'head' for pointing to a
non-head page. So, maybe it would be more clear to write that part
of the function this way:

page = pmd_page(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
head = compound_head(page);

Thanks,
Nitin



[PATCH] sparc64: Fix gup_huge_pmd

2017-06-21 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..9116a6f 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH] sparc64: Fix gup_huge_pmd

2017-06-21 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..9116a6f 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH 3/4] sparc64: Fix gup_huge_pmd

2017-06-20 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH 3/4] sparc64: Fix gup_huge_pmd

2017-06-20 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH 2/4] sparc64: Support huge PUD case in get_user_pages

2017-06-20 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2444b02..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..7cfa9c5 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH 2/4] sparc64: Support huge PUD case in get_user_pages

2017-06-20 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2444b02..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..7cfa9c5 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH 1/4] sparc64: Add 16GB hugepage support

2017-06-20 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-   sethi   %uhi(_PAGE_PMD_HUGE), %g7
+   sethi   %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7
sllx%g7, 32, %g7
 
andcc   %g5, %g7, %g0
diff --git a/arch/sparc/mm/hugetlbpa

[PATCH 1/4] sparc64: Add 16GB hugepage support

2017-06-20 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-   sethi   %uhi(_PAGE_PMD_HUGE), %g7
+   sethi   %uhi(_PAGE_PMD_HUGE | _PAGE_PUD_HUGE), %g7
sllx%g7, 32, %g7
 
andcc   %g5, %g7, %g0
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c

[PATCH 4/4] sparc64: Cleanup hugepage table walk functions

2017-06-20 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f0bb42d..e8b7245 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
@@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH 4/4] sparc64: Cleanup hugepage table walk functions

2017-06-20 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f0bb42d..e8b7245 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
@@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v3 3/4] sparc64: Fix gup_huge_pmd

2017-06-19 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH v3 3/4] sparc64: Fix gup_huge_pmd

2017-06-19 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH v3 4/4] sparc64: Cleanup hugepage table walk functions

2017-06-19 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f0bb42d..e8b7245 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
@@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v3 4/4] sparc64: Cleanup hugepage table walk functions

2017-06-19 Thread Nitin Gupta
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/hugetlbpage.c | 54 +
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f0bb42d..e8b7245 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -266,27 +266,19 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
-
if (sz >= PUD_SIZE)
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_alloc(mm, pud, addr);
-   if (!pmd)
-   return NULL;
-
-   if (sz >= PMD_SIZE)
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_alloc_map(mm, pmd, addr);
-   }
-
-   return pte;
+   return (pte_t *)pud;
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return NULL;
+   if (sz >= PMD_SIZE)
+   return (pte_t *)pmd;
+   return pte_alloc_map(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
@@ -294,27 +286,21 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
long addr)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte = NULL;
 
pgd = pgd_offset(mm, addr);
-   if (!pgd_none(*pgd)) {
-   pud = pud_offset(pgd, addr);
-   if (!pud_none(*pud)) {
-   if (is_hugetlb_pud(*pud))
-   pte = (pte_t *)pud;
-   else {
-   pmd = pmd_offset(pud, addr);
-   if (!pmd_none(*pmd)) {
-   if (is_hugetlb_pmd(*pmd))
-   pte = (pte_t *)pmd;
-   else
-   pte = pte_offset_map(pmd, addr);
-   }
-   }
-   }
-   }
-
-   return pte;
+   if (pgd_none(*pgd))
+   return NULL;
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (is_hugetlb_pud(*pud))
+   return (pte_t *)pud;
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+   if (is_hugetlb_pmd(*pmd))
+   return (pte_t *)pmd;
+   return pte_offset_map(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.9.2



[PATCH v3 2/4] sparc64: Support huge PUD case in get_user_pages

2017-06-19 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2444b02..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..7cfa9c5 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v3 1/4] sparc64: Add 16GB hugepage support

2017-06-19 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
Changelog v3 vs v2:
 - Fixed email headers so the subject shows up correctly

Changelog v2 vs v1:
 - Remove redundant brgez,pn (Bob Picco)
 - Remove unncessary label rename from 700 to 701 (Rob Gardner)
 - Add patch description (Paul)
 - Add 16G case to get_user_pages()

arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGE

[PATCH v3 2/4] sparc64: Support huge PUD case in get_user_pages

2017-06-19 Thread Nitin Gupta
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/include/asm/pgtable_64.h | 15 ++--
 arch/sparc/mm/gup.c | 47 -
 2 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 2444b02..4fefe37 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -692,6 +692,8 @@ static inline unsigned long pmd_write(pmd_t pmd)
return pte_write(pte);
 }
 
+#define pud_write(pud) pte_write(__pte(pud_val(pud)))
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline unsigned long pmd_dirty(pmd_t pmd)
 {
@@ -828,9 +830,18 @@ static inline unsigned long __pmd_page(pmd_t pmd)
 
return ((unsigned long) __va(pfn << PAGE_SHIFT));
 }
+
+static inline unsigned long pud_page_vaddr(pud_t pud)
+{
+   pte_t pte = __pte(pud_val(pud));
+   unsigned long pfn;
+
+   pfn = pte_pfn(pte);
+
+   return ((unsigned long) __va(pfn << PAGE_SHIFT));
+}
+
 #define pmd_page(pmd)  virt_to_page((void *)__pmd_page(pmd))
-#define pud_page_vaddr(pud)\
-   ((unsigned long) __va(pud_val(pud)))
 #define pud_page(pud)  virt_to_page((void 
*)pud_page_vaddr(pud))
 #define pmd_clear(pmdp)(pmd_val(*(pmdp)) = 0UL)
 #define pud_present(pud)   (pud_val(pud) != 0U)
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index cd0e32b..7cfa9c5 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -103,6 +103,47 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned 
long addr,
return 1;
 }
 
+static int gup_huge_pud(pud_t *pudp, pud_t pud, unsigned long addr,
+   unsigned long end, int write, struct page **pages,
+   int *nr)
+{
+   struct page *head, *page;
+   int refs;
+
+   if (!(pud_val(pud) & _PAGE_VALID))
+   return 0;
+
+   if (write && !pud_write(pud))
+   return 0;
+
+   refs = 0;
+   head = pud_page(pud);
+   page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
+   do {
+   VM_BUG_ON(compound_head(page) != head);
+   pages[*nr] = page;
+   (*nr)++;
+   page++;
+   refs++;
+   } while (addr += PAGE_SIZE, addr != end);
+
+   if (!page_cache_add_speculative(head, refs)) {
+   *nr -= refs;
+   return 0;
+   }
+
+   if (unlikely(pud_val(pud) != pud_val(*pudp))) {
+   *nr -= refs;
+   while (refs--)
+   put_page(head);
+   return 0;
+   }
+
+   return 1;
+}
+
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
@@ -141,7 +182,11 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
next = pud_addr_end(addr, end);
if (pud_none(pud))
return 0;
-   if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+   if (unlikely(pud_large(pud))) {
+   if (!gup_huge_pud(pudp, pud, addr, next,
+ write, pages, nr))
+   return 0;
+   } else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
return 0;
} while (pudp++, addr = next, addr != end);
 
-- 
2.9.2



[PATCH v3 1/4] sparc64: Add 16GB hugepage support

2017-06-19 Thread Nitin Gupta
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---
Changelog v3 vs v2:
 - Fixed email headers so the subject shows up correctly

Changelog v2 vs v1:
 - Remove redundant brgez,pn (Bob Picco)
 - Remove unncessary label rename from 700 to 701 (Rob Gardner)
 - Add patch description (Paul)
 - Add 16G case to get_user_pages()

arch/sparc/include/asm/page_64.h|  3 +-
 arch/sparc/include/asm/pgtable_64.h |  5 +++
 arch/sparc/include/asm/tsb.h| 30 +++
 arch/sparc/kernel/tsb.S |  2 +-
 arch/sparc/mm/hugetlbpage.c | 74 ++---
 arch/sparc/mm/init_64.c | 41 
 6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
 
 #define HPAGE_SHIFT23
 #define REAL_HPAGE_SHIFT   22
+#define HPAGE_16GB_SHIFT   34
 #define HPAGE_2GB_SHIFT31
 #define HPAGE_256MB_SHIFT  28
 #define HPAGE_64K_SHIFT16
@@ -28,7 +29,7 @@
 #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #define REAL_HPAGE_PER_HPAGE   (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
 }
 
+static inline bool is_hugetlb_pud(pud_t pud)
+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
 699:
 
+   /* PUD has been loaded into REG1, interpret the value, seeing
+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)

Re: From: Nitin Gupta <nitin.m.gu...@oracle.com>

2017-06-19 Thread Nitin Gupta
Please ignore this patch series. I will resend again with correct email 
headers.


Nitin


On 6/19/17 2:48 PM, Nitin Gupta wrote:

Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---

Changelog v2 vs v1:
  - Remove redundant brgez,pn (Bob Picco)
  - Remove unncessary label rename from 700 to 701 (Rob Gardner)
  - Add patch description (Paul)
  - Add 16G case to get_user_pages()

  arch/sparc/include/asm/page_64.h|  3 +-
  arch/sparc/include/asm/pgtable_64.h |  5 +++
  arch/sparc/include/asm/tsb.h| 30 +++
  arch/sparc/kernel/tsb.S |  2 +-
  arch/sparc/mm/hugetlbpage.c | 74 ++---
  arch/sparc/mm/init_64.c | 41 
  6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
  
  #define HPAGE_SHIFT		23

  #define REAL_HPAGE_SHIFT  22
+#define HPAGE_16GB_SHIFT   34
  #define HPAGE_2GB_SHIFT   31
  #define HPAGE_256MB_SHIFT 28
  #define HPAGE_64K_SHIFT   16
@@ -28,7 +29,7 @@
  #define HUGETLB_PAGE_ORDER(HPAGE_SHIFT - PAGE_SHIFT)
  #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
  #define REAL_HPAGE_PER_HPAGE  (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
  #endif
  
  #ifndef __ASSEMBLY__

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
  }
  
+static inline bool is_hugetlb_pud(pud_t pud)

+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  static inline pmd_t pmd_mkhuge(pmd_t pmd)
  {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
  699:
  
+	/* PUD has been loaded into REG1, interpret the value, seeing

+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Val

Re: From: Nitin Gupta

2017-06-19 Thread Nitin Gupta
Please ignore this patch series. I will resend again with correct email 
headers.


Nitin


On 6/19/17 2:48 PM, Nitin Gupta wrote:

Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Signed-off-by: Nitin Gupta 
---

Changelog v2 vs v1:
  - Remove redundant brgez,pn (Bob Picco)
  - Remove unncessary label rename from 700 to 701 (Rob Gardner)
  - Add patch description (Paul)
  - Add 16G case to get_user_pages()

  arch/sparc/include/asm/page_64.h|  3 +-
  arch/sparc/include/asm/pgtable_64.h |  5 +++
  arch/sparc/include/asm/tsb.h| 30 +++
  arch/sparc/kernel/tsb.S |  2 +-
  arch/sparc/mm/hugetlbpage.c | 74 ++---
  arch/sparc/mm/init_64.c | 41 
  6 files changed, 125 insertions(+), 30 deletions(-)

diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index 5961b2d..8ee1f97 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -17,6 +17,7 @@
  
  #define HPAGE_SHIFT		23

  #define REAL_HPAGE_SHIFT  22
+#define HPAGE_16GB_SHIFT   34
  #define HPAGE_2GB_SHIFT   31
  #define HPAGE_256MB_SHIFT 28
  #define HPAGE_64K_SHIFT   16
@@ -28,7 +29,7 @@
  #define HUGETLB_PAGE_ORDER(HPAGE_SHIFT - PAGE_SHIFT)
  #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
  #define REAL_HPAGE_PER_HPAGE  (_AC(1,UL) << (HPAGE_SHIFT - REAL_HPAGE_SHIFT))
-#define HUGE_MAX_HSTATE4
+#define HUGE_MAX_HSTATE5
  #endif
  
  #ifndef __ASSEMBLY__

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 6fbd931..2444b02 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -414,6 +414,11 @@ static inline bool is_hugetlb_pmd(pmd_t pmd)
return !!(pmd_val(pmd) & _PAGE_PMD_HUGE);
  }
  
+static inline bool is_hugetlb_pud(pud_t pud)

+{
+   return !!(pud_val(pud) & _PAGE_PUD_HUGE);
+}
+
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  static inline pmd_t pmd_mkhuge(pmd_t pmd)
  {
diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
index 32258e0..7b240a3 100644
--- a/arch/sparc/include/asm/tsb.h
+++ b/arch/sparc/include/asm/tsb.h
@@ -195,6 +195,35 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
 nop; \
  699:
  
+	/* PUD has been loaded into REG1, interpret the value, seeing

+* if it is a HUGE PUD or a normal one.  If it is not valid
+* then jump to FAIL_LABEL.  If it is a HUGE PUD, and it
+* translates to a valid PTE, branch to PTE_LABEL.
+*
+* We have to propagate bits [32:22] from the virtual address
+* to resolve at 4M granularity.
+*/
+#if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL;   \
+sethi  %uhi(_PAGE_PUD_HUGE), REG2; \
+   sllxREG2, 32, REG2; \
+   andcc   REG1, REG2, %g0;\
+   be,pt   %xcc, 700f; \
+sethi  %hi(0x1ffc), REG2;  \
+   sllxREG2, 1, REG2;  \
+   brgez,pnREG1, FAIL_LABEL;   \
+andn   REG1, REG2, REG1;   \
+   and VADDR, REG2, REG2;  \
+   brlz,pt REG1, PTE_LABEL;\
+or REG1, REG2, REG1;   \
+700:
+#else
+#define USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, PTE_LABEL) \
+   brz,pn  REG1, FAIL_LABEL; \
+nop;
+#endif
+
/* PMD has been loaded into REG1, interpret the value, seeing
 * if it is a HUGE PMD or a normal one.  If it is not valid
 * then jump to FAIL_LABEL.  If it is a HUGE PMD, and it
@@ -242,6 +271,7 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, 
__tsb_phys_patch_end;
srlxREG2, 64 - PAGE_SHIFT, REG2; \
andnREG2, 0x7, REG2; \
ldxa[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+   USER_PGTABLE_CHECK_PUD_HUGE(VADDR, REG1, REG2, FAIL_LABEL, 800f) \
brz,pn  REG1, FAIL_LABEL; \
 sllx   VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
srlxREG2, 64 - PAGE_SHIFT, REG2; \
diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
index 07c0df9..5f42ac0 100644
--- a/arch/sparc/kernel/tsb.S
+++ b/arch/sparc/kernel/tsb.S
@@ -117,7 +117,7 @@ tsb_miss_page_table_walk_sun4v_fastpath:
/* Valid PTE is now in %g5.  */
  

[PATCH v2 3/4] sparc64: Fix gup_huge_pmd

2017-06-19 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta <nitin.m.gu...@oracle.com>
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



[PATCH v2 3/4] sparc64: Fix gup_huge_pmd

2017-06-19 Thread Nitin Gupta
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Signed-off-by: Nitin Gupta 
---
 arch/sparc/mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 7cfa9c5..b1c649d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -80,6 +80,8 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long 
addr,
refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (PageTail(head))
+   head = compound_head(head);
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
-- 
2.9.2



  1   2   3   4   5   >