Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-12 Thread Michal Hocko
On Wed 11-01-17 08:35:27, Vlastimil Babka wrote:
> [+CC linux-api]
> 
> On 01/11/2017 01:15 AM, David Rientjes wrote:
> > There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> > to do direct compaction and reclaim while all other thp allocations simply 
> > trigger kswapd and kcompactd in the background and fail immediately.
> > 
> > The "defer" setting simply triggers background reclaim and compaction for 
> > all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> > userspace where MADV_HUGEPAGE is being used to indicate the application is 
> > willing to wait for work for thp memory to be available.
> > 
> > The "madvise" setting will do direct compaction and reclaim for these
> > MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> > background for anybody else.
> > 
> > For reasonable usage, there needs to be a mesh between the two options.  
> > This patch introduces a fifth mode, "defer+madvise", that will do direct 
> > reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> > reclaim and compaction for everybody else so that hugepages may be 
> > available in the near future.
> > 
> > A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> > regions as part of the "defer" mode, making it a very powerful setting and 
> > avoids breaking userspace, was offered: 
> > http://marc.info/?t=14823661273.  This additional mode is a 
> > compromise.
> > 
> > A second proposal to allow both "defer" and "madvise" to be selected at
> > the same time was also offered: http://marc.info/?t=14835734531.
> > This is possible, but there was a concern that it might break existing
> > userspaces the parse the output of the defrag mode, so the fifth option
> > was introduced instead.
> > 
> > This patch also cleans up the helper function for storing to "enabled" 
> > and "defrag" since the former supports three modes while the latter 
> > supports five and triple_flag_store() was getting unnecessarily messy.
> > 
> > Signed-off-by: David Rientjes 
> 
> alloc_hugepage_direct_gfpmask() would have been IMHO simpler if a new
> internal flag wasn't added, and combination of two existing for defer
> and madvise used,

I agree with Vlastimil here. The patch can do without touching anything
outside of the sysfs handling.
-- 
Michal Hocko
SUSE Labs


Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-12 Thread Michal Hocko
On Wed 11-01-17 08:35:27, Vlastimil Babka wrote:
> [+CC linux-api]
> 
> On 01/11/2017 01:15 AM, David Rientjes wrote:
> > There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> > to do direct compaction and reclaim while all other thp allocations simply 
> > trigger kswapd and kcompactd in the background and fail immediately.
> > 
> > The "defer" setting simply triggers background reclaim and compaction for 
> > all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> > userspace where MADV_HUGEPAGE is being used to indicate the application is 
> > willing to wait for work for thp memory to be available.
> > 
> > The "madvise" setting will do direct compaction and reclaim for these
> > MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> > background for anybody else.
> > 
> > For reasonable usage, there needs to be a mesh between the two options.  
> > This patch introduces a fifth mode, "defer+madvise", that will do direct 
> > reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> > reclaim and compaction for everybody else so that hugepages may be 
> > available in the near future.
> > 
> > A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> > regions as part of the "defer" mode, making it a very powerful setting and 
> > avoids breaking userspace, was offered: 
> > http://marc.info/?t=14823661273.  This additional mode is a 
> > compromise.
> > 
> > A second proposal to allow both "defer" and "madvise" to be selected at
> > the same time was also offered: http://marc.info/?t=14835734531.
> > This is possible, but there was a concern that it might break existing
> > userspaces the parse the output of the defrag mode, so the fifth option
> > was introduced instead.
> > 
> > This patch also cleans up the helper function for storing to "enabled" 
> > and "defrag" since the former supports three modes while the latter 
> > supports five and triple_flag_store() was getting unnecessarily messy.
> > 
> > Signed-off-by: David Rientjes 
> 
> alloc_hugepage_direct_gfpmask() would have been IMHO simpler if a new
> internal flag wasn't added, and combination of two existing for defer
> and madvise used,

I agree with Vlastimil here. The patch can do without touching anything
outside of the sysfs handling.
-- 
Michal Hocko
SUSE Labs


Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-11 Thread Andrew Morton
On Tue, 10 Jan 2017 16:15:27 -0800 (PST) David Rientjes  
wrote:

> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=14823661273.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=14835734531.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
>  
>  echo always >/sys/kernel/mm/transparent_hugepage/defrag
>  echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo never >/sys/kernel/mm/transparent_hugepage/defrag
>  
> @@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to 
> delay the VM start
>  to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
> -to reclaim pages and wake kcompact to compact memory so that THP is
> +to reclaim pages and wake kcompactd to compact memory so that THP is
>  available in the near future. It's the responsibility of khugepaged
>  to then install the THP pages later.
>  
> +"defer+madvise" will enter direct reclaim and compaction like "always", but
> +only for regions that have used madvise(MADV_HUGEPAGE); all other regions
> +will wake kswapd in the background to reclaim pages and wake kcompactd to
> +compact memory so that THP is available in the near future.
> +

It would be helpful if this text were to tell the reader why they may
choose to use this option: runtime effects, advantages, when-to-use,
when-not-to-use, etc.




Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-11 Thread Andrew Morton
On Tue, 10 Jan 2017 16:15:27 -0800 (PST) David Rientjes  
wrote:

> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=14823661273.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=14835734531.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
>  
>  echo always >/sys/kernel/mm/transparent_hugepage/defrag
>  echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo never >/sys/kernel/mm/transparent_hugepage/defrag
>  
> @@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to 
> delay the VM start
>  to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
> -to reclaim pages and wake kcompact to compact memory so that THP is
> +to reclaim pages and wake kcompactd to compact memory so that THP is
>  available in the near future. It's the responsibility of khugepaged
>  to then install the THP pages later.
>  
> +"defer+madvise" will enter direct reclaim and compaction like "always", but
> +only for regions that have used madvise(MADV_HUGEPAGE); all other regions
> +will wake kswapd in the background to reclaim pages and wake kcompactd to
> +compact memory so that THP is available in the near future.
> +

It would be helpful if this text were to tell the reader why they may
choose to use this option: runtime effects, advantages, when-to-use,
when-not-to-use, etc.




Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-11 Thread Mel Gorman
On Tue, Jan 10, 2017 at 04:15:27PM -0800, David Rientjes wrote:
> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=14823661273.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=14835734531.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> Signed-off-by: David Rientjes 
> ---
>  v2: uses new naming suggested by Vlastimil
>  (defer+madvise order looks better in
>   "... defer defer+madvise madvise ...")
> 
>  v1 was acked by Mel, and it probably could have been preserved but it was
>  removed in case there is an issue with the name change.
> 

There isn't

Acked-by: Mel Gorman 

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-11 Thread Mel Gorman
On Tue, Jan 10, 2017 at 04:15:27PM -0800, David Rientjes wrote:
> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=14823661273.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=14835734531.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> Signed-off-by: David Rientjes 
> ---
>  v2: uses new naming suggested by Vlastimil
>  (defer+madvise order looks better in
>   "... defer defer+madvise madvise ...")
> 
>  v1 was acked by Mel, and it probably could have been preserved but it was
>  removed in case there is an issue with the name change.
> 

There isn't

Acked-by: Mel Gorman 

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-10 Thread Vlastimil Babka
[+CC linux-api]

On 01/11/2017 01:15 AM, David Rientjes wrote:
> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=14823661273.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=14835734531.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> Signed-off-by: David Rientjes 

alloc_hugepage_direct_gfpmask() would have been IMHO simpler if a new
internal flag wasn't added, and combination of two existing for defer
and madvise used, but whatever, I won't nak the patch over that.

> ---
>  v2: uses new naming suggested by Vlastimil
>  (defer+madvise order looks better in
>   "... defer defer+madvise madvise ...")

OK.

>  v1 was acked by Mel, and it probably could have been preserved but it was
>  removed in case there is an issue with the name change.
> 
>  Documentation/vm/transhuge.txt |   8 ++-
>  include/linux/huge_mm.h|   1 +
>  mm/huge_memory.c   | 146 
> +
>  3 files changed, 82 insertions(+), 73 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
>  
>  echo always >/sys/kernel/mm/transparent_hugepage/defrag
>  echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo never >/sys/kernel/mm/transparent_hugepage/defrag
>  
> @@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to 
> delay the VM start
>  to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
> -to reclaim pages and wake kcompact to compact memory so that THP is
> +to reclaim pages and wake kcompactd to compact memory so that THP is
>  available in the near future. It's the responsibility of khugepaged
>  to then install the THP pages later.
>  
> +"defer+madvise" will enter direct reclaim and compaction like "always", but
> +only for regions that have used madvise(MADV_HUGEPAGE); all other regions
> +will wake kswapd in the background to reclaim pages and wake kcompactd to
> +compact memory so that THP is available in the near future.
> +
>  "madvise" will enter direct reclaim like "always" but only for regions
>  that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,7 @@ enum transparent_hugepage_flag {
>   TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>   TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -142,42 +142,6 @@ static struct shrinker huge_zero_page_shrinker = {
>  };
>  
>  #ifdef CONFIG_SYSFS
> -
> -static ssize_t triple_flag_store(struct kobject *kobj,
> -  struct kobj_attribute *attr,
> -  

Re: [patch v2] mm, thp: add new defer+madvise defrag option

2017-01-10 Thread Vlastimil Babka
[+CC linux-api]

On 01/11/2017 01:15 AM, David Rientjes wrote:
> There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
> to do direct compaction and reclaim while all other thp allocations simply 
> trigger kswapd and kcompactd in the background and fail immediately.
> 
> The "defer" setting simply triggers background reclaim and compaction for 
> all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
> userspace where MADV_HUGEPAGE is being used to indicate the application is 
> willing to wait for work for thp memory to be available.
> 
> The "madvise" setting will do direct compaction and reclaim for these
> MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
> background for anybody else.
> 
> For reasonable usage, there needs to be a mesh between the two options.  
> This patch introduces a fifth mode, "defer+madvise", that will do direct 
> reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
> reclaim and compaction for everybody else so that hugepages may be 
> available in the near future.
> 
> A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
> regions as part of the "defer" mode, making it a very powerful setting and 
> avoids breaking userspace, was offered: 
> http://marc.info/?t=14823661273.  This additional mode is a 
> compromise.
> 
> A second proposal to allow both "defer" and "madvise" to be selected at
> the same time was also offered: http://marc.info/?t=14835734531.
> This is possible, but there was a concern that it might break existing
> userspaces the parse the output of the defrag mode, so the fifth option
> was introduced instead.
> 
> This patch also cleans up the helper function for storing to "enabled" 
> and "defrag" since the former supports three modes while the latter 
> supports five and triple_flag_store() was getting unnecessarily messy.
> 
> Signed-off-by: David Rientjes 

alloc_hugepage_direct_gfpmask() would have been IMHO simpler if a new
internal flag wasn't added, and combination of two existing for defer
and madvise used, but whatever, I won't nak the patch over that.

> ---
>  v2: uses new naming suggested by Vlastimil
>  (defer+madvise order looks better in
>   "... defer defer+madvise madvise ...")

OK.

>  v1 was acked by Mel, and it probably could have been preserved but it was
>  removed in case there is an issue with the name change.
> 
>  Documentation/vm/transhuge.txt |   8 ++-
>  include/linux/huge_mm.h|   1 +
>  mm/huge_memory.c   | 146 
> +
>  3 files changed, 82 insertions(+), 73 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
>  
>  echo always >/sys/kernel/mm/transparent_hugepage/defrag
>  echo defer >/sys/kernel/mm/transparent_hugepage/defrag
> +echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
>  echo never >/sys/kernel/mm/transparent_hugepage/defrag
>  
> @@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to 
> delay the VM start
>  to utilise them.
>  
>  "defer" means that an application will wake kswapd in the background
> -to reclaim pages and wake kcompact to compact memory so that THP is
> +to reclaim pages and wake kcompactd to compact memory so that THP is
>  available in the near future. It's the responsibility of khugepaged
>  to then install the THP pages later.
>  
> +"defer+madvise" will enter direct reclaim and compaction like "always", but
> +only for regions that have used madvise(MADV_HUGEPAGE); all other regions
> +will wake kswapd in the background to reclaim pages and wake kcompactd to
> +compact memory so that THP is available in the near future.
> +
>  "madvise" will enter direct reclaim like "always" but only for regions
>  that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,7 @@ enum transparent_hugepage_flag {
>   TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>   TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -142,42 +142,6 @@ static struct shrinker huge_zero_page_shrinker = {
>  };
>  
>  #ifdef CONFIG_SYSFS
> -
> -static ssize_t triple_flag_store(struct kobject *kobj,
> -  struct kobj_attribute *attr,
> -  const 

[patch v2] mm, thp: add new defer+madvise defrag option

2017-01-10 Thread David Rientjes
There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
to do direct compaction and reclaim while all other thp allocations simply 
trigger kswapd and kcompactd in the background and fail immediately.

The "defer" setting simply triggers background reclaim and compaction for 
all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
userspace where MADV_HUGEPAGE is being used to indicate the application is 
willing to wait for work for thp memory to be available.

The "madvise" setting will do direct compaction and reclaim for these
MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
background for anybody else.

For reasonable usage, there needs to be a mesh between the two options.  
This patch introduces a fifth mode, "defer+madvise", that will do direct 
reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
reclaim and compaction for everybody else so that hugepages may be 
available in the near future.

A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
regions as part of the "defer" mode, making it a very powerful setting and 
avoids breaking userspace, was offered: 
http://marc.info/?t=14823661273.  This additional mode is a 
compromise.

A second proposal to allow both "defer" and "madvise" to be selected at
the same time was also offered: http://marc.info/?t=14835734531.
This is possible, but there was a concern that it might break existing
userspaces the parse the output of the defrag mode, so the fifth option
was introduced instead.

This patch also cleans up the helper function for storing to "enabled" 
and "defrag" since the former supports three modes while the latter 
supports five and triple_flag_store() was getting unnecessarily messy.

Signed-off-by: David Rientjes 
---
 v2: uses new naming suggested by Vlastimil
 (defer+madvise order looks better in
  "... defer defer+madvise madvise ...")

 v1 was acked by Mel, and it probably could have been preserved but it was
 removed in case there is an issue with the name change.

 Documentation/vm/transhuge.txt |   8 ++-
 include/linux/huge_mm.h|   1 +
 mm/huge_memory.c   | 146 +
 3 files changed, 82 insertions(+), 73 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
 
 echo always >/sys/kernel/mm/transparent_hugepage/defrag
 echo defer >/sys/kernel/mm/transparent_hugepage/defrag
+echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
 echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
 echo never >/sys/kernel/mm/transparent_hugepage/defrag
 
@@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to 
delay the VM start
 to utilise them.
 
 "defer" means that an application will wake kswapd in the background
-to reclaim pages and wake kcompact to compact memory so that THP is
+to reclaim pages and wake kcompactd to compact memory so that THP is
 available in the near future. It's the responsibility of khugepaged
 to then install the THP pages later.
 
+"defer+madvise" will enter direct reclaim and compaction like "always", but
+only for regions that have used madvise(MADV_HUGEPAGE); all other regions
+will wake kswapd in the background to reclaim pages and wake kcompactd to
+compact memory so that THP is available in the near future.
+
 "madvise" will enter direct reclaim like "always" but only for regions
 that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+   TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -142,42 +142,6 @@ static struct shrinker huge_zero_page_shrinker = {
 };
 
 #ifdef CONFIG_SYSFS
-
-static ssize_t triple_flag_store(struct kobject *kobj,
-struct kobj_attribute *attr,
-const char *buf, size_t count,
-enum transparent_hugepage_flag enabled,
-enum transparent_hugepage_flag deferred,
-enum transparent_hugepage_flag req_madv)
-{
-   if (!memcmp("defer", buf,
-   min(sizeof("defer")-1, count))) {
-   if (enabled == deferred)
-   return -EINVAL;
-   

[patch v2] mm, thp: add new defer+madvise defrag option

2017-01-10 Thread David Rientjes
There is no thp defrag option that currently allows MADV_HUGEPAGE regions 
to do direct compaction and reclaim while all other thp allocations simply 
trigger kswapd and kcompactd in the background and fail immediately.

The "defer" setting simply triggers background reclaim and compaction for 
all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our 
userspace where MADV_HUGEPAGE is being used to indicate the application is 
willing to wait for work for thp memory to be available.

The "madvise" setting will do direct compaction and reclaim for these
MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the 
background for anybody else.

For reasonable usage, there needs to be a mesh between the two options.  
This patch introduces a fifth mode, "defer+madvise", that will do direct 
reclaim and compaction for MADV_HUGEPAGE regions and trigger background 
reclaim and compaction for everybody else so that hugepages may be 
available in the near future.

A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE 
regions as part of the "defer" mode, making it a very powerful setting and 
avoids breaking userspace, was offered: 
http://marc.info/?t=14823661273.  This additional mode is a 
compromise.

A second proposal to allow both "defer" and "madvise" to be selected at
the same time was also offered: http://marc.info/?t=14835734531.
This is possible, but there was a concern that it might break existing
userspaces the parse the output of the defrag mode, so the fifth option
was introduced instead.

This patch also cleans up the helper function for storing to "enabled" 
and "defrag" since the former supports three modes while the latter 
supports five and triple_flag_store() was getting unnecessarily messy.

Signed-off-by: David Rientjes 
---
 v2: uses new naming suggested by Vlastimil
 (defer+madvise order looks better in
  "... defer defer+madvise madvise ...")

 v1 was acked by Mel, and it probably could have been preserved but it was
 removed in case there is an issue with the name change.

 Documentation/vm/transhuge.txt |   8 ++-
 include/linux/huge_mm.h|   1 +
 mm/huge_memory.c   | 146 +
 3 files changed, 82 insertions(+), 73 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -110,6 +110,7 @@ MADV_HUGEPAGE region.
 
 echo always >/sys/kernel/mm/transparent_hugepage/defrag
 echo defer >/sys/kernel/mm/transparent_hugepage/defrag
+echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
 echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
 echo never >/sys/kernel/mm/transparent_hugepage/defrag
 
@@ -120,10 +121,15 @@ that benefit heavily from THP use and are willing to 
delay the VM start
 to utilise them.
 
 "defer" means that an application will wake kswapd in the background
-to reclaim pages and wake kcompact to compact memory so that THP is
+to reclaim pages and wake kcompactd to compact memory so that THP is
 available in the near future. It's the responsibility of khugepaged
 to then install the THP pages later.
 
+"defer+madvise" will enter direct reclaim and compaction like "always", but
+only for regions that have used madvise(MADV_HUGEPAGE); all other regions
+will wake kswapd in the background to reclaim pages and wake kcompactd to
+compact memory so that THP is available in the near future.
+
 "madvise" will enter direct reclaim like "always" but only for regions
 that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+   TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -142,42 +142,6 @@ static struct shrinker huge_zero_page_shrinker = {
 };
 
 #ifdef CONFIG_SYSFS
-
-static ssize_t triple_flag_store(struct kobject *kobj,
-struct kobj_attribute *attr,
-const char *buf, size_t count,
-enum transparent_hugepage_flag enabled,
-enum transparent_hugepage_flag deferred,
-enum transparent_hugepage_flag req_madv)
-{
-   if (!memcmp("defer", buf,
-   min(sizeof("defer")-1, count))) {
-   if (enabled == deferred)
-   return -EINVAL;
-   clear_bit(enabled,