Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-05 Thread Andrea Arcangeli
On Wed, Dec 05, 2018 at 09:15:28PM +0100, Michal Hocko wrote:
> If the __GFP_THISNODE should be really used then it should be applied to
> all other types of pages. Not only THP. And as such done in a separate
> patch. Not a part of the revert. The cleanup was meant to unify THP
> allocations and that is why I object to reverting it as a part of this
> work.

I also wonder if using __GFP_THISNODE for all pages and not only THP
may really help David's workload performance further if it's so NUMA
sensitive and short lived too (plus we know it most of the time fits
in a single node). It'll get rid of the cache in the local node before
allocating remote memory. Of course than the swap storms and
pathological performance for processes that don't fit in a node, will
also materialize without THP enabled. If this is true, that'd point
further to the need of a MPOL that can set __GFP_THISNODE not just for
THP but for all allocations, with the only difference that if it's the
regular page size failing, instead of hitting OOM it should do one
last fallback to a regular page size allocation without __GFP_THISNODE
set.

Regardless of the above I think it's still interesting to look into
adding more NUMA affinity to THP somehow, but I doubt we want to do
that at the price of crippling compaction in a way it can't generate
THP anymore once a node is full of cache, and certainly we don't want
to cripple compaction in non-NUMA hosts too like it'd be happening
with the current proposed patch. Furthermore whatever we do should
work for order 8 7 6 etc.. too. If we did a order 9 specialized trick
that cripples compaction effectiveness, it'd be a short term approach
and it'd be tailored to David's use case that seems to be sensitive to
allocation latency. Being sensitive to allocation latency doesn't make
that process a good fit for MADV_HUGEPAGE to begin with.


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 11:24:53, David Rientjes wrote:
> On Wed, 5 Dec 2018, Michal Hocko wrote:
> 
> > > > At minimum do not remove the cleanup part which consolidates the gfp
> > > > hadnling to a single place. There is no real reason to have the
> > > > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> > > > 
> > > 
> > > The __GFP_THISNODE usage is still confined to 
> > > alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set 
> > > it in alloc_pages_vma() as done before the cleanup.
> > 
> > Why should be new_page any different?
> > 
> 
> To match alloc_new_node_page() which does it correctly and does not change 
> the behavior of mbind() that the cleanup did, which used 
> alloc_hugepage_vma() to get the __GFP_THISNODE behavior.  If there is a 
> reason mbind() is somehow different wrt allocating hugepages locally, I 
> think that should be a separate patch, but the goal of this patch is to 
> revert all the behavioral change that caused hugepages to be allocated 
> remotely.

If the __GFP_THISNODE should be really used then it should be applied to
all other types of pages. Not only THP. And as such done in a separate
patch. Not a part of the revert. The cleanup was meant to unify THP
allocations and that is why I object to reverting it as a part of this
work.

-- 
Michal Hocko
SUSE Labs


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-05 Thread David Rientjes
On Wed, 5 Dec 2018, Michal Hocko wrote:

> > > At minimum do not remove the cleanup part which consolidates the gfp
> > > hadnling to a single place. There is no real reason to have the
> > > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> > > 
> > 
> > The __GFP_THISNODE usage is still confined to 
> > alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set 
> > it in alloc_pages_vma() as done before the cleanup.
> 
> Why should be new_page any different?
> 

To match alloc_new_node_page() which does it correctly and does not change 
the behavior of mbind() that the cleanup did, which used 
alloc_hugepage_vma() to get the __GFP_THISNODE behavior.  If there is a 
reason mbind() is somehow different wrt allocating hugepages locally, I 
think that should be a separate patch, but the goal of this patch is to 
revert all the behavioral change that caused hugepages to be allocated 
remotely.


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 13:56:30, David Rientjes wrote:
> On Tue, 4 Dec 2018, Michal Hocko wrote:
> 
> > > This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > > MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> > > consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> > > 
> > > By not setting __GFP_THISNODE, applications can allocate remote hugepages
> > > when the local node is fragmented or low on memory when either the thp
> > > defrag setting is "always" or the vma has been madvised with
> > > MADV_HUGEPAGE.
> > > 
> > > Remote access to hugepages often has much higher latency than local pages
> > > of the native page size.  On Haswell, ac5b2c18911f was shown to have a
> > > 13.9% access regression after this commit for binaries that remap their
> > > text segment to be backed by transparent hugepages.
> > > 
> > > The intent of ac5b2c18911f is to address an issue where a local node is
> > > low on memory or fragmented such that a hugepage cannot be allocated.  In
> > > every scenario where this was described as a fix, there is abundant and
> > > unfragmented remote memory available to allocate from, even with a greater
> > > access latency.
> > > 
> > > If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> > > also measured on Haswell to have a 40% regression in allocation latency.
> > > 
> > > Restore __GFP_THISNODE for thp allocations.
> > > 
> > > Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE 
> > > mappings")
> > > Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> > > alloc_hugepage_direct_gfpmask")
> > 
> > At minimum do not remove the cleanup part which consolidates the gfp
> > hadnling to a single place. There is no real reason to have the
> > __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> > 
> 
> The __GFP_THISNODE usage is still confined to 
> alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set 
> it in alloc_pages_vma() as done before the cleanup.

Why should be new_page any different?

-- 
Michal Hocko
SUSE Labs


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-04 Thread David Rientjes
On Tue, 4 Dec 2018, Michal Hocko wrote:

> > This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> > MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> > consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> > 
> > By not setting __GFP_THISNODE, applications can allocate remote hugepages
> > when the local node is fragmented or low on memory when either the thp
> > defrag setting is "always" or the vma has been madvised with
> > MADV_HUGEPAGE.
> > 
> > Remote access to hugepages often has much higher latency than local pages
> > of the native page size.  On Haswell, ac5b2c18911f was shown to have a
> > 13.9% access regression after this commit for binaries that remap their
> > text segment to be backed by transparent hugepages.
> > 
> > The intent of ac5b2c18911f is to address an issue where a local node is
> > low on memory or fragmented such that a hugepage cannot be allocated.  In
> > every scenario where this was described as a fix, there is abundant and
> > unfragmented remote memory available to allocate from, even with a greater
> > access latency.
> > 
> > If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> > also measured on Haswell to have a 40% regression in allocation latency.
> > 
> > Restore __GFP_THISNODE for thp allocations.
> > 
> > Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE 
> > mappings")
> > Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> > alloc_hugepage_direct_gfpmask")
> 
> At minimum do not remove the cleanup part which consolidates the gfp
> hadnling to a single place. There is no real reason to have the
> __GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.
> 

The __GFP_THISNODE usage is still confined to 
alloc_hugepage_direct_gfpmask() for the thp fault path, we no longer set 
it in alloc_pages_vma() as done before the cleanup.


Re: [patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-03 Thread Michal Hocko
On Mon 03-12-18 15:50:24, David Rientjes wrote:
> This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
> consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").
> 
> By not setting __GFP_THISNODE, applications can allocate remote hugepages
> when the local node is fragmented or low on memory when either the thp
> defrag setting is "always" or the vma has been madvised with
> MADV_HUGEPAGE.
> 
> Remote access to hugepages often has much higher latency than local pages
> of the native page size.  On Haswell, ac5b2c18911f was shown to have a
> 13.9% access regression after this commit for binaries that remap their
> text segment to be backed by transparent hugepages.
> 
> The intent of ac5b2c18911f is to address an issue where a local node is
> low on memory or fragmented such that a hugepage cannot be allocated.  In
> every scenario where this was described as a fix, there is abundant and
> unfragmented remote memory available to allocate from, even with a greater
> access latency.
> 
> If remote memory is also low or fragmented, not setting __GFP_THISNODE was
> also measured on Haswell to have a 40% regression in allocation latency.
> 
> Restore __GFP_THISNODE for thp allocations.
> 
> Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE 
> mappings")
> Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
> alloc_hugepage_direct_gfpmask")

At minimum do not remove the cleanup part which consolidates the gfp
hadnling to a single place. There is no real reason to have the
__GFP_THISNODE ugliness outside of alloc_hugepage_direct_gfpmask.

I still hate the __GFP_THISNODE part as mentioned before. It is an ugly
hack but I can learn to live with it if this is indeed the only option
for the short term workaround until we find a proper solution.

> Signed-off-by: David Rientjes 
> ---
>  include/linux/mempolicy.h |  2 --
>  mm/huge_memory.c  | 42 +++
>  mm/mempolicy.c|  7 ---
>  3 files changed, 20 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct 
> shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   unsigned long addr);
> -struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> - unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -632,37 +632,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
> vm_fault *vmf,
>  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct 
> *vma, unsigned long addr)
>  {
>   const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> - gfp_t this_node = 0;
> -
> -#ifdef CONFIG_NUMA
> - struct mempolicy *pol;
> - /*
> -  * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
> -  * specified, to express a general desire to stay on the current
> -  * node for optimistic allocation attempts. If the defrag mode
> -  * and/or madvise hint requires the direct reclaim then we prefer
> -  * to fallback to other node rather than node reclaim because that
> -  * can lead to excessive reclaim even though there is free memory
> -  * on other nodes. We expect that NUMA preferences are specified
> -  * by memory policies.
> -  */
> - pol = get_vma_policy(vma, addr);
> - if (pol->mode != MPOL_BIND)
> - this_node = __GFP_THISNODE;
> - mpol_cond_put(pol);
> -#endif
> + const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
>  
> + /* Always do synchronous compaction */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
> &transparent_hugepage_flags))
> - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> + return GFP_TRANSHUGE | __GFP_THISNODE |
> +(vma_madvised ? 0 : __GFP_NORETRY);
> +
> + /* Kick kcompactd and fail quickly */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
> &transparent_hugepage_flags))
> - return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
> + return gfp_mask | __GFP_KSWAPD_RECLAIM;
> +
> + /* Synchronous compaction if madvised, otherwise kick kcompactd */
>   if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, 
> &transparent_hugepage_flags))
> - return GFP_TRANSHUGE_LIGHT | (vma_madvised ? 
> __GFP_DIRECT_RECLAIM :
> -  
> __GFP_KSWAPD_REC

[patch 1/2 for-4.20] mm, thp: restore node-local hugepage allocations

2018-12-03 Thread David Rientjes
This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").

By not setting __GFP_THISNODE, applications can allocate remote hugepages
when the local node is fragmented or low on memory when either the thp
defrag setting is "always" or the vma has been madvised with
MADV_HUGEPAGE.

Remote access to hugepages often has much higher latency than local pages
of the native page size.  On Haswell, ac5b2c18911f was shown to have a
13.9% access regression after this commit for binaries that remap their
text segment to be backed by transparent hugepages.

The intent of ac5b2c18911f is to address an issue where a local node is
low on memory or fragmented such that a hugepage cannot be allocated.  In
every scenario where this was described as a fix, there is abundant and
unfragmented remote memory available to allocate from, even with a greater
access latency.

If remote memory is also low or fragmented, not setting __GFP_THISNODE was
also measured on Haswell to have a 40% regression in allocation latency.

Restore __GFP_THISNODE for thp allocations.

Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask")

Signed-off-by: David Rientjes 
---
 include/linux/mempolicy.h |  2 --
 mm/huge_memory.c  | 42 +++
 mm/mempolicy.c|  7 ---
 3 files changed, 20 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct 
shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
unsigned long addr);
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
-   unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -632,37 +632,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, 
unsigned long addr)
 {
const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-   gfp_t this_node = 0;
-
-#ifdef CONFIG_NUMA
-   struct mempolicy *pol;
-   /*
-* __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
-* specified, to express a general desire to stay on the current
-* node for optimistic allocation attempts. If the defrag mode
-* and/or madvise hint requires the direct reclaim then we prefer
-* to fallback to other node rather than node reclaim because that
-* can lead to excessive reclaim even though there is free memory
-* on other nodes. We expect that NUMA preferences are specified
-* by memory policies.
-*/
-   pol = get_vma_policy(vma, addr);
-   if (pol->mode != MPOL_BIND)
-   this_node = __GFP_THISNODE;
-   mpol_cond_put(pol);
-#endif
+   const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
+   /* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
&transparent_hugepage_flags))
-   return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+   return GFP_TRANSHUGE | __GFP_THISNODE |
+  (vma_madvised ? 0 : __GFP_NORETRY);
+
+   /* Kick kcompactd and fail quickly */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, 
&transparent_hugepage_flags))
-   return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
+   return gfp_mask | __GFP_KSWAPD_RECLAIM;
+
+   /* Synchronous compaction if madvised, otherwise kick kcompactd */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, 
&transparent_hugepage_flags))
-   return GFP_TRANSHUGE_LIGHT | (vma_madvised ? 
__GFP_DIRECT_RECLAIM :
-
__GFP_KSWAPD_RECLAIM | this_node);
+   return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+ __GFP_KSWAPD_RECLAIM);
+
+   /* Only do synchronous compaction if madvised */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, 
&transparent_hugepage_flags))
-   return GFP_TRANSHUGE_LIGHT | (vma_madvised ? 
__GFP_DIRECT_RECLAIM :
-this_node);
-   return GFP_TRANSHUGE_LIGHT | this_node;
+   return gfp_mask | (vma_madvised ? __GFP_DIRECT_