Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
Vlastimil Babka writes: > On 04/21/2015 09:31 AM, Aneesh Kumar K.V wrote: >> Vlastimil Babka writes: >> >>> On 25.2.2015 22:24, David Rientjes wrote: > alloc_pages_preferred_node() variant, change the exact_node() variant to > pass > __GFP_THISNODE, and audit and adjust all callers accordingly. > ... >>> Right, we might be changing behavior not just for slab allocators, but >>> also others using such >>> combination of flags. >> >> Any update on this ? Did we reach a conclusion on how to go forward here >> ? > > I believe David's later version was merged already. Or what exactly are > you asking about? When I checked last time I didn't find it. Hence I asked here. Now I see that it got committed as 5265047ac30191ea24b16503165000c225f54feb Thanks -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
On 04/21/2015 09:31 AM, Aneesh Kumar K.V wrote: Vlastimil Babka writes: On 25.2.2015 22:24, David Rientjes wrote: alloc_pages_preferred_node() variant, change the exact_node() variant to pass __GFP_THISNODE, and audit and adjust all callers accordingly. Sounds like that should be done as part of a cleanup after the 4.0 issues are addressed. alloc_pages_exact_node() does seem to suggest that we want exactly that node, implying __GFP_THISNODE behavior already, so it would be good to avoid having this come up again in the future. Oh lovely, just found out that there's alloc_pages_node which should be the preferred-only version, but in fact does not differ from alloc_pages_exact_node in any relevant way. I agree we should do some larger cleanup for next version. Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE already. Of course, nothing guarantees that hugepage == true implies that gfp == GFP_TRANSHUGE... but current in-tree callers conform to that. Ah, good point, and it includes __GFP_NORETRY as well which means that this patch is busted. It won't try compaction or direct reclaim in the page allocator slowpath because of this: /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim * using a larger set of nodes after it has established that the * allowed per node queues are empty and that nodes are * over allocated. */ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; Hmm. It would be disappointing to have to pass the nodemask of the exact node that we want to allocate from into the page allocator to avoid using __GFP_THISNODE. Yeah. There's a sneaky way around it by just removing __GFP_NORETRY from GFP_TRANSHUGE so the condition above fails and since the page allocator won't retry for such a high-order allocation, but that probably just papers over this stuff too much already. I think what we want to do is Alternatively alloc_pages_exact_node() adds __GFP_THISNODE just to node_zonelist() call and not to __alloc_pages() gfp_mask proper? Unless __GFP_THISNODE was given *also* in the incoming gfp_mask, this should give us the right combination? But it's also subtle cause the slab allocators to not use __GFP_WAIT if they want to avoid reclaim. Yes, the fewer subtle heuristics we have that include combinations of flags (*cough* GFP_TRANSHUGE *cough*), the better. This is probably going to be a much more invasive patch than originally thought. Right, we might be changing behavior not just for slab allocators, but also others using such combination of flags. Any update on this ? Did we reach a conclusion on how to go forward here ? I believe David's later version was merged already. Or what exactly are you asking about? -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
Vlastimil Babka writes: > On 25.2.2015 22:24, David Rientjes wrote: >> >>> alloc_pages_preferred_node() variant, change the exact_node() variant to >>> pass >>> __GFP_THISNODE, and audit and adjust all callers accordingly. >>> >> Sounds like that should be done as part of a cleanup after the 4.0 issues >> are addressed. alloc_pages_exact_node() does seem to suggest that we want >> exactly that node, implying __GFP_THISNODE behavior already, so it would >> be good to avoid having this come up again in the future. > > Oh lovely, just found out that there's alloc_pages_node which should be the > preferred-only version, but in fact does not differ from > alloc_pages_exact_node > in any relevant way. I agree we should do some larger cleanup for next > version. > >>> Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE >>> already. Of course, nothing guarantees that hugepage == true implies that >>> gfp >>> == GFP_TRANSHUGE... but current in-tree callers conform to that. >>> >> Ah, good point, and it includes __GFP_NORETRY as well which means that >> this patch is busted. It won't try compaction or direct reclaim in the >> page allocator slowpath because of this: >> >> /* >> * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and >> * __GFP_NOWARN set) should not cause reclaim since the subsystem >> * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim >> * using a larger set of nodes after it has established that the >> * allowed per node queues are empty and that nodes are >> * over allocated. >> */ >> if (IS_ENABLED(CONFIG_NUMA) && >> (gfp_mask & GFP_THISNODE) == GFP_THISNODE) >> goto nopage; >> >> Hmm. It would be disappointing to have to pass the nodemask of the exact >> node that we want to allocate from into the page allocator to avoid using >> __GFP_THISNODE. > > Yeah. > >> >> There's a sneaky way around it by just removing __GFP_NORETRY from >> GFP_TRANSHUGE so the condition above fails and since the page allocator >> won't retry for such a high-order allocation, but that probably just >> papers over this stuff too much already. I think what we want to do is > > Alternatively alloc_pages_exact_node() adds __GFP_THISNODE just to > node_zonelist() call and not to __alloc_pages() gfp_mask proper? Unless > __GFP_THISNODE > was given *also* in the incoming gfp_mask, this should give us the right > combination? > But it's also subtle > >> cause the slab allocators to not use __GFP_WAIT if they want to avoid >> reclaim. > > Yes, the fewer subtle heuristics we have that include combinations of > flags (*cough* > GFP_TRANSHUGE *cough*), the better. > >> This is probably going to be a much more invasive patch than originally >> thought. > > Right, we might be changing behavior not just for slab allocators, but > also others using such > combination of flags. Any update on this ? Did we reach a conclusion on how to go forward here ? -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
On 25.2.2015 22:24, David Rientjes wrote: alloc_pages_preferred_node() variant, change the exact_node() variant to pass __GFP_THISNODE, and audit and adjust all callers accordingly. Sounds like that should be done as part of a cleanup after the 4.0 issues are addressed. alloc_pages_exact_node() does seem to suggest that we want exactly that node, implying __GFP_THISNODE behavior already, so it would be good to avoid having this come up again in the future. Oh lovely, just found out that there's alloc_pages_node which should be the preferred-only version, but in fact does not differ from alloc_pages_exact_node in any relevant way. I agree we should do some larger cleanup for next version. Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE already. Of course, nothing guarantees that hugepage == true implies that gfp == GFP_TRANSHUGE... but current in-tree callers conform to that. Ah, good point, and it includes __GFP_NORETRY as well which means that this patch is busted. It won't try compaction or direct reclaim in the page allocator slowpath because of this: /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim * using a larger set of nodes after it has established that the * allowed per node queues are empty and that nodes are * over allocated. */ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; Hmm. It would be disappointing to have to pass the nodemask of the exact node that we want to allocate from into the page allocator to avoid using __GFP_THISNODE. Yeah. There's a sneaky way around it by just removing __GFP_NORETRY from GFP_TRANSHUGE so the condition above fails and since the page allocator won't retry for such a high-order allocation, but that probably just papers over this stuff too much already. I think what we want to do is Alternatively alloc_pages_exact_node() adds __GFP_THISNODE just to node_zonelist() call and not to __alloc_pages() gfp_mask proper? Unless __GFP_THISNODE was given *also* in the incoming gfp_mask, this should give us the right combination? But it's also subtle cause the slab allocators to not use __GFP_WAIT if they want to avoid reclaim. Yes, the fewer subtle heuristics we have that include combinations of flags (*cough* GFP_TRANSHUGE *cough*), the better. This is probably going to be a much more invasive patch than originally thought. Right, we might be changing behavior not just for slab allocators, but also others using such combination of flags. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
On Wed, 25 Feb 2015, Vlastimil Babka wrote: > > Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local > > node") restructured alloc_hugepage_vma() with the intent of only > > allocating transparent hugepages locally when there was not an effective > > interleave mempolicy. > > > > alloc_pages_exact_node() does not limit the allocation to the single > > node, however, but rather prefers it. This is because __GFP_THISNODE is > > not set which would cause the node-local nodemask to be passed. Without > > it, only a nodemask that prefers the local node is passed. > > Oops, good catch. > But I believe we have the same problem with khugepaged_alloc_page(), rendering > the recent node determination and zone_reclaim strictness patches partially > useless. > Indeed. > Then I start to wonder about other alloc_pages_exact_node() users. Some do > pass __GFP_THISNODE, others not - are they also mistaken? I guess the function > is a misnomer - when I see "exact_node", I expect the __GFP_THISNODE behavior. > I looked through these yesterday as well and could only find the do_migrate_pages() case for page migration where __GFP_THISNODE was missing. I proposed that separately as http://marc.info/?l=linux-mm&m=142481989722497 -- I couldn't find any other users that looked wrong. > I think to avoid such hidden catches, we should create > alloc_pages_preferred_node() variant, change the exact_node() variant to pass > __GFP_THISNODE, and audit and adjust all callers accordingly. > Sounds like that should be done as part of a cleanup after the 4.0 issues are addressed. alloc_pages_exact_node() does seem to suggest that we want exactly that node, implying __GFP_THISNODE behavior already, so it would be good to avoid having this come up again in the future. > Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE > already. Of course, nothing guarantees that hugepage == true implies that gfp > == GFP_TRANSHUGE... but current in-tree callers conform to that. > Ah, good point, and it includes __GFP_NORETRY as well which means that this patch is busted. It won't try compaction or direct reclaim in the page allocator slowpath because of this: /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim * using a larger set of nodes after it has established that the * allowed per node queues are empty and that nodes are * over allocated. */ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; Hmm. It would be disappointing to have to pass the nodemask of the exact node that we want to allocate from into the page allocator to avoid using __GFP_THISNODE. There's a sneaky way around it by just removing __GFP_NORETRY from GFP_TRANSHUGE so the condition above fails and since the page allocator won't retry for such a high-order allocation, but that probably just papers over this stuff too much already. I think what we want to do is cause the slab allocators to not use __GFP_WAIT if they want to avoid reclaim. This is probably going to be a much more invasive patch than originally thought. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2 for-4.0] mm, thp: really limit transparent hugepage allocation to local node
On 02/25/2015 12:24 AM, David Rientjes wrote: From: Greg Thelen Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") restructured alloc_hugepage_vma() with the intent of only allocating transparent hugepages locally when there was not an effective interleave mempolicy. alloc_pages_exact_node() does not limit the allocation to the single node, however, but rather prefers it. This is because __GFP_THISNODE is not set which would cause the node-local nodemask to be passed. Without it, only a nodemask that prefers the local node is passed. Oops, good catch. But I believe we have the same problem with khugepaged_alloc_page(), rendering the recent node determination and zone_reclaim strictness patches partially useless. Then I start to wonder about other alloc_pages_exact_node() users. Some do pass __GFP_THISNODE, others not - are they also mistaken? I guess the function is a misnomer - when I see "exact_node", I expect the __GFP_THISNODE behavior. I think to avoid such hidden catches, we should create alloc_pages_preferred_node() variant, change the exact_node() variant to pass __GFP_THISNODE, and audit and adjust all callers accordingly. Also, you pass __GFP_NOWARN but that should be covered by GFP_TRANSHUGE already. Of course, nothing guarantees that hugepage == true implies that gfp == GFP_TRANSHUGE... but current in-tree callers conform to that. Fix this by passing __GFP_THISNODE and falling back to small pages when the allocation fails. Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node") Signed-off-by: Greg Thelen Signed-off-by: David Rientjes --- v2: GFP_THISNODE actually defers compaction and reclaim entirely based on the combination of gfp flags. We want to try compaction and reclaim, so only set __GFP_THISNODE. We still set __GFP_NOWARN to suppress oom warnings in the kernel log when we can simply fallback to small pages. mm/mempolicy.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1985,7 +1985,10 @@ retry_cpuset: nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(node, *nmask)) { mpol_cond_put(pol); - page = alloc_pages_exact_node(node, gfp, order); + page = alloc_pages_exact_node(node, gfp | + __GFP_THISNODE | + __GFP_NOWARN, + order); goto out; } } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/