Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-10 Thread David Hildenbrand
On 10.09.20 13:05, Vlastimil Babka wrote:
> On 9/10/20 12:29 PM, David Hildenbrand wrote:
>> On 09.09.20 13:55, Vlastimil Babka wrote:
>>> On 9/9/20 1:36 PM, Michal Hocko wrote:
 On Wed 09-09-20 12:48:54, Vlastimil Babka wrote:
> Here's a version that will apply on top of next-20200908. The first 4 
> patches need no change.
> For callers that pair start_isolate_page_range() with
> undo_isolated_page_range() properly, this is transparent. Currently that's
> alloc_contig_range(). __offline_pages() doesn't call 
> undo_isolated_page_range()
> in the succes case, so it has to be carful to handle restoring pcp->high 
> and batch
> and unlocking pcp_batch_high_lock.

 I was hoping that it would be possible to have this completely hidden
 inside start_isolate_page_range code path.
>>>
>>> I hoped so too, but we can't know the moment when all processes that were 
>>> in the
>>> critical part of freeing pages to pcplists have moved on (they might have 
>>> been
>>> rescheduled).
>>> We could change free_unref_page() to disable IRQs sooner, before
>>> free_unref_page_prepare(), or at least the get_pfnblock_migratetype() part. 
>>> Then
>>> after the single drain, we should be safe, AFAICS?
>>
>> At least moving it before getting the migratetype should not be that severe?
>>
>>> RT guys might not be happy though, but it's much simpler than this patch. I
>>> still like some of the cleanups in 1-4 though tbh :)
>>
>> It would certainly make this patch much simpler. Do you have a prototype
>> lying around?
> 
> Below is the critical part, while writing it I realized that there's also
> free_unref_page_list() where it's more ugly.
> 
> So free_unref_page() simply moves the "determine migratetype" part under
> disabled interrupts. For free_unref_page_list() we optimistically determine 
> them
> without disabling interrupts, and with disabled interrupts we check if zone 
> has
> isolated pageblocks and thus we should not trust that value anymore. That's 
> same
> pattern as free_pcppages_bulk uses.
> 
> But unfortunately it means an extra page_zone() plus a test for each page on 
> the
> list :/

That function is mostly called from mm/vmscan.c AFAIKS.

The fist thing free_unref_page_commit() does is doing page_zone().
That's easy to optimize if needed.

> 
> 8<
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index defefed79cfb..57e2a341c95c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3103,18 +3103,21 @@ void mark_free_pages(struct zone *zone)
>  }
>  #endif /* CONFIG_PM */
>  
> -static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
> +static bool free_unref_page_prepare(struct page *page)
>  {
> - int migratetype;
> -
>   if (!free_pcp_prepare(page))
>   return false;
>  
> - migratetype = get_pfnblock_migratetype(page, pfn);
> - set_pcppage_migratetype(page, migratetype);
>   return true;
>  }
>  
> +static inline void free_unref_page_set_migratetype(struct page *page, 
> unsigned long pfn)
> +{
> + int migratetype = get_pfnblock_migratetype(page, pfn);
> +
> + set_pcppage_migratetype(page, migratetype);
> +}
> +
>  static void free_unref_page_commit(struct page *page, unsigned long pfn)
>  {
>   struct zone *zone = page_zone(page);
> @@ -3156,10 +3159,17 @@ void free_unref_page(struct page *page)
>   unsigned long flags;
>   unsigned long pfn = page_to_pfn(page);
>  
> - if (!free_unref_page_prepare(page, pfn))
> + if (!free_unref_page_prepare(page))
>   return;
>  
> + /*
> +  * by disabling interrupts before reading pageblock's migratetype,
> +  * we can guarantee that after changing a pageblock to MIGRATE_ISOLATE
> +  * and drain_all_pages(), there's nobody who would read the old
> +  * migratetype and put a page from isoalted pageblock to pcplists
> +  */
>   local_irq_save(flags);
> + free_unref_page_set_migratetype(page, pfn);
>   free_unref_page_commit(page, pfn);
>   local_irq_restore(flags);
>  }
> @@ -3176,9 +3186,10 @@ void free_unref_page_list(struct list_head *list)
>   /* Prepare pages for freeing */
>   list_for_each_entry_safe(page, next, list, lru) {
>   pfn = page_to_pfn(page);
> - if (!free_unref_page_prepare(page, pfn))
> + if (!free_unref_page_prepare(page))
>   list_del(>lru);
>   set_page_private(page, pfn);
> + free_unref_page_set_migratetype(page, pfn);
>   }
>  
>   local_irq_save(flags);
> @@ -3187,6 +3198,8 @@ void free_unref_page_list(struct list_head *list)
>  
>   set_page_private(page, 0);
>   trace_mm_page_free_batched(page);
> + if (has_isolate_pageblock(page_zone(page)))
> + free_unref_page_set_migratetype(page, pfn);
>   free_unref_page_commit(page, pfn);
>  

FWIW: You can safely add unlikely() for the 

Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-10 Thread Vlastimil Babka
On 9/10/20 12:29 PM, David Hildenbrand wrote:
> On 09.09.20 13:55, Vlastimil Babka wrote:
>> On 9/9/20 1:36 PM, Michal Hocko wrote:
>>> On Wed 09-09-20 12:48:54, Vlastimil Babka wrote:
 Here's a version that will apply on top of next-20200908. The first 4 
 patches need no change.
 For callers that pair start_isolate_page_range() with
 undo_isolated_page_range() properly, this is transparent. Currently that's
 alloc_contig_range(). __offline_pages() doesn't call 
 undo_isolated_page_range()
 in the succes case, so it has to be carful to handle restoring pcp->high 
 and batch
 and unlocking pcp_batch_high_lock.
>>>
>>> I was hoping that it would be possible to have this completely hidden
>>> inside start_isolate_page_range code path.
>> 
>> I hoped so too, but we can't know the moment when all processes that were in 
>> the
>> critical part of freeing pages to pcplists have moved on (they might have 
>> been
>> rescheduled).
>> We could change free_unref_page() to disable IRQs sooner, before
>> free_unref_page_prepare(), or at least the get_pfnblock_migratetype() part. 
>> Then
>> after the single drain, we should be safe, AFAICS?
> 
> At least moving it before getting the migratetype should not be that severe?
> 
>> RT guys might not be happy though, but it's much simpler than this patch. I
>> still like some of the cleanups in 1-4 though tbh :)
> 
> It would certainly make this patch much simpler. Do you have a prototype
> lying around?

Below is the critical part, while writing it I realized that there's also
free_unref_page_list() where it's more ugly.

So free_unref_page() simply moves the "determine migratetype" part under
disabled interrupts. For free_unref_page_list() we optimistically determine them
without disabling interrupts, and with disabled interrupts we check if zone has
isolated pageblocks and thus we should not trust that value anymore. That's same
pattern as free_pcppages_bulk uses.

But unfortunately it means an extra page_zone() plus a test for each page on the
list :/

8<
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index defefed79cfb..57e2a341c95c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3103,18 +3103,21 @@ void mark_free_pages(struct zone *zone)
 }
 #endif /* CONFIG_PM */
 
-static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
+static bool free_unref_page_prepare(struct page *page)
 {
-   int migratetype;
-
if (!free_pcp_prepare(page))
return false;
 
-   migratetype = get_pfnblock_migratetype(page, pfn);
-   set_pcppage_migratetype(page, migratetype);
return true;
 }
 
+static inline void free_unref_page_set_migratetype(struct page *page, unsigned 
long pfn)
+{
+   int migratetype = get_pfnblock_migratetype(page, pfn);
+
+   set_pcppage_migratetype(page, migratetype);
+}
+
 static void free_unref_page_commit(struct page *page, unsigned long pfn)
 {
struct zone *zone = page_zone(page);
@@ -3156,10 +3159,17 @@ void free_unref_page(struct page *page)
unsigned long flags;
unsigned long pfn = page_to_pfn(page);
 
-   if (!free_unref_page_prepare(page, pfn))
+   if (!free_unref_page_prepare(page))
return;
 
+   /*
+* by disabling interrupts before reading pageblock's migratetype,
+* we can guarantee that after changing a pageblock to MIGRATE_ISOLATE
+* and drain_all_pages(), there's nobody who would read the old
+* migratetype and put a page from isoalted pageblock to pcplists
+*/
local_irq_save(flags);
+   free_unref_page_set_migratetype(page, pfn);
free_unref_page_commit(page, pfn);
local_irq_restore(flags);
 }
@@ -3176,9 +3186,10 @@ void free_unref_page_list(struct list_head *list)
/* Prepare pages for freeing */
list_for_each_entry_safe(page, next, list, lru) {
pfn = page_to_pfn(page);
-   if (!free_unref_page_prepare(page, pfn))
+   if (!free_unref_page_prepare(page))
list_del(>lru);
set_page_private(page, pfn);
+   free_unref_page_set_migratetype(page, pfn);
}
 
local_irq_save(flags);
@@ -3187,6 +3198,8 @@ void free_unref_page_list(struct list_head *list)
 
set_page_private(page, 0);
trace_mm_page_free_batched(page);
+   if (has_isolate_pageblock(page_zone(page)))
+   free_unref_page_set_migratetype(page, pfn);
free_unref_page_commit(page, pfn);
 
/*




Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-10 Thread David Hildenbrand
On 09.09.20 13:55, Vlastimil Babka wrote:
> On 9/9/20 1:36 PM, Michal Hocko wrote:
>> On Wed 09-09-20 12:48:54, Vlastimil Babka wrote:
>>> Here's a version that will apply on top of next-20200908. The first 4 
>>> patches need no change.
>>>
>>> 8<
>>> >From 8febc17272b8e8b378e2e5ea5e76b2616f029c5b Mon Sep 17 00:00:00 2001
>>> From: Vlastimil Babka 
>>> Date: Mon, 7 Sep 2020 17:20:39 +0200
>>> Subject: [PATCH] mm, page_alloc: disable pcplists during page isolation
>>>
>>> Page isolation can race with process freeing pages to pcplists in a way that
>>> a page from isolated pageblock can end up on pcplist. This can be fixed by
>>> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
>>> per-cpu pages again during memory offline" in [1].
>>>
>>> David and Michal would prefer that this race was closed in a way that 
>>> callers
>>> of page isolation don't need to care about drain. David suggested disabling
>>> pcplists usage completely during page isolation, instead of repeatedly 
>>> draining
>>> them.
>>>
>>> To achieve this without adding special cases in alloc/free fastpath, we can 
>>> use
>>> the same 'trick' as boot pagesets - when pcp->high is 0, any pcplist 
>>> addition
>>> will be immediately flushed.
>>>
>>> The race can thus be closed by setting pcp->high to 0 and draining pcplists
>>> once in start_isolate_page_range(). The draining will serialize after 
>>> processes
>>> that already disabled interrupts and read the old value of pcp->high in
>>> free_unref_page_commit(), and processes that have not yet disabled 
>>> interrupts,
>>> will observe pcp->high == 0 when they are rescheduled, and skip pcplists.
>>> This guarantees no stray pages on pcplists in zones where isolation happens.
>>>
>>> We can use the variable zone->nr_isolate_pageblock (protected by zone->lock)
>>> to detect transitions from 0 to 1 (to change pcp->high to 0 and issue drain)
>>> and from 1 to 0 (to restore original pcp->high and batch values cached in
>>> struct zone). We have to avoid external updates to high and batch by taking
>>> pcp_batch_high_lock. To allow multiple isolations in parallel, change this
>>> lock from mutex to rwsem.
>>>
>>> For callers that pair start_isolate_page_range() with
>>> undo_isolated_page_range() properly, this is transparent. Currently that's
>>> alloc_contig_range(). __offline_pages() doesn't call 
>>> undo_isolated_page_range()
>>> in the succes case, so it has to be carful to handle restoring pcp->high 
>>> and batch
>>> and unlocking pcp_batch_high_lock.
>>
>> I was hoping that it would be possible to have this completely hidden
>> inside start_isolate_page_range code path.
> 
> I hoped so too, but we can't know the moment when all processes that were in 
> the
> critical part of freeing pages to pcplists have moved on (they might have been
> rescheduled).
> We could change free_unref_page() to disable IRQs sooner, before
> free_unref_page_prepare(), or at least the get_pfnblock_migratetype() part. 
> Then
> after the single drain, we should be safe, AFAICS?

At least moving it before getting the migratetype should not be that severe?

> RT guys might not be happy though, but it's much simpler than this patch. I
> still like some of the cleanups in 1-4 though tbh :)

It would certainly make this patch much simpler. Do you have a prototype
lying around?

-- 
Thanks,

David / dhildenb



Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-09 Thread David Hildenbrand
On 09.09.20 13:44, David Hildenbrand wrote:
> On 09.09.20 13:36, Michal Hocko wrote:
>> On Wed 09-09-20 12:48:54, Vlastimil Babka wrote:
>>> Here's a version that will apply on top of next-20200908. The first 4 
>>> patches need no change.
>>>
>>> 8<
>>> >From 8febc17272b8e8b378e2e5ea5e76b2616f029c5b Mon Sep 17 00:00:00 2001
>>> From: Vlastimil Babka 
>>> Date: Mon, 7 Sep 2020 17:20:39 +0200
>>> Subject: [PATCH] mm, page_alloc: disable pcplists during page isolation
>>>
>>> Page isolation can race with process freeing pages to pcplists in a way that
>>> a page from isolated pageblock can end up on pcplist. This can be fixed by
>>> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
>>> per-cpu pages again during memory offline" in [1].
>>>
>>> David and Michal would prefer that this race was closed in a way that 
>>> callers
>>> of page isolation don't need to care about drain. David suggested disabling
>>> pcplists usage completely during page isolation, instead of repeatedly 
>>> draining
>>> them.
>>>
>>> To achieve this without adding special cases in alloc/free fastpath, we can 
>>> use
>>> the same 'trick' as boot pagesets - when pcp->high is 0, any pcplist 
>>> addition
>>> will be immediately flushed.
>>>
>>> The race can thus be closed by setting pcp->high to 0 and draining pcplists
>>> once in start_isolate_page_range(). The draining will serialize after 
>>> processes
>>> that already disabled interrupts and read the old value of pcp->high in
>>> free_unref_page_commit(), and processes that have not yet disabled 
>>> interrupts,
>>> will observe pcp->high == 0 when they are rescheduled, and skip pcplists.
>>> This guarantees no stray pages on pcplists in zones where isolation happens.
>>>
>>> We can use the variable zone->nr_isolate_pageblock (protected by zone->lock)
>>> to detect transitions from 0 to 1 (to change pcp->high to 0 and issue drain)
>>> and from 1 to 0 (to restore original pcp->high and batch values cached in
>>> struct zone). We have to avoid external updates to high and batch by taking
>>> pcp_batch_high_lock. To allow multiple isolations in parallel, change this
>>> lock from mutex to rwsem.
>>>
>>> For callers that pair start_isolate_page_range() with
>>> undo_isolated_page_range() properly, this is transparent. Currently that's
>>> alloc_contig_range(). __offline_pages() doesn't call 
>>> undo_isolated_page_range()
>>> in the succes case, so it has to be carful to handle restoring pcp->high 
>>> and batch
>>> and unlocking pcp_batch_high_lock.
>>
>> I was hoping that it would be possible to have this completely hidden
>> inside start_isolate_page_range code path. If we need some sort of
>> disable_pcp_free/enable_pcp_free then it seems like a better fit to have
>> an explicit API for that (the naming would be obviously different
>> because we do not want to call out pcp free lists). I strongly suspect
>> that only the memory hotplug really cares for this hard guanrantee.
>> alloc_contig_range simply goes with EBUSY.
> 
> There will be different alloc_contig_range() demands in the future: try
> fast (e.g., loads of small CMA allocations) vs. try hard (e.g.,
> virtio-mem). We can add ways to specify that.
> 

A reference to a related discussion regarding the "try fast" use case in
CMA for the future:
https://lkml.kernel.org/r/20200818164934.gf3852...@google.com

-- 
Thanks,

David / dhildenb



Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-09 Thread Vlastimil Babka
On 9/9/20 1:36 PM, Michal Hocko wrote:
> On Wed 09-09-20 12:48:54, Vlastimil Babka wrote:
>> Here's a version that will apply on top of next-20200908. The first 4 
>> patches need no change.
>> 
>> 8<
>> >From 8febc17272b8e8b378e2e5ea5e76b2616f029c5b Mon Sep 17 00:00:00 2001
>> From: Vlastimil Babka 
>> Date: Mon, 7 Sep 2020 17:20:39 +0200
>> Subject: [PATCH] mm, page_alloc: disable pcplists during page isolation
>> 
>> Page isolation can race with process freeing pages to pcplists in a way that
>> a page from isolated pageblock can end up on pcplist. This can be fixed by
>> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
>> per-cpu pages again during memory offline" in [1].
>> 
>> David and Michal would prefer that this race was closed in a way that callers
>> of page isolation don't need to care about drain. David suggested disabling
>> pcplists usage completely during page isolation, instead of repeatedly 
>> draining
>> them.
>> 
>> To achieve this without adding special cases in alloc/free fastpath, we can 
>> use
>> the same 'trick' as boot pagesets - when pcp->high is 0, any pcplist addition
>> will be immediately flushed.
>> 
>> The race can thus be closed by setting pcp->high to 0 and draining pcplists
>> once in start_isolate_page_range(). The draining will serialize after 
>> processes
>> that already disabled interrupts and read the old value of pcp->high in
>> free_unref_page_commit(), and processes that have not yet disabled 
>> interrupts,
>> will observe pcp->high == 0 when they are rescheduled, and skip pcplists.
>> This guarantees no stray pages on pcplists in zones where isolation happens.
>> 
>> We can use the variable zone->nr_isolate_pageblock (protected by zone->lock)
>> to detect transitions from 0 to 1 (to change pcp->high to 0 and issue drain)
>> and from 1 to 0 (to restore original pcp->high and batch values cached in
>> struct zone). We have to avoid external updates to high and batch by taking
>> pcp_batch_high_lock. To allow multiple isolations in parallel, change this
>> lock from mutex to rwsem.
>> 
>> For callers that pair start_isolate_page_range() with
>> undo_isolated_page_range() properly, this is transparent. Currently that's
>> alloc_contig_range(). __offline_pages() doesn't call 
>> undo_isolated_page_range()
>> in the succes case, so it has to be carful to handle restoring pcp->high and 
>> batch
>> and unlocking pcp_batch_high_lock.
> 
> I was hoping that it would be possible to have this completely hidden
> inside start_isolate_page_range code path.

I hoped so too, but we can't know the moment when all processes that were in the
critical part of freeing pages to pcplists have moved on (they might have been
rescheduled).
We could change free_unref_page() to disable IRQs sooner, before
free_unref_page_prepare(), or at least the get_pfnblock_migratetype() part. Then
after the single drain, we should be safe, AFAICS?
RT guys might not be happy though, but it's much simpler than this patch. I
still like some of the cleanups in 1-4 though tbh :)

> If we need some sort of
> disable_pcp_free/enable_pcp_free then it seems like a better fit to have
> an explicit API for that (the naming would be obviously different
> because we do not want to call out pcp free lists). I strongly suspect
> that only the memory hotplug really cares for this hard guanrantee.
> alloc_contig_range simply goes with EBUSY.
>  
>> This commit also changes drain_all_pages() to not trust reading pcp->count 
>> during
>> drain for page isolation - I believe that could be racy and lead to missing 
>> some
>> cpu's to drain. If others agree, this can be separated and potentially 
>> backported.
>> 
>> [1] 
>> https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatas...@soleen.com/
>> 
>> Suggested-by: David Hildenbrand 
>> Suggested-by: Michal Hocko 
>> Signed-off-by: Vlastimil Babka 
>> ---
>>  include/linux/gfp.h |  1 +
>>  mm/internal.h   |  4 +++
>>  mm/memory_hotplug.c | 55 -
>>  mm/page_alloc.c | 60 +
>>  mm/page_isolation.c | 45 --
>>  5 files changed, 119 insertions(+), 46 deletions(-)
> 
> This has turned out much larger than I would expect.
> 



Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-09 Thread Pavel Tatashin
> because we do not want to call out pcp free lists). I strongly suspect
> that only the memory hotplug really cares for this hard guanrantee.
> alloc_contig_range simply goes with EBUSY.
>

You are right, the strong requirement is for hotplug case only, but if
PCP disable/enable functionality is properly implemented, and we are
still draining the lists in alloc_contig_range case, I do not see a
reason not to disable/enable PCP during CMA as well just to reduce
number of spurious errors.

Thanks,
Pasha


Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-09 Thread David Hildenbrand
On 09.09.20 13:36, Michal Hocko wrote:
> On Wed 09-09-20 12:48:54, Vlastimil Babka wrote:
>> Here's a version that will apply on top of next-20200908. The first 4 
>> patches need no change.
>>
>> 8<
>> >From 8febc17272b8e8b378e2e5ea5e76b2616f029c5b Mon Sep 17 00:00:00 2001
>> From: Vlastimil Babka 
>> Date: Mon, 7 Sep 2020 17:20:39 +0200
>> Subject: [PATCH] mm, page_alloc: disable pcplists during page isolation
>>
>> Page isolation can race with process freeing pages to pcplists in a way that
>> a page from isolated pageblock can end up on pcplist. This can be fixed by
>> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
>> per-cpu pages again during memory offline" in [1].
>>
>> David and Michal would prefer that this race was closed in a way that callers
>> of page isolation don't need to care about drain. David suggested disabling
>> pcplists usage completely during page isolation, instead of repeatedly 
>> draining
>> them.
>>
>> To achieve this without adding special cases in alloc/free fastpath, we can 
>> use
>> the same 'trick' as boot pagesets - when pcp->high is 0, any pcplist addition
>> will be immediately flushed.
>>
>> The race can thus be closed by setting pcp->high to 0 and draining pcplists
>> once in start_isolate_page_range(). The draining will serialize after 
>> processes
>> that already disabled interrupts and read the old value of pcp->high in
>> free_unref_page_commit(), and processes that have not yet disabled 
>> interrupts,
>> will observe pcp->high == 0 when they are rescheduled, and skip pcplists.
>> This guarantees no stray pages on pcplists in zones where isolation happens.
>>
>> We can use the variable zone->nr_isolate_pageblock (protected by zone->lock)
>> to detect transitions from 0 to 1 (to change pcp->high to 0 and issue drain)
>> and from 1 to 0 (to restore original pcp->high and batch values cached in
>> struct zone). We have to avoid external updates to high and batch by taking
>> pcp_batch_high_lock. To allow multiple isolations in parallel, change this
>> lock from mutex to rwsem.
>>
>> For callers that pair start_isolate_page_range() with
>> undo_isolated_page_range() properly, this is transparent. Currently that's
>> alloc_contig_range(). __offline_pages() doesn't call 
>> undo_isolated_page_range()
>> in the succes case, so it has to be carful to handle restoring pcp->high and 
>> batch
>> and unlocking pcp_batch_high_lock.
> 
> I was hoping that it would be possible to have this completely hidden
> inside start_isolate_page_range code path. If we need some sort of
> disable_pcp_free/enable_pcp_free then it seems like a better fit to have
> an explicit API for that (the naming would be obviously different
> because we do not want to call out pcp free lists). I strongly suspect
> that only the memory hotplug really cares for this hard guanrantee.
> alloc_contig_range simply goes with EBUSY.

There will be different alloc_contig_range() demands in the future: try
fast (e.g., loads of small CMA allocations) vs. try hard (e.g.,
virtio-mem). We can add ways to specify that.

-- 
Thanks,

David / dhildenb



Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-09 Thread Michal Hocko
On Wed 09-09-20 12:48:54, Vlastimil Babka wrote:
> Here's a version that will apply on top of next-20200908. The first 4 patches 
> need no change.
> 
> 8<
> >From 8febc17272b8e8b378e2e5ea5e76b2616f029c5b Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka 
> Date: Mon, 7 Sep 2020 17:20:39 +0200
> Subject: [PATCH] mm, page_alloc: disable pcplists during page isolation
> 
> Page isolation can race with process freeing pages to pcplists in a way that
> a page from isolated pageblock can end up on pcplist. This can be fixed by
> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
> per-cpu pages again during memory offline" in [1].
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation don't need to care about drain. David suggested disabling
> pcplists usage completely during page isolation, instead of repeatedly 
> draining
> them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can 
> use
> the same 'trick' as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once in start_isolate_page_range(). The draining will serialize after 
> processes
> that already disabled interrupts and read the old value of pcp->high in
> free_unref_page_commit(), and processes that have not yet disabled interrupts,
> will observe pcp->high == 0 when they are rescheduled, and skip pcplists.
> This guarantees no stray pages on pcplists in zones where isolation happens.
> 
> We can use the variable zone->nr_isolate_pageblock (protected by zone->lock)
> to detect transitions from 0 to 1 (to change pcp->high to 0 and issue drain)
> and from 1 to 0 (to restore original pcp->high and batch values cached in
> struct zone). We have to avoid external updates to high and batch by taking
> pcp_batch_high_lock. To allow multiple isolations in parallel, change this
> lock from mutex to rwsem.
> 
> For callers that pair start_isolate_page_range() with
> undo_isolated_page_range() properly, this is transparent. Currently that's
> alloc_contig_range(). __offline_pages() doesn't call 
> undo_isolated_page_range()
> in the succes case, so it has to be carful to handle restoring pcp->high and 
> batch
> and unlocking pcp_batch_high_lock.

I was hoping that it would be possible to have this completely hidden
inside start_isolate_page_range code path. If we need some sort of
disable_pcp_free/enable_pcp_free then it seems like a better fit to have
an explicit API for that (the naming would be obviously different
because we do not want to call out pcp free lists). I strongly suspect
that only the memory hotplug really cares for this hard guanrantee.
alloc_contig_range simply goes with EBUSY.
 
> This commit also changes drain_all_pages() to not trust reading pcp->count 
> during
> drain for page isolation - I believe that could be racy and lead to missing 
> some
> cpu's to drain. If others agree, this can be separated and potentially 
> backported.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatas...@soleen.com/
> 
> Suggested-by: David Hildenbrand 
> Suggested-by: Michal Hocko 
> Signed-off-by: Vlastimil Babka 
> ---
>  include/linux/gfp.h |  1 +
>  mm/internal.h   |  4 +++
>  mm/memory_hotplug.c | 55 -
>  mm/page_alloc.c | 60 +
>  mm/page_isolation.c | 45 --
>  5 files changed, 119 insertions(+), 46 deletions(-)

This has turned out much larger than I would expect.
-- 
Michal Hocko
SUSE Labs


Re: [RFC 5/5] mm, page_alloc: disable pcplists during page isolation

2020-09-09 Thread Vlastimil Babka
Here's a version that will apply on top of next-20200908. The first 4 patches 
need no change.

8<
>From 8febc17272b8e8b378e2e5ea5e76b2616f029c5b Mon Sep 17 00:00:00 2001
From: Vlastimil Babka 
Date: Mon, 7 Sep 2020 17:20:39 +0200
Subject: [PATCH] mm, page_alloc: disable pcplists during page isolation

Page isolation can race with process freeing pages to pcplists in a way that
a page from isolated pageblock can end up on pcplist. This can be fixed by
repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
per-cpu pages again during memory offline" in [1].

David and Michal would prefer that this race was closed in a way that callers
of page isolation don't need to care about drain. David suggested disabling
pcplists usage completely during page isolation, instead of repeatedly draining
them.

To achieve this without adding special cases in alloc/free fastpath, we can use
the same 'trick' as boot pagesets - when pcp->high is 0, any pcplist addition
will be immediately flushed.

The race can thus be closed by setting pcp->high to 0 and draining pcplists
once in start_isolate_page_range(). The draining will serialize after processes
that already disabled interrupts and read the old value of pcp->high in
free_unref_page_commit(), and processes that have not yet disabled interrupts,
will observe pcp->high == 0 when they are rescheduled, and skip pcplists.
This guarantees no stray pages on pcplists in zones where isolation happens.

We can use the variable zone->nr_isolate_pageblock (protected by zone->lock)
to detect transitions from 0 to 1 (to change pcp->high to 0 and issue drain)
and from 1 to 0 (to restore original pcp->high and batch values cached in
struct zone). We have to avoid external updates to high and batch by taking
pcp_batch_high_lock. To allow multiple isolations in parallel, change this
lock from mutex to rwsem.

For callers that pair start_isolate_page_range() with
undo_isolated_page_range() properly, this is transparent. Currently that's
alloc_contig_range(). __offline_pages() doesn't call undo_isolated_page_range()
in the succes case, so it has to be carful to handle restoring pcp->high and 
batch
and unlocking pcp_batch_high_lock.

This commit also changes drain_all_pages() to not trust reading pcp->count 
during
drain for page isolation - I believe that could be racy and lead to missing some
cpu's to drain. If others agree, this can be separated and potentially 
backported.

[1] 
https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatas...@soleen.com/

Suggested-by: David Hildenbrand 
Suggested-by: Michal Hocko 
Signed-off-by: Vlastimil Babka 
---
 include/linux/gfp.h |  1 +
 mm/internal.h   |  4 +++
 mm/memory_hotplug.c | 55 -
 mm/page_alloc.c | 60 +
 mm/page_isolation.c | 45 --
 5 files changed, 119 insertions(+), 46 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..cc52c5cc9fab 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -592,6 +592,7 @@ extern void page_frag_free(void *addr);
 
 void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
+void __drain_all_pages(struct zone *zone, bool page_isolation);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
diff --git a/mm/internal.h b/mm/internal.h
index ab4beb7c5cd2..149822747db7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -7,6 +7,7 @@
 #ifndef __MM_INTERNAL_H
 #define __MM_INTERNAL_H
 
+#include 
 #include 
 #include 
 #include 
@@ -196,8 +197,11 @@ extern void post_alloc_hook(struct page *page, unsigned 
int order,
gfp_t gfp_flags);
 extern int user_min_free_kbytes;
 
+extern struct rw_semaphore pcp_batch_high_lock;
 extern void zone_pcp_update(struct zone *zone);
 extern void zone_pcp_reset(struct zone *zone);
+extern void zone_update_pageset_high_and_batch(struct zone *zone,
+   unsigned long high, unsigned long batch);
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index baded53b9ff9..c433565a782c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -803,6 +803,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
int need_zonelists_rebuild = 0;
int ret;
struct memory_notify arg;
+   bool first_isolated_pageblock = false;
 
/* We can only online full sections (e.g., SECTION_IS_ONLINE) */
if (WARN_ON_ONCE(!nr_pages ||
@@ -826,9 +827,13 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
 
/*
 * Fixup the number of isolated pageblocks before marking the sections
-* onlining, such that undo_isolate_page_range() works correctly.
+* onlining, such that undo_isolate_page_range() works correctly. We
+*