Re: [RFC 2/3] mm, compaction: introduce kcompactd

Vlastimil Babka Wed, 27 Jan 2016 04:59:52 -0800

On 01/27/2016 10:10 AM, Mel Gorman wrote:
> On Tue, Jan 26, 2016 at 04:36:14PM +0100, Vlastimil Babka wrote:
>> Memory compaction can be currently performed in several contexts:
>> 
>> - kswapd balancing a zone after a high-order allocation failure
>> - direct compaction to satisfy a high-order allocation, including THP page
>>   fault attemps
>> - khugepaged trying to collapse a hugepage
>> - manually from /proc
>> 
>> The purpose of compaction is two-fold. The obvious purpose is to satisfy a
>> (pending or future) high-order allocation, and is easy to evaluate. The other
>> purpose is to keep overal memory fragmentation low and help the
>> anti-fragmentation mechanism. The success wrt the latter purpose is more
>> difficult to evaluate though.
>> 
>> The current situation wrt the purposes has a few drawbacks:
>> 
>> - compaction is invoked only when a high-order page or hugepage is not
>>   available (or manually). This might be too late for the purposes of keeping
>>   memory fragmentation low.
>> - direct compaction increases latency of allocations. Again, it would be
>>   better if compaction was performed asynchronously to keep fragmentation 
>> low,
>>   before the allocation itself comes.
>> - (a special case of the previous) the cost of compaction during THP page
>>   faults can easily offset the benefits of THP.
>> - kswapd compaction appears to be complex, fragile and not working in some
>>   scenarios
>> 
> 
> An addendum to that is that kswapd can be compacting for a high-order
> allocation request when it should be reclaiming memory for an order-0
> request.


Right, thanks.

> My recollection is that kswapd compacting was meant to help atomic
> high-order allocations but I wonder if the same problem even exists with
> the revised watermark handling.

Well, certainly nobody noticed kswapd compaction being dysfunctional.

> 
>> - the target order used for kswapd is passed to kcompactd
>> 
>> The kswapd compact/reclaim loop for high-order pages will be removed in the
>> next patch with the description of what's wrong with it.
>> 
>> In this patch, kcompactd uses the standard compaction_suitable() and
>> compact_finished() criteria, which means it will most likely have nothing 
>> left
>> to do after kswapd finishes, until the next patch. Kcompactd also mimics
>> direct compaction somewhat by trying async compaction first and sync 
>> compaction
>> afterwards, and uses the deferred compaction functionality.
>> 
> 
> Why should it try async compaction first? The deferred compaction makes
> sense as kcompact will need some sort of limitation on the amount of
> CPU it can use.

I was just being conservative, but good point. Unlike direct compaction, latency
doesn't bother kcompactd.

> 
>> @@ -1759,4 +1763,227 @@ void compaction_unregister_node(struct node *node)
>>  }
>>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>>  
>> +static bool kcompactd_work_requested(pg_data_t *pgdat)
>> +{
>> +    return pgdat->kcompactd_max_order > 0;
>> +}
>> +
> 
> inline
> 
>> +static bool kcompactd_node_suitable(pg_data_t *pgdat)
>> +{
>> +    int zoneid;
>> +    struct zone *zone;
>> +
>> +    for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
>> +            zone = &pgdat->node_zones[zoneid];
>> +
>> +            if (!populated_zone(zone))
>> +                    continue;
>> +
>> +            if (compaction_suitable(zone, pgdat->kcompactd_max_order, 0,
>> +                                    pgdat->kcompactd_classzone_idx)
>> +                                                    == COMPACT_CONTINUE)
>> +                    return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
> 
> Why does this traverse all zones and not just the ones within the
> classzone_idx?

Hmm, guess I didn't revisit it after previous submission where kswapd compaction
wasn't being replaced by kcompactd. But kswapd also tries to balance higher
zones than those given by classzone_idx, if they needed. I'll rethink this.

>> +static void kcompactd_do_work(pg_data_t *pgdat)
>> +{
>> +    /*
>> +     * With no special task, compact all zones so that a page of requested
>> +     * order is allocatable.
>> +     */
>> +    int zoneid;
>> +    struct zone *zone;
>> +    struct compact_control cc = {
>> +            .order = pgdat->kcompactd_max_order,
>> +            .classzone_idx = pgdat->kcompactd_classzone_idx,
>> +            .mode = MIGRATE_ASYNC,
>> +            .ignore_skip_hint = true,
>> +
>> +    };
>> +    bool success = false;
>> +
>> +    trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
>> +                                                    cc.classzone_idx);
>> +    count_vm_event(KCOMPACTD_WAKE);
>> +
>> +retry:
>> +    for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> 
> Again, why is classzone_idx not taken into account?

Might be worth to just do everything once we've woken up, like kswapd.
Deferred+suitable checks should prevent wasted attempts in either case.

[...]

>> +
>> +    if (!success && cc.mode == MIGRATE_ASYNC) {
>> +            cc.mode = MIGRATE_SYNC_LIGHT;
>> +            goto retry;
>> +    }
>> +
> 
> Still not getting why kcompactd should concern itself with async
> compaction. It's really direct compaction that cared and was trying to
> avoid stalls.

Right

>> +     * Regardless of success, we are done until woken up next. But remember
>> +     * the requested order/classzone_idx in case it was higher/tighter than
>> +     * our current ones
>> +     */
>> +    if (pgdat->kcompactd_max_order <= cc.order)
>> +            pgdat->kcompactd_max_order = 0;
>> +    if (pgdat->classzone_idx >= cc.classzone_idx)
>> +            pgdat->classzone_idx = pgdat->nr_zones - 1;
>> +}
>> +
>>
>> <SNIP>
>>
>> @@ -1042,7 +1043,7 @@ int __ref online_pages(unsigned long pfn, unsigned 
>> long nr_pages, int online_typ
>>      arg.nr_pages = nr_pages;
>>      node_states_check_changes_online(nr_pages, zone, &arg);
>>  
>> -    nid = pfn_to_nid(pfn);
>> +    nid = zone_to_nid(zone);
>>  
>>      ret = memory_notify(MEM_GOING_ONLINE, &arg);
>>      ret = notifier_to_errno(ret);
>> @@ -1082,7 +1083,7 @@ int __ref online_pages(unsigned long pfn, unsigned 
>> long nr_pages, int online_typ
>>      pgdat_resize_unlock(zone->zone_pgdat, &flags);
>>  
>>      if (onlined_pages) {
>> -            node_states_set_node(zone_to_nid(zone), &arg);
>> +            node_states_set_node(nid, &arg);
>>              if (need_zonelists_rebuild)
>>                      build_all_zonelists(NULL, NULL);
>>              else
> 
> Why are these two hunks necessary?

Just a drive-by cleanup/optimization that didn't seem worth separate patch. But
I probably should?

> 
>> @@ -1093,8 +1094,10 @@ int __ref online_pages(unsigned long pfn, unsigned 
>> long nr_pages, int online_typ
>>  
>>      init_per_zone_wmark_min();
>>  
>> -    if (onlined_pages)
>> -            kswapd_run(zone_to_nid(zone));
>> +    if (onlined_pages) {
>> +            kswapd_run(nid);
>> +            kcompactd_run(nid);
>> +    }
>>  
>>      vm_total_pages = nr_free_pagecache_pages();
>>  
>> @@ -1858,8 +1861,10 @@ static int __ref __offline_pages(unsigned long 
>> start_pfn,
>>              zone_pcp_update(zone);
>>  
>>      node_states_clear_node(node, &arg);
>> -    if (arg.status_change_nid >= 0)
>> +    if (arg.status_change_nid >= 0) {
>>              kswapd_stop(node);
>> +            kcompactd_stop(node);
>> +    }
>>  
>>      vm_total_pages = nr_free_pagecache_pages();
>>      writeback_set_ratelimit();
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 63358d9f9aa9..7747eb36e789 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5212,6 +5212,9 @@ static void __paginginit free_area_init_core(struct 
>> pglist_data *pgdat)
>>  #endif
>>      init_waitqueue_head(&pgdat->kswapd_wait);
>>      init_waitqueue_head(&pgdat->pfmemalloc_wait);
>> +#ifdef CONFIG_COMPACTION
>> +    init_waitqueue_head(&pgdat->kcompactd_wait);
>> +#endif
>>      pgdat_page_ext_init(pgdat);
>>  
>>      for (j = 0; j < MAX_NR_ZONES; j++) {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 72d52d3aef74..1449e21c55cc 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3408,6 +3408,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
>> order, int classzone_idx)
>>               */
>>              reset_isolation_suitable(pgdat);
>>  
>> +            /*
>> +             * We have freed the memory, now we should compact it to make
>> +             * allocation of the requested order possible.
>> +             */
>> +            wakeup_kcompactd(pgdat, order, classzone_idx);
>> +
>>              if (!kthread_should_stop())
>>                      schedule();
>>  
> 
> This initially confused me but it's due to patch ordering. It's silly
> but when this patch is applied then both kswapd and kcompactd are
> compacting memory. I would prefer if the patches were in reverse order
> but that is purely taste.

In reverse order there would be a case where neither is compacting. Guess I'll
just move the wakeup to the next patch. The separation is mainly for making
review more tractable.

> While this was not a comprehensive review, I think the patch is ok in
> principal. While deferred compaction will keep the CPU usage under control,
> the main concern is that kcompactd consumes too much CPU but I do not
> see a case where that would trigger that kswapd would not have
> encountered already.

Thanks! On the opposite, kswapd didn't consider deferred compaction, so it could
consume too much CPU if it wasn't otherwise broken.

Re: [RFC 2/3] mm, compaction: introduce kcompactd

Reply via email to