Re: compaction is still too expensive for thp

2014-05-22 Thread Vlastimil Babka

On 05/22/2014 10:55 AM, David Rientjes wrote:

On Thu, 22 May 2014, Vlastimil Babka wrote:


With -mm, it turns out that while egregious thp fault latencies were
reduced, faulting 64MB of memory backed by thp on a fragmented 128GB
machine can result in latencies of 1-3s for the entire 64MB.  Collecting
compaction stats from older kernels that give more insight into
regressions, one such incident is as follows.

Baseline:
compact_blocks_moved 8181986
compact_pages_moved 6549560
compact_pagemigrate_failed 1612070
compact_stall 101959
compact_fail 100895
compact_success 1064

5s later:
compact_blocks_moved 8182447
compact_pages_moved 6550286
compact_pagemigrate_failed 1612092
compact_stall 102023
compact_fail 100959
compact_success 1064

This represents faulting two 64MB ranges of anonymous memory.  As you can
see, it results in falling back to 4KB pages because all 64 faults of
hugepages ends up triggering compaction and failing to allocate.  Over the
64 async compactions, we scan on average 7.2 pageblocks per call,
successfully migrate 11.3 pages per call, and fail migrating 0.34 pages
per call.

If each async compaction scans 7.2 pageblocks per call, it would have to
be called 9103 times to scan all memory on this 128GB machine.  We're
simply not scanning enough memory as a result of ISOLATE_ABORT due to
need_resched().


Well, the two objectives of not being expensive and at the same time scanning
"enough memory" (which is hard to define as well) are clearly quite opposite
:/



Agreed.


So the net result is that -mm is much better than Linus's tree, where such
faulting of 64MB ranges could stall 8-9s, but we're still very expensive.


So I guess the difference here is mainly thanks to not doing sync compaction?


Not doing sync compaction for thp and caching the migration pfn for async
so that it doesn't iterate over a ton of memory that may not be eligible
for async compaction every time it is called.  But when we avoid sync
compaction, we also lose deferred compaction.


So if I understand correctly, your intention is to scan more in a single scan,


More will be scanned instead of ~7 pageblocks for every call to async
compaction with the data that I presented but also reduce how expensive
every pageblock scan is by avoiding needlessly migrating memory (and
dealing with rmap locks) when it will not result in 2MB of contiguous
memory for thp faults.


but balance the increased latencies by introducing deferring for async
compaction.

Offhand I can think of two issues with that.

1) the result will be that often the latency will be low thanks to defer, but
then there will be a huge (?) spike by scanning whole 1GB (as you suggest
further in the mail) at once. I think that's similar to what you had now with
the sync compaction?



Not at all, with MIGRATE_SYNC_LIGHT before there is no termination other
than an entire scan of memory so we were potentially scanning 128GB and
failing if thp cannot be allocated.


OK.


If we are to avoid migrating memory needlessly that will not result in
cc->order memory being allocated, then the cost should be relatively
constant for a span of memory.  My 32GB system can iterate all memory with
MIGRATE_ASYNC and no need_resched() aborts in ~530ms.


OK


2) 1GB could have a good chance of being successful (?) so there would be no
defer anyway.



If we terminate early because order-9 is allocatable or we end up scanning
the entire 1GB and the hugepage is allocated, then we have prevented 511
other pagefaults in my testcase where faulting 64MB of memory with thp
enabled can currently take 1-3s on a 128GB machine with fragmentation.  I
think the deferral is unnecessary in such a case.

Are you suggesting we should try without the deferral first?


Might be an option, or less aggressive back off than the sync deferral, 
as it's limited to 1GB.



I have a few improvements in mind, but thought it would be better to
get feedback on it first because it's a substantial rewrite of the
pageblock migration:

   - For all async compaction, avoid migrating memory unless enough
 contiguous memory is migrated to allow a cc->order allocation.


Yes I suggested something like this earlier. Also in the scanner, skip to the
next cc->order aligned block as soon as any page fails the isolation and is
not PageBuddy.


Agreed.


I would just dinstinguish kswapd and direct compaction, not "all async
compaction". Or maybe kswapd could be switched to sync compaction.



To generalize this, I'm thinking that it is pointless for async compaction
to migrate memory in a contiguous span if it will not cause a cc->order
page allocation to succeed.


Well I think it's pointless for page faults and maybe khugepaged. But 
still there should be some daemon trying to migrate even pages that do 
not immediately lead to a continuous block or some order, as an general 
incremental defragmentation. For example I believe (and hope to analyze 
and improve that eventually) that MOVABLE pages allocated in 

Re: compaction is still too expensive for thp

2014-05-22 Thread David Rientjes
On Thu, 22 May 2014, Vlastimil Babka wrote:

> > With -mm, it turns out that while egregious thp fault latencies were
> > reduced, faulting 64MB of memory backed by thp on a fragmented 128GB
> > machine can result in latencies of 1-3s for the entire 64MB.  Collecting
> > compaction stats from older kernels that give more insight into
> > regressions, one such incident is as follows.
> > 
> > Baseline:
> > compact_blocks_moved 8181986
> > compact_pages_moved 6549560
> > compact_pagemigrate_failed 1612070
> > compact_stall 101959
> > compact_fail 100895
> > compact_success 1064
> > 
> > 5s later:
> > compact_blocks_moved 8182447
> > compact_pages_moved 6550286
> > compact_pagemigrate_failed 1612092
> > compact_stall 102023
> > compact_fail 100959
> > compact_success 1064
> > 
> > This represents faulting two 64MB ranges of anonymous memory.  As you can
> > see, it results in falling back to 4KB pages because all 64 faults of
> > hugepages ends up triggering compaction and failing to allocate.  Over the
> > 64 async compactions, we scan on average 7.2 pageblocks per call,
> > successfully migrate 11.3 pages per call, and fail migrating 0.34 pages
> > per call.
> > 
> > If each async compaction scans 7.2 pageblocks per call, it would have to
> > be called 9103 times to scan all memory on this 128GB machine.  We're
> > simply not scanning enough memory as a result of ISOLATE_ABORT due to
> > need_resched().
> 
> Well, the two objectives of not being expensive and at the same time scanning
> "enough memory" (which is hard to define as well) are clearly quite opposite
> :/
> 

Agreed.

> > So the net result is that -mm is much better than Linus's tree, where such
> > faulting of 64MB ranges could stall 8-9s, but we're still very expensive.
> 
> So I guess the difference here is mainly thanks to not doing sync compaction?

Not doing sync compaction for thp and caching the migration pfn for async 
so that it doesn't iterate over a ton of memory that may not be eligible 
for async compaction every time it is called.  But when we avoid sync 
compaction, we also lose deferred compaction.

> So if I understand correctly, your intention is to scan more in a single scan,

More will be scanned instead of ~7 pageblocks for every call to async 
compaction with the data that I presented but also reduce how expensive 
every pageblock scan is by avoiding needlessly migrating memory (and 
dealing with rmap locks) when it will not result in 2MB of contiguous 
memory for thp faults.

> but balance the increased latencies by introducing deferring for async
> compaction.
> 
> Offhand I can think of two issues with that.
> 
> 1) the result will be that often the latency will be low thanks to defer, but
> then there will be a huge (?) spike by scanning whole 1GB (as you suggest
> further in the mail) at once. I think that's similar to what you had now with
> the sync compaction?
> 

Not at all, with MIGRATE_SYNC_LIGHT before there is no termination other 
than an entire scan of memory so we were potentially scanning 128GB and 
failing if thp cannot be allocated.

If we are to avoid migrating memory needlessly that will not result in 
cc->order memory being allocated, then the cost should be relatively 
constant for a span of memory.  My 32GB system can iterate all memory with 
MIGRATE_ASYNC and no need_resched() aborts in ~530ms.

> 2) 1GB could have a good chance of being successful (?) so there would be no
> defer anyway.
> 

If we terminate early because order-9 is allocatable or we end up scanning 
the entire 1GB and the hugepage is allocated, then we have prevented 511 
other pagefaults in my testcase where faulting 64MB of memory with thp 
enabled can currently take 1-3s on a 128GB machine with fragmentation.  I 
think the deferral is unnecessary in such a case.

Are you suggesting we should try without the deferral first?

> > I have a few improvements in mind, but thought it would be better to
> > get feedback on it first because it's a substantial rewrite of the
> > pageblock migration:
> > 
> >   - For all async compaction, avoid migrating memory unless enough
> > contiguous memory is migrated to allow a cc->order allocation.
> 
> Yes I suggested something like this earlier. Also in the scanner, skip to the
> next cc->order aligned block as soon as any page fails the isolation and is
> not PageBuddy.

Agreed.

> I would just dinstinguish kswapd and direct compaction, not "all async
> compaction". Or maybe kswapd could be switched to sync compaction.
> 

To generalize this, I'm thinking that it is pointless for async compaction 
to migrate memory in a contiguous span if it will not cause a cc->order 
page allocation to succeed.

> > This
> > would remove the COMPACT_CLUSTER_MAX restriction on pageblock
> > compaction
> 
> Yes.
> 
> > and keep pages on the cc->migratepages list between
> > calls to isolate_migratepages_range().
> 
> This might not be needed. It's called within a single pageblock 

Re: compaction is still too expensive for thp

2014-05-22 Thread Vlastimil Babka

On 05/22/2014 05:20 AM, David Rientjes wrote:

On Fri, 16 May 2014, Vlastimil Babka wrote:


Compaction uses compact_checklock_irqsave() function to periodically check for
lock contention and need_resched() to either abort async compaction, or to
free the lock, schedule and retake the lock. When aborting, cc->contended is
set to signal the contended state to the caller. Two problems have been
identified in this mechanism.

First, compaction also calls directly cond_resched() in both scanners when no
lock is yet taken. This call either does not abort async compaction, or set
cc->contended appropriately. This patch introduces a new compact_should_abort()
function to achieve both. In isolate_freepages(), the check frequency is
reduced to once by SWAP_CLUSTER_MAX pageblocks to match what the migration
scanner does in the preliminary page checks. In case a pageblock is found
suitable for calling isolate_freepages_block(), the checks within there are
done on higher frequency.

Second, isolate_freepages() does not check if isolate_freepages_block()
aborted due to contention, and advances to the next pageblock. This violates
the principle of aborting on contention, and might result in pageblocks not
being scanned completely, since the scanning cursor is advanced. This patch
makes isolate_freepages_block() check the cc->contended flag and abort.

In case isolate_freepages() has already isolated some pages before aborting
due to contention, page migration will proceed, which is OK since we do not
want to waste the work that has been done, and page migration has own checks
for contention. However, we do not want another isolation attempt by either
of the scanners, so cc->contended flag check is added also to
compaction_alloc() and compact_finished() to make sure compaction is aborted
right after the migration.



We have a pretty significant problem with async compaction related to thp
faults and it's not limited to this patch but was intended to be addressed
in my series as well.  Since this is the latest patch to be proposed for
aborting async compaction when it's too expensive, it's probably a good
idea to discuss it here.


I already tried to call for some higher level discussion eariler in your 
series, good that hear that now you also agree it might be useful :)



With -mm, it turns out that while egregious thp fault latencies were
reduced, faulting 64MB of memory backed by thp on a fragmented 128GB
machine can result in latencies of 1-3s for the entire 64MB.  Collecting
compaction stats from older kernels that give more insight into
regressions, one such incident is as follows.

Baseline:
compact_blocks_moved 8181986
compact_pages_moved 6549560
compact_pagemigrate_failed 1612070
compact_stall 101959
compact_fail 100895
compact_success 1064

5s later:
compact_blocks_moved 8182447
compact_pages_moved 6550286
compact_pagemigrate_failed 1612092
compact_stall 102023
compact_fail 100959
compact_success 1064

This represents faulting two 64MB ranges of anonymous memory.  As you can
see, it results in falling back to 4KB pages because all 64 faults of
hugepages ends up triggering compaction and failing to allocate.  Over the
64 async compactions, we scan on average 7.2 pageblocks per call,
successfully migrate 11.3 pages per call, and fail migrating 0.34 pages
per call.

If each async compaction scans 7.2 pageblocks per call, it would have to
be called 9103 times to scan all memory on this 128GB machine.  We're
simply not scanning enough memory as a result of ISOLATE_ABORT due to
need_resched().


Well, the two objectives of not being expensive and at the same time 
scanning "enough memory" (which is hard to define as well) are clearly 
quite opposite :/



So the net result is that -mm is much better than Linus's tree, where such
faulting of 64MB ranges could stall 8-9s, but we're still very expensive.


So I guess the difference here is mainly thanks to not doing sync 
compaction? Or do you have any insight which patch helped the most?



We may need to consider scanning more memory on a single call to async
compaction even when need_resched() and if we are unsuccessful in
allocating a hugepage to defer async compaction in subsequent calls up to
1 << COMPACT_MAX_DEFER_SHIFT.  Today, we defer on sync compaction but that
is now never done for thp faults since it is reliant solely on async
compaction.


So if I understand correctly, your intention is to scan more in a single 
scan, but balance the increased latencies by introducing deferring for 
async compaction.


Offhand I can think of two issues with that.

1) the result will be that often the latency will be low thanks to 
defer, but then there will be a huge (?) spike by scanning whole 1GB (as 
you suggest further in the mail) at once. I think that's similar to what 
you had now with the sync compaction?


2) 1GB could have a good chance of being successful (?) so there would 
be no defer anyway.


I have some other suggestion at the end of my mail.


I have a 

Re: compaction is still too expensive for thp

2014-05-22 Thread Vlastimil Babka

On 05/22/2014 05:20 AM, David Rientjes wrote:

On Fri, 16 May 2014, Vlastimil Babka wrote:


Compaction uses compact_checklock_irqsave() function to periodically check for
lock contention and need_resched() to either abort async compaction, or to
free the lock, schedule and retake the lock. When aborting, cc-contended is
set to signal the contended state to the caller. Two problems have been
identified in this mechanism.

First, compaction also calls directly cond_resched() in both scanners when no
lock is yet taken. This call either does not abort async compaction, or set
cc-contended appropriately. This patch introduces a new compact_should_abort()
function to achieve both. In isolate_freepages(), the check frequency is
reduced to once by SWAP_CLUSTER_MAX pageblocks to match what the migration
scanner does in the preliminary page checks. In case a pageblock is found
suitable for calling isolate_freepages_block(), the checks within there are
done on higher frequency.

Second, isolate_freepages() does not check if isolate_freepages_block()
aborted due to contention, and advances to the next pageblock. This violates
the principle of aborting on contention, and might result in pageblocks not
being scanned completely, since the scanning cursor is advanced. This patch
makes isolate_freepages_block() check the cc-contended flag and abort.

In case isolate_freepages() has already isolated some pages before aborting
due to contention, page migration will proceed, which is OK since we do not
want to waste the work that has been done, and page migration has own checks
for contention. However, we do not want another isolation attempt by either
of the scanners, so cc-contended flag check is added also to
compaction_alloc() and compact_finished() to make sure compaction is aborted
right after the migration.



We have a pretty significant problem with async compaction related to thp
faults and it's not limited to this patch but was intended to be addressed
in my series as well.  Since this is the latest patch to be proposed for
aborting async compaction when it's too expensive, it's probably a good
idea to discuss it here.


I already tried to call for some higher level discussion eariler in your 
series, good that hear that now you also agree it might be useful :)



With -mm, it turns out that while egregious thp fault latencies were
reduced, faulting 64MB of memory backed by thp on a fragmented 128GB
machine can result in latencies of 1-3s for the entire 64MB.  Collecting
compaction stats from older kernels that give more insight into
regressions, one such incident is as follows.

Baseline:
compact_blocks_moved 8181986
compact_pages_moved 6549560
compact_pagemigrate_failed 1612070
compact_stall 101959
compact_fail 100895
compact_success 1064

5s later:
compact_blocks_moved 8182447
compact_pages_moved 6550286
compact_pagemigrate_failed 1612092
compact_stall 102023
compact_fail 100959
compact_success 1064

This represents faulting two 64MB ranges of anonymous memory.  As you can
see, it results in falling back to 4KB pages because all 64 faults of
hugepages ends up triggering compaction and failing to allocate.  Over the
64 async compactions, we scan on average 7.2 pageblocks per call,
successfully migrate 11.3 pages per call, and fail migrating 0.34 pages
per call.

If each async compaction scans 7.2 pageblocks per call, it would have to
be called 9103 times to scan all memory on this 128GB machine.  We're
simply not scanning enough memory as a result of ISOLATE_ABORT due to
need_resched().


Well, the two objectives of not being expensive and at the same time 
scanning enough memory (which is hard to define as well) are clearly 
quite opposite :/



So the net result is that -mm is much better than Linus's tree, where such
faulting of 64MB ranges could stall 8-9s, but we're still very expensive.


So I guess the difference here is mainly thanks to not doing sync 
compaction? Or do you have any insight which patch helped the most?



We may need to consider scanning more memory on a single call to async
compaction even when need_resched() and if we are unsuccessful in
allocating a hugepage to defer async compaction in subsequent calls up to
1  COMPACT_MAX_DEFER_SHIFT.  Today, we defer on sync compaction but that
is now never done for thp faults since it is reliant solely on async
compaction.


So if I understand correctly, your intention is to scan more in a single 
scan, but balance the increased latencies by introducing deferring for 
async compaction.


Offhand I can think of two issues with that.

1) the result will be that often the latency will be low thanks to 
defer, but then there will be a huge (?) spike by scanning whole 1GB (as 
you suggest further in the mail) at once. I think that's similar to what 
you had now with the sync compaction?


2) 1GB could have a good chance of being successful (?) so there would 
be no defer anyway.


I have some other suggestion at the end of my mail.


I have a few 

Re: compaction is still too expensive for thp

2014-05-22 Thread David Rientjes
On Thu, 22 May 2014, Vlastimil Babka wrote:

  With -mm, it turns out that while egregious thp fault latencies were
  reduced, faulting 64MB of memory backed by thp on a fragmented 128GB
  machine can result in latencies of 1-3s for the entire 64MB.  Collecting
  compaction stats from older kernels that give more insight into
  regressions, one such incident is as follows.
  
  Baseline:
  compact_blocks_moved 8181986
  compact_pages_moved 6549560
  compact_pagemigrate_failed 1612070
  compact_stall 101959
  compact_fail 100895
  compact_success 1064
  
  5s later:
  compact_blocks_moved 8182447
  compact_pages_moved 6550286
  compact_pagemigrate_failed 1612092
  compact_stall 102023
  compact_fail 100959
  compact_success 1064
  
  This represents faulting two 64MB ranges of anonymous memory.  As you can
  see, it results in falling back to 4KB pages because all 64 faults of
  hugepages ends up triggering compaction and failing to allocate.  Over the
  64 async compactions, we scan on average 7.2 pageblocks per call,
  successfully migrate 11.3 pages per call, and fail migrating 0.34 pages
  per call.
  
  If each async compaction scans 7.2 pageblocks per call, it would have to
  be called 9103 times to scan all memory on this 128GB machine.  We're
  simply not scanning enough memory as a result of ISOLATE_ABORT due to
  need_resched().
 
 Well, the two objectives of not being expensive and at the same time scanning
 enough memory (which is hard to define as well) are clearly quite opposite
 :/
 

Agreed.

  So the net result is that -mm is much better than Linus's tree, where such
  faulting of 64MB ranges could stall 8-9s, but we're still very expensive.
 
 So I guess the difference here is mainly thanks to not doing sync compaction?

Not doing sync compaction for thp and caching the migration pfn for async 
so that it doesn't iterate over a ton of memory that may not be eligible 
for async compaction every time it is called.  But when we avoid sync 
compaction, we also lose deferred compaction.

 So if I understand correctly, your intention is to scan more in a single scan,

More will be scanned instead of ~7 pageblocks for every call to async 
compaction with the data that I presented but also reduce how expensive 
every pageblock scan is by avoiding needlessly migrating memory (and 
dealing with rmap locks) when it will not result in 2MB of contiguous 
memory for thp faults.

 but balance the increased latencies by introducing deferring for async
 compaction.
 
 Offhand I can think of two issues with that.
 
 1) the result will be that often the latency will be low thanks to defer, but
 then there will be a huge (?) spike by scanning whole 1GB (as you suggest
 further in the mail) at once. I think that's similar to what you had now with
 the sync compaction?
 

Not at all, with MIGRATE_SYNC_LIGHT before there is no termination other 
than an entire scan of memory so we were potentially scanning 128GB and 
failing if thp cannot be allocated.

If we are to avoid migrating memory needlessly that will not result in 
cc-order memory being allocated, then the cost should be relatively 
constant for a span of memory.  My 32GB system can iterate all memory with 
MIGRATE_ASYNC and no need_resched() aborts in ~530ms.

 2) 1GB could have a good chance of being successful (?) so there would be no
 defer anyway.
 

If we terminate early because order-9 is allocatable or we end up scanning 
the entire 1GB and the hugepage is allocated, then we have prevented 511 
other pagefaults in my testcase where faulting 64MB of memory with thp 
enabled can currently take 1-3s on a 128GB machine with fragmentation.  I 
think the deferral is unnecessary in such a case.

Are you suggesting we should try without the deferral first?

  I have a few improvements in mind, but thought it would be better to
  get feedback on it first because it's a substantial rewrite of the
  pageblock migration:
  
- For all async compaction, avoid migrating memory unless enough
  contiguous memory is migrated to allow a cc-order allocation.
 
 Yes I suggested something like this earlier. Also in the scanner, skip to the
 next cc-order aligned block as soon as any page fails the isolation and is
 not PageBuddy.

Agreed.

 I would just dinstinguish kswapd and direct compaction, not all async
 compaction. Or maybe kswapd could be switched to sync compaction.
 

To generalize this, I'm thinking that it is pointless for async compaction 
to migrate memory in a contiguous span if it will not cause a cc-order 
page allocation to succeed.

  This
  would remove the COMPACT_CLUSTER_MAX restriction on pageblock
  compaction
 
 Yes.
 
  and keep pages on the cc-migratepages list between
  calls to isolate_migratepages_range().
 
 This might not be needed. It's called within a single pageblock (except maybe
 CMA but that's quite a different thing) and I think we can ignore order 
 pageblock_nr_order here.
 

Ok, I guess pageblocks 

Re: compaction is still too expensive for thp

2014-05-22 Thread Vlastimil Babka

On 05/22/2014 10:55 AM, David Rientjes wrote:

On Thu, 22 May 2014, Vlastimil Babka wrote:


With -mm, it turns out that while egregious thp fault latencies were
reduced, faulting 64MB of memory backed by thp on a fragmented 128GB
machine can result in latencies of 1-3s for the entire 64MB.  Collecting
compaction stats from older kernels that give more insight into
regressions, one such incident is as follows.

Baseline:
compact_blocks_moved 8181986
compact_pages_moved 6549560
compact_pagemigrate_failed 1612070
compact_stall 101959
compact_fail 100895
compact_success 1064

5s later:
compact_blocks_moved 8182447
compact_pages_moved 6550286
compact_pagemigrate_failed 1612092
compact_stall 102023
compact_fail 100959
compact_success 1064

This represents faulting two 64MB ranges of anonymous memory.  As you can
see, it results in falling back to 4KB pages because all 64 faults of
hugepages ends up triggering compaction and failing to allocate.  Over the
64 async compactions, we scan on average 7.2 pageblocks per call,
successfully migrate 11.3 pages per call, and fail migrating 0.34 pages
per call.

If each async compaction scans 7.2 pageblocks per call, it would have to
be called 9103 times to scan all memory on this 128GB machine.  We're
simply not scanning enough memory as a result of ISOLATE_ABORT due to
need_resched().


Well, the two objectives of not being expensive and at the same time scanning
enough memory (which is hard to define as well) are clearly quite opposite
:/



Agreed.


So the net result is that -mm is much better than Linus's tree, where such
faulting of 64MB ranges could stall 8-9s, but we're still very expensive.


So I guess the difference here is mainly thanks to not doing sync compaction?


Not doing sync compaction for thp and caching the migration pfn for async
so that it doesn't iterate over a ton of memory that may not be eligible
for async compaction every time it is called.  But when we avoid sync
compaction, we also lose deferred compaction.


So if I understand correctly, your intention is to scan more in a single scan,


More will be scanned instead of ~7 pageblocks for every call to async
compaction with the data that I presented but also reduce how expensive
every pageblock scan is by avoiding needlessly migrating memory (and
dealing with rmap locks) when it will not result in 2MB of contiguous
memory for thp faults.


but balance the increased latencies by introducing deferring for async
compaction.

Offhand I can think of two issues with that.

1) the result will be that often the latency will be low thanks to defer, but
then there will be a huge (?) spike by scanning whole 1GB (as you suggest
further in the mail) at once. I think that's similar to what you had now with
the sync compaction?



Not at all, with MIGRATE_SYNC_LIGHT before there is no termination other
than an entire scan of memory so we were potentially scanning 128GB and
failing if thp cannot be allocated.


OK.


If we are to avoid migrating memory needlessly that will not result in
cc-order memory being allocated, then the cost should be relatively
constant for a span of memory.  My 32GB system can iterate all memory with
MIGRATE_ASYNC and no need_resched() aborts in ~530ms.


OK


2) 1GB could have a good chance of being successful (?) so there would be no
defer anyway.



If we terminate early because order-9 is allocatable or we end up scanning
the entire 1GB and the hugepage is allocated, then we have prevented 511
other pagefaults in my testcase where faulting 64MB of memory with thp
enabled can currently take 1-3s on a 128GB machine with fragmentation.  I
think the deferral is unnecessary in such a case.

Are you suggesting we should try without the deferral first?


Might be an option, or less aggressive back off than the sync deferral, 
as it's limited to 1GB.



I have a few improvements in mind, but thought it would be better to
get feedback on it first because it's a substantial rewrite of the
pageblock migration:

   - For all async compaction, avoid migrating memory unless enough
 contiguous memory is migrated to allow a cc-order allocation.


Yes I suggested something like this earlier. Also in the scanner, skip to the
next cc-order aligned block as soon as any page fails the isolation and is
not PageBuddy.


Agreed.


I would just dinstinguish kswapd and direct compaction, not all async
compaction. Or maybe kswapd could be switched to sync compaction.



To generalize this, I'm thinking that it is pointless for async compaction
to migrate memory in a contiguous span if it will not cause a cc-order
page allocation to succeed.


Well I think it's pointless for page faults and maybe khugepaged. But 
still there should be some daemon trying to migrate even pages that do 
not immediately lead to a continuous block or some order, as an general 
incremental defragmentation. For example I believe (and hope to analyze 
and improve that eventually) that MOVABLE pages allocated in