Re: [PATCH v8] mm: Proactive compaction

2020-06-23 Thread Nitin Gupta
On 6/22/20 9:57 PM, Nathan Chancellor wrote:
> On Mon, Jun 22, 2020 at 09:32:12PM -0700, Nitin Gupta wrote:
>> On 6/22/20 7:26 PM, Nathan Chancellor wrote:
>>> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
 For some applications, we need to allocate almost all memory as
 hugepages. However, on a running system, higher-order allocations can
 fail if the memory is fragmented. Linux kernel currently does on-demand
 compaction as we request more hugepages, but this style of compaction
 incurs very high latency. Experiments with one-time full memory
 compaction (followed by hugepage allocations) show that kernel is able
 to restore a highly fragmented memory state to a fairly compacted memory
 state within <1 sec for a 32G system. Such data suggests that a more
 proactive compaction can help us allocate a large fraction of memory as
 hugepages keeping allocation latencies low.

 For a more proactive compaction, the approach taken here is to define a
 new sysctl called 'vm.compaction_proactiveness' which dictates bounds
 for external fragmentation which kcompactd tries to maintain.

 The tunable takes a value in range [0, 100], with a default of 20.

 Note that a previous version of this patch [1] was found to introduce
 too many tunables (per-order extfrag{low, high}), but this one reduces
 them to just one sysctl. Also, the new tunable is an opaque value
 instead of asking for specific bounds of "external fragmentation", which
 would have been difficult to estimate. The internal interpretation of
 this opaque value allows for future fine-tuning.

 Currently, we use a simple translation from this tunable to [low, high]
 "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
 The score for a node is defined as weighted mean of per-zone external
 fragmentation. A zone's present_pages determines its weight.

 To periodically check per-node score, we reuse per-node kcompactd
 threads, which are woken up every 500 milliseconds to check the same. If
 a node's score exceeds its high threshold (as derived from user-provided
 proactiveness value), proactive compaction is started until its score
 reaches its low threshold value. By default, proactiveness is set to 20,
 which implies threshold values of low=80 and high=90.

 This patch is largely based on ideas from Michal Hocko [2]. See also the
 LWN article [3].

 Performance data
 

 System: x64_64, 1T RAM, 80 CPU threads.
 Kernel: 5.6.0-rc3 + this patch

 echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
 echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

 Before starting the driver, the system was fragmented from a userspace
 program that allocates all memory and then for each 2M aligned section,
 frees 3/4 of base pages using munmap. The workload is mainly anonymous
 userspace pages, which are easy to move around. I intentionally avoided
 unmovable pages in this test to see how much latency we incur when
 hugepage allocations hit direct compaction.

 1. Kernel hugepage allocation latencies

 With the system in such a fragmented state, a kernel driver then
 allocates as many hugepages as possible and measures allocation
 latency:

 (all latency values are in microseconds)

 - With vanilla 5.6.0-rc3

   percentile latency
   –– –––
   57894
  109496
  25   12561
  30   15295
  40   18244
  50   21229
  60   27556
  75   30147
  80   31047
  90   32859
  95   33799

 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
 762G total free => 98% of free memory could be allocated as hugepages)

 - With 5.6.0-rc3 + this patch, with proactiveness=20

 sysctl -w vm.compaction_proactiveness=20

   percentile latency
   –– –––
   5   2
  10   2
  25   3
  30   3
  40   3
  50   4
  60   4
  75   4
  80   4
  90   5
  95 429

 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
 762G total free => 98% of free memory could be allocated as hugepages)

 2. JAVA heap allocation

 In this test, we first fragment memory using the same method as for (1).

 Then, we start a Java process with a heap size set to 700G and request
 the heap to be allocated with THP hugepages. We also set THP to madvise
 to allow hugepage backing of this heap.

 /usr/bin/time
  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

 The above command allocates 

Re: [PATCH v8] mm: Proactive compaction

2020-06-22 Thread Nathan Chancellor
On Mon, Jun 22, 2020 at 09:32:12PM -0700, Nitin Gupta wrote:
> On 6/22/20 7:26 PM, Nathan Chancellor wrote:
> > On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
> >> For some applications, we need to allocate almost all memory as
> >> hugepages. However, on a running system, higher-order allocations can
> >> fail if the memory is fragmented. Linux kernel currently does on-demand
> >> compaction as we request more hugepages, but this style of compaction
> >> incurs very high latency. Experiments with one-time full memory
> >> compaction (followed by hugepage allocations) show that kernel is able
> >> to restore a highly fragmented memory state to a fairly compacted memory
> >> state within <1 sec for a 32G system. Such data suggests that a more
> >> proactive compaction can help us allocate a large fraction of memory as
> >> hugepages keeping allocation latencies low.
> >>
> >> For a more proactive compaction, the approach taken here is to define a
> >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
> >> for external fragmentation which kcompactd tries to maintain.
> >>
> >> The tunable takes a value in range [0, 100], with a default of 20.
> >>
> >> Note that a previous version of this patch [1] was found to introduce
> >> too many tunables (per-order extfrag{low, high}), but this one reduces
> >> them to just one sysctl. Also, the new tunable is an opaque value
> >> instead of asking for specific bounds of "external fragmentation", which
> >> would have been difficult to estimate. The internal interpretation of
> >> this opaque value allows for future fine-tuning.
> >>
> >> Currently, we use a simple translation from this tunable to [low, high]
> >> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
> >> The score for a node is defined as weighted mean of per-zone external
> >> fragmentation. A zone's present_pages determines its weight.
> >>
> >> To periodically check per-node score, we reuse per-node kcompactd
> >> threads, which are woken up every 500 milliseconds to check the same. If
> >> a node's score exceeds its high threshold (as derived from user-provided
> >> proactiveness value), proactive compaction is started until its score
> >> reaches its low threshold value. By default, proactiveness is set to 20,
> >> which implies threshold values of low=80 and high=90.
> >>
> >> This patch is largely based on ideas from Michal Hocko [2]. See also the
> >> LWN article [3].
> >>
> >> Performance data
> >> 
> >>
> >> System: x64_64, 1T RAM, 80 CPU threads.
> >> Kernel: 5.6.0-rc3 + this patch
> >>
> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
> >>
> >> Before starting the driver, the system was fragmented from a userspace
> >> program that allocates all memory and then for each 2M aligned section,
> >> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> >> userspace pages, which are easy to move around. I intentionally avoided
> >> unmovable pages in this test to see how much latency we incur when
> >> hugepage allocations hit direct compaction.
> >>
> >> 1. Kernel hugepage allocation latencies
> >>
> >> With the system in such a fragmented state, a kernel driver then
> >> allocates as many hugepages as possible and measures allocation
> >> latency:
> >>
> >> (all latency values are in microseconds)
> >>
> >> - With vanilla 5.6.0-rc3
> >>
> >>   percentile latency
> >>   –– –––
> >>   57894
> >>  109496
> >>  25   12561
> >>  30   15295
> >>  40   18244
> >>  50   21229
> >>  60   27556
> >>  75   30147
> >>  80   31047
> >>  90   32859
> >>  95   33799
> >>
> >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> >> 762G total free => 98% of free memory could be allocated as hugepages)
> >>
> >> - With 5.6.0-rc3 + this patch, with proactiveness=20
> >>
> >> sysctl -w vm.compaction_proactiveness=20
> >>
> >>   percentile latency
> >>   –– –––
> >>   5   2
> >>  10   2
> >>  25   3
> >>  30   3
> >>  40   3
> >>  50   4
> >>  60   4
> >>  75   4
> >>  80   4
> >>  90   5
> >>  95 429
> >>
> >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
> >> 762G total free => 98% of free memory could be allocated as hugepages)
> >>
> >> 2. JAVA heap allocation
> >>
> >> In this test, we first fragment memory using the same method as for (1).
> >>
> >> Then, we start a Java process with a heap size set to 700G and request
> >> the heap to be allocated with THP hugepages. We also set THP to madvise
> >> to allow hugepage backing of this heap.
> >>
> >> /usr/bin/time
> >>  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
> >>
> >> The above command allocates 700G of Java heap using hugepages.
> >>
> >> - 

Re: [PATCH v8] mm: Proactive compaction

2020-06-22 Thread Nitin Gupta
On 6/22/20 7:26 PM, Nathan Chancellor wrote:
> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is able
>> to restore a highly fragmented memory state to a fairly compacted memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation", which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low, high]
>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same. If
>> a node's score exceeds its high threshold (as derived from user-provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a userspace
>> program that allocates all memory and then for each 2M aligned section,
>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>> userspace pages, which are easy to move around. I intentionally avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> 2. JAVA heap allocation
>>
>> In this test, we first fragment memory using the same method as for (1).
>>
>> Then, we start a Java process with a heap size set to 700G and request
>> the heap to be allocated with THP hugepages. We also set THP to madvise
>> to allow hugepage backing of this heap.
>>
>> /usr/bin/time
>>  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
>>
>> The above command allocates 700G of Java heap using hugepages.
>>
>> - With vanilla 5.6.0-rc3
>>
>> 17.39user 1666.48system 27:37.89elapsed
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> 8.35user 194.58system 3:19.62elapsed
>>
>> Elapsed time remains around 3:15, as proactiveness is further increased.
>>
>> 

Re: [PATCH v8] mm: Proactive compaction

2020-06-22 Thread maobibo



On 06/23/2020 10:26 AM, Nathan Chancellor wrote:
> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
>> For some applications, we need to allocate almost all memory as
>> hugepages. However, on a running system, higher-order allocations can
>> fail if the memory is fragmented. Linux kernel currently does on-demand
>> compaction as we request more hugepages, but this style of compaction
>> incurs very high latency. Experiments with one-time full memory
>> compaction (followed by hugepage allocations) show that kernel is able
>> to restore a highly fragmented memory state to a fairly compacted memory
>> state within <1 sec for a 32G system. Such data suggests that a more
>> proactive compaction can help us allocate a large fraction of memory as
>> hugepages keeping allocation latencies low.
>>
>> For a more proactive compaction, the approach taken here is to define a
>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
>> for external fragmentation which kcompactd tries to maintain.
>>
>> The tunable takes a value in range [0, 100], with a default of 20.
>>
>> Note that a previous version of this patch [1] was found to introduce
>> too many tunables (per-order extfrag{low, high}), but this one reduces
>> them to just one sysctl. Also, the new tunable is an opaque value
>> instead of asking for specific bounds of "external fragmentation", which
>> would have been difficult to estimate. The internal interpretation of
>> this opaque value allows for future fine-tuning.
>>
>> Currently, we use a simple translation from this tunable to [low, high]
>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
>> The score for a node is defined as weighted mean of per-zone external
>> fragmentation. A zone's present_pages determines its weight.
>>
>> To periodically check per-node score, we reuse per-node kcompactd
>> threads, which are woken up every 500 milliseconds to check the same. If
>> a node's score exceeds its high threshold (as derived from user-provided
>> proactiveness value), proactive compaction is started until its score
>> reaches its low threshold value. By default, proactiveness is set to 20,
>> which implies threshold values of low=80 and high=90.
>>
>> This patch is largely based on ideas from Michal Hocko [2]. See also the
>> LWN article [3].
>>
>> Performance data
>> 
>>
>> System: x64_64, 1T RAM, 80 CPU threads.
>> Kernel: 5.6.0-rc3 + this patch
>>
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
>>
>> Before starting the driver, the system was fragmented from a userspace
>> program that allocates all memory and then for each 2M aligned section,
>> frees 3/4 of base pages using munmap. The workload is mainly anonymous
>> userspace pages, which are easy to move around. I intentionally avoided
>> unmovable pages in this test to see how much latency we incur when
>> hugepage allocations hit direct compaction.
>>
>> 1. Kernel hugepage allocation latencies
>>
>> With the system in such a fragmented state, a kernel driver then
>> allocates as many hugepages as possible and measures allocation
>> latency:
>>
>> (all latency values are in microseconds)
>>
>> - With vanilla 5.6.0-rc3
>>
>>   percentile latency
>>   –– –––
>> 57894
>>109496
>>25   12561
>>30   15295
>>40   18244
>>50   21229
>>60   27556
>>75   30147
>>80   31047
>>90   32859
>>95   33799
>>
>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> sysctl -w vm.compaction_proactiveness=20
>>
>>   percentile latency
>>   –– –––
>> 5   2
>>10   2
>>25   3
>>30   3
>>40   3
>>50   4
>>60   4
>>75   4
>>80   4
>>90   5
>>95 429
>>
>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
>> 762G total free => 98% of free memory could be allocated as hugepages)
>>
>> 2. JAVA heap allocation
>>
>> In this test, we first fragment memory using the same method as for (1).
>>
>> Then, we start a Java process with a heap size set to 700G and request
>> the heap to be allocated with THP hugepages. We also set THP to madvise
>> to allow hugepage backing of this heap.
>>
>> /usr/bin/time
>>  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
>>
>> The above command allocates 700G of Java heap using hugepages.
>>
>> - With vanilla 5.6.0-rc3
>>
>> 17.39user 1666.48system 27:37.89elapsed
>>
>> - With 5.6.0-rc3 + this patch, with proactiveness=20
>>
>> 8.35user 194.58system 3:19.62elapsed
>>
>> Elapsed time remains around 3:15, as proactiveness is further increased.

Re: [PATCH v8] mm: Proactive compaction

2020-06-22 Thread Nathan Chancellor
On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote:
> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages, but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define a
> new sysctl called 'vm.compaction_proactiveness' which dictates bounds
> for external fragmentation which kcompactd tries to maintain.
> 
> The tunable takes a value in range [0, 100], with a default of 20.
> 
> Note that a previous version of this patch [1] was found to introduce
> too many tunables (per-order extfrag{low, high}), but this one reduces
> them to just one sysctl. Also, the new tunable is an opaque value
> instead of asking for specific bounds of "external fragmentation", which
> would have been difficult to estimate. The internal interpretation of
> this opaque value allows for future fine-tuning.
> 
> Currently, we use a simple translation from this tunable to [low, high]
> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
> The score for a node is defined as weighted mean of per-zone external
> fragmentation. A zone's present_pages determines its weight.
> 
> To periodically check per-node score, we reuse per-node kcompactd
> threads, which are woken up every 500 milliseconds to check the same. If
> a node's score exceeds its high threshold (as derived from user-provided
> proactiveness value), proactive compaction is started until its score
> reaches its low threshold value. By default, proactiveness is set to 20,
> which implies threshold values of low=80 and high=90.
> 
> This patch is largely based on ideas from Michal Hocko [2]. See also the
> LWN article [3].
> 
> Performance data
> 
> 
> System: x64_64, 1T RAM, 80 CPU threads.
> Kernel: 5.6.0-rc3 + this patch
> 
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
> 
> Before starting the driver, the system was fragmented from a userspace
> program that allocates all memory and then for each 2M aligned section,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> userspace pages, which are easy to move around. I intentionally avoided
> unmovable pages in this test to see how much latency we incur when
> hugepage allocations hit direct compaction.
> 
> 1. Kernel hugepage allocation latencies
> 
> With the system in such a fragmented state, a kernel driver then
> allocates as many hugepages as possible and measures allocation
> latency:
> 
> (all latency values are in microseconds)
> 
> - With vanilla 5.6.0-rc3
> 
>   percentile latency
>   –– –––
>  57894
> 109496
> 25   12561
> 30   15295
> 40   18244
> 50   21229
> 60   27556
> 75   30147
> 80   31047
> 90   32859
> 95   33799
> 
> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> 762G total free => 98% of free memory could be allocated as hugepages)
> 
> - With 5.6.0-rc3 + this patch, with proactiveness=20
> 
> sysctl -w vm.compaction_proactiveness=20
> 
>   percentile latency
>   –– –––
>  5   2
> 10   2
> 25   3
> 30   3
> 40   3
> 50   4
> 60   4
> 75   4
> 80   4
> 90   5
> 95 429
> 
> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of
> 762G total free => 98% of free memory could be allocated as hugepages)
> 
> 2. JAVA heap allocation
> 
> In this test, we first fragment memory using the same method as for (1).
> 
> Then, we start a Java process with a heap size set to 700G and request
> the heap to be allocated with THP hugepages. We also set THP to madvise
> to allow hugepage backing of this heap.
> 
> /usr/bin/time
>  java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
> 
> The above command allocates 700G of Java heap using hugepages.
> 
> - With vanilla 5.6.0-rc3
> 
> 17.39user 1666.48system 27:37.89elapsed
> 
> - With 5.6.0-rc3 + this patch, with proactiveness=20
> 
> 8.35user 194.58system 3:19.62elapsed
> 
> Elapsed time remains around 3:15, as proactiveness is further increased.
> 
> Note that proactive compaction happens throughout the runtime of these
> workloads. The situation of one-time compaction, 

Re: [PATCH v8] mm: Proactive compaction

2020-06-17 Thread Nitin Gupta




On 6/17/20 1:53 PM, Andrew Morton wrote:

On Tue, 16 Jun 2020 13:45:27 -0700 Nitin Gupta  wrote:


For some applications, we need to allocate almost all memory as
hugepages. However, on a running system, higher-order allocations can
fail if the memory is fragmented. Linux kernel currently does on-demand
compaction as we request more hugepages, but this style of compaction
incurs very high latency. Experiments with one-time full memory
compaction (followed by hugepage allocations) show that kernel is able
to restore a highly fragmented memory state to a fairly compacted memory
state within <1 sec for a 32G system. Such data suggests that a more
proactive compaction can help us allocate a large fraction of memory as
hugepages keeping allocation latencies low.

...



All looks straightforward to me and easy to disable if it goes wrong.

All the hard-coded magic numbers are a worry, but such is life.

One teeny complaint:



...

@@ -2650,12 +2801,34 @@ static int kcompactd(void *p)
unsigned long pflags;
  
  		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);

-   wait_event_freezable(pgdat->kcompactd_wait,
-   kcompactd_work_requested(pgdat));
+   if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
+   kcompactd_work_requested(pgdat),
+   msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) {
+
+   psi_memstall_enter();
+   kcompactd_do_work(pgdat);
+   psi_memstall_leave();
+   continue;
+   }
  
-		psi_memstall_enter();

-   kcompactd_do_work(pgdat);
-   psi_memstall_leave();
+   /* kcompactd wait timeout */
+   if (should_proactive_compact_node(pgdat)) {
+   unsigned int prev_score, score;


Everywhere else, scores have type `int'.  Here they are unsigned.  How come?

Would it be better to make these unsigned throughout?  I don't think a
score can ever be negative?



The score is always in [0, 100], so yes, it should be unsigned.
I will send another patch which fixes this.

Thanks,
Nitin



Re: [PATCH v8] mm: Proactive compaction

2020-06-17 Thread Andrew Morton
On Tue, 16 Jun 2020 13:45:27 -0700 Nitin Gupta  wrote:

> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages, but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
>
> ...
>

All looks straightforward to me and easy to disable if it goes wrong.

All the hard-coded magic numbers are a worry, but such is life.

One teeny complaint:

>
> ...
>
> @@ -2650,12 +2801,34 @@ static int kcompactd(void *p)
>   unsigned long pflags;
>  
>   trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
> - wait_event_freezable(pgdat->kcompactd_wait,
> - kcompactd_work_requested(pgdat));
> + if (wait_event_freezable_timeout(pgdat->kcompactd_wait,
> + kcompactd_work_requested(pgdat),
> + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) {
> +
> + psi_memstall_enter();
> + kcompactd_do_work(pgdat);
> + psi_memstall_leave();
> + continue;
> + }
>  
> - psi_memstall_enter();
> - kcompactd_do_work(pgdat);
> - psi_memstall_leave();
> + /* kcompactd wait timeout */
> + if (should_proactive_compact_node(pgdat)) {
> + unsigned int prev_score, score;

Everywhere else, scores have type `int'.  Here they are unsigned.  How come?

Would it be better to make these unsigned throughout?  I don't think a
score can ever be negative?

> + if (proactive_defer) {
> + proactive_defer--;
> + continue;
> + }
> + prev_score = fragmentation_score_node(pgdat);
> + proactive_compact_node(pgdat);
> + score = fragmentation_score_node(pgdat);
> + /*
> +  * Defer proactive compaction if the fragmentation
> +  * score did not go down i.e. no progress made.
> +  */
> + proactive_defer = score < prev_score ?
> + 0 : 1 << COMPACT_MAX_DEFER_SHIFT;
> + }
>   }