[PATCH 3/5] mm: Reclaim small amounts of memory when an external fragmentation event occurs

2018-11-07 Thread Mel Gorman
An external fragmentation event was previously described as

When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if there
are enough sparsely populated pageblocks then the problem can still occur
as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a
zone watermark to be temporarily boosted when an external fragmentation
causing events occurs. The boosting will stall allocations that would
decrease free memory below the boosted low watermark and kswapd is woken
unconditionally to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost
is cleared. When kswapd finishes, it wakes kcompactd at the pageblock
order to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback or swap from reclaim
context during this operation to avoid excessive system disruption in
the name of fragmentation avoidance. Care is taken so that kswapd will
do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:  358574 (65% reduction)
4.20-rc1+patch1-3:19274 (98% reduction)

   4.20.0-rc1 4.20.0-rc1
 lowzone-v2r4 boost-v2r4
Amean fault-base-1  663.65 (   0.00%)  659.85 *   0.57%*
Amean fault-huge-10.00 (   0.00%)  172.19 * -99.00%*

  4.20.0-rc1 4.20.0-rc1
lowzone-v2r4 boost-v2r4
Percentage huge-10.00 (   0.00%)1.68 ( 100.00%)

Note that external fragmentation causing events are massively reduced
by this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch: 337890 ( 1% reduction)
4.20-rc1+patch1-3:   12801 (96% reduction)

thpfioscale Fault Latencies
thpfioscale Fault Latencies
   4.20.0-rc1 4.20.0-rc1
 lowzone-v2r4 boost-v2r4
Amean fault-base-1 1531.37 (   0.00%) 1578.91 (  -3.10%)
Amean fault-huge-1 1160.95 (   0.00%) 1090.23 *   6.09%*

  4.20.0-rc1 4.20.0-rc1
lowzone-v2r4 boost-v2r4
Percentage huge-1   78.97 (   0.00%)   82.59 (   4.58%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads


4.20-rc1 extfrag events < order 9:  209820
4.20-rc1+patch: 185923 (11% reduction)
4.20-rc1+patch1-3:   11240 (95% reduction)

   4.20.0-rc1 4.20.0-rc1
 lowzone-v2r4 boost-v2r4
Amean fault-base-5 1334.99 (   0.00%) 1395.28 (  -4.52%)
Amean fault-huge-5 2428.43 (   0.00%)  539.69 (  77.78%)

  4.20.0-rc1 4.20.0-rc1
lowzone-v2r4 boost-v2r4
Percentage huge-51.13 (   0.00%)0.53 ( -52.94%)

This is an illustration of why latencies are not the primary metric.
There is a 95% reduction in fragmentation causing events but the
huge page latencies look fantastic until you account for the fact it
might be because the success rate was lower. Given how low it was
initially, this is partially down to luck.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

Re: [PATCH 3/5] mm: Reclaim small amounts of memory when an external fragmentation event occurs

2018-10-31 Thread Mel Gorman
On Wed, Oct 31, 2018 at 04:06:43PM +, Mel Gorman wrote:
> An external fragmentation event was previously described as
> 
> When the page allocator fragments memory, it records the event using
> the mm_page_alloc_extfrag event. If the fallback_order is smaller
> than a pageblock order (order-9 on 64-bit x86) then it's considered
> an event that will cause external fragmentation issues in the future.
> 

This had a build error reported by the 0-day bot. It's trivially fixed
with

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77bcc35903e0..e36c279dfade 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3317,8 +3317,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, 
struct zone *zone)
  * probably too small. It only makes sense to spread allocations to avoid
  * fragmentation between the Normal and DMA32 zones.
  */
-static inline unsigned int alloc_flags_nofragment(struct zone *zone,
-   gfp_t gfp_mask)
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
if (zone_idx(zone) != ZONE_NORMAL)
return 0;
@@ -3340,7 +3340,8 @@ static inline unsigned int alloc_flags_nofragment(struct 
zone *zone,
return ALLOC_NOFRAGMENT;
 }
 #else
-static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
return 0;
 }

-- 
Mel Gorman
SUSE Labs


[PATCH 3/5] mm: Reclaim small amounts of memory when an external fragmentation event occurs

2018-10-31 Thread Mel Gorman
An external fragmentation event was previously described as

When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if there
is enough sparsely populated pageblocks then the problem can still occur
as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations below the boosted low
watermark and kswapd is woken unconditionally to reclaim an amount of
memory relative to the size of the high watermark and the
watermark_boost_factor until the boost is cleared. When kswapd finishes,
it wakes kcompactd at the pageblock order to clean some of the pageblocks
that may have been affected by the fragmentation event. kswapd avoids
any writeback or swap from reclaim context during this operation to avoid
excessive system disruption in the name of fragmentation avoidance. Care
is taken so that kswapd will do normal reclaim work if the system is
really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--

4.19 extfrag events < order 0:  71227
4.19+patch1:36456 (49% reduction)
4.19+patch1-3:   4510 (94% reduction)

   4.19.0 4.19.0
 lowzone-v1r1 boost-v1r5
Amean fault-base-1  599.92 (   0.00%)  630.44 *  -5.09%*
Amean fault-huge-1  179.84 (   0.00%)  179.22 (   0.35%)

  4.19.0 4.19.0
lowzone-v1r1 boost-v1r5
Percentage huge-11.08 (   0.00%)2.89 ( 168.75%)

Note that external fragmentation causing events are massively reduced
by this path whether in comparison to the previous kernel or the vanilla
kernel. There is some jitter in the fault latencies and they are a bit
more variable but the slight increase in THP allocation success rates
would account for some of that.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.19 extfrag events < order 0:  40761
4.19+patch1:36085 (11% reduction)
4.19+patch1-3:   1887 (95% reduction)

thpfioscale Fault Latencies
   4.19.0 4.19.0
 lowzone-v1r1 boost-v1r5
Amean fault-base-1 1938.47 (   0.00%) 1863.70 *   3.86%*
Amean fault-huge-1  749.40 (   0.00%)  776.07 *  -3.56%*

thpfioscale Percentage Faults Huge
  4.19.0 4.19.0
lowzone-v1r1 boost-v1r5
Percentage huge-1   83.79 (   0.00%)   86.92 (   3.73%)

As before, massive reduction in external fragmentation events, some
jitter on latencies and a slight increase in THP allocation success
rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads


4.19 extfrag events < order 0:  882868
4.19+patch1:476937 (46% reduction)
4.19+patch1-3:   29044 (97% reduction)

   4.19.0 4.19.0
 lowzone-v1r1 boost-v1r5
Amean fault-base-5 1602.01 (   0.00%) 1595.28 (   0.42%)
Amean fault-huge-50.00 (   0.00%)  435.67 * -99.00%*

  4.19.0 4.19.0
lowzone-v1r1 boost-v1r5
Percentage huge-50.00 (   0.00%)0.15 ( 100.00%)

This is an illustration of why latencies are not the primary metric.
There is a 97% reduction in fragmentation causing events but the
huge page latencies are much higher because they went from never
succeeding to a small success.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-

4.19 extfrag events < order 0: 803099
4.19+patch1:   65467