[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-03-08 Thread Mel Gorman
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken for
in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
zone will be 0.

However, the decision on whether to sleep is made on kswapd_classzone_idx
which is 0 without a recent wakeup request and that classzone does not
account for lowmem reserves.  This allows kswapd to sleep when a low
small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
a stream of allocations cannot use that zone. While kswapd may be woken
again shortly in the near future there are two consequences -- the pgdat
bits that control congestion are cleared prematurely and direct reclaim
is more likely as kswapd slept prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
index) when there has been no recent wakeups. If there are no wakeups,
it'll decide whether to sleep based on the highest possible zone available
(MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
decisions during reclaim and when deciding to sleep are the same. If there is
a mismatch, kswapd can stay awake continually trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent

 4.11.0-rc14.11.0-rc1   
 4.11.0-rc1
vanilla  clear-v2   
   keepawake-v2
Ameanp50-Read 21670074.18 (  0.00%) 19786774.76 (  8.69%) 
22668332.52 ( -4.61%)
Ameanp95-Read 25456267.64 (  0.00%) 24101956.27 (  5.32%) 
26738688.00 ( -5.04%)
Ameanp99-Read 29369064.73 (  0.00%) 27691872.71 (  5.71%) 
30991404.52 ( -5.52%)
Ameanp50-Write1390.30 (  0.00%) 1011.91 ( 27.22%)  
924.91 ( 33.47%)
Ameanp95-Write  412901.57 (  0.00%)34874.98 ( 91.55%) 
1362.62 ( 99.67%)
Ameanp99-Write 6668722.09 (  0.00%)   575449.60 ( 91.37%)
16854.04 ( 99.75%)
Ameanp50-Allocation  78714.31 (  0.00%)84246.26 ( -7.03%)
74729.74 (  5.06%)
Ameanp95-Allocation 175533.51 (  0.00%)   400058.43 (-127.91%)   
101609.74 ( 42.11%)
Ameanp99-Allocation 247003.02 (  0.00%) 10905600.00 (-4315.17%)   
125765.57 ( 49.08%)

With this patch on top, write and allocation latencies are massively
improved.  The read latencies are slightly impaired but it's worth noting
that this is mostly due to the IO scheduler and not directly related to
reclaim. The vmstats are a bit of a mix but the relevant ones are as follows;

4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
  mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0   0   0
Swap Outs0 608   0
Direct pages scanned   6910672 3132699 6357298
Kswapd pages scanned  570369468248866556986286
Kswapd pages reclaimed559934886347432955939113
Direct pages reclaimed 6905990 2964843 6352115
Kswapd efficiency  98% 76% 98%
Kswapd velocity  12494.375   17597.507   12488.065
Direct efficiency  99% 94% 99%
Direct velocity   1513.835 668.3061393.148
Page writes by reclaim   0.000 4410243.000   0.000
Page writes file 0 4409635   0
Page writes anon 0 608   0
Page reclaim immediate 103679214175203 1042571

4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
   vanilla  clear-v2  keepawake-v2
Swap Ins 0  12   0
Swap Outs0 838   0
Direct pages scanned   6579706 3237270 6256811
Kswapd pages scanned  618537027996148654837791
Kswapd pages reclaimed607687646075578853849586
Direct pages reclaimed 6579055 2987453 6256151
Kswapd efficiency  98% 75% 98%
Page writes by reclaim   0.000 4389496.000   0.000
Page writes file 0 4388658   0
Page writes anon 0 838   0
Page reclaim immediate 107357314473009  982507

Swap-outs are equivalent to baseline.
Direct reclaim is reduced but not eliminated. It's worth noting
that there are two periods of direct reclaim for this workload. The
first is when it switches from preparing the files for 

[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-03-08 Thread Mel Gorman
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken for
in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
zone will be 0.

However, the decision on whether to sleep is made on kswapd_classzone_idx
which is 0 without a recent wakeup request and that classzone does not
account for lowmem reserves.  This allows kswapd to sleep when a low
small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
a stream of allocations cannot use that zone. While kswapd may be woken
again shortly in the near future there are two consequences -- the pgdat
bits that control congestion are cleared prematurely and direct reclaim
is more likely as kswapd slept prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
index) when there has been no recent wakeups. If there are no wakeups,
it'll decide whether to sleep based on the highest possible zone available
(MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
decisions during reclaim and when deciding to sleep are the same. If there is
a mismatch, kswapd can stay awake continually trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent

 4.11.0-rc14.11.0-rc1   
 4.11.0-rc1
vanilla  clear-v2   
   keepawake-v2
Ameanp50-Read 21670074.18 (  0.00%) 19786774.76 (  8.69%) 
22668332.52 ( -4.61%)
Ameanp95-Read 25456267.64 (  0.00%) 24101956.27 (  5.32%) 
26738688.00 ( -5.04%)
Ameanp99-Read 29369064.73 (  0.00%) 27691872.71 (  5.71%) 
30991404.52 ( -5.52%)
Ameanp50-Write1390.30 (  0.00%) 1011.91 ( 27.22%)  
924.91 ( 33.47%)
Ameanp95-Write  412901.57 (  0.00%)34874.98 ( 91.55%) 
1362.62 ( 99.67%)
Ameanp99-Write 6668722.09 (  0.00%)   575449.60 ( 91.37%)
16854.04 ( 99.75%)
Ameanp50-Allocation  78714.31 (  0.00%)84246.26 ( -7.03%)
74729.74 (  5.06%)
Ameanp95-Allocation 175533.51 (  0.00%)   400058.43 (-127.91%)   
101609.74 ( 42.11%)
Ameanp99-Allocation 247003.02 (  0.00%) 10905600.00 (-4315.17%)   
125765.57 ( 49.08%)

With this patch on top, write and allocation latencies are massively
improved.  The read latencies are slightly impaired but it's worth noting
that this is mostly due to the IO scheduler and not directly related to
reclaim. The vmstats are a bit of a mix but the relevant ones are as follows;

4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
  mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0   0   0
Swap Outs0 608   0
Direct pages scanned   6910672 3132699 6357298
Kswapd pages scanned  570369468248866556986286
Kswapd pages reclaimed559934886347432955939113
Direct pages reclaimed 6905990 2964843 6352115
Kswapd efficiency  98% 76% 98%
Kswapd velocity  12494.375   17597.507   12488.065
Direct efficiency  99% 94% 99%
Direct velocity   1513.835 668.3061393.148
Page writes by reclaim   0.000 4410243.000   0.000
Page writes file 0 4409635   0
Page writes anon 0 608   0
Page reclaim immediate 103679214175203 1042571

4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
   vanilla  clear-v2  keepawake-v2
Swap Ins 0  12   0
Swap Outs0 838   0
Direct pages scanned   6579706 3237270 6256811
Kswapd pages scanned  618537027996148654837791
Kswapd pages reclaimed607687646075578853849586
Direct pages reclaimed 6579055 2987453 6256151
Kswapd efficiency  98% 75% 98%
Page writes by reclaim   0.000 4389496.000   0.000
Page writes file 0 4388658   0
Page writes anon 0 838   0
Page reclaim immediate 107357314473009  982507

Swap-outs are equivalent to baseline.
Direct reclaim is reduced but not eliminated. It's worth noting
that there are two periods of direct reclaim for this workload. The
first is when it switches from preparing the files for 

Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-03-01 Thread Vlastimil Babka
On 02/23/2017 04:01 PM, Mel Gorman wrote:
> On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
>>> With this patch on top, all the latencies relative to the baseline are
>>> improved, particularly write latencies. The read latencies are still high
>>> for the number of threads but it's worth noting that this is mostly due
>>> to the IO scheduler and not directly related to reclaim. The vmstats are
>>> a bit of a mix but the relevant ones are as follows;
>>>
>>> 4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>>>   mmots-20170209 clear-v1r25keepawake-v1r25
>>> Swap Ins 0   0   0
>>> Swap Outs0 608   0
>>> Direct pages scanned   6910672 3132699 6357298
>>> Kswapd pages scanned  570369468248866556986286
>>> Kswapd pages reclaimed559934886347432955939113
>>> Direct pages reclaimed 6905990 2964843 6352115
>>
>> These stats are confusing me. The earlier description suggests that this 
>> patch
>> should cause less direct reclaim and more kswapd reclaim, but compared to
>> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? 
>> (when
>> considering direct + kswapd combined)
>>
> 
> The description is referring to the impact relative to baseline. It is
> true that relative to patch that direct reclaim is higher but there are
> a number of anomalies.
> 
> Note that kswapd is scanning very aggressively in "clear-v1" and overall
> efficiency is down to 76%. It's also not clear in the stats but in
> "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
> due to the patch "mm, vmscan: fix zone balance check in
> prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
> pages from reclaim context and some swapping due to the aggressiveness
> of the scan.
> 
> While direct reclaim activity might be lower, it's due to kswapd scanning
> aggressively and trying to reclaim the world which is not the right thing
> to do.  With the patches applied, there is still direct reclaim but the fast
> bulk of them are when the workload changes phase from "creating work files"
> to starting multiple threads that allocate a lot of anonymous memory with
> a sudden spike in memory pressure that kswapd does not keep ahead of with
> multiple allocating threads.

Thanks for the explanation.

> 
>>> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int 
>>> order, int classzone_idx)
>>> return sc.order;
>>>  }
>>>  
>>> +/*
>>> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
>>> + * allocation request woke kswapd for. When kswapd has not woken recently,
>>> + * the value is MAX_NR_ZONES which is not a valid index. This compares a
>>> + * given classzone and returns it or the highest classzone index kswapd
>>> + * was recently woke for.
>>> + */
>>> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
>>> +  enum zone_type classzone_idx)
>>> +{
>>> +   if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
>>> +   return classzone_idx;
>>> +
>>> +   return max(pgdat->kswapd_classzone_idx, classzone_idx);
>>
>> A bit paranoid comment: this should probably read 
>> pgdat->kswapd_classzone_idx to
>> a local variable with READ_ONCE(), otherwise something can set it to
>> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
>> Probably not an issue with current callers, but I'd rather future-proof it.
>>
> 
> I'm a little wary of adding READ_ONCE unless there is a definite
> problem. Even if it was an issue, I think it would be better to protect
> thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
> if an update is required or a read to fully guarantee the ordering.
> 
> The consequences as they are is that kswapd may miss reclaiming at a
> higher order or classzone than it should have although it is very
> unlikely and the update and read are made with a workqueue wake and
> scheduler wakeup which should be sufficient in terms of barriers.

OK then.

Acked-by: Vlastimil Babka 



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-03-01 Thread Vlastimil Babka
On 02/23/2017 04:01 PM, Mel Gorman wrote:
> On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
>>> With this patch on top, all the latencies relative to the baseline are
>>> improved, particularly write latencies. The read latencies are still high
>>> for the number of threads but it's worth noting that this is mostly due
>>> to the IO scheduler and not directly related to reclaim. The vmstats are
>>> a bit of a mix but the relevant ones are as follows;
>>>
>>> 4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>>>   mmots-20170209 clear-v1r25keepawake-v1r25
>>> Swap Ins 0   0   0
>>> Swap Outs0 608   0
>>> Direct pages scanned   6910672 3132699 6357298
>>> Kswapd pages scanned  570369468248866556986286
>>> Kswapd pages reclaimed559934886347432955939113
>>> Direct pages reclaimed 6905990 2964843 6352115
>>
>> These stats are confusing me. The earlier description suggests that this 
>> patch
>> should cause less direct reclaim and more kswapd reclaim, but compared to
>> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? 
>> (when
>> considering direct + kswapd combined)
>>
> 
> The description is referring to the impact relative to baseline. It is
> true that relative to patch that direct reclaim is higher but there are
> a number of anomalies.
> 
> Note that kswapd is scanning very aggressively in "clear-v1" and overall
> efficiency is down to 76%. It's also not clear in the stats but in
> "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
> due to the patch "mm, vmscan: fix zone balance check in
> prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
> pages from reclaim context and some swapping due to the aggressiveness
> of the scan.
> 
> While direct reclaim activity might be lower, it's due to kswapd scanning
> aggressively and trying to reclaim the world which is not the right thing
> to do.  With the patches applied, there is still direct reclaim but the fast
> bulk of them are when the workload changes phase from "creating work files"
> to starting multiple threads that allocate a lot of anonymous memory with
> a sudden spike in memory pressure that kswapd does not keep ahead of with
> multiple allocating threads.

Thanks for the explanation.

> 
>>> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int 
>>> order, int classzone_idx)
>>> return sc.order;
>>>  }
>>>  
>>> +/*
>>> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
>>> + * allocation request woke kswapd for. When kswapd has not woken recently,
>>> + * the value is MAX_NR_ZONES which is not a valid index. This compares a
>>> + * given classzone and returns it or the highest classzone index kswapd
>>> + * was recently woke for.
>>> + */
>>> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
>>> +  enum zone_type classzone_idx)
>>> +{
>>> +   if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
>>> +   return classzone_idx;
>>> +
>>> +   return max(pgdat->kswapd_classzone_idx, classzone_idx);
>>
>> A bit paranoid comment: this should probably read 
>> pgdat->kswapd_classzone_idx to
>> a local variable with READ_ONCE(), otherwise something can set it to
>> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
>> Probably not an issue with current callers, but I'd rather future-proof it.
>>
> 
> I'm a little wary of adding READ_ONCE unless there is a definite
> problem. Even if it was an issue, I think it would be better to protect
> thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
> if an update is required or a read to fully guarantee the ordering.
> 
> The consequences as they are is that kswapd may miss reclaiming at a
> higher order or classzone than it should have although it is very
> unlikely and the update and read are made with a workqueue wake and
> scheduler wakeup which should be sufficient in terms of barriers.

OK then.

Acked-by: Vlastimil Babka 



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-23 Thread Mel Gorman
On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
> > With this patch on top, all the latencies relative to the baseline are
> > improved, particularly write latencies. The read latencies are still high
> > for the number of threads but it's worth noting that this is mostly due
> > to the IO scheduler and not directly related to reclaim. The vmstats are
> > a bit of a mix but the relevant ones are as follows;
> > 
> > 4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
> >   mmots-20170209 clear-v1r25keepawake-v1r25
> > Swap Ins 0   0   0
> > Swap Outs0 608   0
> > Direct pages scanned   6910672 3132699 6357298
> > Kswapd pages scanned  570369468248866556986286
> > Kswapd pages reclaimed559934886347432955939113
> > Direct pages reclaimed 6905990 2964843 6352115
> 
> These stats are confusing me. The earlier description suggests that this patch
> should cause less direct reclaim and more kswapd reclaim, but compared to
> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
> considering direct + kswapd combined)
> 

The description is referring to the impact relative to baseline. It is
true that relative to patch that direct reclaim is higher but there are
a number of anomalies.

Note that kswapd is scanning very aggressively in "clear-v1" and overall
efficiency is down to 76%. It's also not clear in the stats but in
"clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
due to the patch "mm, vmscan: fix zone balance check in
prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
pages from reclaim context and some swapping due to the aggressiveness
of the scan.

While direct reclaim activity might be lower, it's due to kswapd scanning
aggressively and trying to reclaim the world which is not the right thing
to do.  With the patches applied, there is still direct reclaim but the fast
bulk of them are when the workload changes phase from "creating work files"
to starting multiple threads that allocate a lot of anonymous memory with
a sudden spike in memory pressure that kswapd does not keep ahead of with
multiple allocating threads.

> > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > order, int classzone_idx)
> > return sc.order;
> >  }
> >  
> > +/*
> > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
> > + * allocation request woke kswapd for. When kswapd has not woken recently,
> > + * the value is MAX_NR_ZONES which is not a valid index. This compares a
> > + * given classzone and returns it or the highest classzone index kswapd
> > + * was recently woke for.
> > + */
> > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
> > +  enum zone_type classzone_idx)
> > +{
> > +   if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
> > +   return classzone_idx;
> > +
> > +   return max(pgdat->kswapd_classzone_idx, classzone_idx);
> 
> A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx 
> to
> a local variable with READ_ONCE(), otherwise something can set it to
> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
> Probably not an issue with current callers, but I'd rather future-proof it.
> 

I'm a little wary of adding READ_ONCE unless there is a definite
problem. Even if it was an issue, I think it would be better to protect
thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
if an update is required or a read to fully guarantee the ordering.

The consequences as they are is that kswapd may miss reclaiming at a
higher order or classzone than it should have although it is very
unlikely and the update and read are made with a workqueue wake and
scheduler wakeup which should be sufficient in terms of barriers.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-23 Thread Mel Gorman
On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
> > With this patch on top, all the latencies relative to the baseline are
> > improved, particularly write latencies. The read latencies are still high
> > for the number of threads but it's worth noting that this is mostly due
> > to the IO scheduler and not directly related to reclaim. The vmstats are
> > a bit of a mix but the relevant ones are as follows;
> > 
> > 4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
> >   mmots-20170209 clear-v1r25keepawake-v1r25
> > Swap Ins 0   0   0
> > Swap Outs0 608   0
> > Direct pages scanned   6910672 3132699 6357298
> > Kswapd pages scanned  570369468248866556986286
> > Kswapd pages reclaimed559934886347432955939113
> > Direct pages reclaimed 6905990 2964843 6352115
> 
> These stats are confusing me. The earlier description suggests that this patch
> should cause less direct reclaim and more kswapd reclaim, but compared to
> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
> considering direct + kswapd combined)
> 

The description is referring to the impact relative to baseline. It is
true that relative to patch that direct reclaim is higher but there are
a number of anomalies.

Note that kswapd is scanning very aggressively in "clear-v1" and overall
efficiency is down to 76%. It's also not clear in the stats but in
"clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
due to the patch "mm, vmscan: fix zone balance check in
prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
pages from reclaim context and some swapping due to the aggressiveness
of the scan.

While direct reclaim activity might be lower, it's due to kswapd scanning
aggressively and trying to reclaim the world which is not the right thing
to do.  With the patches applied, there is still direct reclaim but the fast
bulk of them are when the workload changes phase from "creating work files"
to starting multiple threads that allocate a lot of anonymous memory with
a sudden spike in memory pressure that kswapd does not keep ahead of with
multiple allocating threads.

> > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > order, int classzone_idx)
> > return sc.order;
> >  }
> >  
> > +/*
> > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
> > + * allocation request woke kswapd for. When kswapd has not woken recently,
> > + * the value is MAX_NR_ZONES which is not a valid index. This compares a
> > + * given classzone and returns it or the highest classzone index kswapd
> > + * was recently woke for.
> > + */
> > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
> > +  enum zone_type classzone_idx)
> > +{
> > +   if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
> > +   return classzone_idx;
> > +
> > +   return max(pgdat->kswapd_classzone_idx, classzone_idx);
> 
> A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx 
> to
> a local variable with READ_ONCE(), otherwise something can set it to
> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
> Probably not an issue with current callers, but I'd rather future-proof it.
> 

I'm a little wary of adding READ_ONCE unless there is a definite
problem. Even if it was an issue, I think it would be better to protect
thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
if an update is required or a read to fully guarantee the ordering.

The consequences as they are is that kswapd may miss reclaiming at a
higher order or classzone than it should have although it is very
unlikely and the update and read are made with a workqueue wake and
scheduler wakeup which should be sufficient in terms of barriers.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Hillf Danton
On February 21, 2017 12:34 AM Vlastimil Babka wrote:
> On 02/16/2017 09:21 AM, Hillf Danton wrote:
> > Right, but the order-3 request can also come up while kswapd is active and
> > gives up order-5.
> 
> "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
> order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way 
> how
> kswapd could help an order-3 allocation at that point - it's up to kcompactd.
> 
cpu0cpu1
give up order-5 
fall back to order-0
wake up kswapd for order-3 
wake up kswapd for order-5
fall in sleep
wake up kswapd for order-3
what order would
we try?

It is order-5 in the patch. 

Given the fresh new world without hike ban after napping, 
one tenth second or 3 minutes, we feel free IMHO to select
any order and go another round of reclaiming pages.

thanks
Hillf




Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Hillf Danton
On February 21, 2017 12:34 AM Vlastimil Babka wrote:
> On 02/16/2017 09:21 AM, Hillf Danton wrote:
> > Right, but the order-3 request can also come up while kswapd is active and
> > gives up order-5.
> 
> "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
> order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way 
> how
> kswapd could help an order-3 allocation at that point - it's up to kcompactd.
> 
cpu0cpu1
give up order-5 
fall back to order-0
wake up kswapd for order-3 
wake up kswapd for order-5
fall in sleep
wake up kswapd for order-3
what order would
we try?

It is order-5 in the patch. 

Given the fresh new world without hike ban after napping, 
one tenth second or 3 minutes, we feel free IMHO to select
any order and go another round of reclaiming pages.

thanks
Hillf




Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Vlastimil Babka
On 02/15/2017 10:22 AM, Mel Gorman wrote:
> kswapd is woken to reclaim a node based on a failed allocation request
> from any eligible zone. Once reclaiming in balance_pgdat(), it will
> continue reclaiming until there is an eligible zone available for the
> zone it was woken for. kswapd tracks what zone it was recently woken for
> in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
> zone will be 0.
> 
> However, the decision on whether to sleep is made on kswapd_classzone_idx
> which is 0 without a recent wakeup request and that classzone does not
> account for lowmem reserves.  This allows kswapd to sleep when a low
> small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
> a stream of allocations cannot use that zone. While kswapd may be woken
> again shortly in the near future there are two consequences -- the pgdat
> bits that control congestion are cleared prematurely and direct reclaim
> is more likely as kswapd slept prematurely.
> 
> This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
> index) when there has been no recent wakeups. If there are no wakeups,
> it'll decide whether to sleep based on the highest possible zone available
> (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
> decisions during reclaim and when deciding to sleep are the same. If there is
> a mismatch, kswapd can stay awake continually trying to balance tiny zones.
> 
> simoop was used to evaluate it again. Two of the preparation patches regressed
> the workload so they are included as the second set of results. Otherwise
> this patch looks artifically excellent
> 
>  4.10.0-rc74.10.0-rc7 
>4.10.0-rc7
>  mmots-20170209   clear-v1r25 
>   keepawake-v1r25
> Ameanp50-Read 22325202.49 (  0.00%) 19491134.58 ( 12.69%) 
> 22092755.48 (  1.04%)
> Ameanp95-Read 26102988.80 (  0.00%) 24294195.20 (  6.93%) 
> 26101849.04 (  0.00%)
> Ameanp99-Read 30935176.53 (  0.00%) 30397053.16 (  1.74%) 
> 29746220.52 (  3.84%)
> Ameanp50-Write 976.44 (  0.00%) 1077.22 (-10.32%) 
>  952.73 (  2.43%)
> Ameanp95-Write   15471.29 (  0.00%)36419.56 (-135.40%)
>  3140.27 ( 79.70%)
> Ameanp99-Write   35108.62 (  0.00%)   102000.36 (-190.53%)
>  8843.73 ( 74.81%)
> Ameanp50-Allocation  76382.61 (  0.00%)87485.22 (-14.54%)
> 76349.22 (  0.04%)
> Ameanp95-Allocation 12.39 (  0.00%)   204588.52 (-60.11%)   
> 108630.26 ( 14.98%)
> Ameanp99-Allocation 187937.39 (  0.00%)   631657.74 (-236.10%)   
> 139094.26 ( 25.99%)
> 
> With this patch on top, all the latencies relative to the baseline are
> improved, particularly write latencies. The read latencies are still high
> for the number of threads but it's worth noting that this is mostly due
> to the IO scheduler and not directly related to reclaim. The vmstats are
> a bit of a mix but the relevant ones are as follows;
> 
> 4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>   mmots-20170209 clear-v1r25keepawake-v1r25
> Swap Ins 0   0   0
> Swap Outs0 608   0
> Direct pages scanned   6910672 3132699 6357298
> Kswapd pages scanned  570369468248866556986286
> Kswapd pages reclaimed559934886347432955939113
> Direct pages reclaimed 6905990 2964843 6352115

These stats are confusing me. The earlier description suggests that this patch
should cause less direct reclaim and more kswapd reclaim, but compared to
"clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
considering direct + kswapd combined)

> Kswapd efficiency  98% 76% 98%
> Kswapd velocity  12494.375   17597.507   12488.065
> Direct efficiency  99% 94% 99%
> Direct velocity   1513.835 668.3061393.148
> Page writes by reclaim   0.000 4410243.000   0.000
> Page writes file 0 4409635   0
> Page writes anon 0 608   0
> Page reclaim immediate 103679214175203 1042571
> 
> Swap-outs are equivalent to baseline
> Direct reclaim is reduced but not eliminated. It's worth noting
>   that there are two periods of direct reclaim for this workload. The
>   first is when it switches from preparing the files for the actual
>   test itself. It's a lot of file IO followed by a lot of allocs
>   that reclaims heavily for a brief window. After that, direct
>   reclaim is intermittent when the workload spawns a number of
>   threads periodically to do work. kswapd simply cannot wake and
>   reclaim 

Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Vlastimil Babka
On 02/15/2017 10:22 AM, Mel Gorman wrote:
> kswapd is woken to reclaim a node based on a failed allocation request
> from any eligible zone. Once reclaiming in balance_pgdat(), it will
> continue reclaiming until there is an eligible zone available for the
> zone it was woken for. kswapd tracks what zone it was recently woken for
> in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
> zone will be 0.
> 
> However, the decision on whether to sleep is made on kswapd_classzone_idx
> which is 0 without a recent wakeup request and that classzone does not
> account for lowmem reserves.  This allows kswapd to sleep when a low
> small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
> a stream of allocations cannot use that zone. While kswapd may be woken
> again shortly in the near future there are two consequences -- the pgdat
> bits that control congestion are cleared prematurely and direct reclaim
> is more likely as kswapd slept prematurely.
> 
> This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
> index) when there has been no recent wakeups. If there are no wakeups,
> it'll decide whether to sleep based on the highest possible zone available
> (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
> decisions during reclaim and when deciding to sleep are the same. If there is
> a mismatch, kswapd can stay awake continually trying to balance tiny zones.
> 
> simoop was used to evaluate it again. Two of the preparation patches regressed
> the workload so they are included as the second set of results. Otherwise
> this patch looks artifically excellent
> 
>  4.10.0-rc74.10.0-rc7 
>4.10.0-rc7
>  mmots-20170209   clear-v1r25 
>   keepawake-v1r25
> Ameanp50-Read 22325202.49 (  0.00%) 19491134.58 ( 12.69%) 
> 22092755.48 (  1.04%)
> Ameanp95-Read 26102988.80 (  0.00%) 24294195.20 (  6.93%) 
> 26101849.04 (  0.00%)
> Ameanp99-Read 30935176.53 (  0.00%) 30397053.16 (  1.74%) 
> 29746220.52 (  3.84%)
> Ameanp50-Write 976.44 (  0.00%) 1077.22 (-10.32%) 
>  952.73 (  2.43%)
> Ameanp95-Write   15471.29 (  0.00%)36419.56 (-135.40%)
>  3140.27 ( 79.70%)
> Ameanp99-Write   35108.62 (  0.00%)   102000.36 (-190.53%)
>  8843.73 ( 74.81%)
> Ameanp50-Allocation  76382.61 (  0.00%)87485.22 (-14.54%)
> 76349.22 (  0.04%)
> Ameanp95-Allocation 12.39 (  0.00%)   204588.52 (-60.11%)   
> 108630.26 ( 14.98%)
> Ameanp99-Allocation 187937.39 (  0.00%)   631657.74 (-236.10%)   
> 139094.26 ( 25.99%)
> 
> With this patch on top, all the latencies relative to the baseline are
> improved, particularly write latencies. The read latencies are still high
> for the number of threads but it's worth noting that this is mostly due
> to the IO scheduler and not directly related to reclaim. The vmstats are
> a bit of a mix but the relevant ones are as follows;
> 
> 4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>   mmots-20170209 clear-v1r25keepawake-v1r25
> Swap Ins 0   0   0
> Swap Outs0 608   0
> Direct pages scanned   6910672 3132699 6357298
> Kswapd pages scanned  570369468248866556986286
> Kswapd pages reclaimed559934886347432955939113
> Direct pages reclaimed 6905990 2964843 6352115

These stats are confusing me. The earlier description suggests that this patch
should cause less direct reclaim and more kswapd reclaim, but compared to
"clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
considering direct + kswapd combined)

> Kswapd efficiency  98% 76% 98%
> Kswapd velocity  12494.375   17597.507   12488.065
> Direct efficiency  99% 94% 99%
> Direct velocity   1513.835 668.3061393.148
> Page writes by reclaim   0.000 4410243.000   0.000
> Page writes file 0 4409635   0
> Page writes anon 0 608   0
> Page reclaim immediate 103679214175203 1042571
> 
> Swap-outs are equivalent to baseline
> Direct reclaim is reduced but not eliminated. It's worth noting
>   that there are two periods of direct reclaim for this workload. The
>   first is when it switches from preparing the files for the actual
>   test itself. It's a lot of file IO followed by a lot of allocs
>   that reclaims heavily for a brief window. After that, direct
>   reclaim is intermittent when the workload spawns a number of
>   threads periodically to do work. kswapd simply cannot wake and
>   reclaim 

Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Vlastimil Babka
On 02/16/2017 09:21 AM, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
>> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
>> > On February 15, 2017 5:23 PM Mel Gorman wrote:
>> > >   */
>> > >  static int kswapd(void *p)
>> > >  {
>> > > -unsigned int alloc_order, reclaim_order, classzone_idx;
>> > > +unsigned int alloc_order, reclaim_order;
>> > > +unsigned int classzone_idx = MAX_NR_ZONES - 1;
>> > >  pg_data_t *pgdat = (pg_data_t*)p;
>> > >  struct task_struct *tsk = current;
>> > >
>> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>> > >  tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>> > >  set_freezable();
>> > >
>> > > -pgdat->kswapd_order = alloc_order = reclaim_order = 0;
>> > > -pgdat->kswapd_classzone_idx = classzone_idx = 0;
>> > > +pgdat->kswapd_order = 0;
>> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >  for ( ; ; ) {
>> > >  bool ret;
>> > >
>> > > +alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > +classzone_idx = kswapd_classzone_idx(pgdat, 
>> > > classzone_idx);
>> > > +
>> > >  kswapd_try_sleep:
>> > >  kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>> > >  classzone_idx);
>> > >
>> > >  /* Read the new order and classzone_idx */
>> > >  alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > -classzone_idx = pgdat->kswapd_classzone_idx;
>> > > +classzone_idx = kswapd_classzone_idx(pgdat, 0);
>> > >  pgdat->kswapd_order = 0;
>> > > -pgdat->kswapd_classzone_idx = 0;
>> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >
>> > >  ret = try_to_freeze();
>> > >  if (kthread_should_stop())
>> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>> > >  reclaim_order = balance_pgdat(pgdat, alloc_order, 
>> > > classzone_idx);
>> > >  if (reclaim_order < alloc_order)
>> > >  goto kswapd_try_sleep;
>> >
>> > If we fail order-5 request,  can we then give up order-5, and
>> > try order-3 if requested, after napping?
>> >
>> 
>> That has no bearing upon this patch. At this point, kswapd has stopped
>> reclaiming at the requested order and is preparing to sleep. If there is
>> a parallel request for order-3 while it's sleeping, it'll wake and start
>> reclaiming at order-3 as requested.
>> 
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.

"Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how
kswapd could help an order-3 allocation at that point - it's up to kcompactd.

> thanks
> Hillf
> 



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Vlastimil Babka
On 02/16/2017 09:21 AM, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
>> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
>> > On February 15, 2017 5:23 PM Mel Gorman wrote:
>> > >   */
>> > >  static int kswapd(void *p)
>> > >  {
>> > > -unsigned int alloc_order, reclaim_order, classzone_idx;
>> > > +unsigned int alloc_order, reclaim_order;
>> > > +unsigned int classzone_idx = MAX_NR_ZONES - 1;
>> > >  pg_data_t *pgdat = (pg_data_t*)p;
>> > >  struct task_struct *tsk = current;
>> > >
>> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>> > >  tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>> > >  set_freezable();
>> > >
>> > > -pgdat->kswapd_order = alloc_order = reclaim_order = 0;
>> > > -pgdat->kswapd_classzone_idx = classzone_idx = 0;
>> > > +pgdat->kswapd_order = 0;
>> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >  for ( ; ; ) {
>> > >  bool ret;
>> > >
>> > > +alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > +classzone_idx = kswapd_classzone_idx(pgdat, 
>> > > classzone_idx);
>> > > +
>> > >  kswapd_try_sleep:
>> > >  kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>> > >  classzone_idx);
>> > >
>> > >  /* Read the new order and classzone_idx */
>> > >  alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > -classzone_idx = pgdat->kswapd_classzone_idx;
>> > > +classzone_idx = kswapd_classzone_idx(pgdat, 0);
>> > >  pgdat->kswapd_order = 0;
>> > > -pgdat->kswapd_classzone_idx = 0;
>> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >
>> > >  ret = try_to_freeze();
>> > >  if (kthread_should_stop())
>> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>> > >  reclaim_order = balance_pgdat(pgdat, alloc_order, 
>> > > classzone_idx);
>> > >  if (reclaim_order < alloc_order)
>> > >  goto kswapd_try_sleep;
>> >
>> > If we fail order-5 request,  can we then give up order-5, and
>> > try order-3 if requested, after napping?
>> >
>> 
>> That has no bearing upon this patch. At this point, kswapd has stopped
>> reclaiming at the requested order and is preparing to sleep. If there is
>> a parallel request for order-3 while it's sleeping, it'll wake and start
>> reclaiming at order-3 as requested.
>> 
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.

"Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how
kswapd could help an order-3 allocation at that point - it's up to kcompactd.

> thanks
> Hillf
> 



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Mel Gorman
On Thu, Feb 16, 2017 at 04:21:04PM +0800, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
> > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > > >   */
> > > >  static int kswapd(void *p)
> > > >  {
> > > > -   unsigned int alloc_order, reclaim_order, classzone_idx;
> > > > +   unsigned int alloc_order, reclaim_order;
> > > > +   unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > > > pg_data_t *pgdat = (pg_data_t*)p;
> > > > struct task_struct *tsk = current;
> > > >
> > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > > > set_freezable();
> > > >
> > > > -   pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > > -   pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > > +   pgdat->kswapd_order = 0;
> > > > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > > for ( ; ; ) {
> > > > bool ret;
> > > >
> > > > +   alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > +   classzone_idx = kswapd_classzone_idx(pgdat, 
> > > > classzone_idx);
> > > > +
> > > >  kswapd_try_sleep:
> > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > > > classzone_idx);
> > > >
> > > > /* Read the new order and classzone_idx */
> > > > alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > -   classzone_idx = pgdat->kswapd_classzone_idx;
> > > > +   classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > > > pgdat->kswapd_order = 0;
> > > > -   pgdat->kswapd_classzone_idx = 0;
> > > > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > >
> > > > ret = try_to_freeze();
> > > > if (kthread_should_stop())
> > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > > > reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > > classzone_idx);
> > > > if (reclaim_order < alloc_order)
> > > > goto kswapd_try_sleep;
> > >
> > > If we fail order-5 request,  can we then give up order-5, and
> > > try order-3 if requested, after napping?
> > >
> > 
> > That has no bearing upon this patch. At this point, kswapd has stopped
> > reclaiming at the requested order and is preparing to sleep. If there is
> > a parallel request for order-3 while it's sleeping, it'll wake and start
> > reclaiming at order-3 as requested.
> > 
>
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.
> 

And then it'll be in pgdat->kswapd_order and be picked up on the next
wakeup. It won't be immediate but it's also unlikely to be worth picking
up immediately. The context here is that a high-order reclaim request
failed and rather keeping kswapd awake reclaiming the world, go to sleep
until another wakeup request comes in. Staying awake continually for
high orders caused problems with excessive reclaim in the past.

It could be revisited again but it's not related to what this patch is
aimed for -- avoiding reclaim going to sleep because ZONE_DMA is balanced
for a GFP_DMA request which is nowhere in the request stream.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Mel Gorman
On Thu, Feb 16, 2017 at 04:21:04PM +0800, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
> > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > > >   */
> > > >  static int kswapd(void *p)
> > > >  {
> > > > -   unsigned int alloc_order, reclaim_order, classzone_idx;
> > > > +   unsigned int alloc_order, reclaim_order;
> > > > +   unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > > > pg_data_t *pgdat = (pg_data_t*)p;
> > > > struct task_struct *tsk = current;
> > > >
> > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > > > set_freezable();
> > > >
> > > > -   pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > > -   pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > > +   pgdat->kswapd_order = 0;
> > > > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > > for ( ; ; ) {
> > > > bool ret;
> > > >
> > > > +   alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > +   classzone_idx = kswapd_classzone_idx(pgdat, 
> > > > classzone_idx);
> > > > +
> > > >  kswapd_try_sleep:
> > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > > > classzone_idx);
> > > >
> > > > /* Read the new order and classzone_idx */
> > > > alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > -   classzone_idx = pgdat->kswapd_classzone_idx;
> > > > +   classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > > > pgdat->kswapd_order = 0;
> > > > -   pgdat->kswapd_classzone_idx = 0;
> > > > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > >
> > > > ret = try_to_freeze();
> > > > if (kthread_should_stop())
> > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > > > reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > > classzone_idx);
> > > > if (reclaim_order < alloc_order)
> > > > goto kswapd_try_sleep;
> > >
> > > If we fail order-5 request,  can we then give up order-5, and
> > > try order-3 if requested, after napping?
> > >
> > 
> > That has no bearing upon this patch. At this point, kswapd has stopped
> > reclaiming at the requested order and is preparing to sleep. If there is
> > a parallel request for order-3 while it's sleeping, it'll wake and start
> > reclaiming at order-3 as requested.
> > 
>
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.
> 

And then it'll be in pgdat->kswapd_order and be picked up on the next
wakeup. It won't be immediate but it's also unlikely to be worth picking
up immediately. The context here is that a high-order reclaim request
failed and rather keeping kswapd awake reclaiming the world, go to sleep
until another wakeup request comes in. Staying awake continually for
high orders caused problems with excessive reclaim in the past.

It could be revisited again but it's not related to what this patch is
aimed for -- avoiding reclaim going to sleep because ZONE_DMA is balanced
for a GFP_DMA request which is nowhere in the request stream.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Hillf Danton

On February 16, 2017 4:11 PM Mel Gorman wrote:
> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > - unsigned int alloc_order, reclaim_order, classzone_idx;
> > > + unsigned int alloc_order, reclaim_order;
> > > + unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > >   pg_data_t *pgdat = (pg_data_t*)p;
> > >   struct task_struct *tsk = current;
> > >
> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > >   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >   set_freezable();
> > >
> > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > + pgdat->kswapd_order = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >   for ( ; ; ) {
> > >   bool ret;
> > >
> > > + alloc_order = reclaim_order = pgdat->kswapd_order;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > +
> > >  kswapd_try_sleep:
> > >   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > >   classzone_idx);
> > >
> > >   /* Read the new order and classzone_idx */
> > >   alloc_order = reclaim_order = pgdat->kswapd_order;
> > > - classzone_idx = pgdat->kswapd_classzone_idx;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > >   pgdat->kswapd_order = 0;
> > > - pgdat->kswapd_classzone_idx = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >
> > >   ret = try_to_freeze();
> > >   if (kthread_should_stop())
> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > >   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > classzone_idx);
> > >   if (reclaim_order < alloc_order)
> > >   goto kswapd_try_sleep;
> >
> > If we fail order-5 request,  can we then give up order-5, and
> > try order-3 if requested, after napping?
> >
> 
> That has no bearing upon this patch. At this point, kswapd has stopped
> reclaiming at the requested order and is preparing to sleep. If there is
> a parallel request for order-3 while it's sleeping, it'll wake and start
> reclaiming at order-3 as requested.
> 
Right, but the order-3 request can also come up while kswapd is active and
gives up order-5.

thanks
Hillf



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Hillf Danton

On February 16, 2017 4:11 PM Mel Gorman wrote:
> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > - unsigned int alloc_order, reclaim_order, classzone_idx;
> > > + unsigned int alloc_order, reclaim_order;
> > > + unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > >   pg_data_t *pgdat = (pg_data_t*)p;
> > >   struct task_struct *tsk = current;
> > >
> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > >   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >   set_freezable();
> > >
> > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > + pgdat->kswapd_order = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >   for ( ; ; ) {
> > >   bool ret;
> > >
> > > + alloc_order = reclaim_order = pgdat->kswapd_order;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > +
> > >  kswapd_try_sleep:
> > >   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > >   classzone_idx);
> > >
> > >   /* Read the new order and classzone_idx */
> > >   alloc_order = reclaim_order = pgdat->kswapd_order;
> > > - classzone_idx = pgdat->kswapd_classzone_idx;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > >   pgdat->kswapd_order = 0;
> > > - pgdat->kswapd_classzone_idx = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >
> > >   ret = try_to_freeze();
> > >   if (kthread_should_stop())
> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > >   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > classzone_idx);
> > >   if (reclaim_order < alloc_order)
> > >   goto kswapd_try_sleep;
> >
> > If we fail order-5 request,  can we then give up order-5, and
> > try order-3 if requested, after napping?
> >
> 
> That has no bearing upon this patch. At this point, kswapd has stopped
> reclaiming at the requested order and is preparing to sleep. If there is
> a parallel request for order-3 while it's sleeping, it'll wake and start
> reclaiming at order-3 as requested.
> 
Right, but the order-3 request can also come up while kswapd is active and
gives up order-5.

thanks
Hillf



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Mel Gorman
On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> On February 15, 2017 5:23 PM Mel Gorman wrote: 
> >   */
> >  static int kswapd(void *p)
> >  {
> > -   unsigned int alloc_order, reclaim_order, classzone_idx;
> > +   unsigned int alloc_order, reclaim_order;
> > +   unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > pg_data_t *pgdat = (pg_data_t*)p;
> > struct task_struct *tsk = current;
> > 
> > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > set_freezable();
> > 
> > -   pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > -   pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > +   pgdat->kswapd_order = 0;
> > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > for ( ; ; ) {
> > bool ret;
> > 
> > +   alloc_order = reclaim_order = pgdat->kswapd_order;
> > +   classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > +
> >  kswapd_try_sleep:
> > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > classzone_idx);
> > 
> > /* Read the new order and classzone_idx */
> > alloc_order = reclaim_order = pgdat->kswapd_order;
> > -   classzone_idx = pgdat->kswapd_classzone_idx;
> > +   classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > pgdat->kswapd_order = 0;
> > -   pgdat->kswapd_classzone_idx = 0;
> > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > 
> > ret = try_to_freeze();
> > if (kthread_should_stop())
> > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > classzone_idx);
> > if (reclaim_order < alloc_order)
> > goto kswapd_try_sleep;
> 
> If we fail order-5 request,  can we then give up order-5, and
> try order-3 if requested, after napping?
> 

That has no bearing upon this patch. At this point, kswapd has stopped
reclaiming at the requested order and is preparing to sleep. If there is
a parallel request for order-3 while it's sleeping, it'll wake and start
reclaiming at order-3 as requested.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Mel Gorman
On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> On February 15, 2017 5:23 PM Mel Gorman wrote: 
> >   */
> >  static int kswapd(void *p)
> >  {
> > -   unsigned int alloc_order, reclaim_order, classzone_idx;
> > +   unsigned int alloc_order, reclaim_order;
> > +   unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > pg_data_t *pgdat = (pg_data_t*)p;
> > struct task_struct *tsk = current;
> > 
> > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > set_freezable();
> > 
> > -   pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > -   pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > +   pgdat->kswapd_order = 0;
> > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > for ( ; ; ) {
> > bool ret;
> > 
> > +   alloc_order = reclaim_order = pgdat->kswapd_order;
> > +   classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > +
> >  kswapd_try_sleep:
> > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > classzone_idx);
> > 
> > /* Read the new order and classzone_idx */
> > alloc_order = reclaim_order = pgdat->kswapd_order;
> > -   classzone_idx = pgdat->kswapd_classzone_idx;
> > +   classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > pgdat->kswapd_order = 0;
> > -   pgdat->kswapd_classzone_idx = 0;
> > +   pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > 
> > ret = try_to_freeze();
> > if (kthread_should_stop())
> > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > classzone_idx);
> > if (reclaim_order < alloc_order)
> > goto kswapd_try_sleep;
> 
> If we fail order-5 request,  can we then give up order-5, and
> try order-3 if requested, after napping?
> 

That has no bearing upon this patch. At this point, kswapd has stopped
reclaiming at the requested order and is preparing to sleep. If there is
a parallel request for order-3 while it's sleeping, it'll wake and start
reclaiming at order-3 as requested.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-15 Thread Hillf Danton
On February 15, 2017 5:23 PM Mel Gorman wrote: 
>   */
>  static int kswapd(void *p)
>  {
> - unsigned int alloc_order, reclaim_order, classzone_idx;
> + unsigned int alloc_order, reclaim_order;
> + unsigned int classzone_idx = MAX_NR_ZONES - 1;
>   pg_data_t *pgdat = (pg_data_t*)p;
>   struct task_struct *tsk = current;
> 
> @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>   set_freezable();
> 
> - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>   for ( ; ; ) {
>   bool ret;
> 
> + alloc_order = reclaim_order = pgdat->kswapd_order;
> + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +
>  kswapd_try_sleep:
>   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>   classzone_idx);
> 
>   /* Read the new order and classzone_idx */
>   alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
> + classzone_idx = kswapd_classzone_idx(pgdat, 0);
>   pgdat->kswapd_order = 0;
> - pgdat->kswapd_classzone_idx = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> 
>   ret = try_to_freeze();
>   if (kthread_should_stop())
> @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> classzone_idx);
>   if (reclaim_order < alloc_order)
>   goto kswapd_try_sleep;

If we fail order-5 request,  can we then give up order-5, and
try order-3 if requested, after napping?

> -
> - alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
>   }
> 




Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-15 Thread Hillf Danton
On February 15, 2017 5:23 PM Mel Gorman wrote: 
>   */
>  static int kswapd(void *p)
>  {
> - unsigned int alloc_order, reclaim_order, classzone_idx;
> + unsigned int alloc_order, reclaim_order;
> + unsigned int classzone_idx = MAX_NR_ZONES - 1;
>   pg_data_t *pgdat = (pg_data_t*)p;
>   struct task_struct *tsk = current;
> 
> @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>   set_freezable();
> 
> - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>   for ( ; ; ) {
>   bool ret;
> 
> + alloc_order = reclaim_order = pgdat->kswapd_order;
> + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +
>  kswapd_try_sleep:
>   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>   classzone_idx);
> 
>   /* Read the new order and classzone_idx */
>   alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
> + classzone_idx = kswapd_classzone_idx(pgdat, 0);
>   pgdat->kswapd_order = 0;
> - pgdat->kswapd_classzone_idx = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> 
>   ret = try_to_freeze();
>   if (kthread_should_stop())
> @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> classzone_idx);
>   if (reclaim_order < alloc_order)
>   goto kswapd_try_sleep;

If we fail order-5 request,  can we then give up order-5, and
try order-3 if requested, after napping?

> -
> - alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
>   }
> 




[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-15 Thread Mel Gorman
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken for
in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
zone will be 0.

However, the decision on whether to sleep is made on kswapd_classzone_idx
which is 0 without a recent wakeup request and that classzone does not
account for lowmem reserves.  This allows kswapd to sleep when a low
small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
a stream of allocations cannot use that zone. While kswapd may be woken
again shortly in the near future there are two consequences -- the pgdat
bits that control congestion are cleared prematurely and direct reclaim
is more likely as kswapd slept prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
index) when there has been no recent wakeups. If there are no wakeups,
it'll decide whether to sleep based on the highest possible zone available
(MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
decisions during reclaim and when deciding to sleep are the same. If there is
a mismatch, kswapd can stay awake continually trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches regressed
the workload so they are included as the second set of results. Otherwise
this patch looks artifically excellent

 4.10.0-rc74.10.0-rc7   
 4.10.0-rc7
 mmots-20170209   clear-v1r25   
keepawake-v1r25
Ameanp50-Read 22325202.49 (  0.00%) 19491134.58 ( 12.69%) 
22092755.48 (  1.04%)
Ameanp95-Read 26102988.80 (  0.00%) 24294195.20 (  6.93%) 
26101849.04 (  0.00%)
Ameanp99-Read 30935176.53 (  0.00%) 30397053.16 (  1.74%) 
29746220.52 (  3.84%)
Ameanp50-Write 976.44 (  0.00%) 1077.22 (-10.32%)  
952.73 (  2.43%)
Ameanp95-Write   15471.29 (  0.00%)36419.56 (-135.40%) 
3140.27 ( 79.70%)
Ameanp99-Write   35108.62 (  0.00%)   102000.36 (-190.53%) 
8843.73 ( 74.81%)
Ameanp50-Allocation  76382.61 (  0.00%)87485.22 (-14.54%)
76349.22 (  0.04%)
Ameanp95-Allocation 12.39 (  0.00%)   204588.52 (-60.11%)   
108630.26 ( 14.98%)
Ameanp99-Allocation 187937.39 (  0.00%)   631657.74 (-236.10%)   
139094.26 ( 25.99%)

With this patch on top, all the latencies relative to the baseline are
improved, particularly write latencies. The read latencies are still high
for the number of threads but it's worth noting that this is mostly due
to the IO scheduler and not directly related to reclaim. The vmstats are
a bit of a mix but the relevant ones are as follows;

4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
  mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0   0   0
Swap Outs0 608   0
Direct pages scanned   6910672 3132699 6357298
Kswapd pages scanned  570369468248866556986286
Kswapd pages reclaimed559934886347432955939113
Direct pages reclaimed 6905990 2964843 6352115
Kswapd efficiency  98% 76% 98%
Kswapd velocity  12494.375   17597.507   12488.065
Direct efficiency  99% 94% 99%
Direct velocity   1513.835 668.3061393.148
Page writes by reclaim   0.000 4410243.000   0.000
Page writes file 0 4409635   0
Page writes anon 0 608   0
Page reclaim immediate 103679214175203 1042571

Swap-outs are equivalent to baseline
Direct reclaim is reduced but not eliminated. It's worth noting
that there are two periods of direct reclaim for this workload. The
first is when it switches from preparing the files for the actual
test itself. It's a lot of file IO followed by a lot of allocs
that reclaims heavily for a brief window. After that, direct
reclaim is intermittent when the workload spawns a number of
threads periodically to do work. kswapd simply cannot wake and
reclaim fast enough between the low and min watermarks. It could
be mitigated using vm.watermark_scale_factor but not through
special tricks in kswapd.
Page writes from reclaim context are at 0 which is the ideal
Pages immediately reclaimed after IO completes is back at the baseline

On UMA, there is almost no change so this is not expected to be a universal
win.

Signed-off-by: Mel Gorman 
---
 mm/memory_hotplug.c |   2 

[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-15 Thread Mel Gorman
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken for
in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
zone will be 0.

However, the decision on whether to sleep is made on kswapd_classzone_idx
which is 0 without a recent wakeup request and that classzone does not
account for lowmem reserves.  This allows kswapd to sleep when a low
small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
a stream of allocations cannot use that zone. While kswapd may be woken
again shortly in the near future there are two consequences -- the pgdat
bits that control congestion are cleared prematurely and direct reclaim
is more likely as kswapd slept prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
index) when there has been no recent wakeups. If there are no wakeups,
it'll decide whether to sleep based on the highest possible zone available
(MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
decisions during reclaim and when deciding to sleep are the same. If there is
a mismatch, kswapd can stay awake continually trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches regressed
the workload so they are included as the second set of results. Otherwise
this patch looks artifically excellent

 4.10.0-rc74.10.0-rc7   
 4.10.0-rc7
 mmots-20170209   clear-v1r25   
keepawake-v1r25
Ameanp50-Read 22325202.49 (  0.00%) 19491134.58 ( 12.69%) 
22092755.48 (  1.04%)
Ameanp95-Read 26102988.80 (  0.00%) 24294195.20 (  6.93%) 
26101849.04 (  0.00%)
Ameanp99-Read 30935176.53 (  0.00%) 30397053.16 (  1.74%) 
29746220.52 (  3.84%)
Ameanp50-Write 976.44 (  0.00%) 1077.22 (-10.32%)  
952.73 (  2.43%)
Ameanp95-Write   15471.29 (  0.00%)36419.56 (-135.40%) 
3140.27 ( 79.70%)
Ameanp99-Write   35108.62 (  0.00%)   102000.36 (-190.53%) 
8843.73 ( 74.81%)
Ameanp50-Allocation  76382.61 (  0.00%)87485.22 (-14.54%)
76349.22 (  0.04%)
Ameanp95-Allocation 12.39 (  0.00%)   204588.52 (-60.11%)   
108630.26 ( 14.98%)
Ameanp99-Allocation 187937.39 (  0.00%)   631657.74 (-236.10%)   
139094.26 ( 25.99%)

With this patch on top, all the latencies relative to the baseline are
improved, particularly write latencies. The read latencies are still high
for the number of threads but it's worth noting that this is mostly due
to the IO scheduler and not directly related to reclaim. The vmstats are
a bit of a mix but the relevant ones are as follows;

4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
  mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0   0   0
Swap Outs0 608   0
Direct pages scanned   6910672 3132699 6357298
Kswapd pages scanned  570369468248866556986286
Kswapd pages reclaimed559934886347432955939113
Direct pages reclaimed 6905990 2964843 6352115
Kswapd efficiency  98% 76% 98%
Kswapd velocity  12494.375   17597.507   12488.065
Direct efficiency  99% 94% 99%
Direct velocity   1513.835 668.3061393.148
Page writes by reclaim   0.000 4410243.000   0.000
Page writes file 0 4409635   0
Page writes anon 0 608   0
Page reclaim immediate 103679214175203 1042571

Swap-outs are equivalent to baseline
Direct reclaim is reduced but not eliminated. It's worth noting
that there are two periods of direct reclaim for this workload. The
first is when it switches from preparing the files for the actual
test itself. It's a lot of file IO followed by a lot of allocs
that reclaims heavily for a brief window. After that, direct
reclaim is intermittent when the workload spawns a number of
threads periodically to do work. kswapd simply cannot wake and
reclaim fast enough between the low and min watermarks. It could
be mitigated using vm.watermark_scale_factor but not through
special tricks in kswapd.
Page writes from reclaim context are at 0 which is the ideal
Pages immediately reclaimed after IO completes is back at the baseline

On UMA, there is almost no change so this is not expected to be a universal
win.

Signed-off-by: Mel Gorman 
---
 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c | 118