Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-11 Thread Dave Chinner
On Mon, Jul 11, 2016 at 10:02:24AM +0100, Mel Gorman wrote:
> On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote:
> > > I had tested XFS with earlier releases and noticed no major problems
> > > so later releases tested only one filesystem.  Given the changes since,
> > > a retest is desirable. I've posted the current version of the series but
> > > I'll queue the tests to run over the weekend. They are quite time 
> > > consuming
> > > to run unfortunately.
> > 
> > Understood. I'm not following the patchset all that closely, so I
> > didn' know you'd already tested XFS.
> > 
> 
> It was needed anyway. Not all of them completed over the weekend. In
> particular, the NUMA machine is taking its time because many of the
> workloads are scaled by memory size and it takes longer.
> 
> > > On the fsmark configuration, I configured the test to use 4K files
> > > instead of 0-sized files that normally would be used to stress inode
> > > creation/deletion. This is to have a mix of page cache and slab
> > > allocations. Shout if this does not suit your expectations.
> > 
> > Sounds fine. I usually limit that test to 10 million inodes - that's
> > my "10-4" test.
> > 
> 
> Thanks.
> 
> 
> I'm not going to go through most of the results in detail. The raw data
> is verbose and not necessarily useful in most cases.

Yup, numbers look pretty good and all my concerns have gone away.
Thanks for testing, Mel! :P

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-11 Thread Dave Chinner
On Mon, Jul 11, 2016 at 10:02:24AM +0100, Mel Gorman wrote:
> On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote:
> > > I had tested XFS with earlier releases and noticed no major problems
> > > so later releases tested only one filesystem.  Given the changes since,
> > > a retest is desirable. I've posted the current version of the series but
> > > I'll queue the tests to run over the weekend. They are quite time 
> > > consuming
> > > to run unfortunately.
> > 
> > Understood. I'm not following the patchset all that closely, so I
> > didn' know you'd already tested XFS.
> > 
> 
> It was needed anyway. Not all of them completed over the weekend. In
> particular, the NUMA machine is taking its time because many of the
> workloads are scaled by memory size and it takes longer.
> 
> > > On the fsmark configuration, I configured the test to use 4K files
> > > instead of 0-sized files that normally would be used to stress inode
> > > creation/deletion. This is to have a mix of page cache and slab
> > > allocations. Shout if this does not suit your expectations.
> > 
> > Sounds fine. I usually limit that test to 10 million inodes - that's
> > my "10-4" test.
> > 
> 
> Thanks.
> 
> 
> I'm not going to go through most of the results in detail. The raw data
> is verbose and not necessarily useful in most cases.

Yup, numbers look pretty good and all my concerns have gone away.
Thanks for testing, Mel! :P

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-11 Thread Mel Gorman
On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote:
> > I had tested XFS with earlier releases and noticed no major problems
> > so later releases tested only one filesystem.  Given the changes since,
> > a retest is desirable. I've posted the current version of the series but
> > I'll queue the tests to run over the weekend. They are quite time consuming
> > to run unfortunately.
> 
> Understood. I'm not following the patchset all that closely, so I
> didn' know you'd already tested XFS.
> 

It was needed anyway. Not all of them completed over the weekend. In
particular, the NUMA machine is taking its time because many of the
workloads are scaled by memory size and it takes longer.

> > On the fsmark configuration, I configured the test to use 4K files
> > instead of 0-sized files that normally would be used to stress inode
> > creation/deletion. This is to have a mix of page cache and slab
> > allocations. Shout if this does not suit your expectations.
> 
> Sounds fine. I usually limit that test to 10 million inodes - that's
> my "10-4" test.
> 

Thanks.


I'm not going to go through most of the results in detail. The raw data
is verbose and not necessarily useful in most cases.

tiobench
Similar results to ext4, similar performance, similar reclaim
activity

pgbench
Similar performance results to ext4. Minor differences in
reclaim activity. The series did enter direct reclaim which the
mmotm kernel did not. However, it was one minor spike. kswapd
activity was almost identical.

bonnie
Similar performance results to ext4, minor differences in
reclaim activity

parallel dd

Similar performance results to ext4. Small differences in reclaim
activity. Again, there was a slight increase in direct reclaim
activity but negligble in comparison to the overall workload.
Average direct reclaim velocity was 1.8 pages per second and
direct reclaim page scans were 0.018% of all scans.

stutter
Similar performance results to ext4, similar reclaim activity

These observations are all based on two UMA machines.

fsmark 50m-inodes-4k-files-16-threads
=

As fsmark can be variable, this is reported as quartiles. This is one of
the UMA machines;

 4.7.0-rc4 4.7.0-rc4
mmotm-20160623   approx-v9r6
Min files/sec-16 2354.80 (  0.00%) 2255.40 ( -4.22%)
1st-qrtle   files/sec-16 3254.90 (  0.00%) 3249.40 ( -0.17%)
2nd-qrtle   files/sec-16 3310.10 (  0.00%) 3306.70 ( -0.10%)
3rd-qrtle   files/sec-16 3353.40 (  0.00%) 3329.00 ( -0.73%)
Max-90% files/sec-16 3435.70 (  0.00%) 3426.90 ( -0.26%)
Max-93% files/sec-16 3437.80 (  0.00%) 3462.50 (  0.72%)
Max-95% files/sec-16 3471.60 (  0.00%) 3536.50 (  1.87%)
Max-99% files/sec-16 5383.90 (  0.00%) 5900.00 (  9.59%)
Max files/sec-16 5383.90 (  0.00%) 5900.00 (  9.59%)
Meanfiles/sec-16 3342.99 (  0.00%) 3329.64 ( -0.40%)

   4.7.0-rc4   4.7.0-rc4
mmotm-20160623 approx-v9r6
User  188.46  187.14
System   2964.26 2972.35
Elapsed 10222.83 9865.87

Direct pages scanned144365  189738
Kswapd pages scanned  1314734912965288
Kswapd pages reclaimed1314454312962266
Direct pages reclaimed  144365  189738
Kswapd efficiency  99% 99%
Kswapd velocity   1286.0771314.156
Direct efficiency 100%100%
Direct velocity 14.122  19.232
Percentage direct scans 1%  1%
Slabs scanned 5256396852672128
Direct inode steals132  24
Kswapd inode steals  18234   12096

The performance is comparable and so is slab reclaim activity. The NUMA
machine had completed the same test. On the NUMA machine, there is a also
a slight increase in direct reclaim activity but as a tiny percentage
overall. Slab scan and reclaim activity is almost identical.

fsmark 50m-inodes-0k-files-16-threads
=

I also tested with zero-sized files. The UMA machine showed nothing
interesting, the NUMA machine results were as follows;

 4.7.0-rc4 4.7.0-rc4
mmotm-20160623   approx-v9r6
Min files/sec-16   108235.50 (  0.00%)   120783.20 ( 11.59%)
1st-qrtle   files/sec-16   129569.40 (  0.00%)   132300.70 (  2.11%)
2nd-qrtle   files/sec-16   135544.90 (  0.00%)   141198.40 (  4.17%)
3rd-qrtle   files/sec-16   139634.90 (  0.00%)   148242.50 (  6.16%)
Max-90% files/sec-16   144203.60 (  0.00%)   152247.10 (  5.58%)
Max-93% files/sec-16   145294.50 (  0.00%)   152642.20 (  5.06%)
Max-95% files/sec-16   

Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-11 Thread Mel Gorman
On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote:
> > I had tested XFS with earlier releases and noticed no major problems
> > so later releases tested only one filesystem.  Given the changes since,
> > a retest is desirable. I've posted the current version of the series but
> > I'll queue the tests to run over the weekend. They are quite time consuming
> > to run unfortunately.
> 
> Understood. I'm not following the patchset all that closely, so I
> didn' know you'd already tested XFS.
> 

It was needed anyway. Not all of them completed over the weekend. In
particular, the NUMA machine is taking its time because many of the
workloads are scaled by memory size and it takes longer.

> > On the fsmark configuration, I configured the test to use 4K files
> > instead of 0-sized files that normally would be used to stress inode
> > creation/deletion. This is to have a mix of page cache and slab
> > allocations. Shout if this does not suit your expectations.
> 
> Sounds fine. I usually limit that test to 10 million inodes - that's
> my "10-4" test.
> 

Thanks.


I'm not going to go through most of the results in detail. The raw data
is verbose and not necessarily useful in most cases.

tiobench
Similar results to ext4, similar performance, similar reclaim
activity

pgbench
Similar performance results to ext4. Minor differences in
reclaim activity. The series did enter direct reclaim which the
mmotm kernel did not. However, it was one minor spike. kswapd
activity was almost identical.

bonnie
Similar performance results to ext4, minor differences in
reclaim activity

parallel dd

Similar performance results to ext4. Small differences in reclaim
activity. Again, there was a slight increase in direct reclaim
activity but negligble in comparison to the overall workload.
Average direct reclaim velocity was 1.8 pages per second and
direct reclaim page scans were 0.018% of all scans.

stutter
Similar performance results to ext4, similar reclaim activity

These observations are all based on two UMA machines.

fsmark 50m-inodes-4k-files-16-threads
=

As fsmark can be variable, this is reported as quartiles. This is one of
the UMA machines;

 4.7.0-rc4 4.7.0-rc4
mmotm-20160623   approx-v9r6
Min files/sec-16 2354.80 (  0.00%) 2255.40 ( -4.22%)
1st-qrtle   files/sec-16 3254.90 (  0.00%) 3249.40 ( -0.17%)
2nd-qrtle   files/sec-16 3310.10 (  0.00%) 3306.70 ( -0.10%)
3rd-qrtle   files/sec-16 3353.40 (  0.00%) 3329.00 ( -0.73%)
Max-90% files/sec-16 3435.70 (  0.00%) 3426.90 ( -0.26%)
Max-93% files/sec-16 3437.80 (  0.00%) 3462.50 (  0.72%)
Max-95% files/sec-16 3471.60 (  0.00%) 3536.50 (  1.87%)
Max-99% files/sec-16 5383.90 (  0.00%) 5900.00 (  9.59%)
Max files/sec-16 5383.90 (  0.00%) 5900.00 (  9.59%)
Meanfiles/sec-16 3342.99 (  0.00%) 3329.64 ( -0.40%)

   4.7.0-rc4   4.7.0-rc4
mmotm-20160623 approx-v9r6
User  188.46  187.14
System   2964.26 2972.35
Elapsed 10222.83 9865.87

Direct pages scanned144365  189738
Kswapd pages scanned  1314734912965288
Kswapd pages reclaimed1314454312962266
Direct pages reclaimed  144365  189738
Kswapd efficiency  99% 99%
Kswapd velocity   1286.0771314.156
Direct efficiency 100%100%
Direct velocity 14.122  19.232
Percentage direct scans 1%  1%
Slabs scanned 5256396852672128
Direct inode steals132  24
Kswapd inode steals  18234   12096

The performance is comparable and so is slab reclaim activity. The NUMA
machine had completed the same test. On the NUMA machine, there is a also
a slight increase in direct reclaim activity but as a tiny percentage
overall. Slab scan and reclaim activity is almost identical.

fsmark 50m-inodes-0k-files-16-threads
=

I also tested with zero-sized files. The UMA machine showed nothing
interesting, the NUMA machine results were as follows;

 4.7.0-rc4 4.7.0-rc4
mmotm-20160623   approx-v9r6
Min files/sec-16   108235.50 (  0.00%)   120783.20 ( 11.59%)
1st-qrtle   files/sec-16   129569.40 (  0.00%)   132300.70 (  2.11%)
2nd-qrtle   files/sec-16   135544.90 (  0.00%)   141198.40 (  4.17%)
3rd-qrtle   files/sec-16   139634.90 (  0.00%)   148242.50 (  6.16%)
Max-90% files/sec-16   144203.60 (  0.00%)   152247.10 (  5.58%)
Max-93% files/sec-16   145294.50 (  0.00%)   152642.20 (  5.06%)
Max-95% files/sec-16   

Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-10 Thread Dave Chinner
On Fri, Jul 08, 2016 at 10:52:03AM +0100, Mel Gorman wrote:
> On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote:
> > .
> > > This series is not without its hazards. There are at least three areas
> > > that I'm concerned with even though I could not reproduce any problems in
> > > that area.
> > > 
> > > 1. Reclaim/compaction is going to be affected because the amount of 
> > > reclaim is
> > >no longer targetted at a specific zone. Compaction works on a per-zone 
> > > basis
> > >so there is no guarantee that reclaiming a few THP's worth page pages 
> > > will
> > >have a positive impact on compaction success rates.
> > > 
> > > 2. The Slab/LRU reclaim ratio is affected because the frequency the 
> > > shrinkers
> > >are called is now different. This may or may not be a problem but if it
> > >is, it'll be because shrinkers are not called enough and some balancing
> > >is required.
> > 
> > Given that XFS has a much more complex set of shrinkers and has a
> > much more finely tuned balancing between LRU and shrinker reclaim,
> > I'd be interested to see if you get the same results on XFS for the
> > tests you ran on ext4. It might also be worth running some highly
> > concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way
> > concurrent fsmark tests) to see what impact heavy slab cache
> > pressure has on shrinker behaviour and system balance...
> > 
> 
> I had tested XFS with earlier releases and noticed no major problems
> so later releases tested only one filesystem.  Given the changes since,
> a retest is desirable. I've posted the current version of the series but
> I'll queue the tests to run over the weekend. They are quite time consuming
> to run unfortunately.

Understood. I'm not following the patchset all that closely, so I
didn' know you'd already tested XFS.

> On the fsmark configuration, I configured the test to use 4K files
> instead of 0-sized files that normally would be used to stress inode
> creation/deletion. This is to have a mix of page cache and slab
> allocations. Shout if this does not suit your expectations.

Sounds fine. I usually limit that test to 10 million inodes - that's
my "10-4" test.

> Finally, not all the machines I'm using can store 50 million inodes
> of this size. The benchmark has been configured to use as many inodes
> as it estimates will fit in the disk. In all cases, it'll exert memory
> pressure. Unfortunately, the storage is simple so there is no guarantee
> it'll find all problems but that's standard unfortunately.

Yup. But it's really the system balance that matters, and if the
balance is maintained then XFS will optimise the IO patterns to get
decent throughput regardless of the storage (i.e. the 10-4 test
should still run at tens of MB/s on a single spinning disk).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-10 Thread Dave Chinner
On Fri, Jul 08, 2016 at 10:52:03AM +0100, Mel Gorman wrote:
> On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote:
> > .
> > > This series is not without its hazards. There are at least three areas
> > > that I'm concerned with even though I could not reproduce any problems in
> > > that area.
> > > 
> > > 1. Reclaim/compaction is going to be affected because the amount of 
> > > reclaim is
> > >no longer targetted at a specific zone. Compaction works on a per-zone 
> > > basis
> > >so there is no guarantee that reclaiming a few THP's worth page pages 
> > > will
> > >have a positive impact on compaction success rates.
> > > 
> > > 2. The Slab/LRU reclaim ratio is affected because the frequency the 
> > > shrinkers
> > >are called is now different. This may or may not be a problem but if it
> > >is, it'll be because shrinkers are not called enough and some balancing
> > >is required.
> > 
> > Given that XFS has a much more complex set of shrinkers and has a
> > much more finely tuned balancing between LRU and shrinker reclaim,
> > I'd be interested to see if you get the same results on XFS for the
> > tests you ran on ext4. It might also be worth running some highly
> > concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way
> > concurrent fsmark tests) to see what impact heavy slab cache
> > pressure has on shrinker behaviour and system balance...
> > 
> 
> I had tested XFS with earlier releases and noticed no major problems
> so later releases tested only one filesystem.  Given the changes since,
> a retest is desirable. I've posted the current version of the series but
> I'll queue the tests to run over the weekend. They are quite time consuming
> to run unfortunately.

Understood. I'm not following the patchset all that closely, so I
didn' know you'd already tested XFS.

> On the fsmark configuration, I configured the test to use 4K files
> instead of 0-sized files that normally would be used to stress inode
> creation/deletion. This is to have a mix of page cache and slab
> allocations. Shout if this does not suit your expectations.

Sounds fine. I usually limit that test to 10 million inodes - that's
my "10-4" test.

> Finally, not all the machines I'm using can store 50 million inodes
> of this size. The benchmark has been configured to use as many inodes
> as it estimates will fit in the disk. In all cases, it'll exert memory
> pressure. Unfortunately, the storage is simple so there is no guarantee
> it'll find all problems but that's standard unfortunately.

Yup. But it's really the system balance that matters, and if the
balance is maintained then XFS will optimise the IO patterns to get
decent throughput regardless of the storage (i.e. the 10-4 test
should still run at tens of MB/s on a single spinning disk).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-08 Thread Mel Gorman
On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote:
> .
> > This series is not without its hazards. There are at least three areas
> > that I'm concerned with even though I could not reproduce any problems in
> > that area.
> > 
> > 1. Reclaim/compaction is going to be affected because the amount of reclaim 
> > is
> >no longer targetted at a specific zone. Compaction works on a per-zone 
> > basis
> >so there is no guarantee that reclaiming a few THP's worth page pages 
> > will
> >have a positive impact on compaction success rates.
> > 
> > 2. The Slab/LRU reclaim ratio is affected because the frequency the 
> > shrinkers
> >are called is now different. This may or may not be a problem but if it
> >is, it'll be because shrinkers are not called enough and some balancing
> >is required.
> 
> Given that XFS has a much more complex set of shrinkers and has a
> much more finely tuned balancing between LRU and shrinker reclaim,
> I'd be interested to see if you get the same results on XFS for the
> tests you ran on ext4. It might also be worth running some highly
> concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way
> concurrent fsmark tests) to see what impact heavy slab cache
> pressure has on shrinker behaviour and system balance...
> 

I had tested XFS with earlier releases and noticed no major problems
so later releases tested only one filesystem.  Given the changes since,
a retest is desirable. I've posted the current version of the series but
I'll queue the tests to run over the weekend. They are quite time consuming
to run unfortunately.

On the fsmark configuration, I configured the test to use 4K files
instead of 0-sized files that normally would be used to stress inode
creation/deletion. This is to have a mix of page cache and slab
allocations. Shout if this does not suit your expectations.

Finally, not all the machines I'm using can store 50 million inodes
of this size. The benchmark has been configured to use as many inodes
as it estimates will fit in the disk. In all cases, it'll exert memory
pressure. Unfortunately, the storage is simple so there is no guarantee
it'll find all problems but that's standard unfortunately.

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-08 Thread Mel Gorman
On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote:
> .
> > This series is not without its hazards. There are at least three areas
> > that I'm concerned with even though I could not reproduce any problems in
> > that area.
> > 
> > 1. Reclaim/compaction is going to be affected because the amount of reclaim 
> > is
> >no longer targetted at a specific zone. Compaction works on a per-zone 
> > basis
> >so there is no guarantee that reclaiming a few THP's worth page pages 
> > will
> >have a positive impact on compaction success rates.
> > 
> > 2. The Slab/LRU reclaim ratio is affected because the frequency the 
> > shrinkers
> >are called is now different. This may or may not be a problem but if it
> >is, it'll be because shrinkers are not called enough and some balancing
> >is required.
> 
> Given that XFS has a much more complex set of shrinkers and has a
> much more finely tuned balancing between LRU and shrinker reclaim,
> I'd be interested to see if you get the same results on XFS for the
> tests you ran on ext4. It might also be worth running some highly
> concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way
> concurrent fsmark tests) to see what impact heavy slab cache
> pressure has on shrinker behaviour and system balance...
> 

I had tested XFS with earlier releases and noticed no major problems
so later releases tested only one filesystem.  Given the changes since,
a retest is desirable. I've posted the current version of the series but
I'll queue the tests to run over the weekend. They are quite time consuming
to run unfortunately.

On the fsmark configuration, I configured the test to use 4K files
instead of 0-sized files that normally would be used to stress inode
creation/deletion. This is to have a mix of page cache and slab
allocations. Shout if this does not suit your expectations.

Finally, not all the machines I'm using can store 50 million inodes
of this size. The benchmark has been configured to use as many inodes
as it estimates will fit in the disk. In all cases, it'll exert memory
pressure. Unfortunately, the storage is simple so there is no guarantee
it'll find all problems but that's standard unfortunately.

Thanks.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-07 Thread Dave Chinner
On Fri, Jul 01, 2016 at 04:37:15PM +0100, Mel Gorman wrote:
> Previous releases double accounted LRU stats on the zone and the node
> because it was required by should_reclaim_retry. The last patch in the
> series removes the double accounting. It's not integrated with the series
> as reviewers may not like the solution. If not, it can be safely dropped
> without a major impact to the results.

> tiobench on ext4
> 

[snip other tests on ext4 which show good results]

.
> This series is not without its hazards. There are at least three areas
> that I'm concerned with even though I could not reproduce any problems in
> that area.
> 
> 1. Reclaim/compaction is going to be affected because the amount of reclaim is
>no longer targetted at a specific zone. Compaction works on a per-zone 
> basis
>so there is no guarantee that reclaiming a few THP's worth page pages will
>have a positive impact on compaction success rates.
> 
> 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
>are called is now different. This may or may not be a problem but if it
>is, it'll be because shrinkers are not called enough and some balancing
>is required.

Given that XFS has a much more complex set of shrinkers and has a
much more finely tuned balancing between LRU and shrinker reclaim,
I'd be interested to see if you get the same results on XFS for the
tests you ran on ext4. It might also be worth running some highly
concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way
concurrent fsmark tests) to see what impact heavy slab cache
pressure has on shrinker behaviour and system balance...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-07 Thread Dave Chinner
On Fri, Jul 01, 2016 at 04:37:15PM +0100, Mel Gorman wrote:
> Previous releases double accounted LRU stats on the zone and the node
> because it was required by should_reclaim_retry. The last patch in the
> series removes the double accounting. It's not integrated with the series
> as reviewers may not like the solution. If not, it can be safely dropped
> without a major impact to the results.

> tiobench on ext4
> 

[snip other tests on ext4 which show good results]

.
> This series is not without its hazards. There are at least three areas
> that I'm concerned with even though I could not reproduce any problems in
> that area.
> 
> 1. Reclaim/compaction is going to be affected because the amount of reclaim is
>no longer targetted at a specific zone. Compaction works on a per-zone 
> basis
>so there is no guarantee that reclaiming a few THP's worth page pages will
>have a positive impact on compaction success rates.
> 
> 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
>are called is now different. This may or may not be a problem but if it
>is, it'll be because shrinkers are not called enough and some balancing
>is required.

Given that XFS has a much more complex set of shrinkers and has a
much more finely tuned balancing between LRU and shrinker reclaim,
I'd be interested to see if you get the same results on XFS for the
tests you ran on ext4. It might also be worth running some highly
concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way
concurrent fsmark tests) to see what impact heavy slab cache
pressure has on shrinker behaviour and system balance...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-05 Thread Minchan Kim
On Mon, Jul 04, 2016 at 10:55:09AM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote:
> > > > How big ratio between highmem:lowmem do you think a problem?
> > > > 
> > > 
> > > That's a "how long is a piece of string" type question.  The ratio does
> > > not matter as much as whether the workload is both under memory pressure
> > > and requires large amounts of lowmem pages. Even on systems with very high
> > > ratios, it may not be a problem if HIGHPTE is enabled.
> > 
> > As well page table, pgd/kernelstack/zbud/slab and so on, every kernel
> > allocations wanted to mask __GFP_HIGHMEM off would be a problem in
> > 32bit system.
> > 
> 
> The same point applies -- it depends on the rate of these allocations,
> not the ratio of highmem:lowmem per se.
> 
> > It also depends on that how many drivers needed lowmem only we have
> > in the system.
> > 
> > I don't know how many such driver in the world. When I simply do grep,
> > I found several cases which mask __GFP_HIGHMEM off and among them,
> > I guess DRM might be a popular for us. However, it might be really rare
> > usecase among various i915 usecases.
> > 
> 
> It's also perfectly possible that such allocations are long-lived in which
> case they are not going to cause many skips. Hence why I cannot make a
> general prediction.
> 
> > > > > Conceptually, moving to node LRUs should be easier to understand. The
> > > > > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > > > > similarly on all nodes. 
> > > > > 
> > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > > > > core NUMA machine. The UMA results are presented in most cases as the 
> > > > > NUMA
> > > > > machine behaved similarly.
> > > > 
> > > > I guess you would already test below with various highmem system(e.g.,
> > > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> > > > 
> > > 
> > > I haven't that data, the baseline distribution used doesn't even have
> > > 32-bit support. Even if it was, the results may not be that interesting.
> > > The workloads used were not necessarily going to trigger lowmem pressure
> > > as HIGHPTE was set on the 32-bit configs.
> > 
> > That means we didn't test this on 32-bit with highmem.
> > 
> 
> No. I tested the skip logic and noticed that when forced on purpose that
> system CPU usage was higher but it functionally worked.

Yeb, it would work well functionally. I meant not functionally but
performance point of view, system cpu usage and majfault rate
and so on.

> 
> > I'm not sure it's really too rare case to spend a time for testing.
> > In fact, I really want to test all series to our production system
> > which is 32bit and highmem but as we know well, most of embedded
> > system kernel is rather old so backporting needs lots of time and
> > care. However, if we miss testing in those system at the moment,
> > we will be suprised after 1~2 years.
> > 
> 
> It would be appreciated if it could be tested on such platforms if at all
> possible. Even if I did set up a 32-bit x86 system, it won't have the same
> allocation/reclaim profile as the platforms you are considering.

Yeb. I just finished reviewing of all patches and found no *big* problem
with my brain so my remanining homework is just testing which would find
what my brain have missed.

I will give the backporing to old 32-bit production kernel a shot and
report if something strange happens.

Thanks for great work, Mel!


> 
> > I don't know what kinds of benchmark can we can check it so I cannot
> > insist on it but you might know it.
> > 
> 
> One method would be to use fsmark with very large numbers of small files
> to force slab to require low memory. It's not representative of many real
> workloads unfortunately. Usually such a configuration is for checking the
> slab shrinker is working as expected.

Thanks for the suggestion.

> 
> > Okay, do you have any idea to fix it if we see such regression report
> > in 32-bit system in future?
> 
> Two options, neither whose complexity is justified without a "real"
> workload to use as a reference.
> 
> 1. Long-term isolation of highmem pages when reclaim is lowmem
> 
>When pages are skipped, they are immediately added back onto the LRU
>list. If lowmem reclaim persisted for long periods of time, the same
>highmem pages get continually scanned. The idea would be that lowmem
>keeps those pages on a separate list until a reclaim for highmem pages
>arrives that splices the highmem pages back onto the LRU.
> 
>That would reduce the skip rate, the potential corner case is that
>highmem pages have to be scanned and reclaimed to free lowmem slab pages.
> 
> 2. Linear scan lowmem pages if the initial LRU shrink fails
> 
>This will break LRU ordering but may be preferable and faster during
>memory pressure than skipping LRU pages.

Okay. I guess it would be better to include this in descripion of [4/31].

> 
> -- 

Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-05 Thread Minchan Kim
On Mon, Jul 04, 2016 at 10:55:09AM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote:
> > > > How big ratio between highmem:lowmem do you think a problem?
> > > > 
> > > 
> > > That's a "how long is a piece of string" type question.  The ratio does
> > > not matter as much as whether the workload is both under memory pressure
> > > and requires large amounts of lowmem pages. Even on systems with very high
> > > ratios, it may not be a problem if HIGHPTE is enabled.
> > 
> > As well page table, pgd/kernelstack/zbud/slab and so on, every kernel
> > allocations wanted to mask __GFP_HIGHMEM off would be a problem in
> > 32bit system.
> > 
> 
> The same point applies -- it depends on the rate of these allocations,
> not the ratio of highmem:lowmem per se.
> 
> > It also depends on that how many drivers needed lowmem only we have
> > in the system.
> > 
> > I don't know how many such driver in the world. When I simply do grep,
> > I found several cases which mask __GFP_HIGHMEM off and among them,
> > I guess DRM might be a popular for us. However, it might be really rare
> > usecase among various i915 usecases.
> > 
> 
> It's also perfectly possible that such allocations are long-lived in which
> case they are not going to cause many skips. Hence why I cannot make a
> general prediction.
> 
> > > > > Conceptually, moving to node LRUs should be easier to understand. The
> > > > > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > > > > similarly on all nodes. 
> > > > > 
> > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > > > > core NUMA machine. The UMA results are presented in most cases as the 
> > > > > NUMA
> > > > > machine behaved similarly.
> > > > 
> > > > I guess you would already test below with various highmem system(e.g.,
> > > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> > > > 
> > > 
> > > I haven't that data, the baseline distribution used doesn't even have
> > > 32-bit support. Even if it was, the results may not be that interesting.
> > > The workloads used were not necessarily going to trigger lowmem pressure
> > > as HIGHPTE was set on the 32-bit configs.
> > 
> > That means we didn't test this on 32-bit with highmem.
> > 
> 
> No. I tested the skip logic and noticed that when forced on purpose that
> system CPU usage was higher but it functionally worked.

Yeb, it would work well functionally. I meant not functionally but
performance point of view, system cpu usage and majfault rate
and so on.

> 
> > I'm not sure it's really too rare case to spend a time for testing.
> > In fact, I really want to test all series to our production system
> > which is 32bit and highmem but as we know well, most of embedded
> > system kernel is rather old so backporting needs lots of time and
> > care. However, if we miss testing in those system at the moment,
> > we will be suprised after 1~2 years.
> > 
> 
> It would be appreciated if it could be tested on such platforms if at all
> possible. Even if I did set up a 32-bit x86 system, it won't have the same
> allocation/reclaim profile as the platforms you are considering.

Yeb. I just finished reviewing of all patches and found no *big* problem
with my brain so my remanining homework is just testing which would find
what my brain have missed.

I will give the backporing to old 32-bit production kernel a shot and
report if something strange happens.

Thanks for great work, Mel!


> 
> > I don't know what kinds of benchmark can we can check it so I cannot
> > insist on it but you might know it.
> > 
> 
> One method would be to use fsmark with very large numbers of small files
> to force slab to require low memory. It's not representative of many real
> workloads unfortunately. Usually such a configuration is for checking the
> slab shrinker is working as expected.

Thanks for the suggestion.

> 
> > Okay, do you have any idea to fix it if we see such regression report
> > in 32-bit system in future?
> 
> Two options, neither whose complexity is justified without a "real"
> workload to use as a reference.
> 
> 1. Long-term isolation of highmem pages when reclaim is lowmem
> 
>When pages are skipped, they are immediately added back onto the LRU
>list. If lowmem reclaim persisted for long periods of time, the same
>highmem pages get continually scanned. The idea would be that lowmem
>keeps those pages on a separate list until a reclaim for highmem pages
>arrives that splices the highmem pages back onto the LRU.
> 
>That would reduce the skip rate, the potential corner case is that
>highmem pages have to be scanned and reclaimed to free lowmem slab pages.
> 
> 2. Linear scan lowmem pages if the initial LRU shrink fails
> 
>This will break LRU ordering but may be preferable and faster during
>memory pressure than skipping LRU pages.

Okay. I guess it would be better to include this in descripion of [4/31].

> 
> -- 

Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-04 Thread Mel Gorman
On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote:
> > > How big ratio between highmem:lowmem do you think a problem?
> > > 
> > 
> > That's a "how long is a piece of string" type question.  The ratio does
> > not matter as much as whether the workload is both under memory pressure
> > and requires large amounts of lowmem pages. Even on systems with very high
> > ratios, it may not be a problem if HIGHPTE is enabled.
> 
> As well page table, pgd/kernelstack/zbud/slab and so on, every kernel
> allocations wanted to mask __GFP_HIGHMEM off would be a problem in
> 32bit system.
> 

The same point applies -- it depends on the rate of these allocations,
not the ratio of highmem:lowmem per se.

> It also depends on that how many drivers needed lowmem only we have
> in the system.
> 
> I don't know how many such driver in the world. When I simply do grep,
> I found several cases which mask __GFP_HIGHMEM off and among them,
> I guess DRM might be a popular for us. However, it might be really rare
> usecase among various i915 usecases.
> 

It's also perfectly possible that such allocations are long-lived in which
case they are not going to cause many skips. Hence why I cannot make a
general prediction.

> > > > Conceptually, moving to node LRUs should be easier to understand. The
> > > > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > > > similarly on all nodes. 
> > > > 
> > > > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > > > core NUMA machine. The UMA results are presented in most cases as the 
> > > > NUMA
> > > > machine behaved similarly.
> > > 
> > > I guess you would already test below with various highmem system(e.g.,
> > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> > > 
> > 
> > I haven't that data, the baseline distribution used doesn't even have
> > 32-bit support. Even if it was, the results may not be that interesting.
> > The workloads used were not necessarily going to trigger lowmem pressure
> > as HIGHPTE was set on the 32-bit configs.
> 
> That means we didn't test this on 32-bit with highmem.
> 

No. I tested the skip logic and noticed that when forced on purpose that
system CPU usage was higher but it functionally worked.

> I'm not sure it's really too rare case to spend a time for testing.
> In fact, I really want to test all series to our production system
> which is 32bit and highmem but as we know well, most of embedded
> system kernel is rather old so backporting needs lots of time and
> care. However, if we miss testing in those system at the moment,
> we will be suprised after 1~2 years.
> 

It would be appreciated if it could be tested on such platforms if at all
possible. Even if I did set up a 32-bit x86 system, it won't have the same
allocation/reclaim profile as the platforms you are considering.

> I don't know what kinds of benchmark can we can check it so I cannot
> insist on it but you might know it.
> 

One method would be to use fsmark with very large numbers of small files
to force slab to require low memory. It's not representative of many real
workloads unfortunately. Usually such a configuration is for checking the
slab shrinker is working as expected.

> Okay, do you have any idea to fix it if we see such regression report
> in 32-bit system in future?

Two options, neither whose complexity is justified without a "real"
workload to use as a reference.

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU.

   That would reduce the skip rate, the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-04 Thread Mel Gorman
On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote:
> > > How big ratio between highmem:lowmem do you think a problem?
> > > 
> > 
> > That's a "how long is a piece of string" type question.  The ratio does
> > not matter as much as whether the workload is both under memory pressure
> > and requires large amounts of lowmem pages. Even on systems with very high
> > ratios, it may not be a problem if HIGHPTE is enabled.
> 
> As well page table, pgd/kernelstack/zbud/slab and so on, every kernel
> allocations wanted to mask __GFP_HIGHMEM off would be a problem in
> 32bit system.
> 

The same point applies -- it depends on the rate of these allocations,
not the ratio of highmem:lowmem per se.

> It also depends on that how many drivers needed lowmem only we have
> in the system.
> 
> I don't know how many such driver in the world. When I simply do grep,
> I found several cases which mask __GFP_HIGHMEM off and among them,
> I guess DRM might be a popular for us. However, it might be really rare
> usecase among various i915 usecases.
> 

It's also perfectly possible that such allocations are long-lived in which
case they are not going to cause many skips. Hence why I cannot make a
general prediction.

> > > > Conceptually, moving to node LRUs should be easier to understand. The
> > > > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > > > similarly on all nodes. 
> > > > 
> > > > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > > > core NUMA machine. The UMA results are presented in most cases as the 
> > > > NUMA
> > > > machine behaved similarly.
> > > 
> > > I guess you would already test below with various highmem system(e.g.,
> > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> > > 
> > 
> > I haven't that data, the baseline distribution used doesn't even have
> > 32-bit support. Even if it was, the results may not be that interesting.
> > The workloads used were not necessarily going to trigger lowmem pressure
> > as HIGHPTE was set on the 32-bit configs.
> 
> That means we didn't test this on 32-bit with highmem.
> 

No. I tested the skip logic and noticed that when forced on purpose that
system CPU usage was higher but it functionally worked.

> I'm not sure it's really too rare case to spend a time for testing.
> In fact, I really want to test all series to our production system
> which is 32bit and highmem but as we know well, most of embedded
> system kernel is rather old so backporting needs lots of time and
> care. However, if we miss testing in those system at the moment,
> we will be suprised after 1~2 years.
> 

It would be appreciated if it could be tested on such platforms if at all
possible. Even if I did set up a 32-bit x86 system, it won't have the same
allocation/reclaim profile as the platforms you are considering.

> I don't know what kinds of benchmark can we can check it so I cannot
> insist on it but you might know it.
> 

One method would be to use fsmark with very large numbers of small files
to force slab to require low memory. It's not representative of many real
workloads unfortunately. Usually such a configuration is for checking the
slab shrinker is working as expected.

> Okay, do you have any idea to fix it if we see such regression report
> in 32-bit system in future?

Two options, neither whose complexity is justified without a "real"
workload to use as a reference.

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU.

   That would reduce the skip rate, the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-04 Thread Minchan Kim
On Mon, Jul 04, 2016 at 05:34:05AM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote:
> > > The reason we have zone-based reclaim is that we used to have
> > > large highmem zones in common configurations and it was necessary
> > > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> > > less of a concern as machines with lots of memory will (or should) use
> > > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> > > rare. Machines that do use highmem should have relatively low 
> > > highmem:lowmem
> > > ratios than we worried about in the past.
> > 
> > Hello Mel,
> > 
> > I agree the direction absolutely. However, I have a concern on highmem
> > system as you already mentioned.
> > 
> > Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
> > In such system, LRU churning by skipping other zone pages frequently
> > might be significant for the performance.
> > 
> > How big ratio between highmem:lowmem do you think a problem?
> > 
> 
> That's a "how long is a piece of string" type question.  The ratio does
> not matter as much as whether the workload is both under memory pressure
> and requires large amounts of lowmem pages. Even on systems with very high
> ratios, it may not be a problem if HIGHPTE is enabled.

As well page table, pgd/kernelstack/zbud/slab and so on, every kernel
allocations wanted to mask __GFP_HIGHMEM off would be a problem in
32bit system.

It also depends on that how many drivers needed lowmem only we have
in the system.

I don't know how many such driver in the world. When I simply do grep,
I found several cases which mask __GFP_HIGHMEM off and among them,
I guess DRM might be a popular for us. However, it might be really rare
usecase among various i915 usecases.

> 
> > > 
> > > Conceptually, moving to node LRUs should be easier to understand. The
> > > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > > similarly on all nodes. 
> > > 
> > > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > > core NUMA machine. The UMA results are presented in most cases as the NUMA
> > > machine behaved similarly.
> > 
> > I guess you would already test below with various highmem system(e.g.,
> > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> > 
> 
> I haven't that data, the baseline distribution used doesn't even have
> 32-bit support. Even if it was, the results may not be that interesting.
> The workloads used were not necessarily going to trigger lowmem pressure
> as HIGHPTE was set on the 32-bit configs.

That means we didn't test this on 32-bit with highmem.

I'm not sure it's really too rare case to spend a time for testing.
In fact, I really want to test all series to our production system
which is 32bit and highmem but as we know well, most of embedded
system kernel is rather old so backporting needs lots of time and
care. However, if we miss testing in those system at the moment,
we will be suprised after 1~2 years.

I don't know what kinds of benchmark can we can check it so I cannot
insist on it but you might know it.

Okay, do you have any idea to fix it if we see such regression report
in 32-bit system in future?


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-04 Thread Minchan Kim
On Mon, Jul 04, 2016 at 05:34:05AM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote:
> > > The reason we have zone-based reclaim is that we used to have
> > > large highmem zones in common configurations and it was necessary
> > > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> > > less of a concern as machines with lots of memory will (or should) use
> > > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> > > rare. Machines that do use highmem should have relatively low 
> > > highmem:lowmem
> > > ratios than we worried about in the past.
> > 
> > Hello Mel,
> > 
> > I agree the direction absolutely. However, I have a concern on highmem
> > system as you already mentioned.
> > 
> > Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
> > In such system, LRU churning by skipping other zone pages frequently
> > might be significant for the performance.
> > 
> > How big ratio between highmem:lowmem do you think a problem?
> > 
> 
> That's a "how long is a piece of string" type question.  The ratio does
> not matter as much as whether the workload is both under memory pressure
> and requires large amounts of lowmem pages. Even on systems with very high
> ratios, it may not be a problem if HIGHPTE is enabled.

As well page table, pgd/kernelstack/zbud/slab and so on, every kernel
allocations wanted to mask __GFP_HIGHMEM off would be a problem in
32bit system.

It also depends on that how many drivers needed lowmem only we have
in the system.

I don't know how many such driver in the world. When I simply do grep,
I found several cases which mask __GFP_HIGHMEM off and among them,
I guess DRM might be a popular for us. However, it might be really rare
usecase among various i915 usecases.

> 
> > > 
> > > Conceptually, moving to node LRUs should be easier to understand. The
> > > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > > similarly on all nodes. 
> > > 
> > > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > > core NUMA machine. The UMA results are presented in most cases as the NUMA
> > > machine behaved similarly.
> > 
> > I guess you would already test below with various highmem system(e.g.,
> > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> > 
> 
> I haven't that data, the baseline distribution used doesn't even have
> 32-bit support. Even if it was, the results may not be that interesting.
> The workloads used were not necessarily going to trigger lowmem pressure
> as HIGHPTE was set on the 32-bit configs.

That means we didn't test this on 32-bit with highmem.

I'm not sure it's really too rare case to spend a time for testing.
In fact, I really want to test all series to our production system
which is 32bit and highmem but as we know well, most of embedded
system kernel is rather old so backporting needs lots of time and
care. However, if we miss testing in those system at the moment,
we will be suprised after 1~2 years.

I don't know what kinds of benchmark can we can check it so I cannot
insist on it but you might know it.

Okay, do you have any idea to fix it if we see such regression report
in 32-bit system in future?


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-03 Thread Mel Gorman
On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote:
> > The reason we have zone-based reclaim is that we used to have
> > large highmem zones in common configurations and it was necessary
> > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> > less of a concern as machines with lots of memory will (or should) use
> > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> > rare. Machines that do use highmem should have relatively low highmem:lowmem
> > ratios than we worried about in the past.
> 
> Hello Mel,
> 
> I agree the direction absolutely. However, I have a concern on highmem
> system as you already mentioned.
> 
> Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
> In such system, LRU churning by skipping other zone pages frequently
> might be significant for the performance.
> 
> How big ratio between highmem:lowmem do you think a problem?
> 

That's a "how long is a piece of string" type question.  The ratio does
not matter as much as whether the workload is both under memory pressure
and requires large amounts of lowmem pages. Even on systems with very high
ratios, it may not be a problem if HIGHPTE is enabled.

> > 
> > Conceptually, moving to node LRUs should be easier to understand. The
> > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > similarly on all nodes. 
> > 
> > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > core NUMA machine. The UMA results are presented in most cases as the NUMA
> > machine behaved similarly.
> 
> I guess you would already test below with various highmem system(e.g.,
> 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> 

I haven't that data, the baseline distribution used doesn't even have
32-bit support. Even if it was, the results may not be that interesting.
The workloads used were not necessarily going to trigger lowmem pressure
as HIGHPTE was set on the 32-bit configs.

The skip logic has been checked and it does work. This was done during 
development, by forcing the "wrong" reclaim index to use. It was
noticable in system CPU usage and in the "skip" stats. I didn't preserve
this data.

> >  4.7.0-rc4   4.7.0-rc4
> >   mmotm-20160623nodelru-v8
> > Minor Faults645838  644036
> > Major Faults   573 593
> > Swap Ins 0   0
> > Swap Outs0   0
> > Allocation stalls   24   0
> > DMA allocs   0   0
> > DMA32 allocs  4604145344154171
> > Normal allocs 7805307279865782
> > Movable allocs   0   0
> > Direct pages scanned 10969   54504
> > Kswapd pages scanned  9337514493250583
> > Kswapd pages reclaimed9337224393247714
> > Direct pages reclaimed   10969   54504
> > Kswapd efficiency  99% 99%
> > Kswapd velocity  13741.015   13711.950
> > Direct efficiency 100%100%
> > Direct velocity  1.614   8.014
> > Percentage direct scans 0%  0%
> > Zone normal velocity  8641.875   13719.964
> > Zone dma32 velocity   5100.754   0.000
> > Zone dma velocity0.000   0.000
> > Page writes by reclaim   0.000   0.000
> > Page writes file 0   0
> > Page writes anon 0   0
> > Page reclaim immediate  37  54
> > 
> > kswapd activity was roughly comparable. There were differences in direct
> > reclaim activity but negligible in the context of the overall workload
> > (velocity of 8 pages per second with the patches applied, 1.6 pages per
> > second in the baseline kernel).
> 
> Hmm, nodelru's allocation stall is zero above but how does direct page
> scanning/reclaimed happens?
> 

Good spot, it's because I used the wrong comparison script -- one that
doesn't understand the different skip and allocation stats and I was
looking primarily at the scanning activity. This is a correct version

 4.7.0-rc4   4.7.0-rc4
  mmotm-20160623nodelru-v8r26
Minor Faults645838  643815
Major Faults   573 493
Swap Ins 0   0
Swap Outs0   0
DMA allocs   0   0
DMA32 allocs  4604145344174923
Normal allocs 7805307279816443
Movable allocs   0   0
Allocation stalls   24  31
Stall zone DMA   0   0
Stall zone DMA32 0   0
Stall zone Normal0   1
Stall 

Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-03 Thread Mel Gorman
On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote:
> > The reason we have zone-based reclaim is that we used to have
> > large highmem zones in common configurations and it was necessary
> > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> > less of a concern as machines with lots of memory will (or should) use
> > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> > rare. Machines that do use highmem should have relatively low highmem:lowmem
> > ratios than we worried about in the past.
> 
> Hello Mel,
> 
> I agree the direction absolutely. However, I have a concern on highmem
> system as you already mentioned.
> 
> Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
> In such system, LRU churning by skipping other zone pages frequently
> might be significant for the performance.
> 
> How big ratio between highmem:lowmem do you think a problem?
> 

That's a "how long is a piece of string" type question.  The ratio does
not matter as much as whether the workload is both under memory pressure
and requires large amounts of lowmem pages. Even on systems with very high
ratios, it may not be a problem if HIGHPTE is enabled.

> > 
> > Conceptually, moving to node LRUs should be easier to understand. The
> > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > similarly on all nodes. 
> > 
> > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > core NUMA machine. The UMA results are presented in most cases as the NUMA
> > machine behaved similarly.
> 
> I guess you would already test below with various highmem system(e.g.,
> 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> 

I haven't that data, the baseline distribution used doesn't even have
32-bit support. Even if it was, the results may not be that interesting.
The workloads used were not necessarily going to trigger lowmem pressure
as HIGHPTE was set on the 32-bit configs.

The skip logic has been checked and it does work. This was done during 
development, by forcing the "wrong" reclaim index to use. It was
noticable in system CPU usage and in the "skip" stats. I didn't preserve
this data.

> >  4.7.0-rc4   4.7.0-rc4
> >   mmotm-20160623nodelru-v8
> > Minor Faults645838  644036
> > Major Faults   573 593
> > Swap Ins 0   0
> > Swap Outs0   0
> > Allocation stalls   24   0
> > DMA allocs   0   0
> > DMA32 allocs  4604145344154171
> > Normal allocs 7805307279865782
> > Movable allocs   0   0
> > Direct pages scanned 10969   54504
> > Kswapd pages scanned  9337514493250583
> > Kswapd pages reclaimed9337224393247714
> > Direct pages reclaimed   10969   54504
> > Kswapd efficiency  99% 99%
> > Kswapd velocity  13741.015   13711.950
> > Direct efficiency 100%100%
> > Direct velocity  1.614   8.014
> > Percentage direct scans 0%  0%
> > Zone normal velocity  8641.875   13719.964
> > Zone dma32 velocity   5100.754   0.000
> > Zone dma velocity0.000   0.000
> > Page writes by reclaim   0.000   0.000
> > Page writes file 0   0
> > Page writes anon 0   0
> > Page reclaim immediate  37  54
> > 
> > kswapd activity was roughly comparable. There were differences in direct
> > reclaim activity but negligible in the context of the overall workload
> > (velocity of 8 pages per second with the patches applied, 1.6 pages per
> > second in the baseline kernel).
> 
> Hmm, nodelru's allocation stall is zero above but how does direct page
> scanning/reclaimed happens?
> 

Good spot, it's because I used the wrong comparison script -- one that
doesn't understand the different skip and allocation stats and I was
looking primarily at the scanning activity. This is a correct version

 4.7.0-rc4   4.7.0-rc4
  mmotm-20160623nodelru-v8r26
Minor Faults645838  643815
Major Faults   573 493
Swap Ins 0   0
Swap Outs0   0
DMA allocs   0   0
DMA32 allocs  4604145344174923
Normal allocs 7805307279816443
Movable allocs   0   0
Allocation stalls   24  31
Stall zone DMA   0   0
Stall zone DMA32 0   0
Stall zone Normal0   1
Stall 

Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-03 Thread Minchan Kim
On Fri, Jul 01, 2016 at 09:01:08PM +0100, Mel Gorman wrote:
> (Sorry for the resend, I accidentally sent the branch that still had the
> Signed-off-by's from mmotm still applied which is incorrect.)
> 
> Previous releases double accounted LRU stats on the zone and the node
> because it was required by should_reclaim_retry. The last patch in the
> series removes the double accounting. It's not integrated with the series
> as reviewers may not like the solution. If not, it can be safely dropped
> without a major impact to the results.
> 
> Changelog since v7
> o Rebase onto current mmots
> o Avoid double accounting of stats in node and zone
> o Kswapd will avoid more reclaim if an eligible zone is available
> o Remove some duplications of sc->reclaim_idx and classzone_idx
> o Print per-node stats in zoneinfo
> 
> Changelog since v6
> o Correct reclaim_idx when direct reclaiming for memcg
> o Also account LRU pages per zone for compaction/reclaim
> o Add page_pgdat helper with more efficient lookup
> o Init pgdat LRU lock only once
> o Slight optimisation to wake_all_kswapds
> o Always wake kcompactd when kswapd is going to sleep
> o Rebase to mmotm as of June 15th, 2016
> 
> Changelog since v5
> o Rebase and adjust to changes
> 
> Changelog since v4
> o Rebase on top of v3 of page allocator optimisation series
> 
> Changelog since v3
> o Rebase on top of the page allocator optimisation series
> o Remove RFC tag
> 
> This is the latest version of a series that moves LRUs from the zones to
> the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
> is a current rebase, the test results were based on mmotm as of June 23rd.
> Conceptually, this series is simple but there are a lot of details. Some
> of the broad motivations for this are;
> 
> 1. The residency of a page partially depends on what zone the page was
>allocated from.  This is partially combatted by the fair zone allocation
>policy but that is a partial solution that introduces overhead in the
>page allocator paths.
> 
> 2. Currently, reclaim on node 0 behaves slightly different to node 1. For
>example, direct reclaim scans in zonelist order and reclaims even if
>the zone is over the high watermark regardless of the age of pages
>in that LRU. Kswapd on the other hand starts reclaim on the highest
>unbalanced zone. A difference in distribution of file/anon pages due
>to when they were allocated results can result in a difference in 
>again. While the fair zone allocation policy mitigates some of the
>problems here, the page reclaim results on a multi-zone node will
>always be different to a single-zone node.
>it was scheduled on as a result.
> 
> 3. kswapd and the page allocator scan zones in the opposite order to
>avoid interfering with each other but it's sensitive to timing.  This
>mitigates the page allocator using pages that were allocated very recently
>in the ideal case but it's sensitive to timing. When kswapd is allocating
>from lower zones then it's great but during the rebalancing of the highest
>zone, the page allocator and kswapd interfere with each other. It's worse
>if the highest zone is small and difficult to balance.
> 
> 4. slab shrinkers are node-based which makes it harder to identify the exact
>relationship between slab reclaim and LRU reclaim.
> 
> The reason we have zone-based reclaim is that we used to have
> large highmem zones in common configurations and it was necessary
> to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> less of a concern as machines with lots of memory will (or should) use
> 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> rare. Machines that do use highmem should have relatively low highmem:lowmem
> ratios than we worried about in the past.

Hello Mel,

I agree the direction absolutely. However, I have a concern on highmem
system as you already mentioned.

Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
In such system, LRU churning by skipping other zone pages frequently
might be significant for the performance.

How big ratio between highmem:lowmem do you think a problem?

> 
> Conceptually, moving to node LRUs should be easier to understand. The
> page allocator plays fewer tricks to game reclaim and reclaim behaves
> similarly on all nodes. 
> 
> The series has been tested on a 16 core UMA machine and a 2-socket 48
> core NUMA machine. The UMA results are presented in most cases as the NUMA
> machine behaved similarly.

I guess you would already test below with various highmem system(e.g.,
2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?

> 
> pagealloc
> -
> 
> This is a microbenchmark that shows the benefit of removing the fair zone
> allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
> shown as the other orders were comparable.
> 
>4.7.0-rc4  

Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-03 Thread Minchan Kim
On Fri, Jul 01, 2016 at 09:01:08PM +0100, Mel Gorman wrote:
> (Sorry for the resend, I accidentally sent the branch that still had the
> Signed-off-by's from mmotm still applied which is incorrect.)
> 
> Previous releases double accounted LRU stats on the zone and the node
> because it was required by should_reclaim_retry. The last patch in the
> series removes the double accounting. It's not integrated with the series
> as reviewers may not like the solution. If not, it can be safely dropped
> without a major impact to the results.
> 
> Changelog since v7
> o Rebase onto current mmots
> o Avoid double accounting of stats in node and zone
> o Kswapd will avoid more reclaim if an eligible zone is available
> o Remove some duplications of sc->reclaim_idx and classzone_idx
> o Print per-node stats in zoneinfo
> 
> Changelog since v6
> o Correct reclaim_idx when direct reclaiming for memcg
> o Also account LRU pages per zone for compaction/reclaim
> o Add page_pgdat helper with more efficient lookup
> o Init pgdat LRU lock only once
> o Slight optimisation to wake_all_kswapds
> o Always wake kcompactd when kswapd is going to sleep
> o Rebase to mmotm as of June 15th, 2016
> 
> Changelog since v5
> o Rebase and adjust to changes
> 
> Changelog since v4
> o Rebase on top of v3 of page allocator optimisation series
> 
> Changelog since v3
> o Rebase on top of the page allocator optimisation series
> o Remove RFC tag
> 
> This is the latest version of a series that moves LRUs from the zones to
> the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
> is a current rebase, the test results were based on mmotm as of June 23rd.
> Conceptually, this series is simple but there are a lot of details. Some
> of the broad motivations for this are;
> 
> 1. The residency of a page partially depends on what zone the page was
>allocated from.  This is partially combatted by the fair zone allocation
>policy but that is a partial solution that introduces overhead in the
>page allocator paths.
> 
> 2. Currently, reclaim on node 0 behaves slightly different to node 1. For
>example, direct reclaim scans in zonelist order and reclaims even if
>the zone is over the high watermark regardless of the age of pages
>in that LRU. Kswapd on the other hand starts reclaim on the highest
>unbalanced zone. A difference in distribution of file/anon pages due
>to when they were allocated results can result in a difference in 
>again. While the fair zone allocation policy mitigates some of the
>problems here, the page reclaim results on a multi-zone node will
>always be different to a single-zone node.
>it was scheduled on as a result.
> 
> 3. kswapd and the page allocator scan zones in the opposite order to
>avoid interfering with each other but it's sensitive to timing.  This
>mitigates the page allocator using pages that were allocated very recently
>in the ideal case but it's sensitive to timing. When kswapd is allocating
>from lower zones then it's great but during the rebalancing of the highest
>zone, the page allocator and kswapd interfere with each other. It's worse
>if the highest zone is small and difficult to balance.
> 
> 4. slab shrinkers are node-based which makes it harder to identify the exact
>relationship between slab reclaim and LRU reclaim.
> 
> The reason we have zone-based reclaim is that we used to have
> large highmem zones in common configurations and it was necessary
> to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
> less of a concern as machines with lots of memory will (or should) use
> 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
> rare. Machines that do use highmem should have relatively low highmem:lowmem
> ratios than we worried about in the past.

Hello Mel,

I agree the direction absolutely. However, I have a concern on highmem
system as you already mentioned.

Embedded products still use 2 ~ 3 ratio (highmem:lowmem).
In such system, LRU churning by skipping other zone pages frequently
might be significant for the performance.

How big ratio between highmem:lowmem do you think a problem?

> 
> Conceptually, moving to node LRUs should be easier to understand. The
> page allocator plays fewer tricks to game reclaim and reclaim behaves
> similarly on all nodes. 
> 
> The series has been tested on a 16 core UMA machine and a 2-socket 48
> core NUMA machine. The UMA results are presented in most cases as the NUMA
> machine behaved similarly.

I guess you would already test below with various highmem system(e.g.,
2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?

> 
> pagealloc
> -
> 
> This is a microbenchmark that shows the benefit of removing the fair zone
> allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
> shown as the other orders were comparable.
> 
>4.7.0-rc4  

[PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-01 Thread Mel Gorman
(Sorry for the resend, I accidentally sent the branch that still had the
Signed-off-by's from mmotm still applied which is incorrect.)

Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc4  4.7.0-rc4
  mmotm-20160623 nodelru-v8
Min  total-odr0-1   490.00 (  0.00%)   463.00 (  5.51%)
Min  total-odr0-2   349.00 (  0.00%)   325.00 (  6.88%)
Min  total-odr0-4   288.00 (  0.00%)   272.00 (  5.56%)
Min  total-odr0-8   250.00 (  0.00%)   235.00 (  6.00%)
Min  total-odr0-16  234.00 (  0.00%)   222.00 (  5.13%)
Min  total-odr0-32  223.00 (  0.00%)   205.00 (  8.07%)
Min  total-odr0-64  217.00 (  0.00%)   202.00 (  6.91%)
Min  total-odr0-128 214.00 (  0.00%)   207.00 (  3.27%)
Min  

[PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-01 Thread Mel Gorman
(Sorry for the resend, I accidentally sent the branch that still had the
Signed-off-by's from mmotm still applied which is incorrect.)

Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc4  4.7.0-rc4
  mmotm-20160623 nodelru-v8
Min  total-odr0-1   490.00 (  0.00%)   463.00 (  5.51%)
Min  total-odr0-2   349.00 (  0.00%)   325.00 (  6.88%)
Min  total-odr0-4   288.00 (  0.00%)   272.00 (  5.56%)
Min  total-odr0-8   250.00 (  0.00%)   235.00 (  6.00%)
Min  total-odr0-16  234.00 (  0.00%)   222.00 (  5.13%)
Min  total-odr0-32  223.00 (  0.00%)   205.00 (  8.07%)
Min  total-odr0-64  217.00 (  0.00%)   202.00 (  6.91%)
Min  total-odr0-128 214.00 (  0.00%)   207.00 (  3.27%)
Min  

[PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-01 Thread Mel Gorman
Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc4  4.7.0-rc4
  mmotm-20160623 nodelru-v8
Min  total-odr0-1   490.00 (  0.00%)   463.00 (  5.51%)
Min  total-odr0-2   349.00 (  0.00%)   325.00 (  6.88%)
Min  total-odr0-4   288.00 (  0.00%)   272.00 (  5.56%)
Min  total-odr0-8   250.00 (  0.00%)   235.00 (  6.00%)
Min  total-odr0-16  234.00 (  0.00%)   222.00 (  5.13%)
Min  total-odr0-32  223.00 (  0.00%)   205.00 (  8.07%)
Min  total-odr0-64  217.00 (  0.00%)   202.00 (  6.91%)
Min  total-odr0-128 214.00 (  0.00%)   207.00 (  3.27%)
Min  total-odr0-256 242.00 (  0.00%)   242.00 (  0.00%)
Min  total-odr0-512 272.00 (  0.00%) 

[PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-01 Thread Mel Gorman
Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc4  4.7.0-rc4
  mmotm-20160623 nodelru-v8
Min  total-odr0-1   490.00 (  0.00%)   463.00 (  5.51%)
Min  total-odr0-2   349.00 (  0.00%)   325.00 (  6.88%)
Min  total-odr0-4   288.00 (  0.00%)   272.00 (  5.56%)
Min  total-odr0-8   250.00 (  0.00%)   235.00 (  6.00%)
Min  total-odr0-16  234.00 (  0.00%)   222.00 (  5.13%)
Min  total-odr0-32  223.00 (  0.00%)   205.00 (  8.07%)
Min  total-odr0-64  217.00 (  0.00%)   202.00 (  6.91%)
Min  total-odr0-128 214.00 (  0.00%)   207.00 (  3.27%)
Min  total-odr0-256 242.00 (  0.00%)   242.00 (  0.00%)
Min  total-odr0-512 272.00 (  0.00%)