Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 11, 2016 at 10:02:24AM +0100, Mel Gorman wrote: > On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote: > > > I had tested XFS with earlier releases and noticed no major problems > > > so later releases tested only one filesystem. Given the changes since, > > > a retest is desirable. I've posted the current version of the series but > > > I'll queue the tests to run over the weekend. They are quite time > > > consuming > > > to run unfortunately. > > > > Understood. I'm not following the patchset all that closely, so I > > didn' know you'd already tested XFS. > > > > It was needed anyway. Not all of them completed over the weekend. In > particular, the NUMA machine is taking its time because many of the > workloads are scaled by memory size and it takes longer. > > > > On the fsmark configuration, I configured the test to use 4K files > > > instead of 0-sized files that normally would be used to stress inode > > > creation/deletion. This is to have a mix of page cache and slab > > > allocations. Shout if this does not suit your expectations. > > > > Sounds fine. I usually limit that test to 10 million inodes - that's > > my "10-4" test. > > > > Thanks. > > > I'm not going to go through most of the results in detail. The raw data > is verbose and not necessarily useful in most cases. Yup, numbers look pretty good and all my concerns have gone away. Thanks for testing, Mel! :P Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 11, 2016 at 10:02:24AM +0100, Mel Gorman wrote: > On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote: > > > I had tested XFS with earlier releases and noticed no major problems > > > so later releases tested only one filesystem. Given the changes since, > > > a retest is desirable. I've posted the current version of the series but > > > I'll queue the tests to run over the weekend. They are quite time > > > consuming > > > to run unfortunately. > > > > Understood. I'm not following the patchset all that closely, so I > > didn' know you'd already tested XFS. > > > > It was needed anyway. Not all of them completed over the weekend. In > particular, the NUMA machine is taking its time because many of the > workloads are scaled by memory size and it takes longer. > > > > On the fsmark configuration, I configured the test to use 4K files > > > instead of 0-sized files that normally would be used to stress inode > > > creation/deletion. This is to have a mix of page cache and slab > > > allocations. Shout if this does not suit your expectations. > > > > Sounds fine. I usually limit that test to 10 million inodes - that's > > my "10-4" test. > > > > Thanks. > > > I'm not going to go through most of the results in detail. The raw data > is verbose and not necessarily useful in most cases. Yup, numbers look pretty good and all my concerns have gone away. Thanks for testing, Mel! :P Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote: > > I had tested XFS with earlier releases and noticed no major problems > > so later releases tested only one filesystem. Given the changes since, > > a retest is desirable. I've posted the current version of the series but > > I'll queue the tests to run over the weekend. They are quite time consuming > > to run unfortunately. > > Understood. I'm not following the patchset all that closely, so I > didn' know you'd already tested XFS. > It was needed anyway. Not all of them completed over the weekend. In particular, the NUMA machine is taking its time because many of the workloads are scaled by memory size and it takes longer. > > On the fsmark configuration, I configured the test to use 4K files > > instead of 0-sized files that normally would be used to stress inode > > creation/deletion. This is to have a mix of page cache and slab > > allocations. Shout if this does not suit your expectations. > > Sounds fine. I usually limit that test to 10 million inodes - that's > my "10-4" test. > Thanks. I'm not going to go through most of the results in detail. The raw data is verbose and not necessarily useful in most cases. tiobench Similar results to ext4, similar performance, similar reclaim activity pgbench Similar performance results to ext4. Minor differences in reclaim activity. The series did enter direct reclaim which the mmotm kernel did not. However, it was one minor spike. kswapd activity was almost identical. bonnie Similar performance results to ext4, minor differences in reclaim activity parallel dd Similar performance results to ext4. Small differences in reclaim activity. Again, there was a slight increase in direct reclaim activity but negligble in comparison to the overall workload. Average direct reclaim velocity was 1.8 pages per second and direct reclaim page scans were 0.018% of all scans. stutter Similar performance results to ext4, similar reclaim activity These observations are all based on two UMA machines. fsmark 50m-inodes-4k-files-16-threads = As fsmark can be variable, this is reported as quartiles. This is one of the UMA machines; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9r6 Min files/sec-16 2354.80 ( 0.00%) 2255.40 ( -4.22%) 1st-qrtle files/sec-16 3254.90 ( 0.00%) 3249.40 ( -0.17%) 2nd-qrtle files/sec-16 3310.10 ( 0.00%) 3306.70 ( -0.10%) 3rd-qrtle files/sec-16 3353.40 ( 0.00%) 3329.00 ( -0.73%) Max-90% files/sec-16 3435.70 ( 0.00%) 3426.90 ( -0.26%) Max-93% files/sec-16 3437.80 ( 0.00%) 3462.50 ( 0.72%) Max-95% files/sec-16 3471.60 ( 0.00%) 3536.50 ( 1.87%) Max-99% files/sec-16 5383.90 ( 0.00%) 5900.00 ( 9.59%) Max files/sec-16 5383.90 ( 0.00%) 5900.00 ( 9.59%) Meanfiles/sec-16 3342.99 ( 0.00%) 3329.64 ( -0.40%) 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9r6 User 188.46 187.14 System 2964.26 2972.35 Elapsed 10222.83 9865.87 Direct pages scanned144365 189738 Kswapd pages scanned 1314734912965288 Kswapd pages reclaimed1314454312962266 Direct pages reclaimed 144365 189738 Kswapd efficiency 99% 99% Kswapd velocity 1286.0771314.156 Direct efficiency 100%100% Direct velocity 14.122 19.232 Percentage direct scans 1% 1% Slabs scanned 5256396852672128 Direct inode steals132 24 Kswapd inode steals 18234 12096 The performance is comparable and so is slab reclaim activity. The NUMA machine had completed the same test. On the NUMA machine, there is a also a slight increase in direct reclaim activity but as a tiny percentage overall. Slab scan and reclaim activity is almost identical. fsmark 50m-inodes-0k-files-16-threads = I also tested with zero-sized files. The UMA machine showed nothing interesting, the NUMA machine results were as follows; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9r6 Min files/sec-16 108235.50 ( 0.00%) 120783.20 ( 11.59%) 1st-qrtle files/sec-16 129569.40 ( 0.00%) 132300.70 ( 2.11%) 2nd-qrtle files/sec-16 135544.90 ( 0.00%) 141198.40 ( 4.17%) 3rd-qrtle files/sec-16 139634.90 ( 0.00%) 148242.50 ( 6.16%) Max-90% files/sec-16 144203.60 ( 0.00%) 152247.10 ( 5.58%) Max-93% files/sec-16 145294.50 ( 0.00%) 152642.20 ( 5.06%) Max-95% files/sec-16
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 11, 2016 at 10:47:57AM +1000, Dave Chinner wrote: > > I had tested XFS with earlier releases and noticed no major problems > > so later releases tested only one filesystem. Given the changes since, > > a retest is desirable. I've posted the current version of the series but > > I'll queue the tests to run over the weekend. They are quite time consuming > > to run unfortunately. > > Understood. I'm not following the patchset all that closely, so I > didn' know you'd already tested XFS. > It was needed anyway. Not all of them completed over the weekend. In particular, the NUMA machine is taking its time because many of the workloads are scaled by memory size and it takes longer. > > On the fsmark configuration, I configured the test to use 4K files > > instead of 0-sized files that normally would be used to stress inode > > creation/deletion. This is to have a mix of page cache and slab > > allocations. Shout if this does not suit your expectations. > > Sounds fine. I usually limit that test to 10 million inodes - that's > my "10-4" test. > Thanks. I'm not going to go through most of the results in detail. The raw data is verbose and not necessarily useful in most cases. tiobench Similar results to ext4, similar performance, similar reclaim activity pgbench Similar performance results to ext4. Minor differences in reclaim activity. The series did enter direct reclaim which the mmotm kernel did not. However, it was one minor spike. kswapd activity was almost identical. bonnie Similar performance results to ext4, minor differences in reclaim activity parallel dd Similar performance results to ext4. Small differences in reclaim activity. Again, there was a slight increase in direct reclaim activity but negligble in comparison to the overall workload. Average direct reclaim velocity was 1.8 pages per second and direct reclaim page scans were 0.018% of all scans. stutter Similar performance results to ext4, similar reclaim activity These observations are all based on two UMA machines. fsmark 50m-inodes-4k-files-16-threads = As fsmark can be variable, this is reported as quartiles. This is one of the UMA machines; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9r6 Min files/sec-16 2354.80 ( 0.00%) 2255.40 ( -4.22%) 1st-qrtle files/sec-16 3254.90 ( 0.00%) 3249.40 ( -0.17%) 2nd-qrtle files/sec-16 3310.10 ( 0.00%) 3306.70 ( -0.10%) 3rd-qrtle files/sec-16 3353.40 ( 0.00%) 3329.00 ( -0.73%) Max-90% files/sec-16 3435.70 ( 0.00%) 3426.90 ( -0.26%) Max-93% files/sec-16 3437.80 ( 0.00%) 3462.50 ( 0.72%) Max-95% files/sec-16 3471.60 ( 0.00%) 3536.50 ( 1.87%) Max-99% files/sec-16 5383.90 ( 0.00%) 5900.00 ( 9.59%) Max files/sec-16 5383.90 ( 0.00%) 5900.00 ( 9.59%) Meanfiles/sec-16 3342.99 ( 0.00%) 3329.64 ( -0.40%) 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9r6 User 188.46 187.14 System 2964.26 2972.35 Elapsed 10222.83 9865.87 Direct pages scanned144365 189738 Kswapd pages scanned 1314734912965288 Kswapd pages reclaimed1314454312962266 Direct pages reclaimed 144365 189738 Kswapd efficiency 99% 99% Kswapd velocity 1286.0771314.156 Direct efficiency 100%100% Direct velocity 14.122 19.232 Percentage direct scans 1% 1% Slabs scanned 5256396852672128 Direct inode steals132 24 Kswapd inode steals 18234 12096 The performance is comparable and so is slab reclaim activity. The NUMA machine had completed the same test. On the NUMA machine, there is a also a slight increase in direct reclaim activity but as a tiny percentage overall. Slab scan and reclaim activity is almost identical. fsmark 50m-inodes-0k-files-16-threads = I also tested with zero-sized files. The UMA machine showed nothing interesting, the NUMA machine results were as follows; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 approx-v9r6 Min files/sec-16 108235.50 ( 0.00%) 120783.20 ( 11.59%) 1st-qrtle files/sec-16 129569.40 ( 0.00%) 132300.70 ( 2.11%) 2nd-qrtle files/sec-16 135544.90 ( 0.00%) 141198.40 ( 4.17%) 3rd-qrtle files/sec-16 139634.90 ( 0.00%) 148242.50 ( 6.16%) Max-90% files/sec-16 144203.60 ( 0.00%) 152247.10 ( 5.58%) Max-93% files/sec-16 145294.50 ( 0.00%) 152642.20 ( 5.06%) Max-95% files/sec-16
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 08, 2016 at 10:52:03AM +0100, Mel Gorman wrote: > On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote: > > . > > > This series is not without its hazards. There are at least three areas > > > that I'm concerned with even though I could not reproduce any problems in > > > that area. > > > > > > 1. Reclaim/compaction is going to be affected because the amount of > > > reclaim is > > >no longer targetted at a specific zone. Compaction works on a per-zone > > > basis > > >so there is no guarantee that reclaiming a few THP's worth page pages > > > will > > >have a positive impact on compaction success rates. > > > > > > 2. The Slab/LRU reclaim ratio is affected because the frequency the > > > shrinkers > > >are called is now different. This may or may not be a problem but if it > > >is, it'll be because shrinkers are not called enough and some balancing > > >is required. > > > > Given that XFS has a much more complex set of shrinkers and has a > > much more finely tuned balancing between LRU and shrinker reclaim, > > I'd be interested to see if you get the same results on XFS for the > > tests you ran on ext4. It might also be worth running some highly > > concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way > > concurrent fsmark tests) to see what impact heavy slab cache > > pressure has on shrinker behaviour and system balance... > > > > I had tested XFS with earlier releases and noticed no major problems > so later releases tested only one filesystem. Given the changes since, > a retest is desirable. I've posted the current version of the series but > I'll queue the tests to run over the weekend. They are quite time consuming > to run unfortunately. Understood. I'm not following the patchset all that closely, so I didn' know you'd already tested XFS. > On the fsmark configuration, I configured the test to use 4K files > instead of 0-sized files that normally would be used to stress inode > creation/deletion. This is to have a mix of page cache and slab > allocations. Shout if this does not suit your expectations. Sounds fine. I usually limit that test to 10 million inodes - that's my "10-4" test. > Finally, not all the machines I'm using can store 50 million inodes > of this size. The benchmark has been configured to use as many inodes > as it estimates will fit in the disk. In all cases, it'll exert memory > pressure. Unfortunately, the storage is simple so there is no guarantee > it'll find all problems but that's standard unfortunately. Yup. But it's really the system balance that matters, and if the balance is maintained then XFS will optimise the IO patterns to get decent throughput regardless of the storage (i.e. the 10-4 test should still run at tens of MB/s on a single spinning disk). Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 08, 2016 at 10:52:03AM +0100, Mel Gorman wrote: > On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote: > > . > > > This series is not without its hazards. There are at least three areas > > > that I'm concerned with even though I could not reproduce any problems in > > > that area. > > > > > > 1. Reclaim/compaction is going to be affected because the amount of > > > reclaim is > > >no longer targetted at a specific zone. Compaction works on a per-zone > > > basis > > >so there is no guarantee that reclaiming a few THP's worth page pages > > > will > > >have a positive impact on compaction success rates. > > > > > > 2. The Slab/LRU reclaim ratio is affected because the frequency the > > > shrinkers > > >are called is now different. This may or may not be a problem but if it > > >is, it'll be because shrinkers are not called enough and some balancing > > >is required. > > > > Given that XFS has a much more complex set of shrinkers and has a > > much more finely tuned balancing between LRU and shrinker reclaim, > > I'd be interested to see if you get the same results on XFS for the > > tests you ran on ext4. It might also be worth running some highly > > concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way > > concurrent fsmark tests) to see what impact heavy slab cache > > pressure has on shrinker behaviour and system balance... > > > > I had tested XFS with earlier releases and noticed no major problems > so later releases tested only one filesystem. Given the changes since, > a retest is desirable. I've posted the current version of the series but > I'll queue the tests to run over the weekend. They are quite time consuming > to run unfortunately. Understood. I'm not following the patchset all that closely, so I didn' know you'd already tested XFS. > On the fsmark configuration, I configured the test to use 4K files > instead of 0-sized files that normally would be used to stress inode > creation/deletion. This is to have a mix of page cache and slab > allocations. Shout if this does not suit your expectations. Sounds fine. I usually limit that test to 10 million inodes - that's my "10-4" test. > Finally, not all the machines I'm using can store 50 million inodes > of this size. The benchmark has been configured to use as many inodes > as it estimates will fit in the disk. In all cases, it'll exert memory > pressure. Unfortunately, the storage is simple so there is no guarantee > it'll find all problems but that's standard unfortunately. Yup. But it's really the system balance that matters, and if the balance is maintained then XFS will optimise the IO patterns to get decent throughput regardless of the storage (i.e. the 10-4 test should still run at tens of MB/s on a single spinning disk). Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote: > . > > This series is not without its hazards. There are at least three areas > > that I'm concerned with even though I could not reproduce any problems in > > that area. > > > > 1. Reclaim/compaction is going to be affected because the amount of reclaim > > is > >no longer targetted at a specific zone. Compaction works on a per-zone > > basis > >so there is no guarantee that reclaiming a few THP's worth page pages > > will > >have a positive impact on compaction success rates. > > > > 2. The Slab/LRU reclaim ratio is affected because the frequency the > > shrinkers > >are called is now different. This may or may not be a problem but if it > >is, it'll be because shrinkers are not called enough and some balancing > >is required. > > Given that XFS has a much more complex set of shrinkers and has a > much more finely tuned balancing between LRU and shrinker reclaim, > I'd be interested to see if you get the same results on XFS for the > tests you ran on ext4. It might also be worth running some highly > concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way > concurrent fsmark tests) to see what impact heavy slab cache > pressure has on shrinker behaviour and system balance... > I had tested XFS with earlier releases and noticed no major problems so later releases tested only one filesystem. Given the changes since, a retest is desirable. I've posted the current version of the series but I'll queue the tests to run over the weekend. They are quite time consuming to run unfortunately. On the fsmark configuration, I configured the test to use 4K files instead of 0-sized files that normally would be used to stress inode creation/deletion. This is to have a mix of page cache and slab allocations. Shout if this does not suit your expectations. Finally, not all the machines I'm using can store 50 million inodes of this size. The benchmark has been configured to use as many inodes as it estimates will fit in the disk. In all cases, it'll exert memory pressure. Unfortunately, the storage is simple so there is no guarantee it'll find all problems but that's standard unfortunately. Thanks. -- Mel Gorman SUSE Labs
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 08, 2016 at 09:27:13AM +1000, Dave Chinner wrote: > . > > This series is not without its hazards. There are at least three areas > > that I'm concerned with even though I could not reproduce any problems in > > that area. > > > > 1. Reclaim/compaction is going to be affected because the amount of reclaim > > is > >no longer targetted at a specific zone. Compaction works on a per-zone > > basis > >so there is no guarantee that reclaiming a few THP's worth page pages > > will > >have a positive impact on compaction success rates. > > > > 2. The Slab/LRU reclaim ratio is affected because the frequency the > > shrinkers > >are called is now different. This may or may not be a problem but if it > >is, it'll be because shrinkers are not called enough and some balancing > >is required. > > Given that XFS has a much more complex set of shrinkers and has a > much more finely tuned balancing between LRU and shrinker reclaim, > I'd be interested to see if you get the same results on XFS for the > tests you ran on ext4. It might also be worth running some highly > concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way > concurrent fsmark tests) to see what impact heavy slab cache > pressure has on shrinker behaviour and system balance... > I had tested XFS with earlier releases and noticed no major problems so later releases tested only one filesystem. Given the changes since, a retest is desirable. I've posted the current version of the series but I'll queue the tests to run over the weekend. They are quite time consuming to run unfortunately. On the fsmark configuration, I configured the test to use 4K files instead of 0-sized files that normally would be used to stress inode creation/deletion. This is to have a mix of page cache and slab allocations. Shout if this does not suit your expectations. Finally, not all the machines I'm using can store 50 million inodes of this size. The benchmark has been configured to use as many inodes as it estimates will fit in the disk. In all cases, it'll exert memory pressure. Unfortunately, the storage is simple so there is no guarantee it'll find all problems but that's standard unfortunately. Thanks. -- Mel Gorman SUSE Labs
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 01, 2016 at 04:37:15PM +0100, Mel Gorman wrote: > Previous releases double accounted LRU stats on the zone and the node > because it was required by should_reclaim_retry. The last patch in the > series removes the double accounting. It's not integrated with the series > as reviewers may not like the solution. If not, it can be safely dropped > without a major impact to the results. > tiobench on ext4 > [snip other tests on ext4 which show good results] . > This series is not without its hazards. There are at least three areas > that I'm concerned with even though I could not reproduce any problems in > that area. > > 1. Reclaim/compaction is going to be affected because the amount of reclaim is >no longer targetted at a specific zone. Compaction works on a per-zone > basis >so there is no guarantee that reclaiming a few THP's worth page pages will >have a positive impact on compaction success rates. > > 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers >are called is now different. This may or may not be a problem but if it >is, it'll be because shrinkers are not called enough and some balancing >is required. Given that XFS has a much more complex set of shrinkers and has a much more finely tuned balancing between LRU and shrinker reclaim, I'd be interested to see if you get the same results on XFS for the tests you ran on ext4. It might also be worth running some highly concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way concurrent fsmark tests) to see what impact heavy slab cache pressure has on shrinker behaviour and system balance... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 01, 2016 at 04:37:15PM +0100, Mel Gorman wrote: > Previous releases double accounted LRU stats on the zone and the node > because it was required by should_reclaim_retry. The last patch in the > series removes the double accounting. It's not integrated with the series > as reviewers may not like the solution. If not, it can be safely dropped > without a major impact to the results. > tiobench on ext4 > [snip other tests on ext4 which show good results] . > This series is not without its hazards. There are at least three areas > that I'm concerned with even though I could not reproduce any problems in > that area. > > 1. Reclaim/compaction is going to be affected because the amount of reclaim is >no longer targetted at a specific zone. Compaction works on a per-zone > basis >so there is no guarantee that reclaiming a few THP's worth page pages will >have a positive impact on compaction success rates. > > 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers >are called is now different. This may or may not be a problem but if it >is, it'll be because shrinkers are not called enough and some balancing >is required. Given that XFS has a much more complex set of shrinkers and has a much more finely tuned balancing between LRU and shrinker reclaim, I'd be interested to see if you get the same results on XFS for the tests you ran on ext4. It might also be worth running some highly concurrent inode cache benchmarks (e.g. the 50-million inode, 16-way concurrent fsmark tests) to see what impact heavy slab cache pressure has on shrinker behaviour and system balance... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 10:55:09AM +0100, Mel Gorman wrote: > On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote: > > > > How big ratio between highmem:lowmem do you think a problem? > > > > > > > > > > That's a "how long is a piece of string" type question. The ratio does > > > not matter as much as whether the workload is both under memory pressure > > > and requires large amounts of lowmem pages. Even on systems with very high > > > ratios, it may not be a problem if HIGHPTE is enabled. > > > > As well page table, pgd/kernelstack/zbud/slab and so on, every kernel > > allocations wanted to mask __GFP_HIGHMEM off would be a problem in > > 32bit system. > > > > The same point applies -- it depends on the rate of these allocations, > not the ratio of highmem:lowmem per se. > > > It also depends on that how many drivers needed lowmem only we have > > in the system. > > > > I don't know how many such driver in the world. When I simply do grep, > > I found several cases which mask __GFP_HIGHMEM off and among them, > > I guess DRM might be a popular for us. However, it might be really rare > > usecase among various i915 usecases. > > > > It's also perfectly possible that such allocations are long-lived in which > case they are not going to cause many skips. Hence why I cannot make a > general prediction. > > > > > > Conceptually, moving to node LRUs should be easier to understand. The > > > > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > > > > similarly on all nodes. > > > > > > > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > > > > core NUMA machine. The UMA results are presented in most cases as the > > > > > NUMA > > > > > machine behaved similarly. > > > > > > > > I guess you would already test below with various highmem system(e.g., > > > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > > > > > > > > > I haven't that data, the baseline distribution used doesn't even have > > > 32-bit support. Even if it was, the results may not be that interesting. > > > The workloads used were not necessarily going to trigger lowmem pressure > > > as HIGHPTE was set on the 32-bit configs. > > > > That means we didn't test this on 32-bit with highmem. > > > > No. I tested the skip logic and noticed that when forced on purpose that > system CPU usage was higher but it functionally worked. Yeb, it would work well functionally. I meant not functionally but performance point of view, system cpu usage and majfault rate and so on. > > > I'm not sure it's really too rare case to spend a time for testing. > > In fact, I really want to test all series to our production system > > which is 32bit and highmem but as we know well, most of embedded > > system kernel is rather old so backporting needs lots of time and > > care. However, if we miss testing in those system at the moment, > > we will be suprised after 1~2 years. > > > > It would be appreciated if it could be tested on such platforms if at all > possible. Even if I did set up a 32-bit x86 system, it won't have the same > allocation/reclaim profile as the platforms you are considering. Yeb. I just finished reviewing of all patches and found no *big* problem with my brain so my remanining homework is just testing which would find what my brain have missed. I will give the backporing to old 32-bit production kernel a shot and report if something strange happens. Thanks for great work, Mel! > > > I don't know what kinds of benchmark can we can check it so I cannot > > insist on it but you might know it. > > > > One method would be to use fsmark with very large numbers of small files > to force slab to require low memory. It's not representative of many real > workloads unfortunately. Usually such a configuration is for checking the > slab shrinker is working as expected. Thanks for the suggestion. > > > Okay, do you have any idea to fix it if we see such regression report > > in 32-bit system in future? > > Two options, neither whose complexity is justified without a "real" > workload to use as a reference. > > 1. Long-term isolation of highmem pages when reclaim is lowmem > >When pages are skipped, they are immediately added back onto the LRU >list. If lowmem reclaim persisted for long periods of time, the same >highmem pages get continually scanned. The idea would be that lowmem >keeps those pages on a separate list until a reclaim for highmem pages >arrives that splices the highmem pages back onto the LRU. > >That would reduce the skip rate, the potential corner case is that >highmem pages have to be scanned and reclaimed to free lowmem slab pages. > > 2. Linear scan lowmem pages if the initial LRU shrink fails > >This will break LRU ordering but may be preferable and faster during >memory pressure than skipping LRU pages. Okay. I guess it would be better to include this in descripion of [4/31]. > > --
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 10:55:09AM +0100, Mel Gorman wrote: > On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote: > > > > How big ratio between highmem:lowmem do you think a problem? > > > > > > > > > > That's a "how long is a piece of string" type question. The ratio does > > > not matter as much as whether the workload is both under memory pressure > > > and requires large amounts of lowmem pages. Even on systems with very high > > > ratios, it may not be a problem if HIGHPTE is enabled. > > > > As well page table, pgd/kernelstack/zbud/slab and so on, every kernel > > allocations wanted to mask __GFP_HIGHMEM off would be a problem in > > 32bit system. > > > > The same point applies -- it depends on the rate of these allocations, > not the ratio of highmem:lowmem per se. > > > It also depends on that how many drivers needed lowmem only we have > > in the system. > > > > I don't know how many such driver in the world. When I simply do grep, > > I found several cases which mask __GFP_HIGHMEM off and among them, > > I guess DRM might be a popular for us. However, it might be really rare > > usecase among various i915 usecases. > > > > It's also perfectly possible that such allocations are long-lived in which > case they are not going to cause many skips. Hence why I cannot make a > general prediction. > > > > > > Conceptually, moving to node LRUs should be easier to understand. The > > > > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > > > > similarly on all nodes. > > > > > > > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > > > > core NUMA machine. The UMA results are presented in most cases as the > > > > > NUMA > > > > > machine behaved similarly. > > > > > > > > I guess you would already test below with various highmem system(e.g., > > > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > > > > > > > > > I haven't that data, the baseline distribution used doesn't even have > > > 32-bit support. Even if it was, the results may not be that interesting. > > > The workloads used were not necessarily going to trigger lowmem pressure > > > as HIGHPTE was set on the 32-bit configs. > > > > That means we didn't test this on 32-bit with highmem. > > > > No. I tested the skip logic and noticed that when forced on purpose that > system CPU usage was higher but it functionally worked. Yeb, it would work well functionally. I meant not functionally but performance point of view, system cpu usage and majfault rate and so on. > > > I'm not sure it's really too rare case to spend a time for testing. > > In fact, I really want to test all series to our production system > > which is 32bit and highmem but as we know well, most of embedded > > system kernel is rather old so backporting needs lots of time and > > care. However, if we miss testing in those system at the moment, > > we will be suprised after 1~2 years. > > > > It would be appreciated if it could be tested on such platforms if at all > possible. Even if I did set up a 32-bit x86 system, it won't have the same > allocation/reclaim profile as the platforms you are considering. Yeb. I just finished reviewing of all patches and found no *big* problem with my brain so my remanining homework is just testing which would find what my brain have missed. I will give the backporing to old 32-bit production kernel a shot and report if something strange happens. Thanks for great work, Mel! > > > I don't know what kinds of benchmark can we can check it so I cannot > > insist on it but you might know it. > > > > One method would be to use fsmark with very large numbers of small files > to force slab to require low memory. It's not representative of many real > workloads unfortunately. Usually such a configuration is for checking the > slab shrinker is working as expected. Thanks for the suggestion. > > > Okay, do you have any idea to fix it if we see such regression report > > in 32-bit system in future? > > Two options, neither whose complexity is justified without a "real" > workload to use as a reference. > > 1. Long-term isolation of highmem pages when reclaim is lowmem > >When pages are skipped, they are immediately added back onto the LRU >list. If lowmem reclaim persisted for long periods of time, the same >highmem pages get continually scanned. The idea would be that lowmem >keeps those pages on a separate list until a reclaim for highmem pages >arrives that splices the highmem pages back onto the LRU. > >That would reduce the skip rate, the potential corner case is that >highmem pages have to be scanned and reclaimed to free lowmem slab pages. > > 2. Linear scan lowmem pages if the initial LRU shrink fails > >This will break LRU ordering but may be preferable and faster during >memory pressure than skipping LRU pages. Okay. I guess it would be better to include this in descripion of [4/31]. > > --
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote: > > > How big ratio between highmem:lowmem do you think a problem? > > > > > > > That's a "how long is a piece of string" type question. The ratio does > > not matter as much as whether the workload is both under memory pressure > > and requires large amounts of lowmem pages. Even on systems with very high > > ratios, it may not be a problem if HIGHPTE is enabled. > > As well page table, pgd/kernelstack/zbud/slab and so on, every kernel > allocations wanted to mask __GFP_HIGHMEM off would be a problem in > 32bit system. > The same point applies -- it depends on the rate of these allocations, not the ratio of highmem:lowmem per se. > It also depends on that how many drivers needed lowmem only we have > in the system. > > I don't know how many such driver in the world. When I simply do grep, > I found several cases which mask __GFP_HIGHMEM off and among them, > I guess DRM might be a popular for us. However, it might be really rare > usecase among various i915 usecases. > It's also perfectly possible that such allocations are long-lived in which case they are not going to cause many skips. Hence why I cannot make a general prediction. > > > > Conceptually, moving to node LRUs should be easier to understand. The > > > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > > > similarly on all nodes. > > > > > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > > > core NUMA machine. The UMA results are presented in most cases as the > > > > NUMA > > > > machine behaved similarly. > > > > > > I guess you would already test below with various highmem system(e.g., > > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > > > > > > I haven't that data, the baseline distribution used doesn't even have > > 32-bit support. Even if it was, the results may not be that interesting. > > The workloads used were not necessarily going to trigger lowmem pressure > > as HIGHPTE was set on the 32-bit configs. > > That means we didn't test this on 32-bit with highmem. > No. I tested the skip logic and noticed that when forced on purpose that system CPU usage was higher but it functionally worked. > I'm not sure it's really too rare case to spend a time for testing. > In fact, I really want to test all series to our production system > which is 32bit and highmem but as we know well, most of embedded > system kernel is rather old so backporting needs lots of time and > care. However, if we miss testing in those system at the moment, > we will be suprised after 1~2 years. > It would be appreciated if it could be tested on such platforms if at all possible. Even if I did set up a 32-bit x86 system, it won't have the same allocation/reclaim profile as the platforms you are considering. > I don't know what kinds of benchmark can we can check it so I cannot > insist on it but you might know it. > One method would be to use fsmark with very large numbers of small files to force slab to require low memory. It's not representative of many real workloads unfortunately. Usually such a configuration is for checking the slab shrinker is working as expected. > Okay, do you have any idea to fix it if we see such regression report > in 32-bit system in future? Two options, neither whose complexity is justified without a "real" workload to use as a reference. 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. That would reduce the skip rate, the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. -- Mel Gorman SUSE Labs
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote: > > > How big ratio between highmem:lowmem do you think a problem? > > > > > > > That's a "how long is a piece of string" type question. The ratio does > > not matter as much as whether the workload is both under memory pressure > > and requires large amounts of lowmem pages. Even on systems with very high > > ratios, it may not be a problem if HIGHPTE is enabled. > > As well page table, pgd/kernelstack/zbud/slab and so on, every kernel > allocations wanted to mask __GFP_HIGHMEM off would be a problem in > 32bit system. > The same point applies -- it depends on the rate of these allocations, not the ratio of highmem:lowmem per se. > It also depends on that how many drivers needed lowmem only we have > in the system. > > I don't know how many such driver in the world. When I simply do grep, > I found several cases which mask __GFP_HIGHMEM off and among them, > I guess DRM might be a popular for us. However, it might be really rare > usecase among various i915 usecases. > It's also perfectly possible that such allocations are long-lived in which case they are not going to cause many skips. Hence why I cannot make a general prediction. > > > > Conceptually, moving to node LRUs should be easier to understand. The > > > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > > > similarly on all nodes. > > > > > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > > > core NUMA machine. The UMA results are presented in most cases as the > > > > NUMA > > > > machine behaved similarly. > > > > > > I guess you would already test below with various highmem system(e.g., > > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > > > > > > I haven't that data, the baseline distribution used doesn't even have > > 32-bit support. Even if it was, the results may not be that interesting. > > The workloads used were not necessarily going to trigger lowmem pressure > > as HIGHPTE was set on the 32-bit configs. > > That means we didn't test this on 32-bit with highmem. > No. I tested the skip logic and noticed that when forced on purpose that system CPU usage was higher but it functionally worked. > I'm not sure it's really too rare case to spend a time for testing. > In fact, I really want to test all series to our production system > which is 32bit and highmem but as we know well, most of embedded > system kernel is rather old so backporting needs lots of time and > care. However, if we miss testing in those system at the moment, > we will be suprised after 1~2 years. > It would be appreciated if it could be tested on such platforms if at all possible. Even if I did set up a 32-bit x86 system, it won't have the same allocation/reclaim profile as the platforms you are considering. > I don't know what kinds of benchmark can we can check it so I cannot > insist on it but you might know it. > One method would be to use fsmark with very large numbers of small files to force slab to require low memory. It's not representative of many real workloads unfortunately. Usually such a configuration is for checking the slab shrinker is working as expected. > Okay, do you have any idea to fix it if we see such regression report > in 32-bit system in future? Two options, neither whose complexity is justified without a "real" workload to use as a reference. 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. That would reduce the skip rate, the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. -- Mel Gorman SUSE Labs
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 05:34:05AM +0100, Mel Gorman wrote: > On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote: > > > The reason we have zone-based reclaim is that we used to have > > > large highmem zones in common configurations and it was necessary > > > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much > > > less of a concern as machines with lots of memory will (or should) use > > > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are > > > rare. Machines that do use highmem should have relatively low > > > highmem:lowmem > > > ratios than we worried about in the past. > > > > Hello Mel, > > > > I agree the direction absolutely. However, I have a concern on highmem > > system as you already mentioned. > > > > Embedded products still use 2 ~ 3 ratio (highmem:lowmem). > > In such system, LRU churning by skipping other zone pages frequently > > might be significant for the performance. > > > > How big ratio between highmem:lowmem do you think a problem? > > > > That's a "how long is a piece of string" type question. The ratio does > not matter as much as whether the workload is both under memory pressure > and requires large amounts of lowmem pages. Even on systems with very high > ratios, it may not be a problem if HIGHPTE is enabled. As well page table, pgd/kernelstack/zbud/slab and so on, every kernel allocations wanted to mask __GFP_HIGHMEM off would be a problem in 32bit system. It also depends on that how many drivers needed lowmem only we have in the system. I don't know how many such driver in the world. When I simply do grep, I found several cases which mask __GFP_HIGHMEM off and among them, I guess DRM might be a popular for us. However, it might be really rare usecase among various i915 usecases. > > > > > > > Conceptually, moving to node LRUs should be easier to understand. The > > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > > similarly on all nodes. > > > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > > core NUMA machine. The UMA results are presented in most cases as the NUMA > > > machine behaved similarly. > > > > I guess you would already test below with various highmem system(e.g., > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > > > I haven't that data, the baseline distribution used doesn't even have > 32-bit support. Even if it was, the results may not be that interesting. > The workloads used were not necessarily going to trigger lowmem pressure > as HIGHPTE was set on the 32-bit configs. That means we didn't test this on 32-bit with highmem. I'm not sure it's really too rare case to spend a time for testing. In fact, I really want to test all series to our production system which is 32bit and highmem but as we know well, most of embedded system kernel is rather old so backporting needs lots of time and care. However, if we miss testing in those system at the moment, we will be suprised after 1~2 years. I don't know what kinds of benchmark can we can check it so I cannot insist on it but you might know it. Okay, do you have any idea to fix it if we see such regression report in 32-bit system in future?
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 05:34:05AM +0100, Mel Gorman wrote: > On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote: > > > The reason we have zone-based reclaim is that we used to have > > > large highmem zones in common configurations and it was necessary > > > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much > > > less of a concern as machines with lots of memory will (or should) use > > > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are > > > rare. Machines that do use highmem should have relatively low > > > highmem:lowmem > > > ratios than we worried about in the past. > > > > Hello Mel, > > > > I agree the direction absolutely. However, I have a concern on highmem > > system as you already mentioned. > > > > Embedded products still use 2 ~ 3 ratio (highmem:lowmem). > > In such system, LRU churning by skipping other zone pages frequently > > might be significant for the performance. > > > > How big ratio between highmem:lowmem do you think a problem? > > > > That's a "how long is a piece of string" type question. The ratio does > not matter as much as whether the workload is both under memory pressure > and requires large amounts of lowmem pages. Even on systems with very high > ratios, it may not be a problem if HIGHPTE is enabled. As well page table, pgd/kernelstack/zbud/slab and so on, every kernel allocations wanted to mask __GFP_HIGHMEM off would be a problem in 32bit system. It also depends on that how many drivers needed lowmem only we have in the system. I don't know how many such driver in the world. When I simply do grep, I found several cases which mask __GFP_HIGHMEM off and among them, I guess DRM might be a popular for us. However, it might be really rare usecase among various i915 usecases. > > > > > > > Conceptually, moving to node LRUs should be easier to understand. The > > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > > similarly on all nodes. > > > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > > core NUMA machine. The UMA results are presented in most cases as the NUMA > > > machine behaved similarly. > > > > I guess you would already test below with various highmem system(e.g., > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > > > I haven't that data, the baseline distribution used doesn't even have > 32-bit support. Even if it was, the results may not be that interesting. > The workloads used were not necessarily going to trigger lowmem pressure > as HIGHPTE was set on the 32-bit configs. That means we didn't test this on 32-bit with highmem. I'm not sure it's really too rare case to spend a time for testing. In fact, I really want to test all series to our production system which is 32bit and highmem but as we know well, most of embedded system kernel is rather old so backporting needs lots of time and care. However, if we miss testing in those system at the moment, we will be suprised after 1~2 years. I don't know what kinds of benchmark can we can check it so I cannot insist on it but you might know it. Okay, do you have any idea to fix it if we see such regression report in 32-bit system in future?
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote: > > The reason we have zone-based reclaim is that we used to have > > large highmem zones in common configurations and it was necessary > > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much > > less of a concern as machines with lots of memory will (or should) use > > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are > > rare. Machines that do use highmem should have relatively low highmem:lowmem > > ratios than we worried about in the past. > > Hello Mel, > > I agree the direction absolutely. However, I have a concern on highmem > system as you already mentioned. > > Embedded products still use 2 ~ 3 ratio (highmem:lowmem). > In such system, LRU churning by skipping other zone pages frequently > might be significant for the performance. > > How big ratio between highmem:lowmem do you think a problem? > That's a "how long is a piece of string" type question. The ratio does not matter as much as whether the workload is both under memory pressure and requires large amounts of lowmem pages. Even on systems with very high ratios, it may not be a problem if HIGHPTE is enabled. > > > > Conceptually, moving to node LRUs should be easier to understand. The > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > similarly on all nodes. > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > core NUMA machine. The UMA results are presented in most cases as the NUMA > > machine behaved similarly. > > I guess you would already test below with various highmem system(e.g., > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > I haven't that data, the baseline distribution used doesn't even have 32-bit support. Even if it was, the results may not be that interesting. The workloads used were not necessarily going to trigger lowmem pressure as HIGHPTE was set on the 32-bit configs. The skip logic has been checked and it does work. This was done during development, by forcing the "wrong" reclaim index to use. It was noticable in system CPU usage and in the "skip" stats. I didn't preserve this data. > > 4.7.0-rc4 4.7.0-rc4 > > mmotm-20160623nodelru-v8 > > Minor Faults645838 644036 > > Major Faults 573 593 > > Swap Ins 0 0 > > Swap Outs0 0 > > Allocation stalls 24 0 > > DMA allocs 0 0 > > DMA32 allocs 4604145344154171 > > Normal allocs 7805307279865782 > > Movable allocs 0 0 > > Direct pages scanned 10969 54504 > > Kswapd pages scanned 9337514493250583 > > Kswapd pages reclaimed9337224393247714 > > Direct pages reclaimed 10969 54504 > > Kswapd efficiency 99% 99% > > Kswapd velocity 13741.015 13711.950 > > Direct efficiency 100%100% > > Direct velocity 1.614 8.014 > > Percentage direct scans 0% 0% > > Zone normal velocity 8641.875 13719.964 > > Zone dma32 velocity 5100.754 0.000 > > Zone dma velocity0.000 0.000 > > Page writes by reclaim 0.000 0.000 > > Page writes file 0 0 > > Page writes anon 0 0 > > Page reclaim immediate 37 54 > > > > kswapd activity was roughly comparable. There were differences in direct > > reclaim activity but negligible in the context of the overall workload > > (velocity of 8 pages per second with the patches applied, 1.6 pages per > > second in the baseline kernel). > > Hmm, nodelru's allocation stall is zero above but how does direct page > scanning/reclaimed happens? > Good spot, it's because I used the wrong comparison script -- one that doesn't understand the different skip and allocation stats and I was looking primarily at the scanning activity. This is a correct version 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8r26 Minor Faults645838 643815 Major Faults 573 493 Swap Ins 0 0 Swap Outs0 0 DMA allocs 0 0 DMA32 allocs 4604145344174923 Normal allocs 7805307279816443 Movable allocs 0 0 Allocation stalls 24 31 Stall zone DMA 0 0 Stall zone DMA32 0 0 Stall zone Normal0 1 Stall
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Mon, Jul 04, 2016 at 10:37:03AM +0900, Minchan Kim wrote: > > The reason we have zone-based reclaim is that we used to have > > large highmem zones in common configurations and it was necessary > > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much > > less of a concern as machines with lots of memory will (or should) use > > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are > > rare. Machines that do use highmem should have relatively low highmem:lowmem > > ratios than we worried about in the past. > > Hello Mel, > > I agree the direction absolutely. However, I have a concern on highmem > system as you already mentioned. > > Embedded products still use 2 ~ 3 ratio (highmem:lowmem). > In such system, LRU churning by skipping other zone pages frequently > might be significant for the performance. > > How big ratio between highmem:lowmem do you think a problem? > That's a "how long is a piece of string" type question. The ratio does not matter as much as whether the workload is both under memory pressure and requires large amounts of lowmem pages. Even on systems with very high ratios, it may not be a problem if HIGHPTE is enabled. > > > > Conceptually, moving to node LRUs should be easier to understand. The > > page allocator plays fewer tricks to game reclaim and reclaim behaves > > similarly on all nodes. > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > > core NUMA machine. The UMA results are presented in most cases as the NUMA > > machine behaved similarly. > > I guess you would already test below with various highmem system(e.g., > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > I haven't that data, the baseline distribution used doesn't even have 32-bit support. Even if it was, the results may not be that interesting. The workloads used were not necessarily going to trigger lowmem pressure as HIGHPTE was set on the 32-bit configs. The skip logic has been checked and it does work. This was done during development, by forcing the "wrong" reclaim index to use. It was noticable in system CPU usage and in the "skip" stats. I didn't preserve this data. > > 4.7.0-rc4 4.7.0-rc4 > > mmotm-20160623nodelru-v8 > > Minor Faults645838 644036 > > Major Faults 573 593 > > Swap Ins 0 0 > > Swap Outs0 0 > > Allocation stalls 24 0 > > DMA allocs 0 0 > > DMA32 allocs 4604145344154171 > > Normal allocs 7805307279865782 > > Movable allocs 0 0 > > Direct pages scanned 10969 54504 > > Kswapd pages scanned 9337514493250583 > > Kswapd pages reclaimed9337224393247714 > > Direct pages reclaimed 10969 54504 > > Kswapd efficiency 99% 99% > > Kswapd velocity 13741.015 13711.950 > > Direct efficiency 100%100% > > Direct velocity 1.614 8.014 > > Percentage direct scans 0% 0% > > Zone normal velocity 8641.875 13719.964 > > Zone dma32 velocity 5100.754 0.000 > > Zone dma velocity0.000 0.000 > > Page writes by reclaim 0.000 0.000 > > Page writes file 0 0 > > Page writes anon 0 0 > > Page reclaim immediate 37 54 > > > > kswapd activity was roughly comparable. There were differences in direct > > reclaim activity but negligible in the context of the overall workload > > (velocity of 8 pages per second with the patches applied, 1.6 pages per > > second in the baseline kernel). > > Hmm, nodelru's allocation stall is zero above but how does direct page > scanning/reclaimed happens? > Good spot, it's because I used the wrong comparison script -- one that doesn't understand the different skip and allocation stats and I was looking primarily at the scanning activity. This is a correct version 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8r26 Minor Faults645838 643815 Major Faults 573 493 Swap Ins 0 0 Swap Outs0 0 DMA allocs 0 0 DMA32 allocs 4604145344174923 Normal allocs 7805307279816443 Movable allocs 0 0 Allocation stalls 24 31 Stall zone DMA 0 0 Stall zone DMA32 0 0 Stall zone Normal0 1 Stall
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 01, 2016 at 09:01:08PM +0100, Mel Gorman wrote: > (Sorry for the resend, I accidentally sent the branch that still had the > Signed-off-by's from mmotm still applied which is incorrect.) > > Previous releases double accounted LRU stats on the zone and the node > because it was required by should_reclaim_retry. The last patch in the > series removes the double accounting. It's not integrated with the series > as reviewers may not like the solution. If not, it can be safely dropped > without a major impact to the results. > > Changelog since v7 > o Rebase onto current mmots > o Avoid double accounting of stats in node and zone > o Kswapd will avoid more reclaim if an eligible zone is available > o Remove some duplications of sc->reclaim_idx and classzone_idx > o Print per-node stats in zoneinfo > > Changelog since v6 > o Correct reclaim_idx when direct reclaiming for memcg > o Also account LRU pages per zone for compaction/reclaim > o Add page_pgdat helper with more efficient lookup > o Init pgdat LRU lock only once > o Slight optimisation to wake_all_kswapds > o Always wake kcompactd when kswapd is going to sleep > o Rebase to mmotm as of June 15th, 2016 > > Changelog since v5 > o Rebase and adjust to changes > > Changelog since v4 > o Rebase on top of v3 of page allocator optimisation series > > Changelog since v3 > o Rebase on top of the page allocator optimisation series > o Remove RFC tag > > This is the latest version of a series that moves LRUs from the zones to > the node that is based upon 4.7-rc4 with Andrew's tree applied. While this > is a current rebase, the test results were based on mmotm as of June 23rd. > Conceptually, this series is simple but there are a lot of details. Some > of the broad motivations for this are; > > 1. The residency of a page partially depends on what zone the page was >allocated from. This is partially combatted by the fair zone allocation >policy but that is a partial solution that introduces overhead in the >page allocator paths. > > 2. Currently, reclaim on node 0 behaves slightly different to node 1. For >example, direct reclaim scans in zonelist order and reclaims even if >the zone is over the high watermark regardless of the age of pages >in that LRU. Kswapd on the other hand starts reclaim on the highest >unbalanced zone. A difference in distribution of file/anon pages due >to when they were allocated results can result in a difference in >again. While the fair zone allocation policy mitigates some of the >problems here, the page reclaim results on a multi-zone node will >always be different to a single-zone node. >it was scheduled on as a result. > > 3. kswapd and the page allocator scan zones in the opposite order to >avoid interfering with each other but it's sensitive to timing. This >mitigates the page allocator using pages that were allocated very recently >in the ideal case but it's sensitive to timing. When kswapd is allocating >from lower zones then it's great but during the rebalancing of the highest >zone, the page allocator and kswapd interfere with each other. It's worse >if the highest zone is small and difficult to balance. > > 4. slab shrinkers are node-based which makes it harder to identify the exact >relationship between slab reclaim and LRU reclaim. > > The reason we have zone-based reclaim is that we used to have > large highmem zones in common configurations and it was necessary > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much > less of a concern as machines with lots of memory will (or should) use > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are > rare. Machines that do use highmem should have relatively low highmem:lowmem > ratios than we worried about in the past. Hello Mel, I agree the direction absolutely. However, I have a concern on highmem system as you already mentioned. Embedded products still use 2 ~ 3 ratio (highmem:lowmem). In such system, LRU churning by skipping other zone pages frequently might be significant for the performance. How big ratio between highmem:lowmem do you think a problem? > > Conceptually, moving to node LRUs should be easier to understand. The > page allocator plays fewer tricks to game reclaim and reclaim behaves > similarly on all nodes. > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > core NUMA machine. The UMA results are presented in most cases as the NUMA > machine behaved similarly. I guess you would already test below with various highmem system(e.g., 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > pagealloc > - > > This is a microbenchmark that shows the benefit of removing the fair zone > allocation policy. It was tested uip to order-4 but only orders 0 and 1 are > shown as the other orders were comparable. > >4.7.0-rc4
Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
On Fri, Jul 01, 2016 at 09:01:08PM +0100, Mel Gorman wrote: > (Sorry for the resend, I accidentally sent the branch that still had the > Signed-off-by's from mmotm still applied which is incorrect.) > > Previous releases double accounted LRU stats on the zone and the node > because it was required by should_reclaim_retry. The last patch in the > series removes the double accounting. It's not integrated with the series > as reviewers may not like the solution. If not, it can be safely dropped > without a major impact to the results. > > Changelog since v7 > o Rebase onto current mmots > o Avoid double accounting of stats in node and zone > o Kswapd will avoid more reclaim if an eligible zone is available > o Remove some duplications of sc->reclaim_idx and classzone_idx > o Print per-node stats in zoneinfo > > Changelog since v6 > o Correct reclaim_idx when direct reclaiming for memcg > o Also account LRU pages per zone for compaction/reclaim > o Add page_pgdat helper with more efficient lookup > o Init pgdat LRU lock only once > o Slight optimisation to wake_all_kswapds > o Always wake kcompactd when kswapd is going to sleep > o Rebase to mmotm as of June 15th, 2016 > > Changelog since v5 > o Rebase and adjust to changes > > Changelog since v4 > o Rebase on top of v3 of page allocator optimisation series > > Changelog since v3 > o Rebase on top of the page allocator optimisation series > o Remove RFC tag > > This is the latest version of a series that moves LRUs from the zones to > the node that is based upon 4.7-rc4 with Andrew's tree applied. While this > is a current rebase, the test results were based on mmotm as of June 23rd. > Conceptually, this series is simple but there are a lot of details. Some > of the broad motivations for this are; > > 1. The residency of a page partially depends on what zone the page was >allocated from. This is partially combatted by the fair zone allocation >policy but that is a partial solution that introduces overhead in the >page allocator paths. > > 2. Currently, reclaim on node 0 behaves slightly different to node 1. For >example, direct reclaim scans in zonelist order and reclaims even if >the zone is over the high watermark regardless of the age of pages >in that LRU. Kswapd on the other hand starts reclaim on the highest >unbalanced zone. A difference in distribution of file/anon pages due >to when they were allocated results can result in a difference in >again. While the fair zone allocation policy mitigates some of the >problems here, the page reclaim results on a multi-zone node will >always be different to a single-zone node. >it was scheduled on as a result. > > 3. kswapd and the page allocator scan zones in the opposite order to >avoid interfering with each other but it's sensitive to timing. This >mitigates the page allocator using pages that were allocated very recently >in the ideal case but it's sensitive to timing. When kswapd is allocating >from lower zones then it's great but during the rebalancing of the highest >zone, the page allocator and kswapd interfere with each other. It's worse >if the highest zone is small and difficult to balance. > > 4. slab shrinkers are node-based which makes it harder to identify the exact >relationship between slab reclaim and LRU reclaim. > > The reason we have zone-based reclaim is that we used to have > large highmem zones in common configurations and it was necessary > to quickly find ZONE_NORMAL pages for reclaim. Today, this is much > less of a concern as machines with lots of memory will (or should) use > 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are > rare. Machines that do use highmem should have relatively low highmem:lowmem > ratios than we worried about in the past. Hello Mel, I agree the direction absolutely. However, I have a concern on highmem system as you already mentioned. Embedded products still use 2 ~ 3 ratio (highmem:lowmem). In such system, LRU churning by skipping other zone pages frequently might be significant for the performance. How big ratio between highmem:lowmem do you think a problem? > > Conceptually, moving to node LRUs should be easier to understand. The > page allocator plays fewer tricks to game reclaim and reclaim behaves > similarly on all nodes. > > The series has been tested on a 16 core UMA machine and a 2-socket 48 > core NUMA machine. The UMA results are presented in most cases as the NUMA > machine behaved similarly. I guess you would already test below with various highmem system(e.g., 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it? > > pagealloc > - > > This is a microbenchmark that shows the benefit of removing the fair zone > allocation policy. It was tested uip to order-4 but only orders 0 and 1 are > shown as the other orders were comparable. > >4.7.0-rc4
[PATCH 00/31] Move LRU page reclaim from zones to nodes v8
(Sorry for the resend, I accidentally sent the branch that still had the Signed-off-by's from mmotm still applied which is incorrect.) Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc - This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min
[PATCH 00/31] Move LRU page reclaim from zones to nodes v8
(Sorry for the resend, I accidentally sent the branch that still had the Signed-off-by's from mmotm still applied which is incorrect.) Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc - This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min
[PATCH 00/31] Move LRU page reclaim from zones to nodes v8
Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc - This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%)
[PATCH 00/31] Move LRU page reclaim from zones to nodes v8
Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc - This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%)