Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote: On Thu, 12 Jul 2018 14:54:08 +0200 Michal Hocko wrote: [CC Jesper - I remember he was really concerned about the worst case latencies for highspeed network workloads.] Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s), where we are contenting on the page allocator lock, in a CPU scaling netperf test AFAIK. I also have some special-case micro-benchmarks where I can hit it, but it a micro-bench... Thanks! Looks good. Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a major PCP/buddy bottleneck, where spinning the zonelock took up to 80% CPU, with dramatic BW degradation. Test ran relatively small number of TCP streams (4-16) with unpinned application (iperf). Larger batching reduces the contention on the zone lock and improves the CPU util. I also considered increasing the percpu_pagelist_fraction to a larger value (thought of 512, see patch below), which also affects the batch size (in pageset_set_high_and_batch). As far as I see it, to totally solve the page allocation bottleneck for the increasing networking speeds, the following is still required: 1) optimize order-0 allocations (even on the cost of higher-order allocations). 2) bulking API for page allocations. 3) do SKB remote-release (on the originating core). Regards, Tariq diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 697ef8c225df..88763bd716a5 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -741,9 +741,9 @@ of hot per cpu pagelists. User can specify a number like 100 to allocate The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) -The initial value is zero. Kernel does not use this value at boot time to set +The initial value is 512. Kernel uses this value at boot time to set the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. +sysctl, it will revert to a behavior based on batchsize calculation. == diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100f1e63..c88e8eb50bcb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -129,7 +129,7 @@ unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; -int percpu_pagelist_fraction; +int percpu_pagelist_fraction = 512; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; /*
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote: On Thu, 12 Jul 2018 14:54:08 +0200 Michal Hocko wrote: [CC Jesper - I remember he was really concerned about the worst case latencies for highspeed network workloads.] Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s), where we are contenting on the page allocator lock, in a CPU scaling netperf test AFAIK. I also have some special-case micro-benchmarks where I can hit it, but it a micro-bench... Thanks! Looks good. Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a major PCP/buddy bottleneck, where spinning the zonelock took up to 80% CPU, with dramatic BW degradation. Test ran relatively small number of TCP streams (4-16) with unpinned application (iperf). Larger batching reduces the contention on the zone lock and improves the CPU util. I also considered increasing the percpu_pagelist_fraction to a larger value (thought of 512, see patch below), which also affects the batch size (in pageset_set_high_and_batch). As far as I see it, to totally solve the page allocation bottleneck for the increasing networking speeds, the following is still required: 1) optimize order-0 allocations (even on the cost of higher-order allocations). 2) bulking API for page allocations. 3) do SKB remote-release (on the originating core). Regards, Tariq diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 697ef8c225df..88763bd716a5 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -741,9 +741,9 @@ of hot per cpu pagelists. User can specify a number like 100 to allocate The batch value of each per cpu pagelist is also updated as a result. It is set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) -The initial value is zero. Kernel does not use this value at boot time to set +The initial value is 512. Kernel uses this value at boot time to set the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. +sysctl, it will revert to a behavior based on batchsize calculation. == diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1521100f1e63..c88e8eb50bcb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -129,7 +129,7 @@ unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; -int percpu_pagelist_fraction; +int percpu_pagelist_fraction = 512; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; /*
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Thu, 12 Jul 2018 14:54:08 +0200 Michal Hocko wrote: > [CC Jesper - I remember he was really concerned about the worst case > latencies for highspeed network workloads.] Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s), where we are contenting on the page allocator lock, in a CPU scaling netperf test AFAIK. I also have some special-case micro-benchmarks where I can hit it, but it a micro-bench... > Sorry for top posting but I do not want to torture anybody to scroll > down the long changelog which I want to preserve for Jesper. Thanks! - It look like a very detailed and exhaustive test for a very small change, I'm impressed. > I personally do not mind this change. I usually find performance > improvements solely based on microbenchmarks without any real world > workloads numbers a bit dubious. Especially when numbers are single > cpu vendor based (even though that the number of measurements is quite > impressive here and much better than what we can usually get). On > the other hand this change is really simple. Should we ever find a > regression it will be trivial to reconsider/revert. I do think it is a good idea to increase this batch size. In network micro-benchmarks it is usually hard to hit the page_alloc lock, because drivers have all kind of page recycle schemes to avoid talking to the page allocator. Most of these schemes depend on pages getting recycled fast enough, which is true for these micro-benchmarks, but for real-world workloads, where packets/pages are "outstanding" longer, these page-recycle schemes break down, and drivers start to talk to the page allocator. Thus, this change might help networking for real-workloads (but will not show-up in network micro-benchs). > One could argue that the size of the batch should scale with the number > of CPUs or even the uarch but cosindering how much we suck in that > regards and that differences are not that large I agree that simply bump > up the number is the most viable way forward. > > That being said, feel free to add > Acked-by: Michal Hocko I will also happily ACK this patch, you can add: Acked-by: Jesper Dangaard Brouer > Thanks. The whole patch including the changelog follows below. > > On Wed 11-07-18 13:58:55, Aaron Lu wrote: > > To improve page allocator's performance for order-0 pages, each CPU has > > a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, > > PCP will be checked first before asking pages from Buddy. When PCP is > > used up, a batch of pages will be fetched from Buddy to improve > > performance and the size of batch can affect performance. > > > > zone's batch size gets doubled last time by commit ba56e91c9401("mm: > > page_alloc: increase size of per-cpu-pages") over ten years ago. Since > > then, CPU has envolved a lot and CPU's cache sizes also increased. > > > > Dave Hansen is concerned the current batch size doesn't fit well with > > modern hardware and suggested me to do two things: first, use a page > > allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find > > out how performance changes with different batch sizes on various > > machines and then choose a new default batch size; second, see how > > this new batch size work with other workloads. > > > > >From the first test, we saw performance gains on high-core-count systems > > and little to no effect on older systems with more modest core counts. > > >From this phase's test data, two candidates: 63 and 127 are chosen. > > > > In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability > > and more will-it-scale sub-tests are tested to see how these two > > candidates work with these workloads and decides a new default > > according to their results. > > > > Most test results are flat. will-it-scale/page_fault2 process mode has > > 10%-18% performance increase on 4-sockets Skylake and Broadwell. > > vm-scalability/lru-file-mmap-read has 17%-47% performance increase for > > 4-sockets servers while for 2-sockets servers, it caused 3%-8% > > performance drop. Further analysis showed that, with a larger pcp->batch > > and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch > > is maintained in this patch), zone lock contention shifted to LRU add > > side lock contention and that caused performance drop. This performance > > drop might be mitigated by others' work on optimizing LRU lock. > > > > Another downside of increasing pcp->batch is, when PCP is used up and > > need to fetch a batch of pages from Buddy, since batch is increased, > > that time can be longer than before. My understanding is, this doesn't > > affect slowpath where direct reclaim and compaction dominates. For > > fastpath, throughput is a win(according to will-it-scale/page_fault1) > > but worst latency can be larger now. > > > > Overall, I think double the batch size from 31 to 63 is relatively > > safe and provide good performance boost for high-core-count systems. > > > > The
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Thu, 12 Jul 2018 14:54:08 +0200 Michal Hocko wrote: > [CC Jesper - I remember he was really concerned about the worst case > latencies for highspeed network workloads.] Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s), where we are contenting on the page allocator lock, in a CPU scaling netperf test AFAIK. I also have some special-case micro-benchmarks where I can hit it, but it a micro-bench... > Sorry for top posting but I do not want to torture anybody to scroll > down the long changelog which I want to preserve for Jesper. Thanks! - It look like a very detailed and exhaustive test for a very small change, I'm impressed. > I personally do not mind this change. I usually find performance > improvements solely based on microbenchmarks without any real world > workloads numbers a bit dubious. Especially when numbers are single > cpu vendor based (even though that the number of measurements is quite > impressive here and much better than what we can usually get). On > the other hand this change is really simple. Should we ever find a > regression it will be trivial to reconsider/revert. I do think it is a good idea to increase this batch size. In network micro-benchmarks it is usually hard to hit the page_alloc lock, because drivers have all kind of page recycle schemes to avoid talking to the page allocator. Most of these schemes depend on pages getting recycled fast enough, which is true for these micro-benchmarks, but for real-world workloads, where packets/pages are "outstanding" longer, these page-recycle schemes break down, and drivers start to talk to the page allocator. Thus, this change might help networking for real-workloads (but will not show-up in network micro-benchs). > One could argue that the size of the batch should scale with the number > of CPUs or even the uarch but cosindering how much we suck in that > regards and that differences are not that large I agree that simply bump > up the number is the most viable way forward. > > That being said, feel free to add > Acked-by: Michal Hocko I will also happily ACK this patch, you can add: Acked-by: Jesper Dangaard Brouer > Thanks. The whole patch including the changelog follows below. > > On Wed 11-07-18 13:58:55, Aaron Lu wrote: > > To improve page allocator's performance for order-0 pages, each CPU has > > a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, > > PCP will be checked first before asking pages from Buddy. When PCP is > > used up, a batch of pages will be fetched from Buddy to improve > > performance and the size of batch can affect performance. > > > > zone's batch size gets doubled last time by commit ba56e91c9401("mm: > > page_alloc: increase size of per-cpu-pages") over ten years ago. Since > > then, CPU has envolved a lot and CPU's cache sizes also increased. > > > > Dave Hansen is concerned the current batch size doesn't fit well with > > modern hardware and suggested me to do two things: first, use a page > > allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find > > out how performance changes with different batch sizes on various > > machines and then choose a new default batch size; second, see how > > this new batch size work with other workloads. > > > > >From the first test, we saw performance gains on high-core-count systems > > and little to no effect on older systems with more modest core counts. > > >From this phase's test data, two candidates: 63 and 127 are chosen. > > > > In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability > > and more will-it-scale sub-tests are tested to see how these two > > candidates work with these workloads and decides a new default > > according to their results. > > > > Most test results are flat. will-it-scale/page_fault2 process mode has > > 10%-18% performance increase on 4-sockets Skylake and Broadwell. > > vm-scalability/lru-file-mmap-read has 17%-47% performance increase for > > 4-sockets servers while for 2-sockets servers, it caused 3%-8% > > performance drop. Further analysis showed that, with a larger pcp->batch > > and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch > > is maintained in this patch), zone lock contention shifted to LRU add > > side lock contention and that caused performance drop. This performance > > drop might be mitigated by others' work on optimizing LRU lock. > > > > Another downside of increasing pcp->batch is, when PCP is used up and > > need to fetch a batch of pages from Buddy, since batch is increased, > > that time can be longer than before. My understanding is, this doesn't > > affect slowpath where direct reclaim and compaction dominates. For > > fastpath, throughput is a win(according to will-it-scale/page_fault1) > > but worst latency can be larger now. > > > > Overall, I think double the batch size from 31 to 63 is relatively > > safe and provide good performance boost for high-core-count systems. > > > > The
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
[CC Jesper - I remember he was really concerned about the worst case latencies for highspeed network workloads.] Sorry for top posting but I do not want to torture anybody to scroll down the long changelog which I want to preserve for Jesper. I personally do not mind this change. I usually find performance improvements solely based on microbenchmarks without any real world workloads numbers a bit dubious. Especially when numbers are single cpu vendor based (even though that the number of measurements is quite impressive here and much better than what we can usually get). On the other hand this change is really simple. Should we ever find a regression it will be trivial to reconsider/revert. One could argue that the size of the batch should scale with the number of CPUs or even the uarch but cosindering how much we suck in that regards and that differences are not that large I agree that simply bump up the number is the most viable way forward. That being said, feel free to add Acked-by: Michal Hocko Thanks. The whole patch including the changelog follows below. On Wed 11-07-18 13:58:55, Aaron Lu wrote: > To improve page allocator's performance for order-0 pages, each CPU has > a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, > PCP will be checked first before asking pages from Buddy. When PCP is > used up, a batch of pages will be fetched from Buddy to improve > performance and the size of batch can affect performance. > > zone's batch size gets doubled last time by commit ba56e91c9401("mm: > page_alloc: increase size of per-cpu-pages") over ten years ago. Since > then, CPU has envolved a lot and CPU's cache sizes also increased. > > Dave Hansen is concerned the current batch size doesn't fit well with > modern hardware and suggested me to do two things: first, use a page > allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find > out how performance changes with different batch sizes on various > machines and then choose a new default batch size; second, see how > this new batch size work with other workloads. > > >From the first test, we saw performance gains on high-core-count systems > and little to no effect on older systems with more modest core counts. > >From this phase's test data, two candidates: 63 and 127 are chosen. > > In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability > and more will-it-scale sub-tests are tested to see how these two > candidates work with these workloads and decides a new default > according to their results. > > Most test results are flat. will-it-scale/page_fault2 process mode has > 10%-18% performance increase on 4-sockets Skylake and Broadwell. > vm-scalability/lru-file-mmap-read has 17%-47% performance increase for > 4-sockets servers while for 2-sockets servers, it caused 3%-8% > performance drop. Further analysis showed that, with a larger pcp->batch > and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch > is maintained in this patch), zone lock contention shifted to LRU add > side lock contention and that caused performance drop. This performance > drop might be mitigated by others' work on optimizing LRU lock. > > Another downside of increasing pcp->batch is, when PCP is used up and > need to fetch a batch of pages from Buddy, since batch is increased, > that time can be longer than before. My understanding is, this doesn't > affect slowpath where direct reclaim and compaction dominates. For > fastpath, throughput is a win(according to will-it-scale/page_fault1) > but worst latency can be larger now. > > Overall, I think double the batch size from 31 to 63 is relatively > safe and provide good performance boost for high-core-count systems. > > The two phase's test results are listed below(all tests are done with > THP disabled). > > Phase one(will-it-scale/page_fault1) test results: > > Skylake-EX: increased batch size has a good effect on zone->lock > contention, though LRU contention will rise at the same time and > limited the final performance increase. > > batch score change zone_contention lru_contention total_contention > 31 15345900+0.00% 64% 8% 72% > 53 17903847 +16.67% 32%38% 70% > 63 17992886 +17.25% 24%45% 69% > 73 18022825 +17.44% 10%61% 71% > 119 18023401 +17.45%4%66% 70% > 127 18029012 +17.48%3%66% 69% > 137 18036075 +17.53%4%66% 70% > 165 18035964 +17.53%2%67% 69% > 188 18101105 +17.95%2%67% 69% > 223 18130951 +18.15%2%67% 69% > 255 18118898 +18.07%2%67% 69% > 267 18101559 +17.96%2%67% 69% > 299
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
[CC Jesper - I remember he was really concerned about the worst case latencies for highspeed network workloads.] Sorry for top posting but I do not want to torture anybody to scroll down the long changelog which I want to preserve for Jesper. I personally do not mind this change. I usually find performance improvements solely based on microbenchmarks without any real world workloads numbers a bit dubious. Especially when numbers are single cpu vendor based (even though that the number of measurements is quite impressive here and much better than what we can usually get). On the other hand this change is really simple. Should we ever find a regression it will be trivial to reconsider/revert. One could argue that the size of the batch should scale with the number of CPUs or even the uarch but cosindering how much we suck in that regards and that differences are not that large I agree that simply bump up the number is the most viable way forward. That being said, feel free to add Acked-by: Michal Hocko Thanks. The whole patch including the changelog follows below. On Wed 11-07-18 13:58:55, Aaron Lu wrote: > To improve page allocator's performance for order-0 pages, each CPU has > a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, > PCP will be checked first before asking pages from Buddy. When PCP is > used up, a batch of pages will be fetched from Buddy to improve > performance and the size of batch can affect performance. > > zone's batch size gets doubled last time by commit ba56e91c9401("mm: > page_alloc: increase size of per-cpu-pages") over ten years ago. Since > then, CPU has envolved a lot and CPU's cache sizes also increased. > > Dave Hansen is concerned the current batch size doesn't fit well with > modern hardware and suggested me to do two things: first, use a page > allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find > out how performance changes with different batch sizes on various > machines and then choose a new default batch size; second, see how > this new batch size work with other workloads. > > >From the first test, we saw performance gains on high-core-count systems > and little to no effect on older systems with more modest core counts. > >From this phase's test data, two candidates: 63 and 127 are chosen. > > In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability > and more will-it-scale sub-tests are tested to see how these two > candidates work with these workloads and decides a new default > according to their results. > > Most test results are flat. will-it-scale/page_fault2 process mode has > 10%-18% performance increase on 4-sockets Skylake and Broadwell. > vm-scalability/lru-file-mmap-read has 17%-47% performance increase for > 4-sockets servers while for 2-sockets servers, it caused 3%-8% > performance drop. Further analysis showed that, with a larger pcp->batch > and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch > is maintained in this patch), zone lock contention shifted to LRU add > side lock contention and that caused performance drop. This performance > drop might be mitigated by others' work on optimizing LRU lock. > > Another downside of increasing pcp->batch is, when PCP is used up and > need to fetch a batch of pages from Buddy, since batch is increased, > that time can be longer than before. My understanding is, this doesn't > affect slowpath where direct reclaim and compaction dominates. For > fastpath, throughput is a win(according to will-it-scale/page_fault1) > but worst latency can be larger now. > > Overall, I think double the batch size from 31 to 63 is relatively > safe and provide good performance boost for high-core-count systems. > > The two phase's test results are listed below(all tests are done with > THP disabled). > > Phase one(will-it-scale/page_fault1) test results: > > Skylake-EX: increased batch size has a good effect on zone->lock > contention, though LRU contention will rise at the same time and > limited the final performance increase. > > batch score change zone_contention lru_contention total_contention > 31 15345900+0.00% 64% 8% 72% > 53 17903847 +16.67% 32%38% 70% > 63 17992886 +17.25% 24%45% 69% > 73 18022825 +17.44% 10%61% 71% > 119 18023401 +17.45%4%66% 70% > 127 18029012 +17.48%3%66% 69% > 137 18036075 +17.53%4%66% 70% > 165 18035964 +17.53%2%67% 69% > 188 18101105 +17.95%2%67% 69% > 223 18130951 +18.15%2%67% 69% > 255 18118898 +18.07%2%67% 69% > 267 18101559 +17.96%2%67% 69% > 299
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Thu, 12 Jul 2018 01:40:41 + "Lu, Aaron" wrote: > Thanks Andrew. > I think the credit goes to Dave Hansen Oh. In that case, I take it all back. The patch sucks!
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Thu, 12 Jul 2018 01:40:41 + "Lu, Aaron" wrote: > Thanks Andrew. > I think the credit goes to Dave Hansen Oh. In that case, I take it all back. The patch sucks!
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Wed, 2018-07-11 at 14:35 -0700, Andrew Morton wrote: > On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu wrote: > > > [550 lines of changelog] > > OK, I'm convinced ;) That was a lot of work - thanks for being exhaustive. Thanks Andrew. I think the credit goes to Dave Hansen since he has been very careful about this change and would like me to do all those 2nd phase tests to make sure we didn't get any surprise after doubling batch size :-) I think the LKP robot will run even more tests to capture possible regressions, if any.
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Wed, 2018-07-11 at 14:35 -0700, Andrew Morton wrote: > On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu wrote: > > > [550 lines of changelog] > > OK, I'm convinced ;) That was a lot of work - thanks for being exhaustive. Thanks Andrew. I think the credit goes to Dave Hansen since he has been very careful about this change and would like me to do all those 2nd phase tests to make sure we didn't get any surprise after doubling batch size :-) I think the LKP robot will run even more tests to capture possible regressions, if any.
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu wrote: > [550 lines of changelog] OK, I'm convinced ;) That was a lot of work - thanks for being exhaustive. Of course, not all the world is x86 but I think we can be confident that other architectures are unlikely to be harmed by the change, at least.
Re: [RFC PATCH] mm, page_alloc: double zone's batchsize
On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu wrote: > [550 lines of changelog] OK, I'm convinced ;) That was a lot of work - thanks for being exhaustive. Of course, not all the world is x86 but I think we can be confident that other architectures are unlikely to be harmed by the change, at least.
[RFC PATCH] mm, page_alloc: double zone's batchsize
To improve page allocator's performance for order-0 pages, each CPU has a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, PCP will be checked first before asking pages from Buddy. When PCP is used up, a batch of pages will be fetched from Buddy to improve performance and the size of batch can affect performance. zone's batch size gets doubled last time by commit ba56e91c9401("mm: page_alloc: increase size of per-cpu-pages") over ten years ago. Since then, CPU has envolved a lot and CPU's cache sizes also increased. Dave Hansen is concerned the current batch size doesn't fit well with modern hardware and suggested me to do two things: first, use a page allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find out how performance changes with different batch sizes on various machines and then choose a new default batch size; second, see how this new batch size work with other workloads. >From the first test, we saw performance gains on high-core-count systems and little to no effect on older systems with more modest core counts. >From this phase's test data, two candidates: 63 and 127 are chosen. In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability and more will-it-scale sub-tests are tested to see how these two candidates work with these workloads and decides a new default according to their results. Most test results are flat. will-it-scale/page_fault2 process mode has 10%-18% performance increase on 4-sockets Skylake and Broadwell. vm-scalability/lru-file-mmap-read has 17%-47% performance increase for 4-sockets servers while for 2-sockets servers, it caused 3%-8% performance drop. Further analysis showed that, with a larger pcp->batch and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch is maintained in this patch), zone lock contention shifted to LRU add side lock contention and that caused performance drop. This performance drop might be mitigated by others' work on optimizing LRU lock. Another downside of increasing pcp->batch is, when PCP is used up and need to fetch a batch of pages from Buddy, since batch is increased, that time can be longer than before. My understanding is, this doesn't affect slowpath where direct reclaim and compaction dominates. For fastpath, throughput is a win(according to will-it-scale/page_fault1) but worst latency can be larger now. Overall, I think double the batch size from 31 to 63 is relatively safe and provide good performance boost for high-core-count systems. The two phase's test results are listed below(all tests are done with THP disabled). Phase one(will-it-scale/page_fault1) test results: Skylake-EX: increased batch size has a good effect on zone->lock contention, though LRU contention will rise at the same time and limited the final performance increase. batch score change zone_contention lru_contention total_contention 31 15345900+0.00% 64% 8% 72% 53 17903847 +16.67% 32%38% 70% 63 17992886 +17.25% 24%45% 69% 73 18022825 +17.44% 10%61% 71% 119 18023401 +17.45%4%66% 70% 127 18029012 +17.48%3%66% 69% 137 18036075 +17.53%4%66% 70% 165 18035964 +17.53%2%67% 69% 188 18101105 +17.95%2%67% 69% 223 18130951 +18.15%2%67% 69% 255 18118898 +18.07%2%67% 69% 267 18101559 +17.96%2%67% 69% 299 18160468 +18.34%2%68% 70% 320 18139845 +18.21%2%67% 69% 393 18160869 +18.34%2%68% 70% 424 18170999 +18.41%2%68% 70% 458 18144868 +18.24%2%68% 70% 467 18142366 +18.22%2%68% 70% 498 18154549 +18.30%1%68% 69% 511 18134525 +18.17%1%69% 70% Broadwell-EX: similar pattern as Skylake-EX. batch score change zone_contention lru_contention total_contention 31 16703983+0.00% 67% 7% 74% 53 18195393+8.93% 43%28% 71% 63 1825+9.49% 38%33% 71% 73 18344329+9.82% 35%37% 72% 119 18535529 +10.96% 24%46% 70% 127 18513596 +10.83% 23%48% 71% 137 18514327 +10.84% 23%48% 71% 165 18511840 +10.82% 22%49% 71% 188 18593478 +11.31% 17%53%
[RFC PATCH] mm, page_alloc: double zone's batchsize
To improve page allocator's performance for order-0 pages, each CPU has a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, PCP will be checked first before asking pages from Buddy. When PCP is used up, a batch of pages will be fetched from Buddy to improve performance and the size of batch can affect performance. zone's batch size gets doubled last time by commit ba56e91c9401("mm: page_alloc: increase size of per-cpu-pages") over ten years ago. Since then, CPU has envolved a lot and CPU's cache sizes also increased. Dave Hansen is concerned the current batch size doesn't fit well with modern hardware and suggested me to do two things: first, use a page allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find out how performance changes with different batch sizes on various machines and then choose a new default batch size; second, see how this new batch size work with other workloads. >From the first test, we saw performance gains on high-core-count systems and little to no effect on older systems with more modest core counts. >From this phase's test data, two candidates: 63 and 127 are chosen. In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability and more will-it-scale sub-tests are tested to see how these two candidates work with these workloads and decides a new default according to their results. Most test results are flat. will-it-scale/page_fault2 process mode has 10%-18% performance increase on 4-sockets Skylake and Broadwell. vm-scalability/lru-file-mmap-read has 17%-47% performance increase for 4-sockets servers while for 2-sockets servers, it caused 3%-8% performance drop. Further analysis showed that, with a larger pcp->batch and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch is maintained in this patch), zone lock contention shifted to LRU add side lock contention and that caused performance drop. This performance drop might be mitigated by others' work on optimizing LRU lock. Another downside of increasing pcp->batch is, when PCP is used up and need to fetch a batch of pages from Buddy, since batch is increased, that time can be longer than before. My understanding is, this doesn't affect slowpath where direct reclaim and compaction dominates. For fastpath, throughput is a win(according to will-it-scale/page_fault1) but worst latency can be larger now. Overall, I think double the batch size from 31 to 63 is relatively safe and provide good performance boost for high-core-count systems. The two phase's test results are listed below(all tests are done with THP disabled). Phase one(will-it-scale/page_fault1) test results: Skylake-EX: increased batch size has a good effect on zone->lock contention, though LRU contention will rise at the same time and limited the final performance increase. batch score change zone_contention lru_contention total_contention 31 15345900+0.00% 64% 8% 72% 53 17903847 +16.67% 32%38% 70% 63 17992886 +17.25% 24%45% 69% 73 18022825 +17.44% 10%61% 71% 119 18023401 +17.45%4%66% 70% 127 18029012 +17.48%3%66% 69% 137 18036075 +17.53%4%66% 70% 165 18035964 +17.53%2%67% 69% 188 18101105 +17.95%2%67% 69% 223 18130951 +18.15%2%67% 69% 255 18118898 +18.07%2%67% 69% 267 18101559 +17.96%2%67% 69% 299 18160468 +18.34%2%68% 70% 320 18139845 +18.21%2%67% 69% 393 18160869 +18.34%2%68% 70% 424 18170999 +18.41%2%68% 70% 458 18144868 +18.24%2%68% 70% 467 18142366 +18.22%2%68% 70% 498 18154549 +18.30%1%68% 69% 511 18134525 +18.17%1%69% 70% Broadwell-EX: similar pattern as Skylake-EX. batch score change zone_contention lru_contention total_contention 31 16703983+0.00% 67% 7% 74% 53 18195393+8.93% 43%28% 71% 63 1825+9.49% 38%33% 71% 73 18344329+9.82% 35%37% 72% 119 18535529 +10.96% 24%46% 70% 127 18513596 +10.83% 23%48% 71% 137 18514327 +10.84% 23%48% 71% 165 18511840 +10.82% 22%49% 71% 188 18593478 +11.31% 17%53%