Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Tariq Toukan




On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote:

On Thu, 12 Jul 2018 14:54:08 +0200
Michal Hocko  wrote:


[CC Jesper - I remember he was really concerned about the worst case
  latencies for highspeed network workloads.]


Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
where we are contenting on the page allocator lock, in a CPU scaling
netperf test AFAIK.  I also have some special-case micro-benchmarks
where I can hit it, but it a micro-bench...



Thanks! Looks good.

Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a 
major PCP/buddy bottleneck, where spinning the zonelock took up to 80% 
CPU, with dramatic BW degradation.


Test ran relatively small number of TCP streams (4-16) with unpinned 
application (iperf).


Larger batching reduces the contention on the zone lock and improves the 
CPU util. I also considered increasing the percpu_pagelist_fraction to a 
larger value (thought of 512, see patch below), which also affects the 
batch size (in pageset_set_high_and_batch).


As far as I see it, to totally solve the page allocation bottleneck for 
the increasing networking speeds, the following is still required:
1) optimize order-0 allocations (even on the cost of higher-order 
allocations).

2) bulking API for page allocations.
3) do SKB remote-release (on the originating core).

Regards,
Tariq

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 697ef8c225df..88763bd716a5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -741,9 +741,9 @@ of hot per cpu pagelists.  User can specify a number 
like 100 to allocate
 The batch value of each per cpu pagelist is also updated as a result. 
It is

 set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

-The initial value is zero.  Kernel does not use this value at boot time 
to set

+The initial value is 512.  Kernel uses this value at boot time to set
 the high water marks for each per cpu page list.  If the user writes 
'0' to this

-sysctl, it will revert to this default behavior.
+sysctl, it will revert to a behavior based on batchsize calculation.

 ==

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..c88e8eb50bcb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,7 +129,7 @@
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;

-int percpu_pagelist_fraction;
+int percpu_pagelist_fraction = 512;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

 /*


Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Tariq Toukan




On 12/07/2018 4:55 PM, Jesper Dangaard Brouer wrote:

On Thu, 12 Jul 2018 14:54:08 +0200
Michal Hocko  wrote:


[CC Jesper - I remember he was really concerned about the worst case
  latencies for highspeed network workloads.]


Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
where we are contenting on the page allocator lock, in a CPU scaling
netperf test AFAIK.  I also have some special-case micro-benchmarks
where I can hit it, but it a micro-bench...



Thanks! Looks good.

Indeed, I simulated the page allocation rate of a 200Gbps NIC, and hit a 
major PCP/buddy bottleneck, where spinning the zonelock took up to 80% 
CPU, with dramatic BW degradation.


Test ran relatively small number of TCP streams (4-16) with unpinned 
application (iperf).


Larger batching reduces the contention on the zone lock and improves the 
CPU util. I also considered increasing the percpu_pagelist_fraction to a 
larger value (thought of 512, see patch below), which also affects the 
batch size (in pageset_set_high_and_batch).


As far as I see it, to totally solve the page allocation bottleneck for 
the increasing networking speeds, the following is still required:
1) optimize order-0 allocations (even on the cost of higher-order 
allocations).

2) bulking API for page allocations.
3) do SKB remote-release (on the originating core).

Regards,
Tariq

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 697ef8c225df..88763bd716a5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -741,9 +741,9 @@ of hot per cpu pagelists.  User can specify a number 
like 100 to allocate
 The batch value of each per cpu pagelist is also updated as a result. 
It is

 set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

-The initial value is zero.  Kernel does not use this value at boot time 
to set

+The initial value is 512.  Kernel uses this value at boot time to set
 the high water marks for each per cpu page list.  If the user writes 
'0' to this

-sysctl, it will revert to this default behavior.
+sysctl, it will revert to a behavior based on batchsize calculation.

 ==

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..c88e8eb50bcb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,7 +129,7 @@
 unsigned long totalreserve_pages __read_mostly;
 unsigned long totalcma_pages __read_mostly;

-int percpu_pagelist_fraction;
+int percpu_pagelist_fraction = 512;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

 /*


Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Jesper Dangaard Brouer
On Thu, 12 Jul 2018 14:54:08 +0200
Michal Hocko  wrote:

> [CC Jesper - I remember he was really concerned about the worst case
>  latencies for highspeed network workloads.]

Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
where we are contenting on the page allocator lock, in a CPU scaling
netperf test AFAIK.  I also have some special-case micro-benchmarks
where I can hit it, but it a micro-bench...

> Sorry for top posting but I do not want to torture anybody to scroll
> down the long changelog which I want to preserve for Jesper.

Thanks! - It look like a very detailed and exhaustive test for a very
small change, I'm impressed.

> I personally do not mind this change. I usually find performance
> improvements solely based on microbenchmarks without any real world
> workloads numbers a bit dubious. Especially when numbers are single
> cpu vendor based (even though that the number of measurements is quite
> impressive here and much better than what we can usually get). On
> the other hand this change is really simple. Should we ever find a
> regression it will be trivial to reconsider/revert.

I do think it is a good idea to increase this batch size.

In network micro-benchmarks it is usually hard to hit the page_alloc
lock, because drivers have all kind of page recycle schemes to avoid
talking to the page allocator.  Most of these schemes depend on pages
getting recycled fast enough, which is true for these micro-benchmarks,
but for real-world workloads, where packets/pages are "outstanding"
longer, these page-recycle schemes break down, and drivers start to
talk to the page allocator.  Thus, this change might help networking
for real-workloads (but will not show-up in network micro-benchs).


> One could argue that the size of the batch should scale with the number
> of CPUs or even the uarch but cosindering how much we suck in that
> regards and that differences are not that large I agree that simply bump
> up the number is the most viable way forward.
> 
> That being said, feel free to add
> Acked-by: Michal Hocko 

I will also happily ACK this patch, you can add:

Acked-by: Jesper Dangaard Brouer 


> Thanks. The whole patch including the changelog follows below.
> 
> On Wed 11-07-18 13:58:55, Aaron Lu wrote:
> > To improve page allocator's performance for order-0 pages, each CPU has
> > a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
> > PCP will be checked first before asking pages from Buddy. When PCP is
> > used up, a batch of pages will be fetched from Buddy to improve
> > performance and the size of batch can affect performance.
> > 
> > zone's batch size gets doubled last time by commit ba56e91c9401("mm:
> > page_alloc: increase size of per-cpu-pages") over ten years ago. Since
> > then, CPU has envolved a lot and CPU's cache sizes also increased.
> > 
> > Dave Hansen is concerned the current batch size doesn't fit well with
> > modern hardware and suggested me to do two things: first, use a page
> > allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
> > out how performance changes with different batch sizes on various
> > machines and then choose a new default batch size; second, see how
> > this new batch size work with other workloads.
> >   
> > >From the first test, we saw performance gains on high-core-count systems  
> > and little to no effect on older systems with more modest core counts.  
> > >From this phase's test data, two candidates: 63 and 127 are chosen.  
> > 
> > In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
> > and more will-it-scale sub-tests are tested to see how these two
> > candidates work with these workloads and decides a new default
> > according to their results.
> > 
> > Most test results are flat. will-it-scale/page_fault2 process mode has
> > 10%-18% performance increase on 4-sockets Skylake and Broadwell.
> > vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
> > 4-sockets servers while for 2-sockets servers, it caused 3%-8%
> > performance drop. Further analysis showed that, with a larger pcp->batch
> > and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
> > is maintained in this patch), zone lock contention shifted to LRU add
> > side lock contention and that caused performance drop. This performance
> > drop might be mitigated by others' work on optimizing LRU lock.
> > 
> > Another downside of increasing pcp->batch is, when PCP is used up and
> > need to fetch a batch of pages from Buddy, since batch is increased,
> > that time can be longer than before. My understanding is, this doesn't
> > affect slowpath where direct reclaim and compaction dominates. For
> > fastpath, throughput is a win(according to will-it-scale/page_fault1)
> > but worst latency can be larger now.
> > 
> > Overall, I think double the batch size from 31 to 63 is relatively
> > safe and provide good performance boost for high-core-count systems.
> > 
> > The 

Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Jesper Dangaard Brouer
On Thu, 12 Jul 2018 14:54:08 +0200
Michal Hocko  wrote:

> [CC Jesper - I remember he was really concerned about the worst case
>  latencies for highspeed network workloads.]

Cc. Tariq as he have hit some networking benchmarks (around 100Gbit/s),
where we are contenting on the page allocator lock, in a CPU scaling
netperf test AFAIK.  I also have some special-case micro-benchmarks
where I can hit it, but it a micro-bench...

> Sorry for top posting but I do not want to torture anybody to scroll
> down the long changelog which I want to preserve for Jesper.

Thanks! - It look like a very detailed and exhaustive test for a very
small change, I'm impressed.

> I personally do not mind this change. I usually find performance
> improvements solely based on microbenchmarks without any real world
> workloads numbers a bit dubious. Especially when numbers are single
> cpu vendor based (even though that the number of measurements is quite
> impressive here and much better than what we can usually get). On
> the other hand this change is really simple. Should we ever find a
> regression it will be trivial to reconsider/revert.

I do think it is a good idea to increase this batch size.

In network micro-benchmarks it is usually hard to hit the page_alloc
lock, because drivers have all kind of page recycle schemes to avoid
talking to the page allocator.  Most of these schemes depend on pages
getting recycled fast enough, which is true for these micro-benchmarks,
but for real-world workloads, where packets/pages are "outstanding"
longer, these page-recycle schemes break down, and drivers start to
talk to the page allocator.  Thus, this change might help networking
for real-workloads (but will not show-up in network micro-benchs).


> One could argue that the size of the batch should scale with the number
> of CPUs or even the uarch but cosindering how much we suck in that
> regards and that differences are not that large I agree that simply bump
> up the number is the most viable way forward.
> 
> That being said, feel free to add
> Acked-by: Michal Hocko 

I will also happily ACK this patch, you can add:

Acked-by: Jesper Dangaard Brouer 


> Thanks. The whole patch including the changelog follows below.
> 
> On Wed 11-07-18 13:58:55, Aaron Lu wrote:
> > To improve page allocator's performance for order-0 pages, each CPU has
> > a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
> > PCP will be checked first before asking pages from Buddy. When PCP is
> > used up, a batch of pages will be fetched from Buddy to improve
> > performance and the size of batch can affect performance.
> > 
> > zone's batch size gets doubled last time by commit ba56e91c9401("mm:
> > page_alloc: increase size of per-cpu-pages") over ten years ago. Since
> > then, CPU has envolved a lot and CPU's cache sizes also increased.
> > 
> > Dave Hansen is concerned the current batch size doesn't fit well with
> > modern hardware and suggested me to do two things: first, use a page
> > allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
> > out how performance changes with different batch sizes on various
> > machines and then choose a new default batch size; second, see how
> > this new batch size work with other workloads.
> >   
> > >From the first test, we saw performance gains on high-core-count systems  
> > and little to no effect on older systems with more modest core counts.  
> > >From this phase's test data, two candidates: 63 and 127 are chosen.  
> > 
> > In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
> > and more will-it-scale sub-tests are tested to see how these two
> > candidates work with these workloads and decides a new default
> > according to their results.
> > 
> > Most test results are flat. will-it-scale/page_fault2 process mode has
> > 10%-18% performance increase on 4-sockets Skylake and Broadwell.
> > vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
> > 4-sockets servers while for 2-sockets servers, it caused 3%-8%
> > performance drop. Further analysis showed that, with a larger pcp->batch
> > and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
> > is maintained in this patch), zone lock contention shifted to LRU add
> > side lock contention and that caused performance drop. This performance
> > drop might be mitigated by others' work on optimizing LRU lock.
> > 
> > Another downside of increasing pcp->batch is, when PCP is used up and
> > need to fetch a batch of pages from Buddy, since batch is increased,
> > that time can be longer than before. My understanding is, this doesn't
> > affect slowpath where direct reclaim and compaction dominates. For
> > fastpath, throughput is a win(according to will-it-scale/page_fault1)
> > but worst latency can be larger now.
> > 
> > Overall, I think double the batch size from 31 to 63 is relatively
> > safe and provide good performance boost for high-core-count systems.
> > 
> > The 

Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Michal Hocko
[CC Jesper - I remember he was really concerned about the worst case
 latencies for highspeed network workloads.]

Sorry for top posting but I do not want to torture anybody to scroll
down the long changelog which I want to preserve for Jesper.

I personally do not mind this change. I usually find performance
improvements solely based on microbenchmarks without any real world
workloads numbers a bit dubious. Especially when numbers are single
cpu vendor based (even though that the number of measurements is quite
impressive here and much better than what we can usually get). On
the other hand this change is really simple. Should we ever find a
regression it will be trivial to reconsider/revert.

One could argue that the size of the batch should scale with the number
of CPUs or even the uarch but cosindering how much we suck in that
regards and that differences are not that large I agree that simply bump
up the number is the most viable way forward.

That being said, feel free to add
Acked-by: Michal Hocko 

Thanks. The whole patch including the changelog follows below.

On Wed 11-07-18 13:58:55, Aaron Lu wrote:
> To improve page allocator's performance for order-0 pages, each CPU has
> a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
> PCP will be checked first before asking pages from Buddy. When PCP is
> used up, a batch of pages will be fetched from Buddy to improve
> performance and the size of batch can affect performance.
> 
> zone's batch size gets doubled last time by commit ba56e91c9401("mm:
> page_alloc: increase size of per-cpu-pages") over ten years ago. Since
> then, CPU has envolved a lot and CPU's cache sizes also increased.
> 
> Dave Hansen is concerned the current batch size doesn't fit well with
> modern hardware and suggested me to do two things: first, use a page
> allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
> out how performance changes with different batch sizes on various
> machines and then choose a new default batch size; second, see how
> this new batch size work with other workloads.
> 
> >From the first test, we saw performance gains on high-core-count systems
> and little to no effect on older systems with more modest core counts.
> >From this phase's test data, two candidates: 63 and 127 are chosen.
> 
> In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
> and more will-it-scale sub-tests are tested to see how these two
> candidates work with these workloads and decides a new default
> according to their results.
> 
> Most test results are flat. will-it-scale/page_fault2 process mode has
> 10%-18% performance increase on 4-sockets Skylake and Broadwell.
> vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
> 4-sockets servers while for 2-sockets servers, it caused 3%-8%
> performance drop. Further analysis showed that, with a larger pcp->batch
> and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
> is maintained in this patch), zone lock contention shifted to LRU add
> side lock contention and that caused performance drop. This performance
> drop might be mitigated by others' work on optimizing LRU lock.
> 
> Another downside of increasing pcp->batch is, when PCP is used up and
> need to fetch a batch of pages from Buddy, since batch is increased,
> that time can be longer than before. My understanding is, this doesn't
> affect slowpath where direct reclaim and compaction dominates. For
> fastpath, throughput is a win(according to will-it-scale/page_fault1)
> but worst latency can be larger now.
> 
> Overall, I think double the batch size from 31 to 63 is relatively
> safe and provide good performance boost for high-core-count systems.
> 
> The two phase's test results are listed below(all tests are done with
> THP disabled).
> 
> Phase one(will-it-scale/page_fault1) test results:
> 
> Skylake-EX: increased batch size has a good effect on zone->lock
> contention, though LRU contention will rise at the same time and
> limited the final performance increase.
> 
> batch   score change   zone_contention   lru_contention   total_contention
>  31   15345900+0.00%   64% 8%   72%
>  53   17903847   +16.67%   32%38%   70%
>  63   17992886   +17.25%   24%45%   69%
>  73   18022825   +17.44%   10%61%   71%
> 119   18023401   +17.45%4%66%   70%
> 127   18029012   +17.48%3%66%   69%
> 137   18036075   +17.53%4%66%   70%
> 165   18035964   +17.53%2%67%   69%
> 188   18101105   +17.95%2%67%   69%
> 223   18130951   +18.15%2%67%   69%
> 255   18118898   +18.07%2%67%   69%
> 267   18101559   +17.96%2%67%   69%
> 299   

Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-12 Thread Michal Hocko
[CC Jesper - I remember he was really concerned about the worst case
 latencies for highspeed network workloads.]

Sorry for top posting but I do not want to torture anybody to scroll
down the long changelog which I want to preserve for Jesper.

I personally do not mind this change. I usually find performance
improvements solely based on microbenchmarks without any real world
workloads numbers a bit dubious. Especially when numbers are single
cpu vendor based (even though that the number of measurements is quite
impressive here and much better than what we can usually get). On
the other hand this change is really simple. Should we ever find a
regression it will be trivial to reconsider/revert.

One could argue that the size of the batch should scale with the number
of CPUs or even the uarch but cosindering how much we suck in that
regards and that differences are not that large I agree that simply bump
up the number is the most viable way forward.

That being said, feel free to add
Acked-by: Michal Hocko 

Thanks. The whole patch including the changelog follows below.

On Wed 11-07-18 13:58:55, Aaron Lu wrote:
> To improve page allocator's performance for order-0 pages, each CPU has
> a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
> PCP will be checked first before asking pages from Buddy. When PCP is
> used up, a batch of pages will be fetched from Buddy to improve
> performance and the size of batch can affect performance.
> 
> zone's batch size gets doubled last time by commit ba56e91c9401("mm:
> page_alloc: increase size of per-cpu-pages") over ten years ago. Since
> then, CPU has envolved a lot and CPU's cache sizes also increased.
> 
> Dave Hansen is concerned the current batch size doesn't fit well with
> modern hardware and suggested me to do two things: first, use a page
> allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
> out how performance changes with different batch sizes on various
> machines and then choose a new default batch size; second, see how
> this new batch size work with other workloads.
> 
> >From the first test, we saw performance gains on high-core-count systems
> and little to no effect on older systems with more modest core counts.
> >From this phase's test data, two candidates: 63 and 127 are chosen.
> 
> In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
> and more will-it-scale sub-tests are tested to see how these two
> candidates work with these workloads and decides a new default
> according to their results.
> 
> Most test results are flat. will-it-scale/page_fault2 process mode has
> 10%-18% performance increase on 4-sockets Skylake and Broadwell.
> vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
> 4-sockets servers while for 2-sockets servers, it caused 3%-8%
> performance drop. Further analysis showed that, with a larger pcp->batch
> and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
> is maintained in this patch), zone lock contention shifted to LRU add
> side lock contention and that caused performance drop. This performance
> drop might be mitigated by others' work on optimizing LRU lock.
> 
> Another downside of increasing pcp->batch is, when PCP is used up and
> need to fetch a batch of pages from Buddy, since batch is increased,
> that time can be longer than before. My understanding is, this doesn't
> affect slowpath where direct reclaim and compaction dominates. For
> fastpath, throughput is a win(according to will-it-scale/page_fault1)
> but worst latency can be larger now.
> 
> Overall, I think double the batch size from 31 to 63 is relatively
> safe and provide good performance boost for high-core-count systems.
> 
> The two phase's test results are listed below(all tests are done with
> THP disabled).
> 
> Phase one(will-it-scale/page_fault1) test results:
> 
> Skylake-EX: increased batch size has a good effect on zone->lock
> contention, though LRU contention will rise at the same time and
> limited the final performance increase.
> 
> batch   score change   zone_contention   lru_contention   total_contention
>  31   15345900+0.00%   64% 8%   72%
>  53   17903847   +16.67%   32%38%   70%
>  63   17992886   +17.25%   24%45%   69%
>  73   18022825   +17.44%   10%61%   71%
> 119   18023401   +17.45%4%66%   70%
> 127   18029012   +17.48%3%66%   69%
> 137   18036075   +17.53%4%66%   70%
> 165   18035964   +17.53%2%67%   69%
> 188   18101105   +17.95%2%67%   69%
> 223   18130951   +18.15%2%67%   69%
> 255   18118898   +18.07%2%67%   69%
> 267   18101559   +17.96%2%67%   69%
> 299   

Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Andrew Morton
On Thu, 12 Jul 2018 01:40:41 + "Lu, Aaron"  wrote:

> Thanks Andrew.
> I think the credit goes to Dave Hansen

Oh.  In that case, I take it all back.  The patch sucks!


Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Andrew Morton
On Thu, 12 Jul 2018 01:40:41 + "Lu, Aaron"  wrote:

> Thanks Andrew.
> I think the credit goes to Dave Hansen

Oh.  In that case, I take it all back.  The patch sucks!


Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Lu, Aaron
On Wed, 2018-07-11 at 14:35 -0700, Andrew Morton wrote:
> On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu  wrote:
> 
> > [550 lines of changelog]
> 
> OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Thanks Andrew.
I think the credit goes to Dave Hansen since he has been very careful
about this change and would like me to do all those 2nd phase tests to
make sure we didn't get any surprise after doubling batch size :-)

I think the LKP robot will run even more tests to capture possible
regressions, if any.

Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Lu, Aaron
On Wed, 2018-07-11 at 14:35 -0700, Andrew Morton wrote:
> On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu  wrote:
> 
> > [550 lines of changelog]
> 
> OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Thanks Andrew.
I think the credit goes to Dave Hansen since he has been very careful
about this change and would like me to do all those 2nd phase tests to
make sure we didn't get any surprise after doubling batch size :-)

I think the LKP robot will run even more tests to capture possible
regressions, if any.

Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Andrew Morton
On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu  wrote:

> [550 lines of changelog]

OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Of course, not all the world is x86 but I think we can be confident
that other architectures are unlikely to be harmed by the change, at least.



Re: [RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-11 Thread Andrew Morton
On Wed, 11 Jul 2018 13:58:55 +0800 Aaron Lu  wrote:

> [550 lines of changelog]

OK, I'm convinced ;)  That was a lot of work - thanks for being exhaustive.

Of course, not all the world is x86 but I think we can be confident
that other architectures are unlikely to be harmed by the change, at least.



[RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-10 Thread Aaron Lu
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy. When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.

zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago. Since
then, CPU has envolved a lot and CPU's cache sizes also increased.

Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how
this new batch size work with other workloads.

>From the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
>From this phase's test data, two candidates: 63 and 127 are chosen.

In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default
according to their results.

Most test results are flat. will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8%
performance drop. Further analysis showed that, with a larger pcp->batch
and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
is maintained in this patch), zone lock contention shifted to LRU add
side lock contention and that caused performance drop. This performance
drop might be mitigated by others' work on optimizing LRU lock.

Another downside of increasing pcp->batch is, when PCP is used up and
need to fetch a batch of pages from Buddy, since batch is increased,
that time can be longer than before. My understanding is, this doesn't
affect slowpath where direct reclaim and compaction dominates. For
fastpath, throughput is a win(according to will-it-scale/page_fault1)
but worst latency can be larger now.

Overall, I think double the batch size from 31 to 63 is relatively
safe and provide good performance boost for high-core-count systems.

The two phase's test results are listed below(all tests are done with
THP disabled).

Phase one(will-it-scale/page_fault1) test results:

Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.

batch   score change   zone_contention   lru_contention   total_contention
 31   15345900+0.00%   64% 8%   72%
 53   17903847   +16.67%   32%38%   70%
 63   17992886   +17.25%   24%45%   69%
 73   18022825   +17.44%   10%61%   71%
119   18023401   +17.45%4%66%   70%
127   18029012   +17.48%3%66%   69%
137   18036075   +17.53%4%66%   70%
165   18035964   +17.53%2%67%   69%
188   18101105   +17.95%2%67%   69%
223   18130951   +18.15%2%67%   69%
255   18118898   +18.07%2%67%   69%
267   18101559   +17.96%2%67%   69%
299   18160468   +18.34%2%68%   70%
320   18139845   +18.21%2%67%   69%
393   18160869   +18.34%2%68%   70%
424   18170999   +18.41%2%68%   70%
458   18144868   +18.24%2%68%   70%
467   18142366   +18.22%2%68%   70%
498   18154549   +18.30%1%68%   69%
511   18134525   +18.17%1%69%   70%

Broadwell-EX: similar pattern as Skylake-EX.

batch   score change   zone_contention   lru_contention   total_contention
 31   16703983+0.00%   67% 7%   74%
 53   18195393+8.93%   43%28%   71%
 63   1825+9.49%   38%33%   71%
 73   18344329+9.82%   35%37%   72%
119   18535529   +10.96%   24%46%   70%
127   18513596   +10.83%   23%48%   71%
137   18514327   +10.84%   23%48%   71%
165   18511840   +10.82%   22%49%   71%
188   18593478   +11.31%   17%53% 

[RFC PATCH] mm, page_alloc: double zone's batchsize

2018-07-10 Thread Aaron Lu
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy. When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.

zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago. Since
then, CPU has envolved a lot and CPU's cache sizes also increased.

Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how
this new batch size work with other workloads.

>From the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
>From this phase's test data, two candidates: 63 and 127 are chosen.

In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default
according to their results.

Most test results are flat. will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8%
performance drop. Further analysis showed that, with a larger pcp->batch
and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch
is maintained in this patch), zone lock contention shifted to LRU add
side lock contention and that caused performance drop. This performance
drop might be mitigated by others' work on optimizing LRU lock.

Another downside of increasing pcp->batch is, when PCP is used up and
need to fetch a batch of pages from Buddy, since batch is increased,
that time can be longer than before. My understanding is, this doesn't
affect slowpath where direct reclaim and compaction dominates. For
fastpath, throughput is a win(according to will-it-scale/page_fault1)
but worst latency can be larger now.

Overall, I think double the batch size from 31 to 63 is relatively
safe and provide good performance boost for high-core-count systems.

The two phase's test results are listed below(all tests are done with
THP disabled).

Phase one(will-it-scale/page_fault1) test results:

Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.

batch   score change   zone_contention   lru_contention   total_contention
 31   15345900+0.00%   64% 8%   72%
 53   17903847   +16.67%   32%38%   70%
 63   17992886   +17.25%   24%45%   69%
 73   18022825   +17.44%   10%61%   71%
119   18023401   +17.45%4%66%   70%
127   18029012   +17.48%3%66%   69%
137   18036075   +17.53%4%66%   70%
165   18035964   +17.53%2%67%   69%
188   18101105   +17.95%2%67%   69%
223   18130951   +18.15%2%67%   69%
255   18118898   +18.07%2%67%   69%
267   18101559   +17.96%2%67%   69%
299   18160468   +18.34%2%68%   70%
320   18139845   +18.21%2%67%   69%
393   18160869   +18.34%2%68%   70%
424   18170999   +18.41%2%68%   70%
458   18144868   +18.24%2%68%   70%
467   18142366   +18.22%2%68%   70%
498   18154549   +18.30%1%68%   69%
511   18134525   +18.17%1%69%   70%

Broadwell-EX: similar pattern as Skylake-EX.

batch   score change   zone_contention   lru_contention   total_contention
 31   16703983+0.00%   67% 7%   74%
 53   18195393+8.93%   43%28%   71%
 63   1825+9.49%   38%33%   71%
 73   18344329+9.82%   35%37%   72%
119   18535529   +10.96%   24%46%   70%
127   18513596   +10.83%   23%48%   71%
137   18514327   +10.84%   23%48%   71%
165   18511840   +10.82%   22%49%   71%
188   18593478   +11.31%   17%53%