Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-26 Thread Christopher Lameter
On Fri, 22 Dec 2017, kemi wrote:

> > I think you are fighting a lost battle there. As evident from the timing
> > constraints on packet processing in a 10/40G you will have a hard time to
> > process data if the packets are of regular ethernet size. And we alrady
> > have 100G NICs in operation here.
> >
>
> Not really.
> For 10/40G NIC or even 100G, I admit DPDK is widely used in data center 
> network
> rather than kernel driver in production environment.

Shudder. I would rather have an user space API that is vendor neutral and
that allows the use of multiple NICs. The Linux kernel has an RDMA
subsystem that does just that.

But time budget is difficult to deal with even using RDMA or DPKG where we
can avoid the OS overhead.

> That's due to the slow page allocator and long pipeline processing in network
> protocol stack.

Right the timing budget there for processing a single packet gets below a
microsecond at some point and there its going to be difficult to do much.
Some aggregation / offloading is required and that increases as speeds
become higher.

> That's not easy to change this state in short time, but if we can do something
> here to change it a little, why not.

How much of an improvement is this going to be? If it is significant then
by all means lets do it.

> > We can try to get the performance as high as possible but full rate high
> > speed networking invariable must use offload mechanisms and thus the
> > statistics would only be available from the hardware devices that can do
> > wire speed processing.
> >
>
> I think you may be talking something about SmartNIC (e.g. OpenVswitch offload 
> +
> VF pass through). That's usually used in virtualization environment to 
> eliminate
> the overhead from device emulation and packet processing in software virtual
> switch(OVS or linux bridge).

The switch offloads Can also be used elsewhere. Also the RDMA subsystem
has counters like that.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-26 Thread Christopher Lameter
On Fri, 22 Dec 2017, kemi wrote:

> > I think you are fighting a lost battle there. As evident from the timing
> > constraints on packet processing in a 10/40G you will have a hard time to
> > process data if the packets are of regular ethernet size. And we alrady
> > have 100G NICs in operation here.
> >
>
> Not really.
> For 10/40G NIC or even 100G, I admit DPDK is widely used in data center 
> network
> rather than kernel driver in production environment.

Shudder. I would rather have an user space API that is vendor neutral and
that allows the use of multiple NICs. The Linux kernel has an RDMA
subsystem that does just that.

But time budget is difficult to deal with even using RDMA or DPKG where we
can avoid the OS overhead.

> That's due to the slow page allocator and long pipeline processing in network
> protocol stack.

Right the timing budget there for processing a single packet gets below a
microsecond at some point and there its going to be difficult to do much.
Some aggregation / offloading is required and that increases as speeds
become higher.

> That's not easy to change this state in short time, but if we can do something
> here to change it a little, why not.

How much of an improvement is this going to be? If it is significant then
by all means lets do it.

> > We can try to get the performance as high as possible but full rate high
> > speed networking invariable must use offload mechanisms and thus the
> > statistics would only be available from the hardware devices that can do
> > wire speed processing.
> >
>
> I think you may be talking something about SmartNIC (e.g. OpenVswitch offload 
> +
> VF pass through). That's usually used in virtualization environment to 
> eliminate
> the overhead from device emulation and packet processing in software virtual
> switch(OVS or linux bridge).

The switch offloads Can also be used elsewhere. Also the RDMA subsystem
has counters like that.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-22 Thread Michal Hocko
On Thu 21-12-17 18:31:19, kemi wrote:
> 
> 
> On 2017年12月21日 16:59, Michal Hocko wrote:
> > On Thu 21-12-17 16:23:23, kemi wrote:
> >>
> >>
> >> On 2017年12月21日 16:17, Michal Hocko wrote:
> > [...]
> >>> Can you see any difference with a more generic workload?
> >>>
> >>
> >> I didn't see obvious improvement for will-it-scale.page_fault1
> >> Two reasons for that:
> >> 1) too long code path
> >> 2) server zone lock and lru lock contention (access to buddy system 
> >> frequently) 
> > 
> > OK. So does the patch helps for anything other than a microbenchmark?
> > 
>  Some thinking about that:
>  a) the overhead due to cache bouncing caused by NUMA counter update in 
>  fast path 
>  severely increase with more and more CPUs cores
> >>>
> >>> What is an effect on a smaller system with fewer CPUs?
> >>>
> >>
> >> Several CPU cycles can be saved using single thread for that.
> >>
>  b) AFAIK, the typical usage scenario (similar at least)for which this 
>  optimization can 
>  benefit is 10/40G NIC used in high-speed data center network of cloud 
>  service providers.
> >>>
> >>> I would expect those would disable the numa accounting altogether.
> >>>
> >>
> >> Yes, but it is still worthy to do some optimization, isn't?
> > 
> > Ohh, I am not opposing optimizations but you should make sure that they
> > are worth the additional code and special casing. As I've said I am not
> > convinced special casing numa counters is good. You can play with the
> > threshold scaling for larger CPU count but let's make sure that the
> > benefit is really measurable for normal workloads. Special ones will
> > disable the numa accounting anyway.
> > 
> 
> I understood. Could you give me some suggestion for those normal workloads, 
> Thanks.
> I will have a try and post the data ASAP. 

Well, to be honest, I am really confused what is your objective for
these optimizations then. I hope we have agreed that workloads which
really need to squeeze every single CPU cycle in the allocation path
will simply disable the whole numa stat thing. I haven't yet heard about
any use case which would really required numa stats and suffer from the
numa stats overhead.

I can see some arguments for a better threshold scaling but that
requires to check wider range of tests to show there are no unintended
changes. I am not really confident you understand that when you are
asking for "those normal workloads".

So please, try to step back, rethink who you are optimizing for and act
accordingly. If I were you I would repost the first patch which only
integrates numa stats because that removes a lot of pointless code and
that is a win of its own.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-22 Thread Michal Hocko
On Thu 21-12-17 18:31:19, kemi wrote:
> 
> 
> On 2017年12月21日 16:59, Michal Hocko wrote:
> > On Thu 21-12-17 16:23:23, kemi wrote:
> >>
> >>
> >> On 2017年12月21日 16:17, Michal Hocko wrote:
> > [...]
> >>> Can you see any difference with a more generic workload?
> >>>
> >>
> >> I didn't see obvious improvement for will-it-scale.page_fault1
> >> Two reasons for that:
> >> 1) too long code path
> >> 2) server zone lock and lru lock contention (access to buddy system 
> >> frequently) 
> > 
> > OK. So does the patch helps for anything other than a microbenchmark?
> > 
>  Some thinking about that:
>  a) the overhead due to cache bouncing caused by NUMA counter update in 
>  fast path 
>  severely increase with more and more CPUs cores
> >>>
> >>> What is an effect on a smaller system with fewer CPUs?
> >>>
> >>
> >> Several CPU cycles can be saved using single thread for that.
> >>
>  b) AFAIK, the typical usage scenario (similar at least)for which this 
>  optimization can 
>  benefit is 10/40G NIC used in high-speed data center network of cloud 
>  service providers.
> >>>
> >>> I would expect those would disable the numa accounting altogether.
> >>>
> >>
> >> Yes, but it is still worthy to do some optimization, isn't?
> > 
> > Ohh, I am not opposing optimizations but you should make sure that they
> > are worth the additional code and special casing. As I've said I am not
> > convinced special casing numa counters is good. You can play with the
> > threshold scaling for larger CPU count but let's make sure that the
> > benefit is really measurable for normal workloads. Special ones will
> > disable the numa accounting anyway.
> > 
> 
> I understood. Could you give me some suggestion for those normal workloads, 
> Thanks.
> I will have a try and post the data ASAP. 

Well, to be honest, I am really confused what is your objective for
these optimizations then. I hope we have agreed that workloads which
really need to squeeze every single CPU cycle in the allocation path
will simply disable the whole numa stat thing. I haven't yet heard about
any use case which would really required numa stats and suffer from the
numa stats overhead.

I can see some arguments for a better threshold scaling but that
requires to check wider range of tests to show there are no unintended
changes. I am not really confident you understand that when you are
asking for "those normal workloads".

So please, try to step back, rethink who you are optimizing for and act
accordingly. If I were you I would repost the first patch which only
integrates numa stats because that removes a lot of pointless code and
that is a win of its own.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月22日 01:10, Christopher Lameter wrote:
> On Thu, 21 Dec 2017, kemi wrote:
> 
>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path
>> severely increase with more and more CPUs cores
>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I think you are fighting a lost battle there. As evident from the timing
> constraints on packet processing in a 10/40G you will have a hard time to
> process data if the packets are of regular ethernet size. And we alrady
> have 100G NICs in operation here.
> 

Not really.
For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network 
rather than kernel driver in production environment.
That's due to the slow page allocator and long pipeline processing in network 
protocol stack.
That's not easy to change this state in short time, but if we can do something
here to change it a little, why not.

> We can try to get the performance as high as possible but full rate high
> speed networking invariable must use offload mechanisms and thus the
> statistics would only be available from the hardware devices that can do
> wire speed processing.
> 

I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + 
VF pass through). That's usually used in virtualization environment to 
eliminate 
the overhead from device emulation and packet processing in software virtual 
switch(OVS or linux bridge). 

What I have done in this patch series is to improve page allocator performance,
that's also helpful in offload environment (guest kernel at least), IMHO.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月22日 01:10, Christopher Lameter wrote:
> On Thu, 21 Dec 2017, kemi wrote:
> 
>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path
>> severely increase with more and more CPUs cores
>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I think you are fighting a lost battle there. As evident from the timing
> constraints on packet processing in a 10/40G you will have a hard time to
> process data if the packets are of regular ethernet size. And we alrady
> have 100G NICs in operation here.
> 

Not really.
For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network 
rather than kernel driver in production environment.
That's due to the slow page allocator and long pipeline processing in network 
protocol stack.
That's not easy to change this state in short time, but if we can do something
here to change it a little, why not.

> We can try to get the performance as high as possible but full rate high
> speed networking invariable must use offload mechanisms and thus the
> statistics would only be available from the hardware devices that can do
> wire speed processing.
> 

I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + 
VF pass through). That's usually used in virtualization environment to 
eliminate 
the overhead from device emulation and packet processing in software virtual 
switch(OVS or linux bridge). 

What I have done in this patch series is to improve page allocator performance,
that's also helpful in offload environment (guest kernel at least), IMHO.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread Christopher Lameter
On Thu, 21 Dec 2017, kemi wrote:

> Some thinking about that:
> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
> path
> severely increase with more and more CPUs cores
> b) AFAIK, the typical usage scenario (similar at least)for which this 
> optimization can
> benefit is 10/40G NIC used in high-speed data center network of cloud service 
> providers.

I think you are fighting a lost battle there. As evident from the timing
constraints on packet processing in a 10/40G you will have a hard time to
process data if the packets are of regular ethernet size. And we alrady
have 100G NICs in operation here.

We can try to get the performance as high as possible but full rate high
speed networking invariable must use offload mechanisms and thus the
statistics would only be available from the hardware devices that can do
wire speed processing.



Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread Christopher Lameter
On Thu, 21 Dec 2017, kemi wrote:

> Some thinking about that:
> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
> path
> severely increase with more and more CPUs cores
> b) AFAIK, the typical usage scenario (similar at least)for which this 
> optimization can
> benefit is 10/40G NIC used in high-speed data center network of cloud service 
> providers.

I think you are fighting a lost battle there. As evident from the timing
constraints on packet processing in a 10/40G you will have a hard time to
process data if the packets are of regular ethernet size. And we alrady
have 100G NICs in operation here.

We can try to get the performance as high as possible but full rate high
speed networking invariable must use offload mechanisms and thus the
statistics would only be available from the hardware devices that can do
wire speed processing.



Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:59, Michal Hocko wrote:
> On Thu 21-12-17 16:23:23, kemi wrote:
>>
>>
>> On 2017年12月21日 16:17, Michal Hocko wrote:
> [...]
>>> Can you see any difference with a more generic workload?
>>>
>>
>> I didn't see obvious improvement for will-it-scale.page_fault1
>> Two reasons for that:
>> 1) too long code path
>> 2) server zone lock and lru lock contention (access to buddy system 
>> frequently) 
> 
> OK. So does the patch helps for anything other than a microbenchmark?
> 
 Some thinking about that:
 a) the overhead due to cache bouncing caused by NUMA counter update in 
 fast path 
 severely increase with more and more CPUs cores
>>>
>>> What is an effect on a smaller system with fewer CPUs?
>>>
>>
>> Several CPU cycles can be saved using single thread for that.
>>
 b) AFAIK, the typical usage scenario (similar at least)for which this 
 optimization can 
 benefit is 10/40G NIC used in high-speed data center network of cloud 
 service providers.
>>>
>>> I would expect those would disable the numa accounting altogether.
>>>
>>
>> Yes, but it is still worthy to do some optimization, isn't?
> 
> Ohh, I am not opposing optimizations but you should make sure that they
> are worth the additional code and special casing. As I've said I am not
> convinced special casing numa counters is good. You can play with the
> threshold scaling for larger CPU count but let's make sure that the
> benefit is really measurable for normal workloads. Special ones will
> disable the numa accounting anyway.
> 

I understood. Could you give me some suggestion for those normal workloads, 
Thanks.
I will have a try and post the data ASAP. 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:59, Michal Hocko wrote:
> On Thu 21-12-17 16:23:23, kemi wrote:
>>
>>
>> On 2017年12月21日 16:17, Michal Hocko wrote:
> [...]
>>> Can you see any difference with a more generic workload?
>>>
>>
>> I didn't see obvious improvement for will-it-scale.page_fault1
>> Two reasons for that:
>> 1) too long code path
>> 2) server zone lock and lru lock contention (access to buddy system 
>> frequently) 
> 
> OK. So does the patch helps for anything other than a microbenchmark?
> 
 Some thinking about that:
 a) the overhead due to cache bouncing caused by NUMA counter update in 
 fast path 
 severely increase with more and more CPUs cores
>>>
>>> What is an effect on a smaller system with fewer CPUs?
>>>
>>
>> Several CPU cycles can be saved using single thread for that.
>>
 b) AFAIK, the typical usage scenario (similar at least)for which this 
 optimization can 
 benefit is 10/40G NIC used in high-speed data center network of cloud 
 service providers.
>>>
>>> I would expect those would disable the numa accounting altogether.
>>>
>>
>> Yes, but it is still worthy to do some optimization, isn't?
> 
> Ohh, I am not opposing optimizations but you should make sure that they
> are worth the additional code and special casing. As I've said I am not
> convinced special casing numa counters is good. You can play with the
> threshold scaling for larger CPU count but let's make sure that the
> benefit is really measurable for normal workloads. Special ones will
> disable the numa accounting anyway.
> 

I understood. Could you give me some suggestion for those normal workloads, 
Thanks.
I will have a try and post the data ASAP. 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread Michal Hocko
On Thu 21-12-17 16:23:23, kemi wrote:
> 
> 
> On 2017年12月21日 16:17, Michal Hocko wrote:
[...]
> > Can you see any difference with a more generic workload?
> > 
> 
> I didn't see obvious improvement for will-it-scale.page_fault1
> Two reasons for that:
> 1) too long code path
> 2) server zone lock and lru lock contention (access to buddy system 
> frequently) 

OK. So does the patch helps for anything other than a microbenchmark?

> >> Some thinking about that:
> >> a) the overhead due to cache bouncing caused by NUMA counter update in 
> >> fast path 
> >> severely increase with more and more CPUs cores
> > 
> > What is an effect on a smaller system with fewer CPUs?
> > 
> 
> Several CPU cycles can be saved using single thread for that.
> 
> >> b) AFAIK, the typical usage scenario (similar at least)for which this 
> >> optimization can 
> >> benefit is 10/40G NIC used in high-speed data center network of cloud 
> >> service providers.
> > 
> > I would expect those would disable the numa accounting altogether.
> > 
> 
> Yes, but it is still worthy to do some optimization, isn't?

Ohh, I am not opposing optimizations but you should make sure that they
are worth the additional code and special casing. As I've said I am not
convinced special casing numa counters is good. You can play with the
threshold scaling for larger CPU count but let's make sure that the
benefit is really measurable for normal workloads. Special ones will
disable the numa accounting anyway.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread Michal Hocko
On Thu 21-12-17 16:23:23, kemi wrote:
> 
> 
> On 2017年12月21日 16:17, Michal Hocko wrote:
[...]
> > Can you see any difference with a more generic workload?
> > 
> 
> I didn't see obvious improvement for will-it-scale.page_fault1
> Two reasons for that:
> 1) too long code path
> 2) server zone lock and lru lock contention (access to buddy system 
> frequently) 

OK. So does the patch helps for anything other than a microbenchmark?

> >> Some thinking about that:
> >> a) the overhead due to cache bouncing caused by NUMA counter update in 
> >> fast path 
> >> severely increase with more and more CPUs cores
> > 
> > What is an effect on a smaller system with fewer CPUs?
> > 
> 
> Several CPU cycles can be saved using single thread for that.
> 
> >> b) AFAIK, the typical usage scenario (similar at least)for which this 
> >> optimization can 
> >> benefit is 10/40G NIC used in high-speed data center network of cloud 
> >> service providers.
> > 
> > I would expect those would disable the numa accounting altogether.
> > 
> 
> Yes, but it is still worthy to do some optimization, isn't?

Ohh, I am not opposing optimizations but you should make sure that they
are worth the additional code and special casing. As I've said I am not
convinced special casing numa counters is good. You can play with the
threshold scaling for larger CPU count but let's make sure that the
benefit is really measurable for normal workloads. Special ones will
disable the numa accounting anyway.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:17, Michal Hocko wrote:
> On Thu 21-12-17 16:06:50, kemi wrote:
>>
>>
>> On 2017年12月20日 18:12, Michal Hocko wrote:
>>> On Wed 20-12-17 13:52:14, kemi wrote:


 On 2017年12月19日 20:40, Michal Hocko wrote:
> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>> We have seen significant overhead in cache bouncing caused by NUMA 
>> counters
>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>> update NUMA counter threshold size")' for more details.
>>
>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and 
>> deals
>> with global counter update using different threshold size for node page
>> stats.
>
> Again, no numbers.

 Compare to vanilla kernel, I don't think it has performance improvement, so
 I didn't post performance data here.
 But, if you would like to see performance gain from enlarging threshold 
 size
 for NUMA stats (compare to the first patch), I will do that later. 
>>>
>>> Please do. I would also like to hear _why_ all counters cannot simply
>>> behave same. In other words why we cannot simply increase
>>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
>>> for larger machines.
>>>
>>
>> I will add this performance data to changelog in V3 patch series.
>>
>> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
>> Benchmark: page_bench03
>> Description: 112 threads do single page allocation/deallocation in parallel.
>>before   after
>>(enlarge threshold size)   
>> CPU cycles 722  379(-47.5%)
> 
> Please describe the numbers some more. Is this an average?

Yes

> What is the std? 

I increase the loop times to 10m, so the std is quite slow (repeat 3 times)

> Can you see any difference with a more generic workload?
> 

I didn't see obvious improvement for will-it-scale.page_fault1
Two reasons for that:
1) too long code path
2) server zone lock and lru lock contention (access to buddy system frequently) 

>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path 
>> severely increase with more and more CPUs cores
> 
> What is an effect on a smaller system with fewer CPUs?
> 

Several CPU cycles can be saved using single thread for that.

>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can 
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I would expect those would disable the numa accounting altogether.
> 

Yes, but it is still worthy to do some optimization, isn't?


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月21日 16:17, Michal Hocko wrote:
> On Thu 21-12-17 16:06:50, kemi wrote:
>>
>>
>> On 2017年12月20日 18:12, Michal Hocko wrote:
>>> On Wed 20-12-17 13:52:14, kemi wrote:


 On 2017年12月19日 20:40, Michal Hocko wrote:
> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>> We have seen significant overhead in cache bouncing caused by NUMA 
>> counters
>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>> update NUMA counter threshold size")' for more details.
>>
>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and 
>> deals
>> with global counter update using different threshold size for node page
>> stats.
>
> Again, no numbers.

 Compare to vanilla kernel, I don't think it has performance improvement, so
 I didn't post performance data here.
 But, if you would like to see performance gain from enlarging threshold 
 size
 for NUMA stats (compare to the first patch), I will do that later. 
>>>
>>> Please do. I would also like to hear _why_ all counters cannot simply
>>> behave same. In other words why we cannot simply increase
>>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
>>> for larger machines.
>>>
>>
>> I will add this performance data to changelog in V3 patch series.
>>
>> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
>> Benchmark: page_bench03
>> Description: 112 threads do single page allocation/deallocation in parallel.
>>before   after
>>(enlarge threshold size)   
>> CPU cycles 722  379(-47.5%)
> 
> Please describe the numbers some more. Is this an average?

Yes

> What is the std? 

I increase the loop times to 10m, so the std is quite slow (repeat 3 times)

> Can you see any difference with a more generic workload?
> 

I didn't see obvious improvement for will-it-scale.page_fault1
Two reasons for that:
1) too long code path
2) server zone lock and lru lock contention (access to buddy system frequently) 

>> Some thinking about that:
>> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
>> path 
>> severely increase with more and more CPUs cores
> 
> What is an effect on a smaller system with fewer CPUs?
> 

Several CPU cycles can be saved using single thread for that.

>> b) AFAIK, the typical usage scenario (similar at least)for which this 
>> optimization can 
>> benefit is 10/40G NIC used in high-speed data center network of cloud 
>> service providers.
> 
> I would expect those would disable the numa accounting altogether.
> 

Yes, but it is still worthy to do some optimization, isn't?


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread Michal Hocko
On Thu 21-12-17 16:06:50, kemi wrote:
> 
> 
> On 2017年12月20日 18:12, Michal Hocko wrote:
> > On Wed 20-12-17 13:52:14, kemi wrote:
> >>
> >>
> >> On 2017年12月19日 20:40, Michal Hocko wrote:
> >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>  We have seen significant overhead in cache bouncing caused by NUMA 
>  counters
>  update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>  update NUMA counter threshold size")' for more details.
> 
>  This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and 
>  deals
>  with global counter update using different threshold size for node page
>  stats.
> >>>
> >>> Again, no numbers.
> >>
> >> Compare to vanilla kernel, I don't think it has performance improvement, so
> >> I didn't post performance data here.
> >> But, if you would like to see performance gain from enlarging threshold 
> >> size
> >> for NUMA stats (compare to the first patch), I will do that later. 
> > 
> > Please do. I would also like to hear _why_ all counters cannot simply
> > behave same. In other words why we cannot simply increase
> > stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> > for larger machines.
> > 
> 
> I will add this performance data to changelog in V3 patch series.
> 
> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
> Benchmark: page_bench03
> Description: 112 threads do single page allocation/deallocation in parallel.
>before   after
>(enlarge threshold size)   
> CPU cycles 722  379(-47.5%)

Please describe the numbers some more. Is this an average? What is the
std? Can you see any difference with a more generic workload?

> Some thinking about that:
> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
> path 
> severely increase with more and more CPUs cores

What is an effect on a smaller system with fewer CPUs?

> b) AFAIK, the typical usage scenario (similar at least)for which this 
> optimization can 
> benefit is 10/40G NIC used in high-speed data center network of cloud service 
> providers.

I would expect those would disable the numa accounting altogether.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread Michal Hocko
On Thu 21-12-17 16:06:50, kemi wrote:
> 
> 
> On 2017年12月20日 18:12, Michal Hocko wrote:
> > On Wed 20-12-17 13:52:14, kemi wrote:
> >>
> >>
> >> On 2017年12月19日 20:40, Michal Hocko wrote:
> >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>  We have seen significant overhead in cache bouncing caused by NUMA 
>  counters
>  update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>  update NUMA counter threshold size")' for more details.
> 
>  This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and 
>  deals
>  with global counter update using different threshold size for node page
>  stats.
> >>>
> >>> Again, no numbers.
> >>
> >> Compare to vanilla kernel, I don't think it has performance improvement, so
> >> I didn't post performance data here.
> >> But, if you would like to see performance gain from enlarging threshold 
> >> size
> >> for NUMA stats (compare to the first patch), I will do that later. 
> > 
> > Please do. I would also like to hear _why_ all counters cannot simply
> > behave same. In other words why we cannot simply increase
> > stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> > for larger machines.
> > 
> 
> I will add this performance data to changelog in V3 patch series.
> 
> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
> Benchmark: page_bench03
> Description: 112 threads do single page allocation/deallocation in parallel.
>before   after
>(enlarge threshold size)   
> CPU cycles 722  379(-47.5%)

Please describe the numbers some more. Is this an average? What is the
std? Can you see any difference with a more generic workload?

> Some thinking about that:
> a) the overhead due to cache bouncing caused by NUMA counter update in fast 
> path 
> severely increase with more and more CPUs cores

What is an effect on a smaller system with fewer CPUs?

> b) AFAIK, the typical usage scenario (similar at least)for which this 
> optimization can 
> benefit is 10/40G NIC used in high-speed data center network of cloud service 
> providers.

I would expect those would disable the numa accounting altogether.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
 We have seen significant overhead in cache bouncing caused by NUMA counters
 update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
 update NUMA counter threshold size")' for more details.

 This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
 with global counter update using different threshold size for node page
 stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

I will add this performance data to changelog in V3 patch series.

Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
Benchmark: page_bench03
Description: 112 threads do single page allocation/deallocation in parallel.
   before   after
   (enlarge threshold size)   
CPU cycles 722  379(-47.5%)

Some thinking about that:
a) the overhead due to cache bouncing caused by NUMA counter update in fast 
path 
severely increase with more and more CPUs cores
b) AFAIK, the typical usage scenario (similar at least)for which this 
optimization can 
benefit is 10/40G NIC used in high-speed data center network of cloud service 
providers.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-21 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
 We have seen significant overhead in cache bouncing caused by NUMA counters
 update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
 update NUMA counter threshold size")' for more details.

 This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
 with global counter update using different threshold size for node page
 stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

I will add this performance data to changelog in V3 patch series.

Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM)
Benchmark: page_bench03
Description: 112 threads do single page allocation/deallocation in parallel.
   before   after
   (enlarge threshold size)   
CPU cycles 722  379(-47.5%)

Some thinking about that:
a) the overhead due to cache bouncing caused by NUMA counter update in fast 
path 
severely increase with more and more CPUs cores
b) AFAIK, the typical usage scenario (similar at least)for which this 
optimization can 
benefit is 10/40G NIC used in high-speed data center network of cloud service 
providers.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-20 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
 We have seen significant overhead in cache bouncing caused by NUMA counters
 update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
 update NUMA counter threshold size")' for more details.

 This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
 with global counter update using different threshold size for node page
 stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

Agree. We may consider that.
But, unlike NUMA counters which do not effect system decision.
We need consider very carefully when increase stat_threshold for all the 
counters
for larger machines. BTW, this is another topic that we may discuss it in 
different
thread.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-20 Thread kemi


On 2017年12月20日 18:12, Michal Hocko wrote:
> On Wed 20-12-17 13:52:14, kemi wrote:
>>
>>
>> On 2017年12月19日 20:40, Michal Hocko wrote:
>>> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
 We have seen significant overhead in cache bouncing caused by NUMA counters
 update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
 update NUMA counter threshold size")' for more details.

 This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
 with global counter update using different threshold size for node page
 stats.
>>>
>>> Again, no numbers.
>>
>> Compare to vanilla kernel, I don't think it has performance improvement, so
>> I didn't post performance data here.
>> But, if you would like to see performance gain from enlarging threshold size
>> for NUMA stats (compare to the first patch), I will do that later. 
> 
> Please do. I would also like to hear _why_ all counters cannot simply
> behave same. In other words why we cannot simply increase
> stat_threshold? Maybe calculate_normal_threshold needs a better scaling
> for larger machines.
> 

Agree. We may consider that.
But, unlike NUMA counters which do not effect system decision.
We need consider very carefully when increase stat_threshold for all the 
counters
for larger machines. BTW, this is another topic that we may discuss it in 
different
thread.


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 13:52:14, kemi wrote:
> 
> 
> On 2017年12月19日 20:40, Michal Hocko wrote:
> > On Tue 19-12-17 14:39:24, Kemi Wang wrote:
> >> We have seen significant overhead in cache bouncing caused by NUMA counters
> >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
> >> update NUMA counter threshold size")' for more details.
> >>
> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
> >> with global counter update using different threshold size for node page
> >> stats.
> > 
> > Again, no numbers.
> 
> Compare to vanilla kernel, I don't think it has performance improvement, so
> I didn't post performance data here.
> But, if you would like to see performance gain from enlarging threshold size
> for NUMA stats (compare to the first patch), I will do that later. 

Please do. I would also like to hear _why_ all counters cannot simply
behave same. In other words why we cannot simply increase
stat_threshold? Maybe calculate_normal_threshold needs a better scaling
for larger machines.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-20 Thread Michal Hocko
On Wed 20-12-17 13:52:14, kemi wrote:
> 
> 
> On 2017年12月19日 20:40, Michal Hocko wrote:
> > On Tue 19-12-17 14:39:24, Kemi Wang wrote:
> >> We have seen significant overhead in cache bouncing caused by NUMA counters
> >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
> >> update NUMA counter threshold size")' for more details.
> >>
> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
> >> with global counter update using different threshold size for node page
> >> stats.
> > 
> > Again, no numbers.
> 
> Compare to vanilla kernel, I don't think it has performance improvement, so
> I didn't post performance data here.
> But, if you would like to see performance gain from enlarging threshold size
> for NUMA stats (compare to the first patch), I will do that later. 

Please do. I would also like to hear _why_ all counters cannot simply
behave same. In other words why we cannot simply increase
stat_threshold? Maybe calculate_normal_threshold needs a better scaling
for larger machines.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-19 Thread kemi


On 2017年12月19日 20:40, Michal Hocko wrote:
> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>> We have seen significant overhead in cache bouncing caused by NUMA counters
>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>> update NUMA counter threshold size")' for more details.
>>
>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>> with global counter update using different threshold size for node page
>> stats.
> 
> Again, no numbers.

Compare to vanilla kernel, I don't think it has performance improvement, so
I didn't post performance data here.
But, if you would like to see performance gain from enlarging threshold size
for NUMA stats (compare to the first patch), I will do that later. 

> To be honest I do not really like the special casing
> here. Why are numa counters any different from PGALLOC which is
> incremented for _every_ single page allocation?
> 

I guess you meant to PGALLOC event.
The number of this event is kept in local cpu and sum up (for_each_online_cpu)
when need. It uses the similar way to what I used before for NUMA stats in V1 
patch series. Good enough.

>> ---
>>  mm/vmstat.c | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 9c681cc..64e08ae 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
>> +
>>  #ifdef CONFIG_NUMA
>>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>>  
>> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
>> node_stat_item item)
>>  s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>> -t = __this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = __this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>> +
>>  if (unlikely(v > t)) {
>>  s16 overstep = t >> 1;
>>  
>> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>>   * Most of the time the thresholds are the same anyways
>>   * for all cpus in a node.
>>   */
>> -t = this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>>  
>>  o = this_cpu_read(*p);
>>  n = delta + o;
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-19 Thread kemi


On 2017年12月19日 20:40, Michal Hocko wrote:
> On Tue 19-12-17 14:39:24, Kemi Wang wrote:
>> We have seen significant overhead in cache bouncing caused by NUMA counters
>> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
>> update NUMA counter threshold size")' for more details.
>>
>> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
>> with global counter update using different threshold size for node page
>> stats.
> 
> Again, no numbers.

Compare to vanilla kernel, I don't think it has performance improvement, so
I didn't post performance data here.
But, if you would like to see performance gain from enlarging threshold size
for NUMA stats (compare to the first patch), I will do that later. 

> To be honest I do not really like the special casing
> here. Why are numa counters any different from PGALLOC which is
> incremented for _every_ single page allocation?
> 

I guess you meant to PGALLOC event.
The number of this event is kept in local cpu and sum up (for_each_online_cpu)
when need. It uses the similar way to what I used before for NUMA stats in V1 
patch series. Good enough.

>> ---
>>  mm/vmstat.c | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 9c681cc..64e08ae 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -30,6 +30,8 @@
>>  
>>  #include "internal.h"
>>  
>> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
>> +
>>  #ifdef CONFIG_NUMA
>>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>>  
>> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
>> node_stat_item item)
>>  s16 v, t;
>>  
>>  v = __this_cpu_inc_return(*p);
>> -t = __this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = __this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>> +
>>  if (unlikely(v > t)) {
>>  s16 overstep = t >> 1;
>>  
>> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
>> *pgdat,
>>   * Most of the time the thresholds are the same anyways
>>   * for all cpus in a node.
>>   */
>> -t = this_cpu_read(pcp->stat_threshold);
>> +if (item >= NR_VM_NUMA_STAT_ITEMS)
>> +t = this_cpu_read(pcp->stat_threshold);
>> +else
>> +t = VM_NUMA_STAT_THRESHOLD;
>>  
>>  o = this_cpu_read(*p);
>>  n = delta + o;
>> -- 
>> 2.7.4
>>
> 


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 14:39:24, Kemi Wang wrote:
> We have seen significant overhead in cache bouncing caused by NUMA counters
> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
> update NUMA counter threshold size")' for more details.
> 
> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
> with global counter update using different threshold size for node page
> stats.

Again, no numbers. To be honest I do not really like the special casing
here. Why are numa counters any different from PGALLOC which is
incremented for _every_ single page allocation?

> ---
>  mm/vmstat.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 9c681cc..64e08ae 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,6 +30,8 @@
>  
>  #include "internal.h"
>  
> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
> +
>  #ifdef CONFIG_NUMA
>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>  
> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
> node_stat_item item)
>   s16 v, t;
>  
>   v = __this_cpu_inc_return(*p);
> - t = __this_cpu_read(pcp->stat_threshold);
> + if (item >= NR_VM_NUMA_STAT_ITEMS)
> + t = __this_cpu_read(pcp->stat_threshold);
> + else
> + t = VM_NUMA_STAT_THRESHOLD;
> +
>   if (unlikely(v > t)) {
>   s16 overstep = t >> 1;
>  
> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
> *pgdat,
>* Most of the time the thresholds are the same anyways
>* for all cpus in a node.
>*/
> - t = this_cpu_read(pcp->stat_threshold);
> + if (item >= NR_VM_NUMA_STAT_ITEMS)
> + t = this_cpu_read(pcp->stat_threshold);
> + else
> + t = VM_NUMA_STAT_THRESHOLD;
>  
>   o = this_cpu_read(*p);
>   n = delta + o;
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-19 Thread Michal Hocko
On Tue 19-12-17 14:39:24, Kemi Wang wrote:
> We have seen significant overhead in cache bouncing caused by NUMA counters
> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
> update NUMA counter threshold size")' for more details.
> 
> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
> with global counter update using different threshold size for node page
> stats.

Again, no numbers. To be honest I do not really like the special casing
here. Why are numa counters any different from PGALLOC which is
incremented for _every_ single page allocation?

> ---
>  mm/vmstat.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 9c681cc..64e08ae 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -30,6 +30,8 @@
>  
>  #include "internal.h"
>  
> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
> +
>  #ifdef CONFIG_NUMA
>  int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
>  
> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
> node_stat_item item)
>   s16 v, t;
>  
>   v = __this_cpu_inc_return(*p);
> - t = __this_cpu_read(pcp->stat_threshold);
> + if (item >= NR_VM_NUMA_STAT_ITEMS)
> + t = __this_cpu_read(pcp->stat_threshold);
> + else
> + t = VM_NUMA_STAT_THRESHOLD;
> +
>   if (unlikely(v > t)) {
>   s16 overstep = t >> 1;
>  
> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
> *pgdat,
>* Most of the time the thresholds are the same anyways
>* for all cpus in a node.
>*/
> - t = this_cpu_read(pcp->stat_threshold);
> + if (item >= NR_VM_NUMA_STAT_ITEMS)
> + t = this_cpu_read(pcp->stat_threshold);
> + else
> + t = VM_NUMA_STAT_THRESHOLD;
>  
>   o = this_cpu_read(*p);
>   n = delta + o;
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


[PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-18 Thread Kemi Wang
We have seen significant overhead in cache bouncing caused by NUMA counters
update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
update NUMA counter threshold size")' for more details.

This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
with global counter update using different threshold size for node page
stats.

Signed-off-by: Kemi Wang 
---
 mm/vmstat.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9c681cc..64e08ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@
 
 #include "internal.h"
 
+#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
+
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
@@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
node_stat_item item)
s16 v, t;
 
v = __this_cpu_inc_return(*p);
-   t = __this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = __this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
+
if (unlikely(v > t)) {
s16 overstep = t >> 1;
 
@@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
*pgdat,
 * Most of the time the thresholds are the same anyways
 * for all cpus in a node.
 */
-   t = this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
 
o = this_cpu_read(*p);
n = delta + o;
-- 
2.7.4



[PATCH v2 3/5] mm: enlarge NUMA counters threshold size

2017-12-18 Thread Kemi Wang
We have seen significant overhead in cache bouncing caused by NUMA counters
update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm:
update NUMA counter threshold size")' for more details.

This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals
with global counter update using different threshold size for node page
stats.

Signed-off-by: Kemi Wang 
---
 mm/vmstat.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9c681cc..64e08ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@
 
 #include "internal.h"
 
+#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2)
+
 #ifdef CONFIG_NUMA
 int sysctl_vm_numa_stat = ENABLE_NUMA_STAT;
 
@@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum 
node_stat_item item)
s16 v, t;
 
v = __this_cpu_inc_return(*p);
-   t = __this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = __this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
+
if (unlikely(v > t)) {
s16 overstep = t >> 1;
 
@@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data 
*pgdat,
 * Most of the time the thresholds are the same anyways
 * for all cpus in a node.
 */
-   t = this_cpu_read(pcp->stat_threshold);
+   if (item >= NR_VM_NUMA_STAT_ITEMS)
+   t = this_cpu_read(pcp->stat_threshold);
+   else
+   t = VM_NUMA_STAT_THRESHOLD;
 
o = this_cpu_read(*p);
n = delta + o;
-- 
2.7.4