Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Fri, 22 Dec 2017, kemi wrote: > > I think you are fighting a lost battle there. As evident from the timing > > constraints on packet processing in a 10/40G you will have a hard time to > > process data if the packets are of regular ethernet size. And we alrady > > have 100G NICs in operation here. > > > > Not really. > For 10/40G NIC or even 100G, I admit DPDK is widely used in data center > network > rather than kernel driver in production environment. Shudder. I would rather have an user space API that is vendor neutral and that allows the use of multiple NICs. The Linux kernel has an RDMA subsystem that does just that. But time budget is difficult to deal with even using RDMA or DPKG where we can avoid the OS overhead. > That's due to the slow page allocator and long pipeline processing in network > protocol stack. Right the timing budget there for processing a single packet gets below a microsecond at some point and there its going to be difficult to do much. Some aggregation / offloading is required and that increases as speeds become higher. > That's not easy to change this state in short time, but if we can do something > here to change it a little, why not. How much of an improvement is this going to be? If it is significant then by all means lets do it. > > We can try to get the performance as high as possible but full rate high > > speed networking invariable must use offload mechanisms and thus the > > statistics would only be available from the hardware devices that can do > > wire speed processing. > > > > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload > + > VF pass through). That's usually used in virtualization environment to > eliminate > the overhead from device emulation and packet processing in software virtual > switch(OVS or linux bridge). The switch offloads Can also be used elsewhere. Also the RDMA subsystem has counters like that.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Fri, 22 Dec 2017, kemi wrote: > > I think you are fighting a lost battle there. As evident from the timing > > constraints on packet processing in a 10/40G you will have a hard time to > > process data if the packets are of regular ethernet size. And we alrady > > have 100G NICs in operation here. > > > > Not really. > For 10/40G NIC or even 100G, I admit DPDK is widely used in data center > network > rather than kernel driver in production environment. Shudder. I would rather have an user space API that is vendor neutral and that allows the use of multiple NICs. The Linux kernel has an RDMA subsystem that does just that. But time budget is difficult to deal with even using RDMA or DPKG where we can avoid the OS overhead. > That's due to the slow page allocator and long pipeline processing in network > protocol stack. Right the timing budget there for processing a single packet gets below a microsecond at some point and there its going to be difficult to do much. Some aggregation / offloading is required and that increases as speeds become higher. > That's not easy to change this state in short time, but if we can do something > here to change it a little, why not. How much of an improvement is this going to be? If it is significant then by all means lets do it. > > We can try to get the performance as high as possible but full rate high > > speed networking invariable must use offload mechanisms and thus the > > statistics would only be available from the hardware devices that can do > > wire speed processing. > > > > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload > + > VF pass through). That's usually used in virtualization environment to > eliminate > the overhead from device emulation and packet processing in software virtual > switch(OVS or linux bridge). The switch offloads Can also be used elsewhere. Also the RDMA subsystem has counters like that.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu 21-12-17 18:31:19, kemi wrote: > > > On 2017年12月21日 16:59, Michal Hocko wrote: > > On Thu 21-12-17 16:23:23, kemi wrote: > >> > >> > >> On 2017年12月21日 16:17, Michal Hocko wrote: > > [...] > >>> Can you see any difference with a more generic workload? > >>> > >> > >> I didn't see obvious improvement for will-it-scale.page_fault1 > >> Two reasons for that: > >> 1) too long code path > >> 2) server zone lock and lru lock contention (access to buddy system > >> frequently) > > > > OK. So does the patch helps for anything other than a microbenchmark? > > > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in > fast path > severely increase with more and more CPUs cores > >>> > >>> What is an effect on a smaller system with fewer CPUs? > >>> > >> > >> Several CPU cycles can be saved using single thread for that. > >> > b) AFAIK, the typical usage scenario (similar at least)for which this > optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud > service providers. > >>> > >>> I would expect those would disable the numa accounting altogether. > >>> > >> > >> Yes, but it is still worthy to do some optimization, isn't? > > > > Ohh, I am not opposing optimizations but you should make sure that they > > are worth the additional code and special casing. As I've said I am not > > convinced special casing numa counters is good. You can play with the > > threshold scaling for larger CPU count but let's make sure that the > > benefit is really measurable for normal workloads. Special ones will > > disable the numa accounting anyway. > > > > I understood. Could you give me some suggestion for those normal workloads, > Thanks. > I will have a try and post the data ASAP. Well, to be honest, I am really confused what is your objective for these optimizations then. I hope we have agreed that workloads which really need to squeeze every single CPU cycle in the allocation path will simply disable the whole numa stat thing. I haven't yet heard about any use case which would really required numa stats and suffer from the numa stats overhead. I can see some arguments for a better threshold scaling but that requires to check wider range of tests to show there are no unintended changes. I am not really confident you understand that when you are asking for "those normal workloads". So please, try to step back, rethink who you are optimizing for and act accordingly. If I were you I would repost the first patch which only integrates numa stats because that removes a lot of pointless code and that is a win of its own. -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu 21-12-17 18:31:19, kemi wrote: > > > On 2017年12月21日 16:59, Michal Hocko wrote: > > On Thu 21-12-17 16:23:23, kemi wrote: > >> > >> > >> On 2017年12月21日 16:17, Michal Hocko wrote: > > [...] > >>> Can you see any difference with a more generic workload? > >>> > >> > >> I didn't see obvious improvement for will-it-scale.page_fault1 > >> Two reasons for that: > >> 1) too long code path > >> 2) server zone lock and lru lock contention (access to buddy system > >> frequently) > > > > OK. So does the patch helps for anything other than a microbenchmark? > > > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in > fast path > severely increase with more and more CPUs cores > >>> > >>> What is an effect on a smaller system with fewer CPUs? > >>> > >> > >> Several CPU cycles can be saved using single thread for that. > >> > b) AFAIK, the typical usage scenario (similar at least)for which this > optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud > service providers. > >>> > >>> I would expect those would disable the numa accounting altogether. > >>> > >> > >> Yes, but it is still worthy to do some optimization, isn't? > > > > Ohh, I am not opposing optimizations but you should make sure that they > > are worth the additional code and special casing. As I've said I am not > > convinced special casing numa counters is good. You can play with the > > threshold scaling for larger CPU count but let's make sure that the > > benefit is really measurable for normal workloads. Special ones will > > disable the numa accounting anyway. > > > > I understood. Could you give me some suggestion for those normal workloads, > Thanks. > I will have a try and post the data ASAP. Well, to be honest, I am really confused what is your objective for these optimizations then. I hope we have agreed that workloads which really need to squeeze every single CPU cycle in the allocation path will simply disable the whole numa stat thing. I haven't yet heard about any use case which would really required numa stats and suffer from the numa stats overhead. I can see some arguments for a better threshold scaling but that requires to check wider range of tests to show there are no unintended changes. I am not really confident you understand that when you are asking for "those normal workloads". So please, try to step back, rethink who you are optimizing for and act accordingly. If I were you I would repost the first patch which only integrates numa stats because that removes a lot of pointless code and that is a win of its own. -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月22日 01:10, Christopher Lameter wrote: > On Thu, 21 Dec 2017, kemi wrote: > >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast >> path >> severely increase with more and more CPUs cores >> b) AFAIK, the typical usage scenario (similar at least)for which this >> optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud >> service providers. > > I think you are fighting a lost battle there. As evident from the timing > constraints on packet processing in a 10/40G you will have a hard time to > process data if the packets are of regular ethernet size. And we alrady > have 100G NICs in operation here. > Not really. For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network rather than kernel driver in production environment. That's due to the slow page allocator and long pipeline processing in network protocol stack. That's not easy to change this state in short time, but if we can do something here to change it a little, why not. > We can try to get the performance as high as possible but full rate high > speed networking invariable must use offload mechanisms and thus the > statistics would only be available from the hardware devices that can do > wire speed processing. > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + VF pass through). That's usually used in virtualization environment to eliminate the overhead from device emulation and packet processing in software virtual switch(OVS or linux bridge). What I have done in this patch series is to improve page allocator performance, that's also helpful in offload environment (guest kernel at least), IMHO.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月22日 01:10, Christopher Lameter wrote: > On Thu, 21 Dec 2017, kemi wrote: > >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast >> path >> severely increase with more and more CPUs cores >> b) AFAIK, the typical usage scenario (similar at least)for which this >> optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud >> service providers. > > I think you are fighting a lost battle there. As evident from the timing > constraints on packet processing in a 10/40G you will have a hard time to > process data if the packets are of regular ethernet size. And we alrady > have 100G NICs in operation here. > Not really. For 10/40G NIC or even 100G, I admit DPDK is widely used in data center network rather than kernel driver in production environment. That's due to the slow page allocator and long pipeline processing in network protocol stack. That's not easy to change this state in short time, but if we can do something here to change it a little, why not. > We can try to get the performance as high as possible but full rate high > speed networking invariable must use offload mechanisms and thus the > statistics would only be available from the hardware devices that can do > wire speed processing. > I think you may be talking something about SmartNIC (e.g. OpenVswitch offload + VF pass through). That's usually used in virtualization environment to eliminate the overhead from device emulation and packet processing in software virtual switch(OVS or linux bridge). What I have done in this patch series is to improve page allocator performance, that's also helpful in offload environment (guest kernel at least), IMHO.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu, 21 Dec 2017, kemi wrote: > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast > path > severely increase with more and more CPUs cores > b) AFAIK, the typical usage scenario (similar at least)for which this > optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service > providers. I think you are fighting a lost battle there. As evident from the timing constraints on packet processing in a 10/40G you will have a hard time to process data if the packets are of regular ethernet size. And we alrady have 100G NICs in operation here. We can try to get the performance as high as possible but full rate high speed networking invariable must use offload mechanisms and thus the statistics would only be available from the hardware devices that can do wire speed processing.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu, 21 Dec 2017, kemi wrote: > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast > path > severely increase with more and more CPUs cores > b) AFAIK, the typical usage scenario (similar at least)for which this > optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service > providers. I think you are fighting a lost battle there. As evident from the timing constraints on packet processing in a 10/40G you will have a hard time to process data if the packets are of regular ethernet size. And we alrady have 100G NICs in operation here. We can try to get the performance as high as possible but full rate high speed networking invariable must use offload mechanisms and thus the statistics would only be available from the hardware devices that can do wire speed processing.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月21日 16:59, Michal Hocko wrote: > On Thu 21-12-17 16:23:23, kemi wrote: >> >> >> On 2017年12月21日 16:17, Michal Hocko wrote: > [...] >>> Can you see any difference with a more generic workload? >>> >> >> I didn't see obvious improvement for will-it-scale.page_fault1 >> Two reasons for that: >> 1) too long code path >> 2) server zone lock and lru lock contention (access to buddy system >> frequently) > > OK. So does the patch helps for anything other than a microbenchmark? > Some thinking about that: a) the overhead due to cache bouncing caused by NUMA counter update in fast path severely increase with more and more CPUs cores >>> >>> What is an effect on a smaller system with fewer CPUs? >>> >> >> Several CPU cycles can be saved using single thread for that. >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can benefit is 10/40G NIC used in high-speed data center network of cloud service providers. >>> >>> I would expect those would disable the numa accounting altogether. >>> >> >> Yes, but it is still worthy to do some optimization, isn't? > > Ohh, I am not opposing optimizations but you should make sure that they > are worth the additional code and special casing. As I've said I am not > convinced special casing numa counters is good. You can play with the > threshold scaling for larger CPU count but let's make sure that the > benefit is really measurable for normal workloads. Special ones will > disable the numa accounting anyway. > I understood. Could you give me some suggestion for those normal workloads, Thanks. I will have a try and post the data ASAP.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月21日 16:59, Michal Hocko wrote: > On Thu 21-12-17 16:23:23, kemi wrote: >> >> >> On 2017年12月21日 16:17, Michal Hocko wrote: > [...] >>> Can you see any difference with a more generic workload? >>> >> >> I didn't see obvious improvement for will-it-scale.page_fault1 >> Two reasons for that: >> 1) too long code path >> 2) server zone lock and lru lock contention (access to buddy system >> frequently) > > OK. So does the patch helps for anything other than a microbenchmark? > Some thinking about that: a) the overhead due to cache bouncing caused by NUMA counter update in fast path severely increase with more and more CPUs cores >>> >>> What is an effect on a smaller system with fewer CPUs? >>> >> >> Several CPU cycles can be saved using single thread for that. >> b) AFAIK, the typical usage scenario (similar at least)for which this optimization can benefit is 10/40G NIC used in high-speed data center network of cloud service providers. >>> >>> I would expect those would disable the numa accounting altogether. >>> >> >> Yes, but it is still worthy to do some optimization, isn't? > > Ohh, I am not opposing optimizations but you should make sure that they > are worth the additional code and special casing. As I've said I am not > convinced special casing numa counters is good. You can play with the > threshold scaling for larger CPU count but let's make sure that the > benefit is really measurable for normal workloads. Special ones will > disable the numa accounting anyway. > I understood. Could you give me some suggestion for those normal workloads, Thanks. I will have a try and post the data ASAP.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu 21-12-17 16:23:23, kemi wrote: > > > On 2017年12月21日 16:17, Michal Hocko wrote: [...] > > Can you see any difference with a more generic workload? > > > > I didn't see obvious improvement for will-it-scale.page_fault1 > Two reasons for that: > 1) too long code path > 2) server zone lock and lru lock contention (access to buddy system > frequently) OK. So does the patch helps for anything other than a microbenchmark? > >> Some thinking about that: > >> a) the overhead due to cache bouncing caused by NUMA counter update in > >> fast path > >> severely increase with more and more CPUs cores > > > > What is an effect on a smaller system with fewer CPUs? > > > > Several CPU cycles can be saved using single thread for that. > > >> b) AFAIK, the typical usage scenario (similar at least)for which this > >> optimization can > >> benefit is 10/40G NIC used in high-speed data center network of cloud > >> service providers. > > > > I would expect those would disable the numa accounting altogether. > > > > Yes, but it is still worthy to do some optimization, isn't? Ohh, I am not opposing optimizations but you should make sure that they are worth the additional code and special casing. As I've said I am not convinced special casing numa counters is good. You can play with the threshold scaling for larger CPU count but let's make sure that the benefit is really measurable for normal workloads. Special ones will disable the numa accounting anyway. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu 21-12-17 16:23:23, kemi wrote: > > > On 2017年12月21日 16:17, Michal Hocko wrote: [...] > > Can you see any difference with a more generic workload? > > > > I didn't see obvious improvement for will-it-scale.page_fault1 > Two reasons for that: > 1) too long code path > 2) server zone lock and lru lock contention (access to buddy system > frequently) OK. So does the patch helps for anything other than a microbenchmark? > >> Some thinking about that: > >> a) the overhead due to cache bouncing caused by NUMA counter update in > >> fast path > >> severely increase with more and more CPUs cores > > > > What is an effect on a smaller system with fewer CPUs? > > > > Several CPU cycles can be saved using single thread for that. > > >> b) AFAIK, the typical usage scenario (similar at least)for which this > >> optimization can > >> benefit is 10/40G NIC used in high-speed data center network of cloud > >> service providers. > > > > I would expect those would disable the numa accounting altogether. > > > > Yes, but it is still worthy to do some optimization, isn't? Ohh, I am not opposing optimizations but you should make sure that they are worth the additional code and special casing. As I've said I am not convinced special casing numa counters is good. You can play with the threshold scaling for larger CPU count but let's make sure that the benefit is really measurable for normal workloads. Special ones will disable the numa accounting anyway. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月21日 16:17, Michal Hocko wrote: > On Thu 21-12-17 16:06:50, kemi wrote: >> >> >> On 2017年12月20日 18:12, Michal Hocko wrote: >>> On Wed 20-12-17 13:52:14, kemi wrote: On 2017年12月19日 20:40, Michal Hocko wrote: > On Tue 19-12-17 14:39:24, Kemi Wang wrote: >> We have seen significant overhead in cache bouncing caused by NUMA >> counters >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >> update NUMA counter threshold size")' for more details. >> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and >> deals >> with global counter update using different threshold size for node page >> stats. > > Again, no numbers. Compare to vanilla kernel, I don't think it has performance improvement, so I didn't post performance data here. But, if you would like to see performance gain from enlarging threshold size for NUMA stats (compare to the first patch), I will do that later. >>> >>> Please do. I would also like to hear _why_ all counters cannot simply >>> behave same. In other words why we cannot simply increase >>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling >>> for larger machines. >>> >> >> I will add this performance data to changelog in V3 patch series. >> >> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) >> Benchmark: page_bench03 >> Description: 112 threads do single page allocation/deallocation in parallel. >>before after >>(enlarge threshold size) >> CPU cycles 722 379(-47.5%) > > Please describe the numbers some more. Is this an average? Yes > What is the std? I increase the loop times to 10m, so the std is quite slow (repeat 3 times) > Can you see any difference with a more generic workload? > I didn't see obvious improvement for will-it-scale.page_fault1 Two reasons for that: 1) too long code path 2) server zone lock and lru lock contention (access to buddy system frequently) >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast >> path >> severely increase with more and more CPUs cores > > What is an effect on a smaller system with fewer CPUs? > Several CPU cycles can be saved using single thread for that. >> b) AFAIK, the typical usage scenario (similar at least)for which this >> optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud >> service providers. > > I would expect those would disable the numa accounting altogether. > Yes, but it is still worthy to do some optimization, isn't?
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月21日 16:17, Michal Hocko wrote: > On Thu 21-12-17 16:06:50, kemi wrote: >> >> >> On 2017年12月20日 18:12, Michal Hocko wrote: >>> On Wed 20-12-17 13:52:14, kemi wrote: On 2017年12月19日 20:40, Michal Hocko wrote: > On Tue 19-12-17 14:39:24, Kemi Wang wrote: >> We have seen significant overhead in cache bouncing caused by NUMA >> counters >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >> update NUMA counter threshold size")' for more details. >> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and >> deals >> with global counter update using different threshold size for node page >> stats. > > Again, no numbers. Compare to vanilla kernel, I don't think it has performance improvement, so I didn't post performance data here. But, if you would like to see performance gain from enlarging threshold size for NUMA stats (compare to the first patch), I will do that later. >>> >>> Please do. I would also like to hear _why_ all counters cannot simply >>> behave same. In other words why we cannot simply increase >>> stat_threshold? Maybe calculate_normal_threshold needs a better scaling >>> for larger machines. >>> >> >> I will add this performance data to changelog in V3 patch series. >> >> Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) >> Benchmark: page_bench03 >> Description: 112 threads do single page allocation/deallocation in parallel. >>before after >>(enlarge threshold size) >> CPU cycles 722 379(-47.5%) > > Please describe the numbers some more. Is this an average? Yes > What is the std? I increase the loop times to 10m, so the std is quite slow (repeat 3 times) > Can you see any difference with a more generic workload? > I didn't see obvious improvement for will-it-scale.page_fault1 Two reasons for that: 1) too long code path 2) server zone lock and lru lock contention (access to buddy system frequently) >> Some thinking about that: >> a) the overhead due to cache bouncing caused by NUMA counter update in fast >> path >> severely increase with more and more CPUs cores > > What is an effect on a smaller system with fewer CPUs? > Several CPU cycles can be saved using single thread for that. >> b) AFAIK, the typical usage scenario (similar at least)for which this >> optimization can >> benefit is 10/40G NIC used in high-speed data center network of cloud >> service providers. > > I would expect those would disable the numa accounting altogether. > Yes, but it is still worthy to do some optimization, isn't?
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu 21-12-17 16:06:50, kemi wrote: > > > On 2017年12月20日 18:12, Michal Hocko wrote: > > On Wed 20-12-17 13:52:14, kemi wrote: > >> > >> > >> On 2017年12月19日 20:40, Michal Hocko wrote: > >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: > We have seen significant overhead in cache bouncing caused by NUMA > counters > update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > update NUMA counter threshold size")' for more details. > > This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and > deals > with global counter update using different threshold size for node page > stats. > >>> > >>> Again, no numbers. > >> > >> Compare to vanilla kernel, I don't think it has performance improvement, so > >> I didn't post performance data here. > >> But, if you would like to see performance gain from enlarging threshold > >> size > >> for NUMA stats (compare to the first patch), I will do that later. > > > > Please do. I would also like to hear _why_ all counters cannot simply > > behave same. In other words why we cannot simply increase > > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > > for larger machines. > > > > I will add this performance data to changelog in V3 patch series. > > Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) > Benchmark: page_bench03 > Description: 112 threads do single page allocation/deallocation in parallel. >before after >(enlarge threshold size) > CPU cycles 722 379(-47.5%) Please describe the numbers some more. Is this an average? What is the std? Can you see any difference with a more generic workload? > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast > path > severely increase with more and more CPUs cores What is an effect on a smaller system with fewer CPUs? > b) AFAIK, the typical usage scenario (similar at least)for which this > optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service > providers. I would expect those would disable the numa accounting altogether. -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Thu 21-12-17 16:06:50, kemi wrote: > > > On 2017年12月20日 18:12, Michal Hocko wrote: > > On Wed 20-12-17 13:52:14, kemi wrote: > >> > >> > >> On 2017年12月19日 20:40, Michal Hocko wrote: > >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: > We have seen significant overhead in cache bouncing caused by NUMA > counters > update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > update NUMA counter threshold size")' for more details. > > This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and > deals > with global counter update using different threshold size for node page > stats. > >>> > >>> Again, no numbers. > >> > >> Compare to vanilla kernel, I don't think it has performance improvement, so > >> I didn't post performance data here. > >> But, if you would like to see performance gain from enlarging threshold > >> size > >> for NUMA stats (compare to the first patch), I will do that later. > > > > Please do. I would also like to hear _why_ all counters cannot simply > > behave same. In other words why we cannot simply increase > > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > > for larger machines. > > > > I will add this performance data to changelog in V3 patch series. > > Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) > Benchmark: page_bench03 > Description: 112 threads do single page allocation/deallocation in parallel. >before after >(enlarge threshold size) > CPU cycles 722 379(-47.5%) Please describe the numbers some more. Is this an average? What is the std? Can you see any difference with a more generic workload? > Some thinking about that: > a) the overhead due to cache bouncing caused by NUMA counter update in fast > path > severely increase with more and more CPUs cores What is an effect on a smaller system with fewer CPUs? > b) AFAIK, the typical usage scenario (similar at least)for which this > optimization can > benefit is 10/40G NIC used in high-speed data center network of cloud service > providers. I would expect those would disable the numa accounting altogether. -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月20日 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017年12月19日 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > I will add this performance data to changelog in V3 patch series. Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) Benchmark: page_bench03 Description: 112 threads do single page allocation/deallocation in parallel. before after (enlarge threshold size) CPU cycles 722 379(-47.5%) Some thinking about that: a) the overhead due to cache bouncing caused by NUMA counter update in fast path severely increase with more and more CPUs cores b) AFAIK, the typical usage scenario (similar at least)for which this optimization can benefit is 10/40G NIC used in high-speed data center network of cloud service providers.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月20日 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017年12月19日 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > I will add this performance data to changelog in V3 patch series. Test machine: 2-sockets skylake platform (112 CPUs, 62G RAM) Benchmark: page_bench03 Description: 112 threads do single page allocation/deallocation in parallel. before after (enlarge threshold size) CPU cycles 722 379(-47.5%) Some thinking about that: a) the overhead due to cache bouncing caused by NUMA counter update in fast path severely increase with more and more CPUs cores b) AFAIK, the typical usage scenario (similar at least)for which this optimization can benefit is 10/40G NIC used in high-speed data center network of cloud service providers.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月20日 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017年12月19日 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > Agree. We may consider that. But, unlike NUMA counters which do not effect system decision. We need consider very carefully when increase stat_threshold for all the counters for larger machines. BTW, this is another topic that we may discuss it in different thread.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月20日 18:12, Michal Hocko wrote: > On Wed 20-12-17 13:52:14, kemi wrote: >> >> >> On 2017年12月19日 20:40, Michal Hocko wrote: >>> On Tue 19-12-17 14:39:24, Kemi Wang wrote: We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. >>> >>> Again, no numbers. >> >> Compare to vanilla kernel, I don't think it has performance improvement, so >> I didn't post performance data here. >> But, if you would like to see performance gain from enlarging threshold size >> for NUMA stats (compare to the first patch), I will do that later. > > Please do. I would also like to hear _why_ all counters cannot simply > behave same. In other words why we cannot simply increase > stat_threshold? Maybe calculate_normal_threshold needs a better scaling > for larger machines. > Agree. We may consider that. But, unlike NUMA counters which do not effect system decision. We need consider very carefully when increase stat_threshold for all the counters for larger machines. BTW, this is another topic that we may discuss it in different thread.
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Wed 20-12-17 13:52:14, kemi wrote: > > > On 2017年12月19日 20:40, Michal Hocko wrote: > > On Tue 19-12-17 14:39:24, Kemi Wang wrote: > >> We have seen significant overhead in cache bouncing caused by NUMA counters > >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > >> update NUMA counter threshold size")' for more details. > >> > >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > >> with global counter update using different threshold size for node page > >> stats. > > > > Again, no numbers. > > Compare to vanilla kernel, I don't think it has performance improvement, so > I didn't post performance data here. > But, if you would like to see performance gain from enlarging threshold size > for NUMA stats (compare to the first patch), I will do that later. Please do. I would also like to hear _why_ all counters cannot simply behave same. In other words why we cannot simply increase stat_threshold? Maybe calculate_normal_threshold needs a better scaling for larger machines. -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Wed 20-12-17 13:52:14, kemi wrote: > > > On 2017年12月19日 20:40, Michal Hocko wrote: > > On Tue 19-12-17 14:39:24, Kemi Wang wrote: > >> We have seen significant overhead in cache bouncing caused by NUMA counters > >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > >> update NUMA counter threshold size")' for more details. > >> > >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > >> with global counter update using different threshold size for node page > >> stats. > > > > Again, no numbers. > > Compare to vanilla kernel, I don't think it has performance improvement, so > I didn't post performance data here. > But, if you would like to see performance gain from enlarging threshold size > for NUMA stats (compare to the first patch), I will do that later. Please do. I would also like to hear _why_ all counters cannot simply behave same. In other words why we cannot simply increase stat_threshold? Maybe calculate_normal_threshold needs a better scaling for larger machines. -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月19日 20:40, Michal Hocko wrote: > On Tue 19-12-17 14:39:24, Kemi Wang wrote: >> We have seen significant overhead in cache bouncing caused by NUMA counters >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >> update NUMA counter threshold size")' for more details. >> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >> with global counter update using different threshold size for node page >> stats. > > Again, no numbers. Compare to vanilla kernel, I don't think it has performance improvement, so I didn't post performance data here. But, if you would like to see performance gain from enlarging threshold size for NUMA stats (compare to the first patch), I will do that later. > To be honest I do not really like the special casing > here. Why are numa counters any different from PGALLOC which is > incremented for _every_ single page allocation? > I guess you meant to PGALLOC event. The number of this event is kept in local cpu and sum up (for_each_online_cpu) when need. It uses the similar way to what I used before for NUMA stats in V1 patch series. Good enough. >> --- >> mm/vmstat.c | 13 +++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 9c681cc..64e08ae 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -30,6 +30,8 @@ >> >> #include "internal.h" >> >> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) >> + >> #ifdef CONFIG_NUMA >> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; >> >> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum >> node_stat_item item) >> s16 v, t; >> >> v = __this_cpu_inc_return(*p); >> -t = __this_cpu_read(pcp->stat_threshold); >> +if (item >= NR_VM_NUMA_STAT_ITEMS) >> +t = __this_cpu_read(pcp->stat_threshold); >> +else >> +t = VM_NUMA_STAT_THRESHOLD; >> + >> if (unlikely(v > t)) { >> s16 overstep = t >> 1; >> >> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data >> *pgdat, >> * Most of the time the thresholds are the same anyways >> * for all cpus in a node. >> */ >> -t = this_cpu_read(pcp->stat_threshold); >> +if (item >= NR_VM_NUMA_STAT_ITEMS) >> +t = this_cpu_read(pcp->stat_threshold); >> +else >> +t = VM_NUMA_STAT_THRESHOLD; >> >> o = this_cpu_read(*p); >> n = delta + o; >> -- >> 2.7.4 >> >
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On 2017年12月19日 20:40, Michal Hocko wrote: > On Tue 19-12-17 14:39:24, Kemi Wang wrote: >> We have seen significant overhead in cache bouncing caused by NUMA counters >> update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: >> update NUMA counter threshold size")' for more details. >> >> This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals >> with global counter update using different threshold size for node page >> stats. > > Again, no numbers. Compare to vanilla kernel, I don't think it has performance improvement, so I didn't post performance data here. But, if you would like to see performance gain from enlarging threshold size for NUMA stats (compare to the first patch), I will do that later. > To be honest I do not really like the special casing > here. Why are numa counters any different from PGALLOC which is > incremented for _every_ single page allocation? > I guess you meant to PGALLOC event. The number of this event is kept in local cpu and sum up (for_each_online_cpu) when need. It uses the similar way to what I used before for NUMA stats in V1 patch series. Good enough. >> --- >> mm/vmstat.c | 13 +++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index 9c681cc..64e08ae 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -30,6 +30,8 @@ >> >> #include "internal.h" >> >> +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) >> + >> #ifdef CONFIG_NUMA >> int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; >> >> @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum >> node_stat_item item) >> s16 v, t; >> >> v = __this_cpu_inc_return(*p); >> -t = __this_cpu_read(pcp->stat_threshold); >> +if (item >= NR_VM_NUMA_STAT_ITEMS) >> +t = __this_cpu_read(pcp->stat_threshold); >> +else >> +t = VM_NUMA_STAT_THRESHOLD; >> + >> if (unlikely(v > t)) { >> s16 overstep = t >> 1; >> >> @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data >> *pgdat, >> * Most of the time the thresholds are the same anyways >> * for all cpus in a node. >> */ >> -t = this_cpu_read(pcp->stat_threshold); >> +if (item >= NR_VM_NUMA_STAT_ITEMS) >> +t = this_cpu_read(pcp->stat_threshold); >> +else >> +t = VM_NUMA_STAT_THRESHOLD; >> >> o = this_cpu_read(*p); >> n = delta + o; >> -- >> 2.7.4 >> >
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Tue 19-12-17 14:39:24, Kemi Wang wrote: > We have seen significant overhead in cache bouncing caused by NUMA counters > update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > update NUMA counter threshold size")' for more details. > > This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > with global counter update using different threshold size for node page > stats. Again, no numbers. To be honest I do not really like the special casing here. Why are numa counters any different from PGALLOC which is incremented for _every_ single page allocation? > --- > mm/vmstat.c | 13 +++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 9c681cc..64e08ae 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -30,6 +30,8 @@ > > #include "internal.h" > > +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) > + > #ifdef CONFIG_NUMA > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; > > @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum > node_stat_item item) > s16 v, t; > > v = __this_cpu_inc_return(*p); > - t = __this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = __this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > + > if (unlikely(v > t)) { > s16 overstep = t >> 1; > > @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data > *pgdat, >* Most of the time the thresholds are the same anyways >* for all cpus in a node. >*/ > - t = this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > > o = this_cpu_read(*p); > n = delta + o; > -- > 2.7.4 > -- Michal Hocko SUSE Labs
Re: [PATCH v2 3/5] mm: enlarge NUMA counters threshold size
On Tue 19-12-17 14:39:24, Kemi Wang wrote: > We have seen significant overhead in cache bouncing caused by NUMA counters > update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: > update NUMA counter threshold size")' for more details. > > This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals > with global counter update using different threshold size for node page > stats. Again, no numbers. To be honest I do not really like the special casing here. Why are numa counters any different from PGALLOC which is incremented for _every_ single page allocation? > --- > mm/vmstat.c | 13 +++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 9c681cc..64e08ae 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -30,6 +30,8 @@ > > #include "internal.h" > > +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) > + > #ifdef CONFIG_NUMA > int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; > > @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum > node_stat_item item) > s16 v, t; > > v = __this_cpu_inc_return(*p); > - t = __this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = __this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > + > if (unlikely(v > t)) { > s16 overstep = t >> 1; > > @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data > *pgdat, >* Most of the time the thresholds are the same anyways >* for all cpus in a node. >*/ > - t = this_cpu_read(pcp->stat_threshold); > + if (item >= NR_VM_NUMA_STAT_ITEMS) > + t = this_cpu_read(pcp->stat_threshold); > + else > + t = VM_NUMA_STAT_THRESHOLD; > > o = this_cpu_read(*p); > n = delta + o; > -- > 2.7.4 > -- Michal Hocko SUSE Labs
[PATCH v2 3/5] mm: enlarge NUMA counters threshold size
We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. Signed-off-by: Kemi Wang--- mm/vmstat.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 9c681cc..64e08ae 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -30,6 +30,8 @@ #include "internal.h" +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) + #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) s16 v, t; v = __this_cpu_inc_return(*p); - t = __this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = __this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; + if (unlikely(v > t)) { s16 overstep = t >> 1; @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, * Most of the time the thresholds are the same anyways * for all cpus in a node. */ - t = this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; o = this_cpu_read(*p); n = delta + o; -- 2.7.4
[PATCH v2 3/5] mm: enlarge NUMA counters threshold size
We have seen significant overhead in cache bouncing caused by NUMA counters update in multi-threaded page allocation. See 'commit 1d90ca897cb0 ("mm: update NUMA counter threshold size")' for more details. This patch updates NUMA counters to a fixed size of (MAX_S16 - 2) and deals with global counter update using different threshold size for node page stats. Signed-off-by: Kemi Wang --- mm/vmstat.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index 9c681cc..64e08ae 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -30,6 +30,8 @@ #include "internal.h" +#define VM_NUMA_STAT_THRESHOLD (S16_MAX - 2) + #ifdef CONFIG_NUMA int sysctl_vm_numa_stat = ENABLE_NUMA_STAT; @@ -394,7 +396,11 @@ void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item) s16 v, t; v = __this_cpu_inc_return(*p); - t = __this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = __this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; + if (unlikely(v > t)) { s16 overstep = t >> 1; @@ -549,7 +555,10 @@ static inline void mod_node_state(struct pglist_data *pgdat, * Most of the time the thresholds are the same anyways * for all cpus in a node. */ - t = this_cpu_read(pcp->stat_threshold); + if (item >= NR_VM_NUMA_STAT_ITEMS) + t = this_cpu_read(pcp->stat_threshold); + else + t = VM_NUMA_STAT_THRESHOLD; o = this_cpu_read(*p); n = delta + o; -- 2.7.4