Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead
On Wed, Mar 31 2021 at 09:52, Mel Gorman wrote: > Ingo, Thomas or Peter, is there any chance one of you could take a look > at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to > local_lock" from this series? It's partially motivated by PREEMPT_RT. More > details below. Sure.
Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead
Ingo, Thomas or Peter, is there any chance one of you could take a look at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to local_lock" from this series? It's partially motivated by PREEMPT_RT. More details below. On Mon, Mar 29, 2021 at 01:06:42PM +0100, Mel Gorman wrote: > This series requires patches in Andrew's tree so the series is also > available at > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > mm-percpu-local_lock-v1r15 > > The PCP (per-cpu page allocator in page_alloc.c) share locking requirements > with vmstat which is inconvenient and causes some issues. Possibly because > of that, the PCP list and vmstat share the same per-cpu space meaning that > it's possible that vmstat updates dirty cache lines holding per-cpu lists > across CPUs unless padding is used. The series splits that structure and > separates the locking. > The bulk page allocation series that the local_lock work had an additional fix so I've rebased this onto git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r16 > Second, PREEMPT_RT considers the following sequence to be unsafe > as documented in Documentation/locking/locktypes.rst > >local_irq_disable(); >spin_lock(); > > The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save) > -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly > separates the locking requirements for the PCP list (local_lock) and stat > updates (irqs disabled). Once that is done, the length of time IRQs are > disabled can be reduced and in some cases, IRQ disabling can be replaced > with preempt_disable. > It's this part I'm interested in even though it only partially addresses the preempt-rt tree concerns. More legwork is needed for preempt-rt which is outside the context of this series. At minimum, it involves 1. Split locking of pcp and buddy allocator instead of using spin_lock() when it's "known" that IRQs are disabled (not necessarily a valid assumption on PREEMPT_RT) 2. Split the zone lock into what protects the zone metadata and what protects the free lists This looks straight-forward but it involves audit work and it may be difficult to avoid regressing non-PREEMPT_RT kernels by disabling/enabling IRQs when switching between the pcp allocator and the buddy allocator. > After that, it was very obvious that zone_statistics in particular has way > too much overhead and leaves IRQs disabled for longer than necessary. It > has perfectly accurate counters requiring IRQs be disabled for parallel > RMW sequences when inaccurate ones like vm_events would do. The series > makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that > only require preempt be disabled. > > Finally the bulk page allocator can then do all the stat updates in bulk > with IRQs enabled which should improve the efficiency of the bulk page > allocator. Technically, this could have been done without the local_lock > and vmstat conversion work and the order simply reflects the timing of > when different series were implemented. > > No performance data is included because despite the overhead of the > stats, it's within the noise for most workloads but Jesper and Chuck may > observe a significant different with the same tests used for the bulk > page allocator. The series is more likely to be interesting to the RT > folk in terms of slowing getting the PREEMPT tree into mainline. > > drivers/base/node.c| 18 +-- > include/linux/mmzone.h | 29 +++-- > include/linux/vmstat.h | 65 ++- > mm/mempolicy.c | 2 +- > mm/page_alloc.c| 173 > mm/vmstat.c| 254 +++-- > 6 files changed, 254 insertions(+), 287 deletions(-) > > -- > 2.26.2 > -- Mel Gorman SUSE Labs
Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead
On Wed, 31 Mar 2021 08:38:05 +0100 Mel Gorman wrote: > On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote: > > On Mon, 29 Mar 2021 13:06:42 +0100 > > Mel Gorman wrote: > > > > > This series requires patches in Andrew's tree so the series is also > > > available at > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > > > mm-percpu-local_lock-v1r15 > > > > > > tldr: Jesper and Chuck, it would be nice to verify if this series helps > > > the allocation rate of the bulk page allocator. RT people, this > > > *partially* addresses some problems PREEMPT_RT has with the page > > > allocator but it needs review. > > > > I've run a new micro-benchmark[1] which shows: > > (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) > > > > > > BASELINE > > single_page alloc+put: 194 cycles(tsc) 54.106 ns > > > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > > > Per elem: 195 cycles(tsc) 54.225 ns (step:1) > > Per elem: 127 cycles(tsc) 35.492 ns (step:2) > > Per elem: 117 cycles(tsc) 32.643 ns (step:3) > > Per elem: 111 cycles(tsc) 30.992 ns (step:4) > > Per elem: 106 cycles(tsc) 29.606 ns (step:8) > > Per elem: 102 cycles(tsc) 28.532 ns (step:16) > > Per elem: 99 cycles(tsc) 27.728 ns (step:32) > > Per elem: 98 cycles(tsc) 27.252 ns (step:64) > > Per elem: 97 cycles(tsc) 27.090 ns (step:128) > > > > This should be seen in comparison with the older micro-benchmark[2] > > done on branch mm-bulk-rebase-v5r9. > > > > BASELINE > > single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns > > > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > > > Per elem: 202 cycles(tsc) 56.383 ns (step:1) > > Per elem: 144 cycles(tsc) 40.047 ns (step:2) > > Per elem: 134 cycles(tsc) 37.339 ns (step:3) > > Per elem: 128 cycles(tsc) 35.578 ns (step:4) > > Per elem: 120 cycles(tsc) 33.592 ns (step:8) > > Per elem: 116 cycles(tsc) 32.362 ns (step:16) > > Per elem: 113 cycles(tsc) 31.476 ns (step:32) > > Per elem: 110 cycles(tsc) 30.633 ns (step:64) > > Per elem: 110 cycles(tsc) 30.596 ns (step:128) > > > > Ok, so bulk allocation is faster than allocating single pages, no surprise > there. Putting the array figures for bulk allocation into tabular format > and comparing we get; > > Array variant (time to allocate a page in nanoseconds, lower is better) > BaselinePatched > 1 56.383 54.225 (+3.83%) > 2 40.047 35.492 (+11.38%) > 3 37.339 32.643 (+12.58%) > 4 35.578 30.992 (+12.89%) > 8 33.592 29.606 (+11.87%) > 16 32.362 28.532 (+11.85%) > 32 31.476 27.728 (+11.91%) > 64 30.633 27.252 (+11.04%) > 128 30.596 27.090 (+11.46%) > > The series is 11-12% faster when allocating multiple pages. That's a > fairly positive outcome and I'll include this in the series leader if > you have no objections. That is fine by me to add this to the cover letter. I like your tabular format as it makes is easier to compare. If you use the nanosec measurements and not the cycles, you should state that this was run on a CPU E5-1650 v4 @ 3.60GHz. You might notice that the factor between cycles(tsc) and ns is very close to 3.6. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead
On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 29 Mar 2021 13:06:42 +0100 > Mel Gorman wrote: > > > This series requires patches in Andrew's tree so the series is also > > available at > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > > mm-percpu-local_lock-v1r15 > > > > tldr: Jesper and Chuck, it would be nice to verify if this series helps > > the allocation rate of the bulk page allocator. RT people, this > > *partially* addresses some problems PREEMPT_RT has with the page > > allocator but it needs review. > > I've run a new micro-benchmark[1] which shows: > (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) > > > BASELINE > single_page alloc+put: 194 cycles(tsc) 54.106 ns > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > Per elem: 195 cycles(tsc) 54.225 ns (step:1) > Per elem: 127 cycles(tsc) 35.492 ns (step:2) > Per elem: 117 cycles(tsc) 32.643 ns (step:3) > Per elem: 111 cycles(tsc) 30.992 ns (step:4) > Per elem: 106 cycles(tsc) 29.606 ns (step:8) > Per elem: 102 cycles(tsc) 28.532 ns (step:16) > Per elem: 99 cycles(tsc) 27.728 ns (step:32) > Per elem: 98 cycles(tsc) 27.252 ns (step:64) > Per elem: 97 cycles(tsc) 27.090 ns (step:128) > > This should be seen in comparison with the older micro-benchmark[2] > done on branch mm-bulk-rebase-v5r9. > > BASELINE > single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > Per elem: 202 cycles(tsc) 56.383 ns (step:1) > Per elem: 144 cycles(tsc) 40.047 ns (step:2) > Per elem: 134 cycles(tsc) 37.339 ns (step:3) > Per elem: 128 cycles(tsc) 35.578 ns (step:4) > Per elem: 120 cycles(tsc) 33.592 ns (step:8) > Per elem: 116 cycles(tsc) 32.362 ns (step:16) > Per elem: 113 cycles(tsc) 31.476 ns (step:32) > Per elem: 110 cycles(tsc) 30.633 ns (step:64) > Per elem: 110 cycles(tsc) 30.596 ns (step:128) > Ok, so bulk allocation is faster than allocating single pages, no surprise there. Putting the array figures for bulk allocation into tabular format and comparing we get; Array variant (time to allocate a page in nanoseconds, lower is better) BaselinePatched 1 56.383 54.225 (+3.83%) 2 40.047 35.492 (+11.38%) 3 37.339 32.643 (+12.58%) 4 35.578 30.992 (+12.89%) 8 33.592 29.606 (+11.87%) 16 32.362 28.532 (+11.85%) 32 31.476 27.728 (+11.91%) 64 30.633 27.252 (+11.04%) 128 30.596 27.090 (+11.46%) The series is 11-12% faster when allocating multiple pages. That's a fairly positive outcome and I'll include this in the series leader if you have no objections. Thanks Jesper! -- Mel Gorman SUSE Labs
Re: [RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead
On Mon, 29 Mar 2021 13:06:42 +0100 Mel Gorman wrote: > This series requires patches in Andrew's tree so the series is also > available at > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > mm-percpu-local_lock-v1r15 > > tldr: Jesper and Chuck, it would be nice to verify if this series helps > the allocation rate of the bulk page allocator. RT people, this > *partially* addresses some problems PREEMPT_RT has with the page > allocator but it needs review. I've run a new micro-benchmark[1] which shows: (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) BASELINE single_page alloc+put: 194 cycles(tsc) 54.106 ns LIST variant: time_bulk_page_alloc_free_list: step=bulk size Per elem: 200 cycles(tsc) 55.667 ns (step:1) Per elem: 143 cycles(tsc) 39.755 ns (step:2) Per elem: 132 cycles(tsc) 36.758 ns (step:3) Per elem: 128 cycles(tsc) 35.795 ns (step:4) Per elem: 123 cycles(tsc) 34.339 ns (step:8) Per elem: 120 cycles(tsc) 33.396 ns (step:16) Per elem: 118 cycles(tsc) 32.806 ns (step:32) Per elem: 115 cycles(tsc) 32.169 ns (step:64) Per elem: 116 cycles(tsc) 32.258 ns (step:128) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Per elem: 195 cycles(tsc) 54.225 ns (step:1) Per elem: 127 cycles(tsc) 35.492 ns (step:2) Per elem: 117 cycles(tsc) 32.643 ns (step:3) Per elem: 111 cycles(tsc) 30.992 ns (step:4) Per elem: 106 cycles(tsc) 29.606 ns (step:8) Per elem: 102 cycles(tsc) 28.532 ns (step:16) Per elem: 99 cycles(tsc) 27.728 ns (step:32) Per elem: 98 cycles(tsc) 27.252 ns (step:64) Per elem: 97 cycles(tsc) 27.090 ns (step:128) [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk-local_lock-v1r15 This should be seen in comparison with the older micro-benchmark[2] done on branch mm-bulk-rebase-v5r9. BASELINE single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns LIST variant: time_bulk_page_alloc_free_list: step=bulk size Per elem: 206 cycles(tsc) 57.478 ns (step:1) Per elem: 154 cycles(tsc) 42.861 ns (step:2) Per elem: 145 cycles(tsc) 40.536 ns (step:3) Per elem: 142 cycles(tsc) 39.477 ns (step:4) Per elem: 142 cycles(tsc) 39.610 ns (step:8) Per elem: 137 cycles(tsc) 38.155 ns (step:16) Per elem: 135 cycles(tsc) 37.739 ns (step:32) Per elem: 134 cycles(tsc) 37.282 ns (step:64) Per elem: 133 cycles(tsc) 36.993 ns (step:128) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Per elem: 202 cycles(tsc) 56.383 ns (step:1) Per elem: 144 cycles(tsc) 40.047 ns (step:2) Per elem: 134 cycles(tsc) 37.339 ns (step:3) Per elem: 128 cycles(tsc) 35.578 ns (step:4) Per elem: 120 cycles(tsc) 33.592 ns (step:8) Per elem: 116 cycles(tsc) 32.362 ns (step:16) Per elem: 113 cycles(tsc) 31.476 ns (step:32) Per elem: 110 cycles(tsc) 30.633 ns (step:64) Per elem: 110 cycles(tsc) 30.596 ns (step:128) [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk This new patchset does show some improvements in the micro-benchmark. > The PCP (per-cpu page allocator in page_alloc.c) share locking requirements > with vmstat which is inconvenient and causes some issues. Possibly because > of that, the PCP list and vmstat share the same per-cpu space meaning that > it's possible that vmstat updates dirty cache lines holding per-cpu lists > across CPUs unless padding is used. The series splits that structure and > separates the locking. > > Second, PREEMPT_RT considers the following sequence to be unsafe > as documented in Documentation/locking/locktypes.rst > >local_irq_disable(); >spin_lock(); > > The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save) > -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly > separates the locking requirements for the PCP list (local_lock) and stat > updates (irqs disabled). Once that is done, the length of time IRQs are > disabled can be reduced and in some cases, IRQ disabling can be replaced > with preempt_disable. > > After that, it was very obvious that zone_statistics in particular has way > too much overhead and leaves IRQs disabled for longer than necessary. It > has perfectly accurate counters requiring IRQs be disabled for parallel > RMW sequences when inaccurate ones like vm_events would do. The series > makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that > only require preempt be disabled. > > Finally the bulk page allocator can then do all the stat updates in bulk > with IRQs enabled which should improve the efficiency of the bulk page > allocator. Technically, this could have been done without the local_lock > and vmstat conversion work and the order simply reflects the timing of > when different series were implemented. > > No performance data is included because despite the overhead of the > stats, it's within the noise for most workloads but Jesper
[RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead
This series requires patches in Andrew's tree so the series is also available at git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15 tldr: Jesper and Chuck, it would be nice to verify if this series helps the allocation rate of the bulk page allocator. RT people, this *partially* addresses some problems PREEMPT_RT has with the page allocator but it needs review. The PCP (per-cpu page allocator in page_alloc.c) share locking requirements with vmstat which is inconvenient and causes some issues. Possibly because of that, the PCP list and vmstat share the same per-cpu space meaning that it's possible that vmstat updates dirty cache lines holding per-cpu lists across CPUs unless padding is used. The series splits that structure and separates the locking. Second, PREEMPT_RT considers the following sequence to be unsafe as documented in Documentation/locking/locktypes.rst local_irq_disable(); spin_lock(); The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save) -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly separates the locking requirements for the PCP list (local_lock) and stat updates (irqs disabled). Once that is done, the length of time IRQs are disabled can be reduced and in some cases, IRQ disabling can be replaced with preempt_disable. After that, it was very obvious that zone_statistics in particular has way too much overhead and leaves IRQs disabled for longer than necessary. It has perfectly accurate counters requiring IRQs be disabled for parallel RMW sequences when inaccurate ones like vm_events would do. The series makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that only require preempt be disabled. Finally the bulk page allocator can then do all the stat updates in bulk with IRQs enabled which should improve the efficiency of the bulk page allocator. Technically, this could have been done without the local_lock and vmstat conversion work and the order simply reflects the timing of when different series were implemented. No performance data is included because despite the overhead of the stats, it's within the noise for most workloads but Jesper and Chuck may observe a significant different with the same tests used for the bulk page allocator. The series is more likely to be interesting to the RT folk in terms of slowing getting the PREEMPT tree into mainline. drivers/base/node.c| 18 +-- include/linux/mmzone.h | 29 +++-- include/linux/vmstat.h | 65 ++- mm/mempolicy.c | 2 +- mm/page_alloc.c| 173 mm/vmstat.c| 254 +++-- 6 files changed, 254 insertions(+), 287 deletions(-) -- 2.26.2