Re: [PATCH v12 00/31] Speculative page faults
On Mon, Dec 14, 2020 at 10:36:29AM +0100, Laurent Dufour wrote: > Le 14/12/2020 à 03:03, Joel Fernandes a écrit : > > On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote: > > [..] > > > > > Hi Laurent, > > > > > > > > > > We merged SPF v11 and some patches from v12 into our platforms. After > > > > > several experiments, we observed SPF has obvious improvements on the > > > > > launch time of applications, especially for those high-TLP ones, > > > > > > > > > > # launch time of applications(s): > > > > > > > > > > package version w/ SPF w/o SPF improve(%) > > > > > -- > > > > > Baidu maps10.13.3 0.887 0.98 9.49 > > > > > Taobao8.4.0.35 1.227 1.2935.10 > > > > > Meituan 9.12.401 1.107 1.54328.26 > > > > > WeChat7.0.32.353 2.68 12.20 > > > > > Honor of Kings1.43.1.6 6.636.7131.24 > > > > > > > > That's great news, thanks for reporting this! > > > > > > > > > > > > > > By the way, we have verified our platforms with those patches and > > > > > achieved the goal of mass production. > > > > > > > > Another good news! > > > > For my information, what is your targeted hardware? > > > > > > > > Cheers, > > > > Laurent. > > > > > > Hi Laurent, > > > > > > Our targeted hardware belongs to ARM64 multi-core series. > > > > Hello! > > > > I was trying to develop an intuition about why does SPF give improvement for > > you on small CPU systems. This is just a high-level theory but: > > > > 1. Assume the improvement is because of elimination of "blocking" on > > mmap_sem. > > Could it be that the mmap_sem is acquired in write-mode unnecessarily in > > some > > places, thus causing blocking on mmap_sem in other paths? If so, is it > > feasible to convert such usages to acquiring them in read-mode? > > That's correct, and the goal of this series is to try not holding the > mmap_sem in read mode during page fault processing. > > Converting mmap_sem holder from write to read mode is not so easy and that > work as already been done in some places. If you think there are areas where > this could be done, you're welcome to send patches fixing that. > > > 2. Assume the improvement is because of lesser read-side contention on > > mmap_sem. > > On small CPU systems, I would not expect reducing cache-line bouncing to > > give > > such a dramatic improvement in performance as you are seeing. > > I don't think cache line bouncing reduction is the main sourcec of > performance improvement, I would rather think this is the lower part here. > I guess this is mainly because during loading time a lot of page fault is > occuring and thus SPF is reducing the contention on the mmap_sem. Thanks for the reply. I think I also wrongly assumed that acquiring mmap rwsem in write mode in a syscall makes SPF moot. Peter explained to me on IRC that tere's still perf improvement in write mode if an unrelated VMA is modified while another VMA is faulting. CMIIW - not an mm expert by any stretch. Thanks! - Joel
Re: [PATCH v12 00/31] Speculative page faults
Le 14/12/2020 à 03:03, Joel Fernandes a écrit : On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote: [..] Hi Laurent, We merged SPF v11 and some patches from v12 into our platforms. After several experiments, we observed SPF has obvious improvements on the launch time of applications, especially for those high-TLP ones, # launch time of applications(s): package version w/ SPF w/o SPF improve(%) -- Baidu maps10.13.3 0.887 0.98 9.49 Taobao8.4.0.35 1.227 1.2935.10 Meituan 9.12.401 1.107 1.54328.26 WeChat7.0.32.353 2.68 12.20 Honor of Kings1.43.1.6 6.636.7131.24 That's great news, thanks for reporting this! By the way, we have verified our platforms with those patches and achieved the goal of mass production. Another good news! For my information, what is your targeted hardware? Cheers, Laurent. Hi Laurent, Our targeted hardware belongs to ARM64 multi-core series. Hello! I was trying to develop an intuition about why does SPF give improvement for you on small CPU systems. This is just a high-level theory but: 1. Assume the improvement is because of elimination of "blocking" on mmap_sem. Could it be that the mmap_sem is acquired in write-mode unnecessarily in some places, thus causing blocking on mmap_sem in other paths? If so, is it feasible to convert such usages to acquiring them in read-mode? That's correct, and the goal of this series is to try not holding the mmap_sem in read mode during page fault processing. Converting mmap_sem holder from write to read mode is not so easy and that work as already been done in some places. If you think there are areas where this could be done, you're welcome to send patches fixing that. 2. Assume the improvement is because of lesser read-side contention on mmap_sem. On small CPU systems, I would not expect reducing cache-line bouncing to give such a dramatic improvement in performance as you are seeing. I don't think cache line bouncing reduction is the main sourcec of performance improvement, I would rather think this is the lower part here. I guess this is mainly because during loading time a lot of page fault is occuring and thus SPF is reducing the contention on the mmap_sem. Thanks for any insight on this! - Joel
Re: [PATCH v12 00/31] Speculative page faults
On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote: [..] > > > Hi Laurent, > > > > > > We merged SPF v11 and some patches from v12 into our platforms. After > > > several experiments, we observed SPF has obvious improvements on the > > > launch time of applications, especially for those high-TLP ones, > > > > > > # launch time of applications(s): > > > > > > package version w/ SPF w/o SPF improve(%) > > > -- > > > Baidu maps10.13.3 0.887 0.98 9.49 > > > Taobao8.4.0.35 1.227 1.2935.10 > > > Meituan 9.12.401 1.107 1.54328.26 > > > WeChat7.0.32.353 2.68 12.20 > > > Honor of Kings1.43.1.6 6.636.7131.24 > > > > That's great news, thanks for reporting this! > > > > > > > > By the way, we have verified our platforms with those patches and > > > achieved the goal of mass production. > > > > Another good news! > > For my information, what is your targeted hardware? > > > > Cheers, > > Laurent. > > Hi Laurent, > > Our targeted hardware belongs to ARM64 multi-core series. Hello! I was trying to develop an intuition about why does SPF give improvement for you on small CPU systems. This is just a high-level theory but: 1. Assume the improvement is because of elimination of "blocking" on mmap_sem. Could it be that the mmap_sem is acquired in write-mode unnecessarily in some places, thus causing blocking on mmap_sem in other paths? If so, is it feasible to convert such usages to acquiring them in read-mode? 2. Assume the improvement is because of lesser read-side contention on mmap_sem. On small CPU systems, I would not expect reducing cache-line bouncing to give such a dramatic improvement in performance as you are seeing. Thanks for any insight on this! - Joel
Re: [PATCH v12 00/31] Speculative page faults
On Mon, 2020-07-06 at 14:27 +0200, Laurent Dufour wrote: > Le 06/07/2020 à 11:25, Chinwen Chang a écrit : > > On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote: > >> Hi Laurent, > >> > >> I downloaded your script and run it on Intel 2s skylake platform with > >> spf-v12 patch > >> serials. > >> > >> Here attached the output results of this script. > >> > >> The following comparison result is statistics from the script outputs. > >> > >> a). Enable THP > >> SPF_0 change > >> SPF_1 > >> will-it-scale.page_fault2.per_thread_ops2664190.8 -11.7% > >> 2353637.6 > >> will-it-scale.page_fault3.per_thread_ops4480027.2 -14.7% > >> 3819331.9 > >> > >> > >> b). Disable THP > >> SPF_0 change > >> SPF_1 > >> will-it-scale.page_fault2.per_thread_ops2653260.7 -10% > >> 2385165.8 > >> will-it-scale.page_fault3.per_thread_ops4436330.1 -12.4% > >> 3886734.2 > >> > >> > >> Thanks, > >> Haiyan Song > >> > >> > >> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote: > >>> Le 14/06/2019 à 10:37, Laurent Dufour a écrit : > Please find attached the script I run to get these numbers. > This would be nice if you could give it a try on your victim node and > share the result. > >>> > >>> Sounds that the Intel mail fitering system doesn't like the attached > >>> shell script. > >>> Please find it there: > >>> https://urldefense.com/v3/__https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44__;!!CTRNKA9wMg0ARbw!0lux2FMCbIFxFEl824CdSuSQqT0IVWsvyUqfDVJNEVb9gTWyRltm7cpPZg70N_XhXmMZ$ > >>> > >>> > >>> Thanks, > >>> Laurent. > >>> > > > > Hi Laurent, > > > > We merged SPF v11 and some patches from v12 into our platforms. After > > several experiments, we observed SPF has obvious improvements on the > > launch time of applications, especially for those high-TLP ones, > > > > # launch time of applications(s): > > > > package version w/ SPF w/o SPF improve(%) > > -- > > Baidu maps10.13.3 0.887 0.98 9.49 > > Taobao8.4.0.35 1.227 1.2935.10 > > Meituan 9.12.401 1.107 1.54328.26 > > WeChat7.0.32.353 2.68 12.20 > > Honor of Kings1.43.1.6 6.636.7131.24 > > That's great news, thanks for reporting this! > > > > > By the way, we have verified our platforms with those patches and > > achieved the goal of mass production. > > Another good news! > For my information, what is your targeted hardware? > > Cheers, > Laurent. Hi Laurent, Our targeted hardware belongs to ARM64 multi-core series. Thanks. Chinwen >
Re: [PATCH v12 00/31] Speculative page faults
Le 06/07/2020 à 11:25, Chinwen Chang a écrit : On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote: Hi Laurent, I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch serials. Here attached the output results of this script. The following comparison result is statistics from the script outputs. a). Enable THP SPF_0 change SPF_1 will-it-scale.page_fault2.per_thread_ops2664190.8 -11.7% 2353637.6 will-it-scale.page_fault3.per_thread_ops4480027.2 -14.7% 3819331.9 b). Disable THP SPF_0 change SPF_1 will-it-scale.page_fault2.per_thread_ops2653260.7 -10% 2385165.8 will-it-scale.page_fault3.per_thread_ops4436330.1 -12.4% 3886734.2 Thanks, Haiyan Song On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote: Le 14/06/2019 à 10:37, Laurent Dufour a écrit : Please find attached the script I run to get these numbers. This would be nice if you could give it a try on your victim node and share the result. Sounds that the Intel mail fitering system doesn't like the attached shell script. Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44 Thanks, Laurent. Hi Laurent, We merged SPF v11 and some patches from v12 into our platforms. After several experiments, we observed SPF has obvious improvements on the launch time of applications, especially for those high-TLP ones, # launch time of applications(s): package version w/ SPF w/o SPF improve(%) -- Baidu maps10.13.3 0.887 0.98 9.49 Taobao8.4.0.35 1.227 1.2935.10 Meituan 9.12.401 1.107 1.54328.26 WeChat7.0.32.353 2.68 12.20 Honor of Kings1.43.1.6 6.636.7131.24 That's great news, thanks for reporting this! By the way, we have verified our platforms with those patches and achieved the goal of mass production. Another good news! For my information, what is your targeted hardware? Cheers, Laurent.
Re: [PATCH v12 00/31] Speculative page faults
On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote: > Hi Laurent, > > I downloaded your script and run it on Intel 2s skylake platform with spf-v12 > patch > serials. > > Here attached the output results of this script. > > The following comparison result is statistics from the script outputs. > > a). Enable THP > SPF_0 change SPF_1 > will-it-scale.page_fault2.per_thread_ops2664190.8 -11.7% > 2353637.6 > will-it-scale.page_fault3.per_thread_ops4480027.2 -14.7% > 3819331.9 > > > b). Disable THP > SPF_0 change SPF_1 > will-it-scale.page_fault2.per_thread_ops2653260.7 -10% > 2385165.8 > will-it-scale.page_fault3.per_thread_ops4436330.1 -12.4% > 3886734.2 > > > Thanks, > Haiyan Song > > > On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote: > > Le 14/06/2019 à 10:37, Laurent Dufour a écrit : > > > Please find attached the script I run to get these numbers. > > > This would be nice if you could give it a try on your victim node and > > > share the result. > > > > Sounds that the Intel mail fitering system doesn't like the attached shell > > script. > > Please find it there: > > https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44 > > > > Thanks, > > Laurent. > > Hi Laurent, We merged SPF v11 and some patches from v12 into our platforms. After several experiments, we observed SPF has obvious improvements on the launch time of applications, especially for those high-TLP ones, # launch time of applications(s): package version w/ SPF w/o SPF improve(%) -- Baidu maps10.13.3 0.887 0.98 9.49 Taobao8.4.0.35 1.227 1.2935.10 Meituan 9.12.401 1.107 1.54328.26 WeChat7.0.32.353 2.68 12.20 Honor of Kings1.43.1.6 6.636.7131.24 By the way, we have verified our platforms with those patches and achieved the goal of mass production. Thanks. Chinwen Chang
Re: [PATCH v12 00/31] Speculative page faults
Hi Laurent, I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch serials. Here attached the output results of this script. The following comparison result is statistics from the script outputs. a). Enable THP SPF_0 change SPF_1 will-it-scale.page_fault2.per_thread_ops2664190.8 -11.7% 2353637.6 will-it-scale.page_fault3.per_thread_ops4480027.2 -14.7% 3819331.9 b). Disable THP SPF_0 change SPF_1 will-it-scale.page_fault2.per_thread_ops2653260.7 -10% 2385165.8 will-it-scale.page_fault3.per_thread_ops4436330.1 -12.4% 3886734.2 Thanks, Haiyan Song On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote: > Le 14/06/2019 à 10:37, Laurent Dufour a écrit : > > Please find attached the script I run to get these numbers. > > This would be nice if you could give it a try on your victim node and share > > the result. > > Sounds that the Intel mail fitering system doesn't like the attached shell > script. > Please find it there: > https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44 > > Thanks, > Laurent. > THP always SPF 0 average:2628818 average:2732209 average:2728392 average:2550695 average:2689873 average:2691963 average:2627612 average:2558295 average:2707877 average:2726174 SPF 1 average:2426260 average:2145674 average:2117769 average:2292502 average:2350403 average:2483327 average:2467324 average:2335393 average:2437859 average:2479865 THP never SPF 0 average:2712575 average:2711447 average:2672362 average:2701981 average:2668073 average:2579296 average:2662048 average:2637422 average:2579143 average:2608260 SPF 1 average:2348782 average:2203349 average:2312960 average:2402995 average:2318914 average:2543129 average:2390337 average:2490178 average:2416798 average:2424216 THP always SPF 0 average:4370143 average:4245754 average:4678884 average:4665759 average:4665809 average:4639132 average:4210755 average:4330552 average:4290469 average:4703015 SPF 1 average:3810608 average:3918890 average:3758003 average:3965024 average:3578151 average:3822748 average:3687293 average:3998701 average:3915771 average:3738130 THP never SPF 0 average:4505598 average:4672023 average:4701787 average:4355885 average:4338397 average:4446350 average:4360811 average:4653767 average:4016352 average:4312331 SPF 1 average:3685383 average:4029413 average:4051615 average:3747588 average:4058557 average:4042340 average:3971295 average:3752943 average:3750626 average:3777582
Re: [PATCH v12 00/31] Speculative page faults
Le 14/06/2019 à 10:37, Laurent Dufour a écrit : Please find attached the script I run to get these numbers. This would be nice if you could give it a try on your victim node and share the result. Sounds that the Intel mail fitering system doesn't like the attached shell script. Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44 Thanks, Laurent.
Re: [PATCH v12 00/31] Speculative page faults
Le 06/06/2019 à 08:51, Haiyan Song a écrit : Hi Laurent, Regression test for v12 patch serials have been run on Intel 2s skylake platform, some regressions were found by LKP-tools (linux kernel performance). Only tested the cases that have been run and found regressions on v11 patch serials. Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12. Kernel commit: base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51) head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12) Benchmark: will-it-scale Download link: https://github.com/antonblanchard/will-it-scale/tree/master Metrics: will-it-scale.per_thread_ops=threads/nr_cpu test box: lkp-skl-2sp8(nr_cpu=72,memory=192G) THP: enable / disable nr_task: 100% The following is benchmark results, tested 4 times for every case. a). Enable THP base %stddev changehead %stddev will-it-scale.page_fault3.per_thread_ops63216 ±3% -16.9%52537 ±4% will-it-scale.page_fault2.per_thread_ops36862 -9.8% 33256 b). Disable THP base %stddev changehead %stddev will-it-scale.page_fault3.per_thread_ops65111 -18.6%53023 ±2% will-it-scale.page_fault2.per_thread_ops38164 -12.0%33565 Hi Haiyan, Thanks for running this tests on your systems. I did the same tests on my systems (x86 and PowerPc) and I didn't get the same numbers. My x86 system has lower CPUs but larger memory amount but I don't think this impacts a lot since my numbers are far from yours. x86_64 48CPUs 755G 5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf page_fault2_threads SPF OFF SPF ON THP always 2200902.3 [5%] 2152618.8 -2% [4%] 2136316 -3% [7%] THP never 2185616.5 [6%] 2099274.2 -4% [3%] 2123275.1 -3% [7%] 5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf page_fault3_threads SPF OFF SPF ON THP always 2700078.7 [5%] 2789437.1 +3% [4%] 2944806.8 +12% [3%] THP never 2625756.7 [4%] 2944806.8 +12% [8%] 2876525.5 +10% [4%] PowerPC P8 80CPUs 31G 5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf page_fault2_threads SPF OFF SPF ON THP always 171732 [0%] 170762.8 -1% [0%] 170450.9 -1% [0%] THP never 171808.4 [0%] 170600.3 -1% [0%] 170231.6 -1% [0%] 5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf page_fault3_threads SPF OFF SPF ON THP always 2499.6 [13%]2624.5 +5% [11%]2734.5 +9% [3%] THP never 2732.5 [2%] 2791.1 +2% [1%] 2695 -3% [4%] Numbers in bracket are the standard deviation percent. I run each test 10 times and then compute the average and deviation. Please find attached the script I run to get these numbers. This would be nice if you could give it a try on your victim node and share the result. Thanks, Laurent. Best regards, Haiyan Song On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote: This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle page fault without holding the mm semaphore [1]. The idea is to try to handle user space page faults without holding the mmap_sem. This should allow better concurrency for massively threaded process since the page fault handler will not wait for other threads memory layout change to be done, assuming that this change is done in another part of the process's memory space. This type of page fault is named speculative page fault. If the speculative page fault fails because a concurrency has been detected or because underlying PMD or PTE tables are not yet allocating, it is failing its processing and a regular page fault is then tried. The speculative page fault (SPF) has to look for the VMA matching the fault address without holding the mmap_sem, this is done by protecting the MM RB tree with RCU and by using a reference counter on each VMA. When fetching a VMA under the RCU protection, the VMA's reference counter is incremented to ensure that the VMA will not freed in our back during the SPF processing. Once that processing is done the VMA's reference counter is decremented. To ensure that a VMA is still present when walking the RB tree locklessly, the VMA's reference counter is incremented when that VMA is linked in the RB tree. When the VMA is unlinked from the RB tree, its reference counter will be decremented at the end of the RCU grace period, ensuring it will be available during this time. This means that the VMA freeing could be delayed and could delay the file closing for file mapping. Since the SPF handler is not able to manage file mapping, file is closed synchronously and not during the RCU cleaning. This is safe since the
Re: [PATCH v12 00/31] Speculative page faults
Hi Laurent, Regression test for v12 patch serials have been run on Intel 2s skylake platform, some regressions were found by LKP-tools (linux kernel performance). Only tested the cases that have been run and found regressions on v11 patch serials. Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12. Kernel commit: base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51) head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12) Benchmark: will-it-scale Download link: https://github.com/antonblanchard/will-it-scale/tree/master Metrics: will-it-scale.per_thread_ops=threads/nr_cpu test box: lkp-skl-2sp8(nr_cpu=72,memory=192G) THP: enable / disable nr_task: 100% The following is benchmark results, tested 4 times for every case. a). Enable THP base %stddev changehead %stddev will-it-scale.page_fault3.per_thread_ops63216 ±3% -16.9%52537 ±4% will-it-scale.page_fault2.per_thread_ops36862 -9.8% 33256 b). Disable THP base %stddev changehead %stddev will-it-scale.page_fault3.per_thread_ops65111 -18.6%53023 ±2% will-it-scale.page_fault2.per_thread_ops38164 -12.0%33565 Best regards, Haiyan Song On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote: > This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle > page fault without holding the mm semaphore [1]. > > The idea is to try to handle user space page faults without holding the > mmap_sem. This should allow better concurrency for massively threaded > process since the page fault handler will not wait for other threads memory > layout change to be done, assuming that this change is done in another part > of the process's memory space. This type of page fault is named speculative > page fault. If the speculative page fault fails because a concurrency has > been detected or because underlying PMD or PTE tables are not yet > allocating, it is failing its processing and a regular page fault is then > tried. > > The speculative page fault (SPF) has to look for the VMA matching the fault > address without holding the mmap_sem, this is done by protecting the MM RB > tree with RCU and by using a reference counter on each VMA. When fetching a > VMA under the RCU protection, the VMA's reference counter is incremented to > ensure that the VMA will not freed in our back during the SPF > processing. Once that processing is done the VMA's reference counter is > decremented. To ensure that a VMA is still present when walking the RB tree > locklessly, the VMA's reference counter is incremented when that VMA is > linked in the RB tree. When the VMA is unlinked from the RB tree, its > reference counter will be decremented at the end of the RCU grace period, > ensuring it will be available during this time. This means that the VMA > freeing could be delayed and could delay the file closing for file > mapping. Since the SPF handler is not able to manage file mapping, file is > closed synchronously and not during the RCU cleaning. This is safe since > the page fault handler is aborting if a file pointer is associated to the > VMA. > > Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale > benchmark [2]. > > The VMA's attributes checked during the speculative page fault processing > have to be protected against parallel changes. This is done by using a per > VMA sequence lock. This sequence lock allows the speculative page fault > handler to fast check for parallel changes in progress and to abort the > speculative page fault in that case. > > Once the VMA has been found, the speculative page fault handler would check > for the VMA's attributes to verify that the page fault has to be handled > correctly or not. Thus, the VMA is protected through a sequence lock which > allows fast detection of concurrent VMA changes. If such a change is > detected, the speculative page fault is aborted and a *classic* page fault > is tried. VMA sequence lockings are added when VMA attributes which are > checked during the page fault are modified. > > When the PTE is fetched, the VMA is checked to see if it has been changed, > so once the page table is locked, the VMA is valid, so any other changes > leading to touching this PTE will need to lock the page table, so no > parallel change is possible at this time. > > The locking of the PTE is done with interrupts disabled, this allows > checking for the PMD to ensure that there is not an ongoing collapsing > operation. Since khugepaged is firstly set the PMD to pmd_none and then is > waiting for the other CPU to have caught the IPI interrupt, if the pmd is > valid at the time the PTE is locked, we have the guarantee that the > collapsing operation will have to wait on the PTE lock to move > forward. This allows the SPF handler to map the PTE safely. If the PMD > value is
Re: [PATCH v12 00/31] Speculative page faults
Le 22/04/2019 à 23:29, Michel Lespinasse a écrit : Hi Laurent, Thanks a lot for copying me on this patchset. It took me a few days to go through it - I had not been following the previous iterations of this series so I had to catch up. I will be sending comments for individual commits, but before tat I would like to discuss the series as a whole. Hi Michel, Thanks for reviewing this series. I think these changes are a big step in the right direction. My main reservation about them is that they are additive - adding some complexity for speculative page faults - and I wonder if it'd be possible, over the long term, to replace the existing complexity we have in mmap_sem retry mechanisms instead of adding to it. This is not something that should block your progress, but I think it would be good, as we introduce spf, to evaluate whether we could eventually get all the way to removing the mmap_sem retry mechanism, or if we will actually have to keep both. Until we get rid of the mmap_sem which seems to be a very long story, I can't see how we could get rid of the retry mechanism. The proposed spf mechanism only handles anon vmas. Is there a fundamental reason why it couldn't handle mapped files too ? My understanding is that the mechanism of verifying the vma after taking back the ptl at the end of the fault would work there too ? The file has to stay referenced during the fault, but holding the vma's refcount could be made to cover that ? the vm_file refcount would have to be released in __free_vma() instead of remove_vma; I'm not quite sure if that has more implications than I realize ? The only concern is the flow of operation done in the vm_ops->fault() processing. Most of the file system relie on the generic filemap_fault() which should be safe to use. But we need a clever way to identify fault processing which are compatible with the SPF handler. This could be done using a tag/flag in the vm_ops structure or in the vma's flags. This would be the next step. The proposed spf mechanism only works at the pte level after the page tables have already been created. The non-spf page fault path takes the mm->page_table_lock to protect against concurrent page table allocation by multiple page faults; I think unmapping/freeing page tables could be done under mm->page_table_lock too so that spf could implement allocating new page tables by verifying the vma after taking the mm->page_table_lock ? I've to admit that I didn't dig further here. Do you have a patch? ;) The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL. I am not sure what is the issue there - is this due to the vma->vm_start and vma->vm_pgoff reads in *__vm_normal_page() ? Yes that's the reason, no way to guarantee the value of these fields in the SPF path. My last potential concern is about performance. The numbers you have look great, but I worry about potential regressions in PF performance for threaded processes that don't currently encounter contention (i.e. there may be just one thread actually doing all the work while the others are blocked). I think one good proxy for measuring that would be to measure a single threaded workload - kernbench would be fine - without the special-case optimization in patch 22 where handle_speculative_fault() immediately aborts in the single-threaded case. I'll have to give it a try. Reviewed-by: Michel Lespinasse This is for the series as a whole; I expect to do another review pass on individual commits in the series when we have agreement on the toplevel stuff (I noticed a few things like out-of-date commit messages but that's really minor stuff). Thanks a lot for reviewing this long series. I want to add a note about mmap_sem. In the past there has been discussions about replacing it with an interval lock, but these never went anywhere because, mostly, of the fact that such mechanisms were too expensive to use in the page fault path. I think adding the spf mechanism would invite us to revisit this issue - interval locks may be a great way to avoid blocking between unrelated mmap_sem writers (for example, do not delay stack creation for new threads while a large mmap or munmap may be going on), and probably also to handle mmap_sem readers that can't easily use the spf mechanism (for example, gup callers which make use of the returned vmas). But again that is a separate topic to explore which doesn't have to get resolved before spf goes in.
Re: [PATCH v12 00/31] Speculative page faults
Le 23/04/2019 à 11:38, Peter Zijlstra a écrit : On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote: The proposed spf mechanism only handles anon vmas. Is there a fundamental reason why it couldn't handle mapped files too ? My understanding is that the mechanism of verifying the vma after taking back the ptl at the end of the fault would work there too ? The file has to stay referenced during the fault, but holding the vma's refcount could be made to cover that ? the vm_file refcount would have to be released in __free_vma() instead of remove_vma; I'm not quite sure if that has more implications than I realize ? IIRC (and I really don't remember all that much) the trickiest bit was vs unmount. Since files can stay open past the 'expected' duration, umount could be delayed. But yes, I think I had a version that did all that just 'fine'. Like mentioned, I didn't keep the refcount because it sucked just as hard as the mmap_sem contention, but the SRCU callback did the fput() just fine (esp. now that we have delayed_fput). I had to use a refcount for the VMA because I'm using RCU in place of SRCU and only protecting the RB tree using RCU. Regarding the file pointer, I decided to release it synchronously to avoid the latency of RCU during the file closing. As you mentioned this could delayed the umount but not only, as Linus Torvald demonstrated by the past [1]. Anyway, since the file support is not yet here there is no need for that currently. Regarding the file mapping support, the concern is to ensure that vm_ops->fault() will not try to release the mmap_sem. This is true for most of the file system operation using the generic one, but there is currently no clever way to identify that except by checking the vm_ops->fault pointer. Adding a flag to the vm_operations_struct structure is another option. that's doable as far as the underlying fault() function is not dealing with the mmap_sem, and I made a try by the past but was thinking that first the anonymous case should be accepted before moving forward this way. [1] https://lore.kernel.org/linux-mm/alpine.LFD.2.00.1001041904250.3630@localhost.localdomain/
Re: [PATCH v12 00/31] Speculative page faults
On Tue 23-04-19 05:41:48, Matthew Wilcox wrote: > On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote: > > On Mon 22-04-19 14:29:16, Michel Lespinasse wrote: > > [...] > > > I want to add a note about mmap_sem. In the past there has been > > > discussions about replacing it with an interval lock, but these never > > > went anywhere because, mostly, of the fact that such mechanisms were > > > too expensive to use in the page fault path. I think adding the spf > > > mechanism would invite us to revisit this issue - interval locks may > > > be a great way to avoid blocking between unrelated mmap_sem writers > > > (for example, do not delay stack creation for new threads while a > > > large mmap or munmap may be going on), and probably also to handle > > > mmap_sem readers that can't easily use the spf mechanism (for example, > > > gup callers which make use of the returned vmas). But again that is a > > > separate topic to explore which doesn't have to get resolved before > > > spf goes in. > > > > Well, I believe we should _really_ re-evaluate the range locking sooner > > rather than later. Why? Because it looks like the most straightforward > > approach to the mmap_sem contention for most usecases I have heard of > > (mostly a mm{unm}ap, mremap standing in the way of page faults). > > On a plus side it also makes us think about the current mmap (ab)users > > which should lead to an overall code improvements and maintainability. > > Dave Chinner recently did evaluate the range lock for solving a problem > in XFS and didn't like what he saw: > > https://lore.kernel.org/linux-fsdevel/20190418031013.gx29...@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c Thank you, will have a look. > I think scaling the lock needs to be tied to the actual data structure > and not have a second tree on-the-side to fake-scale the locking. Anyway, > we're going to have a session on this at LSFMM, right? I thought we had something for the mmap_sem scaling but I do not see this in a list of proposed topics. But we can certainly add it there. > > SPF sounds like a good idea but it is a really big and intrusive surgery > > to the #PF path. And more importantly without any real world usecase > > numbers which would justify this. That being said I am not opposed to > > this change I just think it is a large hammer while we haven't seen > > attempts to tackle problems in a simpler way. > > I don't think the "no real world usecase numbers" is fair. Laurent quoted: > > > Ebizzy: > > --- > > The test is counting the number of records per second it can manage, the > > higher is the best. I run it like this 'ebizzy -mTt '. To get > > consistent result I repeated the test 100 times and measure the average > > result. The number is the record processes per second, the higher is the > > best. > > > > BASESPF delta > > 24 CPUs x86 5492.69 9383.07 70.83% > > 1024 CPUS P8 VM 8476.74 17144.38102% > > and cited 30% improvement for you-know-what product from an earlier > version of the patch. Well, we are talking about 45 files changed, 1277 insertions(+), 196 deletions(-) which is a _major_ surgery in my book. Having a real life workloads numbers is nothing unfair to ask for IMHO. And let me remind you that I am not really opposing SPF in general. I would just like to see a simpler approach before we go such a large change. If the range locking is not really a scalable approach then all right but from why I've see it should help a lot of most bottle-necks I have seen. -- Michal Hocko SUSE Labs
Re: [PATCH v12 00/31] Speculative page faults
On Tue, Apr 23, 2019 at 05:41:48AM -0700, Matthew Wilcox wrote: > On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote: > > Well, I believe we should _really_ re-evaluate the range locking sooner > > rather than later. Why? Because it looks like the most straightforward > > approach to the mmap_sem contention for most usecases I have heard of > > (mostly a mm{unm}ap, mremap standing in the way of page faults). > > On a plus side it also makes us think about the current mmap (ab)users > > which should lead to an overall code improvements and maintainability. > > Dave Chinner recently did evaluate the range lock for solving a problem > in XFS and didn't like what he saw: > > https://lore.kernel.org/linux-fsdevel/20190418031013.gx29...@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c > > I think scaling the lock needs to be tied to the actual data structure > and not have a second tree on-the-side to fake-scale the locking. Right, which is how I ended up using the split PT locks. They already provide fine(r) grained locking.
Re: [PATCH v12 00/31] Speculative page faults
On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote: > On Mon 22-04-19 14:29:16, Michel Lespinasse wrote: > [...] > > I want to add a note about mmap_sem. In the past there has been > > discussions about replacing it with an interval lock, but these never > > went anywhere because, mostly, of the fact that such mechanisms were > > too expensive to use in the page fault path. I think adding the spf > > mechanism would invite us to revisit this issue - interval locks may > > be a great way to avoid blocking between unrelated mmap_sem writers > > (for example, do not delay stack creation for new threads while a > > large mmap or munmap may be going on), and probably also to handle > > mmap_sem readers that can't easily use the spf mechanism (for example, > > gup callers which make use of the returned vmas). But again that is a > > separate topic to explore which doesn't have to get resolved before > > spf goes in. > > Well, I believe we should _really_ re-evaluate the range locking sooner > rather than later. Why? Because it looks like the most straightforward > approach to the mmap_sem contention for most usecases I have heard of > (mostly a mm{unm}ap, mremap standing in the way of page faults). > On a plus side it also makes us think about the current mmap (ab)users > which should lead to an overall code improvements and maintainability. Dave Chinner recently did evaluate the range lock for solving a problem in XFS and didn't like what he saw: https://lore.kernel.org/linux-fsdevel/20190418031013.gx29...@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c I think scaling the lock needs to be tied to the actual data structure and not have a second tree on-the-side to fake-scale the locking. Anyway, we're going to have a session on this at LSFMM, right? > SPF sounds like a good idea but it is a really big and intrusive surgery > to the #PF path. And more importantly without any real world usecase > numbers which would justify this. That being said I am not opposed to > this change I just think it is a large hammer while we haven't seen > attempts to tackle problems in a simpler way. I don't think the "no real world usecase numbers" is fair. Laurent quoted: > Ebizzy: > --- > The test is counting the number of records per second it can manage, the > higher is the best. I run it like this 'ebizzy -mTt '. To get > consistent result I repeated the test 100 times and measure the average > result. The number is the record processes per second, the higher is the best. > > BASESPF delta > 24 CPUs x86 5492.69 9383.07 70.83% > 1024 CPUS P8 VM 8476.74 17144.38102% and cited 30% improvement for you-know-what product from an earlier version of the patch.
Re: [PATCH v12 00/31] Speculative page faults
On 04/16/2019 07:14 PM, Laurent Dufour wrote: > In pseudo code, this could be seen as: > speculative_page_fault() > { > vma = find_vma_rcu() > check vma sequence count > check vma's support > disable interrupt > check pgd,p4d,...,pte > save pmd and pte in vmf > save vma sequence counter in vmf > enable interrupt > check vma sequence count > handle_pte_fault(vma) > .. > page = alloc_page() > pte_map_lock() > disable interrupt > abort if sequence counter has changed > abort if pmd or pte has changed > pte map and lock > enable interrupt > if abort > free page > abort Would not it be better if the 'page' allocated here can be passed on to handle_pte_fault() below so that in the fallback path it does not have to enter the buddy again ? Of course it will require changes to handle_pte_fault() to accommodate a pre-allocated non-NULL struct page to operate on or free back into the buddy if fallback path fails for some other reason. This will probably make SPF path less overhead for cases where it has to fallback on handle_pte_fault() after pte_map_lock() in speculative_page_fault(). > ... > put_vma(vma) > } > > arch_fault_handler() > { > if (speculative_page_fault()) > goto done > again: > lock(mmap_sem) > vma = find_vma(); > handle_pte_fault(vma); > if retry > unlock(mmap_sem) > goto again; > done: > handle fault error > } - Anshuman
Re: [PATCH v12 00/31] Speculative page faults
On Mon 22-04-19 14:29:16, Michel Lespinasse wrote: [...] > I want to add a note about mmap_sem. In the past there has been > discussions about replacing it with an interval lock, but these never > went anywhere because, mostly, of the fact that such mechanisms were > too expensive to use in the page fault path. I think adding the spf > mechanism would invite us to revisit this issue - interval locks may > be a great way to avoid blocking between unrelated mmap_sem writers > (for example, do not delay stack creation for new threads while a > large mmap or munmap may be going on), and probably also to handle > mmap_sem readers that can't easily use the spf mechanism (for example, > gup callers which make use of the returned vmas). But again that is a > separate topic to explore which doesn't have to get resolved before > spf goes in. Well, I believe we should _really_ re-evaluate the range locking sooner rather than later. Why? Because it looks like the most straightforward approach to the mmap_sem contention for most usecases I have heard of (mostly a mm{unm}ap, mremap standing in the way of page faults). On a plus side it also makes us think about the current mmap (ab)users which should lead to an overall code improvements and maintainability. SPF sounds like a good idea but it is a really big and intrusive surgery to the #PF path. And more importantly without any real world usecase numbers which would justify this. That being said I am not opposed to this change I just think it is a large hammer while we haven't seen attempts to tackle problems in a simpler way. -- Michal Hocko SUSE Labs
Re: [PATCH v12 00/31] Speculative page faults
On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote: > The proposed spf mechanism only handles anon vmas. Is there a > fundamental reason why it couldn't handle mapped files too ? > My understanding is that the mechanism of verifying the vma after > taking back the ptl at the end of the fault would work there too ? > The file has to stay referenced during the fault, but holding the vma's > refcount could be made to cover that ? the vm_file refcount would have > to be released in __free_vma() instead of remove_vma; I'm not quite sure > if that has more implications than I realize ? IIRC (and I really don't remember all that much) the trickiest bit was vs unmount. Since files can stay open past the 'expected' duration, umount could be delayed. But yes, I think I had a version that did all that just 'fine'. Like mentioned, I didn't keep the refcount because it sucked just as hard as the mmap_sem contention, but the SRCU callback did the fput() just fine (esp. now that we have delayed_fput).
Re: [PATCH v12 00/31] Speculative page faults
Hi Laurent, Thanks a lot for copying me on this patchset. It took me a few days to go through it - I had not been following the previous iterations of this series so I had to catch up. I will be sending comments for individual commits, but before tat I would like to discuss the series as a whole. I think these changes are a big step in the right direction. My main reservation about them is that they are additive - adding some complexity for speculative page faults - and I wonder if it'd be possible, over the long term, to replace the existing complexity we have in mmap_sem retry mechanisms instead of adding to it. This is not something that should block your progress, but I think it would be good, as we introduce spf, to evaluate whether we could eventually get all the way to removing the mmap_sem retry mechanism, or if we will actually have to keep both. The proposed spf mechanism only handles anon vmas. Is there a fundamental reason why it couldn't handle mapped files too ? My understanding is that the mechanism of verifying the vma after taking back the ptl at the end of the fault would work there too ? The file has to stay referenced during the fault, but holding the vma's refcount could be made to cover that ? the vm_file refcount would have to be released in __free_vma() instead of remove_vma; I'm not quite sure if that has more implications than I realize ? The proposed spf mechanism only works at the pte level after the page tables have already been created. The non-spf page fault path takes the mm->page_table_lock to protect against concurrent page table allocation by multiple page faults; I think unmapping/freeing page tables could be done under mm->page_table_lock too so that spf could implement allocating new page tables by verifying the vma after taking the mm->page_table_lock ? The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL. I am not sure what is the issue there - is this due to the vma->vm_start and vma->vm_pgoff reads in *__vm_normal_page() ? My last potential concern is about performance. The numbers you have look great, but I worry about potential regressions in PF performance for threaded processes that don't currently encounter contention (i.e. there may be just one thread actually doing all the work while the others are blocked). I think one good proxy for measuring that would be to measure a single threaded workload - kernbench would be fine - without the special-case optimization in patch 22 where handle_speculative_fault() immediately aborts in the single-threaded case. Reviewed-by: Michel Lespinasse This is for the series as a whole; I expect to do another review pass on individual commits in the series when we have agreement on the toplevel stuff (I noticed a few things like out-of-date commit messages but that's really minor stuff). I want to add a note about mmap_sem. In the past there has been discussions about replacing it with an interval lock, but these never went anywhere because, mostly, of the fact that such mechanisms were too expensive to use in the page fault path. I think adding the spf mechanism would invite us to revisit this issue - interval locks may be a great way to avoid blocking between unrelated mmap_sem writers (for example, do not delay stack creation for new threads while a large mmap or munmap may be going on), and probably also to handle mmap_sem readers that can't easily use the spf mechanism (for example, gup callers which make use of the returned vmas). But again that is a separate topic to explore which doesn't have to get resolved before spf goes in.
[PATCH v12 00/31] Speculative page faults
This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle page fault without holding the mm semaphore [1]. The idea is to try to handle user space page faults without holding the mmap_sem. This should allow better concurrency for massively threaded process since the page fault handler will not wait for other threads memory layout change to be done, assuming that this change is done in another part of the process's memory space. This type of page fault is named speculative page fault. If the speculative page fault fails because a concurrency has been detected or because underlying PMD or PTE tables are not yet allocating, it is failing its processing and a regular page fault is then tried. The speculative page fault (SPF) has to look for the VMA matching the fault address without holding the mmap_sem, this is done by protecting the MM RB tree with RCU and by using a reference counter on each VMA. When fetching a VMA under the RCU protection, the VMA's reference counter is incremented to ensure that the VMA will not freed in our back during the SPF processing. Once that processing is done the VMA's reference counter is decremented. To ensure that a VMA is still present when walking the RB tree locklessly, the VMA's reference counter is incremented when that VMA is linked in the RB tree. When the VMA is unlinked from the RB tree, its reference counter will be decremented at the end of the RCU grace period, ensuring it will be available during this time. This means that the VMA freeing could be delayed and could delay the file closing for file mapping. Since the SPF handler is not able to manage file mapping, file is closed synchronously and not during the RCU cleaning. This is safe since the page fault handler is aborting if a file pointer is associated to the VMA. Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale benchmark [2]. The VMA's attributes checked during the speculative page fault processing have to be protected against parallel changes. This is done by using a per VMA sequence lock. This sequence lock allows the speculative page fault handler to fast check for parallel changes in progress and to abort the speculative page fault in that case. Once the VMA has been found, the speculative page fault handler would check for the VMA's attributes to verify that the page fault has to be handled correctly or not. Thus, the VMA is protected through a sequence lock which allows fast detection of concurrent VMA changes. If such a change is detected, the speculative page fault is aborted and a *classic* page fault is tried. VMA sequence lockings are added when VMA attributes which are checked during the page fault are modified. When the PTE is fetched, the VMA is checked to see if it has been changed, so once the page table is locked, the VMA is valid, so any other changes leading to touching this PTE will need to lock the page table, so no parallel change is possible at this time. The locking of the PTE is done with interrupts disabled, this allows checking for the PMD to ensure that there is not an ongoing collapsing operation. Since khugepaged is firstly set the PMD to pmd_none and then is waiting for the other CPU to have caught the IPI interrupt, if the pmd is valid at the time the PTE is locked, we have the guarantee that the collapsing operation will have to wait on the PTE lock to move forward. This allows the SPF handler to map the PTE safely. If the PMD value is different from the one recorded at the beginning of the SPF operation, the classic page fault handler will be called to handle the operation while holding the mmap_sem. As the PTE lock is done with the interrupts disabled, the lock is done using spin_trylock() to avoid dead lock when handling a page fault while a TLB invalidate is requested by another CPU holding the PTE. In pseudo code, this could be seen as: speculative_page_fault() { vma = find_vma_rcu() check vma sequence count check vma's support disable interrupt check pgd,p4d,...,pte save pmd and pte in vmf save vma sequence counter in vmf enable interrupt check vma sequence count handle_pte_fault(vma) .. page = alloc_page() pte_map_lock() disable interrupt abort if sequence counter has changed abort if pmd or pte has changed pte map and lock enable interrupt if abort free page abort ... put_vma(vma) } arch_fault_handler() { if (speculative_page_fault()) goto done again: lock(mmap_sem) vma = find_vma();