Re: [PATCH v12 00/31] Speculative page faults

2020-12-14 Thread Joel Fernandes
On Mon, Dec 14, 2020 at 10:36:29AM +0100, Laurent Dufour wrote:
> Le 14/12/2020 à 03:03, Joel Fernandes a écrit :
> > On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
> > [..]
> > > > > Hi Laurent,
> > > > > 
> > > > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > > > several experiments, we observed SPF has obvious improvements on the
> > > > > launch time of applications, especially for those high-TLP ones,
> > > > > 
> > > > > # launch time of applications(s):
> > > > > 
> > > > > package   version  w/ SPF  w/o SPF  improve(%)
> > > > > --
> > > > > Baidu maps10.13.3  0.887   0.98 9.49
> > > > > Taobao8.4.0.35 1.227   1.2935.10
> > > > > Meituan   9.12.401 1.107   1.54328.26
> > > > > WeChat7.0.32.353   2.68 12.20
> > > > > Honor of Kings1.43.1.6 6.636.7131.24
> > > > 
> > > > That's great news, thanks for reporting this!
> > > > 
> > > > > 
> > > > > By the way, we have verified our platforms with those patches and
> > > > > achieved the goal of mass production.
> > > > 
> > > > Another good news!
> > > > For my information, what is your targeted hardware?
> > > > 
> > > > Cheers,
> > > > Laurent.
> > > 
> > > Hi Laurent,
> > > 
> > > Our targeted hardware belongs to ARM64 multi-core series.
> > 
> > Hello!
> > 
> > I was trying to develop an intuition about why does SPF give improvement for
> > you on small CPU systems. This is just a high-level theory but:
> > 
> > 1. Assume the improvement is because of elimination of "blocking" on
> > mmap_sem.
> > Could it be that the mmap_sem is acquired in write-mode unnecessarily in 
> > some
> > places, thus causing blocking on mmap_sem in other paths? If so, is it
> > feasible to convert such usages to acquiring them in read-mode?
> 
> That's correct, and the goal of this series is to try not holding the
> mmap_sem in read mode during page fault processing.
> 
> Converting mmap_sem holder from write to read mode is not so easy and that
> work as already been done in some places. If you think there are areas where
> this could be done, you're welcome to send patches fixing that.
> 
> > 2. Assume the improvement is because of lesser read-side contention on
> > mmap_sem.
> > On small CPU systems, I would not expect reducing cache-line bouncing to 
> > give
> > such a dramatic improvement in performance as you are seeing.
> 
> I don't think cache line bouncing reduction is the main sourcec of
> performance improvement, I would rather think this is the lower part here.
> I guess this is mainly because during loading time a lot of page fault is
> occuring and thus SPF is reducing the contention on the mmap_sem.

Thanks for the reply. I think I also wrongly assumed that acquiring mmap
rwsem in write mode in a syscall makes SPF moot. Peter explained to me on IRC
that tere's still perf improvement in write mode if an unrelated VMA is
modified while another VMA is faulting.  CMIIW - not an mm expert by any
stretch.

Thanks!

 - Joel



Re: [PATCH v12 00/31] Speculative page faults

2020-12-14 Thread Laurent Dufour

Le 14/12/2020 à 03:03, Joel Fernandes a écrit :

On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
[..]

Hi Laurent,

We merged SPF v11 and some patches from v12 into our platforms. After
several experiments, we observed SPF has obvious improvements on the
launch time of applications, especially for those high-TLP ones,

# launch time of applications(s):

package   version  w/ SPF  w/o SPF  improve(%)
--
Baidu maps10.13.3  0.887   0.98 9.49
Taobao8.4.0.35 1.227   1.2935.10
Meituan   9.12.401 1.107   1.54328.26
WeChat7.0.32.353   2.68 12.20
Honor of Kings1.43.1.6 6.636.7131.24


That's great news, thanks for reporting this!



By the way, we have verified our platforms with those patches and
achieved the goal of mass production.


Another good news!
For my information, what is your targeted hardware?

Cheers,
Laurent.


Hi Laurent,

Our targeted hardware belongs to ARM64 multi-core series.


Hello!

I was trying to develop an intuition about why does SPF give improvement for
you on small CPU systems. This is just a high-level theory but:

1. Assume the improvement is because of elimination of "blocking" on
mmap_sem.
Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
places, thus causing blocking on mmap_sem in other paths? If so, is it
feasible to convert such usages to acquiring them in read-mode?


That's correct, and the goal of this series is to try not holding the mmap_sem 
in read mode during page fault processing.


Converting mmap_sem holder from write to read mode is not so easy and that work 
as already been done in some places. If you think there are areas where this 
could be done, you're welcome to send patches fixing that.



2. Assume the improvement is because of lesser read-side contention on
mmap_sem.
On small CPU systems, I would not expect reducing cache-line bouncing to give
such a dramatic improvement in performance as you are seeing.


I don't think cache line bouncing reduction is the main sourcec of performance 
improvement, I would rather think this is the lower part here.
I guess this is mainly because during loading time a lot of page fault is 
occuring and thus SPF is reducing the contention on the mmap_sem.



Thanks for any insight on this!

- Joel





Re: [PATCH v12 00/31] Speculative page faults

2020-12-13 Thread Joel Fernandes
On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
[..]
> > > Hi Laurent,
> > > 
> > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > several experiments, we observed SPF has obvious improvements on the
> > > launch time of applications, especially for those high-TLP ones,
> > > 
> > > # launch time of applications(s):
> > > 
> > > package   version  w/ SPF  w/o SPF  improve(%)
> > > --
> > > Baidu maps10.13.3  0.887   0.98 9.49
> > > Taobao8.4.0.35 1.227   1.2935.10
> > > Meituan   9.12.401 1.107   1.54328.26
> > > WeChat7.0.32.353   2.68 12.20
> > > Honor of Kings1.43.1.6 6.636.7131.24
> > 
> > That's great news, thanks for reporting this!
> > 
> > > 
> > > By the way, we have verified our platforms with those patches and
> > > achieved the goal of mass production.
> > 
> > Another good news!
> > For my information, what is your targeted hardware?
> > 
> > Cheers,
> > Laurent.
> 
> Hi Laurent,
> 
> Our targeted hardware belongs to ARM64 multi-core series.

Hello!

I was trying to develop an intuition about why does SPF give improvement for
you on small CPU systems. This is just a high-level theory but:

1. Assume the improvement is because of elimination of "blocking" on
mmap_sem.
Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
places, thus causing blocking on mmap_sem in other paths? If so, is it
feasible to convert such usages to acquiring them in read-mode?

2. Assume the improvement is because of lesser read-side contention on
mmap_sem.
On small CPU systems, I would not expect reducing cache-line bouncing to give
such a dramatic improvement in performance as you are seeing.

Thanks for any insight on this!

- Joel



Re: [PATCH v12 00/31] Speculative page faults

2020-07-06 Thread Chinwen Chang
On Mon, 2020-07-06 at 14:27 +0200, Laurent Dufour wrote:
> Le 06/07/2020 à 11:25, Chinwen Chang a écrit :
> > On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
> >> Hi Laurent,
> >>
> >> I downloaded your script and run it on Intel 2s skylake platform with 
> >> spf-v12 patch
> >> serials.
> >>
> >> Here attached the output results of this script.
> >>
> >> The following comparison result is statistics from the script outputs.
> >>
> >> a). Enable THP
> >>  SPF_0  change   
> >> SPF_1
> >> will-it-scale.page_fault2.per_thread_ops2664190.8  -11.7%   
> >> 2353637.6
> >> will-it-scale.page_fault3.per_thread_ops4480027.2  -14.7%   
> >> 3819331.9
> >>
> >>
> >> b). Disable THP
> >>  SPF_0   change  
> >> SPF_1
> >> will-it-scale.page_fault2.per_thread_ops2653260.7   -10%
> >> 2385165.8
> >> will-it-scale.page_fault3.per_thread_ops4436330.1   -12.4%  
> >> 3886734.2
> >>
> >>
> >> Thanks,
> >> Haiyan Song
> >>
> >>
> >> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> >>> Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
>  Please find attached the script I run to get these numbers.
>  This would be nice if you could give it a try on your victim node and 
>  share the result.
> >>>
> >>> Sounds that the Intel mail fitering system doesn't like the attached 
> >>> shell script.
> >>> Please find it there: 
> >>> https://urldefense.com/v3/__https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44__;!!CTRNKA9wMg0ARbw!0lux2FMCbIFxFEl824CdSuSQqT0IVWsvyUqfDVJNEVb9gTWyRltm7cpPZg70N_XhXmMZ$
> >>>  
> >>>
> >>> Thanks,
> >>> Laurent.
> >>>
> > 
> > Hi Laurent,
> > 
> > We merged SPF v11 and some patches from v12 into our platforms. After
> > several experiments, we observed SPF has obvious improvements on the
> > launch time of applications, especially for those high-TLP ones,
> > 
> > # launch time of applications(s):
> > 
> > package   version  w/ SPF  w/o SPF  improve(%)
> > --
> > Baidu maps10.13.3  0.887   0.98 9.49
> > Taobao8.4.0.35 1.227   1.2935.10
> > Meituan   9.12.401 1.107   1.54328.26
> > WeChat7.0.32.353   2.68 12.20
> > Honor of Kings1.43.1.6 6.636.7131.24
> 
> That's great news, thanks for reporting this!
> 
> > 
> > By the way, we have verified our platforms with those patches and
> > achieved the goal of mass production.
> 
> Another good news!
> For my information, what is your targeted hardware?
> 
> Cheers,
> Laurent.

Hi Laurent,

Our targeted hardware belongs to ARM64 multi-core series.

Thanks.
Chinwen
> 



Re: [PATCH v12 00/31] Speculative page faults

2020-07-06 Thread Laurent Dufour

Le 06/07/2020 à 11:25, Chinwen Chang a écrit :

On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:

Hi Laurent,

I downloaded your script and run it on Intel 2s skylake platform with spf-v12 
patch
serials.

Here attached the output results of this script.

The following comparison result is statistics from the script outputs.

a). Enable THP
 SPF_0  change   SPF_1
will-it-scale.page_fault2.per_thread_ops2664190.8  -11.7%   
2353637.6
will-it-scale.page_fault3.per_thread_ops4480027.2  -14.7%   
3819331.9


b). Disable THP
 SPF_0   change  SPF_1
will-it-scale.page_fault2.per_thread_ops2653260.7   -10%
2385165.8
will-it-scale.page_fault3.per_thread_ops4436330.1   -12.4%  
3886734.2


Thanks,
Haiyan Song


On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:

Le 14/06/2019 à 10:37, Laurent Dufour a écrit :

Please find attached the script I run to get these numbers.
This would be nice if you could give it a try on your victim node and share the 
result.


Sounds that the Intel mail fitering system doesn't like the attached shell 
script.
Please find it there: 
https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44

Thanks,
Laurent.



Hi Laurent,

We merged SPF v11 and some patches from v12 into our platforms. After
several experiments, we observed SPF has obvious improvements on the
launch time of applications, especially for those high-TLP ones,

# launch time of applications(s):

package   version  w/ SPF  w/o SPF  improve(%)
--
Baidu maps10.13.3  0.887   0.98 9.49
Taobao8.4.0.35 1.227   1.2935.10
Meituan   9.12.401 1.107   1.54328.26
WeChat7.0.32.353   2.68 12.20
Honor of Kings1.43.1.6 6.636.7131.24


That's great news, thanks for reporting this!



By the way, we have verified our platforms with those patches and
achieved the goal of mass production.


Another good news!
For my information, what is your targeted hardware?

Cheers,
Laurent.



Re: [PATCH v12 00/31] Speculative page faults

2020-07-06 Thread Chinwen Chang
On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
> Hi Laurent,
> 
> I downloaded your script and run it on Intel 2s skylake platform with spf-v12 
> patch
> serials.
> 
> Here attached the output results of this script.
> 
> The following comparison result is statistics from the script outputs.
> 
> a). Enable THP
> SPF_0  change   SPF_1
> will-it-scale.page_fault2.per_thread_ops2664190.8  -11.7%   
> 2353637.6  
> will-it-scale.page_fault3.per_thread_ops4480027.2  -14.7%   
> 3819331.9 
> 
> 
> b). Disable THP
> SPF_0   change  SPF_1
> will-it-scale.page_fault2.per_thread_ops2653260.7   -10%
> 2385165.8
> will-it-scale.page_fault3.per_thread_ops4436330.1   -12.4%  
> 3886734.2 
> 
> 
> Thanks,
> Haiyan Song
> 
> 
> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> > Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> > > Please find attached the script I run to get these numbers.
> > > This would be nice if you could give it a try on your victim node and 
> > > share the result.
> > 
> > Sounds that the Intel mail fitering system doesn't like the attached shell 
> > script.
> > Please find it there: 
> > https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
> > 
> > Thanks,
> > Laurent.
> > 

Hi Laurent,

We merged SPF v11 and some patches from v12 into our platforms. After
several experiments, we observed SPF has obvious improvements on the
launch time of applications, especially for those high-TLP ones,

# launch time of applications(s):

package   version  w/ SPF  w/o SPF  improve(%)
--  

Baidu maps10.13.3  0.887   0.98 9.49
Taobao8.4.0.35 1.227   1.2935.10
Meituan   9.12.401 1.107   1.54328.26
WeChat7.0.32.353   2.68 12.20
Honor of Kings1.43.1.6 6.636.7131.24


By the way, we have verified our platforms with those patches and
achieved the goal of mass production.

Thanks.
Chinwen Chang


Re: [PATCH v12 00/31] Speculative page faults

2019-06-20 Thread Haiyan Song
Hi Laurent,

I downloaded your script and run it on Intel 2s skylake platform with spf-v12 
patch
serials.

Here attached the output results of this script.

The following comparison result is statistics from the script outputs.

a). Enable THP
SPF_0  change   SPF_1
will-it-scale.page_fault2.per_thread_ops2664190.8  -11.7%   
2353637.6  
will-it-scale.page_fault3.per_thread_ops4480027.2  -14.7%   
3819331.9 


b). Disable THP
SPF_0   change  SPF_1
will-it-scale.page_fault2.per_thread_ops2653260.7   -10%
2385165.8
will-it-scale.page_fault3.per_thread_ops4436330.1   -12.4%  
3886734.2 


Thanks,
Haiyan Song


On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> > Please find attached the script I run to get these numbers.
> > This would be nice if you could give it a try on your victim node and share 
> > the result.
> 
> Sounds that the Intel mail fitering system doesn't like the attached shell 
> script.
> Please find it there: 
> https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
> 
> Thanks,
> Laurent.
> 
 THP always
 SPF 0
average:2628818
average:2732209
average:2728392
average:2550695
average:2689873
average:2691963
average:2627612
average:2558295
average:2707877
average:2726174
 SPF 1
average:2426260
average:2145674
average:2117769
average:2292502
average:2350403
average:2483327
average:2467324
average:2335393
average:2437859
average:2479865
 THP never
 SPF 0
average:2712575
average:2711447
average:2672362
average:2701981
average:2668073
average:2579296
average:2662048
average:2637422
average:2579143
average:2608260
 SPF 1
average:2348782
average:2203349
average:2312960
average:2402995
average:2318914
average:2543129
average:2390337
average:2490178
average:2416798
average:2424216
 THP always
 SPF 0
average:4370143
average:4245754
average:4678884
average:4665759
average:4665809
average:4639132
average:4210755
average:4330552
average:4290469
average:4703015
 SPF 1
average:3810608
average:3918890
average:3758003
average:3965024
average:3578151
average:3822748
average:3687293
average:3998701
average:3915771
average:3738130
 THP never
 SPF 0
average:4505598
average:4672023
average:4701787
average:4355885
average:4338397
average:4446350
average:4360811
average:4653767
average:4016352
average:4312331
 SPF 1
average:3685383
average:4029413
average:4051615
average:3747588
average:4058557
average:4042340
average:3971295
average:3752943
average:3750626
average:3777582


Re: [PATCH v12 00/31] Speculative page faults

2019-06-14 Thread Laurent Dufour

Le 14/06/2019 à 10:37, Laurent Dufour a écrit :

Please find attached the script I run to get these numbers.
This would be nice if you could give it a try on your victim node and share the 
result.


Sounds that the Intel mail fitering system doesn't like the attached shell 
script.
Please find it there: 
https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44

Thanks,
Laurent.



Re: [PATCH v12 00/31] Speculative page faults

2019-06-14 Thread Laurent Dufour

Le 06/06/2019 à 08:51, Haiyan Song a écrit :

Hi Laurent,

Regression test for v12 patch serials have been run on Intel 2s skylake 
platform,
some regressions were found by LKP-tools (linux kernel performance). Only 
tested the
cases that have been run and found regressions on v11 patch serials.

Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12.
Kernel commit:
   base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c 
(v5.1-rc4-mmotm-2019-04-09-17-51)
   head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12)

Benchmark: will-it-scale
Download link: https://github.com/antonblanchard/will-it-scale/tree/master
Metrics: will-it-scale.per_thread_ops=threads/nr_cpu
test box: lkp-skl-2sp8(nr_cpu=72,memory=192G)
THP: enable / disable
nr_task: 100%

The following is benchmark results, tested 4 times for every case.

a). Enable THP
 base  %stddev   changehead   
%stddev
will-it-scale.page_fault3.per_thread_ops63216  ±3%  -16.9%52537   
±4%
will-it-scale.page_fault2.per_thread_ops36862   -9.8% 33256

b). Disable THP
 base  %stddev   changehead   
%stddev
will-it-scale.page_fault3.per_thread_ops65111   -18.6%53023  ±2%
will-it-scale.page_fault2.per_thread_ops38164   -12.0%33565


Hi Haiyan,

Thanks for running this tests on your systems.

I did the same tests on my systems (x86 and PowerPc) and I didn't get the same 
numbers.
My x86 system has lower CPUs but larger memory amount but I don't think this 
impacts
a lot since my numbers are far from yours.

x86_64 48CPUs 755G
5.1.0-rc4-mm1   5.1.0-rc4-mm1-spf
page_fault2_threads SPF OFF SPF ON
THP always  2200902.3 [5%]  2152618.8 -2% [4%]  2136316   -3% 
[7%]
THP never   2185616.5 [6%]  2099274.2 -4% [3%]  2123275.1 -3% 
[7%]

5.1.0-rc4-mm1   5.1.0-rc4-mm1-spf
page_fault3_threads SPF OFF SPF ON
THP always  2700078.7 [5%]  2789437.1 +3% [4%]  2944806.8 +12% 
[3%]
THP never   2625756.7 [4%]  2944806.8 +12% [8%] 2876525.5 +10% 
[4%]

PowerPC P8 80CPUs 31G
5.1.0-rc4-mm1   5.1.0-rc4-mm1-spf
page_fault2_threads SPF OFF SPF ON
THP always  171732   [0%]   170762.8 -1% [0%]   170450.9 -1% 
[0%]
THP never   171808.4 [0%]   170600.3 -1% [0%]   170231.6 -1% 
[0%]

5.1.0-rc4-mm1   5.1.0-rc4-mm1-spf
page_fault3_threads SPF OFF SPF ON
THP always  2499.6 [13%]2624.5 +5% [11%]2734.5 
+9% [3%]
THP never   2732.5 [2%] 2791.1 +2% [1%] 2695   -3% [4%]

Numbers in bracket are the standard deviation percent.

I run each test 10 times and then compute the average and deviation.

Please find attached the script I run to get these numbers.
This would be nice if you could give it a try on your victim node and share the 
result.

Thanks,
Laurent.


Best regards,
Haiyan Song

On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote:

This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type of page fault is named speculative
page fault. If the speculative page fault fails because a concurrency has
been detected or because underlying PMD or PTE tables are not yet
allocating, it is failing its processing and a regular page fault is then
tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by protecting the MM RB
tree with RCU and by using a reference counter on each VMA. When fetching a
VMA under the RCU protection, the VMA's reference counter is incremented to
ensure that the VMA will not freed in our back during the SPF
processing. Once that processing is done the VMA's reference counter is
decremented. To ensure that a VMA is still present when walking the RB tree
locklessly, the VMA's reference counter is incremented when that VMA is
linked in the RB tree. When the VMA is unlinked from the RB tree, its
reference counter will be decremented at the end of the RCU grace period,
ensuring it will be available during this time. This means that the VMA
freeing could be delayed and could delay the file closing for file
mapping. Since the SPF handler is not able to manage file mapping, file is
closed synchronously and not during the RCU cleaning. This is safe since
the 

Re: [PATCH v12 00/31] Speculative page faults

2019-06-06 Thread Haiyan Song
Hi Laurent,

Regression test for v12 patch serials have been run on Intel 2s skylake 
platform,
some regressions were found by LKP-tools (linux kernel performance). Only 
tested the
cases that have been run and found regressions on v11 patch serials.

Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12.
Kernel commit:
  base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c 
(v5.1-rc4-mmotm-2019-04-09-17-51)
  head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12)

Benchmark: will-it-scale
Download link: https://github.com/antonblanchard/will-it-scale/tree/master
Metrics: will-it-scale.per_thread_ops=threads/nr_cpu
test box: lkp-skl-2sp8(nr_cpu=72,memory=192G)
THP: enable / disable
nr_task: 100%

The following is benchmark results, tested 4 times for every case.

a). Enable THP
base  %stddev   changehead   
%stddev
will-it-scale.page_fault3.per_thread_ops63216  ±3%  -16.9%52537   
±4%
will-it-scale.page_fault2.per_thread_ops36862   -9.8% 33256

b). Disable THP
base  %stddev   changehead   
%stddev
will-it-scale.page_fault3.per_thread_ops65111   -18.6%53023  ±2%
will-it-scale.page_fault2.per_thread_ops38164   -12.0%33565

Best regards,
Haiyan Song

On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote:
> This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
> page fault without holding the mm semaphore [1].
> 
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type of page fault is named speculative
> page fault. If the speculative page fault fails because a concurrency has
> been detected or because underlying PMD or PTE tables are not yet
> allocating, it is failing its processing and a regular page fault is then
> tried.
> 
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, this is done by protecting the MM RB
> tree with RCU and by using a reference counter on each VMA. When fetching a
> VMA under the RCU protection, the VMA's reference counter is incremented to
> ensure that the VMA will not freed in our back during the SPF
> processing. Once that processing is done the VMA's reference counter is
> decremented. To ensure that a VMA is still present when walking the RB tree
> locklessly, the VMA's reference counter is incremented when that VMA is
> linked in the RB tree. When the VMA is unlinked from the RB tree, its
> reference counter will be decremented at the end of the RCU grace period,
> ensuring it will be available during this time. This means that the VMA
> freeing could be delayed and could delay the file closing for file
> mapping. Since the SPF handler is not able to manage file mapping, file is
> closed synchronously and not during the RCU cleaning. This is safe since
> the page fault handler is aborting if a file pointer is associated to the
> VMA.
> 
> Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
> benchmark [2].
> 
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
> 
> Once the VMA has been found, the speculative page fault handler would check
> for the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus, the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried.  VMA sequence lockings are added when VMA attributes which are
> checked during the page fault are modified.
> 
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.
> 
> The locking of the PTE is done with interrupts disabled, this allows
> checking for the PMD to ensure that there is not an ongoing collapsing
> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
> valid at the time the PTE is locked, we have the guarantee that the
> collapsing operation will have to wait on the PTE lock to move
> forward. This allows the SPF handler to map the PTE safely. If the PMD
> value is 

Re: [PATCH v12 00/31] Speculative page faults

2019-04-24 Thread Laurent Dufour

Le 22/04/2019 à 23:29, Michel Lespinasse a écrit :

Hi Laurent,

Thanks a lot for copying me on this patchset. It took me a few days to
go through it - I had not been following the previous iterations of
this series so I had to catch up. I will be sending comments for
individual commits, but before tat I would like to discuss the series
as a whole.


Hi Michel,

Thanks for reviewing this series.


I think these changes are a big step in the right direction. My main
reservation about them is that they are additive - adding some complexity
for speculative page faults - and I wonder if it'd be possible, over the
long term, to replace the existing complexity we have in mmap_sem retry
mechanisms instead of adding to it. This is not something that should
block your progress, but I think it would be good, as we introduce spf,
to evaluate whether we could eventually get all the way to removing the
mmap_sem retry mechanism, or if we will actually have to keep both.


Until we get rid of the mmap_sem which seems to be a very long story, I 
can't see how we could get rid of the retry mechanism.



The proposed spf mechanism only handles anon vmas. Is there a
fundamental reason why it couldn't handle mapped files too ?
My understanding is that the mechanism of verifying the vma after
taking back the ptl at the end of the fault would work there too ?
The file has to stay referenced during the fault, but holding the vma's
refcount could be made to cover that ? the vm_file refcount would have
to be released in __free_vma() instead of remove_vma; I'm not quite sure
if that has more implications than I realize ?


The only concern is the flow of operation  done in the vm_ops->fault() 
processing. Most of the file system relie on the generic filemap_fault() 
which should be safe to use. But we need a clever way to identify fault 
processing which are compatible with the SPF handler. This could be done 
using a tag/flag in the vm_ops structure or in the vma's flags.


This would be the next step.



The proposed spf mechanism only works at the pte level after the page
tables have already been created. The non-spf page fault path takes the
mm->page_table_lock to protect against concurrent page table allocation
by multiple page faults; I think unmapping/freeing page tables could
be done under mm->page_table_lock too so that spf could implement
allocating new page tables by verifying the vma after taking the
mm->page_table_lock ?


I've to admit that I didn't dig further here.
Do you have a patch? ;)



The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL.
I am not sure what is the issue there - is this due to the vma->vm_start
and vma->vm_pgoff reads in *__vm_normal_page() ?


Yes that's the reason, no way to guarantee the value of these fields in 
the SPF path.




My last potential concern is about performance. The numbers you have
look great, but I worry about potential regressions in PF performance
for threaded processes that don't currently encounter contention
(i.e. there may be just one thread actually doing all the work while
the others are blocked). I think one good proxy for measuring that
would be to measure a single threaded workload - kernbench would be
fine - without the special-case optimization in patch 22 where
handle_speculative_fault() immediately aborts in the single-threaded case.


I'll have to give it a try.


Reviewed-by: Michel Lespinasse 
This is for the series as a whole; I expect to do another review pass on
individual commits in the series when we have agreement on the toplevel
stuff (I noticed a few things like out-of-date commit messages but that's
really minor stuff).


Thanks a lot for reviewing this long series.



I want to add a note about mmap_sem. In the past there has been
discussions about replacing it with an interval lock, but these never
went anywhere because, mostly, of the fact that such mechanisms were
too expensive to use in the page fault path. I think adding the spf
mechanism would invite us to revisit this issue - interval locks may
be a great way to avoid blocking between unrelated mmap_sem writers
(for example, do not delay stack creation for new threads while a
large mmap or munmap may be going on), and probably also to handle
mmap_sem readers that can't easily use the spf mechanism (for example,
gup callers which make use of the returned vmas). But again that is a
separate topic to explore which doesn't have to get resolved before
spf goes in.





Re: [PATCH v12 00/31] Speculative page faults

2019-04-24 Thread Laurent Dufour

Le 23/04/2019 à 11:38, Peter Zijlstra a écrit :

On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:

The proposed spf mechanism only handles anon vmas. Is there a
fundamental reason why it couldn't handle mapped files too ?
My understanding is that the mechanism of verifying the vma after
taking back the ptl at the end of the fault would work there too ?
The file has to stay referenced during the fault, but holding the vma's
refcount could be made to cover that ? the vm_file refcount would have
to be released in __free_vma() instead of remove_vma; I'm not quite sure
if that has more implications than I realize ?


IIRC (and I really don't remember all that much) the trickiest bit was
vs unmount. Since files can stay open past the 'expected' duration,
umount could be delayed.

But yes, I think I had a version that did all that just 'fine'. Like
mentioned, I didn't keep the refcount because it sucked just as hard as
the mmap_sem contention, but the SRCU callback did the fput() just fine
(esp. now that we have delayed_fput).


I had to use a refcount for the VMA because I'm using RCU in place of 
SRCU and only protecting the RB tree using RCU.


Regarding the file pointer, I decided to release it synchronously to 
avoid the latency of RCU during the file closing. As you mentioned this 
could delayed the umount but not only, as Linus Torvald demonstrated by 
the past [1]. Anyway, since the file support is not yet here there is no 
need for that currently.


Regarding the file mapping support, the concern is to ensure that 
vm_ops->fault() will not try to release the mmap_sem. This is true for 
most of the file system operation using the generic one, but there is 
currently no clever way to identify that except by checking the 
vm_ops->fault pointer. Adding a flag to the vm_operations_struct 
structure is another option.


that's doable as far as the underlying fault() function is not dealing 
with the mmap_sem, and I made a try by the past but was thinking that 
first the anonymous case should be accepted before moving forward this way.


[1] 
https://lore.kernel.org/linux-mm/alpine.LFD.2.00.1001041904250.3630@localhost.localdomain/




Re: [PATCH v12 00/31] Speculative page faults

2019-04-23 Thread Michal Hocko
On Tue 23-04-19 05:41:48, Matthew Wilcox wrote:
> On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> > On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
> > [...]
> > > I want to add a note about mmap_sem. In the past there has been
> > > discussions about replacing it with an interval lock, but these never
> > > went anywhere because, mostly, of the fact that such mechanisms were
> > > too expensive to use in the page fault path. I think adding the spf
> > > mechanism would invite us to revisit this issue - interval locks may
> > > be a great way to avoid blocking between unrelated mmap_sem writers
> > > (for example, do not delay stack creation for new threads while a
> > > large mmap or munmap may be going on), and probably also to handle
> > > mmap_sem readers that can't easily use the spf mechanism (for example,
> > > gup callers which make use of the returned vmas). But again that is a
> > > separate topic to explore which doesn't have to get resolved before
> > > spf goes in.
> > 
> > Well, I believe we should _really_ re-evaluate the range locking sooner
> > rather than later. Why? Because it looks like the most straightforward
> > approach to the mmap_sem contention for most usecases I have heard of
> > (mostly a mm{unm}ap, mremap standing in the way of page faults).
> > On a plus side it also makes us think about the current mmap (ab)users
> > which should lead to an overall code improvements and maintainability.
> 
> Dave Chinner recently did evaluate the range lock for solving a problem
> in XFS and didn't like what he saw:
> 
> https://lore.kernel.org/linux-fsdevel/20190418031013.gx29...@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c

Thank you, will have a look.

> I think scaling the lock needs to be tied to the actual data structure
> and not have a second tree on-the-side to fake-scale the locking.  Anyway,
> we're going to have a session on this at LSFMM, right?

I thought we had something for the mmap_sem scaling but I do not see
this in a list of proposed topics. But we can certainly add it there.

> > SPF sounds like a good idea but it is a really big and intrusive surgery
> > to the #PF path. And more importantly without any real world usecase
> > numbers which would justify this. That being said I am not opposed to
> > this change I just think it is a large hammer while we haven't seen
> > attempts to tackle problems in a simpler way.
> 
> I don't think the "no real world usecase numbers" is fair.  Laurent quoted:
> 
> > Ebizzy:
> > ---
> > The test is counting the number of records per second it can manage, the
> > higher is the best. I run it like this 'ebizzy -mTt '. To get
> > consistent result I repeated the test 100 times and measure the average
> > result. The number is the record processes per second, the higher is the 
> > best.
> > 
> > BASESPF delta   
> > 24 CPUs x86 5492.69 9383.07 70.83%
> > 1024 CPUS P8 VM 8476.74 17144.38102%
> 
> and cited 30% improvement for you-know-what product from an earlier
> version of the patch.

Well, we are talking about
45 files changed, 1277 insertions(+), 196 deletions(-)

which is a _major_ surgery in my book. Having a real life workloads numbers
is nothing unfair to ask for IMHO.

And let me remind you that I am not really opposing SPF in general. I
would just like to see a simpler approach before we go such a large
change. If the range locking is not really a scalable approach then all
right but from why I've see it should help a lot of most bottle-necks I
have seen.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v12 00/31] Speculative page faults

2019-04-23 Thread Peter Zijlstra
On Tue, Apr 23, 2019 at 05:41:48AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> > Well, I believe we should _really_ re-evaluate the range locking sooner
> > rather than later. Why? Because it looks like the most straightforward
> > approach to the mmap_sem contention for most usecases I have heard of
> > (mostly a mm{unm}ap, mremap standing in the way of page faults).
> > On a plus side it also makes us think about the current mmap (ab)users
> > which should lead to an overall code improvements and maintainability.
> 
> Dave Chinner recently did evaluate the range lock for solving a problem
> in XFS and didn't like what he saw:
> 
> https://lore.kernel.org/linux-fsdevel/20190418031013.gx29...@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c
> 
> I think scaling the lock needs to be tied to the actual data structure
> and not have a second tree on-the-side to fake-scale the locking.

Right, which is how I ended up using the split PT locks. They already
provide fine(r) grained locking.



Re: [PATCH v12 00/31] Speculative page faults

2019-04-23 Thread Matthew Wilcox
On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
> [...]
> > I want to add a note about mmap_sem. In the past there has been
> > discussions about replacing it with an interval lock, but these never
> > went anywhere because, mostly, of the fact that such mechanisms were
> > too expensive to use in the page fault path. I think adding the spf
> > mechanism would invite us to revisit this issue - interval locks may
> > be a great way to avoid blocking between unrelated mmap_sem writers
> > (for example, do not delay stack creation for new threads while a
> > large mmap or munmap may be going on), and probably also to handle
> > mmap_sem readers that can't easily use the spf mechanism (for example,
> > gup callers which make use of the returned vmas). But again that is a
> > separate topic to explore which doesn't have to get resolved before
> > spf goes in.
> 
> Well, I believe we should _really_ re-evaluate the range locking sooner
> rather than later. Why? Because it looks like the most straightforward
> approach to the mmap_sem contention for most usecases I have heard of
> (mostly a mm{unm}ap, mremap standing in the way of page faults).
> On a plus side it also makes us think about the current mmap (ab)users
> which should lead to an overall code improvements and maintainability.

Dave Chinner recently did evaluate the range lock for solving a problem
in XFS and didn't like what he saw:

https://lore.kernel.org/linux-fsdevel/20190418031013.gx29...@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c

I think scaling the lock needs to be tied to the actual data structure
and not have a second tree on-the-side to fake-scale the locking.  Anyway,
we're going to have a session on this at LSFMM, right?

> SPF sounds like a good idea but it is a really big and intrusive surgery
> to the #PF path. And more importantly without any real world usecase
> numbers which would justify this. That being said I am not opposed to
> this change I just think it is a large hammer while we haven't seen
> attempts to tackle problems in a simpler way.

I don't think the "no real world usecase numbers" is fair.  Laurent quoted:

> Ebizzy:
> ---
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTt '. To get
> consistent result I repeated the test 100 times and measure the average
> result. The number is the record processes per second, the higher is the best.
> 
>   BASESPF delta   
> 24 CPUs x86   5492.69 9383.07 70.83%
> 1024 CPUS P8 VM 8476.74   17144.38102%

and cited 30% improvement for you-know-what product from an earlier
version of the patch.



Re: [PATCH v12 00/31] Speculative page faults

2019-04-23 Thread Anshuman Khandual
On 04/16/2019 07:14 PM, Laurent Dufour wrote:
> In pseudo code, this could be seen as:
> speculative_page_fault()
> {
>   vma = find_vma_rcu()
>   check vma sequence count
>   check vma's support
>   disable interrupt
> check pgd,p4d,...,pte
> save pmd and pte in vmf
> save vma sequence counter in vmf
>   enable interrupt
>   check vma sequence count
>   handle_pte_fault(vma)
>   ..
>   page = alloc_page()
>   pte_map_lock()
>   disable interrupt
>   abort if sequence counter has changed
>   abort if pmd or pte has changed
>   pte map and lock
>   enable interrupt
>   if abort
>  free page
>  abort

Would not it be better if the 'page' allocated here can be passed on to 
handle_pte_fault()
below so that in the fallback path it does not have to enter the buddy again ? 
Of course
it will require changes to handle_pte_fault() to accommodate a pre-allocated 
non-NULL
struct page to operate on or free back into the buddy if fallback path fails 
for some
other reason. This will probably make SPF path less overhead for cases where it 
has to
fallback on handle_pte_fault() after pte_map_lock() in speculative_page_fault().

>   ...
>   put_vma(vma)
> }
> 
> arch_fault_handler()
> {
>   if (speculative_page_fault())
>  goto done
> again:
>   lock(mmap_sem)
>   vma = find_vma();
>   handle_pte_fault(vma);
>   if retry
>  unlock(mmap_sem)
>  goto again;
> done:
>   handle fault error
> }

- Anshuman


Re: [PATCH v12 00/31] Speculative page faults

2019-04-23 Thread Michal Hocko
On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
[...]
> I want to add a note about mmap_sem. In the past there has been
> discussions about replacing it with an interval lock, but these never
> went anywhere because, mostly, of the fact that such mechanisms were
> too expensive to use in the page fault path. I think adding the spf
> mechanism would invite us to revisit this issue - interval locks may
> be a great way to avoid blocking between unrelated mmap_sem writers
> (for example, do not delay stack creation for new threads while a
> large mmap or munmap may be going on), and probably also to handle
> mmap_sem readers that can't easily use the spf mechanism (for example,
> gup callers which make use of the returned vmas). But again that is a
> separate topic to explore which doesn't have to get resolved before
> spf goes in.

Well, I believe we should _really_ re-evaluate the range locking sooner
rather than later. Why? Because it looks like the most straightforward
approach to the mmap_sem contention for most usecases I have heard of
(mostly a mm{unm}ap, mremap standing in the way of page faults).
On a plus side it also makes us think about the current mmap (ab)users
which should lead to an overall code improvements and maintainability.

SPF sounds like a good idea but it is a really big and intrusive surgery
to the #PF path. And more importantly without any real world usecase
numbers which would justify this. That being said I am not opposed to
this change I just think it is a large hammer while we haven't seen
attempts to tackle problems in a simpler way.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v12 00/31] Speculative page faults

2019-04-23 Thread Peter Zijlstra
On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:
> The proposed spf mechanism only handles anon vmas. Is there a
> fundamental reason why it couldn't handle mapped files too ?
> My understanding is that the mechanism of verifying the vma after
> taking back the ptl at the end of the fault would work there too ?
> The file has to stay referenced during the fault, but holding the vma's
> refcount could be made to cover that ? the vm_file refcount would have
> to be released in __free_vma() instead of remove_vma; I'm not quite sure
> if that has more implications than I realize ?

IIRC (and I really don't remember all that much) the trickiest bit was
vs unmount. Since files can stay open past the 'expected' duration,
umount could be delayed.

But yes, I think I had a version that did all that just 'fine'. Like
mentioned, I didn't keep the refcount because it sucked just as hard as
the mmap_sem contention, but the SRCU callback did the fput() just fine
(esp. now that we have delayed_fput).


Re: [PATCH v12 00/31] Speculative page faults

2019-04-22 Thread Michel Lespinasse
Hi Laurent,

Thanks a lot for copying me on this patchset. It took me a few days to
go through it - I had not been following the previous iterations of
this series so I had to catch up. I will be sending comments for
individual commits, but before tat I would like to discuss the series
as a whole.

I think these changes are a big step in the right direction. My main
reservation about them is that they are additive - adding some complexity
for speculative page faults - and I wonder if it'd be possible, over the
long term, to replace the existing complexity we have in mmap_sem retry
mechanisms instead of adding to it. This is not something that should
block your progress, but I think it would be good, as we introduce spf,
to evaluate whether we could eventually get all the way to removing the
mmap_sem retry mechanism, or if we will actually have to keep both.


The proposed spf mechanism only handles anon vmas. Is there a
fundamental reason why it couldn't handle mapped files too ?
My understanding is that the mechanism of verifying the vma after
taking back the ptl at the end of the fault would work there too ?
The file has to stay referenced during the fault, but holding the vma's
refcount could be made to cover that ? the vm_file refcount would have
to be released in __free_vma() instead of remove_vma; I'm not quite sure
if that has more implications than I realize ?

The proposed spf mechanism only works at the pte level after the page
tables have already been created. The non-spf page fault path takes the
mm->page_table_lock to protect against concurrent page table allocation
by multiple page faults; I think unmapping/freeing page tables could
be done under mm->page_table_lock too so that spf could implement
allocating new page tables by verifying the vma after taking the
mm->page_table_lock ?

The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL.
I am not sure what is the issue there - is this due to the vma->vm_start
and vma->vm_pgoff reads in *__vm_normal_page() ?


My last potential concern is about performance. The numbers you have
look great, but I worry about potential regressions in PF performance
for threaded processes that don't currently encounter contention
(i.e. there may be just one thread actually doing all the work while
the others are blocked). I think one good proxy for measuring that
would be to measure a single threaded workload - kernbench would be
fine - without the special-case optimization in patch 22 where
handle_speculative_fault() immediately aborts in the single-threaded case.

Reviewed-by: Michel Lespinasse 
This is for the series as a whole; I expect to do another review pass on
individual commits in the series when we have agreement on the toplevel
stuff (I noticed a few things like out-of-date commit messages but that's
really minor stuff).


I want to add a note about mmap_sem. In the past there has been
discussions about replacing it with an interval lock, but these never
went anywhere because, mostly, of the fact that such mechanisms were
too expensive to use in the page fault path. I think adding the spf
mechanism would invite us to revisit this issue - interval locks may
be a great way to avoid blocking between unrelated mmap_sem writers
(for example, do not delay stack creation for new threads while a
large mmap or munmap may be going on), and probably also to handle
mmap_sem readers that can't easily use the spf mechanism (for example,
gup callers which make use of the returned vmas). But again that is a
separate topic to explore which doesn't have to get resolved before
spf goes in.


[PATCH v12 00/31] Speculative page faults

2019-04-16 Thread Laurent Dufour
This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type of page fault is named speculative
page fault. If the speculative page fault fails because a concurrency has
been detected or because underlying PMD or PTE tables are not yet
allocating, it is failing its processing and a regular page fault is then
tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by protecting the MM RB
tree with RCU and by using a reference counter on each VMA. When fetching a
VMA under the RCU protection, the VMA's reference counter is incremented to
ensure that the VMA will not freed in our back during the SPF
processing. Once that processing is done the VMA's reference counter is
decremented. To ensure that a VMA is still present when walking the RB tree
locklessly, the VMA's reference counter is incremented when that VMA is
linked in the RB tree. When the VMA is unlinked from the RB tree, its
reference counter will be decremented at the end of the RCU grace period,
ensuring it will be available during this time. This means that the VMA
freeing could be delayed and could delay the file closing for file
mapping. Since the SPF handler is not able to manage file mapping, file is
closed synchronously and not during the RCU cleaning. This is safe since
the page fault handler is aborting if a file pointer is associated to the
VMA.

Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
benchmark [2].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA has been found, the speculative page fault handler would check
for the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus, the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

The locking of the PTE is done with interrupts disabled, this allows
checking for the PMD to ensure that there is not an ongoing collapsing
operation. Since khugepaged is firstly set the PMD to pmd_none and then is
waiting for the other CPU to have caught the IPI interrupt, if the pmd is
valid at the time the PTE is locked, we have the guarantee that the
collapsing operation will have to wait on the PTE lock to move
forward. This allows the SPF handler to map the PTE safely. If the PMD
value is different from the one recorded at the beginning of the SPF
operation, the classic page fault handler will be called to handle the
operation while holding the mmap_sem. As the PTE lock is done with the
interrupts disabled, the lock is done using spin_trylock() to avoid dead
lock when handling a page fault while a TLB invalidate is requested by
another CPU holding the PTE.

In pseudo code, this could be seen as:
speculative_page_fault()
{
vma = find_vma_rcu()
check vma sequence count
check vma's support
disable interrupt
  check pgd,p4d,...,pte
  save pmd and pte in vmf
  save vma sequence counter in vmf
enable interrupt
check vma sequence count
handle_pte_fault(vma)
..
page = alloc_page()
pte_map_lock()
disable interrupt
abort if sequence counter has changed
abort if pmd or pte has changed
pte map and lock
enable interrupt
if abort
   free page
   abort
...
put_vma(vma)
}

arch_fault_handler()
{
if (speculative_page_fault())
   goto done
again:
lock(mmap_sem)
vma = find_vma();