Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

2018-03-22 Thread Laurent Dufour
On 22/03/2018 17:13, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 09:06:14AM -0700, Yang Shi wrote:
>> On 3/22/18 2:10 AM, Michal Hocko wrote:
>>> On Wed 21-03-18 15:36:12, Yang Shi wrote:
 On 3/21/18 2:23 PM, Michal Hocko wrote:
> On Wed 21-03-18 10:16:41, Yang Shi wrote:
>> proc_pid_cmdline_read(), it calls access_remote_vm() which need acquire
>> mmap_sem too, so the mmap_sem scalability issue will be hit sooner or 
>> later.
> Ohh, absolutely. mmap_sem is unfortunatelly abused and it would be great
> to remove that. munmap should perform much better. How to do that safely
>>> The full vma will have to be range locked. So there is nothing small or 
>>> large.
>>
>> It sounds not helpful to a single large vma case since just one range lock
>> for the vma, it sounds equal to mmap_sem.
> 
> But splitting mmap_sem into pieces is beneficial for this case.  Imagine
> we have a spinlock / rwlock to protect the rbtree 

Which is more or less what I'm proposing in the speculative page fault series:
https://lkml.org/lkml/2018/3/13/1158

This being said, having a per VMA lock could lead to tricky dead lock case,
when merging multiple VMA happens in parallel since multiple VMA will have to
be locked at the same time, grabbing those lock in a fine order will be 
required.

> ... / arg_start / arg_end
> / ...  and then each VMA has a rwsem (or equivalent).  access_remote_vm()
> would walk the tree and grab the VMA's rwsem for read while reading
> out the arguments.  The munmap code would have a completely different
> VMA write-locked.
> 



Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

2018-03-22 Thread Laurent Dufour


On 22/03/2018 17:05, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>
>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>> will serialize the access.
>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will 
>>>> be
>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>
>>> The page fault handler will walk the VMA tree to find the correct
>>> VMA and then find that the VMA is marked as deleted.  If it assumes
>>> that the VMA has been deleted because of munmap(), then it can raise
>>> SIGSEGV immediately.  But if the VMA is marked as deleted because of
>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>
>> I'm wondering if such a complexity is required.
>> If the user space process try to access the page being overwritten through
>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>> manipulate the *old* page or *new* one.
> 
> Right; but it must return one or the other, it can't segfault.

Good point, I missed that...

> 
>> I'd think this is up to the user process to handle that concurrency.
>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>> are no more there, which is done through the mmap_sem and PTE locking.
> 
> Yes, and allowing the fault handler to return the *old* page risks the
> old page being reinserted into the page tables after the unmapping task
> has done its work.

The PTE locking should prevent that.

> It's *really* rare to page-fault on a VMA which is in the middle of
> being replaced.  Why are you trying to optimise it?

I was not trying to optimize it, but to not wait in the page fault handler.
This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
done and before the waiting page fault got woken up. This means that the
removed VMA structure will have to remain until all the waiters are woken up
which implies ref_count or similar.

> 
>>> I think I was wrong to describe VMAs as being *deleted*.  I think we
>>> instead need the concept of a *locked* VMA that page faults will block on.
>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>> an rwsem since the only reason to write-lock the VMA is because it is
>>> being deleted.
>>
>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>> the VMA is removed there is no need to wait. Isn't it ?
> 
> I can't think of another reason.  I suppose we could mark the VMA as
> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
> early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
> use of our time.  Just always have the pagefault sleep for a deleted VMA.



Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

2018-03-22 Thread Laurent Dufour


On 22/03/2018 17:05, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:54:52PM +0100, Laurent Dufour wrote:
>> On 22/03/2018 16:40, Matthew Wilcox wrote:
>>> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>>>> Regarding the page fault, why not relying on the PTE locking ?
>>>>
>>>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>>>> will serialize the access.
>>>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will 
>>>> be
>>>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
>>>
>>> The page fault handler will walk the VMA tree to find the correct
>>> VMA and then find that the VMA is marked as deleted.  If it assumes
>>> that the VMA has been deleted because of munmap(), then it can raise
>>> SIGSEGV immediately.  But if the VMA is marked as deleted because of
>>> mmap(MAP_FIXED), it must wait until the new VMA is in place.
>>
>> I'm wondering if such a complexity is required.
>> If the user space process try to access the page being overwritten through
>> mmap(MAP_FIXED) by another thread, there is no guarantee that it will
>> manipulate the *old* page or *new* one.
> 
> Right; but it must return one or the other, it can't segfault.

Good point, I missed that...

> 
>> I'd think this is up to the user process to handle that concurrency.
>> What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
>> are no more there, which is done through the mmap_sem and PTE locking.
> 
> Yes, and allowing the fault handler to return the *old* page risks the
> old page being reinserted into the page tables after the unmapping task
> has done its work.

The PTE locking should prevent that.

> It's *really* rare to page-fault on a VMA which is in the middle of
> being replaced.  Why are you trying to optimise it?

I was not trying to optimize it, but to not wait in the page fault handler.
This could become tricky in the case the VMA is removed once mmap(MAP_FIXED) is
done and before the waiting page fault got woken up. This means that the
removed VMA structure will have to remain until all the waiters are woken up
which implies ref_count or similar.

> 
>>> I think I was wrong to describe VMAs as being *deleted*.  I think we
>>> instead need the concept of a *locked* VMA that page faults will block on.
>>> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
>>> an rwsem since the only reason to write-lock the VMA is because it is
>>> being deleted.
>>
>> Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
>> the VMA is removed there is no need to wait. Isn't it ?
> 
> I can't think of another reason.  I suppose we could mark the VMA as
> locked-for-deletion or locked-for-replacement and have the SIGSEGV happen
> early.  But I'm not sure that optimising for SIGSEGVs is a worthwhile
> use of our time.  Just always have the pagefault sleep for a deleted VMA.



Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

2018-03-22 Thread Laurent Dufour


On 22/03/2018 16:40, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>> On 21/03/2018 23:46, Matthew Wilcox wrote:
>>> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>>>> Marking vma as deleted sounds good. The problem for my current approach is
>>>> the concurrent page fault may succeed if it access the not yet unmapped
>>>> section. Marking deleted vma could tell page fault the vma is not valid
>>>> anymore, then return SIGSEGV.
>>>>
>>>>> does not care; munmap will need to wait for the existing munmap operation
>>>>
>>>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, 
>>>> right?
>>>
>>> The other thing about MAP_FIXED that we'll need to handle is unmapping
>>> conflicts atomically.  Say a program has a 200GB mapping and then
>>> mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
>>> are also going to have to wait for deleted vmas (then retry the fault)
>>> rather than immediately raising SIGSEGV.
>>
>> Regarding the page fault, why not relying on the PTE locking ?
>>
>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>> will serialize the access.
>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
> 
> The page fault handler will walk the VMA tree to find the correct
> VMA and then find that the VMA is marked as deleted.  If it assumes
> that the VMA has been deleted because of munmap(), then it can raise
> SIGSEGV immediately.  But if the VMA is marked as deleted because of
> mmap(MAP_FIXED), it must wait until the new VMA is in place.

I'm wondering if such a complexity is required.
If the user space process try to access the page being overwritten through
mmap(MAP_FIXED) by another thread, there is no guarantee that it will
manipulate the *old* page or *new* one.
I'd think this is up to the user process to handle that concurrency.
What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
are no more there, which is done through the mmap_sem and PTE locking.

> I think I was wrong to describe VMAs as being *deleted*.  I think we
> instead need the concept of a *locked* VMA that page faults will block on.
> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
> an rwsem since the only reason to write-lock the VMA is because it is
> being deleted.

Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
the VMA is removed there is no need to wait. Isn't it ?



Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

2018-03-22 Thread Laurent Dufour


On 22/03/2018 16:40, Matthew Wilcox wrote:
> On Thu, Mar 22, 2018 at 04:32:00PM +0100, Laurent Dufour wrote:
>> On 21/03/2018 23:46, Matthew Wilcox wrote:
>>> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>>>> Marking vma as deleted sounds good. The problem for my current approach is
>>>> the concurrent page fault may succeed if it access the not yet unmapped
>>>> section. Marking deleted vma could tell page fault the vma is not valid
>>>> anymore, then return SIGSEGV.
>>>>
>>>>> does not care; munmap will need to wait for the existing munmap operation
>>>>
>>>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, 
>>>> right?
>>>
>>> The other thing about MAP_FIXED that we'll need to handle is unmapping
>>> conflicts atomically.  Say a program has a 200GB mapping and then
>>> mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
>>> are also going to have to wait for deleted vmas (then retry the fault)
>>> rather than immediately raising SIGSEGV.
>>
>> Regarding the page fault, why not relying on the PTE locking ?
>>
>> When munmap() will unset the PTE it will have to held the PTE lock, so this
>> will serialize the access.
>> If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
>> removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.
> 
> The page fault handler will walk the VMA tree to find the correct
> VMA and then find that the VMA is marked as deleted.  If it assumes
> that the VMA has been deleted because of munmap(), then it can raise
> SIGSEGV immediately.  But if the VMA is marked as deleted because of
> mmap(MAP_FIXED), it must wait until the new VMA is in place.

I'm wondering if such a complexity is required.
If the user space process try to access the page being overwritten through
mmap(MAP_FIXED) by another thread, there is no guarantee that it will
manipulate the *old* page or *new* one.
I'd think this is up to the user process to handle that concurrency.
What needs to be guaranteed is that once mmap(MAP_FIXED) returns the old page
are no more there, which is done through the mmap_sem and PTE locking.

> I think I was wrong to describe VMAs as being *deleted*.  I think we
> instead need the concept of a *locked* VMA that page faults will block on.
> Conceptually, it's a per-VMA rwsem, but I'd use a completion instead of
> an rwsem since the only reason to write-lock the VMA is because it is
> being deleted.

Such a lock would only makes sense in the case of mmap(MAP_FIXED) since when
the VMA is removed there is no need to wait. Isn't it ?



Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

2018-03-22 Thread Laurent Dufour


On 21/03/2018 23:46, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>> Marking vma as deleted sounds good. The problem for my current approach is
>> the concurrent page fault may succeed if it access the not yet unmapped
>> section. Marking deleted vma could tell page fault the vma is not valid
>> anymore, then return SIGSEGV.
>>
>>> does not care; munmap will need to wait for the existing munmap operation
>>
>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
> 
> The other thing about MAP_FIXED that we'll need to handle is unmapping
> conflicts atomically.  Say a program has a 200GB mapping and then
> mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
> are also going to have to wait for deleted vmas (then retry the fault)
> rather than immediately raising SIGSEGV.

Regarding the page fault, why not relying on the PTE locking ?

When munmap() will unset the PTE it will have to held the PTE lock, so this
will serialize the access.
If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.



Re: [RFC PATCH 1/8] mm: mmap: unmap large mapping by section

2018-03-22 Thread Laurent Dufour


On 21/03/2018 23:46, Matthew Wilcox wrote:
> On Wed, Mar 21, 2018 at 02:45:44PM -0700, Yang Shi wrote:
>> Marking vma as deleted sounds good. The problem for my current approach is
>> the concurrent page fault may succeed if it access the not yet unmapped
>> section. Marking deleted vma could tell page fault the vma is not valid
>> anymore, then return SIGSEGV.
>>
>>> does not care; munmap will need to wait for the existing munmap operation
>>
>> Why mmap doesn't care? How about MAP_FIXED? It may fail unexpectedly, right?
> 
> The other thing about MAP_FIXED that we'll need to handle is unmapping
> conflicts atomically.  Say a program has a 200GB mapping and then
> mmap(MAP_FIXED) another 200GB region on top of it.  So I think page faults
> are also going to have to wait for deleted vmas (then retry the fault)
> rather than immediately raising SIGSEGV.

Regarding the page fault, why not relying on the PTE locking ?

When munmap() will unset the PTE it will have to held the PTE lock, so this
will serialize the access.
If the page fault occurs before the mmap(MAP_FIXED), the page mapped will be
removed when mmap(MAP_FIXED) would do the cleanup. Fair enough.



Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-03-21 Thread Laurent Dufour


On 17/03/2018 08:51, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: b1f0502d04537ef55b0c296823affe332b100eb5 ("mm: VMA sequence count")
> url: 
> https://github.com/0day-ci/linux/commits/Laurent-Dufour/Speculative-page-faults/20180316-151833
> 
> 
> in testcase: trinity
> with following parameters:
> 
>   runtime: 300s
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 512M
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> ++++
> || 6a4ce82339 | b1f0502d04 |
> ++++
> | boot_successes | 8  | 4  |
> | boot_failures  | 0  | 4  |
> | INFO:trying_to_register_non-static_key | 0  | 4  |
> ++++
> 
> 
> 
> [   22.212940] INFO: trying to register non-static key.
> [   22.213687] the code is fine but needs lockdep annotation.
> [   22.214459] turning off the locking correctness validator.
> [   22.227459] CPU: 0 PID: 547 Comm: trinity-main Not tainted 
> 4.16.0-rc4-next-20180309-7-gb1f0502 #239
> [   22.228904] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   22.230043] Call Trace:
> [   22.230409]  dump_stack+0x5d/0x79
> [   22.231025]  register_lock_class+0x226/0x45e
> [   22.231827]  ? kvm_clock_read+0x21/0x30
> [   22.232544]  ? kvm_sched_clock_read+0x5/0xd
> [   22.20]  __lock_acquire+0xa2/0x774
> [   22.234152]  lock_acquire+0x4b/0x66
> [   22.234805]  ? unmap_vmas+0x30/0x3d
> [   22.245680]  unmap_page_range+0x56/0x48c
> [   22.248127]  ? unmap_vmas+0x30/0x3d
> [   22.248741]  ? lru_deactivate_file_fn+0x2c6/0x2c6
> [   22.249537]  ? pagevec_lru_move_fn+0x9a/0xa9
> [   22.250244]  unmap_vmas+0x30/0x3d
> [   22.250791]  unmap_region+0xad/0x105
> [   22.251419]  mmap_region+0x3cc/0x455
> [   22.252011]  do_mmap+0x394/0x3e9
> [   22.261224]  vm_mmap_pgoff+0x9c/0xe5
> [   22.261798]  SyS_mmap_pgoff+0x19a/0x1d4
> [   22.262475]  ? task_work_run+0x5e/0x9c
> [   22.263163]  do_syscall_64+0x6d/0x103
> [   22.263814]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [   22.264697] RIP: 0033:0x4573da
> [   22.267248] RSP: 002b:7fffa22f1398 EFLAGS: 0246 ORIG_RAX: 
> 0009
> [   22.274720] RAX: ffda RBX: 0001 RCX: 
> 004573da
> [   22.276083] RDX: 0001 RSI: 1000 RDI: 
> 
> [   22.277343] RBP: 001c R08: 001c R09: 
> 
> [   22.278686] R10: 0002 R11: 0246 R12: 
> 
> [   22.279930] R13: 1000 R14: 0002 R15: 
> 
> [   22.391866] trinity-main uses obsolete (PF_INET,SOCK_PACKET)
> [  327.566956] sysrq: SysRq : Emergency Sync
> [  327.567849] Emergency Sync complete
> [  327.569975] sysrq: SysRq : Resetting

I found the root cause of this lockdep warning.

In mmap_region(), unmap_region() may be called while vma_link() has not been
called. This happens during the error path if call_mmap() failed.

The only to fix that particular case is to call
seqcount_init(>vm_sequence) when initializing the vma in mmap_region().

Thanks,
Laurent.



Re: [mm] b1f0502d04: INFO:trying_to_register_non-static_key

2018-03-21 Thread Laurent Dufour


On 17/03/2018 08:51, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: b1f0502d04537ef55b0c296823affe332b100eb5 ("mm: VMA sequence count")
> url: 
> https://github.com/0day-ci/linux/commits/Laurent-Dufour/Speculative-page-faults/20180316-151833
> 
> 
> in testcase: trinity
> with following parameters:
> 
>   runtime: 300s
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -m 512M
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> ++++
> || 6a4ce82339 | b1f0502d04 |
> ++++
> | boot_successes | 8  | 4  |
> | boot_failures  | 0  | 4  |
> | INFO:trying_to_register_non-static_key | 0  | 4  |
> ++++
> 
> 
> 
> [   22.212940] INFO: trying to register non-static key.
> [   22.213687] the code is fine but needs lockdep annotation.
> [   22.214459] turning off the locking correctness validator.
> [   22.227459] CPU: 0 PID: 547 Comm: trinity-main Not tainted 
> 4.16.0-rc4-next-20180309-7-gb1f0502 #239
> [   22.228904] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   22.230043] Call Trace:
> [   22.230409]  dump_stack+0x5d/0x79
> [   22.231025]  register_lock_class+0x226/0x45e
> [   22.231827]  ? kvm_clock_read+0x21/0x30
> [   22.232544]  ? kvm_sched_clock_read+0x5/0xd
> [   22.20]  __lock_acquire+0xa2/0x774
> [   22.234152]  lock_acquire+0x4b/0x66
> [   22.234805]  ? unmap_vmas+0x30/0x3d
> [   22.245680]  unmap_page_range+0x56/0x48c
> [   22.248127]  ? unmap_vmas+0x30/0x3d
> [   22.248741]  ? lru_deactivate_file_fn+0x2c6/0x2c6
> [   22.249537]  ? pagevec_lru_move_fn+0x9a/0xa9
> [   22.250244]  unmap_vmas+0x30/0x3d
> [   22.250791]  unmap_region+0xad/0x105
> [   22.251419]  mmap_region+0x3cc/0x455
> [   22.252011]  do_mmap+0x394/0x3e9
> [   22.261224]  vm_mmap_pgoff+0x9c/0xe5
> [   22.261798]  SyS_mmap_pgoff+0x19a/0x1d4
> [   22.262475]  ? task_work_run+0x5e/0x9c
> [   22.263163]  do_syscall_64+0x6d/0x103
> [   22.263814]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [   22.264697] RIP: 0033:0x4573da
> [   22.267248] RSP: 002b:7fffa22f1398 EFLAGS: 0246 ORIG_RAX: 
> 0009
> [   22.274720] RAX: ffda RBX: 0001 RCX: 
> 004573da
> [   22.276083] RDX: 0001 RSI: 1000 RDI: 
> 
> [   22.277343] RBP: 001c R08: 001c R09: 
> 
> [   22.278686] R10: 0002 R11: 0246 R12: 
> 
> [   22.279930] R13: 1000 R14: 0002 R15: 
> 
> [   22.391866] trinity-main uses obsolete (PF_INET,SOCK_PACKET)
> [  327.566956] sysrq: SysRq : Emergency Sync
> [  327.567849] Emergency Sync complete
> [  327.569975] sysrq: SysRq : Resetting

I found the root cause of this lockdep warning.

In mmap_region(), unmap_region() may be called while vma_link() has not been
called. This happens during the error path if call_mmap() failed.

The only to fix that particular case is to call
seqcount_init(>vm_sequence) when initializing the vma in mmap_region().

Thanks,
Laurent.



Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

2018-03-21 Thread Laurent Dufour
On 20/03/2018 22:35, Mike Kravetz wrote:
> On 03/20/2018 02:26 PM, Mike Kravetz wrote:
>> Thanks Laurent!
>>
>> This bug was introduced by 31383c6865a5.  Dan's changes for 31383c6865a5
>> seem pretty straight forward.  It simply replaces an explicit check when
>> splitting a vma to a new vm_ops split callout.  Unfortunately, mappings
>> created via shmget/shmat have their vm_ops replaced.  Therefore, this
>> split callout is never made.
>>
>> The shm vm_ops do indirectly call the original vm_ops routines as needed.
>> Therefore, I would suggest a patch something like the following instead.
>> If we move forward with the patch, we should include Laurent's BUG output
>> and perhaps test program in the commit message.
> 
> Sorry, patch in previous mail was a mess
> 
> From 7a19414319c7937fd2757c27f936258f16c1f61d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.krav...@oracle.com>
> Date: Tue, 20 Mar 2018 13:56:57 -0700
> Subject: [PATCH] shm: add split function to shm_vm_ops
> 
> The split function was added to vm_operations_struct to determine
> if a mapping can be split.  This was mostly for device-dax and
> hugetlbfs mappings which have specific alignment constraints.
> 
> mappings initiated via shmget/shmat have their original vm_ops
> overwritten with shm_vm_ops.  shm_vm_ops functions will call back
> to the original vm_ops if needed.  Add such a split function.

FWIW,
Reviewed-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>

> Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to 
> vm_operations_struct)
> Reported by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
> Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>
> ---
>  ipc/shm.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 7acda23430aa..50e88fc060b1 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
>   return sfd->vm_ops->fault(vmf);
>  }
> 
> +static int shm_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct file *file = vma->vm_file;
> + struct shm_file_data *sfd = shm_file_data(file);
> +
> + if (sfd->vm_ops && sfd->vm_ops->split)
> + return sfd->vm_ops->split(vma, addr);
> +
> + return 0;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
>  {
> @@ -510,6 +521,7 @@ static const struct vm_operations_struct shm_vm_ops = {
>   .open   = shm_open, /* callback for a new vm-area open */
>   .close  = shm_close,/* callback for when the vm-area is released */
>   .fault  = shm_fault,
> + .split  = shm_split,
>  #if defined(CONFIG_NUMA)
>   .set_policy = shm_set_policy,
>   .get_policy = shm_get_policy,
> 



Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

2018-03-21 Thread Laurent Dufour
On 20/03/2018 22:35, Mike Kravetz wrote:
> On 03/20/2018 02:26 PM, Mike Kravetz wrote:
>> Thanks Laurent!
>>
>> This bug was introduced by 31383c6865a5.  Dan's changes for 31383c6865a5
>> seem pretty straight forward.  It simply replaces an explicit check when
>> splitting a vma to a new vm_ops split callout.  Unfortunately, mappings
>> created via shmget/shmat have their vm_ops replaced.  Therefore, this
>> split callout is never made.
>>
>> The shm vm_ops do indirectly call the original vm_ops routines as needed.
>> Therefore, I would suggest a patch something like the following instead.
>> If we move forward with the patch, we should include Laurent's BUG output
>> and perhaps test program in the commit message.
> 
> Sorry, patch in previous mail was a mess
> 
> From 7a19414319c7937fd2757c27f936258f16c1f61d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz 
> Date: Tue, 20 Mar 2018 13:56:57 -0700
> Subject: [PATCH] shm: add split function to shm_vm_ops
> 
> The split function was added to vm_operations_struct to determine
> if a mapping can be split.  This was mostly for device-dax and
> hugetlbfs mappings which have specific alignment constraints.
> 
> mappings initiated via shmget/shmat have their original vm_ops
> overwritten with shm_vm_ops.  shm_vm_ops functions will call back
> to the original vm_ops if needed.  Add such a split function.

FWIW,
Reviewed-by: Laurent Dufour 
Tested-by: Laurent Dufour 

> Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to 
> vm_operations_struct)
> Reported by: Laurent Dufour 
> Signed-off-by: Mike Kravetz 
> ---
>  ipc/shm.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 7acda23430aa..50e88fc060b1 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -386,6 +386,17 @@ static int shm_fault(struct vm_fault *vmf)
>   return sfd->vm_ops->fault(vmf);
>  }
> 
> +static int shm_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct file *file = vma->vm_file;
> + struct shm_file_data *sfd = shm_file_data(file);
> +
> + if (sfd->vm_ops && sfd->vm_ops->split)
> + return sfd->vm_ops->split(vma, addr);
> +
> + return 0;
> +}
> +
>  #ifdef CONFIG_NUMA
>  static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
>  {
> @@ -510,6 +521,7 @@ static const struct vm_operations_struct shm_vm_ops = {
>   .open   = shm_open, /* callback for a new vm-area open */
>   .close  = shm_close,/* callback for when the vm-area is released */
>   .fault  = shm_fault,
> + .split  = shm_split,
>  #if defined(CONFIG_NUMA)
>   .set_policy = shm_set_policy,
>   .get_policy = shm_get_policy,
> 



Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

2018-03-21 Thread Laurent Dufour
On 20/03/2018 22:26, Mike Kravetz wrote:
> On 03/20/2018 10:25 AM, Laurent Dufour wrote:
>> When running the sampler detailed below, the kernel, if built with the VM
>> debug option turned on (as many distro do), is panicing with the following
>> message :
>> kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> LE SMP NR_CPUS=2048 NUMA PowerNV
>> Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
>>  8<--8<--8<--8< snip 8<--8<--8<--8<
>> CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C  E
>> 4.15.0-10-generic #11-Ubuntu
>> NIP:  c036e764 LR: c036ee48 CTR: 0009
>> REGS: c03fbcdcf810 TRAP: 0700   Tainted: G C  E
>> (4.15.0-10-generic)
>> MSR:  90029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 2400  XER:
>> 2004
>> CFAR: c036ee44 SOFTE: 1
>> GPR00: c036ee48 c03fbcdcfa90 c16ea600 c03fbcdcfc40
>> GPR04: c03fd9858950 7115e4e0 7115e4e1 
>> GPR08: 0010 0001  
>> GPR12: 2000 c7a2c600 0fe3985954d0 7115e4e0
>> GPR16:    
>> GPR20: 0fe398595a94 a6fc c03fd9858950 00018554
>> GPR24: c03fdcd84500 c19acd00 7115e4e1 c03fbcdcfc40
>> GPR28: 0020 7115e4e0 c03fbc9ac600 c03fd9858950
>> NIP [c036e764] __unmap_hugepage_range+0xa4/0x760
>> LR [c036ee48] __unmap_hugepage_range_final+0x28/0x50
>> Call Trace:
>> [c03fbcdcfa90] [7115e4e0] 0x7115e4e0 (unreliable)
>> [c03fbcdcfb50] [c036ee48]
>> __unmap_hugepage_range_final+0x28/0x50
>> [c03fbcdcfb80] [c033497c] unmap_single_vma+0x11c/0x190
>> [c03fbcdcfbd0] [c0334e14] unmap_vmas+0x94/0x140
>> [c03fbcdcfc20] [c034265c] exit_mmap+0x9c/0x1d0
>> [c03fbcdcfce0] [c0105448] mmput+0xa8/0x1d0
>> [c03fbcdcfd10] [c010fad0] do_exit+0x360/0xc80
>> [c03fbcdcfdd0] [c01104c0] do_group_exit+0x60/0x100
>> [c03fbcdcfe10] [c0110584] SyS_exit_group+0x24/0x30
>> [c03fbcdcfe30] [c000b184] system_call+0x58/0x6c
>> Instruction dump:
>> 552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b09 e9390010
>> 7d2948f8 7d2a2838 0b0a 7d293038 <0b09> e9230086 2fa9 419e0468
>> ---[ end trace ee88f958a1c62605 ]---
>>
>> The panic is due to a VMA pointing to a hugetlb area while the
>> vma->vm_start or vma->vm_end field are not aligned to the huge page
>> boundaries. The sampler is just unmapping a part of the hugetlb area,
>> leading to 2 VMAs which are not well aligned.  The same could be achieved
>> by calling madvise() situation, as it is when running:
>> stress-ng --shm-sysv 1
>>
>> The hugetlb code is assuming that the VMA will be well aligned when it is
>> unmapped, so we must prevent such a VMA to be split or shrink to a
>> misaligned address.
>>
>> This patch is preventing this by checking the new VMA's boundaries when a
>> VMA is modified by calling vma_adjust().
>>
>> If this patch is applied, stable should be Cced.
> 
> Thanks Laurent!
> 
> This bug was introduced by 31383c6865a5.  Dan's changes for 31383c6865a5
> seem pretty straight forward.  It simply replaces an explicit check when
> splitting a vma to a new vm_ops split callout.  Unfortunately, mappings
> created via shmget/shmat have their vm_ops replaced.  Therefore, this
> split callout is never made.
> 
> The shm vm_ops do indirectly call the original vm_ops routines as needed.
> Therefore, I would suggest a patch something like the following instead.
> If we move forward with the patch, we should include Laurent's BUG output
> and perhaps test program in the commit message.

Hi Mike,

That's definitively smarter ! I missed that split() new vm ops...

Cheers,
Laurent.



Re: [PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

2018-03-21 Thread Laurent Dufour
On 20/03/2018 22:26, Mike Kravetz wrote:
> On 03/20/2018 10:25 AM, Laurent Dufour wrote:
>> When running the sampler detailed below, the kernel, if built with the VM
>> debug option turned on (as many distro do), is panicing with the following
>> message :
>> kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> LE SMP NR_CPUS=2048 NUMA PowerNV
>> Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
>>  8<--8<--8<--8< snip 8<--8<--8<--8<
>> CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C  E
>> 4.15.0-10-generic #11-Ubuntu
>> NIP:  c036e764 LR: c036ee48 CTR: 0009
>> REGS: c03fbcdcf810 TRAP: 0700   Tainted: G C  E
>> (4.15.0-10-generic)
>> MSR:  90029033   CR: 2400  XER:
>> 2004
>> CFAR: c036ee44 SOFTE: 1
>> GPR00: c036ee48 c03fbcdcfa90 c16ea600 c03fbcdcfc40
>> GPR04: c03fd9858950 7115e4e0 7115e4e1 
>> GPR08: 0010 0001  
>> GPR12: 2000 c7a2c600 0fe3985954d0 7115e4e0
>> GPR16:    
>> GPR20: 0fe398595a94 a6fc c03fd9858950 00018554
>> GPR24: c03fdcd84500 c19acd00 7115e4e1 c03fbcdcfc40
>> GPR28: 0020 7115e4e0 c03fbc9ac600 c03fd9858950
>> NIP [c036e764] __unmap_hugepage_range+0xa4/0x760
>> LR [c036ee48] __unmap_hugepage_range_final+0x28/0x50
>> Call Trace:
>> [c03fbcdcfa90] [7115e4e0] 0x7115e4e0 (unreliable)
>> [c03fbcdcfb50] [c036ee48]
>> __unmap_hugepage_range_final+0x28/0x50
>> [c03fbcdcfb80] [c033497c] unmap_single_vma+0x11c/0x190
>> [c03fbcdcfbd0] [c0334e14] unmap_vmas+0x94/0x140
>> [c03fbcdcfc20] [c034265c] exit_mmap+0x9c/0x1d0
>> [c03fbcdcfce0] [c0105448] mmput+0xa8/0x1d0
>> [c03fbcdcfd10] [c010fad0] do_exit+0x360/0xc80
>> [c03fbcdcfdd0] [c01104c0] do_group_exit+0x60/0x100
>> [c03fbcdcfe10] [c0110584] SyS_exit_group+0x24/0x30
>> [c03fbcdcfe30] [c000b184] system_call+0x58/0x6c
>> Instruction dump:
>> 552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b09 e9390010
>> 7d2948f8 7d2a2838 0b0a 7d293038 <0b09> e9230086 2fa9 419e0468
>> ---[ end trace ee88f958a1c62605 ]---
>>
>> The panic is due to a VMA pointing to a hugetlb area while the
>> vma->vm_start or vma->vm_end field are not aligned to the huge page
>> boundaries. The sampler is just unmapping a part of the hugetlb area,
>> leading to 2 VMAs which are not well aligned.  The same could be achieved
>> by calling madvise() situation, as it is when running:
>> stress-ng --shm-sysv 1
>>
>> The hugetlb code is assuming that the VMA will be well aligned when it is
>> unmapped, so we must prevent such a VMA to be split or shrink to a
>> misaligned address.
>>
>> This patch is preventing this by checking the new VMA's boundaries when a
>> VMA is modified by calling vma_adjust().
>>
>> If this patch is applied, stable should be Cced.
> 
> Thanks Laurent!
> 
> This bug was introduced by 31383c6865a5.  Dan's changes for 31383c6865a5
> seem pretty straight forward.  It simply replaces an explicit check when
> splitting a vma to a new vm_ops split callout.  Unfortunately, mappings
> created via shmget/shmat have their vm_ops replaced.  Therefore, this
> split callout is never made.
> 
> The shm vm_ops do indirectly call the original vm_ops routines as needed.
> Therefore, I would suggest a patch something like the following instead.
> If we move forward with the patch, we should include Laurent's BUG output
> and perhaps test program in the commit message.

Hi Mike,

That's definitively smarter ! I missed that split() new vm ops...

Cheers,
Laurent.



[PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

2018-03-20 Thread Laurent Dufour
When running the sampler detailed below, the kernel, if built with the VM
debug option turned on (as many distro do), is panicing with the following
message :
kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
8<--8<--8<--8< snip 8<--8<--8<--8<
CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C  E
4.15.0-10-generic #11-Ubuntu
NIP:  c036e764 LR: c036ee48 CTR: 0009
REGS: c03fbcdcf810 TRAP: 0700   Tainted: G C  E
(4.15.0-10-generic)
MSR:  90029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 2400  XER:
2004
CFAR: c036ee44 SOFTE: 1
GPR00: c036ee48 c03fbcdcfa90 c16ea600 c03fbcdcfc40
GPR04: c03fd9858950 7115e4e0 7115e4e1 
GPR08: 0010 0001  
GPR12: 2000 c7a2c600 0fe3985954d0 7115e4e0
GPR16:    
GPR20: 0fe398595a94 a6fc c03fd9858950 00018554
GPR24: c03fdcd84500 c19acd00 7115e4e1 c03fbcdcfc40
GPR28: 0020 7115e4e0 c03fbc9ac600 c03fd9858950
NIP [c036e764] __unmap_hugepage_range+0xa4/0x760
LR [c036ee48] __unmap_hugepage_range_final+0x28/0x50
Call Trace:
[c03fbcdcfa90] [7115e4e0] 0x7115e4e0 (unreliable)
[c03fbcdcfb50] [c036ee48]
__unmap_hugepage_range_final+0x28/0x50
[c03fbcdcfb80] [c033497c] unmap_single_vma+0x11c/0x190
[c03fbcdcfbd0] [c0334e14] unmap_vmas+0x94/0x140
[c03fbcdcfc20] [c034265c] exit_mmap+0x9c/0x1d0
[c03fbcdcfce0] [c0105448] mmput+0xa8/0x1d0
[c03fbcdcfd10] [c010fad0] do_exit+0x360/0xc80
[c03fbcdcfdd0] [c01104c0] do_group_exit+0x60/0x100
[c03fbcdcfe10] [c0110584] SyS_exit_group+0x24/0x30
[c03fbcdcfe30] [c000b184] system_call+0x58/0x6c
Instruction dump:
552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b09 e9390010
7d2948f8 7d2a2838 0b0a 7d293038 <0b09> e9230086 2fa9 419e0468
---[ end trace ee88f958a1c62605 ]---

The panic is due to a VMA pointing to a hugetlb area while the
vma->vm_start or vma->vm_end field are not aligned to the huge page
boundaries. The sampler is just unmapping a part of the hugetlb area,
leading to 2 VMAs which are not well aligned.  The same could be achieved
by calling madvise() situation, as it is when running:
stress-ng --shm-sysv 1

The hugetlb code is assuming that the VMA will be well aligned when it is
unmapped, so we must prevent such a VMA to be split or shrink to a
misaligned address.

This patch is preventing this by checking the new VMA's boundaries when a
VMA is modified by calling vma_adjust().

If this patch is applied, stable should be Cced.

--- Sampler used to hit the panic
nclude 

unsigned long page_size;

int main(void)
{
int shmid, ret=1;
void *addr;

setbuf(stdout, NULL);
page_size = getpagesize();

shmid = shmget(0x1410, LENGTH, IPC_CREAT | SHM_HUGETLB | SHM_R |
SHM_W);
if (shmid < 0) {
perror("shmget");
exit(1);
}

printf("shmid: %d\n", shmid);

addr = shmat(shmid, NULL, 0);
if (addr == (void*)-1) {
perror("shmat");
goto out;
}

/*
 * The following munmap() call will split the VMA in 2, leading to
 * unaligned to huge page size VMAs which will trigger a check when
 * shmdt() is called.
 */
if (munmap(addr + HPSIZE + page_size, page_size)) {
perror("munmap");
goto out;
}

if (shmdt(addr)) {
perror("shmdt");
goto out;
}

printf("test done.\n");
ret = 0;

out:
shmctl(shmid, IPC_RMID, NULL);
return ret;
}
--- End of code

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 mm/mmap.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 188f195883b9..5dbf4b69a798 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -692,6 +692,17 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
long adjust_next = 0;
int remove_next = 0;
 
+   if (is_vm_hugetlb_page(vma)) {
+   /*
+* We must check against the huge page boundarie to not
+* create misaligned VMA.
+*/
+   struct hstate *h = hstate_vma(vma);
+
+   if (start & ~huge_page_mask(h) || end & ~huge_page_mask(h))
+   return -EINVAL;
+   }
+
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;
 
-- 
2.7.4



[PATCH] mm/hugetlb: prevent hugetlb VMA to be misaligned

2018-03-20 Thread Laurent Dufour
When running the sampler detailed below, the kernel, if built with the VM
debug option turned on (as many distro do), is panicing with the following
message :
kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
8<--8<--8<--8< snip 8<--8<--8<--8<
CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C  E
4.15.0-10-generic #11-Ubuntu
NIP:  c036e764 LR: c036ee48 CTR: 0009
REGS: c03fbcdcf810 TRAP: 0700   Tainted: G C  E
(4.15.0-10-generic)
MSR:  90029033   CR: 2400  XER:
2004
CFAR: c036ee44 SOFTE: 1
GPR00: c036ee48 c03fbcdcfa90 c16ea600 c03fbcdcfc40
GPR04: c03fd9858950 7115e4e0 7115e4e1 
GPR08: 0010 0001  
GPR12: 2000 c7a2c600 0fe3985954d0 7115e4e0
GPR16:    
GPR20: 0fe398595a94 a6fc c03fd9858950 00018554
GPR24: c03fdcd84500 c19acd00 7115e4e1 c03fbcdcfc40
GPR28: 0020 7115e4e0 c03fbc9ac600 c03fd9858950
NIP [c036e764] __unmap_hugepage_range+0xa4/0x760
LR [c036ee48] __unmap_hugepage_range_final+0x28/0x50
Call Trace:
[c03fbcdcfa90] [7115e4e0] 0x7115e4e0 (unreliable)
[c03fbcdcfb50] [c036ee48]
__unmap_hugepage_range_final+0x28/0x50
[c03fbcdcfb80] [c033497c] unmap_single_vma+0x11c/0x190
[c03fbcdcfbd0] [c0334e14] unmap_vmas+0x94/0x140
[c03fbcdcfc20] [c034265c] exit_mmap+0x9c/0x1d0
[c03fbcdcfce0] [c0105448] mmput+0xa8/0x1d0
[c03fbcdcfd10] [c010fad0] do_exit+0x360/0xc80
[c03fbcdcfdd0] [c01104c0] do_group_exit+0x60/0x100
[c03fbcdcfe10] [c0110584] SyS_exit_group+0x24/0x30
[c03fbcdcfe30] [c000b184] system_call+0x58/0x6c
Instruction dump:
552907fe e94a0028 e94a0408 eb2a0018 81590008 7f9c5036 0b09 e9390010
7d2948f8 7d2a2838 0b0a 7d293038 <0b09> e9230086 2fa9 419e0468
---[ end trace ee88f958a1c62605 ]---

The panic is due to a VMA pointing to a hugetlb area while the
vma->vm_start or vma->vm_end field are not aligned to the huge page
boundaries. The sampler is just unmapping a part of the hugetlb area,
leading to 2 VMAs which are not well aligned.  The same could be achieved
by calling madvise() situation, as it is when running:
stress-ng --shm-sysv 1

The hugetlb code is assuming that the VMA will be well aligned when it is
unmapped, so we must prevent such a VMA to be split or shrink to a
misaligned address.

This patch is preventing this by checking the new VMA's boundaries when a
VMA is modified by calling vma_adjust().

If this patch is applied, stable should be Cced.

--- Sampler used to hit the panic
nclude 

unsigned long page_size;

int main(void)
{
int shmid, ret=1;
void *addr;

setbuf(stdout, NULL);
page_size = getpagesize();

shmid = shmget(0x1410, LENGTH, IPC_CREAT | SHM_HUGETLB | SHM_R |
SHM_W);
if (shmid < 0) {
perror("shmget");
exit(1);
}

printf("shmid: %d\n", shmid);

addr = shmat(shmid, NULL, 0);
if (addr == (void*)-1) {
perror("shmat");
goto out;
}

/*
 * The following munmap() call will split the VMA in 2, leading to
 * unaligned to huge page size VMAs which will trigger a check when
 * shmdt() is called.
 */
if (munmap(addr + HPSIZE + page_size, page_size)) {
perror("munmap");
goto out;
}

if (shmdt(addr)) {
perror("shmdt");
goto out;
}

printf("test done.\n");
ret = 0;

out:
shmctl(shmid, IPC_RMID, NULL);
return ret;
}
--- End of code

Signed-off-by: Laurent Dufour 
---
 mm/mmap.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 188f195883b9..5dbf4b69a798 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -692,6 +692,17 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
long adjust_next = 0;
int remove_next = 0;
 
+   if (is_vm_hugetlb_page(vma)) {
+   /*
+* We must check against the huge page boundarie to not
+* create misaligned VMA.
+*/
+   struct hstate *h = hstate_vma(vma);
+
+   if (start & ~huge_page_mask(h) || end & ~huge_page_mask(h))
+   return -EINVAL;
+   }
+
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;
 
-- 
2.7.4



Re: [mm] b33ddf50eb: INFO:trying_to_register_non-static_key

2018-03-16 Thread Laurent Dufour
On 16/03/2018 11:23, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: b33ddf50ebcc740b990dd2e0e8ff0b92c7acf58e ("mm: Protect mm_rb tree 
> with a rwlock")
> url: 
> https://github.com/0day-ci/linux/commits/Laurent-Dufour/Speculative-page-faults/20180316-151833
> 
> 
> in testcase: boot
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 4G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> ++++
> || 7f3f7b4e80 | b33ddf50eb |
> ++++
> | boot_successes | 8  | 0  |
> | boot_failures  | 0  | 6  |
> | INFO:trying_to_register_non-static_key | 0  | 6  |
> ++++
> 
> 
> 
> [   22.218186] INFO: trying to register non-static key.
> [   22.220252] the code is fine but needs lockdep annotation.
> [   22.222471] turning off the locking correctness validator.
> [   22.224839] CPU: 0 PID: 1 Comm: init Not tainted 
> 4.16.0-rc4-next-20180309-00017-gb33ddf5 #1
> [   22.228528] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   22.232443] Call Trace:
> [   22.234234]  dump_stack+0x85/0xbc
> [   22.236085]  register_lock_class+0x237/0x477
> [   22.238057]  __lock_acquire+0xd0/0xf15
> [   22.240032]  lock_acquire+0x19c/0x1ce
> [   22.241927]  ? do_mmap+0x3aa/0x3ff
> [   22.243749]  mmap_region+0x37a/0x4c0
> [   22.245619]  ? do_mmap+0x3aa/0x3ff
> [   22.247425]  do_mmap+0x3aa/0x3ff
> [   22.249175]  vm_mmap_pgoff+0xa1/0xea
> [   22.251083]  elf_map+0x6d/0x134
> [   22.252873]  load_elf_binary+0x56f/0xe07
> [   22.254853]  search_binary_handler+0x75/0x1f8
> [   22.256934]  do_execveat_common+0x661/0x92b
> [   22.259164]  ? rest_init+0x22e/0x22e
> [   22.261082]  do_execve+0x1f/0x21
> [   22.262884]  kernel_init+0x5a/0xf0
> [   22.264722]  ret_from_fork+0x3a/0x50
> [   22.303240] systemd[1]: RTC configured in localtime, applying delta of 480 
> minutes to system time.
> [   22.313544] systemd[1]: Failed to insert module 'autofs4': No such file or 
> directory

Thanks a lot for reporting this.

I found the issue introduced in that patch.
I mistakenly remove in the call to seqcount_init(>vm_sequence) in
__vma_link_rb().
This doesn't have a functional impact as the vm_sequence is incremented
monotonically.

I'll fix that in the next series.

Laurent.



Re: [mm] b33ddf50eb: INFO:trying_to_register_non-static_key

2018-03-16 Thread Laurent Dufour
On 16/03/2018 11:23, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: b33ddf50ebcc740b990dd2e0e8ff0b92c7acf58e ("mm: Protect mm_rb tree 
> with a rwlock")
> url: 
> https://github.com/0day-ci/linux/commits/Laurent-Dufour/Speculative-page-faults/20180316-151833
> 
> 
> in testcase: boot
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 4G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> ++++
> || 7f3f7b4e80 | b33ddf50eb |
> ++++
> | boot_successes | 8  | 0  |
> | boot_failures  | 0  | 6  |
> | INFO:trying_to_register_non-static_key | 0  | 6  |
> ++++
> 
> 
> 
> [   22.218186] INFO: trying to register non-static key.
> [   22.220252] the code is fine but needs lockdep annotation.
> [   22.222471] turning off the locking correctness validator.
> [   22.224839] CPU: 0 PID: 1 Comm: init Not tainted 
> 4.16.0-rc4-next-20180309-00017-gb33ddf5 #1
> [   22.228528] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   22.232443] Call Trace:
> [   22.234234]  dump_stack+0x85/0xbc
> [   22.236085]  register_lock_class+0x237/0x477
> [   22.238057]  __lock_acquire+0xd0/0xf15
> [   22.240032]  lock_acquire+0x19c/0x1ce
> [   22.241927]  ? do_mmap+0x3aa/0x3ff
> [   22.243749]  mmap_region+0x37a/0x4c0
> [   22.245619]  ? do_mmap+0x3aa/0x3ff
> [   22.247425]  do_mmap+0x3aa/0x3ff
> [   22.249175]  vm_mmap_pgoff+0xa1/0xea
> [   22.251083]  elf_map+0x6d/0x134
> [   22.252873]  load_elf_binary+0x56f/0xe07
> [   22.254853]  search_binary_handler+0x75/0x1f8
> [   22.256934]  do_execveat_common+0x661/0x92b
> [   22.259164]  ? rest_init+0x22e/0x22e
> [   22.261082]  do_execve+0x1f/0x21
> [   22.262884]  kernel_init+0x5a/0xf0
> [   22.264722]  ret_from_fork+0x3a/0x50
> [   22.303240] systemd[1]: RTC configured in localtime, applying delta of 480 
> minutes to system time.
> [   22.313544] systemd[1]: Failed to insert module 'autofs4': No such file or 
> directory

Thanks a lot for reporting this.

I found the issue introduced in that patch.
I mistakenly remove in the call to seqcount_init(>vm_sequence) in
__vma_link_rb().
This doesn't have a functional impact as the vm_sequence is incremented
monotonically.

I'll fix that in the next series.

Laurent.



Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-03-14 Thread Laurent Dufour
On 14/03/2018 09:48, Peter Zijlstra wrote:
> On Tue, Mar 13, 2018 at 06:59:47PM +0100, Laurent Dufour wrote:
>> This change is inspired by the Peter's proposal patch [1] which was
>> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
>> that particular case, and it is introducing major performance degradation
>> due to excessive scheduling operations.
> 
> Do you happen to have a little more detail on that?

This has been reported by kemi who find bad performance when running some
benchmarks on top of the v5 series:
https://patchwork.kernel.org/patch/687/

It appears that SRCU is generating a lot of additional scheduling to manage the
freeing of the VMA structure. SRCU is dealing through per cpu ressources but
the SRCU callback is And since we are handling this way a per process
ressource (VMA) through a global resource (SRCU) this leads to a lot of
overhead when scheduling the SRCU callback.

>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 34fde7111e88..28c763ea1036 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -335,6 +335,7 @@ struct vm_area_struct {
>>  struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>  #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>  seqcount_t vm_sequence;
>> +atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
>>  #endif
>>  } __randomize_layout;
>>   
>> @@ -353,6 +354,9 @@ struct kioctx_table;
>>  struct mm_struct {
>>  struct vm_area_struct *mmap;/* list of VMAs */
>>  struct rb_root mm_rb;
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +rwlock_t mm_rb_lock;
>> +#endif
>>  u32 vmacache_seqnum;   /* per-thread vmacache */
>>  #ifdef CONFIG_MMU
>>  unsigned long (*get_unmapped_area) (struct file *filp,
> 
> When I tried this, it simply traded contention on mmap_sem for
> contention on these two cachelines.
> 
> This was for the concurrent fault benchmark, where mmap_sem is only ever
> acquired for reading (so no blocking ever happens) and the bottle-neck
> was really pure cacheline access.

I'd say that this expected if multiple threads are dealing on the same VMA, but
if the VMA differ then this contention is disappearing while it is remaining
when using the mmap_sem.

This being said, test I did on PowerPC using will-it-scale/page_fault1_threads
showed that the number of caches-misses generated in get_vma() are very low
(less than 5%). Am I missing something ?

> Only by using RCU can you avoid that thrashing.

I agree, but this kind of test is the best use case for SRCU because there are
not so many updates, so not a lot of call to the SRCU asynchronous callback.

Honestly, I can't see an ideal solution here, RCU is not optimal when there is
a high number of updates, and using a rwlock may introduced a bottleneck there.
I get better results when using the rwlock than using SRCU in that case, but if
you have another proposal, please advise, I'll give it a try.

> Also note that if your database allocates the one giant mapping, it'll
> be _one_ VMA and that vm_ref_count gets _very_ hot indeed.

In the case of the database product I mentioned in the series header, that's
the opposite, the VMA number is very high so this doesn't happen. But in the
case of one VMA, it's clear that there will be a contention on vm_ref_count,
but this would be better than blocking on the mmap_sem.

Laurent.



Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-03-14 Thread Laurent Dufour
On 14/03/2018 09:48, Peter Zijlstra wrote:
> On Tue, Mar 13, 2018 at 06:59:47PM +0100, Laurent Dufour wrote:
>> This change is inspired by the Peter's proposal patch [1] which was
>> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
>> that particular case, and it is introducing major performance degradation
>> due to excessive scheduling operations.
> 
> Do you happen to have a little more detail on that?

This has been reported by kemi who find bad performance when running some
benchmarks on top of the v5 series:
https://patchwork.kernel.org/patch/687/

It appears that SRCU is generating a lot of additional scheduling to manage the
freeing of the VMA structure. SRCU is dealing through per cpu ressources but
the SRCU callback is And since we are handling this way a per process
ressource (VMA) through a global resource (SRCU) this leads to a lot of
overhead when scheduling the SRCU callback.

>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 34fde7111e88..28c763ea1036 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -335,6 +335,7 @@ struct vm_area_struct {
>>  struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>>  #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>  seqcount_t vm_sequence;
>> +atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
>>  #endif
>>  } __randomize_layout;
>>   
>> @@ -353,6 +354,9 @@ struct kioctx_table;
>>  struct mm_struct {
>>  struct vm_area_struct *mmap;/* list of VMAs */
>>  struct rb_root mm_rb;
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +rwlock_t mm_rb_lock;
>> +#endif
>>  u32 vmacache_seqnum;   /* per-thread vmacache */
>>  #ifdef CONFIG_MMU
>>  unsigned long (*get_unmapped_area) (struct file *filp,
> 
> When I tried this, it simply traded contention on mmap_sem for
> contention on these two cachelines.
> 
> This was for the concurrent fault benchmark, where mmap_sem is only ever
> acquired for reading (so no blocking ever happens) and the bottle-neck
> was really pure cacheline access.

I'd say that this expected if multiple threads are dealing on the same VMA, but
if the VMA differ then this contention is disappearing while it is remaining
when using the mmap_sem.

This being said, test I did on PowerPC using will-it-scale/page_fault1_threads
showed that the number of caches-misses generated in get_vma() are very low
(less than 5%). Am I missing something ?

> Only by using RCU can you avoid that thrashing.

I agree, but this kind of test is the best use case for SRCU because there are
not so many updates, so not a lot of call to the SRCU asynchronous callback.

Honestly, I can't see an ideal solution here, RCU is not optimal when there is
a high number of updates, and using a rwlock may introduced a bottleneck there.
I get better results when using the rwlock than using SRCU in that case, but if
you have another proposal, please advise, I'll give it a try.

> Also note that if your database allocates the one giant mapping, it'll
> be _one_ VMA and that vm_ref_count gets _very_ hot indeed.

In the case of the database product I mentioned in the series header, that's
the opposite, the VMA number is very high so this doesn't happen. But in the
case of one VMA, it's clear that there will be a contention on vm_ref_count,
but this would be better than blocking on the mmap_sem.

Laurent.



Re: [PATCH v9 00/24] Speculative page faults

2018-03-14 Thread Laurent Dufour
On 14/03/2018 14:11, Michal Hocko wrote:
> On Tue 13-03-18 18:59:30, Laurent Dufour wrote:
>> Changes since v8:
>>  - Don't check PMD when locking the pte when THP is disabled
>>Thanks to Daniel Jordan for reporting this.
>>  - Rebase on 4.16
> 
> Is this really worth reposting the whole pile? I mean this is at v9,
> each doing little changes. It is quite tiresome to barely get to a
> bookmarked version just to find out that there are 2 new versions out.

I agree, I could have sent only a change for the concerned patch. But the
previous series has been sent a month ago and this one is rebased on the 4.16
kernel.

> I am sorry to be grumpy and I can understand some frustration it doesn't
> move forward that easilly but this is a _big_ change. We should start
> with a real high level review rather than doing small changes here and
> there and reach v20 quickly.
> 
> I am planning to find some time to look at it but the spare cycles are
> so rare these days...

I understand that this is a big change and I'll try to not post a new series
until I get more feedback from this one.

Thanks,
Laurent.



Re: [PATCH v9 00/24] Speculative page faults

2018-03-14 Thread Laurent Dufour
On 14/03/2018 14:11, Michal Hocko wrote:
> On Tue 13-03-18 18:59:30, Laurent Dufour wrote:
>> Changes since v8:
>>  - Don't check PMD when locking the pte when THP is disabled
>>Thanks to Daniel Jordan for reporting this.
>>  - Rebase on 4.16
> 
> Is this really worth reposting the whole pile? I mean this is at v9,
> each doing little changes. It is quite tiresome to barely get to a
> bookmarked version just to find out that there are 2 new versions out.

I agree, I could have sent only a change for the concerned patch. But the
previous series has been sent a month ago and this one is rebased on the 4.16
kernel.

> I am sorry to be grumpy and I can understand some frustration it doesn't
> move forward that easilly but this is a _big_ change. We should start
> with a real high level review rather than doing small changes here and
> there and reach v20 quickly.
> 
> I am planning to find some time to look at it but the spare cycles are
> so rare these days...

I understand that this is a big change and I'll try to not post a new series
until I get more feedback from this one.

Thanks,
Laurent.



[PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour
This configuration variable will be used to build the code needed to
handle speculative page fault.

By default it is turned off, and activated depending on architecture
support.

Suggested-by: Thomas Gleixner <t...@linutronix.de>
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 mm/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index abefa573bcd8..07c566c88faf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -759,3 +759,6 @@ config GUP_BENCHMARK
  performance of get_user_pages_fast().
 
  See tools/testing/selftests/vm/gup_benchmark.c
+
+config SPECULATIVE_PAGE_FAULT
+   bool
-- 
2.7.4



[PATCH v9 01/24] mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour
This configuration variable will be used to build the code needed to
handle speculative page fault.

By default it is turned off, and activated depending on architecture
support.

Suggested-by: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 mm/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index abefa573bcd8..07c566c88faf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -759,3 +759,6 @@ config GUP_BENCHMARK
  performance of get_user_pages_fast().
 
  See tools/testing/selftests/vm/gup_benchmark.c
+
+config SPECULATIVE_PAGE_FAULT
+   bool
-- 
2.7.4



[PATCH v9 03/24] powerpc/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour
Define CONFIG_SPECULATIVE_PAGE_FAULT for BOOK3S_64 and SMP. This enables
the Speculative Page Fault handler.

Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
  set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
  called by update_mmu_cache()

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 73ce5dd07642..acf2696a6505 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -233,6 +233,7 @@ config PPC
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
select SPARSE_IRQ
+   select SPECULATIVE_PAGE_FAULT   if PPC_BOOK3S_64 && SMP
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS  if !PPC64
#
-- 
2.7.4



[PATCH v9 03/24] powerpc/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour
Define CONFIG_SPECULATIVE_PAGE_FAULT for BOOK3S_64 and SMP. This enables
the Speculative Page Fault handler.

Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
  set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
  called by update_mmu_cache()

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 73ce5dd07642..acf2696a6505 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -233,6 +233,7 @@ config PPC
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
select SPARSE_IRQ
+   select SPECULATIVE_PAGE_FAULT   if PPC_BOOK3S_64 && SMP
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS  if !PPC64
#
-- 
2.7.4



[PATCH v9 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-03-13 Thread Laurent Dufour
migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..fd4c3ab7bd9c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -126,14 +126,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 46fe92b93682..412014d5785b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3880,7 +3880,7 @@ static int do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5d0dc7b85f90..ad8692ca6a4f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1900,7 +1900,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1913,7 +1913,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.7.4



[PATCH v9 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-03-13 Thread Laurent Dufour
migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Signed-off-by: Laurent Dufour 
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..fd4c3ab7bd9c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -126,14 +126,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 46fe92b93682..412014d5785b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3880,7 +3880,7 @@ static int do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5d0dc7b85f90..ad8692ca6a4f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1900,7 +1900,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1913,7 +1913,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.7.4



[PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count

2018-03-13 Thread Laurent Dufour
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 17 ++---
 mm/mprotect.c  |  4 +++-
 mm/swap_state.c|  8 ++--
 9 files changed, 86 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 65ae54659833..a2d9c87b7b0b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(>mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cec550c8468f..b8212ba17695 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
up_write(>mmap_sem);
mmput(mm);
@@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7e2268dfc9a..32314e9e48dd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma

[PATCH v9 08/24] mm: Protect VMA modifications using VMA sequence count

2018-03-13 Thread Laurent Dufour
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour 
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 17 ++---
 mm/mprotect.c  |  4 +++-
 mm/swap_state.c|  8 ++--
 9 files changed, 86 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 65ae54659833..a2d9c87b7b0b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(>mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cec550c8468f..b8212ba17695 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
up_write(>mmap_sem);
mmput(mm);
@@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7e2268dfc9a..32314e9e48dd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+   vm_write_end(vma)

[PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-03-13 Thread Laurent Dufour
pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c| 29 +++--
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f3e98edc94a..b6432a261e63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1199,6 +1199,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_NEEDDSYNC  0x2000 /* ->fault did not modify page tables
 * and needs fsync() to complete (for
 * synchronous page faults in DAX) */
+#define VM_FAULT_PTNOTSAME 0x4000  /* Page table entries have changed */
 
 #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
diff --git a/mm/memory.c b/mm/memory.c
index 21b1212a0892..4bc7b0bdcb40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
  * parts, do_swap_page must check under lock before unmapping the pte and
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0   if the PTE are the same
+ * VM_FAULT_PTNOTSAME  if the PTE are different
+ * VM_FAULT_RETRY  if the VMA has changed in our back during
+ * a speculative page fault handling.
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
-   int same = 1;
+   int ret = 0;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
+   if (pte_spinlock(vmf)) {
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   ret = VM_FAULT_PTNOTSAME;
+   spin_unlock(vmf->ptl);
+   } else
+   ret = VM_FAULT_RETRY;
}
 #endif
-   pte_unmap(page_table);
-   return same;
+   pte_unmap(vmf->pte);
+   return ret;
 }
 
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
@@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
int exclusive = 0;
int ret = 0;
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+   ret = pte_unmap_same(vmf);
+   if (ret)
goto out;
 
entry = pte_to_swp_entry(vmf->orig_pte);
-- 
2.7.4



[PATCH v9 06/24] mm: make pte_unmap_same compatible with SPF

2018-03-13 Thread Laurent Dufour
pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 29 +++--
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f3e98edc94a..b6432a261e63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1199,6 +1199,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_NEEDDSYNC  0x2000 /* ->fault did not modify page tables
 * and needs fsync() to complete (for
 * synchronous page faults in DAX) */
+#define VM_FAULT_PTNOTSAME 0x4000  /* Page table entries have changed */
 
 #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
diff --git a/mm/memory.c b/mm/memory.c
index 21b1212a0892..4bc7b0bdcb40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
  * parts, do_swap_page must check under lock before unmapping the pte and
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0   if the PTE are the same
+ * VM_FAULT_PTNOTSAME  if the PTE are different
+ * VM_FAULT_RETRY  if the VMA has changed in our back during
+ * a speculative page fault handling.
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
-   int same = 1;
+   int ret = 0;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
+   if (pte_spinlock(vmf)) {
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   ret = VM_FAULT_PTNOTSAME;
+   spin_unlock(vmf->ptl);
+   } else
+   ret = VM_FAULT_RETRY;
}
 #endif
-   pte_unmap(page_table);
-   return same;
+   pte_unmap(vmf->pte);
+   return ret;
 }
 
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
@@ -2913,7 +2921,8 @@ int do_swap_page(struct vm_fault *vmf)
int exclusive = 0;
int ret = 0;
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+   ret = pte_unmap_same(vmf);
+   if (ret)
goto out;
 
entry = pte_to_swp_entry(vmf->orig_pte);
-- 
2.7.4



[PATCH v9 11/24] mm: Cache some VMA fields in the vm_fault structure

2018-03-13 Thread Laurent Dufour
When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h |  6 ++
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 38 --
 4 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef6ef0627090..dfa81a638b7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -359,6 +359,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 446427cafa19..f71db2b42b30 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3717,6 +3717,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
.vma = vma,
.address = address,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 32314e9e48dd..a946d5306160 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
};
 
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index 0200340ef089..46fe92b93682 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Don't let another task, with possibly unlocked vma,
 * keep the mlocked page.
 */
-   if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+   if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page);/* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf)
  */
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
-   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+   WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
 (VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);
 
@@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -3067,7 +3067,7 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3093,7 +309

[PATCH v9 11/24] mm: Cache some VMA fields in the vm_fault structure

2018-03-13 Thread Laurent Dufour
When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  6 ++
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 38 --
 4 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef6ef0627090..dfa81a638b7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -359,6 +359,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 446427cafa19..f71db2b42b30 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3717,6 +3717,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
.vma = vma,
.address = address,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 32314e9e48dd..a946d5306160 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
};
 
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index 0200340ef089..46fe92b93682 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Don't let another task, with possibly unlocked vma,
 * keep the mlocked page.
 */
-   if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+   if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page);/* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf)
  */
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
-   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+   WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
 (VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);
 
@@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -3067,7 +3067,7 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3093,7 +3093,7 @@ int do_swap_page(struct vm_

[PATCH v9 13/24] mm: Introduce __lru_cache_add_active_or_unevictable

2018-03-13 Thread Laurent Dufour
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/swap.h | 10 --
 mm/memory.c  |  8 
 mm/swap.c|  6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1985940af479..a7dc37e0e405 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,14 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index 412014d5785b..0a0a483d9a65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2560,7 +2560,7 @@ static int wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3083,7 +3083,7 @@ int do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3234,7 +3234,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3486,7 +3486,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 3dd518832096..f2f9c587246f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -455,12 +455,12 @@ void lru_cache_add(struct page *page)
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
-- 
2.7.4



[PATCH v9 13/24] mm: Introduce __lru_cache_add_active_or_unevictable

2018-03-13 Thread Laurent Dufour
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Signed-off-by: Laurent Dufour 
---
 include/linux/swap.h | 10 --
 mm/memory.c  |  8 
 mm/swap.c|  6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1985940af479..a7dc37e0e405 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,14 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index 412014d5785b..0a0a483d9a65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2560,7 +2560,7 @@ static int wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3083,7 +3083,7 @@ int do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3234,7 +3234,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3486,7 +3486,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 3dd518832096..f2f9c587246f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -455,12 +455,12 @@ void lru_cache_add(struct page *page)
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
-- 
2.7.4



[PATCH v9 10/24] mm: Protect SPF handler against anon_vma changes

2018-03-13 Thread Laurent Dufour
The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index d57749966fb8..0200340ef089 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -624,7 +624,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
@@ -638,7 +640,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
-- 
2.7.4



[PATCH v9 10/24] mm: Protect SPF handler against anon_vma changes

2018-03-13 Thread Laurent Dufour
The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index d57749966fb8..0200340ef089 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -624,7 +624,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
@@ -638,7 +640,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
-- 
2.7.4



[PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-03-13 Thread Laurent Dufour
When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.

Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.

Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h |  7 +--
 mm/memory.c| 18 ++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a84ddc218bbd..73b8b99f482b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1263,8 +1263,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap 
*/
 };
 
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device);
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+#define _vm_normal_page(vma, addr, pte, with_public_device) \
+   __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags)
 #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
 
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index af0338fbc34d..184a0d663a76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -826,8 +826,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
 {
unsigned long pfn = pte_pfn(pte);
 
@@ -836,7 +837,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -868,8 +869,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 
/* !HAVE_PTE_SPECIAL case follows: */
 
-   if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-   if (vma->vm_flags & VM_MIXEDMAP) {
+   if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+   if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -878,7 +879,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
-   if (!is_cow_mapping(vma->vm_flags))
+   if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2742,7 +2743,8 @@ static int do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
-   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+vmf->vma_flags);
if (!vmf->page) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -3839,7 +3841,7 @@ static int do_numa_page(struct vm_fault *vmf)
ptep_modify_prot_commit(vma->vm_mm, vmf->address, vmf->pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
 
-   page = vm_normal_page(vma, vmf->address, pte);
+   page = __vm_normal_page(vma, vmf->address, pte, false, vmf->vma_flags);
if (!page) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
-- 
2.7.4



[PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-03-13 Thread Laurent Dufour
When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.

Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.

Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  7 +--
 mm/memory.c| 18 ++
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a84ddc218bbd..73b8b99f482b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1263,8 +1263,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap 
*/
 };
 
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device);
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+#define _vm_normal_page(vma, addr, pte, with_public_device) \
+   __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags)
 #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
 
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index af0338fbc34d..184a0d663a76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -826,8 +826,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
 {
unsigned long pfn = pte_pfn(pte);
 
@@ -836,7 +837,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -868,8 +869,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 
/* !HAVE_PTE_SPECIAL case follows: */
 
-   if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-   if (vma->vm_flags & VM_MIXEDMAP) {
+   if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+   if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -878,7 +879,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
-   if (!is_cow_mapping(vma->vm_flags))
+   if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2742,7 +2743,8 @@ static int do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
-   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+vmf->vma_flags);
if (!vmf->page) {
/*
 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -3839,7 +3841,7 @@ static int do_numa_page(struct vm_fault *vmf)
ptep_modify_prot_commit(vma->vm_mm, vmf->address, vmf->pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);
 
-   page = vm_normal_page(vma, vmf->address, pte);
+   page = __vm_normal_page(vma, vmf->address, pte, false, vmf->vma_flags);
if (!page) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
-- 
2.7.4



[PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()

2018-03-13 Thread Laurent Dufour
When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/rmap.h | 12 ++--
 mm/memory.c  |  8 
 mm/rmap.c|  5 ++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..a5d282573093 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct 
vm_area_struct *,
unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-   unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
diff --git a/mm/memory.c b/mm/memory.c
index 184a0d663a76..66517535514b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2559,7 +2559,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * thread doing COW.
 */
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -3083,7 +3083,7 @@ int do_swap_page(struct vm_fault *vmf)
 
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
@@ -3234,7 +3234,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
}
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
@@ -3486,7 +3486,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/

[PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-03-13 Thread Laurent Dufour
This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.

To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock.  The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.

To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1.  Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.

This patch is preparing for 2 kind of VMA access :
 - as usual, under the control of the mmap_sem,
 - without holding the mmap_sem for the speculative page fault handler.

Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.

The patch is introducing 2 new functions:
 - vma_get() to find a VMA based on an address by holding the new rwlock.
 - vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.

When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.

When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.

[1] https://patchwork.kernel.org/patch/5108281/

Cc: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Matthew Wilcox <wi...@infradead.org>
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |   4 ++
 kernel/fork.c|   3 ++
 mm/init-mm.c |   3 ++
 mm/internal.h|   6 +++
 mm/mmap.c| 122 ++-
 5 files changed, 106 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 34fde7111e88..28c763ea1036 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_t vm_sequence;
+   atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
 #endif
 } __randomize_layout;
 
@@ -353,6 +354,9 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_t mm_rb_lock;
+#endif
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index a32e1c4311b2..9ecac4f725b9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -889,6 +889,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_init(>mm_rb_lock);
+#endif
atomic_set(>mm_users, 1);
atomic_set(>mm_count, 1);
init_rwsem(>mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d15ebc0..e71ac37a98c4 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -17,6 +17,9 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   .mm_rb_lock = __RW_LOCK_UNLOCKED(init_mm.mm_rb_lock),
+#endif
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..fb2667b20f0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,12 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern struct vm_area_struct *get_vma(struct mm_struct *mm,
+ unsigned long addr);
+extern void put_vma(struct vm_area_struct *vma);
+#endif
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/

[PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()

2018-03-13 Thread Laurent Dufour
When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Signed-off-by: Laurent Dufour 
---
 include/linux/rmap.h | 12 ++--
 mm/memory.c  |  8 
 mm/rmap.c|  5 ++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..a5d282573093 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct 
vm_area_struct *,
unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-   unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
diff --git a/mm/memory.c b/mm/memory.c
index 184a0d663a76..66517535514b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2559,7 +2559,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * thread doing COW.
 */
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -3083,7 +3083,7 @@ int do_swap_page(struct vm_fault *vmf)
 
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
@@ -3234,7 +3234,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
}
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
@@ -3486,7 +3486,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/rmap.c b/mm/rmap.c
ind

[PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-03-13 Thread Laurent Dufour
This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.

To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock.  The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.

To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1.  Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.

This patch is preparing for 2 kind of VMA access :
 - as usual, under the control of the mmap_sem,
 - without holding the mmap_sem for the speculative page fault handler.

Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.

The patch is introducing 2 new functions:
 - vma_get() to find a VMA based on an address by holding the new rwlock.
 - vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.

When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.

When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.

[1] https://patchwork.kernel.org/patch/5108281/

Cc: Peter Zijlstra (Intel) 
Cc: Matthew Wilcox 
Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |   4 ++
 kernel/fork.c|   3 ++
 mm/init-mm.c |   3 ++
 mm/internal.h|   6 +++
 mm/mmap.c| 122 ++-
 5 files changed, 106 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 34fde7111e88..28c763ea1036 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_t vm_sequence;
+   atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
 #endif
 } __randomize_layout;
 
@@ -353,6 +354,9 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_t mm_rb_lock;
+#endif
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index a32e1c4311b2..9ecac4f725b9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -889,6 +889,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_init(>mm_rb_lock);
+#endif
atomic_set(>mm_users, 1);
atomic_set(>mm_count, 1);
init_rwsem(>mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d15ebc0..e71ac37a98c4 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -17,6 +17,9 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   .mm_rb_lock = __RW_LOCK_UNLOCKED(init_mm.mm_rb_lock),
+#endif
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..fb2667b20f0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,12 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern struct vm_area_struct *get_vma(struct mm_struct *mm,
+ unsigned long addr);
+extern void put_vma(struct vm_area_struct *vma);
+#endif
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index ac32b577a0c9..182359a5445c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@

[PATCH v9 18/24] mm: Provide speculative fault infrastructure

2018-03-13 Thread Laurent Dufour
From: Peter Zijlstra <pet...@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |   8 +
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  16 +-
 mm/memory.c| 342 -
 5 files changed, 364 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73b8b99f482b..1acc3f4e07d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -329,6 +329,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1351,6 +1355,10 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int handle_speculative_fault(struct mm_struct *mm,
+   unsigned long address, unsigned int flags);
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34ce3ebf97d5..70e4d2688e7b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -456,8 +456,8 @@ static inline pg

[PATCH v9 18/24] mm: Provide speculative fault infrastructure

2018-03-13 Thread Laurent Dufour
From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |   8 +
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  16 +-
 mm/memory.c| 342 -
 5 files changed, 364 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73b8b99f482b..1acc3f4e07d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -329,6 +329,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1351,6 +1355,10 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int handle_speculative_fault(struct mm_struct *mm,
+   unsigned long address, unsigned int flags);
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34ce3ebf97d5..70e4d2688e7b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct 
vm_area_struct *vma,
pgoff_t pgoff;
if

[PATCH v9 20/24] perf: Add a speculative page fault sw event

2018-03-13 Thread Laurent Dufour
Add a new software event to count succeeded speculative page faults.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 6f873503552d..a6ddab9edeec 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
-- 
2.7.4



[PATCH v9 20/24] perf: Add a speculative page fault sw event

2018-03-13 Thread Laurent Dufour
Add a new software event to count succeeded speculative page faults.

Signed-off-by: Laurent Dufour 
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 6f873503552d..a6ddab9edeec 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
-- 
2.7.4



[PATCH v9 21/24] perf tools: Add support for the SPF perf event

2018-03-13 Thread Laurent Dufour
Add support for the new speculative faults event.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 6f873503552d..a6ddab9edeec 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index ef351688b797..45b954019118 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -428,6 +428,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 34589c427e52..2a8189c6d5fc 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 655ecff636a8..5d6782426b30 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -308,6 +308,7 @@ emulation-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 duration_time  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index 2918cac7a142..00dd227959e6 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1174,6 +1174,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.7.4



[PATCH v9 21/24] perf tools: Add support for the SPF perf event

2018-03-13 Thread Laurent Dufour
Add support for the new speculative faults event.

Signed-off-by: Laurent Dufour 
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 6f873503552d..a6ddab9edeec 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index ef351688b797..45b954019118 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -428,6 +428,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 34589c427e52..2a8189c6d5fc 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 655ecff636a8..5d6782426b30 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -308,6 +308,7 @@ emulation-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 duration_time  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index 2918cac7a142..00dd227959e6 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1174,6 +1174,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.7.4



[PATCH v9 22/24] mm: Speculative page fault handler return VMA

2018-03-13 Thread Laurent Dufour
When the speculative page fault handler is returning VM_RETRY, there is a
chance that VMA fetched without grabbing the mmap_sem can be reused by the
legacy page fault handler.  By reusing it, we avoid calling find_vma()
again. To achieve, that we must ensure that the VMA structure will not be
freed in our back. This is done by getting the reference on it (get_vma())
and by assuming that the caller will call the new service
can_reuse_spf_vma() once it has grabbed the mmap_sem.

can_reuse_spf_vma() is first checking that the VMA is still in the RB tree
, and then that the VMA's boundaries matched the passed address and release
the reference on the VMA so that it can be freed if needed.

In the case the VMA is freed, can_reuse_spf_vma() will have returned false
as the VMA is no more in the RB tree.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h |   5 +-
 mm/memory.c| 136 +
 2 files changed, 88 insertions(+), 53 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1acc3f4e07d1..38a8c0041fd0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1357,7 +1357,10 @@ extern int handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
unsigned int flags);
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
 extern int handle_speculative_fault(struct mm_struct *mm,
-   unsigned long address, unsigned int flags);
+   unsigned long address, unsigned int flags,
+   struct vm_area_struct **vma);
+extern bool can_reuse_spf_vma(struct vm_area_struct *vma,
+ unsigned long address);
 #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
diff --git a/mm/memory.c b/mm/memory.c
index f39c4a4df703..16d3f5f4ffdd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4292,13 +4292,22 @@ static int __handle_mm_fault(struct vm_area_struct 
*vma, unsigned long address,
 /* This is required by vm_normal_page() */
 #error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL"
 #endif
-
 /*
  * vm_normal_page() adds some processing which should be done while
  * hodling the mmap_sem.
  */
+
+/*
+ * Tries to handle the page fault in a speculative way, without grabbing the
+ * mmap_sem.
+ * When VM_FAULT_RETRY is returned, the vma pointer is valid and this vma must
+ * be checked later when the mmap_sem has been grabbed by calling
+ * can_reuse_spf_vma().
+ * This is needed as the returned vma is kept in memory until the call to
+ * can_reuse_spf_vma() is made.
+ */
 int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
-unsigned int flags)
+unsigned int flags, struct vm_area_struct **vma)
 {
struct vm_fault vmf = {
.address = address,
@@ -4307,7 +4316,6 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
p4d_t *p4d, p4dval;
pud_t pudval;
int seq, ret = VM_FAULT_RETRY;
-   struct vm_area_struct *vma;
 #ifdef CONFIG_NUMA
struct mempolicy *pol;
 #endif
@@ -4316,14 +4324,16 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
flags |= FAULT_FLAG_SPECULATIVE;
 
-   vma = get_vma(mm, address);
-   if (!vma)
+   *vma = get_vma(mm, address);
+   if (!*vma)
return ret;
+   vmf.vma = *vma;
 
-   seq = raw_read_seqcount(>vm_sequence); /* rmb <-> 
seqlock,vma_rb_erase() */
+   /* rmb <-> seqlock,vma_rb_erase() */
+   seq = raw_read_seqcount(>vm_sequence);
if (seq & 1) {
-   trace_spf_vma_changed(_RET_IP_, vma, address);
-   goto out_put;
+   trace_spf_vma_changed(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4331,9 +4341,9 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * with the VMA.
 * This include huge page from hugetlbfs.
 */
-   if (vma->vm_ops) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (vmf.vma->vm_ops) {
+   trace_spf_vma_notsup(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4341,18 +4351,18 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * because vm_next and vm_prev must be safe. This can't be guaranteed
 * in the speculative path.
 */
-   if (unlikely(!vma->anon_vma)) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   go

[PATCH v9 22/24] mm: Speculative page fault handler return VMA

2018-03-13 Thread Laurent Dufour
When the speculative page fault handler is returning VM_RETRY, there is a
chance that VMA fetched without grabbing the mmap_sem can be reused by the
legacy page fault handler.  By reusing it, we avoid calling find_vma()
again. To achieve, that we must ensure that the VMA structure will not be
freed in our back. This is done by getting the reference on it (get_vma())
and by assuming that the caller will call the new service
can_reuse_spf_vma() once it has grabbed the mmap_sem.

can_reuse_spf_vma() is first checking that the VMA is still in the RB tree
, and then that the VMA's boundaries matched the passed address and release
the reference on the VMA so that it can be freed if needed.

In the case the VMA is freed, can_reuse_spf_vma() will have returned false
as the VMA is no more in the RB tree.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |   5 +-
 mm/memory.c| 136 +
 2 files changed, 88 insertions(+), 53 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1acc3f4e07d1..38a8c0041fd0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1357,7 +1357,10 @@ extern int handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
unsigned int flags);
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
 extern int handle_speculative_fault(struct mm_struct *mm,
-   unsigned long address, unsigned int flags);
+   unsigned long address, unsigned int flags,
+   struct vm_area_struct **vma);
+extern bool can_reuse_spf_vma(struct vm_area_struct *vma,
+ unsigned long address);
 #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
diff --git a/mm/memory.c b/mm/memory.c
index f39c4a4df703..16d3f5f4ffdd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4292,13 +4292,22 @@ static int __handle_mm_fault(struct vm_area_struct 
*vma, unsigned long address,
 /* This is required by vm_normal_page() */
 #error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL"
 #endif
-
 /*
  * vm_normal_page() adds some processing which should be done while
  * hodling the mmap_sem.
  */
+
+/*
+ * Tries to handle the page fault in a speculative way, without grabbing the
+ * mmap_sem.
+ * When VM_FAULT_RETRY is returned, the vma pointer is valid and this vma must
+ * be checked later when the mmap_sem has been grabbed by calling
+ * can_reuse_spf_vma().
+ * This is needed as the returned vma is kept in memory until the call to
+ * can_reuse_spf_vma() is made.
+ */
 int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
-unsigned int flags)
+unsigned int flags, struct vm_area_struct **vma)
 {
struct vm_fault vmf = {
.address = address,
@@ -4307,7 +4316,6 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
p4d_t *p4d, p4dval;
pud_t pudval;
int seq, ret = VM_FAULT_RETRY;
-   struct vm_area_struct *vma;
 #ifdef CONFIG_NUMA
struct mempolicy *pol;
 #endif
@@ -4316,14 +4324,16 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
flags |= FAULT_FLAG_SPECULATIVE;
 
-   vma = get_vma(mm, address);
-   if (!vma)
+   *vma = get_vma(mm, address);
+   if (!*vma)
return ret;
+   vmf.vma = *vma;
 
-   seq = raw_read_seqcount(>vm_sequence); /* rmb <-> 
seqlock,vma_rb_erase() */
+   /* rmb <-> seqlock,vma_rb_erase() */
+   seq = raw_read_seqcount(>vm_sequence);
if (seq & 1) {
-   trace_spf_vma_changed(_RET_IP_, vma, address);
-   goto out_put;
+   trace_spf_vma_changed(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4331,9 +4341,9 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * with the VMA.
 * This include huge page from hugetlbfs.
 */
-   if (vma->vm_ops) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (vmf.vma->vm_ops) {
+   trace_spf_vma_notsup(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4341,18 +4351,18 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * because vm_next and vm_prev must be safe. This can't be guaranteed
 * in the speculative path.
 */
-   if (unlikely(!vma->anon_vma)) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (unlikely(!vmf.vma->anon_vma)) {
+ 

[PATCH v9 24/24] powerpc/mm: Add speculative page fault

2018-03-13 Thread Laurent Dufour
This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Build on if CONFIG_SPECULATIVE_PAGE_FAULT is defined (currently for
BOOK3S_64 && SMP).

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 866446cf2d9a..104f3cc86b51 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -392,6 +392,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
   unsigned long error_code)
 {
struct vm_area_struct * vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct mm_struct *mm = current->mm;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
int is_exec = TRAP(regs) == 0x400;
@@ -459,6 +462,20 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (is_user && (atomic_read(>mm_users) > 1)) {
+   /* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, address, flags, _vma);
+   if (!(fault & VM_FAULT_RETRY)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -489,7 +506,16 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma =  find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma))
return bad_area(regs, address);
if (likely(vma->vm_start <= address))
@@ -568,6 +594,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(>mm->mmap_sem);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.7.4



[PATCH v9 24/24] powerpc/mm: Add speculative page fault

2018-03-13 Thread Laurent Dufour
This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Build on if CONFIG_SPECULATIVE_PAGE_FAULT is defined (currently for
BOOK3S_64 && SMP).

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/fault.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 866446cf2d9a..104f3cc86b51 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -392,6 +392,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
   unsigned long error_code)
 {
struct vm_area_struct * vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct mm_struct *mm = current->mm;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
int is_exec = TRAP(regs) == 0x400;
@@ -459,6 +462,20 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (is_user && (atomic_read(>mm_users) > 1)) {
+   /* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, address, flags, _vma);
+   if (!(fault & VM_FAULT_RETRY)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -489,7 +506,16 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma =  find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma))
return bad_area(regs, address);
if (likely(vma->vm_start <= address))
@@ -568,6 +594,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(>mm->mmap_sem);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.7.4



[PATCH v9 23/24] x86/mm: Add speculative pagefault handling

2018-03-13 Thread Laurent Dufour
From: Peter Zijlstra <pet...@infradead.org>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Try speculative fault path only for multi threaded processes]
[Try to the VMA fetch during the speculative path in case of retry]
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 arch/x86/mm/fault.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e6af2b464c3d..a73cf227edd6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1239,6 +1239,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
unsigned long address)
 {
struct vm_area_struct *vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct task_struct *tsk;
struct mm_struct *mm;
int fault, major = 0;
@@ -1332,6 +1335,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
if (error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if ((error_code & X86_PF_USER) && (atomic_read(>mm_users) > 1)) {
+   fault = handle_speculative_fault(mm, address, flags,
+_vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   if (!(fault & VM_FAULT_ERROR)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   /*
+* In case of error we need the pkey value, but
+* can't get it from the spf_vma as it is only returned
+* when VM_FAULT_RETRY is returned. So we have to
+* retry the page fault with the mmap_sem grabbed.
+*/
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
 * When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in
@@ -1365,7 +1389,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma = find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma)) {
bad_area(regs, error_code, address);
return;
@@ -1451,6 +1484,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
return;
}
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
/*
 * Major/minor page fault accounting. If any of the events
 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4



[PATCH v9 23/24] x86/mm: Add speculative pagefault handling

2018-03-13 Thread Laurent Dufour
From: Peter Zijlstra 

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Try speculative fault path only for multi threaded processes]
[Try to the VMA fetch during the speculative path in case of retry]
Signed-off-by: Laurent Dufour 
---
 arch/x86/mm/fault.c | 38 +-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e6af2b464c3d..a73cf227edd6 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1239,6 +1239,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
unsigned long address)
 {
struct vm_area_struct *vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct task_struct *tsk;
struct mm_struct *mm;
int fault, major = 0;
@@ -1332,6 +1335,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
if (error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if ((error_code & X86_PF_USER) && (atomic_read(>mm_users) > 1)) {
+   fault = handle_speculative_fault(mm, address, flags,
+_vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   if (!(fault & VM_FAULT_ERROR)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   /*
+* In case of error we need the pkey value, but
+* can't get it from the spf_vma as it is only returned
+* when VM_FAULT_RETRY is returned. So we have to
+* retry the page fault with the mmap_sem grabbed.
+*/
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
 * When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in
@@ -1365,7 +1389,16 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma = find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma)) {
bad_area(regs, error_code, address);
return;
@@ -1451,6 +1484,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
return;
}
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
/*
 * Major/minor page fault accounting. If any of the events
 * returned VM_FAULT_MAJOR, we account it as a major fault.
-- 
2.7.4



[PATCH v9 19/24] mm: Adding speculative page fault failure trace events

2018-03-13 Thread Laurent Dufour
This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/trace/events/pagefault.h | 87 
 mm/memory.c  | 62 ++--
 2 files changed, 136 insertions(+), 13 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..1d793f8c739b
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,87 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_pte_lock,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index f0f2caa11282..f39c4a4df703 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -80,6 +80,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2312,8 +2315,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
}
 
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2321,16 +2326,21 @@ static bool pte_spinlock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 #endif
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   if (unlikely(!spin_trylock(vmf->ptl)))
+   if (unlikely(!spin_trylock(vmf->ptl))) {
+   trace_spf_pte_lock(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2363,8 +2373,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * block on the PTL and thus we're safe.
 */
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2372,8 +2384,10 @@ static bool pte_map_lock(struct vm_

[PATCH v9 14/24] mm: Introduce __maybe_mkwrite()

2018-03-13 Thread Laurent Dufour
The current maybe_mkwrite() is getting passed the pointer to the vma
structure to fetch the vm_flags field.

When dealing with the speculative page fault handler, it will be better to
rely on the cached vm_flags value stored in the vm_fault structure.

This patch introduce a __maybe_mkwrite() service which can be called by
passing the value of the vm_flags field.

There is no change functional changes expected for the other callers of
maybe_mkwrite().

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h | 9 +++--
 mm/memory.c| 6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dfa81a638b7c..a84ddc218bbd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -684,13 +684,18 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
 
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+   return __maybe_mkwrite(pte, vma->vm_flags);
+}
+
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
diff --git a/mm/memory.c b/mm/memory.c
index 0a0a483d9a65..af0338fbc34d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-   entry = mk_pte(new_page, vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(new_page, vmf->vma_page_prot);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
 * Clear the pte entry and flush it first, before updating the
 * pte with the new entry. This will avoid a race condition
-- 
2.7.4



[PATCH v9 19/24] mm: Adding speculative page fault failure trace events

2018-03-13 Thread Laurent Dufour
This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour 
---
 include/trace/events/pagefault.h | 87 
 mm/memory.c  | 62 ++--
 2 files changed, 136 insertions(+), 13 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..1d793f8c739b
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,87 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_pte_lock,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index f0f2caa11282..f39c4a4df703 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -80,6 +80,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2312,8 +2315,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
}
 
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2321,16 +2326,21 @@ static bool pte_spinlock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 #endif
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   if (unlikely(!spin_trylock(vmf->ptl)))
+   if (unlikely(!spin_trylock(vmf->ptl))) {
+   trace_spf_pte_lock(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2363,8 +2373,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * block on the PTL and thus we're safe.
 */
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2372,8 +2384,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * is not a hug

[PATCH v9 14/24] mm: Introduce __maybe_mkwrite()

2018-03-13 Thread Laurent Dufour
The current maybe_mkwrite() is getting passed the pointer to the vma
structure to fetch the vm_flags field.

When dealing with the speculative page fault handler, it will be better to
rely on the cached vm_flags value stored in the vm_fault structure.

This patch introduce a __maybe_mkwrite() service which can be called by
passing the value of the vm_flags field.

There is no change functional changes expected for the other callers of
maybe_mkwrite().

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 9 +++--
 mm/memory.c| 6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dfa81a638b7c..a84ddc218bbd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -684,13 +684,18 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
 
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+   return __maybe_mkwrite(pte, vma->vm_flags);
+}
+
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
diff --git a/mm/memory.c b/mm/memory.c
index 0a0a483d9a65..af0338fbc34d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-   entry = mk_pte(new_page, vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(new_page, vmf->vma_page_prot);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
 * Clear the pte entry and flush it first, before updating the
 * pte with the new entry. This will avoid a race condition
-- 
2.7.4



[PATCH v9 09/24] mm: protect mremap() against SPF hanlder

2018-03-13 Thread Laurent Dufour
If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1   CPU2
enter mremap()
unmap the dest area
copy_vma()  Enter speculative page fault handler
   >> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler

move_ptes()
  > it is assumed that the dest area is empty,
  > the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h | 16 
 mm/mmap.c  | 47 ---
 mm/mremap.c| 13 +
 3 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 88042d843668..ef6ef0627090 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct 
anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
cap_sys_admin);
 extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand);
+   struct vm_area_struct *expand, bool keep_locked);
 static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
-   return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+   return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
 }
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+extern struct vm_area_struct *__vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *, struct vm_userfaultfd_ctx);
+   struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked);
+static inline struct vm_area_struct *vma_merge(struct mm_struct *vma,
+   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+   return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off,
+  pol, uff, false);
+}
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index d6533cb85213..ac32b577a0c9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand)
+   struct vm_area_struct *expand, bool keep_locked)
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 
if (next && next != vma)
vm_raw_write_end(next);
-   vm_raw_write_end(vma);
+   if (!keep_locked)
+   vm_raw_write_end(vma);
 
validate_mm(mm);
 
@@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, 
unsigned long vm_flags,
  * parameter) may establish ptes with the wrong permissions of 
  * instead of the right permissions of .
  */
-struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *__vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,

[PATCH v9 09/24] mm: protect mremap() against SPF hanlder

2018-03-13 Thread Laurent Dufour
If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1   CPU2
enter mremap()
unmap the dest area
copy_vma()  Enter speculative page fault handler
   >> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler

move_ptes()
  > it is assumed that the dest area is empty,
  > the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 16 
 mm/mmap.c  | 47 ---
 mm/mremap.c| 13 +
 3 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 88042d843668..ef6ef0627090 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct 
anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
cap_sys_admin);
 extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand);
+   struct vm_area_struct *expand, bool keep_locked);
 static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
-   return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+   return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
 }
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+extern struct vm_area_struct *__vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *, struct vm_userfaultfd_ctx);
+   struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked);
+static inline struct vm_area_struct *vma_merge(struct mm_struct *vma,
+   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+   return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off,
+  pol, uff, false);
+}
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index d6533cb85213..ac32b577a0c9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand)
+   struct vm_area_struct *expand, bool keep_locked)
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 
if (next && next != vma)
vm_raw_write_end(next);
-   vm_raw_write_end(vma);
+   if (!keep_locked)
+   vm_raw_write_end(vma);
 
validate_mm(mm);
 
@@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, 
unsigned long vm_flags,
  * parameter) may establish ptes with the wrong permissions of 
  * instead of the right permissions of .
  */
-struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *__vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long e

[PATCH v9 02/24] x86/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour
Introduce CONFIG_SPECULATIVE_PAGE_FAULT which turns on the Speculative Page
Fault handler when building for 64bits with SMP.

Cc: Thomas Gleixner <t...@linutronix.de>
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a0a777ce4c7c..4c018c48d414 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86_64
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
select X86_DEV_DMA_OPS
+   select SPECULATIVE_PAGE_FAULT if SMP
 
 #
 # Arch settings
-- 
2.7.4



[PATCH v9 02/24] x86/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-03-13 Thread Laurent Dufour
Introduce CONFIG_SPECULATIVE_PAGE_FAULT which turns on the Speculative Page
Fault handler when building for 64bits with SMP.

Cc: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a0a777ce4c7c..4c018c48d414 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86_64
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
select X86_DEV_DMA_OPS
+   select SPECULATIVE_PAGE_FAULT if SMP
 
 #
 # Arch settings
-- 
2.7.4



[PATCH v9 05/24] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2018-03-13 Thread Laurent Dufour
When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8ac241b9f370..21b1212a0892 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3798,8 +3805,8 @@ static int do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -3992,8 +3999,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.7.4



[PATCH v9 05/24] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2018-03-13 Thread Laurent Dufour
When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8ac241b9f370..21b1212a0892 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3798,8 +3805,8 @@ static int do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -3992,8 +3999,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.7.4



[PATCH v9 00/24] Speculative page faults

2018-03-13 Thread Laurent Dufour
  473  pagefault:spf_vma_notsup
 0  pagefault:spf_vma_access
 0  pagefault:spf_pmd_changed

Most of the processes involved are monothreaded so SPF is not activated but
there is no impact on the performance.

Ebizzy:
---
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
result I repeated the test 100 times and measure the average result. The
number is the record processes per second, the higher is the best.

BASESPF delta   
16 CPUs x86 VM  14902.6 95905.16543.55%
80 CPUs P8 node 37240.2478185.67109.95%

Here are the performance counter read during a run on a 16 CPUs x86 VM:
 Performance counter stats for './ebizzy -mRTp':
888157  faults
884773  spf
92  pagefault:spf_pte_lock
  2379  pagefault:spf_vma_changed
 0  pagefault:spf_vma_noanon
80  pagefault:spf_vma_notsup
 0  pagefault:spf_vma_access
 0  pagefault:spf_pmd_changed

And the ones captured during a run on a 80 CPUs Power node:
 Performance counter stats for './ebizzy -mRTp':
762134  faults
728663  spf
 19101  pagefault:spf_pte_lock
 13969  pagefault:spf_vma_changed
 0  pagefault:spf_vma_noanon
   272  pagefault:spf_vma_notsup
 0  pagefault:spf_vma_access
 0  pagefault:spf_pmd_changed

In ebizzy's case most of the page fault were handled in a speculative way,
leading the ebizzy performance boost.

--
Changes since v8:
 - Don't check PMD when locking the pte when THP is disabled
   Thanks to Daniel Jordan for reporting this.
 - Rebase on 4.16
Changes since v7:
 - move pte_map_lock() and pte_spinlock() upper in mm/memory.c (patch 4 &
   5)
 - make pte_unmap_same() compatible with the speculative page fault (patch
   6)
Changes since v6:
 - Rename config variable to CONFIG_SPECULATIVE_PAGE_FAULT (patch 1)
 - Review the way the config variable is set (patch 1 to 3)
 - Introduce mm_rb_write_*lock() in mm/mmap.c (patch 18)
 - Merge patch introducing pte try locking in the patch 18.
Changes since v5:
 - use rwlock agains the mm RB tree in place of SRCU
 - add a VMA's reference count to protect VMA while using it without
   holding the mmap_sem.
 - check PMD value to detect collapsing operation
 - don't try speculative page fault for mono threaded processes
 - try to reuse the fetched VMA if VM_RETRY is returned
 - go directly to the error path if an error is detected during the SPF
   path
 - fix race window when moving VMA in move_vma()
Changes since v4:
 - As requested by Andrew Morton, use CONFIG_SPF and define it earlier in
 the series to ease bisection.
Changes since v3:
 - Don't build when CONFIG_SMP is not set
 - Fixed a lock dependency warning in __vma_adjust()
 - Use READ_ONCE to access p*d values in handle_speculative_fault()
 - Call memcp_oom() service in handle_speculative_fault()
Changes since v2:
 - Perf event is renamed in PERF_COUNT_SW_SPF
 - On Power handle do_page_fault()'s cleaning
 - On Power if the VM_FAULT_ERROR is returned by
 handle_speculative_fault(), do not retry but jump to the error path
 - If VMA's flags are not matching the fault, directly returns
 VM_FAULT_SIGSEGV and not VM_FAULT_RETRY
 - Check for pud_trans_huge() to avoid speculative path
 - Handles _vm_normal_page()'s introduced by 6f16211df3bf
 ("mm/device-public-memory: device memory cache coherent with CPU")
 - add and review few comments in the code
Changes since v1:
 - Remove PERF_COUNT_SW_SPF_FAILED perf event.
 - Add tracing events to details speculative page fault failures.
 - Cache VMA fields values which are used once the PTE is unlocked at the
 end of the page fault events.
 - Ensure that fields read during the speculative path are written and read
 using WRITE_ONCE and READ_ONCE.
 - Add checks at the beginning of the speculative path to abort it if the
 VMA is known to not be supported.
Changes since RFC V5 [5]
 - Port to 4.13 kernel
 - Merging patch fixing lock dependency into the original patch
 - Replace the 2 parameters of vma_has_changed() with the vmf pointer
 - In patch 7, don't call __do_fault() in the speculative path as it may
 want to unlock the mmap_sem.
 - In patch 11-12, don't check for vma boundaries when
 page_add_new_anon_rmap() is called during the spf path and protect against
 anon_vma pointer's update.
 - In patch 13-16, add performance events to report number of successful
 and failed speculative events.

[1]
http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
[2] https://patchwork.kernel.org/patch/687/


Laurent Dufour (20):
  mm: Introduce CONFIG_SPECULATIVE_

[PATCH v9 00/24] Speculative page faults

2018-03-13 Thread Laurent Dufour
  473  pagefault:spf_vma_notsup
 0  pagefault:spf_vma_access
 0  pagefault:spf_pmd_changed

Most of the processes involved are monothreaded so SPF is not activated but
there is no impact on the performance.

Ebizzy:
---
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
result I repeated the test 100 times and measure the average result. The
number is the record processes per second, the higher is the best.

BASESPF delta   
16 CPUs x86 VM  14902.6 95905.16543.55%
80 CPUs P8 node 37240.2478185.67109.95%

Here are the performance counter read during a run on a 16 CPUs x86 VM:
 Performance counter stats for './ebizzy -mRTp':
888157  faults
884773  spf
92  pagefault:spf_pte_lock
  2379  pagefault:spf_vma_changed
 0  pagefault:spf_vma_noanon
80  pagefault:spf_vma_notsup
 0  pagefault:spf_vma_access
 0  pagefault:spf_pmd_changed

And the ones captured during a run on a 80 CPUs Power node:
 Performance counter stats for './ebizzy -mRTp':
762134  faults
728663  spf
 19101  pagefault:spf_pte_lock
 13969  pagefault:spf_vma_changed
 0  pagefault:spf_vma_noanon
   272  pagefault:spf_vma_notsup
 0  pagefault:spf_vma_access
 0  pagefault:spf_pmd_changed

In ebizzy's case most of the page fault were handled in a speculative way,
leading the ebizzy performance boost.

--
Changes since v8:
 - Don't check PMD when locking the pte when THP is disabled
   Thanks to Daniel Jordan for reporting this.
 - Rebase on 4.16
Changes since v7:
 - move pte_map_lock() and pte_spinlock() upper in mm/memory.c (patch 4 &
   5)
 - make pte_unmap_same() compatible with the speculative page fault (patch
   6)
Changes since v6:
 - Rename config variable to CONFIG_SPECULATIVE_PAGE_FAULT (patch 1)
 - Review the way the config variable is set (patch 1 to 3)
 - Introduce mm_rb_write_*lock() in mm/mmap.c (patch 18)
 - Merge patch introducing pte try locking in the patch 18.
Changes since v5:
 - use rwlock agains the mm RB tree in place of SRCU
 - add a VMA's reference count to protect VMA while using it without
   holding the mmap_sem.
 - check PMD value to detect collapsing operation
 - don't try speculative page fault for mono threaded processes
 - try to reuse the fetched VMA if VM_RETRY is returned
 - go directly to the error path if an error is detected during the SPF
   path
 - fix race window when moving VMA in move_vma()
Changes since v4:
 - As requested by Andrew Morton, use CONFIG_SPF and define it earlier in
 the series to ease bisection.
Changes since v3:
 - Don't build when CONFIG_SMP is not set
 - Fixed a lock dependency warning in __vma_adjust()
 - Use READ_ONCE to access p*d values in handle_speculative_fault()
 - Call memcp_oom() service in handle_speculative_fault()
Changes since v2:
 - Perf event is renamed in PERF_COUNT_SW_SPF
 - On Power handle do_page_fault()'s cleaning
 - On Power if the VM_FAULT_ERROR is returned by
 handle_speculative_fault(), do not retry but jump to the error path
 - If VMA's flags are not matching the fault, directly returns
 VM_FAULT_SIGSEGV and not VM_FAULT_RETRY
 - Check for pud_trans_huge() to avoid speculative path
 - Handles _vm_normal_page()'s introduced by 6f16211df3bf
 ("mm/device-public-memory: device memory cache coherent with CPU")
 - add and review few comments in the code
Changes since v1:
 - Remove PERF_COUNT_SW_SPF_FAILED perf event.
 - Add tracing events to details speculative page fault failures.
 - Cache VMA fields values which are used once the PTE is unlocked at the
 end of the page fault events.
 - Ensure that fields read during the speculative path are written and read
 using WRITE_ONCE and READ_ONCE.
 - Add checks at the beginning of the speculative path to abort it if the
 VMA is known to not be supported.
Changes since RFC V5 [5]
 - Port to 4.13 kernel
 - Merging patch fixing lock dependency into the original patch
 - Replace the 2 parameters of vma_has_changed() with the vmf pointer
 - In patch 7, don't call __do_fault() in the speculative path as it may
 want to unlock the mmap_sem.
 - In patch 11-12, don't check for vma boundaries when
 page_add_new_anon_rmap() is called during the spf path and protect against
 anon_vma pointer's update.
 - In patch 13-16, add performance events to report number of successful
 and failed speculative events.

[1]
http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
[2] https://patchwork.kernel.org/patch/687/


Laurent Dufour (20):
  mm: Introduce CONFIG_SPECULATIVE_

Re: [PATCH v8 18/24] mm: Provide speculative fault infrastructure

2018-03-05 Thread Laurent Dufour
Hi Jordan,

Thanks for reporting this.

On 26/02/2018 18:16, Daniel Jordan wrote:
> Hi Laurent,
> 
> This series doesn't build for me[*] when CONFIG_TRANSPARENT_HUGEPAGE is unset.
> 
> The problem seems to be that the BUILD_BUG() version of pmd_same is called in
> pte_map_lock:
> 
> On 02/16/2018 10:25 AM, Laurent Dufour wrote:
>> +static bool pte_map_lock(struct vm_fault *vmf)
>> +{
> ...snip...
>> +    if (!pmd_same(pmdval, vmf->orig_pmd))
>> +    goto out;
> 
> Since SPF can now call pmd_same without THP, maybe the way to fix it is just
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 2cfa3075d148..e130692db24a 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -375,7 +375,8 @@ static inline int pte_unused(pte_t pte)
>  #endif
>  
>  #ifndef __HAVE_ARCH_PMD_SAME
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
> +    defined(CONFIG_SPECULATIVE_PAGE_FAULT)
>  static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>  {
>     return pmd_val(pmd_a) == pmd_val(pmd_b);

We can't fix that this way because some architectures define their own
pmd_same() function (like ppc64), thus forcing the define here will be useless
since __HAVE_ARCH_PMD_SAME is set in that case.

The right way to fix that is to _not check_ for the PMD value in pte_spinlock()
when CONFIG_TRANSPARENT_HUGEPAGE is not set since there is no risk of
collapsing operation in our back if THP are disabled.

I'll fix that in the next version.

Laurent.

> ?
> 
> Daniel
> 
> 
> [*]  The errors are:
> 
> In file included from /home/dmjordan/src/linux/include/linux/kernel.h:10:0,
>  from /home/dmjordan/src/linux/include/linux/list.h:9,
>  from /home/dmjordan/src/linux/include/linux/smp.h:12,
>  from /home/dmjordan/src/linux/include/linux/kernel_stat.h:5,
>  from /home/dmjordan/src/linux/mm/memory.c:41:
> In function ‘pmd_same.isra.104’,
>     inlined from ‘pte_map_lock’ at 
> /home/dmjordan/src/linux/mm/memory.c:2380:7:
> /home/dmjordan/src/linux/include/linux/compiler.h:324:38: error: call to
> ‘__compiletime_assert_391’ declared with attribute error: BUILD_BUG failed
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^
> /home/dmjordan/src/linux/include/linux/compiler.h:304:4: note: in definition 
> of
> macro ‘__compiletime_assert’
>     prefix ## suffix();    \
>     ^~
> /home/dmjordan/src/linux/include/linux/compiler.h:324:2: note: in expansion of
> macro ‘_compiletime_assert’
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^~~
> /home/dmjordan/src/linux/include/linux/build_bug.h:45:37: note: in expansion 
> of
> macro ‘compiletime_assert’
>  #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>  ^~
> /home/dmjordan/src/linux/include/linux/build_bug.h:79:21: note: in expansion 
> of
> macro ‘BUILD_BUG_ON_MSG’
>  #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>  ^~~~
> /home/dmjordan/src/linux/include/asm-generic/pgtable.h:391:2: note: in
> expansion of macro ‘BUILD_BUG’
>   BUILD_BUG();
>   ^
>   CC  block/elevator.o
>   CC  crypto/crypto_wq.o
> In function ‘pmd_same.isra.104’,
>     inlined from ‘pte_spinlock’ at 
> /home/dmjordan/src/linux/mm/memory.c:2326:7,
>     inlined from ‘handle_pte_fault’ at
> /home/dmjordan/src/linux/mm/memory.c:4181:7:
> /home/dmjordan/src/linux/include/linux/compiler.h:324:38: error: call to
> ‘__compiletime_assert_391’ declared with attribute error: BUILD_BUG failed
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^
> /home/dmjordan/src/linux/include/linux/compiler.h:304:4: note: in definition 
> of
> macro ‘__compiletime_assert’
>     prefix ## suffix();    \
>     ^~
> /home/dmjordan/src/linux/include/linux/compiler.h:324:2: note: in expansion of
> macro ‘_compiletime_assert’
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^~~
> /home/dmjordan/src/linux/include/linux/build_bug.h:45:37: note: in expansion 
> of
> macro ‘compiletime_assert’
>  #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>  ^~
> /home/dmjordan/src/linux/include/linux/build_bug.h:79:21: note: in expansion 
> of
> macro ‘BUILD_BUG_ON_MSG’
>  #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>  ^~~~
> /home/dmjordan/src/linux/include/asm-generic/pgtable.h:391:2: note: in
> expansion of macro ‘BUILD_BUG’
>   BUILD_BUG();
>   ^
> ...
> make[2]: *** [/home/dmjordan/src/linux/scripts/Makefile.build:316: 
> mm/memory.o]
> Error 1
> make[1]: *** [/home/dmjordan/src/linux/Makefile:1047: mm] Error 2
> 



Re: [PATCH v8 18/24] mm: Provide speculative fault infrastructure

2018-03-05 Thread Laurent Dufour
Hi Jordan,

Thanks for reporting this.

On 26/02/2018 18:16, Daniel Jordan wrote:
> Hi Laurent,
> 
> This series doesn't build for me[*] when CONFIG_TRANSPARENT_HUGEPAGE is unset.
> 
> The problem seems to be that the BUILD_BUG() version of pmd_same is called in
> pte_map_lock:
> 
> On 02/16/2018 10:25 AM, Laurent Dufour wrote:
>> +static bool pte_map_lock(struct vm_fault *vmf)
>> +{
> ...snip...
>> +    if (!pmd_same(pmdval, vmf->orig_pmd))
>> +    goto out;
> 
> Since SPF can now call pmd_same without THP, maybe the way to fix it is just
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 2cfa3075d148..e130692db24a 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -375,7 +375,8 @@ static inline int pte_unused(pte_t pte)
>  #endif
>  
>  #ifndef __HAVE_ARCH_PMD_SAME
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
> +    defined(CONFIG_SPECULATIVE_PAGE_FAULT)
>  static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>  {
>     return pmd_val(pmd_a) == pmd_val(pmd_b);

We can't fix that this way because some architectures define their own
pmd_same() function (like ppc64), thus forcing the define here will be useless
since __HAVE_ARCH_PMD_SAME is set in that case.

The right way to fix that is to _not check_ for the PMD value in pte_spinlock()
when CONFIG_TRANSPARENT_HUGEPAGE is not set since there is no risk of
collapsing operation in our back if THP are disabled.

I'll fix that in the next version.

Laurent.

> ?
> 
> Daniel
> 
> 
> [*]  The errors are:
> 
> In file included from /home/dmjordan/src/linux/include/linux/kernel.h:10:0,
>  from /home/dmjordan/src/linux/include/linux/list.h:9,
>  from /home/dmjordan/src/linux/include/linux/smp.h:12,
>  from /home/dmjordan/src/linux/include/linux/kernel_stat.h:5,
>  from /home/dmjordan/src/linux/mm/memory.c:41:
> In function ‘pmd_same.isra.104’,
>     inlined from ‘pte_map_lock’ at 
> /home/dmjordan/src/linux/mm/memory.c:2380:7:
> /home/dmjordan/src/linux/include/linux/compiler.h:324:38: error: call to
> ‘__compiletime_assert_391’ declared with attribute error: BUILD_BUG failed
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^
> /home/dmjordan/src/linux/include/linux/compiler.h:304:4: note: in definition 
> of
> macro ‘__compiletime_assert’
>     prefix ## suffix();    \
>     ^~
> /home/dmjordan/src/linux/include/linux/compiler.h:324:2: note: in expansion of
> macro ‘_compiletime_assert’
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^~~
> /home/dmjordan/src/linux/include/linux/build_bug.h:45:37: note: in expansion 
> of
> macro ‘compiletime_assert’
>  #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>  ^~
> /home/dmjordan/src/linux/include/linux/build_bug.h:79:21: note: in expansion 
> of
> macro ‘BUILD_BUG_ON_MSG’
>  #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>  ^~~~
> /home/dmjordan/src/linux/include/asm-generic/pgtable.h:391:2: note: in
> expansion of macro ‘BUILD_BUG’
>   BUILD_BUG();
>   ^
>   CC  block/elevator.o
>   CC  crypto/crypto_wq.o
> In function ‘pmd_same.isra.104’,
>     inlined from ‘pte_spinlock’ at 
> /home/dmjordan/src/linux/mm/memory.c:2326:7,
>     inlined from ‘handle_pte_fault’ at
> /home/dmjordan/src/linux/mm/memory.c:4181:7:
> /home/dmjordan/src/linux/include/linux/compiler.h:324:38: error: call to
> ‘__compiletime_assert_391’ declared with attribute error: BUILD_BUG failed
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^
> /home/dmjordan/src/linux/include/linux/compiler.h:304:4: note: in definition 
> of
> macro ‘__compiletime_assert’
>     prefix ## suffix();    \
>     ^~
> /home/dmjordan/src/linux/include/linux/compiler.h:324:2: note: in expansion of
> macro ‘_compiletime_assert’
>   _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>   ^~~
> /home/dmjordan/src/linux/include/linux/build_bug.h:45:37: note: in expansion 
> of
> macro ‘compiletime_assert’
>  #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>  ^~
> /home/dmjordan/src/linux/include/linux/build_bug.h:79:21: note: in expansion 
> of
> macro ‘BUILD_BUG_ON_MSG’
>  #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
>  ^~~~
> /home/dmjordan/src/linux/include/asm-generic/pgtable.h:391:2: note: in
> expansion of macro ‘BUILD_BUG’
>   BUILD_BUG();
>   ^
> ...
> make[2]: *** [/home/dmjordan/src/linux/scripts/Makefile.build:316: 
> mm/memory.o]
> Error 1
> make[1]: *** [/home/dmjordan/src/linux/Makefile:1047: mm] Error 2
> 



[PATCH v8 05/24] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2018-02-16 Thread Laurent Dufour
When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 42824b504434..1ca289f53dd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3815,8 +3822,8 @@ static int do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -4009,8 +4016,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.7.4



[PATCH v8 05/24] mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2018-02-16 Thread Laurent Dufour
When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 42824b504434..1ca289f53dd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3815,8 +3822,8 @@ static int do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -4009,8 +4016,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.7.4



[PATCH v8 02/24] x86/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-02-16 Thread Laurent Dufour
Introduce CONFIG_SPECULATIVE_PAGE_FAULT which turns on the Speculative Page
Fault handler when building for 64bits with SMP.

Cc: Thomas Gleixner <t...@linutronix.de>
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 52221529abf9..e7f0c6f9f877 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86_64
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
select X86_DEV_DMA_OPS
+   select SPECULATIVE_PAGE_FAULT if SMP
 
 #
 # Arch settings
-- 
2.7.4



[PATCH v8 02/24] x86/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT

2018-02-16 Thread Laurent Dufour
Introduce CONFIG_SPECULATIVE_PAGE_FAULT which turns on the Speculative Page
Fault handler when building for 64bits with SMP.

Cc: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 52221529abf9..e7f0c6f9f877 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86_64
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
select X86_DEV_DMA_OPS
+   select SPECULATIVE_PAGE_FAULT if SMP
 
 #
 # Arch settings
-- 
2.7.4



[PATCH v8 06/24] mm: make pte_unmap_same compatible with SPF

2018-02-16 Thread Laurent Dufour
pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c| 37 ++---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 51d950cac772..e869adec9023 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1197,6 +1197,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_NEEDDSYNC  0x2000 /* ->fault did not modify page tables
 * and needs fsync() to complete (for
 * synchronous page faults in DAX) */
+#define VM_FAULT_PTNOTSAME 0x4000  /* Page table entries have changed */
 
 #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
diff --git a/mm/memory.c b/mm/memory.c
index 1ca289f53dd6..a301b9003200 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
  * parts, do_swap_page must check under lock before unmapping the pte and
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0   if the PTE are the same
+ * VM_FAULT_PTNOTSAME  if the PTE are different
+ * VM_FAULT_RETRY  if the VMA has changed in our back during
+ * a speculative page fault handling.
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
-   int same = 1;
+   int ret = 0;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
+   if (pte_spinlock(vmf)) {
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   ret = VM_FAULT_PTNOTSAME;
+   spin_unlock(vmf->ptl);
+   } else
+   ret = VM_FAULT_RETRY;
}
 #endif
-   pte_unmap(page_table);
-   return same;
+   pte_unmap(vmf->pte);
+   return ret;
 }
 
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
@@ -2912,7 +2920,7 @@ int do_swap_page(struct vm_fault *vmf)
pte_t pte;
int locked;
int exclusive = 0;
-   int ret = 0;
+   int ret;
bool vma_readahead = swap_use_vma_readahead();
 
if (vma_readahead) {
@@ -2920,9 +2928,16 @@ int do_swap_page(struct vm_fault *vmf)
swapcache = page;
}
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
+   ret = pte_unmap_same(vmf);
+   if (ret) {
if (page)
put_page(page);
+   /*
+* In the case the PTE are different, meaning that the
+* page has already been processed by another CPU, we return 0.
+*/
+   if (ret == VM_FAULT_PTNOTSAME)
+   ret = 0;
goto out;
}
 
-- 
2.7.4



[PATCH v8 06/24] mm: make pte_unmap_same compatible with SPF

2018-02-16 Thread Laurent Dufour
pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 37 ++---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 51d950cac772..e869adec9023 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1197,6 +1197,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_NEEDDSYNC  0x2000 /* ->fault did not modify page tables
 * and needs fsync() to complete (for
 * synchronous page faults in DAX) */
+#define VM_FAULT_PTNOTSAME 0x4000  /* Page table entries have changed */
 
 #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
diff --git a/mm/memory.c b/mm/memory.c
index 1ca289f53dd6..a301b9003200 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2309,21 +2309,29 @@ static bool pte_map_lock(struct vm_fault *vmf)
  * parts, do_swap_page must check under lock before unmapping the pte and
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0   if the PTE are the same
+ * VM_FAULT_PTNOTSAME  if the PTE are different
+ * VM_FAULT_RETRY  if the VMA has changed in our back during
+ * a speculative page fault handling.
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
-   int same = 1;
+   int ret = 0;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
+   if (pte_spinlock(vmf)) {
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   ret = VM_FAULT_PTNOTSAME;
+   spin_unlock(vmf->ptl);
+   } else
+   ret = VM_FAULT_RETRY;
}
 #endif
-   pte_unmap(page_table);
-   return same;
+   pte_unmap(vmf->pte);
+   return ret;
 }
 
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
@@ -2912,7 +2920,7 @@ int do_swap_page(struct vm_fault *vmf)
pte_t pte;
int locked;
int exclusive = 0;
-   int ret = 0;
+   int ret;
bool vma_readahead = swap_use_vma_readahead();
 
if (vma_readahead) {
@@ -2920,9 +2928,16 @@ int do_swap_page(struct vm_fault *vmf)
swapcache = page;
}
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
+   ret = pte_unmap_same(vmf);
+   if (ret) {
if (page)
put_page(page);
+   /*
+* In the case the PTE are different, meaning that the
+* page has already been processed by another CPU, we return 0.
+*/
+   if (ret == VM_FAULT_PTNOTSAME)
+   ret = 0;
goto out;
}
 
-- 
2.7.4



[PATCH v8 09/24] mm: protect mremap() against SPF hanlder

2018-02-16 Thread Laurent Dufour
If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1   CPU2
enter mremap()
unmap the dest area
copy_vma()  Enter speculative page fault handler
   >> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler

move_ptes()
  > it is assumed that the dest area is empty,
  > the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h | 16 
 mm/mmap.c  | 47 ---
 mm/mremap.c| 13 +
 3 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c22b4134c5d..dcc2a6db8419 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2187,16 +2187,24 @@ void anon_vma_interval_tree_verify(struct 
anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
cap_sys_admin);
 extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand);
+   struct vm_area_struct *expand, bool keep_locked);
 static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
-   return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+   return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
 }
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+extern struct vm_area_struct *__vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *, struct vm_userfaultfd_ctx);
+   struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked);
+static inline struct vm_area_struct *vma_merge(struct mm_struct *vma,
+   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+   return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off,
+  pol, uff, false);
+}
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index c4d944e1be23..13c799710a8a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand)
+   struct vm_area_struct *expand, bool keep_locked)
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 
if (next && next != vma)
vm_raw_write_end(next);
-   vm_raw_write_end(vma);
+   if (!keep_locked)
+   vm_raw_write_end(vma);
 
validate_mm(mm);
 
@@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, 
unsigned long vm_flags,
  * parameter) may establish ptes with the wrong permissions of 
  * instead of the right permissions of .
  */
-struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *__vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,

[PATCH v8 09/24] mm: protect mremap() against SPF hanlder

2018-02-16 Thread Laurent Dufour
If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1   CPU2
enter mremap()
unmap the dest area
copy_vma()  Enter speculative page fault handler
   >> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler

move_ptes()
  > it is assumed that the dest area is empty,
  > the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 16 
 mm/mmap.c  | 47 ---
 mm/mremap.c| 13 +
 3 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c22b4134c5d..dcc2a6db8419 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2187,16 +2187,24 @@ void anon_vma_interval_tree_verify(struct 
anon_vma_chain *node);
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
cap_sys_admin);
 extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand);
+   struct vm_area_struct *expand, bool keep_locked);
 static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
-   return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+   return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
 }
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+extern struct vm_area_struct *__vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *, struct vm_userfaultfd_ctx);
+   struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked);
+static inline struct vm_area_struct *vma_merge(struct mm_struct *vma,
+   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+   return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off,
+  pol, uff, false);
+}
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index c4d944e1be23..13c799710a8a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand)
+   struct vm_area_struct *expand, bool keep_locked)
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 
if (next && next != vma)
vm_raw_write_end(next);
-   vm_raw_write_end(vma);
+   if (!keep_locked)
+   vm_raw_write_end(vma);
 
validate_mm(mm);
 
@@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, 
unsigned long vm_flags,
  * parameter) may establish ptes with the wrong permissions of 
  * instead of the right permissions of .
  */
-struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *__vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long e

[PATCH v8 08/24] mm: Protect VMA modifications using VMA sequence count

2018-02-16 Thread Laurent Dufour
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 17 ++---
 mm/mprotect.c  |  4 +++-
 mm/swap_state.c|  8 ++--
 9 files changed, 86 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b66fc8de7d34..adb66e1989ee 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1158,8 +1158,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(>mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 87a13a7c8270..1da1ba63c7dd 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
up_write(>mmap_sem);
mmput(mm);
@@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7e2268dfc9a..32314e9e48dd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma

[PATCH v8 08/24] mm: Protect VMA modifications using VMA sequence count

2018-02-16 Thread Laurent Dufour
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour 
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 17 ++---
 mm/mprotect.c  |  4 +++-
 mm/swap_state.c|  8 ++--
 9 files changed, 86 insertions(+), 38 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b66fc8de7d34..adb66e1989ee 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1158,8 +1158,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(>mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 87a13a7c8270..1da1ba63c7dd 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
up_write(>mmap_sem);
mmput(mm);
@@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b7e2268dfc9a..32314e9e48dd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1006,6 +1006,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1041,6 +1042,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+   vm_write_end(vma)

[PATCH v8 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-02-16 Thread Laurent Dufour
migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0c6fe904bc97..08960ec74246 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -126,14 +126,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 66a79f44f303..bd5de1aa3f7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3903,7 +3903,7 @@ static int do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5d0dc7b85f90..ad8692ca6a4f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1900,7 +1900,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1913,7 +1913,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.7.4



[PATCH v8 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-02-16 Thread Laurent Dufour
migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Signed-off-by: Laurent Dufour 
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0c6fe904bc97..08960ec74246 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -126,14 +126,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 66a79f44f303..bd5de1aa3f7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3903,7 +3903,7 @@ static int do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5d0dc7b85f90..ad8692ca6a4f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1900,7 +1900,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1913,7 +1913,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.7.4



[PATCH v8 11/24] mm: Cache some VMA fields in the vm_fault structure

2018-02-16 Thread Laurent Dufour
When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h |  6 ++
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 38 --
 4 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dcc2a6db8419..fd4eda5195b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -361,6 +361,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c204e3d132b..22a818c7a6de 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3716,6 +3716,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
.vma = vma,
.address = address,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 32314e9e48dd..a946d5306160 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
};
 
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index e37af9af4202..66a79f44f303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Don't let another task, with possibly unlocked vma,
 * keep the mlocked page.
 */
-   if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+   if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page);/* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf)
  */
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
-   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+   WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
 (VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);
 
@@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -3090,7 +3090,7 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3116,7 +311

[PATCH v8 11/24] mm: Cache some VMA fields in the vm_fault structure

2018-02-16 Thread Laurent Dufour
When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  6 ++
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 38 --
 4 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dcc2a6db8419..fd4eda5195b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -361,6 +361,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c204e3d132b..22a818c7a6de 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3716,6 +3716,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
.vma = vma,
.address = address,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 32314e9e48dd..a946d5306160 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
};
 
/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index e37af9af4202..66a79f44f303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * Don't let another task, with possibly unlocked vma,
 * keep the mlocked page.
 */
-   if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+   if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page);/* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf)
  */
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
-   WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+   WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf)
 * We should not cow pages in a shared writeable mapping.
 * Just mark the pages writable and/or call ops->pfn_mkwrite.
 */
-   if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
 (VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);
 
@@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+   } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -3090,7 +3090,7 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3116,7 +3116,7 @@ int do_swap_page(struct vm_

[PATCH v8 14/24] mm: Introduce __maybe_mkwrite()

2018-02-16 Thread Laurent Dufour
The current maybe_mkwrite() is getting passed the pointer to the vma
structure to fetch the vm_flags field.

When dealing with the speculative page fault handler, it will be better to
rely on the cached vm_flags value stored in the vm_fault structure.

This patch introduce a __maybe_mkwrite() service which can be called by
passing the value of the vm_flags field.

There is no change functional changes expected for the other callers of
maybe_mkwrite().

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h | 9 +++--
 mm/memory.c| 6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fd4eda5195b9..49f160e56a0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -685,13 +685,18 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
 
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+   return __maybe_mkwrite(pte, vma->vm_flags);
+}
+
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
diff --git a/mm/memory.c b/mm/memory.c
index bc14973b4eba..8fd33d05ef1e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-   entry = mk_pte(new_page, vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(new_page, vmf->vma_page_prot);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
 * Clear the pte entry and flush it first, before updating the
 * pte with the new entry. This will avoid a race condition
-- 
2.7.4



[PATCH v8 14/24] mm: Introduce __maybe_mkwrite()

2018-02-16 Thread Laurent Dufour
The current maybe_mkwrite() is getting passed the pointer to the vma
structure to fetch the vm_flags field.

When dealing with the speculative page fault handler, it will be better to
rely on the cached vm_flags value stored in the vm_fault structure.

This patch introduce a __maybe_mkwrite() service which can be called by
passing the value of the vm_flags field.

There is no change functional changes expected for the other callers of
maybe_mkwrite().

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 9 +++--
 mm/memory.c| 6 +++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fd4eda5195b9..49f160e56a0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -685,13 +685,18 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
 
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+   return __maybe_mkwrite(pte, vma->vm_flags);
+}
+
 int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
struct page *page);
 int finish_fault(struct vm_fault *vmf);
diff --git a/mm/memory.c b/mm/memory.c
index bc14973b4eba..8fd33d05ef1e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-   entry = mk_pte(new_page, vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(new_page, vmf->vma_page_prot);
+   entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
 * Clear the pte entry and flush it first, before updating the
 * pte with the new entry. This will avoid a race condition
-- 
2.7.4



[PATCH v8 13/24] mm: Introduce __lru_cache_add_active_or_unevictable

2018-02-16 Thread Laurent Dufour
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/swap.h | 10 --
 mm/memory.c  |  8 
 mm/swap.c|  6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a1a3f4ed94ce..99377b66ea93 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -337,8 +337,14 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index bd5de1aa3f7a..bc14973b4eba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2560,7 +2560,7 @@ static int wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3106,7 +3106,7 @@ int do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3257,7 +3257,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3509,7 +3509,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 2d337710218f..85fc6e78ca99 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -455,12 +455,12 @@ void lru_cache_add(struct page *page)
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
-- 
2.7.4



[PATCH v8 13/24] mm: Introduce __lru_cache_add_active_or_unevictable

2018-02-16 Thread Laurent Dufour
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Signed-off-by: Laurent Dufour 
---
 include/linux/swap.h | 10 --
 mm/memory.c  |  8 
 mm/swap.c|  6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a1a3f4ed94ce..99377b66ea93 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -337,8 +337,14 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index bd5de1aa3f7a..bc14973b4eba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2560,7 +2560,7 @@ static int wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3106,7 +3106,7 @@ int do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3257,7 +3257,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3509,7 +3509,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 2d337710218f..85fc6e78ca99 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -455,12 +455,12 @@ void lru_cache_add(struct page *page)
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
-- 
2.7.4



[PATCH v8 17/24] mm: Protect mm_rb tree with a rwlock

2018-02-16 Thread Laurent Dufour
This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.

To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock.  The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.

To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1.  Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.

This patch is preparing for 2 kind of VMA access :
 - as usual, under the control of the mmap_sem,
 - without holding the mmap_sem for the speculative page fault handler.

Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.

The patch is introducing 2 new functions:
 - vma_get() to find a VMA based on an address by holding the new rwlock.
 - vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.

When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.

When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.

[1] https://patchwork.kernel.org/patch/5108281/

Cc: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Matthew Wilcox <wi...@infradead.org>
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |   4 ++
 kernel/fork.c|   3 ++
 mm/init-mm.c |   3 ++
 mm/internal.h|   6 +++
 mm/mmap.c| 122 ++-
 5 files changed, 106 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 34fde7111e88..28c763ea1036 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_t vm_sequence;
+   atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
 #endif
 } __randomize_layout;
 
@@ -353,6 +354,9 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_t mm_rb_lock;
+#endif
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 505195d26744..6c4dc1aa6f17 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -887,6 +887,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_init(>mm_rb_lock);
+#endif
atomic_set(>mm_users, 1);
atomic_set(>mm_count, 1);
init_rwsem(>mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d15ebc0..e71ac37a98c4 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -17,6 +17,9 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   .mm_rb_lock = __RW_LOCK_UNLOCKED(init_mm.mm_rb_lock),
+#endif
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..fb2667b20f0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,12 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern struct vm_area_struct *get_vma(struct mm_struct *mm,
+ unsigned long addr);
+extern void put_vma(struct vm_area_struct *vma);
+#endif
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/

[PATCH v8 17/24] mm: Protect mm_rb tree with a rwlock

2018-02-16 Thread Laurent Dufour
This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.

To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock.  The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.

To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1.  Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.

This patch is preparing for 2 kind of VMA access :
 - as usual, under the control of the mmap_sem,
 - without holding the mmap_sem for the speculative page fault handler.

Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.

The patch is introducing 2 new functions:
 - vma_get() to find a VMA based on an address by holding the new rwlock.
 - vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.

When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.

When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.

[1] https://patchwork.kernel.org/patch/5108281/

Cc: Peter Zijlstra (Intel) 
Cc: Matthew Wilcox 
Signed-off-by: Laurent Dufour 
---
 include/linux/mm_types.h |   4 ++
 kernel/fork.c|   3 ++
 mm/init-mm.c |   3 ++
 mm/internal.h|   6 +++
 mm/mmap.c| 122 ++-
 5 files changed, 106 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 34fde7111e88..28c763ea1036 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,7 @@ struct vm_area_struct {
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_t vm_sequence;
+   atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
 #endif
 } __randomize_layout;
 
@@ -353,6 +354,9 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_t mm_rb_lock;
+#endif
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 505195d26744..6c4dc1aa6f17 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -887,6 +887,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_init(>mm_rb_lock);
+#endif
atomic_set(>mm_users, 1);
atomic_set(>mm_count, 1);
init_rwsem(>mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d15ebc0..e71ac37a98c4 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -17,6 +17,9 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   .mm_rb_lock = __RW_LOCK_UNLOCKED(init_mm.mm_rb_lock),
+#endif
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..fb2667b20f0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,12 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern struct vm_area_struct *get_vma(struct mm_struct *mm,
+ unsigned long addr);
+extern void put_vma(struct vm_area_struct *vma);
+#endif
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 13c799710a8a..220ba8cb65fc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@

[PATCH v8 18/24] mm: Provide speculative fault infrastructure

2018-02-16 Thread Laurent Dufour
From: Peter Zijlstra <pet...@infradead.org>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |   8 +
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  16 +-
 mm/memory.c| 334 -
 5 files changed, 356 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5f17fcd2c32..c383a4e2ceb3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -331,6 +331,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1349,6 +1353,10 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int handle_speculative_fault(struct mm_struct *mm,
+   unsigned long address, unsigned int flags);
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34ce3ebf97d5..70e4d2688e7b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -456,8 +456,8 @@ static inline pg

[PATCH v8 18/24] mm: Provide speculative fault infrastructure

2018-02-16 Thread Laurent Dufour
From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |   8 +
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  16 +-
 mm/memory.c| 334 -
 5 files changed, 356 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5f17fcd2c32..c383a4e2ceb3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -331,6 +331,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1349,6 +1353,10 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int handle_speculative_fault(struct mm_struct *mm,
+   unsigned long address, unsigned int flags);
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34ce3ebf97d5..70e4d2688e7b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct 
vm_area_struct *vma,
pgoff_t pgoff;
if

[PATCH v8 21/24] perf tools: Add support for the SPF perf event

2018-02-16 Thread Laurent Dufour
Add support for the new speculative faults event.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index e0739a1aa4b2..164383273147 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66fa45198a11..7fddb37f44ca 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -438,6 +438,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 34589c427e52..2a8189c6d5fc 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 655ecff636a8..5d6782426b30 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -308,6 +308,7 @@ emulation-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 duration_time  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index b1e999bd21ef..100507d632fa 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1142,6 +1142,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.7.4



[PATCH v8 21/24] perf tools: Add support for the SPF perf event

2018-02-16 Thread Laurent Dufour
Add support for the new speculative faults event.

Signed-off-by: Laurent Dufour 
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index e0739a1aa4b2..164383273147 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66fa45198a11..7fddb37f44ca 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -438,6 +438,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 34589c427e52..2a8189c6d5fc 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 655ecff636a8..5d6782426b30 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -308,6 +308,7 @@ emulation-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 duration_time  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index b1e999bd21ef..100507d632fa 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1142,6 +1142,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.7.4



[PATCH v8 24/24] powerpc/mm: Add speculative page fault

2018-02-16 Thread Laurent Dufour
This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Build on if CONFIG_SPECULATIVE_PAGE_FAULT is defined (currently for
BOOK3S_64 && SMP).

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/fault.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 866446cf2d9a..104f3cc86b51 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -392,6 +392,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
   unsigned long error_code)
 {
struct vm_area_struct * vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct mm_struct *mm = current->mm;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
int is_exec = TRAP(regs) == 0x400;
@@ -459,6 +462,20 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (is_user && (atomic_read(>mm_users) > 1)) {
+   /* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, address, flags, _vma);
+   if (!(fault & VM_FAULT_RETRY)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -489,7 +506,16 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma =  find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma))
return bad_area(regs, address);
if (likely(vma->vm_start <= address))
@@ -568,6 +594,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(>mm->mmap_sem);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.7.4



[PATCH v8 24/24] powerpc/mm: Add speculative page fault

2018-02-16 Thread Laurent Dufour
This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Build on if CONFIG_SPECULATIVE_PAGE_FAULT is defined (currently for
BOOK3S_64 && SMP).

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/fault.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 866446cf2d9a..104f3cc86b51 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -392,6 +392,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
   unsigned long error_code)
 {
struct vm_area_struct * vma;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   struct vm_area_struct *spf_vma = NULL;
+#endif
struct mm_struct *mm = current->mm;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
int is_exec = TRAP(regs) == 0x400;
@@ -459,6 +462,20 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (is_user && (atomic_read(>mm_users) > 1)) {
+   /* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, address, flags, _vma);
+   if (!(fault & VM_FAULT_RETRY)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1,
+ regs, address);
+   goto done;
+   }
+   }
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -489,7 +506,16 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
might_sleep();
}
 
-   vma = find_vma(mm, address);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   if (spf_vma) {
+   if (can_reuse_spf_vma(spf_vma, address))
+   vma = spf_vma;
+   else
+   vma =  find_vma(mm, address);
+   spf_vma = NULL;
+   } else
+#endif
+   vma = find_vma(mm, address);
if (unlikely(!vma))
return bad_area(regs, address);
if (likely(vma->vm_start <= address))
@@ -568,6 +594,9 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(>mm->mmap_sem);
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.7.4



[PATCH v8 19/24] mm: Adding speculative page fault failure trace events

2018-02-16 Thread Laurent Dufour
This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/trace/events/pagefault.h | 87 
 mm/memory.c  | 62 ++--
 2 files changed, 136 insertions(+), 13 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..1d793f8c739b
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,87 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_pte_lock,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index efc32173264e..2ef686405154 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -80,6 +80,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2310,23 +2313,30 @@ static bool pte_spinlock(struct vm_fault *vmf)
}
 
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
/*
 * We check if the pmd value is still the same to ensure that there
 * is a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   if (unlikely(!spin_trylock(vmf->ptl)))
+   if (unlikely(!spin_trylock(vmf->ptl))) {
+   trace_spf_pte_lock(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2357,16 +2367,20 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * block on the PTL and thus we're safe.
 */
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
/*
 * We check if the pmd value is still the same to ensure that there
 * is a huge collapse operation in progress in our back.
 */
pmdv

[PATCH v8 22/24] mm: Speculative page fault handler return VMA

2018-02-16 Thread Laurent Dufour
When the speculative page fault handler is returning VM_RETRY, there is a
chance that VMA fetched without grabbing the mmap_sem can be reused by the
legacy page fault handler.  By reusing it, we avoid calling find_vma()
again. To achieve, that we must ensure that the VMA structure will not be
freed in our back. This is done by getting the reference on it (get_vma())
and by assuming that the caller will call the new service
can_reuse_spf_vma() once it has grabbed the mmap_sem.

can_reuse_spf_vma() is first checking that the VMA is still in the RB tree
, and then that the VMA's boundaries matched the passed address and release
the reference on the VMA so that it can be freed if needed.

In the case the VMA is freed, can_reuse_spf_vma() will have returned false
as the VMA is no more in the RB tree.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/linux/mm.h |   5 +-
 mm/memory.c| 136 +
 2 files changed, 88 insertions(+), 53 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c383a4e2ceb3..0cd31a37bb3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1355,7 +1355,10 @@ extern int handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
unsigned int flags);
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
 extern int handle_speculative_fault(struct mm_struct *mm,
-   unsigned long address, unsigned int flags);
+   unsigned long address, unsigned int flags,
+   struct vm_area_struct **vma);
+extern bool can_reuse_spf_vma(struct vm_area_struct *vma,
+ unsigned long address);
 #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
diff --git a/mm/memory.c b/mm/memory.c
index 2ef686405154..1f5ce5ff79af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4307,13 +4307,22 @@ static int __handle_mm_fault(struct vm_area_struct 
*vma, unsigned long address,
 /* This is required by vm_normal_page() */
 #error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL"
 #endif
-
 /*
  * vm_normal_page() adds some processing which should be done while
  * hodling the mmap_sem.
  */
+
+/*
+ * Tries to handle the page fault in a speculative way, without grabbing the
+ * mmap_sem.
+ * When VM_FAULT_RETRY is returned, the vma pointer is valid and this vma must
+ * be checked later when the mmap_sem has been grabbed by calling
+ * can_reuse_spf_vma().
+ * This is needed as the returned vma is kept in memory until the call to
+ * can_reuse_spf_vma() is made.
+ */
 int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
-unsigned int flags)
+unsigned int flags, struct vm_area_struct **vma)
 {
struct vm_fault vmf = {
.address = address,
@@ -4322,7 +4331,6 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
p4d_t *p4d, p4dval;
pud_t pudval;
int seq, ret = VM_FAULT_RETRY;
-   struct vm_area_struct *vma;
 #ifdef CONFIG_NUMA
struct mempolicy *pol;
 #endif
@@ -4331,14 +4339,16 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
flags |= FAULT_FLAG_SPECULATIVE;
 
-   vma = get_vma(mm, address);
-   if (!vma)
+   *vma = get_vma(mm, address);
+   if (!*vma)
return ret;
+   vmf.vma = *vma;
 
-   seq = raw_read_seqcount(>vm_sequence); /* rmb <-> 
seqlock,vma_rb_erase() */
+   /* rmb <-> seqlock,vma_rb_erase() */
+   seq = raw_read_seqcount(>vm_sequence);
if (seq & 1) {
-   trace_spf_vma_changed(_RET_IP_, vma, address);
-   goto out_put;
+   trace_spf_vma_changed(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4346,9 +4356,9 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * with the VMA.
 * This include huge page from hugetlbfs.
 */
-   if (vma->vm_ops) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (vmf.vma->vm_ops) {
+   trace_spf_vma_notsup(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4356,18 +4366,18 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * because vm_next and vm_prev must be safe. This can't be guaranteed
 * in the speculative path.
 */
-   if (unlikely(!vma->anon_vma)) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   go

[PATCH v8 19/24] mm: Adding speculative page fault failure trace events

2018-02-16 Thread Laurent Dufour
This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour 
---
 include/trace/events/pagefault.h | 87 
 mm/memory.c  | 62 ++--
 2 files changed, 136 insertions(+), 13 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..1d793f8c739b
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,87 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_pte_lock,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index efc32173264e..2ef686405154 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -80,6 +80,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2310,23 +2313,30 @@ static bool pte_spinlock(struct vm_fault *vmf)
}
 
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
/*
 * We check if the pmd value is still the same to ensure that there
 * is a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   if (unlikely(!spin_trylock(vmf->ptl)))
+   if (unlikely(!spin_trylock(vmf->ptl))) {
+   trace_spf_pte_lock(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2357,16 +2367,20 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * block on the PTL and thus we're safe.
 */
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
/*
 * We check if the pmd value is still the same to ensure that there
 * is a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_s

[PATCH v8 22/24] mm: Speculative page fault handler return VMA

2018-02-16 Thread Laurent Dufour
When the speculative page fault handler is returning VM_RETRY, there is a
chance that VMA fetched without grabbing the mmap_sem can be reused by the
legacy page fault handler.  By reusing it, we avoid calling find_vma()
again. To achieve, that we must ensure that the VMA structure will not be
freed in our back. This is done by getting the reference on it (get_vma())
and by assuming that the caller will call the new service
can_reuse_spf_vma() once it has grabbed the mmap_sem.

can_reuse_spf_vma() is first checking that the VMA is still in the RB tree
, and then that the VMA's boundaries matched the passed address and release
the reference on the VMA so that it can be freed if needed.

In the case the VMA is freed, can_reuse_spf_vma() will have returned false
as the VMA is no more in the RB tree.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |   5 +-
 mm/memory.c| 136 +
 2 files changed, 88 insertions(+), 53 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c383a4e2ceb3..0cd31a37bb3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1355,7 +1355,10 @@ extern int handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
unsigned int flags);
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
 extern int handle_speculative_fault(struct mm_struct *mm,
-   unsigned long address, unsigned int flags);
+   unsigned long address, unsigned int flags,
+   struct vm_area_struct **vma);
+extern bool can_reuse_spf_vma(struct vm_area_struct *vma,
+ unsigned long address);
 #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
diff --git a/mm/memory.c b/mm/memory.c
index 2ef686405154..1f5ce5ff79af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4307,13 +4307,22 @@ static int __handle_mm_fault(struct vm_area_struct 
*vma, unsigned long address,
 /* This is required by vm_normal_page() */
 #error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL"
 #endif
-
 /*
  * vm_normal_page() adds some processing which should be done while
  * hodling the mmap_sem.
  */
+
+/*
+ * Tries to handle the page fault in a speculative way, without grabbing the
+ * mmap_sem.
+ * When VM_FAULT_RETRY is returned, the vma pointer is valid and this vma must
+ * be checked later when the mmap_sem has been grabbed by calling
+ * can_reuse_spf_vma().
+ * This is needed as the returned vma is kept in memory until the call to
+ * can_reuse_spf_vma() is made.
+ */
 int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
-unsigned int flags)
+unsigned int flags, struct vm_area_struct **vma)
 {
struct vm_fault vmf = {
.address = address,
@@ -4322,7 +4331,6 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
p4d_t *p4d, p4dval;
pud_t pudval;
int seq, ret = VM_FAULT_RETRY;
-   struct vm_area_struct *vma;
 #ifdef CONFIG_NUMA
struct mempolicy *pol;
 #endif
@@ -4331,14 +4339,16 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
flags |= FAULT_FLAG_SPECULATIVE;
 
-   vma = get_vma(mm, address);
-   if (!vma)
+   *vma = get_vma(mm, address);
+   if (!*vma)
return ret;
+   vmf.vma = *vma;
 
-   seq = raw_read_seqcount(>vm_sequence); /* rmb <-> 
seqlock,vma_rb_erase() */
+   /* rmb <-> seqlock,vma_rb_erase() */
+   seq = raw_read_seqcount(>vm_sequence);
if (seq & 1) {
-   trace_spf_vma_changed(_RET_IP_, vma, address);
-   goto out_put;
+   trace_spf_vma_changed(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4346,9 +4356,9 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * with the VMA.
 * This include huge page from hugetlbfs.
 */
-   if (vma->vm_ops) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (vmf.vma->vm_ops) {
+   trace_spf_vma_notsup(_RET_IP_, vmf.vma, address);
+   return ret;
}
 
/*
@@ -4356,18 +4366,18 @@ int handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 * because vm_next and vm_prev must be safe. This can't be guaranteed
 * in the speculative path.
 */
-   if (unlikely(!vma->anon_vma)) {
-   trace_spf_vma_notsup(_RET_IP_, vma, address);
-   goto out_put;
+   if (unlikely(!vmf.vma->anon_vma)) {
+ 

[PATCH v8 20/24] perf: Add a speculative page fault sw event

2018-02-16 Thread Laurent Dufour
Add a new software event to count succeeded speculative page faults.

Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e0739a1aa4b2..164383273147 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
-- 
2.7.4



<    1   2   3   4   5   6   7   8   9   10   >