[PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-10-11 Thread Laurent Dufour
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mremap()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour 
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 17 ++---
 mm/mprotect.c  |  4 +++-
 mm/mremap.c|  6 ++
 9 files changed, 86 insertions(+), 36 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0bf9e423aa99..4fc37f88437c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1155,8 +1155,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(&mm->mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1c713fd5b3e6..426af4fd407c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -640,8 +640,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -866,8 +869,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
up_write(&mm->mmap_sem);
mmput(mm);
@@ -1425,8 +1430,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1583,8 +1590,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c01f177a1120..f723d42140db 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1005,6 +1005,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1040,6 +1041,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+   vm_write_end(vma);
result = SCAN_FAIL;
goto out;
  

Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-10-26 Thread Andrea Arcangeli
Hello Laurent,

Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea...@intel.com> shows
significant slowdown even for brk/malloc ops both single and
multi threaded.

The single threaded case I think is the most important because it has
zero chance of getting back any benefit later during page faults.

Could you check if:

1. it's possible change vm_write_begin to be a noop if mm->mm_count is
   <= 1? Hint: clone() will run single threaded so there's no way it can run
   in the middle of a being/end critical section (clone could set an
   MMF flag to possibly keep the sequence counter activated if a child
   thread exits and mm_count drops to 1 while the other cpu is in the
   middle of a critical section in the other thread).

2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
   freeing happened only once a MMF flag is set? That will at least
   reduce the risk of temporary memory waste until the next RCU grace
   period. The read of the MMF will scale fine. Of course to allow
   point 1 and 2 then the page fault should also take the mmap_sem
   until the MMF flag is set.

Could you also investigate a much bigger change: I wonder if it's
possible to drop the sequence number entirely from the vma and stop
using sequence numbers entirely (which is likely the source of the
single threaded regression in point 1 that may explain the report in
the above message-id), and just call the vma rbtree lookup once again
and check that everything is still the same in the vma and the PT lock
obtained is still a match to finish the anon page fault and fill the
pte?

Then of course we also need to add a method to the read-write
semaphore so it tells us if there's already one user holding the read
mmap_sem and we're the second one.  If we're the second one (or more
than second) only then we should skip taking the down_read mmap_sem.
Even a multithreaded app won't ever skip taking the mmap_sem until
there's sign of runtime contention, and it won't have to run the way
more expensive sequence number-less revalidation during page faults,
unless we get an immediate scalability payoff because we already know
the mmap_sem is already contended and there are multiple nested
threads in the page fault handler of the same mm.

Perhaps we'd need something more advanced than a
down_read_trylock_if_not_hold() (which has to guaranteed not to write
to any cacheline) and we'll have to count the per-thread exponential
backoff of mmap_sem frequency, but starting with
down_read_trylock_if_not_hold() would be good I think.

This is not how the current patch works, the current patch uses a
sequence number because it pretends to go lockless always and in turn
has to slow down all vma updates fast paths or the revalidation
slowsdown performance for page fault too much (as it always
revalidates).

I think it would be much better to go speculative only when there's
"detected" runtime contention on the mmap_sem with
down_read_trylock_if_not_hold() and that will make the revalidation
cost not an issue to worry about because normally we won't have to
revalidate the vma at all during page fault. In turn by making the
revalidation more expensive by starting a vma rbtree lookup from
scratch, we can drop the sequence number entirely and that should
simplify the patch tremendously because all vm_write_begin/end would
disappear from the patch and in turn the mmap/brk slowdown measured by
the message-id above, should disappear as well.
   
Thanks,
Andrea


Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-11-02 Thread Laurent Dufour
Hi Andrea,

Thanks for reviewing this series, and sorry for the late answer, I took few
days off...

On 26/10/2017 12:18, Andrea Arcangeli wrote:
> Hello Laurent,
> 
> Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea...@intel.com> shows
> significant slowdown even for brk/malloc ops both single and
> multi threaded.
> 
> The single threaded case I think is the most important because it has
> zero chance of getting back any benefit later during page faults.
> 
> Could you check if:
> 
> 1. it's possible change vm_write_begin to be a noop if mm->mm_count is
><= 1? Hint: clone() will run single threaded so there's no way it can run
>in the middle of a being/end critical section (clone could set an
>MMF flag to possibly keep the sequence counter activated if a child
>thread exits and mm_count drops to 1 while the other cpu is in the
>middle of a critical section in the other thread).

This sounds to be a good idea, I'll dig on that.
The major risk here is to have a thread calling vm_*_begin() with
mm->mm_count > 1 and later calling vm_*_end() with mm->mm_count <= 1, but
as you mentioned we should find a way to work around this.

> 
> 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
>freeing happened only once a MMF flag is set? That will at least
>reduce the risk of temporary memory waste until the next RCU grace
>period. The read of the MMF will scale fine. Of course to allow
>point 1 and 2 then the page fault should also take the mmap_sem
>until the MMF flag is set.
> 

I think we could also deal with the mm->mm_count value here, if there is
only one thread, no need to postpone the VMA's free operation. Isn't it ?
Also, if mm->mm_count <= 1, there is no need to try the speculative path.

> Could you also investigate a much bigger change: I wonder if it's
> possible to drop the sequence number entirely from the vma and stop
> using sequence numbers entirely (which is likely the source of the
> single threaded regression in point 1 that may explain the report in
> the above message-id), and just call the vma rbtree lookup once again
> and check that everything is still the same in the vma and the PT lock
> obtained is still a match to finish the anon page fault and fill the
> pte?

That's an interesting idea. The big deal here would be to detect that the
VMA has been touched in our back, but there are not so much VMA's fields
involved in the speculative path so that sounds reasonable. The other point
is to identify the impact of the vma rbtree lookup, it's also a known
order, but there is the vma_srcu's lock involved.
> 
> Then of course we also need to add a method to the read-write
> semaphore so it tells us if there's already one user holding the read
> mmap_sem and we're the second one.  If we're the second one (or more
> than second) only then we should skip taking the down_read mmap_sem.
> Even a multithreaded app won't ever skip taking the mmap_sem until
> there's sign of runtime contention, and it won't have to run the way
> more expensive sequence number-less revalidation during page faults,
> unless we get an immediate scalability payoff because we already know
> the mmap_sem is already contended and there are multiple nested
> threads in the page fault handler of the same mm.

The problem is that we may have a thread entering the page fault path,
seeing that the mmap_sem is free, grab it and continue processing the page
fault. Then another thread is entering mprotect or any other mm service
which grab the mmap_sem and it will be blocked until the page fault is
done. The idea with the speculative page fault is also to not block the
other thread which may need to grab the mmap_sem.

> 
> Perhaps we'd need something more advanced than a
> down_read_trylock_if_not_hold() (which has to guaranteed not to write
> to any cacheline) and we'll have to count the per-thread exponential
> backoff of mmap_sem frequency, but starting with
> down_read_trylock_if_not_hold() would be good I think.
> 
> This is not how the current patch works, the current patch uses a
> sequence number because it pretends to go lockless always and in turn
> has to slow down all vma updates fast paths or the revalidation
> slowsdown performance for page fault too much (as it always
> revalidates).
> 
> I think it would be much better to go speculative only when there's
> "detected" runtime contention on the mmap_sem with
> down_read_trylock_if_not_hold() and that will make the revalidation
> cost not an issue to worry about because normally we won't have to
> revalidate the vma at all during page fault. In turn by making the
> revalidation more expensive by starting a vma rbtree lookup from
> scratch, we can drop the sequence number entirely and that should
> simplify the patch tremendously because all vm_write_begin/end would
> disappear from the patch and in turn the mmap/brk slowdown measured by
> the message-id above, should disappear as well.

As I mentioned above, I'm not s

Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-11-02 Thread Laurent Dufour


On 02/11/2017 16:16, Laurent Dufour wrote:
> Hi Andrea,
> 
> Thanks for reviewing this series, and sorry for the late answer, I took few
> days off...
> 
> On 26/10/2017 12:18, Andrea Arcangeli wrote:
>> Hello Laurent,
>>
>> Message-ID: <7ca80231-fe02-a3a7-84bc-ce81690ea...@intel.com> shows
>> significant slowdown even for brk/malloc ops both single and
>> multi threaded.
>>
>> The single threaded case I think is the most important because it has
>> zero chance of getting back any benefit later during page faults.
>>
>> Could you check if:
>>
>> 1. it's possible change vm_write_begin to be a noop if mm->mm_count is
>><= 1? Hint: clone() will run single threaded so there's no way it can run
>>in the middle of a being/end critical section (clone could set an
>>MMF flag to possibly keep the sequence counter activated if a child
>>thread exits and mm_count drops to 1 while the other cpu is in the
>>middle of a critical section in the other thread).
> 
> This sounds to be a good idea, I'll dig on that.
> The major risk here is to have a thread calling vm_*_begin() with
> mm->mm_count > 1 and later calling vm_*_end() with mm->mm_count <= 1, but
> as you mentioned we should find a way to work around this.
> 
>>
>> 2. Same thing with RCU freeing of vmas. Wouldn't it be nicer if RCU
>>freeing happened only once a MMF flag is set? That will at least
>>reduce the risk of temporary memory waste until the next RCU grace
>>period. The read of the MMF will scale fine. Of course to allow
>>point 1 and 2 then the page fault should also take the mmap_sem
>>until the MMF flag is set.
>>
> 
> I think we could also deal with the mm->mm_count value here, if there is
> only one thread, no need to postpone the VMA's free operation. Isn't it ?
> Also, if mm->mm_count <= 1, there is no need to try the speculative path.
> 
>> Could you also investigate a much bigger change: I wonder if it's
>> possible to drop the sequence number entirely from the vma and stop
>> using sequence numbers entirely (which is likely the source of the
>> single threaded regression in point 1 that may explain the report in
>> the above message-id), and just call the vma rbtree lookup once again
>> and check that everything is still the same in the vma and the PT lock
>> obtained is still a match to finish the anon page fault and fill the
>> pte?
> 
> That's an interesting idea. The big deal here would be to detect that the
> VMA has been touched in our back, but there are not so much VMA's fields
> involved in the speculative path so that sounds reasonable. The other point
> is to identify the impact of the vma rbtree lookup, it's also a known
> order, but there is the vma_srcu's lock involved.

I think there is some memory barrier missing when the VMA is modified so
currently the modifications done in the VMA structure may not be written
down at the time the pte is locked. So doing that change will also requires
to call smp_wmb() before locking the page tables. In the current patch this
is ensured by the call to write_seqcount_end().
Doing so will still require to have a memory barrier when touching the VMA.
Not sure we get far better performance compared to the sequence count
change. But I'll give it a try anyway ;)

>>
>> Then of course we also need to add a method to the read-write
>> semaphore so it tells us if there's already one user holding the read
>> mmap_sem and we're the second one.  If we're the second one (or more
>> than second) only then we should skip taking the down_read mmap_sem.
>> Even a multithreaded app won't ever skip taking the mmap_sem until
>> there's sign of runtime contention, and it won't have to run the way
>> more expensive sequence number-less revalidation during page faults,
>> unless we get an immediate scalability payoff because we already know
>> the mmap_sem is already contended and there are multiple nested
>> threads in the page fault handler of the same mm.
> 
> The problem is that we may have a thread entering the page fault path,
> seeing that the mmap_sem is free, grab it and continue processing the page
> fault. Then another thread is entering mprotect or any other mm service
> which grab the mmap_sem and it will be blocked until the page fault is
> done. The idea with the speculative page fault is also to not block the
> other thread which may need to grab the mmap_sem.
> 
>>
>> Perhaps we'd need something more advanced than a
>> down_read_trylock_if_not_hold() (which has to guaranteed not to write
>> to any cacheline) and we'll have to count the per-thread exponential
>> backoff of mmap_sem frequency, but starting with
>> down_read_trylock_if_not_hold() would be good I think.
>>
>> This is not how the current patch works, the current patch uses a
>> sequence number because it pretends to go lockless always and in turn
>> has to slow down all vma updates fast paths or the revalidation
>> slowsdown performance for page fault too much (as it always
>> revalidates).
>>
>> 

Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-11-02 Thread Andrea Arcangeli
On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote:
> I think there is some memory barrier missing when the VMA is modified so
> currently the modifications done in the VMA structure may not be written
> down at the time the pte is locked. So doing that change will also requires
> to call smp_wmb() before locking the page tables. In the current patch this
> is ensured by the call to write_seqcount_end().
> Doing so will still require to have a memory barrier when touching the VMA.
> Not sure we get far better performance compared to the sequence count
> change. But I'll give it a try anyway ;)

Luckily smp_wmb is a noop on x86. I would suggest to ignore the above
issue completely if you give it a try, and then if this performs, we
can just embed a smp_wmb() before spin_lock() somewhere in
pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose
spin_lock isn't a smp_wmb() equivalent. I would focus at flushing
writes before every pagetable spin_lock for non-x86 archs, rather than
after all vma modifications. That should be easier to keep under
control and it's going to be more efficient too as if something there
are fewer spin locks than vma modifications.

For non-x86 archs we may then need a smp_wmb__before_spin_lock. That
looks more self contained than surrounding all vma modifications and
it's a noop on x86 anyway.

I thought about the contention detection logic too yesterday: to
detect contention we could have a mm->mmap_sem_contention_jiffies and
if down_read_trylock_exclusive() [same as down_read_if_not_hold in
prev mail] fails (and it'll fail if either read or write mmap_sem is
hold, so also convering mremap/mprotect etc..) we set
mm->mmap_sem_contention_jiffies = jiffies and then to know if you must
not touch the mmap_sem at all, you compare jiffies against
mmap_sem_contention_jiffies, if it's equal we go speculative. If
that's not enough we can just keep going speculative for a few more
jiffies with time_before(). The srcu lock is non concerning because the
inc/dec of the fast path is in per-cpu cacheline of course, no false
sharing possible there or it wouldn't be any better than a normal lock.

The vma revalidation is already done by khugepaged and mm/userfaultfd,
both need to drop the mmap_sem and continue working on the pagetables,
so we already know it's workable and not too slow.

Summarizing.. by using a runtime contention triggered speculative
design that goes speculative only when contention is runtime-detected
using the above logic (or equivalent), and by having to revalidate the
vma by hand with find_vma without knowing instantly if the vma become
stale, we will run with a substantially slower speculative page fault
than with your current speculative always-on design, but the slower
speculative page fault runtime will still scale 100% in SMP so it
should still be faster on large SMP systems. The pros is that it won't
regress the mmap/brk vma modifications. The whole complexity of
tracking the vma modifications should also go away and the resulting
code should be more maintainable and less risky to break in subtle
ways impossible to reproduce.

Thanks!
Andrea


Re: [PATCH v5 07/22] mm: Protect VMA modifications using VMA sequence count

2017-11-06 Thread Laurent Dufour
Hi Andrea,

On 02/11/2017 21:08, Andrea Arcangeli wrote:
> On Thu, Nov 02, 2017 at 06:25:11PM +0100, Laurent Dufour wrote:
>> I think there is some memory barrier missing when the VMA is modified so
>> currently the modifications done in the VMA structure may not be written
>> down at the time the pte is locked. So doing that change will also requires
>> to call smp_wmb() before locking the page tables. In the current patch this
>> is ensured by the call to write_seqcount_end().
>> Doing so will still require to have a memory barrier when touching the VMA.
>> Not sure we get far better performance compared to the sequence count
>> change. But I'll give it a try anyway ;)
> 
> Luckily smp_wmb is a noop on x86. I would suggest to ignore the above
> issue completely if you give it a try, and then if this performs, we
> can just embed a smp_wmb() before spin_lock() somewhere in
> pte_offset_map_lock/pte_lockptr/spin_lock_nested for those archs whose
> spin_lock isn't a smp_wmb() equivalent. I would focus at flushing
> writes before every pagetable spin_lock for non-x86 archs, rather than
> after all vma modifications. That should be easier to keep under
> control and it's going to be more efficient too as if something there
> are fewer spin locks than vma modifications.

I do agree that would simplify the patch series a lot.
I'll double check that pte lock is not done in a loop other wise having
smp_wmb() there will be bad.

Another point I'm trying to double check is that we may have inconsistency
while reading the vma's flags in the page fault path until the memory
barrier got it in the VMA's changing path. Especially we may have vm_flags
and vm_page_prot not matching at all, which couldn't happen when checking
for the vm_sequence count.

> 
> For non-x86 archs we may then need a smp_wmb__before_spin_lock. That
> looks more self contained than surrounding all vma modifications and
> it's a noop on x86 anyway.
> 
> I thought about the contention detection logic too yesterday: to
> detect contention we could have a mm->mmap_sem_contention_jiffies and
> if down_read_trylock_exclusive() [same as down_read_if_not_hold in
> prev mail] fails (and it'll fail if either read or write mmap_sem is
> hold, so also convering mremap/mprotect etc..) we set
> mm->mmap_sem_contention_jiffies = jiffies and then to know if you must
> not touch the mmap_sem at all, you compare jiffies against
> mmap_sem_contention_jiffies, if it's equal we go speculative. If
> that's not enough we can just keep going speculative for a few more
> jiffies with time_before(). The srcu lock is non concerning because the
> inc/dec of the fast path is in per-cpu cacheline of course, no false
> sharing possible there or it wouldn't be any better than a normal lock.

I'm sorry, I should have missed something here. I can't see how this would
help fixing the case where a thread is entering the page fault handler
seeing that no one else has the mmap_sem and then grab it. While it is
processing the page fault another thread is entering mprotect for instance
and thus will wait for the mmap_sem to be released by the thread processing
the page fault.

Cheers,
Laurent.

> The vma revalidation is already done by khugepaged and mm/userfaultfd,
> both need to drop the mmap_sem and continue working on the pagetables,
> so we already know it's workable and not too slow.
> 
> Summarizing.. by using a runtime contention triggered speculative
> design that goes speculative only when contention is runtime-detected
> using the above logic (or equivalent), and by having to revalidate the
> vma by hand with find_vma without knowing instantly if the vma become
> stale, we will run with a substantially slower speculative page fault
> than with your current speculative always-on design, but the slower
> speculative page fault runtime will still scale 100% in SMP so it
> should still be faster on large SMP systems. The pros is that it won't
> regress the mmap/brk vma modifications. The whole complexity of
> tracking the vma modifications should also go away and the resulting
> code should be more maintainable and less risky to break in subtle
> ways impossible to reproduce.
> 
> Thanks!
> Andrea
>