Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count

2018-11-06 Thread Vinayak Menon
On 11/5/2018 11:52 PM, Laurent Dufour wrote:
> Le 05/11/2018 à 08:04, vinayak menon a écrit :
>> Hi Laurent,
>>
>> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
>>  wrote:
>>>
>>> The VMA sequence count has been introduced to allow fast detection of
>>> VMA modification when running a page fault handler without holding
>>> the mmap_sem.
>>>
>>> This patch provides protection against the VMA modification done in :
>>>  - madvise()
>>>  - mpol_rebind_policy()
>>>  - vma_replace_policy()
>>>  - change_prot_numa()
>>>  - mlock(), munlock()
>>>  - mprotect()
>>>  - mmap_region()
>>>  - collapse_huge_page()
>>>  - userfaultd registering services
>>>
>>> In addition, VMA fields which will be read during the speculative fault
>>> path needs to be written using WRITE_ONCE to prevent write to be split
>>> and intermediate values to be pushed to other CPUs.
>>>
>>> Signed-off-by: Laurent Dufour 
>>> ---
>>>   fs/proc/task_mmu.c |  5 -
>>>   fs/userfaultfd.c   | 17 +
>>>   mm/khugepaged.c    |  3 +++
>>>   mm/madvise.c   |  6 +-
>>>   mm/mempolicy.c | 51 
>>> ++-
>>>   mm/mlock.c | 13 -
>>>   mm/mmap.c  | 22 +-
>>>   mm/mprotect.c  |  4 +++-
>>>   mm/swap_state.c    |  8 ++--
>>>   9 files changed, 89 insertions(+), 40 deletions(-)
>>>
>>>   struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>>  struct vm_fault *vmf)
>>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct 
>>> vm_area_struct *vma,
>>>   unsigned long *start,
>>>   unsigned long *end)
>>>   {
>>> -   *start = max3(lpfn, PFN_DOWN(vma->vm_start),
>>> +   *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
>>>    PFN_DOWN(faddr & PMD_MASK));
>>> -   *end = min3(rpfn, PFN_DOWN(vma->vm_end),
>>> +   *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>>>  PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>>>   }
>>>
>>> -- 
>>> 2.7.4
>>>
>>
>> I have got a crash on 4.14 kernel with speculative page faults enabled
>> and here is my analysis of the problem.
>> The issue was reported only once.
>
> Hi Vinayak,
>
> Thanks for reporting this.
>
>>
>> [23409.303395]  el1_da+0x24/0x84
>> [23409.303400]  __radix_tree_lookup+0x8/0x90
>> [23409.303407]  find_get_entry+0x64/0x14c
>> [23409.303410]  pagecache_get_page+0x5c/0x27c
>> [23409.303416]  __read_swap_cache_async+0x80/0x260
>> [23409.303420]  swap_vma_readahead+0x264/0x37c
>> [23409.303423]  swapin_readahead+0x5c/0x6c
>> [23409.303428]  do_swap_page+0x128/0x6e4
>> [23409.303431]  handle_pte_fault+0x230/0xca4
>> [23409.303435]  __handle_speculative_fault+0x57c/0x7c8
>> [23409.303438]  do_page_fault+0x228/0x3e8
>> [23409.303442]  do_translation_fault+0x50/0x6c
>> [23409.303445]  do_mem_abort+0x5c/0xe0
>> [23409.303447]  el0_da+0x20/0x24
>>
>> Process A accesses address ADDR (part of VMA A) and that results in a
>> translation fault.
>> Kernel enters __handle_speculative_fault to fix the fault.
>> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
>> from speculative path.
>> During this time, another process B which shares the same mm, does a
>> mprotect from another CPU which follows
>> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
>> After the split, ADDR falls into VMA B, but process A is still using
>> VMA A.
>> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
>> swap_vma_readahead->swap_ra_info uses start and end of vma to
>> calculate ptes and nr_pte, which goes wrong due to this and finally
>> resulting in wrong "entry" passed to
>> swap_vma_readahead->__read_swap_cache_async, and in turn causing
>> invalid swapper_space
>> being passed to __read_swap_cache_async->find_get_page, causing an abort.
>>
>> The fix I have tried is to cache vm_start and vm_end also in vmf and
>> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
>> send
>> the patch I 

Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count

2018-11-04 Thread vinayak menon
Hi Laurent,

On Thu, May 17, 2018 at 4:37 PM Laurent Dufour
 wrote:
>
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
>
> This patch provides protection against the VMA modification done in :
> - madvise()
> - mpol_rebind_policy()
> - vma_replace_policy()
> - change_prot_numa()
> - mlock(), munlock()
> - mprotect()
> - mmap_region()
> - collapse_huge_page()
> - userfaultd registering services
>
> In addition, VMA fields which will be read during the speculative fault
> path needs to be written using WRITE_ONCE to prevent write to be split
> and intermediate values to be pushed to other CPUs.
>
> Signed-off-by: Laurent Dufour 
> ---
>  fs/proc/task_mmu.c |  5 -
>  fs/userfaultfd.c   | 17 +
>  mm/khugepaged.c|  3 +++
>  mm/madvise.c   |  6 +-
>  mm/mempolicy.c | 51 ++-
>  mm/mlock.c | 13 -
>  mm/mmap.c  | 22 +-
>  mm/mprotect.c  |  4 +++-
>  mm/swap_state.c|  8 ++--
>  9 files changed, 89 insertions(+), 40 deletions(-)
>
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_fault *vmf)
> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct 
> vm_area_struct *vma,
>  unsigned long *start,
>  unsigned long *end)
>  {
> -   *start = max3(lpfn, PFN_DOWN(vma->vm_start),
> +   *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
>   PFN_DOWN(faddr & PMD_MASK));
> -   *end = min3(rpfn, PFN_DOWN(vma->vm_end),
> +   *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
> PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>  }
>
> --
> 2.7.4
>

I have got a crash on 4.14 kernel with speculative page faults enabled
and here is my analysis of the problem.
The issue was reported only once.

[23409.303395]  el1_da+0x24/0x84
[23409.303400]  __radix_tree_lookup+0x8/0x90
[23409.303407]  find_get_entry+0x64/0x14c
[23409.303410]  pagecache_get_page+0x5c/0x27c
[23409.303416]  __read_swap_cache_async+0x80/0x260
[23409.303420]  swap_vma_readahead+0x264/0x37c
[23409.303423]  swapin_readahead+0x5c/0x6c
[23409.303428]  do_swap_page+0x128/0x6e4
[23409.303431]  handle_pte_fault+0x230/0xca4
[23409.303435]  __handle_speculative_fault+0x57c/0x7c8
[23409.303438]  do_page_fault+0x228/0x3e8
[23409.303442]  do_translation_fault+0x50/0x6c
[23409.303445]  do_mem_abort+0x5c/0xe0
[23409.303447]  el0_da+0x20/0x24

Process A accesses address ADDR (part of VMA A) and that results in a
translation fault.
Kernel enters __handle_speculative_fault to fix the fault.
Process A enters do_swap_page->swapin_readahead->swap_vma_readahead
from speculative path.
During this time, another process B which shares the same mm, does a
mprotect from another CPU which follows
mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B.
After the split, ADDR falls into VMA B, but process A is still using
VMA A.
Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end.
swap_vma_readahead->swap_ra_info uses start and end of vma to
calculate ptes and nr_pte, which goes wrong due to this and finally
resulting in wrong "entry" passed to
swap_vma_readahead->__read_swap_cache_async, and in turn causing
invalid swapper_space
being passed to __read_swap_cache_async->find_get_page, causing an abort.

The fix I have tried is to cache vm_start and vm_end also in vmf and
use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can
send
the patch I am a using if you feel that is the right thing to do.

Thanks,
Vinayak


Re: [PATCH v10 18/25] mm: provide speculative fault infrastructure

2018-05-15 Thread vinayak menon
On Tue, Apr 17, 2018 at 8:03 PM, Laurent Dufour
 wrote:
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +
> +#ifndef __HAVE_ARCH_PTE_SPECIAL
> +/* This is required by vm_normal_page() */
> +#error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL"
> +#endif
> +
> +/*
> + * vm_normal_page() adds some processing which should be done while
> + * hodling the mmap_sem.
> + */
> +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
> +  unsigned int flags)
> +{
> +   struct vm_fault vmf = {
> +   .address = address,
> +   };
> +   pgd_t *pgd, pgdval;
> +   p4d_t *p4d, p4dval;
> +   pud_t pudval;
> +   int seq, ret = VM_FAULT_RETRY;
> +   struct vm_area_struct *vma;
> +
> +   /* Clear flags that may lead to release the mmap_sem to retry */
> +   flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
> +   flags |= FAULT_FLAG_SPECULATIVE;
> +
> +   vma = get_vma(mm, address);
> +   if (!vma)
> +   return ret;
> +
> +   seq = raw_read_seqcount(>vm_sequence); /* rmb <-> 
> seqlock,vma_rb_erase() */
> +   if (seq & 1)
> +   goto out_put;
> +
> +   /*
> +* Can't call vm_ops service has we don't know what they would do
> +* with the VMA.
> +* This include huge page from hugetlbfs.
> +*/
> +   if (vma->vm_ops)
> +   goto out_put;
> +
> +   /*
> +* __anon_vma_prepare() requires the mmap_sem to be held
> +* because vm_next and vm_prev must be safe. This can't be guaranteed
> +* in the speculative path.
> +*/
> +   if (unlikely(!vma->anon_vma))
> +   goto out_put;
> +
> +   vmf.vma_flags = READ_ONCE(vma->vm_flags);
> +   vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
> +
> +   /* Can't call userland page fault handler in the speculative path */
> +   if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
> +   goto out_put;
> +
> +   if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
> +   /*
> +* This could be detected by the check address against VMA's
> +* boundaries but we want to trace it as not supported instead
> +* of changed.
> +*/
> +   goto out_put;
> +
> +   if (address < READ_ONCE(vma->vm_start)
> +   || READ_ONCE(vma->vm_end) <= address)
> +   goto out_put;
> +
> +   if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> +  flags & FAULT_FLAG_INSTRUCTION,
> +  flags & FAULT_FLAG_REMOTE)) {
> +   ret = VM_FAULT_SIGSEGV;
> +   goto out_put;
> +   }
> +
> +   /* This is one is required to check that the VMA has write access set 
> */
> +   if (flags & FAULT_FLAG_WRITE) {
> +   if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
> +   ret = VM_FAULT_SIGSEGV;
> +   goto out_put;
> +   }
> +   } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE {
> +   ret = VM_FAULT_SIGSEGV;
> +   goto out_put;
> +   }
> +
> +   if (IS_ENABLED(CONFIG_NUMA)) {
> +   struct mempolicy *pol;
> +
> +   /*
> +* MPOL_INTERLEAVE implies additional checks in
> +* mpol_misplaced() which are not compatible with the
> +*speculative page fault processing.
> +*/
> +   pol = __get_vma_policy(vma, address);


This gives a compile time error when CONFIG_NUMA is disabled, as there
is no definition for
__get_vma_policy.


> +   if (!pol)
> +   pol = get_task_policy(current);
> +   if (pol && pol->mode == MPOL_INTERLEAVE)
> +   goto out_put;
> +   }
> +
> +   /*
> +* Do a speculative lookup of the PTE entry.
> +*/
> +   local_irq_disable();
> +   pgd = pgd_offset(mm, address);
> +   pgdval = READ_ONCE(*pgd);
> +   if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
> +   goto out_walk;
> +
> +   p4d = p4d_offset(pgd, address);
> +   p4dval = READ_ONCE(*p4d);
> +   if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
> +   goto out_walk;
> +
> +   vmf.pud = pud_offset(p4d, address);
> +   pudval = READ_ONCE(*vmf.pud);
> +   if (pud_none(pudval) || unlikely(pud_bad(pudval)))
> +   goto out_walk;
> +
> +   /* Huge pages at PUD level are not supported. */
> +   if (unlikely(pud_trans_huge(pudval)))
> +   goto out_walk;
> +
> +   vmf.pmd = pmd_offset(vmf.pud, address);
> +   vmf.orig_pmd = READ_ONCE(*vmf.pmd);
> +   /*
> +* pmd_none could mean that a hugepage collapse is in progress
> +* in our 

Re: [PATCH v10 06/25] mm: make pte_unmap_same compatible with SPF

2018-05-10 Thread vinayak menon
On Tue, Apr 17, 2018 at 8:03 PM, Laurent Dufour
 wrote:
> pte_unmap_same() is making the assumption that the page table are still
> around because the mmap_sem is held.
> This is no more the case when running a speculative page fault and
> additional check must be made to ensure that the final page table are still
> there.
>
> This is now done by calling pte_spinlock() to check for the VMA's
> consistency while locking for the page tables.
>
> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> containing all the needed parameters.
>
> As pte_spinlock() may fail in the case of a speculative page fault, if the
> VMA has been touched in our back, pte_unmap_same() should now return 3
> cases :
> 1. pte are the same (0)
> 2. pte are different (VM_FAULT_PTNOTSAME)
> 3. a VMA's changes has been detected (VM_FAULT_RETRY)
>
> The case 2 is handled by the introduction of a new VM_FAULT flag named
> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
> If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
> page fault while holding the mmap_sem.
>
> Acked-by: David Rientjes 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 39 ---
>  2 files changed, 29 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4d1aff80669c..714da99d77a3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1208,6 +1208,7 @@ static inline void clear_page_pfmemalloc(struct page 
> *page)
>  #define VM_FAULT_NEEDDSYNC  0x2000 /* ->fault did not modify page tables
>  * and needs fsync() to complete (for
>  * synchronous page faults in DAX) */
> +#define VM_FAULT_PTNOTSAME 0x4000  /* Page table entries have changed */


This has to be added to VM_FAULT_RESULT_TRACE ?