Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count
On 11/5/2018 11:52 PM, Laurent Dufour wrote: > Le 05/11/2018 à 08:04, vinayak menon a écrit : >> Hi Laurent, >> >> On Thu, May 17, 2018 at 4:37 PM Laurent Dufour >> wrote: >>> >>> The VMA sequence count has been introduced to allow fast detection of >>> VMA modification when running a page fault handler without holding >>> the mmap_sem. >>> >>> This patch provides protection against the VMA modification done in : >>> - madvise() >>> - mpol_rebind_policy() >>> - vma_replace_policy() >>> - change_prot_numa() >>> - mlock(), munlock() >>> - mprotect() >>> - mmap_region() >>> - collapse_huge_page() >>> - userfaultd registering services >>> >>> In addition, VMA fields which will be read during the speculative fault >>> path needs to be written using WRITE_ONCE to prevent write to be split >>> and intermediate values to be pushed to other CPUs. >>> >>> Signed-off-by: Laurent Dufour >>> --- >>> fs/proc/task_mmu.c | 5 - >>> fs/userfaultfd.c | 17 + >>> mm/khugepaged.c | 3 +++ >>> mm/madvise.c | 6 +- >>> mm/mempolicy.c | 51 >>> ++- >>> mm/mlock.c | 13 - >>> mm/mmap.c | 22 +- >>> mm/mprotect.c | 4 +++- >>> mm/swap_state.c | 8 ++-- >>> 9 files changed, 89 insertions(+), 40 deletions(-) >>> >>> struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, >>> struct vm_fault *vmf) >>> @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct >>> vm_area_struct *vma, >>> unsigned long *start, >>> unsigned long *end) >>> { >>> - *start = max3(lpfn, PFN_DOWN(vma->vm_start), >>> + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), >>> PFN_DOWN(faddr & PMD_MASK)); >>> - *end = min3(rpfn, PFN_DOWN(vma->vm_end), >>> + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), >>> PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); >>> } >>> >>> -- >>> 2.7.4 >>> >> >> I have got a crash on 4.14 kernel with speculative page faults enabled >> and here is my analysis of the problem. >> The issue was reported only once. > > Hi Vinayak, > > Thanks for reporting this. > >> >> [23409.303395] el1_da+0x24/0x84 >> [23409.303400] __radix_tree_lookup+0x8/0x90 >> [23409.303407] find_get_entry+0x64/0x14c >> [23409.303410] pagecache_get_page+0x5c/0x27c >> [23409.303416] __read_swap_cache_async+0x80/0x260 >> [23409.303420] swap_vma_readahead+0x264/0x37c >> [23409.303423] swapin_readahead+0x5c/0x6c >> [23409.303428] do_swap_page+0x128/0x6e4 >> [23409.303431] handle_pte_fault+0x230/0xca4 >> [23409.303435] __handle_speculative_fault+0x57c/0x7c8 >> [23409.303438] do_page_fault+0x228/0x3e8 >> [23409.303442] do_translation_fault+0x50/0x6c >> [23409.303445] do_mem_abort+0x5c/0xe0 >> [23409.303447] el0_da+0x20/0x24 >> >> Process A accesses address ADDR (part of VMA A) and that results in a >> translation fault. >> Kernel enters __handle_speculative_fault to fix the fault. >> Process A enters do_swap_page->swapin_readahead->swap_vma_readahead >> from speculative path. >> During this time, another process B which shares the same mm, does a >> mprotect from another CPU which follows >> mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. >> After the split, ADDR falls into VMA B, but process A is still using >> VMA A. >> Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. >> swap_vma_readahead->swap_ra_info uses start and end of vma to >> calculate ptes and nr_pte, which goes wrong due to this and finally >> resulting in wrong "entry" passed to >> swap_vma_readahead->__read_swap_cache_async, and in turn causing >> invalid swapper_space >> being passed to __read_swap_cache_async->find_get_page, causing an abort. >> >> The fix I have tried is to cache vm_start and vm_end also in vmf and >> use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can >> send >> the patch I
Re: [PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count
Hi Laurent, On Thu, May 17, 2018 at 4:37 PM Laurent Dufour wrote: > > The VMA sequence count has been introduced to allow fast detection of > VMA modification when running a page fault handler without holding > the mmap_sem. > > This patch provides protection against the VMA modification done in : > - madvise() > - mpol_rebind_policy() > - vma_replace_policy() > - change_prot_numa() > - mlock(), munlock() > - mprotect() > - mmap_region() > - collapse_huge_page() > - userfaultd registering services > > In addition, VMA fields which will be read during the speculative fault > path needs to be written using WRITE_ONCE to prevent write to be split > and intermediate values to be pushed to other CPUs. > > Signed-off-by: Laurent Dufour > --- > fs/proc/task_mmu.c | 5 - > fs/userfaultfd.c | 17 + > mm/khugepaged.c| 3 +++ > mm/madvise.c | 6 +- > mm/mempolicy.c | 51 ++- > mm/mlock.c | 13 - > mm/mmap.c | 22 +- > mm/mprotect.c | 4 +++- > mm/swap_state.c| 8 ++-- > 9 files changed, 89 insertions(+), 40 deletions(-) > > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask, > struct vm_fault *vmf) > @@ -665,9 +669,9 @@ static inline void swap_ra_clamp_pfn(struct > vm_area_struct *vma, > unsigned long *start, > unsigned long *end) > { > - *start = max3(lpfn, PFN_DOWN(vma->vm_start), > + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)), > PFN_DOWN(faddr & PMD_MASK)); > - *end = min3(rpfn, PFN_DOWN(vma->vm_end), > + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)), > PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); > } > > -- > 2.7.4 > I have got a crash on 4.14 kernel with speculative page faults enabled and here is my analysis of the problem. The issue was reported only once. [23409.303395] el1_da+0x24/0x84 [23409.303400] __radix_tree_lookup+0x8/0x90 [23409.303407] find_get_entry+0x64/0x14c [23409.303410] pagecache_get_page+0x5c/0x27c [23409.303416] __read_swap_cache_async+0x80/0x260 [23409.303420] swap_vma_readahead+0x264/0x37c [23409.303423] swapin_readahead+0x5c/0x6c [23409.303428] do_swap_page+0x128/0x6e4 [23409.303431] handle_pte_fault+0x230/0xca4 [23409.303435] __handle_speculative_fault+0x57c/0x7c8 [23409.303438] do_page_fault+0x228/0x3e8 [23409.303442] do_translation_fault+0x50/0x6c [23409.303445] do_mem_abort+0x5c/0xe0 [23409.303447] el0_da+0x20/0x24 Process A accesses address ADDR (part of VMA A) and that results in a translation fault. Kernel enters __handle_speculative_fault to fix the fault. Process A enters do_swap_page->swapin_readahead->swap_vma_readahead from speculative path. During this time, another process B which shares the same mm, does a mprotect from another CPU which follows mprotect_fixup->__split_vma, and it splits VMA A into VMAs A and B. After the split, ADDR falls into VMA B, but process A is still using VMA A. Now ADDR is greater than VMA_A->vm_start and VMA_A->vm_end. swap_vma_readahead->swap_ra_info uses start and end of vma to calculate ptes and nr_pte, which goes wrong due to this and finally resulting in wrong "entry" passed to swap_vma_readahead->__read_swap_cache_async, and in turn causing invalid swapper_space being passed to __read_swap_cache_async->find_get_page, causing an abort. The fix I have tried is to cache vm_start and vm_end also in vmf and use it in swap_ra_clamp_pfn. Let me know your thoughts on this. I can send the patch I am a using if you feel that is the right thing to do. Thanks, Vinayak
Re: [PATCH v10 18/25] mm: provide speculative fault infrastructure
On Tue, Apr 17, 2018 at 8:03 PM, Laurent Dufour wrote: > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + > +#ifndef __HAVE_ARCH_PTE_SPECIAL > +/* This is required by vm_normal_page() */ > +#error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL" > +#endif > + > +/* > + * vm_normal_page() adds some processing which should be done while > + * hodling the mmap_sem. > + */ > +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address, > + unsigned int flags) > +{ > + struct vm_fault vmf = { > + .address = address, > + }; > + pgd_t *pgd, pgdval; > + p4d_t *p4d, p4dval; > + pud_t pudval; > + int seq, ret = VM_FAULT_RETRY; > + struct vm_area_struct *vma; > + > + /* Clear flags that may lead to release the mmap_sem to retry */ > + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE); > + flags |= FAULT_FLAG_SPECULATIVE; > + > + vma = get_vma(mm, address); > + if (!vma) > + return ret; > + > + seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> > seqlock,vma_rb_erase() */ > + if (seq & 1) > + goto out_put; > + > + /* > +* Can't call vm_ops service has we don't know what they would do > +* with the VMA. > +* This include huge page from hugetlbfs. > +*/ > + if (vma->vm_ops) > + goto out_put; > + > + /* > +* __anon_vma_prepare() requires the mmap_sem to be held > +* because vm_next and vm_prev must be safe. This can't be guaranteed > +* in the speculative path. > +*/ > + if (unlikely(!vma->anon_vma)) > + goto out_put; > + > + vmf.vma_flags = READ_ONCE(vma->vm_flags); > + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot); > + > + /* Can't call userland page fault handler in the speculative path */ > + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) > + goto out_put; > + > + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) > + /* > +* This could be detected by the check address against VMA's > +* boundaries but we want to trace it as not supported instead > +* of changed. > +*/ > + goto out_put; > + > + if (address < READ_ONCE(vma->vm_start) > + || READ_ONCE(vma->vm_end) <= address) > + goto out_put; > + > + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, > + flags & FAULT_FLAG_INSTRUCTION, > + flags & FAULT_FLAG_REMOTE)) { > + ret = VM_FAULT_SIGSEGV; > + goto out_put; > + } > + > + /* This is one is required to check that the VMA has write access set > */ > + if (flags & FAULT_FLAG_WRITE) { > + if (unlikely(!(vmf.vma_flags & VM_WRITE))) { > + ret = VM_FAULT_SIGSEGV; > + goto out_put; > + } > + } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE { > + ret = VM_FAULT_SIGSEGV; > + goto out_put; > + } > + > + if (IS_ENABLED(CONFIG_NUMA)) { > + struct mempolicy *pol; > + > + /* > +* MPOL_INTERLEAVE implies additional checks in > +* mpol_misplaced() which are not compatible with the > +*speculative page fault processing. > +*/ > + pol = __get_vma_policy(vma, address); This gives a compile time error when CONFIG_NUMA is disabled, as there is no definition for __get_vma_policy. > + if (!pol) > + pol = get_task_policy(current); > + if (pol && pol->mode == MPOL_INTERLEAVE) > + goto out_put; > + } > + > + /* > +* Do a speculative lookup of the PTE entry. > +*/ > + local_irq_disable(); > + pgd = pgd_offset(mm, address); > + pgdval = READ_ONCE(*pgd); > + if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval))) > + goto out_walk; > + > + p4d = p4d_offset(pgd, address); > + p4dval = READ_ONCE(*p4d); > + if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval))) > + goto out_walk; > + > + vmf.pud = pud_offset(p4d, address); > + pudval = READ_ONCE(*vmf.pud); > + if (pud_none(pudval) || unlikely(pud_bad(pudval))) > + goto out_walk; > + > + /* Huge pages at PUD level are not supported. */ > + if (unlikely(pud_trans_huge(pudval))) > + goto out_walk; > + > + vmf.pmd = pmd_offset(vmf.pud, address); > + vmf.orig_pmd = READ_ONCE(*vmf.pmd); > + /* > +* pmd_none could mean that a hugepage collapse is in progress > +* in our back as collapse_huge_pa
Re: [PATCH v10 06/25] mm: make pte_unmap_same compatible with SPF
On Tue, Apr 17, 2018 at 8:03 PM, Laurent Dufour wrote: > pte_unmap_same() is making the assumption that the page table are still > around because the mmap_sem is held. > This is no more the case when running a speculative page fault and > additional check must be made to ensure that the final page table are still > there. > > This is now done by calling pte_spinlock() to check for the VMA's > consistency while locking for the page tables. > > This is requiring passing a vm_fault structure to pte_unmap_same() which is > containing all the needed parameters. > > As pte_spinlock() may fail in the case of a speculative page fault, if the > VMA has been touched in our back, pte_unmap_same() should now return 3 > cases : > 1. pte are the same (0) > 2. pte are different (VM_FAULT_PTNOTSAME) > 3. a VMA's changes has been detected (VM_FAULT_RETRY) > > The case 2 is handled by the introduction of a new VM_FAULT flag named > VM_FAULT_PTNOTSAME which is then trapped in cow_user_page(). > If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the > page fault while holding the mmap_sem. > > Acked-by: David Rientjes > Signed-off-by: Laurent Dufour > --- > include/linux/mm.h | 1 + > mm/memory.c| 39 --- > 2 files changed, 29 insertions(+), 11 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 4d1aff80669c..714da99d77a3 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1208,6 +1208,7 @@ static inline void clear_page_pfmemalloc(struct page > *page) > #define VM_FAULT_NEEDDSYNC 0x2000 /* ->fault did not modify page tables > * and needs fsync() to complete (for > * synchronous page faults in DAX) */ > +#define VM_FAULT_PTNOTSAME 0x4000 /* Page table entries have changed */ This has to be added to VM_FAULT_RESULT_TRACE ?