Re: [PATCH 3.12 78/78] mm: let mm_find_pmd fix buggy race with THP fault

Hugh Dickins Fri, 09 Jan 2015 21:02:30 -0800

On Fri, 9 Jan 2015, Jiri Slaby wrote:

> From: Hugh Dickins <[email protected]>
> 
> 3.12-stable review patch.  If anyone has any objections, please let me know.
> 
> ===============
> 
> commit f72e7dcdd25229446b102e587ef2f826f76bff28 upstream.
> 
> Trinity has reported:
> 
>     BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>     IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
>     CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
>                             3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
>     lock_acquire (arch/x86/include/asm/current.h:14
>                   kernel/locking/lockdep.c:3602)
>     _raw_spin_lock (include/linux/spinlock_api_smp.h:143
>                     kernel/locking/spinlock.c:151)
>     remove_migration_pte (mm/migrate.c:137)
>     rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
>     remove_migration_ptes (mm/migrate.c:224)
>     migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
>     migrate_misplaced_page (mm/migrate.c:1733)
>     __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
>     handle_mm_fault (mm/memory.c:3948)
>     __get_user_pages (mm/memory.c:1851)
>     __mlock_vma_pages_range (mm/mlock.c:255)
>     __mm_populate (mm/mlock.c:711)
>     SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
> 
> I believe this comes about because, whereas collapsing and splitting THP
> functions take anon_vma lock in write mode (which excludes concurrent
> rmap walks), faulting THP functions (write protection and misplaced
> NUMA) do not - and mostly they do not need to.
> 
> But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
> an instant (indeed, for a long instant, given the inter-CPU TLB flush in
> there), leaves *pmd neither present not trans_huge.
> 
> Which can confuse a concurrent rmap walk, as when removing migration
> ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
> to insert, anon_vmas containing THPs are in no way segregated from
> 4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
> that instant when a trans_huge pmd is temporarily absent.
> 
> I don't think we need strengthen the locking at the THP end: it's easily
> handled with an ACCESS_ONCE() before testing both conditions.
> 
> And since mm_find_pmd() had only one caller who wanted a THP rather than
> a pmd, let's slightly repurpose it to fail when it hits a THP or
> non-present pmd, and open code split_huge_page_address() again.
> 
> Signed-off-by: Hugh Dickins <[email protected]>
> Reported-by: Sasha Levin <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Bob Liu <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Dave Jones <[email protected]>
> Cc: David Rientjes <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
> Signed-off-by: Jiri Slaby <[email protected]>
> ---
>  mm/huge_memory.c | 18 ++++++++++++------
>  mm/ksm.c         |  1 -
>  mm/migrate.c     |  2 --
>  mm/rmap.c        | 12 ++++++++----
>  4 files changed, 20 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e497843f5f65..04d17ba00893 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2408,8 +2408,6 @@ static void collapse_huge_page(struct mm_struct *mm,
>       pmd = mm_find_pmd(mm, address);
>       if (!pmd)
>               goto out;
> -     if (pmd_trans_huge(*pmd))
> -             goto out;
>  
>       anon_vma_lock_write(vma->anon_vma);
>  
> @@ -2508,8 +2506,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>       pmd = mm_find_pmd(mm, address);
>       if (!pmd)
>               goto out;
> -     if (pmd_trans_huge(*pmd))
> -             goto out;
>  
>       memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
>       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> @@ -2863,12 +2859,22 @@ void split_huge_page_pmd_mm(struct mm_struct *mm, 
> unsigned long address,
>  static void split_huge_page_address(struct mm_struct *mm,
>                                   unsigned long address)
>  {
> +     pgd_t *pgd;
> +     pud_t *pud;
>       pmd_t *pmd;
>  
>       VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
>  
> -     pmd = mm_find_pmd(mm, address);
> -     if (!pmd)
> +     pgd = pgd_offset(mm, address);
> +     if (!pgd_present(*pgd))
> +             return;
> +
> +     pud = pud_offset(pgd, address);
> +     if (!pud_present(*pud))
> +             return;
> +
> +     pmd = pmd_offset(pud, address);
> +     if (!pmd_present(*pmd))
>               return;
>       /*
>        * Caller holds the mmap_sem write mode, so a huge pmd cannot
> diff --git a/mm/ksm.c b/mm/ksm.c
> index c78fff1e9eae..29cbd06c4884 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -945,7 +945,6 @@ static int replace_page(struct vm_area_struct *vma, 
> struct page *page,
>       pmd = mm_find_pmd(mm, addr);
>       if (!pmd)
>               goto out;
> -     BUG_ON(pmd_trans_huge(*pmd));
>  
>       mmun_start = addr;
>       mmun_end   = addr + PAGE_SIZE;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d5c84b0a5243..fac5fa0813c4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -136,8 +136,6 @@ static int remove_migration_pte(struct page *new, struct 
> vm_area_struct *vma,
>               pmd = mm_find_pmd(mm, addr);
>               if (!pmd)
>                       goto out;
> -             if (pmd_trans_huge(*pmd))
> -                     goto out;
>  
>               ptep = pte_offset_map(pmd, addr);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5b8675ccc1ef..440c71c43b8d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -571,6 +571,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long 
> address)
>       pgd_t *pgd;
>       pud_t *pud;
>       pmd_t *pmd = NULL;
> +     pmd_t pmde;
>  
>       pgd = pgd_offset(mm, address);
>       if (!pgd_present(*pgd))
> @@ -581,7 +582,13 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long 
> address)
>               goto out;
>  
>       pmd = pmd_offset(pud, address);
> -     if (!pmd_present(*pmd))
> +     /*
> +      * Some THP functions use the sequence pmdp_clear_flush(), set_pmd_at()
> +      * without holding anon_vma lock for write.  So when looking for a
> +      * genuine pmde (in which to find pte), test present and !THP together.
> +      */
> +     pmde = ACCESS_ONCE(*pmd);
> +     if (!pmd_present(pmde) || pmd_trans_huge(pmde))
>               pmd = NULL;
>  out:
>       return pmd;
> @@ -617,9 +624,6 @@ pte_t *__page_check_address(struct page *page, struct 
> mm_struct *mm,
>       if (!pmd)
>               return NULL;
>  
> -     if (pmd_trans_huge(*pmd))
> -             return NULL;
> -
>       pte = pte_offset_map(pmd, address);
>       /* Make a quick check before getting the lock */
>       if (!sync && !pte_present(*pte)) {
> -- 
> 2.2.1


Fine for this to go in, but there is one catch, which I discovered when
backporting to v3.11: it needed one more hunk.  I haven't checked your
base tree, but if this applies then I believe you need it - most of the
time no problem, but it can case page migration to fail to find a
migration entry it inserted earlier, then BUG_ON(!PageLocked(p)) in
migration_entry_to_page() soon after.  Here's what I wrote back then:

Note on rebase to v3.11: added a hunk to replace the use of mm_find_pmd()
in page_check_address_pmd().  This call had been similarly replaced by
the time of my v3.16 commit, in Kirill Shutemov's v3.15 b5a8cad376ee
("thp: close race between split and zap huge pages"): which we do not
need as such, since it's fixing v3.13 117b0791ac42 ("mm, thp: move ptl
taking inside page_check_address_pmd()"), from a split page-table-lock
series we are not backporting.  But without this additional hunk, rmap
sometimes broke when the new semantic for mm_find_pmd() was used here.

(Adding Kirill to Cc: shouldn't he have been Cc'ed already?)

Hugh
    
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1584,12 +1584,20 @@ pmd_t *page_check_address_pmd(struct page *page,
                              unsigned long address,
                              enum page_check_address_pmd_flag flag)
 {
+       pgd_t *pgd;
+       pud_t *pud;
        pmd_t *pmd, *ret = NULL;
 
        if (address & ~HPAGE_PMD_MASK)
                goto out;
 
-       pmd = mm_find_pmd(mm, address);
+       pgd = pgd_offset(mm, address);
+       if (!pgd_present(*pgd))
+               goto out;
+       pud = pud_offset(pgd, address);
+       if (!pud_present(*pud))
+               goto out;
+       pmd = pmd_offset(pud, address);
        if (!pmd)
                goto out;
        if (pmd_none(*pmd))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 78/78] mm: let mm_find_pmd fix buggy race with THP fault

Reply via email to