On Tue, Jan 08, 2013 at 08:17:59PM -0800, Hugh Dickins wrote:
> On Mon, 7 Jan 2013, Mel Gorman wrote:
> 
> > Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does not
> > check page_count before migrating like base page migration and khugepage. He
> > could not see why this was safe and he is right.
> > 
> > The potential impact of the bug is avoided due to the limitations of NUMA
> > balancing.  The page_mapcount() check ensures that only a single address
> > space is using this page and as THPs are typically private it should not be
> > possible for another address space to fault it in parallel. If the address
> > space has one associated task then it's difficult to have both a GUP pin
> > and be referencing the page at the same time. If there are multiple tasks
> > then a buggy scenario requires that another thread be accessing the page
> > while the direct IO is in flight. This is dodgy behaviour as there is
> > a possibility of corruption with or without THP migration. It would be
> > difficult to identify the corruption as being a migration bug.
> > 
> > While we happen to be safe for the most part it is shoddy to depend on
> > such "safety" so this patch checks the page count similar to anonymous
> > pages. Note that this does not mean that the page_mapcount() check can go
> > away. If we were to remove the page_mapcount() check the the THP would
> > have to be unmapped from all referencing PTEs, replaced with migration
> > PTEs and restored properly afterwards.
> > 
> > Reported-by: Hugh Dickins <hu...@google.com>
> > Signed-off-by: Mel Gorman <mgor...@suse.de>
> 
> Sorry, Mel, it's a NAK:

Don't be sorry. It's not your fault I am a muppet.

> you will have expected an ack from me two weeks
> or more ago; but somehow I had an intuition that if I sat on it for
> long enough, a worm would crawl out.  Got down to looking again today,
> and I notice that although the putback_lru_page() is right,
> NR_ISOLATED_ANON is not restored on this path, so that would leak.
> 
> I expect you'll want to do something like:
>       if (isolated) {
>               putback_lru_page(page);
>               isolated = 0;
>               goto out;
>       }
> and that may be the appropriate fix right now.
> 

Yes, I sent what should be a correction so we can treat this cleanup as
a potential replacemet for it.

> But I do still dislike the way you always put_page in
> numamigrate_isolate_page():

At one point there was a conditional put_page depending on different
failures and it was a very difficult to follow. This was easier to
follow but could still be improved.

> it makes sense in the case when
> isolate_lru_page() succeeds (I've long thought that weird both to
> insist on an existing page reference and add one of its own), but
> I find it very confusing on the failure paths, to have the put_page
> far away from the unlock_page - and I get worried when I see put_page
> followed by unlock_page rather than vice versa (it happens on !pmd_same
> paths: if the pmd is not the same, then can we be sure that the put_page
> does not free the page?)
> 

I'm depending on the page table reference to prevent the put_page()
freeing the page. As mmap_sem is held, there cannot be an munmap()
freeing it. As we hold the page lock there cannot be a parallel reclaim
as trylock_page() in shrink_page_list() will fail.

> At the bottom I've put my own cleanup for this, which simplifies by
> doing the putback_lru_page() inside numamigrate_isolate_page(), and
> doesn't put_page when it doesn't isolate.
> 
> I think the only functional difference from yours (aside from fixing
> up NR_ISOLATED) is that migrate_misplaced_transhuge_page() doesn't
> have to pretend to its caller that it succeeded when actually it
> failed at the last hurdle (because it already did the unlock_page,
> which in yours the caller expects to do on failure).  Oh, and I'm
> not holding page lock (sometimes) at clear_pmdnuma: I didn't see
> the reason for that, perhaps I'm missing something important there.
> 
> Maybe our tastes differ, and you won't see mine as an improvement.
> And I've hardly tested, so haven't signed off, and won't be
> surprised if its own worms crawl out.
> 
> <SNIP>
>
> Not-signed-off-by: Hugh Dickins <hu...@google.com>
> ---
> 
>  mm/huge_memory.c |   28 +++++----------
>  mm/migrate.c     |   79 ++++++++++++++++++++++-----------------------
>  2 files changed, 48 insertions(+), 59 deletions(-)
> 
> --- 3.8-rc2/mm/huge_memory.c  2012-12-22 09:43:27.616015582 -0800
> +++ linux/mm/huge_memory.c    2013-01-08 17:39:06.340407864 -0800
> @@ -1298,7 +1298,6 @@ int do_huge_pmd_numa_page(struct mm_stru
>       int target_nid;
>       int current_nid = -1;
>       bool migrated;
> -     bool page_locked = false;
>  
>       spin_lock(&mm->page_table_lock);
>       if (unlikely(!pmd_same(pmd, *pmdp)))
> @@ -1320,7 +1319,6 @@ int do_huge_pmd_numa_page(struct mm_stru
>       /* Acquire the page lock to serialise THP migrations */
>       spin_unlock(&mm->page_table_lock);
>       lock_page(page);
> -     page_locked = true;
>  
>       /* Confirm the PTE did not while locked */
>       spin_lock(&mm->page_table_lock);
> @@ -1333,34 +1331,26 @@ int do_huge_pmd_numa_page(struct mm_stru
>  
>       /* Migrate the THP to the requested node */
>       migrated = migrate_misplaced_transhuge_page(mm, vma,
> -                             pmdp, pmd, addr,
> -                             page, target_nid);
> -     if (migrated)
> -             current_nid = target_nid;
> -     else {
> -             spin_lock(&mm->page_table_lock);
> -             if (unlikely(!pmd_same(pmd, *pmdp))) {
> -                     unlock_page(page);
> -                     goto out_unlock;
> -             }
> -             goto clear_pmdnuma;
> -     }
> +                             pmdp, pmd, addr, page, target_nid);
> +     if (!migrated)
> +             goto check_same;
>  
> -     task_numa_fault(current_nid, HPAGE_PMD_NR, migrated);
> +     task_numa_fault(target_nid, HPAGE_PMD_NR, true);
>       return 0;
>  

Ok.

> +check_same:
> +     spin_lock(&mm->page_table_lock);
> +     if (unlikely(!pmd_same(pmd, *pmdp)))
> +             goto out_unlock;

Ok.

>  clear_pmdnuma:
>       pmd = pmd_mknonnuma(pmd);
>       set_pmd_at(mm, haddr, pmdp, pmd);
>       VM_BUG_ON(pmd_numa(*pmdp));
>       update_mmu_cache_pmd(vma, addr, pmdp);
> -     if (page_locked)
> -             unlock_page(page);
> -

This is ok too. The lock page is to prevent the page being reclaimed in
parallel during migration. In the two cases you end up in clear_pmdnuma
the page table lock is taken and you've done a pmd_same check and the page
lock is not necessary.

>  out_unlock:
>       spin_unlock(&mm->page_table_lock);
>       if (current_nid != -1)
> -             task_numa_fault(current_nid, HPAGE_PMD_NR, migrated);
> +             task_numa_fault(current_nid, HPAGE_PMD_NR, false);
>       return 0;
>  }
>  
> --- 3.8-rc2/mm/migrate.c      2012-12-22 09:43:27.636015582 -0800
> +++ linux/mm/migrate.c        2013-01-08 18:17:02.664144777 -0800
> @@ -1555,39 +1555,38 @@ bool numamigrate_update_ratelimit(pg_dat
>  
>  int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
>  {
> -     int ret = 0;
> +     int page_lru;
>  
>       /* Avoid migrating to a node that is nearly full */
> -     if (migrate_balanced_pgdat(pgdat, 1)) {
> -             int page_lru;
> +     if (!migrate_balanced_pgdat(pgdat, 1))
> +             return 0;
>  
> -             if (isolate_lru_page(page)) {
> -                     put_page(page);
> -                     return 0;
> -             }
> -
> -             /* Page is isolated */
> -             ret = 1;
> -             page_lru = page_is_file_cache(page);
> -             if (!PageTransHuge(page))
> -                     inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
> -             else
> -                     mod_zone_page_state(page_zone(page),
> -                                     NR_ISOLATED_ANON + page_lru,
> -                                     HPAGE_PMD_NR);
> +     if (isolate_lru_page(page))
> +             return 0;
> +
> +     /*
> +      * migrate_misplaced_transhuge_page() skips page migration's usual
> +      * check on page_count(), so we must do it here, now that the page
> +      * has been isolated: a GUP pin, or any other pin, prevents migration.
> +      * The expected page count is 3: 1 for page's mapcount and 1 for the
> +      * caller's pin and 1 for the reference taken by isolate_lru_page().
> +      */
> +     if (PageTransHuge(page) && page_count(page) != 3) {
> +             putback_lru_page(page);
> +             return 0;
>       }
>  

Ok so you putback the page before the isolated count is updated. Makes
sense.

> +     page_lru = page_is_file_cache(page);
> +     mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
> +                             hpage_nr_pages(page));
> +
>       /*
> -      * Page is either isolated or there is not enough space on the target
> -      * node. If isolated, then it has taken a reference count and the
> -      * callers reference can be safely dropped without the page
> -      * disappearing underneath us during migration. Otherwise the page is
> -      * not to be migrated but the callers reference should still be
> -      * dropped so it does not leak.
> +      * Isolating the page has taken another reference, so the
> +      * caller's reference can be safely dropped without the page
> +      * disappearing underneath us during migration.
>        */
>       put_page(page);
> -
> -     return ret;
> +     return 1;
>  }
>  
>  /*
> @@ -1598,7 +1597,7 @@ int numamigrate_isolate_page(pg_data_t *
>  int migrate_misplaced_page(struct page *page, int node)
>  {
>       pg_data_t *pgdat = NODE_DATA(node);
> -     int isolated = 0;
> +     int isolated;
>       int nr_remaining;
>       LIST_HEAD(migratepages);
>  
> @@ -1606,20 +1605,16 @@ int migrate_misplaced_page(struct page *
>        * Don't migrate pages that are mapped in multiple processes.
>        * TODO: Handle false sharing detection instead of this hammer
>        */
> -     if (page_mapcount(page) != 1) {
> -             put_page(page);
> +     if (page_mapcount(page) != 1)
>               goto out;
> -     }
>  
>       /*
>        * Rate-limit the amount of data that is being migrated to a node.
>        * Optimal placement is no good if the memory bus is saturated and
>        * all the time is being spent migrating!
>        */
> -     if (numamigrate_update_ratelimit(pgdat, 1)) {
> -             put_page(page);
> +     if (numamigrate_update_ratelimit(pgdat, 1))
>               goto out;
> -     }
>  
>       isolated = numamigrate_isolate_page(pgdat, page);
>       if (!isolated)
> @@ -1636,8 +1631,11 @@ int migrate_misplaced_page(struct page *
>       } else
>               count_vm_numa_event(NUMA_PAGE_MIGRATE);
>       BUG_ON(!list_empty(&migratepages));
> -out:
>       return isolated;
> +
> +out:
> +     put_page(page);
> +     return 0;
>  }

Ok.

>  #endif /* CONFIG_NUMA_BALANCING */
>  
> @@ -1672,17 +1670,15 @@ int migrate_misplaced_transhuge_page(str
>  
>       new_page = alloc_pages_node(node,
>               (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
> -     if (!new_page) {
> -             count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
> -             goto out_dropref;
> -     }
> +     if (!new_page)
> +             goto out_fail;
> +
>       page_xchg_last_nid(new_page, page_last_nid(page));
>  
>       isolated = numamigrate_isolate_page(pgdat, page);
>       if (!isolated) {
> -             count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
>               put_page(new_page);
> -             goto out_keep_locked;
> +             goto out_fail;
>       }
>  
>       /* Prepare a page as a migration target */
> @@ -1714,6 +1710,7 @@ int migrate_misplaced_transhuge_page(str
>               putback_lru_page(page);
>  
>               count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
> +             isolated = 0;
>               goto out;
>       }
>  
> @@ -1758,9 +1755,11 @@ out:
>                       -HPAGE_PMD_NR);
>       return isolated;
>  
> +out_fail:
> +     count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
>  out_dropref:
> +     unlock_page(page);
>       put_page(page);
> -out_keep_locked:
>       return 0;
>  }
>  #endif /* CONFIG_NUMA_BALANCING */

I haven't tested this but I cannot see a problem with it either and the
flow does look nicer. I'll test it this evening and look at it some
more.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to