On Wed, Apr 01, 2015 at 12:08:35PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shute...@linux.intel.com> writes:
> 
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> >
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> >
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> >
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> >
> 
> With this we now have pte mapping part of a compound page(). Now the
> gneric gup implementation does
> 
> gup_pte_range()
>       ptem = ptep = pte_offset_map(&pmd, addr);
>       do {
> 
> ....
> ...
>               if (!page_cache_get_speculative(page))
>                       goto pte_unmap;
> .....
>         }
> 
> That page_cache_get_speculative will fail in our case because it does
> if (unlikely(!get_page_unless_zero(page))) on a tail page. ??

IIUC, something as simple as patch below should work fine with migration
entries.

The reason I'm talking about migration enties is that with new refcounting
split_huge_page() breaks this generic fast GUP invariant:

 *  *) THP splits will broadcast an IPI, this can be achieved by overriding
 *      pmdp_splitting_flush.

We don't necessary trigger IPI during split. The page can be mapped only
with ptes by split time. That's fine for migration entries since we
re-check pte value after taking the pin. But it seems we don't have
anything in place for compound_lock case.

Hm. If I will not find any way to get it work with compound_lock, I would
need to implement new split_huge_page() on migration entries without
intermediate step with compound_lock.

Any comments?

diff --git a/mm/gup.c b/mm/gup.c
index d58af0785d24..b45edb8e6455 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1047,7 +1047,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
                 * for an example see gup_get_pte in arch/x86/mm/gup.c
                 */
                pte_t pte = READ_ONCE(*ptep);
-               struct page *page;
+               struct page *head, *page;
 
                /*
                 * Similar to the PMD case below, NUMA hinting must take slow
@@ -1059,15 +1059,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
 
                VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
                page = pte_page(pte);
+               head = compound_head(page);
 
-               if (!page_cache_get_speculative(page))
+               if (!page_cache_get_speculative(head))
                        goto pte_unmap;
 
                if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-                       put_page(page);
+                       put_page(head);
                        goto pte_unmap;
                }
 
+               VM_BUG_ON_PAGE(compound_head(page) != head, page);
                pages[*nr] = page;
                (*nr)++;
 
-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to