Re: [PATCHv2 4/4] mm: make compound_head() robust

2015-08-18 Thread Kirill A. Shutemov
On Tue, Aug 18, 2015 at 06:41:13PM +0200, Michal Hocko wrote:
> On Tue 18-08-15 13:20:22, Michal Hocko wrote:
> > On Mon 17-08-15 18:09:05, Kirill A. Shutemov wrote:
> > > Hugh has pointed that compound_head() call can be unsafe in some
> > > context. There's one example:
> > > 
> > >   CPU0CPU1
> > > 
> > > isolate_migratepages_block()
> > >   page_count()
> > > compound_head()
> > >   !!PageTail() == true
> > >   put_page()
> > > tail->first_page = NULL
> > >   head = tail->first_page
> > >   alloc_pages(__GFP_COMP)
> > >  prep_compound_page()
> > >tail->first_page = head
> > >__SetPageTail(p);
> > >   !!PageTail() == true
> > > 
> > > 
> > > The race is pure theoretical. I don't it's possible to trigger it in
> > > practice. But who knows.
> > > 
> > > We can fix the race by changing how encode PageTail() and compound_head()
> > > within struct page to be able to update them in one shot.
> > > 
> > > The patch introduces page->compound_head into third double word block in
> > > front of compound_dtor and compound_order. That means it shares storage
> > > space with:
> > > 
> > >  - page->lru.next;
> > >  - page->next;
> > >  - page->rcu_head.next;
> > >  - page->pmd_huge_pte;
> > > 
> > > That's too long list to be absolutely sure, but looks like nobody uses
> > > bit 0 of the word. It can be used to encode PageTail(). And if the bit
> > > set, rest of the word is pointer to head page.
> > 
> > I didn't look too closely but the general idea makes sense to me and the
> > overal code simplification is sound. I will give it more detailed review
> > after I sort out other stuff.
> 
> AFICS page::first_page wasn't used outside of compound page logic so you
> should remove it in this patch. The rest looks good to me.

I missed it by accident during rework for v2. Will fix.

> Acked-by: Michal Hocko 

Thanks!

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 4/4] mm: make compound_head() robust

2015-08-18 Thread Michal Hocko
On Tue 18-08-15 13:20:22, Michal Hocko wrote:
> On Mon 17-08-15 18:09:05, Kirill A. Shutemov wrote:
> > Hugh has pointed that compound_head() call can be unsafe in some
> > context. There's one example:
> > 
> > CPU0CPU1
> > 
> > isolate_migratepages_block()
> >   page_count()
> > compound_head()
> >   !!PageTail() == true
> > put_page()
> >   tail->first_page = NULL
> >   head = tail->first_page
> > alloc_pages(__GFP_COMP)
> >prep_compound_page()
> >  tail->first_page = head
> >  __SetPageTail(p);
> >   !!PageTail() == true
> > 
> > 
> > The race is pure theoretical. I don't it's possible to trigger it in
> > practice. But who knows.
> > 
> > We can fix the race by changing how encode PageTail() and compound_head()
> > within struct page to be able to update them in one shot.
> > 
> > The patch introduces page->compound_head into third double word block in
> > front of compound_dtor and compound_order. That means it shares storage
> > space with:
> > 
> >  - page->lru.next;
> >  - page->next;
> >  - page->rcu_head.next;
> >  - page->pmd_huge_pte;
> > 
> > That's too long list to be absolutely sure, but looks like nobody uses
> > bit 0 of the word. It can be used to encode PageTail(). And if the bit
> > set, rest of the word is pointer to head page.
> 
> I didn't look too closely but the general idea makes sense to me and the
> overal code simplification is sound. I will give it more detailed review
> after I sort out other stuff.

AFICS page::first_page wasn't used outside of compound page logic so you
should remove it in this patch. The rest looks good to me.

Acked-by: Michal Hocko 
 
> > Signed-off-by: Kirill A. Shutemov 
> > Cc: Hugh Dickins 
> > Cc: David Rientjes 
> > Cc: Vlastimil Babka 
> > Signed-off-by: Kirill A. Shutemov 
> > ---
> >  Documentation/vm/split_page_table_lock |  4 +-
> >  arch/xtensa/configs/iss_defconfig  |  1 -
> >  include/linux/mm.h | 53 ++
> >  include/linux/mm_types.h   |  8 +++-
> >  include/linux/page-flags.h | 80 
> > --
> >  mm/Kconfig | 12 -
> >  mm/debug.c |  5 ---
> >  mm/huge_memory.c   |  3 +-
> >  mm/hugetlb.c   |  8 +---
> >  mm/internal.h  |  4 +-
> >  mm/memory-failure.c|  7 ---
> >  mm/page_alloc.c| 28 +++-
> >  mm/swap.c  |  4 +-
> >  13 files changed, 53 insertions(+), 164 deletions(-)
> > 
> > diff --git a/Documentation/vm/split_page_table_lock 
> > b/Documentation/vm/split_page_table_lock
> > index 6dea4fd5c961..62842a857dab 100644
> > --- a/Documentation/vm/split_page_table_lock
> > +++ b/Documentation/vm/split_page_table_lock
> > @@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and 
> > pgtable_page_dtor(),
> >  which must be called on PTE table allocation / freeing.
> >  
> >  Make sure the architecture doesn't use slab allocator for page table
> > -allocation: slab uses page->slab_cache and page->first_page for its pages.
> > -These fields share storage with page->ptl.
> > +allocation: slab uses page->slab_cache for its pages.
> > +This field shares storage with page->ptl.
> >  
> >  PMD split lock only makes sense if you have more than two page table
> >  levels.
> > diff --git a/arch/xtensa/configs/iss_defconfig 
> > b/arch/xtensa/configs/iss_defconfig
> > index e4d193e7a300..5c7c385f21c4 100644
> > --- a/arch/xtensa/configs/iss_defconfig
> > +++ b/arch/xtensa/configs/iss_defconfig
> > @@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
> >  # CONFIG_SPARSEMEM_MANUAL is not set
> >  CONFIG_FLATMEM=y
> >  CONFIG_FLAT_NODE_MEM_MAP=y
> > -CONFIG_PAGEFLAGS_EXTENDED=y
> >  CONFIG_SPLIT_PTLOCK_CPUS=4
> >  # CONFIG_PHYS_ADDR_T_64BIT is not set
> >  CONFIG_ZONE_DMA_FLAG=1
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 9c21bbb8875a..5090a0b9bb43 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct 
> > page *page,
> >  #endif
> >  }
> >  
> > -static inline struct page *compound_head_by_tail(struct page *tail)
> > -{
> > -   struct page *head = tail->first_page;
> > -
> > -   /*
> > -* page->first_page may be a dangling pointer to an old
> > -* compound page, so recheck that it is still a tail
> > -* page before returning.
> > -*/
> > -   smp_rmb();
> > -   if (likely(PageTail(tail)))
> > -   return head;
> > -   return tail;
> > -}
> > -
> > -/*
> > - * Since either compound page could be dismantled asynchronously in THP
> > - * 

Re: [PATCHv2 4/4] mm: make compound_head() robust

2015-08-18 Thread Michal Hocko
On Mon 17-08-15 18:09:05, Kirill A. Shutemov wrote:
> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
> 
>   CPU0CPU1
> 
> isolate_migratepages_block()
>   page_count()
> compound_head()
>   !!PageTail() == true
>   put_page()
> tail->first_page = NULL
>   head = tail->first_page
>   alloc_pages(__GFP_COMP)
>  prep_compound_page()
>tail->first_page = head
>__SetPageTail(p);
>   !!PageTail() == true
> 
> 
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
> 
> We can fix the race by changing how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
> 
> The patch introduces page->compound_head into third double word block in
> front of compound_dtor and compound_order. That means it shares storage
> space with:
> 
>  - page->lru.next;
>  - page->next;
>  - page->rcu_head.next;
>  - page->pmd_huge_pte;
> 
> That's too long list to be absolutely sure, but looks like nobody uses
> bit 0 of the word. It can be used to encode PageTail(). And if the bit
> set, rest of the word is pointer to head page.

I didn't look too closely but the general idea makes sense to me and the
overal code simplification is sound. I will give it more detailed review
after I sort out other stuff.

> 
> Signed-off-by: Kirill A. Shutemov 
> Cc: Hugh Dickins 
> Cc: David Rientjes 
> Cc: Vlastimil Babka 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  Documentation/vm/split_page_table_lock |  4 +-
>  arch/xtensa/configs/iss_defconfig  |  1 -
>  include/linux/mm.h | 53 ++
>  include/linux/mm_types.h   |  8 +++-
>  include/linux/page-flags.h | 80 
> --
>  mm/Kconfig | 12 -
>  mm/debug.c |  5 ---
>  mm/huge_memory.c   |  3 +-
>  mm/hugetlb.c   |  8 +---
>  mm/internal.h  |  4 +-
>  mm/memory-failure.c|  7 ---
>  mm/page_alloc.c| 28 +++-
>  mm/swap.c  |  4 +-
>  13 files changed, 53 insertions(+), 164 deletions(-)
> 
> diff --git a/Documentation/vm/split_page_table_lock 
> b/Documentation/vm/split_page_table_lock
> index 6dea4fd5c961..62842a857dab 100644
> --- a/Documentation/vm/split_page_table_lock
> +++ b/Documentation/vm/split_page_table_lock
> @@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and 
> pgtable_page_dtor(),
>  which must be called on PTE table allocation / freeing.
>  
>  Make sure the architecture doesn't use slab allocator for page table
> -allocation: slab uses page->slab_cache and page->first_page for its pages.
> -These fields share storage with page->ptl.
> +allocation: slab uses page->slab_cache for its pages.
> +This field shares storage with page->ptl.
>  
>  PMD split lock only makes sense if you have more than two page table
>  levels.
> diff --git a/arch/xtensa/configs/iss_defconfig 
> b/arch/xtensa/configs/iss_defconfig
> index e4d193e7a300..5c7c385f21c4 100644
> --- a/arch/xtensa/configs/iss_defconfig
> +++ b/arch/xtensa/configs/iss_defconfig
> @@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
>  # CONFIG_SPARSEMEM_MANUAL is not set
>  CONFIG_FLATMEM=y
>  CONFIG_FLAT_NODE_MEM_MAP=y
> -CONFIG_PAGEFLAGS_EXTENDED=y
>  CONFIG_SPLIT_PTLOCK_CPUS=4
>  # CONFIG_PHYS_ADDR_T_64BIT is not set
>  CONFIG_ZONE_DMA_FLAG=1
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9c21bbb8875a..5090a0b9bb43 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct 
> page *page,
>  #endif
>  }
>  
> -static inline struct page *compound_head_by_tail(struct page *tail)
> -{
> - struct page *head = tail->first_page;
> -
> - /*
> -  * page->first_page may be a dangling pointer to an old
> -  * compound page, so recheck that it is still a tail
> -  * page before returning.
> -  */
> - smp_rmb();
> - if (likely(PageTail(tail)))
> - return head;
> - return tail;
> -}
> -
> -/*
> - * Since either compound page could be dismantled asynchronously in THP
> - * or we access asynchronously arbitrary positioned struct page, there
> - * would be tail flag race. To handle this race, we should call
> - * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
> - */
> -static inline struct page *compound_head(struct page *page)
> -{
> - if (unlikely(PageTail(page)))
> - return compound_head_by_tail(page);
> - return page;
> -}
> -
> -/*

Re: [PATCHv2 4/4] mm: make compound_head() robust

2015-08-18 Thread Michal Hocko
On Mon 17-08-15 18:09:05, Kirill A. Shutemov wrote:
 Hugh has pointed that compound_head() call can be unsafe in some
 context. There's one example:
 
   CPU0CPU1
 
 isolate_migratepages_block()
   page_count()
 compound_head()
   !!PageTail() == true
   put_page()
 tail-first_page = NULL
   head = tail-first_page
   alloc_pages(__GFP_COMP)
  prep_compound_page()
tail-first_page = head
__SetPageTail(p);
   !!PageTail() == true
 head == NULL dereferencing
 
 The race is pure theoretical. I don't it's possible to trigger it in
 practice. But who knows.
 
 We can fix the race by changing how encode PageTail() and compound_head()
 within struct page to be able to update them in one shot.
 
 The patch introduces page-compound_head into third double word block in
 front of compound_dtor and compound_order. That means it shares storage
 space with:
 
  - page-lru.next;
  - page-next;
  - page-rcu_head.next;
  - page-pmd_huge_pte;
 
 That's too long list to be absolutely sure, but looks like nobody uses
 bit 0 of the word. It can be used to encode PageTail(). And if the bit
 set, rest of the word is pointer to head page.

I didn't look too closely but the general idea makes sense to me and the
overal code simplification is sound. I will give it more detailed review
after I sort out other stuff.

 
 Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
 Cc: Hugh Dickins hu...@google.com
 Cc: David Rientjes rient...@google.com
 Cc: Vlastimil Babka vba...@suse.cz
 Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
 ---
  Documentation/vm/split_page_table_lock |  4 +-
  arch/xtensa/configs/iss_defconfig  |  1 -
  include/linux/mm.h | 53 ++
  include/linux/mm_types.h   |  8 +++-
  include/linux/page-flags.h | 80 
 --
  mm/Kconfig | 12 -
  mm/debug.c |  5 ---
  mm/huge_memory.c   |  3 +-
  mm/hugetlb.c   |  8 +---
  mm/internal.h  |  4 +-
  mm/memory-failure.c|  7 ---
  mm/page_alloc.c| 28 +++-
  mm/swap.c  |  4 +-
  13 files changed, 53 insertions(+), 164 deletions(-)
 
 diff --git a/Documentation/vm/split_page_table_lock 
 b/Documentation/vm/split_page_table_lock
 index 6dea4fd5c961..62842a857dab 100644
 --- a/Documentation/vm/split_page_table_lock
 +++ b/Documentation/vm/split_page_table_lock
 @@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and 
 pgtable_page_dtor(),
  which must be called on PTE table allocation / freeing.
  
  Make sure the architecture doesn't use slab allocator for page table
 -allocation: slab uses page-slab_cache and page-first_page for its pages.
 -These fields share storage with page-ptl.
 +allocation: slab uses page-slab_cache for its pages.
 +This field shares storage with page-ptl.
  
  PMD split lock only makes sense if you have more than two page table
  levels.
 diff --git a/arch/xtensa/configs/iss_defconfig 
 b/arch/xtensa/configs/iss_defconfig
 index e4d193e7a300..5c7c385f21c4 100644
 --- a/arch/xtensa/configs/iss_defconfig
 +++ b/arch/xtensa/configs/iss_defconfig
 @@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
  # CONFIG_SPARSEMEM_MANUAL is not set
  CONFIG_FLATMEM=y
  CONFIG_FLAT_NODE_MEM_MAP=y
 -CONFIG_PAGEFLAGS_EXTENDED=y
  CONFIG_SPLIT_PTLOCK_CPUS=4
  # CONFIG_PHYS_ADDR_T_64BIT is not set
  CONFIG_ZONE_DMA_FLAG=1
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 9c21bbb8875a..5090a0b9bb43 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct 
 page *page,
  #endif
  }
  
 -static inline struct page *compound_head_by_tail(struct page *tail)
 -{
 - struct page *head = tail-first_page;
 -
 - /*
 -  * page-first_page may be a dangling pointer to an old
 -  * compound page, so recheck that it is still a tail
 -  * page before returning.
 -  */
 - smp_rmb();
 - if (likely(PageTail(tail)))
 - return head;
 - return tail;
 -}
 -
 -/*
 - * Since either compound page could be dismantled asynchronously in THP
 - * or we access asynchronously arbitrary positioned struct page, there
 - * would be tail flag race. To handle this race, we should call
 - * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
 - */
 -static inline struct page *compound_head(struct page *page)
 -{
 - if (unlikely(PageTail(page)))
 - return compound_head_by_tail(page);
 - return page;
 -}
 -
 -/*
 - * 

Re: [PATCHv2 4/4] mm: make compound_head() robust

2015-08-18 Thread Kirill A. Shutemov
On Tue, Aug 18, 2015 at 06:41:13PM +0200, Michal Hocko wrote:
 On Tue 18-08-15 13:20:22, Michal Hocko wrote:
  On Mon 17-08-15 18:09:05, Kirill A. Shutemov wrote:
   Hugh has pointed that compound_head() call can be unsafe in some
   context. There's one example:
   
 CPU0CPU1
   
   isolate_migratepages_block()
 page_count()
   compound_head()
 !!PageTail() == true
 put_page()
   tail-first_page = NULL
 head = tail-first_page
 alloc_pages(__GFP_COMP)
prep_compound_page()
  tail-first_page = head
  __SetPageTail(p);
 !!PageTail() == true
   head == NULL dereferencing
   
   The race is pure theoretical. I don't it's possible to trigger it in
   practice. But who knows.
   
   We can fix the race by changing how encode PageTail() and compound_head()
   within struct page to be able to update them in one shot.
   
   The patch introduces page-compound_head into third double word block in
   front of compound_dtor and compound_order. That means it shares storage
   space with:
   
- page-lru.next;
- page-next;
- page-rcu_head.next;
- page-pmd_huge_pte;
   
   That's too long list to be absolutely sure, but looks like nobody uses
   bit 0 of the word. It can be used to encode PageTail(). And if the bit
   set, rest of the word is pointer to head page.
  
  I didn't look too closely but the general idea makes sense to me and the
  overal code simplification is sound. I will give it more detailed review
  after I sort out other stuff.
 
 AFICS page::first_page wasn't used outside of compound page logic so you
 should remove it in this patch. The rest looks good to me.

I missed it by accident during rework for v2. Will fix.

 Acked-by: Michal Hocko mho...@suse.com

Thanks!

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 4/4] mm: make compound_head() robust

2015-08-18 Thread Michal Hocko
On Tue 18-08-15 13:20:22, Michal Hocko wrote:
 On Mon 17-08-15 18:09:05, Kirill A. Shutemov wrote:
  Hugh has pointed that compound_head() call can be unsafe in some
  context. There's one example:
  
  CPU0CPU1
  
  isolate_migratepages_block()
page_count()
  compound_head()
!!PageTail() == true
  put_page()
tail-first_page = NULL
head = tail-first_page
  alloc_pages(__GFP_COMP)
 prep_compound_page()
   tail-first_page = head
   __SetPageTail(p);
!!PageTail() == true
  head == NULL dereferencing
  
  The race is pure theoretical. I don't it's possible to trigger it in
  practice. But who knows.
  
  We can fix the race by changing how encode PageTail() and compound_head()
  within struct page to be able to update them in one shot.
  
  The patch introduces page-compound_head into third double word block in
  front of compound_dtor and compound_order. That means it shares storage
  space with:
  
   - page-lru.next;
   - page-next;
   - page-rcu_head.next;
   - page-pmd_huge_pte;
  
  That's too long list to be absolutely sure, but looks like nobody uses
  bit 0 of the word. It can be used to encode PageTail(). And if the bit
  set, rest of the word is pointer to head page.
 
 I didn't look too closely but the general idea makes sense to me and the
 overal code simplification is sound. I will give it more detailed review
 after I sort out other stuff.

AFICS page::first_page wasn't used outside of compound page logic so you
should remove it in this patch. The rest looks good to me.

Acked-by: Michal Hocko mho...@suse.com
 
  Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
  Cc: Hugh Dickins hu...@google.com
  Cc: David Rientjes rient...@google.com
  Cc: Vlastimil Babka vba...@suse.cz
  Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
  ---
   Documentation/vm/split_page_table_lock |  4 +-
   arch/xtensa/configs/iss_defconfig  |  1 -
   include/linux/mm.h | 53 ++
   include/linux/mm_types.h   |  8 +++-
   include/linux/page-flags.h | 80 
  --
   mm/Kconfig | 12 -
   mm/debug.c |  5 ---
   mm/huge_memory.c   |  3 +-
   mm/hugetlb.c   |  8 +---
   mm/internal.h  |  4 +-
   mm/memory-failure.c|  7 ---
   mm/page_alloc.c| 28 +++-
   mm/swap.c  |  4 +-
   13 files changed, 53 insertions(+), 164 deletions(-)
  
  diff --git a/Documentation/vm/split_page_table_lock 
  b/Documentation/vm/split_page_table_lock
  index 6dea4fd5c961..62842a857dab 100644
  --- a/Documentation/vm/split_page_table_lock
  +++ b/Documentation/vm/split_page_table_lock
  @@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and 
  pgtable_page_dtor(),
   which must be called on PTE table allocation / freeing.
   
   Make sure the architecture doesn't use slab allocator for page table
  -allocation: slab uses page-slab_cache and page-first_page for its pages.
  -These fields share storage with page-ptl.
  +allocation: slab uses page-slab_cache for its pages.
  +This field shares storage with page-ptl.
   
   PMD split lock only makes sense if you have more than two page table
   levels.
  diff --git a/arch/xtensa/configs/iss_defconfig 
  b/arch/xtensa/configs/iss_defconfig
  index e4d193e7a300..5c7c385f21c4 100644
  --- a/arch/xtensa/configs/iss_defconfig
  +++ b/arch/xtensa/configs/iss_defconfig
  @@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
   # CONFIG_SPARSEMEM_MANUAL is not set
   CONFIG_FLATMEM=y
   CONFIG_FLAT_NODE_MEM_MAP=y
  -CONFIG_PAGEFLAGS_EXTENDED=y
   CONFIG_SPLIT_PTLOCK_CPUS=4
   # CONFIG_PHYS_ADDR_T_64BIT is not set
   CONFIG_ZONE_DMA_FLAG=1
  diff --git a/include/linux/mm.h b/include/linux/mm.h
  index 9c21bbb8875a..5090a0b9bb43 100644
  --- a/include/linux/mm.h
  +++ b/include/linux/mm.h
  @@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct 
  page *page,
   #endif
   }
   
  -static inline struct page *compound_head_by_tail(struct page *tail)
  -{
  -   struct page *head = tail-first_page;
  -
  -   /*
  -* page-first_page may be a dangling pointer to an old
  -* compound page, so recheck that it is still a tail
  -* page before returning.
  -*/
  -   smp_rmb();
  -   if (likely(PageTail(tail)))
  -   return head;
  -   return tail;
  -}
  -
  -/*
  - * Since either compound page could be dismantled asynchronously in THP
  - * or we access asynchronously arbitrary positioned struct page, there
  - * would be tail flag race. To 

[PATCHv2 4/4] mm: make compound_head() robust

2015-08-17 Thread Kirill A. Shutemov
Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0CPU1

isolate_migratepages_block()
  page_count()
compound_head()
  !!PageTail() == true
put_page()
  tail->first_page = NULL
  head = tail->first_page
alloc_pages(__GFP_COMP)
   prep_compound_page()
 tail->first_page = head
 __SetPageTail(p);
  !!PageTail() == true


The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. That means it shares storage
space with:

 - page->lru.next;
 - page->next;
 - page->rcu_head.next;
 - page->pmd_huge_pte;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word. It can be used to encode PageTail(). And if the bit
set, rest of the word is pointer to head page.

Signed-off-by: Kirill A. Shutemov 
Cc: Hugh Dickins 
Cc: David Rientjes 
Cc: Vlastimil Babka 
Signed-off-by: Kirill A. Shutemov 
---
 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig  |  1 -
 include/linux/mm.h | 53 ++
 include/linux/mm_types.h   |  8 +++-
 include/linux/page-flags.h | 80 --
 mm/Kconfig | 12 -
 mm/debug.c |  5 ---
 mm/huge_memory.c   |  3 +-
 mm/hugetlb.c   |  8 +---
 mm/internal.h  |  4 +-
 mm/memory-failure.c|  7 ---
 mm/page_alloc.c| 28 +++-
 mm/swap.c  |  4 +-
 13 files changed, 53 insertions(+), 164 deletions(-)

diff --git a/Documentation/vm/split_page_table_lock 
b/Documentation/vm/split_page_table_lock
index 6dea4fd5c961..62842a857dab 100644
--- a/Documentation/vm/split_page_table_lock
+++ b/Documentation/vm/split_page_table_lock
@@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and 
pgtable_page_dtor(),
 which must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
-allocation: slab uses page->slab_cache and page->first_page for its pages.
-These fields share storage with page->ptl.
+allocation: slab uses page->slab_cache for its pages.
+This field shares storage with page->ptl.
 
 PMD split lock only makes sense if you have more than two page table
 levels.
diff --git a/arch/xtensa/configs/iss_defconfig 
b/arch/xtensa/configs/iss_defconfig
index e4d193e7a300..5c7c385f21c4 100644
--- a/arch/xtensa/configs/iss_defconfig
+++ b/arch/xtensa/configs/iss_defconfig
@@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
 # CONFIG_SPARSEMEM_MANUAL is not set
 CONFIG_FLATMEM=y
 CONFIG_FLAT_NODE_MEM_MAP=y
-CONFIG_PAGEFLAGS_EXTENDED=y
 CONFIG_SPLIT_PTLOCK_CPUS=4
 # CONFIG_PHYS_ADDR_T_64BIT is not set
 CONFIG_ZONE_DMA_FLAG=1
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c21bbb8875a..5090a0b9bb43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct page 
*page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-   struct page *head = tail->first_page;
-
-   /*
-* page->first_page may be a dangling pointer to an old
-* compound page, so recheck that it is still a tail
-* page before returning.
-*/
-   smp_rmb();
-   if (likely(PageTail(tail)))
-   return head;
-   return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
-static inline struct page *compound_head(struct page *page)
-{
-   if (unlikely(PageTail(page)))
-   return compound_head_by_tail(page);
-   return page;
-}
-
-/*
- * If we access compound page synchronously such as access to
- * allocated page, there is no need to handle tail flag race, so we can
- * check tail flag directly without any synchronization primitive.
- */
-static inline struct page *compound_head_fast(struct page *page)
-{
-   if (unlikely(PageTail(page)))
-   return page->first_page;
-   return page;
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that 

[PATCHv2 4/4] mm: make compound_head() robust

2015-08-17 Thread Kirill A. Shutemov
Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0CPU1

isolate_migratepages_block()
  page_count()
compound_head()
  !!PageTail() == true
put_page()
  tail-first_page = NULL
  head = tail-first_page
alloc_pages(__GFP_COMP)
   prep_compound_page()
 tail-first_page = head
 __SetPageTail(p);
  !!PageTail() == true
head == NULL dereferencing

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page-compound_head into third double word block in
front of compound_dtor and compound_order. That means it shares storage
space with:

 - page-lru.next;
 - page-next;
 - page-rcu_head.next;
 - page-pmd_huge_pte;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word. It can be used to encode PageTail(). And if the bit
set, rest of the word is pointer to head page.

Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
Cc: Hugh Dickins hu...@google.com
Cc: David Rientjes rient...@google.com
Cc: Vlastimil Babka vba...@suse.cz
Signed-off-by: Kirill A. Shutemov kirill.shute...@linux.intel.com
---
 Documentation/vm/split_page_table_lock |  4 +-
 arch/xtensa/configs/iss_defconfig  |  1 -
 include/linux/mm.h | 53 ++
 include/linux/mm_types.h   |  8 +++-
 include/linux/page-flags.h | 80 --
 mm/Kconfig | 12 -
 mm/debug.c |  5 ---
 mm/huge_memory.c   |  3 +-
 mm/hugetlb.c   |  8 +---
 mm/internal.h  |  4 +-
 mm/memory-failure.c|  7 ---
 mm/page_alloc.c| 28 +++-
 mm/swap.c  |  4 +-
 13 files changed, 53 insertions(+), 164 deletions(-)

diff --git a/Documentation/vm/split_page_table_lock 
b/Documentation/vm/split_page_table_lock
index 6dea4fd5c961..62842a857dab 100644
--- a/Documentation/vm/split_page_table_lock
+++ b/Documentation/vm/split_page_table_lock
@@ -54,8 +54,8 @@ everything required is done by pgtable_page_ctor() and 
pgtable_page_dtor(),
 which must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
-allocation: slab uses page-slab_cache and page-first_page for its pages.
-These fields share storage with page-ptl.
+allocation: slab uses page-slab_cache for its pages.
+This field shares storage with page-ptl.
 
 PMD split lock only makes sense if you have more than two page table
 levels.
diff --git a/arch/xtensa/configs/iss_defconfig 
b/arch/xtensa/configs/iss_defconfig
index e4d193e7a300..5c7c385f21c4 100644
--- a/arch/xtensa/configs/iss_defconfig
+++ b/arch/xtensa/configs/iss_defconfig
@@ -169,7 +169,6 @@ CONFIG_FLATMEM_MANUAL=y
 # CONFIG_SPARSEMEM_MANUAL is not set
 CONFIG_FLATMEM=y
 CONFIG_FLAT_NODE_MEM_MAP=y
-CONFIG_PAGEFLAGS_EXTENDED=y
 CONFIG_SPLIT_PTLOCK_CPUS=4
 # CONFIG_PHYS_ADDR_T_64BIT is not set
 CONFIG_ZONE_DMA_FLAG=1
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c21bbb8875a..5090a0b9bb43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,46 +437,6 @@ static inline void compound_unlock_irqrestore(struct page 
*page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-   struct page *head = tail-first_page;
-
-   /*
-* page-first_page may be a dangling pointer to an old
-* compound page, so recheck that it is still a tail
-* page before returning.
-*/
-   smp_rmb();
-   if (likely(PageTail(tail)))
-   return head;
-   return tail;
-}
-
-/*
- * Since either compound page could be dismantled asynchronously in THP
- * or we access asynchronously arbitrary positioned struct page, there
- * would be tail flag race. To handle this race, we should call
- * smp_rmb() before checking tail flag. compound_head_by_tail() did it.
- */
-static inline struct page *compound_head(struct page *page)
-{
-   if (unlikely(PageTail(page)))
-   return compound_head_by_tail(page);
-   return page;
-}
-
-/*
- * If we access compound page synchronously such as access to
- * allocated page, there is no need to handle tail flag race, so we can
- * check tail flag directly without any synchronization primitive.
- */
-static inline struct page *compound_head_fast(struct page *page)
-{
-   if (unlikely(PageTail(page)))
-