Hi Hugh,

Thank you very much for you close looking and valuable comments.
And I can't help feeling shame on many mistakes/misunderstandings
and lack of thoughts throughout the patchset.
I promise that all these will be fixed in the next version.

On Wed, Sep 03, 2014 at 02:17:41PM -0700, Hugh Dickins wrote:
> On Thu, 28 Aug 2014, Naoya Horiguchi wrote:
> 
> > We have a race condition between move_pages() and freeing hugepages,
> > where move_pages() calls follow_page(FOLL_GET) for hugepages internally
> > and tries to get its refcount without preventing concurrent freeing.
> > This race crashes the kernel, so this patch fixes it by moving FOLL_GET
> > code for hugepages into follow_huge_pmd() with taking the page table lock.
> 
> You really ought to mention how you are intentionally dropping the
> unnecessary check for NULL pte_page() in this patch: we agree on that,
> but it does need to be mentioned somewhere in the comment.

OK, I'll add it.

> > 
> > This patch also adds the similar locking to follow_huge_(addr|pud)
> > for consistency.
> > 
> > Here is the reproducer:
> > 
> >   $ cat movepages.c
> >   #include <stdio.h>
> >   #include <stdlib.h>
> >   #include <numaif.h>
> > 
> >   #define ADDR_INPUT      0x700000000000UL
> >   #define HPS             0x200000
> >   #define PS              0x1000
> > 
> >   int main(int argc, char *argv[]) {
> >           int i;
> >           int nr_hp = strtol(argv[1], NULL, 0);
> >           int nr_p  = nr_hp * HPS / PS;
> >           int ret;
> >           void **addrs;
> >           int *status;
> >           int *nodes;
> >           pid_t pid;
> > 
> >           pid = strtol(argv[2], NULL, 0);
> >           addrs  = malloc(sizeof(char *) * nr_p + 1);
> >           status = malloc(sizeof(char *) * nr_p + 1);
> >           nodes  = malloc(sizeof(char *) * nr_p + 1);
> > 
> >           while (1) {
> >                   for (i = 0; i < nr_p; i++) {
> >                           addrs[i] = (void *)ADDR_INPUT + i * PS;
> >                           nodes[i] = 1;
> >                           status[i] = 0;
> >                   }
> >                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
> >                                         MPOL_MF_MOVE_ALL);
> >                   if (ret == -1)
> >                           err("move_pages");
> > 
> >                   for (i = 0; i < nr_p; i++) {
> >                           addrs[i] = (void *)ADDR_INPUT + i * PS;
> >                           nodes[i] = 0;
> >                           status[i] = 0;
> >                   }
> >                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
> >                                         MPOL_MF_MOVE_ALL);
> >                   if (ret == -1)
> >                           err("move_pages");
> >           }
> >           return 0;
> >   }
> > 
> >   $ cat hugepage.c
> >   #include <stdio.h>
> >   #include <sys/mman.h>
> >   #include <string.h>
> > 
> >   #define ADDR_INPUT      0x700000000000UL
> >   #define HPS             0x200000
> > 
> >   int main(int argc, char *argv[]) {
> >           int nr_hp = strtol(argv[1], NULL, 0);
> >           char *p;
> > 
> >           while (1) {
> >                   p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | 
> > PROT_WRITE,
> >                            MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 
> > 0);
> >                   if (p != (void *)ADDR_INPUT) {
> >                           perror("mmap");
> >                           break;
> >                   }
> >                   memset(p, 0, nr_hp * HPS);
> >                   munmap(p, nr_hp * HPS);
> >           }
> >   }
> > 
> >   $ sysctl vm.nr_hugepages=40
> >   $ ./hugepage 10 &
> >   $ ./movepages 10 $(pgrep -f hugepage)
> > 
> > Note for stable inclusion:
> >   This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
> >   to move_pages()"), so is applicable to -stable kernels which includes it.
> 
> Just say
> Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to 
> move_pages()")

I just found that Documentation/SubmittingPatches started to state about
Fixes: tag. I'll use it from now.

> > 
> > ChangeLog v3:
> > - remove unnecessary if (page) check
> > - check (pmd|pud)_huge again after holding ptl
> > - do the same change also on follow_huge_pud()
> > - take page table lock also in follow_huge_addr()
> > 
> > ChangeLog v2:
> > - introduce follow_huge_pmd_lock() to do locking in arch-independent code.
> 
> ChangeLog vN info belongs below the ---

OK.
I didn't know this but it's written in SubmittingPatches, so I'll keep it
in mind.

> > 
> > Reported-by: Hugh Dickins <hu...@google.com>
> > Signed-off-by: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>
> > Cc: <sta...@vger.kernel.org>  # [3.12+]
> 
> No ack to this one yet, I'm afraid.

OK, I defer Reported-by until all the problems in this patch are solved.
I added this Reported-by because Andrew asked how In found this problem,
and advised me to show the reporter.
And I didn't intend by this Reported-by that you acked the patch.
In this case, should I have used some unofficial tag like
"Not-yet-Reported-by:" to avoid being rude?

> > ---
> >  arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
> >  arch/metag/mm/hugetlbpage.c   |  4 ++--
> >  arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
> >  include/linux/hugetlb.h       | 12 ++++++------
> >  mm/gup.c                      | 25 ++++---------------------
> >  mm/hugetlb.c                  | 43 
> > +++++++++++++++++++++++++++++++------------
> >  6 files changed, 61 insertions(+), 54 deletions(-)
> > 
> > diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c 
> > mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > index 52b7604b5215..6170381bf074 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> > @@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
> >  
> >  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, 
> > int write)
> >  {
> > -   struct page *page;
> > +   struct page *page = NULL;
> >     pte_t *ptep;
> > +   spinlock_t *ptl;
> >  
> >     if (REGION_NUMBER(addr) != RGN_HPAGE)
> >             return ERR_PTR(-EINVAL);
> >  
> >     ptep = huge_pte_offset(mm, addr);
> > +   ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> 
> It was a mistake to lump this follow_huge_addr() change in with the
> rest: please defer it to your 6/6 (or send 5 and leave 6th to later).
> 
> Unless I'm missing something, all you succeed in doing here is break
> the build on ia64 and powerpc, by introducing undeclared "vma" variable.
> 
> There is no point whatever in taking and dropping this lock: the
> point was to do the get_page while holding the relevant page table lock,
> but you're not doing any get_page, and you still have an "int write"
> argument instead of "int flags" to pass down the FOLL_GET flag,
> and you still have the BUG_ON(flags & FOLL_GET) in follow_page_mask().
> 
> So, please throw these follow_huge_addr() parts out this patch.

Sorry, I'll drop them all.

> >     if (!ptep || pte_none(*ptep))
> > -           return NULL;
> > +           goto out;
> > +
> >     page = pte_page(*ptep);
> >     page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> > +out:
> > +   spin_unlock(ptl);
> >     return page;
> >  }
> >  int pmd_huge(pmd_t pmd)
> > diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c 
> > mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> > index 745081427659..5e96ef096df9 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> > @@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
> >     return 0;
> >  }
> >  
> > -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -                        pmd_t *pmd, int write)
> > +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long 
> > address,
> > +                        pmd_t *pmd, int flags)
> 
> Change from "write" to "flags" is good, but I question below whether
> we actually need to change from mm to vma in follow_huge_pmd() and
> follow_huge_pud().

Without changing mm with vma, we need call find_vma() to get the relevant
vma to get ptl, which looks expensive than getting mm from vma.
The caller already has vma, so I thought that passing vma is better.

... but as you wrote below, there's a better way to get ptl.
With your suggestion, there's no need to change mm.

> >  {
> >     return NULL;
> >  }
> > diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c 
> > mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > index 9517a93a315c..1d8854a56309 100644
> > --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> > +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> > @@ -677,38 +677,38 @@ struct page *
> >  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> >  {
> >     pte_t *ptep;
> > -   struct page *page;
> > +   struct page *page = ERR_PTR(-EINVAL);
> >     unsigned shift;
> >     unsigned long mask;
> > +   spinlock_t *ptl;
> >     /*
> >      * Transparent hugepages are handled by generic code. We can skip them
> >      * here.
> >      */
> >     ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> > -
> > +   ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
> 
> As above, you're breaking the build with a lock that serves no purpose
> in the current patch.

I just drop it, sorry for the silly code.

> >     /* Verify it is a huge page else bail. */
> >     if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
> > -           return ERR_PTR(-EINVAL);
> > +           goto out;
> >  
> >     mask = (1UL << shift) - 1;
> > -   page = pte_page(*ptep);
> > -   if (page)
> > -           page += (address & mask) / PAGE_SIZE;
> > -
> > +   page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
> > +out:
> > +   spin_unlock(ptl);
> >     return page;
> >  }
> >  
> >  struct page *
> > -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -           pmd_t *pmd, int write)
> > +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +           pmd_t *pmd, int flags)
> >  {
> >     BUG();
> >     return NULL;
> >  }
> >  
> >  struct page *
> > -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -           pmd_t *pmd, int write)
> > +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +           pud_t *pud, int flags)
> >  {
> >     BUG();
> >     return NULL;
> > diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h 
> > mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > index 6e6d338641fe..b3200fce07aa 100644
> > --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> > +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> > @@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
> > long addr);
> >  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t 
> > *ptep);
> >  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
> >                           int write);
> > -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -                           pmd_t *pmd, int write);
> > -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -                           pud_t *pud, int write);
> > +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long 
> > address,
> > +                           pmd_t *pmd, int flags);
> > +struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long 
> > address,
> > +                           pud_t *pud, int flags);
> >  int pmd_huge(pmd_t pmd);
> >  int pud_huge(pud_t pmd);
> >  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> > @@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct 
> > seq_file *m)
> >  static inline void hugetlb_show_meminfo(void)
> >  {
> >  }
> > -#define follow_huge_pmd(mm, addr, pmd, write)      NULL
> > -#define follow_huge_pud(mm, addr, pud, write)      NULL
> > +#define follow_huge_pmd(vma, addr, pmd, flags)     NULL
> > +#define follow_huge_pud(vma, addr, pud, flags)     NULL
> >  #define prepare_hugepage_range(file, addr, len)    (-EINVAL)
> >  #define pmd_huge(x)        0
> >  #define pud_huge(x)        0
> > diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c 
> > mmotm-2014-08-25-16-52/mm/gup.c
> > index 91d044b1600d..597a5e92e265 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> > +++ mmotm-2014-08-25-16-52/mm/gup.c
> > @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct 
> > *vma,
> >     pud = pud_offset(pgd, address);
> >     if (pud_none(*pud))
> >             return no_page_table(vma, flags);
> > -   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> > -           if (flags & FOLL_GET)
> > -                   return NULL;
> > -           page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> > -           return page;
> > -   }
> > +   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
> > +           return follow_huge_pud(vma, address, pud, flags);
> 
> Yes, this part is good, except I think mm rather than vma.

I'll fix it.

> >     if (unlikely(pud_bad(*pud)))
> >             return no_page_table(vma, flags);
> >  
> >     pmd = pmd_offset(pud, address);
> >     if (pmd_none(*pmd))
> >             return no_page_table(vma, flags);
> > -   if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> > -           page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> > -           if (flags & FOLL_GET) {
> > -                   /*
> > -                    * Refcount on tail pages are not well-defined and
> > -                    * shouldn't be taken. The caller should handle a NULL
> > -                    * return when trying to follow tail pages.
> > -                    */
> > -                   if (PageHead(page))
> > -                           get_page(page);
> > -                   else
> > -                           page = NULL;
> > -           }
> > -           return page;
> > -   }
> > +   if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> > +           return follow_huge_pmd(vma, address, pmd, flags);
> 
> And this part is good, except I think mm rather than vma.

I'll fix it, too.

> >     if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> >             return no_page_table(vma, flags);
> >     if (pmd_trans_huge(*pmd)) {
> > diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c 
> > mmotm-2014-08-25-16-52/mm/hugetlb.c
> > index 022767506c7b..c5345c5edb50 100644
> > --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> > +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> > @@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned 
> > long address,
> >  }
> >  
> >  struct page * __weak
> > -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > -           pmd_t *pmd, int write)
> > +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> > +           pmd_t *pmd, int flags)
> >  {
> > -   struct page *page;
> > +   struct page *page = NULL;
> > +   spinlock_t *ptl;
> >  
> > -   page = pte_page(*(pte_t *)pmd);
> > -   if (page)
> > -           page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +   ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
> 
> So, this is why you have had to change from "mm" to "vma" throughout.
> And we might end up deciding that that is the right thing to do.
> 
> But here we are deep in page table code, dealing with a huge pmd entry:
> I protest that it's very lame to be asking vma->vm_file to tell us what
> lock the page table code needs at this level.  Isn't it pmd_lockptr()?

Right, inside huge_pte_lock() we call pmd_lockptr() when huge_page_size(h)
== PMD_SIZE to get the ptl. And this code can assume that it's true, so
calling pmd_lockptr() directly is better/faster.

> Now, I'm easily confused, and there may be reasons why it's more subtle
> than that, and you really are forced to use huge_pte_lockptr(); but I'd
> much rather not if we can avoid doing so, just as a matter of principle.

Using huge_pte_lockptr() is useful when we can't assume the hugepage's
info like hugepage size or whether it's pmd/pud-based or not.

> One subtlety to take care over: it's a long time since I've had to
> worry about pmd folding and pud folding (what happens when you only
> have 2 or 3 levels of page table instead of the full 4): macros get
> defined to each other, and levels get optimized out (perhaps
> differently on different architectures).
> 
> So although at first sight the lock to take in follow_huge_pud()
> would seem to be mm->page_table_lock, I am not at this point certain
> that that's necessarily so - sometimes pud_huge might be pmd_huge,
> and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
> to be the pud level.  Maybe: needs checking through the architectures
> and their configs, not obvious to me.

I think that every architecture uses mm->page_table_lock for pud-level
locking at least for now, but that could be changed in the future,
for example when 1GB hugepages or pud-based hugepages become common and
someone are interested in splitting lock for pud level.
So it would be helpful to introduce pud_lockptr() which just returns
mm->page_table_lock now, so that developers never forget to update it
when considering splitting pud lock.

> 
> I realize that I am asking for you (or I) to do more work, when using
> huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
> but I do feel quite strongly that that's the right approach here
> (and I'm not just trying to avoid a few edits of "mm" to "vma").

Yes, I agree.

> Cc'ing Kirill, who may have a strong view to the contrary,
> or a good insight on where the problems if any might be.
> 
> Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
> necessarily does the right thing on follow_huge_addr() architectures,
> ia64 and powerpc.  Do they, for example, allocate the memory for their
> hugetlb entries in such a way that we can indeed use pmd_lockptr() to
> point to a useable spinlock, in the case when huge_page_size(h) just
> happens to equal PMD_SIZE?
> 
> I don't know if this was thought through thoroughly
> (now that's a satisfying phrase hugh thinks hugh never wrote before!)
> when huge_pte_lockptr() was invented or not.  I think it would be safer
> if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
> architectures.

Yes, this seems a real problem and is worth discussing with maintainers
of these architectures. Maybe we can do this as a separate work.

> 
> > +
> > +   if (!pmd_huge(*pmd))
> > +           goto out;
> > +
> > +   page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +
> > +   if (flags & FOLL_GET)
> > +           if (!get_page_unless_zero(page))
> > +                   page = NULL;
> 
> get_page() should be quite good enough, shouldn't it?  We are holding
> the necessary lock, and have tested pmd_huge(*pmd), so it would be a
> bug if page_count(page) were zero here.

Yes, get_page() is enough, I'll fix it.

> > +out:
> > +   spin_unlock(ptl);
> >     return page;
> >  }
> >  
> >  struct page * __weak
> > -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> > -           pud_t *pud, int write)
> > +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> > +           pud_t *pud, int flags)
> >  {
> > -   struct page *page;
> > +   struct page *page = NULL;
> > +   spinlock_t *ptl;
> >  
> > -   page = pte_page(*(pte_t *)pud);
> > -   if (page)
> > -           page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> > +   if (flags & FOLL_GET)
> > +           return NULL;
> > +
> > +   ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);
> 
> Well, you do have vma declared here, but otherwise it's like what you
> had in follow_huge_addr(): there is no point in taking and dropping
> the lock if you're not getting the page while the lock is held.
> 
> So, which way to go on follow_huge_pud()?  I certainly think that we
> should implement FOLL_GET on it, as we should for follow_huge_addr(),
> simply for completeness, and so we don't need to come back here.

Right, this will become important when thinking of 1GB hugepage migration,

> But whether we should do so in a patch which is Cc'ed to stable is not
> so clear.  And leaving follow_huge_pmd() and follow_huge_addr() out
> of this patch may avoid those awkward where-is-the-lock questions
> for now.  Convert follow_huge_pmd() in a separate patch?

... but 1GB hugepage migration is not available now, so no reason to
send follow_huge_pud to stable. I agree to separate that part.

Thanks,
Naoya Horiguchi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to