Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-27 Thread Mel Gorman
On Thu, Jul 26, 2012 at 05:00:28PM -0400, Rik van Riel wrote:
> On 07/20/2012 09:49 AM, Mel Gorman wrote:
> >This V2 is still the mmap_sem approach that fixes a potential deadlock
> >problem pointed out by Michal.
> 
> Larry and I were looking around the hugetlb code some
> more, and found what looks like yet another race.
> 
> In hugetlb_no_page, we have the following code:
> 
> 
> spin_lock(>page_table_lock);
> size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto backout;
> 
> ret = 0;
> if (!huge_pte_none(huge_ptep_get(ptep)))
> goto backout;
> 
> if (anon_rmap)
> hugepage_add_new_anon_rmap(page, vma, address);
> else
> page_dup_rmap(page);
> new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> && (vma->vm_flags & VM_SHARED)));
> set_huge_pte_at(mm, address, ptep, new_pte);
>   ...
>   spin_unlock(>page_table_lock);
> 
> Notice how we check !huge_pte_none with our own
> mm->page_table_lock held.
> 
> This offers no protection at all against other
> processes, that also hold their own page_table_lock.
> 

Yes, the page_table_lock is close to useless once shared page tables are
involved. It's why if we ever wanted to make shared page tables a core MM
thing we'd have to revisit how PTE locking at any level that can share
page tables works.

> In short, it looks like it is possible for multiple
> processes to go through the above code simultaneously,
> potentially resulting in:
> 
> 1) one process overwriting the pte just created by
>another process
> 
> 2) data corruption, as one partially written page
>gets superceded by an newly zeroed page, but no
>TLB invalidates get sent to other CPUs
> 
> 3) a memory leak of a huge page
> 
> Is there anything that would make this race impossible,
> or is this a real bug?
> 

In this case it all happens under the hugetlb instantiation mutex in
hugetlb_fault(). It's yet another reason why removing that mutex would
be a serious undertaking and the gain for doing so is marginal.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-27 Thread Mel Gorman
On Thu, Jul 26, 2012 at 12:01:04PM -0400, Larry Woodman wrote:
> On 07/20/2012 09:49 AM, Mel Gorman wrote:
> >+retry:
> > mutex_lock(>i_mmap_mutex);
> > vma_prio_tree_foreach(svma,,>i_mmap, idx, idx) {
> > if (svma == vma)
> > continue;
> >+if (svma->vm_mm == vma->vm_mm)
> >+continue;
> >+
> >+/*
> >+ * The target mm could be in the process of tearing down
> >+ * its page tables and the i_mmap_mutex on its own is
> >+ * not sufficient. To prevent races against teardown and
> >+ * pagetable updates, we acquire the mmap_sem and pagetable
> >+ * lock of the remote address space. down_read_trylock()
> >+ * is necessary as the other process could also be trying
> >+ * to share pagetables with the current mm. In the fork
> >+ * case, we are already both mm's so check for that
> >+ */
> >+if (locked_mm != svma->vm_mm) {
> >+if (!down_read_trylock(>vm_mm->mmap_sem)) {
> >+mutex_unlock(>i_mmap_mutex);
> >+goto retry;
> >+}
> >+smmap_sem =>vm_mm->mmap_sem;
> >+}
> >+
> >+spage_table_lock =>vm_mm->page_table_lock;
> >+spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
> >
> > saddr = page_table_shareable(svma, vma, addr, idx);
> > if (saddr) {
> 
> Hi Mel, FYI I tried this and ran into a problem.  When there are
> multiple processes
> in huge_pmd_share() just faulting in the same i_map they all have
> their mmap_sem
> down for write so the down_read_trylock(>vm_mm->mmap_sem) never
> succeeds.  What am I missing?
> 

Probably nothing, this version of the patch is flawed. In the final
(unreleased) version of this approach it had to check if it tried this
trylock for too long and bail out if that happened and fail to share
the page tables. I've dropped this approach to the problem as better
alternatives exist.

Thanks Larry!

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-27 Thread Mel Gorman
On Thu, Jul 26, 2012 at 12:01:04PM -0400, Larry Woodman wrote:
 On 07/20/2012 09:49 AM, Mel Gorman wrote:
 +retry:
  mutex_lock(mapping-i_mmap_mutex);
  vma_prio_tree_foreach(svma,iter,mapping-i_mmap, idx, idx) {
  if (svma == vma)
  continue;
 +if (svma-vm_mm == vma-vm_mm)
 +continue;
 +
 +/*
 + * The target mm could be in the process of tearing down
 + * its page tables and the i_mmap_mutex on its own is
 + * not sufficient. To prevent races against teardown and
 + * pagetable updates, we acquire the mmap_sem and pagetable
 + * lock of the remote address space. down_read_trylock()
 + * is necessary as the other process could also be trying
 + * to share pagetables with the current mm. In the fork
 + * case, we are already both mm's so check for that
 + */
 +if (locked_mm != svma-vm_mm) {
 +if (!down_read_trylock(svma-vm_mm-mmap_sem)) {
 +mutex_unlock(mapping-i_mmap_mutex);
 +goto retry;
 +}
 +smmap_sem =svma-vm_mm-mmap_sem;
 +}
 +
 +spage_table_lock =svma-vm_mm-page_table_lock;
 +spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
 
  saddr = page_table_shareable(svma, vma, addr, idx);
  if (saddr) {
 
 Hi Mel, FYI I tried this and ran into a problem.  When there are
 multiple processes
 in huge_pmd_share() just faulting in the same i_map they all have
 their mmap_sem
 down for write so the down_read_trylock(svma-vm_mm-mmap_sem) never
 succeeds.  What am I missing?
 

Probably nothing, this version of the patch is flawed. In the final
(unreleased) version of this approach it had to check if it tried this
trylock for too long and bail out if that happened and fail to share
the page tables. I've dropped this approach to the problem as better
alternatives exist.

Thanks Larry!

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-27 Thread Mel Gorman
On Thu, Jul 26, 2012 at 05:00:28PM -0400, Rik van Riel wrote:
 On 07/20/2012 09:49 AM, Mel Gorman wrote:
 This V2 is still the mmap_sem approach that fixes a potential deadlock
 problem pointed out by Michal.
 
 Larry and I were looking around the hugetlb code some
 more, and found what looks like yet another race.
 
 In hugetlb_no_page, we have the following code:
 
 
 spin_lock(mm-page_table_lock);
 size = i_size_read(mapping-host)  huge_page_shift(h);
 if (idx = size)
 goto backout;
 
 ret = 0;
 if (!huge_pte_none(huge_ptep_get(ptep)))
 goto backout;
 
 if (anon_rmap)
 hugepage_add_new_anon_rmap(page, vma, address);
 else
 page_dup_rmap(page);
 new_pte = make_huge_pte(vma, page, ((vma-vm_flags  VM_WRITE)
  (vma-vm_flags  VM_SHARED)));
 set_huge_pte_at(mm, address, ptep, new_pte);
   ...
   spin_unlock(mm-page_table_lock);
 
 Notice how we check !huge_pte_none with our own
 mm-page_table_lock held.
 
 This offers no protection at all against other
 processes, that also hold their own page_table_lock.
 

Yes, the page_table_lock is close to useless once shared page tables are
involved. It's why if we ever wanted to make shared page tables a core MM
thing we'd have to revisit how PTE locking at any level that can share
page tables works.

 In short, it looks like it is possible for multiple
 processes to go through the above code simultaneously,
 potentially resulting in:
 
 1) one process overwriting the pte just created by
another process
 
 2) data corruption, as one partially written page
gets superceded by an newly zeroed page, but no
TLB invalidates get sent to other CPUs
 
 3) a memory leak of a huge page
 
 Is there anything that would make this race impossible,
 or is this a real bug?
 

In this case it all happens under the hugetlb instantiation mutex in
hugetlb_fault(). It's yet another reason why removing that mutex would
be a serious undertaking and the gain for doing so is marginal.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-26 Thread Hugh Dickins
On Thu, 26 Jul 2012, Rik van Riel wrote:
> On 07/20/2012 09:49 AM, Mel Gorman wrote:
> > This V2 is still the mmap_sem approach that fixes a potential deadlock
> > problem pointed out by Michal.
> 
> Larry and I were looking around the hugetlb code some
> more, and found what looks like yet another race.
> 
> In hugetlb_no_page, we have the following code:
> 
> 
> spin_lock(>page_table_lock);
> size = i_size_read(mapping->host) >> huge_page_shift(h);
> if (idx >= size)
> goto backout;
> 
> ret = 0;
> if (!huge_pte_none(huge_ptep_get(ptep)))
> goto backout;
> 
> if (anon_rmap)
> hugepage_add_new_anon_rmap(page, vma, address);
> else
> page_dup_rmap(page);
> new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> && (vma->vm_flags & VM_SHARED)));
> set_huge_pte_at(mm, address, ptep, new_pte);
>   ...
>   spin_unlock(>page_table_lock);
> 
> Notice how we check !huge_pte_none with our own
> mm->page_table_lock held.
> 
> This offers no protection at all against other
> processes, that also hold their own page_table_lock.
> 
> In short, it looks like it is possible for multiple
> processes to go through the above code simultaneously,
> potentially resulting in:
> 
> 1) one process overwriting the pte just created by
>another process
> 
> 2) data corruption, as one partially written page
>gets superceded by an newly zeroed page, but no
>TLB invalidates get sent to other CPUs
> 
> 3) a memory leak of a huge page
> 
> Is there anything that would make this race impossible,
> or is this a real bug?

I believe it's protected by the unloved hugetlb_instantiation_mutex.

> 
> If so, are there more like it in the hugetlbfs code?

What, more than one bug in that code?
Surely that would defy the laws of probability ;)

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-26 Thread Rik van Riel

On 07/20/2012 09:49 AM, Mel Gorman wrote:

This V2 is still the mmap_sem approach that fixes a potential deadlock
problem pointed out by Michal.


Larry and I were looking around the hugetlb code some
more, and found what looks like yet another race.

In hugetlb_no_page, we have the following code:


spin_lock(>page_table_lock);
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto backout;

ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
goto backout;

if (anon_rmap)
hugepage_add_new_anon_rmap(page, vma, address);
else
page_dup_rmap(page);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);
...
spin_unlock(>page_table_lock);

Notice how we check !huge_pte_none with our own
mm->page_table_lock held.

This offers no protection at all against other
processes, that also hold their own page_table_lock.

In short, it looks like it is possible for multiple
processes to go through the above code simultaneously,
potentially resulting in:

1) one process overwriting the pte just created by
   another process

2) data corruption, as one partially written page
   gets superceded by an newly zeroed page, but no
   TLB invalidates get sent to other CPUs

3) a memory leak of a huge page

Is there anything that would make this race impossible,
or is this a real bug?

If so, are there more like it in the hugetlbfs code?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-26 Thread Larry Woodman

On 07/20/2012 09:49 AM, Mel Gorman wrote:

+retry:
mutex_lock(>i_mmap_mutex);
vma_prio_tree_foreach(svma,,>i_mmap, idx, idx) {
if (svma == vma)
continue;
+   if (svma->vm_mm == vma->vm_mm)
+   continue;
+
+   /*
+* The target mm could be in the process of tearing down
+* its page tables and the i_mmap_mutex on its own is
+* not sufficient. To prevent races against teardown and
+* pagetable updates, we acquire the mmap_sem and pagetable
+* lock of the remote address space. down_read_trylock()
+* is necessary as the other process could also be trying
+* to share pagetables with the current mm. In the fork
+* case, we are already both mm's so check for that
+*/
+   if (locked_mm != svma->vm_mm) {
+   if (!down_read_trylock(>vm_mm->mmap_sem)) {
+   mutex_unlock(>i_mmap_mutex);
+   goto retry;
+   }
+   smmap_sem =>vm_mm->mmap_sem;
+   }
+
+   spage_table_lock =>vm_mm->page_table_lock;
+   spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);

saddr = page_table_shareable(svma, vma, addr, idx);
if (saddr) {


Hi Mel, FYI I tried this and ran into a problem.  When there are 
multiple processes
in huge_pmd_share() just faulting in the same i_map they all have their 
mmap_sem

down for write so the down_read_trylock(>vm_mm->mmap_sem) never
succeeds.  What am I missing?

Thanks, Larry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-26 Thread Larry Woodman

On 07/20/2012 09:49 AM, Mel Gorman wrote:

+retry:
mutex_lock(mapping-i_mmap_mutex);
vma_prio_tree_foreach(svma,iter,mapping-i_mmap, idx, idx) {
if (svma == vma)
continue;
+   if (svma-vm_mm == vma-vm_mm)
+   continue;
+
+   /*
+* The target mm could be in the process of tearing down
+* its page tables and the i_mmap_mutex on its own is
+* not sufficient. To prevent races against teardown and
+* pagetable updates, we acquire the mmap_sem and pagetable
+* lock of the remote address space. down_read_trylock()
+* is necessary as the other process could also be trying
+* to share pagetables with the current mm. In the fork
+* case, we are already both mm's so check for that
+*/
+   if (locked_mm != svma-vm_mm) {
+   if (!down_read_trylock(svma-vm_mm-mmap_sem)) {
+   mutex_unlock(mapping-i_mmap_mutex);
+   goto retry;
+   }
+   smmap_sem =svma-vm_mm-mmap_sem;
+   }
+
+   spage_table_lock =svma-vm_mm-page_table_lock;
+   spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);

saddr = page_table_shareable(svma, vma, addr, idx);
if (saddr) {


Hi Mel, FYI I tried this and ran into a problem.  When there are 
multiple processes
in huge_pmd_share() just faulting in the same i_map they all have their 
mmap_sem

down for write so the down_read_trylock(svma-vm_mm-mmap_sem) never
succeeds.  What am I missing?

Thanks, Larry

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-26 Thread Rik van Riel

On 07/20/2012 09:49 AM, Mel Gorman wrote:

This V2 is still the mmap_sem approach that fixes a potential deadlock
problem pointed out by Michal.


Larry and I were looking around the hugetlb code some
more, and found what looks like yet another race.

In hugetlb_no_page, we have the following code:


spin_lock(mm-page_table_lock);
size = i_size_read(mapping-host)  huge_page_shift(h);
if (idx = size)
goto backout;

ret = 0;
if (!huge_pte_none(huge_ptep_get(ptep)))
goto backout;

if (anon_rmap)
hugepage_add_new_anon_rmap(page, vma, address);
else
page_dup_rmap(page);
new_pte = make_huge_pte(vma, page, ((vma-vm_flags  VM_WRITE)
 (vma-vm_flags  VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);
...
spin_unlock(mm-page_table_lock);

Notice how we check !huge_pte_none with our own
mm-page_table_lock held.

This offers no protection at all against other
processes, that also hold their own page_table_lock.

In short, it looks like it is possible for multiple
processes to go through the above code simultaneously,
potentially resulting in:

1) one process overwriting the pte just created by
   another process

2) data corruption, as one partially written page
   gets superceded by an newly zeroed page, but no
   TLB invalidates get sent to other CPUs

3) a memory leak of a huge page

Is there anything that would make this race impossible,
or is this a real bug?

If so, are there more like it in the hugetlbfs code?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2

2012-07-26 Thread Hugh Dickins
On Thu, 26 Jul 2012, Rik van Riel wrote:
 On 07/20/2012 09:49 AM, Mel Gorman wrote:
  This V2 is still the mmap_sem approach that fixes a potential deadlock
  problem pointed out by Michal.
 
 Larry and I were looking around the hugetlb code some
 more, and found what looks like yet another race.
 
 In hugetlb_no_page, we have the following code:
 
 
 spin_lock(mm-page_table_lock);
 size = i_size_read(mapping-host)  huge_page_shift(h);
 if (idx = size)
 goto backout;
 
 ret = 0;
 if (!huge_pte_none(huge_ptep_get(ptep)))
 goto backout;
 
 if (anon_rmap)
 hugepage_add_new_anon_rmap(page, vma, address);
 else
 page_dup_rmap(page);
 new_pte = make_huge_pte(vma, page, ((vma-vm_flags  VM_WRITE)
  (vma-vm_flags  VM_SHARED)));
 set_huge_pte_at(mm, address, ptep, new_pte);
   ...
   spin_unlock(mm-page_table_lock);
 
 Notice how we check !huge_pte_none with our own
 mm-page_table_lock held.
 
 This offers no protection at all against other
 processes, that also hold their own page_table_lock.
 
 In short, it looks like it is possible for multiple
 processes to go through the above code simultaneously,
 potentially resulting in:
 
 1) one process overwriting the pte just created by
another process
 
 2) data corruption, as one partially written page
gets superceded by an newly zeroed page, but no
TLB invalidates get sent to other CPUs
 
 3) a memory leak of a huge page
 
 Is there anything that would make this race impossible,
 or is this a real bug?

I believe it's protected by the unloved hugetlb_instantiation_mutex.

 
 If so, are there more like it in the hugetlbfs code?

What, more than one bug in that code?
Surely that would defy the laws of probability ;)

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Michal Hocko
On Fri 20-07-12 15:37:53, Mel Gorman wrote:
> On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote:
> > > 
> > > 
> > > Signed-off-by: Mel Gorman 
> > 
> > Yes this looks correct. mmap_sem will make sure that unmap_vmas and
> > free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
> > see non-NULL spte on the way out.
> 
> Yes.
> 
> > I am just wondering whether we need
> > the page_table_lock as well. It is not harmful but I guess we can drop
> > it because both exit_mmap and shmdt are not taking it and mmap_sem is
> > sufficient for them.
> 
> While it is true that we don't *really* need page_table_lock here, we are
> still updating page tables and it's in line with the the ordinary locking
> rules.  There are other cases in hugetlb.c where we do pte_same() checks even
> though we are protected from the related races by the instantiation_mutex.
> 
> page_table_lock is actually a bit useless for shared page tables. If shared
> page tables were every to be a general thing then I think we'd have to
> revisit how PTE update locking is done but I doubt anyone wants to dive
> down that rat-hole.
> 
> For now, I'm going to keep taking it even if strictly speaking it's not
> necessary.

Fair enough

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Mel Gorman
On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote:
> > 
> > 
> > Signed-off-by: Mel Gorman 
> 
> Yes this looks correct. mmap_sem will make sure that unmap_vmas and
> free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
> see non-NULL spte on the way out.

Yes.

> I am just wondering whether we need
> the page_table_lock as well. It is not harmful but I guess we can drop
> it because both exit_mmap and shmdt are not taking it and mmap_sem is
> sufficient for them.

While it is true that we don't *really* need page_table_lock here, we are
still updating page tables and it's in line with the the ordinary locking
rules.  There are other cases in hugetlb.c where we do pte_same() checks even
though we are protected from the related races by the instantiation_mutex.

page_table_lock is actually a bit useless for shared page tables. If shared
page tables were every to be a general thing then I think we'd have to
revisit how PTE update locking is done but I doubt anyone wants to dive
down that rat-hole.

For now, I'm going to keep taking it even if strictly speaking it's not
necessary.

> One more nit bellow.
> 
> I will send my version of the fix which took another path as a reply to
> this email to have something to compare with.
> 

Thanks.

> [...]
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index f6679a7..944b2df 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, 
> > unsigned long addr)
> >  /*
> >   * search for a shareable pmd page for hugetlb.
> >   */
> > -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t 
> > *pud)
> > +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct 
> > *locked_mm,
> > +  unsigned long addr, pud_t *pud)
> >  {
> > struct vm_area_struct *vma = find_vma(mm, addr);
> > struct address_space *mapping = vma->vm_file->f_mapping;
> > @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, 
> > unsigned long addr, pud_t *pud)
> > struct vm_area_struct *svma;
> > unsigned long saddr;
> > pte_t *spte = NULL;
> > +   spinlock_t *spage_table_lock = NULL;
> > +   struct rw_semaphore *smmap_sem = NULL;
> >  
> > if (!vma_shareable(vma, addr))
> > return;
> >  
> > +retry:
> > mutex_lock(>i_mmap_mutex);
> > vma_prio_tree_foreach(svma, , >i_mmap, idx, idx) {
> > if (svma == vma)
> > continue;
> > +   if (svma->vm_mm == vma->vm_mm)
> > +   continue;
> > +
> > +   /*
> > +* The target mm could be in the process of tearing down
> > +* its page tables and the i_mmap_mutex on its own is
> > +* not sufficient. To prevent races against teardown and
> > +* pagetable updates, we acquire the mmap_sem and pagetable
> > +* lock of the remote address space. down_read_trylock()
> > +* is necessary as the other process could also be trying
> > +* to share pagetables with the current mm. In the fork
> > +* case, we are already both mm's so check for that
> > +*/
> > +   if (locked_mm != svma->vm_mm) {
> > +   if (!down_read_trylock(>vm_mm->mmap_sem)) {
> > +   mutex_unlock(>i_mmap_mutex);
> > +   goto retry;
> > +   }
> > +   smmap_sem = >vm_mm->mmap_sem;
> > +   }
> > +
> > +   spage_table_lock = >vm_mm->page_table_lock;
> > +   spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
> >  
> > saddr = page_table_shareable(svma, vma, addr, idx);
> > if (saddr) {
> [...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ae8f708..4832277 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, 
> > struct mm_struct *src,
> > src_pte = huge_pte_offset(src, addr);
> > if (!src_pte)
> > continue;
> > -   dst_pte = huge_pte_alloc(dst, addr, sz);
> > +   dst_pte = huge_pte_alloc(dst, src, addr, sz);
> > if (!dst_pte)
> > goto nomem;
> >  
> > @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> > vm_area_struct *vma,
> >VM_FAULT_SET_HINDEX(h - hstates);
> > }
> >  
> > -   ptep = huge_pte_alloc(mm, address, huge_page_size(h));
> > +   ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));
> 
> strictly speaking we should provide current->mm here because we are in
> the page fault path and mmap_sem is held for reading. This doesn't
> matter here though because huge_pmd_share will take it for reading so
> nesting wouldn't hurt. Maybe a small comment that this is intentional
> and correct would be nice.
> 

Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Michal Hocko
On Fri 20-07-12 15:11:08, Mel Gorman wrote:
> Sorry for the resend, I did not properly refresh Cong Wang's suggested
> fix. This V2 is still the mmap_sem approach that fixes a potential deadlock
> problem pointed out by Michal.
> 
> Changelog since V1
>  o Correct cut error in race description(hugh)
>  o Handle potential deadlock during fork  (mhocko)
>  o Reorder unlocking  (wangcong)
> 
> If a process creates a large hugetlbfs mapping that is eligible for page
> table sharing and forks heavily with children some of whom fault and
> others which destroy the mapping then it is possible for page tables to
> get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
> output a message to the kernel log. The final teardown will trigger a
> BUG_ON in mm/filemap.c.
> 
> This was reproduced in 3.4 but is known to have existed for a long time
> and goes back at least as far as 2.6.37. It was probably was introduced in
> 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
> look like this;
> 
> [  ..] Lots of bad pmd messages followed by this
> [  127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7).
> [  127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7).
> [  127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7).
> [  127.186778] [ cut here ]
> [  127.186781] kernel BUG at mm/filemap.c:134!
> [  127.186782] invalid opcode:  [#1] SMP
> [  127.186783] CPU 7
> [  127.186784] Modules linked in: af_packet cpufreq_conservative 
> cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod 
> coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 
> r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw 
> cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode 
> ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm 
> i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button 
> i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon 
> ata_generic pata_atiixp libata scsi_mod
> [  127.186801]
> [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild 
> #53 Dell Inc. OptiPlex 990/06D7TR
> [  127.186804] RIP: 0010:[]  [] 
> __delete_from_page_cache+0x15e/0x160
> [  127.186809] RSP: :8804144b5c08  EFLAGS: 00010002
> [  127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: 
> ffc0
> [  127.186811] RDX:  RSI: 0009 RDI: 
> 88042dfdad00
> [  127.186812] RBP: 8804144b5c18 R08: 0009 R09: 
> 0003
> [  127.186813] R10:  R11: 002d R12: 
> 880412ff83d8
> [  127.186814] R13: 880412ff83d8 R14:  R15: 
> 880412ff83d8
> [  127.186815] FS:  7fe18ed2c700() GS:88042dce() 
> knlGS:
> [  127.186816] CS:  0010 DS:  ES:  CR0: 8005003b
> [  127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: 
> 000407e0
> [  127.186818] DR0:  DR1:  DR2: 
> 
> [  127.186819] DR3:  DR6: 0ff0 DR7: 
> 0400
> [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo 
> 8804144b4000, task 880417f803c0)
> [  127.186821] Stack:
> [  127.186822]  ea000a5c9000  8804144b5c48 
> 810ed83b
> [  127.186824]  8804144b5c48 138a 1387 
> 8804144b5c98
> [  127.186825]  8804144b5d48 811bc925 8804144b5cb8 
> 
> [  127.186827] Call Trace:
> [  127.186829]  [] delete_from_page_cache+0x3b/0x80
> [  127.186832]  [] truncate_hugepages+0x115/0x220
> [  127.186834]  [] hugetlbfs_evict_inode+0x13/0x30
> [  127.186837]  [] evict+0xa7/0x1b0
> [  127.186839]  [] iput_final+0xd3/0x1f0
> [  127.186840]  [] iput+0x39/0x50
> [  127.186842]  [] d_kill+0xf8/0x130
> [  127.186843]  [] dput+0xd2/0x1a0
> [  127.186845]  [] __fput+0x170/0x230
> [  127.186848]  [] ? rb_erase+0xce/0x150
> [  127.186849]  [] fput+0x1d/0x30
> [  127.186851]  [] remove_vma+0x37/0x80
> [  127.186853]  [] do_munmap+0x2d2/0x360
> [  127.186855]  [] sys_shmdt+0xc9/0x170
> [  127.186857]  [] system_call_fastpath+0x16/0x1b
> [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 
> 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 
> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
> [  127.186868] RIP  [] __delete_from_page_cache+0x15e/0x160
> [  127.186870]  RSP 
> [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
> 
> The bug is a race and not always easy to reproduce. To reproduce it I was
> doing the following on a single socket I7-based machine with 16G of RAM.
> 
> $ hugeadm --pool-pages-max DEFAULT:13G
> $ echo 

Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Michal Hocko
On Fri 20-07-12 15:11:08, Mel Gorman wrote:
 Sorry for the resend, I did not properly refresh Cong Wang's suggested
 fix. This V2 is still the mmap_sem approach that fixes a potential deadlock
 problem pointed out by Michal.
 
 Changelog since V1
  o Correct cutpaste error in race description(hugh)
  o Handle potential deadlock during fork  (mhocko)
  o Reorder unlocking  (wangcong)
 
 If a process creates a large hugetlbfs mapping that is eligible for page
 table sharing and forks heavily with children some of whom fault and
 others which destroy the mapping then it is possible for page tables to
 get corrupted. Some teardowns of the mapping encounter a bad pmd and
 output a message to the kernel log. The final teardown will trigger a
 BUG_ON in mm/filemap.c.
 
 This was reproduced in 3.4 but is known to have existed for a long time
 and goes back at least as far as 2.6.37. It was probably was introduced in
 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
 look like this;
 
 [  ..] Lots of bad pmd messages followed by this
 [  127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7).
 [  127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7).
 [  127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7).
 [  127.186778] [ cut here ]
 [  127.186781] kernel BUG at mm/filemap.c:134!
 [  127.186782] invalid opcode:  [#1] SMP
 [  127.186783] CPU 7
 [  127.186784] Modules linked in: af_packet cpufreq_conservative 
 cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod 
 coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 
 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw 
 cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode 
 ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm 
 i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button 
 i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon 
 ata_generic pata_atiixp libata scsi_mod
 [  127.186801]
 [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild 
 #53 Dell Inc. OptiPlex 990/06D7TR
 [  127.186804] RIP: 0010:[810ed6ce]  [810ed6ce] 
 __delete_from_page_cache+0x15e/0x160
 [  127.186809] RSP: :8804144b5c08  EFLAGS: 00010002
 [  127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: 
 ffc0
 [  127.186811] RDX:  RSI: 0009 RDI: 
 88042dfdad00
 [  127.186812] RBP: 8804144b5c18 R08: 0009 R09: 
 0003
 [  127.186813] R10:  R11: 002d R12: 
 880412ff83d8
 [  127.186814] R13: 880412ff83d8 R14:  R15: 
 880412ff83d8
 [  127.186815] FS:  7fe18ed2c700() GS:88042dce() 
 knlGS:
 [  127.186816] CS:  0010 DS:  ES:  CR0: 8005003b
 [  127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: 
 000407e0
 [  127.186818] DR0:  DR1:  DR2: 
 
 [  127.186819] DR3:  DR6: 0ff0 DR7: 
 0400
 [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo 
 8804144b4000, task 880417f803c0)
 [  127.186821] Stack:
 [  127.186822]  ea000a5c9000  8804144b5c48 
 810ed83b
 [  127.186824]  8804144b5c48 138a 1387 
 8804144b5c98
 [  127.186825]  8804144b5d48 811bc925 8804144b5cb8 
 
 [  127.186827] Call Trace:
 [  127.186829]  [810ed83b] delete_from_page_cache+0x3b/0x80
 [  127.186832]  [811bc925] truncate_hugepages+0x115/0x220
 [  127.186834]  [811bca43] hugetlbfs_evict_inode+0x13/0x30
 [  127.186837]  [811655c7] evict+0xa7/0x1b0
 [  127.186839]  [811657a3] iput_final+0xd3/0x1f0
 [  127.186840]  [811658f9] iput+0x39/0x50
 [  127.186842]  [81162708] d_kill+0xf8/0x130
 [  127.186843]  [81162812] dput+0xd2/0x1a0
 [  127.186845]  [8114e2d0] __fput+0x170/0x230
 [  127.186848]  [81236e0e] ? rb_erase+0xce/0x150
 [  127.186849]  [8114e3ad] fput+0x1d/0x30
 [  127.186851]  [81117db7] remove_vma+0x37/0x80
 [  127.186853]  [81119182] do_munmap+0x2d2/0x360
 [  127.186855]  [811cc639] sys_shmdt+0xc9/0x170
 [  127.186857]  [81410a39] system_call_fastpath+0x16/0x1b
 [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 
 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0f 
 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
 [  127.186868] RIP  [810ed6ce] __delete_from_page_cache+0x15e/0x160
 [  127.186870]  RSP 8804144b5c08
 [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
 
 The 

Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Mel Gorman
On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote:
  SNIP
  
  Signed-off-by: Mel Gorman mgor...@suse.de
 
 Yes this looks correct. mmap_sem will make sure that unmap_vmas and
 free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
 see non-NULL spte on the way out.

Yes.

 I am just wondering whether we need
 the page_table_lock as well. It is not harmful but I guess we can drop
 it because both exit_mmap and shmdt are not taking it and mmap_sem is
 sufficient for them.

While it is true that we don't *really* need page_table_lock here, we are
still updating page tables and it's in line with the the ordinary locking
rules.  There are other cases in hugetlb.c where we do pte_same() checks even
though we are protected from the related races by the instantiation_mutex.

page_table_lock is actually a bit useless for shared page tables. If shared
page tables were every to be a general thing then I think we'd have to
revisit how PTE update locking is done but I doubt anyone wants to dive
down that rat-hole.

For now, I'm going to keep taking it even if strictly speaking it's not
necessary.

 One more nit bellow.
 
 I will send my version of the fix which took another path as a reply to
 this email to have something to compare with.
 

Thanks.

 [...]
  diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
  index f6679a7..944b2df 100644
  --- a/arch/x86/mm/hugetlbpage.c
  +++ b/arch/x86/mm/hugetlbpage.c
  @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, 
  unsigned long addr)
   /*
* search for a shareable pmd page for hugetlb.
*/
  -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t 
  *pud)
  +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct 
  *locked_mm,
  +  unsigned long addr, pud_t *pud)
   {
  struct vm_area_struct *vma = find_vma(mm, addr);
  struct address_space *mapping = vma-vm_file-f_mapping;
  @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, 
  unsigned long addr, pud_t *pud)
  struct vm_area_struct *svma;
  unsigned long saddr;
  pte_t *spte = NULL;
  +   spinlock_t *spage_table_lock = NULL;
  +   struct rw_semaphore *smmap_sem = NULL;
   
  if (!vma_shareable(vma, addr))
  return;
   
  +retry:
  mutex_lock(mapping-i_mmap_mutex);
  vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) {
  if (svma == vma)
  continue;
  +   if (svma-vm_mm == vma-vm_mm)
  +   continue;
  +
  +   /*
  +* The target mm could be in the process of tearing down
  +* its page tables and the i_mmap_mutex on its own is
  +* not sufficient. To prevent races against teardown and
  +* pagetable updates, we acquire the mmap_sem and pagetable
  +* lock of the remote address space. down_read_trylock()
  +* is necessary as the other process could also be trying
  +* to share pagetables with the current mm. In the fork
  +* case, we are already both mm's so check for that
  +*/
  +   if (locked_mm != svma-vm_mm) {
  +   if (!down_read_trylock(svma-vm_mm-mmap_sem)) {
  +   mutex_unlock(mapping-i_mmap_mutex);
  +   goto retry;
  +   }
  +   smmap_sem = svma-vm_mm-mmap_sem;
  +   }
  +
  +   spage_table_lock = svma-vm_mm-page_table_lock;
  +   spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING);
   
  saddr = page_table_shareable(svma, vma, addr, idx);
  if (saddr) {
 [...]
  diff --git a/mm/hugetlb.c b/mm/hugetlb.c
  index ae8f708..4832277 100644
  --- a/mm/hugetlb.c
  +++ b/mm/hugetlb.c
  @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, 
  struct mm_struct *src,
  src_pte = huge_pte_offset(src, addr);
  if (!src_pte)
  continue;
  -   dst_pte = huge_pte_alloc(dst, addr, sz);
  +   dst_pte = huge_pte_alloc(dst, src, addr, sz);
  if (!dst_pte)
  goto nomem;
   
  @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
  vm_area_struct *vma,
 VM_FAULT_SET_HINDEX(h - hstates);
  }
   
  -   ptep = huge_pte_alloc(mm, address, huge_page_size(h));
  +   ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h));
 
 strictly speaking we should provide current-mm here because we are in
 the page fault path and mmap_sem is held for reading. This doesn't
 matter here though because huge_pmd_share will take it for reading so
 nesting wouldn't hurt. Maybe a small comment that this is intentional
 and correct would be nice.
 

Fair point. If we go with this version of the fix, I'll improve the
documentation a bit.

Thanks!

  if (!ptep)
  return 

Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)

2012-07-20 Thread Michal Hocko
On Fri 20-07-12 15:37:53, Mel Gorman wrote:
 On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote:
   SNIP
   
   Signed-off-by: Mel Gorman mgor...@suse.de
  
  Yes this looks correct. mmap_sem will make sure that unmap_vmas and
  free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't
  see non-NULL spte on the way out.
 
 Yes.
 
  I am just wondering whether we need
  the page_table_lock as well. It is not harmful but I guess we can drop
  it because both exit_mmap and shmdt are not taking it and mmap_sem is
  sufficient for them.
 
 While it is true that we don't *really* need page_table_lock here, we are
 still updating page tables and it's in line with the the ordinary locking
 rules.  There are other cases in hugetlb.c where we do pte_same() checks even
 though we are protected from the related races by the instantiation_mutex.
 
 page_table_lock is actually a bit useless for shared page tables. If shared
 page tables were every to be a general thing then I think we'd have to
 revisit how PTE update locking is done but I doubt anyone wants to dive
 down that rat-hole.
 
 For now, I'm going to keep taking it even if strictly speaking it's not
 necessary.

Fair enough

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/