Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

2014-11-10 Thread Mel Gorman
On Mon, Nov 03, 2014 at 10:35:04PM -0800, Hugh Dickins wrote:
> On Thu, 30 Oct 2014, Davidlohr Bueso wrote:
> 
> > The i_mmap_rwsem protects shared pages against races
> > when doing the sharing and unsharing, ultimately
> > calling huge_pmd_share/unshare() for PMD pages --
> > it also needs it to avoid races when populating the pud
> > for pmd allocation when looking for a shareable pmd page
> > for hugetlb. Ultimately the interval tree remains intact.
> > 
> > Signed-off-by: Davidlohr Bueso 
> > Acked-by: Kirill A. Shutemov 
> linux.intel.com
> 
> I'm uncomfortable with this one: I'm certainly not prepared to Ack it;
> but that could easily be that I'm just not thinking hard enough - I'd
> rather leave the heavy thinking to someone else!
> 
> The fs/hugetlbfs/inode.c part of it should be okay, but the rest is
> iffy.  It gets into huge page table sharing territory, which is very
> tricky and surprising territory indeed (take a look at my
> __unmap_hugepage_range_final() comment, for one example).
> 
> You're right that the interval tree remains intact, but I've a feeling
> we end up using i_mmap_mutex for more exclusion than just that (rather
> like how huge_memory.c finds anon_vma lock useful for other exclusions).
> 
> I think Mel (already Cc'ed) and Michal (adding him) both have past
> experience with the shared page table (as do I, but I'm in denial).
> 

I dealt with it far in the past when it was still buried under arch/x86
and it was a whole pile of no fun. In this case I think there is little or
no value in trying to convert the lock for page table sharing. The benefit
is marginal (database initialisation maybe) while the potential for
surprises is high.

The __unmap_hugepage_range_final() concern is valid. If this is converted to
read then I am fairly sure that the bug fixed by commit d833352a4338 ("mm:
hugetlbfs: close race during teardown of hugetlbfs shared page tables")
gets reintroduced. We also potentially see races between huge_pmd_unshare
ref counting and huge_pmd_share as huge_pmd_unshare does a race-prone
check on refcount if it's not serialised by i_mmap_lock_write. On a rance,
it will leak pages which will be hard to detect.

Considering the upside of this particular conversion, I don't think it's
worth the loss of hair or will to live to try fix it up.

> I wonder if the huge shared page table would be a good next target
> for Kirill's removal of mm nastiness.  (Removing it wouldn't hurt
> Google for one: we have it "#if 0"ed out, though I forget why at
> this moment.)
> 

I think the only benefit was reducing TLB pressure on databases with
very large shared memory before 1G pages existed and when 2M TLB entries
were a very limited resource. I doubt it's been quantified on anything
resembling recent hardware. If it did get killed though, it would need a
spin through a database test that used the particular database software
that benefitted.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

2014-11-05 Thread Hugh Dickins
On Tue, 4 Nov 2014, Davidlohr Bueso wrote:
> 8<
> From: Davidlohr Bueso 
> Subject: [PATCH 11/10] mm/memory.c: share the i_mmap_rwsem
> 
> The unmap_mapping_range family of functions do the unmapping
> of user pages (ultimately via zap_page_range_single) without
> touching the actual interval tree, thus share the lock.
> 
> Signed-off-by: Davidlohr Bueso 

Acked-by: Hugh Dickins 

Yes, thanks, let's get this 11/10 into mmotm along with the rest,
but put the hugetlb 10/10 on the shelf for now, until we've had
time to contemplate it more deeply.

> ---
>  mm/memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2ca3105..06f2458 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2396,12 +2396,12 @@ void unmap_mapping_range(struct address_space 
> *mapping,
>   details.last_index = ULONG_MAX;
>  
>  
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
>   if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
>   unmap_mapping_range_tree(&mapping->i_mmap, &details);
>   if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
>   unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
>  
> -- 
> 1.8.4.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

2014-11-04 Thread Davidlohr Bueso
On Mon, 2014-11-03 at 22:35 -0800, Hugh Dickins wrote:
> On Thu, 30 Oct 2014, Davidlohr Bueso wrote:
> 
> > The i_mmap_rwsem protects shared pages against races
> > when doing the sharing and unsharing, ultimately
> > calling huge_pmd_share/unshare() for PMD pages --
> > it also needs it to avoid races when populating the pud
> > for pmd allocation when looking for a shareable pmd page
> > for hugetlb. Ultimately the interval tree remains intact.
> > 
> > Signed-off-by: Davidlohr Bueso 
> > Acked-by: Kirill A. Shutemov 
> linux.intel.com
> 
> I'm uncomfortable with this one: I'm certainly not prepared to Ack it;
> but that could easily be that I'm just not thinking hard enough - I'd
> rather leave the heavy thinking to someone else!
> 
> The fs/hugetlbfs/inode.c part of it should be okay, but the rest is
> iffy.  It gets into huge page table sharing territory, which is very
> tricky and surprising territory indeed (take a look at my
> __unmap_hugepage_range_final() comment, for one example).
> 
> You're right that the interval tree remains intact, but I've a feeling
> we end up using i_mmap_mutex for more exclusion than just that (rather
> like how huge_memory.c finds anon_vma lock useful for other exclusions).

Yeah, that certainly wouldn't surprise me, and this particular patch was
the one I was most unsure about for that exact same reason. Hopefully
others could confirm if this is truly doable and safe.

> I think Mel (already Cc'ed) and Michal (adding him) both have past
> experience with the shared page table (as do I, but I'm in denial).
> 
> I wonder if the huge shared page table would be a good next target
> for Kirill's removal of mm nastiness.  (Removing it wouldn't hurt
> Google for one: we have it "#if 0"ed out, though I forget why at
> this moment.)
> 
> But, returning to the fs/hugetlbfs/inode.c part of it, that reminds
> me: you're missing one patch from the series, aren't you?  Why no
> i_mmap_lock_read() in mm/memory.c unmap_mapping_range()?  I doubt
> it will add much useful parallelism, but it would be correct.

Oh yes, not sure why I didn't update that function, I had it marked it
safe to share the lock. Thanks for taking a close look at the series.

8<
From: Davidlohr Bueso 
Subject: [PATCH 11/10] mm/memory.c: share the i_mmap_rwsem

The unmap_mapping_range family of functions do the unmapping
of user pages (ultimately via zap_page_range_single) without
touching the actual interval tree, thus share the lock.

Signed-off-by: Davidlohr Bueso 
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2ca3105..06f2458 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2396,12 +2396,12 @@ void unmap_mapping_range(struct address_space *mapping,
details.last_index = ULONG_MAX;
 
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-- 
1.8.4.5



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

2014-11-03 Thread Hugh Dickins
On Thu, 30 Oct 2014, Davidlohr Bueso wrote:

> The i_mmap_rwsem protects shared pages against races
> when doing the sharing and unsharing, ultimately
> calling huge_pmd_share/unshare() for PMD pages --
> it also needs it to avoid races when populating the pud
> for pmd allocation when looking for a shareable pmd page
> for hugetlb. Ultimately the interval tree remains intact.
> 
> Signed-off-by: Davidlohr Bueso 
> Acked-by: Kirill A. Shutemov 
linux.intel.com

I'm uncomfortable with this one: I'm certainly not prepared to Ack it;
but that could easily be that I'm just not thinking hard enough - I'd
rather leave the heavy thinking to someone else!

The fs/hugetlbfs/inode.c part of it should be okay, but the rest is
iffy.  It gets into huge page table sharing territory, which is very
tricky and surprising territory indeed (take a look at my
__unmap_hugepage_range_final() comment, for one example).

You're right that the interval tree remains intact, but I've a feeling
we end up using i_mmap_mutex for more exclusion than just that (rather
like how huge_memory.c finds anon_vma lock useful for other exclusions).

I think Mel (already Cc'ed) and Michal (adding him) both have past
experience with the shared page table (as do I, but I'm in denial).

I wonder if the huge shared page table would be a good next target
for Kirill's removal of mm nastiness.  (Removing it wouldn't hurt
Google for one: we have it "#if 0"ed out, though I forget why at
this moment.)

But, returning to the fs/hugetlbfs/inode.c part of it, that reminds
me: you're missing one patch from the series, aren't you?  Why no
i_mmap_lock_read() in mm/memory.c unmap_mapping_range()?  I doubt
it will add much useful parallelism, but it would be correct.

Hugh

> ---
>  fs/hugetlbfs/inode.c |  4 ++--
>  mm/hugetlb.c | 12 ++--
>  mm/memory.c  |  4 ++--
>  3 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 5eba47f..0dca54d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -412,10 +412,10 @@ static int hugetlb_vmtruncate(struct inode *inode, 
> loff_t offset)
>   pgoff = offset >> PAGE_SHIFT;
>  
>   i_size_write(inode, offset);
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
>   if (!RB_EMPTY_ROOT(&mapping->i_mmap))
>   hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
>   truncate_hugepages(inode, offset);
>   return 0;
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2071cf4..80349f2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2775,7 +2775,7 @@ static void unmap_ref_private(struct mm_struct *mm, 
> struct vm_area_struct *vma,
>* this mapping should be shared between all the VMAs,
>* __unmap_hugepage_range() is called as the lock is already held
>*/
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
>   vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
>   /* Do not unmap the current VMA */
>   if (iter_vma == vma)
> @@ -2792,7 +2792,7 @@ static void unmap_ref_private(struct mm_struct *mm, 
> struct vm_area_struct *vma,
>   unmap_hugepage_range(iter_vma, address,
>address + huge_page_size(h), page);
>   }
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
>  }
>  
>  /*
> @@ -3350,7 +3350,7 @@ unsigned long hugetlb_change_protection(struct 
> vm_area_struct *vma,
>   flush_cache_range(vma, address, end);
>  
>   mmu_notifier_invalidate_range_start(mm, start, end);
> - i_mmap_lock_write(vma->vm_file->f_mapping);
> + i_mmap_lock_read(vma->vm_file->f_mapping);
>   for (; address < end; address += huge_page_size(h)) {
>   spinlock_t *ptl;
>   ptep = huge_pte_offset(mm, address);
> @@ -3379,7 +3379,7 @@ unsigned long hugetlb_change_protection(struct 
> vm_area_struct *vma,
>*/
>   flush_tlb_range(vma, start, end);
>   mmu_notifier_invalidate_range(mm, start, end);
> - i_mmap_unlock_write(vma->vm_file->f_mapping);
> + i_mmap_unlock_read(vma->vm_file->f_mapping);
>   mmu_notifier_invalidate_range_end(mm, start, end);
>  
>   return pages << h->order;
> @@ -3547,7 +3547,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned 
> long addr, pud_t *pud)
>   if (!vma_shareable(vma, addr))
>   return (pte_t *)pmd_alloc(mm, pud, addr);
>  
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
>   vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>   if (svma == vma)
>   continue;
> @@ -3575,7 +3575,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned 
> long addr, pud_t *pud)
>   spin_unlock(ptl);
>  out:
>   pte