Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On Thu, Jul 26, 2012 at 05:00:28PM -0400, Rik van Riel wrote: > On 07/20/2012 09:49 AM, Mel Gorman wrote: > >This V2 is still the mmap_sem approach that fixes a potential deadlock > >problem pointed out by Michal. > > Larry and I were looking around the hugetlb code some > more, and found what looks like yet another race. > > In hugetlb_no_page, we have the following code: > > > spin_lock(>page_table_lock); > size = i_size_read(mapping->host) >> huge_page_shift(h); > if (idx >= size) > goto backout; > > ret = 0; > if (!huge_pte_none(huge_ptep_get(ptep))) > goto backout; > > if (anon_rmap) > hugepage_add_new_anon_rmap(page, vma, address); > else > page_dup_rmap(page); > new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) > && (vma->vm_flags & VM_SHARED))); > set_huge_pte_at(mm, address, ptep, new_pte); > ... > spin_unlock(>page_table_lock); > > Notice how we check !huge_pte_none with our own > mm->page_table_lock held. > > This offers no protection at all against other > processes, that also hold their own page_table_lock. > Yes, the page_table_lock is close to useless once shared page tables are involved. It's why if we ever wanted to make shared page tables a core MM thing we'd have to revisit how PTE locking at any level that can share page tables works. > In short, it looks like it is possible for multiple > processes to go through the above code simultaneously, > potentially resulting in: > > 1) one process overwriting the pte just created by >another process > > 2) data corruption, as one partially written page >gets superceded by an newly zeroed page, but no >TLB invalidates get sent to other CPUs > > 3) a memory leak of a huge page > > Is there anything that would make this race impossible, > or is this a real bug? > In this case it all happens under the hugetlb instantiation mutex in hugetlb_fault(). It's yet another reason why removing that mutex would be a serious undertaking and the gain for doing so is marginal. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On Thu, Jul 26, 2012 at 12:01:04PM -0400, Larry Woodman wrote: > On 07/20/2012 09:49 AM, Mel Gorman wrote: > >+retry: > > mutex_lock(>i_mmap_mutex); > > vma_prio_tree_foreach(svma,,>i_mmap, idx, idx) { > > if (svma == vma) > > continue; > >+if (svma->vm_mm == vma->vm_mm) > >+continue; > >+ > >+/* > >+ * The target mm could be in the process of tearing down > >+ * its page tables and the i_mmap_mutex on its own is > >+ * not sufficient. To prevent races against teardown and > >+ * pagetable updates, we acquire the mmap_sem and pagetable > >+ * lock of the remote address space. down_read_trylock() > >+ * is necessary as the other process could also be trying > >+ * to share pagetables with the current mm. In the fork > >+ * case, we are already both mm's so check for that > >+ */ > >+if (locked_mm != svma->vm_mm) { > >+if (!down_read_trylock(>vm_mm->mmap_sem)) { > >+mutex_unlock(>i_mmap_mutex); > >+goto retry; > >+} > >+smmap_sem =>vm_mm->mmap_sem; > >+} > >+ > >+spage_table_lock =>vm_mm->page_table_lock; > >+spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING); > > > > saddr = page_table_shareable(svma, vma, addr, idx); > > if (saddr) { > > Hi Mel, FYI I tried this and ran into a problem. When there are > multiple processes > in huge_pmd_share() just faulting in the same i_map they all have > their mmap_sem > down for write so the down_read_trylock(>vm_mm->mmap_sem) never > succeeds. What am I missing? > Probably nothing, this version of the patch is flawed. In the final (unreleased) version of this approach it had to check if it tried this trylock for too long and bail out if that happened and fail to share the page tables. I've dropped this approach to the problem as better alternatives exist. Thanks Larry! -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On Thu, Jul 26, 2012 at 12:01:04PM -0400, Larry Woodman wrote: On 07/20/2012 09:49 AM, Mel Gorman wrote: +retry: mutex_lock(mapping-i_mmap_mutex); vma_prio_tree_foreach(svma,iter,mapping-i_mmap, idx, idx) { if (svma == vma) continue; +if (svma-vm_mm == vma-vm_mm) +continue; + +/* + * The target mm could be in the process of tearing down + * its page tables and the i_mmap_mutex on its own is + * not sufficient. To prevent races against teardown and + * pagetable updates, we acquire the mmap_sem and pagetable + * lock of the remote address space. down_read_trylock() + * is necessary as the other process could also be trying + * to share pagetables with the current mm. In the fork + * case, we are already both mm's so check for that + */ +if (locked_mm != svma-vm_mm) { +if (!down_read_trylock(svma-vm_mm-mmap_sem)) { +mutex_unlock(mapping-i_mmap_mutex); +goto retry; +} +smmap_sem =svma-vm_mm-mmap_sem; +} + +spage_table_lock =svma-vm_mm-page_table_lock; +spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING); saddr = page_table_shareable(svma, vma, addr, idx); if (saddr) { Hi Mel, FYI I tried this and ran into a problem. When there are multiple processes in huge_pmd_share() just faulting in the same i_map they all have their mmap_sem down for write so the down_read_trylock(svma-vm_mm-mmap_sem) never succeeds. What am I missing? Probably nothing, this version of the patch is flawed. In the final (unreleased) version of this approach it had to check if it tried this trylock for too long and bail out if that happened and fail to share the page tables. I've dropped this approach to the problem as better alternatives exist. Thanks Larry! -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On Thu, Jul 26, 2012 at 05:00:28PM -0400, Rik van Riel wrote: On 07/20/2012 09:49 AM, Mel Gorman wrote: This V2 is still the mmap_sem approach that fixes a potential deadlock problem pointed out by Michal. Larry and I were looking around the hugetlb code some more, and found what looks like yet another race. In hugetlb_no_page, we have the following code: spin_lock(mm-page_table_lock); size = i_size_read(mapping-host) huge_page_shift(h); if (idx = size) goto backout; ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) goto backout; if (anon_rmap) hugepage_add_new_anon_rmap(page, vma, address); else page_dup_rmap(page); new_pte = make_huge_pte(vma, page, ((vma-vm_flags VM_WRITE) (vma-vm_flags VM_SHARED))); set_huge_pte_at(mm, address, ptep, new_pte); ... spin_unlock(mm-page_table_lock); Notice how we check !huge_pte_none with our own mm-page_table_lock held. This offers no protection at all against other processes, that also hold their own page_table_lock. Yes, the page_table_lock is close to useless once shared page tables are involved. It's why if we ever wanted to make shared page tables a core MM thing we'd have to revisit how PTE locking at any level that can share page tables works. In short, it looks like it is possible for multiple processes to go through the above code simultaneously, potentially resulting in: 1) one process overwriting the pte just created by another process 2) data corruption, as one partially written page gets superceded by an newly zeroed page, but no TLB invalidates get sent to other CPUs 3) a memory leak of a huge page Is there anything that would make this race impossible, or is this a real bug? In this case it all happens under the hugetlb instantiation mutex in hugetlb_fault(). It's yet another reason why removing that mutex would be a serious undertaking and the gain for doing so is marginal. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On Thu, 26 Jul 2012, Rik van Riel wrote: > On 07/20/2012 09:49 AM, Mel Gorman wrote: > > This V2 is still the mmap_sem approach that fixes a potential deadlock > > problem pointed out by Michal. > > Larry and I were looking around the hugetlb code some > more, and found what looks like yet another race. > > In hugetlb_no_page, we have the following code: > > > spin_lock(>page_table_lock); > size = i_size_read(mapping->host) >> huge_page_shift(h); > if (idx >= size) > goto backout; > > ret = 0; > if (!huge_pte_none(huge_ptep_get(ptep))) > goto backout; > > if (anon_rmap) > hugepage_add_new_anon_rmap(page, vma, address); > else > page_dup_rmap(page); > new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) > && (vma->vm_flags & VM_SHARED))); > set_huge_pte_at(mm, address, ptep, new_pte); > ... > spin_unlock(>page_table_lock); > > Notice how we check !huge_pte_none with our own > mm->page_table_lock held. > > This offers no protection at all against other > processes, that also hold their own page_table_lock. > > In short, it looks like it is possible for multiple > processes to go through the above code simultaneously, > potentially resulting in: > > 1) one process overwriting the pte just created by >another process > > 2) data corruption, as one partially written page >gets superceded by an newly zeroed page, but no >TLB invalidates get sent to other CPUs > > 3) a memory leak of a huge page > > Is there anything that would make this race impossible, > or is this a real bug? I believe it's protected by the unloved hugetlb_instantiation_mutex. > > If so, are there more like it in the hugetlbfs code? What, more than one bug in that code? Surely that would defy the laws of probability ;) Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On 07/20/2012 09:49 AM, Mel Gorman wrote: This V2 is still the mmap_sem approach that fixes a potential deadlock problem pointed out by Michal. Larry and I were looking around the hugetlb code some more, and found what looks like yet another race. In hugetlb_no_page, we have the following code: spin_lock(>page_table_lock); size = i_size_read(mapping->host) >> huge_page_shift(h); if (idx >= size) goto backout; ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) goto backout; if (anon_rmap) hugepage_add_new_anon_rmap(page, vma, address); else page_dup_rmap(page); new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED))); set_huge_pte_at(mm, address, ptep, new_pte); ... spin_unlock(>page_table_lock); Notice how we check !huge_pte_none with our own mm->page_table_lock held. This offers no protection at all against other processes, that also hold their own page_table_lock. In short, it looks like it is possible for multiple processes to go through the above code simultaneously, potentially resulting in: 1) one process overwriting the pte just created by another process 2) data corruption, as one partially written page gets superceded by an newly zeroed page, but no TLB invalidates get sent to other CPUs 3) a memory leak of a huge page Is there anything that would make this race impossible, or is this a real bug? If so, are there more like it in the hugetlbfs code? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On 07/20/2012 09:49 AM, Mel Gorman wrote: +retry: mutex_lock(>i_mmap_mutex); vma_prio_tree_foreach(svma,,>i_mmap, idx, idx) { if (svma == vma) continue; + if (svma->vm_mm == vma->vm_mm) + continue; + + /* +* The target mm could be in the process of tearing down +* its page tables and the i_mmap_mutex on its own is +* not sufficient. To prevent races against teardown and +* pagetable updates, we acquire the mmap_sem and pagetable +* lock of the remote address space. down_read_trylock() +* is necessary as the other process could also be trying +* to share pagetables with the current mm. In the fork +* case, we are already both mm's so check for that +*/ + if (locked_mm != svma->vm_mm) { + if (!down_read_trylock(>vm_mm->mmap_sem)) { + mutex_unlock(>i_mmap_mutex); + goto retry; + } + smmap_sem =>vm_mm->mmap_sem; + } + + spage_table_lock =>vm_mm->page_table_lock; + spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING); saddr = page_table_shareable(svma, vma, addr, idx); if (saddr) { Hi Mel, FYI I tried this and ran into a problem. When there are multiple processes in huge_pmd_share() just faulting in the same i_map they all have their mmap_sem down for write so the down_read_trylock(>vm_mm->mmap_sem) never succeeds. What am I missing? Thanks, Larry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On 07/20/2012 09:49 AM, Mel Gorman wrote: +retry: mutex_lock(mapping-i_mmap_mutex); vma_prio_tree_foreach(svma,iter,mapping-i_mmap, idx, idx) { if (svma == vma) continue; + if (svma-vm_mm == vma-vm_mm) + continue; + + /* +* The target mm could be in the process of tearing down +* its page tables and the i_mmap_mutex on its own is +* not sufficient. To prevent races against teardown and +* pagetable updates, we acquire the mmap_sem and pagetable +* lock of the remote address space. down_read_trylock() +* is necessary as the other process could also be trying +* to share pagetables with the current mm. In the fork +* case, we are already both mm's so check for that +*/ + if (locked_mm != svma-vm_mm) { + if (!down_read_trylock(svma-vm_mm-mmap_sem)) { + mutex_unlock(mapping-i_mmap_mutex); + goto retry; + } + smmap_sem =svma-vm_mm-mmap_sem; + } + + spage_table_lock =svma-vm_mm-page_table_lock; + spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING); saddr = page_table_shareable(svma, vma, addr, idx); if (saddr) { Hi Mel, FYI I tried this and ran into a problem. When there are multiple processes in huge_pmd_share() just faulting in the same i_map they all have their mmap_sem down for write so the down_read_trylock(svma-vm_mm-mmap_sem) never succeeds. What am I missing? Thanks, Larry -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On 07/20/2012 09:49 AM, Mel Gorman wrote: This V2 is still the mmap_sem approach that fixes a potential deadlock problem pointed out by Michal. Larry and I were looking around the hugetlb code some more, and found what looks like yet another race. In hugetlb_no_page, we have the following code: spin_lock(mm-page_table_lock); size = i_size_read(mapping-host) huge_page_shift(h); if (idx = size) goto backout; ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) goto backout; if (anon_rmap) hugepage_add_new_anon_rmap(page, vma, address); else page_dup_rmap(page); new_pte = make_huge_pte(vma, page, ((vma-vm_flags VM_WRITE) (vma-vm_flags VM_SHARED))); set_huge_pte_at(mm, address, ptep, new_pte); ... spin_unlock(mm-page_table_lock); Notice how we check !huge_pte_none with our own mm-page_table_lock held. This offers no protection at all against other processes, that also hold their own page_table_lock. In short, it looks like it is possible for multiple processes to go through the above code simultaneously, potentially resulting in: 1) one process overwriting the pte just created by another process 2) data corruption, as one partially written page gets superceded by an newly zeroed page, but no TLB invalidates get sent to other CPUs 3) a memory leak of a huge page Is there anything that would make this race impossible, or is this a real bug? If so, are there more like it in the hugetlbfs code? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables v2
On Thu, 26 Jul 2012, Rik van Riel wrote: On 07/20/2012 09:49 AM, Mel Gorman wrote: This V2 is still the mmap_sem approach that fixes a potential deadlock problem pointed out by Michal. Larry and I were looking around the hugetlb code some more, and found what looks like yet another race. In hugetlb_no_page, we have the following code: spin_lock(mm-page_table_lock); size = i_size_read(mapping-host) huge_page_shift(h); if (idx = size) goto backout; ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) goto backout; if (anon_rmap) hugepage_add_new_anon_rmap(page, vma, address); else page_dup_rmap(page); new_pte = make_huge_pte(vma, page, ((vma-vm_flags VM_WRITE) (vma-vm_flags VM_SHARED))); set_huge_pte_at(mm, address, ptep, new_pte); ... spin_unlock(mm-page_table_lock); Notice how we check !huge_pte_none with our own mm-page_table_lock held. This offers no protection at all against other processes, that also hold their own page_table_lock. In short, it looks like it is possible for multiple processes to go through the above code simultaneously, potentially resulting in: 1) one process overwriting the pte just created by another process 2) data corruption, as one partially written page gets superceded by an newly zeroed page, but no TLB invalidates get sent to other CPUs 3) a memory leak of a huge page Is there anything that would make this race impossible, or is this a real bug? I believe it's protected by the unloved hugetlb_instantiation_mutex. If so, are there more like it in the hugetlbfs code? What, more than one bug in that code? Surely that would defy the laws of probability ;) Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri 20-07-12 15:37:53, Mel Gorman wrote: > On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote: > > > > > > > > > Signed-off-by: Mel Gorman > > > > Yes this looks correct. mmap_sem will make sure that unmap_vmas and > > free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't > > see non-NULL spte on the way out. > > Yes. > > > I am just wondering whether we need > > the page_table_lock as well. It is not harmful but I guess we can drop > > it because both exit_mmap and shmdt are not taking it and mmap_sem is > > sufficient for them. > > While it is true that we don't *really* need page_table_lock here, we are > still updating page tables and it's in line with the the ordinary locking > rules. There are other cases in hugetlb.c where we do pte_same() checks even > though we are protected from the related races by the instantiation_mutex. > > page_table_lock is actually a bit useless for shared page tables. If shared > page tables were every to be a general thing then I think we'd have to > revisit how PTE update locking is done but I doubt anyone wants to dive > down that rat-hole. > > For now, I'm going to keep taking it even if strictly speaking it's not > necessary. Fair enough -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote: > > > > > > Signed-off-by: Mel Gorman > > Yes this looks correct. mmap_sem will make sure that unmap_vmas and > free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't > see non-NULL spte on the way out. Yes. > I am just wondering whether we need > the page_table_lock as well. It is not harmful but I guess we can drop > it because both exit_mmap and shmdt are not taking it and mmap_sem is > sufficient for them. While it is true that we don't *really* need page_table_lock here, we are still updating page tables and it's in line with the the ordinary locking rules. There are other cases in hugetlb.c where we do pte_same() checks even though we are protected from the related races by the instantiation_mutex. page_table_lock is actually a bit useless for shared page tables. If shared page tables were every to be a general thing then I think we'd have to revisit how PTE update locking is done but I doubt anyone wants to dive down that rat-hole. For now, I'm going to keep taking it even if strictly speaking it's not necessary. > One more nit bellow. > > I will send my version of the fix which took another path as a reply to > this email to have something to compare with. > Thanks. > [...] > > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > > index f6679a7..944b2df 100644 > > --- a/arch/x86/mm/hugetlbpage.c > > +++ b/arch/x86/mm/hugetlbpage.c > > @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, > > unsigned long addr) > > /* > > * search for a shareable pmd page for hugetlb. > > */ > > -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t > > *pud) > > +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct > > *locked_mm, > > + unsigned long addr, pud_t *pud) > > { > > struct vm_area_struct *vma = find_vma(mm, addr); > > struct address_space *mapping = vma->vm_file->f_mapping; > > @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, > > unsigned long addr, pud_t *pud) > > struct vm_area_struct *svma; > > unsigned long saddr; > > pte_t *spte = NULL; > > + spinlock_t *spage_table_lock = NULL; > > + struct rw_semaphore *smmap_sem = NULL; > > > > if (!vma_shareable(vma, addr)) > > return; > > > > +retry: > > mutex_lock(>i_mmap_mutex); > > vma_prio_tree_foreach(svma, , >i_mmap, idx, idx) { > > if (svma == vma) > > continue; > > + if (svma->vm_mm == vma->vm_mm) > > + continue; > > + > > + /* > > +* The target mm could be in the process of tearing down > > +* its page tables and the i_mmap_mutex on its own is > > +* not sufficient. To prevent races against teardown and > > +* pagetable updates, we acquire the mmap_sem and pagetable > > +* lock of the remote address space. down_read_trylock() > > +* is necessary as the other process could also be trying > > +* to share pagetables with the current mm. In the fork > > +* case, we are already both mm's so check for that > > +*/ > > + if (locked_mm != svma->vm_mm) { > > + if (!down_read_trylock(>vm_mm->mmap_sem)) { > > + mutex_unlock(>i_mmap_mutex); > > + goto retry; > > + } > > + smmap_sem = >vm_mm->mmap_sem; > > + } > > + > > + spage_table_lock = >vm_mm->page_table_lock; > > + spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING); > > > > saddr = page_table_shareable(svma, vma, addr, idx); > > if (saddr) { > [...] > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index ae8f708..4832277 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, > > struct mm_struct *src, > > src_pte = huge_pte_offset(src, addr); > > if (!src_pte) > > continue; > > - dst_pte = huge_pte_alloc(dst, addr, sz); > > + dst_pte = huge_pte_alloc(dst, src, addr, sz); > > if (!dst_pte) > > goto nomem; > > > > @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct > > vm_area_struct *vma, > >VM_FAULT_SET_HINDEX(h - hstates); > > } > > > > - ptep = huge_pte_alloc(mm, address, huge_page_size(h)); > > + ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h)); > > strictly speaking we should provide current->mm here because we are in > the page fault path and mmap_sem is held for reading. This doesn't > matter here though because huge_pmd_share will take it for reading so > nesting wouldn't hurt. Maybe a small comment that this is intentional > and correct would be nice. >
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri 20-07-12 15:11:08, Mel Gorman wrote: > Sorry for the resend, I did not properly refresh Cong Wang's suggested > fix. This V2 is still the mmap_sem approach that fixes a potential deadlock > problem pointed out by Michal. > > Changelog since V1 > o Correct cut error in race description(hugh) > o Handle potential deadlock during fork (mhocko) > o Reorder unlocking (wangcong) > > If a process creates a large hugetlbfs mapping that is eligible for page > table sharing and forks heavily with children some of whom fault and > others which destroy the mapping then it is possible for page tables to > get corrupted. Some teardowns of the mapping encounter a "bad pmd" and > output a message to the kernel log. The final teardown will trigger a > BUG_ON in mm/filemap.c. > > This was reproduced in 3.4 but is known to have existed for a long time > and goes back at least as far as 2.6.37. It was probably was introduced in > 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages > look like this; > > [ ..] Lots of bad pmd messages followed by this > [ 127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7). > [ 127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7). > [ 127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7). > [ 127.186778] [ cut here ] > [ 127.186781] kernel BUG at mm/filemap.c:134! > [ 127.186782] invalid opcode: [#1] SMP > [ 127.186783] CPU 7 > [ 127.186784] Modules linked in: af_packet cpufreq_conservative > cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod > coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 > r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw > cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode > ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm > i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button > i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon > ata_generic pata_atiixp libata scsi_mod > [ 127.186801] > [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild > #53 Dell Inc. OptiPlex 990/06D7TR > [ 127.186804] RIP: 0010:[] [] > __delete_from_page_cache+0x15e/0x160 > [ 127.186809] RSP: :8804144b5c08 EFLAGS: 00010002 > [ 127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: > ffc0 > [ 127.186811] RDX: RSI: 0009 RDI: > 88042dfdad00 > [ 127.186812] RBP: 8804144b5c18 R08: 0009 R09: > 0003 > [ 127.186813] R10: R11: 002d R12: > 880412ff83d8 > [ 127.186814] R13: 880412ff83d8 R14: R15: > 880412ff83d8 > [ 127.186815] FS: 7fe18ed2c700() GS:88042dce() > knlGS: > [ 127.186816] CS: 0010 DS: ES: CR0: 8005003b > [ 127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: > 000407e0 > [ 127.186818] DR0: DR1: DR2: > > [ 127.186819] DR3: DR6: 0ff0 DR7: > 0400 > [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo > 8804144b4000, task 880417f803c0) > [ 127.186821] Stack: > [ 127.186822] ea000a5c9000 8804144b5c48 > 810ed83b > [ 127.186824] 8804144b5c48 138a 1387 > 8804144b5c98 > [ 127.186825] 8804144b5d48 811bc925 8804144b5cb8 > > [ 127.186827] Call Trace: > [ 127.186829] [] delete_from_page_cache+0x3b/0x80 > [ 127.186832] [] truncate_hugepages+0x115/0x220 > [ 127.186834] [] hugetlbfs_evict_inode+0x13/0x30 > [ 127.186837] [] evict+0xa7/0x1b0 > [ 127.186839] [] iput_final+0xd3/0x1f0 > [ 127.186840] [] iput+0x39/0x50 > [ 127.186842] [] d_kill+0xf8/0x130 > [ 127.186843] [] dput+0xd2/0x1a0 > [ 127.186845] [] __fput+0x170/0x230 > [ 127.186848] [] ? rb_erase+0xce/0x150 > [ 127.186849] [] fput+0x1d/0x30 > [ 127.186851] [] remove_vma+0x37/0x80 > [ 127.186853] [] do_munmap+0x2d2/0x360 > [ 127.186855] [] sys_shmdt+0xc9/0x170 > [ 127.186857] [] system_call_fastpath+0x16/0x1b > [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 > 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> > 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 > [ 127.186868] RIP [] __delete_from_page_cache+0x15e/0x160 > [ 127.186870] RSP > [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- > > The bug is a race and not always easy to reproduce. To reproduce it I was > doing the following on a single socket I7-based machine with 16G of RAM. > > $ hugeadm --pool-pages-max DEFAULT:13G > $ echo
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri 20-07-12 15:11:08, Mel Gorman wrote: Sorry for the resend, I did not properly refresh Cong Wang's suggested fix. This V2 is still the mmap_sem approach that fixes a potential deadlock problem pointed out by Michal. Changelog since V1 o Correct cutpaste error in race description(hugh) o Handle potential deadlock during fork (mhocko) o Reorder unlocking (wangcong) If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a bad pmd and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd 880412e04fe8(8003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd 880412e04ff0(8003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd 880412e04ff8(8003dee7). [ 127.186778] [ cut here ] [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod [ 127.186801] [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR [ 127.186804] RIP: 0010:[810ed6ce] [810ed6ce] __delete_from_page_cache+0x15e/0x160 [ 127.186809] RSP: :8804144b5c08 EFLAGS: 00010002 [ 127.186810] RAX: 0001 RBX: ea000a5c9000 RCX: ffc0 [ 127.186811] RDX: RSI: 0009 RDI: 88042dfdad00 [ 127.186812] RBP: 8804144b5c18 R08: 0009 R09: 0003 [ 127.186813] R10: R11: 002d R12: 880412ff83d8 [ 127.186814] R13: 880412ff83d8 R14: R15: 880412ff83d8 [ 127.186815] FS: 7fe18ed2c700() GS:88042dce() knlGS: [ 127.186816] CS: 0010 DS: ES: CR0: 8005003b [ 127.186817] CR2: 7fe34503 CR3: 000417a14000 CR4: 000407e0 [ 127.186818] DR0: DR1: DR2: [ 127.186819] DR3: DR6: 0ff0 DR7: 0400 [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo 8804144b4000, task 880417f803c0) [ 127.186821] Stack: [ 127.186822] ea000a5c9000 8804144b5c48 810ed83b [ 127.186824] 8804144b5c48 138a 1387 8804144b5c98 [ 127.186825] 8804144b5d48 811bc925 8804144b5cb8 [ 127.186827] Call Trace: [ 127.186829] [810ed83b] delete_from_page_cache+0x3b/0x80 [ 127.186832] [811bc925] truncate_hugepages+0x115/0x220 [ 127.186834] [811bca43] hugetlbfs_evict_inode+0x13/0x30 [ 127.186837] [811655c7] evict+0xa7/0x1b0 [ 127.186839] [811657a3] iput_final+0xd3/0x1f0 [ 127.186840] [811658f9] iput+0x39/0x50 [ 127.186842] [81162708] d_kill+0xf8/0x130 [ 127.186843] [81162812] dput+0xd2/0x1a0 [ 127.186845] [8114e2d0] __fput+0x170/0x230 [ 127.186848] [81236e0e] ? rb_erase+0xce/0x150 [ 127.186849] [8114e3ad] fput+0x1d/0x30 [ 127.186851] [81117db7] remove_vma+0x37/0x80 [ 127.186853] [81119182] do_munmap+0x2d2/0x360 [ 127.186855] [811cc639] sys_shmdt+0xc9/0x170 [ 127.186857] [81410a39] system_call_fastpath+0x16/0x1b [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0f 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 [ 127.186868] RIP [810ed6ce] __delete_from_page_cache+0x15e/0x160 [ 127.186870] RSP 8804144b5c08 [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- The
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote: SNIP Signed-off-by: Mel Gorman mgor...@suse.de Yes this looks correct. mmap_sem will make sure that unmap_vmas and free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't see non-NULL spte on the way out. Yes. I am just wondering whether we need the page_table_lock as well. It is not harmful but I guess we can drop it because both exit_mmap and shmdt are not taking it and mmap_sem is sufficient for them. While it is true that we don't *really* need page_table_lock here, we are still updating page tables and it's in line with the the ordinary locking rules. There are other cases in hugetlb.c where we do pte_same() checks even though we are protected from the related races by the instantiation_mutex. page_table_lock is actually a bit useless for shared page tables. If shared page tables were every to be a general thing then I think we'd have to revisit how PTE update locking is done but I doubt anyone wants to dive down that rat-hole. For now, I'm going to keep taking it even if strictly speaking it's not necessary. One more nit bellow. I will send my version of the fix which took another path as a reply to this email to have something to compare with. Thanks. [...] diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index f6679a7..944b2df 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -58,7 +58,8 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr) /* * search for a shareable pmd page for hugetlb. */ -static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) +static void huge_pmd_share(struct mm_struct *mm, struct mm_struct *locked_mm, + unsigned long addr, pud_t *pud) { struct vm_area_struct *vma = find_vma(mm, addr); struct address_space *mapping = vma-vm_file-f_mapping; @@ -68,14 +69,40 @@ static void huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) struct vm_area_struct *svma; unsigned long saddr; pte_t *spte = NULL; + spinlock_t *spage_table_lock = NULL; + struct rw_semaphore *smmap_sem = NULL; if (!vma_shareable(vma, addr)) return; +retry: mutex_lock(mapping-i_mmap_mutex); vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) { if (svma == vma) continue; + if (svma-vm_mm == vma-vm_mm) + continue; + + /* +* The target mm could be in the process of tearing down +* its page tables and the i_mmap_mutex on its own is +* not sufficient. To prevent races against teardown and +* pagetable updates, we acquire the mmap_sem and pagetable +* lock of the remote address space. down_read_trylock() +* is necessary as the other process could also be trying +* to share pagetables with the current mm. In the fork +* case, we are already both mm's so check for that +*/ + if (locked_mm != svma-vm_mm) { + if (!down_read_trylock(svma-vm_mm-mmap_sem)) { + mutex_unlock(mapping-i_mmap_mutex); + goto retry; + } + smmap_sem = svma-vm_mm-mmap_sem; + } + + spage_table_lock = svma-vm_mm-page_table_lock; + spin_lock_nested(spage_table_lock, SINGLE_DEPTH_NESTING); saddr = page_table_shareable(svma, vma, addr, idx); if (saddr) { [...] diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ae8f708..4832277 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2244,7 +2244,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, src_pte = huge_pte_offset(src, addr); if (!src_pte) continue; - dst_pte = huge_pte_alloc(dst, addr, sz); + dst_pte = huge_pte_alloc(dst, src, addr, sz); if (!dst_pte) goto nomem; @@ -2745,7 +2745,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, VM_FAULT_SET_HINDEX(h - hstates); } - ptep = huge_pte_alloc(mm, address, huge_page_size(h)); + ptep = huge_pte_alloc(mm, NULL, address, huge_page_size(h)); strictly speaking we should provide current-mm here because we are in the page fault path and mmap_sem is held for reading. This doesn't matter here though because huge_pmd_share will take it for reading so nesting wouldn't hurt. Maybe a small comment that this is intentional and correct would be nice. Fair point. If we go with this version of the fix, I'll improve the documentation a bit. Thanks! if (!ptep) return
Re: [PATCH] mm: hugetlbfs: Close race during teardown of hugetlbfs shared page tables V2 (resend)
On Fri 20-07-12 15:37:53, Mel Gorman wrote: On Fri, Jul 20, 2012 at 04:29:20PM +0200, Michal Hocko wrote: SNIP Signed-off-by: Mel Gorman mgor...@suse.de Yes this looks correct. mmap_sem will make sure that unmap_vmas and free_pgtables are executed atomicaly wrt. huge_pmd_share so it doesn't see non-NULL spte on the way out. Yes. I am just wondering whether we need the page_table_lock as well. It is not harmful but I guess we can drop it because both exit_mmap and shmdt are not taking it and mmap_sem is sufficient for them. While it is true that we don't *really* need page_table_lock here, we are still updating page tables and it's in line with the the ordinary locking rules. There are other cases in hugetlb.c where we do pte_same() checks even though we are protected from the related races by the instantiation_mutex. page_table_lock is actually a bit useless for shared page tables. If shared page tables were every to be a general thing then I think we'd have to revisit how PTE update locking is done but I doubt anyone wants to dive down that rat-hole. For now, I'm going to keep taking it even if strictly speaking it's not necessary. Fair enough -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/