Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
On Fri, May 24, 2019 at 09:07:11AM -0700, Larry Bassel wrote: > Again, I don't think this can happen in DAX. The only sharing allowed > is for FS/DAX/2MiB pagesize. Hm. I still don't follow. How do you guarantee that DAX actually allocated continues space for the file on backing storage and you can map it with PMD page? I believe you don't have such guarantee. -- Kirill A. Shutemov
Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
On Fri, May 24, 2019 at 9:07 AM Larry Bassel wrote: > On 14 May 19 16:01, Kirill A. Shutemov wrote: > > On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote: [..] > > > diff --git a/mm/memory.c b/mm/memory.c > > > index f7d962d..4c1814c 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct > > > vm_fault *vmf) > > > return 0; > > > } > > > > > > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD > > > +static pmd_t *huge_pmd_offset(struct mm_struct *mm, > > > + unsigned long addr, unsigned long sz) > > > > Could you explain what this function suppose to do? > > > > As far as I can see vma_mmu_pagesize() is always PAGE_SIZE of DAX > > filesystem. So we have 'sz' == PAGE_SIZE here. > > I thought so too, but in my testing I found that vma_mmu_pagesize() returns > 4KiB, which differs from the DAX filesystem's 2MiB pagesize. A given filesystem-dax vma is allowed to support both 4K and 2M mappings, so the vma_mmu_pagesize() is not granular enough to describe the capabilities of a filesystem-dax vma. In the device-dax case, where there are mapping guarantees, the implementation does arrange for vma_mmu_pagesize() to reflect the right page size.
Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
On 14 May 19 16:01, Kirill A. Shutemov wrote: > On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote: [trim] > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1747,6 +1747,33 @@ static inline void zap_deposited_table(struct > > mm_struct *mm, pmd_t *pmd) > > mm_dec_nr_ptes(mm); > > } > > > > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD > > +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr, > > + pmd_t *pmdp) > > +{ > > + pgd_t *pgd = pgd_offset(mm, addr); > > + p4d_t *p4d = p4d_offset(pgd, addr); > > + pud_t *pud = pud_offset(p4d, addr); > > + > > + WARN_ON(page_count(virt_to_page(pmdp)) == 0); > > + if (page_count(virt_to_page(pmdp)) == 1) > > + return 0; > > + > > + pud_clear(pud); > > You don't have proper locking in place to do this. This code is based on and very similar to the code in mm/hugetlb.c (huge_pmd_unshare()). I asked Mike Kravetz why the locking in huge_pmd_share() and huge_pmd_unshare() is correct. The issue (as you point out later in your email) is whether in both of those cases it is OK to take the PMD table lock and then modify the PUD table. He responded with the following analysis: - I went back and looked at the locking in the hugetlb code. Here is most of the code for huge_pmd_share(). i_mmap_lock_write(mapping); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; saddr = page_table_shareable(svma, vma, addr, idx); if (saddr) { spte = huge_pte_offset(svma->vm_mm, saddr, vma_mmu_pagesize(svma)); if (spte) { get_page(virt_to_page(spte)); break; } } } if (!spte) goto out; ptl = huge_pte_lock(hstate_vma(vma), mm, spte); >>> The primary reason the page table lock is taken here is for the purpose of checking and possibly updating the PUD (pointer to PMD page). Note that by the time we get here we already have found a PMD page to share. Also note that the lock taken is the one associated with the PMD page. The synchronization question to ask is: Can anyone else modify the PUD value while I am holding the PMD lock? In general, the answer is Yes. However, we can infer something subtle about the shared PMD case. Suppose someone else wanted to set the PUD value. The only value they could set it to is the PMD page we found in this routine. They also would need to go through this routine to set the value. They also would need to get the lock on the same shared PMD. Actually, they would hit the mapping->i_mmap_rwsem first. But, the bottom line is that nobody else can set it. What about clearing? In the hugetlb case, the only places where PUD gets cleared are final page table tear down and huge_pmd_unshare(). The final page table tear down case is not interesting as the process is exiting. All callers if huge_pmd_unshare must hold the (PMD) page table lock. This is a requirement. Therefore, within a single process this synchronizes two threads: one calling huge_pmd_share and another huge_pmd_unshare. - I assert that the same analysis applies to pmd_share() and unshare_huge_pmd() which are added in this patch. > > > + put_page(virt_to_page(pmdp)); > > + mm_dec_nr_pmds(mm); > > + return 1; > > +} > > + > > +#else > > +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr, > > + pmd_t *pmdp) > > +{ > > + return 0; > > +} > > + > > +#endif > > + > > int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > pmd_t *pmd, unsigned long addr) > > { > > @@ -1764,6 +1791,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct > > vm_area_struct *vma, > > * pgtable_trans_huge_withdraw after finishing pmdp related > > * operations. > > */ > > + if (unshare_huge_pmd(vma->vm_mm, addr, pmd)) { > > + spin_unlock(ptl); > > + return 1; > > + } > > + > > orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd, > > tlb->fullmm); > > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 641cedf..919a290 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -4594,9 +4594,9 @@ long hugetlb_unreserve_pages(struct inode *inode, > > long start, long end, > > } > > > > #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE > > -static unsigned long page_table_shareable(struct vm_area_struct *svma, > > - struct vm_area_struct *vma, > > - unsigned long addr, pgoff_t idx)
Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote: > This is based on (but somewhat different from) what hugetlbfs > does to share/unshare page tables. > > Signed-off-by: Larry Bassel > --- > include/linux/hugetlb.h | 4 ++ > mm/huge_memory.c| 32 ++ > mm/hugetlb.c| 21 -- > mm/memory.c | 108 > +++- > 4 files changed, 160 insertions(+), 5 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 11943b6..9ed9542 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -142,6 +142,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, > int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); > void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, > unsigned long *start, unsigned long *end); > +unsigned long page_table_shareable(struct vm_area_struct *svma, > +struct vm_area_struct *vma, > +unsigned long addr, pgoff_t idx); > +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr); > struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, > int write); > struct page *follow_huge_pd(struct vm_area_struct *vma, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index b6a34b3..e1627c3 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1747,6 +1747,33 @@ static inline void zap_deposited_table(struct > mm_struct *mm, pmd_t *pmd) > mm_dec_nr_ptes(mm); > } > > +#ifdef CONFIG_MAY_SHARE_FSDAX_PMD > +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr, > + pmd_t *pmdp) > +{ > + pgd_t *pgd = pgd_offset(mm, addr); > + p4d_t *p4d = p4d_offset(pgd, addr); > + pud_t *pud = pud_offset(p4d, addr); > + > + WARN_ON(page_count(virt_to_page(pmdp)) == 0); > + if (page_count(virt_to_page(pmdp)) == 1) > + return 0; > + > + pud_clear(pud); You don't have proper locking in place to do this. > + put_page(virt_to_page(pmdp)); > + mm_dec_nr_pmds(mm); > + return 1; > +} > + > +#else > +static int unshare_huge_pmd(struct mm_struct *mm, unsigned long addr, > + pmd_t *pmdp) > +{ > + return 0; > +} > + > +#endif > + > int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, >pmd_t *pmd, unsigned long addr) > { > @@ -1764,6 +1791,11 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct > vm_area_struct *vma, >* pgtable_trans_huge_withdraw after finishing pmdp related >* operations. >*/ > + if (unshare_huge_pmd(vma->vm_mm, addr, pmd)) { > + spin_unlock(ptl); > + return 1; > + } > + > orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd, > tlb->fullmm); > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 641cedf..919a290 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4594,9 +4594,9 @@ long hugetlb_unreserve_pages(struct inode *inode, long > start, long end, > } > > #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE > -static unsigned long page_table_shareable(struct vm_area_struct *svma, > - struct vm_area_struct *vma, > - unsigned long addr, pgoff_t idx) > +unsigned long page_table_shareable(struct vm_area_struct *svma, > +struct vm_area_struct *vma, > +unsigned long addr, pgoff_t idx) > { > unsigned long saddr = ((idx - svma->vm_pgoff) << PAGE_SHIFT) + > svma->vm_start; > @@ -4619,7 +4619,7 @@ static unsigned long page_table_shareable(struct > vm_area_struct *svma, > return saddr; > } > > -static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) > +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) > { > unsigned long base = addr & PUD_MASK; > unsigned long end = base + PUD_SIZE; > @@ -4763,6 +4763,19 @@ void adjust_range_if_pmd_sharing_possible(struct > vm_area_struct *vma, > unsigned long *start, unsigned long *end) > { > } > + > +unsigned long page_table_shareable(struct vm_area_struct *svma, > +struct vm_area_struct *vma, > +unsigned long addr, pgoff_t idx) > +{ > + return 0; > +} > + > +bool vma_shareable(struct vm_area_struct *vma, unsigned long addr) > +{ > + return false; > +} > + > #define want_pmd_share() (0) > #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */ > > diff --git a/mm/memory.c b/mm/memory.c > index f7d962d..4c1814c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3845,6 +3845,109 @@ static vm_fault_t handle_pte_fault(struct vm_fault > *vmf) >
Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX
On 5/10/19 9:16 AM, Larry Bassel wrote: > On 09 May 19 09:49, Matthew Wilcox wrote: >> On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote: >>> This is based on (but somewhat different from) what hugetlbfs >>> does to share/unshare page tables. >> >> Wow, that worked out far more cleanly than I was expecting to see. > > Yes, I was pleasantly surprised. As I've mentioned already, I > think this is at least partially due to the nature of DAX. I have not looked in detail to make sure this is indeed all the places you need to hook and special case for sharing/unsharing. Since this scheme is somewhat like that used for hugetlb, I just wanted to point out some nasty bugs related to hugetlb PMD sharing that were fixed last year. 5e41540c8a0f hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! dff11abe280b hugetlb: take PMD sharing into account when flushing tlb/caches 017b1660df89 mm: migration: fix migration of huge PMD shared pages The common issue in these is that when unmapping a page with a shared PMD mapping you need to flush the entire shared range and not just the unmapped page. The above changes were hugetlb specific. I do not know if any of this applies in the case of DAX. -- Mike Kravetz