Re: [RFC PATCH 2/3] hugetlbfs: introduce hinode_rwsem for pmd sharing synchronization

2020-10-15 Thread Mike Kravetz
On 10/15/20 4:05 PM, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Oct 13, 2020 at 04:10:59PM -0700, Mike Kravetz wrote:
>> Due to pmd sharing, the huge PTE pointer returned by huge_pte_alloc
>> may not be valid.  This can happen if a call to huge_pmd_unshare for
>> the same pmd is made in another thread.
>>
>> To address this issue, add a rw_semaphore (hinode_rwsem) to the hugetlbfs
>> inode.
>> - hinode_rwsem is taken in read mode before calling huge_pte_alloc, and
>>   held until finished with the returned pte pointer.
>> - hinode_rwsem is held in write mode whenever huge_pmd_unshare is called.
>>
>> In the locking hierarchy, hinode_rwsem must be taken before a page lock.
>>
>> In an effort to minimize performance impacts, hinode_rwsem is not taken
>> if the caller knows the target can not possibly be part of a shared pmd.
>> lockdep_assert calls are added to huge_pmd_share and huge_pmd_unshare to
>> help catch callers not using the proper locking.
>>
>> Signed-off-by: Mike Kravetz 
> 
> Hi Mike,
> 
> I didn't find a problem on main idea of introducing hinode_rwsem, so
> I'm fine if the known problems are fixed.

Thank you for taking a look Naoya!

I have been trying to address these race issues for some time.  The issues
have been there since the pmd sharing code was introduced.  Fortunately,
it is not easy to hit the issue.  However, targeted test programs can cause
BUGs.

> I have one question. This patch seems to make sure that huge_pmd_unshare()
> are called under holding hinode_rwsem in write mode for some case. Some
> callers of try_to_unmap() seem not to hold it like shrink_page_list(),
> unmap_page(), which is OK because they never call try_to_unmap() for hugetlb
> pages.  And unmap_ref_private() doesn't takes hinode_rwsem either, and
> that's also OK because this function never handles pmd sharing case.  So
> what about unmap_single_vma()?  It seems that this generic function could
> reach huge_pmd_unshare() without hinode_rwsem, so what prevents the race here?
> (Maybe I might miss some assumption or condition over this race...)

You are not missing anything.  I mistakingly left out the locking code in
of unmap_single_vma().  If I would have run some tests with lockdep enabled,
the new lock checking code would have noticed.

> 
> I left a few other minor comments below ...

I will address the below issues in the next revision.

Thanks again for taking a look.
-- 
Mike Kravetz

> 
>> @@ -4424,6 +4442,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct 
>> vm_area_struct *vma,
>>  
>>  ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>>  if (ptep) {
>> +/*
>> + * Since we hold no locks, ptep could be stale.  That is
>> + * OK as we are only making decisions based on content and
>> + * not actually modifying content here.
>> + */
> 
> nice comment, thank you.
> 
>>  entry = huge_ptep_get(ptep);
>>  if (unlikely(is_hugetlb_entry_migration(entry))) {
>>  migration_entry_wait_huge(vma, mm, ptep);
>> @@ -4431,20 +4454,32 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, 
>> struct vm_area_struct *vma,
>>  } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>>  return VM_FAULT_HWPOISON_LARGE |
>>  VM_FAULT_SET_HINDEX(hstate_index(h));
>> -} else {
>> -ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
>> -if (!ptep)
>> -return VM_FAULT_OOM;
>>  }
>>  
>> +/*
>> + * Acquire hinode_sem before calling huge_pte_alloc and hold
> 
>hinode_rwsem?
> 
>> + * until finished with ptep.  This prevents huge_pmd_unshare from
>> + * being called elsewhere and making the ptep no longer valid.
>> + *
>> + * ptep could have already be assigned via huge_pte_offset.  That
>> + * is OK, as huge_pte_alloc will return the same value unless
>> + * something has changed.
>> + */
> 
> ... 
> 
>> @@ -278,10 +278,14 @@ static __always_inline ssize_t 
>> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>>  BUG_ON(dst_addr >= dst_start + len);
>>  
>>  /*
>> - * Serialize via hugetlb_fault_mutex
>> + * Serialize via hinode_rwsem hugetlb_fault_mutex.
>  ^ "and" here?
> 
>> + * hinode_rwsem ensures the dst_pte remains valid even
>> + * in the case of shared pmds.  fault mutex prevents
>> + * races with other faulting threads.
>>   */
>>  idx = linear_page_index(dst_vma, dst_addr);
>>  mapping = dst_vma->vm_file->f_mapping;
>> +hinode_lock_read(mapping, dst_vma, dst_addr);
>>  hash = hugetlb_fault_mutex_hash(mapping, idx);
>>  mutex_lock(&hugetlb_fault_mutex_table[hash]);
> 
> 
> Thanks,
> Naoya Horiguchi
> 


Re: [RFC PATCH 2/3] hugetlbfs: introduce hinode_rwsem for pmd sharing synchronization

2020-10-15 Thread 堀口 直也
On Tue, Oct 13, 2020 at 04:10:59PM -0700, Mike Kravetz wrote:
> Due to pmd sharing, the huge PTE pointer returned by huge_pte_alloc
> may not be valid.  This can happen if a call to huge_pmd_unshare for
> the same pmd is made in another thread.
> 
> To address this issue, add a rw_semaphore (hinode_rwsem) to the hugetlbfs
> inode.
> - hinode_rwsem is taken in read mode before calling huge_pte_alloc, and
>   held until finished with the returned pte pointer.
> - hinode_rwsem is held in write mode whenever huge_pmd_unshare is called.
> 
> In the locking hierarchy, hinode_rwsem must be taken before a page lock.
> 
> In an effort to minimize performance impacts, hinode_rwsem is not taken
> if the caller knows the target can not possibly be part of a shared pmd.
> lockdep_assert calls are added to huge_pmd_share and huge_pmd_unshare to
> help catch callers not using the proper locking.
> 
> Signed-off-by: Mike Kravetz 

Hi Mike,

I didn't find a problem on main idea of introducing hinode_rwsem, so
I'm fine if the known problems are fixed.

I have one question. This patch seems to make sure that huge_pmd_unshare()
are called under holding hinode_rwsem in write mode for some case. Some
callers of try_to_unmap() seem not to hold it like shrink_page_list(),
unmap_page(), which is OK because they never call try_to_unmap() for hugetlb
pages.  And unmap_ref_private() doesn't takes hinode_rwsem either, and
that's also OK because this function never handles pmd sharing case.  So
what about unmap_single_vma()?  It seems that this generic function could
reach huge_pmd_unshare() without hinode_rwsem, so what prevents the race here?
(Maybe I might miss some assumption or condition over this race...)

I left a few other minor comments below ...

> @@ -4424,6 +4442,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>  
>   ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>   if (ptep) {
> + /*
> +  * Since we hold no locks, ptep could be stale.  That is
> +  * OK as we are only making decisions based on content and
> +  * not actually modifying content here.
> +  */

nice comment, thank you.

>   entry = huge_ptep_get(ptep);
>   if (unlikely(is_hugetlb_entry_migration(entry))) {
>   migration_entry_wait_huge(vma, mm, ptep);
> @@ -4431,20 +4454,32 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>   return VM_FAULT_HWPOISON_LARGE |
>   VM_FAULT_SET_HINDEX(hstate_index(h));
> - } else {
> - ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
> - if (!ptep)
> - return VM_FAULT_OOM;
>   }
>  
> + /*
> +  * Acquire hinode_sem before calling huge_pte_alloc and hold

   hinode_rwsem?

> +  * until finished with ptep.  This prevents huge_pmd_unshare from
> +  * being called elsewhere and making the ptep no longer valid.
> +  *
> +  * ptep could have already be assigned via huge_pte_offset.  That
> +  * is OK, as huge_pte_alloc will return the same value unless
> +  * something has changed.
> +  */

... 

> @@ -278,10 +278,14 @@ static __always_inline ssize_t 
> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>   BUG_ON(dst_addr >= dst_start + len);
>  
>   /*
> -  * Serialize via hugetlb_fault_mutex
> +  * Serialize via hinode_rwsem hugetlb_fault_mutex.
 ^ "and" here?

> +  * hinode_rwsem ensures the dst_pte remains valid even
> +  * in the case of shared pmds.  fault mutex prevents
> +  * races with other faulting threads.
>*/
>   idx = linear_page_index(dst_vma, dst_addr);
>   mapping = dst_vma->vm_file->f_mapping;
> + hinode_lock_read(mapping, dst_vma, dst_addr);
>   hash = hugetlb_fault_mutex_hash(mapping, idx);
>   mutex_lock(&hugetlb_fault_mutex_table[hash]);


Thanks,
Naoya Horiguchi

[RFC PATCH 2/3] hugetlbfs: introduce hinode_rwsem for pmd sharing synchronization

2020-10-13 Thread Mike Kravetz
Due to pmd sharing, the huge PTE pointer returned by huge_pte_alloc
may not be valid.  This can happen if a call to huge_pmd_unshare for
the same pmd is made in another thread.

To address this issue, add a rw_semaphore (hinode_rwsem) to the hugetlbfs
inode.
- hinode_rwsem is taken in read mode before calling huge_pte_alloc, and
  held until finished with the returned pte pointer.
- hinode_rwsem is held in write mode whenever huge_pmd_unshare is called.

In the locking hierarchy, hinode_rwsem must be taken before a page lock.

In an effort to minimize performance impacts, hinode_rwsem is not taken
if the caller knows the target can not possibly be part of a shared pmd.
lockdep_assert calls are added to huge_pmd_share and huge_pmd_unshare to
help catch callers not using the proper locking.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c|  8 
 include/linux/hugetlb.h | 66 +--
 mm/hugetlb.c| 86 +
 mm/memory-failure.c | 15 ++-
 mm/migrate.c| 15 +++
 mm/rmap.c   |  3 +-
 mm/userfaultfd.c|  9 -
 7 files changed, 171 insertions(+), 31 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3a1246aeedc4..d6bb675d4872 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -499,13 +499,17 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
 * Getting here in a truncate operation is a bug.
 */
if (unlikely(page_mapped(page))) {
+   struct hugetlbfs_inode_info *info =
+   HUGETLBFS_I(inode);
BUG_ON(truncate_op);
 
+   down_write(&info->hinode_rwsem);
i_mmap_lock_write(mapping);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
(index + 1) * pages_per_huge_page(h));
i_mmap_unlock_write(mapping);
+   up_write(&info->hinode_rwsem);
}
 
lock_page(page);
@@ -562,15 +566,18 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t 
offset)
pgoff_t pgoff;
struct address_space *mapping = inode->i_mapping;
struct hstate *h = hstate_inode(inode);
+   struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
 
BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;
 
+   down_write(&info->hinode_rwsem);
i_size_write(inode, offset);
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
i_mmap_unlock_write(mapping);
+   up_write(&info->hinode_rwsem);
remove_inode_hugepages(inode, offset, LLONG_MAX);
return 0;
 }
@@ -829,6 +836,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block 
*sb,
inode->i_atime = inode->i_mtime = inode->i_ctime = 
current_time(inode);
inode->i_mapping->private_data = resv_map;
info->seals = F_SEAL_SEAL;
+   init_rwsem(&info->hinode_rwsem);
switch (mode & S_IFMT) {
default:
init_special_inode(inode, mode, dev);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bf79a5601091..10322fdcf3d9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -163,7 +163,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
 pte_t *huge_pte_offset(struct mm_struct *mm,
   unsigned long addr, unsigned long sz);
-int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
+   unsigned long *addr, pte_t *ptep);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
@@ -196,8 +197,9 @@ static inline unsigned long hugetlb_total_pages(void)
return 0;
 }
 
-static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
-   pte_t *ptep)
+static inline int huge_pmd_unshare(struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   unsigned long *addr, pte_t *ptep)
 {
return 0;
 }
@@ -414,6 +416,7 @@ struct hugetlbfs_inode_info {
struct shared_policy policy;
struct inode vfs_inode;
unsigned int seals;
+   struct rw_semaphore hinode_rws