Re: [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races

2018-10-08 Thread Mike Kravetz
On 10/8/18 1:03 AM, Kirill A. Shutemov wrote:
> On Sun, Oct 07, 2018 at 04:38:48PM -0700, Mike Kravetz wrote:
>> The following hugetlbfs truncate/page fault race can be recreated
>> with programs doing something like the following.
>>
>> A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
>> mmap time, 4 huge pages are reserved for the file/mapping.  So,
>> the global reserve count is 4.  In addition, since this is a shared
>> mapping an entry for 4 pages is added to the file's reserve map.
>> The first 3 of the 4 pages are faulted into the file.  As a result,
>> the global reserve count is now 1.
>>
>> Task A starts to fault in the last page (routines hugetlb_fault,
>> hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
>> The reserve map indicates there is a reserved page, so this is
>> used and the global reserve count goes to 0.
>>
>> Now, task B truncates the file to size 0.  It starts by setting
>> inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
>> of the file (hugetlb_vmdelete_list).  Since task A's page table
>> lock is not held at the time, truncation is not blocked.  Truncation
>> removes the 3 pages from the file (remove_inode_hugepages).  When
>> cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
>> the reserve map was for 4 pages.  However, it has only freed 3 pages.
>> So it assumes there is still (4 - 3) 1 reserved pages.  It then
>> decrements the global reserve count by 1 and it goes negative.
>>
>> Task A then continues the page fault process and adds it's newly
>> acquired page to the page cache.  Note that the index of this page
>> is beyond the size of the truncated file (0).  The page fault process
>> then notices the file has been truncated and exits.  However, the
>> page is left in the cache associated with the file.
>>
>> Now, if the file is immediately deleted the truncate code runs again.
>> It will find and free the one page associated with the file.  When
>> cleaning up reserves, it notices the reserve map is empty.  Yet, one
>> page freed.  So, the global reserve count is decremented by (0 - 1) -1.
>> This returns the global count to 0 as it should be.  But, it is
>> possible for someone else to mmap this file/range before it is deleted.
>> If this happens, a reserve map entry for the allocated page is created
>> and the reserved page is forever leaked.
>>
>> To avoid all these conditions, let's simply prevent faults to a file
>> while it is being truncated.  Add a new truncation specific rw mutex
>> to hugetlbfs inode extensions.  faults take the mutex in read mode,
>> truncation takes in write mode.
> 
> Hm. Don't we have already a lock for this? I mean i_mmap_lock.
> 

Thanks Kirill,

Yes, we could use use i_mmap_rwsem for this synchronization.  I don't
see anyone else using the mutex in this manner.  hugetlb code only
explicitly takes this mutex in write mode today.  I suspect that is not
optimal and could be improved.  Certainly, the use within
hugetlb_fault->huge_pte_alloc->huge_pmd_share would need to be changed
if we always wanted to take the mutex in read mode during faults.

I'll work on the changes to use i_mmap_rwsem.

However, right now our DB team informs me that the truncate/fault race
is not the cause of their huge page reserve count going negative issue.
So, I am searching for more bugs in this area.  Found another where an
allocation for migration could race with a fault in a VM_NORESERVE vma.
But, there were no migrations noted on the system, so there must be another
bug.  Sigh!
-- 
Mike Kravetz


Re: [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races

2018-10-08 Thread Kirill A. Shutemov
On Sun, Oct 07, 2018 at 04:38:48PM -0700, Mike Kravetz wrote:
> The following hugetlbfs truncate/page fault race can be recreated
> with programs doing something like the following.
> 
> A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
> mmap time, 4 huge pages are reserved for the file/mapping.  So,
> the global reserve count is 4.  In addition, since this is a shared
> mapping an entry for 4 pages is added to the file's reserve map.
> The first 3 of the 4 pages are faulted into the file.  As a result,
> the global reserve count is now 1.
> 
> Task A starts to fault in the last page (routines hugetlb_fault,
> hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
> The reserve map indicates there is a reserved page, so this is
> used and the global reserve count goes to 0.
> 
> Now, task B truncates the file to size 0.  It starts by setting
> inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
> of the file (hugetlb_vmdelete_list).  Since task A's page table
> lock is not held at the time, truncation is not blocked.  Truncation
> removes the 3 pages from the file (remove_inode_hugepages).  When
> cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
> the reserve map was for 4 pages.  However, it has only freed 3 pages.
> So it assumes there is still (4 - 3) 1 reserved pages.  It then
> decrements the global reserve count by 1 and it goes negative.
> 
> Task A then continues the page fault process and adds it's newly
> acquired page to the page cache.  Note that the index of this page
> is beyond the size of the truncated file (0).  The page fault process
> then notices the file has been truncated and exits.  However, the
> page is left in the cache associated with the file.
> 
> Now, if the file is immediately deleted the truncate code runs again.
> It will find and free the one page associated with the file.  When
> cleaning up reserves, it notices the reserve map is empty.  Yet, one
> page freed.  So, the global reserve count is decremented by (0 - 1) -1.
> This returns the global count to 0 as it should be.  But, it is
> possible for someone else to mmap this file/range before it is deleted.
> If this happens, a reserve map entry for the allocated page is created
> and the reserved page is forever leaked.
> 
> To avoid all these conditions, let's simply prevent faults to a file
> while it is being truncated.  Add a new truncation specific rw mutex
> to hugetlbfs inode extensions.  faults take the mutex in read mode,
> truncation takes in write mode.

Hm. Don't we have already a lock for this? I mean i_mmap_lock.

-- 
 Kirill A. Shutemov


[PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races

2018-10-07 Thread Mike Kravetz
The following hugetlbfs truncate/page fault race can be recreated
with programs doing something like the following.

A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
mmap time, 4 huge pages are reserved for the file/mapping.  So,
the global reserve count is 4.  In addition, since this is a shared
mapping an entry for 4 pages is added to the file's reserve map.
The first 3 of the 4 pages are faulted into the file.  As a result,
the global reserve count is now 1.

Task A starts to fault in the last page (routines hugetlb_fault,
hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
The reserve map indicates there is a reserved page, so this is
used and the global reserve count goes to 0.

Now, task B truncates the file to size 0.  It starts by setting
inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
of the file (hugetlb_vmdelete_list).  Since task A's page table
lock is not held at the time, truncation is not blocked.  Truncation
removes the 3 pages from the file (remove_inode_hugepages).  When
cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
the reserve map was for 4 pages.  However, it has only freed 3 pages.
So it assumes there is still (4 - 3) 1 reserved pages.  It then
decrements the global reserve count by 1 and it goes negative.

Task A then continues the page fault process and adds it's newly
acquired page to the page cache.  Note that the index of this page
is beyond the size of the truncated file (0).  The page fault process
then notices the file has been truncated and exits.  However, the
page is left in the cache associated with the file.

Now, if the file is immediately deleted the truncate code runs again.
It will find and free the one page associated with the file.  When
cleaning up reserves, it notices the reserve map is empty.  Yet, one
page freed.  So, the global reserve count is decremented by (0 - 1) -1.
This returns the global count to 0 as it should be.  But, it is
possible for someone else to mmap this file/range before it is deleted.
If this happens, a reserve map entry for the allocated page is created
and the reserved page is forever leaked.

To avoid all these conditions, let's simply prevent faults to a file
while it is being truncated.  Add a new truncation specific rw mutex
to hugetlbfs inode extensions.  faults take the mutex in read mode,
truncation takes in write mode.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 24 
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c| 25 +++--
 mm/userfaultfd.c|  8 +++-
 4 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 40d4c66c7751..07b0ba049c37 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -427,10 +427,17 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
u32 hash;
 
index = page->index;
-   hash = hugetlb_fault_mutex_hash(h, current->mm,
+   /*
+* Only need to acquire fault mutex in hole punch case.
+* For truncation, we are synchronized via truncation
+* mutex.
+*/
+   if (!truncate_op) {
+   hash = hugetlb_fault_mutex_hash(h, current->mm,
&pseudo_vma,
mapping, index, 0);
-   mutex_lock(&hugetlb_fault_mutex_table[hash]);
+   mutex_lock(&hugetlb_fault_mutex_table[hash]);
+   }
 
/*
 * If page is mapped, it was faulted in after being
@@ -471,7 +478,8 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
}
 
unlock_page(page);
-   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+   if (!truncate_op)
+   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -498,16 +506,19 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t 
offset)
pgoff_t pgoff;
struct address_space *mapping = inode->i_mapping;
struct hstate *h = hstate_inode(inode);
+   struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
 
BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;
 
+   down_write(&info->trunc_rwsem);
i_size_write(inode, offset);
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offse