Re: [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races
On 10/8/18 1:03 AM, Kirill A. Shutemov wrote: > On Sun, Oct 07, 2018 at 04:38:48PM -0700, Mike Kravetz wrote: >> The following hugetlbfs truncate/page fault race can be recreated >> with programs doing something like the following. >> >> A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages. At >> mmap time, 4 huge pages are reserved for the file/mapping. So, >> the global reserve count is 4. In addition, since this is a shared >> mapping an entry for 4 pages is added to the file's reserve map. >> The first 3 of the 4 pages are faulted into the file. As a result, >> the global reserve count is now 1. >> >> Task A starts to fault in the last page (routines hugetlb_fault, >> hugetlb_no_page). It allocates a huge page (alloc_huge_page). >> The reserve map indicates there is a reserved page, so this is >> used and the global reserve count goes to 0. >> >> Now, task B truncates the file to size 0. It starts by setting >> inode size to 0(hugetlb_vmtruncate). It then unmaps all mapping >> of the file (hugetlb_vmdelete_list). Since task A's page table >> lock is not held at the time, truncation is not blocked. Truncation >> removes the 3 pages from the file (remove_inode_hugepages). When >> cleaning up the reserved pages (hugetlb_unreserve_pages), it notices >> the reserve map was for 4 pages. However, it has only freed 3 pages. >> So it assumes there is still (4 - 3) 1 reserved pages. It then >> decrements the global reserve count by 1 and it goes negative. >> >> Task A then continues the page fault process and adds it's newly >> acquired page to the page cache. Note that the index of this page >> is beyond the size of the truncated file (0). The page fault process >> then notices the file has been truncated and exits. However, the >> page is left in the cache associated with the file. >> >> Now, if the file is immediately deleted the truncate code runs again. >> It will find and free the one page associated with the file. When >> cleaning up reserves, it notices the reserve map is empty. Yet, one >> page freed. So, the global reserve count is decremented by (0 - 1) -1. >> This returns the global count to 0 as it should be. But, it is >> possible for someone else to mmap this file/range before it is deleted. >> If this happens, a reserve map entry for the allocated page is created >> and the reserved page is forever leaked. >> >> To avoid all these conditions, let's simply prevent faults to a file >> while it is being truncated. Add a new truncation specific rw mutex >> to hugetlbfs inode extensions. faults take the mutex in read mode, >> truncation takes in write mode. > > Hm. Don't we have already a lock for this? I mean i_mmap_lock. > Thanks Kirill, Yes, we could use use i_mmap_rwsem for this synchronization. I don't see anyone else using the mutex in this manner. hugetlb code only explicitly takes this mutex in write mode today. I suspect that is not optimal and could be improved. Certainly, the use within hugetlb_fault->huge_pte_alloc->huge_pmd_share would need to be changed if we always wanted to take the mutex in read mode during faults. I'll work on the changes to use i_mmap_rwsem. However, right now our DB team informs me that the truncate/fault race is not the cause of their huge page reserve count going negative issue. So, I am searching for more bugs in this area. Found another where an allocation for migration could race with a fault in a VM_NORESERVE vma. But, there were no migrations noted on the system, so there must be another bug. Sigh! -- Mike Kravetz
Re: [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races
On Sun, Oct 07, 2018 at 04:38:48PM -0700, Mike Kravetz wrote: > The following hugetlbfs truncate/page fault race can be recreated > with programs doing something like the following. > > A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages. At > mmap time, 4 huge pages are reserved for the file/mapping. So, > the global reserve count is 4. In addition, since this is a shared > mapping an entry for 4 pages is added to the file's reserve map. > The first 3 of the 4 pages are faulted into the file. As a result, > the global reserve count is now 1. > > Task A starts to fault in the last page (routines hugetlb_fault, > hugetlb_no_page). It allocates a huge page (alloc_huge_page). > The reserve map indicates there is a reserved page, so this is > used and the global reserve count goes to 0. > > Now, task B truncates the file to size 0. It starts by setting > inode size to 0(hugetlb_vmtruncate). It then unmaps all mapping > of the file (hugetlb_vmdelete_list). Since task A's page table > lock is not held at the time, truncation is not blocked. Truncation > removes the 3 pages from the file (remove_inode_hugepages). When > cleaning up the reserved pages (hugetlb_unreserve_pages), it notices > the reserve map was for 4 pages. However, it has only freed 3 pages. > So it assumes there is still (4 - 3) 1 reserved pages. It then > decrements the global reserve count by 1 and it goes negative. > > Task A then continues the page fault process and adds it's newly > acquired page to the page cache. Note that the index of this page > is beyond the size of the truncated file (0). The page fault process > then notices the file has been truncated and exits. However, the > page is left in the cache associated with the file. > > Now, if the file is immediately deleted the truncate code runs again. > It will find and free the one page associated with the file. When > cleaning up reserves, it notices the reserve map is empty. Yet, one > page freed. So, the global reserve count is decremented by (0 - 1) -1. > This returns the global count to 0 as it should be. But, it is > possible for someone else to mmap this file/range before it is deleted. > If this happens, a reserve map entry for the allocated page is created > and the reserved page is forever leaked. > > To avoid all these conditions, let's simply prevent faults to a file > while it is being truncated. Add a new truncation specific rw mutex > to hugetlbfs inode extensions. faults take the mutex in read mode, > truncation takes in write mode. Hm. Don't we have already a lock for this? I mean i_mmap_lock. -- Kirill A. Shutemov
[PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races
The following hugetlbfs truncate/page fault race can be recreated with programs doing something like the following. A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages. At mmap time, 4 huge pages are reserved for the file/mapping. So, the global reserve count is 4. In addition, since this is a shared mapping an entry for 4 pages is added to the file's reserve map. The first 3 of the 4 pages are faulted into the file. As a result, the global reserve count is now 1. Task A starts to fault in the last page (routines hugetlb_fault, hugetlb_no_page). It allocates a huge page (alloc_huge_page). The reserve map indicates there is a reserved page, so this is used and the global reserve count goes to 0. Now, task B truncates the file to size 0. It starts by setting inode size to 0(hugetlb_vmtruncate). It then unmaps all mapping of the file (hugetlb_vmdelete_list). Since task A's page table lock is not held at the time, truncation is not blocked. Truncation removes the 3 pages from the file (remove_inode_hugepages). When cleaning up the reserved pages (hugetlb_unreserve_pages), it notices the reserve map was for 4 pages. However, it has only freed 3 pages. So it assumes there is still (4 - 3) 1 reserved pages. It then decrements the global reserve count by 1 and it goes negative. Task A then continues the page fault process and adds it's newly acquired page to the page cache. Note that the index of this page is beyond the size of the truncated file (0). The page fault process then notices the file has been truncated and exits. However, the page is left in the cache associated with the file. Now, if the file is immediately deleted the truncate code runs again. It will find and free the one page associated with the file. When cleaning up reserves, it notices the reserve map is empty. Yet, one page freed. So, the global reserve count is decremented by (0 - 1) -1. This returns the global count to 0 as it should be. But, it is possible for someone else to mmap this file/range before it is deleted. If this happens, a reserve map entry for the allocated page is created and the reserved page is forever leaked. To avoid all these conditions, let's simply prevent faults to a file while it is being truncated. Add a new truncation specific rw mutex to hugetlbfs inode extensions. faults take the mutex in read mode, truncation takes in write mode. Signed-off-by: Mike Kravetz --- fs/hugetlbfs/inode.c| 24 include/linux/hugetlb.h | 1 + mm/hugetlb.c| 25 +++-- mm/userfaultfd.c| 8 +++- 4 files changed, 47 insertions(+), 11 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 40d4c66c7751..07b0ba049c37 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -427,10 +427,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, u32 hash; index = page->index; - hash = hugetlb_fault_mutex_hash(h, current->mm, + /* +* Only need to acquire fault mutex in hole punch case. +* For truncation, we are synchronized via truncation +* mutex. +*/ + if (!truncate_op) { + hash = hugetlb_fault_mutex_hash(h, current->mm, &pseudo_vma, mapping, index, 0); - mutex_lock(&hugetlb_fault_mutex_table[hash]); + mutex_lock(&hugetlb_fault_mutex_table[hash]); + } /* * If page is mapped, it was faulted in after being @@ -471,7 +478,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, } unlock_page(page); - mutex_unlock(&hugetlb_fault_mutex_table[hash]); + if (!truncate_op) + mutex_unlock(&hugetlb_fault_mutex_table[hash]); } huge_pagevec_release(&pvec); cond_resched(); @@ -498,16 +506,19 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) pgoff_t pgoff; struct address_space *mapping = inode->i_mapping; struct hstate *h = hstate_inode(inode); + struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode); BUG_ON(offset & ~huge_page_mask(h)); pgoff = offset >> PAGE_SHIFT; + down_write(&info->trunc_rwsem); i_size_write(inode, offset); i_mmap_lock_write(mapping); if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); i_mmap_unlock_write(mapping); remove_inode_hugepages(inode, offse