Re: [RESEND][PATCH 0/6] Constify struct page arguments
Looks good to me and I like the cleanup. For the series: Reviewed-by: William Kucharski > On Apr 16, 2021, at 5:15 PM, Matthew Wilcox (Oracle) > wrote: > > [I'm told that patches 2-6 did not make it to the list; resending and > cc'ing lkml this time] > > While working on various solutions to the 32-bit struct page size > regression, one of the problems I found was the networking stack expects > to be able to pass const struct page pointers around, and the mm doesn't > provide a lot of const-friendly functions to call. The root tangle of > problems is that a lot of functions call VM_BUG_ON_PAGE(), which calls > dump_page(), which calls a lot of functions which don't take a const > struct page (but could be const). > > I have other things I need to work on, but I offer these patches as a few > steps towards being able to make dump_page() take a const page pointer. > > Matthew Wilcox (Oracle) (6): > mm: Make __dump_page static > mm/debug: Factor PagePoisoned out of __dump_page > mm/page_owner: Constify dump_page_owner > mm: Make compound_head const-preserving > mm: Constify get_pfnblock_flags_mask and get_pfnblock_migratetype > mm: Constify page_count and page_ref_count > > include/linux/mmdebug.h | 3 +-- > include/linux/page-flags.h | 10 +- > include/linux/page_owner.h | 6 +++--- > include/linux/page_ref.h| 4 ++-- > include/linux/pageblock-flags.h | 2 +- > mm/debug.c | 25 +++-- > mm/page_alloc.c | 16 > mm/page_owner.c | 2 +- > 8 files changed, 28 insertions(+), 40 deletions(-) > > -- > 2.30.2 > >
Re: [PATCH] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs
Sounds good. Reviewed-by: William Kucharski > On Apr 6, 2021, at 11:48 AM, Collin Fijalkovich > wrote: > > Instrumenting filemap_nr_thps_inc() should be sufficient for ensuring > writable file mappings will not be THP-backed. > > If filemap_nr_thps_dec() in unaccount_page_cache_page() and > filemap_nr_thps() in do_dentry_open() race, the worst case is an > unnecessary truncation. We could introduce a memory barrier in > unaccount_page_cache_page(), but I'm unsure it would significantly > mitigate the risk of spurious truncations. Barring synchronization > between do_dentry_open() and ongoing page cache operations, there > could be an in-progress delete_from_page_cache_batch() that has not > yet updated accounting for THPs in its targeted range. > > -- Collin > > On Mon, Apr 5, 2021 at 7:05 PM William Kucharski > wrote: >> >> >> I saw a similar change a few years ago with my prototype: >> >> https://lore.kernel.org/linux-mm/5bb682e1-dd52-4aa9-83e9-def091e0c...@oracle.com/ >> >> the key being a very nice drop in iTLB-load-misses, so it looks like your >> code is >> having the right effect. >> >> What about the call to filemap_nr_thps_dec() in unaccount_page_cache_page() >> - does >> that need an smp_mb() as well? >> >> -- Bill >> >>> On Apr 5, 2021, at 6:15 PM, Collin Fijalkovich >>> wrote: >>> >>> v2 has been uploaded with performance testing results: >>> https://lore.kernel.org/patchwork/patch/1408266/ >>> >>> >>> >>> On Mon, Mar 22, 2021 at 2:59 PM Collin Fijalkovich >>> wrote: >>>> >>>> Transparent huge pages are supported for read-only non-shmem filesystems, >>>> but are only used for vmas with VM_DENYWRITE. This condition ensures that >>>> file THPs are protected from writes while an application is running >>>> (ETXTBSY). Any existing file THPs are then dropped from the page cache >>>> when a file is opened for write in do_dentry_open(). Since sys_mmap >>>> ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas >>>> produced by execve(). >>>> >>>> Systems that make heavy use of shared libraries (e.g. Android) are unable >>>> to apply VM_DENYWRITE through the dynamic linker, preventing them from >>>> benefiting from the resultant reduced contention on the TLB. >>>> >>>> This patch reduces the constraint on file THPs allowing use with any >>>> executable mapping from a file not opened for write (see >>>> inode_is_open_for_write()). It also introduces additional conditions to >>>> ensure that files opened for write will never be backed by file THPs. >>>> >>>> Restricting the use of THPs to executable mappings eliminates the risk that >>>> a read-only file later opened for write would encounter significant >>>> latencies due to page cache truncation. >>>> >>>> The ld linker flag '-z max-page-size=(hugepage size)' can be used to >>>> produce executables with the necessary layout. The dynamic linker must >>>> map these file's segments at a hugepage size aligned vma for the mapping to >>>> be backed with THPs. >>>> >>>> Signed-off-by: Collin Fijalkovich >>>> --- >>>> fs/open.c | 13 +++-- >>>> mm/khugepaged.c | 16 +++- >>>> 2 files changed, 26 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/fs/open.c b/fs/open.c >>>> index e53af13b5835..f76e960d10ea 100644 >>>> --- a/fs/open.c >>>> +++ b/fs/open.c >>>> @@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f, >>>>* XXX: Huge page cache doesn't support writing yet. Drop all page >>>>* cache for this file before processing writes. >>>>*/ >>>> - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping)) >>>> - truncate_pagecache(inode, 0); >>>> + if (f->f_mode & FMODE_WRITE) { >>>> + /* >>>> +* Paired with smp_mb() in collapse_file() to ensure >>>> nr_thps >>>> +* is up to date and the update to i_writecount by >>>> +* get_write_access() is visible. Ensures subsequent >>>> insertion >>>> +* of THPs into the page cache will fail. >>
Re: [PATCH] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs
I saw a similar change a few years ago with my prototype: https://lore.kernel.org/linux-mm/5bb682e1-dd52-4aa9-83e9-def091e0c...@oracle.com/ the key being a very nice drop in iTLB-load-misses, so it looks like your code is having the right effect. What about the call to filemap_nr_thps_dec() in unaccount_page_cache_page() - does that need an smp_mb() as well? -- Bill > On Apr 5, 2021, at 6:15 PM, Collin Fijalkovich > wrote: > > v2 has been uploaded with performance testing results: > https://lore.kernel.org/patchwork/patch/1408266/ > > > > On Mon, Mar 22, 2021 at 2:59 PM Collin Fijalkovich > wrote: >> >> Transparent huge pages are supported for read-only non-shmem filesystems, >> but are only used for vmas with VM_DENYWRITE. This condition ensures that >> file THPs are protected from writes while an application is running >> (ETXTBSY). Any existing file THPs are then dropped from the page cache >> when a file is opened for write in do_dentry_open(). Since sys_mmap >> ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas >> produced by execve(). >> >> Systems that make heavy use of shared libraries (e.g. Android) are unable >> to apply VM_DENYWRITE through the dynamic linker, preventing them from >> benefiting from the resultant reduced contention on the TLB. >> >> This patch reduces the constraint on file THPs allowing use with any >> executable mapping from a file not opened for write (see >> inode_is_open_for_write()). It also introduces additional conditions to >> ensure that files opened for write will never be backed by file THPs. >> >> Restricting the use of THPs to executable mappings eliminates the risk that >> a read-only file later opened for write would encounter significant >> latencies due to page cache truncation. >> >> The ld linker flag '-z max-page-size=(hugepage size)' can be used to >> produce executables with the necessary layout. The dynamic linker must >> map these file's segments at a hugepage size aligned vma for the mapping to >> be backed with THPs. >> >> Signed-off-by: Collin Fijalkovich >> --- >> fs/open.c | 13 +++-- >> mm/khugepaged.c | 16 +++- >> 2 files changed, 26 insertions(+), 3 deletions(-) >> >> diff --git a/fs/open.c b/fs/open.c >> index e53af13b5835..f76e960d10ea 100644 >> --- a/fs/open.c >> +++ b/fs/open.c >> @@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f, >> * XXX: Huge page cache doesn't support writing yet. Drop all page >> * cache for this file before processing writes. >> */ >> - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping)) >> - truncate_pagecache(inode, 0); >> + if (f->f_mode & FMODE_WRITE) { >> + /* >> +* Paired with smp_mb() in collapse_file() to ensure nr_thps >> +* is up to date and the update to i_writecount by >> +* get_write_access() is visible. Ensures subsequent >> insertion >> +* of THPs into the page cache will fail. >> +*/ >> + smp_mb(); >> + if (filemap_nr_thps(inode->i_mapping)) >> + truncate_pagecache(inode, 0); >> + } >> >>return 0; >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index a7d6cb912b05..4c7cc877d5e3 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -459,7 +459,8 @@ static bool hugepage_vma_check(struct vm_area_struct >> *vma, >> >>/* Read-only file mappings need to be aligned for THP to work. */ >>if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && >> - (vm_flags & VM_DENYWRITE)) { >> + !inode_is_open_for_write(vma->vm_file->f_inode) && >> + (vm_flags & VM_EXEC)) { >>return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - >> vma->vm_pgoff, >>HPAGE_PMD_NR); >>} >> @@ -1872,6 +1873,19 @@ static void collapse_file(struct mm_struct *mm, >>else { >>__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr); >>filemap_nr_thps_inc(mapping); >> + /* >> +* Paired with smp_mb() in do_dentry_open() to ensure >> +* i_writecount is up to date and the update to nr_thps is >> +* visible. Ensures the page cache will be truncated if the >> +* file is opened writable. >> +*/ >> + smp_mb(); >> + if (inode_is_open_for_write(mapping->host)) { >> + result = SCAN_FAIL; >> + __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr); >> + filemap_nr_thps_dec(mapping); >> + goto xa_locked; >> + } >>} >> >>if (nr_none) { >> -- >> 2.31.0.rc2.261.g7f71774620-goog >> >
Re: [PATCH] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs
I like this, it reminds me of the changes I proposed a few years ago to try to automatically map read-only text regions of appropriate sizes and alignment with THPs. My concern had always been whether commercial software and distro vendors would buy into supplying the appropriate linker flags when compiling; your solution is at the very least a good head start down that road. Matthew Wilcox and I had noticed a lot of ordinary apps such as gcc, Chrome and Firefox would benefit from such mappings; have you tried building any of those with the appropriate linker flag to see how they might benefit from the change? Thanks, -- Bill > On Mar 22, 2021, at 3:58 PM, Collin Fijalkovich > wrote: > > Transparent huge pages are supported for read-only non-shmem filesystems, > but are only used for vmas with VM_DENYWRITE. This condition ensures that > file THPs are protected from writes while an application is running > (ETXTBSY). Any existing file THPs are then dropped from the page cache > when a file is opened for write in do_dentry_open(). Since sys_mmap > ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas > produced by execve(). > > Systems that make heavy use of shared libraries (e.g. Android) are unable > to apply VM_DENYWRITE through the dynamic linker, preventing them from > benefiting from the resultant reduced contention on the TLB. > > This patch reduces the constraint on file THPs allowing use with any > executable mapping from a file not opened for write (see > inode_is_open_for_write()). It also introduces additional conditions to > ensure that files opened for write will never be backed by file THPs. > > Restricting the use of THPs to executable mappings eliminates the risk that > a read-only file later opened for write would encounter significant > latencies due to page cache truncation. > > The ld linker flag '-z max-page-size=(hugepage size)' can be used to > produce executables with the necessary layout. The dynamic linker must > map these file's segments at a hugepage size aligned vma for the mapping to > be backed with THPs. > > Signed-off-by: Collin Fijalkovich > --- > fs/open.c | 13 +++-- > mm/khugepaged.c | 16 +++- > 2 files changed, 26 insertions(+), 3 deletions(-) > > diff --git a/fs/open.c b/fs/open.c > index e53af13b5835..f76e960d10ea 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f, >* XXX: Huge page cache doesn't support writing yet. Drop all page >* cache for this file before processing writes. >*/ > - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping)) > - truncate_pagecache(inode, 0); > + if (f->f_mode & FMODE_WRITE) { > + /* > + * Paired with smp_mb() in collapse_file() to ensure nr_thps > + * is up to date and the update to i_writecount by > + * get_write_access() is visible. Ensures subsequent insertion > + * of THPs into the page cache will fail. > + */ > + smp_mb(); > + if (filemap_nr_thps(inode->i_mapping)) > + truncate_pagecache(inode, 0); > + } > > return 0; > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index a7d6cb912b05..4c7cc877d5e3 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -459,7 +459,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, > > /* Read-only file mappings need to be aligned for THP to work. */ > if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && > - (vm_flags & VM_DENYWRITE)) { > + !inode_is_open_for_write(vma->vm_file->f_inode) && > + (vm_flags & VM_EXEC)) { > return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, > HPAGE_PMD_NR); > } > @@ -1872,6 +1873,19 @@ static void collapse_file(struct mm_struct *mm, > else { > __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr); > filemap_nr_thps_inc(mapping); > + /* > + * Paired with smp_mb() in do_dentry_open() to ensure > + * i_writecount is up to date and the update to nr_thps is > + * visible. Ensures the page cache will be truncated if the > + * file is opened writable. > + */ > + smp_mb(); > + if (inode_is_open_for_write(mapping->host)) { > + result = SCAN_FAIL; > + __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr); > + filemap_nr_thps_dec(mapping); > + goto xa_locked; > + } > } > > if (nr_none) { > -- > 2.31.0.rc2.261.g7f71774620-goog > >
[PATCH] net/rds: correct socket tunable error in rds_tcp_tune()
Correct an error where setting /proc/sys/net/rds/tcp/rds_tcp_rcvbuf would instead modify the socket's sk_sndbuf and would leave sk_rcvbuf untouched. Signed-off-by: William Kucharski --- net/rds/tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 43db0eca911f..130e5988f956 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -500,7 +500,7 @@ void rds_tcp_tune(struct socket *sock) sk->sk_userlocks |= SOCK_SNDBUF_LOCK; } if (rtn->rcvbuf_size > 0) { - sk->sk_sndbuf = rtn->rcvbuf_size; + sk->sk_rcvbuf = rtn->rcvbuf_size; sk->sk_userlocks |= SOCK_RCVBUF_LOCK; } release_sock(sk); -- 2.30.2
Re: [PATCH v2] fb_defio: Remove custom address_space_operations
Looks good; my apologies for missing the leftover declaration of struct page in the same routine which you also found and removed this time around. > On Mar 10, 2021, at 11:55 AM, Matthew Wilcox (Oracle) > wrote: > > There's no need to give the page an address_space. Leaving the > page->mapping as NULL will cause the VM to handle set_page_dirty() > the same way that it's handled now, and that was the only reason to > set the address_space in the first place. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Christoph Hellwig > Reviewed-by: William Kucharski > --- > v2: Delete local variable definitions > drivers/video/fbdev/core/fb_defio.c | 35 - > drivers/video/fbdev/core/fbmem.c| 4 > include/linux/fb.h | 3 --- > 3 files changed, 42 deletions(-) > > diff --git a/drivers/video/fbdev/core/fb_defio.c > b/drivers/video/fbdev/core/fb_defio.c > index a591d291b231..b292887a2481 100644 > --- a/drivers/video/fbdev/core/fb_defio.c > +++ b/drivers/video/fbdev/core/fb_defio.c > @@ -52,13 +52,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault > *vmf) > return VM_FAULT_SIGBUS; > > get_page(page); > - > - if (vmf->vma->vm_file) > - page->mapping = vmf->vma->vm_file->f_mapping; > - else > - printk(KERN_ERR "no mapping available\n"); > - > - BUG_ON(!page->mapping); > page->index = vmf->pgoff; > > vmf->page = page; > @@ -151,17 +144,6 @@ static const struct vm_operations_struct > fb_deferred_io_vm_ops = { > .page_mkwrite = fb_deferred_io_mkwrite, > }; > > -static int fb_deferred_io_set_page_dirty(struct page *page) > -{ > - if (!PageDirty(page)) > - SetPageDirty(page); > - return 0; > -} > - > -static const struct address_space_operations fb_deferred_io_aops = { > - .set_page_dirty = fb_deferred_io_set_page_dirty, > -}; > - > int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma) > { > vma->vm_ops = &fb_deferred_io_vm_ops; > @@ -212,29 +194,12 @@ void fb_deferred_io_init(struct fb_info *info) > } > EXPORT_SYMBOL_GPL(fb_deferred_io_init); > > -void fb_deferred_io_open(struct fb_info *info, > - struct inode *inode, > - struct file *file) > -{ > - file->f_mapping->a_ops = &fb_deferred_io_aops; > -} > -EXPORT_SYMBOL_GPL(fb_deferred_io_open); > - > void fb_deferred_io_cleanup(struct fb_info *info) > { > struct fb_deferred_io *fbdefio = info->fbdefio; > - struct page *page; > - int i; > > BUG_ON(!fbdefio); > cancel_delayed_work_sync(&info->deferred_work); > - > - /* clear out the mapping that we setup */ > - for (i = 0 ; i < info->fix.smem_len; i += PAGE_SIZE) { > - page = fb_deferred_io_page(info, i); > - page->mapping = NULL; > - } > - > mutex_destroy(&fbdefio->lock); > } > EXPORT_SYMBOL_GPL(fb_deferred_io_cleanup); > diff --git a/drivers/video/fbdev/core/fbmem.c > b/drivers/video/fbdev/core/fbmem.c > index 06f5805de2de..372b52a2befa 100644 > --- a/drivers/video/fbdev/core/fbmem.c > +++ b/drivers/video/fbdev/core/fbmem.c > @@ -1415,10 +1415,6 @@ __releases(&info->lock) > if (res) > module_put(info->fbops->owner); > } > -#ifdef CONFIG_FB_DEFERRED_IO > - if (info->fbdefio) > - fb_deferred_io_open(info, inode, file); > -#endif > out: > unlock_fb_info(info); > if (res) > diff --git a/include/linux/fb.h b/include/linux/fb.h > index ecfbcc0553a5..a8dccd23c249 100644 > --- a/include/linux/fb.h > +++ b/include/linux/fb.h > @@ -659,9 +659,6 @@ static inline void __fb_pad_aligned_buffer(u8 *dst, u32 > d_pitch, > /* drivers/video/fb_defio.c */ > int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma); > extern void fb_deferred_io_init(struct fb_info *info); > -extern void fb_deferred_io_open(struct fb_info *info, > - struct inode *inode, > - struct file *file); > extern void fb_deferred_io_cleanup(struct fb_info *info); > extern int fb_deferred_io_fsync(struct file *file, loff_t start, > loff_t end, int datasync); > -- > 2.30.0 > >
Re: [PATCH] fb_defio: Remove custom address_space_operations
Looks good, just one super minor nit inline. Reviewed-by: William Kucharski > On Mar 10, 2021, at 6:51 AM, Matthew Wilcox (Oracle) > wrote: > > There's no need to give the page an address_space. Leaving the > page->mapping as NULL will cause the VM to handle set_page_dirty() > the same way that it's set now, and that was the only reason to > set the address_space in the first place. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > drivers/video/fbdev/core/fb_defio.c | 33 - > drivers/video/fbdev/core/fbmem.c| 4 > include/linux/fb.h | 3 --- > 3 files changed, 40 deletions(-) > > diff --git a/drivers/video/fbdev/core/fb_defio.c > b/drivers/video/fbdev/core/fb_defio.c > index a591d291b231..1bb208b3c4bb 100644 > --- a/drivers/video/fbdev/core/fb_defio.c > +++ b/drivers/video/fbdev/core/fb_defio.c > @@ -52,13 +52,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault > *vmf) > return VM_FAULT_SIGBUS; > > get_page(page); > - > - if (vmf->vma->vm_file) > - page->mapping = vmf->vma->vm_file->f_mapping; > - else > - printk(KERN_ERR "no mapping available\n"); > - > - BUG_ON(!page->mapping); > page->index = vmf->pgoff; > > vmf->page = page; > @@ -151,17 +144,6 @@ static const struct vm_operations_struct > fb_deferred_io_vm_ops = { > .page_mkwrite = fb_deferred_io_mkwrite, > }; > > -static int fb_deferred_io_set_page_dirty(struct page *page) > -{ > - if (!PageDirty(page)) > - SetPageDirty(page); > - return 0; > -} > - > -static const struct address_space_operations fb_deferred_io_aops = { > - .set_page_dirty = fb_deferred_io_set_page_dirty, > -}; > - > int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma) > { > vma->vm_ops = &fb_deferred_io_vm_ops; > @@ -212,14 +194,6 @@ void fb_deferred_io_init(struct fb_info *info) > } > EXPORT_SYMBOL_GPL(fb_deferred_io_init); > > -void fb_deferred_io_open(struct fb_info *info, > - struct inode *inode, > - struct file *file) > -{ > - file->f_mapping->a_ops = &fb_deferred_io_aops; > -} > -EXPORT_SYMBOL_GPL(fb_deferred_io_open); > - > void fb_deferred_io_cleanup(struct fb_info *info) > { > struct fb_deferred_io *fbdefio = info->fbdefio; > @@ -228,13 +202,6 @@ void fb_deferred_io_cleanup(struct fb_info *info) > > BUG_ON(!fbdefio); > cancel_delayed_work_sync(&info->deferred_work); > - > - /* clear out the mapping that we setup */ > - for (i = 0 ; i < info->fix.smem_len; i += PAGE_SIZE) { > - page = fb_deferred_io_page(info, i); > - page->mapping = NULL; > - } > - > mutex_destroy(&fbdefio->lock); > } We no longer need the definition of "int i" right before the BUG_ON(). > EXPORT_SYMBOL_GPL(fb_deferred_io_cleanup); > diff --git a/drivers/video/fbdev/core/fbmem.c > b/drivers/video/fbdev/core/fbmem.c > index 06f5805de2de..372b52a2befa 100644 > --- a/drivers/video/fbdev/core/fbmem.c > +++ b/drivers/video/fbdev/core/fbmem.c > @@ -1415,10 +1415,6 @@ __releases(&info->lock) > if (res) > module_put(info->fbops->owner); > } > -#ifdef CONFIG_FB_DEFERRED_IO > - if (info->fbdefio) > - fb_deferred_io_open(info, inode, file); > -#endif > out: > unlock_fb_info(info); > if (res) > diff --git a/include/linux/fb.h b/include/linux/fb.h > index ecfbcc0553a5..a8dccd23c249 100644 > --- a/include/linux/fb.h > +++ b/include/linux/fb.h > @@ -659,9 +659,6 @@ static inline void __fb_pad_aligned_buffer(u8 *dst, u32 > d_pitch, > /* drivers/video/fb_defio.c */ > int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma); > extern void fb_deferred_io_init(struct fb_info *info); > -extern void fb_deferred_io_open(struct fb_info *info, > - struct inode *inode, > - struct file *file); > extern void fb_deferred_io_cleanup(struct fb_info *info); > extern int fb_deferred_io_fsync(struct file *file, loff_t start, > loff_t end, int datasync); > -- > 2.30.0 > >
Re: [PATCH v2] include: Remove pagemap.h from blkdev.h
Nice cleanup, IMHO. Reviewed-by: William Kucharski > On Mar 9, 2021, at 12:57 PM, Matthew Wilcox (Oracle) > wrote: > > My UEK-derived config has 1030 files depending on pagemap.h before > this change. Afterwards, just 326 files need to be rebuilt when I > touch pagemap.h. I think blkdev.h is probably included too widely, > but untangling that dependency is harder and this solves my problem. > x86 allmodconfig builds, but there may be implicit include problems > on other architectures. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > v2: Fix CONFIG_SWAP=n implicit use of pagemap.h by swap.h. Increases >the number of files from 240, but that's still a big win -- 68% >reduction instead of 77%. > > block/blk-settings.c | 1 + > drivers/block/brd.c | 1 + > drivers/block/loop.c | 1 + > drivers/md/bcache/super.c | 1 + > drivers/nvdimm/btt.c | 1 + > drivers/nvdimm/pmem.c | 1 + > drivers/scsi/scsicam.c| 1 + > include/linux/blkdev.h| 1 - > include/linux/swap.h | 1 + > 9 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/block/blk-settings.c b/block/blk-settings.c > index b4aa2f37fab6..976085a44fb8 100644 > --- a/block/blk-settings.c > +++ b/block/blk-settings.c > @@ -7,6 +7,7 @@ > #include > #include > #include > +#include > #include/* for max_pfn/max_low_pfn */ > #include > #include > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index 18bf99906662..2a5a1933826b 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > #include > #include > #include > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index a370cde3ddd4..d58d68f3c7cd 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -53,6 +53,7 @@ > #include > #include > #include > +#include > #include > #include > #include > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index 71691f32959b..f154c89d1326 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -16,6 +16,7 @@ > #include "features.h" > > #include > +#include > #include > #include > #include > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c > index 41aa1f01fc07..18a267d5073f 100644 > --- a/drivers/nvdimm/btt.c > +++ b/drivers/nvdimm/btt.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > #include > #include > #include > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index b8a85bfb2e95..16760b237229 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -8,6 +8,7 @@ > */ > > #include > +#include > #include > #include > #include > diff --git a/drivers/scsi/scsicam.c b/drivers/scsi/scsicam.c > index f1553a453616..0ffdb8f2995f 100644 > --- a/drivers/scsi/scsicam.c > +++ b/drivers/scsi/scsicam.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index c032cfe133c7..1e2a95599390 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -11,7 +11,6 @@ > #include > #include > #include > -#include > #include > #include > #include > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 4cc6ec3bf0ab..ae194bb7ddb4 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #include > #include > #include > -- > 2.30.0 > >
Re: [PATCH] mm/filemap: Drop check for truncated page after I/O
LGTM. Reviewed-by: William Kucharski > On Mar 3, 2021, at 3:25 PM, Matthew Wilcox (Oracle) > wrote: > > If the I/O completed successfully, the page will remain Uptodate, > even if it is subsequently truncated. If the I/O completed with an error, > this check would cause us to retry the I/O if the page were truncated > before we woke up. There is no need to retry the I/O; the I/O to fill > the page failed, so we can legitimately just return -EIO. > > This code was originally added by commit 56f0d5fe6851 ("[PATCH] > readpage-vs-invalidate fix") in 2005 (this commit ID is from the > linux-fullhistory tree; it is also commit ba1f08f14b52 in tglx-history). > > At the time, truncate_complete_page() called ClearPageUptodate(), and > so this was fixing a real bug. In 2008, commit 84209e02de48 > ("mm: dont clear PG_uptodate on truncate/invalidate") removed the > call to ClearPageUptodate, and this check has been unnecessary ever > since. > > It doesn't do any real harm, but there's no need to keep it. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/filemap.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 8d3e0daed7c9..3d1635d3be3e 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2238,8 +2238,6 @@ static int filemap_read_page(struct file *file, struct > address_space *mapping, > return error; > if (PageUptodate(page)) > return 0; > - if (!page->mapping) /* page truncated */ > - return AOP_TRUNCATED_PAGE; > shrink_readahead_size_eio(&file->f_ra); > return -EIO; > } > -- > 2.30.0 > >
Re: kernel BUG at mm/page-writeback.c:2241 [ BUG_ON(PageWriteback(page); ]
> On Oct 21, 2020, at 6:49 PM, Matthew Wilcox wrote: > > On Wed, Oct 21, 2020 at 08:30:18PM -0400, Qian Cai wrote: >> Today's linux-next starts to trigger this wondering if anyone has any clue. > > I've seen that occasionally too. I changed that BUG_ON to VM_BUG_ON_PAGE > to try to get a clue about it. Good to know it's not the THP patches > since they aren't in linux-next. > > I don't understand how it can happen. We have the page locked, and then we > do: > >if (PageWriteback(page)) { >if (wbc->sync_mode != WB_SYNC_NONE) >wait_on_page_writeback(page); >else >goto continue_unlock; >} > >VM_BUG_ON_PAGE(PageWriteback(page), page); > > Nobody should be able to put this page under writeback while we have it > locked ... right? The page can be redirtied by the code that's supposed > to be writing it back, but I don't see how anyone can make PageWriteback > true while we're holding the page lock. Looking at __test_set_page_writeback(), I see that it (and most other callers to lock_page_memcg()) do the following: lock_page_memcg(page) /* do other stuff */ ret = TestSetPageWriteback(page); /* do more stuff */ unlock_page_memcg(page) yet lock_page_memcg() does have a few cases where it can (silently) return NULL to indicate an error. Only test_clear_page_writeback() actually saves off the return value (but it too never bothers to check whether it is NULL or not.) Could it be one of those error conditions is occurring leading to no lock actually being taken? The conditions would be extremely rare, but it feels wrong not to check somewhere: struct page *head = compound_head(page); /* rmap on tail pages */ [ ... ] if (mem_cgroup_disabled()) return NULL; again: memcg = head->mem_cgroup; if (unlikely(!memcg)) return NULL;
Re: [PATCH v2 00/12] Overhaul multi-page lookups for THP
Looks good to me; I really like the addition of the "end" parameter to find_get_entries() and the conversion of pagevec_lookup_entries(). For the series: Reviewed-by: William Kucharski > On Sep 14, 2020, at 7:00 AM, Matthew Wilcox (Oracle) > wrote: > > The critical patch to review here is patch 11, "Handle truncates that > split THPs". This code is shared with shmem, and while xfstests passes > (both with the tmpfs filesystem and with THPs enabled for XFS), it is > terribly subtle. > > I posted a similar patch series a few weeks ago [1], but this goes a few > steps further than that one did. In addition to the unification of > find_get_entries() and pagevec_lookup_entries(), this patch series > includes: > > - Only return the head pages from tagged lookups > - Factor a lot of common code out of the various batch lookup routines > - Add mapping_seek_hole_data() > - Only return head pages from find_get_entries > > I also have a patch to iomap to use mapping_seek_hole_data(), but I'm > not including that as part of this batch of patches -- I'll send it > through the iomap tree once mapping_seek_hole_data() lands upstream. > > [1] > https://lore.kernel.org/linux-mm/20200819184850.24779-1-wi...@infradead.org/ > > Matthew Wilcox (Oracle) (12): > mm: Make pagecache tagged lookups return only head pages > mm/shmem: Use pagevec_lookup in shmem_unlock_mapping > mm/filemap: Add helper for finding pages > mm/filemap: Add mapping_seek_hole_data > mm: Add and use find_lock_entries > mm: Add an 'end' parameter to find_get_entries > mm: Add an 'end' parameter to pagevec_lookup_entries > mm: Remove nr_entries parameter from pagevec_lookup_entries > mm: Pass pvec directly to find_get_entries > mm: Remove pagevec_lookup_entries > mm/truncate,shmem: Handle truncates that split THPs > mm/filemap: Return only head pages from find_get_entries > > include/linux/pagemap.h | 5 +- > include/linux/pagevec.h | 4 - > mm/filemap.c| 267 +++- > mm/internal.h | 5 + > mm/shmem.c | 214 +++- > mm/swap.c | 38 +- > mm/truncate.c | 249 ++--- > 7 files changed, 329 insertions(+), 453 deletions(-) > > -- > 2.28.0 >
Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag
I really like that as it’s self-documenting and anyone debugging it can see what is actually being used at a glance. > On Sep 20, 2020, at 09:15, Matthew Wilcox wrote: > > On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote: >> Add a flag to force processing a syscall as a compat syscall. This is >> required so that in_compat_syscall() works for I/O submitted by io_uring >> helper threads on behalf of compat syscalls. > > Al doesn't like this much, but my suggestion is to introduce two new > opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32. The compat code > can translate IORING_OP_READV to IORING_OP_READV32 and then the core > code can know what that user pointer is pointing to.
Re: [RFC PATCH 00/16] 1GB THP support on x86_64
> On Sep 9, 2020, at 7:27 AM, David Hildenbrand wrote: > > On 09.09.20 15:14, Jason Gunthorpe wrote: >> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote: >> >>> But here's the thing ... we already allow >>> mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB) >>> >>> So if we're not doing THP, what's the point of this thread? >> >> I wondered that too.. >> >>> An madvise flag is a different beast; that's just letting the kernel >>> know what the app thinks its behaviour will be. The kernel can pay >> >> But madvise is too late, the VMA already has an address, if it is not >> 1G aligned it cannot be 1G THP already. > > That's why user space (like QEMU) is THP-aware and selects an address > that is aligned to the expected THP granularity (e.g., 2MB on x86_64). To me it's always seemed like there are two major divisions among THP use cases: 1) Applications that KNOW they would benefit from use of THPs, so they call madvise() with an appropriate parameter and explicitly inform the kernel of such 2) Applications that know nothing about THP but there may be an advantage that comes from "automatic" THP mapping when possible. This is an approach that I am more familiar with that comes down to: 1) Is a VMA properly aligned for a (whatever size) THP? 2) Is the mapping request for a length >= (whatever size) THP? 3) Let's try allocating memory to map the space using (whatever size) THP, and: -- If we succeed, great, awesome, let's do it. -- If not, no big deal, map using as large a page as we CAN get. There of course are myriad performance implications to this. Processes that start early after boot have a better chance of getting a THP, but that also means frequently mapped large memory spaces have a better chance of being mapped in a shared manner via a THP, e.g. libc, X servers or Firefox/Chrome. It also means that processes that would be mapped using THPs early in boot may not be if they should crash and need to be restarted. There are all sorts of tunables that would likely need to be in place to make the second approach more viable, but I think it's certainly worth investigating. The address selection you suggest is the basis of one of the patches I wrote for a previous iteration of THP support (and that is in Matthew's THP tree) that will try to round VM addresses to the proper alignment if possible so a THP can then be used to map the area.
Re: [PATCH 00/11] iomap/fs/block patches for 5.11
Really nice improvements here. Reviewed-by: William Kucharski > On Aug 24, 2020, at 9:16 AM, Matthew Wilcox (Oracle) > wrote: > > As promised earlier [1], here are the patches which I would like to > merge into 5.11 to support THPs. They depend on that earlier series. > If there's anything in here that you'd like to see pulled out and added > to that earlier series, let me know. > > There are a couple of pieces in here which aren't exactly part of > iomap, but I think make sense to take through the iomap tree. > > [1] > https://lore.kernel.org/linux-fsdevel/20200824145511.10500-1-wi...@infradead.org/ > > Matthew Wilcox (Oracle) (11): > fs: Make page_mkwrite_check_truncate thp-aware > mm: Support THPs in zero_user_segments > mm: Zero the head page, not the tail page > block: Add bio_for_each_thp_segment_all > iomap: Support THPs in iomap_adjust_read_range > iomap: Support THPs in invalidatepage > iomap: Support THPs in read paths > iomap: Change iomap_write_begin calling convention > iomap: Support THPs in write paths > iomap: Inline data shouldn't see THPs > iomap: Handle tail pages in iomap_page_mkwrite > > fs/iomap/buffered-io.c | 178 > include/linux/bio.h | 13 +++ > include/linux/bvec.h| 27 ++ > include/linux/highmem.h | 15 +++- > include/linux/pagemap.h | 10 +-- > mm/highmem.c| 62 +- > mm/shmem.c | 7 ++ > mm/truncate.c | 7 ++ > 8 files changed, 236 insertions(+), 83 deletions(-) > > -- > 2.28.0 > >
Re: [PATCH 0/7] Overhaul find_get_entries and pagevec_lookup_entries
> On Aug 19, 2020, at 9:05 AM, Matthew Wilcox (Oracle) > wrote: > > This started out as part of the THP patchset, but it's turned into a > nice simplification in its own right. Essentially we end up unifying > find_get_entries() and pagevec_lookup_entries() into one function that's > better than either, and we get rid of a lot of code in the callers as > a result. > > I'm running this through xfstests right now, but something similar to > this has already passed xfstests as part of the THP patchset. > > I've done my best to avoid off-by-one errors for 'end', but I wouldn't be > surprised if I made a mistake. We're not consistent with whether 'end' > is inclusive or exclusive and I didn't want to make extensive changes > to ensure they were consistent. > > Matthew Wilcox (Oracle) (7): > mm: Use pagevec_lookup in shmem_unlock_mapping > mm: Rewrite shmem_seek_hole_data > mm: Add an 'end' parameter to find_get_entries > mm: Add an 'end' parameter to pagevec_lookup_entries > mm: Remove nr_entries parameter from pagevec_lookup_entries > mm: Pass pvec directly to find_get_entries > mm: Remove pagevec_lookup_entries > > include/linux/pagemap.h | 3 +- > include/linux/pagevec.h | 4 -- > mm/filemap.c| 19 + > mm/shmem.c | 85 ++--- > mm/swap.c | 38 +- > mm/truncate.c | 33 +++- > 6 files changed, 45 insertions(+), 137 deletions(-) Very nice cleanups and the code makes more sense, thanks. Reviewed-by: William Kucharski
Re: [PATCH 0/8] Return head pages from find_get_entry and find_lock_entry
> On Aug 19, 2020, at 12:48 PM, Matthew Wilcox (Oracle) > wrote: > > This patch seris started out as part of the THP patch set, but it has > some nice effects along the way and it seems worth splitting it out and > submitting separately. > > Currently find_get_entry() and find_lock_entry() return the page > corresponding to the requested index, but the first thing most callers do > is find the head page, which we just threw away. As part of auditing > all the callers, I found some misuses of the APIs and some plain > inefficiencies that I've fixed. > > The diffstat is unflattering, but I added more kernel-doc. > > Matthew Wilcox (Oracle) (8): > mm: Factor find_get_swap_page out of mincore_page > mm: Use find_get_swap_page in memcontrol > mm: Optimise madvise WILLNEED > proc: Optimise smaps for shmem entries > i915: Use find_lock_page instead of find_lock_entry > mm: Convert find_get_entry to return the head page > mm: Return head page from find_lock_entry > mm: Hoist find_subpage call further up in pagecache_get_page > > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 4 +-- > fs/proc/task_mmu.c| 8 + > include/linux/pagemap.h | 16 +++-- > include/linux/swap.h | 7 > mm/filemap.c | 41 +++ > mm/madvise.c | 21 +++- > mm/memcontrol.c | 25 ++ > mm/mincore.c | 28 ++-- > mm/shmem.c| 15 + > mm/swap_state.c | 31 +++++ > 10 files changed, 98 insertions(+), 98 deletions(-) For the series: Reviewed-by: William Kucharski
Re: [PATCH] memremap: Convert devmap static branch to {inc,dec}
Looks good to me. Reviewed-by: William Kucharski > On Aug 10, 2020, at 5:53 PM, ira.we...@intel.com wrote: > > From: Ira Weiny > > While reviewing Protection Key Supervisor support it was pointed out > that using a counter to track static branch enable was an anti-pattern > which was better solved using the provided static_branch_{inc,dec} > functions.[1] > > Fix up devmap_managed_key to work the same way. Also this should be > safer because there is a very small (very unlikely) race when multiple > callers try to enable at the same time. > > [1] > https://lore.kernel.org/lkml/20200714194031.gi5...@worktop.programming.kicks-ass.net/ > > Cc: Dan Williams > Cc: Vishal Verma > Signed-off-by: Ira Weiny > --- > mm/memremap.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/mm/memremap.c b/mm/memremap.c > index 03e38b7a38f1..9fb9ad500e78 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -40,12 +40,10 @@ EXPORT_SYMBOL_GPL(memremap_compat_align); > #ifdef CONFIG_DEV_PAGEMAP_OPS > DEFINE_STATIC_KEY_FALSE(devmap_managed_key); > EXPORT_SYMBOL(devmap_managed_key); > -static atomic_t devmap_managed_enable; > > static void devmap_managed_enable_put(void) > { > - if (atomic_dec_and_test(&devmap_managed_enable)) > - static_branch_disable(&devmap_managed_key); > + static_branch_dec(&devmap_managed_key); > } > > static int devmap_managed_enable_get(struct dev_pagemap *pgmap) > @@ -56,8 +54,7 @@ static int devmap_managed_enable_get(struct dev_pagemap > *pgmap) > return -EINVAL; > } > > - if (atomic_inc_return(&devmap_managed_enable) == 1) > - static_branch_enable(&devmap_managed_key); > + static_branch_inc(&devmap_managed_key); > return 0; > } > #else > -- > 2.28.0.rc0.12.gb6a658bd00c9 > >
[PATCH] mm: ksize() should silently accept a NULL pointer
Other mm routines such as kfree() and kzfree() silently do the right thing if passed a NULL pointer, so ksize() should do the same. Signed-off-by: William Kucharski --- mm/slab_common.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 9e72ba224175..2bff01ad94d8 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1660,10 +1660,9 @@ static __always_inline void *__do_krealloc(const void *p, size_t new_size, gfp_t flags) { void *ret; - size_t ks = 0; + size_t ks; - if (p) - ks = ksize(p); + ks = ksize(p); if (ks >= new_size) { p = kasan_krealloc((void *)p, new_size, flags); @@ -1723,10 +1722,9 @@ void kzfree(const void *p) size_t ks; void *mem = (void *)p; - if (unlikely(ZERO_OR_NULL_PTR(mem))) - return; ks = ksize(mem); - memset(mem, 0, ks); + if (ks) + memset(mem, 0, ks); kfree(mem); } EXPORT_SYMBOL(kzfree); @@ -1749,8 +1747,6 @@ size_t ksize(const void *objp) { size_t size; - if (WARN_ON_ONCE(!objp)) - return 0; /* * We need to check that the pointed to object is valid, and only then * unpoison the shadow memory below. We use __kasan_check_read(), to @@ -1764,7 +1760,7 @@ size_t ksize(const void *objp) * We want to perform the check before __ksize(), to avoid potentially * crashing in __ksize() due to accessing invalid metadata. */ - if (unlikely(objp == ZERO_SIZE_PTR) || !__kasan_check_read(objp, 1)) + if (unlikely(ZERO_OR_NULL_PTR(objp)) || !__kasan_check_read(objp, 1)) return 0; size = __ksize(objp); -- 2.26.2
Re: [PATCH v4 00/36] Large pages in the page cache
Except for [PATCH v4 36/36], which I can't approve for obvious reasons: Reviewed-by: William Kucharski
Re: [PATCH v4 36/36] mm: Align THP mappings for non-DAX
Thinking about this, if the intent is to make THP usable for any greater than PAGESIZE page size, this routine should probably go back to taking a size or perhaps order parameter so it could be called to align addresses accordingly rather than hard code PMD_SIZE. > On May 15, 2020, at 7:16 AM, Matthew Wilcox wrote: > > From: William Kucharski > > When we have the opportunity to use transparent huge pages to map a > file, we want to follow the same rules as DAX. > > Signed-off-by: William Kucharski > [Inline __thp_get_unmapped_area() into thp_get_unmapped_area()] > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/huge_memory.c | 40 +--- > 1 file changed, 13 insertions(+), 27 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 15a86b06befc..e78686b628ae 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -535,30 +535,30 @@ bool is_transparent_hugepage(struct page *page) > } > EXPORT_SYMBOL_GPL(is_transparent_hugepage); > > -static unsigned long __thp_get_unmapped_area(struct file *filp, > - unsigned long addr, unsigned long len, > - loff_t off, unsigned long flags, unsigned long size) > +unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > + unsigned long len, unsigned long pgoff, unsigned long flags) > { > + loff_t off = (loff_t)pgoff << PAGE_SHIFT; > loff_t off_end = off + len; > - loff_t off_align = round_up(off, size); > + loff_t off_align = round_up(off, PMD_SIZE); > unsigned long len_pad, ret; > > - if (off_end <= off_align || (off_end - off_align) < size) > - return 0; > + if (off_end <= off_align || (off_end - off_align) < PMD_SIZE) > + goto regular; > > - len_pad = len + size; > + len_pad = len + PMD_SIZE; > if (len_pad < len || (off + len_pad) < off) > - return 0; > + goto regular; > > ret = current->mm->get_unmapped_area(filp, addr, len_pad, > off >> PAGE_SHIFT, flags); > > /* > - * The failure might be due to length padding. The caller will retry > - * without the padding. > + * The failure might be due to length padding. Retry without > + * the padding. >*/ > if (IS_ERR_VALUE(ret)) > - return 0; > + goto regular; > > /* >* Do not try to align to THP boundary if allocation at the address > @@ -567,23 +567,9 @@ static unsigned long __thp_get_unmapped_area(struct file > *filp, > if (ret == addr) > return addr; > > - ret += (off - ret) & (size - 1); > + ret += (off - ret) & (PMD_SIZE - 1); > return ret; > -} > - > -unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > - unsigned long len, unsigned long pgoff, unsigned long flags) > -{ > - unsigned long ret; > - loff_t off = (loff_t)pgoff << PAGE_SHIFT; > - > - if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) > - goto out; > - > - ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE); > - if (ret) > - return ret; > -out: > +regular: > return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); > } > EXPORT_SYMBOL_GPL(thp_get_unmapped_area); > -- > 2.26.2 >
Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
> On Oct 1, 2019, at 8:20 AM, Kirill A. Shutemov wrote: > > On Tue, Oct 01, 2019 at 06:18:28AM -0600, William Kucharski wrote: >> >> >> On 10/1/19 5:32 AM, Kirill A. Shutemov wrote: >>> On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote: >>>> >>>> >>>>> On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov >>>>> wrote: >>>>> >>>>> On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote: >>>>>> >>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>>>> index cbe7d0619439..670a1780bd2f 100644 >>>>>> --- a/mm/huge_memory.c >>>>>> +++ b/mm/huge_memory.c >>>>>> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file >>>>>> *filp, unsigned long addr, >>>>>> >>>>>> if (addr) >>>>>> goto out; >>>>>> -if (!IS_DAX(filp->f_mapping->host) || >>>>>> !IS_ENABLED(CONFIG_FS_DAX_PMD)) >>>>>> -goto out; >>>>>> >>>>>> addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE); >>>>>> if (addr) >>>>> >>>>> I think you reducing ASLR without any real indication that THP is relevant >>>>> for the VMA. We need to know if any huge page allocation will be >>>>> *attempted* for the VMA or the file. >>>> >>>> Without a properly aligned address the code will never even attempt >>>> allocating >>>> a THP. >>>> >>>> I don't think rounding an address to one that would be properly aligned to >>>> map >>>> to a THP if possible is all that detrimental to ASLR and without the >>>> ability to >>>> pick an aligned address it's rather unlikely anyone would ever map >>>> anything to >>>> a THP unless they explicitly designate an address with MAP_FIXED. >>>> >>>> If you do object to the slight reduction of the ASLR address space, what >>>> alternative would you prefer to see? >>> >>> We need to know by the time if THP is allowed for this >>> file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for >>> nothing. >>> >>> For instance, if THP is disabled globally, there is no reason to align the >>> VMA to the THP requirements. >> >> I understand, but this code is in thp_get_unmapped_area(), which is only >> called >> if THP is configured and the VMA can support it. >> >> I don't see it in Matthew's patchset, so I'm not sure if it was inadvertently >> missed in his merge or if he has other ideas for how it would eventually be >> called, but in my last patch revision the code calling it in do_mmap() >> looked like this: >> >> #ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP >>/* >> * If THP is enabled, it's a read-only executable that is >> * MAP_PRIVATE mapped, the length is larger than a PMD page >> * and either it's not a MAP_FIXED mapping or the passed address is >> * properly aligned for a PMD page, attempt to get an appropriate >> * address at which to map a PMD-sized THP page, otherwise call the >> * normal routine. >> */ >>if ((prot & PROT_READ) && (prot & PROT_EXEC) && >>(!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) && >>(!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) { > > len and MAP_FIXED is already handled by thp_get_unmapped_area(). > > if (prot & (PROT_READ|PROT_WRITE|PROT_READ) == (PROT_READ|PROT_EXEC) && > (flags & MAP_PRIVATE)) { It is, but I wanted to avoid even calling it if conditions weren't right. Checking twice is non-optimal but I didn't want to alter the existing use of the routine for anon THP. > > >>addr = thp_get_unmapped_area(file, addr, len, pgoff, flags); >> >>if (addr && (!(addr & ~HPAGE_PMD_MASK))) { > > This check is broken. > > For instance, if pgoff is one, (addr & ~HPAGE_PMD_MASK) has to be equal to > PAGE_SIZE to have chance to get a huge page in the mapping. > If the address isn't PMD-aligned, we will never be able to map it with a THP anyway. The current code is designed to only map a THP if the VMA allows for it and it can map the entire THP starting at an aligned address. You can't map a THP at the PMD level at an address that isn't PMD aligned. Perhaps I'm missing a use case here. >>/* >> * If we got a suitable THP mapping address, shut off >> * VM_MAYWRITE for the region, since it's never what >> * we would want. >> */ >>vm_maywrite = 0; > > Wouldn't it break uprobe, for instance? I'm not sure; does uprobe allow COW to insert the probe even for mappings explicitly marked read-only? Thanks, Bill
Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
On 10/1/19 5:32 AM, Kirill A. Shutemov wrote: On Tue, Oct 01, 2019 at 05:21:26AM -0600, William Kucharski wrote: On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov wrote: On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cbe7d0619439..670a1780bd2f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, if (addr) goto out; - if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) - goto out; addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE); if (addr) I think you reducing ASLR without any real indication that THP is relevant for the VMA. We need to know if any huge page allocation will be *attempted* for the VMA or the file. Without a properly aligned address the code will never even attempt allocating a THP. I don't think rounding an address to one that would be properly aligned to map to a THP if possible is all that detrimental to ASLR and without the ability to pick an aligned address it's rather unlikely anyone would ever map anything to a THP unless they explicitly designate an address with MAP_FIXED. If you do object to the slight reduction of the ASLR address space, what alternative would you prefer to see? We need to know by the time if THP is allowed for this file/VMA/process/whatever. Meaning that we do not give up ASLR entropy for nothing. For instance, if THP is disabled globally, there is no reason to align the VMA to the THP requirements. I understand, but this code is in thp_get_unmapped_area(), which is only called if THP is configured and the VMA can support it. I don't see it in Matthew's patchset, so I'm not sure if it was inadvertently missed in his merge or if he has other ideas for how it would eventually be called, but in my last patch revision the code calling it in do_mmap() looked like this: #ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP /* * If THP is enabled, it's a read-only executable that is * MAP_PRIVATE mapped, the length is larger than a PMD page * and either it's not a MAP_FIXED mapping or the passed address is * properly aligned for a PMD page, attempt to get an appropriate * address at which to map a PMD-sized THP page, otherwise call the * normal routine. */ if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) && (!(flags & MAP_FIXED)) && len >= HPAGE_PMD_SIZE) { addr = thp_get_unmapped_area(file, addr, len, pgoff, flags); if (addr && (!(addr & ~HPAGE_PMD_MASK))) { /* * If we got a suitable THP mapping address, shut off * VM_MAYWRITE for the region, since it's never what * we would want. */ vm_maywrite = 0; } else addr = get_unmapped_area(file, addr, len, pgoff, flags); } else { #endif So I think that meets your expectations regarding ASLR. -- Bill
Re: [PATCH 14/15] mm: Align THP mappings for non-DAX
> On Oct 1, 2019, at 4:45 AM, Kirill A. Shutemov wrote: > > On Tue, Sep 24, 2019 at 05:52:13PM -0700, Matthew Wilcox wrote: >> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index cbe7d0619439..670a1780bd2f 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -563,8 +563,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, >> unsigned long addr, >> >> if (addr) >> goto out; >> -if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) >> -goto out; >> >> addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE); >> if (addr) > > I think you reducing ASLR without any real indication that THP is relevant > for the VMA. We need to know if any huge page allocation will be > *attempted* for the VMA or the file. Without a properly aligned address the code will never even attempt allocating a THP. I don't think rounding an address to one that would be properly aligned to map to a THP if possible is all that detrimental to ASLR and without the ability to pick an aligned address it's rather unlikely anyone would ever map anything to a THP unless they explicitly designate an address with MAP_FIXED. If you do object to the slight reduction of the ASLR address space, what alternative would you prefer to see? -- Bill
Re: [PATCH v5 1/2] mm: Allow the page cache to allocate large pages
> On Sep 3, 2019, at 5:57 AM, Michal Hocko wrote: > > On Mon 02-09-19 03:23:40, William Kucharski wrote: >> Add an 'order' argument to __page_cache_alloc() and >> do_read_cache_page(). Ensure the allocated pages are compound pages. > > Why do we need to touch all the existing callers and change them to use > order 0 when none is actually converted to a different order? This just > seem to add a lot of code churn without a good reason. If anything I > would simply add __page_cache_alloc_order and make __page_cache_alloc > call it with order 0 argument. All the EXISTING code in patch [1/2] is changed to call it with an order of 0, as you would expect. However, new code in part [2/2] of the patch calls it with an order of HPAGE_PMD_ORDER, as it seems cleaner to have those routines operate on a page, regardless of the order of the page desired. I certainly can change this as you request, but once again the question is whether "page" should MEAN "page" regardless of the order desired, or whether the assumption will always be "page" means base PAGESIZE. Either approach works, but what is the semantic we want going forward? Thanks again!
Re: [PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
> On Sep 3, 2019, at 1:15 PM, Michal Hocko wrote: > > Then I would suggest mentioning all this in the changelog so that the > overall intention is clear. It is also up to you fs developers to find a > consensus on how to move forward. I have brought that up mostly because > I really hate seeing new config options added due to shortage of > confidence in the code. That really smells like working around standard > code quality inclusion process. I do mention a good deal of this in the blurb in part [0/2] of the patch, though I don't cover the readpage/readpages() debate. Ideally readpage() should do just that, read a page, based on the size of the page passed, and not assume "page" means "PAGESIZE." I can also make the "help" text for the option more descriptive if desired. Thanks for your comments!
[PATCH v5 1/2] mm: Allow the page cache to allocate large pages
Add an 'order' argument to __page_cache_alloc() and do_read_cache_page(). Ensure the allocated pages are compound pages. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: William Kucharski Reported-by: kbuild test robot --- fs/afs/dir.c| 2 +- fs/btrfs/compression.c | 2 +- fs/cachefiles/rdwr.c| 4 ++-- fs/ceph/addr.c | 2 +- fs/ceph/file.c | 2 +- include/linux/pagemap.h | 10 ++ mm/filemap.c| 20 +++- mm/readahead.c | 2 +- net/ceph/pagelist.c | 4 ++-- net/ceph/pagevec.c | 2 +- 10 files changed, 27 insertions(+), 23 deletions(-) diff --git a/fs/afs/dir.c b/fs/afs/dir.c index 139b4e3cc946..ca8f8e77e012 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key) afs_stat_v(dvnode, n_inval); ret = -ENOMEM; - req->pages[i] = __page_cache_alloc(gfp); + req->pages[i] = __page_cache_alloc(gfp, 0); if (!req->pages[i]) goto error; ret = add_to_page_cache_lru(req->pages[i], diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 60c47b417a4b..5280e7477b7e 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, } page = __page_cache_alloc(mapping_gfp_constraint(mapping, -~__GFP_FS)); +~__GFP_FS), 0); if (!page) break; diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index 44a3ce1e4ce4..11d30212745f 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object, goto backing_page_already_present; if (!newpage) { - newpage = __page_cache_alloc(cachefiles_gfp); + newpage = __page_cache_alloc(cachefiles_gfp, 0); if (!newpage) goto nomem_monitor; } @@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object, goto backing_page_already_present; if (!newpage) { - newpage = __page_cache_alloc(cachefiles_gfp); + newpage = __page_cache_alloc(cachefiles_gfp, 0); if (!newpage) goto nomem; } diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b3c8b886bf64..7c1c3857fbb9 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1708,7 +1708,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page) if (len > PAGE_SIZE) len = PAGE_SIZE; } else { - page = __page_cache_alloc(GFP_NOFS); + page = __page_cache_alloc(GFP_NOFS, 0); if (!page) { err = -ENOMEM; goto out; diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 685a03cc4b77..ae58d7c31aa4 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) struct page *page = NULL; loff_t i_size; if (retry_op == READ_INLINE) { - page = __page_cache_alloc(GFP_KERNEL); + page = __page_cache_alloc(GFP_KERNEL, 0); if (!page) return -ENOMEM; } diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c7552459a15f..92e026d9a6b7 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -208,17 +208,19 @@ static inline int page_cache_add_speculative(struct page *page, int count) } #ifdef CONFIG_NUMA -extern struct page *__page_cache_alloc(gfp_t gfp); +extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order); #else -static inline struct page *__page_cache_alloc(gfp_t gfp) +static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order) { - return alloc_pages(gfp, 0); + if (order > 0) + gfp |= __GFP_COMP; + return alloc_pages(gfp, order); } #endif static inline struct page *page_cache_alloc(struct address_space *x) { - return __page_cache_alloc(mapping_gfp_mask(x)); + return __page_cache_alloc(mapping_gfp_mask(x), 0); } static inline gfp_t readahead_gfp_mask(struct address_space *x) diff --git a/mm/fi
[PATCH v5 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
Add filemap_huge_fault() to attempt to satisfy page faults on memory-mapped read-only text pages using THP when possible. Signed-off-by: William Kucharski --- include/linux/mm.h | 2 + mm/Kconfig | 15 ++ mm/filemap.c | 398 +++-- mm/huge_memory.c | 3 + mm/mmap.c | 39 - mm/rmap.c | 4 +- mm/vmscan.c| 2 +- 7 files changed, 446 insertions(+), 17 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0334ca97c584..2a5311721739 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2433,6 +2433,8 @@ extern void truncate_inode_pages_final(struct address_space *); /* generic vm_area_ops exported for stackable file systems */ extern vm_fault_t filemap_fault(struct vm_fault *vmf); +extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf, + enum page_entry_size pe_size); extern void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff); extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf); diff --git a/mm/Kconfig b/mm/Kconfig index 56cec636a1fc..2debaded0e4d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -736,4 +736,19 @@ config ARCH_HAS_PTE_SPECIAL config ARCH_HAS_HUGEPD bool +config RO_EXEC_FILEMAP_HUGE_FAULT_THP + bool "read-only exec filemap_huge_fault THP support (EXPERIMENTAL)" + depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM + + help + Introduce filemap_huge_fault() to automatically map executable + read-only pages of mapped files of suitable size and alignment + using THP if possible. + + This is marked experimental because it is a new feature and is + dependent upon filesystmes implementing readpages() in a way + that will recognize large THP pages and read file content to + them without polluting the pagecache with PAGESIZE pages due + to readahead. + endmenu diff --git a/mm/filemap.c b/mm/filemap.c index 38b46fc00855..5947d432a4e6 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -199,13 +199,12 @@ static void unaccount_page_cache_page(struct address_space *mapping, nr = hpage_nr_pages(page); __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr); - if (PageSwapBacked(page)) { + + if (PageSwapBacked(page)) __mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr); - if (PageTransHuge(page)) - __dec_node_page_state(page, NR_SHMEM_THPS); - } else { - VM_BUG_ON_PAGE(PageTransHuge(page), page); - } + + if (PageTransHuge(page)) + __dec_node_page_state(page, NR_SHMEM_THPS); /* * At this point page must be either written or cleaned by @@ -303,6 +302,9 @@ static void page_cache_delete_batch(struct address_space *mapping, break; if (xa_is_value(page)) continue; + +VM_BUG_ON_PAGE(xa_is_internal(page), page); + if (!tail_pages) { /* * Some page got inserted in our range? Skip it. We @@ -315,6 +317,11 @@ static void page_cache_delete_batch(struct address_space *mapping, continue; } WARN_ON_ONCE(!PageLocked(page)); + + /* +* If a THP is in the page cache, set the succeeding +* cache entries for the PMD-sized page to NULL. +*/ if (PageTransHuge(page) && !PageHuge(page)) tail_pages = HPAGE_PMD_NR - 1; page->mapping = NULL; @@ -324,8 +331,6 @@ static void page_cache_delete_batch(struct address_space *mapping, */ i++; } else { - VM_BUG_ON_PAGE(page->index + HPAGE_PMD_NR - tail_pages - != pvec->pages[i]->index, page); tail_pages--; } xas_store(&xas, NULL); @@ -881,7 +886,10 @@ static int __add_to_page_cache_locked(struct page *page, mapping->nrpages++; /* hugetlb pages do not participate in page cache accounting */ - if (!huge) + if (PageTransHuge(page) && !huge) + __mod_node_page_state(page_pgdat(page), + NR_FILE_PAGES, HPAGE_PMD_NR); + else __inc_node_page_state(page, NR_FILE_PAGES); unlock: xas_unlock_irq(&xas); @@ -1663,7 +1671,8 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, no_page: if (!page && (fgp_flags & FGP_CREAT))
[PATCH v5 0/2] mm,thp: Add filemap_huge_fault() for THP
This set of patches is the first step towards a mechanism for automatically mapping read-only text areas of appropriate size and alignment to THPs whenever possible. For now, the central routine, filemap_huge_fault(), amd various support routines are only included if the experimental kernel configuration option RO_EXEC_FILEMAP_HUGE_FAULT_THP is enabled. This is because filemap_huge_fault() is dependent upon the address_space_operations vector readpage() pointing to a routine that will read and fill an entire large page at a time without poulluting the page cache with PAGESIZE entries for the large page being mapped or performing readahead that would pollute the page cache entries for succeeding large pages. Unfortunately, there is no good way to determine how many bytes were read by readpage(). At present, if filemap_huge_fault() were to call a conventional readpage() routine, it would only fill the first PAGESIZE bytes of the large page, which is definitely NOT the desired behavior. However, by making the code available now it is hoped that filesystem maintainers who have pledged to provide such a mechanism will do so more rapidly. The first part of the patch adds an order field to __page_cache_alloc(), allowing callers to directly request page cache pages of various sizes. This code was provided by Matthew Wilcox. The second part of the patch implements the filemap_huge_fault() mechanism as described above. As this code is only run when the experimental config option is set, there are some issues that need to be resolved but this is a good step step that will enable further developemt. Changes since v4: 1. More code review comments addressed, fixed bug in rcu logic 2. Add code to delete hugepage from page cache upon failure within filemap_huge_fault 3. Remove improperly crafted VM_BUG_ON when removing a THP from the page cache 4. Fixed mapping count issue Changes since v3: 1. Multiple code review comments addressed 2. filemap_huge_fault() now does rcu locking when possible 3. filemap_huge_fault() now properly adds the THP to the page cache before calling readpage() Changes since v2: 1. FGP changes were pulled out to enable submission as an independent patch 2. Inadvertent tab spacing and comment changes were reverted Changes since v1: 1. Fix improperly generated patch for v1 PATCH 1/2 Matthew Wilcox (1): Add an 'order' argument to __page_cache_alloc() and do_read_cache_page(). Ensure the allocated pages are compound pages. William Kucharski (1): Add filemap_huge_fault() to attempt to satisfy page faults on memory-mapped read-only text pages using THP when possible. fs/afs/dir.c| 2 +- fs/btrfs/compression.c | 2 +- fs/cachefiles/rdwr.c| 4 +- fs/ceph/addr.c | 2 +- fs/ceph/file.c | 2 +- include/linux/mm.h | 2 + include/linux/pagemap.h | 10 +- mm/Kconfig | 15 ++ mm/filemap.c| 418 ++-- mm/huge_memory.c| 3 + mm/mmap.c | 39 +++- mm/readahead.c | 2 +- mm/rmap.c | 4 +- mm/vmscan.c | 2 +- net/ceph/pagelist.c | 4 +- net/ceph/pagevec.c | 2 +- 16 files changed, 473 insertions(+), 40 deletions(-) -- 2.21.0
[PATCH v4 0/2] mm,thp: Add filemap_huge_fault() for THP
This set of patches is the first step towards a mechanism for automatically mapping read-only text areas of appropriate size and alignment to THPs whenever possible. For now, the central routine, filemap_huge_fault(), amd various support routines are only included if the experimental kernel configuration option RO_EXEC_FILEMAP_HUGE_FAULT_THP is enabled. This is because filemap_huge_fault() is dependent upon the address_space_operations vector readpage() pointing to a routine that will read and fill an entire large page at a time without poulluting the page cache with PAGESIZE entries for the large page being mapped or performing readahead that would pollute the page cache entries for succeeding large pages. Unfortunately, there is no good way to determine how many bytes were read by readpage(). At present, if filemap_huge_fault() were to call a conventional readpage() routine, it would only fill the first PAGESIZE bytes of the large page, which is definitely NOT the desired behavior. However, by making the code available now it is hoped that filesystem maintainers who have pledged to provide such a mechanism will do so more rapidly. The first part of the patch adds an order field to __page_cache_alloc(), allowing callers to directly request page cache pages of various sizes. This code was provided by Matthew Wilcox. The second part of the patch implements the filemap_huge_fault() mechanism as described above. As this code is only run when the experimental config option is set, there are some issues that need to be resolved but this is a good step step that will enable further developemt. Changes since v3: 1. Multiple code review comments addressed 2. filemap_huge_fault() now does rcu locking when possible 3. filemap_huge_fault() now properly adds the THP to the page cache before calling readpage() Changes since v2: 1. FGP changes were pulled out to enable submission as an independent patch 2. Inadvertent tab spacing and comment changes were reverted Changes since v1: 1. Fix improperly generated patch for v1 PATCH 1/2 Matthew Wilcox (1): Add an 'order' argument to __page_cache_alloc() and do_read_cache_page(). Ensure the allocated pages are compound pages. William Kucharski (1): Add filemap_huge_fault() to attempt to satisfy page faults on memory-mapped read-only text pages using THP when possible. fs/afs/dir.c| 2 +- fs/btrfs/compression.c | 2 +- fs/cachefiles/rdwr.c| 4 +- fs/ceph/addr.c | 2 +- fs/ceph/file.c | 2 +- include/linux/mm.h | 2 + include/linux/pagemap.h | 10 +- mm/Kconfig | 15 ++ mm/filemap.c| 357 ++-- mm/huge_memory.c| 3 + mm/mmap.c | 38 - mm/readahead.c | 2 +- mm/rmap.c | 4 +- net/ceph/pagelist.c | 4 +- net/ceph/pagevec.c | 2 +- 15 files changed, 413 insertions(+), 36 deletions(-) -- 2.21.0
[PATCH v4 1/2] mm: Allow the page cache to allocate large pages
Add an 'order' argument to __page_cache_alloc() and do_read_cache_page(). Ensure the allocated pages are compound pages. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: William Kucharski Reported-by: kbuild test robot --- fs/afs/dir.c| 2 +- fs/btrfs/compression.c | 2 +- fs/cachefiles/rdwr.c| 4 ++-- fs/ceph/addr.c | 2 +- fs/ceph/file.c | 2 +- include/linux/pagemap.h | 10 ++ mm/filemap.c| 20 +++- mm/readahead.c | 2 +- net/ceph/pagelist.c | 4 ++-- net/ceph/pagevec.c | 2 +- 10 files changed, 27 insertions(+), 23 deletions(-) diff --git a/fs/afs/dir.c b/fs/afs/dir.c index e640d67274be..0a392214f71e 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key) afs_stat_v(dvnode, n_inval); ret = -ENOMEM; - req->pages[i] = __page_cache_alloc(gfp); + req->pages[i] = __page_cache_alloc(gfp, 0); if (!req->pages[i]) goto error; ret = add_to_page_cache_lru(req->pages[i], diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 60c47b417a4b..5280e7477b7e 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, } page = __page_cache_alloc(mapping_gfp_constraint(mapping, -~__GFP_FS)); +~__GFP_FS), 0); if (!page) break; diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index 44a3ce1e4ce4..11d30212745f 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object, goto backing_page_already_present; if (!newpage) { - newpage = __page_cache_alloc(cachefiles_gfp); + newpage = __page_cache_alloc(cachefiles_gfp, 0); if (!newpage) goto nomem_monitor; } @@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object, goto backing_page_already_present; if (!newpage) { - newpage = __page_cache_alloc(cachefiles_gfp); + newpage = __page_cache_alloc(cachefiles_gfp, 0); if (!newpage) goto nomem; } diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index e078cc55b989..bcb41fbee533 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1707,7 +1707,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page) if (len > PAGE_SIZE) len = PAGE_SIZE; } else { - page = __page_cache_alloc(GFP_NOFS); + page = __page_cache_alloc(GFP_NOFS, 0); if (!page) { err = -ENOMEM; goto out; diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 685a03cc4b77..ae58d7c31aa4 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) struct page *page = NULL; loff_t i_size; if (retry_op == READ_INLINE) { - page = __page_cache_alloc(GFP_KERNEL); + page = __page_cache_alloc(GFP_KERNEL, 0); if (!page) return -ENOMEM; } diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c7552459a15f..92e026d9a6b7 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -208,17 +208,19 @@ static inline int page_cache_add_speculative(struct page *page, int count) } #ifdef CONFIG_NUMA -extern struct page *__page_cache_alloc(gfp_t gfp); +extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order); #else -static inline struct page *__page_cache_alloc(gfp_t gfp) +static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order) { - return alloc_pages(gfp, 0); + if (order > 0) + gfp |= __GFP_COMP; + return alloc_pages(gfp, order); } #endif static inline struct page *page_cache_alloc(struct address_space *x) { - return __page_cache_alloc(mapping_gfp_mask(x)); + return __page_cache_alloc(mapping_gfp_mask(x), 0); } static inline gfp_t readahead_gfp_mask(struct address_space *x) diff --git a/mm/fi
[PATCH v4 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
Add filemap_huge_fault() to attempt to satisfy page faults on memory-mapped read-only text pages using THP when possible. Signed-off-by: William Kucharski --- include/linux/mm.h | 2 + mm/Kconfig | 15 ++ mm/filemap.c | 337 +++-- mm/huge_memory.c | 3 + mm/mmap.c | 38 - mm/rmap.c | 4 +- 6 files changed, 386 insertions(+), 13 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0334ca97c584..2a5311721739 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2433,6 +2433,8 @@ extern void truncate_inode_pages_final(struct address_space *); /* generic vm_area_ops exported for stackable file systems */ extern vm_fault_t filemap_fault(struct vm_fault *vmf); +extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf, + enum page_entry_size pe_size); extern void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff); extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf); diff --git a/mm/Kconfig b/mm/Kconfig index 56cec636a1fc..2debaded0e4d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -736,4 +736,19 @@ config ARCH_HAS_PTE_SPECIAL config ARCH_HAS_HUGEPD bool +config RO_EXEC_FILEMAP_HUGE_FAULT_THP + bool "read-only exec filemap_huge_fault THP support (EXPERIMENTAL)" + depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM + + help + Introduce filemap_huge_fault() to automatically map executable + read-only pages of mapped files of suitable size and alignment + using THP if possible. + + This is marked experimental because it is a new feature and is + dependent upon filesystmes implementing readpages() in a way + that will recognize large THP pages and read file content to + them without polluting the pagecache with PAGESIZE pages due + to readahead. + endmenu diff --git a/mm/filemap.c b/mm/filemap.c index 38b46fc00855..aebf2f54f52e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -199,13 +199,12 @@ static void unaccount_page_cache_page(struct address_space *mapping, nr = hpage_nr_pages(page); __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr); - if (PageSwapBacked(page)) { + + if (PageSwapBacked(page)) __mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr); - if (PageTransHuge(page)) - __dec_node_page_state(page, NR_SHMEM_THPS); - } else { - VM_BUG_ON_PAGE(PageTransHuge(page), page); - } + + if (PageTransHuge(page)) + __dec_node_page_state(page, NR_SHMEM_THPS); /* * At this point page must be either written or cleaned by @@ -1663,7 +1662,8 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, no_page: if (!page && (fgp_flags & FGP_CREAT)) { int err; - if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping)) + if ((fgp_flags & FGP_WRITE) && + mapping_cap_account_dirty(mapping)) gfp_mask |= __GFP_WRITE; if (fgp_flags & FGP_NOFS) gfp_mask &= ~__GFP_FS; @@ -2643,6 +2643,326 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP +/* + * Check for an entry in the page cache which would conflict with the address + * range we wish to map using a THP or is otherwise unusable to map a large + * cached page. + * + * The routine will return true if a usable page is found in the page cache + * (and *pagep will be set to the address of the cached page), or if no + * cached page is found (and *pagep will be set to NULL). + */ +static bool +filemap_huge_check_pagecache_usable(struct xa_state *xas, + struct page **pagep, pgoff_t hindex, pgoff_t hindex_max) +{ + struct page *page; + + while (1) { + page = xas_find(xas, hindex_max); + + if (xas_retry(xas, page)) { + xas_set(xas, hindex); + continue; + } + + /* +* A found entry is unusable if: +* + the entry is an Xarray value, not a pointer +* + the entry is an internal Xarray node +* + the entry is not a Transparent Huge Page +* + the entry is not a compound page +* + the entry is not the head of a compound page +* + the entry is a page page with an order other than +*HPAGE_PMD_ORDER +* + the page's index is not what we expect it to be +* + the page is not up-to-date +*/ +
Re: [PATCH v3 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
> On Aug 6, 2019, at 5:12 AM, Kirill A. Shutemov wrote: > > IIUC, you are missing ->vm_pgoff from the picture. The newly allocated > page must land into page cache aligned on HPAGE_PMD_NR boundary. In other > word you cannout have huge page with ->index, let say, 1. > > VMA is only suitable for at least one file-THP page if: > > - (vma->vm_start >> PAGE_SHIFT) % (HPAGE_PMD_NR - 1) is equal to >vma->vm_pgoff % (HPAGE_PMD_NR - 1) > >This guarantees right alignment in the backing page cache. > > - *and* vma->vm_end - round_up(vma->vm_start, HPAGE_PMD_SIZE) is equal or > greater than HPAGE_PMD_SIZE. > > Does it make sense? It makes sense, but what I am thinking was say a vma->vm_start of 0x1ff000 and vma->vm_end of 0x40. Assuming x86, that can be mapped with a PAGESIZE page at 0x1ff000 then a PMD page mapping 0x20 - 0x40. That doesn't mean a vma IS or COULD ever be configured that way, so you are correct with your comment, and I will change my check accordingly. >> In the current code, it's assumed it is not exposed, because a single read >> of a large page that does no readahead before the page is inserted into the >> cache means there are no external users of the page. > > You've exposed the page to the filesystem once you call ->readpage(). > It *may* track the page somehow after the call. OK, thanks again. I'll try to have a V4 available with these changes soon.
Re: [PATCH v3 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
> On Aug 5, 2019, at 7:28 AM, Kirill A. Shutemov wrote: > >> >> Is there different terminology you'd prefer to see me use here to clarify >> this? > > My point is that maybe we should just use ~HPAGE_P?D_MASK in code. The new > HPAGE_P?D_OFFSET doesn't add much for readability in my opinion. Fair enough, I'll make that change. >> OK, I can do that; I didn't want to unnecessarily eliminate the >> VM_BUG_ON_PAGE(PageTransHuge(page)) call for everyone given this >> is CONFIG experimental code. > > If you bring the feature, you bring the feature. Just drop these VM_BUGs. OK. > I'm not sure. It will be costly comparing to PageTransCompound/Huge as we > need to check more than flags. > > To exclude hugetlb pages here, use VM_BUG_ON_PAGE(PageHuge(page), page). > It will allow to catch wrong usage of the function. That will also do it and that way we will better know if it ever happens, which of course it shouldn't. >> This routine's main function other than validation is to make sure the page >> cache has not been polluted between when we go out to read the large page >> and when the page is added to the cache (more on that coming up.) For >> example, the way I was able to tell readahead() was polluting future >> possible THP mappings is because after a buffered read I would typically see >> 52 (the readahead size) PAGESIZE pages for the next 2M range in the page >> cache. > > My point is that you should only see compound pages here if they are > HPAGE_PMD_ORDER, shouldn't you? Any other order of compound page would be > unexpected to say the least. Yes, compound pages should only be HPAGE_PMD_ORDER. The routine and the check will need to be updated if we ever can allocate/cache larger pages. >> It's my understanding that pages in the page cache should be locked, so I >> wanted to check for that. > > No. They are get locked temporary for some operation, but not all the > time. OK, thanks for that. >> I don't really care if the start of the VMA is suitable, just whether I can >> map >> the current faulting page with a THP. As far as I know, there's nothing wrong >> with mapping all the pages before the VMA hits a properly aligned bound with >> PAGESIZE pages and then aligned chunks in the middle with THP. > > You cannot map any paged as huge into wrongly aligned VMA. > > THP's ->index must be aligned to HPAGE_PMD_NR, so if the combination VMA's > ->vm_start and ->vm_pgoff doesn't allow for this, you must fallback to > mapping the page with PTEs. I don't see it handled properly here. It was my assumption that if say a VMA started at an address say one page before a large page alignment, you could map that page with a PAGESIZE page but if VMA size allowed, there was a fault on the next page, and VMA size allowed, you could map that next range with a large page, taking taking the approach of mapping chunks of the VMA with the largest page possible. Is it that the start of the VMA must always align or that the entire VMA must be properly aligned and a multiple of the PMD size (so you either map with all large pages or none)? >> This is the page that content was just read to; readpage() will unlock the >> page >> when it is done with I/O, but the page needs to be locked before it's >> inserted >> into the page cache. > > Then you must to lock the page properly with lock_page(). > > __SetPageLocked() is fine for just allocated pages that was not exposed > anywhere. After ->readpage() it's not the case and it's not safe to use > __SetPageLocked() for them. In the current code, it's assumed it is not exposed, because a single read of a large page that does no readahead before the page is inserted into the cache means there are no external users of the page. Regardless, I can make this change as part of the changes I will need to to reorder when the page is inserted into the cache. >> I can make that change; originally alloc_set_pte() didn't use the second >> parameter at all when mapping a read-only page. >> >> Even now, if the page isn't writable, it would only be dereferenced by a >> VM_BUG_ON_PAGE() call if it's COW. > > Please do change this. It has to be future-proof. OK, thanks. >> My thinking had been if any part of reading a large page and mapping it had >> failed, the code could just put_page() the newly allocated page and fallback >> to mapping the page with PAGESIZE pages rather than add a THP to the cache. > > I think it's must change. We should not allow inconsistent view on page > cache. Yes, I can see where it would be confusing to anyone looking at it that assumed the page must already be in the cache before readpage() is called. >> If mprotect() is called, wouldn't the pages be COWed to PAGESIZE pages the >> first time the area was written to? I may be way off on this assumption. > > Okay, fair enough. COW will happen for private mappings. > > But for private mappings you don't need to enforce even RO. All permission > mask should be fine. Thanks. >> Once agai
Re: [PATCH v3 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
On 8/1/19 6:36 AM, Kirill A. Shutemov wrote: #ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define HPAGE_PMD_SHIFT PMD_SHIFT -#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) -#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1)) - -#define HPAGE_PUD_SHIFT PUD_SHIFT -#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) -#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) +#define HPAGE_PMD_SHIFTPMD_SHIFT +#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) +#define HPAGE_PMD_OFFSET (HPAGE_PMD_SIZE - 1) +#define HPAGE_PMD_MASK (~(HPAGE_PMD_OFFSET)) + +#define HPAGE_PUD_SHIFTPUD_SHIFT +#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) +#define HPAGE_PUD_OFFSET (HPAGE_PUD_SIZE - 1) +#define HPAGE_PUD_MASK (~(HPAGE_PUD_OFFSET)) OFFSET vs MASK semantics can be confusing without reading the definition. We don't have anything similar for base page size, right (PAGE_OFFSET is completely different thing :P)? I came up with the OFFSET definitions, the MASK definitions already existed in huge_mm.h, e.g.: #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) Is there different terminology you'd prefer to see me use here to clarify this? +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP +extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf, + enum page_entry_size pe_size); +#endif + No need for #ifdef here. I wanted to avoid referencing an extern that wouldn't exist if the config option wasn't set; I can remove it. + +#ifndefCONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP if (PageSwapBacked(page)) { __mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr); if (PageTransHuge(page)) @@ -206,6 +208,13 @@ static void unaccount_page_cache_page(struct address_space *mapping, } else { VM_BUG_ON_PAGE(PageTransHuge(page), page); } +#else + if (PageSwapBacked(page)) + __mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr); + + if (PageTransHuge(page)) + __dec_node_page_state(page, NR_SHMEM_THPS); +#endif Again, no need for #ifdef: the new definition should be fine for everybody. OK, I can do that; I didn't want to unnecessarily eliminate the VM_BUG_ON_PAGE(PageTransHuge(page)) call for everyone given this is CONFIG experimental code. PageCompound() and PageTransCompound() are the same thing if THP is enabled at compile time. PageHuge() check here is looking out of place. I don't thing we can ever will see hugetlb pages here. What I'm trying to do is sanity check that what the cache contains is a THP page. I added the PageHuge() check simply because PageTransCompound() is true for both THP and hugetlbfs pages, and there's no routine that returns true JUST for THP pages; perhaps there should be? +* + the enbry is a page page with an order other than Typo. Thanks, fixed. +*HPAGE_PMD_ORDER If you see unexpected page order in page cache, something went horribly wrong, right? This routine's main function other than validation is to make sure the page cache has not been polluted between when we go out to read the large page and when the page is added to the cache (more on that coming up.) For example, the way I was able to tell readahead() was polluting future possible THP mappings is because after a buffered read I would typically see 52 (the readahead size) PAGESIZE pages for the next 2M range in the page cache. +* + the page's index is not what we expect it to be Same here. Same rationale. +* + the page is not up-to-date +* + the page is unlocked Confused here. These should never be true, but I wanted to double check for them anyway. I can eliminate the checks if we are satisfied these states can "never" happen for a cached page. Do you expect caller to lock page before the check? If so, state it in the comment for the function. It's my understanding that pages in the page cache should be locked, so I wanted to check for that. This routine is validating entries we find in the page cache to see whether they are conflicts or valid cached THP pages. Wow. That's unreadable. Can we rewrite it something like (commenting each check): I can definitely break it down into multiple checks; it is a bit dense, thus the comment but you're correct, it will read better if broken down more. You also need to check that VMA alignment is suitable for huge pages. See transhuge_vma_suitable(). I don't really care if the start of the VMA is suitable, just whether I can map the current faulting page with a THP. As far as I know, there's nothing wrong with mapping all the pages before the VMA hits a properly aligned bound with PAGESIZE pages and then aligned chunks in the middle with THP. + if (unlikely(!(PageCompound(new_page { How can it happen? That check was alrea
Re: [PATCH v3 0/2] mm,thp: Add filemap_huge_fault() for THP
On 7/31/19 2:35 AM, Song Liu wrote: Could you please explain how to test/try this? Would it automatically map all executables to THPs? Until there is filesystem support you can't actually try this, though I have tested it through some hacks during development and am also working on some other methods to be able to test this before large page filesystem read support is in place. The end goal is that if enabled, when a fault occurs for an RO executable where the faulting address lies within a vma properly aligned/sized for the fault to be satisfied by mapping a THP, and the kernel can allocate a THP, the fault WILL be satisfied by mapping the THP. It's not expected that all executables nor even all pages of all executables would be THP-mapped, just those executables and ranges where alignment and size permit. Future optimizations may include fine-tuning these checks to try to better determine whether an application would actually benefit from THP mapping. From some quick and dirty experiments I performed, I've seen that there are a surprising number of applications that may end up with THP-mapped pages, including Perl, Chrome and Firefox. However I don't yet know what the actual vs. theoretical benefits would be. -- Bill
[PATCH v3 1/2] mm: Allow the page cache to allocate large pages
Add an 'order' argument to __page_cache_alloc() and do_read_cache_page(). Ensure the allocated pages are compound pages. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: William Kucharski Reported-by: kbuild test robot --- fs/afs/dir.c| 2 +- fs/btrfs/compression.c | 2 +- fs/cachefiles/rdwr.c| 4 ++-- fs/ceph/addr.c | 2 +- fs/ceph/file.c | 2 +- include/linux/pagemap.h | 10 ++ mm/filemap.c| 20 +++- mm/readahead.c | 2 +- net/ceph/pagelist.c | 4 ++-- net/ceph/pagevec.c | 2 +- 10 files changed, 27 insertions(+), 23 deletions(-) diff --git a/fs/afs/dir.c b/fs/afs/dir.c index e640d67274be..0a392214f71e 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -274,7 +274,7 @@ static struct afs_read *afs_read_dir(struct afs_vnode *dvnode, struct key *key) afs_stat_v(dvnode, n_inval); ret = -ENOMEM; - req->pages[i] = __page_cache_alloc(gfp); + req->pages[i] = __page_cache_alloc(gfp, 0); if (!req->pages[i]) goto error; ret = add_to_page_cache_lru(req->pages[i], diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 60c47b417a4b..5280e7477b7e 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -466,7 +466,7 @@ static noinline int add_ra_bio_pages(struct inode *inode, } page = __page_cache_alloc(mapping_gfp_constraint(mapping, -~__GFP_FS)); +~__GFP_FS), 0); if (!page) break; diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index 44a3ce1e4ce4..11d30212745f 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -259,7 +259,7 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object, goto backing_page_already_present; if (!newpage) { - newpage = __page_cache_alloc(cachefiles_gfp); + newpage = __page_cache_alloc(cachefiles_gfp, 0); if (!newpage) goto nomem_monitor; } @@ -495,7 +495,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object, goto backing_page_already_present; if (!newpage) { - newpage = __page_cache_alloc(cachefiles_gfp); + newpage = __page_cache_alloc(cachefiles_gfp, 0); if (!newpage) goto nomem; } diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index e078cc55b989..bcb41fbee533 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1707,7 +1707,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page) if (len > PAGE_SIZE) len = PAGE_SIZE; } else { - page = __page_cache_alloc(GFP_NOFS); + page = __page_cache_alloc(GFP_NOFS, 0); if (!page) { err = -ENOMEM; goto out; diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 685a03cc4b77..ae58d7c31aa4 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -1305,7 +1305,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) struct page *page = NULL; loff_t i_size; if (retry_op == READ_INLINE) { - page = __page_cache_alloc(GFP_KERNEL); + page = __page_cache_alloc(GFP_KERNEL, 0); if (!page) return -ENOMEM; } diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c7552459a15f..92e026d9a6b7 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -208,17 +208,19 @@ static inline int page_cache_add_speculative(struct page *page, int count) } #ifdef CONFIG_NUMA -extern struct page *__page_cache_alloc(gfp_t gfp); +extern struct page *__page_cache_alloc(gfp_t gfp, unsigned int order); #else -static inline struct page *__page_cache_alloc(gfp_t gfp) +static inline struct page *__page_cache_alloc(gfp_t gfp, unsigned int order) { - return alloc_pages(gfp, 0); + if (order > 0) + gfp |= __GFP_COMP; + return alloc_pages(gfp, order); } #endif static inline struct page *page_cache_alloc(struct address_space *x) { - return __page_cache_alloc(mapping_gfp_mask(x)); + return __page_cache_alloc(mapping_gfp_mask(x), 0); } static inline gfp_t readahead_gfp_mask(struct address_space *x) diff --git a/mm/fi
[PATCH v3 0/2] mm,thp: Add filemap_huge_fault() for THP
This set of patches is the first step towards a mechanism for automatically mapping read-only text areas of appropriate size and alignment to THPs whenever possible. For now, the central routine, filemap_huge_fault(), amd various support routines are only included if the experimental kernel configuration option RO_EXEC_FILEMAP_HUGE_FAULT_THP is enabled. This is because filemap_huge_fault() is dependent upon the address_space_operations vector readpage() pointing to a routine that will read and fill an entire large page at a time without poulluting the page cache with PAGESIZE entries for the large page being mapped or performing readahead that would pollute the page cache entries for succeeding large pages. Unfortunately, there is no good way to determine how many bytes were read by readpage(). At present, if filemap_huge_fault() were to call a conventional readpage() routine, it would only fill the first PAGESIZE bytes of the large page, which is definitely NOT the desired behavior. However, by making the code available now it is hoped that filesystem maintainers who have pledged to provide such a mechanism will do so more rapidly. The first part of the patch adds an order field to __page_cache_alloc(), allowing callers to directly request page cache pages of various sizes. This code was provided by Matthew Wilcox. The second part of the patch implements the filemap_huge_fault() mechanism as described above. Changes since v2: 1. FGP changes were pulled out to enable submission as an independent patch 2. Inadvertent tab spacing and comment changes were reverted Changes since v1: 1. Fix improperly generated patch for v1 PATCH 1/2 Matthew Wilcox (1): mm: Allow the page cache to allocate large pages William Kucharski (1): Add filemap_huge_fault() to attempt to satisfy page faults on memory-mapped read-only text pages using THP when possible. fs/afs/dir.c| 2 +- fs/btrfs/compression.c | 2 +- fs/cachefiles/rdwr.c| 4 +- fs/ceph/addr.c | 2 +- fs/ceph/file.c | 2 +- include/linux/huge_mm.h | 16 +- include/linux/mm.h | 6 + include/linux/pagemap.h | 10 +- mm/Kconfig | 15 ++ mm/filemap.c| 320 ++-- mm/huge_memory.c| 3 + mm/mmap.c | 36 - mm/readahead.c | 2 +- mm/rmap.c | 8 + net/ceph/pagelist.c | 4 +- net/ceph/pagevec.c | 2 +- 16 files changed, 401 insertions(+), 33 deletions(-) -- 2.21.0
[PATCH v3 2/2] mm,thp: Add experimental config option RO_EXEC_FILEMAP_HUGE_FAULT_THP
Add filemap_huge_fault() to attempt to satisfy page faults on memory-mapped read-only text pages using THP when possible. Signed-off-by: William Kucharski --- include/linux/huge_mm.h | 16 ++- include/linux/mm.h | 6 + mm/Kconfig | 15 ++ mm/filemap.c| 300 +++- mm/huge_memory.c| 3 + mm/mmap.c | 36 - mm/rmap.c | 8 ++ 7 files changed, 374 insertions(+), 10 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 45ede62aa85b..b1e5fd3179fd 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -79,13 +79,15 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_NR (1<index != hindex || (!PageUptodate(page)) || + (!PageLocked(page + return false; + + break; + } + + xas_set(xas, hindex); + *pagep = page; + return true; +} + +/** + * filemap_huge_fault - read in file data for page fault handling to THP + * @vmf: struct vm_fault containing details of the fault + * @pe_size: large page size to map, currently this must be PE_SIZE_PMD + * + * filemap_huge_fault() is invoked via the vma operations vector for a + * mapped memory region to read in file data to a transparent huge page during + * a page fault. + * + * If for any reason we can't allocate a THP, map it or add it to the page + * cache, VM_FAULT_FALLBACK will be returned which will cause the fault + * handler to try mapping the page using a PAGESIZE page, usually via + * filemap_fault() if so speicifed in the vma operations vector. + * + * Returns either VM_FAULT_FALLBACK or the result of calling allcc_set_pte() + * to map the new THP. + * + * NOTE: This routine depends upon the file system's readpage routine as + * specified in the address space operations vector to recognize when it + * is being passed a large page and to read the approprate amount of data + * in full and without polluting the page cache for the large page itself + * with PAGESIZE pages to perform a buffered read or to pollute what + * would be the page cache space for any succeeding pages with PAGESIZE + * pages due to readahead. + * + * It is VITAL that this routine not be enabled without such filesystem + * support. As there is no way to determine how many bytes were read by + * the readpage() operation, if only a PAGESIZE page is read, this routine + * will map the THP containing only the first PAGESIZE bytes of file data + * to satisfy the fault, which is never the result desired. + */ +vm_fault_t filemap_huge_fault(struct vm_fault *vmf, + enum page_entry_size pe_size) +{ + struct file *filp = vmf->vma->vm_file; + struct address_space *mapping = filp->f_mapping; + struct vm_area_struct *vma = vmf->vma; + + unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + pgoff_t hindex = round_down(vmf->pgoff, HPAGE_PMD_NR); + pgoff_t hindex_max = hindex + HPAGE_PMD_NR; + + struct page *cached_page, *hugepage; + struct page *new_page = NULL; + + vm_fault_t ret = VM_FAULT_FALLBACK; + int error; + + XA_STATE_ORDER(xas, &mapping->i_pages, hindex, HPAGE_PMD_ORDER); + + /* +* Return VM_FAULT_FALLBACK if: +* +* + pe_size != PE_SIZE_PMD +* + FAULT_FLAG_WRITE is set in vmf->flags +* + vma isn't aligned to allow a PMD mapping +* + PMD would extend beyond the end of the vma +*/ + if (pe_size != PE_SIZE_PMD || (vmf->flags & FAULT_FLAG_WRITE) || + (haddr < vma->vm_start || + (haddr + HPAGE_PMD_SIZE > vma->vm_end))) + return ret; + + xas_lock_irq(&xas); + +retry_xas_locked: + if (!filemap_huge_check_pagecache_usable(&xas, &cached_page, hindex, + hindex_max)) { + /* found a conflicting entry in the page cache, so fallback */ + goto unlock; + } else if (cached_page) { + /* found a valid cached page, so map it */ + hugepage = cached_page; + goto map_huge; + } + + xas_unlock_irq(&xas); + + /* allocate huge THP page in VMA */ + new_page = __page_cache_alloc(vmf->gfp_mask | __GFP_COMP | + __GFP_NOWARN | __GFP_NORETRY, HPAGE_PMD_ORDER); + + if (unlikely(!new_page)) + return ret; + + if (unlikely(!(PageCompound(new_page { + put_page(new_page); + return ret; + } + + prep_transhuge_page(new_page); + new_page->index = hindex; + new_page->mapping = mapping; + + __SetPageLocked(new_page); + + /* +* The readpage() operation below is expected to fill the large +
Re: [PATCH 2/3] sgi-gru: Remove CONFIG_HUGETLB_PAGE ifdef
> On Jul 22, 2019, at 11:50 AM, Bharath Vedartham wrote: > >> >> >> In all likelihood, these questions are no-ops, and the optimizer may even >> make my questions completely moot, but I thought I might as well ask anyway. >> > That sounds reasonable. I am not really sure as to how much of > an improvement it would be, the condition will be evaluated eitherways > AFAIK? Eitherways, the ternary operator does not look good. I ll make a > version 2 of this. In THEORY the "unlikely" hints to the compiler that that leg of the "if" can be made the branch and jump leg, though in reality optimization is much more complex than that. Still, the unlikely() call is also nicely self-documenting as to what the expected outcome is. Reviewed-by: William Kucharski
Re: [PATCH 2/3] sgi-gru: Remove CONFIG_HUGETLB_PAGE ifdef
I suspect I'm being massively pedantic here, but the comments for atomic_pte_lookup() note: * Only supports Intel large pages (2MB only) on x86_64. * ZZZ - hugepage support is incomplete That makes me wonder how many systems using this hardware are actually configured with CONFIG_HUGETLB_PAGE. I ask as in the most common case, this is likely introducing a few extra instructions and possibly an additional branch to a routine that is called per-fault. So the nit-picky questions are: 1) Does the code really need to be cleaned up in this way? 2) If it does, does it make more sense (given the way pmd_large() is handled now in atomic_pte_lookup()) for this to be coded as: if (unlikely(is_vm_hugetlb_page(vma))) *pageshift = HPAGE_SHIFT; else *pageshift = PAGE_SHIFT; In all likelihood, these questions are no-ops, and the optimizer may even make my questions completely moot, but I thought I might as well ask anyway. > On Jul 21, 2019, at 9:58 AM, Bharath Vedartham wrote: > > is_vm_hugetlb_page has checks for whether CONFIG_HUGETLB_PAGE is defined > or not. If CONFIG_HUGETLB_PAGE is not defined is_vm_hugetlb_page will > always return false. There is no need to have an uneccessary > CONFIG_HUGETLB_PAGE check in the code. > > Cc: Ira Weiny > Cc: John Hubbard > Cc: Jérôme Glisse > Cc: Greg Kroah-Hartman > Cc: Dimitri Sivanich > Cc: Arnd Bergmann > Cc: linux-kernel@vger.kernel.org > Cc: linux...@kvack.org > Signed-off-by: Bharath Vedartham > --- > drivers/misc/sgi-gru/grufault.c | 11 +++ > 1 file changed, 3 insertions(+), 8 deletions(-) > > diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c > index 61b3447..75108d2 100644 > --- a/drivers/misc/sgi-gru/grufault.c > +++ b/drivers/misc/sgi-gru/grufault.c > @@ -180,11 +180,8 @@ static int non_atomic_pte_lookup(struct vm_area_struct > *vma, > { > struct page *page; > > -#ifdef CONFIG_HUGETLB_PAGE > *pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT; > -#else > - *pageshift = PAGE_SHIFT; > -#endif > + > if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0) > return -EFAULT; > *paddr = page_to_phys(page); > @@ -238,11 +235,9 @@ static int atomic_pte_lookup(struct vm_area_struct *vma, > unsigned long vaddr, > return 1; > > *paddr = pte_pfn(pte) << PAGE_SHIFT; > -#ifdef CONFIG_HUGETLB_PAGE > + > *pageshift = is_vm_hugetlb_page(vma) ? HPAGE_SHIFT : PAGE_SHIFT; > -#else > - *pageshift = PAGE_SHIFT; > -#endif > + > return 0; > > err: > -- > 2.7.4 >
Re: [PATCH uprobe, thp 3/4] uprobe: support huge page by only splitting the pmd
Is there any reason to worry about supporting PUD-sized uprobe pages if CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD is defined? I would prefer not to bake in the assumption that "huge" means PMD-sized and more than it already is. For example, if CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD is configured, mm_address_trans_huge() should either make the call to pud_trans_huge() if appropriate, or a VM_BUG_ON_PAGE should be added in case the routine is ever called with one. Otherwise it looks pretty reasonable to me. -- Bill
Re: [PATCH] mm: vmscan: correct nr_reclaimed for THP
> On May 10, 2019, at 10:36 AM, Matthew Wilcox wrote: > > Please don't. That embeds the knowledge that we can only swap out either > normal pages or THP sized pages. I'm trying to make the VM capable of > supporting arbitrary-order pages, and this would be just one more place > to fix. > > I'm sympathetic to the "self documenting" argument. My current tree has > a patch in it: > >mm: Introduce compound_nr > >Replace 1 << compound_order(page) with compound_nr(page). Minor >improvements in readability. > > It goes along with this patch: > >mm: Introduce page_size() > >It's unnecessarily hard to find out the size of a potentially huge page. >Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). > > Better suggestions on naming gratefully received. I'm more happy with > page_size() than I am with compound_nr(). page_nr() gives the wrong > impression; page_count() isn't great either. I like page_size() as well. At least to me, page_nr() or page_count() would imply a basis of PAGESIZE, or that you would need to do something like: page_size = page_nr() << PAGE_SHIFT; to get the size in bytes; page_size() is more straightforward in that respect.
Re: [PATCH] mm: vmscan: correct nr_reclaimed for THP
> On May 9, 2019, at 9:03 PM, Huang, Ying wrote: > > Yang Shi writes: > >> On 5/9/19 7:12 PM, Huang, Ying wrote: >>> >>> How about to change this to >>> >>> >>> nr_reclaimed += hpage_nr_pages(page); >> >> Either is fine to me. Is this faster than "1 << compound_order(page)"? > > I think the readability is a little better. And this will become > >nr_reclaimed += 1 > > if CONFIG_TRANSPARENT_HUAGEPAGE is disabled. I find this more legible and self documenting, and it avoids the bit shift operation completely on the majority of systems where THP is not configured.
Re: [PATCH 5/4] 9p: pass the correct prototype to read_cache_page
> On May 2, 2019, at 7:04 AM, Christoph Hellwig wrote: > > Except that we don't pass v9fs_vfs_readpage as the filler any more, > we now pass v9fs_fid_readpage. True, so never mind. :-)
Re: [PATCH 5/4] 9p: pass the correct prototype to read_cache_page
1) You need to pass "filp" rather than "filp->private_data" to read_cache_pages() in v9fs_fid_readpage(). The patched code passes "filp->private_data" as the "data" parameter to read_cache_pages(), which would generate a call to: filler(data, page) which would become a call to: static int v9fs_vfs_readpage(struct file *filp, struct page *page) { return v9fs_fid_readpage(filp->private_data, page); } which would then effectively become: v9fs_fid_readpage(filp->private_data->private_data, page) Which isn't correct; because data is a void *, no error is thrown when v9fs_vfs_readpages treats filp->private_data as if it is filp. 2) I'd also like to see an explicit comment in do_read_cache_page() along the lines of: /* * If a custom page filler was passed in use it, otherwise use the * standard readpage() routine defined for the address_space. * */ 3) Patch 5/4? Otherwise it looks good. Reviewed-by: William Kucharski > On May 1, 2019, at 11:34 AM, Christoph Hellwig wrote: > > Fix the callback 9p passes to read_cache_page to actually have the > proper type expected. Casting around function pointers can easily > hide typing bugs, and defeats control flow protection. > > Signed-off-by: Christoph Hellwig > --- > fs/9p/vfs_addr.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c > index 0bcbcc20f769..02e0fc51401e 100644 > --- a/fs/9p/vfs_addr.c > +++ b/fs/9p/vfs_addr.c > @@ -50,8 +50,9 @@ > * @page: structure to page > * > */ > -static int v9fs_fid_readpage(struct p9_fid *fid, struct page *page) > +static int v9fs_fid_readpage(void *data, struct page *page) > { > + struct p9_fid *fid = data; > struct inode *inode = page->mapping->host; > struct bio_vec bvec = {.bv_page = page, .bv_len = PAGE_SIZE}; > struct iov_iter to; > @@ -122,7 +123,8 @@ static int v9fs_vfs_readpages(struct file *filp, struct > address_space *mapping, > if (ret == 0) > return ret; > > - ret = read_cache_pages(mapping, pages, (void *)v9fs_vfs_readpage, filp); > + ret = read_cache_pages(mapping, pages, v9fs_fid_readpage, > + filp->private_data); > p9_debug(P9_DEBUG_VFS, " = %d\n", ret); > return ret; > }
Re: [PATCH] mm/gup.c: fix the wrong comments
> On Apr 4, 2019, at 1:23 AM, Huang Shijie wrote: > > > + * This function is different from the get_user_pages_unlocked(): > + * The @pages may has different page order with the result > + * got by get_user_pages_unlocked(). > + * I suggest a slight rewrite of the comment, something like: * Note this routine may fill the pages array with entries in a * different order than get_user_pages_unlocked(), which may cause * issues for callers expecting the routines to be equivalent.
Re: CONFIG_DEBUG_VIRTUAL breaks boot on x86-32
The dmesg output you posted confirms that max_low_pfn is indeed 0x373fe, and it appears that the value of phys_mem being checked mat be 0x3f401ff1, which translates to pfn 0x3f401, at least if what's still in registers can be believed. Since that is indeed greater than max_low_pfn, VIRTUAL_BUG_ON triggers: VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn); Looking at the call stack of copy_strings_0x220 __check_object_size+0xef that looks to translate to this sequence: copy_from_user(kaddr+offset, str, bytes_to_copy) check_copy_size(kaddr+offset, bytes_to_copy, FALSE) check_object_size(kaddr+offset, bytes_to_copy, FALSE) __check_object_size(kaddr+offset, bytes_to_copy, FALSE) check_heap_object(kaddr+offset, bytes_to_copy, FALSE) virt_to_head_page(kaddr+offset) virt_to_page(kaddr+offset) pfn_to_page(kaddr+offset) __pa(kaddr+offset) __phys_addr(kaddr+offset) so it appears the address is "too high" for low memory at that point.
Re: CONFIG_DEBUG_VIRTUAL breaks boot on x86-32
Does this still happen on 5.1-rc2? Do you have idea as to what max_low_pfn() gets set to on your system at boot time? From the screen shot I'm guessing it MIGHT be 0x373fe, but it's hard to know for sure. > On Mar 21, 2019, at 2:22 PM, Meelis Roos wrote: > > I tried to debug another problem and turned on most debug options for memory. > The resulting kernel failed to boot. > > Bisecting the configurations led to CONFIG_DEBUG_VIRTUAL - if I turned it on > in addition to some other debug options, the machine crashed with > > kernel BUG at arch/x86/mm/physaddr.c:79! > > Screenshot at http://kodu.ut.ee/~mroos/debug_virtual-boot-hang-1.jpg > > The machine was Athlon XP with VIA KT600 chipset and 2G RAM. > > -- > Meelis Roos >
Re: [PATCH 1/3] mm/sparse: Clean up the obsolete code comment
> On Mar 21, 2019, at 4:35 AM, Michal Hocko wrote: > > I am sorry to be snarky but hasn't this generated way much more email > traffic than it really deserves? A simply and trivial clean up in the > beginning that was it, right? That's rather the point; that it did generate a fair amount of email traffic indicates it's worthy of at least a passing mention in a comment somewhere.
Re: [PATCH 1/3] mm/sparse: Clean up the obsolete code comment
> On Mar 21, 2019, at 3:21 AM, Baoquan He wrote: It appears as is so often the case that the usage has far outpaced the documentation and -EEXIST may be the proper code to return. The correct answer here may be to modify the documentation to note the additional semantic, though if the usage is solely within the kernel it may be sufficient to explain its use in the header comment for the routine (in this case sparse_add_one_section()).
Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
> On Mar 19, 2019, at 10:33 PM, Jerome Glisse wrote: > > So i believe best we could do is send a SIGBUS to the process that has > GUPed a range of a file that is being truncated this would match what > we do for CPU acces. There is no reason access through GUP should be > handled any differently. This should be done lazily, as there's no need to send the SIGBUS unless the GUPed page is actually accessed post-truncate.
Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
> On Mar 14, 2019, at 7:30 AM, Jan Kara wrote: > > Well I have some crash reports couple years old and they are not from QA > departments. So I'm pretty confident there are real users that use this in > production... and just reboot their machine in case it crashes. Do you know what the use case in those crashes actually was? I'm curious to know they were actually cases of say DMA from a video capture card or if the uses posited to date are simply theoretical. It's always good to know who might be doing this and why if for no other reason than as something to keep in mind when designing future interfaces.
Re: [PATCH] mm/filemap: fix minor typo
> Cc: Matthew Wilcox > Signed-off-by: Laurent Dufour > --- > mm/filemap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index cace3eb8069f..377cedaa3ae5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1440,7 +1440,7 @@ pgoff_t page_cache_next_miss(struct address_space > *mapping, > EXPORT_SYMBOL(page_cache_next_miss); > > /** > - * page_cache_prev_miss() - Find the next gap in the page cache. > + * page_cache_prev_miss() - Find the previous gap in the page cache. > * @mapping: Mapping. > * @index: Index. > * @max_scan: Maximum range to search. > -- > 2.21.0 > Reviewed-by: William Kucharski
Re: [PATCH v3] page cache: Store only head pages in i_pages
Other than the bug Song found in memfd_tag_pins(), I'd like to suggest two quick but pedantic changes to mm/filemap.c: Though not modified in this patch, in line 284, the parenthesis should be moved to after the period: * modified.) The function expects only THP head pages to be present in the > + * Move to the next page in the vector if this is a small page > + * or the index is of the last page in this compound page). A few lines later, there is an extraneous parenthesis, and the comment could be a bit clearer. Might I suggest: * Move to the next page in the vector if this is a PAGESIZE * page or if the index is of the last PAGESIZE page within * this compound page. You can say "base" instead of "PAGESIZE," but "small" seems open to interpretation. I haven't run across any problems and have been hammering the code for over five days without issue; all my testing was with transparent_hugepage/enabled set to "always." Tested-by: William Kucharski Reviewed-by: William Kucharski
Re: [PATCH v2 2/4] mm: remove zone_lru_lock() function access ->lru_lock directly
> On Feb 28, 2019, at 11:22 AM, Andrew Morton wrote: > > I don't think so. This kernedoc comment was missing its leading /**. > The patch fixes that. That makes sense; it had looked like just an extraneous asterisk.
Re: [PATCH v2 2/4] mm: remove zone_lru_lock() function access ->lru_lock directly
> On Feb 28, 2019, at 1:33 AM, Andrey Ryabinin wrote: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index a9852ed7b97f..2d081a32c6a8 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1614,8 +1614,8 @@ static __always_inline void update_lru_sizes(struct > lruvec *lruvec, > > } > > -/* > - * zone_lru_lock is heavily contended. Some of the functions that > +/** Nit: Remove the extra asterisk here; the line should then revert to being unchanged from the original.
Re: [PATCH 06/13] mm: pagewalk: Add 'depth' parameter to pte_hole
> On Feb 15, 2019, at 10:02 AM, Steven Price wrote: > > The pte_hole() callback is called at multiple levels of the page tables. > Code dumping the kernel page tables needs to know what at what depth > the missing entry is. Add this is an extra parameter to pte_hole(). > When the depth isn't know (e.g. processing a vma) then -1 is passed. > > Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their > natural numbers as levels 2/3/4. Nit: Could you add a comment noting this for anyone wondering how to calculate the level numbers in the future? Thanks!
Re: [PATCH] mm/shmem: make find_get_pages_range() work for huge page
> On Feb 12, 2019, at 4:57 PM, Yu Zhao wrote: > > It seems to me it's pefectly fine to use fields of xas directly, > and it's being done this way throughout the file. Fair enough. Reviewed-by: William Kucharski
Re: [PATCH v3 1/2] x86: respect memory size limiting via mem= parameter
> On Feb 14, 2019, at 3:42 AM, Juergen Gross wrote: > > When limiting memory size via kernel parameter "mem=" this should be > respected even in case of memory made accessible via a PCI card. > > Today this kind of memory won't be made usable in initial memory > setup as the memory won't be visible in E820 map, but it might be > added when adding PCI devices due to corresponding ACPI table entries. > > Not respecting "mem=" can be corrected by adding a global max_mem_size > variable set by parse_memopt() which will result in rejecting adding > memory areas resulting in a memory size above the allowed limit. > > Signed-off-by: Juergen Gross > Acked-by: Ingo Molnar Reviewed-by: William Kucharski
Re: [LSF/MM TOPIC] (again) THP for file systems
> On Feb 13, 2019, at 4:59 PM, Matthew Wilcox wrote: > > I believe the direction is clear. It needs people to do the work. > We're critically short of reviewers. I got precious little review of > the original XArray work, which made Andrew nervous and delayed its > integration. Now I'm getting little review of the followup patches > to lay the groundwork for filesystems to support larger page sizes. > I have very little patience for this situation. I'll be happy to dive in and look at the changes from an mm point of view, but I don't feel qualified to comment on all the file system considerations. Perhaps if someone from the fs side would volunteer; I know well how frustrating it can be to be trapped in code review suspended animation.
Re: [PATCH v2 0/3] slub: Do trivial comments fixes
If you need it: Reviewed-by: William Kucharski
Re: [PATCH 1/1] mm/vmalloc: convert vmap_lazy_nr to atomic_long_t
> On Jan 31, 2019, at 9:24 AM, Uladzislau Rezki (Sony) wrote: > > vmap_lazy_nr variable has atomic_t type that is 4 bytes integer > value on both 32 and 64 bit systems. lazy_max_pages() deals with > "unsigned long" that is 8 bytes on 64 bit system, thus vmap_lazy_nr > should be 8 bytes on 64 bit as well. Looks good. Reviewed-by: William Kucharski
Re: [PATCH 1/3] slub: Fix comment spelling mistake
> On Jan 30, 2019, at 9:10 PM, Tobin C. Harding wrote: > > Signed-off-by: Tobin C. Harding > --- > include/linux/slub_def.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h > index 3a1a1dbc6f49..201a635be846 100644 > --- a/include/linux/slub_def.h > +++ b/include/linux/slub_def.h > @@ -81,7 +81,7 @@ struct kmem_cache_order_objects { > */ > struct kmem_cache { > struct kmem_cache_cpu __percpu *cpu_slab; > - /* Used for retriving partial slabs etc */ > + /* Used for retrieving partial slabs etc */ > slab_flags_t flags; > unsigned long min_partial; > unsigned int size; /* The size of an object including meta data */ > -- If you're going to do this cleanup, make the comment in line 84 grammatical: /* Used for retrieving partial slabs, etc. */ Then change lines 87 and 88 to remove the space between "meta" and "data" as the word is "metadata" (as can be seen at line 102) and remove the period at the end of the comment on line 89 ("Free pointer offset.") You might also want to change lines 125-127 to be a single line comment: /* Defragmentation by allocating from a remote node */ so the commenting style is consistent throughout.
Re: [PATCH v2 2/2] x86/xen: dont add memory above max allowed allocation
> On Jan 30, 2019, at 1:22 AM, Juergen Gross wrote: > > +#ifdef CONFIG_MEMORY_HOTPLUG > + /* > + * Don't allow adding memory not in E820 map while > + * booting the system. Once the balloon driver is up > + * it will remove that restriction again. > + */ > + max_mem_size = xen_e820_table.entries[i].addr + > +xen_e820_table.entries[i].size; > +#endif > } > > if (!discard) Reviewed-by: William Kucharski
Re: [PATCH 1/2] x86: respect memory size limiting via mem= parameter
> On Jan 22, 2019, at 1:06 AM, Juergen Gross wrote: > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index b9a667d36c55..7fc2a87110a3 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -96,10 +96,16 @@ void mem_hotplug_done(void) > cpus_read_unlock(); > } > > +u64 max_mem_size = -1; This may be pedantic, but I'd rather see U64_MAX used here.
Re: [PATCH 1/2] mm/mmap.c: Remove redundant variable 'addr' in arch_get_unmapped_area_topdown()
> On Jan 20, 2019, at 1:13 AM, Yang Fan wrote: > > The variable 'addr' is redundant in arch_get_unmapped_area_topdown(), > just use parameter 'addr0' directly. Then remove the const qualifier > of the parameter, and change its name to 'addr'. > > Signed-off-by: Yang Fan These seem similar enough I question whether they really need to be two distinct patches, given both involve removing const keywords from the same routine, and the shift to using the passed addr directly rather than declaring and assigning addr from addr0 is a direct consequence of removing the const. I could be wrong though and easily persuaded otherwise.
Re: [PATCH 1/3] treewide: Lift switch variables out of switches
> On Jan 23, 2019, at 5:09 AM, Jann Horn wrote: > > AFAICS this only applies to switch statements (because they jump to a > case and don't execute stuff at the start of the block), not blocks > after if/while/... . It bothers me that we are going out of our way to deprecate valid C constructs in favor of placing the declarations elsewhere. As current compiler warnings would catch any reference before initialization usage anyway, it seems like we are letting a compiler warning rather than the language standard dictate syntax. Certainly if we want to make it a best practice coding style issue we can, and then an appropriate note explaining why should be added to Documentation/process/coding-style.rst.
Re: [PATCH] mm/page_alloc: check return value of memblock_alloc_node_nopanic()
> On Jan 17, 2019, at 4:26 AM, Mike Rapoport wrote: > > On Thu, Jan 17, 2019 at 03:19:35AM -0700, William Kucharski wrote: >> >> This seems very reasonable, but if the code is just going to panic if the >> allocation fails, why not call memblock_alloc_node() instead? > > I've sent patches [1] that remove panic() from memblock_alloc*() and drop > _nopanic variants. After they will be (hopefully) merged, > memblock_alloc_node() will return NULL on error. > >> If there is a reason we'd prefer to call memblock_alloc_node_nopanic(), >> I'd like to see pgdat->nodeid printed in the panic message as well. > > Sure. Thanks for the quick response. Reviewed-by: William Kucharski
Re: [PATCH] mm/page_alloc: check return value of memblock_alloc_node_nopanic()
This seems very reasonable, but if the code is just going to panic if the allocation fails, why not call memblock_alloc_node() instead? If there is a reason we'd prefer to call memblock_alloc_node_nopanic(), I'd like to see pgdat->nodeid printed in the panic message as well.
Re: [PATCH] mm: hwpoison: use do_send_sig_info() instead of force_sig() (Re: PMEM error-handling forces SIGKILL causes kernel panic)
> On Jan 16, 2019, at 6:07 PM, Jane Chu wrote: > > It's just coding style I'm used to, no big deal. > Up to you to decide. :) Personally I like a (void) cast as it's pretty long-standing syntactic sugar to cast a call that returns a value we don't care about to (void) to show we know it returns a value and we don't care. Without it, it may suggest we either didn't know it returned a value or that we neglected to check the return value. However, in current use elsewhere (e.g. in send_sig_all() and __oom_kill_process()), no such (void) cast is added, so it seems better to match current usage elsewhere in the kernel. Reviewed-by: William Kucharski
Re: [PATCH] mm: memcontrol: use struct_size() in kmalloc()
> On Jan 4, 2019, at 11:37 AM, Gustavo A. R. Silva > wrote: > > One of the more common cases of allocation size calculations is finding > the size of a structure that has a zero-sized array at the end, along > with memory for some number of elements for that array. For example: > > struct foo { >int stuff; >void *entry[]; > }; > > instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); > > Instead of leaving these open-coded and prone to type mistakes, we can > now use the new struct_size() helper: > > instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); > > This code was detected with the help of Coccinelle. > > Signed-off-by: Gustavo A. R. Silva > --- > mm/memcontrol.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index af7f18b32389..ad256cf7da47 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3626,8 +3626,7 @@ static int __mem_cgroup_usage_register_event(struct > mem_cgroup *memcg, > size = thresholds->primary ? thresholds->primary->size + 1 : 1; > > /* Allocate memory for new array of thresholds */ > - new = kmalloc(sizeof(*new) + size * sizeof(struct mem_cgroup_threshold), > - GFP_KERNEL); > + new = kmalloc(struct_size(new, entries, size), GFP_KERNEL); > if (!new) { > ret = -ENOMEM; > goto unlock; > -- > 2.20.1 > Reviewed-by: William Kucharski
Re: [v3 PATCH 5/5] doc: memcontrol: add description for wipe_on_offline
Just a few grammar corrections since this is going into Documentation: > On Jan 9, 2019, at 12:14 PM, Yang Shi wrote: > > Add desprition of wipe_on_offline interface in cgroup documents. Add a description of the wipe_on_offline interface to the cgroup documents. > Cc: Michal Hocko > Cc: Johannes Weiner > Cc: Shakeel Butt > Signed-off-by: Yang Shi > --- > Documentation/admin-guide/cgroup-v2.rst | 9 + > Documentation/cgroup-v1/memory.txt | 10 ++ > 2 files changed, 19 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst > b/Documentation/admin-guide/cgroup-v2.rst > index 0290c65..e4ef08c 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1303,6 +1303,15 @@ PAGE_SIZE multiple when read back. > memory pressure happens. If you want to avoid that, force_empty will > be > useful. > > + memory.wipe_on_offline > + > +This is similar to force_empty, but it just does memory reclaim > +asynchronously in css offline kworker. > + > +Writing into 1 will enable it, disable it by writing into 0. Writing a 1 will enable it; writing a 0 will disable it. > + > +It would reclaim as much as possible memory just as what force_empty > does. It will reclaim as much memory as possible, just as force_empty does. > + > > Usage Guidelines > > diff --git a/Documentation/cgroup-v1/memory.txt > b/Documentation/cgroup-v1/memory.txt > index 8e2cb1d..1c6e1ca 100644 > --- a/Documentation/cgroup-v1/memory.txt > +++ b/Documentation/cgroup-v1/memory.txt > @@ -71,6 +71,7 @@ Brief summary of control files. > memory.stat # show various statistics > memory.use_hierarchy # set/show hierarchical account enabled > memory.force_empty# trigger forced page reclaim > + memory.wipe_on_offline # trigger forced page reclaim when > offlining > memory.pressure_level # set memory pressure notifications > memory.swappiness # set/show swappiness parameter of vmscan >(See sysctl's vm.swappiness) > @@ -581,6 +582,15 @@ hierarchical_= N0= > N1= ... > > The "total" count is sum of file + anon + unevictable. > > +5.7 wipe_on_offline > + > +This is similar to force_empty, but it just does memory reclaim > asynchronously > +in css offline kworker. > + > +Writing into 1 will enable it, disable it by writing into 0. Writing a 1 will enable it; writing a 0 will disable it. > + > +It would reclaim as much as possible memory just as what force_empty does. It will reclaim as much memory as possible, just as force_empty does. > + > 6. Hierarchy support > > The memory controller supports a deep hierarchy and hierarchical accounting. > -- > 1.8.3.1 >
Re: [PATCH] mm/shmem: make find_get_pages_range() work for huge page
> On Jan 9, 2019, at 8:08 PM, Yu Zhao wrote: > > find_get_pages_range() and find_get_pages_range_tag() already > correctly increment reference count on head when seeing compound > page, but they may still use page index from tail. Page index > from tail is always zero, so these functions don't work on huge > shmem. This hasn't been a problem because, AFAIK, nobody calls > these functions on (huge) shmem. Fix them anyway just in case. > > Signed-off-by: Yu Zhao > --- > mm/filemap.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 81adec8ee02c..cf5fd773314a 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1704,7 +1704,7 @@ unsigned find_get_pages_range(struct address_space > *mapping, pgoff_t *start, > > pages[ret] = page; > if (++ret == nr_pages) { > - *start = page->index + 1; > + *start = xas.xa_index + 1; > goto out; > } > continue; > @@ -1850,7 +1850,7 @@ unsigned find_get_pages_range_tag(struct address_space > *mapping, pgoff_t *index, > > pages[ret] = page; > if (++ret == nr_pages) { > - *index = page->index + 1; > + *index = xas.xa_index + 1; > goto out; > } > continue; > -- While this works, it seems like this would be more readable for future maintainers were it to instead squirrel away the value for *start/*index when ret was zero on the first iteration through the loop. Though xa_index is designed to hold the first index of the entry, it seems inappropriate to have these routines deference elements of xas directly; I guess it depends on how opaque we want to keep xas and struct xa_state. Does anyone else have a feeling one way or the other? I could be persuaded either way.
Re: [PATCH v2] netfilter: account ebt_table_info to kmemcg
> On Jan 2, 2019, at 8:14 PM, Shakeel Butt wrote: > > countersize = COUNTER_OFFSET(tmp.nentries) * nr_cpu_ids; > - newinfo = vmalloc(sizeof(*newinfo) + countersize); > + newinfo = __vmalloc(sizeof(*newinfo) + countersize, GFP_KERNEL_ACCOUNT, > + PAGE_KERNEL); > if (!newinfo) > return -ENOMEM; > > if (countersize) > memset(newinfo->counters, 0, countersize); > > - newinfo->entries = vmalloc(tmp.entries_size); > + newinfo->entries = __vmalloc(tmp.entries_size, GFP_KERNEL_ACCOUNT, > + PAGE_KERNEL); > if (!newinfo->entries) { > ret = -ENOMEM; > goto free_newinfo; > -- Just out of curiosity, what are the actual sizes of these areas in typical use given __vmalloc() will be allocating by the page?
Re: Bug with report THP eligibility for each vma
> On Dec 24, 2018, at 12:49 AM, Michal Hocko wrote: > > [Cc-ing mailing list and people involved in the original patch] > > On Fri 21-12-18 13:42:24, Paul Oppenheimer wrote: >> Hello! I've never reported a kernel bug before, and since its on the >> "next" tree I was told to email the author of the relevant commit. >> Please redirect me to the correct place if I've made a mistake. >> >> When opening firefox or chrome, and using it for a good 7 seconds, it >> hangs in "uninterruptible sleep" and I recieve a "BUG" in dmesg. This >> doesn't occur when reverting this commit: >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=48cf516f8c. >> Ive attached the output of decode_stacktrace.sh and the relevant dmesg >> log to this email. >> >> Thanks > >> BUG: unable to handle kernel NULL pointer dereference at 00e8 > > Thanks for the bug report! This is offset 232 and that matches > file->f_mapping as per pahole > pahole -C file ./vmlinux | grep f_mapping >struct address_space * f_mapping;/* 232 8 */ > > I thought that each file really has to have a mapping. But the following > should heal the issue and add an extra care. > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index f64733c23067..fc9d70a9fbd1 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -66,6 +66,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct > *vma) > { > if (vma_is_anonymous(vma)) > return __transparent_hugepage_enabled(vma); > + if (!vma->vm_file || !vma->vm_file->f_mapping) > + return false; > if (shmem_mapping(vma->vm_file->f_mapping) && shmem_huge_enabled(vma)) > return __transparent_hugepage_enabled(vma); From what I see in code in mm/mmap.c, it seems if vma->vm_file is non-zero vma->vm_file->f_mapping may be assumed to be non-NULL; see unlink_file_vma() and __vma_link_file() for two examples, which both use the construct: file = vma->vm_file; if (file) { struct address_space *mapping = file->f_mapping; [ ... ] [ code that dereferences "mapping" without further checks ] } I see nothing wrong with your second check but a few extra instructions performed, but depending upon how often transparent_hugepage_enabled() is called there may be at least theoretical performance concerns. William Kucharski william.kuchar...@oracle.com
Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
> On Dec 20, 2018, at 1:31 PM, Qian Cai wrote: > > diff --git a/mm/page_ext.c b/mm/page_ext.c > index ae44f7adbe07..d76fd51e312a 100644 > --- a/mm/page_ext.c > +++ b/mm/page_ext.c > @@ -399,9 +399,8 @@ void __init page_ext_init(void) >* -pfn--> >* N0 | N1 | N2 | N0 | N1 | N2| >* > - * Take into account DEFERRED_STRUCT_PAGE_INIT. >*/ > - if (early_pfn_to_nid(pfn) != nid) > + if (pfn_to_nid(pfn) != nid) > continue; > if (init_section_page_ext(pfn, nid)) > goto oom; > -- > 2.17.2 (Apple Git-113) > Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the same issue seen in v2?
Re: [PATCH 1/2] ARC: show_regs: avoid page allocator
> On Dec 18, 2018, at 11:53 AM, Vineet Gupta wrote: > > Use on-stack smaller buffers instead of dynamic pages. > > The motivation for this change was to address lockdep splat when > signal handling code calls show_regs (with preemption disabled) and > ARC show_regs calls into sleepable page allocator. > > | potentially unexpected fatal signal 11. > | BUG: sleeping function called from invalid context at > ../mm/page_alloc.c:4317 > | in_atomic(): 1, irqs_disabled(): 0, pid: 57, name: segv > | no locks held by segv/57. > | Preemption disabled at: > | [<8182f17e>] get_signal+0x4a6/0x7c4 > | CPU: 0 PID: 57 Comm: segv Not tainted 4.17.0+ #23 > | > | Stack Trace: > | arc_unwind_core.constprop.1+0xd0/0xf4 > | __might_sleep+0x1f6/0x234 > | __get_free_pages+0x174/0xca0 > | show_regs+0x22/0x330 > | get_signal+0x4ac/0x7c4 # print_fatal_signals() -> preempt_disable() > | do_signal+0x30/0x224 > | resume_user_mode_begin+0x90/0xd8 > > Despite this, lockdep still barfs (see next change), but this patch > still has merit as in we use smaller/localized buffers now and there's > less instructoh trace to sift thru when debugging pesky issues. > > Signed-off-by: Vineet Gupta I would rather see 256 as a #define somewhere rather than a magic number sprinkled around arch/arc/kernel/troubleshoot.c. Still, that's what the existing code does, so I suppose it's OK. Otherwise the change looks good. Reviewed-by: William Kucharski
Re: [PATCH v3] mm: thp: fix flags for pmd migration when split
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index f2d19e4fe854..aebade83cec9 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2145,23 +2145,25 @@ static void __split_huge_pmd_locked(struct > vm_area_struct *vma, pmd_t *pmd, >*/ > old_pmd = pmdp_invalidate(vma, haddr, pmd); > > -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION > pmd_migration = is_pmd_migration_entry(old_pmd); > - if (pmd_migration) { > + if (unlikely(pmd_migration)) { > swp_entry_t entry; > > entry = pmd_to_swp_entry(old_pmd); > page = pfn_to_page(swp_offset(entry)); > - } else > -#endif > + write = is_write_migration_entry(entry); > + young = false; > + soft_dirty = pmd_swp_soft_dirty(old_pmd); > + } else { > page = pmd_page(old_pmd); > + if (pmd_dirty(old_pmd)) > + SetPageDirty(page); > + write = pmd_write(old_pmd); > + young = pmd_young(old_pmd); > + soft_dirty = pmd_soft_dirty(old_pmd); > + } > VM_BUG_ON_PAGE(!page_count(page), page); > page_ref_add(page, HPAGE_PMD_NR - 1); > - if (pmd_dirty(old_pmd)) > - SetPageDirty(page); > - write = pmd_write(old_pmd); > - young = pmd_young(old_pmd); > - soft_dirty = pmd_soft_dirty(old_pmd); > > /* > * Withdraw the table only after we mark the pmd entry invalid. > -- Looks good. Reviewed-by: William Kucharski
Re: [PATCH 2/2] swap: Deal with PTE mapped THP when unuse PTE
> --- > mm/swapfile.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 7464d0a92869..9e6da494781f 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1921,10 +1921,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, > pmd_t *pmd, > goto out; > } > > - if (PageSwapCache(page) && (swap_count(*swap_map) == 0)) > - delete_from_swap_cache(compound_head(page)); > + try_to_free_swap(page); > > - SetPageDirty(page); > unlock_page(page); > put_page(page); > > -- > 2.18.1 > Since try_to_free_swap() can return 0 under certain error conditions, you should check check for a return status of 1 before calling unlock_page() and put_page(). Reviewed-by: William Kucharski
Re: [PATCH v2] userfaultfd: clear flag if remap event not enabled
> On Dec 10, 2018, at 10:34 PM, Peter Xu wrote: > > --- > fs/userfaultfd.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index cd58939dc977..4567b5b6fd32 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -736,10 +736,18 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, > struct userfaultfd_ctx *ctx; > > ctx = vma->vm_userfaultfd_ctx.ctx; > - if (ctx && (ctx->features & UFFD_FEATURE_EVENT_REMAP)) { > + > + if (!ctx) > + return; > + > + if (ctx->features & UFFD_FEATURE_EVENT_REMAP) { > vm_ctx->ctx = ctx; > userfaultfd_ctx_get(ctx); > WRITE_ONCE(ctx->mmap_changing, true); > + } else { > + /* Drop uffd context if remap feature not enabled */ > + vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > + vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); > } > } > > -- > 2.17.1 > Looks good. Reviewed-by: William Kucharski
Re: [PATCH] sysctl: clean up nr_pdflush_threads leftover
> On Nov 28, 2018, at 8:24 AM, Rafael Aquini wrote: > > nr_pdflush_threads has been long deprecated and > removed, but a remnant of its glorious past is > still around in CTL_VM names enum. This patch > is a minor clean-up to that case. > > Signed-off-by: Rafael Aquini > --- > include/uapi/linux/sysctl.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h > index d71013fffaf6..dad5a8f93343 100644 > --- a/include/uapi/linux/sysctl.h > +++ b/include/uapi/linux/sysctl.h > @@ -174,7 +174,7 @@ enum > VM_DIRTY_RATIO=12, /* dirty_ratio */ > VM_DIRTY_WB_CS=13, /* dirty_writeback_centisecs */ > VM_DIRTY_EXPIRE_CS=14, /* dirty_expire_centisecs */ > - VM_NR_PDFLUSH_THREADS=15, /* nr_pdflush_threads */ > + VM_UNUSED15=15, /* was: int nr_pdflush_threads */ > VM_OVERCOMMIT_RATIO=16, /* percent of RAM to allow overcommit in */ > VM_PAGEBUF=17, /* struct: Control pagebuf parameters */ > VM_HUGETLB_PAGES=18,/* int: Number of available Huge Pages */ > -- > 2.17.2 > Please reword the comment to add a colon after the word "int" to match earlier comments in the enum: + VM_UNUSED15=15, /* was: int: nr_pdflush_threads */ Also, as long as you're changing this file, please fix the typo earlier in the same enum: - VM_UNUSED2=2, /* was; int: Linear or sqrt() swapout for hogs */ + VM_UNUSED2=2, /* was: int: Linear or sqrt() swapout for hogs */ Reviewed-by: William Kucharski
Re: [PATCH] hugetlbfs: Call VM_BUG_ON_PAGE earlier in free_huge_page
Reviewed-by: William Kucharski > On Nov 29, 2018, at 4:44 AM, Yongkai Wu wrote: > > A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), > page) in free_huge_page(). Unfortunately, the page->mapping field > was set to NULL before this test. This made it more difficult to > determine the root cause of the problem. > > Move the VM_BUG_ON_PAGE tests earlier in the function so that if > they do trigger more information is present in the page struct. > > Signed-off-by: Yongkai Wu > Acked-by: Michal Hocko > Acked-by: Mike Kravetz > --- > mm/hugetlb.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 7f2a28a..14ef274 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1248,10 +1248,11 @@ void free_huge_page(struct page *page) > (struct hugepage_subpool *)page_private(page); > bool restore_reserve; > > - set_page_private(page, 0); > - page->mapping = NULL; > VM_BUG_ON_PAGE(page_count(page), page); > VM_BUG_ON_PAGE(page_mapcount(page), page); > + > + set_page_private(page, 0); > + page->mapping = NULL; > restore_reserve = PagePrivate(page); > ClearPagePrivate(page); > > -- > 1.8.3.1 >
Re: [PATCH] Small typo fix
> On Nov 27, 2018, at 2:04 PM, Emre Ates wrote: > > --- > mm/vmstat.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 9c624595e904..cc7d04928c2e 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1106,7 +1106,7 @@ int fragmentation_index(struct zone *zone, unsigned int > order) > TEXT_FOR_HIGHMEM(xx) xx "_movable", > > const char * const vmstat_text[] = { > - /* enum zone_stat_item countes */ > + /* enum zone_stat_item counters */ > "nr_free_pages", > "nr_zone_inactive_anon", > "nr_zone_active_anon", > -- > 2.19.1 > > Signed-off-by: Emre Ates Reviewed-by: William Kucharski
Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
> On Nov 27, 2018, at 9:50 AM, Vlastimil Babka wrote: > > On 11/27/18 3:50 PM, William Kucharski wrote: >> >> I was just double checking that this was meant to be more of a check done >> before code elsewhere performs additional checks and does the actual THP >> mapping, not an all-encompassing go/no go check for THP mapping. > > Yes, the code doing the actual mapping is still checking also alignment etc. Thanks, yes, that is what I was getting at.
Re: [PATCH] mm: warn only once if page table misaccounting is detected
> On Nov 27, 2018, at 8:52 AM, Sean Christopherson > wrote: > > On Tue, Nov 27, 2018 at 09:36:03AM +0100, Heiko Carstens wrote: >> Use pr_alert_once() instead of pr_alert() if page table misaccounting >> has been detected. >> >> If this happens once it is very likely that there will be numerous >> other occurrence as well, which would flood dmesg and the console with >> hardly any added information. Therefore print the warning only once. >> >> Cc: Kirill A. Shutemov >> Cc: Martin Schwidefsky >> Signed-off-by: Heiko Carstens >> --- >> kernel/fork.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/fork.c b/kernel/fork.c >> index 07cddff89c7b..c887e9eba89f 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -647,8 +647,8 @@ static void check_mm(struct mm_struct *mm) >> } >> >> if (mm_pgtables_bytes(mm)) >> -pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n", >> -mm_pgtables_bytes(mm)); >> +pr_alert_once("BUG: non-zero pgtables_bytes on freeing mm: >> %ld\n", >> + mm_pgtables_bytes(mm)); > > I found the print-always behavior to be useful when developing a driver > that mucked with PTEs directly via vmf_insert_pfn() and had issues with > racing against exit_mmap(). It was nice to be able to recompile only > the driver and rely on dmesg to let me know when I messed up yet again. > > Would pr_alert_ratelimited() suffice? Actually, I really like that idea. There are certainly times when it is useful to see a cascade of messages, within reason; one there are so many they overflow the dmesg buffer they're of limited usefulness. Something like a pr_alert() that could rate limit to a preset value, perhaps a default of fifty or so, could prove quite useful indeed without being an all or once choice. William Kucharski
Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
> On Nov 27, 2018, at 6:17 AM, Michal Hocko wrote: > > This is only about the process wide flag to disable THP. I do not see > how this can be alighnement related. I suspect you wanted to ask in the > smaps patch? No, answered below. > >> I'm having to deal with both these issues in the text page THP >> prototype I've been working on for some time now. > > Could you be more specific about the issue and how the alignment comes > into the game? The only thing I can think of is to not report VMAs > smaller than the THP as eligible. Is this what you are looking for? Basically, if the faulting VA is one that cannot be mapped with a THP due to alignment or size constraints, it may be "eligible" for THP mapping but ultimately can't be. I was just double checking that this was meant to be more of a check done before code elsewhere performs additional checks and does the actual THP mapping, not an all-encompassing go/no go check for THP mapping. Thanks, William Kucharski
Re: [RFC PATCH 3/3] mm, proc: report PR_SET_THP_DISABLE in proc
This determines whether the page can theoretically be THP-mapped , but is the intention to also check for proper alignment and/or preexisting PAGESIZE page cache mappings for the address range? I'm having to deal with both these issues in the text page THP prototype I've been working on for some time now. Thanks, William Kucharski
Re: [RFC PATCH 3/3] mm, fault_around: do not take a reference to a locked page
> On Nov 20, 2018, at 7:12 AM, Michal Hocko wrote: > > + /* > + * Check the locked pages before taking a reference to not > + * go in the way of migration. > + */ Could you make this a tiny bit more explanative, something like: + /* +* Check for a locked page first, as a speculative +* reference may adversely influence page migration. + */ Reviewed-by: William Kucharski
Re: [PATCH 1/5] mm: print more information about mapping in __dump_page
> On Nov 16, 2018, at 1:30 AM, Michal Hocko wrote: > > From: Michal Hocko > > __dump_page prints the mapping pointer but that is quite unhelpful > for many reports because the pointer itself only helps to distinguish > anon/ksm mappings from other ones (because of lowest bits > set). Sometimes it would be much more helpful to know what kind of > mapping that is actually and if we know this is a file mapping then also > try to resolve the dentry name. I really, really like this - the more information available in the dump output, the easier it is to know where to start looking for the problem. Reviewed-by: William Kucharski
Re: [PATCH] mm/usercopy: Use memory range to be accessed for wraparound check
> On Nov 14, 2018, at 10:32 AM, isa...@codeaurora.org wrote: > > Thank you and David for your feedback. The check_bogus_address() routine is > only invoked from one place in the kernel, which is __check_object_size(). > Before invoking check_bogus_address, __check_object_size ensures that n is > non-zero, so it is not possible to call this routine with n being 0. > Therefore, we shouldn't run into the scenario you described. Also, in the > case where we are copying a page's contents into a kernel space buffer and > will not have that buffer interacting with userspace at all, this change to > that check should still be valid, correct? Having fixed more than one bug resulting from a "only called in one place" routine later being called elsewhere, I am wary, but ultimately it's likely not worth the performance hit of a check or BUG_ON(). It's a generic math check for overflow, so it should work with any address. Reviewed-by: William Kucharski
Re: [PATCH] mm/usercopy: Use memory range to be accessed for wraparound check
> On Nov 14, 2018, at 4:09 AM, David Laight wrote: > > From: William Kucharski >> Sent: 14 November 2018 10:35 >> >>> On Nov 13, 2018, at 5:51 PM, Isaac J. Manjarres >>> wrote: >>> >>> diff --git a/mm/usercopy.c b/mm/usercopy.c >>> index 852eb4e..0293645 100644 >>> --- a/mm/usercopy.c >>> +++ b/mm/usercopy.c >>> @@ -151,7 +151,7 @@ static inline void check_bogus_address(const unsigned >>> long ptr, unsigned long n, >>>bool to_user) >>> { >>> /* Reject if object wraps past end of memory. */ >>> - if (ptr + n < ptr) >>> + if (ptr + (n - 1) < ptr) >>> usercopy_abort("wrapped address", NULL, to_user, 0, ptr + n); >> >> I'm being paranoid, but is it possible this routine could ever be passed "n" >> set to zero? >> >> If so, it will erroneously abort indicating a wrapped address as (n - 1) >> wraps to ULONG_MAX. >> >> Easily fixed via: >> >> if ((n != 0) && (ptr + (n - 1) < ptr)) > > Ugg... you don't want a double test. > > I'd guess that a length of zero is likely, but a usercopy that includes > the highest address is going to be invalid because it is a kernel address > (on most archs, and probably illegal on others). > What you really want to do is add 'ptr + len' and check the carry flag. The extra test is only a few extra instructions, but I understand the concern. (Though I don't know how you'd access the carry flag from C in a machine-independent way. Also, for the calculation to be correct you still need to check 'ptr + (len - 1)' for the wrap.) You could also theoretically call gcc's __builtin_uadd_overflow() if you want to get carried away. As I mentioned, I was just being paranoid, but the passed zero length issue stood out to me. William Kucharski
Re: [PATCH] mm/usercopy: Use memory range to be accessed for wraparound check
> On Nov 13, 2018, at 5:51 PM, Isaac J. Manjarres wrote: > > diff --git a/mm/usercopy.c b/mm/usercopy.c > index 852eb4e..0293645 100644 > --- a/mm/usercopy.c > +++ b/mm/usercopy.c > @@ -151,7 +151,7 @@ static inline void check_bogus_address(const unsigned > long ptr, unsigned long n, > bool to_user) > { > /* Reject if object wraps past end of memory. */ > - if (ptr + n < ptr) > + if (ptr + (n - 1) < ptr) > usercopy_abort("wrapped address", NULL, to_user, 0, ptr + n); I'm being paranoid, but is it possible this routine could ever be passed "n" set to zero? If so, it will erroneously abort indicating a wrapped address as (n - 1) wraps to ULONG_MAX. Easily fixed via: if ((n != 0) && (ptr + (n - 1) < ptr)) William Kucharski
Re: [PATCH v3] mm: Create the new vm_fault_t type
> On Nov 13, 2018, at 10:13 PM, Souptick Joarder wrote: > > On Tue, Nov 6, 2018 at 5:33 PM Souptick Joarder wrote: >> >> Page fault handlers are supposed to return VM_FAULT codes, >> but some drivers/file systems mistakenly return error >> numbers. Now that all drivers/file systems have been converted >> to use the vm_fault_t return type, change the type definition >> to no longer be compatible with 'int'. By making it an unsigned >> int, the function prototype becomes incompatible with a function >> which returns int. Sparse will detect any attempts to return a >> value which is not a VM_FAULT code. >> >> VM_FAULT_SET_HINDEX and VM_FAULT_GET_HINDEX values are changed >> to avoid conflict with other VM_FAULT codes. >> >> Signed-off-by: Souptick Joarder > > Any further comment on this patch ? Reviewed-by: William Kucharski
Re: [PATCH] slab.h: Avoid using & for logical and of booleans
> On Nov 5, 2018, at 14:13, Andrew Morton wrote: > >> On Mon, 5 Nov 2018 12:40:00 -0800 Bart Van Assche >> wrote: >> -return type_dma + (is_reclaimable & !is_dma) * KMALLOC_RECLAIM; >> +return type_dma + is_reclaimable * !is_dma * KMALLOC_RECLAIM; >> } >> >> /* > > I suppose so. > > That function seems too clever for its own good :(. I wonder if these > branch-avoiding tricks are really worthwhile. At the very least I'd like to see some comments added as to why that approach was taken for the sake of future maintainers. William Kucharski
Re: [PATCH v4] mm/page_owner: clamp read count to PAGE_SIZE
> On Nov 1, 2018, at 3:47 PM, Andrew Morton wrote: > > - count = count > PAGE_SIZE ? PAGE_SIZE : count; > + count = min_t(size_t, count, PAGE_SIZE); > kbuf = kmalloc(count, GFP_KERNEL); > if (!kbuf) > return -ENOMEM; Is the use of min_t vs. the C conditional mostly to be more self-documenting? The compiler-generated assembly between the two versions seems mostly a wash. William Kucharski
Re: Regression: Approximate 34% performance hit in receive throughput over ixgbe seen due to build_skb patch
> On May 22, 2018, at 12:23 PM, Alexander Duyck > wrote: > > 3. There should be a private flag that can be updated via "ethtool > --set-priv-flags" called "legacy-rx" that you can enable that will > roll back to the original that did the copy-break type approach for > small packets and the headers of the frame. With legacy-rx enabled, most of the regression goes away, but it's still present as compared to the code without the patch; the regression then drops to about 6%: # ethtool --show-priv-flags eno1 Private flags for eno1: legacy-rx: on Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 65536 64 60.00 35934709 0 306.64 65536 60.00 33791739288.35 Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 65536 64 60.00 39254351 0 334.97 65536 60.00 36761069313.69 Is this variance to be expected, or do you think modification of the interrupt delay would achieve better results? William Kucharski
Regression: Approximate 34% performance hit in receive throughput over ixgbe seen due to build_skb patch
A performance hit of approximately 34% in receive numbers for some packet sizes is seen when testing traffic over ixgbe links using the network test netperf. Starting with the top of tree commit 7addb3e4ad3db6a95a953c59884921b5883dcdec, a git bisect narrowed the issue down to: commit 6f429223b31c550b835b4f066ac034d0cf0cc71e ixgbe: Add support for build_skb This patch adds build_skb support to the Rx path. There are several advantages to this change. 1. It avoids the memcpy and skb->head allocation for small packets which improves performance by about 5% in my tests. 2. It avoids the memcpy, skb->head allocation, and eth_get_headlen for larger packets improving performance by about 10% in my tests. 3. For VXLAN packets it allows the full header to be in skb->data which improves the performance by as much as 30% in some of my tests. Netperf was sourced from: https://hewlettpackard.github.io/netperf/ Two machines were directly connected via ixgbe links. The process "netserver" was started on 10.196.11.8, and running this test: # netperf -l 60 -H 10.196.11.8 -i 10,2 -I 99,10 -t UDP_STREAM -- -m 64 -s 32768 -S 32768 showed that on machines without the patch, we typically see performance like: Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 65536 64 60.00 35435847 0 302.38<-- SEND 65536 60.00 35391087302.00<-- RECEIVE or Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 65536 64 60.00 33708816 0 287.65 65536 60.00 33706010287.62 However, on machines with the patch, receive performance is seen to fall by an average of 34%, e.g.: Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 65536 64 60.00 35179881 0 300.20 65536 60.00 21418471182.77 or Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 65536 64 60.00 36937716 0 315.20 65536 60.00 16838656 143.69 William Kucharski william.kuchar...@oracle.com
Re: [RFC] mm, THP: Map read-only text segments using large THP pages
> On May 17, 2018, at 9:23 AM, Matthew Wilcox wrote: > > I'm certain it is. The other thing I believe is true that we should be > able to share page tables (my motivation is thousands of processes each > mapping the same ridiculously-sized file). I was hoping this prototype > would have code that would be stealable for that purpose, but you've > gone in a different direction. Which is fine for a prototype; you've > produced useful numbers. Definitely, and that's why I mentioned integration with the page cache would be crucial. This prototype allocates pages for each invocation of the executable, which would never fly on a real system. > I think the first step is to get variable sized pages in the page cache > working. Then the map-around functionality can probably just notice if > they're big enough to map with a PMD and make that happen. I don't > immediately > see anything from this PoC that can be used, but it at least gives us a > good point of comparison for any future work. Yes, that's the first step to getting actual usable code designed and working; this prototype was designed just to get something working and to get a first swag at some performance numbers. I do think that adding code to map larger pages as a fault_around variant is a good start as the code is already going to potentially map in fault_around_bytes from the file to satisfy the fault. It makes sense to extend that paradigm to be able to tune when large pages might be read in and/or mapped using large pages extant in the page cache. Filesystem support becomes more important once writing to large pages is allowed. > I think that really tells the story. We almost entirely eliminate > dTLB load misses (down to almost 0.1%) and iTLB load misses drop to 4% > of what they were. Does this test represent any kind of real world load, > or is it designed to show the best possible improvement? It's admittedly designed to thrash the caches pretty hard and doesn't represent any type of actual workload I'm aware of. It basically calls various routines within a huge text area while scribbling to automatic arrays declared at the top of each routine. It wasn't designed as a worst case scenario, but rather as one that would hopefully show some obvious degree of difference when large text pages were supported. Thanks for your comments. -- Bill
Re: [RFC] mm, THP: Map read-only text segments using large THP pages
> On May 17, 2018, at 1:57 AM, Michal Hocko wrote: > > [CCing Kirill and fs-devel] > > On Mon 14-05-18 07:12:13, William Kucharski wrote: >> One of the downsides of THP as currently implemented is that it only supports >> large page mappings for anonymous pages. > > There is a support for shmem merged already. ext4 was next on the plan > AFAIR but I haven't seen any patches and Kirill was busy with other > stuff IIRC. I couldn't find anything that would specifically map text pages with large pages, so perhaps this could be integrated with that or I may have simply missed changes that would ultimately provide that functionality. > >> I embarked upon this prototype on the theory that it would be advantageous >> to >> be able to map large ranges of read-only text pages using THP as well. > > Can the fs really support THP only for read mappings? What if those > pages are to be shared in a writable mapping as well? In other words > can this all work without a full THP support for a particular fs? The integration with the page cache would indeed require filesystem support. The end result I'd like to see is full R/W support for large THP pages; I thought the RO text mapping proof of concept worthwhile to see what kind of results we might see and what the thoughts of the community were. Thanks for the feedback. -- Bill
Re: [RFC] mm, THP: Map read-only text segments using large THP pages
> On May 14, 2018, at 9:19 AM, Christopher Lameter wrote: > > Cool. This could be controlled by the faultaround logic right? If we get > fault_around_bytes up to huge page size then it is reasonable to use a > huge page directly. It isn't presently but certainly could be; for the prototype it tries to map a large page when needed and, should that fail, it will fall through to the normal fault around code. I would think we would want a separate parameter, as I can see the usefulness of more fine-grained control. Many users may want to try mapping a large page if possible, but would prefer a smaller number of bytes to be read in fault around should we need to fall back to using PAGESIZE pages. > fault_around_bytes can be set via sysfs so there is a natural way to > control this feature there I think. I agree; perhaps I could use "fault_around_thp_bytes" or something similar. >> Since this approach will map a PMD size block of the memory map at a time, we >> should see a slight uptick in time spent in disk I/O but a substantial drop >> in >> page faults as well as a reduction in iTLB misses as address ranges will be >> mapped with the larger page. Analysis of a test program that consists of a >> very >> large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows >> this >> does occur and there is a slight reduction in program execution time. > > I think we would also want such a feature for regular writable pages as > soon as possible. That is my ultimate long-term goal for this project - full r/w support of large THP pages; prototyping with read-only text pages seemed like the best first step to get a sense of the possible benefits. -- Bill