Re: [PATCH] shmem: avoid huge pages for small files
On Fri, Nov 11, 2016 at 01:42:47AM +0800, kbuild test robot wrote: > Hi Kirill, > > [auto build test WARNING on linus/master] > [also build test WARNING on v4.9-rc4 next-20161110] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/shmem-avoid-huge-pages-for-small-files/2016-005428 > config: i386-randconfig-s0-201645 (attached as .config) > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All warnings (new ones prefixed by >>): > >mm/shmem.c: In function 'shmem_getpage_gfp': > >> mm/shmem.c:1680:12: warning: unused variable 'off' [-Wunused-variable] >pgoff_t off; >From f0a582888ac6dcb56c6134611c83edfb091bbcb6 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Mon, 17 Oct 2016 14:44:47 +0300 Subject: [PATCH] shmem: avoid huge pages for small files Huge pages are detrimental for small file: they causes noticible overhead on both allocation performance and memory footprint. This patch aimed to address this issue by avoiding huge pages until file grown to size of huge page if the filesystem mounted with huge=within_size option. This would cover most of the cases where huge pages causes regressions in performance. The limit doesn't affect khugepaged behaviour: it still can collapse pages based on its settings. Signed-off-by: Kirill A. Shutemov --- Documentation/vm/transhuge.txt | 7 ++- mm/shmem.c | 7 ++- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 2ec6adb5a4ce..14c911c56f4a 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -208,11 +208,16 @@ You can control hugepage allocation policy in tmpfs with mount option - "always": Attempt to allocate huge pages every time we need a new page; +This option can lead to significant overhead if filesystem is used to +store small files. + - "never": Do not allocate huge pages; - "within_size": -Only allocate huge page if it will be fully within i_size. +Only allocate huge page if size of the file more than size of huge +page. This helps to avoid overhead for small files. + Also respect fadvise()/madvise() hints; - "advise: diff --git a/mm/shmem.c b/mm/shmem.c index ad7813d73ea7..3e2c0912c587 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1677,14 +1677,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, goto alloc_huge; switch (sbinfo->huge) { loff_t i_size; - pgoff_t off; case SHMEM_HUGE_NEVER: goto alloc_nohuge; case SHMEM_HUGE_WITHIN_SIZE: - off = round_up(index, HPAGE_PMD_NR); - i_size = round_up(i_size_read(inode), PAGE_SIZE); - if (i_size >= HPAGE_PMD_SIZE && - i_size >> PAGE_SHIFT >= off) + i_size = i_size_read(inode); + if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE) goto alloc_huge; /* fallthrough */ case SHMEM_HUGE_ADVISE: -- 2.9.3 -- Kirill A. Shutemov
Re: [PATCH] shmem: avoid huge pages for small files
Hi Kirill, [auto build test WARNING on linus/master] [also build test WARNING on v4.9-rc4 next-20161110] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/shmem-avoid-huge-pages-for-small-files/2016-005428 config: i386-randconfig-s0-201645 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): mm/shmem.c: In function 'shmem_getpage_gfp': >> mm/shmem.c:1680:12: warning: unused variable 'off' [-Wunused-variable] pgoff_t off; ^~~ vim +/off +1680 mm/shmem.c 66d2f4d2 Hugh Dickins 2014-07-02 1664 mark_page_accessed(page); 66d2f4d2 Hugh Dickins 2014-07-02 1665 54af6042 Hugh Dickins 2011-08-03 1666 delete_from_swap_cache(page); 27ab7006 Hugh Dickins 2011-07-25 1667set_page_dirty(page); 27ab7006 Hugh Dickins 2011-07-25 1668swap_free(swap); 27ab7006 Hugh Dickins 2011-07-25 1669 54af6042 Hugh Dickins 2011-08-03 1670} else { 800d8c63 Kirill A. Shutemov 2016-07-26 1671/* shmem_symlink() */ 800d8c63 Kirill A. Shutemov 2016-07-26 1672if (mapping->a_ops != &shmem_aops) 800d8c63 Kirill A. Shutemov 2016-07-26 1673goto alloc_nohuge; 657e3038 Kirill A. Shutemov 2016-07-26 1674if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE) 800d8c63 Kirill A. Shutemov 2016-07-26 1675goto alloc_nohuge; 800d8c63 Kirill A. Shutemov 2016-07-26 1676if (shmem_huge == SHMEM_HUGE_FORCE) 800d8c63 Kirill A. Shutemov 2016-07-26 1677goto alloc_huge; 800d8c63 Kirill A. Shutemov 2016-07-26 1678switch (sbinfo->huge) { 800d8c63 Kirill A. Shutemov 2016-07-26 1679loff_t i_size; 800d8c63 Kirill A. Shutemov 2016-07-26 @1680pgoff_t off; 800d8c63 Kirill A. Shutemov 2016-07-26 1681case SHMEM_HUGE_NEVER: 800d8c63 Kirill A. Shutemov 2016-07-26 1682goto alloc_nohuge; 800d8c63 Kirill A. Shutemov 2016-07-26 1683case SHMEM_HUGE_WITHIN_SIZE: bb89f249 Kirill A. Shutemov 2016-11-10 1684i_size = i_size_read(inode); bb89f249 Kirill A. Shutemov 2016-11-10 1685if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE) 800d8c63 Kirill A. Shutemov 2016-07-26 1686goto alloc_huge; 800d8c63 Kirill A. Shutemov 2016-07-26 1687/* fallthrough */ 800d8c63 Kirill A. Shutemov 2016-07-26 1688case SHMEM_HUGE_ADVISE: :: The code at line 1680 was first introduced by commit :: 800d8c63b2e989c2e349632d1648119bf5862f01 shmem: add huge pages support :: TO: Kirill A. Shutemov :: CC: Linus Torvalds --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH] shmem: avoid huge pages for small files
On Mon, Oct 24, 2016 at 01:34:53PM -0700, Dave Hansen wrote: > On 10/21/2016 03:50 PM, Dave Chinner wrote: > > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote: > >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: > >> To me, most of things you're talking about is highly dependent on access > >> pattern generated by userspace: > >> > >> - we may want to allocate huge pages from byte 1 if we know that file > >> will grow; > > > > delayed allocation takes care of that. We use a growing speculative > > delalloc size that kicks in at specific sizes and can be used > > directly to determine if a large page shoul dbe allocated. This code > > is aware of sparse files, sparse writes, etc. > > OK, so somebody does a write() of 1 byte. We can delay the underlying > block allocation for a long time, but we can *not* delay the memory > allocation. We've got to decide before the write() returns. > How does delayed allocation help with that decision? You (and Kirill) have likely misunderstood what I'm saying, based on the fact you are thinking I'm talking about delayed allocation of page cache pages. I'm not. The current code does this for a sequential write: write( off, len) for each PAGE_SIZE chunk grab page cache page alloc + insert if not found map block to page get_block(off, PAGE_SIZE); filesystem does allocation update bufferhead attached to page write data into page Essentially, delayed block allocation occurs inside the get_block() call, completely hidden from the page cache and IO layers. In XFS, we special stuff based on the offset being written to, the size of the existing extent we are adding, etc. to specualtively /reserve/ more blocks that the write actually needs and keep them in a delalloc extent that extends beyond EOF. The next page is grabbed, get_block is called again, and we find we've already got a delalloc reservation for that file offset, so we return immediately. And we repeat that until the delalloc extent runs out. When it runs out, we allocate a bigger delalloc extent beyond EOF so that as the file grows we do fewer and fewer larger delayed allocation reservations. These grow out to /gigabytes/ if there is that much data to write. i.e. the filesystem clearly knows when using large pages would be appropriate, but because it's inside the page cache allocation, it can't influence it at all. Here's what the new fs/iomap.c code does for the same sequential write: write(off, len) iomap_begin(off, len) filesystem does delayed allocation of at least len bytes << returns an iomap with single mapping iomap_apply() for each PAGE_SIZE chunk grab page cache page alloc + insert if not found map iomap to page write data into page iomap_end() Hence if the write was for 1 byte into an empty file, we'd get a single block extent back, which would match to a single PAGE_SIZE page cache allocation required. If the app is doing sequential 1 byte IO, and we're at offset 2MB and the filesystem returns a 2MB delalloc extent (i.e. extends 2MB byte beyond EOF), we know we're getting sequential write IO and we could use a 2MB page in the page cache for this. Similarly, if the app is doing large IO - say 16MB at a time, we'll get at least a 16MB delalloc extent returned from the filesystem, and we know we could map that quickly and easily to 8 x 2MB huge pages in the page cache. But if we get random 4k writes into a sparse file, the filesystem will be allocating single blocks, so the iomaps being returned would be for a single block, and we know that PAGE_SIZE pages would be best to allocate. Or we could have a 2 MB extent size hint set, so every iomap returned from the filesysetm is going to be 2MB aligned and sized, in which case we'd could always map the returned iomap to a huge page rather than worry about Io sizes and incoming IO patterns. Now do you see the difference? The filesystem now has the ability to optimise allocation based on user application IO patterns rather than trying to guess from single page size block mapping requests. And because this happens before we look at the page cache, we can use that information to influence what we do with the page cache. > >> I'm not convinced that filesystem is in better position to see access > >> patterns than mm for page cache. It's not all about on-disk layout. > > > > Spoken like a true mm developer. IO performance is all about IO > > patterns, and the primary contributor to bad IO patterns is bad > > filesystem allocation patterns :P > > For writes, I think you have a good point. Managing a horribly > fragmented file with larger pages and eating the associated write > magnification that comes along with it seems like a recipe for disaster. > > But, Isn't some level of disconnection between the page cache and the > underlying IO patterns a *good* thing? Up to a point. Buffered IO only
Re: [PATCH] shmem: avoid huge pages for small files
On 10/21/2016 03:50 PM, Dave Chinner wrote: > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote: >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: >> To me, most of things you're talking about is highly dependent on access >> pattern generated by userspace: >> >> - we may want to allocate huge pages from byte 1 if we know that file >> will grow; > > delayed allocation takes care of that. We use a growing speculative > delalloc size that kicks in at specific sizes and can be used > directly to determine if a large page shoul dbe allocated. This code > is aware of sparse files, sparse writes, etc. OK, so somebody does a write() of 1 byte. We can delay the underlying block allocation for a long time, but we can *not* delay the memory allocation. We've got to decide before the write() returns. How does delayed allocation help with that decision? I guess we could (always?) allocate small pages up front, and then only bother promoting them once the FS delayed-allocation code kicks in and is *also* giving us underlying large allocations. That punts the logic to the filesystem, which is a bit counterintuitive, but it seems relatively sane. >>> As such, there is no way we should be considering different >>> interfaces and methods for configuring the /same functionality/ just >>> because DAX is enabled or not. It's the /same decision/ that needs >>> to be made, and the filesystem knows an awful lot more about whether >>> huge pages can be used efficiently at the time of access than just >>> about any other actor you can name >> >> I'm not convinced that filesystem is in better position to see access >> patterns than mm for page cache. It's not all about on-disk layout. > > Spoken like a true mm developer. IO performance is all about IO > patterns, and the primary contributor to bad IO patterns is bad > filesystem allocation patterns :P For writes, I think you have a good point. Managing a horribly fragmented file with larger pages and eating the associated write magnification that comes along with it seems like a recipe for disaster. But, Isn't some level of disconnection between the page cache and the underlying IO patterns a *good* thing? Once we've gone to the trouble of bringing some (potentially very fragmented) data into the page cache, why _not_ manage it in a lower-overhead way if we can? For read-only data it seems like a no-brainer that we'd want things in as large of a management unit as we can get. IOW, why let the underlying block allocation layout hamstring how the memory is managed?
Re: [PATCH] shmem: avoid huge pages for small files
On Sat, Oct 22, 2016 at 09:50:13AM +1100, Dave Chinner wrote: > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote: > > On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: > > > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote: > > > > > Ugh, no, please don't use mount options for file specific behaviours > > > > > in filesystems like ext4 and XFS. This is exactly the sort of > > > > > behaviour that should either just work automatically (i.e. be > > > > > completely controlled by the filesystem) or only be applied to files > > > > > > > > Can you explain what you mean? How would the file system control it? > > > > > > There's no point in asking for huge pages when populating the page > > > cache if the file is: > > > > > > - significantly smaller than the huge page size > > > - largely sparse > > > - being randomly accessed in small chunks > > > - badly fragmented and so takes hundreds of IO to read/write > > > a huge page > > > - able to optimise delayed allocation to match huge page > > > sizes and alignments > > > > > > These are all constraints the filesystem knows about, but the > > > application and user don't. > > > > Really? > > > > To me, most of things you're talking about is highly dependent on access > > pattern generated by userspace: > > > > - we may want to allocate huge pages from byte 1 if we know that file > > will grow; > > delayed allocation takes care of that. We use a growing speculative > delalloc size that kicks in at specific sizes and can be used > directly to determine if a large page shoul dbe allocated. This code > is aware of sparse files, sparse writes, etc. I'm confused here. How can we delay allocation of page cache? Delalloc is helpful to have reasonable on-disk layout, but my understanding is that it uses page cache as buffering to postpone block allocation. Later on writeback we see access pattern using pages from page cache. I'm likely missing something important here. Hm? > > - it will be beneficial to allocate huge page even for fragmented files, > > if it's read-mostly; > > No, no it won't. The IO latency impact here can be massive. > read-ahead of single 4k pages hides most of this latency from the > application, but with a 2MB page, we can't use readhead to hide this > IO latency because the first access could stall for hundreds of > small random read IOs to be completed instead of just 1. I agree that it will lead to initial latency spike. But don't we have workloads which would tolerate it to get faster hot-cache behaviour? > > > Further, we are moving the IO path to a model where we use extents > > > for mapping, not blocks. We're optimising for the fact that modern > > > filesystems use extents and so massively reduce the number of block > > > mapping lookup calls we need to do for a given IO. > > > > > > i.e. instead of doing "get page, map block to page" over and over > > > again until we've alked over the entire IO range, we're doing > > > "map extent for entire IO range" once, then iterating "get page" > > > until we've mapped the entire range. > > > > That's great, but it's not how IO path works *now*. And will takes a long > > time (if ever) to flip it over to what you've described. > > Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted > right now, GFS2 will use parts of it in the next release, DAX > already uses it and PMD support in DAX is being built on top of it. That's interesting. I've managed to miss whole fs/iomap.c thing... > > > As such, there is no way we should be considering different > > > interfaces and methods for configuring the /same functionality/ just > > > because DAX is enabled or not. It's the /same decision/ that needs > > > to be made, and the filesystem knows an awful lot more about whether > > > huge pages can be used efficiently at the time of access than just > > > about any other actor you can name > > > > I'm not convinced that filesystem is in better position to see access > > patterns than mm for page cache. It's not all about on-disk layout. > > Spoken like a true mm developer. Guilty. -- Kirill A. Shutemov
Re: [PATCH] shmem: avoid huge pages for small files
On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote: > On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: > > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote: > > > > Ugh, no, please don't use mount options for file specific behaviours > > > > in filesystems like ext4 and XFS. This is exactly the sort of > > > > behaviour that should either just work automatically (i.e. be > > > > completely controlled by the filesystem) or only be applied to files > > > > > > Can you explain what you mean? How would the file system control it? > > > > There's no point in asking for huge pages when populating the page > > cache if the file is: > > > > - significantly smaller than the huge page size > > - largely sparse > > - being randomly accessed in small chunks > > - badly fragmented and so takes hundreds of IO to read/write > > a huge page > > - able to optimise delayed allocation to match huge page > > sizes and alignments > > > > These are all constraints the filesystem knows about, but the > > application and user don't. > > Really? > > To me, most of things you're talking about is highly dependent on access > pattern generated by userspace: > > - we may want to allocate huge pages from byte 1 if we know that file > will grow; delayed allocation takes care of that. We use a growing speculative delalloc size that kicks in at specific sizes and can be used directly to determine if a large page shoul dbe allocated. This code is aware of sparse files, sparse writes, etc. > - the same for sparse file that will be filled; See above. > - it will be beneficial to allocate huge page even for fragmented files, > if it's read-mostly; No, no it won't. The IO latency impact here can be massive. read-ahead of single 4k pages hides most of this latency from the application, but with a 2MB page, we can't use readhead to hide this IO latency because the first access could stall for hundreds of small random read IOs to be completed instead of just 1. > > Further, we are moving the IO path to a model where we use extents > > for mapping, not blocks. We're optimising for the fact that modern > > filesystems use extents and so massively reduce the number of block > > mapping lookup calls we need to do for a given IO. > > > > i.e. instead of doing "get page, map block to page" over and over > > again until we've alked over the entire IO range, we're doing > > "map extent for entire IO range" once, then iterating "get page" > > until we've mapped the entire range. > > That's great, but it's not how IO path works *now*. And will takes a long > time (if ever) to flip it over to what you've described. Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted right now, GFS2 will use parts of it in the next release, DAX already uses it and PMD support in DAX is being built on top of it. > > As such, there is no way we should be considering different > > interfaces and methods for configuring the /same functionality/ just > > because DAX is enabled or not. It's the /same decision/ that needs > > to be made, and the filesystem knows an awful lot more about whether > > huge pages can be used efficiently at the time of access than just > > about any other actor you can name > > I'm not convinced that filesystem is in better position to see access > patterns than mm for page cache. It's not all about on-disk layout. Spoken like a true mm developer. IO performance is all about IO patterns, and the primary contributor to bad IO patterns is bad filesystem allocation patterns :P We're rapidly moving away from the world where a page cache is needed to give applications decent performance. DAX doesn't have a page cache, applications wanting to use high IOPS (hundreds of thousands to millions) storage are using direct IO, because the page cache just introduces latency, memory usage issues and non-deterministic IO behaviour. I we try to make the page cache the "one true IO optimisation source" then we're screwing ourselves because the incoming IO technologies simply don't require it anymore. We need to be ahead of that curve, not playing catchup, and that's why this sort of "what should the page cache do" decisions really need to come from the IO path where we see /all/ the IO, not just buffered IO Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] shmem: avoid huge pages for small files
On Fri 21-10-16 18:00:07, Kirill A. Shutemov wrote: > On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: [...] > > None of these aspects can be optimised sanely by a single threshold, > > especially when considering the combination of access patterns vs file > > layout. > > I agree. > > Here I tried to address the particular performance regression I see with > huge pages enabled on tmpfs. It doesn't mean to fix all possible issues. So can we start simple and use huge pages on shmem mappings only when they are larger than the huge page? Without any tunable which might turn out to be misleading/wrong later on. If I understand Dave's comments it is really not all that clear that a mount option makes sense. I cannot comment on those but they clearly show that there are multiple points of view here. -- Michal Hocko SUSE Labs
Re: [PATCH] shmem: avoid huge pages for small files
On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote: > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote: > > > Ugh, no, please don't use mount options for file specific behaviours > > > in filesystems like ext4 and XFS. This is exactly the sort of > > > behaviour that should either just work automatically (i.e. be > > > completely controlled by the filesystem) or only be applied to files > > > > Can you explain what you mean? How would the file system control it? > > There's no point in asking for huge pages when populating the page > cache if the file is: > > - significantly smaller than the huge page size > - largely sparse > - being randomly accessed in small chunks > - badly fragmented and so takes hundreds of IO to read/write > a huge page > - able to optimise delayed allocation to match huge page > sizes and alignments > > These are all constraints the filesystem knows about, but the > application and user don't. Really? To me, most of things you're talking about is highly dependent on access pattern generated by userspace: - we may want to allocate huge pages from byte 1 if we know that file will grow; - the same for sparse file that will be filled; - it will be beneficial to allocate huge page even for fragmented files, if it's read-mostly; > None of these aspects can be optimised sanely by a single threshold, > especially when considering the combination of access patterns vs file > layout. I agree. Here I tried to address the particular performance regression I see with huge pages enabled on tmpfs. It doesn't mean to fix all possible issues. > Further, we are moving the IO path to a model where we use extents > for mapping, not blocks. We're optimising for the fact that modern > filesystems use extents and so massively reduce the number of block > mapping lookup calls we need to do for a given IO. > > i.e. instead of doing "get page, map block to page" over and over > again until we've alked over the entire IO range, we're doing > "map extent for entire IO range" once, then iterating "get page" > until we've mapped the entire range. That's great, but it's not how IO path works *now*. And will takes a long time (if ever) to flip it over to what you've described. > Hence if we have a 2MB IO come in from userspace, and the iomap > returned is a covers that entire range, it's a no-brainer to ask the > page cache for a huge page instead of iterating 512 times to map all > the 4k pages needed. Yeah, it's no-brainier. But do we want to limit huge page allocation only to such best-possible cases? I hardly ever seen 2MB IOs in real world... And this approach will put too much decision power on the first access to the file range. It may or may not represent future access pattern. > > > specifically configured with persistent hints to reliably allocate > > > extents in a way that can be easily mapped to huge pages. > > > > > e.g. on XFS you will need to apply extent size hints to get large > > > page sized/aligned extent allocation to occur, and so this > > > > It sounds like you're confusing alignment in memory with alignment > > on disk here? I don't see why on disk alignment would be needed > > at all, unless we're talking about DAX here (which is out of > > scope currently) Kirill's changes are all about making the memory > > access for cached data more efficient, it's not about disk layout > > optimizations. > > No, I'm not confusing this with DAX. However, this automatic use > model for huge pages fits straight into DAX as well. Same > mechanisms, same behaviours, slightly stricter alignment > characteristics. All stuff the filesystem already knows about. > > Mount options are, quite frankly, a terrible mechanism for > specifying filesystem policy. Setting up DAX this way was a mistake, > and it's a mount option I plan to remove from XFS once we get nearer > to having DAX feature complete and stablised. We've already got > on-disk "use DAX for this file" flags in XFS, so we can easier and > cleanly support different methods of accessing PMEM from the same > filesystem. > > As such, there is no way we should be considering different > interfaces and methods for configuring the /same functionality/ just > because DAX is enabled or not. It's the /same decision/ that needs > to be made, and the filesystem knows an awful lot more about whether > huge pages can be used efficiently at the time of access than just > about any other actor you can name I'm not convinced that filesystem is in better position to see access patterns than mm for page cache. It's not all about on-disk layout. -- Kirill A. Shutemov
Re: [PATCH] shmem: avoid huge pages for small files
On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote: > > Ugh, no, please don't use mount options for file specific behaviours > > in filesystems like ext4 and XFS. This is exactly the sort of > > behaviour that should either just work automatically (i.e. be > > completely controlled by the filesystem) or only be applied to files > > Can you explain what you mean? How would the file system control it? There's no point in asking for huge pages when populating the page cache if the file is: - significantly smaller than the huge page size - largely sparse - being randomly accessed in small chunks - badly fragmented and so takes hundreds of IO to read/write a huge page - able to optimise delayed allocation to match huge page sizes and alignments These are all constraints the filesystem knows about, but the application and user don't. None of these aspects can be optimised sanely by a single threshold, especially when considering the combination of access patterns vs file layout. Further, we are moving the IO path to a model where we use extents for mapping, not blocks. We're optimising for the fact that modern filesystems use extents and so massively reduce the number of block mapping lookup calls we need to do for a given IO. i.e. instead of doing "get page, map block to page" over and over again until we've alked over the entire IO range, we're doing "map extent for entire IO range" once, then iterating "get page" until we've mapped the entire range. Hence if we have a 2MB IO come in from userspace, and the iomap returned is a covers that entire range, it's a no-brainer to ask the page cache for a huge page instead of iterating 512 times to map all the 4k pages needed. > > specifically configured with persistent hints to reliably allocate > > extents in a way that can be easily mapped to huge pages. > > > e.g. on XFS you will need to apply extent size hints to get large > > page sized/aligned extent allocation to occur, and so this > > It sounds like you're confusing alignment in memory with alignment > on disk here? I don't see why on disk alignment would be needed > at all, unless we're talking about DAX here (which is out of > scope currently) Kirill's changes are all about making the memory > access for cached data more efficient, it's not about disk layout > optimizations. No, I'm not confusing this with DAX. However, this automatic use model for huge pages fits straight into DAX as well. Same mechanisms, same behaviours, slightly stricter alignment characteristics. All stuff the filesystem already knows about. Mount options are, quite frankly, a terrible mechanism for specifying filesystem policy. Setting up DAX this way was a mistake, and it's a mount option I plan to remove from XFS once we get nearer to having DAX feature complete and stablised. We've already got on-disk "use DAX for this file" flags in XFS, so we can easier and cleanly support different methods of accessing PMEM from the same filesystem. As such, there is no way we should be considering different interfaces and methods for configuring the /same functionality/ just because DAX is enabled or not. It's the /same decision/ that needs to be made, and the filesystem knows an awful lot more about whether huge pages can be used efficiently at the time of access than just about any other actor you can name > > persistent extent size hint should trigger the filesystem to use > > large pages if supported, the hint is correctly sized and aligned, > > and there are large pages available for allocation. > > That would be ioctls and similar? You can, but existing filesystem admin tools can already set up allocation policies without the apps being aware that they even exist. If you want to use huge page mappings with DAX you'll already need to do this because of the physical alignment requirements of DAX. Further, such techniques are already used by many admins for things like limiting fragmentation of sparse vm image files. So while you may not know it, extent size hints and per-file inheritable attributes are quire widely used already to manage filesystem behaviour without users or applications even being aware that the filesystem policies have been modified by the admin... > That would imply that every application wanting to use large pages > would need to be especially enabled. That would seem awfully limiting > to me and needlessly deny benefits to most existing code. No change to applications will be necessary (see above), though there's no reason why couldn't directly use the VFS interfaces to explicitly ask for such behaviour themselves Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] shmem: avoid huge pages for small files
> Ugh, no, please don't use mount options for file specific behaviours > in filesystems like ext4 and XFS. This is exactly the sort of > behaviour that should either just work automatically (i.e. be > completely controlled by the filesystem) or only be applied to files Can you explain what you mean? How would the file system control it? > specifically configured with persistent hints to reliably allocate > extents in a way that can be easily mapped to huge pages. > e.g. on XFS you will need to apply extent size hints to get large > page sized/aligned extent allocation to occur, and so this It sounds like you're confusing alignment in memory with alignment on disk here? I don't see why on disk alignment would be needed at all, unless we're talking about DAX here (which is out of scope currently) Kirill's changes are all about making the memory access for cached data more efficient, it's not about disk layout optimizations. > persistent extent size hint should trigger the filesystem to use > large pages if supported, the hint is correctly sized and aligned, > and there are large pages available for allocation. That would be ioctls and similar? That would imply that every application wanting to use large pages would need to be especially enabled. That would seem awfully limiting to me and needlessly deny benefits to most existing code. -Andi
Re: [PATCH] shmem: avoid huge pages for small files
On Thu, Oct 20, 2016 at 01:39:46PM +0300, Kirill A. Shutemov wrote: > On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote: > > On Tue, 18 Oct 2016, Michal Hocko wrote: > > > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote: > > > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote: > > > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote: > > > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote: > > > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: > > > > > [...] > > > > > > > > We add two handle to specify minimal file size for huge pages: > > > > > > > > > > > > > > > > - mount option 'huge_min_size'; > > > > > > > > > > > > > > > > - sysfs file > > > > > > > > /sys/kernel/mm/transparent_hugepage/shmem_min_size for > > > > > > > > in-kernel tmpfs mountpoint; > > > > > > > > > > > > > > Could you explain who might like to change the minimum value > > > > > > > (other than > > > > > > > disable the feautre for the mount point) and for what reason? > > > > > > > > > > > > Depending on how well CPU microarchitecture deals with huge pages, > > > > > > you > > > > > > might need to set it higher in order to balance out overhead with > > > > > > benefit > > > > > > of huge pages. > > > > > > > > > > I am not sure this is a good argument. How do a user know and what > > > > > will > > > > > help to make that decision? Why we cannot autotune that? In other > > > > > words, > > > > > adding new knobs just in case turned out to be a bad idea in the past. > > > > > > > > Well, I don't see a reasonable way to autotune it. We can just let > > > > arch-specific code to redefine it, but the argument below still stands. > > > > > > > > > > In other case, if it's known in advance that specific mount would be > > > > > > populated with large files, you might want to set it to zero to get > > > > > > huge > > > > > > pages allocated from the beginning. > > > > > > Do you think this is a sufficient reason to provide a tunable with such a > > > precision? In other words why cannot we simply start by using an > > > internal only limit at the huge page size for the initial transition > > > (with a way to disable THP altogether for a mount point) and only add a > > > more fine grained tunning if there ever is a real need for it with a use > > > case description. In other words can we be less optimistic about > > > tunables than we used to be in the past and often found out that those > > > were mistakes much later? > > > > I'm not sure whether I'm arguing in the same or the opposite direction > > as you, Michal, but what makes me unhappy is not so much the tunable, > > as the proliferation of mount options. > > > > Kirill, this issue is (not exactly but close enough) what the mount > > option "huge=within_size" was supposed to be about: not wasting huge > > pages on small files. I'd be much happier if you made huge_min_size > > into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable, > > and used it to govern "huge=within_size" mounts only. > > Well, you're right that I tried originally address the issue with > huge=within_size, but this option makes much more sense for filesystem > with persistent storage. For ext4, it would be pretty usable option. Ugh, no, please don't use mount options for file specific behaviours in filesystems like ext4 and XFS. This is exactly the sort of behaviour that should either just work automatically (i.e. be completely controlled by the filesystem) or only be applied to files specifically configured with persistent hints to reliably allocate extents in a way that can be easily mapped to huge pages. e.g. on XFS you will need to apply extent size hints to get large page sized/aligned extent allocation to occur, and so this persistent extent size hint should trigger the filesystem to use large pages if supported, the hint is correctly sized and aligned, and there are large pages available for allocation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] shmem: avoid huge pages for small files
On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote: > On Tue, 18 Oct 2016, Michal Hocko wrote: > > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote: > > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote: > > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote: > > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote: > > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: > > > > [...] > > > > > > > We add two handle to specify minimal file size for huge pages: > > > > > > > > > > > > > > - mount option 'huge_min_size'; > > > > > > > > > > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size > > > > > > > for > > > > > > > in-kernel tmpfs mountpoint; > > > > > > > > > > > > Could you explain who might like to change the minimum value (other > > > > > > than > > > > > > disable the feautre for the mount point) and for what reason? > > > > > > > > > > Depending on how well CPU microarchitecture deals with huge pages, you > > > > > might need to set it higher in order to balance out overhead with > > > > > benefit > > > > > of huge pages. > > > > > > > > I am not sure this is a good argument. How do a user know and what will > > > > help to make that decision? Why we cannot autotune that? In other words, > > > > adding new knobs just in case turned out to be a bad idea in the past. > > > > > > Well, I don't see a reasonable way to autotune it. We can just let > > > arch-specific code to redefine it, but the argument below still stands. > > > > > > > > In other case, if it's known in advance that specific mount would be > > > > > populated with large files, you might want to set it to zero to get > > > > > huge > > > > > pages allocated from the beginning. > > > > Do you think this is a sufficient reason to provide a tunable with such a > > precision? In other words why cannot we simply start by using an > > internal only limit at the huge page size for the initial transition > > (with a way to disable THP altogether for a mount point) and only add a > > more fine grained tunning if there ever is a real need for it with a use > > case description. In other words can we be less optimistic about > > tunables than we used to be in the past and often found out that those > > were mistakes much later? > > I'm not sure whether I'm arguing in the same or the opposite direction > as you, Michal, but what makes me unhappy is not so much the tunable, > as the proliferation of mount options. > > Kirill, this issue is (not exactly but close enough) what the mount > option "huge=within_size" was supposed to be about: not wasting huge > pages on small files. I'd be much happier if you made huge_min_size > into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable, > and used it to govern "huge=within_size" mounts only. Well, you're right that I tried originally address the issue with huge=within_size, but this option makes much more sense for filesystem with persistent storage. For ext4, it would be pretty usable option. What you propose would change the semantics of the option and it will diverge from how it works on ext4. I guess it may have sense, taking into account that shmem/tmpfs is special, in sense that we always start with empty filesystem. If everybody agree, I'll respin the patch with single tunable that manage all huge=within_size mounts. -- Kirill A. Shutemov
Re: [PATCH] shmem: avoid huge pages for small files
On Tue, 18 Oct 2016, Michal Hocko wrote: > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote: > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote: > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote: > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote: > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: > > > [...] > > > > > > We add two handle to specify minimal file size for huge pages: > > > > > > > > > > > > - mount option 'huge_min_size'; > > > > > > > > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size > > > > > > for > > > > > > in-kernel tmpfs mountpoint; > > > > > > > > > > Could you explain who might like to change the minimum value (other > > > > > than > > > > > disable the feautre for the mount point) and for what reason? > > > > > > > > Depending on how well CPU microarchitecture deals with huge pages, you > > > > might need to set it higher in order to balance out overhead with > > > > benefit > > > > of huge pages. > > > > > > I am not sure this is a good argument. How do a user know and what will > > > help to make that decision? Why we cannot autotune that? In other words, > > > adding new knobs just in case turned out to be a bad idea in the past. > > > > Well, I don't see a reasonable way to autotune it. We can just let > > arch-specific code to redefine it, but the argument below still stands. > > > > > > In other case, if it's known in advance that specific mount would be > > > > populated with large files, you might want to set it to zero to get huge > > > > pages allocated from the beginning. > > Do you think this is a sufficient reason to provide a tunable with such a > precision? In other words why cannot we simply start by using an > internal only limit at the huge page size for the initial transition > (with a way to disable THP altogether for a mount point) and only add a > more fine grained tunning if there ever is a real need for it with a use > case description. In other words can we be less optimistic about > tunables than we used to be in the past and often found out that those > were mistakes much later? I'm not sure whether I'm arguing in the same or the opposite direction as you, Michal, but what makes me unhappy is not so much the tunable, as the proliferation of mount options. Kirill, this issue is (not exactly but close enough) what the mount option "huge=within_size" was supposed to be about: not wasting huge pages on small files. I'd be much happier if you made huge_min_size into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable, and used it to govern "huge=within_size" mounts only. Hugh
Re: [PATCH] shmem: avoid huge pages for small files
On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote: > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote: > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote: > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote: > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: > > [...] > > > > > We add two handle to specify minimal file size for huge pages: > > > > > > > > > > - mount option 'huge_min_size'; > > > > > > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for > > > > > in-kernel tmpfs mountpoint; > > > > > > > > Could you explain who might like to change the minimum value (other than > > > > disable the feautre for the mount point) and for what reason? > > > > > > Depending on how well CPU microarchitecture deals with huge pages, you > > > might need to set it higher in order to balance out overhead with benefit > > > of huge pages. > > > > I am not sure this is a good argument. How do a user know and what will > > help to make that decision? Why we cannot autotune that? In other words, > > adding new knobs just in case turned out to be a bad idea in the past. > > Well, I don't see a reasonable way to autotune it. We can just let > arch-specific code to redefine it, but the argument below still stands. > > > > In other case, if it's known in advance that specific mount would be > > > populated with large files, you might want to set it to zero to get huge > > > pages allocated from the beginning. Do you think this is a sufficient reason to provide a tunable with such a precision? In other words why cannot we simply start by using an internal only limit at the huge page size for the initial transition (with a way to disable THP altogether for a mount point) and only add a more fine grained tunning if there ever is a real need for it with a use case description. In other words can we be less optimistic about tunables than we used to be in the past and often found out that those were mistakes much later? -- Michal Hocko SUSE Labs
Re: [PATCH] shmem: avoid huge pages for small files
On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote: > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote: > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote: > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: > [...] > > > > We add two handle to specify minimal file size for huge pages: > > > > > > > > - mount option 'huge_min_size'; > > > > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for > > > > in-kernel tmpfs mountpoint; > > > > > > Could you explain who might like to change the minimum value (other than > > > disable the feautre for the mount point) and for what reason? > > > > Depending on how well CPU microarchitecture deals with huge pages, you > > might need to set it higher in order to balance out overhead with benefit > > of huge pages. > > I am not sure this is a good argument. How do a user know and what will > help to make that decision? Why we cannot autotune that? In other words, > adding new knobs just in case turned out to be a bad idea in the past. Well, I don't see a reasonable way to autotune it. We can just let arch-specific code to redefine it, but the argument below still stands. > > In other case, if it's known in advance that specific mount would be > > populated with large files, you might want to set it to zero to get huge > > pages allocated from the beginning. > > Cannot we use [mf]advise for that purpose? There's no fadvise for this at the moment. We can use madvise, except that the patch makes it lower priority than the limit :P. I'll fix that. But in general, it would require change to the program which is not always desirable or even possible. -- Kirill A. Shutemov
Re: [PATCH] shmem: avoid huge pages for small files
On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote: > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote: > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: [...] > > > We add two handle to specify minimal file size for huge pages: > > > > > > - mount option 'huge_min_size'; > > > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for > > > in-kernel tmpfs mountpoint; > > > > Could you explain who might like to change the minimum value (other than > > disable the feautre for the mount point) and for what reason? > > Depending on how well CPU microarchitecture deals with huge pages, you > might need to set it higher in order to balance out overhead with benefit > of huge pages. I am not sure this is a good argument. How do a user know and what will help to make that decision? Why we cannot autotune that? In other words, adding new knobs just in case turned out to be a bad idea in the past. > In other case, if it's known in advance that specific mount would be > populated with large files, you might want to set it to zero to get huge > pages allocated from the beginning. Cannot we use [mf]advise for that purpose? -- Michal Hocko SUSE Labs
Re: [PATCH] shmem: avoid huge pages for small files
On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote: > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: > [...] > > >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001 > > From: "Kirill A. Shutemov" > > Date: Mon, 17 Oct 2016 14:44:47 +0300 > > Subject: [PATCH] shmem: avoid huge pages for small files > > > > Huge pages are detrimental for small file: they causes noticible > > overhead on both allocation performance and memory footprint. > > > > This patch aimed to address this issue by avoiding huge pages until file > > grown to specified size. This would cover most of the cases where huge > > pages causes regressions in performance. > > > > By default the minimal file size to allocate huge pages is equal to size > > of huge page. > > ok > > > We add two handle to specify minimal file size for huge pages: > > > > - mount option 'huge_min_size'; > > > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for > > in-kernel tmpfs mountpoint; > > Could you explain who might like to change the minimum value (other than > disable the feautre for the mount point) and for what reason? Depending on how well CPU microarchitecture deals with huge pages, you might need to set it higher in order to balance out overhead with benefit of huge pages. In other case, if it's known in advance that specific mount would be populated with large files, you might want to set it to zero to get huge pages allocated from the beginning. > > @@ -238,6 +238,12 @@ values: > >- "force": > > Force the huge option on for all - very useful for testing; > > > > +Tehre's limit on minimal file size before kenrel starts allocate huge > > +pages for it. By default it's size of huge page. > > Smoe tyopse Wlil fxi! -- Kirill A. Shutemov
Re: [PATCH] shmem: avoid huge pages for small files
On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote: [...] > >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001 > From: "Kirill A. Shutemov" > Date: Mon, 17 Oct 2016 14:44:47 +0300 > Subject: [PATCH] shmem: avoid huge pages for small files > > Huge pages are detrimental for small file: they causes noticible > overhead on both allocation performance and memory footprint. > > This patch aimed to address this issue by avoiding huge pages until file > grown to specified size. This would cover most of the cases where huge > pages causes regressions in performance. > > By default the minimal file size to allocate huge pages is equal to size > of huge page. ok > We add two handle to specify minimal file size for huge pages: > > - mount option 'huge_min_size'; > > - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for > in-kernel tmpfs mountpoint; Could you explain who might like to change the minimum value (other than disable the feautre for the mount point) and for what reason? [...] > @@ -238,6 +238,12 @@ values: >- "force": > Force the huge option on for all - very useful for testing; > > +Tehre's limit on minimal file size before kenrel starts allocate huge > +pages for it. By default it's size of huge page. Smoe tyopse -- Michal Hocko SUSE Labs
Re: [PATCH] shmem: avoid huge pages for small files
On Mon, Oct 17, 2016 at 03:18:09PM +0300, Kirill A. Shutemov wrote: > diff --git a/mm/shmem.c b/mm/shmem.c > index ad7813d73ea7..c69047386e2f 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space > *mapping, > /* ifdef here to avoid bloating shmem.o when not necessary */ > > int shmem_huge __read_mostly; > +unsigned long long shmem_huge_min_size = HPAGE_PMD_SIZE __read_mostly; Arghh.. Last second changes... This should be unsigned long long shmem_huge_min_size __read_mostly = HPAGE_PMD_SIZE; >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Mon, 17 Oct 2016 14:44:47 +0300 Subject: [PATCH] shmem: avoid huge pages for small files Huge pages are detrimental for small file: they causes noticible overhead on both allocation performance and memory footprint. This patch aimed to address this issue by avoiding huge pages until file grown to specified size. This would cover most of the cases where huge pages causes regressions in performance. By default the minimal file size to allocate huge pages is equal to size of huge page. We add two handle to specify minimal file size for huge pages: - mount option 'huge_min_size'; - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for in-kernel tmpfs mountpoint; Few notes: - if shmem_enabled is set to 'force', the limit is ignored. We still want to generate as many pages as possible for functional testing. - the limit doesn't affect khugepaged behaviour: it still can collapse pages based on its settings; - remount of the filesystem doesn't affect previously allocated pages, but the limit is applied for new allocations; Signed-off-by: Kirill A. Shutemov --- Documentation/vm/transhuge.txt | 6 + include/linux/huge_mm.h| 1 + include/linux/shmem_fs.h | 1 + mm/huge_memory.c | 1 + mm/shmem.c | 56 ++ 5 files changed, 60 insertions(+), 5 deletions(-) diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 2ec6adb5a4ce..40006d193687 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -238,6 +238,12 @@ values: - "force": Force the huge option on for all - very useful for testing; +Tehre's limit on minimal file size before kenrel starts allocate huge +pages for it. By default it's size of huge page. + +You can adjust the limit using "huge_min_size=" mount option or +/sys/kernel/mm/transparent_hugepage/shmem_min_size for in-kernel mount. + == Need of application restart == The transparent_hugepage/enabled values and tmpfs mount option only affect diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 9b9f65d99873..515b96a5a592 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -52,6 +52,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf, enum transparent_hugepage_flag flag); extern struct kobj_attribute shmem_enabled_attr; +extern struct kobj_attribute shmem_min_size_attr; #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) #define HPAGE_PMD_NR (1huge_min_size && + index < (sbinfo->huge_min_size >> PAGE_SHIFT)) + goto alloc_nohuge; switch (sbinfo->huge) { - loff_t i_size; pgoff_t off; case SHMEM_HUGE_NEVER: goto alloc_nohuge; case SHMEM_HUGE_WITHIN_SIZE: off = round_up(index, HPAGE_PMD_NR); - i_size = round_up(i_size_read(inode), PAGE_SIZE); + i_size = round_up(i_size, PAGE_SIZE); if (i_size >= HPAGE_PMD_SIZE && i_size >> PAGE_SHIFT >= off) goto alloc_huge; @@ -3349,6 +3355,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, huge != SHMEM_HUGE_NEVER) goto bad_val; sbinfo->huge = huge; + } else if (!strcmp(this_char, "huge_min_size")) { + sbinfo->huge_min_size = memparse(value, &rest); + if (*rest) + goto bad_val; #endif #ifdef CONFIG_NUMA } else if (!strcmp