subject:"Re\: \[PATCH\] shmem\: avoid huge pages for small files"

Re: [PATCH] shmem: avoid huge pages for small files

2016-11-10 Thread Kirill A. Shutemov

On Fri, Nov 11, 2016 at 01:42:47AM +0800, kbuild test robot wrote:
> Hi Kirill,
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.9-rc4 next-20161110]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/shmem-avoid-huge-pages-for-small-files/2016-005428
> config: i386-randconfig-s0-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All warnings (new ones prefixed by >>):
> 
>mm/shmem.c: In function 'shmem_getpage_gfp':
> >> mm/shmem.c:1680:12: warning: unused variable 'off' [-Wunused-variable]
>pgoff_t off;


>From f0a582888ac6dcb56c6134611c83edfb091bbcb6 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until
file grown to size of huge page if the filesystem mounted with
huge=within_size option.

This would cover most of the cases where huge pages causes regressions
in performance.

The limit doesn't affect khugepaged behaviour: it still can collapse
pages based on its settings.

Signed-off-by: Kirill A. Shutemov 
---
 Documentation/vm/transhuge.txt | 7 ++-
 mm/shmem.c | 7 ++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..14c911c56f4a 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -208,11 +208,16 @@ You can control hugepage allocation policy in tmpfs with 
mount option
   - "always":
 Attempt to allocate huge pages every time we need a new page;
 
+This option can lead to significant overhead if filesystem is used to
+store small files.
+
   - "never":
 Do not allocate huge pages;
 
   - "within_size":
-Only allocate huge page if it will be fully within i_size.
+Only allocate huge page if size of the file more than size of huge
+page. This helps to avoid overhead for small files.
+
 Also respect fadvise()/madvise() hints;
 
   - "advise:
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..3e2c0912c587 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1677,14 +1677,11 @@ static int shmem_getpage_gfp(struct inode *inode, 
pgoff_t index,
goto alloc_huge;
switch (sbinfo->huge) {
loff_t i_size;
-   pgoff_t off;
case SHMEM_HUGE_NEVER:
goto alloc_nohuge;
case SHMEM_HUGE_WITHIN_SIZE:
-   off = round_up(index, HPAGE_PMD_NR);
-   i_size = round_up(i_size_read(inode), PAGE_SIZE);
-   if (i_size >= HPAGE_PMD_SIZE &&
-   i_size >> PAGE_SHIFT >= off)
+   i_size = i_size_read(inode);
+   if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
goto alloc_huge;
/* fallthrough */
case SHMEM_HUGE_ADVISE:
-- 
2.9.3

-- 
 Kirill A. Shutemov

Re: [PATCH] shmem: avoid huge pages for small files

2016-11-10 Thread kbuild test robot

Hi Kirill,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc4 next-20161110]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/shmem-avoid-huge-pages-for-small-files/2016-005428
config: i386-randconfig-s0-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   mm/shmem.c: In function 'shmem_getpage_gfp':
>> mm/shmem.c:1680:12: warning: unused variable 'off' [-Wunused-variable]
   pgoff_t off;
   ^~~

vim +/off +1680 mm/shmem.c

66d2f4d2 Hugh Dickins   2014-07-02  1664
mark_page_accessed(page);
66d2f4d2 Hugh Dickins   2014-07-02  1665  
54af6042 Hugh Dickins   2011-08-03  1666
delete_from_swap_cache(page);
27ab7006 Hugh Dickins   2011-07-25  1667set_page_dirty(page);
27ab7006 Hugh Dickins   2011-07-25  1668swap_free(swap);
27ab7006 Hugh Dickins   2011-07-25  1669  
54af6042 Hugh Dickins   2011-08-03  1670} else {
800d8c63 Kirill A. Shutemov 2016-07-26  1671/* shmem_symlink() */
800d8c63 Kirill A. Shutemov 2016-07-26  1672if (mapping->a_ops != 
&shmem_aops)
800d8c63 Kirill A. Shutemov 2016-07-26  1673goto 
alloc_nohuge;
657e3038 Kirill A. Shutemov 2016-07-26  1674if (shmem_huge == 
SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
800d8c63 Kirill A. Shutemov 2016-07-26  1675goto 
alloc_nohuge;
800d8c63 Kirill A. Shutemov 2016-07-26  1676if (shmem_huge == 
SHMEM_HUGE_FORCE)
800d8c63 Kirill A. Shutemov 2016-07-26  1677goto alloc_huge;
800d8c63 Kirill A. Shutemov 2016-07-26  1678switch (sbinfo->huge) {
800d8c63 Kirill A. Shutemov 2016-07-26  1679loff_t i_size;
800d8c63 Kirill A. Shutemov 2016-07-26 @1680pgoff_t off;
800d8c63 Kirill A. Shutemov 2016-07-26  1681case SHMEM_HUGE_NEVER:
800d8c63 Kirill A. Shutemov 2016-07-26  1682goto 
alloc_nohuge;
800d8c63 Kirill A. Shutemov 2016-07-26  1683case 
SHMEM_HUGE_WITHIN_SIZE:
bb89f249 Kirill A. Shutemov 2016-11-10  1684i_size = 
i_size_read(inode);
bb89f249 Kirill A. Shutemov 2016-11-10  1685if (index >= 
HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
800d8c63 Kirill A. Shutemov 2016-07-26  1686goto 
alloc_huge;
800d8c63 Kirill A. Shutemov 2016-07-26  1687/* fallthrough 
*/
800d8c63 Kirill A. Shutemov 2016-07-26  1688case SHMEM_HUGE_ADVISE:

:: The code at line 1680 was first introduced by commit
:: 800d8c63b2e989c2e349632d1648119bf5862f01 shmem: add huge pages support

:: TO: Kirill A. Shutemov 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-24 Thread Dave Chinner

On Mon, Oct 24, 2016 at 01:34:53PM -0700, Dave Hansen wrote:
> On 10/21/2016 03:50 PM, Dave Chinner wrote:
> > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> >> To me, most of things you're talking about is highly dependent on access
> >> pattern generated by userspace:
> >>
> >>   - we may want to allocate huge pages from byte 1 if we know that file
> >> will grow;
> > 
> > delayed allocation takes care of that. We use a growing speculative
> > delalloc size that kicks in at specific sizes and can be used
> > directly to determine if a large page shoul dbe allocated. This code
> > is aware of sparse files, sparse writes, etc.
> 
> OK, so somebody does a write() of 1 byte.  We can delay the underlying
> block allocation for a long time, but we can *not* delay the memory
> allocation.  We've got to decide before the write() returns.

> How does delayed allocation help with that decision?

You (and Kirill) have likely misunderstood what I'm saying, based
on the fact you are thinking I'm talking about delayed allocation of
page cache pages.

I'm not. The current code does this for a sequential write:

write( off, len)
  for each PAGE_SIZE chunk
 grab page cache page
   alloc + insert if not found
 map block to page
   get_block(off, PAGE_SIZE);
 filesystem does allocation 
   update bufferhead attached to page
 write data into page

Essentially, delayed block allocation occurs inside the get_block()
call, completely hidden from the page cache and IO layers.

In XFS, we special stuff based on the offset being written to, the
size of the existing extent we are adding, etc. to specualtively
/reserve/ more blocks that the write actually needs and keep them in
a delalloc extent that extends beyond EOF.

The next page is grabbed, get_block is called again, and we find
we've already got a delalloc reservation for that file offset, so
we return immediately. And we repeat that until the delalloc extent
runs out.

When it runs out, we allocate a bigger delalloc extent beyond EOF
so that as the file grows we do fewer and fewer larger delayed
allocation reservations. These grow out to /gigabytes/ if there is
that much data to write.

i.e. the filesystem clearly knows when using large pages would be
appropriate, but because it's inside the page cache allocation,
it can't influence it at all.

Here's what the new fs/iomap.c code does for the same sequential
write:

write(off, len)
  iomap_begin(off, len)
filesystem does delayed allocation of at least len bytes <<
  returns an iomap with single mapping
  iomap_apply()
for each PAGE_SIZE chunk
 grab page cache page
   alloc + insert if not found
 map iomap to page
 write data into page
  iomap_end()

Hence if the write was for 1 byte into an empty file, we'd get a
single block extent back, which would match to a single PAGE_SIZE
page cache allocation required. If the app is doing sequential 1
byte IO, and we're at offset 2MB and the filesystem returns a 2MB
delalloc extent (i.e. extends 2MB byte beyond EOF), we know we're
getting sequential write IO and we could use a 2MB page in the page
cache for this.

Similarly, if the app is doing large IO - say 16MB at a time, we'll
get at least a 16MB delalloc extent returned from the filesystem,
and we know we could map that quickly and easily to 8 x 2MB huge
pages in the page cache.

But if we get random 4k writes into a sparse file, the filesystem
will be allocating single blocks, so the iomaps being returned would
be for a single block, and we know that PAGE_SIZE pages would be
best to allocate.

Or we could have a 2 MB extent size hint set, so every iomap
returned from the filesysetm is going to be 2MB aligned and sized,
in which case we'd could always map the returned iomap to a huge
page rather than worry about Io sizes and incoming IO patterns.

Now do you see the difference? The filesystem now has the ability to
optimise allocation based on user application IO patterns rather
than trying to guess from single page size block mapping requests.
And because this happens before we look at the page cache, we can
use that information to influence what we do with the page cache.

> >> I'm not convinced that filesystem is in better position to see access
> >> patterns than mm for page cache. It's not all about on-disk layout.
> > 
> > Spoken like a true mm developer. IO performance is all about IO
> > patterns, and the primary contributor to bad IO patterns is bad
> > filesystem allocation patterns :P
> 
> For writes, I think you have a good point.  Managing a horribly
> fragmented file with larger pages and eating the associated write
> magnification that comes along with it seems like a recipe for disaster.
> 
> But, Isn't some level of disconnection between the page cache and the
> underlying IO patterns a *good* thing?

Up to a point. Buffered IO only

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-24 Thread Dave Hansen

On 10/21/2016 03:50 PM, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
>> To me, most of things you're talking about is highly dependent on access
>> pattern generated by userspace:
>>
>>   - we may want to allocate huge pages from byte 1 if we know that file
>> will grow;
> 
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

OK, so somebody does a write() of 1 byte.  We can delay the underlying
block allocation for a long time, but we can *not* delay the memory
allocation.  We've got to decide before the write() returns.

How does delayed allocation help with that decision?

I guess we could (always?) allocate small pages up front, and then only
bother promoting them once the FS delayed-allocation code kicks in and
is *also* giving us underlying large allocations.  That punts the logic
to the filesystem, which is a bit counterintuitive, but it seems
relatively sane.

>>> As such, there is no way we should be considering different
>>> interfaces and methods for configuring the /same functionality/ just
>>> because DAX is enabled or not. It's the /same decision/ that needs
>>> to be made, and the filesystem knows an awful lot more about whether
>>> huge pages can be used efficiently at the time of access than just
>>> about any other actor you can name
>>
>> I'm not convinced that filesystem is in better position to see access
>> patterns than mm for page cache. It's not all about on-disk layout.
> 
> Spoken like a true mm developer. IO performance is all about IO
> patterns, and the primary contributor to bad IO patterns is bad
> filesystem allocation patterns :P

For writes, I think you have a good point.  Managing a horribly
fragmented file with larger pages and eating the associated write
magnification that comes along with it seems like a recipe for disaster.

But, Isn't some level of disconnection between the page cache and the
underlying IO patterns a *good* thing?  Once we've gone to the trouble
of bringing some (potentially very fragmented) data into the page cache,
why _not_ manage it in a lower-overhead way if we can?  For read-only
data it seems like a no-brainer that we'd want things in as large of a
management unit as we can get.

IOW, why let the underlying block allocation layout hamstring how the
memory is managed?

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-21 Thread Kirill A. Shutemov

On Sat, Oct 22, 2016 at 09:50:13AM +1100, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > > behaviour that should either just work automatically (i.e. be
> > > > > completely controlled by the filesystem) or only be applied to files
> > > > 
> > > > Can you explain what you mean? How would the file system control it?
> > > 
> > > There's no point in asking for huge pages when populating the page
> > > cache if the file is:
> > > 
> > >   - significantly smaller than the huge page size
> > >   - largely sparse
> > >   - being randomly accessed in small chunks
> > >   - badly fragmented and so takes hundreds of IO to read/write
> > > a huge page
> > >   - able to optimise delayed allocation to match huge page
> > > sizes and alignments
> > > 
> > > These are all constraints the filesystem knows about, but the
> > > application and user don't.
> > 
> > Really?
> > 
> > To me, most of things you're talking about is highly dependent on access
> > pattern generated by userspace:
> > 
> >   - we may want to allocate huge pages from byte 1 if we know that file
> > will grow;
> 
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

I'm confused here. How can we delay allocation of page cache?

Delalloc is helpful to have reasonable on-disk layout, but my
understanding is that it uses page cache as buffering to postpone
block allocation. Later on writeback we see access pattern using pages
from page cache.

I'm likely missing something important here. Hm?

> >   - it will be beneficial to allocate huge page even for fragmented files,
> > if it's read-mostly;
> 
> No, no it won't. The IO latency impact here can be massive.
> read-ahead of single 4k pages hides most of this latency from the
> application, but with a 2MB page, we can't use readhead to hide this
> IO latency because the first access could stall for hundreds of
> small random read IOs to be completed instead of just 1.

I agree that it will lead to initial latency spike. But don't we have
workloads which would tolerate it to get faster hot-cache behaviour?

> > > Further, we are moving the IO path to a model where we use extents
> > > for mapping, not blocks.  We're optimising for the fact that modern
> > > filesystems use extents and so massively reduce the number of block
> > > mapping lookup calls we need to do for a given IO.
> > > 
> > > i.e. instead of doing "get page, map block to page" over and over
> > > again until we've alked over the entire IO range, we're doing
> > > "map extent for entire IO range" once, then iterating "get page"
> > > until we've mapped the entire range.
> > 
> > That's great, but it's not how IO path works *now*. And will takes a long
> > time (if ever) to flip it over to what you've described.
> 
> Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
> right now, GFS2 will use parts of it in the next release, DAX
> already uses it and PMD support in DAX is being built on top of it.

That's interesting. I've managed to miss whole fs/iomap.c thing...

> > > As such, there is no way we should be considering different
> > > interfaces and methods for configuring the /same functionality/ just
> > > because DAX is enabled or not. It's the /same decision/ that needs
> > > to be made, and the filesystem knows an awful lot more about whether
> > > huge pages can be used efficiently at the time of access than just
> > > about any other actor you can name
> > 
> > I'm not convinced that filesystem is in better position to see access
> > patterns than mm for page cache. It's not all about on-disk layout.
> 
> Spoken like a true mm developer.

Guilty.

-- 
 Kirill A. Shutemov

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-21 Thread Dave Chinner

On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > behaviour that should either just work automatically (i.e. be
> > > > completely controlled by the filesystem) or only be applied to files
> > > 
> > > Can you explain what you mean? How would the file system control it?
> > 
> > There's no point in asking for huge pages when populating the page
> > cache if the file is:
> > 
> > - significantly smaller than the huge page size
> > - largely sparse
> > - being randomly accessed in small chunks
> > - badly fragmented and so takes hundreds of IO to read/write
> >   a huge page
> > - able to optimise delayed allocation to match huge page
> >   sizes and alignments
> > 
> > These are all constraints the filesystem knows about, but the
> > application and user don't.
> 
> Really?
> 
> To me, most of things you're talking about is highly dependent on access
> pattern generated by userspace:
> 
>   - we may want to allocate huge pages from byte 1 if we know that file
> will grow;

delayed allocation takes care of that. We use a growing speculative
delalloc size that kicks in at specific sizes and can be used
directly to determine if a large page shoul dbe allocated. This code
is aware of sparse files, sparse writes, etc.

>   - the same for sparse file that will be filled;

See above.

>   - it will be beneficial to allocate huge page even for fragmented files,
> if it's read-mostly;

No, no it won't. The IO latency impact here can be massive.
read-ahead of single 4k pages hides most of this latency from the
application, but with a 2MB page, we can't use readhead to hide this
IO latency because the first access could stall for hundreds of
small random read IOs to be completed instead of just 1.

> > Further, we are moving the IO path to a model where we use extents
> > for mapping, not blocks.  We're optimising for the fact that modern
> > filesystems use extents and so massively reduce the number of block
> > mapping lookup calls we need to do for a given IO.
> > 
> > i.e. instead of doing "get page, map block to page" over and over
> > again until we've alked over the entire IO range, we're doing
> > "map extent for entire IO range" once, then iterating "get page"
> > until we've mapped the entire range.
> 
> That's great, but it's not how IO path works *now*. And will takes a long
> time (if ever) to flip it over to what you've described.

Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
right now, GFS2 will use parts of it in the next release, DAX
already uses it and PMD support in DAX is being built on top of it.

> > As such, there is no way we should be considering different
> > interfaces and methods for configuring the /same functionality/ just
> > because DAX is enabled or not. It's the /same decision/ that needs
> > to be made, and the filesystem knows an awful lot more about whether
> > huge pages can be used efficiently at the time of access than just
> > about any other actor you can name
> 
> I'm not convinced that filesystem is in better position to see access
> patterns than mm for page cache. It's not all about on-disk layout.

Spoken like a true mm developer. IO performance is all about IO
patterns, and the primary contributor to bad IO patterns is bad
filesystem allocation patterns :P

We're rapidly moving away from the world where a page cache is
needed to give applications decent performance. DAX doesn't have a
page cache, applications wanting to use high IOPS (hundreds of
thousands to millions) storage are using direct IO, because the page
cache just introduces latency, memory usage issues and
non-deterministic IO behaviour.

I we try to make the page cache the "one true IO optimisation source"
then we're screwing ourselves because the incoming IO technologies
simply don't require it anymore. We need to be ahead of that curve,
not playing catchup, and that's why this sort of "what should the
page cache do" decisions really need to come from the IO path where
we see /all/ the IO, not just buffered IO

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-21 Thread Michal Hocko

On Fri 21-10-16 18:00:07, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
[...]
> > None of these aspects can be optimised sanely by a single threshold,
> > especially when considering the combination of access patterns vs file
> > layout.
> 
> I agree.
> 
> Here I tried to address the particular performance regression I see with
> huge pages enabled on tmpfs. It doesn't mean to fix all possible issues.

So can we start simple and use huge pages on shmem mappings only when
they are larger than the huge page? Without any tunable which might turn
out to be misleading/wrong later on. If I understand Dave's comments it
is really not all that clear that a mount option makes sense. I cannot
comment on those but they clearly show that there are multiple points of
view here.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-21 Thread Kirill A. Shutemov

On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > Ugh, no, please don't use mount options for file specific behaviours
> > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > behaviour that should either just work automatically (i.e. be
> > > completely controlled by the filesystem) or only be applied to files
> > 
> > Can you explain what you mean? How would the file system control it?
> 
> There's no point in asking for huge pages when populating the page
> cache if the file is:
> 
>   - significantly smaller than the huge page size
>   - largely sparse
>   - being randomly accessed in small chunks
>   - badly fragmented and so takes hundreds of IO to read/write
> a huge page
>   - able to optimise delayed allocation to match huge page
> sizes and alignments
> 
> These are all constraints the filesystem knows about, but the
> application and user don't.

Really?

To me, most of things you're talking about is highly dependent on access
pattern generated by userspace:

  - we may want to allocate huge pages from byte 1 if we know that file
will grow;
  - the same for sparse file that will be filled;
  - it will be beneficial to allocate huge page even for fragmented files,
if it's read-mostly;

> None of these aspects can be optimised sanely by a single threshold,
> especially when considering the combination of access patterns vs file
> layout.

I agree.

Here I tried to address the particular performance regression I see with
huge pages enabled on tmpfs. It doesn't mean to fix all possible issues.

> Further, we are moving the IO path to a model where we use extents
> for mapping, not blocks.  We're optimising for the fact that modern
> filesystems use extents and so massively reduce the number of block
> mapping lookup calls we need to do for a given IO.
> 
> i.e. instead of doing "get page, map block to page" over and over
> again until we've alked over the entire IO range, we're doing
> "map extent for entire IO range" once, then iterating "get page"
> until we've mapped the entire range.

That's great, but it's not how IO path works *now*. And will takes a long
time (if ever) to flip it over to what you've described.

> Hence if we have a 2MB IO come in from userspace, and the iomap
> returned is a covers that entire range, it's a no-brainer to ask the
> page cache for a huge page instead of iterating 512 times to map all
> the 4k pages needed.

Yeah, it's no-brainier.

But do we want to limit huge page allocation only to such best-possible
cases? I hardly ever seen 2MB IOs in real world...

And this approach will put too much decision power on the first access to
the file range. It may or may not represent future access pattern.

> > > specifically configured with persistent hints to reliably allocate
> > > extents in a way that can be easily mapped to huge pages.
> > 
> > > e.g. on XFS you will need to apply extent size hints to get large
> > > page sized/aligned extent allocation to occur, and so this
> > 
> > It sounds like you're confusing alignment in memory with alignment
> > on disk here? I don't see why on disk alignment would be needed
> > at all, unless we're talking about DAX here (which is out of 
> > scope currently) Kirill's changes are all about making the memory
> > access for cached data more efficient, it's not about disk layout
> > optimizations.
> 
> No, I'm not confusing this with DAX. However, this automatic use
> model for huge pages fits straight into DAX as well.  Same
> mechanisms, same behaviours, slightly stricter alignment
> characteristics. All stuff the filesystem already knows about.
> 
> Mount options are, quite frankly, a terrible mechanism for
> specifying filesystem policy. Setting up DAX this way was a mistake,
> and it's a mount option I plan to remove from XFS once we get nearer
> to having DAX feature complete and stablised. We've already got
> on-disk "use DAX for this file" flags in XFS, so we can easier and
> cleanly support different methods of accessing PMEM from the same
> filesystem.
> 
> As such, there is no way we should be considering different
> interfaces and methods for configuring the /same functionality/ just
> because DAX is enabled or not. It's the /same decision/ that needs
> to be made, and the filesystem knows an awful lot more about whether
> huge pages can be used efficiently at the time of access than just
> about any other actor you can name

I'm not convinced that filesystem is in better position to see access
patterns than mm for page cache. It's not all about on-disk layout.

-- 
 Kirill A. Shutemov

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-20 Thread Dave Chinner

On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > Ugh, no, please don't use mount options for file specific behaviours
> > in filesystems like ext4 and XFS. This is exactly the sort of
> > behaviour that should either just work automatically (i.e. be
> > completely controlled by the filesystem) or only be applied to files
> 
> Can you explain what you mean? How would the file system control it?

There's no point in asking for huge pages when populating the page
cache if the file is:

- significantly smaller than the huge page size
- largely sparse
- being randomly accessed in small chunks
- badly fragmented and so takes hundreds of IO to read/write
  a huge page
- able to optimise delayed allocation to match huge page
  sizes and alignments

These are all constraints the filesystem knows about, but the
application and user don't. None of these aspects can be optimised
sanely by a single threshold, especially when considering the
combination of access patterns vs file layout.

Further, we are moving the IO path to a model where we use extents
for mapping, not blocks.  We're optimising for the fact that modern
filesystems use extents and so massively reduce the number of block
mapping lookup calls we need to do for a given IO.

i.e. instead of doing "get page, map block to page" over and over
again until we've alked over the entire IO range, we're doing
"map extent for entire IO range" once, then iterating "get page"
until we've mapped the entire range.

Hence if we have a 2MB IO come in from userspace, and the iomap
returned is a covers that entire range, it's a no-brainer to ask the
page cache for a huge page instead of iterating 512 times to map all
the 4k pages needed.

> > specifically configured with persistent hints to reliably allocate
> > extents in a way that can be easily mapped to huge pages.
> 
> > e.g. on XFS you will need to apply extent size hints to get large
> > page sized/aligned extent allocation to occur, and so this
> 
> It sounds like you're confusing alignment in memory with alignment
> on disk here? I don't see why on disk alignment would be needed
> at all, unless we're talking about DAX here (which is out of 
> scope currently) Kirill's changes are all about making the memory
> access for cached data more efficient, it's not about disk layout
> optimizations.

No, I'm not confusing this with DAX. However, this automatic use
model for huge pages fits straight into DAX as well.  Same
mechanisms, same behaviours, slightly stricter alignment
characteristics. All stuff the filesystem already knows about.

Mount options are, quite frankly, a terrible mechanism for
specifying filesystem policy. Setting up DAX this way was a mistake,
and it's a mount option I plan to remove from XFS once we get nearer
to having DAX feature complete and stablised. We've already got
on-disk "use DAX for this file" flags in XFS, so we can easier and
cleanly support different methods of accessing PMEM from the same
filesystem.

As such, there is no way we should be considering different
interfaces and methods for configuring the /same functionality/ just
because DAX is enabled or not. It's the /same decision/ that needs
to be made, and the filesystem knows an awful lot more about whether
huge pages can be used efficiently at the time of access than just
about any other actor you can name

> > persistent extent size hint should trigger the filesystem to use
> > large pages if supported, the hint is correctly sized and aligned,
> > and there are large pages available for allocation.
> 
> That would be ioctls and similar?

You can, but existing filesystem admin tools can already set up
allocation policies without the apps being aware that they even
exist. If you want to use huge page mappings with DAX you'll already
need to do this because of the physical alignment requirements of
DAX.

Further, such techniques are already used by many admins for things
like limiting fragmentation of sparse vm image files. So while you
may not know it, extent size hints and per-file inheritable
attributes are quire widely used already to manage filesystem
behaviour without users or applications even being aware that the
filesystem policies have been modified by the admin...

> That would imply that every application wanting to use large pages
> would need to be especially enabled. That would seem awfully limiting
> to me and needlessly deny benefits to most existing code.

No change to applications will be necessary (see above), though
there's no reason why couldn't directly use the VFS interfaces to
explicitly ask for such behaviour themselves

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-20 Thread Andi Kleen

> Ugh, no, please don't use mount options for file specific behaviours
> in filesystems like ext4 and XFS. This is exactly the sort of
> behaviour that should either just work automatically (i.e. be
> completely controlled by the filesystem) or only be applied to files

Can you explain what you mean? How would the file system control it?

> specifically configured with persistent hints to reliably allocate
> extents in a way that can be easily mapped to huge pages.

> e.g. on XFS you will need to apply extent size hints to get large
> page sized/aligned extent allocation to occur, and so this

It sounds like you're confusing alignment in memory with alignment
on disk here? I don't see why on disk alignment would be needed
at all, unless we're talking about DAX here (which is out of 
scope currently) Kirill's changes are all about making the memory
access for cached data more efficient, it's not about disk layout
optimizations.

> persistent extent size hint should trigger the filesystem to use
> large pages if supported, the hint is correctly sized and aligned,
> and there are large pages available for allocation.

That would be ioctls and similar?

That would imply that every application wanting to use large pages
would need to be especially enabled. That would seem awfully limiting
to me and needlessly deny benefits to most existing code.

-Andi

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-20 Thread Dave Chinner

On Thu, Oct 20, 2016 at 01:39:46PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote:
> > On Tue, 18 Oct 2016, Michal Hocko wrote:
> > > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > > > [...]
> > > > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > > > 
> > > > > > > >   - mount option 'huge_min_size';
> > > > > > > > 
> > > > > > > >   - sysfs file 
> > > > > > > > /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > > > > in-kernel tmpfs mountpoint;
> > > > > > > 
> > > > > > > Could you explain who might like to change the minimum value 
> > > > > > > (other than
> > > > > > > disable the feautre for the mount point) and for what reason?
> > > > > > 
> > > > > > Depending on how well CPU microarchitecture deals with huge pages, 
> > > > > > you
> > > > > > might need to set it higher in order to balance out overhead with 
> > > > > > benefit
> > > > > > of huge pages.
> > > > > 
> > > > > I am not sure this is a good argument. How do a user know and what 
> > > > > will
> > > > > help to make that decision? Why we cannot autotune that? In other 
> > > > > words,
> > > > > adding new knobs just in case turned out to be a bad idea in the past.
> > > > 
> > > > Well, I don't see a reasonable way to autotune it. We can just let
> > > > arch-specific code to redefine it, but the argument below still stands.
> > > > 
> > > > > > In other case, if it's known in advance that specific mount would be
> > > > > > populated with large files, you might want to set it to zero to get 
> > > > > > huge
> > > > > > pages allocated from the beginning.
> > > 
> > > Do you think this is a sufficient reason to provide a tunable with such a
> > > precision? In other words why cannot we simply start by using an
> > > internal only limit at the huge page size for the initial transition
> > > (with a way to disable THP altogether for a mount point) and only add a
> > > more fine grained tunning if there ever is a real need for it with a use
> > > case description. In other words can we be less optimistic about
> > > tunables than we used to be in the past and often found out that those
> > > were mistakes much later?
> > 
> > I'm not sure whether I'm arguing in the same or the opposite direction
> > as you, Michal, but what makes me unhappy is not so much the tunable,
> > as the proliferation of mount options.
> > 
> > Kirill, this issue is (not exactly but close enough) what the mount
> > option "huge=within_size" was supposed to be about: not wasting huge
> > pages on small files.  I'd be much happier if you made huge_min_size
> > into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
> > and used it to govern "huge=within_size" mounts only.
> 
> Well, you're right that I tried originally address the issue with
> huge=within_size, but this option makes much more sense for filesystem
> with persistent storage. For ext4, it would be pretty usable option.

Ugh, no, please don't use mount options for file specific behaviours
in filesystems like ext4 and XFS. This is exactly the sort of
behaviour that should either just work automatically (i.e. be
completely controlled by the filesystem) or only be applied to files
specifically configured with persistent hints to reliably allocate
extents in a way that can be easily mapped to huge pages.

e.g. on XFS you will need to apply extent size hints to get large
page sized/aligned extent allocation to occur, and so this
persistent extent size hint should trigger the filesystem to use
large pages if supported, the hint is correctly sized and aligned,
and there are large pages available for allocation.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-20 Thread Kirill A. Shutemov

On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote:
> On Tue, 18 Oct 2016, Michal Hocko wrote:
> > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > > [...]
> > > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > > 
> > > > > > >   - mount option 'huge_min_size';
> > > > > > > 
> > > > > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size 
> > > > > > > for
> > > > > > > in-kernel tmpfs mountpoint;
> > > > > > 
> > > > > > Could you explain who might like to change the minimum value (other 
> > > > > > than
> > > > > > disable the feautre for the mount point) and for what reason?
> > > > > 
> > > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > > might need to set it higher in order to balance out overhead with 
> > > > > benefit
> > > > > of huge pages.
> > > > 
> > > > I am not sure this is a good argument. How do a user know and what will
> > > > help to make that decision? Why we cannot autotune that? In other words,
> > > > adding new knobs just in case turned out to be a bad idea in the past.
> > > 
> > > Well, I don't see a reasonable way to autotune it. We can just let
> > > arch-specific code to redefine it, but the argument below still stands.
> > > 
> > > > > In other case, if it's known in advance that specific mount would be
> > > > > populated with large files, you might want to set it to zero to get 
> > > > > huge
> > > > > pages allocated from the beginning.
> > 
> > Do you think this is a sufficient reason to provide a tunable with such a
> > precision? In other words why cannot we simply start by using an
> > internal only limit at the huge page size for the initial transition
> > (with a way to disable THP altogether for a mount point) and only add a
> > more fine grained tunning if there ever is a real need for it with a use
> > case description. In other words can we be less optimistic about
> > tunables than we used to be in the past and often found out that those
> > were mistakes much later?
> 
> I'm not sure whether I'm arguing in the same or the opposite direction
> as you, Michal, but what makes me unhappy is not so much the tunable,
> as the proliferation of mount options.
> 
> Kirill, this issue is (not exactly but close enough) what the mount
> option "huge=within_size" was supposed to be about: not wasting huge
> pages on small files.  I'd be much happier if you made huge_min_size
> into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
> and used it to govern "huge=within_size" mounts only.

Well, you're right that I tried originally address the issue with
huge=within_size, but this option makes much more sense for filesystem
with persistent storage. For ext4, it would be pretty usable option.

What you propose would change the semantics of the option and it will
diverge from how it works on ext4.

I guess it may have sense, taking into account that shmem/tmpfs is
special, in sense that we always start with empty filesystem.

If everybody agree, I'll respin the patch with single tunable that manage
all huge=within_size mounts.

-- 
 Kirill A. Shutemov

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-19 Thread Hugh Dickins

On Tue, 18 Oct 2016, Michal Hocko wrote:
> On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > [...]
> > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > 
> > > > > >   - mount option 'huge_min_size';
> > > > > > 
> > > > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size 
> > > > > > for
> > > > > > in-kernel tmpfs mountpoint;
> > > > > 
> > > > > Could you explain who might like to change the minimum value (other 
> > > > > than
> > > > > disable the feautre for the mount point) and for what reason?
> > > > 
> > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > might need to set it higher in order to balance out overhead with 
> > > > benefit
> > > > of huge pages.
> > > 
> > > I am not sure this is a good argument. How do a user know and what will
> > > help to make that decision? Why we cannot autotune that? In other words,
> > > adding new knobs just in case turned out to be a bad idea in the past.
> > 
> > Well, I don't see a reasonable way to autotune it. We can just let
> > arch-specific code to redefine it, but the argument below still stands.
> > 
> > > > In other case, if it's known in advance that specific mount would be
> > > > populated with large files, you might want to set it to zero to get huge
> > > > pages allocated from the beginning.
> 
> Do you think this is a sufficient reason to provide a tunable with such a
> precision? In other words why cannot we simply start by using an
> internal only limit at the huge page size for the initial transition
> (with a way to disable THP altogether for a mount point) and only add a
> more fine grained tunning if there ever is a real need for it with a use
> case description. In other words can we be less optimistic about
> tunables than we used to be in the past and often found out that those
> were mistakes much later?

I'm not sure whether I'm arguing in the same or the opposite direction
as you, Michal, but what makes me unhappy is not so much the tunable,
as the proliferation of mount options.

Kirill, this issue is (not exactly but close enough) what the mount
option "huge=within_size" was supposed to be about: not wasting huge
pages on small files.  I'd be much happier if you made huge_min_size
into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
and used it to govern "huge=within_size" mounts only.

Hugh

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-18 Thread Michal Hocko

On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > [...]
> > > > > We add two handle to specify minimal file size for huge pages:
> > > > > 
> > > > >   - mount option 'huge_min_size';
> > > > > 
> > > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > in-kernel tmpfs mountpoint;
> > > > 
> > > > Could you explain who might like to change the minimum value (other than
> > > > disable the feautre for the mount point) and for what reason?
> > > 
> > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > might need to set it higher in order to balance out overhead with benefit
> > > of huge pages.
> > 
> > I am not sure this is a good argument. How do a user know and what will
> > help to make that decision? Why we cannot autotune that? In other words,
> > adding new knobs just in case turned out to be a bad idea in the past.
> 
> Well, I don't see a reasonable way to autotune it. We can just let
> arch-specific code to redefine it, but the argument below still stands.
> 
> > > In other case, if it's known in advance that specific mount would be
> > > populated with large files, you might want to set it to zero to get huge
> > > pages allocated from the beginning.

Do you think this is a sufficient reason to provide a tunable with such a
precision? In other words why cannot we simply start by using an
internal only limit at the huge page size for the initial transition
(with a way to disable THP altogether for a mount point) and only add a
more fine grained tunning if there ever is a real need for it with a use
case description. In other words can we be less optimistic about
tunables than we used to be in the past and often found out that those
were mistakes much later?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-18 Thread Kirill A. Shutemov

On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> [...]
> > > > We add two handle to specify minimal file size for huge pages:
> > > > 
> > > >   - mount option 'huge_min_size';
> > > > 
> > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > in-kernel tmpfs mountpoint;
> > > 
> > > Could you explain who might like to change the minimum value (other than
> > > disable the feautre for the mount point) and for what reason?
> > 
> > Depending on how well CPU microarchitecture deals with huge pages, you
> > might need to set it higher in order to balance out overhead with benefit
> > of huge pages.
> 
> I am not sure this is a good argument. How do a user know and what will
> help to make that decision? Why we cannot autotune that? In other words,
> adding new knobs just in case turned out to be a bad idea in the past.

Well, I don't see a reasonable way to autotune it. We can just let
arch-specific code to redefine it, but the argument below still stands.

> > In other case, if it's known in advance that specific mount would be
> > populated with large files, you might want to set it to zero to get huge
> > pages allocated from the beginning.
> 
> Cannot we use [mf]advise for that purpose?

There's no fadvise for this at the moment. We can use madvise, except that
the patch makes it lower priority than the limit :P. I'll fix that.

But in general, it would require change to the program which is not always
desirable or even possible.

-- 
 Kirill A. Shutemov

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-18 Thread Michal Hocko

On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
[...]
> > > We add two handle to specify minimal file size for huge pages:
> > > 
> > >   - mount option 'huge_min_size';
> > > 
> > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > in-kernel tmpfs mountpoint;
> > 
> > Could you explain who might like to change the minimum value (other than
> > disable the feautre for the mount point) and for what reason?
> 
> Depending on how well CPU microarchitecture deals with huge pages, you
> might need to set it higher in order to balance out overhead with benefit
> of huge pages.

I am not sure this is a good argument. How do a user know and what will
help to make that decision? Why we cannot autotune that? In other words,
adding new knobs just in case turned out to be a bad idea in the past.

> In other case, if it's known in advance that specific mount would be
> populated with large files, you might want to set it to zero to get huge
> pages allocated from the beginning.

Cannot we use [mf]advise for that purpose?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-17 Thread Kirill A. Shutemov

On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> [...]
> > >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" 
> > Date: Mon, 17 Oct 2016 14:44:47 +0300
> > Subject: [PATCH] shmem: avoid huge pages for small files
> > 
> > Huge pages are detrimental for small file: they causes noticible
> > overhead on both allocation performance and memory footprint.
> > 
> > This patch aimed to address this issue by avoiding huge pages until file
> > grown to specified size. This would cover most of the cases where huge
> > pages causes regressions in performance.
> > 
> > By default the minimal file size to allocate huge pages is equal to size
> > of huge page.
> 
> ok
> 
> > We add two handle to specify minimal file size for huge pages:
> > 
> >   - mount option 'huge_min_size';
> > 
> >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > in-kernel tmpfs mountpoint;
> 
> Could you explain who might like to change the minimum value (other than
> disable the feautre for the mount point) and for what reason?

Depending on how well CPU microarchitecture deals with huge pages, you
might need to set it higher in order to balance out overhead with benefit
of huge pages.

In other case, if it's known in advance that specific mount would be
populated with large files, you might want to set it to zero to get huge
pages allocated from the beginning.

> > @@ -238,6 +238,12 @@ values:
> >- "force":
> >  Force the huge option on for all - very useful for testing;
> >  
> > +Tehre's limit on minimal file size before kenrel starts allocate huge
> > +pages for it. By default it's size of huge page.
> 
> Smoe tyopse

Wlil fxi!

-- 
 Kirill A. Shutemov

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-17 Thread Michal Hocko

On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
[...]
> >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" 
> Date: Mon, 17 Oct 2016 14:44:47 +0300
> Subject: [PATCH] shmem: avoid huge pages for small files
> 
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
> 
> This patch aimed to address this issue by avoiding huge pages until file
> grown to specified size. This would cover most of the cases where huge
> pages causes regressions in performance.
> 
> By default the minimal file size to allocate huge pages is equal to size
> of huge page.

ok

> We add two handle to specify minimal file size for huge pages:
> 
>   - mount option 'huge_min_size';
> 
>   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> in-kernel tmpfs mountpoint;

Could you explain who might like to change the minimum value (other than
disable the feautre for the mount point) and for what reason?

[...]

> @@ -238,6 +238,12 @@ values:
>- "force":
>  Force the huge option on for all - very useful for testing;
>  
> +Tehre's limit on minimal file size before kenrel starts allocate huge
> +pages for it. By default it's size of huge page.

Smoe tyopse
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] shmem: avoid huge pages for small files

2016-10-17 Thread Kirill A. Shutemov

On Mon, Oct 17, 2016 at 03:18:09PM +0300, Kirill A. Shutemov wrote:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..c69047386e2f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space 
> *mapping,
>  /* ifdef here to avoid bloating shmem.o when not necessary */
>  
>  int shmem_huge __read_mostly;
> +unsigned long long shmem_huge_min_size = HPAGE_PMD_SIZE __read_mostly;

Arghh.. Last second changes...

This should be 

unsigned long long shmem_huge_min_size __read_mostly = HPAGE_PMD_SIZE;


>From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" 
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until file
grown to specified size. This would cover most of the cases where huge
pages causes regressions in performance.

By default the minimal file size to allocate huge pages is equal to size
of huge page.

We add two handle to specify minimal file size for huge pages:

  - mount option 'huge_min_size';

  - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
in-kernel tmpfs mountpoint;

Few notes:

  - if shmem_enabled is set to 'force', the limit is ignored. We still
want to generate as many pages as possible for functional testing.

  - the limit doesn't affect khugepaged behaviour: it still can collapse
pages based on its settings;

  - remount of the filesystem doesn't affect previously allocated pages,
but the limit is applied for new allocations;

Signed-off-by: Kirill A. Shutemov 
---
 Documentation/vm/transhuge.txt |  6 +
 include/linux/huge_mm.h|  1 +
 include/linux/shmem_fs.h   |  1 +
 mm/huge_memory.c   |  1 +
 mm/shmem.c | 56 ++
 5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..40006d193687 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -238,6 +238,12 @@ values:
   - "force":
 Force the huge option on for all - very useful for testing;
 
+Tehre's limit on minimal file size before kenrel starts allocate huge
+pages for it. By default it's size of huge page.
+
+You can adjust the limit using "huge_min_size=" mount option or
+/sys/kernel/mm/transparent_hugepage/shmem_min_size for in-kernel mount.
+
 == Need of application restart ==
 
 The transparent_hugepage/enabled values and tmpfs mount option only affect
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d99873..515b96a5a592 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -52,6 +52,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf,
enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute shmem_min_size_attr;
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1huge_min_size &&
+   index < (sbinfo->huge_min_size >> PAGE_SHIFT))
+   goto alloc_nohuge;
switch (sbinfo->huge) {
-   loff_t i_size;
pgoff_t off;
case SHMEM_HUGE_NEVER:
goto alloc_nohuge;
case SHMEM_HUGE_WITHIN_SIZE:
off = round_up(index, HPAGE_PMD_NR);
-   i_size = round_up(i_size_read(inode), PAGE_SIZE);
+   i_size = round_up(i_size, PAGE_SIZE);
if (i_size >= HPAGE_PMD_SIZE &&
i_size >> PAGE_SHIFT >= off)
goto alloc_huge;
@@ -3349,6 +3355,10 @@ static int shmem_parse_options(char *options, struct 
shmem_sb_info *sbinfo,
huge != SHMEM_HUGE_NEVER)
goto bad_val;
sbinfo->huge = huge;
+   } else if (!strcmp(this_char, "huge_min_size")) {
+   sbinfo->huge_min_size = memparse(value, &rest);
+   if (*rest)
+   goto bad_val;
 #endif
 #ifdef CONFIG_NUMA
} else if (!strcmp

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

Re: [PATCH] shmem: avoid huge pages for small files

19 matches

Site Navigation

Mail list logo

Footer information