from:"Kirill A. Shutemov"

Re: [LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-02-01 Thread Kirill A. Shutemov

On Wed, Jan 31, 2018 at 12:42:45PM -0500, Jerome Glisse wrote:
> The overall idea i have is that in any place in the kernel (except memory 
> reclaim
> but that's ok) we can either get mapping or buffer_head information without 
> relying
> on struct page and if we have either one and a struct page then we can find 
> the
> other one.

Why is it okay for reclaim?

And what about physical memory scanners that doesn't have any side information
about the page they step onto?

-- 
 Kirill A. Shutemov

Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()

2017-02-13 Thread Kirill A. Shutemov

On Thu, Feb 09, 2017 at 01:55:05PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:50PM +0300, Kirill A. Shutemov wrote:
> > +++ b/mm/filemap.c
> > @@ -1886,6 +1886,7 @@ static ssize_t do_generic_file_read(struct file 
> > *filp, loff_t *ppos,
> > if (unlikely(page == NULL))
> > goto no_cached_page;
> > }
> > +   page = compound_head(page);
> 
> We got this page from find_get_page(), which gets it from
> pagecache_get_page(), which gets it from find_get_entry() ... which
> (unless I'm lost in your patch series) returns the head page.  So this
> line is redundant, right?

No. pagecache_get_page() returns subpage. See description of the first
patch.

> But then down in filemap_fault, we have:
> 
> VM_BUG_ON_PAGE(page->index != offset, page);
> 
> ... again, maybe I'm lost somewhere in your patch series, but I don't see
> anywhere you remove that line (or modify it).

This should be fine as find_get_page() returns subpage.

> So are you not testing
> with VM debugging enabled, or are you not doing a test which includes
> mapping a file with huge pages, reading from it (to get the page in cache),
> then faulting on an address that is not in the first 4kB of that 2MB?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
 Kirill A. Shutemov

Re: [PATCHv6 07/37] filemap: allocate huge page in page_cache_read(), if allowed

2017-02-13 Thread Kirill A. Shutemov

On Thu, Feb 09, 2017 at 01:18:35PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:49PM +0300, Kirill A. Shutemov wrote:
> > Later we can add logic to accumulate information from shadow entires to
> > return to caller (average eviction time?).
> 
> I would say minimum rather than average.  That will become the refault
> time of the entire page, so minimum would probably have us making better
> decisions?

Yes, makes sense.

> > +   /* Wipe shadow entires */
> > +   radix_tree_for_each_slot(slot, >page_tree, ,
> > +   page->index) {
> > +   if (iter.index >= page->index + hpage_nr_pages(page))
> > +   break;
> >  
> > p = radix_tree_deref_slot_protected(slot, >tree_lock);
> > -   if (!radix_tree_exceptional_entry(p))
> > +   if (!p)
> > +   continue;
> 
> Just FYI, this can't happen.  You're holding the tree lock so nobody
> else gets to remove things from the tree.  radix_tree_for_each_slot()
> only gives you the full slots; it skips the empty ones for you.  I'm
> OK if you want to leave it in out of an abundance of caution.

I'll drop it.

> > +   __radix_tree_replace(>page_tree, iter.node, slot, NULL,
> > +   workingset_update_node, mapping);
> 
> I may add an update_node argument to radix_tree_join at some point,
> so you can use it here.  Or maybe we don't need to do that, and what
> you have here works just fine.
> 
> > mapping->nrexceptional--;
> 
> ... because adjusting the exceptional count is going to be a pain.

Yeah..

> > +   error = __radix_tree_insert(>page_tree,
> > +   page->index, compound_order(page), page);
> > +   /* This shouldn't happen */
> > +   if (WARN_ON_ONCE(error))
> > +   return error;
> 
> A lesser man would have just ignored the return value from
> __radix_tree_insert.  I salute you.
> 
> > @@ -2078,18 +2155,34 @@ static int page_cache_read(struct file *file, 
> > pgoff_t offset, gfp_t gfp_mask)
> >  {
> > struct address_space *mapping = file->f_mapping;
> > struct page *page;
> > +   pgoff_t hoffset;
> > int ret;
> >  
> > do {
> > -   page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> > +   page = page_cache_alloc_huge(mapping, offset, gfp_mask);
> > +no_huge:
> > +   if (!page)
> > +   page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> > if (!page)
> > return -ENOMEM;
> >  
> > -   ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & 
> > GFP_KERNEL);
> > -   if (ret == 0)
> > +   if (PageTransHuge(page))
> > +   hoffset = round_down(offset, HPAGE_PMD_NR);
> > +   else
> > +   hoffset = offset;
> > +
> > +   ret = add_to_page_cache_lru(page, mapping, hoffset,
> > +   gfp_mask & GFP_KERNEL);
> > +
> > +   if (ret == -EEXIST && PageTransHuge(page)) {
> > +   put_page(page);
> > +   page = NULL;
> > +   goto no_huge;
> > +   } else if (ret == 0) {
> > ret = mapping->a_ops->readpage(file, page);
> > -   else if (ret == -EEXIST)
> > +   } else if (ret == -EEXIST) {
> > ret = 0; /* losing race to add is OK */
> > +   }
> >  
> > put_page(page);
> 
> If the filesystem returns AOP_TRUNCATED_PAGE, you'll go round this loop
> again trying the huge page again, even if the huge page didn't work
> the first time.  I would tend to think that if the huge page failed the
> first time, we shouldn't try it again, so I propose this:

AOP_TRUNCATED_PAGE is positive, so I don't see how you avoid try_huge on
second iteration. Hm?

> 
> struct address_space *mapping = file->f_mapping;
> struct page *page;
> pgoff_t index;
> int ret;
> bool try_huge = true;
> 
> do {
> if (try_huge) {
> page = page_cache_alloc_huge(gfp_mask|__GFP_COLD);
> if (page)
> index = round_down(offset, HPAGE_PMD_NR);
> else
> try_huge = false;
> }
> 
> if (!try_huge) {
> page = __page_cache_alloc(gfp_mask|__GFP_COLD);
>

Re: [PATCHv6 03/37] page-flags: relax page flag policy for few flags

2017-02-13 Thread Kirill A. Shutemov

On Wed, Feb 08, 2017 at 08:01:13PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:45PM +0300, Kirill A. Shutemov wrote:
> > These flags are in use for filesystems with backing storage: PG_error,
> > PG_writeback and PG_readahead.
> 
> Oh ;-)  Then I amend my comment on patch 1 to be "patch 3 needs to go
> ahead of patch 1" ;-)

It doesn't really matter as long as both before patch 37 :P

-- 
 Kirill A. Shutemov

Re: [PATCHv6 01/37] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2017-02-13 Thread Kirill A. Shutemov

On Thu, Feb 09, 2017 at 07:58:20PM +0300, Kirill A. Shutemov wrote:
> I'll look into it.

I ended up with this (I'll test it more later):

void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
struct radix_tree_iter iter;
void **slot;
struct file *file = vmf->vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
loff_t size;
struct page *page;
bool mapped;

rcu_read_lock();
radix_tree_for_each_slot(slot, >page_tree, ,
start_pgoff) {
unsigned long index = iter.index;
if (index < start_pgoff)
index = start_pgoff;
if (index > end_pgoff)
break;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
continue;
if (radix_tree_exception(page)) {
if (radix_tree_deref_retry(page))
slot = radix_tree_iter_retry();
continue;
}

if (!page_cache_get_speculative(page))
goto repeat;

/* Has the page moved? */
if (unlikely(page != *slot)) {
put_page(page);
goto repeat;
}

/* For multi-order entries, find relevant subpage */
page = find_subpage(page, index);

if (!PageUptodate(page) || PageReadahead(page))
goto skip;
if (!trylock_page(page))
goto skip;

if (page_mapping(page) != mapping || !PageUptodate(page))
goto skip_unlock;

size = round_up(i_size_read(mapping->host), PAGE_SIZE);
if (compound_head(page)->index >= size >> PAGE_SHIFT)
goto skip_unlock;

if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
map_next_subpage:
if (PageHWPoison(page))
goto next;

vmf->address += (index - last_pgoff) << PAGE_SHIFT;
if (vmf->pte)
vmf->pte += index - last_pgoff;
last_pgoff = index;
mapped = !alloc_set_pte(vmf, NULL, page);

/* Huge page is mapped or last index? No need to proceed. */
if (pmd_trans_huge(*vmf->pmd) ||
index == end_pgoff) {
unlock_page(page);
break;
}
next:
if (page && PageCompound(page)) {
/* Last subpage handled? */
if ((index & (compound_nr_pages(page) - 1)) ==
compound_nr_pages(page) - 1)
goto skip_unlock;
index++;
page++;

/*
 * One page reference goes to page table mapping.
 * Need additional reference, if last alloc_set_pte()
 * succeed.
 */
if (mapped)
get_page(page);
goto map_next_subpage;
}
skip_unlock:
unlock_page(page);
skip:
iter.index = compound_head(page)->index +
compound_nr_pages(page) - 1;
/* Only give up reference if alloc_set_pte() failed. */
if (!mapped)
        put_page(page);
}
rcu_read_unlock();
}

-- 
 Kirill A. Shutemov

Re: [PATCHv6 01/37] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2017-02-09 Thread Kirill A. Shutemov

On Wed, Feb 08, 2017 at 07:57:27PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:43PM +0300, Kirill A. Shutemov wrote:
> > +++ b/include/linux/pagemap.h
> > @@ -332,6 +332,15 @@ static inline struct page 
> > *grab_cache_page_nowait(struct address_space *mapping,
> > mapping_gfp_mask(mapping));
> >  }
> >  
> > +static inline struct page *find_subpage(struct page *page, pgoff_t offset)
> > +{
> > +   VM_BUG_ON_PAGE(PageTail(page), page);
> > +   VM_BUG_ON_PAGE(page->index > offset, page);
> > +   VM_BUG_ON_PAGE(page->index + (1 << compound_order(page)) < offset,
> > +   page);
> > +   return page - page->index + offset;
> > +}
> 
> What would you think to:
> 
> static inline void check_page_index(struct page *page, pgoff_t offset)
> {
>   VM_BUG_ON_PAGE(PageTail(page), page);
>   VM_BUG_ON_PAGE(page->index > offset, page);
>   VM_BUG_ON_PAGE(page->index + (1 << compound_order(page)) <= offset,
>   page);
> }
> 
> (I think I fixed an off-by-one error up there ...  if
> index + (1 << order) == offset, this is also a bug, right?
> because offset would then refer to the next page, not this page)

Right, thanks.

> 
> static inline struct page *find_subpage(struct page *page, pgoff_t offset)
> {
>   check_page_index(page, offset);
>   return page + (offset - page->index);
> }
> 
> ... then you can use check_page_index down ...

Okay, makes sense.

> 
> > @@ -1250,7 +1233,6 @@ struct page *find_lock_entry(struct address_space 
> > *mapping, pgoff_t offset)
> > put_page(page);
> > goto repeat;
> > }
> > -   VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
> 
> ... here?

Ok.

> > @@ -1472,25 +1451,35 @@ unsigned find_get_pages(struct address_space 
> > *mapping, pgoff_t start,
> > goto repeat;
> > }
> >  
> > +   /* For multi-order entries, find relevant subpage */
> > +   if (PageTransHuge(page)) {
> > +   VM_BUG_ON(index - page->index < 0);
> > +   VM_BUG_ON(index - page->index >= HPAGE_PMD_NR);
> > +   page += index - page->index;
> > +   }
> 
> Use find_subpage() here?

Ok.

> > pages[ret] = page;
> > if (++ret == nr_pages)
> > break;
> > +   if (!PageTransCompound(page))
> > +   continue;
> > +   for (refs = 0; ret < nr_pages &&
> > +   (index + 1) % HPAGE_PMD_NR;
> > +   ret++, refs++, index++)
> > +   pages[ret] = ++page;
> > +   if (refs)
> > +   page_ref_add(compound_head(page), refs);
> > +   if (ret == nr_pages)
> > +   break;
> 
> Can we avoid referencing huge pages specifically in the page cache?  I'd
> like us to get to the point where we can put arbitrary compound pages into
> the page cache.  For example, I think this can be written as:
> 
>   if (!PageCompound(page))
>   continue;
>   for (refs = 0; ret < nr_pages; refs++, index++) {
>   if (index > page->index + (1 << compound_order(page)))
>   break;
>   pages[ret++] = ++page;
>   }
>   if (refs)
>   page_ref_add(compound_head(page), refs);
>   if (ret == nr_pages)
>   break;

That's slightly more costly, but I guess that's fine.

> > @@ -1541,19 +1533,12 @@ unsigned find_get_pages_contig(struct address_space 
> > *mapping, pgoff_t index,
> >  
> > +   /* For multi-order entries, find relevant subpage */
> > +   if (PageTransHuge(page)) {
> > +   VM_BUG_ON(index - page->index < 0);
> > +   VM_BUG_ON(index - page->index >= HPAGE_PMD_NR);
> > +   page += index - page->index;
> > +   }
> > +
> > pages[ret] = page;
> > if (++ret == nr_pages)
> > break;
> > +   if (!PageTransCompound(page))
> > +   continue;
> > +   for (refs = 0; ret < nr_pages &&
> > +   (index + 1) % HPAGE_PMD_NR;
> > +   ret++, refs++, index++

Re: [PATCHv6 06/37] thp: handle write-protection faults for file THP

2017-01-26 Thread Kirill A. Shutemov

On Thu, Jan 26, 2017 at 07:44:39AM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:48PM +0300, Kirill A. Shutemov wrote:
> > For filesystems that wants to be write-notified (has mkwrite), we will
> > encount write-protection faults for huge PMDs in shared mappings.
> > 
> > The easiest way to handle them is to clear the PMD and let it refault as
> > wriable.
> 
> ... of course, the filesystem could implement ->pmd_fault, and then it
> wouldn't hit this case ...

I would rather get rid of ->pmd_fault/->huge_fault :)

->fault should be flexible enough to provide for all of them...

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 04/37] mm, rmap: account file thp pages

2017-01-26 Thread Kirill A. Shutemov

Let's add FileHugePages and FilePmdMapped fields into meminfo and smaps.
It indicates how many times we allocate and map file THP.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 drivers/base/node.c|  6 ++
 fs/proc/meminfo.c  |  4 
 fs/proc/task_mmu.c |  5 -
 include/linux/mmzone.h |  2 ++
 mm/filemap.c   |  3 ++-
 mm/huge_memory.c   |  5 -
 mm/page_alloc.c|  5 +
 mm/rmap.c  | 12 
 mm/vmstat.c|  2 ++
 9 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f9686016..45be0ddb84ed 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -116,6 +116,8 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d AnonHugePages:  %8lu kB\n"
   "Node %d ShmemHugePages: %8lu kB\n"
   "Node %d ShmemPmdMapped: %8lu kB\n"
+  "Node %d FileHugePages: %8lu kB\n"
+  "Node %d FilePmdMapped: %8lu kB\n"
 #endif
,
   nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -139,6 +141,10 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
   HPAGE_PMD_NR),
   nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_THPS) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
   HPAGE_PMD_NR));
 #else
   nid, K(sum_zone_node_page_state(nid, 
NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a428498d6b2..8396843be7a7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -146,6 +146,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
show_val_kb(m, "ShmemPmdMapped: ",
global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
+   show_val_kb(m, "FileHugePages: ",
+   global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR);
+   show_val_kb(m, "FilePmdMapped: ",
+   global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
 #endif
 
 #ifdef CONFIG_CMA
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8f96a49178d0..bdaf557dd953 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -442,6 +442,7 @@ struct mem_size_stats {
unsigned long anonymous;
unsigned long anonymous_thp;
unsigned long shmem_thp;
+   unsigned long file_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -577,7 +578,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
else if (is_zone_device_page(page))
/* pass */;
else
-   VM_BUG_ON_PAGE(1, page);
+   mss->file_thp += HPAGE_PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
@@ -772,6 +773,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   "Anonymous:  %8lu kB\n"
   "AnonHugePages:  %8lu kB\n"
   "ShmemPmdMapped: %8lu kB\n"
+  "FilePmdMapped:  %8lu kB\n"
   "Shared_Hugetlb: %8lu kB\n"
   "Private_Hugetlb: %7lu kB\n"
   "Swap:   %8lu kB\n"
@@ -790,6 +792,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   mss.anonymous >> 10,
   mss.anonymous_thp >> 10,
   mss.shmem_thp >> 10,
+  mss.file_thp >> 10,
   mss.shared_hugetlb >> 10,
   mss.private_hugetlb >> 10,
   mss.swap >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 36d9896fbc1e..a29f6a9aefe4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,8 @@ enum node_stat_item {
NR_SHMEM,   /* shmem pages (included tmpfs/GEM pages) */
NR_SHMEM_THPS,
NR_SHMEM_PMDMAPPED,
+   NR_FILE_THPS,
+   NR_FILE_PMDMAPPED,
NR_ANON_THPS,
NR_UNSTABLE_NFS,/* NFS unstable pages */
NR_VMSCAN_WRITE,
diff --git a/mm/filemap.c b/mm/filemap.c
index 837a71a2a412..5c8d912e891d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -240,7 +240,8 @@ void __delete_from_page_cache(struct page *page, void 
*shadow)

[PATCHv6 22/37] mm, hugetlb: switch hugetlbfs to multi-order radix-tree entries

2017-01-26 Thread Kirill A. Shutemov

From: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>

Currently, hugetlb pages are linked to page cache on the basis of hugepage
offset (derived from vma_hugecache_offset()) for historical reason, which
doesn't match to the generic usage of page cache and requires some routines
to covert page offset <=> hugepage offset in common path. This patch
adjusts code for multi-order radix-tree to avoid the situation.

Main change is on the behavior of page->index for hugetlbfs. Before this
patch, it represented hugepage offset, but with this patch it represents
page offset. So index-related code have to be updated.
Note that hugetlb_fault_mutex_hash() and reservation region handling are
still working with hugepage offset.

Signed-off-by: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>
[kirill.shute...@linux.intel.com: reject fixed]
Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/hugetlbfs/inode.c| 22 ++
 include/linux/pagemap.h | 23 +++
 mm/filemap.c| 12 +---
 mm/hugetlb.c| 19 ++-
 mm/truncate.c   |  8 
 5 files changed, 28 insertions(+), 56 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 54de77e78775..d0da752ba7bc 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -388,8 +388,8 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
 {
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = >i_data;
-   const pgoff_t start = lstart >> huge_page_shift(h);
-   const pgoff_t end = lend >> huge_page_shift(h);
+   const pgoff_t start = lstart >> PAGE_SHIFT;
+   const pgoff_t end = lend >> PAGE_SHIFT;
struct vm_area_struct pseudo_vma;
struct pagevec pvec;
pgoff_t next;
@@ -446,8 +446,7 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
 
i_mmap_lock_write(mapping);
hugetlb_vmdelete_list(>i_mmap,
-   next * pages_per_huge_page(h),
-   (next + 1) * pages_per_huge_page(h));
+   next, next + 1);
i_mmap_unlock_write(mapping);
}
 
@@ -466,7 +465,8 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
freed++;
if (!truncate_op) {
if (unlikely(hugetlb_unreserve_pages(inode,
-   next, next + 1, 1)))
+   (next) << huge_page_order(h),
+   (next + 1) << 
huge_page_order(h), 1)))
hugetlb_fix_reserve_counts(inode);
}
 
@@ -550,8 +550,6 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
struct hstate *h = hstate_inode(inode);
struct vm_area_struct pseudo_vma;
struct mm_struct *mm = current->mm;
-   loff_t hpage_size = huge_page_size(h);
-   unsigned long hpage_shift = huge_page_shift(h);
pgoff_t start, index, end;
int error;
u32 hash;
@@ -567,8 +565,8 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
 * For this range, start is rounded down and end is rounded up
 * as well as being converted to page offsets.
 */
-   start = offset >> hpage_shift;
-   end = (offset + len + hpage_size - 1) >> hpage_shift;
+   start = offset >> PAGE_SHIFT;
+   end = (offset + len + huge_page_size(h) - 1) >> PAGE_SHIFT;
 
inode_lock(inode);
 
@@ -586,7 +584,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
pseudo_vma.vm_file = file;
 
-   for (index = start; index < end; index++) {
+   for (index = start; index < end; index += pages_per_huge_page(h)) {
/*
 * This is supposed to be the vaddr where the page is being
 * faulted in, but we have no vaddr here.
@@ -607,10 +605,10 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
}
 
/* Set numa allocation policy based on index */
-   hugetlb_set_vma_policy(_vma, inode, index);
+   hugetlb_set_vma_policy(_vma, inode, index >> 
huge_page_order(h));
 
/* addr is the offset within the file (zero based) */
-   addr = index * hpage_size;
+   addr = index << PAGE_SHIFT & ~huge_page_mask(h);
 
/* mutex taken here, fault path and hole

[PATCHv6 09/37] filemap: allocate huge page in pagecache_get_page(), if allowed

2017-01-26 Thread Kirill A. Shutemov

Write path allocate pages using pagecache_get_page(). We should be able
to allocate huge pages there, if it's allowed. As usually, fallback to
small pages, if failed.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6cba69176ea9..4e398d5e4134 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1374,13 +1374,16 @@ struct page *pagecache_get_page(struct address_space 
*mapping, pgoff_t offset,
 
 no_page:
if (!page && (fgp_flags & FGP_CREAT)) {
+   pgoff_t hoffset;
int err;
if ((fgp_flags & FGP_WRITE) && 
mapping_cap_account_dirty(mapping))
gfp_mask |= __GFP_WRITE;
if (fgp_flags & FGP_NOFS)
gfp_mask &= ~__GFP_FS;
 
-   page = __page_cache_alloc(gfp_mask);
+   page = page_cache_alloc_huge(mapping, offset, gfp_mask);
+no_huge:   if (!page)
+   page = __page_cache_alloc(gfp_mask);
if (!page)
return NULL;
 
@@ -1391,9 +1394,19 @@ struct page *pagecache_get_page(struct address_space 
*mapping, pgoff_t offset,
if (fgp_flags & FGP_ACCESSED)
__SetPageReferenced(page);
 
-   err = add_to_page_cache_lru(page, mapping, offset,
+   if (PageTransHuge(page))
+   hoffset = round_down(offset, HPAGE_PMD_NR);
+   else
+   hoffset = offset;
+
+   err = add_to_page_cache_lru(page, mapping, hoffset,
gfp_mask & GFP_RECLAIM_MASK);
if (unlikely(err)) {
+   if (PageTransHuge(page)) {
+   put_page(page);
+   page = NULL;
+   goto no_huge;
+   }
put_page(page);
page = NULL;
if (err == -EEXIST)
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 25/37] ext4: make ext4_writepage() work on huge pages

2017-01-26 Thread Kirill A. Shutemov

Change ext4_writepage() and underlying ext4_bio_write_page().

It basically removes assumption on page size, infer it from struct page
instead.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c   | 10 +-
 fs/ext4/page-io.c | 11 +--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 88d57af1b516..8d1b5e63cb15 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2036,10 +2036,10 @@ static int ext4_writepage(struct page *page,
 
trace_ext4_writepage(page);
size = i_size_read(inode);
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
 
page_bufs = page_buffers(page);
/*
@@ -2063,7 +2063,7 @@ static int ext4_writepage(struct page *page,
   ext4_bh_delay_or_unwritten)) {
redirty_page_for_writepage(wbc, page);
if ((current->flags & PF_MEMALLOC) ||
-   (inode->i_sb->s_blocksize == PAGE_SIZE)) {
+   (inode->i_sb->s_blocksize == hpage_size(page))) {
/*
 * For memory cleaning there's no point in writing only
 * some buffers. So just bail out. Warn if we came here
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index d83b0f3c5fe9..360c74daec5c 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -413,6 +413,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));
+   BUG_ON(PageTail(page));
 
if (keep_towrite)
set_page_writeback_keepwrite(page);
@@ -429,8 +430,14 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 * the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   if (len < PAGE_SIZE)
-   zero_user_segment(page, len, PAGE_SIZE);
+   if (len < hpage_size(page)) {
+   page += len / PAGE_SIZE;
+   if (len % PAGE_SIZE)
+   zero_user_segment(page, len % PAGE_SIZE, PAGE_SIZE);
+   while (page + 1 == compound_head(page))
+   clear_highpage(++page);
+   page = compound_head(page);
+   }
/*
 * In the first loop we prepare and mark buffers to submit. We have to
 * mark all buffers in the page before submitting so that
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 07/37] filemap: allocate huge page in page_cache_read(), if allowed

2017-01-26 Thread Kirill A. Shutemov

This patch adds basic functionality to put huge page into page cache.

At the moment we only put huge pages into radix-tree if the range covered
by the huge page is empty.

We ignore shadow entires for now, just remove them from the tree before
inserting huge page.

Later we can add logic to accumulate information from shadow entires to
return to caller (average eviction time?).

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/fs.h  |   5 ++
 include/linux/pagemap.h |  21 ++-
 mm/filemap.c| 155 ++--
 3 files changed, 147 insertions(+), 34 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2ba074328894..dd858a858203 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1806,6 +1806,11 @@ struct super_operations {
 #else
 #define S_DAX  0   /* Make all the DAX code disappear */
 #endif
+#define S_HUGE_MODE0xc000
+#define S_HUGE_NEVER   0x
+#define S_HUGE_ALWAYS  0x4000
+#define S_HUGE_WITHIN_SIZE 0x8000
+#define S_HUGE_ADVISE  0xc000
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad63a7be5a5e..9a93b9c3d662 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -201,14 +201,20 @@ static inline int page_cache_add_speculative(struct page 
*page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc_order(gfp_t gfp,
+   unsigned int order)
 {
-   return alloc_pages(gfp, 0);
+   return alloc_pages(gfp, order);
 }
 #endif
 
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+   return __page_cache_alloc_order(gfp, 0);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
return __page_cache_alloc(mapping_gfp_mask(x));
@@ -225,6 +231,15 @@ static inline gfp_t readahead_gfp_mask(struct 
address_space *x)
  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN;
 }
 
+extern bool __page_cache_allow_huge(struct address_space *x, pgoff_t offset);
+static inline bool page_cache_allow_huge(struct address_space *x,
+   pgoff_t offset)
+{
+   if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+   return false;
+   return __page_cache_allow_huge(x, offset);
+}
+
 typedef int filler_t(void *, struct page *);
 
 pgoff_t page_cache_next_hole(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c8d912e891d..301327685a71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -113,37 +113,50 @@
 static int page_cache_tree_insert(struct address_space *mapping,
  struct page *page, void **shadowp)
 {
-   struct radix_tree_node *node;
-   void **slot;
+   struct radix_tree_iter iter;
+   void **slot, *p;
int error;
 
-   error = __radix_tree_create(>page_tree, page->index, 0,
-   , );
-   if (error)
-   return error;
-   if (*slot) {
-   void *p;
+   /* Wipe shadow entires */
+   radix_tree_for_each_slot(slot, >page_tree, ,
+   page->index) {
+   if (iter.index >= page->index + hpage_nr_pages(page))
+   break;
 
p = radix_tree_deref_slot_protected(slot, >tree_lock);
-   if (!radix_tree_exceptional_entry(p))
+   if (!p)
+   continue;
+
+   if (!radix_tree_exception(p))
return -EEXIST;
 
+   __radix_tree_replace(>page_tree, iter.node, slot, NULL,
+   workingset_update_node, mapping);
+
mapping->nrexceptional--;
-   if (!dax_mapping(mapping)) {
-   if (shadowp)
-   *shadowp = p;
-   } else {
+   if (dax_mapping(mapping)) {
/* DAX can replace empty locked entry with a hole */
WARN_ON_ONCE(p !=
dax_radix_locked_entry(0, RADIX_DAX_EMPTY));
/* Wakeup waiters for exceptional entry lock */
dax_wake_mapping_entry_waiter(mapping, page->index, p,
  true);
+   } else if (!PageTransHuge(page) && shadowp) {
+   *shadowp = p;
}
}
-   __radix_tree_replace(>page_tree, node, slot, page,
-workingset_update_node, mapping);
-   mapping->nrpages++;
+
+   error = __radix_tr

[PATCHv6 10/37] filemap: handle huge pages in filemap_fdatawait_range()

2017-01-26 Thread Kirill A. Shutemov

We writeback whole huge page a time.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4e398d5e4134..f5cd654b3662 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -405,9 +405,14 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
if (page->index > end)
continue;
 
+   page = compound_head(page);
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
+   if (PageTransHuge(page)) {
+   index = page->index + HPAGE_PMD_NR;
+   i += index - pvec.pages[i]->index - 1;
+   }
}
pagevec_release();
cond_resched();
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 16/37] thp: make thp_get_unmapped_area() respect S_HUGE_MODE

2017-01-26 Thread Kirill A. Shutemov

We want mmap(NULL) to return PMD-aligned address if the inode can have
huge pages in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55aee62e8444..2b1d8d13e2c3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -528,10 +528,12 @@ unsigned long thp_get_unmapped_area(struct file *filp, 
unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
 {
loff_t off = (loff_t)pgoff << PAGE_SHIFT;
+   struct inode *inode = filp->f_mapping->host;
 
if (addr)
goto out;
-   if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
+   if ((inode->i_flags & S_HUGE_MODE) == S_HUGE_NEVER &&
+   (!IS_DAX(inode) || !IS_ENABLED(CONFIG_FS_DAX_PMD)))
goto out;
 
addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 26/37] ext4: handle huge pages in ext4_page_mkwrite()

2017-01-26 Thread Kirill A. Shutemov

Trivial: remove assumption on page size.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8d1b5e63cb15..a25be1cf4506 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5778,7 +5778,7 @@ static int ext4_bh_unmapped(handle_t *handle, struct 
buffer_head *bh)
 
 int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   struct page *page = vmf->page;
+   struct page *page = compound_head(vmf->page);
loff_t size;
unsigned long len;
int ret;
@@ -5814,10 +5814,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
goto out;
}
 
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
+
/*
 * Return if we have all the buffers mapped. This avoids the need to do
 * journal_start/journal_stop which can block and take a long time
@@ -5848,7 +5848,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
ret = block_page_mkwrite(vma, vmf, get_block);
if (!ret && ext4_should_journal_data(inode)) {
if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
- PAGE_SIZE, NULL, do_journal_get_write_access)) {
+ hpage_size(page), NULL,
+ do_journal_get_write_access)) {
unlock_page(page);
ret = VM_FAULT_SIGBUS;
ext4_journal_stop(handle);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 15/37] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask}

2017-01-26 Thread Kirill A. Shutemov

Slab pages can be compound, but we shouldn't threat them as THP for
pupose of hpage_* helpers, otherwise it would lead to confusing results.

For instance, ext4 uses slab pages for journal pages and we shouldn't
confuse them with THPs. The easiest way is to exclude them in hpage_*
helpers.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/huge_mm.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e5c9c26d2439..5e6c408f5b47 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -137,21 +137,21 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 }
 static inline int hpage_nr_pages(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_NR;
return 1;
 }
 
 static inline int hpage_size(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_SIZE;
return PAGE_SIZE;
 }
 
 static inline unsigned long hpage_mask(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_MASK;
return PAGE_MASK;
 }
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 11/37] HACK: readahead: alloc huge pages, if allowed

2017-01-26 Thread Kirill A. Shutemov

Most page cache allocation happens via readahead (sync or async), so if
we want to have significant number of huge pages in page cache we need
to find a ways to allocate them from readahead.

Unfortunately, huge pages doesn't fit into current readahead design:
128 max readahead window, assumption on page size, PageReadahead() to
track hit/miss.

I haven't found a ways to get it right yet.

This patch just allocates huge page if allowed, but doesn't really
provide any readahead if huge page is allocated. We read out 2M a time
and I would expect spikes in latancy without readahead.

Therefore HACK.

Having that said, I don't think it should prevent huge page support to
be applied. Future will show if lacking readahead is a big deal with
huge pages in page cache.

Any suggestions are welcome.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/readahead.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c4ca70239233..289527a06254 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -174,6 +174,21 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
if (page_offset > end_index)
break;
 
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
+   (!page_idx || !(page_offset % HPAGE_PMD_NR)) &&
+   page_cache_allow_huge(mapping, page_offset)) {
+   page = __page_cache_alloc_order(gfp_mask | __GFP_COMP,
+   HPAGE_PMD_ORDER);
+   if (page) {
+   prep_transhuge_page(page);
+   page->index = round_down(page_offset,
+   HPAGE_PMD_NR);
+   list_add(>lru, _pool);
+   ret++;
+   goto start_io;
+   }
+   }
+
rcu_read_lock();
page = radix_tree_lookup(>page_tree, page_offset);
rcu_read_unlock();
@@ -189,7 +204,7 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
SetPageReadahead(page);
ret++;
}
-
+start_io:
/*
 * Now start the IO.  We ignore I/O errors - if the page is not
 * uptodate then the caller will launch readpage again, and
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()

2017-01-26 Thread Kirill A. Shutemov

Most of work happans on head page. Only when we need to do copy data to
userspace we find relevant subpage.

We are still limited by PAGE_SIZE per iteration. Lifting this limitation
would require some more work.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 301327685a71..6cba69176ea9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1886,6 +1886,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
if (unlikely(page == NULL))
goto no_cached_page;
}
+   page = compound_head(page);
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
@@ -1967,7 +1968,8 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
 * now we can copy it to user space...
 */
 
-   ret = copy_page_to_iter(page, offset, nr, iter);
+   ret = copy_page_to_iter(page + index - page->index, offset,
+   nr, iter);
offset += ret;
index += offset >> PAGE_SHIFT;
offset &= ~PAGE_MASK;
@@ -2385,6 +2387,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 * because there really aren't any performance issues here
 * and we need to check for errors.
 */
+   page = compound_head(page);
ClearPageError(page);
error = mapping->a_ops->readpage(file, page);
if (!error) {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 17/37] fs: make block_read_full_page() be able to read huge page

2017-01-26 Thread Kirill A. Shutemov

The approach is straight-forward: for compound pages we read out whole
huge page.

For huge page we cannot have array of buffer head pointers on stack --
it's 4096 pointers on x86-64 -- 'arr' is allocated with kmalloc() for
huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c | 22 +-
 include/linux/buffer_head.h |  9 +
 include/linux/page-flags.h  |  2 +-
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 0e87401cf335..72462beca909 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -871,7 +871,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
 
 try_again:
head = NULL;
-   offset = PAGE_SIZE;
+   offset = hpage_size(page);
while ((offset -= size) >= 0) {
bh = alloc_buffer_head(GFP_NOFS);
if (!bh)
@@ -1466,7 +1466,7 @@ void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
bh->b_page = page;
-   BUG_ON(offset >= PAGE_SIZE);
+   BUG_ON(offset >= hpage_size(page));
if (PageHighMem(page))
/*
 * This catches illegal uses and preserves the offset:
@@ -2280,11 +2280,13 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
 {
struct inode *inode = page->mapping->host;
sector_t iblock, lblock;
-   struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+   struct buffer_head *arr_on_stack[MAX_BUF_PER_PAGE];
+   struct buffer_head *bh, *head, **arr = arr_on_stack;
unsigned int blocksize, bbits;
int nr, i;
int fully_mapped = 1;
 
+   VM_BUG_ON_PAGE(PageTail(page), page);
head = create_page_buffers(page, inode, 0);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
@@ -2295,6 +2297,11 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
nr = 0;
i = 0;
 
+   if (PageTransHuge(page)) {
+   arr = kmalloc(sizeof(struct buffer_head *) * HPAGE_PMD_NR *
+   MAX_BUF_PER_PAGE, GFP_NOFS);
+   }
+
do {
if (buffer_uptodate(bh))
continue;
@@ -2310,7 +2317,9 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
SetPageError(page);
}
if (!buffer_mapped(bh)) {
-   zero_user(page, i * blocksize, blocksize);
+   zero_user(page + (i * blocksize / PAGE_SIZE),
+   i * blocksize % PAGE_SIZE,
+   blocksize);
if (!err)
set_buffer_uptodate(bh);
continue;
@@ -2336,7 +2345,7 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
if (!PageError(page))
SetPageUptodate(page);
unlock_page(page);
-   return 0;
+   goto out;
}
 
/* Stage two: lock the buffers */
@@ -2358,6 +2367,9 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
else
submit_bh(REQ_OP_READ, 0, bh);
}
+out:
+   if (arr != arr_on_stack)
+   kfree(arr);
return 0;
 }
 EXPORT_SYMBOL(block_read_full_page);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index fd4134ce9c54..f12f6293ed44 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -131,13 +131,14 @@ BUFFER_FNS(Meta, meta)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
 
-#define bh_offset(bh)  ((unsigned long)(bh)->b_data & ~PAGE_MASK)
+#define bh_offset(bh)  ((unsigned long)(bh)->b_data & ~hpage_mask(bh->b_page))
 
 /* If we *know* page->private refers to buffer_heads */
-#define page_buffers(page) \
+#define page_buffers(__page)   \
({  \
-   BUG_ON(!PagePrivate(page)); \
-   ((struct buffer_head *)page_private(page)); \
+   struct page *p = compound_head(__page); \
+   BUG_ON(!PagePrivate(p));\
+   ((struct buffer_head *)page_private(p));\
})
 #define page_has_buffers(page) PagePrivate(page)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b85b73cfb1b3..23534bd47c08 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -746,7 +746,7 @@ static inline void ClearPageSlabPfmemalloc(struct page 
*page)

[PATCHv6 20/37] truncate: make truncate_inode_pages_range() aware about huge pages

2017-01-26 Thread Kirill A. Shutemov

As with shmem_undo_range(), truncate_inode_pages_range() removes huge
pages, if it fully within range.

Partial truncate of huge pages zero out this part of THP.

Unlike with shmem, it doesn't prevent us having holes in the middle of
huge page we still can skip writeback not touched buffers.

With memory-mapped IO we would loose holes in some cases when we have
THP in page cache, since we cannot track access on 4k level in this
case.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c|  2 +-
 include/linux/mm.h |  9 +-
 mm/truncate.c  | 86 --
 3 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 17167b299d0f..f92090fed933 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1534,7 +1534,7 @@ void block_invalidatepage(struct page *page, unsigned int 
offset,
/*
 * Check for overflow
 */
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9e87155af456..41a97260f865 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1328,8 +1328,15 @@ int get_kernel_page(unsigned long start, int write, 
struct page **pages);
 struct page *get_dump_page(unsigned long addr);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
-extern void do_invalidatepage(struct page *page, unsigned int offset,
+extern void __do_invalidatepage(struct page *page, unsigned int offset,
  unsigned int length);
+static inline void do_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length)
+{
+   if (page_has_private(page))
+   __do_invalidatepage(page, offset, length);
+}
+
 
 int __set_page_dirty_nobuffers(struct page *page);
 int __set_page_dirty_no_writeback(struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 3a1a1c1a654e..81e1a13acb63 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -112,12 +112,12 @@ static int invalidate_exceptional_entry2(struct 
address_space *mapping,
  * point.  Because the caller is about to free (and possibly reuse) those
  * blocks on-disk.
  */
-void do_invalidatepage(struct page *page, unsigned int offset,
+void __do_invalidatepage(struct page *page, unsigned int offset,
   unsigned int length)
 {
void (*invalidatepage)(struct page *, unsigned int, unsigned int);
 
-   invalidatepage = page->mapping->a_ops->invalidatepage;
+   invalidatepage = page_mapping(page)->a_ops->invalidatepage;
 #ifdef CONFIG_BLOCK
if (!invalidatepage)
invalidatepage = block_invalidatepage;
@@ -142,8 +142,7 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
if (page->mapping != mapping)
return -EIO;
 
-   if (page_has_private(page))
-   do_invalidatepage(page, 0, PAGE_SIZE);
+   do_invalidatepage(page, 0, hpage_size(page));
 
/*
 * Some filesystems seem to re-dirty the page even after
@@ -316,13 +315,35 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
unlock_page(page);
continue;
}
+
+   if (PageTransHuge(page)) {
+   int j, first = 0, last = HPAGE_PMD_NR - 1;
+
+   if (start > page->index)
+   first = start & (HPAGE_PMD_NR - 1);
+   if (index == round_down(end, HPAGE_PMD_NR))
+   last = (end - 1) & (HPAGE_PMD_NR - 1);
+
+   /* Range starts or ends in the middle of THP */
+   if (first != 0 || last != HPAGE_PMD_NR - 1) {
+   int off, len;
+   for (j = first; j <= last; j++)
+   clear_highpage(page + j);
+   off = first * PAGE_SIZE;
+   len = (last + 1) * PAGE_SIZE - off;
+   do_invalidatepage(page, off, len);
+   unlock_page(page);
+   continue;
+   }
+   }
+
truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_remove_exceptionals();
+   index += pvec.nr ? hpage_nr_pages(pvec.pages[pvec.nr - 1]) : 1;
pagevec_release();
cond_resched();
-   index++;
}

[PATCHv6 23/37] mm: account huge pages to dirty, writaback, reclaimable, etc.

2017-01-26 Thread Kirill A. Shutemov

We need to account huge pages according to its size to get background
writaback work properly.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/fs-writeback.c   | 10 +++---
 include/linux/backing-dev.h | 10 ++
 include/linux/memcontrol.h  | 22 ++---
 mm/migrate.c|  1 +
 mm/page-writeback.c | 80 +
 mm/rmap.c   |  4 +--
 6 files changed, 74 insertions(+), 53 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef600591d96f..e1c9faddc9e1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -366,8 +366,9 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page) && PageDirty(page)) {
-   __dec_wb_stat(old_wb, WB_RECLAIMABLE);
-   __inc_wb_stat(new_wb, WB_RECLAIMABLE);
+   int nr = hpage_nr_pages(page);
+   __add_wb_stat(old_wb, WB_RECLAIMABLE, -nr);
+   __add_wb_stat(new_wb, WB_RECLAIMABLE, nr);
}
}
 
@@ -376,9 +377,10 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page)) {
+   int nr = hpage_nr_pages(page);
WARN_ON_ONCE(!PageWriteback(page));
-   __dec_wb_stat(old_wb, WB_WRITEBACK);
-   __inc_wb_stat(new_wb, WB_WRITEBACK);
+   __add_wb_stat(old_wb, WB_WRITEBACK, -nr);
+   __add_wb_stat(new_wb, WB_WRITEBACK, nr);
}
}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 43b93a947e61..e63487f78824 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -61,6 +61,16 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
__percpu_counter_add(>stat[item], amount, WB_STAT_BATCH);
 }
 
+static inline void add_wb_stat(struct bdi_writeback *wb,
+enum wb_stat_item item, s64 amount)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __add_wb_stat(wb, item, amount);
+   local_irq_restore(flags);
+}
+
 static inline void __inc_wb_stat(struct bdi_writeback *wb,
 enum wb_stat_item item)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 254698856b8f..7a341b01937f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mem_cgroup;
 struct page;
@@ -517,18 +518,6 @@ static inline void mem_cgroup_update_page_stat(struct page 
*page,
this_cpu_add(page->mem_cgroup->stat->count[idx], val);
 }
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
-{
-   mem_cgroup_update_page_stat(page, idx, 1);
-}
-
-static inline void mem_cgroup_dec_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
-{
-   mem_cgroup_update_page_stat(page, idx, -1);
-}
-
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
gfp_t gfp_mask,
unsigned long *total_scanned);
@@ -739,13 +728,8 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
return false;
 }
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
-{
-}
-
-static inline void mem_cgroup_dec_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
+static inline void mem_cgroup_update_page_stat(struct page *page,
+enum mem_cgroup_stat_index idx, int val)
 {
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 366466ed7fdc..20a9ce2fcc64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -485,6 +485,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 * are mapped to swap space.
 */
if (newzone != oldzone) {
+   BUG_ON(PageTransHuge(page));
__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
if (PageSwapBacked(page) && !PageSwapCache(page)) {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 47d5b12c460e..d7b905d66add 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2430,19 +24

[PATCHv6 29/37] ext4: handle huge pages in ext4_da_write_end()

2017-01-26 Thread Kirill A. Shutemov

Call ext4_da_should_update_i_disksize() for head page with offset
relative to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3eae2d058fd0..bdd62dcaa0b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3019,7 +3019,6 @@ static int ext4_da_write_end(struct file *file,
int ret = 0, ret2;
handle_t *handle = ext4_journal_current_handle();
loff_t new_i_size;
-   unsigned long start, end;
int write_mode = (int)(unsigned long)fsdata;
 
if (write_mode == FALL_BACK_TO_NONDELALLOC)
@@ -3027,8 +3026,6 @@ static int ext4_da_write_end(struct file *file,
  len, copied, page, fsdata);
 
trace_ext4_da_write_end(inode, pos, len, copied);
-   start = pos & (PAGE_SIZE - 1);
-   end = start + copied - 1;
 
/*
 * generic_write_end() will run mark_inode_dirty() if i_size
@@ -3037,8 +3034,10 @@ static int ext4_da_write_end(struct file *file,
 */
new_i_size = pos + copied;
if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
+   struct page *head = compound_head(page);
+   unsigned long end = (pos & ~hpage_mask(head)) + copied - 1;
if (ext4_has_inline_data(inode) ||
-   ext4_da_should_update_i_disksize(page, end)) {
+   ext4_da_should_update_i_disksize(head, end)) {
ext4_update_i_disksize(inode, new_i_size);
/* We need to mark inode dirty even if
 * new_i_size is less that inode->i_size
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 35/37] ext4: reserve larger jounral transaction for huge pages

2017-01-26 Thread Kirill A. Shutemov

If huge pages enabled, in worst case with 2048 blocks underlying a page,
each possibly in a different block group we have much more metadata to
commit.

Let's update estimates accordingly.

I was not able to trigger bad situation without the patch as it's hard to
construct very fragmented filesystem, but hopefully this change would be
enough to address the concern.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/ext4_jbd2.h | 16 +---
 fs/ext4/inode.c | 34 +++---
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index f97611171023..6e4e534d6e98 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -353,11 +353,21 @@ static inline int ext4_journal_restart(handle_t *handle, 
int nblocks)
return 0;
 }
 
+static inline int __ext4_journal_blocks_per_page(struct inode *inode, bool thp)
+{
+   int bpp = 0;
+   if (EXT4_JOURNAL(inode) != NULL) {
+   bpp = jbd2_journal_blocks_per_page(inode);
+   if (thp)
+   bpp <<= HPAGE_PMD_ORDER;
+   }
+   return bpp;
+}
+
 static inline int ext4_journal_blocks_per_page(struct inode *inode)
 {
-   if (EXT4_JOURNAL(inode) != NULL)
-   return jbd2_journal_blocks_per_page(inode);
-   return 0;
+   return __ext4_journal_blocks_per_page(inode,
+   (inode->i_flags & S_HUGE_MODE) != S_HUGE_NEVER);
 }
 
 static inline int ext4_journal_force_commit(journal_t *journal)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5bf68bbe65ec..c30562b6e685 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -141,6 +141,7 @@ static int __ext4_journalled_writepage(struct page *page, 
unsigned int len);
 static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head 
*bh);
 static int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
  int pextents);
+static int __ext4_writepage_trans_blocks(struct inode *inode, int bpp);
 
 /*
  * Test whether an inode is a fast symlink.
@@ -4496,6 +4497,21 @@ void ext4_set_inode_flags(struct inode *inode)
!ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
!ext4_encrypted_inode(inode))
new_fl |= S_DAX;
+
+   if ((new_fl & S_HUGE_MODE) != S_HUGE_NEVER &&
+   EXT4_JOURNAL(inode) != NULL) {
+   int bpp = __ext4_journal_blocks_per_page(inode, true);
+   int credits = __ext4_writepage_trans_blocks(inode, bpp);
+
+   if (EXT4_JOURNAL(inode)->j_max_transaction_buffers < credits) {
+   pr_warn_once("EXT4-fs (%s): "
+   "journal is too small for huge pages. "
+   "Disable huge pages support.\n",
+   inode->i_sb->s_id);
+   new_fl &= ~S_HUGE_MODE;
+   }
+   }
+
inode_set_flags(inode, new_fl,
S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
 }
@@ -5471,6 +5487,16 @@ static int ext4_meta_trans_blocks(struct inode *inode, 
int lblocks,
return ret;
 }
 
+static int __ext4_writepage_trans_blocks(struct inode *inode, int bpp)
+{
+   int ret = ext4_meta_trans_blocks(inode, bpp, bpp);
+
+   /* Account for data blocks for journalled mode */
+   if (ext4_should_journal_data(inode))
+   ret += bpp;
+   return ret;
+}
+
 /*
  * Calculate the total number of credits to reserve to fit
  * the modification of a single pages into a single transaction,
@@ -5484,14 +5510,8 @@ static int ext4_meta_trans_blocks(struct inode *inode, 
int lblocks,
 int ext4_writepage_trans_blocks(struct inode *inode)
 {
int bpp = ext4_journal_blocks_per_page(inode);
-   int ret;
-
-   ret = ext4_meta_trans_blocks(inode, bpp, bpp);
 
-   /* Account for data blocks for journalled mode */
-   if (ext4_should_journal_data(inode))
-   ret += bpp;
-   return ret;
+   return __ext4_writepage_trans_blocks(inode, bpp);
 }
 
 /*
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 19/37] fs: make block_page_mkwrite() aware about huge pages

2017-01-26 Thread Kirill A. Shutemov

Adjust check on whether part of the page beyond file size and apply
compound_head() and page_mapping() where appropriate.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d05524f14846..17167b299d0f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2544,7 +2544,7 @@ EXPORT_SYMBOL(block_commit_write);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 get_block_t get_block)
 {
-   struct page *page = vmf->page;
+   struct page *page = compound_head(vmf->page);
struct inode *inode = file_inode(vma->vm_file);
unsigned long end;
loff_t size;
@@ -2552,7 +2552,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 
lock_page(page);
size = i_size_read(inode);
-   if ((page->mapping != inode->i_mapping) ||
+   if ((page_mapping(page) != inode->i_mapping) ||
(page_offset(page) > size)) {
/* We overload EFAULT to mean page got truncated */
ret = -EFAULT;
@@ -2560,10 +2560,10 @@ int block_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf,
}
 
/* page is wholly or partially inside EOF */
-   if (((page->index + 1) << PAGE_SHIFT) > size)
-   end = size & ~PAGE_MASK;
+   if (((page->index + hpage_nr_pages(page)) << PAGE_SHIFT) > size)
+   end = size & ~hpage_mask(page);
else
-   end = PAGE_SIZE;
+   end = hpage_size(page);
 
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 37/37] ext4, vfs: add huge= mount option

2017-01-26 Thread Kirill A. Shutemov

The same four values as in tmpfs case.

Encyption code is not yet ready to handle huge page, so we disable huge
pages support if the inode has EXT4_INODE_ENCRYPT.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/ext4.h  |  5 +
 fs/ext4/inode.c | 32 +++-
 fs/ext4/super.c | 24 
 3 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2163c1e69f2a..19bb9995fa96 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1134,6 +1134,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DIOREAD_NOLOCK  0x40 /* Enable support for dio read 
nolocking */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM0x80 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT0x100 /* Journal Async 
Commit */
+#define EXT4_MOUNT_HUGE_MODE   0x600 /* Huge support mode: */
+#define EXT4_MOUNT_HUGE_NEVER  0x000
+#define EXT4_MOUNT_HUGE_ALWAYS 0x200
+#define EXT4_MOUNT_HUGE_WITHIN_SIZE0x400
+#define EXT4_MOUNT_HUGE_ADVISE 0x600
 #define EXT4_MOUNT_DELALLOC0x800 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT  0x1000 /* Abort on file data write 
*/
 #define EXT4_MOUNT_BLOCK_VALIDITY  0x2000 /* Block validity checking */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e24ccf4c3694..120d32bcb6af 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4481,7 +4481,7 @@ int ext4_get_inode_loc(struct inode *inode, struct 
ext4_iloc *iloc)
 void ext4_set_inode_flags(struct inode *inode)
 {
unsigned int flags = EXT4_I(inode)->i_flags;
-   unsigned int new_fl = 0;
+   unsigned int mask, new_fl = 0;
 
if (flags & EXT4_SYNC_FL)
new_fl |= S_SYNC;
@@ -4493,11 +4493,25 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
-   if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
-   !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
-   !ext4_encrypted_inode(inode))
-   new_fl |= S_DAX;
-
+   if (S_ISREG(inode->i_mode) && !ext4_encrypted_inode(inode)) {
+   if (test_opt(inode->i_sb, DAX) &&
+   !ext4_should_journal_data(inode) &&
+   !ext4_has_inline_data(inode))
+   new_fl |= S_DAX;
+   switch (test_opt(inode->i_sb, HUGE_MODE)) {
+   case EXT4_MOUNT_HUGE_NEVER:
+   break;
+   case EXT4_MOUNT_HUGE_ALWAYS:
+   new_fl |= S_HUGE_ALWAYS;
+   break;
+   case EXT4_MOUNT_HUGE_WITHIN_SIZE:
+   new_fl |= S_HUGE_WITHIN_SIZE;
+   break;
+   case EXT4_MOUNT_HUGE_ADVISE:
+   new_fl |= S_HUGE_ADVISE;
+   break;
+   }
+   }
if ((new_fl & S_HUGE_MODE) != S_HUGE_NEVER &&
EXT4_JOURNAL(inode) != NULL) {
int bpp = __ext4_journal_blocks_per_page(inode, true);
@@ -4511,9 +4525,9 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl &= ~S_HUGE_MODE;
}
}
-
-   inode_set_flags(inode, new_fl,
-   S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
+   mask = S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+   S_DIRSYNC | S_DAX | S_HUGE_MODE;
+   inode_set_flags(inode, new_fl, mask);
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 66845a08a87a..13376a72050c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1296,6 +1296,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum,
+   Opt_huge_never, Opt_huge_always, Opt_huge_within_size, Opt_huge_advise,
 };
 
 static const match_table_t tokens = {
@@ -1376,6 +1377,10 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
+   {Opt_huge_never, "huge=never"},
+   {Opt_huge_always, "huge=always"},
+   {Opt_huge_within_size, "huge=within_size"},
+   {Opt_huge_advise, "huge=advise"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
{Opt_removed, "check=none"},/* mount option from ext2/3 */
{Opt_removed, "nocheck"},   /* mount option from ext2/3 */

[PATCHv6 13/37] mm: make write_cache_pages() work on huge pages

2017-01-26 Thread Kirill A. Shutemov

We writeback whole huge page a time. Let's adjust iteration this way.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/mm.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/page-writeback.c | 17 -
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b0f64c..9e87155af456 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1045,6 +1045,7 @@ extern pgoff_t __page_file_index(struct page *page);
  */
 static inline pgoff_t page_index(struct page *page)
 {
+   page = compound_head(page);
if (unlikely(PageSwapCache(page)))
return __page_file_index(page);
return page->index;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9a93b9c3d662..e3eb6dc03286 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -545,6 +545,7 @@ static inline int wait_on_page_locked_killable(struct page 
*page)
  */
 static inline void wait_on_page_writeback(struct page *page)
 {
+   page = compound_head(page);
if (PageWriteback(page))
wait_on_page_bit(page, PG_writeback);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 290e8b7d3181..47d5b12c460e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2209,7 +2209,7 @@ int write_cache_pages(struct address_space *mapping,
 * mapping. However, page->index will not change
 * because we have a reference on the page.
 */
-   if (page->index > end) {
+   if (page_to_pgoff(page) > end) {
/*
 * can't be range_cyclic (1st pass) because
 * end == -1 in that case.
@@ -2218,7 +2218,12 @@ int write_cache_pages(struct address_space *mapping,
break;
}
 
-   done_index = page->index;
+   done_index = page_to_pgoff(page);
+   if (PageTransCompound(page)) {
+   index = round_up(index + 1, HPAGE_PMD_NR);
+   i += HPAGE_PMD_NR -
+   done_index % HPAGE_PMD_NR - 1;
+   }
 
lock_page(page);
 
@@ -2230,7 +2235,7 @@ int write_cache_pages(struct address_space *mapping,
 * even if there is now a new, dirty page at the same
 * pagecache address.
 */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
 continue_unlock:
unlock_page(page);
continue;
@@ -2268,7 +2273,8 @@ int write_cache_pages(struct address_space *mapping,
 * not be suitable for data integrity
 * writeout).
 */
-   done_index = page->index + 1;
+   done_index = compound_head(page)->index
+   + hpage_nr_pages(page);
done = 1;
break;
}
@@ -2280,7 +2286,8 @@ int write_cache_pages(struct address_space *mapping,
 * keep going until we have written all the pages
 * we tagged for writeback prior to entering this loop.
 */
-   if (--wbc->nr_to_write <= 0 &&
+   wbc->nr_to_write -= hpage_nr_pages(page);
+   if (wbc->nr_to_write <= 0 &&
wbc->sync_mode == WB_SYNC_NONE) {
done = 1;
break;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 36/37] mm, fs, ext4: expand use of page_mapping() and page_to_pgoff()

2017-01-26 Thread Kirill A. Shutemov

With huge pages in page cache we see tail pages in more code paths.
This patch replaces direct access to struct page fields with macros
which can handle tail pages properly.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c |  2 +-
 fs/ext4/inode.c |  4 ++--
 mm/filemap.c| 24 +---
 mm/memory.c |  2 +-
 mm/page-writeback.c |  2 +-
 mm/truncate.c   |  5 +++--
 6 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index f92090fed933..b2c220dd83b5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -631,7 +631,7 @@ static void __set_page_dirty(struct page *page, struct 
address_space *mapping,
unsigned long flags;
 
spin_lock_irqsave(>tree_lock, flags);
-   if (page->mapping) {/* Race with truncate? */
+   if (page_mapping(page)) {   /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(>page_tree,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c30562b6e685..e24ccf4c3694 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1237,7 +1237,7 @@ static int ext4_write_begin(struct file *file, struct 
address_space *mapping,
}
 
lock_page(page);
-   if (page->mapping != mapping) {
+   if (page_mapping(page) != mapping) {
/* The page got truncated from under us */
unlock_page(page);
put_page(page);
@@ -2975,7 +2975,7 @@ static int ext4_da_write_begin(struct file *file, struct 
address_space *mapping,
}
 
lock_page(page);
-   if (page->mapping != mapping) {
+   if (page_mapping(page) != mapping) {
/* The page got truncated from under us */
unlock_page(page);
put_page(page);
diff --git a/mm/filemap.c b/mm/filemap.c
index 01a0f63fa597..7921a7f3cd2e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -399,7 +399,7 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
struct page *page = pvec.pages[i];
 
/* until radix tree lookup accepts end_index */
-   if (page->index > end)
+   if (page_to_pgoff(page) > end)
continue;
 
page = compound_head(page);
@@ -1364,7 +1364,7 @@ struct page *pagecache_get_page(struct address_space 
*mapping, pgoff_t offset,
}
 
/* Has the page been truncated? */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
put_page(page);
goto repeat;
@@ -1641,7 +1641,8 @@ unsigned find_get_pages_contig(struct address_space 
*mapping, pgoff_t start,
 * otherwise we can get both false positives and false
 * negatives, which is just confusing to the caller.
 */
-   if (page->mapping == NULL || page_to_pgoff(page) != index) {
+   if (page_mapping(page) == NULL ||
+   page_to_pgoff(page) != index) {
put_page(page);
break;
}
@@ -1929,7 +1930,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
if (!trylock_page(page))
goto page_not_up_to_date;
/* Did it get truncated before we got the lock? */
-   if (!page->mapping)
+   if (!page_mapping(page))
goto page_not_up_to_date_locked;
if (!mapping->a_ops->is_partially_uptodate(page,
offset, iter->count))
@@ -2009,7 +2010,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
 
 page_not_up_to_date_locked:
/* Did it get truncated before we got the lock? */
-   if (!page->mapping) {
+   if (!page_mapping(page)) {
unlock_page(page);
put_page(page);
continue;
@@ -2045,7 +2046,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
if (unlikely(error))
goto readpage_error;
if (!PageUptodate(page)) {
-   if (page->mapping == NULL) {
+   if (page_mapping(page) == NULL) {
/*
 * invalidate_mapping_pages got it
 */
@@ -2344,12 +2345,12 @@ int f

[PATCHv6 33/37] ext4: fix SEEK_DATA/SEEK_HOLE for huge pages

2017-01-26 Thread Kirill A. Shutemov

ext4_find_unwritten_pgoff() needs few tweaks to work with huge pages.
Mostly trivial page_mapping()/page_to_pgoff() and adjustment to how we
find relevant block.

Signe-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/file.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d663d3d7c81c..0b11aadfb75f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -519,7 +519,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 * range, it will be a hole.
 */
if (lastoff < endoff && whence == SEEK_HOLE &&
-   page->index > end) {
+   page_to_pgoff(page) > end) {
found = 1;
*offset = lastoff;
goto out;
@@ -527,7 +527,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 
lock_page(page);
 
-   if (unlikely(page->mapping != inode->i_mapping)) {
+   if (unlikely(page_mapping(page) != inode->i_mapping)) {
unlock_page(page);
continue;
}
@@ -538,8 +538,12 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
}
 
if (page_has_buffers(page)) {
+   int diff;
lastoff = page_offset(page);
bh = head = page_buffers(page);
+   diff = (page - compound_head(page)) << 
inode->i_blkbits;
+   while (diff--)
+   bh = bh->b_this_page;
do {
if (buffer_uptodate(bh) ||
buffer_unwritten(bh)) {
@@ -560,8 +564,12 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
} while (bh != head);
}
 
-   lastoff = page_offset(page) + PAGE_SIZE;
+   lastoff = page_offset(page) + hpage_size(page);
unlock_page(page);
+   if (PageTransCompound(page)) {
+   i++;
+   break;
+   }
}
 
/*
@@ -574,7 +582,9 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
break;
}
 
-   index = pvec.pages[i - 1]->index + 1;
+   index = page_to_pgoff(pvec.pages[i - 1]) + 1;
+   if (PageTransCompound(pvec.pages[i - 1]))
+   index = round_up(index, HPAGE_PMD_NR);
pagevec_release();
} while (index <= end);
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 01/37] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2017-01-26 Thread Kirill A. Shutemov

We would need to use multi-order radix-tree entires for ext4 and other
filesystems to have coherent view on tags (dirty/towrite) in the tree.

This patch converts huge tmpfs implementation to multi-order entries, so
we will be able to use the same code patch for all filesystems.

We also change interface for page-cache lookup function:

  - functions that lookup for pages[1] would return subpages of THP
relevant for requested indexes;

  - functions that lookup for entries[2] would return one entry per-THP
and index will point to index of head page (basically, round down to
HPAGE_PMD_NR);

This would provide balanced exposure of multi-order entires to the rest
of the kernel.

[1] find_get_pages(), pagecache_get_page(), pagevec_lookup(), etc.
[2] find_get_entry(), find_get_entries(), pagevec_lookup_entries(), etc.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/pagemap.h |   9 ++
 mm/filemap.c| 236 ++--
 mm/huge_memory.c|  48 +++---
 mm/khugepaged.c |  26 ++
 mm/shmem.c  | 117 ++--
 mm/truncate.c   |  15 ++-
 6 files changed, 235 insertions(+), 216 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 324c8dbad1e1..ad63a7be5a5e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -332,6 +332,15 @@ static inline struct page *grab_cache_page_nowait(struct 
address_space *mapping,
mapping_gfp_mask(mapping));
 }
 
+static inline struct page *find_subpage(struct page *page, pgoff_t offset)
+{
+   VM_BUG_ON_PAGE(PageTail(page), page);
+   VM_BUG_ON_PAGE(page->index > offset, page);
+   VM_BUG_ON_PAGE(page->index + (1 << compound_order(page)) < offset,
+   page);
+   return page - page->index + offset;
+}
+
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
diff --git a/mm/filemap.c b/mm/filemap.c
index b772a33ef640..837a71a2a412 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -150,7 +150,9 @@ static int page_cache_tree_insert(struct address_space 
*mapping,
 static void page_cache_tree_delete(struct address_space *mapping,
   struct page *page, void *shadow)
 {
-   int i, nr;
+   struct radix_tree_node *node;
+   void **slot;
+   int nr;
 
/* hugetlb pages are represented by one entry in the radix tree */
nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
@@ -159,19 +161,12 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(nr != 1 && shadow, page);
 
-   for (i = 0; i < nr; i++) {
-   struct radix_tree_node *node;
-   void **slot;
+   __radix_tree_lookup(>page_tree, page->index, , );
+   VM_BUG_ON_PAGE(!node && nr != 1, page);
 
-   __radix_tree_lookup(>page_tree, page->index + i,
-   , );
-
-   VM_BUG_ON_PAGE(!node && nr != 1, page);
-
-   radix_tree_clear_tags(>page_tree, node, slot);
-   __radix_tree_replace(>page_tree, node, slot, shadow,
-workingset_update_node, mapping);
-   }
+   radix_tree_clear_tags(>page_tree, node, slot);
+   __radix_tree_replace(>page_tree, node, slot, shadow,
+   workingset_update_node, mapping);
 
if (shadow) {
mapping->nrexceptional += nr;
@@ -285,12 +280,7 @@ void delete_from_page_cache(struct page *page)
if (freepage)
freepage(page);
 
-   if (PageTransHuge(page) && !PageHuge(page)) {
-   page_ref_sub(page, HPAGE_PMD_NR);
-   VM_BUG_ON_PAGE(page_count(page) <= 0, page);
-   } else {
-   put_page(page);
-   }
+   put_page(page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
@@ -1172,7 +1162,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 {
void **pagep;
-   struct page *head, *page;
+   struct page *page;
 
rcu_read_lock();
 repeat:
@@ -1193,15 +1183,8 @@ struct page *find_get_entry(struct address_space 
*mapping, pgoff_t offset)
goto out;
}
 
-   head = compound_head(page);
-   if (!page_cache_get_speculative(head))
-   goto repeat;
-
-   /* The page was split under us? */
-   if (compound_head(page) != head) {
-   put_page(head);
+   if (!page_cache_get_speculative(page))

[PATCHv6 28/37] ext4: make ext4_block_write_begin() aware about huge pages

2017-01-26 Thread Kirill A. Shutemov

It simply matches changes to __block_write_begin_int().

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 35 +--
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7e65a5b78cf1..3eae2d058fd0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1093,9 +1093,8 @@ int do_journal_get_write_access(handle_t *handle,
 static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
  get_block_t *get_block)
 {
-   unsigned from = pos & (PAGE_SIZE - 1);
-   unsigned to = from + len;
-   struct inode *inode = page->mapping->host;
+   unsigned from, to;
+   struct inode *inode = page_mapping(page)->host;
unsigned block_start, block_end;
sector_t block;
int err = 0;
@@ -1103,10 +1102,14 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
unsigned bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh = wait;
bool decrypt = false;
+   bool uptodate = PageUptodate(page);
 
+   page = compound_head(page);
+   from = pos & ~hpage_mask(page);
+   to = from + len;
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_SIZE);
-   BUG_ON(to > PAGE_SIZE);
+   BUG_ON(from > hpage_size(page));
+   BUG_ON(to > hpage_size(page));
BUG_ON(from > to);
 
if (!page_has_buffers(page))
@@ -1119,10 +1122,8 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
block++, block_start = block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
-   if (PageUptodate(page)) {
-   if (!buffer_uptodate(bh))
-   set_buffer_uptodate(bh);
-   }
+   if (uptodate && !buffer_uptodate(bh))
+   set_buffer_uptodate(bh);
continue;
}
if (buffer_new(bh))
@@ -1134,19 +1135,25 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
break;
if (buffer_new(bh)) {
clean_bdev_bh_alias(bh);
-   if (PageUptodate(page)) {
+   if (uptodate) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
continue;
}
-   if (block_end > to || block_start < from)
-   zero_user_segments(page, to, block_end,
-  block_start, from);
+   if (block_end > to || block_start < from) {
+   BUG_ON(to - from  > PAGE_SIZE);
+   zero_user_segments(page +
+   block_start / PAGE_SIZE,
+   to % PAGE_SIZE,
+   (block_start % 
PAGE_SIZE) + blocksize,
+   block_start % PAGE_SIZE,
+   from % PAGE_SIZE);
+   }
continue;
}
}
-   if (PageUptodate(page)) {
+   if (uptodate) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
continue;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 14/37] thp: introduce hpage_size() and hpage_mask()

2017-01-26 Thread Kirill A. Shutemov

Introduce new helpers which return size/mask of the page:
HPAGE_PMD_SIZE/HPAGE_PMD_MASK if the page is PageTransHuge() and
PAGE_SIZE/PAGE_MASK otherwise.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/huge_mm.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 97e478d6b690..e5c9c26d2439 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -142,6 +142,20 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
 }
 
+static inline int hpage_size(struct page *page)
+{
+   if (unlikely(PageTransHuge(page)))
+   return HPAGE_PMD_SIZE;
+   return PAGE_SIZE;
+}
+
+static inline unsigned long hpage_mask(struct page *page)
+{
+   if (unlikely(PageTransHuge(page)))
+   return HPAGE_PMD_MASK;
+   return PAGE_MASK;
+}
+
 extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
@@ -167,6 +181,8 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_size(x) PAGE_SIZE
+#define hpage_mask(x) PAGE_MASK
 
 #define transparent_hugepage_enabled(__vma) 0
 
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 30/37] ext4: make ext4_da_page_release_reservation() aware about huge pages

2017-01-26 Thread Kirill A. Shutemov

For huge pages 'stop' must be within HPAGE_PMD_SIZE.
Let's use hpage_size() in the BUG_ON().

We also need to change how we calculate lblk for cluster deallocation.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bdd62dcaa0b2..afba41b65a15 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1571,7 +1571,7 @@ static void ext4_da_page_release_reservation(struct page 
*page,
int num_clusters;
ext4_fsblk_t lblk;
 
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
@@ -1606,7 +1606,8 @@ static void ext4_da_page_release_reservation(struct page 
*page,
 * need to release the reserved space for that cluster. */
num_clusters = EXT4_NUM_B2C(sbi, to_release);
while (num_clusters > 0) {
-   lblk = (page->index << (PAGE_SHIFT - inode->i_blkbits)) +
+   lblk = ((page->index + offset / PAGE_SIZE) <<
+   (PAGE_SHIFT - inode->i_blkbits)) +
((num_clusters - 1) << sbi->s_cluster_bits);
if (sbi->s_cluster_ratio == 1 ||
!ext4_find_delalloc_cluster(inode, lblk))
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 32/37] ext4: make EXT4_IOC_MOVE_EXT work with huge pages

2017-01-26 Thread Kirill A. Shutemov

Adjust how we find relevant block within page and how we clear the
required part of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/move_extent.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 6fc14def0c70..2efa9deb47a9 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -210,7 +210,9 @@ mext_page_mkuptodate(struct page *page, unsigned from, 
unsigned to)
return err;
}
if (!buffer_mapped(bh)) {
-   zero_user(page, block_start, blocksize);
+   zero_user(page + block_start / PAGE_SIZE,
+   block_start % PAGE_SIZE,
+   blocksize);
set_buffer_uptodate(bh);
continue;
}
@@ -267,10 +269,11 @@ move_extent_per_page(struct file *o_filp, struct inode 
*donor_inode,
unsigned int tmp_data_size, data_size, replaced_size;
int i, err2, jblocks, retries = 0;
int replaced_count = 0;
-   int from = data_offset_in_page << orig_inode->i_blkbits;
+   int from;
int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
struct super_block *sb = orig_inode->i_sb;
struct buffer_head *bh = NULL;
+   int diff;
 
/*
 * It needs twice the amount of ordinary journal buffers because
@@ -355,6 +358,9 @@ move_extent_per_page(struct file *o_filp, struct inode 
*donor_inode,
goto unlock_pages;
}
 data_copy:
+   diff = (pagep[0] - compound_head(pagep[0])) * blocks_per_page;
+   from = (data_offset_in_page + diff) << orig_inode->i_blkbits;
+   pagep[0] = compound_head(pagep[0]);
*err = mext_page_mkuptodate(pagep[0], from, from + replaced_size);
if (*err)
goto unlock_pages;
@@ -384,7 +390,7 @@ move_extent_per_page(struct file *o_filp, struct inode 
*donor_inode,
if (!page_has_buffers(pagep[0]))
create_empty_buffers(pagep[0], 1 << orig_inode->i_blkbits, 0);
bh = page_buffers(pagep[0]);
-   for (i = 0; i < data_offset_in_page; i++)
+   for (i = 0; i < data_offset_in_page + diff; i++)
bh = bh->b_this_page;
for (i = 0; i < block_len_in_page; i++) {
*err = ext4_get_block(orig_inode, orig_blk_offset + i, bh, 0);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 34/37] ext4: make fallocate() operations work with huge pages

2017-01-26 Thread Kirill A. Shutemov

__ext4_block_zero_page_range() adjusted to calculate starting iblock
correctry for huge pages.

ext4_{collapse,insert}_range() requires page cache invalidation. We need
the invalidation to be aligning to huge page border if huge pages are
possible in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/extents.c | 10 --
 fs/ext4/inode.c   |  3 +--
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3e295d3350a9..f743e772b44f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5501,7 +5501,10 @@ int ext4_collapse_range(struct inode *inode, loff_t 
offset, loff_t len)
 * Need to round down offset to be aligned with page size boundary
 * for page size > block size.
 */
-   ioffset = round_down(offset, PAGE_SIZE);
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+   ioffset = round_down(offset, HPAGE_PMD_SIZE);
+   else
+   ioffset = round_down(offset, PAGE_SIZE);
/*
 * Write tail of the last page before removed range since it will get
 * removed from the page cache below.
@@ -5650,7 +5653,10 @@ int ext4_insert_range(struct inode *inode, loff_t 
offset, loff_t len)
 * Need to round down to align start offset to page size boundary
 * for page size > block size.
 */
-   ioffset = round_down(offset, PAGE_SIZE);
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+   ioffset = round_down(offset, HPAGE_PMD_SIZE);
+   else
+   ioffset = round_down(offset, PAGE_SIZE);
/* Write out all dirty pages */
ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
LLONG_MAX);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 409ebd81e436..5bf68bbe65ec 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3816,7 +3816,6 @@ void ext4_set_aops(struct inode *inode)
 static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
 {
-   ext4_fsblk_t index = from >> PAGE_SHIFT;
unsigned offset;
unsigned blocksize, pos;
ext4_lblk_t iblock;
@@ -3835,7 +3834,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 
blocksize = inode->i_sb->s_blocksize;
 
-   iblock = index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
+   iblock = page->index << (PAGE_SHIFT - inode->i_sb->s_blocksize_bits);
 
if (!page_has_buffers(page))
create_empty_buffers(page, blocksize, 0);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 06/37] thp: handle write-protection faults for file THP

2017-01-26 Thread Kirill A. Shutemov

For filesystems that wants to be write-notified (has mkwrite), we will
encount write-protection faults for huge PMDs in shared mappings.

The easiest way to handle them is to clear the PMD and let it refault as
wriable.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 mm/memory.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b471e30c..903d9d3e01c0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3488,8 +3488,16 @@ static int wp_huge_pmd(struct vm_fault *vmf, pmd_t 
orig_pmd)
return vmf->vma->vm_ops->pmd_fault(vmf->vma, vmf->address,
   vmf->pmd, vmf->flags);
 
+   if (vmf->vma->vm_flags & VM_SHARED) {
+   /* Clear PMD */
+   zap_page_range_single(vmf->vma, vmf->address & HPAGE_PMD_MASK,
+   HPAGE_PMD_SIZE, NULL);
+
+   /* Refault to establish writable PMD */
+   return 0;
+   }
+
/* COW handled on pte level: split pmd */
-   VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
 
return VM_FAULT_FALLBACK;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 00/37] ext4: support of huge pages

2017-01-26 Thread Kirill A. Shutemov

acking storage;

>From f523dd3aad026f5a3f8cbabc0ec69958a0618f6b Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Fri, 12 Aug 2016 19:44:30 +0300
Subject: [PATCH] Add few more configurations to test ext4 with huge pages

Four new configurations: huge_4k, huge_1k, huge_bigalloc, huge_encrypt.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 .../test-appliance/files/root/fs/ext4/cfg/all.list   |  4 
 .../test-appliance/files/root/fs/ext4/cfg/huge_1k|  6 ++
 .../test-appliance/files/root/fs/ext4/cfg/huge_4k|  6 ++
 .../test-appliance/files/root/fs/ext4/cfg/huge_bigalloc  | 14 ++
 .../files/root/fs/ext4/cfg/huge_bigalloc.exclude |  7 +++
 .../test-appliance/files/root/fs/ext4/cfg/huge_encrypt   |  5 +
 .../files/root/fs/ext4/cfg/huge_encrypt.exclude  | 16 
 kvm-xfstests/util/parse_cli  |  1 +
 8 files changed, 59 insertions(+)
 create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k
 create mode 100644 kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt
 create mode 100644 
kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude

diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list
index 7ec37f4bafaa..14a8e72d2e6e 100644
--- a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/all.list
@@ -9,3 +9,7 @@ dioread_nolock
 data_journal
 bigalloc
 bigalloc_1k
+huge_4k
+huge_1k
+huge_bigalloc
+huge_encrypt
diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k
new file mode 100644
index ..209c76a8a6c1
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_1k
@@ -0,0 +1,6 @@
+export FS=ext4
+export TEST_DEV=$SM_TST_DEV
+export TEST_DIR=$SM_TST_MNT
+export MKFS_OPTIONS="-q -b 1024"
+export EXT_MOUNT_OPTIONS="huge=always"
+TESTNAME="Ext4 1k block with huge pages"
diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k
new file mode 100644
index ..bae901cb2bab
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_4k
@@ -0,0 +1,6 @@
+export FS=ext4
+export TEST_DEV=$PRI_TST_DEV
+export TEST_DIR=$PRI_TST_MNT
+export MKFS_OPTIONS="-q"
+export EXT_MOUNT_OPTIONS="huge=always"
+TESTNAME="Ext4 4k block with huge pages"
diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc
new file mode 100644
index ..b3d87562bce6
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc
@@ -0,0 +1,14 @@
+SIZE=large
+export MKFS_OPTIONS="-O bigalloc"
+export EXT_MOUNT_OPTIONS="huge=always"
+
+# Until we can teach xfstests the difference between cluster size and
+# block size, avoid collapse_range, insert_range, and zero_range since
+# these will fail due the fact that these operations require
+# cluster-aligned ranges.
+export FSX_AVOID="-C -I -z"
+export FSSTRESS_AVOID="-f collapse=0 -f insert=0 -f zero=0"
+export XFS_IO_AVOID="fcollapse finsert zero"
+
+TESTNAME="Ext4 4k block w/bigalloc"
+
diff --git 
a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude
new file mode 100644
index ..bd779be99518
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_bigalloc.exclude
@@ -0,0 +1,7 @@
+# bigalloc does not support on-line defrag
+ext4/301
+ext4/302
+ext4/303
+ext4/304
+ext4/307
+ext4/308
diff --git a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt
new file mode 100644
index ..29f058ba937d
--- /dev/null
+++ b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt
@@ -0,0 +1,5 @@
+SIZE=small
+export MKFS_OPTIONS=""
+export EXT_MOUNT_OPTIONS="test_dummy_encryption,huge=always"
+REQUIRE_FEATURE=encryption
+TESTNAME="Ext4 encryption"
diff --git 
a/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude 
b/kvm-xfstests/test-appliance/files/root/fs/ext4/cfg/huge_encrypt.exclude
new file mode 100644
index ..b91cc58b5aa3
--- /dev/null
+++ b/kvm-xfstests/t

[PATCHv6 18/37] fs: make block_write_{begin,end}() be able to handle huge pages

2017-01-26 Thread Kirill A. Shutemov

It's more or less straight-forward.

Most changes are around getting offset/len withing page right and zero
out desired part of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c | 70 +++--
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 72462beca909..d05524f14846 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1902,6 +1902,7 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
 {
unsigned int block_start, block_end;
struct buffer_head *head, *bh;
+   bool uptodate = PageUptodate(page);
 
BUG_ON(!PageLocked(page));
if (!page_has_buffers(page))
@@ -1912,21 +1913,21 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
do {
block_end = block_start + bh->b_size;
 
-   if (buffer_new(bh)) {
-   if (block_end > from && block_start < to) {
-   if (!PageUptodate(page)) {
-   unsigned start, size;
+   if (buffer_new(bh) && block_end > from && block_start < to) {
+   if (!uptodate) {
+   unsigned start, size;
 
-   start = max(from, block_start);
-   size = min(to, block_end) - start;
+   start = max(from, block_start);
+   size = min(to, block_end) - start;
 
-   zero_user(page, start, size);
-   set_buffer_uptodate(bh);
-   }
-
-   clear_buffer_new(bh);
-   mark_buffer_dirty(bh);
+   zero_user(page + block_start / PAGE_SIZE,
+   start % PAGE_SIZE,
+   size);
+   set_buffer_uptodate(bh);
}
+
+   clear_buffer_new(bh);
+   mark_buffer_dirty(bh);
}
 
block_start = block_end;
@@ -1992,18 +1993,21 @@ iomap_to_bh(struct inode *inode, sector_t block, struct 
buffer_head *bh,
 int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block, struct iomap *iomap)
 {
-   unsigned from = pos & (PAGE_SIZE - 1);
-   unsigned to = from + len;
-   struct inode *inode = page->mapping->host;
+   unsigned from, to;
+   struct inode *inode = page_mapping(page)->host;
unsigned block_start, block_end;
sector_t block;
int err = 0;
unsigned blocksize, bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
+   bool uptodate = PageUptodate(page);
 
+   page = compound_head(page);
+   from = pos & ~hpage_mask(page);
+   to = from + len;
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_SIZE);
-   BUG_ON(to > PAGE_SIZE);
+   BUG_ON(from > hpage_size(page));
+   BUG_ON(to > hpage_size(page));
BUG_ON(from > to);
 
head = create_page_buffers(page, inode, 0);
@@ -2016,10 +2020,8 @@ int __block_write_begin_int(struct page *page, loff_t 
pos, unsigned len,
block++, block_start=block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
-   if (PageUptodate(page)) {
-   if (!buffer_uptodate(bh))
-   set_buffer_uptodate(bh);
-   }
+   if (uptodate && !buffer_uptodate(bh))
+   set_buffer_uptodate(bh);
continue;
}
if (buffer_new(bh))
@@ -2036,23 +2038,28 @@ int __block_write_begin_int(struct page *page, loff_t 
pos, unsigned len,
 
if (buffer_new(bh)) {
clean_bdev_bh_alias(bh);
-   if (PageUptodate(page)) {
+   if (uptodate) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
continue;
}
-   if (block_end > to || block_start < from)
-   zero_user_segments(page,
-   to, block_end,
-

Re: [PATCHv5 22/36] mm, hugetlb: switch hugetlbfs to multi-order radix-tree entries

2016-11-30 Thread Kirill A. Shutemov

On Wed, Nov 30, 2016 at 05:48:05PM +0800, Hillf Danton wrote:
> On Tuesday, November 29, 2016 7:23 PM Kirill A. Shutemov wrote:
> > @@ -607,10 +605,10 @@ static long hugetlbfs_fallocate(struct file *file, 
> > int mode, loff_t offset,
> > }
> > 
> > /* Set numa allocation policy based on index */
> > -   hugetlb_set_vma_policy(_vma, inode, index);
> > +   hugetlb_set_vma_policy(_vma, inode, index >> 
> > huge_page_order(h));
> > 
> > /* addr is the offset within the file (zero based) */
> > -   addr = index * hpage_size;
> > +   addr = index << PAGE_SHIFT & ~huge_page_mask(h);
> > 
> > /* mutex taken here, fault path and hole punch */
> > hash = hugetlb_fault_mutex_hash(h, mm, _vma, mapping,
> 
> Seems we can't use index in computing hash as long as it isn't in huge page 
> size.

Look at changes in hugetlb_fault_mutex_hash(): we shift the index right by
huge_page_order(), before calculating the hash. I don't see a problem
here.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 02/36] Revert "radix-tree: implement radix_tree_maybe_preload_order()"

2016-11-29 Thread Kirill A. Shutemov

This reverts commit 356e1c23292a4f63cfdf1daf0e0ddada51f32de8.

After conversion of huge tmpfs to multi-order entries, we don't need
this anymore.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/radix-tree.h |  1 -
 lib/radix-tree.c   | 74 --
 2 files changed, 75 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index d0690691d9bf..6563fe64cf69 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -323,7 +323,6 @@ unsigned int radix_tree_gang_lookup_slot(struct 
radix_tree_root *root,
unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
-int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5e8fc32697b1..d298ddbbbfec 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -39,9 +39,6 @@
 #include 
 
 
-/* Number of nodes in fully populated tree of given height */
-static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
-
 /*
  * Radix tree node cache.
  */
@@ -523,51 +520,6 @@ int radix_tree_split_preload(unsigned int old_order, 
unsigned int new_order,
 }
 #endif
 
-/*
- * The same as function above, but preload number of nodes required to insert
- * (1 << order) continuous naturally-aligned elements.
- */
-int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
-{
-   unsigned long nr_subtrees;
-   int nr_nodes, subtree_height;
-
-   /* Preloading doesn't help anything with this gfp mask, skip it */
-   if (!gfpflags_allow_blocking(gfp_mask)) {
-   preempt_disable();
-   return 0;
-   }
-
-   /*
-* Calculate number and height of fully populated subtrees it takes to
-* store (1 << order) elements.
-*/
-   nr_subtrees = 1 << order;
-   for (subtree_height = 0; nr_subtrees > RADIX_TREE_MAP_SIZE;
-   subtree_height++)
-   nr_subtrees >>= RADIX_TREE_MAP_SHIFT;
-
-   /*
-* The worst case is zero height tree with a single item at index 0 and
-* then inserting items starting at ULONG_MAX - (1 << order).
-*
-* This requires RADIX_TREE_MAX_PATH nodes to build branch from root to
-* 0-index item.
-*/
-   nr_nodes = RADIX_TREE_MAX_PATH;
-
-   /* Plus branch to fully populated subtrees. */
-   nr_nodes += RADIX_TREE_MAX_PATH - subtree_height;
-
-   /* Root node is shared. */
-   nr_nodes--;
-
-   /* Plus nodes required to build subtrees. */
-   nr_nodes += nr_subtrees * height_to_maxnodes[subtree_height];
-
-   return __radix_tree_preload(gfp_mask, nr_nodes);
-}
-
 static unsigned radix_tree_load_root(struct radix_tree_root *root,
struct radix_tree_node **nodep, unsigned long *maxindex)
 {
@@ -2454,31 +2406,6 @@ radix_tree_node_ctor(void *arg)
INIT_LIST_HEAD(>private_list);
 }
 
-static __init unsigned long __maxindex(unsigned int height)
-{
-   unsigned int width = height * RADIX_TREE_MAP_SHIFT;
-   int shift = RADIX_TREE_INDEX_BITS - width;
-
-   if (shift < 0)
-   return ~0UL;
-   if (shift >= BITS_PER_LONG)
-   return 0UL;
-   return ~0UL >> shift;
-}
-
-static __init void radix_tree_init_maxnodes(void)
-{
-   unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1];
-   unsigned int i, j;
-
-   for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
-   height_to_maxindex[i] = __maxindex(i);
-   for (i = 0; i < ARRAY_SIZE(height_to_maxnodes); i++) {
-   for (j = i; j > 0; j--)
-   height_to_maxnodes[i] += height_to_maxindex[j - 1] + 1;
-   }
-}
-
 static int radix_tree_cpu_dead(unsigned int cpu)
 {
struct radix_tree_preload *rtp;
@@ -2502,7 +2429,6 @@ void __init radix_tree_init(void)
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
-   radix_tree_init_maxnodes();
ret = cpuhp_setup_state_nocalls(CPUHP_RADIX_DEAD, "lib/radix:dead",
NULL, radix_tree_cpu_dead);
WARN_ON(ret < 0);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 14/36] thp: introduce hpage_size() and hpage_mask()

2016-11-29 Thread Kirill A. Shutemov

Introduce new helpers which return size/mask of the page:
HPAGE_PMD_SIZE/HPAGE_PMD_MASK if the page is PageTransHuge() and
PAGE_SIZE/PAGE_MASK otherwise.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/huge_mm.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 97e478d6b690..e5c9c26d2439 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -142,6 +142,20 @@ static inline int hpage_nr_pages(struct page *page)
return 1;
 }
 
+static inline int hpage_size(struct page *page)
+{
+   if (unlikely(PageTransHuge(page)))
+   return HPAGE_PMD_SIZE;
+   return PAGE_SIZE;
+}
+
+static inline unsigned long hpage_mask(struct page *page)
+{
+   if (unlikely(PageTransHuge(page)))
+   return HPAGE_PMD_MASK;
+   return PAGE_MASK;
+}
+
 extern int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
@@ -167,6 +181,8 @@ void mm_put_huge_zero_page(struct mm_struct *mm);
 #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
 
 #define hpage_nr_pages(x) 1
+#define hpage_size(x) PAGE_SIZE
+#define hpage_mask(x) PAGE_MASK
 
 #define transparent_hugepage_enabled(__vma) 0
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 04/36] mm, rmap: account file thp pages

2016-11-29 Thread Kirill A. Shutemov

Let's add FileHugePages and FilePmdMapped fields into meminfo and smaps.
It indicates how many times we allocate and map file THP.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 drivers/base/node.c|  6 ++
 fs/proc/meminfo.c  |  4 
 fs/proc/task_mmu.c |  5 -
 include/linux/mmzone.h |  2 ++
 mm/filemap.c   |  3 ++-
 mm/huge_memory.c   |  5 -
 mm/page_alloc.c|  5 +
 mm/rmap.c  | 12 
 mm/vmstat.c|  2 ++
 9 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f9686016..45be0ddb84ed 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -116,6 +116,8 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d AnonHugePages:  %8lu kB\n"
   "Node %d ShmemHugePages: %8lu kB\n"
   "Node %d ShmemPmdMapped: %8lu kB\n"
+  "Node %d FileHugePages: %8lu kB\n"
+  "Node %d FilePmdMapped: %8lu kB\n"
 #endif
,
   nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -139,6 +141,10 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
   HPAGE_PMD_NR),
   nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_THPS) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
   HPAGE_PMD_NR));
 #else
   nid, K(sum_zone_node_page_state(nid, 
NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a428498d6b2..8396843be7a7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -146,6 +146,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
show_val_kb(m, "ShmemPmdMapped: ",
global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR);
+   show_val_kb(m, "FileHugePages: ",
+   global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR);
+   show_val_kb(m, "FilePmdMapped: ",
+   global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR);
 #endif
 
 #ifdef CONFIG_CMA
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d47c723e7bc2..06840421fae3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -442,6 +442,7 @@ struct mem_size_stats {
unsigned long anonymous;
unsigned long anonymous_thp;
unsigned long shmem_thp;
+   unsigned long file_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -581,7 +582,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
else if (is_zone_device_page(page))
/* pass */;
else
-   VM_BUG_ON_PAGE(1, page);
+   mss->file_thp += HPAGE_PMD_SIZE;
mss->rss_pmd += PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
@@ -848,6 +849,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   "Anonymous:  %8lu kB\n"
   "AnonHugePages:  %8lu kB\n"
   "ShmemPmdMapped: %8lu kB\n"
+  "FilePmdMapped:  %8lu kB\n"
   "Shared_Hugetlb: %8lu kB\n"
   "Private_Hugetlb: %7lu kB\n"
   "Swap:   %8lu kB\n"
@@ -866,6 +868,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   mss.anonymous >> 10,
   mss.anonymous_thp >> 10,
   mss.shmem_thp >> 10,
+  mss.file_thp >> 10,
   mss.shared_hugetlb >> 10,
   mss.private_hugetlb >> 10,
   mss.swap >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0f088f3a2fed..44a43f576d52 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,8 @@ enum node_stat_item {
NR_SHMEM,   /* shmem pages (included tmpfs/GEM pages) */
NR_SHMEM_THPS,
NR_SHMEM_PMDMAPPED,
+   NR_FILE_THPS,
+   NR_FILE_PMDMAPPED,
NR_ANON_THPS,
NR_UNSTABLE_NFS,/* NFS unstable pages */
NR_VMSCAN_WRITE,
diff --git a/mm/filemap.c b/mm/filemap.c
index f8607ab7b7e4..16d39340c106 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -240,7 +240,8 @@ void __delete_from_page_cache(struct page *page,

[PATCHv5 21/36] truncate: make invalidate_inode_pages2_range() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov

For huge pages we need to unmap whole range covered by the huge page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/truncate.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index d2d95f283ec3..6df4b06a190f 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -656,27 +656,32 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
continue;
}
wait_on_page_writeback(page);
+
if (page_mapped(page)) {
+   loff_t begin, len;
+
+   begin = index << PAGE_SHIFT;
if (!did_range_unmap) {
/*
 * Zap the rest of the file in one hit.
 */
+   len = (loff_t)(1 + end - index) <<
+   PAGE_SHIFT;
+   if (len < hpage_size(page))
+   len = hpage_size(page);
unmap_mapping_range(mapping,
-  (loff_t)index << PAGE_SHIFT,
-  (loff_t)(1 + end - index)
-<< PAGE_SHIFT,
-0);
+   begin, len, 0);
did_range_unmap = 1;
} else {
/*
 * Just zap this page
 */
-   unmap_mapping_range(mapping,
-  (loff_t)index << PAGE_SHIFT,
-  PAGE_SIZE, 0);
+   len = hpage_size(page);
+   unmap_mapping_range(mapping, begin,
+   len, 0);
}
}
-   BUG_ON(page_mapped(page));
+   VM_BUG_ON_PAGE(page_mapped(page), page);
ret2 = do_launder_page(mapping, page);
if (ret2 == 0) {
if (!invalidate_complete_page2(mapping, page))
@@ -687,9 +692,9 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
unlock_page(page);
}
pagevec_remove_exceptionals();
+   index += pvec.nr ? hpage_nr_pages(pvec.pages[pvec.nr - 1]) : 1;
pagevec_release();
cond_resched();
-   index++;
}
cleancache_invalidate_inode(mapping);
return ret;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 19/36] fs: make block_page_mkwrite() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov

Adjust check on whether part of the page beyond file size and apply
compound_head() and page_mapping() where appropriate.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7d333621ccfb..8e21513c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2544,7 +2544,7 @@ EXPORT_SYMBOL(block_commit_write);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 get_block_t get_block)
 {
-   struct page *page = vmf->page;
+   struct page *page = compound_head(vmf->page);
struct inode *inode = file_inode(vma->vm_file);
unsigned long end;
loff_t size;
@@ -2552,7 +2552,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 
lock_page(page);
size = i_size_read(inode);
-   if ((page->mapping != inode->i_mapping) ||
+   if ((page_mapping(page) != inode->i_mapping) ||
(page_offset(page) > size)) {
/* We overload EFAULT to mean page got truncated */
ret = -EFAULT;
@@ -2560,10 +2560,10 @@ int block_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf,
}
 
/* page is wholly or partially inside EOF */
-   if (((page->index + 1) << PAGE_SHIFT) > size)
-   end = size & ~PAGE_MASK;
+   if (((page->index + hpage_nr_pages(page)) << PAGE_SHIFT) > size)
+   end = size & ~hpage_mask(page);
else
-   end = PAGE_SIZE;
+   end = hpage_size(page);
 
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 30/36] ext4: make ext4_da_page_release_reservation() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov

For huge pages 'stop' must be within HPAGE_PMD_SIZE.
Let's use hpage_size() in the BUG_ON().

We also need to change how we calculate lblk for cluster deallocation.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e89249c03d2f..035256019e16 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1572,7 +1572,7 @@ static void ext4_da_page_release_reservation(struct page 
*page,
int num_clusters;
ext4_fsblk_t lblk;
 
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
@@ -1607,7 +1607,8 @@ static void ext4_da_page_release_reservation(struct page 
*page,
 * need to release the reserved space for that cluster. */
num_clusters = EXT4_NUM_B2C(sbi, to_release);
while (num_clusters > 0) {
-   lblk = (page->index << (PAGE_SHIFT - inode->i_blkbits)) +
+   lblk = ((page->index + offset / PAGE_SIZE) <<
+   (PAGE_SHIFT - inode->i_blkbits)) +
((num_clusters - 1) << sbi->s_cluster_bits);
if (sbi->s_cluster_ratio == 1 ||
!ext4_find_delalloc_cluster(inode, lblk))
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 18/36] fs: make block_write_{begin,end}() be able to handle huge pages

2016-11-29 Thread Kirill A. Shutemov

It's more or less straight-forward.

Most changes are around getting offset/len withing page right and zero
out desired part of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c | 70 +++--
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 090f7edfa6b7..7d333621ccfb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1902,6 +1902,7 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
 {
unsigned int block_start, block_end;
struct buffer_head *head, *bh;
+   bool uptodate = PageUptodate(page);
 
BUG_ON(!PageLocked(page));
if (!page_has_buffers(page))
@@ -1912,21 +1913,21 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
do {
block_end = block_start + bh->b_size;
 
-   if (buffer_new(bh)) {
-   if (block_end > from && block_start < to) {
-   if (!PageUptodate(page)) {
-   unsigned start, size;
+   if (buffer_new(bh) && block_end > from && block_start < to) {
+   if (!uptodate) {
+   unsigned start, size;
 
-   start = max(from, block_start);
-   size = min(to, block_end) - start;
+   start = max(from, block_start);
+   size = min(to, block_end) - start;
 
-   zero_user(page, start, size);
-   set_buffer_uptodate(bh);
-   }
-
-   clear_buffer_new(bh);
-   mark_buffer_dirty(bh);
+   zero_user(page + block_start / PAGE_SIZE,
+   start % PAGE_SIZE,
+   size);
+   set_buffer_uptodate(bh);
}
+
+   clear_buffer_new(bh);
+   mark_buffer_dirty(bh);
}
 
block_start = block_end;
@@ -1992,18 +1993,21 @@ iomap_to_bh(struct inode *inode, sector_t block, struct 
buffer_head *bh,
 int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block, struct iomap *iomap)
 {
-   unsigned from = pos & (PAGE_SIZE - 1);
-   unsigned to = from + len;
-   struct inode *inode = page->mapping->host;
+   unsigned from, to;
+   struct inode *inode = page_mapping(page)->host;
unsigned block_start, block_end;
sector_t block;
int err = 0;
unsigned blocksize, bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
+   bool uptodate = PageUptodate(page);
 
+   page = compound_head(page);
+   from = pos & ~hpage_mask(page);
+   to = from + len;
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_SIZE);
-   BUG_ON(to > PAGE_SIZE);
+   BUG_ON(from > hpage_size(page));
+   BUG_ON(to > hpage_size(page));
BUG_ON(from > to);
 
head = create_page_buffers(page, inode, 0);
@@ -2016,10 +2020,8 @@ int __block_write_begin_int(struct page *page, loff_t 
pos, unsigned len,
block++, block_start=block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
-   if (PageUptodate(page)) {
-   if (!buffer_uptodate(bh))
-   set_buffer_uptodate(bh);
-   }
+   if (uptodate && !buffer_uptodate(bh))
+   set_buffer_uptodate(bh);
continue;
}
if (buffer_new(bh))
@@ -2036,23 +2038,28 @@ int __block_write_begin_int(struct page *page, loff_t 
pos, unsigned len,
 
if (buffer_new(bh)) {
clean_bdev_bh_alias(bh);
-   if (PageUptodate(page)) {
+   if (uptodate) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
continue;
}
-   if (block_end > to || block_start < from)
-   zero_user_segments(page,
-   to, block_end,
-

[PATCHv5 26/36] ext4: handle huge pages in ext4_page_mkwrite()

2016-11-29 Thread Kirill A. Shutemov

Trivial: remove assumption on page size.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fa4467e4b129..387aa857770b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5759,7 +5759,7 @@ static int ext4_bh_unmapped(handle_t *handle, struct 
buffer_head *bh)
 
 int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   struct page *page = vmf->page;
+   struct page *page = compound_head(vmf->page);
loff_t size;
unsigned long len;
int ret;
@@ -5795,10 +5795,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
goto out;
}
 
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
+
/*
 * Return if we have all the buffers mapped. This avoids the need to do
 * journal_start/journal_stop which can block and take a long time
@@ -5829,7 +5829,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
ret = block_page_mkwrite(vma, vmf, get_block);
if (!ret && ext4_should_journal_data(inode)) {
if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
- PAGE_SIZE, NULL, do_journal_get_write_access)) {
+ hpage_size(page), NULL,
+ do_journal_get_write_access)) {
unlock_page(page);
ret = VM_FAULT_SIGBUS;
ext4_journal_stop(handle);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 35/36] mm, fs, ext4: expand use of page_mapping() and page_to_pgoff()

2016-11-29 Thread Kirill A. Shutemov

With huge pages in page cache we see tail pages in more code paths.
This patch replaces direct access to struct page fields with macros
which can handle tail pages properly.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c |  2 +-
 fs/ext4/inode.c |  4 ++--
 mm/filemap.c| 24 +---
 mm/memory.c |  2 +-
 mm/page-writeback.c |  2 +-
 mm/truncate.c   |  5 +++--
 6 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 24daf7b9bdb0..c7fe6c9bae25 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -631,7 +631,7 @@ static void __set_page_dirty(struct page *page, struct 
address_space *mapping,
unsigned long flags;
 
spin_lock_irqsave(>tree_lock, flags);
-   if (page->mapping) {/* Race with truncate? */
+   if (page_mapping(page)) {   /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(>page_tree,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 263b53ace613..17a767c21dc3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1237,7 +1237,7 @@ static int ext4_write_begin(struct file *file, struct 
address_space *mapping,
}
 
lock_page(page);
-   if (page->mapping != mapping) {
+   if (page_mapping(page) != mapping) {
/* The page got truncated from under us */
unlock_page(page);
put_page(page);
@@ -2974,7 +2974,7 @@ static int ext4_da_write_begin(struct file *file, struct 
address_space *mapping,
}
 
lock_page(page);
-   if (page->mapping != mapping) {
+   if (page_mapping(page) != mapping) {
/* The page got truncated from under us */
unlock_page(page);
put_page(page);
diff --git a/mm/filemap.c b/mm/filemap.c
index 33974ad1a8ec..be8ccadb915f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -399,7 +399,7 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
struct page *page = pvec.pages[i];
 
/* until radix tree lookup accepts end_index */
-   if (page->index > end)
+   if (page_to_pgoff(page) > end)
continue;
 
page = compound_head(page);
@@ -1227,7 +1227,7 @@ struct page *pagecache_get_page(struct address_space 
*mapping, pgoff_t offset,
}
 
/* Has the page been truncated? */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
unlock_page(page);
put_page(page);
goto repeat;
@@ -1504,7 +1504,8 @@ unsigned find_get_pages_contig(struct address_space 
*mapping, pgoff_t start,
 * otherwise we can get both false positives and false
 * negatives, which is just confusing to the caller.
 */
-   if (page->mapping == NULL || page_to_pgoff(page) != index) {
+   if (page_mapping(page) == NULL ||
+   page_to_pgoff(page) != index) {
put_page(page);
break;
}
@@ -1792,7 +1793,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
if (!trylock_page(page))
goto page_not_up_to_date;
/* Did it get truncated before we got the lock? */
-   if (!page->mapping)
+   if (!page_mapping(page))
goto page_not_up_to_date_locked;
if (!mapping->a_ops->is_partially_uptodate(page,
offset, iter->count))
@@ -1872,7 +1873,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
 
 page_not_up_to_date_locked:
/* Did it get truncated before we got the lock? */
-   if (!page->mapping) {
+   if (!page_mapping(page)) {
unlock_page(page);
put_page(page);
continue;
@@ -1908,7 +1909,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
if (unlikely(error))
goto readpage_error;
if (!PageUptodate(page)) {
-   if (page->mapping == NULL) {
+   if (page_mapping(page) == NULL) {
/*
 * invalidate_mapping_pages got it
 */
@@ -2207,12 +2208,12 @@ int f

[PATCHv5 03/36] page-flags: relax page flag policy for few flags

2016-11-29 Thread Kirill A. Shutemov

These flags are in use for filesystems with backing storage: PG_error,
PG_writeback and PG_readahead.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/page-flags.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda91238..a2bef9a41bcf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,7 +253,7 @@ static inline int TestClearPage##uname(struct page *page) { 
return 0; }
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
-PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, 
PF_NO_COMPOUND)
+PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
__SETPAGEFLAG(Referenced, referenced, PF_HEAD)
@@ -293,15 +293,15 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
  */
-TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
-   TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
+TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL)
+   TESTSCFLAG(Writeback, writeback, PF_NO_TAIL)
 PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
-   TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Readahead, reclaim, PF_NO_TAIL)
+   TESTCLEARFLAG(Readahead, reclaim, PF_NO_TAIL)
 
 #ifdef CONFIG_HIGHMEM
 /*
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 33/36] ext4: fix SEEK_DATA/SEEK_HOLE for huge pages

2016-11-29 Thread Kirill A. Shutemov

ext4_find_unwritten_pgoff() needs few tweaks to work with huge pages.
Mostly trivial page_mapping()/page_to_pgoff() and adjustment to how we
find relevant block.

Signe-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/file.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b5f184493c57..7998ac1483c4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -547,7 +547,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 * range, it will be a hole.
 */
if (lastoff < endoff && whence == SEEK_HOLE &&
-   page->index > end) {
+   page_to_pgoff(page) > end) {
found = 1;
*offset = lastoff;
goto out;
@@ -555,7 +555,7 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
 
lock_page(page);
 
-   if (unlikely(page->mapping != inode->i_mapping)) {
+   if (unlikely(page_mapping(page) != inode->i_mapping)) {
unlock_page(page);
continue;
}
@@ -566,8 +566,12 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
}
 
if (page_has_buffers(page)) {
+   int diff;
lastoff = page_offset(page);
bh = head = page_buffers(page);
+   diff = (page - compound_head(page)) << 
inode->i_blkbits;
+   while (diff--)
+   bh = bh->b_this_page;
do {
if (buffer_uptodate(bh) ||
buffer_unwritten(bh)) {
@@ -588,8 +592,12 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
} while (bh != head);
}
 
-   lastoff = page_offset(page) + PAGE_SIZE;
+   lastoff = page_offset(page) + hpage_size(page);
unlock_page(page);
+   if (PageTransCompound(page)) {
+   i++;
+   break;
+   }
}
 
/*
@@ -602,7 +610,9 @@ static int ext4_find_unwritten_pgoff(struct inode *inode,
break;
}
 
-   index = pvec.pages[i - 1]->index + 1;
+   index = page_to_pgoff(pvec.pages[i - 1]) + 1;
+   if (PageTransCompound(pvec.pages[i - 1]))
+   index = round_up(index, HPAGE_PMD_NR);
pagevec_release();
} while (index <= end);
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 32/36] ext4: make EXT4_IOC_MOVE_EXT work with huge pages

2016-11-29 Thread Kirill A. Shutemov

Adjust how we find relevant block within page and how we clear the
required part of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/move_extent.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 6fc14def0c70..2efa9deb47a9 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -210,7 +210,9 @@ mext_page_mkuptodate(struct page *page, unsigned from, 
unsigned to)
return err;
}
if (!buffer_mapped(bh)) {
-   zero_user(page, block_start, blocksize);
+   zero_user(page + block_start / PAGE_SIZE,
+   block_start % PAGE_SIZE,
+   blocksize);
set_buffer_uptodate(bh);
continue;
}
@@ -267,10 +269,11 @@ move_extent_per_page(struct file *o_filp, struct inode 
*donor_inode,
unsigned int tmp_data_size, data_size, replaced_size;
int i, err2, jblocks, retries = 0;
int replaced_count = 0;
-   int from = data_offset_in_page << orig_inode->i_blkbits;
+   int from;
int blocks_per_page = PAGE_SIZE >> orig_inode->i_blkbits;
struct super_block *sb = orig_inode->i_sb;
struct buffer_head *bh = NULL;
+   int diff;
 
/*
 * It needs twice the amount of ordinary journal buffers because
@@ -355,6 +358,9 @@ move_extent_per_page(struct file *o_filp, struct inode 
*donor_inode,
goto unlock_pages;
}
 data_copy:
+   diff = (pagep[0] - compound_head(pagep[0])) * blocks_per_page;
+   from = (data_offset_in_page + diff) << orig_inode->i_blkbits;
+   pagep[0] = compound_head(pagep[0]);
*err = mext_page_mkuptodate(pagep[0], from, from + replaced_size);
if (*err)
goto unlock_pages;
@@ -384,7 +390,7 @@ move_extent_per_page(struct file *o_filp, struct inode 
*donor_inode,
if (!page_has_buffers(pagep[0]))
create_empty_buffers(pagep[0], 1 << orig_inode->i_blkbits, 0);
bh = page_buffers(pagep[0]);
-   for (i = 0; i < data_offset_in_page; i++)
+   for (i = 0; i < data_offset_in_page + diff; i++)
bh = bh->b_this_page;
for (i = 0; i < block_len_in_page; i++) {
*err = ext4_get_block(orig_inode, orig_blk_offset + i, bh, 0);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 31/36] ext4: handle writeback with huge pages

2016-11-29 Thread Kirill A. Shutemov

Modify mpage_map_and_submit_buffers() and mpage_release_unused_pages()
to deal with huge pages.

Mostly result of try-and-error. Critical view would be appriciated.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 61 -
 1 file changed, 43 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 035256019e16..ff4f460d3625 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1666,20 +1666,32 @@ static void mpage_release_unused_pages(struct 
mpage_da_data *mpd,
if (nr_pages == 0)
break;
for (i = 0; i < nr_pages; i++) {
-   struct page *page = pvec.pages[i];
+   struct page *page = compound_head(pvec.pages[i]);
+
if (page->index > end)
break;
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));
if (invalidate) {
+   unsigned long offset, len;
+
+   offset = (index % hpage_nr_pages(page));
+   len = min_t(unsigned long, end - page->index,
+   hpage_nr_pages(page));
+
if (page_mapped(page))
clear_page_dirty_for_io(page);
-   block_invalidatepage(page, 0, PAGE_SIZE);
+   block_invalidatepage(page, offset << PAGE_SHIFT,
+   len << PAGE_SHIFT);
ClearPageUptodate(page);
}
unlock_page(page);
+   if (PageTransHuge(page))
+   break;
}
-   index = pvec.pages[nr_pages - 1]->index + 1;
+   index = page_to_pgoff(pvec.pages[nr_pages - 1]) + 1;
+   if (PageTransCompound(pvec.pages[nr_pages - 1]))
+   index = round_up(index, HPAGE_PMD_NR);
pagevec_release();
}
 }
@@ -2113,16 +2125,16 @@ static int mpage_submit_page(struct mpage_da_data *mpd, 
struct page *page)
loff_t size = i_size_read(mpd->inode);
int err;
 
-   BUG_ON(page->index != mpd->first_page);
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+   page = compound_head(page);
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
+
clear_page_dirty_for_io(page);
err = ext4_bio_write_page(>io_submit, page, len, mpd->wbc, false);
if (!err)
-   mpd->wbc->nr_to_write--;
-   mpd->first_page++;
+   mpd->wbc->nr_to_write -= hpage_nr_pages(page);
+   mpd->first_page = round_up(mpd->first_page + 1, hpage_nr_pages(page));
 
return err;
 }
@@ -2270,12 +2282,16 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
break;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
+   unsigned long diff;
 
-   if (page->index > end)
+   if (page_to_pgoff(page) > end)
break;
/* Up to 'end' pages must be contiguous */
-   BUG_ON(page->index != start);
+   BUG_ON(page_to_pgoff(page) != start);
+   diff = (page - compound_head(page)) << bpp_bits;
bh = head = page_buffers(page);
+   while (diff--)
+   bh = bh->b_this_page;
do {
if (lblk < mpd->map.m_lblk)
continue;
@@ -2312,7 +2328,10 @@ static int mpage_map_and_submit_buffers(struct 
mpage_da_data *mpd)
 * supports blocksize < pagesize as we will try to
 * convert potentially unmapped parts of inode.
 */
-   mpd->io_submit.io_end->size += PAGE_SIZE;
+   if (PageTransCompound(page))
+   mpd->io_submit.io_end->size += HPAGE_PMD_SIZE;
+   else
+   mpd->io_submit.io_end->size += PAGE_SIZE;
/* Page fully mapped - let IO run! */
err = mpage_submit_page(mpd, page);
if

[PATCHv5 16/36] thp: make thp_get_unmapped_area() respect S_HUGE_MODE

2016-11-29 Thread Kirill A. Shutemov

We want mmap(NULL) to return PMD-aligned address if the inode can have
huge pages in page cache.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a15d566b14f6..9c6ba124ba50 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -518,10 +518,12 @@ unsigned long thp_get_unmapped_area(struct file *filp, 
unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
 {
loff_t off = (loff_t)pgoff << PAGE_SHIFT;
+   struct inode *inode = filp->f_mapping->host;
 
if (addr)
goto out;
-   if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
+   if ((inode->i_flags & S_HUGE_MODE) == S_HUGE_NEVER &&
+   (!IS_DAX(inode) || !IS_ENABLED(CONFIG_FS_DAX_PMD)))
goto out;
 
addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 07/36] filemap: allocate huge page in page_cache_read(), if allowed

2016-11-29 Thread Kirill A. Shutemov

This patch adds basic functionality to put huge page into page cache.

At the moment we only put huge pages into radix-tree if the range covered
by the huge page is empty.

We ignore shadow entires for now, just remove them from the tree before
inserting huge page.

Later we can add logic to accumulate information from shadow entires to
return to caller (average eviction time?).

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/fs.h  |   5 ++
 include/linux/pagemap.h |  21 ++-
 mm/filemap.c| 155 ++--
 3 files changed, 147 insertions(+), 34 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 03a5a398ae83..be94b922a22f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1799,6 +1799,11 @@ struct super_operations {
 #else
 #define S_DAX  0   /* Make all the DAX code disappear */
 #endif
+#define S_HUGE_MODE0xc000
+#define S_HUGE_NEVER   0x
+#define S_HUGE_ALWAYS  0x4000
+#define S_HUGE_WITHIN_SIZE 0x8000
+#define S_HUGE_ADVISE  0xc000
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f88d69e2419d..e530e7b3b6b2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -201,14 +201,20 @@ static inline int page_cache_add_speculative(struct page 
*page, int count)
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc_order(gfp_t gfp, unsigned int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc_order(gfp_t gfp,
+   unsigned int order)
 {
-   return alloc_pages(gfp, 0);
+   return alloc_pages(gfp, order);
 }
 #endif
 
+static inline struct page *__page_cache_alloc(gfp_t gfp)
+{
+   return __page_cache_alloc_order(gfp, 0);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
return __page_cache_alloc(mapping_gfp_mask(x));
@@ -225,6 +231,15 @@ static inline gfp_t readahead_gfp_mask(struct 
address_space *x)
  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN;
 }
 
+extern bool __page_cache_allow_huge(struct address_space *x, pgoff_t offset);
+static inline bool page_cache_allow_huge(struct address_space *x,
+   pgoff_t offset)
+{
+   if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+   return false;
+   return __page_cache_allow_huge(x, offset);
+}
+
 typedef int filler_t(void *, struct page *);
 
 pgoff_t page_cache_next_hole(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 16d39340c106..74341f8b831e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -113,37 +113,50 @@
 static int page_cache_tree_insert(struct address_space *mapping,
  struct page *page, void **shadowp)
 {
-   struct radix_tree_node *node;
-   void **slot;
+   struct radix_tree_iter iter;
+   void **slot, *p;
int error;
 
-   error = __radix_tree_create(>page_tree, page->index, 0,
-   , );
-   if (error)
-   return error;
-   if (*slot) {
-   void *p;
+   /* Wipe shadow entires */
+   radix_tree_for_each_slot(slot, >page_tree, ,
+   page->index) {
+   if (iter.index >= page->index + hpage_nr_pages(page))
+   break;
 
p = radix_tree_deref_slot_protected(slot, >tree_lock);
-   if (!radix_tree_exceptional_entry(p))
+   if (!p)
+   continue;
+
+   if (!radix_tree_exception(p))
return -EEXIST;
 
+   __radix_tree_replace(>page_tree, iter.node, slot, NULL,
+   workingset_update_node, mapping);
+
mapping->nrexceptional--;
-   if (!dax_mapping(mapping)) {
-   if (shadowp)
-   *shadowp = p;
-   } else {
+   if (dax_mapping(mapping)) {
/* DAX can replace empty locked entry with a hole */
WARN_ON_ONCE(p !=
dax_radix_locked_entry(0, RADIX_DAX_EMPTY));
/* Wakeup waiters for exceptional entry lock */
dax_wake_mapping_entry_waiter(mapping, page->index, p,
  false);
+   } else if (!PageTransHuge(page) && shadowp) {
+   *shadowp = p;
}
}
-   __radix_tree_replace(>page_tree, node, slot, page,
-workingset_update_node, mapping);
-   mapping->nrpages++;
+
+   error = __radix_tr

[PATCHv5 20/36] truncate: make truncate_inode_pages_range() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov

As with shmem_undo_range(), truncate_inode_pages_range() removes huge
pages, if it fully within range.

Partial truncate of huge pages zero out this part of THP.

Unlike with shmem, it doesn't prevent us having holes in the middle of
huge page we still can skip writeback not touched buffers.

With memory-mapped IO we would loose holes in some cases when we have
THP in page cache, since we cannot track access on 4k level in this
case.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c|  2 +-
 include/linux/mm.h |  9 +-
 mm/truncate.c  | 86 --
 3 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8e21513c..24daf7b9bdb0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1534,7 +1534,7 @@ void block_invalidatepage(struct page *page, unsigned int 
offset,
/*
 * Check for overflow
 */
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 582844ca0b23..59e74dc57359 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1328,8 +1328,15 @@ int get_kernel_page(unsigned long start, int write, 
struct page **pages);
 struct page *get_dump_page(unsigned long addr);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
-extern void do_invalidatepage(struct page *page, unsigned int offset,
+extern void __do_invalidatepage(struct page *page, unsigned int offset,
  unsigned int length);
+static inline void do_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length)
+{
+   if (page_has_private(page))
+   __do_invalidatepage(page, offset, length);
+}
+
 
 int __set_page_dirty_nobuffers(struct page *page);
 int __set_page_dirty_no_writeback(struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index eb3a3a45feb6..d2d95f283ec3 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,12 +70,12 @@ static void clear_exceptional_entry(struct address_space 
*mapping,
  * point.  Because the caller is about to free (and possibly reuse) those
  * blocks on-disk.
  */
-void do_invalidatepage(struct page *page, unsigned int offset,
+void __do_invalidatepage(struct page *page, unsigned int offset,
   unsigned int length)
 {
void (*invalidatepage)(struct page *, unsigned int, unsigned int);
 
-   invalidatepage = page->mapping->a_ops->invalidatepage;
+   invalidatepage = page_mapping(page)->a_ops->invalidatepage;
 #ifdef CONFIG_BLOCK
if (!invalidatepage)
invalidatepage = block_invalidatepage;
@@ -100,8 +100,7 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
if (page->mapping != mapping)
return -EIO;
 
-   if (page_has_private(page))
-   do_invalidatepage(page, 0, PAGE_SIZE);
+   do_invalidatepage(page, 0, hpage_size(page));
 
/*
 * Some filesystems seem to re-dirty the page even after
@@ -273,13 +272,35 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
unlock_page(page);
continue;
}
+
+   if (PageTransHuge(page)) {
+   int j, first = 0, last = HPAGE_PMD_NR - 1;
+
+   if (start > page->index)
+   first = start & (HPAGE_PMD_NR - 1);
+   if (index == round_down(end, HPAGE_PMD_NR))
+   last = (end - 1) & (HPAGE_PMD_NR - 1);
+
+   /* Range starts or ends in the middle of THP */
+   if (first != 0 || last != HPAGE_PMD_NR - 1) {
+   int off, len;
+   for (j = first; j <= last; j++)
+   clear_highpage(page + j);
+   off = first * PAGE_SIZE;
+   len = (last + 1) * PAGE_SIZE - off;
+   do_invalidatepage(page, off, len);
+   unlock_page(page);
+   continue;
+   }
+   }
+
truncate_inode_page(mapping, page);
unlock_page(page);
}
pagevec_remove_exceptionals();
+   index += pvec.nr ? hpage_nr_pages(pvec.pages[pvec.nr - 1]) : 1;
pagevec_release();
cond_resched();
-   index++;
}

[PATCHv5 36/36] ext4, vfs: add huge= mount option

2016-11-29 Thread Kirill A. Shutemov

The same four values as in tmpfs case.

Encyption code is not yet ready to handle huge page, so we disable huge
pages support if the inode has EXT4_INODE_ENCRYPT.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/ext4.h  |  5 +
 fs/ext4/inode.c | 30 +++---
 fs/ext4/super.c | 24 
 3 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index aff204f040fc..fb3f81863b53 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1133,6 +1133,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_DIOREAD_NOLOCK  0x40 /* Enable support for dio read 
nolocking */
 #define EXT4_MOUNT_JOURNAL_CHECKSUM0x80 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT0x100 /* Journal Async 
Commit */
+#define EXT4_MOUNT_HUGE_MODE   0x600 /* Huge support mode: */
+#define EXT4_MOUNT_HUGE_NEVER  0x000
+#define EXT4_MOUNT_HUGE_ALWAYS 0x200
+#define EXT4_MOUNT_HUGE_WITHIN_SIZE0x400
+#define EXT4_MOUNT_HUGE_ADVISE 0x600
 #define EXT4_MOUNT_DELALLOC0x800 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT  0x1000 /* Abort on file data write 
*/
 #define EXT4_MOUNT_BLOCK_VALIDITY  0x2000 /* Block validity checking */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 17a767c21dc3..4c37fd9fb219 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4472,7 +4472,7 @@ int ext4_get_inode_loc(struct inode *inode, struct 
ext4_iloc *iloc)
 void ext4_set_inode_flags(struct inode *inode)
 {
unsigned int flags = EXT4_I(inode)->i_flags;
-   unsigned int new_fl = 0;
+   unsigned int mask, new_fl = 0;
 
if (flags & EXT4_SYNC_FL)
new_fl |= S_SYNC;
@@ -4484,12 +4484,28 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
-   if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
-   !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
-   !ext4_encrypted_inode(inode))
-   new_fl |= S_DAX;
-   inode_set_flags(inode, new_fl,
-   S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
+   if (S_ISREG(inode->i_mode) && !ext4_encrypted_inode(inode)) {
+   if (test_opt(inode->i_sb, DAX) &&
+   !ext4_should_journal_data(inode) &&
+   !ext4_has_inline_data(inode))
+   new_fl |= S_DAX;
+   switch (test_opt(inode->i_sb, HUGE_MODE)) {
+   case EXT4_MOUNT_HUGE_NEVER:
+   break;
+   case EXT4_MOUNT_HUGE_ALWAYS:
+   new_fl |= S_HUGE_ALWAYS;
+   break;
+   case EXT4_MOUNT_HUGE_WITHIN_SIZE:
+   new_fl |= S_HUGE_WITHIN_SIZE;
+   break;
+   case EXT4_MOUNT_HUGE_ADVISE:
+   new_fl |= S_HUGE_ADVISE;
+   break;
+   }
+   }
+   mask = S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+   S_DIRSYNC | S_DAX | S_HUGE_MODE;
+   inode_set_flags(inode, new_fl, mask);
 }
 
 /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 72b459d2b244..127ddfeae1e0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1296,6 +1296,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum,
+   Opt_huge_never, Opt_huge_always, Opt_huge_within_size, Opt_huge_advise,
 };
 
 static const match_table_t tokens = {
@@ -1376,6 +1377,10 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
+   {Opt_huge_never, "huge=never"},
+   {Opt_huge_always, "huge=always"},
+   {Opt_huge_within_size, "huge=within_size"},
+   {Opt_huge_advise, "huge=advise"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
{Opt_removed, "check=none"},/* mount option from ext2/3 */
{Opt_removed, "nocheck"},   /* mount option from ext2/3 */
@@ -1494,6 +1499,11 @@ static int clear_qf_name(struct super_block *sb, int 
qtype)
 #define MOPT_NO_EXT3   0x0200
 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING0x0400
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+#define MOPT_HUGE  0x1000
+#else
+#define MOPT_HUGE  MOPT_NOSUPPORT
+#endif
 
 static const struct

[PATCHv5 11/36] HACK: readahead: alloc huge pages, if allowed

2016-11-29 Thread Kirill A. Shutemov

Most page cache allocation happens via readahead (sync or async), so if
we want to have significant number of huge pages in page cache we need
to find a ways to allocate them from readahead.

Unfortunately, huge pages doesn't fit into current readahead design:
128 max readahead window, assumption on page size, PageReadahead() to
track hit/miss.

I haven't found a ways to get it right yet.

This patch just allocates huge page if allowed, but doesn't really
provide any readahead if huge page is allocated. We read out 2M a time
and I would expect spikes in latancy without readahead.

Therefore HACK.

Having that said, I don't think it should prevent huge page support to
be applied. Future will show if lacking readahead is a big deal with
huge pages in page cache.

Any suggestions are welcome.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/readahead.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index fb4c99f85618..87e38b522645 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -174,6 +174,21 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
if (page_offset > end_index)
break;
 
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
+   (!page_idx || !(page_offset % HPAGE_PMD_NR)) &&
+   page_cache_allow_huge(mapping, page_offset)) {
+   page = __page_cache_alloc_order(gfp_mask | __GFP_COMP,
+   HPAGE_PMD_ORDER);
+   if (page) {
+   prep_transhuge_page(page);
+   page->index = round_down(page_offset,
+   HPAGE_PMD_NR);
+   list_add(>lru, _pool);
+   ret++;
+   goto start_io;
+   }
+   }
+
rcu_read_lock();
page = radix_tree_lookup(>page_tree, page_offset);
rcu_read_unlock();
@@ -189,7 +204,7 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
SetPageReadahead(page);
ret++;
}
-
+start_io:
/*
 * Now start the IO.  We ignore I/O errors - if the page is not
 * uptodate then the caller will launch readpage again, and
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 28/36] ext4: make ext4_block_write_begin() aware about huge pages

2016-11-29 Thread Kirill A. Shutemov

It simply matches changes to __block_write_begin_int().

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 35 +--
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d3143dfe9962..21662bcbbbcb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1094,9 +1094,8 @@ int do_journal_get_write_access(handle_t *handle,
 static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
  get_block_t *get_block)
 {
-   unsigned from = pos & (PAGE_SIZE - 1);
-   unsigned to = from + len;
-   struct inode *inode = page->mapping->host;
+   unsigned from, to;
+   struct inode *inode = page_mapping(page)->host;
unsigned block_start, block_end;
sector_t block;
int err = 0;
@@ -1104,10 +1103,14 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
unsigned bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh = wait;
bool decrypt = false;
+   bool uptodate = PageUptodate(page);
 
+   page = compound_head(page);
+   from = pos & ~hpage_mask(page);
+   to = from + len;
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_SIZE);
-   BUG_ON(to > PAGE_SIZE);
+   BUG_ON(from > hpage_size(page));
+   BUG_ON(to > hpage_size(page));
BUG_ON(from > to);
 
if (!page_has_buffers(page))
@@ -1120,10 +1123,8 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
block++, block_start = block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
-   if (PageUptodate(page)) {
-   if (!buffer_uptodate(bh))
-   set_buffer_uptodate(bh);
-   }
+   if (uptodate && !buffer_uptodate(bh))
+   set_buffer_uptodate(bh);
continue;
}
if (buffer_new(bh))
@@ -1135,19 +1136,25 @@ static int ext4_block_write_begin(struct page *page, 
loff_t pos, unsigned len,
break;
if (buffer_new(bh)) {
clean_bdev_bh_alias(bh);
-   if (PageUptodate(page)) {
+   if (uptodate) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
continue;
}
-   if (block_end > to || block_start < from)
-   zero_user_segments(page, to, block_end,
-  block_start, from);
+   if (block_end > to || block_start < from) {
+   BUG_ON(to - from  > PAGE_SIZE);
+   zero_user_segments(page +
+   block_start / PAGE_SIZE,
+   to % PAGE_SIZE,
+   (block_start % 
PAGE_SIZE) + blocksize,
+   block_start % PAGE_SIZE,
+   from % PAGE_SIZE);
+   }
continue;
}
}
-   if (PageUptodate(page)) {
+   if (uptodate) {
if (!buffer_uptodate(bh))
set_buffer_uptodate(bh);
continue;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 25/36] ext4: make ext4_writepage() work on huge pages

2016-11-29 Thread Kirill A. Shutemov

Change ext4_writepage() and underlying ext4_bio_write_page().

It basically removes assumption on page size, infer it from struct page
instead.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c   | 10 +-
 fs/ext4/page-io.c | 11 +--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ebccc535b15e..fa4467e4b129 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2037,10 +2037,10 @@ static int ext4_writepage(struct page *page,
 
trace_ext4_writepage(page);
size = i_size_read(inode);
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
 
page_bufs = page_buffers(page);
/*
@@ -2064,7 +2064,7 @@ static int ext4_writepage(struct page *page,
   ext4_bh_delay_or_unwritten)) {
redirty_page_for_writepage(wbc, page);
if ((current->flags & PF_MEMALLOC) ||
-   (inode->i_sb->s_blocksize == PAGE_SIZE)) {
+   (inode->i_sb->s_blocksize == hpage_size(page))) {
/*
 * For memory cleaning there's no point in writing only
 * some buffers. So just bail out. Warn if we came here
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index d83b0f3c5fe9..360c74daec5c 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -413,6 +413,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));
+   BUG_ON(PageTail(page));
 
if (keep_towrite)
set_page_writeback_keepwrite(page);
@@ -429,8 +430,14 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 * the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   if (len < PAGE_SIZE)
-   zero_user_segment(page, len, PAGE_SIZE);
+   if (len < hpage_size(page)) {
+   page += len / PAGE_SIZE;
+   if (len % PAGE_SIZE)
+   zero_user_segment(page, len % PAGE_SIZE, PAGE_SIZE);
+   while (page + 1 == compound_head(page))
+   clear_highpage(++page);
+   page = compound_head(page);
+   }
/*
 * In the first loop we prepare and mark buffers to submit. We have to
 * mark all buffers in the page before submitting so that
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 08/36] filemap: handle huge pages in do_generic_file_read()

2016-11-29 Thread Kirill A. Shutemov

Most of work happans on head page. Only when we need to do copy data to
userspace we find relevant subpage.

We are still limited by PAGE_SIZE per iteration. Lifting this limitation
would require some more work.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 74341f8b831e..6a2f9ea521fb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1749,6 +1749,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
if (unlikely(page == NULL))
goto no_cached_page;
}
+   page = compound_head(page);
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
@@ -1830,7 +1831,8 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
 * now we can copy it to user space...
 */
 
-   ret = copy_page_to_iter(page, offset, nr, iter);
+   ret = copy_page_to_iter(page + index - page->index, offset,
+   nr, iter);
offset += ret;
index += offset >> PAGE_SHIFT;
offset &= ~PAGE_MASK;
@@ -2248,6 +2250,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 * because there really aren't any performance issues here
 * and we need to check for errors.
 */
+   page = compound_head(page);
ClearPageError(page);
error = mapping->a_ops->readpage(file, page);
if (!error) {
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 05/36] thp: try to free page's buffers before attempt split

2016-11-29 Thread Kirill A. Shutemov

We want page to be isolated from the rest of the system before spliting
it. We rely on page count to be 2 for file pages to make sure nobody
uses the page: one pin to caller, one to radix-tree.

Filesystems with backing storage can have page count increased if it has
buffers.

Let's try to free them, before attempt split. And remove one guarding
VM_BUG_ON_PAGE().

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/buffer_head.h |  1 +
 mm/huge_memory.c| 19 ++-
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d67ab83823ad..fd4134ce9c54 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -400,6 +400,7 @@ extern int __set_page_dirty_buffers(struct page *page);
 #else /* CONFIG_BLOCK */
 
 static inline void buffer_init(void) {}
+static inline int page_has_buffers(struct page *page) { return 0; }
 static inline int try_to_free_buffers(struct page *page) { return 1; }
 static inline int inode_has_buffers(struct inode *inode) { return 0; }
 static inline void invalidate_inode_buffers(struct inode *inode) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 91dbab9644be..a15d566b14f6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2111,7 +2112,6 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 
VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
-   VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
VM_BUG_ON_PAGE(!PageCompound(page), page);
 
if (PageAnon(head)) {
@@ -2140,6 +2140,23 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
goto out;
}
 
+   /* Try to free buffers before attempt split */
+   if (!PageSwapBacked(head) && PagePrivate(page)) {
+   /*
+* We cannot trigger writeback from here due possible
+* recursion if triggered from vmscan, only wait.
+*
+* Caller can trigger writeback it on its own, if safe.
+*/
+   wait_on_page_writeback(head);
+
+   if (page_has_buffers(head) && !try_to_release_page(head,
+   GFP_KERNEL)) {
+   ret = -EBUSY;
+   goto out;
+   }
+   }
+
/* Addidional pin from radix tree */
extra_pins = 1;
anon_vma = NULL;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 00/36] ext4: support of huge pages

2016-11-29 Thread Kirill A. Shutemov

Here's respin of my huge ext4 patchset on top of Matthew's patchset with
few changes and fixes (see below).

Please review and consider applying.

I don't see any xfstests regressions with huge pages enabled. Patch with
new configurations for xfstests-bld is below.

The basics are the same as with tmpfs[1] which is in Linus' tree now and
ext4 built on top of it. The main difference is that we need to handle
read out from and write-back to backing storage.

As with other THPs, the implementation is build around compound pages:
a naturally aligned collection of pages that memory management subsystem
[in most cases] treat as a single entity:

  - head page (the first subpage) on LRU represents whole huge page;
  - head page's flags represent state of whole huge page (with few
exceptions);
  - mm can't migrate subpages of the compound page individually;

For THP, we use PMD-sized huge pages.

Head page links buffer heads for whole huge page. Dirty/writeback/etc.
tracking happens on per-hugepage level as all subpages share the same page
flags.

lock_page() on any subpage would lock whole hugepage for the same reason.

On radix-tree, a huge page represented as a multi-order entry of the same
order (HPAGE_PMD_ORDER). This allows us to track dirty/writeback on
radix-tree tags with the same granularity as on struct page.

On IO via syscalls, we are still limited by copying upto PAGE_SIZE per
iteration. The limitation here comes from how copy_page_to_iter() and
copy_page_from_iter() work wrt. highmem: it can only handle one small
page a time.

On write side, we also have problem with assuming small pages: write
length and offset within page calculated before we know if small or huge
page is allocated. It's not easy to fix. Looks like it would require
change in ->write_begin() interface to accept len > PAGE_SIZE.

On split_huge_page() we need to free buffers before splitting the page.
Page buffers takes additional pin on the page and can be a vector to mess
with the page during split. We want to avoid this.
If try_to_free_buffers() fails, split_huge_page() would return -EBUSY.

Readahead doesn't play with huge pages well: 128k max readahead window,
assumption on page size, PageReadahead() to track hit/miss.  I've got it
to allocate huge pages, but it doesn't provide any readahead as such.
I don't know how to do this right. It's not clear at this point if we
really need readahead with huge pages. I guess it's good enough for now.

Shadow entries ignored on allocation -- recently evicted page is not
promoted to active list. Not sure if current workingset logic is adequate
for huge pages. On eviction, we split the huge page and setup 4k shadow
entries as usual.

Unlike tmpfs, ext4 makes use of tags in radix-tree. The approach I used
for tmpfs -- 512 entries in radix-tree per-hugepages -- doesn't work well
if we want to have coherent view on tags. So the first patch converts
tmpfs to use multi-order entries in radix-tree. The same infrastructure
used for ext4.

Encryption doesn't handle huge pages yet. To avoid regressions we just
disable huge pages for the inode if it has EXT4_INODE_ENCRYPT.

Tested with 4k, 1k, encryption and bigalloc. All with and without
huge=always. I think it's reasonable coverage.

The patchset is also in git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugeext4/v5

[1] 
http://lkml.kernel.org/r/1465222029-45942-1-git-send-email-kirill.shute...@linux.intel.com

Changes since v4:
  - Rebase onto updated radix-tree interface;
  - Change interface to page cache lookups wrt. multi-order entries;
  - Do not mess with BIO_MAX_PAGES: ext4_mpage_readpages() now uses
block_read_full_page() for THP read out;
  - Fix work with memcg enabled;
  - Drop bogus VM_BUG_ON() from wp_huge_pmd();

Changes since v3:
  - account huge page to dirty/writeback/reclaimable/etc. according to its
size. It fixes background writback.
  - move code that adds huge page to radix-tree to
page_cache_tree_insert() (Jan);
  - make ramdisk work with huge pages;
  - fix unaccont of shadow entries (Jan);
  - use try_to_release_page() instead of try_to_free_buffers() in
split_huge_page() (Jan);
  -  make thp_get_unmapped_area() respect S_HUGE_MODE;
  - use huge-page aligned address to zap page range in wp_huge_pmd();
  - use ext4_kvmalloc in ext4_mpage_readpages() instead of
kmalloc() (Andreas);

Changes since v2:
  - fix intermittent crash in generic/299;
  - typo (condition inversion) in do_generic_file_read(),
reported by Jitendra;

TODO:
  - on IO via syscalls, copy more than PAGE_SIZE per iteration to/from
userspace;
  - readahead ?;
  - wire up madvise()/fadvise();
  - encryption with huge pages;
  - reclaim of file huge pages can be optimized -- split_huge_page() is not
required for pages with backing storage;

>From f523dd3aad026f5a3f8cbabc0ec69958a0618f6b Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date:

[PATCHv5 10/36] filemap: handle huge pages in filemap_fdatawait_range()

2016-11-29 Thread Kirill A. Shutemov

We writeback whole huge page a time.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ec976ddcb88a..52be2b457208 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -405,9 +405,14 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
if (page->index > end)
continue;
 
+   page = compound_head(page);
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
+   if (PageTransHuge(page)) {
+   index = page->index + HPAGE_PMD_NR;
+   i += index - pvec.pages[i]->index - 1;
+   }
}
pagevec_release();
cond_resched();
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 23/36] mm: account huge pages to dirty, writaback, reclaimable, etc.

2016-11-29 Thread Kirill A. Shutemov

We need to account huge pages according to its size to get background
writaback work properly.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/fs-writeback.c   | 10 +++---
 include/linux/backing-dev.h | 10 ++
 include/linux/memcontrol.h  | 22 ++---
 mm/migrate.c|  1 +
 mm/page-writeback.c | 80 +
 mm/rmap.c   |  4 +--
 6 files changed, 74 insertions(+), 53 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef600591d96f..e1c9faddc9e1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -366,8 +366,9 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page) && PageDirty(page)) {
-   __dec_wb_stat(old_wb, WB_RECLAIMABLE);
-   __inc_wb_stat(new_wb, WB_RECLAIMABLE);
+   int nr = hpage_nr_pages(page);
+   __add_wb_stat(old_wb, WB_RECLAIMABLE, -nr);
+   __add_wb_stat(new_wb, WB_RECLAIMABLE, nr);
}
}
 
@@ -376,9 +377,10 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page)) {
+   int nr = hpage_nr_pages(page);
WARN_ON_ONCE(!PageWriteback(page));
-   __dec_wb_stat(old_wb, WB_WRITEBACK);
-   __inc_wb_stat(new_wb, WB_WRITEBACK);
+   __add_wb_stat(old_wb, WB_WRITEBACK, -nr);
+   __add_wb_stat(new_wb, WB_WRITEBACK, nr);
}
}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 43b93a947e61..e63487f78824 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -61,6 +61,16 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
__percpu_counter_add(>stat[item], amount, WB_STAT_BATCH);
 }
 
+static inline void add_wb_stat(struct bdi_writeback *wb,
+enum wb_stat_item item, s64 amount)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __add_wb_stat(wb, item, amount);
+   local_irq_restore(flags);
+}
+
 static inline void __inc_wb_stat(struct bdi_writeback *wb,
 enum wb_stat_item item)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c17f3b7..df014eff82da 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mem_cgroup;
 struct page;
@@ -503,18 +504,6 @@ static inline void mem_cgroup_update_page_stat(struct page 
*page,
this_cpu_add(page->mem_cgroup->stat->count[idx], val);
 }
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
-{
-   mem_cgroup_update_page_stat(page, idx, 1);
-}
-
-static inline void mem_cgroup_dec_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
-{
-   mem_cgroup_update_page_stat(page, idx, -1);
-}
-
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
gfp_t gfp_mask,
unsigned long *total_scanned);
@@ -719,13 +708,8 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
return false;
 }
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
-{
-}
-
-static inline void mem_cgroup_dec_page_stat(struct page *page,
-   enum mem_cgroup_stat_index idx)
+static inline void mem_cgroup_update_page_stat(struct page *page,
+enum mem_cgroup_stat_index idx, int val)
 {
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 0ed24b1fa77b..c274f9d8ac2b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -505,6 +505,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 * are mapped to swap space.
 */
if (newzone != oldzone) {
+   BUG_ON(PageTransHuge(page));
__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
if (PageSwapBacked(page) && !PageSwapCache(page)) {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 47d5b12c460e..d7b905d66add 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2430,19 +24

[PATCHv5 13/36] mm: make write_cache_pages() work on huge pages

2016-11-29 Thread Kirill A. Shutemov

We writeback whole huge page a time. Let's adjust iteration this way.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/mm.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/page-writeback.c | 17 -
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4424784ac374..582844ca0b23 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1045,6 +1045,7 @@ extern pgoff_t __page_file_index(struct page *page);
  */
 static inline pgoff_t page_index(struct page *page)
 {
+   page = compound_head(page);
if (unlikely(PageSwapCache(page)))
return __page_file_index(page);
return page->index;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e530e7b3b6b2..faa3fa173939 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -546,6 +546,7 @@ static inline void wait_on_page_locked(struct page *page)
  */
 static inline void wait_on_page_writeback(struct page *page)
 {
+   page = compound_head(page);
if (PageWriteback(page))
wait_on_page_bit(page, PG_writeback);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 290e8b7d3181..47d5b12c460e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2209,7 +2209,7 @@ int write_cache_pages(struct address_space *mapping,
 * mapping. However, page->index will not change
 * because we have a reference on the page.
 */
-   if (page->index > end) {
+   if (page_to_pgoff(page) > end) {
/*
 * can't be range_cyclic (1st pass) because
 * end == -1 in that case.
@@ -2218,7 +2218,12 @@ int write_cache_pages(struct address_space *mapping,
break;
}
 
-   done_index = page->index;
+   done_index = page_to_pgoff(page);
+   if (PageTransCompound(page)) {
+   index = round_up(index + 1, HPAGE_PMD_NR);
+   i += HPAGE_PMD_NR -
+   done_index % HPAGE_PMD_NR - 1;
+   }
 
lock_page(page);
 
@@ -2230,7 +2235,7 @@ int write_cache_pages(struct address_space *mapping,
 * even if there is now a new, dirty page at the same
 * pagecache address.
 */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
 continue_unlock:
unlock_page(page);
continue;
@@ -2268,7 +2273,8 @@ int write_cache_pages(struct address_space *mapping,
 * not be suitable for data integrity
 * writeout).
 */
-   done_index = page->index + 1;
+   done_index = compound_head(page)->index
+   + hpage_nr_pages(page);
done = 1;
break;
}
@@ -2280,7 +2286,8 @@ int write_cache_pages(struct address_space *mapping,
 * keep going until we have written all the pages
 * we tagged for writeback prior to entering this loop.
 */
-   if (--wbc->nr_to_write <= 0 &&
+   wbc->nr_to_write -= hpage_nr_pages(page);
+   if (wbc->nr_to_write <= 0 &&
wbc->sync_mode == WB_SYNC_NONE) {
done = 1;
break;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-07 Thread Kirill A. Shutemov

On Mon, Nov 07, 2016 at 07:01:03AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 07, 2016 at 02:13:05PM +0300, Kirill A. Shutemov wrote:
> > It looks like a huge limitation to me.
> 
> The DAX PMD fault code can live just fine with it.

There's no way out for DAX as we map backing storage directly into
userspace. There's no such limitation for page-cache. And I don't see a
point to introduce such limitation artificially.

Backing storage fragmentation can be a weight on decision whether we want
to allocate huge page, but it shouldn't be show-stopper.

> And without it performance would suck anyway.

It depends on workload, obviously.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-07 Thread Kirill A. Shutemov

On Wed, Nov 02, 2016 at 07:36:12AM -0700, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> > 
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck. 
> 
> buffer_head handling always sucks.  For the iomap based bufferd write
> path I plan to support a buffer_head-less mode for the block size ==
> PAGE_SIZE case in 4.11 latest, but if I get enough other things of my
> plate in time even for 4.10.  I think that's the right way to go for
> THP, especially if we require the fs to allocate the whole huge page
> as a single extent, similar to the DAX PMD mapping case.
> 
> > 2) PMD-sized pages result in increased space & memory usage.
> 
> How so?
> 
> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> As said above I think we should only use huge page mappings if there is
> a single underlying extent, same as in DAX to keep the complexity down.

It looks like a huge limitation to me.

> > 4) As you have noted some places like write_begin() still depend on 4k
> > pages which creates a strange mix of places that use subpages and that use
> > head pages.
> 
> Just use the iomap bufferd I/O code and all these issues will go away.

Not really.

I'm looking onto iomap_write_actor(): we still calculate 'offset' and
'bytes' based on PAGE_SIZE before we even get the page.
This way we limit outself to PAGE_SIZE per-iteration.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-11-02 Thread Kirill A. Shutemov

On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > If I understand the motivation right, it is mostly about being able to 
> > > mmap
> > > PMD-sized chunks to userspace. So my naive idea would be that we could 
> > > just
> > > implement it by allocating PMD sized chunks of pages when adding pages to
> > > page cache, we don't even have to read them all unless we come from PMD
> > > fault path.
> > 
> > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > per-hugepage, one common list of buffer heads...
> > 
> > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > it otherwise doesn't make sense) and handling it differently for file-THP
> > is nightmare from maintenance POV.
> 
> But the complexity of two different page sizes for page cache and *each*
> filesystem that wants to support it does not make the maintenance easy
> either.

I think with time we can make small pages just a subcase of huge pages.
And some generalization can be made once more than one filesystem with
backing storage will adopt huge pages.

> So I'm not convinced that using the same rules for anon-THP and
> file-THP is a clear win.

We already have file-THP with the same rules: tmpfs. Backing storage is
what changes the picture.

> And if we have these two options neither of which has negligible
> maintenance cost, I'd also like to see more justification for why it is
> a good idea to have file-THP for normal filesystems. Do you have any
> performance numbers that show it is a win under some realistic workload?

See below. As usual with huge pages, they make sense when you plenty of
memory.

> I'd also note that having PMD-sized pages has some obvious disadvantages as
> well:
>
> 1) I'm not sure buffer head handling code will quite scale to 512 or even
> 2048 buffer_heads on a linked list referenced from a page. It may work but
> I suspect the performance will suck.

Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
why syscall-based IO sucks. We spend a lot of time looking for desired
block.

We need to switch to some other data structure for storing buffer_heads.
Is there a reason why we have list there in first place?
Why not just array?

I will look into it, but this sounds like a separate infrastructure change
project.

> 2) PMD-sized pages result in increased space & memory usage.

Space? Do you mean disk space? Not really: we still don't write beyond
i_size or into holes.

Behaviour wrt to holes may change with mmap()-IO as we have less
granularity, but the same can be seen just between different
architectures: 4k vs. 64k base page size.

> 3) In ext4 we have to estimate how much metadata we may need to modify when
> allocating blocks underlying a page in the worst case (you don't seem to
> update this estimate in your patch set). With 2048 blocks underlying a page,
> each possibly in a different block group, it is a lot of metadata forcing
> us to reserve a large transaction (not sure if you'll be able to even
> reserve such large transaction with the default journal size), which again
> makes things slower.

I didn't saw this on profiles. And xfstests looks fine. I probably need to
run them with 1k blocks once again.

> 4) As you have noted some places like write_begin() still depend on 4k
> pages which creates a strange mix of places that use subpages and that use
> head pages.

Yes, this need to be addressed to restore syscall-IO performance and take
advantage of huge pages.

But again, it's an infrastructure change that would likely affect
interface between VFS and filesystems. It deserves a separate patchset.

> All this would be a non-issue (well, except 2 I guess) if we just didn't
> expose filesystems to the fact that something like file-THP exists.

The numbers below generated with fio. The working set is relatively small,
so it fits into page cache and writing set doesn't hit dirty_ratio.

I think the mmap performance should be enough to justify initial inclusion
of an experimental feature: it useful for workloads that targets mmap()-IO.
It will take time to get feature mature anyway.

Configuration:
 - 2x E5-2697v2, 64G RAM;
 - INTEL SSDSC2CW24;
 - IO request size is 4k;
 - 8 processes, 512MB data set each;

Workload
 read/write baselinestddev  huge=always stddev  change

sync-read
 read 21439.00  348.1420297.33  259.62   -5.33%
sync-write
 write 6833.20  147.08 3630.13   52.86  -46.88%
sync-readwrite
 read  4377.17   17.53 2366.33   19.52  -45.94%
 write 4378.50   17.83

Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

2016-10-31 Thread Kirill A. Shutemov

[ My mail system got broken and original reply didn't get to through. Resent. ]

On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> > Most of work happans on head page. Only when we need to do copy data to
> > userspace we find relevant subpage.
> > 
> > We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> > would require some more work.
>
> Hum, I'm kind of lost.

The limitation here comes from how copy_page_to_iter() and
copy_page_from_iter() work wrt. highmem: it can only handle one small
page a time.

On write side, we also have problem with assuming small page: write length
and offset within page calculated before we know if small or huge page is
allocated. It's not easy to fix. Looks like it would require change in
->write_begin() interface to accept len > PAGE_SIZE.

> Can you point me to some design document / email that would explain some
> high level ideas how are huge pages in page cache supposed to work?

I'll elaborate more in cover letter to next revision.

> When are we supposed to operate on the head page and when on subpage?

It's case-by-case. See above explanation why we're limited to PAGE_SIZE
here.

> What is protected by the page lock of the head page?

Whole huge page. As with anon pages.

> Do page locks of subpages play any role?

lock_page() on any subpage would lock whole huge page.

> If understand right, e.g.  pagecache_get_page() will return subpages but
> is it generally safe to operate on subpages individually or do we have
> to be aware that they are part of a huge page?

I tried to make it as transparent as possible: page flag operations will
be redirected to head page, if necessary. Things like page_mapping() and
page_to_pgoff() know about huge pages.

Direct access to struct page fields must be avoided for tail pages as most
of them doesn't have meaning you would expect for small pages.

> If I understand the motivation right, it is mostly about being able to mmap
> PMD-sized chunks to userspace. So my naive idea would be that we could just
> implement it by allocating PMD sized chunks of pages when adding pages to
> page cache, we don't even have to read them all unless we come from PMD
> fault path.

Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
per-hugepage, one common list of buffer heads...

PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
it otherwise doesn't make sense) and handling it differently for file-THP
is nightmare from maintenance POV.

> Reclaim may need to be aware not to split pages unnecessarily
> but that's about it. So I'd like to understand what's wrong with this
> naive idea and why do filesystems need to be aware that someone wants to
> map in PMD sized chunks...

In addition to flags, THP uses some space in struct page of tail pages to
encode additional information. See compound_{mapcount,head,dtor,order},
page_deferred_list().

--
 Kirill A. Shutemov

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages

2016-10-25 Thread Kirill A. Shutemov

On Wed, Oct 12, 2016 at 08:43:20AM +0200, Jan Kara wrote:
> On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote:
> > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > > > invalidate_inode_page() has expectation about page_count() of the page
> > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > > > dropped. That condition almost never met for THPs -- tail pages are
> > > > pinned to the pagevec.
> > > > 
> > > > Let's drop them, before calling invalidate_inode_page().
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > > > ---
> > > >  mm/truncate.c | 11 +++
> > > >  1 file changed, 11 insertions(+)
> > > > 
> > > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > > index a01cce450a26..ce904e4b1708 100644
> > > > --- a/mm/truncate.c
> > > > +++ b/mm/truncate.c
> > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct 
> > > > address_space *mapping,
> > > > /* 'end' is in the middle of THP */
> > > > if (index ==  round_down(end, 
> > > > HPAGE_PMD_NR))
> > > > continue;
> > > > +   /*
> > > > +* invalidate_inode_page() expects
> > > > +* page_count(page) == 2 to drop page 
> > > > from page
> > > > +* cache -- drop tail pages references.
> > > > +*/
> > > > +   get_page(page);
> > > > +   pagevec_release();
> > > 
> > > I'm not quite sure why this is needed. When you have multiorder entry in
> > > the radix tree for your huge page, then you should not get more entries in
> > > the pagevec for your huge page. What do I miss?
> > 
> > For compatibility reason find_get_entries() (which is called by
> > pagevec_lookup_entries()) collects all subpages of huge page in the range
> > (head/tails). See patch [07/41]
> > 
> > So huge page, which is fully in the range it will be pinned up to
> > PAGEVEC_SIZE times.
> 
> Yeah, I see. But then won't it be cleaner to provide iteration method that
> would add to pagevec each radix tree entry (regardless of its order) only
> once and then use it in places where we care? Instead of strange dances
> like you do here?

Maybe. It would require doubling number of find_get_* helpers or
additional flag in each. We have too many already.

And multi-order entries interface for radix-tree has not yet settled in.
I would rather defer such rework until it will be shaped fully.

Let's come back to this later.

> Ultimately we could convert all the places to use these new iteration
> methods but I don't see that as immediately necessary and maybe there are
> places where getting all the subpages in the pagevec actually makes life
> simpler for us (please point me if you know about such place).

I did the way I did to now evaluate each use of find_get_*() one-by-one.
I guessed most of the callers of find_get_page() would be confused by
getting head page instead relevant subpage. Maybe I was wrong and it was
easier to make caller work with that. I don't know...

> On a somewhat unrelated note: I've noticed that you don't invalidate
> a huge page when only part of it should be invalidated. That actually
> breaks some assumptions filesystems make. In particular direct IO code
> assumes that if you do
> 
>   filemap_write_and_wait_range(inode, start, end);
>   invalidate_inode_pages2_range(inode, start, end);
> 
> all the page cache covering start-end *will* be invalidated. Your skipping
> of partial pages breaks this assumption and thus can bring consistency
> issues (e.g. write done using direct IO won't be seen by following buffered
> read).

Acctually, invalidate_inode_pages2_range does invalidate whole page if
part of it is in the range. I've catched this problem during testing.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range()

2016-10-25 Thread Kirill A. Shutemov

On Thu, Oct 13, 2016 at 03:18:02PM +0200, Jan Kara wrote:
> On Thu 13-10-16 15:08:44, Kirill A. Shutemov wrote:
> > On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> > > On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > > > We writeback whole huge page a time.
> > > 
> > > This is one of the things I don't understand. Firstly I didn't see where
> > > changes of writeback like this would happen (maybe they come later).
> > > Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
> > > pages. Is this because radix-tree multiorder entry tracks dirtiness for us
> > > at that granularity?
> > 
> > We track dirty/writeback on per-compound pages: meaning we have one
> > dirty/writeback flag for whole compound page, not on every individual
> > 4k subpage. The same story for radix-tree tags.
> > 
> > > BTW, can you also explain why do we need multiorder entries? What do
> > > they solve for us?
> > 
> > It helps us having coherent view on tags in radix-tree: no matter which
> > index we refer from the range huge page covers we will get the same
> > answer on which tags set.
> 
> OK, understand that. But why do we need a coherent view? For which purposes
> exactly do we care that it is not just a bunch of 4k pages that happen to
> be physically contiguous and thus can be mapped in one PMD?

My understanding is that things like PageDirty() should be handled on the
same granularity as PAGECACHE_TAG_DIRTY, otherwise things can go horribly
wrong...

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4 18/43] block: define BIO_MAX_PAGES to HPAGE_PMD_NR if huge page cache enabled

2016-10-25 Thread Kirill A. Shutemov

On Tue, Oct 25, 2016 at 12:21:22AM -0700, Christoph Hellwig wrote:
> On Tue, Oct 25, 2016 at 03:13:17AM +0300, Kirill A. Shutemov wrote:
> > We are going to do IO a huge page a time. So we need BIO_MAX_PAGES to be
> > at least HPAGE_PMD_NR. For x86-64, it's 512 pages.
> 
> NAK.  The maximum bio size should not depend on an obscure vm config,
> please send a standalone patch increasing the size to the block list,
> with a much long explanation.  Also you can't simply increase the size
> of the largers pool, we'll probably need more pools instead, or maybe
> even implement a similar chaining scheme as we do for struct
> scatterlist.

The size of required pool depends on architecture: different architectures
has different (huge page size)/(base page size).

Would it be okay if I add one more pool with size equal to HPAGE_PMD_NR,
if it's bigger than than BIO_MAX_PAGES and huge pages are enabled?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 22/43] thp: do not threat slab pages as huge in hpage_{nr_pages,size,mask}

2016-10-24 Thread Kirill A. Shutemov

Slab pages can be compound, but we shouldn't threat them as THP for
pupose of hpage_* helpers, otherwise it would lead to confusing results.

For instance, ext4 uses slab pages for journal pages and we shouldn't
confuse them with THPs. The easiest way is to exclude them in hpage_*
helpers.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/huge_mm.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 42934769f256..1300f8bb7523 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -137,21 +137,21 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 }
 static inline int hpage_nr_pages(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_NR;
return 1;
 }
 
 static inline int hpage_size(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_SIZE;
return PAGE_SIZE;
 }
 
 static inline unsigned long hpage_mask(struct page *page)
 {
-   if (unlikely(PageTransHuge(page)))
+   if (unlikely(!PageSlab(page) && PageTransHuge(page)))
return HPAGE_PMD_MASK;
return PAGE_MASK;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 18/43] block: define BIO_MAX_PAGES to HPAGE_PMD_NR if huge page cache enabled

2016-10-24 Thread Kirill A. Shutemov

We are going to do IO a huge page a time. So we need BIO_MAX_PAGES to be
at least HPAGE_PMD_NR. For x86-64, it's 512 pages.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 block/bio.c | 3 ++-
 include/linux/bio.h | 4 
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index db85c5753a76..a69062bda3e0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -44,7 +44,8 @@
  */
 #define BV(x) { .nr_vecs = x, .name = "biovec-"__stringify(x) }
 static struct biovec_slab bvec_slabs[BVEC_POOL_NR] __read_mostly = {
-   BV(1), BV(4), BV(16), BV(64), BV(128), BV(BIO_MAX_PAGES),
+   BV(1), BV(4), BV(16), BV(64), BV(128),
+   { .nr_vecs = BIO_MAX_PAGES, .name ="biovec-max_pages" },
 };
 #undef BV
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 97cb48f03dc7..19d0fae9cdd0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -38,7 +38,11 @@
 #define BIO_BUG_ON
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+#define BIO_MAX_PAGES  (HPAGE_PMD_NR > 256 ? HPAGE_PMD_NR : 256)
+#else
 #define BIO_MAX_PAGES  256
+#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 30/43] mm: account huge pages to dirty, writaback, reclaimable, etc.

2016-10-24 Thread Kirill A. Shutemov

We need to account huge pages according to its size to get background
writaback work properly.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/fs-writeback.c   | 10 ---
 include/linux/backing-dev.h | 10 +++
 include/linux/memcontrol.h  |  5 ++--
 mm/migrate.c|  1 +
 mm/page-writeback.c | 67 +
 5 files changed, 64 insertions(+), 29 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05713a5da083..2feb8677e69e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -366,8 +366,9 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page) && PageDirty(page)) {
-   __dec_wb_stat(old_wb, WB_RECLAIMABLE);
-   __inc_wb_stat(new_wb, WB_RECLAIMABLE);
+   int nr = hpage_nr_pages(page);
+   __add_wb_stat(old_wb, WB_RECLAIMABLE, -nr);
+   __add_wb_stat(new_wb, WB_RECLAIMABLE, nr);
}
}
 
@@ -376,9 +377,10 @@ static void inode_switch_wbs_work_fn(struct work_struct 
*work)
struct page *page = radix_tree_deref_slot_protected(slot,
>tree_lock);
if (likely(page)) {
+   int nr = hpage_nr_pages(page);
WARN_ON_ONCE(!PageWriteback(page));
-   __dec_wb_stat(old_wb, WB_WRITEBACK);
-   __inc_wb_stat(new_wb, WB_WRITEBACK);
+   __add_wb_stat(old_wb, WB_WRITEBACK, -nr);
+   __add_wb_stat(new_wb, WB_WRITEBACK, nr);
}
}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 43b93a947e61..e63487f78824 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -61,6 +61,16 @@ static inline void __add_wb_stat(struct bdi_writeback *wb,
__percpu_counter_add(>stat[item], amount, WB_STAT_BATCH);
 }
 
+static inline void add_wb_stat(struct bdi_writeback *wb,
+enum wb_stat_item item, s64 amount)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __add_wb_stat(wb, item, amount);
+   local_irq_restore(flags);
+}
+
 static inline void __inc_wb_stat(struct bdi_writeback *wb,
 enum wb_stat_item item)
 {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c17f3b7..d24092581442 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mem_cgroup;
 struct page;
@@ -506,13 +507,13 @@ static inline void mem_cgroup_update_page_stat(struct 
page *page,
 static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
 {
-   mem_cgroup_update_page_stat(page, idx, 1);
+   mem_cgroup_update_page_stat(page, idx, hpage_nr_pages(page));
 }
 
 static inline void mem_cgroup_dec_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
 {
-   mem_cgroup_update_page_stat(page, idx, -1);
+   mem_cgroup_update_page_stat(page, idx, -hpage_nr_pages(page));
 }
 
 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
diff --git a/mm/migrate.c b/mm/migrate.c
index 99250aee1ac1..bfc722959d3e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -505,6 +505,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 * are mapped to swap space.
 */
if (newzone != oldzone) {
+   BUG_ON(PageTransHuge(page));
__dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
__inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
if (PageSwapBacked(page) && !PageSwapCache(page)) {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c76fc90b7039..f903c09940c4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2421,19 +2421,22 @@ void account_page_dirtied(struct page *page, struct 
address_space *mapping)
 
if (mapping_cap_account_dirty(mapping)) {
struct bdi_writeback *wb;
+   struct zone *zone = page_zone(page);
+   pg_data_t *pgdat = page_pgdat(page);
+   int nr = hpage_nr_pages(page);
 
inode_attach_wb(inode, page);
wb = inode_to_wb(inode);
 
mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
-   __inc_node_page_state(page, NR_FILE_DIRTY);
-   __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-   __inc_node_page_state(page, NR_DIRTI

[PATCHv4 17/43] HACK: readahead: alloc huge pages, if allowed

2016-10-24 Thread Kirill A. Shutemov

Most page cache allocation happens via readahead (sync or async), so if
we want to have significant number of huge pages in page cache we need
to find a ways to allocate them from readahead.

Unfortunately, huge pages doesn't fit into current readahead design:
128 max readahead window, assumption on page size, PageReadahead() to
track hit/miss.

I haven't found a ways to get it right yet.

This patch just allocates huge page if allowed, but doesn't really
provide any readahead if huge page is allocated. We read out 2M a time
and I would expect spikes in latancy without readahead.

Therefore HACK.

Having that said, I don't think it should prevent huge page support to
be applied. Future will show if lacking readahead is a big deal with
huge pages in page cache.

Any suggestions are welcome.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/readahead.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index c8a955b1297e..f46a9080f6a9 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -174,6 +174,21 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
if (page_offset > end_index)
break;
 
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
+   (!page_idx || !(page_offset % HPAGE_PMD_NR)) &&
+   page_cache_allow_huge(mapping, page_offset)) {
+   page = __page_cache_alloc_order(gfp_mask | __GFP_COMP,
+   HPAGE_PMD_ORDER);
+   if (page) {
+   prep_transhuge_page(page);
+   page->index = round_down(page_offset,
+   HPAGE_PMD_NR);
+   list_add(>lru, _pool);
+   ret++;
+   goto start_io;
+   }
+   }
+
rcu_read_lock();
page = radix_tree_lookup(>page_tree, page_offset);
rcu_read_unlock();
@@ -189,7 +204,7 @@ int __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,
SetPageReadahead(page);
ret++;
}
-
+start_io:
/*
 * Now start the IO.  We ignore I/O errors - if the page is not
 * uptodate then the caller will launch readpage again, and
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 20/43] mm: make write_cache_pages() work on huge pages

2016-10-24 Thread Kirill A. Shutemov

We writeback whole huge page a time. Let's adjust iteration this way.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/mm.h  |  1 +
 include/linux/pagemap.h |  1 +
 mm/page-writeback.c | 17 -
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3a191853faaa..315df8051d06 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1056,6 +1056,7 @@ extern pgoff_t __page_file_index(struct page *page);
  */
 static inline pgoff_t page_index(struct page *page)
 {
+   page = compound_head(page);
if (unlikely(PageSwapCache(page)))
return __page_file_index(page);
return page->index;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 712343108d31..f9aa8bede15e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -528,6 +528,7 @@ static inline void wait_on_page_locked(struct page *page)
  */
 static inline void wait_on_page_writeback(struct page *page)
 {
+   page = compound_head(page);
if (PageWriteback(page))
wait_on_page_bit(page, PG_writeback);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 439cc63ad903..c76fc90b7039 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2200,7 +2200,7 @@ int write_cache_pages(struct address_space *mapping,
 * mapping. However, page->index will not change
 * because we have a reference on the page.
 */
-   if (page->index > end) {
+   if (page_to_pgoff(page) > end) {
/*
 * can't be range_cyclic (1st pass) because
 * end == -1 in that case.
@@ -2209,7 +2209,12 @@ int write_cache_pages(struct address_space *mapping,
break;
}
 
-   done_index = page->index;
+   done_index = page_to_pgoff(page);
+   if (PageTransCompound(page)) {
+   index = round_up(index + 1, HPAGE_PMD_NR);
+   i += HPAGE_PMD_NR -
+   done_index % HPAGE_PMD_NR - 1;
+   }
 
lock_page(page);
 
@@ -2221,7 +2226,7 @@ int write_cache_pages(struct address_space *mapping,
 * even if there is now a new, dirty page at the same
 * pagecache address.
 */
-   if (unlikely(page->mapping != mapping)) {
+   if (unlikely(page_mapping(page) != mapping)) {
 continue_unlock:
unlock_page(page);
continue;
@@ -2259,7 +2264,8 @@ int write_cache_pages(struct address_space *mapping,
 * not be suitable for data integrity
 * writeout).
 */
-   done_index = page->index + 1;
+   done_index = compound_head(page)->index
+   + hpage_nr_pages(page);
done = 1;
break;
}
@@ -2271,7 +2277,8 @@ int write_cache_pages(struct address_space *mapping,
 * keep going until we have written all the pages
 * we tagged for writeback prior to entering this loop.
 */
-   if (--wbc->nr_to_write <= 0 &&
+   wbc->nr_to_write -= hpage_nr_pages(page);
+   if (wbc->nr_to_write <= 0 &&
wbc->sync_mode == WB_SYNC_NONE) {
done = 1;
break;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 24/43] fs: make block_read_full_page() be able to read huge page

2016-10-24 Thread Kirill A. Shutemov

The approach is straight-forward: for compound pages we read out whole
huge page.

For huge page we cannot have array of buffer head pointers on stack --
it's 4096 pointers on x86-64 -- 'arr' is allocated with kmalloc() for
huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c | 22 +-
 include/linux/buffer_head.h |  9 +
 include/linux/page-flags.h  |  2 +-
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b205a629001d..35b76b1c0308 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -870,7 +870,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
 
 try_again:
head = NULL;
-   offset = PAGE_SIZE;
+   offset = hpage_size(page);
while ((offset -= size) >= 0) {
bh = alloc_buffer_head(GFP_NOFS);
if (!bh)
@@ -1465,7 +1465,7 @@ void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
bh->b_page = page;
-   BUG_ON(offset >= PAGE_SIZE);
+   BUG_ON(offset >= hpage_size(page));
if (PageHighMem(page))
/*
 * This catches illegal uses and preserves the offset:
@@ -2238,11 +2238,13 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
 {
struct inode *inode = page->mapping->host;
sector_t iblock, lblock;
-   struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+   struct buffer_head *arr_on_stack[MAX_BUF_PER_PAGE];
+   struct buffer_head *bh, *head, **arr = arr_on_stack;
unsigned int blocksize, bbits;
int nr, i;
int fully_mapped = 1;
 
+   VM_BUG_ON_PAGE(PageTail(page), page);
head = create_page_buffers(page, inode, 0);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
@@ -2253,6 +2255,11 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
nr = 0;
i = 0;
 
+   if (PageTransHuge(page)) {
+   arr = kmalloc(sizeof(struct buffer_head *) * HPAGE_PMD_NR *
+   MAX_BUF_PER_PAGE, GFP_NOFS);
+   }
+
do {
if (buffer_uptodate(bh))
continue;
@@ -2268,7 +2275,9 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
SetPageError(page);
}
if (!buffer_mapped(bh)) {
-   zero_user(page, i * blocksize, blocksize);
+   zero_user(page + (i * blocksize / PAGE_SIZE),
+   i * blocksize % PAGE_SIZE,
+   blocksize);
if (!err)
set_buffer_uptodate(bh);
continue;
@@ -2294,7 +2303,7 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
if (!PageError(page))
SetPageUptodate(page);
unlock_page(page);
-   return 0;
+   goto out;
}
 
/* Stage two: lock the buffers */
@@ -2316,6 +2325,9 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
else
submit_bh(REQ_OP_READ, 0, bh);
}
+out:
+   if (arr != arr_on_stack)
+   kfree(arr);
return 0;
 }
 EXPORT_SYMBOL(block_read_full_page);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 006a8a42acfb..194a85822d5f 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -131,13 +131,14 @@ BUFFER_FNS(Meta, meta)
 BUFFER_FNS(Prio, prio)
 BUFFER_FNS(Defer_Completion, defer_completion)
 
-#define bh_offset(bh)  ((unsigned long)(bh)->b_data & ~PAGE_MASK)
+#define bh_offset(bh)  ((unsigned long)(bh)->b_data & ~hpage_mask(bh->b_page))
 
 /* If we *know* page->private refers to buffer_heads */
-#define page_buffers(page) \
+#define page_buffers(__page)   \
({  \
-   BUG_ON(!PagePrivate(page)); \
-   ((struct buffer_head *)page_private(page)); \
+   struct page *p = compound_head(__page); \
+   BUG_ON(!PagePrivate(p));\
+   ((struct buffer_head *)page_private(p));\
})
 #define page_has_buffers(page) PagePrivate(page)
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a2bef9a41bcf..20b7684e9298 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -730,7 +730,7 @@ static inline void ClearPageSlabPfmemalloc(struct page 
*page)

[PATCHv4 32/43] ext4: make ext4_writepage() work on huge pages

2016-10-24 Thread Kirill A. Shutemov

Change ext4_writepage() and underlying ext4_bio_write_page().

It basically removes assumption on page size, infer it from struct page
instead.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c   | 10 +-
 fs/ext4/page-io.c | 11 +--
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9c064727ed62..c36296fbaa23 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2030,10 +2030,10 @@ static int ext4_writepage(struct page *page,
 
trace_ext4_writepage(page);
size = i_size_read(inode);
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
 
page_bufs = page_buffers(page);
/*
@@ -2057,7 +2057,7 @@ static int ext4_writepage(struct page *page,
   ext4_bh_delay_or_unwritten)) {
redirty_page_for_writepage(wbc, page);
if ((current->flags & PF_MEMALLOC) ||
-   (inode->i_sb->s_blocksize == PAGE_SIZE)) {
+   (inode->i_sb->s_blocksize == hpage_size(page))) {
/*
 * For memory cleaning there's no point in writing only
 * some buffers. So just bail out. Warn if we came here
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 0094923e5ebf..20c9635a782e 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -413,6 +413,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 
BUG_ON(!PageLocked(page));
BUG_ON(PageWriteback(page));
+   BUG_ON(PageTail(page));
 
if (keep_towrite)
set_page_writeback_keepwrite(page);
@@ -429,8 +430,14 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 * the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   if (len < PAGE_SIZE)
-   zero_user_segment(page, len, PAGE_SIZE);
+   if (len < hpage_size(page)) {
+   page += len / PAGE_SIZE;
+   if (len % PAGE_SIZE)
+   zero_user_segment(page, len % PAGE_SIZE, PAGE_SIZE);
+   while (page + 1 == compound_head(page))
+   clear_highpage(++page);
+   page = compound_head(page);
+   }
/*
 * In the first loop we prepare and mark buffers to submit. We have to
 * mark all buffers in the page before submitting so that
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 14/43] filemap: handle huge pages in do_generic_file_read()

2016-10-24 Thread Kirill A. Shutemov

Most of work happans on head page. Only when we need to do copy data to
userspace we find relevant subpage.

We are still limited by PAGE_SIZE per iteration. Lifting this limitation
would require some more work.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f8387488636f..ca4536f2035e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1906,6 +1906,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
if (unlikely(page == NULL))
goto no_cached_page;
}
+   page = compound_head(page);
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
@@ -1984,7 +1985,8 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
 * now we can copy it to user space...
 */
 
-   ret = copy_page_to_iter(page, offset, nr, iter);
+   ret = copy_page_to_iter(page + index - page->index, offset,
+   nr, iter);
offset += ret;
index += offset >> PAGE_SHIFT;
offset &= ~PAGE_MASK;
@@ -2402,6 +2404,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf)
 * because there really aren't any performance issues here
 * and we need to check for errors.
 */
+   page = compound_head(page);
ClearPageError(page);
error = mapping->a_ops->readpage(file, page);
if (!error) {
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 19/43] brd: make it handle huge pages

2016-10-24 Thread Kirill A. Shutemov

Do not assume length of bio segment is never larger than PAGE_SIZE.
With huge pages it's HPAGE_PMD_SIZE (2M on x86-64).

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 drivers/block/brd.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0c76d4016eeb..4214163350d2 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -202,12 +202,15 @@ static int copy_to_brd_setup(struct brd_device *brd, 
sector_t sector, size_t n)
size_t copy;
 
copy = min_t(size_t, n, PAGE_SIZE - offset);
+   n -= copy;
if (!brd_insert_page(brd, sector))
return -ENOSPC;
-   if (copy < n) {
+   while (n) {
sector += copy >> SECTOR_SHIFT;
if (!brd_insert_page(brd, sector))
return -ENOSPC;
+   copy = min_t(size_t, n, PAGE_SIZE);
+   n -= copy;
}
return 0;
 }
@@ -242,6 +245,7 @@ static void copy_to_brd(struct brd_device *brd, const void 
*src,
size_t copy;
 
copy = min_t(size_t, n, PAGE_SIZE - offset);
+   n -= copy;
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
 
@@ -249,10 +253,11 @@ static void copy_to_brd(struct brd_device *brd, const 
void *src,
memcpy(dst + offset, src, copy);
kunmap_atomic(dst);
 
-   if (copy < n) {
+   while (n) {
src += copy;
sector += copy >> SECTOR_SHIFT;
-   copy = n - copy;
+   copy = min_t(size_t, n, PAGE_SIZE);
+   n -= copy;
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
 
@@ -274,6 +279,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
size_t copy;
 
copy = min_t(size_t, n, PAGE_SIZE - offset);
+   n -= copy;
page = brd_lookup_page(brd, sector);
if (page) {
src = kmap_atomic(page);
@@ -282,10 +288,11 @@ static void copy_from_brd(void *dst, struct brd_device 
*brd,
} else
memset(dst, 0, copy);
 
-   if (copy < n) {
+   while (n) {
dst += copy;
sector += copy >> SECTOR_SHIFT;
-   copy = n - copy;
+   copy = min_t(size_t, n, PAGE_SIZE);
+   n -= copy;
page = brd_lookup_page(brd, sector);
if (page) {
src = kmap_atomic(page);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 25/43] fs: make block_write_{begin,end}() be able to handle huge pages

2016-10-24 Thread Kirill A. Shutemov

It's more or less straight-forward.

Most changes are around getting offset/len withing page right and zero
out desired part of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c | 70 +++--
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 35b76b1c0308..c078f5d74a2a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1859,6 +1859,7 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
 {
unsigned int block_start, block_end;
struct buffer_head *head, *bh;
+   bool uptodate = PageUptodate(page);
 
BUG_ON(!PageLocked(page));
if (!page_has_buffers(page))
@@ -1869,21 +1870,21 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
do {
block_end = block_start + bh->b_size;
 
-   if (buffer_new(bh)) {
-   if (block_end > from && block_start < to) {
-   if (!PageUptodate(page)) {
-   unsigned start, size;
+   if (buffer_new(bh) && block_end > from && block_start < to) {
+   if (!uptodate) {
+   unsigned start, size;
 
-   start = max(from, block_start);
-   size = min(to, block_end) - start;
+   start = max(from, block_start);
+   size = min(to, block_end) - start;
 
-   zero_user(page, start, size);
-   set_buffer_uptodate(bh);
-   }
-
-   clear_buffer_new(bh);
-   mark_buffer_dirty(bh);
+   zero_user(page + block_start / PAGE_SIZE,
+   start % PAGE_SIZE,
+   size);
+   set_buffer_uptodate(bh);
}
+
+   clear_buffer_new(bh);
+   mark_buffer_dirty(bh);
}
 
block_start = block_end;
@@ -1949,18 +1950,21 @@ iomap_to_bh(struct inode *inode, sector_t block, struct 
buffer_head *bh,
 int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block, struct iomap *iomap)
 {
-   unsigned from = pos & (PAGE_SIZE - 1);
-   unsigned to = from + len;
-   struct inode *inode = page->mapping->host;
+   unsigned from, to;
+   struct inode *inode = page_mapping(page)->host;
unsigned block_start, block_end;
sector_t block;
int err = 0;
unsigned blocksize, bbits;
struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
+   bool uptodate = PageUptodate(page);
 
+   page = compound_head(page);
+   from = pos & ~hpage_mask(page);
+   to = from + len;
BUG_ON(!PageLocked(page));
-   BUG_ON(from > PAGE_SIZE);
-   BUG_ON(to > PAGE_SIZE);
+   BUG_ON(from > hpage_size(page));
+   BUG_ON(to > hpage_size(page));
BUG_ON(from > to);
 
head = create_page_buffers(page, inode, 0);
@@ -1973,10 +1977,8 @@ int __block_write_begin_int(struct page *page, loff_t 
pos, unsigned len,
block++, block_start=block_end, bh = bh->b_this_page) {
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
-   if (PageUptodate(page)) {
-   if (!buffer_uptodate(bh))
-   set_buffer_uptodate(bh);
-   }
+   if (uptodate && !buffer_uptodate(bh))
+   set_buffer_uptodate(bh);
continue;
}
if (buffer_new(bh))
@@ -1994,23 +1996,28 @@ int __block_write_begin_int(struct page *page, loff_t 
pos, unsigned len,
if (buffer_new(bh)) {
unmap_underlying_metadata(bh->b_bdev,
bh->b_blocknr);
-   if (PageUptodate(page)) {
+   if (uptodate) {
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
continue;
}
-   if (block_end > to || block_start < from)
-   zero_user_segments(page,
-

[PATCHv4 31/43] ext4: make ext4_mpage_readpages() hugepage-aware

2016-10-24 Thread Kirill A. Shutemov

This patch modifies ext4_mpage_readpages() to deal with huge pages.

We read out 2M at once, so we have to alloc (HPAGE_PMD_NR *
blocks_per_page) sector_t for that. I'm not entirely happy with kmalloc
in this codepath, but don't see any other option.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/readpage.c | 39 +--
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index a81b829d56de..af8436a4702c 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -104,12 +104,12 @@ int ext4_mpage_readpages(struct address_space *mapping,
 
struct inode *inode = mapping->host;
const unsigned blkbits = inode->i_blkbits;
-   const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
const unsigned blocksize = 1 << blkbits;
sector_t block_in_file;
sector_t last_block;
sector_t last_block_in_file;
-   sector_t blocks[MAX_BUF_PER_PAGE];
+   sector_t blocks_on_stack[MAX_BUF_PER_PAGE];
+   sector_t *blocks = blocks_on_stack;
unsigned page_block;
struct block_device *bdev = inode->i_sb->s_bdev;
int length;
@@ -122,8 +122,9 @@ int ext4_mpage_readpages(struct address_space *mapping,
map.m_flags = 0;
 
for (; nr_pages; nr_pages--) {
-   int fully_mapped = 1;
-   unsigned first_hole = blocks_per_page;
+   int fully_mapped = 1, nr = nr_pages;
+   unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+   unsigned first_hole;
 
prefetchw(>flags);
if (pages) {
@@ -138,10 +139,32 @@ int ext4_mpage_readpages(struct address_space *mapping,
goto confused;
 
block_in_file = (sector_t)page->index << (PAGE_SHIFT - blkbits);
-   last_block = block_in_file + nr_pages * blocks_per_page;
+
+   if (PageTransHuge(page) &&
+   IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
+   BUILD_BUG_ON(BIO_MAX_PAGES < HPAGE_PMD_NR);
+   nr = HPAGE_PMD_NR * blocks_per_page;
+   /* XXX: need a better solution ? */
+   blocks = ext4_kvmalloc(sizeof(sector_t) * nr, GFP_NOFS);
+   if (!blocks) {
+   if (pages) {
+   delete_from_page_cache(page);
+   goto next_page;
+   }
+   return -ENOMEM;
+   }
+
+   blocks_per_page *= HPAGE_PMD_NR;
+   last_block = block_in_file + blocks_per_page;
+   } else {
+   blocks = blocks_on_stack;
+   last_block = block_in_file + nr * blocks_per_page;
+   }
+
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> 
blkbits;
if (last_block > last_block_in_file)
last_block = last_block_in_file;
+   first_hole = blocks_per_page;
page_block = 0;
 
/*
@@ -213,6 +236,8 @@ int ext4_mpage_readpages(struct address_space *mapping,
}
}
if (first_hole != blocks_per_page) {
+   if (PageTransHuge(page))
+   goto confused;
zero_user_segment(page, first_hole << blkbits,
  PAGE_SIZE);
if (first_hole == 0) {
@@ -248,7 +273,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
goto set_error_page;
}
bio = bio_alloc(GFP_KERNEL,
-   min_t(int, nr_pages, BIO_MAX_PAGES));
+   min_t(int, nr, BIO_MAX_PAGES));
if (!bio) {
if (ctx)
fscrypt_release_ctx(ctx);
@@ -289,5 +314,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
BUG_ON(pages && !list_empty(pages));
if (bio)
submit_bio(bio);
+   if (blocks != blocks_on_stack)
+   kfree(blocks);
return 0;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 33/43] ext4: handle huge pages in ext4_page_mkwrite()

2016-10-24 Thread Kirill A. Shutemov

Trivial: remove assumption on page size.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c36296fbaa23..5ceb72c7bac1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5649,7 +5649,7 @@ static int ext4_bh_unmapped(handle_t *handle, struct 
buffer_head *bh)
 
 int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-   struct page *page = vmf->page;
+   struct page *page = compound_head(vmf->page);
loff_t size;
unsigned long len;
int ret;
@@ -5685,10 +5685,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
goto out;
}
 
-   if (page->index == size >> PAGE_SHIFT)
-   len = size & ~PAGE_MASK;
-   else
-   len = PAGE_SIZE;
+   len = hpage_size(page);
+   if (page->index + hpage_nr_pages(page) - 1 == size >> PAGE_SHIFT)
+   len = size & ~hpage_mask(page);
+
/*
 * Return if we have all the buffers mapped. This avoids the need to do
 * journal_start/journal_stop which can block and take a long time
@@ -5719,7 +5719,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf)
ret = block_page_mkwrite(vma, vmf, get_block);
if (!ret && ext4_should_journal_data(inode)) {
if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
- PAGE_SIZE, NULL, do_journal_get_write_access)) {
+ hpage_size(page), NULL,
+ do_journal_get_write_access)) {
unlock_page(page);
ret = VM_FAULT_SIGBUS;
ext4_journal_stop(handle);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 16/43] filemap: handle huge pages in filemap_fdatawait_range()

2016-10-24 Thread Kirill A. Shutemov

We writeback whole huge page a time.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 954720092cf8..ecf5c2dba3fb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -509,9 +509,14 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
if (page->index > end)
continue;
 
+   page = compound_head(page);
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
+   if (PageTransHuge(page)) {
+   index = page->index + HPAGE_PMD_NR;
+   i += index - pvec.pages[i]->index - 1;
+   }
}
pagevec_release();
cond_resched();
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4 04/43] radix-tree: Add radix_tree_split

2016-10-24 Thread Kirill A. Shutemov

From: Matthew Wilcox <wi...@linux.intel.com>

This new function splits a larger multiorder entry into smaller entries
(potentially multi-order entries).  These entries are initialised to
RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't
confused.  The caller should then call radix_tree_for_each_slot() and
radix_tree_replace_slot() in order to turn these retry entries into the
intended new entries.  Tags are replicated from the original multiorder
entry into each new entry.

Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/radix-tree.h|   3 +
 lib/radix-tree.c  | 109 --
 tools/testing/radix-tree/multiorder.c |  26 
 3 files changed, 134 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 1efd81f21241..f5518f1fe3d7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -319,8 +319,11 @@ static inline void radix_tree_preload_end(void)
preempt_enable();
 }
 
+int radix_tree_split(struct radix_tree_root *, unsigned long index,
+   unsigned new_order);
 int radix_tree_join(struct radix_tree_root *, unsigned long index,
unsigned new_order, void *);
+
 /**
  * struct radix_tree_iter - radix tree iterator state
  *
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6a76252c93a6..c1b835c979ed 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -231,7 +231,10 @@ static void dump_node(struct radix_tree_node *node, 
unsigned long index)
void *entry = node->slots[i];
if (!entry)
continue;
-   if (is_sibling_entry(node, entry)) {
+   if (entry == RADIX_TREE_RETRY) {
+   pr_debug("radix retry offset %ld indices %ld-%ld\n",
+   i, first, last);
+   } else if (is_sibling_entry(node, entry)) {
pr_debug("radix sblng %p offset %ld val %p indices 
%ld-%ld\n",
entry, i,
*(void **)entry_to_node(entry),
@@ -641,7 +644,10 @@ static inline int insert_entries(struct radix_tree_node 
*node, void **slot,
unsigned i, n, tag, offset, tags = 0;
 
if (node) {
-   n = 1 << (order - node->shift);
+   if (order > node->shift)
+   n = 1 << (order - node->shift);
+   else
+   n = 1;
offset = get_slot_offset(node, slot);
} else {
n = 1;
@@ -680,7 +686,8 @@ static inline int insert_entries(struct radix_tree_node 
*node, void **slot,
tag_set(node, tag, offset);
}
if (radix_tree_is_internal_node(old) &&
-   !is_sibling_entry(node, old))
+   !is_sibling_entry(node, old) &&
+   (old != RADIX_TREE_RETRY))
radix_tree_free_nodes(old);
}
if (node)
@@ -843,6 +850,98 @@ int radix_tree_join(struct radix_tree_root *root, unsigned 
long index,
 
return error;
 }
+
+int radix_tree_split(struct radix_tree_root *root, unsigned long index,
+   unsigned order)
+{
+   struct radix_tree_node *parent, *node, *child;
+   void **slot;
+   unsigned int offset, end;
+   unsigned n, tag, tags = 0;
+
+   if (!__radix_tree_lookup(root, index, , ))
+   return -ENOENT;
+   if (!parent)
+   return -ENOENT;
+
+   offset = get_slot_offset(parent, slot);
+
+   for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+   if (tag_get(parent, tag, offset))
+   tags |= 1 << tag;
+
+   for (end = offset + 1; end < RADIX_TREE_MAP_SIZE; end++) {
+   if (!is_sibling_entry(parent, parent->slots[end]))
+   break;
+   for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+   if (tags & (1 << tag))
+   tag_set(parent, tag, end);
+   /* tags must be set before RETRY is set */
+   rcu_assign_pointer(parent->slots[end], RADIX_TREE_RETRY);
+   }
+
+   if (order == parent->shift)
+   return 0;
+   if (order > parent->shift) {
+   while (offset < end)
+   offset += insert_entries(parent, >slots[offset],
+   RADIX_TREE_RETRY, order, true);
+   return 0;
+   }
+
+   node = parent;
+
+   for (;;) {
+

[PATCHv4 36/43] ext4: handle huge pages in ext4_da_write_end()

2016-10-24 Thread Kirill A. Shutemov

Call ext4_da_should_update_i_disksize() for head page with offset
relative to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1eae6801846c..59cd2b113eb2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3015,7 +3015,6 @@ static int ext4_da_write_end(struct file *file,
int ret = 0, ret2;
handle_t *handle = ext4_journal_current_handle();
loff_t new_i_size;
-   unsigned long start, end;
int write_mode = (int)(unsigned long)fsdata;
 
if (write_mode == FALL_BACK_TO_NONDELALLOC)
@@ -3023,8 +3022,6 @@ static int ext4_da_write_end(struct file *file,
  len, copied, page, fsdata);
 
trace_ext4_da_write_end(inode, pos, len, copied);
-   start = pos & (PAGE_SIZE - 1);
-   end = start + copied - 1;
 
/*
 * generic_write_end() will run mark_inode_dirty() if i_size
@@ -3033,8 +3030,10 @@ static int ext4_da_write_end(struct file *file,
 */
new_i_size = pos + copied;
if (copied && new_i_size > EXT4_I(inode)->i_disksize) {
+   struct page *head = compound_head(page);
+   unsigned long end = (pos & ~hpage_mask(head)) + copied - 1;
if (ext4_has_inline_data(inode) ||
-   ext4_da_should_update_i_disksize(page, end)) {
+   ext4_da_should_update_i_disksize(head, end)) {
ext4_update_i_disksize(inode, new_i_size);
/* We need to mark inode dirty even if
 * new_i_size is less that inode->i_size
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 13/41] truncate: make sure invalidate_mapping_pages() can discard huge pages

2016-10-11 Thread Kirill A. Shutemov

On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote:
> > invalidate_inode_page() has expectation about page_count() of the page
> > -- if it's not 2 (one to caller, one to radix-tree), it will not be
> > dropped. That condition almost never met for THPs -- tail pages are
> > pinned to the pagevec.
> > 
> > Let's drop them, before calling invalidate_inode_page().
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > ---
> >  mm/truncate.c | 11 +++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index a01cce450a26..ce904e4b1708 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct 
> > address_space *mapping,
> > /* 'end' is in the middle of THP */
> > if (index ==  round_down(end, HPAGE_PMD_NR))
> > continue;
> > +   /*
> > +* invalidate_inode_page() expects
> > +* page_count(page) == 2 to drop page from page
> > +* cache -- drop tail pages references.
> > +*/
> > +   get_page(page);
> > +   pagevec_release();
> 
> I'm not quite sure why this is needed. When you have multiorder entry in
> the radix tree for your huge page, then you should not get more entries in
> the pagevec for your huge page. What do I miss?

For compatibility reason find_get_entries() (which is called by
pagevec_lookup_entries()) collects all subpages of huge page in the range
(head/tails). See patch [07/41]

So huge page, which is fully in the range it will be pinned up to
PAGEVEC_SIZE times.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 14/41] filemap: allocate huge page in page_cache_read(), if allowed

2016-10-11 Thread Kirill A. Shutemov

On Tue, Oct 11, 2016 at 06:15:45PM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:56, Kirill A. Shutemov wrote:
> > This patch adds basic functionality to put huge page into page cache.
> > 
> > At the moment we only put huge pages into radix-tree if the range covered
> > by the huge page is empty.
> > 
> > We ignore shadow entires for now, just remove them from the tree before
> > inserting huge page.
> > 
> > Later we can add logic to accumulate information from shadow entires to
> > return to caller (average eviction time?).
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > ---
> >  include/linux/fs.h  |   5 ++
> >  include/linux/pagemap.h |  21 ++-
> >  mm/filemap.c| 148 
> > +++-
> >  3 files changed, 157 insertions(+), 17 deletions(-)
> > 
> ...
> > @@ -663,16 +663,55 @@ static int __add_to_page_cache_locked(struct page 
> > *page,
> > page->index = offset;
> >  
> > spin_lock_irq(>tree_lock);
> > -   error = page_cache_tree_insert(mapping, page, shadowp);
> > +   if (PageTransHuge(page)) {
> > +   struct radix_tree_iter iter;
> > +   void **slot;
> > +   void *p;
> > +
> > +   error = 0;
> > +
> > +   /* Wipe shadow entires */
> > +   radix_tree_for_each_slot(slot, >page_tree, , 
> > offset) {
> > +   if (iter.index >= offset + HPAGE_PMD_NR)
> > +   break;
> > +
> > +   p = radix_tree_deref_slot_protected(slot,
> > +   >tree_lock);
> > +   if (!p)
> > +   continue;
> > +
> > +   if (!radix_tree_exception(p)) {
> > +   error = -EEXIST;
> > +   break;
> > +   }
> > +
> > +   mapping->nrexceptional--;
> > +   rcu_assign_pointer(*slot, NULL);
> 
> I think you also need something like workingset_node_shadows_dec(node)
> here. It would be even better if you used something like
> clear_exceptional_entry() to have the logic in one place (you obviously
> need to factor out only part of clear_exceptional_entry() first).

Good point. Will do.

> > +   }
> > +
> > +   if (!error)
> > +   error = __radix_tree_insert(>page_tree, offset,
> > +   compound_order(page), page);
> > +
> > +   if (!error) {
> > +   count_vm_event(THP_FILE_ALLOC);
> > +   mapping->nrpages += HPAGE_PMD_NR;
> > +   *shadowp = NULL;
> > +   __inc_node_page_state(page, NR_FILE_THPS);
> > +   }
> > +   } else {
> > +   error = page_cache_tree_insert(mapping, page, shadowp);
> > +   }
> 
> And I'd prefer to have this logic moved to page_cache_tree_insert() because
> logically it IMHO belongs there - it is a simply another case of handling
> of radix tree used for page cache.

Okay.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 12/41] thp: handle write-protection faults for file THP

2016-10-11 Thread Kirill A. Shutemov

On Tue, Oct 11, 2016 at 05:47:50PM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:54, Kirill A. Shutemov wrote:
> > For filesystems that wants to be write-notified (has mkwrite), we will
> > encount write-protection faults for huge PMDs in shared mappings.
> > 
> > The easiest way to handle them is to clear the PMD and let it refault as
> > wriable.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> > ---
> >  mm/memory.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 83be99d9d8a1..aad8d5c6311f 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3451,8 +3451,17 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t 
> > orig_pmd)
> > return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
> > fe->flags);
> >  
> > +   if (fe->vma->vm_flags & VM_SHARED) {
> > +   /* Clear PMD */
> > +   zap_page_range_single(fe->vma, fe->address,
> > +   HPAGE_PMD_SIZE, NULL);
> > +   VM_BUG_ON(!pmd_none(*fe->pmd));
> > +
> > +   /* Refault to establish writable PMD */
> > +   return 0;
> > +   }
> > +
> 
> Since we want to write-protect the page table entry on each page writeback
> and write-enable then on the next write, this is relatively expensive.
> Would it be that complicated to handle this fully in ->pmd_fault handler
> like we do for DAX?
> 
> Maybe it doesn't have to be done now but longer term I guess it might make
> sense.

Right. This approach is just simplier to implement. We can rework it if it
will show up on traces.

> Otherwise the patch looks good so feel free to add:
> 
> Reviewed-by: Jan Kara <j...@suse.cz>

Thanks!

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 11/41] thp: try to free page's buffers before attempt split

2016-10-11 Thread Kirill A. Shutemov

On Tue, Oct 11, 2016 at 05:40:31PM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:53, Kirill A. Shutemov wrote:
> > We want page to be isolated from the rest of the system before spliting
> > it. We rely on page count to be 2 for file pages to make sure nobody
> > uses the page: one pin to caller, one to radix-tree.
> > 
> > Filesystems with backing storage can have page count increased if it has
> > buffers.
> > 
> > Let's try to free them, before attempt split. And remove one guarding
> > VM_BUG_ON_PAGE().
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> ...
> > @@ -2041,6 +2041,23 @@ int split_huge_page_to_list(struct page *page, 
> > struct list_head *list)
> > goto out;
> > }
> >  
> > +   /* Try to free buffers before attempt split */
> > +   if (!PageSwapBacked(head) && PagePrivate(page)) {
> > +   /*
> > +* We cannot trigger writeback from here due possible
> > +* recursion if triggered from vmscan, only wait.
> > +*
> > +* Caller can trigger writeback it on its own, if safe.
> > +*/
> > +   wait_on_page_writeback(head);
> > +
> > +   if (page_has_buffers(head) &&
> > +   !try_to_free_buffers(head)) {
> > +   ret = -EBUSY;
> > +   goto out;
> > +   }
> 
> Shouldn't you rather use try_to_release_page() here? Because filesystems
> have their ->releasepage() callbacks for freeing data associated with a
> page. It is not guaranteed page private data are buffers although it is
> true for ext4...

Fair enough. Will fix this.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 29/41] ext4: make ext4_mpage_readpages() hugepage-aware

2016-09-16 Thread Kirill A. Shutemov

On Thu, Sep 15, 2016 at 02:55:11PM +0300, Kirill A. Shutemov wrote:
> This patch modifies ext4_mpage_readpages() to deal with huge pages.
> 
> We read out 2M at once, so we have to alloc (HPAGE_PMD_NR *
> blocks_per_page) sector_t for that. I'm not entirely happy with kmalloc
> in this codepath, but don't see any other option.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

0-DAY reported this:

compiler: powerpc64-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cro
+ss
chmod +x ~/bin/make.cross
git checkout d8bfe8f327288810a9a099b15f3c89a834d419a4
# save the attached .config to linux build tree
make.cross ARCH=powerpc

All errors (new ones prefixed by >>):

   In file included from include/linux/linkage.h:4:0,
from include/linux/kernel.h:6,
from fs/ext4/readpage.c:30:
   fs/ext4/readpage.c: In function 'ext4_mpage_readpages':
>> include/linux/compiler.h:491:38: error: call to '__compiletime_assert_144' 
>> declared with attribute error:
+BUILD_BUG_ON failed: BIO_MAX_PAGES < HPAGE_PMD_NR
 _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
 ^
   include/linux/compiler.h:474:4: note: in definition of macro 
'__compiletime_assert'
   prefix ## suffix();\ 
  
   ^   
   include/linux/compiler.h:491:2: note: in expansion of macro 
'_compiletime_assert'
 _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)   

 ^   
   include/linux/bug.h:51:37: note: in expansion of macro 'compiletime_assert'
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)  
^
   include/linux/bug.h:75:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
 BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)   
 ^  
   fs/ext4/readpage.c:144:4: note: in expansion of macro 'BUILD_BUG_ON'
   BUILD_BUG_ON(BIO_MAX_PAGES < HPAGE_PMD_NR); 
   ^  

The fixup:

diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 6d7cbddceeb2..75b2a7700c9a 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -140,7 +140,8 @@ int ext4_mpage_readpages(struct address_space *mapping,
 
block_in_file = (sector_t)page->index << (PAGE_SHIFT - blkbits);
 
-   if (PageTransHuge(page)) {
+   if (PageTransHuge(page) &&
+   IS_ENABLED(TRANSPARENT_HUGE_PAGECACHE)) {
BUILD_BUG_ON(BIO_MAX_PAGES < HPAGE_PMD_NR);
nr = HPAGE_PMD_NR * blocks_per_page;
    /* XXX: need a better solution ? */
-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3 07/41] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2016-09-16 Thread Kirill A. Shutemov

On Thu, Sep 15, 2016 at 02:54:49PM +0300, Kirill A. Shutemov wrote:
> We would need to use multi-order radix-tree entires for ext4 and other
> filesystems to have coherent view on tags (dirty/towrite) in the tree.
> 
> This patch converts huge tmpfs implementation to multi-order entries, so
> we will be able to use the same code patch for all filesystems.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

0-DAY reported this:

reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   lib/crc32.c:148: warning: No description found for parameter 'tab)[256]'
   lib/crc32.c:148: warning: Excess function parameter 'tab' description in 
'crc32_le_generic'
   lib/crc32.c:293: warning: No description found for parameter 'tab)[256]'
   lib/crc32.c:293: warning: Excess function parameter 'tab' description in 
'crc32_be_generic'
   lib/crc32.c:1: warning: no structured comments found
>> mm/filemap.c:1434: warning: No description found for parameter 'start'
>> mm/filemap.c:1434: warning: Excess function parameter 'index' description in 
>> 'find_get_pages_contig'
>> mm/filemap.c:1525: warning: No description found for parameter 'indexp'
>> mm/filemap.c:1525: warning: Excess function parameter 'index' description in 
>> 'find_get_pages_tag'

The fixup:

diff --git a/mm/filemap.c b/mm/filemap.c
index c69b1204744a..1ef20dd45b6b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1548,7 +1548,7 @@ repeat:
 /**
  * find_get_pages_contig - gang contiguous pagecache lookup
  * @mapping:   The address_space to search
- * @index: The starting page index
+ * @start: The starting page index
  * @nr_pages:  The maximum number of pages
  * @pages: Where the resulting pages are placed
  *
@@ -1641,7 +1641,7 @@ EXPORT_SYMBOL(find_get_pages_contig);
 /**
  * find_get_pages_tag - find and return pages that match @tag
  * @mapping:   the address_space to search
- * @index: the starting page index
+ * @indexp:the starting page index
  * @tag:   the tag index
  * @nr_pages:  the maximum number of pages
  * @pages: where the resulting pages are placed
-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 04/41] radix-tree: Add radix_tree_split

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox <wi...@linux.intel.com>

This new function splits a larger multiorder entry into smaller entries
(potentially multi-order entries).  These entries are initialised to
RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't
confused.  The caller should then call radix_tree_for_each_slot() and
radix_tree_replace_slot() in order to turn these retry entries into the
intended new entries.  Tags are replicated from the original multiorder
entry into each new entry.

Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/radix-tree.h|   6 +-
 lib/radix-tree.c  | 109 --
 tools/testing/radix-tree/multiorder.c |  26 
 3 files changed, 135 insertions(+), 6 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 75ae4648d13d..459e8a152c8a 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -280,8 +280,7 @@ bool __radix_tree_delete_node(struct radix_tree_root *root,
  struct radix_tree_node *node);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
-struct radix_tree_node *radix_tree_replace_clear_tags(
-   struct radix_tree_root *root,
+struct radix_tree_node *radix_tree_replace_clear_tags(struct radix_tree_root *,
unsigned long index, void *entry);
 unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
void **results, unsigned long first_index,
@@ -319,8 +318,11 @@ static inline void radix_tree_preload_end(void)
preempt_enable();
 }
 
+int radix_tree_split(struct radix_tree_root *, unsigned long index,
+   unsigned new_order);
 int radix_tree_join(struct radix_tree_root *, unsigned long index,
unsigned new_order, void *);
+
 /**
  * struct radix_tree_iter - radix tree iterator state
  *
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 3157f223c268..ad3116cbe61b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -231,7 +231,10 @@ static void dump_node(struct radix_tree_node *node, 
unsigned long index)
void *entry = node->slots[i];
if (!entry)
continue;
-   if (is_sibling_entry(node, entry)) {
+   if (entry == RADIX_TREE_RETRY) {
+   pr_debug("radix retry offset %ld indices %ld-%ld\n",
+   i, first, last);
+   } else if (is_sibling_entry(node, entry)) {
pr_debug("radix sblng %p offset %ld val %p indices 
%ld-%ld\n",
entry, i,
*(void **)entry_to_node(entry),
@@ -641,7 +644,10 @@ static inline int insert_entries(struct radix_tree_node 
*node, void **slot,
unsigned i, n, tag, offset, tags = 0;
 
if (node) {
-   n = 1 << (order - node->shift);
+   if (order > node->shift)
+   n = 1 << (order - node->shift);
+   else
+   n = 1;
offset = get_slot_offset(node, slot);
} else {
n = 1;
@@ -680,7 +686,8 @@ static inline int insert_entries(struct radix_tree_node 
*node, void **slot,
tag_set(node, tag, offset);
}
if (radix_tree_is_internal_node(old) &&
-   !is_sibling_entry(node, old))
+   !is_sibling_entry(node, old) &&
+   (old != RADIX_TREE_RETRY))
radix_tree_free_nodes(old);
}
if (node)
@@ -843,6 +850,98 @@ int radix_tree_join(struct radix_tree_root *root, unsigned 
long index,
 
return error;
 }
+
+int radix_tree_split(struct radix_tree_root *root, unsigned long index,
+   unsigned order)
+{
+   struct radix_tree_node *parent, *node, *child;
+   void **slot;
+   unsigned int offset, end;
+   unsigned n, tag, tags = 0;
+
+   if (!__radix_tree_lookup(root, index, , ))
+   return -ENOENT;
+   if (!parent)
+   return -ENOENT;
+
+   offset = get_slot_offset(parent, slot);
+
+   for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+   if (tag_get(parent, tag, offset))
+   tags |= 1 << tag;
+
+   for (end = offset + 1; end < RADIX_TREE_MAP_SIZE; end++) {
+   if (!is_sibling_entry(parent, parent->slots[end]))
+   break;
+

[PATCHv3 03/41] radix-tree: Add radix_tree_join

2016-09-15 Thread Kirill A. Shutemov

From: Matthew Wilcox <wi...@linux.intel.com>

This new function allows for the replacement of many smaller entries in
the radix tree with one larger multiorder entry.  From the point of view
of an RCU walker, they may see a mixture of the smaller entries and the
large entry during the same walk, but they will never see NULL for an
index which was populated before the join.

Signed-off-by: Matthew Wilcox <wi...@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/radix-tree.h|   2 +
 lib/radix-tree.c  | 159 +++---
 tools/testing/radix-tree/multiorder.c |  32 +++
 3 files changed, 163 insertions(+), 30 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4c45105dece3..75ae4648d13d 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -319,6 +319,8 @@ static inline void radix_tree_preload_end(void)
preempt_enable();
 }
 
+int radix_tree_join(struct radix_tree_root *, unsigned long index,
+   unsigned new_order, void *);
 /**
  * struct radix_tree_iter - radix tree iterator state
  *
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 1b7bf7314141..3157f223c268 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -314,18 +314,14 @@ static void radix_tree_node_rcu_free(struct rcu_head 
*head)
 {
struct radix_tree_node *node =
container_of(head, struct radix_tree_node, rcu_head);
-   int i;
 
/*
-* must only free zeroed nodes into the slab. radix_tree_shrink
-* can leave us with a non-NULL entry in the first slot, so clear
-* that here to make sure.
+* Must only free zeroed nodes into the slab.  We can be left with
+* non-NULL entries by radix_tree_free_nodes, so clear the entries
+* and tags here.
 */
-   for (i = 0; i < RADIX_TREE_MAX_TAGS; i++)
-   tag_clear(node, i, 0);
-
-   node->slots[0] = NULL;
-   node->count = 0;
+   memset(node->slots, 0, sizeof(node->slots));
+   memset(node->tags, 0, sizeof(node->tags));
 
kmem_cache_free(radix_tree_node_cachep, node);
 }
@@ -563,14 +559,14 @@ int __radix_tree_create(struct radix_tree_root *root, 
unsigned long index,
shift = radix_tree_load_root(root, , );
 
/* Make sure the tree is high enough.  */
+   if (order > 0 && max == ((1UL << order) - 1))
+   max++;
if (max > maxindex) {
int error = radix_tree_extend(root, max, shift);
if (error < 0)
return error;
shift = error;
child = root->rnode;
-   if (order == shift)
-   shift += RADIX_TREE_MAP_SHIFT;
}
 
while (shift > order) {
@@ -582,6 +578,7 @@ int __radix_tree_create(struct radix_tree_root *root, 
unsigned long index,
return -ENOMEM;
child->shift = shift;
child->offset = offset;
+   child->count = 0;
child->parent = node;
rcu_assign_pointer(*slot, node_to_entry(child));
if (node)
@@ -595,31 +592,113 @@ int __radix_tree_create(struct radix_tree_root *root, 
unsigned long index,
slot = >slots[offset];
}
 
+   if (nodep)
+   *nodep = node;
+   if (slotp)
+   *slotp = slot;
+   return 0;
+}
+
 #ifdef CONFIG_RADIX_TREE_MULTIORDER
-   /* Insert pointers to the canonical entry */
-   if (order > shift) {
-   unsigned i, n = 1 << (order - shift);
+/*
+ * Free any nodes below this node.  The tree is presumed to not need
+ * shrinking, and any user data in the tree is presumed to not need a
+ * destructor called on it.  If we need to add a destructor, we can
+ * add that functionality later.  Note that we may not clear tags or
+ * slots from the tree as an RCU walker may still have a pointer into
+ * this subtree.  We could replace the entries with RADIX_TREE_RETRY,
+ * but we'll still have to clear those in rcu_free.
+ */
+static void radix_tree_free_nodes(struct radix_tree_node *node)
+{
+   unsigned offset = 0;
+   struct radix_tree_node *child = entry_to_node(node);
+
+   for (;;) {
+   void *entry = child->slots[offset];
+   if (radix_tree_is_internal_node(entry) &&
+   !is_sibling_entry(child, entry)) {
+   child = entry_to_node(entry);
+   offset = 0;
+   continue;
+   }
+   offset++;
+   while (offset == RADIX_TREE_MAP_SIZE) {
+   struct radix_tree_node *old = c

[PATCHv3 07/41] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2016-09-15 Thread Kirill A. Shutemov

We would need to use multi-order radix-tree entires for ext4 and other
filesystems to have coherent view on tags (dirty/towrite) in the tree.

This patch converts huge tmpfs implementation to multi-order entries, so
we will be able to use the same code patch for all filesystems.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 320 +--
 mm/huge_memory.c |  47 +---
 mm/khugepaged.c  |  26 ++---
 mm/shmem.c   |  36 ++-
 4 files changed, 247 insertions(+), 182 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287dfc5372..ac3a39b1fe6d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -114,7 +114,7 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
   struct page *page, void *shadow)
 {
struct radix_tree_node *node;
-   int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
+   int nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);
@@ -132,36 +132,32 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
}
mapping->nrpages -= nr;
 
-   for (i = 0; i < nr; i++) {
-   node = radix_tree_replace_clear_tags(>page_tree,
-   page->index + i, shadow);
-   if (!node) {
-   VM_BUG_ON_PAGE(nr != 1, page);
-   return;
-   }
+   node = radix_tree_replace_clear_tags(>page_tree,
+   page->index, shadow);
+   if (!node)
+   return;
 
-   workingset_node_pages_dec(node);
-   if (shadow)
-   workingset_node_shadows_inc(node);
-   else
-   if (__radix_tree_delete_node(>page_tree, node))
-   continue;
+   workingset_node_pages_dec(node);
+   if (shadow)
+   workingset_node_shadows_inc(node);
+   else
+   if (__radix_tree_delete_node(>page_tree, node))
+   return;
 
-   /*
-* Track node that only contains shadow entries. DAX mappings
-* contain no shadow entries and may contain other exceptional
-* entries so skip those.
-*
-* Avoid acquiring the list_lru lock if already tracked.
-* The list_empty() test is safe as node->private_list is
-* protected by mapping->tree_lock.
-*/
-   if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
-   list_empty(>private_list)) {
-   node->private_data = mapping;
-   list_lru_add(_shadow_nodes,
-   >private_list);
-   }
+   /*
+* Track node that only contains shadow entries. DAX mappings
+* contain no shadow entries and may contain other exceptional
+* entries so skip those.
+*
+* Avoid acquiring the list_lru lock if already tracked.
+* The list_empty() test is safe as node->private_list is
+* protected by mapping->tree_lock.
+*/
+   if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
+   list_empty(>private_list)) {
+   node->private_data = mapping;
+   list_lru_add(_shadow_nodes,
+   >private_list);
}
 }
 
@@ -264,12 +260,7 @@ void delete_from_page_cache(struct page *page)
if (freepage)
freepage(page);
 
-   if (PageTransHuge(page) && !PageHuge(page)) {
-   page_ref_sub(page, HPAGE_PMD_NR);
-   VM_BUG_ON_PAGE(page_count(page) <= 0, page);
-   } else {
-   put_page(page);
-   }
+   put_page(page);
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
@@ -1073,7 +1064,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 {
void **pagep;
-   struct page *head, *page;
+   struct page *page;
 
rcu_read_lock();
 repeat:
@@ -1094,25 +1085,25 @@ repeat:
goto out;
}
 
-   head = compound_head(page);
-   if (!page_cache_get_speculative(head))
+   if (!page_cache_get_speculative(page))
goto repeat;
 
-   /* The page was split under us? */
-   if (compound_head(page) != head) {
-   put_page(head);
-   goto repeat;
-   }
-
/*
 * Has the page moved?
 * This

[PATCHv3 09/41] page-flags: relax page flag policy for few flags

2016-09-15 Thread Kirill A. Shutemov

These flags are in use for filesystems with backing storage: PG_error,
PG_writeback and PG_readahead.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/page-flags.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda91238..a2bef9a41bcf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,7 +253,7 @@ static inline int TestClearPage##uname(struct page *page) { 
return 0; }
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
-PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, 
PF_NO_COMPOUND)
+PAGEFLAG(Error, error, PF_NO_TAIL) TESTCLEARFLAG(Error, error, PF_NO_TAIL)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
__SETPAGEFLAG(Referenced, referenced, PF_HEAD)
@@ -293,15 +293,15 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
  * Only test-and-set exist for PG_writeback.  The unconditional operators are
  * risky: they bypass page accounting.
  */
-TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
-   TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
+TESTPAGEFLAG(Writeback, writeback, PF_NO_TAIL)
+   TESTSCFLAG(Writeback, writeback, PF_NO_TAIL)
 PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
 PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
-PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
-   TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Readahead, reclaim, PF_NO_TAIL)
+   TESTCLEARFLAG(Readahead, reclaim, PF_NO_TAIL)
 
 #ifdef CONFIG_HIGHMEM
 /*
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 10/41] mm, rmap: account file thp pages

2016-09-15 Thread Kirill A. Shutemov

Let's add FileHugePages and FilePmdMapped fields into meminfo and smaps.
It indicates how many times we allocate and map file THP.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 drivers/base/node.c|  6 ++
 fs/proc/meminfo.c  |  4 
 fs/proc/task_mmu.c |  5 -
 include/linux/mmzone.h |  2 ++
 mm/filemap.c   |  3 ++-
 mm/huge_memory.c   |  5 -
 mm/page_alloc.c|  5 +
 mm/rmap.c  | 12 
 mm/vmstat.c|  2 ++
 9 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f9686016..45be0ddb84ed 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -116,6 +116,8 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d AnonHugePages:  %8lu kB\n"
   "Node %d ShmemHugePages: %8lu kB\n"
   "Node %d ShmemPmdMapped: %8lu kB\n"
+  "Node %d FileHugePages: %8lu kB\n"
+  "Node %d FilePmdMapped: %8lu kB\n"
 #endif
,
   nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -139,6 +141,10 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
   HPAGE_PMD_NR),
   nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_THPS) *
+  HPAGE_PMD_NR),
+  nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED) *
   HPAGE_PMD_NR));
 #else
   nid, K(sum_zone_node_page_state(nid, 
NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b9a8c813e5e6..fc8a487bc7ed 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -107,6 +107,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"AnonHugePages:  %8lu kB\n"
"ShmemHugePages: %8lu kB\n"
"ShmemPmdMapped: %8lu kB\n"
+   "FileHugePages:  %8lu kB\n"
+   "FilePmdMapped:  %8lu kB\n"
 #endif
 #ifdef CONFIG_CMA
"CmaTotal:   %8lu kB\n"
@@ -167,6 +169,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
, K(global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
, K(global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR)
, K(global_node_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
+   , K(global_node_page_state(NR_FILE_THPS) * HPAGE_PMD_NR)
+   , K(global_node_page_state(NR_FILE_PMDMAPPED) * HPAGE_PMD_NR)
 #endif
 #ifdef CONFIG_CMA
, K(totalcma_pages)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f6fa99eca515..9a1cc4a3407a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -449,6 +449,7 @@ struct mem_size_stats {
unsigned long anonymous;
unsigned long anonymous_thp;
unsigned long shmem_thp;
+   unsigned long file_thp;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -584,7 +585,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
else if (is_zone_device_page(page))
/* pass */;
else
-   VM_BUG_ON_PAGE(1, page);
+   mss->file_thp += HPAGE_PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
@@ -779,6 +780,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   "Anonymous:  %8lu kB\n"
   "AnonHugePages:  %8lu kB\n"
   "ShmemPmdMapped: %8lu kB\n"
+  "FilePmdMapped:  %8lu kB\n"
   "Shared_Hugetlb: %8lu kB\n"
   "Private_Hugetlb: %7lu kB\n"
   "Swap:   %8lu kB\n"
@@ -797,6 +799,7 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
   mss.anonymous >> 10,
   mss.anonymous_thp >> 10,
   mss.shmem_thp >> 10,
+  mss.file_thp >> 10,
   mss.shared_hugetlb >> 10,
   mss.private_hugetlb >> 10,
   mss.swap >> 10,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7f2ae99e5daf..20c5fce13697 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,8 @@ enum node_stat_item {
NR_SHMEM,   /* shmem pages (included tmpfs/GEM pages) */
NR_SHMEM_THPS,
NR_SHMEM_PMDMAP

[PATCHv3 17/41] filemap: handle huge pages in filemap_fdatawait_range()

2016-09-15 Thread Kirill A. Shutemov

We writeback whole huge page a time.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/filemap.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 05b42d3e5ed8..53da93156e60 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -372,9 +372,14 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
if (page->index > end)
continue;
 
+   page = compound_head(page);
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
+   if (PageTransHuge(page)) {
+   index = page->index + HPAGE_PMD_NR;
+   i += index - pvec.pages[i]->index - 1;
+   }
}
pagevec_release();
cond_resched();
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv3 26/41] truncate: make truncate_inode_pages_range() aware about huge pages

2016-09-15 Thread Kirill A. Shutemov

As with shmem_undo_range(), truncate_inode_pages_range() removes huge
pages, if it fully within range.

Partial truncate of huge pages zero out this part of THP.

Unlike with shmem, it doesn't prevent us having holes in the middle of
huge page we still can skip writeback not touched buffers.

With memory-mapped IO we would loose holes in some cases when we have
THP in page cache, since we cannot track access on 4k level in this
case.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/buffer.c   |  2 +-
 mm/truncate.c | 95 ++-
 2 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e53808e790e2..20898b051044 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1534,7 +1534,7 @@ void block_invalidatepage(struct page *page, unsigned int 
offset,
/*
 * Check for overflow
 */
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
diff --git a/mm/truncate.c b/mm/truncate.c
index ce904e4b1708..9c339e6255f2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -90,7 +90,7 @@ void do_invalidatepage(struct page *page, unsigned int offset,
 {
void (*invalidatepage)(struct page *, unsigned int, unsigned int);
 
-   invalidatepage = page->mapping->a_ops->invalidatepage;
+   invalidatepage = page_mapping(page)->a_ops->invalidatepage;
 #ifdef CONFIG_BLOCK
if (!invalidatepage)
invalidatepage = block_invalidatepage;
@@ -116,7 +116,7 @@ truncate_complete_page(struct address_space *mapping, 
struct page *page)
return -EIO;
 
if (page_has_private(page))
-   do_invalidatepage(page, 0, PAGE_SIZE);
+   do_invalidatepage(page, 0, hpage_size(page));
 
/*
 * Some filesystems seem to re-dirty the page even after
@@ -288,6 +288,36 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
unlock_page(page);
continue;
}
+
+   if (PageTransTail(page)) {
+   /* Middle of THP: zero out the page */
+   clear_highpage(page);
+   if (page_has_private(page)) {
+   int off = page - compound_head(page);
+   do_invalidatepage(compound_head(page),
+   off * PAGE_SIZE,
+   PAGE_SIZE);
+   }
+   unlock_page(page);
+   continue;
+   } else if (PageTransHuge(page)) {
+   if (index == round_down(end, HPAGE_PMD_NR)) {
+   /*
+* Range ends in the middle of THP:
+* zero out the page
+*/
+   clear_highpage(page);
+   if (page_has_private(page)) {
+   do_invalidatepage(page, 0,
+   PAGE_SIZE);
+   }
+   unlock_page(page);
+   continue;
+   }
+   index += HPAGE_PMD_NR - 1;
+   i += HPAGE_PMD_NR - 1;
+   }
+
truncate_inode_page(mapping, page);
unlock_page(page);
}
@@ -309,9 +339,12 @@ void truncate_inode_pages_range(struct address_space 
*mapping,
wait_on_page_writeback(page);
zero_user_segment(page, partial_start, top);
cleancache_invalidate_page(mapping, page);
-   if (page_has_private(page))
-   do_invalidatepage(page, partial_start,
- top - partial_start);
+   if (page_has_private(page)) {
+   int off = page - compound_head(page);
+   do_invalidatepage(compound_head(page),
+   off * PAGE_SIZE + partial_start,
+   top - partial_start);
+   }
unlock_page(page);
put_page(page);
}
@@ -322,9 +355,12 @@ void truncate_inode_pages_range(struct address_space 
*mapping,

[PATCHv3 35/41] ext4: make ext4_da_page_release_reservation() aware about huge pages

2016-09-15 Thread Kirill A. Shutemov

For huge pages 'stop' must be within HPAGE_PMD_SIZE.
Let's use hpage_size() in the BUG_ON().

We also need to change how we calculate lblk for cluster deallocation.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 fs/ext4/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index deacd3499ec7..6a8da1a8409c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1558,7 +1558,7 @@ static void ext4_da_page_release_reservation(struct page 
*page,
int num_clusters;
ext4_fsblk_t lblk;
 
-   BUG_ON(stop > PAGE_SIZE || stop < length);
+   BUG_ON(stop > hpage_size(page) || stop < length);
 
head = page_buffers(page);
bh = head;
@@ -1593,7 +1593,8 @@ static void ext4_da_page_release_reservation(struct page 
*page,
 * need to release the reserved space for that cluster. */
num_clusters = EXT4_NUM_B2C(sbi, to_release);
while (num_clusters > 0) {
-   lblk = (page->index << (PAGE_SHIFT - inode->i_blkbits)) +
+   lblk = ((page->index + offset / PAGE_SIZE) <<
+   (PAGE_SHIFT - inode->i_blkbits)) +
((num_clusters - 1) << sbi->s_cluster_bits);
if (sbi->s_cluster_ratio == 1 ||
!ext4_find_delalloc_cluster(inode, lblk))
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 160 matches

Mail list logo