from:"Mike Kravetz"

Re: [PATCH v1] mm: hugetlb: fix hugepage memory leak caused by wrong reserve count

2015-11-20 Thread Mike Kravetz

On 11/19/2015 11:57 PM, Hillf Danton wrote:
>>
>> When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back to
>> alloc_buddy_huge_page() to directly create a hugepage from the buddy 
>> allocator.
>> In that case, however, if alloc_buddy_huge_page() succeeds we don't decrement
>> h->resv_huge_pages, which means that successful hugetlb_fault() returns 
>> without
>> releasing the reserve count. As a result, subsequent hugetlb_fault() might 
>> fail
>> despite that there are still free hugepages.
>>
>> This patch simply adds decrementing code on that code path.

In general, I agree with the patch.  If we allocate a huge page via the
buddy allocator and that page will be used to satisfy a reservation, then
we need to decrement the reservation count.

As Hillf mentions, this code is not exactly the same in linux-next.
Specifically, there is the new call to take the memory policy of the
vma into account when calling the buddy allocator.  I do not think,
this impacts your proposed change but you may want to test with that
in place.

>>
>> I reproduced this problem when testing v4.3 kernel in the following 
>> situation:
>> - the test machine/VM is a NUMA system,
>> - hugepage overcommiting is enabled,
>> - most of hugepages are allocated and there's only one free hugepage
>>   which is on node 0 (for example),
>> - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
>>   node 1, tries to allocate a hugepage,

I am curious about this scenario.  When this second program attempts to
allocate the page, I assume it creates a reservation first.  Is this
reservation before or after setting mempolicy?  If the mempolicy was set
first, I would have expected the reservation to allocate a page on
node 1 to satisfy the reservation.

-- 
Mike Kravetz

>> - the allocation should fail but the reserve count is still hold.
>>
>> Signed-off-by: Naoya Horiguchi 
>> Cc:  [3.16+]
>> ---
>> - the reason why I set stable target to "3.16+" is that this patch can be
>>   applied easily/automatically on these versions. But this bug seems to be
>>   old one, so if you are interested in backporting to older kernels,
>>   please let me know.
>> ---
>>  mm/hugetlb.c |5 -
>>  1 files changed, 4 insertions(+), 1 deletions(-)
>>
>> diff --git v4.3/mm/hugetlb.c v4.3_patched/mm/hugetlb.c
>> index 9cc7734..77c518c 100644
>> --- v4.3/mm/hugetlb.c
>> +++ v4.3_patched/mm/hugetlb.c
>> @@ -1790,7 +1790,10 @@ struct page *alloc_huge_page(struct vm_area_struct 
>> *vma,
>>  page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
>>  if (!page)
>>  goto out_uncharge_cgroup;
>> -
>> +if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
>> +SetPagePrivate(page);
>> +h->resv_huge_pages--;
>> +}
> 
> I am wondering if this patch was prepared against the next tree.
> 
>>  spin_lock(&hugetlb_lock);
>>  list_move(&page->lru, &h->hugepage_activelist);
>>  /* Fall through */
>> --
>> 1.7.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -next RESEND] sparc: Hook up userfaultfd system call

2015-11-20 Thread Mike Kravetz

After hooking up system call, userfaultfd selftest was successful for
both 32 and 64 bit version of test.

Signed-off-by: Mike Kravetz 
---
 arch/sparc/include/uapi/asm/unistd.h | 3 ++-
 arch/sparc/kernel/systbls_32.S   | 2 +-
 arch/sparc/kernel/systbls_64.S   | 4 ++--
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/include/uapi/asm/unistd.h 
b/arch/sparc/include/uapi/asm/unistd.h
index efe9479..f31a124 100644
--- a/arch/sparc/include/uapi/asm/unistd.h
+++ b/arch/sparc/include/uapi/asm/unistd.h
@@ -417,8 +417,9 @@
 #define __NR_bpf   349
 #define __NR_execveat  350
 #define __NR_membarrier351
+#define __NR_userfaultfd   352
 
-#define NR_syscalls352
+#define NR_syscalls353
 
 /* Bitmask values returned from kern_features system call.  */
 #define KERN_FEATURE_MIXED_MODE_STACK  0x0001
diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
index cc23b62..78e8029 100644
--- a/arch/sparc/kernel/systbls_32.S
+++ b/arch/sparc/kernel/systbls_32.S
@@ -87,4 +87,4 @@ sys_call_table:
 /*335*/.long sys_syncfs, sys_sendmmsg, sys_setns, 
sys_process_vm_readv, sys_process_vm_writev
 /*340*/.long sys_ni_syscall, sys_kcmp, sys_finit_module, 
sys_sched_setattr, sys_sched_getattr
 /*345*/.long sys_renameat2, sys_seccomp, sys_getrandom, 
sys_memfd_create, sys_bpf
-/*350*/.long sys_execveat, sys_membarrier
+/*350*/.long sys_execveat, sys_membarrier, sys_userfaultfd
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index f229468..2549c2c 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -88,7 +88,7 @@ sys_call_table32:
.word sys_syncfs, compat_sys_sendmmsg, sys_setns, 
compat_sys_process_vm_readv, compat_sys_process_vm_writev
 /*340*/.word sys_kern_features, sys_kcmp, sys_finit_module, 
sys_sched_setattr, sys_sched_getattr
.word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, 
sys_bpf
-/*350*/.word sys32_execveat, sys_membarrier
+/*350*/.word sys32_execveat, sys_membarrier, sys_userfaultfd
 
 #endif /* CONFIG_COMPAT */
 
@@ -168,4 +168,4 @@ sys_call_table:
.word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, 
sys_process_vm_writev
 /*340*/.word sys_kern_features, sys_kcmp, sys_finit_module, 
sys_sched_setattr, sys_sched_getattr
.word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, 
sys_bpf
-/*350*/.word sys64_execveat, sys_membarrier
+/*350*/.word sys64_execveat, sys_membarrier, sys_userfaultfd
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/10] hugetlbfs: add fallocate support

2015-07-02 Thread Mike Kravetz

This patch set has gone through several revisions as an RFC which
is documented in the log below.  The major change in this version
is the addition of code to handle the region_chg/region_add calling
sequence with the addition of fallocate hole punch.  This is the
first patch in the series.  As suggested during the RFC process,
tests have been proposed to libhugetlbfs as described at:
http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
fallocate(2) man page modifications are also necessary to specify
that fallocate for hugetlbfs only operates on whole pages.  This
change will be submitted once the code has stabalized and been
proposed for merging.

hugetlbfs is used today by applications that want a high degree of
control over huge page usage.  Often, large hugetlbfs files are used
to map a large number huge pages into the application processes.
The applications know when page ranges within these large files will
no longer be used, and ideally would like to release them back to
the subpool or global pools for other uses.  The fallocate() system
call provides an interface for preallocation and hole punching within
files.  This patch set adds fallocate functionality to hugetlbfs.

v1:
  Add a cache of region descriptors to the resv_map for use by
region_add in case hole punch deletes entries necessary for
a successful operation.
RFC v4:
  Removed alloc_huge_page/hugetlb_reserve_pages race patches as already
in mmotm
  Moved hugetlb_fix_reserve_counts in series as suggested by Naoya Horiguchi
  Inline'ed hugetlb_fault_mutex routines as suggested by Davidlohr Bueso and
existing code changed to use new interfaces as suggested by Naoya
  fallocate preallocation code cleaned up and made simpler
  Modified alloc_huge_page to handle special case where allocation is
for a hole punched area with spool reserves
RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
in current code
  fallocate allocation and hole punch is synchronized with page
faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fauils
RFC v2:
  Addressed alignment and error handling issues noticed by Hillf Danton
  New region_del() routine for region tracking/resv_map of ranges
  Fixed several issues found during more extensive testing
  Error handling in region_del() when kmalloc() fails stills needs
to be addressed
  madvise remove support remains

Mike Kravetz (10):
  mm/hugetlb: add cache of descriptors to resv_map for region_add
  mm/hugetlb: add region_del() to delete a specific range of entries
  mm/hugetlb: expose hugetlb fault mutex for use by fallocate
  hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
  hugetlbfs: truncate_hugepages() takes a range of pages
  mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
  mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
  hugetlbfs: New huge_add_to_page_cache helper routine
  hugetlbfs: add hugetlbfs_fallocate()
  mm: madvise allow remove operation for hugetlbfs

 fs/hugetlbfs/inode.c| 281 +---
 include/linux/hugetlb.h |  17 +-
 mm/hugetlb.c| 422 ++--
 mm/madvise.c|   2 +-
 4 files changed, 618 insertions(+), 104 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/10] mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch

2015-07-02 Thread Mike Kravetz

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings.  However, this will not be the
case with fallocate hole punch.  When punching a hole, the present
page will be deleted as well as the region/reserve map entry (and
hence any reservation).  vma_has_reserves is passed "chg" which
indicates whether or not a region/reserve map is present.  Use
this to determine if reserves are actually present or were removed
via hole punch.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1c3049c..5923c21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -800,9 +800,19 @@ static int vma_has_reserves(struct vm_area_struct *vma, 
long chg)
return 0;
}
 
-   /* Shared mappings always use reserves */
-   if (vma->vm_flags & VM_MAYSHARE)
-   return 1;
+   if (vma->vm_flags & VM_MAYSHARE) {
+   /*
+* We know VM_NORESERVE is not set.  Therefore, there SHOULD
+* be a region map for all pages.  The only situation where
+* there is no region map is if a hole was punched via
+* fallocate.  In this case, there really are no reverves to
+* use.  This situation is indicated if chg != 0.
+*/
+   if (chg)
+   return 0;
+   else
+   return 1;
+   }
 
/*
 * Only the process that called mmap() has reserves for
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/10] hugetlbfs: add hugetlbfs_fallocate()

2015-07-02 Thread Mike Kravetz

This is based on the shmem version, but it has diverged quite
a bit.  We have no swap to worry about, nor the new file sealing.
Add synchronication via the fault mutex table to coordinate
page faults,  fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of
a hugetlbfs file without having it mapped.  This also gives us
the ability to support MADV_REMOVE since it is currently
implemented using fallocate().  MADV_REMOVE lets madvise() remove
pages from the middle of a hugetlbfs file, which wasn't possible
before.

hugetlbfs fallocate only operates on whole huge pages.

Based-on code-by: Dave Hansen 
Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 158 +++-
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c|   2 +-
 3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 160d58d4..05b9d47 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include/* remove ASAP */
+#include 
 #include 
 #include 
 #include 
@@ -504,6 +505,160 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t 
offset)
return 0;
 }
 
+static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t 
len)
+{
+   struct hstate *h = hstate_inode(inode);
+   loff_t hpage_size = huge_page_size(h);
+   loff_t hole_start, hole_end;
+
+   /*
+* For hole punch round up the beginning offset of the hole and
+* round down the end.
+*/
+   hole_start = round_up(offset, hpage_size);
+   hole_end = round_down(offset + len, hpage_size);
+
+   if (hole_end > hole_start) {
+   struct address_space *mapping = inode->i_mapping;
+
+   mutex_lock(&inode->i_mutex);
+   i_mmap_lock_write(mapping);
+   if (!RB_EMPTY_ROOT(&mapping->i_mmap))
+   hugetlb_vmdelete_list(&mapping->i_mmap,
+   hole_start >> PAGE_SHIFT,
+   hole_end  >> PAGE_SHIFT);
+   i_mmap_unlock_write(mapping);
+   remove_inode_hugepages(inode, hole_start, hole_end);
+   mutex_unlock(&inode->i_mutex);
+   }
+
+   return 0;
+}
+
+static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
+   loff_t len)
+{
+   struct inode *inode = file_inode(file);
+   struct address_space *mapping = inode->i_mapping;
+   struct hstate *h = hstate_inode(inode);
+   struct vm_area_struct pseudo_vma;
+   struct mm_struct *mm = current->mm;
+   loff_t hpage_size = huge_page_size(h);
+   unsigned long hpage_shift = huge_page_shift(h);
+   pgoff_t start, index, end;
+   int error;
+   u32 hash;
+
+   if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+   return -EOPNOTSUPP;
+
+   if (mode & FALLOC_FL_PUNCH_HOLE)
+   return hugetlbfs_punch_hole(inode, offset, len);
+
+   /*
+* Default preallocate case.
+* For this range, start is rounded down and end is rounded up
+* as well as being converted to page offsets.
+*/
+   start = offset >> hpage_shift;
+   end = (offset + len + hpage_size - 1) >> hpage_shift;
+
+   mutex_lock(&inode->i_mutex);
+
+   /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+   error = inode_newsize_ok(inode, offset + len);
+   if (error)
+   goto out;
+
+   /*
+* Initialize a pseudo vma that just contains the policy used
+* when allocating the huge pages.  The actual policy field
+* (vm_policy) is determined based on the index in the loop below.
+*/
+   memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+   pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
+   pseudo_vma.vm_file = file;
+
+   for (index = start; index < end; index++) {
+   /*
+* This is supposed to be the vaddr where the page is being
+* faulted in, but we have no vaddr here.
+*/
+   struct page *page;
+   unsigned long addr;
+   int avoid_reserve = 0;
+
+   cond_resched();
+
+   /*
+* fallocate(2) manpage permits EINTR; we may have been
+* interrupted because we are using up too much memory.
+*/
+   if (signal_pending(current)) {
+   error = -EINTR;
+   break;
+   }
+
+   /* Get policy based on index */
+   pseudo_vma.vm_policy =
+   mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->po

[PATCH 08/10] hugetlbfs: New huge_add_to_page_cache helper routine

2015-07-02 Thread Mike Kravetz

Currently, there is  only a single place where hugetlbfs pages are
added to the page cache.  The new fallocate code be adding a second
one, so break the functionality out into its own helper.

Signed-off-by: Dave Hansen 
Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h |  2 ++
 mm/hugetlb.c| 27 ++-
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 95cd716..12e90f1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -333,6 +333,8 @@ struct huge_bootmem_page {
 struct page *alloc_huge_page_node(struct hstate *h, int nid);
 struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+   pgoff_t idx);
 
 /* arch callback */
 int __init alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 90d8250..d6e47ac 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3373,6 +3373,23 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
return page != NULL;
 }
 
+int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+  pgoff_t idx)
+{
+   struct inode *inode = mapping->host;
+   struct hstate *h = hstate_inode(inode);
+   int err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+
+   if (err)
+   return err;
+   ClearPagePrivate(page);
+
+   spin_lock(&inode->i_lock);
+   inode->i_blocks += blocks_per_huge_page(h);
+   spin_unlock(&inode->i_lock);
+   return 0;
+}
+
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
   struct address_space *mapping, pgoff_t idx,
   unsigned long address, pte_t *ptep, unsigned int 
flags)
@@ -3420,21 +3437,13 @@ retry:
set_page_huge_active(page);
 
if (vma->vm_flags & VM_MAYSHARE) {
-   int err;
-   struct inode *inode = mapping->host;
-
-   err = add_to_page_cache(page, mapping, idx, GFP_KERNEL);
+   int err = huge_add_to_page_cache(page, mapping, idx);
if (err) {
put_page(page);
if (err == -EEXIST)
goto retry;
goto out;
}
-   ClearPagePrivate(page);
-
-   spin_lock(&inode->i_lock);
-   inode->i_blocks += blocks_per_huge_page(h);
-   spin_unlock(&inode->i_lock);
} else {
lock_page(page);
if (unlikely(anon_vma_prepare(vma))) {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/10] mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate

2015-07-02 Thread Mike Kravetz

Areas hole punched by fallocate will not have entries in the
region/reserve map.  However, shared mappings with min_size subpool
reservations may still have reserved pages.  alloc_huge_page needs
to handle this special case and do the proper accounting.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 54 +++---
 1 file changed, 39 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5923c21..90d8250 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1731,34 +1731,58 @@ static struct page *alloc_huge_page(struct 
vm_area_struct *vma,
struct hugepage_subpool *spool = subpool_vma(vma);
struct hstate *h = hstate_vma(vma);
struct page *page;
-   long chg, commit;
+   long map_chg, map_commit;
+   long gbl_chg;
int ret, idx;
struct hugetlb_cgroup *h_cg;
 
idx = hstate_index(h);
/*
-* Processes that did not create the mapping will have no
-* reserves and will not have accounted against subpool
-* limit. Check that the subpool limit can be made before
-* satisfying the allocation MAP_NORESERVE mappings may also
-* need pages and subpool limit allocated allocated if no reserve
-* mapping overlaps.
+* Examine the region/reserve map to determine if the process
+* has a reservation for the page to be allocated.  A return
+* code of zero indicates a reservation exists (no change).
 */
-   chg = vma_needs_reservation(h, vma, addr);
-   if (chg < 0)
+   map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
+   if (map_chg < 0)
return ERR_PTR(-ENOMEM);
-   if (chg || avoid_reserve)
-   if (hugepage_subpool_get_pages(spool, 1) < 0) {
+
+   /*
+* Processes that did not create the mapping will have no
+* reserves as indicated by the region/reserve map. Check
+* that the allocation will not exceed the subpool limit.
+* Allocations for MAP_NORESERVE mappings also need to be
+* checked against any subpool limit.
+*/
+   if (map_chg || avoid_reserve) {
+   gbl_chg = hugepage_subpool_get_pages(spool, 1);
+   if (gbl_chg < 0) {
vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
}
 
+   /*
+* Even though there was no reservation in the region/reserve
+* map, there could be reservations associated with the
+* subpool that can be used.  This would be indicated if the
+* return value of hugepage_subpool_get_pages() is zero.
+* However, if avoid_reserve is specified we still avoid even
+* the subpool reservations.
+*/
+   if (avoid_reserve)
+   gbl_chg = 1;
+   }
+
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
if (ret)
goto out_subpool_put;
 
spin_lock(&hugetlb_lock);
-   page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, chg);
+   /*
+* glb_chg is passed to indicate whether or not a page must be taken
+* from the global free pool (global change).  gbl_chg == 0 indicates
+* a reservation exists for the allocation.
+*/
+   page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
if (!page) {
spin_unlock(&hugetlb_lock);
page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
@@ -1774,8 +1798,8 @@ static struct page *alloc_huge_page(struct vm_area_struct 
*vma,
 
set_page_private(page, (unsigned long)spool);
 
-   commit = vma_commit_reservation(h, vma, addr);
-   if (unlikely(chg > commit)) {
+   map_commit = vma_commit_reservation(h, vma, addr);
+   if (unlikely(map_chg > map_commit)) {
/*
 * The page was added to the reservation map between
 * vma_needs_reservation and vma_commit_reservation.
@@ -1795,7 +1819,7 @@ static struct page *alloc_huge_page(struct vm_area_struct 
*vma,
 out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_subpool_put:
-   if (chg || avoid_reserve)
+   if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
vma_abort_reservation(h, vma, addr);
return ERR_PTR(-ENOSPC);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/10] mm: madvise allow remove operation for hugetlbfs

2015-07-02 Thread Mike Kravetz

Now that we have hole punching support for hugetlbfs, we can
also support the MADV_REMOVE interface to it.

Signed-off-by: Dave Hansen 
Signed-off-by: Mike Kravetz 
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index d215ea9..3c1b7f0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -467,7 +467,7 @@ static long madvise_remove(struct vm_area_struct *vma,
 
*prev = NULL;   /* tell sys_madvise we drop mmap_sem */
 
-   if (vma->vm_flags & (VM_LOCKED | VM_HUGETLB))
+   if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
 
f = vma->vm_file;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/10] mm/hugetlb: add cache of descriptors to resv_map for region_add

2015-07-02 Thread Mike Kravetz

fallocate hole punch will want to remove a specific range of
pages.  When pages are removed, their associated entries in
the region/reserve map will also be removed.  This will break
an assumption in the region_chg/region_add calling sequence.
If a new region descriptor must be allocated, it is done as
part of the region_chg processing.  In this way, region_add
can not fail because it does not need to attempt an allocation.

To prepare for fallocate hole punch, create a "cache" of
descriptors that can be used by region_add if necessary.
region_chg will ensure there are sufficient entries in the
cache.  It will be necessary to track the number of in progress
add operations to know a sufficient number of descriptors
reside in the cache.  A new routine region_abort is added to
adjust this in progress count when add operations are aborted.
vma_abort_reservation is also added for callers creating
reservations with vma_needs_reservation/vma_commit_reservation.

Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c| 166 ++--
 2 files changed, 150 insertions(+), 19 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2050261..f9c8cb2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -35,6 +35,9 @@ struct resv_map {
struct kref refs;
spinlock_t lock;
struct list_head regions;
+   long adds_in_progress;
+   struct list_head rgn_cache;
+   long rgn_cache_count;
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..189c0d7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -240,11 +240,14 @@ struct file_region {
 
 /*
  * Add the huge page range represented by [f, t) to the reserve
- * map.  Existing regions will be expanded to accommodate the
- * specified range.  We know only existing regions need to be
- * expanded, because region_add is only called after region_chg
- * with the same range.  If a new file_region structure must
- * be allocated, it is done in region_chg.
+ * map.  In the normal case, existing regions will be expanded
+ * to accommodate the specified range.  Sufficient regions should
+ * exist for expansion due to the previous call to region_chg
+ * with the same range.  However, it is possible that region_del
+ * could have been called after region_chg and modifed the map
+ * in such a way that no region exists to be expanded.  In this
+ * case, pull a region descriptor from the cache associated with
+ * the map and use that for the new range.
  *
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
@@ -261,6 +264,27 @@ static long region_add(struct resv_map *resv, long f, long 
t)
if (f <= rg->to)
break;
 
+   if (&rg->link == head || t < rg->from) {
+   /*
+* No region exits which can be expanded to include the
+* specified range.  Pull a region descriptor from the
+* cache, and use it for this range.
+*/
+   VM_BUG_ON(!resv->rgn_cache_count);
+
+   resv->rgn_cache_count--;
+   nrg = list_first_entry(&resv->rgn_cache, struct file_region,
+   link);
+   list_del(&nrg->link);
+
+   nrg->from = f;
+   nrg->to = t;
+   list_add(&nrg->link, rg->link.prev);
+
+   add += t - f;
+   goto out_locked;
+   }
+
/* Round our left edge to the current segment if it encloses us. */
if (f > rg->from)
f = rg->from;
@@ -294,6 +318,8 @@ static long region_add(struct resv_map *resv, long f, long 
t)
add += t - nrg->to; /* Added to end of region */
nrg->to = t;
 
+out_locked:
+   resv->adds_in_progress--;
spin_unlock(&resv->lock);
VM_BUG_ON(add < 0);
return add;
@@ -312,11 +338,16 @@ static long region_add(struct resv_map *resv, long f, 
long t)
  * so that the subsequent region_add call will have all the
  * regions it needs and will not fail.
  *
+ * Upon entry, region_chg will also examine the cache of
+ * region descriptors associated with the map.  If there
+ * not enough descriptors cached, one will be allocated
+ * for the in progress add operation.
+ *
  * Returns the number of huge pages that need to be added
  * to the existing reservation map for the range [f, t).
  * This number is greater or equal to zero.  -ENOMEM is
- * returned if a new file_region structure is needed and can
- * not be allocated.
+ * returned if a new file_region structure or cache entry
+ * is needed and can not be allocated.
  */
 static long region_chg(struct resv_map *resv, long

[PATCH 04/10] hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete

2015-07-02 Thread Mike Kravetz

fallocate hole punch will want to unmap a specific range of pages.
Modify the existing hugetlb_vmtruncate_list() routine to take a
start/end range.  If end is 0, this indicates all pages after start
should be unmapped.  This is the same as the existing truncate
functionality.  Modify existing callers to add 0 as end of range.

Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e5a93d8..e9d4c8d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -374,11 +374,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
 }
 
 static inline void
-hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t pgoff)
+hugetlb_vmdelete_list(struct rb_root *root, pgoff_t start, pgoff_t end)
 {
struct vm_area_struct *vma;
 
-   vma_interval_tree_foreach(vma, root, pgoff, ULONG_MAX) {
+   /*
+* end == 0 indicates that the entire range after
+* start should be unmapped.
+*/
+   vma_interval_tree_foreach(vma, root, start, end ? end : ULONG_MAX) {
unsigned long v_offset;
 
/*
@@ -387,13 +391,20 @@ hugetlb_vmtruncate_list(struct rb_root *root, pgoff_t 
pgoff)
 * which overlap the truncated area starting at pgoff,
 * and no vma on a 32-bit arch can span beyond the 4GB.
 */
-   if (vma->vm_pgoff < pgoff)
-   v_offset = (pgoff - vma->vm_pgoff) << PAGE_SHIFT;
+   if (vma->vm_pgoff < start)
+   v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
else
v_offset = 0;
 
-   unmap_hugepage_range(vma, vma->vm_start + v_offset,
-vma->vm_end, NULL);
+   if (end) {
+   end = ((end - start) << PAGE_SHIFT) +
+  vma->vm_start + v_offset;
+   if (end > vma->vm_end)
+   end = vma->vm_end;
+   } else
+   end = vma->vm_end;
+
+   unmap_hugepage_range(vma, vma->vm_start + v_offset, end, NULL);
}
 }
 
@@ -409,7 +420,7 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t 
offset)
i_size_write(inode, offset);
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
-   hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
+   hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
i_mmap_unlock_write(mapping);
truncate_hugepages(inode, offset);
return 0;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/10] mm/hugetlb: expose hugetlb fault mutex for use by fallocate

2015-07-02 Thread Mike Kravetz

hugetlb page faults are currently synchronized by the table of
mutexes (htlb_fault_mutex_table).  fallocate code will need to
synchronize with the page fault code when it allocates or
deletes pages.  Expose interfaces so that fallocate operations
can be synchronized with page faults.  Minor name changes to
be more consistent with other global hugetlb symbols.

Signed-off-by: Mike Kravetz 
---
 include/linux/hugetlb.h |  5 +
 mm/hugetlb.c| 20 ++--
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f9c8cb2..43e8511f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -88,6 +88,11 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
 bool isolate_huge_page(struct page *page, struct list_head *list);
 void putback_active_hugepage(struct page *page);
 void free_huge_page(struct page *page);
+extern struct mutex *hugetlb_fault_mutex_table;
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+   struct vm_area_struct *vma,
+   struct address_space *mapping,
+   pgoff_t idx, unsigned long address);
 
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e8c7f68..21e957b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -64,7 +64,7 @@ DEFINE_SPINLOCK(hugetlb_lock);
  * prevent spurious OOMs when the hugepage pool is fully utilized.
  */
 static int num_fault_mutexes;
-static struct mutex *htlb_fault_mutex_table cacheline_aligned_in_smp;
+struct mutex *hugetlb_fault_mutex_table cacheline_aligned_in_smp;
 
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
@@ -2481,7 +2481,7 @@ static void __exit hugetlb_exit(void)
}
 
kobject_put(hugepages_kobj);
-   kfree(htlb_fault_mutex_table);
+   kfree(hugetlb_fault_mutex_table);
 }
 module_exit(hugetlb_exit);
 
@@ -2514,12 +2514,12 @@ static int __init hugetlb_init(void)
 #else
num_fault_mutexes = 1;
 #endif
-   htlb_fault_mutex_table =
+   hugetlb_fault_mutex_table =
kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
-   BUG_ON(!htlb_fault_mutex_table);
+   BUG_ON(!hugetlb_fault_mutex_table);
 
for (i = 0; i < num_fault_mutexes; i++)
-   mutex_init(&htlb_fault_mutex_table[i]);
+   mutex_init(&hugetlb_fault_mutex_table[i]);
return 0;
 }
 module_init(hugetlb_init);
@@ -3453,7 +3453,7 @@ backout_unlocked:
 }
 
 #ifdef CONFIG_SMP
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping,
pgoff_t idx, unsigned long address)
@@ -3478,7 +3478,7 @@ static u32 fault_mutex_hash(struct hstate *h, struct 
mm_struct *mm,
  * For uniprocesor systems we always use a single mutex, so just
  * return 0 and avoid the hashing overhead.
  */
-static u32 fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping,
pgoff_t idx, unsigned long address)
@@ -3526,8 +3526,8 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * get spurious allocation failures if two CPUs race to instantiate
 * the same page in the page cache.
 */
-   hash = fault_mutex_hash(h, mm, vma, mapping, idx, address);
-   mutex_lock(&htlb_fault_mutex_table[hash]);
+   hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, address);
+   mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
@@ -3612,7 +3612,7 @@ out_ptl:
put_page(pagecache_page);
}
 out_mutex:
-   mutex_unlock(&htlb_fault_mutex_table[hash]);
+   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
/*
 * Generally it's safe to hold refcount during waiting page lock. But
 * here we just wait to defer the next page fault to avoid busy loop and
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/10] hugetlbfs: truncate_hugepages() takes a range of pages

2015-07-02 Thread Mike Kravetz

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of -1 is passed, the
current "truncate" functionality is maintained. Existing callers
are modified to pass -1 as end of range. By keying off end == -1,
the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults
by using the fault mutex table. The hole punch case can experience
the rare region_del error and must handle accordingly.

Add the routine hugetlb_fix_reserve_counts to fix up reserve counts
in the case where region_del returns an error.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages().  To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from
region_del and pass it back to the caller.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c| 98 -
 include/linux/hugetlb.h |  4 +-
 mm/hugetlb.c| 40 ++--
 3 files changed, 128 insertions(+), 14 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e9d4c8d..160d58d4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -318,26 +318,61 @@ static int hugetlbfs_write_end(struct file *file, struct 
address_space *mapping,
return -EINVAL;
 }
 
-static void truncate_huge_page(struct page *page)
+static void remove_huge_page(struct page *page)
 {
ClearPageDirty(page);
ClearPageUptodate(page);
delete_from_page_cache(page);
 }
 
-static void truncate_hugepages(struct inode *inode, loff_t lstart)
+
+/*
+ * remove_inode_hugepages handles two distinct cases: truncation and hole
+ * punch.  There are subtle differences in operation for each case.
+
+ * truncation is indicated by end of range being -1
+ * In this case, we first scan the range and release found pages.
+ * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
+ * maps and global counts.
+ * hole punch is indicated if end is not -1
+ * In the hole punch case we scan the range and release found pages.
+ * Only when releasing a page is the associated region/reserv map
+ * deleted.  The region/reserv map for ranges without associated
+ * pages are not modified.
+ * Note: If the passed end of range value is beyond the end of file, but
+ * not -1 this routine still performs a hole punch operation.
+ */
+static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
+  loff_t lend)
 {
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
const pgoff_t start = lstart >> huge_page_shift(h);
+   const pgoff_t end = lend >> huge_page_shift(h);
+   struct vm_area_struct pseudo_vma;
struct pagevec pvec;
pgoff_t next;
int i, freed = 0;
+   long lookup_nr = PAGEVEC_SIZE;
+   bool truncate_op = (lend == -1);
 
+   memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
+   pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
pagevec_init(&pvec, 0);
next = start;
-   while (1) {
-   if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+   while (next < end) {
+   /*
+* Make sure to never grab more pages that we
+* might possibly need.
+*/
+   if (end - next < lookup_nr)
+   lookup_nr = end - next;
+
+   /*
+* This pagevec_lookup() may return pages past 'end',
+* so we must check for page->index > end.
+*/
+   if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
if (next == start)
break;
next = start;
@@ -346,26 +381,69 @@ static void truncate_hugepages(struct inode *inode, 
loff_t lstart)
 
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
+   u32 hash;
+
+   hash = hugetlb_fault_mutex_hash(h, current->mm,
+   &pseudo_vma,
+   mapping, next, 0);
+   mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
lock_page(page);
+   if (page->index >= end) {
+   unlock_page(page);
+   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+

[PATCH 02/10] mm/hugetlb: add region_del() to delete a specific range of entries

2015-07-02 Thread Mike Kravetz

fallocate hole punch will want to remove a specific range of pages.
The existing region_truncate() routine deletes all region/reserve
map entries after a specified offset.  region_del() will provide
this same functionality if the end of region is specified as -1.
Hence, region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the
rare case where it can not allocate memory for a region descriptor.
This ONLY happens in the case where an existing region must be split.
Current callers passing -1 as end of range will never experience
this error and do not need to deal with error handling.  Future
callers of region_del() (such as fallocate hole punch) will need to
handle this error.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 101 ---
 1 file changed, 75 insertions(+), 26 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 189c0d7..e8c7f68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -460,43 +460,92 @@ static void region_abort(struct resv_map *resv, long f, 
long t)
 }
 
 /*
- * Truncate the reserve map at index 'end'.  Modify/truncate any
- * region which contains end.  Delete any regions past end.
- * Return the number of huge pages removed from the map.
+ * Delete the specified range [f, t) from the reserve map.  If the
+ * t parameter is -1, this indicates that ALL regions after f should
+ * be deleted.  Locate the regions which intersect [f, t) and either
+ * trim, delete or split the existing regions.
+ *
+ * Returns the number of huge pages deleted from the reserve map.
+ * In the normal case, the return value is zero or more.  In the
+ * case where a region must be split, a new region descriptor must
+ * be allocated.  If the allocation fails, -ENOMEM will be returned.
+ * NOTE: If the parameter t == -1, then we will never split a region
+ * and possibly return -ENOMEM.  Callers specifying t == -1 do not
+ * need to check for -ENOMEM error.
  */
-static long region_truncate(struct resv_map *resv, long end)
+static long region_del(struct resv_map *resv, long f, long t)
 {
struct list_head *head = &resv->regions;
struct file_region *rg, *trg;
-   long chg = 0;
+   struct file_region *nrg = NULL;
+   long del = 0;
 
+   if (t == -1)
+   t = LONG_MAX;
+retry:
spin_lock(&resv->lock);
-   /* Locate the region we are either in or before. */
-   list_for_each_entry(rg, head, link)
-   if (end <= rg->to)
+   list_for_each_entry_safe(rg, trg, head, link) {
+   if (rg->to <= f)
+   continue;
+   if (rg->from >= t)
break;
-   if (&rg->link == head)
-   goto out;
 
-   /* If we are in the middle of a region then adjust it. */
-   if (end > rg->from) {
-   chg = rg->to - end;
-   rg->to = end;
-   rg = list_entry(rg->link.next, typeof(*rg), link);
-   }
+   if (f > rg->from && t < rg->to) { /* Must split region */
+   /*
+* Check for an entry in the cache before dropping
+* lock and attempting allocation.
+*/
+   if (!nrg &&
+   resv->rgn_cache_count > resv->adds_in_progress) {
+   nrg = list_first_entry(&resv->rgn_cache,
+   struct file_region,
+   link);
+   list_del(&nrg->link);
+   resv->rgn_cache_count--;
+   }
 
-   /* Drop any remaining regions. */
-   list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-   if (&rg->link == head)
+   if (!nrg) {
+   spin_unlock(&resv->lock);
+   nrg = kmalloc(sizeof(*nrg), GFP_KERNEL);
+   if (!nrg)
+   return -ENOMEM;
+   goto retry;
+   }
+
+   del += t - f;
+
+   /* New entry for end of split region */
+   nrg->from = t;
+   nrg->to = rg->to;
+   INIT_LIST_HEAD(&nrg->link);
+
+   /* Original entry is trimmed */
+   rg->to = f;
+
+   list_add(&nrg->link, &rg->link);
+   nrg = NULL;
break;
-   chg += rg->to - rg->from;
-   list_del(&rg->link);
-   kfree(rg);
+

Re: [PATCH] mm/hugetlb: Unmap pages if page fault raced with hole punch

2015-11-09 Thread Mike Kravetz

On 11/08/2015 11:42 PM, Hugh Dickins wrote:
> On Fri, 30 Oct 2015, Mike Kravetz wrote:
>>
>> The 'next = start' code is actually from the original truncate_hugepages
>> routine.  This functionality was combined with that needed for hole punch
>> to create remove_inode_hugepages().
>>
>> The following code was in truncate_hugepages:
>>
>>  next = start;
>>  while (1) {
>>  if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>>  if (next == start)
>>  break;
>>  next = start;
>>  continue;
>>  }
>>
>>
>> So, in the truncate case pages starting at 'start' are deleted until
>> pagevec_lookup fails.  Then, we call pagevec_lookup() again.  If no
>> pages are found we are done.  Else, we repeat the whole process.
>>
>> Does anyone recall the reason for going back and looking for pages at
>> index'es already deleted?  Git doesn't help as that was part of initial
>> commit.  My thought is that truncate can race with page faults.  The
>> truncate code sets inode offset before unmapping and deleting pages.
>> So, faults after the new offset is set should fail.  But, I suppose a
>> fault could race with setting offset and deleting of pages.  Does this
>> sound right?  Or, is there some other reason I am missing?
> 
> I believe your thinking is correct.  But remember that
> truncate_inode_pages_range() is shared by almost all filesystems,
> and different filesystems have different internal locking conventions,
> and different propensities to such a race: it's trying to cover for
> all of them.
> 
> Typically, writing is well serialized (by i_mutex) against truncation,
> but faulting (like reading) sails through without enough of a lock.
> We resort to i_size checks to avoid the worst of it, but there's often
> a corner or two in which those checks are not quite good enough -
> it's easy to check i_size at the beginning, but it needs to be checked
> again at the end too, and what's been done undone - can be awkward.

Well, it looks like the hugetlb_no_page() routine is checking i_size both
before and after.  It appears to be doing the right thing to handle the
race, but I need to stare at the code some more to make sure.

Because of the way the truncate code went back and did an extra lookup
when done with the range, I assumed it was covering some race.  However,
that may not be the case.

> 
> I hope that in the case of hugetlbfs, since you already have the
> additional fault_mutex to handle races between faults and punching,
> it should be possible to get away without that "pincer" restarting.

Yes, it looks like this may work as a straight loop over the range of
pages.  I just need to study the code some more to make sure I am not
missing something.

-- 
Mike Kravetz

> 
> Hugh
> 
>>
>> I would like to continue having remove_inode_hugepages handle both the
>> truncate and hole punch case.  So, what to make sure the code correctly
>> handles both cases.
>>
>> -- 
>> Mike Kravetz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/hugetlbfs Fix bugs in fallocate hole punch of areas with holes

2015-11-09 Thread Mike Kravetz

On 11/08/2015 11:09 PM, Hugh Dickins wrote:
> Sorry for the delay, I needed some time set aside to look through.

No problem.  I really appreciate your comments.

> On Fri, 30 Oct 2015, Mike Kravetz wrote:
> 
>> Hugh Dickins pointed out problems with the new hugetlbfs fallocate
>> hole punch code.  These problems are in the routine remove_inode_hugepages
>> and mostly occur in the case where there are holes in the range of
>> pages to be removed.  These holes could be the result of a previous hole
>> punch or simply sparse allocation.
>>
>> remove_inode_hugepages handles both hole punch and truncate operations.
>> Page index handling was fixed/cleaned up so that holes are properly
>> handled.  In addition, code was changed to ensure multiple passes of the
>> address range only happens in the truncate case.  More comments were added
>> to explain the different actions in each case.  A cond_resched() was added
>> after removing up to PAGEVEC_SIZE pages.
>>
>> Some totally unnecessary code in hugetlbfs_fallocate() that remained from
>> early development was also removed.
> 
> Yes, I agree with most of that comment, and with removing the unnecessary
> leftover; and you were right to make the patch against v4.3 as you did.
> 
>>
> 
> Should have
> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
> Cc: sta...@vger.kernel.org [4.3]
> when it's finished.

Will do.

> 
>> Signed-off-by: Mike Kravetz 
>> ---
>>  fs/hugetlbfs/inode.c | 44 +---
>>  1 file changed, 29 insertions(+), 15 deletions(-)
>>
> 
> I agree that this is an improvement, but I'm afraid it still
> has (perhaps) a serious bug that I didn't notice before.

Yes, I think most of the issues revolve around the question of whether
or not page faults can race with truncate.  As mentioned in the other
e-mail, this may not be an issue and would result in simpler/cleaner code.

> It'll be clearer if I comment, not on your patch, but on the patched
> remove_inode_hugepages() itself.  Yes, most of what I say could have
> been said when you asked for review of that originally - sorry,
> but I just didn't have time to spare.
> 
> static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>  loff_t lend)
> {
>   struct hstate *h = hstate_inode(inode);
>   struct address_space *mapping = &inode->i_data;
>   const pgoff_t start = lstart >> huge_page_shift(h);
>   const pgoff_t end = lend >> huge_page_shift(h);
>   struct vm_area_struct pseudo_vma;
>   struct pagevec pvec;
>   pgoff_t next;
>   int i, freed = 0;
>   long lookup_nr = PAGEVEC_SIZE;
>   bool truncate_op = (lend == LLONG_MAX);
> 
>   memset(&pseudo_vma, 0, sizeof(struct vm_area_struct));
>   pseudo_vma.vm_flags = (VM_HUGETLB | VM_MAYSHARE | VM_SHARED);
> 
> (I have to say in passing that this is horrid: what's needed is to
> replace hugetlb_fault_mutex_hash()'s "vma" arg by a "bool shared";
> or something else - it's irritating how half its args are irrelevant.
> But you're absolutely right not to do so in this patch, this being
> a fix for stable which should be kept minimal.  Maybe even leave
> out your i_lock/i_private cleanup for now.)

Ok, I'm happy to drop the i_lock/i_private cleanup as well.

> 
>   pagevec_init(&pvec, 0);
>   next = start;
>   while (next < end) {
> 
> Okay: that confused me, but I think you're right to keep it that way for
> the holepunch break (and you don't expect to reach "end" in truncation).
> 
>   /*
>* Make sure to never grab more pages that we
> 
> The next comment makes clear that you cannot "Make sure" of that:
> "Try not to grab more pages than we would need" perhaps.

Agree, comment will be updated.

> 
>* might possibly need.
>*/
>   if (end - next < lookup_nr)
>   lookup_nr = end - next;
> 
> If you are going to restart for truncation (but it's not clear to me
> that you should), then you ought to reinit lookup_nr to PAGEVEC_SIZE
> before restarting; though I suppose that restart finding anything
> will be so rare as not to matter in practice.

I'm pretty sure we will not need to restart once I confirm that this
routine does not need to handle races with faults in the truncate case.

> 
>   /*
>* When no more pages are found, take different action for
>* hole punc

Re: [PATCH] mm/hugetlb: Unmap pages if page fault raced with hole punch

2015-11-10 Thread Mike Kravetz

On 11/09/2015 02:55 PM, Mike Kravetz wrote:
> On 11/08/2015 11:42 PM, Hugh Dickins wrote:
>> On Fri, 30 Oct 2015, Mike Kravetz wrote:
>>>
>>> The 'next = start' code is actually from the original truncate_hugepages
>>> routine.  This functionality was combined with that needed for hole punch
>>> to create remove_inode_hugepages().
>>>
>>> The following code was in truncate_hugepages:
>>>
>>> next = start;
>>> while (1) {
>>> if (!pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
>>> if (next == start)
>>> break;
>>> next = start;
>>> continue;
>>> }
>>>
>>>
>>> So, in the truncate case pages starting at 'start' are deleted until
>>> pagevec_lookup fails.  Then, we call pagevec_lookup() again.  If no
>>> pages are found we are done.  Else, we repeat the whole process.
>>>
>>> Does anyone recall the reason for going back and looking for pages at
>>> index'es already deleted?  Git doesn't help as that was part of initial
>>> commit.  My thought is that truncate can race with page faults.  The
>>> truncate code sets inode offset before unmapping and deleting pages.
>>> So, faults after the new offset is set should fail.  But, I suppose a
>>> fault could race with setting offset and deleting of pages.  Does this
>>> sound right?  Or, is there some other reason I am missing?
>>
>> I believe your thinking is correct.  But remember that
>> truncate_inode_pages_range() is shared by almost all filesystems,
>> and different filesystems have different internal locking conventions,
>> and different propensities to such a race: it's trying to cover for
>> all of them.
>>
>> Typically, writing is well serialized (by i_mutex) against truncation,
>> but faulting (like reading) sails through without enough of a lock.
>> We resort to i_size checks to avoid the worst of it, but there's often
>> a corner or two in which those checks are not quite good enough -
>> it's easy to check i_size at the beginning, but it needs to be checked
>> again at the end too, and what's been done undone - can be awkward.
> 
> Well, it looks like the hugetlb_no_page() routine is checking i_size both
> before and after.  It appears to be doing the right thing to handle the
> race, but I need to stare at the code some more to make sure.
> 
> Because of the way the truncate code went back and did an extra lookup
> when done with the range, I assumed it was covering some race.  However,
> that may not be the case.
> 
>>
>> I hope that in the case of hugetlbfs, since you already have the
>> additional fault_mutex to handle races between faults and punching,
>> it should be possible to get away without that "pincer" restarting.
> 
> Yes, it looks like this may work as a straight loop over the range of
> pages.  I just need to study the code some more to make sure I am not
> missing something.

I have convinced myself that hugetlb_no_page is coded such that page
faults can not race with truncate.  hugetlb_no_page handles the case
where there is no PTE for a faulted in address.  The general flow in
hugetlb_no_page for the no page found case is:
- check index against i_size, end if beyond
- allocate huge page
- take page table lock for huge page
- check index against i_size again,  if beyond free page and return
- add huge page to page table
- unlock page table lock for huge page

The flow for the truncate operation in hugetlb_vmtruncate is:
- set i_size
- take inode/mapping write lock
- hugetlb_vmdelete_list() which removes page table entries.  The page
  table lock will be taken for each huge page in the range
- release inode/mapping write lock
- remove_inode_hugepages() to actually remove pages

The truncate/page fault race we are concerned with is if a page is faulted
in after hugetlb_vmtruncate sets i_size and unmaps the page, but before
actually removing the page.  Obviously, any entry into hugetlb_no_page
after i_size is set will check the value and not allow the fault.  In
addition, if the value of i_size is set before the second check in
hugetlb_no_page, it will do the right thing.  Therefore, the only place to
race is after the second i_size check in hugetlb_no_page.

Note that the second check for i_size is with the page table lock for
the huge page held.  It is not possible for hugetlb_vmtruncate to unmap
the huge page before the page fault completes, as it must acquire the page
table lock.  This is the same as a fault happening before the truncate
operation start

[PATCH v2] mm/hugetlbfs Fix bugs in fallocate hole punch of areas with holes

2015-11-10 Thread Mike Kravetz

This is against linux-stable 4.3.  Will send to sta...@vger.kernel.org
when Ack'ed here.

Hugh Dickins pointed out problems with the new hugetlbfs fallocate
hole punch code.  These problems are in the routine remove_inode_hugepages
and mostly occur in the case where there are holes in the range of
pages to be removed.  These holes could be the result of a previous hole
punch or simply sparse allocation.

remove_inode_hugepages handles both hole punch and truncate operations.
Page index handling was fixed/cleaned up so that the loop index always
matches the page being processed.  The code now only makes a single pass
through the range of pages as it was determined page faults could not
race with truncate.  A cond_resched() was added after removing up to
PAGEVEC_SIZE pages.

Some totally unnecessary code in hugetlbfs_fallocate() that remained from
early development was also removed.

v2:
  Make remove_inode_hugepages simpler after verifying truncate can not
  race with page faults here.

Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c | 57 ++--
 1 file changed, 29 insertions(+), 28 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 316adb9..8290f39 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -332,12 +332,15 @@ static void remove_huge_page(struct page *page)
  * truncation is indicated by end of range being LLONG_MAX
  * In this case, we first scan the range and release found pages.
  * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
- * maps and global counts.
+ * maps and global counts.  Page faults can not race with truncation
+ * in this routine.  hugetlb_no_page() prevents page faults in the
+ * truncated range.
  * hole punch is indicated if end is not LLONG_MAX
  * In the hole punch case we scan the range and release found pages.
  * Only when releasing a page is the associated region/reserv map
  * deleted.  The region/reserv map for ranges without associated
- * pages are not modified.
+ * pages are not modified.  Page faults can race with hole punch.
+ * This is indicated if we find a mapped page.
  * Note: If the passed end of range value is beyond the end of file, but
  * not LLONG_MAX this routine still performs a hole punch operation.
  */
@@ -361,44 +364,38 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
next = start;
while (next < end) {
/*
-* Make sure to never grab more pages that we
-* might possibly need.
+* Don't grab more pages than the number left in the range.
 */
if (end - next < lookup_nr)
lookup_nr = end - next;
 
/*
-* This pagevec_lookup() may return pages past 'end',
-* so we must check for page->index > end.
+* When no more pages are found, we are done.
 */
-   if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
-   if (next == start)
-   break;
-   next = start;
-   continue;
-   }
+   if (!pagevec_lookup(&pvec, mapping, next, lookup_nr))
+   break;
 
for (i = 0; i < pagevec_count(&pvec); ++i) {
struct page *page = pvec.pages[i];
u32 hash;
 
+   /*
+* The page (index) could be beyond end.  This is
+* only possible in the punch hole case as end is
+* max page offset in the truncate case.
+*/
+   next = page->index;
+   if (next >= end)
+   break;
+
hash = hugetlb_fault_mutex_hash(h, current->mm,
&pseudo_vma,
mapping, next, 0);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
lock_page(page);
-   if (page->index >= end) {
-   unlock_page(page);
-   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-   next = end; /* we are done */
-   break;
-   }
-
/*
-* If page is mapped, it was faulted in after being
-* unmapped.  Do nothing in this race case.  In the
-* normal case page

Re: [PATCH v2] mm/hugetlbfs Fix bugs in fallocate hole punch of areas with holes

2015-11-10 Thread Mike Kravetz

On 11/10/2015 06:58 PM, Naoya Horiguchi wrote:
> Hello Mike,
> 
> On Tue, Nov 10, 2015 at 05:38:01PM -0800, Mike Kravetz wrote:
>> This is against linux-stable 4.3.  Will send to sta...@vger.kernel.org
>> when Ack'ed here.
> 
> This is not what stable stuff works, please see
> Documentation/stable_kernel_rules.txt.

Ok. I'll resend with the Cc.

> 
>> Hugh Dickins pointed out problems with the new hugetlbfs fallocate
>> hole punch code.  These problems are in the routine remove_inode_hugepages
>> and mostly occur in the case where there are holes in the range of
>> pages to be removed.  These holes could be the result of a previous hole
>> punch or simply sparse allocation.
>>
>> remove_inode_hugepages handles both hole punch and truncate operations.
>> Page index handling was fixed/cleaned up so that the loop index always
>> matches the page being processed.  The code now only makes a single pass
>> through the range of pages as it was determined page faults could not
>> race with truncate.  A cond_resched() was added after removing up to
>> PAGEVEC_SIZE pages.
>>
>> Some totally unnecessary code in hugetlbfs_fallocate() that remained from
>> early development was also removed.
>>
>> v2:
>>   Make remove_inode_hugepages simpler after verifying truncate can not
>>   race with page faults here.
>>
>> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of 
>> pages")
> 
> Cc: sta...@vger.kernel.org [4.3]

Will add.

> 
>> Signed-off-by: Mike Kravetz 
>> ---
>>  fs/hugetlbfs/inode.c | 57 
>> ++--
>>  1 file changed, 29 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 316adb9..8290f39 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -332,12 +332,15 @@ static void remove_huge_page(struct page *page)
>>   * truncation is indicated by end of range being LLONG_MAX
>>   *  In this case, we first scan the range and release found pages.
>>   *  After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
>> - *  maps and global counts.
>> + *  maps and global counts.  Page faults can not race with truncation
>> + *  in this routine.  hugetlb_no_page() prevents page faults in the
>> + *  truncated range.
> 
> Could you be specific about how/why? Maybe hugetlb_fault_mutex_hash and/or
> i_size check should be mentioned, because it's not so obvious.

The long explanation is here:
http://marc.info/?l=linux-mm&m=144719585221409&w=2
I will include a brief summary here.

> 
>>   * hole punch is indicated if end is not LLONG_MAX
>>   *  In the hole punch case we scan the range and release found pages.
>>   *  Only when releasing a page is the associated region/reserv map
>>   *  deleted.  The region/reserv map for ranges without associated
>> - *  pages are not modified.
>> + *  pages are not modified.  Page faults can race with hole punch.
>> + *  This is indicated if we find a mapped page.
>>   * Note: If the passed end of range value is beyond the end of file, but
>>   * not LLONG_MAX this routine still performs a hole punch operation.
>>   */
>> @@ -361,44 +364,38 @@ static void remove_inode_hugepages(struct inode 
>> *inode, loff_t lstart,
>>  next = start;
>>  while (next < end) {
>>  /*
>> - * Make sure to never grab more pages that we
>> - * might possibly need.
>> + * Don't grab more pages than the number left in the range.
>>   */
>>  if (end - next < lookup_nr)
>>  lookup_nr = end - next;
>>  
>>  /*
>> - * This pagevec_lookup() may return pages past 'end',
>> - * so we must check for page->index > end.
>> + * When no more pages are found, we are done.
>>   */
>> -if (!pagevec_lookup(&pvec, mapping, next, lookup_nr)) {
>> -if (next == start)
>> -break;
>> -next = start;
>> -continue;
>> -}
>> +if (!pagevec_lookup(&pvec, mapping, next, lookup_nr))
>> +break;
>>  
>>  for (i = 0; i < pagevec_count(&pvec); ++i) {
>>  struct page *page = pvec.pages[i];
>>  u32 hash;
>>  
>> +/*
>&

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-11 Thread Mike Kravetz

On 2/7/19 11:31 PM, Naoya Horiguchi wrote:
> On Thu, Feb 07, 2019 at 09:50:30PM -0800, Mike Kravetz wrote:
>> On 2/7/19 6:31 PM, Naoya Horiguchi wrote:
>>> On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote:
>>>> On 1/30/19 1:14 PM, Mike Kravetz wrote:
>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>> @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct 
>>>>> address_space *mapping,
>>>>>   rc = migrate_huge_page_move_mapping(mapping, newpage, page);
>>>>>   if (rc != MIGRATEPAGE_SUCCESS)
>>>>>   return rc;
>>>>> +
>>>>> + /*
>>>>> +  * page_private is subpool pointer in hugetlb pages, transfer
>>>>> +  * if needed.
>>>>> +  */
>>>>> + if (page_private(page) && !page_private(newpage)) {
>>>>> + set_page_private(newpage, page_private(page));
>>>>> + set_page_private(page, 0);
>>>
>>> You don't have to copy PagePrivate flag?
>>>
>>
>> Well my original thought was no.  For hugetlb pages, PagePrivate is not
>> associated with page_private.  It indicates a reservation was consumed.
>> It is set  when a hugetlb page is newly allocated and the allocation is
>> associated with a reservation and the global reservation count is
>> decremented.  When the page is added to the page cache or rmap,
>> PagePrivate is cleared.  If the page is free'ed before being added to page
>> cache or rmap, PagePrivate tells free_huge_page to restore (increment) the
>> reserve count as we did not 'instantiate' the page.
>>
>> So, PagePrivate is only set from the time a huge page is allocated until
>> it is added to page cache or rmap.  My original thought was that the page
>> could not be migrated during this time.  However, I am not sure if that
>> reasoning is correct.  The page is not locked, so it would appear that it
>> could be migrated?  But, if it can be migrated at this time then perhaps
>> there are bigger issues for the (hugetlb) page fault code?
> 
> In my understanding, free hugetlb pages are not expected to be passed to
> migrate_pages(), and currently that's ensured by each migration caller
> which checks and avoids free hugetlb pages on its own.
> migrate_pages() and its internal code are probably not aware of handling
> free hugetlb pages, so if they are accidentally passed to migration code,
> that's a big problem as you are concerned.
> So the above reasoning should work at least this assumption is correct.
> 
> Most of migration callers are not intersted in moving free hugepages.
> The one I'm not sure of is the code path from alloc_contig_range().
> If someone think it's worthwhile to migrate free hugepage to get bigger
> contiguous memory, he/she tries to enable that code path and the assumption
> will be broken.

You are correct.  We do not migrate free huge pages.  I was thinking more
about problems if we migrate a page while it is being added to a task's page
table as in hugetlb_no_page.

Commit bcc54222309c ("mm: hugetlb: introduce page_huge_active") addresses
this issue, but I believe there is a bug in the implementation.
isolate_huge_page contains this test:

if (!page_huge_active(page) || !get_page_unless_zero(page)) {
ret = false;
goto unlock;
}

If the condition is not met, then the huge page can be isolated and migrated.

In hugetlb_no_page, there is this block of code:

page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
ret = vmf_error(PTR_ERR(page));
goto out;
}
clear_huge_page(page, address, pages_per_huge_page(h));
__SetPageUptodate(page);
set_page_huge_active(page);

if (vma->vm_flags & VM_MAYSHARE) {
int err = huge_add_to_page_cache(page, mapping, idx);
if (err) {
put_page(page);
if (err == -EEXIST)
goto retry;
goto out;
}
} else {
lock_page(page);
if (unlikely(anon_vma_prepare(vma))) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
}
anon_rmap = 1;
}
} else {

Note that we call set_page_huge_active BEFORE locking the page.  This
means that we can isolate the page and have migration take pla

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-11 Thread Mike Kravetz

On 2/11/19 6:24 PM, Naoya Horiguchi wrote:
> On Mon, Feb 11, 2019 at 03:06:27PM -0800, Mike Kravetz wrote:
>> While looking at this, I think there is another issue.  When a hugetlb
>> page is migrated, we do not migrate the 'page_huge_active' state of the
>> page.  That should be moved as the page is migrated.  Correct?
> 
> Yes, and I think that putback_active_hugepage(new_hpage) at the last step
> of migration sequence handles the copying of 'page_huge_active' state.
> 

Thanks!  I missed the putback_active_hugepage that takes care of making
the target migration page active.

-- 
Mike Kravetz

Re: [PATCH] include/linux/hugetlb.h: Convert to use vm_fault_t

2019-03-18 Thread Mike Kravetz



On 3/18/19 9:26 AM, Souptick Joarder wrote:
> kbuild produces the below warning ->
> 
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   5453a3df2a5eb49bc24615d4cf0d66b2aae05e5f
> commit 3d3539018d2c ("mm: create the new vm_fault_t type")
> reproduce:
> # apt-get install sparse
> git checkout 3d3539018d2cbd12e5af4a132636ee7fd8d43ef0
> make ARCH=x86_64 allmodconfig
> make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'
> 
>>> mm/memory.c:3968:21: sparse: incorrect type in assignment (different
>>> base types) @@expected restricted vm_fault_t [usertype] ret @@
>>> got e] ret @@
>mm/memory.c:3968:21:expected restricted vm_fault_t [usertype] ret
>mm/memory.c:3968:21:got int
> 
> This patch will convert to return vm_fault_t type for hugetlb_fault()
> when CONFIG_HUGETLB_PAGE =n.
> 
> Signed-off-by: Souptick Joarder 

Thanks for fixing this.

The BUG() here and in several other places in this file is unnecessary
and IMO should be cleaned up.  But that is beyond the scope of this fix.
Added to my to do list.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v3 0/6] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-03 Thread Mike Kravetz

On 8/29/19 12:18 AM, Michal Hocko wrote:
> [Cc cgroups maintainers]
> 
> On Wed 28-08-19 10:58:00, Mina Almasry wrote:
>> On Wed, Aug 28, 2019 at 4:23 AM Michal Hocko  wrote:
>>>
>>> On Mon 26-08-19 16:32:34, Mina Almasry wrote:
>>>>  mm/hugetlb.c  | 493 --
>>>>  mm/hugetlb_cgroup.c   | 187 +--
>>>
>>> This is a lot of changes to an already subtle code which hugetlb
>>> reservations undoubly are.
>>
>> For what it's worth, I think this patch series is a net decrease in
>> the complexity of the reservation code, especially the region_*
>> functions, which is where a lot of the complexity lies. I removed the
>> race between region_del and region_{add|chg}, refactored the main
>> logic into smaller code, moved common code to helpers and deleted the
>> duplicates, and finally added lots of comments to the hard to
>> understand pieces. I hope that when folks review the changes they will
>> see that! :)
> 
> Post those improvements as standalone patches and sell them as
> improvements. We can talk about the net additional complexity of the
> controller much easier then.

All such changes appear to be in patch 4 of this series.  The commit message
says "region_add() and region_chg() are heavily refactored to in this commit
to make the code easier to understand and remove duplication.".  However, the
modifications were also added to accommodate the new cgroup reservation
accounting.  I think it would be helpful to explain why the existing code does
not work with the new accounting.  For example, one change is because
"existing code coalesces resv_map entries for shared mappings.  new cgroup
accounting requires that resv_map entries be kept separate for proper
uncharging."

I am starting to review the changes, but it would help if there was a high
level description.  I also like Michal's idea of calling out the region_*
changes separately.  If not a standalone patch, at least the first patch of
the series.  This new code will be exercised even if cgroup reservation
accounting not enabled, so it is very important than no subtle regressions
be introduced.

-- 
Mike Kravetz

Re: [PATCH v2] mm/hugetlb: avoid looping to the same hugepage if !pages and !vmas

2019-09-03 Thread Mike Kravetz

On 8/29/19 6:50 AM, Zhigang Lu wrote:
> From: Zhigang Lu 
> 
> When mmapping an existing hugetlbfs file with MAP_POPULATE, we find
> it is very time consuming. For example, mmapping a 128GB file takes
> about 50 milliseconds. Sampling with perfevent shows it spends 99%
> time in the same_page loop in follow_hugetlb_page().
> 
> samples: 205  of event 'cycles', Event count (approx.): 136686374
> -  99.04%  test_mmap_huget  [kernel.kallsyms]  [k] follow_hugetlb_page
> follow_hugetlb_page
> __get_user_pages
> __mlock_vma_pages_range
> __mm_populate
> vm_mmap_pgoff
> sys_mmap_pgoff
> sys_mmap
> system_call_fastpath
> __mmap64
> 
> follow_hugetlb_page() is called with pages=NULL and vmas=NULL, so for
> each hugepage, we run into the same_page loop for pages_per_huge_page()
> times, but doing nothing. With this change, it takes less then 1
> millisecond to mmap a 128GB file in hugetlbfs.

Thanks for the analysis!

Just curious, do you have an application that does this (mmap(MAP_POPULATE)
for an existing hugetlbfs file), or was this part of some test suite or
debug code?

> Signed-off-by: Zhigang Lu 
> Reviewed-by: Haozhong Zhang 
> Reviewed-by: Zongming Zhang 
> Acked-by: Matthew Wilcox 
> ---
>  mm/hugetlb.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6d7296d..2df941a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4391,6 +4391,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   break;
>   }
>   }

It might be helpful to add a comment here to help readers of the code.
Something like:

/*
 * If subpage information not requested, update counters
 * and skip the same_page loop below.
 */
> +
> + if (!pages && !vmas && !pfn_offset &&
> + (vaddr + huge_page_size(h) < vma->vm_end) &&
> + (remainder >= pages_per_huge_page(h))) {
> + vaddr += huge_page_size(h);
> + remainder -= pages_per_huge_page(h);
> + i += pages_per_huge_page(h);
> + spin_unlock(ptl);
> +     continue;
> +         }
> +
>  same_page:
>   if (pages) {
>   pages[i] = mem_map_offset(page, pfn_offset);
> 

With a comment added to the code,
Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [PATCH v3 0/6] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-03 Thread Mike Kravetz

On 9/3/19 10:57 AM, Mike Kravetz wrote:
> On 8/29/19 12:18 AM, Michal Hocko wrote:
>> [Cc cgroups maintainers]
>>
>> On Wed 28-08-19 10:58:00, Mina Almasry wrote:
>>> On Wed, Aug 28, 2019 at 4:23 AM Michal Hocko  wrote:
>>>>
>>>> On Mon 26-08-19 16:32:34, Mina Almasry wrote:
>>>>>  mm/hugetlb.c  | 493 --
>>>>>  mm/hugetlb_cgroup.c   | 187 +--
>>>>
>>>> This is a lot of changes to an already subtle code which hugetlb
>>>> reservations undoubly are.
>>>
>>> For what it's worth, I think this patch series is a net decrease in
>>> the complexity of the reservation code, especially the region_*
>>> functions, which is where a lot of the complexity lies. I removed the
>>> race between region_del and region_{add|chg}, refactored the main
>>> logic into smaller code, moved common code to helpers and deleted the
>>> duplicates, and finally added lots of comments to the hard to
>>> understand pieces. I hope that when folks review the changes they will
>>> see that! :)
>>
>> Post those improvements as standalone patches and sell them as
>> improvements. We can talk about the net additional complexity of the
>> controller much easier then.
> 
> All such changes appear to be in patch 4 of this series.  The commit message
> says "region_add() and region_chg() are heavily refactored to in this commit
> to make the code easier to understand and remove duplication.".  However, the
> modifications were also added to accommodate the new cgroup reservation
> accounting.  I think it would be helpful to explain why the existing code does
> not work with the new accounting.  For example, one change is because
> "existing code coalesces resv_map entries for shared mappings.  new cgroup
> accounting requires that resv_map entries be kept separate for proper
> uncharging."
> 
> I am starting to review the changes, but it would help if there was a high
> level description.  I also like Michal's idea of calling out the region_*
> changes separately.  If not a standalone patch, at least the first patch of
> the series.  This new code will be exercised even if cgroup reservation
> accounting not enabled, so it is very important than no subtle regressions
> be introduced.

While looking at the region_* changes, I started thinking about this no
coalesce change for shared mappings which I think is necessary.  Am I
mistaken, or is this a requirement?

If it is a requirement, then think about some of the possible scenarios
such as:
- There is a hugetlbfs file of size 10 huge pages.
- Task A has reservations for pages at offset 1 3 5 7 and 9
- Task B then mmaps the entire file which should result in reservations
  at 0 2 4 6 and 8.
- region_chg will return 5, but will also need to allocate 5 resv_map
  entries for the subsequent region_add which can not fail.  Correct?
  The code does not appear to handle this.

BTW, this series will BUG when running libhugetlbfs test suite.  It will
hit this in resv_map_release().

VM_BUG_ON(resv_map->adds_in_progress);

-- 
Mike Kravetz

Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-16 Thread Mike Kravetz

On 8/15/19 4:04 PM, Mina Almasry wrote:
> On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz  wrote:
>>
>> On 8/13/19 4:54 PM, Mike Kravetz wrote:
>>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>>> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
>>>> in the resv_map entries, in file_region->reservation_counter.
>>>>
>>>> When a file_region entry is added to the resv_map via region_add, we
>>>> also charge the appropriate hugetlb_cgroup and put the pointer to that
>>>> in file_region->reservation_counter. This is slightly delicate since we
>>>> need to not modify the resv_map until we know that charging the
>>>> reservation has succeeded. If charging doesn't succeed, we report the
>>>> error to the caller, so that the kernel fails the reservation.
>>>
>>> I wish we did not need to modify these region_() routines as they are
>>> already difficult to understand.  However, I see no other way with the
>>> desired semantics.
>>>
>>
>> I suspect you have considered this, but what about using the return value
>> from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
>> There is a VERY SMALL race where the value could be too large, but that
>> can be checked and adjusted at region_add time as is done with normal
>> accounting today.
> 
> I have not actually until now; I didn't consider doing stuff with the
> resv_map while not holding onto the resv_map->lock. I guess that's the
> small race you're talking about. Seems fine to me, but I'm more
> worried about hanging off the vma below.

This race is already handled for other 'reservation like' things in
hugetlb_reserve_pages.  So, I don't think the race is much of an issue.

>> If the question is, where would we store the information
>> to uncharge?, then we can hang a structure off the vma.  This would be
>> similar to what is done for private mappings.  In fact, I would suggest
>> making them both use a new cgroup reserve structure hanging off the vma.
>>
> 
> I actually did consider hanging off the info to uncharge off the vma,
> but I didn't for a couple of reasons:
> 
> 1. region_del is called from hugetlb_unreserve_pages, and I don't have
> access to the vma there. Maybe there is a way to query the proper vma
> I don't know about?

I am still thinking about closely tying cgroup revervation limits/usage
to existing reservation accounting.  Of most concern (to me) is handling
shared mappings.  Reservations created for shared mappings are more
associated with the inode/file than individual mappings.  For example,
consider a task which mmaps(MAP_SHARED) a hugetlbfs file.  At mmap time
reservations are created based on the size of the mmap.  Now, if the task
unmaps and/or exits the reservations still exist as they are associated
with the file rather than the mapping.

Honesty, I think this existing reservation bevahior is wrong or at least
not desirable.  Because there are outstanding reservations, the number of
reserved huge pages can not be used for other purposes.  It is also very
difficult for a user or admin to determine the source of the reservations.
No one is currently complaining about this behavior.  This proposal just
made me think about it.

Tying cgroup reservation limits/usage to existing reservation accounting
will introduce the same issues there.  We will need to clearly document the
behavior.

> 2. hugetlb_reserve_pages seems to be able to conduct a reservation
> with a NULL *vma. Not sure what to do in that case.
> 
> Is there a way to get around these that I'm missing here?

You are correct.  The !vma case is there for System V shared memory such
as a call to shmget(SHM_HUGETLB).  In this case, reservations for the
entire shared emmory segment are taken at shmget time.

In your model, the caller of shmget who creates the shared memory segment
would get charged for all the reservations.  Users, (those calling shmat)
would not be charged.

> FWIW I think tracking is better in resv_map since the reservations are
> in resv_map themselves. If I do another structure, then for each
> reservation there will be an entry in resv_map and an entry in the new
> structure and they need to be kept in sync and I need to handle errors
> for when they get out of sync.

I think you may be correct.  However, this implies that we will need to
change the way we do reservation in the resv_map for shared mappings.
I will comment on that in reply to patch 4.

-- 
Mike Kravetz

Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-16 Thread Mike Kravetz

On 8/15/19 4:08 PM, Mina Almasry wrote:
> On Tue, Aug 13, 2019 at 4:54 PM Mike Kravetz  wrote:
>>>  mm/hugetlb.c | 208 +--
>>>  1 file changed, 170 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 235996aef6618..d76e3137110ab 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -242,8 +242,72 @@ struct file_region {
>>>   struct list_head link;
>>>   long from;
>>>   long to;
>>> +#ifdef CONFIG_CGROUP_HUGETLB
>>> + /*
>>> +  * On shared mappings, each reserved region appears as a struct
>>> +  * file_region in resv_map. These fields hold the info needed to
>>> +  * uncharge each reservation.
>>> +  */
>>> + struct page_counter *reservation_counter;
>>> + unsigned long pages_per_hpage;
>>> +#endif
>>>  };
>>>
>>> +/* Must be called with resv->lock held. Calling this with dry_run == true 
>>> will
>>> + * count the number of pages added but will not modify the linked list.
>>> + */
>>> +static long consume_regions_we_overlap_with(struct file_region *rg,
>>> + struct list_head *head, long f, long *t,
>>> + struct hugetlb_cgroup *h_cg,
>>> + struct hstate *h,
>>> + bool dry_run)
>>> +{
>>> + long add = 0;
>>> + struct file_region *trg = NULL, *nrg = NULL;
>>> +
>>> + /* Consume any regions we now overlap with. */
>>> + nrg = rg;
>>> + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
>>> + if (&rg->link == head)
>>> + break;
>>> + if (rg->from > *t)
>>> + break;
>>> +
>>> + /* If this area reaches higher then extend our area to
>>> +  * include it completely.  If this is not the first area
>>> +  * which we intend to reuse, free it.
>>> +  */
>>> + if (rg->to > *t)
>>> + *t = rg->to;
>>> + if (rg != nrg) {
>>> + /* Decrement return value by the deleted range.
>>> +  * Another range will span this area so that by
>>> +  * end of routine add will be >= zero
>>> +  */
>>> + add -= (rg->to - rg->from);
>>> + if (!dry_run) {
>>> +     list_del(&rg->link);
>>> + kfree(rg);
>>
>> Is it possible that the region struct we are deleting pointed to
>> a reservation_counter?  Perhaps even for another cgroup?
>> Just concerned with the way regions are coalesced that we may be
>> deleting counters.
>>
> 
> Yep, that needs to be handled I think. Thanks for catching!
> 

I believe that we will no longer be able to coalesce reserv_map entries
for shared mappings.  That is because we need to record who is responsible
for creating reservation entries.

-- 
Mike Kravetz

Re: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed

2019-09-05 Thread Mike Kravetz

On 9/5/19 4:22 AM, Vlastimil Babka wrote:
> On 9/5/19 11:00 AM, Michal Hocko wrote:
>> [Ccing Mike for checking on the hugetlb side of this change]
>> On Wed 04-09-19 12:54:22, David Rientjes wrote:
>>> @@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
>>> order,
>>> if (page)
>>> goto got_pg;
>>>  
>>> +if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
>>> +   /*
>>> +* If allocating entire pageblock(s) and compaction
>>> +* failed because all zones are below low watermarks
>>> +* or is prohibited because it recently failed at this
>>> +* order, fail immediately.
>>> +*
>>> +* Reclaim is
>>> +*  - potentially very expensive because zones are far
>>> +*below their low watermarks or this is part of very
>>> +*bursty high order allocations,
>>> +*  - not guaranteed to help because isolate_freepages()
>>> +*may not iterate over freed pages as part of its
>>> +*linear scan, and
>>> +*  - unlikely to make entire pageblocks free on its
>>> +*own.
>>> +*/
>>> +   if (compact_result == COMPACT_SKIPPED ||
>>> +   compact_result == COMPACT_DEFERRED)
>>> +   goto nopage;
> 
> As I said, I expect this will make hugetlbfs reservations fail
> prematurely - Mike can probably confirm or disprove that.

I don't have a specific test for this.  It is somewhat common for people
to want to allocate "as many hugetlb pages as possible".  Therefore, they
will try to allocate more pages than reasonable for their environment and
take what they can get.  I 'tested' by simply creating some background
activity and then seeing how many hugetlb pages could be allocated.  Of
course, many tries over time in a loop.

This patch did not cause premature allocation failures in my limited testing.
The number of pages which could be allocated with and without patch were
pretty much the same.

Do note that I tested on top of Andrew's tree which contains this series:
http://lkml.kernel.org/r/20190806014744.15446-1-mike.krav...@oracle.com
Patch 3 in that series causes allocations to fail sooner in the case of
COMPACT_DEFERRED:
http://lkml.kernel.org/r/20190806014744.15446-4-mike.krav...@oracle.com

hugetlb allocations have the __GFP_RETRY_MAYFAIL flag set.  They are willing
to retry and wait and callers are aware of this.  Even though my limited
testing did not show regressions caused by this patch, I would prefer if the
quick exit did not apply to __GFP_RETRY_MAYFAIL requests.
-- 
Mike Kravetz

Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD

2019-09-11 Thread Mike Kravetz

On 9/11/19 8:44 AM, Waiman Long wrote:
> On 9/11/19 4:14 PM, Matthew Wilcox wrote:
>> On Wed, Sep 11, 2019 at 04:05:37PM +0100, Waiman Long wrote:
>>> When allocating a large amount of static hugepages (~500-1500GB) on a
>>> system with large number of CPUs (4, 8 or even 16 sockets), performance
>>> degradation (random multi-second delays) was observed when thousands
>>> of processes are trying to fault in the data into the huge pages. The
>>> likelihood of the delay increases with the number of sockets and hence
>>> the CPUs a system has.  This only happens in the initial setup phase
>>> and will be gone after all the necessary data are faulted in.
>> Can;t the application just specify MAP_POPULATE?
> 
> Originally, I thought that this happened in the startup phase when the
> pages were faulted in. The problem persists after steady state had been
> reached though. Every time you have a new user process created, it will
> have its own page table.

This is still at fault time.  Although, for the particular application it
may be after the 'startup phase'.

>  It is the sharing of the of huge page shared
> memory that is causing problem. Of course, it depends on how the
> application is written.

It may be the case that some applications would find the delays acceptable
for the benefit of shared pmds once they reach steady state.  As you say, of
course this depends on how the application is written.

I know that Oracle DB would not like it if PMD sharing is disabled for them.
Based on what I know of their model, all processes which share PMDs perform
faults (write or read) during the startup phase.  This is in environments as
big or bigger than you describe above.  I have never looked at/for delays in
these environments around pmd sharing (page faults), but that does not mean
they do not exist.  I will try to get the DB group to give me access to one
of their large environments for analysis.

We may want to consider making the timeout value and disable threshold user
configurable.
-- 
Mike Kravetz

Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD

2019-09-11 Thread Mike Kravetz

On 9/11/19 8:05 AM, Waiman Long wrote:
> When allocating a large amount of static hugepages (~500-1500GB) on a
> system with large number of CPUs (4, 8 or even 16 sockets), performance
> degradation (random multi-second delays) was observed when thousands
> of processes are trying to fault in the data into the huge pages. The
> likelihood of the delay increases with the number of sockets and hence
> the CPUs a system has.  This only happens in the initial setup phase
> and will be gone after all the necessary data are faulted in.
> 
> These random delays, however, are deemed unacceptable. The cause of
> that delay is the long wait time in acquiring the mmap_sem when trying
> to share the huge PMDs.
> 
> To remove the unacceptable delays, we have to limit the amount of wait
> time on the mmap_sem. So the new down_write_timedlock() function is
> used to acquire the write lock on the mmap_sem with a timeout value of
> 10ms which should not cause a perceivable delay. If timeout happens,
> the task will abandon its effort to share the PMD and allocate its own
> copy instead.
> 

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6d7296dd11b8..445af661ae29 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4750,6 +4750,8 @@ void adjust_range_if_pmd_sharing_possible(struct 
> vm_area_struct *vma,
>   }
>  }
>  
> +#define PMD_SHARE_DISABLE_THRESHOLD  (1 << 8)
> +
>  /*
>   * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
>   * and returns the corresponding pte. While this is not necessary for the
> @@ -4770,11 +4772,24 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned 
> long addr, pud_t *pud)
>   pte_t *spte = NULL;
>   pte_t *pte;
>   spinlock_t *ptl;
> + static atomic_t timeout_cnt;
>  
> - if (!vma_shareable(vma, addr))
> - return (pte_t *)pmd_alloc(mm, pud, addr);
> + /*
> +  * Don't share if it is not sharable or locking attempt timed out
> +  * after 10ms. After 256 timeouts, PMD sharing will be permanently
> +  * disabled as it is just too slow.
> +  */
> + if (!vma_shareable(vma, addr) ||
> +(atomic_read(&timeout_cnt) >= PMD_SHARE_DISABLE_THRESHOLD))
> + goto out_no_share;
> +
> + if (!i_mmap_timedlock_write(mapping, ms_to_ktime(10))) {
> + if (atomic_inc_return(&timeout_cnt) ==
> + PMD_SHARE_DISABLE_THRESHOLD)
> + pr_info("Hugetlbfs PMD sharing disabled because of 
> timeouts!\n");
> + goto out_no_share;
> + }
>  
> - i_mmap_lock_write(mapping);

All this got me wondering if we really need to take i_mmap_rwsem in write
mode here.  We are not changing the tree, only traversing it looking for
a suitable vma.

Unless I am missing something, the hugetlb code only ever takes the semaphore
in write mode; never read.  Could this have been the result of changing the
tree semaphore to read/write?  Instead of analyzing all the code, the easiest
and safest thing would have been to take all accesses in write mode.

I can investigate more, but wanted to ask the question in case someone already
knows.

At one time, I thought it was safe to acquire the semaphore in read mode for
huge_pmd_share, but write mode for huge_pmd_unshare.  See commit b43a99900559.
This was reverted along with another patch for other reasons.

If we change change from write to read mode, this may have significant impact
on the stalls.
-- 
Mike Kravetz

Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD

2019-09-12 Thread Mike Kravetz

On 9/12/19 2:06 AM, Waiman Long wrote:
> If we can take the rwsem in read mode, that should solve the problem
> AFAICS. As I don't have a full understanding of the history of that
> code, I didn't try to do that in my patch.

Do you still have access to an environment that creates the long stalls?
If so, can you try the simple change of taking the semaphore in read mode
in huge_pmd_share.

-- 
Mike Kravetz

Re: [PATCH v4 4/9] hugetlb: region_chg provides only cache entry

2019-09-16 Thread Mike Kravetz

On 9/10/19 4:31 PM, Mina Almasry wrote:
> Current behavior is that region_chg provides both a cache entry in
> resv->region_cache, AND a placeholder entry in resv->regions. region_add
> first tries to use the placeholder, and if it finds that the placeholder
> has been deleted by a racing region_del call, it uses the cache entry.
> 
> This behavior is completely unnecessary and is removed in this patch for
> a couple of reasons:
> 
> 1. region_add needs to either find a cached file_region entry in
>resv->region_cache, or find an entry in resv->regions to expand. It
>does not need both.
> 2. region_chg adding a placeholder entry in resv->regions opens up
>a possible race with region_del, where region_chg adds a placeholder
>region in resv->regions, and this region is deleted by a racing call
>to region_del during region_chg execution or before region_add is
>called. Removing the race makes the code easier to reason about and
>maintain.
> 
> In addition, a follow up patch in this series disables region
> coalescing, which would be further complicated if the race with
> region_del exists.
> 
> Signed-off-by: Mina Almasry 

Thanks.  I like this modification as it does simplify the code and could
be added as a general cleanup independent of the other changes.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 63 +---
>  1 file changed, 11 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index fbd7c52e17348..bea51ae422f63 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -246,14 +246,10 @@ struct file_region {
> 
>  /*
>   * Add the huge page range represented by [f, t) to the reserve
> - * map.  In the normal case, existing regions will be expanded
> - * to accommodate the specified range.  Sufficient regions should
> - * exist for expansion due to the previous call to region_chg
> - * with the same range.  However, it is possible that region_del
> - * could have been called after region_chg and modifed the map
> - * in such a way that no region exists to be expanded.  In this
> - * case, pull a region descriptor from the cache associated with
> - * the map and use that for the new range.
> + * map.  Existing regions will be expanded to accommodate the specified
> + * range, or a region will be taken from the cache.  Sufficient regions
> + * must exist in the cache due to the previous call to region_chg with
> + * the same range.
>   *
>   * Return the number of new huge pages added to the map.  This
>   * number is greater than or equal to zero.
> @@ -272,9 +268,8 @@ static long region_add(struct resv_map *resv, long f, 
> long t)
> 
>   /*
>* If no region exists which can be expanded to include the
> -  * specified range, the list must have been modified by an
> -  * interleving call to region_del().  Pull a region descriptor
> -  * from the cache and use it for this range.
> +  * specified range, pull a region descriptor from the cache
> +  * and use it for this range.
>*/
>   if (&rg->link == head || t < rg->from) {
>   VM_BUG_ON(resv->region_cache_count <= 0);
> @@ -339,15 +334,9 @@ static long region_add(struct resv_map *resv, long f, 
> long t)
>   * call to region_add that will actually modify the reserve
>   * map to add the specified range [f, t).  region_chg does
>   * not change the number of huge pages represented by the
> - * map.  However, if the existing regions in the map can not
> - * be expanded to represent the new range, a new file_region
> - * structure is added to the map as a placeholder.  This is
> - * so that the subsequent region_add call will have all the
> - * regions it needs and will not fail.
> - *
> - * Upon entry, region_chg will also examine the cache of region descriptors
> - * associated with the map.  If there are not enough descriptors cached, one
> - * will be allocated for the in progress add operation.
> + * map.  A new file_region structure is added to the cache
> + * as a placeholder, so that the subsequent region_add
> + * call will have all the regions it needs and will not fail.
>   *
>   * Returns the number of huge pages that need to be added to the existing
>   * reservation map for the range [f, t).  This number is greater or equal to
> @@ -357,10 +346,9 @@ static long region_add(struct resv_map *resv, long f, 
> long t)
>  static long region_chg(struct resv_map *resv, long f, long t)
>  {
>   struct list_head *head = &resv->regions;
> - struct file_region *rg, *nrg = NULL;
> + struct file_region *rg;
>   long chg = 0;
> 
> -retry:
>

Re: [PATCH v4 5/9] hugetlb: remove duplicated code

2019-09-16 Thread Mike Kravetz

On 9/10/19 4:31 PM, Mina Almasry wrote:
> Remove duplicated code between region_chg and region_add, and refactor it into
> a common function, add_reservation_in_range. This is mostly done because
> there is a follow up change in this series that disables region
> coalescing in region_add, and I want to make that change in one place
> only. It should improve maintainability anyway on its own.
> 
> Signed-off-by: Mina Almasry 

Like the previous patch, this is a good improvement indepentent of the
rest of the series.  Thanks!

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 116 ---
>  1 file changed, 54 insertions(+), 62 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bea51ae422f63..ce5ed1056fefd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -244,6 +244,57 @@ struct file_region {
>   long to;
>  };
> 
> +static long add_reservation_in_range(
> + struct resv_map *resv, long f, long t, bool count_only)
> +{
> +
> + long chg = 0;
> + struct list_head *head = &resv->regions;
> + struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
> +
> + /* Locate the region we are before or in. */
> + list_for_each_entry(rg, head, link)
> + if (f <= rg->to)
> + break;
> +
> + /* Round our left edge to the current segment if it encloses us. */
> + if (f > rg->from)
> + f = rg->from;
> +
> + chg = t - f;
> +
> + /* Check for and consume any regions we now overlap with. */
> + nrg = rg;
> + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> + if (&rg->link == head)
> + break;
> + if (rg->from > t)
> + break;
> +
> + /* We overlap with this area, if it extends further than
> +  * us then we must extend ourselves.  Account for its
> +  * existing reservation.
> +  */
> + if (rg->to > t) {
> + chg += rg->to - t;
> + t = rg->to;
> + }
> + chg -= rg->to - rg->from;
> +
> + if (!count_only && rg != nrg) {
> + list_del(&rg->link);
> + kfree(rg);
> + }
> + }
> +
> + if (!count_only) {
> + nrg->from = f;
> + nrg->to = t;
> + }
> +
> + return chg;
> +}
> +
>  /*
>   * Add the huge page range represented by [f, t) to the reserve
>   * map.  Existing regions will be expanded to accommodate the specified
> @@ -257,7 +308,7 @@ struct file_region {
>  static long region_add(struct resv_map *resv, long f, long t)
>  {
>   struct list_head *head = &resv->regions;
> - struct file_region *rg, *nrg, *trg;
> + struct file_region *rg, *nrg;
>   long add = 0;
> 
>   spin_lock(&resv->lock);
> @@ -287,38 +338,7 @@ static long region_add(struct resv_map *resv, long f, 
> long t)
>   goto out_locked;
>   }
> 
> - /* Round our left edge to the current segment if it encloses us. */
> - if (f > rg->from)
> - f = rg->from;
> -
> - /* Check for and consume any regions we now overlap with. */
> - nrg = rg;
> - list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> - if (&rg->link == head)
> - break;
> - if (rg->from > t)
> - break;
> -
> - /* If this area reaches higher then extend our area to
> -  * include it completely.  If this is not the first area
> -  * which we intend to reuse, free it. */
> - if (rg->to > t)
> - t = rg->to;
> - if (rg != nrg) {
> - /* Decrement return value by the deleted range.
> -  * Another range will span this area so that by
> -  * end of routine add will be >= zero
> -  */
> - add -= (rg->to - rg->from);
> - list_del(&rg->link);
> - kfree(rg);
> - }
> - }
> -
> - add += (nrg->from - f); /* Added to beginning of region */
> - nrg->from = f;
> - add += t - nrg->to; /* Added to end of region */
> - nrg->to = t;
> + add = add_reservation_in_range(resv, f, t, false);
> 
>  out_locked:
>   resv->adds_in_progress--;
&g

Re: [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing

2019-09-16 Thread Mike Kravetz

On 9/10/19 4:31 PM, Mina Almasry wrote:
> A follow up patch in this series adds hugetlb cgroup uncharge info the
> file_region entries in resv->regions. The cgroup uncharge info may
> differ for different regions, so they can no longer be coalesced at
> region_add time. So, disable region coalescing in region_add in this
> patch.
> 
> Behavior change:
> 
> Say a resv_map exists like this [0->1], [2->3], and [5->6].
> 
> Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> 
> Old code would generate resv->regions: [0->5], [5->6].
> New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> [5->6].
> 
> Special care needs to be taken to handle the resv->adds_in_progress
> variable correctly. In the past, only 1 region would be added for every
> region_chg and region_add call. But now, each call may add multiple
> regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
> region_chg calls add_reservation_in_range() to count the number of regions
> needed and allocates those, and that info is passed to region_add and
> region_abort to decrement adds_in_progress correctly.

Hate to throw more theoretical examples at you but ...

Consider an existing reserv_map like [3-10]
Then a region_chg/add call comes in region_chg/add(f=0, t=10).
The region_chg is going to return 3 (additional reservations needed), and
also out_regions_needed = 1 as it would want to create a region [0-3].
Correct?
But, there is nothing to prevent another thread from doing a region_del [5-7]
after the region_chg and before region_add.  Correct?
If so, it seems the region_add would need to create two regions, but there
is only one in the cache and we would BUG in get_file_region_entry_from_cache.
Am I reading the code correctly?

The existing code wants to make sure region_add called after region_chg will
never return error.  This is why all needed allocations were done in the
region_chg call, and it was relatively easy to do in existing code when
region_chg would only need one additional region at most.

I'm thinking that we may have to make region_chg allocate the worst case
number of regions (t - f)/2, OR change to the code such that region_add
could return an error.
-- 
Mike Kravetz

Re: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

2019-09-27 Thread Mike Kravetz

file_region_entry_from_cache(
> + resv, last_accounted_offset, rg->from);
> + list_add(&nrg->link, rg->link.prev);
> + }
>   }
> - chg -= rg->to - rg->from;
> 
> - if (!count_only && rg != nrg) {
> - list_del(&rg->link);
> - kfree(rg);
> - }
> + last_accounted_offset = rg->to;
>   }
> 
> - if (!count_only) {
> - nrg->from = f;
> - nrg->to = t;
> + /* Handle the case where our range extends beyond
> +  * last_accounted_offset.
> +  */
> + if (last_accounted_offset < t) {
> + add += t - last_accounted_offset;
> + if (!count_only) {
> + nrg = get_file_region_entry_from_cache(
> + resv, last_accounted_offset, t);
> + list_add(&nrg->link, rg->link.prev);
> + }
> + last_accounted_offset = t;
>   }
> 
> - return chg;
> + return add;
>  }
> 
>  /*
> @@ -305,46 +321,24 @@ static long add_reservation_in_range(struct resv_map 
> *resv, long f, long t,

The start of this comment block says,

/*
 * Add the huge page range represented by [f, t) to the reserve
 * map.  Existing regions will be expanded to accommodate the specified
 * range, or a region will be taken from the cache.

We are no longer expanding existing regions.  Correct?
As an optimization, I guess we could coalesce/combine reion entries as
long as they are for the same cgroup.  However, it may not be worth the
effort.

>   * must exist in the cache due to the previous call to region_chg with
>   * the same range.
>   *
> + * regions_needed is the out value provided by a previous
> + * call to region_chg.
> + *
>   * Return the number of new huge pages added to the map.  This
>   * number is greater than or equal to zero.
>   */
> -static long region_add(struct resv_map *resv, long f, long t)
> +static long region_add(struct resv_map *resv, long f, long t,
> +long regions_needed)
>  {
> - struct list_head *head = &resv->regions;
> - struct file_region *rg, *nrg;
>   long add = 0;
> 
>   spin_lock(&resv->lock);
> - /* Locate the region we are either in or before. */
> - list_for_each_entry(rg, head, link)
> - if (f <= rg->to)
> - break;
> 
> - /*
> -  * If no region exists which can be expanded to include the
> -  * specified range, pull a region descriptor from the cache
> -  * and use it for this range.
> -  */
> - if (&rg->link == head || t < rg->from) {
> - VM_BUG_ON(resv->region_cache_count <= 0);
> -
> - resv->region_cache_count--;
> - nrg = list_first_entry(&resv->region_cache, struct file_region,
> - link);
> - list_del(&nrg->link);
> -
> - nrg->from = f;
> - nrg->to = t;
> - list_add(&nrg->link, rg->link.prev);
> -
> - add += t - f;
> - goto out_locked;
> - }
> + VM_BUG_ON(resv->region_cache_count < regions_needed);
> 
>   add = add_reservation_in_range(resv, f, t, false);
> + resv->adds_in_progress -= regions_needed;

Consider this example,

- region_chg(1,2)
adds_in_progress = 1
cache entries 1
- region_chg(3,4)
adds_in_progress = 2
cache entries 2
- region_chg(5,6)
adds_in_progress = 3
cache entries 3

At this point, no region descriptors are in the map because only
region_chg has been called.

- region_chg(0,6)
adds_in_progress = 4
cache entries 4

Is that correct so far?

Then the following sequence happens,

- region_add(1,2)
adds_in_progress = 3
cache entries 3
- region_add(3,4)
adds_in_progress = 2
cache entries 2
- region_add(5,6)
adds_in_progress = 1
cache entries 1

list of region descriptors is:
[1->2] [3->4] [5->6]

- region_add(0,6)
This is going to require 3 cache entries but only one is in the cache.
I think we are going to BUG in get_file_region_entry_from_cache() the
second time it is called from add_reservation_in_range().

I stopped looking at the code here as things will need to change if this
is a real issue.
-- 
Mike Kravetz

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-27 Thread Mike Kravetz

On 9/26/19 5:55 PM, Mina Almasry wrote:
> Provided we keep the existing controller untouched, should the new
> controller track:
> 
> 1. only reservations, or
> 2. both reservations and allocations for which no reservations exist
> (such as the MAP_NORESERVE case)?
> 
> I like the 'both' approach. Seems to me a counter like that would work
> automatically regardless of whether the application is allocating
> hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
> into reserved hugetlb pages, correct?

Correct.  One other easy way to allocate huge pages without reserves
(that I know is used today) is via the fallocate system call.

>   If so, then applications that
> allocate with NORESERVE will get sigbused when they hit their limit,
> and applications that allocate without NORESERVE may get an error at
> mmap time but will always be within their limits while they access the
> mmap'd memory, correct?

Correct.  At page allocation time we can easily check to see if a reservation
exists and not charge.  For any specific page within a hugetlbfs file,
a charge would happen at mmap time or allocation time.

One exception (that I can think of) to this mmap(RESERVE) will not cause
a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
a file, not only do they remove pages associated with the file but the
reservation information as well.  Therefore, a subsequent fault will be
the same as an allocation without reservation.

I 'think' the code to remove/truncate a file will work corrctly as it
is today, but I need to think about this some more.

> mmap'd memory, correct? So the 'both' counter seems like a one size
> fits all.
> 
> I think the only sticking point left is whether an added controller
> can support both cgroup-v2 and cgroup-v1. If I could get confirmation
> on that I'll provide a patchset.

Sorry, but I can not provide cgroup expertise.
-- 
Mike Kravetz

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-27 Thread Mike Kravetz

On 9/27/19 3:51 PM, Mina Almasry wrote:
> On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz  wrote:
>>
>> On 9/26/19 5:55 PM, Mina Almasry wrote:
>>> Provided we keep the existing controller untouched, should the new
>>> controller track:
>>>
>>> 1. only reservations, or
>>> 2. both reservations and allocations for which no reservations exist
>>> (such as the MAP_NORESERVE case)?
>>>
>>> I like the 'both' approach. Seems to me a counter like that would work
>>> automatically regardless of whether the application is allocating
>>> hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
>>> into reserved hugetlb pages, correct?
>>
>> Correct.  One other easy way to allocate huge pages without reserves
>> (that I know is used today) is via the fallocate system call.
>>
>>>   If so, then applications that
>>> allocate with NORESERVE will get sigbused when they hit their limit,
>>> and applications that allocate without NORESERVE may get an error at
>>> mmap time but will always be within their limits while they access the
>>> mmap'd memory, correct?
>>
>> Correct.  At page allocation time we can easily check to see if a reservation
>> exists and not charge.  For any specific page within a hugetlbfs file,
>> a charge would happen at mmap time or allocation time.
>>
>> One exception (that I can think of) to this mmap(RESERVE) will not cause
>> a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
>> a file, not only do they remove pages associated with the file but the
>> reservation information as well.  Therefore, a subsequent fault will be
>> the same as an allocation without reservation.
>>
> 
> I don't think it causes a sigbus. This is the scenario, right:
> 
> 1. Make cgroup with limit X bytes.
> 2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged
> 3. A hole of size Y is punched in the file, causing the cgroup to get
> uncharged Y bytes.
> 4. The task faults in memory from the hole, getting charged up to Y
> bytes again. But they will be still within their limits.
> 
> IIUC userspace only gets sigbus'd if the limit is lowered between
> steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

You are correct.  That was my mistake.  I was still thinking of behavior
for a reservation only cgroup model.  It would behave as you describe
above (no SIGBUS) for a combined reservation/allocate model.
-- 
Mike Kravetz

Re: [PATCH] mm, hugetlb: allow hugepage allocations to excessively reclaim

2019-10-07 Thread Mike Kravetz

On 10/7/19 12:55 AM, Michal Hocko wrote:
> From: David Rientjes 
> 
> b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction
> may not succeed") has chnaged the allocator to bail out from the
> allocator early to prevent from a potentially excessive memory
> reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
> reclaim and compaction loop as long as there is a reasonable chance to
> make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED
> at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
> 
> The most obvious affected subsystem is hugetlbfs which allocates huge
> pages based on an admin request (or via admin configured overcommit).
> I have done a simple test which tries to allocate half of the memory
> for hugetlb pages while the memory is full of a clean page cache. This
> is not an unusual situation because we try to cache as much of the
> memory as possible and sysctl/sysfs interface to allocate huge pages is
> there for flexibility to allocate hugetlb pages at any time.
> 
> System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
> after the memory is prefilled by a clean page cache:
> root@test1:~# cat hugetlb_test.sh
> 
> set -x
> echo 0 > /proc/sys/vm/nr_hugepages
> echo 3 > /proc/sys/vm/drop_caches
> echo 1 > /proc/sys/vm/compact_memory
> dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
> TS=$(date +%s)
> echo 256 > /proc/sys/vm/nr_hugepages
> cat /proc/sys/vm/nr_hugepages
> 
> The results for 2 consecutive runs on clean 5.3
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
> + date +%s
> + TS=1569905284
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 256
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
> + date +%s
> + TS=1569905311
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 256
> 
> Now with b39d0ee2632d applied
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
> + date +%s
> + TS=1569905516
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 11
> root@test1:~# sh hugetlb_test.sh
> + echo 0
> + echo 3
> + echo 1
> + dd if=/mnt/data/file-1G of=/dev/null bs=4096
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
> + date +%s
> + TS=1569905541
> + echo 256
> + cat /proc/sys/vm/nr_hugepages
> 12
> 
> The success rate went down by factor of 20!
> 
> Although hugetlb allocation requests might fail and it is reasonable to
> expect them to under extremely fragmented memory or when the memory is
> under a heavy pressure but the above situation is not that case.
> 
> Fix the regression by reverting back to the previous behavior for
> __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
> those requests.

Thank you Michal for doing this.

hugetlbfs allocations are commonly done via sysctl/sysfs shortly after boot
where this may not be as much of an issue.  However, I am aware of at least
three use cases where allocations are made after the system has been up and
running for quite some time:
- DB reconfiguration.  If sysctl/sysfs fails to get required number of huge
  pages, system is rebooted to perform allocation after boot.
- VM provisioning.  If unable get required number of huge pages, fall back
  to base pages.
- An application that does not preallocate pool, but rather allocates pages
  at fault time for optimal NUMA locality.
In all cases, I would expect b39d0ee2632d to cause regressions and noticable
behavior changes.

My quick/limited testing in [1] was insufficient.  It was also mentioned that
if something like b39d0ee2632d went forward, I would like exemptions for
__GFP_RETRY_MAYFAIL requests as in this patch.

> 
> [mho...@suse.com: reworded changelog]
> Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction 
> may not succeed")
> Cc: Mike Kravetz 
> Signed-off-by: David Rientjes 
> Signed-off-by: Michal Hocko 

FWIW,
Reviewed-by: Mike Kravetz 

[1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90b...@oracle.com
-- 
Mike Kravetz

Re: [rfc] mm, hugetlb: allow hugepage allocations to excessively reclaim

2019-10-07 Thread Mike Kravetz

On 10/4/19 11:02 AM, David Rientjes wrote:
> On Fri, 4 Oct 2019, Michal Hocko wrote:
> 
>> Requesting the userspace to drop _all_ page cache in order allocate a
>> number of hugetlb pages or any other affected __GFP_RETRY_MAYFAIL
>> requests is simply not reasonable IMHO.
> 
> It can be used as a fallback when writing to nr_hugepages and the amount 
> allocated did not match expectation.  Again, I'll defer all of this to 
> Mike when he returns: he expressed his preference, I suggested an 
> alternative to consider, and he can make the decision to ack or nack this 
> patch because he has a better understanding of that expectation from users 
> who use hugetlb pages.

I believe these modifications to commit b39d0ee2632d are absolutely necessary
to maintain expected hugetlbfs functionality.  Michal's simple test in the
rewritten commit message shows the type of regressions that I expect some
hugetlbfs users to experience.  The expectation today is that the kernel will
try hard to allocate the requested number of hugetlb pages.  These pages are
often used for very long running processes.  Therefore, the tradeoff of more
reclaim (and compaction) activity up front to create the pages is generally
acceptable.

My apologies if the 'testing' I did in [1] was taken as an endorsement of
b39d0ee2632d working well with hugetlbfs.  It was a limited test that I knew
did not cover all cases.  Therefore, I suggested that if b39d0ee2632d went
forward there should be an exception for __GFP_RETRY_MAYFAIL requests.

[1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90b...@oracle.com
-- 
Mike Kravetz

Re: [PATCH -next] userfaultfd: remove set but not used variable 'h'

2019-10-09 Thread Mike Kravetz

On 10/9/19 5:27 AM, YueHaibing wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> mm/userfaultfd.c: In function '__mcopy_atomic_hugetlb':
> mm/userfaultfd.c:217:17: warning:
>  variable 'h' set but not used [-Wunused-but-set-variable]
> 
> It is not used since commit 78911d0e18ac ("userfaultfd: use vma_pagesize
> for all huge page size calculation")
> 

Thanks!  That should have been removed with the recent cleanups.

> Signed-off-by: YueHaibing 

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH -next] userfaultfd: remove set but not used variable 'h'

2019-10-09 Thread Mike Kravetz

On 10/9/19 6:23 PM, Wei Yang wrote:
> On Wed, Oct 09, 2019 at 05:45:57PM -0700, Mike Kravetz wrote:
>> On 10/9/19 5:27 AM, YueHaibing wrote:
>>> Fixes gcc '-Wunused-but-set-variable' warning:
>>>
>>> mm/userfaultfd.c: In function '__mcopy_atomic_hugetlb':
>>> mm/userfaultfd.c:217:17: warning:
>>>  variable 'h' set but not used [-Wunused-but-set-variable]
>>>
>>> It is not used since commit 78911d0e18ac ("userfaultfd: use vma_pagesize
>>> for all huge page size calculation")
>>>
>>
>> Thanks!  That should have been removed with the recent cleanups.
>>
>>> Signed-off-by: YueHaibing 
>>
>> Reviewed-by: Mike Kravetz 
> 
> If I am correct, this is removed in a recent patch.

I'm having a hard time figuring out what is actually in the latest mmotm
tree.  Andrew added a build fixup patch ab169389eb5 in linux-next which
adds the reference to h.  Is there a patch after that to remove the reference?

-- 
Mike Kravetz

Re: [PATCH -next] userfaultfd: remove set but not used variable 'h'

2019-10-09 Thread Mike Kravetz

On 10/9/19 8:30 PM, Wei Yang wrote:
> On Wed, Oct 09, 2019 at 07:25:18PM -0700, Mike Kravetz wrote:
>> On 10/9/19 6:23 PM, Wei Yang wrote:
>>> On Wed, Oct 09, 2019 at 05:45:57PM -0700, Mike Kravetz wrote:
>>>> On 10/9/19 5:27 AM, YueHaibing wrote:
>>>>> Fixes gcc '-Wunused-but-set-variable' warning:
>>>>>
>>>>> mm/userfaultfd.c: In function '__mcopy_atomic_hugetlb':
>>>>> mm/userfaultfd.c:217:17: warning:
>>>>>  variable 'h' set but not used [-Wunused-but-set-variable]
>>>>>
>>>>> It is not used since commit 78911d0e18ac ("userfaultfd: use vma_pagesize
>>>>> for all huge page size calculation")
>>>>>
>>>>
>>>> Thanks!  That should have been removed with the recent cleanups.
>>>>
>>>>> Signed-off-by: YueHaibing 
>>>>
>>>> Reviewed-by: Mike Kravetz 
>>>
>>> If I am correct, this is removed in a recent patch.
>>
>> I'm having a hard time figuring out what is actually in the latest mmotm
>> tree.  Andrew added a build fixup patch ab169389eb5 in linux-next which
>> adds the reference to h.  Is there a patch after that to remove the 
>> reference?
>>
> 
> I checked linux-next tree, this commit removes the reference.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=add4eaeef3766b7491d70d473c48c0b6d6ca5cb7
> 

Yes, but unless I am mistaken this adds it back,

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=ab169389eb5ff9da7113a21737574edc6d22c072

-- 
Mike Kravetz

[PATCH] hugetlbfs: hugetlb_fault_mutex_hash cleanup

2019-09-18 Thread Mike Kravetz

A new clang diagnostic (-Wsizeof-array-div) warns about the calculation
to determine the number of u32's in an array of unsigned longs. Suppress
warning by adding parentheses.

While looking at the above issue, noticed that the 'address' parameter
to hugetlb_fault_mutex_hash is no longer used. So, remove it from the
definition and all callers.

No functional change.

Reported-by: Nathan Chancellor 
Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c|  4 ++--
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb.c| 10 +-
 mm/userfaultfd.c|  2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a478df035651..6e5eadee6b0d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -440,7 +440,7 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
u32 hash;
 
index = page->index;
-   hash = hugetlb_fault_mutex_hash(h, mapping, index, 0);
+   hash = hugetlb_fault_mutex_hash(h, mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
/*
@@ -644,7 +644,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
addr = index * hpage_size;
 
/* mutex taken here, fault path and hole punch */
-   hash = hugetlb_fault_mutex_hash(h, mapping, index, addr);
+   hash = hugetlb_fault_mutex_hash(h, mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
/* See if already present in mapping to avoid alloc/free */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca4278319..5bf11fffbbd4 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -106,7 +106,7 @@ void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
 u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
-   pgoff_t idx, unsigned long address);
+   pgoff_t idx);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d7296dd11b8..3705d3c69e32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3847,7 +3847,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 * handling userfault.  Reacquire after handling
 * fault to make calling code simpler.
 */
-   hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr);
+   hash = hugetlb_fault_mutex_hash(h, mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -3975,7 +3975,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 #ifdef CONFIG_SMP
 u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
-   pgoff_t idx, unsigned long address)
+   pgoff_t idx)
 {
unsigned long key[2];
u32 hash;
@@ -3983,7 +3983,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
address_space *mapping,
key[0] = (unsigned long) mapping;
key[1] = idx;
 
-   hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
+   hash = jhash2((u32 *)&key, sizeof(key)/(sizeof(u32)), 0);
 
return hash & (num_fault_mutexes - 1);
 }
@@ -3993,7 +3993,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
address_space *mapping,
  * return 0 and avoid the hashing overhead.
  */
 u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
-   pgoff_t idx, unsigned long address)
+   pgoff_t idx)
 {
return 0;
 }
@@ -4037,7 +4037,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * get spurious allocation failures if two CPUs race to instantiate
 * the same page in the page cache.
 */
-   hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr);
+   hash = hugetlb_fault_mutex_hash(h, mapping, idx);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
entry = huge_ptep_get(ptep);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c7ae74ce5ff3..640ff2bd9a69 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -269,7 +269,7 @@ static __always_inline ssize_t 
__mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 */
idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
-   hash = hugetlb_fault

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-26 Thread Mike Kravetz

On 9/26/19 12:28 PM, David Rientjes wrote:
> On Tue, 24 Sep 2019, Mina Almasry wrote:
> 
>>> I personally prefer the one counter approach only for the reason that it
>>> exposes less information about hugetlb reservations.  I was not around
>>> for the introduction of hugetlb reservations, but I have fixed several
>>> issues having to do with reservations.  IMO, reservations should be hidden
>>> from users as much as possible.  Others may disagree.
>>>
>>> I really hope that Aneesh will comment.  He added the existing hugetlb
>>> cgroup code.  I was not involved in that effort, but it looks like there
>>> might have been some thought given to reservations in early versions of
>>> that code.  It would be interesting to get his perspective.
>>>
>>> Changes included in patch 4 (disable region_add file_region coalescing)
>>> would be needed in a one counter approach as well, so I do plan to
>>> review those changes.
>>
>> OK, FWIW, the 1 counter approach should be sufficient for us, so I'm
>> not really opposed. David, maybe chime in if you see a problem here?
>> From the perspective of hiding reservations from the user as much as
>> possible, it is an improvement.
>>
>> I'm only wary about changing the behavior of the current and having
>> that regress applications. I'm hoping you and Aneesh can shed light on
>> this.
>>
> 
> I think neither Aneesh nor myself are going to be able to provide a 
> complete answer on the use of hugetlb cgroup today, anybody could be using 
> it without our knowledge and that opens up the possibility that combining 
> the limits would adversely affect a real system configuration.

I agree that nobody can provide complete information on hugetlb cgroup usage
today.  My interest was in anything Aneesh could remember about development
of the current cgroup code.  It 'appears' that the idea of including
reservations or mmap ranges was considered or at least discussed.  But, those
discussions happened more than 7 years old and my searches are not providing
a complete picture.  My hope was that Aneesh may remember those discussions.

> If that is a possibility, I think we need to do some due diligence and try 
> to deprecate allocation limits if possible.  One of the benefits to 
> separate limits is that we can make reasonable steps to deprecating the 
> actual allocation limits, if possible: we could add warnings about the 
> deprecation of allocation limits and see if anybody complains.
> 
> That could take the form of two separate limits or a tunable in the root 
> hugetlb cgroup that defines whether the limits are for allocation or 
> reservation.
> 
> Combining them in the first pass seems to be very risky and could cause 
> pain for users that will not detect this during an rc cycle and will 
> report the issue only when their distro gets it.  Then we are left with no 
> alternative other than stable backports and the separation of the limits 
> anyway.

I agree that changing behavior of the existing controller is too risky.
Such a change is likely to break someone.  The more I think about it, the
best way forward will be to retain the existing controller and create a
new controller that satisfies the new use cases.  The question remains as
to what that new controller will be.  Does it control reservations only?
Is it a combination of reservations and allocations?  If a combined
controller will work for new use cases, that would be my preference.  Of
course, I have not prototyped such a controller so there may be issues when
we get into the details.  For a reservation only or combined controller,
the region_* changes proposed by Mina would be used.
-- 
Mike Kravetz

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-23 Thread Mike Kravetz

On 9/19/19 3:24 PM, Mina Almasry wrote:
> Patch series implements hugetlb_cgroup reservation usage and limits, which
> track hugetlb reservations rather than hugetlb memory faulted in. Details of
> the approach is 1/7.

Thanks for your continued efforts Mina.

One thing that has bothered me with this approach from the beginning is that
hugetlb reservations are related to, but somewhat distinct from hugetlb
allocations.  The original (existing) huegtlb cgroup implementation does not
take reservations into account.  This is an issue you are trying to address
by adding a cgroup support for hugetlb reservations.  However, this new
reservation cgroup ignores hugetlb allocations at fault time.

I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
of hugetlb pages.  Both the existing cgroup code and the reservation approach
have what I think are some serious flaws.  Consider a system with 100 hugetlb
pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
usage to 50 pages each.

With the existing implementation, a task in group A could create a mmap of
100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
nobody in group B can allocate ANY huge pages.  This is true even though
no pages have been allocated in A (or B).

With the reservation implementation, a task in group A could use MAP_NORESERVE
and allocate all 100 pages without taking any reservations.

As mentioned in your documentation, it would be possible to use both the
existing (allocation) and new reservation cgroups together.  Perhaps if both
are setup for the 50/50 split things would work a little better.

However, instead of creating a new reservation crgoup how about adding
reservation support to the existing allocation cgroup support.  One could
even argue that a reservation is an allocation as it sets aside huge pages
that can only be used for a specific purpose.  Here is something that
may work.

Starting with the existing allocation cgroup.
- When hugetlb pages are reserved, the cgroup of the task making the
  reservations is charged.  Tracking for the charged cgroup is done in the
  reservation map in the same way proposed by this patch set.
- At page fault time,
  - If a reservation already exists for that specific area do not charge the
faulting task.  No tracking in page, just the reservation map.
  - If no reservation exists, charge the group of the faulting task.  Tracking
of this information is in the page itself as implemented today.
- When the hugetlb object is removed, compare the reservation map with any
  allocated pages.  If cgroup tracking information exists in page, uncharge
  that group.  Otherwise, unharge the group (if any) in the reservation map.

One of the advantages of a separate reservation cgroup is that the existing
code is unmodified.  Combining the two provides a more complete/accurate
solution IMO.  But, it has the potential to break existing users.

I really would like to get feedback from anyone that knows how the existing
hugetlb cgroup controller may be used today.  Comments from Aneesh would
be very welcome to know if reservations were considered in development of the
existing code.
-- 
Mike Kravetz

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-23 Thread Mike Kravetz

On 9/23/19 12:18 PM, Mina Almasry wrote:
> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz  wrote:
>>
>> On 9/19/19 3:24 PM, Mina Almasry wrote:
>>> Patch series implements hugetlb_cgroup reservation usage and limits, which
>>> track hugetlb reservations rather than hugetlb memory faulted in. Details of
>>> the approach is 1/7.
>>
>> Thanks for your continued efforts Mina.
>>
> 
> And thanks for your reviews so far.
> 
>> One thing that has bothered me with this approach from the beginning is that
>> hugetlb reservations are related to, but somewhat distinct from hugetlb
>> allocations.  The original (existing) huegtlb cgroup implementation does not
>> take reservations into account.  This is an issue you are trying to address
>> by adding a cgroup support for hugetlb reservations.  However, this new
>> reservation cgroup ignores hugetlb allocations at fault time.
>>
>> I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
>> of hugetlb pages.  Both the existing cgroup code and the reservation approach
>> have what I think are some serious flaws.  Consider a system with 100 hugetlb
>> pages available.  A sysadmin, has two groups A and B and wants to limit 
>> hugetlb
>> usage to 50 pages each.
>>
>> With the existing implementation, a task in group A could create a mmap of
>> 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
>> nobody in group B can allocate ANY huge pages.  This is true even though
>> no pages have been allocated in A (or B).
>>
>> With the reservation implementation, a task in group A could use 
>> MAP_NORESERVE
>> and allocate all 100 pages without taking any reservations.
>>
>> As mentioned in your documentation, it would be possible to use both the
>> existing (allocation) and new reservation cgroups together.  Perhaps if both
>> are setup for the 50/50 split things would work a little better.
>>
>> However, instead of creating a new reservation crgoup how about adding
>> reservation support to the existing allocation cgroup support.  One could
>> even argue that a reservation is an allocation as it sets aside huge pages
>> that can only be used for a specific purpose.  Here is something that
>> may work.
>>
>> Starting with the existing allocation cgroup.
>> - When hugetlb pages are reserved, the cgroup of the task making the
>>   reservations is charged.  Tracking for the charged cgroup is done in the
>>   reservation map in the same way proposed by this patch set.
>> - At page fault time,
>>   - If a reservation already exists for that specific area do not charge the
>> faulting task.  No tracking in page, just the reservation map.
>>   - If no reservation exists, charge the group of the faulting task.  
>> Tracking
>> of this information is in the page itself as implemented today.
>> - When the hugetlb object is removed, compare the reservation map with any
>>   allocated pages.  If cgroup tracking information exists in page, uncharge
>>   that group.  Otherwise, unharge the group (if any) in the reservation map.
>>
>> One of the advantages of a separate reservation cgroup is that the existing
>> code is unmodified.  Combining the two provides a more complete/accurate
>> solution IMO.  But, it has the potential to break existing users.
>>
>> I really would like to get feedback from anyone that knows how the existing
>> hugetlb cgroup controller may be used today.  Comments from Aneesh would
>> be very welcome to know if reservations were considered in development of the
>> existing code.
>> --
> 
> FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:
> 
> AFAICT, the 2 counter approach we have here is strictly superior to
> the 1 upgraded counter approach. Consider these points:
> 
> - From what I can tell so far, everything you can do with the 1
> counter approach, you can do with the two counter approach by setting
> both limit_in_bytes and reservation_limit_in_bytes to the limit value.
> That will limit both reservations and at fault allocations.
> 
> - The 2 counter approach preserves existing usage of hugetlb cgroups,
> so no need to muck around with reverting the feature some time from
> now because of broken users. No existing users of hugetlb cgroups need
> to worry about the effect of this on their usage.
> 
> - Users that use hugetlb memory strictly through reservations can use
> only reservation_limit_in_bytes and enjoy cgroup limits that never
> SIGBUS the application. This is our usage for example.
> 
> -

Re: [PATCH] hugetlbfs: add O_TMPFILE support

2019-10-22 Thread Mike Kravetz

On 10/22/19 12:09 AM, Piotr Sarna wrote:
> On 10/21/19 7:17 PM, Mike Kravetz wrote:
>> On 10/15/19 4:37 PM, Mike Kravetz wrote:
>>> On 10/15/19 3:50 AM, Michal Hocko wrote:
>>>> On Tue 15-10-19 11:01:12, Piotr Sarna wrote:
>>>>> With hugetlbfs, a common pattern for mapping anonymous huge pages
>>>>> is to create a temporary file first.
>>>>
>>>> Really? I though that this is normally done by shmget(SHM_HUGETLB) or
>>>> mmap(MAP_HUGETLB). Or maybe I misunderstood your definition on anonymous
>>>> huge pages.
>>>>
>>>>> Currently libraries like
>>>>> libhugetlbfs and seastar create these with a standard mkstemp+unlink
>>>>> trick,
>>>
>>> I would guess that much of libhugetlbfs was writen before MAP_HUGETLB
>>> was implemented.  So, that is why it does not make (more) use of that
>>> option.
>>>
>>> The implementation looks to be straight forward.  However, I really do
>>> not want to add more functionality to hugetlbfs unless there is specific
>>> use case that needs it.
>>
>> It was not my intention to shut down discussion on this patch.  I was just
>> asking if there was a (new) use case for such a change.  I am checking with
>> our DB team as I seem to remember them using the create/unlink approach for
>> hugetlbfs in one of their upcoming models.
>>
>> Is there a new use case you were thinking about?
>>
> 
> Oh, I indeed thought it was a shutdown. The use case I was thinking about was 
> in Seastar, where the create+unlink trick is used for creating temporary 
> files (in a generic way, not only for hugetlbfs). I simply intended to 
> migrate it to a newer approach - O_TMPFILE. However,
> for the specific case of hugetlbfs it indeed makes more sense to skip it and 
> use mmap's MAP_HUGETLB, so perhaps it's not worth it to patch a perfectly 
> good and stable file system just to provide a semi-useful flag support. My 
> implementation of tmpfile for hugetlbfs is straightforward indeed, but the 
> MAP_HUGETLB argument made me realize that it may not be worth the trouble - 
> especially that MAP_HUGETLB is here since 2.6 and O_TMPFILE was introduced 
> around v3.11, so the mmap way looks more portable.
> 
> tldr: I'd be very happy to get my patch accepted, but the use case I had in 
> mind can be easily solved with MAP_HUGETLB, so I don't insist.

If you really are after something like 'anonymous memory' for Seastar,
then MAP_HUGETLB would be the better approach.

I'm still checking with Oracle DB team as they may have a use for O_TMPFILE
in an upcoming release.  In their use case, they want an open fd to work with.
If it looks like they will proceed in this direction, we can work to get
your patch moved forward.

Thanks,
-- 
Mike Kravetz

[PATCH v4 1/4] hugetlbfs: add arch_hugetlb_valid_size

2020-04-28 Thread Mike Kravetz

The architecture independent routine hugetlb_default_setup sets up
the default huge pages size.  It has no way to verify if the passed
value is valid, so it accepts it and attempts to validate at a later
time.  This requires undocumented cooperation between the arch specific
and arch independent code.

For architectures that support more than one huge page size, provide
a routine arch_hugetlb_valid_size to validate a huge page size.
hugetlb_default_setup can use this to validate passed values.

arch_hugetlb_valid_size will also be used in a subsequent patch to
move processing of the "hugepagesz=" in arch specific code to a common
routine in arch independent code.

Signed-off-by: Mike Kravetz 
Acked-by: Gerald Schaefer   [s390]
Acked-by: Will Deacon 
---
 arch/arm64/mm/hugetlbpage.c   | 17 +
 arch/powerpc/mm/hugetlbpage.c | 20 +---
 arch/riscv/mm/hugetlbpage.c   | 26 +-
 arch/s390/mm/hugetlbpage.c| 16 
 arch/sparc/mm/init_64.c   | 24 
 arch/x86/mm/hugetlbpage.c | 17 +
 include/linux/hugetlb.h   |  1 +
 mm/hugetlb.c  | 21 ++---
 8 files changed, 103 insertions(+), 39 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index bbeb6a5a6ba6..069b96ee2aec 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -462,17 +462,26 @@ static int __init hugetlbpage_init(void)
 }
 arch_initcall(hugetlbpage_init);
 
-static __init int setup_hugepagesz(char *opt)
+bool __init arch_hugetlb_valid_size(unsigned long size)
 {
-   unsigned long ps = memparse(opt, &opt);
-
-   switch (ps) {
+   switch (size) {
 #ifdef CONFIG_ARM64_4K_PAGES
case PUD_SIZE:
 #endif
case CONT_PMD_SIZE:
case PMD_SIZE:
case CONT_PTE_SIZE:
+   return true;
+   }
+
+   return false;
+}
+
+static __init int setup_hugepagesz(char *opt)
+{
+   unsigned long ps = memparse(opt, &opt);
+
+   if (arch_hugetlb_valid_size(ps)) {
add_huge_page_size(ps);
return 1;
}
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 33b3461d91e8..de54d2a37830 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -558,7 +558,7 @@ unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
return vma_kernel_pagesize(vma);
 }
 
-static int __init add_huge_page_size(unsigned long long size)
+bool __init arch_hugetlb_valid_size(unsigned long size)
 {
int shift = __ffs(size);
int mmu_psize;
@@ -566,20 +566,26 @@ static int __init add_huge_page_size(unsigned long long 
size)
/* Check that it is a page size supported by the hardware and
 * that it fits within pagetable and slice limits. */
if (size <= PAGE_SIZE || !is_power_of_2(size))
-   return -EINVAL;
+   return false;
 
mmu_psize = check_and_get_huge_psize(shift);
if (mmu_psize < 0)
-   return -EINVAL;
+   return false;
 
BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
 
-   /* Return if huge page size has already been setup */
-   if (size_to_hstate(size))
-   return 0;
+   return true;
+}
 
-   hugetlb_add_hstate(shift - PAGE_SHIFT);
+static int __init add_huge_page_size(unsigned long long size)
+{
+   int shift = __ffs(size);
+
+   if (!arch_hugetlb_valid_size((unsigned long)size))
+   return -EINVAL;
 
+   if (!size_to_hstate(size))
+   hugetlb_add_hstate(shift - PAGE_SHIFT);
return 0;
 }
 
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index a6189ed36c5f..da1f516bc451 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -12,21 +12,29 @@ int pmd_huge(pmd_t pmd)
return pmd_leaf(pmd);
 }
 
+bool __init arch_hugetlb_valid_size(unsigned long size)
+{
+   if (size == HPAGE_SIZE)
+   return true;
+   else if (IS_ENABLED(CONFIG_64BIT) && size == PUD_SIZE)
+   return true;
+   else
+   return false;
+}
+
 static __init int setup_hugepagesz(char *opt)
 {
unsigned long ps = memparse(opt, &opt);
 
-   if (ps == HPAGE_SIZE) {
-   hugetlb_add_hstate(HPAGE_SHIFT - PAGE_SHIFT);
-   } else if (IS_ENABLED(CONFIG_64BIT) && ps == PUD_SIZE) {
-   hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
-   } else {
-   hugetlb_bad_size();
-   pr_err("hugepagesz: Unsupported page size %lu M\n", ps >> 20);
-   return 0;
+   if (arch_hugetlb_valid_size(ps)) {
+   hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT);
+   return 1;
}
 
-   return 1;
+   hugetlb_bad_size();
+   pr_err("hugepagesz: Unsupport

[PATCH v4 0/4] Clean up hugetlb boot command line processing

2020-04-28 Thread Mike Kravetz

v4 -
   Fixed huge page order definitions for arm64 (Qian Cai)
   Removed hugepages_supported() checks in command line processing as
 powerpc does not set hugepages_supported until later in boot (Sandipan)
   Added Acks, Reviews and Tested (Will, Gerald, Anders, Sandipan)

v3 -
   Used weak attribute method of defining arch_hugetlb_valid_size.
 This eliminates changes to arch specific hugetlb.h files (Peter)
   Updated documentation (Peter, Randy)
   Fixed handling of implicitly specified gigantic page preallocation
 in existing code and removed documentation of such.  There is now
 no difference between handling of gigantic and non-gigantic pages.
 (Peter, Nitesh).
 This requires the most review as there is a small change to
 undocumented behavior.  See patch 4 commit message for details.
   Added Acks and Reviews (Mina, Peter)

v2 -
   Fix build errors with patch 1 (Will)
   Change arch_hugetlb_valid_size arg to unsigned long and remove
 irrelevant 'extern' keyword (Christophe)
   Documentation and other misc changes (Randy, Christophe, Mina)
   Do not process command line options if !hugepages_supported()
 (Dave, but it sounds like we may want to additional changes to
  hugepages_supported() for x86?  If that is needed I would prefer
  a separate patch.)

Longpeng(Mike) reported a weird message from hugetlb command line processing
and proposed a solution [1].  While the proposed patch does address the
specific issue, there are other related issues in command line processing.
As hugetlbfs evolved, updates to command line processing have been made to
meet immediate needs and not necessarily in a coordinated manner.  The result
is that some processing is done in arch specific code, some is done in arch
independent code and coordination is problematic.  Semantics can vary between
architectures.

The patch series does the following:
- Define arch specific arch_hugetlb_valid_size routine used to validate
  passed huge page sizes.
- Move hugepagesz= command line parsing out of arch specific code and into
  an arch independent routine.
- Clean up command line processing to follow desired semantics and
  document those semantics.

[1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpe...@huawei.com

Mike Kravetz (4):
  hugetlbfs: add arch_hugetlb_valid_size
  hugetlbfs: move hugepagesz= parsing to arch independent code
  hugetlbfs: remove hugetlb_add_hstate() warning for existing hstate
  hugetlbfs: clean up command line processing

 .../admin-guide/kernel-parameters.txt |  40 ++--
 Documentation/admin-guide/mm/hugetlbpage.rst  |  35 
 arch/arm64/mm/hugetlbpage.c   |  30 +--
 arch/powerpc/mm/hugetlbpage.c |  30 +--
 arch/riscv/mm/hugetlbpage.c   |  24 +--
 arch/s390/mm/hugetlbpage.c|  24 +--
 arch/sparc/mm/init_64.c   |  43 +
 arch/x86/mm/hugetlbpage.c |  23 +--
 include/linux/hugetlb.h   |   2 +-
 mm/hugetlb.c  | 180 ++
 10 files changed, 260 insertions(+), 171 deletions(-)

-- 
2.25.4

[PATCH v4 3/4] hugetlbfs: remove hugetlb_add_hstate() warning for existing hstate

2020-04-28 Thread Mike Kravetz

The routine hugetlb_add_hstate prints a warning if the hstate already
exists.  This was originally done as part of kernel command line
parsing.  If 'hugepagesz=' was specified more than once, the warning
pr_warn("hugepagesz= specified twice, ignoring\n");
would be printed.

Some architectures want to enable all huge page sizes.  They would
call hugetlb_add_hstate for all supported sizes.  However, this was
done after command line processing and as a result hstates could have
already been created for some sizes.  To make sure no warning were
printed, there would often be code like:
if (!size_to_hstate(size)
hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT)

The only time we want to print the warning is as the result of command
line processing.  So, remove the warning from hugetlb_add_hstate and
add it to the single arch independent routine processing "hugepagesz=".
After this, calls to size_to_hstate() in arch specific code can be
removed and hugetlb_add_hstate can be called without worrying about
warning messages.

Signed-off-by: Mike Kravetz 
Acked-by: Mina Almasry 
Acked-by: Gerald Schaefer   [s390]
Acked-by: Will Deacon 
Tested-by: Anders Roxell 
---
 arch/arm64/mm/hugetlbpage.c   | 16 
 arch/powerpc/mm/hugetlbpage.c |  3 +--
 arch/riscv/mm/hugetlbpage.c   |  2 +-
 arch/sparc/mm/init_64.c   | 19 ---
 arch/x86/mm/hugetlbpage.c |  2 +-
 mm/hugetlb.c  |  9 ++---
 6 files changed, 17 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index f706b821aba6..14bed8f4674a 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -441,22 +441,14 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
 }
 
-static void __init add_huge_page_size(unsigned long size)
-{
-   if (size_to_hstate(size))
-   return;
-
-   hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT);
-}
-
 static int __init hugetlbpage_init(void)
 {
 #ifdef CONFIG_ARM64_4K_PAGES
-   add_huge_page_size(PUD_SIZE);
+   hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
 #endif
-   add_huge_page_size(CONT_PMD_SIZE);
-   add_huge_page_size(PMD_SIZE);
-   add_huge_page_size(CONT_PTE_SIZE);
+   hugetlb_add_hstate((CONT_PMD_SHIFT + PMD_SHIFT) - PAGE_SHIFT);
+   hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+   hugetlb_add_hstate((CONT_PTE_SHIFT + PAGE_SHIFT) - PAGE_SHIFT);
 
return 0;
 }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 2c3fa0a7787b..4d5ed1093615 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -584,8 +584,7 @@ static int __init add_huge_page_size(unsigned long long 
size)
if (!arch_hugetlb_valid_size((unsigned long)size))
return -EINVAL;
 
-   if (!size_to_hstate(size))
-   hugetlb_add_hstate(shift - PAGE_SHIFT);
+   hugetlb_add_hstate(shift - PAGE_SHIFT);
return 0;
 }
 
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 4e5d7e9f0eef..932dadfdca54 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -26,7 +26,7 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
 static __init int gigantic_pages_init(void)
 {
/* With CONTIG_ALLOC, we can allocate gigantic pages at runtime */
-   if (IS_ENABLED(CONFIG_64BIT) && !size_to_hstate(1UL << PUD_SHIFT))
+   if (IS_ENABLED(CONFIG_64BIT))
hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
return 0;
 }
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 4618f96fd30f..ae819a16d07a 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -325,23 +325,12 @@ static void __update_mmu_tsb_insert(struct mm_struct *mm, 
unsigned long tsb_inde
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void __init add_huge_page_size(unsigned long size)
-{
-   unsigned int order;
-
-   if (size_to_hstate(size))
-   return;
-
-   order = ilog2(size) - PAGE_SHIFT;
-   hugetlb_add_hstate(order);
-}
-
 static int __init hugetlbpage_init(void)
 {
-   add_huge_page_size(1UL << HPAGE_64K_SHIFT);
-   add_huge_page_size(1UL << HPAGE_SHIFT);
-   add_huge_page_size(1UL << HPAGE_256MB_SHIFT);
-   add_huge_page_size(1UL << HPAGE_2GB_SHIFT);
+   hugetlb_add_hstate(HPAGE_64K_SHIFT - PAGE_SHIFT);
+   hugetlb_add_hstate(HPAGE_SHIFT - PAGE_SHIFT);
+   hugetlb_add_hstate(HPAGE_256MB_SHIFT - PAGE_SHIFT);
+   hugetlb_add_hstate(HPAGE_2GB_SHIFT - PAGE_SHIFT);
 
return 0;
 }
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 937d640a89e3..cf5781142716 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -195,7 +195,7 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
 st

[PATCH v4 4/4] hugetlbfs: clean up command line processing

2020-04-28 Thread Mike Kravetz

With all hugetlb page processing done in a single file clean up code.
- Make code match desired semantics
  - Update documentation with semantics
- Make all warnings and errors messages start with 'HugeTLB:'.
- Consistently name command line parsing routines.
- Warn if !hugepages_supported() and command line parameters have
  been specified.
- Add comments to code
  - Describe some of the subtle interactions
  - Describe semantics of command line arguments

This patch also fixes issues with implicitly setting the number of
gigantic huge pages to preallocate.  Previously on X86 command line,
hugepages=2 default_hugepagesz=1G
would result in zero 1G pages being preallocated and,
# grep HugePages_Total /proc/meminfo
HugePages_Total:   0
# sysctl -a | grep nr_hugepages
vm.nr_hugepages = 2
vm.nr_hugepages_mempolicy = 2
# cat /proc/sys/vm/nr_hugepages
2
After this patch 2 gigantic pages will be preallocated and all the
proc, sysfs, sysctl and meminfo files will accurately reflect this.

To address the issue with gigantic pages, a small change in behavior
was made to command line processing.  Previously the command line,
hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256
would result in the allocation of 256 2M huge pages.  The value 128
would be ignored without any warning.  After this patch, 128 2M pages
will be allocated and a warning message will be displayed indicating
the value of 256 is ignored.  This change in behavior is required
because allocation of implicitly specified gigantic pages must be done
when the default_hugepagesz= is encountered for gigantic pages.
Previously the code waited until later in the boot process (hugetlb_init),
to allocate pages of default size.  However the bootmem allocator required
for gigantic allocations is not available at this time.

Signed-off-by: Mike Kravetz 
Acked-by: Gerald Schaefer   [s390]
Acked-by: Will Deacon 
Tested-by: Sandipan Das 
---
 .../admin-guide/kernel-parameters.txt |  40 +++--
 Documentation/admin-guide/mm/hugetlbpage.rst  |  35 
 mm/hugetlb.c  | 149 ++
 3 files changed, 179 insertions(+), 45 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7bc83f3d9bdf..cbe657b86d0e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -834,12 +834,15 @@
See also Documentation/networking/decnet.txt.
 
default_hugepagesz=
-   [same as hugepagesz=] The size of the default
-   HugeTLB page size. This is the size represented by
-   the legacy /proc/ hugepages APIs, used for SHM, and
-   default size when mounting hugetlbfs filesystems.
-   Defaults to the default architecture's huge page size
-   if not specified.
+   [HW] The size of the default HugeTLB page. This is
+   the size represented by the legacy /proc/ hugepages
+   APIs.  In addition, this is the default hugetlb size
+   used for shmget(), mmap() and mounting hugetlbfs
+   filesystems.  If not specified, defaults to the
+   architecture's default huge page size.  Huge page
+   sizes are architecture dependent.  See also
+   Documentation/admin-guide/mm/hugetlbpage.rst.
+   Format: size[KMG]
 
deferred_probe_timeout=
[KNL] Debugging option to set a timeout in seconds for
@@ -1479,13 +1482,24 @@
hugepages using the cma allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.
 
-   hugepages=  [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
-   hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
-   On x86-64 and powerpc, this option can be specified
-   multiple times interleaved with hugepages= to reserve
-   huge pages of different sizes. Valid pages sizes on
-   x86-64 are 2M (when the CPU supports "pse") and 1G
-   (when the CPU supports the "pdpe1gb" cpuinfo flag).
+   hugepages=  [HW] Number of HugeTLB pages to allocate at boot.
+   If this follows hugepagesz (below), it specifies
+   the number of pages of hugepagesz to be allocated.
+   If this is the first HugeTLB parameter on the command
+   line, it specifies the number of pages to allocate for
+   the default huge page size.  See also
+

[PATCH v4 2/4] hugetlbfs: move hugepagesz= parsing to arch independent code

2020-04-28 Thread Mike Kravetz

Now that architectures provide arch_hugetlb_valid_size(), parsing
of "hugepagesz=" can be done in architecture independent code.
Create a single routine to handle hugepagesz= parsing and remove
all arch specific routines.  We can also remove the interface
hugetlb_bad_size() as this is no longer used outside arch independent
code.

This also provides consistent behavior of hugetlbfs command line
options.  The hugepagesz= option should only be specified once for
a specific size, but some architectures allow multiple instances.
This appears to be more of an oversight when code was added by some
architectures to set up ALL huge pages sizes.

Signed-off-by: Mike Kravetz 
Acked-by: Mina Almasry 
Reviewed-by: Peter Xu 
Acked-by: Gerald Schaefer   [s390]
Acked-by: Will Deacon 
---
 arch/arm64/mm/hugetlbpage.c   | 15 ---
 arch/powerpc/mm/hugetlbpage.c | 15 ---
 arch/riscv/mm/hugetlbpage.c   | 16 
 arch/s390/mm/hugetlbpage.c| 18 --
 arch/sparc/mm/init_64.c   | 22 --
 arch/x86/mm/hugetlbpage.c | 16 
 include/linux/hugetlb.h   |  1 -
 mm/hugetlb.c  | 23 +--
 8 files changed, 17 insertions(+), 109 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 069b96ee2aec..f706b821aba6 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -476,18 +476,3 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
 
return false;
 }
-
-static __init int setup_hugepagesz(char *opt)
-{
-   unsigned long ps = memparse(opt, &opt);
-
-   if (arch_hugetlb_valid_size(ps)) {
-   add_huge_page_size(ps);
-   return 1;
-   }
-
-   hugetlb_bad_size();
-   pr_err("hugepagesz: Unsupported page size %lu K\n", ps >> 10);
-   return 0;
-}
-__setup("hugepagesz=", setup_hugepagesz);
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index de54d2a37830..2c3fa0a7787b 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -589,21 +589,6 @@ static int __init add_huge_page_size(unsigned long long 
size)
return 0;
 }
 
-static int __init hugepage_setup_sz(char *str)
-{
-   unsigned long long size;
-
-   size = memparse(str, &str);
-
-   if (add_huge_page_size(size) != 0) {
-   hugetlb_bad_size();
-   pr_err("Invalid huge page size specified(%llu)\n", size);
-   }
-
-   return 1;
-}
-__setup("hugepagesz=", hugepage_setup_sz);
-
 static int __init hugetlbpage_init(void)
 {
bool configured = false;
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index da1f516bc451..4e5d7e9f0eef 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -22,22 +22,6 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
return false;
 }
 
-static __init int setup_hugepagesz(char *opt)
-{
-   unsigned long ps = memparse(opt, &opt);
-
-   if (arch_hugetlb_valid_size(ps)) {
-   hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT);
-   return 1;
-   }
-
-   hugetlb_bad_size();
-   pr_err("hugepagesz: Unsupported page size %lu M\n", ps >> 20);
-   return 0;
-
-}
-__setup("hugepagesz=", setup_hugepagesz);
-
 #ifdef CONFIG_CONTIG_ALLOC
 static __init int gigantic_pages_init(void)
 {
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index ac25b207624c..242dfc0d462d 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -261,24 +261,6 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
return false;
 }
 
-static __init int setup_hugepagesz(char *opt)
-{
-   unsigned long size;
-   char *string = opt;
-
-   size = memparse(opt, &opt);
-   if (arch_hugetlb_valid_size(size)) {
-   hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT);
-   } else {
-   hugetlb_bad_size();
-   pr_err("hugepagesz= specifies an unsupported page size %s\n",
-   string);
-   return 0;
-   }
-   return 1;
-}
-__setup("hugepagesz=", setup_hugepagesz);
-
 static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags)
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 2bfe8e22b706..4618f96fd30f 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -397,28 +397,6 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
 
return true;
 }
-
-static int __init setup_hugepagesz(char *string)
-{
-   unsigned long long hugepage_size;
-   int rc = 0;
-
-   hugepage_size = memparse(string, &string);
-
-

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-10-14 Thread Mike Kravetz

On 10/11/19 1:41 PM, Mina Almasry wrote:
> On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry  wrote:
>>
>> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz  
>> wrote:
>>>
>>> On 9/19/19 3:24 PM, Mina Almasry wrote:
>>
>> Mike, note your suggestion above to check if the page hugetlb_cgroup
>> is null doesn't work if we want to keep the current counter working
>> the same: the page will always have a hugetlb_cgroup that points that
>> contains the old counter. Any ideas how to apply this new counter
>> behavior to a private NORESERVE mappings? Is there maybe a flag I can
>> set on the pages at allocation time that I can read on free time to
>> know whether to uncharge the hugetlb_cgroup or not?
> 
> Reading the code and asking around a bit, it seems the pointer to the
> hugetlb_cgroup is in page[2].private. Is it reasonable to use
> page[3].private to store the hugetlb_cgroup to uncharge for the new
> counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that
> would solve my problem. When allocating a private NORESERVE page, set
> page[3].private to the hugetlb_cgroup to uncharge, then on
> free_huge_page, check page[3].private, if it is non-NULL, uncharge the
> new counter on it.

Sorry for not responding sooner.  This approach should work, and it looks like
you have a v6 of the series.  I'll take a look.

-- 
Mike Kravetz

Re: [PATCH] hugetlb: Add nohugepages parameter to prevent hugepages creation

2019-10-14 Thread Mike Kravetz

On 10/11/19 3:39 PM, Guilherme G. Piccoli wrote:
> Currently there are 2 ways for setting HugeTLB hugepages in kernel; either
> users pass parameters on kernel command-line or they can write to sysfs
> files (which is effectively the sysctl way).
> 
> Kdump kernels won't benefit from hugepages - in fact it's quite opposite,
> it may be the case hugepages on kdump kernel can lead to OOM if kernel
> gets unable to allocate demanded pages due to the fact the preallocated
> hugepages are consuming a lot of memory.
> 
> This patch proposes a new kernel parameter to prevent the creation of
> HugeTLB hugepages - we currently don't have a way to do that. We can
> even have kdump scripts removing the kernel command-line options to
> set hugepages, but it's not straightforward to prevent sysctl/sysfs
> configuration, given it happens in later boot or anytime when the
> system is running.
> 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> About some decisions took in this patch:
> 
> * early_param() was used because I couldn't find a way to enforce
> parameters' ordering when using __setup(), and we need nohugepages
> processed before all other hugepages options.

I don't know much about early_param(), so I will assume this works as you
describe.  However, a quick grep shows hugepage options for ia64 also with
early_param.

> * The return when sysctl handler is prevented to progress due to
> nohugepages is -EINVAL, but could be changed; I've just followed
> present code there, but I'm OK changing that if we have suggestions.

It looks like you only have short circuited/prevented nr_hugepages via
sysfs/sysctl.  Theoretically, one could set nr_overcommit_hugepages and
still allocate hugetlb pages.  So, if you REALLY want to shut things down
you need to stop this as well.

There is already a macro hugepages_supported() that can be set by arch
specific code.  I wonder how difficult it would be to 'overwrite' the
macro if nohugepages is specified.  Perhaps just a level of naming
indirection.  This would use the existing code to prevent all hugetlb usage.

It seems like there may be some discussion about 'the right' way to
do kdump.  I can't add to that discussion, but if such an option as
nohugepages is needed, I can help.
-- 
Mike Kravetz

Re: [PATCH v1] hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()

2019-10-15 Thread Mike Kravetz

On 10/15/19 5:07 AM, David Hildenbrand wrote:
> Uninitialized memmaps contain garbage and in the worst case trigger
> kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get
> touched.
> 
> Let's make sure that we only consider online memory (managed by the
> buddy) that has initialized memmaps. ZONE_DEVICE is not applicable.
> 
> page_zone() will call page_to_nid(), which will trigger
> VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
> and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This
> can be the case when an offline memory block (e.g., never onlined) is
> spanned by a zone.
> 
> Note: As explained by Michal in [1], alloc_contig_range() will verify
> the range. So it boils down to the wrong access in this function.
> 
> [1] http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz
> 
> Reported-by: Michal Hocko 
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
> zones until online") # visible after d0dc12e86b319
> Cc: sta...@vger.kernel.org # v4.13+
> Cc: Anshuman Khandual 
> Cc: Mike Kravetz 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Signed-off-by: David Hildenbrand 

Thanks David,

Reviewed-by: Mike Kravetz 

-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: add O_TMPFILE support

2019-10-15 Thread Mike Kravetz

On 10/15/19 3:50 AM, Michal Hocko wrote:
> On Tue 15-10-19 11:01:12, Piotr Sarna wrote:
>> With hugetlbfs, a common pattern for mapping anonymous huge pages
>> is to create a temporary file first.
> 
> Really? I though that this is normally done by shmget(SHM_HUGETLB) or
> mmap(MAP_HUGETLB). Or maybe I misunderstood your definition on anonymous
> huge pages.
> 
>> Currently libraries like
>> libhugetlbfs and seastar create these with a standard mkstemp+unlink
>> trick,

I would guess that much of libhugetlbfs was writen before MAP_HUGETLB
was implemented.  So, that is why it does not make (more) use of that
option.

The implementation looks to be straight forward.  However, I really do
not want to add more functionality to hugetlbfs unless there is specific
use case that needs it.
-- 
Mike Kravetz

Re: [PATCH] hugetlb: Fix clang compilation warning

2019-10-16 Thread Mike Kravetz

On 10/16/19 7:23 AM, Vincenzo Frascino wrote:
> Building the kernel with a recent version of clang I noticed the warning
> below:
> 
> mm/hugetlb.c:4055:40: warning: expression does not compute the number of
> elements in this array; element type is 'unsigned long', not 'u32'
> (aka 'unsigned int') [-Wsizeof-array-div]
> hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
>   ~~~ ^
> mm/hugetlb.c:4049:16: note: array 'key' declared here
> unsigned long key[2];
>   ^
> mm/hugetlb.c:4055:40: note: place parentheses around the 'sizeof(u32)'
> expression to silence this warning
> hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
>   ^  CC  fs/ext4/ialloc.o
> 
> Fix the warning adding parentheses around the sizeof(u32) expression.
> 
> Cc: Mike Kravetz 
> Signed-off-by: Vincenzo Frascino 

Thanks,

However, this is already addressed in Andrew's tree.
https://ozlabs.org/~akpm/mmotm/broken-out/hugetlbfs-hugetlb_fault_mutex_hash-cleanup.patch

-- 
Mike Kravetz

Re: [PATCH V2] mm/page_alloc: Add alloc_contig_pages()

2019-10-16 Thread Mike Kravetz

On 10/16/19 4:02 AM, Anshuman Khandual wrote:
> HugeTLB helper alloc_gigantic_page() implements fairly generic allocation
> method where it scans over various zones looking for a large contiguous pfn
> range before trying to allocate it with alloc_contig_range(). Other than
> deriving the requested order from 'struct hstate', there is nothing HugeTLB
> specific in there. This can be made available for general use to allocate
> contiguous memory which could not have been allocated through the buddy
> allocator.
> 
> alloc_gigantic_page() has been split carving out actual allocation method
> which is then made available via new alloc_contig_pages() helper wrapped
> under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced
> with more generic term 'contig'. Allocated pages here should be freed with
> free_contig_range() or by calling __free_page() on each allocated page.

I had a 'test harness' used when previously working on such an interface.
It simply provides a user interface to call the allocator with random
values for nr_pages.  Several tasks are then started doing random allocations
in parallel.  The new interface is pretty straight forward, but the idea
was to stress the underlying code.  In fact, it did identify issues with
isolation which were corrected.

I exercised this new interface in the same way and am happy to report that
no issues were detected.
-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: fix error handling in init_hugetlbfs_fs()

2019-10-17 Thread Mike Kravetz

Cc: David
On 10/17/19 3:38 AM, Chengguang Xu wrote:
> In order to avoid using incorrect mnt, we should set
> mnt to NULL when we get error from mount_one_hugetlbfs().
> 
> Signed-off-by: Chengguang Xu 
> ---
>  fs/hugetlbfs/inode.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a478df035651..427d845e7706 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -1470,9 +1470,12 @@ static int __init init_hugetlbfs_fs(void)
>   i = 0;
>   for_each_hstate(h) {
>   mnt = mount_one_hugetlbfs(h);
> - if (IS_ERR(mnt) && i == 0) {
> - error = PTR_ERR(mnt);
> - goto out;
> + if (IS_ERR(mnt)) {
> + if (i == 0) {
> + error = PTR_ERR(mnt);
> + goto out;
> + }
> + mnt = NULL;
>   }
>   hugetlbfs_vfsmount[i] = mnt;
>   i++;

Thanks!

That should be fixed.  It was introduced with commit 32021982a324 ("hugetlbfs:
Convert to fs_context").  

That commit also changed the condition for which init_hugetlbfs_fs() would
'error' and remove the inode cache.  Previously, it would do that if there
was an error creating a mount for the default_hstate_idx hstate.  It now does
that for the '0' hstate, and 0 is not always equal to default_hstate_idx.

David was that intentional or an oversight?  I can fix up, just wanted to
make sure there was not some reason for the change.

-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: fix error handling in init_hugetlbfs_fs()

2019-10-17 Thread Mike Kravetz

Sorry for noise, left off David

On 10/17/19 5:08 PM, Mike Kravetz wrote:
> Cc: David
> On 10/17/19 3:38 AM, Chengguang Xu wrote:
>> In order to avoid using incorrect mnt, we should set
>> mnt to NULL when we get error from mount_one_hugetlbfs().
>>
>> Signed-off-by: Chengguang Xu 
>> ---
>>  fs/hugetlbfs/inode.c | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index a478df035651..427d845e7706 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -1470,9 +1470,12 @@ static int __init init_hugetlbfs_fs(void)
>>  i = 0;
>>  for_each_hstate(h) {
>>  mnt = mount_one_hugetlbfs(h);
>> -if (IS_ERR(mnt) && i == 0) {
>> -error = PTR_ERR(mnt);
>> -goto out;
>> +if (IS_ERR(mnt)) {
>> +if (i == 0) {
>> +error = PTR_ERR(mnt);
>> +goto out;
>> +}
>> +mnt = NULL;
>>  }
>>  hugetlbfs_vfsmount[i] = mnt;
>>  i++;
> 
> Thanks!
> 
> That should be fixed.  It was introduced with commit 32021982a324 ("hugetlbfs:
> Convert to fs_context").  
> 
> That commit also changed the condition for which init_hugetlbfs_fs() would
> 'error' and remove the inode cache.  Previously, it would do that if there
> was an error creating a mount for the default_hstate_idx hstate.  It now does
> that for the '0' hstate, and 0 is not always equal to default_hstate_idx.
> 
> David was that intentional or an oversight?  I can fix up, just wanted to
> make sure there was not some reason for the change.
> 


-- 
Mike Kravetz

Re: [PATCH] hugetlb: Add nohugepages parameter to prevent hugepages creation

2019-10-18 Thread Mike Kravetz

On 10/15/19 7:09 AM, Guilherme G. Piccoli wrote:
> 
> 
> On 10/15/19 11:05 AM, Michal Hocko wrote:
>> On Tue 15-10-19 10:58:36, Guilherme G. Piccoli wrote:
>>> On 10/15/19 9:18 AM, Michal Hocko wrote:
>>>> I do agree with Qian Cai here. Kdump kernel requires a very tailored
>>>> environment considering it is running in a very restricted
>>>> configuration. The hugetlb pre-allocation sounds like a tooling problem
>>>> and should be fixed at that layer.
>>>>
>>>
>>> Hi Michal, thanks for your response. Can you suggest me a current way of
>>> preventing hugepages for being created, using userspace? The goal for this
>>> patch is exactly this, introduce such a way.
>>
>> Simply restrict the environment to not allocate any hugepages? Kdump
>> already controls the kernel command line and it also starts only a very
>> minimal subset of services. So who is allocating those hugepages?
>> sysctls should be already excluded by default as Qian mentioned.
>>
> 
> 
> OK, thanks Michal and Qian, I'll try to make things work from kdump 
> perspective. The trick part is exactly preventing the sysctl to get applied 
> heh
> 

Please do let us know if this can be done in tooling.

I am not opposed to the approach taken in your v2 patch as it essentially
uses the hugepages_supported() functionality that exists today.  However,
it seems that other distros have ways around this issue.  As such, I would
prefer if the issue was addressed in the tooling.
-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: add O_TMPFILE support

2019-10-21 Thread Mike Kravetz

On 10/15/19 4:37 PM, Mike Kravetz wrote:
> On 10/15/19 3:50 AM, Michal Hocko wrote:
>> On Tue 15-10-19 11:01:12, Piotr Sarna wrote:
>>> With hugetlbfs, a common pattern for mapping anonymous huge pages
>>> is to create a temporary file first.
>>
>> Really? I though that this is normally done by shmget(SHM_HUGETLB) or
>> mmap(MAP_HUGETLB). Or maybe I misunderstood your definition on anonymous
>> huge pages.
>>
>>> Currently libraries like
>>> libhugetlbfs and seastar create these with a standard mkstemp+unlink
>>> trick,
> 
> I would guess that much of libhugetlbfs was writen before MAP_HUGETLB
> was implemented.  So, that is why it does not make (more) use of that
> option.
> 
> The implementation looks to be straight forward.  However, I really do
> not want to add more functionality to hugetlbfs unless there is specific
> use case that needs it.

It was not my intention to shut down discussion on this patch.  I was just
asking if there was a (new) use case for such a change.  I am checking with
our DB team as I seem to remember them using the create/unlink approach for
hugetlbfs in one of their upcoming models.  

Is there a new use case you were thinking about?
-- 
Mike Kravetz

Re: [PATCH v6 5/9] hugetlb: disable region_add file_region coalescing

2019-10-21 Thread Mike Kravetz

On 10/12/19 5:30 PM, Mina Almasry wrote:
> A follow up patch in this series adds hugetlb cgroup uncharge info the
> file_region entries in resv->regions. The cgroup uncharge info may
> differ for different regions, so they can no longer be coalesced at
> region_add time. So, disable region coalescing in region_add in this
> patch.
> 
> Behavior change:
> 
> Say a resv_map exists like this [0->1], [2->3], and [5->6].
> 
> Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> 
> Old code would generate resv->regions: [0->5], [5->6].
> New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> [5->6].
> 
> Special care needs to be taken to handle the resv->adds_in_progress
> variable correctly. In the past, only 1 region would be added for every
> region_chg and region_add call. But now, each call may add multiple
> regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
> region_chg calls add_reservation_in_range() to count the number of regions
> needed and allocates those, and that info is passed to region_add and
> region_abort to decrement adds_in_progress correctly.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> Changes in v6:
> - Fix bug in number of region_caches allocated by region_chg
> 
> ---
>  mm/hugetlb.c | 256 +--
>  1 file changed, 147 insertions(+), 109 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4a60d7d44b4c3..f9c1947925bb9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c

> -static long region_chg(struct resv_map *resv, long f, long t)
> +static long region_chg(struct resv_map *resv, long f, long t,
> +long *out_regions_needed)
>  {
> + struct file_region *trg = NULL;
>   long chg = 0;
> 
> + /* Allocate the maximum number of regions we're going to need for this
> +  * reservation. The maximum number of regions we're going to need is
> +  * (t - f) / 2 + 1, which corresponds to a region with alternating
> +  * reserved and unreserved pages.
> +  */
> + *out_regions_needed = (t - f) / 2 + 1;
> +
>   spin_lock(&resv->lock);
> -retry_locked:
> - resv->adds_in_progress++;
> +
> + resv->adds_in_progress += *out_regions_needed;
> 
>   /*
>* Check for sufficient descriptors in the cache to accommodate
>* the number of in progress add operations.
>*/
> - if (resv->adds_in_progress > resv->region_cache_count) {
> - struct file_region *trg;
> -
> - VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 
> 1);
> + while (resv->region_cache_count < resv->adds_in_progress) {
>   /* Must drop lock to allocate a new descriptor. */
> - resv->adds_in_progress--;
>   spin_unlock(&resv->lock);
> -
>   trg = kmalloc(sizeof(*trg), GFP_KERNEL);
>   if (!trg)
>   return -ENOMEM;
> @@ -393,9 +395,9 @@ static long region_chg(struct resv_map *resv, long f, 
> long t)
>   spin_lock(&resv->lock);
>   list_add(&trg->link, &resv->region_cache);
>   resv->region_cache_count++;
> - goto retry_locked;
>   }


I know that I suggested allocating the worst case number of entries, but this
is going to be too much of a hit for existing hugetlbfs users.  It is not
uncommon for DBs to have a shared areas in excess of 1TB mapped by hugetlbfs.
With this new scheme, the above while loop will allocate over a half million
file region entries and end up only using one.

I think we need to step back and come up with a different approach.  Let me
give it some more thought before throwing out ideas that may waste more of
your time.  Sorry.
-- 
Mike Kravetz

Re: [PATCH] mm/hugetlb: Fix unsigned overflow in __nr_hugepages_store_common()

2019-02-19 Thread Mike Kravetz

On 2/18/19 1:27 AM, Michal Hocko wrote:
> On Sat 16-02-19 21:31:12, Jingxiangfeng wrote:
>> From: Jing Xiangfeng 
>>
>> We can use the following command to dynamically allocate huge pages:
>>  echo NR_HUGEPAGES > /proc/sys/vm/nr_hugepages
>> The count in  __nr_hugepages_store_common() is parsed from 
>> /proc/sys/vm/nr_hugepages,
>> The maximum number of count is ULONG_MAX,
>> the operation 'count += h->nr_huge_pages - h->nr_huge_pages_node[nid]' 
>> overflow and count will be a wrong number.
> 
> Could you be more specific of what is the runtime effect on the
> overflow? I haven't checked closer but I would assume that we will
> simply shrink the pool size because count will become a small number.
> 

Well, the first thing to note is that this code only applies to case where
someone is changing a node specific hugetlb count.  i.e.
/sys/devices/system/node/node1/hugepages/hugepages-2048kB
In this case, the calculated value of count is a maximum or minimum total
number of huge pages.  However, only the number of huge pages on the specified
node is adjusted to try and achieve this maximum or minimum.

So, in the case of overflow the number of huge pages on the specified node
could be reduced.  I say 'could' because it really is dependent on previous
values.  In some situations the node specific value will be increased.

Minor point is that the description in the commit message does not match
the code changed.

> Is there any reason to report an error in that case? We do not report
> errors when we cannot allocate the requested number of huge pages so why
> is this case any different?

Another issue to consider is that h->nr_huge_pages is an unsigned long,
and h->nr_huge_pages_node[] is an unsigned int.  The sysfs store routines
treat them both as unsigned longs.  Ideally, the store routines should
distinguish between the two.

In reality, an actual overflow is unlikely.  If my math is correct (not
likely) it would take something like 8 Petabytes to overflow the node specific
counts.

In the case of a user entering a crazy high value and causing an overflow,
an error return might not be out of line.  Another option would be to simply
set count to ULONG_MAX if we detect overflow (or UINT_MAX if we are paranoid)
and continue on.  This may be more in line with user's intention of allocating
as many huge pages as possible.

Thoughts?
-- 
Mike Kravetz

Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation

2019-02-19 Thread Mike Kravetz

On 2/15/19 2:08 PM, Zi Yan wrote:

Thanks for working on this issue!

I have not yet had a chance to take a look at the code.  However, I do have
some general questions/comments on the approach.

> Patch structure 
>  
> 
> The patchset I developed to generate physically contiguous memory/arbitrary
> sized pages merely moves pages around. There are three components in this
> patchset:
> 
> 1) a new page migration mechanism, called exchange pages, that exchanges the
> content of two in-use pages instead of performing two back-to-back page
> migration. It saves on overheads and avoids page reclaim and memory compaction
> in the page allocation path, although it is not strictly required if enough
> free memory is available in the system.
> 
> 2) a new mechanism that utilizes both page migration and exchange pages to
> produce physically contiguous memory/arbitrary sized pages without allocating
> any new pages, unlike what khugepaged does. It works on per-VMA basis, 
> creating
> physically contiguous memory out of each VMA, which is virtually contiguous.
> A simple range tree is used to ensure no two VMAs are overlapping with each
> other in the physical address space.

This appears to be a new approach to generating contiguous areas.  Previous
attempts had relied on finding a contiguous area that can then be used for
various purposes including user mappings.  Here, you take an existing mapping
and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag functionality
talks about creating a (VPN, PFN) anchor pair for each vma and then using
this pair as the base for creating a contiguous area.

I'm curious, how 'fixed' is the anchor?  As you know, there could be a
non-movable page in the PFN range.  As a result, you will not be able to
create a contiguous area starting at that PFN.  In such a case, do we try
another PFN?  I know this could result in much page shuffling.  I'm just
trying to figure out how we satisfy a user who really wants a contiguous
area.  Is there some method to keep trying?

My apologies if this is addressed in the code.  This was just one of the
first thoughts that came to mine when giving the series a quick look.
-- 
Mike Kravetz

Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation

2019-02-19 Thread Mike Kravetz

On 2/19/19 6:33 PM, Zi Yan wrote:
> On 19 Feb 2019, at 17:42, Mike Kravetz wrote:
> 
>> On 2/15/19 2:08 PM, Zi Yan wrote:
>>
>> Thanks for working on this issue!
>>
>> I have not yet had a chance to take a look at the code.  However, I do have
>> some general questions/comments on the approach.
> 
> Thanks for replying. The code is very intrusive and has a lot of hacks, so it 
> is
> OK for us to discuss the general idea first. :)
> 
> 
>>> Patch structure
>>> 
>>>
>>> The patchset I developed to generate physically contiguous memory/arbitrary
>>> sized pages merely moves pages around. There are three components in this
>>> patchset:
>>>
>>> 1) a new page migration mechanism, called exchange pages, that exchanges the
>>> content of two in-use pages instead of performing two back-to-back page
>>> migration. It saves on overheads and avoids page reclaim and memory 
>>> compaction
>>> in the page allocation path, although it is not strictly required if enough
>>> free memory is available in the system.
>>>
>>> 2) a new mechanism that utilizes both page migration and exchange pages to
>>> produce physically contiguous memory/arbitrary sized pages without 
>>> allocating
>>> any new pages, unlike what khugepaged does. It works on per-VMA basis, 
>>> creating
>>> physically contiguous memory out of each VMA, which is virtually contiguous.
>>> A simple range tree is used to ensure no two VMAs are overlapping with each
>>> other in the physical address space.
>>
>> This appears to be a new approach to generating contiguous areas.  Previous
>> attempts had relied on finding a contiguous area that can then be used for
>> various purposes including user mappings.  Here, you take an existing mapping
>> and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag functionality
>> talks about creating a (VPN, PFN) anchor pair for each vma and then using
>> this pair as the base for creating a contiguous area.
>>
>> I'm curious, how 'fixed' is the anchor?  As you know, there could be a
>> non-movable page in the PFN range.  As a result, you will not be able to
>> create a contiguous area starting at that PFN.  In such a case, do we try
>> another PFN?  I know this could result in much page shuffling.  I'm just
>> trying to figure out how we satisfy a user who really wants a contiguous
>> area.  Is there some method to keep trying?
> 
> Good question. The anchor is determined on a per-VMA basis, which can be 
> changed
> easily,
> but in this patchiest, I used a very simple strategy — making all VMAs not
> overlapping
> in the physical address space to get maximum overall contiguity and not 
> changing
> anchors
> even if non-moveable pages are encountered when generating physically 
> contiguous
> pages.
> 
> Basically, first VMA1 in the virtual address space has its anchor as
> (VMA1_start_VPN, ZONE_start_PFN),
> second VMA1 has its anchor as (VMA2_start_VPN, ZONE_start_PFN + VMA1_size), 
> and
> so on.
> This makes all VMA not overlapping in physical address space during contiguous
> memory
> generation. When there is a non-moveable page, the anchor will not be changed,
> because
> no matter whether we assign a new anchor or not, the contiguous pages stops at
> the non-moveable page. If we are trying to get a new anchor, more effort is
> needed to
> avoid overlapping new anchor with existing contiguous pages. Any overlapping 
> will
> nullify the existing contiguous pages.
> 
> To satisfy a user who wants a contiguous area with N pages, the minimal 
> distance
> between
> any two non-moveable pages should be bigger than N pages in the system memory.
> Otherwise,
> nothing would work. If there is such an area (PFN1, PFN1+N) in the physical
> address space,
> you can set the anchor to (VPN_USER, PFN1) and use exchange_pages() to 
> generate
> a contiguous
> area with N pages. Instead, alloc_contig_pages(PFN1, PFN1+N, …) could also 
> work,
> but
> only at page allocation time. It also requires the system has N free pages 
> when
> alloc_contig_pages() are migrating the pages in (PFN1, PFN1+N) away, or you 
> need
> to swap
> pages to make the space.
> 
> Let me know if this makes sense to you.
> 

Yes, that is how I expected the implementation would work.  Thank you.

Another high level question.  One of the benefits of this approach is
that exchanging pages does not require N free pages as you describe
above.  This assumes that the vma which we are trying to make contiguous
is already populated.  If it is not populated, then you also need to
have N free pages.  Correct?  If this is true, then is the expected use
case to first populate a vma, and then try to make contiguous?  I would
assume that if it is not populated and a request to make contiguous is
given, we should try to allocate/populate the vma with contiguous pages
at that time?
-- 
Mike Kravetz

Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation

2019-02-19 Thread Mike Kravetz

On 2/19/19 9:19 PM, Zi Yan wrote:
> On 19 Feb 2019, at 19:18, Mike Kravetz wrote:
>> Another high level question.  One of the benefits of this approach is
>> that exchanging pages does not require N free pages as you describe
>> above.  This assumes that the vma which we are trying to make contiguous
>> is already populated.  If it is not populated, then you also need to
>> have N free pages.  Correct?  If this is true, then is the expected use
>> case to first populate a vma, and then try to make contiguous?  I would
>> assume that if it is not populated and a request to make contiguous is
>> given, we should try to allocate/populate the vma with contiguous pages
>> at that time?
> 
> Yes, I assume the pages within the VMA are already populated but not 
> contiguous
> yet.
> 
> My approach considers memory contiguity as an on-demand resource. In some 
> phases
> of an application, accelerators or RDMA controllers would process/transfer 
> data
> in one
> or more VMAs, at which time contiguous memory can help reduce address 
> translation
> overheads or lift certain constraints. And different VMAs could be processed 
> at
> different program phases, thus it might be hard to get contiguous memory for 
> all
> these VMAs at the allocation time using alloc_contig_pages(). My approach can
> help get contiguous memory later, when the demand comes.
> 
> For some cases, you definitely can use alloc_contig_pages() to give users
> a contiguous area at page allocation time, if you know the user is going to 
> use
> this
> area for accelerator data processing or as a RDMA buffer and the area size is
> fixed.
> 
> In addition, we can also use khugepaged approach, having a daemon periodically
> scan VMAs and use alloc_contig_pages() to convert non-contiguous pages in a 
> VMA
> to contiguous pages, but it would require N free pages during the conversion.
> 
> In sum, my approach complements alloc_contig_pages() and provides more 
> flexibility.
> It is not trying to replaces alloc_contig_pages().

Thank you for the explanation.  That makes sense.  I have mostly been
thinking about contiguous memory from an allocation perspective and did
not really consider other use cases.

-- 
Mike Kravetz

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-07 Thread Mike Kravetz

On 1/30/19 1:14 PM, Mike Kravetz wrote:
> Files can be created and mapped in an explicitly mounted hugetlbfs
> filesystem.  If pages in such files are migrated, the filesystem
> usage will not be decremented for the associated pages.  This can
> result in mmap or page allocation failures as it appears there are
> fewer pages in the filesystem than there should be.

Does anyone have a little time to take a look at this?

While migration of hugetlb pages 'should' not be a common issue, we
have seen it happen via soft memory errors/page poisoning in production
environments.  Didn't see a leak in that case as it was with pages in a
Sys V shared mem segment.  However, our DB code is starting to make use
of files in explicitly mounted hugetlbfs filesystems.  Therefore, we are
more likely to hit this bug in the field.
-- 
Mike Kravetz

> 
> For example, a test program which hole punches, faults and migrates
> pages in such a file (1G in size) will eventually fail because it
> can not allocate a page.  Reported counts and usage at time of failure:
> 
> node0
> 537   free_hugepages
> 1024  nr_hugepages
> 0 surplus_hugepages
> node1
> 1000  free_hugepages
> 1024  nr_hugepages
> 0 surplus_hugepages
> 
> Filesystem Size  Used Avail Use% Mounted on
> nodev  4.0G  4.0G 0 100% /var/opt/hugepool
> 
> Note that the filesystem shows 4G of pages used, while actual usage is
> 511 pages (just under 1G).  Failed trying to allocate page 512.
> 
> If a hugetlb page is associated with an explicitly mounted filesystem,
> this information in contained in the page_private field.  At migration
> time, this information is not preserved.  To fix, simply transfer
> page_private from old to new page at migration time if necessary. Also,
> migrate_page_states() unconditionally clears page_private and PagePrivate
> of the old page.  It is unlikely, but possible that these fields could
> be non-NULL and are needed at hugetlb free page time.  So, do not touch
> these fields for hugetlb pages.
> 
> Cc: 
> Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
> Signed-off-by: Mike Kravetz 
> ---
>  fs/hugetlbfs/inode.c | 10 ++
>  mm/migrate.c | 10 --
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 32920a10100e..fb6de1db8806 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct address_space 
> *mapping,
>   rc = migrate_huge_page_move_mapping(mapping, newpage, page);
>   if (rc != MIGRATEPAGE_SUCCESS)
>   return rc;
> +
> + /*
> +  * page_private is subpool pointer in hugetlb pages, transfer
> +  * if needed.
> +  */
> + if (page_private(page) && !page_private(newpage)) {
> + set_page_private(newpage, page_private(page));
> + set_page_private(page, 0);
> + }
> +
>   if (mode != MIGRATE_SYNC_NO_COPY)
>   migrate_page_copy(newpage, page);
>   else
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..0d9708803553 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -703,8 +703,14 @@ void migrate_page_states(struct page *newpage, struct 
> page *page)
>*/
>   if (PageSwapCache(page))
>   ClearPageSwapCache(page);
> - ClearPagePrivate(page);
> - set_page_private(page, 0);
> + /*
> +  * Unlikely, but PagePrivate and page_private could potentially
> +  * contain information needed at hugetlb free page time.
> +  */
> + if (!PageHuge(page)) {
> + ClearPagePrivate(page);
> + set_page_private(page, 0);
> + }
>  
>   /*
>* If any waiters have accumulated on the new page then
>

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-07 Thread Mike Kravetz

On 2/7/19 6:31 PM, Naoya Horiguchi wrote:
> On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote:
>> On 1/30/19 1:14 PM, Mike Kravetz wrote:
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct address_space 
>>> *mapping,
>>> rc = migrate_huge_page_move_mapping(mapping, newpage, page);
>>> if (rc != MIGRATEPAGE_SUCCESS)
>>> return rc;
>>> +
>>> +   /*
>>> +* page_private is subpool pointer in hugetlb pages, transfer
>>> +* if needed.
>>> +*/
>>> +   if (page_private(page) && !page_private(newpage)) {
>>> +   set_page_private(newpage, page_private(page));
>>> +   set_page_private(page, 0);
> 
> You don't have to copy PagePrivate flag?
> 

Well my original thought was no.  For hugetlb pages, PagePrivate is not
associated with page_private.  It indicates a reservation was consumed.
It is set  when a hugetlb page is newly allocated and the allocation is
associated with a reservation and the global reservation count is
decremented.  When the page is added to the page cache or rmap,
PagePrivate is cleared.  If the page is free'ed before being added to page
cache or rmap, PagePrivate tells free_huge_page to restore (increment) the
reserve count as we did not 'instantiate' the page.

So, PagePrivate is only set from the time a huge page is allocated until
it is added to page cache or rmap.  My original thought was that the page
could not be migrated during this time.  However, I am not sure if that
reasoning is correct.  The page is not locked, so it would appear that it
could be migrated?  But, if it can be migrated at this time then perhaps
there are bigger issues for the (hugetlb) page fault code?

>>> +
>>> +   }
>>> +
>>> if (mode != MIGRATE_SYNC_NO_COPY)
>>> migrate_page_copy(newpage, page);
>>> else
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index f7e4bfdc13b7..0d9708803553 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -703,8 +703,14 @@ void migrate_page_states(struct page *newpage, struct 
>>> page *page)
>>>  */
>>> if (PageSwapCache(page))
>>> ClearPageSwapCache(page);
>>> -   ClearPagePrivate(page);
>>> -   set_page_private(page, 0);
>>> +   /*
>>> +* Unlikely, but PagePrivate and page_private could potentially
>>> +* contain information needed at hugetlb free page time.
>>> +*/
>>> +   if (!PageHuge(page)) {
>>> +   ClearPagePrivate(page);
>>> +   set_page_private(page, 0);
>>> +   }
> 
> # This argument is mainly for existing code...
> 
> According to the comment on migrate_page():
> 
> /*
>  * Common logic to directly migrate a single LRU page suitable for
>  * pages that do not use PagePrivate/PagePrivate2.
>  *
>  * Pages are locked upon entry and exit.
>  */
> int migrate_page(struct address_space *mapping, ...
> 
> So this common logic assumes that page_private is not used, so why do
> we explicitly clear page_private in migrate_page_states()?

Perhaps someone else knows.  If not, I can do some git research and
try to find out why.

> buffer_migrate_page(), which is commonly used for the case when
> page_private is used, does that clearing outside migrate_page_states().
> So I thought that hugetlbfs_migrate_page() could do in the similar manner.
> IOW, migrate_page_states() should not do anything on PagePrivate.
> But there're a few other .migratepage callbacks, and I'm not sure all of
> them are safe for the change, so this approach might not fit for a small fix.

I will look at those as well unless someone knows without researching.

> 
> # BTW, there seems a typo in $SUBJECT.

Thanks!

-- 
Mike Kravetz

Re: [LSF/MM TOPIC] Page flags, can we free up space ?

2019-01-22 Thread Mike Kravetz

On 1/22/19 12:17 PM, Jerome Glisse wrote:
> So lattely i have been looking at page flags and we are using 6 flags
> for memory reclaim and compaction:
> 
> PG_referenced
> PG_lru
> PG_active
> PG_workingset
> PG_reclaim
> PG_unevictable
> 
> On top of which you can add the page anonymous flag (anonymous or
> share memory)
> PG_anon // does not exist, lower bit of page->mapping
> 
> And also the movable flag (which alias with KSM)
> PG_movable // does not exist, lower bit of page->mapping
> 
> 
> So i would like to explore if there is a way to express the same amount
> of information with less bits. My methodology is to exhaustively list
> all the possible states (valid combination of above flags) and then to
> see how we change from one state to another (what event trigger the change
> like mlock(), page being referenced, ...) and under which rules (ie do we
> hold the page lock, zone lock, ...).
> 
> My hope is that there might be someway to use less bits to express the
> same thing. I am doing this because for my work on generic page write
> protection (ie KSM for file back page) which i talk about last year and
> want to talk about again ;) I will need to unalias the movable bit from
> KSM bit.
> 
> 
> Right now this is more a temptative ie i do not know if i will succeed,
> in any case i can report on failure or success and discuss my finding to
> get people opinions on the matter.
> 
> 
> I think everyone interested in mm will be interested in this topic :)

Explicitly adding Matthew on Cc as I am pretty sure he has been working
in this area.

-- 
Mike Kravetz

Re: [PATCH] hugetlb: allow to free gigantic pages regardless of the configuration

2019-01-17 Thread Mike Kravetz

On 1/17/19 10:39 AM, Alexandre Ghiti wrote:
> From: Alexandre Ghiti 
> 
> On systems without CMA or (MEMORY_ISOLATION && COMPACTION) activated but
> that support gigantic pages, boottime reserved gigantic pages can not be
> freed at all. This patchs simply enables the possibility to hand back
> those pages to memory allocator.
> 
> This commit then renames gigantic_page_supported and
> ARCH_HAS_GIGANTIC_PAGE to make them more accurate. Indeed, those values
> being false does not mean that the system cannot use gigantic pages: it
> just means that runtime allocation of gigantic pages is not supported,
> one can still allocate boottime gigantic pages if the architecture supports
> it.
> 
> Signed-off-by: Alexandre Ghiti 

Thank you for doing this!

Reviewed-by: Mike Kravetz 

> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -589,8 +589,8 @@ static inline bool pm_suspended_storage(void)
>  /* The below functions must be run on a range from a single zone. */
>  extern int alloc_contig_range(unsigned long start, unsigned long end,
> unsigned migratetype, gfp_t gfp_mask);
> -extern void free_contig_range(unsigned long pfn, unsigned nr_pages);
>  #endif
> +extern void free_contig_range(unsigned long pfn, unsigned int nr_pages);

I think nr_pages should be an unsigned long in cma_release() and here
as well, but that is beyond the scope of this patch.  Most callers of
cma_release pass in a truncated unsigned long.  The truncation is unlikely
to cause any issues, just would be nice if types were consistent.  I have
a patch to do that as part of a contiguous allocation series that I will
get back to someday.

> @@ -2350,9 +2355,10 @@ static unsigned long set_max_huge_pages(struct hstate 
> *h, unsigned long count,
>   break;
>   }
>  out:
> - ret = persistent_huge_pages(h);
> + h->max_huge_pages = persistent_huge_pages(h);
>   spin_unlock(&hugetlb_lock);
> - return ret;
> +
> + return 0;
>  }
>  
>  #define HSTATE_ATTR_RO(_name) \
> @@ -2404,11 +2410,6 @@ static ssize_t __nr_hugepages_store_common(bool 
> obey_mempolicy,
>   int err;
>   NODEMASK_ALLOC(nodemask_t, nodes_allowed, GFP_KERNEL | __GFP_NORETRY);
>  
> - if (hstate_is_gigantic(h) && !gigantic_page_supported()) {
> - err = -EINVAL;
> - goto out;
> - }
> -
>   if (nid == NUMA_NO_NODE) {
>   /*
>* global hstate attribute
> @@ -2428,7 +2429,9 @@ static ssize_t __nr_hugepages_store_common(bool 
> obey_mempolicy,
>   } else
>   nodes_allowed = &node_states[N_MEMORY];
>  
> - h->max_huge_pages = set_max_huge_pages(h, count, nodes_allowed);
> + err = set_max_huge_pages(h, count, nodes_allowed);
> + if (err)
> + goto out;
>  
>   if (nodes_allowed != &node_states[N_MEMORY])
>   NODEMASK_FREE(nodes_allowed);

Yeah!  Those changes causes max_huge_pages to be modified while holding
hugetlb_lock as it should be.
-- 
Mike Kravetz

Re: [RFC PATCH] mm: align anon mmap for THP

2019-01-14 Thread Mike Kravetz

On 1/14/19 7:35 AM, Steven Sistare wrote:
> On 1/11/2019 6:28 PM, Mike Kravetz wrote:
>> On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
>>> On Fri, Jan 11, 2019 at 08:10:03PM +, Mike Kravetz wrote:
>>>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
>>>> to get an address returned by mmap() suitably aligned for THP.  It seems
>>>> that if mmap is asking for a mapping length greater than huge page
>>>> size, it should align the returned address to huge page size.
> 
> A better heuristic would be to return an aligned address if the length
> is a multiple of the huge page size.  The gap (if any) between the end of
> the previous VMA and the start of this VMA would be filled by subsequent
> smaller mmap requests.  The new behavior would need to become part of the
> mmap interface definition so apps can rely on it and omit their hoop-jumping
> code.

Yes, the heuristic really should be 'length is a multiple of the huge page
size'.  As you mention, this would still leave gaps.  I need to look closer
but this may not be any worse than the trick of mapping an area with rounded
up length and then unmapping pages at the beginning.

When I sent this out, the thought in the back of my mind was that this doesn't
really matter unless there is some type of alignment guarantee.  Otherwise,
user space code needs continue employing their code to check/force alignment.
Making matters somewhat worse is that I do not believe there is C interface to
query huge page size.  I thought there was discussion about adding one, but I
can not find it.

> Personally I would like to see a new MAP_ALIGN flag and treat the addr
> argument as the alignment (like Solaris), but I am told that adding flags
> is problematic because old kernels accept undefined flag bits from userland
> without complaint, so their behavior would change.

Well, a flag would clearly define desired behavior.

As others have been mentioned, there are mechanisms in place that allow user
space code to get the alignment it wants.  However, it is at the expense of
an additional system call or two.  Perhaps the question is, "Is it worth
defining new behavior to eliminate this overhead?".

One other thing to consider is that at mmap time, we likely do not know if
the vma will/can use THP.  We would know if system wide THP configuration
is set to never or always.  However, I 'think' the default for most distros
is madvize.  Therefore, it is not until a subsequent madvise call that we
know THP will be employed.  If the application code will need to make this
separate madvise call, then perhaps it is not too much to expect that it
take explicit action to optimally align the mapping.

-- 
Mike Kravetz

Re: [RFC PATCH] mm: align anon mmap for THP

2019-01-15 Thread Mike Kravetz

On 1/15/19 12:24 AM, Kirill A. Shutemov wrote:
> On Mon, Jan 14, 2019 at 10:54:45AM -0800, Mike Kravetz wrote:
>> On 1/14/19 7:35 AM, Steven Sistare wrote:
>>> On 1/11/2019 6:28 PM, Mike Kravetz wrote:
>>>> On 1/11/19 1:55 PM, Kirill A. Shutemov wrote:
>>>>> On Fri, Jan 11, 2019 at 08:10:03PM +, Mike Kravetz wrote:
>>>>>> At LPC last year, Boaz Harrosh asked why he had to 'jump through hoops'
>>>>>> to get an address returned by mmap() suitably aligned for THP.  It seems
>>>>>> that if mmap is asking for a mapping length greater than huge page
>>>>>> size, it should align the returned address to huge page size.
>>>
>>> A better heuristic would be to return an aligned address if the length
>>> is a multiple of the huge page size.  The gap (if any) between the end of
>>> the previous VMA and the start of this VMA would be filled by subsequent
>>> smaller mmap requests.  The new behavior would need to become part of the
>>> mmap interface definition so apps can rely on it and omit their hoop-jumping
>>> code.
>>
>> Yes, the heuristic really should be 'length is a multiple of the huge page
>> size'.  As you mention, this would still leave gaps.  I need to look closer
>> but this may not be any worse than the trick of mapping an area with rounded
>> up length and then unmapping pages at the beginning.
> 
> The question why is it any better. Virtual address space is generally
> cheap, additional VMA maybe more signficiant due to find_vma() overhead.
> 
> And you don't *need* to unmap anything. Just use alinged pointer.

You are correct, it is not any better.

I know you do not need to unmap anything.  However, I believe people are
writing code which does this today.  For example, qemu's qemu_ram_mmap()
utility routine does this, but it may have other reasons for creating
the gap.

Thanks for all of the feedback.  I do not think there is anything we can
or should do in this area.  As Steve said, 'power users' who want to get
optimal THP usage will write the code to make that happen.
-- 
Mike Kravetz

[PATCH RFC v3 2/2] hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race

2019-02-01 Thread Mike Kravetz

hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race.  One obvious omission in the
current code is removing a page newly added to the page cache.  This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations.  To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out.  There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv.  Backing out a reservation may
require memory allocation which could fail so that needs to be taken
into account as well.

Instead of writing the required complicated code for this rare
occurrence, just eliminate the race.  i_mmap_rwsem is now held in read
mode for the duration of page fault processing.  Hold i_mmap_rwsem in
write mode when modifying i_size.  In this way, truncation can not
proceed when page faults are being processed.  In addition, i_size
will not change during fault processing so a single check can be made
to ensure faults are not beyond (proposed) end of file.  Faults can
still race with hole punch, but that race is handled by existing code
and the use of hugetlb_fault_mutex.

With this modification, checks for races with truncation in the page
fault path can be simplified and removed.  remove_inode_hugepages no
longer needs to take hugetlb_fault_mutex in the case of truncation.
Comments are expanded to explain reasoning behind locking.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c | 32 ++--
 mm/hugetlb.c | 23 +++
 2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 5280fe3aad2f..039566175b48 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -384,10 +384,9 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t 
start, pgoff_t end)
  * In this case, we first scan the range and release found pages.
  * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv
  * maps and global counts.  Page faults can not race with truncation
- * in this routine.  hugetlb_no_page() prevents page faults in the
- * truncated range.  It checks i_size before allocation, and again after
- * with the page table lock for the page held.  The same lock must be
- * acquired to unmap a page.
+ * in this routine.  hugetlb_no_page() holds i_mmap_rwsem and prevents
+ * page faults in the truncated range by checking i_size.  i_size is
+ * modified while holding i_mmap_rwsem.
  * hole punch is indicated if end is not LLONG_MAX
  * In the hole punch case we scan the range and release found pages.
  * Only when releasing a page is the associated region/reserv map
@@ -425,11 +424,19 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
struct page *page = pvec.pages[i];
u32 hash;
 
-   index = page->index;
-   hash = hugetlb_fault_mutex_hash(h, current->mm,
+   if (!truncate_op) {
+   /*
+* Only need to hold the fault mutex in the
+* hole punch case.  This prevents races with
+* page faults.  Races are not possible in the
+* case of truncation.
+*/
+   index = page->index;
+   hash = hugetlb_fault_mutex_hash(h, current->mm,
&pseudo_vma,
mapping, index, 0);
-   mutex_lock(&hugetlb_fault_mutex_table[hash]);
+   mutex_lock(&hugetlb_fault_mutex_table[hash]);
+   }
 
/*
 * If page is mapped, it was faulted in after being
@@ -472,7 +479,8 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
}
 
unlock_page(page);
-   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+   if (!truncate_op)
+   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
huge_pagevec_release(&pvec);
cond_resched();
@@ -503,8 +511,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t 
offset)
BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;
 
-   i_size_write(inode, offset);
i_mmap_lock_write(mapping

[PATCH RFC v3 1/2] hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

2019-02-01 Thread Mike Kravetz

While looking at BUGs associated with invalid huge page map counts,
it was discovered and observed that a huge pte pointer could become
'invalid' and point to another task's page table.  Consider the
following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation,
it unmaps everyone who has the file mapped.  If the range being
truncated is covered by a shared pmd, huge_pmd_unshare will be called.
For all but the last user of the shared pmd, huge_pmd_unshare will
clear the pud pointing to the pmd.  If the task in the middle of the
page fault is not the last user, the ptep returned by huge_pte_alloc
now points to another task's page table or worse.  This leads to bad
things such as incorrect page map/reference counts or invalid memory
references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with
  the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults.  This is not the order
specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
isolated today.  Therefore, we use this alternative locking order for
PageHuge() pages.
 mapping->i_mmap_rwsem
   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
 page->flags PG_locked (lock_page)

To help with lock ordering issues, hugetlb_page_mapping_lock_write()
is introduced to write lock the i_mmap_rwsem associated with a page.

In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma.  A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c|   2 +
 include/linux/fs.h  |   5 ++
 include/linux/hugetlb.h |   8 ++
 mm/hugetlb.c| 167 
 mm/memory-failure.c |  29 ++-
 mm/migrate.c|  24 +-
 mm/rmap.c   |  17 +++-
 mm/userfaultfd.c|  11 ++-
 8 files changed, 239 insertions(+), 24 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fb6de1db8806..5280fe3aad2f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -443,7 +443,9 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
 
+   mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_lock_write(mapping);
+   mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
(index + 1) * pages_per_huge_page(h));
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c95c0807471f..2369ae58a24d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -501,6 +501,11 @@ static inline void i_mmap_lock_write(struct address_space 
*mapping)
down_write(&mapping->i_mmap_rwsem);
 }
 
+static inline int i_mmap_trylock_write(struct address_space *mapping)
+{
+   return down_write_trylock(&mapping->i_mmap_rwsem);
+}
+
 static inline void i_mmap_unlock_write(struct address_space *mapping)
 {
up_write(&mapping->i_mmap_rwsem);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f48c91..c6959eb23344 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -130,6 +130,8 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
mm_struct *mm,
 
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
 
+struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
+
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
@@ -172,6 +174,12 @@ static inline unsigned long hugetlb_total_pages(void)
return 0;
 }
 
+static inline struct address_space *hugetlb_page_mapping_lock_write(
+   struct page *hpage)
+{
+   return NULL;
+}
+
 static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
pte_t *ptep)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a80832487981..9198b00d25e6 1006

[PATCH RFC v3 0/2] hugetlbfs: use i_mmap_rwsem for more synchronization

2019-02-01 Thread Mike Kravetz

I'm kicking this back to RFC status as there may be a couple other
options we want to explore.  This expanded use of i_mmap_rwsem in
hugetlbfs was recently pushed upstream and reverted due to locking
issues.
http://lkml.kernel.org/r/20181218223557.5202-3-mike.krav...@oracle.com
http://lkml.kernel.org/r/2018123013.22193-3-mike.krav...@oracle.com

The biggest problem with those patches was lock ordering.  Recall, the
two issues we are trying to address:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.

To effectively use i_mmap_rwsem to address these issues it needs to
be held (in read mode) during page fault processing.  However, during
fault processing we need to lock the page we will be adding.  Lock
ordering requires we take page lock before i_mmap_rwsem.  Waiting until
after taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the
lock ordering for hugetlb pages.  This is not difficult as hugetlbfs
processing is done separate from core mm in many places.  However, I
don't really like this idea.  Much ugliness is contained in the new
routine hugetlb_page_mapping_lock_write() of patch 1.

If this approach of extending the usage of i_mmap_rwsem is not acceptable,
there are two other options I can think of.
- Add a new rw-semaphore that lives in the hugetlbfs specific inode
  extension.  The good thing is that this semaphore would be hugetlbfs
  specific and not directly exposed to the rest of the mm code.  Therefore,
  interaction with other locking schemes is minimized.
- Don't really add any new synchronization, but notice and catch all the
  races.  After catching the race, cleanup, backout, retry ... etc, as
  needed.  This can get really ugly, especially for huge page reservations.
  At one time, I started writing some of the reservation backout code for
  page faults and it got so ugly and complicated I went down the path of
  adding synchronization to avoid the races.

Suggestions on how to proceed would be appreciated.  If you think the
following patches are not too ugly, comments on those would also be
welcome.

Mike Kravetz (2):
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race

 fs/hugetlbfs/inode.c|  34 +---
 include/linux/fs.h  |   5 ++
 include/linux/hugetlb.h |   8 ++
 mm/hugetlb.c| 186 ++--
 mm/memory-failure.c |  29 ++-
 mm/migrate.c|  24 +-
 mm/rmap.c   |  17 +++-
 mm/userfaultfd.c|  11 ++-
 8 files changed, 270 insertions(+), 44 deletions(-)

-- 
2.17.2

Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

2019-02-01 Thread Mike Kravetz

On 1/31/19 6:12 AM, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 290408d4a250 hugetlb: hugepage migration core.
> 
> The bot has tested the following trees: v4.20.5, v4.19.18, v4.14.96, 
> v4.9.153, v4.4.172, v3.18.133.
> 
> v4.20.5: Build OK!
> v4.19.18: Build OK!
> v4.14.96: Build OK!
> v4.9.153: Failed to apply! Possible dependencies:
> 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY")
> 
> v4.4.172: Failed to apply! Possible dependencies:
> 09cbfeaf1a5a ("mm, fs: get rid of PAGE_CACHE_* and 
> page_cache_{get,release} macros")
> 0e749e54244e ("dax: increase granularity of dax_clear_blocks() 
> operations")
> 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY")
> 2a28900be206 ("udf: Export superblock magic to userspace")
> 4420cfd3f51c ("staging: lustre: format properly all comment blocks for 
> LNet core")
> 48b4800a1c6a ("zsmalloc: page migration support")
> 5057dcd0f1aa ("virtio_balloon: export 'available' memory to balloon 
> statistics")
> 52db400fcd50 ("pmem, dax: clean up clear_pmem()")
> 5b7a487cf32d ("f2fs: add customized migrate_page callback")
> 5fd88337d209 ("staging: lustre: fix all conditional comparison to zero in 
> LNet layer")
> a188222b6ed2 ("net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK")
> b1123ea6d3b3 ("mm: balloon: use general non-lru movable page feature")
> b2e0d1625e19 ("dax: fix lifetime of in-kernel dax mappings with 
> dax_map_atomic()")
> bda807d44454 ("mm: migrate: support non-lru movable page migration")
> c8b8e32d700f ("direct-io: eliminate the offset argument to ->direct_IO")
> d1a5f2b4d8a1 ("block: use DAX for partition table reads")
> e10624f8c097 ("pmem: fail io-requests to known bad blocks")
> 
> v3.18.133: Failed to apply! Possible dependencies:
> 0722b1011a5f ("f2fs: set page private for inmemory pages for truncation")
> 1601839e9e5b ("f2fs: fix to release count of meta page in 
> ->invalidatepage")
> 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY")
> 31a3268839c1 ("f2fs: cleanup if-statement of phase in gc_data_segment")
> 34ba94bac938 ("f2fs: do not make dirty any inmemory pages")
> 34d67debe02b ("f2fs: add infra struct and helper for inline dir")
> 4634d71ed190 ("f2fs: fix missing kmem_cache_free")
> 487261f39bcd ("f2fs: merge {invalidate,release}page for meta/node/data 
> pages")
> 5b7a487cf32d ("f2fs: add customized migrate_page callback")
> 67298804f344 ("f2fs: introduce struct inode_management to wrap inner 
> fields")
> 769ec6e5b7d4 ("f2fs: call radix_tree_preload before radix_tree_insert")
> 7dda2af83b2b ("f2fs: more fast lookup for gc_inode list")
> 8b26ef98da33 ("f2fs: use rw_semaphore for nat entry lock")
> 8c402946f074 ("f2fs: introduce the number of inode entries")
> 9be32d72becc ("f2fs: do retry operations with cond_resched")
> 9e4ded3f309e ("f2fs: activate f2fs_trace_pid")
> d5053a34a9cc ("f2fs: introduce -o fastboot for reducing booting time 
> only")
> e5e7ea3c86e5 ("f2fs: control the memory footprint used by ino entries")
> f68daeebba5a ("f2fs: keep PagePrivate during releasepage")
> 
> 
> How should we proceed with this patch?

Hello automated Sasha,

First, let's wait for review/ack.  However, the patch does not strictly
'depend' on the functionality of the commits in the lists above.  If/when
this goes upstream I can provide backports for 4.9, 4.4 and 3.18.

-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: always use address space in inode for resv_map pointer

2019-05-08 Thread Mike Kravetz

On 5/8/19 12:10 AM, yuyufen wrote:
> On 2019/4/20 4:44, Mike Kravetz wrote:
>> Continuing discussion about commit 58b6e5e8f1ad ("hugetlbfs: fix memory
>> leak for resv_map") brought up the issue that inode->i_mapping may not
>> point to the address space embedded within the inode at inode eviction
>> time.  The hugetlbfs truncate routine handles this by explicitly using
>> inode->i_data.  However, code cleaning up the resv_map will still use
>> the address space pointed to by inode->i_mapping.  Luckily, private_data
>> is NULL for address spaces in all such cases today but, there is no
>> guarantee this will continue.
>>
>> Change all hugetlbfs code getting a resv_map pointer to explicitly get
>> it from the address space embedded within the inode.  In addition, add
>> more comments in the code to indicate why this is being done.
>>
>> Reported-by: Yufen Yu 
>> Signed-off-by: Mike Kravetz 
...
> 
> Dose this patch have been applied?

Andrew has pulled it into his tree.  However, I do not believe there has
been an ACK or Review, so am not sure of the exact status.

> I think it is better to add fixes label, like:
> Fixes: 58b6e5e8f1ad ("hugetlbfs: fix memory leak for resv_map")
> 
> Since the commit 58b6e5e8f1a has been merged to stable, this patch also be 
> needed.
> https://www.spinics.net/lists/stable/msg298740.html

It must have been the AI that decided 58b6e5e8f1a needed to go to stable.
Even though this technically does not fix 58b6e5e8f1a, I'm OK with adding
the Fixes: to force this to go to the same stable trees.

-- 
Mike Kravetz

Re: [PATCH] hugetlbfs: always use address space in inode for resv_map pointer

2019-05-09 Thread Mike Kravetz

On 5/9/19 4:11 PM, Andrew Morton wrote:
> On Wed, 8 May 2019 13:16:09 -0700 Mike Kravetz  
> wrote:
> 
>>> I think it is better to add fixes label, like:
>>> Fixes: 58b6e5e8f1ad ("hugetlbfs: fix memory leak for resv_map")
>>>
>>> Since the commit 58b6e5e8f1a has been merged to stable, this patch also be 
>>> needed.
>>> https://www.spinics.net/lists/stable/msg298740.html
>>
>> It must have been the AI that decided 58b6e5e8f1a needed to go to stable.
> 
> grr.
> 
>> Even though this technically does not fix 58b6e5e8f1a, I'm OK with adding
>> the Fixes: to force this to go to the same stable trees.
> 
> Why are we bothering with any of this, given that
> 
> : Luckily, private_data is NULL for address spaces in all such cases
> : today but, there is no guarantee this will continue.
> 
> ?

You are right.  For stable releases, I do not see any way for this to
be an issue.  We are lucky today (and in the past).  The patch is there
to guard against code changes which may cause this condition to change
in the future.

Yufen Yu, do you see this actually fixing a problem in stable releases?
I believe you originally said this is not a problem today, which would
also imply older releases.  Just want to make sure I am not missing something.
-- 
Mike Kravetz

> Even though 58b6e5e8f1ad was inappropriately backported, the above
> still holds, so what problem does a backport of "hugetlbfs: always use
> address space in inode for resv_map pointer" actually solve?
> 
> And yes, some review of this would be nice

Re: [PATCH, RFC 2/2] Implement sharing/unsharing of PMDs for FS/DAX

2019-05-10 Thread Mike Kravetz

On 5/10/19 9:16 AM, Larry Bassel wrote:
> On 09 May 19 09:49, Matthew Wilcox wrote:
>> On Thu, May 09, 2019 at 09:05:33AM -0700, Larry Bassel wrote:
>>> This is based on (but somewhat different from) what hugetlbfs
>>> does to share/unshare page tables.
>>
>> Wow, that worked out far more cleanly than I was expecting to see.
> 
> Yes, I was pleasantly surprised. As I've mentioned already, I 
> think this is at least partially due to the nature of DAX.

I have not looked in detail to make sure this is indeed all the places you
need to hook and special case for sharing/unsharing.  Since this scheme is
somewhat like that used for hugetlb, I just wanted to point out some nasty
bugs related to hugetlb PMD sharing that were fixed last year.

5e41540c8a0f hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
dff11abe280b hugetlb: take PMD sharing into account when flushing tlb/caches
017b1660df89 mm: migration: fix migration of huge PMD shared pages

The common issue in these is that when unmapping a page with a shared PMD
mapping you need to flush the entire shared range and not just the unmapped
page.  The above changes were hugetlb specific.  I do not know if any of
this applies in the case of DAX.
-- 
Mike Kravetz

[PATCH REBASED] hugetlbfs: fix potential over/underflow setting node specific nr_hugepages

2019-03-28 Thread Mike Kravetz

The number of node specific huge pages can be set via a file such as:
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
When a node specific value is specified, the global number of huge
pages must also be adjusted.  This adjustment is calculated as the
specified node specific value + (global value - current node value).
If the node specific value provided by the user is large enough, this
calculation could overflow an unsigned long leading to a smaller
than expected number of huge pages.

To fix, check the calculation for overflow.  If overflow is detected,
use ULONG_MAX as the requested value.  This is inline with the user
request to allocate as many huge pages as possible.

It was also noticed that the above calculation was done outside the
hugetlb_lock.  Therefore, the values could be inconsistent and result
in underflow.  To fix, the calculation is moved within the routine
set_max_huge_pages() where the lock is held.

In addition, the code in __nr_hugepages_store_common() which tries to
handle the case of not being able to allocate a node mask would likely
result in incorrect behavior.  Luckily, it is very unlikely we will
ever take this path.  If we do, simply return ENOMEM.

Reported-by: Jing Xiangfeng 
Signed-off-by: Mike Kravetz 
---
This was sent upstream during 5.1 merge window, but dropped as it was
based on an earlier version of Alex Ghiti's patch which was dropped.
Now rebased on top of Alex Ghiti's "[PATCH v8 0/4] Fix free/allocation
of runtime gigantic pages" series which was just added to mmotm.

 mm/hugetlb.c | 41 ++---
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f3e84c1bef11..f79ae4e42159 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2287,13 +2287,33 @@ static int adjust_pool_surplus(struct hstate *h, 
nodemask_t *nodes_allowed,
 }
 
 #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
-static int set_max_huge_pages(struct hstate *h, unsigned long count,
+static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
  nodemask_t *nodes_allowed)
 {
unsigned long min_count, ret;
 
spin_lock(&hugetlb_lock);
 
+   /*
+* Check for a node specific request.
+* Changing node specific huge page count may require a corresponding
+* change to the global count.  In any case, the passed node mask
+* (nodes_allowed) will restrict alloc/free to the specified node.
+*/
+   if (nid != NUMA_NO_NODE) {
+   unsigned long old_count = count;
+
+   count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
+   /*
+* User may have specified a large count value which caused the
+* above calculation to overflow.  In this case, they wanted
+* to allocate as many huge pages as possible.  Set count to
+* largest possible value to align with their intention.
+*/
+   if (count < old_count)
+   count = ULONG_MAX;
+   }
+
/*
 * Gigantic pages runtime allocation depend on the capability for large
 * page range allocation.
@@ -2445,15 +2465,22 @@ static ssize_t __nr_hugepages_store_common(bool 
obey_mempolicy,
}
} else if (nodes_allowed) {
/*
-* per node hstate attribute: adjust count to global,
-* but restrict alloc/free to the specified node.
+* Node specific request.  count adjustment happens in
+* set_max_huge_pages() after acquiring hugetlb_lock.
 */
-   count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
init_nodemask_of_node(nodes_allowed, nid);
-   } else
-   nodes_allowed = &node_states[N_MEMORY];
+   } else {
+   /*
+* Node specific request, but we could not allocate the few
+* words required for a node mask.  We are unlikely to hit
+* this condition.  Since we can not pass down the appropriate
+* node mask, just return ENOMEM.
+*/
+   err = -ENOMEM;
+   goto out;
+   }
 
-   err = set_max_huge_pages(h, count, nodes_allowed);
+   err = set_max_huge_pages(h, count, nid, nodes_allowed);
 
 out:
if (nodes_allowed != &node_states[N_MEMORY])
-- 
2.20.1

[PATCH v2 0/2] A couple hugetlbfs fixes

2019-03-28 Thread Mike Kravetz

I stumbled on these two hugetlbfs issues while looking at other things:
- The 'restore reserve' functionality at page free time should not
  be adjusting subpool counts.
- A BUG can be triggered (not easily) due to temporarily mapping a
  page before doing a COW.

Both are described in detail in the commit message of the patches.
I would appreciate comments from Davidlohr Bueso as one patch is
directly related to code he added in commit 8382d914ebf7.

I did not cc stable as the first problem has been around since reserves
were added to hugetlbfs and nobody has noticed.  The second is very hard
to hit/reproduce.

v2 - Update definition and all callers of hugetlb_fault_mutex_hash as
 the arguments mm and vma are no longer used or necessary.

Mike Kravetz (2):
  huegtlbfs: on restore reserve error path retain subpool reservation
  hugetlb: use same fault hash key for shared and private mappings

 fs/hugetlbfs/inode.c|  7 ++-
 include/linux/hugetlb.h |  4 +---
 mm/hugetlb.c| 43 +
 mm/userfaultfd.c|  3 +--
 4 files changed, 26 insertions(+), 31 deletions(-)

-- 
2.20.1

[PATCH v2 1/2] huegtlbfs: on restore reserve error path retain subpool reservation

2019-03-28 Thread Mike Kravetz

When a huge page is allocated, PagePrivate() is set if the allocation
consumed a reservation.  When freeing a huge page, PagePrivate is checked.
If set, it indicates the reservation should be restored.  PagePrivate
being set at free huge page time mostly happens on error paths.

When huge page reservations are created, a check is made to determine if
the mapping is associated with an explicitly mounted filesystem.  If so,
pages are also reserved within the filesystem.  The default action when
freeing a huge page is to decrement the usage count in any associated
explicitly mounted filesystem.  However, if the reservation is to be
restored the reservation/use count within the filesystem should not be
decrementd.  Otherwise, a subsequent page allocation and free for the same
mapping location will cause the file filesystem usage to go 'negative'.

Filesystem Size  Used Avail Use% Mounted on
nodev  4.0G -4.0M  4.1G- /opt/hugepool

To fix, when freeing a huge page do not adjust filesystem usage if
PagePrivate() is set to indicate the reservation should be restored.

Signed-off-by: Mike Kravetz 
---
 mm/hugetlb.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f79ae4e42159..8651d6a602f9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1268,12 +1268,23 @@ void free_huge_page(struct page *page)
ClearPagePrivate(page);
 
/*
-* A return code of zero implies that the subpool will be under its
-* minimum size if the reservation is not restored after page is free.
-* Therefore, force restore_reserve operation.
+* If PagePrivate() was set on page, page allocation consumed a
+* reservation.  If the page was associated with a subpool, there
+* would have been a page reserved in the subpool before allocation
+* via hugepage_subpool_get_pages().  Since we are 'restoring' the
+* reservtion, do not call hugepage_subpool_put_pages() as this will
+* remove the reserved page from the subpool.
 */
-   if (hugepage_subpool_put_pages(spool, 1) == 0)
-   restore_reserve = true;
+   if (!restore_reserve) {
+   /*
+* A return code of zero implies that the subpool will be
+* under its minimum size if the reservation is not restored
+* after page is free.  Therefore, force restore_reserve
+* operation.
+*/
+   if (hugepage_subpool_put_pages(spool, 1) == 0)
+   restore_reserve = true;
+   }
 
spin_lock(&hugetlb_lock);
clear_page_huge_active(page);
-- 
2.20.1

[PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings

2019-03-28 Thread Mike Kravetz

hugetlb uses a fault mutex hash table to prevent page faults of the
same pages concurrently.  The key for shared and private mappings is
different.  Shared keys off address_space and file index.  Private
keys off mm and virtual address.  Consider a private mappings of a
populated hugetlbfs file.  A write fault will first map the page from
the file and then do a COW to map a writable page.

Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
pages.  It uses the address_space file index key.  However, private
mappings will use a different key and could temporarily map the file
page before COW.  This causes problems (BUG) for the hole punch code
as it expects the mutex to prevent additional uses/mappings of the page.

There seems to be another potential COW issue/race with this approach
of different private and shared keys as notes in commit 8382d914ebf7
("mm, hugetlb: improve page-fault scalability").

Since every hugetlb mapping (even anon and private) is actually a file
mapping, just use the address_space index key for all mappings.  This
results in potentially more hash collisions.  However, this should not
be the common case.

Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c|  7 ++-
 include/linux/hugetlb.h |  4 +---
 mm/hugetlb.c| 22 ++
 mm/userfaultfd.c|  3 +--
 4 files changed, 10 insertions(+), 26 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ec32fece5e1e..6189ba80b57b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -440,9 +440,7 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
u32 hash;
 
index = page->index;
-   hash = hugetlb_fault_mutex_hash(h, current->mm,
-   &pseudo_vma,
-   mapping, index, 0);
+   hash = hugetlb_fault_mutex_hash(h, mapping, index, 0);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
/*
@@ -639,8 +637,7 @@ static long hugetlbfs_fallocate(struct file *file, int 
mode, loff_t offset,
addr = index * hpage_size;
 
/* mutex taken here, fault path and hole punch */
-   hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
-   index, addr);
+   hash = hugetlb_fault_mutex_hash(h, mapping, index, addr);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
/* See if already present in mapping to avoid alloc/free */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ea35263eb76b..3bc0d02649fe 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -123,9 +123,7 @@ void move_hugetlb_state(struct page *oldpage, struct page 
*newpage, int reason);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-   struct vm_area_struct *vma,
-   struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
pgoff_t idx, unsigned long address);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8651d6a602f9..4409a87434f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3837,8 +3837,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 * handling userfault.  Reacquire after handling
 * fault to make calling code simpler.
 */
-   hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
-   idx, haddr);
+   hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
ret = handle_userfault(&vmf, VM_UFFD_MISSING);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -3946,21 +3945,14 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 }
 
 #ifdef CONFIG_SMP
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-   struct vm_area_struct *vma,
-   struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
pgoff_t idx, unsigned long address)
 {
unsigned long key[2];
u32 hash;
 
-   if (vma->vm_flags & VM_SHARED) {
-   key[0] = (unsigned long) mapping;
-   key[1] = idx;
-   }

[PATCH] hugetlbfs: always use address space in inode for resv_map pointer

2019-04-19 Thread Mike Kravetz

Continuing discussion about commit 58b6e5e8f1ad ("hugetlbfs: fix memory
leak for resv_map") brought up the issue that inode->i_mapping may not
point to the address space embedded within the inode at inode eviction
time.  The hugetlbfs truncate routine handles this by explicitly using
inode->i_data.  However, code cleaning up the resv_map will still use
the address space pointed to by inode->i_mapping.  Luckily, private_data
is NULL for address spaces in all such cases today but, there is no
guarantee this will continue.

Change all hugetlbfs code getting a resv_map pointer to explicitly get
it from the address space embedded within the inode.  In addition, add
more comments in the code to indicate why this is being done.

Reported-by: Yufen Yu 
Signed-off-by: Mike Kravetz 
---
 fs/hugetlbfs/inode.c | 11 +--
 mm/hugetlb.c | 19 ++-
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9285dd4f4b1c..cbc649cd1722 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -499,8 +499,15 @@ static void hugetlbfs_evict_inode(struct inode *inode)
struct resv_map *resv_map;
 
remove_inode_hugepages(inode, 0, LLONG_MAX);
-   resv_map = (struct resv_map *)inode->i_mapping->private_data;
-   /* root inode doesn't have the resv_map, so we should check it */
+
+   /*
+* Get the resv_map from the address space embedded in the inode.
+* This is the address space which points to any resv_map allocated
+* at inode creation time.  If this is a device special inode,
+* i_mapping may not point to the original address space.
+*/
+   resv_map = (struct resv_map *)(&inode->i_data)->private_data;
+   /* Only regular and link inodes have associated reserve maps */
if (resv_map)
resv_map_release(&resv_map->refs);
clear_inode(inode);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6cdc7b2d9100..b30e97b0ef37 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -740,7 +740,15 @@ void resv_map_release(struct kref *ref)
 
 static inline struct resv_map *inode_resv_map(struct inode *inode)
 {
-   return inode->i_mapping->private_data;
+   /*
+* At inode evict time, i_mapping may not point to the original
+* address space within the inode.  This original address space
+* contains the pointer to the resv_map.  So, always use the
+* address space embedded within the inode.
+* The VERY common case is inode->mapping == &inode->i_data but,
+* this may not be true for device special inodes.
+*/
+   return (struct resv_map *)(&inode->i_data)->private_data;
 }
 
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
@@ -4477,6 +4485,11 @@ int hugetlb_reserve_pages(struct inode *inode,
 * called to make the mapping read-write. Assume !vma is a shm mapping
 */
if (!vma || vma->vm_flags & VM_MAYSHARE) {
+   /*
+* resv_map can not be NULL as hugetlb_reserve_pages is only
+* called for inodes for which resv_maps were created (see
+* hugetlbfs_get_inode).
+*/
resv_map = inode_resv_map(inode);
 
chg = region_chg(resv_map, from, to);
@@ -4568,6 +4581,10 @@ long hugetlb_unreserve_pages(struct inode *inode, long 
start, long end,
struct hugepage_subpool *spool = subpool_inode(inode);
long gbl_reserve;
 
+   /*
+* Since this routine can be called in the evict inode path for all
+* hugetlbfs inodes, resv_map could be NULL.
+*/
if (resv_map) {
chg = region_del(resv_map, start, end);
/*
-- 
2.20.1

[Question] Should direct reclaim time be bounded?

2019-04-22 Thread Mike Kravetz

I was looking into an issue on our distro kernel where allocation of huge
pages via "echo X > /proc/sys/vm/nr_hugepages" was taking a LONG time.
In this particular case, we were actually allocating huge pages VERY slowly
at the rate of about one every 30 seconds.  I don't want to talk about the
code in our distro kernel, but the situation that caused this issue exists
upstream and appears to be worse there.

One thing to note is that hugetlb page allocation can really stress the
page allocator.  The routine alloc_pool_huge_page is of special concern.

/*
 * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
 * manner.
 */
static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
{
struct page *page;
int nr_nodes, node;
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
if (page)
break;
}

if (!page)
return 0;

put_page(page); /* free it into the hugepage allocator */

return 1;
}

This routine is called for each huge page the user wants to allocate.  If
they do "echo 4096 > nr_hugepages", this is called 4096 times.
alloc_fresh_huge_page() will eventually call __alloc_pages_nodemask with
__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN in addition to __GFP_THISNODE.
That for_each_node_mask_to_alloc() macro is hugetlbfs specific and attempts
to allocate huge pages in a round robin fashion.  When asked to allocate a
huge page, it first tries the 'next_nid_to_alloc'.  If that fails, it goes
to the next allowed node.  This is 'documented' in kernel docs as:

"On a NUMA platform, the kernel will attempt to distribute the huge page pool
 over all the set of allowed nodes specified by the NUMA memory policy of the
 task that modifies nr_hugepages.  The default for the allowed nodes--when the
 task has default memory policy--is all on-line nodes with memory.  Allowed
 nodes with insufficient available, contiguous memory for a huge page will be
 silently skipped when allocating persistent huge pages.  See the discussion
 below of the interaction of task memory policy, cpusets and per node attributes
 with the allocation and freeing of persistent huge pages.

 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is present in system at the time of the
 allocation attempt.  If the kernel is unable to allocate huge pages from
 some nodes in a NUMA system, it will attempt to make up the difference by
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any."

However, consider the case of a 2 node system where:
node 0 has 2GB memory
node 1 has 4GB memory

Now, if one wants to allocate 4GB of huge pages they may be tempted to simply,
"echo 2048 > nr_hugepages".  At first this will go well until node 0 is out
of memory.  When this happens, alloc_pool_huge_page() will continue to be
called.  Because of that for_each_node_mask_to_alloc() macro, it will likely
attempt to first allocate a page from node 0.  It will call direct reclaim and
compaction until it fails.  Then, it will successfully allocate from node 1.

In our distro kernel, I am thinking about making allocations try "less hard"
on nodes where we start to see failures.  less hard == NORETRY/NORECLAIM.
I was going to try something like this on an upstream kernel when I noticed
that it seems like direct reclaim may never end/exit.  It 'may' exit, but I
instrumented __alloc_pages_slowpath() and saw it take well over an hour
before I 'tricked' it into exiting.

[ 5916.248341] hpage_slow_alloc: jiffies 5295742  tries 2   node 0 success
[ 5916.249271]   reclaim 5295741  compact 1

This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
with 8GB total memory.

I have not started looking at the direct reclaim code to see exactly where
we may be stuck, or trying really hard.  My question is, "Is this expected
or should direct reclaim be somewhat bounded?"  With __alloc_pages_slowpath
getting 'stuck' in direct reclaim, the documented behavior for huge page
allocation is not going to happen.
-- 
Mike Kravetz

Re: [Question] Should direct reclaim time be bounded?

2019-04-23 Thread Mike Kravetz

On 4/23/19 12:19 AM, Michal Hocko wrote:
> On Mon 22-04-19 21:07:28, Mike Kravetz wrote:
>> In our distro kernel, I am thinking about making allocations try "less hard"
>> on nodes where we start to see failures.  less hard == NORETRY/NORECLAIM.
>> I was going to try something like this on an upstream kernel when I noticed
>> that it seems like direct reclaim may never end/exit.  It 'may' exit, but I
>> instrumented __alloc_pages_slowpath() and saw it take well over an hour
>> before I 'tricked' it into exiting.
>>
>> [ 5916.248341] hpage_slow_alloc: jiffies 5295742  tries 2   node 0 success
>> [ 5916.249271]   reclaim 5295741  compact 1
> 
> This is unexpected though. What does tries mean? Number of reclaim
> attempts? If yes could you enable tracing to see what takes so long in
> the reclaim path?

tries is the number of times we pass the 'retry:' label in
__alloc_pages_slowpath.  In this specific case, I am pretty sure all that
time is in one call to __alloc_pages_direct_reclaim.  My 'trick' to make this
succeed was to "echo 0 > nr_hugepages" in another shell.

>> This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
>> with 8GB total memory.
>>
>> I have not started looking at the direct reclaim code to see exactly where
>> we may be stuck, or trying really hard.  My question is, "Is this expected
>> or should direct reclaim be somewhat bounded?"  With __alloc_pages_slowpath
>> getting 'stuck' in direct reclaim, the documented behavior for huge page
>> allocation is not going to happen.
> 
> Well, our "how hard to try for hugetlb pages" is quite arbitrary. We
> used to rety as long as at least order worth of pages have been
> reclaimed but that didn't make any sense since the lumpy reclaim was
> gone.

Yes, that is what I am seeing in our older distro kernel and I can at least
deal with that.

>   So the semantic has change to reclaim&compact as long as there is
> some progress. From what I understad above it seems that you are not
> thrashing and calling reclaim again and again but rather one reclaim
> round takes ages.

Correct

> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
> looks like there is something wrong in the reclaim going on.

Ok, I will start digging into that.  Just wanted to make sure before I got
into it too deep.

BTW - This is very easy to reproduce.  Just try to allocate more huge pages
than will fit into memory.  I see this 'reclaim taking forever' behavior on
v5.1-rc5-mmotm-2019-04-19-14-53.  Looks like it was there in v5.0 as well.
-- 
Mike Kravetz

Re: [PATCH v2] mm/hugetlb: Don't put_page in lock of hugetlb_lock

2019-05-06 Thread Mike Kravetz

On 5/6/19 7:06 AM, Zhiqiang Liu wrote:
> From: Kai Shen 
> 
> spinlock recursion happened when do LTP test:
> #!/bin/bash
> ./runltp -p -f hugetlb &
> ./runltp -p -f hugetlb &
> ./runltp -p -f hugetlb &
> ./runltp -p -f hugetlb &
> ./runltp -p -f hugetlb &
> 
> The dtor returned by get_compound_page_dtor in __put_compound_page
> may be the function of free_huge_page which will lock the hugetlb_lock,
> so don't put_page in lock of hugetlb_lock.
> 
>  BUG: spinlock recursion on CPU#0, hugemmap05/1079
>   lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, 
> .owner_cpu: 0
>  Call trace:
>   dump_backtrace+0x0/0x198
>   show_stack+0x24/0x30
>   dump_stack+0xa4/0xcc
>   spin_dump+0x84/0xa8
>   do_raw_spin_lock+0xd0/0x108
>   _raw_spin_lock+0x20/0x30
>   free_huge_page+0x9c/0x260
>   __put_compound_page+0x44/0x50
>   __put_page+0x2c/0x60
>   alloc_surplus_huge_page.constprop.19+0xf0/0x140
>   hugetlb_acct_memory+0x104/0x378
>   hugetlb_reserve_pages+0xe0/0x250
>   hugetlbfs_file_mmap+0xc0/0x140
>   mmap_region+0x3e8/0x5b0
>   do_mmap+0x280/0x460
>   vm_mmap_pgoff+0xf4/0x128
>   ksys_mmap_pgoff+0xb4/0x258
>   __arm64_sys_mmap+0x34/0x48
>   el0_svc_common+0x78/0x130
>   el0_svc_handler+0x38/0x78
>   el0_svc+0x8/0xc
> 
> Fixes: 9980d744a0 ("mm, hugetlb: get rid of surplus page accounting tricks")
> Signed-off-by: Kai Shen 
> Signed-off-by: Feilong Lin 
> Reported-by: Wang Wang 
> Acked-by: Michal Hocko 

Good catch.  Sorry, for the late reply.

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

Re: [Question] Should direct reclaim time be bounded?

2019-07-12 Thread Mike Kravetz

On 7/11/19 10:47 PM, Hillf Danton wrote:
> 
> On Thu, 11 Jul 2019 02:42:56 +0800 Mike Kravetz wrote:
>>
>> It is quite easy to hit the condition where:
>> nr_reclaimed == 0  && nr_scanned == 0 is true, but we skip the previous test
>>
> Then skipping check of __GFP_RETRY_MAYFAIL makes no sense in your case.
> It is restored in respin below.
> 
>> and the compaction check:
>> sc->nr_reclaimed < pages_for_compaction &&
>>  inactive_lru_pages > pages_for_compaction
>> is true, so we return true before the below check of costly_fg_reclaim
>>
> This check is placed after COMPACT_SUCCESS; the latter is used to
> replace sc->nr_reclaimed < pages_for_compaction.
> 
> And dryrun detection is added based on the result of last round of
> shrinking of inactive pages, particularly when their number is large
> enough.
> 

Thanks Hillf.

This change does appear to eliminate the issue with stalls by
should_continue_reclaim returning true too often.  I need to think
some more about exactly what is impacted with the change.

With this change, the problem moves to compaction with should_compact_retry
returning true too often.  It is the same behavior seem when I simply removed
the __GFP_RETRY_MAYFAIL special casing in should_continue_reclaim.

At Mel's suggestion I removed the compaction_zonelist_suitable() call
from should_compact_retry.  This eliminated the compaction stalls.  Thanks
Mel.

With both changes, stalls appear to be eliminated.  This is promising.
I'll try to refine these approaches and continue testing.
-- 
Mike Kravetz

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2571,18 +2571,6 @@ static inline bool should_continue_reclaim(struct 
> pglist_data *pgdat,
>   return false;
>   }
> 
> - /*
> -  * If we have not reclaimed enough pages for compaction and the
> -  * inactive lists are large enough, continue reclaiming
> -  */
> - pages_for_compaction = compact_gap(sc->order);
> - inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> - if (get_nr_swap_pages() > 0)
> - inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> - if (sc->nr_reclaimed < pages_for_compaction &&
> - inactive_lru_pages > pages_for_compaction)
> - return true;
> -
>   /* If compaction would go ahead or the allocation would succeed, stop */
>   for (z = 0; z <= sc->reclaim_idx; z++) {
>   struct zone *zone = &pgdat->node_zones[z];
> @@ -2598,7 +2586,21 @@ static inline bool should_continue_reclaim(struct 
> pglist_data *pgdat,
>   ;
>   }
>   }
> - return true;
> +
> + /*
> +  * If we have not reclaimed enough pages for compaction and the
> +  * inactive lists are large enough, continue reclaiming
> +  */
> + pages_for_compaction = compact_gap(sc->order);
> + inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> + if (get_nr_swap_pages() > 0)
> + inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> + return inactive_lru_pages > pages_for_compaction &&
> + /*
> +  * avoid dryrun with plenty of inactive pages
> +  */
> + nr_scanned && nr_reclaimed;
>  }
> 
>  static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> --
>

Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-13 Thread Mike Kravetz

On 8/10/19 3:01 PM, Mina Almasry wrote:
> On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz  wrote:
>>
>> On 8/9/19 12:42 PM, Mina Almasry wrote:
>>> On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz  
>>> wrote:
>>>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>>>> Problem:
>>>>> Currently tasks attempting to allocate more hugetlb memory than is 
>>>>> available get
>>>>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations 
>>>>> [1].
>>>>> However, if a task attempts to allocate hugetlb memory only more than its
>>>>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
>>>>> but will SIGBUS the task when it attempts to fault the memory in.
>> 
>>>> I believe tracking reservations for shared mappings can get quite 
>>>> complicated.
>>>> The hugetlbfs reservation code around shared mappings 'works' on the basis
>>>> that shared mapping reservations are global.  As a result, reservations are
>>>> more associated with the inode than with the task making the reservation.
>>>
>>> FWIW, I found it not too bad. And my tests at least don't detect an
>>> anomaly around shared mappings. The key I think is that I'm tracking
>>> cgroup to uncharge on the file_region entry inside the resv_map, so we
>>> know who allocated each file_region entry exactly and we can uncharge
>>> them when the entry is region_del'd.
>>>
>>>> For example, consider a file of size 4 hugetlb pages.
>>>> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
>>>> all 4 pages, and 2 additional reservations are taken.  I am not really sure
>>>> of the desired semantics here for reservation limits if A and B are in 
>>>> separate
>>>> cgroups.  Should B be charged for 4 or 2 reservations?
>>>
>>> Task A's cgroup is charged 2 pages to its reservation usage.
>>> Task B's cgroup is charged 2 pages to its reservation usage.
>>
>> OK,
>> Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
>> allocation.  The mmap would succeed, but Task B could potentially need to
>> allocate more than 2 huge pages.  So, when faulting in more than 2 huge
>> pages B would get a SIGBUS.  Correct?  Or, am I missing something?
>>
>> Perhaps reservation charge should always be the same as map size/maximum
>> allocation size?
> 
> I'm thinking this would work similar to how other shared memory like
> tmpfs is accounted for right now. I.e. if a task conducts an operation
> that causes memory to be allocated then that task is charged for that
> memory, and if another task uses memory that has already been
> allocated and charged by another task, then it can use the memory
> without being charged.
> 
> So in case of hugetlb memory, if a task is mmaping memory that causes
> a new reservation to be made, and new entries to be created in the
> resv_map for the shared mapping, then that task gets charged. If the
> task is mmaping memory that is already reserved or faulted, then it
> reserves or faults it without getting charged.
> 
> In the example above, in chronological order:
> - Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
> - Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb
> reservations because the first 2 are charged already and can be used
> without incurring a charge.
> - Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,
> since none of the 4 pages are faulted in yet. If the task is only
> allowed 2 hugetlb page faults then it will actually get a SIGBUS.
> - Task A accesses 4 hugetlb pages, gets charged no faults, since all
> the hugetlb faults is charged to Task B.
> 
> So, yes, I can see a scenario where userspace still gets SIGBUS'd, but
> I think that's fine because:
> 1. Notice that the SIGBUS is due to the faulting limit, and not the
> reservation limit, so we're not regressing the status quo per say.
> Folks using the fault limit today understand the SIGBUS risk.
> 2. the way I expect folks to use this is to use 'reservation limits'
> to partition the available hugetlb memory on the machine using it and
> forgo using the existing fault limits. Using both at the same time I
> think would be a superuser feature for folks that really know what
> they are doing, and understand the risk of SIGBUS that comes with
> using the existing fault limits.
> 3. I expect userspace to in general handle this correctly because
> there are similar challenges with all shared memory and accounting of
> it, even in tmpfs, I think.

Ok, that helps explain your use case.  I agree that it would be difficult
to use both fault and reservation limits together.  Especially in the case
of shared mappings.
-- 
Mike Kravetz

Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-13 Thread Mike Kravetz

On 8/8/19 4:13 PM, Mina Almasry wrote:
> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> in the resv_map entries, in file_region->reservation_counter.
> 
> When a file_region entry is added to the resv_map via region_add, we
> also charge the appropriate hugetlb_cgroup and put the pointer to that
> in file_region->reservation_counter. This is slightly delicate since we
> need to not modify the resv_map until we know that charging the
> reservation has succeeded. If charging doesn't succeed, we report the
> error to the caller, so that the kernel fails the reservation.

I wish we did not need to modify these region_() routines as they are
already difficult to understand.  However, I see no other way with the
desired semantics.

> On region_del, which is when the hugetlb memory is unreserved, we delete
> the file_region entry in the resv_map, but also uncharge the
> file_region->reservation_counter.
> 
> ---
>  mm/hugetlb.c | 208 +--
>  1 file changed, 170 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 235996aef6618..d76e3137110ab 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -242,8 +242,72 @@ struct file_region {
>   struct list_head link;
>   long from;
>   long to;
> +#ifdef CONFIG_CGROUP_HUGETLB
> + /*
> +  * On shared mappings, each reserved region appears as a struct
> +  * file_region in resv_map. These fields hold the info needed to
> +  * uncharge each reservation.
> +  */
> + struct page_counter *reservation_counter;
> + unsigned long pages_per_hpage;
> +#endif
>  };
> 
> +/* Must be called with resv->lock held. Calling this with dry_run == true 
> will
> + * count the number of pages added but will not modify the linked list.
> + */
> +static long consume_regions_we_overlap_with(struct file_region *rg,
> + struct list_head *head, long f, long *t,
> + struct hugetlb_cgroup *h_cg,
> + struct hstate *h,
> + bool dry_run)
> +{
> + long add = 0;
> + struct file_region *trg = NULL, *nrg = NULL;
> +
> + /* Consume any regions we now overlap with. */
> + nrg = rg;
> + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> + if (&rg->link == head)
> + break;
> + if (rg->from > *t)
> + break;
> +
> + /* If this area reaches higher then extend our area to
> +  * include it completely.  If this is not the first area
> +  * which we intend to reuse, free it.
> +  */
> + if (rg->to > *t)
> + *t = rg->to;
> + if (rg != nrg) {
> + /* Decrement return value by the deleted range.
> +  * Another range will span this area so that by
> +  * end of routine add will be >= zero
> +  */
> + add -= (rg->to - rg->from);
> + if (!dry_run) {
> + list_del(&rg->link);
> + kfree(rg);

Is it possible that the region struct we are deleting pointed to
a reservation_counter?  Perhaps even for another cgroup?
Just concerned with the way regions are coalesced that we may be
deleting counters.

-- 
Mike Kravetz

Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-14 Thread Mike Kravetz

On 8/13/19 4:54 PM, Mike Kravetz wrote:
> On 8/8/19 4:13 PM, Mina Almasry wrote:
>> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
>> in the resv_map entries, in file_region->reservation_counter.
>>
>> When a file_region entry is added to the resv_map via region_add, we
>> also charge the appropriate hugetlb_cgroup and put the pointer to that
>> in file_region->reservation_counter. This is slightly delicate since we
>> need to not modify the resv_map until we know that charging the
>> reservation has succeeded. If charging doesn't succeed, we report the
>> error to the caller, so that the kernel fails the reservation.
> 
> I wish we did not need to modify these region_() routines as they are
> already difficult to understand.  However, I see no other way with the
> desired semantics.
> 

I suspect you have considered this, but what about using the return value
from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
There is a VERY SMALL race where the value could be too large, but that
can be checked and adjusted at region_add time as is done with normal
accounting today.  If the question is, where would we store the information
to uncharge?, then we can hang a structure off the vma.  This would be
similar to what is done for private mappings.  In fact, I would suggest
making them both use a new cgroup reserve structure hanging off the vma.

One issue I see is what to do if a vma is split?  The private mapping case
'should' handle this today, but I would not be surprised if such code is
missing or incorrect.

-- 
Mike Kravetz

Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-09 Thread Mike Kravetz

(+CC  Michal Koutný, cgro...@vger.kernel.org, Aneesh Kumar)

On 8/8/19 4:13 PM, Mina Almasry wrote:
> Problem:
> Currently tasks attempting to allocate more hugetlb memory than is available 
> get
> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> However, if a task attempts to allocate hugetlb memory only more than its
> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> but will SIGBUS the task when it attempts to fault the memory in.
> 
> We have developers interested in using hugetlb_cgroups, and they have 
> expressed
> dissatisfaction regarding this behavior. We'd like to improve this
> behavior such that tasks violating the hugetlb_cgroup limits get an error on
> mmap/shmget time, rather than getting SIGBUS'd when they try to fault
> the excess memory in.
> 
> The underlying problem is that today's hugetlb_cgroup accounting happens
> at hugetlb memory *fault* time, rather than at *reservation* time.
> Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> the offending task gets SIGBUS'd.
> 
> Proposed Solution:
> A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
> counter has slightly different semantics than
> hugetlb.xMB.[limit|usage]_in_bytes:
> 
> - While usage_in_bytes tracks all *faulted* hugetlb memory,
> reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
> 
> - If a task attempts to reserve more memory than limit_in_bytes allows,
> the kernel will allow it to do so. But if a task attempts to reserve
> more memory than reservation_limit_in_bytes, the kernel will fail this
> reservation.
> 
> This proposal is implemented in this patch, with tests to verify
> functionality and show the usage.

Thanks for taking on this effort Mina.

Before looking at the details of the code, it might be helpful to discuss
the expected semantics of the proposed reservation limits.

I see you took into account the differences between private and shared
mappings.  This is good, as the reservation behavior is different for each
of these cases.  First let's look at private mappings.

For private mappings, the reservation usage will be the size of the mapping.
This should be fairly simple.  As reservations are consumed in the hugetlbfs
code, reservations in the resv_map are removed.  I see you have a hook into
region_del.  So, the expectation is that as reservations are consumed the
reservation usage will drop for the cgroup.  Correct?
The only tricky thing about private mappings is COW because of fork.  Current
reservation semantics specify that all reservations stay with the parent.
If child faults and can not get page, SIGBUS.  I assume the new reservation
limits will work the same.

I believe tracking reservations for shared mappings can get quite complicated.
The hugetlbfs reservation code around shared mappings 'works' on the basis
that shared mapping reservations are global.  As a result, reservations are
more associated with the inode than with the task making the reservation.
For example, consider a file of size 4 hugetlb pages.
Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
all 4 pages, and 2 additional reservations are taken.  I am not really sure
of the desired semantics here for reservation limits if A and B are in separate
cgroups.  Should B be charged for 4 or 2 reservations?
Also in the example above, after both tasks create their mappings suppose
Task B faults in the first page.  Does the reservation usage of Task A go
down as it originally had the reservation?

It should also be noted that when hugetlbfs reservations are 'consumed' for
shared mappings there are no changes to the resv_map.  Rather the unmap code
compares the contents of the page cache to the resv_map to determine how
many reservations were actually consumed.  I did not look close enough to
determine the code drops reservation usage counts as pages are added to shared
mappings.

-- 
Mike Kravetz

Re: [RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits

2019-08-09 Thread Mike Kravetz

On 8/9/19 11:05 AM, Mina Almasry wrote:
> On Fri, Aug 9, 2019 at 4:27 AM Michal Koutný  wrote:
>>> Alternatives considered:
>>> [...]
>> (I did not try that but) have you considered:
>> 3) MAP_POPULATE while you're making the reservation,
> 
> I have tried this, and the behaviour is not great. Basically if
> userspace mmaps more memory than its cgroup limit allows with
> MAP_POPULATE, the kernel will reserve the total amount requested by
> the userspace, it will fault in up to the cgroup limit, and then it
> will SIGBUS the task when it tries to access the rest of its
> 'reserved' memory.
> 
> So for example:
> - if /proc/sys/vm/nr_hugepages == 10, and
> - your cgroup limit is 5 pages, and
> - you mmap(MAP_POPULATE) 7 pages.
> 
> Then the kernel will reserve 7 pages, and will fault in 5 of those 7
> pages, and will SIGBUS you when you try to access the remaining 2
> pages. So the problem persists. Folks would still like to know they
> are crossing the limits on mmap time.

If you got the failure at mmap time in the MAP_POPULATE case would this
be useful?

Just thinking that would be a relatively simple change.
-- 
Mike Kravetz

Re: [RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits

2019-08-09 Thread Mike Kravetz

On 8/9/19 1:57 PM, Mina Almasry wrote:
> On Fri, Aug 9, 2019 at 1:39 PM Mike Kravetz  wrote:
>>
>> On 8/9/19 11:05 AM, Mina Almasry wrote:
>>> On Fri, Aug 9, 2019 at 4:27 AM Michal Koutný  wrote:
>>>>> Alternatives considered:
>>>>> [...]
>>>> (I did not try that but) have you considered:
>>>> 3) MAP_POPULATE while you're making the reservation,
>>>
>>> I have tried this, and the behaviour is not great. Basically if
>>> userspace mmaps more memory than its cgroup limit allows with
>>> MAP_POPULATE, the kernel will reserve the total amount requested by
>>> the userspace, it will fault in up to the cgroup limit, and then it
>>> will SIGBUS the task when it tries to access the rest of its
>>> 'reserved' memory.
>>>
>>> So for example:
>>> - if /proc/sys/vm/nr_hugepages == 10, and
>>> - your cgroup limit is 5 pages, and
>>> - you mmap(MAP_POPULATE) 7 pages.
>>>
>>> Then the kernel will reserve 7 pages, and will fault in 5 of those 7
>>> pages, and will SIGBUS you when you try to access the remaining 2
>>> pages. So the problem persists. Folks would still like to know they
>>> are crossing the limits on mmap time.
>>
>> If you got the failure at mmap time in the MAP_POPULATE case would this
>> be useful?
>>
>> Just thinking that would be a relatively simple change.
> 
> Not quite, unfortunately. A subset of the folks that want to use
> hugetlb memory, don't want to use MAP_POPULATE (IIRC, something about
> mmaping a huge amount of hugetlb memory at their jobs' startup, and
> doing that with MAP_POPULATE adds so much to their startup time that
> it is prohibitively expensive - but that's just what I vaguely recall
> offhand. I can get you the details if you're interested).

Yes, MAP_POPULATE can get expensive as you will need to zero all those
huge pages.

-- 
Mike Kravetz

Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-10 Thread Mike Kravetz

On 8/9/19 12:42 PM, Mina Almasry wrote:
> On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz  wrote:
>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>> Problem:
>>> Currently tasks attempting to allocate more hugetlb memory than is 
>>> available get
>>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
>>> However, if a task attempts to allocate hugetlb memory only more than its
>>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
>>> but will SIGBUS the task when it attempts to fault the memory in.

>> I believe tracking reservations for shared mappings can get quite 
>> complicated.
>> The hugetlbfs reservation code around shared mappings 'works' on the basis
>> that shared mapping reservations are global.  As a result, reservations are
>> more associated with the inode than with the task making the reservation.
> 
> FWIW, I found it not too bad. And my tests at least don't detect an
> anomaly around shared mappings. The key I think is that I'm tracking
> cgroup to uncharge on the file_region entry inside the resv_map, so we
> know who allocated each file_region entry exactly and we can uncharge
> them when the entry is region_del'd.
> 
>> For example, consider a file of size 4 hugetlb pages.
>> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
>> all 4 pages, and 2 additional reservations are taken.  I am not really sure
>> of the desired semantics here for reservation limits if A and B are in 
>> separate
>> cgroups.  Should B be charged for 4 or 2 reservations?
> 
> Task A's cgroup is charged 2 pages to its reservation usage.
> Task B's cgroup is charged 2 pages to its reservation usage.

OK,
Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
allocation.  The mmap would succeed, but Task B could potentially need to
allocate more than 2 huge pages.  So, when faulting in more than 2 huge
pages B would get a SIGBUS.  Correct?  Or, am I missing something?

Perhaps reservation charge should always be the same as map size/maximum
allocation size?
-- 
Mike Kravetz

Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders

2019-07-31 Thread Mike Kravetz

On 7/31/19 5:06 AM, Vlastimil Babka wrote:
> On 7/24/19 7:50 PM, Mike Kravetz wrote:
>> For PAGE_ALLOC_COSTLY_ORDER allocations, MIN_COMPACT_COSTLY_PRIORITY is
>> minimum (highest priority).  Other places in the compaction code key off
>> of MIN_COMPACT_PRIORITY.  Costly order allocations will never get to
>> MIN_COMPACT_PRIORITY.  Therefore, some conditions will never be met for
>> costly order allocations.
>>
>> This was observed when hugetlb allocations could stall for minutes or
>> hours when should_compact_retry() would return true more often then it
>> should.  Specifically, this was in the case where compact_result was
>> COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
>> made.
> 
> Hmm, the point of MIN_COMPACT_COSTLY_PRIORITY was that costly
> allocations will not reach the priority where compaction becomes too
> expensive. With your patch, they still don't reach that priority value,
> but are allowed to be thorough anyway, even sooner. That just seems like
> a wrong way to fix the problem.

Thanks Vlastimil, here is why I took the approach I did.

I instrumented some of the long stalls.  Here is one common example:
should_compact_retry returned true 500 consecutive times.  However,
the variable compaction_retries is zero.  We never get to the code that
increments the compaction_retries count because compaction_made_progress
is false and compaction_withdrawn is true.  As suggested earlier, I noted
why compaction_withdrawn is true.  Of the 500 calls,
4921875 were COMPACT_DEFERRED
78125 were COMPACT_PARTIAL_SKIPPED
Note that 500/64(1 << COMPACT_MAX_DEFER_SHIFT) == 78125

I then started looking into why COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED
were being set/returned so often.
COMPACT_DEFERRED is set/returned in try_to_compact_pages.  Specifically,
if (prio > MIN_COMPACT_PRIORITY
&& compaction_deferred(zone, order)) {
rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
continue;
}
COMPACT_PARTIAL_SKIPPED is set/returned in __compact_finished. Specifically,
if (compact_scanners_met(cc)) {
/* Let the next compaction start anew. */
reset_cached_positions(cc->zone);

/* ... */

if (cc->direct_compaction)
cc->zone->compact_blockskip_flush = true;

if (cc->whole_zone)
return COMPACT_COMPLETE;
else
return COMPACT_PARTIAL_SKIPPED;
}

In both cases, compact_priority being MIN_COMPACT_COSTLY_PRIORITY and not
being able to go to MIN_COMPACT_PRIORITY caused the 'compaction_withdrawn'
result to be set/returned.

I do not know the subtleties of the compaction code, but it seems like
retrying in this manner does not make sense.

> If should_compact_retry() returns
> misleading results for costly allocations, then that should be fixed
> instead?
> 
> Alternatively, you might want to say that hugetlb allocations are not
> like other random costly allocations, because the admin setting
> nr_hugepages is prepared to take the cost (I thought that was indicated
> by the __GFP_RETRY_MAYFAIL flag, but seeing all the other users of it,
> I'm not sure anymore).

The example above, resulted in a stall of a little over 5 minutes.  However,
I have seen them last for hours.  Sure, the caller (admin for hugetlbfs)
knows there may be high costs.  But, I think minutes/hours to try and allocate
a single huge page is too much.  We should fail sooner that that.

>In that case should_compact_retry() could take
> __GFP_RETRY_MAYFAIL into account and allow MIN_COMPACT_PRIORITY even for
> costly allocations.

I'll put something like this together to test.
-- 
Mike Kravetz

Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection

2019-07-31 Thread Mike Kravetz

On 7/31/19 4:08 AM, Vlastimil Babka wrote:
> 
> I agree this is an improvement overall, but perhaps the patch does too
> many things at once. The reshuffle is one thing and makes sense. The
> change of the last return condition could perhaps be separate. Also
> AFAICS the ultimate result is that when nr_reclaimed == 0, the function
> will now always return false. Which makes the initial test for
> __GFP_RETRY_MAYFAIL and the comments there misleading. There will no
> longer be a full LRU scan guaranteed - as long as the scanned LRU chunk
> yields no reclaimed page, we abort.

Can someone help me understand why nr_scanned == 0 guarantees a full
LRU scan?  FWICS, nr_scanned used in this context is only incremented
in shrink_page_list and potentially shrink_zones.  In the stall case I
am looking at, there are MANY cases in which nr_scanned is only a few
pages and none of those are reclaimed.

Can we not get nr_scanned == 0 on an arbitrary chunk of the LRU?

I must be missing something, because I do not see how nr_scanned == 0
guarantees a full scan.
-- 
Mike Kravetz

Re: [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail

2019-07-31 Thread Mike Kravetz

On 7/31/19 6:23 AM, Vlastimil Babka wrote:
> On 7/25/19 7:15 PM, Mike Kravetz wrote:
>> On 7/25/19 1:13 AM, Mel Gorman wrote:
>>> On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote:
>>>
>>> set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle
>>> *but* in the event of an allocation failure this bug can silently recur.
>>> An informational message might be justified in that case in case the
>>> stall should recur with no hint as to why.
>>
>> Right.
>> Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error.
>> If we can't allocate a node mask, it is unlikely we will be able to allocate
>> a/any huge pages.  And, the system must be extremely low on memory and there
>> are likely other bigger issues.
> 
> Agreed. But I would perhaps drop __GFP_NORETRY from the mask allocation
> as that can fail for transient conditions.

Thanks, I was unsure if adding __GFP_NORETRY would be a good idea.

-- 
Mike Kravetz

Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders

2019-08-01 Thread Mike Kravetz

On 8/1/19 6:01 AM, Vlastimil Babka wrote:
> Could you try testing the patch below instead? It should hopefully
> eliminate the stalls. If it makes hugepage allocation give up too early,
> we'll know we have to involve __GFP_RETRY_MAYFAIL in allowing the
> MIN_COMPACT_PRIORITY priority. Thanks!

Thanks.  This patch does eliminate the stalls I was seeing.

In my testing, there is little difference in how many hugetlb pages are
allocated.  It does not appear to be giving up/failing too early.  But,
this is only with __GFP_RETRY_MAYFAIL.  The real concern would with THP
requests.  Any suggestions on how to test that?

-- 
Mike Kravetz

> 8<
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 9569e7c786d3..b8bfe8d5d2e9 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -129,11 +129,7 @@ static inline bool compaction_failed(enum compact_result 
> result)
>   return false;
>  }
>  
> -/*
> - * Compaction  has backed off for some reason. It might be throttling or
> - * lock contention. Retrying is still worthwhile.
> - */
> -static inline bool compaction_withdrawn(enum compact_result result)
> +static inline bool compaction_needs_reclaim(enum compact_result result)
>  {
>   /*
>* Compaction backed off due to watermark checks for order-0
> @@ -142,6 +138,15 @@ static inline bool compaction_withdrawn(enum 
> compact_result result)
>   if (result == COMPACT_SKIPPED)
>   return true;
>  
> + return false;
> +}
> +
> +/*
> + * Compaction  has backed off for some reason. It might be throttling or
> + * lock contention. Retrying is still worthwhile.
> + */
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
>   /*
>* If compaction is deferred for high-order allocations, it is
>* because sync compaction recently failed. If this is the case
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 272c6de1bf4e..3dfce1f79112 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3965,6 +3965,11 @@ should_compact_retry(struct alloc_context *ac, int 
> order, int alloc_flags,
>   if (compaction_failed(compact_result))
>   goto check_priority;
>  
> + if (compaction_needs_reclaim(compact_result)) {
> + ret = compaction_zonelist_suitable(ac, order, alloc_flags);
> + goto out;
> + }
> +
>   /*
>* make sure the compaction wasn't deferred or didn't bail out early
>* due to locks contention before we declare that we should give up.
> @@ -3972,8 +3977,7 @@ should_compact_retry(struct alloc_context *ac, int 
> order, int alloc_flags,
>* compaction.
>*/
>   if (compaction_withdrawn(compact_result)) {
> - ret = compaction_zonelist_suitable(ac, order, alloc_flags);
> - goto out;
> + goto check_priority;
>   }
>  
>   /*
>

Re: [PATCH] mm/hugetlb.c: check the failure case for find_vma

2019-07-25 Thread Mike Kravetz

On 7/24/19 6:39 PM, Navid Emamdoost wrote:
> find_vma may fail and return NULL. The null check is added.
> 
> Signed-off-by: Navid Emamdoost 
> ---
>  mm/hugetlb.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ede7e7f5d1ab..9c5e8b7a6476 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4743,6 +4743,9 @@ void adjust_range_if_pmd_sharing_possible(struct 
> vm_area_struct *vma,
>  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  {
>   struct vm_area_struct *vma = find_vma(mm, addr);
> + if (!vma)
> + return (pte_t *)pmd_alloc(mm, pud, addr);
> +

Hello Navid,

You should not mix declarations and code like this.  I am surprised that your
compiler did not issue a warning such as:

mm/hugetlb.c: In function ‘huge_pmd_share’:
mm/hugetlb.c:4815:2: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]
  struct address_space *mapping = vma->vm_file->f_mapping;
  ^~

While it is true that the routine find_vma can return NULL.  I do not
believe it is possible here within the context of huge_pmd_share.  Why?

huge_pmd_share is called from huge_pte_alloc to allocate a page table
entry for a huge page.  So, the calling code is attempting to populate
page tables.  There are three callers of huge_pte_alloc: hugetlb_fault,
copy_hugetlb_page_range and __mcopy_atomic_hugetlb.  In each of these
routines (or their callers) it has been verified that address is within
a vma.  In addition, mmap_sem is held so that vmas can not change.
Therefore, there should be no way for find_vma to return NULL here.

Please let me know if there is something I have overlooked.  Otherwise,
there is no need for such a modification.
-- 
Mike Kravetz

>   struct address_space *mapping = vma->vm_file->f_mapping;
>   pgoff_t idx = ((addr - vma->vm_start) >> PAGE_SHIFT) +
>   vma->vm_pgoff;
>

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 1393 matches

Mail list logo