Re: [PATCH] mm, hugetlb: use memory policy when available

2015-11-05 Thread Vlastimil Babka
On 10/20/2015 09:53 PM, Dave Hansen wrote:
> From: Dave Hansen 
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
>> struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>> page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>> if (!page) {
>> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> 
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.
> 
> Cc: Andrew Morton 
> Cc: Naoya Horiguchi 
> Cc: Mike Kravetz 
> Cc: Hillf Danton 
> Cc: David Rientjes 
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen 

Together with the fix and NUMA=n cleanup

Acked=by: Vlastimil Babka 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, hugetlb: use memory policy when available

2015-11-03 Thread Sasha Levin
On 10/22/2015 05:42 PM, Dave Hansen wrote:
> On 10/22/2015 02:39 PM, Sasha Levin wrote:
>> > Trinity seems to be able to hit the newly added warnings pretty easily:
> Kirill reported the same thing.  Is it fixed with this applied?
> 
>> > http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-use-memory-policy-when-available-fix.patch

Yup, that works for me.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, hugetlb: use memory policy when available

2015-10-22 Thread Dave Hansen
On 10/22/2015 02:39 PM, Sasha Levin wrote:
> Trinity seems to be able to hit the newly added warnings pretty easily:

Kirill reported the same thing.  Is it fixed with this applied?

> http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-use-memory-policy-when-available-fix.patch


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, hugetlb: use memory policy when available

2015-10-22 Thread Sasha Levin
On 10/20/2015 03:53 PM, Dave Hansen wrote:
> From: Dave Hansen 
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
>> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
>> > page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
>> > if (!page) {
>> > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.
> 
> Cc: Andrew Morton 
> Cc: Naoya Horiguchi 
> Cc: Mike Kravetz 
> Cc: Hillf Danton 
> Cc: David Rientjes 
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Dave Hansen 

Hey Dave,

Trinity seems to be able to hit the newly added warnings pretty easily:

[  339.282065] WARNING: CPU: 4 PID: 10181 at mm/hugetlb.c:1520 
__alloc_buddy_huge_page+0xff/0xa80()
[  339.360228] Modules linked in:
[  339.360838] CPU: 4 PID: 10181 Comm: trinity-c291 Not tainted 
4.3.0-rc6-next-20151022-sasha-00040-g5ecc711-dirty #2608
[  339.362629]  88015e59c000 e6475701 88015e61f9a0 
9dd3ef48
[  339.363896]   88015e61f9e0 9c32d1ca 
9c7175bf
[  339.365167]  abddc0c8 88015e61faf0  

[  339.366387] Call Trace:
[  339.366831]  [] dump_stack+0x4e/0x86
[  339.367648]  [] warn_slowpath_common+0xfa/0x120
[  339.368635]  [] ? __alloc_buddy_huge_page+0xff/0xa80
[  339.369631]  [] warn_slowpath_null+0x1a/0x20
[  339.370574]  [] __alloc_buddy_huge_page+0xff/0xa80
[  339.371551]  [] ? return_unused_surplus_pages+0x120/0x120
[  339.372698]  [] ? debug_smp_processor_id+0x17/0x20
[  339.373683]  [] ? get_lock_stats+0x1b/0x80
[  339.374551]  [] ? 
__raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  339.375744]  [] ? do_raw_spin_unlock+0x1d0/0x1e0
[  339.376728]  [] hugetlb_acct_memory+0x193/0x990
[  339.377663]  [] ? dequeue_huge_page_node+0x260/0x260
[  339.378658]  [] ? trace_hardirqs_on_caller+0x540/0x5e0
[  339.379671]  [] hugetlb_reserve_pages+0x229/0x330
[  339.380738]  [] hugetlb_file_setup+0x54b/0x810
[  339.381689]  [] ? hugetlbfs_fallocate+0x9e0/0x9e0
[  339.382653]  [] ? scnprintf+0x100/0x100
[  339.383526]  [] newseg+0x49f/0xa70
[  339.384371]  [] ? debug_smp_processor_id+0x17/0x20
[  339.385345]  [] ? shm_try_destroy_orphaned+0x190/0x190
[  339.386365]  [] ? ipcget+0x60/0x510
[  339.387139]  [] ipcget+0x8f/0x510
[  339.387902]  [] ? do_audit_syscall_entry+0x2b0/0x2b0
[  339.388931]  [] SyS_shmget+0x11a/0x160
[  339.389737]  [] ? is_file_shm_hugepages+0x40/0x40
[  339.393268]  [] ? syscall_trace_enter_phase2+0x462/0x5f0
[  339.395643]  [] tracesys_phase2+0x88/0x8d


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, hugetlb: use memory policy when available

2015-10-21 Thread Kirill A. Shutemov
On Tue, Oct 20, 2015 at 12:53:17PM -0700, Dave Hansen wrote:
> @@ -1445,6 +1514,10 @@ static struct page *alloc_buddy_huge_pag
>   if (hstate_is_gigantic(h))
>   return NULL;
>  
> + if (vma || addr) {
> + WARN_ON_ONCE(!addr || addr == -1);

Trinity triggered the WARN for me:

[  118.647212] WARNING: CPU: 10 PID: 9621 at 
/home/kas/linux/mm/mm/hugetlb.c:1514 __alloc_buddy_huge_page+0x2c8/0x300()
[  118.648698] Modules linked in:
[  118.649105] CPU: 10 PID: 9621 Comm: trinity-c147 Not tainted 4.2.0-dirty #651
[  118.649909] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
Debian-1.8.2-1 04/01/2014
[  118.650929]  81ca6ad8 88081f7f3c68 818a9977 
8001
[  118.651889]   88081f7f3ca8 810574d6 
88081f7f3c98
[  118.652965]   830a87e0  

[  118.653988] Call Trace:
[  118.654315]  [] dump_stack+0x4f/0x7b
[  118.654936]  [] warn_slowpath_common+0x86/0xc0
[  118.655630]  [] warn_slowpath_null+0x1a/0x20
[  118.656427]  [] __alloc_buddy_huge_page+0x2c8/0x300
[  118.657185]  [] hugetlb_acct_memory+0xa1/0x3d0
[  118.657897]  [] ? region_chg+0x1f1/0x200
[  118.658559]  [] hugetlb_reserve_pages+0x92/0x250
[  118.659289]  [] hugetlb_file_setup+0x14c/0x320
[  118.659994]  [] newseg+0x135/0x370
[  118.660713]  [] ? ipcget+0x44/0x2d0
[  118.661306]  [] ipcget+0x270/0x2d0
[  118.661911]  [] SyS_shmget+0x45/0x50
[  118.663409]  [] tracesys_phase2+0x84/0x89
[  118.664199] ---[ end trace d2829191292b44ef ]---


> + WARN_ON_ONCE(nid != NUMA_NO_NODE);
> + }
>   /*
>* Assume we will successfully allocate the surplus page to
>* prevent racing processes from causing the surplus to exceed
-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, hugetlb: use memory policy when available

2015-10-20 Thread Andrew Morton
On Tue, 20 Oct 2015 12:53:17 -0700 Dave Hansen  wrote:

> 
> From: Dave Hansen 
> 
> I have a hugetlbfs user which is never explicitly allocating huge pages
> with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
> the pages be allocated from the buddy allocator at fault time.
> 
> This works, but they noticed that mbind() was not doing them any good and
> the pages were being allocated without respect for the policy they
> specified.
> 
> The code in question is this:
> 
> > struct page *alloc_huge_page(struct vm_area_struct *vma,
> ...
> > page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
> > if (!page) {
> > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);
> 
> dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
> But, it only grabs _existing_ huge pages from the huge page pool.  If the
> pool is empty, we fall back to alloc_buddy_huge_page() which obviously
> can't do anything with the VMA's policy because it isn't even passed the
> VMA.
> 
> Almost everybody preallocates huge pages.  That's probably why nobody has
> ever noticed this.  Looking back at the git history, I don't think this
> _ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
> 8 years ago.
> 
> The fix is to pass vma/addr down in to the places where we actually call in
> to the buddy allocator.  It's fairly straightforward plumbing.  This has
> been lightly tested.

huh.  Fair enough.

>  b/mm/hugetlb.c |  111 
> ++---

Is it worth deporking this for the CONFIG_NUMA=n case?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm, hugetlb: use memory policy when available

2015-10-20 Thread Dave Hansen

From: Dave Hansen 

I have a hugetlbfs user which is never explicitly allocating huge pages
with 'nr_hugepages'.  They only set 'nr_overcommit_hugepages' and then let
the pages be allocated from the buddy allocator at fault time.

This works, but they noticed that mbind() was not doing them any good and
the pages were being allocated without respect for the policy they
specified.

The code in question is this:

> struct page *alloc_huge_page(struct vm_area_struct *vma,
...
> page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
> if (!page) {
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);

dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
But, it only grabs _existing_ huge pages from the huge page pool.  If the
pool is empty, we fall back to alloc_buddy_huge_page() which obviously
can't do anything with the VMA's policy because it isn't even passed the
VMA.

Almost everybody preallocates huge pages.  That's probably why nobody has
ever noticed this.  Looking back at the git history, I don't think this
_ever_ worked from when alloc_buddy_huge_page() was introduced in 7893d1d5,
8 years ago.

The fix is to pass vma/addr down in to the places where we actually call in
to the buddy allocator.  It's fairly straightforward plumbing.  This has
been lightly tested.

Cc: Andrew Morton 
Cc: Naoya Horiguchi 
Cc: Mike Kravetz 
Cc: Hillf Danton 
Cc: David Rientjes 
Cc: linux...@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Dave Hansen 
---

 b/mm/hugetlb.c |  111 ++---
 1 file changed, 99 insertions(+), 12 deletions(-)

diff -puN mm/hugetlb.c~hugetlbfs-fix-mbind-when-demand-allocating mm/hugetlb.c
--- a/mm/hugetlb.c~hugetlbfs-fix-mbind-when-demand-allocating   2015-10-20 
12:50:55.598628613 -0700
+++ b/mm/hugetlb.c  2015-10-20 12:50:55.605628929 -0700
@@ -1437,7 +1437,76 @@ void dissolve_free_huge_pages(unsigned l
dissolve_free_huge_page(pfn_to_page(pfn));
 }
 
-static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
+/*
+ * There are 3 ways this can get called:
+ * 1. With vma+addr: we use the VMA's memory policy
+ * 2. With !vma, but nid=NUMA_NO_NODE:  We try to allocate a huge
+ *page from any node, and let the buddy allocator itself figure
+ *it out.
+ * 3. With !vma, but nid!=NUMA_NO_NODE.  We allocate a huge page
+ *strictly from 'nid'
+ */
+static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
+   struct vm_area_struct *vma, unsigned long addr, int nid)
+{
+   int order = huge_page_order(h);
+   gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
+   unsigned int cpuset_mems_cookie;
+
+   /*
+* We need a VMA to get a memory policy.  If we do not
+* have one, we use the 'nid' argument
+*/
+   if (!vma) {
+   /*
+* If a specific node is requested, make sure to
+* get memory from there, but only when a node
+* is explicitly specified.
+*/
+   if (nid != NUMA_NO_NODE)
+   gfp |= __GFP_THISNODE;
+   /*
+* Make sure to call something that can handle
+* nid=NUMA_NO_NODE
+*/
+   return alloc_pages_node(nid, gfp, order);
+   }
+
+   /*
+* OK, so we have a VMA.  Fetch the mempolicy and try to
+* allocate a huge page with it.
+*/
+   do {
+   struct page *page;
+   struct mempolicy *mpol;
+   struct zonelist *zl;
+   nodemask_t *nodemask;
+
+   cpuset_mems_cookie = read_mems_allowed_begin();
+   zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
+   mpol_cond_put(mpol);
+   page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
+   if (page)
+   return page;
+   } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+   return NULL;
+}
+
+/*
+ * There are two ways to allocate a huge page:
+ * 1. When you have a VMA and an address (like a fault)
+ * 2. When you have no VMA (like when setting /proc/.../nr_hugepages)
+ *
+ * 'vma' and 'addr' are only for (1).  'nid' is always NUMA_NO_NODE in
+ * this case which signifies that the allocation should be done with
+ * respect for the VMA's memory policy.
+ *
+ * For (2), we ignore 'vma' and 'addr' and use 'nid' exclusively. This
+ * implies that memory policies will not be taken in to account.
+ */
+static struct page *__alloc_buddy_huge_page(struct hstate *h,
+   struct vm_area_struct *vma, unsigned long addr, int nid)
 {
struct page *page;
unsigned int r_nid;
@@ -1445,6 +1514,10 @@ static struct page *alloc_buddy_huge_pag
if (hstate_is_gigantic(h))
return NULL;
 
+   if (vma || addr) {
+   WAR