Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

2019-04-15 Thread Yang Shi




On 4/15/19 3:14 PM, Dave Hansen wrote:

On 4/15/19 3:10 PM, Yang Shi wrote:

Also, I don't see anything in the code tying this to strictly demote
from DRAM to PMEM.  Is that the end effect, or is it really implemented
that way and I missed it?

No, not restrict to PMEM. It just tries to demote from "preferred node"
(or called compute node) to a memory-only node. In the hardware with
PMEM, PMEM would be the memory-only node.

If that's the case, your patch subject is pretty criminal. :)


Aha, s/PMEM/cpuless would sound guiltless.




Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

2019-04-15 Thread Dave Hansen
On 4/15/19 3:10 PM, Yang Shi wrote:
>> Also, I don't see anything in the code tying this to strictly demote
>> from DRAM to PMEM.  Is that the end effect, or is it really implemented
>> that way and I missed it?
> 
> No, not restrict to PMEM. It just tries to demote from "preferred node"
> (or called compute node) to a memory-only node. In the hardware with
> PMEM, PMEM would be the memory-only node.

If that's the case, your patch subject is pretty criminal. :)


Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

2019-04-15 Thread Yang Shi




On 4/11/19 7:31 AM, Dave Hansen wrote:

On 4/10/19 8:56 PM, Yang Shi wrote:

  include/linux/gfp.h|  12 
  include/linux/migrate.h|   1 +
  include/trace/events/migrate.h |   3 +-
  mm/debug.c |   1 +
  mm/internal.h  |  13 +
  mm/migrate.c   |  15 -
  mm/vmscan.c| 127 +++--
  7 files changed, 149 insertions(+), 23 deletions(-)

Yikes, that's a lot of code.

And it only handles anonymous pages?


Yes, for the time being. But, it is easy to extend to all kind of pages.



Also, I don't see anything in the code tying this to strictly demote
from DRAM to PMEM.  Is that the end effect, or is it really implemented
that way and I missed it?


No, not restrict to PMEM. It just tries to demote from "preferred node" 
(or called compute node) to a memory-only node. In the hardware with 
PMEM, PMEM would be the memory-only node.


Thanks,
Yang




Re: [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

2019-04-11 Thread Dave Hansen
On 4/10/19 8:56 PM, Yang Shi wrote:
>  include/linux/gfp.h|  12 
>  include/linux/migrate.h|   1 +
>  include/trace/events/migrate.h |   3 +-
>  mm/debug.c |   1 +
>  mm/internal.h  |  13 +
>  mm/migrate.c   |  15 -
>  mm/vmscan.c| 127 
> +++--
>  7 files changed, 149 insertions(+), 23 deletions(-)

Yikes, that's a lot of code.

And it only handles anonymous pages?

Also, I don't see anything in the code tying this to strictly demote
from DRAM to PMEM.  Is that the end effect, or is it really implemented
that way and I missed it?


[v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node

2019-04-10 Thread Yang Shi
Since PMEM provides larger capacity than DRAM and has much lower
access latency than disk, so it is a good choice to use as a middle
tier between DRAM and disk in page reclaim path.

With PMEM nodes, the demotion path of anonymous pages could be:

DRAM -> PMEM -> swap device

This patch demotes anonymous pages only for the time being and demote
THP to PMEM in a whole.  To avoid expensive page reclaim and/or
compaction on PMEM node if there is memory pressure on it, the most
conservative gfp flag is used, which would fail quickly if there is
memory pressure and just wakeup kswapd on failure.  The migrate_pages()
would split THP to migrate one by one as base page upon THP allocation
failure.

Demote pages to the cloest non-DRAM node even though the system is
swapless.  The current logic of page reclaim just scan anon LRU when
swap is on and swappiness is set properly.  Demoting to PMEM doesn't
need care whether swap is available or not.  But, reclaiming from PMEM
still skip anon LRU if swap is not available.

The demotion just happens from DRAM node to its cloest PMEM node.
Demoting to a remote PMEM node or migrating from PMEM to DRAM on reclaim
is not allowed for now.

And, define a new migration reason for demotion, called MR_DEMOTE.
Demote page via async migration to avoid blocking.

Signed-off-by: Yang Shi 
---
 include/linux/gfp.h|  12 
 include/linux/migrate.h|   1 +
 include/trace/events/migrate.h |   3 +-
 mm/debug.c |   1 +
 mm/internal.h  |  13 +
 mm/migrate.c   |  15 -
 mm/vmscan.c| 127 +++--
 7 files changed, 149 insertions(+), 23 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fdab7de..57ced51 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -285,6 +285,14 @@
  * available and will not wake kswapd/kcompactd on failure. The _LIGHT
  * version does not attempt reclaim/compaction at all and is by default used
  * in page fault path, while the non-light is used by khugepaged.
+ *
+ * %GFP_DEMOTE is for migration on memory reclaim (a.k.a demotion) allocations.
+ * The allocation might happen in kswapd or direct reclaim, so assuming
+ * __GFP_IO and __GFP_FS are not allowed looks safer.  Demotion happens for
+ * user pages (on LRU) only and on specific node.  Generally it will fail
+ * quickly if memory is not available, but may wake up kswapd on failure.
+ *
+ * %GFP_TRANSHUGE_DEMOTE is used for THP demotion allocation.
  */
 #define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -300,6 +308,10 @@
 #define GFP_TRANSHUGE_LIGHT((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE  (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#define GFP_DEMOTE (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY | \
+   __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE | \
+   GFP_NOWAIT)
+#define GFP_TRANSHUGE_DEMOTE   (GFP_DEMOTE | __GFP_COMP)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 837fdd1..cfb1f57 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -25,6 +25,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CONTIG_RANGE,
+   MR_DEMOTE,
MR_TYPES
 };
 
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 705b33d..c1d5b36 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset")\
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
-   EMe(MR_CONTIG_RANGE,"contig_range")
+   EM( MR_CONTIG_RANGE,"contig_range") \
+   EMe(MR_DEMOTE,  "demote")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff --git a/mm/debug.c b/mm/debug.c
index c0b31b6..cc0d7df 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -25,6 +25,7 @@
"mempolicy_mbind",
"numa_misplaced",
"cma",
+   "demote",
 };
 
 const struct trace_print_flags pageflag_names[] = {
diff --git a/mm/internal.h b/mm/internal.h
index bee4d6c..8c424b5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -383,6 +383,19 @@ static inline int find_next_best_node(int node, nodemask_t 
*used_node_mask,
 }
 #endif
 
+static inline bool has_cpuless_node_online(void)
+{
+   nodemask_t nmask;
+
+   nodes_andnot(nmask, node_states[N_MEMORY],
+node_states[N_CPU_MEM]);
+
+   if (nodes_empty(nmask))
+   return false;
+
+   return true;
+}