We have NUMA Balancing feature which always trying to move pages
of a task to the node it executed more, while still got issues:

* page cache can't be handled
* no cgroup level balancing

Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,
below scenery could be easily observed:

NODE0                   |       NODE1
                        |
CPU0            CPU1    |       CPU2            CPU3
task_A0         task_A1 |       task_A2         task_A3
task_B0         task_B1 |       task_B2         task_B3

and usually with the equal memory consumption on each node, when tasks have
similar behavior.

In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 
0,
pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located 
randomly,
depends on the first read/write CPU location.

Let's suppose another scenery:

NODE0                   |       NODE1
                        |
CPU0            CPU1    |       CPU2            CPU3
task_A0         task_A1 |       task_B0         task_B1
task_A2         task_A3 |       task_B2         task_B3

By switching the cpu & memory resources of task_A0,1 and task_B0,1, now 
workloads
of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are 
same
but related tasks could share a closer cpu cache, while cache still randomly 
located.

Now what if the workloads generate lot's of page cache, and most of the memory
accessing are page cache writing?

A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if 
task_A0
was already on NODE0 before it read/write files, caches will be there, so how to
make sure this happen?

Usually we could solve this problem by binding workloads on a single node, if 
the
cgroup A was binding to CPU0,1, then all the caches it generated will be on 
NODE0,
the numa bonus will be maximum.

However, this require a very well administration on specified workloads, 
suppose in our
cases if A & B are with a changing CPU requirement from 0% to 400%, then 
binding to a
single node would be a bad idea.

So what we need is a way to detect memory topology on cgroup level, and try to 
migrate
cpu/mem resources to the node with most of the caches there, as long as the 
resource
is plenty on that node.

This patch set introduced:
  * advanced per-cgroup numa statistic
  * numa preferred node feature
  * Numa Balancer module

Which helps to achieve an easy and flexible numa resource assignment, to gain 
numa bonus
as much as possible.

Michael Wang (5):
  numa: introduce per-cgroup numa balancing locality statistic
  numa: append per-node execution info in memory.numa_stat
  numa: introduce per-cgroup preferred numa node
  numa: introduce numa balancer infrastructure
  numa: numa balancer

 drivers/Makefile             |   1 +
 drivers/numa/Makefile        |   1 +
 drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/memcontrol.h   |  99 ++++++
 include/linux/sched.h        |   9 +-
 kernel/sched/debug.c         |   8 +
 kernel/sched/fair.c          |  41 +++
 mm/huge_memory.c             |   7 +-
 mm/memcontrol.c              | 246 +++++++++++++++
 mm/memory.c                  |   9 +-
 mm/mempolicy.c               |   4 +
 11 files changed, 1133 insertions(+), 7 deletions(-)
 create mode 100644 drivers/numa/Makefile
 create mode 100644 drivers/numa/numa_balancer.c

-- 
2.14.4.44.g2045bb6

Reply via email to