Re: [RFC PATCH 5/5] numa: numa balancer

2019-04-23 Thread 王贇



On 2019/4/23 下午5:05, Peter Zijlstra wrote:
[snip]
>>
>> TODO:
>>   * improve the logical to address the regression cases
>>   * Find a way, maybe, to handle the page cache left on remote
>>   * find more scenery which could gain benefit
>>
>> Signed-off-by: Michael Wang 
>> ---
>>  drivers/Makefile |   1 +
>>  drivers/numa/Makefile|   1 +
>>  drivers/numa/numa_balancer.c | 715 
>> +++
> 
> So I really think this is the wrong direction. Why introduce yet another
> balancer thingy and not extend the existing numa balancer with the
> additional information you got from the previous patches?
> 
> Also, this really should not be a module and not in drivers
The reason why we present the idea in the way of a module is that
it's not suitable for all the situations, a module could be clean
and easier for deploy on demands.

Besides, we assume someone may prefer to have their own logical
on how to do the numa balancer, thus the module give them the way
to DIY easily.

But there are no insist on the style, once the logical is mature
enough, we can merge the idea into CFS, per-cgroup switch could be
enough :-P

Regards,
Michael Wang

> 


Re: [RFC PATCH 5/5] numa: numa balancer

2019-04-23 Thread Peter Zijlstra
On Mon, Apr 22, 2019 at 10:21:17AM +0800, 王贇 wrote:
> numa balancer is a module which will try to automatically adjust numa
> balancing stuff to gain numa bonus as much as possible.
> 
> For each memory cgroup, we process the work in two steps:
> 
> On stage 1 we check cgroup's exectime and memory topology to see
> if there could be a candidate for settled down, if we got one then
> move onto stage 2.
> 
> On stage 2 we try to settle down as much as possible by prefer the
> candidate node, if the node no longer suitable or locality keep
> downturn, we reset things and new round begin.
> 
> Decision made with find_candidate_nid(), should_prefer() and keep_prefer(),
> which try to pick a candidate node, see if allowed to prefer it and if
> keep doing the prefer.
> 
> Tested on the box with 96 cpus with sysbench-mysql-oltp_read_write
> testing, 4 mysqld instances created and attached to 4 cgroups, 4
> sysbench instances then created and attached to corresponding cgroup
> to test the mysql with oltp_read_write script, average eps show:
> 
>   origin  balancer
> 4 instances each 12 threads   5241.08 5375.59 +2.50%
> 4 instances each 24 threads   7497.29 7820.73 +4.13%
> 4 instances each 36 threads   8985.44 9317.04 +3.55%
> 4 instances each 48 threads   9716.50 9982.60 +2.66%
> 
> Other benchmark liks dbench, pgbench, perf bench numa also tested, and
> with different parameters and number of instances/threads, most of
> the cases show bonus, some show acceptable regression, and some got no
> changes.
> 
> TODO:
>   * improve the logical to address the regression cases
>   * Find a way, maybe, to handle the page cache left on remote
>   * find more scenery which could gain benefit
> 
> Signed-off-by: Michael Wang 
> ---
>  drivers/Makefile |   1 +
>  drivers/numa/Makefile|   1 +
>  drivers/numa/numa_balancer.c | 715 
> +++

So I really think this is the wrong direction. Why introduce yet another
balancer thingy and not extend the existing numa balancer with the
additional information you got from the previous patches?

Also, this really should not be a module and not in drivers/


[RFC PATCH 5/5] numa: numa balancer

2019-04-21 Thread 王贇
numa balancer is a module which will try to automatically adjust numa
balancing stuff to gain numa bonus as much as possible.

For each memory cgroup, we process the work in two steps:

On stage 1 we check cgroup's exectime and memory topology to see
if there could be a candidate for settled down, if we got one then
move onto stage 2.

On stage 2 we try to settle down as much as possible by prefer the
candidate node, if the node no longer suitable or locality keep
downturn, we reset things and new round begin.

Decision made with find_candidate_nid(), should_prefer() and keep_prefer(),
which try to pick a candidate node, see if allowed to prefer it and if
keep doing the prefer.

Tested on the box with 96 cpus with sysbench-mysql-oltp_read_write
testing, 4 mysqld instances created and attached to 4 cgroups, 4
sysbench instances then created and attached to corresponding cgroup
to test the mysql with oltp_read_write script, average eps show:

origin  balancer
4 instances each 12 threads 5241.08 5375.59 +2.50%
4 instances each 24 threads 7497.29 7820.73 +4.13%
4 instances each 36 threads 8985.44 9317.04 +3.55%
4 instances each 48 threads 9716.50 9982.60 +2.66%

Other benchmark liks dbench, pgbench, perf bench numa also tested, and
with different parameters and number of instances/threads, most of
the cases show bonus, some show acceptable regression, and some got no
changes.

TODO:
  * improve the logical to address the regression cases
  * Find a way, maybe, to handle the page cache left on remote
  * find more scenery which could gain benefit

Signed-off-by: Michael Wang 
---
 drivers/Makefile |   1 +
 drivers/numa/Makefile|   1 +
 drivers/numa/numa_balancer.c | 715 +++
 3 files changed, 717 insertions(+)
 create mode 100644 drivers/numa/Makefile
 create mode 100644 drivers/numa/numa_balancer.c

diff --git a/drivers/Makefile b/drivers/Makefile
index c61cde554340..f07936b03870 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -187,3 +187,4 @@ obj-$(CONFIG_UNISYS_VISORBUS)   += visorbus/
 obj-$(CONFIG_SIOX) += siox/
 obj-$(CONFIG_GNSS) += gnss/
 obj-$(CONFIG_INTERCONNECT) += interconnect/
+obj-$(CONFIG_NUMA_BALANCING)   += numa/
diff --git a/drivers/numa/Makefile b/drivers/numa/Makefile
new file mode 100644
index ..acf8a408
--- /dev/null
+++ b/drivers/numa/Makefile
@@ -0,0 +1 @@
+obj-m  += numa_balancer.o
diff --git a/drivers/numa/numa_balancer.c b/drivers/numa/numa_balancer.c
new file mode 100644
index ..25bbe08c82a2
--- /dev/null
+++ b/drivers/numa/numa_balancer.c
@@ -0,0 +1,715 @@
+/*
+ * NUMA Balancer
+ *
+ *  Copyright (C) 2019 Alibaba Group Holding Limited.
+ *  Author: Michael Wang 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static unsigned int debug_level;
+module_param(debug_level, uint, 0644);
+MODULE_PARM_DESC(debug_level, "1 to print decisions, 2 to print both decisions 
and node info");
+
+static int prefer_level = 10;
+module_param(prefer_level, int, 0644);
+MODULE_PARM_DESC(prefer_level, "stop numa prefer when reach this much 
continuous downturn, 0 means no prefer");
+
+static unsigned int locality_level = PERCENT_70_79;
+module_param(locality_level, uint, 0644);
+MODULE_PARM_DESC(locality_level, "consider locality as good when above this 
sector");
+
+static unsigned long period_max = (600 * HZ);
+module_param(period_max, ulong, 0644);
+MODULE_PARM_DESC(period_max, "maximum period between each stage");
+
+static unsigned long period_min = (5 * HZ);
+module_param(period_min, ulong, 0644);
+MODULE_PARM_DESC(period_min, "minimum period between each stage");
+
+static unsigned int cpu_high_wmark = 100;
+module_param(cpu_high_wmark, uint, 0644);
+MODULE_PARM_DESC(cpu_high_wmark, "respect the execution percent rather than 
memory percent when above this cpu usage");
+
+static unsigned int cpu_low_wmark = 10;
+module_param(cpu_low_wmark, uint, 0644);
+MODULE_PARM_DESC(cpu_low_wmark, "consider cgroup as active when above this cpu 
usage");
+
+static unsigned int free_low_wmark = 10;
+module_param(free_low_wmark, uint, 0644);
+MODULE_PARM_DESC(free_low_wmark, "consider node as consumed out when below 
this free percent");
+
+static unsigned int candidate_wmark = 60;
+module_param(candidate_wmark, uint, 0644);
+MODULE_PARM_DESC(candidate_wmark, "consider node as candidate when above this 
execution time or memory percent");
+
+static unsigned int settled_wmark = 90;
+module_param(settled_wmark, uint, 0644);
+MODULE_PARM_DESC(settled_wmark, "consider cgroup settle down on node when 
above this execution time and memory percent, or