Re: [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.

2018-06-06 Thread Srikar Dronamraju
> > The commit does cause some performance regression but is needed from
> > a fairness/correctness perspective.
> > 
> 
> While it may cause some performance regressions, it may be due to either
> a) some workloads benefit from overloading a node if the tasks idle
> frequently or b) the regression may be due to delayed convergence. I'm
> not 100% convinced this needs to be done from a correctness point of
> view based on just this microbenchmark
> 

I will get back with Specjbb2005 numbers as suggested by Rik.



Re: [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.

2018-06-05 Thread Mel Gorman
On Mon, Jun 04, 2018 at 03:30:20PM +0530, Srikar Dronamraju wrote:
> Since task migration under numa balancing can happen in parallel, more
> than one task might choose to move to the same node at the same time.
> This can cause load imbalances at the node level.
> 
> The problem is more likely if there are more cores per node or more
> nodes in system.
> 
> Use a per-node variable to indicate if task migration
> to the node under numa balance is currently active.
> This per-node variable will not track swapping of tasks.
> 
> Testcase   Time: Min Max Avg  StdDev
> numa01.sh  Real:  434.84  676.90  550.53  106.24
> numa01.sh   Sys:  125.98  217.34  179.41   30.35
> numa01.sh  User:38318.4853789.5645864.17 6620.80
> numa02.sh  Real:   60.06   61.27   60.590.45
> numa02.sh   Sys:   14.25   17.86   16.091.28
> numa02.sh  User: 5190.13 5225.67 5209.24   13.19
> numa03.sh  Real:  748.21  960.25  823.15   73.51
> numa03.sh   Sys:   96.68  122.10  110.42   11.29
> numa03.sh  User:58222.1672595.2763552.22 5048.87
> numa04.sh  Real:  433.08  630.55  499.30   68.15
> numa04.sh   Sys:  245.22  386.75  306.09   63.32
> numa04.sh  User:35014.6846151.7238530.26 3924.65
> numa05.sh  Real:  394.77  410.07  401.415.99
> numa05.sh   Sys:  212.40  301.82  256.23   35.41
> numa05.sh  User:33224.8634201.4033665.61  313.40
> 
> Testcase   Time: Min Max Avg  StdDev   %Change
> numa01.sh  Real:  674.61  997.71  785.01  115.95   -29.86%
> numa01.sh   Sys:  180.87  318.88  270.13   51.32   -33.58%
> numa01.sh  User:54001.3071936.5060495.48 6237.55   -24.18%
> numa02.sh  Real:   60.62   62.30   61.460.62   -1.415%
> numa02.sh   Sys:   15.01   33.63   24.386.81   -34.00%
> numa02.sh  User: 5234.20 5325.60 5276.23   38.85   -1.269%
> numa03.sh  Real:  827.62  946.85  914.48   44.58   -9.987%
> numa03.sh   Sys:  135.55  172.40  158.46   12.75   -30.31%
> numa03.sh  User:64839.4273195.4470805.96 3061.20   -10.24%
> numa04.sh  Real:  481.01  608.76  521.14   47.28   -4.190%
> numa04.sh   Sys:  329.59  373.15  353.20   14.20   -13.33%
> numa04.sh  User:37649.0940722.9438806.32 1072.32   -0.711%
> numa05.sh  Real:  399.21  415.38  409.885.54   -2.066%
> numa05.sh   Sys:  319.46  418.57  363.31   37.62   -29.47%
> numa05.sh  User:33727.7734732.6834127.41  447.11   -1.353%
> 
> The commit does cause some performance regression but is needed from
> a fairness/correctness perspective.
> 

While it may cause some performance regressions, it may be due to either
a) some workloads benefit from overloading a node if the tasks idle
frequently or b) the regression may be due to delayed convergence. I'm
not 100% convinced this needs to be done from a correctness point of
view based on just this microbenchmark

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.

2018-06-04 Thread Rik van Riel
On Mon, 2018-06-04 at 15:30 +0530, Srikar Dronamraju wrote:
> Since task migration under numa balancing can happen in parallel,
> more
> than one task might choose to move to the same node at the same time.
> This can cause load imbalances at the node level.
> 
> The problem is more likely if there are more cores per node or more
> nodes in system.
> 
> Use a per-node variable to indicate if task migration
> to the node under numa balance is currently active.
> This per-node variable will not track swapping of tasks.

> The commit does cause some performance regression but is needed from
> a fairness/correctness perspective.

Does it help any "real workloads", even simple things
like SpecJBB2005?

If this patch only causes regressions, and does not help
any workloads, I would argue that it is not in fact needed.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


[PATCH 11/19] sched/numa: Restrict migrating in parallel to the same node.

2018-06-04 Thread Srikar Dronamraju
Since task migration under numa balancing can happen in parallel, more
than one task might choose to move to the same node at the same time.
This can cause load imbalances at the node level.

The problem is more likely if there are more cores per node or more
nodes in system.

Use a per-node variable to indicate if task migration
to the node under numa balance is currently active.
This per-node variable will not track swapping of tasks.

Testcase   Time: Min Max Avg  StdDev
numa01.sh  Real:  434.84  676.90  550.53  106.24
numa01.sh   Sys:  125.98  217.34  179.41   30.35
numa01.sh  User:38318.4853789.5645864.17 6620.80
numa02.sh  Real:   60.06   61.27   60.590.45
numa02.sh   Sys:   14.25   17.86   16.091.28
numa02.sh  User: 5190.13 5225.67 5209.24   13.19
numa03.sh  Real:  748.21  960.25  823.15   73.51
numa03.sh   Sys:   96.68  122.10  110.42   11.29
numa03.sh  User:58222.1672595.2763552.22 5048.87
numa04.sh  Real:  433.08  630.55  499.30   68.15
numa04.sh   Sys:  245.22  386.75  306.09   63.32
numa04.sh  User:35014.6846151.7238530.26 3924.65
numa05.sh  Real:  394.77  410.07  401.415.99
numa05.sh   Sys:  212.40  301.82  256.23   35.41
numa05.sh  User:33224.8634201.4033665.61  313.40

Testcase   Time: Min Max Avg  StdDev %Change
numa01.sh  Real:  674.61  997.71  785.01  115.95 -29.86%
numa01.sh   Sys:  180.87  318.88  270.13   51.32 -33.58%
numa01.sh  User:54001.3071936.5060495.48 6237.55 -24.18%
numa02.sh  Real:   60.62   62.30   61.460.62 -1.415%
numa02.sh   Sys:   15.01   33.63   24.386.81 -34.00%
numa02.sh  User: 5234.20 5325.60 5276.23   38.85 -1.269%
numa03.sh  Real:  827.62  946.85  914.48   44.58 -9.987%
numa03.sh   Sys:  135.55  172.40  158.46   12.75 -30.31%
numa03.sh  User:64839.4273195.4470805.96 3061.20 -10.24%
numa04.sh  Real:  481.01  608.76  521.14   47.28 -4.190%
numa04.sh   Sys:  329.59  373.15  353.20   14.20 -13.33%
numa04.sh  User:37649.0940722.9438806.32 1072.32 -0.711%
numa05.sh  Real:  399.21  415.38  409.885.54 -2.066%
numa05.sh   Sys:  319.46  418.57  363.31   37.62 -29.47%
numa05.sh  User:33727.7734732.6834127.41  447.11 -1.353%


The commit does cause some performance regression but is needed from
a fairness/correctness perspective.

Signed-off-by: Srikar Dronamraju 
---
 include/linux/mmzone.h |  1 +
 kernel/sched/fair.c| 14 ++
 mm/page_alloc.c|  1 +
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..b0767703 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -677,6 +677,7 @@ struct zonelist {
 
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
+   int active_node_migrate;
 #endif
/*
 * This is a per-node reserve of pages that are not available
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e19e32..259c343 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1478,11 +1478,22 @@ struct task_numa_env {
 static void task_numa_assign(struct task_numa_env *env,
 struct task_struct *p, long imp)
 {
+   pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
struct rq *rq = cpu_rq(env->dst_cpu);
 
if (xchg(&rq->numa_migrate_on, 1))
return;
 
+   if (!env->best_task && env->best_cpu != -1)
+   WRITE_ONCE(pgdat->active_node_migrate, 0);
+
+   if (!p) {
+   if (xchg(&pgdat->active_node_migrate, 1)) {
+   WRITE_ONCE(rq->numa_migrate_on, 0);
+   return;
+   }
+   }
+
if (env->best_cpu != -1) {
rq = cpu_rq(env->best_cpu);
WRITE_ONCE(rq->numa_migrate_on, 0);
@@ -1819,8 +1830,11 @@ static int task_numa_migrate(struct task_struct *p)
 
best_rq = cpu_rq(env.best_cpu);
if (env.best_task == NULL) {
+   pg_data_t *pgdat = NODE_DATA(cpu_to_node(env.dst_cpu));
+
ret = migrate_task_to(p, env.best_cpu);
WRITE_ONCE(best_rq->numa_migrate_on, 0);
+   WRITE_ONCE(pgdat->active_node_migrate, 0);
if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu,