Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the cpu/node ideal for node migration.

Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.

Modify to set the preferred node based of best_cpu.

Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      435.78      653.81      534.58       83.20
numa01.sh       Sys:      121.93      187.18      145.90       23.47
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
numa02.sh      Real:       60.64       61.63       61.19        0.40
numa02.sh       Sys:       14.72       25.68       19.06        4.03
numa02.sh      User:     5210.95     5266.69     5233.30       20.82
numa03.sh      Real:      746.51      808.24      780.36       23.88
numa03.sh       Sys:       97.26      108.48      105.07        4.28
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
numa04.sh      Real:      465.97      519.27      484.81       19.62
numa04.sh       Sys:      304.43      359.08      334.68       20.64
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
numa05.sh      Real:      411.57      457.20      433.29       16.58
numa05.sh       Sys:      230.05      435.48      339.95       67.58
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64

Testcase       Time:         Min         Max         Avg      StdDev     %Change
numa01.sh      Real:      506.35      794.46      599.06      104.26     -10.76%
numa01.sh       Sys:      150.37      223.56      195.99       24.94     -25.55%
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33     -11.43%
numa02.sh      Real:       60.33       62.40       61.31        0.90     -0.195%
numa02.sh       Sys:       18.12       31.66       24.28        5.89     -21.49%
numa02.sh      User:     5203.91     5325.32     5260.29       49.98     -0.513%
numa03.sh      Real:      696.47      853.62      745.80       57.28     4.6339%
numa03.sh       Sys:       85.68      123.71       97.89       13.48     7.3347%
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97     1.5323%
numa04.sh      Real:      444.05      514.83      497.06       26.85     -2.464%
numa04.sh       Sys:      230.39      375.79      316.23       48.58     5.8343%
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08     -1.153%
numa05.sh      Real:      423.09      460.41      439.57       13.92     -1.428%
numa05.sh       Sys:      287.38      480.15      369.37       68.52     -7.964%
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51     -1.704%

While there is a performance hit, this is a correctness issue that is very
much needed in bigger systems.

Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea32a66..94091e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1725,8 +1725,9 @@ static int task_numa_migrate(struct task_struct *p)
         * Tasks that are "trapped" in such domains cannot be migrated
         * elsewhere, so there is no point in (re)trying.
         */
-       if (unlikely(!sd)) {
-               p->numa_preferred_nid = task_node(p);
+       if (unlikely(!sd) && p->numa_preferred_nid != task_node(p)) {
+               /* Set the new preferred node */
+               sched_setnuma(p, task_node(p));
                return -EINVAL;
        }
 
@@ -1785,15 +1786,13 @@ static int task_numa_migrate(struct task_struct *p)
         * trying for a better one later. Do not set the preferred node here.
         */
        if (p->numa_group) {
-               struct numa_group *ng = p->numa_group;
-
                if (env.best_cpu == -1)
                        nid = env.src_nid;
                else
-                       nid = env.dst_nid;
+                       nid = cpu_to_node(env.best_cpu);
 
-               if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, 
ng))
-                       sched_setnuma(p, env.dst_nid);
+               if (nid != p->numa_preferred_nid)
+                       sched_setnuma(p, nid);
        }
 
        /* No better CPU than the current one was found. */
-- 
1.8.3.1

Reply via email to