On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

             ----------------------------
         C0  | T0 T1 |    ||    | T0 T1 | C4
             --------|    ||    |--------
         C1  | T0 T1 | L3 || L3 | T0 T1 | C5
             --------|    ||    |--------
         C2  | T0 T1 | #0 || #1 | T0 T1 | C6
             --------|    ||    |--------
         C3  | T0 T1 |    ||    | T0 T1 | C7
             ----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA      (span a socket: 4 nodes)
  domain3 NUMA      (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierarchy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
Sometimes, threads are scheduled on the same cpu, while other
cpus are idle. This results in run-to-run inconsistency. For example:

  taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
                          --cpu-max-prime=100000 run

Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
across 8 cpus.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a NUMA node. This results in the following
hierarchy of sched domains on the same system described above.

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NODE      (span a NUMA node)
  domain3 NUMA      (span a socket: 4 nodes)
  domain4 NUMA      (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.

Cc: sta...@vger.kernel.org
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpa...@amd.com>
---
Changes from V1 (https://lkml.org/lkml/2017/8/10/540)
  * Update commit message to include performance number.
  * Change from NUMA_IDEN to NODE.
  * Fix code styling and update comments.

 kernel/sched/topology.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 79895ae..2dd5b11 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1335,6 +1335,10 @@ void sched_init_numa(void)
        if (!sched_domains_numa_distance)
                return;
 
+       /* Includes NUMA identity node at level 0. */
+       sched_domains_numa_distance[level++] = curr_distance;
+       sched_domains_numa_levels = level;
+
        /*
         * O(nr_nodes^2) deduplicating selection sort -- in order to find the
         * unique distances in the node_distance() table.
@@ -1382,8 +1386,7 @@ void sched_init_numa(void)
                return;
 
        /*
-        * 'level' contains the number of unique distances, excluding the
-        * identity distance node_distance(i,i).
+        * 'level' contains the number of unique distances
         *
         * The sched_domains_numa_distance[] array includes the actual distance
         * numbers.
@@ -1445,9 +1448,26 @@ void sched_init_numa(void)
                tl[i] = sched_domain_topology[i];
 
        /*
+        * Do not setup NUMA node level if it has the same cpumask
+        * as sched domain at previous level:
+        * This is the case for system with:
+        *  - LLC == NODE : LLC (MC) sched domain span a NUMA node.
+        *  - DIE == NODE : DIE sched domain span a NUMA node.
+        *
+        * Assume all NUMA nodes are identical, so only check node 0.
+        */
+       if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0))) {
+               tl[i++] = (struct sched_domain_topology_level){
+                       .mask = sd_numa_mask,
+                       .numa_level = 0,
+                       SD_INIT_NAME(NODE)
+               };
+       }
+
+       /*
         * .. and append 'j' levels of NUMA goodness.
         */
-       for (j = 0; j < level; i++, j++) {
+       for (j = 1; j < level; i++, j++) {
                tl[i] = (struct sched_domain_topology_level){
                        .mask = sd_numa_mask,
                        .sd_flags = cpu_numa_flags,
-- 
2.7.4

Reply via email to