On Fri, Apr 28, 2017 at 03:20:12PM +0200, Peter Zijlstra wrote:
> +/*
> + * NUMA topology (first read the regular topology blurb below)
> + *
> + * Given a node-distance table, for example:
> + *
> + *   node   0   1   2   3
> + *     0:  10  20  30  20
> + *     1:  20  10  20  30
> + *     2:  30  20  10  20
> + *     3:  20  30  20  10
> + *
> + * which represents a 4 node ring topology like:
> + *
> + *   0 ----- 1
> + *   |       |
> + *   |       |
> + *   |       |
> + *   3 ----- 2
> + *
> + * We want to construct domains and groups to represent this. The way we go
> + * about doing this is to build the domains on 'hops'. For each NUMA level we
> + * construct the mask of all nodes reachable in @level hops.
> + *
> + * For the above NUMA topology that gives 3 levels:
> + *
> + * NUMA-2    0-3             0-3             0-3             0-3
> + *  groups:  {0-1,3},{1-3}   {0-2},{0,2-3}   {1-3},{0-1,3}   {0,2-3},{0-2}
> + *
> + * NUMA-1    0-1,3           0-2             1-3             0,2-3
> + *  groups:  {0},{1},{3}     {0},{1},{2}     {1},{2},{3}     {0},{2},{3}
> + *
> + * NUMA-0    0               1               2               3
> + *
> + *
> + * As can be seen; things don't nicely line up as with the regular topology.
> + * When we iterate a domain in child domain chunks some nodes can be
> + * represented multiple times -- hence the "overlap" naming for this part of
> + * the topology.
> + *
> + * In order to minimize this overlap, we only build enough groups to cover 
> the
> + * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
> + *
> + * Because:
> + *
> + *  - the first group of each domain is its child domain; this
> + *    gets us the first 0-1,3
> + *  - the only uncovered node is 2, who's child domain is 1-3.
> + *
> + * However, because of the overlap, computing a unique CPU for each group is
> + * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
> + * groups include the CPUs of Node-0, while those CPUs would not in fact ever
> + * end up at those groups (they would end up in group: 0-1,3).
> + *
> + * To correct this we have to introduce the group iteration mask. This mask
> + * will contain those CPUs in the group that can reach this group given the
> + * (child) domain tree.
> + *
> + * With this we can once again compute balance_cpu and sched_group_capacity
> + * relations.
> + *
> + * XXX include words on how balance_cpu is unique and therefore can be
> + * used for sched_group_capacity links.
> + */
> +

I added the below to clarify some of the asymmetric comments we have.


--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -571,6 +571,37 @@ int group_balance_cpu(struct sched_group
  *
  * XXX include words on how balance_cpu is unique and therefore can be
  * used for sched_group_capacity links.
+ *
+ *
+ * Another 'interesting' topology is:
+ *
+ *   node   0   1   2   3
+ *     0:  10  20  20  30
+ *     1:  20  10  20  20
+ *     2:  20  20  10  20
+ *     3:  30  20  20  10
+ *
+ * Which looks a little like:
+ *
+ *   0 ----- 1
+ *   |     / |
+ *   |   /   |
+ *   | /     |
+ *   2 ----- 3
+ *
+ * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3
+ * are not.
+ *
+ * This leads to a few particularly weird cases where the sched_domain's are
+ * not of the same number for each cpu. Consider:
+ *
+ * NUMA-2      0-3                                             0-3
+ *  groups:    {0-2},{1-3}                                     {1-3},{0-2}
+ *
+ * NUMA-1      0-2             0-3             0-3             1-3
+ *
+ * NUMA-0      0               1               2               3
+ *
  */
 
 

Reply via email to