On 24/03/15 17:39, Morten Rasmussen wrote:
> On Tue, Mar 24, 2015 at 04:10:37PM +0000, Peter Zijlstra wrote:
>> On Tue, Mar 24, 2015 at 10:44:24AM +0000, Morten Rasmussen wrote:
>>>>> Maybe remind us why this needs to be tied to sched_groups ? Why can't we
>>>>> attach the energy information to the domains?
>>
>>> In the current domain hierarchy you don't have domains with just one cpu
>>> in them. If you attach the per-cpu energy data to the MC level domain
>>> which spans the whole cluster, you break the current idea of attaching
>>> information to the cpumask (currently sched_group, but could be
>>> sched_domain as we discuss here) the information is associated with. You
>>> would have to either introduce a level of single cpu domains at the
>>> lowest level or move away from the idea of attaching data to the cpumask
>>> that is associated with it.
>>>
>>> Using sched_groups we do already have single cpu groups that we can
>>> attach per-cpu data to, but we are missing a top level group spanning
>>> the entire system for system wide energy data. So from that point of
>>> view groups and domains are equally bad.
>>
>> Oh urgh, good point that. Cursed if you do, cursed if you don't. Bugger.
> 
> Yeah :( I don't really care which one we choose. Adding another top
> level domain with one big group spanning all cpus, but with all SD flags
> disabled seems less intrusive than adding a level at the bottom.
> 
> Better ideas are very welcome.
> 

I had a stab at integrating such a top level (SYS) domain w/ all known SD
flags disabled. This SYS sd exposes itself w/ all counters set to 0 in
/proc/schedstat.

There're still some kludges in the patch blow:

- The need for a new topology SD flag to tell sd_init() that we want to
  reset the default sd configuration. 
- Don't break in build_sched_domains() at the first sd spanning cpu_map
- Don't decay newidle max times in rebalance_domains() by bailing early 
  on SYS sd.

It survived booting on single (MC-SYS) and dual cluster ARM (MC-DIE-SYS)
systems.
Would something like this be acceptable?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f984b4e58865..8fbc9976f5d1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -904,6 +904,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_FORK                0x0008  /* Balance on fork, clone */
 #define SD_BALANCE_WAKE                0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE         0x0020  /* Wake task to waking CPU */
+#define SD_SHARE_ENERGY                0x0040  /* System-wide energy data */
 #define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share cpu power */
 #define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4f52c2e7484e..d058dc1e639f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5529,7 +5529,7 @@ static int sd_degenerate(struct sched_domain *sd)
        }
 
        /* Following flags don't use groups */
-       if (sd->flags & (SD_WAKE_AFFINE))
+       if (sd->flags & (SD_WAKE_AFFINE | SD_SHARE_ENERGY))
                return 0;
 
        return 1;
@@ -6215,8 +6215,9 @@ static int sched_domains_curr_level;
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
  * SD_SHARE_CAP_STATES    - describes shared capacity states
  *
- * Odd one out:
+ * Odd two out:
  * SD_ASYM_PACKING        - describes SMT quirks
+ * SD_SHARE_ENERGY        - describes EAS quirks
  */
 #define TOPOLOGY_SD_FLAGS              \
        (SD_SHARE_CPUCAPACITY |         \
@@ -6224,7 +6225,8 @@ static int sched_domains_curr_level;
         SD_NUMA |                      \
         SD_ASYM_PACKING |              \
         SD_SHARE_POWERDOMAIN |         \
-        SD_SHARE_CAP_STATES)
+        SD_SHARE_CAP_STATES |          \
+        SD_SHARE_ENERGY)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
@@ -6298,6 +6300,14 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
                sd->cache_nice_tries = 1;
                sd->busy_idx = 2;
 
+       } else if (sd->flags & SD_SHARE_ENERGY) {
+               /* Reset the default configuration completely */
+               memset(sd, 0, sizeof(*sd));
+
+               sd->flags = 1*SD_SHARE_ENERGY;
+#ifdef CONFIG_SCHED_DEBUG
+               sd->name = tl->name;
+#endif
 #ifdef CONFIG_NUMA
        } else if (sd->flags & SD_NUMA) {
                sd->cache_nice_tries = 2;
@@ -6826,8 +6836,6 @@ static int build_sched_domains(const struct cpumask 
*cpu_map,
                                *per_cpu_ptr(d.sd, i) = sd;
                        if (tl->flags & SDTL_OVERLAP || 
sched_feat(FORCE_SD_OVERLAP))
                                sd->flags |= SD_OVERLAP;
-                       if (cpumask_equal(cpu_map, sched_domain_span(sd)))
-                               break;
                }
        }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfe65aec3237..8d4cc72f4778 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8073,6 +8073,10 @@ static void rebalance_domains(struct rq *rq, enum 
cpu_idle_type idle)
 
        rcu_read_lock();
        for_each_domain(cpu, sd) {
+
+               if (sd->flags & SD_SHARE_ENERGY)
+                       continue;
+
                /*
                 * Decay the newidle max times here because this is a regular
                 * visit to all the domains. Decay ~1% per second.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to