On 24/03/15 17:39, Morten Rasmussen wrote: > On Tue, Mar 24, 2015 at 04:10:37PM +0000, Peter Zijlstra wrote: >> On Tue, Mar 24, 2015 at 10:44:24AM +0000, Morten Rasmussen wrote: >>>>> Maybe remind us why this needs to be tied to sched_groups ? Why can't we >>>>> attach the energy information to the domains? >> >>> In the current domain hierarchy you don't have domains with just one cpu >>> in them. If you attach the per-cpu energy data to the MC level domain >>> which spans the whole cluster, you break the current idea of attaching >>> information to the cpumask (currently sched_group, but could be >>> sched_domain as we discuss here) the information is associated with. You >>> would have to either introduce a level of single cpu domains at the >>> lowest level or move away from the idea of attaching data to the cpumask >>> that is associated with it. >>> >>> Using sched_groups we do already have single cpu groups that we can >>> attach per-cpu data to, but we are missing a top level group spanning >>> the entire system for system wide energy data. So from that point of >>> view groups and domains are equally bad. >> >> Oh urgh, good point that. Cursed if you do, cursed if you don't. Bugger. > > Yeah :( I don't really care which one we choose. Adding another top > level domain with one big group spanning all cpus, but with all SD flags > disabled seems less intrusive than adding a level at the bottom. > > Better ideas are very welcome. >
I had a stab at integrating such a top level (SYS) domain w/ all known SD flags disabled. This SYS sd exposes itself w/ all counters set to 0 in /proc/schedstat. There're still some kludges in the patch blow: - The need for a new topology SD flag to tell sd_init() that we want to reset the default sd configuration. - Don't break in build_sched_domains() at the first sd spanning cpu_map - Don't decay newidle max times in rebalance_domains() by bailing early on SYS sd. It survived booting on single (MC-SYS) and dual cluster ARM (MC-DIE-SYS) systems. Would something like this be acceptable? diff --git a/include/linux/sched.h b/include/linux/sched.h index f984b4e58865..8fbc9976f5d1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -904,6 +904,7 @@ enum cpu_idle_type { #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ +#define SD_SHARE_ENERGY 0x0040 /* System-wide energy data */ #define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu power */ #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4f52c2e7484e..d058dc1e639f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5529,7 +5529,7 @@ static int sd_degenerate(struct sched_domain *sd) } /* Following flags don't use groups */ - if (sd->flags & (SD_WAKE_AFFINE)) + if (sd->flags & (SD_WAKE_AFFINE | SD_SHARE_ENERGY)) return 0; return 1; @@ -6215,8 +6215,9 @@ static int sched_domains_curr_level; * SD_SHARE_POWERDOMAIN - describes shared power domain * SD_SHARE_CAP_STATES - describes shared capacity states * - * Odd one out: + * Odd two out: * SD_ASYM_PACKING - describes SMT quirks + * SD_SHARE_ENERGY - describes EAS quirks */ #define TOPOLOGY_SD_FLAGS \ (SD_SHARE_CPUCAPACITY | \ @@ -6224,7 +6225,8 @@ static int sched_domains_curr_level; SD_NUMA | \ SD_ASYM_PACKING | \ SD_SHARE_POWERDOMAIN | \ - SD_SHARE_CAP_STATES) + SD_SHARE_CAP_STATES | \ + SD_SHARE_ENERGY) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, int cpu) @@ -6298,6 +6300,14 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) sd->cache_nice_tries = 1; sd->busy_idx = 2; + } else if (sd->flags & SD_SHARE_ENERGY) { + /* Reset the default configuration completely */ + memset(sd, 0, sizeof(*sd)); + + sd->flags = 1*SD_SHARE_ENERGY; +#ifdef CONFIG_SCHED_DEBUG + sd->name = tl->name; +#endif #ifdef CONFIG_NUMA } else if (sd->flags & SD_NUMA) { sd->cache_nice_tries = 2; @@ -6826,8 +6836,6 @@ static int build_sched_domains(const struct cpumask *cpu_map, *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; - if (cpumask_equal(cpu_map, sched_domain_span(sd))) - break; } } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cfe65aec3237..8d4cc72f4778 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8073,6 +8073,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) { + + if (sd->flags & SD_SHARE_ENERGY) + continue; + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/