Re: [PATCH 2/2] sched: Document Energy Aware Scheduling

2019-01-18 Thread Juri Lelli
Hi,

On 18/01/19 10:34, Quentin Perret wrote:
> Hi Rafael,
> 
> On Friday 18 Jan 2019 at 10:57:08 (+0100), Rafael J. Wysocki wrote:
> > On Fri, Jan 18, 2019 at 10:16 AM Quentin Perret  
> > wrote:
> > >
> > > Hi Juri,
> > >
> > > On Thursday 17 Jan 2019 at 16:51:17 (+0100), Juri Lelli wrote:
> > > > On 10/01/19 11:05, Quentin Perret wrote:
> > > [...]
> > > > > +The idea behind introducing an EM is to allow the scheduler to 
> > > > > evaluate the
> > > > > +implications of its decisions rather than blindly applying 
> > > > > energy-saving
> > > > > +techniques that may have positive effects only on some platforms. At 
> > > > > the same
> > > > > +time, the EM must be as simple as possible to minimize the scheduler 
> > > > > latency
> > > > > +impact.
> > > > > +
> > > > > +In short, EAS changes the way CFS tasks are assigned to CPUs. When 
> > > > > it is time
> > > >
> > > > Not sure if we want to remark the fact that EAS is looking at CFS tasks
> > > > only ATM.
> > >
> > > Oh, what's wrong about mentioning it ? I mean, it is a fact ATM ...
> > 
> > But it won't hurt to mention that it may cover other scheduling
> > classes in the future.  IOW, the scope limit is not fundamental.
> 
> Agreed, I can do that.

Oh, sorry, bad phrasing from my side. I meant that we should probably
state clearly somewhere that EAS deals with CFS only ATM, but extending
it to other classes (DEADLINE in particular) makes certainly sense and
people are welcome to experiment with that.

So, yeah, I agree with both of you. :-)


Re: [PATCH 2/2] sched: Document Energy Aware Scheduling

2019-01-17 Thread Juri Lelli
Hi,

On 10/01/19 11:05, Quentin Perret wrote:
> Add some documentation detailing the main design points of EAS, as well
> as a list of its dependencies.
> 
> Parts of this documentation are taken from Morten Rasmussen's original
> EAS posting: https://lkml.org/lkml/2015/7/7/754
> 
> Reviewed-by: Qais Yousef 
> Co-authored-by: Morten Rasmussen 
> Signed-off-by: Quentin Perret 
> ---
>  Documentation/scheduler/sched-energy.txt | 425 +++
>  1 file changed, 425 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-energy.txt
> 
> diff --git a/Documentation/scheduler/sched-energy.txt 
> b/Documentation/scheduler/sched-energy.txt
> new file mode 100644
> index ..197d81f4b836
> --- /dev/null
> +++ b/Documentation/scheduler/sched-energy.txt
> @@ -0,0 +1,425 @@
> +===
> +Energy Aware Scheduling
> +===
> +
> +1. Introduction
> +---
> +
> +Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
> +the impact of its decisions on the energy consumed by CPUs. EAS relies on an
> +Energy Model (EM) of the CPUs to select an energy efficient CPU for each 
> task,
> +with a minimal impact on throughput. This document aims at providing an
> +introduction on how EAS works, what are the main design decisions behind it, 
> and
> +details what is needed to get it to run.
> +
> +Before going any further, please note that at the time of writing:
> +
> +   /!\ EAS does not support platforms with symmetric CPU topologies /!\
> +
> +EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
> +because this is where the potential for saving energy through scheduling is
> +the highest.
> +
> +The actual EM used by EAS is _not_ maintained by the scheduler, but by a
> +dedicated framework. For details about this framework and what it provides,
> +please refer to its documentation (see Documentation/power/energy-model.txt).
> +
> +
> +2. Background and Terminology
> +-
> +
> +To make it clear from the start:
> + - energy = [joule] (resource like a battery on powered devices)
> + - power = energy/time = [joule/second] = [watt]
> +
> +The goal of EAS is to minimize energy, while still getting the job done. That
> +is, we want to maximize:
> +
> + performance [inst/s]
> + 
> + power [W]
> +
> +which is equivalent to minimizing:
> +
> + energy [J]
> + ---
> + instruction
> +
> +while still getting 'good' performance. It is essentially an alternative
> +optimization objective to the current performance-only objective for the
> +scheduler. This alternative considers two objectives: energy-efficiency and
> +performance.
> +
> +The idea behind introducing an EM is to allow the scheduler to evaluate the
> +implications of its decisions rather than blindly applying energy-saving
> +techniques that may have positive effects only on some platforms. At the same
> +time, the EM must be as simple as possible to minimize the scheduler latency
> +impact.
> +
> +In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time

Not sure if we want to remark the fact that EAS is looking at CFS tasks
only ATM.

> +for the scheduler to decide where a task should run (during wake-up), the EM
> +is used to break the tie between several good CPU candidates and pick the one
> +that is predicted to yield the best energy consumption without harming the
> +system's throughput. The predictions made by EAS rely on specific elements of
> +knowledge about the platform's topology, which include the 'capacity' of 
> CPUs,

Add a reference to DT bindings docs defining 'capacity' (or define it
somewhere)?

> +and their respective energy costs.
> +
> +
> +3. Topology information
> +---
> +
> +EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
> +differentiate CPUs with different computing throughput. The 'capacity' of a 
> CPU
> +represents the amount of work it can absorb when running at its highest
> +frequency compared to the most capable CPU of the system. Capacity values are
> +normalized in a 1024 range, and are comparable with the utilization signals 
> of
> +tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. 
> Thanks
> +to capacity and utilization values, EAS is able to estimate how big/busy a
> +task/CPU is, and to take this into consideration when evaluating performance 
> vs
> +energy trade-offs. The capacity of CPUs is provided via arch-specific code
> +through the arch_scale_cpu_capacity() callback.

Ah, it's here, mmm, maybe still introduce it before (or point here from
above) and still point to dt bindings doc?

> +
> +The rest of platform knowledge used by EAS is directly read from the Energy
> +Model (EM) framework. The EM of a platform is composed of a power cost table
> +per 'performance domain'

Re: [PATCH 1/2] PM / EM: Document the Energy Model framework

2019-01-17 Thread Juri Lelli
Hi,

On 10/01/19 11:05, Quentin Perret wrote:
> Introduce a documentation file summarizing the key design points and
> APIs of the newly introduced Energy Model framework.
> 
> Signed-off-by: Quentin Perret 

Looks good to me.

Reviewed-by: Juri Lelli 

Best,

- Juri


Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-25 Thread Juri Lelli
On 25/05/18 11:31, Patrick Bellasi wrote:

[...]

> Right, so the problem seems to be that we "need" to call
> arch_update_cpu_topology() and we do that by calling
> partition_sched_domains() which was initially introduced by:
> 
>029190c515f1 ("cpuset sched_load_balance flag")
> 
> back in 2007, where it's also quite well explained the reasons behind
> the sched_load_balance flag and the idea to have "partitioned" SDs.
> 
> I also (hopefully) understood that there are at least two actors involved:
> 
>  - A) arch code
>which creates SDs and SGs, usually to group CPUs depending on the
>memory hierarchy, to support different time granularity of load
>balancing operations
> 
>Special case here are HP and hibernation which, by on-/off-lining
>CPUs they directly affect the SDs/SGs definitions.
> 
>  - B) cpusets
>which expose to userspace the possibility to define,
>_if possible_, a finer granularity set of SGs to further restrict the
>scope of load balancing operations
> 
> Since B is a "possible finer granularity" refinement of A, then we
> trigger A's reconfigurations based on B's constraints.
> 
> That's why, for example, in consequence of an HP online event,
> we have:
> 
>--- core.c ---
> HP[sched:active]
>  | sched_cpu_activate()
>| cpuset_cpu_active()
>--- cpuset.c -
>  | cpuset_update_active_cpus()
>| schedule_work(&cpuset_hotplug_work)
> \.. System Kworker \
> | cpuset_hotplug_workfn()
>   if (cpus_updated || force_rebuild)
> | rebuild_sched_domains()
>   | rebuild_sched_domains_locked()
> | generate_sched_domains()
>--- topology.c ---
> | partition_sched_domains()
>   | arch_update_cpu_topology()
> 
> 
> IOW, we need to pass via cpusets to rebuild the SDs whenever we
> there are HP events or we "need" to do an arch_update_cpu_topology()
> via the arch topology driver (drivers/base/arch_topology.c).

I don't think the arch topology driver is always involved in this (e.g.,
arch/x86/kernel/itmt::sched_itmt_update_handler()).

Still we need to check if topology changed, as you say.

> This last bit is also interesting, whenever we detect arch topology
> information that required an SD rebuild, we need to force a
> partition_sched_domains(). But, for that, in:
> 
>commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs")
> 
> we just introduced the support for the "force_rebuild" flag to be set.
> 
> Thus, potentially we can just extend the check I've proposed to consider the
> force rebuild flag, to be something like:
> 
> ---8<---
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 8f586e8bdc98..1f051fafaa3a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -874,11 +874,19 @@ static void rebuild_sched_domains_locked(void)
>!cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> goto out;
>  
> +   /* Special case for the 99% of systems with one, full, sched domain */
> +   if (!force_rebuild &&
> +   !top_cpuset.isolation_count &&
> +   is_sched_load_balance(&top_cpuset))
> +   goto out;
> +   force_rebuild = false;
> +
> /* Generate domain masks and attrs */
> ndoms = generate_sched_domains(&doms, &attr);
>  
> /* Have scheduler rebuild the domains */
> partition_sched_domains(ndoms, doms, attr);
>  out:
> put_online_cpus();
> ---8<---
> 
> 
> Which would still allow to use something like:
> 
>cpuset_force_rebuild()
>rebuild_sched_domains()
> 
> to actually rebuild SD in consequence of arch topology changes.

That might work.

> 
> > 
> > Maybe we could move the check you are proposing in update_cpumasks_
> > hier() ?
> 
> Yes, that's another option... although there we are outside of
> get_online_cpus(). Could be a problem?

Mmm, using force_rebuild flag seems safer indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Juri Lelli
On 24/05/18 11:09, Waiman Long wrote:
> On 05/24/2018 10:36 AM, Juri Lelli wrote:
> > On 17/05/18 16:55, Waiman Long wrote:
> >
> > [...]
> >
> >> +  A parent cgroup cannot distribute all its CPUs to child
> >> +  scheduling domain cgroups unless its load balancing flag is
> >> +  turned off.
> >> +
> >> +  cpuset.sched.load_balance
> >> +  A read-write single value file which exists on non-root
> >> +  cpuset-enabled cgroups.  It is a binary value flag that accepts
> >> +  either "0" (off) or a non-zero value (on).  This flag is set
> >> +  by the parent and is not delegatable.
> >> +
> >> +  When it is on, tasks within this cpuset will be load-balanced
> >> +  by the kernel scheduler.  Tasks will be moved from CPUs with
> >> +  high load to other CPUs within the same cpuset with less load
> >> +  periodically.
> >> +
> >> +  When it is off, there will be no load balancing among CPUs on
> >> +  this cgroup.  Tasks will stay in the CPUs they are running on
> >> +  and will not be moved to other CPUs.
> >> +
> >> +  The initial value of this flag is "1".  This flag is then
> >> +  inherited by child cgroups with cpuset enabled.  Its state
> >> +  can only be changed on a scheduling domain cgroup with no
> >> +  cpuset-enabled children.
> > [...]
> >
> >> +  /*
> >> +   * On default hierachy, a load balance flag change is only allowed
> >> +   * in a scheduling domain with no child cpuset.
> >> +   */
> >> +  if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> >> + (!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
> >> +  err = -EINVAL;
> >> +  goto out;
> >> +  }
> > The rule is actually
> >
> >  - no child cpuset
> >  - and it must be a scheduling domain
> >
> > Right?
> 
> Yes, because it doesn't make sense to have a cpu in one cpuset that has
> loading balance off while, at the same time, in another cpuset with load
> balancing turned on. This restriction is there to make sure that the
> above condition will not happen. I may be wrong if there is a realistic
> use case where the above condition is desired.

Yep, makes sense to me.

Maybe add the second condition to the comment and documentation.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Juri Lelli
On 17/05/18 16:55, Waiman Long wrote:

[...]

> + A parent cgroup cannot distribute all its CPUs to child
> + scheduling domain cgroups unless its load balancing flag is
> + turned off.
> +
> +  cpuset.sched.load_balance
> + A read-write single value file which exists on non-root
> + cpuset-enabled cgroups.  It is a binary value flag that accepts
> + either "0" (off) or a non-zero value (on).  This flag is set
> + by the parent and is not delegatable.
> +
> + When it is on, tasks within this cpuset will be load-balanced
> + by the kernel scheduler.  Tasks will be moved from CPUs with
> + high load to other CPUs within the same cpuset with less load
> + periodically.
> +
> + When it is off, there will be no load balancing among CPUs on
> + this cgroup.  Tasks will stay in the CPUs they are running on
> + and will not be moved to other CPUs.
> +
> + The initial value of this flag is "1".  This flag is then
> + inherited by child cgroups with cpuset enabled.  Its state
> + can only be changed on a scheduling domain cgroup with no
> + cpuset-enabled children.

[...]

> + /*
> +  * On default hierachy, a load balance flag change is only allowed
> +  * in a scheduling domain with no child cpuset.
> +  */
> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> +(!is_sched_domain(cs) || css_has_online_children(&cs->css))) {
> + err = -EINVAL;
> + goto out;
> + }

The rule is actually

 - no child cpuset
 - and it must be a scheduling domain

Right?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-24 Thread Juri Lelli
On 24/05/18 10:04, Patrick Bellasi wrote:

[...]

> From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001
> From: Patrick Bellasi 
> Date: Wed, 23 May 2018 16:33:06 +0100
> Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not
>  required
> 
> The generate_sched_domains() already addresses the "special case for 99%
> of systems" which require a single full sched domain at the root,
> spanning all the CPUs. However, the current support is based on an
> expensive sequence of operations which destroy and recreate the exact
> same scheduling domain configuration.
> 
> If we notice that:
> 
>  1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the
> isolcpus= kernel boot option, and will never be load balanced
> regardless of the value of "cpuset.sched_load_balance" in any
> cpuset.
> 
>  2) the root cpuset has load_balance enabled by default at boot and
> it's the only parameter which userspace can change at run-time.
> 
> we know that, by default, every system comes up with a complete and
> properly configured set of scheduling domains covering all the CPUs.
> 
> Thus, on every system, unless the user explicitly disables load balance
> for the top_cpuset, the scheduling domains already configured at boot
> time by the scheduler/topology code and updated in consequence of
> hotplug events, are already properly configured for cpuset too.
> 
> This configuration is the default one for 99% of the systems,
> and it's also the one used by most of the Android devices which never
> disable load balance from the top_cpuset.
> 
> Thus, while load balance is enabled for the top_cpuset,
> destroying/rebuilding the scheduling domains at every cpuset.cpus
> reconfiguration is a useless operation which will always produce the
> same result.
> 
> Let's anticipate the "special" optimization within:
> 
>rebuild_sched_domains_locked()
> 
> thus completely skipping the expensive:
> 
>generate_sched_domains()
>partition_sched_domains()
> 
> for all the cases we know that the scheduling domains already defined
> will not be affected by whatsoever value of cpuset.cpus.

[...]

> + /* Special case for the 99% of systems with one, full, sched domain */
> + if (!top_cpuset.isolation_count &&
> + is_sched_load_balance(&top_cpuset))
> + goto out;
> +

Mmm, looks like we still need to destroy e recreate if there is a
new_topology (see arch_update_cpu_topology() in partition_sched_
domains).

Maybe we could move the check you are proposing in update_cpumasks_
hier() ?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-24 Thread Juri Lelli
On 17/05/18 16:55, Waiman Long wrote:

[...]

> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
>* passing doms with offlined cpu to partition_sched_domains().
>* Anyways, hotplug work item will rebuild sched domains.
>*/
> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> + if (!top_cpuset.isolation_count &&
> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> + goto out;
> +
> + if (top_cpuset.isolation_count &&
> +!cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
>   goto out;

Do we cover the case in which hotplug removed one of the isolated cpus
from cpu_active_mask?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 6/6] cpuset: Allow reporting of sched domain generation info

2018-05-22 Thread Juri Lelli
Hi,

On 17/05/18 16:55, Waiman Long wrote:
> This patch enables us to report sched domain generation information.
> 
> If DYNAMIC_DEBUG is enabled, issuing the following command
> 
>   echo "file cpuset.c +p" > /sys/kernel/debug/dynamic_debug/control
> 
> and setting loglevel to 8 will allow the kernel to show what scheduling
> domain changes are being made.
> 
> Signed-off-by: Waiman Long 
> ---
>  kernel/cgroup/cpuset.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index fb8aa82b..8f586e8 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -820,6 +820,12 @@ static int generate_sched_domains(cpumask_var_t 
> **domains,
>   }
>   BUG_ON(nslot != ndoms);
>  
> +#ifdef CONFIG_DEBUG_KERNEL
> + for (i = 0; i < ndoms; i++)
> + pr_debug("generate_sched_domains dom %d: %*pbl\n", i,
> +  cpumask_pr_args(doms[i]));
> +#endif
> +

While I'm always in favor of adding debug output, in this case I'm not
sure it's adding much to what we already print when booting with
sched_debug kernel command-line param, e.g.

--->8---
 Kernel command line: BOOT_IMAGE=/vmlinuz-4.17.0-rc5+ ... sched_debug

[...]

 smp: Bringing up secondary CPUs ...
 x86: Booting SMP configuration:
  node  #0, CPUs:#1  #2  #3  #4  #5
  node  #1, CPUs:#6  #7  #8  #9 #10 #11
 smp: Brought up 2 nodes, 12 CPUs
 smpboot: Max logical packages: 2
 smpboot: Total of 12 processors activated (45636.50 BogoMIPS)
 CPU0 attaching sched-domain(s):
  domain-0: span=0-5 level=MC
   groups: 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ 
span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }
   domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
 CPU1 attaching sched-domain(s):
  domain-0: span=0-5 level=MC
   groups: 1:{ span=1 cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ 
span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }
   domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
 CPU2 attaching sched-domain(s):
  domain-0: span=0-5 level=MC
   groups: 2:{ span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ 
span=0 cap=1016 }, 1:{ span=1 cap=1011 }
   domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
 CPU3 attaching sched-domain(s):
  domain-0: span=0-5 level=MC
   groups: 3:{ span=3 cap=1023 }, 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 
cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ span=2 }
   domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
 CPU4 attaching sched-domain(s):
  domain-0: span=0-5 level=MC
   groups: 4:{ span=4 }, 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 
cap=1011 }, 2:{ span=2 }, 3:{ span=3 cap=1023 }
   domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
 CPU5 attaching sched-domain(s):
  domain-0: span=0-5 level=MC
   groups: 5:{ span=5 }, 0:{ span=0 cap=1016 }, 1:{ span=1 cap=1011 }, 2:{ 
span=2 }, 3:{ span=3 cap=1023 }, 4:{ span=4 }
   domain-1: span=0-11 level=NUMA
groups: 0:{ span=0-5 cap=6122 }, 6:{ span=6-11 cap=6141 }
 CPU6 attaching sched-domain(s):
  domain-0: span=6-11 level=MC
   groups: 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 
10:{ span=10 }, 11:{ span=11 }
   domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
 CPU7 attaching sched-domain(s):
  domain-0: span=6-11 level=MC
   groups: 7:{ span=7 }, 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 
11:{ span=11 }, 6:{ span=6 }
   domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
 CPU8 attaching sched-domain(s):
  domain-0: span=6-11 level=MC
   groups: 8:{ span=8 cap=1021 }, 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 
6:{ span=6 }, 7:{ span=7 }
   domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
 CPU9 attaching sched-domain(s):
  domain-0: span=6-11 level=MC
   groups: 9:{ span=9 }, 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ 
span=7 }, 8:{ span=8 cap=1021 }
   domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
 CPU10 attaching sched-domain(s):
  domain-0: span=6-11 level=MC
   groups: 10:{ span=10 }, 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ 
span=8 cap=1021 }, 9:{ span=9 }
   domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
 CPU11 attaching sched-domain(s):
  domain-0: span=6-11 level=MC
   groups: 11:{ span=11 }, 6:{ span=6 }, 7:{ span=7 }, 8:{ span=8 cap=1021 }, 
9:{ span=9 }, 10:{ span=10 }
   domain-1: span=0-11 level=NUMA
groups: 6:{ span=6-11 cap=6141 }, 0:{ span=0-5 cap=6122 }
 span: 0-11 (max cpu_capacity = 1024)

[...]

 generate_sched_domains dom 0: 6-11 <-- this and the one below is what
 generate_sched_domains dom 1: 0-5  you are adding

Re: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag

2018-05-22 Thread Juri Lelli
Hi,

On 17/05/18 16:55, Waiman Long wrote:

[...]

>  /**
> + * update_isolated_cpumask - update the isolated_cpus mask of parent cpuset
> + * @cpuset:  The cpuset that requests CPU isolation
> + * @oldmask: The old isolated cpumask to be removed from the parent
> + * @newmask: The new isolated cpumask to be added to the parent
> + * Return: 0 if successful, an error code otherwise
> + *
> + * Changes to the isolated CPUs are not allowed if any of CPUs changing
> + * state are in any of the child cpusets of the parent except the requesting
> + * child.
> + *
> + * If the sched_domain flag changes, either the oldmask (0=>1) or the
> + * newmask (1=>0) will be NULL.
> + *
> + * Called with cpuset_mutex held.
> + */
> +static int update_isolated_cpumask(struct cpuset *cpuset,
> + struct cpumask *oldmask, struct cpumask *newmask)
> +{
> + int retval;
> + int adding, deleting;
> + cpumask_var_t addmask, delmask;
> + struct cpuset *parent = parent_cs(cpuset);
> + struct cpuset *sibling;
> + struct cgroup_subsys_state *pos_css;
> + int old_count = parent->isolation_count;
> + bool dying = cpuset->css.flags & CSS_DYING;
> +
> + /*
> +  * Parent must be a scheduling domain with non-empty cpus_allowed.
> +  */
> + if (!is_sched_domain(parent) || cpumask_empty(parent->cpus_allowed))
> + return -EINVAL;
> +
> + /*
> +  * The oldmask, if present, must be a subset of parent's isolated
> +  * CPUs.
> +  */
> + if (oldmask && !cpumask_empty(oldmask) && (!parent->isolation_count ||
> + !cpumask_subset(oldmask, parent->isolated_cpus))) {
> + WARN_ON_ONCE(1);
> + return -EINVAL;
> + }
> +
> + /*
> +  * A sched_domain state change is not allowed if there are
> +  * online children and the cpuset is not dying.
> +  */
> + if (!dying && (!oldmask || !newmask) &&
> + css_has_online_children(&cpuset->css))
> + return -EBUSY;
> +
> + if (!zalloc_cpumask_var(&addmask, GFP_KERNEL))
> + return -ENOMEM;
> + if (!zalloc_cpumask_var(&delmask, GFP_KERNEL)) {
> + free_cpumask_var(addmask);
> + return -ENOMEM;
> + }
> +
> + if (!old_count) {
> + if (!zalloc_cpumask_var(&parent->isolated_cpus, GFP_KERNEL)) {
> + retval = -ENOMEM;
> + goto out;
> + }
> + old_count = 1;
> + }
> +
> + retval = -EBUSY;
> + adding = deleting = false;
> + if (newmask)
> + cpumask_copy(addmask, newmask);
> + if (oldmask)
> + deleting = cpumask_andnot(delmask, oldmask, addmask);
> + if (newmask)
> + adding = cpumask_andnot(addmask, newmask, delmask);
> +
> + if (!adding && !deleting)
> + goto out_ok;
> +
> + /*
> +  * The cpus to be added must be in the parent's effective_cpus mask
> +  * but not in the isolated_cpus mask.
> +  */
> + if (!cpumask_subset(addmask, parent->effective_cpus))
> + goto out;
> + if (parent->isolation_count &&
> + cpumask_intersects(parent->isolated_cpus, addmask))
> + goto out;
> +
> + /*
> +  * Check if any CPUs in addmask or delmask are in a sibling cpuset.
> +  * An empty sibling cpus_allowed means it is the same as parent's
> +  * effective_cpus. This checking is skipped if the cpuset is dying.
> +  */
> + if (dying)
> + goto updated_isolated_cpus;
> +
> + cpuset_for_each_child(sibling, pos_css, parent) {
> + if ((sibling == cpuset) || !(sibling->css.flags & CSS_ONLINE))
> + continue;
> + if (cpumask_empty(sibling->cpus_allowed))
> + goto out;
> + if (adding &&
> + cpumask_intersects(sibling->cpus_allowed, addmask))
> + goto out;
> + if (deleting &&
> + cpumask_intersects(sibling->cpus_allowed, delmask))
> + goto out;
> + }

Just got the below by echoing 1 into cpuset.sched.domain of a sibling with
"isolated" cpuset.cpus. Guess you are missing proper locking about here
above.

--->8---
[ 7509.905005] =
[ 7509.905009] WARNING: suspicious RCU usage
[ 7509.905014] 4.17.0-rc5+ #11 Not tainted
[ 7509.905017] -
[ 7509.905023] /home/juri/work/kernel/linux/kernel/cgroup/cgroup.c:3826 
cgroup_mutex or RCU read lock required!
[ 7509.905026] 
   other info that might help us debug this:

[ 7509.905031] 
   rcu_scheduler_active = 2, debug_locks = 1
[ 7509.905036] 4 locks held by bash/1480:
[ 7509.905039]  #0: bf288709 (sb_writers#6){.+.+}, at: 
vfs_write+0x18a/0x1b0
[ 7509.905072]  #1: ebf23fc9 (&of->mutex){+.+.}, at: 
kernfs_fop_write+0xe2/0x1a0
[ 7509.905098]  #2: de7c626e (kn->count#302){.+.+}, at: 
kernfs_fop_write+0

[PATCH] Documentation/admin-guide/pm/intel_pstate: fix Active Mode w/o HWP paragraph

2018-05-08 Thread Juri Lelli
P-state selection algorithm (powersave or performance) is selected by
echoing the desired choice to scaling_governor sysfs attribute and not
to scaling_cur_freq (as currently stated).

Fix it.

Signed-off-by: Juri Lelli 
Cc: Jonathan Corbet 
Cc: "Rafael J. Wysocki" 
Cc: Srinivas Pandruvada 
Cc: linux-doc@vger.kernel.org
Cc: linux...@vger.kernel.org

---
 Documentation/admin-guide/pm/intel_pstate.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/pm/intel_pstate.rst 
b/Documentation/admin-guide/pm/intel_pstate.rst
index d2b6fda3d67b..ab2fe0eda1d7 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -145,7 +145,7 @@ feature enabled.]
 
 In this mode ``intel_pstate`` registers utilization update callbacks with the
 CPU scheduler in order to run a P-state selection algorithm, either
-``powersave`` or ``performance``, depending on the ``scaling_cur_freq`` policy
+``powersave`` or ``performance``, depending on the ``scaling_governor`` policy
 setting in ``sysfs``.  The current CPU frequency information to be made
 available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is
 periodically updated by those utilization update callbacks too.
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/5] cpuset: Add a root-only cpus.isolated v2 control file

2018-04-23 Thread Juri Lelli
On 19/04/18 09:47, Waiman Long wrote:

[...]

> +  cpuset.cpus.isolated
> + A read-write multiple values file which exists on root cgroup
> + only.
> +
> + It lists the CPUs that have been withdrawn from the root cgroup
> + for load balancing.  These CPUs can still be allocated to child
> + cpusets with load balancing enabled, if necessary.
> +
> + If a child cpuset contains only an exclusive set of CPUs that are
> + a subset of the isolated CPUs and with load balancing enabled,
> + these CPUs will be load balanced on a separate root domain from
> + the one in the root cgroup.
> +
> + Just putting the CPUs into "cpuset.cpus.isolated" will be
> + enough to disable load balancing on those CPUs as long as they
> + do not appear in a child cpuset with load balancing enabled.

Tasks that were on those CPUs when they got isolated will stay there
(unless forcibly moved somewhere else). They will also "automatically"
belong to default root domain (or potentially to a new root domain
created for a group using those CPUs). Both things are maybe unavoidable
(as discussed in previous versions some tasks cannot be migrated at
all), but such "side effects" should probably be documented. What do you
think?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy

2018-04-23 Thread Juri Lelli
On 23/04/18 15:07, Juri Lelli wrote:
> Hi Waiman,
> 
> On 19/04/18 09:46, Waiman Long wrote:
> > v7:
> >  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
> >  - Enforce that load_balancing can only be turned off on cpusets with
> >CPUs from the isolated list.
> >  - Update sched domain generation to allow cpusets with CPUs only
> >from the isolated CPU list to be in separate root domains.
> 

Guess I'll be adding comments as soon as I stumble on something unclear
(to me :), hope that's OK (shout if I should do it differently).

The below looked unexpected to me:

root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.cpus
2-3
root@debian-kvm:/sys/fs/cgroup# cat g1/cpuset.mems

root@debian-kvm:~# echo $$ > /sys/fs/cgroup/g1/cgroup.threads
root@debian-kvm:/sys/fs/cgroup# cat g1/cgroup.threads
2312

So I can add tasks to groups with no mems? Or is it this only true in my
case with a single mem node? Or maybe it's inherited from root group
(slightly confusing IMHO if that's the case).
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy

2018-04-23 Thread Juri Lelli
Hi Waiman,

On 19/04/18 09:46, Waiman Long wrote:
> v7:
>  - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
>  - Enforce that load_balancing can only be turned off on cpusets with
>CPUs from the isolated list.
>  - Update sched domain generation to allow cpusets with CPUs only
>from the isolated CPU list to be in separate root domains.

Just got this while

# echo 2-3 > /sys/fs/cgroup/cpuset.cpus.isolated

[ 6679.177826] =
[ 6679.178385] WARNING: suspicious RCU usage
[ 6679.178910] 4.16.0-rc6+ #151 Not tainted
[ 6679.179459] -
[ 6679.180082] /home/juri/work/kernel/linux/kernel/cgroup/cgroup.c:3826 
cgroup_mutex or RCU read lock required!
[ 6679.181402]
[ 6679.181402] other info that might help us debug this:
[ 6679.181402]
[ 6679.182407]
[ 6679.182407] rcu_scheduler_active = 2, debug_locks = 1
[ 6679.183278] 3 locks held by bash/2205:
[ 6679.183785]  #0:  (sb_writers#10){.+.+}, at: [<4e577fb9>] 
vfs_write+0x18a/0x1b0
[ 6679.184871]  #1:  (&of->mutex){+.+.}, at: [<5944c83f>] 
kernfs_fop_write+0xe2/0x1a0
[ 6679.185987]  #2:  (cpuset_mutex){+.+.}, at: [<879bfba0>] 
cpuset_write_resmask+0x72/0x1560
[ 6679.187112]
[ 6679.187112] stack backtrace:
[ 6679.187612] CPU: 3 PID: 2205 Comm: bash Not tainted 4.16.0-rc6+ #151
[ 6679.188318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-2.fc27 04/01/2014
[ 6679.189291] Call Trace:
[ 6679.189581]  dump_stack+0x85/0xc5
[ 6679.189963]  css_next_child+0x90/0xd0
[ 6679.190385]  cpuset_write_resmask+0x46f/0x1560
[ 6679.190885]  ? lock_acquire+0x9f/0x210
[ 6679.191315]  cgroup_file_write+0x94/0x230
[ 6679.191768]  kernfs_fop_write+0x113/0x1a0
[ 6679.192223]  __vfs_write+0x36/0x180
[ 6679.192617]  ? rcu_read_lock_sched_held+0x6b/0x80
[ 6679.193139]  ? rcu_sync_lockdep_assert+0x2e/0x60
[ 6679.193654]  ? __sb_start_write+0x154/0x1f0
[ 6679.194118]  ? __sb_start_write+0x16a/0x1f0
[ 6679.194607]  vfs_write+0xc1/0x1b0
[ 6679.194984]  SyS_write+0x55/0xc0
[ 6679.195365]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 6679.195839]  do_syscall_64+0x79/0x220
[ 6679.196212]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 6679.196729] RIP: 0033:0x7f03183ff780
[ 6679.197138] RSP: 002b:7ffeae336ca8 EFLAGS: 0246 ORIG_RAX: 
0001
[ 6679.197866] RAX: ffda RBX: 0004 RCX: 7f03183ff780
[ 6679.198550] RDX: 0004 RSI: 00eaf408 RDI: 0001
[ 6679.199235] RBP: 00eaf408 R08: 000a R09: 7f0318cff700
[ 6679.199928] R10:  R11: 0246 R12: 7f03186b57a0
[ 6679.200615] R13: 0004 R14: 0001 R15: 
[ 6679.201369] rebuild_sched_domains dom 0: 0-1
[ 6679.202196] span: 0-1 (max cpu_capacity = 1024)

Guess we should grab either lock from the writing path.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-26 Thread Juri Lelli
On 26/03/18 16:28, Waiman Long wrote:
> On 03/26/2018 08:47 AM, Juri Lelli wrote:
> > On 23/03/18 14:44, Waiman Long wrote:
> >> On 03/23/2018 03:59 AM, Juri Lelli wrote:
> > [...]
> >
> >>> OK, thanks for confirming. Can you tell again however why do you think
> >>> we need to remove sched_load_balance from root level? Won't we end up
> >>> having tasks put on isolated sets?
> >> The root cgroup is special that it owns all the resources in the system.
> >> We generally don't want restriction be put on the root cgroup. A child
> >> cgroup has to be created to have constraints put on it. In fact, most of
> >> the controller files don't show up in the v2 cgroup root at all.
> >>
> >> An isolated cgroup has to be put under root, e.g.
> >>
> >>   Root
> >>  /\
> >> isolated  balanced
> >>
> >>> Also, I guess children groups with more than one CPU will need to be
> >>> able to load balance across their CPUs, no matter what their parent
> >>> group does?
> >> The purpose of an isolated cpuset is to have a dedicated set of CPUs to
> >> be used by a certain application that makes its own scheduling decision
> >> by placing tasks explicitly on specific CPUs. It just doesn't make sense
> >> to have a CPU in an isolated cpuset to participated in load balancing in
> >> another cpuset. If one want load balancing in a child cpuset, the parent
> >> cpuset should have load balancing turned on as well.
> > Isolated with CPUs overlapping some other cpuset makes little sense, I
> > agree. What I have in mind however is an isolated set of CPUs that don't
> > overlap with any other cpuset (as your balanced set above). In this case
> > I think it makes sense to let the sys admin decide if "automatic" load
> > balancing has to be performed (by the scheduler) or no load balacing at
> > all has to take place?
> >
> > Further extending your example:
> >
> >  Root [0-3]
> >  /\
> > group1 [0-1] group2[2-3]
> >
> > Why should we prevent load balancing to be disabled at root level (so
> > that for example tasks still residing in root group are not freely
> > migrated around, potentially disturbing both sub-groups)?
> >
> > Then one can decide that group1 is a "userspace managed" group (no load
> > balancing takes place) and group2 is balanced by the scheduler.
> >
> > And this is not DEADLINE specific, IMHO.
> >
> >> As I look into the code, it seems like root domain is probably somewhat
> >> associated with cpu_exclusive only. Whether sched_load_balance is set
> >> doesn't really matter.  I will need to look further on the conditions
> >> where a new root domain is created.
> > I checked again myself (sched domains code is always a maze :) and I
> > believe that sched_load_balance flag indeed controls domains (sched and
> > root) creation and configuration . Changing the flag triggers potential
> > rebuild and separed sched/root domains are generated if subgroups have
> > non overlapping cpumasks.  cpu_exclusive only enforces this latter
> > condition.
> 
> Right, I ran some tests and figured out that to have root_domain in the
> child cgroup level, we do need to disable load balancing at the root
> cgroup level and enabling it in child cgroups that are mutually disjoint
> in their cpu lists. The cpu_exclusive flag isn't really needed.

It seems to make little sense at root level indeed. 

> I am not against doing that at the root cgroup, but it is kind of weird
> in term of semantics. If we disable load balancing in the root cgroup,
> but enabling it at child cgroups, what does that mean to the processes
> that are still in the root cgroup?

It might be up to the different scheduling classes I guess. See more on
this below.

> The sched_load_balance flag isn't something that is passed to the
> scheduler. It only only affects the CPU topology of the system. So I
> suspect that a process in the root cgroup will be load balanced among
> the CPUs in the one of the child cgroups. That doesn't look right unless
> we enforce that no process can be in the root cgroup in this case.
> 
> Real cpu isolation will then require that we disable load balancing at
> root, and enable load balancing in child cgroups that only contain CPUs
> outside of the isolated CPU list. Again, it is still possible that some
> tasks in the root cgroup, if present, may be using some of the isolated
> CPUs.

So, for DEADLINE this is currently a 

Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-26 Thread Juri Lelli
On 23/03/18 14:44, Waiman Long wrote:
> On 03/23/2018 03:59 AM, Juri Lelli wrote:

[...]

> > OK, thanks for confirming. Can you tell again however why do you think
> > we need to remove sched_load_balance from root level? Won't we end up
> > having tasks put on isolated sets?
> 
> The root cgroup is special that it owns all the resources in the system.
> We generally don't want restriction be put on the root cgroup. A child
> cgroup has to be created to have constraints put on it. In fact, most of
> the controller files don't show up in the v2 cgroup root at all.
> 
> An isolated cgroup has to be put under root, e.g.
> 
>   Root
>  /\
> isolated  balanced
> 
> >
> > Also, I guess children groups with more than one CPU will need to be
> > able to load balance across their CPUs, no matter what their parent
> > group does?
> 
> The purpose of an isolated cpuset is to have a dedicated set of CPUs to
> be used by a certain application that makes its own scheduling decision
> by placing tasks explicitly on specific CPUs. It just doesn't make sense
> to have a CPU in an isolated cpuset to participated in load balancing in
> another cpuset. If one want load balancing in a child cpuset, the parent
> cpuset should have load balancing turned on as well.

Isolated with CPUs overlapping some other cpuset makes little sense, I
agree. What I have in mind however is an isolated set of CPUs that don't
overlap with any other cpuset (as your balanced set above). In this case
I think it makes sense to let the sys admin decide if "automatic" load
balancing has to be performed (by the scheduler) or no load balacing at
all has to take place?

Further extending your example:

 Root [0-3]
 /\
group1 [0-1] group2[2-3]

Why should we prevent load balancing to be disabled at root level (so
that for example tasks still residing in root group are not freely
migrated around, potentially disturbing both sub-groups)?

Then one can decide that group1 is a "userspace managed" group (no load
balancing takes place) and group2 is balanced by the scheduler.

And this is not DEADLINE specific, IMHO.

> As I look into the code, it seems like root domain is probably somewhat
> associated with cpu_exclusive only. Whether sched_load_balance is set
> doesn't really matter.  I will need to look further on the conditions
> where a new root domain is created.

I checked again myself (sched domains code is always a maze :) and I
believe that sched_load_balance flag indeed controls domains (sched and
root) creation and configuration . Changing the flag triggers potential
rebuild and separed sched/root domains are generated if subgroups have
non overlapping cpumasks.  cpu_exclusive only enforces this latter
condition.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-23 Thread Juri Lelli
On 22/03/18 17:50, Waiman Long wrote:
> On 03/22/2018 04:41 AM, Juri Lelli wrote:
> > On 21/03/18 12:21, Waiman Long wrote:

[...]

> >> +  cpuset.sched_load_balance
> >> +  A read-write single value file which exists on non-root cgroups.
> >> +  The default is "1" (on), and the other possible value is "0"
> >> +  (off).
> >> +
> >> +  When it is on, tasks within this cpuset will be load-balanced
> >> +  by the kernel scheduler.  Tasks will be moved from CPUs with
> >> +  high load to other CPUs within the same cpuset with less load
> >> +  periodically.
> >> +
> >> +  When it is off, there will be no load balancing among CPUs on
> >> +  this cgroup.  Tasks will stay in the CPUs they are running on
> >> +  and will not be moved to other CPUs.
> >> +
> >> +  This flag is hierarchical and is inherited by child cpusets. It
> >> +  can be turned off only when the CPUs in this cpuset aren't
> >> +  listed in the cpuset.cpus of other sibling cgroups, and all
> >> +  the child cpusets, if present, have this flag turned off.
> >> +
> >> +  Once it is off, it cannot be turned back on as long as the
> >> +  parent cgroup still has this flag in the off state.
> >> +
> > I'm afraid that this will not work for SCHED_DEADLINE (at least for how
> > it is implemented today). As you can see in Documentation [1] the only
> > way a user has to perform partitioned/clustered scheduling is to create
> > subset of exclusive cpusets and then assign deadline tasks to them. The
> > other thing to take into account here is that a root_domain is created
> > for each exclusive set and we use such root_domain to keep information
> > about admitted bandwidth and speed up load balancing decisions (there is
> > a max heap tracking deadlines of active tasks on each root_domain).
> > Now, AFAIR distinct root_domain(s) are created when parent group has
> > sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
> > is). So, what we normally do is create, say, cpus_exclusive groups for
> > the different clusters and then disable sched_load_balance at root level
> > (so that each cluster gets its own root_domain). Also,
> > sched_load_balance is enabled in children groups (as load balancing
> > inside clusters is what we actually needed :).
> 
> That looks like an undocumented side effect to me. I would rather see an
> explicit control file that enable root_domain and break it free from
> cpu_exclusive && !sched_load_balance, e.g. sched_root_domain(?).

Mmm, it actually makes some sort of sense to me that as long as parent
groups can't load balance (because !sched_load_balance) and this group
can't have CPUs overlapping with some other group (because
cpu_exclusive) a data structure (root_domain) is created to handle load
balancing for this isolated subsystem. I agree that it should be better
documented, though.

> > IIUC your proposal this will not be permitted with cgroup v2 because
> > sched_load_balance won't be present at root level and children groups
> > won't be able to set sched_load_balance back to 1 if that was set to 0
> > in some parent. Is that true?
> 
> Yes, that is the current plan.

OK, thanks for confirming. Can you tell again however why do you think
we need to remove sched_load_balance from root level? Won't we end up
having tasks put on isolated sets?

Also, I guess children groups with more than one CPU will need to be
able to load balance across their CPUs, no matter what their parent
group does?

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 2/2] cpuset: Add cpuset.sched_load_balance to v2

2018-03-22 Thread Juri Lelli
Hi Waiman,

On 21/03/18 12:21, Waiman Long wrote:
> The sched_load_balance flag is needed to enable CPU isolation similar
> to what can be done with the "isolcpus" kernel boot parameter.
> 
> The sched_load_balance flag implies an implicit !cpu_exclusive as
> it doesn't make sense to have an isolated CPU being load-balanced in
> another cpuset.
> 
> For v2, this flag is hierarchical and is inherited by child cpusets. It
> is not allowed to have this flag turn off in a parent cpuset, but on
> in a child cpuset.
> 
> This flag is set by the parent and is not delegatable.
> 
> Signed-off-by: Waiman Long 
> ---
>  Documentation/cgroup-v2.txt | 22 ++
>  kernel/cgroup/cpuset.c  | 56 
> +++--
>  2 files changed, 71 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index ed8ec66..c970bd7 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>   it is a subset of "cpuset.mems".  Its value will be affected
>   by memory nodes hotplug events.
>  
> +  cpuset.sched_load_balance
> + A read-write single value file which exists on non-root cgroups.
> + The default is "1" (on), and the other possible value is "0"
> + (off).
> +
> + When it is on, tasks within this cpuset will be load-balanced
> + by the kernel scheduler.  Tasks will be moved from CPUs with
> + high load to other CPUs within the same cpuset with less load
> + periodically.
> +
> + When it is off, there will be no load balancing among CPUs on
> + this cgroup.  Tasks will stay in the CPUs they are running on
> + and will not be moved to other CPUs.
> +
> + This flag is hierarchical and is inherited by child cpusets. It
> + can be turned off only when the CPUs in this cpuset aren't
> + listed in the cpuset.cpus of other sibling cgroups, and all
> + the child cpusets, if present, have this flag turned off.
> +
> + Once it is off, it cannot be turned back on as long as the
> + parent cgroup still has this flag in the off state.
> +

I'm afraid that this will not work for SCHED_DEADLINE (at least for how
it is implemented today). As you can see in Documentation [1] the only
way a user has to perform partitioned/clustered scheduling is to create
subset of exclusive cpusets and then assign deadline tasks to them. The
other thing to take into account here is that a root_domain is created
for each exclusive set and we use such root_domain to keep information
about admitted bandwidth and speed up load balancing decisions (there is
a max heap tracking deadlines of active tasks on each root_domain).

Now, AFAIR distinct root_domain(s) are created when parent group has
sched_load_balance disabled and cpus_exclusive set (in cgroup v1 that
is). So, what we normally do is create, say, cpus_exclusive groups for
the different clusters and then disable sched_load_balance at root level
(so that each cluster gets its own root_domain). Also,
sched_load_balance is enabled in children groups (as load balancing
inside clusters is what we actually needed :).

IIUC your proposal this will not be permitted with cgroup v2 because
sched_load_balance won't be present at root level and children groups
won't be able to set sched_load_balance back to 1 if that was set to 0
in some parent. Is that true?

Look, the way things work today is most probably not perfect (just to
say one thing, we need to disable load balancing for all classes at root
level just because DEADLINE wants to set restricted affinities to his
tasks :/) and we could probably think on how to change how this all
work. So, let's first see if IIUC what you are proposing (and its
implications). :)

Best,

- Juri

[1] 
https://elixir.bootlin.com/linux/v4.16-rc6/source/Documentation/scheduler/sched-deadline.txt#L640
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Documentation/locking/lockdep: update info about states

2018-02-13 Thread Juri Lelli
Commit d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation") removed
reclaim_fs lockdep STATE.

Reflect the change in documentation.

While we are at it, also clarify formula to calculate number of state
bits.

Signed-off-by: Juri Lelli 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: linux-ker...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
---
 Documentation/locking/lockdep-design.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/locking/lockdep-design.txt 
b/Documentation/locking/lockdep-design.txt
index 9de1c158d44c..e341c2f34e68 100644
--- a/Documentation/locking/lockdep-design.txt
+++ b/Documentation/locking/lockdep-design.txt
@@ -27,7 +27,8 @@ lock-class.
 State
 -
 
-The validator tracks lock-class usage history into 4n + 1 separate state bits:
+The validator tracks lock-class usage history into 4 * nSTATEs + 1 separate
+state bits:
 
 - 'ever held in STATE context'
 - 'ever held as readlock in STATE context'
@@ -37,7 +38,6 @@ The validator tracks lock-class usage history into 4n + 1 
separate state bits:
 Where STATE can be either one of (kernel/locking/lockdep_states.h)
  - hardirq
  - softirq
- - reclaim_fs
 
 - 'ever used'   [ == !unused]
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/2] Documentation/locking/lockdep: Add section about available annotations

2018-02-13 Thread Juri Lelli
Add section about annotations that can be used to perform additional runtime
checking of locking correctness: assert that certain locks are held and
prevent accidental unlocking.

Signed-off-by: Juri Lelli 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: linux-ker...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
---
 Documentation/locking/lockdep-design.txt | 47 
 1 file changed, 47 insertions(+)

diff --git a/Documentation/locking/lockdep-design.txt 
b/Documentation/locking/lockdep-design.txt
index e341c2f34e68..74347a24efc7 100644
--- a/Documentation/locking/lockdep-design.txt
+++ b/Documentation/locking/lockdep-design.txt
@@ -169,6 +169,53 @@ Note: When changing code to use the _nested() primitives, 
be careful and
 check really thoroughly that the hierarchy is correctly mapped; otherwise
 you can get false positives or false negatives.
 
+Annotations
+---
+
+Two constructs can be used to annotate and check where and if certain locks
+must be held: lockdep_assert_held*(&lock) and lockdep_*pin_lock(&lock).
+
+As the name suggests, lockdep_assert_held* family of macros assert that a
+particular lock is held at a certain time (and generate a WARN otherwise).
+This annotation is largely used all over the kernel, e.g. kernel/sched/
+core.c
+
+  void update_rq_clock(struct rq *rq)
+  {
+   s64 delta;
+
+   lockdep_assert_held(&rq->lock);
+   [...]
+  }
+
+where holding rq->lock is required to safely update a rq's clock.
+
+The other family of macros is lockdep_*pin_lock, which is admittedly only
+used for rq->lock ATM. Despite their limited adoption these annotations
+generate a WARN if the lock of interest is "accidentally" unlocked. This turns
+out to be especially helpful to debug code with callbacks, where an upper
+layer assumes a lock remains taken, but a lower layer thinks it can maybe drop
+and reacquire the lock ("unwittingly" introducing races). lockdep_pin_lock
+returns a 'struct pin_cookie' that is then used by lockdep_unpin_lock to check
+that nobody tampered with the lock, e.g. kernel/sched/sched.h
+
+  static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
+  {
+   rf->cookie = lockdep_pin_lock(&rq->lock);
+   [...]
+  }
+
+  static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
+  {
+   [...]
+   lockdep_unpin_lock(&rq->lock, rf->cookie);
+  }
+
+While comments about locking requirements might provide useful information,
+the runtime checks performed by annotations are invaluable when debugging
+locking problems and they carry the same level of details when inspecting
+code.  Always prefer annotations when in doubt!
+
 Proof of 100% correctness:
 --
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Update lockdep doc and add section about annotations

2018-02-13 Thread Juri Lelli
Hi,

a couple of patches to lockdep docs.

First one is a small fix for that fact that Peter made documentation
stale (as usual :P).

Second one is an RFC. I thought that adding some info about lockdep
asserts might help spread adoption even wider (mostly regarding
lockdep_pin_lock stuff, which is pretty new and maybe got unnoticed
outside of the scheduler?).

Best,

- Juri 

Juri Lelli (2):
  Documentation/locking/lockdep: update info about states
  Documentation/locking/lockdep: Add section about available annotations

 Documentation/locking/lockdep-design.txt | 51 ++--
 1 file changed, 49 insertions(+), 2 deletions(-)

-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] Documentation/scheduler/sched-deadline: add info about cgroup support

2018-02-12 Thread Juri Lelli
Add documentation for SCHED_DEADLINE cgroup support (CONFIG_DEADLINE_
GROUP_SCHED config option).

Signed-off-by: Juri Lelli 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Jonathan Corbet 
Cc: Luca Abeni 
Cc: linux-ker...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
---
 Documentation/scheduler/sched-deadline.txt | 36 ++
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/Documentation/scheduler/sched-deadline.txt 
b/Documentation/scheduler/sched-deadline.txt
index 8ce78f82ae23..65d55c778976 100644
--- a/Documentation/scheduler/sched-deadline.txt
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -528,11 +528,8 @@ CONTENTS
  to -deadline tasks is similar to the one already used for -rt
  tasks with real-time group scheduling (a.k.a. RT-throttling - see
  Documentation/scheduler/sched-rt-group.txt), and is based on readable/
- writable control files located in procfs (for system wide settings).
- Notice that per-group settings (controlled through cgroupfs) are still not
- defined for -deadline tasks, because more discussion is needed in order to
- figure out how we want to manage SCHED_DEADLINE bandwidth at the task group
- level.
+ writable control files located in procfs (for system wide settings) and in
+ cgroupfs (per-group settings).
 
  A main difference between deadline bandwidth management and RT-throttling
  is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
@@ -553,9 +550,9 @@ CONTENTS
  For now the -rt knobs are used for -deadline admission control and the
  -deadline runtime is accounted against the -rt runtime. We realize that this
  isn't entirely desirable; however, it is better to have a small interface for
- now, and be able to change it easily later. The ideal situation (see 5.) is to
- run -rt tasks from a -deadline server; in which case the -rt bandwidth is a
- direct subset of dl_bw.
+ now, and be able to change it easily later. The ideal situation (see 6.) is to
+ run -rt tasks from a -deadline server (H-CBS); in which case the -rt bandwidth
+ is a direct subset of dl_bw.
 
  This means that, for a root_domain comprising M CPUs, -deadline tasks
  can be created while the sum of their bandwidths stays below:
@@ -623,6 +620,27 @@ CONTENTS
  make the leftoever runtime available for reclamation by other
  SCHED_DEADLINE tasks.
 
+4.4 Grouping tasks
+--
+
+CONFIG_DEADLINE_GROUP_SCHED depends on CONFIG_RT_GROUP_SCHED, so go on and
+read Documentation/scheduler/sched-rt-group.txt first.
+
+Enabling CONFIG_DEADLINE_GROUP_SCHED lets you explicitly manage CPU bandwidth
+for task groups.
+
+This uses the cgroup virtual file system and "/cpu.rt_runtime_us
+/cpu.rt_period_us" to control the CPU time reserved for each control
+group. Yes, they are the same of CONFIG_RT_GROUP_SCHED since RT and DEADLINE
+share the same bandwidth. In addition to these CONFIG_DEADLINE_GROUP_SCHED
+adds "/cpu.dl_bw" (maximum bandwidth on each CPU available to the
+group, corresponds to cpu.rt_runtime_us/cpu.rt_period_us) and
+"/cpu.dl_total_bw" (a group's current allocated bandwidth); both are
+non-writable.
+
+Group settings are checked against same limits of CONFIG_RT_GROUP_SCHED:
+
+   \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
 
 5. Tasks CPU affinity
 =
@@ -661,7 +679,7 @@ CONTENTS
 of retaining bandwidth isolation among non-interacting tasks. This is
 being studied from both theoretical and practical points of view, and
 hopefully we should be able to produce some demonstrative code soon;
-  - (c)group based bandwidth management, and maybe scheduling;
+  - (c)group based scheduling (Hierachical-CBS);
   - access control for non-root users (and related security concerns to
 address), which is the best way to allow unprivileged use of the mechanisms
 and how to prevent non-root users "cheat" the system?
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Documentation/locking/mutex-design: update to reflect latest changes

2018-02-09 Thread Juri Lelli
Commit 3ca0ff571b09 ("locking/mutex: Rework mutex::owner") reworked the
basic mutex implementation to deal with several problems. Documentation
was however left unchanged and became stale.

Update mutex-design.txt to reflect changes introduced by the above commit.

Signed-off-by: Juri Lelli 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: Davidlohr Bueso 
Cc: linux-ker...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
---
 Documentation/locking/mutex-design.txt | 49 --
 1 file changed, 17 insertions(+), 32 deletions(-)

diff --git a/Documentation/locking/mutex-design.txt 
b/Documentation/locking/mutex-design.txt
index 60c482df1a38..42de80cb2449 100644
--- a/Documentation/locking/mutex-design.txt
+++ b/Documentation/locking/mutex-design.txt
@@ -21,37 +21,23 @@ Implementation
 --
 
 Mutexes are represented by 'struct mutex', defined in include/linux/mutex.h
-and implemented in kernel/locking/mutex.c. These locks use a three
-state atomic counter (->count) to represent the different possible
-transitions that can occur during the lifetime of a lock:
-
- 1: unlocked
- 0: locked, no waiters
-   negative: locked, with potential waiters
-
-In its most basic form it also includes a wait-queue and a spinlock
-that serializes access to it. CONFIG_SMP systems can also include
-a pointer to the lock task owner (->owner) as well as a spinner MCS
-lock (->osq), both described below in (ii).
+and implemented in kernel/locking/mutex.c. These locks use an atomic variable
+(->owner) to keep track of the lock state during its lifetime.  Field owner
+actually contains 'struct task_struct *' to the current lock owner and it is
+therefore NULL if not currently owned. Since task_struct pointers are aligned
+at at least L1_CACHE_BYTES, low bits (3) are used to store extra state (e.g.,
+if waiter list is non-empty).  In its most basic form it also includes a
+wait-queue and a spinlock that serializes access to it. Furthermore,
+CONFIG_MUTEX_SPIN_ON_OWNER systems use a spinner MCS lock (->osq), described
+below in (ii).
 
 When acquiring a mutex, there are three possible paths that can be
 taken, depending on the state of the lock:
 
-(i) fastpath: tries to atomically acquire the lock by decrementing the
-counter. If it was already taken by another task it goes to the next
-possible path. This logic is architecture specific. On x86-64, the
-locking fastpath is 2 instructions:
-
-0e10 :
-e21:   f0 ff 0block decl (%rbx)
-e24:   79 08   jnse2e 
-
-   the unlocking fastpath is equally tight:
-
-0bc0 :
-bc8:   f0 ff 07lock incl (%rdi)
-bcb:   7f 0a   jg bd7 
-
+(i) fastpath: tries to atomically acquire the lock by cmpxchg the owner with
+current task. This only works in the uncontended case (cmpxchg checks
+against 0UL, so all 3 state bits above have to be 0). If the lock is
+contended it goes to the next possible path.
 
 (ii) midpath: aka optimistic spinning, tries to spin for acquisition
  while the lock owner is running and there are no other tasks ready
@@ -143,11 +129,10 @@ Test if the mutex is taken:
 Disadvantages
 -
 
-Unlike its original design and purpose, 'struct mutex' is larger than
-most locks in the kernel. E.g: on x86-64 it is 40 bytes, almost twice
-as large as 'struct semaphore' (24 bytes) and tied, along with rwsems,
-for the largest lock in the kernel. Larger structure sizes mean more
-CPU cache and memory footprint.
+Unlike its original design and purpose, 'struct mutex' is among the largest
+locks in the kernel. E.g: on x86-64 it is 32 bytes, where 'struct semaphore'
+is 24 bytes and rw_semaphore is 40 bytes. Larger structure sizes mean more CPU
+cache and memory footprint.
 
 When to use mutexes
 ---
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation/scheduler/sched-rt-group: Update SCHED DEADLINE references

2018-02-01 Thread Juri Lelli
Hi Daniel,

On 01/02/18 11:31, Daniel Bristot de Oliveira wrote:
> The documentation was mentioning the "future SCHED EDF" as the
> solution for fine-grained control of deadline/period. This patch
> updates this citing the (now) existing SCHED_DEADLINE.
> 
> Signed-off-by: Daniel Bristot de Oliveira 
> Cc: Jonathan Corbet 
> Cc: Juri Lelli 
> Cc: Luca Abeni 
> Cc: Tommaso Cucinotta 
> Cc: Claudio Scordino 
> Cc: Peter Zijlstra 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> 
> ---
>  Documentation/scheduler/sched-rt-group.txt | 19 +--
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-rt-group.txt 
> b/Documentation/scheduler/sched-rt-group.txt
> index d8fce3e78457..ed763c243914 100644
> --- a/Documentation/scheduler/sched-rt-group.txt
> +++ b/Documentation/scheduler/sched-rt-group.txt
> @@ -153,7 +153,7 @@ There is work in progress to make the scheduling period 
> for each group
>  
>  The constraint on the period is that a subgroup must have a smaller or
>  equal period to its parent. But realistically its not very useful _yet_
> -as its prone to starvation without deadline scheduling.
> +as its prone to starvation.
>  
>  Consider two sibling groups A and B; both have 50% bandwidth, but A's
>  period is twice the length of B's.
> @@ -168,16 +168,7 @@ This means that currently a while (1) loop in A will run 
> for the full period of
>  B and can starve B's tasks (assuming they are of lower priority) for a whole
>  period.
>  
> -The next project will be SCHED_EDF (Earliest Deadline First scheduling) to 
> bring
> -full deadline scheduling to the linux kernel. Deadline scheduling the above
> -groups and treating end of the period as a deadline will ensure that they 
> both
> -get their allocated time.
> -
> -Implementing SCHED_EDF might take a while to complete. Priority Inheritance 
> is
> -the biggest challenge as the current linux PI infrastructure is geared 
> towards
> -the limited static priority levels 0-99. With deadline scheduling you need to
> -do deadline inheritance (since priority is inversely proportional to the
> -deadline delta (deadline - now)).
> -
> -This means the whole PI machinery will have to be reworked - and that is one 
> of
> -the most complex pieces of code we have.
> +Nowadays it is possible to use the deadline scheduler (SCHED_DEADLINE) to
> +allocate runtime/period for tasks. The deadline scheduler does not suffer the
> +side effects explained above. For more information about the deadline
> +scheduler, you should read Documentation/scheduler/sched-deadline.txt.

True. But it doesn't have support for groups yet (while this doc seems
to refer to groups). Maybe add a paragraph about this?

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html