Re: [RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation

2020-05-07 Thread John Mathew
On Thu, May 7, 2020 at 6:41 AM Randy Dunlap  wrote:
>
> Hi--
>
> On 5/6/20 7:39 AM, john mathew wrote:
> > From: John Mathew 
> >
> > Add documentation for
> >  -scheduler overview
> >  -scheduler state transtion
> >  -CFS overview
> >  -scheduler data structs
> >
> > Add rst for scheduler APIs and modify sched/core.c
> > to add kernel-doc comments.
> >
> > Suggested-by: Lukas Bulwahn 
> > Co-developed-by: Mostafa Chamanara 
> > Signed-off-by: Mostafa Chamanara 
> > Co-developed-by: Oleg Tsymbal 
> > Signed-off-by: Oleg Tsymbal 
> > Signed-off-by: John Mathew 
> > ---
> >  Documentation/scheduler/cfs-overview.rst  | 110 +++
> >  Documentation/scheduler/index.rst |   3 +
> >  Documentation/scheduler/overview.rst  | 269 ++
> >  .../scheduler/sched-data-structs.rst  | 253 
> >  Documentation/scheduler/scheduler-api.rst |  30 ++
> >  kernel/sched/core.c   |  28 +-
> >  kernel/sched/sched.h  | 169 ++-
> >  7 files changed, 855 insertions(+), 7 deletions(-)
> >  create mode 100644 Documentation/scheduler/cfs-overview.rst
> >  create mode 100644 Documentation/scheduler/sched-data-structs.rst
> >  create mode 100644 Documentation/scheduler/scheduler-api.rst
> >
> > Request review from Valentin Schneider 
> > for the section describing Scheduler classes in:
> > .../scheduler/sched-data-structs.rst
> >
> > diff --git a/Documentation/scheduler/cfs-overview.rst 
> > b/Documentation/scheduler/cfs-overview.rst
> > new file mode 100644
> > index ..50d94b9bb60e
> > --- /dev/null
> > +++ b/Documentation/scheduler/cfs-overview.rst
> > @@ -0,0 +1,110 @@
> > +.. SPDX-License-Identifier: GPL-2.0+
> > +
> > +=
> > +CFS Overview
> > +=
> > +
> > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
> > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the
> > +CFS design is provided in :doc:`sched-design-CFS`
> > +
> > +In addition there have been many improvements to the CFS, a few of which 
> > are
> > +
> > +**Thermal Pressure**:
> > +cpu_capacity initially reflects the maximum possible capacity of a CPU.
> > +Thermal pressure on a CPU means this maximum possible capacity is
> > +unavailable due to thermal events. Average thermal pressure for a CPU
> > +is now subtracted from its maximum possible capacity so that cpu_capacity
> > +reflects the remaining maximum capacity.
> > +
> > +**Use Idle CPU for NUMA balancing**:
> > +Idle CPU is used as a migration target instead of comparing tasks.
> > +Information on an idle core is cached while gathering statistics
> > +and this is used to avoid a second scan of the node runqueues if load is
> > +not imbalanced. Preference is given to an idle core rather than an
> > +idle SMT sibling to avoid packing HT siblings due to linearly scanning
> > +the node cpumask.  Multiple tasks can attempt to select and idle CPU but
> > +fail, in this case instead of failing, an alternative idle CPU scanned.
>
> I'm having problems parsing that last sentence above.
Fixed as follows in v3:
Multiple tasks can attempt to select an idle CPU but
fail because a NUMA balance is active on that CPU, in this case instead of
failing, an alternative idle CPU scanned.
>
> > +
> > +**Asymmetric CPU capacity wakeup scan**:
> > +Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES
> > +domain (sd_llc) are homogeneous didn't hold for newer generations of 
> > big.LITTLE
> > +systems (DynamIQ) which can accommodate CPUs of different compute capacity
> > +within a single LLC domain. A new idle sibling helper function was added
> > +which took CPU capacity in to account. The policy is to pick the first idle
>
>into
Fixed in v3.
>
> > +CPU which is big enough for the task (task_util * margin < cpu_capacity).
>
> why not <= ?
This is how it is implemented in fair.c
/*
* The margin used when comparing utilization with CPU capacity.
*
* (default: ~20%)
*/
#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>
> > +If no idle CPU is big enough, the idle CPU with the highest capacity was
>
> s/was/is/
Fixed in v3.
>
> > +picked.
> > +
> > +**Optimized idle core selection**:
> > +Previously all threads of a core were looped through to evaluate if the
> > +core is idle or not. This was unnecessary. If a thread of a core is not
> > +idle, skip evaluating other threads of a core. Also while clearing the
> > +cpumask, bits of all CPUs of a core can be cleared in one-shot.
>
>   in one shot.
Fixed in v3.
>
> > +
> > +**Load balance aggressively for SCHED_IDLE CPUs**:
> > +The fair scheduler performs periodic load balance on every CPU to check
> > +if it can pull some tasks from other busy CPUs. The duration of this
> > +periodic load balance is set to scheduler domain's balance_interval and
> > +multiplied by a busy_factor 

Re: [RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation

2020-05-06 Thread Randy Dunlap
Hi--

On 5/6/20 7:39 AM, john mathew wrote:
> From: John Mathew 
> 
> Add documentation for
>  -scheduler overview
>  -scheduler state transtion
>  -CFS overview
>  -scheduler data structs
> 
> Add rst for scheduler APIs and modify sched/core.c
> to add kernel-doc comments.
> 
> Suggested-by: Lukas Bulwahn 
> Co-developed-by: Mostafa Chamanara 
> Signed-off-by: Mostafa Chamanara 
> Co-developed-by: Oleg Tsymbal 
> Signed-off-by: Oleg Tsymbal 
> Signed-off-by: John Mathew 
> ---
>  Documentation/scheduler/cfs-overview.rst  | 110 +++
>  Documentation/scheduler/index.rst |   3 +
>  Documentation/scheduler/overview.rst  | 269 ++
>  .../scheduler/sched-data-structs.rst  | 253 
>  Documentation/scheduler/scheduler-api.rst |  30 ++
>  kernel/sched/core.c   |  28 +-
>  kernel/sched/sched.h  | 169 ++-
>  7 files changed, 855 insertions(+), 7 deletions(-)
>  create mode 100644 Documentation/scheduler/cfs-overview.rst
>  create mode 100644 Documentation/scheduler/sched-data-structs.rst
>  create mode 100644 Documentation/scheduler/scheduler-api.rst
> 
> Request review from Valentin Schneider  
> for the section describing Scheduler classes in:
> .../scheduler/sched-data-structs.rst
> 
> diff --git a/Documentation/scheduler/cfs-overview.rst 
> b/Documentation/scheduler/cfs-overview.rst
> new file mode 100644
> index ..50d94b9bb60e
> --- /dev/null
> +++ b/Documentation/scheduler/cfs-overview.rst
> @@ -0,0 +1,110 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=
> +CFS Overview
> +=
> +
> +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
> +Scheduler (CFS) implemented as a scheduling module. A brief overview of the
> +CFS design is provided in :doc:`sched-design-CFS`
> +
> +In addition there have been many improvements to the CFS, a few of which are
> +
> +**Thermal Pressure**:
> +cpu_capacity initially reflects the maximum possible capacity of a CPU.
> +Thermal pressure on a CPU means this maximum possible capacity is
> +unavailable due to thermal events. Average thermal pressure for a CPU
> +is now subtracted from its maximum possible capacity so that cpu_capacity
> +reflects the remaining maximum capacity.
> +
> +**Use Idle CPU for NUMA balancing**:
> +Idle CPU is used as a migration target instead of comparing tasks.
> +Information on an idle core is cached while gathering statistics
> +and this is used to avoid a second scan of the node runqueues if load is
> +not imbalanced. Preference is given to an idle core rather than an
> +idle SMT sibling to avoid packing HT siblings due to linearly scanning
> +the node cpumask.  Multiple tasks can attempt to select and idle CPU but
> +fail, in this case instead of failing, an alternative idle CPU scanned.

I'm having problems parsing that last sentence above.

> +
> +**Asymmetric CPU capacity wakeup scan**:
> +Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES
> +domain (sd_llc) are homogeneous didn't hold for newer generations of 
> big.LITTLE
> +systems (DynamIQ) which can accommodate CPUs of different compute capacity
> +within a single LLC domain. A new idle sibling helper function was added
> +which took CPU capacity in to account. The policy is to pick the first idle

   into

> +CPU which is big enough for the task (task_util * margin < cpu_capacity).

why not <= ?

> +If no idle CPU is big enough, the idle CPU with the highest capacity was

s/was/is/

> +picked.
> +
> +**Optimized idle core selection**:
> +Previously all threads of a core were looped through to evaluate if the
> +core is idle or not. This was unnecessary. If a thread of a core is not
> +idle, skip evaluating other threads of a core. Also while clearing the
> +cpumask, bits of all CPUs of a core can be cleared in one-shot.

  in one shot.

> +
> +**Load balance aggressively for SCHED_IDLE CPUs**:
> +The fair scheduler performs periodic load balance on every CPU to check
> +if it can pull some tasks from other busy CPUs. The duration of this
> +periodic load balance is set to scheduler domain's balance_interval and
> +multiplied by a busy_factor (set to 32 by default) for the busy CPUs. This
> +multiplication is done for busy CPUs to avoid doing load balance too
> +often and rather spend more time executing actual task. While that is
> +the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
> +tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
> +tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE
> +CPUs, it is now preferred to enqueue a newly-woken task to a SCHED_IDLE
> +CPU instead of other busy or idle CPUs. The same reasoning is applied
> +to the load balancer as well to make it migrate tasks more aggressively
> +to a SCHED_IDLE CPU, as that will reduce 

[RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation

2020-05-06 Thread john mathew
From: John Mathew 

Add documentation for
 -scheduler overview
 -scheduler state transtion
 -CFS overview
 -scheduler data structs

Add rst for scheduler APIs and modify sched/core.c
to add kernel-doc comments.

Suggested-by: Lukas Bulwahn 
Co-developed-by: Mostafa Chamanara 
Signed-off-by: Mostafa Chamanara 
Co-developed-by: Oleg Tsymbal 
Signed-off-by: Oleg Tsymbal 
Signed-off-by: John Mathew 
---
 Documentation/scheduler/cfs-overview.rst  | 110 +++
 Documentation/scheduler/index.rst |   3 +
 Documentation/scheduler/overview.rst  | 269 ++
 .../scheduler/sched-data-structs.rst  | 253 
 Documentation/scheduler/scheduler-api.rst |  30 ++
 kernel/sched/core.c   |  28 +-
 kernel/sched/sched.h  | 169 ++-
 7 files changed, 855 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/scheduler/cfs-overview.rst
 create mode 100644 Documentation/scheduler/sched-data-structs.rst
 create mode 100644 Documentation/scheduler/scheduler-api.rst

Request review from Valentin Schneider  
for the section describing Scheduler classes in:
.../scheduler/sched-data-structs.rst

diff --git a/Documentation/scheduler/cfs-overview.rst 
b/Documentation/scheduler/cfs-overview.rst
new file mode 100644
index ..50d94b9bb60e
--- /dev/null
+++ b/Documentation/scheduler/cfs-overview.rst
@@ -0,0 +1,110 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=
+CFS Overview
+=
+
+Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
+Scheduler (CFS) implemented as a scheduling module. A brief overview of the
+CFS design is provided in :doc:`sched-design-CFS`
+
+In addition there have been many improvements to the CFS, a few of which are
+
+**Thermal Pressure**:
+cpu_capacity initially reflects the maximum possible capacity of a CPU.
+Thermal pressure on a CPU means this maximum possible capacity is
+unavailable due to thermal events. Average thermal pressure for a CPU
+is now subtracted from its maximum possible capacity so that cpu_capacity
+reflects the remaining maximum capacity.
+
+**Use Idle CPU for NUMA balancing**:
+Idle CPU is used as a migration target instead of comparing tasks.
+Information on an idle core is cached while gathering statistics
+and this is used to avoid a second scan of the node runqueues if load is
+not imbalanced. Preference is given to an idle core rather than an
+idle SMT sibling to avoid packing HT siblings due to linearly scanning
+the node cpumask.  Multiple tasks can attempt to select and idle CPU but
+fail, in this case instead of failing, an alternative idle CPU scanned.
+
+**Asymmetric CPU capacity wakeup scan**:
+Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES
+domain (sd_llc) are homogeneous didn't hold for newer generations of big.LITTLE
+systems (DynamIQ) which can accommodate CPUs of different compute capacity
+within a single LLC domain. A new idle sibling helper function was added
+which took CPU capacity in to account. The policy is to pick the first idle
+CPU which is big enough for the task (task_util * margin < cpu_capacity).
+If no idle CPU is big enough, the idle CPU with the highest capacity was
+picked.
+
+**Optimized idle core selection**:
+Previously all threads of a core were looped through to evaluate if the
+core is idle or not. This was unnecessary. If a thread of a core is not
+idle, skip evaluating other threads of a core. Also while clearing the
+cpumask, bits of all CPUs of a core can be cleared in one-shot.
+
+**Load balance aggressively for SCHED_IDLE CPUs**:
+The fair scheduler performs periodic load balance on every CPU to check
+if it can pull some tasks from other busy CPUs. The duration of this
+periodic load balance is set to scheduler domain's balance_interval and
+multiplied by a busy_factor (set to 32 by default) for the busy CPUs. This
+multiplication is done for busy CPUs to avoid doing load balance too
+often and rather spend more time executing actual task. While that is
+the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
+tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
+tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE
+CPUs, it is now preferred to enqueue a newly-woken task to a SCHED_IDLE
+CPU instead of other busy or idle CPUs. The same reasoning is applied
+to the load balancer as well to make it migrate tasks more aggressively
+to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the
+migrated (SCHED_OTHER) tasks. Fair scheduler now does the next
+load balance soon after the last non SCHED_IDLE task is dequeued from a
+runqueue, i.e. making the CPU SCHED_IDLE.
+
+**Load balancing algorithm Reworked**:
+The load balancing algorithm contained some heuristics which became
+meaningless since the rework of the scheduler's metrics like the
+introduction of PELT. The new load