Re: [RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation
On Thu, May 7, 2020 at 6:41 AM Randy Dunlap wrote: > > Hi-- > > On 5/6/20 7:39 AM, john mathew wrote: > > From: John Mathew > > > > Add documentation for > > -scheduler overview > > -scheduler state transtion > > -CFS overview > > -scheduler data structs > > > > Add rst for scheduler APIs and modify sched/core.c > > to add kernel-doc comments. > > > > Suggested-by: Lukas Bulwahn > > Co-developed-by: Mostafa Chamanara > > Signed-off-by: Mostafa Chamanara > > Co-developed-by: Oleg Tsymbal > > Signed-off-by: Oleg Tsymbal > > Signed-off-by: John Mathew > > --- > > Documentation/scheduler/cfs-overview.rst | 110 +++ > > Documentation/scheduler/index.rst | 3 + > > Documentation/scheduler/overview.rst | 269 ++ > > .../scheduler/sched-data-structs.rst | 253 > > Documentation/scheduler/scheduler-api.rst | 30 ++ > > kernel/sched/core.c | 28 +- > > kernel/sched/sched.h | 169 ++- > > 7 files changed, 855 insertions(+), 7 deletions(-) > > create mode 100644 Documentation/scheduler/cfs-overview.rst > > create mode 100644 Documentation/scheduler/sched-data-structs.rst > > create mode 100644 Documentation/scheduler/scheduler-api.rst > > > > Request review from Valentin Schneider > > for the section describing Scheduler classes in: > > .../scheduler/sched-data-structs.rst > > > > diff --git a/Documentation/scheduler/cfs-overview.rst > > b/Documentation/scheduler/cfs-overview.rst > > new file mode 100644 > > index ..50d94b9bb60e > > --- /dev/null > > +++ b/Documentation/scheduler/cfs-overview.rst > > @@ -0,0 +1,110 @@ > > +.. SPDX-License-Identifier: GPL-2.0+ > > + > > += > > +CFS Overview > > += > > + > > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair > > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the > > +CFS design is provided in :doc:`sched-design-CFS` > > + > > +In addition there have been many improvements to the CFS, a few of which > > are > > + > > +**Thermal Pressure**: > > +cpu_capacity initially reflects the maximum possible capacity of a CPU. > > +Thermal pressure on a CPU means this maximum possible capacity is > > +unavailable due to thermal events. Average thermal pressure for a CPU > > +is now subtracted from its maximum possible capacity so that cpu_capacity > > +reflects the remaining maximum capacity. > > + > > +**Use Idle CPU for NUMA balancing**: > > +Idle CPU is used as a migration target instead of comparing tasks. > > +Information on an idle core is cached while gathering statistics > > +and this is used to avoid a second scan of the node runqueues if load is > > +not imbalanced. Preference is given to an idle core rather than an > > +idle SMT sibling to avoid packing HT siblings due to linearly scanning > > +the node cpumask. Multiple tasks can attempt to select and idle CPU but > > +fail, in this case instead of failing, an alternative idle CPU scanned. > > I'm having problems parsing that last sentence above. Fixed as follows in v3: Multiple tasks can attempt to select an idle CPU but fail because a NUMA balance is active on that CPU, in this case instead of failing, an alternative idle CPU scanned. > > > + > > +**Asymmetric CPU capacity wakeup scan**: > > +Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES > > +domain (sd_llc) are homogeneous didn't hold for newer generations of > > big.LITTLE > > +systems (DynamIQ) which can accommodate CPUs of different compute capacity > > +within a single LLC domain. A new idle sibling helper function was added > > +which took CPU capacity in to account. The policy is to pick the first idle > >into Fixed in v3. > > > +CPU which is big enough for the task (task_util * margin < cpu_capacity). > > why not <= ? This is how it is implemented in fair.c /* * The margin used when comparing utilization with CPU capacity. * * (default: ~20%) */ #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024) > > > +If no idle CPU is big enough, the idle CPU with the highest capacity was > > s/was/is/ Fixed in v3. > > > +picked. > > + > > +**Optimized idle core selection**: > > +Previously all threads of a core were looped through to evaluate if the > > +core is idle or not. This was unnecessary. If a thread of a core is not > > +idle, skip evaluating other threads of a core. Also while clearing the > > +cpumask, bits of all CPUs of a core can be cleared in one-shot. > > in one shot. Fixed in v3. > > > + > > +**Load balance aggressively for SCHED_IDLE CPUs**: > > +The fair scheduler performs periodic load balance on every CPU to check > > +if it can pull some tasks from other busy CPUs. The duration of this > > +periodic load balance is set to scheduler domain's balance_interval and > > +multiplied by a busy_factor
Re: [RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation
Hi-- On 5/6/20 7:39 AM, john mathew wrote: > From: John Mathew > > Add documentation for > -scheduler overview > -scheduler state transtion > -CFS overview > -scheduler data structs > > Add rst for scheduler APIs and modify sched/core.c > to add kernel-doc comments. > > Suggested-by: Lukas Bulwahn > Co-developed-by: Mostafa Chamanara > Signed-off-by: Mostafa Chamanara > Co-developed-by: Oleg Tsymbal > Signed-off-by: Oleg Tsymbal > Signed-off-by: John Mathew > --- > Documentation/scheduler/cfs-overview.rst | 110 +++ > Documentation/scheduler/index.rst | 3 + > Documentation/scheduler/overview.rst | 269 ++ > .../scheduler/sched-data-structs.rst | 253 > Documentation/scheduler/scheduler-api.rst | 30 ++ > kernel/sched/core.c | 28 +- > kernel/sched/sched.h | 169 ++- > 7 files changed, 855 insertions(+), 7 deletions(-) > create mode 100644 Documentation/scheduler/cfs-overview.rst > create mode 100644 Documentation/scheduler/sched-data-structs.rst > create mode 100644 Documentation/scheduler/scheduler-api.rst > > Request review from Valentin Schneider > for the section describing Scheduler classes in: > .../scheduler/sched-data-structs.rst > > diff --git a/Documentation/scheduler/cfs-overview.rst > b/Documentation/scheduler/cfs-overview.rst > new file mode 100644 > index ..50d94b9bb60e > --- /dev/null > +++ b/Documentation/scheduler/cfs-overview.rst > @@ -0,0 +1,110 @@ > +.. SPDX-License-Identifier: GPL-2.0+ > + > += > +CFS Overview > += > + > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the > +CFS design is provided in :doc:`sched-design-CFS` > + > +In addition there have been many improvements to the CFS, a few of which are > + > +**Thermal Pressure**: > +cpu_capacity initially reflects the maximum possible capacity of a CPU. > +Thermal pressure on a CPU means this maximum possible capacity is > +unavailable due to thermal events. Average thermal pressure for a CPU > +is now subtracted from its maximum possible capacity so that cpu_capacity > +reflects the remaining maximum capacity. > + > +**Use Idle CPU for NUMA balancing**: > +Idle CPU is used as a migration target instead of comparing tasks. > +Information on an idle core is cached while gathering statistics > +and this is used to avoid a second scan of the node runqueues if load is > +not imbalanced. Preference is given to an idle core rather than an > +idle SMT sibling to avoid packing HT siblings due to linearly scanning > +the node cpumask. Multiple tasks can attempt to select and idle CPU but > +fail, in this case instead of failing, an alternative idle CPU scanned. I'm having problems parsing that last sentence above. > + > +**Asymmetric CPU capacity wakeup scan**: > +Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES > +domain (sd_llc) are homogeneous didn't hold for newer generations of > big.LITTLE > +systems (DynamIQ) which can accommodate CPUs of different compute capacity > +within a single LLC domain. A new idle sibling helper function was added > +which took CPU capacity in to account. The policy is to pick the first idle into > +CPU which is big enough for the task (task_util * margin < cpu_capacity). why not <= ? > +If no idle CPU is big enough, the idle CPU with the highest capacity was s/was/is/ > +picked. > + > +**Optimized idle core selection**: > +Previously all threads of a core were looped through to evaluate if the > +core is idle or not. This was unnecessary. If a thread of a core is not > +idle, skip evaluating other threads of a core. Also while clearing the > +cpumask, bits of all CPUs of a core can be cleared in one-shot. in one shot. > + > +**Load balance aggressively for SCHED_IDLE CPUs**: > +The fair scheduler performs periodic load balance on every CPU to check > +if it can pull some tasks from other busy CPUs. The duration of this > +periodic load balance is set to scheduler domain's balance_interval and > +multiplied by a busy_factor (set to 32 by default) for the busy CPUs. This > +multiplication is done for busy CPUs to avoid doing load balance too > +often and rather spend more time executing actual task. While that is > +the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH > +tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE > +tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE > +CPUs, it is now preferred to enqueue a newly-woken task to a SCHED_IDLE > +CPU instead of other busy or idle CPUs. The same reasoning is applied > +to the load balancer as well to make it migrate tasks more aggressively > +to a SCHED_IDLE CPU, as that will reduce
[RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation
From: John Mathew Add documentation for -scheduler overview -scheduler state transtion -CFS overview -scheduler data structs Add rst for scheduler APIs and modify sched/core.c to add kernel-doc comments. Suggested-by: Lukas Bulwahn Co-developed-by: Mostafa Chamanara Signed-off-by: Mostafa Chamanara Co-developed-by: Oleg Tsymbal Signed-off-by: Oleg Tsymbal Signed-off-by: John Mathew --- Documentation/scheduler/cfs-overview.rst | 110 +++ Documentation/scheduler/index.rst | 3 + Documentation/scheduler/overview.rst | 269 ++ .../scheduler/sched-data-structs.rst | 253 Documentation/scheduler/scheduler-api.rst | 30 ++ kernel/sched/core.c | 28 +- kernel/sched/sched.h | 169 ++- 7 files changed, 855 insertions(+), 7 deletions(-) create mode 100644 Documentation/scheduler/cfs-overview.rst create mode 100644 Documentation/scheduler/sched-data-structs.rst create mode 100644 Documentation/scheduler/scheduler-api.rst Request review from Valentin Schneider for the section describing Scheduler classes in: .../scheduler/sched-data-structs.rst diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst new file mode 100644 index ..50d94b9bb60e --- /dev/null +++ b/Documentation/scheduler/cfs-overview.rst @@ -0,0 +1,110 @@ +.. SPDX-License-Identifier: GPL-2.0+ + += +CFS Overview += + +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair +Scheduler (CFS) implemented as a scheduling module. A brief overview of the +CFS design is provided in :doc:`sched-design-CFS` + +In addition there have been many improvements to the CFS, a few of which are + +**Thermal Pressure**: +cpu_capacity initially reflects the maximum possible capacity of a CPU. +Thermal pressure on a CPU means this maximum possible capacity is +unavailable due to thermal events. Average thermal pressure for a CPU +is now subtracted from its maximum possible capacity so that cpu_capacity +reflects the remaining maximum capacity. + +**Use Idle CPU for NUMA balancing**: +Idle CPU is used as a migration target instead of comparing tasks. +Information on an idle core is cached while gathering statistics +and this is used to avoid a second scan of the node runqueues if load is +not imbalanced. Preference is given to an idle core rather than an +idle SMT sibling to avoid packing HT siblings due to linearly scanning +the node cpumask. Multiple tasks can attempt to select and idle CPU but +fail, in this case instead of failing, an alternative idle CPU scanned. + +**Asymmetric CPU capacity wakeup scan**: +Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES +domain (sd_llc) are homogeneous didn't hold for newer generations of big.LITTLE +systems (DynamIQ) which can accommodate CPUs of different compute capacity +within a single LLC domain. A new idle sibling helper function was added +which took CPU capacity in to account. The policy is to pick the first idle +CPU which is big enough for the task (task_util * margin < cpu_capacity). +If no idle CPU is big enough, the idle CPU with the highest capacity was +picked. + +**Optimized idle core selection**: +Previously all threads of a core were looped through to evaluate if the +core is idle or not. This was unnecessary. If a thread of a core is not +idle, skip evaluating other threads of a core. Also while clearing the +cpumask, bits of all CPUs of a core can be cleared in one-shot. + +**Load balance aggressively for SCHED_IDLE CPUs**: +The fair scheduler performs periodic load balance on every CPU to check +if it can pull some tasks from other busy CPUs. The duration of this +periodic load balance is set to scheduler domain's balance_interval and +multiplied by a busy_factor (set to 32 by default) for the busy CPUs. This +multiplication is done for busy CPUs to avoid doing load balance too +often and rather spend more time executing actual task. While that is +the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH +tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE +tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE +CPUs, it is now preferred to enqueue a newly-woken task to a SCHED_IDLE +CPU instead of other busy or idle CPUs. The same reasoning is applied +to the load balancer as well to make it migrate tasks more aggressively +to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the +migrated (SCHED_OTHER) tasks. Fair scheduler now does the next +load balance soon after the last non SCHED_IDLE task is dequeued from a +runqueue, i.e. making the CPU SCHED_IDLE. + +**Load balancing algorithm Reworked**: +The load balancing algorithm contained some heuristics which became +meaningless since the rework of the scheduler's metrics like the +introduction of PELT. The new load