[PATCH 00/13] High performance balancing logic for big.LITTLE

Arseniy Krasnov Fri, 06 Nov 2015 04:03:33 -0800

                                Prologue

The patch set introduces an extension to the default Linux CPU scheduler (CFS).
The main purpose of the extension is utilization of a big.LITTLE CPU for maximum
performance. Such solution may be useful for users of OdroidXU-3 board
(supporting 8 cores) who doesn't care about power efficiency.


Maximum utilization was reached using the following policies:

1) A15 cores must be utilized as much as possible e.g. idle A15 cores always
   pull some task from A7 core.

2) After execution of a task on A7 core for some period of time it should be
   swapped with an appropriate task from A15 cluster in order to achieve
   fairness.

3) Load of big and little clusters is balanced according to frequency and
   A15/A7 slowdown coefficient.

                        Approach Description

The scheduler creates a hierarchy of two domains: MC and HMP. The MC domain is a
default domain for SCHED_MC config. The HMP domain contains two clusters: A15
and A7 CPUs. Balancing between HMP domains is performed by the new logic, in MC
domains, in turn, balancing is done by the default logic of the 'load_balance()'
function.

To perform balancing between HMP domains, the load of each cluster is calculated
in scheduler's softirq handler. Then, this value is scaled according to each
cluster's frequency and slowdown coefficient which is a ratio of busy-loop
performance on A15 and A7. There are three ways of migration between two
clusters: from A15 cluster to A7 cluster (if load on A15 cluster is too high),
from A7 cluster to A15 cluster (otherwise) and task swapping when load on both
clusters is the same. To migrate some task from one cluster to another firstly
this task should be selected. To find a task suitable for migration the
scheduler uses a special per-task metric called 'druntime'. It is based on CFS's
vruntime metric but its grow direction depends on a core where the task is
executed: for A15 core it grows up, for A7 core, in turn, it goes down. So,
being the druntime value close to zero means that the task is executed on both
clusters for the same amount of time. As a result, to get a task for migration
it scans each runqueue to find a task with highest/lowest druntime depending on
which cluster is scanned; after, when the task is found, it is moved to another
cluster. These balancing steps are performed in each scheduler balancing
operation executed by softirq.

To get maximum performance A15 cores must be fully utilized; this means that
idle A15 cores are always able to pull tasks from A7 cores while A7 cores cannot
do that from A15 cores.

An finally, let's look to fairness - it is provided by swapping of tasks during
every softirq balancing: when balance is broken it tries to repair the balance
moving tasks from one cluster to another, then when the clusters are balanced,
the tasks are swapped during each softirq balancing. In addition to this logic,
'select_task_rq_fair' was modified in order to place woken tasks to least loaded
CPU, because it won't break the balance between A15 and A7 cores.

                                Test results

Several test kits were used for performance measurement of the solution.
All comparision is done against the Linaro MP scheduler.

The first test case is a parsec benchmark suite. It contains different types of
tasks like cluster searching or pattern recognition in order to test scheduler
performance. Results of some benchmarks are listed in the text below (in
seconds):

Streamcluster:

Developed by Princeton University and solves the online clustering problem.
Streamcluster was included in the PARSEC benchmark suite because of the
importance of data mining algorithms and the prevalence of problems with
streaming characteristics.

        Threads         HPERF_HMP       Linaro MP
        1               27,333          27,422
        2               14,162          14,197
        3               10,099          10,168
        4               8,227           8,332
        5               10,922          23,349
        6               10,85           22,507
        7               11,39           22,041
        8               12,307          21,181
        9               20,339          22,115
        10              21,33           23,746
        11              23,289          24,831
        12              25,363          26,699
        13              34,091          34,84
        14              34,758          38,661
        15              35,743          38,688
        16              38,1            44,735
        17              41,165          77,098
        18              44,223          102,633
        19              46,177          113,748
        20              48,22           119,146
        21              52,372          135,499
        22              54,319          136,454
        23              56,218          141,924
        24              57,843          145,727
        25              61,759          158,754
        26              63,179          163,915
        27              64,987          167,559
        28              67,329          171,203
        29              70,489          185,171
        30              73,084          189,303
        31              75,264          192,487
        32              77,015          197,27
        avg             40,373          87,543

Bodytrack:

This computer vision application is an Intel RMS workload which tracks a human
body with multiple cameras through an image sequence. This benchmark was
included due to the increasing significance of computer vision algorithms in
areas such as video surveillance, character animation and computer interfaces.

        Threads         HPERF_HMP       Linaro MP
        1               15,884          16,632
        2               8,536           9,42
        3               6,037           7,257
        4               4,84            6,076
        5               8,835           5,739
        6               4,437           5,513
        7               4,119           5,474
        8               3,992           5,115
        9               3,854           5,164
        10              3,92            4,911
        11              3,854           4,932
        12              3,83            4,816
        13              3,839           5,643
        14              3,861           4,816
        15              3,889           4,896
        16              3,845           4,854
        17              3,872           4,837
        18              3,852           4,876
        19              4,304           4,868
        20              3,915           4,928
        21              3,87            4,841
        22              3,858           4,995
        23              3,881           4,97
        24              3,876           4,899
        25              3,854           4,96
        26              3,869           4,902
        27              3,874           4,979
        28              3,88            4,928
        29              3,914           5,008
        30              3,889           5,216
        31              3,898           5,242
        32              3,894           5,199
        avg             4,689           5,653

Blackscholes:

This application is an Intel RMS benchmark. It calculates the prices for a
portfolio of European options analytically with the Black-Scholes partial
differential equation. There is no closed-form expression for the blackscholes
equation and as such it must be computed numerically.

        Threads         HPERF_HMP       Linaro MP
        1               7,293           6,807
        2               3,886           4,044
        3               2,906           2,911
        4               2,429           2,427
        5               2,58            2,985
        6               2,401           2,672
        7               2,205           2,411
        8               2,132           2,293
        9               2,074           2,41
        10              2,067           2,264
        11              2,054           2,205
        12              2,091           2,222
        13              2,042           2,28
        14              2,035           2,222
        15              2,026           2,25
        16              2,024           2,177
        17              2,021           2,173
        18              2,033           2,09
        19              2,03            2,05
        20              2,024           2,158
        21              2,002           2,175
        22              2,026           2,179
        23              2,017           2,134
        24              2,01            2,156
        25              2,009           2,155
        26              2,013           2,179
        27              2,017           2,177
        28              2,019           2,189
        29              2,013           2,158
        30              2,002           2,162
        31              2,016           2,16
        32              2,012           2,159
        avg             2,328           2,469

Also, well known Antutu benchmark was executed on Exynos 5433 board:

                                        HPERF_HMP       Linaro MP
        Integral benchmark result       42400           36860 
        Result: hperf_hmp is 15% better.


Arseniy Krasnov (13):
  hperf_hmp: add new config for arm and arm64.
  hperf_hmp: introduce hew domain flag.
  hperf_hmp: add sched domains initialization.
  hperf_hmp: scheduler initialization routines.
  hperf_hmp: introduce druntime metric.
  hperf_hmp: is_hmp_imbalance introduced.
  hperf_hmp: migration auxiliary functions.
  hperf_hmp: swap tasks function.
  hperf_hmp: one way balancing function.
  hperf_hmp: idle pull function.
  hperf_hmp: task CPU selection logic.
  hperf_hmp: rest of logic.
  hperf_hmp: cpufreq routines.

 arch/arm/Kconfig           |   21 +
 arch/arm/kernel/topology.c |    6 +-
 arch/arm64/Kconfig         |   21 +
 include/linux/sched.h      |   17 +
 kernel/sched/core.c        |   65 +-
 kernel/sched/fair.c        | 1553 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h       |   16 +
 7 files changed, 1586 insertions(+), 113 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/13] High performance balancing logic for big.LITTLE

Reply via email to