numa: Allow a floating imbalance between NUMA nodes

Vincent Guittot Fri, 20 Nov 2020 05:35:57 -0800

On Fri, 20 Nov 2020 at 10:06, Mel Gorman <[email protected]> wrote:
>
> Currently, an imbalance is only allowed when a destination node
> is almost completely idle. This solved one basic class of problems
> and was the cautious approach.
>
> This patch revisits the possibility that NUMA nodes can be imbalanced
> until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
> superficial -- it's half the cores when HT is enabled.  At higher
> utilisations, balancing should continue as normal and keep things even
> until scheduler domains are fully busy or over utilised.
>
> Note that this is not expected to be a universal win. Any benchmark
> that prefers spreading as wide as possible with limited communication
> will favour the old behaviour as there is more memory bandwidth.
> Workloads that communicate heavily in pairs such as netperf or tbench
> benefit. For the tests I ran, the vast majority of workloads saw
> a benefit so it seems to be a worthwhile trade-off.
>
> Signed-off-by: Mel Gorman <[email protected]>


Reviewed-by: Vincent Guittot <[email protected]>

> ---
>  kernel/sched/fair.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9aded12aaa90..e17e6c5da1d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1550,7 +1550,8 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
> -static inline long adjust_numa_imbalance(int imbalance, int dst_running);
> +static inline long adjust_numa_imbalance(int imbalance,
> +                                       int dst_running, int dst_weight);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1930,7 +1931,8 @@ static void task_numa_find_cpu(struct task_numa_env 
> *env,
>                 src_running = env->src_stats.nr_running - 1;
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
> -               imbalance = adjust_numa_imbalance(imbalance, dst_running);
> +               imbalance = adjust_numa_imbalance(imbalance, dst_running,
> +                                                       
> env->dst_stats.weight);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -8995,16 +8997,14 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
>
>  #define NUMA_IMBALANCE_MIN 2
>
> -static inline long adjust_numa_imbalance(int imbalance, int dst_running)
> +static inline long adjust_numa_imbalance(int imbalance,
> +                               int dst_running, int dst_weight)
>  {
> -       unsigned int imbalance_min;
> -
>         /*
>          * Allow a small imbalance based on a simple pair of communicating
> -        * tasks that remain local when the source domain is almost idle.
> +        * tasks that remain local when the destination is lightly loaded.
>          */
> -       imbalance_min = NUMA_IMBALANCE_MIN;
> -       if (dst_running <= imbalance_min)
> +       if (dst_running < (dst_weight >> 2) && imbalance <= 
> NUMA_IMBALANCE_MIN)
>                 return 0;
>
>         return imbalance;
> @@ -9107,9 +9107,10 @@ static inline void calculate_imbalance(struct lb_env 
> *env, struct sd_lb_stats *s
>                 }
>
>                 /* Consider allowing a small imbalance between NUMA groups */
> -               if (env->sd->flags & SD_NUMA)
> +               if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                                               busiest->sum_nr_running);
> +                               busiest->sum_nr_running, 
> busiest->group_weight);
> +               }
>
>                 return;
>         }
> --
> 2.26.2
>

Re: [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes

Reply via email to