On Mon, 7 Sep 2020 at 14:00, Song Bao Hua (Barry Song)
<[email protected]> wrote:
>
> Hi All,
> In case we have a numa system with 4 nodes and in each node we have 24 cpus, 
> and all of the 96 cores are idle.
> Then we start a process with 4 threads in this totally idle system.
> Actually any one of the four numa nodes should have enough capability to run 
> the 4 threads while they can still have 20 idle CPUS after that.
> But right now the existing code in CFS load balance will spread the 4 threads 
> to multiple nodes.
> This results in two negative side effects:
> 1. more numa nodes are awaken while they can save power in lowest frequency 
> and halt status
> 2. cache coherency overhead between numa nodes
>
> A proof-of-concept patch I made to "fix" this issue to some extent is like:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a05..f671e15 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9068,9 +9068,20 @@ static inline void calculate_imbalance(struct lb_env 
> *env, struct sd_lb_stats *s
>                 }
>
>                 /* Consider allowing a small imbalance between NUMA groups */
> -               if (env->sd->flags & SD_NUMA)
> +               if (env->sd->flags & SD_NUMA) {
> +                       /* if the src group uses only 1/4 capability and dst 
> is idle
> +                        * don't move task
> +                        */
> +                       if (busiest->sum_nr_running <= 
> busiest->group_weight/4 &&
> +                                       local->sum_nr_running == 0) {
> +                               env->imbalance = 0;
> +                               return;

Without considering if that makes sense or not, such tests should be
in adjust_numa_imbalance() which is there to decide if it's worth
"fixing" the imbalance between numa node or not.

The default behavior of load balancer is all about spreading tasks.
Then we have 2 NUMA hooks to prevent this to happen if it doesn't make
sense:
-This adjust_numa_imbalance()
-The fbq_type which is used to skip some rqs

Finally, there were several discussions around adjust_numa_imbalance()
when it was introduced and one was how to define how much imbalance is
allowed that will not regress the performance. The conclusion was that
it depends of a lot of inputs about the topology like the number of
CPUs, the number of nodes, the distance between nodes and several
others things. So as a 1st step, it was decided to use the simple and
current implementation.

The 1/4 threshold that you use above may work for some used cases on
your system but will most probably be wrong for others. We must find
something that is not just a heuristic and can work of other system
too



> +                       }
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
>                                                 busiest->sum_nr_running);
> +               }
>
>                 return;
>         }
>
> And I wrote a simple process with 4 threads to measure the execution time:
>
> #include <stdio.h>
> #include <pthread.h>
> #include <sys/types.h>
>
> struct foo {
>     int x;
>     int y;
> } f1;
>
> void* thread_fun1(void* param)
> {
>     int s = 0;
>     for (int i = 0; i < 1000000000; ++i)
>         s += f1.x;
>         return NULL;
> }
>
> void* thread_fun2(void* param)
> {
>     for (int i = 0; i < 1000000000; ++i)
>         ++f1.y;
>         return NULL;
> }
>
> double difftimeval(const struct timeval *start, const struct timeval *end)
> {
>         double d;
>         time_t s;
>         suseconds_t u;
>
>         s = start->tv_sec - end->tv_sec;
>         u = start->tv_usec - end->tv_usec;
>
>         d = s;
>         d += u/1000000.0;
>
>         return d;
> }
>
> int main(void)
> {
>         pthread_t tid1,tid2,tid3,tid4;
>         struct timeval start,end;
>
>         gettimeofday(&start, NULL);
>
>         pthread_create(&tid1,NULL,thread_fun1,NULL);
>         pthread_create(&tid2,NULL,thread_fun2,NULL);
>         pthread_create(&tid3,NULL,thread_fun1,NULL);
>         pthread_create(&tid4,NULL,thread_fun2,NULL);
>
>         pthread_join(tid1,NULL);
>         pthread_join(tid2,NULL);
>         pthread_join(tid3,NULL);
>         pthread_join(tid4,NULL);
>
>         gettimeofday(&end, NULL);
>
>         printf("execution time:%f\n", difftimeval(&end, &start));
> }
>
> Before the PoC patch, the test result:
> $ ./a.out
> execution time:10.734581
>
> After the PoC patch, the test result:
> $ ./a.out
> execution time:6.775150
>
> The execution time reduces around 30-40% because 4 threads are put in single 
> one numa node.
>
> On the other hand, the patch doesn't have to depend on NUMA, it can also 
> apply to SCHED_MC with some changes. If one CPU can be still idle after they 
> handle all tasks in the system, we maybe not need to wake up the 2nd CPU at 
> all?
>
> I understand this PoC patch could have negative side effect in some corner 
> cases, for example, if the four threads running in one process want more 
> memory bandwidth by running in multiple nodes. But generally speaking, we do 
> a tradeoff between cache locality and better CPU utilization as they are the 
> main concerns. If one process highly depends on memory bandwidth, they may 
> change their mempolicy?
>
> Thanks
> Barry

Reply via email to