On Mon, 7 Sep 2020 at 14:00, Song Bao Hua (Barry Song)
<[email protected]> wrote:
>
> Hi All,
> In case we have a numa system with 4 nodes and in each node we have 24 cpus,
> and all of the 96 cores are idle.
> Then we start a process with 4 threads in this totally idle system.
> Actually any one of the four numa nodes should have enough capability to run
> the 4 threads while they can still have 20 idle CPUS after that.
> But right now the existing code in CFS load balance will spread the 4 threads
> to multiple nodes.
> This results in two negative side effects:
> 1. more numa nodes are awaken while they can save power in lowest frequency
> and halt status
> 2. cache coherency overhead between numa nodes
>
> A proof-of-concept patch I made to "fix" this issue to some extent is like:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a05..f671e15 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9068,9 +9068,20 @@ static inline void calculate_imbalance(struct lb_env
> *env, struct sd_lb_stats *s
> }
>
> /* Consider allowing a small imbalance between NUMA groups */
> - if (env->sd->flags & SD_NUMA)
> + if (env->sd->flags & SD_NUMA) {
> + /* if the src group uses only 1/4 capability and dst
> is idle
> + * don't move task
> + */
> + if (busiest->sum_nr_running <=
> busiest->group_weight/4 &&
> + local->sum_nr_running == 0) {
> + env->imbalance = 0;
> + return;
Without considering if that makes sense or not, such tests should be
in adjust_numa_imbalance() which is there to decide if it's worth
"fixing" the imbalance between numa node or not.
The default behavior of load balancer is all about spreading tasks.
Then we have 2 NUMA hooks to prevent this to happen if it doesn't make
sense:
-This adjust_numa_imbalance()
-The fbq_type which is used to skip some rqs
Finally, there were several discussions around adjust_numa_imbalance()
when it was introduced and one was how to define how much imbalance is
allowed that will not regress the performance. The conclusion was that
it depends of a lot of inputs about the topology like the number of
CPUs, the number of nodes, the distance between nodes and several
others things. So as a 1st step, it was decided to use the simple and
current implementation.
The 1/4 threshold that you use above may work for some used cases on
your system but will most probably be wrong for others. We must find
something that is not just a heuristic and can work of other system
too
> + }
> env->imbalance = adjust_numa_imbalance(env->imbalance,
> busiest->sum_nr_running);
> + }
>
> return;
> }
>
> And I wrote a simple process with 4 threads to measure the execution time:
>
> #include <stdio.h>
> #include <pthread.h>
> #include <sys/types.h>
>
> struct foo {
> int x;
> int y;
> } f1;
>
> void* thread_fun1(void* param)
> {
> int s = 0;
> for (int i = 0; i < 1000000000; ++i)
> s += f1.x;
> return NULL;
> }
>
> void* thread_fun2(void* param)
> {
> for (int i = 0; i < 1000000000; ++i)
> ++f1.y;
> return NULL;
> }
>
> double difftimeval(const struct timeval *start, const struct timeval *end)
> {
> double d;
> time_t s;
> suseconds_t u;
>
> s = start->tv_sec - end->tv_sec;
> u = start->tv_usec - end->tv_usec;
>
> d = s;
> d += u/1000000.0;
>
> return d;
> }
>
> int main(void)
> {
> pthread_t tid1,tid2,tid3,tid4;
> struct timeval start,end;
>
> gettimeofday(&start, NULL);
>
> pthread_create(&tid1,NULL,thread_fun1,NULL);
> pthread_create(&tid2,NULL,thread_fun2,NULL);
> pthread_create(&tid3,NULL,thread_fun1,NULL);
> pthread_create(&tid4,NULL,thread_fun2,NULL);
>
> pthread_join(tid1,NULL);
> pthread_join(tid2,NULL);
> pthread_join(tid3,NULL);
> pthread_join(tid4,NULL);
>
> gettimeofday(&end, NULL);
>
> printf("execution time:%f\n", difftimeval(&end, &start));
> }
>
> Before the PoC patch, the test result:
> $ ./a.out
> execution time:10.734581
>
> After the PoC patch, the test result:
> $ ./a.out
> execution time:6.775150
>
> The execution time reduces around 30-40% because 4 threads are put in single
> one numa node.
>
> On the other hand, the patch doesn't have to depend on NUMA, it can also
> apply to SCHED_MC with some changes. If one CPU can be still idle after they
> handle all tasks in the system, we maybe not need to wake up the 2nd CPU at
> all?
>
> I understand this PoC patch could have negative side effect in some corner
> cases, for example, if the four threads running in one process want more
> memory bandwidth by running in multiple nodes. But generally speaking, we do
> a tradeoff between cache locality and better CPU utilization as they are the
> main concerns. If one process highly depends on memory bandwidth, they may
> change their mempolicy?
>
> Thanks
> Barry