On Thu, Mar 4, 2021 at 7:24 PM Sedat Dilek <sedat.di...@gmail.com> wrote: > > On Thu, Mar 4, 2021 at 6:34 PM 'Nick Desaulniers' via Clang Built > Linux <clang-built-li...@googlegroups.com> wrote: > > > > On Wed, Mar 3, 2021 at 2:48 PM Josh Don <josh...@google.com> wrote: > > > > > > From: Clement Courbet <cour...@google.com> > > > > > > A significant portion of __calc_delta time is spent in the loop > > > shifting a u64 by 32 bits. Use `fls` instead of iterating. > > > > > > This is ~7x faster on benchmarks. > > > > > > The generic `fls` implementation (`generic_fls`) is still ~4x faster > > > than the loop. > > > Architectures that have a better implementation will make use of it. For > > > example, on X86 we get an additional factor 2 in speed without dedicated > > > implementation. > > > > > > On gcc, the asm versions of `fls` are about the same speed as the > > > builtin. On clang, the versions that use fls are more than twice as > > > slow as the builtin. This is because the way the `fls` function is > > > written, clang puts the value in memory: > > > https://godbolt.org/z/EfMbYe. This bug is filed at > > > https://bugs.llvm.org/show_bug.cgi?id=49406. > > > > Hi Josh, Thanks for helping get this patch across the finish line. > > Would you mind updating the commit message to point to > > https://bugs.llvm.org/show_bug.cgi?id=20197? > > > > > > > > ``` > > > name cpu/op > > > BM_Calc<__calc_delta_loop> 9.57ms ±12% > > > BM_Calc<__calc_delta_generic_fls> 2.36ms ±13% > > > BM_Calc<__calc_delta_asm_fls> 2.45ms ±13% > > > BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms ±12% > > > BM_Calc<__calc_delta_asm_fls64> 2.46ms ±13% > > > BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms ±15% > > > BM_Calc<__calc_delta_builtin> 1.32ms ±11% > > > ``` > > > > > > Signed-off-by: Clement Courbet <cour...@google.com> > > > Signed-off-by: Josh Don <josh...@google.com> > > > --- > > > kernel/sched/fair.c | 19 +++++++++++-------- > > > kernel/sched/sched.h | 1 + > > > 2 files changed, 12 insertions(+), 8 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 8a8bd7b13634..a691371960ae 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -229,22 +229,25 @@ static void __update_inv_weight(struct load_weight > > > *lw) > > > static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct > > > load_weight *lw) > > > { > > > u64 fact = scale_load_down(weight); > > > + u32 fact_hi = (u32)(fact >> 32); > > > int shift = WMULT_SHIFT; > > > + int fs; > > > > > > __update_inv_weight(lw); > > > > > > - if (unlikely(fact >> 32)) { > > > - while (fact >> 32) { > > > - fact >>= 1; > > > - shift--; > > > - } > > > + if (unlikely(fact_hi)) { > > > + fs = fls(fact_hi); > > > + shift -= fs; > > > + fact >>= fs; > > > } > > > > > > fact = mul_u32_u32(fact, lw->inv_weight); > > > > > > - while (fact >> 32) { > > > - fact >>= 1; > > > - shift--; > > > + fact_hi = (u32)(fact >> 32); > > > + if (fact_hi) { > > > + fs = fls(fact_hi); > > > + shift -= fs; > > > + fact >>= fs; > > > } > > > > > > return mul_u64_u32_shr(delta_exec, fact, shift); > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > index 10a1522b1e30..714af71cf983 100644 > > > --- a/kernel/sched/sched.h > > > +++ b/kernel/sched/sched.h > > > @@ -36,6 +36,7 @@ > > > #include <uapi/linux/sched/types.h> > > > > > > #include <linux/binfmts.h> > > > +#include <linux/bitops.h> > > > > This hunk of the patch is curious. I assume that bitops.h is needed > > for fls(); if so, why not #include it in kernel/sched/fair.c? > > Otherwise this potentially hurts compile time for all TUs that include > > kernel/sched/sched.h. > > > > I have v2 as-is in my custom patchset and booted right now on bare metal. > > As Nick points out moving the include makes sense to me. > We have a lot of include at the wrong places increasing build-time. >
I tried with the attached patch. $ LC_ALL=C ll kernel/sched/fair.o -rw-r--r-- 1 dileks dileks 1.2M Mar 4 20:11 kernel/sched/fair.o - Sedat -
From afd45cd78c21960c6e937021f095e5f8f51fef7a Mon Sep 17 00:00:00 2001 From: Sedat Dilek <sedat.di...@gmail.com> Date: Thu, 4 Mar 2021 20:05:30 +0100 Subject: [PATCH] sched/fair: Move include after __calc_delta optimization change Signed-off-by: Sedat Dilek <sedat.di...@gmail.com> --- kernel/sched/fair.c | 2 ++ kernel/sched/sched.h | 1 - 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5fda1751fbd1..b9f10ae92e3f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -20,6 +20,8 @@ * Adaptive scheduling granularity, math enhancements by Peter Zijlstra * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra */ +#include <linux/bitops.h> + #include "sched.h" /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 714af71cf983..10a1522b1e30 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -36,7 +36,6 @@ #include <uapi/linux/sched/types.h> #include <linux/binfmts.h> -#include <linux/bitops.h> #include <linux/blkdev.h> #include <linux/compat.h> #include <linux/context_tracking.h> -- 2.30.1