[tip: sched/core] sched: Optimize __calc_delta()

2021-03-10 Thread tip-bot2 for Clement Courbet
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 1e17fb8edc5ad6587e9303ccdebce853bc8cf30c
Gitweb:
https://git.kernel.org/tip/1e17fb8edc5ad6587e9303ccdebce853bc8cf30c
Author:Clement Courbet 
AuthorDate:Wed, 03 Mar 2021 14:46:53 -08:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Mar 2021 09:51:49 +01:00

sched: Optimize __calc_delta()

A significant portion of __calc_delta() time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.

This is ~7x faster on benchmarks.

The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on x86 we get an additional factor 2 in speed without dedicated
implementation.

On GCC, the asm versions of `fls` are about the same speed as the
builtin. On Clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?idI406.

```
name   cpu/op
BM_Calc<__calc_delta_loop> 9.57ms Â=B112%
BM_Calc<__calc_delta_generic_fls>  2.36ms Â=B113%
BM_Calc<__calc_delta_asm_fls>  2.45ms Â=B113%
BM_Calc<__calc_delta_asm_fls_nomem>1.66ms Â=B112%
BM_Calc<__calc_delta_asm_fls64>2.46ms Â=B113%
BM_Calc<__calc_delta_asm_fls64_nomem>  1.34ms Â=B115%
BM_Calc<__calc_delta_builtin>      1.32ms Â=B111%
```

Signed-off-by: Clement Courbet 
Signed-off-by: Josh Don 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210303224653.2579656-1-josh...@google.com
---
 kernel/sched/fair.c  | 19 +++
 kernel/sched/sched.h |  1 +
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f5d6541..2e2ab1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -229,22 +229,25 @@ static void __update_inv_weight(struct load_weight *lw)
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct 
load_weight *lw)
 {
u64 fact = scale_load_down(weight);
+   u32 fact_hi = (u32)(fact >> 32);
int shift = WMULT_SHIFT;
+   int fs;
 
__update_inv_weight(lw);
 
-   if (unlikely(fact >> 32)) {
-   while (fact >> 32) {
-   fact >>= 1;
-   shift--;
-   }
+   if (unlikely(fact_hi)) {
+   fs = fls(fact_hi);
+   shift -= fs;
+   fact >>= fs;
}
 
fact = mul_u32_u32(fact, lw->inv_weight);
 
-   while (fact >> 32) {
-   fact >>= 1;
-   shift--;
+   fact_hi = (u32)(fact >> 32);
+   if (fact_hi) {
+   fs = fls(fact_hi);
+   shift -= fs;
+   fact >>= fs;
}
 
return mul_u64_u32_shr(delta_exec, fact, shift);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb8bb06..d2e09a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -36,6 +36,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 


Re: [PATCH 0/4] -ffreestanding/-fno-builtin-* patches

2020-08-19 Thread Clement Courbet
On Tue, Aug 18, 2020 at 9:58 PM Nick Desaulniers  
wrote:
On Tue, Aug 18, 2020 at 12:25 PM Nick Desaulniers
 wrote:
>
> On Tue, Aug 18, 2020 at 12:19 PM Linus Torvalds
>  wrote:
> >
> > And honestly, a compiler that uses 'bcmp' is just broken. WTH? It's
> > the year 2020, we don't use bcmp. It's that simple. Fix your damn
> > broken compiler and use memcmp. The argument that memcmp is more
> > expensive than bcmp is garbage legacy thinking from four decades ago.
> >
> > It's likely the other way around, where people have actually spent
> > time on memcmp, but not on bcmp.
> >
> >                Linus
>
> You'll have to ask Clement about that.  I'm not sure I ever saw the
> "faster bcmp than memcmp" implementation, but I was told "it exists"
> when I asked for a revert when all of our kernel builds went red.

If **is** possible to make bcmp much faster then memcmp. We have one
such implementation internally (it's scheduled to be released as part of
llvm-libc some time this year), but most libc implementations just alias to
memcmp.

Below is a graph showing the impact of releasing this compiler optimization
with our optimized bcmp on the google fleet (the cumulative memcmp+bcmp usage
of all programs running on google datacenters, including the kernel). Scale and
dates have been redacted for obvious reasons, but note that the graph starts at
y=0, so you can compare the values relative to each other. Note how as memcmp
is progressively being replaced by bcmp (more and more programs being
recompiled with the compiler patch), the cumulative usage of memory
comparison drops significantly.
 
https://drive.google.com/file/d/1p8z1ilw2xaAJEnx_5eu-vflp3tEOv0qY/view?usp=sharing

The reasons why bcmp can be faster are:
 - typical libc implementations use the hardware to its full capacity, e.g. for
bcmp we can use vector loads and compares, which can process up to 64 bytes
(avx512) in one instruction. It's harder to implement memcmp with these for
little-endian architectures as there is no vector bswap. Because the kernel
only uses GPRs I can see how that might not perfectly fit the kernel use case.
But the kernel really is a special case, the compiler is written for most
programs, not specifically for the kernel, and most programs should benefit from
this optimization.
 - bcmp() does not have to look at the bytes in order, e.g. it can look at the
first and last . This is useful when comparing buffers that have common
prefixes (as happens in mostly sorted containers, and we have data that shows
that this is a quite common instance).
 

> Also, to Clement's credit, every patch I've ever seen from Clement is
> backed up by data; typically fleetwide profiles at Google.  "we spend
> a lot of time in memcmp, particularly comparing the result against
> zero and no other value; hmm...how do we spend less time in
> memcmp...oh, well there's another library function with slightly
> different semantics we can call instead."  I don't think anyone would
> consider the optimization batshit crazy given the number of cycles
> saved across the fleet.  That an embedded project didn't provide an
> implementation, is a footnote that can be fixed in the embedded
> project, either by using -ffreestanding or -fno-builtin-bcmp, which is
> what this series proposes to do.


[PATCH v5] lib: optimize cpumask_next_and()

2017-11-30 Thread Clement Courbet
> So I think it really worth to be separated patch. Really, it's
> completely nontrivial why adding new function in lib/find_bit.c
> requires including asm-generic/bitops/find.h in arm and uncore32
> asm/bitops.h headers (bug?). And why doing that makes you guard
> find_first_bit and find_first_zero_bit (another bug?).

OK, I'll send a separate patch for this.

> Linux-next is your choice.
>  [...]
> Again. test_find_next_and_bit is trimmed, but it is still based on
> get_cycles and uses tabs in printf(). Please fix it.

OK, I'll send a version of the patch rebased against linux-next.


[PATCH v5] lib: optimize cpumask_next_and()

2017-11-30 Thread Clement Courbet
> So I think it really worth to be separated patch. Really, it's
> completely nontrivial why adding new function in lib/find_bit.c
> requires including asm-generic/bitops/find.h in arm and uncore32
> asm/bitops.h headers (bug?). And why doing that makes you guard
> find_first_bit and find_first_zero_bit (another bug?).

OK, I'll send a separate patch for this.

> Linux-next is your choice.
>  [...]
> Again. test_find_next_and_bit is trimmed, but it is still based on
> get_cycles and uses tabs in printf(). Please fix it.

OK, I'll send a version of the patch rebased against linux-next.


[PATCH v6] lib: optimize cpumask_next_and()

2017-11-29 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join).
Also add generic bitmap benchmarks in the new `test_find_bit` module
for new function (see `find_next_and_bit` in [2] and [3] below).

For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory
usage.
Note that on Arm, the new pure-C implementation still outperforms
the old one that uses a mix of C and asm (`find_next_bit`) [3].

[1] Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
-
   geo.mean  2.06

[2] test_find_next_bit, X86 (skylake)

 [ 3913.477422] Start testing find_bit() with random-filled bitmap
 [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
 [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
 [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
 [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
 [ 3913.480216] Start testing find_next_and_bit() with random-filled
 bitmap
 [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
 [ 3913.481075] Start testing find_bit() with sparse bitmap
 [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
 [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
 [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
 [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
 [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
 [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations

[3] test_find_next_bit, arm (v7 odroid XU3).

[  267.206928] Start testing find_bit() with random-filled bitmap
[  267.214752] find_next_bit: 4474 cycles, 16419 iterations
[  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
[  267.229294] find_last_bit: 4209 cycles, 16419 iterations
[  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
[  267.286265] Start testing find_next_and_bit() with random-filled
bitmap
[  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
[  267.309422] Start testing find_bit() with sparse bitmap
[  267.316054] find_next_bit: 191 cycles, 66 iterations
[  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
[  267.329803] find_last_bit: 84 cycles, 66 iterations
[  267.336169] find_first_bit: 4118 cycles, 66 iterations
[  267.342627] Start testing find_next_and_bit() with sparse bitmap
[  267.356919] find_next_and_bit: 91 cycles, 1 iterations

Signed-off-by: Clement Courbet <cour...@google.com>
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

Changes in v4:
 - Mark _find_next_bit() inline.

Changes in v5:
 - Added benchmarks to test_find_bit.cc
 - Fixed arm compilation: added missing header to arm bitops.h

Changes in v6:
 - Removed test for reference implementation, which was a temporary
   artifact.
 - Prettify test output.

 arch/arm/include/asm/bitops.h   |  1 +
 arch/unicore32/include/asm/bitops.h |  2 ++
 include/asm-generic/bitops/find.h   | 20 +++
 include/linux/bitmap.h  |  6 +++-
 lib/cpumask.c   |  9 ++---
 lib/find_bit.c  | 59 -
 lib/test_find_bit.c | 25 +-
 tools/i

[PATCH v6] lib: optimize cpumask_next_and()

2017-11-29 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join).
Also add generic bitmap benchmarks in the new `test_find_bit` module
for new function (see `find_next_and_bit` in [2] and [3] below).

For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory
usage.
Note that on Arm, the new pure-C implementation still outperforms
the old one that uses a mix of C and asm (`find_next_bit`) [3].

[1] Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
-
   geo.mean  2.06

[2] test_find_next_bit, X86 (skylake)

 [ 3913.477422] Start testing find_bit() with random-filled bitmap
 [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
 [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
 [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
 [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
 [ 3913.480216] Start testing find_next_and_bit() with random-filled
 bitmap
 [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
 [ 3913.481075] Start testing find_bit() with sparse bitmap
 [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
 [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
 [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
 [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
 [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
 [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations

[3] test_find_next_bit, arm (v7 odroid XU3).

[  267.206928] Start testing find_bit() with random-filled bitmap
[  267.214752] find_next_bit: 4474 cycles, 16419 iterations
[  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
[  267.229294] find_last_bit: 4209 cycles, 16419 iterations
[  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
[  267.286265] Start testing find_next_and_bit() with random-filled
bitmap
[  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
[  267.309422] Start testing find_bit() with sparse bitmap
[  267.316054] find_next_bit: 191 cycles, 66 iterations
[  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
[  267.329803] find_last_bit: 84 cycles, 66 iterations
[  267.336169] find_first_bit: 4118 cycles, 66 iterations
[  267.342627] Start testing find_next_and_bit() with sparse bitmap
[  267.356919] find_next_and_bit: 91 cycles, 1 iterations

Signed-off-by: Clement Courbet 
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

Changes in v4:
 - Mark _find_next_bit() inline.

Changes in v5:
 - Added benchmarks to test_find_bit.cc
 - Fixed arm compilation: added missing header to arm bitops.h

Changes in v6:
 - Removed test for reference implementation, which was a temporary
   artifact.
 - Prettify test output.

 arch/arm/include/asm/bitops.h   |  1 +
 arch/unicore32/include/asm/bitops.h |  2 ++
 include/asm-generic/bitops/find.h   | 20 +++
 include/linux/bitmap.h  |  6 +++-
 lib/cpumask.c   |  9 ++---
 lib/find_bit.c  | 59 -
 lib/test_find_bit.c | 25 +-
 tools/include/asm-generic/bitops/find

[PATCH v5] lib: optimize cpumask_next_and()

2017-11-29 Thread Clement Courbet
> > Note that on Arm (), the new c implementation still outperforms the
> > old one that uses c+ the asm implementation of `find_next_bit` [3].
> What is 'c+'? Is it typo?

I meant "a mix of C and asm" ~(C + asm). Rephrased.

> If you find generic find_bit() on arm faster that asm one, we'd
> definitely drop that piece of asm. I have this check it in my
> long list.

What's faster for sure is the mix (the improvement in this commit minus the
possible hit from not using the ASM implementation). I can't tell whether the
latter is negligible or not (I only have one ARM board to try it out), but
that's definitly something to try.

> This is old version of test based on get_cycles. New one is based on
> ktime_get and has other minor changes. I think you'd rerun tests to
> not confuse readers. New version is already in linux-next.

So I'm not sure whether I should be submitting this against 'linux' or
'linux-next' ? This patch is against 'linux', so I think it should
be consistent with the code around.

> > #ifndef find_first_bit
> > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > #endif
> > #ifndef find_first_zero_bit
> > #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 
> > 0)
> > #endif
> How this change related to the find_next_and_bit?

The arm header defines these symbols. Now that we're including
the generic implementation in the arm headers, we need to guard this to
avoid the duplicate definition.

> > test_find_next_and_bit_ref
> I don't understand the purpose of this. It's obviously clear that
> test_find_next_and_bit cannot be slower than test_find_next_and_bit_ref

Fair enough :) That was to back my claim that this commit is worth it.
I've removed the "_ref" version.

> For sparse bitmaps it will be like traversing zero-bitmaps. I doubt
> this numbers will be representative. Do we need this test at all?

It's just two lines, and gives an interesting data point. Why not
keep it ?



[PATCH v5] lib: optimize cpumask_next_and()

2017-11-29 Thread Clement Courbet
> > Note that on Arm (), the new c implementation still outperforms the
> > old one that uses c+ the asm implementation of `find_next_bit` [3].
> What is 'c+'? Is it typo?

I meant "a mix of C and asm" ~(C + asm). Rephrased.

> If you find generic find_bit() on arm faster that asm one, we'd
> definitely drop that piece of asm. I have this check it in my
> long list.

What's faster for sure is the mix (the improvement in this commit minus the
possible hit from not using the ASM implementation). I can't tell whether the
latter is negligible or not (I only have one ARM board to try it out), but
that's definitly something to try.

> This is old version of test based on get_cycles. New one is based on
> ktime_get and has other minor changes. I think you'd rerun tests to
> not confuse readers. New version is already in linux-next.

So I'm not sure whether I should be submitting this against 'linux' or
'linux-next' ? This patch is against 'linux', so I think it should
be consistent with the code around.

> > #ifndef find_first_bit
> > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
> > #endif
> > #ifndef find_first_zero_bit
> > #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 
> > 0)
> > #endif
> How this change related to the find_next_and_bit?

The arm header defines these symbols. Now that we're including
the generic implementation in the arm headers, we need to guard this to
avoid the duplicate definition.

> > test_find_next_and_bit_ref
> I don't understand the purpose of this. It's obviously clear that
> test_find_next_and_bit cannot be slower than test_find_next_and_bit_ref

Fair enough :) That was to back my claim that this commit is worth it.
I've removed the "_ref" version.

> For sparse bitmaps it will be like traversing zero-bitmaps. I doubt
> this numbers will be representative. Do we need this test at all?

It's just two lines, and gives an interesting data point. Why not
keep it ?



[PATCH v5] lib: optimize cpumask_next_and()

2017-11-28 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join).

For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory
usage.

Also added generic bitmap benchmarks in the new `test_find_bit` module
for the reference and new implementations (results: see
`find_next_and_bit` and `find_next_and_bit_ref` in [2] and [3] below).
Note that on Arm (), the new c implementation still outperforms the
old one that uses c+ the asm implementation of `find_next_bit` [3].

[1] Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
-
   geo.mean  2.06

[2] test_find_next_bit, X86 (skylake)

 [ 3913.477422] Start testing find_bit() with random-filled bitmap
 [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
 [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
 [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
 [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
 [ 3913.480216] Start testing find_next_and_bit() with random-filled
 bitmap
 [ 3913.481027] find_next_and_bit_ref: 319444 cycles, 8216 iterations
 [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
 [ 3913.481075] Start testing find_bit() with sparse bitmap
 [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
 [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
 [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
 [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
 [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
 [ 3913.481270] find_next_and_bit_ref: 2486 cycles, 1 iterations
 [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations

[3] test_find_next_bit, arm (v7 odroid XU3).

[  267.206928] Start testing find_bit() with random-filled bitmap
[  267.214752] find_next_bit: 4474 cycles, 16419 iterations
[  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
[  267.229294] find_last_bit: 4209 cycles, 16419 iterations
[  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
[  267.286265] Start testing find_next_and_bit() with random-filled
bitmap
[  267.294895] find_next_and_bit_ref: 7572 cycles, 8140 iterations
[  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
[  267.309422] Start testing find_bit() with sparse bitmap
[  267.316054] find_next_bit: 191 cycles, 66 iterations
[  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
[  267.329803] find_last_bit: 84 cycles, 66 iterations
[  267.336169] find_first_bit: 4118 cycles, 66 iterations
[  267.342627] Start testing find_next_and_bit() with sparse bitmap
[  267.349992] find_next_and_bit_ref: 193 cycles, 1 iterations
[  267.356919] find_next_and_bit: 91 cycles, 1 iterations

Signed-off-by: Clement Courbet <cour...@google.com>
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

Changes in v4:
 - Mark _find_next_bit() inline.

Changes in v5:
 - Added benchmarks to test_find_bit.cc
 - Fixed arm compilation: added missing header to arm bitops.h

 arch/arm/include/asm/bitops.h   |  1 +
 arch/unicore32/include/asm/bitops.h |  2 ++
 include/asm-generic/bitops/find.h   | 20 +++
 include/linux/bitmap.h   

[PATCH v5] lib: optimize cpumask_next_and()

2017-11-28 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join).

For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x
faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory
usage.

Also added generic bitmap benchmarks in the new `test_find_bit` module
for the reference and new implementations (results: see
`find_next_and_bit` and `find_next_and_bit_ref` in [2] and [3] below).
Note that on Arm (), the new c implementation still outperforms the
old one that uses c+ the asm implementation of `find_next_bit` [3].

[1] Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
-
   geo.mean  2.06

[2] test_find_next_bit, X86 (skylake)

 [ 3913.477422] Start testing find_bit() with random-filled bitmap
 [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations
 [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations
 [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations
 [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations
 [ 3913.480216] Start testing find_next_and_bit() with random-filled
 bitmap
 [ 3913.481027] find_next_and_bit_ref: 319444 cycles, 8216 iterations
 [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations
 [ 3913.481075] Start testing find_bit() with sparse bitmap
 [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations
 [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations
 [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations
 [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations
 [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap
 [ 3913.481270] find_next_and_bit_ref: 2486 cycles, 1 iterations
 [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations

[3] test_find_next_bit, arm (v7 odroid XU3).

[  267.206928] Start testing find_bit() with random-filled bitmap
[  267.214752] find_next_bit: 4474 cycles, 16419 iterations
[  267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations
[  267.229294] find_last_bit: 4209 cycles, 16419 iterations
[  267.279131] find_first_bit: 1032991 cycles, 16420 iterations
[  267.286265] Start testing find_next_and_bit() with random-filled
bitmap
[  267.294895] find_next_and_bit_ref: 7572 cycles, 8140 iterations
[  267.302386] find_next_and_bit: 2290 cycles, 8140 iterations
[  267.309422] Start testing find_bit() with sparse bitmap
[  267.316054] find_next_bit: 191 cycles, 66 iterations
[  267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations
[  267.329803] find_last_bit: 84 cycles, 66 iterations
[  267.336169] find_first_bit: 4118 cycles, 66 iterations
[  267.342627] Start testing find_next_and_bit() with sparse bitmap
[  267.349992] find_next_and_bit_ref: 193 cycles, 1 iterations
[  267.356919] find_next_and_bit: 91 cycles, 1 iterations

Signed-off-by: Clement Courbet 
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

Changes in v4:
 - Mark _find_next_bit() inline.

Changes in v5:
 - Added benchmarks to test_find_bit.cc
 - Fixed arm compilation: added missing header to arm bitops.h

 arch/arm/include/asm/bitops.h   |  1 +
 arch/unicore32/include/asm/bitops.h |  2 ++
 include/asm-generic/bitops/find.h   | 20 +++
 include/linux/bitmap.h  |  6 +++-
 lib

[PATCH] lib: test module for find_*_bit() functions

2017-11-09 Thread Clement Courbet
Reviewed-By: Clement Courbet <cour...@google.com>

Thanks for the addition, Yury ! I've used a modified version of v1
for measuring improvements from find_next_and_bit() on x86 and arm
and found it very useful.



[PATCH] lib: test module for find_*_bit() functions

2017-11-09 Thread Clement Courbet
Reviewed-By: Clement Courbet 

Thanks for the addition, Yury ! I've used a modified version of v1
for measuring improvements from find_next_and_bit() on x86 and arm
and found it very useful.



[PATCH] lib: hint GCC to inlilne _find_next_bit() helper

2017-10-30 Thread Clement Courbet
Hi Yury,

I've tried your benchmark on x86-64 (haswell). Inlining is a pretty small
increase in binary size: 48B (2%).

In terms of speed, results are not very stable from one run to another
(I've included two runs to give you an idea), but overall there seems
to be small improvement on the random-filled bitmap, and not so much on
the sparse one.

Before (2312B):
[  312.912746] Start testing find_bit() with random-filled bitmap
[  312.919066] find_next_bit: 170226 cycles, 16267 iterations
[  312.924657] find_next_zero_bit: 170826 cycles, 16502 iterations
[  312.930674] find_last_bit: 152900 cycles, 16266 iterations
[  312.938856] find_first_bit: 5335034 cycles, 16267 iterations
[  312.944533] Start testing find_bit() with sparse bitmap
[  312.949780] find_next_bit: 2644 cycles, 66 iterations
[  312.955016] find_next_zero_bit: 320294 cycles, 32703 iterations
[  312.960957] find_last_bit: 2170 cycles, 66 iterations
[  312.966048] find_first_bit: 21704 cycles, 66 iterations

[  515.310376] Start testing find_bit() with random-filled bitmap
[  515.316693] find_next_bit: 164854 cycles, 16350 iterations
[  515.322293] find_next_zero_bit: 173710 cycles, 16419 iterations
[  515.328312] find_last_bit: 155458 cycles, 16350 iterations
[  515.336584] find_first_bit: 5518332 cycles, 16351 iterations
[  515.342272] Start testing find_bit() with sparse bitmap
[  515.347519] find_next_bit: 2538 cycles, 66 iterations
[  515.352763] find_next_zero_bit: 334828 cycles, 32703 iterations
[  515.358703] find_last_bit: 2250 cycles, 66 iterations
[  515.363787] find_first_bit: 23804 cycles, 66 iterations

After (2360B):
[  183.844318] Start testing find_bit() with random-filled bitmap
[  183.850588] find_next_bit: 148976 cycles, 16342 iterations
[  183.856186] find_next_zero_bit: 173298 cycles, 16427 iterations
[  183.862202] find_last_bit: 148728 cycles, 16341 iterations
[  183.870404] find_first_bit: 5390470 cycles, 16342 iterations
[  183.876084] Start testing find_bit() with sparse bitmap
[  183.881341] find_next_bit: 2144 cycles, 66 iterations
[  183.886586] find_next_zero_bit: 335558 cycles, 32703 iterations
[  183.892535] find_last_bit: 2376 cycles, 66 iterations
[  183.897627] find_first_bit: 24814 cycles, 66 iterations

[  187.842232] Start testing find_bit() with random-filled bitmap
[  187.848505] find_next_bit: 164512 cycles, 16412 iterations
[  187.854101] find_next_zero_bit: 172770 cycles, 16357 iterations
[  187.860117] find_last_bit: 145050 cycles, 16412 iterations
[  187.868312] find_first_bit: 5374792 cycles, 16413 iterations
[  187.873996] Start testing find_bit() with sparse bitmap
[  187.879251] find_next_bit: 2422 cycles, 66 iterations
[  187.884500] find_next_zero_bit: 342548 cycles, 32703 iterations
[  187.890448] find_last_bit: 2150 cycles, 66 iterations
[  187.895539] find_first_bit: 21830 cycles, 66 iterations




[PATCH] lib: hint GCC to inlilne _find_next_bit() helper

2017-10-30 Thread Clement Courbet
Hi Yury,

I've tried your benchmark on x86-64 (haswell). Inlining is a pretty small
increase in binary size: 48B (2%).

In terms of speed, results are not very stable from one run to another
(I've included two runs to give you an idea), but overall there seems
to be small improvement on the random-filled bitmap, and not so much on
the sparse one.

Before (2312B):
[  312.912746] Start testing find_bit() with random-filled bitmap
[  312.919066] find_next_bit: 170226 cycles, 16267 iterations
[  312.924657] find_next_zero_bit: 170826 cycles, 16502 iterations
[  312.930674] find_last_bit: 152900 cycles, 16266 iterations
[  312.938856] find_first_bit: 5335034 cycles, 16267 iterations
[  312.944533] Start testing find_bit() with sparse bitmap
[  312.949780] find_next_bit: 2644 cycles, 66 iterations
[  312.955016] find_next_zero_bit: 320294 cycles, 32703 iterations
[  312.960957] find_last_bit: 2170 cycles, 66 iterations
[  312.966048] find_first_bit: 21704 cycles, 66 iterations

[  515.310376] Start testing find_bit() with random-filled bitmap
[  515.316693] find_next_bit: 164854 cycles, 16350 iterations
[  515.322293] find_next_zero_bit: 173710 cycles, 16419 iterations
[  515.328312] find_last_bit: 155458 cycles, 16350 iterations
[  515.336584] find_first_bit: 5518332 cycles, 16351 iterations
[  515.342272] Start testing find_bit() with sparse bitmap
[  515.347519] find_next_bit: 2538 cycles, 66 iterations
[  515.352763] find_next_zero_bit: 334828 cycles, 32703 iterations
[  515.358703] find_last_bit: 2250 cycles, 66 iterations
[  515.363787] find_first_bit: 23804 cycles, 66 iterations

After (2360B):
[  183.844318] Start testing find_bit() with random-filled bitmap
[  183.850588] find_next_bit: 148976 cycles, 16342 iterations
[  183.856186] find_next_zero_bit: 173298 cycles, 16427 iterations
[  183.862202] find_last_bit: 148728 cycles, 16341 iterations
[  183.870404] find_first_bit: 5390470 cycles, 16342 iterations
[  183.876084] Start testing find_bit() with sparse bitmap
[  183.881341] find_next_bit: 2144 cycles, 66 iterations
[  183.886586] find_next_zero_bit: 335558 cycles, 32703 iterations
[  183.892535] find_last_bit: 2376 cycles, 66 iterations
[  183.897627] find_first_bit: 24814 cycles, 66 iterations

[  187.842232] Start testing find_bit() with random-filled bitmap
[  187.848505] find_next_bit: 164512 cycles, 16412 iterations
[  187.854101] find_next_zero_bit: 172770 cycles, 16357 iterations
[  187.860117] find_last_bit: 145050 cycles, 16412 iterations
[  187.868312] find_first_bit: 5374792 cycles, 16413 iterations
[  187.873996] Start testing find_bit() with sparse bitmap
[  187.879251] find_next_bit: 2422 cycles, 66 iterations
[  187.884500] find_next_zero_bit: 342548 cycles, 32703 iterations
[  187.890448] find_last_bit: 2150 cycles, 66 iterations
[  187.895539] find_first_bit: 21830 cycles, 66 iterations




[PATCH v4] lib: optimize cpumask_next_and()

2017-10-26 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet <cour...@google.com>
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

Changes in v4:
 - Mark _find_next_bit() inline.

 include/asm-generic/bitops/find.h   | 16 ++
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 +++---
 lib/find_bit.c  | 55 -
 tools/include/asm-generic/bitops/find.h | 16 ++
 tools/lib/find_bit.c| 40 ++--
 6 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..8d87d6cd2541 100644

[PATCH v4] lib: optimize cpumask_next_and()

2017-10-26 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet 
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

Changes in v4:
 - Mark _find_next_bit() inline.

 include/asm-generic/bitops/find.h   | 16 ++
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 +++---
 lib/find_bit.c  | 55 -
 tools/include/asm-generic/bitops/find.h | 16 ++
 tools/lib/find_bit.c| 40 ++--
 6 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..8d87d6cd2541 100644
--- a/lib/find_bit.c
+++ b/lib/find_b

[PATCH v2] lib: optimize cpumask_next_and()

2017-10-26 Thread Clement Courbet
Hi Alexey,

> Gentoo ships 5.4.0 which doesn't inline this code on x86_64 defconfig
> (which has OPTIMIZE_INLINING).

I have not actually marked _find_next_bit() inline, it just turns out
that my compiler inlines it.
I've tried out marking the function inline and OPTIMIZE_INLINING does
not un-inline it. I'll send a v4 with an explicit inine.




[PATCH v2] lib: optimize cpumask_next_and()

2017-10-26 Thread Clement Courbet
Hi Alexey,

> Gentoo ships 5.4.0 which doesn't inline this code on x86_64 defconfig
> (which has OPTIMIZE_INLINING).

I have not actually marked _find_next_bit() inline, it just turns out
that my compiler inlines it.
I've tried out marking the function inline and OPTIMIZE_INLINING does
not un-inline it. I'll send a v4 with an explicit inine.




[PATCH v3] lib: optimize cpumask_next_and()

2017-10-26 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet <cour...@google.com>
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

 include/asm-generic/bitops/find.h   | 16 ++
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 +++---
 lib/find_bit.c  | 55 -
 tools/include/asm-generic/bitops/find.h | 16 ++
 tools/lib/find_bit.c| 40 ++--
 6 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..dfcea66b84f4 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@

[PATCH v3] lib: optimize cpumask_next_and()

2017-10-26 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet 
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

Changes in v3:
 - Fixes find_next_and_bit() declaration.
 - Synchronize _find_next_bit_le() with _find_next_bit()
 - Synchronize the code in tools/lib/find_bit.c
 - Add find_next_and_bit to guard code
 - Fix invert value (bad sync with our internal tree on which I'm doing
   the testing).

 include/asm-generic/bitops/find.h   | 16 ++
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 +++---
 lib/find_bit.c  | 55 -
 tools/include/asm-generic/bitops/find.h | 16 ++
 tools/lib/find_bit.c| 40 ++--
 6 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..dfcea66b84f4 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -21,22 +21,29 @@
 #include 
 #include 

Re [PATCH v2] lib: optimize cpumask_next_and()

2017-10-25 Thread Clement Courbet
Thanks for the comments Yury.

> But I'd like also to keep _find_next_bit() consistent with
> _find_next_bit_le()

Not sure I understand what you're suggesting here: Do you want a
find_next_and_bit_le() or do you want to make _find_next_bit_le() more
like _find_next_bit() ? In the latter case we might just want to merge
it with _find_next_bit() and end up with an extra is_le parameter :)



Re [PATCH v2] lib: optimize cpumask_next_and()

2017-10-25 Thread Clement Courbet
Thanks for the comments Yury.

> But I'd like also to keep _find_next_bit() consistent with
> _find_next_bit_le()

Not sure I understand what you're suggesting here: Do you want a
find_next_and_bit_le() or do you want to make _find_next_bit_le() more
like _find_next_bit() ? In the latter case we might just want to merge
it with _find_next_bit() and end up with an extra is_le parameter :)



[PATCH v2] lib: optimize cpumask_next_and()

2017-10-24 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet <cour...@google.com>
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

 include/asm-generic/bitops/find.h   | 16 ++
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 
 lib/find_bit.c  | 37 +
 tools/include/asm-generic/bitops/find.h | 16 ++
 5 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..ebc08fd9fdf8 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -24,19 +24,25 @@
 #if !defined(find_next_bit) || !defined(find_next_zero_bit)
 
 /*
- * This is a common helper function for find_next_bit and
- * find_next_zero_bit.  The difference is the "invert" argument, which
- * is XORed with each fetched word before searching it for one bits.
+ * This is a common helper function for find_next_bit, find_next_z

[PATCH v2] lib: optimize cpumask_next_and()

2017-10-24 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet 
---
Changes in v2:
 - Refactored _find_next_common_bit into _find_next_bit., as suggested
   by Yury Norov. This has no adverse effects on the performance side,
   as the compiler successfully inlines the code.

 include/asm-generic/bitops/find.h   | 16 ++
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 
 lib/find_bit.c  | 37 +
 tools/include/asm-generic/bitops/find.h | 16 ++
 5 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..ebc08fd9fdf8 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -24,19 +24,25 @@
 #if !defined(find_next_bit) || !defined(find_next_zero_bit)
 
 /*
- * This is a common helper function for find_next_bit and
- * find_next_zero_bit.  The difference is the "invert" argument, which
- * is XORed with each fetched word before searching it for one bits.
+ * This is a common helper function for find_next_bit, find_next_zero_bit, and
+ * find_next_and_bi

[PATCH] lib: optimize cpumask_next_and()

2017-10-23 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet <cour...@google.com>
---
 include/asm-generic/bitops/find.h   | 16 
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 +
 lib/find_bit.c  | 34 +
 tools/include/asm-generic/bitops/find.h | 16 
 5 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..83ea8b97ed3e 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -75,6 +75,40 @@ unsigned long find_next_zero_bit(const unsigned long *addr, 
unsigned long size,
 EXPORT_SYMBOL(find_next_zero_bit);
 #endif
 
+#if !defined(find_next_and_bit)
+
+/*
+ * Find the next set bit in a memory region.
+ */
+unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long nbits,
+   unsigned long start)
+{
+   unsigned long tmp;
+
+   if (!nbits || start >= nbits)
+   return nbits;
+
+   tmp = addr1[start / BITS_PER_LONG] & addr2[start / BIT

[PATCH] lib: optimize cpumask_next_and()

2017-10-23 Thread Clement Courbet
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and().
It's essentially a joined iteration in search for a non-zero bit, which
is currently implemented as a lookup join (find a nonzero bit on the
lhs, lookup the rhs to see if it's set there).

Implement a direct join (find a nonzero bit on the incrementally built
join). Direct benchmarking shows that it's 1.17x to 14x faster with a
geometric mean of 2.1 on 32 CPUs. No impact on memory usage.

Approximate benchmark code:

```
  unsigned long src1p[nr_cpumask_longs] = {pattern1};
  unsigned long src2p[nr_cpumask_longs] = {pattern2};
  for (/*a bunch of repetitions*/) {
for (int n = -1; n <= nr_cpu_ids; ++n) {
  asm volatile("" : "+rm"(src1p)); // prevent any optimization
  asm volatile("" : "+rm"(src2p));
  unsigned long result = cpumask_next_and(n, src1p, src2p);
  asm volatile("" : "+rm"(result));
}
  }
```

Results:
pattern1pattern2 time_before/time_after
0x  0x   1.65
0x  0x   2.24
0x  0x   2.94
0x  0x   14.0
0x  0x   1.67
0x  0x   1.71
0x  0x   1.90
0x  0x   6.58
0x  0x   1.46
0x  0x   1.49
0x  0x   1.45
0x  0x   3.10
0x  0x   1.18
0x  0x   1.18
0x  0x   1.17
0x  0x   1.25
---------
   geo.mean  2.06

Signed-off-by: Clement Courbet 
---
 include/asm-generic/bitops/find.h   | 16 
 include/linux/bitmap.h  |  2 ++
 lib/cpumask.c   |  9 +
 lib/find_bit.c  | 34 +
 tools/include/asm-generic/bitops/find.h | 16 
 5 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/bitops/find.h 
b/include/asm-generic/bitops/find.h
index 998d4d544f18..130962f3a264 100644
--- a/include/asm-generic/bitops/find.h
+++ b/include/asm-generic/bitops/find.h
@@ -15,6 +15,22 @@ extern unsigned long find_next_bit(const unsigned long 
*addr, unsigned long
size, unsigned long offset);
 #endif
 
+#ifndef find_next_and_bit
+/**
+ * find_next_and_bit - find the next set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+extern unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long size,
+   unsigned long offset);
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 700cf5f67118..b4606bfda52f 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -77,6 +77,8 @@
  * find_first_bit(addr, nbits) Position first set bit in *addr
  * find_next_zero_bit(addr, nbits, bit)Position next zero bit in *addr 
>= bit
  * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit
+ * find_next_and_bit(addr1, addr2, nbits, bit) Same as find_first_bit, but in
+ * (*addr1 & *addr2)
  */
 
 /*
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 8b1a1bd77539..5602223837fa 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -32,10 +32,11 @@ EXPORT_SYMBOL(cpumask_next);
 int cpumask_next_and(int n, const struct cpumask *src1p,
 const struct cpumask *src2p)
 {
-   while ((n = cpumask_next(n, src1p)) < nr_cpu_ids)
-   if (cpumask_test_cpu(n, src2p))
-   break;
-   return n;
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpumask_check(n);
+   return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+   nr_cpumask_bits, n + 1);
 }
 EXPORT_SYMBOL(cpumask_next_and);
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 6ed74f78380c..83ea8b97ed3e 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -75,6 +75,40 @@ unsigned long find_next_zero_bit(const unsigned long *addr, 
unsigned long size,
 EXPORT_SYMBOL(find_next_zero_bit);
 #endif
 
+#if !defined(find_next_and_bit)
+
+/*
+ * Find the next set bit in a memory region.
+ */
+unsigned long find_next_and_bit(const unsigned long *addr1,
+   const unsigned long *addr2, unsigned long nbits,
+   unsigned long start)
+{
+   unsigned long tmp;
+
+   if (!nbits || start >= nbits)
+   return nbits;
+
+   tmp = addr1[start / BITS_PER_LONG] & addr2[start / BITS_PER_LONG];
+
+