Re: considering L1 cache

2024-05-13 Thread Jim Meyering
You might be interested in this implementation:
https://engineering.fb.com/2019/04/25/developer-tools/f14/

On Mon, May 13, 2024 at 4:24 PM Bruno Haible  wrote:
>
> Paul Eggert wrote:
> > I installed the
> > attached. This probably a win (over de Bruijn too), at least for some
> > apps and platforms, though I haven't benchmarked.
>
> Thanks! Replacing a table access with ca. 7 arithmetic instructions
> definitely a win.
>
> I also love how this code makes use of conditional or carry instructions,
> avoiding conditional jumps, on most CPUs.
>
> Bruno
>
>
>
>
>



Re: considering L1 cache

2024-05-13 Thread Bruno Haible
Paul Eggert wrote:
> I installed the 
> attached. This probably a win (over de Bruijn too), at least for some 
> apps and platforms, though I haven't benchmarked.

Thanks! Replacing a table access with ca. 7 arithmetic instructions
definitely a win.

I also love how this code makes use of conditional or carry instructions,
avoiding conditional jumps, on most CPUs.

Bruno







Re: considering L1 cache

2024-05-13 Thread Paul Eggert

On 5/13/24 09:17, Bruno Haible wrote:

The reason is that such a 256-bytes table will tend to occupy 256 bytes in the
CPU's L1 cache, and thus reduce the ability of other code to use the L1 cache.


Yes, it partly depends on whether the function is called a lot (so the 
256-byte table is in the cache) or rarely (so it's not). I had optimized 
for the former.


The code can be changed to not use an extern data table. I installed the 
attached. This probably a win (over de Bruijn too), at least for some 
apps and platforms, though I haven't benchmarked.From 1b7e82be9dd80fbb67a5567cdf1d60d36dd40db2 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Mon, 13 May 2024 15:21:55 -0700
Subject: [PATCH] stdbit: redo clzll without lookcup table

* lib/stdbit.c (__gl_stdbit_clztab):
* lib/stdbit.in.h (__gl_stdbit_clzll):
[!_GL_STDBIT_HAS_BUILTIN_CLZ && !_MSC_VER]: Rewrite to avoid the
need for a lookup table in memory, and remove the lookup table.
Do this by shrinking the table to 64 bits and puttiung in a 64-bit
constant.  Although this needs another round of shifts, it avoids
the need for a multiplication and memory access a la de Bruijn,
and is probably a win.
---
 ChangeLog   | 12 
 lib/stdbit.c| 24 
 lib/stdbit.in.h |  5 ++---
 3 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 18aab772fc..3e4eba3796 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,15 @@
+2024-05-13  Paul Eggert  
+
+	stdbit: redo clzll without lookcup table
+	* lib/stdbit.c (__gl_stdbit_clztab):
+	* lib/stdbit.in.h (__gl_stdbit_clzll):
+	[!_GL_STDBIT_HAS_BUILTIN_CLZ && !_MSC_VER]: Rewrite to avoid the
+	need for a lookup table in memory, and remove the lookup table.
+	Do this by shrinking the table to 64 bits and puttiung in a 64-bit
+	constant.  Although this needs another round of shifts, it avoids
+	the need for a multiplication and memory access a la de Bruijn,
+	and is probably a win.
+
 2024-05-13  Bruno Haible  
 
 	stdbit tests: Adhere better to Gnulib naming conventions.
diff --git a/lib/stdbit.c b/lib/stdbit.c
index f2a51b10f7..a9f0ef5074 100644
--- a/lib/stdbit.c
+++ b/lib/stdbit.c
@@ -22,30 +22,6 @@
 #define _GL_STDBIT_INLINE _GL_EXTERN_INLINE
 #include 
 
-#if !defined _GL_STDBIT_HAS_BUILTIN_CLZ && !_MSC_VER
-/* __gl_stdbit_clztab[B] is the number of leading zeros in
-   the 8-bit byte with value B.  */
-char const __gl_stdbit_clztab[256] =
-  {
-8,
-7,
-6, 6,
-5, 5, 5, 5,
-4, 4, 4, 4, 4, 4, 4, 4,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-
-/* The rest is zero.  */
-  };
-#endif
-
 #if 1500 <= _MSC_VER && (defined _M_IX86 || defined _M_X64)
 signed char __gl_stdbit_popcount_support;
 #endif
diff --git a/lib/stdbit.in.h b/lib/stdbit.in.h
index 15533dbbda..18efaf2d7c 100644
--- a/lib/stdbit.in.h
+++ b/lib/stdbit.in.h
@@ -122,8 +122,6 @@ __gl_stdbit_clzll (unsigned long long int n)
 
 #else /* !_MSC_VER */
 
-extern char const __gl_stdbit_clztab[256];
-
 _GL_STDBIT_INLINE int
 __gl_stdbit_clzll (unsigned long long int n)
 {
@@ -135,7 +133,8 @@ __gl_stdbit_clzll (unsigned long long int n)
   int a5 = (0x < n) << 5; n >>= a5; r += a5;
   int a4 = (0x < n) << 4; n >>= a4; r += a4;
   int a3 = (0x00ff < n) << 3; n >>= a3; r += a3;
-  return (8 * (sizeof n - 1) - r) + __gl_stdbit_clztab[n];
+  int a2 = (0x000f < n) << 2; n >>= a2; r += a2;
+  return (8 * sizeof n - (1 << 2) - r) + ((0x2234ull >> (n << 2)) & 0xf);
 }
 _GL_STDBIT_INLINE int
 __gl_stdbit_clz (unsigned int n)
-- 
2.45.0



Re: considering L1 cache

2024-05-13 Thread Bruno Haible
Hi Paul,

> substituting something
> more straightforward than a de Bruijn hash (possibly faster?).
> ...
> +#if !defined _GL_STDBIT_HAS_BUILTIN_CLZ && !_MSC_VER
> +/* __gl_stdbit_clztab[B] is the number of leading zeros in
> +   the 8-bit byte with value B.  */
> +char const __gl_stdbit_clztab[256] =
> +  {
> +8,
> +7,
> +6, 6,
> +5, 5, 5, 5,
> +4, 4, 4, 4, 4, 4, 4, 4,
> +3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
> +
> +2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
> +2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
> +
> +1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> +1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> +1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> +1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> +
> +/* The rest is zero.  */
> +  };
> +#endif

I can't agree to the claim that a lookup in a 256-bytes table is preferrable
to a de Bruijn hash because "faster".

The reason is that such a 256-bytes table will tend to occupy 256 bytes in the
CPU's L1 cache, and thus reduce the ability of other code to use the L1 cache.
Knowing that an L1 cache miss is about 200x slower than an L1 cache hit, this
is a big penalty for other code.

The code with the de Bruijn hash uses 32 bytes from the L1 cache, 8 times less.

It is normal that in microbenchmarks such L1 cache considerations are ignored,
and thus code which uses fewer instructions but more memory comes out faster.
But that does not mean that it is preferrable.

Another data point is that it's well-known that excessive inlining by a compiler
causes a slowdown. I don't know whether this is due to worse register allocation
(i.e. GCC allocating more variables into the stack than in registers) or to
increased code size. Do you have data on this?

Having said that, I have no objection to the patch, because most platforms
use GCC or clang-derived compilers nowadays.

Bruno