Hi Sam,
Thanks for working on this!
Sam Russell wrote:
> 85% time reduction on AMD Ryzen 5 5600:
>
> $ ./gltests/bench-crc 1000000
> real 1.740296
> user 1.740
> sys 0.000
>
> $ ../bench-crc-pclmul 1000000
> real 0.248324
> user 0.248
> sys 0.000
>
> This translates to a 13% time reduction for gzip:
>
> $ time ./gzip_sliceby8 -k -d -c large_file.gz > /dev/null
>
> real 0m0.310s
> user 0m0.310s
> sys 0m0.000s
>
> $ time ./gzip_pclmul -k -d -c large_file.gz > /dev/null
>
> real 0m0.267s
> user 0m0.267s
> sys 0m0.000s
This indicates that crc is no longer one of the biggest time consumer for
gzip. Still, a 13% speedup is something we would like to include.
Let me comment regarding the module structure and build system. I'll let
Simon speak up for the rest.
* I would suggest to rename the main source file from crc-pclmul.c to
crc-x86_64.c.
Rationale: So that immediately clear that the code is specific to the
x86_64 CPUs. Not everyone is an assembly language hacker, and even some
assembly language hackers (like me) don't know about the 'pclmul'
instruction and of which ISA it is part of. Whereas everyone knows what
x86_64 is.
* Similarly, I suggest to rename the module 'crc-pclmul' to 'crc-x86_64'.
* I like the fact that you have put the new code in a separate module.
This will allow packages to invoke gnulib-tool with options
--avoid=crc-x86_64 crc
if they don't want to include the CPU specific optimization.
* In the patch, one of the diffs ends with
\ No newline at end of file
Could you please add a newline at the end of that source file?
* In crc.c: Would it make sense to cache the value of pclmul_enabled
in a static variable?
You can probably guess it by looking at the assembly-language code
that gcc -O2 produces for the two __builtin_cpu_supports invocations.
* Are the options -mpclmul -mavx understood by both gcc and clang?
Or does clang use different options for the same thing?
* In the module description, when you write
CFLAGS += -mpclmul -mavx
it has an effect on all compilations of the same Makefile (e.g. in the
case of coreutils, on all the coreutils programs). Which means that the
resulting binaries will *NOT WORK* on CPUs that don't have AVX extensions.
So, these CFLAGS changes should be limited to the one compilation unit
that needs it. Unfortunately, this is not easy to do with Automake (it
requires an intermediate convenience library to be declared), and I don't
know whether gnulib-tool already supports this contortion.
But before I look into it in detail: Is there an alternative to these
command-line options? Maybe there are some #pragmas that have the same
effect?
> but could someone recommend a method for
> detecting alignment so we can safely upscale to a __m128i pointer?
How about ((uintptr_t) ptr & 0xF) == 0
or something like that?
Bruno