On 8/16/23 13:10, Alexander Monakov wrote:
On Tue, 15 Aug 2023, Jeff Law wrote:
Because if the compiler can optimize it automatically, then the projects have
to do literally nothing to take advantage of it. They just compile normally
and their bitwise CRC gets optimized down to either a table lookup or a clmul
variant. That's the real goal here.
The only high-profile FOSS project that carries a bitwise CRC implementation
I'm aware of is the 'xz' compression library. There bitwise CRC is used for
populating the lookup table under './configure --enable-small':
https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c
It's a well-reasoned choice and your compiler would be undoing it
(reintroducing the table when the bitwise CRC is employed specifically
to avoid carrying the table).
If they don't want the table variant, there would obviously be ways to
turn that off. It's essentially no different than any speed improving
optimization that makes things larger.
One final note. Elsewhere in this thread you described performance concerns.
Right now clmuls can be implemented in 4c, fully piped.
Pipelining doesn't matter in the implementation being proposed here, because
the builtin is expanded to
li a4,quotient
li a5,polynomial
xor a0,a1,a0
clmul a0,a0,a4
srli a0,a0,crc_size
clmul a0,a0,a5
slli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
srli a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
making CLMULs data-dependent, so the second can only be started one cycle
after the first finishes, and consecutive invocations of __builtin_crc
are likewise data-dependent (with three cycles between CLMUL). So even
when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
per input block, while state of the art is one widening CLMUL per input block
(one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.
I expect it'll actually be 2c latency. We're approaching the point
where it just won't make that much sense to call out to a library when
you can emit the pair of clmuls and a couple shifts.
jeff