On 8/8/23 09:22, Alexander Monakov wrote:

Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where

- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all

we would not be spending effort on this, correct? At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
That was my thinking at one time. Then we started looking at the distros and found enough crc implementations in there to change my mind about the overall utility.


Note that proposed __builtin_crc does not match how a high-performance CRC
over a variable-size array is implemented. You don't want to do two
back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
and N is the number of loop iterations.
If we need to do something to make it more useful, we're certainly open to that.



Instead, efficient CRC loops have the following structure:

- they carry an unreduced remainder in the loop, performing final reduction
   modulo polynomial only once after the loop — this halves the CLMUL count;

- they overlap multiple CLMUL chains to make the loop throughput-bound
   rather than latency-bound. The typical unroll factor is about 4x-8x.
We do have the ability to build longer chains. We actually use that in the coremark benchmark where the underlying primitives are 8-bit CRCs that are composed into 16/32 bit CRCs.


A relatively easy to follow explanation is provided by Pete Cawley at
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
(there are other sources for similar optimization of table-based CRC).

Also note that in __builtin_crc care is needed regarding how the
polynomial is specified (which term is dropped, and what bit order is used).
Absolutely.


Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
that actually needs high-performance CRC (the Linux kernel, compression
and image libraries).

I think this crosses the line of "cheating in benchmarks" and not something
we should do in GCC.
Certianly not the intention. The intention is to provide a useful builtin_crc while at the same time putting one side of the infrastructure we need for automatic detection of CRC loops and turning them into table lookups or CLMULs.

With that in mind I'm certain Mariam & I would love feedback on a builtin API that would be more useful.

jeff

Reply via email to