On 8/16/23 13:10, Alexander Monakov wrote:

On Tue, 15 Aug 2023, Jeff Law wrote:

Because if the compiler can optimize it automatically, then the projects have
to do literally nothing to take advantage of it.  They just compile normally
and their bitwise CRC gets optimized down to either a table lookup or a clmul
variant.  That's the real goal here.

The only high-profile FOSS project that carries a bitwise CRC implementation
I'm aware of is the 'xz' compression library. There bitwise CRC is used for
populating the lookup table under './configure --enable-small':

https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c

It's a well-reasoned choice and your compiler would be undoing it
(reintroducing the table when the bitwise CRC is employed specifically
to avoid carrying the table).
If they don't want the table variant, there would obviously be ways to turn that off. It's essentially no different than any speed improving optimization that makes things larger.



One final note.  Elsewhere in this thread you described performance concerns.
Right now clmuls can be implemented in 4c, fully piped.

Pipelining doesn't matter in the implementation being proposed here, because
the builtin is expanded to

    li      a4,quotient
    li      a5,polynomial
    xor     a0,a1,a0
    clmul   a0,a0,a4
    srli    a0,a0,crc_size
    clmul   a0,a0,a5
    slli    a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
    srli    a0,a0,GET_MODE_BITSIZE (word_mode) - crc_size


making CLMULs data-dependent, so the second can only be started one cycle
after the first finishes, and consecutive invocations of __builtin_crc
are likewise data-dependent (with three cycles between CLMUL). So even
when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
per input block, while state of the art is one widening CLMUL per input block
(one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.

I expect it'll actually be 2c latency. We're approaching the point where it just won't make that much sense to call out to a library when you can emit the pair of clmuls and a couple shifts.

jeff

Reply via email to