On 8/8/23 10:38, Alexander Monakov wrote:

On Tue, 8 Aug 2023, Jeff Law wrote:

That was my thinking at one time.  Then we started looking at the distros and
found enough crc implementations in there to change my mind about the overall
utility.

The ones I'm familiar with are all table-based and look impossible to
pattern-match (and hence already fairly efficient comparable to bitwise
loop in Coremark).
We found dozens that were the usual looking loops and, IIRC ~200 table lookups after analyzing about half of the packages in Fedora.



So... just provide a library? A library code is easier to develop and audit,
it can be released independently, people can use it with their compiler of
choice. Not everything needs to be in libgcc.
If the compiler can identify a CRC and collapse it down to a table or clmul, that's a major win and such code does exist in the real world. That was the whole point behind the Fedora experiment -- to determine if these things are showing up in the real world or if this is just a benchmarking exercise.

And just to be clear, we're not proposing anything for libgcc.


I'm talking about factoring a long chain into multiple independent chains
for latency hiding.
And that could potentially be an extension. But even without this a standard looking CRC loop will be much faster using table lookups or simple generation with clmul.

Also note that latency of clmuls is improving on modern hardware. 4c isn't hard to achieve and I wouldn't be surprised to see 2c clmuls in the near future.



Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
These consumers need high-performance blockwise CRC, offering them
a latency-bound elementwise CRC primitive is a disservice. And what
should they use as a fallback when __builtin_crc is unavailable?
THe point is builtin_crc would always be available. If there is no clmul, then the RTL backend can expand to a table lookup version.


while at the same time putting one side of the infrastructure we need for
automatic detection of CRC loops and turning them into table lookups or
CLMULs.

With that in mind I'm certain Mariam & I would love feedback on a builtin API
that would be more useful.

I think offering a conventional library for CRC has substantial advantages.
That's not what I asked. If you think there's room for improvement to a builtin API, I'd love to hear it.

But it seems you don't think this is worth the effort at all. That's unfortunate, but if that's the consensus, then so be it.

I'll note LLVM is likely going forward with CRC detection and optimization at some point in the next ~6 months (effectively moving the implementation from the hexagon port into the generic parts of their loop optimizer).



Jeff

Reply via email to