On 8/9/23 00:32, Alexander Monakov wrote:

On Tue, 8 Aug 2023, Jeff Law wrote:

If the compiler can identify a CRC and collapse it down to a table or clmul,
that's a major win and such code does exist in the real world. That was the
whole point behind the Fedora experiment -- to determine if these things are
showing up in the real world or if this is just a benchmarking exercise.

Can you share the results of the experiment and give your estimate of what
sort of real-world improvement is expected? I already listed the popular
FOSS projects where CRC performance is important: the Linux kernel and
a few compression libraries. Those projects do not use a bitwise CRC loop,
except sometimes for table generation on startup (which needs less time
than a page fault that may be necessary to bring in a hardcoded table).
That experiment was ~7 months ago. I don't think any of the data is still around except for some extracted testcases.


For those projects that need a better CRC, why is the chosen solution is
to optimize it in the compiler instead of offering them a library they
could use with any compiler?
Because if the compiler can optimize it automatically, then the projects have to do literally nothing to take advantage of it. They just compile normally and their bitwise CRC gets optimized down to either a table lookup or a clmul variant. That's the real goal here.

If a step where we provide the backend bits hooked up to a builtin isn't useful, then we won't pursue it. The thinking was it would provide value for those willing to make a slight change to their sources and at the same time we get real world exposure for the backend work of the CRC optimization effort while we polish the gimple detection bits.




Was there any thought given to embedded projects that use bitwise CRC
exactly because they little space for a hardcoded table to spare?
It wasn't an explicit goal, but the ability to select between a table implementation and a clmul implementation in the backend seemed useful, so we wired up both.



No, not if the compiler is not GCC, or its version is less than 14. And
those projects are not going to sacrifice their portability just for
__builtin_crc.
You may be right.   I don't think it's so clear cut. though.



I think offering a conventional library for CRC has substantial advantages.
That's not what I asked.  If you think there's room for improvement to a
builtin API, I'd love to hear it.

But it seems you don't think this is worth the effort at all.  That's
unfortunate, but if that's the consensus, then so be it.

I think it's a strange application of development effort. You'd get more
done coding a library.
Not if the end goal is to detect the CRC and optimize it into a table or clmul without the user having to do anything special.

Again, what we've proposed in this patch is a piece of that larger body of work, specifically the backend bits that we thought would have value independently. If the community doesn't see that carved out chunk as helpful we'll table it until the whole end-to-end path is ready for submission.



I'll note LLVM is likely going forward with CRC detection and optimization at
some point in the next ~6 months (effectively moving the implementation from
the hexagon port into the generic parts of their loop optimizer).

I don't see CRC detection in the Hexagon port. There is a recognizer for
polynomial multiplication (CRC is division, not multiplication).
Yes, you need to the recognizer so that you can detect a CRC loop, then with a bit of math you turn that into a carryless multiply sequence. I find the math here mindbending, but the Hexagon bits are precisely to optimize CRC loops. Sadly the Hexagon bits are fairly specific to the CRC implementation inside coremark. The GCC bits we've been working on are much more general.

One final note. Elsewhere in this thread you described performance concerns. Right now clmuls can be implemented in 4c, fully piped. I fully expect that latency to drop within the next 12-18 months. In that world, there's not going to be much benefit to using hand-coded libraries vs just letting the compiler do it.

Jeff

Reply via email to