https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116919

--- Comment #3 from Jeffrey A. Law <law at gcc dot gnu.org> ---
I don't know your extension set or pipeline, but one additional thing that
might improve things further would be to adjust the risc-v expansion code to
alternate between a table lookup and a clmul variant.

What we found with our uarch was that the clmul variant would pretty quickly
fill up the queue for the execution unit that handles clmul (among various
other multi-cycle instructions, including returns).  Given that the table
lookup is roughly the same performance as clmul for our design the thought was
to ping-pong between using a table lookup and clmul alternately.  The
expectation is that when the crcu8 is used to compose crcu16/crcu32 that we'd
ultimately get better performance as we're a lot less likely to serialize
waiting for space in that key queue.  This just hasn't gotten to the top of my
queue to evaluate.

I don't have a counter example handy, but it would certainly involve a constant
that was loadable in a single instruction when sign extended, but required two
when zero extended.  It might be as simple as using 0x8001 in your code rather
than 0x8000.    But I haven't tested that. 

Combine can handle multiple uses, but it'd wrap them in a PARALLEL which would
be unhelpful.  However late-combine and possibly fwprop would be able to use
that kind of define_insn_and_split effectively when there's multiple uses.

Reply via email to