https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116919
--- Comment #3 from Jeffrey A. Law <law at gcc dot gnu.org> --- I don't know your extension set or pipeline, but one additional thing that might improve things further would be to adjust the risc-v expansion code to alternate between a table lookup and a clmul variant. What we found with our uarch was that the clmul variant would pretty quickly fill up the queue for the execution unit that handles clmul (among various other multi-cycle instructions, including returns). Given that the table lookup is roughly the same performance as clmul for our design the thought was to ping-pong between using a table lookup and clmul alternately. The expectation is that when the crcu8 is used to compose crcu16/crcu32 that we'd ultimately get better performance as we're a lot less likely to serialize waiting for space in that key queue. This just hasn't gotten to the top of my queue to evaluate. I don't have a counter example handy, but it would certainly involve a constant that was loadable in a single instruction when sign extended, but required two when zero extended. It might be as simple as using 0x8001 in your code rather than 0x8000. But I haven't tested that. Combine can handle multiple uses, but it'd wrap them in a PARALLEL which would be unhelpful. However late-combine and possibly fwprop would be able to use that kind of define_insn_and_split effectively when there's multiple uses.