https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104315
Bug ID: 104315 Summary: [AArch64] Failure to optimize 8-bit bitreverse pattern Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gabravier at gmail dot com Target Milestone: --- unsigned int stb_bitreverse8(unsigned char n) { n = ((n & 0xAA) >> 1) + ((n & 0x55) << 1); n = ((n & 0xCC) >> 2) + ((n & 0x33) << 2); return (unsigned char) ((n >> 4) + (n << 4)); } On AArch64, with -O3, GCC currently outputs this: stb_bitreverse8(unsigned char): mov w2, 170 mov w1, 85 and w1, w1, w0, lsr 1 and w0, w2, w0, lsl 1 orr w0, w1, w0 mov w1, -52 mov w2, 51 and w1, w1, w0, lsl 2 and w0, w2, w0, lsr 2 and w1, w1, 255 orr w0, w0, w1 lsr w1, w0, 4 orr w0, w1, w0, lsl 4 and w0, w0, 255 ret LLVM instead outputs this: stb_bitreverse8(unsigned char): rbit w8, w0 lsr w0, w8, #24 ret This optimization should be faster and quite useful, especially as there does not seem to be any way to use `rbit` manually with intrinsics in GCC.