https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107006
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|rtl-optimization |tree-optimization Last reconfirmed| |2022-09-22 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Keywords| |missed-optimization --- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- The reason is that the loops are not unrolled (early enough or at all) for the bswap/load detection. So to work you have to unroll the loops manually or direct GCC to do that, for example inline uint64_t get_le64 (const unsigned char x[64/8]) { uint64_t y = 0; #pragma GCC unroll 8 for (size_t i = 0; i < sizeof y; i++) if (0) y |= (uint64_t)x[i] << ((sizeof y - 1 - i)*8); else y |= (uint64_t)x[i] << i*8; return y; } produces get_le64: .LFB11: .cfi_startproc movq (%rdi), %rax ret the unroll heuristics do not anticipate that later bswap/load detection will merge all the loads and thus not grow code too much.