https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107006

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|rtl-optimization            |tree-optimization
   Last reconfirmed|                            |2022-09-22
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
           Keywords|                            |missed-optimization

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
The reason is that the loops are not unrolled (early enough or at all) for the
bswap/load detection.  So to work you have to unroll the loops manually or
direct GCC to do that, for example

inline uint64_t get_le64 (const unsigned char x[64/8]) {
 uint64_t y = 0;
#pragma GCC unroll 8
for (size_t i = 0; i < sizeof y; i++)
 if (0) y |= (uint64_t)x[i] << ((sizeof y - 1 - i)*8);
 else y |= (uint64_t)x[i] << i*8; return y; 
}

produces

get_le64:
.LFB11:
        .cfi_startproc
        movq    (%rdi), %rax
        ret

the unroll heuristics do not anticipate that later bswap/load detection will
merge all the loads and thus not grow code too much.

Reply via email to