https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jakub at gcc dot gnu.org --- Comment #11 from Jakub Jelinek <jakub at gcc dot gnu.org> --- With -O3 it regresses with r7-2009-g8d4fc2d3d0c8f87bb3e182be1a618a511f8f9465 __uint128_t bswap(__uint128_t a) { return __builtin_bswap128 (a); } emits the optimal code but is only in GCC 11.1 and later. One fix for this might be to handle _8 = BIT_FIELD_REF <a_3(D), 64, 64>; _1 = __builtin_bswap64 (_8); y[0] = _1; _10 = BIT_FIELD_REF <a_3(D), 64, 0>; _2 = __builtin_bswap64 (_10); y[1] = _2; _7 = MEM <uint128_t> [(char * {ref-all})&y]; in bswap or store merging. Though, current bswap infrastructure I'm afraid limits it to 64-bit size, because it tracks the bytes in uint64_t vars and uses 8 bits to determine which byte it is (0 value of zero, 1-8 byte index and 0xff unknown). While that is 10 different values right now, if we handled uint128_t we'd need 18 different values times 16. Note, even: unsigned long long bswap (unsigned long long a) { unsigned int x[2]; __builtin_memcpy (x, &a, 8); unsigned int y[2]; y[0] = __builtin_bswap32 (x[1]); y[1] = __builtin_bswap32 (x[0]); __builtin_memcpy (&a, y, 8); return a; } unsigned long long bswap2 (unsigned long long a) { return __builtin_bswap64 (a); } emits better code in the latter function rather than former store-merging isn't able to handle even that. So we want to handle it in store-merging, we should start with handling _8 = BIT_FIELD_REF <a_3(D), 32, 32>; _1 = __builtin_bswap32 (_8); _10 = (unsigned int) a_3(D); _2 = __builtin_bswap32 (_10); _11 = {_1, _2}; _5 = VIEW_CONVERT_EXPR<unsigned long>(_11); and _8 = BIT_FIELD_REF <a_3(D), 32, 32>; _1 = __builtin_bswap32 (_8); y[0] = _1; _10 = (unsigned int) a_3(D); _2 = __builtin_bswap32 (_10); y[1] = _2; _7 = MEM <unsigned long> [(char * {ref-all})&y]; and only once that is handled try _8 = BIT_FIELD_REF <a_3(D), 64, 64>; _1 = __builtin_bswap64 (_8); _10 = (long long unsigned int) a_3(D); _2 = __builtin_bswap64 (_10); _11 = {_1, _2}; _5 = VIEW_CONVERT_EXPR<uint128_t>(_11); Doesn't look like stage4 material though. So in the meantime perhaps some other improvements.