https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102877

            Bug ID: 102877
           Summary: missed optimization: memcpy produces lots more asm
                    than otherwise
           Product: gcc
           Version: 11.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jengelh at inai dot de
  Target Milestone: ---

Input (C++)
===========
struct GLOBCNT { unsigned char ab[6]; };
unsigned long long gc_to_num(GLOBCNT gc)
{
        unsigned long long value;
        auto v = reinterpret_cast<unsigned char *>(&value);
        v[0] = 0;
        v[1] = 0;
#ifdef WITH_MEMCPY
        __builtin_memcpy(v + 2, gc.ab, 6);
#else
        v[2] = gc.ab[0]; v[3] = gc.ab[1]; v[4] = gc.ab[2];
        v[5] = gc.ab[3]; v[6] = gc.ab[4]; v[7] = gc.ab[5];
#endif
        if (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)
                value = __builtin_bswap64(value);
        return value;
}

I hope this is UB-free.


Observed behavior
=================
The use of memcpy/__builtin_memcpy produces a function with 28
instructions/0x5c bytes long.

► g++ -O2 -c t3.cpp -Wall -DWITH_MEMCPY -v
Target: x86_64-suse-linux
gcc version 11.2.1 20210816 [revision 056e324ce46a7924b5cf10f61010cf9dd2ca10e9]
(SUSE Linux)

► objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   89 f8                   mov    eax,edi
   2:   89 f9                   mov    ecx,edi
   4:   89 fa                   mov    edx,edi
   6:   44 0f b6 c7             movzx  r8d,dil
   a:   c1 e9 10                shr    ecx,0x10
   d:   0f b6 f4                movzx  esi,ah
  ...
  5c:   c3                      ret    


Expected behavior
=================
► g++ -O2 -c t3.cpp -Wall -UWITH_MEMCPY
► objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   0f b7 c7                movzx  eax,di
   3:   48 c1 ef 10             shr    rdi,0x10
   7:   48 c1 e7 20             shl    rdi,0x20
   b:   48 c1 e0 10             shl    rax,0x10
   f:   48 09 f8                or     rax,rdi
  12:   48 0f c8                bswap  rax
  15:   c3                      ret    


Other notes
===========
In a twist, clang 13.0.0 produces the short version for memcpy (even shorter
than gcc), and produces a long version for non-memcpy case (even longer than
gcc).

► clang++ -O2 -c t3.cpp -Wall -DWITH_MEMCPY; objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   48 89 f8                mov    rax,rdi
   3:   48 c1 e0 10             shl    rax,0x10
   7:   48 0f c8                bswap  rax
   a:   c3                      ret    

► clang++ -O2 -c t3.cpp -Wall -UWITH_MEMCPY; objdump -Mintel -d t3.o
0000000000000000 <_Z9gc_to_num7GLOBCNT>:
   0:   48 89 f8                mov    rax,rdi
   3:   48 b9 ff ff ff ff ff    movabs rcx,0xffffffffffff
   a:   ff 00 00 
 ...
  6c:   c3                      ret

Reply via email to