https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

            Bug ID: 84719
           Summary: gcc's __builtin_memcpy performance with certain number
                    of bytes is terrible compared to clang's
           Product: gcc
           Version: 7.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: bootstrap
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gpnuma at centaurean dot com
  Target Milestone: ---

I post this bug report as an echo to my post here :
https://stackoverflow.com/questions/49098453/

To reproduce : just create a file (test.c), compile (gcc -O3 test.c) and run
(time ./a.out) this simple code :

#include <sys/stat.h>
#include <sys/types.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

int main(int argc, char *argv[]) {
    const uint64_t size = 1000000000;
    const size_t alloc_mem = size * sizeof(uint8_t);
    uint8_t *mem = (uint8_t*)malloc(alloc_mem);
    for (uint_fast64_t i = 0; i < size; i++)
        mem[i] = (uint8_t) (i >> 7);

    uint8_t block = 0;
    uint_fast64_t counter = 0;
    uint64_t total = 0x123456789abcdefllu;
    uint64_t receiver = 0;

    for(block = 1; block <= 8; block ++) {
        printf("%u ...\n", block);
        counter = 0;
        while (counter < size - 8) {
            __builtin_memcpy(&receiver, &mem[counter], block);
            receiver &= (0xffffffffffffffffllu >> (64 - ((block) << 3)));
            total += ((receiver * 0x321654987cbafedllu) >> 48);
            counter += block;
        }
    }

    printf("=> %llu\n", total);
    return EXIT_SUCCESS;
}

Timings for gcc compiled code are almost 3x slower than those for clang. As a
side note, loop unrolling is not very well handled there as specifying a forced
unroll in gcc 8 improves performance, but this is not any better with clang.

Even with complete manual unrolling, the resulting gcc compiled code is still
3x slower than clangs's. After further testing it appears that the problem is
caused by some specific number of bytes requested in __builtin_memcpy, in
particular the __builtin_memcpy(,,3) performance is very poor.


My platform compiler infos :
gcc-7 -v
Using built-in specs.
COLLECT_GCC=gcc-7
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/7.3.0/libexec/gcc/x86_64-apple-darwin17.4.0/7.3.0/lto-wrapper
Target: x86_64-apple-darwin17.4.0
Configured with: ../configure --build=x86_64-apple-darwin17.4.0
--prefix=/usr/local/Cellar/gcc/7.3.0
--libdir=/usr/local/Cellar/gcc/7.3.0/lib/gcc/7
--enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-7
--with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr
--with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl
--with-system-zlib --enable-checking=release --with-pkgversion='Homebrew GCC
7.3.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues
--disable-nls
Thread model: posix
gcc version 7.3.0 (Homebrew GCC 7.3.0)

cc -v
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bi

Reply via email to