[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 Manuel Lauss changed: What|Removed |Added CC||manuel.lauss at googlemail dot com --- Comment #12 from Manuel Lauss --- clang-7 achieves an impressive level of ipc (amd zen): Performance counter stats for './memtime-clangO2' (10 runs): 358,260795 task-clock:u (msec) #0,999 CPUs utilized ( +- 1,43% ) 0 context-switches:u#0,000 K/sec 0 cpu-migrations:u #0,000 K/sec 244.191 page-faults:u #0,682 M/sec ( +- 0,00% ) 1.253.573.425 cycles:u #3,499 GHz ( +- 1,50% ) 149.207.036 stalled-cycles-frontend:u # 11,90% frontend cycles idle ( +- 2,04% ) 333.373.414 stalled-cycles-backend:u # 26,59% backend cycles idle ( +- 0,00% ) 4.333.767.562 instructions:u#3,46 insn per cycle #0,08 stalled cycles per insn ( +- 0,00% ) 333.621.304 branches:u# 931,225 M/sec ( +- 0,00% ) 248.011 branch-misses:u #0,07% of all branches ( +- 0,06% ) 0,358644336 seconds time elapsed ( +- 1,43% ) compared to gcc-8 as of today: Performance counter stats for './memtime-gcc8O2' (10 runs): 2087,357431 task-clock:u (msec) #1,000 CPUs utilized ( +- 0,19% ) 0 context-switches:u#0,000 K/sec 0 cpu-migrations:u #0,000 K/sec 244.191 page-faults:u #0,117 M/sec ( +- 0,00% ) 8.273.911.027 cycles:u #3,964 GHz ( +- 0,00% ) 3.691.281.142 stalled-cycles-frontend:u # 44,61% frontend cycles idle ( +- 0,02% ) 333.373.414 stalled-cycles-backend:u #4,03% backend cycles idle ( +- 0,00% ) 3.667.101.412 instructions:u#0,44 insn per cycle #1,01 stalled cycles per insn ( +- 0,00% ) 333.621.824 branches:u# 159,830 M/sec ( +- 0,00% ) 248.423 branch-misses:u #0,07% of all branches ( +- 0,01% ) 2,088370519 seconds time elapsed ( +- 0,19% )
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 --- Comment #11 from gpnuma at centaurean dot com --- Yes it's not the init loop the problem. Just to make sure, with the following code : #include #include #include #include #include #include #include int main(int argc, char *argv[]) { const uint64_t size = 10; const size_t alloc_mem = size * sizeof(uint8_t); uint8_t *mem = malloc(alloc_mem); //for (uint_fast64_t i = 0; i < size; i++) //mem[i] = (uint8_t) (i >> 7); uint_fast64_t counter = 0; uint64_t total = 0x123456789abcdefllu; uint64_t receiver = 0; printf("%u ...\n", 3); counter = 0; while (counter < size - 8) { __builtin_memcpy(&receiver, &mem[counter], 3); //receiver &= (0xllu >> (64 - ((3) << 3))); total += ((receiver * 0x321654987cbafedllu) >> 48); counter += 3; } printf("=> %llu\n", total); return EXIT_SUCCESS; } The result is (the calculated sum is unreliable since we do not init memory) : gcc 3 ... => 81985529216486895 real0m3.180s user0m2.822s sys 0m0.328s clang time ./a.out 3 ... => 81985529216486895 real0m0.972s user0m0.621s sys 0m0.338s Still 4x faster (In reply to Richard Biener from comment #9) > So with 2 bytes we get > > .L3: > movzwl (%rax), %edx > addq$3, %rax > movw%dx, 8(%rsp) > movq8(%rsp), %rdx > imulq %rcx, %rdx > shrq$48, %rdx > addq%rdx, %rsi > cmpq%rdi, %rax > jne .L3 > > while with 3 bytes we see > > .L3: > movzwl (%rax), %edx > addq$3, %rax > movw%dx, 8(%rsp) > movzbl -1(%rax), %edx > movb%dl, 10(%rsp) > movq8(%rsp), %rdx > imulq %rcx, %rdx > shrq$48, %rdx > addq%rdx, %rsi > cmpq%rdi, %rax > jne .L3 > > while clang outputs > > .LBB0_3:# =>This Inner Loop Header: Depth=1 > movzwl (%r14,%rcx), %edx > movzbl 2(%r14,%rcx), %edi > shlq$16, %rdi > orq %rdx, %rdi > andq$-16777216, %rbx# imm = 0xFF00 > orq %rdi, %rbx > movq%rbx, %rdx > imulq %rax, %rdx > shrq$48, %rdx > addq%rdx, %rsi > addq$3, %rcx > cmpq$2, %rcx# imm = 0x3B9AC9F8 > jb .LBB0_3 > > that _looks_ slower. Are you sure performance isn't dominated by the > first init loop (both GCC and clang vectorize it). I notice we spill > in the above loop for the bitfield insert where clang uses register > operations. We refuse to inline the memcpy at the GIMPLE level > and further refuse to optimzie it to a BIT_INSERT_EXPR which would > be a possibility.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 --- Comment #10 from Marc Glisse --- (In reply to Richard Biener from comment #9) > So with 2 bytes we get Try 3 bytes (the worst case). > Are you sure performance isn't dominated by the > first init loop (both GCC and clang vectorize it). Replacing memcpy(,,block) with memcpy(,,8) (the next line masks the other bytes anyway) gained a factor 8 in running time, when I tried the other day.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #9 from Richard Biener --- So with 2 bytes we get .L3: movzwl (%rax), %edx addq$3, %rax movw%dx, 8(%rsp) movq8(%rsp), %rdx imulq %rcx, %rdx shrq$48, %rdx addq%rdx, %rsi cmpq%rdi, %rax jne .L3 while with 3 bytes we see .L3: movzwl (%rax), %edx addq$3, %rax movw%dx, 8(%rsp) movzbl -1(%rax), %edx movb%dl, 10(%rsp) movq8(%rsp), %rdx imulq %rcx, %rdx shrq$48, %rdx addq%rdx, %rsi cmpq%rdi, %rax jne .L3 while clang outputs .LBB0_3:# =>This Inner Loop Header: Depth=1 movzwl (%r14,%rcx), %edx movzbl 2(%r14,%rcx), %edi shlq$16, %rdi orq %rdx, %rdi andq$-16777216, %rbx# imm = 0xFF00 orq %rdi, %rbx movq%rbx, %rdx imulq %rax, %rdx shrq$48, %rdx addq%rdx, %rsi addq$3, %rcx cmpq$2, %rcx# imm = 0x3B9AC9F8 jb .LBB0_3 that _looks_ slower. Are you sure performance isn't dominated by the first init loop (both GCC and clang vectorize it). I notice we spill in the above loop for the bitfield insert where clang uses register operations. We refuse to inline the memcpy at the GIMPLE level and further refuse to optimzie it to a BIT_INSERT_EXPR which would be a possibility.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 --- Comment #8 from gpnuma at centaurean dot com --- Just to make sure I commented out bit masking : #include #include #include #include #include #include #include int main(int argc, char *argv[]) { const uint64_t size = 10; const size_t alloc_mem = size * sizeof(uint8_t); uint8_t *mem = malloc(alloc_mem); for (uint_fast64_t i = 0; i < size; i++) mem[i] = (uint8_t) (i >> 7); uint_fast64_t counter = 0; uint64_t total = 0x123456789abcdefllu; uint64_t receiver = 0; printf("%u ...\n", 3); counter = 0; while (counter < size - 8) { __builtin_memcpy(&receiver, &mem[counter], 3); //receiver &= (0xllu >> (64 - ((3) << 3))); total += ((receiver * 0x321654987cbafedllu) >> 48); counter += 3; } printf("=> %llu\n", total); return EXIT_SUCCESS; } Results are exactly the same : gcc time ./a.out 3 ... => 81996806116422545 real0m3.771s user0m3.292s sys 0m0.403s clang time ./a.out 3 ... => 81996806116422545 real0m1.209s user0m0.833s sys 0m0.359s Still 4x faster
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 --- Comment #7 from Andrew Pinski --- I wonder if this is not __builtin_memcpy but rather how to optimize and putting in the lower bytes of an uint64_t. I think your benchmark is not benchmarking what you think it is benchmarking.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 --- Comment #6 from gpnuma at centaurean dot com --- If you compile the following code (-O3 being the only flag used) : #include #include #include #include #include #include #include int main(int argc, char *argv[]) { const uint64_t size = 10; const size_t alloc_mem = size * sizeof(uint8_t); uint8_t *mem = malloc(alloc_mem); for (uint_fast64_t i = 0; i < size; i++) mem[i] = (uint8_t) (i >> 7); uint_fast64_t counter = 0; uint64_t total = 0x123456789abcdefllu; uint64_t receiver = 0; printf("%u ...\n", 3); counter = 0; while (counter < size - 8) { __builtin_memcpy(&receiver, &mem[counter], 3); receiver &= (0xllu >> (64 - ((3) << 3))); total += ((receiver * 0x321654987cbafedllu) >> 48); counter += 3; } printf("=> %llu\n", total); return EXIT_SUCCESS; } Here are the results : gcc time ./a.out 3 ... => 81996806116422545 real0m4.145s user0m3.691s sys 0m0.396s clang time ./a.out 3 ... => 81996806116422545 real0m1.246s user0m0.855s sys 0m0.374s 4x faster
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 --- Comment #5 from gpnuma at centaurean dot com --- Which gcc and which clang ? Because on my platform, in the above code, if you isolate 3 bytes at a time and 5 bytes at a time it is way slower than clang (by doing manual unrolling). Or maybe it's the interaction with the bit masking that causes a problem ? (In reply to H.J. Lu from comment #4) > I compared __builtin_memcpy one size at a time. Here are results in > cycles: > > clang 1 bytes: 17193410146 > gcc 1 bytes: 15440244966 > clang 2 bytes: 8997535880 > gcc 2 bytes: 8147449530 > clang 3 bytes: 6002276628 > gcc 3 bytes: 5430387704 > clang 4 bytes: 4497121282 > gcc 4 bytes: 4069604454 > clang 5 bytes: 3644879742 > gcc 5 bytes: 3258094970 > clang 6 bytes: 3045612708 > gcc 6 bytes: 2728410608 > clang 7 bytes: 2574110178 > gcc 7 bytes: 2330365680 > clang 8 bytes: 969894432 > gcc 8 bytes: 6436950208 > > GCC is faster except for 8 byte size.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 H.J. Lu changed: What|Removed |Added Status|WAITING |NEW --- Comment #4 from H.J. Lu --- I compared __builtin_memcpy one size at a time. Here are results in cycles: clang 1 bytes: 17193410146 gcc 1 bytes: 15440244966 clang 2 bytes: 8997535880 gcc 2 bytes: 8147449530 clang 3 bytes: 6002276628 gcc 3 bytes: 5430387704 clang 4 bytes: 4497121282 gcc 4 bytes: 4069604454 clang 5 bytes: 3644879742 gcc 5 bytes: 3258094970 clang 6 bytes: 3045612708 gcc 6 bytes: 2728410608 clang 7 bytes: 2574110178 gcc 7 bytes: 2330365680 clang 8 bytes: 969894432 gcc 8 bytes: 6436950208 GCC is faster except for 8 byte size.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 H.J. Lu changed: What|Removed |Added Target|x86_64-apple-darwin17.4.0 |x86_64 CC||hjl.tools at gmail dot com --- Comment #3 from H.J. Lu --- Confirmed on Linux/x86-64.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 --- Comment #2 from gpnuma at centaurean dot com --- (In reply to Andrew Pinski from comment #1) > Does -mcpu=native improve it? > Also is GCC calling memcpy instead of doing an inline version? No -march=native does not make any difference. And no, gcc is not calling memcpy as when I replace __builtin_memcpy by memcpy in the above code it is somewhat slower, but the timing is the same as clang/memcpy this time. It's just when comparing gcc/__builtin_memcpy and clang/__builtin_memcpy that the resulting code exhibits considerable performance differences in favor of clang.
[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |WAITING Last reconfirmed||2018-03-05 Ever confirmed|0 |1 --- Comment #1 from Andrew Pinski --- Does -mcpu=native improve it? Also is GCC calling memcpy instead of doing an inline version?