[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-06 Thread manuel.lauss at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

Manuel Lauss  changed:

   What|Removed |Added

 CC||manuel.lauss at googlemail dot 
com

--- Comment #12 from Manuel Lauss  ---
clang-7 achieves an impressive level of ipc (amd zen):

 Performance counter stats for './memtime-clangO2' (10 runs):

358,260795  task-clock:u (msec)   #0,999 CPUs utilized 
  ( +-  1,43% )
 0  context-switches:u#0,000 K/sec  
 0  cpu-migrations:u  #0,000 K/sec  
   244.191  page-faults:u #0,682 M/sec 
  ( +-  0,00% )
 1.253.573.425  cycles:u  #3,499 GHz   
  ( +-  1,50% )
   149.207.036  stalled-cycles-frontend:u #   11,90% frontend cycles
idle ( +-  2,04% )
   333.373.414  stalled-cycles-backend:u  #   26,59% backend cycles
idle  ( +-  0,00% )
 4.333.767.562  instructions:u#3,46  insn per cycle 
  #0,08  stalled cycles per
insn  ( +-  0,00% )
   333.621.304  branches:u#  931,225 M/sec 
  ( +-  0,00% )
   248.011  branch-misses:u   #0,07% of all branches   
  ( +-  0,06% )

   0,358644336 seconds time elapsed
 ( +-  1,43% )


compared to gcc-8 as of today:
 Performance counter stats for './memtime-gcc8O2' (10 runs):

   2087,357431  task-clock:u (msec)   #1,000 CPUs utilized 
  ( +-  0,19% )
 0  context-switches:u#0,000 K/sec  
 0  cpu-migrations:u  #0,000 K/sec  
   244.191  page-faults:u #0,117 M/sec 
  ( +-  0,00% )
 8.273.911.027  cycles:u  #3,964 GHz   
  ( +-  0,00% )
 3.691.281.142  stalled-cycles-frontend:u #   44,61% frontend cycles
idle ( +-  0,02% )
   333.373.414  stalled-cycles-backend:u  #4,03% backend cycles
idle  ( +-  0,00% )
 3.667.101.412  instructions:u#0,44  insn per cycle 
  #1,01  stalled cycles per
insn  ( +-  0,00% )
   333.621.824  branches:u#  159,830 M/sec 
  ( +-  0,00% )
   248.423  branch-misses:u   #0,07% of all branches   
  ( +-  0,01% )

   2,088370519 seconds time elapsed
 ( +-  0,19% )

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-06 Thread gpnuma at centaurean dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #11 from gpnuma at centaurean dot com ---
Yes it's not the init loop the problem. Just to make sure, with the following
code :

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[]) {
const uint64_t size = 10;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = malloc(alloc_mem);
//for (uint_fast64_t i = 0; i < size; i++)
//mem[i] = (uint8_t) (i >> 7);

uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;

printf("%u ...\n", 3);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], 3);
//receiver &= (0xllu >> (64 - ((3) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += 3;
}

printf("=> %llu\n", total);
return EXIT_SUCCESS;
}

The result is (the calculated sum is unreliable since we do not init memory) :
gcc
3 ...
=> 81985529216486895

real0m3.180s
user0m2.822s
sys 0m0.328s

clang
time ./a.out
3 ...
=> 81985529216486895

real0m0.972s
user0m0.621s
sys 0m0.338s

Still 4x faster

(In reply to Richard Biener from comment #9)
> So with 2 bytes we get
> 
> .L3:
> movzwl  (%rax), %edx
> addq$3, %rax
> movw%dx, 8(%rsp)
> movq8(%rsp), %rdx
> imulq   %rcx, %rdx
> shrq$48, %rdx
> addq%rdx, %rsi
> cmpq%rdi, %rax
> jne .L3
> 
> while with 3 bytes we see
> 
> .L3:
> movzwl  (%rax), %edx
> addq$3, %rax
> movw%dx, 8(%rsp)
> movzbl  -1(%rax), %edx
> movb%dl, 10(%rsp)
> movq8(%rsp), %rdx
> imulq   %rcx, %rdx
> shrq$48, %rdx
> addq%rdx, %rsi
> cmpq%rdi, %rax
> jne .L3
> 
> while clang outputs
> 
> .LBB0_3:# =>This Inner Loop Header: Depth=1
> movzwl  (%r14,%rcx), %edx
> movzbl  2(%r14,%rcx), %edi
> shlq$16, %rdi
> orq %rdx, %rdi
> andq$-16777216, %rbx# imm = 0xFF00
> orq %rdi, %rbx
> movq%rbx, %rdx
> imulq   %rax, %rdx
> shrq$48, %rdx
> addq%rdx, %rsi
> addq$3, %rcx
> cmpq$2, %rcx# imm = 0x3B9AC9F8
> jb  .LBB0_3
> 
> that _looks_ slower.  Are you sure performance isn't dominated by the
> first init loop (both GCC and clang vectorize it).  I notice we spill
> in the above loop for the bitfield insert where clang uses register
> operations.  We refuse to inline the memcpy at the GIMPLE level
> and further refuse to optimzie it to a BIT_INSERT_EXPR which would
> be a possibility.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-06 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #10 from Marc Glisse  ---
(In reply to Richard Biener from comment #9)
> So with 2 bytes we get

Try 3 bytes (the worst case).

> Are you sure performance isn't dominated by the
> first init loop (both GCC and clang vectorize it).

Replacing memcpy(,,block) with memcpy(,,8) (the next line masks the other bytes
anyway) gained a factor 8 in running time, when I tried the other day.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-06 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #9 from Richard Biener  ---
So with 2 bytes we get

.L3:
movzwl  (%rax), %edx
addq$3, %rax
movw%dx, 8(%rsp)
movq8(%rsp), %rdx
imulq   %rcx, %rdx
shrq$48, %rdx
addq%rdx, %rsi
cmpq%rdi, %rax
jne .L3

while with 3 bytes we see

.L3:
movzwl  (%rax), %edx
addq$3, %rax
movw%dx, 8(%rsp)
movzbl  -1(%rax), %edx
movb%dl, 10(%rsp)
movq8(%rsp), %rdx
imulq   %rcx, %rdx
shrq$48, %rdx
addq%rdx, %rsi
cmpq%rdi, %rax
jne .L3

while clang outputs

.LBB0_3:# =>This Inner Loop Header: Depth=1
movzwl  (%r14,%rcx), %edx
movzbl  2(%r14,%rcx), %edi
shlq$16, %rdi
orq %rdx, %rdi
andq$-16777216, %rbx# imm = 0xFF00
orq %rdi, %rbx
movq%rbx, %rdx
imulq   %rax, %rdx
shrq$48, %rdx
addq%rdx, %rsi
addq$3, %rcx
cmpq$2, %rcx# imm = 0x3B9AC9F8
jb  .LBB0_3

that _looks_ slower.  Are you sure performance isn't dominated by the
first init loop (both GCC and clang vectorize it).  I notice we spill
in the above loop for the bitfield insert where clang uses register
operations.  We refuse to inline the memcpy at the GIMPLE level
and further refuse to optimzie it to a BIT_INSERT_EXPR which would
be a possibility.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread gpnuma at centaurean dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #8 from gpnuma at centaurean dot com ---
Just to make sure I commented out bit masking :

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[]) {
const uint64_t size = 10;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);

uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;

printf("%u ...\n", 3);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], 3);
//receiver &= (0xllu >> (64 - ((3) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += 3;
}

printf("=> %llu\n", total);
return EXIT_SUCCESS;
}

Results are exactly the same :
gcc
time ./a.out
3 ...
=> 81996806116422545

real0m3.771s
user0m3.292s
sys 0m0.403s

clang
time ./a.out
3 ...
=> 81996806116422545

real0m1.209s
user0m0.833s
sys 0m0.359s

Still 4x faster

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #7 from Andrew Pinski  ---
I wonder if this is not __builtin_memcpy but rather how to optimize and putting
in the lower bytes of an uint64_t.  I think your benchmark is not benchmarking
what you think it is benchmarking.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread gpnuma at centaurean dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #6 from gpnuma at centaurean dot com ---
If you compile the following code (-O3 being the only flag used) :

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[]) {
const uint64_t size = 10;
const size_t alloc_mem = size * sizeof(uint8_t);
uint8_t *mem = malloc(alloc_mem);
for (uint_fast64_t i = 0; i < size; i++)
mem[i] = (uint8_t) (i >> 7);

uint_fast64_t counter = 0;
uint64_t total = 0x123456789abcdefllu;
uint64_t receiver = 0;

printf("%u ...\n", 3);
counter = 0;
while (counter < size - 8) {
__builtin_memcpy(&receiver, &mem[counter], 3);
receiver &= (0xllu >> (64 - ((3) << 3)));
total += ((receiver * 0x321654987cbafedllu) >> 48);
counter += 3;
}

printf("=> %llu\n", total);
return EXIT_SUCCESS;
}

Here are the results :
gcc
time ./a.out
3 ...
=> 81996806116422545

real0m4.145s
user0m3.691s
sys 0m0.396s

clang
time ./a.out
3 ...
=> 81996806116422545

real0m1.246s
user0m0.855s
sys 0m0.374s

4x faster

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread gpnuma at centaurean dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #5 from gpnuma at centaurean dot com ---
Which gcc and which clang ?
Because on my platform, in the above code, if you isolate 3 bytes at a time and
5 bytes at a time it is way slower than clang (by doing manual unrolling).
Or maybe it's the interaction with the bit masking that causes a problem ?

(In reply to H.J. Lu from comment #4)
> I compared __builtin_memcpy one size at a time.  Here are results in
> cycles:
> 
> clang 1 bytes: 17193410146
> gcc   1 bytes: 15440244966
> clang 2 bytes: 8997535880
> gcc   2 bytes: 8147449530
> clang 3 bytes: 6002276628
> gcc   3 bytes: 5430387704
> clang 4 bytes: 4497121282
> gcc   4 bytes: 4069604454
> clang 5 bytes: 3644879742
> gcc   5 bytes: 3258094970
> clang 6 bytes: 3045612708
> gcc   6 bytes: 2728410608
> clang 7 bytes: 2574110178
> gcc   7 bytes: 2330365680
> clang 8 bytes: 969894432
> gcc   8 bytes: 6436950208
> 
> GCC is faster except for 8 byte size.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

H.J. Lu  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #4 from H.J. Lu  ---
I compared __builtin_memcpy one size at a time.  Here are results in
cycles:

clang 1 bytes: 17193410146
gcc   1 bytes: 15440244966
clang 2 bytes: 8997535880
gcc   2 bytes: 8147449530
clang 3 bytes: 6002276628
gcc   3 bytes: 5430387704
clang 4 bytes: 4497121282
gcc   4 bytes: 4069604454
clang 5 bytes: 3644879742
gcc   5 bytes: 3258094970
clang 6 bytes: 3045612708
gcc   6 bytes: 2728410608
clang 7 bytes: 2574110178
gcc   7 bytes: 2330365680
clang 8 bytes: 969894432
gcc   8 bytes: 6436950208

GCC is faster except for 8 byte size.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

H.J. Lu  changed:

   What|Removed |Added

 Target|x86_64-apple-darwin17.4.0   |x86_64
 CC||hjl.tools at gmail dot com

--- Comment #3 from H.J. Lu  ---
Confirmed on Linux/x86-64.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread gpnuma at centaurean dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

--- Comment #2 from gpnuma at centaurean dot com ---
(In reply to Andrew Pinski from comment #1)
> Does -mcpu=native improve it?
> Also is GCC calling memcpy instead of doing an inline version?

No -march=native does not make any difference. 
And no, gcc is not calling memcpy as when I replace __builtin_memcpy by memcpy
in the above code it is somewhat slower, but the timing is the same as
clang/memcpy this time. It's just when comparing gcc/__builtin_memcpy and
clang/__builtin_memcpy that the resulting code exhibits considerable
performance differences in favor of clang.

[Bug target/84719] gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's

2018-03-05 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2018-03-05
 Ever confirmed|0   |1

--- Comment #1 from Andrew Pinski  ---
Does -mcpu=native improve it?
Also is GCC calling memcpy instead of doing an inline version?