[Bug tree-optimization/85212] Parallelizable loop isn't unrolled [regression bug?]

2018-04-05 Thread robertw89 at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85212

--- Comment #2 from robertw89 at googlemail dot com ---
Thanks you for your explaination :) . The compiler indeed emits the expected
code wit -funroll-loops

[Bug tree-optimization/85212] Parallelizable loop isn't unrolled [regression bug?]

2018-04-05 Thread robertw89 at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85212

robertw89 at googlemail dot com changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |WORKSFORME

[Bug tree-optimization/85212] New: Parallelizable loop isn't unrolled [regression bug?]

2018-04-04 Thread robertw89 at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85212

Bug ID: 85212
   Summary: Parallelizable loop isn't unrolled [regression bug?]
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: robertw89 at googlemail dot com
  Target Milestone: ---

The compiler fails to unroll the loop (partially).

compiled with -O3  -mavx  -mavx2  -mfma -fno-math-errno -ffast-math   
-floop-parallelize-all -ftree-parallelize-loops=8

void testAutoParr(int *x) {
for (int i = 0; i < 1000; i++){
x[2*i+1] = x[2*i];
}
}

[Bug tree-optimization/85143] Loop limit prevents (auto)vectorization

2018-03-31 Thread robertw89 at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85143

--- Comment #2 from robertw89 at googlemail dot com ---
This change would be trivial.

To defend my case and rant a bit ;) ...

Indeed, but programmers can manually unroll a loop too ;) . What if the code is
autogenerated? What if the constant comes from a devirtualized call, etc. It
could do this optimization.

[Bug tree-optimization/85143] New: Loop limit prevents (auto)vectorization

2018-03-31 Thread robertw89 at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85143

Bug ID: 85143
   Summary: Loop limit prevents (auto)vectorization
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: robertw89 at googlemail dot com
  Target Milestone: ---

I expected that it generates a vectorized version potentially specializing to
the boundary. LLVM produces strange looking (vectorized) code so I guess it's a
but this time :)

Works fine if the hardcoded boundary is removed.

void boxIntersectionSimdNative(
bool*__restrict__ res,
double*__restrict__ a, double*__restrict__ b,
int n
) {
for( int i = 0; i < n && i < 1337; i++) {
res[i] = a[i] > b[i];

}
}


output

boxIntersectionSimdNative(bool*, double*, double*, int):
  test ecx, ecx
  jle .L34
  mov eax, 1
  jmp .L30
.L35:
  cmp r8d, 1336
  jg .L34
.L30:
  vmovsd xmm0, QWORD PTR [rsi-8+rax*8]
  mov r8d, eax
  vcomisd xmm0, QWORD PTR [rdx-8+rax*8]
  seta BYTE PTR [rdi-1+rax]
  add rax, 1
  cmp ecx, r8d
  jg .L35
.L34:
  rep ret

[Bug tree-optimization/85115] New: Failure to (auto)vectorize sqrtf

2018-03-28 Thread robertw89 at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85115

Bug ID: 85115
   Summary: Failure to (auto)vectorize sqrtf
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: robertw89 at googlemail dot com
  Target Milestone: ---

Fails to (auto)vectorize the code bellow with the flags

-O3 -mavx

#include 

void simdSqrt(
float * __restrict__ a,
float * __restrict__ res,
int size)
{
int i;

float *aAligned = (float*)__builtin_assume_aligned(a, 32);
float *resAligned = (float*)__builtin_assume_aligned(res, 32);

for (i = 0; i < size; i++) {
resAligned[i] = sqrtf(aAligned[i]);
}
}

produces (as displayed by https://godbolt.org/)

simdSqrt(float*, float*, int):
testedx, edx
jle .L8
lea eax, [rdx-1]
pushr12
vxorps  xmm2, xmm2, xmm2
lea r12, [rdi+4+rax*4]
sub rsp, 32
.L3:
vmovss  xmm0, DWORD PTR [rdi]
vucomissxmm2, xmm0
vsqrtss xmm1, xmm1, xmm0
ja  .L12
add rdi, 4
vmovss  DWORD PTR [rsi], xmm1
add rsi, 4
cmp rdi, r12
jne .L3
.L6:
add rsp, 32
pop r12
ret
.L8:
rep ret
.L12:
vmovss  DWORD PTR [rsp+28], xmm2
mov QWORD PTR [rsp+16], rsi
mov QWORD PTR [rsp+8], rdi
vmovss  DWORD PTR [rsp+24], xmm1
callsqrtf
mov rdi, QWORD PTR [rsp+8]
mov rsi, QWORD PTR [rsp+16]
vmovss  xmm1, DWORD PTR [rsp+24]
vmovss  xmm2, DWORD PTR [rsp+28]
add rdi, 4
vmovss  DWORD PTR [rsi], xmm1
add rsi, 4
cmp rdi, r12
jne .L3
jmp .L6