[Bug tree-optimization/85212] Parallelizable loop isn't unrolled [regression bug?]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85212 --- Comment #2 from robertw89 at googlemail dot com --- Thanks you for your explaination :) . The compiler indeed emits the expected code wit -funroll-loops
[Bug tree-optimization/85212] Parallelizable loop isn't unrolled [regression bug?]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85212 robertw89 at googlemail dot com changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |WORKSFORME
[Bug tree-optimization/85212] New: Parallelizable loop isn't unrolled [regression bug?]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85212 Bug ID: 85212 Summary: Parallelizable loop isn't unrolled [regression bug?] Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: robertw89 at googlemail dot com Target Milestone: --- The compiler fails to unroll the loop (partially). compiled with -O3 -mavx -mavx2 -mfma -fno-math-errno -ffast-math -floop-parallelize-all -ftree-parallelize-loops=8 void testAutoParr(int *x) { for (int i = 0; i < 1000; i++){ x[2*i+1] = x[2*i]; } }
[Bug tree-optimization/85143] Loop limit prevents (auto)vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85143 --- Comment #2 from robertw89 at googlemail dot com --- This change would be trivial. To defend my case and rant a bit ;) ... Indeed, but programmers can manually unroll a loop too ;) . What if the code is autogenerated? What if the constant comes from a devirtualized call, etc. It could do this optimization.
[Bug tree-optimization/85143] New: Loop limit prevents (auto)vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85143 Bug ID: 85143 Summary: Loop limit prevents (auto)vectorization Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: robertw89 at googlemail dot com Target Milestone: --- I expected that it generates a vectorized version potentially specializing to the boundary. LLVM produces strange looking (vectorized) code so I guess it's a but this time :) Works fine if the hardcoded boundary is removed. void boxIntersectionSimdNative( bool*__restrict__ res, double*__restrict__ a, double*__restrict__ b, int n ) { for( int i = 0; i < n && i < 1337; i++) { res[i] = a[i] > b[i]; } } output boxIntersectionSimdNative(bool*, double*, double*, int): test ecx, ecx jle .L34 mov eax, 1 jmp .L30 .L35: cmp r8d, 1336 jg .L34 .L30: vmovsd xmm0, QWORD PTR [rsi-8+rax*8] mov r8d, eax vcomisd xmm0, QWORD PTR [rdx-8+rax*8] seta BYTE PTR [rdi-1+rax] add rax, 1 cmp ecx, r8d jg .L35 .L34: rep ret
[Bug tree-optimization/85115] New: Failure to (auto)vectorize sqrtf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85115 Bug ID: 85115 Summary: Failure to (auto)vectorize sqrtf Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: robertw89 at googlemail dot com Target Milestone: --- Fails to (auto)vectorize the code bellow with the flags -O3 -mavx #include void simdSqrt( float * __restrict__ a, float * __restrict__ res, int size) { int i; float *aAligned = (float*)__builtin_assume_aligned(a, 32); float *resAligned = (float*)__builtin_assume_aligned(res, 32); for (i = 0; i < size; i++) { resAligned[i] = sqrtf(aAligned[i]); } } produces (as displayed by https://godbolt.org/) simdSqrt(float*, float*, int): testedx, edx jle .L8 lea eax, [rdx-1] pushr12 vxorps xmm2, xmm2, xmm2 lea r12, [rdi+4+rax*4] sub rsp, 32 .L3: vmovss xmm0, DWORD PTR [rdi] vucomissxmm2, xmm0 vsqrtss xmm1, xmm1, xmm0 ja .L12 add rdi, 4 vmovss DWORD PTR [rsi], xmm1 add rsi, 4 cmp rdi, r12 jne .L3 .L6: add rsp, 32 pop r12 ret .L8: rep ret .L12: vmovss DWORD PTR [rsp+28], xmm2 mov QWORD PTR [rsp+16], rsi mov QWORD PTR [rsp+8], rdi vmovss DWORD PTR [rsp+24], xmm1 callsqrtf mov rdi, QWORD PTR [rsp+8] mov rsi, QWORD PTR [rsp+16] vmovss xmm1, DWORD PTR [rsp+24] vmovss xmm2, DWORD PTR [rsp+28] add rdi, 4 vmovss DWORD PTR [rsi], xmm1 add rsi, 4 cmp rdi, r12 jne .L3 jmp .L6