[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |8.0
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 --- Comment #8 from Andrew Pinski --- (In reply to Richard Biener from comment #7) > It was fixed by adding another loop header copying pass before > vectorization, aka ch_vect. But that went in way in GCC 6 (r6-1951) but the loop header copying was not happening until GCC 8.
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED Blocks||53947 Resolution|--- |FIXED --- Comment #7 from Richard Biener --- It was fixed by adding another loop header copying pass before vectorization, aka ch_vect. Of course it means we peel one iteration which might be not 100% optimal. Optimally we'd teach PRE that those loop carried dependences are bad(TM) just like we do for loads and extend that to cover calls. The peeling means we need an epilogue, so we didn't really save a sqrt call. That said, the situation is somewhat mitigated now and I'd declare it fixed anyway, the testcase is somewhat artificial (resolvable at compile time). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization --- Comment #6 from Andrew Pinski --- So this was fixed in GCC 8 but I cannot tell by what. ch_vect has been there since 2014 which should have done the copying of the header but did not until GCC 8. There is not enough debug output to tell what changed either.
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 --- Comment #5 from vincenzo Innocente --- I remember something similar in the past --param max-completely-peel-times=1 sort of fix it… (why pre does not recognize that 1/(1+0) == 1 btw?? of course it is just a benchmark (and I can modify it to avoid the loop peeling), still
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 --- Comment #4 from Jakub Jelinek --- Actually, it isn't vectorized at all, because PRE attempts to be smart, figures out that for the first iteration of the loop it can avoid computing the sqrt because the result will be one, and moves thus the sqrt call into the latch, but we can't vectorize any loops that have non-empty latches. So, either the vectorizer would need to undo this transformation, or PRE not do it at all, or arrange for it to be done only after vectorizations. Richard, any thoughts on this?
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 --- Comment #3 from Marc Glisse --- -fno-tree-pre lets it vectorize sqr as well. PRE creates a jump to the middle of the loop body, which is nice but prevents vectorization.
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 --- Comment #2 from vincenzo Innocente --- actually the code for div and sqr is different already for standard SSE c++ -std=c++11 -Ofast -S avx2sqrt.cc -ftree-vectorizer-verbose=1 -Wall ; cat avx2sqrt.s .L2: movdqa%xmm0, %xmm1 addl$1, %eax movdqa%xmm0, %xmm4 cmpl$256, %eax paddd%xmm5, %xmm1 pshufd$238, %xmm1, %xmm0 cvtdq2pd%xmm1, %xmm1 movapd%xmm3, %xmm7 paddd%xmm6, %xmm4 cvtdq2pd%xmm0, %xmm0 divpd%xmm0, %xmm7 movapd%xmm7, %xmm0 movapd%xmm3, %xmm7 divpd%xmm1, %xmm7 addpd%xmm7, %xmm0 addpd%xmm0, %xmm2 jne.L3 movapd%xmm2, -24(%rsp) movsd-16(%rsp), %xmm0 addsd%xmm2, %xmm0 ret .cfi_endproc .LFE3: .size_Z3divv, .-_Z3divv .p2align 4,,15 .globl_Z3sqrv .type_Z3sqrv, @function _Z3sqrv: .LFB4: .cfi_startproc movl$1, %eax movsd.LC4(%rip), %xmm1 xorpd%xmm0, %xmm0 jmp.L6 .p2align 4,,10 .p2align 3 .L7: cvtsi2sd%eax, %xmm1 sqrtsd%xmm1, %xmm1 .L6: addl$1, %eax addsd%xmm1, %xmm0 cmpl$1025, %eax jne.L7 rep; ret .cfi_endproc
[Bug tree-optimization/57858] AVX2: ymm used for div, not for sqrt
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57858 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #1 from Jakub Jelinek --- I'll look at this.