[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #60 from CVS Commits --- The master branch has been updated by H.J. Lu : https://gcc.gnu.org/g:737355072af4cd0c24a4a8967e1485c1f3a80bfe commit r11-2200-g737355072af4cd0c24a4a8967e1485c1f3a80bfe Author: H.J. Lu Date: Mon Jul 13 09:07:00 2020 -0700 x86: Rename VF_AVX512VL_VF1_128_256 to VF1_AVX512ER_128_256 Since ix86_emit_swsqrtsf shouldn't be called with DF vector modes, rename VF_AVX512VL_VF1_128_256 to VF1_AVX512ER_128_256 and drop DF vector modes. gcc/ PR target/96186 PR target/88713 * config/i386/sse.md (VF_AVX512VL_VF1_128_256): Renamed to ... (VF1_AVX512ER_128_256): This. Drop DF vector modes. (rsqrt2): Replace VF_AVX512VL_VF1_128_256 with VF1_AVX512ER_128_256. gcc/testsuite/ PR target/96186 PR target/88713 * gcc.target/i386/pr88713-3.c: New test.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #59 from CVS Commits --- The master branch has been updated by H.J. Lu : https://gcc.gnu.org/g:fab263ab0fc10ea08409b80afa7e8569438b8d28 commit r11-1970-gfab263ab0fc10ea08409b80afa7e8569438b8d28 Author: H.J. Lu Date: Wed Jan 23 06:33:58 2019 -0800 x86: Enable FMA in rsqrt2 expander Enable FMA in rsqrt2 expander and fold rsqrtv16sf2 expander into rsqrt2 expander which expands to UNSPEC_RSQRT28 for TARGET_AVX512ER. Although it doesn't show performance change in our workloads, FMA can improve other workloads. gcc/ PR target/88713 * config/i386/i386-expand.c (ix86_emit_swsqrtsf): Enable FMA. * config/i386/sse.md (VF_AVX512VL_VF1_128_256): New. (rsqrt2): Replace VF1_128_256 with VF_AVX512VL_VF1_128_256. (rsqrtv16sf2): Removed. gcc/testsuite/ PR target/88713 * gcc.target/i386/pr88713-1.c: New test. * gcc.target/i386/pr88713-2.c: Likewise.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #58 from H.J. Lu --- (In reply to Thomas Koenig from comment #57) > (In reply to H.J. Lu from comment #56) > > (In reply to Thomas Koenig from comment #55) > > > (In reply to H.J. Lu from comment #45) > > > > Created attachment 45510 [details] > > > > An updated patch > > > > > > HJ, do you plan on committing these? > > > > We are collecting performance data before I submit it. > > Do you have the performance data by now? A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2020-June/549047.html
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #57 from Thomas Koenig --- (In reply to H.J. Lu from comment #56) > (In reply to Thomas Koenig from comment #55) > > (In reply to H.J. Lu from comment #45) > > > Created attachment 45510 [details] > > > An updated patch > > > > HJ, do you plan on committing these? > > We are collecting performance data before I submit it. Do you have the performance data by now?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #56 from H.J. Lu --- (In reply to Thomas Koenig from comment #55) > (In reply to H.J. Lu from comment #45) > > Created attachment 45510 [details] > > An updated patch > > HJ, do you plan on committing these? We are collecting performance data before I submit it.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #55 from Thomas Koenig --- (In reply to H.J. Lu from comment #45) > Created attachment 45510 [details] > An updated patch HJ, do you plan on committing these?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #54 from Chris Elrod --- I commented elsewhere, but I built trunk a few days ago with H.J.Lu's patches (attached here) and Thomas Koenig's inlining patches. With these patches, g++ and all versions of the Fortran code produced excellent asm, and the code performed excellently in benchmarks. Once those are merged, the problems reported here will be solved. I saw Thomas Koenig's packing changes will wait for gcc-10. What about H.J.Lu's fixes to rsqrt and allowing FMA use in those sections?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #53 from rguenther at suse dot de --- On Thu, 24 Jan 2019, glisse at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #52 from Marc Glisse --- > (In reply to Thomas Koenig from comment #49) > > Argh. Sacrificing performance for the sake of bugware... > > But note that in this PR (specifically for avx512 vectors on this cpu), the OP > says that the recip version is slower than calling directly the right insn (it > wasn't clear if that was for inverse or for sqrt). Probably depends on the microarchitecture, yes. But I'd fully expect the two-NR step variant to be slower for a sensible HW implementation (even more so if we need to fend off the exceptional cases)
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #52 from Marc Glisse --- (In reply to Thomas Koenig from comment #49) > Argh. Sacrificing performance for the sake of bugware... But note that in this PR (specifically for avx512 vectors on this cpu), the OP says that the recip version is slower than calling directly the right insn (it wasn't clear if that was for inverse or for sqrt).
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #51 from rguenther at suse dot de --- On Thu, 24 Jan 2019, tkoenig at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #49 from Thomas Koenig --- > (In reply to Uroš Bizjak from comment #48) > > (In reply to rguent...@suse.de from comment #47) > > > >But why don't we generate sqrtps for vector sqrtf? > > > > > > That's the default for - mrecip back in time we benchmarked it and scalar > > > recip miscompares sth. > > > > It was polyhedron benchmark, in one benchmark, the index was calculated from > > square root, and that was too sensitive for 2 ULP difference. > > Argh. Sacrificing performance for the sake of bugware... Maybe use of FMA can recover 1 ULP and the benchmark ;)
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #50 from Uroš Bizjak --- (In reply to Thomas Koenig from comment #49) > (In reply to Uroš Bizjak from comment #48) > > (In reply to rguent...@suse.de from comment #47) > > > >But why don't we generate sqrtps for vector sqrtf? > > > > > > That's the default for - mrecip back in time we benchmarked it and scalar > > > recip miscompares sth. > > > > It was polyhedron benchmark, in one benchmark, the index was calculated from > > square root, and that was too sensitive for 2 ULP difference. > > Argh. Sacrificing performance for the sake of bugware... The details are in [1] and all the drama is documented in PR32352. [1] https://gcc.gnu.org/ml/gcc-patches/2007-06/msg01044.html
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #49 from Thomas Koenig --- (In reply to Uroš Bizjak from comment #48) > (In reply to rguent...@suse.de from comment #47) > > >But why don't we generate sqrtps for vector sqrtf? > > > > That's the default for - mrecip back in time we benchmarked it and scalar > > recip miscompares sth. > > It was polyhedron benchmark, in one benchmark, the index was calculated from > square root, and that was too sensitive for 2 ULP difference. Argh. Sacrificing performance for the sake of bugware...
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #48 from Uroš Bizjak --- (In reply to rguent...@suse.de from comment #47) > >But why don't we generate sqrtps for vector sqrtf? > > That's the default for - mrecip back in time we benchmarked it and scalar > recip miscompares sth. It was polyhedron benchmark, in one benchmark, the index was calculated from square root, and that was too sensitive for 2 ULP difference.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #47 from rguenther at suse dot de --- On January 23, 2019 5:13:12 PM GMT+01:00, "hjl.tools at gmail dot com" wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > >--- Comment #46 from H.J. Lu --- >We generate sqrtps for scalar sqrtf: > >[hjl@gnu-skx-1 pr88713]$ cat s.i >extern float sqrtf(float x); > >float >rsqrt(float r) >{ > return sqrtf (r); >} >[hjl@gnu-skx-1 pr88713]$ gcc -Ofast -S s.i >[hjl@gnu-skx-1 pr88713]$ cat s.s >.file "s.i" >.text >.p2align 4,,15 >.globl rsqrt >.type rsqrt, @function >rsqrt: >.LFB0: >.cfi_startproc >sqrtss %xmm0, %xmm0 >ret >.cfi_endproc >.LFE0: >.size rsqrt, .-rsqrt >.ident "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)" >.section.note.GNU-stack,"",@progbits >[hjl@gnu-skx-1 pr88713]$ > >But why don't we generate sqrtps for vector sqrtf? That's the default for - mrecip back in time we benchmarked it and scalar recip miscompares sth. > >[hjl@gnu-skx-1 pr88713]$ cat y.i >extern float sqrtf(float x); > >void >rsqrt(float* restrict r, float* restrict a){ >for (int i = 0; i < 16; i++){ >r[i] = sqrtf(a[i]); >} >} >[hjl@gnu-skx-1 pr88713]$ gcc -S -Ofast y.i >[hjl@gnu-skx-1 pr88713]$ cat y.s >.file "y.i" >.text >.p2align 4,,15 >.globl rsqrt >.type rsqrt, @function >rsqrt: >.LFB0: >.cfi_startproc >movups (%rsi), %xmm1 >pxor%xmm2, %xmm2 >movaps .LC0(%rip), %xmm4 >movaps %xmm2, %xmm3 >rsqrtps %xmm1, %xmm0 >cmpneqps%xmm1, %xmm3 >movaps %xmm1, %xmm5 >andps %xmm3, %xmm0 >movaps .LC1(%rip), %xmm3 >mulps %xmm0, %xmm5 >mulps %xmm5, %xmm0 >mulps %xmm3, %xmm5 >movaps %xmm0, %xmm1 >movups 16(%rsi), %xmm0 >addps %xmm4, %xmm1 >mulps %xmm5, %xmm1 >movaps %xmm2, %xmm5 >cmpneqps%xmm0, %xmm5 >movups %xmm1, (%rdi) >rsqrtps %xmm0, %xmm1 >andps %xmm5, %xmm1 >movaps %xmm2, %xmm5 >mulps %xmm1, %xmm0 >mulps %xmm0, %xmm1 >mulps %xmm3, %xmm0 >addps %xmm4, %xmm1 >mulps %xmm0, %xmm1 >movups 32(%rsi), %xmm0 >cmpneqps%xmm0, %xmm5 >movups %xmm1, 16(%rdi) >rsqrtps %xmm0, %xmm1 >andps %xmm5, %xmm1 >mulps %xmm1, %xmm0 >mulps %xmm0, %xmm1 >mulps %xmm3, %xmm0 >addps %xmm4, %xmm1 >mulps %xmm0, %xmm1 >movups %xmm1, 32(%rdi) >movups 48(%rsi), %xmm1 >rsqrtps %xmm1, %xmm0 >cmpneqps%xmm1, %xmm2 >andps %xmm2, %xmm0 >mulps %xmm0, %xmm1 >mulps %xmm1, %xmm0 >mulps %xmm3, %xmm1 >addps %xmm4, %xmm0 >mulps %xmm1, %xmm0 >movups %xmm0, 48(%rdi) >ret >.cfi_endproc >.LFE0: >.size rsqrt, .-rsqrt >.section.rodata.cst16,"aM",@progbits,16 >.align 16 >.LC0: >.long 3225419776 >.long 3225419776 >.long 3225419776 >.long 3225419776 >.align 16 >.LC1: >.long 3204448256 >.long 3204448256 >.long 3204448256 >.long 3204448256 >.ident "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)" >.section.note.GNU-stack,"",@progbits >[hjl@gnu-skx-1 pr88713]$
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #46 from H.J. Lu --- We generate sqrtps for scalar sqrtf: [hjl@gnu-skx-1 pr88713]$ cat s.i extern float sqrtf(float x); float rsqrt(float r) { return sqrtf (r); } [hjl@gnu-skx-1 pr88713]$ gcc -Ofast -S s.i [hjl@gnu-skx-1 pr88713]$ cat s.s .file "s.i" .text .p2align 4,,15 .globl rsqrt .type rsqrt, @function rsqrt: .LFB0: .cfi_startproc sqrtss %xmm0, %xmm0 ret .cfi_endproc .LFE0: .size rsqrt, .-rsqrt .ident "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)" .section.note.GNU-stack,"",@progbits [hjl@gnu-skx-1 pr88713]$ But why don't we generate sqrtps for vector sqrtf? [hjl@gnu-skx-1 pr88713]$ cat y.i extern float sqrtf(float x); void rsqrt(float* restrict r, float* restrict a){ for (int i = 0; i < 16; i++){ r[i] = sqrtf(a[i]); } } [hjl@gnu-skx-1 pr88713]$ gcc -S -Ofast y.i [hjl@gnu-skx-1 pr88713]$ cat y.s .file "y.i" .text .p2align 4,,15 .globl rsqrt .type rsqrt, @function rsqrt: .LFB0: .cfi_startproc movups (%rsi), %xmm1 pxor%xmm2, %xmm2 movaps .LC0(%rip), %xmm4 movaps %xmm2, %xmm3 rsqrtps %xmm1, %xmm0 cmpneqps%xmm1, %xmm3 movaps %xmm1, %xmm5 andps %xmm3, %xmm0 movaps .LC1(%rip), %xmm3 mulps %xmm0, %xmm5 mulps %xmm5, %xmm0 mulps %xmm3, %xmm5 movaps %xmm0, %xmm1 movups 16(%rsi), %xmm0 addps %xmm4, %xmm1 mulps %xmm5, %xmm1 movaps %xmm2, %xmm5 cmpneqps%xmm0, %xmm5 movups %xmm1, (%rdi) rsqrtps %xmm0, %xmm1 andps %xmm5, %xmm1 movaps %xmm2, %xmm5 mulps %xmm1, %xmm0 mulps %xmm0, %xmm1 mulps %xmm3, %xmm0 addps %xmm4, %xmm1 mulps %xmm0, %xmm1 movups 32(%rsi), %xmm0 cmpneqps%xmm0, %xmm5 movups %xmm1, 16(%rdi) rsqrtps %xmm0, %xmm1 andps %xmm5, %xmm1 mulps %xmm1, %xmm0 mulps %xmm0, %xmm1 mulps %xmm3, %xmm0 addps %xmm4, %xmm1 mulps %xmm0, %xmm1 movups %xmm1, 32(%rdi) movups 48(%rsi), %xmm1 rsqrtps %xmm1, %xmm0 cmpneqps%xmm1, %xmm2 andps %xmm2, %xmm0 mulps %xmm0, %xmm1 mulps %xmm1, %xmm0 mulps %xmm3, %xmm1 addps %xmm4, %xmm0 mulps %xmm1, %xmm0 movups %xmm0, 48(%rdi) ret .cfi_endproc .LFE0: .size rsqrt, .-rsqrt .section.rodata.cst16,"aM",@progbits,16 .align 16 .LC0: .long 3225419776 .long 3225419776 .long 3225419776 .long 3225419776 .align 16 .LC1: .long 3204448256 .long 3204448256 .long 3204448256 .long 3204448256 .ident "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)" .section.note.GNU-stack,"",@progbits [hjl@gnu-skx-1 pr88713]$
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 H.J. Lu changed: What|Removed |Added Attachment #45509|0 |1 is obsolete|| --- Comment #45 from H.J. Lu --- Created attachment 45510 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45510=edit An updated patch
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 H.J. Lu changed: What|Removed |Added Attachment #45508|0 |1 is obsolete|| --- Comment #44 from H.J. Lu --- Created attachment 45509 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45509=edit A combined patch
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 H.J. Lu changed: What|Removed |Added Attachment #45507|0 |1 is obsolete|| --- Comment #43 from H.J. Lu --- Created attachment 45508 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45508=edit A patch
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #42 from H.J. Lu --- Created attachment 45507 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45507=edit A patch
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #41 from Uroš Bizjak --- (In reply to H.J. Lu from comment #40) > (In reply to rguent...@suse.de from comment #39) > > > > > > > > Yes. The lack of an expander for the rqsrt operation is probably > > > > more severe though (causing sqrt + approx recip to appear) > > > > > > > > > > Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available? > > > > I think we can but we lack an expander for this. IIRC for the following > > existing expander the RTL is ignored and thus we could simply > > replace the TARGET_AVX512ER check with TARGET_AVX512F? > > > > (define_expand "rsqrtv16sf2" > > [(set (match_operand:V16SF 0 "register_operand") > > (unspec:V16SF > > [(match_operand:V16SF 1 "vector_operand")] > > UNSPEC_RSQRT28))] > > "TARGET_SSE_MATH && TARGET_AVX512ER" > > { > > ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true); > > DONE; > > }) > > Like this? > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index 3af4adc63dd..c9b4750ccc4 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -1969,21 +1969,11 @@ > (set_attr "mode" "")]) > > (define_expand "rsqrt2" > - [(set (match_operand:VF1_128_256 0 "register_operand") > - (unspec:VF1_128_256 > -[(match_operand:VF1_128_256 1 "vector_operand")] UNSPEC_RSQRT))] > + [(set (match_operand:VF_AVX512VL 0 "register_operand") > + (unspec:VF_AVX512VL > +[(match_operand:VF_AVX512VL 1 "vector_operand")] > +UNSPEC_RSQRT))] >"TARGET_SSE_MATH" > -{ > - ix86_emit_swsqrtsf (operands[0], operands[1], mode, true); > - DONE; > -}) > - > -(define_expand "rsqrtv16sf2" > - [(set (match_operand:V16SF 0 "register_operand") > - (unspec:V16SF > -[(match_operand:V16SF 1 "vector_operand")] > -UNSPEC_RSQRT28))] > - "TARGET_SSE_MATH && TARGET_AVX512ER" > { >ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true); >DONE; mode instad of V16SFmode.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #40 from H.J. Lu --- (In reply to rguent...@suse.de from comment #39) > > > > > > Yes. The lack of an expander for the rqsrt operation is probably > > > more severe though (causing sqrt + approx recip to appear) > > > > > > > Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available? > > I think we can but we lack an expander for this. IIRC for the following > existing expander the RTL is ignored and thus we could simply > replace the TARGET_AVX512ER check with TARGET_AVX512F? > > (define_expand "rsqrtv16sf2" > [(set (match_operand:V16SF 0 "register_operand") > (unspec:V16SF > [(match_operand:V16SF 1 "vector_operand")] > UNSPEC_RSQRT28))] > "TARGET_SSE_MATH && TARGET_AVX512ER" > { > ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true); > DONE; > }) Like this? diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 3af4adc63dd..c9b4750ccc4 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -1969,21 +1969,11 @@ (set_attr "mode" "")]) (define_expand "rsqrt2" - [(set (match_operand:VF1_128_256 0 "register_operand") - (unspec:VF1_128_256 -[(match_operand:VF1_128_256 1 "vector_operand")] UNSPEC_RSQRT))] + [(set (match_operand:VF_AVX512VL 0 "register_operand") + (unspec:VF_AVX512VL +[(match_operand:VF_AVX512VL 1 "vector_operand")] +UNSPEC_RSQRT))] "TARGET_SSE_MATH" -{ - ix86_emit_swsqrtsf (operands[0], operands[1], mode, true); - DONE; -}) - -(define_expand "rsqrtv16sf2" - [(set (match_operand:V16SF 0 "register_operand") - (unspec:V16SF -[(match_operand:V16SF 1 "vector_operand")] -UNSPEC_RSQRT28))] - "TARGET_SSE_MATH && TARGET_AVX512ER" { ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true); DONE;
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #39 from rguenther at suse dot de --- On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #38 from H.J. Lu --- > (In reply to rguent...@suse.de from comment #37) > > On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > > > > > --- Comment #36 from H.J. Lu --- > > > (In reply to Richard Biener from comment #34) > > > > GCC definitely fails to see the FMA use as opportunity in > > > > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing > > > > expander w/o avx512er where we could still use the NR sequence > > > > with the other instruction. HJ? > > > > > > Like this? > > > > Yes. The lack of an expander for the rqsrt operation is probably > > more severe though (causing sqrt + approx recip to appear) > > > > Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available? I think we can but we lack an expander for this. IIRC for the following existing expander the RTL is ignored and thus we could simply replace the TARGET_AVX512ER check with TARGET_AVX512F? (define_expand "rsqrtv16sf2" [(set (match_operand:V16SF 0 "register_operand") (unspec:V16SF [(match_operand:V16SF 1 "vector_operand")] UNSPEC_RSQRT28))] "TARGET_SSE_MATH && TARGET_AVX512ER" { ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true); DONE; })
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #38 from H.J. Lu --- (In reply to rguent...@suse.de from comment #37) > On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > > > --- Comment #36 from H.J. Lu --- > > (In reply to Richard Biener from comment #34) > > > GCC definitely fails to see the FMA use as opportunity in > > > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing > > > expander w/o avx512er where we could still use the NR sequence > > > with the other instruction. HJ? > > > > Like this? > > Yes. The lack of an expander for the rqsrt operation is probably > more severe though (causing sqrt + approx recip to appear) > Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #37 from rguenther at suse dot de --- On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #36 from H.J. Lu --- > (In reply to Richard Biener from comment #34) > > GCC definitely fails to see the FMA use as opportunity in > > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing > > expander w/o avx512er where we could still use the NR sequence > > with the other instruction. HJ? > > Like this? Yes. The lack of an expander for the rqsrt operation is probably more severe though (causing sqrt + approx recip to appear) > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index e0d7c74fcec..0bbe3772ab7 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -44855,14 +44855,22 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, > machine_mode > mode, bool recip) > } > } > > + mthree = force_reg (mode, mthree); > + >/* e0 = x0 * a */ >emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a))); > - /* e1 = e0 * x0 */ > - emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0))); > > - /* e2 = e1 - 3. */ > - mthree = force_reg (mode, mthree); > - emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree))); > + if (TARGET_FMA || TARGET_AVX512F) > +emit_insn (gen_rtx_SET (e2, > + gen_rtx_FMA (mode, e0, x0, mthree))); > + else > +{ > + /* e1 = e0 * x0 */ > + emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0))); > + > + /* e2 = e1 - 3. */ > + emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree))); > +} > >mhalf = force_reg (mode, mhalf); >if (recip)
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #36 from H.J. Lu --- (In reply to Richard Biener from comment #34) > GCC definitely fails to see the FMA use as opportunity in > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing > expander w/o avx512er where we could still use the NR sequence > with the other instruction. HJ? Like this? diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index e0d7c74fcec..0bbe3772ab7 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -44855,14 +44855,22 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, machine_mode mode, bool recip) } } + mthree = force_reg (mode, mthree); + /* e0 = x0 * a */ emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a))); - /* e1 = e0 * x0 */ - emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0))); - /* e2 = e1 - 3. */ - mthree = force_reg (mode, mthree); - emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree))); + if (TARGET_FMA || TARGET_AVX512F) +emit_insn (gen_rtx_SET (e2, + gen_rtx_FMA (mode, e0, x0, mthree))); + else +{ + /* e1 = e0 * x0 */ + emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0))); + + /* e2 = e1 - 3. */ + emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree))); +} mhalf = force_reg (mode, mhalf); if (recip)
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #35 from Chris Elrod --- > rsqrt: > .LFB12: > .cfi_startproc > vrsqrt28ps (%rsi), %zmm0 > vmovups %zmm0, (%rdi) > vzeroupper > ret > > (huh? isn't there a NR step missing?) > I assume because vrsqrt28ps is much more accurate than vrsqrt14ps, it wasn't considered necessary. Unfortunately, march=skylake-avx512 does not have -mavx512er, and therefore should use the less accurate vrsqrt14ps + NR step. I think vrsqrt14pd/s are -mavx512f or -mavx512vl > Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without > that I don't know how the machinery can guess how to use rsqrt (there are > probably ways). Looking at the asm from only r[i] = sqrtf(a[i]): vmovups (%rsi), %zmm1 vxorps %xmm0, %xmm0, %xmm0 vcmpps $4, %zmm1, %zmm0, %k1 vrsqrt14ps %zmm1, %zmm0{%k1}{z} vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmulps .LC1(%rip), %zmm1, %zmm1 vaddps .LC0(%rip), %zmm0, %zmm0 vmulps %zmm1, %zmm0, %zmm0 vmovups %zmm0, (%rdi) vs the asm from only r[i] = 1 /a[i]: vmovups (%rsi), %zmm1 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 vmovups %zmm0, (%rdi) it looks like the expander is there for sqrt, and for inverse, and we're just getting both one after the other. So it does look like I could benchmark which one is slower than the regular instruction on my platform, if that would be useful.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Richard Biener changed: What|Removed |Added CC||hjl.tools at gmail dot com --- Comment #34 from Richard Biener --- So with -Ofast and -mprefer-vector-width=256 I get [local count: 63136019]: vect__4.2_3 = MEM[(float *)a_11(D)]; vect__5.3_4 = RSQRT (vect__4.2_3); MEM[(float *)r_12(D)] = vect__5.3_4; vect__4.2_21 = MEM[(float *)a_11(D) + 32B]; vect__5.3_20 = RSQRT (vect__4.2_21); MEM[(float *)r_12(D) + 32B] = vect__5.3_20; while with -mprefer-vector-width=512 I need -mavx512er to trigger the expander, then I also get [local count: 63136020]: vect__4.2_21 = MEM[(float *)a_11(D)]; vect__5.3_20 = RSQRT (vect__4.2_21); MEM[(float *)r_12(D)] = vect__5.3_20; and in that case rsqrt: .LFB12: .cfi_startproc vrsqrt28ps (%rsi), %zmm0 vmovups %zmm0, (%rdi) vzeroupper ret (huh? isn't there a NR step missing?) for -mprefer-vector-width=256 I get (irrespective of -mavx512er): rsqrt: .LFB12: .cfi_startproc vmovups (%rsi), %ymm1 vmovaps .LC1(%rip), %ymm3 vrsqrtps%ymm1, %ymm2 vmovaps .LC0(%rip), %ymm4 vmovups 32(%rsi), %ymm0 vmulps %ymm1, %ymm2, %ymm1 vmulps %ymm2, %ymm1, %ymm1 vmulps %ymm3, %ymm2, %ymm2 vaddps %ymm4, %ymm1, %ymm1 vmulps %ymm2, %ymm1, %ymm1 vmovups %ymm1, (%rdi) vrsqrtps%ymm0, %ymm1 vmulps %ymm0, %ymm1, %ymm0 vmulps %ymm1, %ymm0, %ymm0 vmulps %ymm3, %ymm1, %ymm1 vaddps %ymm4, %ymm0, %ymm0 vmulps %ymm1, %ymm0, %ymm0 vmovups %ymm0, 32(%rdi) vzeroupper so the issue lies somewhere in the backend. Of the "fast" you need -ffinite-math-only -fno-math-errno -funsafe-math-optimizations. GCC definitely fails to see the FMA use as opportunity in ix86_emit_swsqrtsf, the a == 0 checking is because of the missing expander w/o avx512er where we could still use the NR sequence with the other instruction. HJ?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #33 from Marc Glisse --- (In reply to Chris Elrod from comment #32) > (In reply to Marc Glisse from comment #31) > > What we need to understand is why gcc doesn't try to generate rsqrt Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without that I don't know how the machinery can guess how to use rsqrt (there are probably ways). > The approximate sqrt, and then approximate reciprocal approximations were > slower on my computer than just vsqrt followed by div. We can probably split that into the speed of sqrt vs its approximation and inverse (div) vs its approximation. At least one of them seems to be a pessimization on that platform.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #32 from Chris Elrod --- (In reply to Marc Glisse from comment #31) > (In reply to Chris Elrod from comment #30) > > gcc caclulates the rsqrt directly > > No, vrsqrt14ps is just the first step in calculating sqrt here (slightly > different formula than rsqrt). vrcp14ps shows that it is computing an > inverse later. What we need to understand is why gcc doesn't try to generate > rsqrt (which would also have vrsqrt14ps, but a slightly different formula > without the comparison with 0 and masking, and without needing an inversion > afterwards). Okay, I think I follow you. You're saying instead of doing this (from rguenther), which we want (also without the comparison to 0 and masking, as you note): /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ it is doing this, which also uses the rsqrt instruction: /* sqrt(a) = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ and then calculating an inverse approximation of that? The approximate sqrt, and then approximate reciprocal approximations were slower on my computer than just vsqrt followed by div.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #31 from Marc Glisse --- (In reply to Chris Elrod from comment #30) > gcc caclulates the rsqrt directly No, vrsqrt14ps is just the first step in calculating sqrt here (slightly different formula than rsqrt). vrcp14ps shows that it is computing an inverse later. What we need to understand is why gcc doesn't try to generate rsqrt (which would also have vrsqrt14ps, but a slightly different formula without the comparison with 0 and masking, and without needing an inversion afterwards).
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #30 from Chris Elrod --- gcc still (In reply to Marc Glisse from comment #29) > The main difference I can see is that clang computes rsqrt directly, while > gcc first computes sqrt and then computes the inverse. Also gcc seems afraid > of getting NaN for sqrt(0) so it masks out this value. ix86_emit_swsqrtsf in > gcc/config/i386/i386.c seems like a good place to look at. gcc caclulates the rsqrt directly with funsafe-math-optimizations and a couple other flags (or just -ffast-math): vmovups (%rsi), %zmm0 vxorps %xmm1, %xmm1, %xmm1 vcmpps $4, %zmm0, %zmm1, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z}
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #29 from Marc Glisse --- The main difference I can see is that clang computes rsqrt directly, while gcc first computes sqrt and then computes the inverse. Also gcc seems afraid of getting NaN for sqrt(0) so it masks out this value. ix86_emit_swsqrtsf in gcc/config/i386/i386.c seems like a good place to look at.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #28 from Chris Elrod --- Created attachment 45501 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501=edit Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s I attached a minimum working example, demonstrating the problem of excessive code generation for reciprocal square root, in the file rsqrt.c. You can compile with: gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s clang -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s Or compare the asm of both on Godbolt: https://godbolt.org/z/c7Z0En For gcc: vmovups (%rsi), %zmm0 vxorps %xmm1, %xmm1, %xmm1 vcmpps $4, %zmm0, %zmm1, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps .LC1(%rip), %zmm2, %zmm2 vaddps .LC0(%rip), %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmulps %zmm0, %zmm1, %zmm0 vaddps %zmm1, %zmm1, %zmm1 vsubps %zmm0, %zmm1, %zmm0 vmovups %zmm0, (%rdi) for Clang: vmovups (%rsi), %zmm0 vrsqrt14ps %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm0 vfmadd213ps .LCPI0_0(%rip){1to16}, %zmm1, %zmm0 # zmm0 = (zmm1 * zmm0) + mem vmulps .LCPI0_1(%rip){1to16}, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmovups %zmm0, (%rdi) Clang looks like it is is doing /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ where .LCPI0_0(%rip) = -3.0 and LCPI0_1(%rip) = -0.5. gcc is doing much more, and fairly different.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #27 from Chris Elrod --- g++ -mrecip=all -O3 -fno-signed-zeros -fassociative-math -freciprocal-math -fno-math-errno -ffinite-math-only -fno-trapping-math -fdump-tree-optimized -S -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o gppvectorization_test.s vectorization_test.cpp is not enough to get vrsqrt. I need -funsafe-math-optimizations for the instruction to appear in the asm.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #26 from Chris Elrod --- > You can try enabling -mrecip to see RSQRT in .optimized - there's > probably late 1/sqrt optimization on RTL. No luck. The full commands I used: gfortran -Ofast -mrecip -S -fdump-tree-optimized -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o gfortvectorizationdump.s vectorization_test.f90 g++ -mrecip -Ofast -fdump-tree-optimized -S -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o gppvectorization_test.s vectorization_test.cpp g++'s output was similar: vect_U33_60.31_372 = SQRT (vect_S33_59.30_371); vect_Ui33_61.32_374 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 } / vect_U33_60.31_372; vect_U13_62.33_375 = vect_S13_47.24_359 * vect_Ui33_61.32_374; vect_U23_63.34_376 = vect_S23_53.27_365 * vect_Ui33_61.32_374; and it has the same assembly as gfortran for the rsqrt: vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm6, %zmm2, %zmm2 vaddps %zmm7, %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm10 vmulps %zmm0, %zmm10, %zmm0 vmulps %zmm0, %zmm10, %zmm0 vaddps %zmm10, %zmm10, %zmm10 vsubps %zmm0, %zmm10, %zmm10
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #25 from rguenther at suse dot de --- On Tue, 22 Jan 2019, elrodc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #24 from Chris Elrod --- > The dump looks like this: > > vect__67.78_217 = SQRT (vect__213.77_225); > vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, > 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 > } / vect__67.78_217; > vect__71.80_249 = vect__246.59_65 * vect_ui33_68.79_248; > vect_u13_73.81_250 = vect__187.71_14 * vect_ui33_68.79_248; > vect_u23_75.82_251 = vect__200.74_5 * vect_ui33_68.79_248; > > so the vrsqrt optimization happens later. g++ shows the same problems with > weird code generation. However this: > > /* sqrt(a) = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) > rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ > > does not match this: > > vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed > vmulps %zmm1, %zmm2, %zmm0 > vmulps %zmm2, %zmm0, %zmm1 > vmulps %zmm6, %zmm0, %zmm0 > vaddps %zmm7, %zmm1, %zmm1 > vmulps %zmm0, %zmm1, %zmm1 > vrcp14ps%zmm1, %zmm0 > vmulps %zmm1, %zmm0, %zmm1 > vmulps %zmm1, %zmm0, %zmm1 > vaddps %zmm0, %zmm0, %zmm0 > vsubps %zmm1, %zmm0, %zmm0 > > Recommendations on the next place to look for what's going on? You can try enabling -mrecip to see RSQRT in .optimized - there's probably late 1/sqrt optimization on RTL.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #24 from Chris Elrod --- The dump looks like this: vect__67.78_217 = SQRT (vect__213.77_225); vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 } / vect__67.78_217; vect__71.80_249 = vect__246.59_65 * vect_ui33_68.79_248; vect_u13_73.81_250 = vect__187.71_14 * vect_ui33_68.79_248; vect_u23_75.82_251 = vect__200.74_5 * vect_ui33_68.79_248; so the vrsqrt optimization happens later. g++ shows the same problems with weird code generation. However this: /* sqrt(a) = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ does not match this: vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm2, %zmm0, %zmm1 vmulps %zmm6, %zmm0, %zmm0 vaddps %zmm7, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm1 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 Recommendations on the next place to look for what's going on?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #23 from rguenther at suse dot de --- On Tue, 22 Jan 2019, elrodc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #22 from Chris Elrod --- > Okay. I did that, and the time went from about 4.25 microseconds down to 4.0 > microseconds. So that is an improvement, but accounts for only a small part of > the difference with the LLVM-compilers. > > -O3 -fno-math-errno > > was about 3.5 microseconds, so -funsafe-math-optimizations still results in a > regression in this code. > > 3.5 microseconds is roughly as fast as you can get with vsqrt and div. > > My best guess now is that gcc does a lot more to improve the accuracy of > vsqrt. > If I understand correctly, these are all the involved instructions: > > vmovaps .LC2(%rip), %zmm7 > vmovaps .LC3(%rip), %zmm6 > # for loop begins > vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed > vmulps %zmm1, %zmm2, %zmm0 > vmulps %zmm2, %zmm0, %zmm1 > vmulps %zmm6, %zmm0, %zmm0 > vaddps %zmm7, %zmm1, %zmm1 > vmulps %zmm0, %zmm1, %zmm1 > vrcp14ps%zmm1, %zmm0 > vmulps %zmm1, %zmm0, %zmm1 > vmulps %zmm1, %zmm0, %zmm1 > vaddps %zmm0, %zmm0, %zmm0 > vsubps %zmm1, %zmm0, %zmm0 > vfnmadd213ps(%r10,%rax), %zmm0, %zmm2 > > If I understand this correctly: > > zmm2 =(approx) 1 / sqrt(zmm1) > zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1) > zmm1 = zmm0 * zmm2 = (approx) 1 > zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1) > zmm1 = zmm7 * zmm1 = (approx) constant7 > zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1) > zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * > constant7) > zmm1 = zmm1 * zmm0 = (approx) 1 > zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) > zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7) > zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7) > > which implies that constant6 * constant6 = approximately -1? GCC implements /* sqrt(a) = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ which looks similar to what LLVM does. You can look at the -fdump-tree-optimized dump to see if there's anything fishy. > > LLVM seems to do a much simpler / briefer update of the output of vrsqrt. > > When I implemented a vrsqrt intrinsic in a Julia library, I just looked at > Wikipedia and did (roughly): > > constant1 = -0.5 > constant2 = 1.5 > > zmm2 = (approx) 1 / sqrt(zmm1) > zmm3 = constant * zmm1 > zmm1 = zmm2 * zmm2 > zmm3 = zmm3 * zmm1 + constant2 > zmm2 = zmm2 * zmm3 > > > I am not a numerical analyst, so I can't comment on relative validities or > accuracies of these approaches. > I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt. > > I would be interesting in reading explanations or discussions, if any are > available. > >
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #22 from Chris Elrod --- Okay. I did that, and the time went from about 4.25 microseconds down to 4.0 microseconds. So that is an improvement, but accounts for only a small part of the difference with the LLVM-compilers. -O3 -fno-math-errno was about 3.5 microseconds, so -funsafe-math-optimizations still results in a regression in this code. 3.5 microseconds is roughly as fast as you can get with vsqrt and div. My best guess now is that gcc does a lot more to improve the accuracy of vsqrt. If I understand correctly, these are all the involved instructions: vmovaps .LC2(%rip), %zmm7 vmovaps .LC3(%rip), %zmm6 # for loop begins vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm2, %zmm0, %zmm1 vmulps %zmm6, %zmm0, %zmm0 vaddps %zmm7, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm1 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 vfnmadd213ps(%r10,%rax), %zmm0, %zmm2 If I understand this correctly: zmm2 =(approx) 1 / sqrt(zmm1) zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1) zmm1 = zmm0 * zmm2 = (approx) 1 zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1) zmm1 = zmm7 * zmm1 = (approx) constant7 zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1) zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) zmm1 = zmm1 * zmm0 = (approx) 1 zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7) zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7) which implies that constant6 * constant6 = approximately -1? LLVM seems to do a much simpler / briefer update of the output of vrsqrt. When I implemented a vrsqrt intrinsic in a Julia library, I just looked at Wikipedia and did (roughly): constant1 = -0.5 constant2 = 1.5 zmm2 = (approx) 1 / sqrt(zmm1) zmm3 = constant * zmm1 zmm1 = zmm2 * zmm2 zmm3 = zmm3 * zmm1 + constant2 zmm2 = zmm2 * zmm3 I am not a numerical analyst, so I can't comment on relative validities or accuracies of these approaches. I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt. I would be interesting in reading explanations or discussions, if any are available.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #21 from rguenther at suse dot de --- On Tue, 22 Jan 2019, elrodc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #19 from Chris Elrod --- > To add a little more: > I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in > Julia. Without adding a Newton step, the answers are wrong beyond just a > couple > significant digits. > With the Newton step, the answers are correct. > > My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding > the Newton step. They get the correct answer. > > That leaves my best guess for the performance difference as owing to the > masked > "vrsqrt14ps" that gcc is using: > > vcmpps $4, %zmm0, %zmm5, %k1 > vrsqrt14ps %zmm0, %zmm1{%k1}{z} > > Is there any way for me to test that idea? > Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and > benchmark it? Usually it's easiest to compile to assembler with GCC (-S) and test this kind of theories by editing the GCC generated assembly and then benchmark that. Just use the assembler as input to the gfortran compile command instead of the .f for linking the program.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #20 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the Newton step, the answers are correct. My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding the Newton step. They get the correct answer. That leaves my best guess for the performance difference as owing to the masked "vrsqrt14ps" that gcc is using (g++ does this too): vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} Is there any way for me to test that idea? Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and benchmark it? Okay, I just tried playing around with flags and looking at asm. I compiled with: g++ -O3 -ffinite-math-only -fexcess-precision=fast -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math -fno-rounding-math -fno-signaling-nans -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgppvectorization_test.so vectorization_test.cpp which is basically all flags implied by "-ffast-math", except "-funsafe-math-optimizations". This does include the flags implied by the unsafe-math optimizations, just not that flag itself. This list can be simplified to (only "-fno-math-errno" is needed): g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgppvectorization_test.so vectorization_test.cpp or gfortran -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgfortvectorization_test.so vectorization_test.f90 This results in the following: vsqrtps (%r8,%rax), %zmm0 vdivps %zmm0, %zmm7, %zmm0 ie, vsqrt and a division, rather than the masked reciprocal square root. With N = 2827, that speeds gfortran and g++ from about 4.3 microseconds to 3.5 microseconds. For comparison, Clang takes about 2 microseconds, and Flang/ispc/and awful looking unsafe Rust take 2.3-2.4 microseconds, using the vrsqrt14ps (without a mask) and a Newton step, instead of vsqrtps followed by a division. So, "-funsafe-math-optimizations" results in a regression here.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #19 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the Newton step, the answers are correct. My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding the Newton step. They get the correct answer. That leaves my best guess for the performance difference as owing to the masked "vrsqrt14ps" that gcc is using: vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} Is there any way for me to test that idea? Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and benchmark it?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #18 from Chris Elrod --- I can confirm that the inlined packing does allow gfortran to vectorize the loop. So allowing packing to inline does seem (to me) like an optimization well worth making. However, performance seems to be about the same as before, still close to 2x slower than Flang. There is definitely something interesting going on in Flang's SLP vectorization, though. I defined the function: #ifndef VECTORWIDTH #define VECTORWIDTH 16 #endif subroutine vpdbacksolve(Uix, x, S) real, dimension(VECTORWIDTH,3) :: Uix real, dimension(VECTORWIDTH,3), intent(in) :: x real, dimension(VECTORWIDTH,6), intent(in) :: S real, dimension(VECTORWIDTH):: U11, U12, U22, U13, U23, U33, & Ui11, Ui12, Ui22, Ui33 U33 = sqrt(S(:,6)) Ui33 = 1 / U33 U13 = S(:,4) * Ui33 U23 = S(:,5) * Ui33 U22 = sqrt(S(:,3) - U23**2) Ui22 = 1 / U22 U12 = (S(:,2) - U13*U23) * Ui22 U11 = sqrt(S(:,1) - U12**2 - U13**2) Ui11 = 1 / U11 ! u11 Ui12 = - U12 * Ui11 * Ui22 ! u12 Uix(:,3) = Ui33*x(:,3) Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) - (U13 * Ui11 + U23 * Ui12) * Uix(:,3) Uix(:,2) = Ui22*x(:,2) - U23 * Ui22 * Uix(:,3) end subroutine vpdbacksolve in a .F90 file, so that VECTORWIDTH can be set appropriately while compiling. I wanted to modify the Fortran file to benchmark these, but I'm pretty sure Flang cheated in the benchmarks. So compiling into a shared library, and benchmarking from Julia: julia> @benchmark flangvtest($Uix, $x, $S) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 15.104 ns (0.00% GC) median time: 15.563 ns (0.00% GC) mean time:16.017 ns (0.00% GC) maximum time: 49.524 ns (0.00% GC) -- samples: 1 evals/sample: 998 julia> @benchmark gfortvtest($Uix, $x, $S) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 24.394 ns (0.00% GC) median time: 24.562 ns (0.00% GC) mean time:25.600 ns (0.00% GC) maximum time: 58.652 ns (0.00% GC) -- samples: 1 evals/sample: 996 That is over 60% faster for Flang, which would account for much, but not all, of the runtime difference in the actual for loops. For comparison, the vectorized loop in processbpp covers 16 samples per iteration. The benchmarks above were with N = 1024, so 1024/16 = 64 iterations. For the three gfortran benchmarks (that averaged 100,000 runs of the loop), that means each loop iteration averaged at about 1000 * (1.34003162 + 1.37529969 + 1.36087596) / (3*64) 21.230246197916664 For flang, that was: 1000 * (0.6596010 + 0.6455200 + 0.6132510) / (3*64) 9.99152083334 so we have about 21 vs 10 ns for the loop body in gfortran vs Flang, respectively. Comparing the asm between: 1. Flang processbpp loop body 2. Flang vpdbacksolve 3. gfortran processbpp loop body 4. gfortran vpdbacksolve Here are a few things I notice. 1. gfortran always uses masked reciprocal square root operations, to make sure it only takes the square root of non-negative (positive?) numbers: vxorps %xmm5, %xmm5, %xmm5 ... vmovups (%rsi,%rax), %zmm0 vmovups 0(%r13,%rax), %zmm9 vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} This might be avx512f specific? Either way, Flang does not use masks: vmovups (%rcx,%r14), %zmm4 vrsqrt14ps %zmm4, %zmm5 I'm having a hard time finding any information on what the performance impact of this may be. Agner Fog's instruction tables, for example, don't mention mask arguments for vrsqrt14ps. 2. Within the loop body, Flang has 0 unnecessary vmov(u/a)ps. There are 8 total plus 3 "vmuls" and 1 vfmsub231ps accessing memory, for the 12 expected per loop iteration (fpdbacksolve's arguments are a vector of length 3 and another of length 6; it returns a vector of length 3). gfortran's loop body has 3 unnecessary vmovaps, copying register contents. gfortran's vpdbacksolve subroutine has 4 unnecessary vmovaps, copying register contents. Flang's vpdbacksolve subroutine has 13 unnecessary vmovaps, and a couple unnecessary memory accesses. Ouch! They also moved on/off (the stack?) vmovaps %zmm2, .BSS4+192(%rip) ... vmovaps %zmm5, .BSS4+320(%rip) ... vmovaps .BSS4+192(%rip), %zmm5 ... #zmm5 is overwritten in here, I just mean to show the sort of stuff that goes on vmulps .BSS4+320(%rip), %zmm5, %zmm0 Some of those moves also don't get used again, and some other things are just plain weird: vxorps %xmm3, %xmm3, %xmm3 vfnmsub231ps%zmm2, %zmm0, %zmm3 # zmm3 = -(zmm0 * zmm2) - zmm3 vmovaps %zmm3, .BSS4+576(%rip) Like, why zero out the 128 bit portion of zmm3 ? I
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #17 from Thomas Koenig --- What an inline packing would (approximately) produce is this: subroutine processBPP(X, BPP, N) integer,intent(in) :: N real, dimension(N,3), intent(out) :: X real, dimension(N,10),intent(in) :: BPP integer :: i real :: tmp1(3) real :: tmp2(6) integer :: k1, k2 do concurrent (i = 1:N) k1 = 0 do if (.not. k1 < 3) exit tmp1(k1+1) = BPP(i,k1+1) k1 = k1 + 1 end do k2 = 0 do if (.not. k2 < 6) exit tmp2(k2+1) = BPP(i,k2+5) k2 = k2 + 1 end do X(i,:) = fpdbacksolve(tmp1, tmp2) end do end subroutine processBPP I see no timing difference for gfortran with this to the (:) version. Chris, can you confirm this? And is flang still faster by a factor of two if you use this version?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Thomas Koenig changed: What|Removed |Added CC||koenigni at gcc dot gnu.org --- Comment #16 from Thomas Koenig --- (In reply to Richard Biener from comment #15) > So can the fortran FE inline the _gfortran_internal_pack() call? It looks > like > flang manages to elide this when inlining the function at least? In principle, this could be done. I guess it would be a good idea to create a test case first which would emulate what inlining pack would do.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #15 from Richard Biener --- So can the fortran FE inline the _gfortran_internal_pack() call? It looks like flang manages to elide this when inlining the function at least?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #14 from Chris Elrod --- It's not really reproducible across runs: $ time ./gfortvectests Transpose benchmark completed in 22.7010765 SIMD benchmark completed in 1.37529969 All are equal: F All are approximately equal: F Maximum relative error 6.20566949E-04 First record X: 0.188879877 0.377619117 -1.67841911E-02 First record Xt: 0.10071 0.377619147 -1.67841911E-02 Second record X: -8.14126506E-02 -0.421755224 -0.199057430 Second record Xt: -8.14126655E-02 -0.421755224 -0.199057430 real0m2.414s user0m2.406s sys 0m0.005s $ time ./flangvectests Transpose benchmark completed in7.630980 SIMD benchmark completed in 0.6455200 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.58675421.568364 0.1006735 First record Xt: 0.58675411.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real0m0.839s user0m0.832s sys 0m0.006s $ time ./gfortvectests Transpose benchmark completed in 22.0195961 SIMD benchmark completed in 1.36087596 All are equal: F All are approximately equal: F Maximum relative error 2.49150675E-04 First record X: -0.284217566 2.13768221E-02 -0.475293010 First record Xt: -0.284217596 2.13767942E-02 -0.475293040 Second record X: 1.75664220E-02 -9.29893106E-02 -4.37139049E-02 Second record Xt: 1.75664220E-02 -9.29893106E-02 -4.37139049E-02 real0m2.344s user0m2.338s sys 0m0.003s $ time ./flangvectests Transpose benchmark completed in7.881181 SIMD benchmark completed in 0.6132510 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.58675421.568364 0.1006735 First record Xt: 0.58675411.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real0m0.861s user0m0.853s sys 0m0.006s It's also probably wasn't quite right to call it "error", because it's comparing the values from the scalar and vectorized versions. Although it is unsettling if the differences are high; there should be an exact match, ideally. Back to Julia, using mpfr (set to 252 bits of precision), and rounding to single precision for an exactly rounded answer... X32gfort # calculated from gfortran X32flang # calculated from flang Xbf # mpfr, 252-bit precision ("BigFloat" in Julia) julia> Xbf32 = Float32.(Xbf) # correctly rounded result julia> function ULP(x, correct) # calculates ULP error x == correct && return 0 if x < correct error = 1 while nextfloat(x, error) != correct error += 1 end else error = 1 while prevfloat(x, error) != correct error += 1 end end error end ULP (generic function with 1 method) julia> ULP.(X32gfort, Xbf32)' 3×1024 Adjoint{Int64,Array{Int64,2}}: 7 1 1 8 3 2 1 1 1 27 4 1 4 6 0 0 2 0 2 4 0 7 1 1 3 8 4 2 2 … 1 0 2 0 0 1 2 3 1 5 1 1 0 0 0 2 3 2 1 2 3 1 0 1 1 0 2 0 41 4 2 1 1 6 1 0 1 1 2 2 0 0 3 0 1 0 3 1 1 0 1 1 0 0 3 1 0 0 0 1 0 1 0 1 0 1 1 4 1 1 0 2 0 1 0 1 0 0 0 1 2 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 1 julia> mean(ans) 1.9462890625 julia> ULP.(X32flang, Xbf32)' 3×1024 Adjoint{Int64,Array{Int64,2}}: 4 1 0 3 0 0 0 1 1 5 2 1 1 6 3 0 1 0 0 1 1 21 0 1 2 8 2 3 0 0 … 1 1 1 15 2 1 1 5 1 1 1 0 0 0 0 0 2 1 3 1 1 1 1 1 1 1 0 11 3 1 1 0 1 0 0 1 0 0 1 0 0 2 1 1 1 6 0 0 0 2 1 0 1 4 1 1 0 3 1 1 1 1 2 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 julia> mean(ans) 1.3388671875 So in that case, gfortran's version had about 1.95 ULP error on average, and Flang about 1.34 ULP error.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Jerry DeLisle changed: What|Removed |Added CC||jvdelisle at gcc dot gnu.org --- Comment #13 from Jerry DeLisle --- I noticed the Maximum Relative error in your benchmarks is significantly larger in the flang test vs the gfortran test. Is this a factor that matters?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #12 from Chris Elrod --- Created attachment 45363 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363=edit Fortran program for running benchmarks. Okay, thank you. I attached a Fortran program you can run to benchmark the code. It randomly generates valid inputs, and then times running the code 10^5 times. Finally, it reports the average time in microseconds. The SIMD times are the vectorized version, and the transposed times are the non-vectorized versions. In both cases, Flang produces much faster code. The results seem in line with what I got benchmarking shared libraries from Julia. I linked rt for access to the high resolution clock. $ gfortran -Ofast -lrt -march=native -mprefer-vector-width=512 vectorization_tests.F90 -o gfortvectests $ time ./gfortvectests Transpose benchmark completed in 22.7799759 SIMD benchmark completed in 1.34003162 All are equal: F All are approximately equal: F Maximum relative error 8.27204276E-05 First record X: 1.02466011 -0.689792156 -0.404027045 First record Xt: 1.02465975 -0.689791918 -0.404026985 Second record X: -0.546353579 3.37308086E-03 1.15257287 Second record Xt: -0.546353400 3.37312138E-03 1.15257275 real0m2.418s user0m2.412s sys 0m0.003s $ flang -Ofast -lrt -march=native -mprefer-vector-width=512 vectorization_tests.F90 -o flangvectests $ time ./flangvectests Transpose benchmark completed in7.232568 SIMD benchmark completed in 0.6596010 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.58675421.568364 0.1006735 First record Xt: 0.58675411.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real0m0.801s user0m0.794s sys 0m0.005s
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Thomas Koenig changed: What|Removed |Added Status|RESOLVED|REOPENED Last reconfirmed||2019-01-06 Component|fortran |tree-optimization Blocks|36854 |53947 Resolution|WONTFIX |--- Summary|_gfortran_internal_pack@PLT |Vectorized code slow vs. |prevents vectorization |flang Ever confirmed|0 |1 --- Comment #11 from Thomas Koenig --- OK, so I think it makes sense to reopen this bug as a missed optimization for the vectorizer (reopen because it would be a shame to lose all the info you already provided). It seems like gcc could be much better, also possibly with some more help from the gfortran front end. A factor of two is not to be ignored. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36854 [Bug 36854] [meta-bug] fortran front-end optimization https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations