[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-07-17 Thread cvs-commit at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #60 from CVS Commits  ---
The master branch has been updated by H.J. Lu :

https://gcc.gnu.org/g:737355072af4cd0c24a4a8967e1485c1f3a80bfe

commit r11-2200-g737355072af4cd0c24a4a8967e1485c1f3a80bfe
Author: H.J. Lu 
Date:   Mon Jul 13 09:07:00 2020 -0700

x86: Rename VF_AVX512VL_VF1_128_256 to VF1_AVX512ER_128_256

Since ix86_emit_swsqrtsf shouldn't be called with DF vector modes, rename
VF_AVX512VL_VF1_128_256 to VF1_AVX512ER_128_256 and drop DF vector modes.

gcc/

PR target/96186
PR target/88713
* config/i386/sse.md (VF_AVX512VL_VF1_128_256): Renamed to ...
(VF1_AVX512ER_128_256): This.  Drop DF vector modes.
(rsqrt2): Replace VF_AVX512VL_VF1_128_256 with
VF1_AVX512ER_128_256.

gcc/testsuite/

PR target/96186
PR target/88713
* gcc.target/i386/pr88713-3.c: New test.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-07-09 Thread cvs-commit at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #59 from CVS Commits  ---
The master branch has been updated by H.J. Lu :

https://gcc.gnu.org/g:fab263ab0fc10ea08409b80afa7e8569438b8d28

commit r11-1970-gfab263ab0fc10ea08409b80afa7e8569438b8d28
Author: H.J. Lu 
Date:   Wed Jan 23 06:33:58 2019 -0800

x86: Enable FMA in rsqrt2 expander

Enable FMA in rsqrt2 expander and fold rsqrtv16sf2 expander into
rsqrt2 expander which expands to UNSPEC_RSQRT28 for TARGET_AVX512ER.
Although it doesn't show performance change in our workloads, FMA can
improve other workloads.

gcc/

PR target/88713
* config/i386/i386-expand.c (ix86_emit_swsqrtsf): Enable FMA.
* config/i386/sse.md (VF_AVX512VL_VF1_128_256): New.
(rsqrt2): Replace VF1_128_256 with VF_AVX512VL_VF1_128_256.
(rsqrtv16sf2): Removed.

gcc/testsuite/

PR target/88713
* gcc.target/i386/pr88713-1.c: New test.
* gcc.target/i386/pr88713-2.c: Likewise.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-06-28 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #58 from H.J. Lu  ---
(In reply to Thomas Koenig from comment #57)
> (In reply to H.J. Lu from comment #56)
> > (In reply to Thomas Koenig from comment #55)
> > > (In reply to H.J. Lu from comment #45)
> > > > Created attachment 45510 [details]
> > > > An updated patch
> > > 
> > > HJ, do you plan on committing these?
> > 
> > We are collecting performance data before I submit it.
> 
> Do you have the performance data by now?

A patch is posted at

https://gcc.gnu.org/pipermail/gcc-patches/2020-June/549047.html

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #57 from Thomas Koenig  ---
(In reply to H.J. Lu from comment #56)
> (In reply to Thomas Koenig from comment #55)
> > (In reply to H.J. Lu from comment #45)
> > > Created attachment 45510 [details]
> > > An updated patch
> > 
> > HJ, do you plan on committing these?
> 
> We are collecting performance data before I submit it.

Do you have the performance data by now?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-09-19 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #56 from H.J. Lu  ---
(In reply to Thomas Koenig from comment #55)
> (In reply to H.J. Lu from comment #45)
> > Created attachment 45510 [details]
> > An updated patch
> 
> HJ, do you plan on committing these?

We are collecting performance data before I submit it.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-09-19 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #55 from Thomas Koenig  ---
(In reply to H.J. Lu from comment #45)
> Created attachment 45510 [details]
> An updated patch

HJ, do you plan on committing these?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-02-12 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #54 from Chris Elrod  ---
I commented elsewhere, but I built trunk a few days ago with H.J.Lu's patches
(attached here) and Thomas Koenig's inlining patches.
With these patches, g++ and all versions of the Fortran code produced excellent
asm, and the code performed excellently in benchmarks.

Once those are merged, the problems reported here will be solved.

I saw Thomas Koenig's packing changes will wait for gcc-10.
What about H.J.Lu's fixes to rsqrt and allowing FMA use in those sections?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #53 from rguenther at suse dot de  ---
On Thu, 24 Jan 2019, glisse at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #52 from Marc Glisse  ---
> (In reply to Thomas Koenig from comment #49)
> > Argh.  Sacrificing performance for the sake of bugware...
> 
> But note that in this PR (specifically for avx512 vectors on this cpu), the OP
> says that the recip version is slower than calling directly the right insn (it
> wasn't clear if that was for inverse or for sqrt).

Probably depends on the microarchitecture, yes.  But I'd fully
expect the two-NR step variant to be slower for a sensible
HW implementation (even more so if we need to fend off the
exceptional cases)

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-24 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #52 from Marc Glisse  ---
(In reply to Thomas Koenig from comment #49)
> Argh.  Sacrificing performance for the sake of bugware...

But note that in this PR (specifically for avx512 vectors on this cpu), the OP
says that the recip version is slower than calling directly the right insn (it
wasn't clear if that was for inverse or for sqrt).

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #51 from rguenther at suse dot de  ---
On Thu, 24 Jan 2019, tkoenig at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #49 from Thomas Koenig  ---
> (In reply to Uroš Bizjak from comment #48)
> > (In reply to rguent...@suse.de from comment #47)
> > > >But why don't we generate sqrtps for vector sqrtf?
> > > 
> > > That's the default for - mrecip back in time we benchmarked it and scalar
> > > recip miscompares sth.
> > 
> > It was polyhedron benchmark, in one benchmark, the index was calculated from
> > square root, and that was too sensitive for 2 ULP difference.
> 
> Argh.  Sacrificing performance for the sake of bugware...

Maybe use of FMA can recover 1 ULP and the benchmark ;)

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #50 from Uroš Bizjak  ---
(In reply to Thomas Koenig from comment #49)
> (In reply to Uroš Bizjak from comment #48)
> > (In reply to rguent...@suse.de from comment #47)
> > > >But why don't we generate sqrtps for vector sqrtf?
> > > 
> > > That's the default for - mrecip back in time we benchmarked it and scalar
> > > recip miscompares sth.
> > 
> > It was polyhedron benchmark, in one benchmark, the index was calculated from
> > square root, and that was too sensitive for 2 ULP difference.
> 
> Argh.  Sacrificing performance for the sake of bugware...

The details are in [1] and all the drama is documented in PR32352.

[1] https://gcc.gnu.org/ml/gcc-patches/2007-06/msg01044.html

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #49 from Thomas Koenig  ---
(In reply to Uroš Bizjak from comment #48)
> (In reply to rguent...@suse.de from comment #47)
> > >But why don't we generate sqrtps for vector sqrtf?
> > 
> > That's the default for - mrecip back in time we benchmarked it and scalar
> > recip miscompares sth.
> 
> It was polyhedron benchmark, in one benchmark, the index was calculated from
> square root, and that was too sensitive for 2 ULP difference.

Argh.  Sacrificing performance for the sake of bugware...

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #48 from Uroš Bizjak  ---
(In reply to rguent...@suse.de from comment #47)
> >But why don't we generate sqrtps for vector sqrtf?
> 
> That's the default for - mrecip back in time we benchmarked it and scalar
> recip miscompares sth.

It was polyhedron benchmark, in one benchmark, the index was calculated from
square root, and that was too sensitive for 2 ULP difference.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #47 from rguenther at suse dot de  ---
On January 23, 2019 5:13:12 PM GMT+01:00, "hjl.tools at gmail dot com"
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
>
>--- Comment #46 from H.J. Lu  ---
>We generate sqrtps for scalar sqrtf:
>
>[hjl@gnu-skx-1 pr88713]$ cat s.i
>extern float sqrtf(float x);
>
>float
>rsqrt(float r)
>{
>  return sqrtf (r);
>}
>[hjl@gnu-skx-1 pr88713]$ gcc -Ofast -S s.i
>[hjl@gnu-skx-1 pr88713]$ cat s.s
>.file   "s.i"
>.text
>.p2align 4,,15
>.globl  rsqrt
>.type   rsqrt, @function
>rsqrt:
>.LFB0:
>.cfi_startproc
>sqrtss  %xmm0, %xmm0
>ret
>.cfi_endproc
>.LFE0:
>.size   rsqrt, .-rsqrt
>.ident  "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)"
>.section.note.GNU-stack,"",@progbits
>[hjl@gnu-skx-1 pr88713]$ 
>
>But why don't we generate sqrtps for vector sqrtf?

That's the default for - mrecip back in time we benchmarked it and scalar recip
miscompares sth.

>
>[hjl@gnu-skx-1 pr88713]$ cat y.i
>extern float sqrtf(float x);
>
>void
>rsqrt(float* restrict r, float* restrict a){
>for (int i = 0; i < 16; i++){
>r[i] = sqrtf(a[i]);
>}
>}
>[hjl@gnu-skx-1 pr88713]$ gcc -S -Ofast y.i 
>[hjl@gnu-skx-1 pr88713]$ cat y.s
>.file   "y.i"
>.text
>.p2align 4,,15
>.globl  rsqrt
>.type   rsqrt, @function
>rsqrt:
>.LFB0:
>.cfi_startproc
>movups  (%rsi), %xmm1
>pxor%xmm2, %xmm2
>movaps  .LC0(%rip), %xmm4
>movaps  %xmm2, %xmm3
>rsqrtps %xmm1, %xmm0
>cmpneqps%xmm1, %xmm3
>movaps  %xmm1, %xmm5
>andps   %xmm3, %xmm0
>movaps  .LC1(%rip), %xmm3
>mulps   %xmm0, %xmm5
>mulps   %xmm5, %xmm0
>mulps   %xmm3, %xmm5
>movaps  %xmm0, %xmm1
>movups  16(%rsi), %xmm0
>addps   %xmm4, %xmm1
>mulps   %xmm5, %xmm1
>movaps  %xmm2, %xmm5
>cmpneqps%xmm0, %xmm5
>movups  %xmm1, (%rdi)
>rsqrtps %xmm0, %xmm1
>andps   %xmm5, %xmm1
>movaps  %xmm2, %xmm5
>mulps   %xmm1, %xmm0
>mulps   %xmm0, %xmm1
>mulps   %xmm3, %xmm0
>addps   %xmm4, %xmm1
>mulps   %xmm0, %xmm1
>movups  32(%rsi), %xmm0
>cmpneqps%xmm0, %xmm5
>movups  %xmm1, 16(%rdi)
>rsqrtps %xmm0, %xmm1
>andps   %xmm5, %xmm1
>mulps   %xmm1, %xmm0
>mulps   %xmm0, %xmm1
>mulps   %xmm3, %xmm0
>addps   %xmm4, %xmm1
>mulps   %xmm0, %xmm1
>movups  %xmm1, 32(%rdi)
>movups  48(%rsi), %xmm1
>rsqrtps %xmm1, %xmm0
>cmpneqps%xmm1, %xmm2
>andps   %xmm2, %xmm0
>mulps   %xmm0, %xmm1
>mulps   %xmm1, %xmm0
>mulps   %xmm3, %xmm1
>addps   %xmm4, %xmm0
>mulps   %xmm1, %xmm0
>movups  %xmm0, 48(%rdi)
>ret
>.cfi_endproc
>.LFE0:
>.size   rsqrt, .-rsqrt
>.section.rodata.cst16,"aM",@progbits,16
>.align 16
>.LC0:
>.long   3225419776
>.long   3225419776
>.long   3225419776
>.long   3225419776
>.align 16
>.LC1:
>.long   3204448256
>.long   3204448256
>.long   3204448256
>.long   3204448256
>.ident  "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)"
>.section.note.GNU-stack,"",@progbits
>[hjl@gnu-skx-1 pr88713]$

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #46 from H.J. Lu  ---
We generate sqrtps for scalar sqrtf:

[hjl@gnu-skx-1 pr88713]$ cat s.i
extern float sqrtf(float x);

float
rsqrt(float r)
{
  return sqrtf (r);
}
[hjl@gnu-skx-1 pr88713]$ gcc -Ofast -S s.i
[hjl@gnu-skx-1 pr88713]$ cat s.s
.file   "s.i"
.text
.p2align 4,,15
.globl  rsqrt
.type   rsqrt, @function
rsqrt:
.LFB0:
.cfi_startproc
sqrtss  %xmm0, %xmm0
ret
.cfi_endproc
.LFE0:
.size   rsqrt, .-rsqrt
.ident  "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)"
.section.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 pr88713]$ 

But why don't we generate sqrtps for vector sqrtf?


[hjl@gnu-skx-1 pr88713]$ cat y.i
extern float sqrtf(float x);

void
rsqrt(float* restrict r, float* restrict a){
for (int i = 0; i < 16; i++){
r[i] = sqrtf(a[i]);
}
}
[hjl@gnu-skx-1 pr88713]$ gcc -S -Ofast y.i 
[hjl@gnu-skx-1 pr88713]$ cat y.s
.file   "y.i"
.text
.p2align 4,,15
.globl  rsqrt
.type   rsqrt, @function
rsqrt:
.LFB0:
.cfi_startproc
movups  (%rsi), %xmm1
pxor%xmm2, %xmm2
movaps  .LC0(%rip), %xmm4
movaps  %xmm2, %xmm3
rsqrtps %xmm1, %xmm0
cmpneqps%xmm1, %xmm3
movaps  %xmm1, %xmm5
andps   %xmm3, %xmm0
movaps  .LC1(%rip), %xmm3
mulps   %xmm0, %xmm5
mulps   %xmm5, %xmm0
mulps   %xmm3, %xmm5
movaps  %xmm0, %xmm1
movups  16(%rsi), %xmm0
addps   %xmm4, %xmm1
mulps   %xmm5, %xmm1
movaps  %xmm2, %xmm5
cmpneqps%xmm0, %xmm5
movups  %xmm1, (%rdi)
rsqrtps %xmm0, %xmm1
andps   %xmm5, %xmm1
movaps  %xmm2, %xmm5
mulps   %xmm1, %xmm0
mulps   %xmm0, %xmm1
mulps   %xmm3, %xmm0
addps   %xmm4, %xmm1
mulps   %xmm0, %xmm1
movups  32(%rsi), %xmm0
cmpneqps%xmm0, %xmm5
movups  %xmm1, 16(%rdi)
rsqrtps %xmm0, %xmm1
andps   %xmm5, %xmm1
mulps   %xmm1, %xmm0
mulps   %xmm0, %xmm1
mulps   %xmm3, %xmm0
addps   %xmm4, %xmm1
mulps   %xmm0, %xmm1
movups  %xmm1, 32(%rdi)
movups  48(%rsi), %xmm1
rsqrtps %xmm1, %xmm0
cmpneqps%xmm1, %xmm2
andps   %xmm2, %xmm0
mulps   %xmm0, %xmm1
mulps   %xmm1, %xmm0
mulps   %xmm3, %xmm1
addps   %xmm4, %xmm0
mulps   %xmm1, %xmm0
movups  %xmm0, 48(%rdi)
ret
.cfi_endproc
.LFE0:
.size   rsqrt, .-rsqrt
.section.rodata.cst16,"aM",@progbits,16
.align 16
.LC0:
.long   3225419776
.long   3225419776
.long   3225419776
.long   3225419776
.align 16
.LC1:
.long   3204448256
.long   3204448256
.long   3204448256
.long   3204448256
.ident  "GCC: (GNU) 8.2.1 20190109 (Red Hat 8.2.1-7)"
.section.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 pr88713]$

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

H.J. Lu  changed:

   What|Removed |Added

  Attachment #45509|0   |1
is obsolete||

--- Comment #45 from H.J. Lu  ---
Created attachment 45510
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45510=edit
An updated patch

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

H.J. Lu  changed:

   What|Removed |Added

  Attachment #45508|0   |1
is obsolete||

--- Comment #44 from H.J. Lu  ---
Created attachment 45509
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45509=edit
A combined patch

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

H.J. Lu  changed:

   What|Removed |Added

  Attachment #45507|0   |1
is obsolete||

--- Comment #43 from H.J. Lu  ---
Created attachment 45508
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45508=edit
A patch

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #42 from H.J. Lu  ---
Created attachment 45507
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45507=edit
A patch

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #41 from Uroš Bizjak  ---
(In reply to H.J. Lu from comment #40)
> (In reply to rguent...@suse.de from comment #39)
> > > > 
> > > > Yes.  The lack of an expander for the rqsrt operation is probably
> > > > more severe though (causing sqrt + approx recip to appear)
> > > > 
> > > 
> > > Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available?
> > 
> > I think we can but we lack an expander for this.  IIRC for the following
> > existing expander the RTL is ignored and thus we could simply
> > replace the TARGET_AVX512ER check with TARGET_AVX512F?
> > 
> > (define_expand "rsqrtv16sf2"
> >   [(set (match_operand:V16SF 0 "register_operand")
> > (unspec:V16SF
> >   [(match_operand:V16SF 1 "vector_operand")]
> >   UNSPEC_RSQRT28))]
> >   "TARGET_SSE_MATH && TARGET_AVX512ER"
> > {
> >   ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true);
> >   DONE;
> > })
> 
> Like this?
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 3af4adc63dd..c9b4750ccc4 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1969,21 +1969,11 @@
> (set_attr "mode" "")])
>  
>  (define_expand "rsqrt2"
> -  [(set (match_operand:VF1_128_256 0 "register_operand")
> -  (unspec:VF1_128_256
> -[(match_operand:VF1_128_256 1 "vector_operand")] UNSPEC_RSQRT))]
> +  [(set (match_operand:VF_AVX512VL 0 "register_operand")
> +  (unspec:VF_AVX512VL
> +[(match_operand:VF_AVX512VL 1 "vector_operand")]
> +UNSPEC_RSQRT))]
>"TARGET_SSE_MATH"
> -{
> -  ix86_emit_swsqrtsf (operands[0], operands[1], mode, true);
> -  DONE;
> -})
> -
> -(define_expand "rsqrtv16sf2"
> -  [(set (match_operand:V16SF 0 "register_operand")
> -  (unspec:V16SF
> -[(match_operand:V16SF 1 "vector_operand")]
> -UNSPEC_RSQRT28))]
> -  "TARGET_SSE_MATH && TARGET_AVX512ER"
>  {
>ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true);
>DONE;

mode instad of V16SFmode.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #40 from H.J. Lu  ---
(In reply to rguent...@suse.de from comment #39)
> > > 
> > > Yes.  The lack of an expander for the rqsrt operation is probably
> > > more severe though (causing sqrt + approx recip to appear)
> > > 
> > 
> > Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available?
> 
> I think we can but we lack an expander for this.  IIRC for the following
> existing expander the RTL is ignored and thus we could simply
> replace the TARGET_AVX512ER check with TARGET_AVX512F?
> 
> (define_expand "rsqrtv16sf2"
>   [(set (match_operand:V16SF 0 "register_operand")
> (unspec:V16SF
>   [(match_operand:V16SF 1 "vector_operand")]
>   UNSPEC_RSQRT28))]
>   "TARGET_SSE_MATH && TARGET_AVX512ER"
> {
>   ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true);
>   DONE;
> })

Like this?

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 3af4adc63dd..c9b4750ccc4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1969,21 +1969,11 @@
(set_attr "mode" "")])

 (define_expand "rsqrt2"
-  [(set (match_operand:VF1_128_256 0 "register_operand")
-  (unspec:VF1_128_256
-[(match_operand:VF1_128_256 1 "vector_operand")] UNSPEC_RSQRT))]
+  [(set (match_operand:VF_AVX512VL 0 "register_operand")
+  (unspec:VF_AVX512VL
+[(match_operand:VF_AVX512VL 1 "vector_operand")]
+UNSPEC_RSQRT))]
   "TARGET_SSE_MATH"
-{
-  ix86_emit_swsqrtsf (operands[0], operands[1], mode, true);
-  DONE;
-})
-
-(define_expand "rsqrtv16sf2"
-  [(set (match_operand:V16SF 0 "register_operand")
-  (unspec:V16SF
-[(match_operand:V16SF 1 "vector_operand")]
-UNSPEC_RSQRT28))]
-  "TARGET_SSE_MATH && TARGET_AVX512ER"
 {
   ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true);
   DONE;

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #39 from rguenther at suse dot de  ---
On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #38 from H.J. Lu  ---
> (In reply to rguent...@suse.de from comment #37)
> > On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> > > 
> > > --- Comment #36 from H.J. Lu  ---
> > > (In reply to Richard Biener from comment #34)
> > > > GCC definitely fails to see the FMA use as opportunity in
> > > > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing
> > > > expander w/o avx512er where we could still use the NR sequence
> > > > with the other instruction.  HJ?
> > > 
> > > Like this?
> > 
> > Yes.  The lack of an expander for the rqsrt operation is probably
> > more severe though (causing sqrt + approx recip to appear)
> > 
> 
> Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available?

I think we can but we lack an expander for this.  IIRC for the following
existing expander the RTL is ignored and thus we could simply
replace the TARGET_AVX512ER check with TARGET_AVX512F?

(define_expand "rsqrtv16sf2"
  [(set (match_operand:V16SF 0 "register_operand")
(unspec:V16SF
  [(match_operand:V16SF 1 "vector_operand")]
  UNSPEC_RSQRT28))]
  "TARGET_SSE_MATH && TARGET_AVX512ER"
{
  ix86_emit_swsqrtsf (operands[0], operands[1], V16SFmode, true);
  DONE;
})

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #38 from H.J. Lu  ---
(In reply to rguent...@suse.de from comment #37)
> On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> > 
> > --- Comment #36 from H.J. Lu  ---
> > (In reply to Richard Biener from comment #34)
> > > GCC definitely fails to see the FMA use as opportunity in
> > > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing
> > > expander w/o avx512er where we could still use the NR sequence
> > > with the other instruction.  HJ?
> > 
> > Like this?
> 
> Yes.  The lack of an expander for the rqsrt operation is probably
> more severe though (causing sqrt + approx recip to appear)
> 

Can we use UNSPEC_RSQRT14 here if UNSPEC_RSQRT28 isn't available?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #37 from rguenther at suse dot de  ---
On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #36 from H.J. Lu  ---
> (In reply to Richard Biener from comment #34)
> > GCC definitely fails to see the FMA use as opportunity in
> > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing
> > expander w/o avx512er where we could still use the NR sequence
> > with the other instruction.  HJ?
> 
> Like this?

Yes.  The lack of an expander for the rqsrt operation is probably
more severe though (causing sqrt + approx recip to appear)

> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index e0d7c74fcec..0bbe3772ab7 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -44855,14 +44855,22 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, 
> machine_mode
> mode, bool recip)
> }
>  }
> 
> +  mthree = force_reg (mode, mthree);
> +
>/* e0 = x0 * a */
>emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a)));
> -  /* e1 = e0 * x0 */
> -  emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0)));
> 
> -  /* e2 = e1 - 3. */
> -  mthree = force_reg (mode, mthree);
> -  emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree)));
> +  if (TARGET_FMA || TARGET_AVX512F)
> +emit_insn (gen_rtx_SET (e2,
> +   gen_rtx_FMA (mode, e0, x0, mthree)));
> +  else
> +{
> +  /* e1 = e0 * x0 */
> +  emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0)));
> +
> +  /* e2 = e1 - 3. */
> +  emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree)));
> +}
> 
>mhalf = force_reg (mode, mhalf);
>if (recip)

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #36 from H.J. Lu  ---
(In reply to Richard Biener from comment #34)
> GCC definitely fails to see the FMA use as opportunity in
> ix86_emit_swsqrtsf, the a == 0 checking is because of the missing
> expander w/o avx512er where we could still use the NR sequence
> with the other instruction.  HJ?

Like this?

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e0d7c74fcec..0bbe3772ab7 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -44855,14 +44855,22 @@ void ix86_emit_swsqrtsf (rtx res, rtx a, machine_mode
mode, bool recip)
}
 }

+  mthree = force_reg (mode, mthree);
+
   /* e0 = x0 * a */
   emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a)));
-  /* e1 = e0 * x0 */
-  emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0)));

-  /* e2 = e1 - 3. */
-  mthree = force_reg (mode, mthree);
-  emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree)));
+  if (TARGET_FMA || TARGET_AVX512F)
+emit_insn (gen_rtx_SET (e2,
+   gen_rtx_FMA (mode, e0, x0, mthree)));
+  else
+{
+  /* e1 = e0 * x0 */
+  emit_insn (gen_rtx_SET (e1, gen_rtx_MULT (mode, e0, x0)));
+
+  /* e2 = e1 - 3. */
+  emit_insn (gen_rtx_SET (e2, gen_rtx_PLUS (mode, e1, mthree)));
+}

   mhalf = force_reg (mode, mhalf);
   if (recip)

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #35 from Chris Elrod  ---
> rsqrt:
> .LFB12:
> .cfi_startproc
> vrsqrt28ps  (%rsi), %zmm0
> vmovups %zmm0, (%rdi)
> vzeroupper
> ret
> 
> (huh?  isn't there a NR step missing?)
> 


I assume because vrsqrt28ps is much more accurate than vrsqrt14ps, it wasn't
considered necessary. Unfortunately, march=skylake-avx512 does not have
-mavx512er, and therefore should use the less accurate vrsqrt14ps + NR step.

I think vrsqrt14pd/s are -mavx512f or -mavx512vl

> Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without 
> that I don't know how the machinery can guess how to use rsqrt (there are 
> probably ways).

Looking at the asm from only r[i] = sqrtf(a[i]):

vmovups (%rsi), %zmm1
vxorps  %xmm0, %xmm0, %xmm0
vcmpps  $4, %zmm1, %zmm0, %k1
vrsqrt14ps  %zmm1, %zmm0{%k1}{z}
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm0, %zmm1, %zmm0
vmulps  .LC1(%rip), %zmm1, %zmm1
vaddps  .LC0(%rip), %zmm0, %zmm0
vmulps  %zmm1, %zmm0, %zmm0
vmovups %zmm0, (%rdi)

vs the asm from only r[i] = 1 /a[i]:

vmovups (%rsi), %zmm1
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm0
vmovups %zmm0, (%rdi)

it looks like the expander is there for sqrt, and for inverse, and we're just
getting both one after the other. So it does look like I could benchmark which
one is slower than the regular instruction on my platform, if that would be
useful.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Richard Biener  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com

--- Comment #34 from Richard Biener  ---
So with -Ofast and -mprefer-vector-width=256 I get

   [local count: 63136019]:
  vect__4.2_3 = MEM[(float *)a_11(D)];
  vect__5.3_4 = RSQRT (vect__4.2_3);
  MEM[(float *)r_12(D)] = vect__5.3_4;
  vect__4.2_21 = MEM[(float *)a_11(D) + 32B];
  vect__5.3_20 = RSQRT (vect__4.2_21);
  MEM[(float *)r_12(D) + 32B] = vect__5.3_20;

while with -mprefer-vector-width=512 I need -mavx512er to trigger the
expander, then I also get

   [local count: 63136020]:
  vect__4.2_21 = MEM[(float *)a_11(D)];
  vect__5.3_20 = RSQRT (vect__4.2_21);
  MEM[(float *)r_12(D)] = vect__5.3_20;

and in that case

rsqrt:
.LFB12:
.cfi_startproc
vrsqrt28ps  (%rsi), %zmm0
vmovups %zmm0, (%rdi)
vzeroupper
ret

(huh?  isn't there a NR step missing?)

for -mprefer-vector-width=256 I get (irrespective of -mavx512er):

rsqrt:
.LFB12:
.cfi_startproc
vmovups (%rsi), %ymm1
vmovaps .LC1(%rip), %ymm3
vrsqrtps%ymm1, %ymm2
vmovaps .LC0(%rip), %ymm4
vmovups 32(%rsi), %ymm0
vmulps  %ymm1, %ymm2, %ymm1
vmulps  %ymm2, %ymm1, %ymm1
vmulps  %ymm3, %ymm2, %ymm2
vaddps  %ymm4, %ymm1, %ymm1
vmulps  %ymm2, %ymm1, %ymm1
vmovups %ymm1, (%rdi)
vrsqrtps%ymm0, %ymm1
vmulps  %ymm0, %ymm1, %ymm0
vmulps  %ymm1, %ymm0, %ymm0
vmulps  %ymm3, %ymm1, %ymm1
vaddps  %ymm4, %ymm0, %ymm0
vmulps  %ymm1, %ymm0, %ymm0
vmovups %ymm0, 32(%rdi)
vzeroupper

so the issue lies somewhere in the backend.

Of the "fast" you need -ffinite-math-only -fno-math-errno
-funsafe-math-optimizations.

GCC definitely fails to see the FMA use as opportunity in
ix86_emit_swsqrtsf, the a == 0 checking is because of the missing
expander w/o avx512er where we could still use the NR sequence
with the other instruction.  HJ?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #33 from Marc Glisse  ---
(In reply to Chris Elrod from comment #32)
> (In reply to Marc Glisse from comment #31)
> > What we need to understand is why gcc doesn't try to generate rsqrt

Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without
that I don't know how the machinery can guess how to use rsqrt (there are
probably ways).

> The approximate sqrt, and then approximate reciprocal approximations were
> slower on my computer than just vsqrt followed by div.

We can probably split that into the speed of sqrt vs its approximation and
inverse (div) vs its approximation. At least one of them seems to be a
pessimization on that platform.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #32 from Chris Elrod  ---
(In reply to Marc Glisse from comment #31)
> (In reply to Chris Elrod from comment #30)
> > gcc caclulates the rsqrt directly
> 
> No, vrsqrt14ps is just the first step in calculating sqrt here (slightly
> different formula than rsqrt). vrcp14ps shows that it is computing an
> inverse later. What we need to understand is why gcc doesn't try to generate
> rsqrt (which would also have vrsqrt14ps, but a slightly different formula
> without the comparison with 0 and masking, and without needing an inversion
> afterwards).

Okay, I think I follow you. You're saying instead of doing this (from
rguenther), which we want (also without the comparison to 0 and masking, as you
note):

 /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

it is doing this, which also uses the rsqrt instruction:

 /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

and then calculating an inverse approximation of that?

The approximate sqrt, and then approximate reciprocal approximations were
slower on my computer than just vsqrt followed by div.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #31 from Marc Glisse  ---
(In reply to Chris Elrod from comment #30)
> gcc caclulates the rsqrt directly

No, vrsqrt14ps is just the first step in calculating sqrt here (slightly
different formula than rsqrt). vrcp14ps shows that it is computing an inverse
later. What we need to understand is why gcc doesn't try to generate rsqrt
(which would also have vrsqrt14ps, but a slightly different formula without the
comparison with 0 and masking, and without needing an inversion afterwards).

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #30 from Chris Elrod  ---
gcc still (In reply to Marc Glisse from comment #29)
> The main difference I can see is that clang computes rsqrt directly, while
> gcc first computes sqrt and then computes the inverse. Also gcc seems afraid
> of getting NaN for sqrt(0) so it masks out this value. ix86_emit_swsqrtsf in
> gcc/config/i386/i386.c seems like a good place to look at.

gcc caclulates the rsqrt directly with funsafe-math-optimizations and a couple
other flags (or just -ffast-math):

vmovups (%rsi), %zmm0
vxorps  %xmm1, %xmm1, %xmm1
vcmpps  $4, %zmm0, %zmm1, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #29 from Marc Glisse  ---
The main difference I can see is that clang computes rsqrt directly, while gcc
first computes sqrt and then computes the inverse. Also gcc seems afraid of
getting NaN for sqrt(0) so it masks out this value. ix86_emit_swsqrtsf in
gcc/config/i386/i386.c seems like a good place to look at.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #28 from Chris Elrod  ---
Created attachment 45501
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501=edit
Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast
-S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o
rsqrt.s

I attached a minimum working example, demonstrating the problem of excessive
code generation for reciprocal square root, in the file rsqrt.c.
You can compile with:

gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC
rsqrt.c -o rsqrt.s

clang -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC
rsqrt.c -o rsqrt.s

Or compare the asm of both on Godbolt: https://godbolt.org/z/c7Z0En

For gcc:

vmovups (%rsi), %zmm0
vxorps  %xmm1, %xmm1, %xmm1
vcmpps  $4, %zmm0, %zmm1, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}
vmulps  %zmm0, %zmm1, %zmm2
vmulps  %zmm1, %zmm2, %zmm0
vmulps  .LC1(%rip), %zmm2, %zmm2
vaddps  .LC0(%rip), %zmm0, %zmm0
vmulps  %zmm2, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm1
vmulps  %zmm0, %zmm1, %zmm0
vmulps  %zmm0, %zmm1, %zmm0
vaddps  %zmm1, %zmm1, %zmm1
vsubps  %zmm0, %zmm1, %zmm0
vmovups %zmm0, (%rdi)

for Clang:

vmovups (%rsi), %zmm0
vrsqrt14ps  %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm0
vfmadd213ps .LCPI0_0(%rip){1to16}, %zmm1, %zmm0 # zmm0 = (zmm1 *
zmm0) + mem
vmulps  .LCPI0_1(%rip){1to16}, %zmm1, %zmm1
vmulps  %zmm0, %zmm1, %zmm0
vmovups %zmm0, (%rdi)

Clang looks like it is is doing
 /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
*/

where .LCPI0_0(%rip) = -3.0 and LCPI0_1(%rip) = -0.5.
gcc is doing much more, and fairly different.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #27 from Chris Elrod  ---
g++ -mrecip=all -O3  -fno-signed-zeros -fassociative-math -freciprocal-math
-fno-math-errno -ffinite-math-only -fno-trapping-math -fdump-tree-optimized -S
-march=native -shared -fPIC -mprefer-vector-width=512
-fno-semantic-interposition -o gppvectorization_test.s  vectorization_test.cpp

is not enough to get vrsqrt. I need -funsafe-math-optimizations for the
instruction to appear in the asm.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #26 from Chris Elrod  ---
> You can try enabling -mrecip to see RSQRT in .optimized - there's
> probably late 1/sqrt optimization on RTL.

No luck. The full commands I used:

gfortran -Ofast -mrecip -S -fdump-tree-optimized -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
gfortvectorizationdump.s  vectorization_test.f90

g++ -mrecip -Ofast -fdump-tree-optimized -S -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
gppvectorization_test.s  vectorization_test.cpp

g++'s output was similar:

  vect_U33_60.31_372 = SQRT (vect_S33_59.30_371);
  vect_Ui33_61.32_374 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,
1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0
} / vect_U33_60.31_372;
  vect_U13_62.33_375 = vect_S13_47.24_359 * vect_Ui33_61.32_374;
  vect_U23_63.34_376 = vect_S23_53.27_365 * vect_Ui33_61.32_374;

and it has the same assembly as gfortran for the rsqrt:

vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}
vmulps  %zmm0, %zmm1, %zmm2
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm6, %zmm2, %zmm2
vaddps  %zmm7, %zmm0, %zmm0
vmulps  %zmm2, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm10
vmulps  %zmm0, %zmm10, %zmm0
vmulps  %zmm0, %zmm10, %zmm0
vaddps  %zmm10, %zmm10, %zmm10
vsubps  %zmm0, %zmm10, %zmm10

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #25 from rguenther at suse dot de  ---
On Tue, 22 Jan 2019, elrodc at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #24 from Chris Elrod  ---
> The dump looks like this:
> 
>   vect__67.78_217 = SQRT (vect__213.77_225);
>   vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,
> 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0
> } / vect__67.78_217;
>   vect__71.80_249 = vect__246.59_65 * vect_ui33_68.79_248;
>   vect_u13_73.81_250 = vect__187.71_14 * vect_ui33_68.79_248;
>   vect_u23_75.82_251 = vect__200.74_5 * vect_ui33_68.79_248;
> 
> so the vrsqrt optimization happens later. g++ shows the same problems with
> weird code generation. However this:
> 
>  /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
> rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */
> 
> does not match this:
> 
> vrsqrt14ps  %zmm1, %zmm2 # comparison and mask removed
> vmulps  %zmm1, %zmm2, %zmm0
> vmulps  %zmm2, %zmm0, %zmm1
> vmulps  %zmm6, %zmm0, %zmm0
> vaddps  %zmm7, %zmm1, %zmm1
> vmulps  %zmm0, %zmm1, %zmm1
> vrcp14ps%zmm1, %zmm0
> vmulps  %zmm1, %zmm0, %zmm1
> vmulps  %zmm1, %zmm0, %zmm1
> vaddps  %zmm0, %zmm0, %zmm0
> vsubps  %zmm1, %zmm0, %zmm0
> 
> Recommendations on the next place to look for what's going on?

You can try enabling -mrecip to see RSQRT in .optimized - there's
probably late 1/sqrt optimization on RTL.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #24 from Chris Elrod  ---
The dump looks like this:

  vect__67.78_217 = SQRT (vect__213.77_225);
  vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,
1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0
} / vect__67.78_217;
  vect__71.80_249 = vect__246.59_65 * vect_ui33_68.79_248;
  vect_u13_73.81_250 = vect__187.71_14 * vect_ui33_68.79_248;
  vect_u23_75.82_251 = vect__200.74_5 * vect_ui33_68.79_248;

so the vrsqrt optimization happens later. g++ shows the same problems with
weird code generation. However this:

 /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

does not match this:

vrsqrt14ps  %zmm1, %zmm2 # comparison and mask removed
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm2, %zmm0, %zmm1
vmulps  %zmm6, %zmm0, %zmm0
vaddps  %zmm7, %zmm1, %zmm1
vmulps  %zmm0, %zmm1, %zmm1
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm0

Recommendations on the next place to look for what's going on?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #23 from rguenther at suse dot de  ---
On Tue, 22 Jan 2019, elrodc at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #22 from Chris Elrod  ---
> Okay. I did that, and the time went from about 4.25 microseconds down to 4.0
> microseconds. So that is an improvement, but accounts for only a small part of
> the difference with the LLVM-compilers.
> 
> -O3 -fno-math-errno
> 
> was about 3.5 microseconds, so -funsafe-math-optimizations still results in a
> regression in this code.
> 
> 3.5 microseconds is roughly as fast as you can get with vsqrt and div.
> 
> My best guess now is that gcc does a lot more to improve the accuracy of 
> vsqrt.
> If I understand correctly, these are all the involved instructions:
> 
> vmovaps .LC2(%rip), %zmm7
> vmovaps .LC3(%rip), %zmm6
> # for loop begins
> vrsqrt14ps  %zmm1, %zmm2 # comparison and mask removed
> vmulps  %zmm1, %zmm2, %zmm0
> vmulps  %zmm2, %zmm0, %zmm1
> vmulps  %zmm6, %zmm0, %zmm0
> vaddps  %zmm7, %zmm1, %zmm1
> vmulps  %zmm0, %zmm1, %zmm1
> vrcp14ps%zmm1, %zmm0
> vmulps  %zmm1, %zmm0, %zmm1
> vmulps  %zmm1, %zmm0, %zmm1
> vaddps  %zmm0, %zmm0, %zmm0
> vsubps  %zmm1, %zmm0, %zmm0
> vfnmadd213ps(%r10,%rax), %zmm0, %zmm2
> 
> If I understand this correctly:
> 
> zmm2 =(approx) 1 / sqrt(zmm1)
> zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1)
> zmm1 = zmm0 * zmm2 = (approx) 1
> zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1)
> zmm1 = zmm7 * zmm1 = (approx) constant7
> zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1)
> zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 *
> constant7)
> zmm1 = zmm1 * zmm0 = (approx) 1
> zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7)
> zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7)
> zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7)
> 
> which implies that constant6 * constant6 = approximately -1?

GCC implements

 /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

which looks similar to what LLVM does.  You can look at the
-fdump-tree-optimized dump to see if there's anything fishy.

> 
> LLVM seems to do a much simpler / briefer update of the output of vrsqrt.
> 
> When I implemented a vrsqrt intrinsic in a Julia library, I just looked at
> Wikipedia and did (roughly):
> 
> constant1 = -0.5
> constant2 = 1.5
> 
> zmm2 = (approx) 1 / sqrt(zmm1)
> zmm3 = constant * zmm1
> zmm1 = zmm2 * zmm2
> zmm3 = zmm3 * zmm1 + constant2
> zmm2 = zmm2 * zmm3
> 
> 
> I am not a numerical analyst, so I can't comment on relative validities or
> accuracies of these approaches.
> I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt.
> 
> I would be interesting in reading explanations or discussions, if any are
> available.
> 
>

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #22 from Chris Elrod  ---
Okay. I did that, and the time went from about 4.25 microseconds down to 4.0
microseconds. So that is an improvement, but accounts for only a small part of
the difference with the LLVM-compilers.

-O3 -fno-math-errno

was about 3.5 microseconds, so -funsafe-math-optimizations still results in a
regression in this code.

3.5 microseconds is roughly as fast as you can get with vsqrt and div.

My best guess now is that gcc does a lot more to improve the accuracy of vsqrt.
If I understand correctly, these are all the involved instructions:

vmovaps .LC2(%rip), %zmm7
vmovaps .LC3(%rip), %zmm6
# for loop begins
vrsqrt14ps  %zmm1, %zmm2 # comparison and mask removed
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm2, %zmm0, %zmm1
vmulps  %zmm6, %zmm0, %zmm0
vaddps  %zmm7, %zmm1, %zmm1
vmulps  %zmm0, %zmm1, %zmm1
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm0
vfnmadd213ps(%r10,%rax), %zmm0, %zmm2

If I understand this correctly:

zmm2 =(approx) 1 / sqrt(zmm1)
zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1)
zmm1 = zmm0 * zmm2 = (approx) 1
zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1)
zmm1 = zmm7 * zmm1 = (approx) constant7
zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1)
zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 *
constant7)
zmm1 = zmm1 * zmm0 = (approx) 1
zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7)
zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7)
zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7)

which implies that constant6 * constant6 = approximately -1?


LLVM seems to do a much simpler / briefer update of the output of vrsqrt.

When I implemented a vrsqrt intrinsic in a Julia library, I just looked at
Wikipedia and did (roughly):

constant1 = -0.5
constant2 = 1.5

zmm2 = (approx) 1 / sqrt(zmm1)
zmm3 = constant * zmm1
zmm1 = zmm2 * zmm2
zmm3 = zmm3 * zmm1 + constant2
zmm2 = zmm2 * zmm3


I am not a numerical analyst, so I can't comment on relative validities or
accuracies of these approaches.
I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt.

I would be interesting in reading explanations or discussions, if any are
available.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #21 from rguenther at suse dot de  ---
On Tue, 22 Jan 2019, elrodc at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
> 
> --- Comment #19 from Chris Elrod  ---
> To add a little more:
> I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
> Julia. Without adding a Newton step, the answers are wrong beyond just a 
> couple
> significant digits.
> With the Newton step, the answers are correct.
> 
> My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
> the Newton step. They get the correct answer.
> 
> That leaves my best guess for the performance difference as owing to the 
> masked
> "vrsqrt14ps" that gcc is using:
> 
> vcmpps  $4, %zmm0, %zmm5, %k1
> vrsqrt14ps  %zmm0, %zmm1{%k1}{z}
> 
> Is there any way for me to test that idea?
> Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
> benchmark it?

Usually it's easiest to compile to assembler with GCC (-S) and test
this kind of theories by editing the GCC generated assembly and
then benchmark that.  Just use the assembler as input to the
gfortran compile command instead of the .f for linking the program.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #20 from Chris Elrod  ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the Newton step, the answers are correct.

My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
the Newton step. They get the correct answer.

That leaves my best guess for the performance difference as owing to the masked
"vrsqrt14ps" that gcc is using (g++ does this too):

vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

Is there any way for me to test that idea?
Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
benchmark it?


Okay, I just tried playing around with flags and looking at asm.
I compiled with:

g++ -O3 -ffinite-math-only -fexcess-precision=fast -fno-math-errno
-fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math
-fno-rounding-math -fno-signaling-nans -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
libgppvectorization_test.so  vectorization_test.cpp

which is basically all flags implied by "-ffast-math", except
"-funsafe-math-optimizations". This does include the flags implied by the
unsafe-math optimizations, just not that flag itself.

This list can be simplified to (only "-fno-math-errno" is needed):

g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512
-fno-semantic-interposition -o libgppvectorization_test.so 
vectorization_test.cpp

or

gfortran -O3 -fno-math-errno -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
libgfortvectorization_test.so  vectorization_test.f90

This results in the following:

vsqrtps (%r8,%rax), %zmm0
vdivps  %zmm0, %zmm7, %zmm0

ie, vsqrt and a division, rather than the masked reciprocal square root.

With N = 2827, that speeds gfortran and g++ from about 4.3 microseconds to 3.5
microseconds.
For comparison, Clang takes about 2 microseconds, and Flang/ispc/and awful
looking unsafe Rust take 2.3-2.4 microseconds, using the vrsqrt14ps (without a
mask) and a Newton step, instead of vsqrtps followed by a division.


So, "-funsafe-math-optimizations" results in a regression here.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #19 from Chris Elrod  ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the Newton step, the answers are correct.

My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
the Newton step. They get the correct answer.

That leaves my best guess for the performance difference as owing to the masked
"vrsqrt14ps" that gcc is using:

vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

Is there any way for me to test that idea?
Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
benchmark it?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #18 from Chris Elrod  ---
I can confirm that the inlined packing does allow gfortran to vectorize the
loop. So allowing packing to inline does seem (to me) like an optimization well
worth making.




However, performance seems to be about the same as before, still close to 2x
slower than Flang.


There is definitely something interesting going on in Flang's SLP
vectorization, though.

I defined the function:

#ifndef VECTORWIDTH
#define VECTORWIDTH 16
#endif

subroutine vpdbacksolve(Uix, x, S)

real, dimension(VECTORWIDTH,3)  ::  Uix
real, dimension(VECTORWIDTH,3), intent(in)  ::  x
real, dimension(VECTORWIDTH,6), intent(in)  ::  S

real, dimension(VECTORWIDTH)::  U11,  U12,  U22,  U13,  U23,  U33,
&
Ui11, Ui12, Ui22, Ui33

U33 = sqrt(S(:,6))

Ui33 = 1 / U33
U13 = S(:,4) * Ui33
U23 = S(:,5) * Ui33
U22 = sqrt(S(:,3) - U23**2)
Ui22 = 1 / U22
U12 = (S(:,2) - U13*U23) * Ui22
U11 = sqrt(S(:,1) - U12**2 - U13**2)

Ui11 = 1 / U11 ! u11
Ui12 = - U12 * Ui11 * Ui22 ! u12
Uix(:,3) = Ui33*x(:,3)
Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) - (U13 * Ui11 + U23 * Ui12) *
Uix(:,3)
Uix(:,2) = Ui22*x(:,2) - U23 * Ui22 * Uix(:,3)

end subroutine vpdbacksolve


in a .F90 file, so that VECTORWIDTH can be set appropriately while compiling.

I wanted to modify the Fortran file to benchmark these, but I'm pretty sure
Flang cheated in the benchmarks. So compiling into a shared library, and
benchmarking from Julia:

julia> @benchmark flangvtest($Uix, $x, $S)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 15.104 ns (0.00% GC)
  median time:  15.563 ns (0.00% GC)
  mean time:16.017 ns (0.00% GC)
  maximum time: 49.524 ns (0.00% GC)
  --
  samples:  1
  evals/sample: 998

julia> @benchmark gfortvtest($Uix, $x, $S)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 24.394 ns (0.00% GC)
  median time:  24.562 ns (0.00% GC)
  mean time:25.600 ns (0.00% GC)
  maximum time: 58.652 ns (0.00% GC)
  --
  samples:  1
  evals/sample: 996

That is over 60% faster for Flang, which would account for much, but not all,
of the runtime difference in the actual for loops.

For comparison, the vectorized loop in processbpp covers 16 samples per
iteration. The benchmarks above were with N = 1024, so 1024/16 = 64 iterations.

For the three gfortran benchmarks (that averaged 100,000 runs of the loop),
that means each loop iteration averaged at about
1000 * (1.34003162 + 1.37529969 + 1.36087596) / (3*64)
21.230246197916664

For flang, that was:
1000 * (0.6596010 + 0.6455200 + 0.6132510) / (3*64)
9.99152083334

so we have about 21 vs 10 ns for the loop body in gfortran vs Flang,
respectively.


Comparing the asm between:
1. Flang processbpp loop body
2. Flang vpdbacksolve
3. gfortran processbpp loop body
4. gfortran vpdbacksolve

Here are a few things I notice.
1. gfortran always uses masked reciprocal square root operations, to make sure
it only takes the square root of non-negative (positive?) numbers:
vxorps  %xmm5, %xmm5, %xmm5
...
vmovups (%rsi,%rax), %zmm0
vmovups 0(%r13,%rax), %zmm9
vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

This might be avx512f specific? 
Either way, Flang does not use masks:

vmovups (%rcx,%r14), %zmm4
vrsqrt14ps  %zmm4, %zmm5

I'm having a hard time finding any information on what the performance impact
of this may be.
Agner Fog's instruction tables, for example, don't mention mask arguments for
vrsqrt14ps.

2. Within the loop body, Flang has 0 unnecessary vmov(u/a)ps. There are 8 total
plus 3 "vmuls" and 1 vfmsub231ps accessing memory, for the 12 expected per loop
iteration (fpdbacksolve's arguments are a vector of length 3 and another of
length 6; it returns a vector of length 3).

gfortran's loop body has 3 unnecessary vmovaps, copying register contents.

gfortran's vpdbacksolve subroutine has 4 unnecessary vmovaps, copying register
contents.

Flang's vpdbacksolve subroutine has 13 unnecessary vmovaps, and a couple
unnecessary memory accesses. Ouch!
They also moved on/off (the stack?)

vmovaps %zmm2, .BSS4+192(%rip)
...
vmovaps %zmm5, .BSS4+320(%rip)
...
vmovaps .BSS4+192(%rip), %zmm5
... #zmm5 is overwritten in here, I just mean to show the sort of stuff that
goes on
vmulps  .BSS4+320(%rip), %zmm5, %zmm0

Some of those moves also don't get used again, and some other things are just
plain weird:
vxorps  %xmm3, %xmm3, %xmm3
vfnmsub231ps%zmm2, %zmm0, %zmm3 # zmm3 = -(zmm0 * zmm2) - zmm3
vmovaps %zmm3, .BSS4+576(%rip)

Like, why zero out the 128 bit portion of zmm3 ?
I 

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #17 from Thomas Koenig  ---
What an inline packing would (approximately) produce is this:

subroutine processBPP(X, BPP, N)
integer,intent(in)  ::  N
real,   dimension(N,3), intent(out) ::  X
real,   dimension(N,10),intent(in)  ::  BPP

integer ::  i
real :: tmp1(3)
real :: tmp2(6)
integer :: k1, k2

do concurrent (i = 1:N)
   k1 = 0
   do
  if (.not. k1 < 3) exit
  tmp1(k1+1) = BPP(i,k1+1)
  k1 = k1 + 1
   end do

   k2 = 0
   do
  if (.not. k2 < 6) exit
  tmp2(k2+1) = BPP(i,k2+5)
  k2 = k2 + 1
   end do

   X(i,:) = fpdbacksolve(tmp1, tmp2)
end do

end subroutine processBPP

I see no timing difference for gfortran with this to the (:) version.
Chris, can you confirm this?

And is flang still faster by a factor of two if you use this version?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Thomas Koenig  changed:

   What|Removed |Added

 CC||koenigni at gcc dot gnu.org

--- Comment #16 from Thomas Koenig  ---
(In reply to Richard Biener from comment #15)
> So can the fortran FE inline the _gfortran_internal_pack() call?  It looks
> like
> flang manages to elide this when inlining the function at least?

In principle, this could be done.  I guess it would be a good idea to
create a test case first which would emulate what inlining pack would do.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #15 from Richard Biener  ---
So can the fortran FE inline the _gfortran_internal_pack() call?  It looks like
flang manages to elide this when inlining the function at least?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #14 from Chris Elrod  ---
It's not really reproducible across runs:

$ time ./gfortvectests 
 Transpose benchmark completed in   22.7010765
 SIMD benchmark completed in   1.37529969
 All are equal: F
 All are approximately equal: F
 Maximum relative error   6.20566949E-04
 First record X:  0.188879877  0.377619117  -1.67841911E-02
 First record Xt:  0.10071  0.377619147  -1.67841911E-02
 Second record X:  -8.14126506E-02 -0.421755224 -0.199057430
 Second record Xt:  -8.14126655E-02 -0.421755224 -0.199057430

real0m2.414s
user0m2.406s
sys 0m0.005s

$ time ./flangvectests 
 Transpose benchmark completed in7.630980
 SIMD benchmark completed in   0.6455200
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.58675421.568364   0.1006735
 First record Xt:   0.58675411.568363   0.1006735
 Second record X:   0.2894785  -0.1510675  -9.3419194E-02
 Second record Xt:   0.2894785  -0.1510675  -9.3419187E-02

real0m0.839s
user0m0.832s
sys 0m0.006s

$ time ./gfortvectests 
 Transpose benchmark completed in   22.0195961
 SIMD benchmark completed in   1.36087596
 All are equal: F
 All are approximately equal: F
 Maximum relative error   2.49150675E-04
 First record X: -0.284217566   2.13768221E-02 -0.475293010
 First record Xt: -0.284217596   2.13767942E-02 -0.475293040
 Second record X:   1.75664220E-02  -9.29893106E-02  -4.37139049E-02
 Second record Xt:   1.75664220E-02  -9.29893106E-02  -4.37139049E-02

real0m2.344s
user0m2.338s
sys 0m0.003s

$ time ./flangvectests 
 Transpose benchmark completed in7.881181
 SIMD benchmark completed in   0.6132510
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.58675421.568364   0.1006735
 First record Xt:   0.58675411.568363   0.1006735
 Second record X:   0.2894785  -0.1510675  -9.3419194E-02
 Second record Xt:   0.2894785  -0.1510675  -9.3419187E-02

real0m0.861s
user0m0.853s
sys 0m0.006s


It's also probably wasn't quite right to call it "error", because it's
comparing the values from the scalar and vectorized versions. Although it is
unsettling if the differences are high; there should be an exact match,
ideally.

Back to Julia, using mpfr (set to 252 bits of precision), and rounding to
single precision for an exactly rounded answer...

X32gfort # calculated from gfortran
X32flang # calculated from flang
Xbf  # mpfr, 252-bit precision ("BigFloat" in Julia)

julia> Xbf32 = Float32.(Xbf) # correctly rounded result

julia> function ULP(x, correct) # calculates ULP error
   x == correct && return 0
   if x < correct
   error = 1
   while nextfloat(x, error) != correct
   error += 1
   end
   else
   error = 1
   while prevfloat(x, error) != correct
   error += 1
   end
   end
   error
   end
ULP (generic function with 1 method)

julia> ULP.(X32gfort, Xbf32)'
3×1024 Adjoint{Int64,Array{Int64,2}}:
 7  1  1  8  3  2  1  1  1  27  4  1  4  6  0  0  2  0  2  4  0  7  1  1  3  8 
4  2  2  …  1  0  2  0  0  1  2  3  1  5  1  1  0  0  0  2  3  2  1  2  3  1  0
 1  1  0  2  0  41
 4  2  1  1  6  1  0  1  1   2  2  0  0  3  0  1  0  3  1  1  0  1  1  0  0  3 
1  0  0 0  1  0  1  0  1  0  1  1  4  1  1  0  2  0  1  0  1  0  0  0  1  2
 1  1  1  0  0   1
 1  1  0  1  1  0  0  0  0   1  1  0  0  1  0  1  1  1  0  1  1  0  0  1  0  1 
0  0  0 0  0  1  0  0  0  0  0  1  0  0  1  1  1  0  0  1  0  1  1  0  1  1
 0  0  0  0  0   1

julia> mean(ans)
1.9462890625

julia> ULP.(X32flang, Xbf32)'
3×1024 Adjoint{Int64,Array{Int64,2}}:
 4  1  0  3  0  0  0  1  1  5  2  1  1  6  3  0  1  0  0  1  1  21  0  1  2  8 
2  3  0  0  …  1  1  1  15  2  1  1  5  1  1  1  0  0  0  0  0  2  1  3  1  1 
1  1  1  1  1  0  11
 3  1  1  0  1  0  0  1  0  0  1  0  0  2  1  1  1  6  0  0  0   2  1  0  1  4 
1  1  0  3 1  1  1   1  2  1  1  0  1  1  0  0  1  0  1  0  0  1  0  0  1 
1  1  0  1  0  0   0
 1  0  1  0  0  0  1  1  0  1  0  0  0  1  1  0  0  1  1  0  1   1  0  1  0  1 
0  0  1  0 0  0  1   0  0  0  0  0  0  2  0  0  0  0  0  1  1  1  1  0  1 
0  0  0  0  0  0   1

julia> mean(ans)
1.3388671875


So in that case, gfortran's version had about 1.95 ULP error on average, and
Flang about 1.34 ULP error.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Jerry DeLisle  changed:

   What|Removed |Added

 CC||jvdelisle at gcc dot gnu.org

--- Comment #13 from Jerry DeLisle  ---
I noticed the Maximum Relative error in your benchmarks is significantly larger
in the flang test vs the gfortran test. Is this a factor that matters?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #12 from Chris Elrod  ---
Created attachment 45363
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363=edit
Fortran program for running benchmarks.

Okay, thank you.

I attached a Fortran program you can run to benchmark the code.
It randomly generates valid inputs, and then times running the code 10^5 times.
Finally, it reports the average time in microseconds.

The SIMD times are the vectorized version, and the transposed times are the
non-vectorized versions. In both cases, Flang produces much faster code.

The results seem in line with what I got benchmarking shared libraries from
Julia.
I linked rt for access to the high resolution clock.


$ gfortran -Ofast -lrt -march=native -mprefer-vector-width=512
vectorization_tests.F90 -o gfortvectests

$ time ./gfortvectests 
 Transpose benchmark completed in   22.7799759
 SIMD benchmark completed in   1.34003162
 All are equal: F
 All are approximately equal: F
 Maximum relative error   8.27204276E-05
 First record X:   1.02466011 -0.689792156 -0.404027045
 First record Xt:   1.02465975 -0.689791918 -0.404026985
 Second record X: -0.546353579   3.37308086E-03   1.15257287
 Second record Xt: -0.546353400   3.37312138E-03   1.15257275

real0m2.418s
user0m2.412s
sys 0m0.003s

$ flang -Ofast -lrt -march=native -mprefer-vector-width=512
vectorization_tests.F90 -o flangvectests

$ time ./flangvectests 
 Transpose benchmark completed in7.232568
 SIMD benchmark completed in   0.6596010
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.58675421.568364   0.1006735
 First record Xt:   0.58675411.568363   0.1006735
 Second record X:   0.2894785  -0.1510675  -9.3419194E-02
 Second record Xt:   0.2894785  -0.1510675  -9.3419187E-02

real0m0.801s
user0m0.794s
sys 0m0.005s

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Thomas Koenig  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
   Last reconfirmed||2019-01-06
  Component|fortran |tree-optimization
 Blocks|36854   |53947
 Resolution|WONTFIX |---
Summary|_gfortran_internal_pack@PLT |Vectorized code slow vs.
   |prevents vectorization  |flang
 Ever confirmed|0   |1

--- Comment #11 from Thomas Koenig  ---
OK, so I think it makes sense to reopen this bug as a missed
optimization for the vectorizer (reopen because it would be a shame
to lose all the info you already provided).

It seems like gcc could be much better, also possibly with some
more help from the gfortran front end.  A factor of two is not to
be ignored.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36854
[Bug 36854] [meta-bug] fortran front-end optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations