[Bug tree-optimization/88713] Vectorized code slow vs. flang

rguenth at gcc dot gnu.org Wed, 23 Jan 2019 01:31:31 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl.tools at gmail dot com

--- Comment #34 from Richard Biener <rguenth at gcc dot gnu.org> ---
So with -Ofast and -mprefer-vector-width=256 I get

  <bb 2> [local count: 63136019]:
  vect__4.2_3 = MEM[(float *)a_11(D)];
  vect__5.3_4 = RSQRT (vect__4.2_3);
  MEM[(float *)r_12(D)] = vect__5.3_4;
  vect__4.2_21 = MEM[(float *)a_11(D) + 32B];
  vect__5.3_20 = RSQRT (vect__4.2_21);
  MEM[(float *)r_12(D) + 32B] = vect__5.3_20;

while with -mprefer-vector-width=512 I need -mavx512er to trigger the
expander, then I also get

  <bb 2> [local count: 63136020]:
  vect__4.2_21 = MEM[(float *)a_11(D)];
  vect__5.3_20 = RSQRT (vect__4.2_21);
  MEM[(float *)r_12(D)] = vect__5.3_20;

and in that case

rsqrt:
.LFB12:
        .cfi_startproc
        vrsqrt28ps      (%rsi), %zmm0
        vmovups %zmm0, (%rdi)
        vzeroupper
        ret

(huh?  isn't there a NR step missing?)

for -mprefer-vector-width=256 I get (irrespective of -mavx512er):

rsqrt:
.LFB12:
        .cfi_startproc
        vmovups (%rsi), %ymm1
        vmovaps .LC1(%rip), %ymm3
        vrsqrtps        %ymm1, %ymm2
        vmovaps .LC0(%rip), %ymm4
        vmovups 32(%rsi), %ymm0
        vmulps  %ymm1, %ymm2, %ymm1
        vmulps  %ymm2, %ymm1, %ymm1
        vmulps  %ymm3, %ymm2, %ymm2
        vaddps  %ymm4, %ymm1, %ymm1
        vmulps  %ymm2, %ymm1, %ymm1
        vmovups %ymm1, (%rdi)
        vrsqrtps        %ymm0, %ymm1
        vmulps  %ymm0, %ymm1, %ymm0
        vmulps  %ymm1, %ymm0, %ymm0
        vmulps  %ymm3, %ymm1, %ymm1
        vaddps  %ymm4, %ymm0, %ymm0
        vmulps  %ymm1, %ymm0, %ymm0
        vmovups %ymm0, 32(%rdi)
        vzeroupper

so the issue lies somewhere in the backend.

Of the "fast" you need -ffinite-math-only -fno-math-errno
-funsafe-math-optimizations.

GCC definitely fails to see the FMA use as opportunity in
ix86_emit_swsqrtsf, the a == 0 checking is because of the missing
expander w/o avx512er where we could still use the NR sequence
with the other instruction.  HJ?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

Reply via email to