[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-07-17 Thread cvs-commit at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #60 from CVS Commits --- The master branch has been updated by H.J. Lu : https://gcc.gnu.org/g:737355072af4cd0c24a4a8967e1485c1f3a80bfe commit r11-2200-g737355072af4cd0c24a4a8967e1485c1f3a80bfe Author: H.J. Lu Date: Mon Jul 13 09

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-07-09 Thread cvs-commit at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #59 from CVS Commits --- The master branch has been updated by H.J. Lu : https://gcc.gnu.org/g:fab263ab0fc10ea08409b80afa7e8569438b8d28 commit r11-1970-gfab263ab0fc10ea08409b80afa7e8569438b8d28 Author: H.J. Lu Date: Wed Jan 23 06

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-06-28 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #58 from H.J. Lu --- (In reply to Thomas Koenig from comment #57) > (In reply to H.J. Lu from comment #56) > > (In reply to Thomas Koenig from comment #55) > > > (In reply to H.J. Lu from comment #45) > > > > Created attachment 45510

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2020-06-28 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #57 from Thomas Koenig --- (In reply to H.J. Lu from comment #56) > (In reply to Thomas Koenig from comment #55) > > (In reply to H.J. Lu from comment #45) > > > Created attachment 45510 [details] > > > An updated patch > > > > HJ, d

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-09-19 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #56 from H.J. Lu --- (In reply to Thomas Koenig from comment #55) > (In reply to H.J. Lu from comment #45) > > Created attachment 45510 [details] > > An updated patch > > HJ, do you plan on committing these? We are collecting perfor

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-09-19 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #55 from Thomas Koenig --- (In reply to H.J. Lu from comment #45) > Created attachment 45510 [details] > An updated patch HJ, do you plan on committing these?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-02-12 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #54 from Chris Elrod --- I commented elsewhere, but I built trunk a few days ago with H.J.Lu's patches (attached here) and Thomas Koenig's inlining patches. With these patches, g++ and all versions of the Fortran code produced excelle

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #53 from rguenther at suse dot de --- On Thu, 24 Jan 2019, glisse at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #52 from Marc Glisse --- > (In reply to Thomas Koenig from comment #49

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-24 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #52 from Marc Glisse --- (In reply to Thomas Koenig from comment #49) > Argh. Sacrificing performance for the sake of bugware... But note that in this PR (specifically for avx512 vectors on this cpu), the OP says that the recip vers

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #51 from rguenther at suse dot de --- On Thu, 24 Jan 2019, tkoenig at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #49 from Thomas Koenig --- > (In reply to Uroš Bizjak from comment #4

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #50 from Uroš Bizjak --- (In reply to Thomas Koenig from comment #49) > (In reply to Uroš Bizjak from comment #48) > > (In reply to rguent...@suse.de from comment #47) > > > >But why don't we generate sqrtps for vector sqrtf? > > > >

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #49 from Thomas Koenig --- (In reply to Uroš Bizjak from comment #48) > (In reply to rguent...@suse.de from comment #47) > > >But why don't we generate sqrtps for vector sqrtf? > > > > That's the default for - mrecip back in time we

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #48 from Uroš Bizjak --- (In reply to rguent...@suse.de from comment #47) > >But why don't we generate sqrtps for vector sqrtf? > > That's the default for - mrecip back in time we benchmarked it and scalar > recip miscompares sth. I

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #47 from rguenther at suse dot de --- On January 23, 2019 5:13:12 PM GMT+01:00, "hjl.tools at gmail dot com" wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > >--- Comment #46 from H.J. Lu --- >We generate sqrtps for scal

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #46 from H.J. Lu --- We generate sqrtps for scalar sqrtf: [hjl@gnu-skx-1 pr88713]$ cat s.i extern float sqrtf(float x); float rsqrt(float r) { return sqrtf (r); } [hjl@gnu-skx-1 pr88713]$ gcc -Ofast -S s.i [hjl@gnu-skx-1 pr88713]$

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 H.J. Lu changed: What|Removed |Added Attachment #45509|0 |1 is obsolete|

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 H.J. Lu changed: What|Removed |Added Attachment #45508|0 |1 is obsolete|

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 H.J. Lu changed: What|Removed |Added Attachment #45507|0 |1 is obsolete|

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #42 from H.J. Lu --- Created attachment 45507 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45507&action=edit A patch

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #41 from Uroš Bizjak --- (In reply to H.J. Lu from comment #40) > (In reply to rguent...@suse.de from comment #39) > > > > > > > > Yes. The lack of an expander for the rqsrt operation is probably > > > > more severe though (causing

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #40 from H.J. Lu --- (In reply to rguent...@suse.de from comment #39) > > > > > > Yes. The lack of an expander for the rqsrt operation is probably > > > more severe though (causing sqrt + approx recip to appear) > > > > > > > Can

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #39 from rguenther at suse dot de --- On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #38 from H.J. Lu --- > (In reply to rguent...@suse.de from comment #3

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #38 from H.J. Lu --- (In reply to rguent...@suse.de from comment #37) > On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > > > --- Comment #36 from H.J. Lu --- > > (I

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #37 from rguenther at suse dot de --- On Wed, 23 Jan 2019, hjl.tools at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #36 from H.J. Lu --- > (In reply to Richard Biener from comment #34)

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #36 from H.J. Lu --- (In reply to Richard Biener from comment #34) > GCC definitely fails to see the FMA use as opportunity in > ix86_emit_swsqrtsf, the a == 0 checking is because of the missing > expander w/o avx512er where we could

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #35 from Chris Elrod --- > rsqrt: > .LFB12: > .cfi_startproc > vrsqrt28ps (%rsi), %zmm0 > vmovups %zmm0, (%rdi) > vzeroupper > ret > > (huh? isn't there a NR step missing?) > I assume

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Richard Biener changed: What|Removed |Added CC||hjl.tools at gmail dot com --- Comment

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #33 from Marc Glisse --- (In reply to Chris Elrod from comment #32) > (In reply to Marc Glisse from comment #31) > > What we need to understand is why gcc doesn't try to generate rsqrt Without -mavx512er, we do not have an expander f

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #32 from Chris Elrod --- (In reply to Marc Glisse from comment #31) > (In reply to Chris Elrod from comment #30) > > gcc caclulates the rsqrt directly > > No, vrsqrt14ps is just the first step in calculating sqrt here (slightly > dif

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #31 from Marc Glisse --- (In reply to Chris Elrod from comment #30) > gcc caclulates the rsqrt directly No, vrsqrt14ps is just the first step in calculating sqrt here (slightly different formula than rsqrt). vrcp14ps shows that it is

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #30 from Chris Elrod --- gcc still (In reply to Marc Glisse from comment #29) > The main difference I can see is that clang computes rsqrt directly, while > gcc first computes sqrt and then computes the inverse. Also gcc seems afraid

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #29 from Marc Glisse --- The main difference I can see is that clang computes rsqrt directly, while gcc first computes sqrt and then computes the inverse. Also gcc seems afraid of getting NaN for sqrt(0) so it masks out this value. ix

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #28 from Chris Elrod --- Created attachment 45501 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501&action=edit Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast -S -march=skylake-avx512 -mprefer-

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #27 from Chris Elrod --- g++ -mrecip=all -O3 -fno-signed-zeros -fassociative-math -freciprocal-math -fno-math-errno -ffinite-math-only -fno-trapping-math -fdump-tree-optimized -S -march=native -shared -fPIC -mprefer-vector-width=512

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #26 from Chris Elrod --- > You can try enabling -mrecip to see RSQRT in .optimized - there's > probably late 1/sqrt optimization on RTL. No luck. The full commands I used: gfortran -Ofast -mrecip -S -fdump-tree-optimized -march=nati

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #25 from rguenther at suse dot de --- On Tue, 22 Jan 2019, elrodc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #24 from Chris Elrod --- > The dump looks like this: > > vect__67.78_

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #24 from Chris Elrod --- The dump looks like this: vect__67.78_217 = SQRT (vect__213.77_225); vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #23 from rguenther at suse dot de --- On Tue, 22 Jan 2019, elrodc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #22 from Chris Elrod --- > Okay. I did that, and the time went from abou

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #22 from Chris Elrod --- Okay. I did that, and the time went from about 4.25 microseconds down to 4.0 microseconds. So that is an improvement, but accounts for only a small part of the difference with the LLVM-compilers. -O3 -fno-mat

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #21 from rguenther at suse dot de --- On Tue, 22 Jan 2019, elrodc at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 > > --- Comment #19 from Chris Elrod --- > To add a little more: > I used inline asm for

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #20 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the N

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #19 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the N

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #18 from Chris Elrod --- I can confirm that the inlined packing does allow gfortran to vectorize the loop. So allowing packing to inline does seem (to me) like an optimization well worth making. However, performance seems to be ab

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #17 from Thomas Koenig --- What an inline packing would (approximately) produce is this: subroutine processBPP(X, BPP, N) integer,intent(in) :: N real, dimension(N,3), intent(out)

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Thomas Koenig changed: What|Removed |Added CC||koenigni at gcc dot gnu.org --- Comment

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #14 from Chris Elrod --- It's not really reproducible across runs: $ time ./gfortvectests Transpose benchmark completed in 22.7010765 SIMD benchmark completed in 1.37529969 All are equal: F All are approximately equa

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Jerry DeLisle changed: What|Removed |Added CC||jvdelisle at gcc dot gnu.org --- Comment

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #12 from Chris Elrod --- Created attachment 45363 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363&action=edit Fortran program for running benchmarks. Okay, thank you. I attached a Fortran program you can run to benchmark

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Thomas Koenig changed: What|Removed |Added Status|RESOLVED|REOPENED Last reconfirmed|