[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #7 from Chris Elrod  ---
(In reply to Chris Elrod from comment #6)
> However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the
> 32 columns of A

Correction: it was the 16x13 version that used stack data after loading column
25 instead of 23 of A.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #6 from Chris Elrod  ---
(In reply to Richard Biener from comment #3)
> If you see spilling on the manually unrolled loop register pressure is
> somehow an issue.

In the matmul kernel:
D = A * X
where D is 16x14, A is 16xN, and X is Nx14 (N arbitrarily set to 32)

The code holds all of D in registers.
16x14 doubles, and 8 doubles per register mean 28 of the 32 registers.

Then, it loads 1 column of A at a time (2 more registers), and broadcasts
elements from the corresponding row in each column of X, updating the
corresponding column of D with fma instructions.

By broadcasting 2 at a time, it should be using exactly 32 registers.

For the most part, that is precisely what the manually unrolled code is doing
for each column of A.
However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the 32
columns of A, it suddenly spills (all the stack accesses happen for the same
column, and none of the others), even though the process is identical for each
column.
Switching to a smaller 16x13 output, freeing up 2 registers to allow 4
broadcast loads at a time, still resulted in 4 spills (down from 5) for only
column #23 or #25.

I couldn't reproduce the spills in the avx2 kernel.
The smaller kernel has an 8x6 output, taking up 12 registers. Again leaving 4
total registers, 2 for a column of A, and 2 broadcasts from X at a time. So
it's the same pattern.


The smaller kernel does reproduce the problems with the loops. Both -O3 without
`-fdisable-tree-cunrolli` leading to a slow vectorization scheme, and with it
or `-O2 -ftree-vectorize` producing repetitive loads and stores within the
loop.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #5 from Chris Elrod  ---
Created attachment 44424
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44424&action=edit
Smaller avx512 kernel that still spills into the stack

This generated 18 total `vmovapd` (I think there'd ideally be 0) when compiled
with:

gfortran -march=skylake-avx512 -mprefer-vector-width=512 -O2 -ftree-vectorize
-shared -fPIC -S kernels16x32x13.f90 -o kernels16x32x13.s

4 of which moved onto the stack, and one moved from the stack back into a
register.
(The others were transfered from the stack within vfmadd instructions:
`vfmadd213pd72(%rsp), %zmm11, %zmm15`
)


Similar to the larger kernel, using `-O3` instead of `-O2 -ftree-vectorize`
eliminated two of the `vmovapd`instructions between registers, but none of the
spills.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #4 from Chris Elrod  ---
Created attachment 44423
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44423&action=edit
8x16 * 16x6 kernel for avx2.

Here is a scaled down version to reproduce most of the the problem for
avx2-capable architectures.
I just used march=haswell, but I think most recent architectures fall under
this.
For some, like zenv1, you may need to add -mprefer-vector-width=256.


To get the inefficiently vectorized loop:

gfortran -march=haswell -Ofast -shared -fPIC -S kernelsavx2.f90 -o
kernelsavx2bad.s

To get only the unnecessary loads/stores, use:

gfortran -march=haswell -O2 -ftree-vectorize -shared -fPIC -S kernelsavx2.f90
-o kernelsavx2.s

This file compiles instantly, while with `O3` the other one can take a couple
seconds.
However while it does `vmovapd` between registers, it no longer spills into the
stack in the manually unrolled version, like the avx512 kernel does.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org
 Blocks||53947

--- Comment #3 from Richard Biener  ---
If you see spilling on the manually unrolled loop register pressure is somehow
an issue.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #2 from Chris Elrod  ---
Created attachment 44418
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44418&action=edit
Code to reproduce slow vectorization pattern and unnecessary loads & stores

(Sorry if this goes to the bottom instead of top, trying to attach a file in
place of a link, but I can't edit the old comment.)

Attached is sample code to reproduce the problem in gcc 8.1.1
As observed by amonakov, compiling with -O3/-Ofast reproduces the full problem,
eg:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops
-S kernels.f90 -o kernels.s

Compiling with -O3 -fdisable-tree-cunrolli or -O2 -ftree-vectorize fixes the
incorrect vectorization pattern, but leave a lot of unnecessary broadcast loads
and stores.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-22 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org
  Component|rtl-optimization|tree-optimization

--- Comment #1 from Alexander Monakov  ---
Please supply testcase(s) as Bugzilla attachments, not external links.

At -O3/-Ofast the main issue is early unrolling ('cunrolli') splatting all
simple 16-iteration inner loops. After that imho all hope is lost, and yeah,
looks like we try to vectorize across the other dimension.

With -O3 -fdisable-tree-cunrolli, or with -O2 -ftree-vectorize we do get the
correct vectorization pattern, but a couple of problems remain: after vect,
tree optimizations cannot hoist/sink memory references out of the outer loop,
leaving 2 loads, 1 load-broadcast and 1 store per each fma. Later, RTL PRE
cleans up redundant vector loads, but load-broadcasts and stores remain.