Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-07 Thread Jan Hubicka
> On Mon, Jan 07, 2019 at 09:29:09AM +0100, Richard Biener wrote:
> > On Sun, 6 Jan 2019, Jan Hubicka wrote:
> > > Even though it is late in release cycle I wonder if we can do that for
> > > GCC 9?  Performance of vectorization is very architecture specific, I
> > > would propose enabling vectorization for Zen, core based chips and
> > > generic in x86-64. I can also run benchmarks on buldozer. I can then
> > > tune down the cheap model to avoid some of more expensive
> > > transformations.
> > 
> > I'd rather not do this now, it's _way_ too late (also considering
> > you are again doing inliner tuning so late).
> 
> This probably should be more generic than just x86 really, we have similar
> problems on Power (-O3 is almost always faster than -O2, which is bad).
> Likely other archs have the same problems.
> 
> But yes, too late for GCC 9.

Yep, I guessed so, still wanted to ask :)
I think this is similar to schedule-insns(2) which is subtarget specific
whether it is a win or not. So I think it is good to leave up to target
to enable the pass - we probably have fewer targets that do want
vectorizing than those we don't.

I would suggest enabling it on x86 early next stage1 and try to do
similar benchmarks on ppc and arm.  We can then try to tune the code
size/speed tradeoffs.

Honza
> 
> 
> Segher


Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-07 Thread Segher Boessenkool
On Mon, Jan 07, 2019 at 09:29:09AM +0100, Richard Biener wrote:
> On Sun, 6 Jan 2019, Jan Hubicka wrote:
> > Even though it is late in release cycle I wonder if we can do that for
> > GCC 9?  Performance of vectorization is very architecture specific, I
> > would propose enabling vectorization for Zen, core based chips and
> > generic in x86-64. I can also run benchmarks on buldozer. I can then
> > tune down the cheap model to avoid some of more expensive
> > transformations.
> 
> I'd rather not do this now, it's _way_ too late (also considering
> you are again doing inliner tuning so late).

This probably should be more generic than just x86 really, we have similar
problems on Power (-O3 is almost always faster than -O2, which is bad).
Likely other archs have the same problems.

But yes, too late for GCC 9.


Segher


Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-07 Thread Jan Hubicka
> > Note that I benchmarked -ftree-slp-vectorize separately before and
> > results was hit/miss, so perhaps enabling only -ftree-vectorize would
> > give better compile time tradeoffs. I was worried of partial memory
> > stalls, but I will benchmark it and also benchmark difference between
> > cost models.
> 
> ; Alias to enable both -ftree-loop-vectorize and -ftree-slp-vectorize.
> ftree-vectorize
> Common Report Optimization
> Enable vectorization on trees.

Thanks! I would probably fall into that trap and run same set of
benchmarks again.

Honza
> 
> -- 
> Eric Botcazou


Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-07 Thread Eric Botcazou
> Note that I benchmarked -ftree-slp-vectorize separately before and
> results was hit/miss, so perhaps enabling only -ftree-vectorize would
> give better compile time tradeoffs. I was worried of partial memory
> stalls, but I will benchmark it and also benchmark difference between
> cost models.

; Alias to enable both -ftree-loop-vectorize and -ftree-slp-vectorize.
ftree-vectorize
Common Report Optimization
Enable vectorization on trees.

-- 
Eric Botcazou


Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

2019-01-07 Thread Richard Biener
On Sun, 6 Jan 2019, Jan Hubicka wrote:

> Hello,
> while running benchmarks for inliner tuning I also run benchmarks
> comparing -O2 and -O2 -ftree-vectorize -ftree-slp-vectorize using Martin
> Liska's LNT setup (https://lnt.opensuse.org/).  The results are
> summarized below but you can also see also colorful table produced
> by Martin's LNT magic
> 
> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?num_runs=3_percentage_change=0.02=746f%2C55f=IwAR1EhvEnavV5Fg5g404cTrguOXG2cW7b3mRZZvtYn1qy93zihyAanZ7AiWQ
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?num_runs=10_percentage_change=0.02=746f%2C55f
> 
> Overall we got following SPECrate improvements:
> 
>  SPECfp2k6   kabylake generic  +7.15%
>  SPECfp2k6   kabylake native   +9.36%
>  SPECfp2k17  kabylake generic  +5.36%
>  SPECfp2k17  kabylake native   +6.03%
>  SPECint2k17 kabylake generic  +4.13%
> 
>  SPECfp2k6   zen  generic  +9.98%
>  SPECfp2k6   zen  native   +7.04%
>  SPECfp2k17  zen  generic  +6.11%
>  SPECfp2k17  zen  native   +5.46%
>  SPECint2k17 zen  generic  +3.61%
>  SPECint2k17 zen  native   +5.18%
> 
> The performance results seems surprisingly a lot in favor of
> vectorization.  Martin's setup is also checking code size which goes up
> by as much 26% on leslie 3d, but since many of benchmarks are small,
> this is not very representative for overall code size/compile time costs
> of vectorization.
> 
> I measured compile time/size on larger programs I have available with
> notable changes on DealII, but otherwise sub 1% increases.  I also
> benchmarked Firefox but there are no significant differences because
> build system already uses -O3 for places where it matters (graphics
> library etc.)

Well, as much as compile-time/size of spec is not representable
the performance improvements are.

>Compile timecode segment size 
> Firefox   mainlin   in noise 0.8%
> gcc from spec2k6  0.5%   0.6%
> gdb   0.8%   0.3%
> crafty0% 0%
> DealII3.2%   4%
> 
> Note that I benchmarked -ftree-slp-vectorize separately before and
> results was hit/miss, so perhaps enabling only -ftree-vectorize would
> give better compile time tradeoffs. I was worried of partial memory
> stalls, but I will benchmark it and also benchmark difference between
> cost models.
>
> There are some performance regressions, most notably in SPEC
>  - exchange (all settings),
>  - gamess (all settings),
>  - calculix (Zen native only),
>  - bwaves (zen native) 
> and induct2 on all settings and ffft2 zen only from Polyhedron. Botan
> seems very noisy, but it is rather special code.
> 
> Exchange can be fixed by adding heuristics that it is bad idea to
> vectorize withing loop nest of 10 containing recursive call. I believe
> gamess and calculix are understood and i can look into the remaining
> cases.
> 
> Overall I am surprised how many improvements vectorization at -O2 can do
> - clearly more parallel CPUs depends it depends on it.  In my experience
> from analyzing regressions of gcc -O2 compared to clang -O2 buids,
> vectorization is one of most common reasons. Having gcc -O2 producing
> lower SPEC scores and comparably large binaries to clang -O2 does not
> feel OK and I think the problem is not limited just to artificial
> benchmarks.
> 
> Even though it is late in release cycle I wonder if we can do that for
> GCC 9?  Performance of vectorization is very architecture specific, I
> would propose enabling vectorization for Zen, core based chips and
> generic in x86-64. I can also run benchmarks on buldozer. I can then
> tune down the cheap model to avoid some of more expensive
> transformations.

I'd rather not do this now, it's _way_ too late (also considering
you are again doing inliner tuning so late).

See our last attempts at this btw.

Richard.
 
> Honza
> 
> 
> Kabylake Spec2k6, generic tuning
> 
>   improvements:
> SPEC2006/FP/481.wrf   -31.33% 
> SPEC2006/FP/436.cactusADM -28.17% 
> SPEC2006/FP/437.leslie3d  -17.21% 
> SPEC2006/FP/434.zeusmp-12.90% 
> SPEC2006/FP/454.calculix  -6.44%  
> SPEC2006/FP/433.milc  -6.03%  
> SPEC2006/FP/459.GemsFDTD  -4.65%  
> SPEC2006/FP/450.soplex-2.11%  
> SPEC2006/INT/403.gcc  -6.54%  
> SPEC2006/INT/456.hmmer-5.45%  
> SPEC2006/INT/464.h264ref  -2.23%  
>   regresions:
> SPEC2006/FP/416.gamess8.51%   
> SPEC2006/FP/447.dealII2.73%   
> 
> Kabylake spec2k6 -march=native
> 
>   improvements:
> SPEC2006/FP/436.cactusADM -45.52% 
> SPEC2006/FP/481.wrf   -34.13% 
> SPEC2006/FP/434.zeusmp-20.25% 
> SPEC2006/FP/437.leslie3d  -19.44% 
>