[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

2016-01-26 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |5.0

--- Comment #5 from Andrew Pinski  ---
Fixed so closing.

[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

2013-12-04 Thread vmakarov at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315

--- Comment #4 from Vladimir Makarov  ---
  Zack, thanks for reporting this.  Crypto algorithms are very interesting
cases for RA.  A lot of performance improvements were done for RA during
gcc-4.9 development.  Now on Intel Haswell I have

bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O2
-march=native salsa-test.c && ./a.out
 779.132 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O2
-march=native salsa-test.c && ./a.out
 778.976 keys/s
bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O2
-march=native salsa-test.c && ./a.out
1392.555 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O3
-fwhole-program -march=native salsa-test.c && ./a.out
1375.610 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O3
-fwhole-program -march=native salsa-test.c && ./a.out
1224.177 keys/s
bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O3
-fwhole-program -march=native salsa-test.c && ./a.out
1436.539 keys/s

Here, trunk5 is today GCC trunk.

Unfortunately, the changes in RA are too big and can not be ported to gcc-4.8.


[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

2013-05-29 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315

Richard Biener  changed:

   What|Removed |Added

   Keywords||ra
 CC||vmakarov at gcc dot gnu.org

--- Comment #3 from Richard Biener  ---
The tree opt code is quite the same for 4.8 and 4.7 at -O3 -fwhole-program,
so I believe this boils down to spilling/register allocation (LRA vs. reload).

We inline everything into main () (even at -O2) and we don't
vectorize anything at -O3.


[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

2013-05-28 Thread zackw at panix dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315

--- Comment #2 from Zack Weinberg  ---
Created attachment 30210
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30210&action=edit
self-contained test case

Here's a self-contained test case.

$ gcc-4.7 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out
 875.178 keys/s
$ gcc-4.8 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out
 808.869 keys/s

$ gcc-4.7 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out
 867.879 keys/s
$ gcc-4.8 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out
 800.794 keys/s

$ gcc-4.7 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 
 606.605 keys/s
$ gcc-4.8 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 
 571.935 keys/s

These numbers are stable to within about 1 key/s.  So there's a 6-8% regression
from 4.7 to 4.8 regardless of optimization level, but also -O3 and -O3
-fwhole-program are inferior to -O2 for this program, with both compilers. 
(-O2 -fwhole-program is within noise of just -O2 for both.)

With 4.8, -march=native on my computer expands to

-march=corei7-avx -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm
-mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx
-mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c
-mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt
--param l1-cache-size=0 --param l1-cache-line-size=0 --param l2-cache-size=256
-mtune=corei7-avx


[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

2013-05-21 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-*
 CC||rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
Please at least reproduce the "core function" as a separate compilable testcase
here togehter with flags used for the build.  Also please try to factor out
LTO ...