[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED Target Milestone|--- |5.0 --- Comment #5 from Andrew Pinski --- Fixed so closing.
[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315 --- Comment #4 from Vladimir Makarov --- Zack, thanks for reporting this. Crypto algorithms are very interesting cases for RA. A lot of performance improvements were done for RA during gcc-4.9 development. Now on Intel Haswell I have bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out 779.132 keys/s bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out 778.976 keys/s bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out 1392.555 keys/s bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out 1375.610 keys/s bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out 1224.177 keys/s bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out 1436.539 keys/s Here, trunk5 is today GCC trunk. Unfortunately, the changes in RA are too big and can not be ported to gcc-4.8.
[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315 Richard Biener changed: What|Removed |Added Keywords||ra CC||vmakarov at gcc dot gnu.org --- Comment #3 from Richard Biener --- The tree opt code is quite the same for 4.8 and 4.7 at -O3 -fwhole-program, so I believe this boils down to spilling/register allocation (LRA vs. reload). We inline everything into main () (even at -O2) and we don't vectorize anything at -O3.
[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315 --- Comment #2 from Zack Weinberg --- Created attachment 30210 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30210&action=edit self-contained test case Here's a self-contained test case. $ gcc-4.7 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out 875.178 keys/s $ gcc-4.8 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out 808.869 keys/s $ gcc-4.7 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out 867.879 keys/s $ gcc-4.8 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out 800.794 keys/s $ gcc-4.7 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 606.605 keys/s $ gcc-4.8 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 571.935 keys/s These numbers are stable to within about 1 key/s. So there's a 6-8% regression from 4.7 to 4.8 regardless of optimization level, but also -O3 and -O3 -fwhole-program are inferior to -O2 for this program, with both compilers. (-O2 -fwhole-program is within noise of just -O2 for both.) With 4.8, -march=native on my computer expands to -march=corei7-avx -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt --param l1-cache-size=0 --param l1-cache-line-size=0 --param l2-cache-size=256 -mtune=corei7-avx
[Bug tree-optimization/57315] LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57315 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-*-* CC||rguenth at gcc dot gnu.org --- Comment #1 from Richard Biener --- Please at least reproduce the "core function" as a separate compilable testcase here togehter with flags used for the build. Also please try to factor out LTO ...