4.9 Regression] Botan performance regressions apparently due to LRA

jakub at gcc dot gnu.org Fri, 10 May 2013 13:20:04 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55278


Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org

--- Comment #12 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Ok, I've looked a little bit on the #c6 testcase.
On i7-2600 CPU I get:
-O3 -mavx
1mo old gcc trunk        0m6.347s
1mo old gcc trunk reload 0m6.517s
curr trunk               0m6.049s
trunk unroll             0m5.800s
trunk unroll p1          0m5.671s
trunk unroll p1+p2       0m5.691s
clang 3.2                0m6.007s
clang 3.3 svn            0m6.003s
icc                      0m5.364s

where unroll is -funroll-loops --param max-completely-peeled-insns=500 (to
match what icc does, which aggressively unrolls the huge inner loop that
iterates exactly 4 times), p1 is experimental hack:
+(define_insn "*<code>hi_1"
+  [(set (match_operand:HI 0 "nonimmediate_operand" "=r,rm")
+(any_or:HI
+ (match_operand:HI 1 "nonimmediate_operand" "%0,0")
+ (match_operand:HI 2 "general_operand" "rn,ri")))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (<CODE>, HImode, operands)"
+  "<logic>{l}\t{%k2, %k0|%k0, %k2}"
+  [(set_attr "type" "alu")
+   (set_attr "mode" "SI")])

(force gcc to avoid xorw memory, %hireg and instead use movzwl memory, %sireg;
... xorl %sireg, %sireg2) and p2 was something similar for *xorqi_1.

Looking at icc generated assembly, it is interesting to see that the only
HImode instructions it ever uses are rolw and movw stores, for everything else
it uses
movzwl loads and SImode arithmetics (well, I guess shift right shrw/sarw/rorw
can't be avoided either).  Similarly, icc on the testcase doesn't emit any
QImode instructions at all, while gcc emits tons of them and llvm something in
between.

So perhaps this bug is not about LRA, but about instruction selection, and when
not optimizing for size at least on some CPUs we should consider using SImode
arithmetics instead of QImode/HImode much more aggressively than we do now.
Not sure if it is better done by (Kai's?) type optimization pass, which shortly
before expansion using target hints would just try to get rid of as many QImode
and especially HImode operations as possible, guess we can often keep complete
garbage in the upper bits, or if it is better done at the *.md level.

[Bug rtl-optimization/55278] [4.8/4.9 Regression] Botan performance regressions apparently due to LRA

Reply via email to