[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #15 from ubizjak at gmail dot com 2007-07-09 18:16 --- New timings on x86_64 core2 (from [1]) The tests were performed on core2 in 64bit mode, using '-DREPS=1 -O3 -msse3 -march=core2 -ffast-math' flags, with and without newly introduced -fno-tree-reassoc flag. The results were _interesting_, showing extreme differences in the run times: w/o -fno-tree-reassoc: ALGORITHM NB REPSTIME MFLOPS = = = == == -DTYPE=float: atlasmm 60 1 2.000 2159.87 -DTYPE=double:atlasmm 60 1 2.500 1727.89 w/ -fno-tree-reassoc: ALGORITHM NB REPSTIME MFLOPS = = = == == -DTYPE=float: atlasmm 60 1 0.932 4634.90 -DTYPE=double:atlasmm 60 1 1.520 2841.93 [1] http://gcc.gnu.org/ml/gcc-patches/2007-07/msg00849.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #16 from uros at gcc dot gnu dot org 2007-07-09 19:22 --- Subject: Bug 27855 Author: uros Date: Mon Jul 9 19:22:03 2007 New Revision: 126491 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=126491 Log: PR target/27855 * doc/extend.texi: Add ftree-reassoc flag. * common.opt (ftree-reassoc): New flag. * tree-ssa-reassoc.c (gate_tree_ssa_reassoc): New static function. (struct tree_opt_pass pass_reassoc): Use gate_tree_ssa_reassoc. Modified: trunk/gcc/ChangeLog trunk/gcc/common.opt trunk/gcc/doc/invoke.texi trunk/gcc/tree-ssa-reassoc.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #13 from ubizjak at gmail dot com 2007-03-02 15:34 --- Any news about this problem? Current mainline still has severe problems: -msse3 -O2 -mfpmath=sse -ffast-math GCC 4.3 -ffast-math double performance: ALGORITHM NB REPSTIME MFLOPS = = = == == atlasmm 60 1000 0.288 1499.91 -msse3 -O2 -mfpmath=sse GCC 4.3 double performance: ALGORITHM NB REPSTIME MFLOPS = = = == == atlasmm 60 1000 0.192 2249.86 -msse3 -O2 -mfpmath=sse -ffast-math GCC 4.3 -ffast-math single performance: ALGORITHM NB REPSTIME MFLOPS = = = == == atlasmm 60 1000 0.304 1420.96 -msse3 -O2 -mfpmath=sse GCC 4.3 single performance: ALGORITHM NB REPSTIME MFLOPS = = = == == atlasmm 60 1000 0.172 2511.48 Please consider the fact that all benchmarks are using -ffast-math nowadays. ;) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #14 from pinskia at gcc dot gnu dot org 2007-03-02 17:42 --- Please consider the fact that all benchmarks are using -ffast-math nowadays. ;) Please also consider the fact that the register allocator has been broken since 20 years ago :) :) :) :). And I repeat again, this has nothing to do with -ffast-math, see my comment #6 and #7 where I prove -ffast-math is not the issue and it is just the register allocator going wrong. If anyone disables reassociation at the tree level, I am going to object loudly. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #11 from steven at gcc dot gnu dot org 2006-10-07 10:05 --- Would anyone object if I'd propose to disable reassociation for floating point thingies on x86 for GCC 4.2? We can re-enable it if/when amacleod's new out-of-ssa stuff fixes this for real... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #12 from pinskia at gcc dot gnu dot org 2006-10-07 16:36 --- (In reply to comment #11) Would anyone object if I'd propose to disable reassociation for floating point thingies on x86 for GCC 4.2? We can re-enable it if/when amacleod's new out-of-ssa stuff fixes this for real... Yes I do because it helps PPC which is why I added in the first place. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #9 from amacleod at redhat dot com 2006-06-05 15:46 --- This thread is moving dangerously close to work in progress.. :-) I'll have something more definitive to say about it in a few weeks, but I'm looking at doing significant register pressure reduction at out of ssa time as a component of a larger hunk of work. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #10 from dberlin at gcc dot gnu dot org 2006-06-05 15:57 --- (In reply to comment #9) This thread is moving dangerously close to work in progress.. :-) I'll have something more definitive to say about it in a few weeks, but I'm looking at doing significant register pressure reduction at out of ssa time as a component of a larger hunk of work. Just be careful. The last time i tried something like this, it produced good code on x86, and really crappy code everywhere else. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #8 from steven at gcc dot gnu dot org 2006-06-03 23:49 --- You could add a basic block list scheduler at the tree level just before out-of-ssa, with heuristics to make life times as short as possible :-) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #2 from uros at kss-loka dot si 2006-06-02 10:04 --- (In reply to comment #1) There is nothing special about reassociation at all. In fact what you are seeing is register allocator going funky. This what you get with x87. This is also what you get with SSE. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #3 from pinskia at gcc dot gnu dot org 2006-06-02 10:19 --- (In reply to comment #2) This is also what you get with SSE. And how many registers does SSE have, not many. Try it on PPC or any processor have more registers? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #4 from steven at gcc dot gnu dot org 2006-06-02 23:19 --- Real bug, despite Andrew's usual portion of x86-hate. -- steven at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2006-06-02 23:19:36 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
Re: [Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
steven at gcc dot gnu dot org wrote: --- Comment #4 from steven at gcc dot gnu dot org 2006-06-02 23:19 --- Real bug, despite Andrew's usual portion of x86-hate. It'd be good to know what exactly is going wrong. Reassociation only touches floating point because someone asked me to make it touch floating point. It still shouldn't have *this* much of an affect, my guess is it is triggering some bad behavior elsewhere.
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #5 from dberlin at gcc dot gnu dot org 2006-06-03 02:11 --- Subject: Re: reassociation pass produces ~30% slower matrix multiplication code steven at gcc dot gnu dot org wrote: --- Comment #4 from steven at gcc dot gnu dot org 2006-06-02 23:19 --- Real bug, despite Andrew's usual portion of x86-hate. It'd be good to know what exactly is going wrong. Reassociation only touches floating point because someone asked me to make it touch floating point. It still shouldn't have *this* much of an affect, my guess is it is triggering some bad behavior elsewhere. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #6 from pinskia at gcc dot gnu dot org 2006-06-03 02:38 --- What reassociation is doing is scheduling the instructions further down before the use but it exands the life time of some variables. e.g.: D.1563_59 = rA0_49 * rB0_50; rC0_0_60 = D.1563_59 + rC0_0_516; D.1564_61 = rA1_52 * rB0_50; rC1_0_62 = D.1564_61 + rC1_0_517; D.1565_63 = rA2_54 * rB0_50; rC2_0_64 = D.1565_63 + rC2_0_518; D.1566_65 = rA3_56 * rB0_50; rC3_0_66 = D.1566_65 + rC3_0_519; D.1567_67 = rA4_58 * rB0_50; rC4_0_68 = D.1567_67 + rC4_0_520; into: D.1563_59 = rB0_50 * rA0_49; D.1564_61 = rA1_52 * rB0_50; D.1565_63 = rA2_54 * rB0_50; D.1566_65 = rA3_56 * rB0_50; D.1567_67 = rA4_58 * rB0_50; . (with loads, etc here) D.1563_477 = rB0_468 * rA0_466; rC0_0_60 = D.1563_59 + rC0_0_516; rC0_0_82 = rC0_0_60 + D.1563_81; rC0_0_104 = rC0_0_82 + D.1563_103; rC0_0_126 = rC0_0_104 + D.1563_125; rC0_0_148 = rC0_0_126 + D.1563_147; rC0_0_170 = rC0_0_148 + D.1563_169; rC0_0_192 = rC0_0_170 + D.1563_191; rC0_0_214 = rC0_0_192 + D.1563_213; rC0_0_236 = rC0_0_214 + D.1563_235; rC0_0_258 = rC0_0_236 + D.1563_257; rC0_0_280 = rC0_0_258 + D.1563_279; rC0_0_302 = rC0_0_280 + D.1563_301; rC0_0_324 = rC0_0_302 + D.1563_323; rC0_0_346 = rC0_0_324 + D.1563_345; rC0_0_368 = rC0_0_346 + D.1563_367; rC0_0_390 = rC0_0_368 + D.1563_389; rC0_0_412 = rC0_0_390 + D.1563_411; rC0_0_434 = rC0_0_412 + D.1563_433; rC0_0_456 = rC0_0_434 + D.1563_455; rC0_0_478 = rC0_0_456 + D.1563_477; Which in of itself not supressing and not nothing which reassociate should handle special. This is what we get with a semi bad register allocation which does nothing to reduce spilling. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #7 from pinskia at gcc dot gnu dot org 2006-06-03 02:49 --- If you change the code to be integers, this also cause the drop too with reassociation even without -ffast-math so it is unrelated to the fact -ffast-math turns on reassociate for floating points for fast math. So what is happening is that the add to rC[0-4]_0 is being further down which causes variable's life to be extended. Yes there should be a pass at the tree level which optimizes variable's life time but I don't know how much use that is without a better register allocator in the first place. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855
[Bug target/27855] reassociation pass produces ~30% slower matrix multiplication code
--- Comment #1 from pinskia at gcc dot gnu dot org 2006-06-01 15:26 --- There is nothing special about reassociation at all. In fact what you are seeing is register allocator going funky. This what you get with x87. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Component|tree-optimization |target Keywords||ra http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27855