On Fri, 5 Feb 2016, Vladimir Makarov wrote: > On 02/05/2016 04:25 AM, Richard Biener wrote: > > The following patch fixes the performance regression for 435.gromacs > > on x86_64 with AVX2 (Haswell or bdver2) caused by > > > > 2015-12-18 Andreas Krebbel <kreb...@linux.vnet.ibm.com> > > > > * ira.c (ira_setup_alts): Move the scan for commutative modifier > > to the first loop to make it work even with disabled alternatives. > > > > which in itself is a desirable change giving the RA more freedom. > > > > It turns out the fix makes an existing issue more severe in detecting > > more swappable alternatives and thus exiting ira_setup_alts with > > operands swapped in recog_data. This seems to give a slight preference > > to choose alternatives with the operands swapped (I didn't try to > > investigate how IRA handles the "merged" alternative mask and > > operand swapping in its further processing). > Alternative mask excludes alternatives which will be definitely rejected in > LRA. This approach is to speed up LRA (a lot was done to speed up RA but > still it consumes a big chunk of compiler time which is unusual for all > compilers). > > LRA and reload prefer insns without commutative operands swap when all other > costs are the same.
Ok, so leaving operands swapped in ira_setup_alts will prefer the swapped variants in case other costs are the same. > > Of course previous RTL optimizers and canonicalization rules as well > > as backend patterns are tuned towards the not swapped variant and thus > > it happens doing more swaps ends up in slower code (I didn't closely > > investigate). > > > > So I tested the following patch which simply makes sure that > > ira_setup_alts does not alter recog_data. > > > > On a Intel Haswell machine I get (base is with the patch, peak is with > > the above change reverted): > > > > Estimated > > Estimated > > Base Base Base Peak Peak Peak > > Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio > > -------------- ------ --------- --------- ------ --------- > > --------- > > 435.gromacs 7140 264 27.1 S 7140 270 > > 26.5 S > > 435.gromacs 7140 264 27.1 * 7140 269 > > 26.6 S > > 435.gromacs 7140 263 27.1 S 7140 269 > > 26.5 * > > ============================================================================== > > 435.gromacs 7140 264 27.1 * 7140 269 > > 26.5 * > > > > which means the patched result is even better than before Andreas > > change. Current trunk homes in at a Run Time of 321s (which is > > the regression to fix). > Thanks for working on this, Richard. It is not easy to find reasons for > worse code on modern processors after such small change. As RA is based on > heuristics it hard to predict the change for a specific benchmark. I remember > I checked Andreas patch on SPEC2000 in a hope that it also improves x86-64 > code but I did not see a difference. > > It is even hard to say sometimes how a specific (non-heuristic) optimization > will affect a specific benchmark performance when a lot of unknown (from > alignments to CPU internals are involved). An year ago I tried to use ML to > choose best options. I used a set of about 100 C benchmarks (and even more > functions). For practically every benchmark, I had an option modification to > -Ofast resulting in faster code but ML prediction did not work at all. > > Bootstrap and regtest running on x86_64-unknown-linux-gnu, ok for trunk? > > > OK. Thanks again. Thanks. Over the weekend I did a full 3-run SPEC 2k6 with the following result. Base is r231814 while peak is r231814 patched, flags are -Ofast -march=native (on a Haswell machine). Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 400.perlbench 9770 255 38.4 * 9770 251 39.0 S 400.perlbench 9770 258 37.8 S 9770 250 39.0 * 400.perlbench 9770 253 38.6 S 9770 250 39.0 S 401.bzip2 9650 407 23.7 * 9650 400 24.1 * 401.bzip2 9650 412 23.4 S 9650 417 23.1 S 401.bzip2 9650 396 24.4 S 9650 398 24.3 S 403.gcc 8050 252 31.9 S 8050 245 32.9 S 403.gcc 8050 240 33.6 S 8050 244 32.9 * 403.gcc 8050 242 33.2 * 8050 241 33.4 S 429.mcf 9120 243 37.6 S 9120 245 37.3 S 429.mcf 9120 224 40.7 S 9120 241 37.8 * 429.mcf 9120 225 40.5 * 9120 229 39.9 S 445.gobmk 10490 394 26.6 S 10490 393 26.7 S 445.gobmk 10490 394 26.6 * 10490 392 26.8 * 445.gobmk 10490 393 26.7 S 10490 392 26.8 S 456.hmmer 9330 340 27.4 S 9330 340 27.5 * 456.hmmer 9330 340 27.5 S 9330 340 27.5 S 456.hmmer 9330 340 27.5 * 9330 340 27.5 S 458.sjeng 12100 407 29.7 * 12100 407 29.8 * 458.sjeng 12100 406 29.8 S 12100 406 29.8 S 458.sjeng 12100 408 29.6 S 12100 407 29.8 S 462.libquantum 20720 300 69.0 S 20720 300 69.0 * 462.libquantum 20720 300 69.1 * 20720 300 69.2 S 462.libquantum 20720 300 69.1 S 20720 301 68.9 S 464.h264ref 22130 444 49.8 S 22130 444 49.9 S 464.h264ref 22130 443 50.0 S 22130 442 50.1 S 464.h264ref 22130 444 49.9 * 22130 443 50.0 * 471.omnetpp 6250 326 19.1 S 6250 328 19.1 S 471.omnetpp 6250 305 20.5 * 6250 324 19.3 * 471.omnetpp 6250 296 21.1 S 6250 305 20.5 S 473.astar 7020 313 22.4 * 7020 316 22.2 S 473.astar 7020 314 22.3 S 7020 309 22.7 S 473.astar 7020 308 22.8 S 7020 310 22.7 * 483.xalancbmk 6900 193 35.7 S 6900 193 35.7 S 483.xalancbmk 6900 189 36.5 * 6900 189 36.5 * 483.xalancbmk 6900 185 37.4 S 6900 189 36.6 S ============================================================================== 400.perlbench 9770 255 38.4 * 9770 250 39.0 * 401.bzip2 9650 407 23.7 * 9650 400 24.1 * 403.gcc 8050 242 33.2 * 8050 244 32.9 * 429.mcf 9120 225 40.5 * 9120 241 37.8 * 445.gobmk 10490 394 26.6 * 10490 392 26.8 * 456.hmmer 9330 340 27.5 * 9330 340 27.5 * 458.sjeng 12100 407 29.7 * 12100 407 29.8 * 462.libquantum 20720 300 69.1 * 20720 300 69.0 * 464.h264ref 22130 444 49.9 * 22130 443 50.0 * 471.omnetpp 6250 305 20.5 * 6250 324 19.3 * 473.astar 7020 313 22.4 * 7020 310 22.7 * 483.xalancbmk 6900 189 36.5 * 6900 189 36.5 * Est. SPECint(R)_base2006 32.8 Est. SPECint2006 32.5 Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Ref. Run Time Ratio Ref. Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 410.bwaves 13590 196 69.4 S 13590 202 67.1 S 410.bwaves 13590 197 69.0 * 13590 197 69.1 S 410.bwaves 13590 199 68.3 S 13590 197 69.0 * 416.gamess 19580 592 33.1 * 19580 589 33.2 S 416.gamess 19580 592 33.1 S 19580 588 33.3 S 416.gamess 19580 593 33.0 S 19580 588 33.3 * 433.milc 9180 350 26.2 S 9180 347 26.5 S 433.milc 9180 335 27.4 * 9180 334 27.5 S 433.milc 9180 334 27.5 S 9180 337 27.2 * 434.zeusmp 9100 231 39.3 S 9100 232 39.2 S 434.zeusmp 9100 233 39.1 S 9100 233 39.1 * 434.zeusmp 9100 232 39.2 * 9100 235 38.8 S 435.gromacs 7140 316 22.6 S 7140 264 27.1 S 435.gromacs 7140 314 22.7 S 7140 263 27.2 S 435.gromacs 7140 316 22.6 * 7140 263 27.1 * 436.cactusADM 11950 201 59.3 * 11950 211 56.6 S 436.cactusADM 11950 214 55.7 S 11950 205 58.3 * 436.cactusADM 11950 198 60.4 S 11950 201 59.5 S 437.leslie3d 9400 218 43.1 S 9400 219 42.9 S 437.leslie3d 9400 219 42.9 * 9400 219 42.9 * 437.leslie3d 9400 220 42.8 S 9400 220 42.7 S 444.namd 8020 301 26.7 S 8020 302 26.5 S 444.namd 8020 300 26.7 * 8020 302 26.5 * 444.namd 8020 300 26.7 S 8020 302 26.6 S 447.dealII 11440 248 46.2 S 11440 245 46.7 S 447.dealII 11440 248 46.0 S 11440 246 46.4 S 447.dealII 11440 248 46.2 * 11440 246 46.5 * 450.soplex 8340 216 38.6 S 8340 223 37.5 S 450.soplex 8340 215 38.8 * 8340 215 38.7 S 450.soplex 8340 214 38.9 S 8340 216 38.6 * 453.povray 5320 121 44.1 S 5320 119 44.8 * 453.povray 5320 120 44.2 * 5320 117 45.3 S 453.povray 5320 120 44.3 S 5320 121 44.0 S 454.calculix 8250 277 29.8 S 8250 277 29.8 S 454.calculix 8250 277 29.8 * 8250 276 29.9 * 454.calculix 8250 277 29.7 S 8250 276 29.9 S 459.GemsFDTD 10610 326 32.5 S 10610 333 31.8 S 459.GemsFDTD 10610 316 33.6 S 10610 318 33.4 S 459.GemsFDTD 10610 317 33.5 * 10610 320 33.2 * 465.tonto 9840 421 23.4 S 9840 419 23.5 S 465.tonto 9840 419 23.5 S 9840 419 23.5 * 465.tonto 9840 420 23.5 * 9840 418 23.5 S 470.lbm 13740 253 54.2 * 13740 254 54.1 S 470.lbm 13740 251 54.6 S 13740 251 54.8 S 470.lbm 13740 254 54.0 S 13740 251 54.7 * 481.wrf 11170 291 38.4 S 11170 293 38.2 S 481.wrf 11170 288 38.8 S 11170 288 38.8 S 481.wrf 11170 289 38.6 * 11170 289 38.6 * 482.sphinx3 19490 398 49.0 S 19490 406 48.0 S 482.sphinx3 19490 406 48.1 S 19490 401 48.6 S 482.sphinx3 19490 399 48.9 * 19490 401 48.6 * ============================================================================== 410.bwaves 13590 197 69.0 * 13590 197 69.0 * 416.gamess 19580 592 33.1 * 19580 588 33.3 * 433.milc 9180 335 27.4 * 9180 337 27.2 * 434.zeusmp 9100 232 39.2 * 9100 233 39.1 * 435.gromacs 7140 316 22.6 * 7140 263 27.1 * 436.cactusADM 11950 201 59.3 * 11950 205 58.3 * 437.leslie3d 9400 219 42.9 * 9400 219 42.9 * 444.namd 8020 300 26.7 * 8020 302 26.5 * 447.dealII 11440 248 46.2 * 11440 246 46.5 * 450.soplex 8340 215 38.8 * 8340 216 38.6 * 453.povray 5320 120 44.2 * 5320 119 44.8 * 454.calculix 8250 277 29.8 * 8250 276 29.9 * 459.GemsFDTD 10610 317 33.5 * 10610 320 33.2 * 465.tonto 9840 420 23.5 * 9840 419 23.5 * 470.lbm 13740 253 54.2 * 13740 251 54.7 * 481.wrf 11170 289 38.6 * 11170 289 38.6 * 482.sphinx3 19490 399 48.9 * 19490 401 48.6 * Est. SPECfp(R)_base2006 38.0 Est. SPECfp2006 38.4 So overall the patch is a loss for SPEC CPU 2006 INT due to the 429.mcf and 471.omnetpp regressions and a win on SPEC FP. (I didn't test SPEC INT previously, only FP) But as you noted the patch only changes allocation preference in case of equal costs. I've also looked at stage2 binary sizes and patched we see a growth of cc1 from 35450264 bytes to 35459552 (executable size). There are both ups and downs for individual .o files though. The gcc.target/i386/addr-sel-1.c (for PR28940) seems to just started working at some point past in time and thus it was added and the bug closed. You could say RA does a better job after the patch as it uses 1 less register but that restricts the followup postreload combine attempts. Though I wonder about what's "better" RA here - isn't the best allocation one that avoids spills but uses as many registers as possible (at least when targeting a CPU that cannot to register renaming)? regrename doesn't help this testcase either (it runs too late and does a renaming that doesn't help). With all of the above I'm not sure what to do for GCC 6 (even though you just approved the patch). Going with the patch alternative (just revert swapping parts of the commutative operands) looks like completely bogus though it works for fixing the regression as well. So I have applied the patch now, giving us a few days to get other peoples (and my own) auto-testers the chance to pick up results with it and after that we can consider reverting or going with the (IMHO bogus) half-way variant. I've XFAILed half of gcc.target/i386/addr-sel-1.c. Thanks, Richard. > > 2016-02-05 Richard Biener <rguent...@suse.de> > > > > PR rtl-optimization/69274 > > * ira.c (ira_setup_alts): Do not change recog_data.operand > > order. > > > > Index: gcc/ira.c > > =================================================================== > > --- gcc/ira.c (revision 231814) > > +++ gcc/ira.c (working copy) > > @@ -1888,10 +1888,11 @@ ira_setup_alts (rtx_insn *insn, HARD_REG > > } > > if (commutative < 0) > > break; > > - if (curr_swapped) > > - break; > > + /* Swap forth and back to avoid changing recog_data. */ > > std::swap (recog_data.operand[commutative], > > recog_data.operand[commutative + 1]); > > + if (curr_swapped) > > + break; > > } > > } > > > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)