[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Mel Chen changed: What|Removed |Added CC||bina2374 at gmail dot com --- Comment #27 from Mel Chen --- Related new bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #7 from Martin Liška --- (In reply to Richard Biener from comment #6) > Created attachment 45313 [details] > patch > > This enables distribution of patterns at -O[2s]+ and optimizes the testcase > at -Os by adjusting the guards in loop distribution. > > Note that the interesting bits are compile-time, binary-size and performance > at mainly -O2, eventually size at -Os. > > I suspect that at -O2 w/o profiling most loops would be > optimize_loop_for_speed > anyways so changing the heuristics isn't so bad but of course enabling > distribution at -O2 might encour a penalty. I have so far build numbers on a Zen machine with -j16: SPEC2006: Elapsed compile for '400.perlbench': 00:00:05 (5) Elapsed compile for '401.bzip2': 00:00:02 (2) Elapsed compile for '403.gcc': 00:00:11 (11) Elapsed compile for '429.mcf': 00:00:01 (1) Elapsed compile for '445.gobmk': 00:00:04 (4) Elapsed compile for '456.hmmer': 00:00:01 (1) Elapsed compile for '458.sjeng': 00:00:01 (1) Elapsed compile for '462.libquantum': 00:00:01 (1) Elapsed compile for '464.h264ref': 00:00:04 (4) Elapsed compile for '471.omnetpp': 00:00:05 (5) Elapsed compile for '473.astar': 00:00:01 (1) Elapsed compile for '483.xalancbmk': 00:00:21 (21) Elapsed compile for '410.bwaves': 00:00:01 (1) Elapsed compile for '416.gamess': 00:00:20 (20) Elapsed compile for '433.milc': 00:00:02 (2) Elapsed compile for '434.zeusmp': 00:00:02 (2) Elapsed compile for '435.gromacs': 00:00:06 (6) Elapsed compile for '436.cactusADM': 00:00:04 (4) Elapsed compile for '437.leslie3d': 00:00:04 (4) Elapsed compile for '444.namd': 00:00:09 (9) Elapsed compile for '447.dealII': 00:00:15 (15) Elapsed compile for '450.soplex': 00:00:03 (3) Elapsed compile for '453.povray': 00:00:04 (4) Elapsed compile for '454.calculix': 00:00:06 (6) Elapsed compile for '459.GemsFDTD': 00:00:09 (9) Elapsed compile for '465.tonto': 00:00:53 (53) Elapsed compile for '470.lbm': 00:00:02 (2) Elapsed compile for '481.wrf': 00:00:38 (38) Elapsed compile for '482.sphinx3': 00:00:01 (1) All differences before and after are withing 1s, which is granularity. SPEC 2017: Elapsed compile for '503.bwaves_r': 00:00:01 (1) Elapsed compile for '507.cactuBSSN_r': 00:00:25 (25) Elapsed compile for '508.namd_r': 00:00:09 (9) Elapsed compile for '510.parest_r': 00:00:46 (46) Elapsed compile for '511.povray_r': 00:00:04 (4) Elapsed compile for '519.lbm_r': 00:00:01 (1) Elapsed compile for '521.wrf_r': 00:05:46 (346) Elapsed compile for '526.blender_r': 00:00:25 (25) Elapsed compile for '527.cam4_r': 00:00:37 (37) Elapsed compile for '538.imagick_r': 00:00:11 (11) Elapsed compile for '544.nab_r': 00:00:01 (1) Elapsed compile for '549.fotonik3d_r': 00:00:07 (7) Elapsed compile for '554.roms_r': 00:00:06 (6) Elapsed compile for '500.perlbench_r': 00:00:09 (9) Elapsed compile for '502.gcc_r': 00:00:44 (44) Elapsed compile for '505.mcf_r': 00:00:01 (1) Elapsed compile for '520.omnetpp_r': 00:00:12 (12) Elapsed compile for '523.xalancbmk_r': 00:00:25 (25) Elapsed compile for '525.x264_r': 00:00:09 (9) Elapsed compile for '531.deepsjeng_r': 00:00:02 (2) Elapsed compile for '541.leela_r': 00:00:03 (3) Elapsed compile for '548.exchange2_r': 00:00:04 (4) Elapsed compile for '557.xz_r': 00:00:01 (1) There's only one difference: 521.wrf_r: 310 -> 346s
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #8 from rguenther at suse dot de --- On Thu, 16 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #7 from Martin Liška --- > (In reply to Richard Biener from comment #6) > > Created attachment 45313 [details] > > patch > > > > This enables distribution of patterns at -O[2s]+ and optimizes the testcase > > at -Os by adjusting the guards in loop distribution. > > > > Note that the interesting bits are compile-time, binary-size and performance > > at mainly -O2, eventually size at -Os. > > > > I suspect that at -O2 w/o profiling most loops would be > > optimize_loop_for_speed > > anyways so changing the heuristics isn't so bad but of course enabling > > distribution at -O2 might encour a penalty. > > I have so far build numbers on a Zen machine with -j16: ... > There's only one difference: > > 521.wrf_r: 310 -> 346s Ick. I currently see no limiting on the size of loops in loop distribution, one easy would be to limit the worklist size in find_seed_stmts_for_distribution with a --param we can lower at -O[2s], another thing would be to limit loop nest depth similarly. A profile might be interesting here as well...
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #9 from Martin Liška --- So comparison all files in wrf, I've got: ╔═╤╤╤═╗ ║ Filename│ Before │ After │ Change ║ ╠═╪╪╪═╣ ║ module_configure.fppized.f90│ 127.81 │ 163.39 │ 127.84% ║ ║ d1fgkb.fppized.f90 │ 0.21 │ 0.23 │ 109.52% ║ ║ solve_interface.fppized.f90 │ 0.35 │ 0.38 │ 108.57% ║ ║ module_ltng_crmpr92.fppized.f90 │ 0.28 │ 0.3│ 107.14% ║ ║ module_cu_kf.fppized.f90│ 1.42 │ 1.51 │ 106.34% ║ ║ mradbg.fppized.f90 │ 0.32 │ 0.34 │ 106.25% ║ ║ module_sf_pxlsm.fppized.f90 │ 0.55 │ 0.58 │ 105.45% ║ ║ module_domain_type.fppized.f90 │ 0.19 │ 0.2│ 105.26% ║ ║ module_shallowcu_driver.fppized.f90 │ 0.19 │ 0.2│ 105.26% ║ ║ module_bl_gfs.fppized.f90 │ 0.78 │ 0.82 │ 105.13% ║ ║ module_bl_myjurb.fppized.f90│ 0.78 │ 0.82 │ 105.13% ║ ║ Meat.fppized.f90│ 0.39 │ 0.41 │ 105.13% ║ ... So the only significant offender is module_configure.fppized.f90 file. Let me profile it.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #10 from Martin Liška --- > So the only significant offender is module_configure.fppized.f90 file. Let > me profile it. Time profile before/after: ╔══╤╤╤═╗ ║ PASS │ Before │ After │ Change ║ ╠══╪╪╪═╣ ║ backwards jump threading │ 6.29 │ 6.16 │ 97.93% ║ ║ integrated RA│ 6.76 │ 6.41 │ 94.82% ║ ║ tree SSA incremental │ 9.01 │ 11.16 │ 123.86% ║ ║ LRA create live ranges │ 15.68 │ 40.02 │ 255.23% ║ ║ PRE │ 23.24 │ 32.32 │ 139.07% ║ ║ alias stmt walking │ 27.69 │ 28.75 │ 103.83% ║ ║ phase opt and generate │ 124.13 │ 163.95 │ 132.08% ║ ║ TOTAL│ 125.39 │ 165.17 │ 131.73% ║ ╚══╧╧╧═╝ Richi, do you want a perf report or do you come up with a patch that will introduce the aforementioned params?
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #11 from rguenther at suse dot de --- On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #10 from Martin Liška --- > > So the only significant offender is module_configure.fppized.f90 file. Let > > me profile it. > > Time profile before/after: > > ╔══╤╤╤═╗ > ║ PASS │ Before │ After │ Change ║ > ╠══╪╪╪═╣ > ║ backwards jump threading │ 6.29 │ 6.16 │ 97.93% ║ > ║ integrated RA│ 6.76 │ 6.41 │ 94.82% ║ > ║ tree SSA incremental │ 9.01 │ 11.16 │ 123.86% ║ > ║ LRA create live ranges │ 15.68 │ 40.02 │ 255.23% ║ > ║ PRE │ 23.24 │ 32.32 │ 139.07% ║ > ║ alias stmt walking │ 27.69 │ 28.75 │ 103.83% ║ > ║ phase opt and generate │ 124.13 │ 163.95 │ 132.08% ║ > ║ TOTAL│ 125.39 │ 165.17 │ 131.73% ║ > ╚══╧╧╧═╝ > > Richi, do you want a perf report or do you come up with a patch that will > introduce the aforementioned params? Can you share -fopt-report-loop differences? From the above I would guess we split a lot of loops, meaning the memcpy/memmove/memset calls are in the "middle" and we have to split loops (how many calls are detected here?). If that's true another way would be to only allow calls at head or tail position, thus a single non-builtin partition.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #12 from Martin Liška --- > > Can you share -fopt-report-loop differences? From the above I would > guess we split a lot of loops, meaning the memcpy/memmove/memset > calls are in the "middle" and we have to split loops (how many > calls are detected here?). If that's true another way would be > to only allow calls at head or tail position, thus a single > non-builtin partition. I newly see ~1400 lines: module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to 0 loops and 1 library calls. ...
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #13 from rguenther at suse dot de --- On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #12 from Martin Liška --- > > > > Can you share -fopt-report-loop differences? From the above I would > > guess we split a lot of loops, meaning the memcpy/memmove/memset > > calls are in the "middle" and we have to split loops (how many > > calls are detected here?). If that's true another way would be > > to only allow calls at head or tail position, thus a single > > non-builtin partition. > > I newly see ~1400 lines: > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to > 0 > loops and 1 library calls. > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to > 0 > loops and 1 library calls. > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to > 0 > loops and 1 library calls. > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to > 0 > loops and 1 library calls. > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to > 0 > loops and 1 library calls. > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to > 0 loops and 1 library calls. > ... All with "0 loops"? That disputes my theory :/
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #14 from Martin Liška --- (In reply to rguent...@suse.de from comment #13) > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > > > --- Comment #12 from Martin Liška --- > > > > > > Can you share -fopt-report-loop differences? From the above I would > > > guess we split a lot of loops, meaning the memcpy/memmove/memset > > > calls are in the "middle" and we have to split loops (how many > > > calls are detected here?). If that's true another way would be > > > to only allow calls at head or tail position, thus a single > > > non-builtin partition. > > > > I newly see ~1400 lines: > > > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split > > to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split > > to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split > > to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split > > to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split > > to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split > > to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split > > to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split > > to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split > > to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split > > to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split > > to > > 0 loops and 1 library calls. > > ... > > All with "0 loops"? That disputes my theory :/ Yep. All these are in a form of: [local count: 118163158]: # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)> # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)> _3211 = S.1565_41079 + -1; _3212 = fire_ignition_start_y1[_3211]; MEM[(real(kind=4)[11] *)&model_config_rec + 101040B][_3211] = _3212; S.1565_32687 = S.1565_41079 + 1; ivtmp_38848 = ivtmp_38850 - 1; if (ivtmp_38848 == 0) goto ; [9.09%] else goto ; [90.91%] [local count: 107425740]: goto ; [100.00%] [local count: 10737418]: [local count: 118163158]: # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)> # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)> _3213 = S.1566_41080 + -1; _3214 = fire_ignition_end_x1[_3213]; MEM[(real(kind=4)[11] *)&model_config_rec + 101084B][_3213] = _3214; S.1566_32689 = S.1566_41080 + 1; ivtmp_38854 = ivtmp_38856 - 1; if (ivtmp_38854 == 0) goto ; [9.09%] else goto ; [90.91%] [local count: 107425740]: goto ; [100.00%] [local count: 10737418]: [local count: 118163158]: # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)> # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)> _3215 = S.1567_41081 + -1; _3216 = fire_ignition_end_y1[_3215]; MEM[(real(kind=4)[11] *)&model_config_rec + 101128B][_3215] = _3216; S.1567_32691 = S.1567_41081 + 1; ivtmp_38858 = ivtmp_38860 - 1; if (ivtmp_38858 == 0) goto ; [9.09%] else goto ; [90.91%] [local count: 107425740]: goto ; [100.00%] [local count: 10737418]: ... It's a configure module, so that it probably contains so many loops for various configs.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #15 from rguenther at suse dot de --- On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #14 from Martin Liška --- > (In reply to rguent...@suse.de from comment #13) > > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > > > > > --- Comment #12 from Martin Liška --- > > > > > > > > Can you share -fopt-report-loop differences? From the above I would > > > > guess we split a lot of loops, meaning the memcpy/memmove/memset > > > > calls are in the "middle" and we have to split loops (how many > > > > calls are detected here?). If that's true another way would be > > > > to only allow calls at head or tail position, thus a single > > > > non-builtin partition. > > > > > > I newly see ~1400 lines: > > > > > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: > > > split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: > > > split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: > > > split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: > > > split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: > > > split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: > > > split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: > > > split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: > > > split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: > > > split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: > > > split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: > > > split to > > > 0 loops and 1 library calls. > > > ... > > > > All with "0 loops"? That disputes my theory :/ > > Yep. All these are in a form of: > >[local count: 118163158]: > # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)> > # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)> > _3211 = S.1565_41079 + -1; > _3212 = fire_ignition_start_y1[_3211]; > MEM[(real(kind=4)[11] *)&model_config_rec + 101040B][_3211] = _3212; > S.1565_32687 = S.1565_41079 + 1; > ivtmp_38848 = ivtmp_38850 - 1; > if (ivtmp_38848 == 0) > goto ; [9.09%] > else > goto ; [90.91%] > >[local count: 107425740]: > goto ; [100.00%] > >[local count: 10737418]: > >[local count: 118163158]: > # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)> > # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)> > _3213 = S.1566_41080 + -1; > _3214 = fire_ignition_end_x1[_3213]; > MEM[(real(kind=4)[11] *)&model_config_rec + 101084B][_3213] = _3214; > S.1566_32689 = S.1566_41080 + 1; > ivtmp_38854 = ivtmp_38856 - 1; > if (ivtmp_38854 == 0) > goto ; [9.09%] > else > goto ; [90.91%] > >[local count: 107425740]: > goto ; [100.00%] > >[local count: 10737418]: > >[local count: 118163158]: > # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)> > # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)> > _3215 = S.1567_41081 + -1; > _3216 = fire_ignition_end_y1[_3215]; > MEM[(real(kind=4)[11] *)&model_config_rec + 101128B][_3215] = _3216; > S.1567_32691 = S.1567_41081 + 1; > ivtmp_38858 = ivtmp_38860 - 1; > if (ivtmp_38858 == 0) > goto ; [9.09%] > else > goto ; [90.91%] > >[local count: 107425740]: > goto ; [100.00%] > >[local count: 10737418]: > ... > > > It's a configure module, so that it probably contains so many loops for > various > configs. Hmm, so then it might be we run into some CFG complexity cut-off before for PRE and RA but not after since the CFG should simplify a lot if we make memcpy from all of the above loops...
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #16 from Martin Liška --- Created attachment 46393 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46393&action=edit SPEC2006 and SPEC2017 report The report presents difference between master (first gray column) and the Richi's patch (last 2 columns in order to run tests twice). There are some signification improvements and regressions both. Note that 436.cactusADM is jumping (21%)!
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Martin Liška changed: What|Removed |Added Status|ASSIGNED|NEW Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #17 from Martin Liška --- > > Hmm, so then it might be we run into some CFG complexity cut-off > before for PRE and RA but not after since the CFG should simplify > a lot if we make memcpy from all of the above loops... I guess so. Note that even without the patch the files takes 2 minutes to compile. It's somehow weird. I'm done with patch measurements, do you see it Richi beneficial to enable it with -O2?
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #18 from Richard Biener --- OK, let's do it.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #19 from rguenther at suse dot de --- On Wed, 22 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > Martin Liška changed: > >What|Removed |Added > > Status|ASSIGNED|NEW >Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot > gnu.org > > --- Comment #17 from Martin Liška --- > > > > Hmm, so then it might be we run into some CFG complexity cut-off > > before for PRE and RA but not after since the CFG should simplify > > a lot if we make memcpy from all of the above loops... > > I guess so. Note that even without the patch the files takes 2 minutes to > compile. It's somehow weird. > > I'm done with patch measurements, do you see it Richi beneficial to enable it > with -O2? OK, let's do it.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Richard Biener changed: What|Removed |Added CC||vmakarov at gcc dot gnu.org --- Comment #20 from Richard Biener --- (In reply to rguent...@suse.de from comment #11) > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > > > --- Comment #10 from Martin Liška --- > > > So the only significant offender is module_configure.fppized.f90 file. Let > > > me profile it. > > > > Time profile before/after: > > > > ╔══╤╤╤═╗ > > ║ PASS │ Before │ After │ Change ║ > > ╠══╪╪╪═╣ > > ║ backwards jump threading │ 6.29 │ 6.16 │ 97.93% ║ > > ║ integrated RA│ 6.76 │ 6.41 │ 94.82% ║ > > ║ tree SSA incremental │ 9.01 │ 11.16 │ 123.86% ║ > > ║ LRA create live ranges │ 15.68 │ 40.02 │ 255.23% ║ > > ║ PRE │ 23.24 │ 32.32 │ 139.07% ║ > > ║ alias stmt walking │ 27.69 │ 28.75 │ 103.83% ║ > > ║ phase opt and generate │ 124.13 │ 163.95 │ 132.08% ║ > > ║ TOTAL│ 125.39 │ 165.17 │ 131.73% ║ > > ╚══╧╧╧═╝ > > > > Richi, do you want a perf report or do you come up with a patch that will > > introduce the aforementioned params? > > Can you share -fopt-report-loop differences? From the above I would > guess we split a lot of loops, meaning the memcpy/memmove/memset > calls are in the "middle" and we have to split loops (how many > calls are detected here?). If that's true another way would be > to only allow calls at head or tail position, thus a single > non-builtin partition. Some analysis shows, focusing on LRA lives, that unpatched we have lra live on 53 BBs for wrf_alt_nml_obsolete lra live on 5 BBs for set_config_as_buffer lra live on 5 BBs for get_config_as_buffer lra live on 3231 BBs for initial_config lra live on 3231 BBs for initial_config while patched lra live on 53 BBs for wrf_alt_nml_obsolete lra live on 5 BBs for set_config_as_buffer lra live on 5 BBs for get_config_as_buffer lra live on 465 BBs for initial_config lra live on 465 BBs for initial_config so it's the initial_config function. We need 8 DF worklist iterations in both cases but eventually the amount of local work is larger or the local work isn't linear in the size of the BBs. The "work" it does to not update hardregs by anding ~all_hard_regs_bitmap seems somewhat pointless unless the functions do not handle those correctly. But that's micro-optimizing, likewise adding a bitmap_ior_and_compl_and_compl function to avoid the temporary bitmap in live_trans_fun. perf tells us most time is spent in process_bb_lives, not in the dataflow problem though, and there in ix86_hard_regno_call_part_clobbered (the function has a _lot_ of calls...). Also w/o pattern detection the lra_simple_p heuristic kicks in since we have a lot more BBs. /* If there are too many pseudos and/or basic blocks (e.g. 10K pseudos and 10K blocks or 100K pseudos and 1K blocks), we will use simplified and faster algorithms in LRA. */ lra_simple_p = (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block_for_fn (cfun)); The code is auto-generated and large (I have a single source file using no modules now but still too large and similar to SPEC to attach here), so I wouldn't worry too much here. The above magic constant should be a --param though.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #21 from Richard Biener --- Ick. static inline void check_pseudos_live_through_calls (int regno, HARD_REG_SET last_call_used_reg_set, rtx_insn *call_insn) { ... for (hr = 0; HARD_REGISTER_NUM_P (hr); hr++) if (targetm.hard_regno_call_part_clobbered (call_insn, hr, PSEUDO_REGNO_MODE (regno))) add_to_hard_reg_set (&lra_reg_info[regno].conflict_hard_regs, PSEUDO_REGNO_MODE (regno), hr); this loop is repeatedly computing an implicit hard-reg set for which hard-regs are partly clobbered by the call for the _same_ actual instruction since check_pseudos_live_through_calls is called via /* Mark each defined value as live. We need to do this for unused values because they still conflict with quantities that are live at the time of the definition. */ for (reg = curr_id->regs; reg != NULL; reg = reg->next) { if (reg->type != OP_IN) { update_pseudo_point (reg->regno, curr_point, USE_POINT); mark_regno_live (reg->regno, reg->biggest_mode); check_pseudos_live_through_calls (reg->regno, last_call_used_reg_set, call_insn); ... } and EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, j) { IOR_HARD_REG_SET (lra_reg_info[j].actual_call_used_reg_set, this_call_used_reg_set); if (flush) check_pseudos_live_through_calls (j, last_call_used_reg_set, last_call_insn); } and /* Mark each used value as live. */ for (reg = curr_id->regs; reg != NULL; reg = reg->next) if (reg->type != OP_OUT) { if (reg->type == OP_IN) update_pseudo_point (reg->regno, curr_point, USE_POINT); mark_regno_live (reg->regno, reg->biggest_mode); check_pseudos_live_through_calls (reg->regno, last_call_used_reg_set, call_insn); } and EXECUTE_IF_SET_IN_BITMAP (df_get_live_in (bb), FIRST_PSEUDO_REGISTER, j, bi) { if (sparseset_cardinality (pseudos_live_through_calls) == 0) break; if (sparseset_bit_p (pseudos_live_through_calls, j)) check_pseudos_live_through_calls (j, last_call_used_reg_set, call_insn); } the pseudos mode may change but I guess usually it doesn't. I also wonder why the target hook doesn't return a hard-reg-set ... That said, the above code doesn't scale well with functions with a lot of calls at least, also the passed call_insn isn't the current insn and might even be NULL. All but aarch64 do not even look at the actual instruction (even more an argument for re-designing the hook with it's use in mind). I guess an artificial testcase with a lot of calls and a lot of live pseudos (even single-BB) should show this issue easily. Samples: 579 of event 'cycles:ppp', Event count (approx.): 257134187434191 Overhead Command Shared Object Symbol 22.26% f951 f951 [.] process_bb_lives 15.06% f951 f951 [.] ix86_hard_regno_call_part_clobbered 8.55% f951 f951 [.] concat 6.88% f951 f951 [.] find_base_term 3.60% f951 f951 [.] get_ref_base_and_extent 3.27% f951 f951 [.] find_base_term 2.95% f951 f951 [.] make_hard_regno_dead
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #22 from Richard Biener --- The code in question was originally added with r202721 by Vlad and likely became more costly after making the target macro a hook (no inlining anymore).
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #23 from Richard Biener --- Author: rguenth Date: Thu May 23 11:35:16 2019 New Revision: 271553 URL: https://gcc.gnu.org/viewcvs?rev=271553&root=gcc&view=rev Log: 2019-05-23 Richard Biener PR tree-optimization/88440 * opts.c (default_options_table): Enable -ftree-loop-distribute-patterns at -O[2s]+. * tree-loop-distribution.c (generate_memset_builtin): Fold the generated call. (generate_memcpy_builtin): Likewise. (distribute_loop): Pass in whether to only distribute patterns. (prepare_perfect_loop_nest): Also allow size optimization. (pass_loop_distribution::execute): When optimizing a loop nest for size allow pattern replacement. * gcc.dg/tree-ssa/ldist-37.c: New testcase. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns. * gcc.dg/tree-ssa/ldist-37.c: Adjust. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * g++.dg/tree-ssa/pr78847.C: Likewise. * gcc.dg/autopar/pr39500-1.c: Likewise. * gcc.dg/autopar/reduc-1char.c: Likewise. * gcc.dg/autopar/reduc-7.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-1.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-2.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-3.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-4.c: Likewise. * gcc.dg/tree-ssa/prefetch-7.c: Likewise. * gcc.dg/tree-ssa/prefetch-8.c: Likewise. * gcc.dg/tree-ssa/prefetch-9.c: Likewise. * gcc.dg/tree-ssa/scev-11.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise. * gcc.target/i386/pr30970.c: Likewise. * gcc.target/i386/vect-double-1.c: Likewise. * gcc.target/i386/vect-double-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-26.c: Likewise. * gcc.dg/tree-ssa/gen-vect-28.c: Likewise. * gcc.dg/tree-ssa/gen-vect-32.c: Likewise. * gfortran.dg/vect/vect-5.f90: Likewise. * gfortran.dg/vect/vect-8.f90: Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/opts.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/tree-ssa/pr78847.C trunk/gcc/testsuite/gcc.dg/autopar/pr39500-1.c trunk/gcc/testsuite/gcc.dg/autopar/reduc-1char.c trunk/gcc/testsuite/gcc.dg/autopar/reduc-7.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-26.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-28.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-32.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-1.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-3.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-4.c trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-7.c trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-8.c trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-9.c trunk/gcc/testsuite/gcc.dg/tree-ssa/scev-11.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-31.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-33.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c trunk/gcc/testsuite/gcc.dg/vect/vect.exp trunk/gcc/testsuite/gcc.target/i386/pr30970.c trunk/gcc/testsuite/gcc.target/i386/vect-double-1.c trunk/gcc/testsuite/gcc.target/i386/vect-double-2.c trunk/gcc/testsuite/gfortran.dg/vect/vect-5.f90 trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90 trunk/gcc/tree-loop-distribution.c
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #24 from Richard Biener --- Fixed on trunk.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #25 from Richard Biener --- Author: rguenth Date: Fri May 24 08:48:14 2019 New Revision: 271595 URL: https://gcc.gnu.org/viewcvs?rev=271595&root=gcc&view=rev Log: 2019-05-23 Richard Biener PR tree-optimization/88440 * opts.c (default_options_table): Enable -ftree-loop-distribute-patterns at -O[2s]+. * tree-loop-distribution.c (generate_memset_builtin): Fold the generated call. (generate_memcpy_builtin): Likewise. (distribute_loop): Pass in whether to only distribute patterns. (prepare_perfect_loop_nest): Also allow size optimization. (pass_loop_distribution::execute): When optimizing a loop nest for size allow pattern replacement. * gcc.dg/tree-ssa/ldist-37.c: New testcase. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns. * gcc.dg/tree-ssa/ldist-37.c: Adjust. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * g++.dg/tree-ssa/pr78847.C: Likewise. * gcc.dg/autopar/pr39500-1.c: Likewise. * gcc.dg/autopar/reduc-1char.c: Likewise. * gcc.dg/autopar/reduc-7.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-1.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-2.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-3.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-4.c: Likewise. * gcc.dg/tree-ssa/prefetch-7.c: Likewise. * gcc.dg/tree-ssa/prefetch-8.c: Likewise. * gcc.dg/tree-ssa/prefetch-9.c: Likewise. * gcc.dg/tree-ssa/scev-11.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise. * gcc.target/i386/pr30970.c: Likewise. * gcc.target/i386/vect-double-1.c: Likewise. * gcc.target/i386/vect-double-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-26.c: Likewise. * gcc.dg/tree-ssa/gen-vect-28.c: Likewise. * gcc.dg/tree-ssa/gen-vect-32.c: Likewise. * gfortran.dg/vect/vect-5.f90: Likewise. * gfortran.dg/vect/vect-8.f90: Likewise. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-37.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-38.c
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #26 from Christophe Lyon --- Author: clyon Date: Mon May 27 13:37:57 2019 New Revision: 271662 URL: https://gcc.gnu.org/viewcvs?rev=271662&root=gcc&view=rev Log: [testsuite,aarch64,arm] PR88440: Fix testcases 2019-05-27 Christophe Lyon PR tree-optimization/88440 gcc/testsuite/ * gcc.target/aarch64/sve/index_offset_1.c: Add -fno-tree-loop-distribute-patterns. * gcc.target/aarch64/sve/single_1.c: Likewise. * gcc.target/aarch64/sve/single_2.c: Likewise. * gcc.target/aarch64/sve/single_3.c: Likewise. * gcc.target/aarch64/sve/single_4.c: Likewise. * gcc.target/aarch64/sve/vec_init_1.c: Likewise. * gcc.target/aarch64/vect-fmovd-zero.c: Likewise. * gcc.target/aarch64/vect-fmovf-zero.c: Likewise. * gcc.target/arm/ivopts.c: Likewise. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/aarch64/sve/index_offset_1.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_1.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_2.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_3.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_4.c trunk/gcc/testsuite/gcc.target/aarch64/sve/vec_init_1.c trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovd-zero.c trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovf-zero.c trunk/gcc/testsuite/gcc.target/arm/ivopts.c
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #1 from Andrew Pinski --- I thought I had a dup of this bug somewhere which was asking for this optimization to moved to -O2 (and -Os) and above rather than keep it at -O3 and above.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed||2018-12-11 CC||mliska at suse dot cz, ||rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- I think distributing patterns is reasonable for -O[2s]. I think we kept it at -O3 because of compile-time concerns (dependence analysis is quadratic). But then we still don't do basic vectorization at -O[2s] either... (same compile-time for not too much gain issue). So if somebody is willing to do some compile-time / effect numbers (bootstrap / SPEC?) then we can consider enabling loop-distribute-patterns for -O[2s] for GCC 10.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #3 from krux --- Adding -ftree-loop-distribute-patterns to -Os does not seem to make a difference though.
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #4 from rguenther at suse dot de --- On Wed, 12 Dec 2018, hoganmeier at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #3 from krux --- > Adding -ftree-loop-distribute-patterns to -Os does not seem to make a > difference though. Possibly because of /* Don't distribute multiple exit edges loop, or cold loop. */ if (!single_exit (loop) || !optimize_loop_for_speed_p (loop)) continue;
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Martin Liška changed: What|Removed |Added Target Milestone|--- |10.0
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 Martin Liška changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |marxin at gcc dot gnu.org --- Comment #5 from Martin Liška --- Sure, I can help with measurements during next stage1. Can you please Richi attach a patch that will enable the optimization for -O[2s]?
[Bug tree-optimization/88440] size optimization of memcpy-like code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 --- Comment #6 from Richard Biener --- Created attachment 45313 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45313&action=edit patch This enables distribution of patterns at -O[2s]+ and optimizes the testcase at -Os by adjusting the guards in loop distribution. Note that the interesting bits are compile-time, binary-size and performance at mainly -O2, eventually size at -Os. I suspect that at -O2 w/o profiling most loops would be optimize_loop_for_speed anyways so changing the heuristics isn't so bad but of course enabling distribution at -O2 might encour a penalty.