On March 20, 2018 6:11:53 PM GMT+01:00, "Bin.Cheng" <amker.ch...@gmail.com> wrote: >On Mon, Mar 19, 2018 at 5:08 PM, Aldy Hernandez <al...@redhat.com> >wrote: >> Hi Richard. >> >> As discussed in the PR, the problem here is that we have two >different >> iterations of an IV live outside of a loop. This inhibits us from >using >> autoinc/dec addressing on ARM, and causes extra lea's on x86. >> >> An abbreviated example is this: >> >> loop: >> # p_9 = PHI <p_17(2), p_20(3)> >> p_20 = p_9 + 18446744073709551615; >> goto loop >> p_24 = p_9 + 18446744073709551614; >> MEM[(char *)p_20 + -1B] = 45; >> >> Here we have both the previous IV (p_9) and the current IV (p_20) >used >> outside of the loop. On Arm this keeps us from using auto-dec >addressing, >> because one use is -2 and the other one is -1. >> >> With the attached patch we attempt to rewrite out-of-loop uses of the >IV in >> terms of the current/last IV (p_20 in the case above). With it, we >end up >> with: >> >> p_24 = p_20 + 18446744073709551615; >> *p_24 = 45; >> >> ...which helps both x86 and Arm. >> >> As you have suggested in comment 38 on the PR, I handle specially >> out-of-loop IV uses of the form IV+CST and propagate those >accordingly >> (along with the MEM_REF above). Otherwise, in less specific cases, >we un-do >> the IV increment, and use that value in all out-of-loop uses. For >instance, >> in the attached testcase, we rewrite: >> >> george (p_9); >> >> into >> >> _26 = p_20 + 1; >> ... >> george (_26); >> >> The attached testcase tests the IV+CST specific case, as well as the >more >> generic case with george(). >> >> Although the original PR was for ARM, this behavior can be noticed on >x86, >> so I tested on x86 with a full bootstrap + tests. I also ran the >specific >> test on an x86 cross ARM build and made sure we had 2 auto-dec with >the >> test. For the original test (slightly different than the testcase in >this >> patch), with this patch we are at 104 bytes versus 116 without it. >There is >> still the issue of a division optimization which would further reduce >the >> code size. I will discuss this separately as it is independent from >this >> patch. >> >> Oh yeah, we could make this more generic, and maybe handle any >multiple of >> the constant, or perhaps *= and /=. Perhaps something for next >stage1... >> >> OK for trunk? >Just FYI, this looks similar to what I did in >https://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html >That change was non-trivial and didn't give obvious improvement back >in time. But I still wonder if this >can be done at rewriting iv_use in a light-overhead way.
Certainly, but the issue is we wreck it again at forwprop time as ivopts runs too early. Richard. > >Thanks, >bin >> Aldy