On Tue, Mar 20, 2018 at 5:56 PM, Richard Biener <richard.guent...@gmail.com> wrote: > On March 20, 2018 6:11:53 PM GMT+01:00, "Bin.Cheng" <amker.ch...@gmail.com> > wrote: >>On Mon, Mar 19, 2018 at 5:08 PM, Aldy Hernandez <al...@redhat.com> >>wrote: >>> Hi Richard. >>> >>> As discussed in the PR, the problem here is that we have two >>different >>> iterations of an IV live outside of a loop. This inhibits us from >>using >>> autoinc/dec addressing on ARM, and causes extra lea's on x86. >>> >>> An abbreviated example is this: >>> >>> loop: >>> # p_9 = PHI <p_17(2), p_20(3)> >>> p_20 = p_9 + 18446744073709551615; >>> goto loop >>> p_24 = p_9 + 18446744073709551614; >>> MEM[(char *)p_20 + -1B] = 45; >>> >>> Here we have both the previous IV (p_9) and the current IV (p_20) >>used >>> outside of the loop. On Arm this keeps us from using auto-dec >>addressing, >>> because one use is -2 and the other one is -1. >>> >>> With the attached patch we attempt to rewrite out-of-loop uses of the >>IV in >>> terms of the current/last IV (p_20 in the case above). With it, we >>end up >>> with: >>> >>> p_24 = p_20 + 18446744073709551615; >>> *p_24 = 45; >>> >>> ...which helps both x86 and Arm. >>> >>> As you have suggested in comment 38 on the PR, I handle specially >>> out-of-loop IV uses of the form IV+CST and propagate those >>accordingly >>> (along with the MEM_REF above). Otherwise, in less specific cases, >>we un-do >>> the IV increment, and use that value in all out-of-loop uses. For >>instance, >>> in the attached testcase, we rewrite: >>> >>> george (p_9); >>> >>> into >>> >>> _26 = p_20 + 1; >>> ... >>> george (_26); >>> >>> The attached testcase tests the IV+CST specific case, as well as the >>more >>> generic case with george(). >>> >>> Although the original PR was for ARM, this behavior can be noticed on >>x86, >>> so I tested on x86 with a full bootstrap + tests. I also ran the >>specific >>> test on an x86 cross ARM build and made sure we had 2 auto-dec with >>the >>> test. For the original test (slightly different than the testcase in >>this >>> patch), with this patch we are at 104 bytes versus 116 without it. >>There is >>> still the issue of a division optimization which would further reduce >>the >>> code size. I will discuss this separately as it is independent from >>this >>> patch. >>> >>> Oh yeah, we could make this more generic, and maybe handle any >>multiple of >>> the constant, or perhaps *= and /=. Perhaps something for next >>stage1... >>> >>> OK for trunk? >>Just FYI, this looks similar to what I did in >>https://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html >>That change was non-trivial and didn't give obvious improvement back >>in time. But I still wonder if this >>can be done at rewriting iv_use in a light-overhead way. > > Certainly, but the issue is we wreck it again at forwprop time as ivopts runs > too early. So both values of p_9/p_20 are used after loop.
loop: # p_9 = PHI <p_17(2), p_20(3)> p_20 = p_9 + 18446744073709551615; goto loop p_24 = p_20 + 18446744073709551615; MEM[(char *)p_20 + -1B] = 45; It looks like a fwprop issue that propagating p_20 with p_9 which results in below code: loop: # p_9 = PHI <p_17(2), p_20(3)> p_20 = p_9 + 18446744073709551615; goto loop p_24 = p_9 + 18446744073709551614; MEM[(char *)p_20 + -1B] = 45; It creates intersecting/longer live ranges while doesn't eliminate copy or definition for p_9. Ah, IIRC, RTL address forward propagation also has this issue. Thanks, bin > > Richard. >> >>Thanks, >>bin >>> Aldy >