On March 20, 2018 6:11:53 PM GMT+01:00, "Bin.Cheng" <amker.ch...@gmail.com> 
wrote:
>On Mon, Mar 19, 2018 at 5:08 PM, Aldy Hernandez <al...@redhat.com>
>wrote:
>> Hi Richard.
>>
>> As discussed in the PR, the problem here is that we have two
>different
>> iterations of an IV live outside of a loop.  This inhibits us from
>using
>> autoinc/dec addressing on ARM, and causes extra lea's on x86.
>>
>> An abbreviated example is this:
>>
>> loop:
>>   # p_9 = PHI <p_17(2), p_20(3)>
>>   p_20 = p_9 + 18446744073709551615;
>> goto loop
>>   p_24 = p_9 + 18446744073709551614;
>>   MEM[(char *)p_20 + -1B] = 45;
>>
>> Here we have both the previous IV (p_9) and the current IV (p_20)
>used
>> outside of the loop.  On Arm this keeps us from using auto-dec
>addressing,
>> because one use is -2 and the other one is -1.
>>
>> With the attached patch we attempt to rewrite out-of-loop uses of the
>IV in
>> terms of the current/last IV (p_20 in the case above).  With it, we
>end up
>> with:
>>
>>   p_24 = p_20 + 18446744073709551615;
>>   *p_24 = 45;
>>
>> ...which helps both x86 and Arm.
>>
>> As you have suggested in comment 38 on the PR, I handle specially
>> out-of-loop IV uses of the form IV+CST and propagate those
>accordingly
>> (along with the MEM_REF above).  Otherwise, in less specific cases,
>we un-do
>> the IV increment, and use that value in all out-of-loop uses.  For
>instance,
>> in the attached testcase, we rewrite:
>>
>>   george (p_9);
>>
>> into
>>
>>   _26 = p_20 + 1;
>>   ...
>>   george (_26);
>>
>> The attached testcase tests the IV+CST specific case, as well as the
>more
>> generic case with george().
>>
>> Although the original PR was for ARM, this behavior can be noticed on
>x86,
>> so I tested on x86 with a full bootstrap + tests.  I also ran the
>specific
>> test on an x86 cross ARM build and made sure we had 2 auto-dec with
>the
>> test.  For the original test (slightly different than the testcase in
>this
>> patch), with this patch we are at 104 bytes versus 116 without it. 
>There is
>> still the issue of a division optimization which would further reduce
>the
>> code size.  I will discuss this separately as it is independent from
>this
>> patch.
>>
>> Oh yeah, we could make this more generic, and maybe handle any
>multiple of
>> the constant, or perhaps *= and /=.  Perhaps something for next
>stage1...
>>
>> OK for trunk?
>Just FYI, this looks similar to what I did in
>https://gcc.gnu.org/ml/gcc-patches/2013-11/msg00535.html
>That change was non-trivial and didn't give obvious improvement back
>in time.  But I still wonder if this
>can be done at rewriting iv_use in a light-overhead way.

Certainly, but the issue is we wreck it again at forwprop time as ivopts runs 
too early. 

Richard. 
>
>Thanks,
>bin
>> Aldy

Reply via email to