On Fri, May 27, 2016 at 11:45 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>> Hi,
>> As analyzed in PR68303 and PR69710, vectorizer generates duplicated 
>> computations in loop's pre-header basic block when creating base address for 
>> vector reference to the same memory object.  Because the duplicated code is 
>> out of loop, IVOPT fails to track base object for these vector references, 
>> resulting in missed strength reduction.
>> It's agreed that vectorizer should be improved to generate optimal 
>> (IVOPT-friendly) code, the difficult part is we want a generic 
>> infrastructure.  After investigation, I tried to introduce a generic/simple 
>> local CSE interface by reusing existing algorithm/data-structure from 
>> tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each 
>> basic block in a bitmap, customers of this interface only need to record 
>> basic blocks in the bitmap when necessary.  Note we don't need scopedtables' 
>> unwinding facility since the interface runs only for single basic block, 
>> this should be good in terms of compilation time.
>> Besides CSE issue, this patch also re-associates address expressions in 
>> vect_create_addr_base_for_vector_ref, specifically, it splits constant 
>> offset and adds it back near the expression root in IR.  This is necessary 
>> because GCC only handles re-association for commutative operators in CSE.
>>
>> I checked its impact on various test cases.
>> With this patch, PR68030's generated assembly is reduced from ~750 lines to 
>> ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
>> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for 
>> base address of the first reference is somehow moved outside of loop's 
>> pre-header, local CSE can't help in this case.  Though split_constant_offset 
>> can back track ssa def chain, it causes possible redundant when there is no 
>> CSE opportunities in pre-header.
>> 2) It causes regression for PR68030 on AArch64.  I think the regression is 
>> caused by IVOPT issues which are exposed by this patch.  Checks on offset 
>> validity in get_address_cost is wrong/inaccurate now.  It considers an 
>> offset as valid if it's within the maximum offset range that backend 
>> supports.  This is not true, for example, AArch64 requires aligned offset 
>> additionally.  For example, LDR [base + 2060] is invalid for V4SFmode, 
>> although "2060" is within the maximum offset range.  Another issue is also 
>> in get_address_cost.  Inaccurate cost is computed for "base + offset + 
>> INDEX" address expression.  When register pressure is low, "base+offset" can 
>> be hoisted out and we can use [base + INDEX] addressing mode, whichhis is 
>> current behavior.
>>
>> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>
> It looks quite straight-forward with the caveat that it has one
> obvious piece that is not in the order
> of the complexity of a basic-block.  threadedge_initialize_values
> creates the SSA value array
I noticed this too, and think it's better to get rid of this init/fini
functions by some kind re-design.  I found it's quite weird to call
threadege_X in tree-vrp.c.  I will keep investigating this.
> which is zero-initialized (upon use).  That's probably a non-issue for
> the use you propose for
> the vectorizer (call cse_bbs once per function).  As Ideally I would
> like this facility to replace
> the loop unrollers own propagate_constants_for_unrolling it might
> become an issue though.
> In that regard the unroller facility is also more powerful because it
> handles regions (including
> PHIs).
With the current data-structure, I think it's not very hard to extend
the interface to regions.  I will keep investigating this too.  BTW,
if it's okay, I tend to do that in following patches.
>
> Note that in particular for SLP vectorization the vectorizer itself
> may end up creating quite
> some redundancies so I wonder if it's worth to CSE the vectorized loop
> body as well
Maybe.  The next step is condition block created by vectorizer.  Both
prune_runtime_alias_test_list and generated alias checks are
sub-optimal now, even worse, somehow computations in alias checks can
be propagated to loop pre-header.  With help of this interface, alias
checks (and later code) can be simplified.

> (and given we have PHIs eventually CSE the whole vectorized loop with
> pre-header as a region...)
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>>
>> 2016-05-17 Bin Cheng  <bin.ch...@arm.com>
>>
>>         PR tree-optimization/68030
>>         PR tree-optimization/69710
>>         * tree-ssa-dom.c (cse_bbs): New function.
>>         * tree-ssa-dom.h (cse_bbs): New declaration.
>>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>         Re-associate address by splitting constant offset.
>>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>         basic block.
>>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>>         changed basic block.
>>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>         (changed_bbs): New variable.
>>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>         * tree-vectorizer.h (changed_bbs): New declaration.

Reply via email to