On Wed, May 25, 2016 at 1:22 PM, Bin Cheng <bin.ch...@arm.com> wrote:
> Hi,
> As analyzed in PR68303 and PR69710, vectorizer generates duplicated 
> computations in loop's pre-header basic block when creating base address for 
> vector reference to the same memory object.  Because the duplicated code is 
> out of loop, IVOPT fails to track base object for these vector references, 
> resulting in missed strength reduction.
> It's agreed that vectorizer should be improved to generate optimal 
> (IVOPT-friendly) code, the difficult part is we want a generic 
> infrastructure.  After investigation, I tried to introduce a generic/simple 
> local CSE interface by reusing existing algorithm/data-structure from 
> tree-ssa-dom (tree-ssa-scopedtables).  The interface runs local CSE for each 
> basic block in a bitmap, customers of this interface only need to record 
> basic blocks in the bitmap when necessary.  Note we don't need scopedtables' 
> unwinding facility since the interface runs only for single basic block, this 
> should be good in terms of compilation time.
> Besides CSE issue, this patch also re-associates address expressions in 
> vect_create_addr_base_for_vector_ref, specifically, it splits constant offset 
> and adds it back near the expression root in IR.  This is necessary because 
> GCC only handles re-association for commutative operators in CSE.
>
> I checked its impact on various test cases.
> With this patch, PR68030's generated assembly is reduced from ~750 lines to 
> ~580 lines on x86_64, with both pre-header and loop body simplified.  But,
> 1) It doesn't fix all the problem on x86_64.  Root cause is computation for 
> base address of the first reference is somehow moved outside of loop's 
> pre-header, local CSE can't help in this case.  Though split_constant_offset 
> can back track ssa def chain, it causes possible redundant when there is no 
> CSE opportunities in pre-header.
> 2) It causes regression for PR68030 on AArch64.  I think the regression is 
> caused by IVOPT issues which are exposed by this patch.  Checks on offset 
> validity in get_address_cost is wrong/inaccurate now.  It considers an offset 
> as valid if it's within the maximum offset range that backend supports.  This 
> is not true, for example, AArch64 requires aligned offset additionally.  For 
> example, LDR [base + 2060] is invalid for V4SFmode, although "2060" is within 
> the maximum offset range.  Another issue is also in get_address_cost.  
> Inaccurate cost is computed for "base + offset + INDEX" address expression.  
> When register pressure is low, "base+offset" can be hoisted out and we can 
> use [base + INDEX] addressing mode, whichhis is current behavior.
>
> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.

It looks quite straight-forward with the caveat that it has one
obvious piece that is not in the order
of the complexity of a basic-block.  threadedge_initialize_values
creates the SSA value array
which is zero-initialized (upon use).  That's probably a non-issue for
the use you propose for
the vectorizer (call cse_bbs once per function).  As Ideally I would
like this facility to replace
the loop unrollers own propagate_constants_for_unrolling it might
become an issue though.
In that regard the unroller facility is also more powerful because it
handles regions (including
PHIs).

Note that in particular for SLP vectorization the vectorizer itself
may end up creating quite
some redundancies so I wonder if it's worth to CSE the vectorized loop
body as well
(and given we have PHIs eventually CSE the whole vectorized loop with
pre-header as a region...)

Thanks,
Richard.

> Thanks,
> bin
>
> 2016-05-17 Bin Cheng  <bin.ch...@arm.com>
>
>         PR tree-optimization/68030
>         PR tree-optimization/69710
>         * tree-ssa-dom.c (cse_bbs): New function.
>         * tree-ssa-dom.h (cse_bbs): New declaration.
>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>         Re-associate address by splitting constant offset.
>         (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>         basic block.
>         * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>         changed basic block.
>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>         (changed_bbs): New variable.
>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>         * tree-vectorizer.h (changed_bbs): New declaration.

Reply via email to