On 05/25/2016 05:22 AM, Bin Cheng wrote:
Hi, As analyzed in PR68303 and PR69710, vectorizer generates
duplicated computations in loop's pre-header basic block when
creating base address for vector reference to the same memory object.
Not a huge surprise. Loop optimizations generally have a tendency to create and/or expose CSE opportunities. Unrolling is a common culprit, there's certainly the possibility for header duplication, code motions and IV rewriting to also expose/create redundant code.

We haven't really had great success squashing these redundancies within the loop optimizer itself. This has lead to running CSE like passes after the loop optimizers or mini-CSE passes like we see here.


Because the duplicated code is out of loop, IVOPT fails to track base
object for these vector references, resulting in missed strength
reduction. It's agreed that vectorizer should be improved to generate
optimal (IVOPT-friendly) code, the difficult part is we want a
generic infrastructure.  After investigation, I tried to introduce a
generic/simple local CSE interface by reusing existing
algorithm/data-structure from tree-ssa-dom (tree-ssa-scopedtables).
I believe you're the only person that's ever tried to re-use that infrastructure -- were there any API/usage patterns that would have worked better for your needs?

ISTM that conceptually if we broke out the basic hashing bits that's what you're most interested in re-using since you're really just building a local CSE.



The interface runs local CSE for each basic block in a bitmap,
customers of this interface only need to record basic blocks in the
bitmap when necessary.  Note we don't need scopedtables' unwinding
facility since the interface runs only for single basic block, this
should be good in terms of compilation time.
Right.



I checked its impact on various test cases. With this patch,
PR68030's generated assembly is reduced from ~750 lines to ~580 lines
on x86_64, with both pre-header and loop body simplified.
That's a good result :-)



 But, 1) It
doesn't fix all the problem on x86_64.  Root cause is computation for
base address of the first reference is somehow moved outside of
loop's pre-header, local CSE can't help in this case.
That's a bid odd -- have you investigated why this is outside the loop header?

 Though
split_constant_offset can back track ssa def chain, it causes
possible redundant when there is no CSE opportunities in pre-header.
Is the problem that when you split you don't know if doing so will be profitable until you you're actually CSE-ing the pre-header?

2) It causes regression for PR68030 on AArch64.  I think the
regression is caused by IVOPT issues which are exposed by this patch.
Checks on offset validity in get_address_cost is wrong/inaccurate
now.  It considers an offset as valid if it's within the maximum
offset range that backend supports.  This is not true, for example,
AArch64 requires aligned offset additionally.  For example, LDR [base
+ 2060] is invalid for V4SFmode, although "2060" is within the
maximum offset range.  Another issue is also in get_address_cost.
Inaccurate cost is computed for "base + offset + INDEX" address
expression.  When register pressure is low, "base+offset" can be
hoisted out and we can use [base + INDEX] addressing mode, whichhis
is current behavior.


> Bootstrap and test on x86_64 and AArch64.  Any comments appreciated.
>
> Thanks,
> bin
>
> 2016-05-17 Bin Cheng  <bin.ch...@arm.com>
>
>         PR tree-optimization/68030
>         PR tree-optimization/69710
>         * tree-ssa-dom.c (cse_bbs): New function.
>         * tree-ssa-dom.h (cse_bbs): New declaration.
>         * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>         Re-associate address by splitting constant offset.
> (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>         basic block.
> * tree-vect-loop-manip.c (vect_gen_niters_for_prolog_loop): Record
>         changed basic block.
>         * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>         (changed_bbs): New variable.
>         (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>         * tree-vectorizer.h (changed_bbs): New declaration.
>
>
> pr69710-20160523.txt
>
>
> diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
> index 8bf5b3c..b2a0b0c 100644
> --- a/gcc/tree-ssa-dom.c
> +++ b/gcc/tree-ssa-dom.c
> @@ -2090,3 +2090,50 @@ lookup_avail_expr (gimple *stmt, bool insert,
>
>    return lhs;
>  }
> +
> +/* A local CSE interface which runs CSE for basic blocks recorded in
> +   CHANGED_BBS.  */
> +
> +void
> +cse_bbs (bitmap changed_bbs)
> +{
> +  unsigned index;
> +  bitmap_iterator bi;
> +  gimple_stmt_iterator gsi;
> +
> +  hash_table<expr_elt_hasher> *avail_exprs;
> +  class avail_exprs_stack *avail_exprs_stack;
> +  class const_and_copies *const_and_copies;
> +
> +  avail_exprs = new hash_table<expr_elt_hasher> (1024);
> +  avail_exprs_stack = new class avail_exprs_stack (avail_exprs);
> +  const_and_copies = new class const_and_copies ();
> +
> +  threadedge_initialize_values ();
> + /* Push a marker on the stacks of local information so that we know how
> +     far to unwind when we finalize this block.  */
> +  avail_exprs_stack->push_marker ();
> +  const_and_copies->push_marker ();
> +
> +  EXECUTE_IF_SET_IN_BITMAP (changed_bbs, 0, index, bi)
> +    {
> +      basic_block bb = BASIC_BLOCK_FOR_FN (cfun, index);
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +  fprintf (dump_file, "\n\nRun local cse on block #%d\n\n", bb->index);
> +
> +      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +  optimize_stmt (bb, gsi, const_and_copies, avail_exprs_stack);
> +
> +      /* Pop stacks to keep it small.  */
> +      avail_exprs_stack->pop_to_marker ();
> +      const_and_copies->pop_to_marker ();
> +    }
> +
> +  delete avail_exprs;
> +  avail_exprs = NULL;
> +
> +  delete avail_exprs_stack;
> +  delete const_and_copies;
> +  threadedge_finalize_values ();
> +}
No real objections here or how it's used in tree-vectorizer.c. I'd like to get to a place where you didn't have to muck with the stacks or threadedge_* at all, but that to me is a cleanup item and not a requirement to move forward.

I'll assume the reassociations you do in tree-vect-data-refs are reasonable.

So how do you want to move forward on this? Are the issues you noted above serious enough to warrant holding this up?

jeff

Reply via email to