On Wed, Oct 2, 2013 at 4:24 AM, Richard Biener <rguent...@suse.de> wrote:
> On Tue, 1 Oct 2013, Cong Hou wrote:
>
>> When alias exists between data refs in a loop, to vectorize it GCC
>> does loop versioning and adds runtime alias checks. Basically for each
>> pair of data refs with possible data dependence, there will be two
>> comparisons generated to make sure there is no aliasing between them
>> in each iteration of the vectorized loop. If there are many such data
>> refs pairs, the number of comparisons can be very large, which is a
>> big overhead.
>>
>> However, in some cases it is possible to reduce the number of those
>> comparisons. For example, for the following loop, we can detect that
>> b[0] and b[1] are two consecutive member accesses so that we can
>> combine the alias check between a[0:100]&b[0] and a[0:100]&b[1] into
>> checking a[0:100]&b[0:2]:
>>
>> void foo(int*a, int* b)
>> {
>>    for (int i = 0; i < 100; ++i)
>>     a[i] = b[0] + b[1];
>> }
>>
>> Actually, the requirement of consecutive memory accesses is too
>> strict. For the following loop, we can still combine the alias checks
>> between a[0:100]&b[0] and a[0:100]&b[100]:
>>
>> void foo(int*a, int* b)
>> {
>>    for (int i = 0; i < 100; ++i)
>>     a[i] = b[0] + b[100];
>> }
>>
>> This is because if b[0] is not in a[0:100] and b[100] is not in
>> a[0:100] then a[0:100] cannot be between b[0] and b[100]. We only need
>> to check a[0:100] and b[0:101] don't overlap.
>>
>> More generally, consider two pairs of data refs (a, b1) and (a, b2).
>> Suppose addr_b1 and addr_b2 are basic addresses of data ref b1 and b2;
>> offset_b1 and offset_b2 (offset_b1 < offset_b2) are offsets of b1 and
>> b2, and segment_length_a, segment_length_b1, and segment_length_b2 are
>> segment length of a, b1, and b2. Then we can combine the two
>> comparisons into one if the following condition is satisfied:
>>
>> offset_b2- offset_b1 - segment_length_b1 < segment_length_a
>>
>>
>> This patch detects those combination opportunities to reduce the
>> number of alias checks. It is tested on an x86-64 machine.
>
> Apart from the other comments you got (to which I agree) the patch
> seems to do two things, namely also:
>
> +  /* Extract load and store statements on pointers with zero-stride
> +     accesses.  */
> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
> +    {
>
> which I'd rather see in a separate patch (and done also when
> the loop doesn't require versioning for alias).

yes.

>
> Also combining the alias checks in vect_create_cond_for_alias_checks
> is nice but doesn't properly fix the use of the
> vect-max-version-for-alias-checks param

Yes. The handling of this should be moved to
'vect_prune_runtime_alias_test_list' to avoid premature decisions.



>which currently inhibits
> vectorization of the HIMENO benchmark by default (and make us look bad
> compared to LLVM).

Here is a small reproducible:

struct  A {
  int *base;
  int offset;
  int offset2;
  int offset3;
  int offset4;
  int offset5;
  int offset6;
  int offset7;
  int offset8;
};

void foo (struct A * ar1, struct A* ar2)
{
      int i;
      for (i = 0; i < 10000; i++)
        {
           ar1->base[i]  = 2*ar2->base[i] + ar2->offset + ar2->offset2
+ ar2->offset3 + ar2->offset4 + ar2->offset5 + ar2->offset6; /* +
ar2->offset7 + ar2->offset8;*/
        }
}

GCC trunk won't vectorize it at O2 due to the limit.


There is another problem we should be tracking: GCC no longer
vectorize the loop (with large
--param=vect-max-version-for-alias-checks=40) when -fno-strict-alias
is specified.   However with additional runtime alias check, the loop
should be vectorizable.

David


>
> So I believe this merging should be done incrementally when
> we collect the DDRs we need to test in vect_mark_for_runtime_alias_test.
>
> Thanks for working on this,
> Richard.
>
>>
>> thanks,
>> Cong
>>
>>
>>
>> Index: gcc/tree-vect-loop-manip.c
>> ===================================================================
>> --- gcc/tree-vect-loop-manip.c (revision 202662)
>> +++ gcc/tree-vect-loop-manip.c (working copy)
>> @@ -19,6 +19,10 @@ You should have received a copy of the G
>>  along with GCC; see the file COPYING3.  If not see
>>  <http://www.gnu.org/licenses/>.  */
>>
>> +#include <vector>
>> +#include <utility>
>> +#include <algorithm>
>> +
>>  #include "config.h"
>>  #include "system.h"
>>  #include "coretypes.h"
>> @@ -2248,6 +2252,74 @@ vect_vfa_segment_size (struct data_refer
>>    return segment_length;
>>  }
>>
>> +namespace
>> +{
>> +
>> +/* struct dr_addr_with_seg_len
>> +
>> +   A struct storing information of a data reference, including the data
>> +   ref itself, its basic address, the access offset and the segment length
>> +   for aliasing checks.  */
>> +
>> +struct dr_addr_with_seg_len
>> +{
>> +  dr_addr_with_seg_len (data_reference* d, tree addr, tree off, tree len)
>> +    : dr (d), basic_addr (addr), offset (off), seg_len (len) {}
>> +
>> +  data_reference* dr;
>> +  tree basic_addr;
>> +  tree offset;
>> +  tree seg_len;
>> +};
>> +
>> +/* Operator == between two dr_addr_with_seg_len objects.
>> +
>> +   This equality operator is used to make sure two data refs
>> +   are the same one so that we will consider to combine the
>> +   aliasing checks of those two pairs of data dependent data
>> +   refs.  */
>> +
>> +bool operator == (const dr_addr_with_seg_len& d1,
>> +  const dr_addr_with_seg_len& d2)
>> +{
>> +  return operand_equal_p (d1.basic_addr, d2.basic_addr, 0)
>> + && operand_equal_p (d1.offset, d2.offset, 0)
>> + && operand_equal_p (d1.seg_len, d2.seg_len, 0);
>> +}
>> +
>> +typedef std::pair <dr_addr_with_seg_len, dr_addr_with_seg_len>
>> + dr_addr_with_seg_len_pair_t;
>> +
>> +
>> +/* Operator < between two dr_addr_with_seg_len_pair_t objects.
>> +
>> +   This operator is used to sort objects of dr_addr_with_seg_len_pair_t
>> +   so that we can combine aliasing checks during one scan.  */
>> +
>> +bool operator < (const dr_addr_with_seg_len_pair_t& p1,
>> + const dr_addr_with_seg_len_pair_t& p2)
>> +{
>> +  const dr_addr_with_seg_len& p11 = p1.first;
>> +  const dr_addr_with_seg_len& p12 = p1.second;
>> +  const dr_addr_with_seg_len& p21 = p2.first;
>> +  const dr_addr_with_seg_len& p22 = p2.second;
>> +
>> +  if (p11.basic_addr != p21.basic_addr)
>> +    return p11.basic_addr < p21.basic_addr;
>> +  if (p12.basic_addr != p22.basic_addr)
>> +    return p12.basic_addr < p22.basic_addr;
>> +  if (TREE_CODE (p11.offset) != INTEGER_CST
>> +      || TREE_CODE (p21.offset) != INTEGER_CST)
>> +    return p11.offset < p21.offset;
>> +  if (int_cst_value (p11.offset) != int_cst_value (p21.offset))
>> +    return int_cst_value (p11.offset) < int_cst_value (p21.offset);
>> +  if (TREE_CODE (p12.offset) != INTEGER_CST
>> +      || TREE_CODE (p22.offset) != INTEGER_CST)
>> +    return p12.offset < p22.offset;
>> +  return int_cst_value (p12.offset) < int_cst_value (p22.offset);
>> +}
>> +
>> +}
>>
>>  /* Function vect_create_cond_for_alias_checks.
>>
>> @@ -2292,20 +2364,51 @@ vect_create_cond_for_alias_checks (loop_
>>    if (may_alias_ddrs.is_empty ())
>>      return;
>>
>> +
>> +  /* Basically, for each pair of dependent data refs store_ptr_0
>> +     and load_ptr_0, we create an expression:
>> +
>> +     ((store_ptr_0 + store_segment_length_0) <= load_ptr_0)
>> +     || (load_ptr_0 + load_segment_length_0) <= store_ptr_0))
>> +
>> +     for aliasing checks. However, in some cases we can decrease
>> +     the number of checks by combining two checks into one. For
>> +     example, suppose we have another pair of data refs store_ptr_0
>> +     and load_ptr_1, and if the following condition is satisfied:
>> +
>> +     load_ptr_0 < load_ptr_1  &&
>> +     load_ptr_1 - load_ptr_0 - load_segment_length_0 < 
>> store_segment_length_0
>> +
>> +     (this condition means, in each iteration of vectorized loop,
>> +     the accessed memory of store_ptr_0 cannot be between the memory
>> +     of load_ptr_0 and load_ptr_1.)
>> +
>> +     we then can use only the following expression to finish the
>> +     alising checks between store_ptr_0 & load_ptr_0 and
>> +     store_ptr_0 & load_ptr_1:
>> +
>> +     ((store_ptr_0 + store_segment_length_0) <= load_ptr_0)
>> +     || (load_ptr_1 + load_segment_length_1 <= store_ptr_0))
>> +
>> +     Note that we only consider that load_ptr_0 and load_ptr_1 have the
>> +     same basic address.  */
>> +
>> +  std::vector<dr_addr_with_seg_len_pair_t> ddrs_with_seg_len;
>> +
>> +  /* First, we collect all data ref pairs for aliasing checks.  */
>> +
>>    FOR_EACH_VEC_ELT (may_alias_ddrs, i, ddr)
>>      {
>>        struct data_reference *dr_a, *dr_b;
>>        gimple dr_group_first_a, dr_group_first_b;
>> -      tree addr_base_a, addr_base_b;
>>        tree segment_length_a, segment_length_b;
>>        gimple stmt_a, stmt_b;
>> -      tree seg_a_min, seg_a_max, seg_b_min, seg_b_max;
>>
>>        dr_a = DDR_A (ddr);
>>        stmt_a = DR_STMT (DDR_A (ddr));
>>        dr_group_first_a = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_a));
>>        if (dr_group_first_a)
>> -        {
>> + {
>>    stmt_a = dr_group_first_a;
>>    dr_a = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt_a));
>>   }
>> @@ -2314,20 +2417,11 @@ vect_create_cond_for_alias_checks (loop_
>>        stmt_b = DR_STMT (DDR_B (ddr));
>>        dr_group_first_b = GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt_b));
>>        if (dr_group_first_b)
>> -        {
>> + {
>>    stmt_b = dr_group_first_b;
>>    dr_b = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt_b));
>>   }
>>
>> -      addr_base_a
>> - = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_a),
>> -   size_binop (PLUS_EXPR, DR_OFFSET (dr_a),
>> -       DR_INIT (dr_a)));
>> -      addr_base_b
>> - = fold_build_pointer_plus (DR_BASE_ADDRESS (dr_b),
>> -   size_binop (PLUS_EXPR, DR_OFFSET (dr_b),
>> -       DR_INIT (dr_b)));
>> -
>>        if (!operand_equal_p (DR_STEP (dr_a), DR_STEP (dr_b), 0))
>>   length_factor = scalar_loop_iters;
>>        else
>> @@ -2335,24 +2429,149 @@ vect_create_cond_for_alias_checks (loop_
>>        segment_length_a = vect_vfa_segment_size (dr_a, length_factor);
>>        segment_length_b = vect_vfa_segment_size (dr_b, length_factor);
>>
>> +      dr_addr_with_seg_len_pair_t dr_with_seg_len_pair
>> +  (dr_addr_with_seg_len
>> +       (dr_a, DR_BASE_ADDRESS (dr_a),
>> + size_binop (PLUS_EXPR, DR_OFFSET (dr_a), DR_INIT (dr_a)),
>> + segment_length_a),
>> +   dr_addr_with_seg_len
>> +       (dr_b, DR_BASE_ADDRESS (dr_b),
>> + size_binop (PLUS_EXPR, DR_OFFSET (dr_b), DR_INIT (dr_b)),
>> + segment_length_b));
>> +
>> +      if (dr_with_seg_len_pair.first.basic_addr >
>> +  dr_with_seg_len_pair.second.basic_addr)
>> + std::swap (dr_with_seg_len_pair.first, dr_with_seg_len_pair.second);
>> +
>> +      ddrs_with_seg_len.push_back (dr_with_seg_len_pair);
>> +    }
>> +
>> +  /* Second, we sort the collected data ref pairs so that we can scan
>> +     them once to combine all possible aliasing checks.  */
>> +
>> +  std::sort (ddrs_with_seg_len.begin(), ddrs_with_seg_len.end());
>> +
>> +  /* Remove duplicate data ref pairs.  */
>> +  ddrs_with_seg_len.erase (std::unique (ddrs_with_seg_len.begin(),
>> + ddrs_with_seg_len.end()),
>> +   ddrs_with_seg_len.end());
>> +
>> +  /* We then scan the sorted dr pairs and check if we can combine
>> +     alias checks of two neighbouring dr pairs.  */
>> +
>> +  for (size_t i = 1; i < ddrs_with_seg_len.size (); ++i)
>> +    {
>> +      dr_addr_with_seg_len& dr_a1 = ddrs_with_seg_len[i-1].first;
>> +      dr_addr_with_seg_len& dr_b1 = ddrs_with_seg_len[i-1].second;
>> +      dr_addr_with_seg_len& dr_a2 = ddrs_with_seg_len[i].first;
>> +      dr_addr_with_seg_len& dr_b2 = ddrs_with_seg_len[i].second;
>> +
>> +      if (dr_a1 == dr_a2)
>> + {
>> +  if (dr_b1.basic_addr != dr_b2.basic_addr
>> +      || TREE_CODE (dr_b1.offset) != INTEGER_CST
>> +      || TREE_CODE (dr_b2.offset) != INTEGER_CST)
>> +    continue;
>> +
>> +  int diff = int_cst_value (dr_b2.offset) -
>> +     int_cst_value (dr_b1.offset);
>> +
>> +  gcc_assert (diff > 0);
>> +
>> +  if (diff <= vect_factor
>> +      || (TREE_CODE (dr_b1.seg_len) == INTEGER_CST
>> +  && diff - int_cst_value (dr_b1.seg_len) < vect_factor)
>> +      || (TREE_CODE (dr_b1.seg_len) == INTEGER_CST
>> +  && TREE_CODE (dr_a1.seg_len) == INTEGER_CST
>> +  && diff - int_cst_value (dr_b1.seg_len) <
>> +     int_cst_value (dr_a1.seg_len)))
>> +    {
>> +      if (dump_enabled_p ())
>> + {
>> +  dump_printf_loc
>> +      (MSG_NOTE, vect_location,
>> +       "combining two runtime checks for data references ");
>> +  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b1.dr));
>> +  dump_printf (MSG_NOTE, " and ");
>> +  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b2.dr));
>> +  dump_printf (MSG_NOTE, "\n");
>> + }
>> +
>> +      dr_b1.seg_len = size_binop (PLUS_EXPR,
>> +  dr_b2.seg_len, size_int (diff));
>> +      ddrs_with_seg_len.erase (ddrs_with_seg_len.begin () + i);
>> +      --i;
>> +    }
>> + }
>> +      else if (dr_b1 == dr_b2)
>> + {
>> +  if (dr_a1.basic_addr != dr_a2.basic_addr
>> +      || TREE_CODE (dr_a1.offset) != INTEGER_CST
>> +      || TREE_CODE (dr_a2.offset) != INTEGER_CST)
>> +    continue;
>> +
>> +  int diff = int_cst_value (dr_a2.offset) -
>> +     int_cst_value (dr_a1.offset);
>> +
>> +  gcc_assert (diff > 0);
>> +
>> +  if (diff <= vect_factor
>> +      || (TREE_CODE (dr_a1.seg_len) == INTEGER_CST
>> +  && diff - int_cst_value (dr_a1.seg_len) < vect_factor)
>> +      || (TREE_CODE (dr_a1.seg_len) == INTEGER_CST
>> +  && TREE_CODE (dr_b1.seg_len) == INTEGER_CST
>> +  && diff - int_cst_value (dr_a1.seg_len) <
>> +     int_cst_value (dr_b1.seg_len)))
>> +    {
>> +      if (dump_enabled_p ())
>> + {
>> +  dump_printf_loc
>> +      (MSG_NOTE, vect_location,
>> +       "combining two runtime checks for data references ");
>> +  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a1.dr));
>> +  dump_printf (MSG_NOTE, " and ");
>> +  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a2.dr));
>> +  dump_printf (MSG_NOTE, "\n");
>> + }
>> +
>> +      dr_a1.seg_len = size_binop (PLUS_EXPR,
>> +  dr_a2.seg_len, size_int (diff));
>> +      ddrs_with_seg_len.erase (ddrs_with_seg_len.begin () + i);
>> +      --i;
>> +    }
>> + }
>> +    }
>> +
>> +  for (size_t i = 0, s = ddrs_with_seg_len.size (); i < s; ++i)
>> +    {
>> +      const dr_addr_with_seg_len& dr_a = ddrs_with_seg_len[i].first;
>> +      const dr_addr_with_seg_len& dr_b = ddrs_with_seg_len[i].second;
>> +      tree segment_length_a = dr_a.seg_len;
>> +      tree segment_length_b = dr_b.seg_len;
>> +
>> +      tree addr_base_a
>> + = fold_build_pointer_plus (dr_a.basic_addr, dr_a.offset);
>> +      tree addr_base_b
>> + = fold_build_pointer_plus (dr_b.basic_addr, dr_b.offset);
>> +
>>        if (dump_enabled_p ())
>>   {
>>    dump_printf_loc (MSG_NOTE, vect_location,
>> -                           "create runtime check for data references ");
>> -  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a));
>> +   "create runtime check for data references ");
>> +  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_a.dr));
>>    dump_printf (MSG_NOTE, " and ");
>> -  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b));
>> -          dump_printf (MSG_NOTE, "\n");
>> +  dump_generic_expr (MSG_NOTE, TDF_SLIM, DR_REF (dr_b.dr));
>> +  dump_printf (MSG_NOTE, "\n");
>>   }
>>
>> -      seg_a_min = addr_base_a;
>> -      seg_a_max = fold_build_pointer_plus (addr_base_a, segment_length_a);
>> -      if (tree_int_cst_compare (DR_STEP (dr_a), size_zero_node) < 0)
>> +      tree seg_a_min = addr_base_a;
>> +      tree seg_a_max = fold_build_pointer_plus (addr_base_a, 
>> segment_length_a);
>> +      if (tree_int_cst_compare (DR_STEP (dr_a.dr), size_zero_node) < 0)
>>   seg_a_min = seg_a_max, seg_a_max = addr_base_a;
>>
>> -      seg_b_min = addr_base_b;
>> -      seg_b_max = fold_build_pointer_plus (addr_base_b, segment_length_b);
>> -      if (tree_int_cst_compare (DR_STEP (dr_b), size_zero_node) < 0)
>> +      tree seg_b_min = addr_base_b;
>> +      tree seg_b_max = fold_build_pointer_plus (addr_base_b, 
>> segment_length_b);
>> +      if (tree_int_cst_compare (DR_STEP (dr_b.dr), size_zero_node) < 0)
>>   seg_b_min = seg_b_max, seg_b_max = addr_base_b;
>>
>>        part_cond_expr =
>> @@ -2477,6 +2696,81 @@ vect_loop_versioning (loop_vec_info loop
>>        adjust_phi_and_debug_stmts (orig_phi, e, PHI_RESULT (new_phi));
>>      }
>>
>> +  /* Extract load and store statements on pointers with zero-stride
>> +     accesses.  */
>> +  if (LOOP_REQUIRES_VERSIONING_FOR_ALIAS (loop_vinfo))
>> +    {
>> +
>> +      /* In the loop body, we iterate each statement to check if it is a 
>> load
>> + or store. Then we check the DR_STEP of the data reference.  If
>> + DR_STEP is zero, then we will hoist the load statement to the loop
>> + preheader, and move the store statement to the loop exit.  */
>> +
>> +      for (gimple_stmt_iterator si = gsi_start_bb (loop->header);
>> +   !gsi_end_p (si); )
>> + {
>> +  gimple stmt = gsi_stmt (si);
>> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>> +  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
>> +
>> +
>> +  if (dr && integer_zerop (DR_STEP (dr)))
>> +    {
>> +      if (DR_IS_READ (dr))
>> + {
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file,
>> +       "Hoist the load to outside of the loop:\n");
>> +      print_gimple_stmt (dump_file, stmt, 0,
>> + TDF_VOPS|TDF_MEMSYMS);
>> +    }
>> +
>> +  basic_block preheader = loop_preheader_edge (loop)->src;
>> +  gimple_stmt_iterator si_dst = gsi_last_bb (preheader);
>> +  gsi_move_after (&si, &si_dst);
>> + }
>> +      else
>> + {
>> +  gimple_stmt_iterator si_dst =
>> +      gsi_last_bb (single_exit (loop)->dest);
>> +  gsi_move_after (&si, &si_dst);
>> + }
>> +              continue;
>> +    }
>> +  else if (!dr)
>> +          {
>> +            bool hoist = true;
>> +            for (size_t i = 0; i < gimple_num_ops (stmt); i++)
>> +            {
>> +              tree op = gimple_op (stmt, i);
>> +              if (TREE_CODE (op) == INTEGER_CST
>> +                  || TREE_CODE (op) == REAL_CST)
>> +                continue;
>> +              if (TREE_CODE (op) == SSA_NAME)
>> +              {
>> +                gimple def = SSA_NAME_DEF_STMT (op);
>> +                if (def == stmt
>> +                    || gimple_nop_p (def)
>> +                    || !flow_bb_inside_loop_p (loop, gimple_bb (def)))
>> +                  continue;
>> +              }
>> +              hoist = false;
>> +              break;
>> +            }
>> +
>> +            if (hoist)
>> +            {
>> +              basic_block preheader = loop_preheader_edge (loop)->src;
>> +              gimple_stmt_iterator si_dst = gsi_last_bb (preheader);
>> +              gsi_move_after (&si, &si_dst);
>> +              continue;
>> +            }
>> +          }
>> +          gsi_next (&si);
>> + }
>> +    }
>> +
>>    /* End loop-exit-fixes after versioning.  */
>>
>>    if (cond_expr_stmt_list)
>> Index: gcc/ChangeLog
>> ===================================================================
>> --- gcc/ChangeLog (revision 202663)
>> +++ gcc/ChangeLog (working copy)
>> @@ -1,3 +1,8 @@
>> +2013-10-01  Cong Hou  <co...@google.com>
>> +
>> + * tree-vect-loop-manip.c (vect_create_cond_for_alias_checks): Combine
>> + alias checks if it is possible to amortize the runtime overhead.
>> +
>>
>>
>
> --
> Richard Biener <rguent...@suse.de>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend

Reply via email to