Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-10-22 Thread Alan Hayward


On 22/10/2015 15:15, "Alan Lawrence"  wrote:

>Just one very small point...
>
>On 19/10/15 09:17, Alan Hayward wrote:
>
> > -  if (check_reduction
> > -  && (!commutative_tree_code (code) || !associative_tree_code
>(code)))
> > +  if (check_reduction)
> >  {
> > -  if (dump_enabled_p ())
> > -report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> > -   "reduction: not commutative/associative: ");
> > -  return NULL;
> > +  if (code != COND_EXPR
> > + && (!commutative_tree_code (code) || !associative_tree_code
>(code)))
> > +   {
> > + if (dump_enabled_p ())
> > +   report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> > +   "reduction: not commutative/associative: ");
> > + return NULL;
> > +   }
> > +
> > +  if (code == COND_EXPR)
> > +   *v_reduc_type = COND_REDUCTION;
>
>Wouldn't this be easier written as
>
>if (code == COND_EXPR)
>   *v_reduc_type = COND_REDUCTION;
>else if (!commutative_tree_code (code) || !associative_tree_code (code))
>   {...}
>
>? Your call!
>
>Cheers, Alan


Good spot! I suspect that’s slipped through when I’ve rewritten bits.
I’ll add this in with Richard’s suggestion of the change around the call
to vectorizable_condition.


Alan.




Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-10-22 Thread Alan Lawrence

Just one very small point...

On 19/10/15 09:17, Alan Hayward wrote:

> -  if (check_reduction
> -  && (!commutative_tree_code (code) || !associative_tree_code (code)))
> +  if (check_reduction)
>  {
> -  if (dump_enabled_p ())
> -report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -  "reduction: not commutative/associative: ");
> -  return NULL;
> +  if (code != COND_EXPR
> +&& (!commutative_tree_code (code) || !associative_tree_code (code)))
> +  {
> +if (dump_enabled_p ())
> +  report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> +  "reduction: not commutative/associative: ");
> +return NULL;
> +  }
> +
> +  if (code == COND_EXPR)
> +  *v_reduc_type = COND_REDUCTION;

Wouldn't this be easier written as

if (code == COND_EXPR)
  *v_reduc_type = COND_REDUCTION;
else if (!commutative_tree_code (code) || !associative_tree_code (code))
  {...}

? Your call!

Cheers, Alan



Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-10-21 Thread Richard Biener
On Mon, Oct 19, 2015 at 10:17 AM, Alan Hayward  wrote:
>
>
>>On 30/09/2015 13:45, "Richard Biener"  wrote:
>>
>>>On Wed, Sep 23, 2015 at 5:51 PM, Alan Hayward 
>>>wrote:


 On 18/09/2015 14:53, "Alan Hayward"  wrote:

>
>
>On 18/09/2015 14:26, "Alan Lawrence"  wrote:
>
>>On 18/09/15 13:17, Richard Biener wrote:
>>>
>>> Ok, I see.
>>>
>>> That this case is already vectorized is because it implements
>>>MAX_EXPR,
>>> modifying it slightly to
>>>
>>> int foo (int *a)
>>> {
>>>int val = 0;
>>>for (int i = 0; i < 1024; ++i)
>>>  if (a[i] > val)
>>>val = a[i] + 1;
>>>return val;
>>> }
>>>
>>> makes it no longer handled by current code.
>>>
>>
>>Yes. I believe the idea for the patch is to handle arbitrary
>>expressions
>>like
>>
>>int foo (int *a)
>>{
>>int val = 0;
>>for (int i = 0; i < 1024; ++i)
>>  if (some_expression (i))
>>val = another_expression (i);
>>return val;
>>}
>
>Yes, thatâs correct. Hopefully my new test cases should cover
>everything.
>

 Attached is a new version of the patch containing all the changes
 requested by Richard.
>>>
>>>+  /* Compare the max index vector to the vector of found indexes to
>>>find
>>>+the postion of the max value.  This will result in either a
>>>single
>>>+match or all of the values.  */
>>>+  tree vec_compare = make_ssa_name (index_vec_type_signed);
>>>+  gimple vec_compare_stmt = gimple_build_assign (vec_compare,
>>>EQ_EXPR,
>>>+induction_index,
>>>+max_index_vec);
>>>
>>>I'm not sure all targets can handle this.  If I deciper the code
>>>correctly then we do
>>>
>>>  mask = induction_index == max_index_vec;
>>>  vec_and = mask & vec_data;
>>>
>>>plus some casts.  So this is basically
>>>
>>>  vec_and = induction_index == max_index_vec ? vec_data : {0, 0, ... };
>>>
>>>without the need to relate the induction index vector type to the data
>>>vector type.
>>>I believe this is also the form all targets support.
>>
>>
>>Ok, Iâll replace this.
>>
>>>
>>>I am missing a comment before all this code-generation that shows the
>>>transform
>>>result with the variable names used in the code-gen.  I have a hard
>>>time connecting
>>>things here.
>>
>>Ok, Iâll add some comments.
>>
>>>
>>>+  tree matched_data_reduc_cast = build1 (VIEW_CONVERT_EXPR,
>>>scalar_type,
>>>+matched_data_reduc);
>>>+  epilog_stmt = gimple_build_assign (new_scalar_dest,
>>>+matched_data_reduc_cast);
>>>+  new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
>>>+  gimple_assign_set_lhs (epilog_stmt, new_temp);
>>>
>>>this will leave the stmt unsimplified.  scalar sign-changes should use
>>>NOP_EXPR,
>>>not VIEW_CONVERT_EXPR.  The easiest fix is to use fold_convert instead.
>>>Also just do like before - first make_ssa_name and then directly use it
>>>in the
>>>gimple_build_assign.
>>
>>We need the VIEW_CONVERT_EXPR for the cases where we have float data
>>values. The index is always integer.
>>
>>
>>>
>>>The patch is somewhat hard to parse with all the indentation changes.  A
>>>context
>>>diff would be much easier to read in those contexts.
>>
>>Ok, Iâll make the next patch like that
>>
>>>
>>>+  if (v_reduc_type == COND_REDUCTION)
>>>+{
>>>+  widest_int ni;
>>>+
>>>+  if (! max_loop_iterations (loop, ))
>>>+   {
>>>+ if (dump_enabled_p ())
>>>+   dump_printf_loc (MSG_NOTE, vect_location,
>>>+"loop count not known, cannot create cond "
>>>+"reduction.\n");
>>>
>>>ugh.  That's bad.
>>>
>>>+  /* The additional index will be the same type as the condition.
>>>Check
>>>+that the loop can fit into this less one (because we'll use up
>>>the
>>>+zero slot for when there are no matches).  */
>>>+  tree max_index = TYPE_MAX_VALUE (cr_index_scalar_type);
>>>+  if (wi::geu_p (ni, wi::to_widest (max_index)))
>>>+   {
>>>+ if (dump_enabled_p ())
>>>+   dump_printf_loc (MSG_NOTE, vect_location,
>>>+"loop size is greater than data size.\n");
>>>+ return false;
>>>
>>>Likewise.
>>
>>We could do better if we made the index type larger.
>>But as a first implementation of this optimisation, I didnât want to
>>overcomplicate things more.
>>
>>>
>>>@@ -5327,6 +5540,8 @@ vectorizable_reduction (gimple stmt,
>>>gimple_stmt_iterator *gsi,
>>>   if (dump_enabled_p ())
>>> dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n");
>>>
>>>+  STMT_VINFO_TYPE 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-10-01 Thread Alan Hayward


On 30/09/2015 13:45, "Richard Biener"  wrote:

>On Wed, Sep 23, 2015 at 5:51 PM, Alan Hayward 
>wrote:
>>
>>
>> On 18/09/2015 14:53, "Alan Hayward"  wrote:
>>
>>>
>>>
>>>On 18/09/2015 14:26, "Alan Lawrence"  wrote:
>>>
On 18/09/15 13:17, Richard Biener wrote:
>
> Ok, I see.
>
> That this case is already vectorized is because it implements
>MAX_EXPR,
> modifying it slightly to
>
> int foo (int *a)
> {
>int val = 0;
>for (int i = 0; i < 1024; ++i)
>  if (a[i] > val)
>val = a[i] + 1;
>return val;
> }
>
> makes it no longer handled by current code.
>

Yes. I believe the idea for the patch is to handle arbitrary
expressions
like

int foo (int *a)
{
int val = 0;
for (int i = 0; i < 1024; ++i)
  if (some_expression (i))
val = another_expression (i);
return val;
}
>>>
>>>Yes, that’s correct. Hopefully my new test cases should cover
>>>everything.
>>>
>>
>> Attached is a new version of the patch containing all the changes
>> requested by Richard.
>
>+  /* Compare the max index vector to the vector of found indexes to
>find
>+the postion of the max value.  This will result in either a
>single
>+match or all of the values.  */
>+  tree vec_compare = make_ssa_name (index_vec_type_signed);
>+  gimple vec_compare_stmt = gimple_build_assign (vec_compare,
>EQ_EXPR,
>+induction_index,
>+max_index_vec);
>
>I'm not sure all targets can handle this.  If I deciper the code
>correctly then we do
>
>  mask = induction_index == max_index_vec;
>  vec_and = mask & vec_data;
>
>plus some casts.  So this is basically
>
>  vec_and = induction_index == max_index_vec ? vec_data : {0, 0, ... };
>
>without the need to relate the induction index vector type to the data
>vector type.
>I believe this is also the form all targets support.


Ok, I’ll replace this.

>
>I am missing a comment before all this code-generation that shows the
>transform
>result with the variable names used in the code-gen.  I have a hard
>time connecting
>things here.

Ok, I’ll add some comments.

>
>+  tree matched_data_reduc_cast = build1 (VIEW_CONVERT_EXPR,
>scalar_type,
>+matched_data_reduc);
>+  epilog_stmt = gimple_build_assign (new_scalar_dest,
>+matched_data_reduc_cast);
>+  new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
>+  gimple_assign_set_lhs (epilog_stmt, new_temp);
>
>this will leave the stmt unsimplified.  scalar sign-changes should use
>NOP_EXPR,
>not VIEW_CONVERT_EXPR.  The easiest fix is to use fold_convert instead.
>Also just do like before - first make_ssa_name and then directly use it
>in the
>gimple_build_assign.

We need the VIEW_CONVERT_EXPR for the cases where we have float data
values. The index is always integer.


>
>The patch is somewhat hard to parse with all the indentation changes.  A
>context
>diff would be much easier to read in those contexts.

Ok, I’ll make the next patch like that

>
>+  if (v_reduc_type == COND_REDUCTION)
>+{
>+  widest_int ni;
>+
>+  if (! max_loop_iterations (loop, ))
>+   {
>+ if (dump_enabled_p ())
>+   dump_printf_loc (MSG_NOTE, vect_location,
>+"loop count not known, cannot create cond "
>+"reduction.\n");
>
>ugh.  That's bad.
>
>+  /* The additional index will be the same type as the condition.
>Check
>+that the loop can fit into this less one (because we'll use up
>the
>+zero slot for when there are no matches).  */
>+  tree max_index = TYPE_MAX_VALUE (cr_index_scalar_type);
>+  if (wi::geu_p (ni, wi::to_widest (max_index)))
>+   {
>+ if (dump_enabled_p ())
>+   dump_printf_loc (MSG_NOTE, vect_location,
>+"loop size is greater than data size.\n");
>+ return false;
>
>Likewise.

We could do better if we made the index type larger.
But as a first implementation of this optimisation, I didn’t want to
overcomplicate things more.

>
>@@ -5327,6 +5540,8 @@ vectorizable_reduction (gimple stmt,
>gimple_stmt_iterator *gsi,
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n");
>
>+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
>+
>   /* FORNOW: Multiple types are not supported for condition.  */
>   if (code == COND_EXPR)
>
>this change looks odd (or wrong).  The type should be _only_ set/changed
>during
>analysis.


The problem is, for COND_EXPRs, this function calls
vectorizable_condition(), which sets STMT_VINFO_TYPE to
condition_vec_info_type.

Therefore we need something to restore 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-30 Thread Richard Biener
On Wed, Sep 23, 2015 at 5:51 PM, Alan Hayward  wrote:
>
>
> On 18/09/2015 14:53, "Alan Hayward"  wrote:
>
>>
>>
>>On 18/09/2015 14:26, "Alan Lawrence"  wrote:
>>
>>>On 18/09/15 13:17, Richard Biener wrote:

 Ok, I see.

 That this case is already vectorized is because it implements MAX_EXPR,
 modifying it slightly to

 int foo (int *a)
 {
int val = 0;
for (int i = 0; i < 1024; ++i)
  if (a[i] > val)
val = a[i] + 1;
return val;
 }

 makes it no longer handled by current code.

>>>
>>>Yes. I believe the idea for the patch is to handle arbitrary expressions
>>>like
>>>
>>>int foo (int *a)
>>>{
>>>int val = 0;
>>>for (int i = 0; i < 1024; ++i)
>>>  if (some_expression (i))
>>>val = another_expression (i);
>>>return val;
>>>}
>>
>>Yes, that’s correct. Hopefully my new test cases should cover everything.
>>
>
> Attached is a new version of the patch containing all the changes
> requested by Richard.

+  /* Compare the max index vector to the vector of found indexes to find
+the postion of the max value.  This will result in either a single
+match or all of the values.  */
+  tree vec_compare = make_ssa_name (index_vec_type_signed);
+  gimple vec_compare_stmt = gimple_build_assign (vec_compare, EQ_EXPR,
+induction_index,
+max_index_vec);

I'm not sure all targets can handle this.  If I deciper the code
correctly then we do

  mask = induction_index == max_index_vec;
  vec_and = mask & vec_data;

plus some casts.  So this is basically

  vec_and = induction_index == max_index_vec ? vec_data : {0, 0, ... };

without the need to relate the induction index vector type to the data
vector type.
I believe this is also the form all targets support.

I am missing a comment before all this code-generation that shows the transform
result with the variable names used in the code-gen.  I have a hard
time connecting
things here.

+  tree matched_data_reduc_cast = build1 (VIEW_CONVERT_EXPR, scalar_type,
+matched_data_reduc);
+  epilog_stmt = gimple_build_assign (new_scalar_dest,
+matched_data_reduc_cast);
+  new_temp = make_ssa_name (new_scalar_dest, epilog_stmt);
+  gimple_assign_set_lhs (epilog_stmt, new_temp);

this will leave the stmt unsimplified.  scalar sign-changes should use NOP_EXPR,
not VIEW_CONVERT_EXPR.  The easiest fix is to use fold_convert instead.
Also just do like before - first make_ssa_name and then directly use it in the
gimple_build_assign.

The patch is somewhat hard to parse with all the indentation changes.  A context
diff would be much easier to read in those contexts.

+  if (v_reduc_type == COND_REDUCTION)
+{
+  widest_int ni;
+
+  if (! max_loop_iterations (loop, ))
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"loop count not known, cannot create cond "
+"reduction.\n");

ugh.  That's bad.

+  /* The additional index will be the same type as the condition.  Check
+that the loop can fit into this less one (because we'll use up the
+zero slot for when there are no matches).  */
+  tree max_index = TYPE_MAX_VALUE (cr_index_scalar_type);
+  if (wi::geu_p (ni, wi::to_widest (max_index)))
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"loop size is greater than data size.\n");
+ return false;

Likewise.

@@ -5327,6 +5540,8 @@ vectorizable_reduction (gimple stmt,
gimple_stmt_iterator *gsi,
   if (dump_enabled_p ())
 dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n");

+  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
+
   /* FORNOW: Multiple types are not supported for condition.  */
   if (code == COND_EXPR)

this change looks odd (or wrong).  The type should be _only_ set/changed during
analysis.

+
+  /* For cond reductions we need to add an additional conditional based on
+the loop index.  */
+  if (v_reduc_type == COND_REDUCTION)
+   {
+ int nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
+ int k;
...
+ STMT_VINFO_VECTYPE (index_vec_info) = cr_index_vector_type;
+ set_vinfo_for_stmt (index_condition, index_vec_info);
+
+ /* Update the phi with the vec cond.  */
+ add_phi_arg (new_phi, cond_name, loop_latch_edge (loop),
+  UNKNOWN_LOCATION);

same as before - I am missing a comment that shows the generated code
and connects
the local vars used.


+ tree ccompare_name = make_ssa_name (TREE_TYPE (ccompare));
+ gimple ccompare_stmt = 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-23 Thread Alan Hayward


On 18/09/2015 14:53, "Alan Hayward"  wrote:

>
>
>On 18/09/2015 14:26, "Alan Lawrence"  wrote:
>
>>On 18/09/15 13:17, Richard Biener wrote:
>>>
>>> Ok, I see.
>>>
>>> That this case is already vectorized is because it implements MAX_EXPR,
>>> modifying it slightly to
>>>
>>> int foo (int *a)
>>> {
>>>int val = 0;
>>>for (int i = 0; i < 1024; ++i)
>>>  if (a[i] > val)
>>>val = a[i] + 1;
>>>return val;
>>> }
>>>
>>> makes it no longer handled by current code.
>>>
>>
>>Yes. I believe the idea for the patch is to handle arbitrary expressions
>>like
>>
>>int foo (int *a)
>>{
>>int val = 0;
>>for (int i = 0; i < 1024; ++i)
>>  if (some_expression (i))
>>val = another_expression (i);
>>return val;
>>}
>
>Yes, that’s correct. Hopefully my new test cases should cover everything.
>

Attached is a new version of the patch containing all the changes
requested by Richard.


Thanks,
Alan.




0001-Support-for-vectorizing-conditional-expressions.patch
Description: Binary data


Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-18 Thread Richard Biener
On Tue, Sep 15, 2015 at 5:32 PM, Alan Hayward  wrote:
>
>
> On 15/09/2015 13:09, "Richard Biener"  wrote:
>
>>
>>Now comments on the patch itself.
>>
>>+  if (code == COND_EXPR)
>>+   *v_reduc_type = COND_REDUCTION;
>>
>>so why not simply use COND_EXPR instead of the new v_reduc_type?
>
> v_reduc_type is also dependant on check_reduction (which comes from
> !nested_cycle in vectorizable_reduction).
> It seemed messy to keep on checking for both those things throughout.
>
> In my patch to catch simpler condition reductions, I’ll be adding another
> value to this enum too. v_reduc_type will be set to this new value based
> on the same properties for COND_REDUCTION plus some additional constraints.
>
>>
>>+  if (check_reduction && code != COND_EXPR &&
>>+  vect_is_slp_reduction (loop_info, phi, def_stmt))
>>
>>& go to the next line
>
> ok
>
>>
>>+ /* Reduction of the max index and a reduction of the found
>>+values.  */
>>+ epilogue_cost += add_stmt_cost (target_cost_data, 1,
>>+ vec_to_scalar, stmt_info, 0,
>>+ vect_epilogue);
>>
>>vec_to_scalar once isn't what the comment suggests.  Instead the
>>comment suggests twice what a regular reduction would do
>>but I guess we can "hide" the vec_to_scalar cost and "merge" it
>>with the broadcast.  Thus make the above two vector_stmt costs?
>>
>>+ /* A broadcast of the max value.  */
>>+ epilogue_cost += add_stmt_cost (target_cost_data, 2,
>>+ scalar_to_vec, stmt_info, 0,
>>+ vect_epilogue);
>>
>>comment suggests a single broadcast.
>
> I’ve made a copy/paste error here. Just need to swap the 1 and the 2.
>
>
>>
>>@@ -3705,7 +3764,7 @@ get_initial_def_for_induction (gimple iv_phi)
>>  the final vector of induction results:  */
>>   exit_phi = NULL;
>>   FOR_EACH_IMM_USE_FAST (use_p, imm_iter, loop_arg)
>>-{
>>+   {
>>  gimple use_stmt = USE_STMT (use_p);
>>  if (is_gimple_debug (use_stmt))
>>continue;
>>
>>please avoid unrelated whitespace changes.
>
> Ok. I was changing “8 spaces” to a tab, but happy to revert.
>
>>
>>+  case COND_EXPR:
>>+   if (v_reduc_type == COND_REDUCTION)
>>+ {
>>...
>>+   /* Fall through.  */
>>+
>>   case MIN_EXPR:
>>   case MAX_EXPR:
>>-  case COND_EXPR:
>>
>>aww, so we could already handle COND_EXPR reductions?  How do they
>>differ from what you add?  Did you see if that path is even exercised
>>today?
>
> Today, COND_EXPRs are only supported when they are nested inside a loop.
> See the vect-cond-*.c tests.
> For example:
>
> for (j = 0; j < M; j++)
> {
>   x = x_in[j];
>   curr_a = a[0];
>
> for (i = 0; i < N; i++)
>   {
> next_a = a[i+1];
> curr_a = x > c[i] ? curr_a : next_a;
>   }
>   x_out[j] = curr_a;
> }
>
>
> In that case, only the outer loop is vectorised.
>
>>
>>+   /* Create a vector of {init_value, 0, 0, 0...}.  */
>>+   vec *v;
>>+   vec_alloc (v, nunits);
>>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, init_val);
>>+   if (SCALAR_FLOAT_TYPE_P (scalar_type))
>>+ for (i = 1; i < nunits; ++i)
>>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
>>+   build_real (scalar_type,
>>dconst0));
>>+   else
>>+ for (i = 1; i < nunits; ++i)
>>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
>>+   build_int_cst (scalar_type, 0));
>>+   init_def = build_constructor (vectype, v);
>>
>>you can unify the float/int case by using build_zero_cst (scalar_type).
>>Note that you should build a vector constant instead of a constructor
>>if init_val is a constant.  The convenient way is to build the vector
>>elements into a tree[] array and use build_vector_stat in that case.
>
> Ok, will switch to build_zero_cst.
> Also, I will switch my vector to  {init_value, init_value, init_value…}.
> I had {init_value, 0, 0, 0…} because I was going have the option of
> ADD_REDUC_EXPR,
> But that got removed along the way.

You can then simply use build_vector_from_val ().

>>
>>+  /* Find maximum value from the vector of found indexes.  */
>>+  tree max_index = make_temp_ssa_name (index_scalar_type, NULL, "");
>>
>>just use make_ssa_name (index_scalar_type);
>
> Ok
>
>>
>>+  /* Convert the vector of data to the same type as the EQ.  */
>>+  tree vec_data_cast;
>>+  if ( TYPE_UNSIGNED (index_vec_type))
>>+   {
>>
>>How come it never happens the element
>>sizes do not match?  (int index type and double data type?)
>
> This was a little unclear.
> The induction index is originally created as an unsigned version of the
> type as the data vector.
> 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-18 Thread Alan Lawrence

On 18/09/15 13:17, Richard Biener wrote:


Ok, I see.

That this case is already vectorized is because it implements MAX_EXPR,
modifying it slightly to

int foo (int *a)
{
   int val = 0;
   for (int i = 0; i < 1024; ++i)
 if (a[i] > val)
   val = a[i] + 1;
   return val;
}

makes it no longer handled by current code.



Yes. I believe the idea for the patch is to handle arbitrary expressions like

int foo (int *a)
{
   int val = 0;
   for (int i = 0; i < 1024; ++i)
 if (some_expression (i))
   val = another_expression (i);
   return val;
}

Cheers,
Alan



Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-18 Thread Alan Hayward

On 18/09/2015 14:26, "Alan Lawrence"  wrote:

>On 18/09/15 13:17, Richard Biener wrote:
>>
>> Ok, I see.
>>
>> That this case is already vectorized is because it implements MAX_EXPR,
>> modifying it slightly to
>>
>> int foo (int *a)
>> {
>>int val = 0;
>>for (int i = 0; i < 1024; ++i)
>>  if (a[i] > val)
>>val = a[i] + 1;
>>return val;
>> }
>>
>> makes it no longer handled by current code.
>>
>
>Yes. I believe the idea for the patch is to handle arbitrary expressions
>like
>
>int foo (int *a)
>{
>int val = 0;
>for (int i = 0; i < 1024; ++i)
>  if (some_expression (i))
>val = another_expression (i);
>return val;
>}


Yes, that’s correct. Hopefully my new test cases should cover everything.


Alan.




Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-15 Thread Richard Biener
On Fri, Sep 11, 2015 at 6:29 PM, Ramana Radhakrishnan
 wrote:
>
 Saying that all reductions have equivalent performance is unlikely to be
 true for many platforms.  On PowerPC, for example, a PLUS reduction has
 very different cost from a MAX reduction.  If the model isn't
 fine-grained enough, let's please be aggressive about fixing it.  I'm
 fine if it's a separate patch, but in my mind this shouldn't be allowed
 to languish.
>>>
>>> ...I agree that the general vectoriser cost model could probably be
>>> improved, but it seems fairer for that improvement to be done by whoever
>>> adds the patterns that need it.
>>
>> All right.  But in response to Ramana's comment, are all relevant
>> reductions of similar cost on each ARM platform?  Obviously they don't
>> have the same cost on different platforms, but the question is whether a
>> reduc_plus, reduc_max, etc., has identical cost on each individual
>> platform.  If not, ARM may have a concern as well.  I don't know the
>> situation for x86 either.
>
> From cauldron I have a note that we need to look at the vectorizer cost model
> for both the ARM and AArch64 backends and move away from
> the set of magic constants that it currently returns.

Indeed.

> On AArch32, all the reduc_ patterns are emulated with pair-wise operations
> while on AArch64 they aren't. Thus they aren't likely to be the same cost as a
> standard vector arithmetic instruction. What difference this makes in practice
> remains to be seen, however the first step is moving towards the newer 
> vectorizer
> cost model interface.
>
> I'll put this on a list of things for us to look at but I'm not sure who/when
> will get around to looking at this.

Note that the target should be able to "see" the cond reduction via the
add_stmt_cost hook calls.  It should see 'where' as the epilogue and
'kind' as a hint - recording stmt_info which should have sufficient info
for a good guess.  At finish_cost () time the target can compute a proper
overall cost.

Yes, the vectorizer "IL" (the stmt-infos) isn't very powerful and esp. for
code not corresponding to existing gimple stmts it doesn't even exist.
We need to improve in that area to better represent the desired transform.

Richard.

> regards
> Ramana


Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-15 Thread Richard Biener
On Thu, Sep 10, 2015 at 4:51 PM, Alan Hayward  wrote:
> Hi,
> This patch (attached) adds support for vectorizing conditional expressions
> (PR 65947), for example:
>
> int condition_reduction (int *a, int min_v)
> {
>   int last = 0;
>   for (int i = 0; i < N; i++)
> if (a[i] < min_v)
>   last = a[i];
>   return last;
> }
>
> To do this the loop is vectorised to create a vector of data results (ie
> of matching a[i] values). Using an induction variable, an additional
> vector is added containing the indexes where the matches occured. In the
> function epilogue this is reduced to a single max value and then used to
> index into the vector of data results.
> When no values are matched in the loop, the indexes vector will contain
> all zeroes, eventually matching the first entry in the data results vector.
>
> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This is
> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that
> REDUC_MAX_EXPR is not supported for the required modes, failing the
> vectorization. On mips it complains that the required vcond expression is
> not supported. It is suggested the relevant backend experts add the
> required backend support.
>
> Using a simple testcase based around a large number of N and run on an
> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of
> it's original time.
>
> This patch caused binary differences in three spec2006 binaries on aarch64
> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board
> showed no improvement or degregation in runtime.
>
>
> In the near future I hope to submit a further patch (as PR 66558) which
> optimises the case where the result is simply the index of the loop, for
> example:
> int condition_reduction (int *a, int min_v)
> {
>   int last = 0;
>   for (int i = 0; i < N; i++)
> if (a[i] < min_v)
>   last = i;
>   return last;
> }
> In this case a lot of the new code can be optimized away.
>
> I have run check for aarch64, arm and x86 and have seen no regressions.

Now comments on the patch itself.

+  if (code == COND_EXPR)
+   *v_reduc_type = COND_REDUCTION;

so why not simply use COND_EXPR instead of the new v_reduc_type?

+  if (check_reduction && code != COND_EXPR &&
+  vect_is_slp_reduction (loop_info, phi, def_stmt))

& go to the next line

+ /* Reduction of the max index and a reduction of the found
+values.  */
+ epilogue_cost += add_stmt_cost (target_cost_data, 1,
+ vec_to_scalar, stmt_info, 0,
+ vect_epilogue);

vec_to_scalar once isn't what the comment suggests.  Instead the
comment suggests twice what a regular reduction would do
but I guess we can "hide" the vec_to_scalar cost and "merge" it
with the broadcast.  Thus make the above two vector_stmt costs?

+ /* A broadcast of the max value.  */
+ epilogue_cost += add_stmt_cost (target_cost_data, 2,
+ scalar_to_vec, stmt_info, 0,
+ vect_epilogue);

comment suggests a single broadcast.

@@ -3705,7 +3764,7 @@ get_initial_def_for_induction (gimple iv_phi)
  the final vector of induction results:  */
   exit_phi = NULL;
   FOR_EACH_IMM_USE_FAST (use_p, imm_iter, loop_arg)
-{
+   {
  gimple use_stmt = USE_STMT (use_p);
  if (is_gimple_debug (use_stmt))
continue;

please avoid unrelated whitespace changes.

+  case COND_EXPR:
+   if (v_reduc_type == COND_REDUCTION)
+ {
...
+   /* Fall through.  */
+
   case MIN_EXPR:
   case MAX_EXPR:
-  case COND_EXPR:

aww, so we could already handle COND_EXPR reductions?  How do they
differ from what you add?  Did you see if that path is even exercised today?

+   /* Create a vector of {init_value, 0, 0, 0...}.  */
+   vec *v;
+   vec_alloc (v, nunits);
+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, init_val);
+   if (SCALAR_FLOAT_TYPE_P (scalar_type))
+ for (i = 1; i < nunits; ++i)
+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+   build_real (scalar_type, dconst0));
+   else
+ for (i = 1; i < nunits; ++i)
+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+   build_int_cst (scalar_type, 0));
+   init_def = build_constructor (vectype, v);

you can unify the float/int case by using build_zero_cst (scalar_type).
Note that you should build a vector constant instead of a constructor
if init_val is a constant.  The convenient way is to build the vector
elements into a tree[] array and use build_vector_stat in that case.

+  /* Find maximum value from the vector of found indexes.  */
+  tree max_index = 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-15 Thread Richard Biener
On Thu, Sep 10, 2015 at 4:51 PM, Alan Hayward  wrote:
> Hi,
> This patch (attached) adds support for vectorizing conditional expressions
> (PR 65947), for example:
>
> int condition_reduction (int *a, int min_v)
> {
>   int last = 0;
>   for (int i = 0; i < N; i++)
> if (a[i] < min_v)
>   last = a[i];
>   return last;
> }
>
> To do this the loop is vectorised to create a vector of data results (ie
> of matching a[i] values). Using an induction variable, an additional
> vector is added containing the indexes where the matches occured. In the
> function epilogue this is reduced to a single max value and then used to
> index into the vector of data results.
> When no values are matched in the loop, the indexes vector will contain
> all zeroes, eventually matching the first entry in the data results vector.
>
> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This is
> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that
> REDUC_MAX_EXPR is not supported for the required modes, failing the
> vectorization. On mips it complains that the required vcond expression is
> not supported. It is suggested the relevant backend experts add the
> required backend support.
>
> Using a simple testcase based around a large number of N and run on an
> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of
> it's original time.
>
> This patch caused binary differences in three spec2006 binaries on aarch64
> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board
> showed no improvement or degregation in runtime.
>
>
> In the near future I hope to submit a further patch (as PR 66558) which
> optimises the case where the result is simply the index of the loop, for
> example:
> int condition_reduction (int *a, int min_v)
> {
>   int last = 0;
>   for (int i = 0; i < N; i++)
> if (a[i] < min_v)
>   last = i;
>   return last;
> }
> In this case a lot of the new code can be optimized away.
>
> I have run check for aarch64, arm and x86 and have seen no regressions.

Now comments on the patch itself.

+  if (code == COND_EXPR)
+   *v_reduc_type = COND_REDUCTION;

so why not simply use COND_EXPR instead of the new v_reduc_type?

+  if (check_reduction && code != COND_EXPR &&
+  vect_is_slp_reduction (loop_info, phi, def_stmt))

& go to the next line

+ /* Reduction of the max index and a reduction of the found
+values.  */
+ epilogue_cost += add_stmt_cost (target_cost_data, 1,
+ vec_to_scalar, stmt_info, 0,
+ vect_epilogue);

vec_to_scalar once isn't what the comment suggests.  Instead the
comment suggests twice what a regular reduction would do
but I guess we can "hide" the vec_to_scalar cost and "merge" it
with the broadcast.  Thus make the above two vector_stmt costs?

+ /* A broadcast of the max value.  */
+ epilogue_cost += add_stmt_cost (target_cost_data, 2,
+ scalar_to_vec, stmt_info, 0,
+ vect_epilogue);

comment suggests a single broadcast.

@@ -3705,7 +3764,7 @@ get_initial_def_for_induction (gimple iv_phi)
  the final vector of induction results:  */
   exit_phi = NULL;
   FOR_EACH_IMM_USE_FAST (use_p, imm_iter, loop_arg)
-{
+   {
  gimple use_stmt = USE_STMT (use_p);
  if (is_gimple_debug (use_stmt))
continue;

please avoid unrelated whitespace changes.

+  case COND_EXPR:
+   if (v_reduc_type == COND_REDUCTION)
+ {
...
+   /* Fall through.  */
+
   case MIN_EXPR:
   case MAX_EXPR:
-  case COND_EXPR:

aww, so we could already handle COND_EXPR reductions?  How do they
differ from what you add?  Did you see if that path is even exercised today?

+   /* Create a vector of {init_value, 0, 0, 0...}.  */
+   vec *v;
+   vec_alloc (v, nunits);
+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, init_val);
+   if (SCALAR_FLOAT_TYPE_P (scalar_type))
+ for (i = 1; i < nunits; ++i)
+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+   build_real (scalar_type, dconst0));
+   else
+ for (i = 1; i < nunits; ++i)
+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+   build_int_cst (scalar_type, 0));
+   init_def = build_constructor (vectype, v);

you can unify the float/int case by using build_zero_cst (scalar_type).
Note that you should build a vector constant instead of a constructor
if init_val is a constant.  The convenient way is to build the vector
elements into a tree[] array and use build_vector_stat in that case.

+  /* Find maximum value from the vector of found indexes.  */
+  tree max_index = 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-15 Thread Alan Hayward


On 15/09/2015 13:09, "Richard Biener"  wrote:

>
>Now comments on the patch itself.
>
>+  if (code == COND_EXPR)
>+   *v_reduc_type = COND_REDUCTION;
>
>so why not simply use COND_EXPR instead of the new v_reduc_type?

v_reduc_type is also dependant on check_reduction (which comes from
!nested_cycle in vectorizable_reduction).
It seemed messy to keep on checking for both those things throughout.

In my patch to catch simpler condition reductions, I’ll be adding another
value to this enum too. v_reduc_type will be set to this new value based
on the same properties for COND_REDUCTION plus some additional constraints.

>
>+  if (check_reduction && code != COND_EXPR &&
>+  vect_is_slp_reduction (loop_info, phi, def_stmt))
>
>& go to the next line

ok

>
>+ /* Reduction of the max index and a reduction of the found
>+values.  */
>+ epilogue_cost += add_stmt_cost (target_cost_data, 1,
>+ vec_to_scalar, stmt_info, 0,
>+ vect_epilogue);
>
>vec_to_scalar once isn't what the comment suggests.  Instead the
>comment suggests twice what a regular reduction would do
>but I guess we can "hide" the vec_to_scalar cost and "merge" it
>with the broadcast.  Thus make the above two vector_stmt costs?
>
>+ /* A broadcast of the max value.  */
>+ epilogue_cost += add_stmt_cost (target_cost_data, 2,
>+ scalar_to_vec, stmt_info, 0,
>+ vect_epilogue);
>
>comment suggests a single broadcast.

I’ve made a copy/paste error here. Just need to swap the 1 and the 2.


>
>@@ -3705,7 +3764,7 @@ get_initial_def_for_induction (gimple iv_phi)
>  the final vector of induction results:  */
>   exit_phi = NULL;
>   FOR_EACH_IMM_USE_FAST (use_p, imm_iter, loop_arg)
>-{
>+   {
>  gimple use_stmt = USE_STMT (use_p);
>  if (is_gimple_debug (use_stmt))
>continue;
>
>please avoid unrelated whitespace changes.

Ok. I was changing “8 spaces” to a tab, but happy to revert.

>
>+  case COND_EXPR:
>+   if (v_reduc_type == COND_REDUCTION)
>+ {
>...
>+   /* Fall through.  */
>+
>   case MIN_EXPR:
>   case MAX_EXPR:
>-  case COND_EXPR:
>
>aww, so we could already handle COND_EXPR reductions?  How do they
>differ from what you add?  Did you see if that path is even exercised
>today?

Today, COND_EXPRs are only supported when they are nested inside a loop.
See the vect-cond-*.c tests.
For example:

for (j = 0; j < M; j++)
{
  x = x_in[j];
  curr_a = a[0];

for (i = 0; i < N; i++)
  {
next_a = a[i+1];
curr_a = x > c[i] ? curr_a : next_a;
  }
  x_out[j] = curr_a;
}


In that case, only the outer loop is vectorised.

>
>+   /* Create a vector of {init_value, 0, 0, 0...}.  */
>+   vec *v;
>+   vec_alloc (v, nunits);
>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, init_val);
>+   if (SCALAR_FLOAT_TYPE_P (scalar_type))
>+ for (i = 1; i < nunits; ++i)
>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
>+   build_real (scalar_type,
>dconst0));
>+   else
>+ for (i = 1; i < nunits; ++i)
>+   CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
>+   build_int_cst (scalar_type, 0));
>+   init_def = build_constructor (vectype, v);
>
>you can unify the float/int case by using build_zero_cst (scalar_type).
>Note that you should build a vector constant instead of a constructor
>if init_val is a constant.  The convenient way is to build the vector
>elements into a tree[] array and use build_vector_stat in that case.

Ok, will switch to build_zero_cst.
Also, I will switch my vector to  {init_value, init_value, init_value…}.
I had {init_value, 0, 0, 0…} because I was going have the option of
ADD_REDUC_EXPR,
But that got removed along the way.

>
>+  /* Find maximum value from the vector of found indexes.  */
>+  tree max_index = make_temp_ssa_name (index_scalar_type, NULL, "");
>
>just use make_ssa_name (index_scalar_type);

Ok

>
>+  /* Convert the vector of data to the same type as the EQ.  */
>+  tree vec_data_cast;
>+  if ( TYPE_UNSIGNED (index_vec_type))
>+   {
>
>How come it never happens the element
>sizes do not match?  (int index type and double data type?)

This was a little unclear.
The induction index is originally created as an unsigned version of the
type as the data vector.
(see the definition of cr_index_vector_type in vectorizable_reduction(),
which is then used to create cond_name)

I will remove the if and replace with a gcc_checking_assert(TYPE_UNSIGNED
(index_vec_type))


>
>+  /* Where the max index occured, use the value from the data
>vector.  */
>+  tree 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-14 Thread Bill Schmidt
On Mon, 2015-09-14 at 10:47 +0100, Alan Lawrence wrote:
> On 11/09/15 14:19, Bill Schmidt wrote:
> >
> > A secondary concern for powerpc is that REDUC_MAX_EXPR produces a scalar
> > that has to be broadcast back to a vector, and the best way to implement
> > it for us already has the max value in all positions of a vector.  But
> > that is something we should be able to fix with simplify-rtx in the back
> > end.
> 
> Reading this thread again, this bit stands out as unaddressed. Yes PowerPC 
> can 
> "fix" this with simplify-rtx, but the vector cost model will not take this 
> into 
> account - it will think that the broadcast-back-to-a-vector requires an extra 
> operation after the reduction, whereas in fact it will not.
> 
> Does that suggest we should have a new entry in vect_cost_for_stmt for 
> vec_to_scalar-and-back-to-vector (that defaults to 
> vec_to_scalar+scalar_to_vec, 
> but on some architectures e.g. PowerPC would be the same as vec_to_scalar)?

Ideally I think we need to do something for that, yeah.  The back ends
could try to patch up the cost when finishing costs for the loop body,
epilogue, etc., but that would be somewhat of a guess; it would be
better to just be up-front that we're doing a reduction to a vector.

As part of this, I dislike the term "vec_to_scalar", which is somewhat
vague about what's going on (it sound like it could mean a vector
extract operation, which is more of an inverse of "scalar_to_vec" than a
reduction is).  GIMPLE calls it a reduction, and the optabs call it a
reduction, so we ought to call it a reduction in the vectorizer cost
model, too.

To cover our bases for PowerPC and AArch32, we probably need:

  plus_reduc_to_scalar
  plus_reduc_to_vector
  minmax_reduc_to_scalar
  minmax_reduc_to_vector

although I think plus_reduc_to_vector wouldn't be used yet, so could be
omitted.  If we go this route, then at that time we would change your
code to use minmax_reduc_to_vector and let the back ends determine
whether that requires a scalar reduction followed by a broadcast, or
whether it would be performed directly.

Using direct reduction to vector for MIN and MAX on PowerPC would be a
big cost savings over scalar reduction/broadcast.

Thanks,
Bill

> 
> (I agree that if that's the limit of how "different" conditional reductions 
> may 
> be between architectures, then we should not have a vec_cost_for_stmt for a 
> whole conditional reduction.)
> 
> Cheers, Alan
> 




Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-14 Thread Alan Lawrence

On 11/09/15 14:19, Bill Schmidt wrote:


A secondary concern for powerpc is that REDUC_MAX_EXPR produces a scalar
that has to be broadcast back to a vector, and the best way to implement
it for us already has the max value in all positions of a vector.  But
that is something we should be able to fix with simplify-rtx in the back
end.


Reading this thread again, this bit stands out as unaddressed. Yes PowerPC can 
"fix" this with simplify-rtx, but the vector cost model will not take this into 
account - it will think that the broadcast-back-to-a-vector requires an extra 
operation after the reduction, whereas in fact it will not.


Does that suggest we should have a new entry in vect_cost_for_stmt for 
vec_to_scalar-and-back-to-vector (that defaults to vec_to_scalar+scalar_to_vec, 
but on some architectures e.g. PowerPC would be the same as vec_to_scalar)?


(I agree that if that's the limit of how "different" conditional reductions may 
be between architectures, then we should not have a vec_cost_for_stmt for a 
whole conditional reduction.)


Cheers, Alan



Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Ramana Radhakrishnan

>>> Saying that all reductions have equivalent performance is unlikely to be
>>> true for many platforms.  On PowerPC, for example, a PLUS reduction has
>>> very different cost from a MAX reduction.  If the model isn't
>>> fine-grained enough, let's please be aggressive about fixing it.  I'm
>>> fine if it's a separate patch, but in my mind this shouldn't be allowed
>>> to languish.
>>
>> ...I agree that the general vectoriser cost model could probably be
>> improved, but it seems fairer for that improvement to be done by whoever
>> adds the patterns that need it.
> 
> All right.  But in response to Ramana's comment, are all relevant
> reductions of similar cost on each ARM platform?  Obviously they don't
> have the same cost on different platforms, but the question is whether a
> reduc_plus, reduc_max, etc., has identical cost on each individual
> platform.  If not, ARM may have a concern as well.  I don't know the
> situation for x86 either.

>From cauldron I have a note that we need to look at the vectorizer cost model
for both the ARM and AArch64 backends and move away from
the set of magic constants that it currently returns.

On AArch32, all the reduc_ patterns are emulated with pair-wise operations
while on AArch64 they aren't. Thus they aren't likely to be the same cost as a
standard vector arithmetic instruction. What difference this makes in practice
remains to be seen, however the first step is moving towards the newer 
vectorizer
cost model interface.

I'll put this on a list of things for us to look at but I'm not sure who/when
will get around to looking at this.

regards
Ramana


Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Alan Hayward
Hi Bill,

I’d be a bit worried about asking the backend for the cost of a
COND_REDUCTION, as that will rely on the backend understanding the
implementation the vectorizer is using - every time the vectorizer
changed, the backends would need to be updated too. I’m hoping soon to get
together a patch to reduce the stmts produced on the simpler cases, which
would require a different set of costings. I can also imagine further
improvements being added for other special cases over time. Having the
backends understand every variation would be a little cumbersome.

As it stands today, we correctly exit the optimisation if max reduction
isn’t supported in hardware, which is what the cost model is expecting.


If power wanted to use this implementation, then I think it’d probably
need some code in tree-vect-generic.c to implement on emulated max
reduction, which would then require updates to the costs modelling of
anything that uses max reduction (not just cond reduction). All of that is
outside the scope of this patch.


Thanks,
Alan.

On 10/09/2015 23:14, "Bill Schmidt"  wrote:

>Hi Alan,
>
>The cost modeling of the epilogue code seems pretty target-specific ("An
>EQ stmt and an AND stmt, reduction of the max index and a reduction of
>the found values, a broadcast of the max value," resulting in two
>vector_stmts, one vec_to_scalar, and two scalar_to_vecs).  On powerpc,
>this will not represent the cost accurately, and the cost will indeed be
>quite different depending on the mode (logarithmic in the number of
>elements).  I think that you need to create a new entry in
>vect_cost_for_stmt to represent the cost of a COND_REDUCTION, and allow
>each target to calculate the cost appropriately.
>
>(Powerpc doesn't have a max-reduction hardware instruction, but because
>the reduction will be only in the epilogue code, it may still be
>profitable for us to generate the somewhat expensive reduction sequence
>in order to vectorize the loop.  But we definitely want to model it as
>costly in and of itself.  Also, the sequence will produce the maximum
>value in all positions without a separate broadcast.)
>
>Thanks,
>Bill
>
>On Thu, 2015-09-10 at 15:51 +0100, Alan Hayward wrote:
>> Hi,
>> This patch (attached) adds support for vectorizing conditional
>>expressions
>> (PR 65947), for example:
>> 
>> int condition_reduction (int *a, int min_v)
>> {
>>   int last = 0;
>>   for (int i = 0; i < N; i++)
>> if (a[i] < min_v)
>>   last = a[i];
>>   return last;
>> }
>> 
>> To do this the loop is vectorised to create a vector of data results (ie
>> of matching a[i] values). Using an induction variable, an additional
>> vector is added containing the indexes where the matches occured. In the
>> function epilogue this is reduced to a single max value and then used to
>> index into the vector of data results.
>> When no values are matched in the loop, the indexes vector will contain
>> all zeroes, eventually matching the first entry in the data results
>>vector.
>> 
>> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This
>>is
>> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that
>> REDUC_MAX_EXPR is not supported for the required modes, failing the
>> vectorization. On mips it complains that the required vcond expression
>>is
>> not supported. It is suggested the relevant backend experts add the
>> required backend support.
>> 
>> Using a simple testcase based around a large number of N and run on an
>> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of
>> it's original time.
>> 
>> This patch caused binary differences in three spec2006 binaries on
>>aarch64
>> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board
>> showed no improvement or degregation in runtime.
>> 
>> 
>> In the near future I hope to submit a further patch (as PR 66558) which
>> optimises the case where the result is simply the index of the loop, for
>> example:
>> int condition_reduction (int *a, int min_v)
>> {
>>   int last = 0;
>>   for (int i = 0; i < N; i++)
>> if (a[i] < min_v)
>>   last = i;
>>   return last;
>> }
>> In this case a lot of the new code can be optimized away.
>> 
>> I have run check for aarch64, arm and x86 and have seen no regressions.
>> 
>> 
>> Changelog:
>> 
>> 2015-08-28  Alan Hayward 
>> 
>> PR tree-optimization/65947
>> * tree-vect-loop.c
>> (vect_is_simple_reduction_1): Find condition reductions.
>> (vect_model_reduction_cost): Add condition reduction costs.
>> (get_initial_def_for_reduction): Add condition reduction initial
>> var.
>> (vect_create_epilog_for_reduction): Add condition reduction
>>epilog.
>> (vectorizable_reduction): Condition reduction support.
>> * tree-vect-stmts.c
>> (vectorizable_condition): Add vect reduction arg
>> * doc/sourcebuild.texi (Vector-specific attributes): Document
>> 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Ramana Radhakrishnan
On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt
 wrote:
> Hi Alan,
>
> I probably wasn't clear enough.  The implementation in the vectorizer is
> fine and I'm not asking that to change per target.  What I'm objecting
> to is the equivalence between a REDUC_MAX_EXPR and a cost associated
> with vec_to_scalar.  This assumes that the back end will implement a
> REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
> But those back ends should be free to model the cost of the
> REDUC_MAX_EXPR appropriately.  Therefore I am asking for a new
> vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR.  For
> ARM, this cost will be the same as a vec_to_scalar.  For others, it may
> not be; for powerpc, it certainly will not be.



>
> We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
> expansion, and therefore it is not correct for us to explode this in
> tree-vect-generic.  This would expand the code size without providing
> any significant optimization opportunity, and could reduce the ability
> to, for instance, common REDUC_MAX_EXPRs.  It would also slow down the
> gimple vectorizers.
>
> I apologize if my loose use of language confused the issue.  It isn't
> the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
> that are used by it.
>
> (The costs in powerpc won't be enormous, but they are definitely
> mode-dependent in a way that vec_to_scalar is not.  We'll need 2*log(n)
> instructions, where n is the number of elements in the mode being
> vectorized.)


IIUC, on AArch64 a reduc_max_expr matches with a single reduction
operation but on AArch32 Neon a reduc_smax gets implemented as a
sequence of vpmax instructions which sounds similar to the PowerPC
example as well. Thus mapping a reduc_smax expression to the cost of a
vec_to_scalar is probably not right in this particular situation.


regards
Ramana
>
> A secondary concern for powerpc is that REDUC_MAX_EXPR produces a scalar
> that has to be broadcast back to a vector, and the best way to implement
> it for us already has the max value in all positions of a vector.  But
> that is something we should be able to fix with simplify-rtx in the back
> end.
>
> Thanks,
> Bill
>
>
> On Fri, 2015-09-11 at 10:15 +0100, Alan Hayward wrote:
>> Hi Bill,
>>
>> I’d be a bit worried about asking the backend for the cost of a
>> COND_REDUCTION, as that will rely on the backend understanding the
>> implementation the vectorizer is using - every time the vectorizer
>> changed, the backends would need to be updated too. I’m hoping soon to get
>> together a patch to reduce the stmts produced on the simpler cases, which
>> would require a different set of costings. I can also imagine further
>> improvements being added for other special cases over time. Having the
>> backends understand every variation would be a little cumbersome.
>>
>> As it stands today, we correctly exit the optimisation if max reduction
>> isn’t supported in hardware, which is what the cost model is expecting.
>>
>>
>> If power wanted to use this implementation, then I think it’d probably
>> need some code in tree-vect-generic.c to implement on emulated max
>> reduction, which would then require updates to the costs modelling of
>> anything that uses max reduction (not just cond reduction). All of that is
>> outside the scope of this patch.
>>
>>
>> Thanks,
>> Alan.
>>
>> On 10/09/2015 23:14, "Bill Schmidt"  wrote:
>>
>> >Hi Alan,
>> >
>> >The cost modeling of the epilogue code seems pretty target-specific ("An
>> >EQ stmt and an AND stmt, reduction of the max index and a reduction of
>> >the found values, a broadcast of the max value," resulting in two
>> >vector_stmts, one vec_to_scalar, and two scalar_to_vecs).  On powerpc,
>> >this will not represent the cost accurately, and the cost will indeed be
>> >quite different depending on the mode (logarithmic in the number of
>> >elements).  I think that you need to create a new entry in
>> >vect_cost_for_stmt to represent the cost of a COND_REDUCTION, and allow
>> >each target to calculate the cost appropriately.
>> >
>> >(Powerpc doesn't have a max-reduction hardware instruction, but because
>> >the reduction will be only in the epilogue code, it may still be
>> >profitable for us to generate the somewhat expensive reduction sequence
>> >in order to vectorize the loop.  But we definitely want to model it as
>> >costly in and of itself.  Also, the sequence will produce the maximum
>> >value in all positions without a separate broadcast.)
>> >
>> >Thanks,
>> >Bill
>> >
>> >On Thu, 2015-09-10 at 15:51 +0100, Alan Hayward wrote:
>> >> Hi,
>> >> This patch (attached) adds support for vectorizing conditional
>> >>expressions
>> >> (PR 65947), for example:
>> >>
>> >> int condition_reduction (int *a, int min_v)
>> >> {
>> >>   int last = 0;
>> >>   for (int i = 0; i < N; i++)
>> >> if (a[i] < min_v)
>> >>   last = a[i];
>> >>   return 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Bill Schmidt
Hi Alan,

I probably wasn't clear enough.  The implementation in the vectorizer is
fine and I'm not asking that to change per target.  What I'm objecting
to is the equivalence between a REDUC_MAX_EXPR and a cost associated
with vec_to_scalar.  This assumes that the back end will implement a
REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
But those back ends should be free to model the cost of the
REDUC_MAX_EXPR appropriately.  Therefore I am asking for a new
vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR.  For
ARM, this cost will be the same as a vec_to_scalar.  For others, it may
not be; for powerpc, it certainly will not be.

We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
expansion, and therefore it is not correct for us to explode this in
tree-vect-generic.  This would expand the code size without providing
any significant optimization opportunity, and could reduce the ability
to, for instance, common REDUC_MAX_EXPRs.  It would also slow down the
gimple vectorizers.

I apologize if my loose use of language confused the issue.  It isn't
the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
that are used by it.

(The costs in powerpc won't be enormous, but they are definitely
mode-dependent in a way that vec_to_scalar is not.  We'll need 2*log(n)
instructions, where n is the number of elements in the mode being
vectorized.)

A secondary concern for powerpc is that REDUC_MAX_EXPR produces a scalar
that has to be broadcast back to a vector, and the best way to implement
it for us already has the max value in all positions of a vector.  But
that is something we should be able to fix with simplify-rtx in the back
end.

Thanks,
Bill


On Fri, 2015-09-11 at 10:15 +0100, Alan Hayward wrote:
> Hi Bill,
> 
> I’d be a bit worried about asking the backend for the cost of a
> COND_REDUCTION, as that will rely on the backend understanding the
> implementation the vectorizer is using - every time the vectorizer
> changed, the backends would need to be updated too. I’m hoping soon to get
> together a patch to reduce the stmts produced on the simpler cases, which
> would require a different set of costings. I can also imagine further
> improvements being added for other special cases over time. Having the
> backends understand every variation would be a little cumbersome.
> 
> As it stands today, we correctly exit the optimisation if max reduction
> isn’t supported in hardware, which is what the cost model is expecting.
> 
> 
> If power wanted to use this implementation, then I think it’d probably
> need some code in tree-vect-generic.c to implement on emulated max
> reduction, which would then require updates to the costs modelling of
> anything that uses max reduction (not just cond reduction). All of that is
> outside the scope of this patch.
> 
> 
> Thanks,
> Alan.
> 
> On 10/09/2015 23:14, "Bill Schmidt"  wrote:
> 
> >Hi Alan,
> >
> >The cost modeling of the epilogue code seems pretty target-specific ("An
> >EQ stmt and an AND stmt, reduction of the max index and a reduction of
> >the found values, a broadcast of the max value," resulting in two
> >vector_stmts, one vec_to_scalar, and two scalar_to_vecs).  On powerpc,
> >this will not represent the cost accurately, and the cost will indeed be
> >quite different depending on the mode (logarithmic in the number of
> >elements).  I think that you need to create a new entry in
> >vect_cost_for_stmt to represent the cost of a COND_REDUCTION, and allow
> >each target to calculate the cost appropriately.
> >
> >(Powerpc doesn't have a max-reduction hardware instruction, but because
> >the reduction will be only in the epilogue code, it may still be
> >profitable for us to generate the somewhat expensive reduction sequence
> >in order to vectorize the loop.  But we definitely want to model it as
> >costly in and of itself.  Also, the sequence will produce the maximum
> >value in all positions without a separate broadcast.)
> >
> >Thanks,
> >Bill
> >
> >On Thu, 2015-09-10 at 15:51 +0100, Alan Hayward wrote:
> >> Hi,
> >> This patch (attached) adds support for vectorizing conditional
> >>expressions
> >> (PR 65947), for example:
> >> 
> >> int condition_reduction (int *a, int min_v)
> >> {
> >>   int last = 0;
> >>   for (int i = 0; i < N; i++)
> >> if (a[i] < min_v)
> >>   last = a[i];
> >>   return last;
> >> }
> >> 
> >> To do this the loop is vectorised to create a vector of data results (ie
> >> of matching a[i] values). Using an induction variable, an additional
> >> vector is added containing the indexes where the matches occured. In the
> >> function epilogue this is reduced to a single max value and then used to
> >> index into the vector of data results.
> >> When no values are matched in the loop, the indexes vector will contain
> >> all zeroes, eventually matching the first entry in the data results
> >>vector.
> >> 
> >> To vectorize 

Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Richard Sandiford
Ramana Radhakrishnan  writes:
> On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt
>  wrote:
>> Hi Alan,
>>
>> I probably wasn't clear enough.  The implementation in the vectorizer is
>> fine and I'm not asking that to change per target.  What I'm objecting
>> to is the equivalence between a REDUC_MAX_EXPR and a cost associated
>> with vec_to_scalar.  This assumes that the back end will implement a
>> REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
>> But those back ends should be free to model the cost of the
>> REDUC_MAX_EXPR appropriately.  Therefore I am asking for a new
>> vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR.  For
>> ARM, this cost will be the same as a vec_to_scalar.  For others, it may
>> not be; for powerpc, it certainly will not be.
>>
>> We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
>> expansion, and therefore it is not correct for us to explode this in
>> tree-vect-generic.  This would expand the code size without providing
>> any significant optimization opportunity, and could reduce the ability
>> to, for instance, common REDUC_MAX_EXPRs.  It would also slow down the
>> gimple vectorizers.
>>
>> I apologize if my loose use of language confused the issue.  It isn't
>> the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
>> that are used by it.
>>
>> (The costs in powerpc won't be enormous, but they are definitely
>> mode-dependent in a way that vec_to_scalar is not.  We'll need 2*log(n)
>> instructions, where n is the number of elements in the mode being
>> vectorized.)
>
> IIUC, on AArch64 a reduc_max_expr matches with a single reduction
> operation but on AArch32 Neon a reduc_smax gets implemented as a
> sequence of vpmax instructions which sounds similar to the PowerPC
> example as well. Thus mapping a reduc_smax expression to the cost of a
> vec_to_scalar is probably not right in this particular situation.

But AIUI vec_to_scalar exists to represent reduction operations.
(I see it was also used for strided stores.)  So for better or worse,
I think the interface that Alan's patch uses is the defined interface
for measuring the cost of a reduction.

If a backend implemented reduc_umax_scal_optab in current sources,
without Alan's patch, then that optab would be used for a "natural"
unsigned max reduction (i.e. a reduction of a MAX_EXPR with unsigned
inputs).  vec_to_scalar would be used to weigh the cost of the epilogue
reduction statement in that case.

So if defining a new Power pattern might cause Alan's patch to trigger
in cases where the transformation is actually too expensive, I would
expect the same to be true for a natural umax without Alan's patch.
The two cases ought to underestimate the true cost by the same degree.

In other words, whether the cost interface is flexible enough is
definitely interesting but seems orthogonal to this patch.

Thanks,
Richard



Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Bill Schmidt
On Fri, 2015-09-11 at 15:29 +0100, Richard Sandiford wrote:
> Ramana Radhakrishnan  writes:
> > On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt
> >  wrote:
> >> Hi Alan,
> >>
> >> I probably wasn't clear enough.  The implementation in the vectorizer is
> >> fine and I'm not asking that to change per target.  What I'm objecting
> >> to is the equivalence between a REDUC_MAX_EXPR and a cost associated
> >> with vec_to_scalar.  This assumes that the back end will implement a
> >> REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
> >> But those back ends should be free to model the cost of the
> >> REDUC_MAX_EXPR appropriately.  Therefore I am asking for a new
> >> vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR.  For
> >> ARM, this cost will be the same as a vec_to_scalar.  For others, it may
> >> not be; for powerpc, it certainly will not be.
> >>
> >> We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
> >> expansion, and therefore it is not correct for us to explode this in
> >> tree-vect-generic.  This would expand the code size without providing
> >> any significant optimization opportunity, and could reduce the ability
> >> to, for instance, common REDUC_MAX_EXPRs.  It would also slow down the
> >> gimple vectorizers.
> >>
> >> I apologize if my loose use of language confused the issue.  It isn't
> >> the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
> >> that are used by it.
> >>
> >> (The costs in powerpc won't be enormous, but they are definitely
> >> mode-dependent in a way that vec_to_scalar is not.  We'll need 2*log(n)
> >> instructions, where n is the number of elements in the mode being
> >> vectorized.)
> >
> > IIUC, on AArch64 a reduc_max_expr matches with a single reduction
> > operation but on AArch32 Neon a reduc_smax gets implemented as a
> > sequence of vpmax instructions which sounds similar to the PowerPC
> > example as well. Thus mapping a reduc_smax expression to the cost of a
> > vec_to_scalar is probably not right in this particular situation.
> 
> But AIUI vec_to_scalar exists to represent reduction operations.
> (I see it was also used for strided stores.)  So for better or worse,
> I think the interface that Alan's patch uses is the defined interface
> for measuring the cost of a reduction.
>
> If a backend implemented reduc_umax_scal_optab in current sources,
> without Alan's patch, then that optab would be used for a "natural"
> unsigned max reduction (i.e. a reduction of a MAX_EXPR with unsigned
> inputs).  vec_to_scalar would be used to weigh the cost of the epilogue
> reduction statement in that case.
> 
> So if defining a new Power pattern might cause Alan's patch to trigger
> in cases where the transformation is actually too expensive, I would
> expect the same to be true for a natural umax without Alan's patch.
> The two cases ought to underestimate the true cost by the same degree.
> 
> In other words, whether the cost interface is flexible enough is
> definitely interesting but seems orthogonal to this patch.

That's a reasonable argument, but is this not a good opportunity to fix
an incorrect assumption in the vectorizer cost model?  I would prefer
for this issue not to get lost on a technicality.

The vectorizer cost model has many small flaws, and we all need to be
mindful of trying to improve it at every opportunity, rather than
allowing it to continue to degrade.  We just had a big discussion about
improving cost models at the last Cauldron, and my request is consistent
with that direction.

Saying that all reductions have equivalent performance is unlikely to be
true for many platforms.  On PowerPC, for example, a PLUS reduction has
very different cost from a MAX reduction.  If the model isn't
fine-grained enough, let's please be aggressive about fixing it.  I'm
fine if it's a separate patch, but in my mind this shouldn't be allowed
to languish.

Thanks,
Bill

> 
> Thanks,
> Richard
> 




Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Richard Sandiford
Bill Schmidt  writes:
> On Fri, 2015-09-11 at 15:29 +0100, Richard Sandiford wrote:
>> Ramana Radhakrishnan  writes:
>> > On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt
>> >  wrote:
>> >> Hi Alan,
>> >>
>> >> I probably wasn't clear enough.  The implementation in the vectorizer is
>> >> fine and I'm not asking that to change per target.  What I'm objecting
>> >> to is the equivalence between a REDUC_MAX_EXPR and a cost associated
>> >> with vec_to_scalar.  This assumes that the back end will implement a
>> >> REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
>> >> But those back ends should be free to model the cost of the
>> >> REDUC_MAX_EXPR appropriately.  Therefore I am asking for a new
>> >> vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR.  For
>> >> ARM, this cost will be the same as a vec_to_scalar.  For others, it may
>> >> not be; for powerpc, it certainly will not be.
>> >>
>> >> We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
>> >> expansion, and therefore it is not correct for us to explode this in
>> >> tree-vect-generic.  This would expand the code size without providing
>> >> any significant optimization opportunity, and could reduce the ability
>> >> to, for instance, common REDUC_MAX_EXPRs.  It would also slow down the
>> >> gimple vectorizers.
>> >>
>> >> I apologize if my loose use of language confused the issue.  It isn't
>> >> the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
>> >> that are used by it.
>> >>
>> >> (The costs in powerpc won't be enormous, but they are definitely
>> >> mode-dependent in a way that vec_to_scalar is not.  We'll need 2*log(n)
>> >> instructions, where n is the number of elements in the mode being
>> >> vectorized.)
>> >
>> > IIUC, on AArch64 a reduc_max_expr matches with a single reduction
>> > operation but on AArch32 Neon a reduc_smax gets implemented as a
>> > sequence of vpmax instructions which sounds similar to the PowerPC
>> > example as well. Thus mapping a reduc_smax expression to the cost of a
>> > vec_to_scalar is probably not right in this particular situation.
>> 
>> But AIUI vec_to_scalar exists to represent reduction operations.
>> (I see it was also used for strided stores.)  So for better or worse,
>> I think the interface that Alan's patch uses is the defined interface
>> for measuring the cost of a reduction.
>>
>> If a backend implemented reduc_umax_scal_optab in current sources,
>> without Alan's patch, then that optab would be used for a "natural"
>> unsigned max reduction (i.e. a reduction of a MAX_EXPR with unsigned
>> inputs).  vec_to_scalar would be used to weigh the cost of the epilogue
>> reduction statement in that case.
>> 
>> So if defining a new Power pattern might cause Alan's patch to trigger
>> in cases where the transformation is actually too expensive, I would
>> expect the same to be true for a natural umax without Alan's patch.
>> The two cases ought to underestimate the true cost by the same degree.
>> 
>> In other words, whether the cost interface is flexible enough is
>> definitely interesting but seems orthogonal to this patch.
>
> That's a reasonable argument, but is this not a good opportunity to fix
> an incorrect assumption in the vectorizer cost model?  I would prefer
> for this issue not to get lost on a technicality.

I think it's more than technicality though.  I don't think it should be
Alan's responsibility to extend the cost model when (a) his patch uses the
current model in the way that it was intended to be used (at least AIUI) and
(b) in this case, the motivating example for the new model is a pattern
that hasn't been written yet. :-)

So...

> The vectorizer cost model has many small flaws, and we all need to be
> mindful of trying to improve it at every opportunity, rather than
> allowing it to continue to degrade.  We just had a big discussion about
> improving cost models at the last Cauldron, and my request is consistent
> with that direction.
>
> Saying that all reductions have equivalent performance is unlikely to be
> true for many platforms.  On PowerPC, for example, a PLUS reduction has
> very different cost from a MAX reduction.  If the model isn't
> fine-grained enough, let's please be aggressive about fixing it.  I'm
> fine if it's a separate patch, but in my mind this shouldn't be allowed
> to languish.

...I agree that the general vectoriser cost model could probably be
improved, but it seems fairer for that improvement to be done by whoever
adds the patterns that need it.

Thanks,
Richard



Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)

2015-09-11 Thread Bill Schmidt
On Fri, 2015-09-11 at 16:28 +0100, Richard Sandiford wrote:
> Bill Schmidt  writes:
> > On Fri, 2015-09-11 at 15:29 +0100, Richard Sandiford wrote:
> >> Ramana Radhakrishnan  writes:
> >> > On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt
> >> >  wrote:
> >> >> Hi Alan,
> >> >>
> >> >> I probably wasn't clear enough.  The implementation in the vectorizer is
> >> >> fine and I'm not asking that to change per target.  What I'm objecting
> >> >> to is the equivalence between a REDUC_MAX_EXPR and a cost associated
> >> >> with vec_to_scalar.  This assumes that the back end will implement a
> >> >> REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
> >> >> But those back ends should be free to model the cost of the
> >> >> REDUC_MAX_EXPR appropriately.  Therefore I am asking for a new
> >> >> vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR.  For
> >> >> ARM, this cost will be the same as a vec_to_scalar.  For others, it may
> >> >> not be; for powerpc, it certainly will not be.
> >> >>
> >> >> We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
> >> >> expansion, and therefore it is not correct for us to explode this in
> >> >> tree-vect-generic.  This would expand the code size without providing
> >> >> any significant optimization opportunity, and could reduce the ability
> >> >> to, for instance, common REDUC_MAX_EXPRs.  It would also slow down the
> >> >> gimple vectorizers.
> >> >>
> >> >> I apologize if my loose use of language confused the issue.  It isn't
> >> >> the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
> >> >> that are used by it.
> >> >>
> >> >> (The costs in powerpc won't be enormous, but they are definitely
> >> >> mode-dependent in a way that vec_to_scalar is not.  We'll need 2*log(n)
> >> >> instructions, where n is the number of elements in the mode being
> >> >> vectorized.)
> >> >
> >> > IIUC, on AArch64 a reduc_max_expr matches with a single reduction
> >> > operation but on AArch32 Neon a reduc_smax gets implemented as a
> >> > sequence of vpmax instructions which sounds similar to the PowerPC
> >> > example as well. Thus mapping a reduc_smax expression to the cost of a
> >> > vec_to_scalar is probably not right in this particular situation.
> >> 
> >> But AIUI vec_to_scalar exists to represent reduction operations.
> >> (I see it was also used for strided stores.)  So for better or worse,
> >> I think the interface that Alan's patch uses is the defined interface
> >> for measuring the cost of a reduction.
> >>
> >> If a backend implemented reduc_umax_scal_optab in current sources,
> >> without Alan's patch, then that optab would be used for a "natural"
> >> unsigned max reduction (i.e. a reduction of a MAX_EXPR with unsigned
> >> inputs).  vec_to_scalar would be used to weigh the cost of the epilogue
> >> reduction statement in that case.
> >> 
> >> So if defining a new Power pattern might cause Alan's patch to trigger
> >> in cases where the transformation is actually too expensive, I would
> >> expect the same to be true for a natural umax without Alan's patch.
> >> The two cases ought to underestimate the true cost by the same degree.
> >> 
> >> In other words, whether the cost interface is flexible enough is
> >> definitely interesting but seems orthogonal to this patch.
> >
> > That's a reasonable argument, but is this not a good opportunity to fix
> > an incorrect assumption in the vectorizer cost model?  I would prefer
> > for this issue not to get lost on a technicality.
> 
> I think it's more than technicality though.  I don't think it should be
> Alan's responsibility to extend the cost model when (a) his patch uses the
> current model in the way that it was intended to be used (at least AIUI) and
> (b) in this case, the motivating example for the new model is a pattern
> that hasn't been written yet. :-)

Agreed.  However, the original patch description said in essence, this
is good for everybody, powerpc and x86 should go and implement their
patterns and use it.  It turns out not to be so simple, unfortunately.

> 
> So...
> 
> > The vectorizer cost model has many small flaws, and we all need to be
> > mindful of trying to improve it at every opportunity, rather than
> > allowing it to continue to degrade.  We just had a big discussion about
> > improving cost models at the last Cauldron, and my request is consistent
> > with that direction.
> >
> > Saying that all reductions have equivalent performance is unlikely to be
> > true for many platforms.  On PowerPC, for example, a PLUS reduction has
> > very different cost from a MAX reduction.  If the model isn't
> > fine-grained enough, let's please be aggressive about fixing it.  I'm
> > fine if it's a separate patch, but in my mind this shouldn't be allowed
> > to languish.
> 
> ...I agree that the general vectoriser cost model could probably be
> improved, but it