Re: [RFC][PR82479] missing popcount builtin detection

2018-03-07 Thread Bin.Cheng
On Wed, Mar 7, 2018 at 8:26 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Mar 6, 2018 at 5:20 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Mar 5, 2018 at 3:24 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Thu, Feb 8, 2018 at 1:41 AM, Kugan Vivekanandarajah
>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>> Hi Richard,
>>>>
>>>> On 1 February 2018 at 23:21, Richard Biener <richard.guent...@gmail.com> 
>>>> wrote:
>>>>> On Thu, Feb 1, 2018 at 5:07 AM, Kugan Vivekanandarajah
>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> On 31 January 2018 at 21:39, Richard Biener <richard.guent...@gmail.com> 
>>>>>> wrote:
>>>>>>> On Wed, Jan 31, 2018 at 11:28 AM, Kugan Vivekanandarajah
>>>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>>>> Hi Richard,
>>>>>>>>
>>>>>>>> Thanks for the review.
>>>>>>>> On 25 January 2018 at 20:04, Richard Biener 
>>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>>> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
>>>>>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Here is a patch for popcount builtin detection similar to LLVM. I
>>>>>>>>>> would like to queue this for review for next stage 1.
>>>>>>>>>>
>>>>>>>>>> 1. This is done part of loop-distribution and effective for -O3 and 
>>>>>>>>>> above.
>>>>>>>>>> 2. This does not distribute loop to detect popcount (like
>>>>>>>>>> memcpy/memmove). I dont think that happens in practice. Please 
>>>>>>>>>> correct
>>>>>>>>>> me if I am wrong.
>>>>>>>>>
>>>>>>>>> But then it has no business inside loop distribution but instead is
>>>>>>>>> doing final value
>>>>>>>>> replacement, right?  You are pattern-matching the whole loop after 
>>>>>>>>> all.  I think
>>>>>>>>> final value replacement would already do the correct thing if you
>>>>>>>>> teached number of
>>>>>>>>> iteration analysis that niter for
>>>>>>>>>
>>>>>>>>>[local count: 955630224]:
>>>>>>>>>   # b_11 = PHI <b_5(5), b_8(6)>
>>>>>>>>>   _1 = b_11 + -1;
>>>>>>>>>   b_8 = _1 & b_11;
>>>>>>>>>   if (b_8 != 0)
>>>>>>>>> goto ; [89.00%]
>>>>>>>>>   else
>>>>>>>>> goto ; [11.00%]
>>>>>>>>>
>>>>>>>>>[local count: 850510900]:
>>>>>>>>>   goto ; [100.00%]
>>>>>>>>
>>>>>>>> I am looking into this approach. What should be the scalar evolution
>>>>>>>> for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
>>>>>>>> and can this be represented with the scev?
>>>>>>>
>>>>>>> No, it's not affine and thus cannot be represented.  You only need the
>>>>>>> scalar evolution of the counting IV which is already handled and
>>>>>>> the number of iteration analysis needs to handle the above IV - this
>>>>>>> is the missing part.
>>>>>> Thanks for the clarification. I am now matching this loop pattern in
>>>>>> number_of_iterations_exit when number_of_iterations_exit_assumptions
>>>>>> fails. If the pattern matches, I am inserting the _builtin_popcount in
>>>>>> the loop preheater and setting the loop niter with this. This will be
>>>>>> used by the final value replacement. Is this what you wanted?
>>>>>
>>>>> No, you shouldn't insert a popcount stmt but instead the niter
>>>>> GENERIC tree should be a CALL_EXPR to popcount with the
>>>>> appropriate argument.
>>&g

Re: [RFC][PR82479] missing popcount builtin detection

2018-03-06 Thread Bin.Cheng
On Mon, Mar 5, 2018 at 3:24 PM, Richard Biener
 wrote:
> On Thu, Feb 8, 2018 at 1:41 AM, Kugan Vivekanandarajah
>  wrote:
>> Hi Richard,
>>
>> On 1 February 2018 at 23:21, Richard Biener  
>> wrote:
>>> On Thu, Feb 1, 2018 at 5:07 AM, Kugan Vivekanandarajah
>>>  wrote:
 Hi Richard,

 On 31 January 2018 at 21:39, Richard Biener  
 wrote:
> On Wed, Jan 31, 2018 at 11:28 AM, Kugan Vivekanandarajah
>  wrote:
>> Hi Richard,
>>
>> Thanks for the review.
>> On 25 January 2018 at 20:04, Richard Biener  
>> wrote:
>>> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
>>>  wrote:
 Hi All,

 Here is a patch for popcount builtin detection similar to LLVM. I
 would like to queue this for review for next stage 1.

 1. This is done part of loop-distribution and effective for -O3 and 
 above.
 2. This does not distribute loop to detect popcount (like
 memcpy/memmove). I dont think that happens in practice. Please correct
 me if I am wrong.
>>>
>>> But then it has no business inside loop distribution but instead is
>>> doing final value
>>> replacement, right?  You are pattern-matching the whole loop after all. 
>>>  I think
>>> final value replacement would already do the correct thing if you
>>> teached number of
>>> iteration analysis that niter for
>>>
>>>[local count: 955630224]:
>>>   # b_11 = PHI 
>>>   _1 = b_11 + -1;
>>>   b_8 = _1 & b_11;
>>>   if (b_8 != 0)
>>> goto ; [89.00%]
>>>   else
>>> goto ; [11.00%]
>>>
>>>[local count: 850510900]:
>>>   goto ; [100.00%]
>>
>> I am looking into this approach. What should be the scalar evolution
>> for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
>> and can this be represented with the scev?
>
> No, it's not affine and thus cannot be represented.  You only need the
> scalar evolution of the counting IV which is already handled and
> the number of iteration analysis needs to handle the above IV - this
> is the missing part.
 Thanks for the clarification. I am now matching this loop pattern in
 number_of_iterations_exit when number_of_iterations_exit_assumptions
 fails. If the pattern matches, I am inserting the _builtin_popcount in
 the loop preheater and setting the loop niter with this. This will be
 used by the final value replacement. Is this what you wanted?
>>>
>>> No, you shouldn't insert a popcount stmt but instead the niter
>>> GENERIC tree should be a CALL_EXPR to popcount with the
>>> appropriate argument.
>>
>> Thats what I tried earlier but ran into some ICEs. I wasn't sure if
>> niter in tree_niter_desc can take such.
>>
>> Attached patch now does this. Also had to add support for CALL_EXPR in
>> few places to handle niter with CALL_EXPR. Does this look OK?
>
> Overall this looks ok - the patch includes changes in places that I don't 
> think
> need changes such as chrec_convert_1 or extract_ops_from_tree.
> The expression_expensive_p change should be more specific than making
> all calls inexpensive as well.
>
> The verify_ssa change looks bogus, you do
>
> +  dest = gimple_phi_result (count_phi);
> +  tree var = make_ssa_name (TREE_TYPE (dest), NULL);
> +  tree fn = builtin_decl_implicit (BUILT_IN_POPCOUNT);
> +
> +  var = build_call_expr (fn, 1, src);
> +  *niter = fold_build2 (MINUS_EXPR, TREE_TYPE (dest), var,
> +   build_int_cst (TREE_TYPE (dest), 1));
>
> why do you allocate a new SSA name here?  It seems unused
> as you overwrive 'var' with the CALL_EXPR immediately.
>
> I didn't review the pattern matching thoroughly nor the exact place you
> call it.  But
>
> +  if (check_popcount_pattern (loop, ))
> +   {
> + niter->assumptions = boolean_false_node;
> + niter->control.base = NULL_TREE;
> + niter->control.step = NULL_TREE;
> + niter->control.no_overflow = false;
> + niter->niter = count;
> + niter->assumptions = boolean_true_node;
> + niter->may_be_zero = boolean_false_node;
> + niter->max = -1;
> + niter->bound = NULL_TREE;
> + niter->cmp = ERROR_MARK;
> + return true;
> +   }
>
> simply setting may_be_zero to false looks fishy.  Try
> with -fno-tree-loop-ch.  Also max should not be negative,
> it should be the number of bits in the IV type?
>
> A related testcase could be that we can completely peel
> a loop like the following which iterates at most 8 times:
>
> int a[8];
> void foo (unsigned char ctrl)
> {
>   int c = 0;
>   while (ctrl)
> {
>

Re: Patch ping (Re: [PATCH PR82965/PR83991]Fix invalid profile count in vectorization peeling)

2018-02-26 Thread Bin.Cheng
Ping^2

Thanks,
bin

On Mon, Feb 19, 2018 at 5:14 PM, Jakub Jelinek  wrote:
> Hi!
>
> Honza, do you think you could have a look at this P1 fix?
>
> Thanks.
>
> On Wed, Jan 31, 2018 at 10:03:51AM +, Bin Cheng wrote:
>> Hi,
>> This patch fixes invalid profile count information in vectorization peeling.
>> Current implementation is a bit confusing for me since it tries to compute
>> an overall probability based on scaling probability and change of estimated
>> niters.  This patch does it in two steps.  Firstly it does the scaling, then
>> it adjusts to new estimated niters by simply adjusting loop's latch count
>> information; scaling the loop's count information by the proportion
>> new_estimated_niters/old_estimate_niters.  Of course we have to adjust loop
>> latch's count information back after scaling.
>>
>> Bootstrap and test on x86_64 and AArch64.  gcc.dg/vect/pr79347.c is fixed
>> for both PR82965 and PR83991.  Is this OK?
>>
>> Thanks,
>> bin
>>
>> 2018-01-30  Bin Cheng  
>>
>>   PR tree-optimization/82965
>>   PR tree-optimization/83991
>>   * cfgloopmanip.c (scale_loop_profile): Further scale loop's profile
>>   information if the loop was predicted to iterate too many times.
>
>> diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
>> index b9b76d8..1f560b8 100644
>> --- a/gcc/cfgloopmanip.c
>> +++ b/gcc/cfgloopmanip.c
>> @@ -509,7 +509,7 @@ scale_loop_profile (struct loop *loop, 
>> profile_probability p,
>>   gcov_type iteration_bound)
>>  {
>>gcov_type iterations = expected_loop_iterations_unbounded (loop);
>> -  edge e;
>> +  edge e, preheader_e;
>>edge_iterator ei;
>>
>>if (dump_file && (dump_flags & TDF_DETAILS))
>> @@ -521,77 +521,66 @@ scale_loop_profile (struct loop *loop, 
>> profile_probability p,
>>  (int)iteration_bound, (int)iterations);
>>  }
>>
>> +  /* Scale the probabilities.  */
>> +  scale_loop_frequencies (loop, p);
>> +
>>/* See if loop is predicted to iterate too many times.  */
>> -  if (iteration_bound && iterations > 0
>> -  && p.apply (iterations) > iteration_bound)
>> +  if (iteration_bound == 0 || iterations <= 0
>> +  || p.apply (iterations) <= iteration_bound)
>> +return;
>> +
>> +  e = single_exit (loop);
>> +  preheader_e = loop_preheader_edge (loop);
>> +  profile_count count_in = preheader_e->count ();
>> +  if (e && preheader_e
>> +  && count_in > profile_count::zero ()
>> +  && loop->header->count.initialized_p ())
>>  {
>> -  /* Fixing loop profile for different trip count is not trivial; the 
>> exit
>> -  probabilities has to be updated to match and frequencies propagated 
>> down
>> -  to the loop body.
>> -
>> -  We fully update only the simple case of loop with single exit that is
>> -  either from the latch or BB just before latch and leads from BB with
>> -  simple conditional jump.   This is OK for use in vectorizer.  */
>> -  e = single_exit (loop);
>> -  if (e)
>> - {
>> -   edge other_e;
>> -   profile_count count_delta;
>> +  edge other_e;
>> +  profile_count count_delta;
>>
>> -  FOR_EACH_EDGE (other_e, ei, e->src->succs)
>> - if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
>> - && e != other_e)
>> -   break;
>> +  FOR_EACH_EDGE (other_e, ei, e->src->succs)
>> + if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
>> + && e != other_e)
>> +   break;
>>
>> -   /* Probability of exit must be 1/iterations.  */
>> -   count_delta = e->count ();
>> -   e->probability = profile_probability::always ()
>> +  /* Probability of exit must be 1/iterations.  */
>> +  count_delta = e->count ();
>> +  e->probability = profile_probability::always ()
>>   .apply_scale (1, iteration_bound);
>> -   other_e->probability = e->probability.invert ();
>> -   count_delta -= e->count ();
>> -
>> -   /* If latch exists, change its count, since we changed
>> -  probability of exit.  Theoretically we should update everything 
>> from
>> -  source of exit edge to latch, but for vectorizer this is enough.  
>> */
>> -   if (loop->latch
>> -   && loop->latch != e->src)
>> - {
>> -   loop->latch->count += count_delta;
>> - }
>> - }
>> +  other_e->probability = e->probability.invert ();
>>
>>/* Roughly speaking we want to reduce the loop body profile by the
>>difference of loop iterations.  We however can do better if
>>we look at the actual profile, if it is available.  */
>> -  p = p.apply_scale (iteration_bound, iterations);
>> -
>> -  if (loop->header->count.initialized_p ())
>> - {
>> -   profile_count count_in = profile_count::zero ();
>> +  p = profile_probability::always ();
>>
>> -   FOR_EACH_EDGE (e, ei, loop->header->preds)
>> - if (e->src != loop->latch)
>> 

Re: [PATCH BACKPORT]Backport r254778 and test case in r244815 to GCC6

2018-02-01 Thread Bin.Cheng
On Wed, Jan 31, 2018 at 10:55 AM, Richard Biener
 wrote:
> On Tue, Dec 19, 2017 at 4:36 PM, Bin Cheng  wrote:
>> HI,
>> This patch backports r254778 and test case in r244815 to GCC6.  Bootstrap and
>> test on x86_64.  Is it OK?
>
> Ok.
Retested and applied on GCC6 branch.

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin
>>
>> 2017-12-18  Bin Cheng  
>>
>> Backport from mainline
>> 2017-11-15  Bin Cheng  
>>
>> PR tree-optimization/82726
>> PR tree-optimization/70754
>> * tree-predcom.c (order_drefs_by_pos): New function.
>> (combine_chains): Move code setting has_max_use_after to...
>> (try_combine_chains): ...here.  New parameter.  Sort combined chains
>> according to position information.
>> (tree_predictive_commoning_loop): Update call to above function.
>> (update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.
>>
>> gcc/testsuite
>> 2017-12-18  Bin Cheng  
>>
>> Backport from mainline
>> 2017-11-15  Bin Cheng  
>>
>> PR tree-optimization/82726
>> * gcc.dg/tree-ssa/pr82726.c: New test.
>>
>> Backport from mainline
>> 2017-01-23  Bin Cheng  
>>
>> PR tree-optimization/70754
>> * gfortran.dg/pr70754.f90: New test.


Re: [PATCH PR82604]Fix regression in ftree-parallelize-loops

2018-01-20 Thread Bin.Cheng
On Fri, Jan 19, 2018 at 5:42 PM, Bin Cheng  wrote:
> Hi,
> This patch is supposed to fix regression caused by loop distribution when
> ftree-parallelize-loops.  The reason is distributed memset call can't be
> understood/analyzed in data reference analysis, as a result, parloop can
> only parallelize the innermost 2-level loop nest.  Before distribution
> change, parloop can parallelize the innermost 3-level loop nest, i.e,
> more parallelization.
> As commented in the PR, ideally, loop distribution should be able to
> distribute memset call for 3-level loop nest.  Unfortunately this requires
> sophisticated work proving equality between tree expressions which gcc
> is not good at now.
> Another fix is to improve data reference analysis so that memset call
> can be supported.  We don't know how big this change is and it's definitely
> not GCC 8 task.
>
> So this patch fixes the regression in a bit hacking way.  It first enables
> 3-level loop nest distribution when flag_tree_parloops > 1.  Secondly, it
> supports 3-level loop nest distribution for ZERO-ing stmt which can only
> be distributed as a loop (nest) of memset, but can't be distributed as a
> single memset.  The overall effect is ZERO-ing stmt will be distributed
> to one loop deeper than now, so parloop can parallelize as before.
>
> Bootstrap and test on x86_64 and AArch64 ongoing.  Is it OK if no errors?
Test finished without error.  Also I checked
-ftree-parallelize-loops=6 on AArch64 and can confirm the regression
is resolved.

Thanks,
bin
>
> Thanks,
> bin
> 2018-01-19  Bin Cheng  
>
> PR tree-optimization/82604
> * tree-loop-distribution.c (enum partition_kind): New enum item
> PKIND_PARTIAL_MEMSET.
> (partition_builtin_p): Support above new enum item.
> (generate_code_for_partition): Ditto.
> (compute_access_range): Differentiate cases that equality can be
> proven at all loops, the innermost loops or no loops.
> (classify_builtin_st, classify_builtin_ldst): Adjust call to above
> function.  Set PKIND_PARTIAL_MEMSET for partition appropriately.
> (finalize_partitions, distribute_loop): Don't fuse partition of
> PKIND_PARTIAL_MEMSET kind when distributing 3-level loop nest.
> (prepare_perfect_loop_nest): Distribute 3-level loop nest only if
> parloop is enabled.


Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

2017-12-20 Thread Bin.Cheng
On Tue, Dec 19, 2017 at 12:56 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Dec 19, 2017 at 12:58 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz <m...@suse.de> wrote:
>>> Hi,
>>>
>>> On Mon, 18 Dec 2017, Richard Biener wrote:
>>>
>>>> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost 
>>>> loop.
>>>
>>> [0] is always outermost loop.
>>>
>>>> The vectorizer does way more complicated things and only looks at the
>>>> distance with respect to the outer loop as far as I can see which can be
>>>> negative.
>>>>
>>>> Not sure if fusion and vectorizer "interleaving" makes a difference
>>>> here. I think the idea was that when interleaving stmt-by-stmt then
>>>> forward dependences would be preserved and thus we don't need to check
>>>> the inner loop dependences.  speaking with "forward vs. backward"
>>>> dependences again, not distances...
>>>>
>>>> This also means that unroll-and-jam could be enhanced to "interleave"
>>>> stmts and thus cover more cases?
>>>>
>>>> Again, I hope Micha can have a look here...
>>>
>>> Haven't yet looked at the patch, but some comments anyway:
>>>
>>> fusion and interleaving interact in the following way in outer loop
>>> vectorization, conceptually:
>>> * (1) the outer loop is unrolled
>>> * (2) the inner loops are fused
>>> * (3) the (now single) inner body is rescheduled/shuffled/interleaved.
>>>
>> Thanks Michael for explaining issue clearer, this is what I meant.  As
>> for PR60276, I think it's actually the other side of the problem,
>> which only relates to dependence validity of interleaving.
>
> Interleaving validity is what is checked by the current code, I don't
> see any checking for validity of (2).  Now, the current code only
> looks at the outer loop distances to verify interleaving validity.
>
> I think we need to verify whether fusion is valid, preferably clearly
> separated from the current code checking interleaving validity.
>
> I'm not 100% convinced the interleaving validity check is correct for
> the outer loop vectorization case.
>
> I think it helps to reduce the dependence checking code to what we do
> for unroll-and-jam:
>
> Index: gcc/tree-vect-data-refs.c
> ===
> --- gcc/tree-vect-data-refs.c   (revision 255777)
> +++ gcc/tree-vect-data-refs.c   (working copy)
> @@ -378,7 +378,26 @@ vect_analyze_data_ref_dependence (struct
> dump_printf_loc (MSG_NOTE, vect_location,
>   "dependence distance  = %d.\n", dist);
>
> -  if (dist == 0)
> +  if (dist < 0)
> +   gcc_unreachable ();
> +
> +  else if (dist >= *max_vf)
> +   {
> + /* Dependence distance does not create dependence, as far as
> +vectorization is concerned, in this case.  */
> + if (dump_enabled_p ())
> +   dump_printf_loc (MSG_NOTE, vect_location,
> +"dependence distance >= VF.\n");
> + continue;
> +   }
> +
> +  else if (DDR_NB_LOOPS (ddr) == 2
> +  && (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
> +  || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
> +  && dist > 0)))
> +   continue;
> +
> +  else if (dist == 0)
> {
>   if (dump_enabled_p ())
> {
> @@ -427,26 +446,7 @@ vect_analyze_data_ref_dependence (struct
>   continue;
> }
>
> -  if (dist > 0 && DDR_REVERSED_P (ddr))
> -   {
> - /* If DDR_REVERSED_P the order of the data-refs in DDR was
> -reversed (to make distance vector positive), and the actual
> -distance is negative.  */
> - if (dump_enabled_p ())
> -   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> -"dependence distance negative.\n");
> - /* Record a negative dependence distance to later limit the
> -amount of stmt copying / unrolling we can perform.
> -Only need to handle read-after-write dependence.  */
> - if (DR_IS_READ (drb)
> - && (STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) == 0
> - || STMT_VINFO_MIN_NEG_DIST (stmtinfo_b) > (unsigne

Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

2017-12-19 Thread Bin.Cheng
On Mon, Dec 18, 2017 at 2:35 PM, Michael Matz  wrote:
> Hi,
>
> On Mon, 18 Dec 2017, Richard Biener wrote:
>
>> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost 
>> loop.
>
> [0] is always outermost loop.
>
>> The vectorizer does way more complicated things and only looks at the
>> distance with respect to the outer loop as far as I can see which can be
>> negative.
>>
>> Not sure if fusion and vectorizer "interleaving" makes a difference
>> here. I think the idea was that when interleaving stmt-by-stmt then
>> forward dependences would be preserved and thus we don't need to check
>> the inner loop dependences.  speaking with "forward vs. backward"
>> dependences again, not distances...
>>
>> This also means that unroll-and-jam could be enhanced to "interleave"
>> stmts and thus cover more cases?
>>
>> Again, I hope Micha can have a look here...
>
> Haven't yet looked at the patch, but some comments anyway:
>
> fusion and interleaving interact in the following way in outer loop
> vectorization, conceptually:
> * (1) the outer loop is unrolled
> * (2) the inner loops are fused
> * (3) the (now single) inner body is rescheduled/shuffled/interleaved.
>
Thanks Michael for explaining issue clearer, this is what I meant.  As
for PR60276, I think it's actually the other side of the problem,
which only relates to dependence validity of interleaving.

Thanks,
bin
> (1) is always okay.  But (2) and (3) as individual transformations must
> both be checked for validity.  If fusion isn't possible the whole
> transformation is invalid, and if interleaving isn't possible the same is
> true.  In the specific example:
>
>   for (b = 4; b >= 0; b--)
> for (c = 0; c <= 6; c++)
>   t = a[c][b + 1];  // S1
>   a[c + 1][b + 2] = t;  // S2
>
> it's already the fusion step that's invalid.  There's a
> dependence between S1 and S2, e.g. for (b,c) = (4,1) comes-before (3,0)
> with S1(4,1) reading a[1][5] and S2(3,0) writing a[1][5].  So a
> write-after-read.  After fusing:
>
>for (c = 0; c <= 6; c++)
>  {
>t = a[c][5];  // S1
>a[c + 1][6] = t;
>t = a[c][4];
>a[c + 1][5] = t;  // S2
>a[c + 1][4] = a[c][3];
>a[c + 1][3] = a[c][2];
>  }
>
> here we have at iterations (c) = (0) comes-before (1), at S2(0) writing
> a[1][5] and S1(1) writing a[1][5].  I.e. now it's a read-after-write (the
> write in iteration 0 overwrites the value that is going to be read at
> iteration 1, which wasn't the case in the original loop).  The dependence
> switched direction --> invalid.
>
> The simple interleaving of statements can't rectify this.
> Interleaving is an inner-body reordering but the brokenness comes from a
> cross-iteration ordering.
>
> This example can be unroll-jammed or outer-loop vectorized if one of the
> two loops is reversed.  Let's say we reverse the inner loop, so that it
> runs in the same direction as the outer loop (reversal is possible here).
>
> It'd then be something like:
>
>for (c = 6; c >= 0; c--)
>  {
>t = a[c][5];  // S1
>a[c + 1][6] = t;
>t = a[c][4];
>a[c + 1][5] = t;  // S2
>a[c + 1][4] = a[c][3];
>a[c + 1][3] = a[c][2];
>  }
>
> The dependence between S1/S2 would still be a write-after-read, and all
> would be well.  This reversal of the inner loop can partly be simulated by
> not only interleaving the inner insns, but by also _reodering_ them.  But
> AFAIK the vectorizer doesn't do this?
>
>
> Ciao,
> Michael.


Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

2017-12-15 Thread Bin.Cheng
On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> As explained in the PR, given below test case:
>>>>> int a[8][10] = { [2][5] = 4 }, c;
>>>>>
>>>>> int
>>>>> main ()
>>>>> {
>>>>>   short b;
>>>>>   int i, d;
>>>>>   for (b = 4; b >= 0; b--)
>>>>> for (c = 0; c <= 6; c++)
>>>>>   a[c + 1][b + 2] = a[c][b + 1];
>>>>>   for (i = 0; i < 8; i++)
>>>>> for (d = 0; d < 10; d++)
>>>>>   if (a[i][d] != (i == 3 && d == 6) * 4)
>>>>> __builtin_abort ();
>>>>>   return 0;
>>>>>
>>>>> the loop nest is illegal for vectorization without reversing inner loop.  
>>>>> The issue
>>>>> is in data dependence checking of vectorizer, I believe the mentioned 
>>>>> revision just
>>>>> exposed this.  Previously the vectorization is skipped because of 
>>>>> unsupported memory
>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>>>
>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>   {
>>>>> for (c = 0; c <= 6; c++)
>>>>>   a[c + 1][6] = a[c][5];
>>>>> for (c = 0; c <= 6; c++)
>>>>>   a[c + 1][5] = a[c][4];
>>>>> for (c = 0; c <= 6; c++)
>>>>>   a[c + 1][4] = a[c][3];
>>>>> for (c = 0; c <= 6; c++)
>>>>>   a[c + 1][3] = a[c][2];
>>>>>   }
>>>>> Then four inner loops are fused into:
>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>   {
>>>>> for (c = 0; c <= 6; c++)
>>>>> {
>>>>>   a[c + 1][6] = a[c][5];  // S1
>>>>>   a[c + 1][5] = a[c][4];  // S2
>>>>>   a[c + 1][4] = a[c][3];
>>>>>   a[c + 1][3] = a[c][2];
>>>>> }
>>>>>   }
>>>>
>>>> Note that they are not really "fused" but they are interleaved.  With
>>>> GIMPLE in mind
>>>> that makes a difference, you should get the equivalent of
>>>>
>>>>for (c = 0; c <= 6; c++)
>>>>  {
>>>>tem1 = a[c][5];
>>>>tem2 = a[c][4];
>>>>tem3 = a[c][3];
>>>>tem4 = a[c][2];
>>>>a[c+1][6] = tem1;
>>>>a[c +1][5] = tem2;
>>>> a[c+1][4] = tem3;
>>>> a[c+1][3] = tem4;
>>>>  }
>>> Yeah, I will double check if this abstract breaks the patch and how.
>> Hmm, I think this doesn't break it, well at least for part of the
>> analysis, because it is loop carried (backward) dependence goes wrong,
>> interleaving or not with the same iteration doesn't matter here.
>
> I think the idea is that forward dependences are always fine (negative 
> distance)
> to vectorize.  But with backward dependences we have to adhere to max_vf.
>
> It looks like for outer loop vectorization we only look at the distances in 
> the
> outer loop but never at inner ones.  But here the same applies but isn't that
> independend on the distances with respect to the outer loop?
>
> But maybe I'm misunderstanding how "distances" work here.
Hmm, I am not sure I understand "distance" correctly.  With
description as in book like "Optimizing compilers for Modern
Architectures", distance is "# of iteration of sink ref - # of
iteration of source ref".  Given below example:
  for (i = 0; i < N; ++i)
{
  x = arr[idx_1];  // S1
  arr[idx_2] = x;  // S2
}
if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
i;
If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
this is backward dependence.  For example idx_1 is i and idx_2 is i +
1;

In GCC, we always try to subtract idx_2 from idx_1 first in computing
classic distance, we could result in negative distance in case of
backw

Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

2017-12-15 Thread Bin.Cheng
On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> As explained in the PR, given below test case:
>>> int a[8][10] = { [2][5] = 4 }, c;
>>>
>>> int
>>> main ()
>>> {
>>>   short b;
>>>   int i, d;
>>>   for (b = 4; b >= 0; b--)
>>> for (c = 0; c <= 6; c++)
>>>   a[c + 1][b + 2] = a[c][b + 1];
>>>   for (i = 0; i < 8; i++)
>>> for (d = 0; d < 10; d++)
>>>   if (a[i][d] != (i == 3 && d == 6) * 4)
>>> __builtin_abort ();
>>>   return 0;
>>>
>>> the loop nest is illegal for vectorization without reversing inner loop.  
>>> The issue
>>> is in data dependence checking of vectorizer, I believe the mentioned 
>>> revision just
>>> exposed this.  Previously the vectorization is skipped because of 
>>> unsupported memory
>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>
>>>   for (b = 4; b > 0; b -= 4)
>>>   {
>>> for (c = 0; c <= 6; c++)
>>>   a[c + 1][6] = a[c][5];
>>> for (c = 0; c <= 6; c++)
>>>   a[c + 1][5] = a[c][4];
>>> for (c = 0; c <= 6; c++)
>>>   a[c + 1][4] = a[c][3];
>>> for (c = 0; c <= 6; c++)
>>>   a[c + 1][3] = a[c][2];
>>>   }
>>> Then four inner loops are fused into:
>>>   for (b = 4; b > 0; b -= 4)
>>>   {
>>> for (c = 0; c <= 6; c++)
>>> {
>>>   a[c + 1][6] = a[c][5];  // S1
>>>   a[c + 1][5] = a[c][4];  // S2
>>>   a[c + 1][4] = a[c][3];
>>>   a[c + 1][3] = a[c][2];
>>> }
>>>   }
>>
>> Note that they are not really "fused" but they are interleaved.  With
>> GIMPLE in mind
>> that makes a difference, you should get the equivalent of
>>
>>for (c = 0; c <= 6; c++)
>>  {
>>tem1 = a[c][5];
>>tem2 = a[c][4];
>>tem3 = a[c][3];
>>tem4 = a[c][2];
>>a[c+1][6] = tem1;
>>a[c +1][5] = tem2;
>> a[c+1][4] = tem3;
>> a[c+1][3] = tem4;
>>  }
> Yeah, I will double check if this abstract breaks the patch and how.
Hmm, I think this doesn't break it, well at least for part of the
analysis, because it is loop carried (backward) dependence goes wrong,
interleaving or not with the same iteration doesn't matter here.

Thanks,
bin
>
>>
>>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's 
>>> data
>>> dependence analyzer does not model dep between references in sibling loops, 
>>> but
>>> in practice, fusion requirement can be checked by analyzing all data 
>>> references
>>> after fusion, and there is no backward data dependence.
>>>
>>> Apparently, the requirement is violated because we have backward data 
>>> dependence
>>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the 
>>> inner
>>> loop, the outer loop would become legal for vectorization.
>>>
>>> This patch fixes the issue by enforcing dependence check.  It also adds two 
>>> tests
>>> with one shouldn't be vectorized and the other should.  Bootstrap and test 
>>> on x86_64
>>> and AArch64.  Is it OK?
>>
>> I think you have identified the spot where things go wrong but I'm not
>> sure you fix the
>> problem fully.  The spot you pacth is (loop is the outer loop):
>>
>>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>> ...
>>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>> {
>>   int dist = dist_v[loop_depth];
>> ...
>>   if (dist > 0 && DDR_REVERSED_P (ddr))
>> {
>>   /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>  reversed (to make distance vector positive), and the actual
>>  distance is negative.  */
>>   if (dump_enabled_p ())
>> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>  "dependence distance negative.\n");
>>
>> where you add
>>
>> + /* When doing outer loop vectorization, we need to check if there 
>> is
>&

Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

2017-12-15 Thread Bin.Cheng
On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
 wrote:
> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng  wrote:
>> Hi,
>> As explained in the PR, given below test case:
>> int a[8][10] = { [2][5] = 4 }, c;
>>
>> int
>> main ()
>> {
>>   short b;
>>   int i, d;
>>   for (b = 4; b >= 0; b--)
>> for (c = 0; c <= 6; c++)
>>   a[c + 1][b + 2] = a[c][b + 1];
>>   for (i = 0; i < 8; i++)
>> for (d = 0; d < 10; d++)
>>   if (a[i][d] != (i == 3 && d == 6) * 4)
>> __builtin_abort ();
>>   return 0;
>>
>> the loop nest is illegal for vectorization without reversing inner loop.  
>> The issue
>> is in data dependence checking of vectorizer, I believe the mentioned 
>> revision just
>> exposed this.  Previously the vectorization is skipped because of 
>> unsupported memory
>> operation.  The outer loop vectorization unrolls the outer loop into:
>>
>>   for (b = 4; b > 0; b -= 4)
>>   {
>> for (c = 0; c <= 6; c++)
>>   a[c + 1][6] = a[c][5];
>> for (c = 0; c <= 6; c++)
>>   a[c + 1][5] = a[c][4];
>> for (c = 0; c <= 6; c++)
>>   a[c + 1][4] = a[c][3];
>> for (c = 0; c <= 6; c++)
>>   a[c + 1][3] = a[c][2];
>>   }
>> Then four inner loops are fused into:
>>   for (b = 4; b > 0; b -= 4)
>>   {
>> for (c = 0; c <= 6; c++)
>> {
>>   a[c + 1][6] = a[c][5];  // S1
>>   a[c + 1][5] = a[c][4];  // S2
>>   a[c + 1][4] = a[c][3];
>>   a[c + 1][3] = a[c][2];
>> }
>>   }
>
> Note that they are not really "fused" but they are interleaved.  With
> GIMPLE in mind
> that makes a difference, you should get the equivalent of
>
>for (c = 0; c <= 6; c++)
>  {
>tem1 = a[c][5];
>tem2 = a[c][4];
>tem3 = a[c][3];
>tem4 = a[c][2];
>a[c+1][6] = tem1;
>a[c +1][5] = tem2;
> a[c+1][4] = tem3;
> a[c+1][3] = tem4;
>  }
Yeah, I will double check if this abstract breaks the patch and how.

>
>> The loop fusion needs to meet the dependence requirement.  Basically, GCC's 
>> data
>> dependence analyzer does not model dep between references in sibling loops, 
>> but
>> in practice, fusion requirement can be checked by analyzing all data 
>> references
>> after fusion, and there is no backward data dependence.
>>
>> Apparently, the requirement is violated because we have backward data 
>> dependence
>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse the 
>> inner
>> loop, the outer loop would become legal for vectorization.
>>
>> This patch fixes the issue by enforcing dependence check.  It also adds two 
>> tests
>> with one shouldn't be vectorized and the other should.  Bootstrap and test 
>> on x86_64
>> and AArch64.  Is it OK?
>
> I think you have identified the spot where things go wrong but I'm not
> sure you fix the
> problem fully.  The spot you pacth is (loop is the outer loop):
>
>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
> ...
>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
> {
>   int dist = dist_v[loop_depth];
> ...
>   if (dist > 0 && DDR_REVERSED_P (ddr))
> {
>   /* If DDR_REVERSED_P the order of the data-refs in DDR was
>  reversed (to make distance vector positive), and the actual
>  distance is negative.  */
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>  "dependence distance negative.\n");
>
> where you add
>
> + /* When doing outer loop vectorization, we need to check if there is
> +backward dependence at inner loop level if dependence at the 
> outer
> +loop is reversed.  See PR81740 for more information.  */
> + if (nested_in_vect_loop_p (loop, DR_STMT (dra))
> + || nested_in_vect_loop_p (loop, DR_STMT (drb)))
> +   {
> + unsigned inner_depth = index_in_loop_nest (loop->inner->num,
> +DDR_LOOP_NEST (ddr));
> + if (dist_v[inner_depth] < 0)
> +   return true;
> +   }
>
> but I don't understand how the dependence direction with respect to the
> outer loop matters here.
If the direction wrto outer loop is positive by itself, i.e,
reversed_p equals to false, then dist is checked against max_vf.  In
this case, it's not possible to have references refer to the same
object?
On the other hand, dist is not checked at all for reversed case.
Maybe an additional check "dist < max_vf" can relax the patch a bit.
>
> Given there's DDR_REVERSED on the outer loop distance what does that
> mean for the inner loop distance given the quite non-obvious code handling
> this case in tree-data-ref.c:
>
>   /* Verify a basic constraint: classic distance vectors should
>  always be lexicographically positive.
>
>  Data references are collected in the order of execution of
>  the 

Re: [PATCH GCC]More conservative interchanging small loops with const initialized simple reduction

2017-12-12 Thread Bin.Cheng
On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This simple patch makes interchange even more conservative for small loops 
>>>> with constant initialized simple reduction.
>>>> The reason is undoing such reduction introduces new data reference and 
>>>> cond_expr, which could cost too much in a small
>>>> loop.
>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if 
>>>> test passes?
>>>
>>> Shouldn't we do this even for non-constant initialzied simple
>>> reduction?  Because for any simple
>>> reduction we add two DRs that are not innermost, for constant
>>> initialized we add an additional
>>> cond-expr.  So ...
>>>
>>> +  /* Conservatively skip interchange in cases only have few data references
>>> + and constant initialized simple reduction since it introduces new data
>>> + reference as well as ?: operation.  */
>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length 
>>> ())
>>> +return false;
>>> +
>>>
>>> can you, instead of carrying num_const_init_simple_reduc simply loop
>>> over m_reductions
>>> and classify them in this function accordingly?  I think we want to
>>> cost non-constant-init
>>> reductions as well.  The :? can eventually count for another DR for
>>> cost purposes.
>> Number of non-constant-init reductions can still be carried in struct
>> loop_cand?  I am not very sure what's the advantage of an additional
>> loop over m_reductions getting the same information.
>> Perhaps the increase of stmts should be counted like:
>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs
>> Question is which number should this be compared against.  (we may
>> need to shift num_new_inv_drs to the other side for wrapping issue).
>>
>>>
>>> It looks like we do count the existing DRs for the reduction?  Is that
>>> why you arrive
>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)
>> Yes.
>>> But we don't really know whether the DR was invariant in the outer
>>> loop (well, I suppose
>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of
>> invariant reference with regarding to inner loop, rather than the
>> outer loop.  The same to num_new_inv_drs,
>> which means a reference become invariant after loop interchange with
>> regarding to (the new) inner loop.  This invariant information is
>> always known from data reference, right?
>> As for DRs for reduction, we know it's invariant because we set its
>> inner loop stride to zero.
>>
>>> we could remember the DR in m_reductions).
>>>
>>> Note that the good thing is that the ?: has an invariant condition and
>>> thus vectorization
>>> can hoist the mask generation out of the vectorized loop which means
>>> it boils down to
>>> cheap operations.  My gut feeling is that just looking at the number
>>> of memory references
>>> isn't a good indicator of profitability as the regular stmt workload
>>> has a big impact on
>>> profitability of vectorization.
>> It's not specific to vectorization.  The generated new code also costs
>> too much in small loops without vectorization.  But yes, # of mem_refs
>> may be too inaccurate, maybe we should check against num_stmts.
>
> Not specific to vectorization but the interchange may pay off only when
> vectorizing a loop.  Would the loop in loop-interchange-5.c be still
> interchanged?  If we remove the multiplication and just keep
> c[i][j] = c[i][j] + b[k][j];
> ?  That is, why is the constant init so special?  Even for non-constant init
> we're changing two outer loop DRs to two non-consecutive inner loop DRs.
Hi Richard,
This is updated patch taking stmt cost into consideration.

Firstly stmt cost (from # of stmt)
of loops are recorded.  Then stmt cost of outer loop is adjusted by decreasing
number of IVs, increasing by the number of constant initialized simple
reductions.
Lastly we check stmt cost between inner/outer loops and give up on interchange
if outer loop has too many stmts.

Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Bootstrap and test
on x86_64 andAA

Re: [PATCH GCC]More conservative interchanging small loops with const initialized simple reduction

2017-12-08 Thread Bin.Cheng
On Fri, Dec 8, 2017 at 3:18 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> This simple patch makes interchange even more conservative for small 
>>>>> loops with constant initialized simple reduction.
>>>>> The reason is undoing such reduction introduces new data reference and 
>>>>> cond_expr, which could cost too much in a small
>>>>> loop.
>>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if 
>>>>> test passes?
>>>>
>>>> Shouldn't we do this even for non-constant initialzied simple
>>>> reduction?  Because for any simple
>>>> reduction we add two DRs that are not innermost, for constant
>>>> initialized we add an additional
>>>> cond-expr.  So ...
>>>>
>>>> +  /* Conservatively skip interchange in cases only have few data 
>>>> references
>>>> + and constant initialized simple reduction since it introduces new 
>>>> data
>>>> + reference as well as ?: operation.  */
>>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= 
>>>> datarefs.length ())
>>>> +return false;
>>>> +
>>>>
>>>> can you, instead of carrying num_const_init_simple_reduc simply loop
>>>> over m_reductions
>>>> and classify them in this function accordingly?  I think we want to
>>>> cost non-constant-init
>>>> reductions as well.  The :? can eventually count for another DR for
>>>> cost purposes.
>>> Number of non-constant-init reductions can still be carried in struct
>>> loop_cand?  I am not very sure what's the advantage of an additional
>>> loop over m_reductions getting the same information.
>>> Perhaps the increase of stmts should be counted like:
>>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs
>>> Question is which number should this be compared against.  (we may
>>> need to shift num_new_inv_drs to the other side for wrapping issue).
>>>
>>>>
>>>> It looks like we do count the existing DRs for the reduction?  Is that
>>>> why you arrive
>>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)
>>> Yes.
>>>> But we don't really know whether the DR was invariant in the outer
>>>> loop (well, I suppose
>>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of
>>> invariant reference with regarding to inner loop, rather than the
>>> outer loop.  The same to num_new_inv_drs,
>>> which means a reference become invariant after loop interchange with
>>> regarding to (the new) inner loop.  This invariant information is
>>> always known from data reference, right?
>>> As for DRs for reduction, we know it's invariant because we set its
>>> inner loop stride to zero.
>>>
>>>> we could remember the DR in m_reductions).
>>>>
>>>> Note that the good thing is that the ?: has an invariant condition and
>>>> thus vectorization
>>>> can hoist the mask generation out of the vectorized loop which means
>>>> it boils down to
>>>> cheap operations.  My gut feeling is that just looking at the number
>>>> of memory references
>>>> isn't a good indicator of profitability as the regular stmt workload
>>>> has a big impact on
>>>> profitability of vectorization.
>>> It's not specific to vectorization.  The generated new code also costs
>>> too much in small loops without vectorization.  But yes, # of mem_refs
>>> may be too inaccurate, maybe we should check against num_stmts.
>>
>> Not specific to vectorization but the interchange may pay off only when
>> vectorizing a loop.  Would the loop in loop-interchange-5.c be still
>> interchanged?  If we remove the multiplication and just keep
>> c[i][j] = c[i][j] + b[k][j];
> Both loop-interchange-5.c and the modified version are interchange,
> because we check
> it against number of all data references (including num_old_inv_drs):
>  if (num_old_inv_drs + num_const_init_s

Re: [PATCH GCC]More conservative interchanging small loops with const initialized simple reduction

2017-12-08 Thread Bin.Cheng
On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This simple patch makes interchange even more conservative for small loops 
>>>> with constant initialized simple reduction.
>>>> The reason is undoing such reduction introduces new data reference and 
>>>> cond_expr, which could cost too much in a small
>>>> loop.
>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if 
>>>> test passes?
>>>
>>> Shouldn't we do this even for non-constant initialzied simple
>>> reduction?  Because for any simple
>>> reduction we add two DRs that are not innermost, for constant
>>> initialized we add an additional
>>> cond-expr.  So ...
>>>
>>> +  /* Conservatively skip interchange in cases only have few data references
>>> + and constant initialized simple reduction since it introduces new data
>>> + reference as well as ?: operation.  */
>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length 
>>> ())
>>> +return false;
>>> +
>>>
>>> can you, instead of carrying num_const_init_simple_reduc simply loop
>>> over m_reductions
>>> and classify them in this function accordingly?  I think we want to
>>> cost non-constant-init
>>> reductions as well.  The :? can eventually count for another DR for
>>> cost purposes.
>> Number of non-constant-init reductions can still be carried in struct
>> loop_cand?  I am not very sure what's the advantage of an additional
>> loop over m_reductions getting the same information.
>> Perhaps the increase of stmts should be counted like:
>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs
>> Question is which number should this be compared against.  (we may
>> need to shift num_new_inv_drs to the other side for wrapping issue).
>>
>>>
>>> It looks like we do count the existing DRs for the reduction?  Is that
>>> why you arrive
>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)
>> Yes.
>>> But we don't really know whether the DR was invariant in the outer
>>> loop (well, I suppose
>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of
>> invariant reference with regarding to inner loop, rather than the
>> outer loop.  The same to num_new_inv_drs,
>> which means a reference become invariant after loop interchange with
>> regarding to (the new) inner loop.  This invariant information is
>> always known from data reference, right?
>> As for DRs for reduction, we know it's invariant because we set its
>> inner loop stride to zero.
>>
>>> we could remember the DR in m_reductions).
>>>
>>> Note that the good thing is that the ?: has an invariant condition and
>>> thus vectorization
>>> can hoist the mask generation out of the vectorized loop which means
>>> it boils down to
>>> cheap operations.  My gut feeling is that just looking at the number
>>> of memory references
>>> isn't a good indicator of profitability as the regular stmt workload
>>> has a big impact on
>>> profitability of vectorization.
>> It's not specific to vectorization.  The generated new code also costs
>> too much in small loops without vectorization.  But yes, # of mem_refs
>> may be too inaccurate, maybe we should check against num_stmts.
>
> Not specific to vectorization but the interchange may pay off only when
> vectorizing a loop.  Would the loop in loop-interchange-5.c be still
> interchanged?  If we remove the multiplication and just keep
> c[i][j] = c[i][j] + b[k][j];
Both loop-interchange-5.c and the modified version are interchange,
because we check
it against number of all data references (including num_old_inv_drs):
 if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())

> ?  That is, why is the constant init so special?  Even for non-constant init
> we're changing two outer loop DRs to two non-consecutive inner loop DRs.
No, the two outer loop DRs becomes consecutive with respect to inner loop.
So for a typical matrix mul case, the interchange moves two outer loop
DRs into inner loops, moves an inner loop DR out to outer loop.
Overall it introduces an additional i

Re: [PATCH GCC]More conservative interchanging small loops with const initialized simple reduction

2017-12-08 Thread Bin.Cheng
On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener
 wrote:
> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng  wrote:
>> Hi,
>> This simple patch makes interchange even more conservative for small loops 
>> with constant initialized simple reduction.
>> The reason is undoing such reduction introduces new data reference and 
>> cond_expr, which could cost too much in a small
>> loop.
>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if 
>> test passes?
>
> Shouldn't we do this even for non-constant initialzied simple
> reduction?  Because for any simple
> reduction we add two DRs that are not innermost, for constant
> initialized we add an additional
> cond-expr.  So ...
>
> +  /* Conservatively skip interchange in cases only have few data references
> + and constant initialized simple reduction since it introduces new data
> + reference as well as ?: operation.  */
> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length 
> ())
> +return false;
> +
>
> can you, instead of carrying num_const_init_simple_reduc simply loop
> over m_reductions
> and classify them in this function accordingly?  I think we want to
> cost non-constant-init
> reductions as well.  The :? can eventually count for another DR for
> cost purposes.
Number of non-constant-init reductions can still be carried in struct
loop_cand?  I am not very sure what's the advantage of an additional
loop over m_reductions getting the same information.
Perhaps the increase of stmts should be counted like:
  num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs
Question is which number should this be compared against.  (we may
need to shift num_new_inv_drs to the other side for wrapping issue).

>
> It looks like we do count the existing DRs for the reduction?  Is that
> why you arrive
> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)
Yes.
> But we don't really know whether the DR was invariant in the outer
> loop (well, I suppose
Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of
invariant reference with regarding to inner loop, rather than the
outer loop.  The same to num_new_inv_drs,
which means a reference become invariant after loop interchange with
regarding to (the new) inner loop.  This invariant information is
always known from data reference, right?
As for DRs for reduction, we know it's invariant because we set its
inner loop stride to zero.

> we could remember the DR in m_reductions).
>
> Note that the good thing is that the ?: has an invariant condition and
> thus vectorization
> can hoist the mask generation out of the vectorized loop which means
> it boils down to
> cheap operations.  My gut feeling is that just looking at the number
> of memory references
> isn't a good indicator of profitability as the regular stmt workload
> has a big impact on
> profitability of vectorization.
It's not specific to vectorization.  The generated new code also costs
too much in small loops without vectorization.  But yes, # of mem_refs
may be too inaccurate, maybe we should check against num_stmts.

Thanks,
bin
>
> So no ack nor nack...
>
> Richard.
>
>> Thanks,
>> bin
>> 2017-12-08  Bin Cheng  
>>
>> * gimple-loop-interchange.cc (struct loop_cand): New field.
>> (loop_cand::loop_cand): Init new field in constructor.
>> (loop_cand::classify_simple_reduction): Record simple reduction
>> initialized with constant value.
>> (should_interchange_loops): New parameter.  Skip interchange if loop
>> has few data references and constant intitialized simple reduction.
>> (tree_loop_interchange::interchange): Update call to above function.
>> (should_interchange_loop_nest): Ditto.


Re: [PATCH] Fix vectorizer part of PR81303

2017-12-08 Thread Bin.Cheng
On Fri, Dec 8, 2017 at 10:39 AM, Richard Biener <rguent...@suse.de> wrote:
> On Fri, 8 Dec 2017, Bin.Cheng wrote:
>
>> On Fri, Dec 8, 2017 at 9:54 AM, Richard Biener <rguent...@suse.de> wrote:
>> > On Fri, 8 Dec 2017, Bin.Cheng wrote:
>> >
>> >> On Fri, Dec 8, 2017 at 8:29 AM, Richard Biener <rguent...@suse.de> wrote:
>> >> > On Fri, 8 Dec 2017, Christophe Lyon wrote:
>> >> >
>> >> >> On 8 December 2017 at 09:07, Richard Biener <rguent...@suse.de> wrote:
>> >> >> > On Thu, 7 Dec 2017, Bin.Cheng wrote:
>> >> >> >
>> >> >> >> On Wed, Dec 6, 2017 at 1:29 PM, Richard Biener <rguent...@suse.de> 
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> > The following fixes a vectorization issue that appears when trying
>> >> >> >> > to vectorize the bwaves mat_times_vec kernel after interchange was
>> >> >> >> > performed by the interchange pass.  That interchange inserts the
>> >> >> >> > following code for the former reduction created by LIM 
>> >> >> >> > store-motion:
>> >> >> >> I do observe more cases are vectorized by this patch on AArch64.
>> >> >> >> Still want to find a way not generating the cond_expr, but for the 
>> >> >> >> moment
>> >> >> >> I will have another patch make interchange even more conservative 
>> >> >> >> for
>> >> >> >> small cases.  In which the new cmp/select instructions do cost a 
>> >> >> >> lot against
>> >> >> >> the small loop body.
>> >> >> >
>> >> >> > Yeah.  I thought about what it takes to avoid the conditional - 
>> >> >> > basically
>> >> >> > we'd need to turn the init value to a (non-nested) loop that we'd 
>> >> >> > need
>> >> >> > to insert on the preheader of the outer loop.
>> >> >> >
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> I noticed a regression on aarch64 after Bin's commit r255472:
>> >> >> gcc.target/aarch64/pr62178.c scan-assembler ldr\\tq[0-9]+,
>> >> >> \\[x[0-9]+\\], [0-9]+
>> >> >> gcc.target/aarch64/pr62178.c scan-assembler ldr\\ts[0-9]+,
>> >> >> \\[x[0-9]+, [0-9]+\\]!
>> >> >> gcc.target/aarch64/pr62178.c scan-assembler mla\\tv[0-9]+.4s,
>> >> >> v[0-9]+.4s, v[0-9]+.s\\[0\\]
>> >> >>
>> >> >> Is this patch supposed to fix it?
>> >> >
>> >> > No, from what I can see the patch shouldn't affect it.  But it's not
>> >> > clear what the testcase tests for - it just scans assembler.
>> >> > Clearly we want to interchange the loop here so the scan assembler
>> >> I am not very sure.  Though interchanging gives better cache behavior,
>> >> but the loop is relatively small here,  the introduced cond_expr
>> >> results in two more instructions, as well as one additional memory
>> >> access from undoing reduction.  Together with addressing mode chosen
>> >> in ivopts, it leads to obvious regression.
>> >> Ah, another issue is the cond_expr blocks vectorization without your
>> >> patch here.  This case is what I meant small loops in which more
>> >> conservative interchange may be wanted.
>> >
>> > The loop has int data and int IVs so my patch shouldn't be necessary
>> > to vectorize the loop.
>> I haven't got time look into vectorizer part yet, but there is below
>> in dump file:
>>
>> pr62178.c:12:7: note: vect_is_simple_use: operand k_9
>> pr62178.c:12:7: note: def_stmt: k_9 = PHI <1(7), k_38(10)>
>> pr62178.c:12:7: note: type of def: external
>> pr62178.c:12:7: note: not vectorized: relevant stmt not supported:
>> r_I_I_lsm.0_19 = k_9 != 1 ? r_I_I_lsm.0_14 : 0;
>> pr62178.c:12:7: note: bad operation or unsupported loop bound.
>> pr62178.c:12:7: note: * Re-trying analysis with vector size 8
>
> Yes, so neon doesn't have a way to vectorize such conditionals?
> It does set vect_condition and even vect_cond_mixed though.
It should.  IIRC, it was me enables vect_cond* stuff on AArch64?  Will
look into it after fixing more urgent bugs.
> You'd have to dive into why it thinks it can

Re: [PATCH] Fix vectorizer part of PR81303

2017-12-08 Thread Bin.Cheng
On Fri, Dec 8, 2017 at 9:54 AM, Richard Biener <rguent...@suse.de> wrote:
> On Fri, 8 Dec 2017, Bin.Cheng wrote:
>
>> On Fri, Dec 8, 2017 at 8:29 AM, Richard Biener <rguent...@suse.de> wrote:
>> > On Fri, 8 Dec 2017, Christophe Lyon wrote:
>> >
>> >> On 8 December 2017 at 09:07, Richard Biener <rguent...@suse.de> wrote:
>> >> > On Thu, 7 Dec 2017, Bin.Cheng wrote:
>> >> >
>> >> >> On Wed, Dec 6, 2017 at 1:29 PM, Richard Biener <rguent...@suse.de> 
>> >> >> wrote:
>> >> >> >
>> >> >> > The following fixes a vectorization issue that appears when trying
>> >> >> > to vectorize the bwaves mat_times_vec kernel after interchange was
>> >> >> > performed by the interchange pass.  That interchange inserts the
>> >> >> > following code for the former reduction created by LIM store-motion:
>> >> >> I do observe more cases are vectorized by this patch on AArch64.
>> >> >> Still want to find a way not generating the cond_expr, but for the 
>> >> >> moment
>> >> >> I will have another patch make interchange even more conservative for
>> >> >> small cases.  In which the new cmp/select instructions do cost a lot 
>> >> >> against
>> >> >> the small loop body.
>> >> >
>> >> > Yeah.  I thought about what it takes to avoid the conditional - 
>> >> > basically
>> >> > we'd need to turn the init value to a (non-nested) loop that we'd need
>> >> > to insert on the preheader of the outer loop.
>> >> >
>> >>
>> >> Hi,
>> >>
>> >> I noticed a regression on aarch64 after Bin's commit r255472:
>> >> gcc.target/aarch64/pr62178.c scan-assembler ldr\\tq[0-9]+,
>> >> \\[x[0-9]+\\], [0-9]+
>> >> gcc.target/aarch64/pr62178.c scan-assembler ldr\\ts[0-9]+,
>> >> \\[x[0-9]+, [0-9]+\\]!
>> >> gcc.target/aarch64/pr62178.c scan-assembler mla\\tv[0-9]+.4s,
>> >> v[0-9]+.4s, v[0-9]+.s\\[0\\]
>> >>
>> >> Is this patch supposed to fix it?
>> >
>> > No, from what I can see the patch shouldn't affect it.  But it's not
>> > clear what the testcase tests for - it just scans assembler.
>> > Clearly we want to interchange the loop here so the scan assembler
>> I am not very sure.  Though interchanging gives better cache behavior,
>> but the loop is relatively small here,  the introduced cond_expr
>> results in two more instructions, as well as one additional memory
>> access from undoing reduction.  Together with addressing mode chosen
>> in ivopts, it leads to obvious regression.
>> Ah, another issue is the cond_expr blocks vectorization without your
>> patch here.  This case is what I meant small loops in which more
>> conservative interchange may be wanted.
>
> The loop has int data and int IVs so my patch shouldn't be necessary
> to vectorize the loop.
I haven't got time look into vectorizer part yet, but there is below
in dump file:

pr62178.c:12:7: note: vect_is_simple_use: operand k_9
pr62178.c:12:7: note: def_stmt: k_9 = PHI <1(7), k_38(10)>
pr62178.c:12:7: note: type of def: external
pr62178.c:12:7: note: not vectorized: relevant stmt not supported:
r_I_I_lsm.0_19 = k_9 != 1 ? r_I_I_lsm.0_14 : 0;
pr62178.c:12:7: note: bad operation or unsupported loop bound.
pr62178.c:12:7: note: * Re-trying analysis with vector size 8

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin
>>
>> > needs to be adjusted and one has to revisit PR62178 to check whether
>> > the result is still ok (or simply add -fno-loop-interchange to it).
>> >
>> > Richard.
>> >
>> >> Thanks,
>> >>
>> >> Christophe
>> >>
>> >> > Richard.
>> >> >
>> >> >> Thanks,
>> >> >> bin
>> >> >> >
>> >> >> >[local count: 161061274]:
>> >> >> >   # m_58 = PHI <1(10), m_84(20)>
>> >> >> > ...
>> >> >> >[local count: 912680551]:
>> >> >> >   # l_35 = PHI <1(13), l_57(21)>
>> >> >> > ...
>> >> >> >   y__I_lsm.113_140 = *y_139(D)[_31];
>> >> >> >   y__I_lsm.113_94 = m_58 != 1 ? y__I_lsm.113_140 : 0.0;
>> >> >> > ...
>> >> >> >   *y_139(D)[_31] = _101;
>> >>

Re: [PATCH] Fix vectorizer part of PR81303

2017-12-08 Thread Bin.Cheng
On Fri, Dec 8, 2017 at 8:29 AM, Richard Biener <rguent...@suse.de> wrote:
> On Fri, 8 Dec 2017, Christophe Lyon wrote:
>
>> On 8 December 2017 at 09:07, Richard Biener <rguent...@suse.de> wrote:
>> > On Thu, 7 Dec 2017, Bin.Cheng wrote:
>> >
>> >> On Wed, Dec 6, 2017 at 1:29 PM, Richard Biener <rguent...@suse.de> wrote:
>> >> >
>> >> > The following fixes a vectorization issue that appears when trying
>> >> > to vectorize the bwaves mat_times_vec kernel after interchange was
>> >> > performed by the interchange pass.  That interchange inserts the
>> >> > following code for the former reduction created by LIM store-motion:
>> >> I do observe more cases are vectorized by this patch on AArch64.
>> >> Still want to find a way not generating the cond_expr, but for the moment
>> >> I will have another patch make interchange even more conservative for
>> >> small cases.  In which the new cmp/select instructions do cost a lot 
>> >> against
>> >> the small loop body.
>> >
>> > Yeah.  I thought about what it takes to avoid the conditional - basically
>> > we'd need to turn the init value to a (non-nested) loop that we'd need
>> > to insert on the preheader of the outer loop.
>> >
>>
>> Hi,
>>
>> I noticed a regression on aarch64 after Bin's commit r255472:
>> gcc.target/aarch64/pr62178.c scan-assembler ldr\\tq[0-9]+,
>> \\[x[0-9]+\\], [0-9]+
>> gcc.target/aarch64/pr62178.c scan-assembler ldr\\ts[0-9]+,
>> \\[x[0-9]+, [0-9]+\\]!
>> gcc.target/aarch64/pr62178.c scan-assembler mla\\tv[0-9]+.4s,
>> v[0-9]+.4s, v[0-9]+.s\\[0\\]
>>
>> Is this patch supposed to fix it?
>
> No, from what I can see the patch shouldn't affect it.  But it's not
> clear what the testcase tests for - it just scans assembler.
> Clearly we want to interchange the loop here so the scan assembler
I am not very sure.  Though interchanging gives better cache behavior,
but the loop is relatively small here,  the introduced cond_expr
results in two more instructions, as well as one additional memory
access from undoing reduction.  Together with addressing mode chosen
in ivopts, it leads to obvious regression.
Ah, another issue is the cond_expr blocks vectorization without your
patch here.  This case is what I meant small loops in which more
conservative interchange may be wanted.

Thanks,
bin

> needs to be adjusted and one has to revisit PR62178 to check whether
> the result is still ok (or simply add -fno-loop-interchange to it).
>
> Richard.
>
>> Thanks,
>>
>> Christophe
>>
>> > Richard.
>> >
>> >> Thanks,
>> >> bin
>> >> >
>> >> >[local count: 161061274]:
>> >> >   # m_58 = PHI <1(10), m_84(20)>
>> >> > ...
>> >> >[local count: 912680551]:
>> >> >   # l_35 = PHI <1(13), l_57(21)>
>> >> > ...
>> >> >   y__I_lsm.113_140 = *y_139(D)[_31];
>> >> >   y__I_lsm.113_94 = m_58 != 1 ? y__I_lsm.113_140 : 0.0;
>> >> > ...
>> >> >   *y_139(D)[_31] = _101;
>> >> >
>> >> >
>> >> > so we have a COND_EXPR with a test on an integer IV m_58 with
>> >> > double values.  Note that the m_58 != 1 condition is invariant
>> >> > in the l loop.
>> >> >
>> >> > Currently we vectorize this condition using V8SImode vectors
>> >> > causing a vectorization factor of 8 and thus forcing the scalar
>> >> > path for the bwaves case (the loops have an iteration count of 5).
>> >> >
>> >> > The following patch makes the vectorizer handle invariant conditions
>> >> > in the first place and second handle widening of operands of invariant
>> >> > conditions transparently (the promotion will happen on the invariant
>> >> > scalars).  This makes it possible to use a vectorization factor of 4,
>> >> > reducing the bwaves runtime from 208s before interchange
>> >> > (via 190s after interchange) to 172s after interchange and vectorization
>> >> > with AVX256 (on a Haswell machine).
>> >> >
>> >> > For the vectorizable_condition part to work I need to avoid
>> >> > pulling apart the condition from the COND_EXPR during pattern
>> >> > detection.
>> >> >
>> >> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>> >> >

Re: [PATCH] Fix vectorizer part of PR81303

2017-12-07 Thread Bin.Cheng
On Wed, Dec 6, 2017 at 1:29 PM, Richard Biener  wrote:
>
> The following fixes a vectorization issue that appears when trying
> to vectorize the bwaves mat_times_vec kernel after interchange was
> performed by the interchange pass.  That interchange inserts the
> following code for the former reduction created by LIM store-motion:
I do observe more cases are vectorized by this patch on AArch64.
Still want to find a way not generating the cond_expr, but for the moment
I will have another patch make interchange even more conservative for
small cases.  In which the new cmp/select instructions do cost a lot against
the small loop body.

Thanks,
bin
>
>[local count: 161061274]:
>   # m_58 = PHI <1(10), m_84(20)>
> ...
>[local count: 912680551]:
>   # l_35 = PHI <1(13), l_57(21)>
> ...
>   y__I_lsm.113_140 = *y_139(D)[_31];
>   y__I_lsm.113_94 = m_58 != 1 ? y__I_lsm.113_140 : 0.0;
> ...
>   *y_139(D)[_31] = _101;
>
>
> so we have a COND_EXPR with a test on an integer IV m_58 with
> double values.  Note that the m_58 != 1 condition is invariant
> in the l loop.
>
> Currently we vectorize this condition using V8SImode vectors
> causing a vectorization factor of 8 and thus forcing the scalar
> path for the bwaves case (the loops have an iteration count of 5).
>
> The following patch makes the vectorizer handle invariant conditions
> in the first place and second handle widening of operands of invariant
> conditions transparently (the promotion will happen on the invariant
> scalars).  This makes it possible to use a vectorization factor of 4,
> reducing the bwaves runtime from 208s before interchange
> (via 190s after interchange) to 172s after interchange and vectorization
> with AVX256 (on a Haswell machine).
>
> For the vectorizable_condition part to work I need to avoid
> pulling apart the condition from the COND_EXPR during pattern
> detection.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>
> Richard.
>
> 2017-12-06  Richard Biener  
>
> PR tree-optimization/81303
> * tree-vect-stmts.c (vect_is_simple_cond): For invariant
> conditions try to create a comparison vector type matching
> the data vector type.
> (vectorizable_condition): Adjust.
> * tree-vect-patterns.c (vect_recog_mask_conversion_pattern):
> Leave invariant conditions alone in case we can vectorize those.
>
> * gcc.target/i386/vectorize9.c: New testcase.
> * gcc.target/i386/vectorize10.c: New testcase.
>
> Index: gcc/tree-vect-stmts.c
> ===
> --- gcc/tree-vect-stmts.c   (revision 255438)
> +++ gcc/tree-vect-stmts.c   (working copy)
> @@ -7792,7 +7792,8 @@ vectorizable_load (gimple *stmt, gimple_
>
>  static bool
>  vect_is_simple_cond (tree cond, vec_info *vinfo,
> -tree *comp_vectype, enum vect_def_type *dts)
> +tree *comp_vectype, enum vect_def_type *dts,
> +tree vectype)
>  {
>tree lhs, rhs;
>tree vectype1 = NULL_TREE, vectype2 = NULL_TREE;
> @@ -7845,6 +7846,20 @@ vect_is_simple_cond (tree cond, vec_info
>  return false;
>
>*comp_vectype = vectype1 ? vectype1 : vectype2;
> +  /* Invariant comparison.  */
> +  if (! *comp_vectype)
> +{
> +  tree scalar_type = TREE_TYPE (lhs);
> +  /* If we can widen the comparison to match vectype do so.  */
> +  if (INTEGRAL_TYPE_P (scalar_type)
> + && tree_int_cst_lt (TYPE_SIZE (scalar_type),
> + TYPE_SIZE (TREE_TYPE (vectype
> +   scalar_type = build_nonstandard_integer_type
> + (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (vectype))),
> +  TYPE_UNSIGNED (scalar_type));
> +  *comp_vectype = get_vectype_for_scalar_type (scalar_type);
> +}
> +
>return true;
>  }
>
> @@ -7942,7 +7957,7 @@ vectorizable_condition (gimple *stmt, gi
>else_clause = gimple_assign_rhs3 (stmt);
>
>if (!vect_is_simple_cond (cond_expr, stmt_info->vinfo,
> -   _vectype, [0])
> +   _vectype, [0], vectype)
>|| !comp_vectype)
>  return false;
>
> Index: gcc/tree-vect-patterns.c
> ===
> --- gcc/tree-vect-patterns.c(revision 255438)
> +++ gcc/tree-vect-patterns.c(working copy)
> @@ -3976,6 +3976,32 @@ vect_recog_mask_conversion_pattern (vec<
>   || TYPE_VECTOR_SUBPARTS (vectype1) == TYPE_VECTOR_SUBPARTS 
> (vectype2))
> return NULL;
>
> +  /* If rhs1 is invariant and we can promote it leave the COND_EXPR
> + in place, we can handle it in vectorizable_condition.  This avoids
> +unnecessary promotion stmts and increased vectorization factor.  */
> +  if (COMPARISON_CLASS_P (rhs1)
> + && INTEGRAL_TYPE_P (rhs1_type)
> + && TYPE_VECTOR_SUBPARTS (vectype1) < TYPE_VECTOR_SUBPARTS 
> (vectype2))
> 

Re: [PATCH GCC]Introduce loop interchange pass and enable it at -O3

2017-12-07 Thread Bin.Cheng
On Thu, Dec 7, 2017 at 11:39 AM, Richard Biener
 wrote:
> On Thu, Dec 7, 2017 at 11:28 AM, Bin Cheng  wrote:
>> Hi,
>> This is the overall loop interchange patch on gimple-linterchange branch.  
>> Note the new pass
>> is enabled at -O3 level by default.  Bootstrap and regtest on x86_64 and 
>> AArch64(ongoing).
>> NOte after cost model change it is now far more conservative than original 
>> version.  It only
>> interchanges 11 loops in spec2k6 (416 doesn't build at the moment), vs ~250 
>> for the original
>> version.  I will collect compilation time data, though there shouldn't be 
>> any surprise given
>> few loops are actually interchanged.  I will also collect spec2k6 data, 
>> shouldn't affect cases
>> other than bwaves either.
>> So is it OK?
>
> Please omit the no longer needed change to gsi_remove in
> gimple-iterator.[ch].  The new
> --params need documenting in invoke.texi.
Here is the updated patch.  I added document for new parameters in
invoke.texi, but the original patch doesn't have any change in
gimple-iterator.[ch]?

Thanks,
bin
>
> Ok with those changes.
>
> Thanks!
> Richard.
>
>> Thanks,
>> bin
>> 2017-12-07  Bin Cheng  
>> Richard Biener  
>>
>> PR tree-optimization/81303
>> * Makefile.in (gimple-loop-interchange.o): New object file.
>> * common.opt (floop-interchange): Reuse the option from graphite.
>> * doc/invoke.texi (-floop-interchange): Ditto.  New document for
>> -floop-interchange and mention it for -O3.
>> * opts.c (default_options_table): Enable -floop-interchange at -O3.
>> * gimple-loop-interchange.cc: New file.
>> * params.def (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS): New parameter.
>> (PARAM_LOOP_INTERCHANGE_STRIDE_RATIO): New parameter.
>> * passes.def (pass_linterchange): New pass.
>> * timevar.def (TV_LINTERCHANGE): New time var.
>> * tree-pass.h (make_pass_linterchange): New declaration.
>> * tree-ssa-loop-ivcanon.c (create_canonical_iv): Change to external
>> interchange.  Record IV before/after increment in new parameters.
>> * tree-ssa-loop-ivopts.h (create_canonical_iv): New declaration.
>> * tree-vect-loop.c (vect_is_simple_reduction): Factor out reduction
>> path check into...
>> (check_reduction_path): ...New function here.
>> * tree-vectorizer.h (check_reduction_path): New declaration.
>>
>> gcc/testsuite
>> 2017-12-07  Bin Cheng  
>> Richard Biener  
>>
>> PR tree-optimization/81303
>> * gcc.dg/tree-ssa/loop-interchange-1.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-1b.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-2.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-3.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-4.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-5.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-6.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-7.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-8.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-9.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-10.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-11.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-12.c: New test.
>> * gcc.dg/tree-ssa/loop-interchange-13.c: New test.
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index db43fc1..3297437 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1302,6 +1302,7 @@ OBJS = \
gimple-iterator.o \
gimple-fold.o \
gimple-laddress.o \
+   gimple-loop-interchange.o \
gimple-low.o \
gimple-pretty-print.o \
gimple-ssa-backprop.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index ffcbf85..6b9e4ea 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1504,8 +1504,8 @@ Common Alias(floop-nest-optimize)
 Enable loop nest transforms.  Same as -floop-nest-optimize.
 
 floop-interchange
-Common Alias(floop-nest-optimize)
-Enable loop nest transforms.  Same as -floop-nest-optimize.
+Common Report Var(flag_loop_interchange) Optimization
+Enable loop interchange on trees.
 
 floop-block
 Common Alias(floop-nest-optimize)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index b8c8083..6a4e8aa 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -7401,6 +7401,7 @@ by @option{-O2} and also turns on the following 
optimization flags:
 -ftree-loop-vectorize @gol
 -ftree-loop-distribution @gol
 -ftree-loop-distribute-patterns @gol
+-floop-interchange @gol
 -fsplit-paths @gol
 -ftree-slp-vectorize @gol
 -fvect-cost-model @gol
@@ -8500,12 +8501,10 @@ Perform loop optimizations on trees.  This flag is 
enabled by default
 at @option{-O} and higher.
 
 @item -ftree-loop-linear
-@itemx -floop-interchange
 @itemx 

Re: [PATCH][gimple-interchange] Final cleanup stuff

2017-12-05 Thread Bin.Cheng
On Tue, Dec 5, 2017 at 1:02 PM, Richard Biener  wrote:
>
> This is my final sweep through the code doing cleanup on-the-fly.
Hi,
Thanks very much for all your help!
>
> I think the code is ready to go now (after you committed your changes).
>
> What I'd eventually like to see is merge the two loop_cand objects
> into a single interchange_cand object, having two really confuses
> me when reading and trying to understand code.  But let's defer this
> for next stage1.
>
> A change that's still required is adjusting the graphite testcases
> to use -floop-nest-optimize (you promised to do that) and to enable
> the interchange pass at -O3 by default.
Yeah, there will be two separated patches, one adjusting graphite
tests and the other enabling at O3 with miscellaneous test change.

>
> With that, can you prepare a patch that merges the changes on the
> branch to trunk and provide updated performance / statistics
> for SPEC (maybe also SPEC compile-time figures with/without the
> patch?  it's enough to look at the Elapsed Compile time numbers in
> the log file IMHO).
Will collect data for the final version.  Thanks again.

Thanks,
bin
>
> Thanks,
> Richard.
>
> 2017-12-05  Richard Biener  
>
> * gimple-loop-interchange.cc (AVG_LOOP_NITER): Remove.
> (loop_cand::supported_operations): Simplify.
> (loop_cand::analyze_iloop_reduction_var): Use m_exit.
> (loop_cand::analyze_oloop_reduction_var): Likewise.
> (loop_cand::analyze_lcssa_phis): Likewise.
> (find_deps_in_bb_for_stmt): Use gimple_seq_add_stmt_without_update.
> (loop_cand::undo_simple_reduction): Likewise, properly release
> virtual defs.
> (tree_loop_interchange::interchange_loops): Likewise.  Move code
> to innner loop here.
> (tree_loop_interchange::map_inductions_to_loop): Remove code moving
> code to inner loop.
> (insert_pos_at_inner_loop): Inline into single caller...
> (tree_loop_interchange::move_code_to_inner): ...here.  Properly
> release virtual defs.
> (proper_loop_form_for_interchange): Properly analyze/instantiate SCEV.
> (prepare_perfect_loop_nest): Do not explicitely allocate vectors.
>
> Index: gcc/gimple-loop-interchange.cc
> ===
> --- gcc/gimple-loop-interchange.cc  (revision 255414)
> +++ gcc/gimple-loop-interchange.cc  (working copy)
> @@ -81,8 +81,6 @@ along with GCC; see the file COPYING3.
>  #define MAX_NUM_STMT(PARAM_VALUE (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS))
>  /* Maximum number of data references in loop nest.  */
>  #define MAX_DATAREFS(PARAM_VALUE (PARAM_LOOP_MAX_DATAREFS_FOR_DATADEPS))
> -/* Default average number of loop iterations.  */
> -#define AVG_LOOP_NITER  (PARAM_VALUE (PARAM_AVG_LOOP_NITER))
>
>  /* Comparison ratio of access stride between inner/outer loops to be
> interchanged.  This is the minimum stride ratio for loop interchange
> @@ -105,7 +103,7 @@ typedef struct induction
>/* IV's base and step part of SCEV.  */
>tree base;
>tree step;
> -}* induction_p;
> +} *induction_p;
>
>  /* Enum type for loop reduction variable.  */
>  enum reduction_type
> @@ -136,7 +134,7 @@ typedef struct reduction
>   reference.  */
>tree fini_ref;
>enum reduction_type type;
> -}* reduction_p;
> +} *reduction_p;
>
>
>  /* Dump reduction RE.  */
> @@ -302,24 +300,17 @@ loop_cand::supported_operations (basic_b
>if (is_gimple_debug (stmt))
> continue;
>
> -  if (gimple_has_volatile_ops (stmt)
> - || gimple_has_side_effects (stmt))
> +  if (gimple_has_side_effects (stmt))
> return false;
>
>bb_num_stmts++;
> -  if (is_gimple_call (stmt))
> +  if (gcall *call = dyn_cast  (stmt))
> {
> - int cflags = gimple_call_flags (stmt);
> - /* Only support const/pure calls.  */
> - if (!(cflags & (ECF_CONST | ECF_PURE)))
> -   return false;
> -
>   /* In basic block of outer loop, the call should be cheap since
>  it will be moved to inner loop.  */
>   if (iloop != NULL
> - && !gimple_inexpensive_call_p (as_a  (stmt)))
> + && !gimple_inexpensive_call_p (call))
> return false;
> -
>   continue;
> }
>
> @@ -334,6 +325,7 @@ loop_cand::supported_operations (basic_b
>tree lhs;
>/* Support loop invariant memory reference if it's only used once by
>  inner loop.  */
> +  /* ???  How's this checking for invariantness?  */
>if (gimple_assign_single_p (stmt)
>   && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
>   && TREE_CODE (lhs) == SSA_NAME
> @@ -347,7 +339,7 @@ loop_cand::supported_operations (basic_b
>/* Allow PHI nodes in any basic block of inner loop, PHI nodes in outer
>   loop's header, or PHI nodes in dest bb of inner loop's exit edge.  

Re: [PATCH][gimple-interchange] Random cleanups

2017-12-04 Thread Bin.Cheng
On Mon, Dec 4, 2017 at 5:39 PM, Richard Biener <rguent...@suse.de> wrote:
> On December 4, 2017 5:01:45 PM GMT+01:00, "Bin.Cheng" <amker.ch...@gmail.com> 
> wrote:
>>On Mon, Dec 4, 2017 at 3:43 PM, Richard Biener <rguent...@suse.de>
>>wrote:
>>>
>>> When skimming through the code I noticed the following (chatted on
>>IRC
>>> about parts of the changes).
>>>
>>> Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>>>
>>> Will commit tomorrow unless you beat me to that.
>>>
>>> Richard.
>>>
>>> 2017-12-04  Richard Biener  <rguent...@suse.de>
>>>
>>> * gimple-loop-interchange.cc
>>(loop_cand::classify_simple_reduction):
>>> Simplify.
>>> (loop_cand::analyze_iloop_reduction_var): Reject dead
>>reductions.
>>> (loop_cand::analyze_oloop_reduction_var): Likewise.
>>Simplify.
>>> (tree_loop_interchange::interchange_loops): Properly analyze
>>> scalar evolution before instantiating a SCEV.
>>>
>>> Index: gcc/gimple-loop-interchange.cc
>>> ===
>>> --- gcc/gimple-loop-interchange.cc  (revision 255383)
>>> +++ gcc/gimple-loop-interchange.cc  (working copy)
>>> @@ -444,50 +444,21 @@ loop_cand::classify_simple_reduction (re
>>>if (!bb || bb->loop_father != m_outer)
>>> return;
>>>
>>> -  if (!is_gimple_assign (producer))
>>> +  if (!gimple_assign_load_p (producer))
>>> return;
>>>
>>> -  code = gimple_assign_rhs_code (producer);
>>> -  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
>>> -   return;
>>> -
>>> -  lhs = gimple_assign_lhs (producer);
>>> -  if (lhs != re->init)
>>> -   return;
>>> -
>>> -  rhs = gimple_assign_rhs1 (producer);
>>> -  if (!REFERENCE_CLASS_P (rhs))
>>> -   return;
>>> -
>>> -  re->init_ref = rhs;
>>> +  re->init_ref = gimple_assign_rhs1 (producer);
>>>  }
>>>else if (!CONSTANT_CLASS_P (re->init))
>>>  return;
>>>
>>> -  /* Check how reduction variable is used.  Note usually reduction
>>variable
>>> - is used outside of its defining loop, we don't require that in
>>terms of
>>> - loop interchange.  */
>>> -  if (!re->lcssa_phi)
>>> -consumer = single_use_in_loop (re->next, m_loop);
>>> -  else
>>> -consumer = single_use_in_loop (PHI_RESULT (re->lcssa_phi),
>>m_outer);
>>> -
>>> -  if (!consumer || !is_gimple_assign (consumer))
>>> -return;
>>> -
>>> -  code = gimple_assign_rhs_code (consumer);
>>> -  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
>>> -return;
>>> -
>>> -  lhs = gimple_assign_lhs (consumer);
>>> -  if (!REFERENCE_CLASS_P (lhs))
>>> -return;
>>> -
>>> -  rhs = gimple_assign_rhs1 (consumer);
>>> -  if (rhs != PHI_RESULT (re->lcssa_phi))
>>> +  /* Check how reduction variable is used.  */
>>> +  consumer = single_use_in_loop (PHI_RESULT (re->lcssa_phi),
>>m_outer);
>>> +  if (!consumer
>>> +  || !gimple_store_p (consumer))
>>>  return;
>>>
>>> -  re->fini_ref = lhs;
>>> +  re->fini_ref = gimple_get_lhs (consumer);
>>>re->consumer = consumer;
>>>
>>>/* Simple reduction with constant initializer.  */
>>> @@ -608,6 +579,9 @@ loop_cand::analyze_iloop_reduction_var (
>>>else
>>> return false;
>>>  }
>>> +  if (!lcssa_phi)
>>> +return false;
>>> +
>>>re = XCNEW (struct reduction);
>>>re->var = var;
>>>re->init = init;
>>> @@ -681,15 +655,9 @@ loop_cand::analyze_oloop_reduction_var (
>>>
>>>/* Outer loop's reduction should only be used to initialize inner
>>loop's
>>>   simple reduction.  */
>>> -  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>>> -{
>>> -  stmt = USE_STMT (use_p);
>>> -  if (is_gimple_debug (stmt))
>>> -   continue;
>>> -
>>> -  if (stmt != inner_re->phi)
>>> -   return false;
>>> -}
>>> +  if (! single_imm_use (var, _p, )
>>> +  || 

Re: [PATCH][gimple-interchange] Random cleanups

2017-12-04 Thread Bin.Cheng
On Mon, Dec 4, 2017 at 3:43 PM, Richard Biener  wrote:
>
> When skimming through the code I noticed the following (chatted on IRC
> about parts of the changes).
>
> Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>
> Will commit tomorrow unless you beat me to that.
>
> Richard.
>
> 2017-12-04  Richard Biener  
>
> * gimple-loop-interchange.cc (loop_cand::classify_simple_reduction):
> Simplify.
> (loop_cand::analyze_iloop_reduction_var): Reject dead reductions.
> (loop_cand::analyze_oloop_reduction_var): Likewise.  Simplify.
> (tree_loop_interchange::interchange_loops): Properly analyze
> scalar evolution before instantiating a SCEV.
>
> Index: gcc/gimple-loop-interchange.cc
> ===
> --- gcc/gimple-loop-interchange.cc  (revision 255383)
> +++ gcc/gimple-loop-interchange.cc  (working copy)
> @@ -444,50 +444,21 @@ loop_cand::classify_simple_reduction (re
>if (!bb || bb->loop_father != m_outer)
> return;
>
> -  if (!is_gimple_assign (producer))
> +  if (!gimple_assign_load_p (producer))
> return;
>
> -  code = gimple_assign_rhs_code (producer);
> -  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
> -   return;
> -
> -  lhs = gimple_assign_lhs (producer);
> -  if (lhs != re->init)
> -   return;
> -
> -  rhs = gimple_assign_rhs1 (producer);
> -  if (!REFERENCE_CLASS_P (rhs))
> -   return;
> -
> -  re->init_ref = rhs;
> +  re->init_ref = gimple_assign_rhs1 (producer);
>  }
>else if (!CONSTANT_CLASS_P (re->init))
>  return;
>
> -  /* Check how reduction variable is used.  Note usually reduction variable
> - is used outside of its defining loop, we don't require that in terms of
> - loop interchange.  */
> -  if (!re->lcssa_phi)
> -consumer = single_use_in_loop (re->next, m_loop);
> -  else
> -consumer = single_use_in_loop (PHI_RESULT (re->lcssa_phi), m_outer);
> -
> -  if (!consumer || !is_gimple_assign (consumer))
> -return;
> -
> -  code = gimple_assign_rhs_code (consumer);
> -  if (get_gimple_rhs_class (code) != GIMPLE_SINGLE_RHS)
> -return;
> -
> -  lhs = gimple_assign_lhs (consumer);
> -  if (!REFERENCE_CLASS_P (lhs))
> -return;
> -
> -  rhs = gimple_assign_rhs1 (consumer);
> -  if (rhs != PHI_RESULT (re->lcssa_phi))
> +  /* Check how reduction variable is used.  */
> +  consumer = single_use_in_loop (PHI_RESULT (re->lcssa_phi), m_outer);
> +  if (!consumer
> +  || !gimple_store_p (consumer))
>  return;
>
> -  re->fini_ref = lhs;
> +  re->fini_ref = gimple_get_lhs (consumer);
>re->consumer = consumer;
>
>/* Simple reduction with constant initializer.  */
> @@ -608,6 +579,9 @@ loop_cand::analyze_iloop_reduction_var (
>else
> return false;
>  }
> +  if (!lcssa_phi)
> +return false;
> +
>re = XCNEW (struct reduction);
>re->var = var;
>re->init = init;
> @@ -681,15 +655,9 @@ loop_cand::analyze_oloop_reduction_var (
>
>/* Outer loop's reduction should only be used to initialize inner loop's
>   simple reduction.  */
> -  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
> -{
> -  stmt = USE_STMT (use_p);
> -  if (is_gimple_debug (stmt))
> -   continue;
> -
> -  if (stmt != inner_re->phi)
> -   return false;
> -}
> +  if (! single_imm_use (var, _p, )
> +  || stmt != inner_re->phi)
> +return false;
>
>/* Check this reduction is correctly used outside of loop via lcssa phi.  
> */
>FOR_EACH_IMM_USE_FAST (use_p, iterator, next)
> @@ -711,6 +679,8 @@ loop_cand::analyze_oloop_reduction_var (
>else
> return false;
>  }
> +  if (!lcssa_phi)
> +return false;
>
>re = XCNEW (struct reduction);
>re->var = var;
> @@ -1146,12 +1116,18 @@ tree_loop_interchange::interchange_loops
>edge instantiate_below = loop_preheader_edge (loop_nest);
>gsi = gsi_last_bb (loop_preheader_edge (loop_nest)->src);
>i_niters = number_of_latch_executions (iloop.m_loop);
> -  i_niters = instantiate_scev (instantiate_below, loop_nest, i_niters);
> +  i_niters = analyze_scalar_evolution (loop_outer (iloop.m_loop), i_niters);
> +  i_niters = instantiate_scev (instantiate_below, loop_outer (iloop.m_loop),
> +  i_niters);
>i_niters = force_gimple_operand_gsi (, unshare_expr (i_niters), true,
>NULL_TREE, false, 
> GSI_CONTINUE_LINKING);
>o_niters = number_of_latch_executions (oloop.m_loop);
>if (oloop.m_loop != loop_nest)
> -o_niters = instantiate_scev (instantiate_below, loop_nest, o_niters);
> +{
> +  o_niters = analyze_scalar_evolution (loop_outer (oloop.m_loop), 
> o_niters);
> +  o_niters = instantiate_scev (instantiate_below, loop_outer 
> (oloop.m_loop),
> +  o_niters);
> +}
Hmm, sorry to disturb. 

Re: [PATCH][gimple-interchange] Add reduction validity check

2017-12-04 Thread Bin.Cheng
On Mon, Dec 4, 2017 at 1:11 PM, Richard Biener  wrote:
>
> I've noticed we perform FP reduction association without the required
> checks for associative math.  I've added
> gcc.dg/tree-ssa/loop-interchange-1b.c to cover this.
>
> I also noticed we happily interchange a loop with a reduction like
>
>  sum = a[i] - sum;
>
> where a change in order of elements isn't ok.  Unfortunately bwaves
> is exactly a case where single_use != next_def (tried to simply remove
> that case for now), because reassoc didn't have a chance to fix the
> operand order.  Thus this patch exports the relevant handling from
> the vectorizer (for stage1 having a separate infrastructure gathering /
> analyzing of reduction/induction infrastructure would be nice...)
> and uses it from interchange.  We then don't handle
> gcc.dg/tree-ssa/loop-interchange-4.c anymore (similar vectorizer
> missed-opt is PR65930).  I didn't bother to split up the vectorizer
> code further to implement relaxed validity checking but simply XFAILed
> this testcase.
>
> Earlier I simplified allocation stuff in the main loop which is why
> this part is included in this patch.
>
> Bootstrap running on x86_64-unknown-linux-gnu.
>
> I'll see to craft a testcase with the sum = a[i] - sum; mis-handling.
>
> Ok?
>
> Thanks,
> Richard.
>
> 2017-12-04  Richard Biener  
>
> * tree-vectorizer.h (check_reduction_path): Declare.
> * tree-vect-loop.c (check_reduction_path): New function, split out
> from ...
> (vect_is_simple_reduction): ... here.
> * gimple-loop-interchange.cc: Include tree-vectorizer.h.
> (loop_cand::analyze_iloop_reduction_var): Use single_imm_use.
> Properly check for a supported reduction operation and a
> valid expression if the reduction covers multiple stmts.
> (prepare_perfect_loop_nest): Simpify allocation.
> (pass_linterchange::execute): Likewise.
>
> * gcc.dg/tree-ssa/loop-interchange-1.c: Add fast-math flags.
> * gcc.dg/tree-ssa/loop-interchange-1b.c: New test variant.
> * gcc.dg/tree-ssa/loop-interchange-4.c: XFAIL.
>
>
> Index: gcc/gimple-loop-interchange.cc
> ===
> --- gcc/gimple-loop-interchange.cc  (revision 255375)
> +++ gcc/gimple-loop-interchange.cc  (working copy)
> @@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.
>  #include "tree-ssa-loop-ivopts.h"
>  #include "tree-ssa-dce.h"
>  #include "tree-data-ref.h"
> +#include "tree-vectorizer.h"
>
>  /* This pass performs loop interchange: for example, the loop nest
>
> @@ -551,23 +552,29 @@ loop_cand::analyze_iloop_reduction_var (
>   in a way that reduction operation is seen as black box.  In general,
>   we can ignore reassociation of reduction operator; we can handle fake
>   reductions in which VAR is not even used to compute NEXT.  */
> -  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
> -{
> -  stmt = USE_STMT (use_p);
> -  if (is_gimple_debug (stmt))
> -   continue;
> -
> -  if (!flow_bb_inside_loop_p (m_loop, gimple_bb (stmt)))
> -   return false;
> -
> -  if (single_use != NULL)
> -   return false;
> +  if (! single_imm_use (var, _p, _use)
> +  || ! flow_bb_inside_loop_p (m_loop, gimple_bb (single_use)))
> +return false;
>
> -  single_use = stmt;
> -}
> +  /* Check the reduction operation.  We require a commutative or
> + left-associative operation.  For FP math we also need to be allowed
> + to associate operations.  */
> +  if (! is_gimple_assign (single_use)
> +  || ! (commutative_tree_code (gimple_assign_rhs_code (single_use))
> +   || (commutative_ternary_tree_code
> + (gimple_assign_rhs_code (single_use))
> +   && (use_p->use == gimple_assign_rhs1_ptr (single_use)
> +   || use_p->use == gimple_assign_rhs2_ptr (single_use)))
> +   || (gimple_assign_rhs_code (single_use) == MINUS_EXPR
> +   && use_p->use == gimple_assign_rhs1_ptr (single_use)))
> +  || (FLOAT_TYPE_P (TREE_TYPE (var))
> + && ! flag_associative_math))
> +return false;
>
> +  /* Handle and verify a series of stmts feeding the reduction op.  */
>if (single_use != next_def
> -  && !stmt_dominates_stmt_p (single_use, next_def))
> +  && !check_reduction_path (UNKNOWN_LOCATION, m_loop, phi, next,
> +   gimple_assign_rhs_code (single_use)))
>  return false;
>
>/* Only support cases in which INIT is used in inner loop.  */
> @@ -1964,7 +1971,7 @@ prepare_perfect_loop_nest (struct loop *
>vec *datarefs, vec *ddrs)
>  {
>struct loop *start_loop = NULL, *innermost = loop;
> -  struct loop *outermost = superloop_at_depth (loop, 0);
> +  struct loop *outermost = loops_for_fn (cfun)->tree_root;
>
>/* Find loop nest from the innermost loop.  The outermost is the innermost

Re: [PATCH][gimple-interchange] Add reduction validity check

2017-12-04 Thread Bin.Cheng
On Mon, Dec 4, 2017 at 1:11 PM, Richard Biener  wrote:
>
> I've noticed we perform FP reduction association without the required
> checks for associative math.  I've added
> gcc.dg/tree-ssa/loop-interchange-1b.c to cover this.
>
> I also noticed we happily interchange a loop with a reduction like
>
>  sum = a[i] - sum;
>
> where a change in order of elements isn't ok.  Unfortunately bwaves
> is exactly a case where single_use != next_def (tried to simply remove
> that case for now), because reassoc didn't have a chance to fix the
> operand order.  Thus this patch exports the relevant handling from
> the vectorizer (for stage1 having a separate infrastructure gathering /
> analyzing of reduction/induction infrastructure would be nice...)
> and uses it from interchange.  We then don't handle
> gcc.dg/tree-ssa/loop-interchange-4.c anymore (similar vectorizer
> missed-opt is PR65930).  I didn't bother to split up the vectorizer
> code further to implement relaxed validity checking but simply XFAILed
> this testcase.
>
> Earlier I simplified allocation stuff in the main loop which is why
> this part is included in this patch.
>
> Bootstrap running on x86_64-unknown-linux-gnu.
>
> I'll see to craft a testcase with the sum = a[i] - sum; mis-handling.
>
> Ok?
Sure.
Just for the record.  There is also similar associative check in
predcom.  As you suggested, a path extraction/checking interface for
associative checking would be great, given we have multiple users now.

Thanks,
bin
>
> Thanks,
> Richard.
>
> 2017-12-04  Richard Biener  
>
> * tree-vectorizer.h (check_reduction_path): Declare.
> * tree-vect-loop.c (check_reduction_path): New function, split out
> from ...
> (vect_is_simple_reduction): ... here.
> * gimple-loop-interchange.cc: Include tree-vectorizer.h.
> (loop_cand::analyze_iloop_reduction_var): Use single_imm_use.
> Properly check for a supported reduction operation and a
> valid expression if the reduction covers multiple stmts.
> (prepare_perfect_loop_nest): Simpify allocation.
> (pass_linterchange::execute): Likewise.
>
> * gcc.dg/tree-ssa/loop-interchange-1.c: Add fast-math flags.
> * gcc.dg/tree-ssa/loop-interchange-1b.c: New test variant.
> * gcc.dg/tree-ssa/loop-interchange-4.c: XFAIL.
>
>
> Index: gcc/gimple-loop-interchange.cc
> ===
> --- gcc/gimple-loop-interchange.cc  (revision 255375)
> +++ gcc/gimple-loop-interchange.cc  (working copy)
> @@ -41,6 +41,7 @@ along with GCC; see the file COPYING3.
>  #include "tree-ssa-loop-ivopts.h"
>  #include "tree-ssa-dce.h"
>  #include "tree-data-ref.h"
> +#include "tree-vectorizer.h"
>
>  /* This pass performs loop interchange: for example, the loop nest
>
> @@ -551,23 +552,29 @@ loop_cand::analyze_iloop_reduction_var (
>   in a way that reduction operation is seen as black box.  In general,
>   we can ignore reassociation of reduction operator; we can handle fake
>   reductions in which VAR is not even used to compute NEXT.  */
> -  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
> -{
> -  stmt = USE_STMT (use_p);
> -  if (is_gimple_debug (stmt))
> -   continue;
> -
> -  if (!flow_bb_inside_loop_p (m_loop, gimple_bb (stmt)))
> -   return false;
> -
> -  if (single_use != NULL)
> -   return false;
> +  if (! single_imm_use (var, _p, _use)
> +  || ! flow_bb_inside_loop_p (m_loop, gimple_bb (single_use)))
> +return false;
>
> -  single_use = stmt;
> -}
> +  /* Check the reduction operation.  We require a commutative or
> + left-associative operation.  For FP math we also need to be allowed
> + to associate operations.  */
> +  if (! is_gimple_assign (single_use)
> +  || ! (commutative_tree_code (gimple_assign_rhs_code (single_use))
> +   || (commutative_ternary_tree_code
> + (gimple_assign_rhs_code (single_use))
> +   && (use_p->use == gimple_assign_rhs1_ptr (single_use)
> +   || use_p->use == gimple_assign_rhs2_ptr (single_use)))
> +   || (gimple_assign_rhs_code (single_use) == MINUS_EXPR
> +   && use_p->use == gimple_assign_rhs1_ptr (single_use)))
> +  || (FLOAT_TYPE_P (TREE_TYPE (var))
> + && ! flag_associative_math))
> +return false;
>
> +  /* Handle and verify a series of stmts feeding the reduction op.  */
>if (single_use != next_def
> -  && !stmt_dominates_stmt_p (single_use, next_def))
> +  && !check_reduction_path (UNKNOWN_LOCATION, m_loop, phi, next,
> +   gimple_assign_rhs_code (single_use)))
>  return false;
>
>/* Only support cases in which INIT is used in inner loop.  */
> @@ -1964,7 +1971,7 @@ prepare_perfect_loop_nest (struct loop *
>vec *datarefs, vec *ddrs)
>  {
>struct loop *start_loop = 

Re: [PATCH][gimple-linterchange] Rewrite compute_access_stride

2017-12-01 Thread Bin.Cheng
On Fri, Dec 1, 2017 at 2:26 PM, Richard Biener <rguent...@suse.de> wrote:
> On Fri, 1 Dec 2017, Bin.Cheng wrote:
>
>> On Fri, Dec 1, 2017 at 12:31 PM, Richard Biener <rguent...@suse.de> wrote:
>> >
>> > This is the access stride computation change.  Apart from the
>> > stride extraction I adjusted the cost model to handle non-constant
>> > strides by checking if either is a multiple of the other and
>> > simply fail interchanging if it's the wrong way around for one
>> > ref or if the simple method using multiple_of_p fails to determine
>> > either case.
>> >
>> > This still handles the bwaves case.
>> >
>> > I think we want additional testcases with variable strides for each
>> > case we add - I believe this is the most conservative way to treat
>> > variable strides.
>> >
>> > It may be inconsistent with the constant stride handling where you
>> > allow for many OK DRs to outweight a few not OK DRs, but as it
>> > worked for bwaves it must be good enough ;)
>> >
>> > Tested on x86_64-unknown-linux-gnu (just the interchange testcases).
>> >
>> > Currently running a bootstrap with -O3 -g -floop-interchange.
>> >
>> > Ok for the branch?
>> Ok.  This actually is closer the motivation: simple/conservative cost
>> model that only transforms code when it's known to be good.
>> I will check the impact on the number of interchange in spec.
>
> Few high-level observations.
>
> In tree_loop_interchange::interchange we try interchanging adjacent
> loops, starting from innermost with outer of innermost.  This way
> the innermost loop will bubble up as much as possible.  But we
> don't seem to handle bulling multiple loops like for
>
>  for (i=0; i<n; ++i)
>   for (j=0; j<n; ++j)
> for (k=0; k   a[j][k][i] = 1.;
>
> because the innermost two loops are correctly ordered so we then
> try interchanging the k and the i loop which succeeds but then
> we stop.  So there's something wrong in the iteration scheme.
> I would have expected it to be quadratic, basically iterating
> the ::interchange loop until we didn't perform any interchange
> (or doing sth more clever).
Yes, I restricted it to be a single pass process in loop nest.
Ideally we could create a vector of loop_cand for the whole loop nest,
then sort/permute loops wrto computed strides.

>
> loop_cand::can_interchange_p seems to perform per BB checks
> (supported_operations, num_stmts) that with the way we interchange
> should disallow any such BB in a loop that we interchange or
> interchange across.  That means it looks like sth we should
> perform early, like during data dependence gathering by for
> example inlining find_data_references_in_bb and doing those
> per-stmt checks there?
Yes.  The only problem is the check on the reduction.  We can build up
all loop_cand earlier, or simply move non-reduction checks earlier.

>
> In prepare_perfect_loop_nest we seem to be somewhat quadratic
> in the way we re-compute dependences if doing so failed
> (we also always just strip the outermost loop while the failing
> DDR could involve a DR that is in an inner loop).  I think it
> should be possible to re-structure this computing dependences
> from inner loop body to outer loop bodies (the ddrs vector
> is, as opposed to the dr vector, unsorted I think).
Even more, we would want interface in tree-data-ref.c so that data
dependence can be computed/assembled level by level in loop nest.
loop distribution can be benefited as well.

> I haven't fully thought this out yet though - a similar
> iteration scheme could improve DR gathering though that's not
> so costly.
I can change this one now.

>
> Overall we should try improving on function names, we have
> valid_data_dependences, can_interchange_loops, should_interchange_loops,
> can_interchange_p which all are related but do slightly different
> things.  My usual approach is to inline all single-use functions
> to improve things (and make program flow more visible).  But
> I guess that's too much implementation detail.
>
> Didn't get to the IV re-mapping stuff yet but you (of course)
> end up with quite some dead IVs when iterating the interchange.
> You seem to add a new canonical IV just to avoid rewriting the
> existing exit test, right?  Defering that to the "final"
> interchange on a nest should avoid those dead IVs.
Hmm, with the help pf new created dce interface, all dead IV/RE will
be deleted after this pass.  Note for current implementation, dead
code can be generated from mapped IV, canonical IV and the reduction.
Deferring adding canonical IV looks not practical wrto current level
by level interchange, because inner loop's IV is needed for
interchange?

>
> Will now leave for the weekend.
Have a nice WE!

Thanks,
bin
>
> Thanks,
> Richard.


Re: [PATCH][gimple-linterchange] Rewrite compute_access_stride

2017-12-01 Thread Bin.Cheng
On Fri, Dec 1, 2017 at 12:31 PM, Richard Biener  wrote:
>
> This is the access stride computation change.  Apart from the
> stride extraction I adjusted the cost model to handle non-constant
> strides by checking if either is a multiple of the other and
> simply fail interchanging if it's the wrong way around for one
> ref or if the simple method using multiple_of_p fails to determine
> either case.
>
> This still handles the bwaves case.
>
> I think we want additional testcases with variable strides for each
> case we add - I believe this is the most conservative way to treat
> variable strides.
>
> It may be inconsistent with the constant stride handling where you
> allow for many OK DRs to outweight a few not OK DRs, but as it
> worked for bwaves it must be good enough ;)
>
> Tested on x86_64-unknown-linux-gnu (just the interchange testcases).
>
> Currently running a bootstrap with -O3 -g -floop-interchange.
>
> Ok for the branch?
Ok.  This actually is closer the motivation: simple/conservative cost
model that only transforms code when it's known to be good.
I will check the impact on the number of interchange in spec.

Thanks,
bin
>
> Richard.
>
> 2017-12-01  Richard Biener  
>
> * gimple-loop-interchange.cc (estimate_val_by_simplify_replace):
> Remove.
> (compute_access_stride): Rewrite using instantiate_scev,
> remove constant substitution.
> (should_interchange_loops): Adjust for non-constant strides.
>
> Index: gcc/gimple-loop-interchange.cc
> ===
> --- gcc/gimple-loop-interchange.cc  (revision 255303)
> +++ gcc/gimple-loop-interchange.cc  (working copy)
> @@ -1325,42 +1325,6 @@ tree_loop_interchange::move_code_to_inne
>  }
>  }
>
> -/* Estimate and return the value of EXPR by replacing variables in EXPR
> -   with CST_TREE and simplifying.  */
> -
> -static tree
> -estimate_val_by_simplify_replace (tree expr, tree cst_tree)
> -{
> -  unsigned i, n;
> -  tree ret = NULL_TREE, e, se;
> -
> -  if (!expr)
> -return NULL_TREE;
> -
> -  /* Do not bother to replace constants.  */
> -  if (CONSTANT_CLASS_P (expr))
> -return expr;
> -
> -  if (!EXPR_P (expr))
> -return cst_tree;
> -
> -  n = TREE_OPERAND_LENGTH (expr);
> -  for (i = 0; i < n; i++)
> -{
> -  e = TREE_OPERAND (expr, i);
> -  se = estimate_val_by_simplify_replace (e, cst_tree);
> -  if (e == se)
> -   continue;
> -
> -  if (!ret)
> -   ret = copy_node (expr);
> -
> -  TREE_OPERAND (ret, i) = se;
> -}
> -
> -  return (ret ? fold (ret) : expr);
> -}
> -
>  /* Given data reference DR in LOOP_NEST, the function computes DR's access
> stride at each level of loop from innermost LOOP to outer.  On success,
> it saves access stride at each level loop in a vector which is pointed
> @@ -1388,44 +1352,31 @@ compute_access_stride (struct loop *loop
>
>tree ref = DR_REF (dr);
>tree scev_base = build_fold_addr_expr (ref);
> -  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
> -  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
> -  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
> -
> -  do {
> -tree scev_fn = analyze_scalar_evolution (loop, scev_base);
> -if (chrec_contains_undetermined (scev_fn)
> -   || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
> -  break;
> -
> -if (TREE_CODE (scev_fn) != POLYNOMIAL_CHREC)
> -  {
> -   scev_base = scev_fn;
> -   strides->safe_push (build_int_cst (sizetype, 0));
> -   continue;
> -  }
> -
> -scev_base = CHREC_LEFT (scev_fn);
> -if (tree_contains_chrecs (scev_base, NULL))
> -  break;
> -
> -tree scev_step = fold_convert (sizetype, CHREC_RIGHT (scev_fn));
> -
> -enum ev_direction scev_dir = scev_direction (scev_fn);
> -/* Estimate if step isn't constant.  */
> -if (scev_dir == EV_DIR_UNKNOWN)
> -  {
> -   scev_step = estimate_val_by_simplify_replace (scev_step, niters);
> -   if (TREE_CODE (scev_step) != INTEGER_CST
> -   || tree_int_cst_lt (scev_step, access_size))
> - scev_step = access_size;
> -  }
> -/* Compute absolute value of scev step.  */
> -else if (scev_dir == EV_DIR_DECREASES)
> -  scev_step = fold_build1 (NEGATE_EXPR, sizetype, scev_step);
> -
> -strides->safe_push (scev_step);
> -  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
> +  tree scev = analyze_scalar_evolution (loop, scev_base);
> +  scev = instantiate_scev (loop_preheader_edge (loop_nest), loop, scev);
> +  if (! chrec_contains_undetermined (scev))
> +{
> +  tree sl = scev;
> +  struct loop *expected = loop;
> +  while (TREE_CODE (sl) == POLYNOMIAL_CHREC)
> +   {
> + struct loop *sl_loop = get_chrec_loop (sl);
> + while (sl_loop != expected)
> +   {
> + strides->safe_push (size_int (0));
> +  

Re: [PATCH GCC][V2]A simple implementation of loop interchange

2017-11-30 Thread Bin.Cheng
On Thu, Nov 30, 2017 at 3:51 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Thu, Nov 30, 2017 at 4:09 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Thu, Nov 30, 2017 at 3:13 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Thu, Nov 30, 2017 at 1:01 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>
>> Ok, I'd like to "dumb" the pass down to the level we can solve the
>> bwave case (which I realize is already one of the more complicated ones).
>>
>> Just because it's already late for GCC 8.
>
> For reference I'll commit the following tomorrow, will play with adding
> a testcase for bwaves and doing the multiple_of_p thing we talked about.
Given instantiated scev in parameterized case like:
{{{p_19(D), +, 8}_1, +, (long unsigned int) n_16(D) * 8}_2, +, (long
unsigned int) (n_16(D) * n_16(D)) * 8}_3
it's ideal if we can relate the variable part in stride with loop
niters.  Unfortunately that's impractical because niter
computation and address computation may expand variables differently.
 As in case of bwaves.
This leaves us the method to check multiple relation for stride itself:
(long unsigned int) (n_16(D) * n_16(D)) * 8  ;;stride X
 vs.
(long unsigned int) n_16(D) * 8 ;;stride Y
 vs.
8;;stride Z
>From inner loop to outer, we check if Y is multiple of Z and if X is
multiple of Y, so strides computed are like:
8 * AVG_LOOP_NITERS * AVG_LOOP_NITERS
 vs.
8 * AVG_LOOP_NITERS
 vs.
8

To make it general, we also need to check if X is multiple of previous
stride (Z, in this case) if check on Y failed.

The multiple check on tree expr is weak, so one question is how to do
if failed.  Giving up or using a placeholder const stride?

Thanks,
bin
>
> Richard.


Re: [PATCH GCC][V2]A simple implementation of loop interchange

2017-11-30 Thread Bin.Cheng
On Thu, Nov 30, 2017 at 1:01 PM, Richard Biener
 wrote:
> On Tue, Nov 28, 2017 at 4:26 PM, Bin Cheng  wrote:
>> Hi,
>> This is updated patch with review comments resolved.  Some explanation 
>> embedded below.
>>
>>> +
>>> +  iloop->nb_iterations = nb_iterations;
>>>
>>> use std::swap?  Also I think if you can keep nb_iterations you
>>> can certainly keep the upper bounds.  You're probably
>>> afraid of the ->stmt references in the nb_iter_bound entries?
>>>
>>> Anyway, either scrap everything or try to keep everything.
>> Yeah, not only the stmts, but also the control_iv information because the 
>> SCEV
>> information may be corrupted during code transformation.
>> Now I discarded all the information.
>
> Note that given you interchange the loops but not the CFG or the loop 
> structures
> you might want to swap loop->num and flags like ->force_vectorize.  That is,
> essentially change the ->header/latch association (and other CFG related stuff
> like recorded exits).
>
> It might also be we want to / need to disable interchange for, say,
> ->force_vectorize
> inner loops or ->unroll != 0?  Or we need to clear them, maybe
> optionally diagnosing
> that fact.
>
> At least we need to think about what it means to preserve loop
> structure (semantically,
> loop->num should maintain association to the same source-level loop
> throughout the
> compilation) for transforms like interchange.
>
>>>
>>> +  for (i = 0; oloop.reductions.iterate (i, ); ++i)
>>> +{
>>> +  if (re->type != DOUBLE_RTYPE)
>>> +   gcc_unreachable ();
>>> +
>>> +  use_operand_p use_p;
>>> +  imm_use_iterator iterator;
>>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, re->var)
>>> +   mark_or_remove_dbg_stmt (USE_STMT (use_p), re->var);
>>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, re->next)
>>> +   mark_or_remove_dbg_stmt (USE_STMT (use_p), re->next);
>>> +  if (TREE_CODE (re->init) == SSA_NAME)
>>> +   {
>>> + FOR_EACH_IMM_USE_FAST (use_p, iterator, re->init)
>>> +   mark_or_remove_dbg_stmt (USE_STMT (use_p), re->init);
>>> +   }
>>>
>>> can you add a comment what you are doing here?
>>>
>>> Note that other loop opts simply scrap all debug stmts ...
>> As mentioned above, updated patch doesn't try hard to maintain debug use 
>> info any more.
>>
>>>
>>> +static void
>>> +compute_access_stride (struct loop *loop_nest, struct loop *loop,
>>> +  data_reference_p dr)
>>> +{
>>> ...
>>> +  tree ref = DR_REF (dr);
>>> +  tree scev_base = build_fold_addr_expr (ref);
>>> +  tree access_size = TYPE_SIZE_UNIT (TREE_TYPE (ref));
>>> +  tree niters = build_int_cst (sizetype, AVG_LOOP_NITER);
>>> +  access_size = fold_build2 (MULT_EXPR, sizetype, niters, access_size);
>>> +
>>> +  do {
>>> +tree scev_fn = analyze_scalar_evolution (loop, scev_base);
>>> +if (chrec_contains_undetermined (scev_fn)
>>> +   || chrec_contains_symbols_defined_in_loop (scev_fn, loop->num))
>>> +  break;
>>> ...
>>> +strides->safe_push (scev_step);
>>> +  } while (loop != loop_nest && (loop = loop_outer (loop)) != NULL);
>>> +
>>>
>>> I _think_ you want to do
>>>
>>>scev_fn = analyze_scalar_evolution (loop, scev_base); // assuming
>>> DR_STMT (dr) is in loop
>>>scev_fn = instantiate_parameters (nest, scev_fn);
>>>if (chrec_contains_undetermined (scev_fn))
>>>  return; // false?
>>>
>>> and analyze the result which should be of the form
>>>
>>>   { { { init, +, step1 }_1, +, step2 }_2, + , step3 }_3 ...
>>>
>>> if canonical.  I think estimate_val_by_simplify_replace isn't needed
>>> if you do that
>>> (it also looks odd to replace all vairables in step by niter...).
>> I replied on this in previous message, instantiate_parameters doesn't always
>> give canonical form result as expected.  The loop here could be seen as a
>> local instantiate process, right?
>
> Kind of.  I'll see if I can reproduce the difference with any of your
> intercahnge
> testcases - any hint which one to look at?
For added tests, I think there will be no difference between the two.
I noticed the difference for
pointer cases like:
for (i...)
  for (j...)
for (k...)
   p[i*n*n + j*n+ k] =...

>
>> Also estimate_val_by_simplify_replace is needed for pointers, where strides
>> are computed from niters of loops which could be non compilation time 
>> constant.
>> But yes, it's an odd fixup after I failed to do anything better.
>
> But you are for example computing _1 - _2 to zero, right?  Because both _1
> and _2 are not constant and thus you replace it with the same (symbolical)
> constant 'niter'.
>
> I think that asks for garbage-in-garbage-out ...
>
> Which testcase is this important for so I can have a look?
So far this is only for the above pointer case.  Actually I don't
think it's that important, and thought about skip it.
So we don't have to do estimate_val_by_simplify_replace.

Thanks,
bin
>
>>>
>>> I think keeping the chrec in the 

Re: [PATCH GCC]Rename and make remove_dead_inserted_code a simple dce interface

2017-11-29 Thread Bin.Cheng
On Wed, Nov 29, 2017 at 10:02 AM, Richard Biener
 wrote:
> On Tue, Nov 28, 2017 at 3:48 PM, Bin Cheng  wrote:
>> Hi,
>> This patch renames remove_dead_inserted_code to simple_dce_from_worklist, 
>> moves it to tree-ssa-dce.c
>> and makes it a simple public DCE interface.  Bootstrap and test along with 
>> loop interchange.  It's required
>> for interchange pass.  Is it OK?
>
> +  /* ???  Re-use seeds as worklist not only as initial set.  This may end up
> + removing more code as well.  If we keep seeds unchanged we could 
> restrict
> + new worklist elements to members of seed.  */
>
> Please remove this comment, while it applies to PRE when one takes
> remove_dead_inserted_code
> literally it doesn't apply to a seeded DCE.
>
> Please also rename 'seeds' to 'worklist' directly and document that
> worklist is consumed by the function.
> The function has linear complexity in the number of dead stmts, the
> constant factor is the number of
> SSA use operands in those stmts (so 2 on average I'd say).
>
> Ok with that change.
Updated, will commit new patch as attached.

Thanks,
bin

>
> Thanks,
> Richard.
>
>> BTW, I will push this along with interchange to branch: 
>> gcc.gnu.org/svn/gcc/branches/gimple-linterchange.
>>
>> Thanks,
>> bin
>> 2017-11-27  Bin Cheng  
>>
>> * tree-ssa-dce.c (simple_dce_from_worklist): Move and rename from
>> tree-ssa-pre.c::remove_dead_inserted_code.
>> * tree-ssa-dce.h: New file.
>> * tree-ssa-pre.c (tree-ssa-dce.h): Include new header file.
>> (remove_dead_inserted_code): Move and rename to function
>> tree-ssa-dce.c::simple_dce_from_worklist.
>> (pass_pre::execute): Update use.
From 219f42625b89eb81e2beb6605c9d594e83ed5048 Mon Sep 17 00:00:00 2001
From: amker 
Date: Sun, 26 Nov 2017 20:56:19 +0800
Subject: [PATCH 01/41] simple-dce-interface

---
 gcc/tree-ssa-dce.c | 52 ++
 gcc/tree-ssa-dce.h | 22 ++
 gcc/tree-ssa-pre.c | 67 +++---
 3 files changed, 82 insertions(+), 59 deletions(-)
 create mode 100644 gcc/tree-ssa-dce.h

diff --git a/gcc/tree-ssa-dce.c b/gcc/tree-ssa-dce.c
index a5f0edf..8595dec 100644
--- a/gcc/tree-ssa-dce.c
+++ b/gcc/tree-ssa-dce.c
@@ -1723,3 +1723,55 @@ make_pass_cd_dce (gcc::context *ctxt)
 {
   return new pass_cd_dce (ctxt);
 }
+
+
+/* A cheap DCE interface.  WORKLIST is a list of possibly dead stmts and
+   is consumed by this function.  The function has linear complexity in
+   the number of dead stmts with a constant factor like the average SSA
+   use operands number.  */
+
+void
+simple_dce_from_worklist (bitmap worklist)
+{
+  while (! bitmap_empty_p (worklist))
+{
+  /* Pop item.  */
+  unsigned i = bitmap_first_set_bit (worklist);
+  bitmap_clear_bit (worklist, i);
+
+  tree def = ssa_name (i);
+  /* Removed by somebody else or still in use.  */
+  if (! def || ! has_zero_uses (def))
+	continue;
+
+  gimple *t = SSA_NAME_DEF_STMT (def);
+  if (gimple_has_side_effects (t))
+	continue;
+
+  /* Add uses to the worklist.  */
+  ssa_op_iter iter;
+  use_operand_p use_p;
+  FOR_EACH_PHI_OR_STMT_USE (use_p, t, iter, SSA_OP_USE)
+	{
+	  tree use = USE_FROM_PTR (use_p);
+	  if (TREE_CODE (use) == SSA_NAME
+	  && ! SSA_NAME_IS_DEFAULT_DEF (use))
+	bitmap_set_bit (worklist, SSA_NAME_VERSION (use));
+	}
+
+  /* Remove stmt.  */
+  if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Removing dead stmt:");
+	  print_gimple_stmt (dump_file, t, 0);
+	}
+  gimple_stmt_iterator gsi = gsi_for_stmt (t);
+  if (gimple_code (t) == GIMPLE_PHI)
+	remove_phi_node (, true);
+  else
+	{
+	  gsi_remove (, true);
+	  release_defs (t);
+	}
+}
+}
diff --git a/gcc/tree-ssa-dce.h b/gcc/tree-ssa-dce.h
new file mode 100644
index 000..2adb086
--- /dev/null
+++ b/gcc/tree-ssa-dce.h
@@ -0,0 +1,22 @@
+/* Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+.  */
+
+#ifndef TREE_SSA_DCE_H
+#define TREE_SSA_DCE_H
+extern void simple_dce_from_worklist (bitmap);
+#endif
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index 281f100..c19d486 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ 

Re: [PATCH GCC][V2]A simple implementation of loop interchange

2017-11-28 Thread Bin.Cheng
On Tue, Nov 28, 2017 at 4:00 PM, David Malcolm <dmalc...@redhat.com> wrote:
> On Tue, 2017-11-28 at 15:26 +, Bin Cheng wrote:
>> Hi,
>> This is updated patch with review comments resolved.  Some
>> explanation embedded below.
>>
>> On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener <richard.guenther@gma
>> il.com> wrote:
>> > On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.ch...@gmail.com>
>> > wrote:
>> > > On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <m...@suse.de>
>> > > wrote:
>> > > > Hello,
>> > > >
>> > > > On Fri, 22 Sep 2017, Bin.Cheng wrote:
>> > > >
>> > > > > This is updated patch for loop interchange with review
>> > > > > suggestions
>> > > > > resolved.  Changes are:
>> > > > >   1) It does more light weight checks like rectangle loop
>> > > > > nest check
>> > > > > earlier than before.
>> > > > >   2) It checks profitability of interchange before data
>> > > > > dependence computation.
>> > > > >   3) It calls find_data_references_in_loop only once for a
>> > > > > loop nest now.
>> > > > >   4) Data dependence is open-computed so that we can skip
>> > > > > instantly at
>> > > > > unknown dependence.
>> > > > >   5) It improves code generation in mapping induction
>> > > > > variables for
>> > > > > loop nest, as well as
>> > > > >  adding a simple dead code elimination pass.
>> > > > >   6) It changes magic constants into parameters.
>> > > >
>> > > > So I have a couple comments/questions.  Something stylistic:
>> > >
>> > > Hi Michael,
>> > > Thanks for reviewing.
>> > >
>> > > >
>> > > > > +class loop_cand
>> > > > > +{
>> > > > > +public:
>> > > > > ...
>> > > > > +  friend class tree_loop_interchange;
>> > > > > +private:
>> > > >
>> > > > Just make this all public (and hence a struct, not class).
>> > > > No need for friends in file local classes.
>> > >
>> > > Done.
>> > >
>> > > >
>> > > > > +single_use_in_loop (tree var, struct loop *loop)
>> > > > > ...
>> > > > > +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>> > > > > +{
>> > > > > +  stmt = USE_STMT (use_p);
>> > > > > ...
>> > > > > +  basic_block bb = gimple_bb (stmt);
>> > > > > +  gcc_assert (bb != NULL);
>> > > >
>> > > > This pattern reoccurs often in your patch: you check for a bb
>> > > > associated
>> > > > for a USE_STMT.  Uses of SSA names always occur in basic
>> > > > blocks, no need
>> > > > for checking.
>> > >
>> > > Done.
>> > >
>> > > >
>> > > > Then, something about your handling of simple reductions:
>> > > >
>> > > > > +void
>> > > > > +loop_cand::classify_simple_reduction (reduction_p re)
>> > > > > +{
>> > > > > ...
>> > > > > +  /* Require memory references in producer and consumer are
>> > > > > the same so
>> > > > > + that we can undo reduction during interchange.  */
>> > > > > +  if (re->init_ref && !operand_equal_p (re->init_ref, re-
>> > > > > >fini_ref, 0))
>> > > > > +return;
>> > > >
>> > > > Where is it checked that the undoing transformation is legal
>> > > > also
>> > > > from a data dep point of view?  Think code like this:
>> > > >
>> > > >sum = X[i];
>> > > >for (j ...)
>> > > >  sum += X[j];
>> > > >X[i] = sum;
>> > > >
>> > > > Moving the store into the inner loop isn't always correct and I
>> > > > don't seem
>> > > > to find where the above situation is rejected.
>> > >
>> > > Yeah.  for the old patch, it's possible to have such loop wrongly
>> > > interchanged;
>> > > in practice, it's hard to create an example.  T

Re: [PATCH GCC]A simple implementation of loop interchange

2017-11-23 Thread Bin.Cheng
Hi Richard,
Thanks for reviewing.  It's quite lot comment, I am trying to resolve
it one by one.  Here I have some questions as embedded.

On Mon, Nov 20, 2017 at 2:46 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Thu, Nov 16, 2017 at 4:18 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <m...@suse.de> wrote:
>>
>>>
>>> I hope this is of some help to you :)
>> Thanks again, it's very helpful.
>>
>> I also fixed several bugs of previous implementation, mostly about debug info
>> statements and simple reductions.  As for test, I enabled this pass by 
>> default,
>> bootstrap and regtest GCC, I also build/run specs.  There must be some other
>> latent bugs in it, but guess we have to exercise it by enabling it at
>> some point.
>>
>> So any comments?
>
>  bool
> -gsi_remove (gimple_stmt_iterator *i, bool remove_permanently)
> +gsi_remove (gimple_stmt_iterator *i, bool remove_permanently, bool 
> insert_dbg)
>  {
>
> that you need this suggests you do stmt removal in wrong order (you need to
> do reverse dom order).
>
> +/* Maximum number of statements in loop nest for loop interchange.  */
> +
> +DEFPARAM (PARAM_LOOP_INTERCHANGE_MAX_NUM_STMTS,
> + "loop-interchange-max-num-stmts",
> + "The maximum number of stmts in loop nest for loop interchange.",
> + 64, 0, 0)
>
> is that to limit dependence computation?  In this case you should probably
> limit the number of data references instead?
No, this is to limit number of statements in loop.  We don't want to
do interchange for too large loops, right?

>
> +ftree-loop-interchange
> +Common Report Var(flag_tree_loop_interchange) Optimization
> +Enable loop interchange on trees.
> +
>
> please re-use -floop-interchange instead and change the GRAPHITE tests
> to use -floop-nest-optimize.  You can do that as pre-approved thing now.
>
> Please enable the pass by default at O3 via opts.c.
There are quite many (vectorize) test cases affected by interchange
(which is correct I believe), so I will prepare another patch enabling
it at O3 and adjusting the tests, to keep this patch small.

>
> diff --git a/gcc/tree-ssa-loop-interchange.cc 
> b/gcc/tree-ssa-loop-interchange.cc
>
> gimple-loop-interchange.cc please.
>
> new file mode 100644
> index 000..abffbf6
> --- /dev/null
> +++ b/gcc/tree-ssa-loop-interchange.cc
> @@ -0,0 +1,2274 @@
> +/* Loop invariant motion.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
>
> Loop invariant motion? ... ;)
>
> Please add a "Contributed by ..." to have an easy way to figure people to 
> blame.
>
> +}*induction_p;
> +
>
> space after '*'
>
> +}*reduction_p;
> +
>
> likewise.
>
> +/* Return true if PHI is unsupported in loop interchange, i.e, PHI contains
> +   ssa var appearing in any abnormal phi node.  */
> +
> +static inline bool
> +unsupported_phi_node (gphi *phi)
> +{
> +  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)))
> +return true;
> +
> +  for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> +{
> +  tree arg = PHI_ARG_DEF (phi, i);
> +  if (TREE_CODE (arg) == SSA_NAME
> + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (arg))
> +   return true;
> +}
> +
> +  return false;
>
> I believe the above isn't necessary given you rule out abnormal edges
> into the loop.
> Did you have a testcase that broke?  A minor thing I guess if it is
> just for extra
> safety...
>
> +/* Return true if all stmts in BB can be supported by loop interchange,
> +   otherwise return false.  ILOOP is not NULL if this loop_cand is the
> +   outer loop in loop nest.  */
> +
> +bool
> +loop_cand::unsupported_operation (basic_block bb, loop_cand *iloop)
> +{
>
> docs and return value suggest this be named supported_operation
>
> +  /* Or it's invariant memory reference and only used by inner loop.  */
> +  if (gimple_assign_single_p (stmt)
> + && (lhs = gimple_assign_lhs (stmt)) != NULL_TREE
> + && TREE_CODE (lhs) == SSA_NAME
> + && single_use_in_loop (lhs, iloop->loop))
> +   continue;
>
> comment suggests multiple uses in loop would be ok?
>
> +  if ((lhs = gimple_assign_lhs (producer)) == NULL_TREE
> + || lhs != re->init)
> +   return;
> +
> +  if ((rhs = gimple_assign_rhs1 (producer)) == NULL_TREE
> + || !REFERENCE_CLASS_P (rhs))
> +   return;
>
> lhs and rhs are never NULL.  Please initialize them outside of the if.
> You want to disallow DECL_P rhs

Re: Improve canonicalisation of TARGET_MEM_REFs

2017-11-20 Thread Bin.Cheng
On Mon, Nov 20, 2017 at 11:02 AM, Richard Biener
 wrote:
> On Tue, Nov 7, 2017 at 7:04 PM, Richard Sandiford
>  wrote:
>> Richard Biener  writes:
>>> On Fri, Nov 3, 2017 at 5:32 PM, Richard Sandiford
>>>  wrote:
 A general TARGET_MEM_REF is:

 BASE + STEP * INDEX + INDEX2 + OFFSET

 After classifying the address in this way, the code that builds
 TARGET_MEM_REFs tries to simplify the address until it's valid
 for the current target and for the mode of memory being addressed.
 It does this in a fixed order:

 (1) add SYMBOL to BASE
 (2) add INDEX * STEP to the base, if STEP != 1
 (3) add OFFSET to INDEX or BASE (reverted if unsuccessful)
 (4) add INDEX to BASE
 (5) add OFFSET to BASE

 So suppose we had an address:

  + offset + index * 8   (e.g. "a[i + 1]" for a global "a")

 on a target that only allows an index or an offset, not both.  Following
 the steps above, we'd first create:

 tmp = symbol
 tmp2 = tmp + index * 8

 Then if the given offset value was valid for the mode being addressed,
 we'd create:

 MEM[base:tmp2, offset:offset]

 while if it was invalid we'd create:

 tmp3 = tmp2 + offset
 MEM[base:tmp3, offset:0]

 The problem is that this could happen if ivopts had decided to use
 a scaled index for an address that happens to have a constant base.
 The old procedure failed to give an indexed TARGET_MEM_REF in that case,
 and adding the offset last prevented later passes from being able to
 fold the index back in.

 The patch avoids this by skipping (2) if BASE + INDEX * STEP
 is a legitimate address and if OFFSET is stopping the address
 being valid.

 Tested on aarch64-linux-gnu, x86_64-linux-gnu and powerpc64-linux-gnu.
 OK to install?

 Richard


 2017-10-31  Richard Sandiford  
 Alan Hayward  
 David Sherwood  

 gcc/
 * tree-ssa-address.c (keep_index_p): New function.
 (create_mem_ref): Use it.  Only split out the INDEX * STEP
 component if that is invalid even with the symbol and offset
 removed.

 Index: gcc/tree-ssa-address.c
 ===
 --- gcc/tree-ssa-address.c  2017-11-03 12:15:44.097060121 +
 +++ gcc/tree-ssa-address.c  2017-11-03 12:21:18.060359821 +
 @@ -746,6 +746,20 @@ gimplify_mem_ref_parts (gimple_stmt_iter
  true, GSI_SAME_STMT);
  }

 +/* Return true if the STEP in PARTS gives a valid BASE + INDEX * STEP
 +   address for type TYPE and if the offset is making it appear invalid.  
 */
 +
 +static bool
 +keep_index_p (tree type, mem_address parts)
>>>
>>> mem_ref_valid_without_offset_p (...)
>>>
>>> ?
>>
>> OK.
>>
 +{
 +  if (!parts.base)
 +return false;
 +
 +  gcc_assert (!parts.symbol);
 +  parts.offset = NULL_TREE;
 +  return valid_mem_ref_p (TYPE_MODE (type), TYPE_ADDR_SPACE (type), 
 );
 +}
 +
  /* Creates and returns a TARGET_MEM_REF for address ADDR.  If necessary
 computations are emitted in front of GSI.  TYPE is the mode
 of created memory reference. IV_CAND is the selected iv candidate in 
 ADDR,
 @@ -809,7 +823,8 @@ create_mem_ref (gimple_stmt_iterator *gs
>>>
>>> Which means all of the following would be more naturally written as
>>>
   into:
 index' = index << step;
 [... + index' + ,,,].  */
 -  if (parts.step && !integer_onep (parts.step))
 +  bool scaled_p = (parts.step && !integer_onep (parts.step));
 +  if (scaled_p && !keep_index_p (type, parts))
  {
>>>
>>>   if (mem_ref_valid_without_offset_p (...))
>>>{
>>>  ...
>>>  return create_mem_ref_raw (...);
>>>}
>>
>> Is this inside the test for a scale:
>>
>>   if (parts.step && !integer_onep (parts.step))
>> {
>>   if (mem_ref_valid_without_offset_p (...))
>> {
>>   tree tmp = parts.offset;
>>   if (parts.base)
>> {
>>   tmp = fold_build_pointer_plus (parts.base, tmp);
>>   tmp = force_gimple_operand_gsi_1 (gsi, tmp,
>> is_gimple_mem_ref_addr,
>> NULL_TREE, true,
>> GSI_SAME_STMT);
>> }
>>   parts.base = tmp;
>>   parts.offset = NULL_TREE;
>>   mem_ref = create_mem_ref_raw (type, alias_ptr_type, , true);
>>   gcc_assert (mem_ref);
>>   

Re: Make ivopts handle calls to internal functions

2017-11-20 Thread Bin.Cheng
On Fri, Nov 17, 2017 at 3:03 PM, Richard Sandiford
 wrote:
> ivopts previously treated pointer arguments to internal functions
> like IFN_MASK_LOAD and IFN_MASK_STORE as normal gimple values.
> This patch makes it treat them as addresses instead.  This makes
> a significant difference to the code quality for SVE loops,
> since we can then use loads and stores with scaled indices.
Thanks for working on this.  This can be extended to other internal
functions which eventually
are expanded into memory references.  I believe (at least) both x86
and AArch64 has such
requirement.

>
> The patch also adds support for ADDR_EXPRs of TARGET_MEM_REFs,
> which are the natural way of representing the result of the
> ivopts transformation.
>
> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> and powerpc64le-linux-gnu.  OK to install?
>
> Richard
>
>
> 2017-11-17  Richard Sandiford  
> Alan Hayward  
> David Sherwood  
>
> gcc/
> * expr.c (expand_expr_addr_expr_1): Handle ADDR_EXPRs of
> TARGET_MEM_REFs.
> * gimple-expr.h (is_gimple_addressable: Likewise.
> * gimple-expr.c (is_gimple_address): Likewise.
> * internal-fn.c (expand_call_mem_ref): New function.
> (expand_mask_load_optab_fn): Use it.
> (expand_mask_store_optab_fn): Likewise.
> * tree-ssa-loop-ivopts.c (USE_ADDRESS): Split into...
> (USE_REF_ADDRESS, USE_PTR_ADDRESS): ...these new use types.
> (dump_groups): Update accordingly.
> (iv_use::mem_type): New member variable.
> (address_p): New function.
> (record_use): Add a mem_type argument and initialize the new
> mem_type field.
> (record_group_use): Add a mem_type argument.  Use address_p.
> Update call to record_use.
> (find_interesting_uses_op): Update call to record_group_use.
> (find_interesting_uses_cond): Likewise.
> (find_interesting_uses_address): Likewise.
> (get_mem_type_for_internal_fn): New function.
> (find_address_like_use): Likewise.
> (find_interesting_uses_stmt): Try find_address_like_use before
> calling find_interesting_uses_op.
> (addr_offset_valid_p): Use the iv mem_type field as the type
> of the addressed memory.
> (add_autoinc_candidates): Likewise.
> (get_address_cost): Likewise.
> (split_small_address_groups_p): Use address_p.
> (split_address_groups): Likewise.
> (add_iv_candidate_for_use): Likewise.
> (autoinc_possible_for_pair): Likewise.
> (rewrite_groups): Likewise.
> (get_use_type): Check for USE_REF_ADDRESS instead of USE_ADDRESS.
> (determine_group_iv_cost): Update after split of USE_ADDRESS.
> (get_alias_ptr_type_for_ptr_address): New function.
> (rewrite_use_address): Rewrite address uses in calls that were
> identified by find_address_like_use.
>
> gcc/testsuite/
> * gcc.dg/tree-ssa/scev-9.c: Expected REFERENCE ADDRESS
> instead of just ADDRESS.
> * gcc.dg/tree-ssa/scev-10.c: Likewise.
> * gcc.dg/tree-ssa/scev-11.c: Likewise.
> * gcc.dg/tree-ssa/scev-12.c: Likewise.
> * gcc.target/aarch64/sve_index_offset_1.c: New test.
> * gcc.target/aarch64/sve_index_offset_1_run.c: Likewise.
> * gcc.target/aarch64/sve_loop_add_2.c: Likewise.
> * gcc.target/aarch64/sve_loop_add_3.c: Likewise.
> * gcc.target/aarch64/sve_while_1.c: Check for indexed addressing 
> modes.
> * gcc.target/aarch64/sve_while_2.c: Likewise.
> * gcc.target/aarch64/sve_while_3.c: Likewise.
> * gcc.target/aarch64/sve_while_4.c: Likewise.
>
> Index: gcc/expr.c
> ===
> --- gcc/expr.c  2017-11-17 09:49:36.191354637 +
> +++ gcc/expr.c  2017-11-17 15:02:12.868132458 +
> @@ -7814,6 +7814,9 @@ expand_expr_addr_expr_1 (tree exp, rtx t
> return expand_expr (tem, target, tmode, modifier);
>}
>
> +case TARGET_MEM_REF:
> +  return addr_for_mem_ref (exp, as, true);
> +
>  case CONST_DECL:
>/* Expand the initializer like constants above.  */
>result = XEXP (expand_expr_constant (DECL_INITIAL (exp),
> Index: gcc/gimple-expr.h
> ===
> --- gcc/gimple-expr.h   2017-11-17 09:40:43.520567009 +
> +++ gcc/gimple-expr.h   2017-11-17 15:02:12.868132458 +
> @@ -119,6 +119,7 @@ virtual_operand_p (tree op)
>  is_gimple_addressable (tree t)
>  {
>return (is_gimple_id (t) || handled_component_p (t)
> + || TREE_CODE (t) == TARGET_MEM_REF
>   || TREE_CODE (t) == MEM_REF);
>  }
>
> Index: gcc/gimple-expr.c
> ===
> --- gcc/gimple-expr.c   

Re: [PATCH GCC]A simple implementation of loop interchange

2017-11-16 Thread Bin.Cheng
On Tue, Oct 24, 2017 at 3:30 PM, Michael Matz <m...@suse.de> wrote:
> Hello,
>
> On Fri, 22 Sep 2017, Bin.Cheng wrote:
>
>> This is updated patch for loop interchange with review suggestions
>> resolved.  Changes are:
>>   1) It does more light weight checks like rectangle loop nest check
>> earlier than before.
>>   2) It checks profitability of interchange before data dependence 
>> computation.
>>   3) It calls find_data_references_in_loop only once for a loop nest now.
>>   4) Data dependence is open-computed so that we can skip instantly at
>> unknown dependence.
>>   5) It improves code generation in mapping induction variables for
>> loop nest, as well as
>>  adding a simple dead code elimination pass.
>>   6) It changes magic constants into parameters.
>
> So I have a couple comments/questions.  Something stylistic:
Hi Michael,
Thanks for reviewing.

>
>> +class loop_cand
>> +{
>> +public:
>> ...
>> +  friend class tree_loop_interchange;
>> +private:
>
> Just make this all public (and hence a struct, not class).
> No need for friends in file local classes.
Done.

>
>> +single_use_in_loop (tree var, struct loop *loop)
>> ...
>> +  FOR_EACH_IMM_USE_FAST (use_p, iterator, var)
>> +{
>> +  stmt = USE_STMT (use_p);
>> ...
>> +  basic_block bb = gimple_bb (stmt);
>> +  gcc_assert (bb != NULL);
>
> This pattern reoccurs often in your patch: you check for a bb associated
> for a USE_STMT.  Uses of SSA names always occur in basic blocks, no need
> for checking.
Done.

>
> Then, something about your handling of simple reductions:
>
>> +void
>> +loop_cand::classify_simple_reduction (reduction_p re)
>> +{
>> ...
>> +  /* Require memory references in producer and consumer are the same so
>> + that we can undo reduction during interchange.  */
>> +  if (re->init_ref && !operand_equal_p (re->init_ref, re->fini_ref, 0))
>> +return;
>
> Where is it checked that the undoing transformation is legal also
> from a data dep point of view?  Think code like this:
>
>sum = X[i];
>for (j ...)
>  sum += X[j];
>X[i] = sum;
>
> Moving the store into the inner loop isn't always correct and I don't seem
> to find where the above situation is rejected.
Yeah.  for the old patch, it's possible to have such loop wrongly interchanged;
in practice, it's hard to create an example.  The pass will give up
when computing
data dep between references in inner/outer loops.  In this updated
patch, it's fixed
by giving up if there is any dependence between references of inner/outer loops.

>
> Maybe I'm confused because I also don't see where you even can get into
> the above situation (though I do see testcases about this).  The thing is,
> for an 2d loop nest to contain something like the above reduction it can't
> be perfect:
>
>for (j) {
>  int sum = X[j];  // 1
>  for (i)
>sum += Y[j][i];
>  X[j] = sum;  // 2
>}
>
> But you do check for perfectness in proper_loop_form_for_interchange and
> prepare_perfect_loop_nest, so either you can't get into the situation or
> the checking can't be complete, or you define the above to be perfect
> nevertheless (probably because the load and store are in outer loop
> header/exit blocks?).  The latter would mean that you accept also other
> code in header/footer of loops from a pure CFG perspective, so where is it
> checked that that other code (which aren't simple reductions) isn't
> harmful to the transformation?
Yes, I used the name perfect loop nest, but the pass can handle special form
imperfect loop nest for the simple reduction.  I added comments describing
this before function prepare_perfect_loop_nest.

>
> Then, the data dependence part of the new pass:
>
>> +bool
>> +tree_loop_interchange::valid_data_dependences (unsigned inner, unsigned 
>> outer)
>> +{
>> +  struct data_dependence_relation *ddr;
>> +
>> +  for (unsigned i = 0; ddrs.iterate (i, ); ++i)
>> +{
>> +  /* Skip no-dependence case.  */
>> +  if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
>> + continue;
>> +
>> +  for (unsigned j = 0; j < DDR_NUM_DIR_VECTS (ddr); ++j)
>> + {
>> +   lambda_vector dist_vect = DDR_DIST_VECT (ddr, j);
>> +   unsigned level = dependence_level (dist_vect, loop_nest.length ());
>> +
>> +   /* If there is no carried dependence.  */
>> +   if (level == 0)
>> + continue;
>> +
>> +   level --;
>> +   /* Skip case which has '>' as the lef

Re: [PATCH PR82726/PR70754][2/2]New fix by finding correct root reference in combined chains

2017-11-15 Thread Bin.Cheng
On Mon, Nov 13, 2017 at 1:20 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Sat, Nov 11, 2017 at 11:19 AM, Bernhard Reutner-Fischer
> <rep.dot@gmail.com> wrote:
>> On Fri, Nov 10, 2017 at 02:14:25PM +, Bin.Cheng wrote:
>>> Hmm, the patch...
>>
>> +  /* Setup UID for all statements in dominance order.  */
>> +  basic_block *bbs = get_loop_body (loop);
>> +  for (i = 0; i < loop->num_nodes; i++)
>> +{
>> +  unsigned uid = 0;
>> +  basic_block bb = bbs[i];
>> +
>> +  for (gimple_stmt_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
>> +  gsi_next ())
>> +   {
>> + gimple *stmt = gsi_stmt (bsi);
>> + if (!virtual_operand_p (gimple_phi_result (as_a (stmt
>> +   gimple_set_uid (stmt, uid);
>> +   }
>> +
>> +  for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
>> +  gsi_next ())
>> +   {
>> + gimple *stmt = gsi_stmt (bsi);
>> + if (gimple_code (stmt) != GIMPLE_LABEL && !is_gimple_debug (stmt))
>> +   gimple_set_uid (stmt, ++uid);
>> +   }
>>
>>   for (gimple_stmt_iterator bsi = gsi_start_nondebug_after_labels_bb 
>> (bb);
>>!gsi_end_p (bsi);
>>gsi_next_nondebug ())
>>  gimple_set_uid (gsi_stmt (bsi), ++uid);
>
> Or even better instead of the whole loop
>
> renumber_gimple_stmt_uids_in_blocks (bbs, loop->num_nodes);
>
> Ok with that change.
Right, here is the updated patch.  Will commit it later.

Thanks,
bin
>
> Thanks,
> Richard.
>
>> thanks,
>>
>> +}
>> +  free (bbs);
>>
From 28a21f4a86ed4e1b5a174b004c45bd4b8ede944f Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 1 Nov 2017 17:43:55 +
Subject: [PATCH 2/2] pr82726-2017.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c |  26 ++
 gcc/tree-predcom.c  | 138 
 2 files changed, 148 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c b/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c
new file mode 100644
index 000..22bc59d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82726.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 --param tree-reassoc-width=4" } */
+/* { dg-additional-options "-mavx2" { target { x86_64-*-* i?86-*-* } } } */
+
+#define N 40
+#define M 128
+unsigned int in[N+M];
+unsigned short out[N];
+
+/* Outer-loop vectorization. */
+
+void
+foo (){
+  int i,j;
+  unsigned int diff;
+
+  for (i = 0; i < N; i++) {
+diff = 0;
+for (j = 0; j < M; j+=8) {
+  diff += in[j+i];
+}
+out[i]=(unsigned short)diff;
+  }
+
+  return;
+}
diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index 24d7c9c..28dac82 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -1020,6 +1020,17 @@ order_drefs (const void *a, const void *b)
   return (*da)->pos - (*db)->pos;
 }
 
+/* Compares two drefs A and B by their position.  Callback for qsort.  */
+
+static int
+order_drefs_by_pos (const void *a, const void *b)
+{
+  const dref *const da = (const dref *) a;
+  const dref *const db = (const dref *) b;
+
+  return (*da)->pos - (*db)->pos;
+}
+
 /* Returns root of the CHAIN.  */
 
 static inline dref
@@ -2633,7 +2644,6 @@ combine_chains (chain_p ch1, chain_p ch2)
   bool swap = false;
   chain_p new_chain;
   unsigned i;
-  gimple *root_stmt;
   tree rslt_type = NULL_TREE;
 
   if (ch1 == ch2)
@@ -2675,31 +2685,55 @@ combine_chains (chain_p ch1, chain_p ch2)
   new_chain->refs.safe_push (nw);
 }
 
-  new_chain->has_max_use_after = false;
-  root_stmt = get_chain_root (new_chain)->stmt;
-  for (i = 1; new_chain->refs.iterate (i, ); i++)
-{
-  if (nw->distance == new_chain->length
-	  && !stmt_dominates_stmt_p (nw->stmt, root_stmt))
-	{
-	  new_chain->has_max_use_after = true;
-	  break;
-	}
-}
-
   ch1->combined = true;
   ch2->combined = true;
   return new_chain;
 }
 
-/* Try to combine the CHAINS.  */
+/* Recursively update position information of all offspring chains to ROOT
+   chain's position information.  */
+
+static void
+update_pos_for_combined_chains (chain_p root)
+{
+  chain_p ch1 = root->ch1, ch2 = root->ch2;
+  dref ref, ref1, ref2;
+  for (unsigned j = 0; (root->refs.iterate (j, )
+			&& ch1->refs.iterate (j, )
+			&& ch2->refs.iterate (j, )); ++j)
+ref1->pos = ref2->pos = ref->pos;
+
+  if (ch1->type == CT_COMBINATION)
+update_pos_for_combined_chains (ch1);
+  if (ch2->type == CT_COMBINATION)
+up

Re: [PATCH PR82726/PR70754][2/2]New fix by finding correct root reference in combined chains

2017-11-10 Thread Bin.Cheng
Hmm, the patch...

Thanks,
bin

On Fri, Nov 10, 2017 at 2:13 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Tue, Nov 7, 2017 at 10:53 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Fri, Nov 3, 2017 at 1:40 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> As described in message of previous patch:
>>>
>>> This patch set fixes both PRs in the opposite way: Instead of find dominance
>>> insertion position for root reference, we resort zero-distance references of
>>> combined chain by their position information so that new root reference must
>>> dominate others.  This should be more efficient because we avoid function 
>>> call
>>> to stmt_dominates_stmt_p.
>>> Bootstrap and test on x86_64 and AArch64 in patch set.  Is it OK?
>>
>> +/* { dg-additional-options "-mavx2" { target avx2_runtime } } */
>>
>> you don't need avx2_runtime for -mavx2 so please instead use
>> { target { x86_64-*-* i?86-*-* } }
>>
>> +#include 
>> +#define INCLUDE_ALGORITHM /* std::sort */
>>
>> can you please use GCCs own hash_map?  Btw...
>>
>> +  /* Setup UID for all statements in dominance order.  */
>> +  std::map stmts_map;
>> +  basic_block *bbs = get_loop_body_in_dom_order (loop);
>> +  for (i = 0; i < loop->num_nodes; i++)
>> +{
>> +  int uid = 0;
>> +  basic_block bb = bbs[i];
>> +
>> +  for (gimple_stmt_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
>> +  gsi_next ())
>> +   {
>> + gimple *stmt = gsi_stmt (bsi);
>> + if (!virtual_operand_p (gimple_phi_result (as_a (stmt
>> +   stmts_map[stmt] = uid;
>>
>> why don't you use gimple_[set_]uid ()?  Given you do a dominance check
>> you don't even need to do this in dominance order - usually passes just
>> number UIDs in all relevant BBs.  There is a helper for that as well,
>> renumber_gimple_stmt_uids_in_blocks which can be used on
>> the get_loop_body result.
> Yea, I forgot gimple_[set_]uid interface when doing this.  All fixed now.
>>
>> +  /* Sort all ZERO distance references by position.  */
>> +  std::sort (>refs[0], >refs[0] + j, order_drefs_by_pos);
>> +
>>
>> given ch1->refs is a vec you can use the new vec::qsort_block you added
>> instead of including algorithm and using std::sort.
> Sorry, I haven't push that patch in.  In this updated patch, I fall
> back to generic qsort so algorithm is not included.
>
> Bootstrap and test on x86_64.  Is it OK?
> Thanks,
> bin
>
> 2017-11-10  Bin Cheng  <bin.ch...@arm.com>
>
> PR tree-optimization/82726
> PR tree-optimization/70754
> * tree-predcom.c (order_drefs_by_pos): New function.
> (combine_chains): Move code setting has_max_use_after to...
> (try_combine_chains): ...here.  New parameter.  Sort combined chains
> according to position information.
> (tree_predictive_commoning_loop): Update call to above function.
> (update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.
>
> gcc/testsuite
> 2017-11-10  Bin Cheng  <bin.ch...@arm.com>
>
> PR tree-optimization/82726
> * gcc.dg/tree-ssa/pr82726.c: New test.
>
>
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>> 2017-11-02  Bin Cheng  <bin.ch...@arm.com>
>>>
>>> PR tree-optimization/82726
>>> PR tree-optimization/70754
>>> * tree-predcom.c (, INCLUDE_ALGORITHM): New headers.
>>> (order_drefs_by_pos): New function.
>>> (combine_chains): Move code setting has_max_use_after to...
>>> (try_combine_chains): ...here.  New parameter.  Sort combined chains
>>> according to position information.
>>> (tree_predictive_commoning_loop): Update call to above function.
>>> (update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.
>>>
>>> gcc/testsuite
>>> 2017-11-02  Bin Cheng  <bin.ch...@arm.com>
>>>
>>> PR tree-optimization/82726
>>> * gcc.dg/tree-ssa/pr82726.c: New test.
From f7b9b4ac78f33aee60ecd37ca515f2f8773f5561 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 1 Nov 2017 17:43:55 +
Subject: [PATCH 2/2] pr82726-20171110.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c |  26 ++
 gcc/tree-predcom.c  | 158 
 2 files changed, 168 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr82726.c

diff --git

Re: [PATCH PR82726/PR70754][2/2]New fix by finding correct root reference in combined chains

2017-11-10 Thread Bin.Cheng
On Tue, Nov 7, 2017 at 10:53 AM, Richard Biener
 wrote:
> On Fri, Nov 3, 2017 at 1:40 PM, Bin Cheng  wrote:
>> Hi,
>> As described in message of previous patch:
>>
>> This patch set fixes both PRs in the opposite way: Instead of find dominance
>> insertion position for root reference, we resort zero-distance references of
>> combined chain by their position information so that new root reference must
>> dominate others.  This should be more efficient because we avoid function 
>> call
>> to stmt_dominates_stmt_p.
>> Bootstrap and test on x86_64 and AArch64 in patch set.  Is it OK?
>
> +/* { dg-additional-options "-mavx2" { target avx2_runtime } } */
>
> you don't need avx2_runtime for -mavx2 so please instead use
> { target { x86_64-*-* i?86-*-* } }
>
> +#include 
> +#define INCLUDE_ALGORITHM /* std::sort */
>
> can you please use GCCs own hash_map?  Btw...
>
> +  /* Setup UID for all statements in dominance order.  */
> +  std::map stmts_map;
> +  basic_block *bbs = get_loop_body_in_dom_order (loop);
> +  for (i = 0; i < loop->num_nodes; i++)
> +{
> +  int uid = 0;
> +  basic_block bb = bbs[i];
> +
> +  for (gimple_stmt_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> +  gsi_next ())
> +   {
> + gimple *stmt = gsi_stmt (bsi);
> + if (!virtual_operand_p (gimple_phi_result (as_a (stmt
> +   stmts_map[stmt] = uid;
>
> why don't you use gimple_[set_]uid ()?  Given you do a dominance check
> you don't even need to do this in dominance order - usually passes just
> number UIDs in all relevant BBs.  There is a helper for that as well,
> renumber_gimple_stmt_uids_in_blocks which can be used on
> the get_loop_body result.
Yea, I forgot gimple_[set_]uid interface when doing this.  All fixed now.
>
> +  /* Sort all ZERO distance references by position.  */
> +  std::sort (>refs[0], >refs[0] + j, order_drefs_by_pos);
> +
>
> given ch1->refs is a vec you can use the new vec::qsort_block you added
> instead of including algorithm and using std::sort.
Sorry, I haven't push that patch in.  In this updated patch, I fall
back to generic qsort so algorithm is not included.

Bootstrap and test on x86_64.  Is it OK?
Thanks,
bin

2017-11-10  Bin Cheng  

PR tree-optimization/82726
PR tree-optimization/70754
* tree-predcom.c (order_drefs_by_pos): New function.
(combine_chains): Move code setting has_max_use_after to...
(try_combine_chains): ...here.  New parameter.  Sort combined chains
according to position information.
(tree_predictive_commoning_loop): Update call to above function.
(update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.

gcc/testsuite
2017-11-10  Bin Cheng  

PR tree-optimization/82726
* gcc.dg/tree-ssa/pr82726.c: New test.


>
> Richard.
>
>> Thanks,
>> bin
>> 2017-11-02  Bin Cheng  
>>
>> PR tree-optimization/82726
>> PR tree-optimization/70754
>> * tree-predcom.c (, INCLUDE_ALGORITHM): New headers.
>> (order_drefs_by_pos): New function.
>> (combine_chains): Move code setting has_max_use_after to...
>> (try_combine_chains): ...here.  New parameter.  Sort combined chains
>> according to position information.
>> (tree_predictive_commoning_loop): Update call to above function.
>> (update_pos_for_combined_chains, pcom_stmt_dominates_stmt_p): New.
>>
>> gcc/testsuite
>> 2017-11-02  Bin Cheng  
>>
>> PR tree-optimization/82726
>> * gcc.dg/tree-ssa/pr82726.c: New test.


Re: [PATCH PR82776]Exploit more undefined pointer overflow behavior in loop niter analysis

2017-11-08 Thread Bin.Cheng
On Wed, Nov 8, 2017 at 11:55 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Nov 7, 2017 at 1:44 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Nov 7, 2017 at 12:23 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Tue, Nov 7, 2017 at 1:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Tue, Nov 7, 2017 at 10:44 AM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Fri, Nov 3, 2017 at 1:35 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>> Hi,
>>>>>> This is a simple patch exploiting more undefined pointer overflow 
>>>>>> behavior in
>>>>>> loop niter analysis.  Originally, it only supports POINTER_PLUS_EXPR if 
>>>>>> the
>>>>>> offset part is IV.  This patch also handles the case if pointer is IV.  
>>>>>> With
>>>>>> this patch, the while(true) loop in test can be removed by cddce pass 
>>>>>> now.
>>>>>>
>>>>>> Bootstrap and test on x86_64 and AArch64.  This patch introduces two 
>>>>>> failures:
>>>>>> FAIL: g++.dg/pr79095-1.C  -std=gnu++98 (test for excess errors)
>>>>>> FAIL: g++.dg/pr79095-2.C  -std=gnu++11 (test for excess errors)
>>>>>> I believe this exposes inaccurate value range information issue.  For 
>>>>>> below code:
>>>>>> /* { dg-do compile } */
>>>>>> /* { dg-options "-Wall -O3" } */
>>>>>>
>>>>>> typedef long unsigned int size_t;
>>>>>>
>>>>>> inline void
>>>>>> fill (int *p, size_t n, int)
>>>>>> {
>>>>>>   while (n--)
>>>>>> *p++ = 0;
>>>>>> }
>>>>>>
>>>>>> struct B
>>>>>> {
>>>>>>   int* p0, *p1, *p2;
>>>>>>
>>>>>>   size_t size () const {
>>>>>> return size_t (p1 - p0);
>>>>>>   }
>>>>>>
>>>>>>   void resize (size_t n) {
>>>>>> if (n > size())
>>>>>>   append (n - size());
>>>>>>   }
>>>>>>
>>>>>>   void append (size_t n)
>>>>>>   {
>>>>>> if (size_t (p2 - p1) >= n)   {
>>>>>>   fill (p1, n, 0);
>>>>>> }
>>>>>>   }
>>>>>> };
>>>>>>
>>>>>> void foo (B )
>>>>>> {
>>>>>>   if (b.size () != 0)
>>>>>> b.resize (b.size () - 1);
>>>>>> }
>>>>>>
>>>>>> GCC gives below warning with this patch:
>>>>>> pr79095-1.C: In function ‘void foo(B&)’:
>>>>>> pr79095-1.C:10:7: warning: iteration 4611686018427387903 invokes 
>>>>>> undefined behavior [-Waggressive-loop-optimizations]
>>>>>>  *p++ = 0;
>>>>>>   ~^~
>>>>>> pr79095-1.C:9:11: note: within this loop
>>>>>>while (n--)
>>>>>>^~
>>>>>>
>>>>>> Problem is VRP should understand that it's never the case with condition:
>>>>>>   (size_t (p2 - p1) >= n)
>>>>>> in function B::append.
>>>>>>
>>>>>> So, any comment?
>>>
>>> Does it warn when not inlining fill()?  Isn't the issue that one test
>> With this patch, yes.
>>> tests p2 - p1 and
>>> the loop goes from p1 to p1 + (p1 - p0)?
>> don't follow here.  so the code is:
>>
>>>>>> inline void
>>>>>> fill (int *p, size_t n, int)
>>>>>> {
>>>>>>   while (n--)
>>>>>> *p++ = 0;
>>>>>> }
>>
>>>>>>   void append (size_t n)
>>>>>>   {
>>>>>> if (size_t (p2 - p1) >= n)   {
>>>>>>   fill (p1, n, 0);
>>>>>> }
>>
>> fill is only called if size_t (p2 - p1) >= n, so while loop in fill
>> can only zero-out memory range [p1, p2)?
>
> what happens if p1 is before p0?  The compare
> p2 - p1 >= p1 - p0 doesn't tell us much when
> iterating from p1 to p1 + ((p1 - p0) - 1), no?
I double thought on this.  Looks like the warning message is not
s

Re: [PATCH PR82776]Exploit more undefined pointer overflow behavior in loop niter analysis

2017-11-07 Thread Bin.Cheng
On Tue, Nov 7, 2017 at 12:23 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Nov 7, 2017 at 1:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Nov 7, 2017 at 10:44 AM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Fri, Nov 3, 2017 at 1:35 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This is a simple patch exploiting more undefined pointer overflow behavior 
>>>> in
>>>> loop niter analysis.  Originally, it only supports POINTER_PLUS_EXPR if the
>>>> offset part is IV.  This patch also handles the case if pointer is IV.  
>>>> With
>>>> this patch, the while(true) loop in test can be removed by cddce pass now.
>>>>
>>>> Bootstrap and test on x86_64 and AArch64.  This patch introduces two 
>>>> failures:
>>>> FAIL: g++.dg/pr79095-1.C  -std=gnu++98 (test for excess errors)
>>>> FAIL: g++.dg/pr79095-2.C  -std=gnu++11 (test for excess errors)
>>>> I believe this exposes inaccurate value range information issue.  For 
>>>> below code:
>>>> /* { dg-do compile } */
>>>> /* { dg-options "-Wall -O3" } */
>>>>
>>>> typedef long unsigned int size_t;
>>>>
>>>> inline void
>>>> fill (int *p, size_t n, int)
>>>> {
>>>>   while (n--)
>>>> *p++ = 0;
>>>> }
>>>>
>>>> struct B
>>>> {
>>>>   int* p0, *p1, *p2;
>>>>
>>>>   size_t size () const {
>>>> return size_t (p1 - p0);
>>>>   }
>>>>
>>>>   void resize (size_t n) {
>>>> if (n > size())
>>>>   append (n - size());
>>>>   }
>>>>
>>>>   void append (size_t n)
>>>>   {
>>>> if (size_t (p2 - p1) >= n)   {
>>>>   fill (p1, n, 0);
>>>> }
>>>>   }
>>>> };
>>>>
>>>> void foo (B )
>>>> {
>>>>   if (b.size () != 0)
>>>> b.resize (b.size () - 1);
>>>> }
>>>>
>>>> GCC gives below warning with this patch:
>>>> pr79095-1.C: In function ‘void foo(B&)’:
>>>> pr79095-1.C:10:7: warning: iteration 4611686018427387903 invokes undefined 
>>>> behavior [-Waggressive-loop-optimizations]
>>>>  *p++ = 0;
>>>>   ~^~
>>>> pr79095-1.C:9:11: note: within this loop
>>>>while (n--)
>>>>^~
>>>>
>>>> Problem is VRP should understand that it's never the case with condition:
>>>>   (size_t (p2 - p1) >= n)
>>>> in function B::append.
>>>>
>>>> So, any comment?
>
> Does it warn when not inlining fill()?  Isn't the issue that one test
With this patch, yes.
> tests p2 - p1 and
> the loop goes from p1 to p1 + (p1 - p0)?
don't follow here.  so the code is:

>>>> inline void
>>>> fill (int *p, size_t n, int)
>>>> {
>>>>   while (n--)
>>>> *p++ = 0;
>>>> }

>>>>   void append (size_t n)
>>>>   {
>>>> if (size_t (p2 - p1) >= n)   {
>>>>   fill (p1, n, 0);
>>>> }

fill is only called if size_t (p2 - p1) >= n, so while loop in fill
can only zero-out memory range [p1, p2)?


>
> What kind of optimization do we apply to the loop in fill?
Depends on some conditions, the loop could be distributed into memset.
Anyway, the warning message is issued as long as niter analysis
believes it takes advantage of undefined pointer overflow behavior.

Thanks,
bin
>
>>> I'm looking hard but I can't see you changed anything in
>>> infer_loop_bounds_from_pointer_arith
>>> besides adding a expr_invariant_in_loop_p (loop, rhs2) check.
>> yes, that's enough for this fix?
>>
>> -  ptr = gimple_assign_rhs1 (stmt);
>> -  if (!expr_invariant_in_loop_p (loop, ptr))
>> +  rhs2 = gimple_assign_rhs2 (stmt);
>> +  if (TYPE_PRECISION (type) != TYPE_PRECISION (TREE_TYPE (rhs2)))
>>  return;
>>
>> -  var = gimple_assign_rhs2 (stmt);
>> -  if (TYPE_PRECISION (type) != TYPE_PRECISION (TREE_TYPE (var)))
>> +  rhs1 = gimple_assign_rhs1 (stmt);
>> +  if (!expr_invariant_in_loop_p (loop, rhs1)
>> +  && !expr_invariant_in_loop_p (loop, rhs2))
>>  return;
>>
>> Before this change, the function skips if ptr in "res = ptr +p offset"
>> is non-invariant.  This change only skips if both ptr and offset are
>> non-invariant, thus the PR is handled.
>
> Ah, of course.  Thanks for the explanation.
>
>> Thanks,
>> bin
>>
>>
>>>
>>> What am I missing?
>>>
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>> 2017-11-02  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>> PR tree-optimization/82776
>>>> * tree-ssa-loop-niter.c (infer_loop_bounds_from_pointer_arith): 
>>>> Handle
>>>> POINTER_PLUS_EXPR in which the pointer is an IV.
>>>> (infer_loop_bounds_from_signedness): Refine comment.
>>>>
>>>> gcc/testsuite
>>>> 2017-11-02  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>> PR tree-optimization/82776
>>>> * g++.dg/pr82776.C: New test.
>>>> * gcc.dg/tree-ssa/split-path-6.c: Refine test.


Re: [PATCH PR82776]Exploit more undefined pointer overflow behavior in loop niter analysis

2017-11-07 Thread Bin.Cheng
On Tue, Nov 7, 2017 at 10:44 AM, Richard Biener
 wrote:
> On Fri, Nov 3, 2017 at 1:35 PM, Bin Cheng  wrote:
>> Hi,
>> This is a simple patch exploiting more undefined pointer overflow behavior in
>> loop niter analysis.  Originally, it only supports POINTER_PLUS_EXPR if the
>> offset part is IV.  This patch also handles the case if pointer is IV.  With
>> this patch, the while(true) loop in test can be removed by cddce pass now.
>>
>> Bootstrap and test on x86_64 and AArch64.  This patch introduces two 
>> failures:
>> FAIL: g++.dg/pr79095-1.C  -std=gnu++98 (test for excess errors)
>> FAIL: g++.dg/pr79095-2.C  -std=gnu++11 (test for excess errors)
>> I believe this exposes inaccurate value range information issue.  For below 
>> code:
>> /* { dg-do compile } */
>> /* { dg-options "-Wall -O3" } */
>>
>> typedef long unsigned int size_t;
>>
>> inline void
>> fill (int *p, size_t n, int)
>> {
>>   while (n--)
>> *p++ = 0;
>> }
>>
>> struct B
>> {
>>   int* p0, *p1, *p2;
>>
>>   size_t size () const {
>> return size_t (p1 - p0);
>>   }
>>
>>   void resize (size_t n) {
>> if (n > size())
>>   append (n - size());
>>   }
>>
>>   void append (size_t n)
>>   {
>> if (size_t (p2 - p1) >= n)   {
>>   fill (p1, n, 0);
>> }
>>   }
>> };
>>
>> void foo (B )
>> {
>>   if (b.size () != 0)
>> b.resize (b.size () - 1);
>> }
>>
>> GCC gives below warning with this patch:
>> pr79095-1.C: In function ‘void foo(B&)’:
>> pr79095-1.C:10:7: warning: iteration 4611686018427387903 invokes undefined 
>> behavior [-Waggressive-loop-optimizations]
>>  *p++ = 0;
>>   ~^~
>> pr79095-1.C:9:11: note: within this loop
>>while (n--)
>>^~
>>
>> Problem is VRP should understand that it's never the case with condition:
>>   (size_t (p2 - p1) >= n)
>> in function B::append.
>>
>> So, any comment?
>
> I'm looking hard but I can't see you changed anything in
> infer_loop_bounds_from_pointer_arith
> besides adding a expr_invariant_in_loop_p (loop, rhs2) check.
yes, that's enough for this fix?

-  ptr = gimple_assign_rhs1 (stmt);
-  if (!expr_invariant_in_loop_p (loop, ptr))
+  rhs2 = gimple_assign_rhs2 (stmt);
+  if (TYPE_PRECISION (type) != TYPE_PRECISION (TREE_TYPE (rhs2)))
 return;

-  var = gimple_assign_rhs2 (stmt);
-  if (TYPE_PRECISION (type) != TYPE_PRECISION (TREE_TYPE (var)))
+  rhs1 = gimple_assign_rhs1 (stmt);
+  if (!expr_invariant_in_loop_p (loop, rhs1)
+  && !expr_invariant_in_loop_p (loop, rhs2))
 return;

Before this change, the function skips if ptr in "res = ptr +p offset"
is non-invariant.  This change only skips if both ptr and offset are
non-invariant, thus the PR is handled.

Thanks,
bin


>
> What am I missing?
>
> Richard.
>
>> Thanks,
>> bin
>> 2017-11-02  Bin Cheng  
>>
>> PR tree-optimization/82776
>> * tree-ssa-loop-niter.c (infer_loop_bounds_from_pointer_arith): 
>> Handle
>> POINTER_PLUS_EXPR in which the pointer is an IV.
>> (infer_loop_bounds_from_signedness): Refine comment.
>>
>> gcc/testsuite
>> 2017-11-02  Bin Cheng  
>>
>> PR tree-optimization/82776
>> * g++.dg/pr82776.C: New test.
>> * gcc.dg/tree-ssa/split-path-6.c: Refine test.


Re: Improve ivopts handling of forced scales

2017-11-06 Thread Bin.Cheng
On Fri, Nov 3, 2017 at 4:28 PM, Richard Sandiford
 wrote:
> This patch improves the ivopts address cost calculcation for modes
> in which an index must be scaled rather than unscaled.  Previously
> we would only try the scaled form if the unscaled form was valid.
>
> Many of the SVE tests rely on this when matching scaled indices.
>
> Tested on aarch64-linux-gnu, x86_64-linux-gnu and powerpc64-linux-gnu.
> OK to install?
OK.

Thanks,
bin
>
> Richard
>
>
> 2017-11-03  Richard Sandiford  
> Alan Hayward  
> David Sherwood  
>
> gcc/
> * tree-ssa-loop-ivopts.c (get_address_cost): Try using a
> scaled index even if the unscaled address was invalid.
> Don't increase the complexity of using a scale in that case.
>
> Index: gcc/tree-ssa-loop-ivopts.c
> ===
> --- gcc/tree-ssa-loop-ivopts.c  2017-11-03 12:20:07.041206480 +
> +++ gcc/tree-ssa-loop-ivopts.c  2017-11-03 12:20:07.193201997 +
> @@ -4333,18 +4333,25 @@ get_address_cost (struct ivopts_data *da
>machine_mode addr_mode = TYPE_MODE (type);
>machine_mode mem_mode = TYPE_MODE (TREE_TYPE (*use->op_p));
>addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (use->iv->base));
> +  /* Only true if ratio != 1.  */
> +  bool ok_with_ratio_p = false;
> +  bool ok_without_ratio_p = false;
>
>if (!aff_combination_const_p (aff_inv))
>  {
>parts.index = integer_one_node;
>/* Addressing mode "base + index".  */
> -  if (valid_mem_ref_p (mem_mode, as, ))
> +  ok_without_ratio_p = valid_mem_ref_p (mem_mode, as, );
> +  if (ratio != 1)
> {
>   parts.step = wide_int_to_tree (type, ratio);
>   /* Addressing mode "base + index << scale".  */
> - if (ratio != 1 && !valid_mem_ref_p (mem_mode, as, ))
> + ok_with_ratio_p = valid_mem_ref_p (mem_mode, as, );
> + if (!ok_with_ratio_p)
> parts.step = NULL_TREE;
> -
> +   }
> +  if (ok_with_ratio_p || ok_without_ratio_p)
> +   {
>   if (maybe_nonzero (aff_inv->offset))
> {
>   parts.offset = wide_int_to_tree (sizetype, aff_inv->offset);
> @@ -,7 +4451,9 @@ get_address_cost (struct ivopts_data *da
>
>if (parts.symbol != NULL_TREE)
>  cost.complexity += 1;
> -  if (parts.step != NULL_TREE && !integer_onep (parts.step))
> +  /* Don't increase the complexity of adding a scaled index if it's
> + the only kind of index that the target allows.  */
> +  if (parts.step != NULL_TREE && ok_without_ratio_p)
>  cost.complexity += 1;
>if (parts.base != NULL_TREE && parts.index != NULL_TREE)
>  cost.complexity += 1;


Re: [PATCH GCC][2/3]Simplify ((A +- CST1 CMP A +- CST2)) for undefined overflow type

2017-10-19 Thread Bin.Cheng
On Thu, Oct 19, 2017 at 4:33 PM, Marc Glisse  wrote:
> On Thu, 19 Oct 2017, Bin Cheng wrote:
>
>> * match.pd (A +- CST1 CMP A +- CST2): New pattern.
>
>
> Similarly, this has a very large overlap with "X + Z < Y + Z" transforms
> already in match.pd. It may handle X - CST CMP X + CST that the other
> doesn't (?), but we tend to canonicalize X-5 to X+-5 anyway.
And drop this one.

Thanks,
bin
>
> --
> Marc Glisse


Re: [PATCH GCC][1/3]Simplify (A + CST cmp A -> CST cmp zero) for undefined overflow type

2017-10-19 Thread Bin.Cheng
On Thu, Oct 19, 2017 at 4:22 PM, Marc Glisse  wrote:
> On Thu, 19 Oct 2017, Bin Cheng wrote:
>
>> * match.pd (A + CST cmp A  ->  CST cmp zero): New simplification
>> for undefined overflow types in (A + CST CMP A  ->  A CMP' CST').
>
>
> Could you check if you still need that? I recently added something very
> similar (search for "X + Y < Y" in match.pd).
Ah, yes indeed, the two patterns are unnecessary now on TOT.  I will drop this.

Thanks,
bin
>
> --
> Marc Glisse


Re: [PATCH GCC][4/7]Choose exit edge/path when removing inner loop's exit statement

2017-10-19 Thread Bin.Cheng
On Thu, Oct 19, 2017 at 9:31 AM, Tom de Vries  wrote:
> On 10/09/2017 03:34 PM, Richard Biener wrote:
>>
>> On Thu, Oct 5, 2017 at 3:16 PM, Bin Cheng  wrote:
>>>
>>> Hi,
>>> Function generate_loops_for_partition chooses arbitrary path when
>>> removing exit
>>> condition not in partition.  This is fine for now because it's impossible
>>> to have
>>> loop exit condition in case of innermost distribution.  After extending
>>> to loop
>>> nest distribution, we must choose exit edge/path for inner loop's exit
>>> condition,
>>> otherwise an infinite empty loop will be generated.  Test case added.
>>>
>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>>
>>
>> Ok.
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>> 2017-10-04  Bin Cheng  
>>>
>>>  * tree-loop-distribution.c (generate_loops_for_partition):
>>> Remove
>>>  inner loop's exit stmt by making it always exit the loop,
>>> otherwise
>>>  we would generate an infinite empty loop.
>>>
>>> gcc/testsuite/ChangeLog
>>> 2017-10-04  Bin Cheng  
>>>
>>>  * gcc.dg/tree-ssa/ldist-27.c: New test.
>
>
> Hi,
>
> I've committed patch below to specify the stack size requirements of this
> test-case (fixing the test failure for nvptx).
Hi,
Maybe we can simply make the structure a global variable?

Thanks,
bin
>
> Does it make sense to trim down the test-case using #ifdef STACK_SIZE?
>
> Thanks,
> - Tom


Re: [PATCH GCC][7/7]Merge adjacent memset builtin partitions

2017-10-17 Thread Bin.Cheng
On Mon, Oct 16, 2017 at 5:27 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Oct 16, 2017 at 5:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> This patch merges adjacent memset builtin partitions if possible.  It is
>>>>> a useful special case optimization transforming below code:
>>>>>
>>>>> #define M (256)
>>>>> #define N (512)
>>>>>
>>>>> struct st
>>>>> {
>>>>>   int a[M][N];
>>>>>   int c[M];
>>>>>   int b[M][N];
>>>>> };
>>>>>
>>>>> void
>>>>> foo (struct st *p)
>>>>> {
>>>>>   for (unsigned i = 0; i < M; ++i)
>>>>> {
>>>>>   p->c[i] = 0;
>>>>>   for (unsigned j = N; j > 0; --j)
>>>>> {
>>>>>   p->a[i][j - 1] = 0;
>>>>>   p->b[i][j - 1] = 0;
>>>>> }
>>>>> }
>>>>>
>>>>> into a single memset function call, rather than three calls initializing
>>>>> the structure field by field.
>>>>>
>>>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>>>>
>>>> +  /* Insertion sort is good enough for the small sub-array.  */
>>>> +  for (k = i + 1; k < j; ++k)
>>>> +   {
>>>> + part1 = (*partitions)[k];
>>>> + for (l = k; l > i; --l)
>>>> +   {
>>>> + part2 = (*partitions)[l - 1];
>>>> + if (part1->builtin->dst_base_offset
>>>> +   >= part2->builtin->dst_base_offset)
>>>> +   break;
>>>> +
>>>> + (*partitions)[l] = part2;
>>>> +   }
>>>> + (*partitions)[l] = part1;
>>>> +   }
>>>>
>>>> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
>>>> to write a qsort helper for this but I can't wrap my head around the above
>>>> in 5 minutes so ... please!
>>> Hmm, I thought twice about this and now I believe stable sorting (thus
>>> insertion sort)
>>> is required here.  Please see below for explanation.
>>>
>>>>
>>>> You don't seem to check the sizes of the memsets given that they
>>>> obviously cannot overlap(!?)
>>> Yes, given it's quite special case transformation, I did add code
>>> checking overlap cases.
>>>>
>>>> Also why handle memset and not memcpy/memmove or combinations of
>>>> them (for sorting)?
>>>>
>>>>   for ()
>>>>{
>>>>   p->a[i] = 0;
>>>>   p->c[i] = q->c[i];
>>>>   p->b[i] = 0;
>>>>}
>>>>
>>>> with a and b adjacent.  I suppose p->c could be computed by a
>>>> non-builtin partition as well.
>>> Yes, the two memset builtin partitions can be merged in this case, but...
>>>> So don't we want to see if dependences allow sorting all builtin
>>>> partitions next to each other
>>>> as much as possible?  (maybe we do that already?)
>>> The answer for this, above partition merging and use of qsort is no.
>>> I think all the three are the same question here.  For now we only do
>>> topological sort for partitions.  To maximize parallelism (either by merging
>>> normal parallel partitions or merging builtin partitions) requires 
>>> fine-tuned
>>> sorting between partitions that doesn't dependence on each other.
>>> In order to sort all memset/memcpy/memmove, we need check dependence
>>> between all data references between different partitions.  For example, I
>>> created new test ldist-36.c illustrating sorting memcpy along with memset
>>> would generate wrong code because dependence is broken.  It's the same
>>> for qsort.  In extreme case, if the same array is set twice with different 
>>> rhs
>>> value, the order between the two sets needs to be preserved.  Unfortunately,
>>> qsort is unstable and could reor

Re: [PATCH GCC][7/7]Merge adjacent memset builtin partitions

2017-10-16 Thread Bin.Cheng
On Mon, Oct 16, 2017 at 5:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This patch merges adjacent memset builtin partitions if possible.  It is
>>>> a useful special case optimization transforming below code:
>>>>
>>>> #define M (256)
>>>> #define N (512)
>>>>
>>>> struct st
>>>> {
>>>>   int a[M][N];
>>>>   int c[M];
>>>>   int b[M][N];
>>>> };
>>>>
>>>> void
>>>> foo (struct st *p)
>>>> {
>>>>   for (unsigned i = 0; i < M; ++i)
>>>> {
>>>>   p->c[i] = 0;
>>>>   for (unsigned j = N; j > 0; --j)
>>>> {
>>>>   p->a[i][j - 1] = 0;
>>>>   p->b[i][j - 1] = 0;
>>>> }
>>>> }
>>>>
>>>> into a single memset function call, rather than three calls initializing
>>>> the structure field by field.
>>>>
>>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>>>
>>> +  /* Insertion sort is good enough for the small sub-array.  */
>>> +  for (k = i + 1; k < j; ++k)
>>> +   {
>>> + part1 = (*partitions)[k];
>>> + for (l = k; l > i; --l)
>>> +   {
>>> + part2 = (*partitions)[l - 1];
>>> + if (part1->builtin->dst_base_offset
>>> +   >= part2->builtin->dst_base_offset)
>>> +   break;
>>> +
>>> + (*partitions)[l] = part2;
>>> +   }
>>> + (*partitions)[l] = part1;
>>> +   }
>>>
>>> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
>>> to write a qsort helper for this but I can't wrap my head around the above
>>> in 5 minutes so ... please!
>> Hmm, I thought twice about this and now I believe stable sorting (thus
>> insertion sort)
>> is required here.  Please see below for explanation.
>>
>>>
>>> You don't seem to check the sizes of the memsets given that they
>>> obviously cannot overlap(!?)
>> Yes, given it's quite special case transformation, I did add code
>> checking overlap cases.
>>>
>>> Also why handle memset and not memcpy/memmove or combinations of
>>> them (for sorting)?
>>>
>>>   for ()
>>>{
>>>   p->a[i] = 0;
>>>   p->c[i] = q->c[i];
>>>   p->b[i] = 0;
>>>}
>>>
>>> with a and b adjacent.  I suppose p->c could be computed by a
>>> non-builtin partition as well.
>> Yes, the two memset builtin partitions can be merged in this case, but...
>>> So don't we want to see if dependences allow sorting all builtin
>>> partitions next to each other
>>> as much as possible?  (maybe we do that already?)
>> The answer for this, above partition merging and use of qsort is no.
>> I think all the three are the same question here.  For now we only do
>> topological sort for partitions.  To maximize parallelism (either by merging
>> normal parallel partitions or merging builtin partitions) requires fine-tuned
>> sorting between partitions that doesn't dependence on each other.
>> In order to sort all memset/memcpy/memmove, we need check dependence
>> between all data references between different partitions.  For example, I
>> created new test ldist-36.c illustrating sorting memcpy along with memset
>> would generate wrong code because dependence is broken.  It's the same
>> for qsort.  In extreme case, if the same array is set twice with different 
>> rhs
>> value, the order between the two sets needs to be preserved.  Unfortunately,
>> qsort is unstable and could reorder different sets.  This would break output
>> dependence.
>> At the point of this function, dependence graph is destroyed, we can't do
>> much in addition to special case handling for memset.  Full solution would
>> require a customized topological sorting process.
>>
>> So, this updated patch keeps insertion sort with additional comment 
>> explaining
>> why.  Also two test cases added showing when memset partitions should be
>> merged (we can't for now) and when memset partitions should not be merged.
> Hmm, I can factor out the sorting loop into a function, that might
> make it easier
> to read.
Okay, I will use std::stable_sort in this case, that's exactly what we
want for this case.

Thanks,
bin


Re: [PATCH GCC][7/7]Merge adjacent memset builtin partitions

2017-10-16 Thread Bin.Cheng
On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> This patch merges adjacent memset builtin partitions if possible.  It is
>>> a useful special case optimization transforming below code:
>>>
>>> #define M (256)
>>> #define N (512)
>>>
>>> struct st
>>> {
>>>   int a[M][N];
>>>   int c[M];
>>>   int b[M][N];
>>> };
>>>
>>> void
>>> foo (struct st *p)
>>> {
>>>   for (unsigned i = 0; i < M; ++i)
>>> {
>>>   p->c[i] = 0;
>>>   for (unsigned j = N; j > 0; --j)
>>> {
>>>   p->a[i][j - 1] = 0;
>>>   p->b[i][j - 1] = 0;
>>> }
>>> }
>>>
>>> into a single memset function call, rather than three calls initializing
>>> the structure field by field.
>>>
>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>>
>> +  /* Insertion sort is good enough for the small sub-array.  */
>> +  for (k = i + 1; k < j; ++k)
>> +   {
>> + part1 = (*partitions)[k];
>> + for (l = k; l > i; --l)
>> +   {
>> + part2 = (*partitions)[l - 1];
>> + if (part1->builtin->dst_base_offset
>> +   >= part2->builtin->dst_base_offset)
>> +   break;
>> +
>> + (*partitions)[l] = part2;
>> +   }
>> + (*partitions)[l] = part1;
>> +   }
>>
>> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
>> to write a qsort helper for this but I can't wrap my head around the above
>> in 5 minutes so ... please!
> Hmm, I thought twice about this and now I believe stable sorting (thus
> insertion sort)
> is required here.  Please see below for explanation.
>
>>
>> You don't seem to check the sizes of the memsets given that they
>> obviously cannot overlap(!?)
> Yes, given it's quite special case transformation, I did add code
> checking overlap cases.
>>
>> Also why handle memset and not memcpy/memmove or combinations of
>> them (for sorting)?
>>
>>   for ()
>>{
>>   p->a[i] = 0;
>>   p->c[i] = q->c[i];
>>   p->b[i] = 0;
>>}
>>
>> with a and b adjacent.  I suppose p->c could be computed by a
>> non-builtin partition as well.
> Yes, the two memset builtin partitions can be merged in this case, but...
>> So don't we want to see if dependences allow sorting all builtin
>> partitions next to each other
>> as much as possible?  (maybe we do that already?)
> The answer for this, above partition merging and use of qsort is no.
> I think all the three are the same question here.  For now we only do
> topological sort for partitions.  To maximize parallelism (either by merging
> normal parallel partitions or merging builtin partitions) requires fine-tuned
> sorting between partitions that doesn't dependence on each other.
> In order to sort all memset/memcpy/memmove, we need check dependence
> between all data references between different partitions.  For example, I
> created new test ldist-36.c illustrating sorting memcpy along with memset
> would generate wrong code because dependence is broken.  It's the same
> for qsort.  In extreme case, if the same array is set twice with different rhs
> value, the order between the two sets needs to be preserved.  Unfortunately,
> qsort is unstable and could reorder different sets.  This would break output
> dependence.
> At the point of this function, dependence graph is destroyed, we can't do
> much in addition to special case handling for memset.  Full solution would
> require a customized topological sorting process.
>
> So, this updated patch keeps insertion sort with additional comment explaining
> why.  Also two test cases added showing when memset partitions should be
> merged (we can't for now) and when memset partitions should not be merged.
Hmm, I can factor out the sorting loop into a function, that might
make it easier
to read.

Thanks,
bin
>
> Bootstrap and test.  Is it OK?
>
> Thanks,
> bin
>
> 2017-10-14  Bin Cheng  <bin.ch...@arm.com>
>
> * tree-loop-distribution.c (tree-ssa-loop-ivopts.h): New header file.
> (struct builtin_info): New fields.
> (classify_builtin_1): Compute and record base and offset parts for
> memset builtin partition by calling strip_offset.
> (fuse_memset_builtins): New function.
> (finalize_partitions): Fuse adjacent memset partitions by calling
> above function.
> * tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration.
> Expose the interface.
> * tree-ssa-loop-ivopts.h (strip_offset): New declaration.
>
> gcc/testsuite/ChangeLog
> 2017-10-14  Bin Cheng  <bin.ch...@arm.com>
>
> * gcc.dg/tree-ssa/ldist-17.c: Adjust test string.
> * gcc.dg/tree-ssa/ldist-32.c: New test.
> * gcc.dg/tree-ssa/ldist-35.c: New test.
> * gcc.dg/tree-ssa/ldist-36.c: New test.


Re: [PATCH GCC][7/7]Merge adjacent memset builtin partitions

2017-10-16 Thread Bin.Cheng
On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
 wrote:
> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng  wrote:
>> Hi,
>> This patch merges adjacent memset builtin partitions if possible.  It is
>> a useful special case optimization transforming below code:
>>
>> #define M (256)
>> #define N (512)
>>
>> struct st
>> {
>>   int a[M][N];
>>   int c[M];
>>   int b[M][N];
>> };
>>
>> void
>> foo (struct st *p)
>> {
>>   for (unsigned i = 0; i < M; ++i)
>> {
>>   p->c[i] = 0;
>>   for (unsigned j = N; j > 0; --j)
>> {
>>   p->a[i][j - 1] = 0;
>>   p->b[i][j - 1] = 0;
>> }
>> }
>>
>> into a single memset function call, rather than three calls initializing
>> the structure field by field.
>>
>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>
> +  /* Insertion sort is good enough for the small sub-array.  */
> +  for (k = i + 1; k < j; ++k)
> +   {
> + part1 = (*partitions)[k];
> + for (l = k; l > i; --l)
> +   {
> + part2 = (*partitions)[l - 1];
> + if (part1->builtin->dst_base_offset
> +   >= part2->builtin->dst_base_offset)
> +   break;
> +
> + (*partitions)[l] = part2;
> +   }
> + (*partitions)[l] = part1;
> +   }
>
> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
> to write a qsort helper for this but I can't wrap my head around the above
> in 5 minutes so ... please!
Hmm, I thought twice about this and now I believe stable sorting (thus
insertion sort)
is required here.  Please see below for explanation.

>
> You don't seem to check the sizes of the memsets given that they
> obviously cannot overlap(!?)
Yes, given it's quite special case transformation, I did add code
checking overlap cases.
>
> Also why handle memset and not memcpy/memmove or combinations of
> them (for sorting)?
>
>   for ()
>{
>   p->a[i] = 0;
>   p->c[i] = q->c[i];
>   p->b[i] = 0;
>}
>
> with a and b adjacent.  I suppose p->c could be computed by a
> non-builtin partition as well.
Yes, the two memset builtin partitions can be merged in this case, but...
> So don't we want to see if dependences allow sorting all builtin
> partitions next to each other
> as much as possible?  (maybe we do that already?)
The answer for this, above partition merging and use of qsort is no.
I think all the three are the same question here.  For now we only do
topological sort for partitions.  To maximize parallelism (either by merging
normal parallel partitions or merging builtin partitions) requires fine-tuned
sorting between partitions that doesn't dependence on each other.
In order to sort all memset/memcpy/memmove, we need check dependence
between all data references between different partitions.  For example, I
created new test ldist-36.c illustrating sorting memcpy along with memset
would generate wrong code because dependence is broken.  It's the same
for qsort.  In extreme case, if the same array is set twice with different rhs
value, the order between the two sets needs to be preserved.  Unfortunately,
qsort is unstable and could reorder different sets.  This would break output
dependence.
At the point of this function, dependence graph is destroyed, we can't do
much in addition to special case handling for memset.  Full solution would
require a customized topological sorting process.

So, this updated patch keeps insertion sort with additional comment explaining
why.  Also two test cases added showing when memset partitions should be
merged (we can't for now) and when memset partitions should not be merged.

Bootstrap and test.  Is it OK?

Thanks,
bin

2017-10-14  Bin Cheng  

* tree-loop-distribution.c (tree-ssa-loop-ivopts.h): New header file.
(struct builtin_info): New fields.
(classify_builtin_1): Compute and record base and offset parts for
memset builtin partition by calling strip_offset.
(fuse_memset_builtins): New function.
(finalize_partitions): Fuse adjacent memset partitions by calling
above function.
* tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration.
Expose the interface.
* tree-ssa-loop-ivopts.h (strip_offset): New declaration.

gcc/testsuite/ChangeLog
2017-10-14  Bin Cheng  

* gcc.dg/tree-ssa/ldist-17.c: Adjust test string.
* gcc.dg/tree-ssa/ldist-32.c: New test.
* gcc.dg/tree-ssa/ldist-35.c: New test.
* gcc.dg/tree-ssa/ldist-36.c: New test.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
index 4efc0a4..b3617f6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
@@ -45,5 +45,5 @@ mad_synth_mute (struct mad_synth *synth)
   return;
 }
 
-/* { dg-final { scan-tree-dump "distributed: split to 0 loops and 4 library 
calls" "ldist" } } */
-/* { dg-final { scan-tree-dump-times "generated memset zero" 4 "ldist" } } 

Re: [PATCH GCC][6/7]Support loop nest distribution for builtin partition

2017-10-12 Thread Bin.Cheng
On Thu, Oct 12, 2017 at 2:32 PM, Richard Biener
 wrote:
> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng  wrote:
>> Hi,
>> This patch rewrites classification part of builtin partition so that nested
>> builtin partitions are supported.  With this extension, below loop nest:
>> void
>> foo (void)
>> {
>>   for (unsigned i = 0; i < M; ++i)
>> for (unsigned j = 0; j < N; ++j)
>>   arr[i][j] = 0;
>>
>> will be distributed into a single memset, rather than a loop of memset.
>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>
> +  tree access_size = fold_convert (sizetype, TYPE_SIZE_UNIT (TREE_TYPE 
> (ref)));
> +
>
> TYPE_SIZE_UNIT should be always sizetype.
Done.

>
> +  /* Classify the builtin kind.  */
> +  if (single_ld == NULL)
> +classify_builtin_1 (loop, partition, single_st);
> +  else
> +classify_builtin_2 (loop, rdg, partition, single_st, single_ld);
>
> maybe name those helpers classify_builtin_st and classify_builtin_ldst?
Done.  Patch updated in attachment, Will apply it later.

Thanks,
bin
2017-10-12  Bin Cheng  

* tree-loop-distribution.c (struct builtin_info): New struct.
(struct partition): Refactor fields into struct builtin_info.
(partition_free): Free struct builtin_info.
(build_size_arg_loc, build_addr_arg_loc): Delete.
(generate_memset_builtin, generate_memcpy_builtin): Get memory range
information from struct builtin_info.
(find_single_drs): New function refactored from classify_partition.
Also moved builtin validity checks to this function.
(compute_access_range, alloc_builtin): New functions.
(classify_builtin_st, classify_builtin_ldst): New functions.
(classify_partition): Refactor code into functions find_single_drs,
classify_builtin_st and classify_builtin_ldst.
(distribute_loop): Don't do runtime alias check when distributing
loop nest.
(find_seed_stmts_for_distribution): New function.
(pass_loop_distribution::execute): Refactor code finding seed
stmts into above function.  Support distribution for the innermost
two-level loop nest.  Adjust dump information.

gcc/testsuite/ChangeLog
2017-10-12  Bin Cheng  

* gcc.dg/tree-ssa/ldist-28.c: New test.
* gcc.dg/tree-ssa/ldist-29.c: New test.
* gcc.dg/tree-ssa/ldist-30.c: New test.
* gcc.dg/tree-ssa/ldist-31.c: New test.

>
> Ok with those changes.
>
> Thanks,
> Richard.
>
From 8271ce0851a60b38226e92558bca234774e5503e Mon Sep 17 00:00:00 2001
From: Bin Cheng 
Date: Wed, 27 Sep 2017 13:00:59 +0100
Subject: [PATCH 6/7] loop_nest-builtin-pattern-20171012.txt

---
 gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c |  16 +
 gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c |  17 ++
 gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c |  16 +
 gcc/testsuite/gcc.dg/tree-ssa/ldist-31.c |  19 ++
 gcc/tree-loop-distribution.c | 507 +++
 5 files changed, 377 insertions(+), 198 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ldist-31.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c
new file mode 100644
index 000..4420139
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-28.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (1024)
+int arr[M][N];
+
+void
+foo (void)
+{
+  for (unsigned i = 0; i < M; ++i)
+for (unsigned j = 0; j < N; ++j)
+  arr[i][j] = 0;
+}
+
+/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 0 loops and 1 library" "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c
new file mode 100644
index 000..9ce93e8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-29.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+int arr[M][N];
+
+void
+foo (void)
+{
+  for (unsigned i = 0; i < M; ++i)
+for (unsigned j = 0; j < N - 1; ++j)
+  arr[i][j] = 0;
+}
+
+/* { dg-final { scan-tree-dump-not "Loop nest . distributed: split to" "ldist" } } */
+/* { dg-final { scan-tree-dump-times "Loop . distributed: split to 0 loops and 1 library" 1 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c
new file mode 100644
index 000..f31860a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-30.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 

Re: [PATCH][GRAPHITE] Fix PR82451 (and PR82355 in a different way)

2017-10-12 Thread Bin.Cheng
On Thu, Oct 12, 2017 at 12:13 PM, Richard Biener <rguent...@suse.de> wrote:
> On Thu, 12 Oct 2017, Bin.Cheng wrote:
>
>> On Wed, Oct 11, 2017 at 3:43 PM, Richard Biener <rguent...@suse.de> wrote:
>> >
>> > For PR82355 I introduced a fake dimension to ISL to allow CHRECs
>> > having an evolution in a loop that isn't fully part of the SESE
>> > region we are processing.  That was easier than fending off those
>> > CHRECs (without simply giving up on SESE regions with those).
>> >
>> > But it didn't fully solve the issue as PR82451 shows where we run
>> > into the issue that we eventually have to code-gen those
>> > evolutions and thus in theory need a canonical IV of that containing loop.
>> >
>> > So I decided (after Micha pressuring me a bit...) to revisit the
>> > original issue and make SCEV analysis "properly" handle SE regions.
>> > It turns out that it is mostly instantiate_scev lacking proper support
>> > plus the necessary interfacing change (really just cosmetic in some sense)
>> > from a instantiate_before basic-block to a instantiate_before edge.
>> >
>> > data-ref interfaces have been similarly adjusted, here changing
>> > the "loop nest" loop parameter to the entry edge for the SE region
>> > and passing that down accordingly.
>> >
>> > I've for now tried to keep other high-level loop-based interfaces the
>> > same by simply using the loop preheader edge as entry where appropriate
>> > (needing loop_preheader_edge cope with the loop root tree for simplicity).
>> >
>> > In the process I ran into issues with us too overly aggressive
>> > instantiating random trees and thus I cut those down.  That part
>> > doesn't successfully test separately (when I remove the strange
>> > ARRAY_REF instantiation), so it's part of this patch.  I've also
>> > run into an SSA verification fail (the id-27.f90 testcase) which
>> > shows we _do_ need to clear the SCEV cache after introducing
>> > the versioned CFG (and added a comment before it).
>> >
>> > On the previously failing testcases I've verified we produce
>> > sensible instantiations for those pesky refs residing in "no" loop
>> > in the SCOP and that we get away with the result in terms of
>> > optimizing.
>> >
>> > SPEC 2k6 testing shows
>> >
>> > loop nest optimized: 311
>> > loop nest not optimized, code generation error: 0
>> > loop nest not optimized, optimized schedule is identical to original
>> > schedule: 173
>> > loop nest not optimized, optimization timed out: 59
>> > loop nest not optimized, ISL signalled an error: 9
>> > loop nest: 552
>> >
>> > for SPEC 2k6 and -floop-nest-optimize while adding -fgraphite-identity
>> > still reveals some codegen errors:
>> >
>> > loop nest optimized: 437
>> > loop nest not optimized, code generation error: 25
>> > loop nest not optimized, optimized schedule is identical to original
>> > schedule: 169
>> > loop nest not optimized, optimization timed out: 60
>> > loop nest not optimized, ISL signalled an error: 9
>> > loop nest: 700
>> >
>> > Bootstrap and regtest in progress on x86_64-unknown-linux-gnu
>> > (with and without -fgraphite-identity -floop-nest-optimize).
>> >
>> > Ok?
>> >
>> > Thanks,
>> > Richard.
>> >
>>
>> > Index: gcc/tree-scalar-evolution.c
>> > ===
>> > --- gcc/tree-scalar-evolution.c (revision 253645)
>> > +++ gcc/tree-scalar-evolution.c (working copy)
>> > @@ -2344,7 +2348,7 @@ static tree instantiate_scev_r (basic_bl
>> > instantiated, and to stop if it exceeds some limit.  */
>> >
>> >  static tree
>> > -instantiate_scev_name (basic_block instantiate_below,
>> > +instantiate_scev_name (edge instantiate_below,
>> >struct loop *evolution_loop, struct loop 
>> > *inner_loop,
>> >tree chrec,
>> >bool *fold_conversions,
>> > @@ -2358,7 +2362,7 @@ instantiate_scev_name (basic_block insta
>> >   evolutions in outer loops), nothing to do.  */
>> >if (!def_bb
>> >|| loop_depth (def_bb->loop_father) == 0
>> > -  || dominated_by_p (CDI_DOMINATORS, instantiate_below, d

Re: [PATCH][GRAPHITE] Fix PR82451 (and PR82355 in a different way)

2017-10-12 Thread Bin.Cheng
On Wed, Oct 11, 2017 at 3:43 PM, Richard Biener  wrote:
>
> For PR82355 I introduced a fake dimension to ISL to allow CHRECs
> having an evolution in a loop that isn't fully part of the SESE
> region we are processing.  That was easier than fending off those
> CHRECs (without simply giving up on SESE regions with those).
>
> But it didn't fully solve the issue as PR82451 shows where we run
> into the issue that we eventually have to code-gen those
> evolutions and thus in theory need a canonical IV of that containing loop.
>
> So I decided (after Micha pressuring me a bit...) to revisit the
> original issue and make SCEV analysis "properly" handle SE regions.
> It turns out that it is mostly instantiate_scev lacking proper support
> plus the necessary interfacing change (really just cosmetic in some sense)
> from a instantiate_before basic-block to a instantiate_before edge.
>
> data-ref interfaces have been similarly adjusted, here changing
> the "loop nest" loop parameter to the entry edge for the SE region
> and passing that down accordingly.
>
> I've for now tried to keep other high-level loop-based interfaces the
> same by simply using the loop preheader edge as entry where appropriate
> (needing loop_preheader_edge cope with the loop root tree for simplicity).
>
> In the process I ran into issues with us too overly aggressive
> instantiating random trees and thus I cut those down.  That part
> doesn't successfully test separately (when I remove the strange
> ARRAY_REF instantiation), so it's part of this patch.  I've also
> run into an SSA verification fail (the id-27.f90 testcase) which
> shows we _do_ need to clear the SCEV cache after introducing
> the versioned CFG (and added a comment before it).
>
> On the previously failing testcases I've verified we produce
> sensible instantiations for those pesky refs residing in "no" loop
> in the SCOP and that we get away with the result in terms of
> optimizing.
>
> SPEC 2k6 testing shows
>
> loop nest optimized: 311
> loop nest not optimized, code generation error: 0
> loop nest not optimized, optimized schedule is identical to original
> schedule: 173
> loop nest not optimized, optimization timed out: 59
> loop nest not optimized, ISL signalled an error: 9
> loop nest: 552
>
> for SPEC 2k6 and -floop-nest-optimize while adding -fgraphite-identity
> still reveals some codegen errors:
>
> loop nest optimized: 437
> loop nest not optimized, code generation error: 25
> loop nest not optimized, optimized schedule is identical to original
> schedule: 169
> loop nest not optimized, optimization timed out: 60
> loop nest not optimized, ISL signalled an error: 9
> loop nest: 700
>
> Bootstrap and regtest in progress on x86_64-unknown-linux-gnu
> (with and without -fgraphite-identity -floop-nest-optimize).
>
> Ok?
>
> Thanks,
> Richard.
>

> Index: gcc/tree-scalar-evolution.c
> ===
> --- gcc/tree-scalar-evolution.c (revision 253645)
> +++ gcc/tree-scalar-evolution.c (working copy)
> @@ -2344,7 +2348,7 @@ static tree instantiate_scev_r (basic_bl
> instantiated, and to stop if it exceeds some limit.  */
>
>  static tree
> -instantiate_scev_name (basic_block instantiate_below,
> +instantiate_scev_name (edge instantiate_below,
>struct loop *evolution_loop, struct loop *inner_loop,
>tree chrec,
>bool *fold_conversions,
> @@ -2358,7 +2362,7 @@ instantiate_scev_name (basic_block insta
>   evolutions in outer loops), nothing to do.  */
>if (!def_bb
>|| loop_depth (def_bb->loop_father) == 0
> -  || dominated_by_p (CDI_DOMINATORS, instantiate_below, def_bb))
> +  || ! dominated_by_p (CDI_DOMINATORS, def_bb, instantiate_below->dest))
>  return chrec;
>
>/* We cache the value of instantiated variable to avoid exponential
> @@ -2380,6 +2384,51 @@ instantiate_scev_name (basic_block insta
>
>def_loop = find_common_loop (evolution_loop, def_bb->loop_father);
>
> +  if (! dominated_by_p (CDI_DOMINATORS,
> +   def_loop->header, instantiate_below->dest))
> +{
> +  gimple *def = SSA_NAME_DEF_STMT (chrec);
> +  if (gassign *ass = dyn_cast  (def))
> +   {
> + switch (gimple_assign_rhs_class (ass))
> +   {
> +   case GIMPLE_UNARY_RHS:
> + {
> +   tree op0 = instantiate_scev_r (instantiate_below, 
> evolution_loop,
> +  inner_loop, gimple_assign_rhs1 
> (ass),
> +  fold_conversions, size_expr);
> +   if (op0 == chrec_dont_know)
> + return chrec_dont_know;
> +   res = fold_build1 (gimple_assign_rhs_code (ass),
> +  TREE_TYPE (chrec), op0);
> +   break;
> + }
> +   case GIMPLE_BINARY_RHS:
> + {
> +   

Re: [PATCH GCC][5/7]Extend loop distribution for two-level innermost loop nest

2017-10-11 Thread Bin.Cheng
On Mon, Oct 9, 2017 at 2:48 PM, Richard Biener
 wrote:
> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng  wrote:
>> Hi,
>> For now distribution pass only handles the innermost loop.  This patch 
>> extends the pass
>> to cover two-level innermost loop nest.  It also refactors code in 
>> pass_loop_distribution::execute
>> for better reading.  Note I restrict it to 2-level loop nest on purpose 
>> because of high
>> cost in data dependence computation.  Some compilation time optimizations 
>> like reusing
>> the data reference finding, data dependence computing, would require a 
>> rewrite of this
>> pass like the proposed loop interchange implementation.  But that's another 
>> task.
>>
>> This patch introduces a temporary TODO for loop nest builtin partition which 
>> is covered
>> by next two patches.
>>
>> With this patch, kernel loop in bwaves now can be distributed, thus exposed 
>> for further
>> interchange.  This patch adds new test for matrix multiplication, as well as 
>> adjusts
>> test strings of existing tests.
>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>
> @ -714,9 +719,11 @@ ssa_name_has_uses_outside_loop_p (tree def, loop_p loop)
>
>FOR_EACH_IMM_USE_FAST (use_p, imm_iter, def)
>  {
> -  gimple *use_stmt = USE_STMT (use_p);
> -  if (!is_gimple_debug (use_stmt)
> - && loop != loop_containing_stmt (use_stmt))
> +  if (is_gimple_debug (USE_STMT (use_p)))
> +   continue;
> +
> +  basic_block use_bb = gimple_bb (USE_STMT (use_p));
> +  if (use_bb == NULL || !flow_bb_inside_loop_p (loop, use_bb))
> return true;
>
> use_bb should never be NULL.
Done.
>
> +  /* Don't support loop nest distribution under runtime alias check
> +since it's not likely to enable many vectorization opportunities.  */
> +  if (loop->inner)
> +   {
> + merge_dep_scc_partitions (rdg, , false);
> +   }
>
> extra {}
Done.
>
> +  /* Support loop nest distribution enclosing current innermost loop.
> +For the moment, we only support the innermost two-level loop nest.  
> */
> +  if (flag_tree_loop_distribution
> + && outer->num > 0 && outer->inner == loop && loop->next == NULL
>
> The canonical check for is-this-non-root is loop_outer (outer) instead
> of outer->num > 0.
Done.
>
> + && single_exit (outer)
>
> not sure how exits are counted but if the inner loop exits also the
> outer loop do
> we correctly handle/reject this case?
I tend to believe this can be handled if it's not rejected by
niters/exit condition,
but I am not very sure about this.
>
> -  if (nb_generated_loops + nb_generated_calls > 0)
> -   {
> - changed = true;
> - dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> -  loc, "Loop %d distributed: split to %d loops "
> -  "and %d library calls.\n",
> -  num, nb_generated_loops, nb_generated_calls);
> + if (nb_generated_loops + nb_generated_calls > 0)
> +   {
> + changed = true;
> + dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> +  loc, "Loop%s %d distributed: split to %d loops 
> "
> +  "and %d library calls.\n",
> +  loop_nest_p ? " nest" : "", loop->num,
> +  nb_generated_loops, nb_generated_call
> ...
>
> can you adjust the printfs to say "loop nest distributed" in case we 
> distributed
> a nest?
Done.
>
> Can you rewrite the iteration over the nest so it would theoretically support
> arbitrary deep perfect nests?  Thus simply initialize loop_nest_p less
> cleverly...
Done.  I factored it out as a function "prepare_perfect_loop_nest".  I
also tested
the updated patch by enabling full loop nest distribution, there is no failure
in bootstrap, regression test, spec benchmarks.  Of course, the final patch
still only supports 2-level innermost loop nest.

Is this OK?

Thanks,
bin
2017-10-04  Bin Cheng  

* tree-loop-distribution.c: Adjust the general comment.
(NUM_PARTITION_THRESHOLD): New macro.
(ssa_name_has_uses_outside_loop_p): Support loop nest distribution.
(classify_partition): Skip builtin pattern of loop nest's inner loop.
(merge_dep_scc_partitions): New parameter ignore_alias_p and use it
in call to build_partition_graph.
(finalize_partitions): New parameter.  Make loop distribution more
conservative by fusing more partitions.
(distribute_loop): Don't do runtime alias check in case of loop nest
distribution.
(find_seed_stmts_for_distribution): New function.
(prepare_perfect_loop_nest): New function.
(pass_loop_distribution::execute): Refactor code finding seed stmts
and loop nest into above functions.  Support loop nest distribution.
Adjust dump information accordingly.

gcc/testsuite/ChangeLog
2017-10-04  Bin Cheng  

Re: [PATCH GCC][3/7]Don't skip renaming PHIs in loop nest with only one inner loop

2017-10-10 Thread Bin.Cheng
On Mon, Oct 9, 2017 at 2:33 PM, Richard Biener
 wrote:
> On Thu, Oct 5, 2017 at 3:16 PM, Bin Cheng  wrote:
>> Hi,
>> Function rename_variables_in_bb skips renaming PHI nodes in loop nest if the
>> outer loop has only one inner loop.  This breaks loop nest distribution when
>> inner loop has PHI node initialized from outer loop's variable.  
>> Unfortunately,
>> I lost the original C code illustrating the issue.  Now it is only triggered
>> in building spec2006/416.gamess with loop nest distribution, but I failed to
>> reduce a test from it.
>
> Bah, can you re-try isolating a testcase?
Hi Richard,
Right, I managed a simple test with help of creduce.  Given the
simplicity of the test,
I assume previous approval still holds for this updated patch and will
apply it later.

Thanks,
bin

2017-10-10  Bin Cheng  

* tree-vect-loop-manip.c (rename_variables_in_bb): Rename PHI nodes
when copying loop nest with only one inner loop.

2017-10-10  Bin Cheng  

* gcc.dg/tree-ssa/ldist-34.c: New test.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-34.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-34.c
new file mode 100644
index 000..3d68a85
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-34.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution" } */
+
+#define X (3.0)
+int b, c;
+double a[3];
+int foo () {
+  for (int i = 0; i < 100; ++i) {
+for (int j = 0; j < c; ++j)
+  if (b)
+a[0] = b;
+a[i * 100] = a[1] = X;
+  }
+  return 0;
+}
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 2c724a2..9fd65a7 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -117,8 +117,6 @@ rename_variables_in_bb (basic_block bb, bool 
rename_from_outer_loop)
  || single_pred (e->src) != outer_loop->header)
continue;
}
- else
-   continue;
}
}
   for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);


Re: [PATCH PR82163/V2]New interface checking LCSSA for single loop

2017-09-25 Thread Bin.Cheng
On Sat, Sep 23, 2017 at 6:31 PM, Bernhard Reutner-Fischer
 wrote:
> On Fri, Sep 22, 2017 at 11:37:53AM +, Bin Cheng wrote:
>
>> diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
>> index d6ba305..6ad0b75 100644
>> --- a/gcc/tree-ssa-loop-manip.c
>> +++ b/gcc/tree-ssa-loop-manip.c
>> @@ -690,48 +690,62 @@ rewrite_virtuals_into_loop_closed_ssa (struct loop 
>> *loop)
>>rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_VIRTUAL_USES, loop);
>>  }
>
>> -/* Checks invariants of loop closed ssa form in statement STMT in BB.  */
>> +/* Checks invariants of loop closed ssa form in BB.  */
>>
>>  static void
>> -check_loop_closed_ssa_stmt (basic_block bb, gimple *stmt)
>> +check_loop_closed_ssa_bb (basic_block bb)
>>  {
>> -  ssa_op_iter iter;
>> -  tree var;
>> +  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
>> +   gsi_next ())
>> +{
>> +  gphi *phi = bsi.phi ();
>>
>> -  if (is_gimple_debug (stmt))
>> -return;
>> +  if (!virtual_operand_p (PHI_RESULT (phi)))
>> + check_loop_closed_ssa_def (bb, PHI_RESULT (phi));
>> +}
>> +
>> +  for (gimple_stmt_iterator bsi = gsi_start_bb (bb); !gsi_end_p (bsi);
>> +   gsi_next ())
>> +{
>> +  ssa_op_iter iter;
>> +  tree var;
>> +  gimple *stmt = gsi_stmt (bsi);
>> +
>> +  if (is_gimple_debug (stmt))
>> + continue;
>
> for (gimple_stmt_iterator bsi = gsi_start_nondebug_after_labels_bb (bb);
>  !gsi_end_p (bsi);
>  gsi_next_nondebug ())
>
> ?
Thanks for the suggestion, patch updated.  I will commit it later
since it's an obvious update.

Thanks,
bin
>>
>> -  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
>> -check_loop_closed_ssa_use (bb, var);
>> +  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_DEF)
>> + check_loop_closed_ssa_def (bb, var);
>> +}
>>  }
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c
new file mode 100644
index 000..389d5c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr82163.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+int a, b, c[4], d, e, f, g;
+
+void h ()
+{
+  for (; a; a++)
+{
+  c[a + 3] = g;
+  if (b)
+c[a] = f;
+  else
+{
+  for (; d; d++)
+c[d + 3] = c[d];
+  for (e = 1; e == 2; e++)
+;
+  if (e)
+break;
+}
+}
+}
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index d6ba305..b08b8b9 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -690,48 +690,59 @@ rewrite_virtuals_into_loop_closed_ssa (struct loop *loop)
   rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_VIRTUAL_USES, loop);
 }
 
-/* Check invariants of the loop closed ssa form for the USE in BB.  */
+/* Check invariants of the loop closed ssa form for the def in DEF_BB.  */
 
 static void
-check_loop_closed_ssa_use (basic_block bb, tree use)
+check_loop_closed_ssa_def (basic_block def_bb, tree def)
 {
-  gimple *def;
-  basic_block def_bb;
+  use_operand_p use_p;
+  imm_use_iterator iterator;
+  FOR_EACH_IMM_USE_FAST (use_p, iterator, def)
+{
+  if (is_gimple_debug (USE_STMT (use_p)))
+   continue;
 
-  if (TREE_CODE (use) != SSA_NAME || virtual_operand_p (use))
-return;
+  basic_block use_bb = gimple_bb (USE_STMT (use_p));
+  if (is_a  (USE_STMT (use_p)))
+   use_bb = EDGE_PRED (use_bb, PHI_ARG_INDEX_FROM_USE (use_p))->src;
 
-  def = SSA_NAME_DEF_STMT (use);
-  def_bb = gimple_bb (def);
-  gcc_assert (!def_bb
- || flow_bb_inside_loop_p (def_bb->loop_father, bb));
+  gcc_assert (flow_bb_inside_loop_p (def_bb->loop_father, use_bb));
+}
 }
 
-/* Checks invariants of loop closed ssa form in statement STMT in BB.  */
+/* Checks invariants of loop closed ssa form in BB.  */
 
 static void
-check_loop_closed_ssa_stmt (basic_block bb, gimple *stmt)
+check_loop_closed_ssa_bb (basic_block bb)
 {
-  ssa_op_iter iter;
-  tree var;
+  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+   gsi_next ())
+{
+  gphi *phi = bsi.phi ();
 
-  if (is_gimple_debug (stmt))
-return;
+  if (!virtual_operand_p (PHI_RESULT (phi)))
+   check_loop_closed_ssa_def (bb, PHI_RESULT (phi));
+}
 
-  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_USE)
-check_loop_closed_ssa_use (bb, var);
+  for (gimple_stmt_iterator bsi = gsi_start_nondebug_bb (bb); !gsi_end_p (bsi);
+   gsi_next_nondebug ())
+{
+  ssa_op_iter iter;
+  tree var;
+  gimple *stmt = gsi_stmt (bsi);
+
+  FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_DEF)
+   check_loop_closed_ssa_def (bb, var);
+}
 }
 
 /* Checks that invariants of the loop closed ssa form are preserved.
-   Call verify_ssa when VERIFY_SSA_P is true.  */
+   Call verify_ssa when VERIFY_SSA_P is true.  Note all loops are checked
+   if LOOP is NULL, otherwise, only LOOP is checked.  */

Re: [PATCH][GRAPHITE] More TLC

2017-09-25 Thread Bin.Cheng
On Mon, Sep 25, 2017 at 1:46 PM, Richard Biener  wrote:
> On Mon, 25 Sep 2017, Richard Biener wrote:
>
>> On Fri, 22 Sep 2017, Richard Biener wrote:
>>
>> >
>> > This simplifies canonicalize_loop_closed_ssa and does other minimal
>> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
>> > when reworking canonicalize_loop_closed_ssa.
>> >
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
>> >
>> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
>> > -Ofast -march=haswell -floop-nest-optimize are
>> >
>> >  61 loop nests "optimized"
>> >  45 loop nest transforms cancelled because of code generation issues
>> >  21 loop nest optimizations timed out the 35 ISL "operations" we allow
>>
>> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
>> and 709 sec. with (this was with release checking).
>>
>> A single-run has 416.gamess (580s -> 618s),
>> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
>> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s ->
>> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
>> do a 3-run for those to confirm (it would be only a single regression
>> for 416.gamess).
>
> 416.gamess regression confirmed, 450.soplex improvement as well,
436/437 improvements?  450.soplex (229s -> 226s) loops like noise.

Thanks,
bin
> in the three-run 462.libquantum regresses (344s -> 351s) so I suppose
> that's noise.
>
> Richard.


Re: [PATCH GCC]A simple implementation of loop interchange

2017-09-22 Thread Bin.Cheng
On Mon, Sep 4, 2017 at 2:54 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This patch implements a simple loop interchange pass in GCC, as described 
>>>> by its comments:
>>>> +/* This pass performs loop interchange: for example, the loop nest
>>>> +
>>>> +   for (int j = 0; j < N; j++)
>>>> + for (int k = 0; k < N; k++)
>>>> +   for (int i = 0; i < N; i++)
>>>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   is transformed to
>>>> +
>>>> +   for (int i = 0; i < N; i++)
>>>> + for (int j = 0; j < N; j++)
>>>> +   for (int k = 0; k < N; k++)
>>>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   This pass implements loop interchange in the following steps:
>>>> +
>>>> + 1) Find perfect loop nest for each innermost loop and compute data
>>>> +   dependence relations for it.  For above example, loop nest is
>>>> +   <loop_j, loop_k, loop_i>.
>>>> + 2) From innermost to outermost loop, this pass tries to interchange
>>>> +   each loop pair.  For above case, it firstly tries to interchange
>>>> +   <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>>> +   Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>>> +   <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>>>> +   loop to the outermost position.  For loop pair <loop_i, loop_j>
>>>> +   to be interchanged, we:
>>>> + 3) Check if data dependence relations are valid for loop interchange.
>>>> + 4) Check if both loops can be interchanged in terms of 
>>>> transformation.
>>>> + 5) Check if interchanging the two loops is profitable.
>>>> + 6) Interchange the two loops by mapping induction variables.
>>>> +
>>>> +   This pass also handles reductions in loop nest.  So far we only support
>>>> +   simple reduction of inner loop and double reduction of the loop nest.  
>>>> */
>>>>
>>>> Actually, this pass only does loop shift which outermosting inner loop to 
>>>> outer, rather
>>>> than permutation.  Also as a traditional loop optimizer, it only works for 
>>>> perfect loop
>>>> nest.  I put it just after loop distribution thus ideally loop 
>>>> split/distribution could
>>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest 
>>>> from distribution
>>>> for now because it only works for innermost loop.  For example, the 
>>>> motivation case in
>>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch 
>>>> extending distribution
>>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>>> Another point is I deliberately make both the cost model and code 
>>>> transformation (very)
>>>> conservative.  We can support more cases, or more transformations with 
>>>> great care when
>>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues 
>>>> quite often and
>>>> don't want to introduce more.
>>>> As for code generation, this patch has an issue that invariant code in 
>>>> outer loop could
>>>> be moved to inner loop.  For the moment, we rely on the last lim pass to 
>>>> handle all INV
>>>> generated during interchange.  In the future, we may need to avoid that in 
>>>> interchange
>>>> itself, or another lim pass just like the one after graphite optimizations.
>>>>
>>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
>>>> successfully.
>>>> Note this pass is disabled in patch, while the code is exercised by 
>>>> bootstrap/building
>>>> programs with it enabled by default.  Any comments?
>>>
>> Thanks for quick review.
>>> +/* The same as above, but this one is only used for interchanging not
>>> +   innermost loo

Re: [PATCH PR82163]Rewrite loop into lcssa form instantly

2017-09-15 Thread Bin.Cheng
On Fri, Sep 15, 2017 at 12:49 PM, Richard Biener
 wrote:
> On Thu, Sep 14, 2017 at 5:02 PM, Bin Cheng  wrote:
>> Hi,
>> Current pcom implementation rewrites into lcssa form after all loops are 
>> transformed, this is
>> not enough because unrolling of later loop checks lcssa form in function 
>> tree_transform_and_unroll_loop.
>> This simple patch rewrites loop into lcssa form if store-store chain is 
>> handled.  I think it doesn't
>> affect compilation time since rewrite_into_loop_closed_ssa_1 is only called 
>> for store-store chain
>> transformation and only the transformed loop is rewritten.
>
> Well, it may look like only the transformed loop is rewritten -- yes,
> it is, but rewrite_into_loop_closed_ssa
> calls update_ssa () which operates on the whole function.
I see.
>
> So I'd rather _not_ do this.
>
> Is there a real problem or is it just the overly aggressive checking
> done?  IMHO we should remove
In this case, it's the check itself.
> the checking or pass in a param whether to skip the checking.  Or even
> better, restrict the
> checking to those loops trans_form_and_unroll actually touches.
Yes, will see if we can check loops only related to trans_form_and_unroll.

Thanks,
bin
>
> Richard.
>
>> Bootstrap and test ongoing on x86_64.  is it OK if no failures?
>>
>> Thanks,
>> bin
>> 2017-09-14  Bin Cheng  
>>
>> PR tree-optimization/82163
>> * tree-predcom.c (tree_predictive_commoning_loop): Rewrite into
>> loop closed ssa instantly.  Return boolean true if loop is unrolled.
>> (tree_predictive_commoning): Return TODO_cleanup_cfg if loop is
>> unrolled.
>>
>> gcc/testsuite
>> 2017-09-14  Bin Cheng  
>>
>> PR tree-optimization/82163
>> * gcc.dg/tree-ssa/pr82163.c: New test.


Re: [PATCH GCC]A simple implementation of loop interchange

2017-09-04 Thread Bin.Cheng
On Mon, Sep 4, 2017 at 2:54 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Wed, Aug 30, 2017 at 6:32 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This patch implements a simple loop interchange pass in GCC, as described 
>>>> by its comments:
>>>> +/* This pass performs loop interchange: for example, the loop nest
>>>> +
>>>> +   for (int j = 0; j < N; j++)
>>>> + for (int k = 0; k < N; k++)
>>>> +   for (int i = 0; i < N; i++)
>>>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   is transformed to
>>>> +
>>>> +   for (int i = 0; i < N; i++)
>>>> + for (int j = 0; j < N; j++)
>>>> +   for (int k = 0; k < N; k++)
>>>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>>>> +
>>>> +   This pass implements loop interchange in the following steps:
>>>> +
>>>> + 1) Find perfect loop nest for each innermost loop and compute data
>>>> +   dependence relations for it.  For above example, loop nest is
>>>> +   <loop_j, loop_k, loop_i>.
>>>> + 2) From innermost to outermost loop, this pass tries to interchange
>>>> +   each loop pair.  For above case, it firstly tries to interchange
>>>> +   <loop_k, loop_i> and loop nest becomes <loop_j, loop_i, loop_k>.
>>>> +   Then it tries to interchange <loop_j, loop_i> and loop nest becomes
>>>> +   <loop_i, loop_j, loop_k>.  The overall effect is to move innermost
>>>> +   loop to the outermost position.  For loop pair <loop_i, loop_j>
>>>> +   to be interchanged, we:
>>>> + 3) Check if data dependence relations are valid for loop interchange.
>>>> + 4) Check if both loops can be interchanged in terms of 
>>>> transformation.
>>>> + 5) Check if interchanging the two loops is profitable.
>>>> + 6) Interchange the two loops by mapping induction variables.
>>>> +
>>>> +   This pass also handles reductions in loop nest.  So far we only support
>>>> +   simple reduction of inner loop and double reduction of the loop nest.  
>>>> */
>>>>
>>>> Actually, this pass only does loop shift which outermosting inner loop to 
>>>> outer, rather
>>>> than permutation.  Also as a traditional loop optimizer, it only works for 
>>>> perfect loop
>>>> nest.  I put it just after loop distribution thus ideally loop 
>>>> split/distribution could
>>>> create perfect nest for it.  Unfortunately, we don't get any perfect nest 
>>>> from distribution
>>>> for now because it only works for innermost loop.  For example, the 
>>>> motivation case in
>>>> spec2k6/bwaves is not handled on this pass alone.  I have a patch 
>>>> extending distribution
>>>> for (innermost) loop nest and with that patch bwaves case can be handled.
>>>> Another point is I deliberately make both the cost model and code 
>>>> transformation (very)
>>>> conservative.  We can support more cases, or more transformations with 
>>>> great care when
>>>> it is for sure known beneficial.  IMHO, we already hit over-baked issues 
>>>> quite often and
>>>> don't want to introduce more.
>>>> As for code generation, this patch has an issue that invariant code in 
>>>> outer loop could
>>>> be moved to inner loop.  For the moment, we rely on the last lim pass to 
>>>> handle all INV
>>>> generated during interchange.  In the future, we may need to avoid that in 
>>>> interchange
>>>> itself, or another lim pass just like the one after graphite optimizations.
>>>>
>>>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
>>>> successfully.
>>>> Note this pass is disabled in patch, while the code is exercised by 
>>>> bootstrap/building
>>>> programs with it enabled by default.  Any comments?
>>>
>> Thanks for quick review.
>>> +/* The same as above, but this one is only used for interchanging not
>>> +   innermost loo

Re: [PATCH GCC]A simple implementation of loop interchange

2017-08-30 Thread Bin.Cheng
On Wed, Aug 30, 2017 at 3:19 PM, Richard Biener
 wrote:
> On Wed, Aug 30, 2017 at 3:18 PM, Bin Cheng  wrote:
>> Hi,
>> This patch implements a simple loop interchange pass in GCC, as described by 
>> its comments:
>> +/* This pass performs loop interchange: for example, the loop nest
>> +
>> +   for (int j = 0; j < N; j++)
>> + for (int k = 0; k < N; k++)
>> +   for (int i = 0; i < N; i++)
>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>> +
>> +   is transformed to
>> +
>> +   for (int i = 0; i < N; i++)
>> + for (int j = 0; j < N; j++)
>> +   for (int k = 0; k < N; k++)
>> +c[i][j] = c[i][j] + a[i][k]*b[k][j];
>> +
>> +   This pass implements loop interchange in the following steps:
>> +
>> + 1) Find perfect loop nest for each innermost loop and compute data
>> +   dependence relations for it.  For above example, loop nest is
>> +   .
>> + 2) From innermost to outermost loop, this pass tries to interchange
>> +   each loop pair.  For above case, it firstly tries to interchange
>> +    and loop nest becomes .
>> +   Then it tries to interchange  and loop nest becomes
>> +   .  The overall effect is to move innermost
>> +   loop to the outermost position.  For loop pair 
>> +   to be interchanged, we:
>> + 3) Check if data dependence relations are valid for loop interchange.
>> + 4) Check if both loops can be interchanged in terms of transformation.
>> + 5) Check if interchanging the two loops is profitable.
>> + 6) Interchange the two loops by mapping induction variables.
>> +
>> +   This pass also handles reductions in loop nest.  So far we only support
>> +   simple reduction of inner loop and double reduction of the loop nest.  */
>>
>> Actually, this pass only does loop shift which outermosting inner loop to 
>> outer, rather
>> than permutation.  Also as a traditional loop optimizer, it only works for 
>> perfect loop
>> nest.  I put it just after loop distribution thus ideally loop 
>> split/distribution could
>> create perfect nest for it.  Unfortunately, we don't get any perfect nest 
>> from distribution
>> for now because it only works for innermost loop.  For example, the 
>> motivation case in
>> spec2k6/bwaves is not handled on this pass alone.  I have a patch extending 
>> distribution
>> for (innermost) loop nest and with that patch bwaves case can be handled.
>> Another point is I deliberately make both the cost model and code 
>> transformation (very)
>> conservative.  We can support more cases, or more transformations with great 
>> care when
>> it is for sure known beneficial.  IMHO, we already hit over-baked issues 
>> quite often and
>> don't want to introduce more.
>> As for code generation, this patch has an issue that invariant code in outer 
>> loop could
>> be moved to inner loop.  For the moment, we rely on the last lim pass to 
>> handle all INV
>> generated during interchange.  In the future, we may need to avoid that in 
>> interchange
>> itself, or another lim pass just like the one after graphite optimizations.
>>
>> Boostrap and test on x86_64 and AArch64.  Various benchmarks built and run 
>> successfully.
>> Note this pass is disabled in patch, while the code is exercised by 
>> bootstrap/building
>> programs with it enabled by default.  Any comments?
>
Thanks for quick review.
> +/* The same as above, but this one is only used for interchanging not
> +   innermost loops.  */
> +#define OUTER_STRIDE_RATIO (2)
>
> please make all these knobs --params.
>
> +/* Enum type for loop reduction variable.  */
> +enum reduction_type
> +{
> +  UNKNOWN_RTYPE = 0,
> +  SIMPLE_RTYPE,
> +  DOUBLE_RTYPE
> +};
>
> seeing this we should have some generic data structure / analysis for
> reduction detection.  This adds a third user (autopar and vectorizer
> are the others).  Just an idea.
>
> +/* Return true if E is abnormal edge.  */
> +
> +static inline bool
> +abnormal_edge (edge e)
> +{
> +  return (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP));
> +}
>
> bad name/comment for what it does.
>
> ... jumping to end of file / start of pass
>
> +  /* Get the outer loop.  */
> +  loop = superloop_at_depth (loop, loop_depth (loop) - 1);
>
> loop_outer (loop)?
>
> +  /* Only support rectangle loop nest, i.e, inner loop's niters doesn't
> +depends on outer loop's IV.  */
> +  if (chrec_contains_symbols_defined_in_loop (niters, loop->num))
> +   break;
>
> but you don't check for a three-nest whether niters depends on outer outer
> loop's IV that way.  Either the check is superfluous here or incomplete.
It is checked for multi-nest case in can_interchange_loops.  I will
move the check to this function so that we can save compilation time.
>
> +  /* Check if start_loop forms a perfect loop nest.  */
> +  

Re: PR81635: Use chrecs to help find related data refs

2017-08-17 Thread Bin.Cheng
On Thu, Aug 17, 2017 at 12:35 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>> On Wed, Aug 16, 2017 at 6:50 PM, Richard Sandiford
>> <richard.sandif...@linaro.org> wrote:
>>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>>> On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford
>>>> <richard.sandif...@linaro.org> wrote:
>>>>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>>>>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford
>>>>>> <richard.sandif...@linaro.org> wrote:
>>>>>>> The first loop in the testcase regressed after my recent changes to
>>>>>>> dr_analyze_innermost.  Previously we would treat "i" as an iv even
>>>>>>> for bb analysis and end up with:
>>>>>>>
>>>>>>>DR_BASE_ADDRESS: p or q
>>>>>>>DR_OFFSET: 0
>>>>>>>DR_INIT: 0 or 4
>>>>>>>DR_STEP: 16
>>>>>>>
>>>>>>> We now always keep the step as 0 instead, so for an int "i" we'd have:
>>>>>>>
>>>>>>>DR_BASE_ADDRESS: p or q
>>>>>>>DR_OFFSET: (intptr_t) i
>>>>>>>DR_INIT: 0 or 4
>>>>>>>DR_STEP: 0
>>>>>>>
>>>>>>> This is also what we'd like to have for the unsigned "i", but the
>>>>>>> problem is that strip_constant_offset thinks that the "i + 1" in
>>>>>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1".
>>>>>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA
>>>>>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no
>>>>>>> longer seem to be related to the [i] ones.
>>>>>>
>>>>>> Didn't read the change in detail, so sorry if I mis-understood the issue.
>>>>>> I made changes in scev to better fold type conversion by various sources
>>>>>> of information, for example, vrp, niters, undefined overflow behavior 
>>>>>> etc.
>>>>>> In theory these information should be available for other
>>>>>> optimizers without
>>>>>> querying scev.  For the mentioned test, vrp should compute accurate range
>>>>>> information for "i" so that we can fold (intptr_t) (i + 1) it without
>>>>>> worrying
>>>>>> overflow.  Note we don't do it in generic folding because
>>>>>> (intptr_t) (i) + 1
>>>>>> could be more expensive (especially in case of (T)(i + j)), or because 
>>>>>> the
>>>>>> CST part is in bigger precision after conversion.
>>>>>> But such folding is wanted in several places, e.g, IVOPTs.  To provide 
>>>>>> such
>>>>>> an interface, we changed tree-affine and made it do aggressive fold.  I 
>>>>>> am
>>>>>> curious if it's possible to use aff_tree to implement 
>>>>>> strip_constant_offset
>>>>>> here since aggressive folding is wanted.  After all, using additional 
>>>>>> chrec
>>>>>> looks like a little heavy wrto the simple test.
>>>>>
>>>>> Yeah, using aff_tree does work here when the bounds are constant.
>>>>> It doesn't look like it works for things like:
>>>>>
>>>>> double p[1000];
>>>>> double q[1000];
>>>>>
>>>>> void
>>>>> f4 (unsigned int n)
>>>>> {
>>>>>   for (unsigned int i = 0; i < n; i += 4)
>>>>> {
>>>>>   double a = q[i] + p[i];
>>>>>   double b = q[i + 1] + p[i + 1];
>>>>>   q[i] = a;
>>>>>   q[i + 1] = b;
>>>>> }
>>>>> }
>>>>>
>>>>> though, where the bounds on the global arrays guarantee that [i + 1] can't
>>>>> overflow, even though "n" is unconstrained.  The patch as posted handles
>>>>> this case too.
>>>> BTW is this a missed optimization in value range analysis?  The range
>>>> information for i should flow in a way like: array boundary 

Re: PR81635: Use chrecs to help find related data refs

2017-08-17 Thread Bin.Cheng
On Wed, Aug 16, 2017 at 6:50 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>> On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford
>> <richard.sandif...@linaro.org> wrote:
>>> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>>>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford
>>>> <richard.sandif...@linaro.org> wrote:
>>>>> The first loop in the testcase regressed after my recent changes to
>>>>> dr_analyze_innermost.  Previously we would treat "i" as an iv even
>>>>> for bb analysis and end up with:
>>>>>
>>>>>DR_BASE_ADDRESS: p or q
>>>>>DR_OFFSET: 0
>>>>>DR_INIT: 0 or 4
>>>>>DR_STEP: 16
>>>>>
>>>>> We now always keep the step as 0 instead, so for an int "i" we'd have:
>>>>>
>>>>>DR_BASE_ADDRESS: p or q
>>>>>DR_OFFSET: (intptr_t) i
>>>>>DR_INIT: 0 or 4
>>>>>DR_STEP: 0
>>>>>
>>>>> This is also what we'd like to have for the unsigned "i", but the
>>>>> problem is that strip_constant_offset thinks that the "i + 1" in
>>>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1".
>>>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA
>>>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no
>>>>> longer seem to be related to the [i] ones.
>>>>
>>>> Didn't read the change in detail, so sorry if I mis-understood the issue.
>>>> I made changes in scev to better fold type conversion by various sources
>>>> of information, for example, vrp, niters, undefined overflow behavior etc.
>>>> In theory these information should be available for other optimizers 
>>>> without
>>>> querying scev.  For the mentioned test, vrp should compute accurate range
>>>> information for "i" so that we can fold (intptr_t) (i + 1) it without
>>>> worrying
>>>> overflow.  Note we don't do it in generic folding because (intptr_t) (i) + 
>>>> 1
>>>> could be more expensive (especially in case of (T)(i + j)), or because the
>>>> CST part is in bigger precision after conversion.
>>>> But such folding is wanted in several places, e.g, IVOPTs.  To provide such
>>>> an interface, we changed tree-affine and made it do aggressive fold.  I am
>>>> curious if it's possible to use aff_tree to implement strip_constant_offset
>>>> here since aggressive folding is wanted.  After all, using additional chrec
>>>> looks like a little heavy wrto the simple test.
>>>
>>> Yeah, using aff_tree does work here when the bounds are constant.
>>> It doesn't look like it works for things like:
>>>
>>> double p[1000];
>>> double q[1000];
>>>
>>> void
>>> f4 (unsigned int n)
>>> {
>>>   for (unsigned int i = 0; i < n; i += 4)
>>> {
>>>   double a = q[i] + p[i];
>>>   double b = q[i + 1] + p[i + 1];
>>>   q[i] = a;
>>>   q[i + 1] = b;
>>> }
>>> }
>>>
>>> though, where the bounds on the global arrays guarantee that [i + 1] can't
>>> overflow, even though "n" is unconstrained.  The patch as posted handles
>>> this case too.
>> BTW is this a missed optimization in value range analysis?  The range
>> information for i should flow in a way like: array boundary -> niters
>> -> scev/vrp.
>> I think that's what niters/scev do in analysis.
>
> Yeah, maybe :-)  It looks like the problem is that when SLP runs,
> the previous VRP pass came before loop header copying, so the (single)
> header has to cope with n == 0 case.  Thus we get:
Ah, there are several passes in between vrp and pass_ch, not sure if
any such pass depends on vrp intensively.  I would suggestion reorder
the two passes, or standalone VRP interface updating information for
loop region after header copied?   This is a non-trivial issue that
needs to be fixed.  Niters analyzer rely on
simplify_using_initial_conditions heavily to get the same information,
which in my opinion should be provided by VRP.  Though this won't be
able to obsolete simplify_using_initial_conditions because VRP is weak
in symbolic range...

>
>   Visiting statement:
>   i

Re: PR81635: Use chrecs to help find related data refs

2017-08-16 Thread Bin.Cheng
On Wed, Aug 16, 2017 at 5:00 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>> On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford
>> <richard.sandif...@linaro.org> wrote:
>>> The first loop in the testcase regressed after my recent changes to
>>> dr_analyze_innermost.  Previously we would treat "i" as an iv even
>>> for bb analysis and end up with:
>>>
>>>DR_BASE_ADDRESS: p or q
>>>DR_OFFSET: 0
>>>DR_INIT: 0 or 4
>>>DR_STEP: 16
>>>
>>> We now always keep the step as 0 instead, so for an int "i" we'd have:
>>>
>>>DR_BASE_ADDRESS: p or q
>>>DR_OFFSET: (intptr_t) i
>>>DR_INIT: 0 or 4
>>>DR_STEP: 0
>>>
>>> This is also what we'd like to have for the unsigned "i", but the
>>> problem is that strip_constant_offset thinks that the "i + 1" in
>>> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1".
>>> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA
>>> name that holds "(intptr_t) (i + 1)", meaning that the accesses no
>>> longer seem to be related to the [i] ones.
>>
>> Didn't read the change in detail, so sorry if I mis-understood the issue.
>> I made changes in scev to better fold type conversion by various sources
>> of information, for example, vrp, niters, undefined overflow behavior etc.
>> In theory these information should be available for other optimizers without
>> querying scev.  For the mentioned test, vrp should compute accurate range
>> information for "i" so that we can fold (intptr_t) (i + 1) it without 
>> worrying
>> overflow.  Note we don't do it in generic folding because (intptr_t) (i) + 1
>> could be more expensive (especially in case of (T)(i + j)), or because the
>> CST part is in bigger precision after conversion.
>> But such folding is wanted in several places, e.g, IVOPTs.  To provide such
>> an interface, we changed tree-affine and made it do aggressive fold.  I am
>> curious if it's possible to use aff_tree to implement strip_constant_offset
>> here since aggressive folding is wanted.  After all, using additional chrec
>> looks like a little heavy wrto the simple test.
>
> Yeah, using aff_tree does work here when the bounds are constant.
> It doesn't look like it works for things like:
>
> double p[1000];
> double q[1000];
>
> void
> f4 (unsigned int n)
> {
>   for (unsigned int i = 0; i < n; i += 4)
> {
>   double a = q[i] + p[i];
>   double b = q[i + 1] + p[i + 1];
>   q[i] = a;
>   q[i + 1] = b;
> }
> }
>
> though, where the bounds on the global arrays guarantee that [i + 1] can't
> overflow, even though "n" is unconstrained.  The patch as posted handles
> this case too.
BTW is this a missed optimization in value range analysis?  The range
information for i should flow in a way like: array boundary -> niters
-> scev/vrp.
I think that's what niters/scev do in analysis.

Thanks,
bin
>
> Thanks,
> Richard
>
>>
>> Thanks,
>> bin
>>>
>>> There seem to be two main uses of DR_OFFSET and DR_INIT.  One type
>>> expects DR_OFFSET and DR_INIT to be generic expressions whose sum
>>> gives the initial offset from DR_BASE_ADDRESS.  The other type uses
>>> the pair (DR_BASE_ADDRESS, DR_OFFSET) to identify related data
>>> references, with the distance between their start addresses being
>>> the difference between their DR_INITs.  For this second type, the
>>> exact form of DR_OFFSET doesn't really matter, and the more it is
>>> able to canonicalise the non-constant offset, the better.
>>>
>>> The patch fixes the PR by adding another offset/init pair to the
>>> innermost loop behaviour for this second group.  The new pair use chrecs
>>> rather than generic exprs for the offset part, with the chrec being
>>> analysed in the innermost loop for which the offset isn't invariant.
>>> This allows us to vectorise the second function in the testcase
>>> as well, which wasn't possible before the earlier patch.
>>>
>>> Tested on x86_64-linux-gnu and aarch64-linux-gnu.  OK to install?
>>>
>>> Richard
>>>
>>>
>>> gcc/
>>> PR tree-optimization/81635
>>> * tree-data-ref.h (innermost_loop_behavior): Add chrec_offset and
>>> chrec_init.
>&

Re: PR81635: Use chrecs to help find related data refs

2017-08-16 Thread Bin.Cheng
On Wed, Aug 16, 2017 at 2:38 PM, Richard Sandiford
 wrote:
> The first loop in the testcase regressed after my recent changes to
> dr_analyze_innermost.  Previously we would treat "i" as an iv even
> for bb analysis and end up with:
>
>DR_BASE_ADDRESS: p or q
>DR_OFFSET: 0
>DR_INIT: 0 or 4
>DR_STEP: 16
>
> We now always keep the step as 0 instead, so for an int "i" we'd have:
>
>DR_BASE_ADDRESS: p or q
>DR_OFFSET: (intptr_t) i
>DR_INIT: 0 or 4
>DR_STEP: 0
>
> This is also what we'd like to have for the unsigned "i", but the
> problem is that strip_constant_offset thinks that the "i + 1" in
> "(intptr_t) (i + 1)" could wrap and so doesn't peel off the "+ 1".
> The [i + 1] accesses therefore have a DR_OFFSET equal to the SSA
> name that holds "(intptr_t) (i + 1)", meaning that the accesses no
> longer seem to be related to the [i] ones.

Didn't read the change in detail, so sorry if I mis-understood the issue.
I made changes in scev to better fold type conversion by various sources
of information, for example, vrp, niters, undefined overflow behavior etc.
In theory these information should be available for other optimizers without
querying scev.  For the mentioned test, vrp should compute accurate range
information for "i" so that we can fold (intptr_t) (i + 1) it without worrying
overflow.  Note we don't do it in generic folding because (intptr_t) (i) + 1
could be more expensive (especially in case of (T)(i + j)), or because the
CST part is in bigger precision after conversion.
But such folding is wanted in several places, e.g, IVOPTs.  To provide such
an interface, we changed tree-affine and made it do aggressive fold.  I am
curious if it's possible to use aff_tree to implement strip_constant_offset
here since aggressive folding is wanted.  After all, using additional chrec
looks like a little heavy wrto the simple test.

Thanks,
bin
>
> There seem to be two main uses of DR_OFFSET and DR_INIT.  One type
> expects DR_OFFSET and DR_INIT to be generic expressions whose sum
> gives the initial offset from DR_BASE_ADDRESS.  The other type uses
> the pair (DR_BASE_ADDRESS, DR_OFFSET) to identify related data
> references, with the distance between their start addresses being
> the difference between their DR_INITs.  For this second type, the
> exact form of DR_OFFSET doesn't really matter, and the more it is
> able to canonicalise the non-constant offset, the better.
>
> The patch fixes the PR by adding another offset/init pair to the
> innermost loop behaviour for this second group.  The new pair use chrecs
> rather than generic exprs for the offset part, with the chrec being
> analysed in the innermost loop for which the offset isn't invariant.
> This allows us to vectorise the second function in the testcase
> as well, which wasn't possible before the earlier patch.
>
> Tested on x86_64-linux-gnu and aarch64-linux-gnu.  OK to install?
>
> Richard
>
>
> gcc/
> PR tree-optimization/81635
> * tree-data-ref.h (innermost_loop_behavior): Add chrec_offset and
> chrec_init.
> (DR_CHREC_OFFSET, DR_CHREC_INIT): New macros.
> (dr_equal_offsets_p): Delete.
> (dr_chrec_offsets_equal_p, dr_chrec_offsets_compare): Declare.
> * tree-data-ref.c: Include tree-ssa-loop-ivopts.h.
> (split_constant_offset): Handle POLYNOMIAL_CHREC.
> (dr_analyze_innermost): Initialize dr_chrec_offset and dr_chrec_init.
> (operator ==): Use dr_chrec_offsets_equal_p and compare the
> DR_CHREC_INITs.
> (prune_runtime_alias_test_list): Likewise.
> (comp_dr_with_seg_len_pair): Use dr_chrec_offsets_compare and compare
> the DR_CHREC_INITs.
> (dr_equal_offsets_p1, dr_equal_offsets_p): Delete.
> (analyze_offset_scev): New function.
> (compute_offset_chrecs): Likewise.
> (dr_chrec_offsets_equal_p): Likewise.
> (dr_chrec_offsets_compare): Likewise.
> * tree-loop-distribution.c (compute_alias_check_pairs): Use
> dr_chrec_offsets_compare.
> * tree-vect-data-refs.c (vect_find_same_alignment_drs): Use
> dr_chrec_offsets_compare and compare the DR_CHREC_INITs.
> (dr_group_sort_cmp): Likewise.
> (vect_analyze_group_access_1): Use DR_CHREC_INIT instead of DR_INIT.
> (vect_no_alias_p): Likewise.
> (vect_analyze_data_ref_accesses): Use dr_chrec_offsets_equal_p and
> compare the DR_CHREC_INITs.
> (vect_prune_runtime_alias_test_list): Use dr_chrec_offsets_compare.
> * tree-vect-stmts.c (vectorizable_load): Use DR_CHREC_INIT instead
> of DR_INIT.
>
> gcc/testsuite/
> PR tree-optimization/81635
> * gcc.dg/vect/bb-slp-pr81635.c: New test.
>


Re: [PATCH PR81832]Skip copying loop header if inner loop is distributed

2017-08-16 Thread Bin.Cheng
On Wed, Aug 16, 2017 at 10:31 AM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>> On Tue, Aug 15, 2017 at 6:33 PM, Richard Sandiford
>> <richard.sandif...@linaro.org> wrote:
>>> Richard Biener <richard.guent...@gmail.com> writes:
>>>> On Tue, Aug 15, 2017 at 11:28 AM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> This patch fixes PR81832.  Root cause for the ICE is:
>>>>>   1) Loop has distributed inner loop.
>>>>>   2) The guarding function call IFN_LOOP_DIST_CALL happens to be in 
>>>>> loop's header.
>>>>>   3) IFN_LOOP_DIST_CALL (int loop's header) is duplicated by pass_ch_vect 
>>>>> thus
>>>>>  not eliminated.
>>>>>
>>>>> Given pass_ch_vect copies loop header to enable more vectorization, we 
>>>>> should
>>>>> skip loop in this case because distributed inner loop means this loop can 
>>>>> not
>>>>> be vectorized anyway.  One point to mention is name 
>>>>> inner_loop_distributed_p
>>>>> is a little misleading.  The name indicates that each basic block is 
>>>>> checked,
>>>>> but the patch only checks loop's header for simplicity/efficiency's 
>>>>> purpose.
>>>>> Any comment?
>>>>
>>>> My comment would be to question pass_ch_vect placement -- what was the
>>>> reason to place it so late?
>>>>
>>>> I also see GRAPHITE runs inbetween loop distribution and vectorization --
>>>> what prevents GRAPHITE from messing up things here?  Or autopar?
>>>>
>>>> The patch itself shows should_duplicate_loop_header_p should
>>>> handle this IFN specially (somehow all IFNs are considered "inexpensive").
>>>>
>>>> So can you please adjust should_duplicate_loop_header_p instead and/or
>>>> gimple_inexpensive_call_p?  Since we have IFNs for stuff like EXP10 I'm not
>>>> sure if that default is so good.
>>>
>>> I think the default itself is OK: we only use IFNs for libm functions
>>> like exp10 if an underlying optab exists (unlike __builtin_exp10, which
>>> is useful as the non-errno setting version of exp10), so the target must
>>> have something that it thinks is worth open-coding.  Also, we currently
>>> treat all MD built-ins as "simple" and thus "inexpensive", so the IFN
>>> handling is consistent with that.
>>>
>>> Maybe there are some IFNs that are worth special-casing as expensive,
>>> but IMO doing that to solve this problem would be a bit hacky.  It seems
>>> like "inexpensive" should be more of a cost thing than a correctness thing.
>> Hi all,
>
>> This is updated patch.  It only adds a single check on IFN_LOOP_DIST_ALIAS
>> in function should_duplicate_loop_header_p.
>
> Thanks.
>
>> As for gimple_inexpensive_call_p, I think it's natural to consider
>> functions like IFN_LOOP_VECTORIZED and IFN_LOOP_DIST_ALIAS as
>> expensive because they are only meant to indicate temporary
>> arrangement of optimizations and are never used in code generation.
>> I will send a standalone patch for that.
>
> Is that enough to consider them expensive though?  To me, "expensive"
Not sure.  Or the other hand,  "Expensive" is a measurement of the cost of
generated code.  For internal function calls in discussion, maybe we should
not ask the question in the first.  Even these function calls are expanded to
constant, IMHO, we can't simply consider it's cheap either because there is
high level side-effects along with expanding, i.e, undoing loop versioning.
such high level transformation is not (and should not be) covered by
gimple_inexpensive_call_p.

Thanks,
bin
> should mean that they cost a lot in terms of size or speed (whichever
> is most important in context).  Both functions are really cheap in
> that sense, since they eventually expand to constants.
>
> Thanks,
> Richard
>
>> Another thing is this patch doesn't check IFN_LOOP_VECTORIZED because
>> it's impossible to have it with current order of passes.  Bootstrap
>> and test ongoing.  Further comments?
>>
>> Thanks,
>> bin
>> 2017-08-15  Bin Cheng  <bin.ch...@arm.com>
>>
>> PR tree-optimization/81832
>> * tree-ssa-loop-ch.c (should_duplicate_loop_header_p): Don't
>> copy loop header which has IFN_LOOP_DIST_ALIAS call.
>>
>> gcc/testsuite

Re: [PATCH PR81799]Fix ICE by forcing to is_gimple_val

2017-08-14 Thread Bin.Cheng
On Mon, Aug 14, 2017 at 12:21 PM, Richard Biener
 wrote:
> On Mon, Aug 14, 2017 at 1:05 PM, Bin Cheng  wrote:
>> Hi,
>> This patch fixes ICE reported in PR81799.  It simply uses is_gimple_val 
>> rather than is_gimple_condexpr.
>> Bootstap and test on x86_64.  Is it OK?
>
> I guess this eventually pessimizes code-gen for the case we do not
> have the IFN call (eventually cleaned up
> by folding / forwprop later).
Note we are not optimal with current code even folding/forwprop can
fold condition to true/false because we undo the versioning in
vectorizer and the folding is not happening between distribution and
vectorizer?

Thanks,
bin
>
> I guess we don't care too much and thus OK.
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>> 2017-08-11  Bin Cheng  
>>
>> PR tree-optimization/81799
>> * tree-loop-distribution.c (version_loop_by_alias_check): Force
>> cond_expr to simple gimple operand.
>>
>> gcc/testsuite
>> 2017-08-11  Bin Cheng  
>>
>> PR tree-optimization/81799
>> * gcc.dg/tree-ssa/pr81799.c: New.


Re: [PATCH PR81228]Fixes ICE by adding LTGT in vec_cmp.

2017-08-14 Thread Bin.Cheng
Ping.

On Fri, Jul 28, 2017 at 12:37 PM, Bin Cheng  wrote:
> Hi,
> This simple patch fixes the ICE by adding LTGT in vec_cmp 
> pattern.
> I also modified the original test case into a compilation one since 
> -fno-wrapping-math
> should not be used in general.
> Bootstrap and test on AArch64, test result check for x86_64.  Is it OK?  I 
> would also need to
> backport it to gcc-7-branch.
>
> Thanks,
> bin
> 2017-07-27  Bin Cheng  
>
> PR target/81228
> * config/aarch64/aarch64-simd.md (vec_cmp): Add
> LTGT.
>
> gcc/testsuite/ChangeLog
> 2017-07-27  Bin Cheng  
>
> PR target/81228
> * gcc.dg/pr81228.c: New.


Re: [PATCH] Fix PR81719

2017-08-08 Thread Bin.Cheng
On Tue, Aug 8, 2017 at 1:20 PM, Richard Biener  wrote:
>
> The following improves niter analysis for range-based for loops
> by handling ADDR_EXPR in expand_simple_operations.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>
> Richard.
>
> 2017-08-08  Richard Biener  
>
> PR middle-end/81719
> * tree-ssa-loop-niter.c: Include tree-dfa.h.
> (expand_simple_operations): Also look through ADDR_EXPRs with
> MEM_REF bases treating them as POINTER_PLUS_EXPR.
>
> * g++.dg/tree-ssa/pr81719.C: New testcase.
>
> Index: gcc/tree-ssa-loop-niter.c
> ===
> *** gcc/tree-ssa-loop-niter.c   (revision 250813)
> --- gcc/tree-ssa-loop-niter.c   (working copy)
> *** along with GCC; see the file COPYING3.
> *** 42,47 
> --- 42,48 
>   #include "tree-chrec.h"
>   #include "tree-scalar-evolution.h"
>   #include "params.h"
> + #include "tree-dfa.h"
>
>
>   /* The maximum number of dominator BBs we search for conditions
> *** expand_simple_operations (tree expr, tre
> *** 1980,1985 
> --- 1981,2001 
>
> if (code == SSA_NAME)
> return expand_simple_operations (e, stop);
> +   else if (code == ADDR_EXPR)
> +   {
> + HOST_WIDE_INT offset;
> + tree base = get_addr_base_and_unit_offset (TREE_OPERAND (e, 0),
> +);
> + if (base
> + && TREE_CODE (base) == MEM_REF)
> +   {
> + ee = expand_simple_operations (TREE_OPERAND (base, 0), stop);
> + return fold_build2 (POINTER_PLUS_EXPR, TREE_TYPE (expr), ee,
> + wide_int_to_tree (sizetype,
> +   mem_ref_offset (base)
> +   + offset));
> +   }
> +   }
>
> return expr;
>   }
There are various places we need to look into ADDR_EXPR (MEM_REF), is
it possible/beneficial to generally simplify _REF[base + offset]?

Thanks,
bin
> Index: gcc/testsuite/g++.dg/tree-ssa/pr81719.C
> ===
> *** gcc/testsuite/g++.dg/tree-ssa/pr81719.C (nonexistent)
> --- gcc/testsuite/g++.dg/tree-ssa/pr81719.C (working copy)
> ***
> *** 0 
> --- 1,24 
> + /* { dg-do compile { target c++11 } } */
> + /* { dg-options "-O3 -fdump-tree-optimized" } */
> +
> + typedef int Items[2];
> +
> + struct ItemArray
> + {
> +   Items items;
> +   int sum_x2() const;
> + };
> +
> + int ItemArray::sum_x2() const
> + {
> +   int total = 0;
> +   for (int item : items)
> + {
> +   total += item;
> + }
> +   return total;
> + }
> +
> + /* We should be able to compute the number of iterations to two, unroll
> +the loop and end up with a single basic-block in sum_x2.  */
> + /* { dg-final { scan-tree-dump-times "bb" 1 "optimized" } } */


Re: [PATCH GCC][5/5]Enable tree loop distribution at -O3 and above optimization levels.

2017-08-07 Thread Bin.Cheng
On Fri, Jun 23, 2017 at 12:04 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Fri, Jun 23, 2017 at 10:47 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Fri, Jun 23, 2017 at 6:04 AM, Jeff Law <l...@redhat.com> wrote:
>>> On 06/07/2017 02:07 AM, Bin.Cheng wrote:
>>>> On Tue, Jun 6, 2017 at 6:47 PM, Jeff Law <l...@redhat.com> wrote:
>>>>> On 06/02/2017 05:52 AM, Bin Cheng wrote:
>>>>>> Hi,
>>>>>> This patch enables -ftree-loop-distribution by default at -O3 and above 
>>>>>> optimization levels.
>>>>>> Bootstrap and test at O2/O3 on x86_64 and AArch64.  is it OK?
>>>>>>
>>>>>> Note I don't have strong opinion here and am fine with either it's 
>>>>>> accepted or rejected.
>>>>>>
>>>>>> Thanks,
>>>>>> bin
>>>>>> 2017-05-31  Bin Cheng  <bin.ch...@arm.com>
>>>>>>
>>>>>>   * opts.c (default_options_table): Enable 
>>>>>> OPT_ftree_loop_distribution
>>>>>>   for -O3 and above levels.
>>>>> I think the question is how does this generally impact the performance
>>>>> of the generated code and to a lesser degree compile-time.
>>>>>
>>>>> Do you have any performance data?
>>>> Hi Jeff,
>>>> At this stage of the patch, only hmmer is impacted and improved
>>>> obviously in my local run of spec2006 for x86_64 and AArch64.  In long
>>>> term, loop distribution is also one prerequisite transformation to
>>>> handle bwaves (at least).  For these two impacted cases, it helps to
>>>> resolve the gap against ICC.  I didn't check compilation time slow
>>>> down, we can restrict it to problem with small partition number if
>>>> that's a problem.
>>> Just a note. I know you've iterated further with Richi -- I'm not
>>> objecting to the patch, nor was I ready to approve.
>>>
>>> Are you and Richi happy with this as-is or are you looking to submit
>>> something newer based on the conversation the two of you have had?
>> Hi Jeff,
>> The patch series is updated in various ways according to review
>> comments, for example, it restricts compilation time by checking
>> number of data references against MAX_DATAREFS_FOR_DATADEPS as well as
>> restores data dependence cache.  There are still two missing parts I'd
>> like to do as followup patches: one is loop nest distribution and the
>> other is a data-locality cost model (at least) for small cases.  Now
>> Richi approved most patches except the last major one, but I still
>> need another iterate for some (approved) patches in order to fix
>> mistake/typo introduced when I separating the patch.
>
> The patch is ok after the approved parts of the ldist series has been 
> committed.
> Note your patch lacks updates to invoke.texi (what options are enabled at 
> -O3).
> Please adjust that before committing.
Hi All,
Given the loop distribution patches have been merged for a while and
couple of issues fixed.  I am submitting updated patch to enable the
pass by default at O3/above levels.
Bootstrap and test on x86_64 and AArch64 ongoing.  Hmmer still can be
improved.  Is it OK if no failure?

Thanks,
bin
2017-08-07  Bin Cheng  <bin.ch...@arm.com>

* doc/invoke.texi: Document -ftree-loop-distribution for O3.
* opts.c (default_options_table): Add OPT_ftree_loop_distribution.
From 2bda01a939ac8c0bf54f04f7e29cc0d3155c7626 Mon Sep 17 00:00:00 2001
From: Bin Cheng <binch...@e108451-lin.cambridge.arm.com>
Date: Wed, 28 Jun 2017 10:54:17 +0100
Subject: [PATCH] enable-loop-distribution-O3-20170802.txt

---
 gcc/doc/invoke.texi | 21 ++---
 gcc/opts.c  |  1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 5ae9dc4..f48a71a 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -7248,13 +7248,20 @@ invoking @option{-O2} on programs that use computed gotos.
 @item -O3
 @opindex O3
 Optimize yet more.  @option{-O3} turns on all optimizations specified
-by @option{-O2} and also turns on the @option{-finline-functions},
-@option{-funswitch-loops}, @option{-fpredictive-commoning},
-@option{-fgcse-after-reload}, @option{-ftree-loop-vectorize},
-@option{-ftree-loop-distribute-patterns}, @option{-fsplit-paths}
-@option{-ftree-slp-vectorize}, @option{-fvect-cost-model},
-@option{-ftree-partial-pre}, @option{-fpeel-loops}
-and @option{-fipa-cp-clone} options.
+by @option{-O2} and also turns on the following optimization flags:
+@gccoptlist{-finli

Re: [PATCH] Make mempcpy more optimal (PR middle-end/70140).

2017-08-02 Thread Bin.Cheng
On Wed, Aug 2, 2017 at 10:54 AM, Martin Liška <mli...@suse.cz> wrote:
> On 08/02/2017 11:45 AM, Bin.Cheng wrote:
>> Hi Martin,
>> With r250771, GCC failed to build glibc for arm/aarch64 linux cross 
>> toolchain:
>
> Hi.
>
> Sorry for the breakage, I accidentally installed wrong version of patch.
> Should be fixed in r250789.
Thanks!

Thanks,
bin
>
> M.


Re: [PATCH] Make mempcpy more optimal (PR middle-end/70140).

2017-08-02 Thread Bin.Cheng
On Wed, Aug 2, 2017 at 8:26 AM, Martin Liška  wrote:
> On 08/02/2017 09:16 AM, Jakub Jelinek wrote:
>> On Wed, Aug 02, 2017 at 09:13:40AM +0200, Martin Liška wrote:
>>> On 08/01/2017 09:50 PM, Jakub Jelinek wrote:
 On Thu, Jul 20, 2017 at 08:59:29AM +0200, Martin Liška wrote:
> Hello.
>
> Following patch does sharing of expansion for mem{p,}cpy and also strpcy 
> (with a known constant as source)
> so that we use same type of expansion (direct insns emission, direct 
> emission with a loop instruction and
> library call). As mentioned in the PR, glibc does not provide an 
> optimized version for majority of targets.
>
> Patch can bootstrap on ppc64le-redhat-linux and survives regression tests.

 This broke e.g.
 FAIL: gcc.dg/20050503-1.c scan-assembler-not call
 on i686-linux, the result is significantly worse.
 Also, while perhaps majority of targets don't provide optimized version,
 some targets do, including i?86/x86_64, and if the memcpy would be expanded
 as a call, it is much better to just emit mempcpy call instead.
 Just look at the testcase, because of this misoptimization we suddenly 
 can't
 use a tail call.

 Jakub

>>>
>>> I see. That said, should I introduce some target hook that will tell 
>>> whether to expand to
>>> 'return memcpy(dst, src,l) + dst;' or call library mempcpy routine?
>>
>> If some targets aren't willing to provide fast mempcpy in libc, then yes I
>> guess.  And, for -Os you should never do the former, that isn't going to be
>> shorter (at least unless the memcpy is expanded inline and is shorter than
>> the call + addition).
>
> Good, I will work on that.
>
>>
>> BTW, do we have folding of mempcpy to memcpy if the result is ignored (no
>> lhs)?
>
> Yes, we do it, I've just verified that.
>
> Martin

Hi Martin,
With r250771, GCC failed to build glibc for arm/aarch64 linux cross toolchain:

during RTL pass: expand
loadlocale.c: In function ‘_nl_load_locale’:
loadlocale.c:199:7: internal compiler error: in emit_move_insn, at expr.c:3704
   __mempcpy (__mempcpy (__mempcpy (newp, file->filename, filenamelen),
   ^~~~
0x80902b emit_move_insn(rtx_def*, rtx_def*)
/test/source/gcc/gcc/expr.c:3703
0x6d2271 expand_builtin_memory_copy_args
/test/source/gcc/gcc/builtins.c:3514
0x6d48d7 expand_builtin(tree_node*, rtx_def*, rtx_def*, machine_mode, int)
/test/source/gcc/gcc/builtins.c:6847
0x80454c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
/test/source/gcc/gcc/expr.c:10848
0x6f8a9c expand_expr
/test/source/gcc/gcc/expr.h:276
0x6f8a9c expand_call_stmt
/test/source/gcc/gcc/cfgexpand.c:2664
0x6f8a9c expand_gimple_stmt_1
/test/source/gcc/gcc/cfgexpand.c:3583
0x6f8a9c expand_gimple_stmt
/test/source/gcc/gcc/cfgexpand.c:3749
0x6f9c1a expand_gimple_basic_block
/test/source/gcc/gcc/cfgexpand.c:5751
0x6ff986 execute
/test/source/gcc/gcc/cfgexpand.c:6358
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

I filed PR81666 for tracking.

Thanks,
bin


Re: [PATCH PR81228]Fixes ICE by adding LTGT in vec_cmp.

2017-08-01 Thread Bin.Cheng
On Fri, Jul 28, 2017 at 3:15 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>> On Fri, Jul 28, 2017 at 12:55 PM, Richard Sandiford
>> <richard.sandif...@linaro.org> wrote:
>>> Bin Cheng <bin.ch...@arm.com> writes:
>>>> Hi,
>>>> This simple patch fixes the ICE by adding LTGT in
>>>> vec_cmp pattern.
>>>> I also modified the original test case into a compilation one since
>>>> -fno-wrapping-math
>>>> should not be used in general.
>>>> Bootstrap and test on AArch64, test result check for x86_64.  Is it OK?
>>>> I would also need to
>>>> backport it to gcc-7-branch.
>>>>
>>>> Thanks,
>>>> bin
>>>> 2017-07-27  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>>   PR target/81228
>>>>   * config/aarch64/aarch64-simd.md (vec_cmp): Add
>>>>   LTGT.
>>>>
>>>> gcc/testsuite/ChangeLog
>>>> 2017-07-27  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>>   PR target/81228
>>>>   * gcc.dg/pr81228.c: New.
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>>>> b/gcc/config/aarch64/aarch64-simd.md
>>>> index 011fcec0..9cd67a2 100644
>>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>>> @@ -2524,6 +2524,7 @@
>>>>  case EQ:
>>>>comparison = gen_aarch64_cmeq;
>>>>break;
>>>> +case LTGT:
>>>>  case UNEQ:
>>>>  case ORDERED:
>>>>  case UNORDERED:
>>>> @@ -2571,6 +2572,7 @@
>>>>emit_insn (comparison (operands[0], operands[2], operands[3]));
>>>>break;
>>>>
>>>> +case LTGT:
>>>>  case UNEQ:
>>>>/* We first check (a > b ||  b > a) which is !UNEQ, inverting
>>>>this result will then give us (a == b || a UNORDERED b).  */
>>>> @@ -2578,7 +2580,8 @@
>>>>operands[2], operands[3]));
>>>>emit_insn (gen_aarch64_cmgt (tmp, operands[3], operands[2]));
>>>>emit_insn (gen_ior3 (operands[0], operands[0], tmp));
>>>> -  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>>>> +  if (code == UNEQ)
>>>> + emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>>>>break;
>>>
>>> AFAIK this is still a grey area, but I think (ltgt x y) is supposed to
>>> be a trapping operation, i.e. it's closer to (ior (lt x y) (gt x y))
>>> than (not (uneq x y)).  See e.g. the handling in may_trap_p_1, where
>>> LTGT is handled like LT and GT rather than like UNEQ.
>>>
>>> See also: https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00583.html
>> Thanks for pointing me to this, I don't know anything about floating point 
>> here.
>> As for the change, the code now looks like:
>>
>> case LTGT:
>> case UNEQ:
>>   /* We first check (a > b ||  b > a) which is !UNEQ, inverting
>>  this result will then give us (a == b || a UNORDERED b).  */
>>   emit_insn (gen_aarch64_cmgt (operands[0],
>>  operands[2], operands[3]));
>>   emit_insn (gen_aarch64_cmgt (tmp, operands[3], operands[2]));
>>   emit_insn (gen_ior3 (operands[0], operands[0], tmp));
>>   if (code == UNEQ)
>> emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>>   break;
>>
>> So (a > b || b > a) is generated for LTGT which you suggested?
>
> Ah, yeah, I was just going off LTGT being treated as !UNEQ, but...
>
>> Here we invert the result for UNEQ though.
>
> ...it looks like it might be the UNEQ code that's wrong.  E.g. this
> test fails at -O3 and passes at -O for me:
>
> #define _GNU_SOURCE
> #include 
>
> double x[16], y[16];
> int res[16];
>
> int
> main (void)
> {
>   for (int i = 0; i < 16; ++i)
> {
>   x[i] = __builtin_nan ("");
>   y[i] = i;
> }
>   asm volatile ("" ::: "memory");
>   feclearexcept (FE_ALL_EXCEPT);
>   for (int i = 0; i < 16; ++i)
> res[i] = __builtin_islessgreater (x[i], y[i]);
>   asm volatile ("" ::: "memory");
>   return fetestexcept (FE_ALL_EXCEPT) != 0;
> }
>
> (asm volatiles just added for paranoia, in case stuff gets optimised
> away otherwise.)
Thanks for the test, I file PR81647 for tracking.  And this is
actually inconsistent LTGT behavior issue.  It's translated
differently w/o vectorization I think.

Thanks,
bin
>
> But I suppose that's no reason to hold up your patch. :-)  Maybe it'd
> be worth having a comment though?
>
> Thanks,
> Richard


Re: [PATCH][GCC][AArch64] optimize float immediate moves (2 /4) - HF/DF/SF mode.

2017-08-01 Thread Bin.Cheng
On Tue, Aug 1, 2017 at 12:51 PM, Tamar Christina
 wrote:
>>
>> Given review comment already pointed out big-endian issue and patch was
>> updated to address it, I would expect reg-test on a big-endian target before
>> applying patch, right?
>
> The patch spent 6 months in external review.
> Given that, I simply forgot to rerun big endian before the commit as I did 
> the rest.
>
> The failing tests were all added after the submission of this patch. I'll 
> have a look.
I may bisect to wrong commit?  The patch is committed three days ago,
how could all failing tests be added after it?

Thanks,
bin


Re: [PATCH][GCC][AArch64] optimize float immediate moves (2 /4) - HF/DF/SF mode.

2017-08-01 Thread Bin.Cheng
On Mon, Jun 26, 2017 at 11:50 AM, Tamar Christina
 wrote:
> Hi all,
>
> Here's the re-spun patch.
> Aside from the grouping of the split patterns it now also uses h register for 
> the fmov for HF when available,
> otherwise it forces a literal load.
>
> Regression tested on aarch64-none-linux-gnu and no regressions.
Hi,
There are lots of test failures on aarch64_be-none-elf, I verified two:
gcc.dg/vect/pr61680.c execution test
gcc.dg/vect/pr63148.c execution test

are caused by svn+ssh://gcc.gnu.org/svn/gcc/trunk@250673

Given review comment already pointed out big-endian issue and patch
was updated to address it, I would expect reg-test on a big-endian
target before applying patch, right?

Thanks,
bin
>
> OK for trunk?
>
> Thanks,
> Tamar
>
>
> gcc/
> 2017-06-26  Tamar Christina  
> Richard Sandiford 
>
> * config/aarch64/aarch64.md (mov): Generalize.
> (*movhf_aarch64, *movsf_aarch64, *movdf_aarch64):
> Add integer and movi cases.
> (movi-split-hf-df-sf split, fp16): New.
> (enabled): Added TARGET_FP_F16INST.
> * config/aarch64/iterators.md (GPF_HF): New.
> 
> From: Tamar Christina
> Sent: Wednesday, June 21, 2017 11:48:33 AM
> To: James Greenhalgh
> Cc: GCC Patches; nd; Marcus Shawcroft; Richard Earnshaw
> Subject: RE: [PATCH][GCC][AArch64] optimize float immediate moves (2 /4) - 
> HF/DF/SF mode.
>
>> > movi\\t%0.4h, #0
>> > -   mov\\t%0.h[0], %w1
>> > +   fmov\\t%s0, %w1
>>
>> Should this not be %h0?
>
> The problem is that H registers are only available in ARMv8.2+,
> I'm not sure what to do about ARMv8.1 given your other feedback
> Pointing out that the bit patterns between how it's stored in s vs h registers
> differ.
>
>>
>> > umov\\t%w0, %1.h[0]
>> > mov\\t%0.h[0], %1.h[0]
>> > +   fmov\\t%s0, %1
>>
>> Likewise, and much more important for correctness as it changes the way the
>> bit pattern ends up in the register (see table C2-1 in release B.a of the ARM
>> Architecture Reference Manual for ARMv8-A), here.
>>
>> > +   * return aarch64_output_scalar_simd_mov_immediate (operands[1],
>> > + SImode);
>> > ldr\\t%h0, %1
>> > str\\t%h1, %0
>> > ldrh\\t%w0, %1
>> > strh\\t%w1, %0
>> > mov\\t%w0, %w1"
>> > -  [(set_attr "type"
>> "neon_move,neon_from_gp,neon_to_gp,neon_move,\
>> > - f_loads,f_stores,load1,store1,mov_reg")
>> > -   (set_attr "simd" "yes,yes,yes,yes,*,*,*,*,*")]
>> > +  "&& can_create_pseudo_p ()
>> > +   && !aarch64_can_const_movi_rtx_p (operands[1], HFmode)
>> > +   && !aarch64_float_const_representable_p (operands[1])
>> > +   &&  aarch64_float_const_rtx_p (operands[1])"
>> > +  [(const_int 0)]
>> > +  "{
>> > +unsigned HOST_WIDE_INT ival;
>> > +if (!aarch64_reinterpret_float_as_int (operands[1], ))
>> > +  FAIL;
>> > +
>> > +rtx tmp = gen_reg_rtx (SImode);
>> > +aarch64_expand_mov_immediate (tmp, GEN_INT (ival));
>> > +tmp = simplify_gen_subreg (HImode, tmp, SImode, 0);
>> > +emit_move_insn (operands[0], gen_lowpart (HFmode, tmp));
>> > +DONE;
>> > +  }"
>> > +  [(set_attr "type" "neon_move,f_mcr,neon_to_gp,neon_move,fconsts,
>> \
>> > +neon_move,f_loads,f_stores,load1,store1,mov_reg")
>> > +   (set_attr "simd" "yes,*,yes,yes,*,yes,*,*,*,*,*")]
>> >  )
>>
>> Thanks,
>> James
>


Re: [PATCH][GCC][AArch64] optimize float immediate moves (1 /4) - infrastructure.

2017-08-01 Thread Bin.Cheng
On Wed, Jun 7, 2017 at 12:38 PM, Tamar Christina
 wrote:
> Hi All,
>
>
> This patch lays the ground work to fix the immediate moves for floats
> to use a combination of mov, movi, fmov instead of ldr and adrp to load
> float constants that fit within the 16-bit limit of movz.
>
> The idea behind it is that these are used quite often in masks etc and we can
> get a gain by doing integer moves instead of memory loads.
>
> This patch also adds the patterns for SImode and DImode to use SIMD mov
> instructions when it's able to.
>
> It's particularly handy when masks are used such as the
> 0x8000 mask in copysignf.
>
> This now generates
>
> moviv2.2s, 0x80, lsl 24
>
> instead of a literal load.
>
>
> Regression tested on aarch64-none-linux-gnu and no regressions.
Hi,
I saw below failure after svn+ssh://gcc.gnu.org/svn/gcc/trunk@250672

FAIL: gcc.target/aarch64/advsimd-intrinsics/vcvt_high_1.c   -O1
(internal compiler error)

Regression in patch updates?

Thanks,
bin
>
> OK for trunk?
>
> Thanks,
> Tamar
>
>
> gcc/
> 2017-06-07  Tamar Christina  
>
> * config/aarch64/aarch64.c
> (aarch64_simd_container_mode): Add prototype.
> (aarch64_expand_mov_immediate): Add HI support.
> (aarch64_reinterpret_float_as_int, aarch64_float_const_rtx_p: New.
> (aarch64_can_const_movi_rtx_p): New.
> (aarch64_preferred_reload_class):
> Remove restrictions of using FP registers for certain SIMD operations.
> (aarch64_rtx_costs): Added new cost for CONST_DOUBLE moves.
> (aarch64_valid_floating_const): Add integer move validation.
> (aarch64_simd_imm_scalar_p): Remove.
> (aarch64_output_scalar_simd_mov_immediate): Generalize function.
> (aarch64_legitimate_constant_p): Expand list of supported cases.
> * config/aarch64/aarch64-protos.h
> (aarch64_float_const_rtx_p, aarch64_can_const_movi_rtx_p): New.
> (aarch64_reinterpret_float_as_int): New.
> (aarch64_simd_imm_scalar_p): Remove.
> * config/aarch64/predicates.md (aarch64_reg_or_fp_float): New.
> * config/aarch64/constraints.md (Uvi): New.
> (Dd): Split into Ds and new Dd.
> * config/aarch64/aarch64.md (*movsi_aarch64):
> Add SIMD mov case.
> (*movdi_aarch64): Add SIMD mov case.


Re: [PATCH][GCC][AArch64] optimize float immediate moves (3 /4) - testsuite.

2017-08-01 Thread Bin.Cheng
On Mon, Jun 26, 2017 at 11:49 AM, Tamar Christina
 wrote:
> Hi,
>
> With the changes in the patches the testsuite had a minor update in the 
> assembler scan.
> I've posted the patch but will assume it's OK based on the previous OK for 
> trunk and
> the fact that this can fall in the obvious rule.
Hi,
With below commit:
commit b78acb5046f8b0e517f39edf17751b275d026b6c
Author: tnfchris 
Date:   Fri Jul 28 15:14:25 2017 +

2017-07-28  Tamar Christina  
Bilyan Borisov  

* gcc.target/aarch64/dbl_mov_immediate_1.c: New.
* gcc.target/aarch64/flt_mov_immediate_1.c: New.
* gcc.target/aarch64/f16_mov_immediate_1.c: New.
* gcc.target/aarch64/f16_mov_immediate_2.c: New.
* gcc.target/aarch64/pr63304_1.c: Changed to double.

I saw test failure on aarch64_be.
FAIL: gcc.target/aarch64/dbl_mov_immediate_1.c scan-assembler-times
mov\tx[0-9]+, 25838523252736 1
FAIL: gcc.target/aarch64/dbl_mov_immediate_1.c scan-assembler-times
movk\tx[0-9]+, 0x40fe, lsl 48 1
FAIL: gcc.target/aarch64/dbl_mov_immediate_1.c scan-assembler-times
mov\tx[0-9]+, -9223372036854775808 1

Thanks,
bin
>
> Thanks,
> Tamar
> 
> From: James Greenhalgh 
> Sent: Wednesday, June 14, 2017 10:11:19 AM
> To: Tamar Christina
> Cc: GCC Patches; nd; Richard Earnshaw; Marcus Shawcroft
> Subject: Re: [PATCH][GCC][AArch64] optimize float immediate moves (3 /4) - 
> testsuite.
>
> On Wed, Jun 07, 2017 at 12:38:41PM +0100, Tamar Christina wrote:
>> Hi All,
>>
>>
>> This patch adds new tests to cover the newly generated code from this patch 
>> series.
>>
>>
>> Regression tested on aarch64-none-linux-gnu and no regressions.
>>
>> OK for trunk?
>
> OK.
>
> Thanks,
> James
>
>>
>> gcc/testsuite/
>> 2017-06-07  Tamar Christina  
>>   Bilyan Borisov  
>>
>>   * gcc.target/aarch64/dbl_mov_immediate_1.c: New.
>>   * gcc.target/aarch64/flt_mov_immediate_1.c: New.
>>   * gcc.target/aarch64/f16_mov_immediate_1.c: New.
>>   * gcc.target/aarch64/f16_mov_immediate_2.c: New.
>
>


Re: [PATCH GCC]Make pointer overflow always undefined and remove the macro

2017-08-01 Thread Bin.Cheng
On Tue, Jul 25, 2017 at 8:26 AM, Richard Biener
 wrote:
> On Mon, Jul 24, 2017 at 10:43 AM, Bin Cheng  wrote:
>> Hi,
>> This is a followup patch to PR81388's fix.  According to Richi,
>> POINTER_TYPE_OVERFLOW_UNDEFINED was added in -fstrict-overflow
>> warning work.  Given:
>>   A) strict-overflow was removed;
>>   B) memory object can not wrap in address space;
>>   C) existing code doesn't take it in consideration, as in nowrap_type_p.
>> This patch makes it always true thus removes definition/usage of the macro.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>
> Ok.
>
> Please give others 24h to comment.
Hi,
I committed the patch as:

r250765 | amker | 2017-08-01 10:28:18 +0100 (Tue, 01 Aug 2017) | 9 lines

* tree.h (POINTER_TYPE_OVERFLOW_UNDEFINED): Delete.
* fold-const.c (fold_comparison, fold_binary_loc): Delete use of
above macro.
* match.pd: Ditto in address comparison pattern.

gcc/testsuite
* gcc.dg/no-strict-overflow-7.c: Revise comment and test string.
* gcc.dg/tree-ssa/pr81388-1.c: Ditto.

We can always revert it if there is any different opinions.

Thanks,
bin


Re: [PATCH GCC][4/4]Better handle store-stores chain if eliminated stores only store loop invariant

2017-07-28 Thread Bin.Cheng
On Tue, Jul 25, 2017 at 12:48 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Mon, Jul 10, 2017 at 10:24 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Jun 27, 2017 at 11:49 AM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> This is a followup patch better handling below case:
>>>  for (i = 0; i < n; i++)
>>>{
>>>  a[i] = 1;
>>>  a[i+2] = 2;
>>>}
>>> Instead of generating root variables by loading from memory and propagating 
>>> with PHI
>>> nodes, like:
>>>  t0 = a[0];
>>>  t1 = a[1];
>>>  for (i = 0; i < n; i++)
>>>{
>>>  a[i] = 1;
>>>  t2 = 2;
>>>  t0 = t1;
>>>  t1 = t2;
>>>}
>>>  a[n] = t0;
>>>  a[n+1] = t1;
>>> We can simply store loop invariant values after loop body if we know loop 
>>> iterates more
>>> than chain->length times, like:
>>>  for (i = 0; i < n; i++)
>>>{
>>>  a[i] = 1;
>>>}
>>>  a[n] = 2;
>>>  a[n+1] = 2;
>>>
>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
>> Update patch wrto changes in previous patch.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>
> +  if (TREE_CODE (val) == INTEGER_CST || TREE_CODE (val) == REAL_CST)
> +   continue;
>
> Please use CONSTANT_CLASS_P (val) instead.  I suppose VECTOR_CST or
> FIXED_CST would be ok as well for example.
>
> Ok with that change.  Did we eventually optimize this in followup
Patch revised as suggested.  I retested whole patch set and applied as
250665 ~ 250670.  Will keep eyes on possible breakage in next couple
weeks.

Thanks,
bin
> passes previously?
>
> Richard.
>
>> Thanks,
>> bin
>>>
>>> Thanks,
>>> bin
>>> 2017-06-21  Bin Cheng  <bin.ch...@arm.com>
>>>
>>> * tree-predcom.c: (struct chain): Handle store-store chain in which
>>> stores for elimination only store loop invariant values.
>>> (execute_pred_commoning_chain): Ditto.
>>> (prepare_initializers_chain_store_elim): Ditto.
>>> (prepare_finalizers): Ditto.
>>> (is_inv_store_elimination_chain): New function.
>>> (initialize_root_vars_store_elim_1): New function.


Re: [PATCH GCC][5/6]Record initialization statements and only insert it for valid chains

2017-07-28 Thread Bin.Cheng
On Mon, Jun 26, 2017 at 10:57 AM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Mon, Jun 26, 2017 at 11:47 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Fri, May 12, 2017 at 12:28 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> This patch caches initialization statements and only inserts it for valid 
>>> chains.
>>> Looks like current code even inserts such stmts for invalid chains which 
>>> will be
>>> deleted as dead code afterwards.
>>>
>>> Bootstrap and test on x86_64 and AArch64, is it OK?
>>
>> Ping this one because it's a prerequisite patch for following predcom
>> enhancement.
>> Also I updated the patch a little bit.
>>
>> Bootstrap and test along with following patches of predcom on x86_64
>> and AArch64.  Is it OK?
>
>if (stmts)
> -   gsi_insert_seq_on_edge_immediate (entry, stmts);
> +   gimple_seq_add_seq (>init_seq, stmts);
>
> use gimple_seq_add_seq_without_update.
Patch revised as suggested and retested.  Applied as 250666.

Thanks,
bin
>
> Ok with that change.
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>> 2017-06-26  Bin Cheng  <bin.ch...@arm.com>
>>
>> * tree-predcom.c (struct chain): New field init_seq.
>> (release_chain): Release init_seq.
>> (prepare_initializers_chain): Record intialization stmts in above
>> field.  Discard it if chain is invalid.
>> (insert_init_seqs): New function.
>> (tree_predictive_commoning_loop): Call insert_init_seqs.


Re: [PATCH GCC][4/6]Simple patch skips single element component

2017-07-28 Thread Bin.Cheng
On Fri, Jun 23, 2017 at 5:56 PM, Jeff Law  wrote:
> On 05/12/2017 05:28 AM, Bin Cheng wrote:
>> Hi,
>> This is a simple patch discarding simple element components earlier in 
>> predcom.
>> Bootstrap and test on x86_64 and AArch64, is it OK?
>>
>> Thanks,
>> bin
>> 2017-05-10  Bin Cheng  
>>
>>   * tree-predcom.c (determine_roots_comp): Skip single-elem chain.
>>
> OK.
Sorry for the delay, I retest and apply it as 250665.

Thanks,
bin
> jeff


Re: [PATCH PR81228]Fixes ICE by adding LTGT in vec_cmp.

2017-07-28 Thread Bin.Cheng
On Fri, Jul 28, 2017 at 3:15 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> "Bin.Cheng" <amker.ch...@gmail.com> writes:
>> On Fri, Jul 28, 2017 at 12:55 PM, Richard Sandiford
>> <richard.sandif...@linaro.org> wrote:
>>> Bin Cheng <bin.ch...@arm.com> writes:
>>>> Hi,
>>>> This simple patch fixes the ICE by adding LTGT in
>>>> vec_cmp pattern.
>>>> I also modified the original test case into a compilation one since
>>>> -fno-wrapping-math
>>>> should not be used in general.
>>>> Bootstrap and test on AArch64, test result check for x86_64.  Is it OK?
>>>> I would also need to
>>>> backport it to gcc-7-branch.
>>>>
>>>> Thanks,
>>>> bin
>>>> 2017-07-27  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>>   PR target/81228
>>>>   * config/aarch64/aarch64-simd.md (vec_cmp): Add
>>>>   LTGT.
>>>>
>>>> gcc/testsuite/ChangeLog
>>>> 2017-07-27  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>>   PR target/81228
>>>>   * gcc.dg/pr81228.c: New.
>>>>
>>>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>>>> b/gcc/config/aarch64/aarch64-simd.md
>>>> index 011fcec0..9cd67a2 100644
>>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>>> @@ -2524,6 +2524,7 @@
>>>>  case EQ:
>>>>comparison = gen_aarch64_cmeq;
>>>>break;
>>>> +case LTGT:
>>>>  case UNEQ:
>>>>  case ORDERED:
>>>>  case UNORDERED:
>>>> @@ -2571,6 +2572,7 @@
>>>>emit_insn (comparison (operands[0], operands[2], operands[3]));
>>>>break;
>>>>
>>>> +case LTGT:
>>>>  case UNEQ:
>>>>/* We first check (a > b ||  b > a) which is !UNEQ, inverting
>>>>this result will then give us (a == b || a UNORDERED b).  */
>>>> @@ -2578,7 +2580,8 @@
>>>>operands[2], operands[3]));
>>>>emit_insn (gen_aarch64_cmgt (tmp, operands[3], operands[2]));
>>>>emit_insn (gen_ior3 (operands[0], operands[0], tmp));
>>>> -  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>>>> +  if (code == UNEQ)
>>>> + emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>>>>break;
>>>
>>> AFAIK this is still a grey area, but I think (ltgt x y) is supposed to
>>> be a trapping operation, i.e. it's closer to (ior (lt x y) (gt x y))
>>> than (not (uneq x y)).  See e.g. the handling in may_trap_p_1, where
>>> LTGT is handled like LT and GT rather than like UNEQ.
>>>
>>> See also: https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00583.html
>> Thanks for pointing me to this, I don't know anything about floating point 
>> here.
>> As for the change, the code now looks like:
>>
>> case LTGT:
>> case UNEQ:
>>   /* We first check (a > b ||  b > a) which is !UNEQ, inverting
>>  this result will then give us (a == b || a UNORDERED b).  */
>>   emit_insn (gen_aarch64_cmgt (operands[0],
>>  operands[2], operands[3]));
>>   emit_insn (gen_aarch64_cmgt (tmp, operands[3], operands[2]));
>>   emit_insn (gen_ior3 (operands[0], operands[0], tmp));
>>   if (code == UNEQ)
>> emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>>   break;
>>
>> So (a > b || b > a) is generated for LTGT which you suggested?
>
> Ah, yeah, I was just going off LTGT being treated as !UNEQ, but...
>
>> Here we invert the result for UNEQ though.
>
> ...it looks like it might be the UNEQ code that's wrong.  E.g. this
> test fails at -O3 and passes at -O for me:
Thanks very much for showing the issue with below example.  That part
code was refactored from old implementation the pattern is added.
>
> #define _GNU_SOURCE
> #include 
>
> double x[16], y[16];
> int res[16];
>
> int
> main (void)
> {
>   for (int i = 0; i < 16; ++i)
> {
>   x[i] = __builtin_nan ("");
>   y[i] = i;
> }
>   asm volatile ("" ::: "memory");
>   feclearexcept (FE_ALL_EXCEPT);
>   for (int i = 0; i < 16; ++i)
> res[i] = __builtin_islessgreater (x[i], y[i]);
>   asm volatile ("" ::: "memory");
>   return fetestexcept (FE_ALL_EXCEPT) != 0;
> }
>
> (asm volatiles just added for paranoia, in case stuff gets optimised
> away otherwise.)
>
> But I suppose that's no reason to hold up your patch. :-)  Maybe it'd
> be worth having a comment though?
Given the code is wrong, I will file a bug for tracking and see if I
can fix it while I am on it.  Hopefully it won't be long thus a
comment can be saved now.

Thanks,
bin
>
> Thanks,
> Richard


Re: [PATCH PR81228]Fixes ICE by adding LTGT in vec_cmp.

2017-07-28 Thread Bin.Cheng
On Fri, Jul 28, 2017 at 12:55 PM, Richard Sandiford
 wrote:
> Bin Cheng  writes:
>> Hi,
>> This simple patch fixes the ICE by adding LTGT in
>> vec_cmp pattern.
>> I also modified the original test case into a compilation one since
>> -fno-wrapping-math
>> should not be used in general.
>> Bootstrap and test on AArch64, test result check for x86_64.  Is it OK?
>> I would also need to
>> backport it to gcc-7-branch.
>>
>> Thanks,
>> bin
>> 2017-07-27  Bin Cheng  
>>
>>   PR target/81228
>>   * config/aarch64/aarch64-simd.md (vec_cmp): Add
>>   LTGT.
>>
>> gcc/testsuite/ChangeLog
>> 2017-07-27  Bin Cheng  
>>
>>   PR target/81228
>>   * gcc.dg/pr81228.c: New.
>>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 011fcec0..9cd67a2 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -2524,6 +2524,7 @@
>>  case EQ:
>>comparison = gen_aarch64_cmeq;
>>break;
>> +case LTGT:
>>  case UNEQ:
>>  case ORDERED:
>>  case UNORDERED:
>> @@ -2571,6 +2572,7 @@
>>emit_insn (comparison (operands[0], operands[2], operands[3]));
>>break;
>>
>> +case LTGT:
>>  case UNEQ:
>>/* We first check (a > b ||  b > a) which is !UNEQ, inverting
>>this result will then give us (a == b || a UNORDERED b).  */
>> @@ -2578,7 +2580,8 @@
>>operands[2], operands[3]));
>>emit_insn (gen_aarch64_cmgt (tmp, operands[3], operands[2]));
>>emit_insn (gen_ior3 (operands[0], operands[0], tmp));
>> -  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>> +  if (code == UNEQ)
>> + emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>>break;
>
> AFAIK this is still a grey area, but I think (ltgt x y) is supposed to
> be a trapping operation, i.e. it's closer to (ior (lt x y) (gt x y))
> than (not (uneq x y)).  See e.g. the handling in may_trap_p_1, where
> LTGT is handled like LT and GT rather than like UNEQ.
>
> See also: https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00583.html
Thanks for pointing me to this, I don't know anything about floating point here.
As for the change, the code now looks like:

case LTGT:
case UNEQ:
  /* We first check (a > b ||  b > a) which is !UNEQ, inverting
 this result will then give us (a == b || a UNORDERED b).  */
  emit_insn (gen_aarch64_cmgt (operands[0],
 operands[2], operands[3]));
  emit_insn (gen_aarch64_cmgt (tmp, operands[3], operands[2]));
  emit_insn (gen_ior3 (operands[0], operands[0], tmp));
  if (code == UNEQ)
emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
  break;

So (a > b || b > a) is generated for LTGT which you suggested?  Here
we invert the result for UNEQ though.

Thanks,
bin
>
> Thanks,
> Richard


Re: [PATCH] PR libstdc++/53984 handle exceptions in basic_istream::sentry

2017-07-27 Thread Bin.Cheng
On Wed, Jul 26, 2017 at 11:06 PM, Jonathan Wakely  wrote:
> On 26/07/17 20:14 +0200, Paolo Carlini wrote:
>>
>> Hi again,
>>
>> On 26/07/2017 16:27, Paolo Carlini wrote:
>>>
>>> Hi,
>>>
>>> On 26/07/2017 16:21, Andreas Schwab wrote:

 ERROR: 27_io/basic_fstream/53984.cc: unknown dg option:
 dg-require-file-io 18 {} for " dg-require-file-io 18 "" "
>>>
>>> Should be already fixed, a trivial typo.
>>
>> ... but now the new test simply fails for me. If I don't spot something
>> else trivial over the next few hours I guess better waiting for Jon to look
>> into that.
>
>
> Sorry about that, I must have only checked for FAILs and missed the
> ERRORs.
>
> It should have been an ifstream not fstream, otherwise the filebuf
> can't even open the file. Fixed like so, committed to trunk.
Hi, I have seen below failure on aarch64/arm linux/elf:
spawn [open ...]^M
/tmp/.../src/gcc/libstdc++-v3/testsuite/27_io/basic_fstream/53984.cc:29:
void test01(): Assertion 'in.bad()' failed.
FAIL: 27_io/basic_fstream/53984.cc execution test
extra_tool_flags are:
  -include bits/stdc++.h

Thanks,
bin
>
>


Re: [PATCH GCC][4/4]Better handle store-stores chain if eliminated stores only store loop invariant

2017-07-25 Thread Bin.Cheng
On Tue, Jul 25, 2017 at 1:57 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Jul 25, 2017 at 2:38 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Jul 25, 2017 at 12:48 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Mon, Jul 10, 2017 at 10:24 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Tue, Jun 27, 2017 at 11:49 AM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> This is a followup patch better handling below case:
>>>>>  for (i = 0; i < n; i++)
>>>>>{
>>>>>  a[i] = 1;
>>>>>  a[i+2] = 2;
>>>>>}
>>>>> Instead of generating root variables by loading from memory and 
>>>>> propagating with PHI
>>>>> nodes, like:
>>>>>  t0 = a[0];
>>>>>  t1 = a[1];
>>>>>  for (i = 0; i < n; i++)
>>>>>{
>>>>>  a[i] = 1;
>>>>>  t2 = 2;
>>>>>  t0 = t1;
>>>>>  t1 = t2;
>>>>>}
>>>>>  a[n] = t0;
>>>>>  a[n+1] = t1;
>>>>> We can simply store loop invariant values after loop body if we know loop 
>>>>> iterates more
>>>>> than chain->length times, like:
>>>>>  for (i = 0; i < n; i++)
>>>>>{
>>>>>  a[i] = 1;
>>>>>}
>>>>>  a[n] = 2;
>>>>>  a[n+1] = 2;
>>>>>
>>>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
>>>> Update patch wrto changes in previous patch.
>>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>>
>>> +  if (TREE_CODE (val) == INTEGER_CST || TREE_CODE (val) == REAL_CST)
>>> +   continue;
>>>
>>> Please use CONSTANT_CLASS_P (val) instead.  I suppose VECTOR_CST or
>>> FIXED_CST would be ok as well for example.
>>>
>>> Ok with that change.  Did we eventually optimize this in followup
>>> passes previously?
>> Probably not?  Given below test:
>>
>> int a[1], b[1], c[1];
>> int f(void)
>> {
>>   int i, n = 100;
>>   int t0 = a[0];
>>   int t1 = a[1];
>>  for (i = 0; i < n; i++)
>>{
>>  a[i] = 1;
>>  int t2 = 2;
>>  t0 = t1;
>>  t1 = t2;
>>}
>>  a[n] = t0;
>>  a[n+1] = t1;
>>   return 0;
>> }
>> The optimized dump is as:
>>
>>[1.00%] [count: INV]:
>>   t1_8 = a[1];
>>   ivtmp.9_17 = (unsigned long) 
>>   _16 = ivtmp.9_17 + 400;
>>
>>[99.00%] [count: INV]:
>>   # t1_20 = PHI <2(3), t1_8(2)>
>>   # ivtmp.9_2 = PHI <ivtmp.9_1(3), ivtmp.9_17(2)>
>>   _15 = (void *) ivtmp.9_2;
>>   MEM[base: _15, offset: 0B] = 1;
>>   ivtmp.9_1 = ivtmp.9_2 + 4;
>>   if (ivtmp.9_1 != _16)
>> goto ; [98.99%] [count: INV]
>>   else
>> goto ; [1.01%] [count: INV]
>>
>>[1.00%] [count: INV]:
>>   a[100] = t1_20;
>>   a[101] = 2;
>>   return 0;
>>
>> We now eliminate one phi and leave another behind.  It is vrp1/dce2
>> when the phi is eliminated.
>
> Ok, I see.  Maybe worth filing a missed optimization PR.
Right, PR81549 filed.

Thanks,
bin
>
> Richard.
>
>> Thanks,
>> bin


Re: [PATCH GCC][4/4]Better handle store-stores chain if eliminated stores only store loop invariant

2017-07-25 Thread Bin.Cheng
On Tue, Jul 25, 2017 at 12:48 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Mon, Jul 10, 2017 at 10:24 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Jun 27, 2017 at 11:49 AM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> This is a followup patch better handling below case:
>>>  for (i = 0; i < n; i++)
>>>{
>>>  a[i] = 1;
>>>  a[i+2] = 2;
>>>}
>>> Instead of generating root variables by loading from memory and propagating 
>>> with PHI
>>> nodes, like:
>>>  t0 = a[0];
>>>  t1 = a[1];
>>>  for (i = 0; i < n; i++)
>>>{
>>>  a[i] = 1;
>>>  t2 = 2;
>>>  t0 = t1;
>>>  t1 = t2;
>>>}
>>>  a[n] = t0;
>>>  a[n+1] = t1;
>>> We can simply store loop invariant values after loop body if we know loop 
>>> iterates more
>>> than chain->length times, like:
>>>  for (i = 0; i < n; i++)
>>>{
>>>  a[i] = 1;
>>>}
>>>  a[n] = 2;
>>>  a[n+1] = 2;
>>>
>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
>> Update patch wrto changes in previous patch.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>
> +  if (TREE_CODE (val) == INTEGER_CST || TREE_CODE (val) == REAL_CST)
> +   continue;
>
> Please use CONSTANT_CLASS_P (val) instead.  I suppose VECTOR_CST or
> FIXED_CST would be ok as well for example.
>
> Ok with that change.  Did we eventually optimize this in followup
> passes previously?
Probably not?  Given below test:

int a[1], b[1], c[1];
int f(void)
{
  int i, n = 100;
  int t0 = a[0];
  int t1 = a[1];
 for (i = 0; i < n; i++)
   {
 a[i] = 1;
 int t2 = 2;
 t0 = t1;
 t1 = t2;
   }
 a[n] = t0;
 a[n+1] = t1;
  return 0;
}
The optimized dump is as:

   [1.00%] [count: INV]:
  t1_8 = a[1];
  ivtmp.9_17 = (unsigned long) 
  _16 = ivtmp.9_17 + 400;

   [99.00%] [count: INV]:
  # t1_20 = PHI <2(3), t1_8(2)>
  # ivtmp.9_2 = PHI <ivtmp.9_1(3), ivtmp.9_17(2)>
  _15 = (void *) ivtmp.9_2;
  MEM[base: _15, offset: 0B] = 1;
  ivtmp.9_1 = ivtmp.9_2 + 4;
  if (ivtmp.9_1 != _16)
goto ; [98.99%] [count: INV]
  else
goto ; [1.01%] [count: INV]

   [1.00%] [count: INV]:
  a[100] = t1_20;
  a[101] = 2;
  return 0;

We now eliminate one phi and leave another behind.  It is vrp1/dce2
when the phi is eliminated.

Thanks,
bin


Re: [PATCH GCC][1/2]Feed bound computation to folder in loop split

2017-07-24 Thread Bin.Cheng
On Mon, Jul 24, 2017 at 3:31 PM, Marc Glisse <marc.gli...@inria.fr> wrote:
> On Mon, 24 Jul 2017, Bin.Cheng wrote:
>
>> On Mon, Jul 24, 2017 at 2:59 PM, Marc Glisse <marc.gli...@inria.fr> wrote:
>>>
>>> On Mon, 24 Jul 2017, Bin.Cheng wrote:
>>>
>>>> But since definition of _197 isn't in current stmt sequence, call "o31
>>>> = do_valueize (valueize, o31)" will return NULL.  As a result, it's
>>>> not matched.
>>>
>>>
>>>
>>> Wait, actually, how was your fold_build* version working? Why was the
>>> first
>>> addition "in the current generic tree" and why isn't it "in current stmt
>>> sequence"?
>>
>> Maybe I didn't express it clearly.  In compute_new_first_bound, we
>> have stmt sequence "_124 = _197 + 1", and we try to simplify "_124 -
>> 1" by calling gimple_build.  The definition of _197 is a PHI and isn't
>> in current stmt sequence.  For fold_build* version, it builds
>> expression "_197 + 1 - 1" and simplifies it.
>
>
> It seems like it shouldn't be relevant whether the definition of _197 is in
> the stmt sequence or not, but indeed we seem to generate a lot of calls to
> do_valueize... I had misunderstood the issue, sorry.
Oh, no need at all, and thanks very much for all the explanation.
>
> Strangely, for a pattern like
> (simplify (plus @0 @1) @0)
> we generate no call to valueize, while for the following
> (simplify (plus @0 (minus @1 @2)) @0)
> we generate 3 calls to do_valueize.
>
> I think we need Richard to say what the intent is for the valueization
> function. It is used both to stop looking at defining stmt if the return is
> NULL, and to replace/optimize one SSA_NAME with another, but currently it
> seems hard to prevent looking at the defining statement without preventing
> from looking at the SSA_NAME at all.
Looks we don't really expand into def_stmt on leaf nodes, maybe
valueization can be saved in the case?
>
> I guess we'll need a fix in genmatch...
Yeah, then the original patch becomes unnecessary.  Thanks again!

Thanks,
bin
>
> --
> Marc Glisse


Re: [PATCH GCC][3/4]Generalize dead store elimination (or store motion) across loop iterations in predcom

2017-07-24 Thread Bin.Cheng
Ping^1.

Thanks,
bin

On Mon, Jul 10, 2017 at 9:23 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Tue, Jul 4, 2017 at 1:29 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Tue, Jul 4, 2017 at 2:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Tue, Jul 4, 2017 at 12:19 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Mon, Jul 3, 2017 at 4:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>> On Mon, Jul 3, 2017 at 10:38 AM, Richard Biener
>>>>> <richard.guent...@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 12:49 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>>> Hi,
>>>>>>> For the moment, tree-predcom.c only supports 
>>>>>>> invariant/load-loads/store-loads chains.
>>>>>>> This patch generalizes dead store elimination (or store motion) across 
>>>>>>> loop iterations in
>>>>>>> predictive commoning pass by supporting store-store chain.  As comment 
>>>>>>> in the patch:
>>>>>>>
>>>>>>>Apart from predictive commoning on Load-Load and Store-Load chains, 
>>>>>>> we
>>>>>>>also support Store-Store chains -- stores killed by other store can 
>>>>>>> be
>>>>>>>eliminated.  Given below example:
>>>>>>>
>>>>>>>  for (i = 0; i < n; i++)
>>>>>>>{
>>>>>>>  a[i] = 1;
>>>>>>>  a[i+2] = 2;
>>>>>>>}
>>>>>>>
>>>>>>>It can be replaced with:
>>>>>>>
>>>>>>>  t0 = a[0];
>>>>>>>  t1 = a[1];
>>>>>>>  for (i = 0; i < n; i++)
>>>>>>>{
>>>>>>>  a[i] = 1;
>>>>>>>  t2 = 2;
>>>>>>>  t0 = t1;
>>>>>>>  t1 = t2;
>>>>>>>}
>>>>>>>  a[n] = t0;
>>>>>>>  a[n+1] = t1;
>>>>>>>
>>>>>>>If the loop runs more than 1 iterations, it can be further 
>>>>>>> simplified into:
>>>>>>>
>>>>>>>  for (i = 0; i < n; i++)
>>>>>>>{
>>>>>>>  a[i] = 1;
>>>>>>>}
>>>>>>>  a[n] = 2;
>>>>>>>  a[n+1] = 2;
>>>>>>>
>>>>>>>The interesting part is this can be viewed either as general store 
>>>>>>> motion
>>>>>>>or general dead store elimination in either intra/inter-iterations 
>>>>>>> way.
>>>>>>>
>>>>>>> There are number of interesting facts about this enhancement:
>>>>>>> a) This patch supports dead store elimination for both across-iteration 
>>>>>>> case and single-iteration
>>>>>>>  case.  For the latter, it is dead store elimination.
>>>>>>> b) There are advantages supporting dead store elimination in predcom, 
>>>>>>> for example, it has
>>>>>>>  complete information about memory address.  On the contrary, DSE 
>>>>>>> pass can only handle
>>>>>>>  memory references with exact the same memory address expression.
>>>>>>> c) It's cheap to support store-stores chain in predcom based on 
>>>>>>> existing code.
>>>>>>> d) As commented, the enhancement can be viewed as either generalized 
>>>>>>> dead store elimination
>>>>>>>  or generalized store motion.  I prefer DSE here.
>>>>>>>
>>>>>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
>>>>>>
>>>>>> Looks mostly ok.  I have a few questions though.
>>>>>>
>>>>>> +  /* Don't do store elimination if loop has multiple exit edges.  */
>>>>>> +  bool eliminate_store_p = single_exit (loop) != NULL;
>>>>>>
>>>>>> handling this would be an enhancement?  IIRC LIM store-motion handles 
>>>>>> this
>>>>>> just fine by emitting code on

Re: [PATCH GCC][1/2]Feed bound computation to folder in loop split

2017-07-24 Thread Bin.Cheng
On Mon, Jul 24, 2017 at 2:59 PM, Marc Glisse <marc.gli...@inria.fr> wrote:
> On Mon, 24 Jul 2017, Bin.Cheng wrote:
>
>> But since definition of _197 isn't in current stmt sequence, call "o31
>> = do_valueize (valueize, o31)" will return NULL.  As a result, it's
>> not matched.
>
>
> Wait, actually, how was your fold_build* version working? Why was the first
> addition "in the current generic tree" and why isn't it "in current stmt
> sequence"?
Maybe I didn't express it clearly.  In compute_new_first_bound, we
have stmt sequence "_124 = _197 + 1", and we try to simplify "_124 -
1" by calling gimple_build.  The definition of _197 is a PHI and isn't
in current stmt sequence.  For fold_build* version, it builds
expression "_197 + 1 - 1" and simplifies it.

Thanks,
bin
>
> --
> Marc Glisse


Re: [PATCH GCC][1/2]Feed bound computation to folder in loop split

2017-07-24 Thread Bin.Cheng
On Mon, Jul 24, 2017 at 1:16 PM, Marc Glisse <marc.gli...@inria.fr> wrote:
> On Mon, 24 Jul 2017, Bin.Cheng wrote:
>
>>> For _123, we have
>>>
>>>   /* (A +- CST1) +- CST2 -> A + CST3
>>> or
>>> /* Associate (p +p off1) +p off2 as (p +p (off1 + off2)).  */
>>>
>>>
>>> For _115, we have
>>>
>>> /* min (a, a + CST) -> a where CST is positive.  */
>>> /* min (a, a + CST) -> a + CST where CST is negative. */
>>> (simplify
>>>  (min:c @0 (plus@2 @0 INTEGER_CST@1))
>>>   (if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0)))
>>>(if (tree_int_cst_sgn (@1) > 0)
>>> @0
>>> @2)))
>>>
>>> What is the type of all those SSA_NAMEs?
>>
>> Hi,
>> From the debugging process, there are two issues preventing "(A +-
>> CST1) +- CST2 -> A + CST3" from being applied:
>> A) before we reach this pattern, there is pattern:
>>
>> /* A - B -> A + (-B) if B is easily negatable.  */
>> (simplify
>> (minus @0 negate_expr_p@1)
>> (if (!FIXED_POINT_TYPE_P (type))
>> (plus @0 (negate @1
>>
>> which is matched and returned in gimple_simplify_MINUS_EXPR.  So does
>> pattern order matter here?
>
>
> That shouldn't be a problem, normally we always try to resimplify the result
> of the simplification, and the transformation should handle x+1+-1 just as
> well as x+1-1. Is that not happening?
Yes, it's doesn't matter.

>
>> B) When folding "_124 - 1" on the basis of existing stmts sequence
>> like "_124 = _197 + 1;".  The corresponding gimple-match.c code is
>> like:
>
> [...]
>>
>> But since definition of _197 isn't in current stmt sequence, call "o31
>> = do_valueize (valueize, o31)" will return NULL.  As a result, it's
>> not matched.
>
>
> Ah, yes, that problem... Jakub was having a very similar issue a few
> weeks ago, don't know if he found a solution. You could call
> gimple_simplify directly with a different valueization function if
> that's safe. Normally the simplification would wait until the next
> forwprop pass.
Thanks for elaboration.
It's too late for next forwprop pass since we are in between loop
optimizations and need the simplified code for niter analysis.
Function compute_new_first_bound calls gimple_build several times,
it's not likely to replace all gimple_build with gimple_simplify?
Also gimple_simplify could return NULL_TREE, in which case I need to
call gimple_build_assign again.  At last, we don't have interface
fold_seq similar to fold_stmt either.
CCing Jakub if he found a solution.  Thanks.

Thanks,
bin
>
> --
> Marc Glisse


Re: [PATCH GCC][1/2]Feed bound computation to folder in loop split

2017-07-24 Thread Bin.Cheng
On Fri, Jun 16, 2017 at 5:48 PM, Marc Glisse <marc.gli...@inria.fr> wrote:
> On Fri, 16 Jun 2017, Bin.Cheng wrote:
>
>> On Fri, Jun 16, 2017 at 5:16 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>>
>>>
>>> That means we miss a pattern in match.PD to handle this case.
>>
>> I see.  I will withdraw this patch and look in that direction.
>
>
> For _123, we have
>
>   /* (A +- CST1) +- CST2 -> A + CST3
> or
> /* Associate (p +p off1) +p off2 as (p +p (off1 + off2)).  */
>
>
> For _115, we have
>
> /* min (a, a + CST) -> a where CST is positive.  */
> /* min (a, a + CST) -> a + CST where CST is negative. */
> (simplify
>  (min:c @0 (plus@2 @0 INTEGER_CST@1))
>   (if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (@0)))
>(if (tree_int_cst_sgn (@1) > 0)
> @0
> @2)))
>
> What is the type of all those SSA_NAMEs?
Hi,
>From the debugging process, there are two issues preventing "(A +-
CST1) +- CST2 -> A + CST3" from being applied:
A) before we reach this pattern, there is pattern:

/* A - B -> A + (-B) if B is easily negatable.  */
(simplify
 (minus @0 negate_expr_p@1)
 (if (!FIXED_POINT_TYPE_P (type))
 (plus @0 (negate @1

which is matched and returned in gimple_simplify_MINUS_EXPR.  So does
pattern order matter here?

B) When folding "_124 - 1" on the basis of existing stmts sequence
like "_124 = _197 + 1;".  The corresponding gimple-match.c code is
like:
  if (gimple_nop_convert (op0, op0_pops, valueize))
{
  tree o20 = op0_pops[0];
  switch (TREE_CODE (o20))
{
case SSA_NAME:
  if (do_valueize (valueize, o20) != NULL_TREE)
{
  gimple *def_stmt = SSA_NAME_DEF_STMT (o20);
  if (gassign *def = dyn_cast  (def_stmt))
switch (gimple_assign_rhs_code (def))
  {
  case PLUS_EXPR:
{
  tree o30 = gimple_assign_rhs1 (def);
  if ((o30 = do_valueize (valueize, o30)))
{
  tree o31 = gimple_assign_rhs2 (def);
  if ((o31 = do_valueize (valueize, o31)))
{
  if (tree_swap_operands_p (o30, o31))
std::swap (o30, o31);
  if (CONSTANT_CLASS_P (o31))
{
  if (CONSTANT_CLASS_P (op1))
{
  {
/* #line 1392 "../../gcc/gcc/match.pd" */
tree captures[3] ATTRIBUTE_UNUSED = { o30, o31, op1 };
if (gimple_simplify_194 (res_code, res_ops, seq,
valueize, type, captures, PLUS_EXPR, MINUS_EXPR, MINUS_EXPR))
  return true;
  }
}
}
}
}
  break;
}

Note we have:
(gdb) call debug_generic_expr(op0)
_124
(gdb) call debug_generic_expr(op1)
1
(gdb) call debug_gimple_stmt(def_stmt)
_124 = _197 + 1;
(gdb) call debug_generic_expr(o30)
_197

But since definition of _197 isn't in current stmt sequence, call "o31
= do_valueize (valueize, o31)" will return NULL.  As a result, it's
not matched.
Any ideas?  Thanks.

Thanks,
bin


Re: [PATCH][1/n] Fix PR81303

2017-07-21 Thread Bin.Cheng
On Fri, Jul 21, 2017 at 8:12 AM, Richard Biener  wrote:
>
> The following is sth I noticed when looking at a way to fix PR81303.
> We happily compute a runtime cost model threshold that executes the
> vectorized variant even though no vector iteration takes place due
> to the number of prologue/epilogue iterations.  The following fixes
> that -- note that if we do not know the prologue/epilogue counts
> statically they are estimated at vf/2 which means there's still the
> chance the vector iteration won't execute.  To fix that we'd have to
> estimate those as vf-1 instead, sth we might consider doing anyway
> given that we regularly completely peel the epilogues vf-1 times
> in that case.  Maybe as followup.
Hi,
Do we consider disabling epilogue peeling if # of iters is unknown at
compilation time?  When loop is versioned, could we share epilogue
with versioned loop?  Epilogue was shared before?

Thanks,
bin
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
>
> Richard.
>
> 2016-07-21  Richard Biener  
>
> PR tree-optimization/81303
> * tree-vect-loop.c (vect_estimate_min_profitable_iters): Take
> into account prologue and epilogue iterations when raising
> min_profitable_iters to sth at least covering one vector iteration.
>
> Index: gcc/tree-vect-loop.c
> ===
> --- gcc/tree-vect-loop.c(revision 250384)
> +++ gcc/tree-vect-loop.c(working copy)
> @@ -3702,8 +3702,9 @@ vect_estimate_min_profitable_iters (loop
>"  Calculated minimum iters for profitability: %d\n",
>min_profitable_iters);
>
> -  min_profitable_iters =
> -   min_profitable_iters < vf ? vf : min_profitable_iters;
> +  /* We want the vectorized loop to execute at least once.  */
> +  if (min_profitable_iters < (vf + peel_iters_prologue + 
> peel_iters_epilogue))
> +min_profitable_iters = vf + peel_iters_prologue + peel_iters_epilogue;
>
>if (dump_enabled_p ())
>  dump_printf_loc (MSG_NOTE, vect_location,


Re: [PATCH AArch64]Fix ICE in cortex-a57 fma steering pass

2017-07-20 Thread Bin.Cheng
On Fri, Jul 14, 2017 at 12:12 PM, James Greenhalgh
 wrote:
> On Wed, Jul 12, 2017 at 03:15:04PM +, Bin Cheng wrote:
>> Hi,
>> After change @236817, AArch64 backend could avoid unnecessary conversion
>> instructions for register between different modes now.  As a result, GCC
>> could initialize register in larger mode and use it later in smaller mode.
>> such def-use chain is not supported by current regrename.c analyzer, as
>> described by its comment:
>>
>> /* Process the insn, determining its effect on the def-use
>>chains and live hard registers.  We perform the following
>>steps with the register references in the insn, simulating
>>its effect:
>>  ...
>>We cannot deal with situations where we track a reg in one mode
>>and see a reference in another mode; these will cause the chain
>>to be marked unrenamable or even cause us to abort the entire
>>basic block.  */
>>
>> In this case, regrename.c analyzer doesn't create chain for the use of the
>> register.  OTOH, cortex-a57-fma-steering.c has below code:
>>
>> @@ -973,10 +973,14 @@ func_fma_steering::analyze ()
>>   break;
>>   }
>>
>> -   /* We didn't find a chain with a def for this instruction.  */
>> -   gcc_assert (i < dest_op_info->n_chains);
>> -
>> -   this->analyze_fma_fmul_insn (forest, chain, head);
>>
>> It assumes by gcc_assert that a chain must be found for dest register of
>> fmul/fmac instructions.  According to above analysis, this is not always true
>> if the dest reg is reused from one of its source register.
>>
>> This patch fixes the issue by skipping such instructions if no du chain is
>> found.  Bootstrap and test on AArch64/cortex-a57.  Is it OK?  If it's fine, I
>> would also need to backport it to 7/6 branches.
>
> This looks OK, but feels a bit like a workaround. Do you have a PR open
> for the missed optimisation caused by the deficiency in regrename?
>
> If so, it would be good to add that PR number to your comment in this
> function.
>
> For now, and for the backport, this will be fine, but your (Kyrill's) testcase
> has confused me (maybe too reduced from the original form) and doesn't
> match the bug here.
>
>> 2017-07-12  Bin Cheng  
>>
>>   PR target/81414
>>   * config/aarch64/cortex-a57-fma-steering.c (analyze): Skip fmul/fmac
>>   instructions if no du chain is found.
>>
>> gcc/testsuite/ChangeLog
>> 2017-07-12  Kyrylo Tkachov  
>>
>>   PR target/81414
>>   * gcc.target/aarch64/pr81414.C: New.
>
>> From ef2bc842993210a4399205d83fa46435eec5d7cd Mon Sep 17 00:00:00 2001
>> From: Bin Cheng 
>> Date: Wed, 12 Jul 2017 15:16:53 +0100
>> Subject: [PATCH] tmp
>>
>> ---
>>  gcc/config/aarch64/cortex-a57-fma-steering.c | 12 
>>  gcc/testsuite/gcc.target/aarch64/pr81414.C   | 10 ++
>>  2 files changed, 18 insertions(+), 4 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr81414.C
>>
>> diff --git a/gcc/config/aarch64/cortex-a57-fma-steering.c 
>> b/gcc/config/aarch64/cortex-a57-fma-steering.c
>> index 1bf804b..b2ee398 100644
>> --- a/gcc/config/aarch64/cortex-a57-fma-steering.c
>> +++ b/gcc/config/aarch64/cortex-a57-fma-steering.c
>> @@ -973,10 +973,14 @@ func_fma_steering::analyze ()
>>   break;
>>   }
>>
>> -   /* We didn't find a chain with a def for this instruction.  */
>> -   gcc_assert (i < dest_op_info->n_chains);
>> -
>> -   this->analyze_fma_fmul_insn (forest, chain, head);
>> +   /* Due to implementation of regrename, dest register can slip away
>> +  from regrename's analysis.  As a result, there is no chain for
>> +  the destination register of insn.  We simply skip the insn even
>> +  it is a fmul/fmac instruction.  This case can happen when the
>> +  dest register is also a source register of insn and the source
>> +  reg is setup in larger mode than this insn.  */
>> +   if (i < dest_op_info->n_chains)
>> + this->analyze_fma_fmul_insn (forest, chain, head);
>>   }
>>  }
>>free (bb_dfs_preorder);
>> diff --git a/gcc/testsuite/gcc.target/aarch64/pr81414.C 
>> b/gcc/testsuite/gcc.target/aarch64/pr81414.C
>> new file mode 100644
>> index 000..13666a3
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/pr81414.C
>> @@ -0,0 +1,10 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -mcpu=cortex-a57" } */
>> +
>> +typedef __Float32x2_t float32x2_t;
>> +__inline float32x2_t vdup_n_f32(float) {}
>> +
>> +float32x2_t vfma_lane_f32(float32x2_t __a, float32x2_t __b) {
>> +  int __lane;
>> +  return __builtin_aarch64_fmav2sf(__b, vdup_n_f32(__lane), __a);
>> +}
>
> I don't see a mode-change here. This looks like it would have a bad def/use
> chain because of the unitialised __lane, rather than the issue 

Re: [PATCH PR81408]Turn TREE level unsafe loop optimizations warning to missed optimization message

2017-07-18 Thread Bin.Cheng
On Tue, Jul 18, 2017 at 9:31 AM, Richard Biener
 wrote:
> On Tue, Jul 18, 2017 at 10:00 AM, Bin Cheng  wrote:
>> Hi,
>> I removed unsafe loop optimization on TREE level last year, so GCC doesn't 
>> do unsafe
>> loop optimizations on TREE now.  All "unsafe loop optimizations" warnings 
>> reported by
>> TREE optimizers are simply missed optimizations.  This patch turns such 
>> warning into
>> missed optimization messages.  I didn't change when this will be dumped, for 
>> now it is
>> when called from ivopts.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>
> Ok but can you change the testcase to not scan the ivopts dump but use
> -fopt-info-loop-missed?
> You should be able to match the output with dg-message.
Thanks for reviewing.  New patch with test case updated accordingly.  Is it OK?

Thanks,
bin
>
> Thanks,
> Richard.
>
>> Thanks,
>> bin
>> 2017-07-13  Bin Cheng  
>>
>> PR target/81408
>> * tree-ssa-loop-niter.c (number_of_iterations_exit): Dump missed
>> optimization for loop niter analysis.
>>
>> gcc/testsuite/ChangeLog
>> 2017-07-13  Bin Cheng  
>>
>> PR target/81408
>> * g++.dg/tree-ssa/pr81408.C: New.
>> * gcc.dg/tree-ssa/pr19210-1.c: Check dump message rather than 
>> warning.
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr81408.C 
b/gcc/testsuite/g++.dg/tree-ssa/pr81408.C
new file mode 100644
index 000..f94544b
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr81408.C
@@ -0,0 +1,92 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -std=gnu++11 -fopt-info-loop-missed 
-Wunsafe-loop-optimizations" } */
+
+namespace a {
+void b () __attribute__ ((__noreturn__));
+template  struct d;
+template  struct d
+{
+  typedef e f;
+};
+struct g
+{
+  template  using i = h *;
+};
+}
+using a::d;
+template  class k
+{
+  j l;
+
+public:
+  typename d::f operator* () {}
+  void operator++ () { ++l; }
+  j
+  aa ()
+  {
+return l;
+  }
+};
+template 
+bool
+operator!= (k o, k p2)
+{
+  return o.aa () != p2.aa ();
+}
+struct p;
+namespace a {
+struct F
+{
+  struct q
+  {
+using ai = g::i;
+  };
+  using r = q::ai;
+};
+class H
+{
+public:
+  k begin ();
+  k end ();
+};
+int s;
+class I
+{
+public:
+  void
+  aq (char)
+  {
+if (s)
+  b ();
+  }
+};
+class u : public I
+{
+public:
+  void
+  operator<< (u o (u))
+  {
+o (*this);
+  }
+  u operator<< (void *);
+};
+template 
+at
+av (au o)
+{
+  o.aq ('\n');
+}
+u ax;
+}
+struct p
+{
+  char *ay;
+};
+a::H t;
+void
+ShowHelpListCommands ()
+{
+  for (auto c : t) /* { dg-message "note: missed loop optimization: niters 
analysis .*" } */
+a::ax << c.ay << a::av;
+}
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c
index 3c8ee06..0fa5600 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr19210-1.c
@@ -1,15 +1,15 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -Wunsafe-loop-optimizations" } */
+/* { dg-options "-O2 -fopt-info-loop-missed -Wunsafe-loop-optimizations" } */
 extern void g(void);
 
 void
 f (unsigned n)
 {
   unsigned k;
-  for(k = 0;k <= n;k++) /* { dg-warning "missed loop optimization.*overflow" } 
*/
+  for(k = 0;k <= n;k++) /* { dg-message "note: missed loop optimization: 
niters analysis .*" } */
 g();
 
-  for(k = 0;k <= n;k += 4) /* { dg-warning "missed loop 
optimization.*overflow" } */
+  for(k = 0;k <= n;k += 4) /* { dg-message "note: missed loop optimization: 
niters analysis .*" } */
 g();
 
   /* We used to get warning for this loop.  However, since then # of iterations
@@ -21,9 +21,9 @@ f (unsigned n)
 g();
 
   /* So we need the following loop, instead.  */
-  for(k = 4;k <= n;k += 5) /* { dg-warning "missed loop 
optimization.*overflow" } */
+  for(k = 4;k <= n;k += 5) /* { dg-message "note: missed loop optimization: 
niters analysis .*" } */
 g();
   
-  for(k = 15;k >= n;k--) /* { dg-warning "missed loop optimization.*overflow" 
} */
+  for(k = 15;k >= n;k--) /* { dg-message "note: missed loop optimization: 
niters analysis .*" } */
 g();
 }
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 5a7cab5..1421002 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2378,9 +2378,9 @@ number_of_iterations_exit (struct loop *loop, edge exit,
 return true;
 
   if (warn)
-warning_at (gimple_location_safe (stmt),
-   OPT_Wunsafe_loop_optimizations,
-   "missed loop optimization, the loop counter may overflow");
+dump_printf_loc (MSG_MISSED_OPTIMIZATION, gimple_location_safe (stmt),
+"missed loop optimization: niters analysis ends up "
+"with assumptions.\n");
 
   return false;
 }


Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check

2017-07-18 Thread Bin.Cheng
On Mon, Jul 17, 2017 at 1:09 PM, Christophe Lyon
<christophe.l...@linaro.org> wrote:
> On 17 July 2017 at 12:06, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
>> <christophe.l...@linaro.org> wrote:
>>> Hi Bin,
>>>
>>> On 30 June 2017 at 12:43, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>>>> <richard.guent...@gmail.com> wrote:
>>>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>>>>>>> x86_64 and aarch64.
>>>>>>>>>>
>>>>>>>>>> why is ldist-12.c no longer distributed?  your comment says it 
>>>>>>>>>> doesn't expose
>>>>>>>>>> more "parallelism" but the point is to reduce memory bandwith 
>>>>>>>>>> requirements
>>>>>>>>>> which it clearly does.
>>>>>>>>>>
>>>>>>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the 
>>>>>>>>>> wording
>>>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>>>
>>>>>>>>>> Can you elaborate on that.  Now onto the patch:
>>>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>>>> etc., appropriate for distribution.
>>>>>>>>
>>>>>>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into 
>>>>>>>> two loops
>>>>>>>> can be executed "in parallel"
>>>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>>>> paralleled), we tend to no distribute it into small ones.  But there
>>>>>>> is one exception here, if the distributed small loops are recognized
>>>>>>> as builtin functions, we still distribute it.  I assume it's generally
>>>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> +   Loop distribution is the dual of loop fusion.  It separates 
>>>>>>>>>> statements
>>>>>>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) 
>>>>>>>>>> with the
>>>>>>>>>> +   same loop header.  The major goal is to separate statements 
>>>>>>>>>> which may
>>>>>>>>>> +   be vectorized from those that can't.  This pass implements 
>>>>>>>>>> distribution
>>>>>>>>>> +   in the following steps:
>>>>>>>>>>
>>>>>>>>>> misse

Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check

2017-07-17 Thread Bin.Cheng
On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
<christophe.l...@linaro.org> wrote:
> Hi Bin,
>
> On 30 June 2017 at 12:43, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>> <richard.guent...@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>>>>> x86_64 and aarch64.
>>>>>>>>
>>>>>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't 
>>>>>>>> expose
>>>>>>>> more "parallelism" but the point is to reduce memory bandwith 
>>>>>>>> requirements
>>>>>>>> which it clearly does.
>>>>>>>>
>>>>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the 
>>>>>>>> wording
>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>
>>>>>>>> Can you elaborate on that.  Now onto the patch:
>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>> etc., appropriate for distribution.
>>>>>>
>>>>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into 
>>>>>> two loops
>>>>>> can be executed "in parallel"
>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>> paralleled), we tend to no distribute it into small ones.  But there
>>>>> is one exception here, if the distributed small loops are recognized
>>>>> as builtin functions, we still distribute it.  I assume it's generally
>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>
>>>> Yes.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> +   Loop distribution is the dual of loop fusion.  It separates 
>>>>>>>> statements
>>>>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with 
>>>>>>>> the
>>>>>>>> +   same loop header.  The major goal is to separate statements which 
>>>>>>>> may
>>>>>>>> +   be vectorized from those that can't.  This pass implements 
>>>>>>>> distribution
>>>>>>>> +   in the following steps:
>>>>>>>>
>>>>>>>> misses the goal of being a memory stream optimization, not only a 
>>>>>>>> vectorization
>>>>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>> now.
>>>>>>
>>>>>> Yes, I agree -- originally it was written to optimize the stream 
>>>>>> benchmark IIRC.
>>>>> Let's see if any performance drop will be reported against this patch.
>>>>> Let's see if we can create a cost model for it.
>>>>
>>

Re: [PATCH GCC][13/13]Distribute loop with loop versioning under runtime alias check

2017-07-10 Thread Bin.Cheng
On Mon, Jul 10, 2017 at 10:32 AM, Christophe Lyon
<christophe.l...@linaro.org> wrote:
> Hi Bin,
>
> On 30 June 2017 at 12:43, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Jun 28, 2017 at 2:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Wed, Jun 28, 2017 at 1:29 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Wed, Jun 28, 2017 at 1:46 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>> On Wed, Jun 28, 2017 at 11:58 AM, Richard Biener
>>>>> <richard.guent...@gmail.com> wrote:
>>>>>> On Tue, Jun 27, 2017 at 4:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>>>> On Tue, Jun 27, 2017 at 1:44 PM, Richard Biener
>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>> On Fri, Jun 23, 2017 at 12:30 PM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>> On Tue, Jun 20, 2017 at 10:22 AM, Bin.Cheng <amker.ch...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>>> On Mon, Jun 12, 2017 at 6:03 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>> Rebased V3 for changes in previous patches.  Bootstap and test on
>>>>>>>>> x86_64 and aarch64.
>>>>>>>>
>>>>>>>> why is ldist-12.c no longer distributed?  your comment says it doesn't 
>>>>>>>> expose
>>>>>>>> more "parallelism" but the point is to reduce memory bandwith 
>>>>>>>> requirements
>>>>>>>> which it clearly does.
>>>>>>>>
>>>>>>>> Likewise for -13.c, -14.c.  -4.c may be a questionable case but the 
>>>>>>>> wording
>>>>>>>> of "parallelism" still confuses me.
>>>>>>>>
>>>>>>>> Can you elaborate on that.  Now onto the patch:
>>>>>>> Given we don't model data locality or memory bandwidth, whether
>>>>>>> distribution enables loops that can be executed paralleled becomes the
>>>>>>> major criteria for distribution.  BTW, I think a good memory stream
>>>>>>> optimization model shouldn't consider small loops as in ldist-12.c,
>>>>>>> etc., appropriate for distribution.
>>>>>>
>>>>>> True.  But what means "parallel" here?  ldist-13.c if partitioned into 
>>>>>> two loops
>>>>>> can be executed "in parallel"
>>>>> So if a loop by itself can be vectorized (or so called can be executed
>>>>> paralleled), we tend to no distribute it into small ones.  But there
>>>>> is one exception here, if the distributed small loops are recognized
>>>>> as builtin functions, we still distribute it.  I assume it's generally
>>>>> better to call builtin memory functions than vectorize it by GCC?
>>>>
>>>> Yes.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> +   Loop distribution is the dual of loop fusion.  It separates 
>>>>>>>> statements
>>>>>>>> +   of a loop (or loop nest) into multiple loops (or loop nests) with 
>>>>>>>> the
>>>>>>>> +   same loop header.  The major goal is to separate statements which 
>>>>>>>> may
>>>>>>>> +   be vectorized from those that can't.  This pass implements 
>>>>>>>> distribution
>>>>>>>> +   in the following steps:
>>>>>>>>
>>>>>>>> misses the goal of being a memory stream optimization, not only a 
>>>>>>>> vectorization
>>>>>>>> enabler.  distributing a loop can also reduce register pressure.
>>>>>>> I will revise the comment, but as explained, enabling more
>>>>>>> vectorization is the major criteria for distribution to some extend
>>>>>>> now.
>>>>>>
>>>>>> Yes, I agree -- originally it was written to optimize the stream 
>>>>>> benchmark IIRC.
>>>>> Let's see if any performance drop will be reported against this patch.
>>>>> Let's see if we can create a cost model for it.
>>>>
>>

Re: [PATCH GCC][4/4]Better handle store-stores chain if eliminated stores only store loop invariant

2017-07-10 Thread Bin.Cheng
On Tue, Jun 27, 2017 at 11:49 AM, Bin Cheng  wrote:
> Hi,
> This is a followup patch better handling below case:
>  for (i = 0; i < n; i++)
>{
>  a[i] = 1;
>  a[i+2] = 2;
>}
> Instead of generating root variables by loading from memory and propagating 
> with PHI
> nodes, like:
>  t0 = a[0];
>  t1 = a[1];
>  for (i = 0; i < n; i++)
>{
>  a[i] = 1;
>  t2 = 2;
>  t0 = t1;
>  t1 = t2;
>}
>  a[n] = t0;
>  a[n+1] = t1;
> We can simply store loop invariant values after loop body if we know loop 
> iterates more
> than chain->length times, like:
>  for (i = 0; i < n; i++)
>{
>  a[i] = 1;
>}
>  a[n] = 2;
>  a[n+1] = 2;
>
> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
Update patch wrto changes in previous patch.
Bootstrap and test on x86_64 and AArch64.  Is it OK?

Thanks,
bin
>
> Thanks,
> bin
> 2017-06-21  Bin Cheng  
>
> * tree-predcom.c: (struct chain): Handle store-store chain in which
> stores for elimination only store loop invariant values.
> (execute_pred_commoning_chain): Ditto.
> (prepare_initializers_chain_store_elim): Ditto.
> (prepare_finalizers): Ditto.
> (is_inv_store_elimination_chain): New function.
> (initialize_root_vars_store_elim_1): New function.
From 7b03b8fe27ce144b557cfd10143e2531323164db Mon Sep 17 00:00:00 2001
From: Bin Cheng 
Date: Wed, 5 Jul 2017 17:30:38 +0100
Subject: [PATCH 6/6] inv-store-elimination-20170627.txt

---
 gcc/tree-predcom.c | 131 ++---
 1 file changed, 125 insertions(+), 6 deletions(-)

diff --git a/gcc/tree-predcom.c b/gcc/tree-predcom.c
index ac0d0d0..d2738a5 100644
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -331,6 +331,10 @@ typedef struct chain
 
   /* True if this chain was combined together with some other chain.  */
   unsigned combined : 1;
+
+  /* True if this is store elimination chain and eliminated stores store
+ loop invariant value into memory.  */
+  unsigned inv_store_elimination : 1;
 } *chain_p;
 
 
@@ -1634,6 +1638,98 @@ initialize_root_vars (struct loop *loop, chain_p chain, bitmap tmp_vars)
 }
 }
 
+/* For inter-iteration store elimination CHAIN in LOOP, returns true if
+   all stores to be eliminated store loop invariant values into memory.
+   In this case, we can use these invariant values directly after LOOP.  */
+
+static bool
+is_inv_store_elimination_chain (struct loop *loop, chain_p chain)
+{
+  if (chain->length == 0 || chain->type != CT_STORE_STORE)
+return false;
+
+  gcc_assert (!chain->has_max_use_after);
+
+  /* If loop iterates for unknown times or fewer times than chain->lenght,
+ we still need to setup root variable and propagate it with PHI node.  */
+  tree niters = number_of_latch_executions (loop);
+  if (TREE_CODE (niters) != INTEGER_CST || wi::leu_p (niters, chain->length))
+return false;
+
+  /* Check stores in chain for elimination if they only store loop invariant
+ values.  */
+  for (unsigned i = 0; i < chain->length; i++)
+{
+  dref a = get_chain_last_ref_at (chain, i);
+  if (a == NULL)
+	continue;
+
+  gimple *def_stmt, *stmt = a->stmt;
+  if (!gimple_assign_single_p (stmt))
+	return false;
+
+  tree val = gimple_assign_rhs1 (stmt);
+  if (TREE_CLOBBER_P (val))
+	return false;
+
+  if (TREE_CODE (val) == INTEGER_CST || TREE_CODE (val) == REAL_CST)
+	continue;
+
+  if (TREE_CODE (val) != SSA_NAME)
+	return false;
+
+  def_stmt = SSA_NAME_DEF_STMT (val);
+  if (gimple_nop_p (def_stmt))
+	continue;
+
+  if (flow_bb_inside_loop_p (loop, gimple_bb (def_stmt)))
+	return false;
+}
+  return true;
+}
+
+/* Creates root variables for store elimination CHAIN in which stores for
+   elimination only store loop invariant values.  In this case, we neither
+   need to load root variables before loop nor propagate it with PHI nodes.  */
+
+static void
+initialize_root_vars_store_elim_1 (chain_p chain)
+{
+  tree var;
+  unsigned i, n = chain->length;
+
+  chain->vars.create (n);
+  chain->vars.safe_grow_cleared (n);
+
+  /* Initialize root value for eliminated stores at each distance.  */
+  for (i = 0; i < n; i++)
+{
+  dref a = get_chain_last_ref_at (chain, i);
+  if (a == NULL)
+	continue;
+
+  var = gimple_assign_rhs1 (a->stmt);
+  chain->vars[a->distance] = var;
+}
+
+  /* We don't propagate values with PHI nodes, so manually propagate value
+ to bubble positions.  */
+  var = chain->vars[0];
+  for (i = 1; i < n; i++)
+{
+  if (chain->vars[i] != NULL_TREE)
+	{
+	  var = chain->vars[i];
+	  continue;
+	}
+  chain->vars[i] = var;
+}
+
+  /* Revert the vector.  */
+  for (i = 0; i < n / 2; i++)
+std::swap (chain->vars[i], chain->vars[n - i - 1]);
+}
+
 /* 

Re: [PATCH GCC][3/4]Generalize dead store elimination (or store motion) across loop iterations in predcom

2017-07-10 Thread Bin.Cheng
On Tue, Jul 4, 2017 at 1:29 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Tue, Jul 4, 2017 at 2:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Tue, Jul 4, 2017 at 12:19 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Mon, Jul 3, 2017 at 4:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Mon, Jul 3, 2017 at 10:38 AM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Tue, Jun 27, 2017 at 12:49 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>> Hi,
>>>>>> For the moment, tree-predcom.c only supports 
>>>>>> invariant/load-loads/store-loads chains.
>>>>>> This patch generalizes dead store elimination (or store motion) across 
>>>>>> loop iterations in
>>>>>> predictive commoning pass by supporting store-store chain.  As comment 
>>>>>> in the patch:
>>>>>>
>>>>>>Apart from predictive commoning on Load-Load and Store-Load chains, we
>>>>>>also support Store-Store chains -- stores killed by other store can be
>>>>>>eliminated.  Given below example:
>>>>>>
>>>>>>  for (i = 0; i < n; i++)
>>>>>>{
>>>>>>  a[i] = 1;
>>>>>>  a[i+2] = 2;
>>>>>>}
>>>>>>
>>>>>>It can be replaced with:
>>>>>>
>>>>>>  t0 = a[0];
>>>>>>  t1 = a[1];
>>>>>>  for (i = 0; i < n; i++)
>>>>>>{
>>>>>>  a[i] = 1;
>>>>>>  t2 = 2;
>>>>>>  t0 = t1;
>>>>>>  t1 = t2;
>>>>>>}
>>>>>>  a[n] = t0;
>>>>>>  a[n+1] = t1;
>>>>>>
>>>>>>If the loop runs more than 1 iterations, it can be further simplified 
>>>>>> into:
>>>>>>
>>>>>>  for (i = 0; i < n; i++)
>>>>>>{
>>>>>>  a[i] = 1;
>>>>>>}
>>>>>>  a[n] = 2;
>>>>>>  a[n+1] = 2;
>>>>>>
>>>>>>The interesting part is this can be viewed either as general store 
>>>>>> motion
>>>>>>or general dead store elimination in either intra/inter-iterations 
>>>>>> way.
>>>>>>
>>>>>> There are number of interesting facts about this enhancement:
>>>>>> a) This patch supports dead store elimination for both across-iteration 
>>>>>> case and single-iteration
>>>>>>  case.  For the latter, it is dead store elimination.
>>>>>> b) There are advantages supporting dead store elimination in predcom, 
>>>>>> for example, it has
>>>>>>  complete information about memory address.  On the contrary, DSE 
>>>>>> pass can only handle
>>>>>>  memory references with exact the same memory address expression.
>>>>>> c) It's cheap to support store-stores chain in predcom based on existing 
>>>>>> code.
>>>>>> d) As commented, the enhancement can be viewed as either generalized 
>>>>>> dead store elimination
>>>>>>  or generalized store motion.  I prefer DSE here.
>>>>>>
>>>>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
>>>>>
>>>>> Looks mostly ok.  I have a few questions though.
>>>>>
>>>>> +  /* Don't do store elimination if loop has multiple exit edges.  */
>>>>> +  bool eliminate_store_p = single_exit (loop) != NULL;
>>>>>
>>>>> handling this would be an enhancement?  IIRC LIM store-motion handles this
>>>>> just fine by emitting code on all exits.
>>>> It is an enhancement with a little bit more complication.  We would
>>>> need to setup/record finalizer memory references for different exit
>>>> edges.  I added TODO description for this (and following one).  Is it
>>>> okay to pick up this in the future?
>>>
>>> Yes.
>>>
>>>>>
>>>>> @@ -1773,6 +2003,9 @@ determine_unroll_factor (vec chains)
>>>>>  {
>>>>>   

Re: [PATCH GCC][2/2]Refine CFG and bound information for split loops

2017-07-04 Thread Bin.Cheng
On Fri, Jun 30, 2017 at 5:09 PM, Jeff Law  wrote:
> On 06/14/2017 07:08 AM, Bin Cheng wrote:
>> Hi,
>> Loop split currently generates below control flow graph for split loops:
>> +
>> +   .-- guard1  --.
>> +   v v
>> + pre1(loop1).-->pre2(loop2)
>> +  | ||
>> +.--->h1 |   h2<.
>> +| | || |
>> +|ex1---.|   .---ex2|
>> +|/ v|   | \|
>> +'---l1 guard2---'   | l2---'
>> +   ||
>> +   ||
>> +   '--->join<---'
>> +
>> In which,
>> +   LOOP2 is the second loop after split, GUARD1 and GUARD2 are the two bbs
>> +   controling if loop2 should be executed.
>>
>> Take added test case as an example, the second loop only iterates for 1 time,
>> as a result, the CFG and loop niter bound information can be refined.  In 
>> general,
>> guard1/guard2 can be folded to true/false if loop2's niters is known at 
>> compilation
>> time.  This patch does such improvement by analyzing and refining niters of
>> loop2; as well as using that information to simplify CFG.  With this patch,
>> the second split loop as in test can be completely unrolled by later passes.
> In your testcase the second loop iterates precisely once, right?  In
> fact, we know it always executes precisely one time regardless of the
> incoming value of LEN.  That's the property you're trying to detect and
> exploit, right?
>
>
>>
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> Thanks,
>> bin
>> 2017-06-12  Bin Cheng  
>>
>>   * tree-ssa-loop-split.c (compute_new_first_bound): Compute and
>>   return bound information for the second split loop.
>>   (adjust_loop_split): New function.
>>   (split_loop): Update calls to above functions.
>>
>> gcc/testsuite/ChangeLog
>> 2017-06-12  Bin Cheng  
>>
>>   * gcc.dg/loop-split-1.c: New test.
>>
>>
>> 0002-lsplit-refine-cfg-niter-bound-20170601.txt
>>
>>
>> From 61855c74c7db6178008f007198aedee9a03f13e6 Mon Sep 17 00:00:00 2001
>> From: amker 
>> Date: Sun, 4 Jun 2017 02:26:34 +0800
>> Subject: [PATCH 2/2] lsplit-refine-cfg-niter-bound-20170601.txt
>>
>> ---
>>  gcc/testsuite/gcc.dg/loop-split-1.c |  16 +++
>>  gcc/tree-ssa-loop-split.c   | 197 
>> 
>>  2 files changed, 194 insertions(+), 19 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/loop-split-1.c
>>
>> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
>> index f8fe8e6..73c7dc2 100644
>> --- a/gcc/tree-ssa-loop-split.c
>> +++ b/gcc/tree-ssa-loop-split.c
>> @@ -387,28 +386,41 @@ connect_loops (struct loop *loop1, struct loop *loop2)
>>
>> Depending on the direction of the IVs and if the exit tests
>> are strict or non-strict we need to use MIN or MAX,
>> -   and add or subtract 1.  This routine computes newend above.  */
>> +   and add or subtract 1.  This routine computes newend above.
>> +
>> +   After computing the new bound (on j), we may be able to compare the
>> +   first loop's iteration space against the original loop's.  If it is
>> +   comparable at compilation time, we may be able to compute the niter
>> +   bound of the second loop.  Record the second loop's iteration bound
>> +   to SECOND_LOOP_NITER_BOUND which has below meaning:
>> +
>> + -3: Don't know anything about the second loop;
>> + -2: The second loop must not be executed;
>> + -1: The second loop must be executed, but niter bound is unknown;
>> +  n: The second loop must be executed, niter bound is n (>= 0);
>> +
>> +   Note we compute niter bound for the second loop's latch.  */
> How hard would it be to have a test for each of the 4 cases above?  I
> don't always ask for this, but ISTM this is a good chance to make sure
> most of the new code gets covered by testing.
>
> Does case -2 really occur in practice or are you just being safe?  ISTM
> that if case -2 occurs then the loop shouldn't have been split to begin
> with.  If this happens in practice, do we clean up all the dead code if
> the bounds are set properly.
>
> Similarly for N = 0, presumably meaning the second loop iterates exactly
> once, but never traverses its backedge, is there any value in changing
> the loop latch test so that it's always false?  Or will that get cleaned
> up later?
>
>
>>
>>  static tree
>> -compute_new_first_bound (gimple_seq *stmts, struct tree_niter_desc *niter,
>> -  tree border,
>> -  enum tree_code guard_code, tree guard_init)
>> +compute_new_first_bound (struct tree_niter_desc *niter, tree border,
>> +  enum tree_code guard_code, tree guard_init,
>> +  tree step, HOST_WIDE_INT *second_loop_niter_bound)
>>  {

Re: [PATCH GCC8][33/33]Fix PR69710/PR68030 by reassociate vect base address and a simple CSE pass

2017-07-04 Thread Bin.Cheng
On Mon, Jul 3, 2017 at 5:12 PM, Jeff Law  wrote:
> On 04/18/2017 04:54 AM, Bin Cheng wrote:
>> Hi,
>> This is the same patch posted at 
>> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html,
>> after rebase against this patch series.  This patch was blocked because 
>> without this patch
>> series, it could generate worse code on targets with limited addressing mode 
>> support, like
>> AArch64.
>> There was some discussion about alternative fix for PRs, but after thinking 
>> twice I think
>> this fix is in the correct direction.  A CSE interface is useful to clean up 
>> code generated
>> in vectorizer, and we should improve this CSE interface into a region base 
>> one.  for the
>> moment, optimal code is not generated on targets like x86, I believe it's 
>> because the CSE
>> is weak and doesn't cover all basic blocks generated by vectorizer, the 
>> issue should be
>> fixed if region-based CSE is implemented.
>> Is it OK?
>>
>> Thanks,
>> bin
>> 2017-04-11  Bin Cheng  
>>
>>   PR tree-optimization/68030
>>   PR tree-optimization/69710
>>   * tree-ssa-dom.c (cse_bbs): New function.
>>   * tree-ssa-dom.h (cse_bbs): New declaration.
>>   * tree-vect-data-refs.c (vect_create_addr_base_for_vector_ref):
>>   Re-associate address by splitting constant offset.
>>   (vect_create_data_ref_ptr, vect_setup_realignment): Record changed
>>   basic block.
>>   * tree-vect-loop-manip.c (vect_gen_prolog_loop_niters): Record
>>   changed basic block.
>>   * tree-vectorizer.c (tree-ssa-dom.h): Include header file.
>>   (changed_bbs): New variable.
>>   (vectorize_loops): Allocate and free CHANGED_BBS.  Call cse_bbs.
>>   * tree-vectorizer.h (changed_bbs): New declaration.
>>
> So are you still interested in moving this forward Bin?  I know you did
> a minor update in response to Michael Meissner's problems.  Is there
> another update planned?
Hi,
Yes I want to move this forward, but better with confirmation whether
the regression is real or just u-arch hazard.  OTOH, given the
proceeding 32 patches have been applied, I think there is no need to
hold this one.  It is proceeding patch rather than this one causes
regression I guess.

>
> THe only obvious thing I'd suggest changing in the DOM interface is to
> have continue to walk the dominator tree, but do nothing for blocks that
> are not in changed_bbs.  That way you walk blocks in changed_bbs in
> dominator order rather than in bb->index order.
I will update patch as you suggested, but may take some time.

Thanks,
bin
>
> Jeff
>


Re: [PATCH GCC][3/4]Generalize dead store elimination (or store motion) across loop iterations in predcom

2017-07-04 Thread Bin.Cheng
On Tue, Jul 4, 2017 at 12:19 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Mon, Jul 3, 2017 at 4:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Jul 3, 2017 at 10:38 AM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Tue, Jun 27, 2017 at 12:49 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> For the moment, tree-predcom.c only supports 
>>>> invariant/load-loads/store-loads chains.
>>>> This patch generalizes dead store elimination (or store motion) across 
>>>> loop iterations in
>>>> predictive commoning pass by supporting store-store chain.  As comment in 
>>>> the patch:
>>>>
>>>>Apart from predictive commoning on Load-Load and Store-Load chains, we
>>>>also support Store-Store chains -- stores killed by other store can be
>>>>eliminated.  Given below example:
>>>>
>>>>  for (i = 0; i < n; i++)
>>>>{
>>>>  a[i] = 1;
>>>>  a[i+2] = 2;
>>>>}
>>>>
>>>>It can be replaced with:
>>>>
>>>>  t0 = a[0];
>>>>  t1 = a[1];
>>>>  for (i = 0; i < n; i++)
>>>>{
>>>>  a[i] = 1;
>>>>  t2 = 2;
>>>>  t0 = t1;
>>>>  t1 = t2;
>>>>}
>>>>  a[n] = t0;
>>>>  a[n+1] = t1;
>>>>
>>>>If the loop runs more than 1 iterations, it can be further simplified 
>>>> into:
>>>>
>>>>  for (i = 0; i < n; i++)
>>>>{
>>>>  a[i] = 1;
>>>>}
>>>>  a[n] = 2;
>>>>  a[n+1] = 2;
>>>>
>>>>The interesting part is this can be viewed either as general store 
>>>> motion
>>>>or general dead store elimination in either intra/inter-iterations way.
>>>>
>>>> There are number of interesting facts about this enhancement:
>>>> a) This patch supports dead store elimination for both across-iteration 
>>>> case and single-iteration
>>>>  case.  For the latter, it is dead store elimination.
>>>> b) There are advantages supporting dead store elimination in predcom, for 
>>>> example, it has
>>>>  complete information about memory address.  On the contrary, DSE pass 
>>>> can only handle
>>>>  memory references with exact the same memory address expression.
>>>> c) It's cheap to support store-stores chain in predcom based on existing 
>>>> code.
>>>> d) As commented, the enhancement can be viewed as either generalized dead 
>>>> store elimination
>>>>  or generalized store motion.  I prefer DSE here.
>>>>
>>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64.  Is it OK?
>>>
>>> Looks mostly ok.  I have a few questions though.
>>>
>>> +  /* Don't do store elimination if loop has multiple exit edges.  */
>>> +  bool eliminate_store_p = single_exit (loop) != NULL;
>>>
>>> handling this would be an enhancement?  IIRC LIM store-motion handles this
>>> just fine by emitting code on all exits.
>> It is an enhancement with a little bit more complication.  We would
>> need to setup/record finalizer memory references for different exit
>> edges.  I added TODO description for this (and following one).  Is it
>> okay to pick up this in the future?
>
> Yes.
>
>>>
>>> @@ -1773,6 +2003,9 @@ determine_unroll_factor (vec chains)
>>>  {
>>>if (chain->type == CT_INVARIANT)
>>> continue;
>>> +  /* Don't unroll when eliminating stores.  */
>>> +  else if (chain->type == CT_STORE_STORE)
>>> +   return 1;
>>>
>>> this is a hard exit value so we do not handle the case where another chain
>>> in the loop would want to unroll? (enhancement?)  I'd have expected to do
>>> the same as for CT_INVARIANT here.
>> I didn't check what change is needed in case of unrolling.  I am not
>> very sure if we should prefer unroll for *load chains or prefer not
>> unroll for store-store chains, because unroll in general increases
>> loop-carried register pressure for store-store chains rather than
>> decreases register pressure for *load chains.
>> I was also thinking if it's possible to restrict un

<    1   2   3   4   5   6   7   8   9   10   >