Re: [RFC] Combine vectorized loops with its scalar remainder.

Yuri Rumyantsev Mon, 30 Nov 2015 07:04:35 -0800

Richard,

Thanks a lot for your detailed comments!


Few words about 436.cactusADM gain. The loop which was transformed for
avx2 is very huge and this is the last inner-most loop in routine
Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't
have sources, let me know.

Yuri.

2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrum...@gmail.com> wrote:
>> Hi Richard,
>>
>> Here is updated version of the patch which 91) is in sync with trunk
>> compiler and (2) contains simple cost model to estimate profitability
>> of scalar epilogue elimination. The part related to vectorization of
>> loops with small trip count is in process of developing. Note that
>> implemented cost model was not tuned  well for HASWELL and KNL but we
>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>
> Ok, so I don't know where to start with this.
>
> First of all while I wanted to have the actual stmt processing to be
> as post-processing
> on the vectorized loop body I didn't want to have this competely separated 
> from
> vectorizing.
>
> So, do combine_vect_loop_remainder () from vect_transform_loop, not by 
> iterating
> over all (vectorized) loops at the end.
>
> Second, all the adjustments of the number of iterations for the vector
> loop should
> be integrated into the main vectorization scheme as should determining the
> cost of the predication.  So you'll end up adding a
> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
> cost analysis and during code generation adjust vector iteration computation
> accordingly and _not_ generate the epilogue loop (or wire it up correctly in
> the first place).
>
> The actual stmt processing should then still happen in a similar way as you 
> do.
>
> So I'm going to comment on that part only as I expect the rest will look a lot
> different.
>
> +/* Generate induction_vector which will be used to mask evaluation.  */
> +
> +static tree
> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned 
> size)
> +{
>
> please make use of create_iv.  Add more comments.  I reverse-engineered
> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
> in gen_mask_for_remainder
> by comparing it against { niter, ..., niter }.
>
> +  gsi = gsi_after_labels (loop->header);
> +  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> +          ? LOOP_VINFO_NITERS (loop_vinfo)
> +          : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
>
> that's either wrong or unnecessary.  if ! peeling for alignment
> loop-vinfo-niters
> is equal to loop-vinfo-niters-unchanged.
>
> +      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
> +      if (!SSA_NAME_PTR_INFO (addr))
> +       copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
>
> vect_duplicate_ssa_name_ptr_info.
>
> +
> +static void
> +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
> +{
> +  gimple *stmt, *new_stmt;
> +  tree old, lhs, vectype, var, n_lhs;
>
> no comment?  what's this for.
>
> +/* Convert vectorized reductions to VEC_COND statements to preserve
> +   reduction semantic:
> +       s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
> +
> +static void
> +convert_reductions (loop_vec_info loop_vinfo, tree mask)
> +{
>
> for reductions it looks like preserving the last iteration x plus the mask
> could avoid predicating it this way and compensate in the reduction
> epilogue by "subtracting" x & mask?  With true predication support
> that'll likely be more expensive of course.
>
> +      /* Generate new VEC_COND expr.  */
> +      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
> +      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
>
> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>
> +/* Return true if MEM_REF is incremented by vector size and false
> otherwise.  */
> +
> +static bool
> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>
> what?!  Just look at DR_STEP of the store?
>
>
> +void
> +combine_vect_loop_remainder (loop_vec_info loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  auto_vec<gimple *, 10> loads;
> +  auto_vec<gimple *, 5> stores;
>
> so you need to re-structure this in a way that it computes
>
>   a) wheter it can perform the operation - and you need to do that
>       reliably before the operation has taken place
>   b) its cost
>
> instead of looking at def types or gimple_assign_load/store_p predicates
> please look at STMT_VINFO_TYPE instead.
>
> I don't like the new target hook for the costing.  We do need some major
> re-structuring in the vectorizer cost model implementation, this doesn't go
> into the right direction.
>
> A simplistic hook following the current scheme would have used
> the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost.
>
> There is not a single testcase in the patch.  I would have expected one that
> makes sure we keep the 6% speedup for cactusADM at least.
>
>
> So this was a 45minute "overall" review not going into all the
> implementation details.
>
> Thanks,
> Richard.
>
>
>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich....@gmail.com> 
>>> wrote:
>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrum...@gmail.com> 
>>>>> wrote:
>>>>>> Richard,
>>>>>>
>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>>> scheme of remainder will be used through introducing new IV :
>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>>
>>>>>> Is it true or we missed something?
>>>>>
>>>>> <quote>
>>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>>> > for both 4b and 4c?
>>>>>>
>>>>>> Do 2a ...
>>>>> Okay.
>>>>> </quote>
>>>>
>>>> 2a was 'transform already vectorized loop as a separate
>>>> post-processing'. Isn't it what this prototype patch implements?
>>>> Current version only masks loop body which is in practice applicable
>>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>>> how profitable masking might be and it is a main target for the first
>>>> masking version.  Extending it to prologues/epilogues and thus making
>>>> it more profitable for other targets is the next step and is out of
>>>> the scope of this patch.
>>>
>>> Ok, technically the prototype transforms the already vectorized loop.
>>> Of course I meant the vectorized loop be copied, masked and that
>>> result used as epilogue...
>>>
>>> I'll queue a more detailed look into the patch for this week.
>>>
>>> Did you perform any measurements with this patch like # of
>>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>>
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> Ilya
>>>>
>>>>>
>>>>> Richard.
>>>>>

Re: [RFC] Combine vectorized loops with its scalar remainder.

Reply via email to