on 2023/8/15 20:07, Richard Biener wrote:
> On Tue, Aug 15, 2023 at 1:47 PM Kewen.Lin <li...@linux.ibm.com> wrote:
>>
>> on 2023/8/15 15:53, Richard Biener wrote:
>>> On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>>>>
>>>> on 2023/8/14 22:16, Richard Sandiford wrote:
>>>>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>>>>> Hi Richard,
>>>>>>
>>>>>> on 2023/8/14 20:20, Richard Sandiford wrote:
>>>>>>> Thanks for the clean-ups.  But...
>>>>>>>
>>>>>>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Following Richi's suggestion [1], this patch is to move the
>>>>>>>> handlings on VMAT_GATHER_SCATTER in the final loop nest
>>>>>>>> of function vectorizable_load to its own loop.  Basically
>>>>>>>> it duplicates the final loop nest, clean up some useless
>>>>>>>> set up code for the case of VMAT_GATHER_SCATTER, remove some
>>>>>>>> unreachable code.  Also remove the corresponding handlings
>>>>>>>> in the final loop nest.
>>>>>>>>
>>>>>>>> Bootstrapped and regtested on x86_64-redhat-linux,
>>>>>>>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>>>>>>>
>>>>>>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>>>>>>>>
>>>>>>>> Is it ok for trunk?
>>>>>>>>
>>>>>>>> BR,
>>>>>>>> Kewen
>>>>>>>> -----
>>>>>>>>
>>>>>>>> gcc/ChangeLog:
>>>>>>>>
>>>>>>>>    * tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>>>>>>>>    VMAT_GATHER_SCATTER in the final loop nest to its own loop,
>>>>>>>>    and update the final nest accordingly.
>>>>>>>> ---
>>>>>>>>  gcc/tree-vect-stmts.cc | 361 +++++++++++++++++++++++++----------------
>>>>>>>>  1 file changed, 219 insertions(+), 142 deletions(-)
>>>>>>>
>>>>>>> ...that seems like quite a lot of +s.  Is there nothing we can do to
>>>>>>> avoid the cut-&-paste?
>>>>>>
>>>>>> Thanks for the comments!  I'm not sure if I get your question, if we
>>>>>> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
>>>>>> inevitable?  Your concern is mainly about git blame history?
>>>>>
>>>>> No, it was more that 219-142=77, so it seems like a lot of lines
>>>>> are being duplicated rather than simply being moved.  (Unlike for
>>>>> VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
>>>>> was a clear improvement.)
>>>>>
>>>>> So I was just wondering if there was any obvious factoring-out that
>>>>> could be done to reduce the duplication.
>>>>
>>>> ah, thanks for the clarification!
>>>>
>>>> I think the main duplication are on the loop body beginning and end,
>>>> let's take a look at them in details:
>>>>
>>>> +  if (memory_access_type == VMAT_GATHER_SCATTER)
>>>> +    {
>>>> +      gcc_assert (alignment_support_scheme == dr_aligned
>>>> +                 || alignment_support_scheme == dr_unaligned_supported);
>>>> +      gcc_assert (!grouped_load && !slp_perm);
>>>> +
>>>> +      unsigned int inside_cost = 0, prologue_cost = 0;
>>>>
>>>> // These above are newly added.
>>>>
>>>> +      for (j = 0; j < ncopies; j++)
>>>> +       {
>>>> +         /* 1. Create the vector or array pointer update chain.  */
>>>> +         if (j == 0 && !costing_p)
>>>> +           {
>>>> +             if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>>>> +               vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>>>> +                                            slp_node, &gs_info, 
>>>> &dataref_ptr,
>>>> +                                            &vec_offsets);
>>>> +             else
>>>> +               dataref_ptr
>>>> +                 = vect_create_data_ref_ptr (vinfo, first_stmt_info, 
>>>> aggr_type,
>>>> +                                             at_loop, offset, &dummy, gsi,
>>>> +                                             &ptr_incr, false, bump);
>>>> +           }
>>>> +         else if (!costing_p)
>>>> +           {
>>>> +             gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>>>> +             if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>>>> +               dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, 
>>>> ptr_incr,
>>>> +                                              gsi, stmt_info, bump);
>>>> +           }
>>>>
>>>> // These are for dataref_ptr, in the final looop nest we deal with more 
>>>> cases
>>>> on simd_lane_access_p and diff_first_stmt_info, but don't handle
>>>> STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared 
>>>> between,
>>>> IMHO factoring out it seems like a overkill.
>>>>
>>>> +
>>>> +         if (mask && !costing_p)
>>>> +           vec_mask = vec_masks[j];
>>>>
>>>> // It's merged out from j == 0 and j != 0
>>>>
>>>> +
>>>> +         gimple *new_stmt = NULL;
>>>> +         for (i = 0; i < vec_num; i++)
>>>> +           {
>>>> +             tree final_mask = NULL_TREE;
>>>> +             tree final_len = NULL_TREE;
>>>> +             tree bias = NULL_TREE;
>>>> +             if (!costing_p)
>>>> +               {
>>>> +                 if (loop_masks)
>>>> +                   final_mask
>>>> +                     = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
>>>> +                                           vec_num * ncopies, vectype,
>>>> +                                           vec_num * j + i);
>>>> +                 if (vec_mask)
>>>> +                   final_mask = prepare_vec_mask (loop_vinfo, 
>>>> mask_vectype,
>>>> +                                                  final_mask, vec_mask, 
>>>> gsi);
>>>> +
>>>> +                 if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>>>> +                   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, 
>>>> ptr_incr,
>>>> +                                                  gsi, stmt_info, bump);
>>>> +               }
>>>>
>>>> // This part is directly copied from the original, the original gets 
>>>> updated by
>>>> removing && !STMT_VINFO_GATHER_SCATTER_P.  Due to its size, I didn't 
>>>> consider
>>>> this before, do you prefer me to factor this part out?
>>>>
>>>> +             if (gs_info.ifn != IFN_LAST)
>>>> +               {
>>>> ...
>>>> +               }
>>>> +             else
>>>> +               {
>>>> +                 /* Emulated gather-scatter.  */
>>>> ...
>>>>
>>>> // This part is just moved from the original.
>>>>
>>>> +             vec_dest = vect_create_destination_var (scalar_dest, 
>>>> vectype);
>>>> +             /* DATA_REF is null if we've already built the statement.  */
>>>> +             if (data_ref)
>>>> +               {
>>>> +                 vect_copy_ref_info (data_ref, DR_REF 
>>>> (first_dr_info->dr));
>>>> +                 new_stmt = gimple_build_assign (vec_dest, data_ref);
>>>> +               }
>>>> +             new_temp = make_ssa_name (vec_dest, new_stmt);
>>>> +             gimple_set_lhs (new_stmt, new_temp);
>>>> +             vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, 
>>>> gsi);
>>>> +
>>>> +             /* Store vector loads in the corresponding SLP_NODE.  */
>>>> +             if (slp)
>>>> +               slp_node->push_vec_def (new_stmt);
>>>> +
>>>> +         if (!slp && !costing_p)
>>>> +           STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
>>>> +       }
>>>> +
>>>> +      if (!slp && !costing_p)
>>>> +       *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
>>>>
>>>> // This part is some subsequent handlings, it's duplicated from the 
>>>> original
>>>> but removing some more useless code.  I guess this part is not worthy
>>>> being factored out?
>>>>
>>>> +      if (costing_p)
>>>> +       {
>>>> +         if (dump_enabled_p ())
>>>> +           dump_printf_loc (MSG_NOTE, vect_location,
>>>> +                            "vect_model_load_cost: inside_cost = %u, "
>>>> +                            "prologue_cost = %u .\n",
>>>> +                            inside_cost, prologue_cost);
>>>> +       }
>>>> +      return true;
>>>> +    }
>>>>
>>>> // Duplicating the dumping, I guess it's unnecessary to be factored out.
>>>>
>>>> oh, I just noticed that this should be shorten as
>>>> "if (costing_p && dump_enabled_p ())" instead, just the same as what's
>>>> adopted for VMAT_LOAD_STORE_LANES dumping.
>>>
>>> Just to mention, the original motivational idea was even though we
>>> duplicate some
>>> code we make it overall more readable and thus maintainable.  In the end we
>>> might have vectorizable_load () for analysis but have not only
>>> load_vec_info_type but one for each VMAT_* which means multiple separate
>>> vect_transform_load () functions.  Currently vectorizable_load is structured
>>> very inconsistently, having the transforms all hang off a single
>>> switch (vmat-kind) {} would be an improvement IMHO.
>>
>> Thanks for the comments!  With these two patches, now the final loop nest are
>> only handling VMAT_CONTIGUOUS, VMAT_CONTIGUOUS_REVERSE and 
>> VMAT_CONTIGUOUS_PERMUTE.
>> IMHO, their handlings are highly bundled, re-structuring them can have more
>> duplicated code and potential incomplete bug fix risks as Richard pointed 
>> out.
>> But if I read the above comments right, our final goal seems to separate all 
>> of
>> them?  I wonder if you both prefer to further separate them?
> 
> I'd leave those together, they share too much code.

Got it, thanks for clarifying, it matches my previous thought. :)

BR,
Kewen

Reply via email to