Richard, Thanks a lot for your detailed comments!
Few words about 436.cactusADM gain. The loop which was transformed for avx2 is very huge and this is the last inner-most loop in routine Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't have sources, let me know. Yuri. 2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: > On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrum...@gmail.com> wrote: >> Hi Richard, >> >> Here is updated version of the patch which 91) is in sync with trunk >> compiler and (2) contains simple cost model to estimate profitability >> of scalar epilogue elimination. The part related to vectorization of >> loops with small trip count is in process of developing. Note that >> implemented cost model was not tuned well for HASWELL and KNL but we >> got ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL. > > Ok, so I don't know where to start with this. > > First of all while I wanted to have the actual stmt processing to be > as post-processing > on the vectorized loop body I didn't want to have this competely separated > from > vectorizing. > > So, do combine_vect_loop_remainder () from vect_transform_loop, not by > iterating > over all (vectorized) loops at the end. > > Second, all the adjustments of the number of iterations for the vector > loop should > be integrated into the main vectorization scheme as should determining the > cost of the predication. So you'll end up adding a > LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during > cost analysis and during code generation adjust vector iteration computation > accordingly and _not_ generate the epilogue loop (or wire it up correctly in > the first place). > > The actual stmt processing should then still happen in a similar way as you > do. > > So I'm going to comment on that part only as I expect the rest will look a lot > different. > > +/* Generate induction_vector which will be used to mask evaluation. */ > + > +static tree > +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned > size) > +{ > > please make use of create_iv. Add more comments. I reverse-engineered > that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use > in gen_mask_for_remainder > by comparing it against { niter, ..., niter }. > > + gsi = gsi_after_labels (loop->header); > + niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > + ? LOOP_VINFO_NITERS (loop_vinfo) > + : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo); > > that's either wrong or unnecessary. if ! peeling for alignment > loop-vinfo-niters > is equal to loop-vinfo-niters-unchanged. > > + ptr = build_int_cst (reference_alias_ptr_type (ref), 0); > + if (!SSA_NAME_PTR_INFO (addr)) > + copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref); > > vect_duplicate_ssa_name_ptr_info. > > + > +static void > +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask) > +{ > + gimple *stmt, *new_stmt; > + tree old, lhs, vectype, var, n_lhs; > > no comment? what's this for. > > +/* Convert vectorized reductions to VEC_COND statements to preserve > + reduction semantic: > + s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2. */ > + > +static void > +convert_reductions (loop_vec_info loop_vinfo, tree mask) > +{ > > for reductions it looks like preserving the last iteration x plus the mask > could avoid predicating it this way and compensate in the reduction > epilogue by "subtracting" x & mask? With true predication support > that'll likely be more expensive of course. > > + /* Generate new VEC_COND expr. */ > + vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs); > + new_stmt = gimple_build_assign (lhs, vec_cond_expr); > > gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs); > > +/* Return true if MEM_REF is incremented by vector size and false > otherwise. */ > + > +static bool > +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs) > +{ > + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > > what?! Just look at DR_STEP of the store? > > > +void > +combine_vect_loop_remainder (loop_vec_info loop_vinfo) > +{ > + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + auto_vec<gimple *, 10> loads; > + auto_vec<gimple *, 5> stores; > > so you need to re-structure this in a way that it computes > > a) wheter it can perform the operation - and you need to do that > reliably before the operation has taken place > b) its cost > > instead of looking at def types or gimple_assign_load/store_p predicates > please look at STMT_VINFO_TYPE instead. > > I don't like the new target hook for the costing. We do need some major > re-structuring in the vectorizer cost model implementation, this doesn't go > into the right direction. > > A simplistic hook following the current scheme would have used > the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost. > > There is not a single testcase in the patch. I would have expected one that > makes sure we keep the 6% speedup for cactusADM at least. > > > So this was a 45minute "overall" review not going into all the > implementation details. > > Thanks, > Richard. > > >> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: >>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich....@gmail.com> >>> wrote: >>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: >>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrum...@gmail.com> >>>>> wrote: >>>>>> Richard, >>>>>> >>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple >>>>>> scheme of remainder will be used through introducing new IV : >>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html >>>>>> >>>>>> Is it true or we missed something? >>>>> >>>>> <quote> >>>>>> > Do you have an idea how "masking" is better be organized to be usable >>>>>> > for both 4b and 4c? >>>>>> >>>>>> Do 2a ... >>>>> Okay. >>>>> </quote> >>>> >>>> 2a was 'transform already vectorized loop as a separate >>>> post-processing'. Isn't it what this prototype patch implements? >>>> Current version only masks loop body which is in practice applicable >>>> for AVX-512 only in the most cases. With AVX-512 it's easier to see >>>> how profitable masking might be and it is a main target for the first >>>> masking version. Extending it to prologues/epilogues and thus making >>>> it more profitable for other targets is the next step and is out of >>>> the scope of this patch. >>> >>> Ok, technically the prototype transforms the already vectorized loop. >>> Of course I meant the vectorized loop be copied, masked and that >>> result used as epilogue... >>> >>> I'll queue a more detailed look into the patch for this week. >>> >>> Did you perform any measurements with this patch like # of >>> masked epilogues in SPEC 2006 FP (and any speedup?) >>> >>> Thanks, >>> Richard. >>> >>>> Thanks, >>>> Ilya >>>> >>>>> >>>>> Richard. >>>>>