On Thu, May 19, 2016 at 9:35 PM, Ilya Enkovich <enkovich....@gmail.com> wrote: > Hi, > > This series is an extension of previous work on loop epilogue combining [1]. > > It introduces three ways to handle vectorized loop epilogues: combine it with > vectorized loop, vectorize it with masks, vectorize it using a smaller vector > size. > > Also it supports vectorization of loops with low trip count. > > Epilogue combining is used as a basic masking transformation. Epilogue > masking and low trip count loop vectorization is considered as epilogue > combining with a zero trip count vector loop. > > Epilogues vectorization is controlled via new option > -ftree-vectorize-epilogues= > which gets a comma separated list of enabled modes which include combine, > mask, > nomask. There is a separate option -ftree-vectorize-short-loops for low trip > count loops. > > To support epilogues vectorization I use a queue of loops to be vectorized in > vectorize_loops and change vect_transform_loop to return generated epilogue > (in case we want to try vectorize it). If epilogue is returned then it is > queued for processing. This variant of epilogues processing was chosen > because > it is simple and works for all epilogue processing options. > > There are currently some limitations implied by this scheme: > - Copied loop misses some required optimization info (e.g. scev info) > which may result in an epilogue which cannot be vectorized > - Loop epilogue may require if-convertion > - Alias/alignment checks are not inherited and therefore will be performed > one more time for epilogue. For now epilogue vectorization is just disabled > in case alias versioning is required and alignment enhancement is > disabled for epilogues. > > There is a set of new fields added to _loop_vec_info to support epilogues > vectorization. > > LOOP_VINFO_CAN_BE_MASKED - true if vectorized loop can be masked. It is > computed during vectorization analysis (in various vectorizable_* functions). > > LOOP_VINFO_REQUIRED_MASKS - for loop which can be masked it holds all masks > required to mask the loop. > > LOOP_VINFO_COMBINE_EPILOGUE - true if we decided vectorized loop should be > masked. > > LOOP_VINFO_MASK_EPILOGUE - true if we decided an epilogue of this loop > should be vectorized and masked > > LOOP_VINFO_NEED_MASKING - true if vectorized loop has to be masked (set for > epilogues we want to mask and low trip count loops). > > LOOP_VINFO_ORIG_LOOP_INFO - for epilogues this holds loop_vec_info of the > original vectorized loop. > > To make a decision whether we want to mask or combine a loop epilogue > cost model is extended with masking costs. This includes > vect_masking_prologue > and vect_masking_body elements added to vect_cost_model_location enum and > finish_cost extended with two additional returned values correspondingly. > Also > in addition to add_stmt_cost I also add add_stmt_masking_cost to compute > a cost for masking a statement. > > vect_estimate_min_profitable_iters checks if epilogue masking is profitable > and also computes a number of iterations required to have profitable > epilogue combining (this number may be used as a threshold in vectorized > loop guard). > > These patches do not enable any of new features by default for all > optimization > levels. Masking features are expected to be mostly used for AVX-512 targets > and lack of hardware suitable for wide performance testing is the reason cost > model is not tuned and optimizations are not enabled by default. With small > tests using a small number of loop iterations and 'heavy' epilogues (e.g. > number of iterations is VF*2-1) I see expected ~2x gain on existing KNL > hardware. > Later this year we expect to get an access to KNL machines and have an > opportunity to tune masking cost model. > > On Haswell hardware I don't see performance gains on similar loops which means > masked code is not better than a scalar one when we have a heavy masks usage. > This still might be useful in case number statements requiring masking is > relatively small (I used test a[i] += b[i] which needs masking for 3 out of 4 > vector statements). We will continue search for cases where masking is > profitable for Haswell to tune masking costs appropriately.
So I've gone over the patches and gave mostly high-level comments. The vectorizer is already in somewhat messy (aka not easy to follow) state, this series doesn't improve the situation (heh). Esp. the high-level structure for code generation and its documentation needs work (where we do versioning / peeling and how we use the copies in which condition and where, etc). Now - given my question on the profitability code for vectorized body masking I wonder if vectorized body masking shouldn't be better done via adding another version for low tripcount loops (not < vf but say < vf * N with N determined by a cost model). Otherwise I can't see how we'd ever mask the vectorized body for loops with an parametric number of iterations (most loops in real life). Thanks, Richard. > Below are ChangeLogs for whole series. > > [1] https://gcc.gnu.org/ml/gcc-patches/2015-10/msg03014.html > > Thanks, > Ilya > -- > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * common.opt (flag_tree_vectorize_epilogues): New. > (ftree-vectorize-short-loops): New. > (ftree-vectorize-epilogues=): New. > (fno-tree-vectorize-epilogues): New. > (fvect-epilogue-cost-model=): New. > * flag-types.h (enum vect_epilogue_mode): New. > * opts.c (parse_vectorizer_options): New. > (common_handle_option): Support -ftree-vectorize-epilogues= > and -fno-tree-vectorize-epilogues options. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * tree-vectorizer.h (struct _loop_vec_info): Add new fields > can_be_masked, required_masks, mask_epilogue, combine_epilogue, > need_masking, orig_loop_info. > (LOOP_VINFO_CAN_BE_MASKED): New. > (LOOP_VINFO_REQUIRED_MASKS): New. > (LOOP_VINFO_COMBINE_EPILOGUE): New. > (LOOP_VINFO_MASK_EPILOGUE): New. > (LOOP_VINFO_NEED_MASKING): New. > (LOOP_VINFO_ORIG_LOOP_INFO): New. > (LOOP_VINFO_EPILOGUE_P): New. > (LOOP_VINFO_ORIG_MASK_EPILOGUE): New. > (LOOP_VINFO_ORIG_VECT_FACTOR): New. > * tree-vect-loop.c (new_loop_vec_info): Initialize new > _loop_vec_info fields. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * tree-if-conv.c (tree_if_conversion): Make public. > * tree-if-conv.h: New file. > * tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Don't > try to enhance alignment for epilogues. > * tree-vect-loop-manip.c (vect_do_peeling_for_loop_bound): Return > created loop. > * tree-vect-loop.c: include tree-if-conv.h. > (destroy_loop_vec_info): Preserve LOOP_VINFO_ORIG_LOOP_INFO in > loop->aux. > (vect_analyze_loop_form): Init LOOP_VINFO_ORIG_LOOP_INFO and reset > loop->aux. > (vect_analyze_loop): Reset loop->aux. > (vect_transform_loop): Check if created epilogue should be returned > for further vectorization. If-convert epilogue if required. > * tree-vectorizer.c (vectorize_loops): Add a queue of loops to > process and insert vectorized loop epilogues into this queue. > * tree-vectorizer.h (vect_do_peeling_for_loop_bound): Return created > loop. > (vect_transform_loop): Return created loop. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * config/i386/i386.c (ix86_init_cost): Extend costs array. > (ix86_add_stmt_masking_cost): New. > (ix86_finish_cost): Add masking_prologue_cost and masking_body_cost > args. > (TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New. > * config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): New. > * config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New. > * config/rs6000/rs6000.c (_rs6000_cost_data): Extend cost array. > (rs6000_init_cost): Initialize new cost elements. > (rs6000_finish_cost): Add masking_prologue_cost and masking_body_cost. > * config/spu/spu.c (spu_init_cost): Extend costs array. > (spu_finish_cost): Add masking_prologue_cost and masking_body_cost > args. > * doc/tm.texi.in (TARGET_VECTORIZE_ADD_STMT_MASKING_COST): New. > * doc/tm.texi: Regenerated. > * target.def (add_stmt_masking_cost): New. > (finish_cost): Add masking_prologue_cost and masking_body_cost args. > * target.h (enum vect_cost_for_stmt): Add vector_mask_load and > vector_mask_store. > (enum vect_cost_model_location): Add vect_masking_prologue > and vect_masking_body. > * targhooks.c (default_builtin_vectorization_cost): Support > vector_mask_load and vector_mask_store. > (default_init_cost): Extend costs array. > (default_add_stmt_masking_cost): New. > (default_finish_cost): Add masking_prologue_cost and masking_body_cost > args. > * targhooks.h (default_add_stmt_masking_cost): New. > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Adjust > finish_cost call. > * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Likewise. > * tree-vectorizer.h (add_stmt_masking_cost): New. > (finish_cost): Add masking_prologue_cost and masking_body_cost args. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * tree-vect-loop.c: Include insn-config.h and recog.h. > (vect_check_required_masks_widening): New. > (vect_check_required_masks_narrowing): New. > (vect_get_masking_iv_elems): New. > (vect_get_masking_iv_type): New. > (vect_get_extreme_masks): New. > (vect_check_required_masks): New. > (vect_analyze_loop_operations): Add vect_check_required_masks > call to compute LOOP_VINFO_CAN_BE_MASKED. > (vect_analyze_loop_2): Initialize LOOP_VINFO_CAN_BE_MASKED and > LOOP_VINFO_NEED_MASKING before starting over. > (vectorizable_reduction): Compute LOOP_VINFO_CAN_BE_MASKED and > masking cost. > * tree-vect-stmts.c (can_mask_load_store): New. > (vect_model_load_masking_cost): New. > (vect_model_store_masking_cost): New. > (vect_model_simple_masking_cost): New. > (vectorizable_mask_load_store): Compute LOOP_VINFO_CAN_BE_MASKED > and masking cost. > (vectorizable_simd_clone_call): Likewise. > (vectorizable_store): Likewise. > (vectorizable_load): Likewise. > (vect_stmt_should_be_masked_for_epilogue): New. > (vect_add_required_mask_for_stmt): New. > (vect_analyze_stmt): Compute LOOP_VINFO_CAN_BE_MASKED. > * tree-vectorizer.h (vect_model_load_masking_cost): New. > (vect_model_store_masking_cost): New. > (vect_model_simple_masking_cost): New. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * tree-vect-stmts.c (vectorizable_mask_load_store): Mark > the first copy of generated vector stores. > (vectorizable_store): Mark the first copy of generated > vector stores and provide it with vectype and the original > data reference. > * tree-vectorizer.h (struct _stmt_vec_info): Add first_copy_p > field. > (STMT_VINFO_FIRST_COPY_P): New. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * dbgcnt.def (vect_tail_combine): New. > * params.def (PARAM_VECT_COST_INCREASE_COMBINE_THRESHOLD): New. > * tree-vect-data-refs.c (vect_get_new_ssa_name): Support > vect_mask_var. > * tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Support > epilogue combined with loop body. > (vect_do_peeling_for_loop_bound): LIkewise. > (vect_do_peeling_for_alignment): ??? > * tree-vect-loop.c Include alias.h and dbgcnt.h. > (vect_estimate_min_profitable_iters): Add > ret_min_profitable_combine_niters > arg, compute number of iterations for which loop epilogue combining is > profitable. > (vect_generate_tmps_on_preheader): Support combined apilogue. > (vect_gen_ivs_for_masking): New. > (vect_get_mask_index_for_elems): New. > (vect_get_mask_index_for_type): New. > (vect_gen_loop_masks): New. > (vect_mask_reduction_stmt): New. > (vect_mask_mask_load_store_stmt): New. > (vect_mask_load_store_stmt): New. > (vect_combine_loop_epilogue): New. > (vect_transform_loop): Support combined apilogue. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * dbgcnt.def (vect_tail_mask): New. > * tree-vect-loop.c (vect_analyze_loop_2): Support masked loop > epilogues and low trip count loops. > (vect_get_known_peeling_cost): Ignore scalat epilogue cost for > loops we are going to mask. > (vect_estimate_min_profitable_iters): Support masked loop > epilogues and low trip count loops. > * tree-vectorizer.c (vectorize_loops): Add a message for a case > when loop epilogue can't be vectorized. > > > gcc/ > > 2016-05-19 Ilya Enkovich <ilya.enkov...@intel.com> > > * tree-vect-loop.c (vect_transform_loop): Print more info > about vectorized loop and specify used vector size. >