On Wed, Aug 31, 2011 at 2:17 PM, Ilya Enkovich <enkovich....@gmail.com> wrote: > Hello Richard, > > Thanks for the review! > >> The fact that you have to adjust gcc.dg/tree-ssa/pr38533.c looks problematic >> to me. Can you investigate the problem report, look at the geneated >> code with the atom default of the param and see whether it's still >> reasonably "fixed" (maybe you'd already done this)? > This test was created as a regression test to the problem in > linearize_expr_tree which moves all statements down to the first > modified one during reassociation increasing registers pressure. Test > has a lot of definitions which are all ORed like this: > def r1 > def r2 > s = r1 or r2 > def r3 > s = s or r3 > def r4 > s = s or r4 > ... > and it checks that register pressure is not increased after > reassociation by simple scan of two sequential defs. If we use > reassociation width higher than 1 then test will fails because we need > to increase register pressure to get parallelism and finally get code > like this: > def r1 > def r2 > def r3 > t1 = r1 or r2 > s = s or r3 > def r4 > def r5 > s = s or t1 > t1 = r4 or r5 > ... > So, I had to fix a test.
Ok. Thanks for explaining. >> + /* Check if we may use less width and still compute sequence for >> + the same time. It will allow us to reduce registers usage. */ >> + while (width > 1 >> + && get_required_cycles (ops_num, width - 1) == cycles_best) >> + width--; >> >> I suppose get_required_cycles () is monotonic in width? Would it >> make sense to binary search the best width then (I realize the >> default is 1 and the only target differing has 2, but ...)? Maybe at >> least add a comment to this effect? Or not decrement by one >> but instead divide by two on each iteration (which is the same for 1 and 2 >> ...)? >> It's also all a mapping that is constant - we should be able to >> pre-compute it for all values, eventually "compressing" it to a >> much simpler formula width = f (cpu_width, ops_num)? > I replaced sequential search with a binary search. I did not pay much > attention to this function because do not think it is really time > consuming compared to other parts of reassociation phase. Am I too > optimistic? If you think it might significantly affect compile time I > can introduce a table of pre-computed values (or make it later as a > separate fix). + /* Get the minimal time required for sequence computation. */ + cycles_best = get_required_cycles (ops_num, width); + + /* Check if we may use less width and still compute sequence for + the same time. It will allow us to reduce registers usage. */ + width_min = 1; + while (width > width_min) + { + int width_mid = (width + width_min) / 2; + + if (get_required_cycles (ops_num, width_mid) == cycles_best) + width = width_mid; + else if (width_min < width_mid) + width_min = width_mid; + else + break; + } this seems to not allow cycles_best to drop with lower width, but that it can't should be an implementation detail of get_required_cycles. To make it not so, can you add a comment before the loop, like /* get_required_cycles is monotonically increasing with lower width so we can perform a binary search for the minimal width that still results in the optimal cycle count. */ > I made all other fixes as you suggested. With the above change the non-x86 specifc parts are ok. Please get approval for them from a x86 maintainer. Thanks, Richard. > Bootstrapped and checked on x86_64-linux. > > Thanks, > Ilya > --- > gcc/ > > 2011-08-31 Enkovich Ilya <ilya.enkov...@intel.com> > > PR middle-end/44382 > * target.def (reassociation_width): New hook. > > * doc/tm.texi.in (reassociation_width): Likewise. > > * doc/tm.texi (reassociation_width): Likewise. > > * doc/invoke.texi (tree-reassoc-width): New param documented. > > * hooks.h (hook_int_uint_mode_1): New default hook. > > * hooks.c (hook_int_uint_mode_1): Likewise. > > * config/i386/i386.h (ix86_tune_indices): Add > X86_TUNE_REASSOC_INT_TO_PARALLEL and > X86_TUNE_REASSOC_FP_TO_PARALLEL. > > (TARGET_REASSOC_INT_TO_PARALLEL): New. > (TARGET_REASSOC_FP_TO_PARALLEL): Likewise. > > * config/i386/i386.c (initial_ix86_tune_features): Add > X86_TUNE_REASSOC_INT_TO_PARALLEL and > X86_TUNE_REASSOC_FP_TO_PARALLEL. > > (ix86_reassociation_width) implementation of > new hook for i386 target. > > * params.def (PARAM_TREE_REASSOC_WIDTH): New param added. > > * tree-ssa-reassoc.c (get_required_cycles): New function. > (get_reassociation_width): Likewise. > (swap_ops_for_binary_stmt): Likewise. > (rewrite_expr_tree_parallel): Likewise. > > (rewrite_expr_tree): Refactored. Part of code moved into > swap_ops_for_binary_stmt. > > (reassociate_bb): Now checks reassociation width to be used > and call rewrite_expr_tree_parallel instead of rewrite_expr_tree > if needed. > > gcc/testsuite/ > > 2011-08-31 Enkovich Ilya <ilya.enkov...@intel.com> > > * gcc.dg/tree-ssa/pr38533.c (dg-options): Added option > --param tree-reassoc-width=1. > > * gcc.dg/tree-ssa/reassoc-24.c: New test. > * gcc.dg/tree-ssa/reassoc-25.c: Likewise. >