On Tue, Dec 12, 2017 at 1:42 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>> Hi, >>>>> This simple patch makes interchange even more conservative for small >>>>> loops with constant initialized simple reduction. >>>>> The reason is undoing such reduction introduces new data reference and >>>>> cond_expr, which could cost too much in a small >>>>> loop. >>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch. Is it OK if >>>>> test passes? >>>> >>>> Shouldn't we do this even for non-constant initialzied simple >>>> reduction? Because for any simple >>>> reduction we add two DRs that are not innermost, for constant >>>> initialized we add an additional >>>> cond-expr. So ... >>>> >>>> + /* Conservatively skip interchange in cases only have few data >>>> references >>>> + and constant initialized simple reduction since it introduces new >>>> data >>>> + reference as well as ?: operation. */ >>>> + if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= >>>> datarefs.length ()) >>>> + return false; >>>> + >>>> >>>> can you, instead of carrying num_const_init_simple_reduc simply loop >>>> over m_reductions >>>> and classify them in this function accordingly? I think we want to >>>> cost non-constant-init >>>> reductions as well. The :? can eventually count for another DR for >>>> cost purposes. >>> Number of non-constant-init reductions can still be carried in struct >>> loop_cand? I am not very sure what's the advantage of an additional >>> loop over m_reductions getting the same information. >>> Perhaps the increase of stmts should be counted like: >>> num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs >>> Question is which number should this be compared against. (we may >>> need to shift num_new_inv_drs to the other side for wrapping issue). >>> >>>> >>>> It looks like we do count the existing DRs for the reduction? Is that >>>> why you arrive >>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:) >>> Yes. >>>> But we don't really know whether the DR was invariant in the outer >>>> loop (well, I suppose >>> Hmm, I might misunderstand here. num_old_inv_drs tracks the number of >>> invariant reference with regarding to inner loop, rather than the >>> outer loop. The same to num_new_inv_drs, >>> which means a reference become invariant after loop interchange with >>> regarding to (the new) inner loop. This invariant information is >>> always known from data reference, right? >>> As for DRs for reduction, we know it's invariant because we set its >>> inner loop stride to zero. >>> >>>> we could remember the DR in m_reductions). >>>> >>>> Note that the good thing is that the ?: has an invariant condition and >>>> thus vectorization >>>> can hoist the mask generation out of the vectorized loop which means >>>> it boils down to >>>> cheap operations. My gut feeling is that just looking at the number >>>> of memory references >>>> isn't a good indicator of profitability as the regular stmt workload >>>> has a big impact on >>>> profitability of vectorization. >>> It's not specific to vectorization. The generated new code also costs >>> too much in small loops without vectorization. But yes, # of mem_refs >>> may be too inaccurate, maybe we should check against num_stmts. >> >> Not specific to vectorization but the interchange may pay off only when >> vectorizing a loop. Would the loop in loop-interchange-5.c be still >> interchanged? If we remove the multiplication and just keep >> c[i][j] = c[i][j] + b[k][j]; >> ? That is, why is the constant init so special? Even for non-constant init >> we're changing two outer loop DRs to two non-consecutive inner loop DRs. > Hi Richard, > This is updated patch taking stmt cost into consideration. > > Firstly stmt cost (from # of stmt) > of loops are recorded. Then stmt cost of outer loop is adjusted by decreasing > number of IVs, increasing by the number of constant initialized simple > reductions. > Lastly we check stmt cost between inner/outer loops and give up on interchange > if outer loop has too many stmts. > > Test gcc.target/aarch64/pr62178.c is fixed with this patch. Bootstrap and > test > on x86_64 andAArch64. Any comment?
This looks good now! Ok for trunk. Thanks, Richard. > Thanks, > bin > 2017-12-12 Bin Cheng <bin.ch...@arm.com> > > * gimple-loop-interchange.cc (STMT_COST_RATIO): New macro. > (loop_cand::m_num_stmts, loop_cand::m_const_init_reduc): New members. > (loop_cand::loop_cand): Initialize above members. > (loop_cand::supported_operations): Delete. > (loop_cand::can_interchange_p): Inline above function. > (loop_cand::classify_simple_reduction): Record number of constant > initialized simple reductions. > (should_interchange_loops): New parameters. Check stmt cost of loops > to be interchange. > (tree_loop_interchange::interchange): Prepare stmt cost of outer loop. > Update call to should_interchange_loops. > (should_interchange_loop_nest): Update call to > should_interchange_loops.