On Fri, Dec 8, 2017 at 3:18 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>> Hi, >>>>> This simple patch makes interchange even more conservative for small >>>>> loops with constant initialized simple reduction. >>>>> The reason is undoing such reduction introduces new data reference and >>>>> cond_expr, which could cost too much in a small >>>>> loop. >>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch. Is it OK if >>>>> test passes? >>>> >>>> Shouldn't we do this even for non-constant initialzied simple >>>> reduction? Because for any simple >>>> reduction we add two DRs that are not innermost, for constant >>>> initialized we add an additional >>>> cond-expr. So ... >>>> >>>> + /* Conservatively skip interchange in cases only have few data >>>> references >>>> + and constant initialized simple reduction since it introduces new >>>> data >>>> + reference as well as ?: operation. */ >>>> + if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= >>>> datarefs.length ()) >>>> + return false; >>>> + >>>> >>>> can you, instead of carrying num_const_init_simple_reduc simply loop >>>> over m_reductions >>>> and classify them in this function accordingly? I think we want to >>>> cost non-constant-init >>>> reductions as well. The :? can eventually count for another DR for >>>> cost purposes. >>> Number of non-constant-init reductions can still be carried in struct >>> loop_cand? I am not very sure what's the advantage of an additional >>> loop over m_reductions getting the same information. >>> Perhaps the increase of stmts should be counted like: >>> num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs >>> Question is which number should this be compared against. (we may >>> need to shift num_new_inv_drs to the other side for wrapping issue). >>> >>>> >>>> It looks like we do count the existing DRs for the reduction? Is that >>>> why you arrive >>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:) >>> Yes. >>>> But we don't really know whether the DR was invariant in the outer >>>> loop (well, I suppose >>> Hmm, I might misunderstand here. num_old_inv_drs tracks the number of >>> invariant reference with regarding to inner loop, rather than the >>> outer loop. The same to num_new_inv_drs, >>> which means a reference become invariant after loop interchange with >>> regarding to (the new) inner loop. This invariant information is >>> always known from data reference, right? >>> As for DRs for reduction, we know it's invariant because we set its >>> inner loop stride to zero. >>> >>>> we could remember the DR in m_reductions). >>>> >>>> Note that the good thing is that the ?: has an invariant condition and >>>> thus vectorization >>>> can hoist the mask generation out of the vectorized loop which means >>>> it boils down to >>>> cheap operations. My gut feeling is that just looking at the number >>>> of memory references >>>> isn't a good indicator of profitability as the regular stmt workload >>>> has a big impact on >>>> profitability of vectorization. >>> It's not specific to vectorization. The generated new code also costs >>> too much in small loops without vectorization. But yes, # of mem_refs >>> may be too inaccurate, maybe we should check against num_stmts. >> >> Not specific to vectorization but the interchange may pay off only when >> vectorizing a loop. Would the loop in loop-interchange-5.c be still >> interchanged? If we remove the multiplication and just keep >> c[i][j] = c[i][j] + b[k][j]; > Both loop-interchange-5.c and the modified version are interchange, > because we check > it against number of all data references (including num_old_inv_drs): > if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ()) > >> ? That is, why is the constant init so special? Even for non-constant init >> we're changing two outer loop DRs to two non-consecutive inner loop DRs. > No, the two outer loop DRs becomes consecutive with respect to inner loop. > So for a typical matrix mul case, the interchange moves two outer loop > DRs into inner loops, moves an inner loop DR out to outer loop. > Overall it introduces an additional instruction. In terms of cache > behavior, it's even cheaper? Because the two new DRs access the same > memory object. > As for pr62178.c, assembly regressed from: > .L3: > ldr w3, [x0], 124 > ldr w4, [x2, 4]! > cmp x0, x5 > madd w1, w4, w3, w1 > bne .L3 > > To: > .L3: > ldr w2, [x3, x0, lsl 2] > cmp w5, 1 > ldr w1, [x4, x0, lsl 2] > csel w2, w2, wzr, ne > madd w1, w6, w1, w2 > str w1, [x3, x0, lsl 2] > add x0, x0, 1 > cmp x0, 31 > bne .L3 > > Without vectorization. Two additional instruction for cond_expr, one > additional memory reference for interchange, and one additional > instruction because of ivopt/addressing mode. For the record, the performance of pr62178.c depends on size of matrix, with 31*31 matrix, the above interchanged version is slower, but it's faster for 250*250 matrix. So L1-cache-size may need to be considered somehow, but still a problem for unknown niters cases.
Thanks, bin > > Thanks, > bin >> >> Richard. >> >>> Thanks, >>> bin >>>> >>>> So no ack nor nack... >>>> >>>> Richard. >>>> >>>>> Thanks, >>>>> bin >>>>> 2017-12-08 Bin Cheng <bin.ch...@arm.com> >>>>> >>>>> * gimple-loop-interchange.cc (struct loop_cand): New field. >>>>> (loop_cand::loop_cand): Init new field in constructor. >>>>> (loop_cand::classify_simple_reduction): Record simple reduction >>>>> initialized with constant value. >>>>> (should_interchange_loops): New parameter. Skip interchange if >>>>> loop >>>>> has few data references and constant intitialized simple >>>>> reduction. >>>>> (tree_loop_interchange::interchange): Update call to above >>>>> function. >>>>> (should_interchange_loop_nest): Ditto.