Re: [PATCH GCC]More conservative interchanging small loops with const initialized simple reduction

Bin.Cheng Fri, 08 Dec 2017 08:25:25 -0800

On Fri, Dec 8, 2017 at 3:18 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 2:40 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Fri, Dec 8, 2017 at 1:43 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Fri, Dec 8, 2017 at 12:17 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Fri, Dec 8, 2017 at 12:46 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> This simple patch makes interchange even more conservative for small 
>>>>> loops with constant initialized simple reduction.
>>>>> The reason is undoing such reduction introduces new data reference and 
>>>>> cond_expr, which could cost too much in a small
>>>>> loop.
>>>>> Test gcc.target/aarch64/pr62178.c is fixed with this patch.  Is it OK if 
>>>>> test passes?
>>>>
>>>> Shouldn't we do this even for non-constant initialzied simple
>>>> reduction?  Because for any simple
>>>> reduction we add two DRs that are not innermost, for constant
>>>> initialized we add an additional
>>>> cond-expr.  So ...
>>>>
>>>> +  /* Conservatively skip interchange in cases only have few data 
>>>> references
>>>> +     and constant initialized simple reduction since it introduces new 
>>>> data
>>>> +     reference as well as ?: operation.  */
>>>> +  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= 
>>>> datarefs.length ())
>>>> +    return false;
>>>> +
>>>>
>>>> can you, instead of carrying num_const_init_simple_reduc simply loop
>>>> over m_reductions
>>>> and classify them in this function accordingly?  I think we want to
>>>> cost non-constant-init
>>>> reductions as well.  The :? can eventually count for another DR for
>>>> cost purposes.
>>> Number of non-constant-init reductions can still be carried in struct
>>> loop_cand?  I am not very sure what's the advantage of an additional
>>> loop over m_reductions getting the same information.
>>> Perhaps the increase of stmts should be counted like:
>>>   num_old_inv_drs + num_const_init_simple_reduc * 2 - num_new_inv_drs
>>> Question is which number should this be compared against.  (we may
>>> need to shift num_new_inv_drs to the other side for wrapping issue).
>>>
>>>>
>>>> It looks like we do count the existing DRs for the reduction?  Is that
>>>> why you arrive
>>>> at the num_const_init_simple_reduc * 2 figure? (one extra load plus one ?:)
>>> Yes.
>>>> But we don't really know whether the DR was invariant in the outer
>>>> loop (well, I suppose
>>> Hmm, I might misunderstand here.  num_old_inv_drs tracks the number of
>>> invariant reference with regarding to inner loop, rather than the
>>> outer loop.  The same to num_new_inv_drs,
>>> which means a reference become invariant after loop interchange with
>>> regarding to (the new) inner loop.  This invariant information is
>>> always known from data reference, right?
>>> As for DRs for reduction, we know it's invariant because we set its
>>> inner loop stride to zero.
>>>
>>>> we could remember the DR in m_reductions).
>>>>
>>>> Note that the good thing is that the ?: has an invariant condition and
>>>> thus vectorization
>>>> can hoist the mask generation out of the vectorized loop which means
>>>> it boils down to
>>>> cheap operations.  My gut feeling is that just looking at the number
>>>> of memory references
>>>> isn't a good indicator of profitability as the regular stmt workload
>>>> has a big impact on
>>>> profitability of vectorization.
>>> It's not specific to vectorization.  The generated new code also costs
>>> too much in small loops without vectorization.  But yes, # of mem_refs
>>> may be too inaccurate, maybe we should check against num_stmts.
>>
>> Not specific to vectorization but the interchange may pay off only when
>> vectorizing a loop.  Would the loop in loop-interchange-5.c be still
>> interchanged?  If we remove the multiplication and just keep
>> c[i][j] = c[i][j] + b[k][j];
> Both loop-interchange-5.c and the modified version are interchange,
> because we check
> it against number of all data references (including num_old_inv_drs):
>  if (num_old_inv_drs + num_const_init_simple_reduc * 2 >= datarefs.length ())
>
>> ?  That is, why is the constant init so special?  Even for non-constant init
>> we're changing two outer loop DRs to two non-consecutive inner loop DRs.
> No, the two outer loop DRs becomes consecutive with respect to inner loop.
> So for a typical matrix mul case, the interchange moves two outer loop
> DRs into inner loops, moves an inner loop DR out to outer loop.
> Overall it introduces an additional instruction.  In terms of cache
> behavior, it's even cheaper?  Because the two new DRs access the same
> memory object.
> As for pr62178.c, assembly regressed from:
> .L3:
>         ldr     w3, [x0], 124
>         ldr     w4, [x2, 4]!
>         cmp     x0, x5
>         madd    w1, w4, w3, w1
>         bne     .L3
>
> To:
> .L3:
>         ldr     w2, [x3, x0, lsl 2]
>         cmp     w5, 1
>         ldr     w1, [x4, x0, lsl 2]
>         csel    w2, w2, wzr, ne
>         madd    w1, w6, w1, w2
>         str     w1, [x3, x0, lsl 2]
>         add     x0, x0, 1
>         cmp     x0, 31
>         bne     .L3
>
> Without vectorization.  Two additional instruction for cond_expr, one
> additional memory reference for interchange, and one additional
> instruction because of ivopt/addressing mode.
For the record, the performance of pr62178.c depends on size of
matrix, with 31*31 matrix, the above interchanged version is slower,
but it's faster for 250*250 matrix.
So L1-cache-size may need to be considered somehow, but still a
problem for unknown niters cases.


Thanks,
bin
>
> Thanks,
> bin
>>
>> Richard.
>>
>>> Thanks,
>>> bin
>>>>
>>>> So no ack nor nack...
>>>>
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>> bin
>>>>> 2017-12-08  Bin Cheng  <bin.ch...@arm.com>
>>>>>
>>>>>         * gimple-loop-interchange.cc (struct loop_cand): New field.
>>>>>         (loop_cand::loop_cand): Init new field in constructor.
>>>>>         (loop_cand::classify_simple_reduction): Record simple reduction
>>>>>         initialized with constant value.
>>>>>         (should_interchange_loops): New parameter.  Skip interchange if 
>>>>> loop
>>>>>         has few data references and constant intitialized simple 
>>>>> reduction.
>>>>>         (tree_loop_interchange::interchange): Update call to above 
>>>>> function.
>>>>>         (should_interchange_loop_nest): Ditto.

Re: [PATCH GCC]More conservative interchanging small loops with const initialized simple reduction

Reply via email to