Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Richard Biener Mon, 18 Dec 2017 05:33:08 -0800

On Mon, Dec 18, 2017 at 1:37 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
> On Fri, Dec 15, 2017 at 7:39 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Fri, Dec 15, 2017 at 1:19 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Fri, Dec 15, 2017 at 1:35 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Fri, Dec 15, 2017 at 12:09 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>>> On Fri, Dec 15, 2017 at 11:55 AM, Richard Biener
>>>>> <richard.guent...@gmail.com> wrote:
>>>>>> On Fri, Dec 15, 2017 at 12:30 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>>> Hi,
>>>>>>> As explained in the PR, given below test case:
>>>>>>> int a[8][10] = { [2][5] = 4 }, c;
>>>>>>>
>>>>>>> int
>>>>>>> main ()
>>>>>>> {
>>>>>>>   short b;
>>>>>>>   int i, d;
>>>>>>>   for (b = 4; b >= 0; b--)
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][b + 2] = a[c][b + 1];
>>>>>>>   for (i = 0; i < 8; i++)
>>>>>>>     for (d = 0; d < 10; d++)
>>>>>>>       if (a[i][d] != (i == 3 && d == 6) * 4)
>>>>>>>         __builtin_abort ();
>>>>>>>   return 0;
>>>>>>>
>>>>>>> the loop nest is illegal for vectorization without reversing inner 
>>>>>>> loop.  The issue
>>>>>>> is in data dependence checking of vectorizer, I believe the mentioned 
>>>>>>> revision just
>>>>>>> exposed this.  Previously the vectorization is skipped because of 
>>>>>>> unsupported memory
>>>>>>> operation.  The outer loop vectorization unrolls the outer loop into:
>>>>>>>
>>>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>>>   {
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][6] = a[c][5];
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][5] = a[c][4];
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][4] = a[c][3];
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>       a[c + 1][3] = a[c][2];
>>>>>>>   }
>>>>>>> Then four inner loops are fused into:
>>>>>>>   for (b = 4; b > 0; b -= 4)
>>>>>>>   {
>>>>>>>     for (c = 0; c <= 6; c++)
>>>>>>>     {
>>>>>>>       a[c + 1][6] = a[c][5];  // S1
>>>>>>>       a[c + 1][5] = a[c][4];  // S2
>>>>>>>       a[c + 1][4] = a[c][3];
>>>>>>>       a[c + 1][3] = a[c][2];
>>>>>>>     }
>>>>>>>   }
>>>>>>
>>>>>> Note that they are not really "fused" but they are interleaved.  With
>>>>>> GIMPLE in mind
>>>>>> that makes a difference, you should get the equivalent of
>>>>>>
>>>>>>    for (c = 0; c <= 6; c++)
>>>>>>      {
>>>>>>        tem1 = a[c][5];
>>>>>>        tem2 = a[c][4];
>>>>>>        tem3 = a[c][3];
>>>>>>        tem4 = a[c][2];
>>>>>>        a[c+1][6] = tem1;
>>>>>>        a[c +1][5] = tem2;
>>>>>>         a[c+1][4] = tem3;
>>>>>>         a[c+1][3] = tem4;
>>>>>>      }
>>>>> Yeah, I will double check if this abstract breaks the patch and how.
>>>> Hmm, I think this doesn't break it, well at least for part of the
>>>> analysis, because it is loop carried (backward) dependence goes wrong,
>>>> interleaving or not with the same iteration doesn't matter here.
>>>
>>> I think the idea is that forward dependences are always fine (negative 
>>> distance)
>>> to vectorize.  But with backward dependences we have to adhere to max_vf.
>>>
>>> It looks like for outer loop vectorization we only look at the distances in 
>>> the
>>> outer loop but never at inner ones.  But here the same applies but isn't 
>>> that
>>> independend on the distances with respect to the outer loop?
>>>
>>> But maybe I'm misunderstanding how "distances" work here.
>> Hmm, I am not sure I understand "distance" correctly.  With
>> description as in book like "Optimizing compilers for Modern
>> Architectures", distance is "# of iteration of sink ref - # of
>> iteration of source ref".  Given below example:
>>   for (i = 0; i < N; ++i)
>>     {
>>       x = arr[idx_1];  // S1
>>       arr[idx_2] = x;  // S2
>>     }
>> if S1 is source ref, distance = idx_2 - idx_1, and distance > 0.  Also
>> this is forward dependence.  For example, idx_1 is i + 1 and idx_2 is
>> i;
>> If S2 is source ref, distance = idx_1 - idx_2, and distance < 0.  Also
>> this is backward dependence.  For example idx_1 is i and idx_2 is i +
>> 1;
>>
>> In GCC, we always try to subtract idx_2 from idx_1 first in computing
>> classic distance, we could result in negative distance in case of
>> backward dependence.  When this happens at dependence carried level,
>> we manually reverse it.  When this happens at inner level loop, we
>> simply keep the negative distance.  And it's meaningless to talk about
>> forward/backward given dependence is carried at outer level loop.
>>
>> Outer loop vectorization is interesting.  The problematic case has
>> backward dependence carried by outer loop.  Because we don't check
>> dist vs. max_vf for such dep, the unrolled references could have outer
>> loop index equal to each other, as in a[c][5] and a[c+1][5].  So it's
>> like we have outer loop index resolved as equal.  Now it has
>> dependence only if c == c' + 1.  I think previous comment on fusion
>> still hold for this and we now need to check backward dependence
>> between the two reference at inner level loop.  I guess it's a bit
>> trick to model dependence between a[c][5] and a[c+1][5] using the
>> original references and dependence.  I think it's okay to do that
>> because distance of a[c][5] and a[c+1][5] is what we computed
>> previously for the original references at inner level loop.
>>
>> Not sure if I have missed something important.
>
> Not sure either.  unroll-and-jam does
>
>       FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>         {
>           /* A distance (a,b) is at worst transformed into (a/N,b) by the
>              unrolling (factor N), so the transformation is valid if
>              a >= N, or b > 0, or b is zero and a > 0.  Otherwise the unroll
>              factor needs to be limited so that the first condition holds.
>              That may limit the factor down to zero in the worst case.  */
>           int dist = dist_v[0];
>           if (dist < 0)
>             gcc_unreachable ();
>           else if ((unsigned)dist >= *unroll)
>             ;
>           else if (lambda_vector_lexico_pos (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
>                    || (lambda_vector_zerop (dist_v + 1, DDR_NB_LOOPS (ddr) - 
> 1)
>                        && dist > 0))
>             ;
>           else
>             *unroll = dist;
>
> where *unroll is similar to *max_vf I think.  dist_v[0] is the innermost loop.
>
> The vectorizer does way more complicated things and only looks at the
> distance with respect to the outer loop as far as I can see which can
> be negative.
>
> Not sure if fusion and vectorizer "interleaving" makes a difference here.
> I think the idea was that when interleaving stmt-by-stmt then forward
> dependences would be preserved and thus we don't need to check the
> inner loop dependences.  speaking with "forward vs. backward" dependences
> again, not distances...
>
> This also means that unroll-and-jam could be enhanced to "interleave"
> stmts and thus cover more cases?
>
> Again, I hope Micha can have a look here...


And _maybe_ PR60276 and its fix (adding STMT_VINFO_MIN_NEG_DIST)
is related.

Richard.

> Richard.
>
>> Thanks,
>> bin
>>>
>>> Richard.
>>>
>>>> Thanks,
>>>> bin
>>>>>
>>>>>>
>>>>>>> The loop fusion needs to meet the dependence requirement.  Basically, 
>>>>>>> GCC's data
>>>>>>> dependence analyzer does not model dep between references in sibling 
>>>>>>> loops, but
>>>>>>> in practice, fusion requirement can be checked by analyzing all data 
>>>>>>> references
>>>>>>> after fusion, and there is no backward data dependence.
>>>>>>>
>>>>>>> Apparently, the requirement is violated because we have backward data 
>>>>>>> dependence
>>>>>>> between references (a[c][5], a[c+1][5]) in S1/S2.  Note, if we reverse 
>>>>>>> the inner
>>>>>>> loop, the outer loop would become legal for vectorization.
>>>>>>>
>>>>>>> This patch fixes the issue by enforcing dependence check.  It also adds 
>>>>>>> two tests
>>>>>>> with one shouldn't be vectorized and the other should.  Bootstrap and 
>>>>>>> test on x86_64
>>>>>>> and AArch64.  Is it OK?
>>>>>>
>>>>>> I think you have identified the spot where things go wrong but I'm not
>>>>>> sure you fix the
>>>>>> problem fully.  The spot you pacth is (loop is the outer loop):
>>>>>>
>>>>>>   loop_depth = index_in_loop_nest (loop->num, DDR_LOOP_NEST (ddr));
>>>>>> ...
>>>>>>   FOR_EACH_VEC_ELT (DDR_DIST_VECTS (ddr), i, dist_v)
>>>>>>     {
>>>>>>       int dist = dist_v[loop_depth];
>>>>>> ...
>>>>>>       if (dist > 0 && DDR_REVERSED_P (ddr))
>>>>>>         {
>>>>>>           /* If DDR_REVERSED_P the order of the data-refs in DDR was
>>>>>>              reversed (to make distance vector positive), and the actual
>>>>>>              distance is negative.  */
>>>>>>           if (dump_enabled_p ())
>>>>>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>>>>>                              "dependence distance negative.\n");
>>>>>>
>>>>>> where you add
>>>>>>
>>>>>> +         /* When doing outer loop vectorization, we need to check if 
>>>>>> there is
>>>>>> +            backward dependence at inner loop level if dependence at 
>>>>>> the outer
>>>>>> +            loop is reversed.  See PR81740 for more information.  */
>>>>>> +         if (nested_in_vect_loop_p (loop, DR_STMT (dra))
>>>>>> +             || nested_in_vect_loop_p (loop, DR_STMT (drb)))
>>>>>> +           {
>>>>>> +             unsigned inner_depth = index_in_loop_nest 
>>>>>> (loop->inner->num,
>>>>>> +                                                        DDR_LOOP_NEST 
>>>>>> (ddr));
>>>>>> +             if (dist_v[inner_depth] < 0)
>>>>>> +               return true;
>>>>>> +           }
>>>>>>
>>>>>> but I don't understand how the dependence direction with respect to the
>>>>>> outer loop matters here.
>>>>> If the direction wrto outer loop is positive by itself, i.e,
>>>>> reversed_p equals to false, then dist is checked against max_vf.  In
>>>>> this case, it's not possible to have references refer to the same
>>>>> object?
>>>>> On the other hand, dist is not checked at all for reversed case.
>>>>> Maybe an additional check "dist < max_vf" can relax the patch a bit.
>>>>>>
>>>>>> Given there's DDR_REVERSED on the outer loop distance what does that
>>>>>> mean for the inner loop distance given the quite non-obvious code 
>>>>>> handling
>>>>>> this case in tree-data-ref.c:
>>>>>>
>>>>>>       /* Verify a basic constraint: classic distance vectors should
>>>>>>          always be lexicographically positive.
>>>>>>
>>>>>>          Data references are collected in the order of execution of
>>>>>>          the program, thus for the following loop
>>>>>>
>>>>>>          | for (i = 1; i < 100; i++)
>>>>>>          |   for (j = 1; j < 100; j++)
>>>>>>          |     {
>>>>>>          |       t = T[j+1][i-1];  // A
>>>>>>          |       T[j][i] = t + 2;  // B
>>>>>>          |     }
>>>>>>
>>>>>>          references are collected following the direction of the wind:
>>>>>>          A then B.  The data dependence tests are performed also
>>>>>>          following this order, such that we're looking at the distance
>>>>>>          separating the elements accessed by A from the elements later
>>>>>>          accessed by B.  But in this example, the distance returned by
>>>>>>          test_dep (A, B) is lexicographically negative (-1, 1), that
>>>>>>          means that the access A occurs later than B with respect to
>>>>>>          the outer loop, ie. we're actually looking upwind.  In this
>>>>>>          case we solve test_dep (B, A) looking downwind to the
>>>>>>          lexicographically positive solution, that returns the
>>>>>>          distance vector (1, -1).  */
>>>>>>       if (!lambda_vector_lexico_pos (dist_v, DDR_NB_LOOPS (ddr)))
>>>>>>         {
>>>>>>           lambda_vector save_v = lambda_vector_new (DDR_NB_LOOPS (ddr));
>>>>>>           if (!subscript_dependence_tester_1 (ddr, 1, 0, loop_nest))
>>>>>>             return false;
>>>>>>           compute_subscript_distance (ddr);
>>>>>>           if (!build_classic_dist_vector_1 (ddr, 1, 0, save_v, &init_b,
>>>>>>                                             &index_carry))
>>>>>>             return false;
>>>>>>           save_dist_v (ddr, save_v);
>>>>>>           DDR_REVERSED_P (ddr) = true;
>>>>>>
>>>>>> that is, what does dist_v[inner_depth] < 0 tell us here?  Does it tell us
>>>>>> that the dependence direction is reverse of the outer loop?
>>>>> Hmm, this part of code is a bit confusing for me.  So we always
>>>>> compute classic distance by sorting two data references in lexico
>>>>> order, which could be the opposite given we collect references by dom
>>>>> order.  IIUC, the negative dist at inner loop means a backward
>>>>> dependence for that index/loop.  It's not related to outer loop or the
>>>>> reversed_p flag.
>>>>>
>>>>> Thanks,
>>>>> bin
>>>>>>
>>>>>> CCing Micha who might remember some more details from unroll-and-jam.
>>>>>>
>>>>>> Thanks,
>>>>>> Richard.
>>>>>>
>>>>>>> Thanks,
>>>>>>> bin
>>>>>>> 2017-12-15  Bin Cheng  <bin.ch...@arm.com>
>>>>>>>
>>>>>>>         PR tree-optimization/81740
>>>>>>>         * tree-vect-data-refs.c (vect_analyze_data_ref_dependence): In 
>>>>>>> case
>>>>>>>         of outer loop vectorization, check backward dependence at inner 
>>>>>>> loop
>>>>>>>         if dependence at outer loop is reversed.
>>>>>>>
>>>>>>> gcc/testsuite
>>>>>>> 2017-12-15  Bin Cheng  <bin.ch...@arm.com>
>>>>>>>
>>>>>>>         PR tree-optimization/81740
>>>>>>>         * gcc.dg/vect/pr81740-1.c: New test.
>>>>>>>         * gcc.dg/vect/pr81740-2.c: Refine test.

Re: [PATCH PR81740]Enforce dependence check for outer loop vectorization

Reply via email to