Hi Both,

Thanks for all the reviews/patches so far 😊

> >
> > Looks good, but I wonder what we can do to at least make the multiple
> > exit case behave reasonably?  The vectorizer keeps track
> 
> > of a "canonical" exit, would it be possible to pass in the main exit
> > edge and use that instead of single_exit (), would other exits then
> > behave somewhat reasonable or would we totally screw things up here?
> > That is, the "canonical" exit would be the counting exit while the
> > other exits are on data driven conditions and thus wouldn't change
> > probability when we reduce the number of iterations(?)
> 
> I can add canonical_exit parameter and make the function to direct flow to it 
> if
> possible.  However overall I think fixup depends on what transformation led to
> the change.
> 
> Assuming that vectorizer did no prologues and apilogues and we vectorized
> with factor N, then I think the update could be done more specifically as
> follows.
> 

If it helps, how this patch series addresses multiple exits by forcing a scalar
epilogue, all non canonical_exits would have been redirected to this scalar
epilogue, so the remaining scalar iteration count will be at most VF.

Regards,
Tamar

> We know that header block count dropped by 4. So we can start from that
> and each time we reach basic block with exit edge, we know the original count
> of the edge.  This count is unchanged, so one can rescale probabilities out of
> that BB accordingly.  If loop has no inner loops, we can just walk the body in
> RPO and propagate scales downwards and we sould arrive to right result
> 
> I originally added the bound parameter to handle prologues/epilogues which
> gets new artificial bound.  In prologue I think you are right that the flow 
> will be
> probably directed to the conditional counting iterations.
> 
> In epilogue we add no artificial iteration cap, so maybe it is more realistic 
> to
> simply scale up probability of all exits?
> 
> To see what is going on I tried following testcase:
> 
> int a[99];
> test()
> {
>   for (int i = 0; i < 99; i++)
>       a[i]++;
> }
> 
> What surprises me is that vectorizer at -O2 does nothing and we end up
> unrolling the loop:
> 
> L2:
>         addl    $1, (%rax)
>         addl    $1, 4(%rax)
>         addl    $1, 8(%rax)
>         addq    $12, %rax
>         cmpq    $a+396, %rax
> 
> Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
> 1 addition would be better.
> 
> With -O3 we vectorize it:
> 
> 
> .L2:
>         movdqa  (%rax), %xmm0
>         addq    $16, %rax
>         paddd   %xmm1, %xmm0
>         movaps  %xmm0, -16(%rax)
>         cmpq    %rax, %rdx
>         jne     .L2
>         movq    a+384(%rip), %xmm0
>         addl    $1, a+392(%rip)
>         movq    .LC1(%rip), %xmm1
>         paddd   %xmm1, %xmm0
>         movq    %xmm0, a+384(%rip)
> 
> 
> and correctly drop vectorized loop body to 24 iterations. However the
> epilogue has loop for vector size 2 predicted to iterate once (it won't)
> 
> ;;   basic block 7, loop depth 0, count 10737416 (estimated locally), maybe
> hot
> ;;    prev block 5, next block 8, flags: (NEW, VISITED)
> ;;    pred:       3 [4.0% (adjusted)]  count:10737416 (estimated locally)
> (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       8 [always]  count:10737416 (estimated locally)
> (FALLTHRU,EXECUTABLE)
> 
> ;;   basic block 8, loop depth 1, count 21474835 (estimated locally), maybe
> hot
> ;;    prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       9 [always]  count:10737417 (estimated locally)
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                7 [always]  count:10737416 (estimated locally)
> (FALLTHRU,EXECUTABLE)
>   # i_9 = PHI <i_17(9), 96(7)>
>   # ivtmp_13 = PHI <ivtmp_18(9), 3(7)>
>   # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a +
> 384B](7)>
>   # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a +
> 384B](7)>
>   # ivtmp_49 = PHI <ivtmp_50(9), 0(7)>
>   vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40];
>   _14 = a[i_9];
>   vect__15.17_44 = vect__14.16_42 + { 1, 1 };
>   _15 = _14 + 1;
>   MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44;
>   i_17 = i_9 + 1;
>   ivtmp_18 = ivtmp_13 - 1;
>   vectp_a.14_41 = vectp_a.14_40 + 8;
>   vectp_a.18_47 = vectp_a.18_46 + 8;
>   ivtmp_50 = ivtmp_49 + 1;
>   if (ivtmp_50 < 1)
>     goto <bb 9>; [50.00%]
>   else
>     goto <bb 12>; [50.00%]
> 
> and finally the scalar copy
> 
> ;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe
> hot
> ;;    prev block 9, next block 13, flags: (NEW, VISITED)
> ;;    pred:       8 [50.0% (adjusted)]  count:10737418 (estimated locally)
> (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       13 [always]  count:10737416 (estimated locally) (FALLTHRU)
> 
> ;;   basic block 13, loop depth 1, count 1063004409 (estimated locally),
> maybe hot
> ;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       14 [always]  count:1052266996 (estimated locally)
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                12 [always]  count:10737416 (estimated locally) (FALLTHRU)
>   # i_30 = PHI <i_36(14), 98(12)>
>   # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>
>   _33 = a[i_30];
>   _34 = _33 + 1;
>   a[i_30] = _34;
>   i_36 = i_30 + 1;
>   ivtmp_37 = ivtmp_32 - 1;
>   if (ivtmp_37 != 0)
>     goto <bb 14>; [98.99%]
>   else
>     goto <bb 4>; [1.01%]
> 
> With also small but non-zero iteration probability.   This is papered
> over by my yesterday patch. But it seems to me that it would be a lot better 
> if
> vectorizer understood that the epilogue will be loopless and accounted it to
> the cost model that would probably make it easy to enable it at cheap costs
> too.
> 
> Clang 16 at -O2 is much more aggressive by both vectorizing and unroling:
> 
> test:                                   # @test
>         .cfi_startproc
> # %bb.0:
>         movdqa  a(%rip), %xmm1
>         pcmpeqd %xmm0, %xmm0
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a(%rip)
>         movdqa  a+16(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+16(%rip)
>         movdqa  a+32(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+32(%rip)
>         movdqa  a+48(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+48(%rip)
>         movdqa  a+64(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+64(%rip)
>         movdqa  a+80(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+80(%rip)
>         movdqa  a+96(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+96(%rip)
>         movdqa  a+112(%rip), %xmm1
>         psubd   %xmm0, %xmm1
> ....
>         movdqa  %xmm1, a+240(%rip)
>         movdqa  a+256(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+256(%rip)
>         movdqa  a+272(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+272(%rip)
>         movdqa  a+288(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+288(%rip)
>         movdqa  a+304(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+304(%rip)
>         movdqa  a+320(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+320(%rip)
>         movdqa  a+336(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+336(%rip)
>         movdqa  a+352(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+352(%rip)
>         movdqa  a+368(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+368(%rip)
>         addl    $1, a+384(%rip)
>         addl    $1, a+388(%rip)
>         addl    $1, a+392(%rip)
>         retq
> 
> Honza

Reply via email to