https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119351
--- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Looking at it some more, I think the loop is valid to vectorize. But we don't
seem to vectorize the reduction jumping back to the outerloop:
;; basic block 384, loop depth 3, count 8598980 (estimated locally, freq
26.9637), maybe hot
;; prev block 383, next block 458, flags: (NEW, REACHABLE, VISITED)
;; pred: 387 [94.5% (guessed)] count:8598980 (estimated locally, freq
26.9637) (TRUE_VALUE,EXECUTABLE)
# RANGE [irange] int [1, +INF]
_1643 = ci_y_1924 + 1;
_3519 = vect_vec_iv_.4957_3520 + { 4, 4, 4, 4 };
# PT = nonlocal escaped null
vectp.4959_3503 = vectp.4959_3504 + 16;
ivtmp_3495 = ivtmp_3496 + 4;
_3493 = (unsigned int) ivtmp_3495;
next_mask_3470 = .WHILE_ULT (_3493, _3494, { 0, 0, 0, 0 });
if (next_mask_3470 == { 0, 0, 0, 0 })
goto <bb 844>; [5.50%]
else
goto <bb 458>; [94.50%]
;; succ: 844 [5.5% (guessed)] count:472944 (estimated locally, freq
1.4830) (TRUE_VALUE,EXECUTABLE)
;; 458 [94.5% (guessed)] count:8126036 (estimated locally, freq
25.4807) (FALSE_VALUE,EXECUTABLE)
;; basic block 458, loop depth 3, count 8126036 (estimated locally, freq
25.4807), maybe hot
;; prev block 384, next block 844, flags: (NEW, REACHABLE, VISITED)
;; pred: 384 [94.5% (guessed)] count:8126036 (estimated locally, freq
25.4807) (FALSE_VALUE,EXECUTABLE)
goto <bb 387>; [100.00%]
;; succ: 387 [always] count:8126036 (estimated locally, freq 25.4807)
(FALLTHRU,DFS_BACK,EXECUTABLE)
;; basic block 844, loop depth 2, count 472944 (estimated locally, freq
1.4830), maybe hot
;; prev block 458, next block 840, flags: (NEW, VISITED)
;; pred: 384 [5.5% (guessed)] count:472944 (estimated locally, freq
1.4830) (TRUE_VALUE,EXECUTABLE)
# RANGE [irange] int [1, +INF]
# _3469 = PHI <_1643(384)>
_3538 = niters.4954_3792;
_3536 = (intD.10) _3538;
tmp.4955_3537 = ci_y_1923 + _3536;
if (_3538 == niters.4954_3792)
goto <bb 385>; [25.00%]
else
goto <bb 840>; [75.00%]
where
_1643 = ci_y_1924 + 1;
has stayed scalar and so LCSSA code inserts a PHI here:
# _3469 = PHI <_1643(384)>
which is unused, as BB 384 is considered the main exit. So it assumes that if
you exit from 384 -> 844 that you've done all iterations and so it just uses
niters + ci_y_1923.
i.e. just adds the number of iterations to ci_y. So I don't think that's
wrong..
I could use some help here richi in whether this loop *is* valid to vectorize
or not. I've not yet been able to create a small reproducer but the loop looks
like:
(gdb) p debug_loop (loop, 3)
loop_48 (header = 387, latch = 458, finite_p
upper_bound 2147483647
likely_upper_bound 2147483647
iterations by profile: 8.347976 (unreliable, maybe flat) entry count:1081571
(estimated locally, freq 3.3915))
{
bb_384 (preds = {bb_387 }, succs = {bb_385 bb_458 })
{
<bb 384> [local count: 9554422]:
_1643 = ci_y_1924 + 1;
if (_1643 == _1649)
goto <bb 385>; [5.50%]
else
goto <bb 458>; [94.50%]
}
bb_458 (preds = {bb_384 }, succs = {bb_387 })
{
<bb 458> [local count: 9028929]:
goto <bb 387>; [100.00%]
}
bb_387 (preds = {bb_458 bb_386 }, succs = {bb_384 bb_388 })
{
<bb 387> [local count: 10110500]:
# ci_y_1924 = PHI <_1643(458), ci_y_1923(386)>
_1651 = _1650 + ci_y_1924;
_1652 = _1651 + 1;
_1653 = (long unsigned int) _1652;
_1655 = _1653 * 4;
_1656 = _1654 + _1655;
# VUSE <.MEM_1725>
_1657 = *_1656;
if (_1657 <= ci_1918)
goto <bb 384>; [94.50%]
else
goto <bb 388>; [5.50%]
}
}
(gdb) p debug_bb_n_slim (388)
;; basic block 388, loop depth 1
;; pred: 387
# ci_x_1703 = PHI <ci_x_1920(387)>
# _2327 = PHI <_1651(387)>
# ci_y_2179 = PHI <ci_y_1924(387)>
if (_333 != 0)
goto <bb 62>; [50.00%]
else
goto <bb 61>; [50.00%]
;; succ: 61
;; 62
(gdb) p debug_bb_n_slim (385)
;; basic block 385, loop depth 2
;; pred: 384
_1646 = ci_x_1920 + 1;
;; succ: 386
(gdb) p debug_bb_n_slim (386)
;; basic block 386, loop depth 2
;; pred: 382
;; 385
# ci_x_1920 = PHI <ci_x_1919(382), _1646(385)>
# ci_y_1923 = PHI <ci_y_1922(382), 0(385)>
_1650 = _1649 * ci_x_1920;
;; succ: 387
(gdb) p debug (pre_header)
<edge (386 -> 387)>
So the reduction values look sane, and the vector code looks sane, I'll instead
focus first on values that change during the loop