https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117008
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, GCC 14.2 doesn't vectorize for me anymore, likely because the use of
gather has been nerfed (for Intel).
With AVX2 we see
<bb 3> [local count: 139586405]:
# vect_total_21.15_24 = PHI <vect_total_11.24_41(3), { 0, 0, 0, 0, 0, 0, 0, 0
}(2)>
# vect_vec_iv_.16_20 = PHI <_18(3), { 0, 1, 2, 3, 4, 5, 6, 7 }(2)>
# ivtmp.27_27 = PHI <ivtmp.27_23(3), 0(2)>
_18 = vect_vec_iv_.16_20 + { 8, 8, 8, 8, 8, 8, 8, 8 };
vect__17.17_5 = vect_vec_iv_.16_20 >> 5;
vect__19.18_3 = vect_vec_iv_.16_20 & { 31, 31, 31, 31, 31, 31, 31, 31 };
vect_30 = VIEW_CONVERT_EXPR<vector(8) int>(vect__17.17_5);
vect_31 = __builtin_ia32_gathersiv8si ({ 0, 0, 0, 0, 0, 0, 0, 0 }, &values,
vect_30, { -1, -1, -1, -1, -1, -1, -1, -1 }, 4);
vect__13.19_32 = VIEW_CONVERT_EXPR<vector(8) long unsigned int>(vect_31);
vect__14.20_34 = { 1, 1, 1, 1, 1, 1, 1, 1 } << vect__19.18_3;
vect__15.21_35 = vect__13.19_32 & vect__14.20_34;
mask__16.22_37 = vect__15.21_35 != { 0, 0, 0, 0, 0, 0, 0, 0 };
_51 = VIEW_CONVERT_EXPR<vector(8) unsigned int>(mask__16.22_37);
vect_total_11.24_41 = vect_total_21.15_24 - _51;
ivtmp.27_23 = ivtmp.27_27 + 1;
if (ivtmp.27_23 != 1600000)
so the first point is we are not able to analyze the memory access pattern
in a very good way and then of course cost modeling breaks down here as well.
The scalar IL is
<bb 3> [local count: 1063004409]:
# total_21 = PHI <total_11(5), 0(2)>
# index_23 = PHI <index_12(5), 0(2)>
# ivtmp_27 = PHI <ivtmp_26(5), 12800000(2)>
_17 = index_23 >> 5;
_19 = index_23 & 31;
_13 = MEM <struct _Base_bitset> [(_WordT *)&values]._M_w[_17];
_14 = 1 << _19;
_15 = _13 & _14;
_16 = _15 != 0;
_1 = (unsigned int) _16;
total_11 = _1 + total_21;
index_12 = index_23 + 1;
ivtmp_26 = ivtmp_27 - 1;
if (ivtmp_26 != 0)
I think for this kind of access pattern it might be nice to unroll the
loop as many times as the same memory location is accessed (1 << 5, aka
32 times).
But as Andrew says - it's just a very bad testcase ;)