On Wed, 12 Jun 2024, Richard Biener wrote:
> On Tue, 11 Jun 2024, Richard Sandiford wrote:
>
> > Don't think it makes any difference, but:
> >
> > Richard Biener <[email protected]> writes:
> > > @@ -2151,7 +2151,16 @@ get_group_load_store_type (vec_info *vinfo,
> > > stmt_vec_info stmt_info,
> > > access excess elements.
> > > ??? Enhancements include peeling multiple iterations
> > > or using masked loads with a static mask. */
> > > - || (group_size * cvf) % cnunits + group_size - gap < cnunits))
> > > + || ((group_size * cvf) % cnunits + group_size - gap < cnunits
> > > + /* But peeling a single scalar iteration is enough if
> > > + we can use the next power-of-two sized partial
> > > + access. */
> > > + && ((cremain = (group_size * cvf - gap) % cnunits), true
> >
> > ...this might be less surprising as:
> >
> > && ((cremain = (group_size * cvf - gap) % cnunits, true)
> >
> > in terms of how the &&s line up.
>
> Yeah - I'll fix before pushing.
The aarch64 CI shows that a few testcases no longer use SVE
(gcc.target/aarch64/sve/slp_perm_{4,7,8}.c) because peeling
for gaps is deemed isufficient. Formerly we had
if (loop_vinfo
&& *memory_access_type == VMAT_CONTIGUOUS
&& SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
&& !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR
(loop_vinfo),
nunits))
{
unsigned HOST_WIDE_INT cnunits, cvf;
if (!can_overrun_p
|| !nunits.is_constant (&cnunits)
|| !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant
(&cvf)
/* Peeling for gaps assumes that a single scalar
iteration
is enough to make sure the last vector iteration
doesn't
access excess elements.
??? Enhancements include peeling multiple iterations
or using masked loads with a static mask. */
|| (group_size * cvf) % cnunits + group_size - gap <
cnunits)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION,
vect_location,
"peeling for gaps insufficient for "
"access\n");
and in all cases multiple_p (group_size * LOOP_VINFO_VECT_FACTOR, nunits)
is true so we didn't check for whether peeling one iteration is
sufficient. But after the refactoring the outer checks merely
indicates there's overrun (which is there already because gap != 0).
That is, we never verified, for the "regular" gap case, whether peeling
for a single iteration is sufficient. But now of course we run into
the inner check which will always trigger if earlier checks didn't
work out to set overrun_p to false.
For slp_perm_8.c we have a group_size of two, nunits is {16, 16}
and VF is {8, 8} and gap is one. Given we know the
multiple_p we know that (group_size * cvf) % cnunits is zero,
so what remains is group_size - gap < nunits but 1 is probably
always less than {16, 16}.
The new logic I added in the later patch that peeling a single
iteration is OK when we use a smaller, rounded-up to power-of-two
sized access is
|| ((group_size * cvf) % cnunits + group_size - gap <
cnunits
/* But peeling a single scalar iteration is enough
if
we can use the next power-of-two sized partial
access. */
&& (cremain = (group_size * cvf - gap) % cnunits,
true)
&& (cpart_size = (1 << ceil_log2 (cremain))) !=
cnunits
&& vector_vector_composition_type
(vectype, cnunits / cpart_size,
&half_vtype) == NULL_TREE)))
again knowing the multiple we know cremain is nunits - gap and with
gap == 1 rounding this size up will yield nunits and thus the existing
peeling is OK. Something is inconsistent here and the pre-existing
(group_size * cvf) % cnunits + group_size - gap < cnunits
check looks suspicious for a general check.
(group_size * cvf - gap)
should be the number of elements we can access without touching
excess elements. Peeling a single iteration will make sure
group_size * cvf + group_size - gap is accessed
(that's group_size * (cvf + 1) - gap). The excess elements
touched in the vector loop are
cnunits - (group_size * cvf - gap) % cnunits
I think that number needs to be less or equal to group_size, so
the correct check should be
(group_size * cvf - gap) % cnunits + group_size < cnunits
for the SVE case that's (nunits - 1) + 2 < nunits which should
simplify to false. Now the question is how to formulate this
with poly-ints in a way that it works out, for the case in
question doing the overrun check as
if (overrun_p
&& can_div_trunc_p (group_size
* LOOP_VINFO_VECT_FACTOR (loop_vinfo) -
gap,
nunits, &tem, &remain)
&& maybe_lt (remain + group_size, nunits))
seems to do the trick. I'm going to test this, will post an updated
series for the CI.
Richard.
> Thanks,
> Richard.
>
> > Thanks,
> > Richard
> >
> > > + && ((cpart_size = (1 << ceil_log2 (cremain)))
> > > + != cnunits)
> > > + && vector_vector_composition_type
> > > + (vectype, cnunits / cpart_size,
> > > + &half_vtype) == NULL_TREE))))
> > > {
> > > if (dump_enabled_p ())
> > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > @@ -11599,6 +11608,27 @@ vectorizable_load (vec_info *vinfo,
> > > gcc_assert (new_vtype
> > > || LOOP_VINFO_PEELING_FOR_GAPS
> > > (loop_vinfo));
> > > + /* But still reduce the access size to the next
> > > + required power-of-two so peeling a single
> > > + scalar iteration is sufficient. */
> > > + unsigned HOST_WIDE_INT cremain;
> > > + if (remain.is_constant (&cremain))
> > > + {
> > > + unsigned HOST_WIDE_INT cpart_size
> > > + = 1 << ceil_log2 (cremain);
> > > + if (known_gt (nunits, cpart_size)
> > > + && constant_multiple_p (nunits, cpart_size,
> > > + &num))
> > > + {
> > > + tree ptype;
> > > + new_vtype
> > > + = vector_vector_composition_type (vectype,
> > > + num,
> > > + &ptype);
> > > + if (new_vtype)
> > > + ltype = ptype;
> > > + }
> > > + }
> > > }
> > > }
> > > tree offset
> >
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)