https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441

--- Comment #43 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 4 Mar 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
> 
> --- Comment #41 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #40)
> > So I wonder if we can use "local costing" to decide a gather is always OK
> > compared to the alternative with peeling for gaps.  On x86 gather tends
> > to be slow compared to open-coding it.
> Yeah, on SVE gathers are generally ?enabling? instructions rather than
> something to use for their own sake.
> 
> I suppose one problem is that we currently only try to use gathers for
> single-element groups.  If we make a local decision to use gathers while
> keeping that restriction, we could end up using gathers ?unnecessarily? while
> still needing to peel for gaps for (say) a two-element group.
> 
> That is, it's only better to use gathers than contiguous loads if by doing 
> that
> we avoid all need to peel for gaps (and if the cost of peeling for gaps was
> high enough to justify the cost of using gathers over consecutive loads).

Yep.  I do want to experiment with a way to have vectorizable_* register
multiple variants of vectorization and have ways to stitch together and 
cost the overall vectorization as a cheaper (and more flexible) way to
"iteration".  It will to some extent blow up combinations to try but
there might be a way to use greedy relaxation techniques to converge to
a lowest cost variant.

> One of the things on the list to do (once everything is SLP!) is to support
> loads with gaps directly via predication, so that we never load elements that
> aren't needed.  E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a
> 32-bit load (LD1W .S) to load only even-indexed elements.  So a single-element
> group with a group size of 2 could be done cheaply with just consecutive 
> loads,
> without peeling for gaps.

Yep.  Gap handling leaves to be desired (also when no predication is
available), I also plan to address some shortcomings in that area early
stage1.

Note that generally the idea is that gap peeling is very cheap - unless
that is the only reason to have an epilogue at all.  The exeption might
be small round-trip loops but those are best handled with predication
where there's no good reason to do peeling for gaps at all.

Reply via email to