https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441
--- Comment #43 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 4 Mar 2024, rsandifo at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113441 > > --- Comment #41 from Richard Sandiford <rsandifo at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #40) > > So I wonder if we can use "local costing" to decide a gather is always OK > > compared to the alternative with peeling for gaps. On x86 gather tends > > to be slow compared to open-coding it. > Yeah, on SVE gathers are generally ?enabling? instructions rather than > something to use for their own sake. > > I suppose one problem is that we currently only try to use gathers for > single-element groups. If we make a local decision to use gathers while > keeping that restriction, we could end up using gathers ?unnecessarily? while > still needing to peel for gaps for (say) a two-element group. > > That is, it's only better to use gathers than contiguous loads if by doing > that > we avoid all need to peel for gaps (and if the cost of peeling for gaps was > high enough to justify the cost of using gathers over consecutive loads). Yep. I do want to experiment with a way to have vectorizable_* register multiple variants of vectorization and have ways to stitch together and cost the overall vectorization as a cheaper (and more flexible) way to "iteration". It will to some extent blow up combinations to try but there might be a way to use greedy relaxation techniques to converge to a lowest cost variant. > One of the things on the list to do (once everything is SLP!) is to support > loads with gaps directly via predication, so that we never load elements that > aren't needed. E.g. on SVE, a 64-bit predicate (PTRUE .D) can be used with a > 32-bit load (LD1W .S) to load only even-indexed elements. So a single-element > group with a group size of 2 could be done cheaply with just consecutive > loads, > without peeling for gaps. Yep. Gap handling leaves to be desired (also when no predication is available), I also plan to address some shortcomings in that area early stage1. Note that generally the idea is that gap peeling is very cheap - unless that is the only reason to have an epilogue at all. The exeption might be small round-trip loops but those are best handled with predication where there's no good reason to do peeling for gaps at all.