On 29/07/2025 11:39, Richard Biener wrote:
On Tue, 29 Jul 2025, Alfie Richards wrote:
On 29/07/2025 10:13, Richard Biener wrote:
On Tue, 29 Jul 2025, Alfie Richards wrote:
(Whoops, s/WIP/RFC)
Hi All,
This patch adds support for capping VF at runtime for VLA loops with a
data dependency.
On 28/07/2025 15:56, Alfie Richards wrote:
Hi All,
This patch adds support for capping VF at runtime for VLA loops with a
data dependency.
Previously, no loop with a data dependency could be vectorized with VLA as
we
made no assumption on the upper bound of vector length. This adds basic
support
for this case (only for partial vectors with a full mask though).
The bump_vector_ptr logic is incorrect right now. I dont understand it
sufficiently yet.
Thoughts on this initial direction?
gcc/ChangeLog:
* tree-vect-data-refs.cc (bump_vector_ptr): Add early return to stop
ICE (incorrect logic though).
* tree-vect-loop-manip.cc (vect_set_loop_invariant_group_controls):
New
function.
(vect_set_loop_condition_normal): Set rgroup controls in capped case.
(vect_set_loop_condition): Add logic to create capped VF ssa node.
(vect_gen_vector_loop_niters): Add logic to peel in case of capped VF.
(vect_do_peeling): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info):
* tree-vect-stmts.cc (vect_get_data_ptr_increment):
* tree-vectorizer.h (GCC_TREE_VECTORIZER_H):
(LOOP_VINFO_RUNTIME_VF_CAP):
There is already LOOP_VINFO_MAX_VECT_FACTOR which should be identical
to your LOOP_VINFO_RUNTIME_VF_CAP.
Ah! Thank you that's really helpful.>
I _think_ given both the IV increment and the loop mask is generated
by vect_set_loop_condition_partial_vectors the only change should
be there? Any other "bump" should be magically OK already? Since
max_vf is loop invariant and even constant non-poly it should be
possible to somehow adjust the .WHILE_ULT by a max() operation?
We were hoping to do this with a loop invariant number of elements per
iteration, creating the predicate and VF in the loop header, and using an
epilogue (possibly a VLS epilogue) afterwards. We expect this to be a better
trade off overall.
Is this what you mean in this?
From what I could see in vect_set_loop_condition_partial_vectors it would be
hard to adapt it to work for this situation. Hence instead adapting
vect_set_loop_condition_normal which has the logic for iteration independent
VF.
Do you have a view on this? Do you still think it is more natural to fit
in vect_set_loop_condition_partial_vectors even with the loop
independent VF? It would involve quite a bit of code churn in that area
from what I can see.
That said, you are constraining the VF, so IMO that's what you should
do - you keep vector types as selected but you should adjust the
actual VF, not insert workarounds elsewhere? Or does that blow up
in your face? Any code checking the VF either does it wrong or
it does for a reason, and then using the non-altered VF would be
wrong?
Apologies, I don't follow what you mean by this. Do you mean we should update
the VF value in loop_vec_info to be a tree with the MIN_EXPR? This is what
Richard S suggested previously.
No, why would it be a MIN_EXPR? Ah, because the VF is poly. Hmm. But
yes, I think this is required for correctness eventually.
I also am aware I need to do a full audit of places using VF to see which need
to be updated with knowledge of this, but am still in the business of getting
to grips with large bits of it.
There's quite some code doing vf.is_constant (), so making VF a tree
will be quite disruptive. Note I've talked about making VF fractional
to support re-rolling, with similar consequences.
If you don't adjust VF then you still have to audit all VF uses to see
if they need adjustments (see your bumping case). Masking allows us
to have a VF that's not consistent with the vector types used, something
we also want for basic-block vectorization, where we want to mask out
inactive lanes we have no data for.
I'd see to keep those related uses in mind when changing what the VF
really means and how to figure whether an operation needs additional
masking (for basic-block vectorization VF is always one, but there the
"group size" is less than the number of vector elements).
Instead of making VF a tree and making it a MIN_EXPR, a simpler
(temporary) approach might be to make it a new class which could
provide the APIs currently used on VF with a sane fallback in
case there's a max_vf we need to apply? That might tell you
places to fix.
This sounds like a good compromise to me. I'll give it a go.
I still assume that at the moment you want to restrict this to
VLA vectorization, so the extra masking should be enforced when
setting up loop masking.
Yeah I'm primarily thinking of VLA, partly because from an aarch64
centric world view VLA ≈ full masking. Though it seems like it should be
relatively painless to enable the same thing on fully masked targets
without VLA.
Alfie
Richard.
Thanks for the feedback!
Alfie>
Thanks,
Richard.
(use_capped_vf):
(vect_vf_for_cost):
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve_cap_1.c: New test.
* gcc.target/aarch64/sve_cap_1_run.c: New test.
---
gcc/testsuite/gcc.target/aarch64/sve_cap_1.c | 54 +++++++++++++
.../gcc.target/aarch64/sve_cap_1_run.c | 38 ++++++++++
gcc/tree-vect-data-refs.cc | 2 +-
gcc/tree-vect-loop-manip.cc | 76 ++++++++++++++++++-
gcc/tree-vect-loop.cc | 11 +++
gcc/tree-vect-stmts.cc | 12 +++
gcc/tree-vectorizer.h | 25 +++++-
7 files changed, 213 insertions(+), 5 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_cap_1.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve_cap_1_run.c
diff --git a/gcc/testsuite/gcc.target/aarch64/sve_cap_1.c
b/gcc/testsuite/gcc.target/aarch64/sve_cap_1.c
new file mode 100644
index 00000000000..31d97cd7619
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve_cap_1.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-inline -march=armv8-a+sve -fno-vect-cost-model
-mautovec-preference=sve-only" } */
+
+#define LOOP(TYPE) \
+ void \
+ f_##TYPE##_5 (TYPE *a, int n) \
+ { \
+ for (long i = 0; i < n; ++i) \
+ a[i] += a[i - 5]; \
+ } \
+ void \
+ f_##TYPE##_10 (TYPE *a, int n) \
+ { \
+ for (long i = 0; i < n; ++i) \
+ a[i] += a[i - 10]; \
+ }
+
+LOOP (char)
+LOOP (short)
+LOOP (int)
+LOOP (float)
+LOOP (long)
+LOOP (double)