On 12/11/2025 10:36, Richard Sandiford wrote:
As a general comment, we do already support constructors of
variable-length vectors, thanks to Tejas's earlier ACLE work.
For example: https://godbolt.org/z/Eqd8sM4cv .  The quality
of the output is awful, but it does look correct.

Thanks, Richard!

That does look inefficient, although seemingly no worse than the code that is currently generated by my patch series. It has the same behaviour of zeroing an SVE register, storing it to the stack, modifying it, reloading it, storing it to another stack location, modifying it, reloading it, etc.

I guess you mean commit 17b520a10cdaabbcbaeaf30fc5228986b368ce0c. The function that Tejas modified was

tree
build_vector_from_ctor (tree type, const vec<constructor_elt, va_gc> *v);

in tree.cc ("the low level primitives for operating on tree nodes"). It declares a local tree_vector_builder instead of requiring one to be passed by its caller (as gimple_build_vector does):

inline tree
gimple_build_vector (gimple_seq *seq, tree_vector_builder *builder)

I'll take a closer look at whether that approach can be adapted for BB SLP, since it might avoid the need for my current constant_lower_bound change.

Sorry if that was already common ground -- just mentioning it
because I didn't see it in this subthread.

Richard Biener <[email protected]> writes:
On Fri, 7 Nov 2025, Christopher Bazley wrote:

On 07/11/2025 13:35, Richard Biener wrote:
On Wed, 5 Nov 2025, Christopher Bazley wrote:

On 28/10/2025 13:51, Richard Biener wrote:
On Tue, 28 Oct 2025, Christopher Bazley wrote:

vect_create_constant_vectors is updated to pad with zeros
between the end of a group and the end of a vector of the type
chosen for the SLP node, when used for BB SLP. This function
calls gimple_build_vector, which also has to be updated for
SVE vector types (by using the lower bound as the number of
elements, e.g., 16 for VNx16QI).
---
    gcc/gimple-fold.cc   |  2 +-
    gcc/tree-vect-slp.cc | 43 +++++++++++++++++++++++++++++++++++--------
    2 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
index edcc04adc08..e5fe0ea12a7 100644
--- a/gcc/gimple-fold.cc
+++ b/gcc/gimple-fold.cc
@@ -11275,7 +11275,7 @@ gimple_build_vector (gimple_stmt_iterator *gsi,
           {
     gimple_seq seq = NULL;
     tree type = builder->type ();
-       unsigned int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+       unsigned int nelts = constant_lower_bound (TYPE_VECTOR_SUBPARTS
(type));
I don't think this is desirable in a generic helper?  How does
the 'builder' class deal with the all-constant case?  It seems
handling for constant vs. non-constant will now differ semantically
(instead of ICEing in one case previously).
This was the most minimal change I could make to get the feature working
(whilst debugging many other issues) and it seemed harmless to me, so I
didn't
spend much time thinking about it.

I know very little about the builder, but my understanding is that
it would
behave as though elements beyond the lower bound do not
exist. e.g., if the
vector type is VNx16QI then TREE_CONSTANT would return true for the
CONSTRUCTOR node created by build_constructor if elements 0..15 are
constant.

This is presumably not safe general-purpose behaviour, because it would
leave
any other elements uninitialised (which does not matter for my
use-case). I
have no objection to trying to solve this elsewhere (probably in
vect_create_constant_vectors) but I'll first need to revert this
change and
remind myself what breaks.
Fixing this upthread would be definitely better.  Not sure exactly how.
Alternatively the change could be done in a way to assert that
the tree_vector_builder has less than or exactly the same number
of elements as constant_lower_bound of nelts.  I don't exactly
remember what the builder tracks here and what constraints for
initialization of VLA vectors are.
I've done some further investigation.

One of the tests that failed without my change to gimple_build_vector was
gcc.target/aarch64/sve/slp_6.c. I made that change to enable building of the
following constant (among others):

_70 = {_85, _21, _55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

That constant has only 16 elements although the type of _70 is
vector([16,16])
unsigned char:

void vec_slp_int8_t (int8_t * restrict a, int8_t * restrict b, int n)
{
...

   vector([16,16]) signed char vect_x0_43.58;
   vector([16,16]) signed char vect__90.57;
   vector([16,16]) unsigned char vect__89.56;
   vector([16,16]) unsigned char vect__87.55;
   vector([16,16]) signed char vect_x0_26.54;
   vector([16,16]) signed char vect_x0_34.47;

...

   vector([16,16]) signed char vect_x1_35.41;

...

   vector([16,16]) signed char vect_x2_36.35;

...

   void * _8;
   vector([16,16]) signed char[3] * _9;

...

   unsigned char _21;
   vector([16,16]) unsigned char _22;
   unsigned char _55;
   vector([16,16]) unsigned char _56;

...

   vector([16,16]) unsigned char _70;
   vector([16,16]) unsigned char _84;
   unsigned char _85;

...

   <bb 5> [local count: 105119324]:
   _84 = (vector([16,16]) unsigned char) vect_x0_34.47_82;
   _85 = .REDUC_PLUS (_84);
   _22 = (vector([16,16]) unsigned char) vect_x1_35.41_38;
   _21 = .REDUC_PLUS (_22);
   _56 = (vector([16,16]) unsigned char) vect_x2_36.35_58;
   _55 = .REDUC_PLUS (_56);
   _70 = {_85, _21, _55, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
   vect__89.56_3 = vect__87.55_2 + _70;
   vect__90.57_4 = VIEW_CONVERT_EXPR<vector([16,16]) signed
char>(vect__89.56_3);

   <bb 6> [local count: 118111600]:
   # vect_x0_43.58_5 = PHI <vect__90.57_4(5), vect_x0_26.54_1(8)>
   .MASK_STORE (b_25(D), 8B, { -1, -1, -1, 0, 0, 0, 0, 0, ... },
vect_x0_43.58_5); [tail call]
   return;

The compiled code looks correct, although movi d0,#0 only zeros the first 16
bytes of the variable-length vector constant:
Architecturally, it zeros the whole register, including the “upper” SVE-only
bits.

Yes, you're right:
https://developer.arm.com/documentation/ddi0602/2025-09/Shared-Pseudocode/shared-functions-registers?lang=en#impl-shared.V.write.2

I wasn't previously aware of that behaviour.

But what mode is the move done in?  Does the RTL move insn use
a 16-byte mode or a VL-byte mode?

It seems to use a VL-byte mode: movi d0, #0 // // 41 [c=4 l=4] *aarch64_sve_movvnx16qi_ldr_str/3 st1b z0.b, p6, [sp, #2, mul vl] //, tmp141, // 46 [c=8 l=4] aarch64_pred_movvnx16qi/2 The relevant RTL is: (insn 41 94 46 (set (reg:VNx16QI 32 v0) (const_vector:VNx16QI repeat [ (const_int 0 [0]) ])) 5577 {*aarch64_sve_movvnx16qi_ldr_str} (nil)) (insn:TI 46 41 43 (set (mem/c:VNx16QI (plus:DI (reg/f:DI 31 sp) (const_poly_int:DI [32, 32])) [0 S[16, 16] A128]) (unspec:VNx16QI [ (reg:VNx16BI 74 p6 [141]) (reg:VNx16QI 32 v0) ] UNSPEC_PRED_X)) 5612 {aarch64_pred_movvnx16qi} (expr_list:REG_DEAD (reg:VNx16QI 32 v0) (nil)))

My understanding is that the RTL generated by my patchset guarantees zero-initialisation of the whole SVE register, at least for the use-case I intend.

I suspect it exercises this code path in aarch64_output_sve_mov_immediate: if (info.u.mov.value == const0_rtx && TARGET_NON_STREAMING) snprintf (templ, sizeof (templ), "movi\t%%d0, #0"); which was changed by Tamar last year (to use movi d0, #0).

     addvl    x0, sp, #2
     movi    d0, #0
     st1b    z0.b, p6, [sp, #2, mul vl]
     uaddv    d27, p6, z27.b
     uaddv    d26, p6, z26.b
     uaddv    d25, p6, z25.b
     str    b27, [x0]
     addvl    x0, sp, #1
     add    x0, x0, 1
     ptrue    p7.b, vl3
     ld1b    z0.b, p6/z, [sp, #2, mul vl]
     st1b    z0.b, p6, [sp, #1, mul vl]
     str    b26, [x0]
     ld1b    z0.b, p6/z, [sp, #1, mul vl]
     st1b    z0.b, p6, [sp]
     str    b25, [sp, 2]
     ld1b    z0.b, p6/z, [sp]
     add    z28.b, z0.b, z28.b
     st1b    z28.b, p7, [x1]
     addvl    sp, sp, #3
     .cfi_def_cfa_offset 0
     ret

(This code has already been noted to be inefficient, which I plan to address
separately.)

The decision about how many bytes to zero is made in the calling function,
vect_create_constant_vectors (which also uses constant_lower_bound), rather
than in gimple_build_vector:

   unsigned int elt_count = group_size;
   if (is_a<bb_vec_info> (vinfo))
     {
       /* We don't use duplicate_and_interleave for basic block
vectorization.
     We know that the group size fits within a single vector, so all we need
     to do for VLA is to pad the constant to the minimum vector length.  */
       nunits = constant_lower_bound (TYPE_VECTOR_SUBPARTS (vector_type));
       elt_count = MAX (nunits, group_size);
     }

My current understanding is that you don't object to this part of my change.
Whatever happens in gimple_build_vector won’t alter the fact that only the
minimum number of bytes are zeroed, and in most cases that’s the desirable
outcome.

I therefore plan to keep my modification to gimple_build_vector, but add an
assertion that builder->encoded_nelts () <= constant_lower_bound
(TYPE_VECTOR_SUBPARTS (builder->type ())) so that the modified function never
builds fewer elements than expected when one of them is non-constant. Would
that be OK?
I'm not sure builder->encoded_nelts () is the correct thing to check
here.  In particular any stepped encoding should be rejected as well,
so nelts_per_pattern () must be <= 2.  And even then the interpretation
is then to fill with the last value IIRC, and as you get zero-filling
with building a "short" CTOR that last element should be a zero.  I'm
not sure how to get at the "last" value, but I think that given
we create a "short" CTOR we need to check that all remaining
elements of the VLA vector are actually encoded as zeros?

Does that mean the tree_vector_builder initialised by vect_create_constant_vectors could need an extra zero at the end?

Agreed.  The only valid situations seem to be:

(1) a duplicate of a single zero, where:

     npatterns == nelts_per_pattern == encoded_nelts == 1

     and the only encoded value is zero

(2) the combination of:

     - nelts_per_pattern == 2
     - multiple_p (TYPE_VECTOR_SUBPARTS (type), npatterns)
     - the second half of the encoded elements are all zeros

But these combinations would not come about by chance.  The caller
would have to take steps to ensure that they're true.  So rather
than check for these relatively complex conditions, it  might
be clearer to add a new gimple_build interface that explicitly
fills with zeros, using a normal array (instead of a
tree_vector_builder) for the explicitly-initialised elements.

Would a new gimple_build_*_with_zeros function remove the need for vect_create_constant_vectors to pad with zeros at all?

The design of vect_create_constant_vectors seems to be heavily built around use of a tree_vector_builder. I'm a bit reluctant to do anything that would require significant refactoring of vect_create_constant_vectors, or that would require this seemingly rather ordinary case to be treated specially.

--
Christopher Bazley
Staff Software Engineer, GNU Tools Team.
Arm Ltd, 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.
http://www.arm.com/

Reply via email to