Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner

Richard Sandiford via Gcc-patches Fri, 30 Sep 2022 09:00:18 -0700

Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> On Tue, 27 Sept 2022 at 01:59, Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
>> > On Fri, 23 Sept 2022 at 21:33, Richard Sandiford
>> > <richard.sandif...@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
>> >> > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
>> >> > <richard.sandif...@arm.com> wrote:
>> >> >>
>> >> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
>> >> >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
>> >> >> > <richard.sandif...@arm.com> wrote:
>> >> >> >>
>> >> >> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
>> >> >> >> >> The VLA encoding encodes the first N patterns explicitly.  The
>> >> >> >> >> npatterns/nelts_per_pattern values then describe how to extend 
>> >> >> >> >> that
>> >> >> >> >> initial sequence to an arbitrary number of elements.  So when 
>> >> >> >> >> performing
>> >> >> >> >> an operation on (potentially) variable-length vectors, the 
>> >> >> >> >> questions is:
>> >> >> >> >>
>> >> >> >> >> * Can we work out an initial sequence and 
>> >> >> >> >> npatterns/nelts_per_pattern
>> >> >> >> >>   pair that will be correct for all elements of the result?
>> >> >> >> >>
>> >> >> >> >> This depends on the operation that we're performing.  E.g. it's
>> >> >> >> >> different for unary operations 
>> >> >> >> >> (vector_builder::new_unary_operation)
>> >> >> >> >> and binary operations (vector_builder::new_binary_operations).  
>> >> >> >> >> It also
>> >> >> >> >> varies between unary operations and between binary operations, 
>> >> >> >> >> hence
>> >> >> >> >> the allow_stepped_p parameters.
>> >> >> >> >>
>> >> >> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
>> >> >> >> >>
>> >> >> >> >> (R) Each individual selector pattern must always select from the 
>> >> >> >> >> same vector.
>> >> >> >> >>
>> >> >> >> >> Whether this condition is met depends both on the pattern itself 
>> >> >> >> >> and on
>> >> >> >> >> the number of patterns that it's combined with.
>> >> >> >> >>
>> >> >> >> >> E.g. suppose we had the selector pattern:
>> >> >> >> >>
>> >> >> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>> >> >> >> >>
>> >> >> >> >> If the arguments and selector are n elements then this pattern 
>> >> >> >> >> on its
>> >> >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
>> >> >> >> >> This is clearly true for large enough n.  So if n is variable 
>> >> >> >> >> then
>> >> >> >> >> we cannot represent this.
>> >> >> >> >>
>> >> >> >> >> If the pattern above is one of two patterns, so interleaved as:
>> >> >> >> >>
>> >> >> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
>> >> >> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
>> >> >> >> >>
>> >> >> >> >> then the pattern would select from more than one argument if
>> >> >> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable 
>> >> >> >> >> n.
>> >> >> >> >>
>> >> >> >> >> But if the pattern above is one of four patterns then it selects
>> >> >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is 
>> >> >> >> >> not
>> >> >> >> >> true for any valid n or o, so the pattern is OK.
>> >> >> >> >>
>> >> >> >> >> So let's define some ad hoc terminology:
>> >> >> >> >>
>> >> >> >> >> * Px is the number of patterns in x
>> >> >> >> >> * Ex is the number of elements per pattern in x
>> >> >> >> >>
>> >> >> >> >> where x can be:
>> >> >> >> >>
>> >> >> >> >> * 1: first argument
>> >> >> >> >> * 2: second argument
>> >> >> >> >> * s: selector
>> >> >> >> >> * r: result
>> >> >> >> >>
>> >> >> >> >> Then:
>> >> >> >> >>
>> >> >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
>> >> >> >> >>
>> >> >> >> >> (2) The explicit encoding can be used to produce a sequence of 
>> >> >> >> >> N*Ex*Px
>> >> >> >> >>     elements for any integer N.  This extended sequence can be 
>> >> >> >> >> reencoded
>> >> >> >> >>     as having N*Px patterns, with Ex staying the same.
>> >> >> >> >>
>> >> >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final 
>> >> >> >> >> Px elements
>> >> >> >> >>     of the explicit encoding.
>> >> >> >> >>
>> >> >> >> >> So let's assume (optimistically) that we can produce the result
>> >> >> >> >> by calculating the first Pr*Er elements and using the Pr,Er 
>> >> >> >> >> encoding
>> >> >> >> >> to imply the rest.  Then:
>> >> >> >> >>
>> >> >> >> >> * (2) means that, when combining multiple input operands with 
>> >> >> >> >> potentially
>> >> >> >> >>   different encodings, we can set the number of patterns in the 
>> >> >> >> >> result
>> >> >> >> >>   to the least common multiple of the number of patterns in the 
>> >> >> >> >> inputs.
>> >> >> >> >>   In this case:
>> >> >> >> >>
>> >> >> >> >>   Pr = least_common_multiple(P1, P2, Ps)
>> >> >> >> >>
>> >> >> >> >>   is a valid number of patterns.
>> >> >> >> >>
>> >> >> >> >> * (3) means that the number of elements per pattern of the 
>> >> >> >> >> result can
>> >> >> >> >>   be the maximum of the number of elements per pattern in the 
>> >> >> >> >> inputs.
>> >> >> >> >>   (Alternatively, we could always use 3.)  In this case:
>> >> >> >> >>
>> >> >> >> >>   Er = max(E1, E2, Es)
>> >> >> >> >>
>> >> >> >> >>   is a valid number of elements per pattern.
>> >> >> >> >>
>> >> >> >> >> So if (R) holds we can compute the result -- for both VLA and 
>> >> >> >> >> VLS -- by
>> >> >> >> >> calculating the first Pr*Er elements of the result and using the
>> >> >> >> >> encoding to derive the rest.  If (R) doesn't hold then we need 
>> >> >> >> >> the
>> >> >> >> >> selector to be constant-length.  We should then fill in the 
>> >> >> >> >> result
>> >> >> >> >> based on:
>> >> >> >> >>
>> >> >> >> >> - Pr == number of elements in the result
>> >> >> >> >> - Er == 1
>> >> >> >> >>
>> >> >> >> >> But this should be the fallback option, even for VLS.
>> >> >> >> >>
>> >> >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
>> >> >> >> >> variable-length types.  After doing that, we can treat a 
>> >> >> >> >> CONSTRUCTOR
>> >> >> >> >> for an N-element vector type by setting the number of patterns 
>> >> >> >> >> to N
>> >> >> >> >> and the number of elements per pattern to 1.
>> >> >> >> > Hi Richard,
>> >> >> >> > Thanks for the suggestions, and sorry for late response.
>> >> >> >> > I have a couple of very elementary questions:
>> >> >> >> >
>> >> >> >> > 1: Consider following inputs to VEC_PERM_EXPR:
>> >> >> >> > op1: P_op1 == 4, E_op1 == 1
>> >> >> >> > {1, 2, 3, 4, ...}
>> >> >> >> >
>> >> >> >> > op2: P_op2 == 2, E_op2 == 2
>> >> >> >> > {11, 21, 12, 22, ...}
>> >> >> >> >
>> >> >> >> > sel: P_sel == 3, E_sel == 1
>> >> >> >> > {0, 4, 5, ...}
>> >> >> >> >
>> >> >> >> > What shall be the result in this case ?
>> >> >> >> > P_res = lcm(4, 2, 3) == 12
>> >> >> >> > E_res = max(1, 2, 1) == 2.
>> >> >> >>
>> >> >> >> Yeah, that looks right.  Of course, since sel is just repeating
>> >> >> >> every three elements, it could just be P_res==3, E_sel==1,
>> >> >> >> but the vector_builder would do that optimisation for us.
>> >> >> >>
>> >> >> >> (I'm not sure whether we'd see a P==3 encoding in practice,
>> >> >> >> but perhaps it's possible.)
>> >> >> >>
>> >> >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
>> >> >> >> repeating every three elements) then:
>> >> >> >>
>> >> >> >> P_res = lcm(4, 2) == 4
>> >> >> >> E_res = max(1, 2, 3) == 3
>> >> >> >>
>> >> >> >> which also looks like it would give the right encoding.
>> >> >> >>
>> >> >> >> > 2. How should we specify index of element in sel when it is not
>> >> >> >> > explicitly encoded in the operand ?
>> >> >> >> > For eg:
>> >> >> >> > op1: npatterns == 2, nelts_per_pattern == 3
>> >> >> >> > { 1, 0, 2, 0, 3, 0, ... }
>> >> >> >> > op2: npatterns == 6, nelts_per_pattern == 1
>> >> >> >> > { 11, 12, 13, 14, 15, 16, ...}
>> >> >> >> >
>> >> >> >> > In sel, how do we refer to element with value 4, that would be 
>> >> >> >> > 4th element
>> >> >> >> > of first pattern in op1, but not explicitly encoded ?
>> >> >> >> > In op1, 4 will come at index == 6.
>> >> >> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
>> >> >> >>
>> >> >> >> What index 6 refers to depends on the length of op1.
>> >> >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
>> >> >> >> If the length of op1 is 6 then index 6 refers to op2[0].
>> >> >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>> >> >> >>
>> >> >> >> This comes back to (R) above.  We need to be able to prove at 
>> >> >> >> compile
>> >> >> >> time that each pattern selects from the same input vectors (for all
>> >> >> >> elements, not just the encoded elements).  If we can't prove that
>> >> >> >> then we can't fold for variable-length vectors.
>> >> >> > Hi Richard,
>> >> >> > Thanks for the clarification!
>> >> >> > I have come up with an approach to verify R:
>> >> >> >
>> >> >> > Consider following pattern:
>> >> >> > a0, a1, a1 + S, ...,
>> >> >> > nelts_per_pattern would be n / Psel, where n == actual length of the 
>> >> >> > vector.
>> >> >> > And last element of pattern will be given by:
>> >> >> > a1 + (n/Psel - 2) * S
>> >> >>
>> >> >> (I think this is just a terminology thing, but in the source,
>> >> >> nelts_per_pattern is a compile-time constant that describes the
>> >> >> encoding.  It always has the value 1, 2 or 3, regardless of the
>> >> >> runtime length.)
>> >> >>
>> >> >> > Rearranging the above term, we can think of pattern
>> >> >> > as a line with following equation:
>> >> >> > y = (S/Psel) * n + (a1 - 2S)
>> >> >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
>> >> >> >
>> >> >> > At,
>> >> >> > n = 2*Psel, y = a1
>> >> >> > n = 3*Psel, y = a1 + S,
>> >> >> > n = 4*Psel, y = a1 + 2S ...
>> >> >> >
>> >> >> > To compare with n, we compare the following lines:
>> >> >> > y1 = (S/Psel) * n + (a1 - 2S)
>> >> >> > y2 = n
>> >> >> >
>> >> >> > So to check if elements always come from first vector,
>> >> >> > we want to check y1 < y2 for n > 0.
>> >> >> > Likewise, if elements always come from second vector,
>> >> >> > we want to check if y1 >= y2, for n > 0.
>> >> >>
>> >> >> One difficulty here is that the indices wrap around, so an index value 
>> >> >> of
>> >> >> 2n selects from the first vector rather than the second.  (This is 
>> >> >> pretty
>> >> >> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
>> >> >>
>> >> >> > If both lines are parallel, ie S/PSel == 1,
>> >> >> > then we choose first or second vector depending on the y-intercept 
>> >> >> > a1 - 2S.
>> >> >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second 
>> >> >> > vector.
>> >> >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first 
>> >> >> > vector.
>> >> >> >
>> >> >> > For eg, if we have following pattern:
>> >> >> > {0, 1, 3, ...}
>> >> >> > where a1 = 1, S = 2, and consider PSel = 2.
>> >> >> >
>> >> >> > y1 = n - 3
>> >> >> > y2 = n
>> >> >> >
>> >> >> > In this case, y1 < y2 for all values of n,  so we select first 
>> >> >> > vector.
>> >> >> >
>> >> >> > Since y2 = n, passes thru origin with slope = 1,
>> >> >> > a line can intersect it either in 1st or 3rd quadrant.
>> >> >> > Calculate point of intersection:
>> >> >> > n_int = Psel * (a1 - 2S) / (Psel - S);
>> >> >> >
>> >> >> > (a) n_int > 0
>> >> >> > n_int > 0 => intersecting in 1st quadrant.
>> >> >> > In this case there will be a cross-over at n_int.
>> >> >> >
>> >> >> > For eg, consider pattern { 0, 1, 4, ...}
>> >> >> > a1 = 1, S = 3, and let's take PSel = 2
>> >> >> >
>> >> >> > y1 = (3/2)n - 5
>> >> >> > y2 = n
>> >> >> >
>> >> >> > Both intersect at (10, 10).
>> >> >> > So for n < 10, y1 < y2
>> >> >> > and for n > 10, y1 > y2.
>> >> >> > so in this case we can't fold since we will select elements from 
>> >> >> > both vectors.
>> >> >> >
>> >> >> > (b) n_int <= 0
>> >> >> > In this case, the lines will intersect in 3rd quadrant,
>> >> >> > so depending upon the slope we can choose either vector.
>> >> >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
>> >> >> > then y1 < y2 for n > 0
>> >> >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
>> >> >> > then y1 > y2 for n > 0.
>> >> >> >
>> >> >> > For eg, in the above pattern {0, 1, 4, ...}
>> >> >> > a1 = 1, S = 3, and let's take PSel = 4
>> >> >> >
>> >> >> > y1 = (3/4)n - 5
>> >> >> > y2 = n
>> >> >> > Both intersect at (-20, -20).
>> >> >> > y1's slope = (S/Psel) = (3/4) < 1
>> >> >> > So y1 < y2 for n > 0.
>> >> >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
>> >> >> > So we pick first vector.
>> >> >> >
>> >> >> > The following pseudo code attempts to capture this:
>> >> >> >
>> >> >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
>> >> >> > {
>> >> >> >   if (S == Psel)
>> >> >> >     {
>> >> >> >       /* If y1 intercept >= 0, then y1 >= y2
>> >> >> >           for all values of n.  */
>> >> >> >       if (a1 - 2*S >= 0)
>> >> >> >         return op2;
>> >> >> >       return op1;
>> >> >> >     }
>> >> >> >
>> >> >> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
>> >> >> >    /* If intersecting in 1st quadrant, there will be cross over,
>> >> >> >        bail out.  */
>> >> >> >    if (n_int > 0)
>> >> >> >      return NULL_TREE;
>> >> >> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
>> >> >> >       then y1 < y2 for n > 0.  */
>> >> >> >    if (S < Psel)
>> >> >> >      return op1;
>> >> >> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
>> >> >> >       then y1 > y2 for n > 0.  */
>> >> >> >    return op2;
>> >> >> > }
>> >> >> >
>> >> >> > Does this look reasonable ?
>> >> >>
>> >> >> ...I think we need to be more conservative.  I think we also need to
>> >> >> distinguish n1 (the number of elements in the input vectors) and
>> >> >> nsel (the number of elements in the selector).
>> >> >>
>> >> >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
>> >> >> there will be (nsel /exact Psel) - 1 index elements from the stepped
>> >> >> encoding and the final index value will be:
>> >> >>
>> >> >>   ae = a1 + (nsel /exact Psel - 2) * S
>> >> >>
>> >> >> Because of wrap-around, we need to ensure that that doesn't run
>> >> >> into an adjoining vector.  I think the easiest way of doing that
>> >> >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
>> >> >> and check that the quotients are equal.
>> >> > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
>> >> > vector since ae is last elem,
>> >> > and the quotient can choose the vector because it will be either 0 or
>> >> > 1 (since indices wrap around after 2n).
>> >>
>> >> Right.
>> >>
>> >> > Um, could you please elaborate a bit on how will can_div_trunc_p
>> >> > calculate quotients, when n1 and nsel are unknown
>> >> > at compile time ?
>> >> >
>> >> > To calculate the quotients for a hard coded pattern,
>> >> > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
>> >> > I tried the following:
>> >> >
>> >> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
>> >> >   poly_uint64 nsel = n1;
>> >> >   poly_uint64 a1 = 1
>> >> >   poly_uint64 Esel = exact_div (nsel / Psel);
>> >>
>> >> We can't use exact_div here.  We need to test that nsel is a multiple
>> >> of Psel (e.g. using multiple_p).
>> >>
>> >> >   poly_uint64 ae = a1 + (Esel - 2) * S;
>> >> >
>> >> >   int q1, qe;
>> >> >   poly_uint64 r1, re;
>> >> >
>> >> >   bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
>> >> >   bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);
>> >> >
>> >> > Which gave strange values for qe and 0 for q1, with first call 
>> >> > succeeding,
>> >> > and second call returning false.
>> >> > Am I calling it incorrectly ?
>> >>
>> >> No, that looks right.  I guess the issue is that ae < a1 for Psel == nsel
>> >> and so for min(nsel) the index wraps around to the other input vector.
>> >> IMO it's OK to punt on that.  We don't have interfaces for applying
>> >> ranges to the indeterminates in a poly_int.
>> > Hi Richard,
>> > Thanks for the suggestions.
>> > I tried the following to test for wrap around:
>> >
>> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
>> >   poly_uint64 nsel = n1;
>> >   poly_uint64 a1 = 1;
>> >   unsigned Psel = 4;
>> >   int S = 3;
>> >
>> >   if (multiple_p (nsel, Psel))
>> >     {
>> >       poly_uint64 nelems = exact_div (nsel, Psel);
>> >       poly_uint64 ae = a1 + (nelems - 2) * S;
>> >
>> >       if (known_gt (ae, a1))
>> >         {
>> >           int q1, qe;
>> >           poly_uint64 r1, re;
>> >
>> >           bool ok1 = can_div_trunc_p (a1, n1, &q1, &r1);
>> >           bool oke = can_div_trunc_p (ae, n1, &qe, &re);
>> >         }
>> >     }
>> >
>> > However, the second call to can_div_trunc_p still returns false, with
>> > strange values for qe.
>>
>> The known_gt (ae, a1) part isn't required.  It's OK for S to be negative.
>>
>> What I meant above is that Psel is a valid value of nsel.  When nsel
>> *does* equal Psel, ae will select from a different input vector from a1.
>> (Although I said ae < a1, that isn't the important bit, sorry.
>> The important bit is that a1 > 0 and ae < 0.)
>>
>> That's why not having conditions that apply ranges to the indeterminates
>> is a problem.  I guess what we want to ask is "does ae select from the
>> same input as a1 for all nsel >= Psel*2?".  But we don't have a way of
>> asking that.  We can only ask for all nsel, and the answer to "does ae
>> select from the same input as a1 for all nsel?" is "no".
>>
>> Perhaps another way of saying this is: if the selector was a natural
>> stepped vector, the first value in the pattern (a0) would be -2.  The
>> selector { -2, 1, 4, ... } *would* cross input vectors.
>>
>> This means that a selector with the above values of a1 and S could only
>> be valid if a0 is in the range [0, nsel).  It's valid for a0 to be in
>> that range, because it can be arbitrarily different from a1 and above,
>> but it means that the test for a1 and above is no longer a simple linear
>> one, of the type we're trying to use here.
>>
>> So what I meant was, we should accept that the fold will be rejected
>> for this case.  I don't think that matters in practice.
>>
>> On the last bit, the values returned by reference from can_div_trunc_p
>> are only meaningful when the function returns true.  The variables
>> aren't updated otherwise (which is by design).
> Hi Richard,
> Thanks for the explanation!
> IIUC, for can_div_trunc_p (a, b, &q, &r);
> to return true, quotients obtained by division of respective
> coeffs should be equal to q0, if q0 = a.coeffs[0] / b.coeffs[0] != 0.
>
>   C q = NCa (a.coeffs[0]) / NCb (b.coeffs[0]);
>    /* Otherwise just check for the case in which ai / bi == Q.  */
>    if (NCa (a.coeffs[i]) / NCb (b.coeffs[i]) != q)
>      return false;
>
> Just to iterate, for above case,
> n1 = len(Vnx4SI) = 4 + 4x
> nsel = n1 = 4 + 4x
> S = 3
> Psel = 4
> a1 = 1
> ae = 3x - 2
>
> a1 /trunc n1 == (1 + 0x) / (4 + 4x)
> Since 1/4 == 0/4, can_div_trunc_p returns true, and sets q1 to 0.
>
> ae /trunc n1 == (-2 + 3x) / (4 + 4x)
> Since the coeff type is unsigned long,
> -2/4 != 3/4, so it returns false.
> and we reject to fold for this case.
> Is this correct ?


Yeah.  Like I say, if this were a simple linear pattern, a0 would be -2,
and so a0 and a1 would select from different inputs.  So a simple
linear test cannot handle this situation.

> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> For num_poly_int_coeffs == 2,
> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> If a1/trunc n1 succeeds,
> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> So, a1 has to be < n1.coeffs[0] ?

Remember that a1 is itself a poly_int.  It's not necessarily a constant.

E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:

  { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }

which is an interleaving of the two patterns:

  { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
  { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

> For eg, if n1 = 4 + 4x, and a1 = 5,
> a1 /trunc n1 will return false since 5/4 != 0 /4.
> At runtime,
> if x == 0, we will select op1[2];
> for x > 0, we will select op0[5];
> But we cannot determine this at compile time ?

Right.  Punting on that case is the right thing to do.

> I have tried to come up with following pseudo code to verify R:
>
> tree get_vector_for_pattern(tree op0, tree op1, vec_perm_indices sel,
> int pattern)
> {
>   if (!multiple_p (nsel, sel_npatterns))
>     return NULL_TREE;
>
>   poly_uint64 nsel = sel.length ();
>   poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
>   poly_uint64 Esel = exact_div (nsel, Psel);

Realise it's only pseudo code, but FWIW, this can be produced
as a side-effect of the multiple_p.

>   poly_uint64 a1 = sel[pattern + sel_npatterns];
>   poly_uint64 ae = a1 + (Esel - 2) * S;
>
>   if (known_lt (ae, 0))
>     return NULL_TREE;

I don't think we need this.

>   uint64_t q1, qe;
>   poly_uint64 r1, re;
>   if (!can_div_trunc_p (a1, n1, &q1, &r1)
>       || !can_div_trunc_p (ae, n1, &qe, &re))
>     return NULL_TREE;
>
>   if (q1 != qe)
>     return NULL_TREE;
>   tree op_vec = (q1 == 0) ? op0 : op1;

This should be testing the low bit of q1.  2 selects from the first
input, 3 from the second, etc.  (Not that we should see that,
since the selector should already have been canonicalised.
But better safe than sorry.)

>  sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>  if (sel_nelts_per_pattern == 3)
>    {

I suppose all of the above is dependent on sel_nelts_per_pattern == 3 too.
S would be zero otherwise.

>      a2 = sel[pattern + 2 * sel_npatterns];
>      S = a2 - a1;
>      if (S < 0)
>        {
>          /* Check for natural stepped pattern.  */
>          if ((a1 - sel[pattern]) != S)
>            return NULL_TREE;
>          if (!known_ge (re, VECTOR_CST_NPATTERNS (op_vec)))
>            return NULL_TREE;
>        }
>    }
>  return op_vec;
> }
>
> Does this look in right direction ?

Yeah, looks like it.

> Um, could you please give an example when nsel is *not* a multiple
> of Psel ?
> I had (incorrectly) assumed, nsel = Psel * Esel,
> so nsel would always be a multiple of Psel.

I was using nsel to mean TYPE_VECTOR_SUBPARTS, i.e. the total number
of elements in the vector at runtime.  Psel and Esel are instead
compile-time constants (number of patterns and explicitly-encoded
elements per pattern respectively).

Although I think in practical use the Psel will be a power of 2, that's
not guaranteed by construction.  Also, there's no reason in principle
why you couldn't have Psel = 4 for nsel = 2 + 2x.  Two of the patterns
will be dropped for x==0, but that's OK.

So all in all, it's better to check.  Like I said in the above comment,
it's a single operation either way: a three-argument multiple_p rather
than an exact_div.

Thanks,
Richard

Re: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner

Reply via email to