> On 25 Oct 2024, at 13:46, Richard Sandiford <[email protected]> wrote:
>
> Kyrylo Tkachov <[email protected]> writes:
>> Thank you for the suggestions! I’m trying them out now.
>>
>>>> + if (rotamnt % BITS_PER_UNIT != 0)
>>>> + return NULL_RTX;
>>>> + machine_mode qimode;
>>>> + if (!qimode_for_vec_perm (mode).exists (&qimode))
>>>> + return NULL_RTX;
>>>> +
>>>> + vec_perm_builder builder;
>>>> + unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>>>
>>> simpler as GET_MODE_UNIT_SIZE
>>>
>>>> + unsigned total_units;
>>>> + /* TODO: Handle VLA vector rotates? */
>>>> + if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>>>> + return NULL_RTX;
>>>
>>> Yeah. I think we can do that by changing:
>>>
>>>> + builder.new_vector (total_units, 1, total_units);
>>>
>>> to:
>>>
>>> builder.new_vector (total_units, 3, units);
>>
>> I think units here is the size in units of the fixed-width component of the
>> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?
>
> Ah, no, sorry, I meant "nunits" rather than "units", with "nunits"
> being the same as for your code. So for V4SI and VNx4SI we'd push
> 12 elements total, as 4 (nunits) "patterns" of 3 elements each.
> The first argument (total_units) is just GET_MODE_SIZE (mode)
> in all its poly_int glory.
Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how can
12 indices be enough to describe the permute?
With this scheme we do end up pushing 12 elements, in the order:
2,3,0,1,6,7,4,5,10,11,8,9 .
In the final RTX emitted in the instruction stream this seems to end up as:
(const_vector:V16QI [
(const_int 2 [0x2])
(const_int 3 [0x3])
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 6 [0x6])
(const_int 7 [0x7])
(const_int 4 [0x4])
(const_int 5 [0x5])
(const_int 10 [0xa])
(const_int 11 [0xb])
(const_int 8 [0x8])
(const_int 9 [0x9]) repeated x2
(const_int 14 [0xe])
(const_int 7 [0x7])
(const_int 0 [0])
])
So the first 12 elements are indeed correct, but the last 4 elements are not.
>
> For V4HI and VNx4HI it would be (..., 3, 2), so 6 elements total.
>
>> What is the recommended API for getting that number out of the poly_uint64
>> mode size?. Is it just accessing coeffs[0]?
>>
>>>
>>> unconditionally and making the outer loop below iterate exactly
>>> three times (i.e. to nunits * 3). It's ok if we generate more
>>> indices than needed.
>>>
>>>> + int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT;
>>>> + for (unsigned j = 0; j < total_units; j += nunits)
>>>> + for (unsigned i = 0; i < nunits; i++)
>>>> + {
>>>> + unsigned idx = (rot_to_perm + i) % nunits + j;
>>>> + if (BYTES_BIG_ENDIAN)
>>>> + idx = total_units - idx - 1;
>>>
>>> I think the endian adjustment should be local to the inner loop/vector
>>> element only. Since this would mean undoing the "nunits - " adjustment
>>> above, how about something like:
>>>
>>> unsigned rot_bytes = rotamnt / BITS_PER_UNIT;
>>> unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes;
>>> ...
>>> builder.quick_push ((rot_to_perm + i) % nunits + j);
>>>
>>> or whatever variation you prefer.
>>>
>>> Hope I've got that right...
>>
>>
>> Hmm, I’m getting some test failures and wrong indices when I try this.
>> I think I can work out the indices and the loops for them but I’d like to
>> work
>> through an example. So say we are rotating a V4SImode vector by 16 (a REV32
>> instruction).
>> The indices pushed into the byte permute vector with my original patch are:
>> [2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d]
>> What sequence do we want to push for V4SImode now that we have 3 patterns in
>> vector_builder?
>> Is it repeating the above 3 times or is it interleaving each SImode entry
>> somehow?
>
> I think the calculation should be the same as in your original code,
> adding "i" to each element. It's just that we limit the outer loop
> to 3 iterations instead of total_units / nunits.
>
> So the end result should be that the pushed elements are the same
> as in your original patch, just longer or shorter, depending on
> whether we're pushing more elements (e.g. V2DI) or fewer (e.g. V4SI).
>
> In general, the VLA constant scheme works by creating the leading
> elements as normal and then describing how to extend that sequence
> using the (npatterns, nelts_per_pattern) pair. The first
> npatterns * nelts_per_pattern elements are always explicitly pushed,
> but are automatically extended or truncated as necessary when creating
> fixed-length tree and rtl constants.
>
> Thanks,
> Richard