Re: [Qemu-devel] [PATCH v2 28/67] target/arm: Implement SVE Permute - Predicates Group

2018-02-23 Thread Richard Henderson
On 02/23/2018 07:15 AM, Peter Maydell wrote:
>> +static const uint64_t expand_bit_data[5][2] = {
>> +{ 0xull, 0xull },
>> +{ 0x0303030303030303ull, 0x0c0c0c0c0c0c0c0cull },
>> +{ 0x000f000f000f000full, 0x00f000f000f000f0ull },
>> +{ 0x00ff00ffull, 0xff00ff00ull },
>> +{ 0xull, 0xull }
>> +};
>> +
>> +/* Expand units of 2**N bits to units of 2**(N+1) bits,
>> +   with the higher bits zero.  */
> 
> In bitops.h we call this operation "half shuffle" (where
> it is specifically working on units of 1 bit size), and
> the inverse "half unshuffle". Worth mentioning that (or
> using similar terminology) ?

I hadn't noticed this helper.  I'll at least mention.

FWIW, the half_un/shuffle operation is what you get with N=0, which corresponds
to a byte predicate interleave.  We need the intermediate steps for half,
single, and double predicate interleaves.

>> +static uint64_t expand_bits(uint64_t x, int n)
>> +{
>> +int i, sh;
> 
> Worth asserting that n is within the range we expect it to be ?
> (what range is that? 0 to 4?)

N goes from 0-3; I goes from 0-4.  N will have been controlled by decode, so
I'm not sure it's worth an assert.  Even if I did add one, I wouldn't want it
here, at the center of a loop kernel.

>> +d[0] = nn + (mm << (1 << esz));
> 
> Is this actually doing an addition, or is it just an odd
> way of writing a bitwise OR when neither of the two
> inputs have 1 in the same bit position?

It could be an OR.  Here I'm hoping that the compiler will use a shift-add
instruction.  Which it wouldn't necessarily be able to prove by itself if I did
write it with an OR.

>> +d[0] = extract64(l + (h << (4 * oprsz)), 0, 8 * oprsz);
> 
> This looks like it's using addition for logical OR again ?

Yes.  Although this time I admit it'll never produce an LEA.

>> +/* For VL which is not a power of 2, the results from M do not
>> +   align nicely with the uint64_t for D.  Put the aligned results
>> +   from M into TMP_M and then copy it into place afterward.  */
> 
> How much risu testing did you do of funny vector lengths ?

As much as I can with the unlicensed Foundation Platform: all lengths from 1-4.

Which, unfortunately does leave a few multi-word predicate paths untested, but
many of the routines loop identically within this length and beyond.


>> +static const uint64_t even_bit_esz_masks[4] = {
>> +0xull,
>> +0xull,
>> +0x0f0f0f0f0f0f0f0full,
>> +0x00ff00ff00ff00ffull
>> +};
> 
> Comment describing the purpose of these numbers would be useful.

Ack.


r~



Re: [Qemu-devel] [PATCH v2 28/67] target/arm: Implement SVE Permute - Predicates Group

2018-02-23 Thread Peter Maydell
On 17 February 2018 at 18:22, Richard Henderson
 wrote:
> Signed-off-by: Richard Henderson 
> ---
>  target/arm/helper-sve.h|   6 +
>  target/arm/sve_helper.c| 280 
> +
>  target/arm/translate-sve.c | 110 ++
>  target/arm/sve.decode  |  18 +++
>  4 files changed, 414 insertions(+)
>
> diff --git a/target/arm/helper-sve.h b/target/arm/helper-sve.h
> index 0c9aad575e..ff958fcebd 100644
> --- a/target/arm/helper-sve.h
> +++ b/target/arm/helper-sve.h
> @@ -439,6 +439,12 @@ DEF_HELPER_FLAGS_3(sve_uunpk_h, TCG_CALL_NO_RWG, void, 
> ptr, ptr, i32)
>  DEF_HELPER_FLAGS_3(sve_uunpk_s, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
>  DEF_HELPER_FLAGS_3(sve_uunpk_d, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
>
> +DEF_HELPER_FLAGS_4(sve_zip_p, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(sve_uzp_p, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(sve_trn_p, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_3(sve_rev_p, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_3(sve_punpk_p, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
> +
>  DEF_HELPER_FLAGS_5(sve_and_, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, 
> i32)
>  DEF_HELPER_FLAGS_5(sve_bic_, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, 
> i32)
>  DEF_HELPER_FLAGS_5(sve_eor_, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, 
> i32)
> diff --git a/target/arm/sve_helper.c b/target/arm/sve_helper.c
> index 466a209c1e..c3a2706a16 100644
> --- a/target/arm/sve_helper.c
> +++ b/target/arm/sve_helper.c
> @@ -1664,3 +1664,283 @@ DO_UNPK(sve_uunpk_s, uint32_t, uint16_t, H4, H2)
>  DO_UNPK(sve_uunpk_d, uint64_t, uint32_t, , H4)
>
>  #undef DO_UNPK
> +
> +static const uint64_t expand_bit_data[5][2] = {
> +{ 0xull, 0xull },
> +{ 0x0303030303030303ull, 0x0c0c0c0c0c0c0c0cull },
> +{ 0x000f000f000f000full, 0x00f000f000f000f0ull },
> +{ 0x00ff00ffull, 0xff00ff00ull },
> +{ 0xull, 0xull }
> +};
> +
> +/* Expand units of 2**N bits to units of 2**(N+1) bits,
> +   with the higher bits zero.  */

In bitops.h we call this operation "half shuffle" (where
it is specifically working on units of 1 bit size), and
the inverse "half unshuffle". Worth mentioning that (or
using similar terminology) ?

> +static uint64_t expand_bits(uint64_t x, int n)
> +{
> +int i, sh;

Worth asserting that n is within the range we expect it to be ?
(what range is that? 0 to 4?)

> +for (i = 4, sh = 16; i >= n; i--, sh >>= 1) {
> +x = ((x & expand_bit_data[i][1]) << sh) | (x & 
> expand_bit_data[i][0]);
> +}
> +return x;
> +}
> +
> +/* Compress units of 2**(N+1) bits to units of 2**N bits.  */
> +static uint64_t compress_bits(uint64_t x, int n)
> +{
> +int i, sh;

Ditto assert.

> +for (i = n, sh = 1 << n; i <= 4; i++, sh <<= 1) {
> +x = ((x >> sh) & expand_bit_data[i][1]) | (x & 
> expand_bit_data[i][0]);
> +}
> +return x;
> +}
> +
> +void HELPER(sve_zip_p)(void *vd, void *vn, void *vm, uint32_t pred_desc)
> +{
> +intptr_t oprsz = extract32(pred_desc, 0, SIMD_OPRSZ_BITS) + 2;
> +int esz = extract32(pred_desc, SIMD_DATA_SHIFT, 2);
> +intptr_t high = extract32(pred_desc, SIMD_DATA_SHIFT + 2, 1);
> +uint64_t *d = vd;
> +intptr_t i;
> +
> +if (oprsz <= 8) {
> +uint64_t nn = *(uint64_t *)vn;
> +uint64_t mm = *(uint64_t *)vm;
> +int half = 4 * oprsz;
> +
> +nn = extract64(nn, high * half, half);
> +mm = extract64(mm, high * half, half);
> +nn = expand_bits(nn, esz);
> +mm = expand_bits(mm, esz);
> +d[0] = nn + (mm << (1 << esz));

Is this actually doing an addition, or is it just an odd
way of writing a bitwise OR when neither of the two
inputs have 1 in the same bit position?

> +} else {
> +ARMPredicateReg tmp_n, tmp_m;
> +
> +/* We produce output faster than we consume input.
> +   Therefore we must be mindful of possible overlap.  */
> +if ((vn - vd) < (uintptr_t)oprsz) {
> +vn = memcpy(&tmp_n, vn, oprsz);
> +}
> +if ((vm - vd) < (uintptr_t)oprsz) {
> +vm = memcpy(&tmp_m, vm, oprsz);
> +}
> +if (high) {
> +high = oprsz >> 1;
> +}
> +
> +if ((high & 3) == 0) {
> +uint32_t *n = vn, *m = vm;
> +high >>= 2;
> +
> +for (i = 0; i < DIV_ROUND_UP(oprsz, 8); i++) {
> +uint64_t nn = n[H4(high + i)];
> +uint64_t mm = m[H4(high + i)];
> +
> +nn = expand_bits(nn, esz);
> +mm = expand_bits(mm, esz);
> +d[i] = nn + (mm << (1 << esz));
> +}
> +} else {
> +uint8_t *n = vn, *m = vm;
> +uint16_t *d16 = vd;
> +
> +for (i = 0; i < oprsz / 2; i++) {
> +uint16_t nn = n[H1(high + i)];
> +