Thanks for the quick review.
Updated patch attached. I've addressed your comments below.
Tests are still running, OK for trunk assuming tests come out clean?
[aarch64][vect] Support V8QI->V8HI WIDEN_ patterns
In the case where 8 out of every 16 elements are widened using a
widening pattern and the next 8 are skipped the patterns are not
recognized. This is because they are normally used in a pair, such as
VEC_WIDEN_MINUS_HI/LO, to achieve a v16qi->v16hi conversion for example.
This patch adds support for V8QI->V8HI patterns.
gcc/ChangeLog:
PR tree-optimisation/98772
* optabs-tree.c (supportable_half_widening_operation): New function
to check for supportable V8QI->V8HI widening patterns.
* optabs-tree.h (supportable_half_widening_operation): New function.
* tree-vect-stmts.c (vect_create_vectorized_promotion_stmts): New
function to create promotion stmts for V8QI->V8HI widening patterns.
(vectorizable_conversion): Add case for V8QI->V8HI.
gcc/testsuite/ChangeLog:
PR tree-optimisation/98772
* gcc.target/aarch64/pr98772.c: New test.
>> + /* The case where a widening operation is not making use of the full
>> width of
>> + of the input vector, but using the full width of the output vector.
>> + Return the non-wided code, which will be used after the inputs are
>
>non-widened
Done.
>> + converted to the wide type. */
>> + if ((code == WIDEN_MINUS_EXPR
>> + || code == WIDEN_PLUS_EXPR
>> + || code == WIDEN_MULT_EXPR
>> + || code == WIDEN_LSHIFT_EXPR)
>
>Minor formatting nit, but the ||s should be indented one space more.
Done.
>> + && known_eq (TYPE_VECTOR_SUBPARTS (vectype_in),
>> + TYPE_VECTOR_SUBPARTS (vectype_out)))
>> + {
>> + switch (code)
>> + {
>> + case WIDEN_LSHIFT_EXPR:
>> + *code1 = LSHIFT_EXPR;
>> + return true;
>> + break;
>> + case WIDEN_MINUS_EXPR:
>> + *code1 = MINUS_EXPR;
>> + return true;
>> + break;
>> + case WIDEN_PLUS_EXPR:
>> + *code1 = PLUS_EXPR;
>> + return true;
>> + break;
>> + case WIDEN_MULT_EXPR:
>> + *code1 = MULT_EXPR;
>> + return true;
>> + break;
>> + default:
>> + gcc_unreachable ();
>> + }
>> + }
>
>Rather than return true, I think we should do something like:
>
> if (!supportable_convert_operation (NOP_EXPR, vectype_out,
> vectype_in, &dummy_code))
> return false;
>
> optab = optab_for_tree_code (*code1, vectype_out, optab_default);
> return (optab_handler (optab, TYPE_MODE (vectype_out))
> != CODE_FOR_nothing);
>
>to make sure that the target really does support this.
Done. I used 'optab_vector' not 'optab_default', as I thought that was correct
for this case and otherwise 'optab_for_tree_code' fails an assertion when
'LSHIFT_EXPR' is used.
> AFAICT the caller always knows when it wants the “if” statement above
> to be used. What it's doing is a bit different from what
> supportable_convert_operation normally does, so it might be better
> to put it into a separate function that tests whether the target
> supports the non-widening form of a widening operation.
Done.
>> +
>> + vec_tmp.create (vec_oprnds0->length () * 2);
>
>It looks like the * 2 isn't needed.
Done.
>> + if (is_gimple_call (new_stmt3))
>> + {
>> + new_tmp = gimple_call_lhs (new_stmt3);
>> + }
>> + else
>> + {
>> + new_tmp = gimple_assign_lhs (new_stmt3);
>> + }
>
>The lhs is always new_tmp3, so it's not necessary to read it back.
Done.
>> +
>> + /* Store the results for the next step. */
>> + vec_tmp.quick_push (new_tmp);
>
>FWIW, you could just assign to vec_oprnds[i] and not have vec_tmp,
>but I don't know whether that's more or less confusing. Either way's
>fine with me.
I chose to keep vec_tmp, but I don't feel strongly about it.
>> + }
>> +
>> + vec_oprnds0->release ();
>> + *vec_oprnds0 = vec_tmp;
>> +}
>> +
>>
>> /* Check if STMT_INFO performs a conversion operation that can be
>> vectorized.
>> If VEC_STMT is also passed, vectorize STMT_INFO: create a vectorized
>> @@ -4697,7 +4763,13 @@ vectorizable_conversion (vec_info *vinfo,
>> nunits_in = TYPE_VECTOR_SUBPARTS (vectype_in);
>> nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
>> if (known_eq (nunits_out, nunits_in))
>> - modifier = NONE;
>> + if (code == WIDEN_MINUS_EXPR
>> + || code == WIDEN_PLUS_EXPR
>> + || code == WIDEN_LSHIFT_EXPR
>> + || code == WIDEN_MULT_EXPR)
>> + modifier = WIDEN;
>
>Formatting nit: the last line should be indented by 6 spaces rather than 8.
Done.
>> @@ -4743,9 +4815,21 @@ vectorizable_conversion (vec_info *vinfo,
>> return false;
>>
>> case WIDEN:
>> - if (supportable_widening_operation (vinfo, code, stmt_info,
>> vectype_out,
>> - vectype_in, &code1, &code2,
>> - &multi_step_cvt, &interm_types))
>> + if (known_eq (nunits_out, nunits_in)
>> + && (code == WIDEN_MINUS_EXPR
>> + || code == WIDEN_LSHIFT_EXPR
>> + || code == WIDEN_PLUS_EXPR
>> + || code == WIDEN_MULT_EXPR)
>> + && supportable_convert_operation (code, vectype_out, vectype_in,
>> + &code1))
>
>Guess this is personal taste, sorry, since it's clearly right both ways,
>but IMO it'd be better to drop the code test. We can only get here
>with nunits_out==nunits_in if we're converting a widening operation into
>a non-widening operation. If we do end up calling a separate function
>(as per the comment above), then it would abort in a meaningful place
>if something unexpected slips through.
Done.
>> + {
>> + gcc_assert (!(multi_step_cvt && op_type == binary_op));
>> + break;
>> + }
>> + else if (supportable_widening_operation (vinfo, code, stmt_info,
>> + vectype_out, vectype_in, &code1,
>> + &code2, &multi_step_cvt,
>> + &interm_types))
>> {
>> /* Binary widening operation can only be supported directly by the
>> architecture. */
>> @@ -4981,10 +5065,20 @@ vectorizable_conversion (vec_info *vinfo,
>> c1 = codecvt1;
>> c2 = codecvt2;
>> }
>> - vect_create_vectorized_promotion_stmts (vinfo, &vec_oprnds0,
>> - &vec_oprnds1, stmt_info,
>> - this_dest, gsi,
>> - c1, c2, op_type);
>> + if ((code == WIDEN_MINUS_EXPR
>> + || code == WIDEN_PLUS_EXPR
>> + || code == WIDEN_LSHIFT_EXPR
>> + || code == WIDEN_MULT_EXPR)
>> + && known_eq (nunits_in, nunits_out))
>
>Same comment here about dropping the code tests.
Done.
From 2a530703c3b7b893c83a13cc8e9e211bddb428b0 Mon Sep 17 00:00:00 2001
From: Joel Hutton <[email protected]>
Date: Fri, 5 Feb 2021 12:52:12 +0000
Subject: [PATCH] [aarch64][vect] Support V8QI->V8HI WIDEN_ patterns
In the case where 8 out of every 16 elements are widened using a
widening pattern and the next 8 are skipped the patterns are not
recognized. This is because they are normally used in a pair, such as
VEC_WIDEN_MINUS_HI/LO, to achieve a v16qi->v16hi conversion for example.
This patch adds support for V8QI->V8HI patterns.
gcc/ChangeLog:
PR tree-optimisation/98772
* optabs-tree.c (supportable_half_widening_operation): New function
to check for supportable V8QI->V8HI widening patterns.
* optabs-tree.h (supportable_half_widening_operation): New function.
* tree-vect-stmts.c (vect_create_vectorized_promotion_stmts): New
function to create promotion stmts for V8QI->V8HI widening patterns.
(vectorizable_conversion): Add case for V8QI->V8HI.
gcc/testsuite/ChangeLog:
PR tree-optimisation/98772
* gcc.target/aarch64/pr98772.c: New test.
---
gcc/optabs-tree.c | 75 ++++++++++
gcc/optabs-tree.h | 3 +
gcc/testsuite/gcc.target/aarch64/pr98772.c | 155 +++++++++++++++++++++
gcc/tree-vect-stmts.c | 93 +++++++++++--
4 files changed, 318 insertions(+), 8 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/pr98772.c
diff --git a/gcc/optabs-tree.c b/gcc/optabs-tree.c
index c94073e3ed9..6cd12a807f9 100644
--- a/gcc/optabs-tree.c
+++ b/gcc/optabs-tree.c
@@ -277,6 +277,81 @@ optab_for_tree_code (enum tree_code code, const_tree type,
}
}
+/* Function supportable_half_widening_operation
+
+ Check whether an operation represented by code is a 'half' widening operation
+ using only half of the full vector width for the inputs. This allows the
+ output to be stored in a single vector, as opposed to using the full width
+ for input, where half the elements must be processed at a time into
+ respective full-width vector outputs i.e. a 'hi'/'lo' pair using codes such
+ as VEC_WIDEN_MINUS_HI/LO. This is handled by widening the inputs and using
+ non-widening instructions. RTL fusing converts these to the widening hardware
+ instructions if supported.
+
+ Supported widening operations:
+ WIDEN_MINUS_EXPR
+ WIDEN_PLUS_EXPR
+ WIDEN_MULT_EXPR
+ WIDEN_LSHIFT_EXPR
+
+ Output:
+ - CODE1 - The non-widened code, which will be used after the inputs are
+ converted to the wide type.
+ */
+bool
+supportable_half_widening_operation (enum tree_code code,
+ tree vectype_out, tree vectype_in,
+ enum tree_code *code1)
+{
+ machine_mode m1,m2;
+ bool truncp;
+ enum tree_code dummy_code;
+ optab op;
+
+ gcc_assert (VECTOR_TYPE_P (vectype_out) && VECTOR_TYPE_P (vectype_in));
+
+ m1 = TYPE_MODE (vectype_out);
+ m2 = TYPE_MODE (vectype_in);
+
+ if (!VECTOR_MODE_P (m1) || !VECTOR_MODE_P (m2))
+ return false;
+
+ if(!known_eq (TYPE_VECTOR_SUBPARTS (vectype_in),
+ TYPE_VECTOR_SUBPARTS (vectype_out)))
+ return false;
+
+ if(!(code == WIDEN_MINUS_EXPR
+ || code == WIDEN_PLUS_EXPR
+ || code == WIDEN_MULT_EXPR
+ || code == WIDEN_LSHIFT_EXPR))
+ return false;
+
+ switch (code)
+ {
+ case WIDEN_LSHIFT_EXPR:
+ *code1 = LSHIFT_EXPR;
+ break;
+ case WIDEN_MINUS_EXPR:
+ *code1 = MINUS_EXPR;
+ break;
+ case WIDEN_PLUS_EXPR:
+ *code1 = PLUS_EXPR;
+ break;
+ case WIDEN_MULT_EXPR:
+ *code1 = MULT_EXPR;
+ break;
+ default:
+ gcc_unreachable ();
+ }
+
+ if (!supportable_convert_operation(NOP_EXPR, vectype_out, vectype_in,
+ &dummy_code))
+ return false;
+
+ op = optab_for_tree_code (*code1, vectype_out, optab_vector);
+ return (optab_handler (op, TYPE_MODE (vectype_out)) != CODE_FOR_nothing);
+}
+
/* Function supportable_convert_operation
Check whether an operation represented by the code CODE is a
diff --git a/gcc/optabs-tree.h b/gcc/optabs-tree.h
index 173da0d3bd2..c3aaa1a4169 100644
--- a/gcc/optabs-tree.h
+++ b/gcc/optabs-tree.h
@@ -36,6 +36,9 @@ enum optab_subtype
the second argument. The third argument distinguishes between the types of
vector shifts and rotates. */
optab optab_for_tree_code (enum tree_code, const_tree, enum optab_subtype);
+bool
+supportable_half_widening_operation (enum tree_code, tree, tree,
+ enum tree_code *);
bool supportable_convert_operation (enum tree_code, tree, tree,
enum tree_code *);
bool expand_vec_cmp_expr_p (tree, tree, enum tree_code);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr98772.c b/gcc/testsuite/gcc.target/aarch64/pr98772.c
new file mode 100644
index 00000000000..35568a9f9d6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr98772.c
@@ -0,0 +1,155 @@
+/* { dg-do run } */
+/* { dg-options "-O3 -save-temps" } */
+#include <stdint.h>
+#include <string.h>
+
+#define DSIZE 16
+#define PIXSIZE 64
+
+extern void
+wplus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] + pix2[x];
+ pix1 += 16;
+ pix2 += 16;
+ }
+}
+extern void __attribute__((optimize (0)))
+wplus_no_opt (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] + pix2[x];
+ pix1 += 16;
+ pix2 += 16;
+ }
+}
+
+extern void
+wminus (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] - pix2[x];
+ pix1 += 16;
+ pix2 += 16;
+ }
+}
+extern void __attribute__((optimize (0)))
+wminus_no_opt (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] - pix2[x];
+ pix1 += 16;
+ pix2 += 16;
+ }
+}
+
+extern void
+wmult (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] * pix2[x];
+ pix1 += 16;
+ pix2 += 16;
+ }
+}
+extern void __attribute__((optimize (0)))
+wmult_no_opt (uint16_t *d, uint8_t *restrict pix1, uint8_t *restrict pix2 )
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] * pix2[x];
+ pix1 += 16;
+ pix2 += 16;
+ }
+}
+
+extern void
+wlshift (uint16_t *d, uint8_t *restrict pix1)
+
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] << 8;
+ pix1 += 16;
+ }
+}
+extern void __attribute__((optimize (0)))
+wlshift_no_opt (uint16_t *d, uint8_t *restrict pix1)
+
+{
+ for( int y = 0; y < 4; y++ )
+ {
+ for( int x = 0; x < 4; x++ )
+ d[x + y*4] = pix1[x] << 8;
+ pix1 += 16;
+ }
+}
+
+void __attribute__((optimize (0)))
+init_arrays(uint16_t *d_a, uint16_t *d_b, uint8_t *pix1, uint8_t *pix2)
+{
+ for(int i = 0; i < DSIZE; i++)
+ {
+ d_a[i] = (1074 * i)%17;
+ d_b[i] = (1074 * i)%17;
+ }
+ for(int i = 0; i < PIXSIZE; i++)
+ {
+ pix1[i] = (1024 * i)%17;
+ pix2[i] = (1024 * i)%17;
+ }
+}
+
+/* Don't optimize main so we don't get confused over where the vector
+ instructions are generated. */
+__attribute__((optimize (0)))
+int main()
+{
+ uint16_t d_a[DSIZE];
+ uint16_t d_b[DSIZE];
+ uint8_t pix1[PIXSIZE];
+ uint8_t pix2[PIXSIZE];
+
+ init_arrays (d_a, d_b, pix1, pix2);
+ wplus(d_a, pix1, pix2);
+ wplus_no_opt(d_b, pix1, pix2);
+ if (memcmp(d_a,d_b, DSIZE) != 0)
+ return 1;
+
+ init_arrays (d_a, d_b, pix1, pix2);
+ wminus(d_a, pix1, pix2);
+ wminus_no_opt(d_b, pix1, pix2);
+ if (memcmp(d_a,d_b, DSIZE) != 0)
+ return 2;
+
+ init_arrays (d_a, d_b, pix1, pix2);
+ wmult(d_a, pix1, pix2);
+ wmult_no_opt(d_b, pix1, pix2);
+ if (memcmp(d_a,d_b, DSIZE) != 0)
+ return 3;
+
+ init_arrays (d_a, d_b, pix1, pix2);
+ wlshift(d_a, pix1);
+ wlshift_no_opt(d_b, pix1);
+ if (memcmp(d_a,d_b, DSIZE) != 0)
+ return 4;
+
+}
+
+/* { dg-final { scan-assembler-times "uaddl\\tv" 2 } } */
+/* { dg-final { scan-assembler-times "usubl\\tv" 2 } } */
+/* { dg-final { scan-assembler-times "umull\\tv" 2 } } */
+/* { dg-final { scan-assembler-times "shl\\tv" 2 } } */
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index f180ced3124..3cfb27f4adc 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -4545,6 +4545,64 @@ vect_create_vectorized_promotion_stmts (vec_info *vinfo,
*vec_oprnds0 = vec_tmp;
}
+/* Create vectorized promotion stmts for widening stmts using only half the
+ potential vector size for input. */
+static void
+vect_create_vectorized_promotion_stmts (vec_info *vinfo,
+ vec<tree> *vec_oprnds0,
+ vec<tree> *vec_oprnds1,
+ stmt_vec_info stmt_info, tree vec_dest,
+ gimple_stmt_iterator *gsi,
+ enum tree_code code1,
+ int op_type)
+{
+ int i;
+ tree vop0, vop1;
+ gimple *new_stmt1;
+ gimple *new_stmt2;
+ gimple *new_stmt3;
+ vec<tree> vec_tmp = vNULL;
+
+ vec_tmp.create (vec_oprnds0->length ());
+ FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
+ {
+ tree new_tmp1, new_tmp2, new_tmp3, out_type;
+
+ gcc_assert (op_type == binary_op);
+ vop1 = (*vec_oprnds1)[i];
+
+ /* Widen the first vector input. */
+ out_type = TREE_TYPE (vec_dest);
+ new_tmp1 = make_ssa_name (out_type);
+ new_stmt1 = gimple_build_assign (new_tmp1, NOP_EXPR, vop0);
+ vect_finish_stmt_generation (vinfo, stmt_info, new_stmt1, gsi);
+ if (VECTOR_TYPE_P (TREE_TYPE (vop1)))
+ {
+ /* Widen the second vector input. */
+ new_tmp2 = make_ssa_name (out_type);
+ new_stmt2 = gimple_build_assign (new_tmp2, NOP_EXPR, vop1);
+ vect_finish_stmt_generation (vinfo, stmt_info, new_stmt2, gsi);
+ /* Perform the operation. With both vector inputs widened. */
+ new_stmt3 = gimple_build_assign (vec_dest, code1, new_tmp1, new_tmp2);
+ }
+ else
+ {
+ /* Perform the operation. With the single vector input widened. */
+ new_stmt3 = gimple_build_assign (vec_dest, code1, new_tmp1, vop1);
+ }
+
+ new_tmp3 = make_ssa_name (vec_dest, new_stmt3);
+ gimple_assign_set_lhs (new_stmt3, new_tmp3);
+ vect_finish_stmt_generation (vinfo, stmt_info, new_stmt3, gsi);
+
+ /* Store the results for the next step. */
+ vec_tmp.quick_push (new_tmp3);
+ }
+
+ vec_oprnds0->release ();
+ *vec_oprnds0 = vec_tmp;
+}
+
/* Check if STMT_INFO performs a conversion operation that can be vectorized.
If VEC_STMT is also passed, vectorize STMT_INFO: create a vectorized
@@ -4697,7 +4755,13 @@ vectorizable_conversion (vec_info *vinfo,
nunits_in = TYPE_VECTOR_SUBPARTS (vectype_in);
nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
if (known_eq (nunits_out, nunits_in))
- modifier = NONE;
+ if (code == WIDEN_MINUS_EXPR
+ || code == WIDEN_PLUS_EXPR
+ || code == WIDEN_LSHIFT_EXPR
+ || code == WIDEN_MULT_EXPR)
+ modifier = WIDEN;
+ else
+ modifier = NONE;
else if (multiple_p (nunits_out, nunits_in))
modifier = NARROW;
else
@@ -4743,9 +4807,16 @@ vectorizable_conversion (vec_info *vinfo,
return false;
case WIDEN:
- if (supportable_widening_operation (vinfo, code, stmt_info, vectype_out,
- vectype_in, &code1, &code2,
- &multi_step_cvt, &interm_types))
+ if (supportable_half_widening_operation (code, vectype_out, vectype_in,
+ &code1))
+ {
+ gcc_assert (!(multi_step_cvt && op_type == binary_op));
+ break;
+ }
+ else if (supportable_widening_operation (vinfo, code, stmt_info,
+ vectype_out, vectype_in, &code1,
+ &code2, &multi_step_cvt,
+ &interm_types))
{
/* Binary widening operation can only be supported directly by the
architecture. */
@@ -4981,10 +5052,16 @@ vectorizable_conversion (vec_info *vinfo,
c1 = codecvt1;
c2 = codecvt2;
}
- vect_create_vectorized_promotion_stmts (vinfo, &vec_oprnds0,
- &vec_oprnds1, stmt_info,
- this_dest, gsi,
- c1, c2, op_type);
+ if (known_eq(nunits_out, nunits_in))
+ vect_create_vectorized_promotion_stmts (vinfo, &vec_oprnds0,
+ &vec_oprnds1, stmt_info,
+ this_dest, gsi,
+ c1, op_type);
+ else
+ vect_create_vectorized_promotion_stmts (vinfo, &vec_oprnds0,
+ &vec_oprnds1, stmt_info,
+ this_dest, gsi,
+ c1, c2, op_type);
}
FOR_EACH_VEC_ELT (vec_oprnds0, i, vop0)
--
2.17.1