On Wed, Oct 14, 2020 at 4:01 AM Segher Boessenkool
<[email protected]> wrote:
>
> Hi!
>
> On Tue, Oct 13, 2020 at 04:40:53PM +0800, Hongtao Liu wrote:
> > For rtx like
> > (vec_select:V2SI (subreg:V4SI (inner:V2SI) 0)
> > (parallel [(const_int 0) (const_int 1)]))
> > it could be simplified as inner.
>
> You could even simplify any vec_select of a subreg of X to just a
> vec_select of X, by changing the selection vector a bit (well, only do
Yes, when SUBREG_BYTE of trueop0 is not 0, we need to add offset to selection.
> this if that is a constant vector, I suppose). Not just for paradoxical
> subregs either, just for *all* subregs.
>
Yes, and only when X has the same inner mode and more elements.
> > gcc/ChangeLog
> > PR rtl-optimization/97249
> > * simplify-rtx.c (simplify_binary_operation_1): Simplify
> > vec_select of paradoxical subreg.
> >
> > gcc/testsuite/ChangeLog
> >
> > * gcc.target/i386/pr97249-1.c: New test.
>
> > + /* For cases like
> > + (vec_select:V2SI (subreg:V4SI (inner:V2SI) 0)
> > + (parallel [(const_int 0) (const_int 1)])).
> > + return inner directly. */
> > + if (GET_CODE (trueop0) == SUBREG
> > + && paradoxical_subreg_p (trueop0)
> > + && mode == GET_MODE (XEXP (trueop0, 0))
> > + && (GET_MODE_NUNITS (GET_MODE (trueop0))).is_constant (&l0)
> > + && (GET_MODE_NUNITS (mode)).is_constant (&l1)
> > + && l0 % l1 == 0)
>
> Why this? Why does the number of elements of the input have to divide
> that of the output?
>
Removed, also add condition for my upper comments.
> > + {
> > + gcc_assert (known_eq (XVECLEN (trueop1, 0), l1));
> > + unsigned HOST_WIDE_INT expect = (HOST_WIDE_INT_1U << l1) - 1;
> > + unsigned HOST_WIDE_INT sel = 0;
> > + int i = 0;
> > + for (;i != l1; i++)
>
> for (int i = 0; i != l1; i++)
>
> > + {
> > + rtx j = XVECEXP (trueop1, 0, i);
> > + if (!CONST_INT_P (j))
> > + break;
> > + sel |= HOST_WIDE_INT_1U << UINTVAL (j);
> > + }
> > + /* ??? Need to simplify XEXP (trueop0, 0) here. */
> > + if (sel == expect)
> > + return XEXP (trueop0, 0);
> > + }
> > }
>
> If you just handle the much more generic case, all the other vec_select
> simplifications can be done as well, not just this one.
>
Yes, changed, also selection should be inside the elements of X.
> > +/* PR target/97249 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx2 -O3 -masm=att" } */
> > +/* { dg-final { scan-assembler-times "vpmovzxbw\[
> > \t\]+\\\(\[^\n\]*%xmm\[0-9\](?:\n|\[ \t\]+#)" 2 } } */
> > +/* { dg-final { scan-assembler-times "vpmovzxwd\[
> > \t\]+\\\(\[^\n\]*%xmm\[0-9\](?:\n|\[ \t\]+#)" 2 } } */
> > +/* { dg-final { scan-assembler-times "vpmovzxdq\[
> > \t\]+\\\(\[^\n\]*%xmm\[0-9\](?:\n|\[ \t\]+#)" 2 } } */
>
> I don't know enough about the x86 backend to know if this is exactly
> what you need in the testsuite. I do know a case of backslashitis when
> I see one though -- you might want to use {} instead of "", and perhaps
> \m and \M and \s etc. And to make sure things are on one line, don't do
> all that nastiness with [^\n], just start the RE with (?n) :-)
>
Yes, changed and it's very clean with usage of (?n) and {}.
>
> Segher
Update patch.
--
BR,
Hongtao
From df71eb46e394e5b778c69e9e8f25b301997e365d Mon Sep 17 00:00:00 2001
From: liuhongt <[email protected]>
Date: Tue, 13 Oct 2020 15:35:29 +0800
Subject: [PATCH] Simplify vec_select of a subreg of X to just a vec_select of
X.
gcc/ChangeLog
PR rtl-optimization/97249
* simplify-rtx.c (simplify_binary_operation_1): Simplify
vec_select of a subreg of X to a vec_select of X when
available.
gcc/testsuite/ChangeLog
* gcc.target/i386/pr97249-1.c: New test.
---
gcc/simplify-rtx.c | 44 +++++++++++++++++++++++
gcc/testsuite/gcc.target/i386/pr97249-1.c | 30 ++++++++++++++++
2 files changed, 74 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/i386/pr97249-1.c
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 869f0d11b2e..8a10b6cf4d5 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -4170,6 +4170,50 @@ simplify_binary_operation_1 (enum rtx_code code, machine_mode mode,
return subop1;
}
}
+
+ /* Simplify vec_select of a subreg of X to just a vec_select of X
+ when available. */
+ int l2;
+ if (GET_CODE (trueop0) == SUBREG
+ && (GET_MODE_INNER (mode)
+ == GET_MODE_INNER (GET_MODE (XEXP (trueop0, 0))))
+ && (GET_MODE_NUNITS (GET_MODE (trueop0))).is_constant (&l0)
+ && (GET_MODE_NUNITS (mode)).is_constant (&l1)
+ && (GET_MODE_NUNITS (GET_MODE (XEXP (trueop0, 0))))
+ .is_constant (&l2)
+ && known_le (l1, l2))
+ {
+ unsigned HOST_WIDE_INT subreg_offset = 0;
+ gcc_assert (known_eq (XVECLEN (trueop1, 0), l1));
+ gcc_assert (can_div_trunc_p (SUBREG_BYTE (trueop0),
+ GET_MODE_SIZE (GET_MODE_INNER (mode)),
+ &subreg_offset));
+ bool success = true;
+ for (int i = 0;i != l1; i++)
+ {
+ rtx j = XVECEXP (trueop1, 0, i);
+ if (!CONST_INT_P (j)
+ || known_ge (UINTVAL (j), l2 - subreg_offset))
+ {
+ success = false;
+ break;
+ }
+ }
+ if (success)
+ {
+ rtx par = trueop1;
+ if (subreg_offset)
+ {
+ rtvec vec = rtvec_alloc (l1);
+ for (int i = 0; i < l1; i++)
+ RTVEC_ELT (vec, i)
+ = GEN_INT (INTVAL (XVECEXP (trueop1, 0, i)
+ + subreg_offset));
+ par = gen_rtx_PARALLEL (VOIDmode, vec);
+ }
+ return gen_rtx_VEC_SELECT (mode, XEXP (trueop0, 0), par);
+ }
+ }
}
if (XVECLEN (trueop1, 0) == 1
diff --git a/gcc/testsuite/gcc.target/i386/pr97249-1.c b/gcc/testsuite/gcc.target/i386/pr97249-1.c
new file mode 100644
index 00000000000..4478a34a9f8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr97249-1.c
@@ -0,0 +1,30 @@
+/* PR target/97249 */
+/* { dg-do compile } */
+/* { dg-options "-mavx2 -O3 -masm=att" } */
+/* { dg-final { scan-assembler-times {(?n)vpmovzxbw[ \t]+\(.*%xmm[0-9]} 2 } } */
+/* { dg-final { scan-assembler-times {(?n)vpmovzxwd[ \t]+\(.*%xmm[0-9]} 2 } } */
+/* { dg-final { scan-assembler-times {(?n)vpmovzxdq[ \t]+\(.*%xmm[0-9]} 2 } } */
+
+void
+foo (unsigned char* p1, unsigned char* p2, short* __restrict p3)
+{
+ for (int i = 0 ; i != 8; i++)
+ p3[i] = p1[i] + p2[i];
+ return;
+}
+
+void
+foo1 (unsigned short* p1, unsigned short* p2, int* __restrict p3)
+{
+ for (int i = 0 ; i != 4; i++)
+ p3[i] = p1[i] + p2[i];
+ return;
+}
+
+void
+foo2 (unsigned int* p1, unsigned int* p2, long long* __restrict p3)
+{
+ for (int i = 0 ; i != 2; i++)
+ p3[i] = (long long)p1[i] + (long long)p2[i];
+ return;
+}
--
2.18.1