Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-17 Thread Dmitrij Pochepko

Thank you!

On 17.07.2020 12:21, Richard Sandiford wrote:

Dmitrij Pochepko  writes:

Hi,

I please take a look at new version (attached).

Thanks, pushed to master with a slightly tweaked changelog.

Richard


Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-17 Thread Richard Sandiford
Dmitrij Pochepko  writes:
> Hi,
>
> I please take a look at new version (attached).

Thanks, pushed to master with a slightly tweaked changelog.

Richard


Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-13 Thread Dmitrij Pochepko
Hi,

I please take a look at new version (attached).

Thanks,
Dmitrij

On Sat, Jul 11, 2020 at 09:39:13AM +0100, Richard Sandiford wrote:
...
> This should push “newelt” instead.
done

...
> This test is endian-agnostic and should work for all targets, but…
...
> 
> …the shuffles in these tests are specific to little-endian targets.
> For big-endian targets, architectural lane 0 is the last in memory
> rather than first, so e.g. the big-endian permute vector for vzip_4.c
> would be: { 0, 1, 5, 6 } instead.
> 
> I guess the options are:
> 
> (1) add __ARM_BIG_ENDIAN preprocessor tests to choose the right mask
> (2) restrict the tests to aarch64_little_endian.
> (3) have two scan-assembler lines with opposite zip1/zip2 choices, e.g:
> 
> /* { dg-final { scan-assembler-times {[ \t]*zip1[ \t]+v[0-9]+\.2d} 1 { target 
> aarch64_big_endian } } } */
> /* { dg-final { scan-assembler-times {[ \t]*zip2[ \t]+v[0-9]+\.2d} 1 { target 
> aarch64_little_endian } } } */
> 
> (untested).
> 
> I guess (3) is probably best, but (2) is fine too if you're not set
> up to test big-endian.
> 
> Sorry for not noticing last time.  I only realised when testing the
> patch locally.
> 
> Thanks,
> Richard

I restricted tests to aarch64_little_endian, because I don't have big-endian 
setup to check it.
>From 197a9bc05f96c3f100b3f4748c9dd12a60de86d1 Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Mon, 13 Jul 2020 17:44:08 +0300
Subject: [PATCH]  __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrpx0, .LC0
	ldr q2, [x0, #:lo12:.LC0]
	tbl v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
	zip1v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-07-13 Andrew Pinski   

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-07-13  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By: Dmitrij Pochepko
---
 gcc/config/aarch64/aarch64.c| 57 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 13 +++
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 13 +++
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 13 +++
 6 files changed, 124 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 17dbe67..e259d05 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19991,6 +19991,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20176,6 +20178,59 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it combines odd and even elements.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
+  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
+  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
+  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, vec_bits);
+
+  if (new_mode == word_mode)
+return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  vec_perm_builder newpermconst;
+  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
+
+  /* Convert the perm constant if we can.  Require even, odd as the pairs.  */
+  for (unsigned int i = 0; i < nelt; i += 2)
+{
+  poly_int64 elt0 = d->perm[i];
+  poly_int64 elt1 = d->perm[i + 1];
+  p

Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-11 Thread Richard Sandiford
Dmitrij Pochepko  writes:
> Hi,
>
> please take a look at updated version (attached).

Looks good, but a couple more points, sorry:

> +  /* Convert the perm constant if we can.  Require even, odd as the pairs.  
> */
> +  for (unsigned int i = 0; i < nelt; i += 2)
> +{
> +  poly_int64 elt0 = d->perm[i];
> +  poly_int64 elt1 = d->perm[i + 1];
> +  poly_int64 newelt;
> +  if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1))
> + return false;
> +  newpermconst.quick_push (elt0.to_constant () / 2);

This should push “newelt” instead.

> diff --git a/gcc/testsuite/gcc.target/aarch64/vdup_n_3.c 
> b/gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
> new file mode 100644
> index 000..5234f5e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define vector __attribute__((vector_size(4*sizeof(float
> +
> +/* These are both dups. */
> +vector float f(vector float a, vector float b)
> +{
> +  return __builtin_shuffle (a, a, (vector int){0, 1, 0, 1});
> +}
> +vector float f1(vector float a, vector float b)
> +{
> +  return __builtin_shuffle (a, a, (vector int){2, 3, 2, 3});
> +}
> +
> +/* { dg-final { scan-assembler-times {[ \t]*dup[ \t]+v[0-9]+\.2d} 2 } } */

This test is endian-agnostic and should work for all targets, but…

> diff --git a/gcc/testsuite/gcc.target/aarch64/vzip_1.c 
> b/gcc/testsuite/gcc.target/aarch64/vzip_1.c
> new file mode 100644
> index 000..5998211
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vzip_1.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define vector __attribute__((vector_size(2*sizeof(float
> +
> +vector float f(vector float a, vector float b)
> +{
> +  return __builtin_shuffle (a, b, (vector int){0, 2});
> +}
> +
> +/* { dg-final { scan-assembler-times {[ \t]*zip1[ \t]+v[0-9]+\.2s} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/vzip_2.c 
> b/gcc/testsuite/gcc.target/aarch64/vzip_2.c
> new file mode 100644
> index 000..2acafde
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vzip_2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define vector __attribute__((vector_size(4*sizeof(float
> +
> +vector float f(vector float a, vector float b)
> +{
> +  /* This is the same as zip1 v.2d as {0, 1, 4, 5} can be converted to {0, 
> 2}. */
> +  return __builtin_shuffle (a, b, (vector int){0, 1, 4, 5});
> +}
> +
> +/* { dg-final { scan-assembler-times {[ \t]*zip1[ \t]+v[0-9]+\.2d} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/vzip_3.c 
> b/gcc/testsuite/gcc.target/aarch64/vzip_3.c
> new file mode 100644
> index 000..8fa80f7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vzip_3.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define vector __attribute__((vector_size(4*sizeof(float
> +
> +vector float f(vector float a, vector float b)
> +{
> +  /* This is the same as zip1 v.2d as {4, 5, 0, 1} can be converted to {2, 
> 0}. */
> +  return __builtin_shuffle (a, b, (vector int){4, 5, 0, 1});
> +}
> +
> +/* { dg-final { scan-assembler-times {[ \t]*zip1[ \t]+v[0-9]+\.2d} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/vzip_4.c 
> b/gcc/testsuite/gcc.target/aarch64/vzip_4.c
> new file mode 100644
> index 000..aefc55c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vzip_4.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define vector __attribute__((vector_size(4*sizeof(float
> +
> +vector float f(vector float a, vector float b)
> +{
> +  /* This is the same as zip2 v.2d as {2, 3, 6, 7} can be converted to {1, 
> 3}. */
> +  return __builtin_shuffle (a, b, (vector int){2, 3, 6, 7});
> +}
> +
> +/* { dg-final { scan-assembler-times {[ \t]*zip2[ \t]+v[0-9]+\.2d} 1 } } */

…the shuffles in these tests are specific to little-endian targets.
For big-endian targets, architectural lane 0 is the last in memory
rather than first, so e.g. the big-endian permute vector for vzip_4.c
would be: { 0, 1, 5, 6 } instead.

I guess the options are:

(1) add __ARM_BIG_ENDIAN preprocessor tests to choose the right mask
(2) restrict the tests to aarch64_little_endian.
(3) have two scan-assembler lines with opposite zip1/zip2 choices, e.g:

/* { dg-final { scan-assembler-times {[ \t]*zip1[ \t]+v[0-9]+\.2d} 1 { target 
aarch64_big_endian } } } */
/* { dg-final { scan-assembler-times {[ \t]*zip2[ \t]+v[0-9]+\.2d} 1 { target 
aarch64_little_endian } } } */

(untested).

I guess (3) is probably best, but (2) is fine too if you're not set
up to test big-endian.

Sorry for not noticing last time.  I only realised when testing the
patch locally.

Thanks,
Richard


Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-10 Thread Dmitrij Pochepko
Hi,

please take a look at updated version (attached).

Thanks,
Dmitrij

On Wed, Jul 08, 2020 at 03:48:39PM +0100, Richard Sandiford wrote:
...
> 
> maybe s/use bigger size up/combines odd and even elements/

done

> It should be possible to do this without the to_constants, e.g.:
> 
>   poly_int64 elt0 = d->perm[i];
>   poly_int64 elt1 = d->perm[i + 1];
>   poly_int64 newelt;
>   if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1))
> return false;
> 
> (The coding conventions require spaces around “+”, even though I agree
> “[i+1]” looks better.)

done

>From 34b6b0803111609ec5a0a615a8f03b78921e8412 Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Fri, 10 Jul 2020 15:42:40 +0300
Subject: [PATCH] __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrpx0, .LC0
	ldr q2, [x0, #:lo12:.LC0]
	tbl v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
zip1v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-07-10 Andrew Pinski   

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-07-10  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By: Dmitrij Pochepko
---
 gcc/config/aarch64/aarch64.c| 57 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 11 ++
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 12 ++
 6 files changed, 120 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 17dbe67..9b31743 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19991,6 +19991,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20176,6 +20178,59 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it combines odd and even elements.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
+  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
+  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
+  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, vec_bits);
+
+  if (new_mode == word_mode)
+return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  vec_perm_builder newpermconst;
+  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
+
+  /* Convert the perm constant if we can.  Require even, odd as the pairs.  */
+  for (unsigned int i = 0; i < nelt; i += 2)
+{
+  poly_int64 elt0 = d->perm[i];
+  poly_int64 elt1 = d->perm[i + 1];
+  poly_int64 newelt;
+  if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1))
+	return false;
+  newpermconst.quick_push (elt0.to_constant () / 2);
+}
+  newpermconst.finalize ();
+
+  newd.vmode = new_mode;
+  newd.vec_flags = VEC_ADVSIMD;
+  newd.target = d->target ? gen_lowpart (new_mode, d->target) : NULL;
+  newd.op0 = d->op0 ? gen_lowpart (new_mode, d->op0) : NULL;
+  newd.op1 = d->op1 ? gen_lowpart (new_mode, d->op1) : NULL;
+  newd.testing_p = d->testing_p;
+  newd.one_vector_p = d->one_vector_p;
+
+  newd.perm.new_vector (newpermconst, newd.one_vector_p ? 1 : 2, nelt / 2);
+  return aarch64_expand_vec_perm_const_1 (&newd);
+}
+
 /* Recognize patterns suitable for the UZP instru

Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-08 Thread Richard Sandiford
Dmitrij Pochepko  writes:
> Hi,
>
> thank you for looking into this.
>
> I prepared new patch with all your comments addressed.

Thanks, looks good, just a couple of minor things:

> @@ -20090,6 +20092,62 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
>return true;
>  }
>  
> +/* Try to re-encode the PERM constant so it use the bigger size up.

maybe s/use bigger size up/combines odd and even elements/

> +   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
> +   We retry with this new constant with the full suite of patterns.  */
> +static bool
> +aarch64_evpc_reencode (struct expand_vec_perm_d *d)
> +{
> +  expand_vec_perm_d newd;
> +  unsigned HOST_WIDE_INT nelt;
> +
> +  if (d->vec_flags != VEC_ADVSIMD)
> +return false;
> +
> +  /* Get the new mode.  Always twice the size of the inner
> + and half the elements.  */
> +  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
> +  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
> +  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
> +  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, 
> vec_bits);
> +
> +  if (new_mode == word_mode)
> +return false;
> +
> +  /* to_constant is safe since this routine is specific to Advanced SIMD
> + vectors.  */
> +  nelt = d->perm.length ().to_constant ();
> +
> +  vec_perm_builder newpermconst;
> +  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
> +
> +  /* Convert the perm constant if we can.  Require even, odd as the pairs.  
> */
> +  for (unsigned int i = 0; i < nelt; i += 2)
> +{
> +  poly_int64 elt_poly0 = d->perm[i];
> +  poly_int64 elt_poly1 = d->perm[i+1];
> +  if (!elt_poly0.is_constant () || !elt_poly1.is_constant ())
> + return false;
> +  unsigned int elt0 = elt_poly0.to_constant ();
> +  unsigned int elt1 = elt_poly1.to_constant ();
> +  if ((elt0 & 1) != 0 || elt0 + 1 != elt1)
> + return false;
> +  newpermconst.quick_push (elt0 / 2);

It should be possible to do this without the to_constants, e.g.:

  poly_int64 elt0 = d->perm[i];
  poly_int64 elt1 = d->perm[i + 1];
  poly_int64 newelt;
  if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1))
return false;

(The coding conventions require spaces around “+”, even though I agree
“[i+1]” looks better.)

Looks good otherwise.

Richard


Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-07 Thread Dmitrij Pochepko
Hi,

thank you for looking into this.

I prepared new patch with all your comments addressed.

Thanks,
Dmitrij

On Tue, Jun 23, 2020 at 05:53:00PM +0100, Richard Sandiford wrote:
...
> 
> I think it would be simpler to do it in this order:
> 
>   - check for Advanced SIMD, bail out if not
>   - get the new mode, bail out if none
>   - calculate the permutation vector, bail out if not suitable
>   - set up the rest of “newd”
> 
> There would then only be one walk over d->perm rather than two,
> and we'd only create the gen_lowparts when there's something to test.
> 
> The new mode can be calculated with something like:
> 
>   poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
>   unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
>   auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
>   machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, 
> vec_bits);
> 
> “new_mode” will be “word_mode” on failure.
>
... 
> The regexp would be easier to read if quoted using {…}, which requires
> fewer backslashes.  Same for the other tests.
> 
> Thanks,
> Richard
>From 71a3f4b05edc462bcceba35ff738c6f1b5ca3f0a Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Tue, 7 Jul 2020 18:45:06 +0300
Subject: [PATCH] __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrp	x0, .LC0
	ldr	q2, [x0, #:lo12:.LC0]
	tbl	v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
	zip1	v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-06-11	Andrew Pinski	

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-06-11  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By:	Dmitrij Pochepko	
---
 gcc/config/aarch64/aarch64.c| 60 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 11 ++
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 12 ++
 6 files changed, 123 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f3551a7..4b02bc7 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19905,6 +19905,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20090,6 +20092,62 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it use the bigger size up.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
+  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
+  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
+  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, vec_bits);
+
+  if (new_mode == word_mode)
+return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  vec_perm_builder newpermconst;
+  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
+
+  /* Convert the perm constant if we can.  Require even, odd as the pairs.  */
+  for (unsigned int i = 0; i < nelt; i += 2)
+{
+  poly_int64 elt_poly0 = d->perm[i];
+  poly_int64 elt_poly1 = d->perm[i+1];
+  if (!elt_poly0.is_constant () || !elt_poly1.is_constant ())
+	return false;
+  unsigned int elt0 = elt_poly0.to_constant ();
+  unsigned int elt1 = elt_poly1.to_constant ();
+  if ((elt0 & 1) != 0 || e

Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-06-23 Thread Richard Sandiford
Sorry for the slow review.

Dmitrij Pochepko  writes:
> @@ -20074,6 +20076,83 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
>return true;
>  }
>  
> +/* Try to re-encode the PERM constant so it use the bigger size up.
> +   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
> +   We retry with this new constant with the full suite of patterns.  */
> +static bool
> +aarch64_evpc_reencode (struct expand_vec_perm_d *d)
> +{
> +  expand_vec_perm_d newd;
> +  unsigned HOST_WIDE_INT nelt;
> +
> +  if (d->vec_flags != VEC_ADVSIMD)
> +return false;
> +
> +  unsigned int encoded_nelts = d->perm.encoding ().encoded_nelts ();
> +  for (unsigned int i = 0; i < encoded_nelts; ++i)
> +if (!d->perm[i].is_constant ())
> +  return false;
> +
> +  /* to_constant is safe since this routine is specific to Advanced SIMD
> + vectors.  */
> +  nelt = d->perm.length ().to_constant ();
> +
> +  /* Get the new mode.  Always twice the size of the inner
> + and half the elements.  */
> +  machine_mode new_mode;
> +  switch (d->vmode)
> +{
> +/* 128bit vectors.  */
> +case E_V4SFmode:
> +case E_V4SImode:
> +  new_mode = V2DImode;
> +  break;
> +case E_V8BFmode:
> +case E_V8HFmode:
> +case E_V8HImode:
> +  new_mode = V4SImode;
> +  break;
> +case E_V16QImode:
> +  new_mode = V8HImode;
> +  break;
> +/* 64bit vectors.  */
> +case E_V4BFmode:
> +case E_V4HFmode:
> +case E_V4HImode:
> +  new_mode = V2SImode;
> +  break;
> +case E_V8QImode:
> +  new_mode = V4HImode;
> +  break;
> +default:
> +  return false;
> +}
> +
> +  newd.vmode = new_mode;
> +  newd.vec_flags = VEC_ADVSIMD;
> +  newd.target = d->target ? gen_lowpart (new_mode, d->target) : NULL;
> +  newd.op0 = d->op0 ? gen_lowpart (new_mode, d->op0) : NULL;
> +  newd.op1 = d->op1 ? gen_lowpart (new_mode, d->op1) : NULL;
> +  newd.testing_p = d->testing_p;
> +  newd.one_vector_p = d->one_vector_p;
> +  vec_perm_builder newpermconst;
> +  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
> +
> +  /* Convert the perm constant if we can.  Require even, odd as the pairs.  
> */
> +  for (unsigned int i = 0; i < nelt; i += 2)
> +{
> +  unsigned int elt0 = d->perm[i].to_constant ();
> +  unsigned int elt1 = d->perm[i+1].to_constant ();
> +  if ((elt0 & 1) != 0 || elt0 + 1 != elt1)
> + return false;
> +  newpermconst.quick_push (elt0 / 2);
> +}
> +  newpermconst.finalize ();

I think it would be simpler to do it in this order:

  - check for Advanced SIMD, bail out if not
  - get the new mode, bail out if none
  - calculate the permutation vector, bail out if not suitable
  - set up the rest of “newd”

There would then only be one walk over d->perm rather than two,
and we'd only create the gen_lowparts when there's something to test.

The new mode can be calculated with something like:

  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, vec_bits);

“new_mode” will be “word_mode” on failure.

> diff --git a/gcc/testsuite/gcc.target/aarch64/vdup_n_3.c 
> b/gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
> new file mode 100644
> index 000..289604d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +#define vector __attribute__((vector_size(4*sizeof(float
> +
> +/* These are both dups. */
> +vector float f(vector float a, vector float b)
> +{
> +  return __builtin_shuffle (a, a, (vector int){0, 1, 0, 1});
> +}
> +vector float f1(vector float a, vector float b)
> +{
> +  return __builtin_shuffle (a, a, (vector int){2, 3, 2, 3});
> +}
> +
> +/* { dg-final { scan-assembler-times "\[ \t\]*dup\[ \t\]+v\[0-9\]+\.2d" 2 } 
> } */

The regexp would be easier to read if quoted using {…}, which requires
fewer backslashes.  Same for the other tests.

Thanks,
Richard


RE: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-06-23 Thread Tamar Christina
Adding AArch64 maintainers.

> -Original Message-
> From: Gcc-patches  On Behalf Of Dmitrij
> Pochepko
> Sent: Thursday, June 11, 2020 12:22 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1
> rather than TBL (PR82199)
> 
> The following patch enables vector permutations optimization by using
> another vector element size when applicable.
> It allows usage of simpler instructions in applicable cases.
> 
> example:
> #define vector __attribute__((vector_size(16) ))
> 
> vector float f(vector float a, vector float b) {
>   return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5}); }
> 
> was compiled into:
> ...
>   adrpx0, .LC0
>   ldr q2, [x0, #:lo12:.LC0]
>   tbl v0.16b, {v0.16b - v1.16b}, v2.16b
> ...
> 
> and after patch:
> ...
>   zip1v0.2d, v0.2d, v1.2d
> ...
> 
> bootstrapped and tested on aarch64-linux-gnu with no regressions
> 
> 
> This patch was initially introduced by Andrew Pinksi 
> with me being involved later.
> 
> (I have no write access to repo)
> 
> Thanks,
> Dmitrij
> 
> gcc/ChangeLog:
> 
> 2020-06-11Andrew Pinski   
> 
>   PR gcc/82199
> 
>   * gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New
> function
> 
> gcc/testsuite/ChangeLog:
> 
> 2020-06-11  Andrew Pinski   
> 
>   PR gcc/82199
> 
>   * gcc.target/aarch64/vdup_n_3.c: New test
>   * gcc.target/aarch64/vzip_1.c: New test
>   * gcc.target/aarch64/vzip_2.c: New test
>   * gcc.target/aarch64/vzip_3.c: New test
>   * gcc.target/aarch64/vzip_4.c: New test
> 
> Co-Authored-By:   Dmitrij Pochepko sw.com>
> 
> 
> 
> Thanks,
> Dmitrij


[PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-06-11 Thread Dmitrij Pochepko
The following patch enables vector permutations optimization by using another 
vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:
#define vector __attribute__((vector_size(16) ))

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
adrpx0, .LC0
ldr q2, [x0, #:lo12:.LC0]
tbl v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
zip1v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions


This patch was initially introduced by Andrew Pinksi  with 
me being involved later.

(I have no write access to repo)

Thanks,
Dmitrij

gcc/ChangeLog:

2020-06-11  Andrew Pinski   

PR gcc/82199

* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-06-11  Andrew Pinski   

PR gcc/82199

* gcc.target/aarch64/vdup_n_3.c: New test
* gcc.target/aarch64/vzip_1.c: New test
* gcc.target/aarch64/vzip_2.c: New test
* gcc.target/aarch64/vzip_3.c: New test
* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By: Dmitrij Pochepko



Thanks,
Dmitrij
>From 3c9f3fe834811386223755fc58e2ab4a612eefcf Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Thu, 11 Jun 2020 14:13:35 +0300
Subject: [PATCH] __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrp	x0, .LC0
	ldr	q2, [x0, #:lo12:.LC0]
	tbl	v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
	zip1	v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-06-11	Andrew Pinski	

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-06-11  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By:	Dmitrij Pochepko	
---
 gcc/config/aarch64/aarch64.c| 81 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 ++
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 11 
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 12 +
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 12 +
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 12 +
 6 files changed, 144 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65a..ab7b39e 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19889,6 +19889,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20074,6 +20076,83 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it use the bigger size up.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  unsigned int encoded_nelts = d->perm.encoding ().encoded_nelts ();
+  for (unsigned int i = 0; i < encoded_nelts; ++i)
+if (!d->perm[i].is_constant ())
+  return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  machine_mode new_mode;
+  switch (d->vmode)
+{
+/* 128bit vectors.  */
+case E_V4SFmode:
+case E_V4SImode:
+  new_mode = V2DImode;
+  break;
+case E_V8BFmode:
+case E_V8HFmode:
+case E_V8HImode:
+  new_mode = V4SImode;
+  break;
+case E_V16QImode:
+  new_mode = V8HImode;
+  break;
+/* 64bit vectors.  */
+case E_V4BFmode:
+case E_V4HFmode:
+case E_V4HImode:
+  new_mode = V2SImode;
+  break;
+case E_V8QImode:
+  ne