Re: [PATCH][RFC] vector creation from two parts of two vectors produces TBL rather than ins (PR93720)

2020-07-17 Thread Dmitrij Pochepko

Thank you!

On 17.07.2020 12:25, Richard Sandiford wrote:

Dmitrij Pochepko  writes:

Hi,

please take a look at updated patch with all comments addressed (attached).

Thanks, pushed to master with a slightly tweaked changelog.

Richard


Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-17 Thread Dmitrij Pochepko

Thank you!

On 17.07.2020 12:21, Richard Sandiford wrote:

Dmitrij Pochepko  writes:

Hi,

I please take a look at new version (attached).

Thanks, pushed to master with a slightly tweaked changelog.

Richard


[PATCH] non-power-of-2 group size can be vectorized for 2-element vectors case (PR96208)

2020-07-15 Thread Dmitrij Pochepko
Hi,

here is an enhancement to gcc, which allows load/store groups with size being 
non-power-of-2 to be vectorized.
Current implementation is using interleaving permutations to transform 
load/store groups. That is where power-of-2 requirements comes from.
For N-element vectors simplest approch would be to use N single element 
insertions for any required vector permutation.
And for 2-element vectors it is a reasonable amount of insertions.
Using this approach allows vectorization for cases, which were not supported 
before.

bootstrapped and tested on x86_64-pc-linux-gnu and aarch64-linux-gnu.

Thanks,
Dmitrij
>From acf12c34f4bebbb5c6000a87bf9aaa58e48418bb Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Wed, 15 Jul 2020 18:07:26 +0300
Subject: [PATCH] non-power-of-2 group size can be vectorized for 2-element
 vectors case (PR96208)

Support for non-power-of-2 group size in vectorizer for 2-element vectors.

gcc/ChangeLog:

2020-07-15      Dmitrij Pochepko 

PR gcc/96208

* gcc/tree-vect-data-refs.c:
	(vect_all_2element_permutations_supported): New function
	(vect_permute_load_chain): added new branch with new algo
	(vect_permute_store_chain): Likewise
	(vect_grouped_load_supported): modified logic for new algo
	(vect_grouped_store_supported): Likewise
	(vect_transform_grouped_load): Likewise

gcc/testsuite/ChangeLog:

2020-07-15      Dmitrij Pochepko 

	PR gcc/96208

	* gcc.dg/vect/vect-non-pow2-group.c: New test
---
 gcc/testsuite/gcc.dg/vect/vect-non-pow2-group.c |  25 +++
 gcc/tree-vect-data-refs.c   | 212 +---
 2 files changed, 218 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-non-pow2-group.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-non-pow2-group.c b/gcc/testsuite/gcc.dg/vect/vect-non-pow2-group.c
new file mode 100644
index 000..7a22739
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-non-pow2-group.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_double } */
+/* { dg-require-effective-target vect_perm } */
+/* { dg-additional-options "-fdump-tree-vect-details -fno-vect-cost-model -Ofast" } */
+
+typedef struct {
+double m1, m2, m3, m4, m5;
+} the_struct_t;
+
+double bar1 (the_struct_t*);
+
+double foo (double* k, unsigned int n, the_struct_t* the_struct)
+{
+unsigned int u;
+the_struct_t result;
+for (u=0; u < n; u++, k--) {
+	result.m1 += (*k)*the_struct[u].m1;
+	result.m2 += (*k)*the_struct[u].m2;
+	result.m3 += (*k)*the_struct[u].m3;
+	result.m4 += (*k)*the_struct[u].m4;
+}
+return bar1 (&result);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index e35a215..caf4555 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -5027,6 +5027,37 @@ vect_create_destination_var (tree scalar_dest, tree vectype)
   return vec_dest;
 }
 
+/* Function vect_all_2element_permutations_supported
+
+   Returns TRUE if all possible permutations for 2-element
+   vectors are supported for requested mode.  */
+
+bool
+vect_all_2element_permutations_supported (machine_mode mode)
+{
+  // check all possible permutations for 2-element vectors
+  // for 2 vectors it'll be all low and high combinations:
+  // ll={0, 2}, lh={0, 3}, hl={1,2}, hh={1,3}
+  poly_uint64 nelt = GET_MODE_NUNITS (mode);
+  if (!known_eq (nelt, 2ULL))
+return false;
+  vec_perm_builder sel (nelt, 2, 2);
+  sel.quick_grow (2);
+  sel[0] = 0;
+  sel[1] = 2;
+  vec_perm_indices ll (sel, 2, 2);
+  sel[1] = 3;
+  vec_perm_indices lh (sel, 2, 2);
+  sel[0] = 1;
+  vec_perm_indices hh (sel, 2, 2);
+  sel[1] = 2;
+  vec_perm_indices hl (sel, 2, 2);
+  return can_vec_perm_const_p (mode, ll)
+  && can_vec_perm_const_p (mode, lh)
+  && can_vec_perm_const_p (mode, hl)
+  && can_vec_perm_const_p (mode, hh);
+}
+
 /* Function vect_grouped_store_supported.
 
Returns TRUE if interleave high and interleave low permutations
@@ -5038,13 +5069,15 @@ vect_grouped_store_supported (tree vectype, unsigned HOST_WIDE_INT count)
   machine_mode mode = TYPE_MODE (vectype);
 
   /* vect_permute_store_chain requires the group size to be equal to 3 or
- be a power of two.  */
-  if (count != 3 && exact_log2 (count) == -1)
+ be a power of two or 2-element vectors to be used.  */
+  if (count != 3 && exact_log2 (count) == -1
+   && !known_eq (GET_MODE_NUNITS (mode), 2ULL))
 {
   if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 			 "the size of the group of accesses"
-			 " is not a power of 2 or not eqaul to 3\n");
+			 " is not a power of 2 or not eqaul to 3"
+			 " and vector element number is not 2\n");
   return false;
 }
 
@@ -5113,9 +5146,14 @@ vect_grouped_store_suppo

Re: [PATCH][RFC] vector creation from two parts of two vectors produces TBL rather than ins (PR93720)

2020-07-14 Thread Dmitrij Pochepko
Hi,

please take a look at updated patch with all comments addressed (attached).

Thanks,
Dmitrij

On Sat, Jul 11, 2020 at 09:52:40AM +0100, Richard Sandiford wrote:
...
> 
> For this point, I meant that we should remove the first loop too.  I.e.:
>
... 
> 
> is now redundant with the later:
>
... 
> 
> However, a more canonical way to write the condition above is:
> 
>   for (unsigned HOST_WIDE_INT i = 0; i < nelt; i++)
> {
>   HOST_WIDE_INT elt;
>   if (!elt.is_constant (&elt))
> return false;
>   if (elt == (HOST_WIDE_INT) i)
> continue;
> 

done

> Very minor, but the coding conventions don't put a space before “++”.
> So:
> 
> > +  for (unsigned HOST_WIDE_INT i = 0; i < nelt; i ++)
> 
> …this should be “i++” too.

done
>From 9acc14f4cdd10091daa5311f495daacfebdcfc3d Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Tue, 14 Jul 2020 15:48:46 +0300
Subject: [PATCH] vector creation from two parts of two vectors produces TBL
 rather than ins (PR 93720)

The following patch enables vector permutations optimization by trying to use ins instruction instead of slow and generic tbl.

example:

vector float f0(vector float a, vector float b)
{
  return __builtin_shuffle (a, a, (vector int){3, 1, 2, 3});
}

was compiled into:
...
	adrpx0, .LC0
	ldr q1, [x0, #:lo12:.LC0]
	tbl v0.16b, {v0.16b}, v1.16b
...

and after patch:
...
	ins v0.s[0], v0.s[3]
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-07-14 Andrew Pinski   

	PR gcc/93720

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_ins): New function
	* gcc/config/aarch64/aarch64-simd.md (aarch64_simd_vec_copy_lane): changed name prefix

gcc/testsuite/ChangeLog:

2020-07-14  Andrew Pinski   

	PR gcc/93720

	* gcc/testsuite/gcc.target/aarch64/vins-1.c: New test
	* gcc/testsuite/gcc.target/aarch64/vins-2.c: New test
	* gcc/testsuite/gcc.target/aarch64/vins-3.c: New test

Co-Authored-By: Dmitrij Pochepko
---
 gcc/config/aarch64/aarch64-simd.md|  2 +-
 gcc/config/aarch64/aarch64.c  | 77 +++
 gcc/testsuite/gcc.target/aarch64/vins-1.c | 23 +
 gcc/testsuite/gcc.target/aarch64/vins-2.c | 23 +
 gcc/testsuite/gcc.target/aarch64/vins-3.c | 23 +
 5 files changed, 147 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-3.c

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9f0e2bd..11ebf5b 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -958,7 +958,7 @@
   [(set_attr "type" "neon_ins, neon_from_gp, neon_load1_one_lane")]
 )
 
-(define_insn "*aarch64_simd_vec_copy_lane"
+(define_insn "@aarch64_simd_vec_copy_lane"
   [(set (match_operand:VALL_F16 0 "register_operand" "=w")
 	(vec_merge:VALL_F16
 	(vec_duplicate:VALL_F16
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index e259d05..f1c5b5a 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -20594,6 +20594,81 @@ aarch64_evpc_sel (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Recognize patterns suitable for the INS instructions.  */
+static bool
+aarch64_evpc_ins (struct expand_vec_perm_d *d)
+{
+  machine_mode mode = d->vmode;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+  rtx insv = d->op0;
+
+  HOST_WIDE_INT idx = -1;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelt; i++)
+{
+  HOST_WIDE_INT elt;
+  if (!d->perm[i].is_constant (&elt))
+	return false;
+  if (elt == (HOST_WIDE_INT) i)
+	continue;
+  if (idx != -1)
+	{
+	  idx = -1;
+	  break;
+	}
+  idx = i;
+}
+
+  if (idx == -1)
+{
+  insv = d->op1;
+  for (unsigned HOST_WIDE_INT i = 0; i < nelt; i++)
+	{
+	  if (d->perm[i].to_constant () == (HOST_WIDE_INT) (i + nelt))
+	continue;
+	  if (idx != -1)
+	return false;
+	  idx = i;
+	}
+
+  if (idx == -1)
+	return false;
+}
+
+  if (d->testing_p)
+return true;
+
+  gcc_assert (idx != -1);
+
+  unsigned extractindex = d->perm[idx].to_constant ();
+  rtx extractv = d->op0;
+  if (extractindex >= nelt)
+{
+  extractv = d->op1;
+  extractindex -= nelt;
+}
+  gcc_assert (extractindex < nelt);
+
+  emit_move_insn (d->target, insv);
+  insn_code icode = code_for_aarch64_simd_vec_copy_lane (mode);
+  expand_operand ops[5];
+  create_output_operand (&ops[0], d->target, mode);
+  create_input_

Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-13 Thread Dmitrij Pochepko
Hi,

I please take a look at new version (attached).

Thanks,
Dmitrij

On Sat, Jul 11, 2020 at 09:39:13AM +0100, Richard Sandiford wrote:
...
> This should push “newelt” instead.
done

...
> This test is endian-agnostic and should work for all targets, but…
...
> 
> …the shuffles in these tests are specific to little-endian targets.
> For big-endian targets, architectural lane 0 is the last in memory
> rather than first, so e.g. the big-endian permute vector for vzip_4.c
> would be: { 0, 1, 5, 6 } instead.
> 
> I guess the options are:
> 
> (1) add __ARM_BIG_ENDIAN preprocessor tests to choose the right mask
> (2) restrict the tests to aarch64_little_endian.
> (3) have two scan-assembler lines with opposite zip1/zip2 choices, e.g:
> 
> /* { dg-final { scan-assembler-times {[ \t]*zip1[ \t]+v[0-9]+\.2d} 1 { target 
> aarch64_big_endian } } } */
> /* { dg-final { scan-assembler-times {[ \t]*zip2[ \t]+v[0-9]+\.2d} 1 { target 
> aarch64_little_endian } } } */
> 
> (untested).
> 
> I guess (3) is probably best, but (2) is fine too if you're not set
> up to test big-endian.
> 
> Sorry for not noticing last time.  I only realised when testing the
> patch locally.
> 
> Thanks,
> Richard

I restricted tests to aarch64_little_endian, because I don't have big-endian 
setup to check it.
>From 197a9bc05f96c3f100b3f4748c9dd12a60de86d1 Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Mon, 13 Jul 2020 17:44:08 +0300
Subject: [PATCH]  __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrpx0, .LC0
	ldr q2, [x0, #:lo12:.LC0]
	tbl v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
	zip1v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-07-13 Andrew Pinski   

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-07-13  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By: Dmitrij Pochepko
---
 gcc/config/aarch64/aarch64.c| 57 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 13 +++
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 13 +++
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 13 +++
 6 files changed, 124 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 17dbe67..e259d05 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19991,6 +19991,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20176,6 +20178,59 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it combines odd and even elements.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
+  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
+  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
+  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, vec_bits);
+
+  if (new_mode == word_mode)
+return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  vec_perm_builder newpermconst;
+  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
+
+  /* Convert the perm constant if we can.  Require even, odd as the pairs.  */
+  for (unsigned

Re: [PATCH][RFC] vector creation from two parts of two vectors produces TBL rather than ins (PR93720)

2020-07-10 Thread Dmitrij Pochepko
Hi,

thank you for reviewing it.

Please check updated version(attached) with all comments addressed.

Thanks,
Dmitrij

On Tue, Jun 23, 2020 at 06:10:52PM +0100, Richard Sandiford wrote:
...
> 
> I think it would be better to test this as part of the loop below.
> 
done

...
> I think it'd be better to generate the target instruction directly.
> We can do that by replacing:
> 
> (define_insn "*aarch64_simd_vec_copy_lane"
> 
> with:
> 
> (define_insn "@aarch64_simd_vec_copy_lane"
> 
> then using the expand_insn interface to create an instance of
> code_for_aarch64_simd_vec_copy_lane (mode).
done

...
> 
> > +/* { dg-final { scan-assembler-times "\[ \t\]*ins\[ \t\]+v\[0-9\]+\.s" 4 } 
> > } */
> 
> Same comment as the other patch about using {…} regexp quoting.
> 
done
>From 8e7cfa2da407171a30d1e152f0e0f4be399d571e Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Fri, 10 Jul 2020 20:37:17 +0300
Subject: [PATCH] vector creation from two parts of two vectors produces TBL
 rather than ins (PR 93720)

The following patch enables vector permutations optimization by trying to use ins instruction instead of slow and generic tbl.

example:

vector float f0(vector float a, vector float b)
{
  return __builtin_shuffle (a, a, (vector int){3, 1, 2, 3});
}

was compiled into:
...
	adrpx0, .LC0
	ldr q1, [x0, #:lo12:.LC0]
	tbl v0.16b, {v0.16b}, v1.16b
...

and after patch:
...
	ins v0.s[0], v0.s[3]
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-07-10 Andrew Pinski   

	PR gcc/93720

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_ins): New function
	* gcc/config/aarch64/aarch64-simd.md (aarch64_simd_vec_copy_lane): changed name prefix

gcc/testsuite/ChangeLog:

2020-07-10  Andrew Pinski   

	PR gcc/93720

	* gcc/testsuite/gcc.target/aarch64/vins-1.c: New test
	* gcc/testsuite/gcc.target/aarch64/vins-2.c: New test
	* gcc/testsuite/gcc.target/aarch64/vins-3.c: New test

Co-Authored-By: Dmitrij Pochepko
---
 gcc/config/aarch64/aarch64-simd.md|  2 +-
 gcc/config/aarch64/aarch64.c  | 82 +++
 gcc/testsuite/gcc.target/aarch64/vins-1.c | 23 +
 gcc/testsuite/gcc.target/aarch64/vins-2.c | 23 +
 gcc/testsuite/gcc.target/aarch64/vins-3.c | 23 +
 5 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-3.c

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9f0e2bd..11ebf5b 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -958,7 +958,7 @@
   [(set_attr "type" "neon_ins, neon_from_gp, neon_load1_one_lane")]
 )
 
-(define_insn "*aarch64_simd_vec_copy_lane"
+(define_insn "@aarch64_simd_vec_copy_lane"
   [(set (match_operand:VALL_F16 0 "register_operand" "=w")
 	(vec_merge:VALL_F16
 	(vec_duplicate:VALL_F16
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 9b31743..7544fd8 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -20594,6 +20594,86 @@ aarch64_evpc_sel (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Recognize patterns suitable for the INS instructions.  */
+static bool
+aarch64_evpc_ins (struct expand_vec_perm_d *d)
+{
+  machine_mode mode = d->vmode;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  unsigned int encoded_nelts = d->perm.encoding ().encoded_nelts ();
+  for (unsigned int i = 0; i < encoded_nelts; ++i)
+if (!d->perm[i].is_constant ())
+  return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+  rtx insv = d->op0;
+
+  HOST_WIDE_INT idx = -1;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelt; i ++)
+{
+  poly_int64 elt = d->perm[i];
+  if (!elt.is_constant ())
+	return false;
+  if (elt.to_constant () == (HOST_WIDE_INT) i)
+	continue;
+  if (idx != -1)
+	{
+	  idx = -1;
+	  break;
+	}
+  idx = i;
+}
+
+  if (idx == -1)
+{
+  insv = d->op1;
+  for (unsigned HOST_WIDE_INT i = 0; i < nelt; i ++)
+	{
+	  if (d->perm[i].to_constant () == (HOST_WIDE_INT) (i + nelt))
+	continue;
+	  if (idx != -1)
+	return false;
+	  idx = i;
+	}
+
+  if (idx == -1)
+	return false;
+}
+
+  if (d->testing_p)
+return true;
+
+  gcc_assert (idx != -1);
+
+  unsigned extractindex = d->perm[idx].to_constant ();
+  rtx extractv = d->op0;
+  if (extractindex >= nelt)
+{
+  extractv = d->op1;
+  extractindex -= nelt;
+}
+ 

Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-10 Thread Dmitrij Pochepko
Hi,

please take a look at updated version (attached).

Thanks,
Dmitrij

On Wed, Jul 08, 2020 at 03:48:39PM +0100, Richard Sandiford wrote:
...
> 
> maybe s/use bigger size up/combines odd and even elements/

done

> It should be possible to do this without the to_constants, e.g.:
> 
>   poly_int64 elt0 = d->perm[i];
>   poly_int64 elt1 = d->perm[i + 1];
>   poly_int64 newelt;
>   if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1))
> return false;
> 
> (The coding conventions require spaces around “+”, even though I agree
> “[i+1]” looks better.)

done

>From 34b6b0803111609ec5a0a615a8f03b78921e8412 Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Fri, 10 Jul 2020 15:42:40 +0300
Subject: [PATCH] __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrpx0, .LC0
	ldr q2, [x0, #:lo12:.LC0]
	tbl v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
zip1v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-07-10 Andrew Pinski   

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-07-10  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By: Dmitrij Pochepko
---
 gcc/config/aarch64/aarch64.c| 57 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 11 ++
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 12 ++
 6 files changed, 120 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 17dbe67..9b31743 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19991,6 +19991,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20176,6 +20178,59 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it combines odd and even elements.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
+  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
+  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
+  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, vec_bits);
+
+  if (new_mode == word_mode)
+return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  vec_perm_builder newpermconst;
+  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
+
+  /* Convert the perm constant if we can.  Require even, odd as the pairs.  */
+  for (unsigned int i = 0; i < nelt; i += 2)
+{
+  poly_int64 elt0 = d->perm[i];
+  poly_int64 elt1 = d->perm[i + 1];
+  poly_int64 newelt;
+  if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1))
+	return false;
+  newpermconst.quick_push (elt0.to_constant () / 2);
+}
+  newpermconst.finalize ();
+
+  newd.vmode = new_mode;
+  newd.vec_flags = VEC_ADVSIMD;
+  newd.target = d->target ? gen_lowpart (new_mode, d->target) : NULL;
+  newd.op0 = d->op0 ? gen_lowpart (new_mode, d->op0) : NULL;
+  newd.op1 = d->op1 ? gen_lowpart (new_mode, d->op1) : NULL;
+  newd.testing_p = d->testing_p;
+  newd.one_vector_p = d->one_vector_p;
+
+  newd.perm.new_vector (newpermconst, newd.one_vector_p ? 1 : 2, nelt / 2);
+

Re: [PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-07-07 Thread Dmitrij Pochepko
Hi,

thank you for looking into this.

I prepared new patch with all your comments addressed.

Thanks,
Dmitrij

On Tue, Jun 23, 2020 at 05:53:00PM +0100, Richard Sandiford wrote:
...
> 
> I think it would be simpler to do it in this order:
> 
>   - check for Advanced SIMD, bail out if not
>   - get the new mode, bail out if none
>   - calculate the permutation vector, bail out if not suitable
>   - set up the rest of “newd”
> 
> There would then only be one walk over d->perm rather than two,
> and we'd only create the gen_lowparts when there's something to test.
> 
> The new mode can be calculated with something like:
> 
>   poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
>   unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
>   auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
>   machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, 
> vec_bits);
> 
> “new_mode” will be “word_mode” on failure.
>
... 
> The regexp would be easier to read if quoted using {…}, which requires
> fewer backslashes.  Same for the other tests.
> 
> Thanks,
> Richard
>From 71a3f4b05edc462bcceba35ff738c6f1b5ca3f0a Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Tue, 7 Jul 2020 18:45:06 +0300
Subject: [PATCH] __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrp	x0, .LC0
	ldr	q2, [x0, #:lo12:.LC0]
	tbl	v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
	zip1	v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-06-11	Andrew Pinski	

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-06-11  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By:	Dmitrij Pochepko	
---
 gcc/config/aarch64/aarch64.c| 60 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 11 ++
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 12 ++
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 12 ++
 6 files changed, 123 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f3551a7..4b02bc7 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19905,6 +19905,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20090,6 +20092,62 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it use the bigger size up.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  poly_uint64 vec_bits = GET_MODE_BITSIZE (d->vmode);
+  unsigned int new_elt_bits = GET_MODE_UNIT_BITSIZE (d->vmode) * 2;
+  auto new_elt_mode = int_mode_for_size (new_elt_bits, false).require ();
+  machine_mode new_mode = aarch64_simd_container_mode (new_elt_mode, vec_bits);
+
+  if (new_mode == word_mode)
+return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  vec_perm_builder newpermconst;
+  newpermconst.new_vector (nelt / 2, nelt / 2, 1);
+
+  /* Convert the perm constant if we can.  Require even, odd as the pairs.  */
+  for (unsigned int i = 0; i < nelt; i += 2)
+{
+  poly_int64 elt_poly0 = d->perm[i];
+  poly_int64 elt_poly1 = d->perm[i+1];
+  if (!elt_poly0.is_constant () || !elt_poly1.is_constant ())
+	return false;
+  unsi

[PATCH][RFC] vector creation from two parts of two vectors produces TBL rather than ins (PR93720)

2020-06-17 Thread Dmitrij Pochepko
The following patch enables vector permutations optimization by trying to use 
ins instruction instead of slow and generic tbl.

example:
#define vector __attribute__((vector_size(4*sizeof(float

vector float f0(vector float a, vector float b)
{
  return __builtin_shuffle (a, a, (vector int){3, 1, 2, 3});
}


was compiled into:
...
adrpx0, .LC0
ldr q1, [x0, #:lo12:.LC0]
tbl v0.16b, {v0.16b}, v1.16b
...

and after patch:
...
ins v0.s[0], v0.s[3]
...

bootstrapped and tested on aarch64-linux-gnu with no regressions


This patch was initially introduced by me with Andrew Pinksi 
 being involved later.

Please note that test in this patch depends on another commit (PR82199), which 
I sent not long ago.

(I have no write access to repo)

Thanks,
Dmitrij
>From d4ccbcdf67648a095706213a0fe0ac856bb077bb Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Wed, 17 Jun 2020 12:04:10 +0300
Subject: [PATCH] vector creation from two parts of two vectors produces TBL
 rather than ins (PR93720)

The following patch enables vector permutations optimization by trying to use ins instruction instead of slow and generic tbl.

example:

vector float f0(vector float a, vector float b)
{
  return __builtin_shuffle (a, a, (vector int){3, 1, 2, 3});
}

was compiled into:
...
adrpx0, .LC0
ldr q1, [x0, #:lo12:.LC0]
tbl v0.16b, {v0.16b}, v1.16b
...

and after patch:
...
ins v0.s[0], v0.s[3]
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

This patch was initially introduced by me with Andrew Pinksi  being involved later.
---
 gcc/config/aarch64/aarch64.c  | 85 +++
 gcc/testsuite/gcc.target/aarch64/vins-1.c | 23 +
 gcc/testsuite/gcc.target/aarch64/vins-2.c | 23 +
 gcc/testsuite/gcc.target/aarch64/vins-3.c | 23 +
 4 files changed, 154 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vins-3.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index ab7b39e..e0bde6d 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -20516,6 +20516,89 @@ aarch64_evpc_sel (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Recognize patterns suitable for the INS instructions.  */
+static bool
+aarch64_evpc_ins (struct expand_vec_perm_d *d)
+{
+  machine_mode mode = d->vmode;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  unsigned int encoded_nelts = d->perm.encoding ().encoded_nelts ();
+  for (unsigned int i = 0; i < encoded_nelts; ++i)
+if (!d->perm[i].is_constant ())
+  return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+  rtx insv = d->op0;
+
+  HOST_WIDE_INT idx = -1;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelt; i ++)
+{
+  if (d->perm[i].to_constant () == (HOST_WIDE_INT) i)
+	continue;
+  if (idx != -1)
+	{
+	  idx = -1;
+	  break;
+	}
+  idx = i;
+}
+
+  if (idx == -1)
+{
+  insv = d->op1;
+  for (unsigned HOST_WIDE_INT i = 0; i < nelt; i ++)
+	{
+	  if (d->perm[i].to_constant () == (HOST_WIDE_INT) (i + nelt))
+	continue;
+	  if (idx != -1)
+	return false;
+	  idx = i;
+	}
+
+  if (idx == -1)
+	return false;
+}
+
+  if (d->testing_p)
+return true;
+
+  gcc_assert (idx != -1);
+
+  unsigned extractindex = d->perm[idx].to_constant ();
+  rtx extractv = d->op0;
+  if (extractindex >= nelt)
+{
+  extractv = d->op1;
+  extractindex -= nelt;
+}
+  gcc_assert (extractindex < nelt);
+
+  machine_mode inner_mode = GET_MODE_INNER (mode);
+
+  enum insn_code inscode = optab_handler (vec_set_optab, mode);
+  gcc_assert (inscode != CODE_FOR_nothing);
+  enum insn_code iextcode = convert_optab_handler (vec_extract_optab, mode,
+		   inner_mode);
+  gcc_assert (iextcode != CODE_FOR_nothing);
+  rtx tempinner = gen_reg_rtx (inner_mode);
+  emit_insn (GEN_FCN (iextcode) (tempinner, extractv, GEN_INT (extractindex)));
+
+  rtx temp = gen_reg_rtx (mode);
+  emit_move_insn (temp, insv);
+  emit_insn (GEN_FCN (inscode) (temp, tempinner, GEN_INT (idx)));
+
+  emit_move_insn (d->target, temp);
+
+  return true;
+}
+
 static bool
 aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d)
 {
@@ -20550,6 +20633,8 @@ aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d)
 	return true;
   else if (aarch64_evpc_sel (d))
 	return true;
+  else if (aarch64_evpc_ins (d))
+	return true;
   else if (aarch64_evpc_reencode (d))
 	return true;
   if (d->vec_flags == VEC_SVE_DATA)
diff --git a/gcc/testsuite/gcc.target/aarch64/vins-1.c b/gcc/testsuite/gcc.target/aarch64/vins-1.c
new 

[PATCH][RFC] __builtin_shuffle sometimes should produce zip1 rather than TBL (PR82199)

2020-06-11 Thread Dmitrij Pochepko
The following patch enables vector permutations optimization by using another 
vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:
#define vector __attribute__((vector_size(16) ))

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
adrpx0, .LC0
ldr q2, [x0, #:lo12:.LC0]
tbl v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
zip1v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions


This patch was initially introduced by Andrew Pinksi  with 
me being involved later.

(I have no write access to repo)

Thanks,
Dmitrij

gcc/ChangeLog:

2020-06-11  Andrew Pinski   

PR gcc/82199

* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-06-11  Andrew Pinski   

PR gcc/82199

* gcc.target/aarch64/vdup_n_3.c: New test
* gcc.target/aarch64/vzip_1.c: New test
* gcc.target/aarch64/vzip_2.c: New test
* gcc.target/aarch64/vzip_3.c: New test
* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By: Dmitrij Pochepko



Thanks,
Dmitrij
>From 3c9f3fe834811386223755fc58e2ab4a612eefcf Mon Sep 17 00:00:00 2001
From: Dmitrij Pochepko 
Date: Thu, 11 Jun 2020 14:13:35 +0300
Subject: [PATCH] __builtin_shuffle sometimes should produce zip1 rather than
 TBL (PR82199)

The following patch enables vector permutations optimization by using another vector element size when applicable.
It allows usage of simpler instructions in applicable cases.

example:

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 1, 4,5});
}

was compiled into:
...
	adrp	x0, .LC0
	ldr	q2, [x0, #:lo12:.LC0]
	tbl	v0.16b, {v0.16b - v1.16b}, v2.16b
...

and after patch:
...
	zip1	v0.2d, v0.2d, v1.2d
...

bootstrapped and tested on aarch64-linux-gnu with no regressions

gcc/ChangeLog:

2020-06-11	Andrew Pinski	

	PR gcc/82199

	* gcc/config/aarch64/aarch64.c (aarch64_evpc_reencode): New function

gcc/testsuite/ChangeLog:

2020-06-11  Andrew Pinski   

	PR gcc/82199

	* gcc.target/aarch64/vdup_n_3.c: New test
	* gcc.target/aarch64/vzip_1.c: New test
	* gcc.target/aarch64/vzip_2.c: New test
	* gcc.target/aarch64/vzip_3.c: New test
	* gcc.target/aarch64/vzip_4.c: New test

Co-Authored-By:	Dmitrij Pochepko	
---
 gcc/config/aarch64/aarch64.c| 81 +
 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c | 16 ++
 gcc/testsuite/gcc.target/aarch64/vzip_1.c   | 11 
 gcc/testsuite/gcc.target/aarch64/vzip_2.c   | 12 +
 gcc/testsuite/gcc.target/aarch64/vzip_3.c   | 12 +
 gcc/testsuite/gcc.target/aarch64/vzip_4.c   | 12 +
 6 files changed, 144 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vdup_n_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vzip_4.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65a..ab7b39e 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -19889,6 +19889,8 @@ struct expand_vec_perm_d
   bool testing_p;
 };
 
+static bool aarch64_expand_vec_perm_const_1 (struct expand_vec_perm_d *d);
+
 /* Generate a variable permutation.  */
 
 static void
@@ -20074,6 +20076,83 @@ aarch64_evpc_trn (struct expand_vec_perm_d *d)
   return true;
 }
 
+/* Try to re-encode the PERM constant so it use the bigger size up.
+   This rewrites constants such as {0, 1, 4, 5}/V4SF to {0, 2}/V2DI.
+   We retry with this new constant with the full suite of patterns.  */
+static bool
+aarch64_evpc_reencode (struct expand_vec_perm_d *d)
+{
+  expand_vec_perm_d newd;
+  unsigned HOST_WIDE_INT nelt;
+
+  if (d->vec_flags != VEC_ADVSIMD)
+return false;
+
+  unsigned int encoded_nelts = d->perm.encoding ().encoded_nelts ();
+  for (unsigned int i = 0; i < encoded_nelts; ++i)
+if (!d->perm[i].is_constant ())
+  return false;
+
+  /* to_constant is safe since this routine is specific to Advanced SIMD
+ vectors.  */
+  nelt = d->perm.length ().to_constant ();
+
+  /* Get the new mode.  Always twice the size of the inner
+ and half the elements.  */
+  machine_mode new_mode;
+  switch (d->vmode)
+{
+/* 128bit vectors.  */
+case E_V4SFmode:
+case E_V4SImode:
+  new_mode = V2DImode;
+  break;
+case E_V8BFmode:
+case E_V8HFmode:
+case E_V8HImode:
+  new_mode = V4SImode;
+  break;
+case E_V16QImode:
+  new_mode = V8HImode;
+  break;
+/* 64bit vectors.  */
+case E_V4BFmode:
+case E_V4HFmode:
+case E_V4HImode:
+  new_mode = V2SImode;
+  break;

Re: [PATCH] PR tree-optimization/90836 Missing popcount pattern matching

2019-10-01 Thread Dmitrij Pochepko
Hi Richard,

I updated patch according to all your comments.
Also bootstrapped and tested again on x86_64-pc-linux-gnu and 
aarch64-linux-gnu, which took some time.

attached v3.

Thanks,
Dmitrij

On Thu, Sep 26, 2019 at 09:47:04AM +0200, Richard Biener wrote:
> On Tue, Sep 24, 2019 at 5:29 PM Dmitrij Pochepko
>  wrote:
> >
> > Hi,
> >
> > can anybody take a look at v2?
> 
> +(if (tree_to_uhwi (@4) == 1
> + && tree_to_uhwi (@10) == 2 && tree_to_uhwi (@5) == 4
> 
> those will still ICE for large __int128_t constants.  Since you do not match
> any conversions you should probably restrict the precision of 'type' like
> with
>(if (TYPE_PRECISION (type) <= 64
> && tree_to_uhwi (@4) ...
> 
> likewise tree_to_uhwi will fail for negative constants thus if the
> pattern assumes
> unsigned you should verify that as well with && TYPE_UNSIGNED  (type).
> 
> Your 'argtype' is simply 'type' so you can elide it.
> 
> +   (switch
> +   (if (types_match (argtype, long_long_unsigned_type_node))
> + (convert (BUILT_IN_POPCOUNTLL:integer_type_node @0)))
> +   (if (types_match (argtype, long_unsigned_type_node))
> + (convert (BUILT_IN_POPCOUNTL:integer_type_node @0)))
> +   (if (types_match (argtype, unsigned_type_node))
> + (convert (BUILT_IN_POPCOUNT:integer_type_node @0)))
> 
> Please test small types first so we can avoid popcountll when long == long 
> long
> or long == int.  I also wonder if we really want to use the builtins and
> check optab availability or if we nowadays should use
> direct_internal_fn_supported_p (IFN_POPCOUNT, integer_type_node, type,
> OPTIMIZE_FOR_BOTH) and
> 
> (convert (IFN_POPCOUNT:type @0))
> 
> without the switch?
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Dmitrij
> >
> > On Mon, Sep 09, 2019 at 10:03:40PM +0300, Dmitrij Pochepko wrote:
> > > Hi all.
> > >
> > > Please take a look at v2 (attached).
> > > I changed patch according to review comments. The same testing was 
> > > performed again.
> > >
> > > Thanks,
> > > Dmitrij
> > >
> > > On Thu, Sep 05, 2019 at 06:34:49PM +0300, Dmitrij Pochepko wrote:
> > > > This patch adds matching for Hamming weight (popcount) implementation. 
> > > > The following sources:
> > > >
> > > > int
> > > > foo64 (unsigned long long a)
> > > > {
> > > > unsigned long long b = a;
> > > > b -= ((b>>1) & 0xULL);
> > > > b = ((b>>2) & 0xULL) + (b & 0xULL);
> > > > b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
> > > > b *= 0x0101010101010101ULL;
> > > > return (int)(b >> 56);
> > > > }
> > > >
> > > > and
> > > >
> > > > int
> > > > foo32 (unsigned int a)
> > > > {
> > > > unsigned long b = a;
> > > > b -= ((b>>1) & 0xUL);
> > > > b = ((b>>2) & 0xUL) + (b & 0xUL);
> > > > b = ((b>>4) + b) & 0x0F0F0F0FUL;
> > > > b *= 0x01010101UL;
> > > > return (int)(b >> 24);
> > > > }
> > > >
> > > > and equivalents are now recognized as popcount for platforms with hw 
> > > > popcount support. Bootstrapped and tested on x86_64-pc-linux-gnu and 
> > > > aarch64-linux-gnu systems with no regressions.
> > > >
> > > > (I have no write access to repo)
> > > >
> > > > Thanks,
> > > > Dmitrij
> > > >
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > PR tree-optimization/90836
> > > >
> > > > * gcc/match.pd (popcount): New pattern.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > PR tree-optimization/90836
> > > >
> > > > * lib/target-supports.exp (check_effective_target_popcount)
> > > > (check_effective_target_popcountll): New effective targets.
> > > > * gcc.dg/tree-ssa/popcount4.c: New test.
> > > > * gcc.dg/tree-ssa/popcount4l.c: New test.
> > > > * gcc.dg/tree-ssa/popcount4ll.c: New test.
> > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > index 0317bc7..b1867bf 100644
> &g

Re: [PATCH] PR tree-optimization/90836 Missing popcount pattern matching

2019-09-24 Thread Dmitrij Pochepko
Hi,

can anybody take a look at v2?

Thanks,
Dmitrij

On Mon, Sep 09, 2019 at 10:03:40PM +0300, Dmitrij Pochepko wrote:
> Hi all.
> 
> Please take a look at v2 (attached).
> I changed patch according to review comments. The same testing was performed 
> again.
> 
> Thanks,
> Dmitrij
> 
> On Thu, Sep 05, 2019 at 06:34:49PM +0300, Dmitrij Pochepko wrote:
> > This patch adds matching for Hamming weight (popcount) implementation. The 
> > following sources:
> > 
> > int
> > foo64 (unsigned long long a)
> > {
> > unsigned long long b = a;
> > b -= ((b>>1) & 0xULL);
> > b = ((b>>2) & 0xULL) + (b & 0xULL);
> > b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
> > b *= 0x0101010101010101ULL;
> > return (int)(b >> 56);
> > }
> > 
> > and
> > 
> > int
> > foo32 (unsigned int a)
> > {
> > unsigned long b = a;
> > b -= ((b>>1) & 0xUL);
> > b = ((b>>2) & 0xUL) + (b & 0xUL);
> > b = ((b>>4) + b) & 0x0F0F0F0FUL;
> > b *= 0x01010101UL;
> > return (int)(b >> 24);
> > }
> > 
> > and equivalents are now recognized as popcount for platforms with hw 
> > popcount support. Bootstrapped and tested on x86_64-pc-linux-gnu and 
> > aarch64-linux-gnu systems with no regressions. 
> > 
> > (I have no write access to repo)
> > 
> > Thanks,
> > Dmitrij
> > 
> > 
> > gcc/ChangeLog:
> > 
> > PR tree-optimization/90836
> > 
> > * gcc/match.pd (popcount): New pattern.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > PR tree-optimization/90836
> > 
> > * lib/target-supports.exp (check_effective_target_popcount)
> > (check_effective_target_popcountll): New effective targets.
> > * gcc.dg/tree-ssa/popcount4.c: New test.
> > * gcc.dg/tree-ssa/popcount4l.c: New test.
> > * gcc.dg/tree-ssa/popcount4ll.c: New test.
> 
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 0317bc7..b1867bf 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -5358,6 +5358,70 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >(cmp (popcount @0) integer_zerop)
> >(rep @0 { build_zero_cst (TREE_TYPE (@0)); }
> >  
> > +/* 64- and 32-bits branchless implementations of popcount are detected:
> > +
> > +   int popcount64c (uint64_t x)
> > +   {
> > + x -= (x >> 1) & 0xULL;
> > + x = (x & 0xULL) + ((x >> 2) & 0xULL);
> > + x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
> > + return (x * 0x0101010101010101ULL) >> 56;
> > +   }
> > +
> > +   int popcount32c (uint32_t x)
> > +   {
> > + x -= (x >> 1) & 0x;
> > + x = (x & 0x) + ((x >> 2) & 0x);
> > + x = (x + (x >> 4)) & 0x0f0f0f0f;
> > + return (x * 0x01010101) >> 24;
> > +   }  */
> > +(simplify
> > +  (convert
> > +(rshift
> > +  (mult
> > +   (bit_and:c
> > + (plus:c
> > +   (rshift @8 INTEGER_CST@5)
> > +   (plus:c@8
> > + (bit_and @6 INTEGER_CST@7)
> > + (bit_and
> > +   (rshift
> > + (minus@6
> > +   @0
> > +   (bit_and
> > + (rshift @0 INTEGER_CST@4)
> > + INTEGER_CST@11))
> > + INTEGER_CST@10)
> > +   INTEGER_CST@9)))
> > + INTEGER_CST@3)
> > +   INTEGER_CST@2)
> > +  INTEGER_CST@1))
> > +  /* Check constants and optab.  */
> > +  (with
> > + {
> > +   tree argtype = TREE_TYPE (@0);
> > +   unsigned prec = TYPE_PRECISION (argtype);
> > +   int shift = TYPE_PRECISION (long_long_unsigned_type_node) - prec;
> > +   const unsigned long long c1 = 0x0101010101010101ULL >> shift,
> > +   c2 = 0x0F0F0F0F0F0F0F0FULL >> shift,
> > +   c3 = 0xULL >> shift,
> > +   c4 = 0xULL >> shift;
> > + }
> > +(if (types_match (type, integer_type_node) && tree_to_uhwi (@4) == 1
> > + && tree_to_uhwi (@10) == 2 && tree_to_uhwi (@5) == 4
> > + && tree_to_uhwi (@1) == prec - 8 && tree_to_uhwi (@2)

Re: [PATCH] PR tree-optimization/90836 Missing popcount pattern matching

2019-09-09 Thread Dmitrij Pochepko
Hi all.

Please take a look at v2 (attached).
I changed patch according to review comments. The same testing was performed 
again.

Thanks,
Dmitrij

On Thu, Sep 05, 2019 at 06:34:49PM +0300, Dmitrij Pochepko wrote:
> This patch adds matching for Hamming weight (popcount) implementation. The 
> following sources:
> 
> int
> foo64 (unsigned long long a)
> {
> unsigned long long b = a;
> b -= ((b>>1) & 0xULL);
> b = ((b>>2) & 0xULL) + (b & 0xULL);
> b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
> b *= 0x0101010101010101ULL;
> return (int)(b >> 56);
> }
> 
> and
> 
> int
> foo32 (unsigned int a)
> {
> unsigned long b = a;
> b -= ((b>>1) & 0xUL);
> b = ((b>>2) & 0xUL) + (b & 0xUL);
> b = ((b>>4) + b) & 0x0F0F0F0FUL;
> b *= 0x01010101UL;
> return (int)(b >> 24);
> }
> 
> and equivalents are now recognized as popcount for platforms with hw popcount 
> support. Bootstrapped and tested on x86_64-pc-linux-gnu and aarch64-linux-gnu 
> systems with no regressions. 
> 
> (I have no write access to repo)
> 
> Thanks,
> Dmitrij
> 
> 
> gcc/ChangeLog:
> 
>   PR tree-optimization/90836
> 
>   * gcc/match.pd (popcount): New pattern.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR tree-optimization/90836
> 
>   * lib/target-supports.exp (check_effective_target_popcount)
>   (check_effective_target_popcountll): New effective targets.
>   * gcc.dg/tree-ssa/popcount4.c: New test.
>   * gcc.dg/tree-ssa/popcount4l.c: New test.
>   * gcc.dg/tree-ssa/popcount4ll.c: New test.

> diff --git a/gcc/match.pd b/gcc/match.pd
> index 0317bc7..b1867bf 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -5358,6 +5358,70 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>(cmp (popcount @0) integer_zerop)
>(rep @0 { build_zero_cst (TREE_TYPE (@0)); }
>  
> +/* 64- and 32-bits branchless implementations of popcount are detected:
> +
> +   int popcount64c (uint64_t x)
> +   {
> + x -= (x >> 1) & 0xULL;
> + x = (x & 0xULL) + ((x >> 2) & 0xULL);
> + x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
> + return (x * 0x0101010101010101ULL) >> 56;
> +   }
> +
> +   int popcount32c (uint32_t x)
> +   {
> + x -= (x >> 1) & 0x;
> + x = (x & 0x) + ((x >> 2) & 0x);
> + x = (x + (x >> 4)) & 0x0f0f0f0f;
> + return (x * 0x01010101) >> 24;
> +   }  */
> +(simplify
> +  (convert
> +(rshift
> +  (mult
> + (bit_and:c
> +   (plus:c
> + (rshift @8 INTEGER_CST@5)
> + (plus:c@8
> +   (bit_and @6 INTEGER_CST@7)
> +   (bit_and
> + (rshift
> +   (minus@6
> + @0
> + (bit_and
> +   (rshift @0 INTEGER_CST@4)
> +   INTEGER_CST@11))
> +   INTEGER_CST@10)
> + INTEGER_CST@9)))
> +   INTEGER_CST@3)
> + INTEGER_CST@2)
> +  INTEGER_CST@1))
> +  /* Check constants and optab.  */
> +  (with
> + {
> +   tree argtype = TREE_TYPE (@0);
> +   unsigned prec = TYPE_PRECISION (argtype);
> +   int shift = TYPE_PRECISION (long_long_unsigned_type_node) - prec;
> +   const unsigned long long c1 = 0x0101010101010101ULL >> shift,
> + c2 = 0x0F0F0F0F0F0F0F0FULL >> shift,
> + c3 = 0xULL >> shift,
> + c4 = 0xULL >> shift;
> + }
> +(if (types_match (type, integer_type_node) && tree_to_uhwi (@4) == 1
> +   && tree_to_uhwi (@10) == 2 && tree_to_uhwi (@5) == 4
> +   && tree_to_uhwi (@1) == prec - 8 && tree_to_uhwi (@2) == c1
> +   && tree_to_uhwi (@3) == c2 && tree_to_uhwi (@9) == c3
> +   && tree_to_uhwi (@7) == c3 && tree_to_uhwi (@11) == c4
> +   && optab_handler (popcount_optab, TYPE_MODE (argtype))
> + != CODE_FOR_nothing)
> + (switch
> + (if (types_match (argtype, long_long_unsigned_type_node))
> +   (BUILT_IN_POPCOUNTLL @0))
> + (if (types_match (argtype, long_unsigned_type_node))
> +   (BUILT_IN_POPCOUNTL @0))
> + (if (types_match (argtype, unsigned_type_node))
> +   (BUILT_IN_POPCOUNT @0))
> +
>  /* Simplify:
>  
>  

Re: [PATCH] PR tree-optimization/90836 Missing popcount pattern matching

2019-09-09 Thread Dmitrij Pochepko
Hi,

thank you for looking into it.

On Fri, Sep 06, 2019 at 12:13:34PM +, Wilco Dijkstra wrote:
> Hi,
> 
> +(simplify
> +  (convert
> +(rshift
> +  (mult
> 
> > is the outer convert really necessary?  That is, if we change
> > the simplification result to
> 
> Indeed that should be "convert?" to make it optional.
> 

I removed this one as Richard suggested in the new patch version.

> > Is the Hamming weight popcount
> > faster than the libgcc table-based approach?  I wonder if we really
> > need to restrict this conversion to the case where the target
> > has an expander.
> 
> Well libgcc uses the exact same sequence (not a table):
> 
> objdump -d ./aarch64-unknown-linux-gnu/libgcc/_popcountsi2.o
> 
>  <__popcountdi2>:
>0: d341fc01lsr x1, x0, #1
>4: b200c3e3mov x3, #0x101010101010101  // 
> #72340172838076673
>8: 9200f021and x1, x1, #0x
>c: cb010001sub x1, x0, x1
>   10: 9200e422and x2, x1, #0x
>   14: d342fc21lsr x1, x1, #2
>   18: 9200e421and x1, x1, #0x
>   1c: 8b010041add x1, x2, x1
>   20: 8b411021add x1, x1, x1, lsr #4
>   24: 9200cc20and x0, x1, #0xf0f0f0f0f0f0f0f
>   28: 9b037c00mul x0, x0, x3
>   2c: d378fc00lsr x0, x0, #56
>   30: d65f03c0ret
> 
> So if you don't check for an expander you get an endless loop in libgcc since
> the makefile doesn't appear to use -fno-builtin anywhere...

The patch is designed to avoid such endless loop - libgcc popcount call is 
compiled into popcount cpu instruction(s) on supported platforms and the patch 
is only allowing simplification on such platforms. This is implemented via 
"optab_handler (popcount_optab, TYPE_MODE (argtype)) != CODE_FOR_nothing" check.

Thanks,
Dmitrij

> 
> Wilco
> 


Re: [PATCH] PR tree-optimization/90836 Missing popcount pattern matching

2019-09-09 Thread Dmitrij Pochepko
Hi,

thank you for looking into it.

On Fri, Sep 06, 2019 at 12:23:40PM +0200, Richard Biener wrote:
> On Thu, Sep 5, 2019 at 5:35 PM Dmitrij Pochepko
>  wrote:
> >
> > This patch adds matching for Hamming weight (popcount) implementation. The 
> > following sources:
> >
> > int
> > foo64 (unsigned long long a)
> > {
> > unsigned long long b = a;
> > b -= ((b>>1) & 0xULL);
> > b = ((b>>2) & 0xULL) + (b & 0xULL);
> > b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
> > b *= 0x0101010101010101ULL;
> > return (int)(b >> 56);
> > }
> >
> > and
> >
> > int
> > foo32 (unsigned int a)
> > {
> > unsigned long b = a;
> > b -= ((b>>1) & 0xUL);
> > b = ((b>>2) & 0xUL) + (b & 0xUL);
> > b = ((b>>4) + b) & 0x0F0F0F0FUL;
> > b *= 0x01010101UL;
> > return (int)(b >> 24);
> > }
> >
> > and equivalents are now recognized as popcount for platforms with hw 
> > popcount support. Bootstrapped and tested on x86_64-pc-linux-gnu and 
> > aarch64-linux-gnu systems with no regressions.
> >
> > (I have no write access to repo)
> 
> +(simplify
> +  (convert
> +(rshift
> +  (mult
> 
> is the outer convert really necessary?  That is, if we change
> the simplification result to
> 
>  (convert (BUILT_IN_POPCOUNT @0))
> 
> wouldn't that be correct as well?

Yes, this is better. I fixed it in the new version.

> 
> Is the Hamming weight popcount
> faster than the libgcc table-based approach?  I wonder if we really
> need to restrict this conversion to the case where the target
> has an expander.
> 
> +  (mult
> +   (bit_and:c
> 
> this doesn't need :c (second operand is a constant).

Yes. Agree, this is redundant.

> 
> +   int shift = TYPE_PRECISION (long_long_unsigned_type_node) - prec;
> +   const unsigned long long c1 = 0x0101010101010101ULL >> shift,
> 
> I think this mixes host and target properties.  I guess intead of
> 'const unsigned long long' you want to use 'const uint64_t' and
> instead of TYPE_PRECISION (long_long_unsigned_type_node) 64?
> Since you are later comparing with unsigned HOST_WIDE_INT
> eventually unsigned HOST_WIDE_INT is better (that's always 64bit as well).

Agree. It is better to use HOST_WIDE_INT.

> 
> You are using tree_to_uhwi but nowhere verifying if @0 is unsigned.
> What happens if 'prec' is > 64?  (__int128 ...).  Ah, I guess the
> final selection will simply select nothing...
> 
> Otherwise the patch looks reasonable, even if the pattern
> is a bit unwieldly... ;)
> 
> Does it work for targets where 'unsigned int' is smaller than 32bit?

Yes. The only 16-bit-int architecture with popcount support on hw level is avr. 
I built gcc for avr and checked that 16-bit popcount algorithm is recognized 
successfully.

Thanks,
Dmitrij

> 
> Thanks,
> Richard.
> >
> > Thanks,
> > Dmitrij
> >
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/90836
> >
> > * gcc/match.pd (popcount): New pattern.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/90836
> >
> > * lib/target-supports.exp (check_effective_target_popcount)
> > (check_effective_target_popcountll): New effective targets.
> > * gcc.dg/tree-ssa/popcount4.c: New test.
> > * gcc.dg/tree-ssa/popcount4l.c: New test.
> > * gcc.dg/tree-ssa/popcount4ll.c: New test.


[PATCH] PR tree-optimization/90836 Missing popcount pattern matching

2019-09-05 Thread Dmitrij Pochepko
This patch adds matching for Hamming weight (popcount) implementation. The 
following sources:

int
foo64 (unsigned long long a)
{
unsigned long long b = a;
b -= ((b>>1) & 0xULL);
b = ((b>>2) & 0xULL) + (b & 0xULL);
b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
b *= 0x0101010101010101ULL;
return (int)(b >> 56);
}

and

int
foo32 (unsigned int a)
{
unsigned long b = a;
b -= ((b>>1) & 0xUL);
b = ((b>>2) & 0xUL) + (b & 0xUL);
b = ((b>>4) + b) & 0x0F0F0F0FUL;
b *= 0x01010101UL;
return (int)(b >> 24);
}

and equivalents are now recognized as popcount for platforms with hw popcount 
support. Bootstrapped and tested on x86_64-pc-linux-gnu and aarch64-linux-gnu 
systems with no regressions. 

(I have no write access to repo)

Thanks,
Dmitrij


gcc/ChangeLog:

PR tree-optimization/90836

* gcc/match.pd (popcount): New pattern.

gcc/testsuite/ChangeLog:

PR tree-optimization/90836

* lib/target-supports.exp (check_effective_target_popcount)
(check_effective_target_popcountll): New effective targets.
* gcc.dg/tree-ssa/popcount4.c: New test.
* gcc.dg/tree-ssa/popcount4l.c: New test.
* gcc.dg/tree-ssa/popcount4ll.c: New test.
diff --git a/gcc/match.pd b/gcc/match.pd
index 0317bc7..b1867bf 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5358,6 +5358,70 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (cmp (popcount @0) integer_zerop)
   (rep @0 { build_zero_cst (TREE_TYPE (@0)); }
 
+/* 64- and 32-bits branchless implementations of popcount are detected:
+
+   int popcount64c (uint64_t x)
+   {
+ x -= (x >> 1) & 0xULL;
+ x = (x & 0xULL) + ((x >> 2) & 0xULL);
+ x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
+ return (x * 0x0101010101010101ULL) >> 56;
+   }
+
+   int popcount32c (uint32_t x)
+   {
+ x -= (x >> 1) & 0x;
+ x = (x & 0x) + ((x >> 2) & 0x);
+ x = (x + (x >> 4)) & 0x0f0f0f0f;
+ return (x * 0x01010101) >> 24;
+   }  */
+(simplify
+  (convert
+(rshift
+  (mult
+	(bit_and:c
+	  (plus:c
+	(rshift @8 INTEGER_CST@5)
+	(plus:c@8
+	  (bit_and @6 INTEGER_CST@7)
+	  (bit_and
+		(rshift
+		  (minus@6
+		@0
+		(bit_and
+		  (rshift @0 INTEGER_CST@4)
+		  INTEGER_CST@11))
+		  INTEGER_CST@10)
+		INTEGER_CST@9)))
+	  INTEGER_CST@3)
+	INTEGER_CST@2)
+  INTEGER_CST@1))
+  /* Check constants and optab.  */
+  (with
+ {
+   tree argtype = TREE_TYPE (@0);
+   unsigned prec = TYPE_PRECISION (argtype);
+   int shift = TYPE_PRECISION (long_long_unsigned_type_node) - prec;
+   const unsigned long long c1 = 0x0101010101010101ULL >> shift,
+c2 = 0x0F0F0F0F0F0F0F0FULL >> shift,
+c3 = 0xULL >> shift,
+c4 = 0xULL >> shift;
+ }
+(if (types_match (type, integer_type_node) && tree_to_uhwi (@4) == 1
+	  && tree_to_uhwi (@10) == 2 && tree_to_uhwi (@5) == 4
+	  && tree_to_uhwi (@1) == prec - 8 && tree_to_uhwi (@2) == c1
+	  && tree_to_uhwi (@3) == c2 && tree_to_uhwi (@9) == c3
+	  && tree_to_uhwi (@7) == c3 && tree_to_uhwi (@11) == c4
+	  && optab_handler (popcount_optab, TYPE_MODE (argtype))
+	!= CODE_FOR_nothing)
+	(switch
+	(if (types_match (argtype, long_long_unsigned_type_node))
+	  (BUILT_IN_POPCOUNTLL @0))
+	(if (types_match (argtype, long_unsigned_type_node))
+	  (BUILT_IN_POPCOUNTL @0))
+	(if (types_match (argtype, unsigned_type_node))
+	  (BUILT_IN_POPCOUNT @0))
+
 /* Simplify:
 
  a = a1 op a2
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/popcount4.c b/gcc/testsuite/gcc.dg/tree-ssa/popcount4.c
new file mode 100644
index 000..9f759f8
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/popcount4.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target popcount } */
+/* { dg-require-effective-target int32plus } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+const unsigned m1  = 0xUL;
+const unsigned m2  = 0xUL;
+const unsigned m4  = 0x0F0F0F0FUL;
+const unsigned h01 = 0x01010101UL;
+const int shift = 24;
+
+int popcount64c(unsigned x)
+{
+x -= (x >> 1) & m1;
+x = (x & m2) + ((x >> 2) & m2);
+x = (x + (x >> 4)) & m4;
+return (x * h01) >> shift;
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_popcount" 1 "optimized" } } */
+
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/popcount4l.c b/gcc/testsuite/gcc.dg/tree-ssa/popcount4l.c
new file mode 100644
index 000..ab33f79
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/popcount4l.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target popcountl } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+#if __SIZEOF_LONG__ == 4
+const unsigned long m1  = 0xUL;
+const unsigned long m2  = 0xUL;
+const unsigned long m4  = 0x0F0F0F0FUL;
+const unsigned long h01