Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-05-04 Thread Alexandre Oliva
On May  4, 2021, Prathamesh Kulkarni  wrote:

> It looks like constfun's prototype had a typo with missing 2nd param for void 
> *.

Ugh, the patch for https://gcc.gnu.org/PR90773 added the param the day
after I retested the patch, and I did not give it yet another spin
before checking it in :-(

> I committed the fix as obvious in the following commit:
> https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5fbe6a8e73b52c6ebc28b9111456226c1cda6472

Thank you!

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about 


Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-05-03 Thread Prathamesh Kulkarni via Gcc-patches
On Tue, 4 May 2021 at 07:30, Alexandre Oliva  wrote:
>
> On May  3, 2021, Richard Biener  wrote:
>
> > On Fri, Apr 30, 2021 at 4:42 PM Jeff Law  wrote:
> >>
> >>
> >> On 4/28/2021 10:26 PM, Alexandre Oliva wrote:
> >> > On Feb 22, 2021, Richard Biener  wrote:
> >> >
> >> >> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva  
> >> >> wrote:
> >> >>> Here's an improved version of the patch.  Regstrapped on
> >> >>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
> >> >>> ahead of setmem, and also tested with riscv32-elf.
> >> >>>
> >> >>> Is it ok to install?  Or should it wait for stage1?
> >> >> It generally looks OK but I'd wait for stage1.
> >> > 'k, I've retested the patch now, and it's still working as expected.
> >>
> >> Then I'd go forward with it at this point.
>
> > Yes, it looks good to me as well.
>
> Thanks, I've put it in.
Hi,
This caused the following build error:
../../gcc/gcc/builtins.c:6750:18: error: invalid conversion from
‘rtx_def* (*)(void*, void*, long int, scalar_int_mode)’ to
rtx_def* (*)(void*, long int, scalar_int_mode)’ [-fpermissive]
 6750 |   constfun = builtin_memset_gen_str;

It looks like constfun's prototype had a typo with missing 2nd param for void *.
I committed the fix as obvious in the following commit:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=5fbe6a8e73b52c6ebc28b9111456226c1cda6472

Thanks,
Prathamesh
>
> --
> Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
>Free Software Activist   GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about 


Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-05-03 Thread Alexandre Oliva
On May  3, 2021, Richard Biener  wrote:

> On Fri, Apr 30, 2021 at 4:42 PM Jeff Law  wrote:
>> 
>> 
>> On 4/28/2021 10:26 PM, Alexandre Oliva wrote:
>> > On Feb 22, 2021, Richard Biener  wrote:
>> >
>> >> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva  wrote:
>> >>> Here's an improved version of the patch.  Regstrapped on
>> >>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
>> >>> ahead of setmem, and also tested with riscv32-elf.
>> >>>
>> >>> Is it ok to install?  Or should it wait for stage1?
>> >> It generally looks OK but I'd wait for stage1.
>> > 'k, I've retested the patch now, and it's still working as expected.
>> 
>> Then I'd go forward with it at this point.

> Yes, it looks good to me as well.

Thanks, I've put it in.

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about 


Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-05-03 Thread Richard Biener via Gcc-patches
On Fri, Apr 30, 2021 at 4:42 PM Jeff Law  wrote:
>
>
> On 4/28/2021 10:26 PM, Alexandre Oliva wrote:
> > On Feb 22, 2021, Richard Biener  wrote:
> >
> >> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva  wrote:
> >>> Here's an improved version of the patch.  Regstrapped on
> >>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
> >>> ahead of setmem, and also tested with riscv32-elf.
> >>>
> >>> Is it ok to install?  Or should it wait for stage1?
> >> It generally looks OK but I'd wait for stage1.
> > 'k, I've retested the patch now, and it's still working as expected.
>
> Then I'd go forward with it at this point.

Yes, it looks good to me as well.

Thanks,
Richard.

>
>
> >
> > I've dropped references to PR94092, since it's a different issue.
>
> ACK.  Note I just put a patch into pr94092 that may be of interest.  In
> simplest terms it tries to use REGNO_POINTER_ALIGN to increase the
> alignment of MEMs.  We'd been using it internally for a bit before we
> went a slightly different direction.  Feel free to use it if it's helpful.
>
>
> jeff
>


Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-04-30 Thread Jeff Law via Gcc-patches



On 4/28/2021 10:26 PM, Alexandre Oliva wrote:

On Feb 22, 2021, Richard Biener  wrote:


On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva  wrote:

Here's an improved version of the patch.  Regstrapped on
x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
ahead of setmem, and also tested with riscv32-elf.

Is it ok to install?  Or should it wait for stage1?

It generally looks OK but I'd wait for stage1.

'k, I've retested the patch now, and it's still working as expected.


Then I'd go forward with it at this point.





I've dropped references to PR94092, since it's a different issue.


ACK.  Note I just put a patch into pr94092 that may be of interest.  In 
simplest terms it tries to use REGNO_POINTER_ALIGN to increase the 
alignment of MEMs.  We'd been using it internally for a bit before we 
went a slightly different direction.  Feel free to use it if it's helpful.



jeff



Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-04-28 Thread Alexandre Oliva
On Feb 22, 2021, Richard Biener  wrote:

> On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva  wrote:
>> 
>> Here's an improved version of the patch.  Regstrapped on
>> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
>> ahead of setmem, and also tested with riscv32-elf.
>> 
>> Is it ok to install?  Or should it wait for stage1?

> It generally looks OK but I'd wait for stage1.

'k, I've retested the patch now, and it's still working as expected.

> I'd also like to see
> comments from others.  Note that I think we need to guard
> the loop emitting on optimize_*_for_speed/!size.

There's no loop, it's straightline codegen with interspersed
conditionals if we'd output straightline code for a slightly larger
block.

> It's also not entirely
> clear to me how the code avoids expanding all > 1 byte block size
> memsets to a loop (thus bypassing more optimal libc implementations
> for larger sizes).

It doesn't preempt preexisting expansions, that may cover even
variable-length cases (e.g. with an insn expander), but when we get to
it, it's limited by the same tuning defines that limit move by pieces.

> I also remember that we sometimes do a dynamic
> dispatch between inlined and non-inlined code paths, though that might
> be target specific handling in the x86 backend.

Yup, no effect on those.


I've dropped references to PR94092, since it's a different issue.


Re-regstrapped on x86_64-linux-gnu, with and without a patchlet that
exercises it far more thoroughly.  Ok to install?



introduce try store by multiple pieces

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length clearing memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch handles variable lengths with multiple conditional
power-of-2-constant-sized stores-by-pieces, so as to reduce the
overhead of length compares.

It also changes the last copy-prop pass into ccp, so that pointer
alignment and length's nonzero bits are detected and made available
for the expander, even for ldist-introduced SSA_NAMEs.


for  gcc/ChangeLog

* builtins.c (try_store_by_multiple_pieces): New.
(expand_builtin_memset_args): Use it.  If target_char_cast
fails, proceed as for non-constant val.  Pass len's ctz to...
* expr.c (clear_storage_hints): ... this.  Try store by
multiple pieces after setmem.
(clear_storage): Adjust.
* expr.h (clear_storage_hints): Likewise.
(try_store_by_multiple_pieces): Declare.
* passes.def: Replace the last copy_prop with ccp.
---
 gcc/builtins.c |  182 ++--
 gcc/expr.c |9 ++-
 gcc/expr.h |   13 
 gcc/passes.def |5 +-
 4 files changed, 197 insertions(+), 12 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 8c5324bf7de9a..e3ccb3c17dbb3 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6645,6 +6645,166 @@ expand_builtin_memset (tree exp, rtx target, 
machine_mode mode)
   return expand_builtin_memset_args (dest, val, len, target, mode, exp);
 }
 
+/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
+   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
+   aligned at an ALIGN-bits boundary.  LEN must be a multiple of
+   1<= 0);
+
+  if (val)
+valc = 1;
+
+  /* Bits more significant than TST_BITS are part of the shared prefix
+ in the binary representation of both min_len and max_len.  Since
+ they're identical, we don't need to test them in the loop.  */
+  int tst_bits = (max_bits != min_bits ? max_bits
+ : floor_log2 (max_len ^ min_len));
+
+  /* Check whether it's profitable to start by storing a fixed BLKSIZE
+ bytes, to lower max_bits.  In the unlikely case of a constant LEN
+ (implied by identical MAX_LEN and MIN_LEN), we want to issue a
+ single store_by_pieces, but otherwise, select the minimum multiple
+ of the ALIGN (in bytes) and of the MCD of the possible LENs, that
+ brings MAX_LEN below TST_BITS, if that's lower than min_len.  */
+  unsigned HOST_WIDE_INT blksize;
+  if (max_len > min_len)
+{
+  unsigned HOST_WIDE_INT alrng = MAX (HOST_WIDE_INT_1U << ctz_len,
+ align / BITS_PER_UNIT);
+  blksize = max_len - (HOST_WIDE_INT_1U << tst_bits) + alrng;
+  blksize &= ~(alrng - 1);
+}
+  else if (max_len == min_len)
+blksize = 

Re: [PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-02-22 Thread Richard Biener via Gcc-patches
On Fri, Feb 19, 2021 at 9:08 AM Alexandre Oliva  wrote:
>
> Here's an improved version of the patch.  Regstrapped on
> x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
> ahead of setmem, and also tested with riscv32-elf.
>
> Is it ok to install?  Or should it wait for stage1?

It generally looks OK but I'd wait for stage1.  I'd also like to see
comments from others.  Note that I think we need to guard
the loop emitting on optimize_*_for_speed/!size.  It's also not entirely
clear to me how the code avoids expanding all > 1 byte block size
memsets to a loop (thus bypassing more optimal libc implementations
for larger sizes).  I also remember that we sometimes do a dynamic
dispatch between inlined and non-inlined code paths, though that might
be target specific handling in the x86 backend.

Thanks,
Richard.

> [PR94092] introduce try store by multiple pieces
>
> From: Alexandre Oliva 
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length clearing memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch handles variable lengths with multiple conditional
> power-of-2-constant-sized stores-by-pieces, so as to reduce the
> overhead of length compares.
>
> It also changes the last copy-prop pass into ccp, so that pointer
> alignment and length's nonzero bits are detected and made available
> for the expander, even for ldist-introduced SSA_NAMEs.
>
>
> for  gcc/ChangeLog
>
> PR tree-optimization/94092
> * builtins.c (try_store_by_multiple_pieces): New.
> (expand_builtin_memset_args): Use it.  If target_char_cast
> fails, proceed as for non-constant val.  Pass len's ctz to...
> * expr.c (clear_storage_hints): ... this.  Try store by
> multiple pieces after setmem.
> (clear_storage): Adjust.
> * expr.h (clear_storage_hints): Likewise.
> (try_store_by_multiple_pieces): Declare.
> * passes.def: Replace the last copy_prop to ccp.
> ---
>  gcc/builtins.c |  182 
> ++--
>  gcc/expr.c |9 ++-
>  gcc/expr.h |   13 
>  gcc/passes.def |5 +-
>  4 files changed, 197 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 0aed008687ccc..05b14f2a3f6d3 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -6600,6 +6600,166 @@ expand_builtin_memset (tree exp, rtx target, 
> machine_mode mode)
>return expand_builtin_memset_args (dest, val, len, target, mode, exp);
>  }
>
> +/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
> +   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
> +   aligned at an ALIGN-bits boundary.  LEN must be a multiple of
> +   1< +
> +   The strategy is to issue one store_by_pieces for each power of two,
> +   from most to least significant, guarded by a test on whether there
> +   are at least that many bytes left to copy in LEN.
> +
> +   ??? Should we skip some powers of two in favor of loops?  Maybe start
> +   at the max of TO/LEN/word alignment, at least when optimizing for
> +   size, instead of ensuring O(log len) dynamic compares?  */
> +
> +bool
> +try_store_by_multiple_pieces (rtx to, rtx len, unsigned int ctz_len,
> + unsigned HOST_WIDE_INT min_len,
> + unsigned HOST_WIDE_INT max_len,
> + rtx val, char valc, unsigned int align)
> +{
> +  int max_bits = floor_log2 (max_len);
> +  int min_bits = floor_log2 (min_len);
> +  int sctz_len = ctz_len;
> +
> +  gcc_checking_assert (sctz_len >= 0);
> +
> +  if (val)
> +valc = 1;
> +
> +  /* Bits more significant than TST_BITS are part of the shared prefix
> + in the binary representation of both min_len and max_len.  Since
> + they're identical, we don't need to test them in the loop.  */
> +  int tst_bits = (max_bits != min_bits ? max_bits
> + : floor_log2 (max_len ^ min_len));
> +
> +  /* Check whether it's profitable to start by storing a fixed BLKSIZE
> + bytes, to lower max_bits.  In the unlikely case of a constant LEN
> + (implied by identical MAX_LEN and MIN_LEN), we want to issue a
> + single store_by_pieces, but otherwise, select the minimum multiple
> + of the ALIGN (in bytes) and of the MCD of the possible LENs, that
> + brings MAX_LEN below TST_BITS, if 

[PR94092] Re: [RFC] test builtin ratio for loop distribution

2021-02-19 Thread Alexandre Oliva
Here's an improved version of the patch.  Regstrapped on
x86_64-linux-gnu, with and without a patchlet that moved multi-pieces
ahead of setmem, and also tested with riscv32-elf.

Is it ok to install?  Or should it wait for stage1?


[PR94092] introduce try store by multiple pieces

From: Alexandre Oliva 

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length clearing memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch handles variable lengths with multiple conditional
power-of-2-constant-sized stores-by-pieces, so as to reduce the
overhead of length compares.

It also changes the last copy-prop pass into ccp, so that pointer
alignment and length's nonzero bits are detected and made available
for the expander, even for ldist-introduced SSA_NAMEs.


for  gcc/ChangeLog

PR tree-optimization/94092
* builtins.c (try_store_by_multiple_pieces): New.
(expand_builtin_memset_args): Use it.  If target_char_cast
fails, proceed as for non-constant val.  Pass len's ctz to...
* expr.c (clear_storage_hints): ... this.  Try store by
multiple pieces after setmem.
(clear_storage): Adjust.
* expr.h (clear_storage_hints): Likewise.
(try_store_by_multiple_pieces): Declare.
* passes.def: Replace the last copy_prop to ccp.
---
 gcc/builtins.c |  182 ++--
 gcc/expr.c |9 ++-
 gcc/expr.h |   13 
 gcc/passes.def |5 +-
 4 files changed, 197 insertions(+), 12 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 0aed008687ccc..05b14f2a3f6d3 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6600,6 +6600,166 @@ expand_builtin_memset (tree exp, rtx target, 
machine_mode mode)
   return expand_builtin_memset_args (dest, val, len, target, mode, exp);
 }
 
+/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
+   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
+   aligned at an ALIGN-bits boundary.  LEN must be a multiple of
+   1<= 0);
+
+  if (val)
+valc = 1;
+
+  /* Bits more significant than TST_BITS are part of the shared prefix
+ in the binary representation of both min_len and max_len.  Since
+ they're identical, we don't need to test them in the loop.  */
+  int tst_bits = (max_bits != min_bits ? max_bits
+ : floor_log2 (max_len ^ min_len));
+
+  /* Check whether it's profitable to start by storing a fixed BLKSIZE
+ bytes, to lower max_bits.  In the unlikely case of a constant LEN
+ (implied by identical MAX_LEN and MIN_LEN), we want to issue a
+ single store_by_pieces, but otherwise, select the minimum multiple
+ of the ALIGN (in bytes) and of the MCD of the possible LENs, that
+ brings MAX_LEN below TST_BITS, if that's lower than min_len.  */
+  unsigned HOST_WIDE_INT blksize;
+  if (max_len > min_len)
+{
+  unsigned HOST_WIDE_INT alrng = MAX (HOST_WIDE_INT_1U << ctz_len,
+ align / BITS_PER_UNIT);
+  blksize = max_len- (HOST_WIDE_INT_1U << tst_bits) + alrng;
+  blksize &= ~(alrng - 1);
+}
+  else if (max_len == min_len)
+blksize = max_len;
+  else
+gcc_unreachable ();
+  if (min_len >= blksize)
+{
+  min_len -= blksize;
+  min_bits = floor_log2 (min_len);
+  max_len -= blksize;
+  max_bits = floor_log2 (max_len);
+
+  tst_bits = (max_bits != min_bits ? max_bits
+: floor_log2 (max_len ^ min_len));
+}
+  else
+blksize = 0;
+
+  /* Check that we can use store by pieces for the maximum store count
+ we may issue (initial fixed-size block, plus conditional
+ power-of-two-sized from max_bits to ctz_len.  */
+  unsigned HOST_WIDE_INT xlenest = blksize;
+  if (max_bits >= 0)
+xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
+   - (HOST_WIDE_INT_1U << ctz_len));
+  if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
+   , align, true))
+return false;
+
+  rtx (*constfun) (void *, HOST_WIDE_INT, scalar_int_mode);
+  void *constfundata;
+  if (val)
+{
+  constfun = builtin_memset_gen_str;
+  constfundata = val = force_reg (TYPE_MODE (unsigned_char_type_node),
+ val);
+}
+  else
+{
+  constfun = builtin_memset_read_str;
+  constfundata = 
+}
+
+  rtx ptr = 

Re: [RFC] test builtin ratio for loop distribution

2021-02-16 Thread Richard Biener via Gcc-patches
On Tue, Feb 16, 2021 at 11:48 AM Alexandre Oliva  wrote:
>
> On Feb 16, 2021, Alexandre Oliva  wrote:
>
> >> So I wonder whether we should instead re-run CCP after loop opts which
> >> computes nonzero bits as well instead of the above "hack".
>
> That works.  It takes care of both the dest alignment and the len ctz.
>
> Explicitly masking out the len tz from nonzero bits also works, but I
> suppose the ccp pass does a better job, and it takes care of memcpy and
> memmove as well.

It's certainly more general, yes.

> I noticed that ccp releases CDI_DOMINATORS information.  It looks like
> the last copy_prop pass, now replaced with ccp in my tree, was late
> enough that it doesn't seem to matter, but at first I thought you meant
> an earlier copy_prop, that runs before graphite, and dropping dominator
> info at that point was a problem for e.g. cunroll.

We indeed shouldn't release CDI_DOMINATORS info, but if it is a problem
elsewhere then that pass fails to compute dominator info.  But yes,
in the loop pipeline we expect dominator info to be present (but fast
query can disable itself leading to issues if doms are not recomputed).

Richard.

> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>Free Software Activist GNU Toolchain Engineer
> Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-16 Thread Alexandre Oliva
On Feb 16, 2021, Alexandre Oliva  wrote:

>> So I wonder whether we should instead re-run CCP after loop opts which
>> computes nonzero bits as well instead of the above "hack".

That works.  It takes care of both the dest alignment and the len ctz.

Explicitly masking out the len tz from nonzero bits also works, but I
suppose the ccp pass does a better job, and it takes care of memcpy and
memmove as well.


I noticed that ccp releases CDI_DOMINATORS information.  It looks like
the last copy_prop pass, now replaced with ccp in my tree, was late
enough that it doesn't seem to matter, but at first I thought you meant
an earlier copy_prop, that runs before graphite, and dropping dominator
info at that point was a problem for e.g. cunroll.

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-15 Thread Alexandre Oliva
On Feb 12, 2021, Richard Biener  wrote:

>> +  if (TREE_CODE (mem) == SSA_NAME)
>> +if (ptr_info_def *pi = get_ptr_info (mem))
>> +  {
>> +   unsigned al = get_pointer_alignment (builtin->dst_base);
>> +   if (al > pi->align || pi->misalign)

> We still might prefer pi->align == 64 and pi->misalign == 32 over al == 16
> so maybe factor that in, too.

Ugh, apologies, I somehow posted an incorrect and outdated version of
the patch.  The improved (propagates both alignments) and fixed (divides
by BITS_PER_UNIT, fixing a regression in gfortran.dg/sms-2.f90) had
this alternate hunk as the only difference:

@@ -1155,6 +1156,16 @@ generate_memset_builtin (class loop *loop, partition 
*partition)
   mem = force_gimple_operand_gsi (, mem, true, NULL_TREE,
  false, GSI_CONTINUE_LINKING);
 
+  if (TREE_CODE (mem) == SSA_NAME)
+if (ptr_info_def *pi = get_ptr_info (mem))
+  {
+   unsigned al;
+   unsigned HOST_WIDE_INT misal;
+   if (get_pointer_alignment_1 (builtin->dst_base, , ))
+ set_ptr_info_alignment (pi, al / BITS_PER_UNIT,
+ misal / BITS_PER_UNIT);
+  }
+
   /* This exactly matches the pattern recognition in classify_partition.  */
   val = gimple_assign_rhs1 (DR_STMT (builtin->dst_dr));
   /* Handle constants like 0x15151515 and similarly



> So I wonder whether we should instead re-run CCP after loop opts which
> computes nonzero bits as well instead of the above "hack".  Would
> nonzero bits be ready to compute in the above way from loop distribution?
> That is, you can do set_nonzero_bits just like you did
> set_ptr_info_alignment ...

> Since CCP also performs copy propagation an obvious candidate would be
> to replace the last pass_copy_prop with pass_ccp (with a comment noting
> to compute up-to-date nonzero bits info).

I'll look into these possibilities.


-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-12 Thread Richard Biener via Gcc-patches
On Thu, Feb 11, 2021 at 11:19 AM Alexandre Oliva  wrote:
>
> On Feb  4, 2021, Alexandre Oliva  wrote:
>
> > On Feb  4, 2021, Richard Biener  wrote:
> >>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
> >>>
> >>> Why would that be better than keeping the constant-length memset call,
> >>> that would be turned into an unrolled loop during expand?
>
> >> Well, because of the possibly lost ctz and alignment info.
>
> > Funny you should mention that.  I got started with the expand-time
> > expansion yesterday, and found out that we're not using the alignment
> > information that is available.  Though the pointer is known to point to
> > an aligned object, we are going for 8-bit alignment for some reason.
>
> > The strategy I used there was to first check whether by_pieces would
> > expand inline a constant length near the max known length, then loop
> > over the bits in the variable length, expand in each iteration a
> > constant-length store-by-pieces for the fixed length corresponding to
> > that bit, and a test comparing the variable length with the fixed length
> > guarding the expansion of the store-by-pieces.  We may get larger code
> > this way (no loops), but only O(log(len)) compares.
>
> > I've also fixed some bugs in the ldist expander, so now it bootstraps,
> > but with a few regressions in the testsuite, that I'm yet to look into.
>
> A few more fixes later, this seems to work.
>
> It encompasses the patch to recover tree_ctz from a mult_expr def, it
> adds code to set up the alignment data for the ldist-generated memset
> dest, and then it expands memset as described above if length is not
> constant, setmem is not available, but the by-pieces machinery would
> still be used for nearby constants.
>
> How does this look?

Overall it looks good - I can't really review the RTL code generation
parts because I'm not too familiar with the usual pitfalls there.

Some comments below.

>
> [PR94092] introduce try store by multiple pieces
>
> From: Alexandre Oliva 
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch handles variable lengths with multiple conditional
> power-of-2-constant-sized stores-by-pieces, so as to reduce the
> overhead of length compares.
>
> It also propagates the alignment of the memset dest, introduced by
> loop-distribution, to the SSA pointer used to hold it, so that it is
> not discarded right away, and may be recovered by the memset builtin
> expander.
>
> Finally, it computes the ctz of the length, and uses it to omit blocks
> for stores of small byte counts when we're storing whole words.
> tree_ctz is improved so as to look at the length DEF stmt, and zero
> out the least-significant bits if it's a multiply by a power of two.
>
>
> for  gcc/ChangeLog
>
> PR tree-optimization/94092
> * builtins.c (try_store_by_multiple_pieces): New.
> (expand_builtin_memset_args): Use it.  If target_char_cast
> fails, proceed as for non-constant val.  Pass len's ctz to...
> * expr.c (clear_storage_hints): ... this.  Try store by
> multiple pieces after setmem.
> (clear_storage): Adjust.
> * expr.h (clear_storage_hints): Likewise.
> (try_store_by_multiple_pieces): Declare.
> * tree-loop-distribution.c: Include builtins.h.
> (generate_memset_builtin): Propagate dst_base alignmen to mem.
> * tree-ssanames.c (get_nonzero_bits): Zero out low bits of
> integral types, when a MULT_EXPR INTEGER_CST operand ensures
> the result will be a multiple of a power of two.
> ---
>  gcc/builtins.c   |  113 
> +++---
>  gcc/expr.c   |9 +++
>  gcc/expr.h   |   12 
>  gcc/tree-loop-distribution.c |9 +++
>  gcc/tree-ssanames.c  |   23 -
>  5 files changed, 154 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/builtins.c b/gcc/builtins.c
> index 0aed008687ccc..44f3af92536a5 100644
> --- a/gcc/builtins.c
> +++ b/gcc/builtins.c
> @@ -6600,6 +6600,97 @@ expand_builtin_memset (tree exp, rtx target, 
> machine_mode mode)
>return expand_builtin_memset_args (dest, val, len, target, mode, exp);
>  }
>
> +/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
> +   Return TRUE if 

Re: [RFC] test builtin ratio for loop distribution

2021-02-11 Thread Alexandre Oliva
On Feb 11, 2021, Alexandre Oliva  wrote:

> How does this look?


> for  gcc/ChangeLog

>   PR tree-optimization/94092
>   * builtins.c (try_store_by_multiple_pieces): New.
>   (expand_builtin_memset_args): Use it.  If target_char_cast
>   fails, proceed as for non-constant val.  Pass len's ctz to...
>   * expr.c (clear_storage_hints): ... this.  Try store by
>   multiple pieces after setmem.
>   (clear_storage): Adjust.
>   * expr.h (clear_storage_hints): Likewise.
>   (try_store_by_multiple_pieces): Declare.
>   * tree-loop-distribution.c: Include builtins.h.
>   (generate_memset_builtin): Propagate dst_base alignmen to mem.
>   * tree-ssanames.c (get_nonzero_bits): Zero out low bits of
>   integral types, when a MULT_EXPR INTEGER_CST operand ensures
>   the result will be a multiple of a power of two.

I forgot to mention this passed regstrap on x86_64-linux-gnu, as well as
some cross testing of riscv32-elf.

I've also regstrapped it on x86_64-linux-gnu along with a patch for
testing purposes, that tried try_store_by_multiple_pieces before setmem
in all 3 locations where they are called, which gives me some confidence
that the implementation is reasonably robust.


Is this ok to install?  (if not right now, perhaps in stage1)

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-11 Thread Alexandre Oliva
On Feb  4, 2021, Alexandre Oliva  wrote:

> On Feb  4, 2021, Richard Biener  wrote:
>>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>>> 
>>> Why would that be better than keeping the constant-length memset call,
>>> that would be turned into an unrolled loop during expand?

>> Well, because of the possibly lost ctz and alignment info.

> Funny you should mention that.  I got started with the expand-time
> expansion yesterday, and found out that we're not using the alignment
> information that is available.  Though the pointer is known to point to
> an aligned object, we are going for 8-bit alignment for some reason.

> The strategy I used there was to first check whether by_pieces would
> expand inline a constant length near the max known length, then loop
> over the bits in the variable length, expand in each iteration a
> constant-length store-by-pieces for the fixed length corresponding to
> that bit, and a test comparing the variable length with the fixed length
> guarding the expansion of the store-by-pieces.  We may get larger code
> this way (no loops), but only O(log(len)) compares.

> I've also fixed some bugs in the ldist expander, so now it bootstraps,
> but with a few regressions in the testsuite, that I'm yet to look into.

A few more fixes later, this seems to work.

It encompasses the patch to recover tree_ctz from a mult_expr def, it
adds code to set up the alignment data for the ldist-generated memset
dest, and then it expands memset as described above if length is not
constant, setmem is not available, but the by-pieces machinery would
still be used for nearby constants.

How does this look?


[PR94092] introduce try store by multiple pieces

From: Alexandre Oliva 

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch handles variable lengths with multiple conditional
power-of-2-constant-sized stores-by-pieces, so as to reduce the
overhead of length compares.

It also propagates the alignment of the memset dest, introduced by
loop-distribution, to the SSA pointer used to hold it, so that it is
not discarded right away, and may be recovered by the memset builtin
expander.

Finally, it computes the ctz of the length, and uses it to omit blocks
for stores of small byte counts when we're storing whole words.
tree_ctz is improved so as to look at the length DEF stmt, and zero
out the least-significant bits if it's a multiply by a power of two.


for  gcc/ChangeLog

PR tree-optimization/94092
* builtins.c (try_store_by_multiple_pieces): New.
(expand_builtin_memset_args): Use it.  If target_char_cast
fails, proceed as for non-constant val.  Pass len's ctz to...
* expr.c (clear_storage_hints): ... this.  Try store by
multiple pieces after setmem.
(clear_storage): Adjust.
* expr.h (clear_storage_hints): Likewise.
(try_store_by_multiple_pieces): Declare.
* tree-loop-distribution.c: Include builtins.h.
(generate_memset_builtin): Propagate dst_base alignmen to mem.
* tree-ssanames.c (get_nonzero_bits): Zero out low bits of
integral types, when a MULT_EXPR INTEGER_CST operand ensures
the result will be a multiple of a power of two.
---
 gcc/builtins.c   |  113 +++---
 gcc/expr.c   |9 +++
 gcc/expr.h   |   12 
 gcc/tree-loop-distribution.c |9 +++
 gcc/tree-ssanames.c  |   23 -
 5 files changed, 154 insertions(+), 12 deletions(-)

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 0aed008687ccc..44f3af92536a5 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -6600,6 +6600,97 @@ expand_builtin_memset (tree exp, rtx target, 
machine_mode mode)
   return expand_builtin_memset_args (dest, val, len, target, mode, exp);
 }
 
+/* Try to store VAL (or, if NULL_RTX, VALC) in LEN bytes starting at TO.
+   Return TRUE if successful, FALSE otherwise.  TO is assumed to be
+   aligned at an ALIGN boundary.  LEN is assumed to be a multiple of
+   1<= ctz_len; i--)
+{
+  unsigned HOST_WIDE_INT blksize = HOST_WIDE_INT_1U << i;
+  rtx_code_label *label = NULL;
+
+  if (i <= shr_bits)
+   {
+ label = gen_label_rtx ();
+ emit_cmp_and_jump_insns (rem, GEN_INT (blksize), LT, NULL,
+  

Re: [RFC] test builtin ratio for loop distribution

2021-02-11 Thread Alexandre Oliva
On Feb  4, 2021, Jim Wilson  wrote:

> FYI we have a bug report for this for a coremark regression which sounds
> like the same problem.
>  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Indeed, thanks!

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-05 Thread Richard Biener via Gcc-patches
On Thu, Feb 4, 2021 at 11:18 PM Alexandre Oliva  wrote:
>
> On Feb  4, 2021, Richard Biener  wrote:
>
> >> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
> >>
> >> Why would that be better than keeping the constant-length memset call,
> >> that would be turned into an unrolled loop during expand?
>
> > Well, because of the possibly lost ctz and alignment info.
>
> Funny you should mention that.  I got started with the expand-time
> expansion yesterday, and found out that we're not using the alignment
> information that is available.  Though the pointer is known to point to
> an aligned object, we are going for 8-bit alignment for some reason.
>
> The strategy I used there was to first check whether by_pieces would
> expand inline a constant length near the max known length, then loop
> over the bits in the variable length, expand in each iteration a
> constant-length store-by-pieces for the fixed length corresponding to
> that bit, and a test comparing the variable length with the fixed length
> guarding the expansion of the store-by-pieces.  We may get larger code
> this way (no loops), but only O(log(len)) compares.
>
> I've also fixed some bugs in the ldist expander, so now it bootstraps,
> but with a few regressions in the testsuite, that I'm yet to look into.
>
> >> Uhh, thanks, but...  you realize nearly all of the gimple-building code
> >> is one and the same for the loop and for trailing count misalignment?
>
> > Sorry, the code lacked comments and so I didn't actually try decipering
> > the code you generate ;)
>
> Oh, come on, it was planly obscure ;-D
>
> Sorry for posting an early-draft before polishing it up.
>
> > The original motivation was really that esp. for small trip count loops
> > the target knows best how to implement them.  Now, that completely
> > fails of course in case the target doesn't implement any of this or
> > the generic code fails because we lost ctz and alignment info.
>
> In our case, generic code fails because it won't handle variable-sized
> clear-by-pieces.  But then, I found out, when it's fixed-size, it also
> makes the code worse, because it seems to expand to byte stores even
> when the store-to object is known to have wider alignment:
>
> union u {
>   long long i;
>   char c[8];
> } x[8];
> int s(union u *p, int k) {
>   for (int i = k ? 0 : 3; i < 8; i++) {
> for (int j = 0; j < 8; j++) {
>   p[i].c[j] = 0;
> } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte 
> stores
>   }
> }

On x86_64 I see two DImode stores generated, but that appears to be
done via setmem.

> >> > I think the builtins with alignment and calloc-style element count
> >> > will be useful on its own.
> >>
> >> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
> >> I'm not sure I want to go there.  I'd much rather recover the ctz of the
> >> length, and use it in existing code.
>
> > Yeah, but when we generate memcpy there might not be a way to
> > store the ctz info until RTL expansion where the magic should really happen 
> > ...
>
> True.  It can be recovered without much difficulty in the cases I've
> looked at, but it could be lost in others.
>
> > So I'd say go for improving RTL expansion.
>
> 'k, thanks
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>Free Software Activist GNU Toolchain Engineer
> Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-04 Thread Jim Wilson
On Wed, Jan 27, 2021 at 4:40 AM Alexandre Oliva  wrote:

> This patch attempts to fix a libgcc codegen regression introduced in
> gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> FYI we have a bug report for this for a coremark regression which sounds
like the same problem.
 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092

Jim


Re: [RFC] test builtin ratio for loop distribution

2021-02-04 Thread Alexandre Oliva
On Feb  4, 2021, Richard Biener  wrote:

>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>> 
>> Why would that be better than keeping the constant-length memset call,
>> that would be turned into an unrolled loop during expand?

> Well, because of the possibly lost ctz and alignment info.

Funny you should mention that.  I got started with the expand-time
expansion yesterday, and found out that we're not using the alignment
information that is available.  Though the pointer is known to point to
an aligned object, we are going for 8-bit alignment for some reason.

The strategy I used there was to first check whether by_pieces would
expand inline a constant length near the max known length, then loop
over the bits in the variable length, expand in each iteration a
constant-length store-by-pieces for the fixed length corresponding to
that bit, and a test comparing the variable length with the fixed length
guarding the expansion of the store-by-pieces.  We may get larger code
this way (no loops), but only O(log(len)) compares.

I've also fixed some bugs in the ldist expander, so now it bootstraps,
but with a few regressions in the testsuite, that I'm yet to look into.

>> Uhh, thanks, but...  you realize nearly all of the gimple-building code
>> is one and the same for the loop and for trailing count misalignment?

> Sorry, the code lacked comments and so I didn't actually try decipering
> the code you generate ;)

Oh, come on, it was planly obscure ;-D

Sorry for posting an early-draft before polishing it up.

> The original motivation was really that esp. for small trip count loops
> the target knows best how to implement them.  Now, that completely
> fails of course in case the target doesn't implement any of this or
> the generic code fails because we lost ctz and alignment info.

In our case, generic code fails because it won't handle variable-sized
clear-by-pieces.  But then, I found out, when it's fixed-size, it also
makes the code worse, because it seems to expand to byte stores even
when the store-to object is known to have wider alignment:

union u {
  long long i;
  char c[8];
} x[8];
int s(union u *p, int k) {
  for (int i = k ? 0 : 3; i < 8; i++) {
for (int j = 0; j < 8; j++) {
  p[i].c[j] = 0;
} // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte stores
  }
}

>> > I think the builtins with alignment and calloc-style element count
>> > will be useful on its own.
>> 
>> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
>> I'm not sure I want to go there.  I'd much rather recover the ctz of the
>> length, and use it in existing code.

> Yeah, but when we generate memcpy there might not be a way to
> store the ctz info until RTL expansion where the magic should really happen 
> ...

True.  It can be recovered without much difficulty in the cases I've
looked at, but it could be lost in others.

> So I'd say go for improving RTL expansion.

'k, thanks

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-04 Thread Richard Biener via Gcc-patches
On Wed, Feb 3, 2021 at 4:11 PM Alexandre Oliva  wrote:
>
> On Feb  3, 2021, Richard Biener  wrote:
>
> > So I think we should try to match what __builtin_memcpy/memset
> > expansion would do here, taking advantage of extra alignment
> > and size knowledge.  In particular,
>
> >  a) if __builtin_memcpy/memset would use setmem/cpymem optabs
> >  see if we can have variants of memcpy/memset transferring alignment
> >  and size knowledge
>
> We could add more optional parameters to them to convey the length's
> known ctz.  However, the ctz can't be recovered reliably.  We can't even
> recover it after gimplifying the length within ldist!
>
> That said, my other patch already enables ctz calls to recover it, at
> least in libgcc risc-v tfmode cases, and it's possible it's readily
> available in other cases.  I'd rather leave that for someone dealing
> with the machine-specific patterns to figure out whether a separate
> argument would be useful.  RISC-V, which is what I'm dealing with,
> doesn't have much to offer as far as these patterns are concerned.
>
> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>
> Why would that be better than keeping the constant-length memset call,
> that would be turned into an unrolled loop during expand?

Well, because of the possibly lost ctz and alignment info.  There's also
the eventual benefit of eliding either the source (for memcpy) if the
variable accesses in the original loop or the address-taken for the
memcpy/memset are the only reasons it is not.  Of course that's then
again a general GIMPLE memcpy/memset inlining deficit (we only
"inline" it if it's possible to express the operation with one read and
one write).

> >  c) if expansion would emit a memset/memcpy call but we know
> >  alignment and have a low bound on niters emit a loop (like your patch 
> > does)
>
> > a) might be difficult but adding the builtin variants may pay off in any 
> > case.
>
> *nod*
>
> > The patch itself could benefit from one or two helpers we already
> > have, first of all there's create_empty_loop_on_edge (so you don't
> > need the loop fixup)
>
> Uhh, thanks, but...  you realize nearly all of the gimple-building code
> is one and the same for the loop and for trailing count misalignment?

Sorry, the code lacked comments and so I didn't actually try decipering
the code you generate ;)

> There doesn't seem to be a lot of benefit to using this primitive, aside
> from its dealing with creating the loop data structure which, again, I'd
> only want to do in the loop case.
>
> (I guess I should add more comments as to the inline expansion
>  strategy.  it's equivalent to:
>
>  i = len, ptr = base, blksz = 1 << alctz;
>  while (i >= blksz) { *(ub*)ptr = val; i -= blksz; ptr += blksz; }
>  blksz >>= 1; if (i >= blksz) { *(ub*)ptr = val; i -= blksz; ptr += 
> blksz; }
>  blksz >>= 1; if (i >= blksz) { *(ub*)ptr = val; i -= blksz; ptr += 
> blksz; }
>  ... until blksz gets down to zero or to 1< > Note that for memmove if we know the dependence direction, we
> > can also emit a loop / unrolled code.
>
> *nod*, but the logic would have to be quite different, using bit tests,
> and odds are we won't know the direction and have to output a test and
> code for both possibilities, which would be quite unlikely to be
> beneficial.  Though the original code would quite likely make the
> direction visible; perhaps if the size is small enough that we would
> expand a memcpy inline, we should refrain from transforming the loop
> into a memmove call.
>
> In the case at hand, there's no benefit at all to these transformations:
> we start from a loop with the known alignment and a small loop count (0
> to 2 words copied), and if everything goes all right with the
> transformation, we may be lucky to get back to that.  It's not like the
> transformation could even increase the known alignment, so why bother,
> and throw away debug info by rewriting the loop into the same code minus
> debug?

The original motivation was really that esp. for small trip count loops
the target knows best how to implement them.  Now, that completely
fails of course in case the target doesn't implement any of this or
the generic code fails because we lost ctz and alignment info.

> > I think the builtins with alignment and calloc-style element count
> > will be useful on its own.
>
> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
> I'm not sure I want to go there.  I'd much rather recover the ctz of the
> length, and use it in existing code.

Yeah, but when we generate memcpy there might not be a way to
store the ctz info until RTL expansion where the magic should really happen ...

> I'd also prefer if the generic memset (and memcpy and memmove?) builtin
> expanders dealt with non-constant lengths in the way I implemented.

*nod*

> That feels like the right spot for it.  That deprives us of gimple loop
> optimizations in the inlined loop generated by the current patch, 

Re: [RFC] test builtin ratio for loop distribution

2021-02-03 Thread Alexandre Oliva
On Feb  3, 2021, Richard Biener  wrote:

> So I think we should try to match what __builtin_memcpy/memset
> expansion would do here, taking advantage of extra alignment
> and size knowledge.  In particular,

>  a) if __builtin_memcpy/memset would use setmem/cpymem optabs
>  see if we can have variants of memcpy/memset transferring alignment
>  and size knowledge

We could add more optional parameters to them to convey the length's
known ctz.  However, the ctz can't be recovered reliably.  We can't even
recover it after gimplifying the length within ldist!

That said, my other patch already enables ctz calls to recover it, at
least in libgcc risc-v tfmode cases, and it's possible it's readily
available in other cases.  I'd rather leave that for someone dealing
with the machine-specific patterns to figure out whether a separate
argument would be useful.  RISC-V, which is what I'm dealing with,
doesn't have much to offer as far as these patterns are concerned.

>  b) if expansion would use BY_PIECES then expand to an unrolled loop

Why would that be better than keeping the constant-length memset call,
that would be turned into an unrolled loop during expand?

>  c) if expansion would emit a memset/memcpy call but we know
>  alignment and have a low bound on niters emit a loop (like your patch 
> does)

> a) might be difficult but adding the builtin variants may pay off in any case.

*nod*

> The patch itself could benefit from one or two helpers we already
> have, first of all there's create_empty_loop_on_edge (so you don't
> need the loop fixup)

Uhh, thanks, but...  you realize nearly all of the gimple-building code
is one and the same for the loop and for trailing count misalignment?
There doesn't seem to be a lot of benefit to using this primitive, aside
from its dealing with creating the loop data structure which, again, I'd
only want to do in the loop case.

(I guess I should add more comments as to the inline expansion
 strategy.  it's equivalent to:

 i = len, ptr = base, blksz = 1 << alctz;
 while (i >= blksz) { *(ub*)ptr = val; i -= blksz; ptr += blksz; }
 blksz >>= 1; if (i >= blksz) { *(ub*)ptr = val; i -= blksz; ptr += 
blksz; }
 blksz >>= 1; if (i >= blksz) { *(ub*)ptr = val; i -= blksz; ptr += 
blksz; }
 ... until blksz gets down to zero or to 1< Note that for memmove if we know the dependence direction, we
> can also emit a loop / unrolled code.

*nod*, but the logic would have to be quite different, using bit tests,
and odds are we won't know the direction and have to output a test and
code for both possibilities, which would be quite unlikely to be
beneficial.  Though the original code would quite likely make the
direction visible; perhaps if the size is small enough that we would
expand a memcpy inline, we should refrain from transforming the loop
into a memmove call.

In the case at hand, there's no benefit at all to these transformations:
we start from a loop with the known alignment and a small loop count (0
to 2 words copied), and if everything goes all right with the
transformation, we may be lucky to get back to that.  It's not like the
transformation could even increase the known alignment, so why bother,
and throw away debug info by rewriting the loop into the same code minus
debug?

> I think the builtins with alignment and calloc-style element count
> will be useful on its own.

Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
I'm not sure I want to go there.  I'd much rather recover the ctz of the
length, and use it in existing code.


I'd also prefer if the generic memset (and memcpy and memmove?) builtin
expanders dealt with non-constant lengths in the way I implemented.
That feels like the right spot for it.  That deprives us of gimple loop
optimizations in the inlined loop generated by the current patch, but if
we expand an unrolled loop with compares and offsets with small
constants, loop optimizations might not even be relevant.


FWIW, the patch I posted yesterday is broken, the regstrap test did not
even build libstdc++-v3 successfully.  I'm not sure whether to pursue it
further, or to reimplement it in the expander.  Suggestions are welcome.

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-02-03 Thread Richard Biener via Gcc-patches
On Tue, Feb 2, 2021 at 6:14 PM Alexandre Oliva  wrote:
>
> On Jan 28, 2021, Richard Biener  wrote:
>
> > That would allow turning back the memset into the original loop (but
> > with optimal IVs, etc.).
>
> Is this sort of what you had in mind?
>
> I haven't tested the inline expansion of memset much yet; and that of
> memcpy, not at all; this really is mainly to check that I understood
> what you had in mind before I spend further time polishing it.

So I think we should try to match what __builtin_memcpy/memset
expansion would do here, taking advantage of extra alignment
and size knowledge.  In particular,

 a) if __builtin_memcpy/memset would use setmem/cpymem optabs
 see if we can have variants of memcpy/memset transferring alignment
 and size knowledge

 b) if expansion would use BY_PIECES then expand to an unrolled loop

 c) if expansion would emit a memset/memcpy call but we know
 alignment and have a low bound on niters emit a loop (like your patch does)

a) might be difficult but adding the builtin variants may pay off in any case.

The patch itself could benefit from one or two helpers we already
have, first of all there's create_empty_loop_on_edge (so you don't
need the loop fixup) which conveniently adds the control IV for you.
If you want to use pointer IVs for simplicity there's create_iv.  In the
end you shouldn't need more than creating the actual memory GIMPLE
stmts.  If expansion would use BY_PIECES you can implement the
unrolled code by setting loop->unroll to the number of iterations
to (maximally) peel and cunroll will take care of that.

Note that for memmove if we know the dependence direction, we
can also emit a loop / unrolled code.

I think the builtins with alignment and calloc-style element count
will be useful on its own.

Richard.

>
> test builtin ratio for loop distribution
>
> From: Alexandre Oliva 
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.
>
> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.
>
> This patch adds, to the loop distribution pass, the ability to perform
> inline expansion of bounded variable-length memset and memcpy in ways
> that take advantage of known alignments and size's factors, when
> preexisting *_RATIO macros suggest the inline expansion is
> advantageous.
>
>
> for  gcc/ChangeLog
>
> * tree-loop-distribution.c: Include builtins.h.
> (generate_memset_builtin): Introduce optional inline expansion
> of bounded variable-sized memset calls.
> (generate_memcpy_builtin): Likewise for memcpy only.
> (loop_distribution::execute): Fix loop structure afterwards.
> ---
>  gcc/tree-loop-distribution.c |  280 
> ++
>  1 file changed, 279 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..3be7a4c1ac281 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-vectorizer.h"
>  #include "tree-eh.h"
>  #include "gimple-fold.h"
> +#include "builtins.h"
>
>
>  #define MAX_DATAREFS_NUM \
> @@ -1148,6 +1149,23 @@ generate_memset_builtin (class loop *loop, partition 
> *partition)
>/* The new statements will be placed before LOOP.  */
>gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
>
> +  /* Compute builtin->size range and ctz before it's gimplified; 
> sub-expressions
> + thereof are rewritten in place, so they end up referencing SSA_NAMEs for
> + which we don't have VR info.  */
> +  unsigned align = get_pointer_alignment (builtin->dst_base) / BITS_PER_UNIT;
> +  unsigned alctz, szctz, xbits;
> +  wide_int szmin, szmax;
> +  value_range_kind vrk;
> +  if (align)
> +{
> +  alctz = wi::ctz (align);
> +  szctz = MIN (tree_ctz (builtin->size), alctz);
> +  xbits = alctz - szctz;
> +  vrk = determine_value_range (builtin->size, , );
> +  if (szmin == szmax)
> +   align = 0;
> +}
> +
>nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
>nb_bytes = force_gimple_operand_gsi (, nb_bytes, true, NULL_TREE,
>false, GSI_CONTINUE_LINKING);
> @@ -1172,6 +1190,127 @@ generate_memset_builtin (class loop *loop, partition 
> *partition)
>val = tem;
>  }
>
> +  unsigned 

Re: [RFC] test builtin ratio for loop distribution

2021-02-02 Thread Alexandre Oliva
On Jan 28, 2021, Richard Biener  wrote:

> That would allow turning back the memset into the original loop (but
> with optimal IVs, etc.).

Is this sort of what you had in mind?

I haven't tested the inline expansion of memset much yet; and that of
memcpy, not at all; this really is mainly to check that I understood
what you had in mind before I spend further time polishing it.


test builtin ratio for loop distribution

From: Alexandre Oliva 

The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch adds, to the loop distribution pass, the ability to perform
inline expansion of bounded variable-length memset and memcpy in ways
that take advantage of known alignments and size's factors, when
preexisting *_RATIO macros suggest the inline expansion is
advantageous.


for  gcc/ChangeLog

* tree-loop-distribution.c: Include builtins.h.
(generate_memset_builtin): Introduce optional inline expansion
of bounded variable-sized memset calls.
(generate_memcpy_builtin): Likewise for memcpy only.
(loop_distribution::execute): Fix loop structure afterwards.
---
 gcc/tree-loop-distribution.c |  280 ++
 1 file changed, 279 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..3be7a4c1ac281 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -115,6 +115,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "tree-eh.h"
 #include "gimple-fold.h"
+#include "builtins.h"
 
 
 #define MAX_DATAREFS_NUM \
@@ -1148,6 +1149,23 @@ generate_memset_builtin (class loop *loop, partition 
*partition)
   /* The new statements will be placed before LOOP.  */
   gsi = gsi_last_bb (loop_preheader_edge (loop)->src);
 
+  /* Compute builtin->size range and ctz before it's gimplified; 
sub-expressions
+ thereof are rewritten in place, so they end up referencing SSA_NAMEs for
+ which we don't have VR info.  */
+  unsigned align = get_pointer_alignment (builtin->dst_base) / BITS_PER_UNIT;
+  unsigned alctz, szctz, xbits;
+  wide_int szmin, szmax;
+  value_range_kind vrk;
+  if (align)
+{
+  alctz = wi::ctz (align);
+  szctz = MIN (tree_ctz (builtin->size), alctz);
+  xbits = alctz - szctz;
+  vrk = determine_value_range (builtin->size, , );
+  if (szmin == szmax)
+   align = 0;
+}
+
   nb_bytes = rewrite_to_non_trapping_overflow (builtin->size);
   nb_bytes = force_gimple_operand_gsi (, nb_bytes, true, NULL_TREE,
   false, GSI_CONTINUE_LINKING);
@@ -1172,6 +1190,127 @@ generate_memset_builtin (class loop *loop, partition 
*partition)
   val = tem;
 }
 
+  unsigned HOST_WIDE_INT ratio;
+  if (integer_zerop (val))
+ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
+  else
+ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
+
+  /* Compare the ratio with the number of (most-aligned) required stores needed
+ for szmax bytes, with the ratio: a loop that iterates up to szmax >> 
alctz,
+ and then one conditional store for each extra bit of alignment that the
+ destination has over the size.  */
+  if (align && vrk == VR_RANGE
+  && wi::ltu_p (wi::rshift (szmax, alctz, UNSIGNED) + xbits, ratio))
+{
+  gimple *stmt;
+  tree bytes = create_tmp_var_raw (size_type_node, "ldistbytes");
+  tree ptr = create_tmp_var_raw (build_pointer_type (char_type_node),
+"ldistptr");
+  tree stval = force_gimple_operand_gsi (,
+fold_convert
+(unsigned_char_type_node, val),
+true, NULL_TREE, false,
+GSI_CONTINUE_LINKING);
+
+  for (unsigned i = 1; i != alctz; i *= 2)
+   {
+ unsigned bits = i * 2 * BITS_PER_UNIT;
+ tree type = build_nonstandard_integer_type (bits, true);
+ stval = fold_convert (type, stval);
+ tree op2 = fold_build2_loc (partition->loc, LSHIFT_EXPR,
+ TREE_TYPE (stval), stval,
+ build_int_cst (type, i * BITS_PER_UNIT));
+ stval = fold_build2_loc 

Re: [RFC] test builtin ratio for loop distribution

2021-01-28 Thread Richard Biener via Gcc-patches
On Thu, Jan 28, 2021 at 6:28 AM Alexandre Oliva  wrote:
>
> On Jan 27, 2021, Richard Biener  wrote:
>
> > That said, rather than not transforming the loop as you do I'd
> > say we want to re-inline small copies more forcefully during
> > loop distribution code-gen so we turn a loop that sets
> > 3 'short int' to zero into a 'int' store and a 'short' store for example
> > (we have more code-size leeway here because we formerly had
> > a loop).
>
> Ok, that makes sense to me, if there's a chance of growing the access
> stride.
>
> > Since you don't add a testcase
>
> Uhh, sorry, I mentioned TFmode emulation calls, but I wasn't explicit I
> meant the soft-fp ones from libgcc.
>
> ./xgcc -B./ -O2 $srcdir/libgcc/soft-fp/fixtfdi.c \
>   -I$srcdir/libgcc/config/riscv -I$srcdir/include \
>   -S -o - | grep memset
>
> > I can't see whether the actual case would be fixed by setting SSA
> > pointer alignment on the memset arguments
>
> The dest pointer alignment is known at the builtin expansion time,
> that's not a problem.  What isn't known (*) is that the length is a
> multiple of the word size: what gets to the expander is that it's
> between 4 and 12 bytes (targeting 32-bit risc-v), but not that it's
> either 4, 8 or 12.  Coming up with an efficient inline expansion becomes
> a bit of a challenge without that extra knowledge.

Ah, yes - that's also useful information.  I guess for memset
an enhanced builtin would then have calloc style
size and nmemb arguments where the destination is expected
to have at least 'size' alignment.  That would allow turning
back the memset into the original loop (but with optimal IVs, etc.).
So kind of the x86 rep mov{bwlq}.  Now think of sth similarly
useful for memcpy/move.

Richard.

>
>
> (*) at least before my related patch for get_nonzero_bits
> https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564344.html
>
> --
> Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
>Free Software Activist GNU Toolchain Engineer
> Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-01-27 Thread Alexandre Oliva
On Jan 27, 2021, Richard Biener  wrote:

> That said, rather than not transforming the loop as you do I'd
> say we want to re-inline small copies more forcefully during
> loop distribution code-gen so we turn a loop that sets
> 3 'short int' to zero into a 'int' store and a 'short' store for example
> (we have more code-size leeway here because we formerly had
> a loop).

Ok, that makes sense to me, if there's a chance of growing the access
stride.

> Since you don't add a testcase

Uhh, sorry, I mentioned TFmode emulation calls, but I wasn't explicit I
meant the soft-fp ones from libgcc.

./xgcc -B./ -O2 $srcdir/libgcc/soft-fp/fixtfdi.c \
  -I$srcdir/libgcc/config/riscv -I$srcdir/include \
  -S -o - | grep memset

> I can't see whether the actual case would be fixed by setting SSA
> pointer alignment on the memset arguments

The dest pointer alignment is known at the builtin expansion time,
that's not a problem.  What isn't known (*) is that the length is a
multiple of the word size: what gets to the expander is that it's
between 4 and 12 bytes (targeting 32-bit risc-v), but not that it's
either 4, 8 or 12.  Coming up with an efficient inline expansion becomes
a bit of a challenge without that extra knowledge.


(*) at least before my related patch for get_nonzero_bits
https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564344.html

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: [RFC] test builtin ratio for loop distribution

2021-01-27 Thread Richard Biener via Gcc-patches
On Wed, Jan 27, 2021 at 2:18 PM Alexandre Oliva  wrote:
>
>
> This patch attempts to fix a libgcc codegen regression introduced in
> gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.
>
>
> The ldist pass turns even very short loops into memset calls.  E.g.,
> the TFmode emulation calls end with a loop of up to 3 iterations, to
> zero out trailing words, and the loop distribution pass turns them
> into calls of the memset builtin.
>
> Though short constant-length memsets are usually dealt with
> efficiently, for non-constant-length ones, the options are setmemM, or
> a function calls.

There's also folding which is expected to turn small memcpy/memset
into simple stmts again.

> RISC-V doesn't have any setmemM pattern, so the loops above end up
> "optimized" into memset calls, incurring not only the overhead of an
> explicit call, but also discarding the information the compiler has
> about the alignment of the destination, and that the length is a
> multiple of the word alignment.

But yes, this particular issue has come up before.  We do have some
on-the-side info for alignment and size estimates from profiling, notably
memset expansion will call

  if (currently_expanding_gimple_stmt)
stringop_block_profile (currently_expanding_gimple_stmt,
_align, _size);

I'm not sure whether those would be a suitable vehicle to record
known alignments.  The alternative is to create builtin variants
that encode this info and are usable for the middle-end here.

That said, rather than not transforming the loop as you do I'd
say we want to re-inline small copies more forcefully during
loop distribution code-gen so we turn a loop that sets
3 'short int' to zero into a 'int' store and a 'short' store for example
(we have more code-size leeway here because we formerly had
a loop).

Since you don't add a testcase I can't see whether the actual
case would be fixed by setting SSA pointer alignment on the
memset arguments (that is, whether they are invariant and thus
no SSA name is involved or not).

> This patch adds to the loop distribution pass some cost analysis based
> on preexisting *_RATIO macros, so that we won't transform loops with
> trip counts as low as the ratios we'd rather expand inline.
>
>
> This patch is not finished; it needs adjustments to the testsuite, to
> make up for the behavior changes it brings about.  Specifically, on a
> x86_64-linux-gnu regstrap, it regresses:
>
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> > FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> > FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 
> > 1 library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  
> > scan-tree-dump ldist "split to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split 
> > to 0 loops and 1 library calls"
> > FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split 
> > to 0 loops and 1 library calls"
>
> I suppose just lengthening the loops will take care of ldist-38 and
> pr78847, but the loss of the warnings in pr53265 is more concerning, and
> will require investigation.
>
> Nevertheless, I seek feedback on whether this is an acceptable approach,
> or whether we should use alternate tuning parameters for ldist, or
> something entirely different.  Thanks in advance,
>
>
> for  gcc/ChangeLog
>
> * tree-loop-distribution.c (maybe_normalize_partition): New.
> (loop_distribution::distribute_loop): Call it.
>
> [requires testsuite adjustments and investigation of a warning regression]
> ---
>  gcc/tree-loop-distribution.c |   54 
> ++
>  1 file changed, 54 insertions(+)
>
> diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
> index bb15fd3723fb6..b5198652817ee 100644
> --- a/gcc/tree-loop-distribution.c
> +++ b/gcc/tree-loop-distribution.c
> @@ -2848,6 +2848,52 @@ fuse_memset_builtins (vec 
> *partitions)
>  }
>  }
>
> +/* Return false if it's profitable to turn the LOOP PARTITION into a builtin
> +   call, and true if it wasn't, changing the PARTITION to PKIND_NORMAL.  */
> +
> +static bool
> +maybe_normalize_partition (class loop *loop, struct partition *partition)
> +{
> +  unsigned HOST_WIDE_INT ratio;
> +
> +  switch (partition->kind)
> +{
> +case PKIND_NORMAL:
> +case PKIND_PARTIAL_MEMSET:
> +  return false;
> +
> +case PKIND_MEMSET:
> +  if (integer_zerop (gimple_assign_rhs1 (DR_STMT
> +(partition->builtin->dst_dr
> +   ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
> +  else
> +   ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
> +  break;
> +
> +case PKIND_MEMCPY:
> +case PKIND_MEMMOVE:
> +  ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
> +  break;
> +
> +default:
> +  gcc_unreachable ();
> +}
> +
> +  tree niters = 

[RFC] test builtin ratio for loop distribution

2021-01-27 Thread Alexandre Oliva


This patch attempts to fix a libgcc codegen regression introduced in
gcc-10, as -ftree-loop-distribute-patterns was enabled at -O2.


The ldist pass turns even very short loops into memset calls.  E.g.,
the TFmode emulation calls end with a loop of up to 3 iterations, to
zero out trailing words, and the loop distribution pass turns them
into calls of the memset builtin.

Though short constant-length memsets are usually dealt with
efficiently, for non-constant-length ones, the options are setmemM, or
a function calls.

RISC-V doesn't have any setmemM pattern, so the loops above end up
"optimized" into memset calls, incurring not only the overhead of an
explicit call, but also discarding the information the compiler has
about the alignment of the destination, and that the length is a
multiple of the word alignment.

This patch adds to the loop distribution pass some cost analysis based
on preexisting *_RATIO macros, so that we won't transform loops with
trip counts as low as the ratios we'd rather expand inline.


This patch is not finished; it needs adjustments to the testsuite, to
make up for the behavior changes it brings about.  Specifically, on a
x86_64-linux-gnu regstrap, it regresses:

> FAIL: gcc.dg/pr53265.c  (test for warnings, line 40)
> FAIL: gcc.dg/pr53265.c  (test for warnings, line 42)
> FAIL: gcc.dg/tree-ssa/ldist-38.c scan-tree-dump ldist "split to 0 loops and 1 
> library cal> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++14  scan-tree-dump 
> ldist "split to 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++17  scan-tree-dump ldist "split to 
> 0 loops and 1 library calls"
> FAIL: g++.dg/tree-ssa/pr78847.C  -std=gnu++2a  scan-tree-dump ldist "split to 
> 0 loops and 1 library calls"

I suppose just lengthening the loops will take care of ldist-38 and
pr78847, but the loss of the warnings in pr53265 is more concerning, and
will require investigation.

Nevertheless, I seek feedback on whether this is an acceptable approach,
or whether we should use alternate tuning parameters for ldist, or
something entirely different.  Thanks in advance,


for  gcc/ChangeLog

* tree-loop-distribution.c (maybe_normalize_partition): New.
(loop_distribution::distribute_loop): Call it.

[requires testsuite adjustments and investigation of a warning regression]
---
 gcc/tree-loop-distribution.c |   54 ++
 1 file changed, 54 insertions(+)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index bb15fd3723fb6..b5198652817ee 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -2848,6 +2848,52 @@ fuse_memset_builtins (vec 
*partitions)
 }
 }
 
+/* Return false if it's profitable to turn the LOOP PARTITION into a builtin
+   call, and true if it wasn't, changing the PARTITION to PKIND_NORMAL.  */
+
+static bool
+maybe_normalize_partition (class loop *loop, struct partition *partition)
+{
+  unsigned HOST_WIDE_INT ratio;
+
+  switch (partition->kind)
+{
+case PKIND_NORMAL:
+case PKIND_PARTIAL_MEMSET:
+  return false;
+
+case PKIND_MEMSET:
+  if (integer_zerop (gimple_assign_rhs1 (DR_STMT
+(partition->builtin->dst_dr
+   ratio = CLEAR_RATIO (optimize_loop_for_speed_p (loop));
+  else
+   ratio = SET_RATIO (optimize_loop_for_speed_p (loop));
+  break;
+
+case PKIND_MEMCPY:
+case PKIND_MEMMOVE:
+  ratio = MOVE_RATIO (optimize_loop_for_speed_p (loop));
+  break;
+
+default:
+  gcc_unreachable ();
+}
+
+  tree niters = number_of_latch_executions (loop);
+  if (niters == NULL_TREE || niters == chrec_dont_know)
+return false;
+
+  wide_int minit, maxit;
+  value_range_kind vrk = determine_value_range (niters, , );
+  if (vrk == VR_RANGE && wi::ltu_p (maxit, ratio))
+{
+  partition->kind = PKIND_NORMAL;
+  return true;
+}
+
+  return false;
+}
+
 void
 loop_distribution::finalize_partitions (class loop *loop,
vec *partitions,
@@ -3087,6 +3133,14 @@ loop_distribution::distribute_loop (class loop *loop, 
vec stmts,
 }
 
   finalize_partitions (loop, , _ddrs);
+  {
+bool any_changes_p = false;
+for (i = 0; partitions.iterate (i, ); ++i)
+  if (maybe_normalize_partition (loop, partition))
+   any_changes_p = true;
+if (any_changes_p)
+  finalize_partitions (loop, , _ddrs);
+  }
 
   /* If there is a reduction in all partitions make sure the last one
  is not classified for builtin code generation.  */

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar