Re: [RFC] Introduce -finline-memset-loops

Alexandre Oliva via Gcc-patches Thu, 19 Jan 2023 03:25:41 -0800

On Jan 16, 2023, Richard Biener <richard.guent...@gmail.com> wrote:

> On Sat, Jan 14, 2023 at 2:55 AM Alexandre Oliva <ol...@adacore.com> wrote:
>> Target-specific code is great for tight optimizations, but the main
>> purpose of this feature is not an optimization.  AFAICT it actually
>> slows things down in general (due to code growth, and to conservative
>> assumptions about alignment), except perhaps for some microbenchmarks.
>> It's rather a means to avoid depending on the C runtime, particularly
>> due to compiler-introduced memset calls.


> OK, that's what I guessed but you didn't spell out.  So does it make sense
> to mention -ffreestanding in the docs at least?  My fear is that we'd get
> complaints that -O3 -finline-memset-loops turns nicely optimized memset
> loops into dumb ones (via loop distribution and then stupid re-expansion).
> So does it also make sense to turn off -floop-distribute-patterns[-memset]
> with -finline-memset-loops?

I don't think they should be tied together.  Verbose as it is, the
expansion of memset is a sort of local optimum given what the compiler
knows about length and alignment: minimizing what can be minimized,
namely the compare count, by grouping stores in large straight-line
blocks.

Though an optimized memset could in theory perform better, whether
through ifuncs or by bumping alignment, if you're tuning generated code
for a specific target, and you know dest is aligned, the generated code
can likely beat a general-purpose optimized memset, even if by a thin
margin, such as the code that the general-purpose memset would have to
run to detect the alignment and realize it doesn't need to be bumped,
and to extend the byte value to be stored to wider modes.

So I can envision cases in which -floop-distribute-patterns could turn
highly inefficient stores into a memset with known-strict alignment and
length multiplier, that could then be profitably expanded inline so as
to take advantage of both for performance reasons.

Indeed, when I started working on this, I thought the issue was
performance, and this led me to pursue the store-by-multiple-pieces
logic.  It can indeed bring about performance improvements, both over
generic-target and highly-optimized memset implementations.  But it can
also be put to use to avoid C runtime calls.  So while I wouldn't
suggest enabling it by default at any optimization level, I wouldn't tie
it to the single purpose of freestanding environments either.


>> My initial goal was to be able to show that inline expansion would NOT
>> bring about performance improvements, but performance was not the
>> concern that led to the request.
>> 
>> If the approach seems generally acceptable, I may even end up extending
>> it to other such builtins.  I have a vague recollection that memcmp is
>> also an issue for us.

> The C/C++ runtime produce at least memmove, memcpy and memcmp as well.

*nod*.  The others are far more challenging to expand inline in a way
that could potentially be more performant:

- memcmp can only do by_pieces when testing for equality, presumably
because grouping multiple bytes into larger words to speed things up
won't always get you the expected result if you just subtract the larger
words, endianness reversal prior to subtracting might be required, which
would harm performance.  I don't see that using similar
power-of-two-sizes grouping strategies to minimize looping overheads
would be so advantageous, if at all, given the need for testing and
branching at every word.

- memcpy seems doable, but all of the block moves other than cpymem
assume non-overlapping memcpy.  Even if we were to output a test for
overlap that a naïve expansion would break, and an alternate block to go
backwards, all of the block copying logic would have to be "oriented" to
proceed explicitly forward, backward, or don't-care, where currently we
only have don't-care.

Though my initial plan, when posting this patch, was to see how well the
general approach was received, before thinking much about how to apply
it to the other builtins, now I am concerned that extending it to them
is more than I can tackle.

Would it make more sense to extend it, even constrained by the
limitations mentioned above, or handle memset only?  In the latter case,
would it still make sense to adopt a command-line option that suggests a
broader effect than it already has, even if it's only a hopeful future
extension?  -finline-all-stringops[={memset,memcpy,...}], that you
suggested, seems to be a reasonable and extensible one to adopt.

>> Is (optionally) tending to this (uncommon, I suppose) need (or
>> preference?) not something GCC would like to do?

> Sure, I think for the specific intended purpose that would be fine.

Cool!

> It should also only apply to __builtin_memset calls, not to memset
> calls from user code?

I suppose it could be argued both ways.  The situations that I had in
mind either already are or could be made __builtin_memset calls, but I
can't think of reasons to prevent explicit memset calls from such
expansion.  Do you have any specific purpose in mind?

In favor of allowing non-__builtin_ memset to be so expanded, I'll
mention that I caught a number of corner cases while developing the
following improved version of the earlier patch (now handling even
large-constant lengths) by bootstrapping the compiler with the option
enabled by default.  Without that, testcases would have to be a *lot*
more thorough.


Introduce -finline-memset-loops

From: Alexandre Oliva <ol...@adacore.com>

try_store_by_multiple_pieces was added not long ago, enabling
variable-sized memset to be expanded inline when the worst-case
in-range constant length would, using conditional blocks with powers
of two to cover all possibilities of length and alignment.

This patch extends the memset expansion to start with a loop, so as to
still take advantage of known alignment even with long lengths, but
without necessarily adding store blocks for every power of two.

This makes it possible for any memset call to be expanded, even if
storing a single byte per iteration.  Surely efficient implementations
of memset can do better, with a pre-loop to increase alignment, but
that would likely be excessive for inline expansions of memset.

Still, in some cases, users prefer to inline memset, even if it's not
as performant, or when it's known to be performant in ways the
compiler can't tell, to avoid depending on a C runtime library.

With this flag, global or per-function optimizations may enable inline
expansion of memset to be selectively requested, while the
infrastructure for that may enable us to introduce per-target tuning
to enable such looping even advantageous, even if not explicitly
requested.


for  gcc/ChangeLog

        * builtins.cc (try_store_by_multiple_pieces): Support starting
        with a loop.
        * common.opt (finline-memset-loops): New.
        * doc/invoke.texi (-finline-memset-loops): Add.

for  gcc/testsuite/ChangeLog

        * gcc.dg/torture/inline-mem-set-1.c: New.
---
 gcc/builtins.cc                                 |   99 +++++++++++++++++++++--
 gcc/common.opt                                  |    4 +
 gcc/doc/invoke.texi                             |   13 +++
 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c |   84 ++++++++++++++++++++
 4 files changed, 192 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index af45829875e6c..733fe17ede622 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -4322,6 +4322,10 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
   int tst_bits = (max_bits != min_bits ? max_bits
                  : floor_log2 (max_len ^ min_len));
 
+  /* Save the pre-blksize values.  */
+  int orig_max_bits = max_bits;
+  int orig_tst_bits = tst_bits;
+
   /* Check whether it's profitable to start by storing a fixed BLKSIZE
      bytes, to lower max_bits.  In the unlikely case of a constant LEN
      (implied by identical MAX_LEN and MIN_LEN), we want to issue a
@@ -4361,9 +4365,70 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
   if (max_bits >= 0)
     xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
                - (HOST_WIDE_INT_1U << ctz_len));
-  if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
-                           &valc, align, true))
-    return false;
+  bool max_loop = false;
+  /* Skip the test in case of overflow in xlenest.  It shouldn't
+     happen because of the way max_bits and blksize are related, but
+     it doesn't hurt to test.  */
+  if (blksize > xlenest
+      || !can_store_by_pieces (xlenest, builtin_memset_read_str,
+                              &valc, align, true))
+    {
+      if (!flag_inline_memset_loops)
+       return false;
+
+      for (max_bits = orig_max_bits;
+          max_bits >= sctz_len;
+          --max_bits)
+       {
+         xlenest = ((HOST_WIDE_INT_1U << max_bits) * 2
+                    - (HOST_WIDE_INT_1U << ctz_len));
+         /* Check that blksize plus the bits to be stored as blocks
+            sized at powers of two can be stored by pieces.  This is
+            like the test above, but with smaller max_bits.  Skip
+            orig_max_bits (it would be redundant).  Also skip in case
+            of overflow.  */
+         if (max_bits < orig_max_bits
+             && xlenest + blksize >= xlenest
+             && can_store_by_pieces (xlenest + blksize,
+                                     builtin_memset_read_str,
+                                     &valc, align, true))
+           {
+             max_loop = true;
+             break;
+           }
+         if (blksize
+             && can_store_by_pieces (xlenest,
+                                     builtin_memset_read_str,
+                                     &valc, align, true))
+           {
+             max_len += blksize;
+             min_len += blksize;
+             tst_bits = orig_tst_bits;
+             blksize = 0;
+             max_loop = true;
+             break;
+           }
+         if (max_bits == sctz_len)
+           {
+             --sctz_len;
+             --ctz_len;
+           }
+       }
+      if (!max_loop)
+       return false;
+      /* If the boundaries are such that min and max may run a
+        different number of trips in the initial loop, the remainder
+        needs not be between the moduli, so set tst_bits to cover all
+        bits.  Otherwise, if the trip counts are the same, max_len
+        has the common prefix, and the previously-computed tst_bits
+        is usable.  */
+      if (max_len >> max_bits > min_len >> max_bits)
+       tst_bits = max_bits;
+    }
+  /* ??? Do we have to check that all powers of two lengths from
+     max_bits down to ctz_len pass can_store_by_pieces?  As in, could
+     it possibly be that xlenest passes while smaller power-of-two
+     sizes don't?  */
 
   by_pieces_constfn constfun;
   void *constfundata;
@@ -4405,7 +4470,9 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
      the least significant bit possibly set in the length.  */
   for (int i = max_bits; i >= sctz_len; i--)
     {
+      rtx_code_label *loop_label = NULL;
       rtx_code_label *label = NULL;
+
       blksize = HOST_WIDE_INT_1U << i;
 
       /* If we're past the bits shared between min_ and max_len, expand
@@ -4419,18 +4486,31 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
                                   profile_probability::even ());
        }
       /* If we are at a bit that is in the prefix shared by min_ and
-        max_len, skip this BLKSIZE if the bit is clear.  */
-      else if ((max_len & blksize) == 0)
+        max_len, skip the current BLKSIZE if the bit is clear, but do
+        not skip the loop, even if it doesn't require
+        prechecking.  */
+      else if ((max_len & blksize) == 0
+              && !(max_loop && i == max_bits))
        continue;
 
+      if (max_loop && i == max_bits)
+       {
+         loop_label = gen_label_rtx ();
+         emit_label (loop_label);
+         /* Since we may run this multiple times, don't assume we
+            know anything about the offset.  */
+         clear_mem_offset (to);
+       }
+
       /* Issue a store of BLKSIZE bytes.  */
+      bool update_needed = i != sctz_len || loop_label;
       to = store_by_pieces (to, blksize,
                            constfun, constfundata,
                            align, true,
-                           i != sctz_len ? RETURN_END : RETURN_BEGIN);
+                           update_needed ? RETURN_END : RETURN_BEGIN);
 
       /* Adjust REM and PTR, unless this is the last iteration.  */
-      if (i != sctz_len)
+      if (update_needed)
        {
          emit_move_insn (ptr, force_operand (XEXP (to, 0), NULL_RTX));
          to = replace_equiv_address (to, ptr);
@@ -4438,6 +4518,11 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
          emit_move_insn (rem, force_operand (rem_minus_blksize, NULL_RTX));
        }
 
+      if (loop_label)
+       emit_cmp_and_jump_insns (rem, GEN_INT (blksize), GE, NULL,
+                                ptr_mode, 1, loop_label,
+                                profile_probability::likely ());
+
       if (label)
        {
          emit_label (label);
diff --git a/gcc/common.opt b/gcc/common.opt
index d0371aec8db5f..5d28f054241ad 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1874,6 +1874,10 @@ finline-atomics
 Common Var(flag_inline_atomics) Init(1) Optimization
 Inline __atomic operations when a lock free instruction sequence is available.
 
+finline-memset-loops
+Common Var(flag_inline_memset_loops) Init(0) Optimization
+Inline memset even if it requires loops.
+
 fcf-protection
 Common RejectNegative Alias(fcf-protection=,full)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 631c00582bf85..8f8d52bbeef68 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -548,7 +548,8 @@ Objective-C and Objective-C++ Dialects}.
 -fgcse-sm  -fhoist-adjacent-loads  -fif-conversion @gol
 -fif-conversion2  -findirect-inlining @gol
 -finline-functions  -finline-functions-called-once  -finline-limit=@var{n} @gol
--finline-small-functions -fipa-modref -fipa-cp  -fipa-cp-clone @gol
+-finline-memset-loops -finline-small-functions @gol
+-fipa-modref -fipa-cp  -fipa-cp-clone @gol
 -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const @gol
 -fipa-reference  -fipa-reference-addressable @gol
 -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm} @gol
@@ -11961,6 +11962,16 @@ in its own right.
 Enabled at levels @option{-O1}, @option{-O2}, @option{-O3} and @option{-Os},
 but not @option{-Og}.
 
+@item -finline-memset-loops
+@opindex finline-memset-loops
+Expand @code{memset} calls inline, even when the length is variable or
+big enough as to require looping.  This may enable the compiler to take
+advantage of known alignment and length multipliers, but it will often
+generate code that is less efficient than performant implementations of
+@code{memset}, and grow code size so much that even a less performant
+@code{memset} may run faster due to better use of the code cache.  This
+option is disabled by default.
+
 @item -fearly-inlining
 @opindex fearly-inlining
 Inline functions marked by @code{always_inline} and functions whose body seems
diff --git a/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c 
b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
new file mode 100644
index 0000000000000..4de51df006ede
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
@@ -0,0 +1,84 @@
+/* { dg-do compile } */
+/* { dg-options "-finline-memset-loops -gno-record-gcc-switches -fno-lto" } */
+
+void *zero (unsigned long long (*p)[32], int n)
+{
+  return __builtin_memset (p, 0, n * sizeof (*p));
+}
+
+void *ones (char (*p)[128], int n)
+{
+  return __builtin_memset (p, -1, n * sizeof (*p));
+}
+
+void *opt2 (int *p, int i)
+{
+  return __builtin_memset (p, 0, (i ? 1024 : 2) * sizeof (*p));
+}
+
+void *opt8 (int *p, int i)
+{
+  return __builtin_memset (p, 0, (i ? 1024 : 8) * sizeof (*p));
+}
+
+void *opt32 (int *p, int i)
+{
+  return __builtin_memset (p, 0, (i ? 1024 : 32) * sizeof (*p));
+}
+
+void *opt128 (int *p, int i)
+{
+  return __builtin_memset (p, 0, (i ? 1024 : 128) * sizeof (*p));
+}
+
+void *opt512 (int *p, int i)
+{
+  return __builtin_memset (p, 0, (i ? 1024 : 512) * sizeof (*p));
+}
+
+void *opt_primes (int *p, int i)
+{
+  return __builtin_memset (p, 0, (i ? 509 : 7) * sizeof (*p));
+}
+
+void *opt_primes_blk (int *p, int i)
+{
+  return __builtin_memset (p, 0, (i ? 521 : 9) * sizeof (*p));
+}
+
+void *huge (long (*p)[16384])
+{
+  return __builtin_memset (p, 0, sizeof (*p));
+}
+
+void *hugep1 (long (*p)[16384+1])
+{
+  return __builtin_memset (p, 0, sizeof (*p));
+}
+
+void *hugep4 (long (*p)[16384+4])
+{
+  return __builtin_memset (p, 0, sizeof (*p));
+}
+
+void *hugep16 (long (*p)[16384+16])
+{
+  return __builtin_memset (p, 0, sizeof (*p));
+}
+
+void *hugep64 (long (*p)[16384+64])
+{
+  return __builtin_memset (p, 0, sizeof (*p));
+}
+
+void *hugep256 (long (*p)[16384+256])
+{
+  return __builtin_memset (p, 0, sizeof (*p));
+}
+
+void *hugep1024p256p64p16p4p1 (long (*p)[16384+1024+64+16+4+1])
+{
+  return __builtin_memset (p, 0, sizeof (*p));
+}
+
+/* { dg-final { scan-assembler-not "memset" } } */


-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

Re: [RFC] Introduce -finline-memset-loops

Reply via email to