Re: [RFC] Introduce -finline-memset-loops

Alexandre Oliva via Gcc-patches Fri, 13 Jan 2023 19:22:21 -0800

Hello, Paul,

On Jan 13, 2023, Paul Koning <paulkon...@comcast.net> wrote:


>> On Jan 13, 2023, at 8:54 PM, Alexandre Oliva via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:

>> Target-specific code is great for tight optimizations, but the main
>> purpose of this feature is not an optimization.  AFAICT it actually
>> slows things down in general (due to code growth, and to conservative
>> assumptions about alignment), 

> I thought machinery like the memcpy patterns have as one of their
> benefits the ability to find the alignment of their operands and from
> that optimize things.  So I don't understand why you'd say
> "conservative".

Though memcpy implementations normally do that indeed, dynamically
increasing dest alignment has such an impact on code size that *inline*
memcpy doesn't normally do that.  try_store_by_multiple_pieces,
specifically, is potentially branch-heavy to begin with, and bumping
alignment up could double the inline expansion size.  So what it does is
to take the conservative dest alignment estimate from the compiler and
use it.

By adding leading loops to try_store_by_multiple_pieces (as does the
proposed patch, with its option enabled) we may expand an
unknown-length, unknown-alignment memset to something conceptually like
(cims is short for constant-sized inlined memset):

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

With dynamic alignment bumps under a trivial extension of the current
logic, it would become (cimsN is short for cims with dest known to be
aligned to an N-byte boundary):

if (len >= 2 && (dest & 1)) { len -= 1; cims(dest, c, 1); dest += 1; }
if (len >= 4 && (dest & 2)) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 8 && (dest & 4)) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 16 && (dest & 8)) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 32 && (dest & 16)) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 64 && (dest & 32)) { len -= 32; cims32(dest, c, 32); dest += 32; }
while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims32(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, by using more loops instead of going through every power of two, We
could shorten (for -Os) the former to e.g.:

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
while (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

and we could similarly add more compact logic for dynamic alignment:

if (len >= 8) {
  while (dest & 7) { len -= 1; cims(dest, c, 1); dest += 1; }
  if (len >= 64)
    while (dest & 56) { len -= 8; cims8(dest, c, 8); dest += 8; }
  while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
  while (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
}
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, given that improving performance was never goal of this change, and
the expansion it optionally offers is desirable even when it slows
things down, just making it a simple loop at the known alignment would
do.  The remainder sort of flowed out of the way
try_store_by_multiple_pieces was structured, and I found it sort of made
sense to start with the largest-reasonable block loop, and then end with
whatever try_store_by_multiple_pieces would have expanded a
known-shorter but variable length memset to.  And this is how I got to
it.  I'm not sure it makes any sense to try to change things further to
satisfy other competing goals such as performance or code size.

-- 
Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
   Free Software Activist                       GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about <https://stallmansupport.org>

Re: [RFC] Introduce -finline-memset-loops

Reply via email to