Re: [RFC] test builtin ratio for loop distribution

Alexandre Oliva Thu, 04 Feb 2021 14:18:10 -0800

On Feb  4, 2021, Richard Biener <[email protected]> wrote:

>> >  b) if expansion would use BY_PIECES then expand to an unrolled loop
>> 
>> Why would that be better than keeping the constant-length memset call,
>> that would be turned into an unrolled loop during expand?


> Well, because of the possibly lost ctz and alignment info.

Funny you should mention that.  I got started with the expand-time
expansion yesterday, and found out that we're not using the alignment
information that is available.  Though the pointer is known to point to
an aligned object, we are going for 8-bit alignment for some reason.

The strategy I used there was to first check whether by_pieces would
expand inline a constant length near the max known length, then loop
over the bits in the variable length, expand in each iteration a
constant-length store-by-pieces for the fixed length corresponding to
that bit, and a test comparing the variable length with the fixed length
guarding the expansion of the store-by-pieces.  We may get larger code
this way (no loops), but only O(log(len)) compares.

I've also fixed some bugs in the ldist expander, so now it bootstraps,
but with a few regressions in the testsuite, that I'm yet to look into.

>> Uhh, thanks, but...  you realize nearly all of the gimple-building code
>> is one and the same for the loop and for trailing count misalignment?

> Sorry, the code lacked comments and so I didn't actually try decipering
> the code you generate ;)

Oh, come on, it was planly obscure ;-D

Sorry for posting an early-draft before polishing it up.

> The original motivation was really that esp. for small trip count loops
> the target knows best how to implement them.  Now, that completely
> fails of course in case the target doesn't implement any of this or
> the generic code fails because we lost ctz and alignment info.

In our case, generic code fails because it won't handle variable-sized
clear-by-pieces.  But then, I found out, when it's fixed-size, it also
makes the code worse, because it seems to expand to byte stores even
when the store-to object is known to have wider alignment:

union u {
  long long i;
  char c[8];
} x[8];
int s(union u *p, int k) {
  for (int i = k ? 0 : 3; i < 8; i++) {
    for (int j = 0; j < 8; j++) {
      p[i].c[j] = 0;
    } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte stores
  }
}

>> > I think the builtins with alignment and calloc-style element count
>> > will be useful on its own.
>> 
>> Oh, I see, you're suggesting actual separate builtin functions.  Uhh...
>> I'm not sure I want to go there.  I'd much rather recover the ctz of the
>> length, and use it in existing code.

> Yeah, but when we generate memcpy there might not be a way to
> store the ctz info until RTL expansion where the magic should really happen 
> ...

True.  It can be recovered without much difficulty in the cases I've
looked at, but it could be lost in others.

> So I'd say go for improving RTL expansion.

'k, thanks

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist         GNU Toolchain Engineer
        Vim, Vi, Voltei pro Emacs -- GNUlius Caesar

Re: [RFC] test builtin ratio for loop distribution

Reply via email to