On Feb 4, 2021, Richard Biener <richard.guent...@gmail.com> wrote: >> > b) if expansion would use BY_PIECES then expand to an unrolled loop >> >> Why would that be better than keeping the constant-length memset call, >> that would be turned into an unrolled loop during expand?
> Well, because of the possibly lost ctz and alignment info. Funny you should mention that. I got started with the expand-time expansion yesterday, and found out that we're not using the alignment information that is available. Though the pointer is known to point to an aligned object, we are going for 8-bit alignment for some reason. The strategy I used there was to first check whether by_pieces would expand inline a constant length near the max known length, then loop over the bits in the variable length, expand in each iteration a constant-length store-by-pieces for the fixed length corresponding to that bit, and a test comparing the variable length with the fixed length guarding the expansion of the store-by-pieces. We may get larger code this way (no loops), but only O(log(len)) compares. I've also fixed some bugs in the ldist expander, so now it bootstraps, but with a few regressions in the testsuite, that I'm yet to look into. >> Uhh, thanks, but... you realize nearly all of the gimple-building code >> is one and the same for the loop and for trailing count misalignment? > Sorry, the code lacked comments and so I didn't actually try decipering > the code you generate ;) Oh, come on, it was planly obscure ;-D Sorry for posting an early-draft before polishing it up. > The original motivation was really that esp. for small trip count loops > the target knows best how to implement them. Now, that completely > fails of course in case the target doesn't implement any of this or > the generic code fails because we lost ctz and alignment info. In our case, generic code fails because it won't handle variable-sized clear-by-pieces. But then, I found out, when it's fixed-size, it also makes the code worse, because it seems to expand to byte stores even when the store-to object is known to have wider alignment: union u { long long i; char c[8]; } x[8]; int s(union u *p, int k) { for (int i = k ? 0 : 3; i < 8; i++) { for (int j = 0; j < 8; j++) { p[i].c[j] = 0; } // becomes a memset to an 8-byte-aligned 8-byte object, then 8 byte stores } } >> > I think the builtins with alignment and calloc-style element count >> > will be useful on its own. >> >> Oh, I see, you're suggesting actual separate builtin functions. Uhh... >> I'm not sure I want to go there. I'd much rather recover the ctz of the >> length, and use it in existing code. > Yeah, but when we generate memcpy there might not be a way to > store the ctz info until RTL expansion where the magic should really happen > ... True. It can be recovered without much difficulty in the cases I've looked at, but it could be lost in others. > So I'd say go for improving RTL expansion. 'k, thanks -- Alexandre Oliva, happy hacker https://FSFLA.org/blogs/lxo/ Free Software Activist GNU Toolchain Engineer Vim, Vi, Voltei pro Emacs -- GNUlius Caesar