On Mon, 13 Aug 2012, Ramana Radhakrishnan wrote:
On 13 August 2012 14:21, Marc Glisse <marc.gli...@inria.fr> wrote:
On Mon, 13 Aug 2012, Richard Guenther wrote:
On Mon, Aug 13, 2012 at 3:12 PM, Ramana Radhakrishnan
<ramana.radhakrish...@linaro.org> wrote:
I guess people will complain soon enough if this causes horrible
performance
regressions in vectorized code.
Not having looked at your patch in great detail,. surely what we don't
want is a situation where 2 constant permutations are converted into
one generic permute. Based on a quick read of your patch I couldn't
work that out. It might be that 2 constant permutes are cheaper than
a generic permute. Have you looked at any examples in that space . I
surely wouldn't like to see a sequence of interleave / transpose
change into a generic permute operation on Neon as that would be far
more expensive than this. It surely needs more testting than just
this bit before going in. The reason being that this would likely take
more registers and indeed produce loads of a constant pool for the new
mask.
What do you call constant / non-constant? The combined permutation is still
constant, although the expansion (in the back-end) might fail to expand it
efficiently and fall back to the generic permutation expansion...
That is exactly what I was worried about. By constant I would expect
something that is expanded as a constant permute by the backend - an
interleave operation or a transpose operation or indeed some of the
funky operations as below in ARM / Neon which is the SIMD architecture
I'm most familiar with .
If you had something that did the following :
uint8x8_t
tst_vrev642_u8 (uint8x8_t __a)
{
uint8x8_t __rv;
uint8x8_t __mask1 = { 7, 6, 5, 4, 3, 2, 1, 0};
uint8x8_t __mask2 = { 0, 8, 2, 10, 4, 12, 6, 14 };
return __builtin_shuffle ( __builtin_shuffle ( __a, __mask1), __mask2) ;
}
I would expect these instructions
vrev64.8 d0, d0
vmov d16, d0 @ v8qi
vtrn.8 d0, d16
bx lr
With your patch I see
tst_vrev642_u8:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
fldd d16, .L2
vtbl.8 d0, {d0}, d16
bx lr
.L3:
.align 3
.L2:
.byte 7
.byte 7
.byte 5
.byte 5
.byte 3
.byte 3
.byte 1
.byte 1
Seems to be one instruction shorter at least ;-) Yes, there can be much
worse regressions than that because of the patch (like 40 instructions
instead of 4, in the x86 backend).
It might be that the backend predicates need tightening in this case
but why not try to get cost info about such combinations rather than
just doing this gratuitously ?
I don't think the word gratuitous is appropriate. middle-end replaces a+-b
with a-b without first asking the backend whether it might be more
efficient. One permutation is better than 2. It just happens that the
range of possible permutations is too large (and the basic instructions
are too strange) for backends to do a good job on them, and thus keeping
toplevel input as a hint is important.
While in this case the ARM port might
be wrong , but in general when the vector permute rewrites were done
we chose to go ahead and keep the generic constant permutes in the
backend as the last resort to fall back to. However if fwprop starts
making such transformations really one ought to get relative costs for
each of these operations rather than allowing gratuitous replacements.
Indeed, having costs would help, but that's a lot of work.
As you can see from the original email, I am willing to limit the
optimization to the case where the combined permutation is the identity
(yes, that's quite drastic...). I should probably implement that special
case anyway, because it doesn't require its argument to have a single use
for the optimization to make sense.
--
Marc Glisse