Re: [RFC PR48941 / 51980] Rewrite arm_neon.h to use __builtin_shuffle

Richard Guenther Tue, 12 Jun 2012 01:53:53 -0700

On Mon, Jun 11, 2012 at 5:46 PM, Ramana Radhakrishnan
<ramana.radhakrish...@linaro.org> wrote:
> Hi,
>
>        A number of the vector permute intrinsics in arm_neon.h end up
> generating quite wasteful code because we end up packing these into
> larger types. While looking at a particularly bad example and the
> infamous PR48941 and cases that a lot of these large intrinsic forms
> could be represented as only affecting their constituent parts with lo
> and high style operations. The one thing I'm missing handling here is
> the case with vext which we could do once vector permute support
> handles the right thing.
>
> I don't like the ML bits of the patch as it stands today and before
> committing I would like to clean up the ML bits quite a bit further
> especially in areas where I've put FIXMEs and before you ask - yes I
> am trying to get some execute testcases in for all these that might be
> useful. I will also point out that this implementation actually makes
> things worse at -O0 given we don't constant propagate the mask into
> the VEC_PERM_EXPR which is actually a regression compared to the
> current state of the art (yes at O0 but I'm sure someone somewhere
> will care about that.) I did think about big-endian but surely that
> should not be a problem in this case as the operations in this case
> (i.e. zip , unzip , rev64, rev32, rev16, transpose) really should be
> the same on both endian-ness.  I am not setup with a big-endian system
> to do some testing on but looking at the code coming out it's
> identical to what's coming out on little endian systems.It's been
> through a full round of testing with a cross-compiler and there are
> some fallouts with the neon intrinsics tests failing but that's a
> result of these instructions not getting generated at O0.
>
> There are a few ways I can think of for dealing with this -
>
> 1.  We check at lowering time of vec_perm_expr if the mask is actually
> associated with a constant - should be an extra constant time check
> I'd think and if so, do a simple constant propagate type operation at
> that point. Is that reasonable ?


ISTR that at -O0 what you might end up seeing is the constant hidden
via a store/load pair, so it's not that easy.  If you have a testcase I'll have
a quick look.

> 2.  We annotate arm_neon.h so that the relevant functions are all
> compiled at O1 so that such constant propagation would occur within
> just these functions. However we need to fix the backend so that
> target_pragma_parse and friends work fine which is a nice side-effect
> of doing that.

I would definitely not do that - you won't get inlining of the
intrinsics then ;)

> 3.  Allow __builtin_shuffle to take constant vectors as parameters (
> unfortunately that means a change and I'm not sure if that's good in
> terms of compatibility with OpenCL )

__builtin_shuffle already allows constant vectors as parameters (but you
cannot express a vector constant in C or GNU C, so it requires optimization
to put it back in ... ).  But maybe I am missing something - what case
are you thinking about?

> 4.  Define a "new" md builtin which is lowered into a vec_perm_expr
> with a constant mask using targetm.fold_builtin.

I don't see how that helps, given a function call already allows constants
as arguments.

> What would be considered the least worse option out of these or is
> there another way that could be used .

Eventually just deal with the regression at -O0?

Thanks,
Richard.

> Thus I thought I'd put this out there for some comments on the ML bits
> and in case anyone else also wanted to play with this. With the simple
> testcases I've played with
>
>  * Test from PR48941
>  * Test from PR51980
>  * A couple of routines that I use as testcases for some more complex
> use of some of the intrinsics.
>
> I see a significant improvement in code generated with the diffs being
> attached for the testcases from PR48941 and PR51980. Thoughts,
> opinions , brickbats ?
>
>
> regards,
> Ramana
>
>
>
>        * config/arm/neon-gen.ml (gcc_builtin_shuffle): New.
>        (return_by_ptr): Delete.
>        (base_type): New helper function.
>        (masktype): Likewise.
>        (num_vec_elt): Likewise.
>        (range): Likewise.
>        (gen_revmask): Likewise.
>        (int_rev_mask): New function and use some of the reverse helper
>        functions.
>        (permute_range): Likewise.
>        (zip_range): Likewise.
>        (uzip_range): Likewise.
>        (trn_range): Likewise.
>        (init_zip_mask): Likewise and use the permutation helper functions.
>        (perm_locode): New function.
>        (perm_hicode): Likewise.
>        (return): Delete handling of return_by_ptr. Handle the 
> gcc_builtin_shuffle case
>        for the vector permutes.
>        (params): Delete handling of return_by_ptr.
>        * config/arm/neon.ml: Update copyright years.
>        (shuffletype): New type.
>        (features): New feature GCCBuiltinShuffle. Delete ReturnPtr.
>        (ops): Use for Vrev64, Vrev32, Vrev16, Vtrn, Vzip and Vunzip.
>        * config/arm/arm_neon.h: Regenerate.

Re: [RFC PR48941 / 51980] Rewrite arm_neon.h to use __builtin_shuffle

Reply via email to