On Mon, Jun 11, 2012 at 5:46 PM, Ramana Radhakrishnan <ramana.radhakrish...@linaro.org> wrote: > Hi, > > A number of the vector permute intrinsics in arm_neon.h end up > generating quite wasteful code because we end up packing these into > larger types. While looking at a particularly bad example and the > infamous PR48941 and cases that a lot of these large intrinsic forms > could be represented as only affecting their constituent parts with lo > and high style operations. The one thing I'm missing handling here is > the case with vext which we could do once vector permute support > handles the right thing. > > I don't like the ML bits of the patch as it stands today and before > committing I would like to clean up the ML bits quite a bit further > especially in areas where I've put FIXMEs and before you ask - yes I > am trying to get some execute testcases in for all these that might be > useful. I will also point out that this implementation actually makes > things worse at -O0 given we don't constant propagate the mask into > the VEC_PERM_EXPR which is actually a regression compared to the > current state of the art (yes at O0 but I'm sure someone somewhere > will care about that.) I did think about big-endian but surely that > should not be a problem in this case as the operations in this case > (i.e. zip , unzip , rev64, rev32, rev16, transpose) really should be > the same on both endian-ness. I am not setup with a big-endian system > to do some testing on but looking at the code coming out it's > identical to what's coming out on little endian systems.It's been > through a full round of testing with a cross-compiler and there are > some fallouts with the neon intrinsics tests failing but that's a > result of these instructions not getting generated at O0. > > There are a few ways I can think of for dealing with this - > > 1. We check at lowering time of vec_perm_expr if the mask is actually > associated with a constant - should be an extra constant time check > I'd think and if so, do a simple constant propagate type operation at > that point. Is that reasonable ?
ISTR that at -O0 what you might end up seeing is the constant hidden via a store/load pair, so it's not that easy. If you have a testcase I'll have a quick look. > 2. We annotate arm_neon.h so that the relevant functions are all > compiled at O1 so that such constant propagation would occur within > just these functions. However we need to fix the backend so that > target_pragma_parse and friends work fine which is a nice side-effect > of doing that. I would definitely not do that - you won't get inlining of the intrinsics then ;) > 3. Allow __builtin_shuffle to take constant vectors as parameters ( > unfortunately that means a change and I'm not sure if that's good in > terms of compatibility with OpenCL ) __builtin_shuffle already allows constant vectors as parameters (but you cannot express a vector constant in C or GNU C, so it requires optimization to put it back in ... ). But maybe I am missing something - what case are you thinking about? > 4. Define a "new" md builtin which is lowered into a vec_perm_expr > with a constant mask using targetm.fold_builtin. I don't see how that helps, given a function call already allows constants as arguments. > What would be considered the least worse option out of these or is > there another way that could be used . Eventually just deal with the regression at -O0? Thanks, Richard. > Thus I thought I'd put this out there for some comments on the ML bits > and in case anyone else also wanted to play with this. With the simple > testcases I've played with > > * Test from PR48941 > * Test from PR51980 > * A couple of routines that I use as testcases for some more complex > use of some of the intrinsics. > > I see a significant improvement in code generated with the diffs being > attached for the testcases from PR48941 and PR51980. Thoughts, > opinions , brickbats ? > > > regards, > Ramana > > > > * config/arm/neon-gen.ml (gcc_builtin_shuffle): New. > (return_by_ptr): Delete. > (base_type): New helper function. > (masktype): Likewise. > (num_vec_elt): Likewise. > (range): Likewise. > (gen_revmask): Likewise. > (int_rev_mask): New function and use some of the reverse helper > functions. > (permute_range): Likewise. > (zip_range): Likewise. > (uzip_range): Likewise. > (trn_range): Likewise. > (init_zip_mask): Likewise and use the permutation helper functions. > (perm_locode): New function. > (perm_hicode): Likewise. > (return): Delete handling of return_by_ptr. Handle the > gcc_builtin_shuffle case > for the vector permutes. > (params): Delete handling of return_by_ptr. > * config/arm/neon.ml: Update copyright years. > (shuffletype): New type. > (features): New feature GCCBuiltinShuffle. Delete ReturnPtr. > (ops): Use for Vrev64, Vrev32, Vrev16, Vtrn, Vzip and Vunzip. > * config/arm/arm_neon.h: Regenerate.