On Thu, Sep 15, 2011 at 8:05 PM, Richard Henderson <r...@redhat.com> wrote:
>> +The elements of the input vectors are numbered from left to right across
>> +one or both of the vectors. Each element in the mask specifies a number
>> +of element from the input vector(s). Consider the following example.
>
> It would be more preferable to talk about the memory ordering of the
> elements rather than "left" and "right" which are ambiguous at best.
>
>> +  if (TREE_CODE (mask) == VECTOR_CST)
>> +    {
>> +      tree m_type, call;
>> +      tree fn = targetm.vectorize.builtin_vec_perm (TREE_TYPE (v0), 
>> &m_type);
>> +      /*rtx t;*/
>
> Leftover crap.

Fixed.

>> +
>> +      if (!fn)
>> +     goto vshuffle;
>> +
>> +      if (m_type != TREE_TYPE (TREE_TYPE (mask)))
>> +     {
>> +       int units = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask));
>> +       tree cvt = build_vector_type (m_type, units);
>> +       mask = fold_convert (cvt, mask);
>> +     }
>> +
>> +      call = fold_build1 (ADDR_EXPR, build_pointer_type (TREE_TYPE (fn)), 
>> fn);
>> +      call = build_call_nary (type /* ? */, call, 3, v0, v1, mask);
>> +
>> +      return expand_expr_real_1 (call, target, VOIDmode, EXPAND_NORMAL, 
>> NULL);
>> +    }
>> +
>> +vshuffle:
>> +  gcc_assert (operand_equal_p (v0, v1, 0));
>
> Why can't a non-constant shuffle have different V0 and V1?  That seems
> like a direct violation of the documentation, and any sort of usefulness.

Ok, I agree. The reason why this assert is here is that noone in the
middle-end generates the code that does not meet this assert. In
principle we definitely want to support it in the upcoming patches,
but it would be nice to start with a simple thing.

> Also, while I'm ok with the use of builtin_vec_perm here in the short
> term, I think that in the long term we should simply force the named
> pattern to handle constants.  Then the vectorizer can simply use the
> predicate and the tree code and we can drop the large redundancy of
> builtins with different argument types.
>
> Indeed, once this patch is applied, I think that ought to be the very
> next task in this domain.
>
>> +/* Vector shuffle expression.  A = VEC_SHUFFLE_EXPR<v0, v1, maks>
>
> Typo in "mask".
>
>> +   foreach i in length (mask):
>> +     A = mask[i] < length (v0) ? v0[mask[i]] : v1[mask[i]]
>
> Surely it's v1[mask[i] - length].
>
>> +      if (TREE_CODE (vect) == VECTOR_CST)
>> +        {
>> +            unsigned i;
>
> Indentation is off all through this function.

Fixed.

>> +  mask = gen_rtx_AND (maskmode, mask, mm);
>> +
>> +  /* Convert mask to vector of chars.  */
>> +  mask = simplify_gen_subreg (V16QImode, mask, maskmode, 0);
>> +  mask = force_reg (V16QImode, mask);
>
> Why are you using force_reg to do all the dirty work?  Seems to
> me this should be using expand_normal.  All throughout this
> function.  That would also avoid the need for all of the extra
> force_reg stuff that ought not be there for -O0.

I don't really understand this. As far as I know, expand_normal
"converts" tree to rtx. All my computations are happening at the level
of rtx and force_reg is needed just to bring an rtx expression to the
register of the correct mode. If I am missing something, could you
give an example how can I use expand_normal instead of force_reg in
this particular code.

> I also see that you're not even attempting to use xop_pperm.

As I said, I am happy to experiment with the cases v0 != v1 in the
upcoming patches. Let's just start with a simple thing and see what
kind of issues/problems it would bring.

> Is ssse3_pshufb why you do the wrong thing in the expander for v0 != v1?

My personal feeling is that it may be the case with v0 != v1, that it
would be more efficient to perform piecewise shuffling rather than
bitwise dances around the masks.

> And give the vshuffle named pattern the wrong number of arguments?

Ok, If I'll make vshuffle to accept only two arguments -- vector and
mask, would it be ok?

> It's certainly possible to handle it, though it takes a few more steps,
> and might well be more efficient as a libgcc function rather than inline.

I don't really understand why it could be more efficient. I thought
that inline gives more chances to the final RTL optimisation.
>
>
> r~
>
>
>

Thanks,
Artem.

Reply via email to