https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110660

--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 14 Jul 2023, juzhe.zhong at rivai dot ai wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110660
> 
> --- Comment #2 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> (In reply to Richard Biener from comment #1)
> > The vectorizer itself could do the merging which means it could also more
> > accurately cost things.
> > 
> 
> It's similar with ARM SVE:
> 
> https://godbolt.org/z/8cn5j1zTr
> 
> vect.dump:
> 
> vect__ifc__33.15_53 = VEC_COND_EXPR <vec_mask_and_49, vect__7.14_50, { 0, ...
> }>;
>   vect__34.16_54 = .COND_ADD (loop_mask_43, vect_res_19.7_40,
> vect__ifc__33.15_53, vect_res_19.7_40);
> 
> 
> optimized.dump:
> 
> vect__34.16_54 = .COND_ADD (vec_mask_and_49, vect_res_19.7_40, vect__7.14_50,
> vect_res_19.7_40);
> 
> No vcond_mask
> 
> GCC can fuse vec_cond_expr with COND_ADD, I think this pattern in match.pd
> helps:
> 
> /* Detect cases in which a VEC_COND_EXPR effectively replaces the
>    "else" value of an IFN_COND_*.  */
> (for cond_op (COND_BINARY)
>  (simplify
>   (vec_cond @0 (view_convert? (cond_op @0 @1 @2 @3)) @4)
>   (with { tree op_type = TREE_TYPE (@3); }
>    (if (element_precision (type) == element_precision (op_type))
>     (view_convert (cond_op @0 @1 @2 (view_convert:op_type @4))))))
>  (simplify
>   (vec_cond @0 @1 (view_convert? (cond_op @2 @3 @4 @5)))
>   (with { tree op_type = TREE_TYPE (@5); }
>    (if (inverse_conditions_p (@0, @2)
>         && element_precision (type) == element_precision (op_type))
>     (view_convert (cond_op @2 @3 @4 (view_convert:op_type @1)))))))
> 
> 
> 
> > Otherwise think about whether/how such a situation might arise from people
> > using RVV intrinsics - how are those exposed to GIMPLE / RTL and at which
> > level could that be optimized?  Is it possible to write an intrinsic
> > testcase with such opportunity?
> 
> This piece code, users can easily write intrinsics to produce that but I 
> don't think compiler should help users to optimize that:
> 
> user can write this code:
> 
> size_t vl = vsetvl;
> vbool32_t mask = comparison;
> vbool32_t dummy_mask = vmset;
> vint32m1_t dummy_else_value = {0};
> vint32m1_t op1_1 = vload;
> vint32m1_t op1_2 = vmerge (op1_1,dummy_else_value,mask)
> vint32m1_t result = vadd (dummy_mask,op0,op1_2,op0,vl);
> 
> Writing the intrinsics as above will generate the same codegen as the 
> this example auto-vectorization codegen.
> 
> However, I don't think compiler should optimize this intrinsic codes since
> this intentional codes written by uers. If user want to optimize this codegen,
> user can easily modify these intrinsic as follows:
> 
> size_t vl = vsetvl;
> vbool32_t mask = comparison;
> vint32m1_t op1_1 = vload;
> vint32m1_t result = vadd (mask,op0,op1_1,op0,vl);
> 
> Then user can get optimal codegen.

Sure.  For that to reliably work the intrinsics need to stay target
builtins and UNSPECS, but I'm not entirely convinced this is always
what users want.

> So, I am not sure whether such optimization for auto-vectorization should be
> done in middle-end (match.pd) or backend (combine pass).
> 
> Are you suggesting me doing this in the backend?

If there's a match.pd pattern doing this for SVE try to extend that.

Reply via email to