https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110660

--- Comment #2 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
(In reply to Richard Biener from comment #1)
> The vectorizer itself could do the merging which means it could also more
> accurately cost things.
> 

It's similar with ARM SVE:

https://godbolt.org/z/8cn5j1zTr

vect.dump:

vect__ifc__33.15_53 = VEC_COND_EXPR <vec_mask_and_49, vect__7.14_50, { 0, ...
}>;
  vect__34.16_54 = .COND_ADD (loop_mask_43, vect_res_19.7_40,
vect__ifc__33.15_53, vect_res_19.7_40);


optimized.dump:

vect__34.16_54 = .COND_ADD (vec_mask_and_49, vect_res_19.7_40, vect__7.14_50,
vect_res_19.7_40);

No vcond_mask

GCC can fuse vec_cond_expr with COND_ADD, I think this pattern in match.pd
helps:

/* Detect cases in which a VEC_COND_EXPR effectively replaces the
   "else" value of an IFN_COND_*.  */
(for cond_op (COND_BINARY)
 (simplify
  (vec_cond @0 (view_convert? (cond_op @0 @1 @2 @3)) @4)
  (with { tree op_type = TREE_TYPE (@3); }
   (if (element_precision (type) == element_precision (op_type))
    (view_convert (cond_op @0 @1 @2 (view_convert:op_type @4))))))
 (simplify
  (vec_cond @0 @1 (view_convert? (cond_op @2 @3 @4 @5)))
  (with { tree op_type = TREE_TYPE (@5); }
   (if (inverse_conditions_p (@0, @2)
        && element_precision (type) == element_precision (op_type))
    (view_convert (cond_op @2 @3 @4 (view_convert:op_type @1)))))))



> Otherwise think about whether/how such a situation might arise from people
> using RVV intrinsics - how are those exposed to GIMPLE / RTL and at which
> level could that be optimized?  Is it possible to write an intrinsic
> testcase with such opportunity?

This piece code, users can easily write intrinsics to produce that but I 
don't think compiler should help users to optimize that:

user can write this code:

size_t vl = vsetvl;
vbool32_t mask = comparison;
vbool32_t dummy_mask = vmset;
vint32m1_t dummy_else_value = {0};
vint32m1_t op1_1 = vload;
vint32m1_t op1_2 = vmerge (op1_1,dummy_else_value,mask)
vint32m1_t result = vadd (dummy_mask,op0,op1_2,op0,vl);

Writing the intrinsics as above will generate the same codegen as the 
this example auto-vectorization codegen.

However, I don't think compiler should optimize this intrinsic codes since
this intentional codes written by uers. If user want to optimize this codegen,
user can easily modify these intrinsic as follows:

size_t vl = vsetvl;
vbool32_t mask = comparison;
vint32m1_t op1_1 = vload;
vint32m1_t result = vadd (mask,op0,op1_1,op0,vl);

Then user can get optimal codegen.

So, I am not sure whether such optimization for auto-vectorization should be
done in middle-end (match.pd) or backend (combine pass).

Are you suggesting me doing this in the backend?

Thanks.

Reply via email to