https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110660
--- Comment #2 from JuzheZhong <juzhe.zhong at rivai dot ai> --- (In reply to Richard Biener from comment #1) > The vectorizer itself could do the merging which means it could also more > accurately cost things. > It's similar with ARM SVE: https://godbolt.org/z/8cn5j1zTr vect.dump: vect__ifc__33.15_53 = VEC_COND_EXPR <vec_mask_and_49, vect__7.14_50, { 0, ... }>; vect__34.16_54 = .COND_ADD (loop_mask_43, vect_res_19.7_40, vect__ifc__33.15_53, vect_res_19.7_40); optimized.dump: vect__34.16_54 = .COND_ADD (vec_mask_and_49, vect_res_19.7_40, vect__7.14_50, vect_res_19.7_40); No vcond_mask GCC can fuse vec_cond_expr with COND_ADD, I think this pattern in match.pd helps: /* Detect cases in which a VEC_COND_EXPR effectively replaces the "else" value of an IFN_COND_*. */ (for cond_op (COND_BINARY) (simplify (vec_cond @0 (view_convert? (cond_op @0 @1 @2 @3)) @4) (with { tree op_type = TREE_TYPE (@3); } (if (element_precision (type) == element_precision (op_type)) (view_convert (cond_op @0 @1 @2 (view_convert:op_type @4)))))) (simplify (vec_cond @0 @1 (view_convert? (cond_op @2 @3 @4 @5))) (with { tree op_type = TREE_TYPE (@5); } (if (inverse_conditions_p (@0, @2) && element_precision (type) == element_precision (op_type)) (view_convert (cond_op @2 @3 @4 (view_convert:op_type @1))))))) > Otherwise think about whether/how such a situation might arise from people > using RVV intrinsics - how are those exposed to GIMPLE / RTL and at which > level could that be optimized? Is it possible to write an intrinsic > testcase with such opportunity? This piece code, users can easily write intrinsics to produce that but I don't think compiler should help users to optimize that: user can write this code: size_t vl = vsetvl; vbool32_t mask = comparison; vbool32_t dummy_mask = vmset; vint32m1_t dummy_else_value = {0}; vint32m1_t op1_1 = vload; vint32m1_t op1_2 = vmerge (op1_1,dummy_else_value,mask) vint32m1_t result = vadd (dummy_mask,op0,op1_2,op0,vl); Writing the intrinsics as above will generate the same codegen as the this example auto-vectorization codegen. However, I don't think compiler should optimize this intrinsic codes since this intentional codes written by uers. If user want to optimize this codegen, user can easily modify these intrinsic as follows: size_t vl = vsetvl; vbool32_t mask = comparison; vint32m1_t op1_1 = vload; vint32m1_t result = vadd (mask,op0,op1_1,op0,vl); Then user can get optimal codegen. So, I am not sure whether such optimization for auto-vectorization should be done in middle-end (match.pd) or backend (combine pass). Are you suggesting me doing this in the backend? Thanks.