https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113091
Bug ID: 113091 Summary: Over-estimate SLP vector-to-scalar cost for non-live pattern statement Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- Gcc fails to vectorize the below testcase on aarch64. int test(unsigned array[8]); int foo(char *a, char *b) { unsigned array[8]; array[0] = (a[0] - b[0]); array[1] = (a[1] - b[1]); array[2] = (a[2] - b[2]); array[3] = (a[3] - b[3]); array[4] = (a[4] - b[4]); array[5] = (a[5] - b[5]); array[6] = (a[6] - b[6]); array[7] = (a[7] - b[7]); return test(array); } The dump shows that loads to a[i] and b[i] are considered to be live as scalar references, which results in over-estimated vector-to-scalar cost. *a_50(D) 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)a_50(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue *b_51(D) 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 1B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 3B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 5B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue MEM[(char *)b_51(D) + 7B] 1 times vec_to_scalar costs 2 in epilogue Subtraction on char type is recognized as widen-sub, and involves two kinds of pattern replacement. * Original _1 = *a_50(D); _2 = (int) _1; _3 = *b_51(D); _4 = (int) _3; _5 = _2 - _4; * After pattern replacement patt_63 = (unsigned short) _1; // _2 = (int) _1; patt_64 = (int) patt_63; // _2 = (int) _1; patt_65 = (unsigned short) _3; // _4 = (int) _3; patt_66 = (int) patt_65; // _4 = (int) _3; patt_67 = .VEC_WIDEN_MINUS (_1, _3); // _5 = _2 - _4; patt_68 = (signed short) patt_67; // _5 = _2 - _4; patt_69 = (int) patt_68; // _5 = _2 - _4; For the statement "_2 = (int) _1", its vectorization representative "patt_64 = (int) patt_63" is not marked as PURE_SLP, so it is conservatively considered to having scalar use and being live outside of SLP bb (in the function vect_bb_slp_mark_live_stmts). However, the pattern definition is actually dead, should not contribute to vector-to-scalar cost. Those defs from pattern statements are not part of function body, we could not track def/use chain as ordinary SSAs. Probably, we may have a quick fix for one situation, if the original SSA "_2" has single use, its existence should be only covered by vectorized operation, no matter what/how it would be w/o pattern replacement.