https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103571
--- Comment #19 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Hongtao.liu from comment #17) > (In reply to Hongtao.liu from comment #16) > > There're already testcases for vec_extract/vec_set/vec_duplicate, but those > > testcases are written under TARGET_AVX512FP16, i'll make a copy of them and > > test them w/o avx512fp16. > > Also we can relax condition of extendv*hfv*sf and truncv*sfv*hf to > avx512vl/f16c so that vect-float16-1.c could be vectorized. > > vect-float16-1.c > > void > foo (_Float16 *__restrict__ a, _Float16 *__restrict__ b, > _Float16 *__restrict__ c) > { > for (int i = 0; i < 256; i++) > a[i] = b[i] + c[i]; > } Even w/ support of extend_optab/trunc_optab, veclower still lower v8hf addition to scalar version. And the mismatch is vectorizer assume '+/-' is supported by default(w/o check optab, just cehck if v8hf is supported in vector_mode_supported_p), and then vectorize the loop, but veclower lower vector operation back to scalar which create much worse code than not vectorized version. after loop vectorizer, dump is quite optimized: vect__4.6_27 = MEM <vector(4) _Float16> [(_Float16 *)vectp_b.4_29]; vect__6.9_24 = MEM <vector(4) _Float16> [(_Float16 *)vectp_c.7_26]; vect__8.10_23 = vect__4.6_27 + vect__6.9_24; MEM <vector(4) _Float16> [(_Float16 *)vectp_a.11_22] = vect__8.10_23; vectp_b.4_28 = vectp_b.4_29 + 8; vectp_c.7_25 = vectp_c.7_26 + 8; vectp_a.11_21 = vectp_a.11_22 + 8; But after veclower vect__4.6_4 = MEM <vector(4) _Float16> [(_Float16 *)b_12(D)]; vect__6.9_5 = MEM <vector(4) _Float16> [(_Float16 *)c_13(D)]; _28 = BIT_FIELD_REF <vect__4.6_4, 16, 0>; _25 = BIT_FIELD_REF <vect__6.9_5, 16, 0>; _21 = _28 + _25; _15 = BIT_FIELD_REF <vect__4.6_4, 16, 16>; _10 = BIT_FIELD_REF <vect__6.9_5, 16, 16>; _17 = _15 + _10; _22 = BIT_FIELD_REF <vect__4.6_4, 16, 32>; _26 = BIT_FIELD_REF <vect__6.9_5, 16, 32>; _29 = _22 + _26; _20 = BIT_FIELD_REF <vect__4.6_4, 16, 48>; _3 = BIT_FIELD_REF <vect__6.9_5, 16, 48>; _2 = _20 + _3; vect__8.10_6 = {_21, _17, _29, _2}; MEM <vector(4) _Float16> [(_Float16 *)a_14(D)] = vect__8.10_6; vectp_b.4_8 = b_12(D) + 8; vectp_c.7_16 = c_13(D) + 8; vectp_a.11_30 = a_14(D) + 8; vect__4.6_27 = MEM <vector(4) _Float16> [(_Float16 *)vectp_b.4_8]; vect__6.9_24 = MEM <vector(4) _Float16> [(_Float16 *)vectp_c.7_16]; _1 = BIT_FIELD_REF <vect__4.6_27, 16, 0>; _19 = BIT_FIELD_REF <vect__6.9_24, 16, 0>; _31 = _1 + _19; _9 = BIT_FIELD_REF <vect__4.6_27, 16, 16>; _32 = BIT_FIELD_REF <vect__6.9_24, 16, 16>; _33 = _9 + _32; _34 = BIT_FIELD_REF <vect__4.6_27, 16, 32>; _35 = BIT_FIELD_REF <vect__6.9_24, 16, 32>; _36 = _34 + _35; _37 = BIT_FIELD_REF <vect__4.6_27, 16, 48>; _38 = BIT_FIELD_REF <vect__6.9_24, 16, 48>; _39 = _37 + _38; vect__8.10_23 = {_31, _33, _36, _39}; MEM <vector(4) _Float16> [(_Float16 *)vectp_a.11_30] = vect__8.10_23; Could veclower try widen mode for addition, even veclower can, vNhfmode better be supported under avx512vl or f16c, orelse vectorized code is really bad, then why should we supported vector mode under generic target.