https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103571

--- Comment #19 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Hongtao.liu from comment #17)
> (In reply to Hongtao.liu from comment #16)
> > There're already testcases for vec_extract/vec_set/vec_duplicate, but those
> > testcases are written under TARGET_AVX512FP16, i'll make a copy of them and
> > test them w/o avx512fp16.
> 
> Also we can relax condition of extendv*hfv*sf and truncv*sfv*hf to
> avx512vl/f16c so that vect-float16-1.c could be vectorized.
> 
> vect-float16-1.c
> 
> void
> foo (_Float16 *__restrict__ a, _Float16 *__restrict__ b,
>      _Float16 *__restrict__ c)
> {
>   for (int i = 0; i < 256; i++)
>     a[i] = b[i] + c[i];
> }

Even w/ support of extend_optab/trunc_optab, veclower still lower v8hf addition
to scalar version. And the mismatch is vectorizer assume '+/-' is supported by
default(w/o check optab, just cehck if v8hf is supported in
vector_mode_supported_p), and then vectorize the loop, but veclower lower
vector operation back to scalar which create much worse code than not
vectorized version. 

after loop vectorizer, dump is quite optimized:
  vect__4.6_27 = MEM <vector(4) _Float16> [(_Float16 *)vectp_b.4_29];
  vect__6.9_24 = MEM <vector(4) _Float16> [(_Float16 *)vectp_c.7_26];
  vect__8.10_23 = vect__4.6_27 + vect__6.9_24;
  MEM <vector(4) _Float16> [(_Float16 *)vectp_a.11_22] = vect__8.10_23;
  vectp_b.4_28 = vectp_b.4_29 + 8;
  vectp_c.7_25 = vectp_c.7_26 + 8;
  vectp_a.11_21 = vectp_a.11_22 + 8;

But after veclower

  vect__4.6_4 = MEM <vector(4) _Float16> [(_Float16 *)b_12(D)];
  vect__6.9_5 = MEM <vector(4) _Float16> [(_Float16 *)c_13(D)];
  _28 = BIT_FIELD_REF <vect__4.6_4, 16, 0>;
  _25 = BIT_FIELD_REF <vect__6.9_5, 16, 0>;
  _21 = _28 + _25;
  _15 = BIT_FIELD_REF <vect__4.6_4, 16, 16>;
  _10 = BIT_FIELD_REF <vect__6.9_5, 16, 16>;
  _17 = _15 + _10;
  _22 = BIT_FIELD_REF <vect__4.6_4, 16, 32>;
  _26 = BIT_FIELD_REF <vect__6.9_5, 16, 32>;
  _29 = _22 + _26;
  _20 = BIT_FIELD_REF <vect__4.6_4, 16, 48>;
  _3 = BIT_FIELD_REF <vect__6.9_5, 16, 48>;
  _2 = _20 + _3;
  vect__8.10_6 = {_21, _17, _29, _2};
  MEM <vector(4) _Float16> [(_Float16 *)a_14(D)] = vect__8.10_6;
  vectp_b.4_8 = b_12(D) + 8;
  vectp_c.7_16 = c_13(D) + 8;
  vectp_a.11_30 = a_14(D) + 8;
  vect__4.6_27 = MEM <vector(4) _Float16> [(_Float16 *)vectp_b.4_8];
  vect__6.9_24 = MEM <vector(4) _Float16> [(_Float16 *)vectp_c.7_16];
  _1 = BIT_FIELD_REF <vect__4.6_27, 16, 0>;
  _19 = BIT_FIELD_REF <vect__6.9_24, 16, 0>;
  _31 = _1 + _19;
  _9 = BIT_FIELD_REF <vect__4.6_27, 16, 16>;
  _32 = BIT_FIELD_REF <vect__6.9_24, 16, 16>;
  _33 = _9 + _32;
  _34 = BIT_FIELD_REF <vect__4.6_27, 16, 32>;
  _35 = BIT_FIELD_REF <vect__6.9_24, 16, 32>;
  _36 = _34 + _35;
  _37 = BIT_FIELD_REF <vect__4.6_27, 16, 48>;
  _38 = BIT_FIELD_REF <vect__6.9_24, 16, 48>;
  _39 = _37 + _38;
  vect__8.10_23 = {_31, _33, _36, _39};
  MEM <vector(4) _Float16> [(_Float16 *)vectp_a.11_30] = vect__8.10_23;


Could veclower try widen mode for addition, even veclower can, vNhfmode better
be supported under avx512vl or f16c, orelse vectorized code is really bad, then
why should we supported vector mode under generic target.

Reply via email to