https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119626
--- Comment #6 from mcccs at gmx dot com ---
Lastly I would like to mention why this is such an important issue in the use
__bf16 and why __bf16 is otherwise very inefficient: bfcvt is not only used for
casts. Consider the following code:
__bf16 a[4];
void multiply() {
for (int i = 0; i < 4; i++)
a[i] *= 16;
}
It does involve the bfcvt instruction.
The function compiles to:
Clang O3 -bf16: 13 instructions
Clang O3 +bf16: 8 instructions
GCC O3 +bf16: 43 instructions
It seems there are two parts to solving the problem. By comparing with Clang,
first is to ensure
__bf16 convert(float x) {
return (__bf16) x;
}
uses bfcvt
the second is to ensure
void convert2(float * __restrict a, __bf16 * __restrict x) {
for (int i = 0; i < 4; i++)
x[i] = (__bf16)a[i];
}
can be vectorized even with march=...-bf16