https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118141
--- Comment #1 from Richard Yao <richard.yao at alumni dot stonybrook.edu> ---
As an additional comment, while Clang does a good job on this function, it
could do better. In specific, this uses 1 less instruction:
convert_fp32_to_bfloat16:
vmovups (%rdi), %ymm0
vpsrld $16, %ymm0, %ymm0
vphaddw %ymm0, %ymm0, %ymm0
vmovdqu %xmm0, (%rsi)
vzeroupper
ret
Using vphaddw to do __builtin_convertvector() works here because we know the
top 16-bit value of every 32-bit lane is 0 due to the shift operation. That
said, I am not sure if this would be a worthwhile optimization to implement
once the original optimization bug is fixed.