https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #1)
> It's done by r12-1958, it's better for dcache, but worse for icache, small
> benchmark in the commit show broadcast from integer is slightly better than
> constant pool, maybe we should make it as a u-arch specific tuning.

I see it was benchmarked on Intel CPU which have a shared register file, I
was specifically wondering of the AMD case where any integer <-> FP/vector
boundary crossing incurs a latency penalty.

If there's already code generation using broadcast from scalar memory
a tunable would be nice to have given that makes benchmarking such
change easy.

In reality what is faster always depends on the surrounding code, but
IMO code size (and uop cache space) wins easily.  Quite likely
doing full vector loads from the constant pool for XMM initialization
is better than broadcast from scalar, possibly even YMM, for the same
reason.

I'm not sure how to best do a micro-benchmark measuring the actual
latency of the variants in question.

Reply via email to