https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao Liu from comment #1) > It's done by r12-1958, it's better for dcache, but worse for icache, small > benchmark in the commit show broadcast from integer is slightly better than > constant pool, maybe we should make it as a u-arch specific tuning. I see it was benchmarked on Intel CPU which have a shared register file, I was specifically wondering of the AMD case where any integer <-> FP/vector boundary crossing incurs a latency penalty. If there's already code generation using broadcast from scalar memory a tunable would be nice to have given that makes benchmarking such change easy. In reality what is faster always depends on the surrounding code, but IMO code size (and uop cache space) wins easily. Quite likely doing full vector loads from the constant pool for XMM initialization is better than broadcast from scalar, possibly even YMM, for the same reason. I'm not sure how to best do a micro-benchmark measuring the actual latency of the variants in question.
