https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123631

--- Comment #3 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> (In reply to Hongtao Liu from comment #1)
> > It's done by r12-1958, it's better for dcache, but worse for icache, small
> > benchmark in the commit show broadcast from integer is slightly better than
> > constant pool, maybe we should make it as a u-arch specific tuning.
> 
> I see it was benchmarked on Intel CPU which have a shared register file, I
> was specifically wondering of the AMD case where any integer <-> FP/vector
> boundary crossing incurs a latency penalty.
> 
> If there's already code generation using broadcast from scalar memory
> a tunable would be nice to have given that makes benchmarking such
> change easy.

I'll write a patch for this. Should be able to upstream it next week.

Reply via email to