On Sun, 13 Mar 2022 04:27:44 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4178:
>> 
>>> 4176:   movl(scratch, 1056964608);
>>> 4177:   movq(xtmp1, scratch);
>>> 4178:   vbroadcastss(xtmp1, xtmp1, vec_enc);
>> 
>> You could put the constant in the constant table and use `vbroadcastss` here 
>> also.
>> 
>> Thank you very much.
>
> constant and register to register moves are never issued to execution ports,  
> rematerializing value rather than reading from memory will give better 
> performance.

I have come across this a little bit. While `movl r, i` may not consume 
execution ports, `movq x, r` and `vbroadcastss x, x` surely do. This leads to 3 
retired and 2 executed uops. Furthermore, both `movq x, r` and `vbroadcastss x, 
x` can only run on port 5, limit the throughput of the operation. On the 
contrary, a `vbroadcastss x, m` only results in 1 retired and 1 executed uop, 
reducing pressure on the decoder and the backend. A `vbroadcastss x, m` can run 
on both port 2 and port 3, offering a much better throughput. Latency is not 
much of a concern in this circumstance since the operation does not have any 
input dependency.

> register to register moves are never issued to execution ports

I believe you misremembered this part, a register to register move is only 
elided when the registers are of the same kind, `vmovq x, r` would result in 1 
uop being executed on port 5.

What do you think? Thank you very much.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7094

Reply via email to