https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068

--- Comment #7 from Brian Grayson <brian.grayson at sifive dot com> ---
A single lhau instruction is better than two instructions (lha + addi) for many
reasons. Are there reasons that you feel a two-instruction sequence of lha+addi
is *superior* to just an lhau?

On all PowerPC implementations, reducing the loop size by one improves the odds
that the entire loop fits within a cache line, reducing the fetch bubbles that
might otherwise occur.

On all PowerPC implementations, saving one instruction reduces overall
instruction-cache pressure, especially if this construct or other
lhau-compatible situations occur multiple times within a function for looping
over different arrays etc.

On all PowerPC implementations, decoding a single instruction should be less
power than decoding two, even in the presence of cracking.

On some PowerPC implementations, the lhau will crack into two micro-ops (either
at decode, or possibly later in the pipe -- I've worked on implementations that
have done both approaches), and still use two execution units (LSU + ALU), but
on some others, it will go to only one unit and merely provide two results from
the LSU. Either way, lhau is no worse than lha+addi, and sometimes better.

There is a potential issue on some implementations, if they cannot handle
cracking and decoding further instructions in a single cycle, but that should
be controlled by an -march/-mcpu/-mtune flag, not as a blanket across all past
and future implementations, as some existing implementations do *not* pay a
penalty when decoding an update-form instruction.

If the concern is over lha instead of lhz, as some implementations may crack
lha into lhz+extsh, that is a distinct reason, but the same "missed
optimization" occurs if I change the types to uint16_t, where the lha crack
concern disappears: gcc still emits lhz/addi instead of lhzu.

Reply via email to