https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99068
--- Comment #7 from Brian Grayson <brian.grayson at sifive dot com> --- A single lhau instruction is better than two instructions (lha + addi) for many reasons. Are there reasons that you feel a two-instruction sequence of lha+addi is *superior* to just an lhau? On all PowerPC implementations, reducing the loop size by one improves the odds that the entire loop fits within a cache line, reducing the fetch bubbles that might otherwise occur. On all PowerPC implementations, saving one instruction reduces overall instruction-cache pressure, especially if this construct or other lhau-compatible situations occur multiple times within a function for looping over different arrays etc. On all PowerPC implementations, decoding a single instruction should be less power than decoding two, even in the presence of cracking. On some PowerPC implementations, the lhau will crack into two micro-ops (either at decode, or possibly later in the pipe -- I've worked on implementations that have done both approaches), and still use two execution units (LSU + ALU), but on some others, it will go to only one unit and merely provide two results from the LSU. Either way, lhau is no worse than lha+addi, and sometimes better. There is a potential issue on some implementations, if they cannot handle cracking and decoding further instructions in a single cycle, but that should be controlled by an -march/-mcpu/-mtune flag, not as a blanket across all past and future implementations, as some existing implementations do *not* pay a penalty when decoding an update-form instruction. If the concern is over lha instead of lhz, as some implementations may crack lha into lhz+extsh, that is a distinct reason, but the same "missed optimization" occurs if I change the types to uint16_t, where the lha crack concern disappears: gcc still emits lhz/addi instead of lhzu.