https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113533
--- Comment #10 from Roger Sayle <roger at nextmovesoftware dot com> --- Hi Oleg. Great question. The "speed" parameter passed to rtx_costs, and address_cost indicates whether the middle-end is optimizing for peformance, and interested in the nummber of cycles taken by each instruction, or optimizing for size, and interested in the number of bytes used to encode the instruction. Previously, this speed parameter was ignored by the SH backend, so the costs were the same independent of the objective function. In my proposed patch, the address cost (1) when optimizing for size attempts to return the additional size of an instruction based on the addressing mode. For register, and reg+reg addressing modes there is no size increase (overhead), and for adressing modes with displacements, and displacements to address pointers, there is a cost. (2) when optimizing for speed, address cost remains between 0 and 3, and is used to prioritize between (equivalent numbers of) instructions. Normally, rtx_costs are defined in terms of COST_N_INSNS, which multiplies by 4. Hence on many platforms a single instruction that references memory may be encoded as COSTS_N_INSNS(1)+1 (or a more complex addressing mode as COSTS_N_INSNS(1)+2) to show that this is disfavored to a single instruction that doesn't reference memory, COSTS_N_INSNS(1)+0. This is the fix for this particular regression; SIGN_EXTEND of a register now costs COSTS_N_INSNS(1), and SIGN_EXTEND of a MEM now costs COSTS_N_INSNS(1)+1. A useful way to debug rtx_costs is to use the -dP command line option, and then look at the [c=X, l=Y] annotations in the assembly language file. One way to check/confirm that these are sensible is that ideally they should be correlated when optimizing for size (with -Os or -Oz). I've found an interesting table of SH cycle counts (for different CPUs) at http://www.shared-ptr.com/sh_insns.html and these could be used to improve sh_rtx_costs further. For example, SH currently reports multiplications as a single cycle operation, which doesn't match the hardware specs, and prevents GCC from using synth_mult to produce faster (or shorter) sequences using shifts and additions. Likewise, sh_rtx_costs doesn't distinguish the machine mode, so the costs of SImode multiplications are the same as DImode multiplications. In comment #5 you mention GCC's defaults; it turns out that for rtx_costs the default values that would be provided by the middle-end, may be more accurate than the values (currently) specified by the backend. I hope this answers your question.