The sh4 port aligns blocks that have no fallthrus and that are either frequently executed (JUMP_ALIGN) or preceeded a barrier (LABEL_ALIGN_AFTER_BARRIER) on a cache line.
While in theory this help to avoid cache misses if the block slits over 2 cache lines, in practise this reduces cache locality and lenghten distance between blocks. The number of issued instructions are also impacted. For example the relative indirect address in jump tables needs a byte zero extend instruction if the distance occupies 8 bits instead of 7 bits. I ran some experiments and benchmarked (eembc) with 2 strategies 1) -falign-jumps=1 2) Align the block if the size is bigger than a given threshold. (empirically set to 16 bytes, half of the cache line size). See illustrating attached patch. My conclusion is that in -O3 the performance never degrades (option 2 is a little bit better, even improving dhrystone by 3%) when removing this padding. And the text size improves by ~15%. So I was not able to measurate the benefit of the cache line padding although the code size impact is big (even in -O2/-O3 a code size bloat should be motivated by some performance improvement). Is there a motivating test that justifies this microoptimisation ? In the illustrating patch I still align the basic blocks on 4-bytes to account for better instruction fetch accesses -- Summary: cache align alignment is too aggressive on sh-elf Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: chrbr at gcc dot gnu dot org GCC target triplet: sh-superh-elf http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31640