The sh4 port aligns blocks that have no fallthrus and that are either
frequently executed (JUMP_ALIGN) or preceeded a barrier
(LABEL_ALIGN_AFTER_BARRIER) on a cache line.

While in theory this help to avoid cache misses if the block slits over 2 cache
lines, in practise this reduces cache locality and lenghten distance between
blocks.
The number of issued instructions are also impacted. For example the relative
indirect address in jump tables needs a byte zero extend instruction if the
distance occupies 8 bits instead of 7 bits. 

I ran some experiments and benchmarked (eembc) with 2 strategies
1) -falign-jumps=1
2) Align the block if the size is bigger than a given threshold. (empirically
set to 16 bytes, half of the cache line size). See illustrating attached patch.

My conclusion is that in -O3 the performance never degrades (option 2 is a
little bit better, even improving dhrystone by 3%) when removing this padding.
And the text size improves by ~15%.

So I was not able to measurate the benefit of the cache line padding although
the code size impact is big (even in -O2/-O3 a code size bloat should be
motivated by some performance improvement).

Is there a motivating test that justifies this microoptimisation ?

In the illustrating patch I still align the basic blocks on 4-bytes to account
for better instruction fetch accesses


-- 
           Summary: cache align alignment is too aggressive on sh-elf
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: chrbr at gcc dot gnu dot org
GCC target triplet: sh-superh-elf


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31640

Reply via email to