------- Comment #21 from vvv at ru dot ru  2009-05-13 17:13 -------
I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM
processors, but it is nonoptimal for Intel processors. Because:

1. AMD limitation for 16-bytes page (memory range XXXXXXX0 - XXXXXXXF), but
Intel limitation for 16-bytes chunk  (memory range XXXXXXXX - XXXXXXXX+10h)
2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or
returns),
Intel -  maximum of _FOUR_ branches!

Quotation from Software Optimization Guide for AMD64 Processors

6.1          Density of Branches
When possible, align branches such that they do not cross a 16-byte boundary.

The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache
branch-prediction history for a maximum of three near branches (CALL, JMP,
conditional branches, or returns) per 16-byte fetch window. A branch
instruction that crosses a 16-byte boundary is counted in the second 16-byte
window. Due to architectural restrictions, a branch that is split across a
16-byte
boundary cannot dispatch with any other instructions when it is predicted
taken. Perform this alignment by rearranging code; it is not beneficial to
align branches using padding sequences.

The following branches are limited to three per 16-byte window:

jcc   rel8
jcc   rel32
jmp   rel8
jmp   rel32
jmp   reg
jmp   WORD PTR
jmp   DWORD PTR
call  rel16
call  r/m16
call  rel32
call  r/m32

Coding more than three branches in the same 16-byte code window may lead to
conflicts in the branch target buffer. To avoid conflicts in the branch target
buffer, space out branches such that three or fewer exist in a given 16-byte
code window. For absolute optimal performance, try to limit branches to one per
16-byte code window. Avoid code sequences like the following:
ALIGN 16
label3:
     call   label1     ;  1st branch in 16-byte     code   window
     jc     label3     ;  2nd branch in 16-byte     code   window
     call   label2     ;  3rd branch in 16-byte     code   window
     jnz    label4     ;  4th branch in 16-byte     code   window
                       ;  Cannot be predicted.
If there is a jump table that contains many frequently executed branches, pad
the table entries to 8 bytes each to assure that there are never more than
three branches per 16-byte block of code.
Only branches that have been taken at least once are entered into the dynamic
branch prediction, and therefore only those branches count toward the
three-branch limit.




-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

Reply via email to