------- Comment #21 from vvv at ru dot ru 2009-05-13 17:13 ------- I guess! Your patch is absolutely correct for AMD AthlonTM 64 and AMD OpteronTM processors, but it is nonoptimal for Intel processors. Because:
1. AMD limitation for 16-bytes page (memory range XXXXXXX0 - XXXXXXXF), but Intel limitation for 16-bytes chunk (memory range XXXXXXXX - XXXXXXXX+10h) 2. AMD - maximum of _THREE_ near branches (CALL, JMP, conditional branches, or returns), Intel - maximum of _FOUR_ branches! Quotation from Software Optimization Guide for AMD64 Processors 6.1 Density of Branches When possible, align branches such that they do not cross a 16-byte boundary. The AMD AthlonTM 64 and AMD OpteronTM processors have the capability to cache branch-prediction history for a maximum of three near branches (CALL, JMP, conditional branches, or returns) per 16-byte fetch window. A branch instruction that crosses a 16-byte boundary is counted in the second 16-byte window. Due to architectural restrictions, a branch that is split across a 16-byte boundary cannot dispatch with any other instructions when it is predicted taken. Perform this alignment by rearranging code; it is not beneficial to align branches using padding sequences. The following branches are limited to three per 16-byte window: jcc rel8 jcc rel32 jmp rel8 jmp rel32 jmp reg jmp WORD PTR jmp DWORD PTR call rel16 call r/m16 call rel32 call r/m32 Coding more than three branches in the same 16-byte code window may lead to conflicts in the branch target buffer. To avoid conflicts in the branch target buffer, space out branches such that three or fewer exist in a given 16-byte code window. For absolute optimal performance, try to limit branches to one per 16-byte code window. Avoid code sequences like the following: ALIGN 16 label3: call label1 ; 1st branch in 16-byte code window jc label3 ; 2nd branch in 16-byte code window call label2 ; 3rd branch in 16-byte code window jnz label4 ; 4th branch in 16-byte code window ; Cannot be predicted. If there is a jump table that contains many frequently executed branches, pad the table entries to 8 bytes each to assure that there are never more than three branches per 16-byte block of code. Only branches that have been taken at least once are entered into the dynamic branch prediction, and therefore only those branches count toward the three-branch limit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942