On 04/03/2015 04:03 PM, Ingo Molnar wrote: > > * Denys Vlasenko <dvlas...@redhat.com> wrote: > >> Interrupt entry points are handled with the following code: >> Each 32-byte code block contains seven entry points >> >> ... >> [push][jump 22] // 4 bytes >> [push][jump 18] // 4 bytes >> [push][jump 14] // 4 bytes >> [push][jump 10] // 4 bytes >> [push][jump 6] // 4 bytes >> [push][jump 2] // 4 bytes >> [push][jump common_interrupt][padding] // 8 bytes >> >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump] >> [push][jump common_interrupt][padding] >> >> [padding_2] >> common_interrupt: >> >> The first six jumps are short (2-byte insns) jumps to the last >> jump in the block. The last jump usually has to be a longer, 5-byte form. >> >> There are about 30 such blocks. >> >> This change uses the fact that last few of these blocks are close to >> "common_interrupt" label and the last jump there can also be a short one. >> >> This allows several last 32-byte code blocks to contain eight, not seven >> entry points, >> and to have all these entry points jump directly to common_interrupt, >> eliminating one branch. They look like this: >> >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> [push][jump common_interrupt] // 4 bytes >> >> This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before dispatch >> table >> and before common_interrupt with 32-byte alignment. >> >> The first alignment was completely unnecessary: >> the dispatch table will not benefit from any alignment bigger than 32 bytes >> (any entry point needs at most only 32 bytes of table to be read by the CPU). >> >> The alignment before common_interrupt label affects the size of padding >> (the "[padding_2]" above) and excessive padding reduces the number of blocks >> which can be optimized. In the .config I tested with, this alignment was 64 >> bytes, >> this means two fewer blocks could be optimized. I believe 32-byte alignment >> should do just fine. >> >> The condition which controls the choice of a direct jump versus short one, >> >> /* Can we reach common_interrupt with a short jump? */ >> .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4) >> >> was tested to be a passimistic one: the generated code will not erroneously >> generate a longer than necessary jump even with maximum possible padding >> and with -1 subtracted from the right hand side above. >> This was verified by checking disassembly. >> >> However, with current value of FIRST_SYSTEM_VECTOR it so happens >> that the padding before "common_interrupt" is zero bytes. >> >> Disassembly before After >> ========================================== >> ================================== >> ... ... >> 840 6a9b push $0xffffffffffffff9b 82c 6a9d push >> $0xffffffffffffff9d >> 842 eb16 jmp 85a 82e eb30 jmp 860 >> 844 6a9a push $0xffffffffffffff9a 830 6a9c push >> $0xffffffffffffff9c >> 846 eb12 jmp 85a 832 eb2c jmp 860 >> 848 6a99 push $0xffffffffffffff99 834 6a9b push >> $0xffffffffffffff9b >> 84a eb0e jmp 85a 836 eb28 jmp 860 >> 84c 6a98 push $0xffffffffffffff98 838 6a9a push >> $0xffffffffffffff9a >> 84e eb0a jmp 85a 83a eb24 jmp 860 >> 850 6a97 push $0xffffffffffffff97 83c 6a99 push >> $0xffffffffffffff99 >> 852 eb06 jmp 85a 83e eb20 jmp 860 >> 854 6a96 push $0xffffffffffffff96 840 6a98 push >> $0xffffffffffffff98 >> 856 eb02 jmp 85a 842 eb1c jmp 860 >> 858 6a95 push $0xffffffffffffff95 844 6a97 push >> $0xffffffffffffff97 >> 85a eb24 jmp 880 846 eb18 jmp 860 >> 85c 0f1f4000 nopl 0x0(%rax) 848 6a96 push >> $0xffffffffffffff96 >> 860 6a94 push $0xffffffffffffff94 84a eb14 jmp 860 >> 862 eb0c jmp 870 84c 6a95 push >> $0xffffffffffffff95 >> 864 6a93 push $0xffffffffffffff93 84e eb10 jmp 860 >> 866 eb08 jmp 870 850 6a94 push >> $0xffffffffffffff94 >> 868 6a92 push $0xffffffffffffff92 852 eb0c jmp 860 >> 86a eb04 jmp 870 854 6a93 push >> $0xffffffffffffff93 >> 86c 6a91 push $0xffffffffffffff91 856 eb08 jmp 860 >> 86e eb00 jmp 870 858 6a92 push >> $0xffffffffffffff92 >> 870 eb0e jmp 880 85a eb04 jmp 860 >> 872 66666666662e0f data32 data32 data32 data3285c 6a91 push >> $0xffffffffffffff91 >> 879 1f840000000000 nopw %cs:0x0(%rax,%rax,1) 85e eb00 jmp 860 >> 880 <common_interrupt>: 860 <common_interrupt>: >> >> Sizes: >> text data bss dec hex filename >> 12578 0 0 12578 3122 entry_64.o.before >> 12546 0 0 12546 3102 entry_64.o >> >> Signed-off-by: Denys Vlasenko <dvlas...@redhat.com> >> CC: Linus Torvalds <torva...@linux-foundation.org> >> CC: Steven Rostedt <rost...@goodmis.org> >> CC: Ingo Molnar <mi...@kernel.org> >> CC: Borislav Petkov <b...@alien8.de> >> CC: "H. Peter Anvin" <h...@zytor.com> >> CC: Andy Lutomirski <l...@amacapital.net> >> CC: Oleg Nesterov <o...@redhat.com> >> CC: Frederic Weisbecker <fweis...@gmail.com> >> CC: Alexei Starovoitov <a...@plumgrid.com> >> CC: Will Drewry <w...@chromium.org> >> CC: Kees Cook <keesc...@chromium.org> >> CC: x...@kernel.org >> CC: linux-kernel@vger.kernel.org >> --- >> arch/x86/kernel/entry_64.S | 61 >> +++++++++++++++++++++++++++++++++++++--------- >> 1 file changed, 49 insertions(+), 12 deletions(-) >> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S >> index da137f9..2003417 100644 >> --- a/arch/x86/kernel/entry_64.S >> +++ b/arch/x86/kernel/entry_64.S >> @@ -537,33 +537,70 @@ END(ret_from_fork) >> * Build the entry stubs and pointer table with some assembler magic. >> * We pack 7 stubs into a single 32-byte chunk, which will fit in a >> * single cache line on all modern x86 implementations. >> + * The last few cachelines pack 8 stubs each. >> */ >> +ALIGN_common_interrupt=32 >> .section .init.rodata,"a" >> ENTRY(interrupt) >> .section .entry.text >> - .p2align 5 >> - .p2align CONFIG_X86_L1_CACHE_SHIFT >> + .align 32 >> ENTRY(irq_entries_start) >> INTR_FRAME >> vector=FIRST_EXTERNAL_VECTOR >> -.rept (FIRST_SYSTEM_VECTOR-FIRST_EXTERNAL_VECTOR+6)/7 >> - .balign 32 >> +.rept 256 /* this number does not need to be exact, just big enough */ >> + >> + /* >> + * Block of six "push + short_jump" pairs, 4 bytes each, >> + * and a 2-byte seventh "push", without its jump yet. >> + */ >> +need_near_jump=0 >> +push_count=0 >> .rept 7 >> .if vector < FIRST_SYSTEM_VECTOR >> - .if vector <> FIRST_EXTERNAL_VECTOR >> +1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */ >> + .previous >> + .quad 1b >> + .section .entry.text >> +vector=vector+1 >> +push_count=push_count+1 >> + .if push_count < 7 >> + /* Can we reach common_interrupt with a short jump? */ >> + .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4) >> + jmp common_interrupt /* yes */ >> + .else >> + jmp 2f >> +need_near_jump=1 >> + .endif >> CFI_ADJUST_CFA_OFFSET -8 >> .endif >> + .endif >> + .endr >> + >> + /* If we are close to the end, we can pack in yet another pair */ >> + .if need_near_jump == 0 >> + .if push_count == 7 >> + /* The "short jump" for the seventh "push" */ >> + jmp common_interrupt >> + CFI_ADJUST_CFA_OFFSET -8 >> + .endif >> + .if vector < FIRST_SYSTEM_VECTOR >> + /* "push + short_jump" pair #8 */ >> 1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */ >> - .if ((vector-FIRST_EXTERNAL_VECTOR)%7) <> 6 >> - jmp 2f >> - .endif >> - .previous >> + .previous >> .quad 1b >> - .section .entry.text >> + .section .entry.text >> vector=vector+1 >> +push_count=push_count+1 >> + jmp common_interrupt >> + CFI_ADJUST_CFA_OFFSET -8 >> .endif >> - .endr >> + .else >> + /* Use remaining space for "near jump" for the seventh "push" */ >> 2: jmp common_interrupt >> + CFI_ADJUST_CFA_OFFSET -8 >> + .align 2 >> + .endif >> + >> .endr >> CFI_ENDPROC >> END(irq_entries_start) > > So I think this might be easier to read if we went low tech and used > an open-coded 240+ lines assembly file for this, with a few > obvious-purpose helper macros? It's not like it will change all that > often so it only has to be generated once.
How about this version? It's still isn't a star of readability, but the structure of the 32-byte code block is more visible now... /* * Build the entry stubs and pointer table with some assembler magic. * We pack 7 stubs into a single 32-byte chunk, which will fit in a * single cache line on all modern x86 implementations. * The last few cachelines pack 8 stubs each. */ ALIGN_common_interrupt=32 .section .init.rodata,"a" ENTRY(interrupt) .section .entry.text .align 32 ENTRY(irq_entries_start) INTR_FRAME .macro push_vector .if vector < FIRST_SYSTEM_VECTOR 1: pushq_cfi $(~vector+0x80) /* Note: always in signed byte range */ .previous .quad 1b .section .entry.text need_jump=1 vector=vector+1 .endif .endm .macro short_jump label .if need_jump == 1 /* Can we reach common_interrupt with a short jump? */ .if vector >= (FIRST_SYSTEM_VECTOR - (128-ALIGN_common_interrupt)/4) jmp common_interrupt /* yes */ .else jmp \label need_near_jump=1 .endif CFI_ADJUST_CFA_OFFSET -8 need_jump=0 .endif .endm .macro near_jump label jmp \label CFI_ADJUST_CFA_OFFSET -8 .endm vector=FIRST_EXTERNAL_VECTOR .rept 256 /* this number does not need to be exact, just big enough */ need_near_jump=0 push_vector; short_jump 2f push_vector; short_jump 2f push_vector; short_jump 2f push_vector; short_jump 2f push_vector; short_jump 2f push_vector; short_jump 2f .if need_near_jump == 1 push_vector; 2: near_jump common_interrupt; .align 2 .else push_vector; short_jump common_interrupt push_vector; short_jump common_interrupt .endif .endr CFI_ENDPROC END(irq_entries_start) .previous END(interrupt) .previous -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/