On Thu, May 21, 2015 at 2:12 PM, Sriraman Tallam <tmsri...@google.com> wrote: > On Sun, May 10, 2015 at 10:01 AM, Sriraman Tallam <tmsri...@google.com> wrote: >> >> On Sun, May 10, 2015, 8:19 AM H.J. Lu <hjl.to...@gmail.com> wrote: >> >> On Sat, May 9, 2015 at 9:34 AM, H.J. Lu <hjl.to...@gmail.com> wrote: >>> On Mon, May 4, 2015 at 7:45 AM, Michael Matz <m...@suse.de> wrote: >>>> Hi, >>>> >>>> On Thu, 30 Apr 2015, Sriraman Tallam wrote: >>>> >>>>> We noticed that one of our benchmarks sped-up by ~1% when we eliminated >>>>> PLT stubs for some of the hot external library functions like memcmp, >>>>> pow. The win was from better icache and itlb performance. The main >>>>> reason was that the PLT stubs had no spatial locality with the >>>>> call-sites. I have started looking at ways to tell the compiler to >>>>> eliminate PLT stubs (in-effect inline them) for specified external >>>>> functions, for x86_64. I have a proposal and a patch and I would like to >>>>> hear what you think. >>>>> >>>>> This comes with caveats. This cannot be generally done for all >>>>> functions marked extern as it is impossible for the compiler to say if a >>>>> function is "truly extern" (defined in a shared library). If a function >>>>> is not truly extern(ends up defined in the final executable), then >>>>> calling it indirectly is a performance penalty as it could have been a >>>>> direct call. >>>> >>>> This can be fixed by Alans idea. >>>> >>>>> Further, the newly created GOT entries are fixed up at >>>>> start-up and do not get lazily bound. >>>> >>>> And this can be fixed by some enhancements in the linker and dynamic >>>> linker. The idea is to still generate a PLT stub and make its GOT entry >>>> point to it initially (like a normal got.plt slot). Then the first >>>> indirect call will use the address of PLT entry (starting lazy >>>> resolution) >>>> and update the GOT slot with the real address, so further indirect calls >>>> will directly go to the function. >>>> >>>> This requires a new asm marker (and hence new reloc) as normally if >>>> there's a GOT slot it's filled by the real symbols address, unlike if >>>> there's only a got.plt slot. E.g. a >>>> >>>> call *foo@GOTPLT(%rip) >>>> >>>> would generate a GOT slot (and fill its address into above call insn), >>>> but >>>> generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one. >>>> >>> >>> I added the "relax" prefix support to x86 assembler on users/hjl/relax >>> branch >>> >>> at >>> >>> https://sourceware.org/git/?p=binutils-gdb.git;a=summary >>> >>> [hjl@gnu-tools-1 relax-3]$ cat r.S >>> .text >>> relax jmp foo >>> relax call foo >>> relax jmp foo@plt >>> relax call foo@plt >>> [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S >>> [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o >>> >>> r.o: file format elf64-x86-64 >>> >>> >>> Disassembly of section .text: >>> >>> 0000000000000000 <.text>: >>> 0: 66 e9 00 00 00 00 data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4 >>> 6: 66 e8 00 00 00 00 data16 callq 0xc 8: R_X86_64_RELAX_PC32 >>> foo-0x4 >>> c: 66 e9 00 00 00 00 data16 jmpq 0x12 e: >>> R_X86_64_RELAX_PLT32foo-0x4 >>> 12: 66 e8 00 00 00 00 data16 callq 0x18 14: >>> R_X86_64_RELAX_PLT32foo-0x4 >>> [hjl@gnu-tools-1 relax-3]$ >>> >>> Right now, the relax relocations are treated as PC32/PLT32 relocations. >>> I am working on linker support. >>> >> >> I implemented the linker support for x86-64: >> >> 00000000 <main>: >> 0: 48 83 ec 08 sub $0x8,%rsp >> 4: e8 00 00 00 00 callq 9 <main+0x9> 5: R_X86_64_PC32 plt-0x4 >> 9: e8 00 00 00 00 callq e <main+0xe> a: R_X86_64_PLT32 plt-0x4 >> e: e8 00 00 00 00 callq 13 <main+0x13> f: R_X86_64_PC32 bar-0x4 >> 13: 66 e8 00 00 00 00 data16 callq 19 <main+0x19> 15: >> R_X86_64_RELAX_PC32 bar-0x4 >> 19: 66 e8 00 00 00 00 data16 callq 1f <main+0x1f> 1b: >> R_X86_64_RELAX_PLT32 bar-0x4 >> 1f: 66 e8 00 00 00 00 data16 callq 25 <main+0x25> 21: >> R_X86_64_RELAX_PC32 foo-0x4 >> 25: 66 e8 00 00 00 00 data16 callq 2b <main+0x2b> 27: >> R_X86_64_RELAX_PLT32 foo-0x4 >> 2b: 31 c0 xor %eax,%eax >> 2d: 48 83 c4 08 add $0x8,%rsp >> 31: c3 retq >> >> 00400460 <main>: >> 400460: 48 83 ec 08 sub $0x8,%rsp >> 400464: e8 d7 ff ff ff callq 400440 <plt@plt> >> 400469: e8 d2 ff ff ff callq 400440 <plt@plt> >> 40046e: e8 ad ff ff ff callq 400420 <bar@plt> >> 400473: ff 15 ff 03 20 00 callq *0x2003ff(%rip) # 600878 >> <_DYNAMIC+0xf8> >> 400479: ff 15 f9 03 20 00 callq *0x2003f9(%rip) # 600878 >> <_DYNAMIC+0xf8> >> 40047f: 66 e8 f3 00 00 00 data16 callq 400578 <foo> >> 400485: 66 e8 ed 00 00 00 data16 callq 400578 <foo> >> 40048b: 31 c0 xor %eax,%eax >> 40048d: 48 83 c4 08 add $0x8,%rsp >> 400491: c3 retq >> >> Sriraman, can you give it a try? > > > I like HJ's proposal here and it is important that the linker fixes > unnecessary indirect calls to direct ones. > > However, independently I think my original proposal is still useful > and I want to pitch it again for the following reasons. > > AFAIU, Alexander Monakov's -fno-plt does not solve the following: > > * Does not do anything for non-PIC code. The compiler does not > generate a @PLT call but the linker will route all external calls via > PLT. We noticed a problem with non-PIC executables where the PLT > stubs were causing too many icache misses and are interested in a > solution for this. > * Aggressively uses indirect calls even if the final symbol is not > truly external. This needs HJ's linker patch to fix unnecessary > indirect calls to direct calls. > > My original proposal, for x86_64 only, was to add > -fno-plt=<function-name>. This lets the user decide for which > functions PLT must be avoided. Let the compiler always generate an > indirect call using call *func@GOTPCREL(%rip). We could do this for > non-PIC code too. No need for linker fixups since this relies on the > user to know that func is from a shared object. > > I am reattaching the patch.
I also want to add that my proposal with -fno-plt=<fn_name> does not take away anything from the newly added -fno-plt option or the linker patch HJ proposed. It is orthogonal and lets the user decide the subset of external functions for which PLT must be avoided. Thanks Sri > > Thanks > Sri > > >> >> Thanks! Will do! >> >> Sri >> >> -- >> H.J. >> >>
* common.opt (-fno-plt=): New option. * config/i386/i386.c (avoid_plt_to_call): New function. (ix86_output_call_insn): Check if PLT needs to be avoided and call or jump indirectly if true. * opts-global.c (htab_str_eq): New function. (avoid_plt_fnsymbol_names_tab): New htab. (handle_common_deferred_options): Handle -fno-plt= Index: common.opt =================================================================== --- common.opt (revision 222892) +++ common.opt (working copy) @@ -1087,6 +1087,11 @@ fdbg-cnt= Common RejectNegative Joined Var(common_deferred_options) Defer -fdbg-cnt=<counter>:<limit>[,<counter>:<limit>,...] Set the debug counter limit. +fno-plt= +Common RejectNegative Joined Var(common_deferred_options) Defer +-fno-plt=<symbol1> Avoid going through the PLT when calling the specified function. +Allow multiple instances of this option with different function names. + fdebug-prefix-map= Common Joined RejectNegative Var(common_deferred_options) Defer Map one directory name to another in debug information Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 222892) +++ config/i386/i386.c (working copy) @@ -25282,6 +25282,25 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call return call; } +extern htab_t avoid_plt_fnsymbol_names_tab; +/* If the function referenced by call_op is to a external function + and calls via PLT must be avoided as specified by -fno-plt=, then + return true. */ + +static int +avoid_plt_to_call(rtx call_op) +{ + const char *name; + if (GET_CODE (call_op) != SYMBOL_REF + || SYMBOL_REF_LOCAL_P (call_op) + || avoid_plt_fnsymbol_names_tab == NULL) + return 0; + name = XSTR (call_op, 0); + if (htab_find_slot (avoid_plt_fnsymbol_names_tab, name, NO_INSERT) != NULL) + return 1; + return 0; +} + /* Output the assembly for a call instruction. */ const char * @@ -25294,7 +25313,12 @@ ix86_output_call_insn (rtx insn, rtx call_op) if (SIBLING_CALL_P (insn)) { if (direct_p) - xasm = "jmp\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "jmp\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "jmp\t%P0"; + } /* SEH epilogue detection requires the indirect branch case to include REX.W. */ else if (TARGET_SEH) @@ -25346,9 +25370,15 @@ ix86_output_call_insn (rtx insn, rtx call_op) } if (direct_p) - xasm = "call\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "call\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "call\t%P0"; + } else xasm = "call\t%A0"; + output_asm_insn (xasm, &call_op); Index: opts-global.c =================================================================== --- opts-global.c (revision 222892) +++ opts-global.c (working copy) @@ -47,6 +47,7 @@ along with GCC; see the file COPYING3. If not see #include "xregex.h" #include "attribs.h" #include "stringpool.h" +#include "hash-table.h" typedef const char *const_char_p; /* For DEF_VEC_P. */ @@ -420,6 +421,17 @@ decode_options (struct gcc_options *opts, struct g finish_options (opts, opts_set, loc); } +/* Helper function for the hash table that compares the + existing entry (S1) with the given string (S2). */ + +static int +htab_str_eq (const void *s1, const void *s2) +{ + return !strcmp ((const char *)s1, (const char *) s2); +} + +htab_t avoid_plt_fnsymbol_names_tab = NULL; + /* Process common options that have been deferred until after the handlers have been called for all options. */ @@ -539,6 +551,15 @@ handle_common_deferred_options (void) stack_limit_rtx = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (opt->arg)); break; + case OPT_fno_plt_: + void **slot; + if (avoid_plt_fnsymbol_names_tab == NULL) + avoid_plt_fnsymbol_names_tab = htab_create (10, htab_hash_string, + htab_str_eq, NULL); + slot = htab_find_slot (avoid_plt_fnsymbol_names_tab, opt->arg, INSERT); + *slot = (void *)opt->arg; + break; + default: gcc_unreachable (); }