On Wed, May 06, 2015 at 06:24:58PM +0300, Alexander Monakov wrote: > If the same PLT stubs as today are to be used, it constrains the compiler on > 32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a > specific register. It's possible to imagine more complex PLT stubs that > obtain GOT pointer on their own, but in that case you can't let optimizations > such as loop invariant motion move the GOT load away from the call in a > fashion that could result in PLT stub pointer be reused many times.
Why? 32-bit x86 (shouldn't we care much more about x86-64, where this is a non-issue?) PLT looks like: 4c2b7310 <_Unwind_Find_FDE@plt-0x10>: 4c2b7310: ff b3 04 00 00 00 pushl 0x4(%ebx) 4c2b7316: ff a3 08 00 00 00 jmp *0x8(%ebx) 4c2b731c: 00 00 add %al,(%eax) ... 4c2b7320 <_Unwind_Find_FDE@plt>: 4c2b7320: ff a3 0c 00 00 00 jmp *0xc(%ebx) 4c2b7326: 68 00 00 00 00 push $0x0 4c2b732b: e9 e0 ff ff ff jmp 4c2b7310 4c2b7330 <realloc@plt>: 4c2b7330: ff a3 10 00 00 00 jmp *0x10(%ebx) 4c2b7336: 68 08 00 00 00 push $0x8 4c2b733b: e9 d0 ff ff ff jmp 4c2b7310 The linker would know very well what kind of relocations are used for particular PLT slot, and for the new relocations which would resolve to the address of the .got.plt slot it could just tweak corresponding 3rd insn in the slot, to not jump to first plt slot - 16, but a few bytes before that that would just load the address of _G_O_T_ into %ebx and then fallthru into the 0x4c2b7310 snippet above. The lazy binding would be a few ticks slower in that case, but no requirement on %ebx to contain _G_O_T_. As for hoisting the load of the call address before the loop, with lazy binding that has the obvious disadvantage that you'd resolve the slot again and again, if you are unlucky enough that the function hasn't been resolved yet. Unless the shared PLT stub after computing _G_O_T_ (for x86) also rechecks the .got.plt address. Jakub