On Thu, Mar 12, 2020 at 11:57:17AM -0500, Segher Boessenkool wrote: > Hi! > > On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote: > > With lazy PLT resolution the first load of a PLT entry may be a value > > pointing at a resolver stub. gcc's loop processing can result in the > > PLT load in inline PLT calls being hoisted out of a loop in the > > mistaken idea that this is an optimisation. It isn't. If the value > > hoisted was that for a resolver stub then every call to that function > > in the loop will go via the resolver, slowing things down quite > > dramatically. > > > > The PLT really is volatile, so teach gcc about that. > > It would be nice if we could keep it cached after it has been resolved > once, this has potential for regressing performance if we don't? And > LD_BIND_NOW should keep working just as fast as it is now, too?
Using a call-saved register to cache a load out of the PLT looks really silly when the inline PLT call is turned back into a direct call by the linker. You end up with an unnecessary save and restore of the register, plus copies from the register to r12. What's the chance of someone reporting that as a gcc "bug"? :-) Then there's the possibility that shortening the number of instructions between two calls of a small function runs into stalls. How can we teach gcc about these unknowns? ie. How to weight use of a call-saved register to cache PLT loads against other possible uses of that register in a loop? It's quite likely not a good use, even when gcc knows the PLT entry has been resolved.. Which means some gcc infrastructure would be needed to do this sensibly and without the necessary infrastructure, I think gcc hoisting a PLT load out of a loop should never be done. -- Alan Modra Australia Development Lab, IBM