Re: sped-up functions from lib1funcs.S: what about using more instructions and/or stack?

Joern Rennecke Mon, 20 May 2024 06:36:46 -0700

On Wed, 15 May 2024 at 07:49, Wolfgang Hospital
<wolfgang.hospi...@arcor.de> wrote:
>
>  Dear all,
>
> G-J Lay has been kind enough to turn my whine about __udivmodqi4 into a bug 
> report and handle that; I tried to follow suit reporting further strict 
> improvements (NO resource used more, at least one used less).
> While I think bug keyword "missed-optimization" is for missing opportunities 
> during compilation, I have no problem regarding strictly sub-optimal library 
> code as a missed optimization.
>
> But what about speed improvements that take more instructions and/or stack, 
> or are slower for some argument values? Starting with a same size __mulqi3 
> faster for all multipliers but zero, for which it is slower, or a __mulhi3 
> with worst case about twice as fast, but 3 instructions longer than the 
> current code (both pointless for cores with mul, obviously). Or division 
> routines: a faster one that is no larger "without movw", but uses one more 
> return address on stack; one that is 2 instructions smaller, a wee bit faster 
> on average, but slower worst case; one that's about 14 cycles faster, but 1 
> instruction longer?


The standard answer is to benchmark it, but which (mix of)
benchmark(s) too choose?
It's horses for courses, so if you really care, you can write multiple
implementation optimized for specific use profiles.
For a target that has no shared libraries, there is no run-time cost
for having more library functions which can be used as alternatives
depending on the use case.  Although the build/install cost of
additional functions increases with the number of multilibs, so that
is something to consider too.
You can have a compile-time option to change what function is called
(and there you may change the ABI too, although that restricts the
freedom to interchange functions in the linker), and/or a linker
option to
look for function resolution in specific sub-libraries or translate
function names.
If you only care about speed, it may make sense to use different
implementations in different translation units.  With value profiling,
you might even make a per-call-site decision.  Although in the
presence of caches, there is also something to be said to be said to
keep temporally close invocation to use the same code to increase
locality.  OTOH, for all-out size optimization, you'd want to use only
one implementation per executable, although
the optimal implementation might depend on the number of call sites,
i.e. if you have lots of call sites, it might be useful to have more
saves in the callee so that the callers can use more registers that
are live across the call.
Finally, when you have implemented compile and/or link time options
for library function selection, for a good out-of-the box experience
you should set sensible defaults in the SPECs to select appropriate
options
depending on -O2 / -O3 / -Os etc.

Re: sped-up functions from lib1funcs.S: what about using more instructions and/or stack?

Reply via email to