On Wed, 15 May 2024 at 07:49, Wolfgang Hospital <wolfgang.hospi...@arcor.de> wrote: > > Dear all, > > G-J Lay has been kind enough to turn my whine about __udivmodqi4 into a bug > report and handle that; I tried to follow suit reporting further strict > improvements (NO resource used more, at least one used less). > While I think bug keyword "missed-optimization" is for missing opportunities > during compilation, I have no problem regarding strictly sub-optimal library > code as a missed optimization. > > But what about speed improvements that take more instructions and/or stack, > or are slower for some argument values? Starting with a same size __mulqi3 > faster for all multipliers but zero, for which it is slower, or a __mulhi3 > with worst case about twice as fast, but 3 instructions longer than the > current code (both pointless for cores with mul, obviously). Or division > routines: a faster one that is no larger "without movw", but uses one more > return address on stack; one that is 2 instructions smaller, a wee bit faster > on average, but slower worst case; one that's about 14 cycles faster, but 1 > instruction longer?
The standard answer is to benchmark it, but which (mix of) benchmark(s) too choose? It's horses for courses, so if you really care, you can write multiple implementation optimized for specific use profiles. For a target that has no shared libraries, there is no run-time cost for having more library functions which can be used as alternatives depending on the use case. Although the build/install cost of additional functions increases with the number of multilibs, so that is something to consider too. You can have a compile-time option to change what function is called (and there you may change the ABI too, although that restricts the freedom to interchange functions in the linker), and/or a linker option to look for function resolution in specific sub-libraries or translate function names. If you only care about speed, it may make sense to use different implementations in different translation units. With value profiling, you might even make a per-call-site decision. Although in the presence of caches, there is also something to be said to be said to keep temporally close invocation to use the same code to increase locality. OTOH, for all-out size optimization, you'd want to use only one implementation per executable, although the optimal implementation might depend on the number of call sites, i.e. if you have lots of call sites, it might be useful to have more saves in the callee so that the callers can use more registers that are live across the call. Finally, when you have implemented compile and/or link time options for library function selection, for a good out-of-the box experience you should set sensible defaults in the SPECs to select appropriate options depending on -O2 / -O3 / -Os etc.