Hi, On 2020-03-30 21:33:14 +0800, John Naylor wrote: > Then I used the attached program to measure various combinations of > compiled instructions using two constant multipliers iterating over > bytes similar to a generated hash function.
It looks like you didn't attach the program? > <cc> -O2 -Wall test-const-mult.c test-const-mult-2.c > ./a.out > Median of 3 with clang 10: > > lea, lea 0.181s > > lea, lea+add 0.248s > lea, shift+add 0.251s > > lea+add, shift+add 0.273s > shift+add, shift+add 0.276s > > 2 leas, 2 leas 0.290s > shift+add, imul 0.329s > > Taking this with a grain of salt, it nonetheless seems plausible that > a single lea could be faster than any two instructions here. It's a bit complicated by the fact that there's more execution ports to execute shift/add than there ports to compute some form of leas. And some of that won't easily be measurable in a micro-benchmark, because there'll be dependencies between the instruction preventing any instruction level parallelism. I think the form of lea generated here is among the ones that can only be executed on port 1. Whereas e.g. an register+register/immediate add can be executed on four different ports. There's also a significant difference in latency that you might not see in your benchmark. E.g. on coffee lake the relevant form of lea has a latency of three cycles, but one independent lea can be "started" per cycle (agner calls this "reciprocal throughput). Whereas a shift has a latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better), add has a latency o 1 and a reciprocal throughput of 0.25. See the tables in https://www.agner.org/optimize/instruction_tables.pdf I'm not really sure my musings above matter terribly much, but I just wanted to point out why I'd not take too much stock in the above timings in isolation. Even a very high latency wouldn't necessarily be penalized in a benchmark with one loop iteration independent from each other, but would matter in the real world. Cool work! Greetings, Andres Freund