On Thu, Mar 7, 2024 at 9:35 PM Richard Henderson <richard.hender...@linaro.org> wrote: > > >>> - for (size_t i = 0; i < ARRAY_SIZE(decoders); ++i) { > >>> - if (decoders[i].guard_func(ctx->cfg_ptr) && > >>> - decoders[i].decode_func(ctx, opcode32)) { > >>> + for (size_t i = 0; i < decoder_table_size; ++i) { > >>> + if (ctx->decoder[i](ctx, opcode32)) { > >>> return; > > By the way, you're adding layers of pointer chasing, so I suspect you'll find > all of this > is a wash or worse, performance-wise. > > > Indeed, since some of the predicates are trivial, going the other way might > help: allow > everything to be inlined: > > if (decode_insn32(...)) { > return; > } > if (has_xthead_p(...) && decode_xthead(...)) { > return; > } > ... > > > Even if there are 10 entries here, so what? All of the code has to be > compiled into QEMU. > You're not going to somehow add truly dynamic code that you've loaded from > a module.
I just tested this with GCC -O2/-O3. The generated code from the existing decoder loop will result in exactly what you have listed here (loop unrolling, transforming the indirect calls to direct calls, inlining, and evaluation of statically known conditions). has_xthead_p() can get inlined as well, if the inlining costs are considered low enough (thank you, Richard, for giving some hints about that below). What the commit message is not mentioning (and what this patch was actually addressing and therefore should have been mentioned): Having dynamic control of the decoder order was meant to allow adding a vendor_decoder before decode_insn32() with minimal overhead (no guard_func) for users that don't need this particular vendor_decoder. Being more explicit: there is interest in supporting the non-conforming (conflicting) instruction extension XTheadVector: https://github.com/T-head-Semi/thead-extension-spec/blob/master/xtheadvector.adoc XTheadVector uses the RVV 0.7.1 draft encoding, which conflicts with the ratified RVV spec. The idea is to avoid these conflicts with a call to decode_xtheadvector() before decode_insn32(). This implies that everyone has to call has_xtheadvector_p() before calling decode_insn32(). And the intent of this patch is to provide a mechanism to reduce this overhead. When suggesting the dynamic decoder list, I was not aware that always_true_p() will be eliminated by the compiler. Since this is the case, I agree with the "wash or worse" for decode_insn32(). The elimination of following guard functions for vendor decoders is likely less performance relevant. I don't think we should discuss efficiency any further unless we have some data to justify any changes. E.g. emulating a RISC-V SPEC CPU 2017 run on x86_64 and looking at the runtimes could give the relevant insights for the following scenarios: * existing code on upstream/master * existing code + adding a new extension that comes before decode_insn32 in decoders[] * existing code + this patch (dynamic decoders) Related overheads that could be measured: adding 20 new instructions to decode_insn32(), which are not executed (to put the costs into perspective). > You could perhaps streamline predicates such as has_xthead_p to not test 11 > variables by > adding an artificial "ext_xthead_any" configuration entry that is the sum of > all of those. > > > r~