https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91898
Bug ID: 91898 Summary: [optimization] switch statements for state machines could be better Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathan at acm dot org Target Milestone: --- Interesting talk at cppcon'19 by Miro Knejp 'Non-conforming C++: the Secrets the Committee Is Hiding From You' It'll be on youtube at some point. This PR concerns his observations about a common VM implementation technique. We have an array of instructions, and we execute these one after eachother. The natural way to write this would be: for (int index = 0; ; index++) switch (insns[index]) { case ADD: ... break; case SUB: ... break; ... case END: goto end; } end:; this leads to code gen that looks like: loop: r = insn[index]; index = index + 1; [range check on r] addr = jmptable[r]; jmp [addr]; ADD: ... jmp loop etc. This performs poorly because all the dispatches go through the single jmp[addr] instruction, and the CPU's indirect branch predictor has no idea where it'll go. But there are patterns in the sequence of virtual instructions, just as there are in real code! Hence, people use the GCC label-address extension and write: static void *const dispatch[] = {&&ADD, &&SUB, ... }; index = 0; goto *dispatch[insn[index++]]; ADD: ... goto *dispatch[insn[index++]]; SUB: ... goto *dispatch[insn[index++]]; code generation looks like: ADD: ... r = insn[index]; index++; addr = dispatch[r]; jmp [addr]; This is faster because there are per-virtual-insn indirect jumps, and the branch predictor does much better. (We're also eliding the range check, I do not know how much that would cost if it was added). Hence, it might be worthwhile duplicating the switch's dispatch logic at the end of every case block.