On Mon, Jul 8, 2019 at 3:38 PM Josh Poimboeuf <jpoim...@redhat.com> wrote: > > On Mon, Jul 08, 2019 at 03:15:37PM -0700, Alexei Starovoitov wrote: > > > 2) > > > > > > After doing the first optimization, GCC then does another one which is > > > a little trickier. It replaces: > > > > > > select_insn: > > > jmp *jumptable(, %rax, 8) > > > ... > > > ALU64_ADD_X: > > > ... > > > jmp *jumptable(, %rax, 8) > > > ALU_ADD_X: > > > ... > > > jmp *jumptable(, %rax, 8) > > > > > > with > > > > > > select_insn: > > > mov jumptable, %r12 > > > jmp *(%r12, %rax, 8) > > > ... > > > ALU64_ADD_X: > > > ... > > > jmp *(%r12, %rax, 8) > > > ALU_ADD_X: > > > ... > > > jmp *(%r12, %rax, 8) > > > > > > The problem is that it only moves the jumptable address into %r12 > > > once, for the entire function, then it goes through multiple recursive > > > indirect jumps which rely on that %r12 value. But objtool isn't yet > > > smart enough to be able to track the value across multiple recursive > > > indirect jumps through the jump table. > > > > > > After some digging I found that the quick and easy fix is to disable > > > -fgcse. In fact, this seems to be recommended by the GCC manual, for > > > code like this: > > > > > > -fgcse > > > Perform a global common subexpression elimination pass. This > > > pass also performs global constant and copy propagation. > > > > > > Note: When compiling a program using computed gotos, a GCC > > > extension, you may get better run-time performance if you > > > disable the global common subexpression elimination pass by > > > adding -fno-gcse to the command line. > > > > > > Enabled at levels -O2, -O3, -Os. > > > > > > This code indeed relies extensively on computed gotos. I don't know > > > *why* disabling this optimization would improve performance. In fact > > > I really don't see how it could make much of a difference either way. > > > > > > Anyway, using -fno-gcse makes optimization #2 go away and makes > > > objtool happy, with only a fix for #1 needed. > > > > > > If -fno-gcse isn't an option, we might be able to fix objtool by using > > > the "first_jump_src" thing which Peter added, improving it such that > > > it also takes table jumps into account. > > > > Sorry for delay. I'm mostly offgrid until next week. > > As far as -fno-gcse.. I don't mind as long as it doesn't hurt performance. > > Which I suspect it will :( > > All these indirect gotos are there for performance. > > Single indirect goto and a bunch of jmp select_insn > > are way slower, since there is only one instruction > > for cpu branch predictor to work with. > > When every insn is followed by "jmp *jumptable" > > there is more room for cpu to speculate. > > It's been long time, but when I wrote it the difference > > between all indirect goto vs single indirect goto was almost 2x. > > Just to clarify, -fno-gcse doesn't get rid of any of the indirect jumps. > It still has 166 indirect jumps. It just gets rid of the second > optimization, where the jumptable address is placed in a register.
what about other functions in core.c ? May be it's easier to teach objtool to recognize that pattern? > If you have a benchmark which is relatively easy to use, I could try to > run some tests. modprobe test_bpf selftests/bpf/test_progs both print runtime. Some of test_progs have high run-to-run variations though.