On 9/29/20 2:32 PM, Owen Anderson wrote: > Hello, > > I would like to request feedback on the following patch, which I do > not believe should be applied to master as-is. The idea here is to > avoid gathering the full CPU state in the fast path of an indirect > branch lookup when running in user mode on a platform where the flags > can only be changed in privileged mode. I believe this is true on the > AArch64 scenario that I care about, but clearly not true in general. > I'm particularly seeking feedback on how to clean this up into a > version that checks the correct necessary and sufficient conditions to > allow all users that can benefit from it to do so. > > On the workload that I am targeting (aarch64 on x86), this patch > reduces execution wall time by approximately 20%, and eliminates > indirect branch lookups from the hot stack traces entirely.
(1) What qemu version are you looking at and, (2) Do you have --enable-tcg-debug enabled? Because you should not be seeing anything even close to 20% overhead. In e979972a6a1 (included in qemu 4.2), the AArch64 path is uint32_t flags = env->hflags; *cs_base = 0; if (FIELD_EX32(flags, TBFLAG_ANY, AARCH64_STATE)) { *pc = env->pc; if (cpu_isar_feature(aa64_bti, env_archcpu(env))) { flags = FIELD_DP32(flags, TBFLAG_A64, BTYPE, env->btype); } pstate_for_ss = env->pstate; } if (FIELD_EX32(flags, TBFLAG_ANY, SS_ACTIVE) && (pstate_for_ss & PSTATE_SS)) { flags = FIELD_DP32(flags, TBFLAG_ANY, PSTATE_SS, 1); } *pflags = flags; With --enable-tcg-debug, there is an additional step wherein we validate that env->hflags has the correct value. Which has caught a number of bugs. With a silly testcase like so: for (int x = 0; x < 10000000; ++x) { void *tmp; asm volatile("adr %0,1f; br %0; 1:" : "=r"(tmp)); } I see cpu_get_tb_cpu_state no higher than 10% of the total runtime. Which, I admit is higher than I expected, but still nothing like what you're reporting. And a "reasonable" test case should surely have a lower proportion of indirect branches per dynamic instruction. > +#if !defined(TARGET_ARM) || !defined(CONFIG_USER_ONLY) > cpu_get_tb_cpu_state(env, pc, cs_base, flags); > +#else > + if (is_a64(env)) { > + *pc = env->pc; > + } else { > + *pc = env->regs[15]; > + } > +#endif ... > +#if !defined(TARGET_ARM) || !defined(CONFIG_USER_ONLY) > tb->cs_base == *cs_base && > tb->flags == *flags && > +#endif This is assuming that all TB have the same flags, and thus the flags don't need comparing. Which is false, even for CONFIG_USER_ONLY. I would guess that testing -cpu cortex-a57 does not use any of the bits that might change, but post v8.2 will: SVE, BTI, MTE. So, this change breaks stuff. r~