Since both kernel and user mode run in ring 3, they run in the same "predictor mode". While the kernel could take care of this itself, doing so would be yet another item distinguishing PV from native. Additionally we're in a much better position to issue the barrier command, and we can save a #GP (for privileged instruction emulation) this way.
To allow to recover performance, introduce a new VM assist allowing the guest kernel to suppress this barrier. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- v2: Leverage entry-IBPB. Add VM assist. Re-base. --- I'm not entirely happy with re-using opt_ibpb_ctxt_switch here (it's a mode switch after all, but v1 used opt_ibpb here), but it also didn't seem very reasonable to introduce yet another command line option. The only feasible alternative I would see is to check the CPUID bits directly. --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -757,7 +757,8 @@ static inline void pv_inject_sw_interrup * but we can't make such requests fail all of the sudden. */ #define PV64_VM_ASSIST_MASK (PV32_VM_ASSIST_MASK | \ - (1UL << VMASST_TYPE_m2p_strict)) + (1UL << VMASST_TYPE_m2p_strict) | \ + (1UL << VMASST_TYPE_mode_switch_no_ibpb)) #define HVM_VM_ASSIST_MASK (1UL << VMASST_TYPE_runstate_update_flag) #define arch_vm_assist_valid_mask(d) \ --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -467,7 +467,15 @@ void toggle_guest_mode(struct vcpu *v) if ( v->arch.flags & TF_kernel_mode ) v->arch.pv.gs_base_kernel = gs_base; else + { v->arch.pv.gs_base_user = gs_base; + + if ( opt_ibpb_ctxt_switch && + !(d->arch.spec_ctrl_flags & SCF_entry_ibpb) && + !VM_ASSIST(d, mode_switch_no_ibpb) ) + wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB); + } + asm volatile ( "swapgs" ); _toggle_guest_pt(v); --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -571,6 +571,16 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t); */ #define VMASST_TYPE_m2p_strict 32 +/* + * x86-64 guests: Suppress IBPB on guest-user to guest-kernel mode switch. + * + * By default (on affected and capable hardware) as a safety measure Xen, + * to cover for the fact that guest-kernel and guest-user modes are both + * running in ring 3 (and hence share prediction context), would issue a + * barrier for user->kernel mode switches of PV guests. + */ +#define VMASST_TYPE_mode_switch_no_ibpb 33 + #if __XEN_INTERFACE_VERSION__ < 0x00040600 #define MAX_VMASST_TYPE 3 #endif