Hi Sean,
Thanks for reviewing my patches.
On 10/15/2024 11:19 PM, Sean Christopherson wrote:
> On Fri, Oct 04, 2024, Manali Shukla wrote:
>> When a VMRUN instruction is executed, the bus lock threshold count is
>> loaded into an internal count register. Before the processor executes
>> a bus lock in the guest, it checks the value of this register:
>>
>> - If the value is greater than '0', the processor successfully
>> executes the bus lock and decrements the count.
>>
>> - If the value is '0', the bus lock is not executed, and a #VMEXIT to
>> the VMM is taken.
>>
>> The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT
>> code A5h, SVM_EXIT_BUS_LOCK.
>
> I vote to split this into two patches: one to add the architectural
> collateral,
> with the above as the changelog, and a second to actually implement support in
> KVM. Having the above background is useful, but it makes it quite hard to
> find
> information on the KVM design and implementation.
>
Sure. I will split it into 2 patches.
>> This implementation is set up to intercept every guest bus lock.
>
> "This implementation" is a variation on "This patch". Drop it, and simply
> state
> what the patch is doing.
>
>> initial value of the Bus Lock Threshold counter is '0' in the VMCB, so
>> the very first bus lock will exit, set the Bus Lock Threshold counter
>> to '1' (so that the offending instruction can make forward progress)
>> but then the value is at '0' again so the next bus lock will exit.
>>
>> The generated SVM_EXIT_BUS_LOCK in kvm will exit to user space by
>
> s/kvm/KVM
>
> And again, the tone is wrong.
>
> Something is what I would aim for:
>
> Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
> Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
> allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.
>
> The biggest difference between the two features is that Threshold is
> fault-like, whereas Detection is trap-like. To allow the guest to make
> forward progress, Threshold provides a per-VMCB counter which is
> decremented every time a bus lock occurs, and a VM-Exit is triggered if
> and only if the counter is '0'.
>
> To provide Detection-like semantics, initialize the counter to '0', i.e.
> exit on every bus lock, and when re-executing the guilty instruction, set
> the counter to '1' to effectively step past the instruction.
>
> Note, in the unlikely scenario that re-executing the instruction doesn't
> trigger a bus lock, e.g. because the guest has changed memory types or
> patched the guilty instruction, the bus lock counter will be left at '1',
> i.e. the guest will be able to do a bus lock on a different instruction.
> In a perfect world, KVM would ensure the counter is '0' if the guest has
> made forward progress, e.g. if RIP has changed. But trying to close that
> hole would incur non-trivial complexity, for marginal benefit; the intent
> of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
> not to allow for precise detection of problematic guest code. And, it's
> simply not feasible to fully close the hole, e.g. if an interrupt arrives
> before the original instruction can re-execute, the guest could step past
> a different bus lock.
>
>> setting the KVM_RUN_BUS_LOCK flag in vcpu->run->flags, indicating to
>> the user space that a bus lock has been detected in the guest.
>>
>> Use the already available KVM capability KVM_CAP_X86_BUS_LOCK_EXIT to
>> enable the feature. This feature is disabled by default, it can be
>> enabled explicitly from user space.
>>
>> More details about the Bus Lock Threshold feature can be found in AMD
>> APM [1].
>
> ...
>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index d5314cb7dff4..ca1c42201894 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -669,6 +669,11 @@ static void nested_vmcb02_prepare_control(struct
>> vcpu_svm *svm,
>> vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
>> vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
>>
>> + /*
>> + * The bus lock threshold count should keep running across nested
>> + * transitions. Copy the buslock threshold count from vmcb01 to vmcb02.
>
> No, it should be reset to '0'. The bus lock can't have been on VMRUN,
> because KVM
> is emulating the VMRUN. That leaves two possibilities: the bus lock happened
> in
> L1 on an instruction before VMRUN, or the bus lock happened in _an_ L2,
> before a
> nested VM-Exit to L1 occurred.
>
> In the first case, the bus lock firmly happened on an instruction in the past.
> Even if vmcb01's counter is still '1', e.g. because the guilty instruction got
> patched, the vCPU has clearly made forward progress and so KVM should reset
> vmcb02's
> counter to '0'.
>
> In the second case, KVM has no idea if L2 has made forward progress. The only
> way to _try to_ detect if L2 has made forward progress would to be to track
> the
> counter on a per-vmcb12 basis, but even that is flawed because KVM doesn't
> have
> visibility into L1's management of L2.
>
> I do think we may need to stash vmcb02's counter though. E.g. if userspace
> rate-
> limits the vCPU, then it's entirely possible that L1's tick interrupt is
> pending
> by the time userspace re-runs the vCPU. If KVM unconditionally clears the
> counter
> on VMRUN, then when L1 re-enters L2, the same instruction will trigger a
> VM-Exit
> and the entire cycle starts over.
>
> Anything we do is going to be imperfect, but the best idea I can come up with
> is
> also relatively simple, especially in conjunction with my suggestion below.
> If
> KVM tracks the RIP that triggered the bus lock, then on nested VM-Exit KVM can
> propagate that RIP into svm_nested_state as appropriate. E.g.
>
> if (vmcb02->control.bus_lock_counter &&
> svm->bus_lock_rip == vmcb02->save.rip)
> svm->nested.bus_lock_rip = svm->bus_lock_rip;
> else
> svm->nested.bus_lock_rip = INVALID_GVA; /* or '0', as much as
> that bugs me */
>
> and then on nested VMRUN
>
> if (svm->nested.bus_lock_rip == vmcb02->save.rip) {
> vmcb02->control.bus_lock_counter = 1;
> svm->bus_lock_rip = svm->nested.bus_lock_rip;
> } else {
> vmcb02->control.bus_lock_counter = 0;
> }
>
> svm->nested.bus_lock_rip = INVALID_GVA;
>
> Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the same
> RIP, then KVM will allow a bus lock for the wrong vCPU. But the odds of that
> happening are absurdly low unless L1 is deliberately doing weird things, and
> so
> I don't think
>
>> + */
>> + vmcb02->control.bus_lock_counter = vmcb01->control.bus_lock_counter;
>> /* Done at vmrun: asid. */
>>
>> /* Also overwritten later if necessary. */
>> @@ -1035,6 +1040,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>>
>> }
>>
>> + /*
>> + * The bus lock threshold count should keep running across nested
>> + * transitions. Copy the buslock threshold count from vmcb02 to vmcb01.
>> + */
>> + vmcb01->control.bus_lock_counter = vmcb02->control.bus_lock_counter;
>
> KVM should definitely reset the counter to '0' on a nested VM-Exit, because L1
> can't be interrupted by L2, i.e. there is zero chance that KVM needs to allow
> a
> bus lock in L1 to ensure L1 makes forward progress.
>
>> nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr);
>>
>> svm_switch_vmcb(svm, &svm->vmcb01);
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 9df3e1e5ae81..0d0407f1ee6a 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -1372,6 +1372,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>> svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK;
>> }
>>
>> + if (vcpu->kvm->arch.bus_lock_detection_enabled)
>> + svm_set_intercept(svm, INTERCEPT_BUSLOCK);
>> +
>> if (sev_guest(vcpu->kvm))
>> sev_init_vmcb(svm);
>>
>> @@ -3286,6 +3289,24 @@ static int invpcid_interception(struct kvm_vcpu *vcpu)
>> return kvm_handle_invpcid(vcpu, type, gva);
>> }
>>
>> +static int bus_lock_exit(struct kvm_vcpu *vcpu)
>> +{
>> + struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> + vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
>> + vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
>> +
>> + /*
>> + * Reload the counter with value greater than '0'.
>
> The value quite obviously must be exactly '1', not simply greater than '0. I
> also
> think this is the wrong place to set the counter. Rather than set the
> counter at
> the time of exit, KVM should implement a vcpu->arch.complete_userspace_io
> callback
> and set the counter to '1' if and only if RIP (or LIP, but I have no
> objection to
> keeping things simple) is unchanged. It's a bit of extra complexity, but it
> will
> make it super obvious why KVM is setting the counter to '1'. And, if
> userspace
> wants to stuff state and move past the instruction, e.g. by emulating the
> guilty
> instruction, then KVM won't unnecessarily allow a bus lock in the guest.
>
> And then the comment can be:
>
> /*
> * If userspace has NOT change RIP, then KVM's ABI is to let the guest
> * execute the bus-locking instruction. Set the bus lock counter to '1'
> * to effectively step past the bus lock.
> */
>
Thank you for highlighting these scenarios (for nested guest and normal
guests).
I had not thought about them. I’m currently going through the comments
and trying to fully understand them. I’ll try out the suggested changes and
get back to you.
- Manali
>> + * The bus lock exit on SVM happens with RIP pointing to the guilty
>> + * instruction. So, reloading the value of bus_lock_counter to '0'
>> + * results into generating continuous bus lock exits.
>> + */
>> + svm->vmcb->control.bus_lock_counter = 1;
>> +
>> + return 0;
>> +}
>> +
>> static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>> [SVM_EXIT_READ_CR0] = cr_interception,
>> [SVM_EXIT_READ_CR3] = cr_interception,
>> @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct
>> kvm_vcpu *vcpu) = {
>> [SVM_EXIT_CR4_WRITE_TRAP] = cr_trap,
>> [SVM_EXIT_CR8_WRITE_TRAP] = cr_trap,
>> [SVM_EXIT_INVPCID] = invpcid_interception,
>> + [SVM_EXIT_BUS_LOCK] = bus_lock_exit,
>> [SVM_EXIT_NPF] = npf_interception,
>> [SVM_EXIT_RSM] = rsm_interception,
>> [SVM_EXIT_AVIC_INCOMPLETE_IPI] =
>> avic_incomplete_ipi_interception,
>> @@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void)
>> kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
>> }
>>
>> + if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) {
>> + pr_info("Bus Lock Threashold supported\n");
>> + kvm_caps.has_bus_lock_exit = true;
>> + }
>> +
>> /* CPUID 0x80000008 */
>> if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
>> boot_cpu_has(X86_FEATURE_AMD_SSBD))
>> --
>> 2.34.1
>>