On 12/10/2018 18:30, Andi Kleen wrote:
>> 4. Results
>>     - Without this optimization, the guest pmi handling time is
>>       ~4500000 ns, and the max sampling rate is reduced to 250.
>>     - With this optimization, the guest pmi handling time is ~9000 ns
>>       (i.e. 1 / 500 of the non-optimization case), and the max sampling
>>       rate remains at the original 100000.
> 
> Impressive performance improvement!

Agreed!

> It's not clear to me why you're special casing PMIs here. The optimization
> should work generically, right?

Yeah, you can even just check if the counter is in the struct
cpu_hw_events guest mask, and if so always write the counter MSR directly.

>> @@ -237,9 +267,23 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, 
>> struct msr_data *msr_info)
>>      default:
>>              if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
>>                  (pmc = get_fixed_pmc(pmu, msr))) {
>> -                    if (!msr_info->host_initiated)
>> -                            data = (s64)(s32)data;
>> -                    pmc->counter += data - pmc_read_counter(pmc);
>> +                    if (pmu->in_pmi) {
>> +                            /*
>> +                             * Since we are not re-allocating a perf event
>> +                             * to reconfigure the sampling time when the
>> +                             * guest pmu is in PMI, just set the value to
>> +                             * the hardware perf counter. Counting will
>> +                             * continue after the guest enables the
>> +                             * counter bit in MSR_CORE_PERF_GLOBAL_CTRL.
>> +                             */
>> +                            struct hw_perf_event *hwc =
>> +                                            &pmc->perf_event->hw;
>> +                            wrmsrl(hwc->event_base, data);
> 
> Is that guaranteed to be always called on the right CPU that will run the 
> vcpu?
> 
> AFAIK there's an ioctl to set MSRs in the guest from qemu, I'm pretty sure
> it won't handle that.

How much of the performance improvement comes from here?  In theory
pmc_read_counter() should always hit a relatively fast path, because the
smp_call_function_single in perf_event_read doesn't need an IPI.

In any case, this should be a separate patch.

Paolo

> May need to be delayed to entry time.
> 
> -Andi
> 

Reply via email to