RE: [PATCH v4 10/10] KVM/x86/lbr: lazy save the guest lbr stack II

2018-12-28 Thread Wang, Wei W
On Friday, December 28, 2018 4:53 AM, Andi Kleen wrote:
> Actually forgot one case.
> 
> In Arch Perfmon v4 the LBR freezing is also controlled through a
> GLOBAL_CTRL bit.
> I didn't see any code handling that bit?

That GLOBAL_STATUS.LBR_FRZ bit hasn't been supported yet. I'll add that, thanks.

Best,
Wei




Re: [PATCH v4 10/10] KVM/x86/lbr: lazy save the guest lbr stack

2018-12-28 Thread Andi Kleen
On Fri, Dec 28, 2018 at 11:47:06AM +0800, Wei Wang wrote:
> On 12/28/2018 04:51 AM, Andi Kleen wrote:
> > Thanks. This looks a lot better than the earlier versions.
> > 
> > Some more comments.
> > 
> > On Wed, Dec 26, 2018 at 05:25:38PM +0800, Wei Wang wrote:
> > > When the vCPU is scheduled in:
> > > - if the lbr feature was used in the last vCPU time slice, set the lbr
> > >stack to be interceptible, so that the host can capture whether the
> > >lbr feature will be used in this time slice;
> > > - if the lbr feature wasn't used in the last vCPU time slice, disable
> > >the vCPU support of the guest lbr switching.
> > time slice is the time from exit to exit?
> 
> It's the vCPU thread time slice (e.g. 100ms).

I don't think the time slices are that long, but ok.

> 
> > 
> > This might be rather short in some cases if the workload does a lot of exits
> > (which I would expect PMU workloads to do) Would be better to use some
> > explicit time check, or at least N exits.
> 
> Did you mean further increasing the lazy time to multiple host thread
> scheduling time slices?
> What would be a good value for "N"?

I'm not sure -- i think the goal would be to find a value that optimizes
performance (or rather minimizes overhead). But perhaps if it's as you say the
scheduler time slice it might be good enough as it is.

I guess it could be tuned later based on more experneice.

> > or partially cleared. This would be user visible.
> > 
> > In theory could try to detect if the guest is inside a PMI and
> > save/restore then, but that would likely be complicated. I would
> > save/restore for all cases.
> 
> Yes, it is easier to save for all the cases. But curious for the
> non-callstack
> mode, it's just ponit sampling functions (kind of speculative in some
> degree).
> Would rarely losing a few recordings important in that case?

In principle no for statistical samples, but I know some tools complain
for bogus samples (e.g. autofdo will). Also with perf report --branch-history 
it will
be definitely visible. I think it's easier to always safe now than to
handle the user complaints about this later.


-Andi


Re: [PATCH v4 10/10] KVM/x86/lbr: lazy save the guest lbr stack

2018-12-27 Thread Wei Wang

On 12/28/2018 04:51 AM, Andi Kleen wrote:

Thanks. This looks a lot better than the earlier versions.

Some more comments.

On Wed, Dec 26, 2018 at 05:25:38PM +0800, Wei Wang wrote:

When the vCPU is scheduled in:
- if the lbr feature was used in the last vCPU time slice, set the lbr
   stack to be interceptible, so that the host can capture whether the
   lbr feature will be used in this time slice;
- if the lbr feature wasn't used in the last vCPU time slice, disable
   the vCPU support of the guest lbr switching.

time slice is the time from exit to exit?


It's the vCPU thread time slice (e.g. 100ms).




This might be rather short in some cases if the workload does a lot of exits
(which I would expect PMU workloads to do) Would be better to use some
explicit time check, or at least N exits.


Did you mean further increasing the lazy time to multiple host thread
scheduling time slices?
What would be a good value for "N"?



Upon the first access to one of the lbr related MSRs (since the vCPU was
scheduled in):
- record that the guest has used the lbr;
- create a host perf event to help save/restore the guest lbr stack if
   the guest uses the user callstack mode lbr stack;

This is a bit risky. It would be safer (but also more expensive)
to always safe even for any guest LBR use independent of callstack.

Otherwise we might get into a situation where
a vCPU context switch inside the guest PMI will clear the LBRs
before they can be read in the PMI, so some LBR samples will be fully
or partially cleared. This would be user visible.

In theory could try to detect if the guest is inside a PMI and
save/restore then, but that would likely be complicated. I would
save/restore for all cases.


Yes, it is easier to save for all the cases. But curious for the 
non-callstack
mode, it's just ponit sampling functions (kind of speculative in some 
degree).

Would rarely losing a few recordings important in that case?





+static void
+__always_inline vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr,
+ int type, bool value);

__always_inline should only be used if it's needed for functionality,
or in a header.


Thanks, will fix it.

Best,
Wei


Re: [PATCH v4 10/10] KVM/x86/lbr: lazy save the guest lbr stack II

2018-12-27 Thread Andi Kleen


Actually forgot one case.

In Arch Perfmon v4 the LBR freezing is also controlled through a GLOBAL_CTRL 
bit.
I didn't see any code handling that bit?


-Andi


Re: [PATCH v4 10/10] KVM/x86/lbr: lazy save the guest lbr stack

2018-12-27 Thread Andi Kleen


Thanks. This looks a lot better than the earlier versions.

Some more comments.

On Wed, Dec 26, 2018 at 05:25:38PM +0800, Wei Wang wrote:
> When the vCPU is scheduled in:
> - if the lbr feature was used in the last vCPU time slice, set the lbr
>   stack to be interceptible, so that the host can capture whether the
>   lbr feature will be used in this time slice;
> - if the lbr feature wasn't used in the last vCPU time slice, disable
>   the vCPU support of the guest lbr switching.

time slice is the time from exit to exit?

This might be rather short in some cases if the workload does a lot of exits
(which I would expect PMU workloads to do) Would be better to use some
explicit time check, or at least N exits.

> 
> Upon the first access to one of the lbr related MSRs (since the vCPU was
> scheduled in):
> - record that the guest has used the lbr;
> - create a host perf event to help save/restore the guest lbr stack if
>   the guest uses the user callstack mode lbr stack;

This is a bit risky. It would be safer (but also more expensive)
to always safe even for any guest LBR use independent of callstack.

Otherwise we might get into a situation where
a vCPU context switch inside the guest PMI will clear the LBRs
before they can be read in the PMI, so some LBR samples will be fully
or partially cleared. This would be user visible.

In theory could try to detect if the guest is inside a PMI and
save/restore then, but that would likely be complicated. I would
save/restore for all cases.

> +static void
> +__always_inline vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr,
> +   int type, bool value);

__always_inline should only be used if it's needed for functionality,
or in a header.

The rest looks good.

-Andi



[PATCH v4 10/10] KVM/x86/lbr: lazy save the guest lbr stack

2018-12-26 Thread Wei Wang
When the vCPU is scheduled in:
- if the lbr feature was used in the last vCPU time slice, set the lbr
  stack to be interceptible, so that the host can capture whether the
  lbr feature will be used in this time slice;
- if the lbr feature wasn't used in the last vCPU time slice, disable
  the vCPU support of the guest lbr switching.

Upon the first access to one of the lbr related MSRs (since the vCPU was
scheduled in):
- record that the guest has used the lbr;
- create a host perf event to help save/restore the guest lbr stack if
  the guest uses the user callstack mode lbr stack;
- pass the stack through to the guest.

Suggested-by: Andi Kleen 
Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/include/asm/kvm_host.h |   4 ++
 arch/x86/kvm/pmu.h  |   5 ++
 arch/x86/kvm/vmx.c  | 138 
 3 files changed, 147 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fac209b..7f91eac 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -775,6 +775,10 @@ struct kvm_vcpu_arch {
 
/* Flush the L1 Data cache for L1TF mitigation on VMENTER */
bool l1tf_flush_l1d;
+   /* Indicate if the guest is using lbr with the user callstack mode */
+   bool lbr_user_callstack;
+   /* Indicate if the lbr msrs were accessed in this vCPU time slice */
+   bool lbr_used;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index efd8f16..c1fed24 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -103,6 +103,11 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu 
*pmu, u32 msr)
return NULL;
 }
 
+static inline bool intel_pmu_save_vcpu_lbr_enabled(struct kvm_vcpu *vcpu)
+{
+   return !!vcpu_to_pmu(vcpu)->vcpu_lbr_event;
+}
+
 void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
 void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
 void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ee02967..80ec3f4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1310,6 +1310,9 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 
*vmcs12,
 static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
 static __always_inline void vmx_disable_intercept_for_msr(unsigned long 
*msr_bitmap,
  u32 msr, int type);
+static void
+__always_inline vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr,
+ int type, bool value);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -4088,6 +4091,121 @@ static int vmx_get_msr_feature(struct kvm_msr_entry 
*msr)
return 0;
 }
 
+static void vmx_set_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
+{
+   unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+   struct x86_perf_lbr_stack *stack = >kvm->arch.lbr_stack;
+   int nr = stack->nr;
+   int i;
+
+   vmx_set_intercept_for_msr(msr_bitmap, stack->tos, MSR_TYPE_RW, set);
+   for (i = 0; i < nr; i++) {
+   vmx_set_intercept_for_msr(msr_bitmap, stack->from + i,
+ MSR_TYPE_RW, set);
+   vmx_set_intercept_for_msr(msr_bitmap, stack->to + i,
+ MSR_TYPE_RW, set);
+   if (stack->info)
+   vmx_set_intercept_for_msr(msr_bitmap, stack->info + i,
+ MSR_TYPE_RW, set);
+   }
+}
+
+static inline bool msr_is_lbr_stack(struct kvm_vcpu *vcpu, u32 index)
+{
+   struct x86_perf_lbr_stack *stack = >kvm->arch.lbr_stack;
+   int nr = stack->nr;
+
+   return !!(index == stack->tos ||
+(index >= stack->from && index < stack->from + nr) ||
+(index >= stack->to && index < stack->to + nr) ||
+(index >= stack->info && index < stack->info));
+}
+
+static bool guest_get_lbr_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+   u32 index = msr_info->index;
+   bool ret = false;
+
+   switch (index) {
+   case MSR_IA32_DEBUGCTLMSR:
+   msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+   ret = true;
+   break;
+   case MSR_LBR_SELECT:
+   ret = true;
+   rdmsrl(index, msr_info->data);
+   break;
+   default:
+   if (msr_is_lbr_stack(vcpu, index)) {
+   ret = true;
+   rdmsrl(index, msr_info->data);
+   }
+   }
+
+   return ret;
+}
+
+static bool guest_set_lbr_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+   u32 index = msr_info->index;
+   u64 data = msr_info->data;
+   bool