Re: [PATCH V4 04/23] perf/x86/intel: Support adaptive PEBSv4

2019-03-27 Thread Andi Kleen
> We need to call perf_event_overflow() for the last record of each event.
> It's hard to detect which record is the last record of the event with one
> pass walking.
> 
> Also, I'm not sure how much we can save with one pass walking. The
> optimization should only benefit large PEBS. The total number of records for
> large PEBS should not be huge.
> I will evaluate the performance impact of one pass walking. If there is
> observed performance improvement, I will submit a separate patch later.
> 
> For now, I think we can still use the mature two pass walking method.

Okay sounds reasonable to keep it then.

Thanks,

-Andi


Re: [PATCH V4 04/23] perf/x86/intel: Support adaptive PEBSv4

2019-03-27 Thread Liang, Kan




On 3/26/2019 6:24 PM, Andi Kleen wrote:

+   for (at = base; at < top; at += cpuc->pebs_record_size) {
+   u64 pebs_status;
+
+   pebs_status = get_pebs_status(at) & cpuc->pebs_enabled;
+   pebs_status &= mask;
+
+   for_each_set_bit(bit, (unsigned long *)&pebs_status, size)
+   counts[bit]++;
+   }


On Icelake pebs_status is always reliable, so I don't think we need
the two pass walking.



We need to call perf_event_overflow() for the last record of each event. 
It's hard to detect which record is the last record of the event with 
one pass walking.


Also, I'm not sure how much we can save with one pass walking. The 
optimization should only benefit large PEBS. The total number of records 
for large PEBS should not be huge.
I will evaluate the performance impact of one pass walking. If there is 
observed performance improvement, I will submit a separate patch later.


For now, I think we can still use the mature two pass walking method.

Thanks,
Kan


-Andi


+
+   for (bit = 0; bit < size; bit++) {
+   if (counts[bit] == 0)
+   continue;
+
+   event = cpuc->events[bit];
+   if (WARN_ON_ONCE(!event))
+   continue;
+
+   if (WARN_ON_ONCE(!event->attr.precise_ip))
+   continue;
+
+   __intel_pmu_pebs_event(event, iregs, base,
+  top, bit, counts[bit],
+  setup_pebs_adaptive_sample_data);
+   }
+}


Re: [PATCH V4 04/23] perf/x86/intel: Support adaptive PEBSv4

2019-03-26 Thread Andi Kleen
> + for (at = base; at < top; at += cpuc->pebs_record_size) {
> + u64 pebs_status;
> +
> + pebs_status = get_pebs_status(at) & cpuc->pebs_enabled;
> + pebs_status &= mask;
> +
> + for_each_set_bit(bit, (unsigned long *)&pebs_status, size)
> + counts[bit]++;
> + }

On Icelake pebs_status is always reliable, so I don't think we need
the two pass walking.

-Andi

> +
> + for (bit = 0; bit < size; bit++) {
> + if (counts[bit] == 0)
> + continue;
> +
> + event = cpuc->events[bit];
> + if (WARN_ON_ONCE(!event))
> + continue;
> +
> + if (WARN_ON_ONCE(!event->attr.precise_ip))
> + continue;
> +
> + __intel_pmu_pebs_event(event, iregs, base,
> +top, bit, counts[bit],
> +setup_pebs_adaptive_sample_data);
> + }
> +}


[PATCH V4 04/23] perf/x86/intel: Support adaptive PEBSv4

2019-03-26 Thread kan . liang
From: Kan Liang 

Adaptive PEBS is a new way to report PEBS sampling information. Instead
of a fixed size record for all PEBS events it allows to configure the
PEBS record to only include the information needed. Events can then opt
in to use such an extended record, or stay with a basic record which
only contains the IP.

The major new feature is to support LBRs in PEBS record.
Besides normal LBR, this allows (much faster) large PEBS, while still
supporting callstacks through callstack LBR. So essentially a lot of
profiling can now be done without frequent interrupts, dropping the
overhead significantly.

The main requirement still is to use a period, and not use frequency
mode, because frequency mode requires reevaluating the frequency on each
overflow.

The floating point state (XMM) is also supported, which allows efficient
profiling of FP function arguments.

Introduce specific drain function to handle variable length records.
Use a new callback to parse the new record format, and also handle the
STATUS field now being at a different offset.

Add code to set up the configuration register. Since there is only a
single register, all events either get the full super set of all events,
or only the basic record.

Originally-by: Andi Kleen 
Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/intel/core.c  |   2 +
 arch/x86/events/intel/ds.c| 373 --
 arch/x86/events/intel/lbr.c   |  22 ++
 arch/x86/events/perf_event.h  |   9 +
 arch/x86/include/asm/msr-index.h  |   1 +
 arch/x86/include/asm/perf_event.h |  42 
 6 files changed, 429 insertions(+), 20 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8baa441d8000..620beae035a0 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3507,6 +3507,8 @@ static struct intel_excl_cntrs *allocate_excl_cntrs(int 
cpu)
 
 int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
 {
+   cpuc->pebs_record_size = x86_pmu.pebs_record_size;
+
if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
cpuc->shared_regs = allocate_shared_regs(cpu);
if (!cpuc->shared_regs)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index efc054aee3c1..1a076beb5fb1 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -906,17 +906,85 @@ static inline void pebs_update_threshold(struct 
cpu_hw_events *cpuc)
 
if (cpuc->n_pebs == cpuc->n_large_pebs) {
threshold = ds->pebs_absolute_maximum -
-   reserved * x86_pmu.pebs_record_size;
+   reserved * cpuc->pebs_record_size;
} else {
-   threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+   threshold = ds->pebs_buffer_base + cpuc->pebs_record_size;
}
 
ds->pebs_interrupt_threshold = threshold;
 }
 
+static void adaptive_pebs_record_size_update(void)
+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+   u64 pebs_data_cfg = cpuc->pebs_data_cfg;
+   int sz = sizeof(struct pebs_basic);
+
+   if (pebs_data_cfg & PEBS_DATACFG_MEMINFO)
+   sz += sizeof(struct pebs_meminfo);
+   if (pebs_data_cfg & PEBS_DATACFG_GPRS)
+   sz += sizeof(struct pebs_gprs);
+   if (pebs_data_cfg & PEBS_DATACFG_XMMS)
+   sz += sizeof(struct pebs_xmm);
+   if (pebs_data_cfg & PEBS_DATACFG_LBRS)
+   sz += x86_pmu.lbr_nr * sizeof(struct pebs_lbr_entry);
+
+   cpuc->pebs_record_size = sz;
+}
+
+#define PERF_PEBS_MEMINFO_TYPE (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC |   \
+   PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT | \
+   PERF_SAMPLE_TRANSACTION)
+
+static u64 pebs_update_adaptive_cfg(struct perf_event *event)
+{
+   struct perf_event_attr *attr = &event->attr;
+   u64 sample_type = attr->sample_type;
+   u64 pebs_data_cfg = 0;
+   bool gprs, tsx_weight;
+
+   if ((sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) ||
+   attr->precise_ip < 2) {
+
+   if (sample_type & PERF_PEBS_MEMINFO_TYPE)
+   pebs_data_cfg |= PEBS_DATACFG_MEMINFO;
+
+   /*
+* Cases we need the registers:
+* + user requested registers
+* + precise_ip < 2 for the non event IP
+* + For RTM TSX weight we need GPRs too for the abort
+* code. But we don't want to force GPRs for all other
+* weights.  So add it only collectfor the RTM abort event.
+*/
+   gprs = (sample_type & PERF_SAMPLE_REGS_INTR) &&
+ (attr->sample_regs_intr & 0x);
+   tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT) &&
+((attr->config & 0x) == 
x86_pmu.force_gpr_event);
+   if (gprs