From: Kan Liang <kan.li...@linux.intel.com> Start from Haswell, Linux perf can utilize the existing Last Branch Record (LBR) facility to record call stack. However, the depth of the reconstructed LBR call stack limits to the number of LBR registers. E.g. on skylake, the depth of reconstructed LBR call stack is <= 32 That's because HW will overwrite the oldest LBR registers when it's full.
However, the overwritten LBRs may still be retrieved from previous sample. At that moment, HW hasn't overwritten the LBR registers yet. Perf tools can stitch those overwritten LBRs on current call stacks to get a more complete call stack. To determine if LBRs can be stitched, the physical index of LBR registers is required. A new sample type is introduced in patch 1 & 2 to dump the LBR Top-of-Stack (TOS) information for perf tools. Besides, the maximum number of LBRs is required as well. Patch 3 & 4 retrieve the capabilities information from sysfs and save them in perf header. Patch 5 & 6 implements the LBR stitching approach. Users can use the options introduced in patch 7-10 to enable the LBR stitching approach for perf report, script, top and c2c. The stitching approach base on LBR call stack technology. The known limitations of LBR call stack technology still apply to the approach, e.g. Exception handing such as setjmp/longjmp will have calls/returns not match. This approach is not full proof. There can be cases where it creates incorrect call stacks from incorrect matches. There is no attempt to validate any matches in another way. So it is not enabled by default. However in many common cases with call stack overflows it can recreate better call stacks than the default lbr call stack output. So if there are problems with LBR overflows this is a possible workaround. Performance impact: The processing time may increase with the LBR stitching approach enabled. The impact depends on the number of samples with stitched LBRs. For sqlite's tcltest, perf record --call-graph lbr -- make tcltest perf report --stitch-lbr There are 4.11% samples has stitched LBRs. Total number of samples: 2833728 The number of samples with stitched LBRs 116478 The processing time of perf report increases 6.8% Without --stitch-lbr: 55906106 usec With --stitch-lbr: 59728701 usec For a simple test case tchain_edit with 43 depth of call stacks. perf record --call-graph lbr -- ./tchain_edit perf report --stitch-lbr There are 99.9% samples has stitched LBRs. Total number of samples: 10915 The number of samples with stitched LBRs 10905 The processing time of perf report increases 67.4% Without --stitch-lbr: 11970508 usec With --stitch-lbr: 20036055 usec The source code of tchain_edit.c is something like as below. noinline void f43(void) { int i; for (i = 0; i < 10000;) { if(i%2) i++; else i++; } } noinline void f42(void) { int i; for (i = 0; i < 100; i++) { f43(); f43(); f43(); } } noinline void f41(void) { int i; for (i = 0; i < 100; i++) { f42(); f42(); f42(); } } noinline void f40(void) { f41(); } ... ... noinline void f32(void) { f33(); } noinline void f31(void) { int i; for (i = 0; i < 10000; i++) { if(i%2) i++; else i++; } f32(); } noinline void f30(void) { f31(); } ... ... noinline void f1(void) { f2(); } int main() { f1(); } Kan Liang (10): perf/core, x86: Add PERF_SAMPLE_LBR_TOS perf tools: Support PERF_SAMPLE_LBR_TOS perf pmu: Add support for PMU capabilities perf header: Support CPU PMU capabilities perf machine: Refine the function for LBR call stack reconstruction perf tools: Stitch LBR call stack perf report: Add option to enable the LBR stitching approach perf script: Add option to enable the LBR stitching approach perf top: Add option to enable the LBR stitching approach perf c2c: Add option to enable the LBR stitching approach arch/x86/events/intel/lbr.c | 9 + include/linux/perf_event.h | 1 + include/uapi/linux/perf_event.h | 4 +- kernel/events/core.c | 12 + tools/include/uapi/linux/perf_event.h | 4 +- tools/perf/Documentation/perf-c2c.txt | 11 + tools/perf/Documentation/perf-report.txt | 11 + tools/perf/Documentation/perf-script.txt | 11 + tools/perf/Documentation/perf-top.txt | 9 + .../Documentation/perf.data-file-format.txt | 16 + tools/perf/builtin-c2c.c | 6 + tools/perf/builtin-record.c | 3 + tools/perf/builtin-report.c | 6 + tools/perf/builtin-script.c | 6 + tools/perf/builtin-stat.c | 1 + tools/perf/builtin-top.c | 11 + tools/perf/util/branch.h | 10 +- tools/perf/util/env.h | 3 + tools/perf/util/event.h | 1 + tools/perf/util/evsel.c | 16 +- tools/perf/util/evsel.h | 1 + tools/perf/util/header.c | 110 +++++++ tools/perf/util/header.h | 1 + tools/perf/util/machine.c | 303 ++++++++++++++---- tools/perf/util/perf_event_attr_fprintf.c | 2 +- tools/perf/util/pmu.c | 87 +++++ tools/perf/util/pmu.h | 12 + tools/perf/util/synthetic-events.c | 8 + tools/perf/util/thread.c | 3 + tools/perf/util/thread.h | 18 ++ tools/perf/util/top.h | 1 + 31 files changed, 626 insertions(+), 71 deletions(-) -- 2.17.1