Documentation for IMC(In-Memory Collection Counters) infrastructure and trace-mode of IMC.
Signed-off-by: Anju T Sudhakar <a...@linux.vnet.ibm.com> --- Documentation/powerpc/imc.txt | 195 ++++++++++++++++++++++++++++++++++ 1 file changed, 195 insertions(+) create mode 100644 Documentation/powerpc/imc.txt diff --git a/Documentation/powerpc/imc.txt b/Documentation/powerpc/imc.txt new file mode 100644 index 000000000000..9c32e059f3be --- /dev/null +++ b/Documentation/powerpc/imc.txt @@ -0,0 +1,195 @@ + =================================== + IMC (In-Memory Collection Counters) + =================================== + Date created: 10 May 2019 + +Table of Contents: +------------------ + - Basic overview + - IMC example Usage + - IMC Trace Mode + - LDBAR Register Layout + - TRACE_IMC_SCOM bit representation + - Trace IMC example usage + - Benefits of using IMC trace-mode + + +Basic overview +============== + +IMC (In-Memory collection counters) is a hardware monitoring facility +that collects large number of hardware performance events at Nest level +(these are on-chip but off-core), Core level and Thread level. + +The Nest PMU counters are handled by a Nest IMC microcode which runs +in the OCC (On-Chip Controller) complex. The microcode collects the +counter data and moves the nest IMC counter data to memory. + +The Core and Thread IMC PMU counters are handled in the core. Core +level PMU counters give us the IMC counters' data per core and thread +level PMU counters give us the IMC counters' data per CPU thread. + +OPAL obtains the IMC PMU and supported events information from the +IMC Catalog and passes on to the kernel via the device tree. The event's +information contains : + - Event name + - Event Offset + - Event description +and, maybe : + - Event scale + - Event unit + +Some PMUs may have a common scale and unit values for all their +supported events. For those cases, the scale and unit properties for +those events must be inherited from the PMU. + +The event offset in the memory is where the counter data gets +accumulated. + +IMC catalog is available at: + https://github.com/open-power/ima-catalog + +The kernel discovers the IMC counters information in the device tree +at the "imc-counters" device node which has a compatible field +"ibm,opal-in-memory-counters". From the device tree, the kernel parses +the PMUs and their event's information and register the PMU and it +attributes in the kernel. + +IMC example usage +================= + +# perf list + + [...] + nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/ [Kernel PMU event] + nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/ [Kernel PMU event] + + [...] + core_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event] + core_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event] + + [...] + thread_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event] + thread_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event] + +To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/ : + # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket + +To see non-idle instructions for core 0 : + # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000 + +To see non-idle instructions for a "make" : + # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make + + +IMC Trace-mode +=============== + +POWER9 support two modes for IMC which are the Accumulation mode and +Trace mode. In Accumulation mode, event counts are accumulated in system +Memory. Hypervisor then reads the posted counts periodically or when +requested. In IMC Trace mode, the 64 bit trace scom value is initialized +with the event information. The CPMC*SEL and CPMC_LOAD in the trace scom, +specifies the event to be monitored and the sampling duration. On each +overflow in the CPMC*SEL, hardware snapshots the program counter along +with event counts and writes into memory pointed by LDBAR. + +LDBAR is a 64 bit special purpose per thread register, it has bits to +indicate whether hardware is configured for accumulation or trace mode. + +* LDBAR Register Layout: + 0 : Enable/Disable + 1 : 0 -> Accumulation Mode + 1 -> Trace Mode + 2:3 : Reserved + 4-6 : PB scope + 7 : Reserved + 8:50 : Counter Address + 51:63 : Reserved + +* TRACE_IMC_SCOM bit representation: + + 0:1 : SAMPSEL + 2:33 : CPMC_LOAD + 34:40 : CPMC1SEL + 41:47 : CPMC2SEL + 48:50 : BUFFERSIZE + 51:63 : RESERVED + +CPMC_LOAD contains the sampling duration. SAMPSEL and CPMC*SEL determines +the event to count. BUFFRSIZE indicates the memory range. On each overflow, +hardware snapshots program counter along with event counts and update the +memory and reloads the CMPC_LOAD value for the next sampling duration. +IMC hardware does not support exceptions, so it quietly wraps around if +memory buffer reaches the end. + +*Currently the event monitored for trace-mode is fixed as cycle. + +Trace IMC example usage +======================= + +# perf list + + [....] + trace_imc/trace_cycles/ [Kernel PMU event] + +To record an application/process with trace-imc event +# perf record -e trace_imc/trace_cycles/ yes > /dev/nul +[ perf record: Woken up 1 times to write data ] +[ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ] + +The perf.data generated, can be read using perf report. + +Benefits of using IMC trace-mode +================================ + +PMI interrupt handling is avoided, since IMC trace mode snapshots the +program counter and update to the memory. And this also provide a way for +the operating system to do instruction sampling in real time without +PMI(Performance Monitoring Interrupts) processing overhead. +Example:- + +Performance data using 'perf top' with and without trace-imc event: + +PMI interrupts count when `perf top` command is executed without trace-imc event. + +# cat /proc/interrupts (a snippet from the output) +9944 1072 804 804 1644 804 1306 +804 804 804 804 804 804 804 +804 804 1961 1602 804 804 1258 +[-----------------------------------------------------------------] +803 803 803 803 803 803 803 +803 803 803 803 804 804 804 +804 804 804 804 804 804 803 +803 803 803 803 803 1306 803 +803 Performance monitoring interrupts + + +`perf top` with trace-imc (executed right after 'perf top' without trace-imc event): + +# perf top -e trace_imc/trace_cycles/ +12.50% [kernel] [k] arch_cpu_idle +11.81% [kernel] [k] __next_timer_interrupt +11.22% [kernel] [k] rcu_idle_enter +10.25% [kernel] [k] find_next_bit + 7.91% [kernel] [k] do_idle + 7.69% [kernel] [k] rcu_dynticks_eqs_exit + 5.20% [kernel] [k] tick_nohz_idle_stop_tick + [-----------------------] + +# cat /proc/interrupts (a snippet from the output) + +9944 1072 804 804 1644 804 1306 +804 804 804 804 804 804 804 +804 804 1961 1602 804 804 1258 +[-----------------------------------------------------------------] +803 803 803 803 803 803 803 +803 803 803 804 804 804 804 +804 804 804 804 804 804 803 +803 803 803 803 803 1306 803 +803 Performance monitoring interrupts + +The PMI interrupts count remains the same. + +---------------------------------------------------------- +Author(s) : Anju T Sudhakar <a...@linux.vnet.ibm.com> -- 2.17.2