I'm submitting the following closed approved automatic fasttrack on behalf
of Jon Haslam and the DTrace community. It has been approved by the community
after discussion on dtrace-discuss at opensolaris.org. The stability is 
Committed
and the binding is Patch.

Adam

---8<---

A. INTRODUCTION

This case adds the 'cpc' provider which will enable consumers to access the
performance counters of a CPU. This will allow users to easily connect CPU
events (e.g. TLB misses, L2 cache misses) to the cause of the event on a
system-wide basis.

The Solaris CPU Performance Counter (CPC) subsystem (PSARC 2002/180) gives
general purpose access to the hardware performance counters of a
microprocessor. The cpc provider leverages the infrastructure provided by
the CPC subsystem to access the CPU performance counter resources of a system.
The provider utilises the hardware overflow interrupt mechanism to allow
profiling based upon CPU performance counter events (in the same way that
the profile provider allows us to profile by time).


B. DESCRIPTION

1. Probe Format

The format of probes made available by the cpc provider:

        cpc:::event_name-mode-{optional mask}-count

where:

event_name:     The event name of interest. A full list of events available
                on each platform are given in the output of `cpustat -h`.

mode:           The operating mode of the processor in which the event is
                counted. Valid settings are "user" (user mode), "kernel"
                (kernel mode) and "all" (user and kernel mode).

optional mask:  Some platform specific events can be further specified with
                the use of a mask (sometimes known as a 'umask' or an 'emask').
                This field is optional and can only be specified for platform
                specific events. It cannot be used with generic performance
                counter events (PSARC 2008/334). Specified as a hex value. 

count:          Specifies the number of events to be counted on a CPU for a
                probe to fire on that CPU.

As an example, the specification for a probe which fires every 10000 user mode
DTLB misses on an UltraSPARC IV processor would look like:

        cpc:::DTLB_miss-user-10000

The probes exported by the cpc provider are unanchored and are not associated
with a particular point of execution, but rather an asynchronous performance
counter event interrupt. When a probe is fired we can sample aspects of system
state and inferences can be made about system behaviour. The following example
records the user-land stack trace if the "foo" executable was executing when
the probe fired and the probe fires every 10000 user mode L1 instruction
cache misses (note that executable "foo" may have generated anywhere between
1 and 10000 of those events).

cpc:::IC_miss-user-10000
/execname == "foo"/
{
        @[ustack()] = count();
} 


2. Probe arguments

All probes provide two arguments:

arg0            The program counter (PC) in the kernel at the time the probe
                fired, or 0 if the current process was not executing in the
                kernel at the time the probe fired.

arg1            The PC in the user-level process at the time the probe fired,
                or 0 if the current process was executing in the kernel at the
                time the probe fired.


3. Probe Availability

Probes are made available dynamically when requested by a user. The probes
available will differ according to the events exported by the CPC subsystem
on a platform. The names of available events can be discovered, as mentioned
in section 'B1 - Probe Format', using the output of `cpustat -h`. 

CPU performance counters are a finite resource and the number of probes
that can be enabled depends upon hardware capabilities. Processors
that cannot determine which counter has overflowed when multiple counters
are programmed (e.g. AMD, UltraSPARC) are only allowed to have a single
enabling at any one time. On such platforms, consumers attempting to enable
more than 1 probe will fail as will consumers attempting to enable a probe
when a disparate enabling already exists. Processors that can detect which
counter has overflowed (e.g. Niagara2, Intel P4) are allowed to have as many
probes enabled as the hardware will allow. This will be, at most, the number
of counters available on a processor. On such configurations, multiple probes
can be enabled at any one time.

Probes are enabled by consumers on a first-come, first-served basis. When
hardware resources are fully utilised subsequent enablings will fail until
resources become available.

3. Co-existence with existing tools

The provider has priority over per-LWP libcpc usage (i.e. cputrack)
for access to counters. In the same manner as cpustat, enabling probes
causes all existing per-LWP counter contexts to be invalidated.  As long as
these enablings remain active, the counters will remain unavailable to
cputrack-type consumers.

Only one of cpustat and DTrace may use the counter hardware at any one time.
Ownership of the counters is given on a first-come, first-served basis.

4. Limiting Overflow Rate

So as to not saturate the system with overflow interrupts, a default minimum
of 5000 is imposed on the value that can be specified for the 'count'
part of the probename (refer to section 'B1 - Probe Format'). This can be
reduced explicitly by altering the 'dcpc_min_overflow' kernel variable with
mdb(1) or by modifying the dcpc.conf driver configuration file and unloading
and reloading the dcpc driver module.

C. EXAMPLES


1. Instructions executed by applications on an AMD platform:

cpc:::FR_retired_x86_instr_w_excp_intr-user-10000
{
        @[execname] = count();
}
# ./user-insts.d
dtrace: script './user-insts.d' matched 2 probes
^C
[chop]
  init                                                            138
  dtrace                                                          175
  nis_cachemgr                                                    179
  automountd                                                      183
  intrd                                                           235
  run-mozilla.sh                                                  306
  thunderbird                                                     316
  Xorg                                                            453
  thunderbird-bin                                                2370
  sshd                                                           8114


2. A kernel profiled by cycle usage on an AMD platform.

cpc:::BU_cpu_clk_unhalted-kernel-10000
{
        @[func(arg0)] = count();
}
 
# ./kerncycprof.d                                
dtrace: script './kerncycprof.d' matched 1 probe
^C

[chop]
  genunix`vpm_sync_pages                                       478948
  genunix`vpm_unmap_pages                                      496626
  genunix`vpm_map_pages                                        640785
  unix`mutex_delay_default                                     916703
  unix`hat_kpm_page2va                                         988880
  tmpfs`rdtmp                                                  991252
  unix`hat_page_setattr                                       1077717
  unix`page_try_reclaim_lock                                  1213379
  genunix`free_vpmap                                          1914810
  genunix`get_vpmap                                           2417896
  unix`page_lookup_create                                     3992197
  unix`mutex_enter                                            5595647
  unix`do_copy_fault_nta                                     27803554


3. L2 cache misses, by function, generated by any running executables
called 'brendan' on an AMD platform.

cpc:::BU_fill_req_missed_L2-all-0x7-10000
/execname == "brendan"/
{
        @[ufunc(arg1)] = count();
}

./brendan-l2miss.d
dtrace: script './brendan-l2miss.d' matched 1 probe
CPU     ID                    FUNCTION:NAME
^C

  brendan`func_gamma                                               930
  brendan`func_beta                                               1578
  brendan`func_alpha                                              2945

4. The same example as in example (3) above but using a generic event to
specify L2 data cache misses:

cpc:::PAPI_l2_dcm-all-10000
/execname == "brendan"/
{
        @[ufunc(arg1)] = count();
}

# ./papi-l2miss.d  
dtrace: script './papi-l2miss.d' matched 1 probe
^C

  brendan`func_gamma                                              1681
  brendan`func_beta                                               2521
  brendan`func_alpha                                              5068


D. REFERENCES

http://bugs.opensolaris.org/view_bug.do?bug_id=6486156
PSARC/2002/180 CPU Performance Counters (CPC) Version 2
PSARC/2008/334 CPU Performance Counter Generic Event Names

E. DOCUMENTATION

A new chapter has been added to the Solaris Dynamic Tracing Guide for this
proposed provider:

http://wikis.sun.com/display/DTrace/Documentation       # DTrace Guide
http://wikis.sun.com/display/DTrace/cpc+Provider        # CPC Provider Chapter


F. STABILITY

The DTrace internal stability table is described below:

Element         Name stability  Data stability  Dependency class
Provider        Evolving        Evolving        Common
Module          Private         Private         Unknown
Function        Private         Private         Unknown
Name            Evolving        Evolving        CPU
Arguments       Evolving        Evolving        Common

Reply via email to