Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread John Bates
On Fri, Feb 12, 2021 at 5:01 AM Tamminen, Eero T 
wrote:

>
> Unlike some other Linux tracing solutions, Perfetto appears to be for
> Android / Chrome(OS?), and not available from in common Linux distro
> repos.
>
> So, why Perfetto instead of one of the other solutions, e.g. from ones
> mentioned here:
> https://tracingsummit.org/ts/2018/
> ?
>
>
Good question. Perfetto is for Linux, Android, and Chrome OS. Not sure what
Linux distros provide it besides Android and Chrome OS. It provides
comprehensive tracing solutions from data collection and tools to
convenient web-based UI and analysis as well as interoperation with
other trace data providers. Looking at the tracing summit presentations,
for example, there appear to be some good additional tracing data sources
that could potentially feed into Perfetto trace daemon and UI. But none of
those particular projects are providing a comprehensive solution like
Perfetto is. Lots more detail at perfetto.dev.


> And, if tracing API is added to Mesa, shouldn't it support also
> tracepoints for other tracing solutions?
>
> I mean, code added to drivers themselves preferably should not have
> anything perfetto/percetto specific.  Tracing system specific code
> should be only in one place (even if it's just macros in common header).


I agree it makes sense to keep the macro API implementation in a common
mesa header so that we have the option of changing out the backend. On the
other hand, it can get difficult to maintain more than one tracing backend,
especially when tracing usage goes beyond the simple TRACE_SCOPE(__func__)
macros. For example, with GPU timeline tracks, counters, etc. I would not
expect mesa devs to test their tracing code on more than one tracing
backend, so it would be likely for other backends to regress. So ideally we
could pick one.


>
>
> > This helps developers
> > quickly answer questions like:
> >
> >- How long are frames taking?
>
> That doesn't require any changes to Mesa.  Just set uprobe for suitable
> buffer swap function [1], and parse kernel ftrace events.  This way
> starting tracing doesn't require even restarting the tracked processes.
>
>
> [1] glXSwapBuffers, eglSwapBuffers, eglSwapBuffersWithDamageEXT,
> anv_QueuePresentKHR[2]..
>
> [2] Many apps resolve "vkQueuePresentKHR" Vulkan API loader wrapper
> function and call the backend function like "anv_QueuePresentKHR"
> directly, so it's  better to track latter instead.
>
>
> >- What caused a particular frame drop?
> >- Is it CPU bound or GPU bound?
>
> That doesn't require adding tracepoints to Mesa, just checking CPU & GPU
> utilization (which is lower level thing).
>
>
> >- Did a CPU core frequency drop cause something to go slower than
> > usual?
>
> Note that nowadays actual CPU frequencies are often controlled by HW /
> firmware, so you don't necessarily get any ftrace event from freq
> change, you would need to poll MSR registers instead (which is
> privileged operation, and polling can easily miss changes).
>
>
> >- Is something else running that is stealing CPU or GPU time? Could
> > I
> >fix that with better thread/context priorities?
> >- Are all CPU cores being used effectively? Do I need
> > sched_setaffinity
> >to keep my thread on a big or little core?
>
> I don't think these to require adding tracepoints to Mesa either...
>
>
> >- What’s the latency between CPU frame submit and GPU start?
>
> I think this would require tracepoints in kernel GPU code more than in
> Mesa?
>
>
> - Eero
>
>
> > *What Does Mesa + Perfetto Provide?*
> >
> > Mesa is in a unique position to produce GPU trace data for several GPU
> > vendors without requiring the developer to build and install
> > additional
> > tools like gfx-pps .
> >
> > The key is making it easy for developers to use. Ideally, perfetto is
> > eventually available by default in mesa so that if your system has
> > perfetto
> > traced running, you just need to run perfetto (perhaps along with
> > setting
> > an environment variable) with the mesa categories to see:
> >
> >- GPU processing timeline events.
> >- GPU counters.
> >- CPU events for potentially slow functions in mesa like shader
> > compiles.
> >
> > Example of what this data might look like (with fake GPU events):
> > [image: percetto-gpu-example.png]
> >
> > *Runtime Characteristics*
> >
> >- ~500KB additional binary size. Even with using only the basic
> > features
> >of perfetto, it will increase the binary size of mesa by about
> > 500KB.
> >- Background thread. Perfetto uses a background thread for
> > communication
> >with the system tracing daemon (traced) to advertise trace data and
> > get
> >notification of trace start/stop.
> >- Runtime overhead when disabled is designed to be optimal with one
> >predicted branch, typically a few CPU cycles
> >
> > 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread John Bates
(responding from correct address this time)

On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:

> I've recently been using GPUVis to look at trace events.  On Intel
> platforms, GPUVis incorporates ftrace events from the i915 driver,
> performance metrics from igt-gpu-tools, and userspace ftrace markers
> that I locally hack up in Mesa.
>

GPUVis is great. I would love to see that data combined with
userspace events without any need for local hacks. Perfetto provides
on-demand trace events with lower overhead compared to ftrace, so for
example it is acceptable to have production trace instrumentation that can
be captured without dev builds. To do that with ftrace it may require a way
to enable and disable the ftrace file writes to avoid the overhead when
tracing is not in use. This is what Android does with systrace/atrace, for
example, it uses Binder to notify processes about trace sessions. Perfetto
does that in a more portable way.


>
> It is very easy to compile the GPUVis UI.  Userspace instrumentation
> requires a single C/C++ header.  You don't have to access an external
> web service to analyze trace data (a big no-no for devs working on
> preproduction hardware).
>
> Is it possible to build and run the Perfetto UI locally?


Yes, local UI builds are possible
<https://github.com/google/perfetto/blob/5ff758df67da94d17734c2e70eb6738c4902953e/ui/README.md>.
Also confirmed with the perfetto team <https://discord.gg/35ShE3A> that
trace data is not uploaded unless you use the 'share' feature.


>   Can it display
> arbitrary trace events that are written to
> /sys/kernel/tracing/trace_marker ?


Yes, I believe it does support that via linux.ftrace data source
<https://perfetto.dev/docs/quickstart/linux-tracing>. We use that for
example to overlay CPU sched data to show what process is on each core
throughout the timeline. There are many ftrace event types
<https://github.com/google/perfetto/tree/5ff758df67da94d17734c2e70eb6738c4902953e/protos/perfetto/trace/ftrace>
in
the perfetto protos.


> Can it be extended to show i915 and
> i915-perf-recorder events?
>

It can be extended to consume custom data sources. One way this is done is
via a bridge daemon, such as traced_probes which is responsible for
capturing data from ftrace and /proc during a trace session and sending it
to traced. traced is the main perfetto tracing daemon that notifies all
trace data sources to start/stop tracing and communicates with user tracing
requests via the 'perfetto' command.



>
> John Bates  writes:
>
> > I recently opened issue 4262
> > <https://gitlab.freedesktop.org/mesa/mesa/-/issues/4262> to begin the
> > discussion on integrating perfetto into mesa.
> >
> > *Background*
> >
> > System-wide tracing is an invaluable tool for developers to find and fix
> > performance problems. The perfetto project enables a combined view of
> trace
> > data from kernel ftrace, GPU driver and various manually-instrumented
> > tracepoints throughout the application and system. This helps developers
> > quickly answer questions like:
> >
> >- How long are frames taking?
> >- What caused a particular frame drop?
> >- Is it CPU bound or GPU bound?
> >- Did a CPU core frequency drop cause something to go slower than
> usual?
> >- Is something else running that is stealing CPU or GPU time? Could I
> >fix that with better thread/context priorities?
> >- Are all CPU cores being used effectively? Do I need
> sched_setaffinity
> >to keep my thread on a big or little core?
> >- What’s the latency between CPU frame submit and GPU start?
> >
> > *What Does Mesa + Perfetto Provide?*
> >
> > Mesa is in a unique position to produce GPU trace data for several GPU
> > vendors without requiring the developer to build and install additional
> > tools like gfx-pps <https://gitlab.freedesktop.org/Fahien/gfx-pps>.
> >
> > The key is making it easy for developers to use. Ideally, perfetto is
> > eventually available by default in mesa so that if your system has
> perfetto
> > traced running, you just need to run perfetto (perhaps along with setting
> > an environment variable) with the mesa categories to see:
> >
> >- GPU processing timeline events.
> >- GPU counters.
> >- CPU events for potentially slow functions in mesa like shader
> compiles.
> >
> > Example of what this data might look like (with fake GPU events):
> > [image: percetto-gpu-example.png]
> >
> > *Runtime Characteristics*
> >
> >- ~500KB additional binary size. Even with using only the basic
> features
> >of perfetto, it will increase the binary size of mesa

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread John Bates
On Fri, Feb 12, 2021 at 4:34 PM Rob Clark  wrote:

> On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
> >
>
> 
>
> > Runtime Characteristics
> >
> > ~500KB additional binary size. Even with using only the basic features
> of perfetto, it will increase the binary size of mesa by about 500KB.
>
> IMHO, that size is negligible.. looking at freedreno, a mesa build
> *only* enabling freedreno is already ~6MB.. distros typically use
> "megadriver" (ie. all the drivers linked into a single .so with hard
> links for the different  ${driver}_dri.so), which on my fedora laptop
> is ~21M.  Maybe if anything is relevant it is how much of that
> actually gets paged into RAM from disk, but I think 500K isn't a thing
> to worry about too much.
>
> > Background thread. Perfetto uses a background thread for communication
> with the system tracing daemon (traced) to advertise trace data and get
> notification of trace start/stop.
>
> Mesa already tends to have plenty of threads.. some of that depends on
> the driver, I think currently radeonsi is the threading king, but
> there are several other drivers working on threaded_context and async
> compile thread pool.
>
> It is worth mentioning that, AFAIU, perfetto can operate in
> self-server mode, which seems like it would be useful for distros
> which do not have the system daemon.  I'm not sure if we lose that
> with percetto?
>

Easy to add, but want to avoid a runtime arg because it would add ~300KB to
binary size. Okay if we have an alternate init function though.


>
> > Runtime overhead when disabled is designed to be optimal with one
> predicted branch, typically a few CPU cycles per event. While enabled, the
> overhead can be around 1 us per event.
> >
> > Integration Challenges
> >
> > The perfetto SDK is C++ and designed around macros, lambdas, inline
> templates, etc. There are ongoing discussions on providing an official
> perfetto C API, but it is not yet clear when this will land on the perfetto
> roadmap.
> > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K lines
> of code.
> > Anything that includes perfetto.h takes a long time to compile.
> > The current Perfetto SDK design is incompatible with being a shared
> library behind a C API.
>
> So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> But maybe we should verify that MSVC is happy with it, otherwise we
> need to take a bit more care in some parts of the codebase.
>
> As far as compile time, I wonder if we can regenerate the .cc/.h with
> only the gpu trace parts?  But I wouldn't expect the .h to be
> something widely included.  For example, for gpu timeline traces in
> freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> extern "C" {} around the callbacks that would hook into the
> u_tracepoint tracepoints.  That one file would pull in the perfetto
> .h, and we'd just not build that file if perfetto was disabled.
>

That works for GPU, but I'd like to see some slow CPU functions in traces
as well to help reason about performance problems. This ends up peppering
the trace header in lots of places.

Overall having to add our own extern C wrappers in some places doesn't
> seem like the *end* of the world.. a bit annoying, but we might end up
> doing that regardless if other folks want the ability to hook in
> something other than perfetto?
>

It's more than extern C wrappers if we want to minimize overhead while
tracing enabled at compile time. Have a look at percetto.h
<https://github.com/olvaffe/percetto/blob/main/src/percetto.h>/cc
<https://github.com/olvaffe/percetto/blob/main/src/percetto.cc>.


>
> 
>
> > Mesa Integration Alternatives
>
> I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
> that is mostly because I expect to initially just add some basic gpu
> timeline tracepoints, but over time iterate on adding more.. it would
> be nice to not have to depend on a newer version of an external
> library at each step.  That is ofc only my $0.02..
>

It's a small initial setup tax, true, but I still think it depends on what
perfetto features we plan to use -- for only a couple files doing GPU
tracing I agree percetto is unnecessary, but for CPU tracing it gets more
complicated.


>
> BR,
> -R
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-15 Thread John Bates
I can vouch for the usefulness of the combined trace timeline showing CPU
core usage, filtered application events and GPU usage. Android systrace
shows this data -- I've used it from both an app developer perspective to
fix countless performance bugs and from a whole-system perspective to tune
issues such as motopho latency for VR. The latter is only possible when the
CPU timeline is combined with vendor-specific GPU data showing binning,
resolves/unresolves and context preemptions.

With virtualization, we have a new level of complexity and corresponding
performance bugs to track down. One example is unexpected shader compiles,
but there are other slowpaths in mesa that are important to be able to see
without difficulty. There is work being done to support perfetto trace data
from both host and guest VM -- mesa is in both.

Perfetto/systrace makes this performance analysis work easier in many cases
-- run an app, record trace, reproduce a glitch, and then view the trace to
find out what happened.

On Mon, Feb 15, 2021 at 9:27 AM Rob Clark  wrote:

> On Mon, Feb 15, 2021 at 3:13 AM Tamminen, Eero T
>  wrote:
> >
> > Hi,
> >
> > On Fri, 2021-02-12 at 18:20 -0800, Rob Clark wrote:
> > > On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
> > >  wrote:
> > ...
> > > > In our implementation that precision (in particular when a drawcall
> > > > ends) comes at a stalling cost unfortunately.
> > >
> > > yeah, stalling on our end too for per-draw counter snapshots.. but if
> > > you are looking for which shaders to optimize that doesn't matter
> > > *that* much.. they'll be some overhead, but it's not really going to
> > > change which draws/shaders are expensive.. just mean that you lose out
> > > on pipelining of the state changes
> >
> > I don't think it makes sense to try doing this all in one step.
> >
> > Unless one has resources of Google + commitment for maintaining it, I
> > think doing those steps with separate, dedicated tools can be better fit
> > for Open Source than trying to maintain a monster that tries to do
> > everything of analyzing:
> > - whether performance issue is on GPU side, CPU side, or code being too
> > synchronous
> > - where the bottlenecks are on GPU side
> > - where the bottlenecks are on CPU side
> > - what are the sync points
>
> I mean, google has a team working on perfetto, so we kinda are getting
> the tool here for free, all we need to do here is instrumentation for
> the mesa part of the system..
>
> Currently, if you look at
> https://chromeos.dev/en/games/optimizing-games-profiling the
> recommendation basically amounts to "optimize on android with
> snapdragon profiler/etc".. which is really not a great look for mesa.
> (And doesn't do anything for intel at all.)  Mesa is a great project,
> but profiling tooling, especially something for people other than mesa
> developers, is a glaring weakness.  Perfetto looks like a great
> opportunity to fix that, not only for ourselves but also game
> developers and others.
>
> BR,
> -R
>
> > IMHO:
> > - Overall picture should not have too many details, because otherwise
> > one can start chasing irrelevancies [1]
> > - Rest of analysis works better when one concentrate on one performance
> > aspect (shown by the overall picture) at the time.  So that activity
> > could have tool dedicated for that purpose
> >
> >
> > - Eero
> >
> > [1] Unless one has HW assisted tool that really can tell *everything*
> > like ARM ETM and Intel PT with *really good* post-processing &
> > visualization tooling.  I don't think are usable outside of large
> > companies though because of HW requirements and using them taking a lot
> > of time / expertise (1 sec trace is gigs of data).
> >
> > PS. For checking on shader compiles, I've used two steps:
> > * script to trace frame updates & shader compiles (with ftrace uprobe on
> > appropriate function entry points) + monitor CPU usage & GPU usage (for
> > GPU, freq or power usage is enough)
> >   -> shows whether FPS & GPU utilization dip with compiles.  Frame
> > updates & compiles are rare enough that ftrace overhead doesn't matter
> >
> > * enable Mesa shader debugging, because in next step one wants to know
> > what shaders they are and how they're compiled
> >
> > ___
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] Perfetto CPU/GPU tracing

2021-02-11 Thread John Bates
I recently opened issue 4262
 to begin the
discussion on integrating perfetto into mesa.

*Background*

System-wide tracing is an invaluable tool for developers to find and fix
performance problems. The perfetto project enables a combined view of trace
data from kernel ftrace, GPU driver and various manually-instrumented
tracepoints throughout the application and system. This helps developers
quickly answer questions like:

   - How long are frames taking?
   - What caused a particular frame drop?
   - Is it CPU bound or GPU bound?
   - Did a CPU core frequency drop cause something to go slower than usual?
   - Is something else running that is stealing CPU or GPU time? Could I
   fix that with better thread/context priorities?
   - Are all CPU cores being used effectively? Do I need sched_setaffinity
   to keep my thread on a big or little core?
   - What’s the latency between CPU frame submit and GPU start?

*What Does Mesa + Perfetto Provide?*

Mesa is in a unique position to produce GPU trace data for several GPU
vendors without requiring the developer to build and install additional
tools like gfx-pps .

The key is making it easy for developers to use. Ideally, perfetto is
eventually available by default in mesa so that if your system has perfetto
traced running, you just need to run perfetto (perhaps along with setting
an environment variable) with the mesa categories to see:

   - GPU processing timeline events.
   - GPU counters.
   - CPU events for potentially slow functions in mesa like shader compiles.

Example of what this data might look like (with fake GPU events):
[image: percetto-gpu-example.png]

*Runtime Characteristics*

   - ~500KB additional binary size. Even with using only the basic features
   of perfetto, it will increase the binary size of mesa by about 500KB.
   - Background thread. Perfetto uses a background thread for communication
   with the system tracing daemon (traced) to advertise trace data and get
   notification of trace start/stop.
   - Runtime overhead when disabled is designed to be optimal with one
   predicted branch, typically a few CPU cycles
    per
   event. While enabled, the overhead can be around 1 us per event.

*Integration Challenges*

   - The perfetto SDK is C++ and designed around macros, lambdas, inline
   templates, etc. There are ongoing discussions on providing an official
   perfetto C API, but it is not yet clear when this will land on the perfetto
   roadmap.
   - The perfetto SDK is an amalgamated .h and .cc that adds up to 100K
   lines of code.
   - Anything that includes perfetto.h takes a long time to compile.
   - The current Perfetto SDK design is incompatible with being a shared
   library behind a C API.

*Percetto*

The percetto library  was recently
implemented to provide an interim C API for perfetto. It provides efficient
support for scoped trace events, multiple categories, counters, custom
timestamps, and debug data annotations. Percetto also provides some
features that are important to mesa, but not available yet with perfetto
SDK:

   - Trace events from multiple perfetto instances in separate shared
   libraries (like mesa and virglrenderer) show correctly in a single process
   and thread view.
   - Counter tracks and macro API.

Percetto is missing API for perfetto's GPU DataSource and counter support,
but that feature could be implemented next if it is important for mesa.
With the existing percetto API mesa could present GPU trace data as named
'slice' events and int64_t counters with custom timestamps as shown in the
image above (based on this sample
).

*Mesa Integration Alternatives*

Note: we have some pressing needs for performance analysis in Chrome OS, so
I'm intentionally leaving out the alternative of waiting for an official
perfetto C API. Of course, once that C API is available it would become an
option to migrate to it from any of the alternatives below.

Ordered by difficulty with easiest first:

   1. Statically link with percetto as an optional external dependency
(virglrenderer
   now has this approach
   
   ).
   - Pros: API already supports most common tracing needs. Tested and used
  by an increasing number of CrOS components.
  - Cons: External dependency for optional mesa build option.
   2. Embed Perfetto SDK + a Percetto fork/copy.
  - Pros: API already supports most common tracing needs. No added
  external dependency for mesa.
  - Cons: Percetto code divergence, bug fixes need to land in two trees.
   3. Embed Perfetto SDK + custom C wrapper.
  - Pros: Tailored API for mesa's needs.
  - Cons: Nontrivial development efforts and