Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Alyssa Rosenzweig
> But on many other embedded OSes  - at least Google ones like  CrOS and
> Android - the security model is way stricter.  We could argue that is
> bad / undesirable / too draconian but that is something that any of us
> has the power to change. At some point each platform decides where it
> wants to be in the spectrum of "easy to hack" and "secure for the
> user". CrOS model is: you can hack as much as you want, but you need
> first to re-flash it in dev-mode.

Off-topic but, speaking from someone who grew up in the libre software
"purist" circles, I'm a big fan of the CrOS model here.
Draconian is when the user _can't_ put it in dev mode. If you can,
there's nothing wrong with sane defaults
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Primiano Tucci



On 18/02/2021 20:26, Tamminen, Eero T wrote:

Hi,

(This isn't anymore that related to Mesa, but maybe it's still of
interest.)

On Thu, 2021-02-18 at 16:40 +0100, Primiano Tucci wrote:


On 18/02/2021 14:35, Tamminen, Eero T wrote:

[...]

It doesn't require executable code to be writable from user-space,
library code can remain read-only because kernel can toggle relevant
page writable for uprobe breakpoint setup and back.


The problem is not who rewrites the .text pages (although, yes, I agree
that the kernel doing this is better than userspace doing it). The
problem is:

1. Losing the ability to verify the integrity of system executables.
tell if some malware/rootkit did alter them or uprobes did. Effectively
you lose the ability to verify the full chain of bootloader -> system
image -> file integrity.


Why you would lose it?

Integrity checks will succeed when there are no trace points enabled,
and trace points should be enabled only when you start tracing, so you
know what is causing integrity check failures (especially when they
start passing again once you disable tracepoint


If you do this (disabling when tracing) the message out there becomes: 
"if you write malware, the very first thing you should do is enabling 
tracing, so any system integrity check will be suppressed" :)


Things like uprobes (i.e. anything that can dynamically alter the 
execution flow of system processes) is typically available only on 
engineering setups, where you have control of the device / kernel / 
security settings (Yama, selinux or any other security module),  not on 
production devices.
I understand that the situation for most (all?) Linux-based distros is 
different as you can just sudo. But on many other embedded OSes  - at 
least Google ones like  CrOS and Android - the security model is way 
stricter.
We could argue that is bad / undesirable / too draconian but that is 
something that any of us has the power to change. At some point each 
platform decides where it wants to be in the spectrum of "easy to hack" 
and "secure for the user". CrOS model is: you can hack as much as you 
want, but you need first to re-flash it in dev-mode.




2. In general, a mechanism that allows dynamic rewriting of code is a
wide attack surface, not welcome on production devices (for the same
very unlikely to fly for non-dev images IMHO. Many system processes
contain too sensitive information like cookie jar, oauth2 tokens etc.


Isn't there any kind of dev-mode which would be required to enable
things that are normally disallowed?


That requires following steps that are non-trivial for non-tech-savy 
users and, more importantly, wiping the device (CrOS calls this 
"power-washing") [1].
We can't ask users to reflash their device just to give us a trace when 
they are experiencing problems. Many of those problems can't be 
reproduced by engineers because depend on some peculiar state the user 
is in. A recent example (not related with Mesa): some users were 
experiencing an extremely unresponsive (Chrome) UI. After looking at 
traces engineers figured out that the root cause (and hence the repro) 
was: "you need to have a (chrome) tab which title is long enough to 
cause ellipsis and that also has an emoji in the left-most visible part. 
The emoji causes invalidation of the cached font measurement (this is 
the bug), which causes every UI draw to be awfully slow.
For problems like this (which are very frequent) we really need to ask 
users to give us traces. And that needs to be really a one-click thing 
for them or they will not be able to help us.


[1] 
https://www.chromium.org/chromium-os/chromiumos-design-docs/developer-mode


(like kernel modifying RO mapped user-space process memory pages)





[...]

Yes, if you need more context, or handle really frequent events,
static
breakpoints are a better choice.


In case of more frequent events, on Linux one might consider using
some
BPF program to process dynamic tracepoint data so that much smaller
amount needs to be transferred to user-space.  But I'm not sure
whether
support for attaching BPF to tracepoints is in upstream Linux kernel
yet.


eBPF, which you can use in recent kernels with tracepoints, solves
different problem. It solves e.g., (1) dynamic filtering or (2)
computing aggregations from hi-freq events. It doesn't solve problems
like "I want to see all scheduling events and all frame-related
userspace instrumentation points. But given that sched events are so
hi-traffic I want to put them in a separate buffer, so they don't
clobber all the rest". Turning scheduling events into a histogram
(something you can do with eBPF+tracepoints) doesn't really solve cases
where you want to follow the full scheduling block/wake chain while some
userspace events taking unexpectedly long.


You could e.g. filter out all sched events except ones for the process
you're interested about.  That should already provide huge reduction in
amount of data, for use-cases where scheduling 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Tamminen, Eero T
Hi,

(This isn't anymore that related to Mesa, but maybe it's still of
interest.)

On Thu, 2021-02-18 at 16:40 +0100, Primiano Tucci wrote:

> On 18/02/2021 14:35, Tamminen, Eero T wrote:
[...]
> > It doesn't require executable code to be writable from user-space,
> > library code can remain read-only because kernel can toggle relevant
> > page writable for uprobe breakpoint setup and back.
> 
> The problem is not who rewrites the .text pages (although, yes, I agree 
> that the kernel doing this is better than userspace doing it). The 
> problem is:
> 
> 1. Losing the ability to verify the integrity of system executables. 
> tell if some malware/rootkit did alter them or uprobes did. Effectively 
> you lose the ability to verify the full chain of bootloader -> system 
> image -> file integrity.

Why you would lose it?

Integrity checks will succeed when there are no trace points enabled,
and trace points should be enabled only when you start tracing, so you
know what is causing integrity check failures (especially when they
start passing again once you disable tracepoints)...


> 2. In general, a mechanism that allows dynamic rewriting of code is a 
> wide attack surface, not welcome on production devices (for the same 
> very unlikely to fly for non-dev images IMHO. Many system processes 
> contain too sensitive information like cookie jar, oauth2 tokens etc.

Isn't there any kind of dev-mode which would be required to enable
things that are normally disallowed?

(like kernel modifying RO mapped user-space process memory pages)


> 
[...]
> > Yes, if you need more context, or handle really frequent events,
> > static
> > breakpoints are a better choice.
> > 
> > 
> > In case of more frequent events, on Linux one might consider using
> > some
> > BPF program to process dynamic tracepoint data so that much smaller
> > amount needs to be transferred to user-space.  But I'm not sure
> > whether
> > support for attaching BPF to tracepoints is in upstream Linux kernel
> > yet.
> 
> eBPF, which you can use in recent kernels with tracepoints, solves 
> different problem. It solves e.g., (1) dynamic filtering or (2) 
> computing aggregations from hi-freq events. It doesn't solve problems 
> like "I want to see all scheduling events and all frame-related 
> userspace instrumentation points. But given that sched events are so 
> hi-traffic I want to put them in a separate buffer, so they don't 
> clobber all the rest". Turning scheduling events into a histogram 
> (something you can do with eBPF+tracepoints) doesn't really solve cases 
> where you want to follow the full scheduling block/wake chain while some
> userspace events taking unexpectedly long.

You could e.g. filter out all sched events except ones for the process
you're interested about.  That should already provide huge reduction in
amount of data, for use-cases where scheduling of rest of processes is
of less interest.

However, I think high frequency kernel tracing is a different use-case
from user-space tracing, which requires its own tooling [1] (and just
few user-space trace points to provide context for traced kernel
activity).


- Eero

[1] In corporate setting I would expect this kind of latency
investigations to be actually HW assisted, otherwise tracing itself
disturbs the system too much.  Ultimately it could be using instruction
branch tracing to catch *everything*, as both ARM and x86 have HW
support for that.

(Instruction branch tracing doesn't include context, but that can be
injected separately to the data stream.  Because it catches everything,
one can infer some of the context from the trace itself too.  I don't
think there's any good Open Source post-processing / visualization tools
for such data though.)

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Rob Clark
On Thu, Feb 18, 2021 at 8:00 AM Rob Clark  wrote:
>
> On Thu, Feb 18, 2021 at 5:35 AM Tamminen, Eero T
>  wrote:
> >
> > Hi,
> >
> > On Thu, 2021-02-18 at 12:17 +0100, Primiano Tucci wrote:
> > [discussion about Perfetto itself]
> > ...
> > > eero.t.tamminen@
> > > from in common Linux distro repos.
> > >
> > > That's right. I am aware of the problem. The plan is to address it
> > > with
> > > bit.ly/perfetto-debian as a starter.
> >
> > Glad to hear something is planned for making things easier for distros!
> >
> >
> > > > eero.t.tamminen@
> > > > Just set uprobe for suitable buffer swap function [1], and parse
> > > kernel ftrace events. (paraphrasing for context: "why do we need
> > > instrumentation points? we can't we just use uprobes instead?")
> > >
> > > The problem with uprobes is:
> > > 1. It's linux specific. Perhaps not a big problem for Mesa here, but the
> > > reason why we didn't go there with Perfetto, at least until now, is that
> > > we need to support all major OSes (Linux, CrOS, Android, Windows,
> > > macOS).
> >
> > The main point here is that uprobe works already without any
> > modifications to any libraries (I have script that has been used for FPS
> > tracking of daily 3D stack builds for many years).
> >
> > And other OSes already offer similar functionality.  E.g. DTrace should
> > be available both for Mac & Windows.
> >
>
> So we are talking about a couple different tracing use-cases which
> perfetto provides.. *Maybe* uprobe can work for the instrument the
> code use case that you are talking about, just assuming for the sake
> of argument that the security folks buy into it, etc.. I'm not sure if
> there isn't a race condition if the kernel has to temporarily remap
> text pages r/w or other potential issues I've not thought of?
>
> But that is ignoring important things like gpu traces and perf
> counters.  I kind of think it is informative to look at some of the
> related proto definitions, because they give a sense of what
> information the visualization UI tools can make use of, for example:
>
> https://cs.android.com/android/platform/superproject/+/master:external/perfetto/protos/perfetto/trace/gpu/gpu_render_stage_event.proto
>
> For that, we would need to, from a background thread in the driver
> (well aux/util/u_trace) collect up the logged gpu timestamps after the
> fact and fill in the relevant details for the trace event.  We are
> going to anyways need the perfetto SDK (in the short term, until we
> can switch to C shared lib when it is avail) for that.
>

jfyi, I captured an example perfetto trace from an android phone,
since we don't have all this wired up in mesa.. but it should be
enough to give an idea what is possible with the gpu counters and
render stage traces

https://people.freedesktop.org/~robclark/example.perfetto

It looks like the GPU render stages don't show up in ui.perfetto.dev
(yet?), but you can also open it directly in AGI[1][2] which does also
show the render stages.  The GPU counters show up in both.

[1] https://gpuinspector.dev/
[2] https://github.com/google/agi

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Rob Clark
On Thu, Feb 18, 2021 at 7:40 AM Primiano Tucci  wrote:
>
>
>
> On 18/02/2021 14:35, Tamminen, Eero T wrote:
> > Hi,
> >
> > On Thu, 2021-02-18 at 12:17 +0100, Primiano Tucci wrote:
> > [discussion about Perfetto itself]
> > ...
> >> eero.t.tamminen@
> >> from in common Linux distro repos.
> >>



> >
> > At least both Ubuntu and Fedora default kernels have had uprobes built
> > in for *many* years.
> >
> > Up to date Fedora 33 kernel:
> > $ grep UPROBE /boot/config-5.10.15-200.fc33.x86_64
> > CONFIG_ARCH_SUPPORTS_UPROBES=y
> > CONFIG_UPROBES=y
> > CONFIG_UPROBE_EVENTS=y
> >
> > Same on up to date Ubuntu 20.04:
> > $ grep UPROBE /boot/config-5.4.0-65-generic
> > CONFIG_ARCH_SUPPORTS_UPROBES=y
> > CONFIG_UPROBES=y
> > CONFIG_UPROBE_EVENTS=y
> >
>
> Somebody more knowledgeable about CrOS should chime in, but from a
> codesearch, I don't think they are enabled on CrOS:
>
> https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.4/arch/x86/configs/chromiumos-jail-vm-x86_64_defconfig;l=254?q=CONFIG_UPROBES%20-%22%23if%22%20-%22obj-%22==chromiumos%2Fchromiumos%2Fcodesearch:src%2Fthird_party%2Fkernel%2Fv5.4%2F

These are *not* enabled in CrOS.. it is possible to build your own
kernel with them enabled (if your device is in dev mode, and rootfs
verification is disabled).. but that completely defeats the purpose of
having something where we can trace production builds and have tools
available for our users

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Rob Clark
On Thu, Feb 18, 2021 at 5:35 AM Tamminen, Eero T
 wrote:
>
> Hi,
>
> On Thu, 2021-02-18 at 12:17 +0100, Primiano Tucci wrote:
> [discussion about Perfetto itself]
> ...
> > eero.t.tamminen@
> > from in common Linux distro repos.
> >
> > That's right. I am aware of the problem. The plan is to address it
> > with
> > bit.ly/perfetto-debian as a starter.
>
> Glad to hear something is planned for making things easier for distros!
>
>
> > > eero.t.tamminen@
> > > Just set uprobe for suitable buffer swap function [1], and parse
> > kernel ftrace events. (paraphrasing for context: "why do we need
> > instrumentation points? we can't we just use uprobes instead?")
> >
> > The problem with uprobes is:
> > 1. It's linux specific. Perhaps not a big problem for Mesa here, but the
> > reason why we didn't go there with Perfetto, at least until now, is that
> > we need to support all major OSes (Linux, CrOS, Android, Windows,
> > macOS).
>
> The main point here is that uprobe works already without any
> modifications to any libraries (I have script that has been used for FPS
> tracking of daily 3D stack builds for many years).
>
> And other OSes already offer similar functionality.  E.g. DTrace should
> be available both for Mac & Windows.
>

So we are talking about a couple different tracing use-cases which
perfetto provides.. *Maybe* uprobe can work for the instrument the
code use case that you are talking about, just assuming for the sake
of argument that the security folks buy into it, etc.. I'm not sure if
there isn't a race condition if the kernel has to temporarily remap
text pages r/w or other potential issues I've not thought of?

But that is ignoring important things like gpu traces and perf
counters.  I kind of think it is informative to look at some of the
related proto definitions, because they give a sense of what
information the visualization UI tools can make use of, for example:

https://cs.android.com/android/platform/superproject/+/master:external/perfetto/protos/perfetto/trace/gpu/gpu_render_stage_event.proto

For that, we would need to, from a background thread in the driver
(well aux/util/u_trace) collect up the logged gpu timestamps after the
fact and fill in the relevant details for the trace event.  We are
going to anyways need the perfetto SDK (in the short term, until we
can switch to C shared lib when it is avail) for that.

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Primiano Tucci



On 18/02/2021 14:35, Tamminen, Eero T wrote:

Hi,

On Thu, 2021-02-18 at 12:17 +0100, Primiano Tucci wrote:
[discussion about Perfetto itself]
...

eero.t.tamminen@
from in common Linux distro repos.

That's right. I am aware of the problem. The plan is to address it
with
bit.ly/perfetto-debian as a starter.


Glad to hear something is planned for making things easier for distros!



eero.t.tamminen@
Just set uprobe for suitable buffer swap function [1], and parse

kernel ftrace events. (paraphrasing for context: "why do we need
instrumentation points? we can't we just use uprobes instead?")

The problem with uprobes is:
1. It's linux specific. Perhaps not a big problem for Mesa here, but the
reason why we didn't go there with Perfetto, at least until now, is that
we need to support all major OSes (Linux, CrOS, Android, Windows,
macOS).


The main point here is that uprobe works already without any
modifications to any libraries (I have script that has been used for FPS
tracking of daily 3D stack builds for many years).



Marking begin/end of function with timestamp is the easy part and you 
can do with an arbitrarily large set of tracing/debugging tools. In my 
experience things get more interesting when you want to start dumping 
the state of entire subsystems in the trace, so you can reason about 
them after the fact when looking at the trace timeline.


E.g. instrumentation points like this which go beyond the notion of 
dynamic breakpoints like this:


https://chromium.googlesource.com/chromium/src/+/4f331b42066d4e729c28facaa9bd9d4c33c6bbfd/components/viz/common/quads/compositor_render_pass.cc#116

which eventually allow to get tracing features like:
https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/using-frameviewer 





And other OSes already offer similar functionality.  E.g. DTrace should
be available both for Mac & Windows.


Btw. If really needed, you could even implement similar functionality in
user space on other OSes.We did something similar at Nokia using
ptrace before uprobes was a thing:
https://github.com/maemo-tools-old/functracer

(While possible, because of the need for some tricky architecture
specific assembly, and certain instructions needing some extra assembly
fixups when replaced by a breakpoint, it's unlikely to be feasible for
general tracing though.)



2. Even on Linux-based systems, it's really hard to have uprobes enabled
in production (I am not sure what is the situation for CrOS).

  In Google,
we care a lot about being able to trace from production devices
without
reflashing them with dev images, because then we can just tell people
that are experiencing problems "can you just open chrome://tracing,
orders of magnitude the actionable feedback we'd be able to get from
users / testers.


I would think having tracepoint code in the libraries themselves and
generally enabled for some specific tracing solution like Perfetto, is
*much* more unlikely.


IMO this one of the key point that you (Mesa) folks need to discuss 
here: whether (i) the trace points are directly tied to Perfetto (or any 
other tracing system) API (I buy the skepticism given the current state 
of things) or (ii) you have some mesa-specific tracing abstraction layer 
and you wire up Perfetto (or whatever else) in some "Mesa tracing 
backend impl", so the dependency surface is minimized.
In my experience, (ii) tends to be a bit more appealing but its 
feasibility depends on "what" do you want to trace, i.e. how much your 
instrumentation points look like begin/end markers and counters (easy) 
or full object state like the link above, in which case the risk is that 
you'll end up doing lot of boilerplate code and double-copies for state 
objects to avoid the direct deps.


Perhaps the best way is to have snippets of code to see how that would 
look like.




At least both Ubuntu and Fedora default kernels have had uprobes built
in for *many* years.

Up to date Fedora 33 kernel:
$ grep UPROBE /boot/config-5.10.15-200.fc33.x86_64
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y

Same on up to date Ubuntu 20.04:
$ grep UPROBE /boot/config-5.4.0-65-generic
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y



Somebody more knowledgeable about CrOS should chime in, but from a 
codesearch, I don't think they are enabled on CrOS:


https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.4/arch/x86/configs/chromiumos-jail-vm-x86_64_defconfig;l=254?q=CONFIG_UPROBES%20-%22%23if%22%20-%22obj-%22==chromiumos%2Fchromiumos%2Fcodesearch:src%2Fthird_party%2Fkernel%2Fv5.4%2F




The challenge of ubprobes is that it relies on dynamic rewriting of
.text pages. Whenever I mention that, a platform security team reacts
like the Frau Blucher horses (https://youtu.be/bps5hJ5DQDw?t=10), with
understandable reasons.


I'm not sure you've given them accurate picture of it.

It doesn't require executable code to be 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Tamminen, Eero T
Hi,

On Thu, 2021-02-18 at 12:17 +0100, Primiano Tucci wrote:
[discussion about Perfetto itself]
...
> eero.t.tamminen@
> from in common Linux distro repos.
> 
> That's right. I am aware of the problem. The plan is to address it
> with 
> bit.ly/perfetto-debian as a starter.

Glad to hear something is planned for making things easier for distros!


> > eero.t.tamminen@
> > Just set uprobe for suitable buffer swap function [1], and parse 
> kernel ftrace events. (paraphrasing for context: "why do we need 
> instrumentation points? we can't we just use uprobes instead?")
> 
> The problem with uprobes is:
> 1. It's linux specific. Perhaps not a big problem for Mesa here, but the
> reason why we didn't go there with Perfetto, at least until now, is that
> we need to support all major OSes (Linux, CrOS, Android, Windows,
> macOS).

The main point here is that uprobe works already without any
modifications to any libraries (I have script that has been used for FPS
tracking of daily 3D stack builds for many years).

And other OSes already offer similar functionality.  E.g. DTrace should
be available both for Mac & Windows.


Btw. If really needed, you could even implement similar functionality in
user space on other OSes.We did something similar at Nokia using
ptrace before uprobes was a thing:
https://github.com/maemo-tools-old/functracer

(While possible, because of the need for some tricky architecture
specific assembly, and certain instructions needing some extra assembly
fixups when replaced by a breakpoint, it's unlikely to be feasible for
general tracing though.)


> 2. Even on Linux-based systems, it's really hard to have uprobes enabled
> in production (I am not sure what is the situation for CrOS).
>
>  In Google, 
> we care a lot about being able to trace from production devices
> without 
> reflashing them with dev images, because then we can just tell people 
> that are experiencing problems "can you just open chrome://tracing, 
> orders of magnitude the actionable feedback we'd be able to get from 
> users / testers.

I would think having tracepoint code in the libraries themselves and
generally enabled for some specific tracing solution like Perfetto, is
*much* more unlikely.

At least both Ubuntu and Fedora default kernels have had uprobes built
in for *many* years.

Up to date Fedora 33 kernel:
$ grep UPROBE /boot/config-5.10.15-200.fc33.x86_64 
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y

Same on up to date Ubuntu 20.04:
$ grep UPROBE /boot/config-5.4.0-65-generic 
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y


> The challenge of ubprobes is that it relies on dynamic rewriting of 
> .text pages. Whenever I mention that, a platform security team reacts 
> like the Frau Blucher horses (https://youtu.be/bps5hJ5DQDw?t=10), with
> understandable reasons.

I'm not sure you've given them accurate picture of it.

It doesn't require executable code to be writable from user-space,
library code can remain read-only because kernel can toggle relevant
page writable for uprobe breakpoint setup and back.

# cat /sys/kernel/tracing/uprobe_events 
p:uprobes/glXSwapBuffers /opt/lib/libGL.so.1.2.0:0x0003bab0

# grep -h /opt/lib/libGL.so.1.2.0 /proc/*/maps | sort
7f486ab51000-7f486ab6a000 r--p  08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f486ab6a000-7f486abaf000 r-xp 00019000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f486abaf000-7f486abc6000 r--p 0005e000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f486abc6000-7f486abc9000 r--p 00074000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f486abc9000-7f486abca000 rw-p 00077000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f491438d000-7f49143a6000 r--p  08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f49143a6000-7f49143eb000 r-xp 00019000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f49143eb000-7f4914402000 r--p 0005e000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f4914402000-7f4914405000 r--p 00074000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f4914405000-7f4914406000 rw-p 00077000 08:03 7865435  
/opt/lib/libGL.so.1.2.0
7f6296d62000-7f6296d7b000 r--p  08:03 7865435  
/opt/lib/libGL.so.1.2.0
...


>  > mark.a.janes@
> events.  Why not follow the example of GPUVis, and write generic 
> trace_markers to ftrace?
> 
> In my experience ftrace's trace_marker:
> 1. Works for very simple types of events (e.g. 
> begin-slice/end-slice/counters) but don't work with richer / structured 
> event types like the ones linked above, as that gets into 
> stringification format, mashaling costs and interop.
> 2. Writing into the marker has some non-trivial cost (IIRC 1-10 us on 
> Android), it involves a kernel into and back from the kernel;
> 3. It leads to one global ring buffer, where fast events push out slower
> ones, which is particularly 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-18 Thread Primiano Tucci



Hey folks,
I'm one of the authors and maintainers of Perfetto, also +skyostil@.
I am really sorry for the giant bulk reply. I'll try to do my best to 
answer the various open questions about Perfetto but I don't know a 
better way than some heavy -ing given I'm joining the party late.


In short:

- Yep, so far the only distribution we have for the SDK is a C++ 
amalgamation. I am aware that it isn't great for Linux OSS projects, it 
was very optimized for Google projects that have the habit of statically 
linking everything.


- There are plans to move beyond that and have a stable C API (docs 
linked below). But that will take us quite some time. We should probably 
figure out some intermediate solution meanwhile.


- I'd be really keen to learn how Mesa is intending to do tracing. That 
can influence a lot our upcoming design. Begin/end markers are IMO the 
least interesting thing as they tend to work in whatever form and are 
easy to abstract. Richer/structured trace points like 
https://github.com/google/perfetto/tree/master/protos/perfetto/trace/gpu 
( currently used by Android GPU drivers [1]) are more interesting and 
where most of the challenges lie.


- Maybe the discussion here needs to be split into: (1) a shorter-term 
plan to iterate, figure out what works, what doesn't, see how the end 
result looks like; (2) a longer term plan and on how the API surface 
should look like.


I don't have strong opinions on how Mesa should proceed here and you 
don't need an extra cook in the kitchen. If I really had to express a 
handwavy opinion, my best advice would be to start with something you 
can iterate on right now, maybe behind some compile-time flag, and come 
up with a plan on how to turn into a production thing later on. We are 
interested to hear your feedback and adjust the design of our stable C API.


[1] 
https://android-developers.googleblog.com/2020/09/android-gpu-inspector-open-beta.html?m=1


On the tracing library / C++ vendoring / stable C API:

The way the Perfetto SDK is organized today is mainly influenced-by and 
designed-for the way Google handles its projects, which boils down to: 
(i) statically link everything, to minimize the testing matrix; (ii) 
move fast and refactor all dependencies as needed.
It's all about "who pays the maintenance cost and when?". This tends to 
work well in a large company which: (i) has a giant repo which allows 
~atomic cross-project changes; (ii) has the resources to keep everything 
up to date.
I am perfectly aware this is not appealing nor ideal for open source 
projects and, more in general, with the way libraries in Linux 
distributions work. I hear you when you say "vendoring [...] seems a bad 
idea". Yes, it implicitly pushes the burden of up-revving onto the 
"depender" [that's bad]
We are committed to maintain ABI stability of our tracing protocol and 
socket (see https://perfetto.dev/docs/design-docs/api-and-abi). This is 
because Chrome, Android, and now CrOS, and tools like gpuinspector.dev, 
which all statically link perfetto in some form, have strongly different 
release cycles. [that's good]
We also try to not break the C++ API too much, as robdclark@ found 
trying to update through our v3..v12 monthly releases [that's good]. But 
that C++ API has a too wide surface and we can't commit to make that a 
fully stable API. Nor can we make the current C++ ABI stable across a 
.so boundary (the C++ SDK today heavily depends on inlines to allow the 
compiler to see through the layers). [that's bad]


For this reason, we recently started making plans to change the way we 
do things to meet the needs of open source projects and not just 
Google-internal ones. [that's good]
Specifically (Note: to open the docs below you need to join 
https://groups.google.com/forum/#!forum/perfetto-dev to inherit the ACLs):


1. https://bit.ly/perfetto-debian has a plan to distribute tracing 
services and SDK as standard pkg-config based packages (at least for 
Debian. We'll rely on volunteers for other distros)


2. https://bit.ly/perfetto-c has a plan + ongoing discussion for having 
a long-term stable C API in a libperfetto.so . The key here for us 
(Perfetto) is identifying a subset of the wider C++ API that fits the 
bill for projects out there and that we are comfortable maintaining 
longer term.


The one thing I also need to be very clear on, though, is that both the 
perfetto-debian and perfetto-c discussions are very recent and will take 
a while for us to get there. We can't commit to a specific timeline 
right now, but if I had to make an educated estimate I'd say more 
towards end-of-2021. [that's bad]
I'd also be more keen to commit once there are concrete use-cases, 
ideally with iterations/feedback from a project like Mesa.


[obligatory reference at this point: https://youtu.be/Krbl911ZPBA?t=22]


> eero.t.tamminen@
> Just set uprobe for suitable buffer swap function [1], and parse 
kernel ftrace events. (paraphrasing for context: "why do 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-16 Thread Alyssa Rosenzweig
> That said, I'm ok with making perfetto support a build-time option
> that is default disabled.  And I think it would be even ok to limit
> use of perfetto to individual drivers (ie. no core mesa/gallium
> perfetto dependency) to start.  And, well, CrOS has plenty of mesa
> contributors so I think you can consider us signed up to maintain
> this.

As a general rule, I have no objection to build-time default disabled,
driver-specific stuff. Not my job to make technical decisions for other
driver teams.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-16 Thread Rob Clark
So, I did an experiment to get a feel for how hard/easy updating
perfetto sdk would be.. I took gfx-pps (which already "vendors" the
perfetto sdk) and rebuilt it with perfetto.cc/h from each existing
perfetto release tag (v3.0 is the earliest I see, v12.1 is the
latest).  I did hack out the intel backend due to missing i915-perf
dependency (which I guess is a problem that would go away if this was
all just pulled into mesa).. but kept the panfrost backend.  I ran
into no issues rebuilding with each different perfetto SDK version.

So that gives some sense that at least the API is stable, and we
aren't signing up for huge pain when we need to update perfetto SDK.

That said, I'm ok with making perfetto support a build-time option
that is default disabled.  And I think it would be even ok to limit
use of perfetto to individual drivers (ie. no core mesa/gallium
perfetto dependency) to start.  And, well, CrOS has plenty of mesa
contributors so I think you can consider us signed up to maintain
this.

(I do think perfetto has something to offer linux distro users as
well, but we can take this one step at time.)

BR,
-R

On Fri, Feb 12, 2021 at 5:51 PM Dylan Baker  wrote:
>
> So, we're vendoring something that we know getting bug fixes for will be an 
> enormous pile of work? That sounds like a really bad idea.
>
> On Fri, Feb 12, 2021, at 17:51, Rob Clark wrote:
> > On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  wrote:
> > >
> > >
> > >
> > > On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > > > On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
> > > > >
> > > >
> > > > 
> > > >
> > > > > Runtime Characteristics
> > > > >
> > > > > ~500KB additional binary size. Even with using only the basic 
> > > > > features of perfetto, it will increase the binary size of mesa by 
> > > > > about 500KB.
> > > >
> > > > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > > > *only* enabling freedreno is already ~6MB.. distros typically use
> > > > "megadriver" (ie. all the drivers linked into a single .so with hard
> > > > links for the different  ${driver}_dri.so), which on my fedora laptop
> > > > is ~21M.  Maybe if anything is relevant it is how much of that
> > > > actually gets paged into RAM from disk, but I think 500K isn't a thing
> > > > to worry about too much.
> > > >
> > > > > Background thread. Perfetto uses a background thread for 
> > > > > communication with the system tracing daemon (traced) to advertise 
> > > > > trace data and get notification of trace start/stop.
> > > >
> > > > Mesa already tends to have plenty of threads.. some of that depends on
> > > > the driver, I think currently radeonsi is the threading king, but
> > > > there are several other drivers working on threaded_context and async
> > > > compile thread pool.
> > > >
> > > > It is worth mentioning that, AFAIU, perfetto can operate in
> > > > self-server mode, which seems like it would be useful for distros
> > > > which do not have the system daemon.  I'm not sure if we lose that
> > > > with percetto?
> > > >
> > > > > Runtime overhead when disabled is designed to be optimal with one 
> > > > > predicted branch, typically a few CPU cycles per event. While 
> > > > > enabled, the overhead can be around 1 us per event.
> > > > >
> > > > > Integration Challenges
> > > > >
> > > > > The perfetto SDK is C++ and designed around macros, lambdas, inline 
> > > > > templates, etc. There are ongoing discussions on providing an 
> > > > > official perfetto C API, but it is not yet clear when this will land 
> > > > > on the perfetto roadmap.
> > > > > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K 
> > > > > lines of code.
> > > > > Anything that includes perfetto.h takes a long time to compile.
> > > > > The current Perfetto SDK design is incompatible with being a shared 
> > > > > library behind a C API.
> > > >
> > > > So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> > > > But maybe we should verify that MSVC is happy with it, otherwise we
> > > > need to take a bit more care in some parts of the codebase.
> > > >
> > > > As far as compile time, I wonder if we can regenerate the .cc/.h with
> > > > only the gpu trace parts?  But I wouldn't expect the .h to be
> > > > something widely included.  For example, for gpu timeline traces in
> > > > freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> > > > extern "C" {} around the callbacks that would hook into the
> > > > u_tracepoint tracepoints.  That one file would pull in the perfetto
> > > > .h, and we'd just not build that file if perfetto was disabled.
> > > >
> > > > Overall having to add our own extern C wrappers in some places doesn't
> > > > seem like the *end* of the world.. a bit annoying, but we might end up
> > > > doing that regardless if other folks want the ability to hook in
> > > > something other than perfetto?
> > > >
> > > > 
> > > >
> > > > > Mesa Integration Alternatives
> > > >
> > > > I'm 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-16 Thread Jose Fonseca
I've seen other projects successfully leveraging git submodules for including 
3rd party code without vendoring in the stricter sense.  I don't have direct 
experience doing so myself yet, but I hope one day to move apitrace towards 
this.  (Apitrace bundles lots of 3rd party code, partly for convenience on 
Windows, but also because it's important that everything is statically linked 
when doing LD_PRELOAD / DLL injection, to avoid interfering with the 
applications' own .so/.dlls.)


Regarding the perf event tracing in Mesa, it seems a good idea to me in 
principle FWIW.  Even if for many developers a tracing single source might 
suffice, there are scenarios when tracing the whole system is useful (be it 
multiple processes like browsers & desktop compositors, virtualization 
host/guest, etc.)

For reference, Windows has an event tracing framework (ETW), used by many parts 
of the system, including D3D runtimes, which allow anybody to have a system 
wide view of performance [1].  And in fact, WDDM 1.2 drivers are required to 
hook into ETW [2].

In short, I think it would be nice if Mesa had support for tracing events, 
preferably in a way that allows to plug-in different tracing frameworks, while 
at the same time, allowing to opt out for those who don't need it.

Jose

[1] https://graphics.stanford.edu/~mdfisher/GPUView.html
[2] 
https://docs.microsoft.com/en-us/windows-hardware/drivers/display/user-mode-driver-logging


From: mesa-dev  on behalf of Dylan 
Baker 
Sent: Saturday, February 13, 2021 02:15
To: Rob Clark 
Cc: ML mesa-dev 
Subject: Re: [Mesa-dev] Perfetto CPU/GPU tracing

I can't speak for anyone else, but a giant pile of vendored code that you're 
expected to not update seems like a really bad idea to me.

On Fri, Feb 12, 2021, at 18:09, Rob Clark wrote:
> I'm not really sure that is a fair statement.. the work scales
> according to the API change (which I'm not really sure if it changes
> much other than adding things).. if the API doesn't change, it is not
> really any effort to update two files in mesa git.
>
> As far as bug fixes.. it is a lot of code, but seems like the largest
> part of it is just generated protobuf serialization/deserialization
> code, rather than anything interesting.
>
> And again, I'm not a fan of their approach of "just vendor it".. but
> it is how perfetto is intended to be used, and in this case it seems
> like the best approach, since it is a situation where the protocol is
> the point of abi stability.
>
> BR,
> -R
>
> On Fri, Feb 12, 2021 at 5:51 PM Dylan Baker  wrote:
> >
> > So, we're vendoring something that we know getting bug fixes for will be an 
> > enormous pile of work? That sounds like a really bad idea.
> >
> > On Fri, Feb 12, 2021, at 17:51, Rob Clark wrote:
> > > On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  wrote:
> > > >
> > > >
> > > >
> > > > On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > > > > On Thu, Feb 11, 2021 at 5:40 PM John Bates  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > 
> > > > >
> > > > > > Runtime Characteristics
> > > > > >
> > > > > > ~500KB additional binary size. Even with using only the basic 
> > > > > > features of perfetto, it will increase the binary size of mesa by 
> > > > > > about 500KB.
> > > > >
> > > > > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > > > > *only* enabling freedreno is already ~6MB.. distros typically use
> > > > > "megadriver" (ie. all the drivers linked into a single .so with hard
> > > > > links for the different  ${driver}_dri.so), which on my fedora laptop
> > > > > is ~21M.  Maybe if anything is relevant it is how much of that
> > > > > actually gets paged into RAM from disk, but I think 500K isn't a thing
> > > > > to worry about too much.
> > > > >
> > > > > > Background thread. Perfetto uses a background thread for 
> > > > > > communication with the system tracing daemon (traced) to advertise 
> > > > > > trace data and get notification of trace start/stop.
> > > > >
> > > > > Mesa already tends to have plenty of threads.. some of that depends on
> > > > > the driver, I think currently radeonsi is the threading king, but
> > > > > there are several other drivers working on threaded_context and async
> > > > > compile thread pool.
> > > > >
> > > > > It is worth mentioning that, AFAIU

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-15 Thread John Bates
I can vouch for the usefulness of the combined trace timeline showing CPU
core usage, filtered application events and GPU usage. Android systrace
shows this data -- I've used it from both an app developer perspective to
fix countless performance bugs and from a whole-system perspective to tune
issues such as motopho latency for VR. The latter is only possible when the
CPU timeline is combined with vendor-specific GPU data showing binning,
resolves/unresolves and context preemptions.

With virtualization, we have a new level of complexity and corresponding
performance bugs to track down. One example is unexpected shader compiles,
but there are other slowpaths in mesa that are important to be able to see
without difficulty. There is work being done to support perfetto trace data
from both host and guest VM -- mesa is in both.

Perfetto/systrace makes this performance analysis work easier in many cases
-- run an app, record trace, reproduce a glitch, and then view the trace to
find out what happened.

On Mon, Feb 15, 2021 at 9:27 AM Rob Clark  wrote:

> On Mon, Feb 15, 2021 at 3:13 AM Tamminen, Eero T
>  wrote:
> >
> > Hi,
> >
> > On Fri, 2021-02-12 at 18:20 -0800, Rob Clark wrote:
> > > On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
> > >  wrote:
> > ...
> > > > In our implementation that precision (in particular when a drawcall
> > > > ends) comes at a stalling cost unfortunately.
> > >
> > > yeah, stalling on our end too for per-draw counter snapshots.. but if
> > > you are looking for which shaders to optimize that doesn't matter
> > > *that* much.. they'll be some overhead, but it's not really going to
> > > change which draws/shaders are expensive.. just mean that you lose out
> > > on pipelining of the state changes
> >
> > I don't think it makes sense to try doing this all in one step.
> >
> > Unless one has resources of Google + commitment for maintaining it, I
> > think doing those steps with separate, dedicated tools can be better fit
> > for Open Source than trying to maintain a monster that tries to do
> > everything of analyzing:
> > - whether performance issue is on GPU side, CPU side, or code being too
> > synchronous
> > - where the bottlenecks are on GPU side
> > - where the bottlenecks are on CPU side
> > - what are the sync points
>
> I mean, google has a team working on perfetto, so we kinda are getting
> the tool here for free, all we need to do here is instrumentation for
> the mesa part of the system..
>
> Currently, if you look at
> https://chromeos.dev/en/games/optimizing-games-profiling the
> recommendation basically amounts to "optimize on android with
> snapdragon profiler/etc".. which is really not a great look for mesa.
> (And doesn't do anything for intel at all.)  Mesa is a great project,
> but profiling tooling, especially something for people other than mesa
> developers, is a glaring weakness.  Perfetto looks like a great
> opportunity to fix that, not only for ourselves but also game
> developers and others.
>
> BR,
> -R
>
> > IMHO:
> > - Overall picture should not have too many details, because otherwise
> > one can start chasing irrelevancies [1]
> > - Rest of analysis works better when one concentrate on one performance
> > aspect (shown by the overall picture) at the time.  So that activity
> > could have tool dedicated for that purpose
> >
> >
> > - Eero
> >
> > [1] Unless one has HW assisted tool that really can tell *everything*
> > like ARM ETM and Intel PT with *really good* post-processing &
> > visualization tooling.  I don't think are usable outside of large
> > companies though because of HW requirements and using them taking a lot
> > of time / expertise (1 sec trace is gigs of data).
> >
> > PS. For checking on shader compiles, I've used two steps:
> > * script to trace frame updates & shader compiles (with ftrace uprobe on
> > appropriate function entry points) + monitor CPU usage & GPU usage (for
> > GPU, freq or power usage is enough)
> >   -> shows whether FPS & GPU utilization dip with compiles.  Frame
> > updates & compiles are rare enough that ftrace overhead doesn't matter
> >
> > * enable Mesa shader debugging, because in next step one wants to know
> > what shaders they are and how they're compiled
> >
> > ___
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-15 Thread Rob Clark
On Mon, Feb 15, 2021 at 3:13 AM Tamminen, Eero T
 wrote:
>
> Hi,
>
> On Fri, 2021-02-12 at 18:20 -0800, Rob Clark wrote:
> > On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
> >  wrote:
> ...
> > > In our implementation that precision (in particular when a drawcall
> > > ends) comes at a stalling cost unfortunately.
> >
> > yeah, stalling on our end too for per-draw counter snapshots.. but if
> > you are looking for which shaders to optimize that doesn't matter
> > *that* much.. they'll be some overhead, but it's not really going to
> > change which draws/shaders are expensive.. just mean that you lose out
> > on pipelining of the state changes
>
> I don't think it makes sense to try doing this all in one step.
>
> Unless one has resources of Google + commitment for maintaining it, I
> think doing those steps with separate, dedicated tools can be better fit
> for Open Source than trying to maintain a monster that tries to do
> everything of analyzing:
> - whether performance issue is on GPU side, CPU side, or code being too
> synchronous
> - where the bottlenecks are on GPU side
> - where the bottlenecks are on CPU side
> - what are the sync points

I mean, google has a team working on perfetto, so we kinda are getting
the tool here for free, all we need to do here is instrumentation for
the mesa part of the system..

Currently, if you look at
https://chromeos.dev/en/games/optimizing-games-profiling the
recommendation basically amounts to "optimize on android with
snapdragon profiler/etc".. which is really not a great look for mesa.
(And doesn't do anything for intel at all.)  Mesa is a great project,
but profiling tooling, especially something for people other than mesa
developers, is a glaring weakness.  Perfetto looks like a great
opportunity to fix that, not only for ourselves but also game
developers and others.

BR,
-R

> IMHO:
> - Overall picture should not have too many details, because otherwise
> one can start chasing irrelevancies [1]
> - Rest of analysis works better when one concentrate on one performance
> aspect (shown by the overall picture) at the time.  So that activity
> could have tool dedicated for that purpose
>
>
> - Eero
>
> [1] Unless one has HW assisted tool that really can tell *everything*
> like ARM ETM and Intel PT with *really good* post-processing &
> visualization tooling.  I don't think are usable outside of large
> companies though because of HW requirements and using them taking a lot
> of time / expertise (1 sec trace is gigs of data).
>
> PS. For checking on shader compiles, I've used two steps:
> * script to trace frame updates & shader compiles (with ftrace uprobe on
> appropriate function entry points) + monitor CPU usage & GPU usage (for
> GPU, freq or power usage is enough)
>   -> shows whether FPS & GPU utilization dip with compiles.  Frame
> updates & compiles are rare enough that ftrace overhead doesn't matter
>
> * enable Mesa shader debugging, because in next step one wants to know
> what shaders they are and how they're compiled
>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-15 Thread Tamminen, Eero T
Hi,

On Fri, 2021-02-12 at 18:20 -0800, Rob Clark wrote:
> On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
>  wrote:
...
> > In our implementation that precision (in particular when a drawcall
> > ends) comes at a stalling cost unfortunately.
> 
> yeah, stalling on our end too for per-draw counter snapshots.. but if
> you are looking for which shaders to optimize that doesn't matter
> *that* much.. they'll be some overhead, but it's not really going to
> change which draws/shaders are expensive.. just mean that you lose out
> on pipelining of the state changes

I don't think it makes sense to try doing this all in one step.

Unless one has resources of Google + commitment for maintaining it, I
think doing those steps with separate, dedicated tools can be better fit
for Open Source than trying to maintain a monster that tries to do
everything of analyzing:
- whether performance issue is on GPU side, CPU side, or code being too
synchronous
- where the bottlenecks are on GPU side
- where the bottlenecks are on CPU side
- what are the sync points

IMHO:
- Overall picture should not have too many details, because otherwise
one can start chasing irrelevancies [1]
- Rest of analysis works better when one concentrate on one performance
aspect (shown by the overall picture) at the time.  So that activity
could have tool dedicated for that purpose


- Eero

[1] Unless one has HW assisted tool that really can tell *everything*
like ARM ETM and Intel PT with *really good* post-processing &
visualization tooling.  I don't think are usable outside of large
companies though because of HW requirements and using them taking a lot
of time / expertise (1 sec trace is gigs of data).

PS. For checking on shader compiles, I've used two steps:
* script to trace frame updates & shader compiles (with ftrace uprobe on
appropriate function entry points) + monitor CPU usage & GPU usage (for
GPU, freq or power usage is enough)
  -> shows whether FPS & GPU utilization dip with compiles.  Frame
updates & compiles are rare enough that ftrace overhead doesn't matter

* enable Mesa shader debugging, because in next step one wants to know
what shaders they are and how they're compiled

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-15 Thread Michel Dänzer

On 2021-02-13 3:15 a.m., Dylan Baker wrote:

I can't speak for anyone else, but a giant pile of vendored code that you're 
expected to not update seems like a really bad idea to me.


I agree.


--
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-14 Thread Lionel Landwerlin

On 14/02/2021 00:47, Rob Clark wrote:

On Sat, Feb 13, 2021 at 12:52 PM Lionel Landwerlin
 wrote:

On 13/02/2021 18:52, Rob Clark wrote:

On Sat, Feb 13, 2021 at 12:04 AM Lionel Landwerlin
 wrote:

On 13/02/2021 04:20, Rob Clark wrote:

On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
 wrote:

On 13/02/2021 03:38, Rob Clark wrote:

On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
 wrote:

We're kind of in the same boat for Intel.

Access to GPU perf counters is exclusive to a single process if you want
to build a timeline of the work (because preemption etc...).

ugg, does that mean extensions like AMD_performance_monitor doesn't
actually work on intel?

It work,s but only a single app can use it at a time.


I see.. on the freedreno side we haven't really gone down the
preemption route yet, but we have a way to hook in some safe/restore
cmdstream

That's why I think, for Intel HW, something like gfx-pps is probably
best to pull out all the data on a timeline for the entire system.

Then the drivers could just provide timestamp on the timeline to
annotate it.


Looking at gfx-pps, my question is why is this not part of the mesa
tree?  That way I could use it for freedreno (either as stand-alone
process or part of driver) without duplicating all the perfcntr
tables, and information about different variants of a given generation
needed to interpret the raw counters into something useful for a
human.

Pulling gfx-pps into mesa seems like a sensible way forward.

BR,
-R

Yeah, I guess it depends on how your stack looks.

If the counters cover more than 3d and you have video drivers out of the
mesa tree, it might make sense to have it somewhere else.

Pulling intel video drivers into mesa and bringing some sanity to that
side of things would make more than a few people happy (but I digress
;-))

Until then, exporting a library from the mesa build might be an
option?  And possibly something like that would work for virgl host
side stuff as well?



For Intel, we have a small library in IGT [1], it's not big and most of 
that logic also exists in src/intel/perf [2] too.


So definitely possible.

Might also be a good occasion to try to make a common sublibrary for 
Intel drivers so we don't carry the statically link code from 
src/intel/perf in each driver.



-Lionel





Anyway I didn't want to sound negative, having a daemon like gfx-pps
thing in mesa to report system wide counters works for me :)

yeah, I think having it in the mesa tree is good for code-reuse (from
my experience, pulling external freedreno related tools into mesa
turned out to be a good thing)

At some point, whether the collection is in-process or not could just
be an implementation detail

BR,
-R


Then we can look into how to have each intel driver add annotation on
the timeline.


-Lionel



[1] : 
https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/lib/i915/perf.c


[2] : https://gitlab.freedesktop.org/mesa/mesa/-/tree/master/src/intel/perf

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-13 Thread Rob Clark
On Sat, Feb 13, 2021 at 12:52 PM Lionel Landwerlin
 wrote:
>
> On 13/02/2021 18:52, Rob Clark wrote:
> > On Sat, Feb 13, 2021 at 12:04 AM Lionel Landwerlin
> >  wrote:
> >> On 13/02/2021 04:20, Rob Clark wrote:
> >>> On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
> >>>  wrote:
>  On 13/02/2021 03:38, Rob Clark wrote:
> > On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
> >  wrote:
> >> We're kind of in the same boat for Intel.
> >>
> >> Access to GPU perf counters is exclusive to a single process if you 
> >> want
> >> to build a timeline of the work (because preemption etc...).
> > ugg, does that mean extensions like AMD_performance_monitor doesn't
> > actually work on intel?
>  It work,s but only a single app can use it at a time.
> 
> >>> I see.. on the freedreno side we haven't really gone down the
> >>> preemption route yet, but we have a way to hook in some safe/restore
> >>> cmdstream
> >>
> >> That's why I think, for Intel HW, something like gfx-pps is probably
> >> best to pull out all the data on a timeline for the entire system.
> >>
> >> Then the drivers could just provide timestamp on the timeline to
> >> annotate it.
> >>
> > Looking at gfx-pps, my question is why is this not part of the mesa
> > tree?  That way I could use it for freedreno (either as stand-alone
> > process or part of driver) without duplicating all the perfcntr
> > tables, and information about different variants of a given generation
> > needed to interpret the raw counters into something useful for a
> > human.
> >
> > Pulling gfx-pps into mesa seems like a sensible way forward.
> >
> > BR,
> > -R
>
> Yeah, I guess it depends on how your stack looks.
>
> If the counters cover more than 3d and you have video drivers out of the
> mesa tree, it might make sense to have it somewhere else.

Pulling intel video drivers into mesa and bringing some sanity to that
side of things would make more than a few people happy (but I digress
;-))

Until then, exporting a library from the mesa build might be an
option?  And possibly something like that would work for virgl host
side stuff as well?

> Anyway I didn't want to sound negative, having a daemon like gfx-pps
> thing in mesa to report system wide counters works for me :)

yeah, I think having it in the mesa tree is good for code-reuse (from
my experience, pulling external freedreno related tools into mesa
turned out to be a good thing)

At some point, whether the collection is in-process or not could just
be an implementation detail

BR,
-R

> Then we can look into how to have each intel driver add annotation on
> the timeline.
>
>
> -Lionel
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-13 Thread Lionel Landwerlin

On 13/02/2021 18:52, Rob Clark wrote:

On Sat, Feb 13, 2021 at 12:04 AM Lionel Landwerlin
 wrote:

On 13/02/2021 04:20, Rob Clark wrote:

On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
 wrote:

On 13/02/2021 03:38, Rob Clark wrote:

On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
 wrote:

We're kind of in the same boat for Intel.

Access to GPU perf counters is exclusive to a single process if you want
to build a timeline of the work (because preemption etc...).

ugg, does that mean extensions like AMD_performance_monitor doesn't
actually work on intel?

It work,s but only a single app can use it at a time.


I see.. on the freedreno side we haven't really gone down the
preemption route yet, but we have a way to hook in some safe/restore
cmdstream


That's why I think, for Intel HW, something like gfx-pps is probably
best to pull out all the data on a timeline for the entire system.

Then the drivers could just provide timestamp on the timeline to
annotate it.


Looking at gfx-pps, my question is why is this not part of the mesa
tree?  That way I could use it for freedreno (either as stand-alone
process or part of driver) without duplicating all the perfcntr
tables, and information about different variants of a given generation
needed to interpret the raw counters into something useful for a
human.

Pulling gfx-pps into mesa seems like a sensible way forward.

BR,
-R


Yeah, I guess it depends on how your stack looks.

If the counters cover more than 3d and you have video drivers out of the 
mesa tree, it might make sense to have it somewhere else.



Anyway I didn't want to sound negative, having a daemon like gfx-pps 
thing in mesa to report system wide counters works for me :)


Then we can look into how to have each intel driver add annotation on 
the timeline.



-Lionel

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-13 Thread Rob Clark
On Sat, Feb 13, 2021 at 12:04 AM Lionel Landwerlin
 wrote:
>
> On 13/02/2021 04:20, Rob Clark wrote:
> > On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
> >  wrote:
> >> On 13/02/2021 03:38, Rob Clark wrote:
> >>> On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
> >>>  wrote:
>  We're kind of in the same boat for Intel.
> 
>  Access to GPU perf counters is exclusive to a single process if you want
>  to build a timeline of the work (because preemption etc...).
> >>> ugg, does that mean extensions like AMD_performance_monitor doesn't
> >>> actually work on intel?
> >>
> >> It work,s but only a single app can use it at a time.
> >>
> > I see.. on the freedreno side we haven't really gone down the
> > preemption route yet, but we have a way to hook in some safe/restore
> > cmdstream
>
>
> That's why I think, for Intel HW, something like gfx-pps is probably
> best to pull out all the data on a timeline for the entire system.
>
> Then the drivers could just provide timestamp on the timeline to
> annotate it.
>

Looking at gfx-pps, my question is why is this not part of the mesa
tree?  That way I could use it for freedreno (either as stand-alone
process or part of driver) without duplicating all the perfcntr
tables, and information about different variants of a given generation
needed to interpret the raw counters into something useful for a
human.

Pulling gfx-pps into mesa seems like a sensible way forward.

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-13 Thread Lionel Landwerlin

On 13/02/2021 04:20, Rob Clark wrote:

On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
 wrote:

On 13/02/2021 03:38, Rob Clark wrote:

On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
 wrote:

We're kind of in the same boat for Intel.

Access to GPU perf counters is exclusive to a single process if you want
to build a timeline of the work (because preemption etc...).

ugg, does that mean extensions like AMD_performance_monitor doesn't
actually work on intel?


It work,s but only a single app can use it at a time.


I see.. on the freedreno side we haven't really gone down the
preemption route yet, but we have a way to hook in some safe/restore
cmdstream



That's why I think, for Intel HW, something like gfx-pps is probably 
best to pull out all the data on a timeline for the entire system.


Then the drivers could just provide timestamp on the timeline to 
annotate it.



-Lionel






The best information we could add from mesa would a timestamp of when a
particular drawcall started.
But that's pretty much when timestamps queries are.

Were you thinking of particular GPU generated data you don't get from
gfx-pps?

>From the looks of it, currently I don't get *any* GPU generated data
from gfx-pps ;-)


Maybe file a bug? :
https://gitlab.freedesktop.org/Fahien/gfx-pps/-/blob/master/src/gpu/intel/intel_driver.cc



We can ofc sample counters from a separate process as well... I have a
curses tool (fdperf) which does this.. but running outside of gpu
cmdstream plus counters losing context across suspend/resume makes it
less than perfect.


Our counters are global so to give per application values, we need to
post process a stream of HW counter snapshots.



And something that works the same way as
AMD_performance_monitor under the hook gives a more precise look at
which shaders (for ex) are consuming the most cycles.


In our implementation that precision (in particular when a drawcall
ends) comes at a stalling cost unfortunately.

yeah, stalling on our end too for per-draw counter snapshots.. but if
you are looking for which shaders to optimize that doesn't matter
*that* much.. they'll be some overhead, but it's not really going to
change which draws/shaders are expensive.. just mean that you lose out
on pipelining of the state changes

BR,
-R


For cases where
we can profile a trace, frameretrace and related tools is pretty
great.. but it would be nice to have similar visibility for actual
games (which for me, mostly means android games, since so far no
aarch64 steam store), but also give game developers good tools (or at
least the same tools that they get with other closed src drivers on
android).


Sure, but frame analysis is different than live monitoring of the system.

On Intel's HW you don't get the same level of details in both cases, and
apart for a few timestamps, I think gfx-pps is as good as you gonna get
for live stuff.


-Lionel



BR,
-R


Thanks,

-Lionel


On 13/02/2021 00:12, Alyssa Rosenzweig wrote:

My 2c for Mali/Panfrost --

For us, capturing GPU perf counters is orthogonal to rendering. It's
expected (e.g. with Arm's tools) to do this from a separate process.
Neither Mesa nor the DDK should require custom instrumentation for the
low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
Perfetto as it is. So for us I don't see the value in modifying Mesa for
tracing.

On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:

(responding from correct address this time)

On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:


I've recently been using GPUVis to look at trace events.  On Intel
platforms, GPUVis incorporates ftrace events from the i915 driver,
performance metrics from igt-gpu-tools, and userspace ftrace markers
that I locally hack up in Mesa.


GPUVis is great. I would love to see that data combined with
userspace events without any need for local hacks. Perfetto provides
on-demand trace events with lower overhead compared to ftrace, so for
example it is acceptable to have production trace instrumentation that can
be captured without dev builds. To do that with ftrace it may require a way
to enable and disable the ftrace file writes to avoid the overhead when
tracing is not in use. This is what Android does with systrace/atrace, for
example, it uses Binder to notify processes about trace sessions. Perfetto
does that in a more portable way.



It is very easy to compile the GPUVis UI.  Userspace instrumentation
requires a single C/C++ header.  You don't have to access an external
web service to analyze trace data (a big no-no for devs working on
preproduction hardware).

Is it possible to build and run the Perfetto UI locally?

Yes, local UI builds are possible
.
Also confirmed with the perfetto team  that
trace data is not uploaded unless you use the 'share' feature.



 Can it display
arbitrary trace events that are 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Dylan Baker
We're open source, tilting at windmills is what we do :D

On Fri, Feb 12, 2021, at 18:49, Rob Clark wrote:
> A lot of code, which like I said is mostly just generated ser/deser
> and not very interesting.. and 90% of it we won't use (unless mesa
> becomes a wifi driver, and more or less the rest of an OS).  And if
> there is a need to update it, we update it.. it's two files.  But *if*
> there are any API changes, we can deal with that in the same commit.
> I'm not saying it is great.. just that it is least bad.
> 
> I completely understand the argument against vendoring.. and I also
> completely understand the argument against tilting at windmills ;-)
> 
> BR,
> -R
> 
> On Fri, Feb 12, 2021 at 6:15 PM Dylan Baker  wrote:
> >
> > I can't speak for anyone else, but a giant pile of vendored code that 
> > you're expected to not update seems like a really bad idea to me.
> >
> > On Fri, Feb 12, 2021, at 18:09, Rob Clark wrote:
> > > I'm not really sure that is a fair statement.. the work scales
> > > according to the API change (which I'm not really sure if it changes
> > > much other than adding things).. if the API doesn't change, it is not
> > > really any effort to update two files in mesa git.
> > >
> > > As far as bug fixes.. it is a lot of code, but seems like the largest
> > > part of it is just generated protobuf serialization/deserialization
> > > code, rather than anything interesting.
> > >
> > > And again, I'm not a fan of their approach of "just vendor it".. but
> > > it is how perfetto is intended to be used, and in this case it seems
> > > like the best approach, since it is a situation where the protocol is
> > > the point of abi stability.
> > >
> > > BR,
> > > -R
> > >
> > > On Fri, Feb 12, 2021 at 5:51 PM Dylan Baker  wrote:
> > > >
> > > > So, we're vendoring something that we know getting bug fixes for will 
> > > > be an enormous pile of work? That sounds like a really bad idea.
> > > >
> > > > On Fri, Feb 12, 2021, at 17:51, Rob Clark wrote:
> > > > > On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > > > > > > On Thu, Feb 11, 2021 at 5:40 PM John Bates  
> > > > > > > wrote:
> > > > > > > >
> > > > > > >
> > > > > > > 
> > > > > > >
> > > > > > > > Runtime Characteristics
> > > > > > > >
> > > > > > > > ~500KB additional binary size. Even with using only the basic 
> > > > > > > > features of perfetto, it will increase the binary size of mesa 
> > > > > > > > by about 500KB.
> > > > > > >
> > > > > > > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > > > > > > *only* enabling freedreno is already ~6MB.. distros typically use
> > > > > > > "megadriver" (ie. all the drivers linked into a single .so with 
> > > > > > > hard
> > > > > > > links for the different  ${driver}_dri.so), which on my fedora 
> > > > > > > laptop
> > > > > > > is ~21M.  Maybe if anything is relevant it is how much of that
> > > > > > > actually gets paged into RAM from disk, but I think 500K isn't a 
> > > > > > > thing
> > > > > > > to worry about too much.
> > > > > > >
> > > > > > > > Background thread. Perfetto uses a background thread for 
> > > > > > > > communication with the system tracing daemon (traced) to 
> > > > > > > > advertise trace data and get notification of trace start/stop.
> > > > > > >
> > > > > > > Mesa already tends to have plenty of threads.. some of that 
> > > > > > > depends on
> > > > > > > the driver, I think currently radeonsi is the threading king, but
> > > > > > > there are several other drivers working on threaded_context and 
> > > > > > > async
> > > > > > > compile thread pool.
> > > > > > >
> > > > > > > It is worth mentioning that, AFAIU, perfetto can operate in
> > > > > > > self-server mode, which seems like it would be useful for distros
> > > > > > > which do not have the system daemon.  I'm not sure if we lose that
> > > > > > > with percetto?
> > > > > > >
> > > > > > > > Runtime overhead when disabled is designed to be optimal with 
> > > > > > > > one predicted branch, typically a few CPU cycles per event. 
> > > > > > > > While enabled, the overhead can be around 1 us per event.
> > > > > > > >
> > > > > > > > Integration Challenges
> > > > > > > >
> > > > > > > > The perfetto SDK is C++ and designed around macros, lambdas, 
> > > > > > > > inline templates, etc. There are ongoing discussions on 
> > > > > > > > providing an official perfetto C API, but it is not yet clear 
> > > > > > > > when this will land on the perfetto roadmap.
> > > > > > > > The perfetto SDK is an amalgamated .h and .cc that adds up to 
> > > > > > > > 100K lines of code.
> > > > > > > > Anything that includes perfetto.h takes a long time to compile.
> > > > > > > > The current Perfetto SDK design is incompatible with being a 
> > > > > > > > shared library behind a C API.
> > > > > > >
> > > > > > > So, C++ on it's own isn't a showstopper, mesa has plenty 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
A lot of code, which like I said is mostly just generated ser/deser
and not very interesting.. and 90% of it we won't use (unless mesa
becomes a wifi driver, and more or less the rest of an OS).  And if
there is a need to update it, we update it.. it's two files.  But *if*
there are any API changes, we can deal with that in the same commit.
I'm not saying it is great.. just that it is least bad.

I completely understand the argument against vendoring.. and I also
completely understand the argument against tilting at windmills ;-)

BR,
-R

On Fri, Feb 12, 2021 at 6:15 PM Dylan Baker  wrote:
>
> I can't speak for anyone else, but a giant pile of vendored code that you're 
> expected to not update seems like a really bad idea to me.
>
> On Fri, Feb 12, 2021, at 18:09, Rob Clark wrote:
> > I'm not really sure that is a fair statement.. the work scales
> > according to the API change (which I'm not really sure if it changes
> > much other than adding things).. if the API doesn't change, it is not
> > really any effort to update two files in mesa git.
> >
> > As far as bug fixes.. it is a lot of code, but seems like the largest
> > part of it is just generated protobuf serialization/deserialization
> > code, rather than anything interesting.
> >
> > And again, I'm not a fan of their approach of "just vendor it".. but
> > it is how perfetto is intended to be used, and in this case it seems
> > like the best approach, since it is a situation where the protocol is
> > the point of abi stability.
> >
> > BR,
> > -R
> >
> > On Fri, Feb 12, 2021 at 5:51 PM Dylan Baker  wrote:
> > >
> > > So, we're vendoring something that we know getting bug fixes for will be 
> > > an enormous pile of work? That sounds like a really bad idea.
> > >
> > > On Fri, Feb 12, 2021, at 17:51, Rob Clark wrote:
> > > > On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  wrote:
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > > > > > On Thu, Feb 11, 2021 at 5:40 PM John Bates  
> > > > > > wrote:
> > > > > > >
> > > > > >
> > > > > > 
> > > > > >
> > > > > > > Runtime Characteristics
> > > > > > >
> > > > > > > ~500KB additional binary size. Even with using only the basic 
> > > > > > > features of perfetto, it will increase the binary size of mesa by 
> > > > > > > about 500KB.
> > > > > >
> > > > > > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > > > > > *only* enabling freedreno is already ~6MB.. distros typically use
> > > > > > "megadriver" (ie. all the drivers linked into a single .so with hard
> > > > > > links for the different  ${driver}_dri.so), which on my fedora 
> > > > > > laptop
> > > > > > is ~21M.  Maybe if anything is relevant it is how much of that
> > > > > > actually gets paged into RAM from disk, but I think 500K isn't a 
> > > > > > thing
> > > > > > to worry about too much.
> > > > > >
> > > > > > > Background thread. Perfetto uses a background thread for 
> > > > > > > communication with the system tracing daemon (traced) to 
> > > > > > > advertise trace data and get notification of trace start/stop.
> > > > > >
> > > > > > Mesa already tends to have plenty of threads.. some of that depends 
> > > > > > on
> > > > > > the driver, I think currently radeonsi is the threading king, but
> > > > > > there are several other drivers working on threaded_context and 
> > > > > > async
> > > > > > compile thread pool.
> > > > > >
> > > > > > It is worth mentioning that, AFAIU, perfetto can operate in
> > > > > > self-server mode, which seems like it would be useful for distros
> > > > > > which do not have the system daemon.  I'm not sure if we lose that
> > > > > > with percetto?
> > > > > >
> > > > > > > Runtime overhead when disabled is designed to be optimal with one 
> > > > > > > predicted branch, typically a few CPU cycles per event. While 
> > > > > > > enabled, the overhead can be around 1 us per event.
> > > > > > >
> > > > > > > Integration Challenges
> > > > > > >
> > > > > > > The perfetto SDK is C++ and designed around macros, lambdas, 
> > > > > > > inline templates, etc. There are ongoing discussions on providing 
> > > > > > > an official perfetto C API, but it is not yet clear when this 
> > > > > > > will land on the perfetto roadmap.
> > > > > > > The perfetto SDK is an amalgamated .h and .cc that adds up to 
> > > > > > > 100K lines of code.
> > > > > > > Anything that includes perfetto.h takes a long time to compile.
> > > > > > > The current Perfetto SDK design is incompatible with being a 
> > > > > > > shared library behind a C API.
> > > > > >
> > > > > > So, C++ on it's own isn't a showstopper, mesa has plenty of C++ 
> > > > > > code.
> > > > > > But maybe we should verify that MSVC is happy with it, otherwise we
> > > > > > need to take a bit more care in some parts of the codebase.
> > > > > >
> > > > > > As far as compile time, I wonder if we can regenerate the .cc/.h 
> > > > > > with
> > > > > > only the gpu trace parts?  But I 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
On Fri, Feb 12, 2021 at 5:56 PM Lionel Landwerlin
 wrote:
>
> On 13/02/2021 03:38, Rob Clark wrote:
> > On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
> >  wrote:
> >> We're kind of in the same boat for Intel.
> >>
> >> Access to GPU perf counters is exclusive to a single process if you want
> >> to build a timeline of the work (because preemption etc...).
> > ugg, does that mean extensions like AMD_performance_monitor doesn't
> > actually work on intel?
>
>
> It work,s but only a single app can use it at a time.
>

I see.. on the freedreno side we haven't really gone down the
preemption route yet, but we have a way to hook in some safe/restore
cmdstream

>
> >
> >> The best information we could add from mesa would a timestamp of when a
> >> particular drawcall started.
> >> But that's pretty much when timestamps queries are.
> >>
> >> Were you thinking of particular GPU generated data you don't get from
> >> gfx-pps?
> > >From the looks of it, currently I don't get *any* GPU generated data
> > from gfx-pps ;-)
>
>
> Maybe file a bug? :
> https://gitlab.freedesktop.org/Fahien/gfx-pps/-/blob/master/src/gpu/intel/intel_driver.cc
>
>
> >
> > We can ofc sample counters from a separate process as well... I have a
> > curses tool (fdperf) which does this.. but running outside of gpu
> > cmdstream plus counters losing context across suspend/resume makes it
> > less than perfect.
>
>
> Our counters are global so to give per application values, we need to
> post process a stream of HW counter snapshots.
>
>
> >And something that works the same way as
> > AMD_performance_monitor under the hook gives a more precise look at
> > which shaders (for ex) are consuming the most cycles.
>
>
> In our implementation that precision (in particular when a drawcall
> ends) comes at a stalling cost unfortunately.

yeah, stalling on our end too for per-draw counter snapshots.. but if
you are looking for which shaders to optimize that doesn't matter
*that* much.. they'll be some overhead, but it's not really going to
change which draws/shaders are expensive.. just mean that you lose out
on pipelining of the state changes

BR,
-R

>
> >For cases where
> > we can profile a trace, frameretrace and related tools is pretty
> > great.. but it would be nice to have similar visibility for actual
> > games (which for me, mostly means android games, since so far no
> > aarch64 steam store), but also give game developers good tools (or at
> > least the same tools that they get with other closed src drivers on
> > android).
>
>
> Sure, but frame analysis is different than live monitoring of the system.
>
> On Intel's HW you don't get the same level of details in both cases, and
> apart for a few timestamps, I think gfx-pps is as good as you gonna get
> for live stuff.
>
>
> -Lionel
>
>
> >
> > BR,
> > -R
> >
> >> Thanks,
> >>
> >> -Lionel
> >>
> >>
> >> On 13/02/2021 00:12, Alyssa Rosenzweig wrote:
> >>> My 2c for Mali/Panfrost --
> >>>
> >>> For us, capturing GPU perf counters is orthogonal to rendering. It's
> >>> expected (e.g. with Arm's tools) to do this from a separate process.
> >>> Neither Mesa nor the DDK should require custom instrumentation for the
> >>> low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
> >>> Perfetto as it is. So for us I don't see the value in modifying Mesa for
> >>> tracing.
> >>>
> >>> On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
>  (responding from correct address this time)
> 
>  On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  
>  wrote:
> 
> > I've recently been using GPUVis to look at trace events.  On Intel
> > platforms, GPUVis incorporates ftrace events from the i915 driver,
> > performance metrics from igt-gpu-tools, and userspace ftrace markers
> > that I locally hack up in Mesa.
> >
>  GPUVis is great. I would love to see that data combined with
>  userspace events without any need for local hacks. Perfetto provides
>  on-demand trace events with lower overhead compared to ftrace, so for
>  example it is acceptable to have production trace instrumentation that 
>  can
>  be captured without dev builds. To do that with ftrace it may require a 
>  way
>  to enable and disable the ftrace file writes to avoid the overhead when
>  tracing is not in use. This is what Android does with systrace/atrace, 
>  for
>  example, it uses Binder to notify processes about trace sessions. 
>  Perfetto
>  does that in a more portable way.
> 
> 
> > It is very easy to compile the GPUVis UI.  Userspace instrumentation
> > requires a single C/C++ header.  You don't have to access an external
> > web service to analyze trace data (a big no-no for devs working on
> > preproduction hardware).
> >
> > Is it possible to build and run the Perfetto UI locally?
>  Yes, local UI builds are possible
>  

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Dylan Baker
I can't speak for anyone else, but a giant pile of vendored code that you're 
expected to not update seems like a really bad idea to me. 

On Fri, Feb 12, 2021, at 18:09, Rob Clark wrote:
> I'm not really sure that is a fair statement.. the work scales
> according to the API change (which I'm not really sure if it changes
> much other than adding things).. if the API doesn't change, it is not
> really any effort to update two files in mesa git.
> 
> As far as bug fixes.. it is a lot of code, but seems like the largest
> part of it is just generated protobuf serialization/deserialization
> code, rather than anything interesting.
> 
> And again, I'm not a fan of their approach of "just vendor it".. but
> it is how perfetto is intended to be used, and in this case it seems
> like the best approach, since it is a situation where the protocol is
> the point of abi stability.
> 
> BR,
> -R
> 
> On Fri, Feb 12, 2021 at 5:51 PM Dylan Baker  wrote:
> >
> > So, we're vendoring something that we know getting bug fixes for will be an 
> > enormous pile of work? That sounds like a really bad idea.
> >
> > On Fri, Feb 12, 2021, at 17:51, Rob Clark wrote:
> > > On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  wrote:
> > > >
> > > >
> > > >
> > > > On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > > > > On Thu, Feb 11, 2021 at 5:40 PM John Bates  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > 
> > > > >
> > > > > > Runtime Characteristics
> > > > > >
> > > > > > ~500KB additional binary size. Even with using only the basic 
> > > > > > features of perfetto, it will increase the binary size of mesa by 
> > > > > > about 500KB.
> > > > >
> > > > > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > > > > *only* enabling freedreno is already ~6MB.. distros typically use
> > > > > "megadriver" (ie. all the drivers linked into a single .so with hard
> > > > > links for the different  ${driver}_dri.so), which on my fedora laptop
> > > > > is ~21M.  Maybe if anything is relevant it is how much of that
> > > > > actually gets paged into RAM from disk, but I think 500K isn't a thing
> > > > > to worry about too much.
> > > > >
> > > > > > Background thread. Perfetto uses a background thread for 
> > > > > > communication with the system tracing daemon (traced) to advertise 
> > > > > > trace data and get notification of trace start/stop.
> > > > >
> > > > > Mesa already tends to have plenty of threads.. some of that depends on
> > > > > the driver, I think currently radeonsi is the threading king, but
> > > > > there are several other drivers working on threaded_context and async
> > > > > compile thread pool.
> > > > >
> > > > > It is worth mentioning that, AFAIU, perfetto can operate in
> > > > > self-server mode, which seems like it would be useful for distros
> > > > > which do not have the system daemon.  I'm not sure if we lose that
> > > > > with percetto?
> > > > >
> > > > > > Runtime overhead when disabled is designed to be optimal with one 
> > > > > > predicted branch, typically a few CPU cycles per event. While 
> > > > > > enabled, the overhead can be around 1 us per event.
> > > > > >
> > > > > > Integration Challenges
> > > > > >
> > > > > > The perfetto SDK is C++ and designed around macros, lambdas, inline 
> > > > > > templates, etc. There are ongoing discussions on providing an 
> > > > > > official perfetto C API, but it is not yet clear when this will 
> > > > > > land on the perfetto roadmap.
> > > > > > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K 
> > > > > > lines of code.
> > > > > > Anything that includes perfetto.h takes a long time to compile.
> > > > > > The current Perfetto SDK design is incompatible with being a shared 
> > > > > > library behind a C API.
> > > > >
> > > > > So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> > > > > But maybe we should verify that MSVC is happy with it, otherwise we
> > > > > need to take a bit more care in some parts of the codebase.
> > > > >
> > > > > As far as compile time, I wonder if we can regenerate the .cc/.h with
> > > > > only the gpu trace parts?  But I wouldn't expect the .h to be
> > > > > something widely included.  For example, for gpu timeline traces in
> > > > > freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> > > > > extern "C" {} around the callbacks that would hook into the
> > > > > u_tracepoint tracepoints.  That one file would pull in the perfetto
> > > > > .h, and we'd just not build that file if perfetto was disabled.
> > > > >
> > > > > Overall having to add our own extern C wrappers in some places doesn't
> > > > > seem like the *end* of the world.. a bit annoying, but we might end up
> > > > > doing that regardless if other folks want the ability to hook in
> > > > > something other than perfetto?
> > > > >
> > > > > 
> > > > >
> > > > > > Mesa Integration Alternatives
> > > > >
> > > > > I'm kind of leaning towards the "just slurp in the .cc/.h" 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
I'm not really sure that is a fair statement.. the work scales
according to the API change (which I'm not really sure if it changes
much other than adding things).. if the API doesn't change, it is not
really any effort to update two files in mesa git.

As far as bug fixes.. it is a lot of code, but seems like the largest
part of it is just generated protobuf serialization/deserialization
code, rather than anything interesting.

And again, I'm not a fan of their approach of "just vendor it".. but
it is how perfetto is intended to be used, and in this case it seems
like the best approach, since it is a situation where the protocol is
the point of abi stability.

BR,
-R

On Fri, Feb 12, 2021 at 5:51 PM Dylan Baker  wrote:
>
> So, we're vendoring something that we know getting bug fixes for will be an 
> enormous pile of work? That sounds like a really bad idea.
>
> On Fri, Feb 12, 2021, at 17:51, Rob Clark wrote:
> > On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  wrote:
> > >
> > >
> > >
> > > On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > > > On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
> > > > >
> > > >
> > > > 
> > > >
> > > > > Runtime Characteristics
> > > > >
> > > > > ~500KB additional binary size. Even with using only the basic 
> > > > > features of perfetto, it will increase the binary size of mesa by 
> > > > > about 500KB.
> > > >
> > > > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > > > *only* enabling freedreno is already ~6MB.. distros typically use
> > > > "megadriver" (ie. all the drivers linked into a single .so with hard
> > > > links for the different  ${driver}_dri.so), which on my fedora laptop
> > > > is ~21M.  Maybe if anything is relevant it is how much of that
> > > > actually gets paged into RAM from disk, but I think 500K isn't a thing
> > > > to worry about too much.
> > > >
> > > > > Background thread. Perfetto uses a background thread for 
> > > > > communication with the system tracing daemon (traced) to advertise 
> > > > > trace data and get notification of trace start/stop.
> > > >
> > > > Mesa already tends to have plenty of threads.. some of that depends on
> > > > the driver, I think currently radeonsi is the threading king, but
> > > > there are several other drivers working on threaded_context and async
> > > > compile thread pool.
> > > >
> > > > It is worth mentioning that, AFAIU, perfetto can operate in
> > > > self-server mode, which seems like it would be useful for distros
> > > > which do not have the system daemon.  I'm not sure if we lose that
> > > > with percetto?
> > > >
> > > > > Runtime overhead when disabled is designed to be optimal with one 
> > > > > predicted branch, typically a few CPU cycles per event. While 
> > > > > enabled, the overhead can be around 1 us per event.
> > > > >
> > > > > Integration Challenges
> > > > >
> > > > > The perfetto SDK is C++ and designed around macros, lambdas, inline 
> > > > > templates, etc. There are ongoing discussions on providing an 
> > > > > official perfetto C API, but it is not yet clear when this will land 
> > > > > on the perfetto roadmap.
> > > > > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K 
> > > > > lines of code.
> > > > > Anything that includes perfetto.h takes a long time to compile.
> > > > > The current Perfetto SDK design is incompatible with being a shared 
> > > > > library behind a C API.
> > > >
> > > > So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> > > > But maybe we should verify that MSVC is happy with it, otherwise we
> > > > need to take a bit more care in some parts of the codebase.
> > > >
> > > > As far as compile time, I wonder if we can regenerate the .cc/.h with
> > > > only the gpu trace parts?  But I wouldn't expect the .h to be
> > > > something widely included.  For example, for gpu timeline traces in
> > > > freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> > > > extern "C" {} around the callbacks that would hook into the
> > > > u_tracepoint tracepoints.  That one file would pull in the perfetto
> > > > .h, and we'd just not build that file if perfetto was disabled.
> > > >
> > > > Overall having to add our own extern C wrappers in some places doesn't
> > > > seem like the *end* of the world.. a bit annoying, but we might end up
> > > > doing that regardless if other folks want the ability to hook in
> > > > something other than perfetto?
> > > >
> > > > 
> > > >
> > > > > Mesa Integration Alternatives
> > > >
> > > > I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
> > > > that is mostly because I expect to initially just add some basic gpu
> > > > timeline tracepoints, but over time iterate on adding more.. it would
> > > > be nice to not have to depend on a newer version of an external
> > > > library at each step.  That is ofc only my $0.02..
> > > >
> > > > BR,
> > > > -R
> > > > ___
> > > > mesa-dev 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
On Fri, Feb 12, 2021 at 5:40 PM John Bates  wrote:
>
>
>
> On Fri, Feb 12, 2021 at 4:34 PM Rob Clark  wrote:
>>
>> On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
>> >
>>
>> 
>>
>> > Runtime Characteristics
>> >
>> > ~500KB additional binary size. Even with using only the basic features of 
>> > perfetto, it will increase the binary size of mesa by about 500KB.
>>
>> IMHO, that size is negligible.. looking at freedreno, a mesa build
>> *only* enabling freedreno is already ~6MB.. distros typically use
>> "megadriver" (ie. all the drivers linked into a single .so with hard
>> links for the different  ${driver}_dri.so), which on my fedora laptop
>> is ~21M.  Maybe if anything is relevant it is how much of that
>> actually gets paged into RAM from disk, but I think 500K isn't a thing
>> to worry about too much.
>>
>> > Background thread. Perfetto uses a background thread for communication 
>> > with the system tracing daemon (traced) to advertise trace data and get 
>> > notification of trace start/stop.
>>
>> Mesa already tends to have plenty of threads.. some of that depends on
>> the driver, I think currently radeonsi is the threading king, but
>> there are several other drivers working on threaded_context and async
>> compile thread pool.
>>
>> It is worth mentioning that, AFAIU, perfetto can operate in
>> self-server mode, which seems like it would be useful for distros
>> which do not have the system daemon.  I'm not sure if we lose that
>> with percetto?
>
>
> Easy to add, but want to avoid a runtime arg because it would add ~300KB to 
> binary size. Okay if we have an alternate init function though.

I think I could imagine wanting mesa build params to control whether
we want self-server or system-server mode.. ie. if some distros add
system-server support they wouldn't need self-server mode and visa
versa

>
>>
>>
>> > Runtime overhead when disabled is designed to be optimal with one 
>> > predicted branch, typically a few CPU cycles per event. While enabled, the 
>> > overhead can be around 1 us per event.
>> >
>> > Integration Challenges
>> >
>> > The perfetto SDK is C++ and designed around macros, lambdas, inline 
>> > templates, etc. There are ongoing discussions on providing an official 
>> > perfetto C API, but it is not yet clear when this will land on the 
>> > perfetto roadmap.
>> > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K lines 
>> > of code.
>> > Anything that includes perfetto.h takes a long time to compile.
>> > The current Perfetto SDK design is incompatible with being a shared 
>> > library behind a C API.
>>
>> So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
>> But maybe we should verify that MSVC is happy with it, otherwise we
>> need to take a bit more care in some parts of the codebase.
>>
>> As far as compile time, I wonder if we can regenerate the .cc/.h with
>> only the gpu trace parts?  But I wouldn't expect the .h to be
>> something widely included.  For example, for gpu timeline traces in
>> freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
>> extern "C" {} around the callbacks that would hook into the
>> u_tracepoint tracepoints.  That one file would pull in the perfetto
>> .h, and we'd just not build that file if perfetto was disabled.
>
>
> That works for GPU, but I'd like to see some slow CPU functions in traces as 
> well to help reason about performance problems. This ends up peppering the 
> trace header in lots of places.

My point was that we could strip out a whole lot of stuff that is
completely unrelated to mesa.. not sure if it is worth bothering with,
I doubt we'd #include perfetto.h very widely

>> Overall having to add our own extern C wrappers in some places doesn't
>> seem like the *end* of the world.. a bit annoying, but we might end up
>> doing that regardless if other folks want the ability to hook in
>> something other than perfetto?
>
>
> It's more than extern C wrappers if we want to minimize overhead while 
> tracing enabled at compile time. Have a look at percetto.h/cc.

I'm not sure how many distros are not using LTO these days.. I assume
once you have LTO it doesn't really matter anymore?

>>
>>
>> 
>>
>> > Mesa Integration Alternatives
>>
>> I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
>> that is mostly because I expect to initially just add some basic gpu
>> timeline tracepoints, but over time iterate on adding more.. it would
>> be nice to not have to depend on a newer version of an external
>> library at each step.  That is ofc only my $0.02..
>
>
> It's a small initial setup tax, true, but I still think it depends on what 
> perfetto features we plan to use -- for only a couple files doing GPU tracing 
> I agree percetto is unnecessary, but for CPU tracing it gets more complicated.

Definitely the first thing I plan to use is getting render stages onto
a timeline, so I can better see where the GPU time is going.. second
step is probably adding more 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Lionel Landwerlin

On 13/02/2021 03:38, Rob Clark wrote:

On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
 wrote:

We're kind of in the same boat for Intel.

Access to GPU perf counters is exclusive to a single process if you want
to build a timeline of the work (because preemption etc...).

ugg, does that mean extensions like AMD_performance_monitor doesn't
actually work on intel?



It work,s but only a single app can use it at a time.





The best information we could add from mesa would a timestamp of when a
particular drawcall started.
But that's pretty much when timestamps queries are.

Were you thinking of particular GPU generated data you don't get from
gfx-pps?

>From the looks of it, currently I don't get *any* GPU generated data
from gfx-pps ;-)



Maybe file a bug? : 
https://gitlab.freedesktop.org/Fahien/gfx-pps/-/blob/master/src/gpu/intel/intel_driver.cc





We can ofc sample counters from a separate process as well... I have a
curses tool (fdperf) which does this.. but running outside of gpu
cmdstream plus counters losing context across suspend/resume makes it
less than perfect.



Our counters are global so to give per application values, we need to 
post process a stream of HW counter snapshots.




   And something that works the same way as
AMD_performance_monitor under the hook gives a more precise look at
which shaders (for ex) are consuming the most cycles.



In our implementation that precision (in particular when a drawcall 
ends) comes at a stalling cost unfortunately.




   For cases where
we can profile a trace, frameretrace and related tools is pretty
great.. but it would be nice to have similar visibility for actual
games (which for me, mostly means android games, since so far no
aarch64 steam store), but also give game developers good tools (or at
least the same tools that they get with other closed src drivers on
android).



Sure, but frame analysis is different than live monitoring of the system.

On Intel's HW you don't get the same level of details in both cases, and 
apart for a few timestamps, I think gfx-pps is as good as you gonna get 
for live stuff.



-Lionel




BR,
-R


Thanks,

-Lionel


On 13/02/2021 00:12, Alyssa Rosenzweig wrote:

My 2c for Mali/Panfrost --

For us, capturing GPU perf counters is orthogonal to rendering. It's
expected (e.g. with Arm's tools) to do this from a separate process.
Neither Mesa nor the DDK should require custom instrumentation for the
low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
Perfetto as it is. So for us I don't see the value in modifying Mesa for
tracing.

On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:

(responding from correct address this time)

On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:


I've recently been using GPUVis to look at trace events.  On Intel
platforms, GPUVis incorporates ftrace events from the i915 driver,
performance metrics from igt-gpu-tools, and userspace ftrace markers
that I locally hack up in Mesa.


GPUVis is great. I would love to see that data combined with
userspace events without any need for local hacks. Perfetto provides
on-demand trace events with lower overhead compared to ftrace, so for
example it is acceptable to have production trace instrumentation that can
be captured without dev builds. To do that with ftrace it may require a way
to enable and disable the ftrace file writes to avoid the overhead when
tracing is not in use. This is what Android does with systrace/atrace, for
example, it uses Binder to notify processes about trace sessions. Perfetto
does that in a more portable way.



It is very easy to compile the GPUVis UI.  Userspace instrumentation
requires a single C/C++ header.  You don't have to access an external
web service to analyze trace data (a big no-no for devs working on
preproduction hardware).

Is it possible to build and run the Perfetto UI locally?

Yes, local UI builds are possible
.
Also confirmed with the perfetto team  that
trace data is not uploaded unless you use the 'share' feature.



Can it display
arbitrary trace events that are written to
/sys/kernel/tracing/trace_marker ?

Yes, I believe it does support that via linux.ftrace data source
. We use that for
example to overlay CPU sched data to show what process is on each core
throughout the timeline. There are many ftrace event types

in
the perfetto protos.



Can it be extended to show i915 and
i915-perf-recorder events?


It can be extended to consume custom data sources. One way this is done is
via a bridge daemon, such as traced_probes which is responsible for
capturing data from ftrace and /proc during a trace session and sending it
to traced. traced is the main perfetto tracing 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Dylan Baker
So, we're vendoring something that we know getting bug fixes for will be an 
enormous pile of work? That sounds like a really bad idea.

On Fri, Feb 12, 2021, at 17:51, Rob Clark wrote:
> On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  wrote:
> >
> >
> >
> > On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > > On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
> > > >
> > >
> > > 
> > >
> > > > Runtime Characteristics
> > > >
> > > > ~500KB additional binary size. Even with using only the basic features 
> > > > of perfetto, it will increase the binary size of mesa by about 500KB.
> > >
> > > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > > *only* enabling freedreno is already ~6MB.. distros typically use
> > > "megadriver" (ie. all the drivers linked into a single .so with hard
> > > links for the different  ${driver}_dri.so), which on my fedora laptop
> > > is ~21M.  Maybe if anything is relevant it is how much of that
> > > actually gets paged into RAM from disk, but I think 500K isn't a thing
> > > to worry about too much.
> > >
> > > > Background thread. Perfetto uses a background thread for communication 
> > > > with the system tracing daemon (traced) to advertise trace data and get 
> > > > notification of trace start/stop.
> > >
> > > Mesa already tends to have plenty of threads.. some of that depends on
> > > the driver, I think currently radeonsi is the threading king, but
> > > there are several other drivers working on threaded_context and async
> > > compile thread pool.
> > >
> > > It is worth mentioning that, AFAIU, perfetto can operate in
> > > self-server mode, which seems like it would be useful for distros
> > > which do not have the system daemon.  I'm not sure if we lose that
> > > with percetto?
> > >
> > > > Runtime overhead when disabled is designed to be optimal with one 
> > > > predicted branch, typically a few CPU cycles per event. While enabled, 
> > > > the overhead can be around 1 us per event.
> > > >
> > > > Integration Challenges
> > > >
> > > > The perfetto SDK is C++ and designed around macros, lambdas, inline 
> > > > templates, etc. There are ongoing discussions on providing an official 
> > > > perfetto C API, but it is not yet clear when this will land on the 
> > > > perfetto roadmap.
> > > > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K 
> > > > lines of code.
> > > > Anything that includes perfetto.h takes a long time to compile.
> > > > The current Perfetto SDK design is incompatible with being a shared 
> > > > library behind a C API.
> > >
> > > So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> > > But maybe we should verify that MSVC is happy with it, otherwise we
> > > need to take a bit more care in some parts of the codebase.
> > >
> > > As far as compile time, I wonder if we can regenerate the .cc/.h with
> > > only the gpu trace parts?  But I wouldn't expect the .h to be
> > > something widely included.  For example, for gpu timeline traces in
> > > freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> > > extern "C" {} around the callbacks that would hook into the
> > > u_tracepoint tracepoints.  That one file would pull in the perfetto
> > > .h, and we'd just not build that file if perfetto was disabled.
> > >
> > > Overall having to add our own extern C wrappers in some places doesn't
> > > seem like the *end* of the world.. a bit annoying, but we might end up
> > > doing that regardless if other folks want the ability to hook in
> > > something other than perfetto?
> > >
> > > 
> > >
> > > > Mesa Integration Alternatives
> > >
> > > I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
> > > that is mostly because I expect to initially just add some basic gpu
> > > timeline tracepoints, but over time iterate on adding more.. it would
> > > be nice to not have to depend on a newer version of an external
> > > library at each step.  That is ofc only my $0.02..
> > >
> > > BR,
> > > -R
> > > ___
> > > mesa-dev mailing list
> > > mesa-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > >
> >
> >
> > My experience is that vendoring just ends up being a huge pain for 
> > everyone, especially if the ui code stops working with our forked version, 
> > and we have to rebase all of our changes on upstream.
> >
> > Could we add meson build files and use a wrap for this if the distro 
> > doesn't ship the library? Id be willing to do/help with an initial port if 
> > that's what we wanted to do. But since this really a dev dependency i don't 
> > see why using a wrap would be a big deal.
> 
> I'm not a super huge fan of the perfetto approach of "just import the
> library", but at the end of the day the point of ABI compatibility is
> the protocol, not the API.. so it is actually safer to import the
> library into mesa's git tree
> 
> BR,
> -R
>

-- 
  Dylan Baker
  

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
On Fri, Feb 12, 2021 at 5:35 PM Dylan Baker  wrote:
>
>
>
> On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> > On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
> > >
> >
> > 
> >
> > > Runtime Characteristics
> > >
> > > ~500KB additional binary size. Even with using only the basic features of 
> > > perfetto, it will increase the binary size of mesa by about 500KB.
> >
> > IMHO, that size is negligible.. looking at freedreno, a mesa build
> > *only* enabling freedreno is already ~6MB.. distros typically use
> > "megadriver" (ie. all the drivers linked into a single .so with hard
> > links for the different  ${driver}_dri.so), which on my fedora laptop
> > is ~21M.  Maybe if anything is relevant it is how much of that
> > actually gets paged into RAM from disk, but I think 500K isn't a thing
> > to worry about too much.
> >
> > > Background thread. Perfetto uses a background thread for communication 
> > > with the system tracing daemon (traced) to advertise trace data and get 
> > > notification of trace start/stop.
> >
> > Mesa already tends to have plenty of threads.. some of that depends on
> > the driver, I think currently radeonsi is the threading king, but
> > there are several other drivers working on threaded_context and async
> > compile thread pool.
> >
> > It is worth mentioning that, AFAIU, perfetto can operate in
> > self-server mode, which seems like it would be useful for distros
> > which do not have the system daemon.  I'm not sure if we lose that
> > with percetto?
> >
> > > Runtime overhead when disabled is designed to be optimal with one 
> > > predicted branch, typically a few CPU cycles per event. While enabled, 
> > > the overhead can be around 1 us per event.
> > >
> > > Integration Challenges
> > >
> > > The perfetto SDK is C++ and designed around macros, lambdas, inline 
> > > templates, etc. There are ongoing discussions on providing an official 
> > > perfetto C API, but it is not yet clear when this will land on the 
> > > perfetto roadmap.
> > > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K lines 
> > > of code.
> > > Anything that includes perfetto.h takes a long time to compile.
> > > The current Perfetto SDK design is incompatible with being a shared 
> > > library behind a C API.
> >
> > So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> > But maybe we should verify that MSVC is happy with it, otherwise we
> > need to take a bit more care in some parts of the codebase.
> >
> > As far as compile time, I wonder if we can regenerate the .cc/.h with
> > only the gpu trace parts?  But I wouldn't expect the .h to be
> > something widely included.  For example, for gpu timeline traces in
> > freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> > extern "C" {} around the callbacks that would hook into the
> > u_tracepoint tracepoints.  That one file would pull in the perfetto
> > .h, and we'd just not build that file if perfetto was disabled.
> >
> > Overall having to add our own extern C wrappers in some places doesn't
> > seem like the *end* of the world.. a bit annoying, but we might end up
> > doing that regardless if other folks want the ability to hook in
> > something other than perfetto?
> >
> > 
> >
> > > Mesa Integration Alternatives
> >
> > I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
> > that is mostly because I expect to initially just add some basic gpu
> > timeline tracepoints, but over time iterate on adding more.. it would
> > be nice to not have to depend on a newer version of an external
> > library at each step.  That is ofc only my $0.02..
> >
> > BR,
> > -R
> > ___
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >
>
>
> My experience is that vendoring just ends up being a huge pain for everyone, 
> especially if the ui code stops working with our forked version, and we have 
> to rebase all of our changes on upstream.
>
> Could we add meson build files and use a wrap for this if the distro doesn't 
> ship the library? Id be willing to do/help with an initial port if that's 
> what we wanted to do. But since this really a dev dependency i don't see why 
> using a wrap would be a big deal.

I'm not a super huge fan of the perfetto approach of "just import the
library", but at the end of the day the point of ABI compatibility is
the protocol, not the API.. so it is actually safer to import the
library into mesa's git tree

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread John Bates
On Fri, Feb 12, 2021 at 4:34 PM Rob Clark  wrote:

> On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
> >
>
> 
>
> > Runtime Characteristics
> >
> > ~500KB additional binary size. Even with using only the basic features
> of perfetto, it will increase the binary size of mesa by about 500KB.
>
> IMHO, that size is negligible.. looking at freedreno, a mesa build
> *only* enabling freedreno is already ~6MB.. distros typically use
> "megadriver" (ie. all the drivers linked into a single .so with hard
> links for the different  ${driver}_dri.so), which on my fedora laptop
> is ~21M.  Maybe if anything is relevant it is how much of that
> actually gets paged into RAM from disk, but I think 500K isn't a thing
> to worry about too much.
>
> > Background thread. Perfetto uses a background thread for communication
> with the system tracing daemon (traced) to advertise trace data and get
> notification of trace start/stop.
>
> Mesa already tends to have plenty of threads.. some of that depends on
> the driver, I think currently radeonsi is the threading king, but
> there are several other drivers working on threaded_context and async
> compile thread pool.
>
> It is worth mentioning that, AFAIU, perfetto can operate in
> self-server mode, which seems like it would be useful for distros
> which do not have the system daemon.  I'm not sure if we lose that
> with percetto?
>

Easy to add, but want to avoid a runtime arg because it would add ~300KB to
binary size. Okay if we have an alternate init function though.


>
> > Runtime overhead when disabled is designed to be optimal with one
> predicted branch, typically a few CPU cycles per event. While enabled, the
> overhead can be around 1 us per event.
> >
> > Integration Challenges
> >
> > The perfetto SDK is C++ and designed around macros, lambdas, inline
> templates, etc. There are ongoing discussions on providing an official
> perfetto C API, but it is not yet clear when this will land on the perfetto
> roadmap.
> > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K lines
> of code.
> > Anything that includes perfetto.h takes a long time to compile.
> > The current Perfetto SDK design is incompatible with being a shared
> library behind a C API.
>
> So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> But maybe we should verify that MSVC is happy with it, otherwise we
> need to take a bit more care in some parts of the codebase.
>
> As far as compile time, I wonder if we can regenerate the .cc/.h with
> only the gpu trace parts?  But I wouldn't expect the .h to be
> something widely included.  For example, for gpu timeline traces in
> freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> extern "C" {} around the callbacks that would hook into the
> u_tracepoint tracepoints.  That one file would pull in the perfetto
> .h, and we'd just not build that file if perfetto was disabled.
>

That works for GPU, but I'd like to see some slow CPU functions in traces
as well to help reason about performance problems. This ends up peppering
the trace header in lots of places.

Overall having to add our own extern C wrappers in some places doesn't
> seem like the *end* of the world.. a bit annoying, but we might end up
> doing that regardless if other folks want the ability to hook in
> something other than perfetto?
>

It's more than extern C wrappers if we want to minimize overhead while
tracing enabled at compile time. Have a look at percetto.h
/cc
.


>
> 
>
> > Mesa Integration Alternatives
>
> I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
> that is mostly because I expect to initially just add some basic gpu
> timeline tracepoints, but over time iterate on adding more.. it would
> be nice to not have to depend on a newer version of an external
> library at each step.  That is ofc only my $0.02..
>

It's a small initial setup tax, true, but I still think it depends on what
perfetto features we plan to use -- for only a couple files doing GPU
tracing I agree percetto is unnecessary, but for CPU tracing it gets more
complicated.


>
> BR,
> -R
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Dylan Baker



On Fri, Feb 12, 2021, at 16:36, Rob Clark wrote:
> On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
> >
> 
> 
> 
> > Runtime Characteristics
> >
> > ~500KB additional binary size. Even with using only the basic features of 
> > perfetto, it will increase the binary size of mesa by about 500KB.
> 
> IMHO, that size is negligible.. looking at freedreno, a mesa build
> *only* enabling freedreno is already ~6MB.. distros typically use
> "megadriver" (ie. all the drivers linked into a single .so with hard
> links for the different  ${driver}_dri.so), which on my fedora laptop
> is ~21M.  Maybe if anything is relevant it is how much of that
> actually gets paged into RAM from disk, but I think 500K isn't a thing
> to worry about too much.
> 
> > Background thread. Perfetto uses a background thread for communication with 
> > the system tracing daemon (traced) to advertise trace data and get 
> > notification of trace start/stop.
> 
> Mesa already tends to have plenty of threads.. some of that depends on
> the driver, I think currently radeonsi is the threading king, but
> there are several other drivers working on threaded_context and async
> compile thread pool.
> 
> It is worth mentioning that, AFAIU, perfetto can operate in
> self-server mode, which seems like it would be useful for distros
> which do not have the system daemon.  I'm not sure if we lose that
> with percetto?
> 
> > Runtime overhead when disabled is designed to be optimal with one predicted 
> > branch, typically a few CPU cycles per event. While enabled, the overhead 
> > can be around 1 us per event.
> >
> > Integration Challenges
> >
> > The perfetto SDK is C++ and designed around macros, lambdas, inline 
> > templates, etc. There are ongoing discussions on providing an official 
> > perfetto C API, but it is not yet clear when this will land on the perfetto 
> > roadmap.
> > The perfetto SDK is an amalgamated .h and .cc that adds up to 100K lines of 
> > code.
> > Anything that includes perfetto.h takes a long time to compile.
> > The current Perfetto SDK design is incompatible with being a shared library 
> > behind a C API.
> 
> So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
> But maybe we should verify that MSVC is happy with it, otherwise we
> need to take a bit more care in some parts of the codebase.
> 
> As far as compile time, I wonder if we can regenerate the .cc/.h with
> only the gpu trace parts?  But I wouldn't expect the .h to be
> something widely included.  For example, for gpu timeline traces in
> freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
> extern "C" {} around the callbacks that would hook into the
> u_tracepoint tracepoints.  That one file would pull in the perfetto
> .h, and we'd just not build that file if perfetto was disabled.
> 
> Overall having to add our own extern C wrappers in some places doesn't
> seem like the *end* of the world.. a bit annoying, but we might end up
> doing that regardless if other folks want the ability to hook in
> something other than perfetto?
> 
> 
> 
> > Mesa Integration Alternatives
> 
> I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
> that is mostly because I expect to initially just add some basic gpu
> timeline tracepoints, but over time iterate on adding more.. it would
> be nice to not have to depend on a newer version of an external
> library at each step.  That is ofc only my $0.02..
> 
> BR,
> -R
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>


My experience is that vendoring just ends up being a huge pain for everyone, 
especially if the ui code stops working with our forked version, and we have to 
rebase all of our changes on upstream.

Could we add meson build files and use a wrap for this if the distro doesn't 
ship the library? Id be willing to do/help with an initial port if that's what 
we wanted to do. But since this really a dev dependency i don't see why using a 
wrap would be a big deal.
-- 
  Dylan Baker
  dy...@pnwbakers.com
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
 wrote:
>
> We're kind of in the same boat for Intel.
>
> Access to GPU perf counters is exclusive to a single process if you want
> to build a timeline of the work (because preemption etc...).

ugg, does that mean extensions like AMD_performance_monitor doesn't
actually work on intel?

> The best information we could add from mesa would a timestamp of when a
> particular drawcall started.
> But that's pretty much when timestamps queries are.
>
> Were you thinking of particular GPU generated data you don't get from
> gfx-pps?

From the looks of it, currently I don't get *any* GPU generated data
from gfx-pps ;-)

We can ofc sample counters from a separate process as well... I have a
curses tool (fdperf) which does this.. but running outside of gpu
cmdstream plus counters losing context across suspend/resume makes it
less than perfect.  And something that works the same way as
AMD_performance_monitor under the hook gives a more precise look at
which shaders (for ex) are consuming the most cycles.  For cases where
we can profile a trace, frameretrace and related tools is pretty
great.. but it would be nice to have similar visibility for actual
games (which for me, mostly means android games, since so far no
aarch64 steam store), but also give game developers good tools (or at
least the same tools that they get with other closed src drivers on
android).

BR,
-R

> Thanks,
>
> -Lionel
>
>
> On 13/02/2021 00:12, Alyssa Rosenzweig wrote:
> > My 2c for Mali/Panfrost --
> >
> > For us, capturing GPU perf counters is orthogonal to rendering. It's
> > expected (e.g. with Arm's tools) to do this from a separate process.
> > Neither Mesa nor the DDK should require custom instrumentation for the
> > low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
> > Perfetto as it is. So for us I don't see the value in modifying Mesa for
> > tracing.
> >
> > On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
> >> (responding from correct address this time)
> >>
> >> On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:
> >>
> >>> I've recently been using GPUVis to look at trace events.  On Intel
> >>> platforms, GPUVis incorporates ftrace events from the i915 driver,
> >>> performance metrics from igt-gpu-tools, and userspace ftrace markers
> >>> that I locally hack up in Mesa.
> >>>
> >> GPUVis is great. I would love to see that data combined with
> >> userspace events without any need for local hacks. Perfetto provides
> >> on-demand trace events with lower overhead compared to ftrace, so for
> >> example it is acceptable to have production trace instrumentation that can
> >> be captured without dev builds. To do that with ftrace it may require a way
> >> to enable and disable the ftrace file writes to avoid the overhead when
> >> tracing is not in use. This is what Android does with systrace/atrace, for
> >> example, it uses Binder to notify processes about trace sessions. Perfetto
> >> does that in a more portable way.
> >>
> >>
> >>> It is very easy to compile the GPUVis UI.  Userspace instrumentation
> >>> requires a single C/C++ header.  You don't have to access an external
> >>> web service to analyze trace data (a big no-no for devs working on
> >>> preproduction hardware).
> >>>
> >>> Is it possible to build and run the Perfetto UI locally?
> >>
> >> Yes, local UI builds are possible
> >> .
> >> Also confirmed with the perfetto team  that
> >> trace data is not uploaded unless you use the 'share' feature.
> >>
> >>
> >>>Can it display
> >>> arbitrary trace events that are written to
> >>> /sys/kernel/tracing/trace_marker ?
> >>
> >> Yes, I believe it does support that via linux.ftrace data source
> >> . We use that for
> >> example to overlay CPU sched data to show what process is on each core
> >> throughout the timeline. There are many ftrace event types
> >> 
> >> in
> >> the perfetto protos.
> >>
> >>
> >>> Can it be extended to show i915 and
> >>> i915-perf-recorder events?
> >>>
> >> It can be extended to consume custom data sources. One way this is done is
> >> via a bridge daemon, such as traced_probes which is responsible for
> >> capturing data from ftrace and /proc during a trace session and sending it
> >> to traced. traced is the main perfetto tracing daemon that notifies all
> >> trace data sources to start/stop tracing and communicates with user tracing
> >> requests via the 'perfetto' command.
> >>
> >>
> >>
> >>> John Bates  writes:
> >>>
>  I recently opened issue 4262
>   to begin the
>  discussion on integrating perfetto into mesa.
> 
>  *Background*
> 
>  

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Lionel Landwerlin

We're kind of in the same boat for Intel.

Access to GPU perf counters is exclusive to a single process if you want 
to build a timeline of the work (because preemption etc...).


The best information we could add from mesa would a timestamp of when a 
particular drawcall started.

But that's pretty much when timestamps queries are.

Were you thinking of particular GPU generated data you don't get from 
gfx-pps?


Thanks,

-Lionel


On 13/02/2021 00:12, Alyssa Rosenzweig wrote:

My 2c for Mali/Panfrost --

For us, capturing GPU perf counters is orthogonal to rendering. It's
expected (e.g. with Arm's tools) to do this from a separate process.
Neither Mesa nor the DDK should require custom instrumentation for the
low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
Perfetto as it is. So for us I don't see the value in modifying Mesa for
tracing.

On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:

(responding from correct address this time)

On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:


I've recently been using GPUVis to look at trace events.  On Intel
platforms, GPUVis incorporates ftrace events from the i915 driver,
performance metrics from igt-gpu-tools, and userspace ftrace markers
that I locally hack up in Mesa.


GPUVis is great. I would love to see that data combined with
userspace events without any need for local hacks. Perfetto provides
on-demand trace events with lower overhead compared to ftrace, so for
example it is acceptable to have production trace instrumentation that can
be captured without dev builds. To do that with ftrace it may require a way
to enable and disable the ftrace file writes to avoid the overhead when
tracing is not in use. This is what Android does with systrace/atrace, for
example, it uses Binder to notify processes about trace sessions. Perfetto
does that in a more portable way.



It is very easy to compile the GPUVis UI.  Userspace instrumentation
requires a single C/C++ header.  You don't have to access an external
web service to analyze trace data (a big no-no for devs working on
preproduction hardware).

Is it possible to build and run the Perfetto UI locally?


Yes, local UI builds are possible
.
Also confirmed with the perfetto team  that
trace data is not uploaded unless you use the 'share' feature.



   Can it display
arbitrary trace events that are written to
/sys/kernel/tracing/trace_marker ?


Yes, I believe it does support that via linux.ftrace data source
. We use that for
example to overlay CPU sched data to show what process is on each core
throughout the timeline. There are many ftrace event types

in
the perfetto protos.



Can it be extended to show i915 and
i915-perf-recorder events?


It can be extended to consume custom data sources. One way this is done is
via a bridge daemon, such as traced_probes which is responsible for
capturing data from ftrace and /proc during a trace session and sending it
to traced. traced is the main perfetto tracing daemon that notifies all
trace data sources to start/stop tracing and communicates with user tracing
requests via the 'perfetto' command.




John Bates  writes:


I recently opened issue 4262
 to begin the
discussion on integrating perfetto into mesa.

*Background*

System-wide tracing is an invaluable tool for developers to find and fix
performance problems. The perfetto project enables a combined view of

trace

data from kernel ftrace, GPU driver and various manually-instrumented
tracepoints throughout the application and system. This helps developers
quickly answer questions like:

- How long are frames taking?
- What caused a particular frame drop?
- Is it CPU bound or GPU bound?
- Did a CPU core frequency drop cause something to go slower than

usual?

- Is something else running that is stealing CPU or GPU time? Could I
fix that with better thread/context priorities?
- Are all CPU cores being used effectively? Do I need

sched_setaffinity

to keep my thread on a big or little core?
- What’s the latency between CPU frame submit and GPU start?

*What Does Mesa + Perfetto Provide?*

Mesa is in a unique position to produce GPU trace data for several GPU
vendors without requiring the developer to build and install additional
tools like gfx-pps .

The key is making it easy for developers to use. Ideally, perfetto is
eventually available by default in mesa so that if your system has

perfetto

traced running, you just need to run perfetto (perhaps along with setting
an environment variable) with the mesa categories to see:

- GPU processing timeline 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
On Fri, Feb 12, 2021 at 4:51 PM Mark Janes  wrote:
>
> Rob Clark  writes:
>
> > On Fri, Feb 12, 2021 at 5:01 AM Tamminen, Eero T
> >  wrote:
> >>
> >> Hi,
> >>
> >> On Thu, 2021-02-11 at 17:39 -0800, John Bates wrote:
> >> > I recently opened issue 4262
> >> >  to begin the
> >> > discussion on integrating perfetto into mesa.
> >> >
> >> > *Background*
> >> >
> >> > System-wide tracing is an invaluable tool for developers to find and
> >> > fix
> >> > performance problems. The perfetto project enables a combined view of
> >> > trace
> >> > data from kernel ftrace, GPU driver and various manually-instrumented
> >> > tracepoints throughout the application and system.
> >>
> >> Unlike some other Linux tracing solutions, Perfetto appears to be for
> >> Android / Chrome(OS?), and not available from in common Linux distro
> >> repos.
> >
> > I don't think there is anything about perfetto that would not be
> > usable in a generic linux distro.. and mesa support for perfetto would
> > perhaps be a compelling reason for distro's to add support
> >
> >> So, why Perfetto instead of one of the other solutions, e.g. from ones
> >> mentioned here:
> >> https://tracingsummit.org/ts/2018/
> >> ?
> >>
> >> And, if tracing API is added to Mesa, shouldn't it support also
> >> tracepoints for other tracing solutions?
> >
> > perfetto does have systrace collectors
> >
> > And a general comment on perfetto vs other things.. we end up needing
> > to support perfetto regardless (for android and CrOS).. we don't
> > *need* to enable it on generic linux, but I think we should (but maybe
> > using the mode that does not require a system server.. at least
> > initially.. that may limit it's ability to collect systrace and traces
> > from other parts of the system, but that wouldn't depend on distro's
> > enabling perfetto system server).
>
> Perfetto seems like an awful lot of infrastructure to capture trace
> events.  Why not follow the example of GPUVis, and write generic
> trace_markers to ftrace?  It limits impact to Mesa, while allowing any
> trace visualizer to use the trace points.

I'm not really seeing how that would cover anything more than CPU
based events.. which is kind of the smallest part of what I'm
interested in..

> >> I mean, code added to drivers themselves preferably should not have
> >> anything perfetto/percetto specific.  Tracing system specific code
> >> should be only in one place (even if it's just macros in common header).
> >>
> >>
> >> > This helps developers
> >> > quickly answer questions like:
> >> >
> >> >- How long are frames taking?
> >>
> >> That doesn't require any changes to Mesa.  Just set uprobe for suitable
> >> buffer swap function [1], and parse kernel ftrace events.  This way
> >> starting tracing doesn't require even restarting the tracked processes.
> >>
> >
> > But this doesn't tell you how long the GPU is spending doing what.  My
> > rough idea is to hook up an optional callback to u_tracepoint so we
> > can get generate perfetto traces on the GPU timeline (ie. with
> > timestamps captured from GPU), fwiw
>
> I implemented a feature called INTEL_MEASURE based off of a tool that
> Ken wrote.  It captures render/batch/frame timestamps in a BO, providing
> durations on the GPU timeline.  It works for Iris and Anv.
>
> The approach provides accurate gpu timing, with minimal stalling.  This
> data could be presented in Perfetto or GPUVis.

have a look at u_trace.. it is basically this but implemented in a way
that is (hopefully) useful to other drivers..

(there is some small gallium dependency currently, although at some
point when I'm spending more time on vk optimization I'll hoist it out
of gallium/aux/util, unless someone else gets there first)

BR,
-R

> > BR,
> > -R
> >
> >> [1] glXSwapBuffers, eglSwapBuffers, eglSwapBuffersWithDamageEXT,
> >> anv_QueuePresentKHR[2]..
> >>
> >> [2] Many apps resolve "vkQueuePresentKHR" Vulkan API loader wrapper
> >> function and call the backend function like "anv_QueuePresentKHR"
> >> directly, so it's  better to track latter instead.
> >>
> >>
> >> >- What caused a particular frame drop?
> >> >- Is it CPU bound or GPU bound?
> >>
> >> That doesn't require adding tracepoints to Mesa, just checking CPU & GPU
> >> utilization (which is lower level thing).
> >>
> >>
> >> >- Did a CPU core frequency drop cause something to go slower than
> >> > usual?
> >>
> >> Note that nowadays actual CPU frequencies are often controlled by HW /
> >> firmware, so you don't necessarily get any ftrace event from freq
> >> change, you would need to poll MSR registers instead (which is
> >> privileged operation, and polling can easily miss changes).
> >>
> >>
> >> >- Is something else running that is stealing CPU or GPU time? Could
> >> > I
> >> >fix that with better thread/context priorities?
> >> >- Are all CPU cores being used effectively? Do I need
> >> > sched_setaffinity
> 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Mark Janes
Rob Clark  writes:

> On Fri, Feb 12, 2021 at 5:01 AM Tamminen, Eero T
>  wrote:
>>
>> Hi,
>>
>> On Thu, 2021-02-11 at 17:39 -0800, John Bates wrote:
>> > I recently opened issue 4262
>> >  to begin the
>> > discussion on integrating perfetto into mesa.
>> >
>> > *Background*
>> >
>> > System-wide tracing is an invaluable tool for developers to find and
>> > fix
>> > performance problems. The perfetto project enables a combined view of
>> > trace
>> > data from kernel ftrace, GPU driver and various manually-instrumented
>> > tracepoints throughout the application and system.
>>
>> Unlike some other Linux tracing solutions, Perfetto appears to be for
>> Android / Chrome(OS?), and not available from in common Linux distro
>> repos.
>
> I don't think there is anything about perfetto that would not be
> usable in a generic linux distro.. and mesa support for perfetto would
> perhaps be a compelling reason for distro's to add support
>
>> So, why Perfetto instead of one of the other solutions, e.g. from ones
>> mentioned here:
>> https://tracingsummit.org/ts/2018/
>> ?
>>
>> And, if tracing API is added to Mesa, shouldn't it support also
>> tracepoints for other tracing solutions?
>
> perfetto does have systrace collectors
>
> And a general comment on perfetto vs other things.. we end up needing
> to support perfetto regardless (for android and CrOS).. we don't
> *need* to enable it on generic linux, but I think we should (but maybe
> using the mode that does not require a system server.. at least
> initially.. that may limit it's ability to collect systrace and traces
> from other parts of the system, but that wouldn't depend on distro's
> enabling perfetto system server).

Perfetto seems like an awful lot of infrastructure to capture trace
events.  Why not follow the example of GPUVis, and write generic
trace_markers to ftrace?  It limits impact to Mesa, while allowing any
trace visualizer to use the trace points.

>> I mean, code added to drivers themselves preferably should not have
>> anything perfetto/percetto specific.  Tracing system specific code
>> should be only in one place (even if it's just macros in common header).
>>
>>
>> > This helps developers
>> > quickly answer questions like:
>> >
>> >- How long are frames taking?
>>
>> That doesn't require any changes to Mesa.  Just set uprobe for suitable
>> buffer swap function [1], and parse kernel ftrace events.  This way
>> starting tracing doesn't require even restarting the tracked processes.
>>
>
> But this doesn't tell you how long the GPU is spending doing what.  My
> rough idea is to hook up an optional callback to u_tracepoint so we
> can get generate perfetto traces on the GPU timeline (ie. with
> timestamps captured from GPU), fwiw

I implemented a feature called INTEL_MEASURE based off of a tool that
Ken wrote.  It captures render/batch/frame timestamps in a BO, providing
durations on the GPU timeline.  It works for Iris and Anv.

The approach provides accurate gpu timing, with minimal stalling.  This
data could be presented in Perfetto or GPUVis.

> BR,
> -R
>
>> [1] glXSwapBuffers, eglSwapBuffers, eglSwapBuffersWithDamageEXT,
>> anv_QueuePresentKHR[2]..
>>
>> [2] Many apps resolve "vkQueuePresentKHR" Vulkan API loader wrapper
>> function and call the backend function like "anv_QueuePresentKHR"
>> directly, so it's  better to track latter instead.
>>
>>
>> >- What caused a particular frame drop?
>> >- Is it CPU bound or GPU bound?
>>
>> That doesn't require adding tracepoints to Mesa, just checking CPU & GPU
>> utilization (which is lower level thing).
>>
>>
>> >- Did a CPU core frequency drop cause something to go slower than
>> > usual?
>>
>> Note that nowadays actual CPU frequencies are often controlled by HW /
>> firmware, so you don't necessarily get any ftrace event from freq
>> change, you would need to poll MSR registers instead (which is
>> privileged operation, and polling can easily miss changes).
>>
>>
>> >- Is something else running that is stealing CPU or GPU time? Could
>> > I
>> >fix that with better thread/context priorities?
>> >- Are all CPU cores being used effectively? Do I need
>> > sched_setaffinity
>> >to keep my thread on a big or little core?
>>
>> I don't think these to require adding tracepoints to Mesa either...
>>
>>
>> >- What’s the latency between CPU frame submit and GPU start?
>>
>> I think this would require tracepoints in kernel GPU code more than in
>> Mesa?
>>
>>
>> - Eero
>>
>>
>> > *What Does Mesa + Perfetto Provide?*
>> >
>> > Mesa is in a unique position to produce GPU trace data for several GPU
>> > vendors without requiring the developer to build and install
>> > additional
>> > tools like gfx-pps .
>> >
>> > The key is making it easy for developers to use. Ideally, perfetto is
>> > eventually available by default in mesa so that if your 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
On Thu, Feb 11, 2021 at 5:40 PM John Bates  wrote:
>



> Runtime Characteristics
>
> ~500KB additional binary size. Even with using only the basic features of 
> perfetto, it will increase the binary size of mesa by about 500KB.

IMHO, that size is negligible.. looking at freedreno, a mesa build
*only* enabling freedreno is already ~6MB.. distros typically use
"megadriver" (ie. all the drivers linked into a single .so with hard
links for the different  ${driver}_dri.so), which on my fedora laptop
is ~21M.  Maybe if anything is relevant it is how much of that
actually gets paged into RAM from disk, but I think 500K isn't a thing
to worry about too much.

> Background thread. Perfetto uses a background thread for communication with 
> the system tracing daemon (traced) to advertise trace data and get 
> notification of trace start/stop.

Mesa already tends to have plenty of threads.. some of that depends on
the driver, I think currently radeonsi is the threading king, but
there are several other drivers working on threaded_context and async
compile thread pool.

It is worth mentioning that, AFAIU, perfetto can operate in
self-server mode, which seems like it would be useful for distros
which do not have the system daemon.  I'm not sure if we lose that
with percetto?

> Runtime overhead when disabled is designed to be optimal with one predicted 
> branch, typically a few CPU cycles per event. While enabled, the overhead can 
> be around 1 us per event.
>
> Integration Challenges
>
> The perfetto SDK is C++ and designed around macros, lambdas, inline 
> templates, etc. There are ongoing discussions on providing an official 
> perfetto C API, but it is not yet clear when this will land on the perfetto 
> roadmap.
> The perfetto SDK is an amalgamated .h and .cc that adds up to 100K lines of 
> code.
> Anything that includes perfetto.h takes a long time to compile.
> The current Perfetto SDK design is incompatible with being a shared library 
> behind a C API.

So, C++ on it's own isn't a showstopper, mesa has plenty of C++ code.
But maybe we should verify that MSVC is happy with it, otherwise we
need to take a bit more care in some parts of the codebase.

As far as compile time, I wonder if we can regenerate the .cc/.h with
only the gpu trace parts?  But I wouldn't expect the .h to be
something widely included.  For example, for gpu timeline traces in
freedreno, I'm expecting it to look like a freedreno_perfetto.cc with
extern "C" {} around the callbacks that would hook into the
u_tracepoint tracepoints.  That one file would pull in the perfetto
.h, and we'd just not build that file if perfetto was disabled.

Overall having to add our own extern C wrappers in some places doesn't
seem like the *end* of the world.. a bit annoying, but we might end up
doing that regardless if other folks want the ability to hook in
something other than perfetto?



> Mesa Integration Alternatives

I'm kind of leaning towards the "just slurp in the .cc/.h" approach..
that is mostly because I expect to initially just add some basic gpu
timeline tracepoints, but over time iterate on adding more.. it would
be nice to not have to depend on a newer version of an external
library at each step.  That is ofc only my $0.02..

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
On Fri, Feb 12, 2021 at 5:01 AM Tamminen, Eero T
 wrote:
>
> Hi,
>
> On Thu, 2021-02-11 at 17:39 -0800, John Bates wrote:
> > I recently opened issue 4262
> >  to begin the
> > discussion on integrating perfetto into mesa.
> >
> > *Background*
> >
> > System-wide tracing is an invaluable tool for developers to find and
> > fix
> > performance problems. The perfetto project enables a combined view of
> > trace
> > data from kernel ftrace, GPU driver and various manually-instrumented
> > tracepoints throughout the application and system.
>
> Unlike some other Linux tracing solutions, Perfetto appears to be for
> Android / Chrome(OS?), and not available from in common Linux distro
> repos.

I don't think there is anything about perfetto that would not be
usable in a generic linux distro.. and mesa support for perfetto would
perhaps be a compelling reason for distro's to add support

> So, why Perfetto instead of one of the other solutions, e.g. from ones
> mentioned here:
> https://tracingsummit.org/ts/2018/
> ?
>
> And, if tracing API is added to Mesa, shouldn't it support also
> tracepoints for other tracing solutions?

perfetto does have systrace collectors

And a general comment on perfetto vs other things.. we end up needing
to support perfetto regardless (for android and CrOS).. we don't
*need* to enable it on generic linux, but I think we should (but maybe
using the mode that does not require a system server.. at least
initially.. that may limit it's ability to collect systrace and traces
from other parts of the system, but that wouldn't depend on distro's
enabling perfetto system server).

> I mean, code added to drivers themselves preferably should not have
> anything perfetto/percetto specific.  Tracing system specific code
> should be only in one place (even if it's just macros in common header).
>
>
> > This helps developers
> > quickly answer questions like:
> >
> >- How long are frames taking?
>
> That doesn't require any changes to Mesa.  Just set uprobe for suitable
> buffer swap function [1], and parse kernel ftrace events.  This way
> starting tracing doesn't require even restarting the tracked processes.
>

But this doesn't tell you how long the GPU is spending doing what.  My
rough idea is to hook up an optional callback to u_tracepoint so we
can get generate perfetto traces on the GPU timeline (ie. with
timestamps captured from GPU), fwiw

BR,
-R

> [1] glXSwapBuffers, eglSwapBuffers, eglSwapBuffersWithDamageEXT,
> anv_QueuePresentKHR[2]..
>
> [2] Many apps resolve "vkQueuePresentKHR" Vulkan API loader wrapper
> function and call the backend function like "anv_QueuePresentKHR"
> directly, so it's  better to track latter instead.
>
>
> >- What caused a particular frame drop?
> >- Is it CPU bound or GPU bound?
>
> That doesn't require adding tracepoints to Mesa, just checking CPU & GPU
> utilization (which is lower level thing).
>
>
> >- Did a CPU core frequency drop cause something to go slower than
> > usual?
>
> Note that nowadays actual CPU frequencies are often controlled by HW /
> firmware, so you don't necessarily get any ftrace event from freq
> change, you would need to poll MSR registers instead (which is
> privileged operation, and polling can easily miss changes).
>
>
> >- Is something else running that is stealing CPU or GPU time? Could
> > I
> >fix that with better thread/context priorities?
> >- Are all CPU cores being used effectively? Do I need
> > sched_setaffinity
> >to keep my thread on a big or little core?
>
> I don't think these to require adding tracepoints to Mesa either...
>
>
> >- What’s the latency between CPU frame submit and GPU start?
>
> I think this would require tracepoints in kernel GPU code more than in
> Mesa?
>
>
> - Eero
>
>
> > *What Does Mesa + Perfetto Provide?*
> >
> > Mesa is in a unique position to produce GPU trace data for several GPU
> > vendors without requiring the developer to build and install
> > additional
> > tools like gfx-pps .
> >
> > The key is making it easy for developers to use. Ideally, perfetto is
> > eventually available by default in mesa so that if your system has
> > perfetto
> > traced running, you just need to run perfetto (perhaps along with
> > setting
> > an environment variable) with the mesa categories to see:
> >
> >- GPU processing timeline events.
> >- GPU counters.
> >- CPU events for potentially slow functions in mesa like shader
> > compiles.
> >
> > Example of what this data might look like (with fake GPU events):
> > [image: percetto-gpu-example.png]
> >
> > *Runtime Characteristics*
> >
> >- ~500KB additional binary size. Even with using only the basic
> > features
> >of perfetto, it will increase the binary size of mesa by about
> > 500KB.
> >- Background thread. Perfetto uses a background thread for
> > communication
> >with 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Rob Clark
yes, but that is a limitation of mali which does not apply to a lot of
other drivers ;-)

But AFAIU typically you'd use perfetto with a sort of system server
collecting trace data from various different processes, so the fact
that that mali trace perf counters come from somewhere else doesn't
really matter

And this is about more than just perf cntrs, I plan to wire up the
u_tracepoint stuff to perfetto events (or rather provide a way to hook
up individual tracepoints) so that we can see on a timeline things
like how long the binning pass took, how long tile passes take (broken
down into restore/draw/resolve).  I think we mostly definitely want
perfetto support in mesa.  It can be optional, but I'm hoping linux
distros start enabling perfetto when they have a compelling reason to
(ie. mesa gpu perf analysis)

BR,
-R

On Fri, Feb 12, 2021 at 2:13 PM Alyssa Rosenzweig
 wrote:
>
> My 2c for Mali/Panfrost --
>
> For us, capturing GPU perf counters is orthogonal to rendering. It's
> expected (e.g. with Arm's tools) to do this from a separate process.
> Neither Mesa nor the DDK should require custom instrumentation for the
> low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
> Perfetto as it is. So for us I don't see the value in modifying Mesa for
> tracing.
>
> On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
> > (responding from correct address this time)
> >
> > On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:
> >
> > > I've recently been using GPUVis to look at trace events.  On Intel
> > > platforms, GPUVis incorporates ftrace events from the i915 driver,
> > > performance metrics from igt-gpu-tools, and userspace ftrace markers
> > > that I locally hack up in Mesa.
> > >
> >
> > GPUVis is great. I would love to see that data combined with
> > userspace events without any need for local hacks. Perfetto provides
> > on-demand trace events with lower overhead compared to ftrace, so for
> > example it is acceptable to have production trace instrumentation that can
> > be captured without dev builds. To do that with ftrace it may require a way
> > to enable and disable the ftrace file writes to avoid the overhead when
> > tracing is not in use. This is what Android does with systrace/atrace, for
> > example, it uses Binder to notify processes about trace sessions. Perfetto
> > does that in a more portable way.
> >
> >
> > >
> > > It is very easy to compile the GPUVis UI.  Userspace instrumentation
> > > requires a single C/C++ header.  You don't have to access an external
> > > web service to analyze trace data (a big no-no for devs working on
> > > preproduction hardware).
> > >
> > > Is it possible to build and run the Perfetto UI locally?
> >
> >
> > Yes, local UI builds are possible
> > .
> > Also confirmed with the perfetto team  that
> > trace data is not uploaded unless you use the 'share' feature.
> >
> >
> > >   Can it display
> > > arbitrary trace events that are written to
> > > /sys/kernel/tracing/trace_marker ?
> >
> >
> > Yes, I believe it does support that via linux.ftrace data source
> > . We use that for
> > example to overlay CPU sched data to show what process is on each core
> > throughout the timeline. There are many ftrace event types
> > 
> > in
> > the perfetto protos.
> >
> >
> > > Can it be extended to show i915 and
> > > i915-perf-recorder events?
> > >
> >
> > It can be extended to consume custom data sources. One way this is done is
> > via a bridge daemon, such as traced_probes which is responsible for
> > capturing data from ftrace and /proc during a trace session and sending it
> > to traced. traced is the main perfetto tracing daemon that notifies all
> > trace data sources to start/stop tracing and communicates with user tracing
> > requests via the 'perfetto' command.
> >
> >
> >
> > >
> > > John Bates  writes:
> > >
> > > > I recently opened issue 4262
> > > >  to begin the
> > > > discussion on integrating perfetto into mesa.
> > > >
> > > > *Background*
> > > >
> > > > System-wide tracing is an invaluable tool for developers to find and fix
> > > > performance problems. The perfetto project enables a combined view of
> > > trace
> > > > data from kernel ftrace, GPU driver and various manually-instrumented
> > > > tracepoints throughout the application and system. This helps developers
> > > > quickly answer questions like:
> > > >
> > > >- How long are frames taking?
> > > >- What caused a particular frame drop?
> > > >- Is it CPU bound or GPU bound?
> > > >- Did a CPU core frequency drop cause something to go slower than
> > > usual?
> > > >- Is something else running that is 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Alyssa Rosenzweig
Sure, I definitely see the use case for virgl :)

On Fri, Feb 12, 2021 at 02:43:25PM -0800, Chia-I Wu wrote:
> For virgl, where the biggest perf gaps often come from unnecessary CPU
> waits or high latencies of fence signaling, being able to insert
> userspace driver trace events and combine them with kernel ftrace
> events are a big plus.  Admittedly, there is no HW counters and my
> needs are simpler (inserting function begin/end and wait begin/end and
> combining them with virtio-gpu and dma-fence ftrace events).
> 
> On Fri, Feb 12, 2021 at 2:13 PM Alyssa Rosenzweig
>  wrote:
> >
> > My 2c for Mali/Panfrost --
> >
> > For us, capturing GPU perf counters is orthogonal to rendering. It's
> > expected (e.g. with Arm's tools) to do this from a separate process.
> > Neither Mesa nor the DDK should require custom instrumentation for the
> > low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
> > Perfetto as it is. So for us I don't see the value in modifying Mesa for
> > tracing.
> >
> > On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
> > > (responding from correct address this time)
> > >
> > > On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  
> > > wrote:
> > >
> > > > I've recently been using GPUVis to look at trace events.  On Intel
> > > > platforms, GPUVis incorporates ftrace events from the i915 driver,
> > > > performance metrics from igt-gpu-tools, and userspace ftrace markers
> > > > that I locally hack up in Mesa.
> > > >
> > >
> > > GPUVis is great. I would love to see that data combined with
> > > userspace events without any need for local hacks. Perfetto provides
> > > on-demand trace events with lower overhead compared to ftrace, so for
> > > example it is acceptable to have production trace instrumentation that can
> > > be captured without dev builds. To do that with ftrace it may require a 
> > > way
> > > to enable and disable the ftrace file writes to avoid the overhead when
> > > tracing is not in use. This is what Android does with systrace/atrace, for
> > > example, it uses Binder to notify processes about trace sessions. Perfetto
> > > does that in a more portable way.
> > >
> > >
> > > >
> > > > It is very easy to compile the GPUVis UI.  Userspace instrumentation
> > > > requires a single C/C++ header.  You don't have to access an external
> > > > web service to analyze trace data (a big no-no for devs working on
> > > > preproduction hardware).
> > > >
> > > > Is it possible to build and run the Perfetto UI locally?
> > >
> > >
> > > Yes, local UI builds are possible
> > > .
> > > Also confirmed with the perfetto team  that
> > > trace data is not uploaded unless you use the 'share' feature.
> > >
> > >
> > > >   Can it display
> > > > arbitrary trace events that are written to
> > > > /sys/kernel/tracing/trace_marker ?
> > >
> > >
> > > Yes, I believe it does support that via linux.ftrace data source
> > > . We use that for
> > > example to overlay CPU sched data to show what process is on each core
> > > throughout the timeline. There are many ftrace event types
> > > 
> > > in
> > > the perfetto protos.
> > >
> > >
> > > > Can it be extended to show i915 and
> > > > i915-perf-recorder events?
> > > >
> > >
> > > It can be extended to consume custom data sources. One way this is done is
> > > via a bridge daemon, such as traced_probes which is responsible for
> > > capturing data from ftrace and /proc during a trace session and sending it
> > > to traced. traced is the main perfetto tracing daemon that notifies all
> > > trace data sources to start/stop tracing and communicates with user 
> > > tracing
> > > requests via the 'perfetto' command.
> > >
> > >
> > >
> > > >
> > > > John Bates  writes:
> > > >
> > > > > I recently opened issue 4262
> > > > >  to begin the
> > > > > discussion on integrating perfetto into mesa.
> > > > >
> > > > > *Background*
> > > > >
> > > > > System-wide tracing is an invaluable tool for developers to find and 
> > > > > fix
> > > > > performance problems. The perfetto project enables a combined view of
> > > > trace
> > > > > data from kernel ftrace, GPU driver and various manually-instrumented
> > > > > tracepoints throughout the application and system. This helps 
> > > > > developers
> > > > > quickly answer questions like:
> > > > >
> > > > >- How long are frames taking?
> > > > >- What caused a particular frame drop?
> > > > >- Is it CPU bound or GPU bound?
> > > > >- Did a CPU core frequency drop cause something to go slower than
> > > > usual?
> > > > >- Is something else running that is stealing CPU or GPU time? 
> > > > > Could I
> > > > >fix that with 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Chia-I Wu
For virgl, where the biggest perf gaps often come from unnecessary CPU
waits or high latencies of fence signaling, being able to insert
userspace driver trace events and combine them with kernel ftrace
events are a big plus.  Admittedly, there is no HW counters and my
needs are simpler (inserting function begin/end and wait begin/end and
combining them with virtio-gpu and dma-fence ftrace events).

On Fri, Feb 12, 2021 at 2:13 PM Alyssa Rosenzweig
 wrote:
>
> My 2c for Mali/Panfrost --
>
> For us, capturing GPU perf counters is orthogonal to rendering. It's
> expected (e.g. with Arm's tools) to do this from a separate process.
> Neither Mesa nor the DDK should require custom instrumentation for the
> low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
> Perfetto as it is. So for us I don't see the value in modifying Mesa for
> tracing.
>
> On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
> > (responding from correct address this time)
> >
> > On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:
> >
> > > I've recently been using GPUVis to look at trace events.  On Intel
> > > platforms, GPUVis incorporates ftrace events from the i915 driver,
> > > performance metrics from igt-gpu-tools, and userspace ftrace markers
> > > that I locally hack up in Mesa.
> > >
> >
> > GPUVis is great. I would love to see that data combined with
> > userspace events without any need for local hacks. Perfetto provides
> > on-demand trace events with lower overhead compared to ftrace, so for
> > example it is acceptable to have production trace instrumentation that can
> > be captured without dev builds. To do that with ftrace it may require a way
> > to enable and disable the ftrace file writes to avoid the overhead when
> > tracing is not in use. This is what Android does with systrace/atrace, for
> > example, it uses Binder to notify processes about trace sessions. Perfetto
> > does that in a more portable way.
> >
> >
> > >
> > > It is very easy to compile the GPUVis UI.  Userspace instrumentation
> > > requires a single C/C++ header.  You don't have to access an external
> > > web service to analyze trace data (a big no-no for devs working on
> > > preproduction hardware).
> > >
> > > Is it possible to build and run the Perfetto UI locally?
> >
> >
> > Yes, local UI builds are possible
> > .
> > Also confirmed with the perfetto team  that
> > trace data is not uploaded unless you use the 'share' feature.
> >
> >
> > >   Can it display
> > > arbitrary trace events that are written to
> > > /sys/kernel/tracing/trace_marker ?
> >
> >
> > Yes, I believe it does support that via linux.ftrace data source
> > . We use that for
> > example to overlay CPU sched data to show what process is on each core
> > throughout the timeline. There are many ftrace event types
> > 
> > in
> > the perfetto protos.
> >
> >
> > > Can it be extended to show i915 and
> > > i915-perf-recorder events?
> > >
> >
> > It can be extended to consume custom data sources. One way this is done is
> > via a bridge daemon, such as traced_probes which is responsible for
> > capturing data from ftrace and /proc during a trace session and sending it
> > to traced. traced is the main perfetto tracing daemon that notifies all
> > trace data sources to start/stop tracing and communicates with user tracing
> > requests via the 'perfetto' command.
> >
> >
> >
> > >
> > > John Bates  writes:
> > >
> > > > I recently opened issue 4262
> > > >  to begin the
> > > > discussion on integrating perfetto into mesa.
> > > >
> > > > *Background*
> > > >
> > > > System-wide tracing is an invaluable tool for developers to find and fix
> > > > performance problems. The perfetto project enables a combined view of
> > > trace
> > > > data from kernel ftrace, GPU driver and various manually-instrumented
> > > > tracepoints throughout the application and system. This helps developers
> > > > quickly answer questions like:
> > > >
> > > >- How long are frames taking?
> > > >- What caused a particular frame drop?
> > > >- Is it CPU bound or GPU bound?
> > > >- Did a CPU core frequency drop cause something to go slower than
> > > usual?
> > > >- Is something else running that is stealing CPU or GPU time? Could I
> > > >fix that with better thread/context priorities?
> > > >- Are all CPU cores being used effectively? Do I need
> > > sched_setaffinity
> > > >to keep my thread on a big or little core?
> > > >- What’s the latency between CPU frame submit and GPU start?
> > > >
> > > > *What Does Mesa + Perfetto Provide?*
> > > >
> > > > Mesa is in a unique position to produce GPU trace data 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Alyssa Rosenzweig
My 2c for Mali/Panfrost --

For us, capturing GPU perf counters is orthogonal to rendering. It's
expected (e.g. with Arm's tools) to do this from a separate process.
Neither Mesa nor the DDK should require custom instrumentation for the
low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
Perfetto as it is. So for us I don't see the value in modifying Mesa for
tracing.

On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
> (responding from correct address this time)
> 
> On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:
> 
> > I've recently been using GPUVis to look at trace events.  On Intel
> > platforms, GPUVis incorporates ftrace events from the i915 driver,
> > performance metrics from igt-gpu-tools, and userspace ftrace markers
> > that I locally hack up in Mesa.
> >
> 
> GPUVis is great. I would love to see that data combined with
> userspace events without any need for local hacks. Perfetto provides
> on-demand trace events with lower overhead compared to ftrace, so for
> example it is acceptable to have production trace instrumentation that can
> be captured without dev builds. To do that with ftrace it may require a way
> to enable and disable the ftrace file writes to avoid the overhead when
> tracing is not in use. This is what Android does with systrace/atrace, for
> example, it uses Binder to notify processes about trace sessions. Perfetto
> does that in a more portable way.
> 
> 
> >
> > It is very easy to compile the GPUVis UI.  Userspace instrumentation
> > requires a single C/C++ header.  You don't have to access an external
> > web service to analyze trace data (a big no-no for devs working on
> > preproduction hardware).
> >
> > Is it possible to build and run the Perfetto UI locally?
> 
> 
> Yes, local UI builds are possible
> .
> Also confirmed with the perfetto team  that
> trace data is not uploaded unless you use the 'share' feature.
> 
> 
> >   Can it display
> > arbitrary trace events that are written to
> > /sys/kernel/tracing/trace_marker ?
> 
> 
> Yes, I believe it does support that via linux.ftrace data source
> . We use that for
> example to overlay CPU sched data to show what process is on each core
> throughout the timeline. There are many ftrace event types
> 
> in
> the perfetto protos.
> 
> 
> > Can it be extended to show i915 and
> > i915-perf-recorder events?
> >
> 
> It can be extended to consume custom data sources. One way this is done is
> via a bridge daemon, such as traced_probes which is responsible for
> capturing data from ftrace and /proc during a trace session and sending it
> to traced. traced is the main perfetto tracing daemon that notifies all
> trace data sources to start/stop tracing and communicates with user tracing
> requests via the 'perfetto' command.
> 
> 
> 
> >
> > John Bates  writes:
> >
> > > I recently opened issue 4262
> > >  to begin the
> > > discussion on integrating perfetto into mesa.
> > >
> > > *Background*
> > >
> > > System-wide tracing is an invaluable tool for developers to find and fix
> > > performance problems. The perfetto project enables a combined view of
> > trace
> > > data from kernel ftrace, GPU driver and various manually-instrumented
> > > tracepoints throughout the application and system. This helps developers
> > > quickly answer questions like:
> > >
> > >- How long are frames taking?
> > >- What caused a particular frame drop?
> > >- Is it CPU bound or GPU bound?
> > >- Did a CPU core frequency drop cause something to go slower than
> > usual?
> > >- Is something else running that is stealing CPU or GPU time? Could I
> > >fix that with better thread/context priorities?
> > >- Are all CPU cores being used effectively? Do I need
> > sched_setaffinity
> > >to keep my thread on a big or little core?
> > >- What’s the latency between CPU frame submit and GPU start?
> > >
> > > *What Does Mesa + Perfetto Provide?*
> > >
> > > Mesa is in a unique position to produce GPU trace data for several GPU
> > > vendors without requiring the developer to build and install additional
> > > tools like gfx-pps .
> > >
> > > The key is making it easy for developers to use. Ideally, perfetto is
> > > eventually available by default in mesa so that if your system has
> > perfetto
> > > traced running, you just need to run perfetto (perhaps along with setting
> > > an environment variable) with the mesa categories to see:
> > >
> > >- GPU processing timeline events.
> > >- GPU counters.
> > >- CPU events for potentially slow functions in mesa like shader
> > compiles.
> > >
> > > 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread John Bates
(responding from correct address this time)

On Fri, Feb 12, 2021 at 12:03 PM Mark Janes  wrote:

> I've recently been using GPUVis to look at trace events.  On Intel
> platforms, GPUVis incorporates ftrace events from the i915 driver,
> performance metrics from igt-gpu-tools, and userspace ftrace markers
> that I locally hack up in Mesa.
>

GPUVis is great. I would love to see that data combined with
userspace events without any need for local hacks. Perfetto provides
on-demand trace events with lower overhead compared to ftrace, so for
example it is acceptable to have production trace instrumentation that can
be captured without dev builds. To do that with ftrace it may require a way
to enable and disable the ftrace file writes to avoid the overhead when
tracing is not in use. This is what Android does with systrace/atrace, for
example, it uses Binder to notify processes about trace sessions. Perfetto
does that in a more portable way.


>
> It is very easy to compile the GPUVis UI.  Userspace instrumentation
> requires a single C/C++ header.  You don't have to access an external
> web service to analyze trace data (a big no-no for devs working on
> preproduction hardware).
>
> Is it possible to build and run the Perfetto UI locally?


Yes, local UI builds are possible
.
Also confirmed with the perfetto team  that
trace data is not uploaded unless you use the 'share' feature.


>   Can it display
> arbitrary trace events that are written to
> /sys/kernel/tracing/trace_marker ?


Yes, I believe it does support that via linux.ftrace data source
. We use that for
example to overlay CPU sched data to show what process is on each core
throughout the timeline. There are many ftrace event types

in
the perfetto protos.


> Can it be extended to show i915 and
> i915-perf-recorder events?
>

It can be extended to consume custom data sources. One way this is done is
via a bridge daemon, such as traced_probes which is responsible for
capturing data from ftrace and /proc during a trace session and sending it
to traced. traced is the main perfetto tracing daemon that notifies all
trace data sources to start/stop tracing and communicates with user tracing
requests via the 'perfetto' command.



>
> John Bates  writes:
>
> > I recently opened issue 4262
> >  to begin the
> > discussion on integrating perfetto into mesa.
> >
> > *Background*
> >
> > System-wide tracing is an invaluable tool for developers to find and fix
> > performance problems. The perfetto project enables a combined view of
> trace
> > data from kernel ftrace, GPU driver and various manually-instrumented
> > tracepoints throughout the application and system. This helps developers
> > quickly answer questions like:
> >
> >- How long are frames taking?
> >- What caused a particular frame drop?
> >- Is it CPU bound or GPU bound?
> >- Did a CPU core frequency drop cause something to go slower than
> usual?
> >- Is something else running that is stealing CPU or GPU time? Could I
> >fix that with better thread/context priorities?
> >- Are all CPU cores being used effectively? Do I need
> sched_setaffinity
> >to keep my thread on a big or little core?
> >- What’s the latency between CPU frame submit and GPU start?
> >
> > *What Does Mesa + Perfetto Provide?*
> >
> > Mesa is in a unique position to produce GPU trace data for several GPU
> > vendors without requiring the developer to build and install additional
> > tools like gfx-pps .
> >
> > The key is making it easy for developers to use. Ideally, perfetto is
> > eventually available by default in mesa so that if your system has
> perfetto
> > traced running, you just need to run perfetto (perhaps along with setting
> > an environment variable) with the mesa categories to see:
> >
> >- GPU processing timeline events.
> >- GPU counters.
> >- CPU events for potentially slow functions in mesa like shader
> compiles.
> >
> > Example of what this data might look like (with fake GPU events):
> > [image: percetto-gpu-example.png]
> >
> > *Runtime Characteristics*
> >
> >- ~500KB additional binary size. Even with using only the basic
> features
> >of perfetto, it will increase the binary size of mesa by about 500KB.
> >- Background thread. Perfetto uses a background thread for
> communication
> >with the system tracing daemon (traced) to advertise trace data and
> get
> >notification of trace start/stop.
> >- Runtime overhead when disabled is designed to be optimal with one
> >predicted branch, typically a few CPU cycles
> >

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Mark Janes
I've recently been using GPUVis to look at trace events.  On Intel
platforms, GPUVis incorporates ftrace events from the i915 driver,
performance metrics from igt-gpu-tools, and userspace ftrace markers
that I locally hack up in Mesa.

It is very easy to compile the GPUVis UI.  Userspace instrumentation
requires a single C/C++ header.  You don't have to access an external
web service to analyze trace data (a big no-no for devs working on
preproduction hardware).

Is it possible to build and run the Perfetto UI locally?  Can it display
arbitrary trace events that are written to
/sys/kernel/tracing/trace_marker ?  Can it be extended to show i915 and
i915-perf-recorder events?

John Bates  writes:

> I recently opened issue 4262
>  to begin the
> discussion on integrating perfetto into mesa.
>
> *Background*
>
> System-wide tracing is an invaluable tool for developers to find and fix
> performance problems. The perfetto project enables a combined view of trace
> data from kernel ftrace, GPU driver and various manually-instrumented
> tracepoints throughout the application and system. This helps developers
> quickly answer questions like:
>
>- How long are frames taking?
>- What caused a particular frame drop?
>- Is it CPU bound or GPU bound?
>- Did a CPU core frequency drop cause something to go slower than usual?
>- Is something else running that is stealing CPU or GPU time? Could I
>fix that with better thread/context priorities?
>- Are all CPU cores being used effectively? Do I need sched_setaffinity
>to keep my thread on a big or little core?
>- What’s the latency between CPU frame submit and GPU start?
>
> *What Does Mesa + Perfetto Provide?*
>
> Mesa is in a unique position to produce GPU trace data for several GPU
> vendors without requiring the developer to build and install additional
> tools like gfx-pps .
>
> The key is making it easy for developers to use. Ideally, perfetto is
> eventually available by default in mesa so that if your system has perfetto
> traced running, you just need to run perfetto (perhaps along with setting
> an environment variable) with the mesa categories to see:
>
>- GPU processing timeline events.
>- GPU counters.
>- CPU events for potentially slow functions in mesa like shader compiles.
>
> Example of what this data might look like (with fake GPU events):
> [image: percetto-gpu-example.png]
>
> *Runtime Characteristics*
>
>- ~500KB additional binary size. Even with using only the basic features
>of perfetto, it will increase the binary size of mesa by about 500KB.
>- Background thread. Perfetto uses a background thread for communication
>with the system tracing daemon (traced) to advertise trace data and get
>notification of trace start/stop.
>- Runtime overhead when disabled is designed to be optimal with one
>predicted branch, typically a few CPU cycles
> per
>event. While enabled, the overhead can be around 1 us per event.
>
> *Integration Challenges*
>
>- The perfetto SDK is C++ and designed around macros, lambdas, inline
>templates, etc. There are ongoing discussions on providing an official
>perfetto C API, but it is not yet clear when this will land on the perfetto
>roadmap.
>- The perfetto SDK is an amalgamated .h and .cc that adds up to 100K
>lines of code.
>- Anything that includes perfetto.h takes a long time to compile.
>- The current Perfetto SDK design is incompatible with being a shared
>library behind a C API.
>
> *Percetto*
>
> The percetto library  was recently
> implemented to provide an interim C API for perfetto. It provides efficient
> support for scoped trace events, multiple categories, counters, custom
> timestamps, and debug data annotations. Percetto also provides some
> features that are important to mesa, but not available yet with perfetto
> SDK:
>
>- Trace events from multiple perfetto instances in separate shared
>libraries (like mesa and virglrenderer) show correctly in a single process
>and thread view.
>- Counter tracks and macro API.
>
> Percetto is missing API for perfetto's GPU DataSource and counter support,
> but that feature could be implemented next if it is important for mesa.
> With the existing percetto API mesa could present GPU trace data as named
> 'slice' events and int64_t counters with custom timestamps as shown in the
> image above (based on this sample
> ).
>
> *Mesa Integration Alternatives*
>
> Note: we have some pressing needs for performance analysis in Chrome OS, so
> I'm intentionally leaving out the alternative of waiting for an official
> perfetto C API. Of course, once that C API is available it would 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread John Bates
On Fri, Feb 12, 2021 at 5:01 AM Tamminen, Eero T 
wrote:

>
> Unlike some other Linux tracing solutions, Perfetto appears to be for
> Android / Chrome(OS?), and not available from in common Linux distro
> repos.
>
> So, why Perfetto instead of one of the other solutions, e.g. from ones
> mentioned here:
> https://tracingsummit.org/ts/2018/
> ?
>
>
Good question. Perfetto is for Linux, Android, and Chrome OS. Not sure what
Linux distros provide it besides Android and Chrome OS. It provides
comprehensive tracing solutions from data collection and tools to
convenient web-based UI and analysis as well as interoperation with
other trace data providers. Looking at the tracing summit presentations,
for example, there appear to be some good additional tracing data sources
that could potentially feed into Perfetto trace daemon and UI. But none of
those particular projects are providing a comprehensive solution like
Perfetto is. Lots more detail at perfetto.dev.


> And, if tracing API is added to Mesa, shouldn't it support also
> tracepoints for other tracing solutions?
>
> I mean, code added to drivers themselves preferably should not have
> anything perfetto/percetto specific.  Tracing system specific code
> should be only in one place (even if it's just macros in common header).


I agree it makes sense to keep the macro API implementation in a common
mesa header so that we have the option of changing out the backend. On the
other hand, it can get difficult to maintain more than one tracing backend,
especially when tracing usage goes beyond the simple TRACE_SCOPE(__func__)
macros. For example, with GPU timeline tracks, counters, etc. I would not
expect mesa devs to test their tracing code on more than one tracing
backend, so it would be likely for other backends to regress. So ideally we
could pick one.


>
>
> > This helps developers
> > quickly answer questions like:
> >
> >- How long are frames taking?
>
> That doesn't require any changes to Mesa.  Just set uprobe for suitable
> buffer swap function [1], and parse kernel ftrace events.  This way
> starting tracing doesn't require even restarting the tracked processes.
>
>
> [1] glXSwapBuffers, eglSwapBuffers, eglSwapBuffersWithDamageEXT,
> anv_QueuePresentKHR[2]..
>
> [2] Many apps resolve "vkQueuePresentKHR" Vulkan API loader wrapper
> function and call the backend function like "anv_QueuePresentKHR"
> directly, so it's  better to track latter instead.
>
>
> >- What caused a particular frame drop?
> >- Is it CPU bound or GPU bound?
>
> That doesn't require adding tracepoints to Mesa, just checking CPU & GPU
> utilization (which is lower level thing).
>
>
> >- Did a CPU core frequency drop cause something to go slower than
> > usual?
>
> Note that nowadays actual CPU frequencies are often controlled by HW /
> firmware, so you don't necessarily get any ftrace event from freq
> change, you would need to poll MSR registers instead (which is
> privileged operation, and polling can easily miss changes).
>
>
> >- Is something else running that is stealing CPU or GPU time? Could
> > I
> >fix that with better thread/context priorities?
> >- Are all CPU cores being used effectively? Do I need
> > sched_setaffinity
> >to keep my thread on a big or little core?
>
> I don't think these to require adding tracepoints to Mesa either...
>
>
> >- What’s the latency between CPU frame submit and GPU start?
>
> I think this would require tracepoints in kernel GPU code more than in
> Mesa?
>
>
> - Eero
>
>
> > *What Does Mesa + Perfetto Provide?*
> >
> > Mesa is in a unique position to produce GPU trace data for several GPU
> > vendors without requiring the developer to build and install
> > additional
> > tools like gfx-pps .
> >
> > The key is making it easy for developers to use. Ideally, perfetto is
> > eventually available by default in mesa so that if your system has
> > perfetto
> > traced running, you just need to run perfetto (perhaps along with
> > setting
> > an environment variable) with the mesa categories to see:
> >
> >- GPU processing timeline events.
> >- GPU counters.
> >- CPU events for potentially slow functions in mesa like shader
> > compiles.
> >
> > Example of what this data might look like (with fake GPU events):
> > [image: percetto-gpu-example.png]
> >
> > *Runtime Characteristics*
> >
> >- ~500KB additional binary size. Even with using only the basic
> > features
> >of perfetto, it will increase the binary size of mesa by about
> > 500KB.
> >- Background thread. Perfetto uses a background thread for
> > communication
> >with the system tracing daemon (traced) to advertise trace data and
> > get
> >notification of trace start/stop.
> >- Runtime overhead when disabled is designed to be optimal with one
> >predicted branch, typically a few CPU cycles
> >
> > 

Re: [Mesa-dev] Perfetto CPU/GPU tracing

2021-02-12 Thread Tamminen, Eero T
Hi,

On Thu, 2021-02-11 at 17:39 -0800, John Bates wrote:
> I recently opened issue 4262
>  to begin the
> discussion on integrating perfetto into mesa.
> 
> *Background*
> 
> System-wide tracing is an invaluable tool for developers to find and
> fix
> performance problems. The perfetto project enables a combined view of
> trace
> data from kernel ftrace, GPU driver and various manually-instrumented
> tracepoints throughout the application and system.

Unlike some other Linux tracing solutions, Perfetto appears to be for
Android / Chrome(OS?), and not available from in common Linux distro
repos.

So, why Perfetto instead of one of the other solutions, e.g. from ones
mentioned here:
https://tracingsummit.org/ts/2018/
?

And, if tracing API is added to Mesa, shouldn't it support also
tracepoints for other tracing solutions?

I mean, code added to drivers themselves preferably should not have
anything perfetto/percetto specific.  Tracing system specific code
should be only in one place (even if it's just macros in common header).


> This helps developers
> quickly answer questions like:
> 
>    - How long are frames taking?

That doesn't require any changes to Mesa.  Just set uprobe for suitable
buffer swap function [1], and parse kernel ftrace events.  This way
starting tracing doesn't require even restarting the tracked processes.


[1] glXSwapBuffers, eglSwapBuffers, eglSwapBuffersWithDamageEXT, 
anv_QueuePresentKHR[2]..

[2] Many apps resolve "vkQueuePresentKHR" Vulkan API loader wrapper
function and call the backend function like "anv_QueuePresentKHR"
directly, so it's  better to track latter instead.


>    - What caused a particular frame drop?
>    - Is it CPU bound or GPU bound?

That doesn't require adding tracepoints to Mesa, just checking CPU & GPU
utilization (which is lower level thing).


>    - Did a CPU core frequency drop cause something to go slower than
> usual?

Note that nowadays actual CPU frequencies are often controlled by HW /
firmware, so you don't necessarily get any ftrace event from freq
change, you would need to poll MSR registers instead (which is
privileged operation, and polling can easily miss changes).


>    - Is something else running that is stealing CPU or GPU time? Could
> I
>    fix that with better thread/context priorities?
>    - Are all CPU cores being used effectively? Do I need
> sched_setaffinity
>    to keep my thread on a big or little core?

I don't think these to require adding tracepoints to Mesa either...


>    - What’s the latency between CPU frame submit and GPU start?

I think this would require tracepoints in kernel GPU code more than in
Mesa?


- Eero


> *What Does Mesa + Perfetto Provide?*
> 
> Mesa is in a unique position to produce GPU trace data for several GPU
> vendors without requiring the developer to build and install
> additional
> tools like gfx-pps .
> 
> The key is making it easy for developers to use. Ideally, perfetto is
> eventually available by default in mesa so that if your system has
> perfetto
> traced running, you just need to run perfetto (perhaps along with
> setting
> an environment variable) with the mesa categories to see:
> 
>    - GPU processing timeline events.
>    - GPU counters.
>    - CPU events for potentially slow functions in mesa like shader
> compiles.
> 
> Example of what this data might look like (with fake GPU events):
> [image: percetto-gpu-example.png]
> 
> *Runtime Characteristics*
> 
>    - ~500KB additional binary size. Even with using only the basic
> features
>    of perfetto, it will increase the binary size of mesa by about
> 500KB.
>    - Background thread. Perfetto uses a background thread for
> communication
>    with the system tracing daemon (traced) to advertise trace data and
> get
>    notification of trace start/stop.
>    - Runtime overhead when disabled is designed to be optimal with one
>    predicted branch, typically a few CPU cycles
>   
> 
> per
>    event. While enabled, the overhead can be around 1 us per event.
> 
> *Integration Challenges*
> 
>    - The perfetto SDK is C++ and designed around macros, lambdas,
> inline
>    templates, etc. There are ongoing discussions on providing an
> official
>    perfetto C API, but it is not yet clear when this will land on the
> perfetto
>    roadmap.
>    - The perfetto SDK is an amalgamated .h and .cc that adds up to
> 100K
>    lines of code.
>    - Anything that includes perfetto.h takes a long time to compile.
>    - The current Perfetto SDK design is incompatible with being a
> shared
>    library behind a C API.
> 
> *Percetto*
> 
> The percetto library  was
> recently
> implemented to provide an interim C API for perfetto. It provides
> efficient
> support for scoped trace events, multiple categories, counters, 

[Mesa-dev] Perfetto CPU/GPU tracing

2021-02-11 Thread John Bates
I recently opened issue 4262
 to begin the
discussion on integrating perfetto into mesa.

*Background*

System-wide tracing is an invaluable tool for developers to find and fix
performance problems. The perfetto project enables a combined view of trace
data from kernel ftrace, GPU driver and various manually-instrumented
tracepoints throughout the application and system. This helps developers
quickly answer questions like:

   - How long are frames taking?
   - What caused a particular frame drop?
   - Is it CPU bound or GPU bound?
   - Did a CPU core frequency drop cause something to go slower than usual?
   - Is something else running that is stealing CPU or GPU time? Could I
   fix that with better thread/context priorities?
   - Are all CPU cores being used effectively? Do I need sched_setaffinity
   to keep my thread on a big or little core?
   - What’s the latency between CPU frame submit and GPU start?

*What Does Mesa + Perfetto Provide?*

Mesa is in a unique position to produce GPU trace data for several GPU
vendors without requiring the developer to build and install additional
tools like gfx-pps .

The key is making it easy for developers to use. Ideally, perfetto is
eventually available by default in mesa so that if your system has perfetto
traced running, you just need to run perfetto (perhaps along with setting
an environment variable) with the mesa categories to see:

   - GPU processing timeline events.
   - GPU counters.
   - CPU events for potentially slow functions in mesa like shader compiles.

Example of what this data might look like (with fake GPU events):
[image: percetto-gpu-example.png]

*Runtime Characteristics*

   - ~500KB additional binary size. Even with using only the basic features
   of perfetto, it will increase the binary size of mesa by about 500KB.
   - Background thread. Perfetto uses a background thread for communication
   with the system tracing daemon (traced) to advertise trace data and get
   notification of trace start/stop.
   - Runtime overhead when disabled is designed to be optimal with one
   predicted branch, typically a few CPU cycles
    per
   event. While enabled, the overhead can be around 1 us per event.

*Integration Challenges*

   - The perfetto SDK is C++ and designed around macros, lambdas, inline
   templates, etc. There are ongoing discussions on providing an official
   perfetto C API, but it is not yet clear when this will land on the perfetto
   roadmap.
   - The perfetto SDK is an amalgamated .h and .cc that adds up to 100K
   lines of code.
   - Anything that includes perfetto.h takes a long time to compile.
   - The current Perfetto SDK design is incompatible with being a shared
   library behind a C API.

*Percetto*

The percetto library  was recently
implemented to provide an interim C API for perfetto. It provides efficient
support for scoped trace events, multiple categories, counters, custom
timestamps, and debug data annotations. Percetto also provides some
features that are important to mesa, but not available yet with perfetto
SDK:

   - Trace events from multiple perfetto instances in separate shared
   libraries (like mesa and virglrenderer) show correctly in a single process
   and thread view.
   - Counter tracks and macro API.

Percetto is missing API for perfetto's GPU DataSource and counter support,
but that feature could be implemented next if it is important for mesa.
With the existing percetto API mesa could present GPU trace data as named
'slice' events and int64_t counters with custom timestamps as shown in the
image above (based on this sample
).

*Mesa Integration Alternatives*

Note: we have some pressing needs for performance analysis in Chrome OS, so
I'm intentionally leaving out the alternative of waiting for an official
perfetto C API. Of course, once that C API is available it would become an
option to migrate to it from any of the alternatives below.

Ordered by difficulty with easiest first:

   1. Statically link with percetto as an optional external dependency
(virglrenderer
   now has this approach
   
   ).
   - Pros: API already supports most common tracing needs. Tested and used
  by an increasing number of CrOS components.
  - Cons: External dependency for optional mesa build option.
   2. Embed Perfetto SDK + a Percetto fork/copy.
  - Pros: API already supports most common tracing needs. No added
  external dependency for mesa.
  - Cons: Percetto code divergence, bug fixes need to land in two trees.
   3. Embed Perfetto SDK + custom C wrapper.
  - Pros: Tailored API for mesa's needs.
  - Cons: Nontrivial development efforts and