Re: [Intel-gfx] [PATCH i-g-t 6/6] intel_gpu_top: Add a sanity check discovered busy metric is per engine

2021-11-30 Thread Rogozhkin, Dmitry V
On Fri, 2021-11-19 at 12:59 +, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> Adding a cross-check with ABI config name space and not just relying
> on
> sysfs names.
> 
> Signed-off-by: Tvrtko Ursulin 
> Cc: Dmitry Rogozhkin 
> ---
>  tools/intel_gpu_top.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/tools/intel_gpu_top.c b/tools/intel_gpu_top.c
> index 41c59a72c09d..81c724d1fe1c 100644
> --- a/tools/intel_gpu_top.c
> +++ b/tools/intel_gpu_top.c
> @@ -376,6 +376,12 @@ static struct engines *discover_engines(char
> *device)
>   break;
>   }
>  
> + /* Double check config is an engine config. */
> + if (engine->busy.config >= __I915_PMU_OTHER(0)) {
> + free((void *)engine->name);
> + continue;
> + }
> +
>   engine->class = (engine->busy.config &
>(__I915_PMU_OTHER(0) - 1)) >>
>   I915_PMU_CLASS_SHIFT;

This works for me.
Acked-by: Dmitry Rogozhkin 

However, looking to the existing code down below the place where you've
added a fix, it seems to me that 'free((void *)engine->name)' might be
missing on a number of other error paths and on non-error exit path as
well.



Re: [Intel-gfx] [PATCH 4/5] drm/i915: Increase initial busyspin limit

2018-08-02 Thread Rogozhkin, Dmitry V
On Thu, 2018-08-02 at 15:54 +0100, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-08-02 15:47:37)
> > 
> > On 28/07/2018 17:46, Chris Wilson wrote:
> > > Here we bump the busyspin for the current request to avoid
> > > sleeping and
> > > the cost of both idling and downclocking the CPU.
> > > 
> > > Signed-off-by: Chris Wilson 
> > > Cc: Sagar Kamble 
> > > Cc: Eero Tamminen 
> > > Cc: Francisco Jerez 
> > > Cc: Tvrtko Ursulin 
> > > Cc: Ben Widawsky 
> > > Cc: Joonas Lahtinen 
> > > Cc: Michał Winiarski 
> > > ---
> > >   drivers/gpu/drm/i915/Kconfig.profile | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/Kconfig.profile
> > > b/drivers/gpu/drm/i915/Kconfig.profile
> > > index de394dea4a14..54e0ba4051e7 100644
> > > --- a/drivers/gpu/drm/i915/Kconfig.profile
> > > +++ b/drivers/gpu/drm/i915/Kconfig.profile
> > > @@ -1,6 +1,6 @@
> > >   config DRM_I915_SPIN_REQUEST_IRQ
> > >   int
> > > - default 5 # microseconds
> > > + default 250 # microseconds
> > >   help
> > > Before sleeping waiting for a request (GPU operation) to
> > > complete,
> > > we may spend some time polling for its completion. As the
> > > IRQ may
> > > 
> > 
> > But the histograms from previous patch do not suggest 250us busy
> > spin 
> > gets us into interesting territory. And we have to know the
> > distribution 
> > between IRQ and CS spins.
> 
> This is for a different problem than presented in those histograms.
> This
> is to hide the reclocking that occurred if we sleep.
> 
> Previous justification for the initial spin was to try and hide the
> irq
> setup costs, this is to try and hide the sleep costs.
> 
> The purpose of this patch was just to dip the toe into the water of
> what
> is possible and solicit feedback since we are still waiting for the
> panacea patches to improve cpufreq. (Why it had such a lackluster
> justification.)

This patch somewhat resembles with the issue we just met in userspace
for the opencl and media. Long story short there was CPU% regression
between 2 opencl releases root caused to the following. OpenCL rewrote
the code how they wait for the task completion. Essentially they have
an optimistic spin implemented on the userspace side and logic their is
quite complex: they commit to different duration of the spin depending
on the scenario. And they had some pretty long spin, like 10ms. This is
a killer thing if opencl is coupled with media workloads, especially
when media is dominant. That's because usual execution time of media
batches is pretty long, few or dozen milliseconds is quite typical. As
a result opencl workload is really being submitted and executed with
significant delay, thus optimistic spin gives up and it comes with the
CPU% cost. This story was discussed here: https://github.com/intel/comp
ute-runtime/commit/d53e1c3979f6d822f55645bfcd753be1658f53ca#comments.

I was actually surprised why they don't rely on i915 wait function.
They responded in was: "I would gladly always use bo_wait ioctl, but it
has big overhead for short workloads." Thus the question: do we have an
opportunity to improve? Can we have some dynamic way to configure
spinning for the particular requests/contexts? /if I understand
correctly current patch it introduces kernel config-level setting
rather than some dynamic setting/

> -Chris
> ___
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v9 7/7] drm/i915: add a sysfs entry to let users set sseu configs

2018-08-02 Thread Rogozhkin, Dmitry V
On Thu, 2018-08-02 at 11:00 +0100, Tvrtko Ursulin wrote:
> [Picking this point in the thread to reply on some points mentioned
> by 
> folks in the whole thread.]
> 
> I don't remember if any patches from Lionel's series actually had r-
> b's, 
> but a few people including myself have certainly been reviewing them.
> If 
> I had applied the final r-b I wouldn't have made much difference due 
> lack of userspace and disagreement on the DoS mitigation story. So
> to 
> say effectively put your money where your mouth is and review is not 
> entirely fair.
> 
> Suggestion to add a master sysfs switch was to alleviate the DoS 
> concerns because dynamic switching has a cost towards all GPU
> clients, 
> not that it just has potential to slow down media.
> 
> Suggestion was also that this switch becomes immutable and defaults
> to 
> "allow" on Gen11 onwards.
> 
> I preferred sysfs versus a modparam since it makes testing (Both for 
> developers and users (what config works better for my use case?)
> easier.)
> 
> I am fine with the suggestion to drive the Gen11 part first, which 
> removes the need for any of this.
> 
> Patch series is already (AFAIR) nicely split so only the last patch 
> needs to be dropped.
> 
> If at some point we decide to revisit the Gen8/9 story, we can 
> evaluate/measure whether any master switch is actually warranted. I 
> think Lionel did some micro-benchmarking which showed impact is not
> so 
> severe, so perhaps for real-world use cases it would be even less.
> 
> I can re-read the public patches and review them, or perhaps even
> adopt 
> them if they have been orphaned. Have to check with Francesco first 
> before I commit to the latter.

Thank you for volunteering:).
I talked with Lionel about that and he thinks he may be able to do a
respin next week. So, please, check with him also to avoid double work.

> 
> On the uAPI front interface looked fine to me.
> 
> One recent interesting development are Mesa patches posted to mesa-
> dev 
> (https://lists.freedesktop.org/archives/mesa-dev/2018-July/200607.htm
> l) 
> which add EGL_CONTEXT_load_type extension (low/medium/high). This
> would 
> need some sort of mapping between low/medium/high to more specific 
> settings but still seems okay to me.
> 
> This may bring another (open source?) user for the patches. Which
> Gen's 
> they are interested in is also a question.
> 
> Regards,
> 
> Tvrtko
> 
> On 24/07/2018 22:50, Bloomfield, Jon wrote:
> > Gratuitous top posting to re-kick the thread.
> > 
> > For Gen11 we can't have an on/off switch anyway (media simply won't
> > run
> > with an oncompatible slice config), so let's agree on an api to
> > allow userland
> > to select the slice configuration for its contexts, for Gen11 only.
> > I'd prefer a
> > simple {MediaConfig/GeneralConfig} type context param but I really
> > don't
> > care that much.
> > 
> > For gen9/10 it's arguable whether we need this at all. The effect
> > on media
> > workloads varies, but I'm guessing normal users (outside a
> > transcode
> > type environment) will never know they're missing anything. Either
> > way,
> > we can continue discussing while we progress the gen11 solution.
> > 
> > Jon
> > 
> > > -Original Message-
> > > From: Intel-gfx  On
> > > Behalf Of
> > > Bloomfield, Jon
> > > Sent: Wednesday, July 18, 2018 9:44 AM
> > > To: Tvrtko Ursulin ; Joonas
> > > Lahtinen
> > > ; Chris Wilson  > > on.co.uk>;
> > > Landwerlin, Lionel G ; intel-
> > > g...@lists.freedesktop.org
> > > Subject: Re: [Intel-gfx] [PATCH v9 7/7] drm/i915: add a sysfs
> > > entry to let users
> > > set sseu configs
> > > 
> > > > -Original Message-
> > > > From: Intel-gfx  On
> > > > Behalf Of
> > > > Tvrtko Ursulin
> > > > Sent: Thursday, June 14, 2018 1:29 AM
> > > > To: Joonas Lahtinen ; Chris
> > > > Wilson
> > > > ; Landwerlin, Lionel G
> > > > ; intel-...@lists.freedesktop.or
> > > > g
> > > > Subject: Re: [Intel-gfx] [PATCH v9 7/7] drm/i915: add a sysfs
> > > > entry to let
> > > 
> > > users
> > > > set sseu configs
> > > > 
> > > > 
> > > > On 13/06/2018 13:49, Joonas Lahtinen wrote:
> > > > > Quoting Tvrtko Ursulin (2018-06-12 15:02:07)
> > > > > > 
> > > > > > On 12/06/2018 11:52, Lionel Landwerlin wrote:
> > > > > > > On 12/06/18 11:37, Chris Wilson wrote:
> > > > > > > > Quoting Lionel Landwerlin (2018-06-12 11:33:34)
> > > > > > > > > On 12/06/18 10:20, Joonas Lahtinen wrote:
> > > > > > > > > > Quoting Chris Wilson (2018-06-11 18:02:37)
> > > > > > > > > > > Quoting Lionel Landwerlin (2018-06-11 14:46:07)
> > > > > > > > > > > > On 11/06/18 13:10, Tvrtko Ursulin wrote:
> > > > > > > > > > > > > On 30/05/2018 15:33, Lionel Landwerlin wrote:
> > > > > > > > > > > > > > There are concerns about denial of service
> > > > > > > > > > > > > > around the per
> > > > > > > > > > > > > > context sseu
> > > > > > > > > > > > > > configuration capability. In a previous
> > > > > > > > > > > > > > commit introducing the
> > > > > > > > > > > > > > 

Re: [Intel-gfx] [PATCH v9 7/7] drm/i915: add a sysfs entry to let users set sseu configs

2018-07-30 Thread Rogozhkin, Dmitry V
On Tue, 2018-07-24 at 21:50 +, Bloomfield, Jon wrote:
> Gratuitous top posting to re-kick the thread.
> 
> For Gen11 we can't have an on/off switch anyway (media simply won't
> run
> with an oncompatible slice config), so let's agree on an api to allow
> userland
> to select the slice configuration for its contexts, for Gen11 only.
> I'd prefer a
> simple {MediaConfig/GeneralConfig} type context param but I really
> don't
> care that much.

Does anyone work on breaking the patch series for Gen11 and Gen8/9/10
with Gen11 preceding since this is required for the functional side? We
hope to have userspace media driver ICL patches available pretty soon
and would be really nice to have kernel side prepared as much as
possible... If no one is working on this, can you, Lionel, help,
please?

> 
> For gen9/10 it's arguable whether we need this at all. The effect on
> media
> workloads varies, but I'm guessing normal users (outside a transcode
> type environment) will never know they're missing anything. Either
> way,
> we can continue discussing while we progress the gen11 solution.
> 
> Jon
> 
> > -Original Message-
> > From: Intel-gfx  On Behalf
> > Of
> > Bloomfield, Jon
> > Sent: Wednesday, July 18, 2018 9:44 AM
> > To: Tvrtko Ursulin ; Joonas
> > Lahtinen
> > ; Chris Wilson  > .co.uk>;
> > Landwerlin, Lionel G ; intel-
> > g...@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH v9 7/7] drm/i915: add a sysfs entry
> > to let users
> > set sseu configs
> > 
> > > -Original Message-
> > > From: Intel-gfx  On
> > > Behalf Of
> > > Tvrtko Ursulin
> > > Sent: Thursday, June 14, 2018 1:29 AM
> > > To: Joonas Lahtinen ; Chris
> > > Wilson
> > > ; Landwerlin, Lionel G
> > > ; intel-gfx@lists.freedesktop.org
> > > Subject: Re: [Intel-gfx] [PATCH v9 7/7] drm/i915: add a sysfs
> > > entry to let
> > 
> > users
> > > set sseu configs
> > > 
> > > 
> > > On 13/06/2018 13:49, Joonas Lahtinen wrote:
> > > > Quoting Tvrtko Ursulin (2018-06-12 15:02:07)
> > > > > 
> > > > > On 12/06/2018 11:52, Lionel Landwerlin wrote:
> > > > > > On 12/06/18 11:37, Chris Wilson wrote:
> > > > > > > Quoting Lionel Landwerlin (2018-06-12 11:33:34)
> > > > > > > > On 12/06/18 10:20, Joonas Lahtinen wrote:
> > > > > > > > > Quoting Chris Wilson (2018-06-11 18:02:37)
> > > > > > > > > > Quoting Lionel Landwerlin (2018-06-11 14:46:07)
> > > > > > > > > > > On 11/06/18 13:10, Tvrtko Ursulin wrote:
> > > > > > > > > > > > On 30/05/2018 15:33, Lionel Landwerlin wrote:
> > > > > > > > > > > > > There are concerns about denial of service
> > > > > > > > > > > > > around the per
> > > > > > > > > > > > > context sseu
> > > > > > > > > > > > > configuration capability. In a previous
> > > > > > > > > > > > > commit introducing the
> > > > > > > > > > > > > capability we allowed it only for capable
> > > > > > > > > > > > > users. This changes
> > > > > > > > > > > > > adds a
> > > > > > > > > > > > > new debugfs entry to let any user configure
> > > > > > > > > > > > > its own context
> > > > > > > > > > > > > powergating setup.
> > > > > > > > > > > > 
> > > > > > > > > > > > As far as I understood it, Joonas' concerns
> > > > > > > > > > > > here are:
> > > > > > > > > > > > 
> > > > > > > > > > > > 1) That in the containers use case individual
> > > > > > > > > > > > containers
> > 
> > wouldn't
> > > be
> > > > > > > > > > > > able to turn on the sysfs toggle for them.
> > > > > > > > > > > > 
> > > > > > > > > > > > 2) That also in the containers use case if box
> > > > > > > > > > > > admin turns on
> > 
> > the
> > > > > > > > > > > > feature, some containers would potentially
> > > > > > > > > > > > start negatively
> > > > > > > > > > > > affecting
> > > > > > > > > > > > the others (via the accumulated cost of slice
> > > > > > > > > > > > re-configuration
> > 
> > on
> > > > > > > > > > > > context switching).
> > > > > > > > > > > > 
> > > > > > > > > > > > I am not familiar with typical container setups
> > > > > > > > > > > > to be
> > 
> > authoritative
> > > > > > > > > > > > here, but intuitively I find it reasonable that
> > > > > > > > > > > > a low-level
> > 
> > hardware
> > > > > > > > > > > > switch like this would be under the control of
> > > > > > > > > > > > a master domain
> > > > > > > > > > > > administrator. ("If you are installing our
> > > > > > > > > > > > product in the
> > 
> > container
> > > > > > > > > > > > environment, make sure your system
> > > > > > > > > > > > administrator enables
> > 
> > this
> > > > > > > > > > > > hardware
> > > > > > > > > > > > feature.", "Note to system administrators:
> > > > > > > > > > > > Enabling this
> > 
> > features
> > > > > > > > > > > > may
> > > > > > > > > > > > negatively affect the performance of other
> > > > > > > > > > > > containers.")
> > > > > > > > > > > > 
> > > > > > > > > > > > Alternative proposal is for the i915 to apply
> > > > > > > > > > > > an "or" filter on all
> > > > > > > > > > > > requested masks and in that way ensure 

Re: [Intel-gfx] [PATCH 8/8] drm/i915: Expose RPCS (SSEU) configuration to userspace

2018-05-08 Thread Rogozhkin, Dmitry V
>> Context of the question was whether you need to change slice configuration 
>> for a single GEM context between submitting batch buffers?

When we create a context we know the optimal slice count for it. And this 
optimal point does not change for this context in unperturbed conditions, i.e. 
when context runs alone. However, there are critical use cases when we never 
run single context, but instead we run few contexts in parallel and each of 
them has its own optimal slice count operating point. As a result, major 
question is: what is the cost of context switch if there is associated slice 
configuration change, powering on or off the slice(s)? In the ideal situation 
when the cost is 0, we don't need any change of slice configuration for single 
GEM context between submitting batch buffers. Problem is that cost is far from 
0. And the cost is non-tolerable for the worst case when we will have a switch 
at each next batch buffer. As a result, we are forced to have some negotiation 
channel between different contexts and make them agree on some single slice 
configuration which will exist for reasonably long period of time to have 
associated cost negligible in genera. During this period we will submit a 
number of batch buffers before next reconfiguration attempt. So, that's the 
situation when we need to reconfigure slice configuration for a single GEM 
context between submitting batches.

Dmitry.

-Original Message-
From: Tvrtko Ursulin [mailto:tvrtko.ursu...@linux.intel.com] 
Sent: Tuesday, May 8, 2018 1:25 AM
To: Rogozhkin, Dmitry V <dmitry.v.rogozh...@intel.com>; Landwerlin, Lionel G 
<lionel.g.landwer...@intel.com>; intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 8/8] drm/i915: Expose RPCS (SSEU) configuration 
to userspace


On 08/05/2018 05:04, Rogozhkin, Dmitry V wrote:
>>>  I'm pretty sure Dmitry wants dynamic configurations.
> 
> Yes, I afraid we really need dynamic slice configurations for media.

Context of the question was whether you need to change slice configuration for 
a single GEM context between submitting batch buffers?

Regards,

Tvrtko


> *From:*Landwerlin, Lionel G
> *Sent:* Friday, May 4, 2018 9:25 AM
> *To:* Tvrtko Ursulin <tvrtko.ursu...@linux.intel.com>; 
> intel-gfx@lists.freedesktop.org; Rogozhkin, Dmitry V 
> <dmitry.v.rogozh...@intel.com>
> *Subject:* Re: [Intel-gfx] [PATCH 8/8] drm/i915: Expose RPCS (SSEU) 
> configuration to userspace
> 
> On 03/05/18 18:18, Tvrtko Ursulin wrote:
> 
>    +int intel_lr_context_set_sseu(struct i915_gem_context *ctx,
> +  struct intel_engine_cs *engine,
> +  struct i915_gem_context_sseu *sseu)
> +{
> +    struct drm_i915_private *dev_priv = ctx->i915;
> +    struct intel_context *ce;
> +    enum intel_engine_id id;
> +    int ret;
> +
> +    lockdep_assert_held(_priv->drm.struct_mutex);
> +
> +    if (memcmp(sseu, >engine[engine->id].sseu,
> sizeof(*sseu)) == 0)
> +    return 0;
> +
> +    /*
> + * We can only program this on render ring.
> + */
> +    ce = >engine[RCS];
> +
> +    if (ce->pin_count) { /* Assume that the context is active! */
> +    ret = i915_gem_switch_to_kernel_context(dev_priv);
> +    if (ret)
> +    return ret;
> +
> +    ret = i915_gem_wait_for_idle(dev_priv,
> + I915_WAIT_INTERRUPTIBLE |
> + I915_WAIT_LOCKED);
> 
> 
> Could we consider the alternative of only allowing this to be
> configured on context create? That way we would not need to idle the
> GPU every time userspace decides to fiddle with it. It is
> unprivileged so quite an easy way for random app to ruin GPU
> performance for everyone.
> 
> Regards,
> 
> Tvrtko
> 
> I'm pretty sure Dmitry wants dynamic configurations.
> 
> Dmitry?
> 
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 8/8] drm/i915: Expose RPCS (SSEU) configuration to userspace

2018-05-07 Thread Rogozhkin, Dmitry V
>> I'm pretty sure Dmitry wants dynamic configurations.

Yes, I afraid we really need dynamic slice configurations for media.

From: Landwerlin, Lionel G
Sent: Friday, May 4, 2018 9:25 AM
To: Tvrtko Ursulin <tvrtko.ursu...@linux.intel.com>; 
intel-gfx@lists.freedesktop.org; Rogozhkin, Dmitry V 
<dmitry.v.rogozh...@intel.com>
Subject: Re: [Intel-gfx] [PATCH 8/8] drm/i915: Expose RPCS (SSEU) configuration 
to userspace

On 03/05/18 18:18, Tvrtko Ursulin wrote:
  +int intel_lr_context_set_sseu(struct i915_gem_context *ctx,
+  struct intel_engine_cs *engine,
+  struct i915_gem_context_sseu *sseu)
+{
+struct drm_i915_private *dev_priv = ctx->i915;
+struct intel_context *ce;
+enum intel_engine_id id;
+int ret;
+
+lockdep_assert_held(_priv->drm.struct_mutex);
+
+if (memcmp(sseu, >engine[engine->id].sseu, sizeof(*sseu)) == 0)
+return 0;
+
+/*
+ * We can only program this on render ring.
+ */
+ce = >engine[RCS];
+
+if (ce->pin_count) { /* Assume that the context is active! */
+ret = i915_gem_switch_to_kernel_context(dev_priv);
+if (ret)
+return ret;
+
+ret = i915_gem_wait_for_idle(dev_priv,
+ I915_WAIT_INTERRUPTIBLE |
+ I915_WAIT_LOCKED);

Could we consider the alternative of only allowing this to be configured on 
context create? That way we would not need to idle the GPU every time userspace 
decides to fiddle with it. It is unprivileged so quite an easy way for random 
app to ruin GPU performance for everyone.

Regards,

Tvrtko


I'm pretty sure Dmitry wants dynamic configurations.

Dmitry?
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] vlv punit and sideband tidy

2018-03-07 Thread Rogozhkin, Dmitry V
On Wed, 2018-03-07 at 19:41 +, Chris Wilson wrote:
> Mika
> believes that if we keep the cpu in C0 whilst the gpu is busy, then
> it
> behaves much better -- but that is a very tough sell

Chris, Mika, I wonder does i915 driver tries to keep CPU in C0 at the
moment already or you just consider this? If it is doing anything to
CPU while gpu is busy, could you, please, clarify what exactly and
point me to the code/key patches? Please, answer from BDW/SKL gen
perspective rather than VLV.

Dmitry.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915: Add a new modparam for customized ring multiplier

2017-12-27 Thread Rogozhkin, Dmitry V
>> I definitely asked what will be if GT request will be bigger than IA 
>> request. But that was a year ago and I don't remember the answer. Let me ask 
>> again. I will mail back in few days.

Hi Chris, here is the response.

Question was: "Whether we can meet with the RING transition penalty (at least 
theoretically) if we will have GT request higher than IA request with the 
dominant IA load and tiny GT load, i.e. reverted situation of what we have 
actually faced? For example, if we will try to pin IA frequency to 800MHz (x1 
multiplier) and GT frequency to 700MHz (x2 multiplier): in that case we will 
have requests for ring 800 vs. 1400."

Answer is: "In this case, if the GT will toggle between RC0 and RC6, it will 
force ring frequency to toggle between 800 and 1400, which in the toggling time 
will stall IA execution. This will lead to performance loss." However, this is 
a case if we have really few toggle events within few milliseconds. It is quite 
probable that GT driver will not allow such behavior to happen if it simply 
doesn't often toggle between RC0 and RC6. Considering that GT driver probably 
handles much less interrupts than IA, this can be the case. So, I think Chris, 
that's now question to you: how often you toggle between RC0 and RC6 to see the 
reverted issue to happen? If you don't toggle much, then RING will simply 
remain on 1400 almost all the time and you will see no issue.

Again, I remind that's the talk about Gen9 only.

Dmitry.

-Original Message-
From: Intel-gfx [mailto:intel-gfx-boun...@lists.freedesktop.org] On Behalf Of 
Rogozhkin, Dmitry V
Sent: Tuesday, December 26, 2017 9:39 AM
To: Chris Wilson <ch...@chris-wilson.co.uk>; Li, Yaodong 
<yaodong...@intel.com>; intel-gfx@lists.freedesktop.org
Cc: Widawsky, Benjamin <benjamin.widaw...@intel.com>
Subject: Re: [Intel-gfx] [RFC] drm/i915: Add a new modparam for customized ring 
multiplier

>> To clarify, the HW will flip between the two GT/IA requests rather than 
>> stick to the highest?

Yes, it will flip on Gen9. On Gen8 there was some mechanism (HW) which 
flattened that. But it was removed/substituted in Gen9. In Gen10 it was tuned  
to close the mentioned issue.

>> Do you know anything about the opposite position. I heard a suggestion that 
>> simply increasing the ringfreq universally caused thermal throttling in some 
>> other workloads. Do you have any knowledge of those?

Initially we tried to just increase GT multiplier to x3 and stepped into the 
throttling. Thus, we introduced parameter to be able to mitigate all that 
depending on the SKU and user needs. I definitely asked what will be if GT 
request will be bigger than IA request. But that was a year ago and I don't 
remember the answer. Let me ask again. I will mail back in few days.

>> You are thinking of plugging into intel_pstate to make it smarter for ia 
>> freq transitions?

Yep. This seems a correct step to give some automatic support instead of 
parameter/hardcoded multiplier.

Dmitry.

-Original Message-
From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] 
Sent: Tuesday, December 26, 2017 8:59 AM
To: Rogozhkin, Dmitry V <dmitry.v.rogozh...@intel.com>; Li, Yaodong 
<yaodong...@intel.com>; intel-gfx@lists.freedesktop.org
Cc: Gong, Zhipeng <zhipeng.g...@intel.com>; Widawsky, Benjamin 
<benjamin.widaw...@intel.com>; Mateo Lozano, Oscar <oscar.ma...@intel.com>; 
Kamble, Sagar A <sagar.a.kam...@intel.com>; Li, Yaodong <yaodong...@intel.com>
Subject: RE: [RFC] drm/i915: Add a new modparam for customized ring multiplier

Quoting Rogozhkin, Dmitry V (2017-12-26 16:39:23)
> Clarification on the issue. Consider that you have a massive load on GT and 
> just tiny one on IA. If GT will program the RING frequency to be lower than 
> IA frequency, then you will fall into the situation when RING frequency 
> constantly transits from GT to IA level and back. Each transition of a RING 
> frequency is a full system stall. If you will have "good" transition rate 
> with few transitions per few milliseconds you will lose ~10% of performance. 
> That's the case for media workloads when you easily can step into this since 
> 1) media utilizes few GPU engines and with few parallel workloads you can 
> make sure that at least 1 engine is _always_ doing something, 2) media BB are 
> relatively small, so you have regular wakeups of the IA to manage requests. 
> This will affect Gen9 platforms due to HW design change (we've spot this in 
> SKL). This will not happen in Gen8 (old HW design). This will be fixed in 
> Gen10+ (CNL+).

To clarify, the HW will flip between the two GT/IA requests rather than stick 
to the highest? Iirc, the expectation was that we were setting a requested 
minimum frequency for the ring/ia based off the gpu freq.

> On SKL we ran into th

Re: [Intel-gfx] [RFC] drm/i915: Add a new modparam for customized ring multiplier

2017-12-26 Thread Rogozhkin, Dmitry V
>> To clarify, the HW will flip between the two GT/IA requests rather than 
>> stick to the highest?

Yes, it will flip on Gen9. On Gen8 there was some mechanism (HW) which 
flattened that. But it was removed/substituted in Gen9. In Gen10 it was tuned  
to close the mentioned issue.

>> Do you know anything about the opposite position. I heard a suggestion that 
>> simply increasing the ringfreq universally caused thermal throttling in some 
>> other workloads. Do you have any knowledge of those?

Initially we tried to just increase GT multiplier to x3 and stepped into the 
throttling. Thus, we introduced parameter to be able to mitigate all that 
depending on the SKU and user needs. I definitely asked what will be if GT 
request will be bigger than IA request. But that was a year ago and I don't 
remember the answer. Let me ask again. I will mail back in few days.

>> You are thinking of plugging into intel_pstate to make it smarter for ia 
>> freq transitions?

Yep. This seems a correct step to give some automatic support instead of 
parameter/hardcoded multiplier.

Dmitry.

-Original Message-
From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] 
Sent: Tuesday, December 26, 2017 8:59 AM
To: Rogozhkin, Dmitry V <dmitry.v.rogozh...@intel.com>; Li, Yaodong 
<yaodong...@intel.com>; intel-gfx@lists.freedesktop.org
Cc: Gong, Zhipeng <zhipeng.g...@intel.com>; Widawsky, Benjamin 
<benjamin.widaw...@intel.com>; Mateo Lozano, Oscar <oscar.ma...@intel.com>; 
Kamble, Sagar A <sagar.a.kam...@intel.com>; Li, Yaodong <yaodong...@intel.com>
Subject: RE: [RFC] drm/i915: Add a new modparam for customized ring multiplier

Quoting Rogozhkin, Dmitry V (2017-12-26 16:39:23)
> Clarification on the issue. Consider that you have a massive load on GT and 
> just tiny one on IA. If GT will program the RING frequency to be lower than 
> IA frequency, then you will fall into the situation when RING frequency 
> constantly transits from GT to IA level and back. Each transition of a RING 
> frequency is a full system stall. If you will have "good" transition rate 
> with few transitions per few milliseconds you will lose ~10% of performance. 
> That's the case for media workloads when you easily can step into this since 
> 1) media utilizes few GPU engines and with few parallel workloads you can 
> make sure that at least 1 engine is _always_ doing something, 2) media BB are 
> relatively small, so you have regular wakeups of the IA to manage requests. 
> This will affect Gen9 platforms due to HW design change (we've spot this in 
> SKL). This will not happen in Gen8 (old HW design). This will be fixed in 
> Gen10+ (CNL+).

To clarify, the HW will flip between the two GT/IA requests rather than stick 
to the highest? Iirc, the expectation was that we were setting a requested 
minimum frequency for the ring/ia based off the gpu freq.

> On SKL we ran into this with the GPU frequency pinned to 700MHz, CPU to 2GHz. 
> Multipliers were x2 for GT, x1 for IA.

Basically, with the GPU clocked to mid frequency, memory throughput is 
insufficient to keep the fixed functions occupied, and you need to increase the 
ring frequency. Is there ever a case where we don't need max ring frequency? 
(Perhaps we still need to set low frequency for GT
idle?) I guess media is more susceptible to this as that workload should be 
sustainable at reduced clocks, GL et al are much more likely to keep the clocks 
ramped all the way up.

Do you know anything about the opposite position. I heard a suggestion that 
simply increasing the ringfreq universally caused thermal throttling in some 
other workloads. Do you have any knowledge of those?
 
> So, effectively, what we need to do is to make sure that RING frequency 
> request from GT is _not_ below the request from IA. If IA requests 2GHz, we 
> can't request 1.4GHz, we need request at least 2GHz. Multiplier patch was 
> intended to do exactly that, but manually. Can  we somehow automate that 
> managing IA frequency requests to the RING?

You are thinking of plugging into intel_pstate to make it smarter for ia freq 
transitions? That seems possible, certainly. I'm not sure if the ring frequency 
is actually poked from anywhere else in the kernel, would be interesting to 
find out.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC] drm/i915: Add a new modparam for customized ring multiplier

2017-12-26 Thread Rogozhkin, Dmitry V
Clarification on the issue. Consider that you have a massive load on GT and 
just tiny one on IA. If GT will program the RING frequency to be lower than IA 
frequency, then you will fall into the situation when RING frequency constantly 
transits from GT to IA level and back. Each transition of a RING frequency is a 
full system stall. If you will have "good" transition rate with few transitions 
per few milliseconds you will lose ~10% of performance. That's the case for 
media workloads when you easily can step into this since 1) media utilizes few 
GPU engines and with few parallel workloads you can make sure that at least 1 
engine is _always_ doing something, 2) media BB are relatively small, so you 
have regular wakeups of the IA to manage requests. This will affect Gen9 
platforms due to HW design change (we've spot this in SKL). This will not 
happen in Gen8 (old HW design). This will be fixed in Gen10+ (CNL+).

On SKL we ran into this with the GPU frequency pinned to 700MHz, CPU to 2GHz. 
Multipliers were x2 for GT, x1 for IA.

So, effectively, what we need to do is to make sure that RING frequency request 
from GT is _not_ below the request from IA. If IA requests 2GHz, we can't 
request 1.4GHz, we need request at least 2GHz. Multiplier patch was intended to 
do exactly that, but manually. Can  we somehow automate that managing IA 
frequency requests to the RING?

Dmitry.

-Original Message-
From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] 
Sent: Tuesday, December 26, 2017 6:36 AM
To: Li, Yaodong <yaodong...@intel.com>; intel-gfx@lists.freedesktop.org
Cc: Gong, Zhipeng <zhipeng.g...@intel.com>; Widawsky, Benjamin 
<benjamin.widaw...@intel.com>; Mateo Lozano, Oscar <oscar.ma...@intel.com>; 
Kamble, Sagar A <sagar.a.kam...@intel.com>; Rogozhkin, Dmitry V 
<dmitry.v.rogozh...@intel.com>; Li, Yaodong <yaodong...@intel.com>
Subject: Re: [RFC] drm/i915: Add a new modparam for customized ring multiplier

Quoting Chris Wilson (2017-12-18 21:47:25)
> Quoting Jackie Li (2017-12-18 21:22:08)
> > From: Zhipeng Gong <zhipeng.g...@intel.com>
> > 
> > SKL platforms requires a higher ring multiplier when there's massive 
> > GPU load. Current driver doesn't provide a way to override the ring 
> > multiplier.
> > 
> > This patch adds a new module parameter to allow the overriding of 
> > ring multiplier for Gen9 platforms.
> 
> So the default ring-scaling is not good enough, the first thing we do 
> is to try and ensure the defaults work for nearly all use cases. My 
> impression is that you want a nonlinear scalefactor, low power 
> workloads don't try and ramp up the ring frequencies as aggressively, 
> high power workloads try hard for higher frequencies, and then get 
> throttled back harder as well. How well can we autotune it? What 
> events tells us if the ratio is too high or too low?

One thing that came to mind is that we don't know the min/max ring frequencies 
and just program them blindly. Is it the case that at max gpu freq, there is 
still headroom on the ring freq, or do you require a steeper ramp so that you 
hit the max ringfreq earlier for your workload (which then presumably can run 
at less than max gpufreq, so pushing power elsewhere).
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v5] drm/i915: Restore GT performance in headless mode with DMC loaded

2017-12-05 Thread Rogozhkin, Dmitry V
On Tue, 2017-12-05 at 15:09 +0200, Imre Deak wrote:
> On Sat, Dec 02, 2017 at 02:05:42AM +0200, Rogozhkin, Dmitry V wrote:
> > On Thu, 2017-11-30 at 13:19 +0200, Imre Deak wrote:
> > > > > +#define NEEDS_CSR_GT_PERF_WA(dev_priv) \
> > > > > +   (HAS_CSR(dev_priv) && IS_GEN9(dev_priv) && !
> > > IS_SKYLAKE(dev_priv))
> > > 
> > > Nitpick: could be just !IS_SKYLAKE(), but works in the above way too.
> > > For all other platforms the GT_IRQ domain won't be mapped making
> > > display_power_get/put on those just a domain ref inc/dec, otherwise a
> > > nop.
> > 
> > Why not +&& !IS_BROXTON(dev_priv) by the way?
> 
> We have the same slow-down problem on APL/BXT (and we don't have the DC6
> corruption problem there).

Ok, this makes sense now. Thank you for clarification.

> 
> --Imre

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v5] drm/i915: Restore GT performance in headless mode with DMC loaded

2017-12-01 Thread Rogozhkin, Dmitry V
On Thu, 2017-11-30 at 13:19 +0200, Imre Deak wrote:
> > > +#define NEEDS_CSR_GT_PERF_WA(dev_priv) \
> > > +   (HAS_CSR(dev_priv) && IS_GEN9(dev_priv) && !
> IS_SKYLAKE(dev_priv))
> 
> Nitpick: could be just !IS_SKYLAKE(), but works in the above way too.
> For all other platforms the GT_IRQ domain won't be mapped making
> display_power_get/put on those just a domain ref inc/dec, otherwise a
> nop.

Why not +&& !IS_BROXTON(dev_priv) by the way?
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v5] drm/i915: Restore GT performance in headless mode with DMC loaded

2017-12-01 Thread Rogozhkin, Dmitry V
On Thu, 2017-11-30 at 13:19 +0200, Imre Deak wrote:
> On Thu, Nov 30, 2017 at 09:45:25AM +, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2017-11-30 09:18:20)
> > > From: Tvrtko Ursulin 
> > > 
> > > It seems that the DMC likes to transition between the DC states a lot when
> > > there are no connected displays (no active power domains) during command
> > > submission.
> > > 
> > > This activity on DC states has a negative impact on the performance of the
> > > chip with huge latencies observed in the interrupt handlers and elsewhere.
> > > Simple tests like igt/gem_latency -n 0 are slowed down by a factor of
> > > eight.
> > > 
> > > Work around it by introducing a new power domain named,
> > > POWER_DOMAIN_GT_IRQ, associtated with the "DC off" power well, which is
> > > held for the duration of command submission activity.
> > > 
> > > v2:
> > >  * Add commit text as comment in i915_gem_mark_busy. (Chris Wilson)
> > >  * Protect macro body with braces. (Jani Nikula)
> > > 
> > > v3:
> > >  * Add dedicated power domain for clarity. (Chris, Imre)
> > >  * Commit message and comment text updates.
> > >  * Apply to all big-core GEN9 parts apart for Skylake which is pending DMC
> > >firmware release.
> > > 
> > > v4:
> > >  * Power domain should be inner to device runtime pm. (Chris)
> > >  * Simplify NEEDS_CSR_GT_PERF_WA macro. (Chris)
> > >  * Handle async DMC loading by moving the GT_IRQ power domain logic into
> > >intel_runtime_pm. (Daniel, Chris)
> > >  * Include small core GEN9 as well. (Imre)
> > > 
> > > v5
> > >  * Special handling for async DMC load is not needed since on failure the
> > >power domain reference is kept permanently taken. (Imre)
> > > 
> > > Signed-off-by: Tvrtko Ursulin 
> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100572
> > > Testcase: igt/gem_exec_nop/headless
> > > Cc: Imre Deak 
> > > Acked-by: Chris Wilson  (v2)
> > > Cc: Chris Wilson 
> > > Cc: Dmitry Rogozhkin 
> > > ---
> > >  drivers/gpu/drm/i915/i915_drv.h |  4 
> > >  drivers/gpu/drm/i915/i915_gem.c |  3 +++
> > >  drivers/gpu/drm/i915/i915_gem_request.c | 14 ++
> > >  drivers/gpu/drm/i915/intel_runtime_pm.c | 14 ++
> > >  4 files changed, 35 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_drv.h 
> > > b/drivers/gpu/drm/i915/i915_drv.h
> > > index bddd65839f60..59cf11dd5d3b 100644
> > > --- a/drivers/gpu/drm/i915/i915_drv.h
> > > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > > @@ -398,6 +398,7 @@ enum intel_display_power_domain {
> > > POWER_DOMAIN_AUX_D,
> > > POWER_DOMAIN_GMBUS,
> > > POWER_DOMAIN_MODESET,
> > > +   POWER_DOMAIN_GT_IRQ,
> > > POWER_DOMAIN_INIT,
> > >  
> > > POWER_DOMAIN_NUM,
> > > @@ -3285,6 +3286,9 @@ intel_info(const struct drm_i915_private *dev_priv)
> > >  #define GT_FREQUENCY_MULTIPLIER 50
> > >  #define GEN9_FREQ_SCALER 3
> > >  
> > > +#define NEEDS_CSR_GT_PERF_WA(dev_priv) \
> > > +   (HAS_CSR(dev_priv) && IS_GEN9(dev_priv) && !IS_SKYLAKE(dev_priv))
> 
> Nitpick: could be just !IS_SKYLAKE(), but works in the above way too.
> For all other platforms the GT_IRQ domain won't be mapped making
> display_power_get/put on those just a domain ref inc/dec, otherwise a
> nop.

Folks, is my understanding correct that once SKL DMC will be merged we
just remove !IS_SKYLAKE from the above check to get the performance fix
for SKL and nothing more changes in the patch? I am asking because there
is a need to have SKL patch on the plate already to verify the fix
sooner rather than later. From this perspective, can we have one more
patch in the series dedicated to fix SKL right now? After all, SKL DMC
is available in drm-firmware already, those who need can try it.

> 
> > > +
> > >  #include "i915_trace.h"
> > >  
> > >  static inline bool intel_vtd_active(void)
> > > diff --git a/drivers/gpu/drm/i915/i915_gem.c 
> > > b/drivers/gpu/drm/i915/i915_gem.c
> > > index 354b0546a191..c6067cba1dca 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > @@ -3381,6 +3381,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
> > >  
> > > if (INTEL_GEN(dev_priv) >= 6)
> > > gen6_rps_idle(dev_priv);
> > > +
> > > +   intel_display_power_put(dev_priv, POWER_DOMAIN_GT_IRQ);
> > > +
> > > intel_runtime_pm_put(dev_priv);
> > >  out_unlock:
> > > mutex_unlock(_priv->drm.struct_mutex);
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_request.c 
> > > b/drivers/gpu/drm/i915/i915_gem_request.c
> > > index a90bdd26571f..c28a4ceb016d 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_request.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> > > @@ -252,6 +252,20 @@ static void mark_busy(struct drm_i915_private *i915)
> > > 

Re: [Intel-gfx] [PATCH 2/2] drm/i915/pmu: Add queued counter

2017-11-22 Thread Rogozhkin, Dmitry V
On Wed, 2017-11-22 at 12:46 +, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> We add a PMU counter to expose the number of requests currently submitted
> to the GPU, plus the number of runnable requests waiting on GPU time.
> 
> This is useful to analyze the overall load of the system.
> 
> Signed-off-by: Tvrtko Ursulin 
> ---
>  drivers/gpu/drm/i915/i915_pmu.c | 30 +-
>  include/uapi/drm/i915_drm.h |  6 ++
>  2 files changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> index 112243720ff3..b2b4b32af35f 100644
> --- a/drivers/gpu/drm/i915/i915_pmu.c
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -36,7 +36,8 @@
>  #define ENGINE_SAMPLE_MASK \
>   (BIT(I915_SAMPLE_BUSY) | \
>BIT(I915_SAMPLE_WAIT) | \
> -  BIT(I915_SAMPLE_SEMA))
> +  BIT(I915_SAMPLE_SEMA) | \
> +  BIT(I915_SAMPLE_QUEUED))
>  
>  #define ENGINE_SAMPLE_BITS (1 << I915_PMU_SAMPLE_BITS)
>  
> @@ -223,6 +224,12 @@ static void engines_sample(struct drm_i915_private 
> *dev_priv)
>  
>   update_sample(>pmu.sample[I915_SAMPLE_SEMA],
> PERIOD, !!(val & RING_WAIT_SEMAPHORE));
> +
> + if (engine->pmu.enable & BIT(I915_SAMPLE_QUEUED))
> + update_sample(>pmu.sample[I915_SAMPLE_QUEUED],
> +   1 / I915_SAMPLE_QUEUED_SCALE,
> +   engine->queued +
> +   (last_seqno - current_seqno));
>   }
>  
>   if (fw)
> @@ -310,6 +317,10 @@ static int engine_event_init(struct perf_event *event)
>   if (INTEL_GEN(i915) < 6)
>   return -ENODEV;
>   break;
> + case I915_SAMPLE_QUEUED:
> + if (INTEL_GEN(i915) < 8)
> + return -ENODEV;
> + break;
>   default:
>   return -ENOENT;
>   }
> @@ -399,6 +410,10 @@ static u64 __i915_pmu_event_read(struct perf_event 
> *event)
>   } else if (sample == I915_SAMPLE_BUSY &&
>  engine->pmu.busy_stats) {
>   val = ktime_to_ns(intel_engine_get_busy_time(engine));
> + } else if (sample == I915_SAMPLE_QUEUED) {
> + val =
> +div_u64(engine->pmu.sample[I915_SAMPLE_QUEUED].cur,
> +FREQUENCY);
>   } else {
>   val = engine->pmu.sample[sample].cur;
>   }
> @@ -679,13 +694,18 @@ static ssize_t i915_pmu_event_show(struct device *dev,
>   I915_EVENT_STR(_name.unit, _unit)
>  
>  #define I915_ENGINE_EVENT(_name, _class, _instance, _sample) \
> - I915_EVENT_ATTR(_name, __I915_PMU_ENGINE(_class, _instance, _sample)), \
> + I915_EVENT_ATTR(_name, __I915_PMU_ENGINE(_class, _instance, _sample))
> +
> +#define I915_ENGINE_EVENT_NS(_name, _class, _instance, _sample) \
> + I915_ENGINE_EVENT(_name, _class, _instance, _sample), \
>   I915_EVENT_STR(_name.unit, "ns")
>  
>  #define I915_ENGINE_EVENTS(_name, _class, _instance) \
> - I915_ENGINE_EVENT(_name##_instance-busy, _class, _instance, 
> I915_SAMPLE_BUSY), \
> - I915_ENGINE_EVENT(_name##_instance-sema, _class, _instance, 
> I915_SAMPLE_SEMA), \
> - I915_ENGINE_EVENT(_name##_instance-wait, _class, _instance, 
> I915_SAMPLE_WAIT)
> + I915_ENGINE_EVENT_NS(_name##_instance-busy, _class, _instance, 
> I915_SAMPLE_BUSY), \
> + I915_ENGINE_EVENT_NS(_name##_instance-sema, _class, _instance, 
> I915_SAMPLE_SEMA), \
> + I915_ENGINE_EVENT_NS(_name##_instance-wait, _class, _instance, 
> I915_SAMPLE_WAIT), \
> + I915_ENGINE_EVENT(_name##_instance-queued, _class, _instance, 
> I915_SAMPLE_QUEUED), \
> + I915_EVENT_STR(_name##_instance-queued.scale, 
> __stringify(I915_SAMPLE_QUEUED_SCALE))

We expose queued as an "instant" metric, i.e. that's a number of
requests on the very moment when we query the metric, i.e. that's not an
ever growing counter - is that right? I doubt such a metric will make
sense for perf-stat. Can we somehow restrict it to be queried by uAPI
only and avoid perf-stat for it?

>  
>  static struct attribute *i915_pmu_events_attrs[] = {
>   I915_ENGINE_EVENTS(rcs, I915_ENGINE_CLASS_RENDER, 0),
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 915a6e85a855..20ee668d1428 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -111,9 +111,12 @@ enum drm_i915_pmu_engine_sample {
>   I915_SAMPLE_BUSY = 0,
>   I915_SAMPLE_WAIT = 1,
>   I915_SAMPLE_SEMA = 2,
> + I915_SAMPLE_QUEUED = 3,
>   I915_ENGINE_SAMPLE_MAX /* non-ABI */
>  };
>  
> +#define I915_SAMPLE_QUEUED_SCALE 1e-2 /* No braces please. */
> +
>  #define I915_PMU_SAMPLE_BITS (4)
>  #define I915_PMU_SAMPLE_MASK (0xf)
>  #define I915_PMU_SAMPLE_INSTANCE_BITS (8)
> 

Re: [Intel-gfx] [PATCH 06/10] drm/i915/pmu: Wire up engine busy stats to PMU

2017-10-04 Thread Rogozhkin, Dmitry V
Nice data, thank you!

-Original Message-
From: Tvrtko Ursulin [mailto:tvrtko.ursu...@linux.intel.com] 
Sent: Wednesday, October 4, 2017 10:35 AM
To: Tvrtko Ursulin <tursu...@ursulin.net>; Intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <ch...@chris-wilson.co.uk>; Rogozhkin, Dmitry V 
<dmitry.v.rogozh...@intel.com>
Subject: Re: [Intel-gfx] [PATCH 06/10] drm/i915/pmu: Wire up engine busy stats 
to PMU


On 29/09/2017 13:34, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursu...@intel.com>
> 
> We can use engine busy stats instead of the sampling timer for better 
> accuracy.
> 
> By doing this we replace the stohastic sampling with busyness metric 
> derived directly from engine activity. This is context switch 
> interrupt driven, so as accurate as we can get from software tracking.

To quantify the precision benefit with a graph:

https://people.freedesktop.org/~tursulin/sampling-vs-busy-stats.png

This represents the render engine busyness on neverball menu screen looping a 
few times. PMU sampling interval is 100ms. It was generated from a special 
build with both PMU counters running in parallel so the data is absolutely 
aligned.

Anomaly of busyness going over 100% is caused by me not bothering to use the 
actual timestamps got from PMU, but using fixed interval (perf stat -I 100). 
But I don't think it matters for the rough comparison and looking at the noise 
and error in the sampling version.

> As a secondary benefit, we can also not run the sampling timer in 
> cases only busyness metric is enabled.

For this one data is not showing anything significant. I've seen the 
i915_sample function use <0.1% of CPU time but that's it. Probably with the 
original version which used MMIO it was much worse.

Regards,

Tvrtko

> v2: Rebase.
> v3:
>   * Rebase, comments.
>   * Leave engine busyness controls out of workers.
> v4: Checkpatch cleanup.
> v5: Added comment to pmu_needs_timer change.
> v6:
>   * Rebase.
>   * Fix style of some comments. (Chris Wilson)
> v7: Rebase and commit message update. (Chris Wilson)
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@intel.com>
> Reviewed-by: Chris Wilson <ch...@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/i915_pmu.c | 37 
> +++--
>   drivers/gpu/drm/i915/intel_ringbuffer.h |  5 +
>   2 files changed, 40 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c 
> b/drivers/gpu/drm/i915/i915_pmu.c index f341c904c159..93c0e7ec7d75 
> 100644
> --- a/drivers/gpu/drm/i915/i915_pmu.c
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -90,6 +90,11 @@ static unsigned int event_enabled_bit(struct perf_event 
> *event)
>   return config_enabled_bit(event->attr.config);
>   }
>   
> +static bool supports_busy_stats(void) {
> + return i915_modparams.enable_execlists; }
> +
>   static bool pmu_needs_timer(struct drm_i915_private *i915, bool gpu_active)
>   {
>   u64 enable;
> @@ -115,6 +120,12 @@ static bool pmu_needs_timer(struct drm_i915_private 
> *i915, bool gpu_active)
>*/
>   if (!gpu_active)
>   enable &= ~ENGINE_SAMPLE_MASK;
> + /*
> +  * Also there is software busyness tracking available we do not
> +  * need the timer for I915_SAMPLE_BUSY counter.
> +  */
> + else if (supports_busy_stats())
> + enable &= ~BIT(I915_SAMPLE_BUSY);
>   
>   /*
>* If some bits remain it means we need the sampling timer running.
> @@ -362,6 +373,9 @@ static u64 __i915_pmu_event_read(struct perf_event 
> *event)
>   
>   if (WARN_ON_ONCE(!engine)) {
>   /* Do nothing */
> + } else if (sample == I915_SAMPLE_BUSY &&
> +engine->pmu.busy_stats) {
> + val = ktime_to_ns(intel_engine_get_busy_time(engine));
>   } else {
>   val = engine->pmu.sample[sample].cur;
>   }
> @@ -398,6 +412,12 @@ static void i915_pmu_event_read(struct perf_event *event)
>   local64_add(new - prev, >count);
>   }
>   
> +static bool engine_needs_busy_stats(struct intel_engine_cs *engine) {
> + return supports_busy_stats() &&
> +(engine->pmu.enable & BIT(I915_SAMPLE_BUSY)); }
> +
>   static void i915_pmu_enable(struct perf_event *event)
>   {
>   struct drm_i915_private *i915 =
> @@ -437,7 +457,14 @@ static void i915_pmu_enable(struct perf_event 
> *event)
>   
>   GEM_BUG_ON(sample >= I915_PMU_SAMPLE_BITS);
>   GEM_BUG_ON(engine->pmu.enable_count[sample] == ~0);
> - engine->pmu.enable_count[sample]++;
> +

Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-10-02 Thread Rogozhkin, Dmitry V
Hi Peter, friendly reminder. Could you, please, respond?

-Original Message-
From: Rogozhkin, Dmitry V 
Sent: Wednesday, September 20, 2017 1:15 PM
To: Rogozhkin, Dmitry V <dmitry.v.rogozh...@intel.com>; pet...@infradead.org
Cc: Intel-gfx@lists.freedesktop.org
Subject: RE: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf 
queries

Hi Peter, could you, please, comment on below?

-Original Message-
From: Intel-gfx [mailto:intel-gfx-boun...@lists.freedesktop.org] On Behalf Of 
Rogozhkin, Dmitry V
Sent: Wednesday, September 13, 2017 4:06 PM
To: pet...@infradead.org
Cc: Intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf 
queries

On Tue, 2017-08-29 at 21:21 +0200, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 07:16:31PM +, Rogozhkin, Dmitry V wrote:
> > > Pretty strict, people tend to get fairly upset every time we leak stuff.
> > > In fact Debian and Android carry a perf_event_paranoid patch that 
> > > default disables _everything_ :-(
> > 
> > Can you say more on that for Debian and Android? What exactly they do?
> > What is the value of perf_event_paranoid there? They disable 
> > everything even for root and CAP_SYS_ADMIN? But still they don't 
> > remove this from kernel on compilation stage, right? So users can 
> > explicitly change perf_event_paranoid to the desired value?
> 
> They introduce (and default to) perf_event_paranoid = 3. Which 
> disallows everything for unpriv user, root can still do things IIRC, 
> I'd have to dig out the patch.
> 
> This way apps have no access to the syscall, but you can enable it 
> using ADB by lowering the setting. So developers still have access, 
> but regular apps do not.
> 

Hi, Peter.

How you would feel about the following idea (or close to it):
1. We introduce one more level for perf_event_paranoid=4 (or =3, I am not sure 
whether Debian/Android =3 is considered uAPI) which would mean:
"disallow kernel profiling for unpriv, but let individual kernel modules to 
have their own settings".
2. We will have i915 PMU custom setting
"/sys/module/i915/parameters/perf_event_paranoid" which will be in effect only 
if global perf_event_paranoid=4 (or =3) and prevail over a global setting

Would anything like that be acceptable upstream? This would permit customers to 
configure i915 PMU support for unpriv users separately from the rest of PMU 
subsystem.

Regards,
Dmitry.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 7/8] drm/i915/pmu: Wire up engine busy stats to PMU

2017-09-26 Thread Rogozhkin, Dmitry V
On Tue, 2017-09-26 at 13:32 +0100, Tvrtko Ursulin wrote:
> On 25/09/2017 18:48, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2017-09-25 16:15:42)
> >> From: Tvrtko Ursulin 
> >>
> >> We can use engine busy stats instead of the MMIO sampling timer
> >> for better efficiency.
> >>
> >> As minimum this saves period * num_engines / sec mmio reads,
> >> and in a better case, when only engine busy samplers are active,
> >> it enables us to not kick off the sampling timer at all.
> > 
> > Or you could inspect port_isset(execlists.port).
> > You can avoid the mmio for this case also by just using HWSP. It's just
> > that I never enabled busy tracking in isolation, so I always ended up
> > using the mmio.
> 
> This would be for execlists only. I could change the main patch to do 
> this, you think it is worth it?

You know, I wonder why we limit this by execlists? Is that because
scheduler is working only for execlists and doesn't work for ringbuffers
on older HW? But consider the following. If we don't have a scheduler,
then we have FIFO queue and either hw semaphore or sw sync. For the
userspace applications real execution and wait are not actually
distinguishable: for him engine is busy, it either executes or it is
stalled and can't execute anything else, thus - it is busy.

From this perspective we can consider to extend what we do currently for
execlists to cover FIFO ringbuffers. How do you think? Other metrics
like SEMA or WAIT would be second level of details and will mean
distribution of BUSY time: we spent SEMA time on semaphore or WAIT time
on the wait and actively ran something BUSY-SEMA(WAIT) time.

By the way, with the BDW+ and execlists, is my understanding right that
we report WAIT metric and SEMA is always 0?

> 
> > Stronger argument is that this method give more precise timing than
> > stochastic sampling.
> > 
> >>
> >> v2: Rebase.
> >> v3:
> >>   * Rebase, comments.
> >>   * Leave engine busyness controls out of workers.
> >> v4: Checkpatch cleanup.
> >> v5: Added comment to pmu_needs_timer change.
> >> v6:
> >>   * Rebase.
> >>   * Fix style of some comments. (Chris Wilson)
> >>
> >> Signed-off-by: Tvrtko Ursulin 
> > 
> > With a better explanation of why this is preferred,
> > Reviewed-by: Chris Wilson 
> 
> Thanks,
> 
> Tvrtko
> 
> ___
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-09-20 Thread Rogozhkin, Dmitry V
Hi Peter, could you, please, comment on below?

-Original Message-
From: Intel-gfx [mailto:intel-gfx-boun...@lists.freedesktop.org] On Behalf Of 
Rogozhkin, Dmitry V
Sent: Wednesday, September 13, 2017 4:06 PM
To: pet...@infradead.org
Cc: Intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf 
queries

On Tue, 2017-08-29 at 21:21 +0200, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 07:16:31PM +0000, Rogozhkin, Dmitry V wrote:
> > > Pretty strict, people tend to get fairly upset every time we leak stuff.
> > > In fact Debian and Android carry a perf_event_paranoid patch that 
> > > default disables _everything_ :-(
> > 
> > Can you say more on that for Debian and Android? What exactly they do?
> > What is the value of perf_event_paranoid there? They disable 
> > everything even for root and CAP_SYS_ADMIN? But still they don't 
> > remove this from kernel on compilation stage, right? So users can 
> > explicitly change perf_event_paranoid to the desired value?
> 
> They introduce (and default to) perf_event_paranoid = 3. Which 
> disallows everything for unpriv user, root can still do things IIRC, 
> I'd have to dig out the patch.
> 
> This way apps have no access to the syscall, but you can enable it 
> using ADB by lowering the setting. So developers still have access, 
> but regular apps do not.
> 

Hi, Peter.

How you would feel about the following idea (or close to it):
1. We introduce one more level for perf_event_paranoid=4 (or =3, I am not sure 
whether Debian/Android =3 is considered uAPI) which would mean:
"disallow kernel profiling for unpriv, but let individual kernel modules to 
have their own settings".
2. We will have i915 PMU custom setting
"/sys/module/i915/parameters/perf_event_paranoid" which will be in effect only 
if global perf_event_paranoid=4 (or =3) and prevail over a global setting

Would anything like that be acceptable upstream? This would permit customers to 
configure i915 PMU support for unpriv users separately from the rest of PMU 
subsystem.

Regards,
Dmitry.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 10/11] drm/i915: Export engine stats API to other users

2017-09-19 Thread Rogozhkin, Dmitry V
On Tue, 2017-09-19 at 12:50 -0700, Ben Widawsky wrote:
> On 17-09-15 10:49:56, Tvrtko Ursulin wrote:
> >
> >On 14/09/2017 21:26, Chris Wilson wrote:
> >>Quoting Tvrtko Ursulin (2017-09-11 16:25:58)
> >>>From: Tvrtko Ursulin 
> >>>
> >>>Other kernel users might want to look at total GPU busyness
> >>>in order to implement things like package power distribution
> >>>algorithms more efficiently.
> >>
> >>Who are we exporting these symbols to? Will you not need all the module
> >>ref handling and load ordering like around ips and audio?
> >
> >Hm yes indeed, I forgot about that.
> >
> >Perhaps Ben could comment on who is the user. If it is purely for 
> >internal explorations, I'll stick the patch at the end of the series 
> >as it is. If it has a more serious user I would need to implement a 
> >proper solution.
> >
> >Regards,
> >
> >Tvrtko
> 
> P-state driver was looking to use this as a way to make determinations about 
> how
> much to limit CPU frequency. Srinivas was privy to the original discussion
> 

I personally was surprised to see private API exposed rather than reuse
of PMU API. Do they really need a private path?

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 8/8] drm/i915: Gate engine stats collection with a static key

2017-09-18 Thread Rogozhkin, Dmitry V
On Mon, 2017-09-18 at 12:38 +0100, Tvrtko Ursulin wrote:
>  static inline void intel_engine_context_out(struct intel_engine_cs
> *engine)
>  {
> unsigned long flags;
>  
> -   if (READ_ONCE(engine->stats.enabled) == 0)
> -   return;
> -
> -   spin_lock_irqsave(>stats.lock, flags);
> -
> -   if (engine->stats.enabled > 0) {
> -   ktime_t last, now = ktime_get();
> -
> -   if (engine->stats.active && --engine->stats.active ==
> 0) {
> -   /*
> -* Decrement the active context count and in
> case GPU
> -* is now idle add up to the running total.
> -*/
> -   last = ktime_sub(now, engine->stats.start);
> -
> -   engine->stats.total =
> ktime_add(engine->stats.total,
> -   last);
> -   } else if (engine->stats.active == 0) {
> -   /*
> -* After turning on engine stats, context out
> might be
> -* the first event in which case we account
> from the
> -* time stats gathering was turned on.
> -*/
> -   last = ktime_sub(now,
> engine->stats.enabled_at);
> -
> -   engine->stats.total =
> ktime_add(engine->stats.total,
> -   last);
> +   if (static_branch_unlikely(_engine_stats_key)) {
> +   if (READ_ONCE(engine->stats.enabled) == 0)
> +   return;
> +
> +   spin_lock_irqsave(>stats.lock, flags);
> +
> +   if (engine->stats.enabled > 0) {
> +   ktime_t last, now = ktime_get();
> +
> +   if (engine->stats.active &&
> +   --engine->stats.active == 0) {
> +   /*
> +* Decrement the active context count
> and in
> +* case GPU is now idle add up to the
> running
> +* total.
> +*/
> +   last = ktime_sub(now,
> engine->stats.start);
> +
> +   engine->stats.total =
> +   ktime_add(engine->stats.total,
> last);
> +   } else if (engine->stats.active == 0) {
> +   /*
> +* After turning on engine stats,
> context out
> +* might be the first event in which
> case we
> +* account from the time stats
> gathering was
> +* turned on.
> +*/
> +   last = ktime_sub(now,
> engine->stats.enabled_at);
> +
> +   engine->stats.total =
> +   ktime_add(engine->stats.total,
> last);
> +   }
> +
> +   spin_unlock_irqrestore(>stats.lock,
> flags);

This irqrestore looks placed asymmetric with irqsave.

> }
> }
> -
> -   spin_unlock_irqrestore(>stats.lock, flags);
>  }

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t 5/5] tests/perf_pmu: Tests for i915 PMU API

2017-09-18 Thread Rogozhkin, Dmitry V
Did you try tests on the system with 2 VDBOX engines? On my side 2 tests
are failing on SKL GT4e NUC:

(perf_pmu:5414) CRITICAL: Test assertion failure function
busy_check_all, file perf_pmu.c:164:
(perf_pmu:5414) CRITICAL: Failed assertion: (double)(val[i]) <= (1.0 +
tolerance) * (double)0.0f && (double)(val[i]) >= (1.0 - tolerance) *
(double)0.0f
(perf_pmu:5414) CRITICAL: 'val[i]' != '0.0f' (499984960.00 not
within 2.00% tolerance of 0.00)
Subtest two-busy-check-all-bsd: FAIL (0.501s)

(perf_pmu:5414) CRITICAL: Test assertion failure function
two_busy_check_all, file perf_pmu.c:221:
(perf_pmu:5414) CRITICAL: Failed assertion: (double)(val[i]) <= (1.0 +
tolerance) * (double)0.0f && (double)(val[i]) >= (1.0 - tolerance) *
(double)0.0f
(perf_pmu:5414) CRITICAL: 'val[i]' != '0.0f' (499940146.00 not
within 2.00% tolerance of 0.00)
Subtest two-busy-check-all-bsd1: FAIL (0.501s)

I am trying to speculate on the reasons below.


On Mon, 2017-09-18 at 12:38 +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> A bunch of tests for the new i915 PMU feature.
> 
> Parts of the code were initialy sketched by Dmitry Rogozhkin.
> 
> Signed-off-by: Tvrtko Ursulin 
> Cc: Chris Wilson 
> Cc: Dmitry Rogozhkin 
> ---
>  lib/igt_gt.c   |  23 +-
>  lib/igt_gt.h   |   8 +
>  tests/Makefile.sources |   1 +
>  tests/perf_pmu.c   | 713 
> +
>  4 files changed, 738 insertions(+), 7 deletions(-)
>  create mode 100644 tests/perf_pmu.c
> 
> diff --git a/lib/igt_gt.c b/lib/igt_gt.c
> index b3f3b3809eee..102cc2841feb 100644
> --- a/lib/igt_gt.c
> +++ b/lib/igt_gt.c
> @@ -537,14 +537,23 @@ unsigned intel_detect_and_clear_missed_interrupts(int 
> fd)
>   return missed;
>  }
>  
> +enum drm_i915_gem_engine_class {
> + I915_ENGINE_CLASS_OTHER = 0,
> + I915_ENGINE_CLASS_RENDER = 1,
> + I915_ENGINE_CLASS_COPY = 2,
> + I915_ENGINE_CLASS_VIDEO = 3,
> + I915_ENGINE_CLASS_VIDEO_ENHANCE = 4,
> + I915_ENGINE_CLASS_MAX /* non-ABI */
> +};
> +
>  const struct intel_execution_engine intel_execution_engines[] = {
> - { "default", NULL, 0, 0 },
> - { "render", "rcs0", I915_EXEC_RENDER, 0 },
> - { "bsd", "vcs0", I915_EXEC_BSD, 0 },
> - { "bsd1", "vcs0", I915_EXEC_BSD, 1<<13 /*I915_EXEC_BSD_RING1*/ },
> - { "bsd2", "vcs1", I915_EXEC_BSD, 2<<13 /*I915_EXEC_BSD_RING2*/ },
> - { "blt", "bcs0", I915_EXEC_BLT, 0 },
> - { "vebox", "vecs0", I915_EXEC_VEBOX, 0 },
> + { "default", NULL, -1, -1, 0, 0 },
> + { "render", "rcs0", I915_ENGINE_CLASS_RENDER, 0, I915_EXEC_RENDER, 0 },
> + { "bsd", "vcs0", I915_ENGINE_CLASS_VIDEO, 0, I915_EXEC_BSD, 0 },
With such definition, we will probably detect "bsd" as an engine (as
well as "bsd1" and "bsd2"), right? As a result, we will run
two-busy-check-all-bsd for it and according to defined flags we will
submit workloads to _both_ vcs0 and vcs1 evenly following i915 KMD
dispatching. Thus, the two-busy-check-all-bsd will fail since it will
detect a load on 3 engines (rcs0, vcs0, vcs1) instead of 2.

I am not quite sure why two-busy-check-all-bsd1 fails as well on my
side? or rather, why it did not fail on your side as well? The only
explanation I see is that the test thinks "bsd" and "bsd1" are separate
engines, and, thus, count them as 2. But that should fail on single
VDBOX system as well... hm...

> + { "bsd1", "vcs0", I915_ENGINE_CLASS_VIDEO, 0, I915_EXEC_BSD, 1<<13 
> /*I915_EXEC_BSD_RING1*/ },
> + { "bsd2", "vcs1", I915_ENGINE_CLASS_VIDEO, 1, I915_EXEC_BSD, 2<<13 
> /*I915_EXEC_BSD_RING2*/ },
> + { "blt", "bcs0", I915_ENGINE_CLASS_COPY, 0, I915_EXEC_BLT, 0 },
> + { "vebox", "vecs0", I915_ENGINE_CLASS_VIDEO_ENHANCE, 0, 
> I915_EXEC_VEBOX, 0 },
>   { NULL, 0, 0 }
>  };
>  
> diff --git a/lib/igt_gt.h b/lib/igt_gt.h
> index 2579cbd37be7..436041ce9cc0 100644
> --- a/lib/igt_gt.h
> +++ b/lib/igt_gt.h
> @@ -66,6 +66,8 @@ unsigned intel_detect_and_clear_missed_interrupts(int fd);
>  extern const struct intel_execution_engine {
>   const char *name;
>   const char *full_name;
> + int class;
> + int instance;
>   unsigned exec_id;
>   unsigned flags;
>  } intel_execution_engines[];
> @@ -78,6 +80,12 @@ extern const struct intel_execution_engine {
>e__++) \
>   for_if (gem_has_ring(fd__, flags__ = e__->exec_id | e__->flags))
>  
> +#define for_each_engine_class_instance(fd__, e__) \
> + for ((e__) = intel_execution_engines;\
> +  (e__)->name; \
> +  (e__)++) \
> + for_if ((e__)->class > 0)
> +
>  bool gem_can_store_dword(int fd, unsigned int engine);
>  
>  #endif /* IGT_GT_H */
> diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> index cf542df181a8..4bab6247151c 100644
> --- a/tests/Makefile.sources
> +++ b/tests/Makefile.sources
> @@ -217,6 +217,7 @@ 

Re: [Intel-gfx] [RFC v7 04/11] drm/i915/pmu: Expose a PMU interface for perf queries

2017-09-14 Thread Rogozhkin, Dmitry V
On Wed, 2017-09-13 at 11:34 +0100, Tvrtko Ursulin wrote:
> +static int i915_pmu_event_init(struct perf_event *event)
> +{
> +   struct drm_i915_private *i915 =
> +   container_of(event->pmu, typeof(*i915), pmu.base);
> +   int cpu, ret;
> +
> +   if (event->attr.type != event->pmu->type)
> +   return -ENOENT;
> +
> +   /* unsupported modes and filters */
> +   if (event->attr.sample_period) /* no sampling */
> +   return -EINVAL;
> +
> +   if (has_branch_stack(event))
> +   return -EOPNOTSUPP;
> +
> +   if (event->cpu < 0)
> +   return -EINVAL;
> +
> +   cpu = cpumask_any_and(_pmu_cpumask,
> + topology_sibling_cpumask(event->cpu));
> +   if (cpu >= nr_cpu_ids)
> +   return -ENODEV;
> +
> +   ret = 0;
> +   if (is_engine_event(event)) {
> +   ret = engine_event_init(event);
> +   } else switch (event->attr.config) {
> +   case I915_PMU_ACTUAL_FREQUENCY:
> +   if (IS_VALLEYVIEW(i915) || IS_CHERRYVIEW(i915))
> +   ret = -ENODEV; /* requires a mutex for
> sampling! */
> +   case I915_PMU_REQUESTED_FREQUENCY:
> +   case I915_PMU_ENERGY:
> +   case I915_PMU_RC6_RESIDENCY:
> +   case I915_PMU_RC6p_RESIDENCY:
> +   case I915_PMU_RC6pp_RESIDENCY:
> +   if (INTEL_GEN(i915) < 6)
> +   ret = -ENODEV;
> +   break;
> +   }
> +   if (ret)
> +   return ret;

The switch for non-engine events should error out by default:

diff --git a/drivers/gpu/drm/i915/i915_pmu.c
b/drivers/gpu/drm/i915/i915_pmu.c
index d734879..3145e9a 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -329,6 +329,9 @@ static int i915_pmu_event_init(struct perf_event
*event)
if (INTEL_GEN(i915) < 6)
ret = -ENODEV;
break;
+   default:
+   ret = -ENOENT;
+   break;
}
if (ret)
return ret;


Otherwise user may try to enable non-existing metric (> I915_PMU_LAST)
and eventually will be subject to kernel panic on
i915_pmu_enable/disable during refcount operations. And we need to have
an IGT test to check that.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-09-13 Thread Rogozhkin, Dmitry V
On Tue, 2017-08-29 at 21:21 +0200, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 07:16:31PM +0000, Rogozhkin, Dmitry V wrote:
> > > Pretty strict, people tend to get fairly upset every time we leak stuff.
> > > In fact Debian and Android carry a perf_event_paranoid patch that
> > > default disables _everything_ :-(
> > 
> > Can you say more on that for Debian and Android? What exactly they do?
> > What is the value of perf_event_paranoid there? They disable everything
> > even for root and CAP_SYS_ADMIN? But still they don't remove this from
> > kernel on compilation stage, right? So users can explicitly change
> > perf_event_paranoid to the desired value?
> 
> They introduce (and default to) perf_event_paranoid = 3. Which
> disallows everything for unpriv user, root can still do things IIRC, I'd
> have to dig out the patch.
> 
> This way apps have no access to the syscall, but you can enable it using
> ADB by lowering the setting. So developers still have access, but
> regular apps do not.
> 

Hi, Peter.

How you would feel about the following idea (or close to it):
1. We introduce one more level for perf_event_paranoid=4 (or =3, I am
not sure whether Debian/Android =3 is considered uAPI) which would mean:
"disallow kernel profiling for unpriv, but let individual kernel modules
to have their own settings".
2. We will have i915 PMU custom setting
"/sys/module/i915/parameters/perf_event_paranoid" which will be in
effect only if global perf_event_paranoid=4 (or =3) and prevail over a
global setting

Would anything like that be acceptable upstream? This would permit
customers to configure i915 PMU support for unpriv users separately from
the rest of PMU subsystem.

Regards,
Dmitry.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC v3 00/11] i915 PMU and engine busy stats

2017-09-12 Thread Rogozhkin, Dmitry V
On Tue, 2017-09-12 at 15:54 +0100, Tvrtko Ursulin wrote:
> On 12/09/2017 03:03, Rogozhkin, Dmitry V wrote:
> > Hi,
> > 
> > Just tried v3 series. perf-stat works fine. From the IGT tests which I 
> > wrote for i915 PMU
> > (https://patchwork.freedesktop.org/series/29313/) all pass (assuming 
> > pmu.enabled will be exposed
> > in debugfs) except cpu_online subtest. And this is pretty interesting - see 
> > details below.
> > 
> > Ok, be prepared for the long mail:)...
> > 
> > So, cpu_online subtest loads RCS0 engine 100% and starts to put CPUs 
> > offline one by one starting
> > from CPU0 (don't forget to have CONFIG_BOOTPARAM_HOTPLUG_CPU0=y in 
> > .config). Test expectation is
> > that i915 PMU will report almost 100% (with 2% tolerance) busyness of RCS0. 
> > Test requests metric
> > just twice: before running on RCS0 and right after. This fails as follows:
> > 
> > Executed on rcs0 for 32004262184us
> >i915/rcs0-busy/: 2225723999us
> > (perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, 
> > file perf_pmu.c:719:
> > (perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed([0]) > 
> > (1-USAGE_TOLERANCE) * elapsed_ns(, )
> > Stack trace:
> >#0 [__igt_fail_assert+0xf1]
> >#1 [__real_main773+0xff1]
> >#2 [main+0x35]
> >#3 [__libc_start_main+0xf5]
> >#4 [_start+0x29]
> >#5 [+0x29]
> > Subtest cpu_online failed.
> >  DEBUG 
> > (perf_pmu:6325) DEBUG: Test requirement passed: is_hotplug_cpu0()
> > (perf_pmu:6325) INFO: perf_init: enabled 1 metrics from 1 requested
> > (perf_pmu:6325) ioctl-wrappers-DEBUG: Test requirement passed: 
> > gem_has_ring(fd, ring)
> > (perf_pmu:6325) INFO: Executed on rcs0 for 32004262184us
> > (perf_pmu:6325) INFO:   i915/rcs0-busy/: 2225723999us
> > (perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, 
> > file perf_pmu.c:719:
> > (perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed([0]) > 
> > (1-USAGE_TOLERANCE) * elapsed_ns(, )
> > 
> > Now. Looks like that by itself PMU context migration works. For example, if 
> > you will comment out
> > "perf_pmu_migrate_context(>base, cpu, target)" you will get:
> > 
> >  Executed on rcs0 for 32004434918us
> >i915/rcs0-busy/: 76623707us
> > 
> > Compare with previous:
> >  Executed on rcs0 for 32004262184us
> >i915/rcs0-busy/:2225723999us
> > 
> > This test passed on the previous set of patches, I mean Tvrtko's v2 series 
> > + my patches.
> > 
> > So, it seems we are loosing counter values somehow. I saw in the patches 
> > that this place really was modified - you
> > have added subtraction from initial counter value:
> > static void i915_pmu_event_read(struct perf_event *event)
> > {
> > 
> > local64_set(>count,
> > __i915_pmu_event_read(event) -
> > local64_read(>hw.prev_count));
> > }
> > 
> > But looks like the problem is that with the PMU context migration we get 
> > sequence of events start/stop (or maybe
> > add/del) which eventually call our i915_pmu_enable/disable. Here is the 
> > dmesg log with the obvious printk:
> > 
> > [  153.971096] [IGT] perf_pmu: starting subtest cpu_online
> > [  153.971151] i915_pmu_enable: event->hw.prev_count=0
> > [  154.036015] i915_pmu_disable: event->hw.prev_count=0
> > [  154.048027] i915_pmu_enable: event->hw.prev_count=0
> > [  154.049343] smpboot: CPU 0 is now offline
> > [  155.059028] smpboot: Booting Node 0 Processor 0 APIC 0x0
> > [  155.155078] smpboot: CPU 1 is now offline
> > [  156.161026] x86: Booting SMP configuration:
> > [  156.161027] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > [  156.197065] IRQ 122: no longer affine to CPU2
> > [  156.198087] smpboot: CPU 2 is now offline
> > [  157.208028] smpboot: Booting Node 0 Processor 2 APIC 0x4
> > [  157.263093] smpboot: CPU 3 is now offline
> > [  158.273026] smpboot: Booting Node 0 Processor 3 APIC 0x6
> > [  158.310026] i915_pmu_disable: event->hw.prev_count=76648307
> > [  158.319020] i915_pmu_enable: event->hw.prev_count=76648307
> > [  158.319098] IRQ 124: no longer affine to CPU4
> > [  158.320368] smpboot: CPU 4 is now offline
> > [  159.326030] smpboot: Booting Node 0 Processor 4 APIC 0x1
> > [  159.365306] smpboot: CPU 5 is now offline
> > [  160.371030] smpboot: Booting Node 0 Processor 5 APIC 0x3
> > [  160.421077] IRQ 125: no longer affine t

Re: [Intel-gfx] [RFC 04/11] drm/i915/pmu: Expose a PMU interface for perf queries

2017-09-11 Thread Rogozhkin, Dmitry V
On Mon, 2017-09-11 at 16:25 +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> From: Chris Wilson 
> From: Tvrtko Ursulin 
> From: Dmitry Rogozhkin 
> 
> The first goal is to be able to measure GPU (and invidual ring) busyness
> without having to poll registers from userspace. (Which not only incurs
> holding the forcewake lock indefinitely, perturbing the system, but also
> runs the risk of hanging the machine.) As an alternative we can use the
> perf event counter interface to sample the ring registers periodically
> and send those results to userspace.
> 
> To be able to do so, we need to export the two symbols from
> kernel/events/core.c to register and unregister a PMU device.
> 
> v1-v2 (Chris Wilson):
> 
> v2: Use a common timer for the ring sampling.
> 
> v3: (Tvrtko Ursulin)
>  * Decouple uAPI from i915 engine ids.
>  * Complete uAPI defines.
>  * Refactor some code to helpers for clarity.
>  * Skip sampling disabled engines.
>  * Expose counters in sysfs.
>  * Pass in fake regs to avoid null ptr deref in perf core.
>  * Convert to class/instance uAPI.
>  * Use shared driver code for rc6 residency, power and frequency.
> 
> v4: (Dmitry Rogozhkin)
>  * Register PMU with .task_ctx_nr=perf_invalid_context
>  * Expose cpumask for the PMU with the single CPU in the mask
>  * Properly support pmu->stop(): it should call pmu->read()
>  * Properly support pmu->del(): it should call stop(event, PERF_EF_UPDATE)
>  * Make pmu.busy_stats a refcounter to avoid busy stats going away
>with some deleted event.

busy_stats appear later in the patch series. And in your final version
busy_stats remain bool while we rely on events refcounting. So, this
item is misleading. Could you, please, change it giving a credit to
general ref-counting of events which you have rewrote for v5 of this
patch?

>  * Expose cpumask for i915 PMU to avoid multiple events creation of
>the same type followed by counter aggregation by perf-stat.
>  * Track CPUs getting online/offline to migrate perf context. If (likely)
>cpumask will initially set CPU0, CONFIG_BOOTPARAM_HOTPLUG_CPU0 will be
>needed to see effect of CPU status tracking.
>  * End result is that only global events are supported and perf stat
>works correctly.
>  * Deny perf driver level sampling - it is prohibited for uncore PMU.
> 
> v5: (Tvrtko Ursulin)
> 
>  * Don't hardcode number of engine samplers.
>  * Rewrite event ref-counting for correctness and simplicity.
>  * Store initial counter value when starting already enabled events
>to correctly report values to all listeners.
>  * Fix RC6 residency readout.
>  * Comments, GPL header.
> 
> Signed-off-by: Chris Wilson 
> Signed-off-by: Tvrtko Ursulin 
> Signed-off-by: Dmitry Rogozhkin 
> Cc: Tvrtko Ursulin 
> Cc: Chris Wilson 
> Cc: Dmitry Rogozhkin 
> Cc: Peter Zijlstra 
> ---
>  drivers/gpu/drm/i915/Makefile   |   1 +
>  drivers/gpu/drm/i915/i915_drv.c |   2 +
>  drivers/gpu/drm/i915/i915_drv.h |  76 
>  drivers/gpu/drm/i915/i915_pmu.c | 686 
> 
>  drivers/gpu/drm/i915/i915_reg.h |   3 +
>  drivers/gpu/drm/i915/intel_engine_cs.c  |  10 +
>  drivers/gpu/drm/i915/intel_ringbuffer.c |  25 ++
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  25 ++
>  include/uapi/drm/i915_drm.h |  58 +++
>  9 files changed, 886 insertions(+)
>  create mode 100644 drivers/gpu/drm/i915/i915_pmu.c
> 
> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> index 1cb8059a3a16..7b3a0eca62b6 100644
> --- a/drivers/gpu/drm/i915/Makefile
> +++ b/drivers/gpu/drm/i915/Makefile
> @@ -26,6 +26,7 @@ i915-y := i915_drv.o \
>  
>  i915-$(CONFIG_COMPAT)   += i915_ioc32.o
>  i915-$(CONFIG_DEBUG_FS) += i915_debugfs.o intel_pipe_crc.o
> +i915-$(CONFIG_PERF_EVENTS) += i915_pmu.o
>  
>  # GEM code
>  i915-y += i915_cmd_parser.o \
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 5c111ea96e80..b1f96eb1be16 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1196,6 +1196,7 @@ static void i915_driver_register(struct 
> drm_i915_private *dev_priv)
>   struct drm_device *dev = _priv->drm;
>  
>   i915_gem_shrinker_init(dev_priv);
> + i915_pmu_register(dev_priv);
>  
>   /*
>* Notify a valid surface after modesetting,
> @@ -1250,6 +1251,7 @@ static void i915_driver_unregister(struct 
> drm_i915_private *dev_priv)
>   intel_opregion_unregister(dev_priv);
>  
>   i915_perf_unregister(dev_priv);
> + i915_pmu_unregister(dev_priv);
>  
>   i915_teardown_sysfs(dev_priv);
>   i915_guc_log_unregister(dev_priv);
> diff --git 

Re: [Intel-gfx] [RFC v3 00/11] i915 PMU and engine busy stats

2017-09-11 Thread Rogozhkin, Dmitry V
Hi,

Just tried v3 series. perf-stat works fine. From the IGT tests which I wrote 
for i915 PMU
(https://patchwork.freedesktop.org/series/29313/) all pass (assuming 
pmu.enabled will be exposed
in debugfs) except cpu_online subtest. And this is pretty interesting - see 
details below.

Ok, be prepared for the long mail:)...

So, cpu_online subtest loads RCS0 engine 100% and starts to put CPUs offline 
one by one starting
from CPU0 (don't forget to have CONFIG_BOOTPARAM_HOTPLUG_CPU0=y in .config). 
Test expectation is
that i915 PMU will report almost 100% (with 2% tolerance) busyness of RCS0. 
Test requests metric
just twice: before running on RCS0 and right after. This fails as follows:

Executed on rcs0 for 32004262184us
  i915/rcs0-busy/: 2225723999us
(perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, file 
perf_pmu.c:719:
(perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed([0]) > 
(1-USAGE_TOLERANCE) * elapsed_ns(, )
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [__real_main773+0xff1]
  #2 [main+0x35]
  #3 [__libc_start_main+0xf5]
  #4 [_start+0x29]
  #5 [+0x29]
Subtest cpu_online failed.
 DEBUG 
(perf_pmu:6325) DEBUG: Test requirement passed: is_hotplug_cpu0()
(perf_pmu:6325) INFO: perf_init: enabled 1 metrics from 1 requested
(perf_pmu:6325) ioctl-wrappers-DEBUG: Test requirement passed: gem_has_ring(fd, 
ring)
(perf_pmu:6325) INFO: Executed on rcs0 for 32004262184us
(perf_pmu:6325) INFO:   i915/rcs0-busy/: 2225723999us
(perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, file 
perf_pmu.c:719:
(perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed([0]) > 
(1-USAGE_TOLERANCE) * elapsed_ns(, )

Now. Looks like that by itself PMU context migration works. For example, if you 
will comment out
"perf_pmu_migrate_context(>base, cpu, target)" you will get:

Executed on rcs0 for 32004434918us
  i915/rcs0-busy/: 76623707us

Compare with previous:
Executed on rcs0 for 32004262184us
  i915/rcs0-busy/:2225723999us

This test passed on the previous set of patches, I mean Tvrtko's v2 series + my 
patches.

So, it seems we are loosing counter values somehow. I saw in the patches that 
this place really was modified - you
have added subtraction from initial counter value:
static void i915_pmu_event_read(struct perf_event *event)
{

local64_set(>count,
__i915_pmu_event_read(event) -
local64_read(>hw.prev_count));
}

But looks like the problem is that with the PMU context migration we get 
sequence of events start/stop (or maybe
add/del) which eventually call our i915_pmu_enable/disable. Here is the dmesg 
log with the obvious printk:

[  153.971096] [IGT] perf_pmu: starting subtest cpu_online
[  153.971151] i915_pmu_enable: event->hw.prev_count=0
[  154.036015] i915_pmu_disable: event->hw.prev_count=0
[  154.048027] i915_pmu_enable: event->hw.prev_count=0
[  154.049343] smpboot: CPU 0 is now offline
[  155.059028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  155.155078] smpboot: CPU 1 is now offline
[  156.161026] x86: Booting SMP configuration:
[  156.161027] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  156.197065] IRQ 122: no longer affine to CPU2
[  156.198087] smpboot: CPU 2 is now offline
[  157.208028] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  157.263093] smpboot: CPU 3 is now offline
[  158.273026] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  158.310026] i915_pmu_disable: event->hw.prev_count=76648307
[  158.319020] i915_pmu_enable: event->hw.prev_count=76648307
[  158.319098] IRQ 124: no longer affine to CPU4
[  158.320368] smpboot: CPU 4 is now offline
[  159.326030] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  159.365306] smpboot: CPU 5 is now offline
[  160.371030] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  160.421077] IRQ 125: no longer affine to CPU6
[  160.422093] smpboot: CPU 6 is now offline
[  161.429030] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  161.467091] smpboot: CPU 7 is now offline
[  162.473027] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  162.527019] i915_pmu_disable: event->hw.prev_count=4347548222
[  162.546017] i915_pmu_enable: event->hw.prev_count=4347548222
[  162.547317] smpboot: CPU 0 is now offline
[  163.553028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  163.621089] smpboot: CPU 1 is now offline
[  164.627028] x86: Booting SMP configuration:
[  164.627029] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  164.669308] smpboot: CPU 2 is now offline
[  165.679025] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  165.717089] smpboot: CPU 3 is now offline
[  166.723025] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  166.775016] i915_pmu_disable: event->hw.prev_count=8574197312
[  166.787016] i915_pmu_enable: event->hw.prev_count=8574197312
[  166.788309] smpboot: CPU 4 is now offline
[  167.794025] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  167.837114] smpboot: CPU 5 is now offline
[  168.847025] smpboot: Booting Node 

Re: [Intel-gfx] [PATCH 1/2] drm/i915/tracepoints: Don't compile-out low-level tracepoints

2017-09-11 Thread Rogozhkin, Dmitry V
On Mon, 2017-09-11 at 20:52 +0100, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2017-09-11 16:34:08)
> > 
> > On 11/09/2017 14:09, Michał Winiarski wrote:
> > > There's no reason to hide those tracepoints.
> > > Let's also remove the DRM_I915_LOW_LEVEL_TRACEPOINTS Kconfig option.
> > 
> > No numbers from (micro-)bechmarks showing how small the impact of doing 
> > this is? I thought John was compiling this data. It will be just a no-op 
> > on the fast path, but a bit more generated code.
> > 
> > Assuming that will be fine, the only potentially problematic aspect that 
> > comes to mind is the fact meaning of these tracepoints is a bit 
> > different between execlists and guc. But maybe that is thinking to low 
> > level (!) - in fact they are in both cases at points where i915 is 
> > passing/receiving requests to/from hardware so not an issue?
> 
> Along the same lines is that this implies that these are important
> enough to be ABI, and that means we need to make a long term decision on
> the viability and meaning of such tracepoints.
> -Chris
There is a number of applications which use these tracepoints for tasks
profiling putting them on the time scale for visualization. For example,
VTune and GPA, - these 2 are closed source. Not sure whether there are
open source profiling tools with such support (at least for GPU). Right
now VTune and GPA are stuck with custom patches for the i915 in the
better case, thus limiting themselves with the customers who are ok with
the custom patching. Thus, making this ABI (if possible) or at least
making sure this does not get broken or disappears (with for example GUC
enabling) would be highly beneficial.

Dmitry.

> ___
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC v2 2/3] drm/i915/pmu: serve global events and support perf stat

2017-08-30 Thread Rogozhkin, Dmitry V
On Tue, 2017-08-29 at 11:28 +0200, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 11:38:43PM +0000, Rogozhkin, Dmitry V wrote:
> > On Wed, 2017-08-23 at 08:26 -0700, Dmitry Rogozhkin wrote:
> > > +static cpumask_t i915_pmu_cpumask = CPU_MASK_CPU0;
> > 
> > Peter, this hardcoding of cpumask to use CPU0 works, but should I
> > implement something smarter or this will be sufficient? I see that
> > cstate.c you have pointed me to tries to track CPUs going online/offline
> > and migrates PMU context to another CPU if selected one went offline.
> > Should I follow this way?
> 
> Yes.. x86 used to not allow hotplug of CPU0, but they 'fixed' that :/
> 
> And the perf core needs _a_ valid CPU to run things from, which leaves
> you having to track online/offline things.
> 
> Now, I suppose its all fairly similar for a lot of uncore PMUs, so maybe
> you can pull some of this into a library and avoid the endless
> duplication between all (most?) uncore driveres.
> 
> > If I should track CPUs going online/offline, then I have questions:
> > 1. How I should register tracking callbacks? I see that cstate.c
> > registers CPUHP_AP_PERF_X86_CSTATE_STARTING and
> > CPUHP_AP_PERF_X86_CSTATE_ONLINE, uncore.c registers
> > CPUHP_AP_PERF_X86_UNCORE_ONLINE. What I should use? I incline to UNCORE.
> 
> Egads, what a mess :/ Clearly I didn't pay too much attention there.
> 
> So ideally we'd not hate a state per PMU, and
> __cpuhp_setup_state_cpuslocked() has a .multi_instance argument that
> allows reuse of a state.
> 
> So yes, please use the PERF_X86_UNCORE ones if possible.
> 
> > 2. If I will register for, say UNCORE, then how double registrations
> > will be handled if both uncore.c and i915.c will register callbacks? Any
> > conflict here?
> 
> Should work with .multi_instance I _think_, I've not had the pleasure of
> using the new and improved CPU hotplug infrastructure much.
> 
> > 3. What I should pass as 2nd argument? Will "perf/x86/intel/i915:online"
> > be ok?
> 
> Yeah, whatever I think.. something unique. Someone or something will
> eventually yell if its no good I suppose ;-)


I figured out how to track cpus online/offline status in PMU. Here is a
question however. What is the reason for uncore PMUs (cstate.c for
example) to register for cpus other than cpu0? I see it registers for
first thread of each cpu, on my 8 logical-core systems it registers for
cpu0-3 it seems. If they register for few cpus then perf-stat will
aggregate counters which can be disabled with '-A, --no-aggr' option.
Ok... but they could register for just cpu0. Besides, it looks like on
Linux cpu0 can't go offline at all at least of x86 architecture. Peter,
could you, please, clarify the reasoning to register designated readers
of uncore PMU for few CPUs?

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-08-29 Thread Rogozhkin, Dmitry V
On Tue, 2017-08-29 at 11:30 +0200, Peter Zijlstra wrote:
> On Mon, Aug 28, 2017 at 10:43:17PM +0000, Rogozhkin, Dmitry V wrote:
> 
> > Hi Peter,
> >
> > I have updated my fixes to Tvrtko's PMU, they are here:
> > https://patchwork.freedesktop.org/series/28842/, and I started to check
> > whether we will be able to cover all the use cases for this PMU which we
> > had in mind. Here I have some concerns and further questions.
> >
> > So, as soon as I registered PMU with the perf_invalid_context, i.e. as
> > an uncore PMU, I got the effect that metrics from our PMU are available
> > under root only. This happens since we fall to the following case
> > described in 'man perf_event_open': "A pid == -1 and cpu >= 0 setting is
> > per-CPU and measures all processes on the specified CPU.  Per-CPU events
> > need  the  CAP_SYS_ADMIN  capability  or
> > a /proc/sys/kernel/perf_event_paranoid value of less than 1."
> >
> > This a trouble point for us... So, could you, please, clarify:
> > 1. How PMU API is positioned? It is for debug purposes only or it can be
> > used in the end-user release applications to monitor system activity and
> > make some decisions based on that?
> 
> Perf is meant to also be usable for end-users, _however_ by default it
> will only allow users to profile their own userspace (tasks).
> 
> Allowing unpriv users access to kernel data is a clear data leak (you
> instantly void KASLR for instance).
> 
> And since uncore PMUs are not uniquely tied to individual user context,
> unpriv users have no access.
> 

If someone knew how much I don't like to step into such situations:).
Ok, thank you for clarification. This affect how we can use this API,
but it does not make it unusable. Good that it is intended for end-users
as well.

> > 2. How applications can access uncore PMU metrics from non-privileged
> > applications?
> 
> Only by lowering sysctl.kernel.perf_event_paranoid.
> Since uuncore is shared among many users, its events can be used to
> construct side-channel attacks. So from a security pov this is not
> something that can otherwise be done.
> 
> Imagine user A using the GPU to do crypto and user B reading the GPU
> events to infer state or similar things.
> 
> Without privilege separation we cannot allow unpriv access to system
> wide resources.
> 
> > 3. Is that a strong requirement to restrict uncore PMU metrics reporting
> > to privileged applications or this can be relaxed?
> 
> Pretty strict, people tend to get fairly upset every time we leak stuff.
> In fact Debian and Android carry a perf_event_paranoid patch that
> default disables _everything_ :-(

Can you say more on that for Debian and Android? What exactly they do?
What is the value of perf_event_paranoid there? They disable everything
even for root and CAP_SYS_ADMIN? But still they don't remove this from
kernel on compilation stage, right? So users can explicitly change
perf_event_paranoid to the desired value?

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-08-28 Thread Rogozhkin, Dmitry V
On Wed, 2017-08-23 at 20:22 +0200, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 05:51:38PM +0000, Rogozhkin, Dmitry V wrote:
> 
> > Anyhow, returning to the metrics i915 exposes. Some metrics are just
> > exposure of some counters supported already inside i915 PMU which do not
> > require any special sampling: at any given moment you can request the
> > counter value (these are interrupts counts, i915 power consumption).
> 
> > Other metrics are similar to the ever-existing which I just described,
> > but they require activation for i915 to start to count them - this is
> > done on the event initialization (these are engine busy stats).
> 
> Right, so depending on how expensive this activation is and if it can be
> done without scheduling, there are two options:
> 
>  1) activate/deactivate from pmu::start()/pmu::stop()
>  2) activate/deactivate from pmu::event_init()/event->destroy() and
> disregard all counting between pmu::stop() and pmu::start().
> 
> > Finally, there is a third group which require sampling counting: they
> > are needed to be initialized and i915 pmu starts an internal timer to
> > count these values (these are some engines characteristics referenced
> > in the code as QUEUED, SEMA, WAIT).
> 
> So uncore PMUs can't really do sampling. That is, perf defines sampling
> as interrupting the relevant task and then providing things like the
> %RIP value at interrupt time. Since uncore activity cannot be associated
> with any one task, no sampling allowed.
> 
> Now, I'm thinking that what i915 does is slightly different, it doesn't
> provide registers to read out the counter state, but instead
> periodically writes state snapshots into some memory buffer, right?
> 
> That's a bit tricky, maybe the best fit would be what PPC HV 24x7 does.
> They create an event-group, that is a set of counters that are
> co-scheduled, matching the set of counters they get from the HV
> interface (or a subset) and then sys_read() will use a TXN_READ to
> group-read the entire thing at once. In your case it could consume the
> last state snapshot instead of request one (or wait for the next,
> whatever works best).
> 
> Would that work?

Hi Peter,

I have updated my fixes to Tvrtko's PMU, they are here:
https://patchwork.freedesktop.org/series/28842/, and I started to check
whether we will be able to cover all the use cases for this PMU which we
had in mind. Here I have some concerns and further questions.

So, as soon as I registered PMU with the perf_invalid_context, i.e. as
an uncore PMU, I got the effect that metrics from our PMU are available
under root only. This happens since we fall to the following case
described in 'man perf_event_open': "A pid == -1 and cpu >= 0 setting is
per-CPU and measures all processes on the specified CPU.  Per-CPU events
need  the  CAP_SYS_ADMIN  capability  or
a /proc/sys/kernel/perf_event_paranoid value of less than 1."

This a trouble point for us... So, could you, please, clarify:
1. How PMU API is positioned? It is for debug purposes only or it can be
used in the end-user release applications to monitor system activity and
make some decisions based on that?
2. How applications can access uncore PMU metrics from non-privileged
applications?
3. Is that a strong requirement to restrict uncore PMU metrics reporting
to privileged applications or this can be relaxed?


I understand why restriction was relevant in the time when only CPU
level were available: system-wide were expensive, but I don't quite
understand why these restrictions are in place now for uncore PMUs when
they actually report metrics right away. Is that just a remnant of
CPU-only times and no one needed this to be changed? Can this be changed
and uncore metrics allowed to be accessed from general applications?


Regards,
Dmitry.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC v2 2/3] drm/i915/pmu: serve global events and support perf stat

2017-08-28 Thread Rogozhkin, Dmitry V
Peter, any comments?

On Wed, 2017-08-23 at 23:38 +, Rogozhkin, Dmitry V wrote:
> On Wed, 2017-08-23 at 08:26 -0700, Dmitry Rogozhkin wrote:
> > +static cpumask_t i915_pmu_cpumask = CPU_MASK_CPU0;
> 
> 
> Peter, this hardcoding of cpumask to use CPU0 works, but should I
> implement something smarter or this will be sufficient? I see that
> cstate.c you have pointed me to tries to track CPUs going online/offline
> and migrates PMU context to another CPU if selected one went offline.
> Should I follow this way?
> 
> If I should track CPUs going online/offline, then I have questions:
> 1. How I should register tracking callbacks? I see that cstate.c
> registers CPUHP_AP_PERF_X86_CSTATE_STARTING and
> CPUHP_AP_PERF_X86_CSTATE_ONLINE, uncore.c registers
> CPUHP_AP_PERF_X86_UNCORE_ONLINE. What I should use? I incline to UNCORE.
> 2. If I will register for, say UNCORE, then how double registrations
> will be handled if both uncore.c and i915.c will register callbacks? Any
> conflict here?
> 3. What I should pass as 2nd argument? Will "perf/x86/intel/i915:online"
> be ok?
> 
> ___
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC v2 3/3] drm/i915/pmu: deny perf driver level sampling of i915 PMU

2017-08-25 Thread Rogozhkin, Dmitry V
Returning mailing list back.

Just tried ./intel-gpu-overlay. From what I see it wonderfully works even when 
I wiped out event->sampling_period from i915 RFC PMU and prohibited it. Mind 
that I did not remove sampling inside i915 RFC PMU: I removed only timer staff 
which is exposed to the perf core. Sampling itself is still there and works as 
far as I see.

Thus, the question: why event->hw->hrtimer staff was introduced in i915 RFC 
PMU? Right now it does not make sense to me.

Dmitry.

-Original Message-
From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] 
Sent: Friday, August 25, 2017 11:19 AM
To: Rogozhkin, Dmitry V <dmitry.v.rogozh...@intel.com>; Wilson, Chris 
<chris.wil...@intel.com>
Subject: RE: [RFC v2 3/3] drm/i915/pmu: deny perf driver level sampling of i915 
PMU

Quoting Rogozhkin, Dmitry V (2017-08-25 19:06:13)
> Hi Chris, not sure you saw my mail. So, I try to remind.

It's integral to all the sampling counters we use; which are the majority of 
the events exposed and used by intel-gpu-overlay.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC v2 3/3] drm/i915/pmu: deny perf driver level sampling of i915 PMU

2017-08-23 Thread Rogozhkin, Dmitry V
Hi Chris,

Why we had event->hw->hrtimer in i915 PMU? Was there any particular
reason? You had some use case which did not work?

According to Peter we should not expose the timer out of our pmu, and I
do not see the reason why we need it at the first place. So, I went
forward and wiped it out and prohibited events to be intialized with the
sampling_period. I don't see what will be broken. From my perspective
nothing because internal sampling timer still remains.

Could you, please, comment?

Dmitry.

On Wed, 2017-08-23 at 08:26 -0700, Dmitry Rogozhkin wrote:
> This patch should probably be squashed with Tvrtko's PMU enabling patch...
> 
> As per discussion with Peter, i915 PMU is an example of uncore PMU which
> are prohibited to support perf driver level sampling. This patch removes
> hrtimer which we expose to perf core and denies events creation with
> non-zero event->attr.sampling_period.
> 
> Mind that this patch does _not_ remove i915 PMU _internal_ sampling timer.
> So, sampling metrics are still gathered, but can be accessed only by
> explicit request to get metric counter, i.e. by sys_read().
> 
> Change-Id: I33f345f679f0a5a8ecc9867f9e7c1bfb357e708d
> Signed-off-by: Dmitry Rogozhkin 
> Cc: Tvrtko Ursulin 
> Cc: Chris Wilson 
> Cc: Peter Zijlstra 
> ---
>  drivers/gpu/drm/i915/i915_pmu.c | 89 
> ++---
>  1 file changed, 4 insertions(+), 85 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> index c551d64..311aeeb 100644
> --- a/drivers/gpu/drm/i915/i915_pmu.c
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -239,50 +239,6 @@ static int engine_event_init(struct perf_event *event)
>   return 0;
>  }
>  
> -static DEFINE_PER_CPU(struct pt_regs, i915_pmu_pt_regs);
> -
> -static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
> -{
> - struct pt_regs *regs = this_cpu_ptr(_pmu_pt_regs);
> - struct perf_sample_data data;
> - struct perf_event *event;
> - u64 period;
> -
> - event = container_of(hrtimer, struct perf_event, hw.hrtimer);
> - if (event->state != PERF_EVENT_STATE_ACTIVE)
> - return HRTIMER_NORESTART;
> -
> - event->pmu->read(event);
> -
> - perf_sample_data_init(, 0, event->hw.last_period);
> - perf_event_overflow(event, , regs);
> -
> - period = max_t(u64, 1, event->hw.sample_period);
> - hrtimer_forward_now(hrtimer, ns_to_ktime(period));
> - return HRTIMER_RESTART;
> -}
> -
> -static void init_hrtimer(struct perf_event *event)
> -{
> - struct hw_perf_event *hwc = >hw;
> -
> - if (!is_sampling_event(event))
> - return;
> -
> - hrtimer_init(>hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> - hwc->hrtimer.function = hrtimer_sample;
> -
> - if (event->attr.freq) {
> - long freq = event->attr.sample_freq;
> -
> - event->attr.sample_period = NSEC_PER_SEC / freq;
> - hwc->sample_period = event->attr.sample_period;
> - local64_set(>period_left, hwc->sample_period);
> - hwc->last_period = hwc->sample_period;
> - event->attr.freq = 0;
> - }
> -}
> -
>  static int i915_pmu_event_init(struct perf_event *event)
>  {
>   struct drm_i915_private *i915 =
> @@ -293,6 +249,10 @@ static int i915_pmu_event_init(struct perf_event *event)
>   if (event->attr.type != event->pmu->type)
>   return -ENOENT;
>  
> + /* unsupported modes and filters */
> + if (event->attr.sample_period) /* no sampling */
> + return -EINVAL;
> +
>   if (has_branch_stack(event))
>   return -EOPNOTSUPP;
>  
> @@ -328,46 +288,9 @@ static int i915_pmu_event_init(struct perf_event *event)
>   if (!event->parent)
>   event->destroy = i915_pmu_event_destroy;
>  
> - init_hrtimer(event);
> -
>   return 0;
>  }
>  
> -static void i915_pmu_timer_start(struct perf_event *event)
> -{
> - struct hw_perf_event *hwc = >hw;
> - s64 period;
> -
> - if (!is_sampling_event(event))
> - return;
> -
> - period = local64_read(>period_left);
> - if (period) {
> - if (period < 0)
> - period = 1;
> -
> - local64_set(>period_left, 0);
> - } else {
> - period = max_t(u64, 1, hwc->sample_period);
> - }
> -
> - hrtimer_start_range_ns(>hrtimer,
> -ns_to_ktime(period), 0,
> -HRTIMER_MODE_REL_PINNED);
> -}
> -
> -static void i915_pmu_timer_cancel(struct perf_event *event)
> -{
> - struct hw_perf_event *hwc = >hw;
> -
> - if (!is_sampling_event(event))
> - return;
> -
> - local64_set(>period_left,
> - ktime_to_ns(hrtimer_get_remaining(>hrtimer)));
> - hrtimer_cancel(>hrtimer);
> -}
> -
>  static bool 

Re: [Intel-gfx] [RFC v2 2/3] drm/i915/pmu: serve global events and support perf stat

2017-08-23 Thread Rogozhkin, Dmitry V
On Wed, 2017-08-23 at 08:26 -0700, Dmitry Rogozhkin wrote:
> +static cpumask_t i915_pmu_cpumask = CPU_MASK_CPU0;

Peter, this hardcoding of cpumask to use CPU0 works, but should I
implement something smarter or this will be sufficient? I see that
cstate.c you have pointed me to tries to track CPUs going online/offline
and migrates PMU context to another CPU if selected one went offline.
Should I follow this way?

If I should track CPUs going online/offline, then I have questions:
1. How I should register tracking callbacks? I see that cstate.c
registers CPUHP_AP_PERF_X86_CSTATE_STARTING and
CPUHP_AP_PERF_X86_CSTATE_ONLINE, uncore.c registers
CPUHP_AP_PERF_X86_UNCORE_ONLINE. What I should use? I incline to UNCORE.
2. If I will register for, say UNCORE, then how double registrations
will be handled if both uncore.c and i915.c will register callbacks? Any
conflict here?
3. What I should pass as 2nd argument? Will "perf/x86/intel/i915:online"
be ok?

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-08-23 Thread Rogozhkin, Dmitry V
On Wed, 2017-08-23 at 20:22 +0200, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 05:51:38PM +0000, Rogozhkin, Dmitry V wrote:
> 
> > Anyhow, returning to the metrics i915 exposes. Some metrics are just
> > exposure of some counters supported already inside i915 PMU which do not
> > require any special sampling: at any given moment you can request the
> > counter value (these are interrupts counts, i915 power consumption).
> 
> > Other metrics are similar to the ever-existing which I just described,
> > but they require activation for i915 to start to count them - this is
> > done on the event initialization (these are engine busy stats).
> 
> Right, so depending on how expensive this activation is and if it can be
> done without scheduling, there are two options:
> 
>  1) activate/deactivate from pmu::start()/pmu::stop()
>  2) activate/deactivate from pmu::event_init()/event->destroy() and
> disregard all counting between pmu::stop() and pmu::start().
> 
> > Finally, there is a third group which require sampling counting: they
> > are needed to be initialized and i915 pmu starts an internal timer to
> > count these values (these are some engines characteristics referenced
> > in the code as QUEUED, SEMA, WAIT).
> 
> So uncore PMUs can't really do sampling. That is, perf defines sampling
> as interrupting the relevant task and then providing things like the
> %RIP value at interrupt time. Since uncore activity cannot be associated
> with any one task, no sampling allowed.

I read this as we need to add:
static int i915_pmu_event_init(struct perf_event *event)
{
...
/* Sampling not supported yet */
if (hwc->sample_period)
return -EINVAL;
...
}
And deny sampling period for our events, right? And what should happen
is that the following command will start to fail: "perf stat -e ev -a -I
100"? Essentially you are saying that uncore PMUs should work only in
the "get value when asked" mode, for perf-stat it will add results in
the end of the run. However, there is no strict perf subsystem denying
of the uncore event sampling, but this is delegated to the each PMU
implementation. Is that correct? I wonder, how this is supposed to work
in case PMU is capable to produce both global and per-task metrics?


> 
> Now, I'm thinking that what i915 does is slightly different, it doesn't
> provide registers to read out the counter state, but instead
> periodically writes state snapshots into some memory buffer, right?
> 
> That's a bit tricky, maybe the best fit would be what PPC HV 24x7 does.
> They create an event-group, that is a set of counters that are
> co-scheduled, matching the set of counters they get from the HV
> interface (or a subset) and then sys_read() will use a TXN_READ to
> group-read the entire thing at once. In your case it could consume the
> last state snapshot instead of request one (or wait for the next,
> whatever works best).
> 
> Would that work?

Hm. Not sure I understand this with my userspace brain:). Why group
read? Just to eliminate overhead of doing same/similar things for all
the metrics? Anyway, what completely slept away from me is how this
relates to the sampling if it is not allowed for uncore PMUs? So, there
will or will not be a sampling?

Essentially, i915 PMU for such "sampling" metrics is implemented in the
following way:
1. If such a metric is requested i915 PMU starts to do some internal
sampling which is not exposed to perf subsystem. I think there is some
timer staff currently associated with this.
2. On each received i915_pmu_event_read() we just immediately return
those values we have gathered so far internally
3. Since we were not aware about no uncore sampling policy, we have some
code for sampling exposed to perf subsystem, but this is non-related to
#1. Essentially it is used to just call i915_pmu_event_read(). Here is a
code:

static DEFINE_PER_CPU(struct pt_regs, i915_pmu_pt_regs);

static enum hrtimer_restart hrtimer_sample(struct hrtimer *hrtimer)
{
struct pt_regs *regs = this_cpu_ptr(_pmu_pt_regs);
struct perf_sample_data data;
struct perf_event *event;
u64 period;

event = container_of(hrtimer, struct perf_event, hw.hrtimer);
if (event->state != PERF_EVENT_STATE_ACTIVE)
return HRTIMER_NORESTART;

event->pmu->read(event);

perf_sample_data_init(, 0, event->hw.last_period);
perf_event_overflow(event, , regs);

period = max_t(u64, 1, event->hw.sample_period);
hrtimer_forward_now(hrtimer, ns_to_ktime(period));
return HRTIMER_RESTART;
}


By the way, I think Tvrtko had a question on why he needs this
"DEFINE_PER_CPU(struct pt_regs, i915_pmu_pt_regs)" staff which looked
somewhat strange? Here is a quot

Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-08-23 Thread Rogozhkin, Dmitry V
On Wed, 2017-08-23 at 20:01 +0200, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 05:51:38PM +0000, Rogozhkin, Dmitry V wrote:
> 
> > https://patchwork.freedesktop.org/patch/171953/. This patch makes 'perf
> > stat -e i915/rcs0-busy/' to error out and supports 'perf stat -e
> > i915/rcs0-busy/ -a -C0'. I still think I miss something since 'perf stat
> > -e i915/rcs0-busy/ -a' will give metrics multiplied by number of active
> > CPUs, but that's look closer to what is needed.
> 
> IIRC an uncore PMU should expose a cpumask through sysfs, and then perf
> tools will read that mask and auto-magically limit the number of CPUs it
> instantiates the counter on.
> 
> See for instance arch/x86/events/intel/cstate.c, that is a fairly
> trivial uncore PMU, look for "DEVICE_ATTR(cpumask".

Thank you for the reference! I will go forward and try to implement this
for i915 PMU.

> 
> For example, on my 2 socket 10 core ivb-ep:
> 
> root@ivb-ep:~# cat /sys/bus/event_source/devices/cstate_pkg/cpumask
> 0,10
> 

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-08-23 Thread Rogozhkin, Dmitry V
On Wed, 2017-08-23 at 20:04 +0200, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 05:51:38PM +0000, Rogozhkin, Dmitry V wrote:
> 
> > > The above command tries to add an event 'i915/rcs0-busy/' to a task. How
> > > are i915 resource associated to any one particular task?
> > 
> > Currently in no way, they are global.
> 
> Right. So no per DRM context things. Can you have multiple DRM contexts
> per task? If so that would make things slightly tricky when you get away
> from !global counters.

Just to be sure we are on the same page: under task we understand the
process which is being monitored with the perf: "perf stat -e 
task.sh"? Yes, each process may have few DRM contexts, so we are in a
tricky space:).

If we will decide to support per-task counters in the future, would it
be possible to preserve global mode providing end-user a choice: monitor
per-process or globally?

> 
> > > Is there a unique i915 resource for each task? If not, I don't see how
> > > per-task event can ever work as expected.
> > 
> > This depends on what you mean under "expected"? I see 2 variants:
> > 1. Either this command line should return correct metric values
> > 2. This command line should error out
> 
> > Right now i915 PMU produces global metric, thus, I think 2nd variant is
> > true. Which I think is achievable if PMU is registered with
> > perf_invalid_context.
> 
> Agreed, and yes, perf_invalid_context is right for uncore stuff and will
> result in refusal to create per-task counters.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 04/10] drm/i915: Expose a PMU interface for perf queries

2017-08-23 Thread Rogozhkin, Dmitry V
On Tue, 2017-08-22 at 20:17 +0200, Peter Zijlstra wrote:
> On Sat, Aug 12, 2017 at 02:15:13AM +0000, Rogozhkin, Dmitry V wrote:
> > $ perf stat -e instructions,i915/rcs0-busy/ workload.sh
> > <... wrokload.sh output...>
> > 
> > Performance counter stats for 'workload.sh':
> >  1,204,616,268  instructions
> >  0  i915/rcs0-busy/
> > 
> >1.869169153 seconds time elapsed
> > 
> > As you can see instructions event works pretty well, i915/rcs0-busy/
> > doesn't.
> > 
> > I afraid that our current understanding of how PMU should work is not
> > fully correct.
> 
> Can we start off by explaining to me how this i915 stuff works. Because
> all I have is ~750 lines of patch without comments. Which sort of leaves
> me confused.

i915 PMU tries to expose a number of SW metrics to characterize i915
performance: i915 interrupt counts, time (counted in microseconds) GPU
spent in specific states like RC6, time  (counted in microseconds) GPU
engines were busy executing some tasks or were idle, etc. As for now
these metrics are i915 global and can't be attached to some particular
task, however, we may consider such metrics in the future. I think that
global PMU is the nice first step. I had a talk with Andi Kleen who
pointed that this is similar to how uncore PMU behaves and suggested I
should use 'perf stat -e  -a -C0' and probably adjust the code to
register i915 PMU with the perf_invalid_context. Which I did in the
patch which probably should be squashed to the patch from Tvrtko (I did
not want to intersect with his job). This change can be found here:
https://patchwork.freedesktop.org/patch/171953/. This patch makes 'perf
stat -e i915/rcs0-busy/' to error out and supports 'perf stat -e
i915/rcs0-busy/ -a -C0'. I still think I miss something since 'perf stat
-e i915/rcs0-busy/ -a' will give metrics multiplied by number of active
CPUs, but that's look closer to what is needed.

Anyhow, returning to the metrics i915 exposes. Some metrics are just
exposure of some counters supported already inside i915 PMU which do not
require any special sampling: at any given moment you can request the
counter value (these are interrupts counts, i915 power consumption).
Other metrics are similar to the ever-existing which I just described,
but they require activation for i915 to start to count them - this is
done on the event initialization (these are engine busy stats). Finally,
there is a third group which require sampling counting: they are needed
to be initialized and i915 pmu starts an internal timer to count these
values (these are some engines characteristics referenced in the code as
QUEUED, SEMA, WAIT).

I hope this clarifies. As well I hope I correctly covered high level
description of i915 PMU. I hope Chris will correct me if I was wrong
somewhere. Mind that another author of i915 PMU Tvrtko is for one more
week or so on vacation.


> 
> The above command tries to add an event 'i915/rcs0-busy/' to a task. How
> are i915 resource associated to any one particular task?

Currently in no way, they are global.

> 
> Is there a unique i915 resource for each task? If not, I don't see how
> per-task event can ever work as expected.

This depends on what you mean under "expected"? I see 2 variants:
1. Either this command line should return correct metric values
2. This command line should error out
Right now i915 PMU produces global metric, thus, I think 2nd variant is
true. Which I think is achievable if PMU is registered with
perf_invalid_context.

> 
> > I think so, because the way PMU entry points init(),
> > add(), del(), start(), stop(), read() are implemented do not correlate
> > with how many times they are called. I have counted them and here is the
> > result:
> > init()=19, add()=44310, del()=43900, start()=44534, stop()=0, read()=0
> > 
> > Which means that we are regularly attempt to start/stop timer and/or
> > busy stats calculations. Another thing which pay attention is that
> > read() was not called at all. How perf supposes to get counter value?
> 
> Both stop() and del() are supposed to update event->count. Only if we do
> sys_read() while the event is active (something perf-stat never does
> IIRC) will it issue pmu::read() to get an up-to-date number.
> 
> > Yet another thing, where we are supposed to initialize our internal
> > staff: numbers above are from single run and even init is called
> > multiple times? Where we are supposed to de-init our staff: each time on
> > del() - this hardly makes sense?
> 
> init happens in pmu::event_init(), that can set an optional
> event->destroy() function for de-init.
> 
> init() is called once for each event created, the above creates an
> inherited per-task event (I think, I lost tra

Re: [Intel-gfx] [RFC 2/2] drm/i915/pmu: serve global events and support perf stat

2017-08-17 Thread Rogozhkin, Dmitry V
On Tue, 2017-08-15 at 10:56 -0700, Dmitry Rogozhkin wrote:
> This patch should probably be squashed with Tvrtko's PMU enabling
> patch...
> 
> i915 events don't have a CPU context to work with, so i915
> PMU should work in global mode, i.e. expose perf_invalid_context.
> This will make the following call to perf:
>perf stat workload.sh
> to error out with "failed to read counter". Correct usage would be:
>perf stat -a -C0 workload.sh
> And we will get expected output:
> 
> Another change required to make perf stat happy is properly support
> pmu->del(): comments in del() declaration says that del() is required
> to call stop(event, PERF_EF_UPDATE) and perf stat implicitly gets
> event count thru this.
> 
> Finally, if perf stat will be ran as follows:
>perf stat -a workload.sh
> i.e. without '-C0' option, then event will be initilized as many times
> as there are CPUs. Thus, patch adds PMU refcounting support to avoid
> multiple init/close of internal things. The issue which I did not
> overcome is that in this case counters will be multiplied on number
> of CPUs. Not sure whether it is possible to have some event enabling
> CPU mask or rather I need to simply error out.
> 
> Change-Id: I7d1abe747a4399196e72253f7b66441a6528dbee
> Signed-off-by: Dmitry Rogozhkin 
> Cc: Tvrtko Ursulin 
> Cc: Peter Zijlstra 
> ---
>  drivers/gpu/drm/i915/i915_drv.h |   1 +
>  drivers/gpu/drm/i915/i915_pmu.c | 106 
> ++--
>  2 files changed, 60 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index b59da2c..215d47b 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2663,6 +2663,7 @@ struct drm_i915_private {
>   struct {
>   struct pmu base;
>   spinlock_t lock;
> + unsigned int ref;
>   struct hrtimer timer;
>   bool timer_enabled;
>   u64 enable;
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> index bcdf2bc..0972203 100644
> --- a/drivers/gpu/drm/i915/i915_pmu.c
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -451,34 +451,39 @@ static void i915_pmu_enable(struct perf_event *event)
>   container_of(event->pmu, typeof(*i915), pmu.base);
>   struct intel_engine_cs *engine = NULL;
>   unsigned long flags;
> + bool start_timer = false;
>  
>   spin_lock_irqsave(>pmu.lock, flags);
>  
> - if (is_engine_event(event)) {
> - engine = intel_engine_lookup_user(i915,
> -   engine_event_class(event),
> -   engine_event_instance(event));
> - GEM_BUG_ON(!engine);
> - engine->pmu.enable |= BIT(engine_event_sample(event));
> - }
> -
> - i915->pmu.enable |= event_enabled_mask(event);
> + if (i915->pmu.ref++ == 0) {
This was a stupid me: with this I support single busy_stat event only.
Correct way to do that is to make busy_stats a refcount. I did that in
the updated patch which I just sent. I wonder whether timer staff should
be updated with the similar way, but I am not yet tried this part of the
code to judge.
> + if (is_engine_event(event)) {
> + engine = intel_engine_lookup_user(i915,
> + 
> engine_event_class(event),
> + 
> engine_event_instance(event));
> + GEM_BUG_ON(!engine);
> + engine->pmu.enable |= BIT(engine_event_sample(event));
> + }
>  
> - if (pmu_needs_timer(i915, true) && !i915->pmu.timer_enabled) {
> - hrtimer_start_range_ns(>pmu.timer,
> -ns_to_ktime(PERIOD), 0,
> -HRTIMER_MODE_REL_PINNED);
> - i915->pmu.timer_enabled = true;
> - } else if (is_engine_event(event) && engine_needs_busy_stats(engine) &&
> -!engine->pmu.busy_stats) {
> - engine->pmu.busy_stats = true;
> - if (!cancel_delayed_work(>pmu.disable_busy_stats))
> - queue_work(i915->wq, >pmu.enable_busy_stats);
> + i915->pmu.enable |= event_enabled_mask(event);
> +
> + if (pmu_needs_timer(i915, true) && !i915->pmu.timer_enabled) {
> + hrtimer_start_range_ns(>pmu.timer,
> + ns_to_ktime(PERIOD), 0,
> + HRTIMER_MODE_REL_PINNED);
> + i915->pmu.timer_enabled = true;
> + } else if (is_engine_event(event) && 
> engine_needs_busy_stats(engine) &&
> + !engine->pmu.busy_stats) {
> + engine->pmu.busy_stats = true;
> + 

Re: [Intel-gfx] [PATCH] RFC drm/i915: Slaughter the thundering i915_wait_request herd

2015-11-19 Thread Rogozhkin, Dmitry V
Hi Chris,

Are you still tuning the patch or it is ready to go? For us this is critical 
and one of the most important patches over the past few years which gives 
benefits to the whole media stack in the server segment. Definitely would like 
to see it upstreamed ASAP. Anything outstanding you want us to experiment with?

Dmitry.

-Original Message-
From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] 
Sent: Wednesday, November 4, 2015 5:48 PM
To: Gong, Zhipeng
Cc: intel-gfx@lists.freedesktop.org; Rogozhkin, Dmitry V
Subject: Re: [Intel-gfx] [PATCH] RFC drm/i915: Slaughter the thundering 
i915_wait_request herd

On Wed, Nov 04, 2015 at 01:20:33PM +, Gong, Zhipeng wrote:
> 
> 
> > -Original Message-
> > From: Chris Wilson [mailto:ch...@chris-wilson.co.uk]
> > Sent: Wednesday, November 04, 2015 5:54 PM On Wed, Nov 04, 2015 at 
> > 06:19:33AM +, Gong, Zhipeng wrote:
> > > > From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] On Tue, Nov 
> > > > 03,
> > > > 2015 at 01:31:22PM +, Gong, Zhipeng wrote:
> > > > >
> > > > > > From: Chris Wilson [mailto:ch...@chris-wilson.co.uk]
> > > > > >
> > > > > > Do you also have a relative perf statistics like op/s we can 
> > > > > > compare to make sure we aren't just stalling the whole system?
> > > > > >
> > > > > Could you please provide the commands about how to check it?
> > > >
> > > > I was presuming your workload has some measure of
> > efficiency/throughput?
> > > > It is one thing to say we are using 10% less CPU (per second), 
> > > > but the task is running 2x as long!
> > > We use execute time as a measurement, the patch affects the 
> > > execution time for our cases slightly.
> > >
> > > Exec time(s)|   w/o patch   |   w/patch
> > > ---
> > > BDW async 1 |65.00  |65.25
> > > BDW async 5 |68.50  |66.42
> > 
> > That's reassuring.
> > 
> > > >
> > > > > > How much cpu time is left in the i915_wait_request branch? i.e.
> > > > > > how close to the limit are we with chasing this path?
> > > > > Could you please provide the commands here either? :)
> > > >
> > > > Check the perf callgraph.
> > >
> > > Now the most of time is in io_schedule_timeout __i915_wait_request
> > > |--64.04%-- io_schedule_timeout
> > > |--22.04%-- intel_engine_add_wakeup
> > > |--3.13%-- prepare_to_wait
> > > |--2.99%-- gen6_rps_boost
> > > |-...
> > 
> > No more busywaits, and most of the time is spent kicking the next 
> > process or doing the insertion sort into the waiting rbtree.
> > 
> > What's the ratio now of __i915_wait_request to the next hot function?
> > And who are the chief callers of __i915_wait_request?
> > -Chris
> Please check the attachments for the details, I post a piece of it here:
> |--17.89%-- i915_gem_object_sync
>  |--73.19%-- __i915_wait_request
>  |--12.60%-- i915_gem_object_retire_request

Interesting. Most of the time is spent shuffling requests around in the 
execbuffer rather than doing useful work. I've been working on moving that work 
around, but even then we are likely to be spending our time instantiating all 
those new objects. As far as trimming the CPU time from __i915_wait_request() 
that looks about as far as we can go.

If you have some free cycles on those machines, I would very much appreciate 
seeing the same callgraphs from a
http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=nightly=134211e33719ef698f9bd51b72ad2fc434cb51f9
kernel

Thanks,
-Chris

--
Chris Wilson, Intel Open Source Technology Centre


Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] RFC drm/i915: Slaughter the thundering i915_wait_request herd

2015-11-02 Thread Rogozhkin, Dmitry V
Yeah, very likely. I wonder, how easy is to negotiate issue with inter-ring 
synchronization on BDW in the expectation of KMD Scheduler from John Harrison?

-Original Message-
From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] 
Sent: Monday, November 2, 2015 12:53 PM
To: Gong, Zhipeng
Cc: intel-gfx@lists.freedesktop.org; Rogozhkin, Dmitry V
Subject: Re: [PATCH] RFC drm/i915: Slaughter the thundering i915_wait_request 
herd

On Mon, Nov 02, 2015 at 05:39:54AM +, Gong, Zhipeng wrote:
> Chris-
> 
> The patch cannot be applied on the latest drm-intel-nightly directly. 
> I modified it a little bit to make it applied.
> The patch can help much in HSW, but a little bit in BDW.
> The test is to transcode 26 streams, which creates 244 threads.
> 
> CPU util  |   w/o patch  |   w/ patch
> --
> HSW async 1   |   102% | 61%
> HSW async 5   |   114% | 46%
> BDW async 1   |   116% | 116%
> BDW async 5   |   111% | 107%

The problem on bdw is likely to be frequent inter-ring synchronisation keeping 
the number of waiters at 1 (i.e. lack of semaphores). Note that the first 
waiter gets the busywait before waiting on the interrupt.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre


Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx