Re: [PATCH v4 6/8] drm/xe: Cache data about user-visible engines

2024-05-16 Thread Umesh Nerlige Ramappa

On Thu, May 16, 2024 at 02:52:01PM -0500, Lucas De Marchi wrote:

On Thu, May 16, 2024 at 11:33:54AM GMT, Umesh Nerlige Ramappa wrote:

On Wed, May 15, 2024 at 02:42:56PM -0700, Lucas De Marchi wrote:

gt->info.engine_mask used to indicate the available engines, but that
is not always true anymore: some engines are reserved to kernel and some
may be exposed as a single engine (e.g. with ccs_mode).

Runtime changes only happen when no clients exist, so it's safe to cache
the list of engines in the gt and update that when it's needed. This
will help implementing per client engine utilization so this (mostly
constant) information doesn't need to be re-calculated on every query.

Signed-off-by: Lucas De Marchi 


Just a few questions below, otherwise this looks good as is:

Reviewed-by: Umesh Nerlige Ramappa 


---
drivers/gpu/drm/xe/xe_gt.c  | 23 +++
drivers/gpu/drm/xe/xe_gt.h  | 13 +
drivers/gpu/drm/xe/xe_gt_ccs_mode.c |  1 +
drivers/gpu/drm/xe/xe_gt_types.h| 21 -
4 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index e69a03ddd255..5194a3d38e76 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -560,9 +560,32 @@ int xe_gt_init(struct xe_gt *gt)
if (err)
return err;

+   xe_gt_record_user_engines(gt);
+
return drmm_add_action_or_reset(_to_xe(gt)->drm, gt_fini, gt);
}

+void xe_gt_record_user_engines(struct xe_gt *gt)
+{
+   struct xe_hw_engine *hwe;
+   enum xe_hw_engine_id id;
+
+   gt->user_engines.mask = 0;
+   memset(gt->user_engines.instances_per_class, 0,
+  sizeof(gt->user_engines.instances_per_class));
+
+   for_each_hw_engine(hwe, gt, id) {
+   if (xe_hw_engine_is_reserved(hwe))
+   continue;
+
+   gt->user_engines.mask |= BIT_ULL(id);
+   gt->user_engines.instances_per_class[hwe->class]++;
+   }
+
+   xe_gt_assert(gt, (gt->user_engines.mask | gt->info.engine_mask)
+== gt->info.engine_mask);


I am not seeing a place where user_engines.mask is not a subset of 
info.engine_mask in the driver, so the above check will always be 
true.


that's why it's an assert. user_engines.mask should always be a
subset of info.engine_mask, otherwise something went terribly wrong.



Did you mean to do and & instead of | above? That might make sense 
since then you are making sure that the user_engines are a subset of 
engine_mask.


no, what I'm trying to assert is that user_engines.mask never has an
engine that is not present in info.engine_mask. Example:

engine_mask   == 0b01
user_engines.mask == 0b11

That should never happen and it should fail the assert.


oh, my bad, the assert looks correct.


I decided to add the assert because I'm not deriving the
user_engines.mask directly from the mask, but indirectly. Early on probe
we setup the mask and create the hw_engine instances and we are
calculating the user_engines.mask from there. I just wanted to make sure
we don't screw up something in the middle that causes issues.




+}
+
static int do_gt_reset(struct xe_gt *gt)
{
int err;
diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index 8474c50b1b30..ad3fd31e0a41 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -38,6 +38,19 @@ int xe_gt_init_hwconfig(struct xe_gt *gt);
int xe_gt_init_early(struct xe_gt *gt);
int xe_gt_init(struct xe_gt *gt);
int xe_gt_record_default_lrcs(struct xe_gt *gt);
+
+/**
+ * @xe_gt_record_user_engines - save data related to engines available to
+ * usersapce
+ * @gt: GT structure
+ *
+ * Walk the available HW engines from gt->info.engine_mask and calculate data
+ * related to those engines that may be used by userspace. To be used whenever
+ * available engines change in runtime (e.g. with ccs_mode) or during


After the driver loads, do we expect ccs_mode to change dynamically 
based on some criteria OR is it a one time configuration at driver 
load?


If former, can you provide an example where ccs_mode would change 
dynamically, just curious.


it can be set via sysfs, but it blocks changing it if there are clients.
For with display, it's easier to check by loading the driver with
enable_display=0. Trying that on a DG2:

# modprobe xe enable_display=0
# exec 3<> /dev/dri/card1
# tail -n4 /proc/self/fdinfo/3
drm-cycles-bcs: 0
drm-total-cycles-bcs:   37728138157
drm-cycles-ccs: 0
drm-total-cycles-ccs:   37728138157
#
# exec 3<&-
# echo 2 > 
/sys/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/tile0/gt0/ccs_mode
# exec 3<> /dev/dri/card1
# tail -n4 /proc/self/fdinfo/3
drm-total-cycles-bcs:   38260910526
   

Re: [PATCH v4 8/8] drm/xe/client: Print runtime to fdinfo

2024-05-16 Thread Umesh Nerlige Ramappa

On Wed, May 15, 2024 at 02:42:58PM -0700, Lucas De Marchi wrote:

Print the accumulated runtime for client when printing fdinfo.
Each time a query is done it first does 2 things:

1) loop through all the exec queues for the current client and
  accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
  that, being read from the context image.

2) Read a "GPU timestamp" that can be used for considering "how much GPU
  time has passed" and that has the same unit/refclock as the one
  recording the runtime. RING_TIMESTAMP is used for that via MMIO.

Since for all current platforms RING_TIMESTAMP follows the same
refclock, just read it once, using any first engine available.

This is exported to userspace as 2 numbers in fdinfo:

drm-cycles-: 
drm-total-cycles-: 

Userspace is expected to collect at least 2 samples, which allows to
know the client engine busyness as per:

RUNTIME1 - RUNTIME0
busyness = -
  T1 - T0

Since drm-cycles- always starts at 0, it's also possible to know
if and engine was ever used by a client.

It's expected that userspace will read any 2 samples every few seconds.
Given the update frequency of the counters involved and that
CTX_TIMESTAMP is 32-bits, the counter for each exec_queue can wrap
around (assuming 100% utilization) after ~200s. The wraparound is not
perceived by userspace since it's just accumulated for all the
exec_queues in a 64-bit counter) but the measurement will not be
accurate if the samples are too far apart.

This could be mitigated by adding a workqueue to accumulate the counters
every so often, but it's additional complexity for something that is
done already by userspace every few seconds in tools like gputop (from
igt), htop, nvtop, etc, with none of them really defaulting to 1 sample
per minute or more.

Signed-off-by: Lucas De Marchi 


LGTM,

Reviewed-by: Umesh Nerlige Ramappa 

Thanks,
Umesh

---
Documentation/gpu/drm-usage-stats.rst   |  21 +++-
Documentation/gpu/xe/index.rst  |   1 +
Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
drivers/gpu/drm/xe/xe_drm_client.c  | 121 +++-
4 files changed, 150 insertions(+), 3 deletions(-)
create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst
index 6dc299343b48..a80f95ca1b2f 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -112,6 +112,19 @@ larger value within a reasonable period. Upon observing a 
value lower than what
was previously read, userspace is expected to stay with that larger previous
value until a monotonic update is seen.

+- drm-total-cycles-: 
+
+Engine identifier string must be the same as the one specified in the
+drm-cycles- tag and shall contain the total number cycles for the given
+engine.
+
+This is a timestamp in GPU unspecified unit that matches the update rate
+of drm-cycles-. For drivers that implement this interface, the engine
+utilization can be calculated entirely on the GPU clock domain, without
+considering the CPU sleep time between 2 samples.
+
+A driver may implement either this key or drm-maxfreq-, but not both.
+
- drm-maxfreq-:  [Hz|MHz|KHz]

Engine identifier string must be the same as the one specified in the
@@ -121,6 +134,9 @@ percentage utilization of the engine, whereas 
drm-engine- only reflects
time active without considering what frequency the engine is operating as a
percentage of its maximum frequency.

+A driver may implement either this key or drm-total-cycles-, but not
+both.
+
Memory
^^

@@ -168,5 +184,6 @@ be documented above and where possible, aligned with other 
drivers.
Driver specific implementations
---

-:ref:`i915-usage-stats`
-:ref:`panfrost-usage-stats`
+* :ref:`i915-usage-stats`
+* :ref:`panfrost-usage-stats`
+* :ref:`xe-usage-stats`
diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index c224ecaee81e..3f07aa3b5432 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
   xe_firmware
   xe_tile
   xe_debugging
+   xe-drm-usage-stats.rst
diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst 
b/Documentation/gpu/xe/xe-drm-usage-stats.rst
new file mode 100644
index ..482d503ae68a
--- /dev/null
+++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+.. _xe-usage-stats:
+
+
+Xe DRM client usage stats implementation
+
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c
+   :doc: DRM Client usage stats
diff --git a/drivers/gpu/drm/xe/xe_drm_client.c 
b/drivers/gpu/drm/xe/xe_drm_client.c
index 08f0b7c95901..952b0cc87708 100644
--- a/drivers/gpu/drm/xe/x

Re: [PATCH v4 7/8] drm/xe: Add helper to return any available hw engine

2024-05-16 Thread Umesh Nerlige Ramappa

On Wed, May 15, 2024 at 02:42:57PM -0700, Lucas De Marchi wrote:

Get the first available engine from a gt, which helps in the case any
engine serves as a context, like when reading RING_TIMESTAMP.

Signed-off-by: Lucas De Marchi 


Reviewed-by: Umesh Nerlige Ramappa 


---
drivers/gpu/drm/xe/xe_gt.c | 11 +++
drivers/gpu/drm/xe/xe_gt.h |  7 +++
2 files changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 5194a3d38e76..3432fef56486 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -833,3 +833,14 @@ struct xe_hw_engine 
*xe_gt_any_hw_engine_by_reset_domain(struct xe_gt *gt,

return NULL;
}
+
+struct xe_hw_engine *xe_gt_any_hw_engine(struct xe_gt *gt)
+{
+   struct xe_hw_engine *hwe;
+   enum xe_hw_engine_id id;
+
+   for_each_hw_engine(hwe, gt, id)
+   return hwe;
+
+   return NULL;
+}
diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index ad3fd31e0a41..a53f01362d94 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -67,6 +67,13 @@ void xe_gt_remove(struct xe_gt *gt);
struct xe_hw_engine *
xe_gt_any_hw_engine_by_reset_domain(struct xe_gt *gt, enum xe_engine_class 
class);

+/**
+ * xe_gt_any_hw_engine - scan the list of engines and return the
+ * first available
+ * @gt: GT structure
+ */
+struct xe_hw_engine *xe_gt_any_hw_engine(struct xe_gt *gt);
+
struct xe_hw_engine *xe_gt_hw_engine(struct xe_gt *gt,
 enum xe_engine_class class,
 u16 instance,
--
2.43.0



Re: [PATCH v4 6/8] drm/xe: Cache data about user-visible engines

2024-05-16 Thread Umesh Nerlige Ramappa

On Wed, May 15, 2024 at 02:42:56PM -0700, Lucas De Marchi wrote:

gt->info.engine_mask used to indicate the available engines, but that
is not always true anymore: some engines are reserved to kernel and some
may be exposed as a single engine (e.g. with ccs_mode).

Runtime changes only happen when no clients exist, so it's safe to cache
the list of engines in the gt and update that when it's needed. This
will help implementing per client engine utilization so this (mostly
constant) information doesn't need to be re-calculated on every query.

Signed-off-by: Lucas De Marchi 


Just a few questions below, otherwise this looks good as is:

Reviewed-by: Umesh Nerlige Ramappa 


---
drivers/gpu/drm/xe/xe_gt.c  | 23 +++
drivers/gpu/drm/xe/xe_gt.h  | 13 +
drivers/gpu/drm/xe/xe_gt_ccs_mode.c |  1 +
drivers/gpu/drm/xe/xe_gt_types.h| 21 -
4 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index e69a03ddd255..5194a3d38e76 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -560,9 +560,32 @@ int xe_gt_init(struct xe_gt *gt)
if (err)
return err;

+   xe_gt_record_user_engines(gt);
+
return drmm_add_action_or_reset(_to_xe(gt)->drm, gt_fini, gt);
}

+void xe_gt_record_user_engines(struct xe_gt *gt)
+{
+   struct xe_hw_engine *hwe;
+   enum xe_hw_engine_id id;
+
+   gt->user_engines.mask = 0;
+   memset(gt->user_engines.instances_per_class, 0,
+  sizeof(gt->user_engines.instances_per_class));
+
+   for_each_hw_engine(hwe, gt, id) {
+   if (xe_hw_engine_is_reserved(hwe))
+   continue;
+
+   gt->user_engines.mask |= BIT_ULL(id);
+   gt->user_engines.instances_per_class[hwe->class]++;
+   }
+
+   xe_gt_assert(gt, (gt->user_engines.mask | gt->info.engine_mask)
+== gt->info.engine_mask);


I am not seeing a place where user_engines.mask is not a subset of 
info.engine_mask in the driver, so the above check will always be true.


Did you mean to do and & instead of | above? That might make sense since 
then you are making sure that the user_engines are a subset of 
engine_mask.



+}
+
static int do_gt_reset(struct xe_gt *gt)
{
int err;
diff --git a/drivers/gpu/drm/xe/xe_gt.h b/drivers/gpu/drm/xe/xe_gt.h
index 8474c50b1b30..ad3fd31e0a41 100644
--- a/drivers/gpu/drm/xe/xe_gt.h
+++ b/drivers/gpu/drm/xe/xe_gt.h
@@ -38,6 +38,19 @@ int xe_gt_init_hwconfig(struct xe_gt *gt);
int xe_gt_init_early(struct xe_gt *gt);
int xe_gt_init(struct xe_gt *gt);
int xe_gt_record_default_lrcs(struct xe_gt *gt);
+
+/**
+ * @xe_gt_record_user_engines - save data related to engines available to
+ * usersapce
+ * @gt: GT structure
+ *
+ * Walk the available HW engines from gt->info.engine_mask and calculate data
+ * related to those engines that may be used by userspace. To be used whenever
+ * available engines change in runtime (e.g. with ccs_mode) or during


After the driver loads, do we expect ccs_mode to change dynamically 
based on some criteria OR is it a one time configuration at driver load?


If former, can you provide an example where ccs_mode would change 
dynamically, just curious.


Regards,
Umesh


+ * initialization
+ */
+void xe_gt_record_user_engines(struct xe_gt *gt);
+
void xe_gt_suspend_prepare(struct xe_gt *gt);
int xe_gt_suspend(struct xe_gt *gt);
int xe_gt_resume(struct xe_gt *gt);
diff --git a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c 
b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c
index a34c9a24dafc..c36218f4f6c8 100644
--- a/drivers/gpu/drm/xe/xe_gt_ccs_mode.c
+++ b/drivers/gpu/drm/xe/xe_gt_ccs_mode.c
@@ -134,6 +134,7 @@ ccs_mode_store(struct device *kdev, struct device_attribute 
*attr,
if (gt->ccs_mode != num_engines) {
xe_gt_info(gt, "Setting compute mode to %d\n", num_engines);
gt->ccs_mode = num_engines;
+   xe_gt_record_user_engines(gt);
xe_gt_reset_async(gt);
}

diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
index 5a114fc9dde7..aaf2951749a6 100644
--- a/drivers/gpu/drm/xe/xe_gt_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_types.h
@@ -112,7 +112,11 @@ struct xe_gt {
enum xe_gt_type type;
/** @info.reference_clock: clock frequency */
u32 reference_clock;
-   /** @info.engine_mask: mask of engines present on GT */
+   /**
+* @info.engine_mask: mask of engines present on GT. Some of
+* them may be reserved in runtime and not available for user.
+* See @user_engines.mask
+*/
u64 engine_mask;
/** @info.gmdid: raw GMD_ID value from hardware */
u32 gmdid;
@@ -365,6 +369,21 @@ s

Re: [PATCH v3 6/6] drm/xe/client: Print runtime to fdinfo

2024-05-07 Thread Umesh Nerlige Ramappa

On Tue, May 07, 2024 at 03:45:10PM -0700, Lucas De Marchi wrote:

Print the accumulated runtime for client when printing fdinfo.
Each time a query is done it first does 2 things:

1) loop through all the exec queues for the current client and
  accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
  that, being read from the context image.

2) Read a "GPU timestamp" that can be used for considering "how much GPU
  time has passed" and that has the same unit/refclock as the one
  recording the runtime. RING_TIMESTAMP is used for that via MMIO.

Since for all current platforms RING_TIMESTAMP follows the same
refclock, just read it once, using any first engine.

This is exported to userspace as 2 numbers in fdinfo:

drm-cycles-: 
drm-total-cycles-: 

Userspace is expected to collect at least 2 samples, which allows to
know the client engine busyness as per:

RUNTIME1 - RUNTIME0
busyness = -
  T1 - T0

Another thing to point out is that it's expected that userspace will
read any 2 samples every few seconds.  Given the update frequency of the
counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
each exec_queue can wrap around (assuming 100% utilization) after ~200s.
The wraparound is not perceived by userspace since it's just accumulated
for all the exec_queues in a 64-bit counter), but the measurement will
not be accurate if the samples are too far apart.

This could be mitigated by adding a workqueue to accumulate the counters
every so often, but it's additional complexity for something that is
done already by userspace every few seconds in tools like gputop (from
igt), htop, nvtop, etc with none of them really defaulting to 1 sample
per minute or more.

Signed-off-by: Lucas De Marchi 
---
Documentation/gpu/drm-usage-stats.rst   |  16 ++-
Documentation/gpu/xe/index.rst  |   1 +
Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
drivers/gpu/drm/xe/xe_drm_client.c  | 136 +++-
4 files changed, 160 insertions(+), 3 deletions(-)
create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst
index 6dc299343b48..421766289b78 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -112,6 +112,17 @@ larger value within a reasonable period. Upon observing a 
value lower than what
was previously read, userspace is expected to stay with that larger previous
value until a monotonic update is seen.

+- drm-total-cycles-: 
+
+Engine identifier string must be the same as the one specified in the
+drm-cycles- tag and shall contain the total number cycles for the given
+engine.
+
+This is a timestamp in GPU unspecified unit that matches the update rate
+of drm-cycles-. For drivers that implement this interface, the engine
+utilization can be calculated entirely on the GPU clock domain, without
+considering the CPU sleep time between 2 samples.
+
- drm-maxfreq-:  [Hz|MHz|KHz]

Engine identifier string must be the same as the one specified in the
@@ -168,5 +179,6 @@ be documented above and where possible, aligned with other 
drivers.
Driver specific implementations
---

-:ref:`i915-usage-stats`
-:ref:`panfrost-usage-stats`
+* :ref:`i915-usage-stats`
+* :ref:`panfrost-usage-stats`
+* :ref:`xe-usage-stats`
diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
index c224ecaee81e..3f07aa3b5432 100644
--- a/Documentation/gpu/xe/index.rst
+++ b/Documentation/gpu/xe/index.rst
@@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
   xe_firmware
   xe_tile
   xe_debugging
+   xe-drm-usage-stats.rst
diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst 
b/Documentation/gpu/xe/xe-drm-usage-stats.rst
new file mode 100644
index ..482d503ae68a
--- /dev/null
+++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+.. _xe-usage-stats:
+
+
+Xe DRM client usage stats implementation
+
+
+.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c
+   :doc: DRM Client usage stats
diff --git a/drivers/gpu/drm/xe/xe_drm_client.c 
b/drivers/gpu/drm/xe/xe_drm_client.c
index 08f0b7c95901..27494839d586 100644
--- a/drivers/gpu/drm/xe/xe_drm_client.c
+++ b/drivers/gpu/drm/xe/xe_drm_client.c
@@ -2,6 +2,7 @@
/*
 * Copyright © 2023 Intel Corporation
 */
+#include "xe_drm_client.h"

#include 
#include 
@@ -12,9 +13,65 @@
#include "xe_bo.h"
#include "xe_bo_types.h"
#include "xe_device_types.h"
-#include "xe_drm_client.h"
+#include "xe_exec_queue.h"
+#include "xe_gt.h"
+#include "xe_hw_engine.h"
+#include "xe_pm.h"
#include "xe_trace.h"

+/**
+ * DOC: DRM Client usage stats
+ *
+ * The drm/xe driver implements the DRM client usage stats specification as
+ * documented in 

Re: [PATCH v3 5/6] drm/xe: Add helper to accumulate exec queue runtime

2024-05-07 Thread Umesh Nerlige Ramappa

On Tue, May 07, 2024 at 03:45:09PM -0700, Lucas De Marchi wrote:

From: Umesh Nerlige Ramappa 

Add a helper to accumulate per-client runtime of all its
exec queues. This is called every time a sched job is finished.

v2:
 - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
   runtime when job is finished since xe_sched_job_completed() is not a
   notification that job finished.
 - Stop trying to update runtime from xe_exec_queue_fini() - that is
   redundant and may happen after xef is closed, leading to a
   use-after-free
 - Do not special case the first timestamp read: the default LRC sets
   CTX_TIMESTAMP to zero, so even the first sample should be a valid
   one.
 - Handle the parallel submission case by multiplying the runtime by
   width.

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: Lucas De Marchi 
---
drivers/gpu/drm/xe/xe_device_types.h |  9 +++
drivers/gpu/drm/xe/xe_exec_queue.c   | 35 
drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
drivers/gpu/drm/xe/xe_execlist.c |  1 +
drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
5 files changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
index 906b98fb973b..de078bdf0ab9 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -560,6 +560,15 @@ struct xe_file {
struct mutex lock;
} exec_queue;

+   /**
+* @runtime: hw engine class runtime in ticks for this drm client
+*
+* Only stats from xe_exec_queue->lrc[0] are accumulated. For multi-lrc
+* case, since all jobs run in parallel on the engines, only the stats
+* from lrc[0] are sufficient.


Maybe we can drop the above comment altogether after the multi-lrc 
update.


Umesh


Re: [PATCH v2 3/6] drm/xe: Add helper to accumulate exec queue runtime

2024-04-26 Thread Umesh Nerlige Ramappa

On Fri, Apr 26, 2024 at 11:49:32AM +0100, Tvrtko Ursulin wrote:


On 24/04/2024 00:56, Lucas De Marchi wrote:

From: Umesh Nerlige Ramappa 

Add a helper to accumulate per-client runtime of all its
exec queues. Currently that is done in 2 places:

1. when the exec_queue is destroyed
2. when the sched job is completed

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: Lucas De Marchi 
---
 drivers/gpu/drm/xe/xe_device_types.h |  9 +++
 drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
 drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
 drivers/gpu/drm/xe/xe_sched_job.c|  2 ++
 4 files changed, 49 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
index 2e62450d86e1..33d3bf93a2f1 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -547,6 +547,15 @@ struct xe_file {
struct mutex lock;
} exec_queue;
+   /**
+* @runtime: hw engine class runtime in ticks for this drm client
+*
+* Only stats from xe_exec_queue->lrc[0] are accumulated. For multi-lrc
+* case, since all jobs run in parallel on the engines, only the stats
+* from lrc[0] are sufficient.


Out of curiousity doesn't this mean multi-lrc jobs will be incorrectly 
accounted for? (When capacity is considered.)


TBH, I am not sure what the user would like to see here for multi-lrc.  
If reporting the capacity, then we may need to use width as a 
multiplication factor for multi-lrc. How was this done in i915?


Regards,
Umesh




Regards,

Tvrtko


+*/
+   u64 runtime[XE_ENGINE_CLASS_MAX];
+
/** @client: drm client */
struct xe_drm_client *client;
 };
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
b/drivers/gpu/drm/xe/xe_exec_queue.c
index 395de93579fa..b7b6256cb96a 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -214,6 +214,8 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
 {
int i;
+   xe_exec_queue_update_runtime(q);
+
for (i = 0; i < q->width; ++i)
xe_lrc_finish(q->lrc + i);
if (!(q->flags & EXEC_QUEUE_FLAG_PERMANENT) && (q->flags & 
EXEC_QUEUE_FLAG_VM || !q->vm))
@@ -769,6 +771,41 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
q->lrc[0].fence_ctx.next_seqno - 1;
 }
+/**
+ * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw
+ * @q: The exec queue
+ *
+ * Update the timestamp saved by HW for this exec queue and save runtime
+ * calculated by using the delta from last update. On multi-lrc case, only the
+ * first is considered.
+ */
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
+{
+   struct xe_file *xef;
+   struct xe_lrc *lrc;
+   u32 old_ts, new_ts;
+
+   /*
+* Jobs that are run during driver load may use an exec_queue, but are
+* not associated with a user xe file, so avoid accumulating busyness
+* for kernel specific work.
+*/
+   if (!q->vm || !q->vm->xef)
+   return;
+
+   xef = q->vm->xef;
+   lrc = >lrc[0];
+
+   new_ts = xe_lrc_update_timestamp(lrc, _ts);
+
+   /*
+* Special case the very first timestamp: we don't want the
+* initial delta to be a huge value
+*/
+   if (old_ts)
+   xef->runtime[q->class] += new_ts - old_ts;
+}
+
 void xe_exec_queue_kill(struct xe_exec_queue *q)
 {
struct xe_exec_queue *eq = q, *next;
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h 
b/drivers/gpu/drm/xe/xe_exec_queue.h
index 02ce8d204622..45b72daa2db3 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -66,5 +66,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct 
xe_exec_queue *e,
   struct xe_vm *vm);
 void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm,
  struct dma_fence *fence);
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q);
 #endif
diff --git a/drivers/gpu/drm/xe/xe_sched_job.c 
b/drivers/gpu/drm/xe/xe_sched_job.c
index cd8a2fba5438..6a081a4fa190 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -242,6 +242,8 @@ bool xe_sched_job_completed(struct xe_sched_job *job)
 {
struct xe_lrc *lrc = job->q->lrc;
+   xe_exec_queue_update_runtime(job->q);
+
/*
 * Can safely check just LRC[0] seqno as that is last seqno written when
 * parallel handshake is done.


Re: [PATCH 1/3] drm/i915/guc: Support new and improved engine busyness

2023-10-03 Thread Umesh Nerlige Ramappa

On Fri, Sep 22, 2023 at 03:25:08PM -0700, john.c.harri...@intel.com wrote:

From: John Harrison 

The GuC has been extended to support a much more friendly engine
busyness interface. So partition the old interface into a 'busy_v1'
space and add 'busy_v2' support alongside. And if v2 is available, use
that in preference to v1. Note that v2 provides extra features over
and above v1 which will be exposed via PMU in subsequent patches.


Since we are thinking of using the existing busyness counter to expose 
the v2 values, we can drop the last sentence from above.




Signed-off-by: John Harrison 
---
drivers/gpu/drm/i915/gt/intel_engine_types.h  |   4 +-
.../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   4 +-
drivers/gpu/drm/i915/gt/uc/intel_guc.h|  82 ++--
drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  55 ++-
drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   9 +-
drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  23 +-
.../gpu/drm/i915/gt/uc/intel_guc_submission.c | 381 ++
7 files changed, 427 insertions(+), 131 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index a7e6775980043..40fd8f984d64b 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -323,7 +323,7 @@ struct intel_engine_execlists_stats {
ktime_t start;
};

-struct intel_engine_guc_stats {
+struct intel_engine_guc_stats_v1 {
/**
 * @running: Active state of the engine when busyness was last sampled.
 */
@@ -603,7 +603,7 @@ struct intel_engine_cs {
struct {
union {
struct intel_engine_execlists_stats execlists;
-   struct intel_engine_guc_stats guc;
+   struct intel_engine_guc_stats_v1 guc_v1;
};


Overall, I would suggest having the renames as a separate patch. Would 
make the review easier.




/**
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h 
b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index f359bef046e0b..c190a99a36c38 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -137,7 +137,9 @@ enum intel_guc_action {
INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE = 0x4600,
INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
INTEL_GUC_ACTION_CLIENT_SOFT_RESET = 0x5507,
-   INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
+   INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF_V1 = 0x550A,
+   INTEL_GUC_ACTION_SET_DEVICE_ENGINE_UTILIZATION_V2 = 0x550C,
+   INTEL_GUC_ACTION_SET_FUNCTION_ENGINE_UTILIZATION_V2 = 0x550D,
INTEL_GUC_ACTION_STATE_CAPTURE_NOTIFICATION = 0x8002,
INTEL_GUC_ACTION_NOTIFY_FLUSH_LOG_BUFFER_TO_FILE = 0x8003,
INTEL_GUC_ACTION_NOTIFY_CRASH_DUMP_POSTED = 0x8004,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h 
b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
index 6c392bad29c19..e6502ab5f049f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
@@ -226,45 +226,61 @@ struct intel_guc {
struct mutex send_mutex;

/**
-* @timestamp: GT timestamp object that stores a copy of the timestamp
-* and adjusts it for overflow using a worker.
+* @busy: Data used by the different versions of engine busyness 
implementations.
 */
-   struct {
-   /**
-* @lock: Lock protecting the below fields and the engine stats.
-*/
-   spinlock_t lock;
-
-   /**
-* @gt_stamp: 64 bit extended value of the GT timestamp.
-*/
-   u64 gt_stamp;
-
-   /**
-* @ping_delay: Period for polling the GT timestamp for
-* overflow.
-*/
-   unsigned long ping_delay;
-
-   /**
-* @work: Periodic work to adjust GT timestamp, engine and
-* context usage for overflows.
-*/
-   struct delayed_work work;
-
+   union {
/**
-* @shift: Right shift value for the gpm timestamp
+* @v1: Data used by v1 engine busyness implementation. Mostly 
a copy
+* of the GT timestamp extended to 64 bits and the worker for 
maintaining it.
 */
-   u32 shift;
+   struct {
+   /**
+* @lock: Lock protecting the below fields and the 
engine stats.
+*/
+   spinlock_t lock;
+
+   /**
+* @gt_stamp: 64 bit extended value of the GT timestamp.
+*/
+   u64 gt_stamp;
+
+   /**
+* @ping_delay: Period for polling the GT 

Re: [Intel-gfx] [PATCH 2/3] drm/i915/mtl: Add a PMU counter for total active ticks

2023-09-27 Thread Umesh Nerlige Ramappa

On Mon, Sep 25, 2023 at 09:40:46AM +0100, Tvrtko Ursulin wrote:


On 22/09/2023 23:25, john.c.harri...@intel.com wrote:

From: Umesh Nerlige Ramappa 

Current engine busyness interface exposed by GuC has a few issues:

- The busyness of active engine is calculated using 2 values provided by
  GuC and is prone to race between CPU reading those values and GuC
  updating them. Any sort of HW synchronization would be at the cost of
  scheduling latencies.

- GuC provides only 32 bit values for busyness and KMD has to run a
  worker to extend the values to 64 bit. In addition KMD also needs to
  extend the GT timestamp to 64 bits so that it can be used to calculate
  active busyness for an engine.

To address these issues, GuC provides a new interface to calculate
engine busyness. GuC accumulates the busyness ticks in a 64 bit value
and also internally updates the busyness for an active context using a
periodic timer. This simplifies the KMD implementation such that KMD
only needs to relay the busyness value to the user.

In addition to fixing the interface, GuC also provides a periodically
total active ticks that the GT has been running for. This counter is
exposed to the user so that the % busyness can be calculated as follows:

busyness % = (engine active ticks/total active ticks) * 100.


AFAIU I915_PMU_TOTAL_ACTIVE_TICKS only runs when GT is awake, right?

So if GT is awake 10% of the time, and engine is busy that 100% of 
that time, which is 10% of the real/wall time, the busyness by this 
formula comes up as 100%. Which wouldn't be useful for intel_gpu_top 
and alike.


How to scale it back to wall time? Again AFAIU there is no info about 
tick frequency, so how does one know what a delta in total active 
ticks means?


Looks like I got this wrong. The implementation is actually updating the 
total active ticks even when idle and that addresses the concern above.




Going back on the higher level, I am not convinced we need to add a 
new uapi just for MTL. If the tick period is known internally we could 
just use v2 internally and expose the current uapi using it.


We did plan to support the total active ticks in future platforms for 
other use cases and thought this would be a good place to initiate the 
support. At the same time, I agree that existing interface can still 
work with the v2 GuC interface. I will post that.




Any timebase conversion error is unlikely to be relevant because 
userspace only looks at deltas over relatively short periods 
(seconds). Ie. I don't think that the clock drift error would 
accumulate so it would need to be really huge to be relevant over 
short sampling periods.


At some point we may need to think about long running workloads, but 
that may require a different counter anyways, so I would not address it 
here.


Thanks,
Umesh



Regards,

Tvrtko



Implement the new interface and start by adding a new counter for total
active ticks.

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: John Harrison 
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  1 +
 drivers/gpu/drm/i915/i915_pmu.c   |  6 +
 include/uapi/drm/i915_drm.h   |  2 ++
 4 files changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 88465d701c278..0c1fee5360777 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1607,6 +1607,30 @@ static ktime_t busy_v2_guc_engine_busyness(struct 
intel_engine_cs *engine, ktime
return ns_to_ktime(total);
 }
+static u64 busy_v1_intel_guc_total_active_ticks(struct intel_guc *guc)
+{
+   return guc->busy.v1.gt_stamp;
+}
+
+static u64 busy_v2_intel_guc_total_active_ticks(struct intel_guc *guc)
+{
+   u64 ticks_gt;
+
+   __busy_v2_get_engine_usage_record(guc, NULL, NULL, NULL, _gt);
+
+   return ticks_gt;
+}
+
+u64 intel_guc_total_active_ticks(struct intel_gt *gt)
+{
+   struct intel_guc *guc = >uc.guc;
+
+   if (GUC_SUBMIT_VER(guc) < MAKE_GUC_VER(1, 3, 1))
+   return busy_v1_intel_guc_total_active_ticks(guc);
+   else
+   return busy_v2_intel_guc_total_active_ticks(guc);
+}
+
 static int busy_v2_guc_action_enable_usage_stats_device(struct intel_guc *guc)
 {
u32 offset = guc_engine_usage_offset_v2_device(guc);
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
index c57b29cdb1a64..f6d42838825f2 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.h
@@ -30,6 +30,7 @@ void intel_guc_dump_active_requests(struct intel_engine_cs 
*engine,
struct drm_printer *m);
 void intel_guc_busyness_park(struct intel_gt *gt);
 void intel_guc_busyness_unpark(struct intel_gt *

Re: [PATCH] drm/i915/perf: Clear out entire reports after reading if not power of 2 size

2023-05-23 Thread Umesh Nerlige Ramappa

On Mon, May 22, 2023 at 02:50:51PM -0700, Dixit, Ashutosh wrote:

On Mon, 22 May 2023 14:34:18 -0700, Umesh Nerlige Ramappa wrote:


On Mon, May 22, 2023 at 01:17:49PM -0700, Ashutosh Dixit wrote:
> Clearing out report id and timestamp as means to detect unlanded reports
> only works if report size is power of 2. That is, only when report size is
> a sub-multiple of the OA buffer size can we be certain that reports will
> land at the same place each time in the OA buffer (after rewind). If report
> size is not a power of 2, we need to zero out the entire report to be able
> to detect unlanded reports reliably.
>
> Cc: Umesh Nerlige Ramappa 
> Signed-off-by: Ashutosh Dixit 
> ---
> drivers/gpu/drm/i915/i915_perf.c | 17 +++--
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c
> index 19d5652300eeb..58284156428dc 100644
> --- a/drivers/gpu/drm/i915/i915_perf.c
> +++ b/drivers/gpu/drm/i915/i915_perf.c
> @@ -877,12 +877,17 @@ static int gen8_append_oa_reports(struct 
i915_perf_stream *stream,
>stream->oa_buffer.last_ctx_id = ctx_id;
>}
>
> -  /*
> -   * Clear out the report id and timestamp as a means to detect 
unlanded
> -   * reports.
> -   */
> -  oa_report_id_clear(stream, report32);
> -  oa_timestamp_clear(stream, report32);
> +  if (is_power_of_2(report_size)) {
> +  /*
> +   * Clear out the report id and timestamp as a means
> +   * to detect unlanded reports.
> +   */
> +  oa_report_id_clear(stream, report32);
> +  oa_timestamp_clear(stream, report32);
> +  } else {
> +  /* Zero out the entire report */
> +  memset(report32, 0, report_size);

Indeed, this was a bug. For a minute, I started wondering if this is the
issue I am running into with the other patch posted for DG2, but then I see
the issue within the first fill of the OA buffer where chunks of the
reports are zeroed out, so this is a new issue.


Yes I saw this while reviewing your patch. And also I thought your issue
was happening on DG2 with power of 2 report size, only on MTL OAM we
introduce non power of 2 report size.


lgtm,

Reviewed-by: Umesh Nerlige Ramappa 


Maybe this should include Fixes: tag pointing to the patch that 
introduced the OAM non-power-of-2 format.


Umesh



Thanks.
--
Ashutosh



> +  }
>}
>
>if (start_offset != *offset) {
> --
> 2.38.0
>


Re: [PATCH] drm/i915/perf: Clear out entire reports after reading if not power of 2 size

2023-05-22 Thread Umesh Nerlige Ramappa

On Mon, May 22, 2023 at 01:17:49PM -0700, Ashutosh Dixit wrote:

Clearing out report id and timestamp as means to detect unlanded reports
only works if report size is power of 2. That is, only when report size is
a sub-multiple of the OA buffer size can we be certain that reports will
land at the same place each time in the OA buffer (after rewind). If report
size is not a power of 2, we need to zero out the entire report to be able
to detect unlanded reports reliably.

Cc: Umesh Nerlige Ramappa 
Signed-off-by: Ashutosh Dixit 
---
drivers/gpu/drm/i915/i915_perf.c | 17 +++--
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 19d5652300eeb..58284156428dc 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -877,12 +877,17 @@ static int gen8_append_oa_reports(struct i915_perf_stream 
*stream,
stream->oa_buffer.last_ctx_id = ctx_id;
}

-   /*
-* Clear out the report id and timestamp as a means to detect 
unlanded
-* reports.
-*/
-   oa_report_id_clear(stream, report32);
-   oa_timestamp_clear(stream, report32);
+   if (is_power_of_2(report_size)) {
+   /*
+* Clear out the report id and timestamp as a means
+* to detect unlanded reports.
+*/
+   oa_report_id_clear(stream, report32);
+   oa_timestamp_clear(stream, report32);
+   } else {
+   /* Zero out the entire report */
+   memset(report32, 0, report_size);


Indeed, this was a bug. For a minute, I started wondering if this is the 
issue I am running into with the other patch posted for DG2, but then I 
see the issue within the first fill of the OA buffer where chunks of the 
reports are zeroed out, so this is a new issue.


lgtm,

Reviewed-by: Umesh Nerlige Ramappa 

Thanks,
Umesh



+   }
}

if (start_offset != *offset) {
--
2.38.0



Re: [PATCH] drm/i915/pmu: Change bitmask of enabled events to u32

2023-05-16 Thread Umesh Nerlige Ramappa

On Tue, May 16, 2023 at 03:13:01PM -0700, Umesh Nerlige Ramappa wrote:

On Tue, May 16, 2023 at 10:24:45AM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Having it as u64 was a confusing (but harmless) mistake.

Also add some asserts to make sure the internal field does not overflow
in the future.

Signed-off-by: Tvrtko Ursulin 
Cc: Ashutosh Dixit 
Cc: Umesh Nerlige Ramappa 
---
I am not entirely sure the __builtin_constant_p->BUILD_BUG_ON branch will
work with all compilers. Lets see...

Compile tested only.
---
drivers/gpu/drm/i915/i915_pmu.c | 32 ++--
1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 7ece883a7d95..8736b3418f88 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -50,7 +50,7 @@ static u8 engine_event_instance(struct perf_event *event)
return (event->attr.config >> I915_PMU_SAMPLE_BITS) & 0xff;
}

-static bool is_engine_config(u64 config)
+static bool is_engine_config(const u64 config)
{
return config < __I915_PMU_OTHER(0);
}
@@ -82,15 +82,28 @@ static unsigned int other_bit(const u64 config)

static unsigned int config_bit(const u64 config)
{
+   unsigned int bit;
+
if (is_engine_config(config))
-   return engine_config_sample(config);
+   bit = engine_config_sample(config);
else
-   return other_bit(config);
+   bit = other_bit(config);
+
+   if (__builtin_constant_p(config))
+   BUILD_BUG_ON(bit >
+BITS_PER_TYPE(typeof_member(struct i915_pmu,
+enable)) - 1);
+   else
+   WARN_ON_ONCE(bit >
+BITS_PER_TYPE(typeof_member(struct i915_pmu,
+enable)) - 1);


The else is firing for the INTERRUPT event because event_bit() also 
calls config_bit(). It would be best to move this check to 
config_mask() and leave this function as is.


I posted the modified version here - 
https://patchwork.freedesktop.org/patch/537361/?series=117843=1 as 
part of the MTL PMU series so that it Tests out with IGT patches.


Thanks,
Umesh



Thanks,
Umesh


+
+   return bit;
}

-static u64 config_mask(u64 config)
+static u32 config_mask(const u64 config)
{
-   return BIT_ULL(config_bit(config));
+   return BIT(config_bit(config));
}

static bool is_engine_event(struct perf_event *event)
@@ -633,11 +646,10 @@ static void i915_pmu_enable(struct perf_event *event)
{
struct drm_i915_private *i915 =
container_of(event->pmu, typeof(*i915), pmu.base);
+   const unsigned int bit = event_bit(event);
struct i915_pmu *pmu = >pmu;
unsigned long flags;
-   unsigned int bit;

-   bit = event_bit(event);
if (bit == -1)
goto update;

@@ -651,7 +663,7 @@ static void i915_pmu_enable(struct perf_event *event)
GEM_BUG_ON(bit >= ARRAY_SIZE(pmu->enable_count));
GEM_BUG_ON(pmu->enable_count[bit] == ~0);

-   pmu->enable |= BIT_ULL(bit);
+   pmu->enable |= BIT(bit);
pmu->enable_count[bit]++;

/*
@@ -698,7 +710,7 @@ static void i915_pmu_disable(struct perf_event *event)
{
struct drm_i915_private *i915 =
container_of(event->pmu, typeof(*i915), pmu.base);
-   unsigned int bit = event_bit(event);
+   const unsigned int bit = event_bit(event);
struct i915_pmu *pmu = >pmu;
unsigned long flags;

@@ -734,7 +746,7 @@ static void i915_pmu_disable(struct perf_event *event)
 * bitmask when the last listener on an event goes away.
 */
if (--pmu->enable_count[bit] == 0) {
-   pmu->enable &= ~BIT_ULL(bit);
+   pmu->enable &= ~BIT(bit);
pmu->timer_enabled &= pmu_needs_timer(pmu, true);
}

--
2.39.2



Re: [PATCH] drm/i915/pmu: Change bitmask of enabled events to u32

2023-05-16 Thread Umesh Nerlige Ramappa

On Tue, May 16, 2023 at 10:24:45AM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Having it as u64 was a confusing (but harmless) mistake.

Also add some asserts to make sure the internal field does not overflow
in the future.

Signed-off-by: Tvrtko Ursulin 
Cc: Ashutosh Dixit 
Cc: Umesh Nerlige Ramappa 
---
I am not entirely sure the __builtin_constant_p->BUILD_BUG_ON branch will
work with all compilers. Lets see...

Compile tested only.
---
drivers/gpu/drm/i915/i915_pmu.c | 32 ++--
1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 7ece883a7d95..8736b3418f88 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -50,7 +50,7 @@ static u8 engine_event_instance(struct perf_event *event)
return (event->attr.config >> I915_PMU_SAMPLE_BITS) & 0xff;
}

-static bool is_engine_config(u64 config)
+static bool is_engine_config(const u64 config)
{
return config < __I915_PMU_OTHER(0);
}
@@ -82,15 +82,28 @@ static unsigned int other_bit(const u64 config)

static unsigned int config_bit(const u64 config)
{
+   unsigned int bit;
+
if (is_engine_config(config))
-   return engine_config_sample(config);
+   bit = engine_config_sample(config);
else
-   return other_bit(config);
+   bit = other_bit(config);
+
+   if (__builtin_constant_p(config))
+   BUILD_BUG_ON(bit >
+BITS_PER_TYPE(typeof_member(struct i915_pmu,
+enable)) - 1);
+   else
+   WARN_ON_ONCE(bit >
+BITS_PER_TYPE(typeof_member(struct i915_pmu,
+enable)) - 1);


The else is firing for the INTERRUPT event because event_bit() also 
calls config_bit(). It would be best to move this check to config_mask() 
and leave this function as is.


Thanks,
Umesh 


+
+   return bit;
}

-static u64 config_mask(u64 config)
+static u32 config_mask(const u64 config)
{
-   return BIT_ULL(config_bit(config));
+   return BIT(config_bit(config));
}

static bool is_engine_event(struct perf_event *event)
@@ -633,11 +646,10 @@ static void i915_pmu_enable(struct perf_event *event)
{
struct drm_i915_private *i915 =
container_of(event->pmu, typeof(*i915), pmu.base);
+   const unsigned int bit = event_bit(event);
struct i915_pmu *pmu = >pmu;
unsigned long flags;
-   unsigned int bit;

-   bit = event_bit(event);
if (bit == -1)
goto update;

@@ -651,7 +663,7 @@ static void i915_pmu_enable(struct perf_event *event)
GEM_BUG_ON(bit >= ARRAY_SIZE(pmu->enable_count));
GEM_BUG_ON(pmu->enable_count[bit] == ~0);

-   pmu->enable |= BIT_ULL(bit);
+   pmu->enable |= BIT(bit);
pmu->enable_count[bit]++;

/*
@@ -698,7 +710,7 @@ static void i915_pmu_disable(struct perf_event *event)
{
struct drm_i915_private *i915 =
container_of(event->pmu, typeof(*i915), pmu.base);
-   unsigned int bit = event_bit(event);
+   const unsigned int bit = event_bit(event);
struct i915_pmu *pmu = >pmu;
unsigned long flags;

@@ -734,7 +746,7 @@ static void i915_pmu_disable(struct perf_event *event)
 * bitmask when the last listener on an event goes away.
 */
if (--pmu->enable_count[bit] == 0) {
-   pmu->enable &= ~BIT_ULL(bit);
+   pmu->enable &= ~BIT(bit);
pmu->timer_enabled &= pmu_needs_timer(pmu, true);
}

--
2.39.2



Re: [PATCH 1/1] drm/i915: fix race condition UAF in i915_perf_add_config_ioctl

2023-03-28 Thread Umesh Nerlige Ramappa

On Tue, Mar 28, 2023 at 02:08:47PM +0100, Tvrtko Ursulin wrote:


On 28/03/2023 10:36, Min Li wrote:

Userspace can guess the id value and try to race oa_config object creation
with config remove, resulting in a use-after-free if we dereference the
object after unlocking the metrics_lock.  For that reason, unlocking the
metrics_lock must be done after we are done dereferencing the object.

Signed-off-by: Min Li 


Fixes: f89823c21224 ("drm/i915/perf: Implement I915_PERF_ADD/REMOVE_CONFIG 
interface")
Cc: Lionel Landwerlin 
Cc: Umesh Nerlige Ramappa 
Cc:  # v4.14+


---
 drivers/gpu/drm/i915/i915_perf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 824a34ec0b83..93748ca2c5da 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -4634,13 +4634,13 @@ int i915_perf_add_config_ioctl(struct drm_device *dev, 
void *data,
err = oa_config->id;
goto sysfs_err;
}
-
-   mutex_unlock(>metrics_lock);
+   id = oa_config->id;
drm_dbg(>i915->drm,
"Added config %s id=%i\n", oa_config->uuid, oa_config->id);
+   mutex_unlock(>metrics_lock);
-   return oa_config->id;
+   return id;
 sysfs_err:
mutex_unlock(>metrics_lock);


LGTM.

Reviewed-by: Tvrtko Ursulin 

Umesh or Lionel could you please double check? I can merge if confirmed okay.


LGTM,

Reviewed-by: Umesh Nerlige Ramappa 

Thanks,
Umesh



Regards,

Tvrtko


Re: [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails

2022-12-12 Thread Umesh Nerlige Ramappa

On Tue, Nov 29, 2022 at 01:12:53PM -0800, john.c.harri...@intel.com wrote:

From: John Harrison 

Engine resets are supposed to never happen. But in the case when one
does (due to unknwon reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

Signed-off-by: John Harrison 
---
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0a42f1807f52c..c82730804a1c4 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct 
*w)
guc->submission_state.reset_fail_mask = 0;
spin_unlock_irqrestore(>submission_state.lock, flags);

-   if (likely(reset_fail_mask))
+   if (likely(reset_fail_mask)) {
+   struct intel_engine_cs *engine;
+   enum intel_engine_id id;
+
+   /*
+* GuC is toast at this point - it dead loops after sending the 
failed
+* reset notification. So need to manually determine the guilty 
context.
+* Note that it should be safe/reliable to do this here because 
the GuC
+* is toast and will not be scheduling behind the KMD's back.
+*/


Is that defined by the kmd-GuC interface that following a failed reset notification, GuC 
will always dead-loop OR not schedule anything (even on other engines) until KMD takes 
some action? What action should KMD take?


Regards,
Umesh


+   for_each_engine_masked(engine, gt, reset_fail_mask, id)
+   intel_guc_find_hung_context(engine);
+
intel_gt_handle_error(gt, reset_fail_mask,
  I915_ERROR_CAPTURE,
  "GuC failed to reset engine mask=0x%x\n",
  reset_fail_mask);
+   }
}

int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
--
2.37.3



Re: [Intel-gfx] [PATCH 1/2] drm/i915: Allow error capture without a request

2022-12-12 Thread Umesh Nerlige Ramappa
  engine->name);
+   rq = NULL;
+   }
} else {
/*
 * Getting here with GuC enabled means it is a forced error 
capture
@@ -1625,12 +1648,14 @@ capture_engine(struct intel_engine_cs *engine,
if (rq)
rq = i915_request_get_rcu(rq);

-   if (!rq)
-   goto no_request_capture;
+   if (rq)
+   capture = intel_engine_coredump_add_request(ee, rq, 
ATOMIC_MAYFAIL);


2 back-to-back if (rq) could merge together,


otherwise, lgtm

Reviewed-by: Umesh Nerlige Ramappa 

Umesh

+   else if (ce)
+   capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);

-   capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
if (!capture) {
-   i915_request_put(rq);
+   if (rq)
+   i915_request_put(rq);
goto no_request_capture;
}
if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
--
2.37.3



Re: [PATCH] drm/i915/guc: Remove excessive line feeds in state dumps

2022-11-03 Thread Umesh Nerlige Ramappa

On Mon, Oct 31, 2022 at 03:00:07PM -0700, john.c.harri...@intel.com wrote:

From: John Harrison 

Some of the GuC state dump messages were adding extra line feeds. When
printing via a DRM printer to dmesg, for example, that messes up the
log formatting as it loses any prefixing from the printer. Given that
the extra line feeds are just in the middle of random bits of GuC
state, there isn't any real need for them. So just remove them
completely.

Signed-off-by: John Harrison 


lgtm,

Reviewed-by: Umesh Nerlige Ramappa 


---
drivers/gpu/drm/i915/gt/uc/intel_guc.c| 4 ++--
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 8 
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
index 27b09ba1d295f..1bcd61bb50f89 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c
@@ -871,14 +871,14 @@ void intel_guc_load_status(struct intel_guc *guc, struct 
drm_printer *p)
u32 status = intel_uncore_read(uncore, GUC_STATUS);
u32 i;

-   drm_printf(p, "\nGuC status 0x%08x:\n", status);
+   drm_printf(p, "GuC status 0x%08x:\n", status);
drm_printf(p, "\tBootrom status = 0x%x\n",
   (status & GS_BOOTROM_MASK) >> GS_BOOTROM_SHIFT);
drm_printf(p, "\tuKernel status = 0x%x\n",
   (status & GS_UKERNEL_MASK) >> GS_UKERNEL_SHIFT);
drm_printf(p, "\tMIA Core status = 0x%x\n",
   (status & GS_MIA_MASK) >> GS_MIA_SHIFT);
-   drm_puts(p, "\nScratch registers:\n");
+   drm_puts(p, "Scratch registers:\n");
for (i = 0; i < 16; i++) {
drm_printf(p, "\t%2d: \t0x%x\n",
   i, intel_uncore_read(uncore, 
SOFT_SCRATCH(i)));
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4ccb29f9ac55c..4dbdac8002e32 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4901,7 +4901,7 @@ void intel_guc_submission_print_info(struct intel_guc 
*guc,

drm_printf(p, "GuC Number Outstanding Submission G2H: %u\n",
   atomic_read(>outstanding_submission_g2h));
-   drm_printf(p, "GuC tasklet count: %u\n\n",
+   drm_printf(p, "GuC tasklet count: %u\n",
   atomic_read(_engine->tasklet.count));

spin_lock_irqsave(_engine->lock, flags);
@@ -4949,7 +4949,7 @@ static inline void guc_log_context(struct drm_printer *p,
   atomic_read(>pin_count));
drm_printf(p, "\t\tGuC ID Ref Count: %u\n",
   atomic_read(>guc_id.ref));
-   drm_printf(p, "\t\tSchedule State: 0x%x\n\n",
+   drm_printf(p, "\t\tSchedule State: 0x%x\n",
   ce->guc_state.sched_state);
}

@@ -4978,7 +4978,7 @@ void intel_guc_submission_print_context_info(struct 
intel_guc *guc,
   
READ_ONCE(*ce->parallel.guc.wq_head));
drm_printf(p, "\t\tWQI Tail: %u\n",
   
READ_ONCE(*ce->parallel.guc.wq_tail));
-   drm_printf(p, "\t\tWQI Status: %u\n\n",
+   drm_printf(p, "\t\tWQI Status: %u\n",
   
READ_ONCE(*ce->parallel.guc.wq_status));
}

@@ -4986,7 +4986,7 @@ void intel_guc_submission_print_context_info(struct 
intel_guc *guc,
emit_bb_start_parent_no_preempt_mid_batch) {
u8 i;

-   drm_printf(p, "\t\tChildren Go: %u\n\n",
+   drm_printf(p, "\t\tChildren Go: %u\n",
   get_children_go_value(ce));
for (i = 0; i < ce->parallel.number_children; 
++i)
drm_printf(p, "\t\tChildren Join: %u\n",
--
2.37.3



Re: [Intel-gfx] [PATCH v2 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset

2022-11-03 Thread Umesh Nerlige Ramappa
et makes that 
pointless, I don't remember.
The reset is cancelling the worker anyway. And it will then be 
rescheduled once the reset is done. And the ping time is defined as 
1/8th the wrap time (being approx 223 seconds on current platforms). 
So as long as the reset doesn't take longer than about 200s, there is 
no issue. And if the reset did take longer than that then we have 
bigger issues than the busyness stats (which can't actually be 
counting anyway because nothing is running if the GT is in reset) 
being slightly off.


In addition to canceling the ping worker, __reset_guc_busyness_stats is 
performing the same activities that the ping-worker would do if it were 
to run, so we should be safe to skip the worker when a reset is in 
progress, so lgtm,


Reviewed-by: Umesh Nerlige Ramappa 

Thanks,
Umesh



John.



Regards,

Tvrtko




Re: [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2022-10-21 Thread Umesh Nerlige Ramappa

On Fri, Oct 21, 2022 at 09:42:53AM +0100, Tvrtko Ursulin wrote:


On 27/10/2021 01:48, Umesh Nerlige Ramappa wrote:

[snip]


+static void guc_timestamp_ping(struct work_struct *wrk)
+{
+   struct intel_guc *guc = container_of(wrk, typeof(*guc),
+timestamp.work.work);
+   struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
+   struct intel_gt *gt = guc_to_gt(guc);
+   intel_wakeref_t wakeref;
+   unsigned long flags;
+   int srcu, ret;
+
+   /*
+* Synchronize with gt reset to make sure the worker does not
+* corrupt the engine/guc stats.
+*/
+   ret = intel_gt_reset_trylock(gt, );
+   if (ret)
+   return;
+
+   spin_lock_irqsave(>timestamp.lock, flags);
+
+   with_intel_runtime_pm(>i915->runtime_pm, wakeref)
+   __update_guc_busyness_stats(guc);


Spotted one splat today: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12268/bat-adlp-4/igt@i915_pm_...@basic-pci-d3-state.html

Could be that reset lock needs to be inside the rpm get. Haven't really though 
about it much, could you please check?

<4> [300.214744]
<4> [300.214753] ==
<4> [300.214755] WARNING: possible circular locking dependency detected
<4> [300.214758] 6.1.0-rc1-CI_DRM_12268-g86e8558e3283+ #1 Not tainted
<4> [300.214761] --
<4> [300.214762] kworker/10:1H/265 is trying to acquire lock:
<4> [300.214765] 8275e560 (fs_reclaim){+.+.}-{0:0}, at: 
__kmem_cache_alloc_node+0x27/0x170
<4> [300.214780]
but task is already holding lock:
<4> [300.214782] c900013e7e78 
((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: 
process_one_work+0x1eb/0x5b0
<4> [300.214793]
which lock already depends on the new lock.
<4> [300.214794]
the existing dependency chain (in reverse order) is:
<4> [300.214796]
-> #2 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}:
<4> [300.214801]lock_acquire+0xd3/0x310
<4> [300.214806]__flush_work+0x77/0x4e0
<4> [300.214811]__cancel_work_timer+0x14e/0x1f0
<4> [300.214815]intel_guc_submission_reset_prepare+0x7a/0x420 [i915]
<4> [300.215119]intel_uc_reset_prepare+0x44/0x50 [i915]
<4> [300.215360]reset_prepare+0x21/0x80 [i915]
<4> [300.215561]intel_gt_reset+0x143/0x340 [i915]
<4> [300.215757]intel_gt_reset_global+0xeb/0x160 [i915]
<4> [300.215946]intel_gt_handle_error+0x2c2/0x410 [i915]
<4> [300.216137]intel_gt_debugfs_reset_store+0x59/0xc0 [i915]
<4> [300.216333]i915_wedged_set+0xc/0x20 [i915]
<4> [300.216513]simple_attr_write+0xda/0x100
<4> [300.216520]full_proxy_write+0x4e/0x80
<4> [300.216525]vfs_write+0xe3/0x4e0
<4> [300.216531]ksys_write+0x57/0xd0
<4> [300.216535]do_syscall_64+0x37/0x90
<4> [300.216542]entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4> [300.216549]
-> #1 (>reset.mutex){+.+.}-{3:3}:
<4> [300.216556]lock_acquire+0xd3/0x310
<4> [300.216559]i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]


i915_gem_shrinker_taints_mutex seems to have something to do with 
fs_reclaim and so does the stack #0. Any idea what this early init is 
doing? Can this code also result in a gt_wedged case because that might 
explain the stack #2 which is a reset.



<4> [300.216799]intel_gt_init_reset+0x61/0x80 [i915]
<4> [300.217018]intel_gt_common_init_early+0x10c/0x190 [i915]
<4> [300.217227]intel_root_gt_init_early+0x44/0x60 [i915]
<4> [300.217434]i915_driver_probe+0x9ab/0xf30 [i915]
<4> [300.217615]i915_pci_probe+0xa5/0x240 [i915]
<4> [300.217796]pci_device_probe+0x95/0x110
<4> [300.217803]really_probe+0xd6/0x350
<4> [300.217811]__driver_probe_device+0x73/0x170
<4> [300.217816]driver_probe_device+0x1a/0x90
<4> [300.217821]__driver_attach+0xbc/0x190
<4> [300.217826]bus_for_each_dev+0x72/0xc0
<4> [300.217831]bus_add_driver+0x1bb/0x210
<4> [300.217835]driver_register+0x66/0xc0
<4> [300.217841]0xa093001f
<4> [300.217844]do_one_initcall+0x53/0x2f0
<4> [300.217849]do_init_module+0x45/0x1c0
<4> [300.217855]load_module+0x1d5e/0x1e90
<4> [300.217859]__do_sys_finit_module+0xaf/0x120
<4> [300.217864]do_syscall_64+0x37/0x90
<4> [300.217869]entry_SYSCALL_64_after_hwframe+0x63/0xcd
<4> [300.217875]
-> #0 (fs_reclaim){+.+.}-{0:0}:
<4> [300.217880]validate_chain+0xb3d/0x2000
<4> [300.217884]   

Re: [Intel-gfx] [PATCH 1/1] drm/i915/guc: Enable compute scheduling on DG2

2022-09-22 Thread Umesh Nerlige Ramappa

On Thu, Sep 22, 2022 at 01:12:09PM -0700, john.c.harri...@intel.com wrote:

From: John Harrison 

DG2 has issues. To work around one of these the GuC must schedule
apps in an exclusive manner across both RCS and CCS. That is, if a
context from app X is running on RCS then all CCS engines must sit
idle even if there are contexts from apps Y, Z, ... waiting to run. A
certain OS favours RCS to the total starvation of CCS. Linux does not.
Hence the GuC now has a scheduling policy setting to control this
abitration.

Signed-off-by: John Harrison 


lgtm,

Reviewed-by: Umesh Nerlige Ramappa 

Regards,
Umesh

---
.../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |  1 +
drivers/gpu/drm/i915/gt/uc/abi/guc_klvs_abi.h |  9 +-
drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   | 22 +
.../gpu/drm/i915/gt/uc/intel_guc_submission.c | 93 +++
4 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h 
b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
index 29ef8afc8c2e4..f359bef046e0b 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
@@ -117,6 +117,7 @@ enum intel_guc_action {
INTEL_GUC_ACTION_ENTER_S_STATE = 0x501,
INTEL_GUC_ACTION_EXIT_S_STATE = 0x502,
INTEL_GUC_ACTION_GLOBAL_SCHED_POLICY_CHANGE = 0x506,
+   INTEL_GUC_ACTION_UPDATE_SCHEDULING_POLICIES_KLV = 0x509,
INTEL_GUC_ACTION_SCHED_CONTEXT = 0x1000,
INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET = 0x1001,
INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_DONE = 0x1002,
diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_klvs_abi.h 
b/drivers/gpu/drm/i915/gt/uc/abi/guc_klvs_abi.h
index 4a59478c3b5c4..58012edd4eb0e 100644
--- a/drivers/gpu/drm/i915/gt/uc/abi/guc_klvs_abi.h
+++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_klvs_abi.h
@@ -81,10 +81,17 @@
#define GUC_KLV_SELF_CFG_G2H_CTB_SIZE_KEY   0x0907
#define GUC_KLV_SELF_CFG_G2H_CTB_SIZE_LEN   1u

+/*
+ * Global scheduling policy update keys.
+ */
+enum {
+   GUC_SCHEDULING_POLICIES_KLV_ID_RENDER_COMPUTE_YIELD = 0x1001,
+};
+
/*
 * Per context scheduling policy update keys.
 */
-enum  {
+enum {
GUC_CONTEXT_POLICIES_KLV_ID_EXECUTION_QUANTUM   = 
0x2001,
GUC_CONTEXT_POLICIES_KLV_ID_PREEMPTION_TIMEOUT  = 
0x2002,
GUC_CONTEXT_POLICIES_KLV_ID_SCHEDULING_PRIORITY = 
0x2003,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
index 323b055e5db97..e7a7fb450f442 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
@@ -290,6 +290,25 @@ struct guc_update_context_policy {
struct guc_klv_generic_dw_t klv[GUC_CONTEXT_POLICIES_KLV_NUM_IDS];
} __packed;

+/* Format of the UPDATE_SCHEDULING_POLICIES H2G data packet */
+struct guc_update_scheduling_policy_header {
+   u32 action;
+} __packed;
+
+/*
+ * Can't dynmically allocate memory for the scheduling policy KLV because
+ * it will be sent from within the reset path. Need a fixed size lump on
+ * the stack instead :(.
+ *
+ * Currently, there is only one KLV defined, which has 1 word of KL + 2 words 
of V.
+ */
+#define MAX_SCHEDULING_POLICY_SIZE 3
+
+struct guc_update_scheduling_policy {
+   struct guc_update_scheduling_policy_header header;
+   u32 data[MAX_SCHEDULING_POLICY_SIZE];
+} __packed;
+
#define GUC_POWER_UNSPECIFIED   0
#define GUC_POWER_D01
#define GUC_POWER_D12
@@ -298,6 +317,9 @@ struct guc_update_context_policy {

/* Scheduling policy settings */

+#define GLOBAL_SCHEDULE_POLICY_RC_YIELD_DURATION   100 /* in ms */
+#define GLOBAL_SCHEDULE_POLICY_RC_YIELD_RATIO  50  /* in percent */
+
#define GLOBAL_POLICY_MAX_NUM_WI 15

/* Don't reset an engine upon preemption failure */
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 22ba66e48a9b0..f09f530198f4d 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4177,6 +4177,98 @@ int intel_guc_submission_setup(struct intel_engine_cs 
*engine)
return 0;
}

+struct scheduling_policy {
+   /* internal data */
+   u32 max_words, num_words;
+   u32 count;
+   /* API data */
+   struct guc_update_scheduling_policy h2g;
+};
+
+static u32 __guc_scheduling_policy_action_size(struct scheduling_policy 
*policy)
+{
+   u32 *start = (void *)>h2g;
+   u32 *end = policy->h2g.data + policy->num_words;
+   size_t delta = end - start;
+
+   return delta;
+}
+
+static struct scheduling_policy *__guc_scheduling_policy_start_klv(struct 
scheduling_policy *policy)
+{
+   policy->h2g.header.action = 
INTEL_GUC_ACTION_UPDATE_SCHEDULING_POLICIES_KLV;
+   policy->max_words = ARRAY_SIZE(policy->h2g.data);
+   pol

Re: [Intel-gfx] [PATCH 2/6] drm/i915/gt: Invalidate TLB of the OA unit at TLB invalidations

2022-06-15 Thread Umesh Nerlige Ramappa

On Wed, Jun 15, 2022 at 04:27:36PM +0100, Mauro Carvalho Chehab wrote:

From: Chris Wilson 

On gen12 HW, ensure that the TLB of the OA unit is also invalidated
as just invalidating the TLB of an engine is not enough.

Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")

Signed-off-by: Chris Wilson 
Cc: Fei Yang 
Cc: Andi Shyti 
Cc: sta...@vger.kernel.org
Acked-by: Thomas Hellström 
Signed-off-by: Mauro Carvalho Chehab 
---

See [PATCH 0/6] at: 
https://lore.kernel.org/all/cover.1655306128.git.mche...@kernel.org/

drivers/gpu/drm/i915/gt/intel_gt.c | 10 ++
1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c 
b/drivers/gpu/drm/i915/gt/intel_gt.c
index d5ed6a6ac67c..61b7ec5118f9 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -10,6 +10,7 @@
#include "pxp/intel_pxp.h"

#include "i915_drv.h"
+#include "i915_perf_oa_regs.h"
#include "intel_context.h"
#include "intel_engine_pm.h"
#include "intel_engine_regs.h"
@@ -1259,6 +1260,15 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
awake |= engine->mask;
}

+   /* Wa_2207587034:tgl,dg1,rkl,adl-s,adl-p */
+   if (awake &&
+   (IS_TIGERLAKE(i915) ||
+IS_DG1(i915) ||
+IS_ROCKETLAKE(i915) ||
+IS_ALDERLAKE_S(i915) ||
+IS_ALDERLAKE_P(i915)))
+   intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1);
+


This patch can be dropped since this is being done in i915/i915_perf.c 
-> gen12_oa_disable and is synchronized with OA use cases.


Regards,
Umesh



for_each_engine_masked(engine, gt, awake, tmp) {
struct reg_and_bit rb;

--
2.36.1



Re: [Intel-gfx] [PATCH i-g-t 1/3] lib: Helper library for parsing i915 fdinfo output

2022-04-01 Thread Umesh Nerlige Ramappa

lgtm, thanks for clarifications on the other patch.

Reviewed-by: Umesh Nerlige Ramappa 

Umesh

On Fri, Apr 01, 2022 at 03:11:53PM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Tests and intel_gpu_top will share common code for parsing this file.

v2:
* Fix key-value parsing if valid key line ends with ':'.
* Return number of drm keys found.
* Add DRM_CLIENT_FDINFO_MAX_ENGINES. (Umesh)
* Always zero terminate read buffer. (Umesh)

Signed-off-by: Tvrtko Ursulin 
---
lib/igt_drm_fdinfo.c | 188 +++
lib/igt_drm_fdinfo.h |  69 
lib/meson.build  |   7 ++
3 files changed, 264 insertions(+)
create mode 100644 lib/igt_drm_fdinfo.c
create mode 100644 lib/igt_drm_fdinfo.h

diff --git a/lib/igt_drm_fdinfo.c b/lib/igt_drm_fdinfo.c
new file mode 100644
index ..b422f67a4ace
--- /dev/null
+++ b/lib/igt_drm_fdinfo.c
@@ -0,0 +1,188 @@
+/*
+ * Copyright © 2022 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "drmtest.h"
+
+#include "igt_drm_fdinfo.h"
+
+static size_t read_fdinfo(char *buf, const size_t sz, int at, const char *name)
+{
+   size_t count;
+   int fd;
+
+   fd = openat(at, name, O_RDONLY);
+   if (fd < 0)
+   return 0;
+
+   buf[sz - 1] = 0;
+   count = read(fd, buf, sz);
+   buf[sz - 1] = 0;
+   close(fd);
+
+   return count;
+}
+
+static int parse_engine(char *line, struct drm_client_fdinfo *info,
+   size_t prefix_len, uint64_t *val)
+{
+   static const char *e2class[] = {
+   "render",
+   "copy",
+   "video",
+   "video-enhance",
+   };
+   ssize_t name_len;
+   char *name, *p;
+   int found = -1;
+   unsigned int i;
+
+   p = index(line, ':');
+   if (!p || p == line)
+   return -1;
+
+   name_len = p - line - prefix_len;
+   if (name_len < 1)
+   return -1;
+
+   name = line + prefix_len;
+
+   for (i = 0; i < ARRAY_SIZE(e2class); i++) {
+   if (!strncmp(name, e2class[i], name_len)) {
+   found = i;
+   break;
+   }
+   }
+
+   if (found >= 0) {
+   while (*++p && isspace(*p));
+   *val = strtoull(p, NULL, 10);
+   }
+
+   return found;
+}
+
+static const char *find_kv(const char *buf, const char *key, size_t keylen)
+{
+   const char *p = buf;
+
+   if (strncmp(buf, key, keylen))
+   return NULL;
+
+   p = index(buf, ':');
+   if (!p || p == buf)
+   return NULL;
+   if ((p - buf) != keylen)
+   return NULL;
+
+   p++;
+   while (*p && isspace(*p))
+   p++;
+
+   return *p ? p : NULL;
+}
+
+unsigned int
+__igt_parse_drm_fdinfo(int dir, const char *fd, struct drm_client_fdinfo *info)
+{
+   char buf[4096], *_buf = buf;
+   char *l, *ctx = NULL;
+   unsigned int good = 0, num_capacity = 0;
+   size_t count;
+
+   count = read_fdinfo(buf, sizeof(buf), dir, fd);
+   if (!count)
+   return 0;
+
+   while ((l = strtok_r(_buf, "\n", ))) {
+   uint64_t val = 0;
+   const char *v;
+   int idx;
+
+   _buf = NULL;
+
+   if ((v = find_kv(l, "drm-driver", strlen("drm-driver" {
+   strncpy(info->driver, v, sizeof(info->driver) - 1);
+   good++;
+   } else if ((v = find_kv(l, "drm-pdev", strlen("drm-pdev" {
+   strncpy

Re: [igt-dev] [PATCH i-g-t 03/11] intel-gpu-top: Add support for per client stats

2022-03-31 Thread Umesh Nerlige Ramappa

lgtm, I just have a few nits and questions below:

Regardless, this is

Reviewed-by: Umesh Nerlige Ramappa 

Umesh

On Tue, Feb 22, 2022 at 01:55:57PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Use the i915 exported data in /proc//fdinfo to show GPU utilization
per DRM client.

Example of the output:

intel-gpu-top: Intel Tigerlake (Gen12) @ /dev/dri/card0 -  220/ 221 MHz
   70% RC6;  0.62/ 7.08 W;  760 irqs/s

ENGINES BUSY MI_SEMA MI_WAIT
  Render/3D   23.06% |██▊  |  0%  0%
Blitter0.00% | |  0%  0%
  Video5.40% |█▋   |  0%  0%
   VideoEnhance   20.67% |██   |  0%  0%

  PID  NAME  Render/3DBlitter  VideoVideoEnhance
 3082   mpv |  ||  ||▌ ||██|
 3117 neverball |█▉||  ||  ||  |
1   systemd |▍ ||  ||  ||  |
 2338   gnome-shell |  ||  ||  ||  |

Signed-off-by: Tvrtko Ursulin 
---
man/intel_gpu_top.rst |   4 +
tools/intel_gpu_top.c | 801 +-
tools/meson.build |   2 +-
3 files changed, 804 insertions(+), 3 deletions(-)

diff --git a/man/intel_gpu_top.rst b/man/intel_gpu_top.rst
index b3b765b05feb..f4dbfc5b44d9 100644
--- a/man/intel_gpu_top.rst
+++ b/man/intel_gpu_top.rst
@@ -56,6 +56,10 @@ Supported keys:
'q'Exit from the tool.
'h'Show interactive help.
'1'Toggle between aggregated engine class and physical engine mode.
+'n'Toggle display of numeric client busyness overlay.
+'s'Toggle between sort modes (runtime, total runtime, pid, client id).
+'i'Toggle display of clients which used no GPU time.
+'H'Toggle between per PID aggregation and individual clients.

DEVICE SELECTION

diff --git a/tools/intel_gpu_top.c b/tools/intel_gpu_top.c
index bc11fce2bb1e..73815cdea8aa 100644
--- a/tools/intel_gpu_top.c
+++ b/tools/intel_gpu_top.c
@@ -43,8 +43,10 @@
#include 
#include 
#include 
+#include 

#include "igt_perf.h"
+#include "igt_drm_fdinfo.h"

#define ARRAY_SIZE(arr) (sizeof(arr)/sizeof(arr[0]))

@@ -311,7 +313,8 @@ static int engine_cmp(const void *__a, const void *__b)
return a->instance - b->instance;
}

-#define is_igpu_pci(x) (strcmp(x, ":00:02.0") == 0)
+#define IGPU_PCI ":00:02.0"
+#define is_igpu_pci(x) (strcmp(x, IGPU_PCI) == 0)
#define is_igpu(x) (strcmp(x, "i915") == 0)

static struct engines *discover_engines(char *device)
@@ -635,6 +638,547 @@ static void pmu_sample(struct engines *engines)
}
}

+enum client_status {
+   FREE = 0, /* mbz */
+   ALIVE,
+   PROBE
+};
+
+struct clients;
+
+struct client {
+   struct clients *clients;
+
+   enum client_status status;
+   unsigned int id;
+   unsigned int pid;
+   char name[24];
+   char print_name[24];
+   unsigned int samples;
+   unsigned long total_runtime;
+   unsigned long last_runtime;
+   unsigned long *val;
+   uint64_t *last;
+};
+
+struct clients {
+   unsigned int num_clients;
+   unsigned int active_clients;
+
+   unsigned int num_classes;
+   struct engine_class *class;
+
+   char pci_slot[64];
+
+   struct client *client;
+};
+
+#define for_each_client(clients, c, tmp) \
+   for ((tmp) = (clients)->num_clients, c = (clients)->client; \
+(tmp > 0); (tmp)--, (c)++)
+
+static struct clients *init_clients(const char *pci_slot)
+{
+   struct clients *clients;
+
+   clients = malloc(sizeof(*clients));
+   if (!clients)
+   return NULL;
+
+   memset(clients, 0, sizeof(*clients));
+
+   strncpy(clients->pci_slot, pci_slot, sizeof(clients->pci_slot));
+
+   return clients;
+}
+
+static struct client *
+find_client(struct clients *clients, enum client_status status, unsigned int 
id)
+{
+   unsigned int start, num;
+   struct client *c;
+
+   start = status == FREE ? clients->active_clients : 0; /* Free block at 
the end. */
+   num = clients->num_clients - start;
+
+   for (c = >client[start]; num; c++, num--) {
+   if (status != c->status)
+   continue;
+
+   if (status == FREE || c->id == id)
+   return c;
+   }
+
+   return NULL;
+}
+
+static void
+update_client(struct client *c, unsigned int pid, char *name, uint64_t val[16])
+{
+   unsigned int i;
+
+   if (c->pid != pid)
+   c->pid = pid;
+
+   if (strcmp(c->name, name)) {
+   char *p;
+
+   strncpy(c->name, name, sizeof(c->name) - 1);
+   strncpy(c->prin

Re: [igt-dev] [PATCH i-g-t 02/11] tests/i915/drm_fdinfo: Basic and functional tests for GPU busyness exported via fdinfo

2022-03-30 Thread Umesh Nerlige Ramappa
This looks very similar to existing perf_pmu tests with the slight 
change that the busyness is now captured from the fdinfo.


lgtm,
Reviewed-by: Umesh Nerlige Ramappa 

Umesh

On Tue, Feb 22, 2022 at 01:55:56PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Mostly inherited from the perf_pmu, some basic tests, and some tests to
verify exported GPU busyness is as expected.

Signed-off-by: Tvrtko Ursulin 
---
tests/i915/drm_fdinfo.c | 555 
tests/meson.build   |   8 +
2 files changed, 563 insertions(+)
create mode 100644 tests/i915/drm_fdinfo.c

diff --git a/tests/i915/drm_fdinfo.c b/tests/i915/drm_fdinfo.c
new file mode 100644
index ..e3b1ebb0f454
--- /dev/null
+++ b/tests/i915/drm_fdinfo.c
@@ -0,0 +1,555 @@
+/*
+ * Copyright © 2022 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "igt.h"
+#include "igt_core.h"
+#include "igt_device.h"
+#include "igt_drm_fdinfo.h"
+#include "i915/gem.h"
+#include "intel_ctx.h"
+
+IGT_TEST_DESCRIPTION("Test the i915 drm fdinfo data");
+
+const double tolerance = 0.05f;
+const unsigned long batch_duration_ns = 500e6;
+
+#define __assert_within_epsilon(x, ref, tol_up, tol_down) \
+   igt_assert_f((double)(x) <= (1.0 + (tol_up)) * (double)(ref) && \
+(double)(x) >= (1.0 - (tol_down)) * (double)(ref), \
+"'%s' != '%s' (%f not within +%.1f%%/-%.1f%% tolerance of 
%f)\n",\
+#x, #ref, (double)(x), \
+(tol_up) * 100.0, (tol_down) * 100.0, \
+(double)(ref))
+
+#define assert_within_epsilon(x, ref, tolerance) \
+   __assert_within_epsilon(x, ref, tolerance, tolerance)
+
+static void basics(int i915, unsigned int num_classes)
+{
+   struct drm_client_fdinfo info = { };
+   bool ret;
+
+   ret = igt_parse_drm_fdinfo(i915, );
+   igt_assert(ret);
+
+   igt_assert(!strcmp(info.driver, "i915"));
+
+   igt_assert_eq(info.num_engines, num_classes);
+}
+
+/*
+ * Helper for cases where we assert on time spent sleeping (directly or
+ * indirectly), so make it more robust by ensuring the system sleep time
+ * is within test tolerance to start with.
+ */
+static unsigned int measured_usleep(unsigned int usec)
+{
+   struct timespec ts = { };
+   unsigned int slept;
+
+   slept = igt_nsec_elapsed();
+   igt_assert(slept == 0);
+   do {
+   usleep(usec - slept);
+   slept = igt_nsec_elapsed() / 1000;
+   } while (slept < usec);
+
+   return igt_nsec_elapsed();
+}
+
+#define TEST_BUSY (1)
+#define FLAG_SYNC (2)
+#define TEST_TRAILING_IDLE (4)
+#define FLAG_HANG (8)
+#define TEST_ISOLATION (16)
+
+static igt_spin_t *__spin_poll(int fd, uint64_t ahnd, const intel_ctx_t *ctx,
+  const struct intel_execution_engine2 *e)
+{
+   struct igt_spin_factory opts = {
+   .ahnd = ahnd,
+   .ctx = ctx,
+   .engine = e->flags,
+   };
+
+   if (gem_class_can_store_dword(fd, e->class))
+   opts.flags |= IGT_SPIN_POLL_RUN;
+
+   return __igt_spin_factory(fd, );
+}
+
+static unsigned long __spin_wait(int fd, igt_spin_t *spin)
+{
+   struct timespec start = { };
+
+   igt_nsec_elapsed();
+
+   if (igt_spin_has_poll(spin)) {
+   unsigned long timeout = 0;
+
+   while (!igt_spin_has_started(spin)) {
+   unsigned long t = igt_nsec_elapsed();
+
+   igt_assert(gem_bo_busy(fd, spin->handle));
+   if ((t - timeout) > 250e6) {
+   timeout = t;
+   igt_warn("Sp

Re: [Intel-gfx] [PATCH i-g-t 01/11] lib: Helper library for parsing i915 fdinfo output

2022-03-30 Thread Umesh Nerlige Ramappa

On Tue, Feb 22, 2022 at 01:55:55PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Tests and intel_gpu_top will share common code for parsing this file.

Signed-off-by: Tvrtko Ursulin 
---
lib/igt_drm_fdinfo.c | 183 +++
lib/igt_drm_fdinfo.h |  48 
lib/meson.build  |   7 ++
3 files changed, 238 insertions(+)
create mode 100644 lib/igt_drm_fdinfo.c
create mode 100644 lib/igt_drm_fdinfo.h

diff --git a/lib/igt_drm_fdinfo.c b/lib/igt_drm_fdinfo.c
new file mode 100644
index ..28c1bdbda08e
--- /dev/null
+++ b/lib/igt_drm_fdinfo.c
@@ -0,0 +1,183 @@
+/*
+ * Copyright © 2022 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "drmtest.h"
+
+#include "igt_drm_fdinfo.h"
+
+static size_t read_fdinfo(char *buf, const size_t sz, int at, const char *name)
+{
+   size_t count;
+   int fd;
+
+   fd = openat(at, name, O_RDONLY);
+   if (fd < 0)
+   return 0;
+
+   buf[sz - 1] = 0;


Wondering if this ^ should be after the read() in case 4096 bytes are read.


+   count = read(fd, buf, sz);
+   close(fd);
+
+   return count;
+}
+
+static int parse_engine(char *line, struct drm_client_fdinfo *info,
+   size_t prefix_len, uint64_t *val)
+{
+   static const char *e2class[] = {
+   "render",
+   "copy",
+   "video",
+   "video-enhance",
+   };
+   ssize_t name_len;
+   char *name, *p;
+   int found = -1;
+   unsigned int i;
+
+   p = index(line, ':');
+   if (!p || p == line)
+   return -1;
+
+   name_len = p - line - prefix_len;
+   if (name_len < 1)
+   return -1;
+
+   name = line + prefix_len;
+
+   for (i = 0; i < ARRAY_SIZE(e2class); i++) {
+   if (!strncmp(name, e2class[i], name_len)) {
+   found = i;
+   break;
+   }
+   }
+
+   if (found >= 0) {
+   while (*++p && isspace(*p));
+   *val = strtoull(p, NULL, 10);
+   }
+
+   return found;
+}
+
+static const char *find_kv(const char *buf, const char *key, size_t keylen)
+{
+   const char *p = buf;
+
+   p = index(buf, ':');
+   if (!p || p == buf)
+   return NULL;
+
+   if ((p - buf) != keylen)
+   return NULL;
+
+   while (*++p && isspace(*p));
+   if (*p && !strncmp(buf, key, keylen))


nit: why not just do the strncmp early in this function since buf, key, 
keylen have not changed?



+   return p;
+
+   return NULL;
+}
+
+bool
+__igt_parse_drm_fdinfo(int dir, const char *fd, struct drm_client_fdinfo *info)
+{
+   char buf[4096], *_buf = buf;
+   char *l, *ctx = NULL;
+   unsigned int good = 0;
+   size_t count;
+


Should buf be zeroed out here?


+   count = read_fdinfo(buf, sizeof(buf), dir, fd);
+   if (!count)
+   return false;
+
+   while ((l = strtok_r(_buf, "\n", ))) {
+   uint64_t val = 0;
+   const char *v;
+   int idx;
+
+   _buf = NULL;
+
+   if ((v = find_kv(l, "drm-driver", strlen("drm-driver" {
+   strncpy(info->driver, v, sizeof(info->driver) - 1);
+   good++;
+   } else if ((v = find_kv(l, "drm-pdev", strlen("drm-pdev" {
+   strncpy(info->pdev, v, sizeof(info->pdev) - 1);
+   }  else if ((v = find_kv(l, "drm-client-id",
+strlen("drm-client-id" {
+   info->id = atol(v);
+   good++;
+   } else if (!strncmp(l, "drm-engine-", 11) &&
+  strncmp(l, 

Re: [Intel-gfx] [PATCH v3 08/13] drm/i915/xehp: Enable ccs/dual-ctx in RCU_MODE

2022-03-01 Thread Umesh Nerlige Ramappa
_init(struct temp_regset *regset,
ret |= GUC_MMIO_REG_ADD(regset, RING_HWS_PGA(base), false);
ret |= GUC_MMIO_REG_ADD(regset, RING_IMR(base), false);

+   if (engine->class == RENDER_CLASS &&
+   CCS_MASK(engine->gt))
+   ret |= GUC_MMIO_REG_ADD(regset, GEN12_RCU_MODE, true);
+
for (i = 0, wa = wal->list; i < wal->count; i++, wa++)
ret |= GUC_MMIO_REG_ADD(regset, wa->reg, wa->masked_reg);

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 891b98236155..7e248e2001de 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3603,6 +3603,19 @@ static bool guc_sched_engine_disabled(struct 
i915_sched_engine *sched_engine)
return !sched_engine->tasklet.callback;
}

+static int gen12_rcs_resume(struct intel_engine_cs *engine)
+{
+   int ret;
+
+   ret = guc_resume(engine);
+   if (ret)
+   return ret;
+
+   xehp_enable_ccs_engines(engine);
+
+   return 0;
+}
+
static void guc_set_default_submission(struct intel_engine_cs *engine)
{
engine->submit_request = guc_submit_request;
@@ -3723,6 +3736,9 @@ static void rcs_submission_override(struct 
intel_engine_cs *engine)
engine->emit_fini_breadcrumb = gen8_emit_fini_breadcrumb_rcs;
break;
}
+
+   if (engine->class == RENDER_CLASS)
+   engine->resume = gen12_rcs_resume;


Why not just have guc_resume and execlist_resume call 
xehp_enable_ccs_engines(engine) for render case?


Also what happens if render itself is not present/fused-off (if there is 
such a thing)?


Just those questions, overall the patch looks fine as is:

Reviewed-by: Umesh Nerlige Ramappa 

Umesh



}

static inline void guc_default_irqs(struct intel_engine_cs *engine)
--
2.34.1



Re: [PATCH 7/7] drm/i915: Expose client engine utilisation via fdinfo

2022-02-18 Thread Umesh Nerlige Ramappa

On Thu, Jan 06, 2022 at 04:55:36PM +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Similar to AMD commit
874442541133 ("drm/amdgpu: Add show_fdinfo() interface"), using the
infrastructure added in previous patches, we add basic client info
and GPU engine utilisation for i915.

Example of the output:

 pos:0
 flags:  012
 mnt_id: 21
 drm-driver: i915
 drm-pdev:   :00:02.0
 drm-client-id:  7
 drm-engine-render:  9288864723 ns
 drm-engine-copy:2035071108 ns
 drm-engine-video:   0 ns
 drm-engine-video-enhance:   0 ns

v2:
* Update for removal of name and pid.

v3:
* Use drm_driver.name.

Signed-off-by: Tvrtko Ursulin 
Cc: David M Nieto 
Cc: Christian König 
Cc: Daniel Vetter 
Cc: Chris Healy 
Acked-by: Christian König 
---
Documentation/gpu/drm-usage-stats.rst  |  6 +++
Documentation/gpu/i915.rst | 27 ++
drivers/gpu/drm/i915/i915_driver.c |  3 ++
drivers/gpu/drm/i915/i915_drm_client.c | 73 ++
drivers/gpu/drm/i915/i915_drm_client.h |  4 ++
5 files changed, 113 insertions(+)

diff --git a/Documentation/gpu/drm-usage-stats.rst 
b/Documentation/gpu/drm-usage-stats.rst
index c669026be244..6952f8389d07 100644
--- a/Documentation/gpu/drm-usage-stats.rst
+++ b/Documentation/gpu/drm-usage-stats.rst
@@ -95,3 +95,9 @@ object belong to this client, in the respective memory region.

Default unit shall be bytes with optional unit specifiers of 'KiB' or 'MiB'
indicating kibi- or mebi-bytes.
+
+===
+Driver specific implementations
+===
+
+:ref:`i915-usage-stats`
diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst
index b7d801993bfa..29f412a0c3dc 100644
--- a/Documentation/gpu/i915.rst
+++ b/Documentation/gpu/i915.rst
@@ -708,3 +708,30 @@ The style guide for ``i915_reg.h``.

.. kernel-doc:: drivers/gpu/drm/i915/i915_reg.h
   :doc: The i915 register macro definition style guide
+
+.. _i915-usage-stats:
+
+i915 DRM client usage stats implementation
+==
+
+The drm/i915 driver implements the DRM client usage stats specification as
+documented in :ref:`drm-client-usage-stats`.
+
+Example of the output showing the implemented key value pairs and entirety of
+the currenly possible format options:


s/currenly/currently/

lgtm, for the series 


Reviewed-by: Umesh Nerlige Ramappa 

Regards,
Umesh




Re: [PATCH] drm/i915/perf: Skip the i915_perf_init for dg2

2022-02-17 Thread Umesh Nerlige Ramappa

On Tue, Feb 15, 2022 at 11:01:15AM +0530, Ramalingam C wrote:

i915_perf is not enabled for dg2 yet, hence skip the feature
initialization.

Signed-off-by: Ramalingam C 
cc: Umesh Nerlige Ramappa 
---
drivers/gpu/drm/i915/i915_perf.c | 4 
1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 36f1325baa7d..5ac9604d07b3 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -4373,6 +4373,10 @@ void i915_perf_init(struct drm_i915_private *i915)

/* XXX const struct i915_perf_ops! */

+   /* i915_perf is not enabled for DG2 yet */
+   if (IS_DG2(i915))
+   return;
+


lgtm

Reviewed-by: Umesh Nerlige Ramappa 

Thanks,
Umesh

perf->oa_formats = oa_formats;
if (IS_HASWELL(i915)) {
perf->ops.is_valid_b_counter_reg = gen7_is_valid_b_counter_addr;
--
2.20.1



Re: [PATCH] drm/i915/guc/slpc: Correct the param count for unset param

2022-02-17 Thread Umesh Nerlige Ramappa

On Wed, Feb 16, 2022 at 10:15:04AM -0800, Vinay Belgaumkar wrote:

SLPC unset param H2G only needs one parameter - the id of the
param.

Fixes: 025cb07bebfa ("drm/i915/guc/slpc: Cache platform frequency limits")

Suggested-by: Umesh Nerlige Ramappa 
Signed-off-by: Vinay Belgaumkar 
---
drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
index 13b27b8ff74e..ba21ace973da 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
@@ -110,7 +110,7 @@ static int guc_action_slpc_unset_param(struct intel_guc 
*guc, u8 id)
{
u32 request[] = {
GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
-   SLPC_EVENT(SLPC_EVENT_PARAMETER_UNSET, 2),
+   SLPC_EVENT(SLPC_EVENT_PARAMETER_UNSET, 1),


lgtm,

Reviewed-by: Umesh Nerlige Ramappa 

Thanks,
Umesh

id,
};

--
2.34.0



Re: [PATCH v9 2/6] drm/i915: Use to_gt() helper for GGTT accesses

2022-01-04 Thread Umesh Nerlige Ramappa

On Mon, Jan 03, 2022 at 01:17:10PM -0800, Matt Roper wrote:

On Tue, Dec 21, 2021 at 09:46:29PM +0200, Andi Shyti wrote:

Hi Matt,

> > diff --git a/drivers/gpu/drm/i915/i915_perf.c 
b/drivers/gpu/drm/i915/i915_perf.c
> > index 170bba913c30..128315aec517 100644
> > --- a/drivers/gpu/drm/i915/i915_perf.c
> > +++ b/drivers/gpu/drm/i915/i915_perf.c
> > @@ -1630,7 +1630,7 @@ static int alloc_noa_wait(struct i915_perf_stream 
*stream)
> >   struct drm_i915_gem_object *bo;
> >   struct i915_vma *vma;
> >   const u64 delay_ticks = 0x -
> > - intel_gt_ns_to_clock_interval(stream->perf->i915->ggtt.vm.gt,
> > + 
intel_gt_ns_to_clock_interval(to_gt(stream->perf->i915)->ggtt->vm.gt,
>
> I'm not too familiar with the perf code, but this looks a bit roundabout
> since we're ultimately trying to get to a GT...do we even need to go
> through the ggtt structure here or can we just pass
> "to_gt(stream->perf->i915)" as the first parameter?
>
> > 
atomic64_read(>perf->noa_programming_delay));
> >   const u32 base = stream->engine->mmio_base;
> >  #define CS_GPR(x) GEN8_RING_CS_GPR(base, x)
> > @@ -3542,7 +3542,7 @@ i915_perf_open_ioctl_locked(struct i915_perf *perf,
> >
> >  static u64 oa_exponent_to_ns(struct i915_perf *perf, int exponent)
> >  {
> > - return intel_gt_clock_interval_to_ns(perf->i915->ggtt.vm.gt,
> > + return intel_gt_clock_interval_to_ns(to_gt(perf->i915)->ggtt->vm.gt,
>
> Ditto; this looks like "to_gt(perf->i915)" might be all we need?

I think this function is looking for the GT coming from the VM,
otherwise originally it could have taken it from >gt. In my
first version I proposed a wrapper around this but it was
rejected by Lucas.

Besides, as we discussed earlier when I was proposed the static
allocation, the ggtt might not always be linked to the same gt,
so that I assumed that sometimes:

   to_gt(perf->i915)->ggtt->vm.gt != to_gt(perf->i915)

if two GTs are sharing the same ggtt, what would the ggtt->vm.gt
link be?


From the git history, it doesn't look like this really needs to care
about the GGTT at all; I think it was just unintentionally written in a
roundabout manner when intel_gt was first being introduced in the code.
The reference here first showed up in commit f170523a7b8e ("drm/i915/gt:
Consolidate the CS timestamp clocks").

Actually the most correct thing to do is probably to use
'stream->engine->gt' to ensure we grab the GT actually associated with
the stream's engine.



stream is not yet created at this point, so I would do this:

pass intel_gt to the helper instead of perf:
static u64 oa_exponent_to_ns(struct intel_gt *gt, int exponent)
{
return intel_gt_clock_interval_to_ns(gt, 2ULL << exponent);
}

caller would then be:
oa_period = oa_exponent_to_ns(props->engine->gt, value);

Thanks,
Umesh



Matt




Thanks,
Andi


--
Matt Roper
Graphics Software Engineer
VTT-OSGC Platform Enablement
Intel Corporation
(916) 356-2795


[PATCH] drm/i915/pmu: Increase the live_engine_busy_stats sample period

2021-11-11 Thread Umesh Nerlige Ramappa
Irrespective of the backend for request submissions, busyness for an
engine with an active context is calculated using:

busyness = total + (current_time - context_switch_in_time)

In execlists mode of operation, the context switch events are handled
by the CPU. Context switch in/out time and current_time are captured
in CPU time domain using ktime_get().

In GuC mode of submission, context switch events are handled by GuC and
the times in the above formula are captured in GT clock domain. This
information is shared with the CPU through shared memory. This results
in 2 caveats:

1) The time taken between start of a batch and the time that CPU is able
to see the context_switch_in_time in shared memory is dependent on GuC
and memory bandwidth constraints.

2) Determining current_time requires an MMIO read that can take anywhere
between a few us to a couple ms. A reference CPU time is captured soon
after reading the MMIO so that the caller can compare the cpu delta
between 2 busyness samples. The issue here is that the CPU delta and the
busyness delta can be skewed because of the time taken to read the
register.

These 2 factors affect the accuracy of the selftest -
live_engine_busy_stats. For (1) the selftest waits until busyness stats
are visible to the CPU. The effects of (2) are more prominent for the
current busyness sample period of 100 us. Increase the busyness sample
period from 100 us to 10 ms to overccome (2).

Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/selftest_engine_pm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c 
b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
index 0bfd738dbf3a..96cc565afa78 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
@@ -316,7 +316,7 @@ static int live_engine_busy_stats(void *arg)
ENGINE_TRACE(engine, "measuring busy time\n");
preempt_disable();
de = intel_engine_get_busy_time(engine, [0]);
-   udelay(100);
+   udelay(1);
de = ktime_sub(intel_engine_get_busy_time(engine, [1]), de);
preempt_enable();
dt = ktime_sub(t[1], t[0]);
-- 
2.20.1



[PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset

2021-11-08 Thread Umesh Nerlige Ramappa
Since the PMU callback runs in irq context, it synchronizes with gt
reset using the reset count. We could run into a case where the PMU
callback could read the reset count before it is updated. This has a
potential of corrupting the busyness stats.

In addition to the reset count, check if the reset bit is set before
capturing busyness.

In addition save the previous stats only if you intend to update them.

v2:
- The 2 reset counts captured in the PMU callback can end up being the
  same if they were captured right after the count is incremented in the
  reset flow. This can lead to a bad busyness state. Ensure that reset
  is not in progress when the initial reset count is captured.

Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Matthew Brost 
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 5cc49c0b3889..0dfc6032cd6b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1183,15 +1183,20 @@ static ktime_t guc_engine_busyness(struct 
intel_engine_cs *engine, ktime_t *now)
u64 total, gt_stamp_saved;
unsigned long flags;
u32 reset_count;
+   bool in_reset;
 
spin_lock_irqsave(>timestamp.lock, flags);
 
/*
-* If a reset happened, we risk reading partially updated
-* engine busyness from GuC, so we just use the driver stored
-* copy of busyness. Synchronize with gt reset using reset_count.
+* If a reset happened, we risk reading partially updated engine
+* busyness from GuC, so we just use the driver stored copy of busyness.
+* Synchronize with gt reset using reset_count and the
+* I915_RESET_BACKOFF flag. Note that reset flow updates the reset_count
+* after I915_RESET_BACKOFF flag, so ensure that the reset_count is
+* usable by checking the flag afterwards.
 */
reset_count = i915_reset_count(gpu_error);
+   in_reset = test_bit(I915_RESET_BACKOFF, >reset.flags);
 
*now = ktime_get();
 
@@ -1201,9 +1206,9 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs 
*engine, ktime_t *now)
 * start_gt_clk is derived from GuC state. To get a consistent
 * view of activity, we query the GuC state only if gt is awake.
 */
-   stats_saved = *stats;
-   gt_stamp_saved = guc->timestamp.gt_stamp;
-   if (intel_gt_pm_get_if_awake(gt)) {
+   if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
+   stats_saved = *stats;
+   gt_stamp_saved = guc->timestamp.gt_stamp;
guc_update_engine_gt_clks(engine);
guc_update_pm_timestamp(guc, engine, now);
intel_gt_pm_put_async(gt);
-- 
2.20.1



Re: [Intel-gfx] [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-29 Thread Umesh Nerlige Ramappa

On Tue, Oct 26, 2021 at 05:48:21PM -0700, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an over-accounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at GuC level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate GuC and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of GuC state
- Add hooks to gt park/unpark for GuC busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to GuC initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and GuC stats objects
- Since disable_submission is called from many places, move resetting
 stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
 callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
 called during i915 load. This ends up calling the GuC busyness unpark
 hook and results in kick-starting an uninitialized worker. Let
 park/unpark hooks check if GuC submission has been initialized.
- drop cant_sleep() from trylock helper since rcu_read_lock takes care
 of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For GuC mode of submission the engine busyness is derived from gt time
 domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
 longer and falls within the busyness tolerances in selftest.

v8:
- Use ktime_get in selftest as before
- intel_reset_trylock_no_wait results in a lockdep splat that is not
 trivial to fix since the PMU callback runs in irq context and the
 reset paths are tightly knit into the driver. The test that uncovers
 this is igt@perf_pmu@faulting-read. Drop intel_reset_trylock_no_wait,
 instead use the reset_count to synchronize with gt reset during pmu
 callback. For the ping, continue to use intel_reset_trylock since ping
 is not run in irq context.

- GuC PM timestamp does not tick when GuC is idle. This can potentially
 result in wrong busyness values when a context is active on the
 engine, but GuC is idle. Use the RING TIMESTAMP as GPU timestamp to
 process the GuC busyness stats. This works since both GuC timestamp and
 RING timestamp are synced with the same clock.

- The busyness stats may get updated after the batch starts running.
 This delay causes the busyness reported for 100us duration to fall
 below 95% in the selftest. The only option at this time is to wait for
 GuC busyness to change from idle to active before we sample busyness
 over a 100us period.

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
Acked-by: Tvrtko Ursulin 
---
drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +-
drivers/gpu/drm/i915

[PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-26 Thread Umesh Nerlige Ramappa
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an over-accounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at GuC level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate GuC and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of GuC state
- Add hooks to gt park/unpark for GuC busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to GuC initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and GuC stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
  called during i915 load. This ends up calling the GuC busyness unpark
  hook and results in kick-starting an uninitialized worker. Let
  park/unpark hooks check if GuC submission has been initialized.
- drop cant_sleep() from trylock helper since rcu_read_lock takes care
  of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For GuC mode of submission the engine busyness is derived from gt time
  domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
  longer and falls within the busyness tolerances in selftest.

v8:
- Use ktime_get in selftest as before
- intel_reset_trylock_no_wait results in a lockdep splat that is not
  trivial to fix since the PMU callback runs in irq context and the
  reset paths are tightly knit into the driver. The test that uncovers
  this is igt@perf_pmu@faulting-read. Drop intel_reset_trylock_no_wait,
  instead use the reset_count to synchronize with gt reset during pmu
  callback. For the ping, continue to use intel_reset_trylock since ping
  is not run in irq context.

- GuC PM timestamp does not tick when GuC is idle. This can potentially
  result in wrong busyness values when a context is active on the
  engine, but GuC is idle. Use the RING TIMESTAMP as GPU timestamp to
  process the GuC busyness stats. This works since both GuC timestamp and
  RING timestamp are synced with the same clock.

- The busyness stats may get updated after the batch starts running.
  This delay causes the busyness reported for 100us duration to fall
  below 95% in the selftest. The only option at this time is to wait for
  GuC busyness to change from idle to active before we sample busyness
  over a 100us period.

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
Acked-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915

[PATCH 1/2] drm/i915/pmu: Add a name to the execlists stats

2021-10-26 Thread Umesh Nerlige Ramappa
In preparation for GuC pmu stats, add a name to the execlists stats
structure so that it can be differentiated from the GuC stats.

Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c| 14 +++---
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 33 +++--
 drivers/gpu/drm/i915/gt/intel_engine_types.h | 52 +++-
 3 files changed, 53 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index ff6753ccb129..2de396e34d83 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -363,7 +363,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum 
intel_engine_id id,
DRIVER_CAPS(i915)->has_logical_contexts = true;
 
ewma__engine_latency_init(>latency);
-   seqcount_init(>stats.lock);
+   seqcount_init(>stats.execlists.lock);
 
ATOMIC_INIT_NOTIFIER_HEAD(>context_status_notifier);
 
@@ -1918,15 +1918,16 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
ktime_t *now)
 {
-   ktime_t total = engine->stats.total;
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
+   ktime_t total = stats->total;
 
/*
 * If the engine is executing something at the moment
 * add it to the total.
 */
*now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+   if (READ_ONCE(stats->active))
+   total = ktime_add(total, ktime_sub(*now, stats->start));
 
return total;
 }
@@ -1940,13 +1941,14 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned int seq;
ktime_t total;
 
do {
-   seq = read_seqcount_begin(>stats.lock);
+   seq = read_seqcount_begin(>lock);
total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqcount_retry(>stats.lock, seq));
+   } while (read_seqcount_retry(>lock, seq));
 
return total;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h 
b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
index 24fbdd94351a..8e762d683e50 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_stats.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -15,45 +15,46 @@
 
 static inline void intel_engine_context_in(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   if (engine->stats.active) {
-   engine->stats.active++;
+   if (stats->active) {
+   stats->active++;
return;
}
 
/* The writer is serialised; but the pmu reader may be from hardirq */
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.start = ktime_get();
-   engine->stats.active++;
+   stats->start = ktime_get();
+   stats->active++;
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 
-   GEM_BUG_ON(!engine->stats.active);
+   GEM_BUG_ON(!stats->active);
 }
 
 static inline void intel_engine_context_out(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   GEM_BUG_ON(!engine->stats.active);
-   if (engine->stats.active > 1) {
-   engine->stats.active--;
+   GEM_BUG_ON(!stats->active);
+   if (stats->active > 1) {
+   stats->active--;
return;
}
 
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.active--;
-   engine->stats.total =
-   ktime_add(engine->stats.total,
- ktime_sub(ktime_get(), engine->stats.start));
+   stats->active--;
+   stats->total = ktime_add(stats->total,
+ktime_sub(ktime_get(), stats->start));
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index e0f773585c29..24fa7fb0e7de 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -257,6 +

Re: [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-19 Thread Umesh Nerlige Ramappa

On Tue, Oct 19, 2021 at 09:32:07AM +0100, Tvrtko Ursulin wrote:


On 18/10/2021 19:35, Umesh Nerlige Ramappa wrote:

On Mon, Oct 18, 2021 at 08:58:01AM +0100, Tvrtko Ursulin wrote:



On 16/10/2021 00:47, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
  called during i915 load. This ends up calling the guc busyness unpark
  hook and results in kiskstarting an uninitialized worker. Let
  park/unpark hooks check if guc submission has been initialized.
- drop cant_sleep() from trylock hepler since rcu_read_lock takes care
  of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For guc mode of submission the engine busyness is derived from gt time
  domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
  longer and falls within the busyness tolerances in selftest.


[snip]

diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c 
b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c

index 75569666105d..24358bef6691 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
@@ -234,6 +234,7 @@ static int live_engine_busy_stats(void *arg)
 struct i915_request *rq;
 ktime_t de, dt;
 ktime_t t[2];
+    u32 gt_stamp;
 if (!intel_engine_supports_stats(engine))
 continue;
@@ -251,10 +252,16 @@ static int live_engine_busy_stats(void *arg)
 ENGINE_TRACE(engine, "measuring idle time\n");
 preempt_disable();
 de = intel_engine_get_busy_time(engine, [0]);
-    udelay(100);
+    gt_stamp = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
+    udelay(1);
 de = ktime_sub(intel_engine_get_busy_time(engine, [1]), de);
+    gt_stamp = intel_uncore_read(gt->uncore, 
GUCPMTIMESTAMP) - gt_stamp;

 preempt_enable();
-    dt = ktime_sub(t[1], t[0]);
+
+    dt = intel_engine_uses_guc(engine) ?
+ intel_gt_c

Re: [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-18 Thread Umesh Nerlige Ramappa

On Mon, Oct 18, 2021 at 11:35:44AM -0700, Umesh Nerlige Ramappa wrote:

On Mon, Oct 18, 2021 at 08:58:01AM +0100, Tvrtko Ursulin wrote:



On 16/10/2021 00:47, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
 stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
 callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
 called during i915 load. This ends up calling the guc busyness unpark
 hook and results in kiskstarting an uninitialized worker. Let
 park/unpark hooks check if guc submission has been initialized.
- drop cant_sleep() from trylock hepler since rcu_read_lock takes care
 of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For guc mode of submission the engine busyness is derived from gt time
 domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
 longer and falls within the busyness tolerances in selftest.


[snip]


diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c 
b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
index 75569666105d..24358bef6691 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
@@ -234,6 +234,7 @@ static int live_engine_busy_stats(void *arg)
struct i915_request *rq;
ktime_t de, dt;
ktime_t t[2];
+   u32 gt_stamp;
if (!intel_engine_supports_stats(engine))
continue;
@@ -251,10 +252,16 @@ static int live_engine_busy_stats(void *arg)
ENGINE_TRACE(engine, "measuring idle time\n");
preempt_disable();
de = intel_engine_get_busy_time(engine, [0]);
-   udelay(100);
+   gt_stamp = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
+   udelay(1);
de = ktime_sub(intel_engine_get_busy_time(engine, [1]), de);
+   gt_stamp = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP) - 
gt_stamp;
preempt_enable();
-   dt = ktime_sub(t[1], t[0]);
+

Re: [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-18 Thread Umesh Nerlige Ramappa

On Mon, Oct 18, 2021 at 08:58:01AM +0100, Tvrtko Ursulin wrote:



On 16/10/2021 00:47, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
  called during i915 load. This ends up calling the guc busyness unpark
  hook and results in kiskstarting an uninitialized worker. Let
  park/unpark hooks check if guc submission has been initialized.
- drop cant_sleep() from trylock hepler since rcu_read_lock takes care
  of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For guc mode of submission the engine busyness is derived from gt time
  domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
  longer and falls within the busyness tolerances in selftest.


[snip]


diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c 
b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
index 75569666105d..24358bef6691 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_pm.c
@@ -234,6 +234,7 @@ static int live_engine_busy_stats(void *arg)
struct i915_request *rq;
ktime_t de, dt;
ktime_t t[2];
+   u32 gt_stamp;
if (!intel_engine_supports_stats(engine))
continue;
@@ -251,10 +252,16 @@ static int live_engine_busy_stats(void *arg)
ENGINE_TRACE(engine, "measuring idle time\n");
preempt_disable();
de = intel_engine_get_busy_time(engine, [0]);
-   udelay(100);
+   gt_stamp = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP);
+   udelay(1);
de = ktime_sub(intel_engine_get_busy_time(engine, [1]), de);
+   gt_stamp = intel_uncore_read(gt->uncore, GUCPMTIMESTAMP) - 
gt_stamp;
preempt_enable();
-   dt = ktime_sub(t[1], t[0]);
+
+   dt = intel_eng

[PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-15 Thread Umesh Nerlige Ramappa
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
  called during i915 load. This ends up calling the guc busyness unpark
  hook and results in kiskstarting an uninitialized worker. Let
  park/unpark hooks check if guc submission has been initialized.
- drop cant_sleep() from trylock hepler since rcu_read_lock takes care
  of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For guc mode of submission the engine busyness is derived from gt time
  domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
  longer and falls within the busyness tolerances in selftest.

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
Acked-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915/gt/intel_execlists_submission.c  |  34 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 drivers/gpu/drm/i915/gt/intel_reset.c |  15 +
 drivers/gpu/drm/i915/gt/intel_reset.h |   1 +
 drivers/gpu/drm/i915/gt/selftest_engine_pm.c  |  21 +-
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  30 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 273 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 15 files changed, 449 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 38436f4b5706..6b783fdcba2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,23 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(eng

[PATCH 1/2] drm/i915/pmu: Add a name to the execlists stats

2021-10-15 Thread Umesh Nerlige Ramappa
In preparation for GuC pmu stats, add a name to the execlists stats
structure so that it can be differentiated from the GuC stats.

Signed-off-by: Umesh Nerlige Ramappa 
Acked-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c| 14 +++---
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 33 +++--
 drivers/gpu/drm/i915/gt/intel_engine_types.h | 52 +++-
 3 files changed, 53 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..38436f4b5706 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -361,7 +361,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum 
intel_engine_id id)
DRIVER_CAPS(i915)->has_logical_contexts = true;
 
ewma__engine_latency_init(>latency);
-   seqcount_init(>stats.lock);
+   seqcount_init(>stats.execlists.lock);
 
ATOMIC_INIT_NOTIFIER_HEAD(>context_status_notifier);
 
@@ -1876,15 +1876,16 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
ktime_t *now)
 {
-   ktime_t total = engine->stats.total;
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
+   ktime_t total = stats->total;
 
/*
 * If the engine is executing something at the moment
 * add it to the total.
 */
*now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+   if (READ_ONCE(stats->active))
+   total = ktime_add(total, ktime_sub(*now, stats->start));
 
return total;
 }
@@ -1898,13 +1899,14 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned int seq;
ktime_t total;
 
do {
-   seq = read_seqcount_begin(>stats.lock);
+   seq = read_seqcount_begin(>lock);
total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqcount_retry(>stats.lock, seq));
+   } while (read_seqcount_retry(>lock, seq));
 
return total;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h 
b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
index 24fbdd94351a..8e762d683e50 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_stats.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -15,45 +15,46 @@
 
 static inline void intel_engine_context_in(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   if (engine->stats.active) {
-   engine->stats.active++;
+   if (stats->active) {
+   stats->active++;
return;
}
 
/* The writer is serialised; but the pmu reader may be from hardirq */
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.start = ktime_get();
-   engine->stats.active++;
+   stats->start = ktime_get();
+   stats->active++;
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 
-   GEM_BUG_ON(!engine->stats.active);
+   GEM_BUG_ON(!stats->active);
 }
 
 static inline void intel_engine_context_out(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   GEM_BUG_ON(!engine->stats.active);
-   if (engine->stats.active > 1) {
-   engine->stats.active--;
+   GEM_BUG_ON(!stats->active);
+   if (stats->active > 1) {
+   stats->active--;
return;
}
 
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.active--;
-   engine->stats.total =
-   ktime_add(engine->stats.total,
- ktime_sub(ktime_get(), engine->stats.start));
+   stats->active--;
+   stats->total = ktime_add(stats->total,
+ktime_sub(ktime_get(), stats->start));
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 9167ce52487c..b820a2c1124e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm

[PATCH 1/2] drm/i915/pmu: Add a name to the execlists stats

2021-10-14 Thread Umesh Nerlige Ramappa
In preparation for GuC pmu stats, add a name to the execlists stats
structure so that it can be differentiated from the GuC stats.

Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c| 14 +++---
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 33 +++--
 drivers/gpu/drm/i915/gt/intel_engine_types.h | 52 +++-
 3 files changed, 53 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..38436f4b5706 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -361,7 +361,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum 
intel_engine_id id)
DRIVER_CAPS(i915)->has_logical_contexts = true;
 
ewma__engine_latency_init(>latency);
-   seqcount_init(>stats.lock);
+   seqcount_init(>stats.execlists.lock);
 
ATOMIC_INIT_NOTIFIER_HEAD(>context_status_notifier);
 
@@ -1876,15 +1876,16 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
ktime_t *now)
 {
-   ktime_t total = engine->stats.total;
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
+   ktime_t total = stats->total;
 
/*
 * If the engine is executing something at the moment
 * add it to the total.
 */
*now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+   if (READ_ONCE(stats->active))
+   total = ktime_add(total, ktime_sub(*now, stats->start));
 
return total;
 }
@@ -1898,13 +1899,14 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned int seq;
ktime_t total;
 
do {
-   seq = read_seqcount_begin(>stats.lock);
+   seq = read_seqcount_begin(>lock);
total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqcount_retry(>stats.lock, seq));
+   } while (read_seqcount_retry(>lock, seq));
 
return total;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h 
b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
index 24fbdd94351a..8e762d683e50 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_stats.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -15,45 +15,46 @@
 
 static inline void intel_engine_context_in(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   if (engine->stats.active) {
-   engine->stats.active++;
+   if (stats->active) {
+   stats->active++;
return;
}
 
/* The writer is serialised; but the pmu reader may be from hardirq */
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.start = ktime_get();
-   engine->stats.active++;
+   stats->start = ktime_get();
+   stats->active++;
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 
-   GEM_BUG_ON(!engine->stats.active);
+   GEM_BUG_ON(!stats->active);
 }
 
 static inline void intel_engine_context_out(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   GEM_BUG_ON(!engine->stats.active);
-   if (engine->stats.active > 1) {
-   engine->stats.active--;
+   GEM_BUG_ON(!stats->active);
+   if (stats->active > 1) {
+   stats->active--;
return;
}
 
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.active--;
-   engine->stats.total =
-   ktime_add(engine->stats.total,
- ktime_sub(ktime_get(), engine->stats.start));
+   stats->active--;
+   stats->total = ktime_add(stats->total,
+ktime_sub(ktime_get(), stats->start));
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 9167ce52487c..b820a2c1124e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -257,6 +

[PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-14 Thread Umesh Nerlige Ramappa
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
  called during i915 load. This ends up calling the guc busyness unpark
  hook and results in kiskstarting an uninitialized worker. Let
  park/unpark hooks check if guc submission has been initialized.
- drop cant_sleep() from trylock hepler since rcu_read_lock takes care
  of that.

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
Acked-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915/gt/intel_execlists_submission.c  |  34 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 drivers/gpu/drm/i915/gt/intel_reset.c |  15 +
 drivers/gpu/drm/i915/gt/intel_reset.h |   1 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  30 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 273 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 14 files changed, 432 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 38436f4b5706..6b783fdcba2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,23 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   struct intel_engine_execlists_stats *stats = >stats.execlists;
-   ktime_t total = stats->total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to 

Re: [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-14 Thread Umesh Nerlige Ramappa

On Thu, Oct 14, 2021 at 09:21:28AM +0100, Tvrtko Ursulin wrote:


On 13/10/2021 01:56, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915/gt/intel_execlists_submission.c  |  34 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 drivers/gpu/drm/i915/gt/intel_reset.c |  16 ++
 drivers/gpu/drm/i915/gt/intel_reset.h |   1 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  30 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 267 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 14 files changed, 427 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 38436f4b5706..6b783fdcba2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,23 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   struct intel_engine_execlists_stats *stats = >stats.execlists;
-   ktime_t total = stats->total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(stats->active))
-   total = ktime_add(total, ktime_sub(*now, stats->start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * 

Re: [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-13 Thread Umesh Nerlige Ramappa

On Wed, Oct 13, 2021 at 05:06:26PM +0100, Tvrtko Ursulin wrote:


On 13/10/2021 01:56, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset


Looks good to me now, for some combination of high level and 
incomeplte low level review (I did not check the overflow handling or 
the GuC page layout and flow.). Both patches:


Acked-by: Tvrtko Ursulin 


Thanks



Do you have someone available to check the parts I did not and r-b?


I will check with Matt/John.

Regards,
Umesh


Regards,

Tvrtko



Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915/gt/intel_execlists_submission.c  |  34 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 drivers/gpu/drm/i915/gt/intel_reset.c |  16 ++
 drivers/gpu/drm/i915/gt/intel_reset.h |   1 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  30 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 267 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 14 files changed, 427 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 38436f4b5706..6b783fdcba2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,23 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   struct intel_engine_execlists_stats *stats = >stats.execlists;
-   ktime_t total = stats->total;
-
-   /*
-* If t

[PATCH 1/2] drm/i915/pmu: Add a name to the execlists stats

2021-10-12 Thread Umesh Nerlige Ramappa
In preparation for GuC pmu stats, add a name to the execlists stats
structure so that it can be differentiated from the GuC stats.

Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c| 14 +++---
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 33 +++--
 drivers/gpu/drm/i915/gt/intel_engine_types.h | 52 +++-
 3 files changed, 53 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..38436f4b5706 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -361,7 +361,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum 
intel_engine_id id)
DRIVER_CAPS(i915)->has_logical_contexts = true;
 
ewma__engine_latency_init(>latency);
-   seqcount_init(>stats.lock);
+   seqcount_init(>stats.execlists.lock);
 
ATOMIC_INIT_NOTIFIER_HEAD(>context_status_notifier);
 
@@ -1876,15 +1876,16 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
ktime_t *now)
 {
-   ktime_t total = engine->stats.total;
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
+   ktime_t total = stats->total;
 
/*
 * If the engine is executing something at the moment
 * add it to the total.
 */
*now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+   if (READ_ONCE(stats->active))
+   total = ktime_add(total, ktime_sub(*now, stats->start));
 
return total;
 }
@@ -1898,13 +1899,14 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned int seq;
ktime_t total;
 
do {
-   seq = read_seqcount_begin(>stats.lock);
+   seq = read_seqcount_begin(>lock);
total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqcount_retry(>stats.lock, seq));
+   } while (read_seqcount_retry(>lock, seq));
 
return total;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h 
b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
index 24fbdd94351a..8e762d683e50 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_stats.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -15,45 +15,46 @@
 
 static inline void intel_engine_context_in(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   if (engine->stats.active) {
-   engine->stats.active++;
+   if (stats->active) {
+   stats->active++;
return;
}
 
/* The writer is serialised; but the pmu reader may be from hardirq */
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.start = ktime_get();
-   engine->stats.active++;
+   stats->start = ktime_get();
+   stats->active++;
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 
-   GEM_BUG_ON(!engine->stats.active);
+   GEM_BUG_ON(!stats->active);
 }
 
 static inline void intel_engine_context_out(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   GEM_BUG_ON(!engine->stats.active);
-   if (engine->stats.active > 1) {
-   engine->stats.active--;
+   GEM_BUG_ON(!stats->active);
+   if (stats->active > 1) {
+   stats->active--;
return;
}
 
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.active--;
-   engine->stats.total =
-   ktime_add(engine->stats.total,
- ktime_sub(ktime_get(), engine->stats.start));
+   stats->active--;
+   stats->total = ktime_add(stats->total,
+ktime_sub(ktime_get(), stats->start));
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 9167ce52487c..b820a2c1124e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -257,6 +

[PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-12 Thread Umesh Nerlige Ramappa
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
  callbacks and worker with gt reset

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915/gt/intel_execlists_submission.c  |  34 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 drivers/gpu/drm/i915/gt/intel_reset.c |  16 ++
 drivers/gpu/drm/i915/gt/intel_reset.h |   1 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  30 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 267 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 14 files changed, 427 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 38436f4b5706..6b783fdcba2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,23 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   struct intel_engine_execlists_stats *stats = >stats.execlists;
-   ktime_t total = stats->total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(stats->active))
-   total = ktime_add(total, ktime_sub(*now, stats->start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1899,16 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_en

Re: [PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-11 Thread Umesh Nerlige Ramappa

On Mon, Oct 11, 2021 at 12:41:19PM +0100, Tvrtko Ursulin wrote:


On 07/10/2021 23:55, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +--
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915/gt/intel_execlists_submission.c  |  34 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 238 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 12 files changed, 377 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 38436f4b5706..6b783fdcba2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,23 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   struct intel_engine_execlists_stats *stats = >stats.execlists;
-   ktime_t total = stats->total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(stats->active))
-   total = ktime_add(total, ktime_sub(*now, stats->start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1899,16 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {

Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-07 Thread Umesh Nerlige Ramappa

On Tue, Oct 05, 2021 at 04:14:23PM -0700, Matthew Brost wrote:

On Tue, Oct 05, 2021 at 10:47:11AM -0700, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  26 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +--
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 12 files changed, 398 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }

-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   ktime_t total = engine->stats.total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
-   unsigned int seq;
-   ktime_t total;

[PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-07 Thread Umesh Nerlige Ramappa
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and guc stats objects
- Since disable_submission is called from many places, move resetting
  stats to intel_guc_submission_reset_prepare

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  28 +--
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  33 ++-
 .../drm/i915/gt/intel_execlists_submission.c  |  34 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 238 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 12 files changed, 377 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 38436f4b5706..6b783fdcba2a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,23 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   struct intel_engine_execlists_stats *stats = >stats.execlists;
-   ktime_t total = stats->total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(stats->active))
-   total = ktime_add(total, ktime_sub(*now, stats->start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1899,16 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
-   struct intel_engine_execlists_stats *stats = >stats.execlists;
-   unsigned int seq;
-   ktime_t t

[PATCH 1/2] drm/i915/pmu: Add a name to the execlists stats

2021-10-07 Thread Umesh Nerlige Ramappa
In preparation for GuC pmu stats, add a name to the execlists stats
structure so that it can be differentiated from the GuC stats.

Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c| 14 +++---
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 33 +++--
 drivers/gpu/drm/i915/gt/intel_engine_types.h | 52 +++-
 3 files changed, 53 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..38436f4b5706 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -361,7 +361,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum 
intel_engine_id id)
DRIVER_CAPS(i915)->has_logical_contexts = true;
 
ewma__engine_latency_init(>latency);
-   seqcount_init(>stats.lock);
+   seqcount_init(>stats.execlists.lock);
 
ATOMIC_INIT_NOTIFIER_HEAD(>context_status_notifier);
 
@@ -1876,15 +1876,16 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
ktime_t *now)
 {
-   ktime_t total = engine->stats.total;
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
+   ktime_t total = stats->total;
 
/*
 * If the engine is executing something at the moment
 * add it to the total.
 */
*now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
+   if (READ_ONCE(stats->active))
+   total = ktime_add(total, ktime_sub(*now, stats->start));
 
return total;
 }
@@ -1898,13 +1899,14 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned int seq;
ktime_t total;
 
do {
-   seq = read_seqcount_begin(>stats.lock);
+   seq = read_seqcount_begin(>lock);
total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqcount_retry(>stats.lock, seq));
+   } while (read_seqcount_retry(>lock, seq));
 
return total;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h 
b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
index 24fbdd94351a..8e762d683e50 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_stats.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -15,45 +15,46 @@
 
 static inline void intel_engine_context_in(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   if (engine->stats.active) {
-   engine->stats.active++;
+   if (stats->active) {
+   stats->active++;
return;
}
 
/* The writer is serialised; but the pmu reader may be from hardirq */
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.start = ktime_get();
-   engine->stats.active++;
+   stats->start = ktime_get();
+   stats->active++;
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 
-   GEM_BUG_ON(!engine->stats.active);
+   GEM_BUG_ON(!stats->active);
 }
 
 static inline void intel_engine_context_out(struct intel_engine_cs *engine)
 {
+   struct intel_engine_execlists_stats *stats = >stats.execlists;
unsigned long flags;
 
-   GEM_BUG_ON(!engine->stats.active);
-   if (engine->stats.active > 1) {
-   engine->stats.active--;
+   GEM_BUG_ON(!stats->active);
+   if (stats->active > 1) {
+   stats->active--;
return;
}
 
local_irq_save(flags);
-   write_seqcount_begin(>stats.lock);
+   write_seqcount_begin(>lock);
 
-   engine->stats.active--;
-   engine->stats.total =
-   ktime_add(engine->stats.total,
- ktime_sub(ktime_get(), engine->stats.start));
+   stats->active--;
+   stats->total = ktime_add(stats->total,
+ktime_sub(ktime_get(), stats->start));
 
-   write_seqcount_end(>stats.lock);
+   write_seqcount_end(>lock);
local_irq_restore(flags);
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 5ae1207c363b..316d8551d22f 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -257,6 +

Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-07 Thread Umesh Nerlige Ramappa

On Thu, Oct 07, 2021 at 09:17:34AM +0100, Tvrtko Ursulin wrote:


On 06/10/2021 21:45, Umesh Nerlige Ramappa wrote:

On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:


[snip]


@@ -762,12 +764,25 @@ submission_disabled(struct intel_guc *guc)
 static void disable_submission(struct intel_guc *guc)
 {
 struct i915_sched_engine * const sched_engine = guc->sched_engine;
+    struct intel_gt *gt = guc_to_gt(guc);
+    struct intel_engine_cs *engine;
+    enum intel_engine_id id;
+    unsigned long flags;
 if (__tasklet_is_enabled(_engine->tasklet)) {
 GEM_BUG_ON(!guc->ct.enabled);
 __tasklet_disable_sync_once(_engine->tasklet);
 sched_engine->tasklet.callback = NULL;
 }
+
+    cancel_delayed_work(>timestamp.work);


I am not sure when disable_submission gets called so a question - 
could it be important to call cancel_delayed_work_sync here to 
ensure if the worker was running it had exited before proceeding?


disable_submission is called in the reset_prepare path for uc 
resets. I see this happening only with busy-hang test which does a 
global gt reset. The counterpart for this is the 
guc_init_engine_stats which is called post reset in the path to 
initialize GuC.


I tried cancel_delayed_work_sync both here and in park. Seems to 
work fine, so will change the calls to _sync versions.


From park is not allowed to sleep so can't do sync from there. It 
might have been my question which put you on a wrong path, sorry. Now 
I think question remains what happens if the ping worker happens to be 
sampling GuC state as GuC is being reset? Do you need some sort of a 
lock to protect that, or make sure worker skips if reset in progress?




If ping ran after the actual gt reset, we should be okay. If it ran 
after we reset prev_total and before gt reset, then we have bad 
busyness. At the same time, skipping ping risks timestamp overflow. I am 
thinking skip ping, but update all stats in the reset_prepare path.  
reset_prepare is running with pm runtime.


On a different note, during reset, we need to store now-start into the 
total_gt_clks also because we may lose that information in the next pmu 
query or ping (post reset). Maybe I will store active_clks instead of 
running in the stats to do that.


Thanks,
Umesh



Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-06 Thread Umesh Nerlige Ramappa

On Wed, Oct 06, 2021 at 10:11:58AM +0100, Tvrtko Ursulin wrote:


On 05/10/2021 18:47, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.


Second point had now been addressed, right?


Both were addressed actually. For reset, I was mainly running busy-hang 
and after adding your suggestion of maintaining a consistent view, the 
busy-hang is fixed too.


I will remove them from the commit msg.





Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  26 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +--
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 12 files changed, 398 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   ktime_t total = engine->stats.total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882

Re: [PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-05 Thread Umesh Nerlige Ramappa

On Mon, Oct 04, 2021 at 04:21:44PM +0100, Tvrtko Ursulin wrote:


On 24/09/2021 23:34, Umesh Nerlige Ramappa wrote:

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequqncy = 1/8th the time it takes for the timestamp to wrap. As an


frequency


example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.


2) sounds easy since there are park/unpark hooks for pmu already. Will 
see if I can figure out why you did not just immediately do it.


I posted a new revision now with all these comments for your review.

For (2), something was throwing a warning when I tried this earlier. I 
figured I need to move the initialization of the work and spinlock 
elsewhere.




I would also document in the commit message the known problem of 
possible over-accounting, just for historical reference.


I added a note. If that's not the issue you are mentioning w.r.t.  
engine busyness, let me know.  





v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  26 +--
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  82 ---
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  26 +++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 204 ++
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 10 files changed, 363 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   ktime_t total = engine->stats.total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
-   unsigned int seq;
-   ktime_t total;
-
-   do {
-   seq = read_seqcount_begin(>stats.lock);
-   total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqcount_retry(>stats.lock, seq));
-
-   return total;
+   retur

[PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-10-05 Thread Umesh Nerlige Ramappa
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.

Note:
There might be an overaccounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate guc and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of guc state
- Add hooks to gt park/unpark for guc busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to guc initialization
- Drop helpers that are called only once

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  26 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  90 +--
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 drivers/gpu/drm/i915/gt/intel_gt_pm.c |   2 +
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  26 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 +
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 227 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |   2 +
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 12 files changed, 398 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   ktime_t total = engine->stats.total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
-   unsigned int seq;
-   ktime_t total;
-
-   do {
-   seq = read_seqcount_begin(>stats.lock);
-   total = __intel_engine_get_busy_time(engine, now);

[PATCH] drm/i915/pmu: Connect engine busyness stats from GuC to pmu

2021-09-24 Thread Umesh Nerlige Ramappa
With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequqncy = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Opens and wip that are targeted for later patches:

1) On global gt reset the total busyness of engines resets and i915
   needs to fix that so that user sees monotonically increasing
   busyness.
2) In runtime suspend mode, the worker may not need to be run. We could
   stop the worker on suspend and rerun it on resume provided that the
   guc pm timestamp does not tick during suspend.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at guc level to update engine stats
- Document worker specific details

Signed-off-by: John Harrison 
Signed-off-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  26 +--
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  82 ---
 .../drm/i915/gt/intel_execlists_submission.c  |  32 +++
 .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |   1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc.h|  26 +++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c|  21 ++
 drivers/gpu/drm/i915/gt/uc/intel_guc_ads.h|   5 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  13 ++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 204 ++
 drivers/gpu/drm/i915/i915_reg.h   |   2 +
 10 files changed, 363 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 2ae57e4656a3..6fcc70a313d9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1873,22 +1873,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
intel_engine_print_breadcrumbs(engine, m);
 }
 
-static ktime_t __intel_engine_get_busy_time(struct intel_engine_cs *engine,
-   ktime_t *now)
-{
-   ktime_t total = engine->stats.total;
-
-   /*
-* If the engine is executing something at the moment
-* add it to the total.
-*/
-   *now = ktime_get();
-   if (READ_ONCE(engine->stats.active))
-   total = ktime_add(total, ktime_sub(*now, engine->stats.start));
-
-   return total;
-}
-
 /**
  * intel_engine_get_busy_time() - Return current accumulated engine busyness
  * @engine: engine to report on
@@ -1898,15 +1882,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
  */
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t 
*now)
 {
-   unsigned int seq;
-   ktime_t total;
-
-   do {
-   seq = read_seqcount_begin(>stats.lock);
-   total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqcount_retry(>stats.lock, seq));
-
-   return total;
+   return engine->busyness(engine, now);
 }
 
 struct intel_context *
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 5ae1207c363b..490166b54ed6 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -432,6 +432,12 @@ struct intel_engine_cs {
void(*add_active_request)(struct i915_request *rq);
void(*remove_active_request)(struct i915_request *rq);
 
+   /*
+* Get engine busyness and the time at which the busyness was sampled.
+*/
+   ktime_t (*busyness)(struct intel_engine_cs *engine,
+   ktime_t *n

Re: [PATCH 8/8] drm/i915/perf: Map OA buffer to user space for gen12 performance query

2021-08-31 Thread Umesh Nerlige Ramappa

On Tue, Aug 31, 2021 at 02:55:37PM +0200, Daniel Vetter wrote:

On Mon, Aug 30, 2021 at 12:38:51PM -0700, Umesh Nerlige Ramappa wrote:

i915 used to support time based sampling mode which is good for overall
system monitoring, but is not enough for query mode used to measure a
single draw call or dispatch. Gen9-Gen11 are using current i915 perf
implementation for query, but Gen12+ requires a new approach for query
based on triggered reports within oa buffer.

Triggering reports into the OA buffer is achieved by writing into a
a trigger register. Optionally an unused counter/register is set with a
marker value such that a triggered report can be identified in the OA
buffer. Reports are usually triggered at the start and end of work that
is measured.

Since OA buffer is large and queries can be frequent, an efficient way
to look for triggered reports is required. By knowing the current head
and tail offsets into the OA buffer, it is easier to determine the
locality of the reports of interest.

Current perf OA interface does not expose head/tail information to the
user and it filters out invalid reports before sending data to user.
Also considering limited size of user buffer used during a query,
creating a 1:1 copy of the OA buffer at the user space added undesired
complexity.

The solution was to map the OA buffer to user space provided

(1) that it is accessed from a privileged user.
(2) OA report filtering is not used.

These 2 conditions would satisfy the safety criteria that the current
perf interface addresses.


This is a perf improvement. Please include benchmark numbers to justify
it.


OA supports 2 mechanisms of perf measurements.

1) query interface where perf countes can be queried.
2) OA buffer use case where counter-snapshots are captured periodically 
and analyzed for performance.


This patch series is actually just (1) query interface implementation 
for discrete and not a perf improvement.


The old mechanism to query OA report (MI_REPORT_PERF_COUNT) is not 
available for all engines. In the new mechanism, a query is triggered 
from a batch by writing to a whitelisted OA trigger register. Once a 
query is triggered, the resulting report is captured in the OA buffer.  
To locate the report quickly, the batch also captures the OA HW tail 
pointer before/after writing to the trigger register. This gives the 
user a window/locality in the OA buffer where the report of interest 
lies.  

For this new mechanism, the current interface to read reports from the 
OA buffer is inefficient since the reads are sequential and reports are 
copied to user buffer.


mmap provides an accurate and faster way to fetch the queried reports 
based on locality.


Note that mmap does not replace the OA buffer use case from (2) which 
still reads reports sequentially to analyze performance.






To enable the query:
- Add an ioctl to expose head and tail to the user
- Add an ioctl to return size and offset of the OA buffer
- Map the OA buffer to the user space

v2:
- Improve commit message (Chris)
- Do not mmap based on gem object filp. Instead, use perf_fd and support
  mmap syscall (Chris)
- Pass non-zero offset in mmap to enforce the right object is
  mapped (Chris)
- Do not expose gpu_address (Chris)
- Verify start and length of vma for page alignment (Lionel)
- Move SQNTL config out (Lionel)

v3: (Chris)
- Omit redundant checks
- Return VM_FAULT_SIGBUS is old stream is closed
- Maintain reference counts to stream in vm_open and vm_close
- Use switch to identify object to be mapped

v4: Call kref_put on closing perf fd (Chris)
v5:
- Strip access to OA buffer from unprivileged child of a privileged
  parent. Use VM_DONTCOPY
- Enforce MAP_PRIVATE by checking for VM_MAYSHARE

v6:
(Chris)
- Use len of -1 in unmap_mapping_range
- Don't use stream->oa_buffer.vma->obj in vm_fault_oa
- Use kernel block comment style
- do_mmap gets a reference to the file and puts it in do_munmap, so
  no need to maintain a reference to i915_perf_stream. Hence, remove
  vm_open/vm_close and stream->closed hooks/checks.
(Umesh)
- Do not allow mmap if SAMPLE_OA_REPORT is not set during
  i915_perf_open_ioctl.
- Drop ioctl returning head/tail since this information is already
  whitelisted. Remove hooks to read head register.

v7: (Chris)
- unmap before destroy
- change ioctl argument struct

v8: Documentation and more checks (Chris)
v9: Fix comment style (Umesh)
v10: Update uapi comment (Ashutosh)

Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gem/i915_gem_mman.c |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.h |   2 +
 drivers/gpu/drm/i915/i915_perf.c | 126 ++-
 include/uapi/drm/i915_drm.h  |  33 ++
 4 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 5130e8ed9564..84cdce2ee447 100644
--- a/driver

[PATCH 8/8] drm/i915/perf: Map OA buffer to user space for gen12 performance query

2021-08-30 Thread Umesh Nerlige Ramappa
i915 used to support time based sampling mode which is good for overall
system monitoring, but is not enough for query mode used to measure a
single draw call or dispatch. Gen9-Gen11 are using current i915 perf
implementation for query, but Gen12+ requires a new approach for query
based on triggered reports within oa buffer.

Triggering reports into the OA buffer is achieved by writing into a
a trigger register. Optionally an unused counter/register is set with a
marker value such that a triggered report can be identified in the OA
buffer. Reports are usually triggered at the start and end of work that
is measured.

Since OA buffer is large and queries can be frequent, an efficient way
to look for triggered reports is required. By knowing the current head
and tail offsets into the OA buffer, it is easier to determine the
locality of the reports of interest.

Current perf OA interface does not expose head/tail information to the
user and it filters out invalid reports before sending data to user.
Also considering limited size of user buffer used during a query,
creating a 1:1 copy of the OA buffer at the user space added undesired
complexity.

The solution was to map the OA buffer to user space provided

(1) that it is accessed from a privileged user.
(2) OA report filtering is not used.

These 2 conditions would satisfy the safety criteria that the current
perf interface addresses.

To enable the query:
- Add an ioctl to expose head and tail to the user
- Add an ioctl to return size and offset of the OA buffer
- Map the OA buffer to the user space

v2:
- Improve commit message (Chris)
- Do not mmap based on gem object filp. Instead, use perf_fd and support
  mmap syscall (Chris)
- Pass non-zero offset in mmap to enforce the right object is
  mapped (Chris)
- Do not expose gpu_address (Chris)
- Verify start and length of vma for page alignment (Lionel)
- Move SQNTL config out (Lionel)

v3: (Chris)
- Omit redundant checks
- Return VM_FAULT_SIGBUS is old stream is closed
- Maintain reference counts to stream in vm_open and vm_close
- Use switch to identify object to be mapped

v4: Call kref_put on closing perf fd (Chris)
v5:
- Strip access to OA buffer from unprivileged child of a privileged
  parent. Use VM_DONTCOPY
- Enforce MAP_PRIVATE by checking for VM_MAYSHARE

v6:
(Chris)
- Use len of -1 in unmap_mapping_range
- Don't use stream->oa_buffer.vma->obj in vm_fault_oa
- Use kernel block comment style
- do_mmap gets a reference to the file and puts it in do_munmap, so
  no need to maintain a reference to i915_perf_stream. Hence, remove
  vm_open/vm_close and stream->closed hooks/checks.
(Umesh)
- Do not allow mmap if SAMPLE_OA_REPORT is not set during
  i915_perf_open_ioctl.
- Drop ioctl returning head/tail since this information is already
  whitelisted. Remove hooks to read head register.

v7: (Chris)
- unmap before destroy
- change ioctl argument struct

v8: Documentation and more checks (Chris)
v9: Fix comment style (Umesh)
v10: Update uapi comment (Ashutosh)

Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gem/i915_gem_mman.c |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.h |   2 +
 drivers/gpu/drm/i915/i915_perf.c | 126 ++-
 include/uapi/drm/i915_drm.h  |  33 ++
 4 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 5130e8ed9564..84cdce2ee447 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -213,7 +213,7 @@ compute_partial_view(const struct drm_i915_gem_object *obj,
return view;
 }
 
-static vm_fault_t i915_error_to_vmf_fault(int err)
+vm_fault_t i915_error_to_vmf_fault(int err)
 {
switch (err) {
default:
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.h 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.h
index efee9e0d2508..1190a3a228ea 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.h
@@ -29,4 +29,6 @@ void i915_gem_object_release_mmap_gtt(struct 
drm_i915_gem_object *obj);
 
 void i915_gem_object_release_mmap_offset(struct drm_i915_gem_object *obj);
 
+vm_fault_t i915_error_to_vmf_fault(int err);
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index de3d1738aabe..1f8d4f3a2148 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -192,10 +192,12 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 
 #include "gem/i915_gem_context.h"
+#include "gem/i915_gem_mman.h"
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_user.h"
 #include "gt/intel_execlists_submission.h"
@@ -3322,6 +3324,44 @@ static long i915_perf_config_locked(struct 
i915_perf_stream *stream,
return ret;
 }
 
+#define I915_PERF_OA_BUFF

[PATCH 5/8] drm/i915/perf: Ensure observation logic is not clock gated

2021-08-30 Thread Umesh Nerlige Ramappa
From: Piotr Maciejewski 

A clock gating switch can control if the performance monitoring and
observation logic is enaled or not. Ensure that we enable the clocks.

v2: Separate code from other patches (Lionel)
v3: Reset PMON enable when disabling perf to save power (Lionel)
v4: Use intel_uncore_rmw and REG_BIT (Chris)

Fixes: 00a7f0d7155c ("drm/i915/tgl: Add perf support on TGL")
Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_perf.c | 9 +
 drivers/gpu/drm/i915/i915_reg.h  | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 2f01b8c0284c..3ded6e7d8526 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -2553,6 +2553,12 @@ gen12_enable_metric_set(struct i915_perf_stream *stream,
(period_exponent << 
GEN12_OAG_OAGLBCTXCTRL_TIMER_PERIOD_SHIFT))
: 0);
 
+   /*
+* Initialize Super Queue Internal Cnt Register
+* Set PMON Enable in order to collect valid metrics.
+*/
+   intel_uncore_rmw(uncore, GEN12_SQCNT1, 0, GEN12_SQCNT1_PMON_ENABLE);
+
/*
 * Update all contexts prior writing the mux configurations as we need
 * to make sure all slices/subslices are ON before writing to NOA
@@ -2612,6 +2618,9 @@ static void gen12_disable_metric_set(struct 
i915_perf_stream *stream)
 
/* Make sure we disable noa to save power. */
intel_uncore_rmw(uncore, RPM_CONFIG1, GEN10_GT_NOA_ENABLE, 0);
+
+   /* Reset PMON Enable to save power. */
+   intel_uncore_rmw(uncore, GEN12_SQCNT1, GEN12_SQCNT1_PMON_ENABLE, 0);
 }
 
 static void gen7_oa_enable(struct i915_perf_stream *stream)
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 40943dd5e0db..77ece19bda7e 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -718,6 +718,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define OABUFFER_SIZE_16M   (7 << 3)
 
 #define GEN12_OA_TLB_INV_CR _MMIO(0xceec)
+#define GEN12_SQCNT1 _MMIO(0x8718)
+#define  GEN12_SQCNT1_PMON_ENABLE REG_BIT(30)
 
 /* Gen12 OAR unit */
 #define GEN12_OAR_OACONTROL _MMIO(0x2960)
-- 
2.20.1



[PATCH 6/8] drm/i915/perf: Whitelist OA report trigger registers

2021-08-30 Thread Umesh Nerlige Ramappa
OA reports can be triggered into the OA buffer by writing into the
OAREPORTTRIG registers. Whitelist the registers to allow non-privileged
user to trigger reports.

Whitelist registers only if perf_stream_paranoid is set to 0. In
i915_perf_open_ioctl, this setting is checked and the whitelist is
enabled accordingly. On closing the perf fd, the whitelist is removed.

This ensures that the access to the whitelist is gated by
perf_stream_paranoid.

v2:
- Move related change to this patch (Lionel)
- Bump up perf revision (Lionel)

v3: Pardon whitelisted registers for selftest (Umesh)
v4: Document supported gens for the feature (Lionel)
v5: Whitelist registers only if perf_stream_paranoid is set to 0 (Jon)
v6: Move oa whitelist array to i915_perf (Chris)
v7: Fix OA writing beyond the wal->list memory (CI)
v8: Protect write to engine whitelist registers

v9: (Umesh)
- Use uncore->lock to protect write to forcepriv regs
- In case wal->count falls to zero on _wa_remove, make sure you still
  clear the registers. Remove wal->count check when applying whitelist.

v10: (Umesh)
- Split patches modifying intel_workarounds
- On some platforms there are no whitelisted regs. intel_engine_resume
  applies whitelist on these platforms too and the wal->count gates such
  platforms. Bring back the wal->count check.
- intel_engine_allow/deny_user_register_access modifies the engine
  whitelist and the wal->count. Use uncore->lock to serialize it with
  intel_engine_apply_whitelist.
- Grow the wal->list when adding whitelist registers after driver load.

v11:
- Fix memory leak in _wa_list_grow (Chris)
- Serialize kfree with engine resume using uncore->lock (Umesh)
- Grow the list only if wal->count is not aligned (Umesh)

Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_perf.c   | 79 +-
 drivers/gpu/drm/i915/i915_perf_types.h |  8 +++
 drivers/gpu/drm/i915/i915_reg.h| 12 ++--
 3 files changed, 92 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3ded6e7d8526..30f5025b2ff6 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1364,12 +1364,56 @@ free_noa_wait(struct i915_perf_stream *stream)
i915_vma_unpin_and_release(>noa_wait, 0);
 }
 
+static const i915_reg_t gen9_oa_wl_regs[] = {
+   { __OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+};
+
+static const i915_reg_t gen12_oa_wl_regs[] = {
+   { __GEN12_OAG_OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __GEN12_OAG_OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+};
+
+static int intel_engine_apply_oa_whitelist(struct i915_perf_stream *stream)
+{
+   struct intel_engine_cs *engine = stream->engine;
+   int ret;
+
+   if (i915_perf_stream_paranoid ||
+   !(stream->sample_flags & SAMPLE_OA_REPORT) ||
+   !stream->perf->oa_wl)
+   return 0;
+
+   ret = intel_engine_allow_user_register_access(engine,
+ stream->perf->oa_wl,
+ stream->perf->num_oa_wl);
+   if (ret < 0)
+   return ret;
+
+   stream->oa_whitelisted = true;
+   return 0;
+}
+
+static void intel_engine_remove_oa_whitelist(struct i915_perf_stream *stream)
+{
+   struct intel_engine_cs *engine = stream->engine;
+
+   if (!stream->oa_whitelisted)
+   return;
+
+   intel_engine_deny_user_register_access(engine,
+  stream->perf->oa_wl,
+  stream->perf->num_oa_wl);
+}
+
 static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
 {
struct i915_perf *perf = stream->perf;
 
BUG_ON(stream != perf->exclusive_stream);
 
+   intel_engine_remove_oa_whitelist(stream);
+
/*
 * Unset exclusive_stream first, it will be checked while disabling
 * the metric set on gen8+.
@@ -1465,7 +1509,8 @@ static void gen8_init_oa_buffer(struct i915_perf_stream 
*stream)
 *  bit."
 */
intel_uncore_write(uncore, GEN8_OABUFFER, gtt_offset |
-  OABUFFER_SIZE_16M | GEN8_OABUFFER_MEM_SELECT_GGTT);
+  OABUFFER_SIZE_16M | GEN8_OABUFFER_MEM_SELECT_GGTT |
+  GEN7_OABUFFER_EDGE_TRIGGER);
intel_uncore_write(uncore, GEN8_OATAILPTR, gtt_offset & 
GEN8_OATAILPTR_MASK);
 
/* Mark that we need updated tail pointers to read from... */
@@ -1518,7 +1563,8 @@ static void gen12_init_oa_buffer(struct i915_perf_stream 
*stream)
 *  bit."
 */
intel_uncore_write(uncore, G

[PATCH 4/8] drm/i915/gt: Enable dynamic adjustment of RING_NONPRIV

2021-08-30 Thread Umesh Nerlige Ramappa
From: Chris Wilson 

The OA subsystem would like to enable its privileged clients access to
the OA registers from execbuf. This requires temporarily removing the
HW validation from those registers for the duration of the OA client,
for which we need to allow OA to dynamically adjust the set of
RING_NONPRIV.

Care must still be taken since the RING_NONPRIV are global, so any and
all contexts that run at the same time as the OA client, will also be
able to adjust the registers from their execbuf.

v2: Fix memmove size (Umesh)
v3: Update selftest (Umesh)
- Use ppgtt for results
- Use ww locking
- Prevent rc6. Whitelist configuration is saved/restored on rc6, so
  applying whitelist configuration with rc6 enabled leads to a race
  where the pwr ctx restored configuration conflicts with the most
  recently applied config in the selftest.

Signed-off-by: Chris Wilson 
Reviewed-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c   |  59 
 drivers/gpu/drm/i915/gt/intel_workarounds.h   |   7 +
 .../gpu/drm/i915/gt/selftest_workarounds.c| 267 ++
 3 files changed, 333 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index df452a718200..c1ec09162e66 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -200,6 +200,18 @@ static void _wa_add(struct i915_wa_list *wal, const struct 
i915_wa *wa)
__wa_add(wal, wa);
 }
 
+static void _wa_del(struct i915_wa_list *wal, i915_reg_t reg)
+{
+   struct i915_wa *wa = wal->list;
+   int index;
+
+   index = wa_index(wal, reg);
+   if (GEM_DEBUG_WARN_ON(index < 0))
+   return;
+
+   memmove(wa + index, wa + index + 1, (--wal->count - index) * 
sizeof(*wa));
+}
+
 static void wa_add(struct i915_wa_list *wal, i915_reg_t reg,
   u32 clear, u32 set, u32 read_mask, bool masked_reg)
 {
@@ -2152,6 +2164,53 @@ void intel_engine_init_workarounds(struct 
intel_engine_cs *engine)
wa_init_finish(wal);
 }
 
+int intel_engine_allow_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count)
+{
+   struct i915_wa_list *wal = >whitelist;
+   unsigned long flags;
+   int err;
+
+   if (GEM_DEBUG_WARN_ON(wal->count + count >= RING_MAX_NONPRIV_SLOTS))
+   return -ENOSPC;
+
+   spin_lock_irqsave(>uncore->lock, flags);
+
+   err = wa_list_grow(wal, wal->count + count, GFP_ATOMIC | __GFP_NOWARN);
+   if (err)
+   goto out;
+
+   while (count--) {
+   struct i915_wa wa = { .reg = *reg++ };
+
+   __wa_add(wal, );
+   }
+
+   __engine_apply_whitelist(engine);
+
+out:
+   spin_unlock_irqrestore(>uncore->lock, flags);
+   return err;
+}
+
+void intel_engine_deny_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count)
+{
+   struct i915_wa_list *wal = >whitelist;
+   unsigned long flags;
+
+   spin_lock_irqsave(>uncore->lock, flags);
+
+   while (count--)
+   _wa_del(wal, *reg++);
+
+   __engine_apply_whitelist(engine);
+
+   spin_unlock_irqrestore(>uncore->lock, flags);
+}
+
 void intel_engine_apply_workarounds(struct intel_engine_cs *engine)
 {
wa_list_apply(engine->gt, >wa_list);
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.h 
b/drivers/gpu/drm/i915/gt/intel_workarounds.h
index 15abb68b6c00..3c50390e3a7f 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.h
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.h
@@ -36,4 +36,11 @@ void intel_engine_apply_workarounds(struct intel_engine_cs 
*engine);
 int intel_engine_verify_workarounds(struct intel_engine_cs *engine,
const char *from);
 
+int intel_engine_allow_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count);
+void intel_engine_deny_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count);
+
 #endif
diff --git a/drivers/gpu/drm/i915/gt/selftest_workarounds.c 
b/drivers/gpu/drm/i915/gt/selftest_workarounds.c
index e623ac45f4aa..ce91fad9075f 100644
--- a/drivers/gpu/drm/i915/gt/selftest_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/selftest_workarounds.c
@@ -1177,6 +1177,272 @@ static int live_isolated_whitelist(void *arg)
return err;
 }
 
+static int rmw_reg(struct intel_engine_cs *engine, const i915_reg_t reg)
+{
+   const u32 values[] = {
+   0x,

[PATCH 7/8] drm/i915/perf: Whitelist OA counter and buffer registers

2021-08-30 Thread Umesh Nerlige Ramappa
It is useful to have markers in the OA reports to identify triggered
reports. Whitelist some OA counters that can be used as markers.

A triggered report can be found faster if we can sample the HW tail and
head registers when the report was triggered. Whitelist OA buffer
specific registers.

v2:
- Bump up the perf revision (Lionel)
- Use indexing for counters (Lionel)
- Fix selftest for oa ticking register (Umesh)

v3: Pardon whitelisted registers for selftest (Umesh)

v4:
- Document whitelisted registers (Lionel)
- Fix live isolated whitelist for OA regs (Umesh)

v5:
- Free up whitelist slots. Remove GPU_TICKS and A20 counter (Piotr)
- Whitelist registers only if perf_stream_paranoid is set to 0 (Jon)

v6: Move oa whitelist array to i915_perf (Chris)

Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_perf.c | 18 +-
 drivers/gpu/drm/i915/i915_reg.h  | 16 ++--
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 30f5025b2ff6..de3d1738aabe 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1367,11 +1367,19 @@ free_noa_wait(struct i915_perf_stream *stream)
 static const i915_reg_t gen9_oa_wl_regs[] = {
{ __OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
{ __OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __OA_PERF_COUNTER_A(18) | (RING_FORCE_TO_NONPRIV_ACCESS_RW |
+RING_FORCE_TO_NONPRIV_RANGE_4) },
+   { __GEN8_OASTATUS | (RING_FORCE_TO_NONPRIV_ACCESS_RD |
+RING_FORCE_TO_NONPRIV_RANGE_4) },
 };
 
 static const i915_reg_t gen12_oa_wl_regs[] = {
{ __GEN12_OAG_OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
{ __GEN12_OAG_OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __GEN12_OAG_PERF_COUNTER_A(18) | (RING_FORCE_TO_NONPRIV_ACCESS_RW |
+   RING_FORCE_TO_NONPRIV_RANGE_4) },
+   { __GEN12_OAG_OASTATUS | (RING_FORCE_TO_NONPRIV_ACCESS_RD |
+ RING_FORCE_TO_NONPRIV_RANGE_4) },
 };
 
 static int intel_engine_apply_oa_whitelist(struct i915_perf_stream *stream)
@@ -4623,8 +4631,16 @@ int i915_perf_ioctl_version(void)
 *into the OA buffer. This applies only to gen8+. The feature can
 *only be accessed if perf_stream_paranoid is set to 0 by privileged
 *user.
+*
+* 7: Whitelist below OA registers for user to identify the location of
+*triggered reports in the OA buffer. This applies only to gen8+.
+*The feature can only be accessed if perf_stream_paranoid is set to
+*0 by privileged user.
+*
+*- OA buffer head/tail/status/buffer registers for read only
+*- OA counters A18, A19, A20 for read/write
 */
-   return 6;
+   return 7;
 }
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index c5c6adbe5b6f..b4c6bfc33a18 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -695,7 +695,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define  GEN7_OASTATUS2_HEAD_MASK   0xffc0
 #define  GEN7_OASTATUS2_MEM_SELECT_GGTT (1 << 0) /* 0: PPGTT, 1: GGTT */
 
-#define GEN8_OASTATUS _MMIO(0x2b08)
+#define __GEN8_OASTATUS 0x2b08
+#define GEN8_OASTATUS _MMIO(__GEN8_OASTATUS)
 #define  GEN8_OASTATUS_TAIL_POINTER_WRAP(1 << 17)
 #define  GEN8_OASTATUS_HEAD_POINTER_WRAP(1 << 16)
 #define  GEN8_OASTATUS_OVERRUN_STATUS  (1 << 3)
@@ -755,7 +756,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define  GEN12_OAG_OA_DEBUG_DISABLE_GO_1_0_REPORTS (1 << 2)
 #define  GEN12_OAG_OA_DEBUG_DISABLE_CTX_SWITCH_REPORTS (1 << 1)
 
-#define GEN12_OAG_OASTATUS _MMIO(0xdafc)
+#define __GEN12_OAG_OASTATUS 0xdafc
+#define GEN12_OAG_OASTATUS _MMIO(__GEN12_OAG_OASTATUS)
 #define  GEN12_OAG_OASTATUS_COUNTER_OVERFLOW (1 << 2)
 #define  GEN12_OAG_OASTATUS_BUFFER_OVERFLOW  (1 << 1)
 #define  GEN12_OAG_OASTATUS_REPORT_LOST  (1 << 0)
@@ -998,6 +1000,16 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define OAREPORTTRIG8_NOA_SELECT_6_SHIFT24
 #define OAREPORTTRIG8_NOA_SELECT_7_SHIFT28
 
+/* Performance counters registers */
+#define __OA_PERF_COUNTER_A(idx)   (0x2800 + 8 * (idx))
+#define OA_PERF_COUNTER_A(idx) _MMIO(__OA_PERF_COUNTER_A(idx))
+#define OA_PERF_COUNTER_A_UDW(idx) _MMIO(__OA_PERF_COUNTER_A(idx) 
+ 4)
+
+/* Gen12 Performance counters registers */
+#define __GEN12_OAG_PERF_COUNTER_A(idx)(0xD980 + 8 * (idx))
+#define GEN12_OAG_PERF_COUNTER_A(idx)  _MMIO(__GEN12_OAG_PERF_COUNTER_A(idx))
+#define GEN12_OAG_PERF_COUNTER_A_UDW(idx) 

[PATCH 3/8] drm/i915/gt: Check for conflicting RING_NONPRIV

2021-08-30 Thread Umesh Nerlige Ramappa
From: Chris Wilson 

Strip the encoded bits from the register offset so that we only use the
address for looking up the RING_NONPRIV entry.

Signed-off-by: Chris Wilson 
Reviewed-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 66 +
 1 file changed, 42 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 6928f250cafe..df452a718200 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -80,18 +80,44 @@ static void wa_init_finish(struct i915_wa_list *wal)
 wal->wa_count, wal->name, wal->engine_name);
 }
 
+static u32 reg_offset(i915_reg_t reg)
+{
+   return i915_mmio_reg_offset(reg) & RING_FORCE_TO_NONPRIV_ADDRESS_MASK;
+}
+
+static u32 reg_flags(i915_reg_t reg)
+{
+   return i915_mmio_reg_offset(reg) & ~RING_FORCE_TO_NONPRIV_ADDRESS_MASK;
+}
+
+__maybe_unused
+static bool is_nonpriv_flags_valid(u32 flags)
+{
+   /* Check only valid flag bits are set */
+   if (flags & ~RING_FORCE_TO_NONPRIV_MASK_VALID)
+   return false;
+
+   /* NB: Only 3 out of 4 enum values are valid for access field */
+   if ((flags & RING_FORCE_TO_NONPRIV_ACCESS_MASK) ==
+   RING_FORCE_TO_NONPRIV_ACCESS_INVALID)
+   return false;
+
+   return true;
+}
+
 static int wa_index(struct i915_wa_list *wal, i915_reg_t reg)
 {
-   unsigned int addr = i915_mmio_reg_offset(reg);
int start = 0, end = wal->count;
+   u32 addr = reg_offset(reg);
 
/* addr and wal->list[].reg, both include the R/W flags */
while (start < end) {
unsigned int mid = start + (end - start) / 2;
+   u32 pos = reg_offset(wal->list[mid].reg);
 
-   if (i915_mmio_reg_offset(wal->list[mid].reg) < addr)
+   if (pos < addr)
start = mid + 1;
-   else if (i915_mmio_reg_offset(wal->list[mid].reg) > addr)
+   else if (pos > addr)
end = mid;
else
return mid;
@@ -117,13 +143,22 @@ static void __wa_add(struct i915_wa_list *wal, const 
struct i915_wa *wa)
struct i915_wa *wa_;
int index;
 
+   GEM_BUG_ON(!is_nonpriv_flags_valid(reg_flags(wa->reg)));
+
index = wa_index(wal, wa->reg);
if (index >= 0) {
wa_ = >list[index];
 
+   if (i915_mmio_reg_offset(wa->reg) !=
+   i915_mmio_reg_offset(wa_->reg)) {
+   DRM_ERROR("Discarding incompatible w/a for reg %04x\n",
+ reg_offset(wa->reg));
+   return;
+   }
+
if ((wa->clr | wa_->clr) && !(wa->clr & ~wa_->clr)) {
DRM_ERROR("Discarding overwritten w/a for reg %04x 
(clear: %08x, set: %08x)\n",
- i915_mmio_reg_offset(wa_->reg),
+ reg_offset(wa_->reg),
  wa_->clr, wa_->set);
 
wa_->set &= ~wa->clr;
@@ -141,10 +176,8 @@ static void __wa_add(struct i915_wa_list *wal, const 
struct i915_wa *wa)
*wa_ = *wa;
 
while (wa_-- > wal->list) {
-   GEM_BUG_ON(i915_mmio_reg_offset(wa_[0].reg) ==
-  i915_mmio_reg_offset(wa_[1].reg));
-   if (i915_mmio_reg_offset(wa_[1].reg) >
-   i915_mmio_reg_offset(wa_[0].reg))
+   GEM_BUG_ON(reg_offset(wa_[0].reg) == reg_offset(wa_[1].reg));
+   if (reg_offset(wa_[1].reg) > reg_offset(wa_[0].reg))
break;
 
swap(wa_[1], wa_[0]);
@@ -160,7 +193,7 @@ static void _wa_add(struct i915_wa_list *wal, const struct 
i915_wa *wa)
if (IS_ALIGNED(wal->count, grow) && /* Either uninitialized or full. */
wa_list_grow(wal, ALIGN(wal->count + 1, grow), GFP_KERNEL)) {
DRM_ERROR("Unable to store w/a for reg %04x\n",
- i915_mmio_reg_offset(wa->reg));
+ reg_offset(wa->reg));
return;
}
 
@@ -1367,21 +1400,6 @@ bool intel_gt_verify_workarounds(struct intel_gt *gt, 
const char *from)
return wa_list_verify(gt, >i915->gt_wa_list, from);
 }
 
-__maybe_unused
-static bool is_nonpriv_flags_valid(u32 flags)
-{
-   /* Check only valid flag bits are set */
-   if (flags & ~RING_FORCE_TO_NONPRIV_MASK_VALID)
-   return false;
-
-   /* NB: Only 3 out of 4 enum values are valid for access field */
-   if ((flags & RING_FORCE_TO_NONPRIV_ACCESS_MASK) ==
-   RING_FORCE_TO_NONPRIV_ACCESS_INVALID)
-   return false;
-
-   return true;
-}
-
 static void
 whitelist_reg_ext(struct i915_wa_list *wal, i915_reg_t reg, u32 flags)
 {
-- 
2.20.1



[PATCH 2/8] drm/i915/gt: Refactor _wa_add to reuse wa_index and wa_list_grow

2021-08-30 Thread Umesh Nerlige Ramappa
From: Chris Wilson 

Switch the search and grow code of the _wa_add to use _wa_index and
_wa_list_grow.

Signed-off-by: Chris Wilson 
Reviewed-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 124 +++-
 1 file changed, 71 insertions(+), 53 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 2a8cc0e2d1b1..6928f250cafe 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -60,20 +60,19 @@ static void wa_init_start(struct i915_wa_list *wal, const 
char *name, const char
 
 #define WA_LIST_CHUNK (1 << 4)
 
-static void wa_init_finish(struct i915_wa_list *wal)
+static void wa_trim(struct i915_wa_list *wal, gfp_t gfp)
 {
+   struct i915_wa *list;
+
/* Trim unused entries. */
-   if (!IS_ALIGNED(wal->count, WA_LIST_CHUNK)) {
-   struct i915_wa *list = kmemdup(wal->list,
-  wal->count * sizeof(*list),
-  GFP_KERNEL);
-
-   if (list) {
-   kfree(wal->list);
-   wal->list = list;
-   }
-   }
+   list = krealloc(wal->list, wal->count * sizeof(*list), gfp);
+   if (list)
+   wal->list = list;
+}
 
+static void wa_init_finish(struct i915_wa_list *wal)
+{
+   wa_trim(wal, GFP_KERNEL);
if (!wal->count)
return;
 
@@ -81,57 +80,60 @@ static void wa_init_finish(struct i915_wa_list *wal)
 wal->wa_count, wal->name, wal->engine_name);
 }
 
-static void _wa_add(struct i915_wa_list *wal, const struct i915_wa *wa)
+static int wa_index(struct i915_wa_list *wal, i915_reg_t reg)
 {
-   unsigned int addr = i915_mmio_reg_offset(wa->reg);
-   unsigned int start = 0, end = wal->count;
-   const unsigned int grow = WA_LIST_CHUNK;
-   struct i915_wa *wa_;
+   unsigned int addr = i915_mmio_reg_offset(reg);
+   int start = 0, end = wal->count;
 
-   GEM_BUG_ON(!is_power_of_2(grow));
+   /* addr and wal->list[].reg, both include the R/W flags */
+   while (start < end) {
+   unsigned int mid = start + (end - start) / 2;
 
-   if (IS_ALIGNED(wal->count, grow)) { /* Either uninitialized or full. */
-   struct i915_wa *list;
+   if (i915_mmio_reg_offset(wal->list[mid].reg) < addr)
+   start = mid + 1;
+   else if (i915_mmio_reg_offset(wal->list[mid].reg) > addr)
+   end = mid;
+   else
+   return mid;
+   }
 
-   list = kmalloc_array(ALIGN(wal->count + 1, grow), sizeof(*wa),
-GFP_KERNEL);
-   if (!list) {
-   DRM_ERROR("No space for workaround init!\n");
-   return;
-   }
+   return -ENOENT;
+}
 
-   if (wal->list) {
-   memcpy(list, wal->list, sizeof(*wa) * wal->count);
-   kfree(wal->list);
-   }
+static int wa_list_grow(struct i915_wa_list *wal, size_t count, gfp_t gfp)
+{
+   struct i915_wa *list;
 
-   wal->list = list;
-   }
+   list = krealloc(wal->list, count * sizeof(*list), gfp);
+   if (!list)
+   return -ENOMEM;
 
-   while (start < end) {
-   unsigned int mid = start + (end - start) / 2;
+   wal->list = list;
+   return 0;
+}
 
-   if (i915_mmio_reg_offset(wal->list[mid].reg) < addr) {
-   start = mid + 1;
-   } else if (i915_mmio_reg_offset(wal->list[mid].reg) > addr) {
-   end = mid;
-   } else {
-   wa_ = >list[mid];
-
-   if ((wa->clr | wa_->clr) && !(wa->clr & ~wa_->clr)) {
-   DRM_ERROR("Discarding overwritten w/a for reg 
%04x (clear: %08x, set: %08x)\n",
- i915_mmio_reg_offset(wa_->reg),
- wa_->clr, wa_->set);
-
-   wa_->set &= ~wa->clr;
-   }
-
-   wal->wa_count++;
-   wa_->set |= wa->set;
-   wa_->clr |= wa->clr;
-   wa_->read |= wa->read;
-   return;
+static void __wa_add(struct i915_wa_list *wal, const struct i915_wa *wa)
+{
+   struct i915_wa *wa_;
+   int index;
+
+   index = wa_index(wal, wa->reg);
+   if (index >= 0) {
+   wa_ = >list[index];
+
+   if ((wa->clr

[PATCH 0/8] Enable triggered perf query for Xe_HP

2021-08-30 Thread Umesh Nerlige Ramappa
This is a revival of the patch series to support triggered perf query reports
from here - https://patchwork.freedesktop.org/series/83831/

The patches were not pushed earlier because corresponding UMD changes were
missing. This revival addresses UMD changes in GPUvis for this series. GPUvis
uses the perf library in igt-gpu-tools. Changes to the library are here -
https://patchwork.freedesktop.org/series/93355/

GPUvis changes will be posted as a PR once the above library and kernel changes
are pushed.

Summary of the feature:

Current platforms provide MI_REPORT_PERF_COUNT to query a snapshot of perf
counters from a batch. This mechanism does not have consistent support on all
engines for newer platforms. To support perf query, all new platforms use a
mechanism to trigger OA report snapshots into the OA buffer by writing to a HW
register from a batch. To lookup this report in the OA buffer quickly, the OA
buffer is mmapped into the user space.

This series implements the new query mechanism.

v2: Fix BAT failure (Umesh)
v3: Fix selftest (Umesh)
v4: Update uapi comment (Umesh)

Test-with: 20210830193337.15260-1-umesh.nerlige.rama...@intel.com
Signed-off-by: Umesh Nerlige Ramappa 

Chris Wilson (3):
  drm/i915/gt: Refactor _wa_add to reuse wa_index and wa_list_grow
  drm/i915/gt: Check for conflicting RING_NONPRIV
  drm/i915/gt: Enable dynamic adjustment of RING_NONPRIV

Piotr Maciejewski (1):
  drm/i915/perf: Ensure observation logic is not clock gated

Umesh Nerlige Ramappa (4):
  drm/i915/gt: Lock intel_engine_apply_whitelist with uncore->lock
  drm/i915/perf: Whitelist OA report trigger registers
  drm/i915/perf: Whitelist OA counter and buffer registers
  drm/i915/perf: Map OA buffer to user space for gen12 performance query

 drivers/gpu/drm/i915/gem/i915_gem_mman.c  |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.h  |   2 +
 drivers/gpu/drm/i915/gt/intel_workarounds.c   | 269 +-
 drivers/gpu/drm/i915/gt/intel_workarounds.h   |   7 +
 .../gpu/drm/i915/gt/selftest_workarounds.c| 267 +
 drivers/gpu/drm/i915/i915_perf.c  | 228 ++-
 drivers/gpu/drm/i915/i915_perf_types.h|   8 +
 drivers/gpu/drm/i915/i915_reg.h   |  30 +-
 include/uapi/drm/i915_drm.h   |  33 +++
 9 files changed, 758 insertions(+), 88 deletions(-)

-- 
2.20.1



[PATCH 1/8] drm/i915/gt: Lock intel_engine_apply_whitelist with uncore->lock

2021-08-30 Thread Umesh Nerlige Ramappa
Refactor intel_engine_apply_whitelist into locked and unlocked versions
so that a caller who already has the lock can apply whitelist.

v2: Fix sparse warning
v3: (Chris)
- Drop prefix and suffix for static function
- Use longest to shortest line ordering for variable declaration

Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 46 ++---
 1 file changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 94e1937f8d29..2a8cc0e2d1b1 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -1248,7 +1248,8 @@ void intel_gt_init_workarounds(struct drm_i915_private 
*i915)
 }
 
 static enum forcewake_domains
-wal_get_fw_for_rmw(struct intel_uncore *uncore, const struct i915_wa_list *wal)
+wal_get_fw(struct intel_uncore *uncore, const struct i915_wa_list *wal,
+  unsigned int op)
 {
enum forcewake_domains fw = 0;
struct i915_wa *wa;
@@ -1257,8 +1258,7 @@ wal_get_fw_for_rmw(struct intel_uncore *uncore, const 
struct i915_wa_list *wal)
for (i = 0, wa = wal->list; i < wal->count; i++, wa++)
fw |= intel_uncore_forcewake_for_reg(uncore,
 wa->reg,
-FW_REG_READ |
-FW_REG_WRITE);
+op);
 
return fw;
 }
@@ -1289,7 +1289,7 @@ wa_list_apply(struct intel_gt *gt, const struct 
i915_wa_list *wal)
if (!wal->count)
return;
 
-   fw = wal_get_fw_for_rmw(uncore, wal);
+   fw = wal_get_fw(uncore, wal, FW_REG_READ | FW_REG_WRITE);
 
spin_lock_irqsave(>lock, flags);
intel_uncore_forcewake_get__locked(uncore, fw);
@@ -1328,7 +1328,7 @@ static bool wa_list_verify(struct intel_gt *gt,
unsigned int i;
bool ok = true;
 
-   fw = wal_get_fw_for_rmw(uncore, wal);
+   fw = wal_get_fw(uncore, wal, FW_REG_READ | FW_REG_WRITE);
 
spin_lock_irqsave(>lock, flags);
intel_uncore_forcewake_get__locked(uncore, fw);
@@ -1617,27 +1617,45 @@ void intel_engine_init_whitelist(struct intel_engine_cs 
*engine)
wa_init_finish(w);
 }
 
-void intel_engine_apply_whitelist(struct intel_engine_cs *engine)
+static void __engine_apply_whitelist(struct intel_engine_cs *engine)
 {
const struct i915_wa_list *wal = >whitelist;
struct intel_uncore *uncore = engine->uncore;
const u32 base = engine->mmio_base;
+   enum forcewake_domains fw;
struct i915_wa *wa;
unsigned int i;
 
-   if (!wal->count)
-   return;
+   lockdep_assert_held(>lock);
+
+   fw = wal_get_fw(uncore, wal, FW_REG_WRITE);
+   intel_uncore_forcewake_get__locked(uncore, fw);
 
for (i = 0, wa = wal->list; i < wal->count; i++, wa++)
-   intel_uncore_write(uncore,
-  RING_FORCE_TO_NONPRIV(base, i),
-  i915_mmio_reg_offset(wa->reg));
+   intel_uncore_write_fw(uncore,
+ RING_FORCE_TO_NONPRIV(base, i),
+ i915_mmio_reg_offset(wa->reg));
 
/* And clear the rest just in case of garbage */
for (; i < RING_MAX_NONPRIV_SLOTS; i++)
-   intel_uncore_write(uncore,
-  RING_FORCE_TO_NONPRIV(base, i),
-  i915_mmio_reg_offset(RING_NOPID(base)));
+   intel_uncore_write_fw(uncore,
+ RING_FORCE_TO_NONPRIV(base, i),
+ i915_mmio_reg_offset(RING_NOPID(base)));
+
+   intel_uncore_forcewake_put__locked(uncore, fw);
+}
+
+void intel_engine_apply_whitelist(struct intel_engine_cs *engine)
+{
+   const struct i915_wa_list *wal = >whitelist;
+   unsigned long flags;
+
+   if (!wal->count)
+   return;
+
+   spin_lock_irqsave(>uncore->lock, flags);
+   __engine_apply_whitelist(engine);
+   spin_unlock_irqrestore(>uncore->lock, flags);
 }
 
 static void
-- 
2.20.1



Re: [PATCH 0/8] Enable triggered perf query for Xe_HP

2021-08-03 Thread Umesh Nerlige Ramappa

On Tue, Aug 03, 2021 at 01:18:38PM -0700, Umesh Nerlige Ramappa wrote:

+ Joonas

On Tue, Aug 03, 2021 at 01:13:41PM -0700, Umesh Nerlige Ramappa wrote:

This is a revival of the patch series to support triggered perf query reports
from here - https://patchwork.freedesktop.org/series/83831/

The patches were not pushed earlier because corresponding UMD changes were
missing. This revival addresses UMD changes in GPUvis for this series. GPUvis
uses the perf library in igt-gpu-tools. Changes to the library are here -
https://patchwork.freedesktop.org/series/93355/

GPUvis changes will be posted as a PR once the above library and kernel changes
are pushed.


GPUvis changes:
https://github.com/unerlige/gpuvis/commit/1c19c134a64564f7b8d7ca3b46449324040a40be



Summary of the feature:

Current platforms provide MI_REPORT_PERF_COUNT to query a snapshot of perf
counters from a batch. This mechanism does not have consistent support on all
engines for newer platforms. To support perf query, all new platforms use a
mechanism to trigger OA report snapshots into the OA buffer by writing to a HW
register from a batch. To lookup this report in the OA buffer quickly, the OA
buffer is mmapped into the user space.

This series implements the new query mechanism.

Signed-off-by: Umesh Nerlige Ramappa 

Chris Wilson (3):
drm/i915/gt: Refactor _wa_add to reuse wa_index and wa_list_grow
drm/i915/gt: Check for conflicting RING_NONPRIV
drm/i915/gt: Enable dynamic adjustment of RING_NONPRIV

Piotr Maciejewski (1):
drm/i915/perf: Ensure observation logic is not clock gated

Umesh Nerlige Ramappa (4):
drm/i915/gt: Lock intel_engine_apply_whitelist with uncore->lock
drm/i915/perf: Whitelist OA report trigger registers
drm/i915/perf: Whitelist OA counter and buffer registers
drm/i915/perf: Map OA buffer to user space for gen12 performance query

drivers/gpu/drm/i915/gem/i915_gem_mman.c  |   2 +-
drivers/gpu/drm/i915/gem/i915_gem_mman.h  |   2 +
drivers/gpu/drm/i915/gt/intel_workarounds.c   | 269 +-
drivers/gpu/drm/i915/gt/intel_workarounds.h   |   7 +
.../gpu/drm/i915/gt/selftest_workarounds.c| 237 +++
drivers/gpu/drm/i915/i915_perf.c  | 228 ++-
drivers/gpu/drm/i915/i915_perf_types.h|   8 +
drivers/gpu/drm/i915/i915_reg.h   |  30 +-
include/uapi/drm/i915_drm.h   |  33 +++
9 files changed, 728 insertions(+), 88 deletions(-)

--
2.20.1



Re: [PATCH 0/8] Enable triggered perf query for Xe_HP

2021-08-03 Thread Umesh Nerlige Ramappa

+ Joonas

On Tue, Aug 03, 2021 at 01:13:41PM -0700, Umesh Nerlige Ramappa wrote:

This is a revival of the patch series to support triggered perf query reports
from here - https://patchwork.freedesktop.org/series/83831/

The patches were not pushed earlier because corresponding UMD changes were
missing. This revival addresses UMD changes in GPUvis for this series. GPUvis
uses the perf library in igt-gpu-tools. Changes to the library are here -
https://patchwork.freedesktop.org/series/93355/

GPUvis changes will be posted as a PR once the above library and kernel changes
are pushed.

Summary of the feature:

Current platforms provide MI_REPORT_PERF_COUNT to query a snapshot of perf
counters from a batch. This mechanism does not have consistent support on all
engines for newer platforms. To support perf query, all new platforms use a
mechanism to trigger OA report snapshots into the OA buffer by writing to a HW
register from a batch. To lookup this report in the OA buffer quickly, the OA
buffer is mmapped into the user space.

This series implements the new query mechanism.

Signed-off-by: Umesh Nerlige Ramappa 

Chris Wilson (3):
 drm/i915/gt: Refactor _wa_add to reuse wa_index and wa_list_grow
 drm/i915/gt: Check for conflicting RING_NONPRIV
 drm/i915/gt: Enable dynamic adjustment of RING_NONPRIV

Piotr Maciejewski (1):
 drm/i915/perf: Ensure observation logic is not clock gated

Umesh Nerlige Ramappa (4):
 drm/i915/gt: Lock intel_engine_apply_whitelist with uncore->lock
 drm/i915/perf: Whitelist OA report trigger registers
 drm/i915/perf: Whitelist OA counter and buffer registers
 drm/i915/perf: Map OA buffer to user space for gen12 performance query

drivers/gpu/drm/i915/gem/i915_gem_mman.c  |   2 +-
drivers/gpu/drm/i915/gem/i915_gem_mman.h  |   2 +
drivers/gpu/drm/i915/gt/intel_workarounds.c   | 269 +-
drivers/gpu/drm/i915/gt/intel_workarounds.h   |   7 +
.../gpu/drm/i915/gt/selftest_workarounds.c| 237 +++
drivers/gpu/drm/i915/i915_perf.c  | 228 ++-
drivers/gpu/drm/i915/i915_perf_types.h|   8 +
drivers/gpu/drm/i915/i915_reg.h   |  30 +-
include/uapi/drm/i915_drm.h   |  33 +++
9 files changed, 728 insertions(+), 88 deletions(-)

--
2.20.1



[PATCH 5/8] drm/i915/perf: Ensure observation logic is not clock gated

2021-08-03 Thread Umesh Nerlige Ramappa
From: Piotr Maciejewski 

A clock gating switch can control if the performance monitoring and
observation logic is enaled or not. Ensure that we enable the clocks.

v2: Separate code from other patches (Lionel)
v3: Reset PMON enable when disabling perf to save power (Lionel)
v4: Use intel_uncore_rmw and REG_BIT (Chris)

Fixes: 00a7f0d7155c ("drm/i915/tgl: Add perf support on TGL")
Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_perf.c | 9 +
 drivers/gpu/drm/i915/i915_reg.h  | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 2f01b8c0284c..3ded6e7d8526 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -2553,6 +2553,12 @@ gen12_enable_metric_set(struct i915_perf_stream *stream,
(period_exponent << 
GEN12_OAG_OAGLBCTXCTRL_TIMER_PERIOD_SHIFT))
: 0);
 
+   /*
+* Initialize Super Queue Internal Cnt Register
+* Set PMON Enable in order to collect valid metrics.
+*/
+   intel_uncore_rmw(uncore, GEN12_SQCNT1, 0, GEN12_SQCNT1_PMON_ENABLE);
+
/*
 * Update all contexts prior writing the mux configurations as we need
 * to make sure all slices/subslices are ON before writing to NOA
@@ -2612,6 +2618,9 @@ static void gen12_disable_metric_set(struct 
i915_perf_stream *stream)
 
/* Make sure we disable noa to save power. */
intel_uncore_rmw(uncore, RPM_CONFIG1, GEN10_GT_NOA_ENABLE, 0);
+
+   /* Reset PMON Enable to save power. */
+   intel_uncore_rmw(uncore, GEN12_SQCNT1, GEN12_SQCNT1_PMON_ENABLE, 0);
 }
 
 static void gen7_oa_enable(struct i915_perf_stream *stream)
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index bd395d967634..138b35da3057 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -718,6 +718,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define OABUFFER_SIZE_16M   (7 << 3)
 
 #define GEN12_OA_TLB_INV_CR _MMIO(0xceec)
+#define GEN12_SQCNT1 _MMIO(0x8718)
+#define  GEN12_SQCNT1_PMON_ENABLE REG_BIT(30)
 
 /* Gen12 OAR unit */
 #define GEN12_OAR_OACONTROL _MMIO(0x2960)
-- 
2.20.1



[PATCH 4/8] drm/i915/gt: Enable dynamic adjustment of RING_NONPRIV

2021-08-03 Thread Umesh Nerlige Ramappa
From: Chris Wilson 

The OA subsystem would like to enable its privileged clients access to
the OA registers from execbuf. This requires temporarily removing the
HW validation from those registers for the duration of the OA client,
for which we need to allow OA to dynamically adjust the set of
RING_NONPRIV.

Care must still be taken since the RING_NONPRIV are global, so any and
all contexts that run at the same time as the OA client, will also be
able to adjust the registers from their execbuf.

v2: Fix memmove size (Umesh)

Signed-off-by: Chris Wilson 
Reviewed-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c   |  59 +
 drivers/gpu/drm/i915/gt/intel_workarounds.h   |   7 +
 .../gpu/drm/i915/gt/selftest_workarounds.c| 237 ++
 3 files changed, 303 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 7dda5a0a8e75..3da7b5486251 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -200,6 +200,18 @@ static void _wa_add(struct i915_wa_list *wal, const struct 
i915_wa *wa)
__wa_add(wal, wa);
 }
 
+static void _wa_del(struct i915_wa_list *wal, i915_reg_t reg)
+{
+   struct i915_wa *wa = wal->list;
+   int index;
+
+   index = wa_index(wal, reg);
+   if (GEM_DEBUG_WARN_ON(index < 0))
+   return;
+
+   memmove(wa + index, wa + index + 1, (--wal->count - index) * 
sizeof(*wa));
+}
+
 static void wa_add(struct i915_wa_list *wal, i915_reg_t reg,
   u32 clear, u32 set, u32 read_mask, bool masked_reg)
 {
@@ -2012,6 +2024,53 @@ void intel_engine_init_workarounds(struct 
intel_engine_cs *engine)
wa_init_finish(wal);
 }
 
+int intel_engine_allow_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count)
+{
+   struct i915_wa_list *wal = >whitelist;
+   unsigned long flags;
+   int err;
+
+   if (GEM_DEBUG_WARN_ON(wal->count + count >= RING_MAX_NONPRIV_SLOTS))
+   return -ENOSPC;
+
+   spin_lock_irqsave(>uncore->lock, flags);
+
+   err = wa_list_grow(wal, wal->count + count, GFP_ATOMIC | __GFP_NOWARN);
+   if (err)
+   goto out;
+
+   while (count--) {
+   struct i915_wa wa = { .reg = *reg++ };
+
+   __wa_add(wal, );
+   }
+
+   __engine_apply_whitelist(engine);
+
+out:
+   spin_unlock_irqrestore(>uncore->lock, flags);
+   return err;
+}
+
+void intel_engine_deny_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count)
+{
+   struct i915_wa_list *wal = >whitelist;
+   unsigned long flags;
+
+   spin_lock_irqsave(>uncore->lock, flags);
+
+   while (count--)
+   _wa_del(wal, *reg++);
+
+   __engine_apply_whitelist(engine);
+
+   spin_unlock_irqrestore(>uncore->lock, flags);
+}
+
 void intel_engine_apply_workarounds(struct intel_engine_cs *engine)
 {
wa_list_apply(engine->gt, >wa_list);
diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.h 
b/drivers/gpu/drm/i915/gt/intel_workarounds.h
index 15abb68b6c00..3c50390e3a7f 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.h
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.h
@@ -36,4 +36,11 @@ void intel_engine_apply_workarounds(struct intel_engine_cs 
*engine);
 int intel_engine_verify_workarounds(struct intel_engine_cs *engine,
const char *from);
 
+int intel_engine_allow_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count);
+void intel_engine_deny_user_register_access(struct intel_engine_cs *engine,
+   const i915_reg_t *reg,
+   unsigned int count);
+
 #endif
diff --git a/drivers/gpu/drm/i915/gt/selftest_workarounds.c 
b/drivers/gpu/drm/i915/gt/selftest_workarounds.c
index e623ac45f4aa..8290e02d4663 100644
--- a/drivers/gpu/drm/i915/gt/selftest_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/selftest_workarounds.c
@@ -1177,6 +1177,242 @@ static int live_isolated_whitelist(void *arg)
return err;
 }
 
+static int rmw_reg(struct intel_engine_cs *engine, const i915_reg_t reg)
+{
+   const u32 values[] = {
+   0x,
+   0x01010101,
+   0x10100101,
+   0x03030303,
+   0x30300303,
+   0x05050505,
+   0x50500505,
+   0x0f0f0f0f,
+   0xf00ff00f,
+   0x10101010,
+   0xf0f01010,
+   0x30303030,

[PATCH 8/8] drm/i915/perf: Map OA buffer to user space for gen12 performance query

2021-08-03 Thread Umesh Nerlige Ramappa
i915 used to support time based sampling mode which is good for overall
system monitoring, but is not enough for query mode used to measure a
single draw call or dispatch. Gen9-Gen11 are using current i915 perf
implementation for query, but Gen12+ requires a new approach for query
based on triggered reports within oa buffer.

Triggering reports into the OA buffer is achieved by writing into a
a trigger register. Optionally an unused counter/register is set with a
marker value such that a triggered report can be identified in the OA
buffer. Reports are usually triggered at the start and end of work that
is measured.

Since OA buffer is large and queries can be frequent, an efficient way
to look for triggered reports is required. By knowing the current head
and tail offsets into the OA buffer, it is easier to determine the
locality of the reports of interest.

Current perf OA interface does not expose head/tail information to the
user and it filters out invalid reports before sending data to user.
Also considering limited size of user buffer used during a query,
creating a 1:1 copy of the OA buffer at the user space added undesired
complexity.

The solution was to map the OA buffer to user space provided

(1) that it is accessed from a privileged user.
(2) OA report filtering is not used.

These 2 conditions would satisfy the safety criteria that the current
perf interface addresses.

To enable the query:
- Add an ioctl to expose head and tail to the user
- Add an ioctl to return size and offset of the OA buffer
- Map the OA buffer to the user space

v2:
- Improve commit message (Chris)
- Do not mmap based on gem object filp. Instead, use perf_fd and support
  mmap syscall (Chris)
- Pass non-zero offset in mmap to enforce the right object is
  mapped (Chris)
- Do not expose gpu_address (Chris)
- Verify start and length of vma for page alignment (Lionel)
- Move SQNTL config out (Lionel)

v3: (Chris)
- Omit redundant checks
- Return VM_FAULT_SIGBUS is old stream is closed
- Maintain reference counts to stream in vm_open and vm_close
- Use switch to identify object to be mapped

v4: Call kref_put on closing perf fd (Chris)
v5:
- Strip access to OA buffer from unprivileged child of a privileged
  parent. Use VM_DONTCOPY
- Enforce MAP_PRIVATE by checking for VM_MAYSHARE

v6:
(Chris)
- Use len of -1 in unmap_mapping_range
- Don't use stream->oa_buffer.vma->obj in vm_fault_oa
- Use kernel block comment style
- do_mmap gets a reference to the file and puts it in do_munmap, so
  no need to maintain a reference to i915_perf_stream. Hence, remove
  vm_open/vm_close and stream->closed hooks/checks.
(Umesh)
- Do not allow mmap if SAMPLE_OA_REPORT is not set during
  i915_perf_open_ioctl.
- Drop ioctl returning head/tail since this information is already
  whitelisted. Remove hooks to read head register.

v7: (Chris)
- unmap before destroy
- change ioctl argument struct

v8: Documentation and more checks (Chris)

Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gem/i915_gem_mman.c |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.h |   2 +
 drivers/gpu/drm/i915/i915_perf.c | 126 ++-
 include/uapi/drm/i915_drm.h  |  33 ++
 4 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index 5130e8ed9564..84cdce2ee447 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -213,7 +213,7 @@ compute_partial_view(const struct drm_i915_gem_object *obj,
return view;
 }
 
-static vm_fault_t i915_error_to_vmf_fault(int err)
+vm_fault_t i915_error_to_vmf_fault(int err)
 {
switch (err) {
default:
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.h 
b/drivers/gpu/drm/i915/gem/i915_gem_mman.h
index efee9e0d2508..1190a3a228ea 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.h
@@ -29,4 +29,6 @@ void i915_gem_object_release_mmap_gtt(struct 
drm_i915_gem_object *obj);
 
 void i915_gem_object_release_mmap_offset(struct drm_i915_gem_object *obj);
 
+vm_fault_t i915_error_to_vmf_fault(int err);
+
 #endif
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index de3d1738aabe..1f8d4f3a2148 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -192,10 +192,12 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 
 #include "gem/i915_gem_context.h"
+#include "gem/i915_gem_mman.h"
 #include "gt/intel_engine_pm.h"
 #include "gt/intel_engine_user.h"
 #include "gt/intel_execlists_submission.h"
@@ -3322,6 +3324,44 @@ static long i915_perf_config_locked(struct 
i915_perf_stream *stream,
return ret;
 }
 
+#define I915_PERF_OA_BUFFER_MMAP_OFFSET 1
+
+/**
+ * i915_perf_oa_buffer_info_locked - siz

[PATCH 1/8] drm/i915/gt: Lock intel_engine_apply_whitelist with uncore->lock

2021-08-03 Thread Umesh Nerlige Ramappa
Refactor intel_engine_apply_whitelist into locked and unlocked versions
so that a caller who already has the lock can apply whitelist.

v2: Fix sparse warning
v3: (Chris)
- Drop prefix and suffix for static function
- Use longest to shortest line ordering for variable declaration

Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 46 ++---
 1 file changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 053fa7251cd0..011cde0bc38d 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -1108,7 +1108,8 @@ void intel_gt_init_workarounds(struct drm_i915_private 
*i915)
 }
 
 static enum forcewake_domains
-wal_get_fw_for_rmw(struct intel_uncore *uncore, const struct i915_wa_list *wal)
+wal_get_fw(struct intel_uncore *uncore, const struct i915_wa_list *wal,
+  unsigned int op)
 {
enum forcewake_domains fw = 0;
struct i915_wa *wa;
@@ -1117,8 +1118,7 @@ wal_get_fw_for_rmw(struct intel_uncore *uncore, const 
struct i915_wa_list *wal)
for (i = 0, wa = wal->list; i < wal->count; i++, wa++)
fw |= intel_uncore_forcewake_for_reg(uncore,
 wa->reg,
-FW_REG_READ |
-FW_REG_WRITE);
+op);
 
return fw;
 }
@@ -1149,7 +1149,7 @@ wa_list_apply(struct intel_gt *gt, const struct 
i915_wa_list *wal)
if (!wal->count)
return;
 
-   fw = wal_get_fw_for_rmw(uncore, wal);
+   fw = wal_get_fw(uncore, wal, FW_REG_READ | FW_REG_WRITE);
 
spin_lock_irqsave(>lock, flags);
intel_uncore_forcewake_get__locked(uncore, fw);
@@ -1188,7 +1188,7 @@ static bool wa_list_verify(struct intel_gt *gt,
unsigned int i;
bool ok = true;
 
-   fw = wal_get_fw_for_rmw(uncore, wal);
+   fw = wal_get_fw(uncore, wal, FW_REG_READ | FW_REG_WRITE);
 
spin_lock_irqsave(>lock, flags);
intel_uncore_forcewake_get__locked(uncore, fw);
@@ -1477,27 +1477,45 @@ void intel_engine_init_whitelist(struct intel_engine_cs 
*engine)
wa_init_finish(w);
 }
 
-void intel_engine_apply_whitelist(struct intel_engine_cs *engine)
+static void __engine_apply_whitelist(struct intel_engine_cs *engine)
 {
const struct i915_wa_list *wal = >whitelist;
struct intel_uncore *uncore = engine->uncore;
const u32 base = engine->mmio_base;
+   enum forcewake_domains fw;
struct i915_wa *wa;
unsigned int i;
 
-   if (!wal->count)
-   return;
+   lockdep_assert_held(>lock);
+
+   fw = wal_get_fw(uncore, wal, FW_REG_WRITE);
+   intel_uncore_forcewake_get__locked(uncore, fw);
 
for (i = 0, wa = wal->list; i < wal->count; i++, wa++)
-   intel_uncore_write(uncore,
-  RING_FORCE_TO_NONPRIV(base, i),
-  i915_mmio_reg_offset(wa->reg));
+   intel_uncore_write_fw(uncore,
+ RING_FORCE_TO_NONPRIV(base, i),
+ i915_mmio_reg_offset(wa->reg));
 
/* And clear the rest just in case of garbage */
for (; i < RING_MAX_NONPRIV_SLOTS; i++)
-   intel_uncore_write(uncore,
-  RING_FORCE_TO_NONPRIV(base, i),
-  i915_mmio_reg_offset(RING_NOPID(base)));
+   intel_uncore_write_fw(uncore,
+ RING_FORCE_TO_NONPRIV(base, i),
+ i915_mmio_reg_offset(RING_NOPID(base)));
+
+   intel_uncore_forcewake_put__locked(uncore, fw);
+}
+
+void intel_engine_apply_whitelist(struct intel_engine_cs *engine)
+{
+   const struct i915_wa_list *wal = >whitelist;
+   unsigned long flags;
+
+   if (!wal->count)
+   return;
+
+   spin_lock_irqsave(>uncore->lock, flags);
+   __engine_apply_whitelist(engine);
+   spin_unlock_irqrestore(>uncore->lock, flags);
 }
 
 static void
-- 
2.20.1



[PATCH 2/8] drm/i915/gt: Refactor _wa_add to reuse wa_index and wa_list_grow

2021-08-03 Thread Umesh Nerlige Ramappa
From: Chris Wilson 

Switch the search and grow code of the _wa_add to use _wa_index and
_wa_list_grow.

Signed-off-by: Chris Wilson 
Reviewed-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 124 +++-
 1 file changed, 71 insertions(+), 53 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 011cde0bc38d..94540cdb90c4 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -60,20 +60,19 @@ static void wa_init_start(struct i915_wa_list *wal, const 
char *name, const char
 
 #define WA_LIST_CHUNK (1 << 4)
 
-static void wa_init_finish(struct i915_wa_list *wal)
+static void wa_trim(struct i915_wa_list *wal, gfp_t gfp)
 {
+   struct i915_wa *list;
+
/* Trim unused entries. */
-   if (!IS_ALIGNED(wal->count, WA_LIST_CHUNK)) {
-   struct i915_wa *list = kmemdup(wal->list,
-  wal->count * sizeof(*list),
-  GFP_KERNEL);
-
-   if (list) {
-   kfree(wal->list);
-   wal->list = list;
-   }
-   }
+   list = krealloc(wal->list, wal->count * sizeof(*list), gfp);
+   if (list)
+   wal->list = list;
+}
 
+static void wa_init_finish(struct i915_wa_list *wal)
+{
+   wa_trim(wal, GFP_KERNEL);
if (!wal->count)
return;
 
@@ -81,57 +80,60 @@ static void wa_init_finish(struct i915_wa_list *wal)
 wal->wa_count, wal->name, wal->engine_name);
 }
 
-static void _wa_add(struct i915_wa_list *wal, const struct i915_wa *wa)
+static int wa_index(struct i915_wa_list *wal, i915_reg_t reg)
 {
-   unsigned int addr = i915_mmio_reg_offset(wa->reg);
-   unsigned int start = 0, end = wal->count;
-   const unsigned int grow = WA_LIST_CHUNK;
-   struct i915_wa *wa_;
+   unsigned int addr = i915_mmio_reg_offset(reg);
+   int start = 0, end = wal->count;
 
-   GEM_BUG_ON(!is_power_of_2(grow));
+   /* addr and wal->list[].reg, both include the R/W flags */
+   while (start < end) {
+   unsigned int mid = start + (end - start) / 2;
 
-   if (IS_ALIGNED(wal->count, grow)) { /* Either uninitialized or full. */
-   struct i915_wa *list;
+   if (i915_mmio_reg_offset(wal->list[mid].reg) < addr)
+   start = mid + 1;
+   else if (i915_mmio_reg_offset(wal->list[mid].reg) > addr)
+   end = mid;
+   else
+   return mid;
+   }
 
-   list = kmalloc_array(ALIGN(wal->count + 1, grow), sizeof(*wa),
-GFP_KERNEL);
-   if (!list) {
-   DRM_ERROR("No space for workaround init!\n");
-   return;
-   }
+   return -ENOENT;
+}
 
-   if (wal->list) {
-   memcpy(list, wal->list, sizeof(*wa) * wal->count);
-   kfree(wal->list);
-   }
+static int wa_list_grow(struct i915_wa_list *wal, size_t count, gfp_t gfp)
+{
+   struct i915_wa *list;
 
-   wal->list = list;
-   }
+   list = krealloc(wal->list, count * sizeof(*list), gfp);
+   if (!list)
+   return -ENOMEM;
 
-   while (start < end) {
-   unsigned int mid = start + (end - start) / 2;
+   wal->list = list;
+   return 0;
+}
 
-   if (i915_mmio_reg_offset(wal->list[mid].reg) < addr) {
-   start = mid + 1;
-   } else if (i915_mmio_reg_offset(wal->list[mid].reg) > addr) {
-   end = mid;
-   } else {
-   wa_ = >list[mid];
-
-   if ((wa->clr | wa_->clr) && !(wa->clr & ~wa_->clr)) {
-   DRM_ERROR("Discarding overwritten w/a for reg 
%04x (clear: %08x, set: %08x)\n",
- i915_mmio_reg_offset(wa_->reg),
- wa_->clr, wa_->set);
-
-   wa_->set &= ~wa->clr;
-   }
-
-   wal->wa_count++;
-   wa_->set |= wa->set;
-   wa_->clr |= wa->clr;
-   wa_->read |= wa->read;
-   return;
+static void __wa_add(struct i915_wa_list *wal, const struct i915_wa *wa)
+{
+   struct i915_wa *wa_;
+   int index;
+
+   index = wa_index(wal, wa->reg);
+   if (index >= 0) {
+   wa_ = >list[index];
+
+   if ((wa->clr

[PATCH 7/8] drm/i915/perf: Whitelist OA counter and buffer registers

2021-08-03 Thread Umesh Nerlige Ramappa
It is useful to have markers in the OA reports to identify triggered
reports. Whitelist some OA counters that can be used as markers.

A triggered report can be found faster if we can sample the HW tail and
head registers when the report was triggered. Whitelist OA buffer
specific registers.

v2:
- Bump up the perf revision (Lionel)
- Use indexing for counters (Lionel)
- Fix selftest for oa ticking register (Umesh)

v3: Pardon whitelisted registers for selftest (Umesh)

v4:
- Document whitelisted registers (Lionel)
- Fix live isolated whitelist for OA regs (Umesh)

v5:
- Free up whitelist slots. Remove GPU_TICKS and A20 counter (Piotr)
- Whitelist registers only if perf_stream_paranoid is set to 0 (Jon)

v6: Move oa whitelist array to i915_perf (Chris)

Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_perf.c | 18 +-
 drivers/gpu/drm/i915/i915_reg.h  | 16 ++--
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 30f5025b2ff6..de3d1738aabe 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1367,11 +1367,19 @@ free_noa_wait(struct i915_perf_stream *stream)
 static const i915_reg_t gen9_oa_wl_regs[] = {
{ __OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
{ __OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __OA_PERF_COUNTER_A(18) | (RING_FORCE_TO_NONPRIV_ACCESS_RW |
+RING_FORCE_TO_NONPRIV_RANGE_4) },
+   { __GEN8_OASTATUS | (RING_FORCE_TO_NONPRIV_ACCESS_RD |
+RING_FORCE_TO_NONPRIV_RANGE_4) },
 };
 
 static const i915_reg_t gen12_oa_wl_regs[] = {
{ __GEN12_OAG_OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
{ __GEN12_OAG_OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __GEN12_OAG_PERF_COUNTER_A(18) | (RING_FORCE_TO_NONPRIV_ACCESS_RW |
+   RING_FORCE_TO_NONPRIV_RANGE_4) },
+   { __GEN12_OAG_OASTATUS | (RING_FORCE_TO_NONPRIV_ACCESS_RD |
+ RING_FORCE_TO_NONPRIV_RANGE_4) },
 };
 
 static int intel_engine_apply_oa_whitelist(struct i915_perf_stream *stream)
@@ -4623,8 +4631,16 @@ int i915_perf_ioctl_version(void)
 *into the OA buffer. This applies only to gen8+. The feature can
 *only be accessed if perf_stream_paranoid is set to 0 by privileged
 *user.
+*
+* 7: Whitelist below OA registers for user to identify the location of
+*triggered reports in the OA buffer. This applies only to gen8+.
+*The feature can only be accessed if perf_stream_paranoid is set to
+*0 by privileged user.
+*
+*- OA buffer head/tail/status/buffer registers for read only
+*- OA counters A18, A19, A20 for read/write
 */
-   return 6;
+   return 7;
 }
 
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 114958854d67..09940d7d00cf 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -695,7 +695,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define  GEN7_OASTATUS2_HEAD_MASK   0xffc0
 #define  GEN7_OASTATUS2_MEM_SELECT_GGTT (1 << 0) /* 0: PPGTT, 1: GGTT */
 
-#define GEN8_OASTATUS _MMIO(0x2b08)
+#define __GEN8_OASTATUS 0x2b08
+#define GEN8_OASTATUS _MMIO(__GEN8_OASTATUS)
 #define  GEN8_OASTATUS_TAIL_POINTER_WRAP(1 << 17)
 #define  GEN8_OASTATUS_HEAD_POINTER_WRAP(1 << 16)
 #define  GEN8_OASTATUS_OVERRUN_STATUS  (1 << 3)
@@ -755,7 +756,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define  GEN12_OAG_OA_DEBUG_DISABLE_GO_1_0_REPORTS (1 << 2)
 #define  GEN12_OAG_OA_DEBUG_DISABLE_CTX_SWITCH_REPORTS (1 << 1)
 
-#define GEN12_OAG_OASTATUS _MMIO(0xdafc)
+#define __GEN12_OAG_OASTATUS 0xdafc
+#define GEN12_OAG_OASTATUS _MMIO(__GEN12_OAG_OASTATUS)
 #define  GEN12_OAG_OASTATUS_COUNTER_OVERFLOW (1 << 2)
 #define  GEN12_OAG_OASTATUS_BUFFER_OVERFLOW  (1 << 1)
 #define  GEN12_OAG_OASTATUS_REPORT_LOST  (1 << 0)
@@ -998,6 +1000,16 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define OAREPORTTRIG8_NOA_SELECT_6_SHIFT24
 #define OAREPORTTRIG8_NOA_SELECT_7_SHIFT28
 
+/* Performance counters registers */
+#define __OA_PERF_COUNTER_A(idx)   (0x2800 + 8 * (idx))
+#define OA_PERF_COUNTER_A(idx) _MMIO(__OA_PERF_COUNTER_A(idx))
+#define OA_PERF_COUNTER_A_UDW(idx) _MMIO(__OA_PERF_COUNTER_A(idx) 
+ 4)
+
+/* Gen12 Performance counters registers */
+#define __GEN12_OAG_PERF_COUNTER_A(idx)(0xD980 + 8 * (idx))
+#define GEN12_OAG_PERF_COUNTER_A(idx)  _MMIO(__GEN12_OAG_PERF_COUNTER_A(idx))
+#define GEN12_OAG_PERF_COUNTER_A_UDW(idx) 

[PATCH 3/8] drm/i915/gt: Check for conflicting RING_NONPRIV

2021-08-03 Thread Umesh Nerlige Ramappa
From: Chris Wilson 

Strip the encoded bits from the register offset so that we only use the
address for looking up the RING_NONPRIV entry.

Signed-off-by: Chris Wilson 
Reviewed-by: Umesh Nerlige Ramappa 
---
 drivers/gpu/drm/i915/gt/intel_workarounds.c | 66 +
 1 file changed, 42 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_workarounds.c 
b/drivers/gpu/drm/i915/gt/intel_workarounds.c
index 94540cdb90c4..7dda5a0a8e75 100644
--- a/drivers/gpu/drm/i915/gt/intel_workarounds.c
+++ b/drivers/gpu/drm/i915/gt/intel_workarounds.c
@@ -80,18 +80,44 @@ static void wa_init_finish(struct i915_wa_list *wal)
 wal->wa_count, wal->name, wal->engine_name);
 }
 
+static u32 reg_offset(i915_reg_t reg)
+{
+   return i915_mmio_reg_offset(reg) & RING_FORCE_TO_NONPRIV_ADDRESS_MASK;
+}
+
+static u32 reg_flags(i915_reg_t reg)
+{
+   return i915_mmio_reg_offset(reg) & ~RING_FORCE_TO_NONPRIV_ADDRESS_MASK;
+}
+
+__maybe_unused
+static bool is_nonpriv_flags_valid(u32 flags)
+{
+   /* Check only valid flag bits are set */
+   if (flags & ~RING_FORCE_TO_NONPRIV_MASK_VALID)
+   return false;
+
+   /* NB: Only 3 out of 4 enum values are valid for access field */
+   if ((flags & RING_FORCE_TO_NONPRIV_ACCESS_MASK) ==
+   RING_FORCE_TO_NONPRIV_ACCESS_INVALID)
+   return false;
+
+   return true;
+}
+
 static int wa_index(struct i915_wa_list *wal, i915_reg_t reg)
 {
-   unsigned int addr = i915_mmio_reg_offset(reg);
int start = 0, end = wal->count;
+   u32 addr = reg_offset(reg);
 
/* addr and wal->list[].reg, both include the R/W flags */
while (start < end) {
unsigned int mid = start + (end - start) / 2;
+   u32 pos = reg_offset(wal->list[mid].reg);
 
-   if (i915_mmio_reg_offset(wal->list[mid].reg) < addr)
+   if (pos < addr)
start = mid + 1;
-   else if (i915_mmio_reg_offset(wal->list[mid].reg) > addr)
+   else if (pos > addr)
end = mid;
else
return mid;
@@ -117,13 +143,22 @@ static void __wa_add(struct i915_wa_list *wal, const 
struct i915_wa *wa)
struct i915_wa *wa_;
int index;
 
+   GEM_BUG_ON(!is_nonpriv_flags_valid(reg_flags(wa->reg)));
+
index = wa_index(wal, wa->reg);
if (index >= 0) {
wa_ = >list[index];
 
+   if (i915_mmio_reg_offset(wa->reg) !=
+   i915_mmio_reg_offset(wa_->reg)) {
+   DRM_ERROR("Discarding incompatible w/a for reg %04x\n",
+ reg_offset(wa->reg));
+   return;
+   }
+
if ((wa->clr | wa_->clr) && !(wa->clr & ~wa_->clr)) {
DRM_ERROR("Discarding overwritten w/a for reg %04x 
(clear: %08x, set: %08x)\n",
- i915_mmio_reg_offset(wa_->reg),
+ reg_offset(wa_->reg),
  wa_->clr, wa_->set);
 
wa_->set &= ~wa->clr;
@@ -141,10 +176,8 @@ static void __wa_add(struct i915_wa_list *wal, const 
struct i915_wa *wa)
*wa_ = *wa;
 
while (wa_-- > wal->list) {
-   GEM_BUG_ON(i915_mmio_reg_offset(wa_[0].reg) ==
-  i915_mmio_reg_offset(wa_[1].reg));
-   if (i915_mmio_reg_offset(wa_[1].reg) >
-   i915_mmio_reg_offset(wa_[0].reg))
+   GEM_BUG_ON(reg_offset(wa_[0].reg) == reg_offset(wa_[1].reg));
+   if (reg_offset(wa_[1].reg) > reg_offset(wa_[0].reg))
break;
 
swap(wa_[1], wa_[0]);
@@ -160,7 +193,7 @@ static void _wa_add(struct i915_wa_list *wal, const struct 
i915_wa *wa)
if (IS_ALIGNED(wal->count, grow) && /* Either uninitialized or full. */
wa_list_grow(wal, ALIGN(wal->count + 1, grow), GFP_KERNEL)) {
DRM_ERROR("Unable to store w/a for reg %04x\n",
- i915_mmio_reg_offset(wa->reg));
+ reg_offset(wa->reg));
return;
}
 
@@ -1227,21 +1260,6 @@ bool intel_gt_verify_workarounds(struct intel_gt *gt, 
const char *from)
return wa_list_verify(gt, >i915->gt_wa_list, from);
 }
 
-__maybe_unused
-static bool is_nonpriv_flags_valid(u32 flags)
-{
-   /* Check only valid flag bits are set */
-   if (flags & ~RING_FORCE_TO_NONPRIV_MASK_VALID)
-   return false;
-
-   /* NB: Only 3 out of 4 enum values are valid for access field */
-   if ((flags & RING_FORCE_TO_NONPRIV_ACCESS_MASK) ==
-   RING_FORCE_TO_NONPRIV_ACCESS_INVALID)
-   return false;
-
-   return true;
-}
-
 static void
 whitelist_reg_ext(struct i915_wa_list *wal, i915_reg_t reg, u32 flags)
 {
-- 
2.20.1



[PATCH 0/8] Enable triggered perf query for Xe_HP

2021-08-03 Thread Umesh Nerlige Ramappa
This is a revival of the patch series to support triggered perf query reports
from here - https://patchwork.freedesktop.org/series/83831/

The patches were not pushed earlier because corresponding UMD changes were
missing. This revival addresses UMD changes in GPUvis for this series. GPUvis
uses the perf library in igt-gpu-tools. Changes to the library are here -
https://patchwork.freedesktop.org/series/93355/

GPUvis changes will be posted as a PR once the above library and kernel changes
are pushed.

Summary of the feature:

Current platforms provide MI_REPORT_PERF_COUNT to query a snapshot of perf
counters from a batch. This mechanism does not have consistent support on all
engines for newer platforms. To support perf query, all new platforms use a
mechanism to trigger OA report snapshots into the OA buffer by writing to a HW
register from a batch. To lookup this report in the OA buffer quickly, the OA
buffer is mmapped into the user space.

This series implements the new query mechanism.

Signed-off-by: Umesh Nerlige Ramappa 

Chris Wilson (3):
  drm/i915/gt: Refactor _wa_add to reuse wa_index and wa_list_grow
  drm/i915/gt: Check for conflicting RING_NONPRIV
  drm/i915/gt: Enable dynamic adjustment of RING_NONPRIV

Piotr Maciejewski (1):
  drm/i915/perf: Ensure observation logic is not clock gated

Umesh Nerlige Ramappa (4):
  drm/i915/gt: Lock intel_engine_apply_whitelist with uncore->lock
  drm/i915/perf: Whitelist OA report trigger registers
  drm/i915/perf: Whitelist OA counter and buffer registers
  drm/i915/perf: Map OA buffer to user space for gen12 performance query

 drivers/gpu/drm/i915/gem/i915_gem_mman.c  |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.h  |   2 +
 drivers/gpu/drm/i915/gt/intel_workarounds.c   | 269 +-
 drivers/gpu/drm/i915/gt/intel_workarounds.h   |   7 +
 .../gpu/drm/i915/gt/selftest_workarounds.c| 237 +++
 drivers/gpu/drm/i915/i915_perf.c  | 228 ++-
 drivers/gpu/drm/i915/i915_perf_types.h|   8 +
 drivers/gpu/drm/i915/i915_reg.h   |  30 +-
 include/uapi/drm/i915_drm.h   |  33 +++
 9 files changed, 728 insertions(+), 88 deletions(-)

-- 
2.20.1



[PATCH 6/8] drm/i915/perf: Whitelist OA report trigger registers

2021-08-03 Thread Umesh Nerlige Ramappa
OA reports can be triggered into the OA buffer by writing into the
OAREPORTTRIG registers. Whitelist the registers to allow non-privileged
user to trigger reports.

Whitelist registers only if perf_stream_paranoid is set to 0. In
i915_perf_open_ioctl, this setting is checked and the whitelist is
enabled accordingly. On closing the perf fd, the whitelist is removed.

This ensures that the access to the whitelist is gated by
perf_stream_paranoid.

v2:
- Move related change to this patch (Lionel)
- Bump up perf revision (Lionel)

v3: Pardon whitelisted registers for selftest (Umesh)
v4: Document supported gens for the feature (Lionel)
v5: Whitelist registers only if perf_stream_paranoid is set to 0 (Jon)
v6: Move oa whitelist array to i915_perf (Chris)
v7: Fix OA writing beyond the wal->list memory (CI)
v8: Protect write to engine whitelist registers

v9: (Umesh)
- Use uncore->lock to protect write to forcepriv regs
- In case wal->count falls to zero on _wa_remove, make sure you still
  clear the registers. Remove wal->count check when applying whitelist.

v10: (Umesh)
- Split patches modifying intel_workarounds
- On some platforms there are no whitelisted regs. intel_engine_resume
  applies whitelist on these platforms too and the wal->count gates such
  platforms. Bring back the wal->count check.
- intel_engine_allow/deny_user_register_access modifies the engine
  whitelist and the wal->count. Use uncore->lock to serialize it with
  intel_engine_apply_whitelist.
- Grow the wal->list when adding whitelist registers after driver load.

v11:
- Fix memory leak in _wa_list_grow (Chris)
- Serialize kfree with engine resume using uncore->lock (Umesh)
- Grow the list only if wal->count is not aligned (Umesh)

Signed-off-by: Piotr Maciejewski 
Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_perf.c   | 79 +-
 drivers/gpu/drm/i915/i915_perf_types.h |  8 +++
 drivers/gpu/drm/i915/i915_reg.h| 12 ++--
 3 files changed, 92 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3ded6e7d8526..30f5025b2ff6 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -1364,12 +1364,56 @@ free_noa_wait(struct i915_perf_stream *stream)
i915_vma_unpin_and_release(>noa_wait, 0);
 }
 
+static const i915_reg_t gen9_oa_wl_regs[] = {
+   { __OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+};
+
+static const i915_reg_t gen12_oa_wl_regs[] = {
+   { __GEN12_OAG_OAREPORTTRIG2 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+   { __GEN12_OAG_OAREPORTTRIG6 | RING_FORCE_TO_NONPRIV_ACCESS_RW },
+};
+
+static int intel_engine_apply_oa_whitelist(struct i915_perf_stream *stream)
+{
+   struct intel_engine_cs *engine = stream->engine;
+   int ret;
+
+   if (i915_perf_stream_paranoid ||
+   !(stream->sample_flags & SAMPLE_OA_REPORT) ||
+   !stream->perf->oa_wl)
+   return 0;
+
+   ret = intel_engine_allow_user_register_access(engine,
+ stream->perf->oa_wl,
+ stream->perf->num_oa_wl);
+   if (ret < 0)
+   return ret;
+
+   stream->oa_whitelisted = true;
+   return 0;
+}
+
+static void intel_engine_remove_oa_whitelist(struct i915_perf_stream *stream)
+{
+   struct intel_engine_cs *engine = stream->engine;
+
+   if (!stream->oa_whitelisted)
+   return;
+
+   intel_engine_deny_user_register_access(engine,
+  stream->perf->oa_wl,
+  stream->perf->num_oa_wl);
+}
+
 static void i915_oa_stream_destroy(struct i915_perf_stream *stream)
 {
struct i915_perf *perf = stream->perf;
 
BUG_ON(stream != perf->exclusive_stream);
 
+   intel_engine_remove_oa_whitelist(stream);
+
/*
 * Unset exclusive_stream first, it will be checked while disabling
 * the metric set on gen8+.
@@ -1465,7 +1509,8 @@ static void gen8_init_oa_buffer(struct i915_perf_stream 
*stream)
 *  bit."
 */
intel_uncore_write(uncore, GEN8_OABUFFER, gtt_offset |
-  OABUFFER_SIZE_16M | GEN8_OABUFFER_MEM_SELECT_GGTT);
+  OABUFFER_SIZE_16M | GEN8_OABUFFER_MEM_SELECT_GGTT |
+  GEN7_OABUFFER_EDGE_TRIGGER);
intel_uncore_write(uncore, GEN8_OATAILPTR, gtt_offset & 
GEN8_OATAILPTR_MASK);
 
/* Mark that we need updated tail pointers to read from... */
@@ -1518,7 +1563,8 @@ static void gen12_init_oa_buffer(struct i915_perf_stream 
*stream)
 *  bit."
 */
intel_uncore_write(uncore, G

[PATCH 0/1] Add support for querying engine cycles

2021-05-03 Thread Umesh Nerlige Ramappa
This is just a refresh of the earlier patch along with cover letter for the IGT
testing. The query provides the engine cs cycles counter.

v2: Use GRAPHICS_VER() instead of IG_GEN()
v3: Add R-b to the patch
v4: Split cpu timestamp array into timestamp and delta for cleaner API
v5: Add width of the cs cycles to the uapi

Signed-off-by: Umesh Nerlige Ramappa 
Test-with: 20210504001003.69445-1-umesh.nerlige.rama...@intel.com

Umesh Nerlige Ramappa (1):
  i915/query: Correlate engine and cpu timestamps with better accuracy

 drivers/gpu/drm/i915/i915_query.c | 157 ++
 include/uapi/drm/i915_drm.h   |  56 +++
 2 files changed, 213 insertions(+)

-- 
2.20.1

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-05-03 Thread Umesh Nerlige Ramappa
Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
  in perf_event_open. This clock id is used by the perf subsystem to
  return the appropriate cpu timestamp in perf events. Similarly, let
  the user pass the clockid to this query so that cpu timestamp
  corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
  register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
  register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

v12: (Jason)
- Split cpu timestamp array into timestamp and delta for cleaner API

v13:
- Return the width of cs cycles
- Conform to kernel doc format

Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
Reviewed-by: Jason Ekstrand 
---
 drivers/gpu/drm/i915/i915_query.c | 157 ++
 include/uapi/drm/i915_drm.h   |  56 +++
 2 files changed, 213 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..2e7039c71866 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@
 
 #include 
 
+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
 #include "i915_drv.h"
 #include "i915_perf.h"
 #include "i915_query.h"
@@ -90,6 +92,160 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
return total_length;
 }
 
+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+   /*
+* Use logic same as the perf subsystem to allow user to select the
+* reference clock id to be used for timestamps.
+*/
+   switch (clk_id) {
+   case CLOCK_MONOTONIC:
+   return _get_ns;
+   case CLOCK_MONOTONIC_RAW:
+   return _get_raw_ns;
+   case CLOCK_REALTIME:
+   return _get_real_ns;
+   case CLOCK_BOOTTIME:
+   return _get_boottime_ns;
+   case CLOCK_TAI:
+   return _get_clocktai_ns;
+   default:
+   return NULL;
+   }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+ i915_reg_t lower_reg,
+ i915_reg_t upper_reg,
+ u64 *cs_ts,
+ u64 *cpu_ts,
+ u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   u32 upper, lower, old_upper, loop = 0;
+
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   do {
+   *cpu_delta = local_clock();
+   *cpu_ts = cpu_clock();
+   lower = intel_uncore_read_fw(uncore, lower_reg);
+   *cpu_delta = local_clock() - *cpu_delta;
+   old_upper = upper;
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   } while (upper != old_upper && loop++ < 2);
+
+   *cs_ts = (u64)upper << 32 | lower;
+
+   return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+ u64 *cs_ts, u64 *cpu_ts, u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   struct intel_uncore *uncore = engine->uncore;
+   enum forcewake_domains fw_domains;
+   u32 base = engine->mmio_base;
+   intel_wakeref_t wakeref;
+   int ret;
+
+   fw_domains = intel_uncore_forcewake_for_reg(uncore,
+   RING_TIMESTAMP(base),
+   FW_REG_READ);
+
+   with_intel_runtime_pm(uncore->rpm, wakeref) {
+   spin_lock_irq(>lock);
+   intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+   ret = __read_timestamps(uncore,
+   RING_TIMESTAMP(base),
+   RING_TIMESTAMP_UDW(base),
+   cs_ts,
+ 

Re: [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-05-03 Thread Umesh Nerlige Ramappa

On Sat, May 01, 2021 at 10:27:03AM -0500, Jason Ekstrand wrote:

  On April 30, 2021 23:01:44 "Dixit, Ashutosh" 
  wrote:

On Fri, 30 Apr 2021 19:19:59 -0700, Umesh Nerlige Ramappa wrote:

  On Fri, Apr 30, 2021 at 07:35:41PM -0500, Jason Ekstrand wrote:

On April 30, 2021 18:00:58 "Dixit, Ashutosh"

wrote:
On Fri, 30 Apr 2021 15:26:09 -0700, Umesh Nerlige Ramappa wrote:
Looks like the engine can be dropped since all timestamps are in
sync.
I
just have one more question here. The timestamp itself is 36 bits.
 Should
the uapi also report the timestamp width to the user OR should I
just
return the lower 32 bits of the timestamp?
Yeah, I think reporting the timestamp width is a good idea since
we're
reporting the period/frequency here.

  Actually, I forgot that we are handling the overflow before returning
  the
  cs_cycles to the user and overflow handling was the only reason I
  thought
  user should know the width. Would you stil recommend returning the
  width in
  the uapi?

The width is needed for userspace to figure out if overflow has occured
between two successive query calls. I don't think I see this happening
in
the code.

  Right... We (UMDs) currently just hard-code it to 36 bits because that's
  what we've had on all platforms since close enough to forever. We bake in
  the frequency based on PCI ID. Returning the number of bits, like I said,
  goes nicely with the frequency. It's not necessary, assuming sufficiently
  smart userspace (neither is frequency), but it seems to go with it. I
  guess I don't care much either way.
  Coming back to the multi-tile issue we discussed internally, I think that
  is something we should care about. Since this works by reading the
  timestamp register on an engine, I think leaving the engine specifier in
  there is fine. Userspace should know that there's actually only one clock
  and just query one of them (probably RCS). For crazy multi-device cases,
  we'll either query per logical device (read tile) or we'll have to make
  them look like a single device and sync the timestamps somehow in the UMD
  by carrying around an offset factor.
  As is, this patch is
  Reviewed-by: Jason Ekstrand 


Thanks, I will add the width here and post the final version.

Regards,
Umesh



  I still need to review the ANV patch before we can land this though.
  --Jason

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-30 Thread Umesh Nerlige Ramappa

On Fri, Apr 30, 2021 at 07:35:41PM -0500, Jason Ekstrand wrote:

  On April 30, 2021 18:00:58 "Dixit, Ashutosh" 
  wrote:

On Fri, 30 Apr 2021 15:26:09 -0700, Umesh Nerlige Ramappa wrote:

  Looks like the engine can be dropped since all timestamps are in sync.
  I
  just have one more question here. The timestamp itself is 36 bits.
   Should
  the uapi also report the timestamp width to the user OR should I just
  return the lower 32 bits of the timestamp?

  Yeah, I think reporting the timestamp width is a good idea since we're
  reporting the period/frequency here.


Actually, I forgot that we are handling the overflow before returning 
the cs_cycles to the user and overflow handling was the only reason I 
thought user should know the width. Would you stil recommend returning 
the width in the uapi?


Thanks,
Umesh




How would exposing only the lower 32 bits of the timestamp work?
The way to avoid exposing the width would be to expose the timestamp as
a
regular 64 bit value. In the kernel engine state, have a variable for
the
counter and keep on accumulating that (on each query) to full 64 bits in
spite of the 36 bit HW counter overflow.

  That's doesn't actually work since you can query the 64-bit timestamp
  value from the GPU. The way this is handled in Vulkan is that the number
  of timestamp bits is reported to the application as a queue property.
  --Jason

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-30 Thread Umesh Nerlige Ramappa

On Thu, Apr 29, 2021 at 02:07:58PM -0500, Jason Ekstrand wrote:

On Wed, Apr 28, 2021 at 7:34 PM Umesh Nerlige Ramappa
 wrote:


Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
  in perf_event_open. This clock id is used by the perf subsystem to
  return the appropriate cpu timestamp in perf events. Similarly, let
  the user pass the clockid to this query so that cpu timestamp
  corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
  register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
  register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

v12: (Jason)
- Split cpu timestamp array into timestamp and delta for cleaner API

Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_query.c | 148 ++
 include/uapi/drm/i915_drm.h   |  52 +++
 2 files changed, 200 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..357c44e8177c 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@

 #include 

+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
 #include "i915_drv.h"
 #include "i915_perf.h"
 #include "i915_query.h"
@@ -90,6 +92,151 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
return total_length;
 }

+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+   /*
+* Use logic same as the perf subsystem to allow user to select the
+* reference clock id to be used for timestamps.
+*/
+   switch (clk_id) {
+   case CLOCK_MONOTONIC:
+   return _get_ns;
+   case CLOCK_MONOTONIC_RAW:
+   return _get_raw_ns;
+   case CLOCK_REALTIME:
+   return _get_real_ns;
+   case CLOCK_BOOTTIME:
+   return _get_boottime_ns;
+   case CLOCK_TAI:
+   return _get_clocktai_ns;
+   default:
+   return NULL;
+   }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+ i915_reg_t lower_reg,
+ i915_reg_t upper_reg,
+ u64 *cs_ts,
+ u64 *cpu_ts,
+ u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   u32 upper, lower, old_upper, loop = 0;
+
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   do {
+   *cpu_delta = local_clock();
+   *cpu_ts = cpu_clock();
+   lower = intel_uncore_read_fw(uncore, lower_reg);
+   *cpu_delta = local_clock() - *cpu_delta;
+   old_upper = upper;
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   } while (upper != old_upper && loop++ < 2);
+
+   *cs_ts = (u64)upper << 32 | lower;
+
+   return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+ u64 *cs_ts, u64 *cpu_ts, u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   struct intel_uncore *uncore = engine->uncore;
+   enum forcewake_domains fw_domains;
+   u32 base = engine->mmio_base;
+   intel_wakeref_t wakeref;
+   int ret;
+
+   fw_domains = intel_uncore_forcewake_for_reg(uncore,
+   RING_TIMESTAMP(base),
+   FW_REG_READ);
+
+   with_intel_runtime_pm(uncore->rpm, wakeref) {
+   spin_lock_irq(>lock);
+   intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+   ret = __read_timestamps(uncore,
+   RING_TIMESTAMP(base),
+   RING_TIMESTAMP_UDW(base),
+ 

[PATCH 1/1] i915/query: Correlate engine and cpu timestamps with better accuracy

2021-04-28 Thread Umesh Nerlige Ramappa
Perf measurements rely on CPU and engine timestamps to correlate
events of interest across these time domains. Current mechanisms get
these timestamps separately and the calculated delta between these
timestamps lack enough accuracy.

To improve the accuracy of these time measurements to within a few us,
add a query that returns the engine and cpu timestamps captured as
close to each other as possible.

v2: (Tvrtko)
- document clock reference used
- return cpu timestamp always
- capture cpu time just before lower dword of cs timestamp

v3: (Chris)
- use uncore-rpm
- use __query_cs_timestamp helper

v4: (Lionel)
- Kernel perf subsytem allows users to specify the clock id to be used
  in perf_event_open. This clock id is used by the perf subsystem to
  return the appropriate cpu timestamp in perf events. Similarly, let
  the user pass the clockid to this query so that cpu timestamp
  corresponds to the clock id requested.

v5: (Tvrtko)
- Use normal ktime accessors instead of fast versions
- Add more uApi documentation

v6: (Lionel)
- Move switch out of spinlock

v7: (Chris)
- cs_timestamp is a misnomer, use cs_cycles instead
- return the cs cycle frequency as well in the query

v8:
- Add platform and engine specific checks

v9: (Lionel)
- Return 2 cpu timestamps in the query - captured before and after the
  register read

v10: (Chris)
- Use local_clock() to measure time taken to read lower dword of
  register and return it to user.

v11: (Jani)
- IS_GEN deprecated. User GRAPHICS_VER instead.

v12: (Jason)
- Split cpu timestamp array into timestamp and delta for cleaner API

Signed-off-by: Umesh Nerlige Ramappa 
Reviewed-by: Lionel Landwerlin 
---
 drivers/gpu/drm/i915/i915_query.c | 148 ++
 include/uapi/drm/i915_drm.h   |  52 +++
 2 files changed, 200 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_query.c 
b/drivers/gpu/drm/i915/i915_query.c
index fed337ad7b68..357c44e8177c 100644
--- a/drivers/gpu/drm/i915/i915_query.c
+++ b/drivers/gpu/drm/i915/i915_query.c
@@ -6,6 +6,8 @@
 
 #include 
 
+#include "gt/intel_engine_pm.h"
+#include "gt/intel_engine_user.h"
 #include "i915_drv.h"
 #include "i915_perf.h"
 #include "i915_query.h"
@@ -90,6 +92,151 @@ static int query_topology_info(struct drm_i915_private 
*dev_priv,
return total_length;
 }
 
+typedef u64 (*__ktime_func_t)(void);
+static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
+{
+   /*
+* Use logic same as the perf subsystem to allow user to select the
+* reference clock id to be used for timestamps.
+*/
+   switch (clk_id) {
+   case CLOCK_MONOTONIC:
+   return _get_ns;
+   case CLOCK_MONOTONIC_RAW:
+   return _get_raw_ns;
+   case CLOCK_REALTIME:
+   return _get_real_ns;
+   case CLOCK_BOOTTIME:
+   return _get_boottime_ns;
+   case CLOCK_TAI:
+   return _get_clocktai_ns;
+   default:
+   return NULL;
+   }
+}
+
+static inline int
+__read_timestamps(struct intel_uncore *uncore,
+ i915_reg_t lower_reg,
+ i915_reg_t upper_reg,
+ u64 *cs_ts,
+ u64 *cpu_ts,
+ u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   u32 upper, lower, old_upper, loop = 0;
+
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   do {
+   *cpu_delta = local_clock();
+   *cpu_ts = cpu_clock();
+   lower = intel_uncore_read_fw(uncore, lower_reg);
+   *cpu_delta = local_clock() - *cpu_delta;
+   old_upper = upper;
+   upper = intel_uncore_read_fw(uncore, upper_reg);
+   } while (upper != old_upper && loop++ < 2);
+
+   *cs_ts = (u64)upper << 32 | lower;
+
+   return 0;
+}
+
+static int
+__query_cs_cycles(struct intel_engine_cs *engine,
+ u64 *cs_ts, u64 *cpu_ts, u64 *cpu_delta,
+ __ktime_func_t cpu_clock)
+{
+   struct intel_uncore *uncore = engine->uncore;
+   enum forcewake_domains fw_domains;
+   u32 base = engine->mmio_base;
+   intel_wakeref_t wakeref;
+   int ret;
+
+   fw_domains = intel_uncore_forcewake_for_reg(uncore,
+   RING_TIMESTAMP(base),
+   FW_REG_READ);
+
+   with_intel_runtime_pm(uncore->rpm, wakeref) {
+   spin_lock_irq(>lock);
+   intel_uncore_forcewake_get__locked(uncore, fw_domains);
+
+   ret = __read_timestamps(uncore,
+   RING_TIMESTAMP(base),
+   RING_TIMESTAMP_UDW(base),
+   cs_ts,
+   cpu_ts,
+   cpu_delta,
+

[PATCH 0/1] Add support for querying engine cycles

2021-04-28 Thread Umesh Nerlige Ramappa
This is just a refresh of the earlier patch along with cover letter for the IGT
testing. The query provides the engine cs cycles counter.

v2: Use GRAPHICS_VER() instead of IG_GEN()
v3: Add R-b to the patch
v4: Split cpu timestamp array into timestamp and delta for cleaner API

Signed-off-by: Umesh Nerlige Ramappa 
Test-with: 20210429002959.69473-1-umesh.nerlige.rama...@intel.com

Umesh Nerlige Ramappa (1):
  i915/query: Correlate engine and cpu timestamps with better accuracy

 drivers/gpu/drm/i915/i915_query.c | 148 ++
 include/uapi/drm/i915_drm.h   |  52 +++
 2 files changed, 200 insertions(+)

-- 
2.20.1

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel