Re: [Freedreno] [PATCH 3/3] drm/msm: Use Hardware counters for perf profiling

2018-11-28 Thread Rob Clark
On Fri, Oct 26, 2018 at 9:46 AM Sharat Masetty  wrote:
>
> Added Rob to this thread.
>
> On 10/17/2018 8:05 PM, Jordan Crouse wrote:
> > On Wed, Oct 17, 2018 at 06:34:01PM +0530, Sharat Masetty wrote:
> >> This patch attempts to make use of the hardware counters for GPU busy %
> >> estimation when possible and skip using the software counters as it also
> >> accounts for software side delays. This should help give more accurate
> >> representation of the GPU workload.
> >
> > I would leave this more to Rob as this particular file makes more sense for
> > freedreno than it does for us but I think in general if you want to do this
> > then we should do use the hardware for all targets and get rid of the
> > mix of the old and the new.
> Thanks for the comments Jordan. Yes, I prefer the same too, but the only
> limiting factor for me is that I don't have a way to test changes for
> targets such as a3xx and a4xx, and I also do not know if there is anyone
> actually using this currently.
>
> Hi Rob,
> Can you please share your comments on this? Is it okay to remove
> software counters functionality?

In principle yes..  although one side-comment is that there are
patches floating around for a2xx, which is from long enough ago that I
don't remember what the perf-ctr situation is there.

I guess if we can be reasonably confident to implement it w/ hw
counters for all the generations, then I don't see a big need to keep
the sw counter functionality.

I should, in theory, be able to test a3xx/a4xx/a5xx.. a4xx might be a
bit harder to get a recent upstream kernel running on

BR,
-R

>
> Sharat
> >
> >> Signed-off-by: Sharat Masetty 
> >> ---
> >>   drivers/gpu/drm/msm/msm_gpu.c  | 30 ++
> >>   drivers/gpu/drm/msm/msm_gpu.h  |  5 +++--
> >>   drivers/gpu/drm/msm/msm_perf.c | 10 +-
> >>   3 files changed, 34 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> >> index e9b5426..a896541 100644
> >> --- a/drivers/gpu/drm/msm/msm_gpu.c
> >> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> >> @@ -592,6 +592,9 @@ static void update_sw_cntrs(struct msm_gpu *gpu)
> >>  uint32_t elapsed;
> >>  unsigned long flags;
> >>
> >> +if (gpu->funcs->gpu_busy)
> >> +return;
> >
> > Like here - there isn't any reason to not have a gpu_busy for each target
> > so then this code could mostly go away.
> >
> >>  spin_lock_irqsave(&gpu->perf_lock, flags);
> >>  if (!gpu->perfcntr_active)
> >>  goto out;
> >> @@ -620,6 +623,7 @@ void msm_gpu_perfcntr_start(struct msm_gpu *gpu)
> >>  /* we could dynamically enable/disable perfcntr registers too.. */
> >>  gpu->last_sample.active = msm_gpu_active(gpu);
> >>  gpu->last_sample.time = ktime_get();
> >> +gpu->last_sample.busy_cycles = 0;
> >>  gpu->activetime = gpu->totaltime = 0;
> >>  gpu->perfcntr_active = true;
> >>  update_hw_cntrs(gpu, 0, NULL);
> >> @@ -632,9 +636,22 @@ void msm_gpu_perfcntr_stop(struct msm_gpu *gpu)
> >>  pm_runtime_put_sync(&gpu->pdev->dev);
> >>   }
> >>
> >> +static void msm_gpu_hw_sample(struct msm_gpu *gpu, uint64_t *activetime,
> >> +uint64_t *totaltime)
> >> +{
> >> +ktime_t time;
> >> +
> >> +*activetime = gpu->funcs->gpu_busy(gpu,
> >> +&gpu->last_sample.busy_cycles);
> >> +
> >> +time = ktime_get();
> >> +*totaltime = ktime_us_delta(time, gpu->last_sample.time);
> >> +gpu->last_sample.time = time;
> >> +}
> >
> > This can all be done inline in the sample function below.
> >
> >>   /* returns -errno or # of cntrs sampled */
> >> -int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t *activetime,
> >> -uint32_t *totaltime, uint32_t ncntrs, uint32_t *cntrs)
> >> +int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint64_t *activetime,
> >> +uint64_t *totaltime, uint32_t ncntrs, uint32_t *cntrs)
> >
> > This might be an good change (though if our activetime or totaltime ever
> > goes over 32 bits we've got ourselves a problem) - but it should be in a
> > separate patch.
> >
> >>   {
> >>  unsigned long flags;
> >>  int ret;
> >> @@ -646,13 +663,18 @@ int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, 
> >> uint32_t *activetime,
> >>  goto out;
> >>  }
> >>
> >> +ret = update_hw_cntrs(gpu, ncntrs, cntrs);
> >> +
> >> +if (gpu->funcs->gpu_busy) {
> >> +msm_gpu_hw_sample(gpu, activetime, totaltime);
> >> +goto out;
> >> +}
> >> +
> >>  *activetime = gpu->activetime;
> >>  *totaltime = gpu->totaltime;
> >>
> >>  gpu->activetime = gpu->totaltime = 0;
> >>
> >> -ret = update_hw_cntrs(gpu, ncntrs, cntrs);
> >> -
> >>   out:
> >>  spin_unlock_irqrestore(&gpu->perf_lock, flags);
> >>
> >> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> >> index 0ff23ca..7dc775f 100644
> >> --- a/drivers/gpu/drm/msm/msm_gpu.h
> >> +++ b/drivers/gpu/d

Re: [Freedreno] [PATCH 3/3] drm/msm: Use Hardware counters for perf profiling

2018-10-26 Thread Sharat Masetty

Added Rob to this thread.

On 10/17/2018 8:05 PM, Jordan Crouse wrote:

On Wed, Oct 17, 2018 at 06:34:01PM +0530, Sharat Masetty wrote:

This patch attempts to make use of the hardware counters for GPU busy %
estimation when possible and skip using the software counters as it also
accounts for software side delays. This should help give more accurate
representation of the GPU workload.


I would leave this more to Rob as this particular file makes more sense for
freedreno than it does for us but I think in general if you want to do this
then we should do use the hardware for all targets and get rid of the
mix of the old and the new.
Thanks for the comments Jordan. Yes, I prefer the same too, but the only 
limiting factor for me is that I don't have a way to test changes for 
targets such as a3xx and a4xx, and I also do not know if there is anyone 
actually using this currently.


Hi Rob,
Can you please share your comments on this? Is it okay to remove 
software counters functionality?


Sharat



Signed-off-by: Sharat Masetty 
---
  drivers/gpu/drm/msm/msm_gpu.c  | 30 ++
  drivers/gpu/drm/msm/msm_gpu.h  |  5 +++--
  drivers/gpu/drm/msm/msm_perf.c | 10 +-
  3 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index e9b5426..a896541 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -592,6 +592,9 @@ static void update_sw_cntrs(struct msm_gpu *gpu)
uint32_t elapsed;
unsigned long flags;
  
+	if (gpu->funcs->gpu_busy)

+   return;


Like here - there isn't any reason to not have a gpu_busy for each target
so then this code could mostly go away.


spin_lock_irqsave(&gpu->perf_lock, flags);
if (!gpu->perfcntr_active)
goto out;
@@ -620,6 +623,7 @@ void msm_gpu_perfcntr_start(struct msm_gpu *gpu)
/* we could dynamically enable/disable perfcntr registers too.. */
gpu->last_sample.active = msm_gpu_active(gpu);
gpu->last_sample.time = ktime_get();
+   gpu->last_sample.busy_cycles = 0;
gpu->activetime = gpu->totaltime = 0;
gpu->perfcntr_active = true;
update_hw_cntrs(gpu, 0, NULL);
@@ -632,9 +636,22 @@ void msm_gpu_perfcntr_stop(struct msm_gpu *gpu)
pm_runtime_put_sync(&gpu->pdev->dev);
  }
  
+static void msm_gpu_hw_sample(struct msm_gpu *gpu, uint64_t *activetime,

+   uint64_t *totaltime)
+{
+   ktime_t time;
+
+   *activetime = gpu->funcs->gpu_busy(gpu,
+   &gpu->last_sample.busy_cycles);
+
+   time = ktime_get();
+   *totaltime = ktime_us_delta(time, gpu->last_sample.time);
+   gpu->last_sample.time = time;
+}


This can all be done inline in the sample function below.


  /* returns -errno or # of cntrs sampled */
-int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t *activetime,
-   uint32_t *totaltime, uint32_t ncntrs, uint32_t *cntrs)
+int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint64_t *activetime,
+   uint64_t *totaltime, uint32_t ncntrs, uint32_t *cntrs)


This might be an good change (though if our activetime or totaltime ever
goes over 32 bits we've got ourselves a problem) - but it should be in a
separate patch.


  {
unsigned long flags;
int ret;
@@ -646,13 +663,18 @@ int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t 
*activetime,
goto out;
}
  
+	ret = update_hw_cntrs(gpu, ncntrs, cntrs);

+
+   if (gpu->funcs->gpu_busy) {
+   msm_gpu_hw_sample(gpu, activetime, totaltime);
+   goto out;
+   }
+
*activetime = gpu->activetime;
*totaltime = gpu->totaltime;
  
  	gpu->activetime = gpu->totaltime = 0;
  
-	ret = update_hw_cntrs(gpu, ncntrs, cntrs);

-
  out:
spin_unlock_irqrestore(&gpu->perf_lock, flags);
  
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h

index 0ff23ca..7dc775f 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -90,6 +90,7 @@ struct msm_gpu {
struct {
bool active;
ktime_t time;
+   u64 busy_cycles;
} last_sample;
uint32_t totaltime, activetime;/* sw counters */
uint32_t last_cntrs[5];/* hw counters */
@@ -275,8 +276,8 @@ static inline void gpu_write64(struct msm_gpu *gpu, u32 lo, 
u32 hi, u64 val)
  
  void msm_gpu_perfcntr_start(struct msm_gpu *gpu);

  void msm_gpu_perfcntr_stop(struct msm_gpu *gpu);
-int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t *activetime,
-   uint32_t *totaltime, uint32_t ncntrs, uint32_t *cntrs);
+int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint64_t *activetime,
+   uint64_t *totaltime, uint32_t ncntrs, uint32_t *cntrs);
  
  void msm_gpu_retire(struct msm_gpu *gpu);

  void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *