Re: [Freedreno] [PATCH 3/3] drm/msm: Use Hardware counters for perf profiling
On Fri, Oct 26, 2018 at 9:46 AM Sharat Masetty wrote: > > Added Rob to this thread. > > On 10/17/2018 8:05 PM, Jordan Crouse wrote: > > On Wed, Oct 17, 2018 at 06:34:01PM +0530, Sharat Masetty wrote: > >> This patch attempts to make use of the hardware counters for GPU busy % > >> estimation when possible and skip using the software counters as it also > >> accounts for software side delays. This should help give more accurate > >> representation of the GPU workload. > > > > I would leave this more to Rob as this particular file makes more sense for > > freedreno than it does for us but I think in general if you want to do this > > then we should do use the hardware for all targets and get rid of the > > mix of the old and the new. > Thanks for the comments Jordan. Yes, I prefer the same too, but the only > limiting factor for me is that I don't have a way to test changes for > targets such as a3xx and a4xx, and I also do not know if there is anyone > actually using this currently. > > Hi Rob, > Can you please share your comments on this? Is it okay to remove > software counters functionality? In principle yes.. although one side-comment is that there are patches floating around for a2xx, which is from long enough ago that I don't remember what the perf-ctr situation is there. I guess if we can be reasonably confident to implement it w/ hw counters for all the generations, then I don't see a big need to keep the sw counter functionality. I should, in theory, be able to test a3xx/a4xx/a5xx.. a4xx might be a bit harder to get a recent upstream kernel running on BR, -R > > Sharat > > > >> Signed-off-by: Sharat Masetty > >> --- > >> drivers/gpu/drm/msm/msm_gpu.c | 30 ++ > >> drivers/gpu/drm/msm/msm_gpu.h | 5 +++-- > >> drivers/gpu/drm/msm/msm_perf.c | 10 +- > >> 3 files changed, 34 insertions(+), 11 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c > >> index e9b5426..a896541 100644 > >> --- a/drivers/gpu/drm/msm/msm_gpu.c > >> +++ b/drivers/gpu/drm/msm/msm_gpu.c > >> @@ -592,6 +592,9 @@ static void update_sw_cntrs(struct msm_gpu *gpu) > >> uint32_t elapsed; > >> unsigned long flags; > >> > >> +if (gpu->funcs->gpu_busy) > >> +return; > > > > Like here - there isn't any reason to not have a gpu_busy for each target > > so then this code could mostly go away. > > > >> spin_lock_irqsave(&gpu->perf_lock, flags); > >> if (!gpu->perfcntr_active) > >> goto out; > >> @@ -620,6 +623,7 @@ void msm_gpu_perfcntr_start(struct msm_gpu *gpu) > >> /* we could dynamically enable/disable perfcntr registers too.. */ > >> gpu->last_sample.active = msm_gpu_active(gpu); > >> gpu->last_sample.time = ktime_get(); > >> +gpu->last_sample.busy_cycles = 0; > >> gpu->activetime = gpu->totaltime = 0; > >> gpu->perfcntr_active = true; > >> update_hw_cntrs(gpu, 0, NULL); > >> @@ -632,9 +636,22 @@ void msm_gpu_perfcntr_stop(struct msm_gpu *gpu) > >> pm_runtime_put_sync(&gpu->pdev->dev); > >> } > >> > >> +static void msm_gpu_hw_sample(struct msm_gpu *gpu, uint64_t *activetime, > >> +uint64_t *totaltime) > >> +{ > >> +ktime_t time; > >> + > >> +*activetime = gpu->funcs->gpu_busy(gpu, > >> +&gpu->last_sample.busy_cycles); > >> + > >> +time = ktime_get(); > >> +*totaltime = ktime_us_delta(time, gpu->last_sample.time); > >> +gpu->last_sample.time = time; > >> +} > > > > This can all be done inline in the sample function below. > > > >> /* returns -errno or # of cntrs sampled */ > >> -int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t *activetime, > >> -uint32_t *totaltime, uint32_t ncntrs, uint32_t *cntrs) > >> +int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint64_t *activetime, > >> +uint64_t *totaltime, uint32_t ncntrs, uint32_t *cntrs) > > > > This might be an good change (though if our activetime or totaltime ever > > goes over 32 bits we've got ourselves a problem) - but it should be in a > > separate patch. > > > >> { > >> unsigned long flags; > >> int ret; > >> @@ -646,13 +663,18 @@ int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, > >> uint32_t *activetime, > >> goto out; > >> } > >> > >> +ret = update_hw_cntrs(gpu, ncntrs, cntrs); > >> + > >> +if (gpu->funcs->gpu_busy) { > >> +msm_gpu_hw_sample(gpu, activetime, totaltime); > >> +goto out; > >> +} > >> + > >> *activetime = gpu->activetime; > >> *totaltime = gpu->totaltime; > >> > >> gpu->activetime = gpu->totaltime = 0; > >> > >> -ret = update_hw_cntrs(gpu, ncntrs, cntrs); > >> - > >> out: > >> spin_unlock_irqrestore(&gpu->perf_lock, flags); > >> > >> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h > >> index 0ff23ca..7dc775f 100644 > >> --- a/drivers/gpu/drm/msm/msm_gpu.h > >> +++ b/drivers/gpu/d
Re: [Freedreno] [PATCH 3/3] drm/msm: Use Hardware counters for perf profiling
Added Rob to this thread. On 10/17/2018 8:05 PM, Jordan Crouse wrote: On Wed, Oct 17, 2018 at 06:34:01PM +0530, Sharat Masetty wrote: This patch attempts to make use of the hardware counters for GPU busy % estimation when possible and skip using the software counters as it also accounts for software side delays. This should help give more accurate representation of the GPU workload. I would leave this more to Rob as this particular file makes more sense for freedreno than it does for us but I think in general if you want to do this then we should do use the hardware for all targets and get rid of the mix of the old and the new. Thanks for the comments Jordan. Yes, I prefer the same too, but the only limiting factor for me is that I don't have a way to test changes for targets such as a3xx and a4xx, and I also do not know if there is anyone actually using this currently. Hi Rob, Can you please share your comments on this? Is it okay to remove software counters functionality? Sharat Signed-off-by: Sharat Masetty --- drivers/gpu/drm/msm/msm_gpu.c | 30 ++ drivers/gpu/drm/msm/msm_gpu.h | 5 +++-- drivers/gpu/drm/msm/msm_perf.c | 10 +- 3 files changed, 34 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c index e9b5426..a896541 100644 --- a/drivers/gpu/drm/msm/msm_gpu.c +++ b/drivers/gpu/drm/msm/msm_gpu.c @@ -592,6 +592,9 @@ static void update_sw_cntrs(struct msm_gpu *gpu) uint32_t elapsed; unsigned long flags; + if (gpu->funcs->gpu_busy) + return; Like here - there isn't any reason to not have a gpu_busy for each target so then this code could mostly go away. spin_lock_irqsave(&gpu->perf_lock, flags); if (!gpu->perfcntr_active) goto out; @@ -620,6 +623,7 @@ void msm_gpu_perfcntr_start(struct msm_gpu *gpu) /* we could dynamically enable/disable perfcntr registers too.. */ gpu->last_sample.active = msm_gpu_active(gpu); gpu->last_sample.time = ktime_get(); + gpu->last_sample.busy_cycles = 0; gpu->activetime = gpu->totaltime = 0; gpu->perfcntr_active = true; update_hw_cntrs(gpu, 0, NULL); @@ -632,9 +636,22 @@ void msm_gpu_perfcntr_stop(struct msm_gpu *gpu) pm_runtime_put_sync(&gpu->pdev->dev); } +static void msm_gpu_hw_sample(struct msm_gpu *gpu, uint64_t *activetime, + uint64_t *totaltime) +{ + ktime_t time; + + *activetime = gpu->funcs->gpu_busy(gpu, + &gpu->last_sample.busy_cycles); + + time = ktime_get(); + *totaltime = ktime_us_delta(time, gpu->last_sample.time); + gpu->last_sample.time = time; +} This can all be done inline in the sample function below. /* returns -errno or # of cntrs sampled */ -int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t *activetime, - uint32_t *totaltime, uint32_t ncntrs, uint32_t *cntrs) +int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint64_t *activetime, + uint64_t *totaltime, uint32_t ncntrs, uint32_t *cntrs) This might be an good change (though if our activetime or totaltime ever goes over 32 bits we've got ourselves a problem) - but it should be in a separate patch. { unsigned long flags; int ret; @@ -646,13 +663,18 @@ int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t *activetime, goto out; } + ret = update_hw_cntrs(gpu, ncntrs, cntrs); + + if (gpu->funcs->gpu_busy) { + msm_gpu_hw_sample(gpu, activetime, totaltime); + goto out; + } + *activetime = gpu->activetime; *totaltime = gpu->totaltime; gpu->activetime = gpu->totaltime = 0; - ret = update_hw_cntrs(gpu, ncntrs, cntrs); - out: spin_unlock_irqrestore(&gpu->perf_lock, flags); diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h index 0ff23ca..7dc775f 100644 --- a/drivers/gpu/drm/msm/msm_gpu.h +++ b/drivers/gpu/drm/msm/msm_gpu.h @@ -90,6 +90,7 @@ struct msm_gpu { struct { bool active; ktime_t time; + u64 busy_cycles; } last_sample; uint32_t totaltime, activetime;/* sw counters */ uint32_t last_cntrs[5];/* hw counters */ @@ -275,8 +276,8 @@ static inline void gpu_write64(struct msm_gpu *gpu, u32 lo, u32 hi, u64 val) void msm_gpu_perfcntr_start(struct msm_gpu *gpu); void msm_gpu_perfcntr_stop(struct msm_gpu *gpu); -int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint32_t *activetime, - uint32_t *totaltime, uint32_t ncntrs, uint32_t *cntrs); +int msm_gpu_perfcntr_sample(struct msm_gpu *gpu, uint64_t *activetime, + uint64_t *totaltime, uint32_t ncntrs, uint32_t *cntrs); void msm_gpu_retire(struct msm_gpu *gpu); void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *