On Tue, May 5, 2026 at 12:45 PM Tomeu Vizoso <[email protected]> wrote:
>
> Hi Rob,
>
> On Fri, Mar 6, 2026 at 5:38 PM Rob Herring (Arm) <[email protected]> wrote:
> >
> > The Arm Ethos-U NPUs have a PMU with performance counters. The PMU h/w
> > supports up to 4 (U65) or 8 (U85) counters which can be programmed for
> > different events. There is also a dedicated cycle counter.
> >
> > The ABI and implementation are copied from the V3D driver. The main
> > difference in the ABI is there is no query API for the the event list.
> > The events differ between the U65 and U85, so the events lists are
> > maintained in userspace along with other differences between the U65 and
> > U85.
> >
> > The cycle counter is always enabled when the PMU is enabled. When the
> > user requests N events, reading the counters will return the N events
> > plus the cycle counter.
> >
> > Signed-off-by: Rob Herring (Arm) <[email protected]>
> > ---
>
> <snip>
>
> >
> >  struct ethosu_file_priv {
> >         struct ethosu_device *edev;
> >         struct drm_sched_entity sched_entity;
> > +
> > +       struct {
> > +               struct idr idr;
> > +               struct mutex lock;
> > +       } perfmon;
> > +};
> > +
> > +/* Performance monitor object. The perform lifetime is controlled by 
> > userspace
> > + * using perfmon related ioctls. A perfmon can be attached to a submit_cl
>
> I guess submit_cl is more of a v3d thing?
>
> > + * request, and when this is the case, HW perf counters will be activated 
> > just
> > + * before the submit_cl is submitted to the GPU and disabled when the job 
> > is
> > + * done. This way, only events related to a specific job will be counted.
> > + */
> > +struct ethosu_perfmon {
> > +       /* Tracks the number of users of the perfmon, when this counter 
> > reaches
> > +        * zero the perfmon is destroyed.
> > +        */
> > +       refcount_t refcnt;
> > +
> > +       /* Protects perfmon stop, as it can be invoked from multiple 
> > places. */
> > +       struct mutex lock;
> > +
> > +       /* Number of counters activated in this perfmon instance
> > +        * (should be less than or equal to DRM_ETHOSU_MAX_PERF_COUNTERS).
> > +        */
> > +       u8 ncounters;
> > +
> > +       /* Events counted by the HW perf counters. */
> > +       u16 counters[DRM_ETHOSU_MAX_PERF_EVENT_COUNTERS];
> > +
> > +       /* Storage for counter values. Counters are incremented by the
> > +        * HW perf counter values every time the perfmon is attached
> > +        * to a GPU job.  This way, perfmon users don't have to
> > +        * retrieve the results after each job if they want to track
> > +        * events covering several submissions.  Note that counter
> > +        * values can't be reset, but you can fake a reset by
> > +        * destroying the perfmon and creating a new one.
> > +        */
> > +       u64 values[] __counted_by(ncounters);
> >  };
>
> <snip>
>
> > +void ethosu_perfmon_start(struct ethosu_device *ethosu, struct 
> > ethosu_perfmon *perfmon)
> > +{
> > +       unsigned int i;
> > +       u8 ncounters;
> > +       u32 mask;
> > +
> > +       if (WARN_ON_ONCE(!perfmon || ethosu->active_perfmon))
> > +               return;
> > +
> > +       writel_relaxed(PMCR_CNT_EN, ethosu->pmu_regs + NPU_REG_PMCR);
> > +       writel_relaxed(0x000011, ethosu->pmu_regs + NPU_REG_PMCCNTR_CFG);
>
> Can we use macros for this value so we have it documented in the source code?
>
> <snip>
>
> > +void ethosu_perfmon_stop(struct ethosu_device *ethosu, struct 
> > ethosu_perfmon *perfmon,
> > +                     bool capture)
> > +{
> > +       unsigned int i;
> > +       u8 ncounters;
> > +       u32 mask;
> > +
> > +       if (!perfmon || !ethosu->active_perfmon)
> > +               return;
> > +
> > +       mutex_lock(&perfmon->lock);
> > +       if (perfmon != ethosu->active_perfmon) {
> > +               mutex_unlock(&perfmon->lock);
> > +               return;
> > +       }
> > +
> > +       ncounters = perfmon->ncounters - 1;
> > +
> > +       if (capture) {
> > +               for (i = 0; i < ncounters; i++)
> > +                       perfmon->values[i] += 
> > readl_relaxed(ethosu->pmu_regs + NPU_REG_PMU_EVCNTR(i));
> > +               perfmon->values[ncounters] +=
> > +                       readl_relaxed(ethosu->pmu_regs + 
> > NPU_REG_PMCCNTR_LO) |
> > +                       (u64)readl_relaxed(ethosu->pmu_regs + 
> > NPU_REG_PMCCNTR_HI) << 32;
> > +       }
> > +
> > +       mask = 0x80000000;
> > +       if (ncounters)
> > +               mask |= GENMASK(ncounters - 1, 0);
> > +       writel_relaxed(mask, ethosu->pmu_regs + NPU_REG_PMCNTENCLR);
> > +
> > +       writel_relaxed(0, ethosu->pmu_regs + NPU_REG_PMCR);
> > +       ethosu->active_perfmon = NULL;
> > +       mutex_unlock(&perfmon->lock);
>
> Wonder if the npu-idle counter won't continue to accumulate until we
> call stop(), and in that case, if we shouldn't stop the counters on
> the job IRQ. So we can know for how long the NPU was idle during the
> job, without extra idle time.

>From what the h/w folks have told me, that is the only idle time. When
the job starts, the idle time is 0 and active counts every clock cycle
until we hit the STOP cmd.

>
> I think we could freeze the counters in the job_done IRQ, and harvest
> the values on job cleanup. At freeze time it would be good to also
> capture a timestamp and return to userspace in get_values, so that
> userspace doesn't have to take it with considerable delay (for tools
> such as Perfetto).

Note that perfetto uses the global perfmon which I think we want free
running and sampled.

Essentially, all the events other than idle and cycle cnt are stopped.

>
> <snip>
>
> > +int ethosu_ioctl_perfmon_create(struct drm_device *dev, void *data,
> > +                            struct drm_file *file_priv)
> > +{
> > +       struct ethosu_file_priv *ethosu_priv = file_priv->driver_priv;
> > +       struct drm_ethosu_perfmon_create *req = data;
> > +       struct ethosu_device *ethosu = to_ethosu_device(dev);
> > +       struct ethosu_perfmon *perfmon;
> > +       unsigned int i, event_max;
> > +       int ret;
> > +
> > +       /* Number of monitored counters cannot exceed HW limits. */
> > +       if (req->ncounters > ethosu->npu_info.pmu_counters)
> > +               return -EINVAL;
> > +
> > +       /* Make sure all counters are valid. */
> > +       event_max = ethosu_is_u65(ethosu) ? 433 : 671;
>
> Can we have macros for these constants?
>
> <snip>
>
> > +int ethosu_ioctl_perfmon_get_values(struct drm_device *dev, void *data,
> > +                                struct drm_file *file_priv)
> > +{
> > +       struct ethosu_device *ethosu = to_ethosu_device(dev);
> > +       struct ethosu_file_priv *ethosu_priv = file_priv->driver_priv;
> > +       struct drm_ethosu_perfmon_get_values *req = data;
> > +       struct ethosu_perfmon *perfmon;
> > +       int ret = 0;
> > +
> > +       if (req->pad != 0)
> > +               return -EINVAL;
> > +
> > +       perfmon = ethosu_perfmon_find(ethosu_priv, req->id);
> > +       if (!perfmon)
> > +               return -EINVAL;
> > +
> > +       ret = pm_runtime_resume_and_get(dev->dev);
> > +       if (ret)
>
> Should we cal put() here?

Err, yes.

>
> > +               return ret;
> > +
> > +       ethosu_perfmon_stop(ethosu, perfmon, true);
> > +
> > +       pm_runtime_put_autosuspend(dev->dev);
> > +
> > +       if (copy_to_user(u64_to_user_ptr(req->values_ptr), perfmon->values,
> > +                        perfmon->ncounters * sizeof(u64)))
>
> And here?
>
> > +               ret = -EFAULT;
> > +
> > +       ethosu_perfmon_put(perfmon);
> > +
> > +       return ret;
> > +}
> > +
> > +int ethosu_ioctl_perfmon_set_global(struct drm_device *dev, void *data,
> > +                                struct drm_file *file_priv)
> > +{
> > +       struct ethosu_file_priv *ethosu_priv = file_priv->driver_priv;
> > +       struct drm_ethosu_perfmon_set_global *req = data;
> > +       struct ethosu_device *ethosu = to_ethosu_device(dev);
> > +       struct ethosu_perfmon *perfmon;
> > +
> > +       if (req->flags & ~DRM_ETHOSU_PERFMON_CLEAR_GLOBAL)
> > +               return -EINVAL;
> > +
> > +       perfmon = ethosu_perfmon_find(ethosu_priv, req->id);
>
> Aren't we leaking this second reference?

No, this function does the get and then userspace calls
ethosu_ioctl_perfmon_destroy() to do the put. However, I think if
userspace forgets or crashes, then we need to do a put.

Rob

Reply via email to