Hi,
[...]
> > > > > @@ -1216,6 +1217,13 @@ int live_rps_power(void *arg)
> > > > > if (11 * min.power > 10 * max.power) {
> > > > > pr_err("%s: did not conserve power when setting
> > > > > lower frequency!\n",
> > > > > engine->name);
> > > > > +
> > > > > + throttle = intel_uncore_read(gt->uncore,
> > > > > +
> > > > > intel_gt_perf_limit_reasons_reg(gt));
> > > > > +
> > > > > + pr_warn("%s: GPU throttled with reasons
> > > > > 0x%08x\n",
> > > > > + engine->name, throttle &
> > > > > GT0_PERF_LIMIT_REASONS_MASK);
> > > > > +
> > > > This feels like spamming the system messages. We are on error
> > > > path already and exiting with -EINVAL, so printing an
> > > > unconditional warning here is excessive, I think.
> > > >
> > > > [...]
> > > Got it. Based on past experience, most failures occur due to throttling.
> > > However, I can switch it to pr_info since we already print pr_err
> > > beforehand.
> > No, that would bunch up two reasons for potential failure.
> > If you experienced problems in condition check:
> >
> > if (11 * min.power > 10 * max.power)
> >
> > due to throttling, then throttling detection could use its own
> > error path, something like this could work:
> >
> > if (11 * min.power > 10 * max.power) {
> > - pr_err("%s: did not conserve power when setting
> > lower frequency!\n",
> > - engine->name);
> > + if (read_cagf(rps) != rps->max_freq) {
> > + throttle = intel_uncore_read(gt->uncore,
> > +
> > intel_gt_perf_limit_reasons_reg(gt));
> > + pr_err("%s: GPU throttled with reasons
> > 0x%08x\n",
> > + engine->name, throttle &
> > GT0_PERF_LIMIT_REASONS_MASK);
> > + } else {
> > + pr_err("%s: did not conserve power when setting
> > lower frequency!\n",
> > + engine->name);
> > + }
> > +
> > err = -EINVAL;
> > break;
> > }
> >
> > The main goal would be to differentiate and print only one
> > reason for failure, instead of emitting two of them and leaving
> > people guessing which one of the two was the real reason the
> > function returned with -EINVAL.
> >
> > I did not test the code above, so it may require some changes.
>
> Throttle is not considered a failure in this context; it serves only as a
> debug indicator or a potential reason for failure.
This is a good reason not to include it straight away in the logs.
>
> The primary objective of the test is to measure power, and any failure will
> be related to power metrics.
>
> Throttling is viewed as a defensive mechanism rather than a failure
> condition.
Then print the log when throttling happened
(i.e., when if (read_cagf(rps) != rps->max_freq) is true):
if (11 * min.power > 10 * max.power) {
pr_err("%s: did not conserve power when setting lower
frequency!\n",
engine->name);
+ if (read_cagf(rps) != rps->max_freq) {
+ throttle = intel_uncore_read(gt->uncore,
+
intel_gt_perf_limit_reasons_reg(gt));
+ pr_warn("%s: GPU throttled with reasons 0x%08x\n",
+ engine->name, throttle &
GT0_PERF_LIMIT_REASONS_MASK);
+ }
+
err = -EINVAL;
break;
}
Unless, you are certain that a throttle will always happen in
this error path, that additional "if" is necessary.
If this is a debug log anyway, then choosing pr_info or pr_warn
is not that important, so you may leave it as you see fit.
--
Best Regards,
Krzysztof