Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-11 Thread Jan Beulich
On 12.05.2023 03:02, Marek Marczykowski-Górecki wrote:
> On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote:
>> On 10.05.2023 15:54, Jason Andryuk wrote:
>>> On Mon, May 8, 2023 at 2:33 AM Jan Beulich  wrote:
 On 05.05.2023 17:35, Jason Andryuk wrote:
> On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
> The other issue is that if you select "hwp" as the governor, but HWP
> hardware support is not available, then hwp_available() needs to reset
> the governor back to the default.  This feels like a layering
> violation.

 Layering violation - yes. But why would the governor need resetting in
 this case? If HWP was asked for but isn't available, I don't think any
 other cpufreq handling (and hence governor) should be put in place.
 And turning off cpufreq altogether (if necessary in the first place)
 wouldn't, to me, feel as much like a layering violation.
>>>
>>> My goal was for Xen to use HWP if available and fallback to the acpi
>>> cpufreq driver if not.  That to me seems more user-friendly than
>>> disabling cpufreq.
>>>
>>> if ( hwp_available() )
>>> ret = hwp_register_driver();
>>> else
>>> ret = cpufreq_register_driver(&acpi_cpufreq_driver);
>>
>> That's fine as a (future) default, but for now using hwp requires a
>> command line option, and if that option says "hwp" then it ought to
>> be hwp imo.
> 
> As a downstrem distribution, I'd strongly prefer to have an option that
> would enable HWP when present and fallback to other driver otherwise,
> even if that isn't the default upstream. I can't possibly require large
> group of users (either HWP-having or HWP-not-having) to edit the Xen
> cmdline to get power management working well.
> 
> If the meaning for cpufreq=hwp absolutely must include "nothing if HWP
> is not available", then maybe it should be named cpufreq=try-hwp
> instead, or cpufreq=prefer-hwp or something else like this?

Any new sub-option needs to fit the existing ones in its meaning. I
could see e.g. "cpufreq=xen" alone to effect what you want (once hwp
becomes available for use by default). But (for now at least) I
continue to think that a request for "hwp" ought to mean HWP.

Jan



Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-11 Thread Marek Marczykowski-Górecki
On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote:
> On 10.05.2023 15:54, Jason Andryuk wrote:
> > On Mon, May 8, 2023 at 2:33 AM Jan Beulich  wrote:
> >> On 05.05.2023 17:35, Jason Andryuk wrote:
> >>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
> >>> The other issue is that if you select "hwp" as the governor, but HWP
> >>> hardware support is not available, then hwp_available() needs to reset
> >>> the governor back to the default.  This feels like a layering
> >>> violation.
> >>
> >> Layering violation - yes. But why would the governor need resetting in
> >> this case? If HWP was asked for but isn't available, I don't think any
> >> other cpufreq handling (and hence governor) should be put in place.
> >> And turning off cpufreq altogether (if necessary in the first place)
> >> wouldn't, to me, feel as much like a layering violation.
> > 
> > My goal was for Xen to use HWP if available and fallback to the acpi
> > cpufreq driver if not.  That to me seems more user-friendly than
> > disabling cpufreq.
> > 
> > if ( hwp_available() )
> > ret = hwp_register_driver();
> > else
> > ret = cpufreq_register_driver(&acpi_cpufreq_driver);
> 
> That's fine as a (future) default, but for now using hwp requires a
> command line option, and if that option says "hwp" then it ought to
> be hwp imo.

As a downstrem distribution, I'd strongly prefer to have an option that
would enable HWP when present and fallback to other driver otherwise,
even if that isn't the default upstream. I can't possibly require large
group of users (either HWP-having or HWP-not-having) to edit the Xen
cmdline to get power management working well.

If the meaning for cpufreq=hwp absolutely must include "nothing if HWP
is not available", then maybe it should be named cpufreq=try-hwp
instead, or cpufreq=prefer-hwp or something else like this?

> > If we are setting cpufreq_opt_governor to enter hwp_available(), but
> > then HWP isn't available, it seems to me that we need to reset
> > cpufreq_opt_governor when exiting hwp_available() false.
> 
> This may be necessary in the future, but shouldn't be necessary right
> now. It's not entirely clear to me how that future is going to look
> like, command line option wise.
> 
> Jan
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-10 Thread Jan Beulich
On 10.05.2023 15:54, Jason Andryuk wrote:
> On Mon, May 8, 2023 at 2:33 AM Jan Beulich  wrote:
>> On 05.05.2023 17:35, Jason Andryuk wrote:
>>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
>>> The other issue is that if you select "hwp" as the governor, but HWP
>>> hardware support is not available, then hwp_available() needs to reset
>>> the governor back to the default.  This feels like a layering
>>> violation.
>>
>> Layering violation - yes. But why would the governor need resetting in
>> this case? If HWP was asked for but isn't available, I don't think any
>> other cpufreq handling (and hence governor) should be put in place.
>> And turning off cpufreq altogether (if necessary in the first place)
>> wouldn't, to me, feel as much like a layering violation.
> 
> My goal was for Xen to use HWP if available and fallback to the acpi
> cpufreq driver if not.  That to me seems more user-friendly than
> disabling cpufreq.
> 
> if ( hwp_available() )
> ret = hwp_register_driver();
> else
> ret = cpufreq_register_driver(&acpi_cpufreq_driver);

That's fine as a (future) default, but for now using hwp requires a
command line option, and if that option says "hwp" then it ought to
be hwp imo.

> If we are setting cpufreq_opt_governor to enter hwp_available(), but
> then HWP isn't available, it seems to me that we need to reset
> cpufreq_opt_governor when exiting hwp_available() false.

This may be necessary in the future, but shouldn't be necessary right
now. It's not entirely clear to me how that future is going to look
like, command line option wise.

Jan



Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-10 Thread Jason Andryuk
On Mon, May 8, 2023 at 2:33 AM Jan Beulich  wrote:
>
> On 05.05.2023 17:35, Jason Andryuk wrote:
> > On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
> >>
> >> On 04.05.2023 18:56, Jason Andryuk wrote:
> >>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich  wrote:
>  On 01.05.2023 21:30, Jason Andryuk wrote:
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> > oprofile, rather than detectin
> >  available support.
> >
> >  ### cpufreq
> > -> `= none | {{  | xen } 
> > [:[powersave|performance|ondemand|userspace][,][,[][,[verbose}
> >  | dom0-kernel`
> > +> `= none | {{  | xen } 
> > [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]}
> >  | dom0-kernel`
> 
>  Considering you use a special internal governor, the 4 governor 
>  alternatives are
>  meaningless for hwp. Hence at the command line level recognizing "hwp" 
>  as if it
>  was another governor name would seem better to me. This would then also 
>  get rid
>  of one of the two special "no-" prefix parsing cases (which I'm not 
>  overly
>  happy about).
> 
>  Even if not done that way I'm puzzled by the way you spell out the 
>  interaction
>  of "hwp" and "hdc": As you say in the description, "hdc" is meaningful 
>  only when
>  "hwp" was specified, so even if not merged with the governors group 
>  "hwp" should
>  come first, and "hdc" ought to be rejected if "hwp" wasn't first 
>  specified. (The
>  way you've spelled it out it actually looks to be kind of the other way 
>  around.)
> >>>
> >>> I placed them in alphabetical order, but, yes, it doesn't make sense.
> >>>
>  Strictly speaking "maxfreq" and "minfreq" also should be objected to 
>  when "hwp"
>  was specified.
> 
>  Overall I'm getting the impression that beyond your "verbose" related 
>  adjustment
>  more is needed, if you're meaning to get things closer to how we parse 
>  the
>  option (splitting across multiple lines to help see what I mean):
> 
>  `= none
>   | {{  | xen } [:{powersave|performance|ondemand|userspace}
>    
>  [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}]
>    [,verbose]]}
>   | dom0-kernel`
> 
>  (We're still parsing in a more relaxed way, e.g. minfreq may come ahead 
>  of
>  maxfreq, but better be more tight in the doc than too relaxed.)
> 
>  Furthermore while max/min freq don't apply directly, there are still two 
>  MSRs
>  controlling bounds at the package and logical processor levels.
> >>>
> >>> Well, we only program the logical processor level MSRs because we
> >>> don't have a good idea of the packages to know when we can skip
> >>> writing an MSR.
> >>>
> >>> How about this:
> >>> `= none
> >>>  | {{  | xen } {
> >>> [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]]
> >>> | [:hwp[,hdc]] }
> >>>   [,verbose]]}
> >>>  | dom0-kernel`
> >>
> >> Looks right, yes.
> >
> > There is a wrinkle to using the hwp governor.  The hwp governor was
> > named "hwp-internal", so it needs to be renamed to "hwp" for use with
> > command line parsing.  That means the checking for "-internal" needs
> > to change to just "hwp" which removes the generality of the original
> > implementation.
>
> I'm afraid I don't see why this would strictly be necessary or a
> consequence.

Maybe I took your comment too far when you mentioned using hwp as a
fake governor.  I used the actual HWP struct cpufreq_governor to hook
into cpufreq_cmdline_parse().  cpufreq_cmdline_parse() uses the that
name for comparison.  But the naming stops being an issue if struct
cpufreq_governor gains a bool .internal flag.  That flag also makes
the registration checks clearer.

> > The other issue is that if you select "hwp" as the governor, but HWP
> > hardware support is not available, then hwp_available() needs to reset
> > the governor back to the default.  This feels like a layering
> > violation.
>
> Layering violation - yes. But why would the governor need resetting in
> this case? If HWP was asked for but isn't available, I don't think any
> other cpufreq handling (and hence governor) should be put in place.
> And turning off cpufreq altogether (if necessary in the first place)
> wouldn't, to me, feel as much like a layering violation.

My goal was for Xen to use HWP if available and fallback to the acpi
cpufreq driver if not.  That to me seems more user-friendly than
disabling cpufreq.

if ( hwp_available() )
ret = hwp_register_driver();
else
ret = cpufreq_register_driver(&acpi_cpufreq_driver);

If we are setting cpufreq_opt_governor to enter hwp_available(), but
then HWP is

Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-07 Thread Jan Beulich
On 05.05.2023 17:35, Jason Andryuk wrote:
> On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
>>
>> On 04.05.2023 18:56, Jason Andryuk wrote:
>>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich  wrote:
 On 01.05.2023 21:30, Jason Andryuk wrote:
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> oprofile, rather than detectin
>  available support.
>
>  ### cpufreq
> -> `= none | {{  | xen } 
> [:[powersave|performance|ondemand|userspace][,][,[][,[verbose}
>  | dom0-kernel`
> +> `= none | {{  | xen } 
> [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]}
>  | dom0-kernel`

 Considering you use a special internal governor, the 4 governor 
 alternatives are
 meaningless for hwp. Hence at the command line level recognizing "hwp" as 
 if it
 was another governor name would seem better to me. This would then also 
 get rid
 of one of the two special "no-" prefix parsing cases (which I'm not overly
 happy about).

 Even if not done that way I'm puzzled by the way you spell out the 
 interaction
 of "hwp" and "hdc": As you say in the description, "hdc" is meaningful 
 only when
 "hwp" was specified, so even if not merged with the governors group "hwp" 
 should
 come first, and "hdc" ought to be rejected if "hwp" wasn't first 
 specified. (The
 way you've spelled it out it actually looks to be kind of the other way 
 around.)
>>>
>>> I placed them in alphabetical order, but, yes, it doesn't make sense.
>>>
 Strictly speaking "maxfreq" and "minfreq" also should be objected to when 
 "hwp"
 was specified.

 Overall I'm getting the impression that beyond your "verbose" related 
 adjustment
 more is needed, if you're meaning to get things closer to how we parse the
 option (splitting across multiple lines to help see what I mean):

 `= none
  | {{  | xen } [:{powersave|performance|ondemand|userspace}
   
 [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}]
   [,verbose]]}
  | dom0-kernel`

 (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
 maxfreq, but better be more tight in the doc than too relaxed.)

 Furthermore while max/min freq don't apply directly, there are still two 
 MSRs
 controlling bounds at the package and logical processor levels.
>>>
>>> Well, we only program the logical processor level MSRs because we
>>> don't have a good idea of the packages to know when we can skip
>>> writing an MSR.
>>>
>>> How about this:
>>> `= none
>>>  | {{  | xen } {
>>> [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]]
>>> | [:hwp[,hdc]] }
>>>   [,verbose]]}
>>>  | dom0-kernel`
>>
>> Looks right, yes.
> 
> There is a wrinkle to using the hwp governor.  The hwp governor was
> named "hwp-internal", so it needs to be renamed to "hwp" for use with
> command line parsing.  That means the checking for "-internal" needs
> to change to just "hwp" which removes the generality of the original
> implementation.

I'm afraid I don't see why this would strictly be necessary or a
consequence.

> The other issue is that if you select "hwp" as the governor, but HWP
> hardware support is not available, then hwp_available() needs to reset
> the governor back to the default.  This feels like a layering
> violation.

Layering violation - yes. But why would the governor need resetting in
this case? If HWP was asked for but isn't available, I don't think any
other cpufreq handling (and hence governor) should be put in place.
And turning off cpufreq altogether (if necessary in the first place)
wouldn't, to me, feel as much like a layering violation.

> I'm still investigating, but promoting hwp to a top level option -
> cpufreq=hwp - might be a better arrangement.

Might be an alternative, yes.

Jan



Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-05 Thread Jason Andryuk
On Fri, May 5, 2023 at 3:01 AM Jan Beulich  wrote:
>
> On 04.05.2023 18:56, Jason Andryuk wrote:
> > On Thu, May 4, 2023 at 9:11 AM Jan Beulich  wrote:
> >> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>> --- a/docs/misc/xen-command-line.pandoc
> >>> +++ b/docs/misc/xen-command-line.pandoc
> >>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> >>> oprofile, rather than detectin
> >>>  available support.
> >>>
> >>>  ### cpufreq
> >>> -> `= none | {{  | xen } 
> >>> [:[powersave|performance|ondemand|userspace][,][,[][,[verbose}
> >>>  | dom0-kernel`
> >>> +> `= none | {{  | xen } 
> >>> [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]}
> >>>  | dom0-kernel`
> >>
> >> Considering you use a special internal governor, the 4 governor 
> >> alternatives are
> >> meaningless for hwp. Hence at the command line level recognizing "hwp" as 
> >> if it
> >> was another governor name would seem better to me. This would then also 
> >> get rid
> >> of one of the two special "no-" prefix parsing cases (which I'm not overly
> >> happy about).
> >>
> >> Even if not done that way I'm puzzled by the way you spell out the 
> >> interaction
> >> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful 
> >> only when
> >> "hwp" was specified, so even if not merged with the governors group "hwp" 
> >> should
> >> come first, and "hdc" ought to be rejected if "hwp" wasn't first 
> >> specified. (The
> >> way you've spelled it out it actually looks to be kind of the other way 
> >> around.)
> >
> > I placed them in alphabetical order, but, yes, it doesn't make sense.
> >
> >> Strictly speaking "maxfreq" and "minfreq" also should be objected to when 
> >> "hwp"
> >> was specified.
> >>
> >> Overall I'm getting the impression that beyond your "verbose" related 
> >> adjustment
> >> more is needed, if you're meaning to get things closer to how we parse the
> >> option (splitting across multiple lines to help see what I mean):
> >>
> >> `= none
> >>  | {{  | xen } [:{powersave|performance|ondemand|userspace}
> >>   
> >> [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}]
> >>   [,verbose]]}
> >>  | dom0-kernel`
> >>
> >> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> >> maxfreq, but better be more tight in the doc than too relaxed.)
> >>
> >> Furthermore while max/min freq don't apply directly, there are still two 
> >> MSRs
> >> controlling bounds at the package and logical processor levels.
> >
> > Well, we only program the logical processor level MSRs because we
> > don't have a good idea of the packages to know when we can skip
> > writing an MSR.
> >
> > How about this:
> > `= none
> >  | {{  | xen } {
> > [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]]
> > | [:hwp[,hdc]] }
> >   [,verbose]]}
> >  | dom0-kernel`
>
> Looks right, yes.

There is a wrinkle to using the hwp governor.  The hwp governor was
named "hwp-internal", so it needs to be renamed to "hwp" for use with
command line parsing.  That means the checking for "-internal" needs
to change to just "hwp" which removes the generality of the original
implementation.

The other issue is that if you select "hwp" as the governor, but HWP
hardware support is not available, then hwp_available() needs to reset
the governor back to the default.  This feels like a layering
violation.

I'm still investigating, but promoting hwp to a top level option -
cpufreq=hwp - might be a better arrangement.

> >>> +union hwp_request
> >>> +{
> >>> +struct
> >>> +{
> >>> +uint64_t min_perf:8;
> >>> +uint64_t max_perf:8;
> >>> +uint64_t desired:8;
> >>> +uint64_t energy_perf:8;
> >>> +uint64_t activity_window:10;
> >>> +uint64_t package_control:1;
> >>> +uint64_t reserved:16;
> >>> +uint64_t activity_window_valid:1;
> >>> +uint64_t energy_perf_valid:1;
> >>> +uint64_t desired_valid:1;
> >>> +uint64_t max_perf_valid:1;
> >>> +uint64_t min_perf_valid:1;
> >>
> >> The boolean fields here would probably better be of type "bool". I also
> >> don't see the need for using uint64_t for any of the other fields -
> >> unsigned int will be quite fine, I think. Only ...
> >
> > This is the hardware MSR format, so it seemed natural to use uint64_t
> > and the bit fields.  To me, uint64_t foo:$bits; better shows that we
> > are dividing up a single hardware register using bit fields.
> > Honestly, I'm unfamiliar with the finer points of laying out bitfields
> > with bool.  And the 10 bits of activity window throws off aligning to
> > standard types.
> >
> > This seems to have the correct layout:
> > struct
> > {
> > unsigned char min_perf;
> > unsigned char max_perf;
> > unsigned char desired;
> > unsigned char energy_perf;
> > unsigned int activity_window:10;
> > 

Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-05 Thread Jan Beulich
On 04.05.2023 18:56, Jason Andryuk wrote:
> On Thu, May 4, 2023 at 9:11 AM Jan Beulich  wrote:
>> On 01.05.2023 21:30, Jason Andryuk wrote:
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
>>> oprofile, rather than detectin
>>>  available support.
>>>
>>>  ### cpufreq
>>> -> `= none | {{  | xen } 
>>> [:[powersave|performance|ondemand|userspace][,][,[][,[verbose}
>>>  | dom0-kernel`
>>> +> `= none | {{  | xen } 
>>> [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]}
>>>  | dom0-kernel`
>>
>> Considering you use a special internal governor, the 4 governor alternatives 
>> are
>> meaningless for hwp. Hence at the command line level recognizing "hwp" as if 
>> it
>> was another governor name would seem better to me. This would then also get 
>> rid
>> of one of the two special "no-" prefix parsing cases (which I'm not overly
>> happy about).
>>
>> Even if not done that way I'm puzzled by the way you spell out the 
>> interaction
>> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only 
>> when
>> "hwp" was specified, so even if not merged with the governors group "hwp" 
>> should
>> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. 
>> (The
>> way you've spelled it out it actually looks to be kind of the other way 
>> around.)
> 
> I placed them in alphabetical order, but, yes, it doesn't make sense.
> 
>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when 
>> "hwp"
>> was specified.
>>
>> Overall I'm getting the impression that beyond your "verbose" related 
>> adjustment
>> more is needed, if you're meaning to get things closer to how we parse the
>> option (splitting across multiple lines to help see what I mean):
>>
>> `= none
>>  | {{  | xen } [:{powersave|performance|ondemand|userspace}
>>   
>> [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}]
>>   [,verbose]]}
>>  | dom0-kernel`
>>
>> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
>> maxfreq, but better be more tight in the doc than too relaxed.)
>>
>> Furthermore while max/min freq don't apply directly, there are still two MSRs
>> controlling bounds at the package and logical processor levels.
> 
> Well, we only program the logical processor level MSRs because we
> don't have a good idea of the packages to know when we can skip
> writing an MSR.
> 
> How about this:
> `= none
>  | {{  | xen } {
> [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]]
> | [:hwp[,hdc]] }
>   [,verbose]]}
>  | dom0-kernel`

Looks right, yes.

>>> --- /dev/null
>>> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
>>> @@ -0,0 +1,506 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
>>> + *
>>> + * Copyright (C) 2021 Jason Andryuk 
>>> + */
>>> +
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +
>>> +static bool feature_hwp;
>>> +static bool feature_hwp_notification;
>>> +static bool feature_hwp_activity_window;
>>> +static bool feature_hwp_energy_perf;
>>> +static bool feature_hwp_pkg_level_ctl;
>>> +static bool feature_hwp_peci;
>>> +
>>> +static bool feature_hdc;
>>
>> Most (all?) of these want to be __ro_after_init, I expect.
> 
> I think you are correct.  (This pre-dates __ro_after_init and I didn't
> update it.)

Yet even then they should have used __read_mostly.

>>> +union hwp_request
>>> +{
>>> +struct
>>> +{
>>> +uint64_t min_perf:8;
>>> +uint64_t max_perf:8;
>>> +uint64_t desired:8;
>>> +uint64_t energy_perf:8;
>>> +uint64_t activity_window:10;
>>> +uint64_t package_control:1;
>>> +uint64_t reserved:16;
>>> +uint64_t activity_window_valid:1;
>>> +uint64_t energy_perf_valid:1;
>>> +uint64_t desired_valid:1;
>>> +uint64_t max_perf_valid:1;
>>> +uint64_t min_perf_valid:1;
>>
>> The boolean fields here would probably better be of type "bool". I also
>> don't see the need for using uint64_t for any of the other fields -
>> unsigned int will be quite fine, I think. Only ...
> 
> This is the hardware MSR format, so it seemed natural to use uint64_t
> and the bit fields.  To me, uint64_t foo:$bits; better shows that we
> are dividing up a single hardware register using bit fields.
> Honestly, I'm unfamiliar with the finer points of laying out bitfields
> with bool.  And the 10 bits of activity window throws off aligning to
> standard types.
> 
> This seems to have the correct layout:
> struct
> {
> unsigned char min_perf;
> unsigned char max_perf;
> unsigned char desired;
> unsigned char energy_perf;
> unsigned int activity_window:10;
> bool package_control:1;
> 

Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-04 Thread Jason Andryuk
On Thu, May 4, 2023 at 9:11 AM Jan Beulich  wrote:
>
> On 01.05.2023 21:30, Jason Andryuk wrote:
> > For cpufreq=xen:hwp, placing the option inside the governor wouldn't
> > work.  Users would have to select the hwp-internal governor to turn off
> > hwp support.
>
> I'm afraid I don't understand this, and you'll find a comment towards
> this further down. Even when ...
>
> >  hwp-internal isn't usable without hwp, and users wouldn't
> > be able to select a different governor.  That doesn't matter while hwp
> > defaults off, but it would if or when hwp defaults to enabled.
>
> ... it starts defaulting to enabled, selecting another governor can
> simply have the side effect of turning off hwp.

I didn't think of that - makes sense.

> > Write to disable the interrupt - the linux pstate driver does this.  We
> > don't use the interrupts, so we can just turn them off.  We aren't ready
> > to handle them, so we don't want any.  Unclear if this is necessary.
> > SDM says it's default disabled.
>
> Definitely better to be on the safe side.
>
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> > oprofile, rather than detectin
> >  available support.
> >
> >  ### cpufreq
> > -> `= none | {{  | xen } 
> > [:[powersave|performance|ondemand|userspace][,][,[][,[verbose}
> >  | dom0-kernel`
> > +> `= none | {{  | xen } 
> > [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]}
> >  | dom0-kernel`
>
> Considering you use a special internal governor, the 4 governor alternatives 
> are
> meaningless for hwp. Hence at the command line level recognizing "hwp" as if 
> it
> was another governor name would seem better to me. This would then also get 
> rid
> of one of the two special "no-" prefix parsing cases (which I'm not overly
> happy about).
>
> Even if not done that way I'm puzzled by the way you spell out the interaction
> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only 
> when
> "hwp" was specified, so even if not merged with the governors group "hwp" 
> should
> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. 
> (The
> way you've spelled it out it actually looks to be kind of the other way 
> around.)

I placed them in alphabetical order, but, yes, it doesn't make sense.

> Strictly speaking "maxfreq" and "minfreq" also should be objected to when 
> "hwp"
> was specified.
>
> Overall I'm getting the impression that beyond your "verbose" related 
> adjustment
> more is needed, if you're meaning to get things closer to how we parse the
> option (splitting across multiple lines to help see what I mean):
>
> `= none
>  | {{  | xen } [:{powersave|performance|ondemand|userspace}
>   
> [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}]
>   [,verbose]]}
>  | dom0-kernel`
>
> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> maxfreq, but better be more tight in the doc than too relaxed.)
>
> Furthermore while max/min freq don't apply directly, there are still two MSRs
> controlling bounds at the package and logical processor levels.

Well, we only program the logical processor level MSRs because we
don't have a good idea of the packages to know when we can skip
writing an MSR.

How about this:
`= none
 | {{  | xen } {
[:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]]
| [:hwp[,hdc]] }
  [,verbose]]}
 | dom0-kernel`

i.e:
xen:hwp,hdc

> > --- /dev/null
> > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> > @@ -0,0 +1,506 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> > + *
> > + * Copyright (C) 2021 Jason Andryuk 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +static bool feature_hwp;
> > +static bool feature_hwp_notification;
> > +static bool feature_hwp_activity_window;
> > +static bool feature_hwp_energy_perf;
> > +static bool feature_hwp_pkg_level_ctl;
> > +static bool feature_hwp_peci;
> > +
> > +static bool feature_hdc;
>
> Most (all?) of these want to be __ro_after_init, I expect.

I think you are correct.  (This pre-dates __ro_after_init and I didn't
update it.)

> > +__initdata bool opt_cpufreq_hwp = false;
> > +__initdata bool opt_cpufreq_hdc = true;
>
> Nit (style): Please put annotations after the type.
>
> > +#define HWP_ENERGY_PERF_BALANCE 0x80
> > +#define IA32_ENERGY_BIAS_BALANCE0x7
> > +#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
> > +#define IA32_ENERGY_BIAS_MASK   0xf
> > +
> > +union hwp_request
> > +{
> > +struct
> > +{
> > +uint64_t min_perf:8;
> > +uint64_t max_perf:8;
> > +uint64_t desired:8;
> > +uint64_t energy_perf:8;
> > +uint64_t activity_window:10;
> > +u

Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-04 Thread Jan Beulich
On 01.05.2023 21:30, Jason Andryuk wrote:
> For cpufreq=xen:hwp, placing the option inside the governor wouldn't
> work.  Users would have to select the hwp-internal governor to turn off
> hwp support.

I'm afraid I don't understand this, and you'll find a comment towards
this further down. Even when ...

>  hwp-internal isn't usable without hwp, and users wouldn't
> be able to select a different governor.  That doesn't matter while hwp
> defaults off, but it would if or when hwp defaults to enabled.

... it starts defaulting to enabled, selecting another governor can
simply have the side effect of turning off hwp.

> Write to disable the interrupt - the linux pstate driver does this.  We
> don't use the interrupts, so we can just turn them off.  We aren't ready
> to handle them, so we don't want any.  Unclear if this is necessary.
> SDM says it's default disabled.

Definitely better to be on the safe side.

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> oprofile, rather than detectin
>  available support.
>  
>  ### cpufreq
> -> `= none | {{  | xen } 
> [:[powersave|performance|ondemand|userspace][,][,[][,[verbose}
>  | dom0-kernel`
> +> `= none | {{  | xen } 
> [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]}
>  | dom0-kernel`

Considering you use a special internal governor, the 4 governor alternatives are
meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
was another governor name would seem better to me. This would then also get rid
of one of the two special "no-" prefix parsing cases (which I'm not overly
happy about).

Even if not done that way I'm puzzled by the way you spell out the interaction
of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
"hwp" was specified, so even if not merged with the governors group "hwp" should
come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
way you've spelled it out it actually looks to be kind of the other way around.)

Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
was specified.

Overall I'm getting the impression that beyond your "verbose" related adjustment
more is needed, if you're meaning to get things closer to how we parse the
option (splitting across multiple lines to help see what I mean):

`= none
 | {{  | xen } [:{powersave|performance|ondemand|userspace}
  [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}]
  [,verbose]]}
 | dom0-kernel`

(We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
maxfreq, but better be more tight in the doc than too relaxed.)

Furthermore while max/min freq don't apply directly, there are still two MSRs
controlling bounds at the package and logical processor levels.

> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,506 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static bool feature_hwp;
> +static bool feature_hwp_notification;
> +static bool feature_hwp_activity_window;
> +static bool feature_hwp_energy_perf;
> +static bool feature_hwp_pkg_level_ctl;
> +static bool feature_hwp_peci;
> +
> +static bool feature_hdc;

Most (all?) of these want to be __ro_after_init, I expect.

> +__initdata bool opt_cpufreq_hwp = false;
> +__initdata bool opt_cpufreq_hdc = true;

Nit (style): Please put annotations after the type.

> +#define HWP_ENERGY_PERF_BALANCE 0x80
> +#define IA32_ENERGY_BIAS_BALANCE0x7
> +#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
> +#define IA32_ENERGY_BIAS_MASK   0xf
> +
> +union hwp_request
> +{
> +struct
> +{
> +uint64_t min_perf:8;
> +uint64_t max_perf:8;
> +uint64_t desired:8;
> +uint64_t energy_perf:8;
> +uint64_t activity_window:10;
> +uint64_t package_control:1;
> +uint64_t reserved:16;
> +uint64_t activity_window_valid:1;
> +uint64_t energy_perf_valid:1;
> +uint64_t desired_valid:1;
> +uint64_t max_perf_valid:1;
> +uint64_t min_perf_valid:1;

The boolean fields here would probably better be of type "bool". I also
don't see the need for using uint64_t for any of the other fields -
unsigned int will be quite fine, I think. Only ...

> +};
> +uint64_t raw;

... this wants to keep this type. (Same again below then.)

> +bool __init hwp_available(void)
> +{
> +unsigned int eax, ecx, unused;
> +bool use_hwp;
> +
> +if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> +{
> +hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> +boot_cpu_data.cpuid_level);
> +return false;
> +

[PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver

2023-05-01 Thread Jason Andryuk
>From the Intel SDM: "Hardware-Controlled Performance States (HWP), which
autonomously selects performance states while utilizing OS supplied
performance guidance hints."

Enable HWP to run in autonomous mode by poking the correct MSRs.
cpufreq=xen:hwp enables and cpufreq=xen:hwp=0 disables.  The same for
hdc.

There is no interface to configure - xen_sysctl_pm_op/xenpm will
be to be extended to configure in subsequent patches.  It will run with
the default values, which should be the default 0x80 (out
of 0x0-0xff) energy/performance preference.

Unscientific powertop measurement of an mostly idle, customized OpenXT
install:
A 10th gen 6-core laptop showed battery discharge drop from ~9.x to
~7.x watts.
A 8th gen 4-core laptop dropped from ~10 to ~9

Power usage depends on many factors, especially display brightness, but
this does show an power saving in balanced mode when CPU utilization is
low.

HWP isn't compatible with an external governor - it doesn't take
explicit frequency requests.  Therefore a minimal internal governor,
hwp-internal, is also added as a placeholder.

While adding to the xen-command-line.pandoc entry, un-nest verbose from
minfreq.  They are independent.

Signed-off-by: Jason Andryuk 

---

We disable on cpuid_level < 0x16.  cpuid(0x16) is used to get the cpu
frequencies for calculating the APERF/MPERF.  Without it, things would
still work, but the averge cpufrequency output would be wrong.

My 8th & 10th gen test systems both report:
(XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0
(XEN) HWP: Hardware Duty Cycling (HDC) supported
(XEN) HWP: HW_FEEDBACK not supported

IA32_ENERGY_PERF_BIAS has not been tested.

For cpufreq=xen:hwp, placing the option inside the governor wouldn't
work.  Users would have to select the hwp-internal governor to turn off
hwp support.  hwp-internal isn't usable without hwp, and users wouldn't
be able to select a different governor.  That doesn't matter while hwp
defaults off, but it would if or when hwp defaults to enabled.

We can't use parse_boolean() since it requires a single name=val string
and cpufreq_handle_common_option is provided two strings.  Use
parse_bool() and manual handle no-hwp.

Write to disable the interrupt - the linux pstate driver does this.  We
don't use the interrupts, so we can just turn them off.  We aren't ready
to handle them, so we don't want any.  Unclear if this is necessary.
SDM says it's default disabled.

FAST_IA32_HWP_REQUEST was removed in v2.  The check in v1 was wrong,
it's a model specific feature and the CPUID bit is only available
after enabling via the MSR.  Support was untested since I don't have
hardware with the feature.  Writes are expected to be infrequent, so
just leave it out.

---
v2:
Alphabetize headers
Re-work driver registration
name hwp_drv_data anonymous union "hw"
Drop hwp_verbose_cont
style cleanups
Condense hwp_governor switch
hwp_cpufreq_target remove .raw from hwp_req assignment
Use typed-pointer in a few functions
Pass type to xzalloc
Add HWP_ENERGY_PERF_BALANCE/IA32_ENERGY_BIAS_BALANCE defines
Add XEN_HWP_GOVERNOR define for "hwp-internal"
Capitalize CPUID and MSR defines
Change '_' to '-' for energy-perf & act-window
Read-modify-write MSRs updates
Use FAST_IA32_HWP_REQUEST_MSR_ENABLE define
constify pointer in hwp_set_misc_turbo
Add space after non-fallthrough break in governor switch
Add IA32_ENERGY_BIAS_MASK define
Check CPUID_PM_LEAK for energy bias when needed
Fail initialization with curr_req = -1
Fold hwp_read_capabilities into hwp_init_msrs
Add command line cpufreq=xen:hwp
Add command line cpufreq=xen:hdc
Use per_cpu for hwp_drv_data pointers
Move hwp_energy_perf_bias call into hwp_write_request
energy_perf 0 is valid, so hwp_energy_perf_bias cannot be skipped
Ensure we don't generate interrupts
Remove Fast Write of Uncore MSR
Initialize hwp_drv_data from curr_req
Use SPDX line instead of license text in hwp.c

v3:
Add cf_check to cpufreq_gov_hwp_init() - Marek
Print cpuid_level with %#x - Marek
---
 docs/misc/xen-command-line.pandoc |   8 +-
 xen/arch/x86/acpi/cpufreq/Makefile|   1 +
 xen/arch/x86/acpi/cpufreq/cpufreq.c   |   5 +-
 xen/arch/x86/acpi/cpufreq/hwp.c   | 506 ++
 xen/arch/x86/include/asm/cpufeature.h |  13 +-
 xen/arch/x86/include/asm/msr-index.h  |  13 +
 xen/drivers/cpufreq/cpufreq.c |  32 ++
 xen/include/acpi/cpufreq/cpufreq.h|   3 +
 xen/include/acpi/cpufreq/processor_perf.h |   3 +
 xen/include/public/sysctl.h   |   1 +
 10 files changed, 581 insertions(+), 4 deletions(-)
 create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index e0b89b7d33..aaa31f444b 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, 
rather than detectin
 available support.
 
 ### cpufreq
-> `=