Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On 12.05.2023 03:02, Marek Marczykowski-Górecki wrote: > On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote: >> On 10.05.2023 15:54, Jason Andryuk wrote: >>> On Mon, May 8, 2023 at 2:33 AM Jan Beulich wrote: On 05.05.2023 17:35, Jason Andryuk wrote: > On Fri, May 5, 2023 at 3:01 AM Jan Beulich wrote: > The other issue is that if you select "hwp" as the governor, but HWP > hardware support is not available, then hwp_available() needs to reset > the governor back to the default. This feels like a layering > violation. Layering violation - yes. But why would the governor need resetting in this case? If HWP was asked for but isn't available, I don't think any other cpufreq handling (and hence governor) should be put in place. And turning off cpufreq altogether (if necessary in the first place) wouldn't, to me, feel as much like a layering violation. >>> >>> My goal was for Xen to use HWP if available and fallback to the acpi >>> cpufreq driver if not. That to me seems more user-friendly than >>> disabling cpufreq. >>> >>> if ( hwp_available() ) >>> ret = hwp_register_driver(); >>> else >>> ret = cpufreq_register_driver(&acpi_cpufreq_driver); >> >> That's fine as a (future) default, but for now using hwp requires a >> command line option, and if that option says "hwp" then it ought to >> be hwp imo. > > As a downstrem distribution, I'd strongly prefer to have an option that > would enable HWP when present and fallback to other driver otherwise, > even if that isn't the default upstream. I can't possibly require large > group of users (either HWP-having or HWP-not-having) to edit the Xen > cmdline to get power management working well. > > If the meaning for cpufreq=hwp absolutely must include "nothing if HWP > is not available", then maybe it should be named cpufreq=try-hwp > instead, or cpufreq=prefer-hwp or something else like this? Any new sub-option needs to fit the existing ones in its meaning. I could see e.g. "cpufreq=xen" alone to effect what you want (once hwp becomes available for use by default). But (for now at least) I continue to think that a request for "hwp" ought to mean HWP. Jan
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On Wed, May 10, 2023 at 04:19:57PM +0200, Jan Beulich wrote: > On 10.05.2023 15:54, Jason Andryuk wrote: > > On Mon, May 8, 2023 at 2:33 AM Jan Beulich wrote: > >> On 05.05.2023 17:35, Jason Andryuk wrote: > >>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich wrote: > >>> The other issue is that if you select "hwp" as the governor, but HWP > >>> hardware support is not available, then hwp_available() needs to reset > >>> the governor back to the default. This feels like a layering > >>> violation. > >> > >> Layering violation - yes. But why would the governor need resetting in > >> this case? If HWP was asked for but isn't available, I don't think any > >> other cpufreq handling (and hence governor) should be put in place. > >> And turning off cpufreq altogether (if necessary in the first place) > >> wouldn't, to me, feel as much like a layering violation. > > > > My goal was for Xen to use HWP if available and fallback to the acpi > > cpufreq driver if not. That to me seems more user-friendly than > > disabling cpufreq. > > > > if ( hwp_available() ) > > ret = hwp_register_driver(); > > else > > ret = cpufreq_register_driver(&acpi_cpufreq_driver); > > That's fine as a (future) default, but for now using hwp requires a > command line option, and if that option says "hwp" then it ought to > be hwp imo. As a downstrem distribution, I'd strongly prefer to have an option that would enable HWP when present and fallback to other driver otherwise, even if that isn't the default upstream. I can't possibly require large group of users (either HWP-having or HWP-not-having) to edit the Xen cmdline to get power management working well. If the meaning for cpufreq=hwp absolutely must include "nothing if HWP is not available", then maybe it should be named cpufreq=try-hwp instead, or cpufreq=prefer-hwp or something else like this? > > If we are setting cpufreq_opt_governor to enter hwp_available(), but > > then HWP isn't available, it seems to me that we need to reset > > cpufreq_opt_governor when exiting hwp_available() false. > > This may be necessary in the future, but shouldn't be necessary right > now. It's not entirely clear to me how that future is going to look > like, command line option wise. > > Jan > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On 10.05.2023 15:54, Jason Andryuk wrote: > On Mon, May 8, 2023 at 2:33 AM Jan Beulich wrote: >> On 05.05.2023 17:35, Jason Andryuk wrote: >>> On Fri, May 5, 2023 at 3:01 AM Jan Beulich wrote: >>> The other issue is that if you select "hwp" as the governor, but HWP >>> hardware support is not available, then hwp_available() needs to reset >>> the governor back to the default. This feels like a layering >>> violation. >> >> Layering violation - yes. But why would the governor need resetting in >> this case? If HWP was asked for but isn't available, I don't think any >> other cpufreq handling (and hence governor) should be put in place. >> And turning off cpufreq altogether (if necessary in the first place) >> wouldn't, to me, feel as much like a layering violation. > > My goal was for Xen to use HWP if available and fallback to the acpi > cpufreq driver if not. That to me seems more user-friendly than > disabling cpufreq. > > if ( hwp_available() ) > ret = hwp_register_driver(); > else > ret = cpufreq_register_driver(&acpi_cpufreq_driver); That's fine as a (future) default, but for now using hwp requires a command line option, and if that option says "hwp" then it ought to be hwp imo. > If we are setting cpufreq_opt_governor to enter hwp_available(), but > then HWP isn't available, it seems to me that we need to reset > cpufreq_opt_governor when exiting hwp_available() false. This may be necessary in the future, but shouldn't be necessary right now. It's not entirely clear to me how that future is going to look like, command line option wise. Jan
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On Mon, May 8, 2023 at 2:33 AM Jan Beulich wrote: > > On 05.05.2023 17:35, Jason Andryuk wrote: > > On Fri, May 5, 2023 at 3:01 AM Jan Beulich wrote: > >> > >> On 04.05.2023 18:56, Jason Andryuk wrote: > >>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich wrote: > On 01.05.2023 21:30, Jason Andryuk wrote: > > --- a/docs/misc/xen-command-line.pandoc > > +++ b/docs/misc/xen-command-line.pandoc > > @@ -499,7 +499,7 @@ If set, force use of the performance counters for > > oprofile, rather than detectin > > available support. > > > > ### cpufreq > > -> `= none | {{ | xen } > > [:[powersave|performance|ondemand|userspace][,][,[][,[verbose} > > | dom0-kernel` > > +> `= none | {{ | xen } > > [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]} > > | dom0-kernel` > > Considering you use a special internal governor, the 4 governor > alternatives are > meaningless for hwp. Hence at the command line level recognizing "hwp" > as if it > was another governor name would seem better to me. This would then also > get rid > of one of the two special "no-" prefix parsing cases (which I'm not > overly > happy about). > > Even if not done that way I'm puzzled by the way you spell out the > interaction > of "hwp" and "hdc": As you say in the description, "hdc" is meaningful > only when > "hwp" was specified, so even if not merged with the governors group > "hwp" should > come first, and "hdc" ought to be rejected if "hwp" wasn't first > specified. (The > way you've spelled it out it actually looks to be kind of the other way > around.) > >>> > >>> I placed them in alphabetical order, but, yes, it doesn't make sense. > >>> > Strictly speaking "maxfreq" and "minfreq" also should be objected to > when "hwp" > was specified. > > Overall I'm getting the impression that beyond your "verbose" related > adjustment > more is needed, if you're meaning to get things closer to how we parse > the > option (splitting across multiple lines to help see what I mean): > > `= none > | {{ | xen } [:{powersave|performance|ondemand|userspace} > > [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}] > [,verbose]]} > | dom0-kernel` > > (We're still parsing in a more relaxed way, e.g. minfreq may come ahead > of > maxfreq, but better be more tight in the doc than too relaxed.) > > Furthermore while max/min freq don't apply directly, there are still two > MSRs > controlling bounds at the package and logical processor levels. > >>> > >>> Well, we only program the logical processor level MSRs because we > >>> don't have a good idea of the packages to know when we can skip > >>> writing an MSR. > >>> > >>> How about this: > >>> `= none > >>> | {{ | xen } { > >>> [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]] > >>> | [:hwp[,hdc]] } > >>> [,verbose]]} > >>> | dom0-kernel` > >> > >> Looks right, yes. > > > > There is a wrinkle to using the hwp governor. The hwp governor was > > named "hwp-internal", so it needs to be renamed to "hwp" for use with > > command line parsing. That means the checking for "-internal" needs > > to change to just "hwp" which removes the generality of the original > > implementation. > > I'm afraid I don't see why this would strictly be necessary or a > consequence. Maybe I took your comment too far when you mentioned using hwp as a fake governor. I used the actual HWP struct cpufreq_governor to hook into cpufreq_cmdline_parse(). cpufreq_cmdline_parse() uses the that name for comparison. But the naming stops being an issue if struct cpufreq_governor gains a bool .internal flag. That flag also makes the registration checks clearer. > > The other issue is that if you select "hwp" as the governor, but HWP > > hardware support is not available, then hwp_available() needs to reset > > the governor back to the default. This feels like a layering > > violation. > > Layering violation - yes. But why would the governor need resetting in > this case? If HWP was asked for but isn't available, I don't think any > other cpufreq handling (and hence governor) should be put in place. > And turning off cpufreq altogether (if necessary in the first place) > wouldn't, to me, feel as much like a layering violation. My goal was for Xen to use HWP if available and fallback to the acpi cpufreq driver if not. That to me seems more user-friendly than disabling cpufreq. if ( hwp_available() ) ret = hwp_register_driver(); else ret = cpufreq_register_driver(&acpi_cpufreq_driver); If we are setting cpufreq_opt_governor to enter hwp_available(), but then HWP is
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On 05.05.2023 17:35, Jason Andryuk wrote: > On Fri, May 5, 2023 at 3:01 AM Jan Beulich wrote: >> >> On 04.05.2023 18:56, Jason Andryuk wrote: >>> On Thu, May 4, 2023 at 9:11 AM Jan Beulich wrote: On 01.05.2023 21:30, Jason Andryuk wrote: > --- a/docs/misc/xen-command-line.pandoc > +++ b/docs/misc/xen-command-line.pandoc > @@ -499,7 +499,7 @@ If set, force use of the performance counters for > oprofile, rather than detectin > available support. > > ### cpufreq > -> `= none | {{ | xen } > [:[powersave|performance|ondemand|userspace][,][,[][,[verbose} > | dom0-kernel` > +> `= none | {{ | xen } > [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]} > | dom0-kernel` Considering you use a special internal governor, the 4 governor alternatives are meaningless for hwp. Hence at the command line level recognizing "hwp" as if it was another governor name would seem better to me. This would then also get rid of one of the two special "no-" prefix parsing cases (which I'm not overly happy about). Even if not done that way I'm puzzled by the way you spell out the interaction of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when "hwp" was specified, so even if not merged with the governors group "hwp" should come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The way you've spelled it out it actually looks to be kind of the other way around.) >>> >>> I placed them in alphabetical order, but, yes, it doesn't make sense. >>> Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp" was specified. Overall I'm getting the impression that beyond your "verbose" related adjustment more is needed, if you're meaning to get things closer to how we parse the option (splitting across multiple lines to help see what I mean): `= none | {{ | xen } [:{powersave|performance|ondemand|userspace} [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}] [,verbose]]} | dom0-kernel` (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of maxfreq, but better be more tight in the doc than too relaxed.) Furthermore while max/min freq don't apply directly, there are still two MSRs controlling bounds at the package and logical processor levels. >>> >>> Well, we only program the logical processor level MSRs because we >>> don't have a good idea of the packages to know when we can skip >>> writing an MSR. >>> >>> How about this: >>> `= none >>> | {{ | xen } { >>> [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]] >>> | [:hwp[,hdc]] } >>> [,verbose]]} >>> | dom0-kernel` >> >> Looks right, yes. > > There is a wrinkle to using the hwp governor. The hwp governor was > named "hwp-internal", so it needs to be renamed to "hwp" for use with > command line parsing. That means the checking for "-internal" needs > to change to just "hwp" which removes the generality of the original > implementation. I'm afraid I don't see why this would strictly be necessary or a consequence. > The other issue is that if you select "hwp" as the governor, but HWP > hardware support is not available, then hwp_available() needs to reset > the governor back to the default. This feels like a layering > violation. Layering violation - yes. But why would the governor need resetting in this case? If HWP was asked for but isn't available, I don't think any other cpufreq handling (and hence governor) should be put in place. And turning off cpufreq altogether (if necessary in the first place) wouldn't, to me, feel as much like a layering violation. > I'm still investigating, but promoting hwp to a top level option - > cpufreq=hwp - might be a better arrangement. Might be an alternative, yes. Jan
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On Fri, May 5, 2023 at 3:01 AM Jan Beulich wrote: > > On 04.05.2023 18:56, Jason Andryuk wrote: > > On Thu, May 4, 2023 at 9:11 AM Jan Beulich wrote: > >> On 01.05.2023 21:30, Jason Andryuk wrote: > >>> --- a/docs/misc/xen-command-line.pandoc > >>> +++ b/docs/misc/xen-command-line.pandoc > >>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for > >>> oprofile, rather than detectin > >>> available support. > >>> > >>> ### cpufreq > >>> -> `= none | {{ | xen } > >>> [:[powersave|performance|ondemand|userspace][,][,[][,[verbose} > >>> | dom0-kernel` > >>> +> `= none | {{ | xen } > >>> [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]} > >>> | dom0-kernel` > >> > >> Considering you use a special internal governor, the 4 governor > >> alternatives are > >> meaningless for hwp. Hence at the command line level recognizing "hwp" as > >> if it > >> was another governor name would seem better to me. This would then also > >> get rid > >> of one of the two special "no-" prefix parsing cases (which I'm not overly > >> happy about). > >> > >> Even if not done that way I'm puzzled by the way you spell out the > >> interaction > >> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful > >> only when > >> "hwp" was specified, so even if not merged with the governors group "hwp" > >> should > >> come first, and "hdc" ought to be rejected if "hwp" wasn't first > >> specified. (The > >> way you've spelled it out it actually looks to be kind of the other way > >> around.) > > > > I placed them in alphabetical order, but, yes, it doesn't make sense. > > > >> Strictly speaking "maxfreq" and "minfreq" also should be objected to when > >> "hwp" > >> was specified. > >> > >> Overall I'm getting the impression that beyond your "verbose" related > >> adjustment > >> more is needed, if you're meaning to get things closer to how we parse the > >> option (splitting across multiple lines to help see what I mean): > >> > >> `= none > >> | {{ | xen } [:{powersave|performance|ondemand|userspace} > >> > >> [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}] > >> [,verbose]]} > >> | dom0-kernel` > >> > >> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of > >> maxfreq, but better be more tight in the doc than too relaxed.) > >> > >> Furthermore while max/min freq don't apply directly, there are still two > >> MSRs > >> controlling bounds at the package and logical processor levels. > > > > Well, we only program the logical processor level MSRs because we > > don't have a good idea of the packages to know when we can skip > > writing an MSR. > > > > How about this: > > `= none > > | {{ | xen } { > > [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]] > > | [:hwp[,hdc]] } > > [,verbose]]} > > | dom0-kernel` > > Looks right, yes. There is a wrinkle to using the hwp governor. The hwp governor was named "hwp-internal", so it needs to be renamed to "hwp" for use with command line parsing. That means the checking for "-internal" needs to change to just "hwp" which removes the generality of the original implementation. The other issue is that if you select "hwp" as the governor, but HWP hardware support is not available, then hwp_available() needs to reset the governor back to the default. This feels like a layering violation. I'm still investigating, but promoting hwp to a top level option - cpufreq=hwp - might be a better arrangement. > >>> +union hwp_request > >>> +{ > >>> +struct > >>> +{ > >>> +uint64_t min_perf:8; > >>> +uint64_t max_perf:8; > >>> +uint64_t desired:8; > >>> +uint64_t energy_perf:8; > >>> +uint64_t activity_window:10; > >>> +uint64_t package_control:1; > >>> +uint64_t reserved:16; > >>> +uint64_t activity_window_valid:1; > >>> +uint64_t energy_perf_valid:1; > >>> +uint64_t desired_valid:1; > >>> +uint64_t max_perf_valid:1; > >>> +uint64_t min_perf_valid:1; > >> > >> The boolean fields here would probably better be of type "bool". I also > >> don't see the need for using uint64_t for any of the other fields - > >> unsigned int will be quite fine, I think. Only ... > > > > This is the hardware MSR format, so it seemed natural to use uint64_t > > and the bit fields. To me, uint64_t foo:$bits; better shows that we > > are dividing up a single hardware register using bit fields. > > Honestly, I'm unfamiliar with the finer points of laying out bitfields > > with bool. And the 10 bits of activity window throws off aligning to > > standard types. > > > > This seems to have the correct layout: > > struct > > { > > unsigned char min_perf; > > unsigned char max_perf; > > unsigned char desired; > > unsigned char energy_perf; > > unsigned int activity_window:10; > >
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On 04.05.2023 18:56, Jason Andryuk wrote: > On Thu, May 4, 2023 at 9:11 AM Jan Beulich wrote: >> On 01.05.2023 21:30, Jason Andryuk wrote: >>> --- a/docs/misc/xen-command-line.pandoc >>> +++ b/docs/misc/xen-command-line.pandoc >>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for >>> oprofile, rather than detectin >>> available support. >>> >>> ### cpufreq >>> -> `= none | {{ | xen } >>> [:[powersave|performance|ondemand|userspace][,][,[][,[verbose} >>> | dom0-kernel` >>> +> `= none | {{ | xen } >>> [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]} >>> | dom0-kernel` >> >> Considering you use a special internal governor, the 4 governor alternatives >> are >> meaningless for hwp. Hence at the command line level recognizing "hwp" as if >> it >> was another governor name would seem better to me. This would then also get >> rid >> of one of the two special "no-" prefix parsing cases (which I'm not overly >> happy about). >> >> Even if not done that way I'm puzzled by the way you spell out the >> interaction >> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only >> when >> "hwp" was specified, so even if not merged with the governors group "hwp" >> should >> come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. >> (The >> way you've spelled it out it actually looks to be kind of the other way >> around.) > > I placed them in alphabetical order, but, yes, it doesn't make sense. > >> Strictly speaking "maxfreq" and "minfreq" also should be objected to when >> "hwp" >> was specified. >> >> Overall I'm getting the impression that beyond your "verbose" related >> adjustment >> more is needed, if you're meaning to get things closer to how we parse the >> option (splitting across multiple lines to help see what I mean): >> >> `= none >> | {{ | xen } [:{powersave|performance|ondemand|userspace} >> >> [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}] >> [,verbose]]} >> | dom0-kernel` >> >> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of >> maxfreq, but better be more tight in the doc than too relaxed.) >> >> Furthermore while max/min freq don't apply directly, there are still two MSRs >> controlling bounds at the package and logical processor levels. > > Well, we only program the logical processor level MSRs because we > don't have a good idea of the packages to know when we can skip > writing an MSR. > > How about this: > `= none > | {{ | xen } { > [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]] > | [:hwp[,hdc]] } > [,verbose]]} > | dom0-kernel` Looks right, yes. >>> --- /dev/null >>> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c >>> @@ -0,0 +1,506 @@ >>> +/* SPDX-License-Identifier: GPL-2.0 */ >>> +/* >>> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP) >>> + * >>> + * Copyright (C) 2021 Jason Andryuk >>> + */ >>> + >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> + >>> +static bool feature_hwp; >>> +static bool feature_hwp_notification; >>> +static bool feature_hwp_activity_window; >>> +static bool feature_hwp_energy_perf; >>> +static bool feature_hwp_pkg_level_ctl; >>> +static bool feature_hwp_peci; >>> + >>> +static bool feature_hdc; >> >> Most (all?) of these want to be __ro_after_init, I expect. > > I think you are correct. (This pre-dates __ro_after_init and I didn't > update it.) Yet even then they should have used __read_mostly. >>> +union hwp_request >>> +{ >>> +struct >>> +{ >>> +uint64_t min_perf:8; >>> +uint64_t max_perf:8; >>> +uint64_t desired:8; >>> +uint64_t energy_perf:8; >>> +uint64_t activity_window:10; >>> +uint64_t package_control:1; >>> +uint64_t reserved:16; >>> +uint64_t activity_window_valid:1; >>> +uint64_t energy_perf_valid:1; >>> +uint64_t desired_valid:1; >>> +uint64_t max_perf_valid:1; >>> +uint64_t min_perf_valid:1; >> >> The boolean fields here would probably better be of type "bool". I also >> don't see the need for using uint64_t for any of the other fields - >> unsigned int will be quite fine, I think. Only ... > > This is the hardware MSR format, so it seemed natural to use uint64_t > and the bit fields. To me, uint64_t foo:$bits; better shows that we > are dividing up a single hardware register using bit fields. > Honestly, I'm unfamiliar with the finer points of laying out bitfields > with bool. And the 10 bits of activity window throws off aligning to > standard types. > > This seems to have the correct layout: > struct > { > unsigned char min_perf; > unsigned char max_perf; > unsigned char desired; > unsigned char energy_perf; > unsigned int activity_window:10; > bool package_control:1; >
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On Thu, May 4, 2023 at 9:11 AM Jan Beulich wrote: > > On 01.05.2023 21:30, Jason Andryuk wrote: > > For cpufreq=xen:hwp, placing the option inside the governor wouldn't > > work. Users would have to select the hwp-internal governor to turn off > > hwp support. > > I'm afraid I don't understand this, and you'll find a comment towards > this further down. Even when ... > > > hwp-internal isn't usable without hwp, and users wouldn't > > be able to select a different governor. That doesn't matter while hwp > > defaults off, but it would if or when hwp defaults to enabled. > > ... it starts defaulting to enabled, selecting another governor can > simply have the side effect of turning off hwp. I didn't think of that - makes sense. > > Write to disable the interrupt - the linux pstate driver does this. We > > don't use the interrupts, so we can just turn them off. We aren't ready > > to handle them, so we don't want any. Unclear if this is necessary. > > SDM says it's default disabled. > > Definitely better to be on the safe side. > > > --- a/docs/misc/xen-command-line.pandoc > > +++ b/docs/misc/xen-command-line.pandoc > > @@ -499,7 +499,7 @@ If set, force use of the performance counters for > > oprofile, rather than detectin > > available support. > > > > ### cpufreq > > -> `= none | {{ | xen } > > [:[powersave|performance|ondemand|userspace][,][,[][,[verbose} > > | dom0-kernel` > > +> `= none | {{ | xen } > > [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]} > > | dom0-kernel` > > Considering you use a special internal governor, the 4 governor alternatives > are > meaningless for hwp. Hence at the command line level recognizing "hwp" as if > it > was another governor name would seem better to me. This would then also get > rid > of one of the two special "no-" prefix parsing cases (which I'm not overly > happy about). > > Even if not done that way I'm puzzled by the way you spell out the interaction > of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only > when > "hwp" was specified, so even if not merged with the governors group "hwp" > should > come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. > (The > way you've spelled it out it actually looks to be kind of the other way > around.) I placed them in alphabetical order, but, yes, it doesn't make sense. > Strictly speaking "maxfreq" and "minfreq" also should be objected to when > "hwp" > was specified. > > Overall I'm getting the impression that beyond your "verbose" related > adjustment > more is needed, if you're meaning to get things closer to how we parse the > option (splitting across multiple lines to help see what I mean): > > `= none > | {{ | xen } [:{powersave|performance|ondemand|userspace} > > [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}] > [,verbose]]} > | dom0-kernel` > > (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of > maxfreq, but better be more tight in the doc than too relaxed.) > > Furthermore while max/min freq don't apply directly, there are still two MSRs > controlling bounds at the package and logical processor levels. Well, we only program the logical processor level MSRs because we don't have a good idea of the packages to know when we can skip writing an MSR. How about this: `= none | {{ | xen } { [:{powersave|performance|ondemand|userspace}[,maxfreq=[,minfreq=]] | [:hwp[,hdc]] } [,verbose]]} | dom0-kernel` i.e: xen:hwp,hdc > > --- /dev/null > > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c > > @@ -0,0 +1,506 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP) > > + * > > + * Copyright (C) 2021 Jason Andryuk > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +static bool feature_hwp; > > +static bool feature_hwp_notification; > > +static bool feature_hwp_activity_window; > > +static bool feature_hwp_energy_perf; > > +static bool feature_hwp_pkg_level_ctl; > > +static bool feature_hwp_peci; > > + > > +static bool feature_hdc; > > Most (all?) of these want to be __ro_after_init, I expect. I think you are correct. (This pre-dates __ro_after_init and I didn't update it.) > > +__initdata bool opt_cpufreq_hwp = false; > > +__initdata bool opt_cpufreq_hdc = true; > > Nit (style): Please put annotations after the type. > > > +#define HWP_ENERGY_PERF_BALANCE 0x80 > > +#define IA32_ENERGY_BIAS_BALANCE0x7 > > +#define IA32_ENERGY_BIAS_MAX_POWERSAVE 0xf > > +#define IA32_ENERGY_BIAS_MASK 0xf > > + > > +union hwp_request > > +{ > > +struct > > +{ > > +uint64_t min_perf:8; > > +uint64_t max_perf:8; > > +uint64_t desired:8; > > +uint64_t energy_perf:8; > > +uint64_t activity_window:10; > > +u
Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
On 01.05.2023 21:30, Jason Andryuk wrote: > For cpufreq=xen:hwp, placing the option inside the governor wouldn't > work. Users would have to select the hwp-internal governor to turn off > hwp support. I'm afraid I don't understand this, and you'll find a comment towards this further down. Even when ... > hwp-internal isn't usable without hwp, and users wouldn't > be able to select a different governor. That doesn't matter while hwp > defaults off, but it would if or when hwp defaults to enabled. ... it starts defaulting to enabled, selecting another governor can simply have the side effect of turning off hwp. > Write to disable the interrupt - the linux pstate driver does this. We > don't use the interrupts, so we can just turn them off. We aren't ready > to handle them, so we don't want any. Unclear if this is necessary. > SDM says it's default disabled. Definitely better to be on the safe side. > --- a/docs/misc/xen-command-line.pandoc > +++ b/docs/misc/xen-command-line.pandoc > @@ -499,7 +499,7 @@ If set, force use of the performance counters for > oprofile, rather than detectin > available support. > > ### cpufreq > -> `= none | {{ | xen } > [:[powersave|performance|ondemand|userspace][,][,[][,[verbose} > | dom0-kernel` > +> `= none | {{ | xen } > [:[powersave|performance|ondemand|userspace][,][,[]][,[]][,[]][,[verbose]]]} > | dom0-kernel` Considering you use a special internal governor, the 4 governor alternatives are meaningless for hwp. Hence at the command line level recognizing "hwp" as if it was another governor name would seem better to me. This would then also get rid of one of the two special "no-" prefix parsing cases (which I'm not overly happy about). Even if not done that way I'm puzzled by the way you spell out the interaction of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when "hwp" was specified, so even if not merged with the governors group "hwp" should come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The way you've spelled it out it actually looks to be kind of the other way around.) Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp" was specified. Overall I'm getting the impression that beyond your "verbose" related adjustment more is needed, if you're meaning to get things closer to how we parse the option (splitting across multiple lines to help see what I mean): `= none | {{ | xen } [:{powersave|performance|ondemand|userspace} [{,hwp[,hdc]|[,maxfreq=[,minfreq=]}] [,verbose]]} | dom0-kernel` (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of maxfreq, but better be more tight in the doc than too relaxed.) Furthermore while max/min freq don't apply directly, there are still two MSRs controlling bounds at the package and logical processor levels. > --- /dev/null > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c > @@ -0,0 +1,506 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP) > + * > + * Copyright (C) 2021 Jason Andryuk > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static bool feature_hwp; > +static bool feature_hwp_notification; > +static bool feature_hwp_activity_window; > +static bool feature_hwp_energy_perf; > +static bool feature_hwp_pkg_level_ctl; > +static bool feature_hwp_peci; > + > +static bool feature_hdc; Most (all?) of these want to be __ro_after_init, I expect. > +__initdata bool opt_cpufreq_hwp = false; > +__initdata bool opt_cpufreq_hdc = true; Nit (style): Please put annotations after the type. > +#define HWP_ENERGY_PERF_BALANCE 0x80 > +#define IA32_ENERGY_BIAS_BALANCE0x7 > +#define IA32_ENERGY_BIAS_MAX_POWERSAVE 0xf > +#define IA32_ENERGY_BIAS_MASK 0xf > + > +union hwp_request > +{ > +struct > +{ > +uint64_t min_perf:8; > +uint64_t max_perf:8; > +uint64_t desired:8; > +uint64_t energy_perf:8; > +uint64_t activity_window:10; > +uint64_t package_control:1; > +uint64_t reserved:16; > +uint64_t activity_window_valid:1; > +uint64_t energy_perf_valid:1; > +uint64_t desired_valid:1; > +uint64_t max_perf_valid:1; > +uint64_t min_perf_valid:1; The boolean fields here would probably better be of type "bool". I also don't see the need for using uint64_t for any of the other fields - unsigned int will be quite fine, I think. Only ... > +}; > +uint64_t raw; ... this wants to keep this type. (Same again below then.) > +bool __init hwp_available(void) > +{ > +unsigned int eax, ecx, unused; > +bool use_hwp; > + > +if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF ) > +{ > +hwp_verbose("cpuid_level (%#x) lacks HWP support\n", > +boot_cpu_data.cpuid_level); > +return false; > +
[PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver
>From the Intel SDM: "Hardware-Controlled Performance States (HWP), which autonomously selects performance states while utilizing OS supplied performance guidance hints." Enable HWP to run in autonomous mode by poking the correct MSRs. cpufreq=xen:hwp enables and cpufreq=xen:hwp=0 disables. The same for hdc. There is no interface to configure - xen_sysctl_pm_op/xenpm will be to be extended to configure in subsequent patches. It will run with the default values, which should be the default 0x80 (out of 0x0-0xff) energy/performance preference. Unscientific powertop measurement of an mostly idle, customized OpenXT install: A 10th gen 6-core laptop showed battery discharge drop from ~9.x to ~7.x watts. A 8th gen 4-core laptop dropped from ~10 to ~9 Power usage depends on many factors, especially display brightness, but this does show an power saving in balanced mode when CPU utilization is low. HWP isn't compatible with an external governor - it doesn't take explicit frequency requests. Therefore a minimal internal governor, hwp-internal, is also added as a placeholder. While adding to the xen-command-line.pandoc entry, un-nest verbose from minfreq. They are independent. Signed-off-by: Jason Andryuk --- We disable on cpuid_level < 0x16. cpuid(0x16) is used to get the cpu frequencies for calculating the APERF/MPERF. Without it, things would still work, but the averge cpufrequency output would be wrong. My 8th & 10th gen test systems both report: (XEN) HWP: 1 notify: 1 act_window: 1 energy_perf: 1 pkg_level: 0 peci: 0 (XEN) HWP: Hardware Duty Cycling (HDC) supported (XEN) HWP: HW_FEEDBACK not supported IA32_ENERGY_PERF_BIAS has not been tested. For cpufreq=xen:hwp, placing the option inside the governor wouldn't work. Users would have to select the hwp-internal governor to turn off hwp support. hwp-internal isn't usable without hwp, and users wouldn't be able to select a different governor. That doesn't matter while hwp defaults off, but it would if or when hwp defaults to enabled. We can't use parse_boolean() since it requires a single name=val string and cpufreq_handle_common_option is provided two strings. Use parse_bool() and manual handle no-hwp. Write to disable the interrupt - the linux pstate driver does this. We don't use the interrupts, so we can just turn them off. We aren't ready to handle them, so we don't want any. Unclear if this is necessary. SDM says it's default disabled. FAST_IA32_HWP_REQUEST was removed in v2. The check in v1 was wrong, it's a model specific feature and the CPUID bit is only available after enabling via the MSR. Support was untested since I don't have hardware with the feature. Writes are expected to be infrequent, so just leave it out. --- v2: Alphabetize headers Re-work driver registration name hwp_drv_data anonymous union "hw" Drop hwp_verbose_cont style cleanups Condense hwp_governor switch hwp_cpufreq_target remove .raw from hwp_req assignment Use typed-pointer in a few functions Pass type to xzalloc Add HWP_ENERGY_PERF_BALANCE/IA32_ENERGY_BIAS_BALANCE defines Add XEN_HWP_GOVERNOR define for "hwp-internal" Capitalize CPUID and MSR defines Change '_' to '-' for energy-perf & act-window Read-modify-write MSRs updates Use FAST_IA32_HWP_REQUEST_MSR_ENABLE define constify pointer in hwp_set_misc_turbo Add space after non-fallthrough break in governor switch Add IA32_ENERGY_BIAS_MASK define Check CPUID_PM_LEAK for energy bias when needed Fail initialization with curr_req = -1 Fold hwp_read_capabilities into hwp_init_msrs Add command line cpufreq=xen:hwp Add command line cpufreq=xen:hdc Use per_cpu for hwp_drv_data pointers Move hwp_energy_perf_bias call into hwp_write_request energy_perf 0 is valid, so hwp_energy_perf_bias cannot be skipped Ensure we don't generate interrupts Remove Fast Write of Uncore MSR Initialize hwp_drv_data from curr_req Use SPDX line instead of license text in hwp.c v3: Add cf_check to cpufreq_gov_hwp_init() - Marek Print cpuid_level with %#x - Marek --- docs/misc/xen-command-line.pandoc | 8 +- xen/arch/x86/acpi/cpufreq/Makefile| 1 + xen/arch/x86/acpi/cpufreq/cpufreq.c | 5 +- xen/arch/x86/acpi/cpufreq/hwp.c | 506 ++ xen/arch/x86/include/asm/cpufeature.h | 13 +- xen/arch/x86/include/asm/msr-index.h | 13 + xen/drivers/cpufreq/cpufreq.c | 32 ++ xen/include/acpi/cpufreq/cpufreq.h| 3 + xen/include/acpi/cpufreq/processor_perf.h | 3 + xen/include/public/sysctl.h | 1 + 10 files changed, 581 insertions(+), 4 deletions(-) create mode 100644 xen/arch/x86/acpi/cpufreq/hwp.c diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index e0b89b7d33..aaa31f444b 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -499,7 +499,7 @@ If set, force use of the performance counters for oprofile, rather than detectin available support. ### cpufreq -> `=