On 01.05.2023 21:30, Jason Andryuk wrote:
> For cpufreq=xen:hwp, placing the option inside the governor wouldn't
> work.  Users would have to select the hwp-internal governor to turn off
> hwp support.

I'm afraid I don't understand this, and you'll find a comment towards
this further down. Even when ...

>  hwp-internal isn't usable without hwp, and users wouldn't
> be able to select a different governor.  That doesn't matter while hwp
> defaults off, but it would if or when hwp defaults to enabled.

... it starts defaulting to enabled, selecting another governor can
simply have the side effect of turning off hwp.

> Write to disable the interrupt - the linux pstate driver does this.  We
> don't use the interrupts, so we can just turn them off.  We aren't ready
> to handle them, so we don't want any.  Unclear if this is necessary.
> SDM says it's default disabled.

Definitely better to be on the safe side.

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> oprofile, rather than detectin
>  available support.
>  
>  ### cpufreq
> -> `= none | {{ <boolean> | xen } 
> [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]}
>  | dom0-kernel`
> +> `= none | {{ <boolean> | xen } 
> [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]}
>  | dom0-kernel`

Considering you use a special internal governor, the 4 governor alternatives are
meaningless for hwp. Hence at the command line level recognizing "hwp" as if it
was another governor name would seem better to me. This would then also get rid
of one of the two special "no-" prefix parsing cases (which I'm not overly
happy about).

Even if not done that way I'm puzzled by the way you spell out the interaction
of "hwp" and "hdc": As you say in the description, "hdc" is meaningful only when
"hwp" was specified, so even if not merged with the governors group "hwp" should
come first, and "hdc" ought to be rejected if "hwp" wasn't first specified. (The
way you've spelled it out it actually looks to be kind of the other way around.)

Strictly speaking "maxfreq" and "minfreq" also should be objected to when "hwp"
was specified.

Overall I'm getting the impression that beyond your "verbose" related adjustment
more is needed, if you're meaning to get things closer to how we parse the
option (splitting across multiple lines to help see what I mean):

`= none
 | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
                          [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
                          [,verbose]]}
 | dom0-kernel`

(We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
maxfreq, but better be more tight in the doc than too relaxed.)

Furthermore while max/min freq don't apply directly, there are still two MSRs
controlling bounds at the package and logical processor levels.

> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,506 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk <jandr...@gmail.com>
> + */
> +
> +#include <xen/cpumask.h>
> +#include <xen/init.h>
> +#include <xen/param.h>
> +#include <xen/xmalloc.h>
> +#include <asm/io.h>
> +#include <asm/msr.h>
> +#include <acpi/cpufreq/cpufreq.h>
> +
> +static bool feature_hwp;
> +static bool feature_hwp_notification;
> +static bool feature_hwp_activity_window;
> +static bool feature_hwp_energy_perf;
> +static bool feature_hwp_pkg_level_ctl;
> +static bool feature_hwp_peci;
> +
> +static bool feature_hdc;

Most (all?) of these want to be __ro_after_init, I expect.

> +__initdata bool opt_cpufreq_hwp = false;
> +__initdata bool opt_cpufreq_hdc = true;

Nit (style): Please put annotations after the type.

> +#define HWP_ENERGY_PERF_BALANCE         0x80
> +#define IA32_ENERGY_BIAS_BALANCE        0x7
> +#define IA32_ENERGY_BIAS_MAX_POWERSAVE  0xf
> +#define IA32_ENERGY_BIAS_MASK           0xf
> +
> +union hwp_request
> +{
> +    struct
> +    {
> +        uint64_t min_perf:8;
> +        uint64_t max_perf:8;
> +        uint64_t desired:8;
> +        uint64_t energy_perf:8;
> +        uint64_t activity_window:10;
> +        uint64_t package_control:1;
> +        uint64_t reserved:16;
> +        uint64_t activity_window_valid:1;
> +        uint64_t energy_perf_valid:1;
> +        uint64_t desired_valid:1;
> +        uint64_t max_perf_valid:1;
> +        uint64_t min_perf_valid:1;

The boolean fields here would probably better be of type "bool". I also
don't see the need for using uint64_t for any of the other fields -
unsigned int will be quite fine, I think. Only ...

> +    };
> +    uint64_t raw;

... this wants to keep this type. (Same again below then.)

> +bool __init hwp_available(void)
> +{
> +    unsigned int eax, ecx, unused;
> +    bool use_hwp;
> +
> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> +    {
> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> +                    boot_cpu_data.cpuid_level);
> +        return false;
> +    }
> +
> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> +    {
> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq 
> info\n",
> +                 boot_cpu_data.cpuid_level);
> +        return false;
> +    }
> +
> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> +
> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> +    {
> +        hwp_verbose("HWP disabled: No energy/performance preference 
> available");
> +        return false;
> +    }
> +
> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> +    feature_hwp_energy_perf     =
> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> +    feature_hwp_pkg_level_ctl   = eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> +
> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d 
> pkg-level: %d peci: %d\n",
> +                feature_hwp, feature_hwp_notification,
> +                feature_hwp_activity_window, feature_hwp_energy_perf,
> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> +
> +    if ( !feature_hwp )
> +        return false;
> +
> +    feature_hdc = eax & CPUID6_EAX_HDC;
> +
> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> +                feature_hdc ? "" : "not ",
> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> +                            : "");
> +
> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> +
> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");

You report this, but you don't really use it?

> +    use_hwp = feature_hwp && opt_cpufreq_hwp;

There's a lot of output you may produce until you make it here, which is
largely meaningless when opt_cpufreq_hwp == false. Is there a reason you
don't check that flag first thing in the function?

> +static void hdc_set_pkg_hdc_ctl(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n");
> +
> +        return;
> +    }
> +
> +    if ( val )
> +        msr |= IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> +    else
> +        msr &= ~IA32_PKG_HDC_CTL_HDC_PKG_ENABLE;
> +
> +    if ( wrmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_PKG_HDC_CTL): %016lx\n", msr);
> +}
> +
> +static void hdc_set_pm_ctl1(bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n");
> +
> +        return;
> +    }
> +
> +    if ( val )
> +        msr |= IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> +    else
> +        msr &= ~IA32_PM_CTL1_HDC_ALLOW_BLOCK;
> +
> +    if ( wrmsr_safe(MSR_IA32_PM_CTL1, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_PM_CTL1): %016lx\n", msr);
> +}

For both functions: Elsewhere you also log the affected CPU in hwp_err().
Without this I'm not convinced the logging here is very useful. In fact I
wonder whether hwp_err() shouldn't take care of this and/or the "error"
part of the string literal. A HWP: prefix might also not be bad ...

> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> +{
> +    uint32_t base_khz, max_khz, bus_khz, edx;
> +
> +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> +
> +    /* aperf/mperf scales base. */
> +    policy->cpuinfo.perf_freq = base_khz * 1000;
> +    policy->cpuinfo.min_freq = base_khz * 1000;
> +    policy->cpuinfo.max_freq = max_khz * 1000;
> +    policy->min = base_khz * 1000;
> +    policy->max = max_khz * 1000;
> +    policy->cur = 0;

What is the comment intended to be telling me here?

> +static void cf_check hwp_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    uint64_t val;
> +
> +    /*
> +     * Package level MSR, but we don't have a good idea of packages here, so
> +     * just do it everytime.
> +     */
> +    if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", 
> policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    /* Ensure we don't generate interrupts */
> +    if ( feature_hwp_notification )
> +        wrmsr_safe(MSR_IA32_HWP_INTERRUPT, 0);
> +
> +    hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val);
> +    if ( !(val & IA32_PM_ENABLE_HWP_ENABLE) )
> +    {
> +        val |= IA32_PM_ENABLE_HWP_ENABLE;
> +        if ( wrmsr_safe(MSR_IA32_PM_ENABLE, val) )
> +        {
> +            hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_PM_ENABLE, %lx)\n",
> +                    policy->cpu, val);
> +            data->curr_req.raw = -1;
> +            return;
> +        }
> +    }
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n",
> +                policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) )
> +    {
> +        hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", 
> policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    if ( !feature_hwp_energy_perf ) {

Nit: Brace placement.

> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> +        {
> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> +            data->curr_req.raw = -1;
> +
> +            return;
> +        }
> +
> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> +    }

In order to not need to undo the "enable" you've already done, maybe that
should move down here? With all the sanity checking you do here, maybe
you should also check that the write of the enable bit actually took
effect?

> +/* val 0 - highest performance, 15 - maximum energy savings */
> +static void hwp_energy_perf_bias(const struct hwp_drv_data *data)
> +{
> +    uint64_t msr;
> +    uint8_t val = data->energy_perf;
> +
> +    ASSERT(val <= IA32_ENERGY_BIAS_MAX_POWERSAVE);
> +
> +    if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> +    {
> +        hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> +
> +        return;
> +    }
> +
> +    msr &= ~IA32_ENERGY_BIAS_MASK;
> +    msr |= val;
> +
> +    if ( wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, msr) )
> +        hwp_err("error wrmsr_safe(MSR_IA32_ENERGY_PERF_BIAS): %016lx\n", 
> msr);
> +}
> +
> +static void cf_check hwp_write_request(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    union hwp_request hwp_req = data->curr_req;
> +
> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
> +    if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) )
> +    {
> +        hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n",
> +                policy->cpu, hwp_req.raw);
> +        rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw);
> +    }
> +
> +    if ( !feature_hwp_energy_perf )
> +        hwp_energy_perf_bias(data);
> +
> +}
> +
> +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> +                                       unsigned int target_freq,
> +                                       unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +    /* Zero everything to ensure reserved bits are zero... */
> +    union hwp_request hwp_req = { .raw = 0 };
> +
> +    /* .. and update from there */
> +    hwp_req.min_perf = data->minimum;
> +    hwp_req.max_perf = data->maximum;
> +    hwp_req.desired = data->desired;
> +    if ( feature_hwp_energy_perf )
> +        hwp_req.energy_perf = data->energy_perf;
> +    if ( feature_hwp_activity_window )
> +        hwp_req.activity_window = data->activity_window;
> +
> +    if ( hwp_req.raw == data->curr_req.raw )
> +        return 0;
> +
> +    data->curr_req = hwp_req;
> +
> +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
> +
> +    return 0;
> +}

If I'm not mistaken these 3 functions can only be reached from the user
space tool (via set_cpufreq_para()). On that path I don't think there
should be any hwp_err(); definitely not in non-verbose mode. Instead it
would be good if a sensible error code could be reported back. (Same
then for hwp_cpufreq_update() and its helper.)

> --- a/xen/arch/x86/include/asm/cpufeature.h
> +++ b/xen/arch/x86/include/asm/cpufeature.h
> @@ -46,8 +46,17 @@ extern struct cpuinfo_x86 boot_cpu_data;
>  #define cpu_has(c, bit)              test_bit(bit, (c)->x86_capability)
>  #define boot_cpu_has(bit)    test_bit(bit, boot_cpu_data.x86_capability)
>  
> -#define CPUID_PM_LEAF                    6
> -#define CPUID6_ECX_APERFMPERF_CAPABILITY 0x1
> +#define CPUID_PM_LEAF                                6
> +#define CPUID6_EAX_HWP                               (_AC(1, U) <<  7)
> +#define CPUID6_EAX_HWP_NOTIFICATION                  (_AC(1, U) <<  8)
> +#define CPUID6_EAX_HWP_ACTIVITY_WINDOW               (_AC(1, U) <<  9)
> +#define CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE (_AC(1, U) << 10)
> +#define CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST         (_AC(1, U) << 11)
> +#define CPUID6_EAX_HDC                               (_AC(1, U) << 13)
> +#define CPUID6_EAX_HWP_PECI                          (_AC(1, U) << 16)
> +#define CPUID6_EAX_HW_FEEDBACK                       (_AC(1, U) << 19)

Perhaps better without open-coding BIT()?

I also find it a little odd that e.g. bit 17 is left out here despite you
declaring the 5 "valid" bits in union hwp_request (which are qualified by
this CPUID bit afaict).

> +#define CPUID6_ECX_APERFMPERF_CAPABILITY             0x1
> +#define CPUID6_ECX_IA32_ENERGY_PERF_BIAS             0x8

Why not the same form here?

> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -151,6 +151,13 @@
>  
>  #define MSR_PKRS                            0x000006e1
>  
> +#define MSR_IA32_PM_ENABLE                  0x00000770
> +#define  IA32_PM_ENABLE_HWP_ENABLE          (_AC(1, ULL) <<  0)
> +
> +#define MSR_IA32_HWP_CAPABILITIES           0x00000771
> +#define MSR_IA32_HWP_INTERRUPT              0x00000773
> +#define MSR_IA32_HWP_REQUEST                0x00000774

I think for new MSRs being added here in particular Andrew would like to
see the IA32 infixes omitted. (I'd extend this then to
CPUID6_ECX_IA32_ENERGY_PERF_BIAS as well.)

> @@ -165,6 +172,11 @@
>  #define  PASID_PASID_MASK                   0x000fffff
>  #define  PASID_VALID                        (_AC(1, ULL) << 31)
>  
> +#define MSR_IA32_PKG_HDC_CTL                0x00000db0
> +#define  IA32_PKG_HDC_CTL_HDC_PKG_ENABLE    (_AC(1, ULL) <<  0)

The name has two redundant infixes, which looks odd, but then I can't
suggest any better without going too much out of sync with the SDM.

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -565,6 +565,38 @@ static void cpufreq_cmdline_common_para(struct 
> cpufreq_policy *new_policy)
>  
>  static int __init cpufreq_handle_common_option(const char *name, const char 
> *val)
>  {
> +    if (!strcmp(name, "hdc")) {
> +        if (val) {
> +            int ret = parse_bool(val, NULL);
> +            if (ret != -1) {
> +                opt_cpufreq_hdc = ret;
> +                return 1;
> +            }
> +        } else {
> +            opt_cpufreq_hdc = true;
> +            return 1;
> +        }
> +    } else if (!strcmp(name, "no-hdc")) {
> +        opt_cpufreq_hdc = false;
> +        return 1;
> +    }

I think recognizing a "no-" prefix would want to be separated out, and be
restricted to val being NULL. It would result in val being pointed at the
string "no" (or "off" or anything else parse_bool() recognizes as negative
indicator).

Yet if, as suggested above, "hwp" became a "fake" governor also when
parsing the command line, "hdc" could actually be handled in its
handle_option() hook.

Jan

Reply via email to