Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
Hi Doug, Clearly you are a "discerning user", who understands the limitations of the kernel sysfs interface, both new and old, for communicating frequency. With the limitations of the (old and new) sysfs interfaces, why are you using it, rather than turbostat? >> As with previous methods of calculating MHz, >> idle time is excluded. > > Which makes the response time to a correct answer > asymmetric. i.e. removal of a load on a CPU will > linger much much longer that adding a load on a CPU. If the measurement interval is not defined, then a "correct answer" is also no defined. Users now have the capability to define the measurement interval. Before they didn't, they just could observe it was "short enough that it looks current". Others may want to measure frequency over a longer (or even known) interval, and the previous code made that impossible. Also, you may be interested to know that in HWP mode, intel_pstate used to have a periodic timer who's _only_ job was to wake up the CPU so that the driver could update the frequency statistic for sysfs. (now it is a scheduler callback) Sure, a "discerning user" may have noticed that they have "fresh" data in sysfs, but most users were better served by not having a timer fire to refresh data that they'll never consume... The new code never runs at all, unless the user asks it to. > Somehow, somewhere along the way, turbostat no longer seems > to use base_MHz based on the actual TSC. It used to. True, though not directly related to this thread... On current and future Intel hardware, base_mhz and TSC rate are not in the same clock domain. Only on very specific configurations are those clock rates now equal. >> + /* Don't bother re-computing within 10 ms */ >> + if (time_before(jiffies, s->jiffies + HZ/100)) >> + return; > > The above condition would be 8 mSec on a 250 Hertz kernel, > wouldn't it? > (I don't care, I'm just saying.) True. We could replace the "10ms" comment with "typically 10ms", or "recently". Note that the value here isn't precise, it is there just to prevent wasted overhead. The previous version of this patch was equally valid with a value 10x larger. > Summary: > > . There no longer seems to be a way to check the CPU frequency without > affecting the processor (i.e. forcing a wakeup), > thereby potentially influencing the system under test. This has always been true, just that the wakeups used to happen inside the kernel -- whether you consumed the answer or not. > . Yes, the old way might have been a "lie", but in some situations it was > much much less of a "lie", and took data that > was already available (and at the very maximum 4 seconds old), and didn't > force a wakeup, thus monitoring CPU frequency > was a negligible perturbation to the system. Frequency data isn't "already available", it has to be measured. A measurement is not valid unless it is made over a known measurement interval. > . Now the data is as old as the time the command was run, which might be > hours. True, under controlled conditions, the sysfs measurement interval could be days or months long. If a known interval is desired, than something need to provoke a read of the attribute at the start of the interval of interest. Yes, we could do this inside the kernel, but then that would add overhead to the system for the vast majority of users who never even read this attribute, and it would also take control of the interval away from the user. Making this interface more complex inside the kernel doesn't seem like a prudent path to go down when turbostat already exists and can already measure concurrent/overlapping intervals of arbitrary length in user-space. While I still haven't gleaned exactly what you are trying to measure, I'm very much interested to know if/why you can't measure it using the new sysfs attribute semantics, or better yet, using turbostat. thanks, Len Brown, Intel Open Source Technology Center
Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
Hi Doug, Clearly you are a "discerning user", who understands the limitations of the kernel sysfs interface, both new and old, for communicating frequency. With the limitations of the (old and new) sysfs interfaces, why are you using it, rather than turbostat? >> As with previous methods of calculating MHz, >> idle time is excluded. > > Which makes the response time to a correct answer > asymmetric. i.e. removal of a load on a CPU will > linger much much longer that adding a load on a CPU. If the measurement interval is not defined, then a "correct answer" is also no defined. Users now have the capability to define the measurement interval. Before they didn't, they just could observe it was "short enough that it looks current". Others may want to measure frequency over a longer (or even known) interval, and the previous code made that impossible. Also, you may be interested to know that in HWP mode, intel_pstate used to have a periodic timer who's _only_ job was to wake up the CPU so that the driver could update the frequency statistic for sysfs. (now it is a scheduler callback) Sure, a "discerning user" may have noticed that they have "fresh" data in sysfs, but most users were better served by not having a timer fire to refresh data that they'll never consume... The new code never runs at all, unless the user asks it to. > Somehow, somewhere along the way, turbostat no longer seems > to use base_MHz based on the actual TSC. It used to. True, though not directly related to this thread... On current and future Intel hardware, base_mhz and TSC rate are not in the same clock domain. Only on very specific configurations are those clock rates now equal. >> + /* Don't bother re-computing within 10 ms */ >> + if (time_before(jiffies, s->jiffies + HZ/100)) >> + return; > > The above condition would be 8 mSec on a 250 Hertz kernel, > wouldn't it? > (I don't care, I'm just saying.) True. We could replace the "10ms" comment with "typically 10ms", or "recently". Note that the value here isn't precise, it is there just to prevent wasted overhead. The previous version of this patch was equally valid with a value 10x larger. > Summary: > > . There no longer seems to be a way to check the CPU frequency without > affecting the processor (i.e. forcing a wakeup), > thereby potentially influencing the system under test. This has always been true, just that the wakeups used to happen inside the kernel -- whether you consumed the answer or not. > . Yes, the old way might have been a "lie", but in some situations it was > much much less of a "lie", and took data that > was already available (and at the very maximum 4 seconds old), and didn't > force a wakeup, thus monitoring CPU frequency > was a negligible perturbation to the system. Frequency data isn't "already available", it has to be measured. A measurement is not valid unless it is made over a known measurement interval. > . Now the data is as old as the time the command was run, which might be > hours. True, under controlled conditions, the sysfs measurement interval could be days or months long. If a known interval is desired, than something need to provoke a read of the attribute at the start of the interval of interest. Yes, we could do this inside the kernel, but then that would add overhead to the system for the vast majority of users who never even read this attribute, and it would also take control of the interval away from the user. Making this interface more complex inside the kernel doesn't seem like a prudent path to go down when turbostat already exists and can already measure concurrent/overlapping intervals of arbitrary length in user-space. While I still haven't gleaned exactly what you are trying to measure, I'm very much interested to know if/why you can't measure it using the new sysfs attribute semantics, or better yet, using turbostat. thanks, Len Brown, Intel Open Source Technology Center
RE: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
Sorry to be late to the party on this one: On 2017.06.23 10:12 Len Brown wrote: > The goal of this change is to give users a uniform and meaningful > result when they read /sys/...cpufreq/scaling_cur_freq > on modern x86 hardware, as compared to what they get today. Myself, I like what I got then, and not what I get now. > Modern x86 processors include the hardware needed > to accurately calculate frequency over an interval -- > APERF, MPERF, and the TSC. > > Here we provide an x86 routine to make this calculation > on supported hardware, and use it in preference to any > driver driver-specific cpufreq_driver.get() routine. > > MHz is computed like so: > > MHz = base_MHz * delta_APERF / delta_MPERF Yes, thanks very much. > MHz is the average frequency of the busy processor > over a measurement interval. The interval is > defined to be the time between successive invocations > of aperfmperf_khz_on_cpu(), which are expected to to > happen on-demand when users read sysfs attribute > cpufreq/scaling_cur_freq. Yes but that can be hours apart, resulting in useless information. This threw me for a loop for several days. > As with previous methods of calculating MHz, > idle time is excluded. Which makes the response time to a correct answer asymmetric. i.e. removal of a load on a CPU will linger much much longer that adding a load on a CPU. > base_MHz above is from TSC calibration global "cpu_khz". Yes, thank you very much. > This x86 native method to calculate MHz returns a meaningful result > no matter if P-states are controlled by hardware or firmware > and/or if the Linux cpufreq sub-system is or is-not installed. > > When this routine is invoked more frequently, the measurement > interval becomes shorter. However, the code limits re-computation > to 10ms intervals so that average frequency remains meaningful. > > Discerning users are encouraged to take advantage of > the turbostat(8) utility, which can gracefully handle > concurrent measurement intervals of arbitrary length. Somehow, somewhere along the way, turbostat no longer seems to use base_MHz based on the actual TSC. It used to. > Signed-off-by: Len Brown> --- > arch/x86/kernel/cpu/Makefile | 1 + > arch/x86/kernel/cpu/aperfmperf.c | 79 > drivers/cpufreq/cpufreq.c| 12 +- > include/linux/cpufreq.h | 2 + > 4 files changed, 93 insertions(+), 1 deletion(-) > create mode 100644 arch/x86/kernel/cpu/aperfmperf.c ... [deleted some] ... > + * aperfmperf_snapshot_khz() > + * On the current CPU, snapshot APERF, MPERF, and jiffies > + * unless we already did it within 10ms Well, it'll be 8 mSec on a 250 Hz kernel. There is no maximum time defined, so the interval can be anything, and therefore the result can be dominated by stale information. > + * calculate kHz, save snapshot > + */ > +static void aperfmperf_snapshot_khz(void *dummy) > +{ > + u64 aperf, aperf_delta; > + u64 mperf, mperf_delta; > + struct aperfmperf_sample *s = this_cpu_ptr(); > + > + /* Don't bother re-computing within 10 ms */ > + if (time_before(jiffies, s->jiffies + HZ/100)) > + return; The above condition would be 8 mSec on a 250 Hertz kernel, wouldn't it? (I don't care, I'm just saying.) __ A long boring story is copied below, but it also includes my test data. Summary: . There no longer seems to be a way to check the CPU frequency without affecting the processor (i.e. forcing a wakeup), thereby potentially influencing the system under test. . Yes, the old way might have been a "lie", but in some situations it was much much less of a "lie", and took data that was already available (and at the very maximum 4 seconds old), and didn't force a wakeup, thus monitoring CPU frequency was a negligible perturbation to the system. . Now the data is as old as the time the command was run, which might be hours. For reference my test computer contains an i7-2600K processor, and TSC is 3411.1043 MHz. Minimum pstate 16. I did follow the e-mail thread [1] about changes to the "cpu MHz" line from /proc/cpuinfo, and expected it to have changed, and indeed, it only ever prints TSC now and never changes. Whereas with kernel 4.12 it printed the actual CPU frequency, albeit with the limitations stated in the e-mail thread, which I have always understood and accepted. O.K. so now it is useless as an actual CPU frequency inquiry tool. Now, there are two other methods (well three if one includes turbostat) for observing CPU frequency: The "sudo cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq" method, works the same as it did in the past (well, there is another active thread about issues with it), but requires root access. And the "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq" method, which works fine with kernel 4.12, but seems to give incorrect information with kernel 4.13-rc1, unless one
RE: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
Sorry to be late to the party on this one: On 2017.06.23 10:12 Len Brown wrote: > The goal of this change is to give users a uniform and meaningful > result when they read /sys/...cpufreq/scaling_cur_freq > on modern x86 hardware, as compared to what they get today. Myself, I like what I got then, and not what I get now. > Modern x86 processors include the hardware needed > to accurately calculate frequency over an interval -- > APERF, MPERF, and the TSC. > > Here we provide an x86 routine to make this calculation > on supported hardware, and use it in preference to any > driver driver-specific cpufreq_driver.get() routine. > > MHz is computed like so: > > MHz = base_MHz * delta_APERF / delta_MPERF Yes, thanks very much. > MHz is the average frequency of the busy processor > over a measurement interval. The interval is > defined to be the time between successive invocations > of aperfmperf_khz_on_cpu(), which are expected to to > happen on-demand when users read sysfs attribute > cpufreq/scaling_cur_freq. Yes but that can be hours apart, resulting in useless information. This threw me for a loop for several days. > As with previous methods of calculating MHz, > idle time is excluded. Which makes the response time to a correct answer asymmetric. i.e. removal of a load on a CPU will linger much much longer that adding a load on a CPU. > base_MHz above is from TSC calibration global "cpu_khz". Yes, thank you very much. > This x86 native method to calculate MHz returns a meaningful result > no matter if P-states are controlled by hardware or firmware > and/or if the Linux cpufreq sub-system is or is-not installed. > > When this routine is invoked more frequently, the measurement > interval becomes shorter. However, the code limits re-computation > to 10ms intervals so that average frequency remains meaningful. > > Discerning users are encouraged to take advantage of > the turbostat(8) utility, which can gracefully handle > concurrent measurement intervals of arbitrary length. Somehow, somewhere along the way, turbostat no longer seems to use base_MHz based on the actual TSC. It used to. > Signed-off-by: Len Brown > --- > arch/x86/kernel/cpu/Makefile | 1 + > arch/x86/kernel/cpu/aperfmperf.c | 79 > drivers/cpufreq/cpufreq.c| 12 +- > include/linux/cpufreq.h | 2 + > 4 files changed, 93 insertions(+), 1 deletion(-) > create mode 100644 arch/x86/kernel/cpu/aperfmperf.c ... [deleted some] ... > + * aperfmperf_snapshot_khz() > + * On the current CPU, snapshot APERF, MPERF, and jiffies > + * unless we already did it within 10ms Well, it'll be 8 mSec on a 250 Hz kernel. There is no maximum time defined, so the interval can be anything, and therefore the result can be dominated by stale information. > + * calculate kHz, save snapshot > + */ > +static void aperfmperf_snapshot_khz(void *dummy) > +{ > + u64 aperf, aperf_delta; > + u64 mperf, mperf_delta; > + struct aperfmperf_sample *s = this_cpu_ptr(); > + > + /* Don't bother re-computing within 10 ms */ > + if (time_before(jiffies, s->jiffies + HZ/100)) > + return; The above condition would be 8 mSec on a 250 Hertz kernel, wouldn't it? (I don't care, I'm just saying.) __ A long boring story is copied below, but it also includes my test data. Summary: . There no longer seems to be a way to check the CPU frequency without affecting the processor (i.e. forcing a wakeup), thereby potentially influencing the system under test. . Yes, the old way might have been a "lie", but in some situations it was much much less of a "lie", and took data that was already available (and at the very maximum 4 seconds old), and didn't force a wakeup, thus monitoring CPU frequency was a negligible perturbation to the system. . Now the data is as old as the time the command was run, which might be hours. For reference my test computer contains an i7-2600K processor, and TSC is 3411.1043 MHz. Minimum pstate 16. I did follow the e-mail thread [1] about changes to the "cpu MHz" line from /proc/cpuinfo, and expected it to have changed, and indeed, it only ever prints TSC now and never changes. Whereas with kernel 4.12 it printed the actual CPU frequency, albeit with the limitations stated in the e-mail thread, which I have always understood and accepted. O.K. so now it is useless as an actual CPU frequency inquiry tool. Now, there are two other methods (well three if one includes turbostat) for observing CPU frequency: The "sudo cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq" method, works the same as it did in the past (well, there is another active thread about issues with it), but requires root access. And the "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq" method, which works fine with kernel 4.12, but seems to give incorrect information with kernel 4.13-rc1, unless one inquires two or more times
Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
On Sat, Jun 24, 2017 at 10:56 AM, Thomas Gleixnerwrote: > On Fri, 23 Jun 2017, Len Brown wrote: >> This x86 native method to calculate MHz returns a meaningful result >> no matter if P-states are controlled by hardware or firmware >> and/or if the Linux cpufreq sub-system is or is-not installed. >> >> When this routine is invoked more frequently, the measurement >> interval becomes shorter. However, the code limits re-computation >> to 10ms intervals so that average frequency remains meaningful. >> >> Discerning users are encouraged to take advantage of >> the turbostat(8) utility, which can gracefully handle >> concurrent measurement intervals of arbitrary length. >> >> Signed-off-by: Len Brown > > Reviewed-by: Thomas Gleixner > > Raphael, please take the whole lot through the cpufreq tree. I will, thanks! Rafael
Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
On Sat, Jun 24, 2017 at 10:56 AM, Thomas Gleixner wrote: > On Fri, 23 Jun 2017, Len Brown wrote: >> This x86 native method to calculate MHz returns a meaningful result >> no matter if P-states are controlled by hardware or firmware >> and/or if the Linux cpufreq sub-system is or is-not installed. >> >> When this routine is invoked more frequently, the measurement >> interval becomes shorter. However, the code limits re-computation >> to 10ms intervals so that average frequency remains meaningful. >> >> Discerning users are encouraged to take advantage of >> the turbostat(8) utility, which can gracefully handle >> concurrent measurement intervals of arbitrary length. >> >> Signed-off-by: Len Brown > > Reviewed-by: Thomas Gleixner > > Raphael, please take the whole lot through the cpufreq tree. I will, thanks! Rafael
Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
On Fri, 23 Jun 2017, Len Brown wrote: > This x86 native method to calculate MHz returns a meaningful result > no matter if P-states are controlled by hardware or firmware > and/or if the Linux cpufreq sub-system is or is-not installed. > > When this routine is invoked more frequently, the measurement > interval becomes shorter. However, the code limits re-computation > to 10ms intervals so that average frequency remains meaningful. > > Discerning users are encouraged to take advantage of > the turbostat(8) utility, which can gracefully handle > concurrent measurement intervals of arbitrary length. > > Signed-off-by: Len BrownReviewed-by: Thomas Gleixner Raphael, please take the whole lot through the cpufreq tree. Thanks, tglx
Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
On Fri, 23 Jun 2017, Len Brown wrote: > This x86 native method to calculate MHz returns a meaningful result > no matter if P-states are controlled by hardware or firmware > and/or if the Linux cpufreq sub-system is or is-not installed. > > When this routine is invoked more frequently, the measurement > interval becomes shorter. However, the code limits re-computation > to 10ms intervals so that average frequency remains meaningful. > > Discerning users are encouraged to take advantage of > the turbostat(8) utility, which can gracefully handle > concurrent measurement intervals of arbitrary length. > > Signed-off-by: Len Brown Reviewed-by: Thomas Gleixner Raphael, please take the whole lot through the cpufreq tree. Thanks, tglx
[PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
From: Len BrownThe goal of this change is to give users a uniform and meaningful result when they read /sys/...cpufreq/scaling_cur_freq on modern x86 hardware, as compared to what they get today. Modern x86 processors include the hardware needed to accurately calculate frequency over an interval -- APERF, MPERF, and the TSC. Here we provide an x86 routine to make this calculation on supported hardware, and use it in preference to any driver driver-specific cpufreq_driver.get() routine. MHz is computed like so: MHz = base_MHz * delta_APERF / delta_MPERF MHz is the average frequency of the busy processor over a measurement interval. The interval is defined to be the time between successive invocations of aperfmperf_khz_on_cpu(), which are expected to to happen on-demand when users read sysfs attribute cpufreq/scaling_cur_freq. As with previous methods of calculating MHz, idle time is excluded. base_MHz above is from TSC calibration global "cpu_khz". This x86 native method to calculate MHz returns a meaningful result no matter if P-states are controlled by hardware or firmware and/or if the Linux cpufreq sub-system is or is-not installed. When this routine is invoked more frequently, the measurement interval becomes shorter. However, the code limits re-computation to 10ms intervals so that average frequency remains meaningful. Discerning users are encouraged to take advantage of the turbostat(8) utility, which can gracefully handle concurrent measurement intervals of arbitrary length. Signed-off-by: Len Brown --- arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/aperfmperf.c | 79 drivers/cpufreq/cpufreq.c| 12 +- include/linux/cpufreq.h | 2 + 4 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 arch/x86/kernel/cpu/aperfmperf.c diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 521..cdf8249 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -21,6 +21,7 @@ obj-y += common.o obj-y += rdrand.o obj-y += match.o obj-y += bugs.o +obj-$(CONFIG_CPU_FREQ) += aperfmperf.o obj-$(CONFIG_PROC_FS) += proc.o obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c new file mode 100644 index 000..d869c86 --- /dev/null +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -0,0 +1,79 @@ +/* + * x86 APERF/MPERF KHz calculation for + * /sys/.../cpufreq/scaling_cur_freq + * + * Copyright (C) 2017 Intel Corp. + * Author: Len Brown + * + * This file is licensed under GPLv2. + */ + +#include +#include +#include +#include + +struct aperfmperf_sample { + unsigned intkhz; + unsigned long jiffies; + u64 aperf; + u64 mperf; +}; + +static DEFINE_PER_CPU(struct aperfmperf_sample, samples); + +/* + * aperfmperf_snapshot_khz() + * On the current CPU, snapshot APERF, MPERF, and jiffies + * unless we already did it within 10ms + * calculate kHz, save snapshot + */ +static void aperfmperf_snapshot_khz(void *dummy) +{ + u64 aperf, aperf_delta; + u64 mperf, mperf_delta; + struct aperfmperf_sample *s = this_cpu_ptr(); + + /* Don't bother re-computing within 10 ms */ + if (time_before(jiffies, s->jiffies + HZ/100)) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + aperf_delta = aperf - s->aperf; + mperf_delta = mperf - s->mperf; + + /* +* There is no architectural guarantee that MPERF +* increments faster than we can read it. +*/ + if (mperf_delta == 0) + return; + + /* +* if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then +* khz = (cpu_khz * aperf_delta) / mperf_delta +*/ + if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta) + s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); + else/* khz = aperf_delta / (mperf_delta / cpu_khz) */ + s->khz = div64_u64(aperf_delta, + div64_u64(mperf_delta, cpu_khz)); + s->jiffies = jiffies; + s->aperf = aperf; + s->mperf = mperf; +} + +unsigned int arch_freq_get_on_cpu(int cpu) +{ + if (!cpu_khz) + return 0; + + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) + return 0; + + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); + + return per_cpu(samples.khz, cpu); +} diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 26b643d..6e7424d 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -632,11 +632,21 @@ show_one(cpuinfo_transition_latency, cpuinfo.transition_latency); show_one(scaling_min_freq, min);
[PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF
From: Len Brown The goal of this change is to give users a uniform and meaningful result when they read /sys/...cpufreq/scaling_cur_freq on modern x86 hardware, as compared to what they get today. Modern x86 processors include the hardware needed to accurately calculate frequency over an interval -- APERF, MPERF, and the TSC. Here we provide an x86 routine to make this calculation on supported hardware, and use it in preference to any driver driver-specific cpufreq_driver.get() routine. MHz is computed like so: MHz = base_MHz * delta_APERF / delta_MPERF MHz is the average frequency of the busy processor over a measurement interval. The interval is defined to be the time between successive invocations of aperfmperf_khz_on_cpu(), which are expected to to happen on-demand when users read sysfs attribute cpufreq/scaling_cur_freq. As with previous methods of calculating MHz, idle time is excluded. base_MHz above is from TSC calibration global "cpu_khz". This x86 native method to calculate MHz returns a meaningful result no matter if P-states are controlled by hardware or firmware and/or if the Linux cpufreq sub-system is or is-not installed. When this routine is invoked more frequently, the measurement interval becomes shorter. However, the code limits re-computation to 10ms intervals so that average frequency remains meaningful. Discerning users are encouraged to take advantage of the turbostat(8) utility, which can gracefully handle concurrent measurement intervals of arbitrary length. Signed-off-by: Len Brown --- arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/aperfmperf.c | 79 drivers/cpufreq/cpufreq.c| 12 +- include/linux/cpufreq.h | 2 + 4 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 arch/x86/kernel/cpu/aperfmperf.c diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 521..cdf8249 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -21,6 +21,7 @@ obj-y += common.o obj-y += rdrand.o obj-y += match.o obj-y += bugs.o +obj-$(CONFIG_CPU_FREQ) += aperfmperf.o obj-$(CONFIG_PROC_FS) += proc.o obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c new file mode 100644 index 000..d869c86 --- /dev/null +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -0,0 +1,79 @@ +/* + * x86 APERF/MPERF KHz calculation for + * /sys/.../cpufreq/scaling_cur_freq + * + * Copyright (C) 2017 Intel Corp. + * Author: Len Brown + * + * This file is licensed under GPLv2. + */ + +#include +#include +#include +#include + +struct aperfmperf_sample { + unsigned intkhz; + unsigned long jiffies; + u64 aperf; + u64 mperf; +}; + +static DEFINE_PER_CPU(struct aperfmperf_sample, samples); + +/* + * aperfmperf_snapshot_khz() + * On the current CPU, snapshot APERF, MPERF, and jiffies + * unless we already did it within 10ms + * calculate kHz, save snapshot + */ +static void aperfmperf_snapshot_khz(void *dummy) +{ + u64 aperf, aperf_delta; + u64 mperf, mperf_delta; + struct aperfmperf_sample *s = this_cpu_ptr(); + + /* Don't bother re-computing within 10 ms */ + if (time_before(jiffies, s->jiffies + HZ/100)) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + aperf_delta = aperf - s->aperf; + mperf_delta = mperf - s->mperf; + + /* +* There is no architectural guarantee that MPERF +* increments faster than we can read it. +*/ + if (mperf_delta == 0) + return; + + /* +* if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then +* khz = (cpu_khz * aperf_delta) / mperf_delta +*/ + if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta) + s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); + else/* khz = aperf_delta / (mperf_delta / cpu_khz) */ + s->khz = div64_u64(aperf_delta, + div64_u64(mperf_delta, cpu_khz)); + s->jiffies = jiffies; + s->aperf = aperf; + s->mperf = mperf; +} + +unsigned int arch_freq_get_on_cpu(int cpu) +{ + if (!cpu_khz) + return 0; + + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) + return 0; + + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); + + return per_cpu(samples.khz, cpu); +} diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 26b643d..6e7424d 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -632,11 +632,21 @@ show_one(cpuinfo_transition_latency, cpuinfo.transition_latency); show_one(scaling_min_freq, min); show_one(scaling_max_freq, max); +__weak unsigned int