On 2018-03-18 17:15, Rafael J. Wysocki wrote:
Doug, Thomas,Thank you both for the reports, much appreciated! Below is a drop-in v6 replacement for patch [4/7]. With this new patch applied instead of the [4/7] the behavior should be much more in line with the v4 behavior, so please try it if you can and let me know if that really is the case on your systems. Patches [5-7/7] from the original v5 apply on top of it right away for me, but I've also created a git branch you can use to pull all of the series with the below included: git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \ idle-loop
Thanks for the git repo, that helps alot. I have tested v6 on a Skylake desktop and server system as well as a Haswell server system. The odd idle behavior of v5 is gone. Some of the other findings may be obsolete by the upcoming respin, I will retest. Our originally observed Powernightmare pattern is effectively prevented in both idle and with a synthetic trigger. However, I can reproduce simple workloads under which the revised menu governor wastes energy by going into *deeper* C-states than advisable. Consider the Skylake server system which has residencies in C1E of 20 us and C6 of 800 us. I use a small while(1) {usleep(300);} unsynchronized pinned to each core. While this is an artificial case, it is a very innocent one - easy to predict and regular. Between vanilla 4.16.0-rc5 and idle-loop/v6, the power consumption increases from 149.7 W to 158.1 W. On 4.16.0-rc5, the cores sleep almost entirely in C1E. With the patches applied, the cores spend ~75% of their sleep time in C6, ~25% in C1E. The average time/usage for C1E is also lower with v6 at ~350 us rather than the ~550 us in C6 (and in C1E with the baseline). Generally the new menu governor seems to chose C1E if the next timer is an enabled sched timer - which occasionally interrupts the sleep-interval into two C1E sleeps rather than one C6. Manually disabling C6, reduces power consumption back to 149.5 W. This is far from what I expected, I did not yet figure out why the patched menu governor decides to go to C6 under that workload. I have tested this previously with v4 and saw this behavior even without path "7/7". The results from Haswell-EP and Skylake desktop are similar. The tests are with a 1000 Hz kernel because I wanted to amplify effects that happening when C-state residencies and tick timers are closer together. But I suspect the results will be similar with 300 Hz as the impact from the sched tick interruption seems to be minor compared to the odd C-state selection. Some very raw illustrations, all from Skylake SP (2 == C1E, 3 == C6): power consumption trigger-10-10 is the synthetic Powernightmare poller-omp-300 is the parallel usleep(300) loop: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_power.png cstate utilization with usleep(300) loop (as per /sys/.../stateN/time / wall time) https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_utilization.png average time spent in cstates (as /sys/.../stateN/time / /sys/.../stateN/usage) https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_skl_sp_poll_300_avg_time.png detailed look: https://wwwpub.zih.tu-dresden.de/~tilsche/powernightmares/v6_poll_300_skl.png
Thanks! --- From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> Subject: [PATCH v6] cpuidle: Return nohz hint from cpuidle_select() Add a new pointer argument to cpuidle_select() and to the ->select cpuidle governor callback to allow a boolean value indicating whether or not the tick should be stopped before entering the selected state to be returned from there. Make the ladder governor ignore that pointer (to preserve its current behavior) and make the menu governor return 'false" through it if: (1) the idle exit latency is constrained at 0, (2) the selected state is a polling one, or (3) the selected state is not deep enough. Since the value returned through the new argument pointer is not used yet, this change is not expected to alter the functionality of the code. Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com> ---[cut]@@ -354,6 +360,7 @@ static int menu_select(struct cpuidle_dr if (latency_req > interactivity_req) latency_req = interactivity_req; + expected_interval = TICK_USEC_HZ; /* * Find the idle state with the lowest power while satisfying * our constraints. @@ -367,17 +374,44 @@ static int menu_select(struct cpuidle_dr continue; if (idx == -1) idx = i; /* first enabled state */ - if (s->target_residency > data->predicted_us) + if (s->target_residency > data->predicted_us) { + /* + * Retain the tick if the selected state is shallower + * than the deepest available one with target residency + * within the tick period range. + * + * This allows the tick to be stopped even if the + * predicted idle duration is within the tick period + * range to counter the effect by which the prediction + * may be skewed towards lower values due to the tick + * bias. + */ + expected_interval = s->target_residency; break;BTW, I guess I need to explain the motivation here more thoroughly, so here it goes. The governor predicts idle duration under the assumption that the tick will be stopped, so if the result of the prediction is within the tick period range and it is not accurate, that needs to be taken into account in the governor's statistics. However, if the tick is allowed to run every time the governor predicts idle duration within the tick period range, the governor will always see that it was "almost right" and the correction factor applied by it to improve the prediction next time will not be sufficient. For this reason, it is better to stop the tick at least sometimes when the governor predicts idle duration within the tick period range and the idea here is to do that when the selected state is the deepest available one with the target residency within the tick period range. This allows the opportunity to save more energy to be seized which balances the extra overhead of stopping the tick. HTH
-- Dipl. Inf. Thomas Ilsche Computer Scientist Highly Adaptive Energy-Efficient Computing CRC 912 HAEC: http://tu-dresden.de/sfb912 Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) 01062 Dresden, Germany Phone: +49 351 463-42168 Fax: +49 351 463-37773 E-Mail: thomas.ils...@tu-dresden.de
smime.p7s
Description: S/MIME Cryptographic Signature