From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> This patch exposes those extended CEDE states to the cpuidle framework which are responsive to external interrupts and do not need an H_PROD.
Since as per the PAPR, all the extended CEDE states are non-responsive to timers, we indicate this to the cpuidle subsystem via the CPUIDLE_FLAG_TIMER_STOP flag for all those extende CEDE states which can wake up on external interrupts. With the patch, we are able to see the extended CEDE state with latency hint = 1 exposed via the cpuidle framework. $ cpupower idle-info CPUidle driver: pseries_idle CPUidle governor: menu analyzing CPU 0: Number of idle states: 3 Available idle states: snooze CEDE XCEDE1 snooze: Flags/Description: snooze Latency: 0 Usage: 33429446 Duration: 27006062 CEDE: Flags/Description: CEDE Latency: 1 Usage: 10272 Duration: 110786770 XCEDE1: Flags/Description: XCEDE1 Latency: 12 Usage: 26445 Duration: 1436433815 Benchmark results: TLDR: Over all we do not see any additional benefit from having XCEDE1 over CEDE. ebizzy : 2 threads bound to a big-core. With this patch, we see a 3.39% regression compared to with only CEDE0 latency fixup. x With only CEDE0 latency fixup * With CEDE0 latency fixup + CEDE1 N Min Max Median Avg Stddev x 10 2893813 5834474 5832448 5327281.3 1055941.4 * 10 2907329 5834923 5831398 5146614.6 1193874.8 context_switch2: With the context_switch2 there are no observable regressions in the results. context_switch2 CPU0 CPU1 (Same Big-core, different small-cores). No difference with and without patch. x without_patch * with_patch N Min Max Median Avg Stddev x 500 343644 348778 345444 345584.02 1035.1658 * 500 344310 347646 345776 345877.22 802.19501 context_switch2 CPU0 CPU8 (different big-cores). Minor 0.05% improvement with patch x without_patch * with_patch N Min Max Median Avg Stddev x 500 287562 288756 288162 288134.76 262.24328 * 500 287874 288960 288306 288274.66 187.57034 schbench: No regressions observed with schbench Without Patch: Latency percentiles (usec) 50.0th: 29 75.0th: 40 90.0th: 50 95.0th: 61 *99.0th: 13648 99.5th: 14768 99.9th: 15664 min=0, max=29812 With Patch: Latency percentiles (usec) 50.0th: 30 75.0th: 40 90.0th: 51 95.0th: 59 *99.0th: 13616 99.5th: 14512 99.9th: 15696 min=0, max=15996 Signed-off-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com> --- drivers/cpuidle/cpuidle-pseries.c | 50 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 502f906..6f893cd 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -362,9 +362,59 @@ static int add_pseries_idle_states(void) for (i = 0; i < nr_xcede_records; i++) { u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks; u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; + char name[CPUIDLE_NAME_LEN]; + unsigned int latency_hint = xcede_records[i].latency_hint; + u64 residency_us; + + if (!xcede_records[i].responsive_to_irqs) { + pr_info("cpuidle : Skipping XCEDE%d. Not responsive to IRQs\n", + latency_hint); + continue; + } if (latency_us < min_latency_us) min_latency_us = latency_us; + snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint); + + /* + * As per the section 14.14.1 of PAPR version 2.8.1 + * says that alling H_CEDE with the value of the cede + * latency specifier set greater than zero allows the + * processor timer facility to be disabled (so as not + * to cause gratuitous wake-ups - the use of H_PROD, + * or other external interrupt is required to wake the + * processor in this case). + * + * So, inform the cpuidle-subsystem that the timer + * will be stopped for these states. + * + * Also, bump up the latency by 10us, since cpuidle + * would use timer-offload framework which will need + * to send an IPI to wakeup a CPU whose timer has + * expired. + */ + if (latency_hint > 0) { + dedicated_states[nr_states].flags = CPUIDLE_FLAG_TIMER_STOP; + latency_us += 10; + } + + /* + * Thumb rule : Reside in the XCEDE state for at least + * 10x the time required to enter and exit that state. + */ + residency_us = latency_us * 10; + + strlcpy(dedicated_states[nr_states].name, (const char *)name, + CPUIDLE_NAME_LEN); + strlcpy(dedicated_states[nr_states].desc, (const char *)name, + CPUIDLE_NAME_LEN); + dedicated_states[nr_states].exit_latency = latency_us; + dedicated_states[nr_states].target_residency = residency_us; + dedicated_states[nr_states].enter = &dedicated_cede_loop; + cede_latency_hint[nr_states] = latency_hint; + pr_info("cpuidle : Added %s. latency-hint = %d\n", + name, latency_hint); + nr_states++; } /* -- 1.9.4