Re: [PATCH RFC v2 4/5] x86/mwait-idle: enable interrupts before C1 on Xeons
On Mon, Jan 24, 2022 at 03:44:53PM +0100, Jan Beulich wrote: > On 20.01.2022 16:52, Roger Pau Monné wrote: > > On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote: > >> From: Artem Bityutskiy > >> Unlike Linux we want to disable IRQs again after MWAITing, as > >> subsequently invoked functions assume so. > > > > I'm also wondering whether there could be interrupts that rely on some > > of the housekeeping that's done when returning from mwait. I guess > > it's unlikely for an interrupt handler to have anything to do with the > > TSC not having been restored. > > Actually this is a good point you make: We don't want to enable > IRQs when cstate_restore_tsc() is not a no-op, or else we might > confuse the time rendezvous. (I thought that I would remember > TSC to stop only in deeper C-states, but maybe I'm mixing this up > with the APIC timer.) There's a comment in time.c that mentions the TSC only stopping in 'deep C states'. Also note that in that case the rendezvous function already updates the TSC, so I'm not sure whether calling it with an out of date TSC would be harmful - it will be updated anyway to match the master TSC. Might be safer to disable interrupts unconditionally on CPUs that don't have a non-stop TSC just to be on the safe side. Thanks, Roger.
Re: [PATCH RFC v2 4/5] x86/mwait-idle: enable interrupts before C1 on Xeons
On 20.01.2022 16:52, Roger Pau Monné wrote: > On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote: >> From: Artem Bityutskiy >> >> Enable local interrupts before requesting C1 on the last two generations >> of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. >> This decreases average C1 interrupt latency by about 5-10%, as measured >> with the 'wult' tool. >> >> The '->enter()' function of the driver enters C-states with local >> interrupts disabled by executing the 'monitor' and 'mwait' pair of >> instructions. If an interrupt happens, the CPU exits the C-state and >> continues executing instructions after 'mwait'. It does not jump to >> the interrupt handler, because local interrupts are disabled. The >> cpuidle subsystem enables interrupts a bit later, after doing some >> housekeeping. >> >> With this patch, we enable local interrupts before requesting C1. In >> this case, if the CPU wakes up because of an interrupt, it will jump >> to the interrupt handler right away. The cpuidle housekeeping will be >> done after the pending interrupt(s) are handled. >> >> Enabling interrupts before entering a C-state has measurable impact >> for faster C-states, like C1. Deeper, but slower C-states like C6 do >> not really benefit from this sort of change, because their latency is >> a lot higher comparing to the delay added by cpuidle housekeeping. >> >> This change was also tested with cyclictest and dbench. In case of Ice >> Lake, the average cyclictest latency decreased by 5.1%, and the average >> 'dbench' throughput increased by about 0.8%. Both tests were run for 4 >> hours with only C1 enabled (all other idle states, including 'POLL', >> were disabled). CPU frequency was pinned to HFM, and uncore frequency >> was pinned to the maximum value. The other platforms had similar >> single-digit percentage improvements. >> >> It is worth noting that this patch affects 'cpuidle' statistics a tiny >> bit. Before this patch, C1 residency did not include the interrupt >> handling time, but with this patch, it will include it. This is similar >> to what happens in case of the 'POLL' state, which also runs with >> interrupts enabled. >> >> Suggested-by: Len Brown >> Signed-off-by: Artem Bityutskiy >> [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3] >> >> We don't have a pointer into cpuidle_state_table[] readily available. >> To compensate, make use of the new flag only appearing for C1 and hence >> only in the first table entry. > > We could likely use the 8 padding bits in acpi_processor_cx between > entry_method and method to store a flags field? When looking there initially, I thought it wouldn't be good to add field there. But looking again, I now don't see why I thought what I thought. (We actually have an 8-bit padding slot there and a 32-bit one.) > It would seems more resilient, as I guess at some point we could > enable interrupts for further states? C1E maybe; I don't think anything beyond, for having higher wakeup latency anyway. >> Unlike Linux we want to disable IRQs again after MWAITing, as >> subsequently invoked functions assume so. > > I'm also wondering whether there could be interrupts that rely on some > of the housekeeping that's done when returning from mwait. I guess > it's unlikely for an interrupt handler to have anything to do with the > TSC not having been restored. Actually this is a good point you make: We don't want to enable IRQs when cstate_restore_tsc() is not a no-op, or else we might confuse the time rendezvous. (I thought that I would remember TSC to stop only in deeper C-states, but maybe I'm mixing this up with the APIC timer.) >> Signed-off-by: Jan Beulich > > Acked-by: Roger Pau Monné Thanks, but as per above - not just yet. Jan
Re: [PATCH RFC v2 4/5] x86/mwait-idle: enable interrupts before C1 on Xeons
On Thu, Jan 20, 2022 at 03:04:39PM +0100, Jan Beulich wrote: > From: Artem Bityutskiy > > Enable local interrupts before requesting C1 on the last two generations > of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. > This decreases average C1 interrupt latency by about 5-10%, as measured > with the 'wult' tool. > > The '->enter()' function of the driver enters C-states with local > interrupts disabled by executing the 'monitor' and 'mwait' pair of > instructions. If an interrupt happens, the CPU exits the C-state and > continues executing instructions after 'mwait'. It does not jump to > the interrupt handler, because local interrupts are disabled. The > cpuidle subsystem enables interrupts a bit later, after doing some > housekeeping. > > With this patch, we enable local interrupts before requesting C1. In > this case, if the CPU wakes up because of an interrupt, it will jump > to the interrupt handler right away. The cpuidle housekeeping will be > done after the pending interrupt(s) are handled. > > Enabling interrupts before entering a C-state has measurable impact > for faster C-states, like C1. Deeper, but slower C-states like C6 do > not really benefit from this sort of change, because their latency is > a lot higher comparing to the delay added by cpuidle housekeeping. > > This change was also tested with cyclictest and dbench. In case of Ice > Lake, the average cyclictest latency decreased by 5.1%, and the average > 'dbench' throughput increased by about 0.8%. Both tests were run for 4 > hours with only C1 enabled (all other idle states, including 'POLL', > were disabled). CPU frequency was pinned to HFM, and uncore frequency > was pinned to the maximum value. The other platforms had similar > single-digit percentage improvements. > > It is worth noting that this patch affects 'cpuidle' statistics a tiny > bit. Before this patch, C1 residency did not include the interrupt > handling time, but with this patch, it will include it. This is similar > to what happens in case of the 'POLL' state, which also runs with > interrupts enabled. > > Suggested-by: Len Brown > Signed-off-by: Artem Bityutskiy > [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3] > > We don't have a pointer into cpuidle_state_table[] readily available. > To compensate, make use of the new flag only appearing for C1 and hence > only in the first table entry. We could likely use the 8 padding bits in acpi_processor_cx between entry_method and method to store a flags field? It would seems more resilient, as I guess at some point we could enable interrupts for further states? > > Unlike Linux we want to disable IRQs again after MWAITing, as > subsequently invoked functions assume so. I'm also wondering whether there could be interrupts that rely on some of the housekeeping that's done when returning from mwait. I guess it's unlikely for an interrupt handler to have anything to do with the TSC not having been restored. > > Signed-off-by: Jan Beulich Acked-by: Roger Pau Monné I think it's important to keep in sync with our upstream that's Linux there. Albeit I would prefer if we could carry the flags into acpi_processor_cx. Thanks, Roger.
[PATCH RFC v2 4/5] x86/mwait-idle: enable interrupts before C1 on Xeons
From: Artem Bityutskiy Enable local interrupts before requesting C1 on the last two generations of Intel Xeon platforms: Sky Lake, Cascade Lake, Cooper Lake, Ice Lake. This decreases average C1 interrupt latency by about 5-10%, as measured with the 'wult' tool. The '->enter()' function of the driver enters C-states with local interrupts disabled by executing the 'monitor' and 'mwait' pair of instructions. If an interrupt happens, the CPU exits the C-state and continues executing instructions after 'mwait'. It does not jump to the interrupt handler, because local interrupts are disabled. The cpuidle subsystem enables interrupts a bit later, after doing some housekeeping. With this patch, we enable local interrupts before requesting C1. In this case, if the CPU wakes up because of an interrupt, it will jump to the interrupt handler right away. The cpuidle housekeeping will be done after the pending interrupt(s) are handled. Enabling interrupts before entering a C-state has measurable impact for faster C-states, like C1. Deeper, but slower C-states like C6 do not really benefit from this sort of change, because their latency is a lot higher comparing to the delay added by cpuidle housekeeping. This change was also tested with cyclictest and dbench. In case of Ice Lake, the average cyclictest latency decreased by 5.1%, and the average 'dbench' throughput increased by about 0.8%. Both tests were run for 4 hours with only C1 enabled (all other idle states, including 'POLL', were disabled). CPU frequency was pinned to HFM, and uncore frequency was pinned to the maximum value. The other platforms had similar single-digit percentage improvements. It is worth noting that this patch affects 'cpuidle' statistics a tiny bit. Before this patch, C1 residency did not include the interrupt handling time, but with this patch, it will include it. This is similar to what happens in case of the 'POLL' state, which also runs with interrupts enabled. Suggested-by: Len Brown Signed-off-by: Artem Bityutskiy [Linux commit: c227233ad64c77e57db738ab0e46439db71822a3] We don't have a pointer into cpuidle_state_table[] readily available. To compensate, make use of the new flag only appearing for C1 and hence only in the first table entry. Unlike Linux we want to disable IRQs again after MWAITing, as subsequently invoked functions assume so. Signed-off-by: Jan Beulich --- RFC: I'm not entirely certain that we want to take this, i.e. whether we're as much worried about interrupt latency. RFC: I was going back and forth between putting the local_irq_enable() ahead of or after cpu_is_haltable(). --- v2: New. --- a/xen/arch/x86/cpu/mwait-idle.c +++ b/xen/arch/x86/cpu/mwait-idle.c @@ -108,6 +108,11 @@ static const struct cpuidle_state { #define CPUIDLE_FLAG_DISABLED 0x1 /* + * Enable interrupts before entering the C-state. On some platforms and for + * some C-states, this may measurably decrease interrupt latency. + */ +#define CPUIDLE_FLAG_IRQ_ENABLE0x8000 +/* * Set this flag for states where the HW flushes the TLB for us * and so we don't need cross-calls to keep it consistent. * If this flag is set, SW flushes the TLB, so even if the @@ -539,7 +544,7 @@ static struct cpuidle_state __read_mostl static struct cpuidle_state __read_mostly skx_cstates[] = { { .name = "C1", - .flags = MWAIT2flg(0x00), + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, .exit_latency = 2, .target_residency = 2, }, @@ -561,7 +566,7 @@ static struct cpuidle_state __read_mostl static const struct cpuidle_state icx_cstates[] = { { .name = "C1", - .flags = MWAIT2flg(0x00), + .flags = MWAIT2flg(0x00) | CPUIDLE_FLAG_IRQ_ENABLE, .exit_latency = 1, .target_residency = 1, }, @@ -776,6 +781,7 @@ static void mwait_idle(void) unsigned int next_state; u64 before, after; u32 exp = 0, pred = 0, irq_traced[4] = { 0 }; + bool irq_enable_early = false; if (max_cstate > 0 && power && (next_state = cpuidle_current_governor->select(power)) > 0) { @@ -806,6 +812,12 @@ static void mwait_idle(void) return; } + if (cx->idx == 1 && cx->type == ACPI_STATE_C1 && + (cpuidle_state_table[0].flags & CPUIDLE_FLAG_IRQ_ENABLE)) { + ASSERT(cx->address == flg2MWAIT(cpuidle_state_table[0].flags)); + irq_enable_early = true; + } + cpufreq_dbs_timer_suspend(); rcu_idle_enter(cpu); @@ -842,9 +854,15 @@ static void mwait_idle(void) update_last_cx_stat(power, cx, before); - if (cpu_is_haltable(cpu)) + if (cpu_is_haltable(cpu)) { + if (irq_enable_early) + local_irq_enable(); + mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK); +