When working on Nehalem Deeper C-states, my Core-2-Duo desktop ran a prototype Deeper C-state kernel without issues for a month before we had access to nehalem hardware. The prototype handled TSC halting in ACPI C2/C3 using a similar mechanism used by Solaris xpv in dtrace_xpv_gethrtime(). drtace_xpv_gethrtime(), the lowest level function which reads TSC, uses a global variable to detect TSC regression. While not pretty, this avoids the need for the yet unsolved "TSC rendezvous". Additionally per-CPU TSC stall in C-state was calculated with the HPET. A similar mechanism may be the easiest acceptable approach to get P-states working on SMP systems with P-state variant TSC.
I am not sure what Casper Dik's work did, but I assume it takes hrtime+TSC snapshots when changing P-states and then uses future TSC deltas from the transition hrtime compensated by current TSC frequency to calculate current hrtime? IIRC the big MP problem was: it is in-determinant when exactly the core's TSC changes frequency during the transition from one p-state/frequency to another, so the TSCs tend to jitter and drift a little with each P-state change. Here is a possible solution based on the earlier C-state prototype: 1) Use /etc/system to set kernel variables which would allow P-states and/or C-states on non-invariant-TSC systems. 2) Add code to detect CPUs which need this workaround. 3) Point gethrtimef at a new function which treats tsc_read() similar to what dtrace_xpv_gethrtime() does. The new function also needs to have access to the timestamp when this P-state started and current TSC frequency on this CPU. The function would then compute "invariant" hrtime from these. It is not possible to write a new value to the TSC, so each CPU could keep track of how far off their TSC is from the "global" value. Some thought needs to go into preventing the thread from migrating CPUs during the call. 4) When making a P-state or C-state change, sync the CPU's TSC with a current HPET timestamp. Issues: A) Currently during boot the TSC frequency is calculated based on the PIT timer. See freq_tsc(). Something similar may have to be done with the HPET during initialization because hrtime will now be driven from HPET snapshots + TSC deltas (instead of just TSC). B) hrtime will jitter (forward only) by as much as the HPET period plus HPET read latency. IIRC max HPET period is 100 nanoseconds and avg read latency is around a microsecond. C) The HPET read latency after P-state transition may hurt performance? The rate of P-state changes may need to be throttled. This probably is not an issue. HPET read latency after C-state wakeup should not be an issue because Solaris is very good at not entering deeper C-states when busy and the HPET read latency is relatively small compared to deeper C-state wakeup. D) The atomic operations updating the last global TSC value do not scale. Probably no an issue on these small systems. This approach was acceptable to allow ACPI C2 & C3 on my desktop where TSC stalled in deeper C-states. The prototype never went through ON PIT or Perf testing.... The P-state portion works great "in theory". ;-) My wife puts me to work around the house every time she catches me working on this, so I don't expect to make progress on this. 2 cents, Bill -- This message posted from opensolaris.org
