When working on Nehalem Deeper C-states, my Core-2-Duo desktop ran a prototype 
Deeper C-state kernel without issues for a month before we had access to 
nehalem hardware.  The prototype handled TSC halting in ACPI C2/C3 using a 
similar mechanism used by Solaris xpv in dtrace_xpv_gethrtime().  
drtace_xpv_gethrtime(), the lowest level function which reads TSC, uses a 
global variable to detect TSC regression.  While not pretty, this avoids the 
need for the yet unsolved "TSC rendezvous".  Additionally per-CPU TSC stall in 
C-state was calculated with the HPET.  A similar mechanism may be the easiest 
acceptable approach to get P-states working on SMP systems with P-state variant 
TSC.

I am not sure what Casper Dik's work did, but I assume it takes hrtime+TSC 
snapshots when changing P-states and then uses future TSC deltas from the 
transition hrtime compensated by current TSC frequency to calculate current 
hrtime?  IIRC the big MP problem was: it is in-determinant when exactly the 
core's TSC changes frequency during the transition from one p-state/frequency 
to another, so the TSCs tend to jitter and drift a little with each P-state 
change.


Here is a possible solution based on the earlier C-state prototype:

1) Use /etc/system to set kernel variables which would allow P-states and/or 
C-states on non-invariant-TSC systems.

2) Add code to detect CPUs which need this workaround.

3) Point gethrtimef at a new function which treats tsc_read() similar to what 
dtrace_xpv_gethrtime() does. The new function also needs to have access to the 
timestamp when this P-state started and current TSC frequency on this CPU.  The 
function would then compute "invariant" hrtime from these.  It is not possible 
to write a new value to the TSC, so each CPU could keep track of how far off 
their TSC is from the "global" value.  Some thought needs to go into preventing 
the thread from migrating CPUs during the call.

4) When making a P-state or C-state change, sync the CPU's TSC with a current 
HPET timestamp.


Issues:
A) Currently during boot the TSC frequency is calculated based on the PIT 
timer.  See freq_tsc().  Something similar may have to be done with the HPET 
during initialization because hrtime will now be driven from HPET snapshots + 
TSC deltas (instead of just TSC).

B) hrtime will jitter (forward only) by as much as the HPET period plus HPET 
read latency.  IIRC max HPET period is 100 nanoseconds and avg read latency is 
around a microsecond.

C) The HPET read latency after P-state transition may hurt performance?  The 
rate of P-state changes may need to be throttled.  This probably is not an 
issue.
 HPET read latency after C-state wakeup should not be an issue because Solaris 
is very good at not entering deeper C-states when busy and the HPET read 
latency is relatively small compared to deeper C-state wakeup.

D) The atomic operations updating the last global TSC value do not scale.  
Probably no an issue on these small systems.


This approach was acceptable to allow ACPI C2 & C3 on my desktop where TSC 
stalled in deeper C-states.   The prototype never went through ON PIT or Perf 
testing....  The P-state portion works great "in theory".  ;-)

My wife puts me to work around the house every time she catches me working on 
this, so I don't expect to make progress on this.

2 cents,
Bill
-- 
This message posted from opensolaris.org

Reply via email to