Hi, all. I wrote a code to improve x86's TSC.x86/tsc.c rev. 1.67 reduced cache problem, but it still has room. I measured the effect of lfence, mfence, cpuid and rdtscp. The impact to TSC skew and/or drift is:
AMD: mfence > rdtscp > cpuid > lfence-serialize > lfence = nomodify Intel: lfence > rdtscp > cpuid > nomodify So, mfence is the best on AMD and lfence is the best on Intel. If it has no SSE2, we can use cpuid. The logs of TSC calibration (from "dmesg -t |grep TSC") is at: http://www.netbsd.org/~msaitoh/tsc/tsc-20200605log.tgz Diff is at: http://www.netbsd.org/~msaitoh/tsc/tsc-20200609-0.dif In this diff, cpu_counter*(those functions are MI API) uses serializing. We can provide both cpu_counter() and cpu_counter_serializing(). I think almost all usecases requires serializing on x86. For RNG, serializing is not required but the overhead is trivial. I think users who use cpu_counter() use the function to get time difference precisely. The serializing is required to get precised value, so I think just adding it to x86's cpu_counter() is enough. If it's acceptable, I'd like to commit this change. Any comments/advice are welcomed. NOTE: - An AMD's document says DE_CFG_LFENCE_SERIALIZE bit can be used for serializing, but it's not so good. - On Intel i386(not amd64), it seems the improvement is very little. - rdtscp instruct can be used as serializing instruction + rdtsc, but it's not good as [lm]fence. Both Intel and AMD's document say that the latency of rdtscp is bigger than rdtsc, so I suspect the difference of the result comes from it. And, thanks ad@, knakahara@, nonaka@ and some other people to help this work. -- ----------------------------------------------- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)