TSC improvement

SAITOH Masanobu Tue, 09 Jun 2020 09:49:26 -0700

Hi, all.

 I wrote a code to improve x86's TSC.x86/tsc.c rev. 1.67 reduced cache
problem, but it still has room. I measured the effect of lfence, mfence,
cpuid and rdtscp. The impact to TSC skew and/or drift is:


        AMD:   mfence > rdtscp > cpuid > lfence-serialize > lfence = nomodify
        Intel: lfence > rdtscp > cpuid > nomodify

So, mfence is the best on AMD and lfence is the best on Intel. If it has no
SSE2, we can use cpuid. The logs of TSC calibration
(from "dmesg -t |grep TSC") is at:

        http://www.netbsd.org/~msaitoh/tsc/tsc-20200605log.tgz

Diff is at:

        http://www.netbsd.org/~msaitoh/tsc/tsc-20200609-0.dif

In this diff, cpu_counter*(those functions are MI API) uses serializing.
We can provide both cpu_counter() and cpu_counter_serializing(). I think
almost all usecases requires serializing on x86. For RNG, serializing is
not required but the overhead is trivial. I think users who use
cpu_counter() use the function to get time difference precisely. The
serializing is required to get precised value, so I think just adding it
to x86's cpu_counter() is enough. If it's acceptable, I'd like to commit
this change.

 Any comments/advice are welcomed.


NOTE:
  - An AMD's document says DE_CFG_LFENCE_SERIALIZE bit can be used for
    serializing, but it's not so good.
  - On Intel i386(not amd64), it seems the improvement is very little.
  - rdtscp instruct can be used as serializing instruction + rdtsc, but
    it's not good as [lm]fence. Both Intel and AMD's document say that
    the latency of rdtscp is bigger than rdtsc, so I suspect the difference
    of the result comes from it.

And, thanks ad@, knakahara@, nonaka@ and some other people to help this work.

--
-----------------------------------------------
                SAITOH Masanobu (msai...@execsw.org
                                 msai...@netbsd.org)

TSC improvement

Reply via email to