Thank you for your advices and running benchmark code on many CPUs with
different compiler options and shared results.
> Possibly harmful to result interpretation esp. cross-CPUs with things like
> RPi/ARM: The minimum over 101 runs in your bench template is good to reduce
> noise from CPU spin up to higher clock rates, BUT the two sleep() s are bad
> since you are probably giving the CPU/OS time to put the CPU back into a
> lower power mode. Essentially, you are doing one thing to make results less
> sensitive to other work happening on the system and another to make it more
> sensitive. Unless you have a very specific workload in mind, it's usually
> best to pick a direction.
When I ran `testgcd.nim` 3 times on Raspberry Pi 3 without sleep inside the
loop in `bench` template like following code, I got unstable timings.
template bench(repeat: int; init, body: untyped): untyped =
var minTime = initDuration(days = 2)
for i in 1 .. repeat:
init
let start = getMonoTime()
body
let finish = getMonoTime()
minTime = min(minTime, finish - start)
# Prevent thermal throttling.
#sleep(20)
echo minTime.inMicroseconds, " micro second"
sleep(2000)
Run
In first and second results, the timing of `gcdLAR` is 9410 micro sec but in
third result, it go down to 4050 micro sec. In first result, the timing of
`gcdSub` is 9178 micro sec but in second and third results, they are about 3900
micro sec.
$ nim c -r -d:release testgcd.nim
gcd in stdlib: 7324 micro second
gcdLAR: 9410 micro second
gcdLAR2: 4055 micro second
gcdLAR3: 5013 micro second
gcdLAR4: 8022 micro second
gcdSub: 9178 micro second
gcdSub2: 7456 micro second
1638163816381638163816381638
[alarm@rasp proj]$ ./testgcd
gcd in stdlib: 7324 micro second
gcdLAR: 9410 micro second
gcdLAR2: 4055 micro second
gcdLAR3: 5013 micro second
gcdLAR4: 8023 micro second
gcdSub: 3892 micro second
gcdSub2: 3163 micro second
1638163816381638163816381638
[alarm@rasp proj]$ ./testgcd
gcd in stdlib: 7324 micro second
gcdLAR: 3996 micro second
gcdLAR2: 4050 micro second
gcdLAR3: 5012 micro second
gcdLAR4: 8024 micro second
gcdSub: 3886 micro second
gcdSub2: 3163 micro second
1638163816381638163816381638
Run
When I run testgcd 3 times with `sleep(20)` inside the loop in `bench`
template, I got almost same results.
$ nim c -r -d:release testgcd.nim
gcd in stdlib: 3115 micro second
gcdLAR: 3992 micro second
gcdLAR2: 4053 micro second
gcdLAR3: 5020 micro second
gcdLAR4: 3423 micro second
gcdSub: 3894 micro second
gcdSub2: 3171 micro second
1638163816381638163816381638
[alarm@rasp proj]$ ./testgcd
gcd in stdlib: 3115 micro second
gcdLAR: 3995 micro second
gcdLAR2: 4056 micro second
gcdLAR3: 5019 micro second
gcdLAR4: 3425 micro second
gcdSub: 3910 micro second
gcdSub2: 3176 micro second
1638163816381638163816381638
[alarm@rasp proj]$ ./testgcd
gcd in stdlib: 3206 micro second
gcdLAR: 4111 micro second
gcdLAR2: 4181 micro second
gcdLAR3: 5020 micro second
gcdLAR4: 3417 micro second
gcdSub: 3911 micro second
gcdSub2: 3174 micro second
1638163816381638163816381638
Run
Maybe it is not caused by thermal throttling. CPU temperate didn't raised much
after running `testgcd`. There are many `Undervoltage detected!` warning in
dmesg and it might be causing unstable results. My Raspberry Pi 3 has no heat
sink and good power supply. It might inadequate for benchmark.