STINNER Victor added the comment:

Stefan: "In my experience it is very hard to get stable benchmark results with 
Python.  Even long running benchmarks on an empty machine vary: (...)"

tl; dr We *can* tune the Linux kernel to avoid most of the system noise when 
running kernels.

I modified Stefan's to remove all I/O from the hot code: the benchmark 
is now really CPU-bound. I also modified to run the benchmark 5 times. 
One run takes around 2.6 seconds.

I also added the following lines to check the CPU affinity and the number of 
context switches:

    os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid())

Well, see attached for the full script.

I used my script to get a system load >= 5.0. Without tasksel, 
the benchmark result changes completly: at least 5 seconds. Well, it's not 
really surprising, it's known that benchmarks depend on the system load.

*BUT* I have a great kernel called Linux which has cool features called "CPU 
isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is 
compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y.

haypo@smithers$ lscpu --extended
0   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
1   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
2   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
3   0    0      3    3:3:3:0       oui    5900,0000 1600,0000
4   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
5   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
6   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
7   0    0      3    3:3:3:0       oui    5900,0000 1600,0000

My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores 
because of hyper threading.

I modified the Linux command line during the boot in GRUB to add: 
isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to 
performance to avoid hiccups:

# for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; 

Check the config with:

$ cat /sys/devices/system/cpu/isolated
$ cat /sys/devices/system/cpu/nohz_full
$ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor

Ok now with this kernel config but still without tasksel on an idle system:
Elapsed time: 2.660088424000037
Elapsed time: 2.5927538629999844
Elapsed time: 2.6135682369999813
Elapsed time: 2.5819260570000324
Elapsed time: 2.5991294099999322

Cpus_allowed:   33
Cpus_allowed_list:      0-1,4-5
voluntary_ctxt_switches:        1
nonvoluntary_ctxt_switches:     21

With system load >= 5.0:
Elapsed time: 5.3484489170000415
Elapsed time: 5.336797472999933
Elapsed time: 5.187413687999992
Elapsed time: 5.24122020599998
Elapsed time: 5.10201246400004

Cpus_allowed_list:      0-1,4-5
voluntary_ctxt_switches:        1
nonvoluntary_ctxt_switches:     1597

And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 
and 7), still on the heavily loaded system:
$ taskset -c 2,3,6,7 python3 full 

Elapsed time: 2.579487486000062
Elapsed time: 2.5827961039999536
Elapsed time: 2.5811954810001225
Elapsed time: 2.5782033600000887
Elapsed time: 2.572370636999949

Cpus_allowed:   cc
Cpus_allowed_list:      2-3,6-7
voluntary_ctxt_switches:        2
nonvoluntary_ctxt_switches:     16

Numbers look *more* stable than the numbers of the first test without taskset 
on an idle system! You can see that number of context switches is very low 
(total: 18).

Example of a second run:
haypo@smithers$ taskset -c 2,3,6,7 python3 full 

Elapsed time: 2.538398498999868
Elapsed time: 2.544711968999991
Elapsed time: 2.5323677339999904
Elapsed time: 2.536252647000083
Elapsed time: 2.525748182999905

Cpus_allowed:   cc
Cpus_allowed_list:      2-3,6-7
voluntary_ctxt_switches:        2
nonvoluntary_ctxt_switches:     15

Third run:
haypo@smithers$ taskset -c 2,3,6,7 python3 full 

Elapsed time: 2.5819172930000605
Elapsed time: 2.5783024259999365
Elapsed time: 2.578493587999901
Elapsed time: 2.5774198510000588
Elapsed time: 2.5772148999999445

Cpus_allowed:   cc
Cpus_allowed_list:      2-3,6-7
voluntary_ctxt_switches:        2
nonvoluntary_ctxt_switches:     15

Well, it's no perfect, but it looks much stable than timings without specific 
kernel config nor CPU pinning.

Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded 

>>> times
[2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 
2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 
2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 
2.578493587999901, 2.5774198510000588, 2.5772148999999445]
>>> statistics.mean(times)
>>> statistics.pvariance(times)
>>> statistics.stdev(times)

Compare if to the timings without tunning on an idle system:

>>> times
[2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 
>>> statistics.mean(times)
>>> statistics.pvariance(times)
>>> statistics.stdev(times)

We get (no tuning, idle system => tuning, busy system):

* Population variance: 0.00074 => 0.00043
* Standard deviation: 0.031 => 0.022

It looks *much* better, no? Even I only used *5* timings on the benchmark 
without tuning, whereas I used 15 timings on the benchmark with tuning. I 
expect larger variance and deviation with more times.


Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle 
system with tuning:

>>> times
[2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 
2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 
2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 
2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
>>> import statistics
>>> statistics.mean(times)
>>> statistics.pvariance(times)
>>> statistics.stdev(times)

As expected, it's even better (no tune, idle system => tuning, busy system => 
tuning, idle system):

* Population variance: 0.00074 => 0.00043 => 0.00016
* Standard deviation: 0.031 => 0.022 => 0.013

Added file:

Python tracker <>
Python-bugs-list mailing list

Reply via email to