Kristján Valur Jónsson <krist...@ccpgames.com> added the comment:

I just did some profiling.  I´m using visual studio team edition which has some 
fancy built in profiling.  I decided to compare the performance of the 
iotest.py script with two cpu threads, running for 10 seconds with processor 
affinity enabled and disabled.  I added this code to the script:
if affinity:
    import ctypes
    i = ctypes.c_int()
    i.value = 1
    ctypes.windll.kernel32.SetProcessAffinityMask(-1, 1)

Regular instruction counter sampling showed no differences.  There were no 
indications of excessive time being used in the GIL or any strangeness with the 
locking primitives.  So, I decided to sample on cpu performance counters.  
Following up on my conjecture from yesterday, that this was due to 
inefficiencies in switching between cpus, I settled on sampling the instruction 
fetch stall cycles from the instruction fetch unit.  I sample every 1000000 
stalls.  I get interesting results.

With affinity:
Functions Causing Most Work
Name    Samples %
_PyObject_Call  403     99,02
_PyEval_EvalFrameEx     402     98,77
_PyEval_EvalCodeEx      402     98,77
_PyEval_CallObjectWithKeywords  400     98,28
call_function   395     97,05

affinity off:
Functions Causing Most Work
Name    Samples %
_PyEval_EvalFrameEx     1.937   99,28
_PyEval_EvalCodeEx      1.937   99,28
_PyEval_CallObjectWithKeywords  1.936   99,23
_PyObject_Call  1.936   99,23
_threadstartex  1.934   99,13

When we run on both cores, we get four times as many L1 instruction cache hits! 
 So, what appears to be happening is that each time that a switch occurs the L1 
instruction cache for each core must be repopulated with the python evaluation 
loop, it having been evacuated on that core during the hiatus.

Note that for this effect to kick in we need a large piece of code excercising 
the cache, such as the evaluation loop.  Earlier today, I wrote a simple 
(python free) C program to do similar testing, using a GIL, and found no 
performance degradation due to multi core, but that program only had a very 
simple "work" function.

So, this confirms my hypothesis:  The downgrading of the performance of python 
cpu bound threads on multicore machines stems from the shuttling about of the 
python evaluation loop between the instruction caches of the individual cores.

How best to combat this?  I'll do some experiments on Windows.  Perhaps we can 
identify cpu-bound threads and group them on a single core.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8299>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to