Kristján Valur Jónsson <krist...@ccpgames.com> added the comment: I just did some profiling. I´m using visual studio team edition which has some fancy built in profiling. I decided to compare the performance of the iotest.py script with two cpu threads, running for 10 seconds with processor affinity enabled and disabled. I added this code to the script: if affinity: import ctypes i = ctypes.c_int() i.value = 1 ctypes.windll.kernel32.SetProcessAffinityMask(-1, 1)
Regular instruction counter sampling showed no differences. There were no indications of excessive time being used in the GIL or any strangeness with the locking primitives. So, I decided to sample on cpu performance counters. Following up on my conjecture from yesterday, that this was due to inefficiencies in switching between cpus, I settled on sampling the instruction fetch stall cycles from the instruction fetch unit. I sample every 1000000 stalls. I get interesting results. With affinity: Functions Causing Most Work Name Samples % _PyObject_Call 403 99,02 _PyEval_EvalFrameEx 402 98,77 _PyEval_EvalCodeEx 402 98,77 _PyEval_CallObjectWithKeywords 400 98,28 call_function 395 97,05 affinity off: Functions Causing Most Work Name Samples % _PyEval_EvalFrameEx 1.937 99,28 _PyEval_EvalCodeEx 1.937 99,28 _PyEval_CallObjectWithKeywords 1.936 99,23 _PyObject_Call 1.936 99,23 _threadstartex 1.934 99,13 When we run on both cores, we get four times as many L1 instruction cache hits! So, what appears to be happening is that each time that a switch occurs the L1 instruction cache for each core must be repopulated with the python evaluation loop, it having been evacuated on that core during the hiatus. Note that for this effect to kick in we need a large piece of code excercising the cache, such as the evaluation loop. Earlier today, I wrote a simple (python free) C program to do similar testing, using a GIL, and found no performance degradation due to multi core, but that program only had a very simple "work" function. So, this confirms my hypothesis: The downgrading of the performance of python cpu bound threads on multicore machines stems from the shuttling about of the python evaluation loop between the instruction caches of the individual cores. How best to combat this? I'll do some experiments on Windows. Perhaps we can identify cpu-bound threads and group them on a single core. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue8299> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com