Done: http://bugs.python.org/issue4477
On Sun, Nov 30, 2008 at 8:14 PM, Brett Cannon <[EMAIL PROTECTED]> wrote: > Can you toss the patch into the issue tracker, Jeffrey, so that any > patch comments can be done there? > > -Brett > > On Sun, Nov 30, 2008 at 17:54, Jeffrey Yasskin <[EMAIL PROTECTED]> wrote: >> Tracing support shows up fairly heavily an a Python profile, even >> though it's nearly always turned off. The attached patch against the >> trunk speeds up PyBench by 2% for me. All tests pass. I have 2 >> questions: >> >> 1) Can other people corroborate this speedup on their machines? I'm >> running on a Macbook Pro (Intel Core2 processor, probably Merom) with >> a 32-bit build from Apple's gcc-4.0.1. (Apple's gcc consistently >> produces a faster python than gcc-4.3.) >> >> 2) Assuming this speeds things up for most people, should I check it >> in anywhere besides the trunk? I assume it's out for 3.0; is it in for >> 2.6.1 or 3.0.1? >> >> >> >> Pybench output: >> >> ------------------------------------------------------------------------------- >> PYBENCH 2.0 >> ------------------------------------------------------------------------------- >> * using CPython 2.7a0 (trunk:67458M, Nov 30 2008, 17:14:10) [GCC 4.0.1 >> (Apple Inc. build 5488)] >> * disabled garbage collection >> * system check interval set to maximum: 2147483647 >> * using timer: time.time >> >> ------------------------------------------------------------------------------- >> Benchmark: pybench.out >> ------------------------------------------------------------------------------- >> >> Rounds: 10 >> Warp: 10 >> Timer: time.time >> >> Machine Details: >> Platform ID: Darwin-9.5.0-i386-32bit >> Processor: i386 >> >> Python: >> Implementation: CPython >> Executable: >> /Users/jyasskin/src/python/trunk-fast-tracing/build/python.exe >> Version: 2.7.0 >> Compiler: GCC 4.0.1 (Apple Inc. build 5488) >> Bits: 32bit >> Build: Nov 30 2008 17:14:10 (#trunk:67458M) >> Unicode: UCS2 >> >> >> ------------------------------------------------------------------------------- >> Comparing with: ../build_orig/pybench.out >> ------------------------------------------------------------------------------- >> >> Rounds: 10 >> Warp: 10 >> Timer: time.time >> >> Machine Details: >> Platform ID: Darwin-9.5.0-i386-32bit >> Processor: i386 >> >> Python: >> Implementation: CPython >> Executable: >> /Users/jyasskin/src/python/trunk-fast-tracing/build_orig/python.exe >> Version: 2.7.0 >> Compiler: GCC 4.0.1 (Apple Inc. build 5488) >> Bits: 32bit >> Build: Nov 30 2008 13:51:09 (#trunk:67458) >> Unicode: UCS2 >> >> >> Test minimum run-time average run-time >> this other diff this other diff >> ------------------------------------------------------------------------------- >> BuiltinFunctionCalls: 127ms 130ms -2.4% 129ms 132ms >> -2.1% >> BuiltinMethodLookup: 90ms 93ms -3.2% 91ms 94ms >> -3.1% >> CompareFloats: 88ms 91ms -3.3% 89ms 93ms >> -4.3% >> CompareFloatsIntegers: 97ms 99ms -2.1% 97ms 100ms >> -2.4% >> CompareIntegers: 79ms 82ms -4.2% 79ms 85ms >> -6.1% >> CompareInternedStrings: 90ms 92ms -2.4% 94ms 94ms >> -0.9% >> CompareLongs: 86ms 83ms +3.6% 87ms 84ms >> +3.5% >> CompareStrings: 80ms 82ms -3.1% 81ms 83ms >> -2.3% >> CompareUnicode: 103ms 105ms -2.3% 106ms 108ms >> -1.5% >> ComplexPythonFunctionCalls: 139ms 137ms +1.3% 140ms 139ms >> +0.1% >> ConcatStrings: 142ms 151ms -6.0% 156ms 154ms >> +1.1% >> ConcatUnicode: 87ms 92ms -5.4% 89ms 94ms >> -5.7% >> CreateInstances: 142ms 144ms -1.4% 144ms 145ms >> -1.1% >> CreateNewInstances: 107ms 109ms -2.3% 108ms 111ms >> -2.1% >> CreateStringsWithConcat: 114ms 137ms -17.1% 117ms 139ms >> -16.0% >> CreateUnicodeWithConcat: 92ms 101ms -9.2% 95ms 102ms >> -7.2% >> DictCreation: 77ms 81ms -4.4% 80ms 85ms >> -5.9% >> DictWithFloatKeys: 91ms 107ms -14.5% 93ms 109ms >> -14.6% >> DictWithIntegerKeys: 95ms 94ms +1.4% 108ms 96ms >> +12.3% >> DictWithStringKeys: 83ms 88ms -5.8% 84ms 88ms >> -4.7% >> ForLoops: 72ms 72ms -0.1% 79ms 74ms >> +5.8% >> IfThenElse: 83ms 80ms +3.9% 85ms 80ms >> +5.3% >> ListSlicing: 117ms 118ms -0.7% 118ms 121ms >> -1.8% >> NestedForLoops: 116ms 119ms -2.4% 121ms 121ms >> +0.0% >> NormalClassAttribute: 106ms 115ms -7.7% 108ms 117ms >> -7.7% >> NormalInstanceAttribute: 96ms 98ms -2.3% 97ms 100ms >> -3.1% >> PythonFunctionCalls: 92ms 95ms -3.7% 94ms 99ms >> -5.2% >> PythonMethodCalls: 147ms 147ms +0.1% 152ms 149ms >> +2.1% >> Recursion: 135ms 136ms -0.3% 140ms 144ms >> -2.9% >> SecondImport: 101ms 99ms +2.1% 103ms 101ms >> +2.2% >> SecondPackageImport: 107ms 103ms +3.5% 108ms 104ms >> +3.3% >> SecondSubmoduleImport: 134ms 134ms +0.3% 136ms 136ms >> -0.0% >> SimpleComplexArithmetic: 105ms 111ms -5.0% 110ms 112ms >> -1.4% >> SimpleDictManipulation: 95ms 106ms -10.6% 96ms 109ms >> -12.0% >> SimpleFloatArithmetic: 90ms 99ms -9.3% 93ms 102ms >> -8.2% >> SimpleIntFloatArithmetic: 78ms 76ms +2.3% 79ms 77ms >> +2.0% >> SimpleIntegerArithmetic: 78ms 77ms +1.8% 79ms 77ms >> +2.0% >> SimpleListManipulation: 80ms 78ms +2.4% 80ms 79ms >> +1.9% >> SimpleLongArithmetic: 110ms 113ms -2.0% 111ms 113ms >> -2.1% >> SmallLists: 128ms 117ms +9.5% 130ms 124ms >> +4.9% >> SmallTuples: 115ms 114ms +1.7% 117ms 114ms >> +2.2% >> SpecialClassAttribute: 101ms 112ms -10.3% 104ms 114ms >> -8.9% >> SpecialInstanceAttribute: 173ms 177ms -1.9% 176ms 179ms >> -1.6% >> StringMappings: 165ms 167ms -1.2% 168ms 169ms >> -0.5% >> StringPredicates: 126ms 134ms -5.7% 127ms 134ms >> -5.6% >> StringSlicing: 125ms 123ms +1.9% 131ms 130ms >> +0.7% >> TryExcept: 79ms 80ms -0.6% 80ms 80ms >> -0.8% >> TryFinally: 110ms 107ms +3.0% 111ms 112ms >> -1.1% >> TryRaiseExcept: 99ms 101ms -1.6% 100ms 102ms >> -1.7% >> TupleSlicing: 127ms 127ms +0.6% 137ms 137ms >> +0.0% >> UnicodeMappings: 144ms 144ms -0.3% 145ms 145ms >> -0.4% >> UnicodePredicates: 116ms 114ms +1.3% 117ms 115ms >> +1.1% >> UnicodeProperties: 106ms 102ms +3.6% 107ms 104ms >> +3.1% >> UnicodeSlicing: 95ms 111ms -14.0% 99ms 112ms >> -11.8% >> WithFinally: 157ms 152ms +3.3% 159ms 154ms >> +3.3% >> WithRaiseExcept: 123ms 125ms -1.1% 125ms 126ms >> -1.2% >> ------------------------------------------------------------------------------- >> Totals: 6043ms 6182ms -2.2% 6185ms 6301ms >> -1.9% >> >> (this=pybench.out, other=../build_orig/pybench.out) >> >> >> 2to3 times: >> >> Before: >> $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null >> real 0m56.685s >> user 0m55.620s >> sys 0m0.380s >> >> After: >> $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null >> real 0m55.067s >> user 0m53.843s >> sys 0m0.376s >> >> == 3% faster >> >> >> Gory details: >> >> The meat of the patch is: >> @@ -884,11 +891,12 @@ >> fast_next_opcode: >> f->f_lasti = INSTR_OFFSET(); >> >> /* line-by-line tracing support */ >> >> - if (tstate->c_tracefunc != NULL && !tstate->tracing) { >> + if (_Py_TracingPossible && >> + tstate->c_tracefunc != NULL && !tstate->tracing) { >> >> >> This converts the generated assembly (produced with `gcc -S -dA ...`, >> then manually annotated a bit) from: >> >> # basic block 17 >> # ../Python/ceval.c:885 >> LM541: >> movl 8(%ebp), %ecx >> LVL319: >> subl -316(%ebp), %edx >> movl %edx, 60(%ecx) >> # ../Python/ceval.c:889 >> LM542: >> # %esi = tstate >> movl -336(%ebp), %esi >> LVL320: >> # %eax = tstate->c_tracefunc >> movl 28(%esi), %eax >> LVL321: >> # if tstate->c_tracefunc == 0 >> testl %eax, %eax >> # goto past-if () >> je L567 >> # more if conditions here >> >> to: >> >> # basic block 17 >> # ../Python/ceval.c:889 >> LM542: >> movl 8(%ebp), %ecx >> LVL319: >> subl -316(%ebp), %edx >> movl %edx, 60(%ecx) >> # ../Python/ceval.c:893 >> LM543: >> # %eax = _Py_TracingPossible >> movl __Py_TracingPossible-"L00000000033$pb"(%ebx), %eax >> LVL320: >> # if _Py_TracingPossible != 0 >> testl %eax, %eax >> # goto rest-of-if (nearby) >> jne L2321 >> # opcode = NEXTOP(); continues here >> >> >> The branch should be predicted accurately either way, so there are 2 >> things that may be contributing to the performance change. >> >> First, adding the global caching variable halves the amount of memory >> that has to be read to check the prediction. The memory that is read >> is still read one instruction before it's used, but adding a local >> variable to read the memory earlier doesn't affect the performance. >> >> Without the global variable, the compiler puts the tracing code >> immediately after the if; with the global, it moves it away and puts >> the non-tracing code immediately after the first test in the if. This >> may affect branch prediction and may affect the icache. I tried using >> gcc's __builtin_expect() to ensure that the tracing code is always >> out-of-line. This moved it much farther away and cost about 1% in >> performance (i.e. 1% instead of 2% faster than "before"). I don't know >> why the __builtin_expect() version would be slower. If anyone feels >> inspired to test this out on another processor or compiler version, >> let me know how it goes. >> >> Jeffrey >> >> _______________________________________________ >> Python-Dev mailing list >> Python-Dev@python.org >> http://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: >> http://mail.python.org/mailman/options/python-dev/brett%40python.org >> >> > -- Namasté, Jeffrey Yasskin http://jeffrey.yasskin.info/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com