Done: http://bugs.python.org/issue4477
On Sun, Nov 30, 2008 at 8:14 PM, Brett Cannon <[EMAIL PROTECTED]> wrote:
> Can you toss the patch into the issue tracker, Jeffrey, so that any
> patch comments can be done there?
>
> -Brett
>
> On Sun, Nov 30, 2008 at 17:54, Jeffrey Yasskin <[EMAIL PROTECTED]> wrote:
>> Tracing support shows up fairly heavily an a Python profile, even
>> though it's nearly always turned off. The attached patch against the
>> trunk speeds up PyBench by 2% for me. All tests pass. I have 2
>> questions:
>>
>> 1) Can other people corroborate this speedup on their machines? I'm
>> running on a Macbook Pro (Intel Core2 processor, probably Merom) with
>> a 32-bit build from Apple's gcc-4.0.1. (Apple's gcc consistently
>> produces a faster python than gcc-4.3.)
>>
>> 2) Assuming this speeds things up for most people, should I check it
>> in anywhere besides the trunk? I assume it's out for 3.0; is it in for
>> 2.6.1 or 3.0.1?
>>
>>
>>
>> Pybench output:
>>
>> -------------------------------------------------------------------------------
>> PYBENCH 2.0
>> -------------------------------------------------------------------------------
>> * using CPython 2.7a0 (trunk:67458M, Nov 30 2008, 17:14:10) [GCC 4.0.1
>> (Apple Inc. build 5488)]
>> * disabled garbage collection
>> * system check interval set to maximum: 2147483647
>> * using timer: time.time
>>
>> -------------------------------------------------------------------------------
>> Benchmark: pybench.out
>> -------------------------------------------------------------------------------
>>
>> Rounds: 10
>> Warp: 10
>> Timer: time.time
>>
>> Machine Details:
>> Platform ID: Darwin-9.5.0-i386-32bit
>> Processor: i386
>>
>> Python:
>> Implementation: CPython
>> Executable:
>> /Users/jyasskin/src/python/trunk-fast-tracing/build/python.exe
>> Version: 2.7.0
>> Compiler: GCC 4.0.1 (Apple Inc. build 5488)
>> Bits: 32bit
>> Build: Nov 30 2008 17:14:10 (#trunk:67458M)
>> Unicode: UCS2
>>
>>
>> -------------------------------------------------------------------------------
>> Comparing with: ../build_orig/pybench.out
>> -------------------------------------------------------------------------------
>>
>> Rounds: 10
>> Warp: 10
>> Timer: time.time
>>
>> Machine Details:
>> Platform ID: Darwin-9.5.0-i386-32bit
>> Processor: i386
>>
>> Python:
>> Implementation: CPython
>> Executable:
>> /Users/jyasskin/src/python/trunk-fast-tracing/build_orig/python.exe
>> Version: 2.7.0
>> Compiler: GCC 4.0.1 (Apple Inc. build 5488)
>> Bits: 32bit
>> Build: Nov 30 2008 13:51:09 (#trunk:67458)
>> Unicode: UCS2
>>
>>
>> Test minimum run-time average run-time
>> this other diff this other diff
>> -------------------------------------------------------------------------------
>> BuiltinFunctionCalls: 127ms 130ms -2.4% 129ms 132ms
>> -2.1%
>> BuiltinMethodLookup: 90ms 93ms -3.2% 91ms 94ms
>> -3.1%
>> CompareFloats: 88ms 91ms -3.3% 89ms 93ms
>> -4.3%
>> CompareFloatsIntegers: 97ms 99ms -2.1% 97ms 100ms
>> -2.4%
>> CompareIntegers: 79ms 82ms -4.2% 79ms 85ms
>> -6.1%
>> CompareInternedStrings: 90ms 92ms -2.4% 94ms 94ms
>> -0.9%
>> CompareLongs: 86ms 83ms +3.6% 87ms 84ms
>> +3.5%
>> CompareStrings: 80ms 82ms -3.1% 81ms 83ms
>> -2.3%
>> CompareUnicode: 103ms 105ms -2.3% 106ms 108ms
>> -1.5%
>> ComplexPythonFunctionCalls: 139ms 137ms +1.3% 140ms 139ms
>> +0.1%
>> ConcatStrings: 142ms 151ms -6.0% 156ms 154ms
>> +1.1%
>> ConcatUnicode: 87ms 92ms -5.4% 89ms 94ms
>> -5.7%
>> CreateInstances: 142ms 144ms -1.4% 144ms 145ms
>> -1.1%
>> CreateNewInstances: 107ms 109ms -2.3% 108ms 111ms
>> -2.1%
>> CreateStringsWithConcat: 114ms 137ms -17.1% 117ms 139ms
>> -16.0%
>> CreateUnicodeWithConcat: 92ms 101ms -9.2% 95ms 102ms
>> -7.2%
>> DictCreation: 77ms 81ms -4.4% 80ms 85ms
>> -5.9%
>> DictWithFloatKeys: 91ms 107ms -14.5% 93ms 109ms
>> -14.6%
>> DictWithIntegerKeys: 95ms 94ms +1.4% 108ms 96ms
>> +12.3%
>> DictWithStringKeys: 83ms 88ms -5.8% 84ms 88ms
>> -4.7%
>> ForLoops: 72ms 72ms -0.1% 79ms 74ms
>> +5.8%
>> IfThenElse: 83ms 80ms +3.9% 85ms 80ms
>> +5.3%
>> ListSlicing: 117ms 118ms -0.7% 118ms 121ms
>> -1.8%
>> NestedForLoops: 116ms 119ms -2.4% 121ms 121ms
>> +0.0%
>> NormalClassAttribute: 106ms 115ms -7.7% 108ms 117ms
>> -7.7%
>> NormalInstanceAttribute: 96ms 98ms -2.3% 97ms 100ms
>> -3.1%
>> PythonFunctionCalls: 92ms 95ms -3.7% 94ms 99ms
>> -5.2%
>> PythonMethodCalls: 147ms 147ms +0.1% 152ms 149ms
>> +2.1%
>> Recursion: 135ms 136ms -0.3% 140ms 144ms
>> -2.9%
>> SecondImport: 101ms 99ms +2.1% 103ms 101ms
>> +2.2%
>> SecondPackageImport: 107ms 103ms +3.5% 108ms 104ms
>> +3.3%
>> SecondSubmoduleImport: 134ms 134ms +0.3% 136ms 136ms
>> -0.0%
>> SimpleComplexArithmetic: 105ms 111ms -5.0% 110ms 112ms
>> -1.4%
>> SimpleDictManipulation: 95ms 106ms -10.6% 96ms 109ms
>> -12.0%
>> SimpleFloatArithmetic: 90ms 99ms -9.3% 93ms 102ms
>> -8.2%
>> SimpleIntFloatArithmetic: 78ms 76ms +2.3% 79ms 77ms
>> +2.0%
>> SimpleIntegerArithmetic: 78ms 77ms +1.8% 79ms 77ms
>> +2.0%
>> SimpleListManipulation: 80ms 78ms +2.4% 80ms 79ms
>> +1.9%
>> SimpleLongArithmetic: 110ms 113ms -2.0% 111ms 113ms
>> -2.1%
>> SmallLists: 128ms 117ms +9.5% 130ms 124ms
>> +4.9%
>> SmallTuples: 115ms 114ms +1.7% 117ms 114ms
>> +2.2%
>> SpecialClassAttribute: 101ms 112ms -10.3% 104ms 114ms
>> -8.9%
>> SpecialInstanceAttribute: 173ms 177ms -1.9% 176ms 179ms
>> -1.6%
>> StringMappings: 165ms 167ms -1.2% 168ms 169ms
>> -0.5%
>> StringPredicates: 126ms 134ms -5.7% 127ms 134ms
>> -5.6%
>> StringSlicing: 125ms 123ms +1.9% 131ms 130ms
>> +0.7%
>> TryExcept: 79ms 80ms -0.6% 80ms 80ms
>> -0.8%
>> TryFinally: 110ms 107ms +3.0% 111ms 112ms
>> -1.1%
>> TryRaiseExcept: 99ms 101ms -1.6% 100ms 102ms
>> -1.7%
>> TupleSlicing: 127ms 127ms +0.6% 137ms 137ms
>> +0.0%
>> UnicodeMappings: 144ms 144ms -0.3% 145ms 145ms
>> -0.4%
>> UnicodePredicates: 116ms 114ms +1.3% 117ms 115ms
>> +1.1%
>> UnicodeProperties: 106ms 102ms +3.6% 107ms 104ms
>> +3.1%
>> UnicodeSlicing: 95ms 111ms -14.0% 99ms 112ms
>> -11.8%
>> WithFinally: 157ms 152ms +3.3% 159ms 154ms
>> +3.3%
>> WithRaiseExcept: 123ms 125ms -1.1% 125ms 126ms
>> -1.2%
>> -------------------------------------------------------------------------------
>> Totals: 6043ms 6182ms -2.2% 6185ms 6301ms
>> -1.9%
>>
>> (this=pybench.out, other=../build_orig/pybench.out)
>>
>>
>> 2to3 times:
>>
>> Before:
>> $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
>> real 0m56.685s
>> user 0m55.620s
>> sys 0m0.380s
>>
>> After:
>> $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
>> real 0m55.067s
>> user 0m53.843s
>> sys 0m0.376s
>>
>> == 3% faster
>>
>>
>> Gory details:
>>
>> The meat of the patch is:
>> @@ -884,11 +891,12 @@
>> fast_next_opcode:
>> f->f_lasti = INSTR_OFFSET();
>>
>> /* line-by-line tracing support */
>>
>> - if (tstate->c_tracefunc != NULL && !tstate->tracing) {
>> + if (_Py_TracingPossible &&
>> + tstate->c_tracefunc != NULL && !tstate->tracing) {
>>
>>
>> This converts the generated assembly (produced with `gcc -S -dA ...`,
>> then manually annotated a bit) from:
>>
>> # basic block 17
>> # ../Python/ceval.c:885
>> LM541:
>> movl 8(%ebp), %ecx
>> LVL319:
>> subl -316(%ebp), %edx
>> movl %edx, 60(%ecx)
>> # ../Python/ceval.c:889
>> LM542:
>> # %esi = tstate
>> movl -336(%ebp), %esi
>> LVL320:
>> # %eax = tstate->c_tracefunc
>> movl 28(%esi), %eax
>> LVL321:
>> # if tstate->c_tracefunc == 0
>> testl %eax, %eax
>> # goto past-if ()
>> je L567
>> # more if conditions here
>>
>> to:
>>
>> # basic block 17
>> # ../Python/ceval.c:889
>> LM542:
>> movl 8(%ebp), %ecx
>> LVL319:
>> subl -316(%ebp), %edx
>> movl %edx, 60(%ecx)
>> # ../Python/ceval.c:893
>> LM543:
>> # %eax = _Py_TracingPossible
>> movl __Py_TracingPossible-"L00000000033$pb"(%ebx), %eax
>> LVL320:
>> # if _Py_TracingPossible != 0
>> testl %eax, %eax
>> # goto rest-of-if (nearby)
>> jne L2321
>> # opcode = NEXTOP(); continues here
>>
>>
>> The branch should be predicted accurately either way, so there are 2
>> things that may be contributing to the performance change.
>>
>> First, adding the global caching variable halves the amount of memory
>> that has to be read to check the prediction. The memory that is read
>> is still read one instruction before it's used, but adding a local
>> variable to read the memory earlier doesn't affect the performance.
>>
>> Without the global variable, the compiler puts the tracing code
>> immediately after the if; with the global, it moves it away and puts
>> the non-tracing code immediately after the first test in the if. This
>> may affect branch prediction and may affect the icache. I tried using
>> gcc's __builtin_expect() to ensure that the tracing code is always
>> out-of-line. This moved it much farther away and cost about 1% in
>> performance (i.e. 1% instead of 2% faster than "before"). I don't know
>> why the __builtin_expect() version would be slower. If anyone feels
>> inspired to test this out on another processor or compiler version,
>> let me know how it goes.
>>
>> Jeffrey
>>
>> _______________________________________________
>> Python-Dev mailing list
>> [email protected]
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> http://mail.python.org/mailman/options/python-dev/brett%40python.org
>>
>>
>
--
Namasté,
Jeffrey Yasskin
http://jeffrey.yasskin.info/
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com