Re: [Python-Dev] Patch to speed up non-tracing case in PyEval_EvalFrameEx (2% on pybench)

Jeffrey Yasskin Sun, 30 Nov 2008 20:34:36 -0800

Done: http://bugs.python.org/issue4477


On Sun, Nov 30, 2008 at 8:14 PM, Brett Cannon <[EMAIL PROTECTED]> wrote:
> Can you toss the patch into the issue tracker, Jeffrey, so that any
> patch comments can be done there?
>
> -Brett
>
> On Sun, Nov 30, 2008 at 17:54, Jeffrey Yasskin <[EMAIL PROTECTED]> wrote:
>> Tracing support shows up fairly heavily an a Python profile, even
>> though it's nearly always turned off. The attached patch against the
>> trunk speeds up PyBench by 2% for me. All tests pass. I have 2
>> questions:
>>
>> 1) Can other people corroborate this speedup on their machines? I'm
>> running on a Macbook Pro (Intel Core2 processor, probably Merom) with
>> a 32-bit build from Apple's gcc-4.0.1. (Apple's gcc consistently
>> produces a faster python than gcc-4.3.)
>>
>> 2) Assuming this speeds things up for most people, should I check it
>> in anywhere besides the trunk? I assume it's out for 3.0; is it in for
>> 2.6.1 or 3.0.1?
>>
>>
>>
>> Pybench output:
>>
>> -------------------------------------------------------------------------------
>> PYBENCH 2.0
>> -------------------------------------------------------------------------------
>> * using CPython 2.7a0 (trunk:67458M, Nov 30 2008, 17:14:10) [GCC 4.0.1
>> (Apple Inc. build 5488)]
>> * disabled garbage collection
>> * system check interval set to maximum: 2147483647
>> * using timer: time.time
>>
>> -------------------------------------------------------------------------------
>> Benchmark: pybench.out
>> -------------------------------------------------------------------------------
>>
>>    Rounds: 10
>>    Warp:   10
>>    Timer:  time.time
>>
>>    Machine Details:
>>       Platform ID:    Darwin-9.5.0-i386-32bit
>>       Processor:      i386
>>
>>    Python:
>>       Implementation: CPython
>>       Executable:
>> /Users/jyasskin/src/python/trunk-fast-tracing/build/python.exe
>>       Version:        2.7.0
>>       Compiler:       GCC 4.0.1 (Apple Inc. build 5488)
>>       Bits:           32bit
>>       Build:          Nov 30 2008 17:14:10 (#trunk:67458M)
>>       Unicode:        UCS2
>>
>>
>> -------------------------------------------------------------------------------
>> Comparing with: ../build_orig/pybench.out
>> -------------------------------------------------------------------------------
>>
>>    Rounds: 10
>>    Warp:   10
>>    Timer:  time.time
>>
>>    Machine Details:
>>       Platform ID:    Darwin-9.5.0-i386-32bit
>>       Processor:      i386
>>
>>    Python:
>>       Implementation: CPython
>>       Executable:
>> /Users/jyasskin/src/python/trunk-fast-tracing/build_orig/python.exe
>>       Version:        2.7.0
>>       Compiler:       GCC 4.0.1 (Apple Inc. build 5488)
>>       Bits:           32bit
>>       Build:          Nov 30 2008 13:51:09 (#trunk:67458)
>>       Unicode:        UCS2
>>
>>
>> Test                             minimum run-time        average  run-time
>>                                 this    other   diff    this    other   diff
>> -------------------------------------------------------------------------------
>>          BuiltinFunctionCalls:   127ms   130ms   -2.4%   129ms   132ms   
>> -2.1%
>>           BuiltinMethodLookup:    90ms    93ms   -3.2%    91ms    94ms   
>> -3.1%
>>                 CompareFloats:    88ms    91ms   -3.3%    89ms    93ms   
>> -4.3%
>>         CompareFloatsIntegers:    97ms    99ms   -2.1%    97ms   100ms   
>> -2.4%
>>               CompareIntegers:    79ms    82ms   -4.2%    79ms    85ms   
>> -6.1%
>>        CompareInternedStrings:    90ms    92ms   -2.4%    94ms    94ms   
>> -0.9%
>>                  CompareLongs:    86ms    83ms   +3.6%    87ms    84ms   
>> +3.5%
>>                CompareStrings:    80ms    82ms   -3.1%    81ms    83ms   
>> -2.3%
>>                CompareUnicode:   103ms   105ms   -2.3%   106ms   108ms   
>> -1.5%
>>    ComplexPythonFunctionCalls:   139ms   137ms   +1.3%   140ms   139ms   
>> +0.1%
>>                 ConcatStrings:   142ms   151ms   -6.0%   156ms   154ms   
>> +1.1%
>>                 ConcatUnicode:    87ms    92ms   -5.4%    89ms    94ms   
>> -5.7%
>>               CreateInstances:   142ms   144ms   -1.4%   144ms   145ms   
>> -1.1%
>>            CreateNewInstances:   107ms   109ms   -2.3%   108ms   111ms   
>> -2.1%
>>       CreateStringsWithConcat:   114ms   137ms  -17.1%   117ms   139ms  
>> -16.0%
>>       CreateUnicodeWithConcat:    92ms   101ms   -9.2%    95ms   102ms   
>> -7.2%
>>                  DictCreation:    77ms    81ms   -4.4%    80ms    85ms   
>> -5.9%
>>             DictWithFloatKeys:    91ms   107ms  -14.5%    93ms   109ms  
>> -14.6%
>>           DictWithIntegerKeys:    95ms    94ms   +1.4%   108ms    96ms  
>> +12.3%
>>            DictWithStringKeys:    83ms    88ms   -5.8%    84ms    88ms   
>> -4.7%
>>                      ForLoops:    72ms    72ms   -0.1%    79ms    74ms   
>> +5.8%
>>                    IfThenElse:    83ms    80ms   +3.9%    85ms    80ms   
>> +5.3%
>>                   ListSlicing:   117ms   118ms   -0.7%   118ms   121ms   
>> -1.8%
>>                NestedForLoops:   116ms   119ms   -2.4%   121ms   121ms   
>> +0.0%
>>          NormalClassAttribute:   106ms   115ms   -7.7%   108ms   117ms   
>> -7.7%
>>       NormalInstanceAttribute:    96ms    98ms   -2.3%    97ms   100ms   
>> -3.1%
>>           PythonFunctionCalls:    92ms    95ms   -3.7%    94ms    99ms   
>> -5.2%
>>             PythonMethodCalls:   147ms   147ms   +0.1%   152ms   149ms   
>> +2.1%
>>                     Recursion:   135ms   136ms   -0.3%   140ms   144ms   
>> -2.9%
>>                  SecondImport:   101ms    99ms   +2.1%   103ms   101ms   
>> +2.2%
>>           SecondPackageImport:   107ms   103ms   +3.5%   108ms   104ms   
>> +3.3%
>>         SecondSubmoduleImport:   134ms   134ms   +0.3%   136ms   136ms   
>> -0.0%
>>       SimpleComplexArithmetic:   105ms   111ms   -5.0%   110ms   112ms   
>> -1.4%
>>        SimpleDictManipulation:    95ms   106ms  -10.6%    96ms   109ms  
>> -12.0%
>>         SimpleFloatArithmetic:    90ms    99ms   -9.3%    93ms   102ms   
>> -8.2%
>>      SimpleIntFloatArithmetic:    78ms    76ms   +2.3%    79ms    77ms   
>> +2.0%
>>       SimpleIntegerArithmetic:    78ms    77ms   +1.8%    79ms    77ms   
>> +2.0%
>>        SimpleListManipulation:    80ms    78ms   +2.4%    80ms    79ms   
>> +1.9%
>>          SimpleLongArithmetic:   110ms   113ms   -2.0%   111ms   113ms   
>> -2.1%
>>                    SmallLists:   128ms   117ms   +9.5%   130ms   124ms   
>> +4.9%
>>                   SmallTuples:   115ms   114ms   +1.7%   117ms   114ms   
>> +2.2%
>>         SpecialClassAttribute:   101ms   112ms  -10.3%   104ms   114ms   
>> -8.9%
>>      SpecialInstanceAttribute:   173ms   177ms   -1.9%   176ms   179ms   
>> -1.6%
>>                StringMappings:   165ms   167ms   -1.2%   168ms   169ms   
>> -0.5%
>>              StringPredicates:   126ms   134ms   -5.7%   127ms   134ms   
>> -5.6%
>>                 StringSlicing:   125ms   123ms   +1.9%   131ms   130ms   
>> +0.7%
>>                     TryExcept:    79ms    80ms   -0.6%    80ms    80ms   
>> -0.8%
>>                    TryFinally:   110ms   107ms   +3.0%   111ms   112ms   
>> -1.1%
>>                TryRaiseExcept:    99ms   101ms   -1.6%   100ms   102ms   
>> -1.7%
>>                  TupleSlicing:   127ms   127ms   +0.6%   137ms   137ms   
>> +0.0%
>>               UnicodeMappings:   144ms   144ms   -0.3%   145ms   145ms   
>> -0.4%
>>             UnicodePredicates:   116ms   114ms   +1.3%   117ms   115ms   
>> +1.1%
>>             UnicodeProperties:   106ms   102ms   +3.6%   107ms   104ms   
>> +3.1%
>>                UnicodeSlicing:    95ms   111ms  -14.0%    99ms   112ms  
>> -11.8%
>>                   WithFinally:   157ms   152ms   +3.3%   159ms   154ms   
>> +3.3%
>>               WithRaiseExcept:   123ms   125ms   -1.1%   125ms   126ms   
>> -1.2%
>> -------------------------------------------------------------------------------
>> Totals:                          6043ms  6182ms   -2.2%  6185ms  6301ms   
>> -1.9%
>>
>> (this=pybench.out, other=../build_orig/pybench.out)
>>
>>
>> 2to3 times:
>>
>> Before:
>> $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
>> real    0m56.685s
>> user    0m55.620s
>> sys     0m0.380s
>>
>> After:
>> $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
>> real    0m55.067s
>> user    0m53.843s
>> sys     0m0.376s
>>
>> == 3% faster
>>
>>
>> Gory details:
>>
>> The meat of the patch is:
>> @@ -884,11 +891,12 @@
>>        fast_next_opcode:
>>                f->f_lasti = INSTR_OFFSET();
>>
>>                /* line-by-line tracing support */
>>
>> -               if (tstate->c_tracefunc != NULL && !tstate->tracing) {
>> +               if (_Py_TracingPossible &&
>> +                   tstate->c_tracefunc != NULL && !tstate->tracing) {
>>
>>
>> This converts the generated assembly (produced with `gcc -S -dA ...`,
>> then manually annotated a bit) from:
>>
>>        # basic block 17
>>        # ../Python/ceval.c:885
>> LM541:
>>        movl    8(%ebp), %ecx
>> LVL319:
>>        subl    -316(%ebp), %edx
>>        movl    %edx, 60(%ecx)
>>        # ../Python/ceval.c:889
>> LM542:
>> # %esi = tstate
>>        movl    -336(%ebp), %esi
>> LVL320:
>> # %eax = tstate->c_tracefunc
>>        movl    28(%esi), %eax
>> LVL321:
>> # if tstate->c_tracefunc == 0
>>        testl   %eax, %eax
>> # goto past-if ()
>>        je      L567
>> # more if conditions here
>>
>> to:
>>
>>        # basic block 17
>>        # ../Python/ceval.c:889
>> LM542:
>>        movl    8(%ebp), %ecx
>> LVL319:
>>        subl    -316(%ebp), %edx
>>        movl    %edx, 60(%ecx)
>>        # ../Python/ceval.c:893
>> LM543:
>> # %eax = _Py_TracingPossible
>>        movl    __Py_TracingPossible-"L00000000033$pb"(%ebx), %eax
>> LVL320:
>> # if _Py_TracingPossible != 0
>>        testl   %eax, %eax
>> # goto rest-of-if (nearby)
>>        jne     L2321
>> # opcode = NEXTOP(); continues here
>>
>>
>> The branch should be predicted accurately either way, so there are 2
>> things that may be contributing to the performance change.
>>
>> First, adding the global caching variable halves the amount of memory
>> that has to be read to check the prediction. The memory that is read
>> is still read one instruction before it's used, but adding a local
>> variable to read the memory earlier doesn't affect the performance.
>>
>> Without the global variable, the compiler puts the tracing code
>> immediately after the if; with the global, it moves it away and puts
>> the non-tracing code immediately after the first test in the if. This
>> may affect branch prediction and may affect the icache. I tried using
>> gcc's __builtin_expect() to ensure that the tracing code is always
>> out-of-line. This moved it much farther away and cost about 1% in
>> performance (i.e. 1% instead of 2% faster than "before"). I don't know
>> why the __builtin_expect() version would be slower. If anyone feels
>> inspired to test this out on another processor or compiler version,
>> let me know how it goes.
>>
>> Jeffrey
>>
>> _______________________________________________
>> Python-Dev mailing list
>> [email protected]
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe: 
>> http://mail.python.org/mailman/options/python-dev/brett%40python.org
>>
>>
>



-- 
Namasté,
Jeffrey Yasskin
http://jeffrey.yasskin.info/
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Patch to speed up non-tracing case in PyEval_EvalFrameEx (2% on pybench)

Reply via email to