Yury Selivanov added the comment:

tl;dr   I'm attaching a new patch - fastint4 -- the fastest of them all. It 
incorporates Serhiy's suggestion to export long/float functions and use them.  
I think it's reasonable complete -- please review it, and let's get it 
committed.

== Benchmarks ==

spectral_norm (fastint_alt)    -> 1.07x faster
spectral_norm (fastintfloat)   -> 1.08x faster
spectral_norm (fastint3.patch) -> 1.29x faster
spectral_norm (fastint4.patch) -> 1.16x faster

spectral_norm (fastint**.patch)-> 1.31x faster
nbody (fastint**.patch)        -> 1.16x faster

Where:
- fastint3 - is my previous patch that nobody likes (it inlined a lot of logic 
from longobject/floatobject)

- fastint4 - is the patch I'm attaching and ideally want to commit

- fastint** - is a modification of fastint4.  This is very interesting -- I 
started to profile different approaches, and found two bottlenecks, that really 
made Serhiy's and my other patches slower than fastint3.  What I found is that 
PyLong_AsDouble can be significantly optimized, and PyLong_FloorDiv is super 
inefficient.

PyLong_AsDouble can be sped up several times if we add a fastpath for 1-digit 
longs:

    // longobject.c: PyLong_AsDouble
    if (PyLong_CheckExact(v) && Py_ABS(Py_SIZE(v)) <= 1) {
        /* fast path; single digit will always fit decimal */
        return (double)MEDIUM_VALUE((PyLongObject *)v);
    }


PyLong_FloorDiv (fastint4 adds it) can be specialized for single digits, which 
gives it a tremendous boost.

With those too optimizations, fastint4 becomes as fast as fastint3.  I'll 
create separate issues for PyLong_AsDouble and FloorDiv.

== Micro-benchmarks ==

Floats + ints:  -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + 
(x+0.1)/(x-0.1)*2 + (x+10)*(x-30)"

2.7:          0.42 (usec)
3.5:          0.619
fastint_alt   0.619
fastintfloat: 0.52
fastint3:     0.289
fastint4:     0.51
fastint**:    0.314

===

Ints:  -m timeit -s "x=2" "x + 10 + x * 20 - x // 3 + x* 10 + 20 -x"

2.7:          0.151 (usec)
3.5:          0.19
fastint_alt:  0.136
fastintfloat: 0.135
fastint3:     0.135
fastint4:     0.122
fastint**:    0.122


P.S. I have another variant of fastint4 that uses fast_* functions in ceval 
loop, instead of a big macro.  Its performance is slightly worse than with the 
macro.

----------
Added file: http://bugs.python.org/file41811/fastint4.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21955>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to