As a benchmark of C-based iterator usage and to make it work properly in a multi-threaded context, I've updated numexpr to use the new iterator. In addition to some performance improvements, this also made it easy to add optional out= and order= parameters to the evaluate function. The numexpr repository with this update is available here:
https://github.com/m-paradox/numexpr To use it, you need the new_iterator branch of NumPy from here: https://github.com/m-paradox/numpy In all cases tested, the iterator version of numexpr's evaluate function matches or beats the standard version. The timing results are below, with some explanatory comments placed inline: -Mark In [1]: import numexpr as ne # numexpr front page example In [2]: a = np.arange(1e6) In [3]: b = np.arange(1e6) In [4]: timeit a**2 + b**2 + 2*a*b 1 loops, best of 3: 121 ms per loop In [5]: ne.set_num_threads(1) # iterator version performance matches standard version In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") 10 loops, best of 3: 24.8 ms per loop In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") 10 loops, best of 3: 24.3 ms per loop In [8]: ne.set_num_threads(2) # iterator version performance matches standard version In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") 10 loops, best of 3: 21 ms per loop In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") 10 loops, best of 3: 20.5 ms per loop # numexpr front page example with a 10x bigger array In [11]: a = np.arange(1e7) In [12]: b = np.arange(1e7) In [13]: ne.set_num_threads(2) # the iterator version performance improvement is due to # a small task scheduler tweak In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") 1 loops, best of 3: 282 ms per loop In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") 1 loops, best of 3: 255 ms per loop # numexpr front page example with a Fortran contiguous array In [16]: a = np.arange(1e7).reshape(10,100,100,100).T In [17]: b = np.arange(1e7).reshape(10,100,100,100).T In [18]: timeit a**2 + b**2 + 2*a*b 1 loops, best of 3: 3.22 s per loop In [19]: ne.set_num_threads(1) # even with a C-ordered output, the iterator version performs better In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") 1 loops, best of 3: 3.74 s per loop In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") 1 loops, best of 3: 379 ms per loop In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C') 1 loops, best of 3: 2.03 s per loop In [23]: ne.set_num_threads(2) # the standard version just uses 1 thread here, I believe # the iterator version performs the same as for the flat 1e7-sized array In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") 1 loops, best of 3: 3.92 s per loop In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") 1 loops, best of 3: 254 ms per loop In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C') 1 loops, best of 3: 1.74 s per loop
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion