On Wed, Dec 22, 2010 at 12:05 PM, Francesc Alted <fal...@pytables.org>wrote:
> <snip> > > Ah, things go well now: > > >>> timeit 3*a+b-(a/c) > 10 loops, best of 3: 67.7 ms per loop > >>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) > 10 loops, best of 3: 27.8 ms per loop > >>> timeit ne.evaluate("3*a+b-(a/c)") > 10 loops, best of 3: 42.8 ms per loop > > So, yup, I'm seeing the good speedup here too :-) > Great! <snip> > > Well, see the timings for the non-broadcasting case: > > >>> a = np.random.random((50,50,50,10)) > >>> b = np.random.random((50,50,50,10)) > >>> c = np.random.random((50,50,50,10)) > > >>> timeit 3*a+b-(a/c) > 10 loops, best of 3: 31.1 ms per loop > >>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) > 10 loops, best of 3: 24.5 ms per loop > >>> timeit ne.evaluate("3*a+b-(a/c)") > 100 loops, best of 3: 10.4 ms per loop > > However, the above comparison is not fair, as numexpr uses all your > cores by default (2 for the case above). If we force using only one > core: > > >>> ne.set_num_threads(1) > >>> timeit ne.evaluate("3*a+b-(a/c)") > 100 loops, best of 3: 16 ms per loop > > which is still faster than luf. In this case numexpr was not using SSE, > but in case luf does so, this does not imply better speed. Ok, I get pretty close to the same ratios (and my machine feels a bit slow...): In [6]: timeit 3*a+b-(a/c) 10 loops, best of 3: 101 ms per loop In [7]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 53.4 ms per loop In [8]: timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 27.8 ms per loop In [9]: ne.set_num_threads(1) In [10]: timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 33.6 ms per loop I think the closest to a "memcpy" we can do here would be just adding, which shows the expression evaluation can be estimated to have 20% overhead. While that's small compared the speedup over straight NumPy, I think it's still worth considering. In [11]: timeit ne.evaluate("a+b+c") 10 loops, best of 3: 27.9 ms per loop Even just switching from add to divide gives more than 10% overhead. With SSE2 these divides could be done two at a time for doubles or four at a time for floats to cut that down. In [12]: timeit ne.evaluate("a/b/c") 10 loops, best of 3: 31.7 ms per loop This all shows that the 'luf' Python interpreter overhead is still pretty big, the new iterator can't defeat numexpr by itself. I think numexpr could get a nice boost from using the new iterator internally though - if I go back to the original motivation, different memory orderings, 'luf' is 10x faster than single-threaded numexpr. In [15]: a = np.random.random((50,50,50,10)).T In [16]: b = np.random.random((50,50,50,10)).T In [17]: c = np.random.random((50,50,50,10)).T In [18]: timeit ne.evaluate("3*a+b-(a/c)") 1 loops, best of 3: 556 ms per loop In [19]: timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 52.5 ms per loop Cheers, Mark
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion