A Wednesday 22 December 2010 20:42:54 Mark Wiebe escrigué: > On Wed, Dec 22, 2010 at 11:16 AM, Francesc Alted <fal...@pytables.org>wrote: > > A Wednesday 22 December 2010 19:52:45 Mark Wiebe escrigué: > > > On Wed, Dec 22, 2010 at 10:41 AM, Francesc Alted > > > > <fal...@pytables.org>wrote: > > > > NumPy version 2.0.0.dev-147f817 > > > > > > There's your problem, it looks like the PYTHONPATH isn't seeing > > > your new build for some reason. That build is off of this > > > commit in the NumPy master branch: > > > > > > https://github.com/numpy/numpy/commit/147f817eefd5efa56fa26b03953 > > > a51d 533cc27ec > > > > Uh, I think I'm a bit lost here. I've cloned this repo: > > > > $ git clone git://github.com/m-paradox/numpy.git > > > > Is that wrong? > > That's right, it was my mistake to assume that the page for a branch > on github would give you that branch. You need the 'new_iterator' > branch, so after that clone, you should do this: > > $ git checkout origin/new_iterator
Ah, things go well now: >>> timeit 3*a+b-(a/c) 10 loops, best of 3: 67.7 ms per loop >>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 27.8 ms per loop >>> timeit ne.evaluate("3*a+b-(a/c)") 10 loops, best of 3: 42.8 ms per loop So, yup, I'm seeing the good speedup here too :-) > > But you need to transport those small chunks from main memory to > > cache before you can start doing the computation for this piece, > > right? This is what I'm saying that the bottleneck for evaluating > > arbitrary expressions (like "3*a+b-(a/c)", i.e. not including > > transcendental functions, nor broadcasting) is memory bandwidth > > (and more in particular RAM bandwidth). > > In the example expression, I believe the evaluation would go > something like this. Assuming the memory allocator keeps giving > back the same locations to 'luf', all temporary variables will > already be in cache after the first chunk. > > temp1 = 3 * a # a is read from main memory > temp2 = temp1 + b # b is read from main memory > temp3 = a / c # a is already in cache, c is read from > main memory > result = temp2 + temp3 # result is written to data from main memory > > So there are 4 reads and writes to chunks from outside of the cache, > but 12 total reads and writes to chunks, so speeding up the parts > already in cache would appear to be beneficial. The benefit will > get better with more complicated expressions. I think as long as > the operation is slower than a memcpy, the RAM bandwidth isn't the > main bottleneck to be concerned with, but instead produces an upper > bound on performance. I'm not sure how to precisely measure that > overhead, though. Well, see the timings for the non-broadcasting case: >>> a = np.random.random((50,50,50,10)) >>> b = np.random.random((50,50,50,10)) >>> c = np.random.random((50,50,50,10)) >>> timeit 3*a+b-(a/c) 10 loops, best of 3: 31.1 ms per loop >>> timeit luf(lambda a,b,c:3*a+b-(a/c), a, b, c) 10 loops, best of 3: 24.5 ms per loop >>> timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 10.4 ms per loop However, the above comparison is not fair, as numexpr uses all your cores by default (2 for the case above). If we force using only one core: >>> ne.set_num_threads(1) >>> timeit ne.evaluate("3*a+b-(a/c)") 100 loops, best of 3: 16 ms per loop which is still faster than luf. In this case numexpr was not using SSE, but in case luf does so, this does not imply better speed. -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion