Correct me if I'm wrong, but this code still doesn't seem to make the optimization of flattening arrays as much as possible. The array you get out of np.zeros((100,100)) can be iterated over as an array of shape (10000,), which should yield very substantial speedups. Since most arrays one operates on are like this, there's potentially a large speedup here. (On the other hand, if this optimization is being done, then these tests are somewhat deceptive.)
On the other hand, it seems to me there's still some question about how to optimize execution order when the ufunc is dealing with two or more arrays with different memory layouts. In such a case, which one do you reorder in favour of? Is it acceptable to return freshly-allocated arrays that are not C-contiguous? Anne On 15 June 2010 07:37, Pauli Virtanen <[email protected]> wrote: > pe, 2010-06-11 kello 10:52 +0200, Hans Meine kirjoitti: >> On Friday 11 June 2010 10:38:28 Pauli Virtanen wrote: > [clip] >> > I think I there was some code ready to implement this shuffling. I'll try >> > to dig it out and implement the shuffling. >> >> That would be great! >> >> Ullrich Köthe has implemented this for our VIGRA/numpy bindings: >> http://tinyurl.com/fast-ufunc >> At the bottom you can see that he basically wraps all numpy.ufuncs he can >> find >> in the numpy top-level namespace automatically. > > Ok, here's the branch: > > > http://github.com/pv/numpy-work/compare/master...feature;ufunc-memory-access-speedup > > Some samples: (the reported times in braces are times without the > optimization) > > x = np.zeros([100,100]) > %timeit x + x > 10000 loops, best of 3: 106 us (99.1 us) per loop > %timeit x.T + x.T > 10000 loops, best of 3: 114 us (164 us) per loop > %timeit x.T + x > 10000 loops, best of 3: 176 us (171 us) per loop > > x = np.zeros([100,5,5]) > %timeit x.T + x.T > 10000 loops, best of 3: 47.7 us (61 us) per loop > > x = np.zeros([100,5,100]).transpose(2,0,1) > %timeit np.cos(x) > 100 loops, best of 3: 3.77 ms (9 ms) per loop > > As expected, some improvement can be seen. There's also appears to be > an additional 5 us (~ 700 inner loop operations it seems) overhead > coming from somewhere; perhaps this can still be reduced. > > -- > Pauli Virtanen > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
