On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote: > Hello, > > > I'm trying to understand the buffering done by the Numpy iterator > interface (the new post 1.6-one) when running ufuncs on arrays that > require broadcasting. Consider this simple case: > > In [35]: m = np.arange(16).reshape(4,4) > In [37]: n = np.arange(4) > In [39]: m + n > Out[39]: > array([[ 0, 2, 4, 6], > [ 4, 6, 8, 10], > [ 8, 10, 12, 14], > [12, 14, 16, 18]]) > >
On first sight this seems true. However, there is one other point to consider here. The inner ufunc loop can only handle a single stride. The contiguous array `n` has to be iterated as if it had the strides `(0, 8)`, which is not the strides of the contiguous array `m` which can be "unrolled" to 1-D. Those effective strides are thus not contiguous for the inner ufunc loop and cannot be unrolled into a single ufunc call! The optimization (which might kick in a bit more broadly maybe), is thus that the number of inner loop calls is minimized, whether that is worth it, I am not sure, it may well be that there is some worthy optimization possible here. Note however, that this does not occur for large inner loop sizes (though I think you can find some "bad" sizes): ``` In [18]: n = np.arange(40000) In [19]: m = np.arange(160000).reshape(4,40000) In [20]: o = m + n Iterator: Checking casting for operand 0 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 1 op: dtype('int64'), iter: dtype('int64') Iterator: Checking casting for operand 2 op: <null>, iter: dtype('int64') Iterator: Setting allocated stride 1 for iterator dimension 0 to 8 Iterator: Setting allocated stride 0 for iterator dimension 1 to 320000 Iterator: Copying inputs to buffers Iterator: Expanding inner loop size from 8192 to 40000 since buffering wasn't needed Any buffering needed: 0 Iterator: Finished copying inputs to buffers (buffered size is 40000) ``` Anyway, feel free to have a look ;). The code is not the most read one in NumPy, and it would not surprise me a lot if you can find something to tweak. - Sebastian > > If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), I see > that when the add() ufunc is called, 'n' is copied into a temporary > buffer by the iterator. The ufunc then gets the buffer as its data. > > > My question is: why is this buffering needed? It seems wasteful, since > no casting is required here, no special alignment problems and also > 'n' is contiguously laid out in memory. It seems that it would be more > efficient to just use 'n' in the ufunc instead of passing in the > buffer. What am I missing? > > > Thanks in advance, > > Eli > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion