On Di, 2015-11-24 at 16:49 -0800, Eli Bendersky wrote: > > > On Mon, Nov 23, 2015 at 2:09 PM, Sebastian Berg > <sebast...@sipsolutions.net> wrote: > On Mo, 2015-11-23 at 13:31 -0800, Eli Bendersky wrote: > > Hello, > > > > > > I'm trying to understand the buffering done by the Numpy > iterator > > interface (the new post 1.6-one) when running ufuncs on > arrays that > > require broadcasting. Consider this simple case: > > > > In [35]: m = np.arange(16).reshape(4,4) > > In [37]: n = np.arange(4) > > In [39]: m + n > > Out[39]: > > array([[ 0, 2, 4, 6], > > [ 4, 6, 8, 10], > > [ 8, 10, 12, 14], > > [12, 14, 16, 18]]) > > > > > > > On first sight this seems true. However, there is one other > point to > consider here. The inner ufunc loop can only handle a single > stride. The > contiguous array `n` has to be iterated as if it had the > strides > `(0, 8)`, which is not the strides of the contiguous array `m` > which can > be "unrolled" to 1-D. Those effective strides are thus not > contiguous > for the inner ufunc loop and cannot be unrolled into a single > ufunc > call! > > The optimization (which might kick in a bit more broadly > maybe), is thus > that the number of inner loop calls is minimized, whether that > is worth > it, I am not sure, it may well be that there is some worthy > optimization > possible here. > Note however, that this does not occur for large inner loop > sizes > (though I think you can find some "bad" sizes): > > ``` > In [18]: n = np.arange(40000) > > In [19]: m = np.arange(160000).reshape(4,40000) > > In [20]: o = m + n > Iterator: Checking casting for operand 0 > op: dtype('int64'), iter: dtype('int64') > Iterator: Checking casting for operand 1 > op: dtype('int64'), iter: dtype('int64') > Iterator: Checking casting for operand 2 > op: <null>, iter: dtype('int64') > Iterator: Setting allocated stride 1 for iterator dimension 0 > to 8 > Iterator: Setting allocated stride 0 for iterator dimension 1 > to 320000 > Iterator: Copying inputs to buffers > Iterator: Expanding inner loop size from 8192 to 40000 since > buffering > wasn't needed > Any buffering needed: 0 > Iterator: Finished copying inputs to buffers (buffered size is > 40000) > ``` > > > The heuristic in the code says that if we can use a single stride and > that's larger than the buffer size (which I assume is the default > buffer size, and can change) then it's "is_onestride" and no buffering > is done. > > > So this led me to explore around this threshold (8192 items by default > on my machine), and indeed we can notice funny behavior: > > In [51]: %%timeit n = arange(8192); m = > np.arange(8192*40).reshape(40,8192) > o = m + n > ....: > 1000 loops, best of 3: 274 µs per loop > > In [52]: %%timeit n = arange(8292); m = > np.arange(8292*40).reshape(40,8292) > o = m + n > ....: > 1000 loops, best of 3: 229 µs per loop > > > So, given this, it's not very clear why the "optimization" kicks in. > Buffering for small sizes seems like a mistake. >
I am pretty sure it is not generally a mistake. Consider the case of an 10000x3 array (note that shrinking the buffer can have great advantage though, I am not sure if this is done usually). If you have (10000, 3) + (3,) arrays, then the ufunc outer loop would have 10000x overhead. Doing the buffering (which I believe has some extra code to be faster), will lower this to a few ufunc inner loop calls. I have not timed it, but I would be a bit surprised if it was not faster in this case at least. Even calling a C function (and looping) for an inner loop of 3 elements, should be quite a bit of overhead, and my guess is more overhead than the buffering. - Sebastian > > Eli > > > > > Anyway, feel free to have a look ;). The code is not the most > read one > in NumPy, and it would not surprise me a lot if you can find > something > to tweak. > > - Sebastian > > > > > > If I instrument Numpy (setting NPY_IT_DBG_TRACING and such), > I see > > that when the add() ufunc is called, 'n' is copied into a > temporary > > buffer by the iterator. The ufunc then gets the buffer as > its data. > > > > > > My question is: why is this buffering needed? It seems > wasteful, since > > no casting is required here, no special alignment problems > and also > > 'n' is contiguously laid out in memory. It seems that it > would be more > > efficient to just use 'n' in the ufunc instead of passing in > the > > buffer. What am I missing? > > > > > > Thanks in advance, > > > > Eli > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion