[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Dom Grigonis Wed, 01 Nov 2023 07:21:44 -0700

Your comparisons do not paint correct picture.
a) Most of time here is spent for array allocation and the actual time of the 
loop gets swallowed in the noise
a) You do not test early stop - your benchmarks simply test full loop as 
condition is almost never hit - 5 standard deviations...


Here is comparison with Cython:

import npi
arr = np.random.random(100_000)
%timeit npi.first_above(arr, 5)     # 66.7 µs
%timeit first(arr, lambda x: x > 5) # 83.8 ms
arr[100] += 10
%timeit npi.first_above(arr, 5)     # 16.2 µs
%timeit first(arr, lambda x: x > 5) # 86.9 ms

It is in the magnitude of 1000 x slower.

Regards,
DG

> On 1 Nov 2023, at 14:07, Juan Nunez-Iglesias <[email protected]> wrote:
> 
> Have you tried timing things? Thankfully this is easy to test because the 
> Python source of numba-jitted functions is available at jitted_func.py_func.
> 
> In [23]: @numba.njit
>     ...: def _first(arr, pred):
>     ...:     for i, elem in enumerate(arr):
>     ...:         if pred(elem):
>     ...:             return i
>     ...:
>     ...: def first(arr, pred):
>     ...:     _pred = numba.njit(pred)
>     ...:     return _first(arr, _pred)
>     ...:
> 
> In [24]: arr = np.random.random(100_000_000)
> 
> In [25]: %timeit first(arr, lambda x: x > 5)
> 72 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> In [26]: %timeit arr + 5
> 90.3 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> 
> In [27]: %timeit _first.py_func(arr, lambda x: x > 5)
> 7.8 s ± 46.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> 
> So numba gives a >100x speedup. It's still not as fast as a NumPy function 
> call that doesn't have an allocation overhead:
> 
> In [30]: arr2 = np.empty_like(arr, dtype=bool)
> 
> In [32]: %timeit np.greater(arr, 5, out=arr2)
> 13.9 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> 
> But it's certainly much better than pure Python! And it's not a huge cost for 
> the flexibility.
> 
> Juan.
> 
> On Wed, 1 Nov 2023, at 10:42 AM, Dom Grigonis wrote:
>> This results in a very slow code. The function calls of 
>> 
>> 
>> _pred = numba.njit(pred)
>> 
>> are expensive and this sort of approach will be comparable to pure python 
>> functions.
>> 
>> This is only recommended for sourcing functions that are not called 
>> frequently, but rather have a large computational content within them. In 
>> other words not suitable for predicates.
>> 
>> Regards,
>> DG
>> 
>>> On 1 Nov 2023, at 01:05, Juan Nunez-Iglesias <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> If you add a layer of indirection with Numba you can get a *very* nice API:
>>> 
>>> @numba.njit
>>> def _first(arr, pred):
>>>     for i, elem in enumerate(arr):
>>>         if pred(elem):
>>>             return i
>>> 
>>> def first(arr, pred):
>>>     _pred = numba.njit(pred)
>>>     return _first(arr, _pred)
>>> 
>>> This even works with lambdas! (TIL, thanks Numba devs!)
>>> 
>>> >>> first(np.random.random(10_000_000), lambda x: x > 0.99)
>>> 215
>>> 
>>> Since Numba has ufunc support I don't suppose it would be hard to make it 
>>> work with an axis= argument, but I've never played with that API myself.
>>> 
>>> On Tue, 31 Oct 2023, at 6:49 PM, Lev Maximov wrote:
>>>> I've implemented such functions in Cython and packaged them into a library 
>>>> called numpy_illustrated <https://pypi.org/project/numpy-illustrated/>
>>>> 
>>>> It exposes the following functions:
>>>> 
>>>> find(a, v)  # returns the index of the first occurrence of v in a
>>>> first_above(a, v)   # returns the index of the first element in a that is 
>>>> strictly above v
>>>> first_nonzero(a)   # returns the index of the first nonzero element
>>>> 
>>>> They scan the array and bail out immediately once the match is found. Have 
>>>> a significant performance gain if the element to be
>>>> found is closer to the beginning of the array. Have roughly the same speed 
>>>> as alternative methods if the value is missing.
>>>> 
>>>> The complete signatures of the functions look like this:
>>>> 
>>>> find(a, v, rtol=1e-05, atol=1e-08, sorted=False, default=-1, raises=False)
>>>> first_above(a, v, sorted=False, missing=-1, raises=False)
>>>> first_nonzero(a, missing=-1, raises=False)
>>>> 
>>>> This covers the most common use cases and does not accept Python callbacks 
>>>> because accepting them would nullify any speed gain
>>>> one would expect from such a function. A Python callback can be 
>>>> implemented with Numba, but anyone who can write the callback
>>>> in Numba has no need for a library that wraps it into a dedicated function.
>>>> 
>>>> The library has a 100% test coverage. Code style 'black'. It should be 
>>>> easy to add functions like 'first_below' if necessary.
>>>> 
>>>> A more detailed description of these functions can be found here 
>>>> <https://betterprogramming.pub/the-numpy-illustrated-library-7531a7c43ffb?sk=8dd60bfafd6d49231ac76cb148a4d16f>.
>>>> 
>>>> Best regards,
>>>>   Lev Maximov
>>>> 
>>>> On Tue, Oct 31, 2023 at 3:50 AM Dom Grigonis <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> I juggled a bit and found pretty nice solution using numba. Which is 
>>>> probably not very robust, but proves that such thing can be optimised 
>>>> while retaining flexibility. Check if it works for your use cases and let 
>>>> me know if anything fails or if it is slow compared to what you used.
>>>> 
>>>> 
>>>> 
>>>> first_true_str = """
>>>> def first_true(arr, n):
>>>>     result = np.full((n, arr.shape[1]), -1, dtype=np.int32)
>>>>     for j in range(arr.shape[1]):
>>>>         k = 0
>>>>         for i in range(arr.shape[0]):
>>>>             x = arr[i:i + 1, j]
>>>>             if cond(x):
>>>>                 result[k, j] = i
>>>>                 k += 1
>>>>                 if k >= n:
>>>>                     break
>>>>     return result
>>>> """
>>>> 
>>>> 
>>>> class FirstTrue:
>>>>     CONTEXT = {'np': np}
>>>> 
>>>>     def __init__(self, expr):
>>>>         self.expr = expr
>>>>         self.expr_ast = ast.parse(expr, mode='exec').body[0].value
>>>>         self.func_ast = ast.parse(first_true_str, mode='exec')
>>>>         self.func_ast.body[0].body[1].body[1].body[1].test = self.expr_ast
>>>>         self.func_cmp = compile(self.func_ast, filename="<ast>", 
>>>> mode="exec")
>>>>         exec(self.func_cmp, self.CONTEXT)
>>>>         self.func_nb = nb.njit(self.CONTEXT[self.func_ast.body[0].name])
>>>> 
>>>>     def __call__(self, arr, n=1, axis=None):
>>>>         # PREPARE INPUTS
>>>>         in_1d = False
>>>>         if axis is None:
>>>>             arr = np.ravel(arr)[:, None]
>>>>             in_1d = True
>>>>         elif axis == 0:
>>>>             if arr.ndim == 1:
>>>>                 in_1d = True
>>>>                 arr = arr[:, None]
>>>>         else:
>>>>             raise ValueError('axis ~in (None, 0)')
>>>>         res = self.func_nb(arr, n)
>>>>         if in_1d:
>>>>             res = res[:, 0]
>>>>         return res
>>>> 
>>>> 
>>>> if __name__ == '__main__':
>>>>     arr = np.arange(125).reshape((5, 5, 5))
>>>>     ft = FirstTrue('np.sum(x) > 30')
>>>>     print(ft(arr, n=2, axis=0))
>>>> 
>>>> [[1 0 0 0 0]
>>>>  [2 1 1 1 1]]
>>>> 
>>>> 
>>>> In [16]: %timeit ft(arr, 2, axis=0)
>>>> 1.31 µs ± 3.94 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops 
>>>> each)
>>>> 
>>>> Regards,
>>>> DG
>>>> 
>>>>> On 29 Oct 2023, at 23:18, rosko37 <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> An example with a 1-D array (where it is easiest to see what I mean) is 
>>>>> the following. I will follow Dom Grigonis's suggestion that the range not 
>>>>> be provided as a separate argument, as it can be just as easily "folded 
>>>>> into" the array by passing a slice. So it becomes just:
>>>>> idx = first_true(arr, cond)
>>>>> 
>>>>> As Dom also points out, the "cond" would likely need to be a "function 
>>>>> pointer" (i.e., the name of a function defined elsewhere, turning 
>>>>> first_true into a higher-order function), unless there's some way to pass 
>>>>> a parseable expression for simple cases. A few special cases like the 
>>>>> first zero/nonzero element could be handled with dedicated options (sort 
>>>>> of like matplotlib colors), but for anything beyond that it gets unwieldy 
>>>>> fast.
>>>>> 
>>>>> So let's say we have this:
>>>>> ******************
>>>>> def cond(x):
>>>>>     return x>50
>>>>> 
>>>>> search_arr = np.exp(np.arange(0,1000))
>>>>> 
>>>>> print(np.first_true(search_arr, cond))
>>>>> *******************
>>>>> 
>>>>> This should print 4, because the element of search_arr at index 4 (i.e. 
>>>>> the 5th element) is e^4, which is slightly greater than 50 (while e^3 is 
>>>>> less than 50). It should return this without testing the 6th through 
>>>>> 1000th elements of the array at all to see whether they exceed 50 or not. 
>>>>> This example is rather contrived, because simply taking the natural log 
>>>>> of 50 and rounding up is far superior, not even evaluating the array of 
>>>>> exponentials (which my example clearly still does--and in the use cases 
>>>>> I've had for such a function, I can't predict the array elements like 
>>>>> this--they come from loaded data, the output of a simulation, etc., and 
>>>>> are all already in a numpy array). And in this case, since the values are 
>>>>> strictly increasing, search_sorted() would work as well. But it 
>>>>> illustrates the idea.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Oct 26, 2023 at 5:54 AM Dom Grigonis <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Could you please give a concise example? I know you have provided one, 
>>>>> but it is engrained deep in verbose text and has some typos in it, which 
>>>>> makes hard to understand exactly what inputs should result in what output.
>>>>> 
>>>>> Regards,
>>>>> DG
>>>>> 
>>>>> > On 25 Oct 2023, at 22:59, rosko37 <[email protected] 
>>>>> > <mailto:[email protected]>> wrote:
>>>>> > 
>>>>> > I know this question has been asked before, both on this list as well 
>>>>> > as several threads on Stack Overflow, etc. It's a common issue. I'm NOT 
>>>>> > asking for how to do this using existing Numpy functions (as that 
>>>>> > information can be found in any of those sources)--what I'm asking is 
>>>>> > whether Numpy would accept inclusion of a function that does this, or 
>>>>> > whether (possibly more likely) such a proposal has already been 
>>>>> > considered and rejected for some reason.
>>>>> > 
>>>>> > The task is this--there's a large array and you want to find the next 
>>>>> > element after some index that satisfies some condition. Such elements 
>>>>> > are common, and the typical number of elements to be searched through 
>>>>> > is small relative to the size of the array. Therefore, it would greatly 
>>>>> > improve performance to avoid testing ALL elements against the 
>>>>> > conditional once one is found that returns True. However, all built-in 
>>>>> > functions that I know of test the entire array. 
>>>>> > 
>>>>> > One can obviously jury-rig some ways, like for instance create a "for" 
>>>>> > loop over non-overlapping slices of length slice_length and call 
>>>>> > something like np.where(cond) on each--that outer "for" loop is much 
>>>>> > faster than a loop over individual elements, and the inner loop at most 
>>>>> > will go slice_length-1 elements past the first "hit". However, needing 
>>>>> > to use such a convoluted piece of code for such a simple task seems to 
>>>>> > go against the Numpy spirit of having one operation being one function 
>>>>> > of the form func(arr)".
>>>>> > 
>>>>> > A proposed function for this, let's call it "np.first_true(arr, 
>>>>> > start_idx, [stop_idx])" would be best implemented at the C code level, 
>>>>> > possibly in the same code file that defines np.where. I'm wondering if 
>>>>> > I, or someone else, were to write such a function, if the Numpy 
>>>>> > developers would consider merging it as a standard part of the 
>>>>> > codebase. It's possible that the idea of such a function is bad because 
>>>>> > it would violate some existing broadcasting or fancy indexing rules. 
>>>>> > Clearly one could make it possible to pass an "axis" argument to 
>>>>> > np.first_true() that would select an axis to search over in the case of 
>>>>> > multi-dimensional arrays, and then the result would be an array of 
>>>>> > indices of one fewer dimension than the original array. So 
>>>>> > np.first_true(np.array([1,5],[2,7],[9,10],cond) would return [1,1,0] 
>>>>> > for cond(x): x>4. The case where no elements satisfy the condition 
>>>>> > would need to return a "signal value" like -1. But maybe there are some 
>>>>> > weird cases where there isn't a sensible return val
>>>>>  ue, hence why such a function has not been added.
>>>>> > 
>>>>> > -Andrew Rosko
>>>>> > _______________________________________________
>>>>> > NumPy-Discussion mailing list -- [email protected] 
>>>>> > <mailto:[email protected]>
>>>>> > To unsubscribe send an email to [email protected] 
>>>>> > <mailto:[email protected]>
>>>>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>>>> > <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>>>> > Member address: [email protected] <mailto:[email protected]>
>>>>> 
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list -- [email protected] 
>>>>> <mailto:[email protected]>
>>>>> To unsubscribe send an email to [email protected] 
>>>>> <mailto:[email protected]>
>>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>>>> Member address: [email protected] <mailto:[email protected]>
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list -- [email protected] 
>>>>> <mailto:[email protected]>
>>>>> To unsubscribe send an email to [email protected] 
>>>>> <mailto:[email protected]>
>>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>>>> Member address: [email protected] <mailto:[email protected]>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list -- [email protected] 
>>>> <mailto:[email protected]>
>>>> To unsubscribe send an email to [email protected] 
>>>> <mailto:[email protected]>
>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>>> Member address: [email protected] <mailto:[email protected]>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list -- [email protected] 
>>>> <mailto:[email protected]>
>>>> To unsubscribe send an email to [email protected] 
>>>> <mailto:[email protected]>
>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>>> Member address: [email protected] <mailto:[email protected]>
>>>> 
>>> 
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- [email protected] 
>>> <mailto:[email protected]>
>>> To unsubscribe send an email to [email protected] 
>>> <mailto:[email protected]>
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>>> Member address: [email protected] <mailto:[email protected]>
>> 
>> _______________________________________________
>> NumPy-Discussion mailing list -- [email protected] 
>> <mailto:[email protected]>
>> To unsubscribe send an email to [email protected] 
>> <mailto:[email protected]>
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
>> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
>> Member address: [email protected] <mailto:[email protected]>
>> 
> 
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected] 
> <mailto:[email protected]>
> To unsubscribe send an email to [email protected] 
> <mailto:[email protected]>
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ 
> <https://mail.python.org/mailman3/lists/numpy-discussion.python.org/>
> Member address: [email protected] <mailto:[email protected]>

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Reply via email to