[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Juan Nunez-Iglesias Wed, 01 Nov 2023 05:11:09 -0700

Have you tried timing things? Thankfully this is easy to test because the 
Python source of numba-jitted functions is available at jitted_func.py_func.


In [23]: @numba.njit
    ...: def _first(arr, pred):
    ...:     for i, elem in enumerate(arr):
    ...:         if pred(elem):
    ...:             return i
    ...:
    ...: def first(arr, pred):
    ...:     _pred = numba.njit(pred)
    ...:     return _first(arr, _pred)
    ...:

In [24]: arr = np.random.random(100_000_000)

In [25]: %timeit first(arr, lambda x: x > 5)
72 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [26]: %timeit arr + 5
90.3 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [27]: %timeit _first.py_func(arr, lambda x: x > 5)
7.8 s ± 46.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So numba gives a >100x speedup. It's still not as fast as a NumPy function call 
that doesn't have an allocation overhead:

In [30]: arr2 = np.empty_like(arr, dtype=bool)

In [32]: %timeit np.greater(arr, 5, out=arr2)
13.9 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But it's certainly much better than pure Python! And it's not a huge cost for 
the flexibility.

Juan.

On Wed, 1 Nov 2023, at 10:42 AM, Dom Grigonis wrote:
> This results in a very slow code. The function calls of 
> 
> 
> _pred = numba.njit(pred)
> 
> are expensive and this sort of approach will be comparable to pure python 
> functions.
> 
> This is only recommended for sourcing functions that are not called 
> frequently, but rather have a large computational content within them. In 
> other words not suitable for predicates.
> 
> Regards,
> DG
> 
>> On 1 Nov 2023, at 01:05, Juan Nunez-Iglesias <[email protected]> wrote:
>> 
>> If you add a layer of indirection with Numba you can get a *very* nice API:
>> 
>> @numba.njit
>> def _first(arr, pred):
>>     for i, elem in enumerate(arr):
>>         if pred(elem):
>>             return i
>> 
>> def first(arr, pred):
>>     _pred = numba.njit(pred)
>>     return _first(arr, _pred)
>> 
>> This even works with lambdas! (TIL, thanks Numba devs!)
>> 
>> >>> first(np.random.random(10_000_000), lambda x: x > 0.99)
>> 215
>> 
>> Since Numba has ufunc support I don't suppose it would be hard to make it 
>> work with an axis= argument, but I've never played with that API myself.
>> 
>> On Tue, 31 Oct 2023, at 6:49 PM, Lev Maximov wrote:
>>> I've implemented such functions in Cython and packaged them into a library 
>>> called numpy_illustrated <https://pypi.org/project/numpy-illustrated/>
>>> 
>>> It exposes the following functions:
>>> 
>>> find(a, v)  # returns the index of the first occurrence of v in a
>>> first_above(a, v)   # returns the index of the first element in a that is 
>>> strictly above v
>>> first_nonzero(a)   # returns the index of the first nonzero element
>>> 
>>> They scan the array and bail out immediately once the match is found. Have 
>>> a significant performance gain if the element to be
>>> found is closer to the beginning of the array. Have roughly the same speed 
>>> as alternative methods if the value is missing.
>>> 
>>> The complete signatures of the functions look like this:
>>> 
>>> find(a, v, rtol=1e-05, atol=1e-08, sorted=False, default=-1, raises=False)
>>> first_above(a, v, sorted=False, missing=-1, raises=False)
>>> first_nonzero(a, missing=-1, raises=False)
>>> 
>>> This covers the most common use cases and does not accept Python callbacks 
>>> because accepting them would nullify any speed gain
>>> one would expect from such a function. A Python callback can be implemented 
>>> with Numba, but anyone who can write the callback
>>> in Numba has no need for a library that wraps it into a dedicated function.
>>> 
>>> The library has a 100% test coverage. Code style 'black'. It should be easy 
>>> to add functions like 'first_below' if necessary.
>>> 
>>> A more detailed description of these functions can be found here 
>>> <https://betterprogramming.pub/the-numpy-illustrated-library-7531a7c43ffb?sk=8dd60bfafd6d49231ac76cb148a4d16f>.
>>> 
>>> Best regards,
>>>   Lev Maximov
>>> 
>>> On Tue, Oct 31, 2023 at 3:50 AM Dom Grigonis <[email protected]> wrote:
>>>> I juggled a bit and found pretty nice solution using numba. Which is 
>>>> probably not very robust, but proves that such thing can be optimised 
>>>> while retaining flexibility. Check if it works for your use cases and let 
>>>> me know if anything fails or if it is slow compared to what you used.
>>>> 
>>>> 
>>>> 
>>>> first_true_str = """
>>>> def first_true(arr, n):
>>>>     result = np.full((n, arr.shape[1]), -1, dtype=np.int32)
>>>>     for j in range(arr.shape[1]):
>>>>         k = 0
>>>>         for i in range(arr.shape[0]):
>>>>             x = arr[i:i + 1, j]
>>>>             if cond(x):
>>>>                 result[k, j] = i
>>>>                 k += 1
>>>>                 if k >= n:
>>>>                     break
>>>>     return result
>>>> """
>>>> 
>>>> 
>>>> *class* *FirstTrue*:
>>>>     CONTEXT = {'np': np}
>>>> 
>>>>     *def* __init__(self, expr):
>>>>         self.expr = expr
>>>>         self.expr_ast = ast.parse(expr, mode='exec').body[0].value
>>>>         self.func_ast = ast.parse(first_true_str, mode='exec')
>>>>         self.func_ast.body[0].body[1].body[1].body[1].test = self.expr_ast
>>>>         self.func_cmp = compile(self.func_ast, filename="<ast>", 
>>>> mode="exec")
>>>>         *exec*(self.func_cmp, self.CONTEXT)
>>>>         self.func_nb = nb.njit(self.CONTEXT[self.func_ast.body[0].name])
>>>> 
>>>>     *def* __call__(self, arr, n=1, axis=None):
>>>>         *# PREPARE INPUTS*
>>>>         in_1d = False
>>>>         *if* axis *is* None:
>>>>             arr = np.ravel(arr)[:, None]
>>>>             in_1d = True
>>>>         *elif* axis == 0:
>>>>             *if* arr.ndim == 1:
>>>>                 in_1d = True
>>>>                 arr = arr[:, None]
>>>>         *else*:
>>>>             *raise* *ValueError*('axis ~in (None, 0)')
>>>>         res = self.func_nb(arr, n)
>>>>         *if* in_1d:
>>>>             res = res[:, 0]
>>>>         *return* res
>>>> 
>>>> 
>>>> *if* __name__ == '__main__':
>>>>     arr = np.arange(125).reshape((5, 5, 5))
>>>>     ft = FirstTrue('np.sum(x) > 30')
>>>>     *print*(ft(arr, n=2, axis=0))
>>>> 
>>>> [[1 0 0 0 0]
>>>>  [2 1 1 1 1]]
>>>> 
>>>> 
>>>> In [16]: %timeit ft(arr, 2, axis=0)
>>>> 1.31 µs ± 3.94 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops 
>>>> each)
>>>> 
>>>> Regards,
>>>> DG
>>>> 
>>>>> On 29 Oct 2023, at 23:18, rosko37 <[email protected]> wrote:
>>>>> 
>>>>> An example with a 1-D array (where it is easiest to see what I mean) is 
>>>>> the following. I will follow Dom Grigonis's suggestion that the range not 
>>>>> be provided as a separate argument, as it can be just as easily "folded 
>>>>> into" the array by passing a slice. So it becomes just:
>>>>> idx = first_true(arr, cond)
>>>>> 
>>>>> As Dom also points out, the "cond" would likely need to be a "function 
>>>>> pointer" (i.e., the name of a function defined elsewhere, turning 
>>>>> first_true into a higher-order function), unless there's some way to pass 
>>>>> a parseable expression for simple cases. A few special cases like the 
>>>>> first zero/nonzero element could be handled with dedicated options (sort 
>>>>> of like matplotlib colors), but for anything beyond that it gets unwieldy 
>>>>> fast.
>>>>> 
>>>>> So let's say we have this:
>>>>> ******************
>>>>> def cond(x):
>>>>>     return x>50
>>>>> 
>>>>> search_arr = np.exp(np.arange(0,1000))
>>>>> 
>>>>> print(np.first_true(search_arr, cond))
>>>>> *******************
>>>>> 
>>>>> This should print 4, because the element of search_arr at index 4 (i.e. 
>>>>> the 5th element) is e^4, which is slightly greater than 50 (while e^3 is 
>>>>> less than 50). It should return this *without testing the 6th through 
>>>>> 1000th elements of the array at all to see whether they exceed 50 or 
>>>>> not*. This example is rather contrived, because simply taking the natural 
>>>>> log of 50 and rounding up is far superior, not even *evaluating the array 
>>>>> of exponentials *(which my example clearly still does--and in the use 
>>>>> cases I've had for such a function, I can't predict the array elements 
>>>>> like this--they come from loaded data, the output of a simulation, etc., 
>>>>> and are all already in a numpy array). And in this case, since the values 
>>>>> are strictly increasing, search_sorted() would work as well. But it 
>>>>> illustrates the idea.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Oct 26, 2023 at 5:54 AM Dom Grigonis <[email protected]> 
>>>>> wrote:
>>>>>> Could you please give a concise example? I know you have provided one, 
>>>>>> but it is engrained deep in verbose text and has some typos in it, which 
>>>>>> makes hard to understand exactly what inputs should result in what 
>>>>>> output.
>>>>>> 
>>>>>> Regards,
>>>>>> DG
>>>>>> 
>>>>>> > On 25 Oct 2023, at 22:59, rosko37 <[email protected]> wrote:
>>>>>> > 
>>>>>> > I know this question has been asked before, both on this list as well 
>>>>>> > as several threads on Stack Overflow, etc. It's a common issue. I'm 
>>>>>> > NOT asking for how to do this using existing Numpy functions (as that 
>>>>>> > information can be found in any of those sources)--what I'm asking is 
>>>>>> > whether Numpy would accept inclusion of a function that does this, or 
>>>>>> > whether (possibly more likely) such a proposal has already been 
>>>>>> > considered and rejected for some reason.
>>>>>> > 
>>>>>> > The task is this--there's a large array and you want to find the next 
>>>>>> > element after some index that satisfies some condition. Such elements 
>>>>>> > are common, and the typical number of elements to be searched through 
>>>>>> > is small relative to the size of the array. Therefore, it would 
>>>>>> > greatly improve performance to avoid testing ALL elements against the 
>>>>>> > conditional once one is found that returns True. However, all built-in 
>>>>>> > functions that I know of test the entire array. 
>>>>>> > 
>>>>>> > One can obviously jury-rig some ways, like for instance create a "for" 
>>>>>> > loop over non-overlapping slices of length slice_length and call 
>>>>>> > something like np.where(cond) on each--that outer "for" loop is much 
>>>>>> > faster than a loop over individual elements, and the inner loop at 
>>>>>> > most will go slice_length-1 elements past the first "hit". However, 
>>>>>> > needing to use such a convoluted piece of code for such a simple task 
>>>>>> > seems to go against the Numpy spirit of having one operation being one 
>>>>>> > function of the form func(arr)".
>>>>>> > 
>>>>>> > A proposed function for this, let's call it "np.first_true(arr, 
>>>>>> > start_idx, [stop_idx])" would be best implemented at the C code level, 
>>>>>> > possibly in the same code file that defines np.where. I'm wondering if 
>>>>>> > I, or someone else, were to write such a function, if the Numpy 
>>>>>> > developers would consider merging it as a standard part of the 
>>>>>> > codebase. It's possible that the idea of such a function is bad 
>>>>>> > because it would violate some existing broadcasting or fancy indexing 
>>>>>> > rules. Clearly one could make it possible to pass an "axis" argument 
>>>>>> > to np.first_true() that would select an axis to search over in the 
>>>>>> > case of multi-dimensional arrays, and then the result would be an 
>>>>>> > array of indices of one fewer dimension than the original array. So 
>>>>>> > np.first_true(np.array([1,5],[2,7],[9,10],cond) would return [1,1,0] 
>>>>>> > for cond(x): x>4. The case where no elements satisfy the condition 
>>>>>> > would need to return a "signal value" like -1. But maybe there are 
>>>>>> > some weird cases where there isn't a sensible return val
>>>>>>  ue, hence why such a function has not been added.
>>>>>> > 
>>>>>> > -Andrew Rosko
>>>>>> > _______________________________________________
>>>>>> > NumPy-Discussion mailing list -- [email protected]
>>>>>> > To unsubscribe send an email to [email protected]
>>>>>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>>>> > Member address: [email protected]
>>>>>> 
>>>>>> _______________________________________________
>>>>>> NumPy-Discussion mailing list -- [email protected]
>>>>>> To unsubscribe send an email to [email protected]
>>>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>>>> Member address: [email protected]
>>>>> _______________________________________________
>>>>> NumPy-Discussion mailing list -- [email protected]
>>>>> To unsubscribe send an email to [email protected]
>>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>>> Member address: [email protected]
>>>> 
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>> Member address: [email protected]
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: [email protected]
>>> 
>> 
>> _______________________________________________
>> NumPy-Discussion mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: [email protected]
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]
>

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Function that searches arrays for the first element that satisfies a condition

Reply via email to