[Numpy-discussion] Re: Speeding up isin1d and adding a "method" or similar

Stephan Hoyer Fri, 17 Jun 2022 08:41:48 -0700

I think this is a great idea! I don't see any downsides here.

As for the method name, I would lean towards calling it "kind" and using a
default value of None for automatic selection, for consistency with np.sort.


On Thu, Jun 16, 2022 at 6:14 AM Sebastian Berg <[email protected]>
wrote:

> Hi all,
>
> there is a PR to add a faster path to `np.isin`, that uses a look-up-
> table for all the elements that are included in the haystack
> (`test_elements`):
>
>     https://github.com/numpy/numpy/pull/12065/files
>
> Such a table means that the memory overhead can be very significant,
> but the speedup as well, so there was the idea of adding an option to
> pick which version is used.
>
> The current documentation for this new `method` keyword argument would
> be.  So the main questions are:
>
> * Is there any concern about adding such a new kwarg?
> * Is `method` the best name?  Sorts uses `kind` which may also be good
>
> There is also the smaller question of what heuristic 'auto' would use,
> but that can be tweaked at any time.
>
> ```
>    method : {'auto', 'sort', 'dictionary'}, optional
>          The algorithm to use. This will not affect the final result,
>          but will affect the speed. Default is 'auto'.
>
>          - If 'sort', will use a mergesort-based approach. This will have
>            a memory usage of roughly 6 times the sum of the sizes of
>            `ar1` and `ar2`, not accounting for size of dtypes.
>          - If 'dictionary', will use a key-dictionary approach similar
>            to a counting sort. This is only available for boolean and
>            integer arrays. This will have a memory usage of the
>            size of `ar1` plus the max-min value of `ar2`. This tends
>            to be the faster method if the following formula is true:
>            `log10(len(ar2)) > (log10(max(ar2)-min(ar2)) - 2.27) / 0.927`,
>            but may use greater memory.
>          - If 'auto', will automatically choose the method which is
>            expected to perform the fastest, using the above
>            formula. For larger sizes or smaller range,
>            'dictionary' is chosen. For larger range or smaller
>            sizes, 'sort' is chosen.`
> ```
>
> Cheers,
>
> Sebastian
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]
>

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Speeding up isin1d and adding a "method" or similar

Reply via email to