[Numpy-discussion] Re: Speeding up isin1d and adding a "method" or similar

Sebastian Berg Fri, 17 Jun 2022 09:19:00 -0700

On Fri, 2022-06-17 at 08:40 -0700, Stephan Hoyer wrote:
> I think this is a great idea! I don't see any downsides here.
> 
> As for the method name, I would lean towards calling it "kind" and
> using a
> default value of None for automatic selection, for consistency with
> np.sort.


:+1:, I agree with both points.  I am also not sure about the choice of
"dictionary", since we don't use a dictionary (or even hashtable)?
(Although I may have missed discussion on the name.)

Cheers,

Sebastian


> 
> On Thu, Jun 16, 2022 at 6:14 AM Sebastian Berg
> <[email protected]>
> wrote:
> 
> > Hi all,
> > 
> > there is a PR to add a faster path to `np.isin`, that uses a look-
> > up-
> > table for all the elements that are included in the haystack
> > (`test_elements`):
> > 
> >     https://github.com/numpy/numpy/pull/12065/files
> > 
> > Such a table means that the memory overhead can be very
> > significant,
> > but the speedup as well, so there was the idea of adding an option
> > to
> > pick which version is used.
> > 
> > The current documentation for this new `method` keyword argument
> > would
> > be.  So the main questions are:
> > 
> > * Is there any concern about adding such a new kwarg?
> > * Is `method` the best name?  Sorts uses `kind` which may also be
> > good
> > 
> > There is also the smaller question of what heuristic 'auto' would
> > use,
> > but that can be tweaked at any time.
> > 
> > ```
> >    method : {'auto', 'sort', 'dictionary'}, optional
> >          The algorithm to use. This will not affect the final
> > result,
> >          but will affect the speed. Default is 'auto'.
> > 
> >          - If 'sort', will use a mergesort-based approach. This
> > will have
> >            a memory usage of roughly 6 times the sum of the sizes
> > of
> >            `ar1` and `ar2`, not accounting for size of dtypes.
> >          - If 'dictionary', will use a key-dictionary approach
> > similar
> >            to a counting sort. This is only available for boolean
> > and
> >            integer arrays. This will have a memory usage of the
> >            size of `ar1` plus the max-min value of `ar2`. This
> > tends
> >            to be the faster method if the following formula is
> > true:
> >            `log10(len(ar2)) > (log10(max(ar2)-min(ar2)) - 2.27) /
> > 0.927`,
> >            but may use greater memory.
> >          - If 'auto', will automatically choose the method which is
> >            expected to perform the fastest, using the above
> >            formula. For larger sizes or smaller range,
> >            'dictionary' is chosen. For larger range or smaller
> >            sizes, 'sort' is chosen.`
> > ```
> > 
> > Cheers,
> > 
> > Sebastian
> > _______________________________________________
> > NumPy-Discussion mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> > Member address: [email protected]
> > 
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]

signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Speeding up isin1d and adding a "method" or similar

Reply via email to