[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

Sebastian Berg Wed, 29 Jun 2022 10:53:42 -0700

On Wed, 2022-06-29 at 12:56 +0000, Miles Cranmer wrote:
> So, this new method is in fact a hash table as discussed in that blog
> post. However, because it assumes integer arrays, we can go even
> further than that blog, and simply use `np.arange(ar_min, ar_max +
> 1)` is the "hash table". Thus, you don't actually need to use a
> hashing function at all, you can just use the integer itself as the
> hash of an integer, and use fast array operations. This is why it
> happens to return a sorted output, even though it's not technically
> necessary. The scaling is the same, but the overall constant in front
> of O(n) is much lower.



Yeah, which is why I was happy with adding `kind=`.  There was some
support (and no-one against it) and it seemed unlikely that there is a
"generic" solution that is can just be a great default in all cases.
(If there is, deprecating `kind` is probably not terrible either.)

Whether the `kind=` should cover methods that do not sort the output
here, I am not sure.  Maybe that should be its own kwarg.
(Kinds that do not return a sorted result can still sort the result and
hope it is much smaller).

In general, I am happy with the addition though.  The only larger
argument against it that I can think of, would be if we promoted
another library that does the job much better.

Cheers,

Sebastian


> 
> In other words, returning a sorted array for integers gets you the
> best performance since you can use ``hash(integer)=integer`` (+
> assume that your range of integers is relatively small). But for
> other datatypes like strings, as discussed on that blog post,
> ``hash(string)`` might have a large range over your strings, and it
> would be inefficient to allocate the entire thing.
> 
> The closest analogy to this comparison would be counting sorts vs
> radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort)
> uses a hash table, whereas a counting sort
> (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array
> of length equal to the range of integer data. A radix sort has
> scaling O(n*k), for k the key size, and a counting sort has scaling
> O(n+k), for k the range of your data. However, a counting sort might
> have a better constant in front of the scaling, for fixed k, since it
> avoids having to create the hash table, and gets to use super
> efficient array operations. This PR is analogous to a counting sort,
> but a radix sort-like method is described on that blog (and would be
> useful for generic types).
> 
> Does this help explain things?
> Cheers,
> Miles
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: sebast...@sipsolutions.net
>

signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Speeding up `unique` and adding "kind" parameter

Reply via email to