On Wed, 2022-06-29 at 12:56 +0000, Miles Cranmer wrote: > So, this new method is in fact a hash table as discussed in that blog > post. However, because it assumes integer arrays, we can go even > further than that blog, and simply use `np.arange(ar_min, ar_max + > 1)` is the "hash table". Thus, you don't actually need to use a > hashing function at all, you can just use the integer itself as the > hash of an integer, and use fast array operations. This is why it > happens to return a sorted output, even though it's not technically > necessary. The scaling is the same, but the overall constant in front > of O(n) is much lower.
Yeah, which is why I was happy with adding `kind=`. There was some support (and no-one against it) and it seemed unlikely that there is a "generic" solution that is can just be a great default in all cases. (If there is, deprecating `kind` is probably not terrible either.) Whether the `kind=` should cover methods that do not sort the output here, I am not sure. Maybe that should be its own kwarg. (Kinds that do not return a sorted result can still sort the result and hope it is much smaller). In general, I am happy with the addition though. The only larger argument against it that I can think of, would be if we promoted another library that does the job much better. Cheers, Sebastian > > In other words, returning a sorted array for integers gets you the > best performance since you can use ``hash(integer)=integer`` (+ > assume that your range of integers is relatively small). But for > other datatypes like strings, as discussed on that blog post, > ``hash(string)`` might have a large range over your strings, and it > would be inefficient to allocate the entire thing. > > The closest analogy to this comparison would be counting sorts vs > radix sorts. A radix sort (https://en.wikipedia.org/wiki/Radix_sort) > uses a hash table, whereas a counting sort > (https://en.wikipedia.org/wiki/Counting_sort) pre-allocates an array > of length equal to the range of integer data. A radix sort has > scaling O(n*k), for k the key size, and a counting sort has scaling > O(n+k), for k the range of your data. However, a counting sort might > have a better constant in front of the scaling, for fixed k, since it > avoids having to create the hash table, and gets to use super > efficient array operations. This PR is analogous to a counting sort, > but a radix sort-like method is described on that blog (and would be > useful for generic types). > > Does this help explain things? > Cheers, > Miles > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: sebast...@sipsolutions.net >
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com