On Tue, May 16, 2017 at 1:25 AM, Stephan Hoyer <sho...@gmail.com> wrote:
> I like the idea of a strategy keyword argument. strategy='auto' leaves the > door open for future improvements, e.g., if we ever add hash tables to > numpy. > > For the algorithm, I think we actually want to sort the needles array as > well in most (all?) cases. > > If haystack is also sorted, advancing thorough both arrays at once brings > down the cost of the actual search itself down to O(n+k). (Possibly this is > worth exposing as np.searchbothsorted or something similar?) > > If we don't sort haystack, we can still use binary search on the needles > to bring the search cost down to O(n log k). > I have several old branches implementing most of this: - https://github.com/jaimefrio/numpy/tree/sorted-searchsorted This one adds a sorted_keys= kwarg to searchsorted that assumes the keys are also sorted and does the linear scan of both. - https://github.com/jaimefrio/numpy/tree/mergesorted Adds an internal mergesorted function that merges two sorted arrays into a new one, with several handlings of repeated items, originally intended to speed-up intersect1d and friends in arraysetops. - https://github.com/jaimefrio/numpy/tree/hash-unique Implements a hash table for numpy arrays, largely based on the implementation of Python dicts, and uses it for unique. Some of the speed-ups are considerable, but I guess I found the extra complexity didn't justify sending any of those as a PR. But if we really want to make that search thing happen, then maybe it's time to bring them back from the dead. Jaime > On Mon, May 15, 2017 at 5:00 PM Nathaniel Smith <n...@pobox.com> wrote: > >> On May 9, 2017 9:47 AM, "Martin Spacek" <nu...@mspacek.mm.st> wrote: >> >> Hello, >> >> I've opened up a pull request to add a function called np.search(), or >> something like it, to complement np.searchsorted(): >> >> https://github.com/numpy/numpy/pull/9055 >> >> There's also this issue I opened before starting the PR: >> >> https://github.com/numpy/numpy/issues/9052 >> >> Proposed API changes require discussion on the list, so here I am! >> >> This proposed function (and perhaps array method?) does the same as >> np.searchsorted(a, v), but doesn't require `a` to be sorted, and explicitly >> checks if all the values in `v` are a subset of those in `a`. If not, it >> currently raises an error, but that could be controlled via a kwarg. >> >> As I mentioned in the PR, I often find myself abusing np.searchsorted() >> by not explicitly checking these assumptions. The temptation to use it is >> great, because it's such a fast and convenient function, and most of the >> time that I use it, the assumptions are indeed valid. Explicitly checking >> those assumptions each and every time before I use np.searchsorted() is >> tedious, and easy to forget to do. I wouldn't be surprised if many others >> abuse np.searchsorted() in the same way. >> >> >> It's worth noting though that the "sorted" part is a critical part of >> what makes it fast. If we're looking for k needles in an n-item haystack, >> then: >> >> If the haystack is already sorted and we know it, using searchsorted does >> it in k*log2(n) comparisons. (Could be reduced to average case O(k log log >> n) for simple scalars by using interpolation search, but I don't think >> searchsorted is that clever atm.) >> >> If the haystack is not sorted, then sorting it and then using >> searchsorted requires a total of O(n log n) + k*log2(n) comparisons. >> >> And if the haystack is not sorted, then doing linear search to find the >> first item like list.index does requires on average 0.5*k*n comparisons. >> >> This analysis ignores memory effects, which are important -- linear >> memory access is faster than random access, and searchsorted is all about >> making memory access maximally unpredictable. But even so, I think >> sorting-then-searching will be reasonably competitive pretty much from the >> start, and for moderately large k and n values the difference between (n + >> k)*log(n) and n*k is huge. >> >> Another issue is that sorting requires an O(n)-sized temporary buffer >> (assuming you can't mutate the haystack in place). But if your haystack is >> a large enough fraction of memory that you can't afford is buffer, then >> it's likely large enough that you can't afford linear searching either... >> >> >> Looking at my own habits and uses, it seems to me that finding the >> indices of matching values of one array in another is a more common use >> case than finding insertion indices of one array into another sorted array. >> So, I propose that np.search(), or something like it, could be even more >> useful than np.searchsorted(). >> >> >> My main concern here would be creating a trap for the unwary, where >> people use search() naively because it's so nice and convenient, and then >> eventually get surprised by a nasty quadratic slowdown. There's a whole >> blog about these traps :-) https://accidentallyquadratic.tumblr.com/ >> >> Otoh there are also huge number of numpy use cases where it doesn't >> matter if some calculation is 1000x slower than it should be, as long as it >> works and is discoverable... >> >> So it sounds like one obvious thing would be to have a version of >> searchsorted that checks for matches (maybe side="exact"? Though that's not >> easy to find...). That's clearly useful, and orthogonal to the >> linear/binary search issue, so we shouldn't make it a reason people are >> tempted to choose the inferior algorithm. >> >> ...ok, how's this for a suggestion. Give np.search a strategy= kwarg, >> with options "linear", "searchsorted", and "auto". Linear does the obvious >> thing, searchsorted generates a sorter array using argsort (unless the user >> provided one) and then calls searchsorted, and auto picks one of them >> depending on whether a sorter array was provided and how large the arrays >> are. The default is auto. In all cases it looks for exact matches. >> >> I guess by default "not found" should be signaled with an exception, and >> then there should be some option to have it return a sentinel value >> instead? The problem is that since we're returning integers then there's no >> sentinel value that's necessarily an invalid index (e.g. indexing will >> happily accept -1). >> >> -n >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion