> On Mar 15, 2020, at 06:25, 1njecti...@gmail.com wrote: > > In other languages, there are built-in data structures with binary search > tree semantics, as an example, std::map in C++.
There have been multiple past discussions on this, which you should search for and read, but I can summarize what I remember. Personally, I think it would probably be worth adding the SortedDict and maybe SortedSet types from Grant Jenks’ Sorted Containers library as collections.sorteddict, and matching ABCs to collections.abc, and linking to any popular alternatives in the docs. If not, then the collections docs should at least describe what sorted containers are for, and link to Sorted Containers and any other popular third-party alternatives, and the ABCs should be added anyway. Arguments below. > By binary search tree semantics we mean the following: > > * We can insert elements into the structure in time faster than O(n), i.e. > O(logn) for std::map > * We can iterate over all of the elements of the structure in sorted order in > linear time, i.e. O(n) for std::map Do you really want to require “binary”? Binary trees are really inefficient in space, cache locality, and allocation costs. And memmove is really fast. So in most uses at small sizes it’s better to array-bisect for search, even if it means you have to shift for insert. And for large sizes, the same thing is true at every level, not just the leaves. So you really want something with branching somewhere between 16 and 8192, not 2. Again, this is just “most uses”, but it really is “most”. If you make it wide enough, you also may want to special-case the two lowest levels, because then you’re dealing with chunks on the order of a cache line and on the order of a VM page, and those are special enough to deserve special treatment. At which point it may even be worth simplifying things to, say, three fixed levels (which means huge collections do approach linear time—but if “huge” means “much larger than a billion elements”, and most people don’t have enough RAM for much more than a billion elements of a few dozen bytes each…). And the same applies to building a skiplist or any other logarithmic search structure. If you look at the documentation for blist and SortedContainers, they both put actual numbers to these points. Another useful thing about search semantics is that you can find all the elements between a min and max in logarithmic time. That is, if you want M contiguous elements out of N, you can do it in O(logN + M) time, instead of doing M searches for O(logN * M) time. But what should the interface for that be? And there are a few other API questions—for example, your own example would clearly benefit (both in readability and in performance) from having a way to jump to the Nth element without having to iterate the first N, but what should that look like? And that raises the larger question of whether there should be a SearchMap or SortedMap or similar ABC so that different alternatives can all provide the same thing. Anyway, here’s the history as I remember it: Back in the days when recipes on ActiveState were more important than libraries on PyPI, the docs linked to Raymond Hettinger’s simple sortedlist (which is just a list that sorts at construction, resorts on each insert, and bisects for search—great if you never need to modify it after building it, but otherwise not so much), and that linked to a bunch of other recipes. People built new tree libraries for years, and most of them got abandoned, even if there was some uptake. A lot of major projects versioned in and forked one library or another instead of relying on anything being kept up to date. Proposals to add something to the stdlib always failed quickly based on the argument that if nobody can even maintain a popular third-party lib to work through three versions of Python… One library, blist (a wide hybrid b-tree-like design) got somewhat popular, and its author proposed it to be added to the stdlib, offering to maintain it. But he wanted to also replace list and str with ropes built on the same foundation, which nobody else wanted. Also, nobody stepped up to built a pure-Python implementation of the same design so that Jython could provide the same types with similar performance characteristics. So, it failed. A few years later, in one of the periodic discussions of adding sorted containers to the stdlib, there was some consensus that maybe Python could just add an ABC so that various third-party libraries could converge on a single API (largely based on blist’s). There’d then be more chance of one of them becoming a category killer: even if it’s only ideal for 90% of use cases instead of 100% (it’s not impossible to come up with a case where a red-black tree wins if you really want to, but I think the real sticking point is people who want to wrap up a Java container in Jython, etc.), if the other designs are all available and work as drop-in replacement libraries, that’s fine. When that fizzled out too, Grant Jenks went off and built a new library, Sorted Containers, that implemented the consensus API (largely based on blist’s). After a few years of widespread use in the field, Grant revised the API to be “more Python-3-ish”, and l believe it’s been pretty stable since. become the most popular one. Some alternatives have changed their APIs to match, but even more have just deprecated their libraries in favor of using Sorted Containers. And I don’t think there are any new libraries that don’t either have an incomplete API or copy the Sorted Containers one. So, I think Grant’s Sorted Containers is a good-enough category killer to be worth adding today, even though it wasn’t in 2014. If not, at least the API is stable enough to be worth adding as an ABC. I believe he’s kept the comparisons section of his docs reasonably up to date as well, so if people are concerned that it’s only the right choice 90% of the time, the docs to link to third-party options for the other 10% should be easy to write up. But of course Grant Jenks is the person who really knows whether Sorted Containers, or at least its API, is “done” enough to move into the stdlib to die. If not, then I think the collections docs should at least link to Sorted Containers and any other relevant libraries. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7SAXQFV4IE5LQ7NMQ36UPS764PW5OOFZ/ Code of Conduct: http://python.org/psf/codeofconduct/