> On Mar 15, 2020, at 06:25, 1njecti...@gmail.com wrote:
> 
> In other languages, there are built-in data structures with binary search 
> tree semantics, as an example, std::map in C++.

There have been multiple past discussions on this, which you should search for 
and read, but I can summarize what I remember.

Personally, I think it would probably be worth adding the SortedDict and maybe 
SortedSet types from Grant Jenks’ Sorted Containers library as 
collections.sorteddict, and matching ABCs to collections.abc, and linking to 
any popular alternatives in the docs. If not, then the collections docs should 
at least describe what sorted containers are for, and link to Sorted Containers 
and any other popular third-party alternatives, and the ABCs should be added 
anyway. Arguments below.

> By binary search tree semantics we mean the following:
> 
> * We can insert elements into the structure in time faster than O(n), i.e. 
> O(logn) for std::map
> * We can iterate over all of the elements of the structure in sorted order in 
> linear time, i.e. O(n) for std::map

Do you really want to require “binary”? Binary trees are really inefficient in 
space, cache locality, and allocation costs. And memmove is really fast. So in 
most uses at small sizes it’s better to array-bisect for search, even if it 
means you have to shift for insert. And for large sizes, the same thing is true 
at every level, not just the leaves. So you really want something with 
branching somewhere between 16 and 8192, not 2. Again, this is just “most 
uses”, but it really is “most”.

If you make it wide enough, you also may want to special-case the two lowest 
levels, because then you’re dealing with chunks on the order of a cache line 
and on the order of a VM page, and those are special enough to deserve special 
treatment. At which point it may even be worth simplifying things to, say, 
three fixed levels (which means huge collections do approach linear time—but if 
“huge” means “much larger than a billion elements”, and most people don’t have 
enough RAM for much more than a billion elements of a few dozen bytes each…).

And the same applies to building a skiplist or any other logarithmic search 
structure.

If you look at the documentation for blist and SortedContainers, they both put 
actual numbers to these points.

Another useful thing about search semantics is that you can find all the 
elements between a min and max in logarithmic time. That is, if you want M 
contiguous elements out of N, you can do it in O(logN + M) time, instead of 
doing M searches for O(logN * M) time. But what should the interface for that 
be? 

And there are a few other API questions—for example, your own example would 
clearly benefit (both in readability and in performance) from having a way to 
jump to the Nth element without having to iterate the first N, but what should 
that look like?

And that raises the larger question of whether there should be a SearchMap or 
SortedMap or similar ABC so that different alternatives can all provide the 
same thing.

Anyway, here’s the history as I remember it:

Back in the days when recipes on ActiveState were more important than libraries 
on PyPI, the docs linked to Raymond Hettinger’s simple sortedlist (which is 
just a list that sorts at construction, resorts on each insert, and bisects for 
search—great if you never need to modify it after building it, but otherwise 
not so much), and that linked to a bunch of other recipes.

People built new tree libraries for years, and most of them got abandoned, even 
if there was some uptake. A lot of major projects versioned in and forked one 
library or another instead of relying on anything being kept up to date. 
Proposals to add something to the stdlib always failed quickly based on the 
argument that if nobody can even maintain a popular third-party lib to work 
through three versions of Python…

One library, blist (a wide hybrid b-tree-like design) got somewhat popular, and 
its author proposed it to be added to the stdlib, offering to maintain it. But 
he wanted to also replace list and str with ropes built on the same foundation, 
which nobody else wanted. Also, nobody stepped up to built a pure-Python 
implementation of the same design so that Jython could provide the same types 
with similar performance characteristics. So, it failed.

A few years later, in one of the periodic discussions of adding sorted 
containers to the stdlib, there was some consensus that maybe Python could just 
add an ABC so that various third-party libraries could converge on a single API 
(largely based on blist’s). There’d then be more chance of one of them becoming 
a category killer: even if it’s only ideal for 90% of use cases instead of 100% 
(it’s not impossible to come up with a case where a red-black tree wins if you 
really want to, but I think the real sticking point is people who want to wrap 
up a Java container in Jython, etc.), if the other designs are all available 
and work as drop-in replacement libraries, that’s fine.

When that fizzled out too, Grant Jenks went off and built a new library, Sorted 
Containers, that implemented the consensus API (largely based on blist’s). 
After a few years of widespread use in the field, Grant revised the API to be 
“more Python-3-ish”, and l believe it’s been pretty stable since. become the 
most popular one. Some alternatives have changed their APIs to match, but even 
more have just deprecated their libraries in favor of using Sorted Containers. 
And I don’t think there are any new libraries that don’t either have an 
incomplete API or copy the Sorted Containers one.

So, I think Grant’s Sorted Containers is a good-enough category killer to be 
worth adding today, even though it wasn’t in 2014. If not, at least the API is 
stable enough to be worth adding as an ABC.

I believe he’s kept the comparisons section of his docs reasonably up to date 
as well, so if people are concerned that it’s only the right choice 90% of the 
time, the docs to link to third-party options for the other 10% should be easy 
to write up.

But of course Grant Jenks is the person who really knows whether Sorted 
Containers, or at least its API, is “done” enough to move into the stdlib to 
die. If not, then I think the collections docs should at least link to Sorted 
Containers and any other relevant libraries.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/7SAXQFV4IE5LQ7NMQ36UPS764PW5OOFZ/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to