[Python-ideas] Re: Adding a built-in data structure with binary search tree semantics

Christopher Barker Sun, 15 Mar 2020 12:18:27 -0700

>
> > In other languages, there are built-in data structures with binary
> > search tree semantics, as an example, std::map in C++.
>


not being a C++ guy, I had no idea that std::map was a tree -- I always
figured it was a hashtable, but you learn something new every day.

As far as I know, there is no built in data structure that is ready to go,
but some tools are there.

Before I go further, the obvious think to do is see if there's a third
party package that does what you want -- there are a lot of hits for
"binary tree" on PyPi -- no idea if any of them are robust, mature,
maintained, and meet your needs, but do go look.


> > * We can insert elements into the structure in time faster than O(n),
> > i.e. O(logn) for std::map
>
> > * We can iterate over all of the elements of the structure in sorted
> > order in linear time, i.e. O(n) for std::map
>
> > Is there currently any data structure in Python with binary search
> > tree semantics?
>

To be pedantic, I don't think you are looking for binary search semantics,
but rather, binary tree performance :-)

This version is taking advantage of the fact that dicts now preserve order.
So what it does is keep the keys in order, but inserting items in the right
place.

And it used the stdlib bisect module to work with the keys efficiently.

In theory, it should have:

O(logN) insert
O(logN) retrieval on the nth key, value, item
O(1) retrieval of a particular key

However
 - The constant for insertion are huge: it has to rebuild the dict on every
insertion
 - The constant for retrieval by order is kinda big -- it has to make a
list out of the keys.

And this implementation has all sorts of possible optimizations that I
haven't bother with yet.  A big one is that it might make sense to keep the
sorted list of the keys around, as you can't use bisect() on the
dict.keys() iterator.

Which makes me think that we should dump the dict altogether, and simply
use two lists: on of keys, and one of values. keep the keys list sorted,
and you can use bisect to insert and retrieve and there you go.

I've tried this code on your example, and it appears to work, but it is
painfully slow. However, your code mingles creation and retrieval, and
generating random numbers, etc, so it's a bit hard to tease out where
the performance hit is.

I would say that you want to think about what performance you care about:
if you are going to be adding things much less often that retrieving them,
and you want good "get this specific key" performance, then this approach
(optimized) might be OK.

Note that this code has a few tests in it -- if you run it with pytest:

pytest sorted_dict.py

the tests will run. They are not the least bit comprehensive, but enough to
play around with optimizations and no you haven't broken anything too badly.

Hmm -- maybe I'll go try a version simply with lists now...

Also, if you're talking about bank accounts, ISTM that the persistent,
> secure storage mechanisms would swamp the time for finding the 1000th
> oldest account.  Or are all of your account data only stored in memory?
>

Yes, this example would probably be in a database, but I'm going to assume
it's a motivating example, and the general case of wanting a data structure
that's efficient for this kind of data is reasonable.

-CHB

-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

import bisect
from random import randint

N1 = 10000 # N1 > N2 > N3
N2 = N1 // 10
N3 = N1 // 100


class SortedDict(dict):
    """
    dict that keeps all its keys in sorted order
    """
    def __setitem__(self, key, value):
        """
        overload the setitem to add the item in sorted order

        We can assume that the existing keys are already in sorted order,
        so we can use bisect.

        NOTE: values may not be sortable, so we need to work with the keys.
        """
        # get the keys and values:
        # making copies so we won't lose them
        keys = list(self.keys())
        values = list(self.values())
        # find the insertion point
        idx = bisect.bisect(keys, key)
        # empty the dict
        self.clear()
        # refill the first half:
        for k, v in zip(keys[:idx], values[:idx]):
            super().__setitem__(k, v)
        # add the new item:
        super().__setitem__(key, value)
        # fill in the rest:
        for k, v in zip(keys[idx:], values[idx:]):
            super().__setitem__(k, v)

    def nth_smallest_key(self, n):
        """
        return the nth smallest key
        """
        return list(self.keys())[n]

    def nth_smallest_value(self, n):
        return self[list(self.keys())[n]]

    def nth_smallest_item(self, n):
        key = list(self.keys())[n]
        return (key, self[key])



dist = lambda: randint(0, 2147483647)


def test_sorted_dict_preserve_items():
    sd = SortedDict()

    sd[1] = "mary"
    sd[3] = "fred"
    sd[2] = "bob"

    # make sure it acts like a regular dict()
    assert sd[1] == "mary"
    assert sd[3] == "fred"
    assert sd[2] == "bob"


def test_sorted_dict_preserve_sort_order():
    sd = SortedDict()

    sd[1] = "mary"
    sd[3] = "fred"
    sd[2] = "bob"

    # make sure the keys are in order
    assert list(sd.keys()) ==  sorted(sd.keys())


def test_nth_smallest_key():
    sd = SortedDict()

    for i in range(20):
        sd[i] = i + 2

    assert sd.nth_smallest_key(2) == 2
    assert sd.nth_smallest_key(6) == 6


def test_nth_smallest_item():
    sd = SortedDict()

    for i in range(20):
        sd[i] = i + 2

    assert sd.nth_smallest_item(2) == (2, 4)
    assert sd.nth_smallest_item(6) == (6, 8)


def test_nth_smallest_value():
    sd = SortedDict()

    for i in range(20):
        sd[i] = i + 2

    assert sd.nth_smallest_value(2) == 4
    assert sd.nth_smallest_value(6) == 8


def nth_smallest_key(n, m):
    return sorted(m.keys())[n]


def main():
    my_map = SortedDict()

    # fills a map with N1 random mappings of type (int, int)
    for i in range(0, N1):
        my_map[dist()] = dist()

    # prints out the N3th smallest key and its value
    target_key = my_map.nth_smallest_key(N3)
    print("({}: {})".format(target_key, my_map[target_key]))

    # writes a new random mapping to the map
    # then prints out the N3th smallest key and its value if that key
    # has changed
    # 100000 times
    for i in range(N3):
        my_map[dist()] = dist()

        test_key = my_map.nth_smallest_key(N3)
        if target_key != test_key:
            target_key = test_key
            print(i, target_key, test_key)
            print("({}: {})".format(target_key, my_map[target_key]))

        # print an indicator every N3 iterations for comparison
        if i % N3 == 0:
            print("iteration: {}".format(i))


if __name__ == "__main__":
    main()

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/Q7G7QNNHOSCXKZ4OFXC6V2PCXQDH6IU6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Adding a built-in data structure with binary search tree semantics

Reply via email to