[sqlite] Order Statistic Trees

Scott Robison Sat, 17 May 2014 19:20:07 -0700

My apologies for this lengthy email. I hope someone can wade through it and
provide some insight.


For many database manipulation needs, SQLite (or SQL in general) is a
perfectly adequate solution. I'm having a problem that does not work well
in SQL land (or at least I can't find a solution that works performantly).
I've started writing something of my own, but really don't want to write
(esp maintain in the future) code that already exists and is suitably
licensed (LGPL or recognized open source license; GPL and similar hard
copyleft licenses are not usable).

We have a slow unindexed data source (third party proprietary database)
that we read once and store in a SQLite database for cache purposes. The
user needs to be able to "see" the population of this list sorted by some
field. When the data source is "small" it's not a big deal, the user will
never perceive a problem. When it is large, it takes a very long time
(potentially hours) to enumerate the source data.

It is easy enough to read the source data and insert it into a table. It is
easy enough to query the list of data ordered by any field after the data
has been completely read. The hard part is knowing the correct insertion
point for a newly read record. Should it go into the current window of the
data? Does it go *above* the current window? Or is it *below*?

To attempt to solve this problem, I wrote code to maintain my own sort
order in memory, and worry about optimization later. Unfortunately, we've
now reached the point where this simply can't scale much past 10,000
records, and we have one customer complaining about his 250,000.

The solution I've come up with is called Order Statistic Tree. Maybe you've
heard of it, though it was new to me. Really it's a modification that can
be applied to many / most tree structures. Instead of just maintaining the
sort order of keys in the tree along with whatever invariants the tree
requires, you maintain a count of the number of items in each subtree. In a
normal B-Tree you have O(log N) time to find any given record by key value,
but finding the Nth element from the beginning of the list is an O(N)
operation. An Order Statistic B-Tree allows O(log N) for finding by key and
for selecting the Nth element from.

So my question (finally) is: Does anyone here already have or know of some
type of Order Statistic code? I'm pretty sure SQLite doesn't have that,
though my understanding could be wrong.

I appreciate all responses. Believe me, I really do need to keep track of
an order of rows as a table is being populated to keep it in sync with an
MVC list view. It doesn't have to be 100% optimal, but it needs to be
significantly faster than either of the following (assume appropriate
unique indexes to avoid indeterminate order, which I can control):

SELECT * FROM SomeTable ORDER BY SomeColumn OFFSET 249999 LIMIT 1

SELECT COUNT(*) FROM SomeTable WHERE SomeColumn = value ORDER BY SomeColumn

As I understand it, both of these have to execute the full query and
calculate the result set in order to do return a result. I need something
that gives me log N access (or better) to elements by key and numeric index.

Again, my apologies for the length of this. I can write it myself, but if I
don't have to, that would be awesome.

-- 
Scott Robison
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Order Statistic Trees

Reply via email to