The OpenStack Object Storage system uses sqlite3 databases as the "metadata" storage layer for it's "containers" and "accounts" - similar in a function to AWS S3 "buckets".
Each "object" in the system is recorded by a row in it's "container". Given a the API URI for an object: https://storage.host/v1/<account>/<conainer>/<path/to/some/object> The container database would have a row for each object in it's object table. "path/to/some/object" These container databases are normally small (<1-4M rows) - and you can have many of them (>10M) - but depending on the usage pattern - they can also grow big (100+ GB) when there are many many object rows stored in a *single* container. These databases are primarily used to provide object listings as part of the API - but also account usage information - how many objects, how many bytes, etc. (these counts are handled via triggers) We're investigating strategies to improve the way this container information is stored at scale, one approach we're researching is to "split" the database such that objects before and after "pivot points" are stored in different underlying databases. An analogy might be something akin to an "encyclopedia" or "card catalog" - if you're storing log files from the last year your db may pivot at "2016/01/db01/var/log/postgres/9.4/data/pg_log/postgresql.log" and objects named like "2015/05/24/web01/var/log/nginx.log" might go in one db, while object named like "2016/03/worker10/var/log/beanstalkd.log" might be in the other. For databases that currently have 100's of millions of rows we would expect many such pivot points - functioning essentially like bookends - but ideally in the future as soon as a database grows sufficiently large enough they would automatically be split in half. Hey, it's an idea. ;) This brings us to the interesting SQL - what's the best way to ask a table indexed on name, what is the name (roughly) in the middle (since it might be potentially helpful, we can assume we have a rough idea on cardinality [1] of rows in the database through a combination of triggers which update the object count and to a lesser extent min and max ROW_ID) One attempt: SELECT name FROM object ORDER BY name LIMIT 1 OFFSET ( SELECT object_count / 2 FROM policy_stat); Unfortunately - this is not a fast query as OFFSET gets larger i.e. 350M :) Is there a more obvious way to find this "middle" of a large ordered query in an indexed table? Just estimating between the min and max doesn't always seem to represent a good split since the distribution of prefixes is not always uniform? Could one imagine "estimating" the middle of the index by traversing the B-Tree in a non-standard way!? Would any of that be "exposed" in a "supported" fashion that would be likely to be maintainable across a non-trivial set of sqlite3 versions? If not, maintainable - would it even be *possible* to estimate *quickly* - given the current storage format - to parse the db file and traverse the B-Tree by hand - even if just purely for the geek factor!? ;) I'm interested in your thoughts! -Clay 1. It might be worth noting these databases are replicated and the object table is kept in sync via a custom daemon that runs as part of the object storage consistency engine. In addition to the name (and some other *non*-indexed metadata associated with the object) we also store a "deleted" flag on each row - so the index is actually "(deleted, name)". _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users