Sorry for the formatting - it looked better when I sent it from Yahoo's web interface.
----- Original Message ---- From: Igor Conom <igor..co...@yahoo.com> To: sqlite-users@sqlite.org Sent: Sat, October 17, 2009 9:03:54 AM Subject: [sqlite] Creating a spatial index for a large number of points Hello everybody! I need to create a spatial index for a large number of points and I would really like to hear your comments. Thank you in advance for your time. There are some very large files containing scientific data, stored in a very simple format (by very large I mean things in the range of 10 TB). One of these files is made of equal-length records, each record containing a fix-length header and a vector of floating point numbers. For each record there are two spatial points associated: x1,y1 and x2,y2. There are few hundred millions of such records. What I need is to create an index such that it allows me to group all records with the same x1,y1 and do various operations (by same x1,y1 I mean points within some predefined range, i.e. two points can be considered the same if the distance between the two is less than some specified constant). The operations are: 1. for a given (x,y), quickly find the record group with the closest (x1,y1). 2. given a rectangle, find the list of (x1,y1) groups withing that rectangle. The index needs to be build in a reasonable amount of time (close to the time it takes to read through the original data). Preferably a single pass through the data should be required – but this is second to building an efficient index. Once created, the index can be used only read-only, so one can take some extra time building it. I expect that the number of distinct (x1,y1) groups will be in the range of few millions, with maybe few hundred records associated for each group. A record can be uniquely identified by a 32-bit integer (its position on the file(s)), so a group will be defined by (x1,y1) and a list of integers. My initial plan is to create two tables: one storing big chunks of groups and one r-tree. The “big chunks” will be a list of many groups contained in a larger rectangular area, that I can further process in memory. I start with a big rectangle representing a huge enough boundary to contain every point. As I add a point, I use the r-tree to find a suitable rectangle for it, then add the point point to its list. If the list grows over a give number of points (or size in bytes), I split the rectangle in 4 parts, from the median point, so each part will contain a balanced number of points). Another thing: I can use integer coordinates: I know that relative to a good choice of origin and cell size, the areal extent will not contain more than 2^32 cells on each axis (from INT_MIN to INT_MAX). I presume the operations are faster with integers. CREATE TABLE groups ( data BLOB); CREATE VIRTUAL TABLE groups_rt USING rtree_i32 (Id INTEGER, minX INTEGER, maxX INTEGER, minY INTEGER, maxY INTEGER); The groups_rt.Id will be groups.ROWID . The algorithm to create the index is described next: for each input point: if is the first point: insert the first entry to the groups table and insert a big rectangle in the r-tree (i.e. INT_MIN, INT_MAX, INT_MIN, INT_MAX) else: transform x1,y1 to integers: ix1, iy1 SELECT groups.ROWID, data, minX,maxX,minY,maxX FROM groups, groups_rt WHERE groups.ROWID=groups_rt.Id AND maxX>= ix1 AND minX<=(ix1+1) AND maxY>= iy1 AND minY<=(iy1+1) There will always be at least one row and at most 4 rows in the result. Pick the closest rectangle to the point (closest in real coordinates). If the groups.data blob already contains too many points: (or is bigger than some number of bytes .ie. 4K), remove it from database and split the rectangle in 4 part, then add them to the database. Add the point to the right blob of data. endif end for There is some additional optimization: based on the fact that I process a bunch of points at a time and they are most likely to be close to each other I keep the current blob/rect in memory and keep adding points to it until a different one is needed. This saves some updates to the database. Right now I'm thinking what structure to use for the blob: for example, knowing that more points will be added later, is it worth allocating a bigger chunk initially so more points can be added to it. I'd really like to hear some comments or new ideas. It seems to be a little complicated so maybe I miss something. The rectangles stored in the r-tree are always non-overlapping: can this property be used for some optimization? Is Sqlite the right choice for the job? I'm reluctant to build my own disk format – only dealing with all the cache management, journaling, fragmentation will add significant development time .... and bugs. Thank you for your time again, if you read this far :) Cheers, -IC _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users