John, thank you for the comments ! Maybe I wasn't clear - the 10TB data is separated. It contains a lot of other data that I don't dream of storing in a database. But this bulk data is structured in fixed-length records, each record containing a vector of floating point values and some associated meta-information. Part of this meta-info is a pair of points, (x1,y1) and (x2,y2). The number records is in the range of few hundred millions (say 500 mil), which is the number of coordinate pairs I need to handle. The coordinates are represented as a 4-byte IEEE floating point, so 500 mil will take 500*4*4 ~ 8GB. So I'll have to process about 8GB of points.
These points have common coordinates for (x1,y1). The number of groups (distinct (x1,y1)) is in the range of few hundred thousand (say 1mil). Now, in this "index file" that I try to create, I'll have about 1 mil entries, each entry containing somehow the (x1,y1) and a list of integers, which are really record numbers to the original data set. So if nothing else is stored, the entire index file should be about 8GB in our case. From what I read, sqlite3 should handle fairly well a db few GB large, containing few mil records. The reality is that my "grups" table will store big rectangles, containing many groups (I'm thinking about 1000). So I expect that while the total size of the database is the same (few GB hopefully), the number of records in the r-tree and associated data table to be in the few thousands range). There are few reasons I sqlite is a candidate for this job: 1. This "index" that I try to build needs to be stored, since it will be used multiple times for further processing, so any "in-memory" structure that I may use needs to be stored on disk at some point in some form. 2. The process of indexing may take some time (few hours, to one day - as you read through 10TB of data). If the process gets interrupted somehow, I need to be able to restart it gracefully. Here the journaling feature of sqlite comes in very handy: ex. I process the input in chunks of few tens of MB of points, each chunk begins a transaction; if something happen, all the tables involved remains in a consistent state, and I can just restart with the last chunk processed. 3. The database is a single self-contained file, which makes it easy to integrated it in a bigger project. 4. I used it before for something else :) The points number 1 and 2 are important. So, unless I'll implement my own storage system (think journal, think page management, etc.), I need to find something else to build my app in top of it. The other candidates I considered were HD5, and ... well I heard firebird can be used in serverless mode. Any other ideas are welcome :) -IC ----- Original Message ---- From: John Crenshaw <johncrens...@priacta.com> To: General Discussion of SQLite Database <sqlite-users@sqlite.org> Sent: Sat, October 17, 2009 2:33:30 PM Subject: Re: [sqlite] Creating a spatial index for a large number of points- sorry for the text formating I doubt SQLite is the right tool for this job, for a number of reasons. First, if the data is as simple as you say, you are probably better off writing your logic as straight C, rather than SQL. SQLite is VERY fast, but there is still an incredible amount of overhead in executing a query, in any database. It will take many times more operations to do this with SQL than it would to do it without. The strength of SQL is in abstracting complex access to complex data. Simple lists of datapoints that only need a few simple transformations should probably be handled differently. Second, SQLite has in-memory overhead based on the size of the database. 10TB worth of database would require a LOT of memory. Tried to find the numbers on this, but I couldn't. I think it was something like 1 byte per page, with each page 1KB by default. That would be a 10GB of ram required for your data, and that assuming that SQLite stores it as efficiently as it is now (unlikely). Third, SQLite has a 2TB limit, using the default page size. By increasing the page size you can raise that limit, but you are entering wild territory. I'm a huge fan of SQLite, but I wouldn't use it for the job you described. John -----Original Message----- From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Igor Conom Sent: Saturday, October 17, 2009 10:10 AM To: General Discussion of SQLite Database Subject: Re: [sqlite] Creating a spatial index for a large number of points- sorry for the text formating Sorry for the formatting - it looked better when I sent it from Yahoo's web interface. ----- Original Message ---- From: Igor Conom <igor..co...@yahoo.com> To: sqlite-users@sqlite.org Sent: Sat, October 17, 2009 9:03:54 AM Subject: [sqlite] Creating a spatial index for a large number of points Hello everybody! I need to create a spatial index for a large number of points and I would really like to hear your comments. Thank you in advance for your time. There are some very large files containing scientific data, stored in a very simple format (by very large I mean things in the range of 10 TB). One of these files is made of equal-length records, each record containing a fix-length header and a vector of floating point numbers. For each record there are two spatial points associated: x1,y1 and x2,y2. There are few hundred millions of such records. What I need is to create an index such that it allows me to group all records with the same x1,y1 and do various operations (by same x1,y1 I mean points within some predefined range, i.e. two points can be considered the same if the distance between the two is less than some specified constant). The operations are: 1. for a given (x,y), quickly find the record group with the closest (x1,y1). 2. given a rectangle, find the list of (x1,y1) groups withing that rectangle. The index needs to be build in a reasonable amount of time (close to the time it takes to read through the original data). Preferably a single pass through the data should be required – but this is second to building an efficient index. Once created, the index can be used only read-only, so one can take some extra time building it. I expect that the number of distinct (x1,y1) groups will be in the range of few millions, with maybe few hundred records associated for each group. A record can be uniquely identified by a 32-bit integer (its position on the file(s)), so a group will be defined by (x1,y1) and a list of integers. My initial plan is to create two tables: one storing big chunks of groups and one r-tree. The “big chunks” will be a list of many groups contained in a larger rectangular area, that I can further process in memory. I start with a big rectangle representing a huge enough boundary to contain every point. As I add a point, I use the r-tree to find a suitable rectangle for it, then add the point point to its list. If the list grows over a give number of points (or size in bytes), I split the rectangle in 4 parts, from the median point, so each part will contain a balanced number of points). Another thing: I can use integer coordinates: I know that relative to a good choice of origin and cell size, the areal extent will not contain more than 2^32 cells on each axis (from INT_MIN to INT_MAX). I presume the operations are faster with integers. CREATE TABLE groups ( data BLOB); CREATE VIRTUAL TABLE groups_rt USING rtree_i32 (Id INTEGER, minX INTEGER, maxX INTEGER, minY INTEGER, maxY INTEGER); The groups_rt.Id will be groups.ROWID . The algorithm to create the index is described next: for each input point: if is the first point: insert the first entry to the groups table and insert a big rectangle in the r-tree (i.e. INT_MIN, INT_MAX, INT_MIN, INT_MAX) else: transform x1,y1 to integers: ix1, iy1 SELECT groups.ROWID, data, minX,maxX,minY,maxX FROM groups, groups_rt WHERE groups.ROWID=groups_rt.Id AND maxX>= ix1 AND minX<=(ix1+1) AND maxY>= iy1 AND minY<=(iy1+1) There will always be at least one row and at most 4 rows in the result. Pick the closest rectangle to the point (closest in real coordinates). If the groups.data blob already contains too many points: (or is bigger than some number of bytes .ie. 4K), remove it from database and split the rectangle in 4 part, then add them to the database. Add the point to the right blob of data. endif end for There is some additional optimization: based on the fact that I process a bunch of points at a time and they are most likely to be close to each other I keep the current blob/rect in memory and keep adding points to it until a different one is needed. This saves some updates to the database. Right now I'm thinking what structure to use for the blob: for example, knowing that more points will be added later, is it worth allocating a bigger chunk initially so more points can be added to it. I'd really like to hear some comments or new ideas. It seems to be a little complicated so maybe I miss something. The rectangles stored in the r-tree are always non-overlapping: can this property be used for some optimization? Is Sqlite the right choice for the job? I'm reluctant to build my own disk format – only dealing with all the cache management, journaling, fragmentation will add significant development time .... and bugs. Thank you for your time again, if you read this far :) Cheers, -IC _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users