Sorry for the formatting - it looked better when I sent it from Yahoo's web 
interface.



----- Original Message ----
From: Igor Conom <igor..co...@yahoo.com>
To: sqlite-users@sqlite.org
Sent: Sat, October 17, 2009 9:03:54 AM
Subject: [sqlite] Creating a spatial index for a large number of points

    
Hello everybody!

I need to create a spatial index for a large number of points and I would 
really like to hear your comments. Thank you in advance for your time.

There are some very large files containing scientific data, stored in a very 
simple format (by very large  I mean things in the range of 10 TB). One of 
these files is made of
equal-length records, each record containing a fix-length header and a vector 
of floating point numbers. For each record there are two spatial points 
associated: x1,y1 and x2,y2. There are few hundred millions of such records.

What I need is to create an index such that it allows me to group all records 
with the same x1,y1 and do various operations (by same x1,y1 I mean points 
within some
predefined range, i.e. two points can be considered the same if the distance 
between the two is less than some specified constant).

The operations are:
1. for a given (x,y), quickly find the
record group with the closest (x1,y1).
2. given a rectangle, find the list of
(x1,y1) groups withing that rectangle.

The index needs to be build in a reasonable amount of time (close to the time 
it takes to read through the original data). Preferably a single pass through 
the data should
be required – but this is second to building an efficient index. Once created, 
the index can be used only read-only, so one can take some extra time building 
it.

I expect that the number of distinct (x1,y1) groups will be in the range of few 
millions, with maybe few hundred records associated for each group. A record 
can be uniquely
identified by a 32-bit integer (its position on the file(s)), so a group will 
be defined by (x1,y1) and a list of integers.

My initial plan is to create two tables: one storing big chunks of groups and 
one r-tree. The “big chunks” will be a list of many groups contained in a 
larger rectangular
area, that I can further process in memory. I start with a big rectangle 
representing a huge enough boundary to contain every point. As I add a point, I 
use the r-tree to find a suitable rectangle for it, then add the point point to 
its list. If the list grows over a give number of points (or size in bytes), I 
split the rectangle in 4 parts, from the median point, so each part will 
contain a balanced number of points).

Another thing: I can use integer coordinates: I know that relative to a good 
choice of origin and cell size, the areal extent will not contain more than 
2^32 cells on each
axis (from INT_MIN to INT_MAX). I presume the operations are faster with 
integers.

CREATE TABLE groups ( data BLOB);
CREATE VIRTUAL TABLE groups_rt USING rtree_i32  (Id INTEGER, minX INTEGER, maxX 
INTEGER, minY INTEGER, maxY INTEGER);

The groups_rt.Id will be groups.ROWID .

The algorithm to create the index is
described next:
for each input point:
    if is the first point:
        insert the first entry to the groups table and 
        insert a big rectangle in the r-tree (i.e. INT_MIN, INT_MAX, INT_MIN, 
INT_MAX)
     else:
        transform x1,y1 to integers: ix1, iy1
        SELECT groups.ROWID, data, minX,maxX,minY,maxX FROM groups, groups_rt 
WHERE
            groups.ROWID=groups_rt.Id AND maxX>= ix1 AND minX<=(ix1+1) AND 
                                maxY>= iy1 AND minY<=(iy1+1)
        There will always be at least one row and at most 4 rows in the result.

        Pick the closest rectangle to the point (closest in real coordinates).

        If the groups.data blob already contains too many points:
            (or is bigger than some number of bytes .ie. 4K), 
             remove it from database and split the rectangle in 4 part, 
             then add them to the database.

        Add the point to the right blob of data.
     endif
end for

There is some additional optimization:
based on the fact that I process a bunch of points at a time and they
are most likely to be close to each other I keep the current
blob/rect in memory and keep adding points to it until a different
one is needed. This saves some updates to the database.

Right now I'm thinking what structure to use for the blob: for example, knowing 
that more points will be added later, is it worth allocating a bigger chunk 
initially so more points can be added to it.

I'd really like to hear some comments or new ideas. It seems to be a little 
complicated so maybe I miss something. The rectangles stored in the r-tree are 
always non-overlapping: can this property be used for some optimization? Is 
Sqlite the right choice for the job? I'm reluctant to build my own disk format 
– only dealing with all the cache management, journaling, fragmentation will 
add significant development time .... and bugs.

Thank you for your time again, if you read this far :)
Cheers,
-IC


      

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



      

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to