[sqlite] Creating a spatial index for a large number of points

Igor Conom Sat, 17 Oct 2009 07:04:02 -0700

    
Hello everybody!

I need to create a spatial index for a
large number of points and I would really like to hear your comments.
Thank you in advance for your time.


There are some very large files containing scientific
data, stored in a very simple format (by very large  I mean
things in the range of 10 TB). One of these files is made of
equal-length records, each record containing a fix-length header and
a vector of floating point numbers. For each record there are two
spatial points associated: x1,y1 and x2,y2. There are few hundred millions of 
such records.

What I need is to create an index such
that it allows me to group all records with the same x1,y1 and do
various operations (by same x1,y1 I mean points within some
predefined range, i.e. two points can be considered the same if the
distance between the two is less than some specified constant).

The operations are:
1. for a given (x,y), quickly find the
record group with the closest (x1,y1).
2. given a rectangle, find the list of
(x1,y1) groups withing that rectangle.

The index needs to be build in a
reasonable amount of time (close to the time it takes to read through
the original data). Preferably a single pass through the data should
be required – but this is second to building an efficient index.
Once created, the index can be used only read-only, so one can take
some extra time building it.

I expect that the number of distinct
(x1,y1) groups will be in the range of few millions, with maybe few
hundred records associated for each group. A record can be uniquely
identified by a 32-bit integer (its position on the file(s)), so a
group will be defined by (x1,y1) and a list of integers.

My initial plan is to create two
tables: one storing big chunks of groups and one r-tree. The “big
chunks” will be a list of many groups contained in a larger rectangular
area, that I can further process in memory. I start with a big
rectangle representing a huge enough boundary to contain every point.
As I add a point, I use the r-tree to find a suitable rectangle for
it, then add the point point to its list. If the list grows over a give number 
of
points (or size in bytes), I split the rectangle in 4 parts, from the
median point, so each part will contain a balanced number of points).

Another thing: I can use integer
coordinates: I know that relative to a good choice of origin and cell
size, the areal extent will not contain more than 2^32 cells on each
axis (from INT_MIN to INT_MAX). I presume the operations are faster
with integers.

CREATE TABLE groups ( data BLOB);
CREATE VIRTUAL TABLE groups_rt USING
rtree_i32 
(Id INTEGER, minX INTEGER, maxX
INTEGER, minY INTEGER, maxY INTEGER);

The groups_rt.Id will be groups.ROWID .

The algorithm to create the index is
described next:
for each input point:
    if is the first point:
        insert the first entry to the
groups table and 
        insert a big rectangle in the
r-tree (i.e. INT_MIN, INT_MAX, INT_MIN, INT_MAX)
         else:
        transform x1,y1 to integers:
ix1, iy1
        SELECT groups.ROWID, data,
minX,maxX,minY,maxX FROM groups, groups_rt
                           WHERE
groups.ROWID=groups_rt.Id AND
                                maxX>= ix1 AND
minX<=(ix1+1) AND 
                                maxY>= iy1 AND
minY<=(iy1+1)
        There will always be at least
one row and at most 4 rows in the result.

        Pick the closest rectangle to
the point (closest in real coordinates).

        If the groups.data blob already
contains too many points:
                         (or is bigger than some
number of bytes .ie. 4K), 

             remove it from database and split the rectangle in 4 part, 
             then add them to the
database.

        Add the point to the right blob
of data.
     endif
end for

There is some additional optimization:
based on the fact that I process a bunch of points at a time and they
are most likely to be close to each other I keep the current
blob/rect in memory and keep adding points to it until a different
one is needed. This saves some updates to the database.


Right now I'm thinking what structure to use for the blob: for example, knowing 
that more points will be added later, is it worth allocating a bigger chunk 
initially so more points can be added to it l


I'd really like to hear some comments
or new ideas. It seems to be a little complicated so maybe I miss
something. The rectangles stored in the r-tree are always
non-overlapping: can this property be used for some optimization? Is
Sqlite the right choice for the job? I'm reluctant to build my own
disk format – only dealing with all the cache management,
journaling, fragmentation will add significant development time ....
and bugs.

Thank you for your time again, if you
read this far :)
Cheers,
-IC


      

_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Creating a spatial index for a large number of points

Reply via email to