Hello everybody!
I need to create a spatial index for a
large number of points and I would really like to hear your comments.
Thank you in advance for your time.
There are some very large files containing scientific
data, stored in a very simple format (by very large I mean
things in the range of 10 TB). One of these files is made of
equal-length records, each record containing a fix-length header and
a vector of floating point numbers. For each record there are two
spatial points associated: x1,y1 and x2,y2. There are few hundred millions of
such records.
What I need is to create an index such
that it allows me to group all records with the same x1,y1 and do
various operations (by same x1,y1 I mean points within some
predefined range, i.e. two points can be considered the same if the
distance between the two is less than some specified constant).
The operations are:
1. for a given (x,y), quickly find the
record group with the closest (x1,y1).
2. given a rectangle, find the list of
(x1,y1) groups withing that rectangle.
The index needs to be build in a
reasonable amount of time (close to the time it takes to read through
the original data). Preferably a single pass through the data should
be required – but this is second to building an efficient index.
Once created, the index can be used only read-only, so one can take
some extra time building it.
I expect that the number of distinct
(x1,y1) groups will be in the range of few millions, with maybe few
hundred records associated for each group. A record can be uniquely
identified by a 32-bit integer (its position on the file(s)), so a
group will be defined by (x1,y1) and a list of integers.
My initial plan is to create two
tables: one storing big chunks of groups and one r-tree. The “big
chunks” will be a list of many groups contained in a larger rectangular
area, that I can further process in memory. I start with a big
rectangle representing a huge enough boundary to contain every point.
As I add a point, I use the r-tree to find a suitable rectangle for
it, then add the point point to its list. If the list grows over a give number
of
points (or size in bytes), I split the rectangle in 4 parts, from the
median point, so each part will contain a balanced number of points).
Another thing: I can use integer
coordinates: I know that relative to a good choice of origin and cell
size, the areal extent will not contain more than 2^32 cells on each
axis (from INT_MIN to INT_MAX). I presume the operations are faster
with integers.
CREATE TABLE groups ( data BLOB);
CREATE VIRTUAL TABLE groups_rt USING
rtree_i32
(Id INTEGER, minX INTEGER, maxX
INTEGER, minY INTEGER, maxY INTEGER);
The groups_rt.Id will be groups.ROWID .
The algorithm to create the index is
described next:
for each input point:
if is the first point:
insert the first entry to the
groups table and
insert a big rectangle in the
r-tree (i.e. INT_MIN, INT_MAX, INT_MIN, INT_MAX)
else:
transform x1,y1 to integers:
ix1, iy1
SELECT groups.ROWID, data,
minX,maxX,minY,maxX FROM groups, groups_rt
WHERE
groups.ROWID=groups_rt.Id AND
maxX>= ix1 AND
minX<=(ix1+1) AND
maxY>= iy1 AND
minY<=(iy1+1)
There will always be at least
one row and at most 4 rows in the result.
Pick the closest rectangle to
the point (closest in real coordinates).
If the groups.data blob already
contains too many points:
(or is bigger than some
number of bytes .ie. 4K),
remove it from database and split the rectangle in 4 part,
then add them to the
database.
Add the point to the right blob
of data.
endif
end for
There is some additional optimization:
based on the fact that I process a bunch of points at a time and they
are most likely to be close to each other I keep the current
blob/rect in memory and keep adding points to it until a different
one is needed. This saves some updates to the database.
Right now I'm thinking what structure to use for the blob: for example, knowing
that more points will be added later, is it worth allocating a bigger chunk
initially so more points can be added to it l
I'd really like to hear some comments
or new ideas. It seems to be a little complicated so maybe I miss
something. The rectangles stored in the r-tree are always
non-overlapping: can this property be used for some optimization? Is
Sqlite the right choice for the job? I'm reluctant to build my own
disk format – only dealing with all the cache management,
journaling, fragmentation will add significant development time ....
and bugs.
Thank you for your time again, if you
read this far :)
Cheers,
-IC
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users