John, thank you for the comments !

Maybe I wasn't clear - the 10TB data is separated. It contains a lot of other 
data that I don't dream of storing in a database. But this bulk data is 
structured in fixed-length records, each record containing a vector of floating 
point values and some associated meta-information. Part of this meta-info is a 
pair of points, (x1,y1) and (x2,y2). The number records is in the range of few 
hundred millions (say 500 mil), which is the number of coordinate pairs I need 
to handle. The coordinates are represented as a 4-byte IEEE floating point, so 
500 mil will take 500*4*4 ~ 8GB.
So I'll have to process about 8GB of points.

These points have common coordinates for (x1,y1). The number of groups 
(distinct (x1,y1)) is in the range of few hundred thousand (say 1mil). Now, in 
this "index file" that I try to create, I'll have about 1 mil entries, each 
entry containing somehow the (x1,y1) and a list of integers, which are really 
record numbers to the original data set. So if nothing else is stored, the 
entire index file should be about 8GB in our case. From what I read, sqlite3 
should handle fairly well a db few GB large, containing few mil records. The 
reality is that my "grups" table will store big rectangles, containing many 
groups (I'm thinking about 1000). So I expect that while the total size of the 
database is the same (few GB hopefully), the number of records in the r-tree 
and associated data table to be in the few thousands range).

There are few reasons I sqlite is a candidate for this job:

1. This "index" that I try to build needs to be stored, since it will be used 
multiple times for further processing, so any "in-memory" structure that I may 
use needs to be stored on disk at some point in some form.

2. The process of indexing may take some time (few hours, to one day - as you 
read through 10TB of data). If the process gets interrupted somehow, I need to 
be able to restart it gracefully. Here the journaling feature of sqlite comes 
in very handy: ex. I process the input in chunks of few tens of MB of points, 
each chunk begins a transaction; if something happen, all the tables involved 
remains in a consistent state, and I can just restart with the last chunk 
processed.

3. The database is a single self-contained file, which makes it easy to 
integrated it in a bigger project.

4. I used it before for something else :)

The points number 1 and 2 are important. So, unless I'll implement my own 
storage system (think journal, think page management, etc.), I need to find 
something else to build my app in top of it. The other candidates I considered 
were HD5, and ... well I heard firebird can be used in serverless mode. Any 
other ideas are welcome :)

-IC


----- Original Message ----
From: John Crenshaw <johncrens...@priacta.com>
To: General Discussion of SQLite Database <sqlite-users@sqlite.org>
Sent: Sat, October 17, 2009 2:33:30 PM
Subject: Re: [sqlite] Creating a spatial index for a large number of points- 
sorry for the text formating

I doubt SQLite is the right tool for this job, for a number of reasons.

First, if the data is as simple as you say, you are probably better off writing 
your logic as straight C, rather than SQL. SQLite is VERY fast, but there is 
still an incredible amount of overhead in executing a query, in any database. 
It will take many times more operations to do this with SQL than it would to do 
it without. The strength of SQL is in abstracting complex access to complex 
data. Simple lists of datapoints that only need a few simple transformations 
should probably be handled differently. 

Second, SQLite has in-memory overhead based on the size of the database. 10TB 
worth of database would require a LOT of memory. Tried to find the numbers on 
this, but I couldn't. I think it was something like 1 byte per page, with each 
page 1KB by default. That would be a 10GB of ram required for your data, and 
that assuming that SQLite stores it as efficiently as it is now (unlikely).

Third, SQLite has a 2TB limit, using the default page size. By increasing the 
page size you can raise that limit, but you are entering wild territory.

I'm a huge fan of SQLite, but I wouldn't use it for the job you described.

John

-----Original Message-----
From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] 
On Behalf Of Igor Conom
Sent: Saturday, October 17, 2009 10:10 AM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] Creating a spatial index for a large number of points- 
sorry for the text formating

Sorry for the formatting - it looked better when I sent it from Yahoo's web 
interface.



----- Original Message ----
From: Igor Conom <igor..co...@yahoo.com>
To: sqlite-users@sqlite.org
Sent: Sat, October 17, 2009 9:03:54 AM
Subject: [sqlite] Creating a spatial index for a large number of points

    
Hello everybody!

I need to create a spatial index for a large number of points and I would 
really like to hear your comments. Thank you in advance for your time.

There are some very large files containing scientific data, stored in a very 
simple format (by very large  I mean things in the range of 10 TB). One of 
these files is made of
equal-length records, each record containing a fix-length header and a vector 
of floating point numbers. For each record there are two spatial points 
associated: x1,y1 and x2,y2. There are few hundred millions of such records.

What I need is to create an index such that it allows me to group all records 
with the same x1,y1 and do various operations (by same x1,y1 I mean points 
within some
predefined range, i.e. two points can be considered the same if the distance 
between the two is less than some specified constant).

The operations are:
1. for a given (x,y), quickly find the
record group with the closest (x1,y1).
2. given a rectangle, find the list of
(x1,y1) groups withing that rectangle.

The index needs to be build in a reasonable amount of time (close to the time 
it takes to read through the original data). Preferably a single pass through 
the data should
be required – but this is second to building an efficient index. Once created, 
the index can be used only read-only, so one can take some extra time building 
it.

I expect that the number of distinct (x1,y1) groups will be in the range of few 
millions, with maybe few hundred records associated for each group. A record 
can be uniquely
identified by a 32-bit integer (its position on the file(s)), so a group will 
be defined by (x1,y1) and a list of integers.

My initial plan is to create two tables: one storing big chunks of groups and 
one r-tree. The “big chunks” will be a list of many groups contained in a 
larger rectangular
area, that I can further process in memory. I start with a big rectangle 
representing a huge enough boundary to contain every point. As I add a point, I 
use the r-tree to find a suitable rectangle for it, then add the point point to 
its list. If the list grows over a give number of points (or size in bytes), I 
split the rectangle in 4 parts, from the median point, so each part will 
contain a balanced number of points).

Another thing: I can use integer coordinates: I know that relative to a good 
choice of origin and cell size, the areal extent will not contain more than 
2^32 cells on each
axis (from INT_MIN to INT_MAX). I presume the operations are faster with 
integers.

CREATE TABLE groups ( data BLOB);
CREATE VIRTUAL TABLE groups_rt USING rtree_i32  (Id INTEGER, minX INTEGER, maxX 
INTEGER, minY INTEGER, maxY INTEGER);

The groups_rt.Id will be groups.ROWID .

The algorithm to create the index is
described next:
for each input point:
    if is the first point:
        insert the first entry to the groups table and 
        insert a big rectangle in the r-tree (i.e. INT_MIN, INT_MAX, INT_MIN, 
INT_MAX)
     else:
        transform x1,y1 to integers: ix1, iy1
        SELECT groups.ROWID, data, minX,maxX,minY,maxX FROM groups, groups_rt 
WHERE
            groups.ROWID=groups_rt.Id AND maxX>= ix1 AND minX<=(ix1+1) AND 
                                maxY>= iy1 AND minY<=(iy1+1)
        There will always be at least one row and at most 4 rows in the result.

        Pick the closest rectangle to the point (closest in real coordinates).

        If the groups.data blob already contains too many points:
            (or is bigger than some number of bytes .ie. 4K), 
             remove it from database and split the rectangle in 4 part, 
             then add them to the database.

        Add the point to the right blob of data.
     endif
end for

There is some additional optimization:
based on the fact that I process a bunch of points at a time and they
are most likely to be close to each other I keep the current
blob/rect in memory and keep adding points to it until a different
one is needed. This saves some updates to the database.

Right now I'm thinking what structure to use for the blob: for example, knowing 
that more points will be added later, is it worth allocating a bigger chunk 
initially so more points can be added to it.

I'd really like to hear some comments or new ideas. It seems to be a little 
complicated so maybe I miss something. The rectangles stored in the r-tree are 
always non-overlapping: can this property be used for some optimization? Is 
Sqlite the right choice for the job? I'm reluctant to build my own disk format 
– only dealing with all the cache management, journaling, fragmentation will 
add significant development time .... and bugs.

Thank you for your time again, if you read this far :)
Cheers,
-IC


      

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 



      

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


      

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to