I'm not sure I completely understand your data structure, but here are
some ideas:

First, conduct experiments with different page and cache sizes.  I
don't know if you jumped to a 32K page size or experimented with each
size, but experimentation is the way to go.

I'm guessing that in your cell_lc and cell_dist tables, rows are
unique on cell_id.  If so, you could make cell_id the primary key and
avoid the indexes altogether.

When you posted before, you mentioned that you had multi-TB hard
drives, tons of memory, quad CPU's, etc, and didn't care about saving
space.  I'd denormalize the data, putting lc and dist fields in the
cell table.  That will eliminate join #2 and join #3.

You don't really say what you're doing with the results of these
queries below.  Are you doing a bunch off Perlish stuff with the
results?  It's possible that most of your time is being spent in the
Perl interpreter rather than in SQLite.  To check this, try sending
the output to /dev/null and using the sqlite3 command line tool to
execute your queries.  If they happen quickly, then the overhead is in
the manipulation of the results, not the queries.

Good luck!
Jim


On 3/21/09, P Kishor <punk.k...@gmail.com> wrote:
> Part 1.
> -------
>
> I have the following schema in a SQLite db that is 430 MB on my
> Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB
> cache.
>
> -- 1000,000 rows
> CREATE TABLE cell (
>   cell_id INTEGER PRIMARY KEY,
>   met_cell_id INTEGER,
>   8 other INTEGER or REAL columns
> )
>
> -- 38 rows
> CREATE TABLE lc (
>   lc_id INTEGER PRIMARY KEY,
>   56 other INTEGER or REAL columns
> )
>
> -- 10 rows
> CREATE TABLE dist (
>   dist_id INTEGER PRIMARY KEY,
>   37 other INTEGER or REAL columns
> )
>
> -- 2,920,000
> CREATE TABLE met (
>   met_id INTEGER PRIMARY KEY,
>   met_cell_id INTEGER,
>   9 other INTEGER or REAL columns
> )
>
> CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER)
> CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER)
>
> CREATE INDEX idx_met_cell_id ON met (met_cell_id)
> CREATE INDEX idx_cell_lc ON cell_lc (cell_id)
> CREATE INDEX idx_cell_dist ON cell_dist (cell_id)
>
> I also have an R*Tree index, but that is a different story, not relevant
> here.
>
> I retrieve *all* data for one cell ':cell_id' using the following queries
>
> [1] First retrieve all data from cell table
> SELECT * FROM cell WHERE cell_id = :cell_id
>
> [2] Now retrieve the related lc, dist and met
> SELECT lc.*
> FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id
> WHERE c.cell_id = :cell_id
>
> [3] Retrieve the related dist
> SELECT d.*
> FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id
> WHERE c.cell_id = :cell_id
>
> [4] Retrieve the related met
> SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] above>
>
> I did some benchmarking with the above schema using Perl DBI, and I
> get about 30 transactions per second as long as I returning the data
> to memory.
>
> [08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl
>
> timethis 1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @
> 33.33/s (n=1)
>
> timethis 10:  0 wallclock secs ( 0.31 usr +  0.02 sys =  0.33 CPU) @
> 30.30/s (n=10)
>
> timethis 100:  3 wallclock secs ( 2.85 usr +  0.20 sys =  3.05 CPU) @
> 32.79/s (n=100)
>
> timethis 1000: 33 wallclock secs (31.08 usr +  1.22 sys = 32.30 CPU) @
> 30.96/s (n=1000)
>
> if I write the data to file, the speed drops to about 1 transaction per
> second
>
> timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48
> CPU) @  1.33/s (n=1000)
>
> Even if I stick with manipulating the data in memory, at 30
> transactions per second (or 33 ms per transaction), it would take more
> than 9 hours to query each of the 1 million cells one by one.
>
> In the real world, I will first find the relevant cell ids based on
> lat-lon bounds (hence my R*Tree index) and then extract their data one
> by one.
>
> How can I, if at all, speed this up?
>
> Part 2.
> -------
>
> Alternatively, I could denormalize the data completely. Inspired by a
> post on the Flickr blog
> (http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/),
> in particular the para "To make this data available quickly from the
> server, we maintain and update a per-member cache in our database,
> where we store each member’s contact list in a text blob — this way
> it’s a single quick DB query to retrieve it. We can format this blob
> in any way we want: XML, JSON, etc" I decided to experiment with the
> same technique. So...
>
> CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);
>
> I then queried each cell as in Part 1, serialized it and stored it in
> the cell_blobs table. My intent is to simply retrieve a BLOB and
> deserialize it... it would *possibly* be quicker than 33 ms per
> retrieval. Well, I haven't yet completed this test because each BLOB
> is taking about 430 KB. At 1 million rows, that is going to occupy
> upward of 400 GB. I broke the load_blob_table routine after about a
> third of the records had been processed because I found even the
> loading_the_blobs to be excruciatingly slow.
>
> Suggestions? I am posting the same query on Perlmonks so hopefully I
> will have a wealth of suggestions.
>
>
> --
> Puneet Kishor http://www.punkish.org/
> Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
> Carbon Model http://carbonmodel.org/
> Open Source Geospatial Foundation http://www.osgeo.org/
> Sent from: Madison WI United States.
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>


-- 
Software first.  Software lasts!
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to