[sqlite] speeding up row by row lookup in a large db

P Kishor Sat, 21 Mar 2009 07:31:50 -0700

Part 1.
-------

I have the following schema in a SQLite db that is 430 MB on my
Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB
cache.


-- 1000,000 rows
CREATE TABLE cell (
  cell_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  8 other INTEGER or REAL columns
)

-- 38 rows
CREATE TABLE lc (
  lc_id INTEGER PRIMARY KEY,
  56 other INTEGER or REAL columns
)

-- 10 rows
CREATE TABLE dist (
  dist_id INTEGER PRIMARY KEY,
  37 other INTEGER or REAL columns
)

-- 2,920,000
CREATE TABLE met (
  met_id INTEGER PRIMARY KEY,
  met_cell_id INTEGER,
  9 other INTEGER or REAL columns
)

CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER)
CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER)

CREATE INDEX idx_met_cell_id ON met (met_cell_id)
CREATE INDEX idx_cell_lc ON cell_lc (cell_id)
CREATE INDEX idx_cell_dist ON cell_dist (cell_id)

I also have an R*Tree index, but that is a different story, not relevant here.

I retrieve *all* data for one cell ':cell_id' using the following queries

[1] First retrieve all data from cell table
SELECT * FROM cell WHERE cell_id = :cell_id

[2] Now retrieve the related lc, dist and met
SELECT lc.*
FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id
WHERE c.cell_id = :cell_id

[3] Retrieve the related dist
SELECT d.*
FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id
WHERE c.cell_id = :cell_id

[4] Retrieve the related met
SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] above>

I did some benchmarking with the above schema using Perl DBI, and I
get about 30 transactions per second as long as I returning the data
to memory.

[08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl

timethis 1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @
33.33/s (n=1)

timethis 10:  0 wallclock secs ( 0.31 usr +  0.02 sys =  0.33 CPU) @
30.30/s (n=10)

timethis 100:  3 wallclock secs ( 2.85 usr +  0.20 sys =  3.05 CPU) @
32.79/s (n=100)

timethis 1000: 33 wallclock secs (31.08 usr +  1.22 sys = 32.30 CPU) @
30.96/s (n=1000)

if I write the data to file, the speed drops to about 1 transaction per second

timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48
CPU) @  1.33/s (n=1000)

Even if I stick with manipulating the data in memory, at 30
transactions per second (or 33 ms per transaction), it would take more
than 9 hours to query each of the 1 million cells one by one.

In the real world, I will first find the relevant cell ids based on
lat-lon bounds (hence my R*Tree index) and then extract their data one
by one.

How can I, if at all, speed this up?

Part 2.
-------

Alternatively, I could denormalize the data completely. Inspired by a
post on the Flickr blog
(http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/),
in particular the para "To make this data available quickly from the
server, we maintain and update a per-member cache in our database,
where we store each member’s contact list in a text blob — this way
it’s a single quick DB query to retrieve it. We can format this blob
in any way we want: XML, JSON, etc" I decided to experiment with the
same technique. So...

CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);

I then queried each cell as in Part 1, serialized it and stored it in
the cell_blobs table. My intent is to simply retrieve a BLOB and
deserialize it... it would *possibly* be quicker than 33 ms per
retrieval. Well, I haven't yet completed this test because each BLOB
is taking about 430 KB. At 1 million rows, that is going to occupy
upward of 400 GB. I broke the load_blob_table routine after about a
third of the records had been processed because I found even the
loading_the_blobs to be excruciatingly slow.

Suggestions? I am posting the same query on Perlmonks so hopefully I
will have a wealth of suggestions.


-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] speeding up row by row lookup in a large db

Reply via email to