My apologies for not constructing an all encompassing, full-picture
email in the first place. Here are more explanations and responses --

On Sat, Mar 21, 2009 at 12:45 PM, Jim Wilcoxson <pri...@gmail.com> wrote:
> I'm not sure I completely understand your data structure, but here are
> some ideas:
>

Data structure is actually very simply.

Each cell has a bunch of attributes.

A couple of those attributes (dist and lc) can appear more than one,
that is, each cell can have 0 or more dist and lc. Hence the cross
reference (many-to-many) cell_dist and cell_lc tables.

One attribute (met) is a one-to-many lookup, that is, each cell has a
one and only one met value that leads to several years worth of daly
met (weather) data. Hence the separate met table with a met_cell_id as
a FK.

The data structure is not complex at all... it is just a lot of data.

> First, conduct experiments with different page and cache sizes.  I
> don't know if you jumped to a 32K page size or experimented with each
> size, but experimentation is the way to go.

I started with page_size of 4096 and then jumped to 32768. Nothing in between.
The performance with 32K was slightly better than 4096 in my limited testing.

>
> I'm guessing that in your cell_lc and cell_dist tables, rows are
> unique on cell_id.  If so, you could make cell_id the primary key and
> avoid the indexes altogether.

No, they are cross reference tables (see above). So, each combination
of cell_id,lc_id or cell_id,dist_id is unique.


>
> When you posted before, you mentioned that you had multi-TB hard
> drives, tons of memory, quad CPU's, etc, and didn't care about saving
> space.  I'd denormalize the data, putting lc and dist fields in the
> cell table.  That will eliminate join #2 and join #3.

That is exactly what I am trying to do with my BLOB approach. By
storing a pre-formed array of arrays as a BLOB, I am, in effect,
de-normalizing my data. That darn process, however, is taking a very
long time.


>
> You don't really say what you're doing with the results of these
> queries below.  Are you doing a bunch off Perlish stuff with the
> results?  It's possible that most of your time is being spent in the
> Perl interpreter rather than in SQLite.  To check this, try sending
> the output to /dev/null and using the sqlite3 command line tool to
> execute your queries.  If they happen quickly, then the overhead is in
> the manipulation of the results, not the queries.

Yes, that is a possibility, but without any scientific reason, I think
the bottleneck is not Perl. Besides, I am not going in and out of
Perl. In any case, Perl DBI and DBD::SQLite are written in C anyway so
my (non-scientific) hunch is that is not the bottleneck. In the end,
as Kees led me to the conclusion, I might simply have reached a
physical limit with my data and hardware, in which case, I just live
with that limit and look at other innovations.

What am I doing with the results? Well, the results of each cell query
are fed into a model that calculates Carbon and Nitrogen budgets over
a very large area (million+ sq. kms.) iterating spatially on a cell by
cell basis and over long time periods (several centuries) temporally
on a daily basis.

>
> Good luck!
> Jim
>
>
> On 3/21/09, P Kishor <punk.k...@gmail.com> wrote:
>> Part 1.
>> -------
>>
>> I have the following schema in a SQLite db that is 430 MB on my
>> Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB
>> cache.
>>
>> -- 1000,000 rows
>> CREATE TABLE cell (
>>   cell_id INTEGER PRIMARY KEY,
>>   met_cell_id INTEGER,
>>   8 other INTEGER or REAL columns
>> )
>>
>> -- 38 rows
>> CREATE TABLE lc (
>>   lc_id INTEGER PRIMARY KEY,
>>   56 other INTEGER or REAL columns
>> )
>>
>> -- 10 rows
>> CREATE TABLE dist (
>>   dist_id INTEGER PRIMARY KEY,
>>   37 other INTEGER or REAL columns
>> )
>>
>> -- 2,920,000
>> CREATE TABLE met (
>>   met_id INTEGER PRIMARY KEY,
>>   met_cell_id INTEGER,
>>   9 other INTEGER or REAL columns
>> )
>>
>> CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER)
>> CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER)
>>
>> CREATE INDEX idx_met_cell_id ON met (met_cell_id)
>> CREATE INDEX idx_cell_lc ON cell_lc (cell_id)
>> CREATE INDEX idx_cell_dist ON cell_dist (cell_id)
>>
>> I also have an R*Tree index, but that is a different story, not relevant
>> here.
>>
>> I retrieve *all* data for one cell ':cell_id' using the following queries
>>
>> [1] First retrieve all data from cell table
>> SELECT * FROM cell WHERE cell_id = :cell_id
>>
>> [2] Now retrieve the related lc, dist and met
>> SELECT lc.*
>> FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id
>> WHERE c.cell_id = :cell_id
>>
>> [3] Retrieve the related dist
>> SELECT d.*
>> FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id
>> WHERE c.cell_id = :cell_id
>>
>> [4] Retrieve the related met
>> SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] above>
>>
>> I did some benchmarking with the above schema using Perl DBI, and I
>> get about 30 transactions per second as long as I returning the data
>> to memory.
>>
>> [08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl
>>
>> timethis 1:  0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU) @
>> 33.33/s (n=1)
>>
>> timethis 10:  0 wallclock secs ( 0.31 usr +  0.02 sys =  0.33 CPU) @
>> 30.30/s (n=10)
>>
>> timethis 100:  3 wallclock secs ( 2.85 usr +  0.20 sys =  3.05 CPU) @
>> 32.79/s (n=100)
>>
>> timethis 1000: 33 wallclock secs (31.08 usr +  1.22 sys = 32.30 CPU) @
>> 30.96/s (n=1000)
>>
>> if I write the data to file, the speed drops to about 1 transaction per
>> second
>>
>> timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48
>> CPU) @  1.33/s (n=1000)
>>
>> Even if I stick with manipulating the data in memory, at 30
>> transactions per second (or 33 ms per transaction), it would take more
>> than 9 hours to query each of the 1 million cells one by one.
>>
>> In the real world, I will first find the relevant cell ids based on
>> lat-lon bounds (hence my R*Tree index) and then extract their data one
>> by one.
>>
>> How can I, if at all, speed this up?
>>
>> Part 2.
>> -------
>>
>> Alternatively, I could denormalize the data completely. Inspired by a
>> post on the Flickr blog
>> (http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/),
>> in particular the para "To make this data available quickly from the
>> server, we maintain and update a per-member cache in our database,
>> where we store each member’s contact list in a text blob — this way
>> it’s a single quick DB query to retrieve it. We can format this blob
>> in any way we want: XML, JSON, etc" I decided to experiment with the
>> same technique. So...
>>
>> CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB);
>>
>> I then queried each cell as in Part 1, serialized it and stored it in
>> the cell_blobs table. My intent is to simply retrieve a BLOB and
>> deserialize it... it would *possibly* be quicker than 33 ms per
>> retrieval. Well, I haven't yet completed this test because each BLOB
>> is taking about 430 KB. At 1 million rows, that is going to occupy
>> upward of 400 GB. I broke the load_blob_table routine after about a
>> third of the records had been processed because I found even the
>> loading_the_blobs to be excruciatingly slow.
>>
>> Suggestions? I am posting the same query on Perlmonks so hopefully I
>> will have a wealth of suggestions.
>>
>>
>> --
>> Puneet Kishor http://www.punkish.org/
>> Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
>> Carbon Model http://carbonmodel.org/
>> Open Source Geospatial Foundation http://www.osgeo.org/
>> Sent from: Madison WI United States.
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>
>
> --
> Software first.  Software lasts!
>



-- 
Puneet Kishor http://www.punkish.org/
Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/
Carbon Model http://carbonmodel.org/
Open Source Geospatial Foundation http://www.osgeo.org/
Sent from: Madison WI United States.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to