My apologies for not constructing an all encompassing, full-picture email in the first place. Here are more explanations and responses --
On Sat, Mar 21, 2009 at 12:45 PM, Jim Wilcoxson <pri...@gmail.com> wrote: > I'm not sure I completely understand your data structure, but here are > some ideas: > Data structure is actually very simply. Each cell has a bunch of attributes. A couple of those attributes (dist and lc) can appear more than one, that is, each cell can have 0 or more dist and lc. Hence the cross reference (many-to-many) cell_dist and cell_lc tables. One attribute (met) is a one-to-many lookup, that is, each cell has a one and only one met value that leads to several years worth of daly met (weather) data. Hence the separate met table with a met_cell_id as a FK. The data structure is not complex at all... it is just a lot of data. > First, conduct experiments with different page and cache sizes. I > don't know if you jumped to a 32K page size or experimented with each > size, but experimentation is the way to go. I started with page_size of 4096 and then jumped to 32768. Nothing in between. The performance with 32K was slightly better than 4096 in my limited testing. > > I'm guessing that in your cell_lc and cell_dist tables, rows are > unique on cell_id. If so, you could make cell_id the primary key and > avoid the indexes altogether. No, they are cross reference tables (see above). So, each combination of cell_id,lc_id or cell_id,dist_id is unique. > > When you posted before, you mentioned that you had multi-TB hard > drives, tons of memory, quad CPU's, etc, and didn't care about saving > space. I'd denormalize the data, putting lc and dist fields in the > cell table. That will eliminate join #2 and join #3. That is exactly what I am trying to do with my BLOB approach. By storing a pre-formed array of arrays as a BLOB, I am, in effect, de-normalizing my data. That darn process, however, is taking a very long time. > > You don't really say what you're doing with the results of these > queries below. Are you doing a bunch off Perlish stuff with the > results? It's possible that most of your time is being spent in the > Perl interpreter rather than in SQLite. To check this, try sending > the output to /dev/null and using the sqlite3 command line tool to > execute your queries. If they happen quickly, then the overhead is in > the manipulation of the results, not the queries. Yes, that is a possibility, but without any scientific reason, I think the bottleneck is not Perl. Besides, I am not going in and out of Perl. In any case, Perl DBI and DBD::SQLite are written in C anyway so my (non-scientific) hunch is that is not the bottleneck. In the end, as Kees led me to the conclusion, I might simply have reached a physical limit with my data and hardware, in which case, I just live with that limit and look at other innovations. What am I doing with the results? Well, the results of each cell query are fed into a model that calculates Carbon and Nitrogen budgets over a very large area (million+ sq. kms.) iterating spatially on a cell by cell basis and over long time periods (several centuries) temporally on a daily basis. > > Good luck! > Jim > > > On 3/21/09, P Kishor <punk.k...@gmail.com> wrote: >> Part 1. >> ------- >> >> I have the following schema in a SQLite db that is 430 MB on my >> Macbook laptop's 320 GB HFS+ formatted 7200 RPM disk with an 8 MB >> cache. >> >> -- 1000,000 rows >> CREATE TABLE cell ( >> cell_id INTEGER PRIMARY KEY, >> met_cell_id INTEGER, >> 8 other INTEGER or REAL columns >> ) >> >> -- 38 rows >> CREATE TABLE lc ( >> lc_id INTEGER PRIMARY KEY, >> 56 other INTEGER or REAL columns >> ) >> >> -- 10 rows >> CREATE TABLE dist ( >> dist_id INTEGER PRIMARY KEY, >> 37 other INTEGER or REAL columns >> ) >> >> -- 2,920,000 >> CREATE TABLE met ( >> met_id INTEGER PRIMARY KEY, >> met_cell_id INTEGER, >> 9 other INTEGER or REAL columns >> ) >> >> CREATE TABLE cell_lc (cell_id INTEGER, lc_id INTEGER) >> CREATE TABLE cell_dist (cell_id INTEGER, dist_id INTEGER) >> >> CREATE INDEX idx_met_cell_id ON met (met_cell_id) >> CREATE INDEX idx_cell_lc ON cell_lc (cell_id) >> CREATE INDEX idx_cell_dist ON cell_dist (cell_id) >> >> I also have an R*Tree index, but that is a different story, not relevant >> here. >> >> I retrieve *all* data for one cell ':cell_id' using the following queries >> >> [1] First retrieve all data from cell table >> SELECT * FROM cell WHERE cell_id = :cell_id >> >> [2] Now retrieve the related lc, dist and met >> SELECT lc.* >> FROM lc l JOIN cell_lc c on l.lc_id = c.lc_id >> WHERE c.cell_id = :cell_id >> >> [3] Retrieve the related dist >> SELECT d.* >> FROM dist d JOIN cell_lc c on d.dist_id = c.dist_id >> WHERE c.cell_id = :cell_id >> >> [4] Retrieve the related met >> SELECT * FROM met WHERE met_cell_id = <met_cell_id from query [1] above> >> >> I did some benchmarking with the above schema using Perl DBI, and I >> get about 30 transactions per second as long as I returning the data >> to memory. >> >> [08:38 AM] ~/Data/carbonmodel$perl carbonmodel.pl >> >> timethis 1: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ >> 33.33/s (n=1) >> >> timethis 10: 0 wallclock secs ( 0.31 usr + 0.02 sys = 0.33 CPU) @ >> 30.30/s (n=10) >> >> timethis 100: 3 wallclock secs ( 2.85 usr + 0.20 sys = 3.05 CPU) @ >> 32.79/s (n=100) >> >> timethis 1000: 33 wallclock secs (31.08 usr + 1.22 sys = 32.30 CPU) @ >> 30.96/s (n=1000) >> >> if I write the data to file, the speed drops to about 1 transaction per >> second >> >> timethis 1000: 783 wallclock secs (732.26 usr + 18.22 sys = 750.48 >> CPU) @ 1.33/s (n=1000) >> >> Even if I stick with manipulating the data in memory, at 30 >> transactions per second (or 33 ms per transaction), it would take more >> than 9 hours to query each of the 1 million cells one by one. >> >> In the real world, I will first find the relevant cell ids based on >> lat-lon bounds (hence my R*Tree index) and then extract their data one >> by one. >> >> How can I, if at all, speed this up? >> >> Part 2. >> ------- >> >> Alternatively, I could denormalize the data completely. Inspired by a >> post on the Flickr blog >> (http://code.flickr.com/blog/2009/03/18/building-fast-client-side-searches/), >> in particular the para "To make this data available quickly from the >> server, we maintain and update a per-member cache in our database, >> where we store each member’s contact list in a text blob — this way >> it’s a single quick DB query to retrieve it. We can format this blob >> in any way we want: XML, JSON, etc" I decided to experiment with the >> same technique. So... >> >> CREATE TABLE cell_blobs (cell_id INTEGER PRIMARY KEY, cell_data BLOB); >> >> I then queried each cell as in Part 1, serialized it and stored it in >> the cell_blobs table. My intent is to simply retrieve a BLOB and >> deserialize it... it would *possibly* be quicker than 33 ms per >> retrieval. Well, I haven't yet completed this test because each BLOB >> is taking about 430 KB. At 1 million rows, that is going to occupy >> upward of 400 GB. I broke the load_blob_table routine after about a >> third of the records had been processed because I found even the >> loading_the_blobs to be excruciatingly slow. >> >> Suggestions? I am posting the same query on Perlmonks so hopefully I >> will have a wealth of suggestions. >> >> >> -- >> Puneet Kishor http://www.punkish.org/ >> Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/ >> Carbon Model http://carbonmodel.org/ >> Open Source Geospatial Foundation http://www.osgeo.org/ >> Sent from: Madison WI United States. >> _______________________________________________ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > > > -- > Software first. Software lasts! > -- Puneet Kishor http://www.punkish.org/ Nelson Institute for Environmental Studies http://www.nelson.wisc.edu/ Carbon Model http://carbonmodel.org/ Open Source Geospatial Foundation http://www.osgeo.org/ Sent from: Madison WI United States. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users