I have ~500,000 data files each containing ~1,000 records that I want to put into a database for easy access. Fictive example for illustration: File w. meteorological data from a given station.

stat_id |  yr  | d_o_y | hr | mn | tmp  | wind
---------|------|-------|----|----|------|------
   78   | 2001 |  211  | 14 |  0 | 15.3 |  4.7
   78   | 2001 |  211  | 14 |  1 | 15.4 |  5.4
   78   | 2001 |  211  | 14 |  2 | 15.4 |  5.0
   ..   | .... |  ...  | .. |  . | .... |  ...
   78   | 2001 |  211  | 14 | 24 | 15.8 |  2.7

Some parameters are constant for any file, e.g. station id. Others are constant for a varying percentage of the files (hr~58%, d_o_y~98%, yr~99.995% due to the oddly chosen 25 min. pr. file)

It seems like a table with file summaries (1 row for each file) would be very useful. Some of the columns/rows could be:

   file     | st_id |  yr  | d_o_y | avg_tmp | std_dev_tmp
-------------|-------|------|-------|---------|--------------
78_f6548.dat |   78  | 2001 |  211  |   15.5  |    0.24
57_f4296.dat |   57  | 2000 |   -1  |   8.3   |    0.11
   ...      |  ...  |  ... |  ...  |   ...   |    ...
(-1 for day of year indicates an odd file spanning across midnight)

To store the detailed records the SQL novice would construct one table pr. file and exclude any constant columns since these are redundant (given in summary table). These detailed tables would then have different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) for ~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) for ~0.005%.

SQL will discourage this novice. To enable simple queries data tables must be combined. Say I want to find all summer data with high temperature standard deviation and low wind. If I made the ultimate table for easy queries, containing all columns for both detailed data and file summary, I could find the data like this:

SELECT file, std_dev_tmp, tmp, wind FROM huge_table WHERE std_dev_tmp>0.2 AND d_o_y>150 AND d_o_y<240 AND wind<4.0

The ability to make this easy query comes with a ridiculous amount of redundant data in huge_table. In order to remove all the stuff that is already given by the summary table I must split the huge table into a number of tables; one for each of above mentioned column schemas (in my case many more than in the example above). Difficult to build and no easy queries anyway - I'd need to write a C program to launch the many queries that would in effect return the same data as the single query above. Defeats the purpose of easy, efficient access.

This novice must be missing a sneaky way to avoid massive redundancy and still maintain easy access. I've been suggested to look at inheritance and foreign keys. Foreign keys I don't see how to use, but I could make an inheritance hierarchy of the split up tables to make the parent table columns (mn,tmp,wind in above example) available across tables. But accessing the rest of the columns still require child-table specific queries, and my many child tables will not be nicely nested as in above example, so I still don't see how it can get me there. I could really use some input.

One thought: PostgreSQL adds to every table a system column tableoid containing a constant value. Is that value really stored 1,000 times for a 1,000-row table? Or...?

(Apologies for the length of this post)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to