[GENERAL] SQL - planet redundant data

Poul Jensen Thu, 08 Sep 2005 21:11:53 -0700

I have ~500,000 data files each containing ~1,000 records that I want toput into a database for easy access.Fictive example for illustration: File w. meteorological data from agiven station.


stat_id |  yr  | d_o_y | hr | mn | tmp  | wind
---------|------|-------|----|----|------|------
   78   | 2001 |  211  | 14 |  0 | 15.3 |  4.7
   78   | 2001 |  211  | 14 |  1 | 15.4 |  5.4
   78   | 2001 |  211  | 14 |  2 | 15.4 |  5.0
   ..   | .... |  ...  | .. |  . | .... |  ...
   78   | 2001 |  211  | 14 | 24 | 15.8 |  2.7

Some parameters are constant for any file, e.g. station id. Others areconstant for a varying percentage of the files (hr~58%, d_o_y~98%,yr~99.995% due to the oddly chosen 25 min. pr. file)

It seems like a table with file summaries (1 row for each file) would bevery useful. Some of the columns/rows could be:


   file     | st_id |  yr  | d_o_y | avg_tmp | std_dev_tmp
-------------|-------|------|-------|---------|--------------
78_f6548.dat |   78  | 2001 |  211  |   15.5  |    0.24
57_f4296.dat |   57  | 2000 |   -1  |   8.3   |    0.11
   ...      |  ...  |  ... |  ...  |   ...   |    ...
(-1 for day of year indicates an odd file spanning across midnight)

To store the detailed records the SQL novice would construct one tablepr. file and exclude any constant columns since these are redundant(given in summary table). These detailed tables would then havedifferent column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) for~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) for~0.005%.

SQL will discourage this novice. To enable simple queries data tablesmust be combined. Say I want to find all summer data with hightemperature standard deviation and low wind. If I made the ultimatetable for easy queries, containing all columns for both detailed dataand file summary, I could find the data like this:

SELECT file, std_dev_tmp, tmp, wind FROM huge_table WHEREstd_dev_tmp>0.2 AND d_o_y>150 AND d_o_y<240 AND wind<4.0

The ability to make this easy query comes with a ridiculous amount ofredundant data in huge_table. In order to remove all the stuff that isalready given by the summary table I must split the huge table into anumber of tables; one for each of above mentioned column schemas (in mycase many more than in the example above). Difficult to build and noeasy queries anyway - I'd need to write a C program to launch the manyqueries that would in effect return the same data as the single queryabove. Defeats the purpose of easy, efficient access.

This novice must be missing a sneaky way to avoid massive redundancy andstill maintain easy access. I've been suggested to look at inheritanceand foreign keys. Foreign keys I don't see how to use, but I could makean inheritance hierarchy of the split up tables to make the parent tablecolumns (mn,tmp,wind in above example) available across tables. Butaccessing the rest of the columns still require child-table specificqueries, and my many child tables will not be nicely nested as in aboveexample, so I still don't see how it can get me there. I could reallyuse some input.

One thought: PostgreSQL adds to every table a system column tableoidcontaining a constant value. Is that value really stored 1,000 times fora 1,000-row table? Or...?


(Apologies for the length of this post)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

[GENERAL] SQL - planet redundant data

Reply via email to