I have been struggling with this for a while now, have even gone down a few paths but struck out, so I turn now to the community for ideas. First, the problem: Store six daily variables for ~ 25 years for cells in a grid.
* Number of vars = 6 * Number of cells ~ 13 million * Number of days ~ 9125 (25 * 365) Optimize the store for two different kinds of queries: Query one: Retrieve the value of a single var for all or a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var. Query two: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day. So, I set about designing the db. The "grid" is in a table with 13 million rows CREATE TABLE cells ( cell_id INTEGER, other_data .. ) WITH ( OIDS=FALSE ) A single table *where every row is one day's values for one cell* looks like so CREATE TABLE d ( yr SMALLINT, yday SMALLINT, a SMALLINT, b SMALLINT, d SMALLINT, e SMALLINT, f SMALLINT, g SMALLINT, cell_id INTEGER ) WITH ( OIDS=FALSE ) The data would look like so yr yday a b c d e f g cell_id ---------------------------------------------------- 1980 1 x x x x x x x 1 .. 1980 365 x x x x x x x 1 ... 1981 1 x x x x x x x 1 .. 1981 365 x x x x x x x 1 ... ... 2005 1 x x x x x x x 1 .. 2005 365 x x x x x x x 1 ...... 1980 1 x x x x x x x 2 .. 1980 365 x x x x x x x 2 ... I could now (theoretically) conduct my queries like so: Query 1a: Retrieve the value of a single var for all the cells for a single day. This is analogous to an image where every pixel is the value of a single var. SELECT <var> FROM d WHERE yr = ? AND yday = ?; I assuming I would need an index on yr and yday, or perhaps even a compound index on (yr, yday). Query 1b: Retrieve the value of a single var for a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var. SELECT <var> FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...); I assuming I would need an index on yr and yday, or perhaps even a compound index on (yr, yday) AND an index on cell_id. Query 2: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day. SELECT <var> FROM d WHERE cell_id = ?; SELECT <var> FROM d WHERE cell_id IN (?,?,?...); Once again, an index on cell_id would assist in the above. The problem: The above table would have 13 M * 9125 rows ~ 118 billion rows. Huge indexes, slow queries, etc. In fact, major issues loading the data in the first place. Since I am loading data in batches, I drop the indexes (takes time), COPY data into the table (takes time), build the indexes (takes time), experiment with improving the performance (takes time), fail, rinse, lather, repeat. I actually tried the above with a subset of data (around 100 M rows) and experienced all of the above. I don't remember the query times, but they were awful. So, I partitioned the table into years like so CREATE TABLE d_<yyyy> ( CHECK ( yr = <yyyy> ) ) INHERITS (d) Hmmm... still no satisfaction. I ended up with 1 master table + 25 inherited tables. Each of the year tables now had ~ 4.75 billion rows (13 M * 365), and the queries were still very slow. So, I partitioned it all by years and days like so CREATE TABLE d_<yyyy>_<yday> ( CHECK ( yr = <yyyy> AND yday = <yday> ) ) INHERITS (d) Each table now has 13 million rows, and is reasonably fast (although still not satisfactorily fast), but now I have 9K tables. That has its own problems. I can't query the master table anymore as Pg tries to lock all the tables and runs out of memory. Additionally, I can't anymore conduct query two above. I could do something like SELECT a FROM d_1980_1 WHERE cell_id = 1 UNION SELECT a FROM d_1980_2 WHERE cell_id = 1 UNION SELECT a FROM d_1980_3 WHERE cell_id = 1 UNION SELECT a FROM d_1980_4 WHERE cell_id = 1 UNION ... But the above is hardly optimal. Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg, or even a RDBMS, is not the right tool for this problem, in which case, suggestion for alternatives would be welcome as well. Right now I am testing this on a dual Xeon dual core 3 GHz Xserve with 12 GB RAM. The PGDATA directory is located on an attached RAID that is configured as RAID5. Reasonable time for a query would be under 500 ms, although ultimately I would love to have the query be done under 250 ms, perhaps with RAID10, and a machine with more RAM. I have access to a machine with dual Xeon quad core 3 GHz Xserve with 32 GB RAM, and an internal RAID, but before I try to move the data, I want to actually conclusively prove that Pg is the best solution (or not). Perhaps flat files are better, perhaps blobs are better, or perhaps Pg's array column type. Any ideas/suggestions welcome. Oh, one more thing. Once the data are loaded and everything is working, the data are readonly. They are historical, so they don't need to be changed. -- Puneet Kishor -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general