[GENERAL] Optimizing tuning and table design for large analytics DB

2009-05-07 Thread Rob W

Can anyone point me towards good articles or books that would help a PostgreSQL 
novice (i.e. me) learn the optimal approaches to setting up a DB for analytics?

In this particular case, I need to efficiently analyze approximately 300 
million system log events (i.e. time series data). It's log data, so it's only 
appended to the table, not inserted and is never modified. Only 90 days worth 
of data will be retained, so old records need to be deleted periodically. Query 
performance will only be important for small subsets of the data (e.g. when 
analyzing a week or day's worth of data), the rest of the reports will be run 
in batch mode. There will likely only be one user at a time doing ad-hoc 
queries.

This is a a follow-up to the earlier suggestions that PostgreSQL will handle 
the volumes of data I plan to work with, so I figured I'd give it a shot. 

Rob

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing tuning and table design for large analytics DB

2009-05-07 Thread John R Pierce

Rob W wrote:

Can anyone point me towards good articles or books that would help a PostgreSQL 
novice (i.e. me) learn the optimal approaches to setting up a DB for analytics?

In this particular case, I need to efficiently analyze approximately 300 
million system log events (i.e. time series data). It's log data, so it's only 
appended to the table, not inserted and is never modified. Only 90 days worth 
of data will be retained, so old records need to be deleted periodically. Query 
performance will only be important for small subsets of the data (e.g. when 
analyzing a week or day's worth of data), the rest of the reports will be run 
in batch mode. There will likely only be one user at a time doing ad-hoc 
queries.

This is a a follow-up to the earlier suggestions that PostgreSQL will handle the volumes of data I plan to work with, so I figured I'd give it a shot. 
  



one approach to speeding up the handling of time expired data like this 
is to partition it, maybe by week.   eg, you create a seperate table for 
each of 14 weeks, and have a view that joins them all for doing 
queries.  you insert your new records to the latest week table, then 
each week truncate the oldest week table and switch to using that one 
for the new inserts   this is more efficient than having one large 
table and deleting individual rows.


you can speed up the inserts some by doing them in batches, for 
instance, collecting a few minutes worth of new records, and inserting 
them all as one transaction.   depending on how many fields of these 
tables are indexed, this can greatly reduce the overhead of maintaining 
those indices.


see http://www.postgresql.org/docs/current/static/ddl-partitioning.html 
for more on this sort of partitioning.   Above, I mentioned using a view 
to read the whole table as a join, this page discusses using inheritance 
instead, which has advantages.


if your reporting requirements include the sorts of statistics that can 
be precomputed, it can be advantageous to keep a set of running tallies 
in separate tables, like per hour and per day counts for each event 
class, which can be used to reduce the amount of bulk querying required 
to generate statistical count reports. of course, these tally tables 
also need aging, but there's much MUCH less data in them so conventional 
row deletes is probably fine.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general