subject:"\[GENERAL\] Optimizing tuning and table design for large analytics DB"

[GENERAL] Optimizing tuning and table design for large analytics DB

2009-05-07 Thread Rob W


Can anyone point me towards good articles or books that would help a PostgreSQL 
novice (i.e. me) learn the optimal approaches to setting up a DB for analytics?

In this particular case, I need to efficiently analyze approximately 300 
million system log events (i.e. time series data). It's log data, so it's only 
appended to the table, not inserted and is never modified. Only 90 days worth 
of data will be retained, so old records need to be deleted periodically. Query 
performance will only be important for small subsets of the data (e.g. when 
analyzing a week or day's worth of data), the rest of the reports will be run 
in batch mode. There will likely only be one user at a time doing ad-hoc 
queries.

This is a a follow-up to the earlier suggestions that PostgreSQL will handle 
the volumes of data I plan to work with, so I figured I'd give it a shot. 

Rob

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Optimizing tuning and table design for large analytics DB

2009-05-07 Thread John R Pierce

Rob W wrote:

Can anyone point me towards good articles or books that would help a PostgreSQL
novice (i.e. me) learn the optimal approaches to setting up a DB for analytics?

In this particular case, I need to efficiently analyze approximately 300
million system log events (i.e. time series data). It's log data, so it's only
appended to the table, not inserted and is never modified. Only 90 days worth
of data will be retained, so old records need to be deleted periodically. Query
performance will only be important for small subsets of the data (e.g. when
analyzing a week or day's worth of data), the rest of the reports will be run
in batch mode. There will likely only be one user at a time doing ad-hoc
queries.

This is a a follow-up to the earlier suggestions that PostgreSQL will handle the volumes of data I plan to work with, so I figured I'd give it a shot.

one approach to speeding up the handling of time expired data like this
is to partition it, maybe by week. eg, you create a seperate table for
each of 14 weeks, and have a view that joins them all for doing
queries. you insert your new records to the latest week table, then
each week truncate the oldest week table and switch to using that one
for the new inserts this is more efficient than having one large
table and deleting individual rows.

you can speed up the inserts some by doing them in batches, for
instance, collecting a few minutes worth of new records, and inserting
them all as one transaction. depending on how many fields of these
tables are indexed, this can greatly reduce the overhead of maintaining
those indices.

see http://www.postgresql.org/docs/current/static/ddl-partitioning.html
for more on this sort of partitioning. Above, I mentioned using a view
to read the whole table as a join, this page discusses using inheritance
instead, which has advantages.

if your reporting requirements include the sorts of statistics that can
be precomputed, it can be advantageous to keep a set of running tallies
in separate tables, like per hour and per day counts for each event
class, which can be used to reduce the amount of bulk querying required
to generate statistical count reports. of course, these tally tables
also need aging, but there's much MUCH less data in them so conventional
row deletes is probably fine.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Optimizing tuning and table design for large analytics DB

Re: [GENERAL] Optimizing tuning and table design for large analytics DB

2 matches

Site Navigation

Mail list logo

Footer information