Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-11-15 Thread Chitra Creta
Thanks for your example Chris. I will look into it as a long-term solution.

Partitioning tables as a strategy worked very well indeed. This will be my
short/medium term solution.

Another strategy that I would like to evaluate as a short/medium term
solution is archiving old records in a table before purging them.

I am aware that Oracle has a tool that allows records to be exported into a
file / archive table before purging them. They also provide a tool to
import these records.

Does PostgreSQL have similar tools to export to a file and re-import?

If PostgreSQL does not have a tool to do this, does anyone have any ideas
on what file format (e.g. text file containing a table of headers being
column names and rows being records) would be ideal for easy re-importing
into a PostgreSQL table?

Thank you for your ideas.


On Mon, Oct 22, 2012 at 12:14 AM, Chris Travers chris.trav...@gmail.comwrote:



 On Sun, Oct 21, 2012 at 5:46 AM, Chitra Creta chitracr...@gmail.comwrote:

 Thank you all.

 Ryan, would you mind sharing your one-time function to move it?

 Merlin, what are your suggestions to improve query performance?

 Shaun, thank you. I will look into facts and dimensions should all else
 fail.

 Chris, would you mind giving me an example of what you mean by your log,
 aggregate and snapshot approach. Also, with indexing, I believe composite
 and partial indexes are better than indexes, am I correct? Do you have any
 recommendations as to which type (e.g btree, hash) is better for which
 situations.


 Sure.  Suppose I have an accounting system.

 I may record the amounts in the transactions in a journal_entry and
 journal_line table.   These will be write once read many.  However time you
 will end up having to digest millions of records (given sufficient volume)
 to find out the balance of a checking account, and this is not really ideal.

 So to deal with this, I might, for example, add a table called
 account_checkpoint which might have the following fields:

 account_id
 end_date
 debits
 credits
 balance

 And then I can snapshot on closing of books the accumulated debits,
 credits, and balance to date.  If I need any of these numbers I can just
 grab the appropriate number from account_checkpoint and roll forward from
 end_date.  If I have too much volume I can have closings on a monthly level
 of whatever.

 The checkpoint table contains sufficient information for me to start a
 report at any point and end it at any other point without having to scan
 interceding checkpointed periods.  Additionally if I want to purge old
 data, I can do so without losing current balance information.

 So what this approach does, in essence is it gives you a way to purge
 without losing some aggregated information, and a way to skip portions of
 the table for aggregation purposes you can't skip otherwise.  The big thing
 is you cannot insert (and if this is in doubt, you need to enforce this
 with a trigger) any records effective before the most recent checkpoint.

 Best Wishes,
 Chris Travers



Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-11-15 Thread Igor Romanchenko
On Thu, Nov 15, 2012 at 1:34 PM, Chitra Creta chitracr...@gmail.com wrote:

 Thanks for your example Chris. I will look into it as a long-term solution.

 Partitioning tables as a strategy worked very well indeed. This will be my
 short/medium term solution.

 Another strategy that I would like to evaluate as a short/medium term
 solution is archiving old records in a table before purging them.

 I am aware that Oracle has a tool that allows records to be exported into
 a file / archive table before purging them. They also provide a tool to
 import these records.

 Does PostgreSQL have similar tools to export to a file and re-import?

 If PostgreSQL does not have a tool to do this, does anyone have any ideas
 on what file format (e.g. text file containing a table of headers being
 column names and rows being records) would be ideal for easy re-importing
 into a PostgreSQL table?

 Thank you for your ideas.


PostgreSQL has COPY TO to export records to a file (
http://wiki.postgresql.org/wiki/COPY ).


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-21 Thread Chitra Creta
Thank you all.

Ryan, would you mind sharing your one-time function to move it?

Merlin, what are your suggestions to improve query performance?

Shaun, thank you. I will look into facts and dimensions should all else
fail.

Chris, would you mind giving me an example of what you mean by your log,
aggregate and snapshot approach. Also, with indexing, I believe composite
and partial indexes are better than indexes, am I correct? Do you have any
recommendations as to which type (e.g btree, hash) is better for which
situations.

Thank you,
Chitra



On Thu, Oct 18, 2012 at 12:47 AM, Chris Travers chris.trav...@gmail.comwrote:



 On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta chitracr...@gmail.comwrote:

 Hi,

 I currently have a table that is growing very quickly - i.e 7 million
 records in 5 days. This table acts as a placeholder for statistics, and
 hence the records are merely inserted and never updated or deleted.

 Many queries are run on this table to obtain trend analysis. However,
 these queries are now starting to take a very long time (hours) to execute
 due to the size of the table.

 I have put indexes on this table, to no significant benefit.  Some of the
 other strategies I have thought of:
 1. Purge old data
 2. Reindex
 3. Partition
 4. Creation of daily, monthly, yearly summary tables that contains
 aggregated data specific to the statistics required

 Does anyone know what is the best practice to handle this situation?


 The answer is well, it depends.  Possibly some combination.

 One approach I like that may be included in #4 but not necessarily is the
 idea of summary tables which contain snapshots of the data, allowing you to
 roll forward or backward from defined points.  This is what I call the log,
 aggregate, and snapshot approach.   But it really depends on what you are
 doing and there is no one size fits all approach at this volume.

 Instead of reindexing, I would suggest also looking into partial indexes.

 Best Wishes,



Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-21 Thread Chris Travers
On Sun, Oct 21, 2012 at 5:46 AM, Chitra Creta chitracr...@gmail.com wrote:

 Thank you all.

 Ryan, would you mind sharing your one-time function to move it?

 Merlin, what are your suggestions to improve query performance?

 Shaun, thank you. I will look into facts and dimensions should all else
 fail.

 Chris, would you mind giving me an example of what you mean by your log,
 aggregate and snapshot approach. Also, with indexing, I believe composite
 and partial indexes are better than indexes, am I correct? Do you have any
 recommendations as to which type (e.g btree, hash) is better for which
 situations.


Sure.  Suppose I have an accounting system.

I may record the amounts in the transactions in a journal_entry and
journal_line table.   These will be write once read many.  However time you
will end up having to digest millions of records (given sufficient volume)
to find out the balance of a checking account, and this is not really ideal.

So to deal with this, I might, for example, add a table called
account_checkpoint which might have the following fields:

account_id
end_date
debits
credits
balance

And then I can snapshot on closing of books the accumulated debits,
credits, and balance to date.  If I need any of these numbers I can just
grab the appropriate number from account_checkpoint and roll forward from
end_date.  If I have too much volume I can have closings on a monthly level
of whatever.

The checkpoint table contains sufficient information for me to start a
report at any point and end it at any other point without having to scan
interceding checkpointed periods.  Additionally if I want to purge old
data, I can do so without losing current balance information.

So what this approach does, in essence is it gives you a way to purge
without losing some aggregated information, and a way to skip portions of
the table for aggregation purposes you can't skip otherwise.  The big thing
is you cannot insert (and if this is in doubt, you need to enforce this
with a trigger) any records effective before the most recent checkpoint.

Best Wishes,
Chris Travers


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-17 Thread Chris Travers
On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta chitracr...@gmail.com wrote:

 Hi,

 I currently have a table that is growing very quickly - i.e 7 million
 records in 5 days. This table acts as a placeholder for statistics, and
 hence the records are merely inserted and never updated or deleted.

 Many queries are run on this table to obtain trend analysis. However,
 these queries are now starting to take a very long time (hours) to execute
 due to the size of the table.

 I have put indexes on this table, to no significant benefit.  Some of the
 other strategies I have thought of:
 1. Purge old data
 2. Reindex
 3. Partition
 4. Creation of daily, monthly, yearly summary tables that contains
 aggregated data specific to the statistics required

 Does anyone know what is the best practice to handle this situation?


The answer is well, it depends.  Possibly some combination.

One approach I like that may be included in #4 but not necessarily is the
idea of summary tables which contain snapshots of the data, allowing you to
roll forward or backward from defined points.  This is what I call the log,
aggregate, and snapshot approach.   But it really depends on what you are
doing and there is no one size fits all approach at this volume.

Instead of reindexing, I would suggest also looking into partial indexes.

Best Wishes,


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-16 Thread Chitra Creta
Thank you all for your suggestions. Since all of you recommended the
Partition option, I decided to try it out myself.

I found a function that created partition tables for every month in the
last two years. A trigger was also added to the parent table to ensure that
every insert into it from hence forth will be inserted into the
approapriate child table.

However, there were a few observations that I made which I would appreciate
your comments on:

1. Since existing data was in the parent table, I had to do a pg_dump on
it, drop it, and then to a restore on it to force the trigger to work on
existing data. Is this how partitioning existing data should be done?

2. I noticed that there are two copies of the same record - i.e the one
that was inserted into the parent table and another that was inserted in
the child table. If I delete the record in the parent table, the child
record gets automatically deleted. I was under the impression that
partitioning meant that my parent table will not be large anymore because
the data will be moved to smaller child tables. Is this the case?

3. Is there a way for me to evaluate the effectiveness of the partitioned
table? Would performing an Explain Analyse allow me to determine whether
querying the parent table for statistics is quicker than querying against a
massive non-partitioned table?

Thank you.
 On Oct 13, 2012 3:49 AM, John R Pierce pie...@hogranch.com wrote:

 On 10/12/12 7:44 AM, Chitra Creta wrote:


 1. Purge old data
 2. Reindex
 3. Partition
 4. Creation of daily, monthly, yearly summary tables that contains
 aggregated data specific to the statistics required


 if most of your queries read the majority of the tables,  indexing will be
 of little help

 parittioning will aid in purging old data, as you can partitions by date
 (for instance, by week) and drop whole partitions rather than deleting
 individual records.

 aggregate tables likely will be the biggest win for your statistics if
 they reduce the mount of data you need to query.


 --
 john r pierceN 37, W 122
 santa cruz ca mid-left coast



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/**mailpref/pgsql-generalhttp://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-16 Thread Ryan Kelly
On Tue, Oct 16, 2012 at 09:26:09PM +1100, Chitra Creta wrote:
 Thank you all for your suggestions. Since all of you recommended the
 Partition option, I decided to try it out myself.
 
 I found a function that created partition tables for every month in the
 last two years. A trigger was also added to the parent table to ensure that
 every insert into it from hence forth will be inserted into the
 approapriate child table.
 
 However, there were a few observations that I made which I would appreciate
 your comments on:
 
 1. Since existing data was in the parent table, I had to do a pg_dump on
 it, drop it, and then to a restore on it to force the trigger to work on
 existing data. Is this how partitioning existing data should be done?
I just wrote a one-time function to move it.

 2. I noticed that there are two copies of the same record - i.e the one
 that was inserted into the parent table and another that was inserted in
 the child table. If I delete the record in the parent table, the child
 record gets automatically deleted. I was under the impression that
 partitioning meant that my parent table will not be large anymore because
 the data will be moved to smaller child tables. Is this the case?
The data *is* in the child tables. Queries on the parent tables, by
default, affect data in the child tables. So, issuing a SELECT against
your parent table will also query the child tables. DELETE will,
similarly, delete data in the child tables. You may target just the
parent table using ONLY, e.g. SELECT * FROM ONLY foo. This behavior is
also controlled by the GUC sql_inheritance, though I encourage you not
to change this value. To get a better idea of what it happening, look at
the output from EXPLAIN to see all the tables that are being included in
your plan.

 3. Is there a way for me to evaluate the effectiveness of the partitioned
 table? Would performing an Explain Analyse allow me to determine whether
 querying the parent table for statistics is quicker than querying against a
 massive non-partitioned table?
Well, you can do it with EXPLAIN ANALYZE, or you can do it by timing
your query, so that the overhead of EXPLAIN ANALYZE does not come into
play.

Also, I assume you've read this:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

That link will be helpful in understanding how partitioning could
benefit you.

-Ryan Kelly


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-16 Thread Merlin Moncure
On Fri, Oct 12, 2012 at 10:00 AM, Lonni J Friedman netll...@gmail.com wrote:
 On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta chitracr...@gmail.com wrote:
 Hi,

 I currently have a table that is growing very quickly - i.e 7 million
 records in 5 days. This table acts as a placeholder for statistics, and
 hence the records are merely inserted and never updated or deleted.

 Many queries are run on this table to obtain trend analysis. However, these
 queries are now starting to take a very long time (hours) to execute due to
 the size of the table.

 I have put indexes on this table, to no significant benefit.  Some of the
 other strategies I have thought of:
 1. Purge old data
 2. Reindex
 3. Partition
 4. Creation of daily, monthly, yearly summary tables that contains
 aggregated data specific to the statistics required

 Does anyone know what is the best practice to handle this situation?

 I would appreciate knowledge sharing on the pros and cons of the above, or
 if there are any other strategies that I could put in place.

 Partitioning is prolly your best solution.  3  4 sound like
 variations on the same thing.  Before you go that route, you should
 make sure that your bottleneck is really a result of the massive
 amount of data, and not some other problem.  Are you sure that the
 indices you created are being used, and that you have all the indices
 that you need for your queries?  Look at the query plan output from
 EXPLAIN, and/or post here if you're unsure.

Partitioning is not a strategy to improve query performance unless you
are exploiting the data structure in some way through the partition.

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-16 Thread Shaun Thomas

On 10/12/2012 09:44 AM, Chitra Creta wrote:


4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required


If the partitioning doesn't help you, I strongly urge you to build one 
or more Fact tables with appropriate Diminsions. If you don't know what 
these terms mean, there are lots of books on the subject. They're very 
versatile for producing fast reports on varying inputs. You can also 
layer them by having cascading levels of aggregation from day - month 
- year, and so on.


These kinds of reporting structures are perfect for huge data 
accumulation warehouses. The book most recommended to me back in the day 
was The Data Warehouse Toolkit, and I can also vouch that it covers 
these subjects pretty well.


I still recommend partitioning simply due to maintenance overhead, but 
you'll want to look into this too.


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-12 Thread Chitra Creta
Hi,

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these
queries are now starting to take a very long time (hours) to execute due to
the size of the table.

I have put indexes on this table, to no significant benefit.  Some of the
other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

I would appreciate knowledge sharing on the pros and cons of the above, or
if there are any other strategies that I could put in place.

Thanking you in advance.


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-12 Thread Lonni J Friedman
On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta chitracr...@gmail.com wrote:
 Hi,

 I currently have a table that is growing very quickly - i.e 7 million
 records in 5 days. This table acts as a placeholder for statistics, and
 hence the records are merely inserted and never updated or deleted.

 Many queries are run on this table to obtain trend analysis. However, these
 queries are now starting to take a very long time (hours) to execute due to
 the size of the table.

 I have put indexes on this table, to no significant benefit.  Some of the
 other strategies I have thought of:
 1. Purge old data
 2. Reindex
 3. Partition
 4. Creation of daily, monthly, yearly summary tables that contains
 aggregated data specific to the statistics required

 Does anyone know what is the best practice to handle this situation?

 I would appreciate knowledge sharing on the pros and cons of the above, or
 if there are any other strategies that I could put in place.

Partitioning is prolly your best solution.  3  4 sound like
variations on the same thing.  Before you go that route, you should
make sure that your bottleneck is really a result of the massive
amount of data, and not some other problem.  Are you sure that the
indices you created are being used, and that you have all the indices
that you need for your queries?  Look at the query plan output from
EXPLAIN, and/or post here if you're unsure.

Reindexing shouldn't make a difference unless something is wrong with
the indices that you already have in place.

Purging old data is only a good solution if you do not need the data,
and never will need the data.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-12 Thread Ondrej Ivanič
Hi,

On 13 October 2012 01:44, Chitra Creta chitracr...@gmail.com wrote:
 I currently have a table that is growing very quickly - i.e 7 million
 records in 5 days. This table acts as a placeholder for statistics, and
 hence the records are merely inserted and never updated or deleted.

 Many queries are run on this table to obtain trend analysis. However, these
 queries are now starting to take a very long time (hours) to execute due to
 the size of the table.

Have you changed autovacuum settings to make it more agressive?
Another options is to run analyse after loading.


 I have put indexes on this table, to no significant benefit.  Some of the
 other strategies I have thought of:
 1. Purge old data
 3. Partition

Those two go together. Partitioning is useful if you can constrain
queries to specific ranges ie. this query needs last two days. You
shouldn't go over 200 - 300 partitions per table. Partition
granularity should be the same as the amount of data in average query.
if you run weekly queries then use weekly partitions (bi-weekly or
daily partitions might work but I do not have good experience).

It is easy to purge old data because you need to drop unwanted
partitions (no table/index bloat). Loading is little bit tricky
becuase you have load data into right partition.

 4. Creation of daily, monthly, yearly summary tables that contains
 aggregated data specific to the statistics required

I think this way to the hell. You start with few tables and then you
add more tables until you realise that it takes longer to update them
then run your queries :)

You might benefit from query parallelisation, for example, pgpool-II,
Stado, Postgres XC or do not use Postgres at all. For example, any
column oriented NoSQL database might be good choice.

-- 
Ondrej Ivanic
(ondrej.iva...@gmail.com)
(http://www.linkedin.com/in/ondrejivanic)


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-12 Thread Ryan Kelly
On Sat, Oct 13, 2012 at 01:44:02AM +1100, Chitra Creta wrote:
 Hi,
 
 I currently have a table that is growing very quickly - i.e 7 million
 records in 5 days. This table acts as a placeholder for statistics, and
 hence the records are merely inserted and never updated or deleted.
 
 Many queries are run on this table to obtain trend analysis. However, these
 queries are now starting to take a very long time (hours) to execute due to
 the size of the table.
 
 I have put indexes on this table, to no significant benefit.  Some of the
 other strategies I have thought of:
 1. Purge old data
 2. Reindex
 3. Partition
 4. Creation of daily, monthly, yearly summary tables that contains
 aggregated data specific to the statistics required
 
 Does anyone know what is the best practice to handle this situation?
Three and four will probably be your biggest wins. We do both. There are
trade-offs for both. If you need the absolute best in response times
(e.g., in a web application), summary tables are the way to go. If
you're regularly querying data in a specific date range, but response
times aren't as critical (e.g., daily report generated at the end of
day), partitioning will also help there.

Having the correct indexes is crucial, as always.

Additionally, you will be able to purge old data relatively quickly and
without bloat using the partitioning method.

 I would appreciate knowledge sharing on the pros and cons of the above, or
 if there are any other strategies that I could put in place.
 
 Thanking you in advance.

-Ryan Kelly


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strategies/Best Practises Handling Large Tables

2012-10-12 Thread John R Pierce

On 10/12/12 7:44 AM, Chitra Creta wrote:


1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains 
aggregated data specific to the statistics required




if most of your queries read the majority of the tables,  indexing will 
be of little help


parittioning will aid in purging old data, as you can partitions by date 
(for instance, by week) and drop whole partitions rather than deleting 
individual records.


aggregate tables likely will be the biggest win for your statistics if 
they reduce the mount of data you need to query.



--
john r pierceN 37, W 122
santa cruz ca mid-left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general