Re: [PERFORM] How to determine whether to VACUUM or CLUSTER

2005-06-20 Thread Jim C. Nasby
On Wed, Jun 15, 2005 at 11:34:18AM -0400, Ken Shaw wrote:
 Hi All,
 
 I have an app that updates a PostgreSQL db in a batch fashion.  After
 each batch (or several batches), it issues VACUUM and ANALYZE calls on
 the updated tables.  Now I want to cluster some tables for better
 performance.  I understand that doing a VACUUM and a CLUSTER on a table
 is wasteful as the CLUSTER makes the VACUUM superfluous.  The app does
 not have a built-in list of the tables and whether each is clustered or
 not.  It looks to me as if the only way to determine whether to issue a
 VACUUM (on a non-clustered table) or a CLUSTER (on a clustered table) is
 to query the table pg_index, much like view pg_indexes does, for the
 column indisclustered.  Is this right?

I don't think that's what you want. 'indisclustered' only indicates if
the last time the table was clustered was on that index. The best thing
that comes to mind is looking at the correlation of the first field in
the index for the table. You'll find this info in pg_stats.

 Also, how expensive is CLUSTER compared to VACUUM?  Does CLUSTER read in
 the whole table, sort it, and write it back out?  Or write out a
 completely new file?  Is the time for a CLUSTER the same whether one row
 is out of place or the table is completely disordered?

AFAIK, cluster completely re-creates the table from scratch, then
rebuilds all the indexes. It's basically the most expensive operation
you can perform on a table. There probably will be some increased
performance from the sort if the table is already mostly in the right
order though.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] autovacuum suggestions for 500,000,000+ row tables?

2005-06-20 Thread Alex Stapleton
Hi, i'm trying to optimise our autovacuum configuration so that it  
vacuums / analyzes some of our larger tables better. It has been set  
to the default settings for quite some time. We never delete  
anything  (well not often, and not much) from the tables, so I am not  
so worried about the VACUUM status, but I am wary of XID wraparound  
nuking us at some point if we don't sort vacuuming out so we VACUUM  
at least once every year ;) However not running ANALYZE for such huge  
periods of time is probably impacting the statistics accuracy  
somewhat, and I have seen some unusually slow queries at times.  
Anyway, does anyone think we might benefit from a more aggressive  
autovacuum configuration? 


---(end of broadcast)---
TIP 6: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] autovacuum suggestions for 500,000,000+ row

2005-06-20 Thread Jacques Caron

Hi,

At 16:44 20/06/2005, Alex Stapleton wrote:

We never delete
anything  (well not often, and not much) from the tables, so I am not
so worried about the VACUUM status


DELETEs are not the only reason you might need to VACUUM. UPDATEs are 
important as well, if not more. Tables that are constantly updated 
(statistics, session data, queues...) really need to be VACUUMed a lot.



but I am wary of XID wraparound
nuking us at some point if we don't sort vacuuming out so we VACUUM
at least once every year ;)


That would give you a maximum average of 31 transactions/sec... Don't know 
if that's high or low for you.



 However not running ANALYZE for such huge
periods of time is probably impacting the statistics accuracy
somewhat, and I have seen some unusually slow queries at times.
Anyway, does anyone think we might benefit from a more aggressive
autovacuum configuration?


ANALYZE is not a very expensive operation, however VACUUM can definitely be 
a big strain and take a long time on big tables, depending on your 
setup. I've found that partitioning tables (at the application level) can 
be quite helpful if you manage to keep each partition to a reasonable size 
(under or close to available memory), especially if the partitioning scheme 
is somehow time-related. YMMV.


Jacques.



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?

2005-06-20 Thread Alex Stapleton


On 20 Jun 2005, at 15:59, Jacques Caron wrote:


Hi,

At 16:44 20/06/2005, Alex Stapleton wrote:


We never delete
anything  (well not often, and not much) from the tables, so I am not
so worried about the VACUUM status



DELETEs are not the only reason you might need to VACUUM. UPDATEs  
are important as well, if not more. Tables that are constantly  
updated (statistics, session data, queues...) really need to be  
VACUUMed a lot.


We UPDATE it even less often.




but I am wary of XID wraparound
nuking us at some point if we don't sort vacuuming out so we VACUUM
at least once every year ;)



That would give you a maximum average of 31 transactions/sec...  
Don't know if that's high or low for you.


It's high as far as inserts go for us. It does them all at the end of  
each minute.





 However not running ANALYZE for such huge
periods of time is probably impacting the statistics accuracy
somewhat, and I have seen some unusually slow queries at times.
Anyway, does anyone think we might benefit from a more aggressive
autovacuum configuration?



ANALYZE is not a very expensive operation, however VACUUM can  
definitely be a big strain and take a long time on big tables,  
depending on your setup. I've found that partitioning tables (at  
the application level) can be quite helpful if you manage to keep  
each partition to a reasonable size (under or close to available  
memory), especially if the partitioning scheme is somehow time- 
related. YMMV.


Jacques.


That's not currently an option as it would require a pretty large  
amount of work to implement. I think we will have to keep that in  
mind though.



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?

2005-06-20 Thread John Arbash Meinel
Alex Stapleton wrote:


 On 20 Jun 2005, at 15:59, Jacques Caron wrote:

...

 ANALYZE is not a very expensive operation, however VACUUM can
 definitely be a big strain and take a long time on big tables,
 depending on your setup. I've found that partitioning tables (at  the
 application level) can be quite helpful if you manage to keep  each
 partition to a reasonable size (under or close to available  memory),
 especially if the partitioning scheme is somehow time- related. YMMV.

 Jacques.


 That's not currently an option as it would require a pretty large
 amount of work to implement. I think we will have to keep that in
 mind though.

Remember, you can fake it with a low-level set of tables, and then wrap
them into a UNION ALL view.
So you get something like:

CREATE VIEW orig_table AS
SELECT * FROM table_2005_04
UNION ALL SELECT * FROM table_2005_05
UNION ALL SELECT * FROM table_2005_06
...
;

Then at least your individual operations are fast. As you insert, you
can create a rule that on insert into orig_table do instead ... insert
into table_2005_07 (or whatever the current table is).
It takes a little bit of maintenance on the DB admin's part, since every
month they have to create a new table, and then update all of the views
and triggers. But it is pretty straightforward.
If you are doing append-only inserting, then you have the nice feature
that only the last table is ever modified, which means that the older
tables don't really need to be vacuumed or analyzed.
And even if you have to have each table modified as you go, you still
can break up a VACUUM into only doing one of the sub tables at a time.

I don't know you db schema, but I thought I would mention that true
partitioning isn't implemented yet, you can still get something very
similar with views, triggers and rules.

John
=:-



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?

2005-06-20 Thread Josh Berkus
Alex,

 Hi, i'm trying to optimise our autovacuum configuration so that it
 vacuums / analyzes some of our larger tables better. It has been set
 to the default settings for quite some time. We never delete
 anything  (well not often, and not much) from the tables, so I am not
 so worried about the VACUUM status, but I am wary of XID wraparound
 nuking us at some point if we don't sort vacuuming out so we VACUUM
 at least once every year ;) 

I personally don't use autovaccuum on very large databases.   For DW, 
vacuuming is far better tied to ETL operations or a clock schedule of 
downtime.

XID wraparound may be further away than you think.   Try checking 
pg_controldata, which will give you the current XID, and you can calculate 
how long you are away from wraparound.  I just tested a 200G data warehouse 
and figured out that we are 800 months away from wraparound, despite hourly 
ETL.

 However not running ANALYZE for such huge 
 periods of time is probably impacting the statistics accuracy
 somewhat, and I have seen some unusually slow queries at times.
 Anyway, does anyone think we might benefit from a more aggressive
 autovacuum configuration?

Hmmm, good point, you could use autovacuum for ANALYZE only.  Just set the 
VACUUM settings preposterously high (like 10x) so it never runs.   Then it'll 
run ANALYZE only.   I generally threshold 200, multiple 0.1x for analyze; 
that is, re-analyze after 200+10% of rows have changed.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] investigating slow queries through pg_stat_activity

2005-06-20 Thread Dan Harris
I've got some queries generated by my application that will, for some  
reason, run forever until I kill the pid.  Yet, when I run the  
queries manually to check them out, they usually work fine.  To get  
more information about these queries, I'm writing a utility to take  
snapshots of pg_stat_activity every 5 minutes.  If it finds a query  
that runs for longer than 15 minutes, it will trap the query so I can  
run 'explain analyze' on it and see where the weakness is.


However, the problem I have is that pg_stat_activity only returns the  
first n (255?) characters of the SQL as current_query, so it gets  
chopped off at the end.  I would very much like to find out how I can  
get the *entire* query that is active.  Is this possible?


Also, I'm sure some people will respond with turn on query  
logging.. I've explored that option and the formatting of the log  
file and the fact that EVERY query is logged is not what I'm after  
for this project.  The infinite-running queries are unpredictable  
and may only happen once a week.  Logging 24/7 in anticipation of one  
of these occurrences is not something I'd like to do.


Thanks,

Dan Harris

---(end of broadcast)---
TIP 6: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] investigating slow queries through

2005-06-20 Thread Jacques Caron

Hi,

At 19:55 20/06/2005, Dan Harris wrote:

Also, I'm sure some people will respond with turn on query
logging.. I've explored that option and the formatting of the log
file and the fact that EVERY query is logged is not what I'm after
for this project.


You can log just those queries that take a little bit too much time. See 
log_min_duration_statement in postgresql.conf. Set it really high, and 
you'll only get those queries you're after.


Jacques.



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] investigating slow queries through pg_stat_activity

2005-06-20 Thread Tom Lane
Dan Harris [EMAIL PROTECTED] writes:
 However, the problem I have is that pg_stat_activity only returns the  
 first n (255?) characters of the SQL as current_query, so it gets  
 chopped off at the end.  I would very much like to find out how I can  
 get the *entire* query that is active.  Is this possible?

I think the limit is ~1000 characters in 8.0 and later.  However, you
can't realistically have unlimited because of constraints of the stats
messaging mechanism.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] investigating slow queries through pg_stat_activity

2005-06-20 Thread George Essig
On 6/20/05, Dan Harris [EMAIL PROTECTED] wrote:
 Also, I'm sure some people will respond with turn on query
 logging.. I've explored that option and the formatting of the log
 file and the fact that EVERY query is logged is not what I'm after
 for this project.  

You don't have to log every query.  You can set
log_min_duration_statement in postgresql.conf to log only the queries
that exceed a certain amount of time.

From the manual at
http://www.postgresql.org/docs/8.0/static/runtime-config.html:

log_min_duration_statement (integer)

Sets a minimum statement execution time (in milliseconds) that
causes a statement to be logged. All SQL statements that run for the
time specified or longer will be logged with their duration. Setting
this to zero will print all queries and their durations. Minus-one
(the default) disables the feature. For example, if you set it to 250
then all SQL statements that run 250ms or longer will be logged.
Enabling this option can be useful in tracking down unoptimized
queries in your applications. Only superusers can change this setting.

George Essig

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]