Re: [PERFORM] How to determine whether to VACUUM or CLUSTER
On Wed, Jun 15, 2005 at 11:34:18AM -0400, Ken Shaw wrote: Hi All, I have an app that updates a PostgreSQL db in a batch fashion. After each batch (or several batches), it issues VACUUM and ANALYZE calls on the updated tables. Now I want to cluster some tables for better performance. I understand that doing a VACUUM and a CLUSTER on a table is wasteful as the CLUSTER makes the VACUUM superfluous. The app does not have a built-in list of the tables and whether each is clustered or not. It looks to me as if the only way to determine whether to issue a VACUUM (on a non-clustered table) or a CLUSTER (on a clustered table) is to query the table pg_index, much like view pg_indexes does, for the column indisclustered. Is this right? I don't think that's what you want. 'indisclustered' only indicates if the last time the table was clustered was on that index. The best thing that comes to mind is looking at the correlation of the first field in the index for the table. You'll find this info in pg_stats. Also, how expensive is CLUSTER compared to VACUUM? Does CLUSTER read in the whole table, sort it, and write it back out? Or write out a completely new file? Is the time for a CLUSTER the same whether one row is out of place or the table is completely disordered? AFAIK, cluster completely re-creates the table from scratch, then rebuilds all the indexes. It's basically the most expensive operation you can perform on a table. There probably will be some increased performance from the sort if the table is already mostly in the right order though. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] autovacuum suggestions for 500,000,000+ row tables?
Hi, i'm trying to optimise our autovacuum configuration so that it vacuums / analyzes some of our larger tables better. It has been set to the default settings for quite some time. We never delete anything (well not often, and not much) from the tables, so I am not so worried about the VACUUM status, but I am wary of XID wraparound nuking us at some point if we don't sort vacuuming out so we VACUUM at least once every year ;) However not running ANALYZE for such huge periods of time is probably impacting the statistics accuracy somewhat, and I have seen some unusually slow queries at times. Anyway, does anyone think we might benefit from a more aggressive autovacuum configuration? ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] autovacuum suggestions for 500,000,000+ row
Hi, At 16:44 20/06/2005, Alex Stapleton wrote: We never delete anything (well not often, and not much) from the tables, so I am not so worried about the VACUUM status DELETEs are not the only reason you might need to VACUUM. UPDATEs are important as well, if not more. Tables that are constantly updated (statistics, session data, queues...) really need to be VACUUMed a lot. but I am wary of XID wraparound nuking us at some point if we don't sort vacuuming out so we VACUUM at least once every year ;) That would give you a maximum average of 31 transactions/sec... Don't know if that's high or low for you. However not running ANALYZE for such huge periods of time is probably impacting the statistics accuracy somewhat, and I have seen some unusually slow queries at times. Anyway, does anyone think we might benefit from a more aggressive autovacuum configuration? ANALYZE is not a very expensive operation, however VACUUM can definitely be a big strain and take a long time on big tables, depending on your setup. I've found that partitioning tables (at the application level) can be quite helpful if you manage to keep each partition to a reasonable size (under or close to available memory), especially if the partitioning scheme is somehow time-related. YMMV. Jacques. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?
On 20 Jun 2005, at 15:59, Jacques Caron wrote: Hi, At 16:44 20/06/2005, Alex Stapleton wrote: We never delete anything (well not often, and not much) from the tables, so I am not so worried about the VACUUM status DELETEs are not the only reason you might need to VACUUM. UPDATEs are important as well, if not more. Tables that are constantly updated (statistics, session data, queues...) really need to be VACUUMed a lot. We UPDATE it even less often. but I am wary of XID wraparound nuking us at some point if we don't sort vacuuming out so we VACUUM at least once every year ;) That would give you a maximum average of 31 transactions/sec... Don't know if that's high or low for you. It's high as far as inserts go for us. It does them all at the end of each minute. However not running ANALYZE for such huge periods of time is probably impacting the statistics accuracy somewhat, and I have seen some unusually slow queries at times. Anyway, does anyone think we might benefit from a more aggressive autovacuum configuration? ANALYZE is not a very expensive operation, however VACUUM can definitely be a big strain and take a long time on big tables, depending on your setup. I've found that partitioning tables (at the application level) can be quite helpful if you manage to keep each partition to a reasonable size (under or close to available memory), especially if the partitioning scheme is somehow time- related. YMMV. Jacques. That's not currently an option as it would require a pretty large amount of work to implement. I think we will have to keep that in mind though. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?
Alex Stapleton wrote: On 20 Jun 2005, at 15:59, Jacques Caron wrote: ... ANALYZE is not a very expensive operation, however VACUUM can definitely be a big strain and take a long time on big tables, depending on your setup. I've found that partitioning tables (at the application level) can be quite helpful if you manage to keep each partition to a reasonable size (under or close to available memory), especially if the partitioning scheme is somehow time- related. YMMV. Jacques. That's not currently an option as it would require a pretty large amount of work to implement. I think we will have to keep that in mind though. Remember, you can fake it with a low-level set of tables, and then wrap them into a UNION ALL view. So you get something like: CREATE VIEW orig_table AS SELECT * FROM table_2005_04 UNION ALL SELECT * FROM table_2005_05 UNION ALL SELECT * FROM table_2005_06 ... ; Then at least your individual operations are fast. As you insert, you can create a rule that on insert into orig_table do instead ... insert into table_2005_07 (or whatever the current table is). It takes a little bit of maintenance on the DB admin's part, since every month they have to create a new table, and then update all of the views and triggers. But it is pretty straightforward. If you are doing append-only inserting, then you have the nice feature that only the last table is ever modified, which means that the older tables don't really need to be vacuumed or analyzed. And even if you have to have each table modified as you go, you still can break up a VACUUM into only doing one of the sub tables at a time. I don't know you db schema, but I thought I would mention that true partitioning isn't implemented yet, you can still get something very similar with views, triggers and rules. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] autovacuum suggestions for 500,000,000+ row tables?
Alex, Hi, i'm trying to optimise our autovacuum configuration so that it vacuums / analyzes some of our larger tables better. It has been set to the default settings for quite some time. We never delete anything (well not often, and not much) from the tables, so I am not so worried about the VACUUM status, but I am wary of XID wraparound nuking us at some point if we don't sort vacuuming out so we VACUUM at least once every year ;) I personally don't use autovaccuum on very large databases. For DW, vacuuming is far better tied to ETL operations or a clock schedule of downtime. XID wraparound may be further away than you think. Try checking pg_controldata, which will give you the current XID, and you can calculate how long you are away from wraparound. I just tested a 200G data warehouse and figured out that we are 800 months away from wraparound, despite hourly ETL. However not running ANALYZE for such huge periods of time is probably impacting the statistics accuracy somewhat, and I have seen some unusually slow queries at times. Anyway, does anyone think we might benefit from a more aggressive autovacuum configuration? Hmmm, good point, you could use autovacuum for ANALYZE only. Just set the VACUUM settings preposterously high (like 10x) so it never runs. Then it'll run ANALYZE only. I generally threshold 200, multiple 0.1x for analyze; that is, re-analyze after 200+10% of rows have changed. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] investigating slow queries through pg_stat_activity
I've got some queries generated by my application that will, for some reason, run forever until I kill the pid. Yet, when I run the queries manually to check them out, they usually work fine. To get more information about these queries, I'm writing a utility to take snapshots of pg_stat_activity every 5 minutes. If it finds a query that runs for longer than 15 minutes, it will trap the query so I can run 'explain analyze' on it and see where the weakness is. However, the problem I have is that pg_stat_activity only returns the first n (255?) characters of the SQL as current_query, so it gets chopped off at the end. I would very much like to find out how I can get the *entire* query that is active. Is this possible? Also, I'm sure some people will respond with turn on query logging.. I've explored that option and the formatting of the log file and the fact that EVERY query is logged is not what I'm after for this project. The infinite-running queries are unpredictable and may only happen once a week. Logging 24/7 in anticipation of one of these occurrences is not something I'd like to do. Thanks, Dan Harris ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] investigating slow queries through
Hi, At 19:55 20/06/2005, Dan Harris wrote: Also, I'm sure some people will respond with turn on query logging.. I've explored that option and the formatting of the log file and the fact that EVERY query is logged is not what I'm after for this project. You can log just those queries that take a little bit too much time. See log_min_duration_statement in postgresql.conf. Set it really high, and you'll only get those queries you're after. Jacques. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] investigating slow queries through pg_stat_activity
Dan Harris [EMAIL PROTECTED] writes: However, the problem I have is that pg_stat_activity only returns the first n (255?) characters of the SQL as current_query, so it gets chopped off at the end. I would very much like to find out how I can get the *entire* query that is active. Is this possible? I think the limit is ~1000 characters in 8.0 and later. However, you can't realistically have unlimited because of constraints of the stats messaging mechanism. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] investigating slow queries through pg_stat_activity
On 6/20/05, Dan Harris [EMAIL PROTECTED] wrote: Also, I'm sure some people will respond with turn on query logging.. I've explored that option and the formatting of the log file and the fact that EVERY query is logged is not what I'm after for this project. You don't have to log every query. You can set log_min_duration_statement in postgresql.conf to log only the queries that exceed a certain amount of time. From the manual at http://www.postgresql.org/docs/8.0/static/runtime-config.html: log_min_duration_statement (integer) Sets a minimum statement execution time (in milliseconds) that causes a statement to be logged. All SQL statements that run for the time specified or longer will be logged with their duration. Setting this to zero will print all queries and their durations. Minus-one (the default) disables the feature. For example, if you set it to 250 then all SQL statements that run 250ms or longer will be logged. Enabling this option can be useful in tracking down unoptimized queries in your applications. Only superusers can change this setting. George Essig ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]