Ð ÐÐÐ, 22.03.2004, Ð 00:12, Tom Lane ÐÐÑÐÑ:
> Please repost in some less proprietary format.  Plain text is generally
> considered the thing to use on this list.

-- 
Markus Bertheau <[EMAIL PROTECTED]>
pg_autovacuum Version 2
Design Document:


Exec Summary:
pg_autovacuum was initially released as a contrib module in PostgreSQL v7.4.   The 
version in 7.4 is by design very simple.  No configuration is required, and very 
little configuration is possible.  Despite these limitations it was voted the most 
popular new feature of PostgreSQL v7.4 according to the survey held on postgresql.org 
(http://www.postgresql.org/survey.php?View=1&SurveyID=23).  

Despite it's popularity there is much room for improvement.  This document sets out to 
define the most important improvements that would help pg_autovacuum to become a truly 
powerful asset to the suite of tools that come with PostgreSQL.


Current Problems & Limitations:

Based on user feedback from people using pg_autovacuum in the field, and my own 
observations, there are a number of problems and limitation with pg_autovacuum.  They 
are:

Inability to customize thresholds on a per table basis
Inability to set default thresholds on a per database basis
Inability to exclude specific databases / tables from pg_autovacuum monitoring
Inability to schedule vacuums during off-peak times
Lack of integration related to startup and shutdown
Ignorance of VACUUM and ANALYZE operations performed outside pg_autovacuum (requires 
backend integration? or can listen / notify can be used?)
Lack of logging options / syslog integration / log rotation options
Create table fails because template1 is busy

I'm not sure how to address all of these concerns, or that they all should be 
addressed right now.  One of my big questions is backend integration.  I am leaning 
towards leaving pg_autovacuum as a client application in contrib for one more release. 
 During this time, I can continue to tweak and improve pg_autovacuum so that we will 
have a very good idea what the final product should be before we make it a standard 
backend process.


For PostgreSQL 7.5,  I plan to implement these new features:

1.Per database defaults and per table thresholds (including total exclusion)
2.Persistent data
3.Single-Pass Mode (external scheduling from cron etc...)
4.Off peak scheduling

1. Per Database defaults and Per table Thresholds:

There are differing opinions as to the best way to providing these this feature.  The 
primary debate is where to save the configuration data.  I see three options:

1.Store config data inside a special pg_autovacuum table inside existing databases 
that wants custom settings.   

2.Use a config file.  This would require some additional coding to add the required 
parsing, but is possible. 

3.Create a pg_autovacuum database inside any cluster that wants to customize their 
settings.  

Since many people do not like tools that clutter their databases by adding tables, I 
think option 1 (adding a pg_autovacuum table to existing databases) is right out.  
Using a config file would be Ok, but would require additional parsing code.  My 
preference is option 3.  Since pg_autovacuum will (hopefully) eventually become an 
integrated part of the backend, it will eventually be able to add required data to the 
system catalogs.  Given these two premises, as long as pg_autovacuum remains a contrib 
module it could use it's own database to mimic having system tables.  If this database 
exists, it will be used, if it does not exist, then pg_autovacuum will work just as it 
did in the 7.4 release with very limited options available to it.  The user will be 
able to specify a non-default database.

Table Structure for database specific defaults and table specific thresholds:

databases_defaults: (will reference the pg_class system table)
        id                                      serial primary key
        exclude_database                        boolean
        default_vacuum_scaling_factor   float
        default_vacuum_base_value       int
        default_analyze_scaling_factor  float
        default_analyze_base_value      int
        dboid                                   oid references pg_database.oid
        
table_thresholds
        id                              serial primary key
        exclude_table                   boolean (exclude this table)
        vacuum_scaling_factor   float           (equivalent to  -v)
        vacuum_base_value               int             (equivalent to -V)
        vacuum_threshold                float           (if > 0, use this threshold)
        analyze_scaling_factor  float           (equivalent to -a)
        analyze_base_value              int             (equivalent to -A)
        analyze_threshold               float           (if > 0 use this threshold)
        relid                           oid references pg_classs.relid


2.Persistent pg_autovacuum Data:

Right now pg_autovacuum has no memory of what was going on the last time it was run.  
So if significant changes have happened while pg_autovacuum is not running, they will 
not be counted in the analysis of when to perform a vacuum or analyze operation which 
can result in under vacuuming.  So, pg_autovacuum should occasionally write down it's 
numbers to the database.  The data  will be stored in an additional table called 
table_data

table_data
        id                      serial primary key
        CountAtLastAnalyze long 
        CountAtLastVacuum long
        table_thresholds_id     int references table_thresholds


3.Single-Pass Mode (External Scheduling):

I have received requests to be able to run pg_autovacuum only on request (not as a 
daemon) making only one pass over all the tables (not looping indefinately).  The 
advantage being that it will operate more like the current vacuum command except that 
it will only vacuum tables that need to be vacuumed.  This feature could be useful as 
long as pg_autovacuum exists outside the backend.  If pg_autovacuum gets integrated 
into the backend and gets automatically started as a daemon during startup, then this 
option will no longer make sense.

Once we have persistent data (Step 2) then we can easily operate in Single-Pass Mode.


4.Off-Peak Scheduling:

A fundamental advantage of our vacuum system is that the work required to reclaim 
table space is taken out of the critical path and can be moved to and off-peak time 
when cycles are less precious.  One of the drawbacks of the current pg_autovacuum is 
that it doesn't have any way to factor this in.

In it's simplest form (which I will implement first) I would add the ability to add a 
second set of thresholds that will be active only during an ¡°off-peak¡± time that can 
be specified in the pg_autovacuum database, perhaps in a general_settings table.





A few notes about things I'm not planning on working on at the moment.

Better logging options:

An additional logging could would be added to the pg_autovacuum database and will log 
all activity (vacuums and analyzes) along with their corresponding duration.

Syslog support.  I'm not sure this is really needed, but a simple patch was submitted 
by one user and perhaps that can be reviewed / improved and applied.


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Reply via email to