Re: [HACKERS] pg_autovacuum next steps

Markus Bertheau Sun, 21 Mar 2004 16:10:58 -0800

Ð ÐÐÐ, 22.03.2004, Ð 00:12, Tom Lane ÐÐÑÐÑ:
> Please repost in some less proprietary format.  Plain text is generally
> considered the thing to use on this list.


-- 
Markus Bertheau <[EMAIL PROTECTED]>

pg_autovacuum Version 2
Design Document:

Exec Summary:
pg_autovacuum was initially released as a contrib module in PostgreSQL v7.4. The
version in 7.4 is by design very simple. No configuration is required, and very
little configuration is possible. Despite these limitations it was voted the most
popular new feature of PostgreSQL v7.4 according to the survey held on postgresql.org
(http://www.postgresql.org/survey.php?View=1&SurveyID=23).

Despite it's popularity there is much room for improvement. This document sets out to
define the most important improvements that would help pg_autovacuum to become a truly
powerful asset to the suite of tools that come with PostgreSQL.

Current Problems & Limitations:

Based on user feedback from people using pg_autovacuum in the field, and my own
observations, there are a number of problems and limitation with pg_autovacuum. They
are:

Inability to customize thresholds on a per table basis
Inability to set default thresholds on a per database basis
Inability to exclude specific databases / tables from pg_autovacuum monitoring
Inability to schedule vacuums during off-peak times
Lack of integration related to startup and shutdown
Ignorance of VACUUM and ANALYZE operations performed outside pg_autovacuum (requires
backend integration? or can listen / notify can be used?)
Lack of logging options / syslog integration / log rotation options
Create table fails because template1 is busy

I'm not sure how to address all of these concerns, or that they all should be
addressed right now. One of my big questions is backend integration. I am leaning
towards leaving pg_autovacuum as a client application in contrib for one more release.
During this time, I can continue to tweak and improve pg_autovacuum so that we will
have a very good idea what the final product should be before we make it a standard
backend process.

For PostgreSQL 7.5, I plan to implement these new features:

1.Per database defaults and per table thresholds (including total exclusion)
2.Persistent data
3.Single-Pass Mode (external scheduling from cron etc...)
4.Off peak scheduling

1. Per Database defaults and Per table Thresholds:

There are differing opinions as to the best way to providing these this feature. The
primary debate is where to save the configuration data. I see three options:

1.Store config data inside a special pg_autovacuum table inside existing databases
that wants custom settings.

2.Use a config file. This would require some additional coding to add the required
parsing, but is possible.

3.Create a pg_autovacuum database inside any cluster that wants to customize their
settings.

Since many people do not like tools that clutter their databases by adding tables, I
think option 1 (adding a pg_autovacuum table to existing databases) is right out.
Using a config file would be Ok, but would require additional parsing code. My
preference is option 3. Since pg_autovacuum will (hopefully) eventually become an
integrated part of the backend, it will eventually be able to add required data to the
system catalogs. Given these two premises, as long as pg_autovacuum remains a contrib
module it could use it's own database to mimic having system tables. If this database
exists, it will be used, if it does not exist, then pg_autovacuum will work just as it
did in the 7.4 release with very limited options available to it. The user will be
able to specify a non-default database.

Table Structure for database specific defaults and table specific thresholds:

databases_defaults: (will reference the pg_class system table)
id serial primary key
exclude_database boolean
default_vacuum_scaling_factor float
default_vacuum_base_value int
default_analyze_scaling_factor float
default_analyze_base_value int
dboid oid references pg_database.oid

table_thresholds
id serial primary key
exclude_table boolean (exclude this table)
vacuum_scaling_factor float (equivalent to -v)
vacuum_base_value int (equivalent to -V)
vacuum_threshold float (if > 0, use this threshold)
analyze_scaling_factor float (equivalent to -a)
analyze_base_value int (equivalent to -A)
analyze_threshold float (if > 0 use this threshold)
relid oid references pg_classs.relid

2.Persistent pg_autovacuum Data:

Right now pg_autovacuum has no memory of what was going on the last time it was run.
So if significant changes have happened while pg_autovacuum is not running, they will
not be counted in the analysis of when to perform a vacuum or analyze operation which
can result in under vacuuming. So, pg_autovacuum should occasionally write down it's
numbers to the database. The data will be stored in an additional table called
table_data

table_data
id serial primary key
CountAtLastAnalyze long
CountAtLastVacuum long
table_thresholds_id int references table_thresholds

3.Single-Pass Mode (External Scheduling):

I have received requests to be able to run pg_autovacuum only on request (not as a
daemon) making only one pass over all the tables (not looping indefinately). The
advantage being that it will operate more like the current vacuum command except that
it will only vacuum tables that need to be vacuumed. This feature could be useful as
long as pg_autovacuum exists outside the backend. If pg_autovacuum gets integrated
into the backend and gets automatically started as a daemon during startup, then this
option will no longer make sense.

Once we have persistent data (Step 2) then we can easily operate in Single-Pass Mode.

4.Off-Peak Scheduling:

A fundamental advantage of our vacuum system is that the work required to reclaim
table space is taken out of the critical path and can be moved to and off-peak time
when cycles are less precious. One of the drawbacks of the current pg_autovacuum is
that it doesn't have any way to factor this in.

In it's simplest form (which I will implement first) I would add the ability to add a
second set of thresholds that will be active only during an ¡°off-peak¡± time that can
be specified in the pg_autovacuum database, perhaps in a general_settings table.

A few notes about things I'm not planning on working on at the moment.

Better logging options:

An additional logging could would be added to the pg_autovacuum database and will log
all activity (vacuums and analyzes) along with their corresponding duration.

Syslog support. I'm not sure this is really needed, but a simple patch was submitted
by one user and perhaps that can be reviewed / improved and applied.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] pg_autovacuum next steps

Reply via email to