Truncating a heap at the end of vacuum, to release unused space back to
the OS, currently requires taking an AccessExclusiveLock. Although it's
only held for a short duration, it can be enough to cause a hiccup in
query processing while it's held. Also, if there is a continuous stream
of queries on the table, autovacuum never succeeds in acquiring the
lock, and thus the table never gets truncated.
I'd like to eliminate the need for AccessExclusiveLock while truncating.
Design
------
In shared memory, keep two watermarks: a "soft" truncation watermark,
and a "hard" truncation watermark. If there is no truncation in
progress, the values are not set and everything works like today.
The soft watermark is the relation size (ie. number of pages) that
vacuum wants to truncate the relation to. Backends can read pages above
the soft watermark normally, but should refrain from inserting new
tuples there. However, it's OK to update a page above the soft
watermark, including adding new tuples, if the page is not completely
empty (vacuum will check and not truncate away non-empty pages). If a
backend nevertheless has to insert a new tuple to an empty page above
the soft watermark, for example if there is no more free space in any
lower-numbered pages, it must grab the extension lock, and update the
soft watermark while holding it.
The hard watermark is the point above which there is guaranteed to be no
tuples. A backend must not try to read or write any pages above the hard
watermark - it should be thought of as the end of file for all practical
purposes. If a backend needs to write above the hard watermark, ie. to
extend the relation, it must first grab the extension lock, and raise
the hard watermark.
The hard watermark is always >= the soft watermark.
Shared memory space is limited, but we only need the watermarks for any
in-progress truncations. Let's keep them in shared memory, in a small
fixed-size array. That limits the number of concurrent truncations that
can be in-progress, but that should be ok. To not slow down common
backend operations, the values (or lack thereof) are cached in relcache.
To sync the relcache when the values change, there will be a new shared
cache invalidation event to force backends to refresh the cached
watermark values. A backend (vacuum) can ensure that all backends see
the new value by first updating the value in shared memory, sending the
sinval message, and waiting until everyone has received it.
With the watermarks, truncation works like this:
1. Set soft watermark to the point where we think we can truncate the
relation. Wait until everyone sees it (send sinval message, wait).
2. Scan the pages to verify they are still empty.
3. Grab extension lock. Set hard watermark to current soft watermark (a
backend might have inserted a tuple and raised the soft watermark while
we were scanning). Release lock.
4. Wait until everyone sees the new hard watermark.
5. Grab extension lock.
6. Check (or wait) that there are no pinned buffers above the current
hard watermark. (a backend might have a scan in progress that started
before any of this, still holding a buffer pinned, even though it's empty.)
7. Truncate relation to the current hard watermark.
8. Release extension lock.
If a backend inserts a new tuple before step 2, the vacuum scan will see
it. If it's inserted after step 2, the backend's cached soft watermark
is already up-to-date, and thus the backend will update the soft
watermark before the insert. Thus after the vacuum scan has finished the
scan at step 2, all pages above the current soft watermark must still be
empty.
Implementation details
----------------------
There are three kinds of access to a heap page:
A) As a target for new tuple.
B) Following an index pointer, ctid or similar.
C) A sequential scan (and bitmap heap scan?)
To refrain from inserting new tuples to non-empty pages above the soft
watermark (A), RelationGetBufferForTuple() is modified to check the soft
watermark (and raise it if necessary).
An index scan (B) should never try to read beyond the high watermark,
because there are no tuples above it, and thus there should be no
pointers to pages above it either.
A sequential scan (C) must refrain from reading beyond the hard
watermark. This can be implemented by always checking the (cached) high
watermark value before stepping to next page.
Truncation during hot standby is a lot simpler: set soft and hard
watermarks to the truncation point, wait until everyone sees the new
values, and truncate the relation.
Does anyone see a flaw in this?
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers