Truncating a heap at the end of vacuum, to release unused space back to
the OS, currently requires taking an AccessExclusiveLock. Although it's only held for a short duration, it can be enough to cause a hiccup in query processing while it's held. Also, if there is a continuous stream of queries on the table, autovacuum never succeeds in acquiring the lock, and thus the table never gets truncated.

I'd like to eliminate the need for AccessExclusiveLock while truncating.

Design
------

In shared memory, keep two watermarks: a "soft" truncation watermark, and a "hard" truncation watermark. If there is no truncation in progress, the values are not set and everything works like today.

The soft watermark is the relation size (ie. number of pages) that vacuum wants to truncate the relation to. Backends can read pages above the soft watermark normally, but should refrain from inserting new tuples there. However, it's OK to update a page above the soft watermark, including adding new tuples, if the page is not completely empty (vacuum will check and not truncate away non-empty pages). If a backend nevertheless has to insert a new tuple to an empty page above the soft watermark, for example if there is no more free space in any lower-numbered pages, it must grab the extension lock, and update the soft watermark while holding it.

The hard watermark is the point above which there is guaranteed to be no tuples. A backend must not try to read or write any pages above the hard watermark - it should be thought of as the end of file for all practical purposes. If a backend needs to write above the hard watermark, ie. to extend the relation, it must first grab the extension lock, and raise the hard watermark.

The hard watermark is always >= the soft watermark.

Shared memory space is limited, but we only need the watermarks for any in-progress truncations. Let's keep them in shared memory, in a small fixed-size array. That limits the number of concurrent truncations that can be in-progress, but that should be ok. To not slow down common backend operations, the values (or lack thereof) are cached in relcache. To sync the relcache when the values change, there will be a new shared cache invalidation event to force backends to refresh the cached watermark values. A backend (vacuum) can ensure that all backends see the new value by first updating the value in shared memory, sending the sinval message, and waiting until everyone has received it.

With the watermarks, truncation works like this:

1. Set soft watermark to the point where we think we can truncate the relation. Wait until everyone sees it (send sinval message, wait).

2. Scan the pages to verify they are still empty.

3. Grab extension lock. Set hard watermark to current soft watermark (a backend might have inserted a tuple and raised the soft watermark while we were scanning). Release lock.

4. Wait until everyone sees the new hard watermark.

5. Grab extension lock.

6. Check (or wait) that there are no pinned buffers above the current hard watermark. (a backend might have a scan in progress that started before any of this, still holding a buffer pinned, even though it's empty.)

7. Truncate relation to the current hard watermark.

8. Release extension lock.


If a backend inserts a new tuple before step 2, the vacuum scan will see it. If it's inserted after step 2, the backend's cached soft watermark is already up-to-date, and thus the backend will update the soft watermark before the insert. Thus after the vacuum scan has finished the scan at step 2, all pages above the current soft watermark must still be empty.


Implementation details
----------------------

There are three kinds of access to a heap page:

A) As a target for new tuple.
B) Following an index pointer, ctid or similar.
C) A sequential scan (and bitmap heap scan?)


To refrain from inserting new tuples to non-empty pages above the soft watermark (A), RelationGetBufferForTuple() is modified to check the soft watermark (and raise it if necessary).

An index scan (B) should never try to read beyond the high watermark, because there are no tuples above it, and thus there should be no pointers to pages above it either.

A sequential scan (C) must refrain from reading beyond the hard watermark. This can be implemented by always checking the (cached) high watermark value before stepping to next page.


Truncation during hot standby is a lot simpler: set soft and hard watermarks to the truncation point, wait until everyone sees the new values, and truncate the relation.


Does anyone see a flaw in this?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to