[HACKERS] Heap truncation without AccessExclusiveLock (9.4)

Heikki Linnakangas Wed, 15 May 2013 08:36:29 -0700

Truncating a heap at the end of vacuum, to release unused space back to

the OS, currently requires taking an AccessExclusiveLock. Although it'sonly held for a short duration, it can be enough to cause a hiccup inquery processing while it's held. Also, if there is a continuous streamof queries on the table, autovacuum never succeeds in acquiring thelock, and thus the table never gets truncated.


I'd like to eliminate the need for AccessExclusiveLock while truncating.

Design
------

In shared memory, keep two watermarks: a "soft" truncation watermark,and a "hard" truncation watermark. If there is no truncation inprogress, the values are not set and everything works like today.

The soft watermark is the relation size (ie. number of pages) thatvacuum wants to truncate the relation to. Backends can read pages abovethe soft watermark normally, but should refrain from inserting newtuples there. However, it's OK to update a page above the softwatermark, including adding new tuples, if the page is not completelyempty (vacuum will check and not truncate away non-empty pages). If abackend nevertheless has to insert a new tuple to an empty page abovethe soft watermark, for example if there is no more free space in anylower-numbered pages, it must grab the extension lock, and update thesoft watermark while holding it.

The hard watermark is the point above which there is guaranteed to be notuples. A backend must not try to read or write any pages above the hardwatermark - it should be thought of as the end of file for all practicalpurposes. If a backend needs to write above the hard watermark, ie. toextend the relation, it must first grab the extension lock, and raisethe hard watermark.


The hard watermark is always >= the soft watermark.

Shared memory space is limited, but we only need the watermarks for anyin-progress truncations. Let's keep them in shared memory, in a smallfixed-size array. That limits the number of concurrent truncations thatcan be in-progress, but that should be ok. To not slow down commonbackend operations, the values (or lack thereof) are cached in relcache.To sync the relcache when the values change, there will be a new sharedcache invalidation event to force backends to refresh the cachedwatermark values. A backend (vacuum) can ensure that all backends seethe new value by first updating the value in shared memory, sending thesinval message, and waiting until everyone has received it.


With the watermarks, truncation works like this:

1. Set soft watermark to the point where we think we can truncate therelation. Wait until everyone sees it (send sinval message, wait).


2. Scan the pages to verify they are still empty.

3. Grab extension lock. Set hard watermark to current soft watermark (abackend might have inserted a tuple and raised the soft watermark whilewe were scanning). Release lock.


4. Wait until everyone sees the new hard watermark.

5. Grab extension lock.

6. Check (or wait) that there are no pinned buffers above the currenthard watermark. (a backend might have a scan in progress that startedbefore any of this, still holding a buffer pinned, even though it's empty.)


7. Truncate relation to the current hard watermark.

8. Release extension lock.

If a backend inserts a new tuple before step 2, the vacuum scan will seeit. If it's inserted after step 2, the backend's cached soft watermarkis already up-to-date, and thus the backend will update the softwatermark before the insert. Thus after the vacuum scan has finished thescan at step 2, all pages above the current soft watermark must still beempty.



Implementation details
----------------------

There are three kinds of access to a heap page:

A) As a target for new tuple.
B) Following an index pointer, ctid or similar.
C) A sequential scan (and bitmap heap scan?)

To refrain from inserting new tuples to non-empty pages above the softwatermark (A), RelationGetBufferForTuple() is modified to check the softwatermark (and raise it if necessary).

An index scan (B) should never try to read beyond the high watermark,because there are no tuples above it, and thus there should be nopointers to pages above it either.

A sequential scan (C) must refrain from reading beyond the hardwatermark. This can be implemented by always checking the (cached) highwatermark value before stepping to next page.

Truncation during hot standby is a lot simpler: set soft and hardwatermarks to the truncation point, wait until everyone sees the newvalues, and truncate the relation.



Does anyone see a flaw in this?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Heap truncation without AccessExclusiveLock (9.4)

Reply via email to