Alvaro Herrera wrote:

> This is a preliminary proposal for Minmax indexes.  I'm experimenting
> with the code, but it's too crude to post yet, so here's a document
> explaining what they are and how they work, so that reviewers can poke
> holes to have the design improved.

After going further with the implementation, I have added a new concept,
the reverse range map.

Reverse Range Map
-----------------

The reverse range map is a separate fork each Minmax index has.  Its purpose
is to let a way to quickly find the index tuple corresponding to a page range;
for a given heap page number, there's an O(1) way to obtain the TID of the
corresponding index tuple.
 
To scan the index, we first obtain the TID of index tuple for page 0.  If
this returns a valid TID, we read that tuple to determine the min/max bounds
for this page range.  If it returns invalid, then the range is unsummarized,
and the scan must return the whole range as needing scan.  After this
index entry has been processed, we obtain the TID of the index tuple for
page 0+pagesPerRange (currently this is a compile-time constant, but
there's no reason this cannot be a per-index property).  Continue adding
pagesPerRange until we reach the end of the heap.

To read the TID during an index scan, we follow this protocol:

* read revmap page
* obtain share lock on the buffer
* read the TID
* LockTuple that TID (using the index as relation).  A shared lock is
  sufficient.  We need the LockTuple to prevent VACUUM from recycling
  the index tuple; see below.
* release buffer lock
* read the index tuple
* release the tuple lock

On heap insert/update, it is normally cheaper to update the summary tuple
(grab the old tuple, expand the min/max range according to the new value
being inserted, insert the new index tuple, update the reverse range map)
than letting the range be unsummarized, which would require scanning the
pages in the range.

[However, when many updates on a range are going to be done, it'd be
better to mark it as unsummarized (i.e. set the revmap TID to Invalid)
and do a resummarization later, to prevent the index from bloating.
Also, it'd be sensible to allow postponing of the index update, if many
tuples are inserted; something like GIN's "pending list".  We would need
to keep track the TIDs of the inserted heap tuples, so that we can
figure out the new min/max values, without having to scan the whole
range.]

To update the summary tuple for a page range, we use this protocol:

* insert a new index tuple anywhere; note its TID
* read revmap page
* obtain exclusive lock on buffer
* write the TID
* release lock

This ensures no concurrent reader can obtain a partially-written TID.
Note we don't need a tuple lock here.  Concurrent scans don't have to
worry about whether they got the old or new index tuple: if they get the
old one, the tighter values are okay from a correctness standpoint because
due to MVCC they can't possibly see the just-inserted heap tuples anyway.

For vacuuming, we need to figure out which index tuples are no longer
referenced from the reverse range map.  This requires some brute force,
but is simple:

1) scan the complete index, store each existing TIDs in a dynahash.
   Hash key is the TID, hash value is a boolean initially set to false.
2) scan the complete revmap sequentially, read the TIDs on each page.  Share
   lock on each page is sufficient.  For each TID so obtained, grab the
   element from the hash and update the boolean to true.
3) Scan the index again; for each tuple found, search the hash table.
   If the tuple is not present, it must have been added after our initial
   scan; ignore it.  If the hash flag is true, then the tuple is referenced;
   ignore it.  If the hash flag is false, then the index tuple is no longer
   referenced by the revmap; but they could be about to be accessed by a
   concurrent scan.  Do ConditionalLockTuple.  If this fails, ignore the
   tuple, it will be deleted by a future vacuum.  If lock is acquired,
   then we can safely remove the index tuple.
4) Index pages with free space can be detected by this second scan.  Register
   those with the FSM.

Note this doesn't require scanning the heap at all, or being involved in
the heap's cleanup procedure.  (In particular, tables with only minmax
indexes do not require the removed tuples' TIDs to be collected during
the heap cleanup pass.)   XXX I think this is wrong in detail; we
probably need to be using LockBufferForCleanup.

With the reverse range map it is very easy to see which page ranges in
the heap need resummarization; it's the ones marked with InvalidTid.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to