[HACKERS] [PATCH 14/14] design document v2.3 and snapshot building design doc v0.2

Andres Freund Wed, 14 Nov 2012 19:39:23 -0800

---
 src/backend/replication/logical/DESIGN.txt         | 603 +++++++++++++++++++++
 src/backend/replication/logical/Makefile           |   6 +
 .../replication/logical/README.SNAPBUILD.txt       | 298 ++++++++++
 3 files changed, 907 insertions(+)
 create mode 100644 src/backend/replication/logical/DESIGN.txt
 create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt

diff --git a/src/backend/replication/logical/DESIGN.txt b/src/backend/replication/logical/DESIGN.txt
new file mode 100644
index 0000000..8383602
--- /dev/null
+++ b/src/backend/replication/logical/DESIGN.txt
@@ -0,0 +1,603 @@
+//-*- mode: adoc -*-
+= High Level Design for Logical Replication in Postgres =
+:copyright: PostgreSQL Global Development Group 2012
+:author: Andres Freund, 2ndQuadrant Ltd.
+:email: and...@2ndquadrant.com
+
+== Introduction ==
+
+This document aims to first explain why we think postgres needs another
+replication solution and what that solution needs to offer in our opinion. Then
+it sketches out our proposed implementation.
+
+In contrast to an earlier version of the design document which talked about the
+implementation of four parts of replication solutions:
+
+1. Source data generation
+1. Transportation of that data
+1. Applying the changes
+1. Conflict resolution
+
+this version only plans to talk about the first part in detail as it is an
+independent and complex part usable for a wide range of use cases which we want
+to get included into postgres in a first step.
+
+=== Previous discussions ===
+
+There are two rather large threads discussing several parts of the initial
+prototype and proposed architecture:
+
+- http://archives.postgresql.org/message-id/201206131327.24092.and...@2ndquadrant.com[Logical Replication/BDR prototype and architecture]
+- http://archives.postgresql.org/message-id/201206211341.25322.and...@2ndquadrant.com[Catalog/Metadata consistency during changeset extraction from WAL]
+
+Those discussions lead to some fundamental design changes which are presented in this document.
+
+=== Changes from v1 ===
+* At least a partial decoding step required/possible on the source system
+* No intermediate ("schema only") instances required
+* DDL handling, without event triggers
+* A very simple text conversion is provided for debugging/demo purposes
+* Smaller scope
+
+== Existing approaches to replication in Postgres ==
+
+If any currently used approach to replication can be made to support every
+use-case/feature we need, it likely is not a good idea to implement something
+different. Currently three basic approaches are in use in/around postgres
+today:
+
+. Trigger based
+. Recovery based/Physical footnote:[Often referred to by terms like Hot Standby, Streaming Replication, Point In Time Recovery]
+. Statement based
+
+Statement based replication has obvious and known problems with consistency and
+correctness making it hard to use in the general case so we will not further
+discuss it here.
+
+Lets have a look at the advantages/disadvantages of the other approaches:
+
+=== Trigger based Replication ===
+
+This variant has a multitude of significant advantages:
+
+* implementable in userspace
+* easy to customize
+* just about everything can be made configurable
+* cross version support
+* cross architecture support
+* can feed into systems other than postgres
+* no overhead from writes to non-replicated tables
+* writable standbys
+* mature solutions
+* multimaster implementations possible & existing
+
+But also a number of disadvantages, some of them very hard to solve:
+
+* essentially duplicates the amount of writes (or even more!)
+* synchronous replication hard or impossible to implement
+* noticeable CPU overhead
+** trigger functions
+** text conversion of data
+* complex parts implemented in several solutions
+* not in core
+
+Especially the higher amount of writes might seem easy to solve at a first
+glance but a solution not using a normal transactional table for its log/queue
+has to solve a lot of problems. The major ones are:
+
+* crash safety, restartability & spilling to disk
+* consistency with the commit status of transactions
+* only a minimal amount of synchronous work should be done inside individual
+transactions
+
+In our opinion those problems are restricting progress/wider distribution of
+these class of solutions. It is our aim though that existing solutions in this
+space - most prominently slony and londiste - can benefit from the work we are
+doing & planning to do by incorporating at least parts of the changeset
+generation infrastructure.
+
+=== Recovery based Replication ===
+
+This type of solution, being built into postgres and of increasing popularity,
+has and will have its use cases and we do not aim to replace but to complement
+it. We plan to reuse some of the infrastructure and to make it possible to mix
+both modes of replication
+
+Advantages:
+
+* builtin
+* built on existing infrastructure from crash recovery
+* efficient
+** minimal CPU, memory overhead on primary
+** low amount of additional writes
+* synchronous operation mode
+* low maintenance once setup
+* handles DDL
+
+Disadvantages:
+
+* standbys are read only
+* no cross version support
+* no cross architecture support
+* no replication into foreign systems
+* hard to customize
+* not configurable on the level of database, tables, ...
+
+== Goals ==
+
+As seen in the previous short survey of the two major interesting classes of
+replication solution there is a significant gap between those. Our aim is to
+make it smaller.
+
+We aim for:
+
+* in core
+* low CPU overhead
+* low storage overhead
+* asynchronous, optionally synchronous operation modes
+* robust
+* modular
+* basis for other technologies (sharding, replication into other DBMS's, ...)
+* basis for at least one multi-master solution
+* make the implementation as unintrusive as possible, but not more
+
+== New Architecture ==
+
+=== Overview ===
+
+Our proposal is to reuse the basic principle of WAL based replication, namely
+reusing data that already needs to be written for another purpose, and extend
+it to allow most, but not all, the flexibility of trigger based solutions.
+We want to do that by decoding the WAL back into a non-physical form.
+
+To get the flexibility we and others want we propose that the last step of
+changeset generation, transforming it into a format that can be used by the
+replication consumer, is done in an extensible manner. In the schema the part
+that does that is described as 'Output Plugin'. To keep the amount of
+duplication between different plugins as low as possible the plugin should only
+do a a very limited amount of work.
+
+The following paragraphs contain reasoning for the individual design decisions
+made and their highlevel design.
+
+=== Schematics ===
+
+The basic proposed architecture for changeset extraction is presented in the
+following diagram. The first part should look familiar to anyone knowing
+postgres' architecture. The second is where most of the new magic happens.
+
+[[basic-schema]]
+.Architecture Schema
+["ditaa"]
+------------------------------------------------------------------------------
+        Traditional Stuff
+
+ +---------+---------+---------+---------+----+
+ | Backend | Backend | Backend | Autovac | ...|
+ +----+----+---+-----+----+----+----+----+-+--+
+      |        |          |         |      |
+      +------+ | +--------+         |      |
+    +-+      | | | +----------------+      |
+    |        | | | |                       |
+    |        v v v v                       |
+    |     +------------+                   |
+    |     | WAL writer |<------------------+
+    |     +------------+
+    |       | | | | |
+    v       v v v v v       +-------------------+
++--------+ +---------+   +->| Startup/Recovery  |
+|{s}     | |{s}      |   |  +-------------------+
+|Catalog | |   WAL   |---+->| SR/Hot Standby    |
+|        | |         |   |  +-------------------+
++--------+ +---------+   +->| Point in Time     |
+    ^          |            +-------------------+
+ ---|----------|--------------------------------
+    |       New Stuff
++---+          |
+|              v            Running separately
+| +----------------+  +=-------------------------+
+| | Walsender  |   |  |                          |
+| |            v   |  |    +-------------------+ |
+| +-------------+  |  | +->| Logical Rep.      | |
+| |     WAL     |  |  | |  +-------------------+ |
++-|  decoding   |  |  | +->| Multimaster       | |
+| +------+------/  |  | |  +-------------------+ |
+| |            |   |  | +->| Slony             | |
+| |            v   |  | |  +-------------------+ |
+| +-------------+  |  | +->| Auditing          | |
+| |     TX      |  |  | |  +-------------------+ |
++-| reassembly  |  |  | +->| Mysql/...         | |
+| +-------------/  |  | |  +-------------------+ |
+| |            |   |  | +->| Custom Solutions  | |
+| |            v   |  | |  +-------------------+ |
+| +-------------+  |  | +->| Debugging         | |
+| |   Output    |  |  | |  +-------------------+ |
++-|   Plugin    |--|--|-+->| Data Recovery     | |
+  +-------------/  |  |    +-------------------+ |
+  |                |  |                          |
+  +----------------+  +--------------------------|
+------------------------------------------------------------------------------
+
+=== WAL enrichement ===
+
+To be able to decode individual WAL records at the very minimal they need to
+contain enough information to reconstruct what has happened to which row. The
+action is already encoded in the WAL records header in most of the cases.
+
+As an example of missing data, the WAL record emitted when a row gets deleted,
+only contains its physical location. At the very least we need a way to
+identify the deleted row: in a relational database the minimal amount of data
+that does that should be the primary key footnote:[Yes, there are use cases
+where the whole row is needed, or where no primary key can be found].
+
+We propose that for now it is enough to extend the relevant WAL record with
+additional data when the newly introduced 'WAL_level = logical' is set.
+
+Previously it has been argued on the hackers mailing list that a generic 'WAL
+record annotation' mechanism might be a good thing. That mechanism would allow
+to attach arbitrary data to individual wal records making it easier to extend
+postgres to support something like what we propose.. While we don't oppose that
+idea we think it is largely orthogonal issue to this proposal as a whole
+because the format of a WAL records is version dependent by nature and the
+necessary changes for our easy way are small, so not much effort is lost.
+
+A full annotation capability is a complex endeavour on its own as the parts of
+the code generating the relevant WAL records has somewhat complex requirements
+and cannot easily be configured from the outside.
+
+Currently this is contained in the http://archives.postgresql.org/message-id/1347669575-14371-6-git-send-email-and...@2ndquadrant.com[Log enough data into the wal to reconstruct logical changes from it] patch.
+
+=== WAL parsing & decoding ===
+
+The main complexity when reading the WAL as stored on disk is that the format
+is somewhat complex and the existing parser is too deeply integrated in the
+recovery system to be directly reusable. Once a reusable parser exists decoding
+the binary data into individual WAL records is a small problem.
+
+Currently two competing proposals for this module exist, each having its own
+merits. In the grand scheme of this proposal it is irrelevant which one gets
+picked as long as the functionality gets integrated.
+
+The mailing list post
+http:http://archives.postgresql.org/message-id/1347669575-14371-3-git-send-email-and...@2ndquadrant.com[Add
+support for a generic wal reading facility dubbed XLogReader] contains both
+competing patches and discussion around which one is preferable.
+
+Once the WAL has been decoded into individual records two major issues exist:
+
+1. records from different transactions and even individual user level actions
+are intermingled
+1. the data attached to records cannot be interpreted on its own, it is only
+meaningful with a lot of required information (including table, columns, types
+and more)
+
+The solution to the first issue is described in the next section: <<tx-reassembly>>
+
+The second problem is probably the reason why no mature solution to reuse the
+WAL for logical changeset generation exists today. See the <<snapbuilder>>
+paragraph for some details.
+
+As decoding, Transaction reassembly and Snapshot building are interdependent
+they currently are implemented in the same patch:
+http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-and...@2ndquadrant.com[Introduce
+wal decoding via catalog timetravel]
+
+That patch also includes a small demonstration that the approach works in the
+presence of DDL:
+
+[[example-of-decoding]]
+.Decoding example
+[NOTE]
+---------------------------
+/* just so we keep a sensible xmin horizon */
+ROLLBACK PREPARED 'f';
+BEGIN;
+CREATE TABLE keepalive();
+PREPARE TRANSACTION 'f';
+
+DROP TABLE IF EXISTS replication_example;
+
+SELECT pg_current_xlog_insert_location();
+CHECKPOINT;
+CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
+varchar(120));
+begin;
+INSERT INTO replication_example(somedata, text) VALUES (1, 1);
+INSERT INTO replication_example(somedata, text) VALUES (1, 2);
+commit;
+
+
+ALTER TABLE replication_example ADD COLUMN bar int;
+
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);
+
+BEGIN;
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
+INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);
+COMMIT;
+
+/* slightly more complex schema change, still no table rewrite */
+ALTER TABLE replication_example DROP COLUMN bar;
+INSERT INTO replication_example(somedata, text) VALUES (3, 1);
+
+BEGIN;
+INSERT INTO replication_example(somedata, text) VALUES (3, 2);
+INSERT INTO replication_example(somedata, text) VALUES (3, 3);
+commit;
+
+ALTER TABLE replication_example RENAME COLUMN text TO somenum;
+
+INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);
+
+/* complex schema change, changing types of existing column, rewriting the table */
+ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
+(somenum::int4);
+
+INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);
+
+SELECT pg_current_xlog_insert_location();
+
+/* now decode what has been written to the WAL during that time */
+
+SELECT decode_xlog('0/1893D78', '0/18BE398');
+
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
+WARNING:  tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
+WARNING:  tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
+WARNING:  tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
+(null)
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
+WARNING:  tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  COMMIT
+WARNING:  BEGIN
+WARNING:  tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
+WARNING:  COMMIT
+
+---------------------------
+
+[[tx-reassembly]]
+=== TX reassembly ===
+
+In order to make usage of the decoded stream easy we want to present the user
+level code with a correctly ordered image of individual transactions at once
+because otherwise every user will have to reassemble transactions themselves.
+
+Transaction reassembly needs to solve several problems:
+
+1. changes inside a transaction can be interspersed with other transactions
+1. a top level transaction only knows which subtransactions belong to it when
+it reads the commit record
+1. individual user level actions can be smeared over multiple records (TOAST)
+
+Our proposed module solves 1) and 2) by building individual streams of records
+split by xid. While not fully implemented yet we plan to spill those individual
+xid streams to disk after a certain amount of memory is used. This can be
+implemented without any change in the external interface.
+
+As all the individual streams are already sorted by LSN by definition - we
+build them from the wal in a FIFO manner, and the position in the WAL is the
+definition of the LSN footnote:[the LSN is just the byte position int the WAL
+stream] - the individual changes can be merged efficiently by a k-way merge
+(without sorting!) by keeping the individual streams in a binary heap.
+
+To manipulate the binary heap a generic implementation is proposed. Several
+independent implementations of binary heaps already exist in the postgres code,
+but none of them is generic.  The patch is available at
+http://archives.postgresql.org/message-id/1347669575-14371-2-git-send-email-and...@2ndquadrant.com[Add
+minimal binary heap implementation].
+
+[NOTE]
+============
+The reassembly component was previously coined ApplyCache because it was
+proposed to run on replication consumers just before applying changes. This is
+not the case anymore.
+
+It is still called that way in the source of the patch recently submitted.
+============
+
+[[snapbuilder]]
+=== Snapshot building  ===
+
+To decode the contents of wal records describing data changes we need to decode
+and transform their contents. A single tuple is stored in a data structure
+called HeapTuple. As stored on disk that structure doesn't contain any
+information about the format of its contents.
+
+The basic problem is twofold:
+
+1. The wal records only contain the relfilenode not the relation oid of a table
+11. The relfilenode changes when an action performing a full table rewrite is performed
+1. To interpret a HeapTuple correctly the exact schema definition from back
+when the wal record was inserted into the wal stream needs to be available
+
+We chose to implement timetraveling access to the system catalog using
+postgres' MVCC nature & implementation because of the following advantages:
+
+* low amount of additional data in wal
+* genericity
+* similarity of implementation to Hot Standby, quite a bit of the infrastructure is reusable
+* all kinds of DDL can be handled in reliable manner
+* extensibility to user defined catalog like tables
+
+Timetravel access to the catalog means that we are able to look at the catalog
+just as it looked when changes were generated. That allows us to get the
+correct information about the contents of the aforementioned HeapTuple's so we
+can decode them reliably.
+
+Other solutions we thought about that fell through:
+* catalog only proxy instances that apply schema changes exactly to the point
+  were decoding using ``old fashioned'' wal replay
+* do the decoding on a 2nd machine, replicating all DDL exactly, rely on the catalog there
+* do not allow DDL at all
+* always add enough data into the WAL to allow decoding
+* build a fully versioned catalog
+
+The email thread available under
+http://archives.postgresql.org/message-id/201206211341.25322.and...@2ndquadrant.com[Catalog/Metadata
+consistency during changeset extraction from WAL] contains some details,
+advantages and disadvantages about the different possible implementations.
+
+How we build snapshots is somewhat intricate and complicated and seems to be
+out of scope for this document. We will provide a second document discussing
+the implementation in detail. Let's just assume it is possible from here on.
+
+[NOTE]
+Some details are already available in comments inside 'src/backend/replication/logical/snapbuild.{c,h}'.
+
+=== Output Plugin ===
+
+As already mentioned previously our aim is to make the implementation of output
+plugins as simple and non-redundant as possible as we expect several different
+ones with different use cases to emerge quickly. See <<basic-schema>> for a
+list of possible output plugins that we think might emerge.
+
+Although we for now only plan to tackle logical replication and based on that a
+multi-master implementation in the near future we definitely aim to provide all
+use-cases with something easily useable!
+
+To decode and translate local transaction an output plugin needs to be able to
+transform transactions as a whole so it can apply them as a meaningful
+transaction at the other side.
+
+What we do to provide that is, that very time we find a transaction commit and
+thus have completed reassembling the transaction we start to provide the
+individual changes to the output plugin. It currently only has to fill out 3
+callbacks:
+[options="header"]
+|=====================================================================================================================================
+|Callback |Passed Parameters                    |Called per TX  | Use
+|begin    |xid                                  |once           |Begin of a reassembled transaction
+|change   |xid, subxid, change, mvcc snapshot   |every change   |Gets passed every change so it can transform it to the target format
+|commit   |xid                                  |once           |End of a reassembled transaction
+|=====================================================================================================================================
+
+During each of those callback an appropriate timetraveling SnapshotNow snapshot
+is setup so the callbacks can perform all read-only catalog accesses they need,
+including using the sys/rel/catcache. For obvious reasons only read access is
+allowed.
+
+The snapshot guarantees that the result of lookups are be the same as they
+were/would have been when the change was originally created.
+
+Additionally they get passed a MVCC snapshot, to e.g. run sql queries on
+catalogs or similar.
+
+[IMPORTANT]
+============
+At the moment none of these snapshots can be used to access normal user
+tables. Adding additional tables to the allowed set is easy implementation
+wise, but every transaction changing such tables incurs a noticeably higher
+overhead.
+============
+
+For now transactions won't be decoded/output in parallel. There are ideas to
+improve on this, but we don't think the complexity is appropriate for the first
+release of this feature.
+
+This is an adoption barrier for databases where large amounts of data get
+loaded/written in one transaction.
+
+=== Setup of replication nodes ===
+
+When setting up a new standby/consumer of a primary some problem exist
+independent of the implementation of the consumer. The gist of the problem is
+that when making a base backup and starting to stream all changes since that
+point transactions that were running during all this cannot be included:
+
+* Transaction that have not committed before starting to dump a database are
+  invisible to the dumping process
+
+* Transactions that began before the point from which on the WAL is being
+  decoded are incomplete and cannot be replayed
+
+Our proposal for a solution to this is to detect points in the WAL stream where we can provide:
+
+. A snapshot exported similarly to pg_export_snapshot() footnote:[http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION] that can be imported with +SET TRANSACTION SNAPSHOT+ footnote:[http://www.postgresql.org/docs/devel/static/sql-set-transaction.html]
+. A stream of changes that will include the complete data of all transactions seen as running by the snapshot generated in 1)
+
+See the diagram.
+
+[[setup-schema]]
+.Control flow during setup of a new node
+["ditaa",scaling="0.7"]
+------------------------------------------------------------------------------
++----------------+
+| Walsender  |   |                               +------------+
+|            v   |                               | Consumer   |
++-------------+  |<--IDENTIFY_SYSTEM-------------|            |
+|     WAL     |  |                               |            |
+|  decoding   |  |----....---------------------->|            |
++------+------/  |                               |            |
+|            |   |                               |            |
+|            v   |                               |            |
++-------------+  |<--INIT_LOGICAL $PLUGIN--------|            |
+|     TX      |  |                               |            |
+| reassembly  |  |---FOUND_STARTING %X/%X------->|            |
++-------------/  |                               |            |
+|            |   |---FOUND_CONSISTENT %X/%X----->|            |
+|            v   |---pg_dump snapshot----------->|            |
++-------------+  |---replication slot %P-------->|            |
+|   Output    |  |                               |            |
+|   Plugin    |  |    ^                          |            |
++-------------/  |    |                          |            |
+|                |    +-run pg_dump separately --|            |
+|                |                               |            |
+|                |<--STREAM_DATA-----------------|            |
+|                |                               |            |
+|                |---data ---------------------->|            |
+|                |                               |            |
+|                |                               |            |
+|                |  ---- SHUTDOWN -------------  |            |
+|                |                               |            |
+|                |                               |            |
+|                |<--RESTART_LOGICAL $PLUGIN %P--|            |
+|                |                               |            |
+|                |---data----------------------->|            |
+|                |                               |            |
+|                |                               |            |
++----------------+                               +------------+
+
+------------------------------------------------------------------------------
+
+=== Disadvantages of the approach ===
+
+* somewhat intricate code for snapshot timetravel
+* output plugins/walsenders need to work per database as they access the catalog
+* when sending to multiple standbys some work is done multiple times
+* decoding/applying multiple transactions in parallel is hard
+
+=== Unfinished/Undecided issues ===
+
+* declaration of user ``catalog'' tables (e.g. userspace enums)
+* finishing different parts of the implementation
+  * spill to disk during transaction reassembly
+  * mixed catalog/data transactions
+  * snapshot refcounting
+  * snapshot exporting
+  * snapshot serialization
diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile
index cf040ef..92b4508 100644
--- a/src/backend/replication/logical/Makefile
+++ b/src/backend/replication/logical/Makefile
@@ -17,3 +17,9 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS)
 OBJS = decode.o logicalfuncs.o reorderbuffer.o snapbuild.o
 
 include $(top_srcdir)/src/backend/common.mk
+
+DESIGN.pdf: DESIGN.txt
+	a2x -v --fop -f pdf -D $(shell pwd) $<
+
+README.SNAPBUILD.pdf: README.SNAPBUILD.txt
+	a2x -v --fop -f pdf -D $(shell pwd) $<
diff --git a/src/backend/replication/logical/README.SNAPBUILD.txt b/src/backend/replication/logical/README.SNAPBUILD.txt
new file mode 100644
index 0000000..70e142f
--- /dev/null
+++ b/src/backend/replication/logical/README.SNAPBUILD.txt
@@ -0,0 +1,298 @@
+= Snapshot Building =
+:author: Andres Freund, 2nQuadrant Ltd
+
+== Why do we need timetravel catalog access ==
+
+When doing wal decoding (see DESIGN.txt for reasons to do so) we need to know
+how the catalog looked at the point a record was inserted into the WAL because
+without that information we don't know much more about the record but its
+length. Its just an arbitrary bunch of bytes without further information.
+Unfortunately due the possibility of the table definition changing we cannot
+just access a newer version of the catalog and assume the table definition is just the same.
+
+If only the type information were required it might be enough to annotate the
+wal records with a bit more information (table oid, table name, column name,
+column type) but as we want to be able to convert the output to more useful
+formats like text we need to be able to call output functions. Those need a
+normal environment including the usual caches and normal catalog access to
+lookup operators, functions and other types.
+
+Our solution to this is to add the capability to access the catalog in a way
+that makes it look like it did when the record was inserted into the WAL. The
+locking used during WAL generation guarantees the catalog is/was in a consistent
+state at that point.
+
+Interesting cases include:
+- enums
+- composite types
+- extension types
+- non-C functions
+- relfilenode to table oid mapping
+
+Due to postgres' MVCC nature regular modification of a tables contents are
+theoretically non-destructive. The problem is that there is no way to access
+arbitrary points in time even if the data for it is there.
+
+This module adds the capability to do so in the very limited set of
+circumstances we need it in for wal decoding. It does *not* provide a facility
+to do so in general.
+
+A 'Snapshot' is the datastructure used in postgres to describe which tuples are
+visible and which are not. We need to build a Snapshot which can be used to
+access the catalog the way it looked when the wal record was inserted.
+
+Restrictions:
+* Only works for catalog tables
+* Snapshot modifications are somewhat expensive
+* it cannot build initial visibility information for every point in time, it
+  needs a specific set of circumstances for that
+* limited window in which we can build snapshots
+
+== How do we build timetravel snapshots ==
+
+Hot Standby added infrastructure to build snapshots from WAL during recovery in
+the 9.0 release. Most of that can be reused for our purposes.
+
+We cannot reuse all of the HS infrastructure because:
+* we are not in recovery
+* we need to look *inside* transactions
+* we need the capability to have multiple different snapshots arround at the same time
+
+We need to provide two kinds of snapshots that are implemented rather
+differently in their plain postgres incarnation:
+* SnapshotNow
+* SnapshotMVCC
+
+We need both because if any operators use normal functions they will get
+executed with SnapshotMVCC semantics while the catcache and related things will
+rely on SnapshotNow semantics. Note that SnapshotNow here cannot be a normal
+SnapshotNow because we wouldn't access the old version of the catalog in that
+case. Instead something like an MVCC snapshot with the correct visibility
+information. That also means that snapshot won't have some race issues normal
+SnapshotNow has.
+
+Everytime a transaction that changed the catalog commits all other transactions
+will need a new snapshot that marks that transaction (and its subtransactions)
+as visible.
+
+Our snapshot representation is a bit different from normal snapshots, but we
+still reuse the normal SnapshotData struct:
+* Snapshot->xip contains all transaction we consider committed
+* Snapshot->subxip contains all transactions belonging to our transaction,
+  including the toplevel one
+
+The meaning of ->xip is inverted in comparison with non-timetravel snapshots
+because usually only a tiny percentage of comitted transactions will have
+modified the catalog between xmin and xmax. It also makes subtransaction
+handling easier (we cannot query pg_subtrans).
+
+== Building of initial snapshot ==
+
+We can start building an initial snapshot as soon as we find either an
+XLOG_RUNNING_XACTS or an XLOG_CHECKPOINT_SHUTDOWN record because both allow us
+to know how many transactions are running.
+
+We need to know which transactions were running when we start to build a
+snapshot/start decoding as we don't have enough information about those as they
+could have done catalog modifications before we started watching. Also we
+wouldn't have the complete contents of those transactions as we started reading
+after they started.  The latter is also important to build snapshots which can
+be used to build a consistent initial clone.
+
+There also is the problem that XLOG_RUNNING_XACT records can be 'suboverflowed'
+which means there were more running subtransactions than fitting into shared
+memory. In that case we use the same incremental building trick HS uses which
+is either
+1) wait till further XLOG_RUNNING_XACT records have a running->oldestRunningXid
+after the initial xl_runnign_xacts->nextXid
+2) wait for a further XLOG_RUNNING_XACT thatis not overflowed or
+a XLOG_CHECKPOINT_SHUTDOWN
+
+XXX: we probably don't need to care about ->suboverflowed at all as we only
+need to know about committed XIDs and we get enough information about
+subtransactions at commit.. More thinking needed.
+
+When we start building a snapshot we are in the 'SNAPBUILD_START' state. As
+soon as we find any visibility information, even if incomplete, we change to
+SNAPBUILD_INITIAL_POINT.
+
+When we have collected enough information to decode any transaction starting
+after that point in time we fall over to SNAPBUILD_FULL_SNAPSHOT. If those
+transactions commit before the next state is reached we throw their complete
+content away.
+
+When all transactions that were running when we switched over to FULL_SNAPSHOT
+commited, we change into the 'SNAPBUILD_CONSISTENT' state. Every transaction
+that commits from now on gets handed to the output plugin.
+When doing the switch to CONSISTENT we optionally export a snapshot which makes
+all transactions visible that committed up to this point. That exported
+snapshot allows the user to run pg_dump on it and replay all changes received
+on a restored dump to get a consistent new clone.
+
+["ditaa",scaling="0.8"]
+---------------
+
++-------------------------+
+|SNAPBUILD_START          |-----------------------+
+|                         |-----------+           |
++-------------------------+           |           |
+             |                        |           |
+ XLOG_RUNNING_XACTS suboverflowed     |   saved snapshot
+             |                        |           |
+             |                        |           |
+             |                        |           |
+             v                        |           |
++-------------------------+           v           v
+|SNAPBUILD_INITIAL        |---------------------->+
+|                         |---------->+           |
++-------------------------+           |           |
+             |                        |           |
+ oldestRunningXid past initialNextXid |           |
+             |                        |           |
+             |  XLOG_RUNNING_XACTS !suboverflowed |
+             v                        |           |
++-------------------------+           |           |
+|SNAPBUILD_FULL_SNAPSHOT  |<----------+           v
+|                         |---------------------->+
++-------------------------+                       |
+             |                                    |
+             |                    XLOG_CHECKPOINT_SHUTDOWN
+ any running txn's finished                       |
+             |                                    |
+             v                                    |
++-------------------------+                       |
+|SNAPBUILD_CONSISTENT     |<----------------------+
+|                         |
++-------------------------+
+
+---------------
+
+== Snapshot Management ==
+
+Whenever a transaction is detected as having started during decoding after
+SNAPBUILD_FULL_SNAPSHOT is reached we distribute the currently maintained
+snapshot to it (i.e. call ApplyCacheAddBaseSnapshot). This serves as its
+initial SnapshotNow and SnapshotMVCC. Unless there are concurrent catalog
+changes that snapshot won't ever change.
+
+Whenever a transaction commits that had catalog changes we iterate over all
+concurrently active transactions and add a new SnapshotNow to it
+(ApplyCacheAddBaseSnapshot(current_lsn)). This is required because any row
+written from now that point on will have used the changed catalog
+contents. This is possible to occur even with correct locking.
+
+SnapshotNow's need to be setup globally so the syscache and other pieces access
+it transparently. This is done using two new tqual.h functions:
+SetupDecodingSnapshots() and RevertFromDecodingSnapshots().
+
+== Catalog/User Table Detection ==
+
+To detect whether a record/transaction does catalog modifications - which we
+need to do for memory/performance reasons - we need to resolve the
+RelFileNode's in xlog records back to the original tables. Unfortunately
+RelFileNode's only contain the tables relfilenode, not their table oid. We only
+can do catalog access once we reached FULL_SNAPSHOT, before that we can use
+some heuristics but otherwise we have to assume that every record changes the
+catalog.
+
+The heuristics we can use are:
+* relfilenode->spcNode == GLOBALTABLESPACE_OID
+* relfilenode->relNode <= FirstNormalObjectId
+* RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid
+
+Those detect some catalog tables but not all (think VACUUM FULL), but if they
+detect one they are correct.
+
+After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us
+a table might not be a catalog table. For that we use the new RELFILENODE
+syscache with (spcNode, relNode).
+
+XXX: Note that that syscache is a bit problematic because its not actually
+unique because shared/nailed catalogs store a 0 as relfilenode (they are stored
+in the relmapper). Those are never looked up though, so it might be
+ok. Unfortunately it doesn't seem to be possible to use a partial index (WHERE
+relfilenode != 0) here.
+
+XXX: For some usecases it would be useful to treat some user specified tables
+as catalog tables
+
+== System Table Rewrite Handling ==
+
+XXX, expand, XXX
+
+NOTES:
+* always using newest relmapper, use newest invalidations
+* old tuples are preserved across rewrites, thats fine
+* REINDEX/CLUSTER pg_class; in a transaction
+
+== mixed DDL/DML transaction handling  ==
+
+When a transactions uses DDL and DML in the same transaction things get a bit
+more complicated because we need to handle CommandIds and ComboCids as we need
+to use the correct version of the catalog when decoding the individual tuples.
+
+CommandId handling itself is relatively simple, we can figure out the current
+CommandId relatively easily by looking at the currently used one in
+changes. The problematic part is that those CommandId frequently will not be
+actual cmin or cmax values but ComboCids. Those are used to minimize space in
+the heap. During normal operation cmin/cmax values are only used within the
+backend emitting those rows and only during one toplevel transaction, so
+instead of storing cmin/cmax only a reference to an in-memory value is stored
+that contains both. Whenever we see a new CommandId we call
+ApplyCacheAddNewCommandId.
+
+To resolve this problem during heap_* whenever we generate a new combocid
+(detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table
+we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During
+decoding this ComboCid is added to the applycache
+(ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a
+single transaction, so we cannot simply setup all of them globally. Before
+calling the output plugin ComboCids are temporarily setup and torn down
+afterwards.
+
+All this only needs to happen in the transaction performing the DDL.
+
+== Cache Handling ==
+
+As we allow usage of the normal {sys,cat,rel,..}cache we also need to integrate
+cache invalidation. For transactions without DDL thats easy as everything is
+already provided by HS. Everytime we read a commit record we apply the sinval
+messages contained therein.
+
+For transactions that contain DDL and DML cache invalidation needs to happen
+more frequently because we need to all tore down all caches that just got
+modified. To do that we simply apply all invalidation messages that got
+collected at the end of transaction and apply them after every single change.
+At some point this can get optimized by generating new local invalidation
+messages, but that seems too complicated for now.
+
+XXX: think/talk about syscache invalidation of relmapper/pg_class changes.
+
+== xmin Horizon Handling ==
+
+Reusing MVCC for timetravel access has one obvious major problem:
+VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
+obviously, we want autovacuum/manual vacuum to work as before.
+
+The idea here is to reuse the infrastrcuture built for hot_standby_feedback
+which allows us to keep the xmin horizon of a walsender backend artificially
+low. We keep it low enough so we can restart decoding from the last location
+the client has confirmed to be safely received. The means that we keep it low
+enough to contain the last checkpoints oldestXid value.
+
+That also means we need to make that value persist across restarts/crashes in a
+very similar manner to twophase.c's. That infrastructure actually also useful
+to make hot_standby_feedback work properly across primary restarts.
+
+== Restartable Decoding ==
+
+As we want to generate a consistent stream of changes we need to have the
+ability to start from a previously decoded location without going to the whole
+multi-phase setup because that would make it very hard to calculate up to where
+we need to keep information.
+
+To make that easier everytime a decoding process finds an online checkpoint
+record it exlusively takes a global lwlock and checks whether visibility
+information has been already been written out for that checkpoint and does so
+if not. We only need to do that once as visibility information is the same
+between all decoding backends.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PATCH 14/14] design document v2.3 and snapshot building design doc v0.2

Reply via email to