--- src/backend/replication/logical/DESIGN.txt | 603 +++++++++++++++++++++ src/backend/replication/logical/Makefile | 6 + .../replication/logical/README.SNAPBUILD.txt | 298 ++++++++++ 3 files changed, 907 insertions(+) create mode 100644 src/backend/replication/logical/DESIGN.txt create mode 100644 src/backend/replication/logical/README.SNAPBUILD.txt
diff --git a/src/backend/replication/logical/DESIGN.txt b/src/backend/replication/logical/DESIGN.txt new file mode 100644 index 0000000..8383602 --- /dev/null +++ b/src/backend/replication/logical/DESIGN.txt @@ -0,0 +1,603 @@ +//-*- mode: adoc -*- += High Level Design for Logical Replication in Postgres = +:copyright: PostgreSQL Global Development Group 2012 +:author: Andres Freund, 2ndQuadrant Ltd. +:email: and...@2ndquadrant.com + +== Introduction == + +This document aims to first explain why we think postgres needs another +replication solution and what that solution needs to offer in our opinion. Then +it sketches out our proposed implementation. + +In contrast to an earlier version of the design document which talked about the +implementation of four parts of replication solutions: + +1. Source data generation +1. Transportation of that data +1. Applying the changes +1. Conflict resolution + +this version only plans to talk about the first part in detail as it is an +independent and complex part usable for a wide range of use cases which we want +to get included into postgres in a first step. + +=== Previous discussions === + +There are two rather large threads discussing several parts of the initial +prototype and proposed architecture: + +- http://archives.postgresql.org/message-id/201206131327.24092.and...@2ndquadrant.com[Logical Replication/BDR prototype and architecture] +- http://archives.postgresql.org/message-id/201206211341.25322.and...@2ndquadrant.com[Catalog/Metadata consistency during changeset extraction from WAL] + +Those discussions lead to some fundamental design changes which are presented in this document. + +=== Changes from v1 === +* At least a partial decoding step required/possible on the source system +* No intermediate ("schema only") instances required +* DDL handling, without event triggers +* A very simple text conversion is provided for debugging/demo purposes +* Smaller scope + +== Existing approaches to replication in Postgres == + +If any currently used approach to replication can be made to support every +use-case/feature we need, it likely is not a good idea to implement something +different. Currently three basic approaches are in use in/around postgres +today: + +. Trigger based +. Recovery based/Physical footnote:[Often referred to by terms like Hot Standby, Streaming Replication, Point In Time Recovery] +. Statement based + +Statement based replication has obvious and known problems with consistency and +correctness making it hard to use in the general case so we will not further +discuss it here. + +Lets have a look at the advantages/disadvantages of the other approaches: + +=== Trigger based Replication === + +This variant has a multitude of significant advantages: + +* implementable in userspace +* easy to customize +* just about everything can be made configurable +* cross version support +* cross architecture support +* can feed into systems other than postgres +* no overhead from writes to non-replicated tables +* writable standbys +* mature solutions +* multimaster implementations possible & existing + +But also a number of disadvantages, some of them very hard to solve: + +* essentially duplicates the amount of writes (or even more!) +* synchronous replication hard or impossible to implement +* noticeable CPU overhead +** trigger functions +** text conversion of data +* complex parts implemented in several solutions +* not in core + +Especially the higher amount of writes might seem easy to solve at a first +glance but a solution not using a normal transactional table for its log/queue +has to solve a lot of problems. The major ones are: + +* crash safety, restartability & spilling to disk +* consistency with the commit status of transactions +* only a minimal amount of synchronous work should be done inside individual +transactions + +In our opinion those problems are restricting progress/wider distribution of +these class of solutions. It is our aim though that existing solutions in this +space - most prominently slony and londiste - can benefit from the work we are +doing & planning to do by incorporating at least parts of the changeset +generation infrastructure. + +=== Recovery based Replication === + +This type of solution, being built into postgres and of increasing popularity, +has and will have its use cases and we do not aim to replace but to complement +it. We plan to reuse some of the infrastructure and to make it possible to mix +both modes of replication + +Advantages: + +* builtin +* built on existing infrastructure from crash recovery +* efficient +** minimal CPU, memory overhead on primary +** low amount of additional writes +* synchronous operation mode +* low maintenance once setup +* handles DDL + +Disadvantages: + +* standbys are read only +* no cross version support +* no cross architecture support +* no replication into foreign systems +* hard to customize +* not configurable on the level of database, tables, ... + +== Goals == + +As seen in the previous short survey of the two major interesting classes of +replication solution there is a significant gap between those. Our aim is to +make it smaller. + +We aim for: + +* in core +* low CPU overhead +* low storage overhead +* asynchronous, optionally synchronous operation modes +* robust +* modular +* basis for other technologies (sharding, replication into other DBMS's, ...) +* basis for at least one multi-master solution +* make the implementation as unintrusive as possible, but not more + +== New Architecture == + +=== Overview === + +Our proposal is to reuse the basic principle of WAL based replication, namely +reusing data that already needs to be written for another purpose, and extend +it to allow most, but not all, the flexibility of trigger based solutions. +We want to do that by decoding the WAL back into a non-physical form. + +To get the flexibility we and others want we propose that the last step of +changeset generation, transforming it into a format that can be used by the +replication consumer, is done in an extensible manner. In the schema the part +that does that is described as 'Output Plugin'. To keep the amount of +duplication between different plugins as low as possible the plugin should only +do a a very limited amount of work. + +The following paragraphs contain reasoning for the individual design decisions +made and their highlevel design. + +=== Schematics === + +The basic proposed architecture for changeset extraction is presented in the +following diagram. The first part should look familiar to anyone knowing +postgres' architecture. The second is where most of the new magic happens. + +[[basic-schema]] +.Architecture Schema +["ditaa"] +------------------------------------------------------------------------------ + Traditional Stuff + + +---------+---------+---------+---------+----+ + | Backend | Backend | Backend | Autovac | ...| + +----+----+---+-----+----+----+----+----+-+--+ + | | | | | + +------+ | +--------+ | | + +-+ | | | +----------------+ | + | | | | | | + | v v v v | + | +------------+ | + | | WAL writer |<------------------+ + | +------------+ + | | | | | | + v v v v v v +-------------------+ ++--------+ +---------+ +->| Startup/Recovery | +|{s} | |{s} | | +-------------------+ +|Catalog | | WAL |---+->| SR/Hot Standby | +| | | | | +-------------------+ ++--------+ +---------+ +->| Point in Time | + ^ | +-------------------+ + ---|----------|-------------------------------- + | New Stuff ++---+ | +| v Running separately +| +----------------+ +=-------------------------+ +| | Walsender | | | | +| | v | | +-------------------+ | +| +-------------+ | | +->| Logical Rep. | | +| | WAL | | | | +-------------------+ | ++-| decoding | | | +->| Multimaster | | +| +------+------/ | | | +-------------------+ | +| | | | | +->| Slony | | +| | v | | | +-------------------+ | +| +-------------+ | | +->| Auditing | | +| | TX | | | | +-------------------+ | ++-| reassembly | | | +->| Mysql/... | | +| +-------------/ | | | +-------------------+ | +| | | | | +->| Custom Solutions | | +| | v | | | +-------------------+ | +| +-------------+ | | +->| Debugging | | +| | Output | | | | +-------------------+ | ++-| Plugin |--|--|-+->| Data Recovery | | + +-------------/ | | +-------------------+ | + | | | | + +----------------+ +--------------------------| +------------------------------------------------------------------------------ + +=== WAL enrichement === + +To be able to decode individual WAL records at the very minimal they need to +contain enough information to reconstruct what has happened to which row. The +action is already encoded in the WAL records header in most of the cases. + +As an example of missing data, the WAL record emitted when a row gets deleted, +only contains its physical location. At the very least we need a way to +identify the deleted row: in a relational database the minimal amount of data +that does that should be the primary key footnote:[Yes, there are use cases +where the whole row is needed, or where no primary key can be found]. + +We propose that for now it is enough to extend the relevant WAL record with +additional data when the newly introduced 'WAL_level = logical' is set. + +Previously it has been argued on the hackers mailing list that a generic 'WAL +record annotation' mechanism might be a good thing. That mechanism would allow +to attach arbitrary data to individual wal records making it easier to extend +postgres to support something like what we propose.. While we don't oppose that +idea we think it is largely orthogonal issue to this proposal as a whole +because the format of a WAL records is version dependent by nature and the +necessary changes for our easy way are small, so not much effort is lost. + +A full annotation capability is a complex endeavour on its own as the parts of +the code generating the relevant WAL records has somewhat complex requirements +and cannot easily be configured from the outside. + +Currently this is contained in the http://archives.postgresql.org/message-id/1347669575-14371-6-git-send-email-and...@2ndquadrant.com[Log enough data into the wal to reconstruct logical changes from it] patch. + +=== WAL parsing & decoding === + +The main complexity when reading the WAL as stored on disk is that the format +is somewhat complex and the existing parser is too deeply integrated in the +recovery system to be directly reusable. Once a reusable parser exists decoding +the binary data into individual WAL records is a small problem. + +Currently two competing proposals for this module exist, each having its own +merits. In the grand scheme of this proposal it is irrelevant which one gets +picked as long as the functionality gets integrated. + +The mailing list post +http:http://archives.postgresql.org/message-id/1347669575-14371-3-git-send-email-and...@2ndquadrant.com[Add +support for a generic wal reading facility dubbed XLogReader] contains both +competing patches and discussion around which one is preferable. + +Once the WAL has been decoded into individual records two major issues exist: + +1. records from different transactions and even individual user level actions +are intermingled +1. the data attached to records cannot be interpreted on its own, it is only +meaningful with a lot of required information (including table, columns, types +and more) + +The solution to the first issue is described in the next section: <<tx-reassembly>> + +The second problem is probably the reason why no mature solution to reuse the +WAL for logical changeset generation exists today. See the <<snapbuilder>> +paragraph for some details. + +As decoding, Transaction reassembly and Snapshot building are interdependent +they currently are implemented in the same patch: +http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-and...@2ndquadrant.com[Introduce +wal decoding via catalog timetravel] + +That patch also includes a small demonstration that the approach works in the +presence of DDL: + +[[example-of-decoding]] +.Decoding example +[NOTE] +--------------------------- +/* just so we keep a sensible xmin horizon */ +ROLLBACK PREPARED 'f'; +BEGIN; +CREATE TABLE keepalive(); +PREPARE TRANSACTION 'f'; + +DROP TABLE IF EXISTS replication_example; + +SELECT pg_current_xlog_insert_location(); +CHECKPOINT; +CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text +varchar(120)); +begin; +INSERT INTO replication_example(somedata, text) VALUES (1, 1); +INSERT INTO replication_example(somedata, text) VALUES (1, 2); +commit; + + +ALTER TABLE replication_example ADD COLUMN bar int; + +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4); + +BEGIN; +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4); +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4); +INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL); +COMMIT; + +/* slightly more complex schema change, still no table rewrite */ +ALTER TABLE replication_example DROP COLUMN bar; +INSERT INTO replication_example(somedata, text) VALUES (3, 1); + +BEGIN; +INSERT INTO replication_example(somedata, text) VALUES (3, 2); +INSERT INTO replication_example(somedata, text) VALUES (3, 3); +commit; + +ALTER TABLE replication_example RENAME COLUMN text TO somenum; + +INSERT INTO replication_example(somedata, somenum) VALUES (4, 1); + +/* complex schema change, changing types of existing column, rewriting the table */ +ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING +(somenum::int4); + +INSERT INTO replication_example(somedata, somenum) VALUES (5, 1); + +SELECT pg_current_xlog_insert_location(); + +/* now decode what has been written to the WAL during that time */ + +SELECT decode_xlog('0/1893D78', '0/18BE398'); + +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1 +WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2 +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4 +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4 +WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4 +WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]: +(null) +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1 +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2 +WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3 +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1 +WARNING: COMMIT +WARNING: BEGIN +WARNING: COMMIT +WARNING: BEGIN +WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1 +WARNING: COMMIT + +--------------------------- + +[[tx-reassembly]] +=== TX reassembly === + +In order to make usage of the decoded stream easy we want to present the user +level code with a correctly ordered image of individual transactions at once +because otherwise every user will have to reassemble transactions themselves. + +Transaction reassembly needs to solve several problems: + +1. changes inside a transaction can be interspersed with other transactions +1. a top level transaction only knows which subtransactions belong to it when +it reads the commit record +1. individual user level actions can be smeared over multiple records (TOAST) + +Our proposed module solves 1) and 2) by building individual streams of records +split by xid. While not fully implemented yet we plan to spill those individual +xid streams to disk after a certain amount of memory is used. This can be +implemented without any change in the external interface. + +As all the individual streams are already sorted by LSN by definition - we +build them from the wal in a FIFO manner, and the position in the WAL is the +definition of the LSN footnote:[the LSN is just the byte position int the WAL +stream] - the individual changes can be merged efficiently by a k-way merge +(without sorting!) by keeping the individual streams in a binary heap. + +To manipulate the binary heap a generic implementation is proposed. Several +independent implementations of binary heaps already exist in the postgres code, +but none of them is generic. The patch is available at +http://archives.postgresql.org/message-id/1347669575-14371-2-git-send-email-and...@2ndquadrant.com[Add +minimal binary heap implementation]. + +[NOTE] +============ +The reassembly component was previously coined ApplyCache because it was +proposed to run on replication consumers just before applying changes. This is +not the case anymore. + +It is still called that way in the source of the patch recently submitted. +============ + +[[snapbuilder]] +=== Snapshot building === + +To decode the contents of wal records describing data changes we need to decode +and transform their contents. A single tuple is stored in a data structure +called HeapTuple. As stored on disk that structure doesn't contain any +information about the format of its contents. + +The basic problem is twofold: + +1. The wal records only contain the relfilenode not the relation oid of a table +11. The relfilenode changes when an action performing a full table rewrite is performed +1. To interpret a HeapTuple correctly the exact schema definition from back +when the wal record was inserted into the wal stream needs to be available + +We chose to implement timetraveling access to the system catalog using +postgres' MVCC nature & implementation because of the following advantages: + +* low amount of additional data in wal +* genericity +* similarity of implementation to Hot Standby, quite a bit of the infrastructure is reusable +* all kinds of DDL can be handled in reliable manner +* extensibility to user defined catalog like tables + +Timetravel access to the catalog means that we are able to look at the catalog +just as it looked when changes were generated. That allows us to get the +correct information about the contents of the aforementioned HeapTuple's so we +can decode them reliably. + +Other solutions we thought about that fell through: +* catalog only proxy instances that apply schema changes exactly to the point + were decoding using ``old fashioned'' wal replay +* do the decoding on a 2nd machine, replicating all DDL exactly, rely on the catalog there +* do not allow DDL at all +* always add enough data into the WAL to allow decoding +* build a fully versioned catalog + +The email thread available under +http://archives.postgresql.org/message-id/201206211341.25322.and...@2ndquadrant.com[Catalog/Metadata +consistency during changeset extraction from WAL] contains some details, +advantages and disadvantages about the different possible implementations. + +How we build snapshots is somewhat intricate and complicated and seems to be +out of scope for this document. We will provide a second document discussing +the implementation in detail. Let's just assume it is possible from here on. + +[NOTE] +Some details are already available in comments inside 'src/backend/replication/logical/snapbuild.{c,h}'. + +=== Output Plugin === + +As already mentioned previously our aim is to make the implementation of output +plugins as simple and non-redundant as possible as we expect several different +ones with different use cases to emerge quickly. See <<basic-schema>> for a +list of possible output plugins that we think might emerge. + +Although we for now only plan to tackle logical replication and based on that a +multi-master implementation in the near future we definitely aim to provide all +use-cases with something easily useable! + +To decode and translate local transaction an output plugin needs to be able to +transform transactions as a whole so it can apply them as a meaningful +transaction at the other side. + +What we do to provide that is, that very time we find a transaction commit and +thus have completed reassembling the transaction we start to provide the +individual changes to the output plugin. It currently only has to fill out 3 +callbacks: +[options="header"] +|===================================================================================================================================== +|Callback |Passed Parameters |Called per TX | Use +|begin |xid |once |Begin of a reassembled transaction +|change |xid, subxid, change, mvcc snapshot |every change |Gets passed every change so it can transform it to the target format +|commit |xid |once |End of a reassembled transaction +|===================================================================================================================================== + +During each of those callback an appropriate timetraveling SnapshotNow snapshot +is setup so the callbacks can perform all read-only catalog accesses they need, +including using the sys/rel/catcache. For obvious reasons only read access is +allowed. + +The snapshot guarantees that the result of lookups are be the same as they +were/would have been when the change was originally created. + +Additionally they get passed a MVCC snapshot, to e.g. run sql queries on +catalogs or similar. + +[IMPORTANT] +============ +At the moment none of these snapshots can be used to access normal user +tables. Adding additional tables to the allowed set is easy implementation +wise, but every transaction changing such tables incurs a noticeably higher +overhead. +============ + +For now transactions won't be decoded/output in parallel. There are ideas to +improve on this, but we don't think the complexity is appropriate for the first +release of this feature. + +This is an adoption barrier for databases where large amounts of data get +loaded/written in one transaction. + +=== Setup of replication nodes === + +When setting up a new standby/consumer of a primary some problem exist +independent of the implementation of the consumer. The gist of the problem is +that when making a base backup and starting to stream all changes since that +point transactions that were running during all this cannot be included: + +* Transaction that have not committed before starting to dump a database are + invisible to the dumping process + +* Transactions that began before the point from which on the WAL is being + decoded are incomplete and cannot be replayed + +Our proposal for a solution to this is to detect points in the WAL stream where we can provide: + +. A snapshot exported similarly to pg_export_snapshot() footnote:[http://www.postgresql.org/docs/devel/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION] that can be imported with +SET TRANSACTION SNAPSHOT+ footnote:[http://www.postgresql.org/docs/devel/static/sql-set-transaction.html] +. A stream of changes that will include the complete data of all transactions seen as running by the snapshot generated in 1) + +See the diagram. + +[[setup-schema]] +.Control flow during setup of a new node +["ditaa",scaling="0.7"] +------------------------------------------------------------------------------ ++----------------+ +| Walsender | | +------------+ +| v | | Consumer | ++-------------+ |<--IDENTIFY_SYSTEM-------------| | +| WAL | | | | +| decoding | |----....---------------------->| | ++------+------/ | | | +| | | | | +| v | | | ++-------------+ |<--INIT_LOGICAL $PLUGIN--------| | +| TX | | | | +| reassembly | |---FOUND_STARTING %X/%X------->| | ++-------------/ | | | +| | |---FOUND_CONSISTENT %X/%X----->| | +| v |---pg_dump snapshot----------->| | ++-------------+ |---replication slot %P-------->| | +| Output | | | | +| Plugin | | ^ | | ++-------------/ | | | | +| | +-run pg_dump separately --| | +| | | | +| |<--STREAM_DATA-----------------| | +| | | | +| |---data ---------------------->| | +| | | | +| | | | +| | ---- SHUTDOWN ------------- | | +| | | | +| | | | +| |<--RESTART_LOGICAL $PLUGIN %P--| | +| | | | +| |---data----------------------->| | +| | | | +| | | | ++----------------+ +------------+ + +------------------------------------------------------------------------------ + +=== Disadvantages of the approach === + +* somewhat intricate code for snapshot timetravel +* output plugins/walsenders need to work per database as they access the catalog +* when sending to multiple standbys some work is done multiple times +* decoding/applying multiple transactions in parallel is hard + +=== Unfinished/Undecided issues === + +* declaration of user ``catalog'' tables (e.g. userspace enums) +* finishing different parts of the implementation + * spill to disk during transaction reassembly + * mixed catalog/data transactions + * snapshot refcounting + * snapshot exporting + * snapshot serialization diff --git a/src/backend/replication/logical/Makefile b/src/backend/replication/logical/Makefile index cf040ef..92b4508 100644 --- a/src/backend/replication/logical/Makefile +++ b/src/backend/replication/logical/Makefile @@ -17,3 +17,9 @@ override CPPFLAGS := -I$(srcdir) $(CPPFLAGS) OBJS = decode.o logicalfuncs.o reorderbuffer.o snapbuild.o include $(top_srcdir)/src/backend/common.mk + +DESIGN.pdf: DESIGN.txt + a2x -v --fop -f pdf -D $(shell pwd) $< + +README.SNAPBUILD.pdf: README.SNAPBUILD.txt + a2x -v --fop -f pdf -D $(shell pwd) $< diff --git a/src/backend/replication/logical/README.SNAPBUILD.txt b/src/backend/replication/logical/README.SNAPBUILD.txt new file mode 100644 index 0000000..70e142f --- /dev/null +++ b/src/backend/replication/logical/README.SNAPBUILD.txt @@ -0,0 +1,298 @@ += Snapshot Building = +:author: Andres Freund, 2nQuadrant Ltd + +== Why do we need timetravel catalog access == + +When doing wal decoding (see DESIGN.txt for reasons to do so) we need to know +how the catalog looked at the point a record was inserted into the WAL because +without that information we don't know much more about the record but its +length. Its just an arbitrary bunch of bytes without further information. +Unfortunately due the possibility of the table definition changing we cannot +just access a newer version of the catalog and assume the table definition is just the same. + +If only the type information were required it might be enough to annotate the +wal records with a bit more information (table oid, table name, column name, +column type) but as we want to be able to convert the output to more useful +formats like text we need to be able to call output functions. Those need a +normal environment including the usual caches and normal catalog access to +lookup operators, functions and other types. + +Our solution to this is to add the capability to access the catalog in a way +that makes it look like it did when the record was inserted into the WAL. The +locking used during WAL generation guarantees the catalog is/was in a consistent +state at that point. + +Interesting cases include: +- enums +- composite types +- extension types +- non-C functions +- relfilenode to table oid mapping + +Due to postgres' MVCC nature regular modification of a tables contents are +theoretically non-destructive. The problem is that there is no way to access +arbitrary points in time even if the data for it is there. + +This module adds the capability to do so in the very limited set of +circumstances we need it in for wal decoding. It does *not* provide a facility +to do so in general. + +A 'Snapshot' is the datastructure used in postgres to describe which tuples are +visible and which are not. We need to build a Snapshot which can be used to +access the catalog the way it looked when the wal record was inserted. + +Restrictions: +* Only works for catalog tables +* Snapshot modifications are somewhat expensive +* it cannot build initial visibility information for every point in time, it + needs a specific set of circumstances for that +* limited window in which we can build snapshots + +== How do we build timetravel snapshots == + +Hot Standby added infrastructure to build snapshots from WAL during recovery in +the 9.0 release. Most of that can be reused for our purposes. + +We cannot reuse all of the HS infrastructure because: +* we are not in recovery +* we need to look *inside* transactions +* we need the capability to have multiple different snapshots arround at the same time + +We need to provide two kinds of snapshots that are implemented rather +differently in their plain postgres incarnation: +* SnapshotNow +* SnapshotMVCC + +We need both because if any operators use normal functions they will get +executed with SnapshotMVCC semantics while the catcache and related things will +rely on SnapshotNow semantics. Note that SnapshotNow here cannot be a normal +SnapshotNow because we wouldn't access the old version of the catalog in that +case. Instead something like an MVCC snapshot with the correct visibility +information. That also means that snapshot won't have some race issues normal +SnapshotNow has. + +Everytime a transaction that changed the catalog commits all other transactions +will need a new snapshot that marks that transaction (and its subtransactions) +as visible. + +Our snapshot representation is a bit different from normal snapshots, but we +still reuse the normal SnapshotData struct: +* Snapshot->xip contains all transaction we consider committed +* Snapshot->subxip contains all transactions belonging to our transaction, + including the toplevel one + +The meaning of ->xip is inverted in comparison with non-timetravel snapshots +because usually only a tiny percentage of comitted transactions will have +modified the catalog between xmin and xmax. It also makes subtransaction +handling easier (we cannot query pg_subtrans). + +== Building of initial snapshot == + +We can start building an initial snapshot as soon as we find either an +XLOG_RUNNING_XACTS or an XLOG_CHECKPOINT_SHUTDOWN record because both allow us +to know how many transactions are running. + +We need to know which transactions were running when we start to build a +snapshot/start decoding as we don't have enough information about those as they +could have done catalog modifications before we started watching. Also we +wouldn't have the complete contents of those transactions as we started reading +after they started. The latter is also important to build snapshots which can +be used to build a consistent initial clone. + +There also is the problem that XLOG_RUNNING_XACT records can be 'suboverflowed' +which means there were more running subtransactions than fitting into shared +memory. In that case we use the same incremental building trick HS uses which +is either +1) wait till further XLOG_RUNNING_XACT records have a running->oldestRunningXid +after the initial xl_runnign_xacts->nextXid +2) wait for a further XLOG_RUNNING_XACT thatis not overflowed or +a XLOG_CHECKPOINT_SHUTDOWN + +XXX: we probably don't need to care about ->suboverflowed at all as we only +need to know about committed XIDs and we get enough information about +subtransactions at commit.. More thinking needed. + +When we start building a snapshot we are in the 'SNAPBUILD_START' state. As +soon as we find any visibility information, even if incomplete, we change to +SNAPBUILD_INITIAL_POINT. + +When we have collected enough information to decode any transaction starting +after that point in time we fall over to SNAPBUILD_FULL_SNAPSHOT. If those +transactions commit before the next state is reached we throw their complete +content away. + +When all transactions that were running when we switched over to FULL_SNAPSHOT +commited, we change into the 'SNAPBUILD_CONSISTENT' state. Every transaction +that commits from now on gets handed to the output plugin. +When doing the switch to CONSISTENT we optionally export a snapshot which makes +all transactions visible that committed up to this point. That exported +snapshot allows the user to run pg_dump on it and replay all changes received +on a restored dump to get a consistent new clone. + +["ditaa",scaling="0.8"] +--------------- + ++-------------------------+ +|SNAPBUILD_START |-----------------------+ +| |-----------+ | ++-------------------------+ | | + | | | + XLOG_RUNNING_XACTS suboverflowed | saved snapshot + | | | + | | | + | | | + v | | ++-------------------------+ v v +|SNAPBUILD_INITIAL |---------------------->+ +| |---------->+ | ++-------------------------+ | | + | | | + oldestRunningXid past initialNextXid | | + | | | + | XLOG_RUNNING_XACTS !suboverflowed | + v | | ++-------------------------+ | | +|SNAPBUILD_FULL_SNAPSHOT |<----------+ v +| |---------------------->+ ++-------------------------+ | + | | + | XLOG_CHECKPOINT_SHUTDOWN + any running txn's finished | + | | + v | ++-------------------------+ | +|SNAPBUILD_CONSISTENT |<----------------------+ +| | ++-------------------------+ + +--------------- + +== Snapshot Management == + +Whenever a transaction is detected as having started during decoding after +SNAPBUILD_FULL_SNAPSHOT is reached we distribute the currently maintained +snapshot to it (i.e. call ApplyCacheAddBaseSnapshot). This serves as its +initial SnapshotNow and SnapshotMVCC. Unless there are concurrent catalog +changes that snapshot won't ever change. + +Whenever a transaction commits that had catalog changes we iterate over all +concurrently active transactions and add a new SnapshotNow to it +(ApplyCacheAddBaseSnapshot(current_lsn)). This is required because any row +written from now that point on will have used the changed catalog +contents. This is possible to occur even with correct locking. + +SnapshotNow's need to be setup globally so the syscache and other pieces access +it transparently. This is done using two new tqual.h functions: +SetupDecodingSnapshots() and RevertFromDecodingSnapshots(). + +== Catalog/User Table Detection == + +To detect whether a record/transaction does catalog modifications - which we +need to do for memory/performance reasons - we need to resolve the +RelFileNode's in xlog records back to the original tables. Unfortunately +RelFileNode's only contain the tables relfilenode, not their table oid. We only +can do catalog access once we reached FULL_SNAPSHOT, before that we can use +some heuristics but otherwise we have to assume that every record changes the +catalog. + +The heuristics we can use are: +* relfilenode->spcNode == GLOBALTABLESPACE_OID +* relfilenode->relNode <= FirstNormalObjectId +* RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid + +Those detect some catalog tables but not all (think VACUUM FULL), but if they +detect one they are correct. + +After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us +a table might not be a catalog table. For that we use the new RELFILENODE +syscache with (spcNode, relNode). + +XXX: Note that that syscache is a bit problematic because its not actually +unique because shared/nailed catalogs store a 0 as relfilenode (they are stored +in the relmapper). Those are never looked up though, so it might be +ok. Unfortunately it doesn't seem to be possible to use a partial index (WHERE +relfilenode != 0) here. + +XXX: For some usecases it would be useful to treat some user specified tables +as catalog tables + +== System Table Rewrite Handling == + +XXX, expand, XXX + +NOTES: +* always using newest relmapper, use newest invalidations +* old tuples are preserved across rewrites, thats fine +* REINDEX/CLUSTER pg_class; in a transaction + +== mixed DDL/DML transaction handling == + +When a transactions uses DDL and DML in the same transaction things get a bit +more complicated because we need to handle CommandIds and ComboCids as we need +to use the correct version of the catalog when decoding the individual tuples. + +CommandId handling itself is relatively simple, we can figure out the current +CommandId relatively easily by looking at the currently used one in +changes. The problematic part is that those CommandId frequently will not be +actual cmin or cmax values but ComboCids. Those are used to minimize space in +the heap. During normal operation cmin/cmax values are only used within the +backend emitting those rows and only during one toplevel transaction, so +instead of storing cmin/cmax only a reference to an in-memory value is stored +that contains both. Whenever we see a new CommandId we call +ApplyCacheAddNewCommandId. + +To resolve this problem during heap_* whenever we generate a new combocid +(detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table +we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During +decoding this ComboCid is added to the applycache +(ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a +single transaction, so we cannot simply setup all of them globally. Before +calling the output plugin ComboCids are temporarily setup and torn down +afterwards. + +All this only needs to happen in the transaction performing the DDL. + +== Cache Handling == + +As we allow usage of the normal {sys,cat,rel,..}cache we also need to integrate +cache invalidation. For transactions without DDL thats easy as everything is +already provided by HS. Everytime we read a commit record we apply the sinval +messages contained therein. + +For transactions that contain DDL and DML cache invalidation needs to happen +more frequently because we need to all tore down all caches that just got +modified. To do that we simply apply all invalidation messages that got +collected at the end of transaction and apply them after every single change. +At some point this can get optimized by generating new local invalidation +messages, but that seems too complicated for now. + +XXX: think/talk about syscache invalidation of relmapper/pg_class changes. + +== xmin Horizon Handling == + +Reusing MVCC for timetravel access has one obvious major problem: +VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also +obviously, we want autovacuum/manual vacuum to work as before. + +The idea here is to reuse the infrastrcuture built for hot_standby_feedback +which allows us to keep the xmin horizon of a walsender backend artificially +low. We keep it low enough so we can restart decoding from the last location +the client has confirmed to be safely received. The means that we keep it low +enough to contain the last checkpoints oldestXid value. + +That also means we need to make that value persist across restarts/crashes in a +very similar manner to twophase.c's. That infrastructure actually also useful +to make hot_standby_feedback work properly across primary restarts. + +== Restartable Decoding == + +As we want to generate a consistent stream of changes we need to have the +ability to start from a previously decoded location without going to the whole +multi-phase setup because that would make it very hard to calculate up to where +we need to keep information. + +To make that easier everytime a decoding process finds an online checkpoint +record it exlusively takes a global lwlock and checks whether visibility +information has been already been written out for that checkpoint and does so +if not. We only need to do that once as visibility information is the same +between all decoding backends.
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers