Re: [HACKERS] Replication Node Identifiers and crashsafe Apply Progress

Steve Singer Tue, 19 Nov 2013 15:51:06 -0800

On 11/14/2013 12:26 PM, Andres Freund wrote:

Hello,


As you know, the reason we are working changeset extraction is that we
want to build logical unidirection and bidirectional replication
ontop. To use changeset extraction effectively, I think one set of
related features ontop is very useful:

When extracting changes using the changeset extraction patchset (latest
version at [1]) the START_LOGICAL_REPLICATION command is used to stream
changes from a source system. When started it will continue to send
changes as long as the connection is up or it is aborted. For obvious
performance reasons it will *not* wait for an ACK for each transaction
commit it streams out.
Instead it relies on the receiver, exactly as in physical replication,
sending feedback messages containing the LSN up to which data has safely
been received.
That means frequently something like:
walsender: => COMMIT 0/10000000
walsender: => COMMIT 0/10000200
walsender: => COMMIT 0/10000400
walsender: => COMMIT 0/10000600
receiver:  <= ACKNOWLEDGE 0/10000270
walsender: => COMMIT 0/10000800
is possible and important for performance. I.e. the server has streamed
out more changes than it got confirmation for.

So, when the the replication connection goes down, e.g. because the
receiving side has crashed, we need to tell the server from where to
start. Every position between the last ACKed and the end of WAL is
legal.
The receiver then can ask the source to start replication from the last
replayed commit using START_LOGICAL_REPLICATION 'slot_name'
'0/10000600' which would then re-stream all the changes in the
transaction that committe at 0/10000600 and all that follow.

But for that the receiving side needs to know up to where changes have
been applied. One relatively easy solution for that is that the
receiving side does something like:
UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
before the end of every replayed transaction. But that obviously will
quickly cause bloat.

I don't see how this is going to cause any more bloat than whattrigger-based slony does today with sl_confirm and I don't hear a lot ofcomplaints about that being a big problem. This might be because slonydoesn't do a commit on the replica for every transaction but groups thetransactions together, logical slony will behave the same way where wewould only commit on SYNC transactions.

Our solution to that is that a replaying process can tell the backend
that it is currently doing so and setup three variables for every
transaction:
1) an identifier for the the source database
2) the LSN at which the replayed transaction has committed remotely
3) the time at which the replayed transaction has committed remotely

When the transaction then commits the commit record will set the
XACT_CONTAINS_ORIGIN flag to ->xinfo and will add that data to the end
of the commit record. During crash recovery the startup process will
remember the newest LSN for each remote database in shared memory.

This way, after a crash, restart, disconnect the replay process can look
into shared memory and check how far it has already replayed and restart
seamlessly. With minimal effort.

We previously discussed the topic and some were very adverse to using
any sort of numeric node identifiers across systems and suggested that
those should only be used internally. So what the attached patch does is
to add a new shared system catalog called 'pg_replication_identifier'
(suggestions for a better name welcome) which translates a number of
identifying traits into a numeric identifier.
The set of identifiers currently are:
* the sysid of the remote system, combined with the remote TLI
* the oid of the local database
* the oid of the remote database
* an optional name
but that's just what we needed in our multimaster prototype, and not
what I necessarily think is correct.

The added API (which surely need some work, I am not particularly happy
with the naming of functions for one) basically consists of two parts:
1) functions to query/create replication identifiers:
    * GetReplicationIdentifier(identifying traits) - search for a numeric 
replication identifier
    * CreateReplicationIdentifier(identifying traits) - creates a numeric 
replication identifier
    * GetReplicationInfoByIdentifier(numeric identifier) - returns identifying 
traits

2) functions to query/manipulate replication progress:
    * AdvanceReplicationIdentifier(node, local_lsn, remote_lsn)
    * XLogRecPtr RemoteCommitFromReplicationIdentifier(node)

Internally the code also maintains some on-disk data which is updated
during checkpoints to store the replication progress, otherwise it'd
vanish if we shutdown gracefully ;).

The attached code also integrates with the "commit timestamp" module
that Alvaro submitted ([2]). Everytime a remote transaction is committed
we store a) the remote commit's timestamp, b) the origin node id in it.
That allows to relatively easily build multimaster systems with conflict
resolution ontop, since whenever there's a conflict the originating
node, and originating commit timestamp for a row can be queried
efficiently.

Having information about the origin of a change/transaction allows to
implement complex replication topologies since the information is
available to changeset extration output plugins.
It allows to do write plugins that:
* decode all changes, independent from the system they were originally
   executed on by the user
* decode changes generated locally, but none from remote systems
* pick and choose between those, say only decode those the receiving
   system isn't replicating from itself

Questions are:
* Which identifying traits do we want to use to identify nodes?
* How do we want to manipulate replication identifiers? Currently they
   can only be manipulated by using C functions, which is fine for some users,
   but probably not for others?
* Do we want to allow setting (remote_lsn, remote_timestamp,
   remote_node_id) via SQL? Currently the remote_node_id can be set as a
   GUC, but the other's can't. They probably should be a function that
   can be called instead of GUCs?

A way of advancing the replication pointer via SQL would be nice,otherwise I'll just have to write my own C function that I will invokevia SQL (which sin't hard but everyone would need to do the same)

* Suggestions for better names!
* Would slony et al need something ontop to use this?

In the slony world we identifer nodes with a 32 bit integer. I thinkthe idea is that I'm going to have to pass arguments into


+extern RepNodeId GetReplicationIdentifier(uint64 remotesysid, Oid remotetli,
+                                                                               
  Oid remotedb, Name riname,
+                                                                               
  Oid rilocaldb);

to map my slony concept of a node id to a 16 bit "node id" that is onlyuseful on the local system. In additon to a slony 32 bit node id I havea conninfo that I can use to contact that node. I think the slon wouldneed to connect to the remote node with that conninfo (which it doesanyway) and get the remote oid's and then use the slony node_idconverted to a string as the "riname" value.I would then have to invoke AdvanceReplicationIdentifier at some pointbefore I issue the commit.



What does building up node_id key from

(sysid, tlid, remote_dbid, local_dbid, name) get us over just mapping from an 
arbitrary name field to a 16 bit node_id ?
I agree with the other comments on the thread that letting the replication 
system figure out its own unique naming is better.  If we were going t come up 
with a schema then I am also not sure if using the remote TLI as part of the 
node key is a good idea.

Todo:
* cleanup/naming
* Set returning function to see the replication progress
* remove old checkpoint files

Note that this only applies a) ontop the changeset extraction code b)
the commit timestamp code. The 'replication-identifiers' git branch
([3]) contains all integrated together.

Comments welcome!

Greetings,

Andres Freund

[1]http://archives.postgresql.org/message-id/20131114134647.GA26172%40alap2.anarazel.de
[2]http://archives.postgresql.org/message-id/20131022221600.GE4987%40eldon.alvh.no-ip.org
[3]http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/replication-identifiers




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication Node Identifiers and crashsafe Apply Progress

Reply via email to