First, you can add me to the list of people saying 'wow', I'm impressed.

The approach I am taking to reviewing this to try and answer the following question

1) How might a future version of slony be able to use logical replication as described by your patch and design documents
and what would that look like.

2) What functionality is missing from the patch set that would stop me from implementing or prototyping the above.



Connecting slon to the remote postgresql
========================

Today the slony remote listener thread queries a bunch of events from sl_event for a batch of SYNC events. Then the remote helper thread queries data from sl_log_1 and sl_log_2. I see this changing, instead the slony remote listener thread would connect to the remote system and get a logical replication stream.

1) Would slony connect as a normal client connection and call something like 'select pg_slony_process_xlog(...)' to get bunch of logical replication
      change records to process.
  OR
2) Would slony connect as a replication connection similar to how the pg_receivelog program does today and then process the logical changeset
      outputs.  Instead of writing it to a file (as pg_receivelog does)

It seems that the second approach is what is encouraged. I think we would put a lot of the pg_receivelog functionality into slon and it would issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use the slony logical replication plugin. Slon would also have to provide feedback to the walsender about what it has processed so the origin database knows what catalog snapshots can be expired. Based on eyeballing pg_receivelog.c it seems that about half the code in the 700 file is related to command line arguments etc, and the other half is related to looping over the copy out stream, sending feedback and other things that we would need to duplicate in slon.

pg_receivelog.c has  comment:

/*
* We have to use postgres.h not postgres_fe.h here, because there's so much
 * backend-only stuff in the XLOG include files we need.  But we need a
 * frontend-ish environment otherwise.    Hence this ugly hack.
 */

This looks like more of a carryover from pg_receivexlog.c. From what I can tell we can eliminate the postgres.h include if we also eliminate the utils/datetime.h and utils/timestamp.h and instead add in:

#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speak up. I recall that in the past slon actually did include postgres.h and it caused some issues (I think with MSVC win32 builds). Since pg_receivelog.c will be used as a starting point/sample for third parties to write client programs it would be better if it didn't encourage client programs to include postgres.h


The Slony Output Plugin
=====================

Once we've modified slon to connect as a logical replication client we will need to write a slony plugin.

As I understand the plugin API:
* A walsender is processing through WAL records, each time it sees a COMMIT WAL record it will call my plugins
.begin
.change (for each change in the transaction)
.commit

* The plugin for a particular stream/replication client will see one transaction at a time passed to it in commit order. It won't see .change(t1) followed by .change (t2), followed by a second .change(t1). The reorder buffer code hides me from all that complexity (yah)

From a slony point of view I think the output of the plugin will be rows, suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type, cmd_updatencols,command_args

This is basically the Slony 2.2 sl_log format minus a few columns we no longer need (txid, actionseq). command_args is a postgresql text array of column=value pairs. Ie [ {id=1},{name='steve'},{project='slony'}]

I don't t think our output plugin will be much more complicated than the test_decoding plugin. I suspect we will want to give it the ability to filter out non-replicated tables. We will also have to filter out change records that didn't originate on the local-node that aren't part of a cascaded subscription. Remember that in a two node cluster slony will have connections from A-->B and from B--->A even if user tables only flow one way. Data that is replicated from A into B will show up in the WAL stream for B.

Exactly how we do this filtering is an open question, I think the output plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on. This is easy to figure out if the output plugin is able/allowed to query its database. Will this be possible? I would expect to be able to query the database as it exists now(at plugin invocation time) not as it existed in the past when the WAL was generated. In addition to the node ID I can see us wanting to be able to query other slony tables (sl_table,sl_set etc...)

b) What the slony node id is of the node we are streaming too. It would be nice if we could pass extra, arbitrary data/parameters to the output plugins that could include that, or other things. At the moment the start_logical_replication rule in repl_gram.y doesn't allow for that but I don't see why we couldn't make it do so.

I still see some open questions about exactly how we would filter out data in this stage.


<editorial> Everything above deals with the postgresql side of things, ie the patch in question or the plugin API we would have to work with. Much of what is below deals with slony side change and might of limited interest to some on pgsql-hackers
</editorial>

Slon Applying Changes
================

The next task we will have is to make slon and the replica instance be able to apply these changes. In slony 2.2 we do a COPY from sl_log and apply that stream to a table on the replica with COPY. We then have triggers on the replica that decode the command_args and apply the changes as INSERT/UPDATE/DELETE statements on the user tables. I see this continuing to work in this fashion, but there are a few special cases:

1) Changes made to sl_event on the origin will result in records in the logical replication stream that change sl_event. In many cases we won't just be inserting records into sl_event but we will need to instead do the logic in remote_worker.c for processing the different types of events. Worst case we could parse the change records we receive from our version pg_receivellog and split the sl_event records out into a sl_event stream and a sl_log stream. Another approach might be to have the slony apply trigger build up a list of events that the slon remote_worker code can than process through.

2) Slony is normally bi-directional even if user data only replicates one way. Confirm (sl_confirm) entries go from a replica back to an origin. In a two node origin->replica scenario for data, the way I see this working is that the slon for the origin would connect to the replica (like it does today). It would receive the logical replication records, but since it isn't subscribed to any tables it won't receive/process the WAL for user-tables but it will still receive/process sl_confirm rows. It will then insert the rows in sl_confirm that it 'replicated' from the remote node.


With what I have described so far, Slony would then be receiving a stream of events that look like

t1-insert into foo , [id=1, name='steve']
t1-insert into bar  [id=1, something='somethingelse']
t1-commit
t2- insert into foo [....]
t2-commit
t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345]
t3-commit

Even though, from a data-correctness point of view, slony could commit the transaction on the replica after it sees the t1 commit, we won't want it to do commits other than on a SYNC boundary. This means that the replicas will continue to move between consistent SYNC snapshots and that we can still track the state/progress of replication by knowing what events (SYNC or otherwise) have been confirmed.

This also means that slony should only provide feedback to the walsender on SYNC boundaries after the transaction has committed on the receiver. I don't see this as being an issue.

Setting up Subscriptions
===================
At first we have a slon cluster with just 1 node, life is good. When a second node is created and a path(or pair of paths) are defined between the nodes I think they will each:
1. Connect to the remote node with a normal libpq connection.
    a. Get the current xlog recptr,
    b. Query any non-sync events of interest from sl_event.
2. Connect to the remote node with a logical replication connection and start streaming logical replication changes start at the recptr we retrieved
    above.

Slon will then receive any future events from the remote sl_event as part of the logical replication stream. It won't receive any user tables because it isn't yet subscribed to any.

When a subscription is started, the SUBSCRIBE_SET and ENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTO sl_event will be part of a change record in the replication stream and be picked up by the subscribers slon remote_worker.

The remote_worker:copy_set will then need to get a consistent COPY of the tables in the replication set such that any changes made to the tables after the copy is started get included in the replication stream. The approach proposed in the DESIGN.TXT file with exporting a snapshot sounds okay for this. I *think* slony could get by with something less fancy as well but it would be ugly.

1. Make sure that the origin starts including change records for the tables in the set 2. have the slon(copy_set) wait until any transactions on the origin, that started prior to the ENABLE_SUBSCRIPTION, are committed.
     Slony does this today as part of the copy_set logic.
3. Get/remember the snapshot visibility information for the COPY's transaction 4. When we start to process change records we need to filter out records for transactions that were already visible by the copy.

Steps 1-3 are similar to how slony works today, but step 4 will be a bit awkward/ugly. This isn't an issue today because we are already using the transaction visibility information for selecting from sl_log so it works, but above I had proposed stripping the xid from the logical change records.



Cascading Replication
=================
A-->B--->C

The slon for B will insert records from A into B's tables. This insert will generate WAL records on B. The slon for C should be able to pull the data it needs (both sl_event entries with ev_origin=A, and user table data originating on A) from B's logical replication stream. I don't see any issues here nor do I see a need to 'cache' the data in an sl_log type of table on B.


Reshaping Replication
=================

In Slony replication is reshaped by two types events, a MOVE SET and a FAILOVER.

Move Set:
A replication set might be subscribed in a cascaded fashion like
A--->B--->C

When a MOVE SET is issued node A will stop accepting new write transactions for tables in the set. A MOVE_SET(1,A,B) event is then put into sl_event on node A. Node A will then stop accepting new transactions on the tables in set 1. Node B receives the MOVE_SET command in the proper order, after it has processed the last SYNC generated on A when A was still accepting write transactions to those tables. When Node B processes the MOVE_SET event then node B starts accepting write transactions on the tables. Node B will also generates an ACCEPT_SET event. Node C will then receive the MOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (after all SYNC events from A with data changes to the set) and then knows that it should start data on those tables from B.

I don't see any of this changing with logical replication acting as the data source.

FAILOVER:
---------------
A---->B
|    .
v  .
C

Today with slony, if B is a valid failover target then it is a forwarding node of the set. This means that B keeps a record in sl_log of any changes originating on A until B knows that node C has received those changes. In the event of a failover, if node C is far behind, it can just get the missing data from sl_log on node B (the failover target/new origin).

I see a problem with what I have discussed above, B won't explicitly store the data from A in sl_log, a cascaded node would depend on B's WAL stream. The problem is that at FAILOVER time, B might have processed some changes from A. Node C might also be processing Node B's WAL stream for events (or data from another set). Node C will discard/not receive the data for A's tables since it isn't subscribed to those tables from B. What happens then if at some later point B and C receive the FAILOVER event. What does node C do? It can't get the missing data from node A because node A has failed, and it can't get it from node B because node C has already processed the WAL changes from node B that included the data but it ignored/discarded it. Maybe node C could reprocess older WAL from node B? Maybe this forces us to keep an sl_log type structure around?

Is it complete enough to build a prototype?
==========================
I think so, the incomplete areas I see are the ones that mentioned in the patch submission including:
* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those even though we'd need them before I could build a production system.

Conclusions
=============
I like this design much better than the original design from the spring that would have required keeping a catalog proxy on the decoding machine. Based on what I've seen it should be possible to make slony use logical replication as a source for events instead of triggers populating sl_log. My thinking is that we want a way for logreceiver programs to pass arguments/parameters to the output plugins. Beyond that this looks like something slony can use.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to