Re: [HACKERS] logical changeset generation v3 - Source for Slony

Steve Singer Mon, 19 Nov 2012 04:06:29 -0800

First, you can add me to the list of people saying 'wow', I'm impressed.

The approach I am taking to reviewing this to try and answer thefollowing question

1) How might a future version of slony be able to use logicalreplication as described by your patch and design documents

and what would that look like.

2) What functionality is missing from the patch set that would stop mefrom implementing or prototyping the above.




Connecting slon to the remote postgresql
========================

Today the slony remote listener thread queries a bunch of events fromsl_event for a batch of SYNC events. Then the remote helper threadqueries data from sl_log_1 and sl_log_2. I see this changing, insteadthe slony remote listener thread would connect to the remote system andget a logical replication stream.

1) Would slony connect as a normal client connection and callsomething like 'select pg_slony_process_xlog(...)' to get bunch oflogical replication

      change records to process.
  OR

2) Would slony connect as a replication connection similar to how thepg_receivelog program does today and then process the logical changeset

      outputs.  Instead of writing it to a file (as pg_receivelog does)

It seems that the second approach is what is encouraged. I think wewould put a lot of the pg_receivelog functionality into slon and itwould issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use theslony logical replication plugin. Slon would also have to providefeedback to the walsender about what it has processed so the origindatabase knows what catalog snapshots can be expired. Based oneyeballing pg_receivelog.c it seems that about half the code in the 700file is related to command line arguments etc, and the other half isrelated to looping over the copy out stream, sending feedback and otherthings that we would need to duplicate in slon.


pg_receivelog.c has  comment:

/*

* We have to use postgres.h not postgres_fe.h here, because there's somuch

 * backend-only stuff in the XLOG include files we need.  But we need a
 * frontend-ish environment otherwise.    Hence this ugly hack.
 */

This looks like more of a carryover from pg_receivexlog.c. From what Ican tell we can eliminate the postgres.h include if we also eliminatethe utils/datetime.h and utils/timestamp.h and instead add in:


#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speakup. I recall that in the past slon actually did include postgres.h andit caused some issues (I think with MSVC win32 builds). Sincepg_receivelog.c will be used as a starting point/sample for thirdparties to write client programs it would be better if it didn'tencourage client programs to include postgres.h



The Slony Output Plugin
=====================

Once we've modified slon to connect as a logical replication client wewill need to write a slony plugin.


As I understand the plugin API:

* A walsender is processing through WAL records, each time it sees aCOMMIT WAL record it will call my plugins

.begin
.change (for each change in the transaction)
.commit

* The plugin for a particular stream/replication client will see onetransaction at a time passed to it in commit order. It won't see.change(t1) followed by .change (t2), followed by a second .change(t1).The reorder buffer code hides me from all that complexity (yah)

From a slony point of view I think the output of the plugin will berows, suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type,cmd_updatencols,command_args

This is basically the Slony 2.2 sl_log format minus a few columns we nolonger need (txid, actionseq).command_args is a postgresql text array of column=value pairs. Ie [{id=1},{name='steve'},{project='slony'}]

I don't t think our output plugin will be much more complicated than thetest_decoding plugin. I suspect we will want to give it the ability tofilter out non-replicated tables. We will also have to filter outchange records that didn't originate on the local-node that aren't partof a cascaded subscription. Remember that in a two node cluster slonywill have connections from A-->B and from B--->A even if user tablesonly flow one way. Data that is replicated from A into B will show up inthe WAL stream for B.

Exactly how we do this filtering is an open question, I think theoutput plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on. This is easyto figure out if the output plugin is able/allowed to query itsdatabase. Will this be possible? I would expect to be able to query thedatabase as it exists now(at plugin invocation time) not as it existedin the past when the WAL was generated. In addition to the node ID Ican see us wanting to be able to query other slony tables(sl_table,sl_set etc...)

b) What the slony node id is of the node we are streaming too. Itwould be nice if we could pass extra, arbitrary data/parameters to theoutput plugins that could include that, or other things. At the momentthe start_logical_replication rule in repl_gram.y doesn't allow for thatbut I don't see why we couldn't make it do so.

I still see some open questions about exactly how we would filter outdata in this stage.

<editorial> Everything above deals with the postgresql side of things,ie the patch in question or the plugin API we would have to work with.Much of what is below deals with slony side change and might of limitedinterest to some on pgsql-hackers

</editorial>

Slon Applying Changes
================

The next task we will have is to make slon and the replica instance beable to apply these changes. In slony 2.2 we do a COPY from sl_log andapply that stream to a table on the replica with COPY. We then havetriggers on the replica that decode the command_args and apply thechanges asINSERT/UPDATE/DELETE statements on the user tables. I see thiscontinuing to work in this fashion, but there are a few special cases:

1) Changes made to sl_event on the origin will result in records in thelogical replication stream that change sl_event. In many cases we won'tjust be inserting records into sl_event but we will need to instead dothe logic in remote_worker.c for processing the different types ofevents. Worst case we could parse the change records we receive fromour version pg_receivellog and split the sl_event records out into asl_event stream and a sl_log stream. Another approach might be to havethe slony apply trigger build up a list of events that the slonremote_worker code can than process through.

2) Slony is normally bi-directional even if user data only replicatesone way. Confirm (sl_confirm) entries go from a replica back to anorigin. In a two node origin->replica scenario for data, the way I seethis working is that the slon for the origin would connect to thereplica (like it does today).It would receive the logical replication records, but since it isn'tsubscribed to any tables it won't receive/process the WAL foruser-tables but it will still receive/process sl_confirm rows. It willthen insert the rows in sl_confirm that it 'replicated' from the remotenode.

With what I have described so far, Slony would then be receiving astream of events that look like


t1-insert into foo , [id=1, name='steve']
t1-insert into bar  [id=1, something='somethingelse']
t1-commit
t2- insert into foo [....]
t2-commit
t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345]
t3-commit

Even though, from a data-correctness point of view, slony could committhe transaction on the replica after it sees the t1 commit, we won'twant it to do commits other than on a SYNC boundary. This means thatthe replicas will continue to move between consistent SYNC snapshots andthat we can still track the state/progress of replication by knowingwhat events (SYNC or otherwise) have been confirmed.

This also means that slony should only provide feedback to thewalsender on SYNC boundaries after the transaction has committed on thereceiver. I don't see this as being an issue.


Setting up Subscriptions
===================

At first we have a slon cluster with just 1 node, life is good. When asecond node is created and a path(or pair of paths) are defined betweenthe nodes I think they will each:

1. Connect to the remote node with a normal libpq connection.
    a. Get the current xlog recptr,
    b. Query any non-sync events of interest from sl_event.

2. Connect to the remote node with a logical replication connection andstart streaming logical replication changes start at the recptr we retrieved

    above.

Slon will then receive any future events from the remote sl_event aspart of the logical replication stream. It won't receive any usertables because it isn't yet subscribed to any.

When a subscription is started, the SUBSCRIBE_SET andENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTOsl_event will be part of a change record in the replication stream andbe picked up by the subscribers slon remote_worker.

The remote_worker:copy_set will then need to get a consistent COPY ofthe tables in the replication set such that any changes made to thetables after the copy is started get included in the replicationstream. The approach proposed in the DESIGN.TXT file with exporting asnapshot sounds okay for this. I *think* slony could get by withsomething less fancy as well but it would be ugly.

1. Make sure that the origin starts including change records for thetables in the set2. have the slon(copy_set) wait until any transactions on the origin,that started prior to the ENABLE_SUBSCRIPTION, are committed.

     Slony does this today as part of the copy_set logic.

3. Get/remember the snapshot visibility information for the COPY'stransaction4. When we start to process change records we need to filter outrecords for transactions that were already visible by the copy.

Steps 1-3 are similar to how slony works today, but step 4 will be a bitawkward/ugly. This isn't an issue today because we are already usingthe transaction visibility information for selecting from sl_log so itworks, but above I had proposed stripping the xid from the logicalchange records.




Cascading Replication
=================
A-->B--->C

The slon for B will insert records from A into B's tables. This insertwill generate WAL records on B. The slon for C should be able to pullthe data it needs (both sl_event entries with ev_origin=A, and usertable data originating on A) from B's logical replication stream. Idon't see any issues here nor do I see a need to 'cache' the data in ansl_log type of table on B.



Reshaping Replication
=================

In Slony replication is reshaped by two types events, a MOVE SET and aFAILOVER.


Move Set:
A replication set might be subscribed in a cascaded fashion like
A--->B--->C

When a MOVE SET is issued node A will stop accepting new writetransactions for tables in the set. A MOVE_SET(1,A,B) event is then putinto sl_event on node A. Node A will then stop accepting newtransactions on the tables in set 1.Node B receives the MOVE_SET command in the proper order, after it hasprocessed the last SYNC generated on A when A was still accepting writetransactions to those tables. When Node B processes the MOVE_SET eventthen node B starts accepting write transactions on the tables. Node Bwill also generates an ACCEPT_SET event. Node C will then receive theMOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (afterall SYNC events from A with data changes to the set) and then knows thatit should start data on those tables from B.

I don't see any of this changing with logical replication acting as thedata source.


FAILOVER:
---------------
A---->B
|    .
v  .
C

Today with slony, if B is a valid failover target then it is aforwarding node of the set. This means that B keeps a record in sl_logof any changes originating on A until B knows that node C has receivedthose changes. In the event of a failover, if node C is far behind, itcan just get the missing data from sl_log on node B (the failovertarget/new origin).

I see a problem with what I have discussed above, B won't explicitlystore the data from A in sl_log, a cascaded node would depend on B's WALstream.The problem is that at FAILOVER time, B might have processed somechanges from A. Node C might also be processing Node B's WAL stream forevents (or data from another set). Node C will discard/not receive thedata for A's tables since it isn't subscribed to those tables from B.What happens then if at some later point B and C receive the FAILOVER event.What does node C do? It can't get the missing data from node A becausenode A has failed, and it can't get it from node B because node C hasalready processed the WAL changes from node B that included the data butit ignored/discarded it. Maybe node C could reprocess older WAL fromnode B? Maybe this forces us to keep an sl_log type structure around?


Is it complete enough to build a prototype?
==========================

I think so, the incomplete areas I see are the ones that mentioned inthe patch submission including:

* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those eventhough we'd need them before I could build a production system.


Conclusions
=============

I like this design much better than the original design from the springthat would have required keeping a catalog proxy on the decodingmachine. Based on what I've seen it should be possible to make slonyuse logical replication as a source for events instead of triggerspopulating sl_log.My thinking is that we want a way for logreceiver programs to passarguments/parameters to the output plugins. Beyond that this looks likesomething slony can use.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - Source for Slony

Reply via email to