On 12-11-18 11:07 AM, Andres Freund wrote:
Hi Steve!


I think we should provide some glue code to do this, otherwise people
will start replicating all the bugs I hacked into this... More
seriously: I think we should have support code here, no user will want
to learn the intracacies of feedback messages and such. Where that would
live? No idea.

libpglogicalrep.so ?

I wholeheartedly aggree. It should also be cleaned up a fair bit before
others copy it should we not go for having some client side library.

Imo the library could very roughly be something like:

state = SetupStreamingLLog(replication-slot, ...);
while((message = StreamingLLogNextMessage(state))
{
      write(outfd, message->data, message->length);
      if (received_100_messages)
      {
           fsync(outfd);
           StreamingLLogConfirm(message);
      }
}

Although I guess thats not good enough because StreamingLLogNextMessage
would be blocking, but that shouldn't be too hard to work around.


How about we pass a timeout value to StreamingLLogNextMessage (..) where it returns if no data is available after the timeout to give the caller a chance to do something else.

This is basically the Slony 2.2 sl_log format minus a few columns we no
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs.  Ie [
{id=1},{name='steve'},{project='slony'}]
It seems to me that that makes escaping unneccesarily complicated, but
given you already have all the code... ;)

When I look at the actual code/representation we picked it is closer to {column1,value1,column2,value2...}



I don't t think our output plugin will be much more complicated than the
test_decoding plugin.
Good. Thats the idea ;). Are you ok with the interface as it is now or
would you like to change something?

I'm going to think about this some more and maybe try to write an example plugin before I can say anything with confidence.


Yes. We will also need something like that. If you remember the first
prototype we sent to the list, it included the concept of an
'origin_node' in wal record. I think you actually reviewed that one ;)

That was exactly aimed at something like this...

Since then my thoughts about how the origin_id looks like have changed a
bit:
- origin id is internally still represented as an uint32/Oid
   - never visible outside of wal/system catalogs
- externally visible it gets
   - assigned an uuid
   - optionally assigned a user defined name
- user settable (permissions?) origin when executing sql:
   - SET change_origin_uuid = 'uuid';
   - SET change_origin_name = 'user-settable-name';
   - defaults to the local node
- decoding callbacks get passed the origin of a change
   - txn->{origin_uuid, origin_name, origin_internal?}
- the init decoding callback can setup an array of interesting origins,
   so the others don't even get the ReorderBuffer treatment

I have to thank the discussion on -hackers and a march through prague
with Marko here...
So would the uuid and optional name assignment be done in the output plugin or some else? When/how does the uuid get generated and where do we store it so the same uuid gets returned when postgres restarts. Slony today stores all this type of stuff in user-level tables and user-level functions (because it has no other choice). What is the connection between these values and the 'slot-id' in your proposal for the init arguments? Does the slot-id need to be the external uuid of the other end or is there no direct connection?

Today slony allows us to replicate between two databases in the same postgresql cluster (I use this for testing all the time) Slony also allows for two different 'slony clusters' to be setup in the same database (or so I'm told, I don't think I have ever tried this myself).

plugin functions that let me query the local database and then return the uuid and origin_name would work in this model.

+1 on being able to mark the 'change origin' in a SET command when the replication process is pushing data into the replica.

Exactly how we do this filtering is an open question,  I think the output
plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on.  This is easy to
figure out if the output plugin is able/allowed to query its database.  Will
this be possible? I would expect to be able to query the database as it
exists now(at plugin invocation time) not as it existed in the past when the
WAL was generated.   In addition to the node ID I can see us wanting to be
able to query other slony tables (sl_table,sl_set etc...)
Hm. There is no fundamental reason not to allow normal database access
to the current database but it won't be all that cheap, so doing it
frequently is not a good idea.
The reason its not cheap is that you basically need to teardown the
postgres internal caches if you switch the timestream in which you are
working.

Would go something like:

TransactionContext = AllocSetCreate(...);
RevertFromDecodingSnapshot();
InvalidateSystemCaches();
StartTransactionCommand();
/* do database work */
CommitTransactionCommand();
/* cleanup memory*/
SetupDecodingSnapshot(snapshot, data);
InvalidateSystemCaches();

Why do you need to be able to query the present? I thought it might be
neccesary to allow additional tables be accessed in a timetraveling
manner, but not this way round.
I guess an initial round of querying during plugin initialization won't
be good enough?

For example my output plugin would want the list of replicated tables (or the list of tables replicated to a particular replica). This list can change over time. As administrators issue commands to add or remove tables to replication or otherwise reshape the cluster the output plugin will need to know about this. I MIGHT be able to get away with having slon disconnect and reconnect on reconfiguration events so only the init() call would need this data, but I am not sure.

One of the ways slony allows you to shoot your foot off is by changing certain configuration things (like dropping a table from a set) while a subscription is in progress. Being able to timetravel the slony configuration tables might make this type of foot-gun a lot harder to encounter but that might be asking for too much.




b) What the slony node id is of the node we are streaming too.   It would be
nice if we could pass extra, arbitrary data/parameters to the output plugins
that could include that, or other things.  At the moment the
start_logical_replication rule in repl_gram.y doesn't allow for that but I
don't see why we couldn't make it do so.
Yes, I think we want something like that. I even asked input on that
recently ;):
http://archives.postgresql.org/message-id/20121115014250.ga5...@awork2.anarazel.de

Input welcome!

How flexible will the datatypes for the arguments be? If I wanted to pass in a list of tables (ie an array?) could I? Above I talked about having the init() or change() methods query the local database. Another option might be to make the slon build up this data (by querying the database over a normal psql connection) and just passing the data in. However that might mean passing in a list of a few thousand table names, which doesn't sound like a good idea.


Even though, from a data-correctness point of view, slony could commit the
transaction on the replica after it sees the t1 commit, we won't want it to
do commits other than on a SYNC boundary.  This means that the replicas will
continue to move between consistent SYNC snapshots and that we can still
track the state/progress of replication by knowing what events (SYNC or
otherwise) have been confirmed.
I don't know enough about slony internals, but: why? This will prohibit
you from ever doing (per-transaction) synchronous replication...

A lot of this has to do with the stuff I discuss in the section below on cluster reshaping that you didn't understand. Slony depends on knowing what data has , or hasn't been sent to a replica at a particular event id. If 'some' transactions in between two SYNC events have committed but not others then slony has no idea what data it needs to get elsewhere on a FAILOVER type event. There might be a way to make this work otherwise but I'm not sure what that is and how long it will take to debug out the issues.

Cool! Don't hesitate to mention anything that you think would make you life easier, chances are that youre not the only one who could benefit from it... Thanks, Andres





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to