Hi! On 2013-01-24 13:27:00 -0500, Robert Haas wrote: > On Thu, Jan 24, 2013 at 6:14 AM, Andres Freund <and...@2ndquadrant.com> wrote:
> Before getting bogged down in technical commentary, let me say this > very clearly: I am enormously grateful for your work on this project. > Logical replication based on WAL decoding is a feature of enormous > value that PostgreSQL has needed for a long time, and your work has > made that look like an achievable goal. Furthermore, it seems to me > that you have pursued the community process with all the vigor and > sincerity for which anyone could ask. Serious design concerns were > raised early in the process and you made radical changes to the design > which I believe have improved it tremendously, and you've continued to > display an outstanding attitude at every phase of this process about > which I can't say enough good things. Very much appreciated. Especially as I can echo your feeling of not only having positive feelings about the process ;) > Now, the bad news is, I don't think it's very reasonable to try to > commit this to 9.3. I think it is just too much stuff too late in the > cycle. I've reviewed some of the patches from time to time but there > is a lot more stuff and it's big and complicated and it's not really > clear that we have the interface quite right yet, even though I think > it's also clear that we are a lot of closer than we were. I don't > want to be fixing that during beta, much less after release. It pains me to admit that you have a point there. What I am afraid though is that it basically goes on like this in the next commitfests: * 9.4-CF1: no "serious" reviewer comments because they are busy doing release work * 9.4-CF2: all are relieved that the release is over and a bit tired * 9.4-CF3: first deeper review, some more complex restructuring required * 9.4-CF4: too many changes to commit. If you look at the development of the feature, after the first prototype and the resulting design changes nobody with decision power had a more than cursory look at the proposed interfaces. Thats very, very, very understandable, you all are busy people and the patch & the interfaces are complex so it takes noticeable amounts of time, but it unfortunately doesn't help in getting an acceptable interface nailed down. The problem with that is not only that it sucks huge amounts of energy out of me and others but also that its very hard to really build the layers/users above changeset extraction without being able to rely on the interface and semantics. So we never get to the actually benefits :(, and we don't get the users people require for the feature to be committed. So far, the only really effective way of getting people to comment on patches in this state & complexity is the threat of an upcoming commit because of the last commitfest :( I honestly don't know how to go on about this... > > I tried very, very hard to get the basics of the design & interface > > solid. Which obviously doesn't man I am succeeding - luckily not being > > superhuman after all ;). And I think thats very much where input is > > desparetely needed and where I failed to raise enough attention. The > > "output plugin" interface follewed by the walsender interface is what > > needs to be most closely vetted. > > Those are the permanent, user/developer exposed UI and the one we should > > try to keep as stable as possible. > > > > The output plugin callbacks are defined here: > > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/include/replication/output_plugin.h;hb=xlog-decoding-rebasing-cf4 > > To make it more agnostic of the technology to implement changeset > > extraction we possibly should replace the ReorderBuffer(TXN|Change) > > structs being passed by something more implementation agnostic. > > > > walsender interface: > > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/backend/replication/repl_gram.y;hb=xlog-decoding-rebasing-cf4 > > The interesting new commands are: > > 1) K_INIT_LOGICAL_REPLICATION NAME NAME > > 2) K_START_LOGICAL_REPLICATION NAME RECPTR plugin_options > > 3) K_FREE_LOGICAL_REPLICATION NAME > > > > 1 & 3 allocate (respectively free) the permanent state associated with > > one changeset consumer whereas START_LOGICAL_REPLICATION streams out > > changes starting at RECPTR. > > Forgive me for not having looked at the patch, but to what extent is > all this, ah, documented? There are several mails on -hackers where I ask for input on whether that interface is what people want but all the comments have been from non-core pg people, although mildly favorable. I couldn't convince myself of writing real low-level documentation instead of just the example code I needed for testing anyway and some more higher-level docs before I had input from that side. Perhaps that was a mistake. So, here's a slightly less quick overview of the walsender interface: Whenever a new replication consumer wants to stream data we need to make sure on the primary that the data can be provided gapless, even across disconnects, crashes et al. The permanent state associated with this is currently called a "replication slot". $ psql "port=5440 dbname=postgres replication=1" postgres=# INIT_LOGICAL_REPLICATION 'bdr-whatever-1' 'test_decoding'; replication_id | consistent_point | snapshot_name | plugin ----------------+------------------+---------------+--------------- bdr-whatever-1 | 0/3E8DFA08 | 000F54F1-1 | test_decoding (1 row) So now we have allocated a permanent slot identified by the name 'bdr-whatever-1'. It also automatically exported the snapshot '000F54F1-1' that can be imported into another transaction, e.g. to consistently dump an initial snapshot of the data. The information returned in the 'consistent_point' column tells us that we will be able to return all data from that LSN onwards. That replication slot can *only* be used for replicating changes out of the database postgres and with the plugin 'test_decoding' (a contrib module). That slot will persist across restarts and everything until somebody issues a FREE_LOGICAL_REPLICATION 'bdr-whatever-1'. To start streaming out changes the command postgres=# START_LOGICAL_REPLICATION 'bdr-whatever-1' 0/3E8DFA08; WARNING: Starting logical replication unexpected PQresultStatus: 8 Time: 76.346 ms is used. Unfortunately psql isn't a suitable consumer as it cannot deal with the unrequested copy, but thats what we have pg_receivellog for: $ ~/.../pg_receivellog -p 5440 -d postgres --slot bdr-whatever-1 -f - --start -v pg_receivellog: starting log streaming at 0/0 (slot bdr-whatever-1) pg_receivellog: initiated streaming Which will start streaming out changes when we do: $ psql -h /tmp -p 5440 -U andres postgres postgres=# CREATE TABLE frak(id serial primary key, data int); CREATE TABLE postgres=# INSERT INTO frak (data) SELECT * FROM generate_series(1, 1); INSERT 0 1 back to receivellog: BEGIN 1004786 table "frak_id_seq": INSERT: sequence_name[name]:frak_id_seq last_value[int8]:1 start_value[int8]:1 increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0 is_cycled[bool]:f is_called[bool]:f COMMIT 1004786 pg_receivellog: confirming flush up to 0/3E8F0F30 (slot bdr-whatever-1) BEGIN 1004787 table "frak": INSERT: id[int4]:1 data[int4]:1 COMMIT 1004787 pg_receivellog: confirming flush up to 0/3E8FCDC0 (slot bdr-whatever-1) Makes sense so far? The actual output you see there, the BEGIN 1004787 table "frak": INSERT: id[int4]:1 data[int4]:1 COMMIT 1004787 bit, is generated by the test_decoding plugin referenced previously which has functions like extern void pg_decode_init(struct LogicalDecodingContext *ctx, bool is_init); extern bool pg_decode_begin_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn); extern bool pg_decode_commit_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn, XLogRecPtr commit_lsn); extern bool pg_decode_change(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn, Oid tableoid, ReorderBufferChange *change); And e.g. begin_txn looks like: bool pg_decode_begin_txn(struct LogicalDecodingContext *ctx, ReorderBufferTXN* txn) { TestDecodingData *data = ctx->output_plugin_private; ctx->prepare_write(ctx, txn->lsn, txn->xid); if (data->include_xids) appendStringInfo(ctx->out, "BEGIN %u", txn->xid); else appendStringInfoString(ctx->out, "BEGIN"); ctx->write(ctx, txn->lsn, txn->xid); return true; } As you see, it seems to have somehow gathered options from somewhere. Those can be specified as optional argumetns to START_LOGICAL_REPLICATION. > > Btw, there are currently *no* changes to the wal format at all if > > wal_format < logical except that xl_running_xacts are logged more > > frequently which obviously could easily be made conditional. Baring bugs > > of course. > > The changes with wal_level>=logical aren't that big either imo: > > * heap_insert, heap_update prevent full page writes from removing their > > normal record by using a separate XLogRecData block for the buffer and > > the record > > * heap_delete adds more data (the pkey of the tuple) after the unchanged > > xl_heap_delete struct > > * On changes to catalog tables (relfilenode, tid, cmin, cmax) are logged. > > > > No changes to mvcc for normal backends at all, unless you count the very > > slightly changed *Satisfies interface (getting passed a HeapTuple > > instead of HeapTupleHeader). > > > > I am not sure what you're concerned about WRT the on-disk format of the > > tuples? We are pretty much nailed down on that due to pg_upgrade'ability > > anyway and it could be changed from this patches POV without a problem, > > the output plugin just sees normal HeapTuples? Or are you concerned > > about the code extracting them from the xlog records? > > Mostly, my concern is that you've accidentally broken something, or > that your code will turn out to be flaky in ways we can't now predict. I really think a look or two from experienced enough people should make the heapam parts safe enough. The changes basically are like: heap_insert(Relation relation, HeapTuple tup, CommandId cid, xl_heap_insert xlrec; xl_heap_header xlhdr; XLogRecPtr recptr; - XLogRecData rdata[3]; + XLogRecData rdata[4]; Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; + /* + * For the logical replication case we need the tuple even if were + * doing a full page write. We could alternatively store a pointer into + * the fpw though. + * For that to work we add another rdata entry for the buffer in that + * case. + */ + bool need_tuple_data = wal_level >= WAL_LEVEL_LOGICAL + && RelationGetRelid(relation) >= FirstNormalObjectId; + + /* For logical decode we need combocids to properly decode the catalog */ + if (wal_level >= WAL_LEVEL_LOGICAL && RelationGetRelid(relation) < FirstNormalObjectId) + log_heap_new_cid(relation, heaptup); ... rdata[1].data = (char *) &xlhdr; rdata[1].len = SizeOfHeapHeader; - rdata[1].buffer = buffer; + rdata[1].buffer = need_tuple_data ? InvalidBuffer : buffer; rdata[1].buffer_std = true; rdata[1].next = &(rdata[2]); /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */ rdata[2].data = (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits); rdata[2].len = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits); - rdata[2].buffer = buffer; + rdata[2].buffer = need_tuple_data ? InvalidBuffer : buffer; rdata[2].buffer_std = true; rdata[2].next = NULL; /* + * add record for the buffer without actual content thats removed if + * fpw is done for that buffer + */ + if (need_tuple_data) + { + rdata[2].next = &(rdata[3]); + + rdata[3].data = NULL; + rdata[3].len = 0; + rdata[3].buffer = buffer; + rdata[3].buffer_std = true; + rdata[3].next = NULL; + } Both the wal_level >= logical && XXX checks are now nicely encapsulated but this shows the complexity of whats being done better... Thats about all the changes that are done to heapam.c. Well, the same is done for update, multi_insert, and delete as well, but... > My only really specific concern at this point is about the special > treatment of catalog tables. We've never done anything like that > before, and it feels like a bad idea. In particular, the fact that > you have to WAL-log new information about cmin/cmax really suggests > that we're committing ourselves to the MVCC infrastructure in a way > that we weren't previously. It basically restores the pre 8.3 (?) state again where cmin/max were really stored - only that it only does so temporarily instead of permanently bloating the tables again. It imo pretty closely resembles what the normal code is doing with combocids, just that the combocid in this case is slightly more complex because it needs to be looked up over a longer timeframe. I thought about simply re-adding cmin/max storage for catalog tables, with some trickery thats not that hard to do (store it similar to oids), but the impact of that would have been far, far greater. And the decision of treating only some tables that way? Well, thats a question of overhead. There simply is no need to do something like that for tables that aren't required for converting a HeapTuple to the format the output wants. >From my pov its somewhat similar to the way we log differently for temporary, persistent and unlogged tables. > There's some category of stuff that our > MVCC implementation didn't previously require us to persist on disk > which, after this, it will. I don't understand exactly where the > boundaries of that are in terms of future changes we might want to > make - but I don't like moving the goalposts in that area. I don't really see a problem there. If we decide to get rid of MVCC in a fundamental manner, this will be the absolutely, smallest problem of it all. IMNSHO ;) Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers