On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne <cbbro...@gmail.com> wrote: > On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas <robertmh...@gmail.com> wrote: >>> Btw, what do you mean with "conflating" the stream? I don't really see that >>> being proposed. >> >> It seems to me that you are intent on using the WAL stream as the >> logical change stream. I think that's a bad design. Instead, you >> should extract changes from WAL and then ship them around in a format >> that is specific to logical replication. > > Yeah, that seems worth elaborating on. > > What has been said several times is that it's pretty necessary to > capture the logical changes into WAL. That seems pretty needful, in > order that the replication data gets fsync()ed avidly, and so that we > don't add in the race condition of needing to fsync() something *else* > almost exactly as avidly as is the case for WAL today..
Check. > But it's undesirable to pull *all* the bulk of contents of WAL around > if it's only part of the data that is going to get applied. On a > "physical streaming" replica, any logical data that gets captured will > be useless. And on a "logical replica," they "physical" bits of WAL > will be useless. > > What I *want* you to mean is that there would be: > a) WAL readers that pull the "physical bits", and > b) WAL readers that just pull "logical bits." > > I expect it would be fine to have a tool that pulls LCRs out of WAL to > prepare that to be sent to remote locations. Is that what you have in > mind? Yes. I think it should be possible to generate LCRs from WAL, but I think that the on-the-wire format for LCRs should be different from the WAL format. Trying to use the same format for both things seems like an unpleasant straightjacket. This discussion illustrates why: we're talking about consuming scarce bit-space in WAL records for a feature that only a tiny minority of users will use, and it's still not really enough bit space. That stinks. If LCR transmission is a separate protocol, this problem can be engineered away at a higher level. Suppose we have three servers, A, B, and C, that are doing multi-master replication in a loop. A sends LCRs to B, B sends them to C, and C sends them back to A. Obviously, we need to make sure that each server applies each set of changes just once, but it suffices to have enough information in WAL to distinguish between replication transactions and non-replication transactions - that is, one bit. So suppose a change is made on server A. A generates LCRs from WAL, and tags each LCR with node_id = A. It then sends those LCRs to B. B applies them, flagging the apply transaction in WAL as a replication transaction, AND ALSO sends the LCRs to C. The LCR generator on B sees the WAL from apply, but because it's flagged as a replication transaction, it does not generate LCRs. So C receives LCRs from B just once, without any need for the node_id to to be known in WAL. C can now also apply those LCRs (again flagging the apply transaction as replication) and it can also skip sending them to A, because it seems that they originated at A. Now suppose we have a more complex topology. Suppose we have a cluster of four servers A .. D which, for improved tolerance against network outages, are all connected pairwise. Normally all the links are up, so each server sends all the LCRs it generates directly to all other servers. But how do we prevent cycles? A generates a change and sends it to B, C, and D. B then sees that the change came from A so it sends it to C and D. C, receiving that change, sees that came from A via B, so it sends it to D again, whereupon D, which got it from C and knows that the origin is A, sends it to B, who will then send it right back over to D. Obviously, we've got an infinite loop here, so this topology will not work. However, there are several obvious ways to patch it by changing the LCR protocol. Most obviously, and somewhat stupidly, you could add a TTL. A bit smarter, you could have each LCR carry a LIST of node_ids that it had already visited, refusing to send it to any node it had already been to it, instead of a single node_id. Smarter still, you could send handshaking messages around the cluster so that each node can build up a spanning tree and prefix each LCR it sends with the list of additional nodes to which the recipient must deliver it. So, normally, A would send a message to each of B, C, and D destined only for that node; but if the A-C link went down, A would choose either B or D and send each LCR to that node destined for that node *and C*; then, A would forward the message. Or perhaps you think this is too complex and not worth supporting anyway, and that might be true, but the point is that if you insist that all of the identifying information must be carried in WAL, you've pretty much ruled it out, because we are not going to put TTL fields, or lists of node IDs, or lists of destinations, in WAL. But there is no reason they can't be attached to LCRs, which is where they are actually needed. > Or are you feeling that the "logical bits" shouldn't get > captured in WAL altogether, so we need to fsync() them into a > different stream of files? No, that would be ungood. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers