On Fri, Nov 11, 2022 at 5:49 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > The other option might be to make these messages non-transactional, in > which case we'd separate the ordering from COMMIT ordering, evading the > reordering problem. > > That'd mean we'd ignore rollbacks (which seems fine), we could probably > optimize this by checking if the state actually changed, etc. But we'd > also need to deal with transactions created in the (still uncommitted) > transaction. But I'm also worried it might lead to the same issue with > non-transactional behaviors that forced revert in v15.
I think it might be a good idea to step back slightly from implementation details and try to agree on a theoretical model of what's happening here. Let's start by banishing the words transactional and non-transactional from the conversation and talk about what logical replication is trying to do. We can imagine that the replicated objects on the primary pass through a series of states S1, S2, ..., Sn, where n keeps going up as new state changes occur. The state, for our purposes here, is the contents of the database as they could be observed by a user running SELECT queries at some moment in time chosen by the user. For instance, if the initial state of the database is S1, and then the user executes BEGIN, 2 single-row INSERT statements, and a COMMIT, then S2 is the state that differs from S1 in that both of those rows are now part of the database contents. There is no state where one of those rows is visible and the other is not. That was never observable by the user, except from within the transaction as it was executing, which we can and should discount. I believe that the goal of logical replication is to bring about a state of affairs where the set of states observable on the standby is a subset of the states observable on the primary. That is, if the primary goes from S1 to S2 to S3, the standby can do the same thing, or it can go straight from S1 to S3 without ever making it possible for the user to observe S2. Either is correct behavior. But the standby cannot invent any new states that didn't occur on the primary. It can't decide to go from S1 to S1.5 to S2.5 to S3, or something like that. It can only consolidate changes that occurred separately on the primary, never split them up. Neither can it reorder them. Now, if you accept this as a reasonable definition of correctness, then the next question is what consequences it has for transactional and non-transactional behavior. If all behavior is transactional, then we've basically got to replay each primary transaction in a single standby transaction, and commit those transactions in the same order that the corresponding primary transactions committed. We could legally choose to merge a group of transactions that committed one after the other on the primary into a single transaction on the standby, and it might even be a good idea if they're all very tiny, but it's not required. But if there are non-transactional things happening, then there are changes that become visible at some time other than at a transaction commit. For example, consider this sequence of events, in which each "thing" that happens is transactional except where the contrary is noted: T1: BEGIN; T2: BEGIN; T1: Do thing 1; T2: Do thing 2; T1: Do a non-transactional thing; T1: Do thing 3; T2: Do thing 4; T2: COMMIT; T1: COMMIT; >From the point of the user here, there are 4 observable states here: S1: Initiate state. S2: State after the non-transactional thing happens. S3: State after T2 commits (reflects the non-transactional thing plus things 2 and 4). S4: State after T1 commits. Basically, the non-transactional thing behaves a whole lot like a separate transaction. That non-transactional operation ought to be replicated before T2, which ought to be replicated before T1. Maybe logical replication ought to treat it in exactly that way: as a separate operation that needs to be replicated after any earlier transactions that completed prior to the history shown here, but before T2 or T1. Alternatively, you can merge the non-transactional change into T2, i.e. the first transaction that committed after it happened. But you can't merge it into T1, even though it happened in T1. If you do that, then you're creating states on the standby that never existed on the primary, which is wrong. You could argue that this is just nitpicking: who cares if the change in the sequence value doesn't get replicated at exactly the right moment? But I don't think it's a technicality at all: I think if we don't make the operation appear to happen at the same point in the sequence as it became visible on the master, then there will be endless artifacts and corner cases to the bottom of which we will never get. Just like if we replicated the actual transactions out of order, chaos would ensue, because there can be logical dependencies between them, so too can there be logical dependencies between non-transactional operations, or between a non-transactional operation and a transactional operation. To make it more concrete, consider two sessions concurrently running this SQL: insert into t1 select nextval('s1') from generate_series(1,1000000) g; There are, in effect, 2000002 transaction-like things here. The sequence gets incremented 2 million times, and then there are 2 commits that each insert a million rows. Perhaps the actual order of events looks something like this: 1. nextval the sequence N times, where N >= 1 million 2. commit the first transaction, adding a million rows to t1 3. nextval the sequence 2 million - N times 4. commit the second transaction, adding another million rows to t1 Unless we replicate all of the nextval operations that occur in step 1 at the same time or prior to replicating the first transaction in step 2, we might end up making visible a state where the next value of the sequence is less than the highest value present in the table, which would be bad. With that perhaps overly-long set of preliminaries, I'm going to move on to talking about the implementation ideas which you mention. You write that "the current solution is based on WAL-logging state of all sequences incremented by the transaction at COMMIT" and then, it seems to me, go on to demonstrate that it's simply incorrect. In my opinion, the fundamental problem is that it doesn't look at the order that things happened on the primary and do them in the same order on the standby. Instead, it accepts that the non-transactional operations are going to be replicated at the wrong time, and then tries to patch around the issue by attempting to scrounge up the correct values at some convenient point and use that data to compensate for our failure to do the right thing at an earlier point. That doesn't seem like a satisfying solution, and I think it will be hard to make it fully correct. Your alternative proposal says "The other option might be to make these messages non-transactional, in which case we'd separate the ordering from COMMIT ordering, evading the reordering problem." But I don't think that avoids the reordering problem at all. Nor do I think it's correct. I don't think you *can* separate the ordering of these operations from the COMMIT ordering. They are, as I argue here, essentially mini-commits that only bump the sequence value, and they need to be replicated after the transactions that commit prior to the sequence value bump and before those that commit afterward. If they aren't handled that way, I don't think you're going to get fully correct behavior. I'm going to confess that I have no really specific idea how to implement that. I'm just not sufficiently familiar with this code. However, I suspect that the solution lies in changing things on the decoding side rather than in the WAL format. I feel like the information that we need in order to do the right thing must already be present in the WAL. If it weren't, then how could crash recovery work correctly, or physical replication? At any given moment, you can choose to promote a physical standby, and at that point the state you observe on the new primary had better be some state that existed on the primary at some point in its history. At any moment, you can unplug the primary, restart it, and run crash recovery, and if you do, you had better end up with some state that existed on the primary at some point shortly before the crash. I think that there are actually a few subtle inaccuracies in the last two sentences, because actually the order in which transactions become visible on a physical standby can differ from the order in which it happens on the primary, but I don't think that actually changes the picture much. The point is that the WAL is the definitive source of information about what happened and in what order it happened, and we use it in that way already in the context of physical replication, and of standbys. If logical decoding has a problem with some case that those systems handle correctly, the problem is with logical decoding, not the WAL format. In particular, I think it's likely that the "non-transactional messages" that you mention earlier don't get applied at the point in the commit sequence where they were found in the WAL. Not sure why exactly, but perhaps the point at which we're reading WAL runs ahead of the decoding per se, or something like that, and thus those non-transactional messages arrive too early relative to the commit ordering. Possibly that could be changed, and they could be buffered until earlier commits are replicated. Or else, when we see a WAL record for a non-transactional sequence operation, we could arrange to bundle that operation into an "adjacent" replicated transaction i.e. the transaction whose commit record occurs most nearly prior to, or most nearly after, the WAL record for the operation itself. Or else, we could create "virtual" transactions for such operations and make sure those get replayed at the right point in the commit sequence. Or else, I don't know, maybe something else. But I think the overall picture is that we need to approach the problem by replicating changes in WAL order, as a physical standby would do. Saying that a change is "nontransactional" doesn't mean that it's exempt from ordering requirements; rather, it means that that change has its own place in that ordering, distinct from the transaction in which it occurred. -- Robert Haas EDB: http://www.enterprisedb.com