On Tue, Dec 8, 2020 at 9:14 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <smithpb2...@gmail.com> wrote: > > > > Yes, I observed this same behavior. > > > > IIUC the only way for the tablesync worker to go from CATCHUP mode to > > SYNCDONE is via the call to process_sync_tables. > > > > But a side-effect of this is, when messages arrive during this CATCHUP > > phase one tx will be getting handled by the tablesync worker before > > the process_sync_tables() is ever encountered. > > > > I have created and attached a simple patch which allows the tablesync > > to detect if there is anything to do *before* it enters the apply main > > loop. Calling process_sync_tables() before the apply main loop offers > > a quick way out so the message handling will not be split > > unnecessarily between the workers. > > > > Yeah, this demonstrates the idea can work but as mentioned in my > previous email [1] this needs much more work to make the COPY and then > later fetching the changes from the publisher consistently. So, let me > summarize the discussion so far. We wanted to enhance the tablesync > phase of Logical Replication to enable decoding of prepared > transactions [2]. The problem was when we stream prepared transactions > in the tablesync worker, it will simply commit the same due to the > requirement of maintaining a single transaction for the entire > duration of copy and streaming of transactions afterward. We can't > simply disable the decoding of prepared xacts for tablesync workers > because it can skip some of the prepared xacts forever on subscriber > as explained in one of the emails above [3]. Now, while investigating > the solutions to enhance tablesync to support decoding at prepare > time, I found that due to the current design of tablesync we can see > partial data of transactions on subscribers which is also explained in > the email above with an example [4]. This problem of visibility is > there since the Logical Replication is introduced in PostgreSQL and > the only answer I got till now is that there doesn't seem to be any > other alternative which I think is not true and I have provided one > alternative as well. > > Next, we have discussed three different solutions all of which will > solve the first problem (allow the tablesync worker to decode > transactions at prepare time) and one of which solves both the first > and second problem (partial transaction data visibility). > > Solution-1: Allow the table-sync worker to use multiple transactions. > The reason for doing it in a single transaction is that if after > initial COPY we commit and then crash while streaming changes of other > transactions, the state of the table won't be known after the restart > as we are using temporary slot so we don't from where to restart > syncing the table. > > IIUC, we need to primarily do two things to achieve multiple > transactions, one is to have an additional state in the catalog (say > catch up) which will say that the initial copy is done. Then we need > to have a permanent slot using which we can track the progress of the > slot so that after restart (due to crash, connection break, etc.) we > can start from the appropriate position. Now, this will allow us to do > less work after recovering from a crash because we will know the > restart point. As Craig mentioned, it also allows the sync slot to > advance, freeing any held upstream resources before the whole sync is > done, which is good if the upstream is busy and generating lots of > WAL. Finally, committing as we go means we won't exceed the cid > increment limit in a single txn. > > Solution-2: The next solution we discussed is to make "tablesync" > worker just gather messages after COPY in a similar way to how the > current streaming of in-progress transaction feature gathers messages > into a "changes" file so that they can be replayed later by the apply > worker. Now, here as we don't need to replay the individual > transactions in tablesync worker in a single transaction, it will > allow us to send decode prepared to the subscriber. This has some > disadvantages such as each transaction processed by tablesync worker > needs to be durably written to file and it can also lead to some apply > lag later when we process the same by apply worker. > > Solution-3: Allow the table-sync workers to just perform initial COPY > and then once the COPY is done for all relations the apply worker will > stream all the future changes. Now, surely if large copies are > required for multiple relations then we would delay a bit to replay > transactions partially by the apply worker but don't know how much > that matters as compared to transaction visibility issue and anyway we > would have achieved the maximum parallelism by allowing copy via > multiple workers. This would reduce the load (both CPU and I/O) on the > publisher-side by allowing to decode the data only once instead of for > each table sync worker once and separately for the apply worker. I > think it will use fewer resources to finish the work. > > Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT > option which creates a transaction snapshot on the publisher, and then > we use the same snapshot for COPY from the publisher. After this, when > we try to receive the data from the publisher using the same slot, it > will be in sync with the COPY. I think to keep the same consistency > between COPY and the data we receive from the publisher in this > approach, we need to export the snapshot while creating a slot in the > apply worker by using CRS_EXPORT_SNAPSHOT and then use the same > snapshot by all the tablesync workers doing the copy. In tablesync > workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN > READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported > snapshot. That way the COPY will use the same snapshot as is used for > receiving the changes in apply worker and the data will be in sync. > > Then we also need a way to export snapshot while the apply worker is > already receiving the changes because users can use 'ALTER > SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be > synced. I think we need to introduce a new command in > exec_replication_command() to export the snapshot from the existing > slot and then use it by the new tablesync worker. > > > Among the above three solutions, the first two will solve the first > problem (allow the tablesync worker to decode transactions at prepare > time) and the third solution will solve both the first and second > problem (partial transaction data visibility). The third solution > requires quite some redesign of how the Logical Replication work is > synchronized between apply and tablesync workers and might turn out to > be a bigger implementation effort. I am tentatively thinking to go > with a first or second solution at this stage and anyway if later > people feel that we need some bigger redesign then we can go with > something on the lines of Solution-3. > > Thoughts? > > [1] - > https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com > [2] - > https://www.postgresql.org/message-id/cahut+puemk4so8ogzxc_ftzpkga8uc-y5qi-krqhsy_p0i3...@mail.gmail.com > [3] - > https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com > [4] - > https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com > > --
Hi Amit, - Solution-3 has become too complicated to be attempted by me. Anyway, we may be better to just focus on eliminating the new problems exposed by the 2PC work [1], rather than burning too much effort to fix some other quirk which apparently has existed for years. [1] https://www.postgresql.org/message-id/CAHut%2BPtm7E5Jj92tJWPtnnjbNjJN60_%3DaGGKYW3h23b7J%3DqeDg%40mail.gmail.com - Solution-2 has some potential lag problems, and maybe file resource problems as well. This idea did not get a very favourable response when I first proposed it. - This leaves Solution-1 as the best viable option to fix the current known 2PC trouble. ~~ So I will try to write a patch for the proposed Solution-1. --- Kind Regards, Peter Smith. Fujitsu Australia