On Wed, Jan 6, 2021 at 5:18 PM Michael Paquier <mich...@paquier.xyz> wrote: > > On Tue, Jan 05, 2021 at 04:24:21PM +0530, Amit Kapila wrote: > > There are already tests [1] in one of the upcoming patches for logical > > decoding of 2PC which covers this code using which I have found this > > problem. So, I thought those would be sufficient. I have not checked > > the feasibility of using test_decoding because I thought adding more > > using test_decoding will unnecessarily duplicate the tests. > > Hmm. This stuff does not check after replication origins even if it > stresses 2PC, so that looks incomplete when seen from here. >
I think it does. Let me try to explain in a bit more detail. Internally, the apply worker uses replication origins to track the progress of apply, see the code near ApplyWorkerMain->replorigin_create. We will store the progress (WAL LSN) for each commit (prepared)/ rollback prepared with this origin. If the server crashes and restarts, we will use the origin's LSN as the start decoding point (the subscriber sends the last LSN to the publisher). The bug here is that after restart the origin was not advanced for rollback prepared which I have fixed with this patch. Now, let us see how the tests mentioned by me cover this code. In the first test (check that 2PC gets replicated to subscriber then ROLLBACK PREPARED), we do below on publisher and wait for it to be applied on the subscriber. BEGIN; INSERT INTO tab_full VALUES (12); PREPARE TRANSACTION 'test_prepared_tab_full'; ROLLBACK PREPARED 'test_prepared_tab_full'; Note that we would have WAL logged the LSN (replication_origin_lsn) corresponding to ROLLBACK PREPARED on the subscriber during apply. Now, in the second test(Check that ROLLBACK PREPARED is decoded properly on crash restart (publisher and subscriber crash)), we prepare a transaction and crash the server. After the restart, because we have not advanced the replication origin in the recovery of Rollback Prepared, the subscriber won't consider that transaction has been applied so it again requests that transaction. Actually speaking, we don't need the second test to reproduce this exact problem, if we would have restarted after the first test the problem would be reproduced but I was consistent getting the problem so with the current way tests are written. However, we can change it slightly to restart after the first test if we want. -- With Regards, Amit Kapila.