On Sat, Dec 19, 2020 at 5:38 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Fri, Dec 18, 2020 at 6:41 PM Peter Smith <smithpb2...@gmail.com> wrote:
> >
> > TODO / Known Issues:
> >
> > * the current implementation of tablesync drop slot (e.g. from
> > DropSubscription or finish_sync_worker) regenerates the tablesync slot
> > name so it knows what slot to drop.
> >
>
> If you always drop the slot at finish_sync_worker, then in which case
> do you need to drop it during DropSubscription? Is it when the table
> sync workers are crashed?

Yes. It is not the normal case. But if the tablesync never yet got to
SYNCDONE state (maybe crashed) then finish_sync_worker may not be
called.
So I think a rogue tablesync slot might still exist during DropSubscription.

>
> > The current code might be ok for
> > normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
> > (slot_name = newname) it would fail to be able to find the tablesync
> > slot.
> >
>
> Sure, but the same will be true for the apply worker slot as well. I
> agree the problem would be more for table sync workers but I think we
> can solve it, see below.
>
> > * I think if there are crashed tablesync workers then they are not
> > known to DropSubscription. So this might be a problem to cleanup slots
> > and/or origin tracking belonging to those unknown workers.
> >
>
> Yeah, I think we can do two things to avoid this and the previous
> problem. (a) We can generate the slot_name for the table sync worker
> based on only subscription_id and rel_id. (b) Immediately after
> creating the slot, advance the replication origin with the position
> (origin_startpos) we get from walrcv_create_slot, this will help us to
> start from the right location.
>
> Do you see anything which will still not be addressed after doing the above?

(a) V5 Patch is updated as suggested.
(b) V5 Patch is updated as suggested. Now calling replorigin_advance.
No problems seen so far. All TAP tests pass, but more testing needed
for the origin stuff

>
> I understand why you are trying to create this patch atop logical
> decoding of 2PC patch but I think it is better to create this as an
> independent patch and then use it to test 2PC problem.

OK. The latest patch still applies to v30 just for my convenience
today, but I will head towards converting this to an independent patch
ASAP.

> Also, please
> explain what kind of testing you did to ensure that it works properly
> after the table sync worker restarts after the crash.

So far tested like this - I caused the tablesync to crash after
COPYDONE (but before SYNCDONE) by sending a row to cause a PK
violation while holding the tablesync at the CATCHUP state in the
debugger. The tablesync then handles the insert, encounters the PK
violation error, and re-launches. Then I can remove the extra row so
the PK violation does not happen, so the (re-launched) tablesync can
complete and finish normally. The Apply worker then takes over.

I have attached some captured/annotated logging of my test scenario
which I ran using the V4 patch (the log has a lot of extra temporary
output to help see what is going on)

---
Kind Regards,
Peter Smith.
Fujitsu Australia.
TEST CASE:

§  Insert some row to subscriber side only
§  Start the SUBSCRIPTION – to start tablesync worker
§  Let tablesync do the copy (so mode will be COPYDONE)
§  In debugger hold the tablesync worker at the CATCHUP point
§  Then issue INSERT (from publisher) to cause PK violation with same row that 
was previous inserted as subscriber side
§  Tablesync worker should ERROR when handling this duplicate insert
§  Then a *new* tablesync worker may be launched for same table
§  And this newtablesync worker should detect the state was at COPYDONE, and so 
this time it should bypass the initial copy phase. 
§  In debugger hold the (re-launched) tablesync worker at the CATCHUP point
§  Delete the subscriber row which was causeing the PK viloation
§  Let the tablesync proceed. Now it should be ableto handle replaciation and 
finish OK.
§  Then Apply worker takes over...

LOG FRAGMENTS BELOW
===================

##
## Start with empty tables
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "TRUNCATE test_tab;"
TRUNCATE TABLE
[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "TRUNCATE test_tab;"
TRUNCATE TABLE

##
## Poulate 1 row data on publisher.
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 
'foo');"
INSERT 0 1

##
## Add a row to subscription side only (this is to cause a subsequent PK 
violation)
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "INSERT INTO test_tab 
VALUES(2, 'bar');"
INSERT 0 1

##
## Start SUBSCRIPTION
## Attach to tablesync
## Get to CATCHUP mode
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION 
tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' 
PUBLICATION tap_pub WITH (two_phase = on);"
2020-12-18 22:54:04.280 AEDT [18765] LOG:  logical decoding found consistent 
point at 0/1653608
2020-12-18 22:54:04.280 AEDT [18765] DETAIL:  There are no running transactions.
2020-12-18 22:54:04.280 AEDT [18765] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2020-12-18 22:54:04.289 AEDT [18766] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
2020-12-18 22:54:04.289 AEDT [18766] LOG:  !!>> The apply worker process has 
PID = 18766
[postgres@CentOS7-x64 ~]$ 2020-12-18 22:54:04.296 AEDT [18771] LOG:  starting 
logical decoding for slot "tap_sub"
2020-12-18 22:54:04.296 AEDT [18771] DETAIL:  Streaming transactions committing 
after 0/1653640, reading WAL from 0/1653608.
2020-12-18 22:54:04.296 AEDT [18771] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', two_phase 'on', publication_names 
'"tap_pub"')
2020-12-18 22:54:04.296 AEDT [18771] LOG:  logical decoding found consistent 
point at 0/1653608
2020-12-18 22:54:04.296 AEDT [18771] DETAIL:  There are no running transactions.
2020-12-18 22:54:04.296 AEDT [18771] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', two_phase 'on', publication_names 
'"tap_pub"')
2020-12-18 22:54:04.296 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:04.296 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:04.300 AEDT [18774] LOG:  logical replication table 
synchronization worker for subscription "tap_sub", table "test_tab" has started
2020-12-18 22:54:04.300 AEDT [18774] LOG:  !!>> The tablesync worker process 
has PID = 18774
2020-12-18 22:54:04.300 AEDT [18774] LOG:  !!>>
        
        
        Sleeping 30 secs. For debugging, attach to process 18774 now!
        
        
        
2020-12-18 22:54:05.311 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:05.311 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:06.312 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:06.313 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:07.314 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:07.314 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:08.315 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:08.315 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:09.317 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:09.317 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:10.319 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:10.319 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:11.320 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:11.321 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:12.323 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:12.323 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:13.325 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:13.325 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:14.331 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:14.332 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:14.378 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:14.378 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:15.382 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:15.382 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:16.384 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:16.384 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:17.386 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:17.386 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:18.388 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:18.388 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:18.656 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:18.657 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:19.658 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:19.658 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:20.662 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:20.662 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:21.666 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:21.666 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:22.677 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:22.677 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:23.680 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:23.680 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:24.680 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:24.680 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:25.686 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:25.687 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:26.688 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:26.688 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:27.689 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:27.689 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:28.694 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:28.694 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:29.696 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:29.696 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:30.700 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:30.700 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:31.703 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:31.703 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:32.712 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:32.712 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:33.715 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:33.715 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:34.307 AEDT [18774] LOG:  !!>> tablesync worker: About to call 
LogicalRepSyncTableStart to do initial syncing
2020-12-18 22:54:34.718 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:34.718 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:35.723 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:35.723 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:36.727 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:36.727 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:37.731 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:37.731 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:38.735 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:38.735 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:39.739 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:39.739 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:40.745 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:40.745 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:41.746 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:41.746 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:42.749 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:42.749 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:43.753 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:43.753 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:44.754 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:44.755 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:45.759 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:45.759 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:46.761 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:46.761 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:47.763 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:47.763 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:48.669 AEDT [18774] LOG:  !!>> LogicalRepSyncTableStart: 
walrcv_create_slot for "tap_sub_65608_sync_16385".
2020-12-18 22:54:48.684 AEDT [20106] LOG:  logical decoding found consistent 
point at 0/1653728
2020-12-18 22:54:48.684 AEDT [20106] DETAIL:  There are no running transactions.
2020-12-18 22:54:48.684 AEDT [20106] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub_65608_sync_16385" LOGICAL pgoutput USE_SNAPSHOT
2020-12-18 22:54:48.690 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:48.691 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 22:54:48.701 AEDT [18774] LOG:  !!>> LogicalRepSyncTableStart: 
create replication origin tracking "pg_65608_16385".
2020-12-18 22:54:48.702 AEDT [18774] LOG:  !!>> LogicalRepSyncTableStart: setup 
replication origin tracking "pg_65608_16385".
2020-12-18 22:54:48.702 AEDT [18774] LOG:  !!>> tablesync worker: wait for 
CATCHUP state notification
2020-12-18 22:54:48.702 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 22:54:48.702 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables


## COPYDONE state?
## Origin tracking names?

psql -d test_sub -p 54321 -c "select * from pg_subscription_rel;"
 srsubid | srrelid | srsubstate | srsublsn 
---------+---------+------------+----------
   65608 |   16385 | C          | 
(1 row)

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "select * from 
pg_replication_origin;"
 roident |     roname     
---------+----------------
       1 | pg_65608
       2 | pg_65608_16385
(2 rows)

## Copied data?

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "SELECT * FROM test_tab;"
 a |  b  |               c               |  d  
---+-----+-------------------------------+-----
 2 | bar | 2020-12-18 22:53:26.772818+11 | 999
 1 | foo | 2020-12-18 22:52:56.26031+11  | 999
(2 rows)

## debugging (in CATCHUP mode)

(gdb) p *MyLogicalRepWorker
$1 = {launch_time = 661607644296792, in_use = true, generation = 2, proc = 
0x7ff727e95180, 
  dbid = 16384, userid = 10, subid = 65608, relid = 16385, relstate = 99 'c', 
  relstate_lsn = 23410528, relmutex = 0 '\000', last_lsn = 0, last_send_time = 
661607644297610, 
  last_recv_time = 661607644297610, reply_lsn = 0, reply_time = 661607644297610}


## Now try to replicate a row which will cause PK violation

2020-12-18 23:01:32.897 AEDT [18771] CONTEXT:  slot "tap_sub", output plugin 
"pgoutput", in the commit callback, associated LSN 0/16538C0
2020-12-18 23:01:32.897 AEDT [18771] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', two_phase 'on', publication_names 
'"tap_pub"')
INSERT 0 1

##
## Let tablesync try to cope with that
##

[postgres@CentOS7-x64 ~]$ 2020-12-18 23:02:01.396 AEDT [18774] LOG:  !!>> 
tablesync worker: received CATCHUP state notification
2020-12-18 23:02:04.490 AEDT [18774] LOG:  !!>> tablesync worker: Returned from 
LogicalRepSyncTableStart
2020-12-18 23:02:07.522 AEDT [20106] LOG:  starting logical decoding for slot 
"tap_sub_65608_sync_16385"
2020-12-18 23:02:07.522 AEDT [20106] DETAIL:  Streaming transactions committing 
after 0/1653760, reading WAL from 0/1653728.
2020-12-18 23:02:07.522 AEDT [20106] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')
2020-12-18 23:02:07.522 AEDT [20106] LOG:  logical decoding found consistent 
point at 0/1653728
2020-12-18 23:02:07.522 AEDT [20106] DETAIL:  There are no running transactions.
2020-12-18 23:02:07.522 AEDT [20106] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')
2020-12-18 23:02:07.523 AEDT [20106] LOG:  !!>>
        
        pgoutput_begin_txn
        
        
2020-12-18 23:02:07.523 AEDT [20106] CONTEXT:  slot "tap_sub_65608_sync_16385", 
output plugin "pgoutput", in the begin callback, associated LSN 0/1653760
2020-12-18 23:02:07.523 AEDT [20106] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')
2020-12-18 23:02:07.523 AEDT [20106] LOG:  !!>>
        
        pgoutput_commit_txn
        
        
2020-12-18 23:02:07.523 AEDT [20106] CONTEXT:  slot "tap_sub_65608_sync_16385", 
output plugin "pgoutput", in the commit callback, associated LSN 0/16538C0
2020-12-18 23:02:07.523 AEDT [20106] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')
2020-12-18 23:02:09.681 AEDT [18774] LOG:  !!>> tablesync worker: 
LogicalRepApplyLoop
2020-12-18 23:02:11.457 AEDT [18774] LOG:  !!>> tablesync worker: 
apply_dispatch for message kind 'B'
2020-12-18 23:02:13.482 AEDT [18774] LOG:  !!>> tablesync worker: 
apply_dispatch for message kind 'R'
2020-12-18 23:02:15.450 AEDT [18774] LOG:  !!>> tablesync worker: 
apply_dispatch for message kind 'I'
2020-12-18 23:02:17.309 AEDT [18774] LOG:  !!>> tablesync worker: 
should_apply_changes_for_rel: true
        rel->state == READY: false
        rel->state == SYNCDONE: false
        rel->statelsn <= remote_final_lsn: true
2020-12-18 23:02:17.309 AEDT [18774] ERROR:  duplicate key value violates 
unique constraint "test_tab_pkey"
2020-12-18 23:02:17.309 AEDT [18774] DETAIL:  Key (a)=(2) already exists.

## Above is the PK violation
## The apply worker proceeded but decided it should not yet do anything 
(because tablesync not finished?)

2020-12-18 23:02:17.311 AEDT [27204] LOG:  background worker "logical 
replication worker" (PID 18774) exited with exit code 1
2020-12-18 23:02:17.311 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:17.311 AEDT [18766] LOG:  !!>> apply worker: apply_dispatch 
for message kind 'B'
2020-12-18 23:02:17.311 AEDT [18766] LOG:  !!>> apply worker: apply_dispatch 
for message kind 'R'
2020-12-18 23:02:17.311 AEDT [18766] LOG:  !!>> apply worker: apply_dispatch 
for message kind 'I'
2020-12-18 23:02:17.312 AEDT [18766] LOG:  !!>> apply worker: 
should_apply_changes_for_rel: false
        rel->state == READY: false
        rel->state == SYNCDONE: false
        rel->statelsn <= remote_final_lsn: true
2020-12-18 23:02:17.312 AEDT [18766] LOG:  !!>> apply worker: apply_dispatch 
for message kind 'C'
2020-12-18 23:02:17.312 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables

## The replacement tablesyn relaunches
## It skips the COPY since already done
## It reused same origin tracking that was previously created

2020-12-18 23:02:17.316 AEDT [32050] LOG:  logical replication table 
synchronization worker for subscription "tap_sub", table "test_tab" has started
2020-12-18 23:02:17.316 AEDT [32050] LOG:  !!>> The tablesync worker process 
has PID = 32050
2020-12-18 23:02:17.316 AEDT [32050] LOG:  !!>>
        
        
        Sleeping 30 secs. For debugging, attach to process 32050 now!
        
        
        
2020-12-18 23:02:17.323 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:18.325 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:18.325 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:19.326 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:19.326 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:20.329 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:20.329 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
...
2020-12-18 23:02:44.417 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:45.419 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:45.419 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:46.420 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:46.420 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:47.332 AEDT [32050] LOG:  !!>> tablesync worker: About to call 
LogicalRepSyncTableStart to do initial syncing
2020-12-18 23:02:47.422 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:47.422 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:47.422 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:47.422 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:48.423 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:48.423 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:02:48.773 AEDT [32050] LOG:  !!>> LogicalRepSyncTableStart: 
tablesync relstate was SUBREL_STATE_COPYDONE.
2020-12-18 23:02:48.773 AEDT [32050] LOG:  !!>> LogicalRepSyncTableStart: setup 
replication origin tracking "pg_65608_16385".
2020-12-18 23:02:48.773 AEDT [32050] LOG:  !!>> tablesync worker: wait for 
CATCHUP state notification
2020-12-18 23:02:48.773 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:02:48.773 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables


## Here we are at the 2nd tablesync at CATCHUP mode

## Delete the subscriber row so this replication wont cause PK violation next 
time thru

psql -d test_sub -p 54321 -c "DELETE FROM test_tab WHERE a = 2";
DELETE 1

## And let 2nd tablesync keep going

[postgres@CentOS7-x64 ~]$ 2020-12-18 23:08:48.580 AEDT [32050] LOG:  !!>> 
tablesync worker: received CATCHUP state notification
2020-12-18 23:08:51.638 AEDT [32050] LOG:  !!>> tablesync worker: Returned from 
LogicalRepSyncTableStart
2020-12-18 23:09:03.006 AEDT [395] LOG:  starting logical decoding for slot 
"tap_sub_65608_sync_16385"
2020-12-18 23:09:03.006 AEDT [395] DETAIL:  Streaming transactions committing 
after 0/1653760, reading WAL from 0/1653728.
2020-12-18 23:09:03.006 AEDT [395] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')
2020-12-18 23:09:03.006 AEDT [395] LOG:  logical decoding found consistent 
point at 0/1653728
2020-12-18 23:09:03.006 AEDT [395] DETAIL:  There are no running transactions.
2020-12-18 23:09:03.006 AEDT [395] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')
2020-12-18 23:09:03.006 AEDT [395] LOG:  !!>>
        
        pgoutput_begin_txn
        
        
2020-12-18 23:09:03.006 AEDT [395] CONTEXT:  slot "tap_sub_65608_sync_16385", 
output plugin "pgoutput", in the begin callback, associated LSN 0/1653760
2020-12-18 23:09:03.006 AEDT [395] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')
2020-12-18 23:09:03.008 AEDT [395] LOG:  !!>>
        
        pgoutput_commit_txn
        
        
2020-12-18 23:09:03.008 AEDT [395] CONTEXT:  slot "tap_sub_65608_sync_16385", 
output plugin "pgoutput", in the commit callback, associated LSN 0/16538C0
2020-12-18 23:09:03.008 AEDT [395] STATEMENT:  START_REPLICATION SLOT 
"tap_sub_65608_sync_16385" LOGICAL 0/0 (proto_version '2', two_phase 'on', 
publication_names '"tap_pub"')

## This time tablesync succeeds the inserts
## And cleanly finishes

2020-12-18 23:09:05.077 AEDT [32050] LOG:  !!>> tablesync worker: 
LogicalRepApplyLoop
2020-12-18 23:09:08.085 AEDT [32050] LOG:  !!>> tablesync worker: 
apply_dispatch for message kind 'B'
2020-12-18 23:09:11.228 AEDT [32050] LOG:  !!>> tablesync worker: 
apply_dispatch for message kind 'R'
2020-12-18 23:09:12.868 AEDT [32050] LOG:  !!>> tablesync worker: 
apply_dispatch for message kind 'I'
2020-12-18 23:09:15.189 AEDT [32050] LOG:  !!>> tablesync worker: 
should_apply_changes_for_rel: true
        rel->state == READY: false
        rel->state == SYNCDONE: false
        rel->statelsn <= remote_final_lsn: true
2020-12-18 23:09:29.383 AEDT [32050] LOG:  !!>> tablesync worker: 
apply_dispatch for message kind 'C'
2020-12-18 23:09:40.517 AEDT [32050] LOG:  !!>> tablesync worker: called 
process_syncing_tables
2020-12-18 23:10:31.043 AEDT [32050] LOG:  !!>> tablesync worker: called 
process_syncing_tables
2020-12-18 23:10:39.233 AEDT [32050] LOG:  logical replication table 
synchronization worker for subscription "tap_sub", table "test_tab" has finished
2020-12-18 23:10:39.233 AEDT [32050] LOG:  !!>> finish_sync_worker: dropping 
the tablesync slot "tap_sub_65608_sync_16385".
2020-12-18 23:10:39.240 AEDT [32050] LOG:  dropped replication slot 
"tap_sub_65608_sync_16385" on publisher

## Appy worker takes over again, and removes the origin tracking from the (now 
completed) tablesync

2020-12-18 23:10:39.240 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:10:39.240 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:10:39.240 AEDT [18766] LOG:  !!>> apply worker: find tablesync 
origin tracking for "pg_65608_16385".
2020-12-18 23:10:39.240 AEDT [18766] LOG:  !!>> apply worker: dropping 
tablesync origin tracking for "pg_65608_16385".
2020-12-18 23:10:39.241 AEDT [18766] LOG:  !!>> apply worker: dropped tablesync 
origin tracking for "pg_65608_16385".
2020-12-18 23:10:39.241 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:10:39.241 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:10:40.243 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:10:40.243 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:10:41.244 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:10:41.244 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:10:42.246 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
...
2020-12-18 23:10:52.281 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:10:52.281 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:10:53.284 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:10:53.284 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
2020-12-18 23:10:54.287 AEDT [18766] LOG:  !!>> apply worker: 
LogicalRepApplyLoop
2020-12-18 23:10:54.287 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables
psql -d test_sub -p 54321 -c "DROP SUBSCRIPTION tap_sub;"2020-12-18 
23:10:55.294 AEDT [18766] LOG:  !!>> apply worker: LogicalRepApplyLoop
2020-12-18 23:10:55.294 AEDT [18766] LOG:  !!>> apply worker: called 
process_syncing_tables

2020-12-18 23:10:55.896 AEDT [18766] FATAL:  terminating logical replication 
worker due to administrator command
2020-12-18 23:10:55.897 AEDT [27204] LOG:  background worker "logical 
replication worker" (PID 18766) exited with exit code 1
NOTICE:  dropped replication slot "tap_sub" on publisher
DROP SUBSCRIPTION


[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "select * from 
pg_replication_origin;"
 roident | roname 
---------+--------
(0 rows)


Reply via email to