On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu <m.melihmu...@gmail.com> wrote: > > Hi, > > Please see attached patches for the below changes. >
> Thanks for reviewing, > -- > Melih Mutlu > Microsoft Hello Melih, Thank you for making the changes. I have few more comments: 1) src/backend/replication/logical/worker.c: (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has started", src/backend/replication/logical/worker.c: (errmsg("logical replication table synchronization worker for subscription \"%s\" has moved to sync table \"%s\".", src/backend/replication/logical/tablesync.c: (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has finished", In above can we have rep_slot_id as well in trace message, else it is not clear which worker moved to next relation. We may have: logical replication table synchronization worker_%d for subscription \"%s\" has moved to sync table, rep_slot_id,.... Overall we need to improve the tracing. I will give my suggestions on this later (in detail). 2) I found a crash in the previous patch (v9), but have not tested it on the latest yet. Crash happens when all the replication slots are consumed and we are trying to create new. I tweaked the settings like below so that it can be reproduced easily: max_sync_workers_per_subscription=3 max_replication_slots = 2 and then ran the test case shared by you. I think there is some memory corruption happening. (I did test in debug mode, have not tried in release mode). I tried to put some traces to identify the root-cause. I observed that worker_1 keeps on moving from 1 table to another table correctly, but at some point, it gets corrupted i.e. origin-name obtained for it is wrong and it tries to advance that and since that origin does not exist, it asserts and then something else crashes. >From log: (new trace lines added by me are prefixed by shveta, also tweaked code to have my comment 1 fixed to have clarity on worker-id). form below traces, it is clear that worker_1 was moving from one relation to another, always getting correct origin 'pg_16688_1', but at the end it got 'pg_16688_49' which does not exist. Second part of trace shows who updated 'pg_16688_49', it was done by worker_49 which even did not get chance to create this origin due to max_rep_slot reached. ============================== .............. 2023-02-01 16:01:38.041 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub", table "table_93" has finished 2023-02-01 16:01:38.047 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub" has moved to sync table "table_98". 2023-02-01 16:01:38.055 IST [9243] LOG: shveta- LogicalRepSyncTableStart- worker_1 get-origin-id2 originid:2, originname:pg_16688_1 2023-02-01 16:01:38.055 IST [9243] LOG: shveta- LogicalRepSyncTableStart- Worker_1 reusing slot:pg_16688_sync_1_7195132648087016333, originid:2, originname:pg_16688_1 2023-02-01 16:01:38.094 IST [9243] LOG: shveta- LogicalRepSyncTableStart- worker_1 updated-origin2 originname:pg_16688_1 2023-02-01 16:01:38.096 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub", table "table_98" has finished 2023-02-01 16:01:38.096 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub" has moved to sync table "table_60". 2023-02-01 16:01:38.099 IST [9243] LOG: shveta- LogicalRepSyncTableStart- worker_1 get-origin originid:0, originname:pg_16688_49 2023-02-01 16:01:38.099 IST [9243] LOG: could not drop replication slot "pg_16688_sync_49_7195132648087016333" on publisher: ERROR: replication slot "pg_16688_sync_49_7195132648087016333" does not exist 2023-02-01 16:01:38.103 IST [9243] LOG: shveta- LogicalRepSyncTableStart- Worker_1 reusing slot:pg_16688_sync_1_7195132648087016333, originid:0, originname:pg_16688_49 TRAP: failed Assert("node != InvalidRepOriginId"), File: "origin.c", Line: 892, PID: 9243 postgres: logical replication worker for subscription 16688 sync 16384 (ExceptionalCondition+0xbb)[0x56019194d3b7] postgres: logical replication worker for subscription 16688 sync 16384 (replorigin_advance+0x6d)[0x5601916b53d4] postgres: logical replication worker for subscription 16688 sync 16384 (LogicalRepSyncTableStart+0xbb4)[0x5601916cb648] postgres: logical replication worker for subscription 16688 sync 16384 (+0x5d25e2)[0x5601916d35e2] postgres: logical replication worker for subscription 16688 sync 16384 (+0x5d282c)[0x5601916d382c] postgres: logical replication worker for subscription 16688 sync 16384 (ApplyWorkerMain+0x17b)[0x5601916d4078] postgres: logical replication worker for subscription 16688 sync 16384 (StartBackgroundWorker+0x248)[0x56019167f943] postgres: logical replication worker for subscription 16688 sync 16384 (+0x589ad3)[0x56019168aad3] postgres: logical replication worker for subscription 16688 sync 16384 (+0x589ee3)[0x56019168aee3] postgres: logical replication worker for subscription 16688 sync 16384 (+0x588d8d)[0x560191689d8d] postgres: logical replication worker for subscription 16688 sync 16384 (+0x584604)[0x560191685604] postgres: logical replication worker for subscription 16688 sync 16384 (PostmasterMain+0x14f1)[0x560191684f1e] postgres: logical replication worker for subscription 16688 sync 16384 (+0x446e05)[0x560191547e05] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f048cc58d90] ============================== 'pg_16688_49' updated by worker_49: 2023-02-01 16:01:37.083 IST [9348] LOG: shveta- LogicalRepSyncTableStart- worker_49 get-origin originid:0, originname:pg_16688_49 2023-02-01 16:01:37.083 IST [9348] LOG: shveta- LogicalRepSyncTableStart- worker_49 updated-origin originname:pg_16688_49 2023-02-01 16:01:37.083 IST [9348] LOG: shveta- LogicalRepSyncTableStart- worker_49 get-origin-id2 originid:0, originname:pg_16688_49 2023-02-01 16:01:37.083 IST [9348] ERROR: could not create replication slot "pg_16688_sync_49_7195132648087016333": ERROR: all replication slots are in use HINT: Free one or increase max_replication_slots. ============================== Rest of the workers keep on exiting and getting recreated since they could not create slot. The last_used_id kept on increasing on every restart of subscriber until we kill it. In my case it reached 2k+. thanks Shveta