Hello,
I started taking a brief look at the v2 patch, and it does appear to work for
the basic case. Logical slot is synchronized across and I can connect to the
promoted standby and stream changes afterwards.
It's not clear to me what the correct behavior is when a logical slot that has
been synced to the replica and then it gets deleted on the writer. Would we
expect this to be propagated or leave it up to the end-user to manage?
> + rawname = pstrdup(standby_slot_names);
> + SplitIdentifierString(rawname, ',', &namelist);
> +
> + while (true)
> + {
> + int wait_slots_remaining;
> + XLogRecPtr oldest_flush_pos = InvalidXLogRecPtr;
> + int rc;
> +
> + wait_slots_remaining = list_length(namelist);
> +
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> + for (int i = 0; i < max_replication_slots; i++)
> + {
Even though standby_slot_names is PGC_SIGHUP, we never reload/re-process the
value. If we have a wrong entry in there, the backend becomes stuck until we
re-establish the logical connection. Adding "postmaster/interrupt.h" with
ConfigReloadPending / ProcessConfigFile does seem to work.
Another thing I noticed is that once it starts waiting in this block, Ctrl+C
doesn't seem to terminate the backend?
pg_recvlogical -d postgres -p 5432 --slot regression_slot --start -f -
..
^Cpg_recvlogical: error: unexpected termination of replication stream:
The logical backend connection is still present:
ps aux | grep 51263
hsuchen 51263 80.7 0.0 320180 14304 ? Rs 01:11 3:04 postgres:
walsender hsuchen [local] START_REPLICATION
pstack 51263
#0 0x00007ffee99e79a5 in clock_gettime ()
#1 0x00007f8705e88246 in clock_gettime () from /lib64/libc.so.6
#2 0x000000000075f141 in WaitEventSetWait ()
#3 0x000000000075f565 in WaitLatch ()
#4 0x0000000000720aea in ReorderBufferProcessTXN ()
#5 0x00000000007142a6 in DecodeXactOp ()
#6 0x000000000071460f in LogicalDecodingProcessRecord ()
It can be terminated with a pg_terminate_backend though.
If we have a physical slot with name foo on the standby, and then a logical
slot is created on the writer with the same slot_name it does error out on the
replica although it prevents other slots from being synchronized which is
probably fine.
2021-12-16 02:10:29.709 UTC [73788] LOG: replication slot synchronization
worker for database "postgres" has started
2021-12-16 02:10:29.713 UTC [73788] ERROR: cannot use physical replication
slot for logical decoding
2021-12-16 02:10:29.714 UTC [73037] DEBUG: unregistering background worker
"replication slot synchronization worker"
On 12/14/21, 2:26 PM, "Peter Eisentraut" <[email protected]>
wrote:
CAUTION: This email originated from outside of the organization. Do not
click links or open attachments unless you can confirm the sender and know the
content is safe.
On 28.11.21 07:52, Bharath Rupireddy wrote:
> 1) Instead of a new LIST_SLOT command, can't we use
> READ_REPLICATION_SLOT (slight modifications needs to be done to make
> it support logical replication slots and to get more information from
> the subscriber).
I looked at that but didn't see an obvious way to consolidate them.
This is something we could look at again later.
> 2) How frequently the new bg worker is going to sync the slot info?
> How can it ensure that the latest information exists say when the
> subscriber is down/crashed before it picks up the latest slot
> information?
The interval is currently hardcoded, but could be a configuration
setting. In the v2 patch, there is a new setting that orders physical
replication before logical so that the logical subscribers cannot get
ahead of the physical standby.
> 3) Instead of the subscriber pulling the slot info, why can't the
> publisher (via the walsender or a new bg worker maybe?) push the
> latest slot info? I'm not sure we want to add more functionality to
> the walsender, if yes, isn't it going to be much simpler?
This sounds like the failover slot feature, which was rejected.