On Sunday, March 22, 2026 1:47 AM Amit Kapila <[email protected]> wrote: > On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <[email protected]> > wrote: > > > > On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]> > wrote: > > > > > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> > wrote: > > > > > > > > Normally, the slotsync worker updates the standby slot using the > > > > primary's slot state. However, when confirmed_flush_lsn matches > > > > but restart_lsn does not, the worker does not actually update the > > > > standby slot. Despite that, the current code of > > > > update_local_synced_slot() appears to treat this situation as if > > > > an update occurred. As a result, the worker sleeps only for the > > > > minimum interval (200 ms) before retrying. In the next cycle, it > > > > again assumes an update happened, and continues looping with the > > > > short sleep interval, causing the repeated logical decoding log > > > > messages. > Based on a quick analysis, this seems to be the root cause. > > > > > > > > I think update_local_synced_slot() should return false (i.e., no > > > > update > > > > happened) when confirmed_flush_lsn is equal but restart_lsn > > > > differs between primary and standby. > > > > > > > > > > We expect that in such a case update_local_synced_slot() should > > > advance local_slot's 'restart_lsn' via > > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the > > > cheap code path next time. Normally, restart_lsn advancement should > > > happen when we process XLOG_RUNNING_XACTS and call > > > SnapBuildProcessRunningXacts(). In this particular case as both > > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the > > > machinery may not be processing XLOG_RUNNING_XACTS record. I have > > > not debugged the exact case yet but you can try by emitting some > > > more records on publisher, it should let the standby advance the > > > slot. It is possible that we can do something like you are proposing > > > to silence the LOG messages but we should know what is going on here. > > > > I agree it's important to investigate why restart_lsn doesn't advance > > in that case and fix that issue. > > > > Separately, I think mistreating the slot as updated is another problem: > > the local slot is treated as updated even when it isn't, which causes > > the slotsync worker to use the minimum sleep interval. We should > > address this independently. > > > > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local > > slot, but update_local_synced_slot() currently assumes it does. That > > seems to be the root cause. The attached 0001 patch makes > > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was > > actually updated, and updates update_local_synced_slot() to act based on > that result. > > > > I've also attached 0002, which reduces the log level of some logical > > decoding messages (e.g., "logical decoding found consistent point") > > from LOG to DEBUG1. These are low-level, developer-oriented messages > > and can be noisy when they occur regularly. This is the same issue I > reported earlier at [1]. > > > > How about applying these patches while we continue working on the > > restart_lsn issue? > > > > * > @@ -2174,7 +2193,10 @@ > LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto, > > if (XLogRecPtrIsValid(ctx->reader->EndRecPtr)) > { > - LogicalConfirmReceivedLocation(moveto); > + bool slot_updated = LogicalConfirmReceivedLocation(moveto); > + > + if (updated_xmin_or_lsn) > + *updated_xmin_or_lsn = slot_updated; > > BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while > processing running_xact record, so not sure we can rely only on the explicit > call LogicalConfirmReceivedLocation() above to ascertain the same.
Perhaps we could simply compare the slot's old and new LSN/xmin to determine whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook into each individual update path at the lower level. Attaching a patch for reference. > > Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to > fix this issue but I'll look at it after the feature freeze. The other patches attempt to advance restart_lsn in more cases by modifying the logical decoding logic. However, those scenarios are relatively rare and may not produce significant improvement outside the slot synchronization context. So, I think modifying the slot sync worker to correctly increase the polling interval instead is also a good starting point. Best Regards, Hou zj
v3-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch
Description: v3-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch
