On Tuesday, September 10, 2024 7:25 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> > wrote: > > > > --- > > ISSUES and SOLUTION > > --- > > > > To detect update_deleted conflicts, we need to search for dead tuples > > in the table. However, dead tuples can be removed by VACUUM at any > > time. Therefore, to ensure consistent and accurate conflict detection, > > tuples deleted by other origins must not be removed by VACUUM before > > the conflict detection process. If the tuples are removed prematurely, > > it might lead to incorrect conflict identification and resolution, causing > > data > divergence between nodes. > > > > Here is an example of how VACUUM could affect conflict detection and > > how to prevent this issue. Assume we have a bidirectional cluster with > > two nodes, A and B. > > > > Node A: > > T1: INSERT INTO t (id, value) VALUES (1,1); > > T2: DELETE FROM t WHERE id = 1; > > > > Node B: > > T3: UPDATE t SET value = 2 WHERE id = 1; > > > > To retain the deleted tuples, the initial idea was that once > > transaction T2 had been applied to both nodes, there was no longer a > > need to preserve the dead tuple on Node A. However, a scenario arises > > where transactions T3 and T2 occur concurrently, with T3 committing > > slightly earlier than T2. In this case, if Node B applies T2 and Node > > A removes the dead tuple (1,1) via VACUUM, and then Node A applies T3 > > after the VACUUM operation, it can only result in an update_missing > > conflict. Given that the default resolution for update_missing > > conflicts is apply_or_skip (e.g. convert update to insert if possible > > and apply the insert), Node A will eventually hold a row (1,2) while Node B > becomes empty, causing data inconsistency. > > > > Therefore, the strategy needs to be expanded as follows: Node A cannot > > remove the dead tuple until: > > (a) The DELETE operation is replayed on all remote nodes, *AND* > > (b) The transactions on logical standbys occurring before the replay > > of Node A's DELETE are replayed on Node A as well. > > > > --- > > THE DESIGN > > --- > > > > To achieve the above, we plan to allow the logical walsender to > > maintain and advance the slot.xmin to protect the data in the user > > table and introduce a new logical standby feedback message. This > > message reports the WAL position that has been replayed on the logical > > standby *AND* the changes occurring on the logical standby before the > > WAL position are also replayed to the walsender's node (where the > > walsender is running). After receiving the new feedback message, the > > walsender will advance the slot.xmin based on the flush info, similar > > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin > > of logical slot are unused during logical replication, so I think it's safe > > and > won't cause side-effect to reuse the xmin for this feature. > > > > We have introduced a new subscription option > > (feedback_slots='slot1,...'), where these slots will be used to check > > condition (b): the transactions on logical standbys occurring before > > the replay of Node A's DELETE are replayed on Node A as well. > > Therefore, on Node B, users should specify the slots corresponding to > > Node A in this option. The apply worker will get the oldest confirmed > > flush LSN among the specified slots and send the LSN as a feedback > message to the walsender. -- I also thought of making it an automaic way, e.g. > > let apply worker select the slots that acquired by the walsenders > > which connect to the same remote server(e.g. if apply worker's > > connection info or some other flags is same as the walsender's > > connection info). But it seems tricky because if some slots are > > inactive which means the walsenders are not there, the apply worker > > could not find the correct slots to check unless we save the host along with > the slot's persistence data. > > > > The new feedback message is sent only if feedback_slots is not NULL. > > > > Don't you need to deal with versioning stuff for sending this new message? I > mean what if the receiver side of this message is old and doesn't support this > new message.
Yes, I think we can avoid sending the new message if the remote server version doesn't support handling this message (e.g. server_version < 18). Will address this in next version. > > One minor comment on 0003 > ======================= > 1. > get_slot_confirmed_flush() > { > ... > + /* > + * To prevent concurrent slot dropping and creation while filtering the > + * slots, take the ReplicationSlotControlLock outside of the loop. > + */ > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + > + foreach_ptr(String, name, MySubscription->feedback_slots) { XLogRecPtr > + confirmed_flush; ReplicationSlot *slot; > + > + slot = ValidateAndGetFeedbackSlot(strVal(name)); > > Why do we need to validate slots each time here? Isn't it better to do it > once? I think it's possible that the slot was correct but changed or dropped later, so it could be useful to give a warning in this case to hint user to adjust the slots, otherwise, the xmin of the publisher's slot won't be advanced and might cause dead tuples accumulation. This is similar to the checks we performed for the slots in "synchronized_standby_slots". (E.g. StandbySlotsHaveCaughtup) Best Regards, Hou zj