Hi, On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: > IIUC, the issue is that we terminate the process holding the > replication slot, and the conflict cause that we recorded before > terminating the process changes in the next iteration due to the > advancement in effective_xmin and/or effective_catalog_xmin.
Thanks for looking at it! Yeah, and that could lead to no conflict detected anymore (like in the case [2] reported up-thread). > FWIW, a test code something like [1], can help detect above race issues, > right? I think so and I've added it in v2 attached (except that it uses the new "terminated" variable, see below), thanks! > Some comments on the patch: > > 1. > last_signaled_pid = active_pid; > + terminated = true; > } > > Why is a separate variable needed? Can't last_signaled_pid != 0 enough > to determine that a process was terminated earlier? Yeah probably, I thought about it but preferred to add a new variable for this purpose for clarity and avoid race conditions (in case futur patches "touch" the last_signaled_pid anyhow). I was thinking that this part of the code is already not that easy. > 2. If my understanding of the racy behavior is right, can the same > issue happen due to the advancement in restart_lsn? I'm not sure as I never saw it but it should not hurt to also consider this "potential" case so it's done in v2 attached. > I'm not sure if it > can happen at all, but I think we can rely on previous conflict reason > instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. I'm not sure what you mean here. I think we should still keep the "initial" LSN so that the next check done with it still makes sense. The previous conflict reason as you're proposing also makes sense to me but for another reason: PANIC in case the issue still happen (for cases we did not think about, means not covered by what the added previous LSNs are covering). > 3. Is there a way to reproduce this racy issue, may be by adding some > sleeps or something? If yes, can you share it here, just for the > records and verify the whatever fix provided actually works? Alexander was able to reproduce it on a slow machine and the issue was not there anymore with v1 in place. I think it's tricky to reproduce as it would need the slot to advance between the 2 checks. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From 4665945cba163bcb4beadc6391ee65c755e566d8 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <bertranddrouvot...@gmail.com> Date: Thu, 11 Jan 2024 13:57:53 +0000 Subject: [PATCH v2] Fix race condition in InvalidatePossiblyObsoleteSlot() In case of an active slot we proceed in 2 steps: - terminate the backend holding the slot - report the slot as obsolete This is racy because between the two we release the mutex on the slot, which means the effective_xmin and effective_catalog_xmin could advance during that time. Holding the mutex longer is not an option (given what we'd to do while holding it) so record the previous LSNs instead that were used during the backend termination. --- src/backend/replication/slot.c | 37 ++++++++++++++++++++++++++++------ 1 file changed, 31 insertions(+), 6 deletions(-) 100.0% src/backend/replication/ diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 52da694c79..f136916285 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1352,6 +1352,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, { int last_signaled_pid = 0; bool released_lock = false; + bool terminated = false; + XLogRecPtr initial_effective_xmin = InvalidXLogRecPtr; + XLogRecPtr initial_catalog_effective_xmin = InvalidXLogRecPtr; + XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr; + ReplicationSlotInvalidationCause conflict_prev = RS_INVAL_NONE; for (;;) { @@ -1386,11 +1391,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, */ if (s->data.invalidated == RS_INVAL_NONE) { + if (!terminated) + { + initial_restart_lsn = s->data.restart_lsn; + initial_effective_xmin = s->effective_xmin; + initial_catalog_effective_xmin = s->effective_catalog_xmin; + } + switch (cause) { case RS_INVAL_WAL_REMOVED: - if (s->data.restart_lsn != InvalidXLogRecPtr && - s->data.restart_lsn < oldestLSN) + if (initial_restart_lsn != InvalidXLogRecPtr && + initial_restart_lsn < oldestLSN) conflict = cause; break; case RS_INVAL_HORIZON: @@ -1399,12 +1411,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, /* invalid DB oid signals a shared relation */ if (dboid != InvalidOid && dboid != s->data.database) break; - if (TransactionIdIsValid(s->effective_xmin) && - TransactionIdPrecedesOrEquals(s->effective_xmin, + if (TransactionIdIsValid(initial_effective_xmin) && + TransactionIdPrecedesOrEquals(initial_effective_xmin, snapshotConflictHorizon)) conflict = cause; - else if (TransactionIdIsValid(s->effective_catalog_xmin) && - TransactionIdPrecedesOrEquals(s->effective_catalog_xmin, + else if (TransactionIdIsValid(initial_catalog_effective_xmin) && + TransactionIdPrecedesOrEquals(initial_catalog_effective_xmin, snapshotConflictHorizon)) conflict = cause; break; @@ -1417,6 +1429,17 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, } } + /* + * Check if the conflict cause recorded previously before we terminate + * the process changed now for any reason. + */ + if (conflict_prev != RS_INVAL_NONE && terminated && + conflict_prev != conflict) + elog(PANIC, "conflict cause recorded before terminating process %d has been changed; previous cause: %d, current cause: %d", + last_signaled_pid, + conflict_prev, + conflict); + /* if there's no conflict, we're done */ if (conflict == RS_INVAL_NONE) { @@ -1499,6 +1522,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; + terminated = true; + conflict_prev = conflict; } /* Wait until the slot is released. */ -- 2.34.1