On 2020-Jun-17, Fujii Masao wrote: > While reading InvalidateObsoleteReplicationSlots() code, I found another > issue. > > ereport(LOG, > (errmsg("terminating walsender %d > because replication slot \"%s\" is too far behind", > wspid, > NameStr(slotname)))); > (void) kill(wspid, SIGTERM); > > wspid indicates the PID of process using the slot. That process can be > a backend, for example, executing pg_replication_slot_advance(). > So "walsender" in the above log message is not always correct.
Good point. > int wspid = > ReplicationSlotAcquire(NameStr(slotname), > > SAB_Inquire); > > Why do we need to call ReplicationSlotAcquire() here and mark the slot as > used by the checkpointer? Isn't it enough to check directly the slot's > active_pid, instead? > > Maybe ReplicationSlotAcquire() is necessary because > ReplicationSlotRelease() is called later? If so, why do we need to call > ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's > active_pid is zero. No? I think the point here was that in order to modify the slot you have to acquire it -- it's not valid to modify a slot you don't own. > + /* > + * Signal to terminate the process using the replication slot. > + * > + * Try to signal every 100ms until it succeeds. > + */ > + if (!killed && kill(active_pid, SIGTERM) == 0) > + killed = true; > + ConditionVariableTimedSleep(&slot->active_cv, 100, > + > WAIT_EVENT_REPLICATION_SLOT_DROP); > + } while (ReplicationSlotIsActive(slot, NULL)); Note that here you're signalling only once and then sleeping many times in increments of 100ms -- you're not signalling every 100ms as the comment claims -- unless the signal fails, but you don't really expect that. On the contrary, I'd claim that the logic is reversed: if the signal fails, *then* you should stop signalling. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services