On 2025/06/30 20:32, Nisha Moond wrote:
On Fri, Jun 27, 2025 at 5:40 PM Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
On 2025/06/27 15:32, Nisha Moond wrote:
On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
On 2025/06/26 15:46, Nisha Moond wrote:
On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
Hi,
The pg_replication_slots documentation mentions only max_slot_wal_keep_size
as a condition under which the wal_status column can show unreserved or lost.
However, since commit ac0e33136ab, idle_replication_slot_timeout can also
cause this behavior when it is set. This has not been documented yet.
https://www.postgresql.org/docs/devel/view-pg-replication-slots.html
+1 to the doc update.
Thanks for the review!
So, how about updating the documentation to also mention
idle_replication_slot_timeout as a factor that can cause wal_status to
become unreserved or lost? Patch attached.
Since idle_replication_slot_timeout can only cause wal_status to
become 'lost' and not 'unreserved', perhaps we can reword the sentence
slightly for clarity, suggestion -
"The last two states are seen when max_slot_wal_keep_size is
non-negative and, the 'lost' state may also appear when
idle_replication_slot_timeout is greater than zero."
I was thinking that when idle_replication_slot_timeout triggers,
the following functions are called, and that wal_status can become
"unreserved" before ReplicationSlotRelease() runs. It's very short
period, though. Am I wrong?
ReplicationSlotMarkDirty();
ReplicationSlotSave();
ReplicationSlotRelease();
Thank you for pointing it out.
You are correct that while the checkpointer is in the process of
invalidating a slot, it sets its PID as the slot’s active_pid. During
this short window, if a user queries pg_replication_slot, the
underlying function pg_get_replication_slots will compute the
wal_status as 'unreserved' for the invalidated slot because the slot
has a valid active_pid.
That said, it's reasonable to mention in the doc that 'unreserved' may
appear when idle_replication_slot_timeout is greater than zero, as
this can indeed happen. So, let's retain the current description.
However, this behavior isn’t specific to
idle_replication_slot_timeout. For example, when a slot is being
invalidated due to a different cause "wal_level_insufficient",
'unreserved' may also briefly appear in wal_status.
Yes, and "lost" can appear for various reasons, including
wal_level_insufficient,
so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost"
status in the note. It would probably be better to remove the mention of "lost"
from that note.
+1
Is this true starting from v16, when logical replication from standby was
introduced?
In other words, in v15 and earlier, only max_slot_wal_keep_size could cause
the wal_status to become "unreserved" or "lost"? I'm wondering where to
back-patch
this fix to.
As for "unreserved", it can also occur for different reasons, but typically,
it happens when max_slot_wal_keep_size is set to a non-negative value.
So it might make sense to keep the explanation focused just on "unreserved"
and max_slot_wal_keep_size. For example:
----------------------
<listitem>
<para>
<literal>unreserved</literal> means that the slot no longer
retains the required WAL files and some of them are to be removed
at
- the next checkpoint. This state can return
+ the next checkpoint. This can occur when
+ <xref linkend="guc-max-slot-wal-keep-size"/> is set to
+ a non-negative value. This state can return
to <literal>reserved</literal> or <literal>extended</literal>.
</para>
</listitem>
<listitem>
----------------------
What do you think?
The change LGTM, only a minor suggestion to add "typically", as “This
can typically occur when…” to indicate that max_slot_wal_keep_size is
one possible reason, not the only one.
OK.
Also, I noticed the note that says “If <structfield>restart_lsn</structfield>
is NULL, this field is null” seems inaccurate. For example, when "wal_removed"
happens, restart_lsn is NULL but wal_status is "lost". So maybe we should remove
that note as well?
You're right, the statement is not accurate.
We could rephrase it as: "If <structfield>restart_lsn</structfield> is
NULL, this field is either null or lost." But since 'unreserved' can
also appear briefly during invalidation, it might be better to remove
it altogether.
I agree with removing the description. Unless I'm missing something,
it has been incorrect since at least v13, so we should back-patch this fix
to all supported versions.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation