On Mon, Nov 19, 2018 at 01:39:58PM +0900, Michael Paquier wrote:
> I was just coming by to look at bit at the patch series, and bumped
> into that:

So I have been looking at the last patch series 0001-0004 posted on this
thread, and coming from here:
https://postgr.es/m/20181025.215518.189844649.horiguchi.kyot...@lab.ntt.co.jp

/* check that the slot is gone */
SELECT * FROM pg_replication_slots
It could be an idea to switch to the expanded mode here, not that it
matters much still..

+IsLsnStillAvaiable(XLogRecPtr targetLSN, uint64 *restBytes)
You mean Available here, not Avaiable.  This function is only used when
scanning for slot information with pg_replication_slots, so wouldn't it
be better to just return the status string in this case?

Not sure I see the point of the "remain" field, which can be found with
a simple calculation using the current insertion LSN, the segment size
and the amount of WAL that the slot is retaining.  It may be interesting
to document a query to do that though.

GetOldestXLogFileSegNo() has race conditions if WAL recycling runs in
parallel, no?  How is it safe to scan pg_wal on a process querying
pg_replication_slots while another process may manipulate its contents
(aka the checkpointer or just the startup process with an
end-of-recovery checkpoint.).  This routine relies on unsafe
assumptions as this is not concurrent-safe.  You can avoid problems by
making sure instead that lastRemovedSegNo is initialized correctly at
startup, which would be normally one segment older than what's in
pg_wal, which feels a bit hacky to rely on to track the oldest segment.

It seems to me that GetOldestXLogFileSegNo() should also check for
segments matching the current timeline, no?

+           if (prev_lost_segs != lost_segs)
+               ereport(WARNING,
+                       (errmsg ("some replication slots have lost
required WAL segments"),
+                        errdetail_plural(
+                            "The mostly affected slot has lost %ld
segment.",
+                            "The mostly affected slot has lost %ld
segments.",
+                            lost_segs, lost_segs)));
This can become very noisy with the time, and it would be actually
useful to know which replication slot is impacted by that.

+      slot doesn't have valid restart_lsn, this field
Missing a determinant here, and restart_lsn should have a <literal>
markup.

+    many WAL segments that they fill up the space allotted
s/allotted/allocated/.

+      available. The last two states are seen only when
+      <xref linkend="guc-max-slot-wal-keep-size"/> is non-negative. If the
+      slot doesn't have valid restart_lsn, this field
+      is <literal>unknown</literal>.
I am a bit confused by this statement.  The last two states are "lost"
and "keeping", but shouldn't "keeping" be the state showing up by
default as it means that all WAL segments are kept around.

+# Advance WAL by ten segments (= 160MB) on master
+advance_wal($node_master, 10);
+$node_master->safe_psql('postgres', "CHECKPOINT;");
This makes the tests very costly, which is something we should avoid as
much as possible.  One trick which could be used here, on top of
reducing the number of segment switches, is to use initdb
--wal-segsize=1.
--
Michael

Attachment: signature.asc
Description: PGP signature

Reply via email to