On 2020-Apr-08, Kyotaro Horiguchi wrote:
> At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi
> <[email protected]> wrote in
>
> Just avoiding starting replication when restart_lsn is invalid is
> sufficient (the attached, which is equivalent to a part of what the
> invalidated flag did). I thing that the error message needs a Hint but
> it looks on the subscriber side as:
>
> [22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL
> stream: ERROR: replication slot "s1" is invalidated
> HINT: The slot exceeds the limit by max_slot_wal_keep_size.
>
> I don't think it is not clean.. Perhaps the subscriber should remove
> the trailing line of the message from the publisher?
Thanks for the fix! I propose two changes:
1. reword the error like this:
ERROR: replication slot "regression_slot3" cannot be advanced
DETAIL: This slot has never previously reserved WAL, or has been invalidated
2. use the same error in one other place, to wit
pg_logical_slot_get_changes() and pg_replication_slot_advance(). I
made the DETAIL part the same in all places, but the ERROR line is
adjusted to what each callsite is doing.
I do think that this change in test_decoding is a bit unpleasant:
-ERROR: cannot use physical replication slot for logical decoding
+ERROR: cannot get changes from replication slot "repl"
The test is
-- check that we're detecting a streaming rep slot used for logical
decoding
SELECT 'init' FROM pg_create_physical_replication_slot('repl');
SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1');
> > On the other hand, physical replication doesn't break by invlidation.
> > [...]
>
> If we don't mind that standby can reconnect after a walsender
> termination due to the invalidation, we don't need to do something for
> this. Restricting max_slot_wal_keep_size to be larger than a certain
> threshold would reduce the chance we see that behavior.
Yeah, I think you're referring to the fact that StartReplication()
doesn't verify the restart_lsn of the slot; and if we do add a check, a
few tests that rely on physical replication start to fail. This patch
only adds a comment in that spot. But I don't (yet) know what the
consequences of this are, or whether it can be fixed by setting a valid
restart_lsn ahead of time. This test in pg_basebackup fails, for
example:
# Running: pg_basebackup -D
/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl
-X stream -S slot1
pg_basebackup: error: could not send replication command "START_REPLICATION":
ERROR: cannot read from replication slot "slot1"
DETAIL: This slot has never previously reserved WAL, or has been invalidated
pg_basebackup: error: child process exited with exit code 1
pg_basebackup: removing data directory
"/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl"
not ok 95 - pg_basebackup -X stream with replication slot runs
# Failed test 'pg_basebackup -X stream with replication slot runs'
# at t/010_pg_basebackup.pl line 461.
Anyway I think the current patch can be applied as is -- and if we want
physical replication to have some other behavior, we can patch for that
afterwards.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 74dcfd5f6384d9bbdb1d23e6d4d69fcf798c7cff Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <[email protected]>
Date: Mon, 27 Apr 2020 17:32:28 -0400
Subject: [PATCH] check slot->restart_lsn in a couple of places
---
contrib/test_decoding/expected/ddl.out | 3 ++-
contrib/test_decoding/expected/slot.out | 3 ++-
src/backend/replication/logical/logicalfuncs.c | 7 +++++++
src/backend/replication/slotfuncs.c | 4 +++-
src/backend/replication/walsender.c | 14 +++++++++++++-
5 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/contrib/test_decoding/expected/ddl.out b/contrib/test_decoding/expected/ddl.out
index d79cd316b7..bbc8e445af 100644
--- a/contrib/test_decoding/expected/ddl.out
+++ b/contrib/test_decoding/expected/ddl.out
@@ -41,7 +41,8 @@ SELECT 'init' FROM pg_create_physical_replication_slot('repl');
(1 row)
SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
-ERROR: cannot use physical replication slot for logical decoding
+ERROR: cannot get changes from replication slot "repl"
+DETAIL: This slot has never previously reserved WAL, or has been invalidated
SELECT pg_drop_replication_slot('repl');
pg_drop_replication_slot
--------------------------
diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out
index 1000171530..b448e881d6 100644
--- a/contrib/test_decoding/expected/slot.out
+++ b/contrib/test_decoding/expected/slot.out
@@ -143,7 +143,8 @@ SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
ERROR: invalid target WAL LSN
SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
-ERROR: cannot advance replication slot that has not previously reserved WAL
+ERROR: replication slot "regression_slot3" cannot be advanced
+DETAIL: This slot has never previously reserved WAL, or has been invalidated
SELECT pg_drop_replication_slot('regression_slot3');
pg_drop_replication_slot
--------------------------
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index f5384f1df8..8412e14fe1 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -227,6 +227,13 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
(void) ReplicationSlotAcquire(NameStr(*name), SAB_Error);
+ if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot get changes from replication slot \"%s\"",
+ NameStr(*name)),
+ errdetail("This slot has never previously reserved WAL, or has been invalidated")));
+
PG_TRY();
{
/* restart at slot's confirmed_flush */
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index f776de3df7..1cfe2e5893 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -598,7 +598,9 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("cannot advance replication slot that has not previously reserved WAL")));
+ errmsg("replication slot \"%s\" cannot be advanced",
+ NameStr(*slotname)),
+ errdetail("This slot has never previously reserved WAL, or has been invalidated")));
/*
* Check if the slot is not moving backwards. Physical slots rely simply
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6c87f6087c..8373f00e80 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -600,6 +600,12 @@ StartReplication(StartReplicationCmd *cmd)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot use a logical replication slot for physical replication")));
+
+ /*
+ * We don't need to verify the slot's restart_lsn here; instead we
+ * rely on the caller requesting the starting point to use. If the
+ * WAL segment doesn't exist, we'll fail later.
+ */
}
/*
@@ -1134,6 +1140,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
(void) ReplicationSlotAcquire(cmd->slotname, SAB_Error);
+ if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot read from replication slot \"%s\"",
+ cmd->slotname),
+ errdetail("This slot has never previously reserved WAL, or has been invalidated")));
+
/*
* Force a disconnect, so that the decoding code doesn't need to care
* about an eventual switch from running in recovery, to running in a
@@ -1159,7 +1172,6 @@ StartLogicalReplication(StartReplicationCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);
-
WalSndSetState(WALSNDSTATE_CATCHUP);
/* Send a CopyBothResponse message, and start streaming */
--
2.20.1