Re: Minimal logical decoding on standbys

Drouvot, Bertrand Wed, 18 Jan 2023 02:25:57 -0800

Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:

Hi,
0004:

@@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {

  =pod

+=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub create_logical_slot_on_standby
+{


Any reason this has to be standby specific?


Due to the extra work to be done for this case (aka wait for restart_lsn
and trigger a checkpoint on the primary).

+       # Now arrange for the xl_running_xacts record for which pg_recvlogical
+       # is waiting.
+       $master->safe_psql('postgres', 'CHECKPOINT');
+


Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.


Not sure I got it. What the C helper would be supposed to do?

+       print "starting pg_recvlogical\n";


I don't think tests should just print somewhere. Either diag() or note()
should be used.


Will be done.

+       if ($wait)
+       # make sure activeslot is in use
+       {
+               $node_standby->poll_query_until('testdb',
+                       "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE 
slot_name = 'activeslot' AND active_pid IS NOT NULL)"
+               ) or die "slot never became active";
+       }


That comment placement imo is quite odd.


Agree, will be done.

+# test if basic decoding works
+is(scalar(my @foobar = split /^/m, $result),
+       14, 'Decoding produced 14 rows');


Maybe mention that it's 2 transactions + 10 rows?


Agree, will be done.

+$node_primary->wait_for_catchup($node_standby, 'replay', 
$node_primary->lsn('flush'));


There's enough copies of this that I wonder if we shouldn't introduce a
Cluster.pm level helper for this.


Done in [1].

+print "waiting to replay $endpos\n";


See above.


Will be done.

+my $stdout_recv = $node_standby->pg_recvlogical_upto(
+    'testdb', 'activeslot', $endpos, 180,
+    'include-xids'     => '0',
+    'skip-empty-xacts' => '1');


I don't think this should use a hardcoded 180 but
$PostgreSQL::Test::Utils::timeout_default.


Agree, will be done.

+# One way to reproduce recovery conflict is to run VACUUM FULL with
+# hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;


IIRC a reload should suffice.


Right.

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are 
invalidated)
because the slots are not dropped yet. What do you think?

+# This should trigger the conflict
+$node_primary->safe_psql('testdb', 'VACUUM FULL');


Can we do something cheaper than rewriting the entire database? Seems
rewriting a single table ought to be sufficient?


While implementing the test at the table level I discovered that It looks like there is 
no guarantee that say a "vacuum full pg_class;" would
produce a conflict.

Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE 
with snapshotConflictHorizon to 0:

"rmgr: Heap2       len (rec/tot):    107/   107, tx:        848, lsn: 0/03B98B30, 
prev 0/03B98AF0, desc: PRUNE snapshotConflictHorizon 0"


Having a snapshotConflictHorizon to zero leads to 
ResolveRecoveryConflictWithSnapshot() simply returning
without any conflict handling.

It does look like that in the standby decoding case that's not the right 
behavior (and that the xid that generated the PRUNING should be used instead)
, what do you think?

I think it'd also be good to test that rewriting a non-catalog table doesn't
trigger an issue.


Good point, but need to understand the above first.

+##################################################
+# Recovery conflict: Invalidate conflicting slots, including in-use slots
+# Scenario 2: conflict due to row removal with hot_standby_feedback off.
+##################################################
+
+# get the position to search from in the standby logfile
+my $logstart = -s $node_standby->logfile;
+
+# drop the logical slots
+$node_standby->psql('postgres', q[SELECT 
pg_drop_replication_slot('inactiveslot')]);
+$node_standby->psql('postgres', q[SELECT 
pg_drop_replication_slot('activeslot')]);
+
+create_logical_slots();
+
+# One way to produce recovery conflict is to create/drop a relation and launch 
a vacuum
+# with hot_standby_feedback turned off on the standby.
+$node_standby->append_conf('postgresql.conf',q[
+hot_standby_feedback = off
+]);
+$node_standby->restart;
+# ensure walreceiver feedback off by waiting for expected xmin and
+# catalog_xmin on primary. Both should be NULL since hs_feedback is off
+wait_for_xmins($node_primary, $primary_slotname,
+                          "xmin IS NULL AND catalog_xmin IS NULL");
+
+$handle = make_slot_active(1);


This is a fair bit of repeated setup, maybe put it into a function?


Yeah, good point: will be done.

I think it'd be good to test the ongoing decoding via the SQL interface also
gets correctly handled. But it might be too hard to do reliably.

+##################################################
+# Test standby promotion and logical decoding behavior
+# after the standby gets promoted.
+##################################################
+


I think this also should test the streaming / walsender case.


Do you mean cascading standby?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: 
https://www.postgresql.org/message-id/flat/846724b5-0723-f4c2-8b13-75301ec7509e%40gmail.com

Re: Minimal logical decoding on standbys

Reply via email to