On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila <[email protected]> wrote:
>
...
> Thanks. I have fixed one of the issues reported by me earlier [1]
> wherein the tablesync worker can repeatedly fail if after dropping the
> slot there is an error while updating the SYNCDONE state in the
> database. I have moved the drop of the slot just before commit of the
> transaction where we are marking the state as SYNCDONE. Additionally,
> I have removed unnecessary includes in tablesync.c, updated the docs
> for Alter Subscription, and updated the comments at various places in
> the patch. I have also updated the commit message this time.
>
Below are my feedback comments for V17 (nothing functional)
~~
1.
V27 Commit message:
For the initial table data synchronization in logical replication, we use
a single transaction to copy the entire table and then synchronizes the
position in the stream with the main apply worker.
Typo:
"synchronizes" -> "synchronize"
~~
2.
@@ -48,6 +48,23 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
(Currently, all subscription owners must be superusers, so the owner checks
will be bypassed in practice. But this might change in the future.)
</para>
+
+ <para>
+ When refreshing a publication we remove the relations that are no longer
+ part of the publication and we also remove the tablesync slots if there are
+ any. It is necessary to remove tablesync slots so that the resources
+ allocated for the subscription on the remote host are released. If due to
+ network breakdown or some other error, we are not able to remove the slots,
+ we give WARNING and the user needs to manually remove such slots later as
+ otherwise, they will continue to reserve WAL and might eventually cause
+ the disk to fill up. See also <xref
linkend="logical-replication-subscription-slot"/>.
+ </para>
I think the content is good, but the 1st-person wording seemed strange.
e.g.
"we are not able to remove the slots, we give WARNING and the user needs..."
Maybe it should be like:
"... PostgreSQL is unable to remove the slots, so a WARNING is
reported. The user needs... "
~~
3.
@@ -566,107 +569,197 @@ AlterSubscription_refresh(Subscription *sub,
bool copy_data)
...
+ * XXX If there is a network break down while dropping the
"network break down" -> "network breakdown"
~~
4.
@@ -872,7 +970,48 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
(errmsg("could not connect to the publisher: %s", err)));
...
+ * XXX We could also instead try to drop the slot, last time we failed
+ * but for that, we might need to clean up the copy state as it might
+ * be in the middle of fetching the rows. Also, if there is a network
+ * break down then it wouldn't have succeeded so trying it next time
+ * seems like a better bet.
"network break down" -> "network breakdown"
~~
5.
@@ -269,26 +313,47 @@ invalidate_syncing_table_states(Datum arg, int
cacheid, uint32 hashvalue)
...
+
+ /*
+ * Cleanup the tablesync slot.
+ *
+ * This has to be done after updating the state because otherwise if
+ * there is an error while doing the database operations we won't be
+ * able to rollback dropped slot.
+ */
+ ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
+ MyLogicalRepWorker->relid,
+ syncslotname);
+
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname, false /* missing_ok */);
+
Should this comment also describe why the missing_ok is false for this case?
----
Kind Regards,
Peter Smith.
Fujitsu Australia