Re: Synchronizing slots from primary to standby

Masahiko Sawada Mon, 22 Jan 2024 07:13:01 -0800

On Mon, Jan 22, 2024 at 9:26 PM Amit Kapila <[email protected]> wrote:
>
> On Mon, Jan 22, 2024 at 5:28 PM Masahiko Sawada <[email protected]> wrote:
> >
> > On Sat, Jan 20, 2024 at 7:44 PM Amit Kapila <[email protected]> wrote:
> > >
> > > On Sat, Jan 20, 2024 at 10:52 AM Dilip Kumar <[email protected]> 
> > > wrote:
> > > >
> > > > On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila <[email protected]> 
> > > > wrote:
> > > > >
> > > > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <[email protected]> 
> > > > > wrote:
> > > > > >
> > > > >
> > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta
> > > > > on the topic of extending replication commands instead of using the
> > > > > current model where we fetch the required slot information via SQL
> > > > > using a database connection. I would like to summarize the discussion
> > > > > and would like to know the thoughts of others on this topic.
> > > > >
> > > > > In the current patch, we launch the slotsync worker on physical
> > > > > standby which connects to the specified database (currently we let
> > > > > users specify the required dbname in primary_conninfo) on the primary.
> > > > > It then fetches the required information for failover marked slots
> > > > > from the primary and also does some primitive checks on the upstream
> > > > > node via SQL (the additional checks are like whether the upstream node
> > > > > has a specified physical slot or whether the upstream node is a
> > > > > primary node or a standby node). To fetch the required information it
> > > > > uses a libpqwalreciever API which is mostly apt for this purpose as it
> > > > > supports SQL execution but for this patch, we don't need a replication
> > > > > connection, so we extend the libpqwalreciever connect API.
> > > >
> > > > What sort of extension we have done to 'libpqwalreciever'? Is it
> > > > something like by default this supports replication connections so we
> > > > have done an extension to the API so that we can provide an option
> > > > whether to create a replication connection or a normal connection?
> > > >
> > >
> > > Yeah and in the future there could be more as well. The other function
> > > added walrcv_get_dbname_from_conninfo doesn't appear to be a problem
> > > either for now.
> > >
> > > > > Now, the concerns related to this could be that users would probably
> > > > > need to change existing mechanisms/tools to update priamry_conninfo
> >
> > I'm concerned about this. In fact, a primary_conninfo value generated
> > by pg_basebackup does not work with enable_syncslot.
> >
>
> Right, but if we want can't we extend pg_basebackup to do that? It is
> just that I am not sure that it is a good idea to extend pg_basebackup
> in the first version.


Okay.

>
> > > > > and one of the alternatives proposed is to have an additional GUC like
> > > > > slot_sync_dbname. Users won't be able to drop the database this worker
> > > > > is connected to aka whatever is specified in slot_sync_dbname but as
> > > > > the user herself sets up the configuration it shouldn't be a big deal.
> > > >
> > > > Yeah for this purpose users may use template1 or so which they
> > > > generally don't plan to drop.
> > > >
> > >
> > > Using template1 has other problems like users won't be able to create
> > > a new database. See [2] (point number 2.2)
> > >
> > > >
> > > >  So in case the user wants to drop that
> > > > database user needs to turn off the slot syncing option and then it
> > > > can be done?
> > > >
> > >
> > > Right.
> >
> > If the user wants to continue using slot syncing, they need to switch
> > the database to connect. Which requires modifying primary_conninfo and
> > reloading the configuration file. Which further leads to restarting
> > the physical replication. If they use synchronous replication, it
> > means the application temporarily stops during that.
> >
>
> Yes, that would be an inconvenience but the point is we don't expect
> this to change often.
>
> > >
> > > > > Then we also discussed whether extending libpqwalreceiver's connect
> > > > > API is a good idea and whether we need to further extend it in the
> > > > > future. As far as I can see, slotsync worker's primary requirement is
> > > > > to execute SQL queries which the current API is sufficient, and don't
> > > > > see something that needs any drastic change in this API. Note that
> > > > > tablesync worker that executes SQL also uses these APIs, so we may
> > > > > need something in the future for either of those. Then finally we need
> > > > > a slotsync worker to also connect to a database to use SQL and fetch
> > > > > results.
> > > >
> > > > While looking into the patch v64-0002 I could not exactly point out
> > > > what sort of extensions are there in libpqwalreceiver.c, I just saw
> > > > one extra API for fetching the dbname from connection info?
> > > >
> > >
> > > Right, the worry was that we may need it in the future.
> >
> > Yes. IIUC the slotsync worker uses libpqwalreceiver to establish a
> > non-replication connection and to execute SQL query. But neither of
> > them are relevant with replication.
> >
>
> But we are already using libpqwalreceiver to execute SQL queries via
> tablesync worker.

IIUC tablesync workers do both SQL queries and replication commands. I
think the slotsync worker is the first background process who does
only SQL queries in a non-replication command ( using
libpqwalreceiver).

>
>  I'm a bit concerned that when we
> > need to extend the slotsync feature in the future we will end up
> > extending libpqwalreceiver, even if the new feature is not also
> > relevant with replication.
> >
> > >
> > > > > Now, let us consider if we extend the replication commands like
> > > > > READ_REPLICATION_SLOT and or introduce a new set of replication
> > > > > commands to fetch the required information then we don't need a DB
> > > > > connection with primary or a connection in slotsync worker. As per my
> > > > > current understanding, it is quite doable but I think we will slowly
> > > > > go in the direction of making replication commands something like SQL
> > > > > because today we need to extend it to fetch all slots info that have
> > > > > failover marked as true, the existence of a particular replication,
> > > > > etc. Then tomorrow, if we want to extend this work to have multiple
> > > > > slotsync workers say workers perdb then we have to extend the
> > > > > replication command to fetch per-database failover marked slots. To
> > > > > me, it sounds more like we are slowly adding SQL-like features to
> > > > > replication commands.
> >
> > Right. How about filtering slots on the standby side? That is, for
> > example, the LIST_SLOT command returns all slots and the slotsync
> > worker filters out non-failover slots.
> >
>
> Yeah, we can do that but it could be unnecessary network overhead when
> there are very few failover slots. And it may not be just one time, we
> need to fetch slot information periodically.

True.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

Reply via email to