On Mon, Jan 22, 2024 at 9:26 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 5:28 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > On Sat, Jan 20, 2024 at 7:44 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > On Sat, Jan 20, 2024 at 10:52 AM Dilip Kumar <dilipbal...@gmail.com> > > > wrote: > > > > > > > > On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila <amit.kapil...@gmail.com> > > > > wrote: > > > > > > > > > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.ma...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > > > > > on the topic of extending replication commands instead of using the > > > > > current model where we fetch the required slot information via SQL > > > > > using a database connection. I would like to summarize the discussion > > > > > and would like to know the thoughts of others on this topic. > > > > > > > > > > In the current patch, we launch the slotsync worker on physical > > > > > standby which connects to the specified database (currently we let > > > > > users specify the required dbname in primary_conninfo) on the primary. > > > > > It then fetches the required information for failover marked slots > > > > > from the primary and also does some primitive checks on the upstream > > > > > node via SQL (the additional checks are like whether the upstream node > > > > > has a specified physical slot or whether the upstream node is a > > > > > primary node or a standby node). To fetch the required information it > > > > > uses a libpqwalreciever API which is mostly apt for this purpose as it > > > > > supports SQL execution but for this patch, we don't need a replication > > > > > connection, so we extend the libpqwalreciever connect API. > > > > > > > > What sort of extension we have done to 'libpqwalreciever'? Is it > > > > something like by default this supports replication connections so we > > > > have done an extension to the API so that we can provide an option > > > > whether to create a replication connection or a normal connection? > > > > > > > > > > Yeah and in the future there could be more as well. The other function > > > added walrcv_get_dbname_from_conninfo doesn't appear to be a problem > > > either for now. > > > > > > > > Now, the concerns related to this could be that users would probably > > > > > need to change existing mechanisms/tools to update priamry_conninfo > > > > I'm concerned about this. In fact, a primary_conninfo value generated > > by pg_basebackup does not work with enable_syncslot. > > > > Right, but if we want can't we extend pg_basebackup to do that? It is > just that I am not sure that it is a good idea to extend pg_basebackup > in the first version.
Okay. > > > > > > and one of the alternatives proposed is to have an additional GUC like > > > > > slot_sync_dbname. Users won't be able to drop the database this worker > > > > > is connected to aka whatever is specified in slot_sync_dbname but as > > > > > the user herself sets up the configuration it shouldn't be a big deal. > > > > > > > > Yeah for this purpose users may use template1 or so which they > > > > generally don't plan to drop. > > > > > > > > > > Using template1 has other problems like users won't be able to create > > > a new database. See [2] (point number 2.2) > > > > > > > > > > > So in case the user wants to drop that > > > > database user needs to turn off the slot syncing option and then it > > > > can be done? > > > > > > > > > > Right. > > > > If the user wants to continue using slot syncing, they need to switch > > the database to connect. Which requires modifying primary_conninfo and > > reloading the configuration file. Which further leads to restarting > > the physical replication. If they use synchronous replication, it > > means the application temporarily stops during that. > > > > Yes, that would be an inconvenience but the point is we don't expect > this to change often. > > > > > > > > > Then we also discussed whether extending libpqwalreceiver's connect > > > > > API is a good idea and whether we need to further extend it in the > > > > > future. As far as I can see, slotsync worker's primary requirement is > > > > > to execute SQL queries which the current API is sufficient, and don't > > > > > see something that needs any drastic change in this API. Note that > > > > > tablesync worker that executes SQL also uses these APIs, so we may > > > > > need something in the future for either of those. Then finally we need > > > > > a slotsync worker to also connect to a database to use SQL and fetch > > > > > results. > > > > > > > > While looking into the patch v64-0002 I could not exactly point out > > > > what sort of extensions are there in libpqwalreceiver.c, I just saw > > > > one extra API for fetching the dbname from connection info? > > > > > > > > > > Right, the worry was that we may need it in the future. > > > > Yes. IIUC the slotsync worker uses libpqwalreceiver to establish a > > non-replication connection and to execute SQL query. But neither of > > them are relevant with replication. > > > > But we are already using libpqwalreceiver to execute SQL queries via > tablesync worker. IIUC tablesync workers do both SQL queries and replication commands. I think the slotsync worker is the first background process who does only SQL queries in a non-replication command ( using libpqwalreceiver). > > I'm a bit concerned that when we > > need to extend the slotsync feature in the future we will end up > > extending libpqwalreceiver, even if the new feature is not also > > relevant with replication. > > > > > > > > > > Now, let us consider if we extend the replication commands like > > > > > READ_REPLICATION_SLOT and or introduce a new set of replication > > > > > commands to fetch the required information then we don't need a DB > > > > > connection with primary or a connection in slotsync worker. As per my > > > > > current understanding, it is quite doable but I think we will slowly > > > > > go in the direction of making replication commands something like SQL > > > > > because today we need to extend it to fetch all slots info that have > > > > > failover marked as true, the existence of a particular replication, > > > > > etc. Then tomorrow, if we want to extend this work to have multiple > > > > > slotsync workers say workers perdb then we have to extend the > > > > > replication command to fetch per-database failover marked slots. To > > > > > me, it sounds more like we are slowly adding SQL-like features to > > > > > replication commands. > > > > Right. How about filtering slots on the standby side? That is, for > > example, the LIST_SLOT command returns all slots and the slotsync > > worker filters out non-failover slots. > > > > Yeah, we can do that but it could be unnecessary network overhead when > there are very few failover slots. And it may not be just one time, we > need to fetch slot information periodically. True. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com