On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Jul 26, 2023 at 10:31 AM shveta malik <shveta.ma...@gmail.com> wrote: > > > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > > > <bharath.rupireddyforpostg...@gmail.com> wrote: > > > > > > > > Is having one (or a few more - not > > > > necessarily one for each logical slot) worker for all logical slots > > > > enough? > > > > > > > > > > I guess for a large number of slots the is a possibility of a large > > > gap in syncing the slots which probably means we need to retain > > > corresponding WAL for a much longer time on the primary. If we can > > > prove that the gap won't be large enough to matter then this would be > > > probably worth considering otherwise, I think we should find a way to > > > scale the number of workers to avoid the large gap. > > > > > > > How about this: > > > > 1) On standby, spawn 1 worker per database in the start (as it is > > doing currently). > > > > 2) Maintain statistics on activity against each primary's database on > > standby by any means. Could be by maintaining 'last_synced_time' and > > 'last_activity_seen time'. The last_synced_time is updated every time > > we sync/recheck slots for that particular database. The > > 'last_activity_seen_time' changes only if we get any slot on that > > database where actually confirmed_flush or say restart_lsn has changed > > from what was maintained already. > > > > 3) If at any moment, we find that 'last_synced_time' - > > 'last_activity_seen' goes beyond a threshold, that means that DB is > > not active currently. Add it to list of inactive DB > > > > I think we should also increase the next_sync_time if in current sync, > there is no update.
+1 > > > 4) Launcher on the other hand is always checking if it needs to spawn > > any other extra worker for any new DB. It will additionally check if > > number of inactive databases (maintained on standby) has gone higher > > (> some threshold), then it brings down the workers for those and > > starts a common worker which takes care of all such inactive databases > > (or merge all in 1), while workers for active databases remain as such > > (i.e. one per db). Each worker maintains the list of DBs which it is > > responsible for. > > > > 5) If in the list of these inactive databases, we again find any > > active database using the above logic, then the launcher will spawn a > > separate worker for that. > > > > I wonder if we anyway some sort of design like this because we > shouldn't allow to spawn as many workers as the number of databases. > There has to be some existing or new GUC like max_sync_slot_workers > which decided the number of workers. > Currently it does not have any such GUC for sync-slot workers. It mainly uses the logical-rep-worker framework for the sync-slot worker part and thus it relies on 'max_logical_replication_workers' GUC. Also it errors out if 'max_replication_slots' is set to zero. I think it is not the correct way of doing things for sync-slot. We can have a new GUC (max_sync_slot_workers) as you suggested and if the number of databases < max_sync_slot_workers, then we can start 1 worker per dbid, else divide the work equally among the max sync-workers possible. And for inactive database cases, we can increase the next_sync_time rather than starting a special worker to handle all the inactive databases. Thoughts? thanks Shveta