On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignes...@gmail.com> wrote: > ... > a) Introduce a new latch to handle worker attach and exit.
IIUC there is normally only the “shared” latch or a “process local” latch. i.e. AFAICT is is quite uncommon to invent a new latches like option a is proposing. I did not see any examples of making new latches (e.g. MyLatchA, MyLatchB, MyLatchC) in the PostgreSQL source other than that ‘recoveryWakeupLatch’ mentioned above by Hou-san. So this option could be OK, but OTOH since there is hardly any precedent maybe that should be taken as an indication to avoid doing it this way (???) > b) Add a new GUC launcher_retry_time which gives more flexibility to > users as suggested by Amit at [1]. I'm not sure that introducing a new GUC is a good option because this seems a rare problem to have -- so it will be hard to tune since it will be difficult to know you even have this problem and then difficult to know that it is fixed. Anyway. this made me consider more what the WaitLatch timeout value should be. Here are some thoughts: Idea 1) I was wondering where did that DEFAULT_NAPTIME_PER_CYCLE value of 180 seconds come from or was that just a made-up number? AFAICT it just came into existence in the first pub/sub commit [1] but there is no explanation for why 180s was chosen. Anyway, I assume a low value (5s?) would be bad because it incurs unacceptable CPU usage, right? But if 180s is too long and 5s is too short then what does a “good” number even look like? E.g.,. if 60s is deemed OK, then is there any harm in just defining DEFAULT_NAPTIME_PER_CYCLE to be 60s and leaving it at that? Idea 2) Another idea could be to use a “dynamic timeout” in the WaitLatch of ApplyLauncherMain. Let me try to explain my thought bubble: - If the preceding foreach(lc, sublist) loop launched any workers then the WaitLatch timeout can be a much shorter 10s - If the preceding foreach(lc, sublist) loop did NOT launch any workers (this would be the most common case) then the WaitLatch timeout can be the usual 180s. IIUC this strategy will still give any recently launched workers enough time to attach shmem but it also means that any concurrent CREATE SUBSCRIPTION will be addressed within 10s instead of 180s. Maybe this is sufficient to make an already rare problem become insignificant. > c) Don't reset the latch at worker attach and allow launcher main to > identify and handle it. For this there is a patch v6-0002 available at > [2]. This option c seems the easiest. Can you explain what are the drawbacks of using this approach? ====== [1] github - https://github.com/postgres/postgres/commit/665d1fad99e7b11678b0d5fa24d2898424243cd6#diff-127f8eb009151ec548d14c877a57a89d67da59e35ea09189411aed529c6341bf Kind Regards, Peter Smith. Fujitsu Australia