On Fri, Mar 24, 2023 at 1:21 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Fri, Mar 24, 2023 at 9:27 AM Melanie Plageman > <melanieplage...@gmail.com> wrote: > > > > On Thu, Mar 23, 2023 at 2:09 AM Masahiko Sawada <sawada.m...@gmail.com> > > wrote: > > > On Sun, Mar 19, 2023 at 7:47 AM Melanie Plageman > > > <melanieplage...@gmail.com> wrote: > > > Do we need to calculate the number of workers running with > > > nworkers_for_balance by iterating over the running worker list? I > > > guess autovacuum workers can increment/decrement it at the beginning > > > and end of vacuum. > > > > I don't think we can do that because if a worker crashes, we have no way > > of knowing if it had incremented or decremented the number, so we can't > > adjust for it. > > What kind of crash are you concerned about? If a worker raises an > ERROR, we can catch it in PG_CATCH() block. If it's a FATAL, we can do > that in FreeWorkerInfo(). A PANIC error ends up crashing the entire > server.
Yes, but what about a worker that segfaults? Since table AMs can define relation_vacuum(), this seems like a real possibility. I'll address your other code feedback in the next version. I realized nworkers_for_balance should be initialized to 0 and not 1 -- 1 is misleading since there are often 0 autovac workers. We just never want to use nworkers_for_balance when it is 0. But, workers put a floor of 1 on the number when they divide limit/nworkers_for_balance (since they know there must be at least one worker right now since they are a worker). I thought about whether or not they should call autovac_balance_cost() if they find that nworkers_for_balance is 0 when updating their own limit, but I'm not sure. > > > I need another few readthroughs to figure out of VacuumFailsafeActive > > > does what > > > I think it does, and should be doing, but in general I think this is a > > > good > > > idea and a patch in good condition close to being committable. > > Another approach would be to make VacuumCostActive a ternary value: > on, off, and never. When we trigger the failsafe mode we switch it to > never, meaning that it never becomes active even after reloading the > config file. A good point is that we don't need to add a new global > variable, but I'm not sure it's better than the current approach. Hmm, this is interesting. I don't love the word "never" since it kind of implies a duration longer than the current table being vacuumed. But we could find a different word or just document it well. For clarity, we might want to call it failsafe_mode or something. I wonder if the primary drawback to converting LVRelState->failsafe_active to a global VacuumFailsafeActive is just the general rule of limiting scope to the minimum needed. - Melanie