On Fri, Jan 23, 2026 at 9:31 PM Sami Imseih <[email protected]> wrote: > > Thanks for the detailed summary! > > It is important to point out that this feature is trying to do 2 distinct > things in 1 command. run analyze under when either one of these conditions > is true: > > 1/ Table has not been analyzed yet. > 2/ Table has been modified. >
Maybe this is all an aside, but I don't think that was the vision for what the OP was trying to do with his patch, in that sense he was approaching it from a different angle, and I've been reading this thread trying to decide if people are just talking past each other. But after thinking about it some more, I think the above might be the more useful mental model for the discussion. > > Thanks a lot for the detailed feedback — this has been very > > helpful.Answering to all mails in one. > > > > A few clarifications on intent and scope, and how this relates to the > > points raised: > > > > Autovacuum overlap > > I agree there is some conceptual overlap with autovacuum’s analyze decision > > logic. > > The intent here is not to replace or duplicate autovacuum heuristics, but > > to reduce > > Yes, I agree with this. > > > I agree that n_mod_since_analyze == 0 is a very simple condition > > and not “smart” in the general sense. That is intentional for now. > > This option is not trying to answer when statistics should be refreshed > > optimally, > > but only to skip relations that are known to be unchanged since the last > > analyze. > > If even a single tuple is modified, SMART ANALYZE will still re-run, > > preserving > > conservative behavior. > > Yes, this is my concern. Why would I want to analyze if 1 row or a negligible > amount of rows are modified? I understand that this feature is trying to > keep the decision making very simple, but I think it's too simple to actually > be helpful in addressing the wasted effort of an ANALYZE command. > > > Tables never analyzed > > As Christoph and Ilia pointed out earlier, skipping tables that were never > > analyzed would be incorrect. > > The current logic explicitly avoids that by requiring last_analyze or > > last_autoanalyze to be present > > before skipping. Tables without prior statistics are always analyzed. > > I agree with this, but I think it's more than just tables that have > not been analyzed. > What if a new column is added after the last (auto)analyze. Would we not want > to > trigger an analyze in that case? > Well, I don't know that we are "triggering" anything, but this is definitely a case where we have "missing stats". > > Relation to vacuumdb --missing-stats-only > > I agree this is related but slightly different in intent. > > --missing-stats-only > > answers “does this table have any statistics at all?”, while SMART ANALYZE > > answers “has this table changed since the last statistics collection?”. > > Both seem > > useful, but they target different use cases. I see SMART ANALYZE primarily > > as a performance optimization for repeated manual ANALYZE runs on > > mostly-static schemas. > > SMART ANALYZE is trying to answer 2 questions "which table does not > have any statistics at all" > and "has this table changed since the last statistics collection?”, right? > > So, maybe they need to be 2 separate options. > > > Although as sami said this SMART is not smart enough as it should be , > > I will change name accordingly in the further patches > > Yup, I am not too fond of SMART in the name. Also, then name itself > is vague. SKIP_LOCKED and BUFFER_USAGE_LIMIT on the other > hand tell you exactly what they[re used for. > So, tossing out a new proposal here, which is to offer ANALYZE with 2 new options... MISSING_STATS and MODIFIED_STATS. When MISSING_STATS is passed, we attempt to analyze only tables that have missing stats, essentially implementing a version of --missing-stats-only but for the ANALYZE command. In successive runs, this should reduce towards a no-op, although we need to decide what to do about system tables, which, iirc --missing-stats-only always assumes to be true, but this version probably doesn't want to assume that. When MODIFIED_STATS is passed, we would instead only analyze tables where some threshold of rows has been modified. I feel like the most obvious choice for this calculation would be based on a formula like "analyze threshold = analyze base threshold + analyze scale factor * number of tuples". Astute observers will note that this is the same threshold used by autoanalyze, which means if you had the same defaults you are just doing the work manually that autoanalyze would eventually get around to doing (which seems potentially useful on its own). But also if these were based on gucs, the OP could modify those gucs to achieve their desired behavior, ie. set analyze_base_threshold=1; set analyze_scale_factor=0; analyze (modified_stats); // this should analyze anything with 1 modified row Granted, I don't like that it is both more wordy than the original idea, and that we would need to add new gucs, but this would be pretty flexible. Robert Treat https://xzilla.net
