Hi all, First of all, sorry for the delay in posting the updated patch. I was occupied with some other work and holidays, and it took me some time to revisit the design carefully. After several iterations and reworking the logic based on the earlier feedback, I have now arrived at what I believe is a cleaner and more aligned implementation. I am attaching v4 of the patch for review.
The intention is to provide a SQL-level equivalent of vacuumdb --analyze-only --missing-stats-only, while keeping the default ANALYZE behavior completely unchanged. When this option is specified, ANALYZE will process only those relations that are missing statistics, and skip relations that already have complete statistics. A relation is considered to be missing statistics if at least one analyzable attribute (as determined by examine_attribute()) does not have a corresponding row in pg_statistic, or if there is an extended statistics object in pg_statistic_ext without a matching row in pg_statistic_ext_data. In other words, the implementation relies on catalog inspection rather than counters or timestamps, and it reuses existing ANALYZE internals instead of redefining missing-stats logic independently. The check is placed inside analyze_rel(), after the standard relation validation and skip conditions, so that it does not interfere with privilege checks or special-relation handling. The default ANALYZE path remains unchanged. In terms of behavior: * A brand new empty table is analyzed (since it has no statistics). * Re-running on an empty table analyzes again, because there are still no pg_statistic rows. * A table with data but no statistics is analyzed. * Re-running after statistics exist causes the table to be skipped. * If a new column is added and lacks statistics, the table is analyzed again. * After statistics are created for that column, subsequent runs skip the table. * If statistics are manually deleted or effectively lost (e.g., crash recovery scenarios affecting stats tracking), the table is analyzed again. Repeated runs therefore converge toward a no-op once all relations have complete statistics. Regression tests are included. As discussed earlier in the thread, I plan to start a new discussion and patch series for a separate ANALYZE (MODIFIED_STATS) option that would reuse autoanalyze-style thresholds. I believe keeping MISSING_STATS_ONLY and MODIFIED_STATS as separate, clearly defined options makes the semantics easier to reason about. I would greatly appreciate further review and feedback on this version. Thank you all for the detailed guidance and suggestions so far — especially regarding reuse of examine_attribute() and alignment with vacuumdb behavior. This process has been very educational for me. Thanks, Vasuki M C-DAC, Chennai
From 978bb7ba59c1a008e21f7ce126e37ed85cc3d9cd Mon Sep 17 00:00:00 2001 From: Vasuki M <[email protected]> Date: Fri, 13 Feb 2026 12:09:44 +0530 Subject: [PATCH v4] ANALYZE: add MISSING_STATS_ONLY option Introduce a new ANALYZE option, MISSING_STATS_ONLY, which limits ANALYZE to relations that are missing statistics. When specified, ANALYZE will process only those relations that have at least one analyzable attribute or extended statistics object without corresponding entries in pg_statistic or pg_statistic_ext_data. Relations that already have complete statistics are skipped. This provides SQL-level functionality similar in spirit to vacuumdb --missing-stats-only, allowing users to reduce unnecessary work during repeated manual ANALYZE runs across many relations. The behavior is conservative: - Tables that have never been analyzed are processed. - Tables that lost statistics (e.g., due to crash recovery) are processed. - Tables with newly added columns lacking statistics are processed. - Tables with complete statistics are skipped. Default ANALYZE behavior is unchanged. Regression tests are included. --- src/backend/commands/analyze.c | 115 ++++++++++++++++++ src/backend/commands/vacuum.c | 7 +- src/include/commands/vacuum.h | 2 + .../expected/analyze_missing_stats_only.out | 54 ++++++++ src/test/regress/parallel_schedule | 1 + .../sql/analyze_missing_stats_only.sql | 47 +++++++ 6 files changed, 225 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/analyze_missing_stats_only.out create mode 100644 src/test/regress/sql/analyze_missing_stats_only.sql diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index a4834241..124244a7 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -29,6 +29,8 @@ #include "catalog/index.h" #include "catalog/indexing.h" #include "catalog/pg_inherits.h" +#include "catalog/pg_statistic_ext.h" +#include "catalog/pg_statistic_ext_data.h" #include "commands/progress.h" #include "commands/tablecmds.h" #include "commands/vacuum.h" @@ -96,6 +98,84 @@ static void update_attstats(Oid relid, bool inh, static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); +static bool +relation_has_missing_column_stats(Relation rel) +{ + int attnum; + + for (attnum = 1; attnum <= rel->rd_att->natts; attnum++) + { + VacAttrStats *stats; + HeapTuple statstup; + + /* Is this attribute analyzable at all? */ + stats = examine_attribute(rel, attnum, NULL); + if (stats == NULL) + continue; + + /* Does this attribute already have stats? */ + statstup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(RelationGetRelid(rel)), + Int16GetDatum(attnum), + BoolGetDatum(false)); + + if (!HeapTupleIsValid(statstup)) + return true; /* missing stats */ + + ReleaseSysCache(statstup); + } + + return false; +} + +static bool +relation_has_missing_extended_stats(Relation rel) +{ + Relation extrel; + SysScanDesc scan; + ScanKeyData key; + HeapTuple tup; + + extrel = table_open(StatisticExtRelationId, AccessShareLock); + + ScanKeyInit(&key, + Anum_pg_statistic_ext_stxrelid, + BTEqualStrategyNumber, + F_OIDEQ, + ObjectIdGetDatum(RelationGetRelid(rel))); + + scan = systable_beginscan(extrel, + StatisticExtRelidIndexId, + true, + NULL, + 1, + &key); + + while (HeapTupleIsValid(tup = systable_getnext(scan))) + { + Form_pg_statistic_ext e = + (Form_pg_statistic_ext) GETSTRUCT(tup); + + HeapTuple dtup = + SearchSysCache2(STATEXTDATASTXOID, + ObjectIdGetDatum(e->oid), + BoolGetDatum(false)); + + if (!HeapTupleIsValid(dtup)) + { + systable_endscan(scan); + table_close(extrel, AccessShareLock); + return true; + } + + ReleaseSysCache(dtup); + } + + systable_endscan(scan); + table_close(extrel, AccessShareLock); + + return false; +} /* * analyze_rel() -- analyze one relation @@ -234,6 +314,37 @@ analyze_rel(Oid relid, RangeVar *relation, return; } + /* + * ANALYZE (MISSING_STATS_ONLY): + * Skip relation if any relation already have complete statistics. + */ + if (params.options & VACOPT_MISSING_STATS_ONLY) + { + bool missing = false; + + /* Check column stats */ + if (relation_has_missing_column_stats(onerel)) + missing = true; + + /* Check extended stats */ + else if (relation_has_missing_extended_stats(onerel)) + missing = true; + + if (!missing) + { + elog(DEBUG1, + "ANALYZE (MISSING_STATS_ONLY): skipping relation \"%s\"", + RelationGetRelationName(onerel)); + + table_close(onerel, ShareUpdateExclusiveLock); + return; + } + + elog(DEBUG1, + "ANALYZE (MISSING_STATS_ONLY): relation eligible \"%s\"", + RelationGetRelationName(onerel)); + } + /* * OK, let's do it. First, initialize progress reporting. */ @@ -314,6 +425,10 @@ do_analyze_rel(Relation onerel, const VacuumParams params, PgStat_Counter startreadtime = 0; PgStat_Counter startwritetime = 0; + elog(DEBUG1, "ANALYZE processing relation \"%s\" (OID %u)", + RelationGetRelationName(onerel), + RelationGetRelid(onerel)); + verbose = (params.options & VACOPT_VERBOSE) != 0; instrument = (verbose || (AmAutoVacuumWorkerProcess() && params.log_analyze_min_duration >= 0)); diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 03932f45..20981a9d 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -165,6 +165,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) BufferAccessStrategy bstrategy = NULL; bool verbose = false; bool skip_locked = false; + bool missing_stats_only = false; bool analyze = false; bool freeze = false; bool full = false; @@ -229,6 +230,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) ring_size = result; } + else if (strcmp(opt->defname, "missing_stats_only") == 0) + missing_stats_only = defGetBoolean(opt); + else if (!vacstmt->is_vacuumcmd) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), @@ -305,6 +309,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) (vacstmt->is_vacuumcmd ? VACOPT_VACUUM : VACOPT_ANALYZE) | (verbose ? VACOPT_VERBOSE : 0) | (skip_locked ? VACOPT_SKIP_LOCKED : 0) | + (missing_stats_only ? VACOPT_MISSING_STATS_ONLY :0) | (analyze ? VACOPT_ANALYZE : 0) | (freeze ? VACOPT_FREEZE : 0) | (full ? VACOPT_FULL : 0) | @@ -315,7 +320,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel) (only_database_stats ? VACOPT_ONLY_DATABASE_STATS : 0); /* sanity checks on options */ - Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE)); + Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE | VACOPT_MISSING_STATS_ONLY)); Assert((params.options & VACOPT_VACUUM) || !(params.options & (VACOPT_FULL | VACOPT_FREEZE))); diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index e885a4b9..63e9d437 100644 --- a/src/include/commands/vacuum.h +++ b/src/include/commands/vacuum.h @@ -188,6 +188,8 @@ typedef struct VacAttrStats #define VACOPT_DISABLE_PAGE_SKIPPING 0x100 /* don't skip any pages */ #define VACOPT_SKIP_DATABASE_STATS 0x200 /* skip vac_update_datfrozenxid() */ #define VACOPT_ONLY_DATABASE_STATS 0x400 /* only vac_update_datfrozenxid() */ +#define VACOPT_MISSING_STATS_ONLY 0x800 /* ANALYZE if stats are missing */ + /* * Values used by index_cleanup and truncate params. diff --git a/src/test/regress/expected/analyze_missing_stats_only.out b/src/test/regress/expected/analyze_missing_stats_only.out new file mode 100644 index 00000000..9d33cd70 --- /dev/null +++ b/src/test/regress/expected/analyze_missing_stats_only.out @@ -0,0 +1,54 @@ +-- +-- ANALYZE (MISSING_STATS_ONLY) regression test +-- +-- 1. Brand new empty table +CREATE TABLE ms1 (a int); +-- No statistics yet → should analyze +ANALYZE (MISSING_STATS_ONLY); +-- pg_statistic should still be empty (table has no rows) +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + stat_rows +----------- + 0 +(1 row) + +-- 2. Insert data +INSERT INTO ms1 SELECT generate_series(1,10); +-- Stats missing → should analyze +ANALYZE (MISSING_STATS_ONLY); +-- Now stats must exist +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + stat_rows +----------- + 1 +(1 row) + +-- 3. Re-run → stats exist → should skip +ANALYZE (MISSING_STATS_ONLY); +-- Stats count should remain the same +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + stat_rows +----------- + 1 +(1 row) + +-- 4. Add new column → missing stats for column b +ALTER TABLE ms1 ADD COLUMN b int; +-- Should analyze again +ANALYZE (MISSING_STATS_ONLY); +-- Now both columns should have stats +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + stat_rows +----------- + 2 +(1 row) + +DROP TABLE ms1; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 549e9b2d..a64be382 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -90,6 +90,7 @@ test: rules psql psql_crosstab psql_pipeline amutils stats_ext collate.linux.utf test: select_parallel test: write_parallel test: vacuum_parallel +test: analyze_missing_stats_only # Run this alone, because concurrent DROP TABLE would make non-superuser # "ANALYZE;" fail with "relation with OID $n does not exist". diff --git a/src/test/regress/sql/analyze_missing_stats_only.sql b/src/test/regress/sql/analyze_missing_stats_only.sql new file mode 100644 index 00000000..0a6ed73c --- /dev/null +++ b/src/test/regress/sql/analyze_missing_stats_only.sql @@ -0,0 +1,47 @@ +-- +-- ANALYZE (MISSING_STATS_ONLY) regression test +-- + +-- 1. Brand new empty table +CREATE TABLE ms1 (a int); + +-- No statistics yet → should analyze +ANALYZE (MISSING_STATS_ONLY); + +-- pg_statistic should still be empty (table has no rows) +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + +-- 2. Insert data +INSERT INTO ms1 SELECT generate_series(1,10); + +-- Stats missing → should analyze +ANALYZE (MISSING_STATS_ONLY); + +-- Now stats must exist +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + +-- 3. Re-run → stats exist → should skip +ANALYZE (MISSING_STATS_ONLY); + +-- Stats count should remain the same +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + +-- 4. Add new column → missing stats for column b +ALTER TABLE ms1 ADD COLUMN b int; + +-- Should analyze again +ANALYZE (MISSING_STATS_ONLY); + +-- Now both columns should have stats +SELECT count(*) AS stat_rows +FROM pg_statistic +WHERE starelid = 'ms1'::regclass; + +DROP TABLE ms1; + -- 2.43.0
