Hi all,

First of all, sorry for the delay in posting the updated patch. I was
occupied with some other work and holidays, and it took me some time to
revisit the design carefully. After several iterations and reworking the
logic based on the earlier feedback, I have now arrived at what I believe
is a cleaner and more aligned implementation. I am attaching v4 of the
patch for review.

The intention is to provide a SQL-level equivalent of vacuumdb
--analyze-only --missing-stats-only, while keeping the default ANALYZE
behavior completely unchanged. When this option is specified, ANALYZE will
process only those relations that are missing statistics, and skip
relations that already have complete statistics.

A relation is considered to be missing statistics if at least one
analyzable attribute (as determined by examine_attribute()) does not have a
corresponding row in pg_statistic, or if there is an extended statistics
object in pg_statistic_ext without a matching row in pg_statistic_ext_data.
In other words, the implementation relies on catalog inspection rather than
counters or timestamps, and it reuses existing ANALYZE internals instead of
redefining missing-stats logic independently.

The check is placed inside analyze_rel(), after the standard relation
validation and skip conditions, so that it does not interfere with
privilege checks or special-relation handling. The default ANALYZE path
remains unchanged.

In terms of behavior:

* A brand new empty table is analyzed (since it has no statistics).
* Re-running on an empty table analyzes again, because there are still no
pg_statistic rows.
* A table with data but no statistics is analyzed.
* Re-running after statistics exist causes the table to be skipped.
* If a new column is added and lacks statistics, the table is analyzed
again.
* After statistics are created for that column, subsequent runs skip the
table.
* If statistics are manually deleted or effectively lost (e.g., crash
recovery scenarios affecting stats tracking), the table is analyzed again.

Repeated runs therefore converge toward a no-op once all relations have
complete statistics.

Regression tests are included.

As discussed earlier in the thread, I plan to start a new discussion and
patch series for a separate ANALYZE (MODIFIED_STATS) option that would
reuse autoanalyze-style thresholds. I believe keeping MISSING_STATS_ONLY
and MODIFIED_STATS as separate, clearly defined options makes the semantics
easier to reason about.

I would greatly appreciate further review and feedback on this version.
Thank you all for the detailed guidance and suggestions so far — especially
regarding reuse of examine_attribute() and alignment with vacuumdb
behavior. This process has been very educational for me.

Thanks,
Vasuki M
C-DAC, Chennai
From 978bb7ba59c1a008e21f7ce126e37ed85cc3d9cd Mon Sep 17 00:00:00 2001
From: Vasuki M <[email protected]>
Date: Fri, 13 Feb 2026 12:09:44 +0530
Subject: [PATCH v4] ANALYZE: add MISSING_STATS_ONLY option

Introduce a new ANALYZE option, MISSING_STATS_ONLY, which limits
ANALYZE to relations that are missing statistics.

When specified, ANALYZE will process only those relations that
have at least one analyzable attribute or extended statistics object
without corresponding entries in pg_statistic or pg_statistic_ext_data.

Relations that already have complete statistics are skipped.

This provides SQL-level functionality similar in spirit to
vacuumdb --missing-stats-only, allowing users to reduce unnecessary
work during repeated manual ANALYZE runs across many relations.

The behavior is conservative:
- Tables that have never been analyzed are processed.
- Tables that lost statistics (e.g., due to crash recovery) are processed.
- Tables with newly added columns lacking statistics are processed.
- Tables with complete statistics are skipped.

Default ANALYZE behavior is unchanged.

Regression tests are included.
---
 src/backend/commands/analyze.c                | 115 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |   7 +-
 src/include/commands/vacuum.h                 |   2 +
 .../expected/analyze_missing_stats_only.out   |  54 ++++++++
 src/test/regress/parallel_schedule            |   1 +
 .../sql/analyze_missing_stats_only.sql        |  47 +++++++
 6 files changed, 225 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/analyze_missing_stats_only.out
 create mode 100644 src/test/regress/sql/analyze_missing_stats_only.sql

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index a4834241..124244a7 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -29,6 +29,8 @@
 #include "catalog/index.h"
 #include "catalog/indexing.h"
 #include "catalog/pg_inherits.h"
+#include "catalog/pg_statistic_ext.h"
+#include "catalog/pg_statistic_ext_data.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
@@ -96,6 +98,84 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static bool
+relation_has_missing_column_stats(Relation rel)
+{
+	int attnum;
+
+	for (attnum = 1; attnum <= rel->rd_att->natts; attnum++)
+	{
+		VacAttrStats *stats;
+		HeapTuple statstup;
+
+		/* Is this attribute analyzable at all? */
+		stats = examine_attribute(rel, attnum, NULL);
+		if (stats == NULL)
+			continue;
+
+		/* Does this attribute already have stats? */
+		statstup = SearchSysCache3(STATRELATTINH,
+						ObjectIdGetDatum(RelationGetRelid(rel)),
+						Int16GetDatum(attnum),
+						BoolGetDatum(false));
+
+		if (!HeapTupleIsValid(statstup))
+			return true;  /* missing stats */
+
+		ReleaseSysCache(statstup);
+	}
+
+	return false;
+}
+
+static bool
+relation_has_missing_extended_stats(Relation rel)
+{
+	Relation extrel;
+	SysScanDesc scan;
+	ScanKeyData key;
+	HeapTuple tup;
+
+	extrel = table_open(StatisticExtRelationId, AccessShareLock);
+
+	ScanKeyInit(&key,
+			Anum_pg_statistic_ext_stxrelid,
+			BTEqualStrategyNumber,
+			F_OIDEQ,
+			ObjectIdGetDatum(RelationGetRelid(rel)));
+
+	scan = systable_beginscan(extrel,
+					StatisticExtRelidIndexId,
+					true,
+					NULL,
+					1,
+					&key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_statistic_ext e =
+			(Form_pg_statistic_ext) GETSTRUCT(tup);
+
+		HeapTuple dtup =
+			SearchSysCache2(STATEXTDATASTXOID,
+						ObjectIdGetDatum(e->oid),
+						BoolGetDatum(false));
+
+		if (!HeapTupleIsValid(dtup))
+		{
+			systable_endscan(scan);
+			table_close(extrel, AccessShareLock);
+			return true;
+		}
+
+		ReleaseSysCache(dtup);
+	}
+
+	systable_endscan(scan);
+	table_close(extrel, AccessShareLock);
+
+	return false;
+}
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -234,6 +314,37 @@ analyze_rel(Oid relid, RangeVar *relation,
 		return;
 	}
 
+	/*
+	 * ANALYZE (MISSING_STATS_ONLY):
+	 * Skip relation if any relation already have complete statistics.
+	 */
+	if (params.options & VACOPT_MISSING_STATS_ONLY)
+	{
+		bool missing = false;
+
+		/* Check column stats */
+		if (relation_has_missing_column_stats(onerel))
+			missing = true;
+
+		/* Check extended stats */
+		else if (relation_has_missing_extended_stats(onerel))
+			missing = true;
+
+		if (!missing)
+		{
+			elog(DEBUG1,
+				"ANALYZE (MISSING_STATS_ONLY): skipping relation \"%s\"",
+				RelationGetRelationName(onerel));
+
+			table_close(onerel, ShareUpdateExclusiveLock);
+			return;
+		}
+
+		elog(DEBUG1,
+			"ANALYZE (MISSING_STATS_ONLY): relation eligible \"%s\"",
+			RelationGetRelationName(onerel));
+	}
+
 	/*
 	 * OK, let's do it.  First, initialize progress reporting.
 	 */
@@ -314,6 +425,10 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 
+	elog(DEBUG1, "ANALYZE processing relation \"%s\" (OID %u)",
+			RelationGetRelationName(onerel),
+			RelationGetRelid(onerel));
+
 	verbose = (params.options & VACOPT_VERBOSE) != 0;
 	instrument = (verbose || (AmAutoVacuumWorkerProcess() &&
 							  params.log_analyze_min_duration >= 0));
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 03932f45..20981a9d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -165,6 +165,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	BufferAccessStrategy bstrategy = NULL;
 	bool		verbose = false;
 	bool		skip_locked = false;
+	bool		missing_stats_only = false;
 	bool		analyze = false;
 	bool		freeze = false;
 	bool		full = false;
@@ -229,6 +230,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 
 			ring_size = result;
 		}
+		else if (strcmp(opt->defname, "missing_stats_only") == 0)
+			missing_stats_only = defGetBoolean(opt);
+
 		else if (!vacstmt->is_vacuumcmd)
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -305,6 +309,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		(vacstmt->is_vacuumcmd ? VACOPT_VACUUM : VACOPT_ANALYZE) |
 		(verbose ? VACOPT_VERBOSE : 0) |
 		(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
+		(missing_stats_only ? VACOPT_MISSING_STATS_ONLY :0) |
 		(analyze ? VACOPT_ANALYZE : 0) |
 		(freeze ? VACOPT_FREEZE : 0) |
 		(full ? VACOPT_FULL : 0) |
@@ -315,7 +320,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		(only_database_stats ? VACOPT_ONLY_DATABASE_STATS : 0);
 
 	/* sanity checks on options */
-	Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE));
+	Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE | VACOPT_MISSING_STATS_ONLY));
 	Assert((params.options & VACOPT_VACUUM) ||
 		   !(params.options & (VACOPT_FULL | VACOPT_FREEZE)));
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e885a4b9..63e9d437 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -188,6 +188,8 @@ typedef struct VacAttrStats
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x100	/* don't skip any pages */
 #define VACOPT_SKIP_DATABASE_STATS 0x200	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x400	/* only vac_update_datfrozenxid() */
+#define VACOPT_MISSING_STATS_ONLY	0x800	/* ANALYZE if stats are missing */
+
 
 /*
  * Values used by index_cleanup and truncate params.
diff --git a/src/test/regress/expected/analyze_missing_stats_only.out b/src/test/regress/expected/analyze_missing_stats_only.out
new file mode 100644
index 00000000..9d33cd70
--- /dev/null
+++ b/src/test/regress/expected/analyze_missing_stats_only.out
@@ -0,0 +1,54 @@
+--
+-- ANALYZE (MISSING_STATS_ONLY) regression test
+--
+-- 1. Brand new empty table
+CREATE TABLE ms1 (a int);
+-- No statistics yet → should analyze
+ANALYZE (MISSING_STATS_ONLY);
+-- pg_statistic should still be empty (table has no rows)
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         0
+(1 row)
+
+-- 2. Insert data
+INSERT INTO ms1 SELECT generate_series(1,10);
+-- Stats missing → should analyze
+ANALYZE (MISSING_STATS_ONLY);
+-- Now stats must exist
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         1
+(1 row)
+
+-- 3. Re-run → stats exist → should skip
+ANALYZE (MISSING_STATS_ONLY);
+-- Stats count should remain the same
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         1
+(1 row)
+
+-- 4. Add new column → missing stats for column b
+ALTER TABLE ms1 ADD COLUMN b int;
+-- Should analyze again
+ANALYZE (MISSING_STATS_ONLY);
+-- Now both columns should have stats
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+ stat_rows 
+-----------
+         2
+(1 row)
+
+DROP TABLE ms1;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 549e9b2d..a64be382 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -90,6 +90,7 @@ test: rules psql psql_crosstab psql_pipeline amutils stats_ext collate.linux.utf
 test: select_parallel
 test: write_parallel
 test: vacuum_parallel
+test: analyze_missing_stats_only
 
 # Run this alone, because concurrent DROP TABLE would make non-superuser
 # "ANALYZE;" fail with "relation with OID $n does not exist".
diff --git a/src/test/regress/sql/analyze_missing_stats_only.sql b/src/test/regress/sql/analyze_missing_stats_only.sql
new file mode 100644
index 00000000..0a6ed73c
--- /dev/null
+++ b/src/test/regress/sql/analyze_missing_stats_only.sql
@@ -0,0 +1,47 @@
+--
+-- ANALYZE (MISSING_STATS_ONLY) regression test
+--
+
+-- 1. Brand new empty table
+CREATE TABLE ms1 (a int);
+
+-- No statistics yet → should analyze
+ANALYZE (MISSING_STATS_ONLY);
+
+-- pg_statistic should still be empty (table has no rows)
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+-- 2. Insert data
+INSERT INTO ms1 SELECT generate_series(1,10);
+
+-- Stats missing → should analyze
+ANALYZE (MISSING_STATS_ONLY);
+
+-- Now stats must exist
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+-- 3. Re-run → stats exist → should skip
+ANALYZE (MISSING_STATS_ONLY);
+
+-- Stats count should remain the same
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+-- 4. Add new column → missing stats for column b
+ALTER TABLE ms1 ADD COLUMN b int;
+
+-- Should analyze again
+ANALYZE (MISSING_STATS_ONLY);
+
+-- Now both columns should have stats
+SELECT count(*) AS stat_rows
+FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+DROP TABLE ms1;
+
-- 
2.43.0

Reply via email to