Hi all,

I am back with a new patch.

As a follow-up to the previous discussion, I’m posting a revised v3 patch
that introduces an opt-in ANALYZE (MISSING_STATS) option.

The goal is to reduce unnecessary work when running manual ANALYZE over
many relations, while keeping the default behavior unchanged.

Overview:
ANALYZE (MISSING_STATS) analyzes only relations that currently have no
statistics entries in pg_statistic. Relations that already have statistics
are skipped.

This is conceptually similar to vacuumdb --missing-stats-only, but exposed
at the SQL ANALYZE level for interactive and scripted use.

The decision is intentionally table-level and conservative:

- If a relation has at least one pg_statistic entry, it is considered to
  “have stats” and may be skipped.
- If no pg_statistic rows exist (new table, stats removed, crash reset),
  the relation is analyzed.

No thresholds or modification counters are used in this option.

Behavior summary for ANALYZE(MISSING_STATS);


New empty table------------------------------->analyzed
Re-run on empty table------------------------> analyzed
Table with data but no stats-----------------> analyzed
Re-run after stats exist-----------------------> skipped
Add new column after ANALYZE----------> analyzed
Re-run after column stats exist-------------> skipped
Statistics manually deleted (pg_statistic) -> analyzed
Statistics lost after crash recovery-----------> analyzed
Regular ANALYZE -------------------------------> unchanged behavior

This ensures that ANALYZE (MISSING_STATS) converges toward a no-op on
subsequent runs, while still recovering from missing or invalid statistics.

Scope and limitations:
- Applies only to regular relations.
- Uses pg_statistic directly (not pg_stats or pg_stat views).
- Does not consider modification thresholds or autovacuum heuristics.
- Partitioned tables, inheritance, and extended statistics are not handled
  yet and can be considered separately.

I would appreciate feedback on:

- Whether this behavior and naming align with expectations.
- Any edge cases I may have missed.
- Whether this is a reasonable first step before considering more advanced
  options (e.g., modified-stats thresholds).

While testing i have noted this :
analyze_test=# ALTER TABLE ms1 ADD COLUMN b int;
ALTER TABLE
Time: 44.665 ms
analyze_test=# ANALYZE (MISSING_STATS);
DEBUG:  ANALYZE processing relation "ms1" (OID 32791)
analyze_test=# SELECT attname
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_attribute a
  ON a.attrelid = c.oid AND a.attnum = s.staattnum
WHERE c.relname = 'ms1'
ORDER BY attname;
 attname
---------
 a
 b
(2 rows)

Time: 1.390 ms
analyze_test=# ANALYZE (MISSING_STATS);
DEBUG:  ANALYZE (MISSING_STATS): skipping relation "ms1" (OID 32791)
analyze_test=# SELECT
  a.attname,
  s.stanullfrac,
  s.stadistinct
FROM pg_statistic s
JOIN pg_class c ON c.oid = s.starelid
JOIN pg_attribute a
  ON a.attrelid = c.oid AND a.attnum = s.staattnum
WHERE c.relname = 'ms1';
 attname | stanullfrac | stadistinct
---------+-------------+-------------
 a       |           0 |          -1
 b       |           1 |           0
(2 rows)

Time: 0.733 ms

Note:
[1] The empty table[if a table has 0 rows] --> no pg_statistic rows at all
[2] If a table has >= 1 row then postgres creates pg_statistic rows for
every user column,even if the
      - Column is entirely null
      - The column was added later
      - The column has never had a non-null values

Thanks for your time and review, I will post the next patch for
modified_stats shortly.



*Vasuki MC-DAC,Chennai*
From 1ad50bbefbab0fa1e75aa696187066e1b11896f4 Mon Sep 17 00:00:00 2001
From: Vasuki M <[email protected]>
Date: Thu, 29 Jan 2026 17:25:26 +0530
Subject: [PATCH] ANALYZE: add MISSING_STATS option to skip relations with
 existing statistics

Introduce an opt-in ANALYZE (MISSING_STATS) mode that analyzes only
relations which currently have no entries in pg_statistic.

This allows repeated manual ANALYZE runs to avoid unnecessary work on
relations whose statistics already exist, while still analyzing:
- newly created tables,
- tables that have gained data but lack statistics,
- relations whose statistics were lost or reset.

The default ANALYZE behavior is unchanged.

Regression tests are included.
---
 src/backend/commands/analyze.c                | 48 +++++++++++++++++
 src/backend/commands/vacuum.c                 |  6 ++-
 src/include/commands/vacuum.h                 |  1 +
 .../expected/analyze_missing_stats.out        | 54 +++++++++++++++++++
 src/test/regress/parallel_schedule            |  1 +
 .../regress/sql/analyze_missing_stats.sql     | 52 ++++++++++++++++++
 6 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/analyze_missing_stats.out
 create mode 100644 src/test/regress/sql/analyze_missing_stats.sql

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index a4834241..9d846cf2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -46,6 +46,7 @@
 #include "storage/procarray.h"
 #include "utils/attoptcache.h"
 #include "utils/datum.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -140,6 +141,53 @@ analyze_rel(Oid relid, RangeVar *relation,
 	onerel = vacuum_open_relation(relid, relation, params.options & ~(VACOPT_VACUUM),
 								  params.log_analyze_min_duration >= 0,
 								  ShareUpdateExclusiveLock);
+	/*
+	 * ANALYZE (MISSING_STATS):
+	 * Skip relations that already have statistics in pg_statistic.
+	 * This is table-level, not column-level.
+	 */
+	if ((params.options & VACOPT_MISSING_STATS) &&
+		onerel->rd_rel->relkind == RELKIND_RELATION)
+	{
+		Relation        pgstatistic;
+		SysScanDesc    scan;
+		ScanKeyData    key;
+		HeapTuple      tup;
+		bool            has_stats = false;
+
+		pgstatistic = table_open(StatisticRelationId, AccessShareLock);
+
+		ScanKeyInit(&key,
+				Anum_pg_statistic_starelid,
+				BTEqualStrategyNumber,
+				F_OIDEQ,
+				ObjectIdGetDatum(RelationGetRelid(onerel)));
+
+		scan = systable_beginscan(pgstatistic,
+						InvalidOid,
+						false,
+						NULL,
+						1,
+						&key);
+
+		tup = systable_getnext(scan);
+		if (HeapTupleIsValid(tup))
+			has_stats = true;
+
+		systable_endscan(scan);
+		table_close(pgstatistic, AccessShareLock);
+
+		if (has_stats)
+		{
+			elog(DEBUG1,
+				"ANALYZE (MISSING_STATS): skipping relation \"%s\"",
+				RelationGetRelationName(onerel));
+
+			table_close(onerel, ShareUpdateExclusiveLock);
+			return;
+		}
+	}
+
 
 	/* leave if relation could not be opened or locked */
 	if (!onerel)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 03932f45..3cf3f210 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -165,6 +165,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	BufferAccessStrategy bstrategy = NULL;
 	bool		verbose = false;
 	bool		skip_locked = false;
+        bool            missing_stats = false;
 	bool		analyze = false;
 	bool		freeze = false;
 	bool		full = false;
@@ -229,6 +230,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 
 			ring_size = result;
 		}
+		else if (strcmp(opt->defname, "missing_stats") == 0)
+			missing_stats = defGetBoolean(opt);
 		else if (!vacstmt->is_vacuumcmd)
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -305,6 +308,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		(vacstmt->is_vacuumcmd ? VACOPT_VACUUM : VACOPT_ANALYZE) |
 		(verbose ? VACOPT_VERBOSE : 0) |
 		(skip_locked ? VACOPT_SKIP_LOCKED : 0) |
+		(missing_stats ? VACOPT_MISSING_STATS :0) |
 		(analyze ? VACOPT_ANALYZE : 0) |
 		(freeze ? VACOPT_FREEZE : 0) |
 		(full ? VACOPT_FULL : 0) |
@@ -315,7 +319,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		(only_database_stats ? VACOPT_ONLY_DATABASE_STATS : 0);
 
 	/* sanity checks on options */
-	Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE));
+	Assert(params.options & (VACOPT_VACUUM | VACOPT_ANALYZE | VACOPT_MISSING_STATS));
 	Assert((params.options & VACOPT_VACUUM) ||
 		   !(params.options & (VACOPT_FULL | VACOPT_FREEZE)));
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e885a4b9..55c87958 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -188,6 +188,7 @@ typedef struct VacAttrStats
 #define VACOPT_DISABLE_PAGE_SKIPPING 0x100	/* don't skip any pages */
 #define VACOPT_SKIP_DATABASE_STATS 0x200	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x400	/* only vac_update_datfrozenxid() */
+#define VACOPT_MISSING_STATS       0x08		/* ANALYZE if stats are missing */
 
 /*
  * Values used by index_cleanup and truncate params.
diff --git a/src/test/regress/expected/analyze_missing_stats.out b/src/test/regress/expected/analyze_missing_stats.out
new file mode 100644
index 00000000..b39a82ca
--- /dev/null
+++ b/src/test/regress/expected/analyze_missing_stats.out
@@ -0,0 +1,54 @@
+--
+-- ANALYZE (MISSING_STATS) regression test
+--
+-- 1. Brand new empty table
+CREATE TABLE ms1 (a int);
+-- No statistics yet → should analyze
+ANALYZE (MISSING_STATS);
+-- pg_statistic should still be empty (table has no rows)
+SELECT EXISTS (
+  SELECT 1
+  FROM pg_statistic
+  WHERE starelid = 'ms1'::regclass
+) AS has_stats;
+ has_stats 
+-----------
+ f
+(1 row)
+
+-- 2. Re-run → still no stats → should analyze again
+ANALYZE (MISSING_STATS);
+-- 3. Insert data
+INSERT INTO ms1 SELECT generate_series(1,10);
+-- Stats still missing → should analyze
+ANALYZE (MISSING_STATS);
+-- Now stats must exist
+SELECT EXISTS (
+  SELECT 1
+  FROM pg_statistic
+  WHERE starelid = 'ms1'::regclass
+) AS has_stats;
+ has_stats 
+-----------
+ t
+(1 row)
+
+-- 4. Re-run → stats exist → should skip
+ANALYZE (MISSING_STATS);
+-- 5. Simulate stats loss (crash/reset)
+DELETE FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+-- Stats missing again → should analyze
+ANALYZE (MISSING_STATS);
+-- Stats restored
+SELECT EXISTS (
+  SELECT 1
+  FROM pg_statistic
+  WHERE starelid = 'ms1'::regclass
+) AS has_stats;
+ has_stats 
+-----------
+ t
+(1 row)
+
+DROP TABLE ms1;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 021d57f6..001def94 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -90,6 +90,7 @@ test: rules psql psql_crosstab psql_pipeline amutils stats_ext collate.linux.utf
 test: select_parallel
 test: write_parallel
 test: vacuum_parallel
+test: analyze_missing_stats
 
 # Run this alone, because concurrent DROP TABLE would make non-superuser
 # "ANALYZE;" fail with "relation with OID $n does not exist".
diff --git a/src/test/regress/sql/analyze_missing_stats.sql b/src/test/regress/sql/analyze_missing_stats.sql
new file mode 100644
index 00000000..9457ff04
--- /dev/null
+++ b/src/test/regress/sql/analyze_missing_stats.sql
@@ -0,0 +1,52 @@
+--
+-- ANALYZE (MISSING_STATS) regression test
+--
+
+-- 1. Brand new empty table
+CREATE TABLE ms1 (a int);
+
+-- No statistics yet → should analyze
+ANALYZE (MISSING_STATS);
+
+-- pg_statistic should still be empty (table has no rows)
+SELECT EXISTS (
+  SELECT 1
+  FROM pg_statistic
+  WHERE starelid = 'ms1'::regclass
+) AS has_stats;
+
+-- 2. Re-run → still no stats → should analyze again
+ANALYZE (MISSING_STATS);
+
+-- 3. Insert data
+INSERT INTO ms1 SELECT generate_series(1,10);
+
+-- Stats still missing → should analyze
+ANALYZE (MISSING_STATS);
+
+-- Now stats must exist
+SELECT EXISTS (
+  SELECT 1
+  FROM pg_statistic
+  WHERE starelid = 'ms1'::regclass
+) AS has_stats;
+
+-- 4. Re-run → stats exist → should skip
+ANALYZE (MISSING_STATS);
+
+-- 5. Simulate stats loss (crash/reset)
+DELETE FROM pg_statistic
+WHERE starelid = 'ms1'::regclass;
+
+-- Stats missing again → should analyze
+ANALYZE (MISSING_STATS);
+
+-- Stats restored
+SELECT EXISTS (
+  SELECT 1
+  FROM pg_statistic
+  WHERE starelid = 'ms1'::regclass
+) AS has_stats;
+
+DROP TABLE ms1;
+
-- 
2.43.0

Reply via email to