Performance optimization for ANALYZE with extended statistics (dependencies)

Ilia Evdokimov Wed, 11 Jun 2025 12:27:52 -0700

Hi hackers,

When calculating functional dependencies for extended statistics(dependencies), the current implementation always sorts rows in sampleto measure the degree of dependency between columns.

However, if the column a8 in the dependency list has ndistinct = 1 (allvalues are the same) or ndistinct = -1 (all values are unique), then thefunctional dependency {a1, [... a7]} => a8 is always 1.0.

This patch adds a fast path to dependency_degree() to return 1.0immediately in such cases, skipping the expensive sorting step, whichmay involve up to many rows from sample.

There is a benchmark with the most extreme case with'default_statistics_target = 10000' on a table with 3 million rows and 8integer columns with unique values:


CREATE TABLE my_test_table_all_unique (
    id SERIAL PRIMARY KEY,
    col1 INTEGER,
    col2 INTEGER,
    col3 INTEGER,
    col4 INTEGER,
    col5 INTEGER,
    col6 INTEGER,
    col7 INTEGER,
    col8 INTEGER
);

INSERT INTO my_test_table_all_unique (col1, col2, col3, col4, col5,col6, col7, col8)

SELECT
    n AS col1,
    n + 10000000 AS col2,
    n + 20000000 AS col3,
    n + 30000000 AS col4,
    n + 40000000 AS col5,
    n + 50000000 AS col6,
    n + 60000000 AS col7,
    n + 70000000 AS col8
FROM generate_series(1, 3000000) as n;

CREATE STATISTICS (dependencies) ON col1, col2, col3, col4, col5, col6,col7, col8 FROM my_test_table_all_unique;


Before patch:
ANALYZE my_test_table_all_unique;
Time: 142114,970 ms (02:22,115)

After patch:
ANALYZE my_test_table_all_unique;
Time: 1273,664 ms (00:01,274)

I am happy to hear any feedback or suggestions for improvement.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

From 5d8b7293c3c72bf33bd533d1e3ff03c65a0b2fdc Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <[email protected]>
Date: Wed, 11 Jun 2025 19:31:05 +0300
Subject: [PATCH v1] Optimize functional dependency calculation for unique and
 same values

When building extended statistics with dependencies, the calculation
always sorts the sample data to compute the degree of functional
dependency between columns. However, if the last column in the
dependency list contains either all-equal or all-unique values,
the dependency is trivially 1.0.

This patch adds a fast path to skip the expensive sorting step
in such cases, significantly reducing ANALYZE time for large tables
with high statistics targets.
---
 src/backend/statistics/dependencies.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index eb2fc4366b4..3f896d7bc99 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -225,6 +225,7 @@ dependency_degree(StatsBuildData *data, int k, AttrNumber *dependency)
 	MultiSortSupport mss;
 	SortItem   *items;
 	AttrNumber *attnums_dep;
+	VacAttrStats *last_col_stats;
 
 	/* counters valid within a group */
 	int			group_size = 0;
@@ -236,6 +237,17 @@ dependency_degree(StatsBuildData *data, int k, AttrNumber *dependency)
 	/* Make sure we have at least two input attributes. */
 	Assert(k >= 2);
 
+	/*
+	 * If the last column has only one distinct value,
+	 * or if it has as many distinct values as rows (implying a unique key),
+	 * the dependency degree is 1.0.
+	 */
+	last_col_stats = data->stats[dependency[k - 1]];
+	if (last_col_stats->stadistinct == 1.0 || last_col_stats->stadistinct == -1.0)
+	{
+		return 1.0;
+	}
+
 	/* sort info for all attributes columns */
 	mss = multi_sort_init(k);
 
-- 
2.34.1

Performance optimization for ANALYZE with extended statistics (dependencies)

Reply via email to