Hi hackers,
When calculating functional dependencies for extended statistics
(dependencies), the current implementation always sorts rows in sample
to measure the degree of dependency between columns.
However, if the column a8 in the dependency list has ndistinct = 1 (all
values are the same) or ndistinct = -1 (all values are unique), then the
functional dependency {a1, [... a7]} => a8 is always 1.0.
This patch adds a fast path to dependency_degree() to return 1.0
immediately in such cases, skipping the expensive sorting step, which
may involve up to many rows from sample.
There is a benchmark with the most extreme case with
'default_statistics_target = 10000' on a table with 3 million rows and 8
integer columns with unique values:
CREATE TABLE my_test_table_all_unique (
id SERIAL PRIMARY KEY,
col1 INTEGER,
col2 INTEGER,
col3 INTEGER,
col4 INTEGER,
col5 INTEGER,
col6 INTEGER,
col7 INTEGER,
col8 INTEGER
);
INSERT INTO my_test_table_all_unique (col1, col2, col3, col4, col5,
col6, col7, col8)
SELECT
n AS col1,
n + 10000000 AS col2,
n + 20000000 AS col3,
n + 30000000 AS col4,
n + 40000000 AS col5,
n + 50000000 AS col6,
n + 60000000 AS col7,
n + 70000000 AS col8
FROM generate_series(1, 3000000) as n;
CREATE STATISTICS (dependencies) ON col1, col2, col3, col4, col5, col6,
col7, col8 FROM my_test_table_all_unique;
Before patch:
ANALYZE my_test_table_all_unique;
Time: 142114,970 ms (02:22,115)
After patch:
ANALYZE my_test_table_all_unique;
Time: 1273,664 ms (00:01,274)
I am happy to hear any feedback or suggestions for improvement.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
From 5d8b7293c3c72bf33bd533d1e3ff03c65a0b2fdc Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdoki...@tantorlabs.ru>
Date: Wed, 11 Jun 2025 19:31:05 +0300
Subject: [PATCH v1] Optimize functional dependency calculation for unique and
same values
When building extended statistics with dependencies, the calculation
always sorts the sample data to compute the degree of functional
dependency between columns. However, if the last column in the
dependency list contains either all-equal or all-unique values,
the dependency is trivially 1.0.
This patch adds a fast path to skip the expensive sorting step
in such cases, significantly reducing ANALYZE time for large tables
with high statistics targets.
---
src/backend/statistics/dependencies.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index eb2fc4366b4..3f896d7bc99 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -225,6 +225,7 @@ dependency_degree(StatsBuildData *data, int k, AttrNumber *dependency)
MultiSortSupport mss;
SortItem *items;
AttrNumber *attnums_dep;
+ VacAttrStats *last_col_stats;
/* counters valid within a group */
int group_size = 0;
@@ -236,6 +237,17 @@ dependency_degree(StatsBuildData *data, int k, AttrNumber *dependency)
/* Make sure we have at least two input attributes. */
Assert(k >= 2);
+ /*
+ * If the last column has only one distinct value,
+ * or if it has as many distinct values as rows (implying a unique key),
+ * the dependency degree is 1.0.
+ */
+ last_col_stats = data->stats[dependency[k - 1]];
+ if (last_col_stats->stadistinct == 1.0 || last_col_stats->stadistinct == -1.0)
+ {
+ return 1.0;
+ }
+
/* sort info for all attributes columns */
mss = multi_sort_init(k);
--
2.34.1