[HACKERS] proposal : cross-column stats

Tomas Vondra Sat, 11 Dec 2010 18:59:13 -0800

Hi everyone,

one of the ssesion I've attended on PgDay last week was Heikki's session
about statistics in PostgreSQL. One of the issues he mentioned (and one
I regularly run into) is the absence of cross-column stats. When the
columns are not independent, this usually result in poor estimates (and
then in suboptimal plans).


I was thinking about this issue before, but that session was the last
impulse that pushed me to try to hack up a proof of concept. So here it
is ;-)

Lets talk about one special case - I'll explain how the proposed
solution works, and then I'll explain how to make it more general, what
improvements are possible, what issues are there. Anyway this is by no
means a perfect or complete solution - it's just a starting point.

------------------------ Short introduction ------------------------

Say we have a table with two INT columns - col_a and col_b, and we want
to estimate number of rows for a condition involving both columns:

   WHERE (col_a BETWEEN m AND n) AND (col_b BETWEEN p AND q)

When the columns are independent, doing the estimate is just a matter of
multiplication. When the columns are dependent, the estimate may be way off.

Lets assume there are histograms with 5 bins for both columns. What I
propose is basically building a 2D histogram. It kind of resembles
contingency table.

So we do have a table like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===============||==============================
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
          20%  ||  4% |  4% |  4% |  4% |  4% |
===============||==============================

where each column / row represents a bin in the original histograms, and
each cell represents an expected number of rows in it (for really
independent columns, it's 4%).

For dependent columns the actual values may be actually very different,
of course - e.g. for strongly dependent columns it might be like this

col_b \ col_a  || 20% | 20% | 20% | 20% | 20% |
===============||==============================
          20%  || 20% |  0% |  0% |  0% |  0% |
          20%  ||  0% | 20% |  0% |  0% |  0% |
          20%  ||  0% |  0% | 20% |  0% |  0% |
          20%  ||  0% |  0% |  0% | 20% |  0% |
          20%  ||  0% |  0% |  9% |  0% | 20% |
===============||==============================

To estimate the number of rows matching the condition, you'd sum
estimates for cells matching the condition. I.e. when the condition on
col_a matches the lowest 20% (the first histogram bin) and the condition
on col_b matches the lowest 20% of values, this corresponds to the first
cell (20% of rows).

Current optimizer estimates this to be 4% as it believes the columns are
independent.

I'm not sure whether I've explained it well enough, but the essence of
the proposal is to build N-dimensional histograms (where N is the number
of columns covered by the statistics) just as we are building histograms
today.

----------------------- Proof of concept ---------------------------

I've hacked a nasty proof of concept in PL/pgSQL (see the attachment).
It creates two tables - test_table and cross_stats. The former contains
the data (in two columns), the latter is used to store cross-column
statistics of test_table.

Then there are several PL/pgSQL functions - two of them are really
important:

collect_stats(p_bins_a INT, p_bins_b INT)
  - this one is used to collect the stats, the parameters represent
    number of bins for the columns (sides of the contingency table)
  - collect_stats(10,10) will build contingency table with 100 cells

get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, p_to_b INT)
  - computes estimate for the condition listed above (ranges on both
    columns)

So to run the PoC, just do something like this:

1) fill the table with some data

   INSERT INTO test_table SELECT round(random()*1000),
   round(random()*1000) FROM generate_series(1,100000);

2) collect the cross-column statistics

   SELECT collect_stats(10,10);

3) see current estimated and actual number of rows

   EXPLAIN ANALYZE SELECT * FROM test_table
                           WHERE (col_a BETWEEN 30 AND 129)
                             AND (col_b BETWEEN 234 AND 484);

4) see the estimate based on contingency table

   SELECT get_estimate(30, 129, 234, 484);

Just two very simple tests for now - col_a/col_b contain the range from
the query, then there are actual number of matching rows and a current
estimate, And finally the new estimate based on contingency table with
various number of bins.

A) independent columns (proof that it produces just as good estimates
   as the current code)

   col_a   |   col_b   | actual | expected | 10x10 | 20x20 |
  [50,70]  |  [50,70]  |    41  |     40   |   41  |    47 |
  [50,250] |  [50,250] |  4023  |   4024   | 4436  |  3944 |
  [50,250] | [750,950] |  4023  |   3955   | 4509  |  3933 |

B) strongly positively dependent columns (actually col_a = col_b)

   col_a   |   col_b   | actual | expect | 10x10 | 20x20 | 100x100 |
  [50,70]  |  [50,70]  |   2143 |     57 |   391 |   729 |    1866 |
  [50,250] |  [50,250] |  20181 |   4111 | 15401 | 19983 |   19991 |
  [50,250] | [750,950] |      0 |   3977 |     0 |     0 |       0 |

  Which proves that it actually significantly improves the estimates.

Again, I've hacked that in about 1 hour, so it's really slow and I think
some of the estimates may be further improved. And the precision
obviously depends on the number of cells.

----------------------- Additional notes ---------------------------

1) The whole thing may be easily extended to more than two columns,
   just build N-dimensional cubes.

2) I really don't think we should collect stats for all combinations of
   columns of a table - I do like the Oracle-like approach where a DBA
   has to enable cross-column stats using an ALTER TABLE (for a
   particular list of columns).

   The only exception might be columns from a multi-column index. It
   might be quite efficient I guess?

3) There are independence tests for contingency tables (e.g. Pearson's
   Chi-squared test), so that it's easy to find out whether the columns
   are independent. In that case we can just throw away these stats and
   use the simple estimation.

   http://mathworld.wolfram.com/Chi-SquaredTest.html

4) Or we could store just those cells where expected and observed values
   differ significantly (may help if most of the values are indendent,
   but there's a small glitch somewhere).

5) It does not make sense to collect stats for two list of columns that
   share several columns - just collect cross stats for union of the
   column lists and then 'dripp up' as needed. An extreme case might be
   collecting stats for all columns in a table. But there are practical
   issues with this approach (huge tables, small values within cells).

6) The precision increases with the number of cells, but the number of
   cells grows exponentionally. So it's necessary to choose a
   reasonable number of bins for the columns (might be part of the
   ALTER COMMAND, should not be firmly bound to the statistics target).

   At a cell level, the simple 'multiplication' estimates are actually
   used (but only when the cell is divided by the range).

7) All this is done under the assumption that all the columns are in a
   single table - when doing research in the archives I've noticed
   suggestions to use this to optimize joins. Maybe it can be done but
   I really don't know how to do that.

8) I'm not sure where to store this and in what form - generally the
   table does not need to be significantly more complex than cross_stats
   table from the PoC.

9) I've noticed demands to collect data about columns used frequently
   in a single WHERE condition. Not sure how to do that and how to
   analyze the collected data. I think collecting data about expected
   and observed number of columns should be just fine - but it's kinda
   independent from this proposal.

regards
Tomas

-- a test table - just two integer columns
-- we'll fill it with data, collect stats and see what is the estimated / 
actual number of rows
CREATE TABLE test_table (
    col_a INT,
    col_b INT
);

-- just to speed up the stats collection
CREATE INDEX test_table_a_idx ON test_table(col_a);
CREATE INDEX test_table_b_idx ON test_table(col_b);

-- table used to store statistics
CREATE TABLE cross_stats (
    histogram_a INT[],
    histogram_b INT[],
    cross_histogram INT[]
);

/*
* Collects statistics.
*
* This is a very stupid / slow implementation.
*/
CREATE OR REPLACE FUNCTION collect_stats(p_bins_a INT, p_bins_b INT) RETURNS 
void AS $$
DECLARE
  v_value INT;
  v_count INT;
  v_histogram_a INT[];
  v_histogram_b INT[];
  v_contingency_table INT[];
  v_row RECORD;
  v_bin_idx INT;
BEGIN

  -- count all rows
  SELECT count(*) INTO v_count FROM test_table;

  /* build histogram s */

  -- lower borders
  SELECT MIN(col_a) INTO v_value FROM test_table;
  v_histogram_a := array_append(v_histogram_a, v_value);

  SELECT MIN(col_b) INTO v_value FROM test_table;
  v_histogram_b := array_append(v_histogram_b, v_value);

  -- inner borders
  FOR v_idx IN 1..(p_bins_a-1) LOOP
      SELECT col_a INTO v_value FROM test_table ORDER BY col_a LIMIT 1 OFFSET 
floor(v_idx * v_count / p_bins_a);
      v_histogram_a := array_append(v_histogram_a, v_value);
  END LOOP;

  FOR v_idx IN 1..(p_bins_b-1) LOOP
      SELECT col_b INTO v_value FROM test_table ORDER BY col_b LIMIT 1 OFFSET 
floor(v_idx * v_count / p_bins_b);
      v_histogram_b := array_append(v_histogram_b, v_value);
  END LOOP;

  -- upper borders
  SELECT MAX(col_a) INTO v_value FROM test_table;
  v_histogram_a := array_append(v_histogram_a, v_value);

  SELECT MAX(col_b) INTO v_value FROM test_table;
  v_histogram_b := array_append(v_histogram_b, v_value);

  /* build the contingency table */
  
  -- init
  FOR v_idx_a IN 1..(p_bins_a*p_bins_b) LOOP
    v_contingency_table := array_append(v_contingency_table, 0);
  END LOOP;

  FOR v_row IN (SELECT * FROM test_table) LOOP
    v_bin_idx := get_contingency_bin(v_histogram_a, v_histogram_b, v_row.col_a, 
v_row.col_b);
    v_contingency_table[v_bin_idx] := v_contingency_table[v_bin_idx] + 1;
  END LOOP;

  -- save stats
  DELETE FROM cross_stats;
  INSERT INTO cross_stats VALUES (v_histogram_a, v_histogram_b, 
v_contingency_table);

END;
$$ LANGUAGE plpgsql;

-- get ID of the bin in a (linearized) contingency table
CREATE OR REPLACE FUNCTION get_contingency_bin(p_histogram_a INT[], 
p_histogram_b INT[], p_value_a INT, p_value_b INT) RETURNS INT AS $$
DECLARE
  v_idx_a INT;
  v_idx_b INT;
BEGIN

  v_idx_a := get_histogram_bin(p_histogram_a, p_value_a);
  v_idx_b := get_histogram_bin(p_histogram_b, p_value_b);

  RETURN (v_idx_b - 1) * (array_upper(p_histogram_a,1)-1) + v_idx_a;

END;
$$ LANGUAGE plpgsql;

-- get bin in a histogram
CREATE OR REPLACE FUNCTION get_histogram_bin(p_histogram INT[], p_value INT) 
RETURNS INT AS $$
DECLARE
  v_tmp INT;
BEGIN

  -- slow, bisection should be used ...
  FOR v_idx IN 1..(array_upper(p_histogram,1)-1) LOOP
    IF (p_value >= p_histogram[v_idx]) THEN
      v_tmp := v_idx;
    END IF;
  END LOOP;

  RETURN v_tmp;

END;
$$ LANGUAGE plpgsql;

-- compute the estimate when there are range conditions on both columns, i.e. 
something like
-- ... WHERE (col_a BETWEEN 40 AND 75) AND (col_b BETWEEN 75 AND 1293)
CREATE OR REPLACE FUNCTION get_estimate(p_from_a INT, p_to_a INT, p_from_b INT, 
p_to_b INT) RETURNS INT AS $$
DECLARE

  -- bin indexes for col_a
  v_from_a_bin INT;
  v_to_a_bin INT;

  -- bin indexes for col_b
  v_from_b_bin INT;
  v_to_b_bin INT;

  -- the estimate
  v_estimate INT := 0;

  -- histograms (loaded from cross_stats)
  v_histogram_a INT[];
  v_histogram_b INT[];
  v_contingency INT[];
  v_cont_idx INT;

  -- coefficients (used to compute area of a single bin)
  v_coeff_a FLOAT;
  v_coeff_b FLOAT;

BEGIN

  SELECT histogram_a INTO v_histogram_a FROM cross_stats;
  SELECT histogram_b INTO v_histogram_b FROM cross_stats;
  SELECT cross_histogram INTO v_contingency FROM cross_stats;

  v_from_a_bin := get_histogram_bin(v_histogram_a, p_from_a);
  v_to_a_bin := get_histogram_bin(v_histogram_a, p_to_a);

  v_from_b_bin := get_histogram_bin(v_histogram_b, p_from_b);
  v_to_b_bin := get_histogram_bin(v_histogram_b, p_to_b);

  FOR v_idx_a IN v_from_a_bin..v_to_a_bin LOOP

    IF (v_from_a_bin = v_to_a_bin) THEN
      -- single bin
      v_coeff_a := (p_to_a - p_from_a)::float / (v_histogram_a[v_from_a_bin+1] 
- v_histogram_a[v_from_a_bin]);
    ELSIF (v_idx_a = v_from_a_bin) THEN
      -- starting bin
      v_coeff_a := (v_histogram_a[v_from_a_bin+1] - p_from_a)::float / 
(v_histogram_a[v_from_a_bin+1] - v_histogram_a[v_from_a_bin]);
    ELSIF (v_idx_a = v_to_a_bin) THEN
      -- last bin
      v_coeff_a := (p_to_a - v_histogram_a[v_to_a_bin])::float / 
(v_histogram_a[v_to_a_bin+1] - v_histogram_a[v_to_a_bin]);
    ELSE
      -- inner bins
      v_coeff_a := 1;
    END IF;

    FOR v_idx_b IN v_from_b_bin..v_to_b_bin LOOP

      IF (v_from_b_bin = v_to_b_bin) THEN
        -- single bin
        v_coeff_b := (p_to_b - p_from_b)::float / 
(v_histogram_b[v_from_b_bin+1] - v_histogram_b[v_from_b_bin]);
      ELSIF (v_idx_b = v_from_b_bin) THEN
        -- starting bin
        v_coeff_b := (v_histogram_b[v_from_b_bin+1] - p_from_b)::float / 
(v_histogram_b[v_from_b_bin+1] - v_histogram_b[v_from_b_bin]);
      ELSIF (v_idx_a = v_to_a_bin) THEN
        -- last bin
        v_coeff_b := (p_to_b - v_histogram_a[v_to_b_bin])::float / 
(v_histogram_a[v_to_b_bin+1] - v_histogram_a[v_to_b_bin]);
      ELSE
        -- inner bins
        v_coeff_b := 1;
      END IF;

      v_cont_idx := (v_idx_b - 1) * (array_upper(v_histogram_a,1)-1) + v_idx_a;

      v_estimate := v_estimate + round(v_contingency[v_cont_idx] * v_coeff_a * 
v_coeff_b);

    END LOOP;
  END LOOP;

  RETURN v_estimate;

END;
$$ LANGUAGE plpgsql;

/*
  independent columns

   col_a   |   col_b   | actual | expected | 10x10 | 20x20 |
  [50,70]  |  [50,70]  |    41  |     40   |   41  |    47 |
  [50,250] |  [50,250] |  4023  |   4024   | 4436  |  3944 |
  [50,250] | [750,950] |  4023  |   3955   | 4509  |  3933 |

*/
INSERT INTO test_table SELECT round(random()*1000), round(random()*1000) FROM 
generate_series(1,100000);

/*
  positively dependent columns

   col_a   |   col_b   | actual | expected | 10x10 | 20x20 | 40x40 | 100x100 |
  [50,70]  |  [50,70]  |   2143 |     57   |   391 |   729 |  1468 |    1866 |
  [50,250] |  [50,250] |  20181 |   4111   | 15401 | 19983 | 19985 |   19991 |
  [50,250] | [750,950] |      0 |   3977   |     0 |     0 |     0 |       0 |

*/
INSERT INTO test_table SELECT val, val FROM (SELECT round(random()*1000) AS val 
FROM generate_series(1,100000)) foo;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] proposal : cross-column stats

Reply via email to