Re: [HACKERS] proposal : cross-column stats

Tomas Vondra Fri, 17 Dec 2010 09:59:11 -0800

Dne 12.12.2010 15:43, Heikki Linnakangas napsal(a):
> On 12.12.2010 15:17, Martijn van Oosterhout wrote:
>> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
>> Very cool that you're working on this.
> 
> +1
> 
>>> Lets talk about one special case - I'll explain how the proposed
>>> solution works, and then I'll explain how to make it more general, what
>>> improvements are possible, what issues are there. Anyway this is by no
>>> means a perfect or complete solution - it's just a starting point.
>>
>> It looks like you handled most of the issues. Just a few points:
>>
>> - This is obviously applicable to more than just integers, probably
>>    anything with a b-tree operator class. What you've coded seems rely
>>    on calculations on the values. Have you thought about how it could
>>    work for, for example, strings?
>>
>> The classic failure case has always been: postcodes and city names.
>> Strongly correlated, but in a way that the computer can't easily see.
> 
> Yeah, and that's actually analogous to the example I used in my
> presentation.
> 
> The way I think of that problem is that once you know the postcode,
> knowing the city name doesn't add any information. The postcode implies
> the city name. So the selectivity for "postcode = ? AND city = ?" should
> be the selectivity of "postcode = ?" alone. The measurement we need is
> "implicativeness": How strongly does column A imply a certain value for
> column B. Perhaps that could be measured by counting the number of
> distinct values of column B for each value of column A, or something
> like that. I don't know what the statisticians call that property, or if
> there's some existing theory on how to measure that from a sample.
> 
> That's assuming the combination has any matches. It's possible that the
> user chooses a postcode and city combination that doesn't exist, but
> that's no different from a user doing "city = 'fsdfsdfsd'" on a single
> column, returning no matches. We should assume that the combination
> makes sense.


Well, I've read several papers this week, in an attempt to find out what
possibilities are there when implementing cross-column statistics. One
of the papers seems like a reasonable solution to this particular
problem - discrete data + equality queries.

The article is "A Bayesian Approach to Estimating The Selectivity of
Conjuctive Predicates" (written by Heimel, Markl and Murthy). It's
freely available as a PDF

http:://subs.emis.de/LNI/Proceedings/Proceedings144/52.pdf

I do have a PDF with my own notes in it (as annotations), let me know if
you're interested ...

The basic principle is that instead of 'attribute value independence'
(which is the problem with correlated columns) they assume 'uniform
correlation' which is a much weaker assumption.

In the end, all they need to compute an estimate is number of distinct
values for each of the columns (we already have that in pg_stats) and a
number of distinct values for the group of columns in a query. They
really don't need any multidimensional histogram or something like that.

The funny thing is I've implemented most of the PoC before reading the
article - the only difference was the way to combine the estimates (

------------------------ pros and cons --------------------------

So let's see the pros:

* it's reasonably simple, the overhead should be minimal I guess (we're
  already estimating distinct values for the columns, so I guess we can
  do that for multiple columns with reasonable impact)

* there are no 'advanced data structures' as multi-dimensional
   histograms that are expensive to build and maintain

* it consistently produces better estimates than the current estimator
  based on attribute value independence assumption, and it's reasonably
  accurate (I've been playing with it for some time, so we'll need
  more tests of course)

* it's easily extensible to more columns

* I guess we could improve the estimated by our own heuristicts, to
  catch the special case when one column is perfectly implied by
  another one (e.g. when state is implied by ZIP code) - we can catch
  that by 'distinct for combination = distinct for column' and use just
  the estimate for one of the column

but there are some cons too:

* this really is not designed to work with real-valued data, it's a
  very nice solution for discrete data (namely 'labels' as for example
  city/state names, ZIP codes etc.)

* you need to know the list of columns in advance (there's nothing like
  'let's build' the stats for columns (a,b,c) and then query just (a,b))
  as you need to know the count of distinct values for the queried
  columns (well, maybe it's possible - I guess we could 'integrate'
  over all possible values of the omited column)

* it's built for 'equality' qeuries, although maybe we could modify it
  for inequalities too (but it won't be as elegant), but I guess the
  equality queries are much more common in case of discrete data (OK,
  you can run something like

    WHERE (zip_code BETWEEN '49389' AND '54980') AND state = '48'

  but I guess that's not very frequent. And I think we could still use
  the solution described in the paper.

------------------------ proof of concept --------------------------

I've prepared another 'proof of concept' example - see the attachment.
It works with data from

  http://www.census.gov/geo/www/tiger/zip1999.zip

which is a list of US ZIP codes from 1999, along with info on GPS,
state, county etc. Basically a perfect example of the fail case
described above. It's in a DBF format, so use http:://pgdbf.sf.net (or
something like that) to extract the data.

Then use the script to create a table zip_codes_usa (with appropriate
data types), table 'cross_stats' to store the stats (number of distinct
values) and plpgsql functions collect_stats, get_estimate and
get_frequency.

Run the 'collect_stats' with two columns as parameters, i.e.

  SELECT collect_stats('zip_code', 'state');

and then you can run the 'get_estimate' with values for the columns,
i.e. something like this

  SELECT get_estimate('07034', '34');

which prints some debug notices and returns three integers - estimate
for the colums separately, and then the combined result. So what really
matters is the third number ...

Some basic tests, compared with the current estimator:

a) queries involving 'zip_code' - I've put there the heuristics
   described above, so it actually returns the best estimate (unless
   the combination with state does not make sense, of course)

     ZIP  | state | actual | current estimate | get_estimate
   =========================================================
    07034 | 34    | 32     | 1                | 32
    07034 | 32    | 0      | 1                | 32

   There are 32 rows with this combination as I've copied the data
   multiple times to make the table bigger.

   The 'better performance' of the current estimator in the second case
   is actually just a lucky coincidence, a side-effect of the
   independence assumption. And as Heikki already pointed out, we should
   assume the combination makes sense, so this actually is not an
   advantage of the current estimator.

b) queries involving 'post_name' and 'state' - this can't use the
   heuristics as in (a), as there are 18.940 different post names,
   59 states and 30113 combinations. On the other side, post name
   implies state.

    post name | state | actual | current estimate | get_estimate
   =========================================================
    ESSEX     | 09    | 32     | 1                | 75
    ELMSFORD  | 36    | 32     | 4                | 188

   Hm, in this case the estimate is not as good as in the first case,
   but I guess it's still a bit better than the current estimate. In
   the first case it's about 2.5x overestimated (compared to 32x
   underestimate of the current estimate), and about 5x overestimated
   in the second case (compared to 8x underestimate of the current one).

   Again, with the invalid combination (e.g. ESSEX/36) the current
   estimator would perform better.

regards
Tomas

-- 
create table zip_codes_usa (zip_code char(5), latitude float, longitude float, 
zip_class char(1), post_name varchar(28), state char(2), county char(3));

insert into zip_codes_usa select zip_code, latitude::float, longitude::float, 
zip_class, poname, state, county from zipnov99 ;

-- table used to store statistics
CREATE TABLE cross_stats (
    columns VARCHAR[],
    ndistinct INT[]
);

/*
* Collects statistics.
*
* This is a very stupid / slow implementation.
*/
CREATE OR REPLACE FUNCTION collect_stats(p_col_a VARCHAR, p_col_b VARCHAR) 
RETURNS void AS $$
DECLARE

  v_columns VARCHAR[];

  v_dist_total INT;
  v_dist_a INT;
  v_dist_b INT;

  v_distinct INT[];

BEGIN

  /* columns */
  v_columns := array_append(v_columns, p_col_a);
  v_columns := array_append(v_columns, p_col_b);

  RAISE NOTICE 'counting distinct values ...';

  EXECUTE 'SELECT COUNT(DISTINCT ' || p_col_a || ') dist_a, 
                  COUNT(DISTINCT ' || p_col_b || ') dist_b,
                  COUNT(DISTINCT ' || p_col_a || ' || '':'' || ' || p_col_b || 
') dist_total
           FROM zip_codes_usa' INTO v_dist_a, v_dist_b, v_dist_total;

  v_distinct := array_append(v_distinct, v_dist_a);
  v_distinct := array_append(v_distinct, v_dist_b);
  v_distinct := array_append(v_distinct, v_dist_total);

  -- save stats
  DELETE FROM cross_stats;
  INSERT INTO cross_stats VALUES (v_columns, v_distinct);

END;
$$ LANGUAGE plpgsql;

-- compute the estimate when there are range conditions on both columns, i.e. 
something like
-- ... WHERE (col_a = '40999') AND (col_b = '029')
CREATE OR REPLACE FUNCTION get_estimate(p_value_a VARCHAR, p_value_b VARCHAR) 
RETURNS INT[] AS $$
DECLARE

  -- the estimate
  v_estimate FLOAT;
  v_estimates INT[];

  -- coefficients
  v_count FLOAT;
  v_coeffs FLOAT[];
  v_ndistinct INT[];

  v_columns VARCHAR[];

BEGIN

  SELECT columns, ndistinct INTO v_columns, v_ndistinct FROM cross_stats;

  SELECT reltuples INTO v_count FROM pg_class WHERE relname = 'zip_codes_usa';

  v_estimate := v_count * get_frequency('zip_codes_usa', v_columns[1], 
p_value_a);
  v_estimates := array_append(v_estimates, round(v_estimate)::int);

  RAISE NOTICE 'estimate for column % is %',v_columns[1],v_estimate;

  v_estimate := v_count * get_frequency('zip_codes_usa', v_columns[2], 
p_value_b);
  v_estimates := array_append(v_estimates, round(v_estimate)::int);

  RAISE NOTICE 'estimate for column % is %',v_columns[2],v_estimate;

  -- heuristics (not part of the solution described in the article)
  IF (v_ndistinct[1] = v_ndistinct[3]) THEN
    v_estimate := v_estimates[1];
  ELSIF (v_ndistinct[1] = v_ndistinct[3]) THEN
    v_estimate := v_estimates[2];
  ELSE
    v_estimate := (v_ndistinct[1]::float / v_ndistinct[3]) * v_estimates[1] + 
(v_ndistinct[2]::float / v_ndistinct[3]) * v_estimates[2];
  END IF;

  v_estimates := array_append(v_estimates, round(v_estimate)::int);

  RAISE NOTICE 'combined estimate is %',v_estimate;

  RETURN v_estimates;

END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION get_frequency(p_table VARCHAR, p_column VARCHAR, 
p_value VARCHAR) RETURNS FLOAT AS $$
DECLARE
  v_values VARCHAR[];
  v_freqs  FLOAT[];
  v_freq   FLOAT := 0;
  v_ndist  INT;
BEGIN

  SELECT n_distinct, most_common_vals, most_common_freqs INTO v_ndist, 
v_values, v_freqs FROM pg_stats WHERE tablename = p_table AND attname = 
p_column;

  IF (v_values IS NULL) THEN
    v_freq := (1::float / v_ndist);
    RAISE NOTICE 'frequency for column % is %', p_column, v_freq;
    RETURN v_freq;
  END IF;

  -- is it one of the MCVs?
  FOR v_idx IN 1..array_upper(v_values,1) LOOP
    IF (v_values[v_idx] = p_value) THEN
      v_freq := v_freqs[v_idx];
      RAISE NOTICE 'frekvency for column % is % (from MCV)', p_column, v_freq;
      RETURN v_freqs[v_idx];
    END IF;

    v_freq := v_freq + v_freqs[v_idx];

  END LOOP;

  v_ndist := (v_ndist - array_upper(v_values,1));
  IF (v_ndist <= 0) THEN
    v_ndist := 1;
  END IF;

  v_freq := (1 - v_freq) / v_ndist;
  RAISE NOTICE 'frequency for colunm % is %', p_column, v_freq;
  RETURN v_freq;

END;
$$ LANGUAGE plpgsql;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal : cross-column stats

Reply via email to