Re: [HACKERS] Statistics and selectivity estimation for ranges

Heikki Linnakangas Wed, 15 Aug 2012 01:15:23 -0700

On 15.08.2012 10:38, Alexander Korotkov wrote:

On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas<
heikki.linnakan...@enterprisedb.com>  wrote:

It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note that
the estimation of overlap selectivity could be implemented using separate
histograms of lower bounds and upper bounds, without requiring a histogram
of range lengths, because a&&  b == NOT (a<<  b OR a>>  b). I'm not sure if
the estimates we'd get that way would be better or worse than your current
method, but I think it would be easier to understand.

I don't think the&<  and&>  operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.


Oh, actually I didn't touch those operators. Selectivity estimation
functions for them were already in the catalog, they didn't work previously
just because no statistics.

Yeah, without the histogram, the scalar selectivity estimator sort-ofworks, in that it returns the estimate just based on the most commonvalues and a constant.

Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

Hmm, if we collected a histogram of lower bounds and a histogram ofupper bounds, that would be roughly the same amount of data as for the"standard" histogram with both bounds in the same histogram.

The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length as
argument, and returns the fraction of tuples with length>= argument. It
would do the lookup in the length histogram to find the right histogram
bin, and do the linear interpolation within the bin. You're assuming that
length is independent of lower/upper bound, so you shouldn't need any other
parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops significantly,
but I wasn't sure if there's some more intelligence in the way you combine
values from the length and lower bound histograms that you couldn't do with
such a helper function.


Yes, I also thought about something like this. But, in order to save
current estimate accuracy, it should be more complicated in following
reasons:
1) In last version, I don't estimate just fraction of tuples with length>=
argument, but area under length histogram between two length bounds
(length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to
return place where it happened. Now it is performed by hist_frac *= (length
- prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned
aspects.


Thanks.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Statistics and selectivity estimation for ranges

Reply via email to