On 07/16/2018 12:16 AM, Tomas Vondra wrote:
On 07/15/2018 04:43 PM, Dean Rasheed wrote:
On 15 July 2018 at 14:29, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
It's quite unclear to me how this algorithm could reliably end up with
hist_sel=0 (in cases where we already don't end up with that). I mean,
if a bucket matches the conditions, then the only way to eliminate is by
deducing that MCV already contains all the matches - and that's rather
difficult for complex clauses ...
Ah, I didn't realise that you were using histograms for equality
clauses as well. I had assumed that they would only use the MCV stats,
as in the univariate case. Using histograms for equality seems
problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
as things stand that would end up with an estimate of half the
bucket's frequency, which seems excessive.
Yeah, I think you're right - this is likely to produce over-estimates
and needs rethinking. With top-level equality clauses it should be
fairly trivial to use approach similar to the univariate case, i.e.
computing ndistinct and using
(1 - mcv_totalsel) / ndistinct
If there are ndistinct coefficients this might be pretty good estimate
of the non-MCV part, I think. But it only works for top-level
equalities, not for complex clauses as in your examples.
On further thought, it's a bit more complicated, actually. Firstly, we
already do that when there's no histogram (as in your example), and
clearly it does not help. I initially thought it's a mistake to use the
histogram in this case, but I can think of cases where it helps a lot.
1) when the equality clauses match nothing
In this case we may not find any buckets possibly matching the
combination of values, producing selectivity estimate 0.0. While by
using 1/ndistinct would give us something else.
2) when there are equality and inequality clauses
Similarly to the previous case, the equality clauses are useful in
eliminating some of the buckets.
Now, I agree estimating equality clauses using histogram is tricky, so
perhaps what we should do is using them as "conditions" to eliminate
histogram buckets, but use ndistinct to estimate the selectivity. That
is something like this:
P(a=1 & b=1 & c<10 & d>=100)
= P(a=1 & b=1) * P(c<10 & d>=100 | a=1 & b=1)
= 1/ndistinct(a,b) * P(c<10 & d>=100 | a=1 & b=1)
where the second part is estimated using the histogram.
Of course, this still only works for the top-level equality clauses :-(
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services