Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

Kuntal Ghosh Sun, 09 Jul 2017 05:23:05 -0700

On Fri, Jul 7, 2017 at 2:53 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Kuntal Ghosh <kuntalghosh.2...@gmail.com> writes:
Wow. Thank you for the wonderful explanation.


>> On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
>> if histogram_bounds are assigned as,
>> {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..........}
>> ...
>> So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
>> which means it assumes val is included in the previous bucket.
>
> I looked at that again and realized that one of the answers I was missing
> lies in the behavior of analyze.c's compute_scalar_stats().  When it has
> collected "nvals" values and wants to make a histogram containing
> "num_hist" entries, it does this:
>
>              * The object of this loop is to copy the first and last values[]
>              * entries along with evenly-spaced values in between.  So the
>              * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)].
>
> (where i runs from 0 to num_hist - 1).  Because the "/" denotes integer
> division, this effectively means that for all but the first entry,
> we are taking the last value out of the corresponding bucket of samples.
> That's obviously true for the last histogram entry, which will be the very
> last sample value, and it's also true for earlier ones --- except for the
> *first* histogram entry, which is necessarily the first one in its bucket.
> So the first histogram bucket is actually a tad narrower than the others,
> which is visible in the typical data you showed above: 0 to 9 is only 9
> counts wide, but all the remaining buckets are 10 counts wide.  That
> explains why we need a scale adjustment just in the first bucket.
>
Agreed. But, I've some doubt regarding the amount of adjustment needed.
+  if (i == 1)
+      histfrac += eq_selec * (1.0 - binfrac);
For predicate x<=p, when p lies in the first bucket, following is the amount
of binfrac that we're off from the actual one.
(Assume the same histogram boundaries i.e., 0,9,19,...)

For p=0, (1/10)-(0/9) = 1/10 * (1 - 0)
For p=1, (2/10)-(1/9) = 1/10 * (1 - 1/9)
For p=2, (3/10)-(2/9) = 1/10 * (1 - 2/9)
For p=3, (4/10)-(3/9) = 1/10 * (1 - 3/9)

In general, 1/(high + 1 - low) * (1.0 - binfrac) and the difference in
histfrac is
(1.0 - binfrac) / (high + 1 - low) / num_of_hist_buckets.

So, the above adjustment for the first bucket looks correct when,
otherdistinct ~ (high + 1 - low) * num_of_hist_buckets

Is that always true or I'm missing something?

> I think I'm also beginning to grasp why scalarineqsel's basic estimation
> process produces an estimate for "x <= p" rather than some other condition
> such as "x < p".  For clarity, imagine that we're given p equal to the
> last histogram entry.  If the test operator is in fact "<=", then it will
> say "true" for every histogram entry, and we'll fall off the end of the
> histogram and return 1.0, which is correct.  If the test operator is "<",
> it will return "true" for all but the last entry, so that we end up with
> "i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0,
> if convert_to_scalar() produces sane answers, again resulting in
> histfrac = 1.0.  Similar reasoning applies for ">" and ">=", so that in
> all cases we arrive at histfrac = 1.0 if p equals the last histogram
> entry.  More generally, if p is equal to some interior histogram entry,
> say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1)
> no matter which operator we probe with, assuming that convert_to_scalar()
> correctly gives us a binfrac of 0.0 or 1.0 depending on which of the
> adjacent bins we end up examining.  So, remembering that the histogram
> entries occupy the right ends of their buckets, the proper interpretation
> of that is that it's the probability of "x <= p".  (This is wrong for the
> first histogram entry, but that's why we need a correction there.)
>
True. In general, histfrac = (k-1)/(n-1) + binfrac where binfrac
depends on the linear interpolation.

For p=some histogram boundary, it'll always be the probability of
"x<=p". When p lies inside the bucket and we've a flat distribution,
then also it'll be the probability of "x<=p". But, when we've high
variance inside a bucket and p lies inside the bucket, linear
interpolation fails to capture the actual probability. But, as you've
mentioned earlier, I guess that's not a new problem.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

Reply via email to