On Fri, Jul 7, 2017 at 2:53 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Kuntal Ghosh <kuntalghosh.2...@gmail.com> writes: Wow. Thank you for the wonderful explanation.
>> On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> if histogram_bounds are assigned as, >> {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..........} >> ... >> So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets >> which means it assumes val is included in the previous bucket. > > I looked at that again and realized that one of the answers I was missing > lies in the behavior of analyze.c's compute_scalar_stats(). When it has > collected "nvals" values and wants to make a histogram containing > "num_hist" entries, it does this: > > * The object of this loop is to copy the first and last values[] > * entries along with evenly-spaced values in between. So the > * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)]. > > (where i runs from 0 to num_hist - 1). Because the "/" denotes integer > division, this effectively means that for all but the first entry, > we are taking the last value out of the corresponding bucket of samples. > That's obviously true for the last histogram entry, which will be the very > last sample value, and it's also true for earlier ones --- except for the > *first* histogram entry, which is necessarily the first one in its bucket. > So the first histogram bucket is actually a tad narrower than the others, > which is visible in the typical data you showed above: 0 to 9 is only 9 > counts wide, but all the remaining buckets are 10 counts wide. That > explains why we need a scale adjustment just in the first bucket. > Agreed. But, I've some doubt regarding the amount of adjustment needed. + if (i == 1) + histfrac += eq_selec * (1.0 - binfrac); For predicate x<=p, when p lies in the first bucket, following is the amount of binfrac that we're off from the actual one. (Assume the same histogram boundaries i.e., 0,9,19,...) For p=0, (1/10)-(0/9) = 1/10 * (1 - 0) For p=1, (2/10)-(1/9) = 1/10 * (1 - 1/9) For p=2, (3/10)-(2/9) = 1/10 * (1 - 2/9) For p=3, (4/10)-(3/9) = 1/10 * (1 - 3/9) In general, 1/(high + 1 - low) * (1.0 - binfrac) and the difference in histfrac is (1.0 - binfrac) / (high + 1 - low) / num_of_hist_buckets. So, the above adjustment for the first bucket looks correct when, otherdistinct ~ (high + 1 - low) * num_of_hist_buckets Is that always true or I'm missing something? > I think I'm also beginning to grasp why scalarineqsel's basic estimation > process produces an estimate for "x <= p" rather than some other condition > such as "x < p". For clarity, imagine that we're given p equal to the > last histogram entry. If the test operator is in fact "<=", then it will > say "true" for every histogram entry, and we'll fall off the end of the > histogram and return 1.0, which is correct. If the test operator is "<", > it will return "true" for all but the last entry, so that we end up with > "i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0, > if convert_to_scalar() produces sane answers, again resulting in > histfrac = 1.0. Similar reasoning applies for ">" and ">=", so that in > all cases we arrive at histfrac = 1.0 if p equals the last histogram > entry. More generally, if p is equal to some interior histogram entry, > say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1) > no matter which operator we probe with, assuming that convert_to_scalar() > correctly gives us a binfrac of 0.0 or 1.0 depending on which of the > adjacent bins we end up examining. So, remembering that the histogram > entries occupy the right ends of their buckets, the proper interpretation > of that is that it's the probability of "x <= p". (This is wrong for the > first histogram entry, but that's why we need a correction there.) > True. In general, histfrac = (k-1)/(n-1) + binfrac where binfrac depends on the linear interpolation. For p=some histogram boundary, it'll always be the probability of "x<=p". When p lies inside the bucket and we've a flat distribution, then also it'll be the probability of "x<=p". But, when we've high variance inside a bucket and p lies inside the bucket, linear interpolation fails to capture the actual probability. But, as you've mentioned earlier, I guess that's not a new problem. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers