Kuntal Ghosh <kuntalghosh.2...@gmail.com> writes: > On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > + * In the first bin (i==1), add a fudge factor that ensures > + * that histfrac is at least eq_selec. We do this because we > + * know that the first histogram entry does satisfy the > + * inequality (if !isgt) or not satisfy it (if isgt), so our > + * estimate here should certainly not be zero even if binfrac > + * is zero. (XXX experimentally this is the correct way to do > + * it, but why isn't it a linear adjustment across the whole > + * histogram rather than just the first bin?)
> Given that the values are distinct, (I've some doubts for real number case) Well, really, we are *always* dealing with discrete distributions here. > if histogram_bounds are assigned as, > {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..........} > ... > So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets > which means it assumes val is included in the previous bucket. I looked at that again and realized that one of the answers I was missing lies in the behavior of analyze.c's compute_scalar_stats(). When it has collected "nvals" values and wants to make a histogram containing "num_hist" entries, it does this: * The object of this loop is to copy the first and last values[] * entries along with evenly-spaced values in between. So the * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)]. (where i runs from 0 to num_hist - 1). Because the "/" denotes integer division, this effectively means that for all but the first entry, we are taking the last value out of the corresponding bucket of samples. That's obviously true for the last histogram entry, which will be the very last sample value, and it's also true for earlier ones --- except for the *first* histogram entry, which is necessarily the first one in its bucket. So the first histogram bucket is actually a tad narrower than the others, which is visible in the typical data you showed above: 0 to 9 is only 9 counts wide, but all the remaining buckets are 10 counts wide. That explains why we need a scale adjustment just in the first bucket. I think I'm also beginning to grasp why scalarineqsel's basic estimation process produces an estimate for "x <= p" rather than some other condition such as "x < p". For clarity, imagine that we're given p equal to the last histogram entry. If the test operator is in fact "<=", then it will say "true" for every histogram entry, and we'll fall off the end of the histogram and return 1.0, which is correct. If the test operator is "<", it will return "true" for all but the last entry, so that we end up with "i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0, if convert_to_scalar() produces sane answers, again resulting in histfrac = 1.0. Similar reasoning applies for ">" and ">=", so that in all cases we arrive at histfrac = 1.0 if p equals the last histogram entry. More generally, if p is equal to some interior histogram entry, say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1) no matter which operator we probe with, assuming that convert_to_scalar() correctly gives us a binfrac of 0.0 or 1.0 depending on which of the adjacent bins we end up examining. So, remembering that the histogram entries occupy the right ends of their buckets, the proper interpretation of that is that it's the probability of "x <= p". (This is wrong for the first histogram entry, but that's why we need a correction there.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers